@AlfrescoPublicApi public abstract class TikaPoweredMetadataExtracter extends AbstractMappingMetadataExtracter implements MetadataEmbedder
author: -- cm:author title: -- cm:title subject: -- cm:description created: -- cm:created comments:
Modifier and Type | Class and Description |
---|---|
protected static class |
TikaPoweredMetadataExtracter.HeadContentHandler
This content handler will capture entries from within
the header of the Tika content XHTML, but ignore the
rest.
|
protected static class |
TikaPoweredMetadataExtracter.MapCaptureContentHandler
This content handler will grab all tags and attributes,
and record the textual content of the last seen one
of them.
|
protected static class |
TikaPoweredMetadataExtracter.NullContentHandler
A content handler that ignores all the content it finds.
|
MetadataExtracter.OverwritePolicy
Modifier and Type | Field and Description |
---|---|
protected org.apache.tika.extractor.DocumentSelector |
documentSelector |
protected static String |
KEY_AUTHOR |
protected static String |
KEY_COMMENTS |
protected static String |
KEY_CREATED |
protected static String |
KEY_DESCRIPTION |
protected static String |
KEY_SUBJECT |
protected static String |
KEY_TAGS |
protected static String |
KEY_TITLE |
protected static org.apache.commons.logging.Log |
logger |
MEGABYTE_SIZE, metadataExtracterConfig, NAMESPACE_PROPERTY_PREFIX, PROPERTY_COMPONENT_EMBED, PROPERTY_COMPONENT_EXTRACT, PROPERTY_PREFIX_METADATA
Constructor and Description |
---|
TikaPoweredMetadataExtracter(ArrayList<String> supportedMimeTypes) |
TikaPoweredMetadataExtracter(ArrayList<String> supportedMimeTypes,
ArrayList<String> supportedEmbedMimeTypes) |
TikaPoweredMetadataExtracter(HashSet<String> supportedMimeTypes) |
TikaPoweredMetadataExtracter(HashSet<String> supportedMimeTypes,
HashSet<String> supportedEmbedMimeTypes) |
TikaPoweredMetadataExtracter(String extractorContext,
ArrayList<String> supportedMimeTypes) |
TikaPoweredMetadataExtracter(String extractorContext,
HashSet<String> supportedMimeTypes,
HashSet<String> supportedEmbedMimeTypes) |
Modifier and Type | Method and Description |
---|---|
protected org.apache.tika.parser.ParseContext |
buildParseContext(org.apache.tika.metadata.Metadata metadata,
String sourceMimeType)
By default returns a new ParseContent
|
protected static ArrayList<String> |
buildSupportedMimetypes(String[] explicitTypes,
org.apache.tika.parser.Parser... tikaParsers)
Builds up a list of supported mime types by merging
an explicit list with any that Tika also claims to support
|
protected void |
embedInternal(Map<String,Serializable> properties,
org.alfresco.service.cmr.repository.ContentReader reader,
org.alfresco.service.cmr.repository.ContentWriter writer)
Override to embed metadata values.
|
protected Map<String,Serializable> |
extractRaw(org.alfresco.service.cmr.repository.ContentReader reader)
Override to provide the raw extracted metadata values.
|
protected String |
extractSize(String sizeText)
Exif metadata for size also returns the string "pixels"
after the number value , this function will
stop at the first non digit character found in the text
|
protected Map<String,Serializable> |
extractSpecific(org.apache.tika.metadata.Metadata metadata,
Map<String,Serializable> properties,
Map<String,String> headers)
Allows implementation specific mappings to be done.
|
protected org.apache.tika.extractor.DocumentSelector |
getDocumentSelector(org.apache.tika.metadata.Metadata metadata,
String targetMimeType)
Gets the document selector, used for determining whether to parse embedded resources,
null by default so parse all.
|
protected org.apache.tika.embedder.Embedder |
getEmbedder()
Returns the Tika Embedder to modify
the document.
|
protected String |
getExtractorContext()
Gets context for the current implementation
|
protected InputStream |
getInputStream(org.alfresco.service.cmr.repository.ContentReader reader)
There seems to be some sort of issue with some downstream
3rd party libraries, and input streams that come from
a
ContentReader . |
String |
getMetadataSeparator() |
protected abstract org.apache.tika.parser.Parser |
getParser()
Returns the correct Tika Parser to process the document.
|
protected Date |
makeDate(String dateStr)
Version which also tries the ISO-8601 formats (in order..),
and similar formats, which Tika makes use of
|
protected boolean |
needHeaderContents()
Do we care about the contents of the
extracted header, or nothing at all?
|
void |
setDocumentSelector(org.apache.tika.extractor.DocumentSelector documentSelector)
Sets the document selector, used for determining whether to parse embedded resources.
|
void |
setMetadataSeparator(String metadataSeparator) |
checkIsEmbedSupported, checkIsSupported, embed, extract, extract, extract, filterSystemProperties, getBeanName, getDefaultEmbedMapping, getDefaultMapping, getEmbedMapping, getExecutorService, getLimits, getMapping, getMimetypeService, init, isEmbeddingSupported, isSupported, newRawMap, putRawValue, readEmbedMappingProperties, readEmbedMappingProperties, readGlobalEmbedMappingProperties, readGlobalExtractMappingProperties, readMappingProperties, readMappingProperties, register, setApplicationContext, setBeanName, setDictionaryService, setEmbedMapping, setEmbedMappingProperties, setEnableStringTagging, setExecutorService, setFailOnTypeConversion, setInheritDefaultEmbedMapping, setInheritDefaultMapping, setMapping, setMappingProperties, setMetadataExtracterConfig, setMimetypeLimits, setMimetypeService, setOverwritePolicy, setProperties, setRegistry, setSupportedDateFormats, setSupportedEmbedMimetypes, setSupportedMimetypes
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
embed, isEmbeddingSupported
protected static org.apache.commons.logging.Log logger
protected static final String KEY_AUTHOR
protected static final String KEY_TITLE
protected static final String KEY_SUBJECT
protected static final String KEY_CREATED
protected static final String KEY_DESCRIPTION
protected static final String KEY_COMMENTS
protected static final String KEY_TAGS
protected org.apache.tika.extractor.DocumentSelector documentSelector
public TikaPoweredMetadataExtracter(String extractorContext, ArrayList<String> supportedMimeTypes)
public TikaPoweredMetadataExtracter(ArrayList<String> supportedMimeTypes)
public TikaPoweredMetadataExtracter(ArrayList<String> supportedMimeTypes, ArrayList<String> supportedEmbedMimeTypes)
public TikaPoweredMetadataExtracter(HashSet<String> supportedMimeTypes)
public TikaPoweredMetadataExtracter(HashSet<String> supportedMimeTypes, HashSet<String> supportedEmbedMimeTypes)
public String getMetadataSeparator()
public void setMetadataSeparator(String metadataSeparator)
protected static ArrayList<String> buildSupportedMimetypes(String[] explicitTypes, org.apache.tika.parser.Parser... tikaParsers)
protected String getExtractorContext()
String
value which determines current contextprotected Date makeDate(String dateStr)
makeDate
in class AbstractMappingMetadataExtracter
protected abstract org.apache.tika.parser.Parser getParser()
TikaAutoMetadataExtracter
which makes use of the Tika auto-detection.protected org.apache.tika.embedder.Embedder getEmbedder()
protected boolean needHeaderContents()
protected Map<String,Serializable> extractSpecific(org.apache.tika.metadata.Metadata metadata, Map<String,Serializable> properties, Map<String,String> headers)
protected InputStream getInputStream(org.alfresco.service.cmr.repository.ContentReader reader) throws IOException
ContentReader
. This happens most often with
JPEG and Tiff files.
For these cases, buffer out to a local file if not
already thereIOException
public void setDocumentSelector(org.apache.tika.extractor.DocumentSelector documentSelector)
documentSelector
- protected org.apache.tika.extractor.DocumentSelector getDocumentSelector(org.apache.tika.metadata.Metadata metadata, String targetMimeType)
metadata
- targetMimeType
- protected org.apache.tika.parser.ParseContext buildParseContext(org.apache.tika.metadata.Metadata metadata, String sourceMimeType)
metadata
- sourceMimeType
- protected Map<String,Serializable> extractRaw(org.alfresco.service.cmr.repository.ContentReader reader) throws Throwable
AbstractMappingMetadataExtracter
default mapping
doesn't handle all properties, it is
possible for each instance of the extracter to be configured differently and more or
less of the properties may be used in different installations.
Raw values must not be trimmed or removed for any reason. Null values and empty strings are
Properties extracted and their meanings and types should be thoroughly described in the class-level javadocs of the extracter implementation, for example:
editor: - the document editor --> cm:author title: - the document title --> cm:title user1: - the document summary user2: - the document description --> cm:description user3: - user4: -
extractRaw
in class AbstractMappingMetadataExtracter
reader
- the document to extract the values from. This stream provided by
the reader must be closed if accessed directly.Throwable
- All exception conditions can be handled.AbstractMappingMetadataExtracter.getDefaultMapping()
protected void embedInternal(Map<String,Serializable> properties, org.alfresco.service.cmr.repository.ContentReader reader, org.alfresco.service.cmr.repository.ContentWriter writer) throws Throwable
AbstractMappingMetadataExtracter
default mapping
doesn't handle all properties, it is
possible for each instance of the extracter to be configured differently and more or
less of the properties may be used in different installations.embedInternal
in class AbstractMappingMetadataExtracter
properties
- the metadata keys and values to embed in the content filereader
- the reader for the original document. This stream provided by
the reader must be closed if accessed directly.writer
- the writer for the document to embed the values in. This stream provided by
the writer must be closed if accessed directly.Throwable
- All exception conditions can be handled.AbstractMappingMetadataExtracter.getDefaultEmbedMapping()
Copyright © 2005–2017 Alfresco Software. All rights reserved.