|
author: -- cm:author title: -- cm:title subject: -- cm:description created: -- cm:created comments:
Nested classes/interfaces inherited from interface org.alfresco.repo.content.metadata.MetadataExtracter |
MetadataExtracter.OverwritePolicy |
Field Summary | ||
protected org.apache.tika.extractor.DocumentSelector |
documentSelector | |
protected static String |
KEY_AUTHOR | |
protected static String |
KEY_COMMENTS | |
protected static String |
KEY_CREATED | |
protected static String |
KEY_DESCRIPTION | |
protected static String |
KEY_SUBJECT | |
protected static String |
KEY_TAGS | |
protected static String |
KEY_TITLE | |
protected static org.apache.commons.logging.Log |
logger |
Fields inherited from class org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter |
MEGABYTE_SIZE, metadataExtracterConfig, NAMESPACE_PROPERTY_PREFIX, PROPERTY_COMPONENT_EMBED, PROPERTY_COMPONENT_EXTRACT, PROPERTY_PREFIX_METADATA |
Constructor Summary | ||
TikaPoweredMetadataExtracter(String extractorContext, ArrayList<String> supportedMimeTypes) | ||
TikaPoweredMetadataExtracter(String extractorContext, HashSet<String> supportedMimeTypes, HashSet<String> supportedEmbedMimeTypes) | ||
TikaPoweredMetadataExtracter(ArrayList<String> supportedMimeTypes) | ||
TikaPoweredMetadataExtracter(ArrayList<String> supportedMimeTypes, ArrayList<String> supportedEmbedMimeTypes) | ||
TikaPoweredMetadataExtracter(HashSet<String> supportedMimeTypes) | ||
TikaPoweredMetadataExtracter(HashSet<String> supportedMimeTypes, HashSet<String> supportedEmbedMimeTypes) |
Method Summary | ||
protected org.apache.tika.parser.ParseContext |
buildParseContext(org.apache.tika.metadata.Metadata metadata, String sourceMimeType) By default returns a new ParseContent |
|
buildSupportedMimetypes(String[] explicitTypes, org.apache.tika.parser.Parser... tikaParsers) Builds up a list of supported mime types by merging an explicit list with any that Tika also claims to support |
||
protected void |
embedInternal(Map<String,Serializable> properties, ContentReader reader, ContentWriter writer) Override to embed metadata values. |
|
protected Map<String,Serializable> |
extractRaw(ContentReader reader) Override to provide the raw extracted metadata values. |
|
protected String |
extractSize(String sizeText) Exif metadata for size also returns the string "pixels" after the number value , this function will stop at the first non digit character found in the text |
|
protected Map<String,Serializable> |
extractSpecific(org.apache.tika.metadata.Metadata metadata, Map<String,Serializable> properties, Map<String> headers) Allows implementation specific mappings to be done. |
|
protected org.apache.tika.extractor.DocumentSelector |
getDocumentSelector(org.apache.tika.metadata.Metadata metadata, String targetMimeType) Gets the document selector, used for determining whether to parse embedded resources, null by default so parse all. |
|
protected org.apache.tika.embedder.Embedder |
getEmbedder() Returns the Tika Embedder to modify the document. |
|
protected String |
getExtractorContext() Gets context for the current implementation |
|
protected InputStream |
getInputStream(ContentReader reader) There seems to be some sort of issue with some downstream 3rd party libraries, and input streams that come from a ContentReader. |
|
getMetadataSeparator() | ||
protected abstract org.apache.tika.parser.Parser |
getParser() Returns the correct Tika Parser to process the document. |
|
protected Date |
makeDate(String dateStr) Version which also tries the ISO-8601 formats (in order..), and similar formats, which Tika makes use of |
|
protected boolean |
needHeaderContents() Do we care about the contents of the extracted header, or nothing at all? |
|
void |
setDocumentSelector(org.apache.tika.extractor.DocumentSelector documentSelector) Sets the document selector, used for determining whether to parse embedded resources. |
|
void |
setMetadataSeparator(String metadataSeparator) |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Raw values must not be trimmed or removed for any reason. Null values and empty strings are
Properties extracted and their meanings and types should be thoroughly described in the class-level javadocs of the extracter implementation, for example:
editor: - the document editor --> cm:author title: - the document title --> cm:title user1: - the document summary user2: - the document description --> cm:description user3: - user4: -
|