Overview  Package   Class  Use  Tree  Deprecated  Index  Help 
PREV CLASS   NEXT CLASS FRAMES    NO FRAMES    All Classes
SUMMARY: NESTED | FIELD | CONSTR | METHOD DETAIL: FIELD | CONSTR | METHOD

org.alfresco.repo.content.metadata
Class TikaPoweredMetadataExtracter
java.lang.Object
  org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter
      org.alfresco.repo.content.metadata.TikaPoweredMetadataExtracter
All Implemented Interfaces:
MetadataEmbedder, ContentWorker, MetadataExtracter, org.springframework.beans.factory.BeanNameAware, org.springframework.beans.factory.Aware, org.springframework.context.ApplicationContextAware
Direct Known Subclasses:
TikaSpringConfiguredMetadataExtracter

@org.alfresco.api.AlfrescoPublicApi
public abstract class TikaPoweredMetadataExtracter
extends AbstractMappingMetadataExtracter
implements MetadataEmbedder
The parent of all Metadata Extractors which use Apache Tika under the hood. This handles all the common parts of processing the files, and the common mappings. Individual extractors extend from this to do custom mappings.
   author:                 --      cm:author
   title:                  --      cm:title
   subject:                --      cm:description
   created:                --      cm:created
   comments:
 
Since:
3.4
Author:
Nick Burch

Nested classes/interfaces inherited from interface org.alfresco.repo.content.metadata.MetadataExtracter
MetadataExtracter.OverwritePolicy
Field Summary
protected org.apache.tika.extractor.DocumentSelector
documentSelector
protected static String
KEY_AUTHOR
protected static String
KEY_COMMENTS
protected static String
KEY_CREATED
protected static String
KEY_DESCRIPTION
protected static String
KEY_SUBJECT
protected static String
KEY_TAGS
protected static String
KEY_TITLE
protected static org.apache.commons.logging.Log
logger
Fields inherited from class org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter
MEGABYTE_SIZE, metadataExtracterConfig, NAMESPACE_PROPERTY_PREFIX, PROPERTY_COMPONENT_EMBED, PROPERTY_COMPONENT_EXTRACT, PROPERTY_PREFIX_METADATA
Constructor Summary
TikaPoweredMetadataExtracter(String extractorContext, ArrayList<String> supportedMimeTypes)
TikaPoweredMetadataExtracter(String extractorContext, HashSet<String> supportedMimeTypes, HashSet<String> supportedEmbedMimeTypes)
TikaPoweredMetadataExtracter(ArrayList<String> supportedMimeTypes)
TikaPoweredMetadataExtracter(ArrayList<String> supportedMimeTypes, ArrayList<String> supportedEmbedMimeTypes)
TikaPoweredMetadataExtracter(HashSet<String> supportedMimeTypes)
TikaPoweredMetadataExtracter(HashSet<String> supportedMimeTypes, HashSet<String> supportedEmbedMimeTypes)
Method Summary
protected org.apache.tika.parser.ParseContext
buildParseContext(org.apache.tika.metadata.Metadata metadata, String sourceMimeType)
          By default returns a new ParseContent
protected static ArrayList<String>
buildSupportedMimetypes(String[] explicitTypes, org.apache.tika.parser.Parser... tikaParsers)
          Builds up a list of supported mime types by merging an explicit list with any that Tika also claims to support
protected void
embedInternal(Map<String,Serializable> properties, ContentReader reader, ContentWriter writer)
          Override to embed metadata values.
extractRaw(ContentReader reader)
          Override to provide the raw extracted metadata values.
protected String
extractSize(String sizeText)
          Exif metadata for size also returns the string "pixels" after the number value , this function will stop at the first non digit character found in the text
extractSpecific(org.apache.tika.metadata.Metadata metadata, Map<String,Serializable> properties, Map<String> headers)
          Allows implementation specific mappings to be done.
protected org.apache.tika.extractor.DocumentSelector
getDocumentSelector(org.apache.tika.metadata.Metadata metadata, String targetMimeType)
          Gets the document selector, used for determining whether to parse embedded resources, null by default so parse all.
protected org.apache.tika.embedder.Embedder
getEmbedder()
          Returns the Tika Embedder to modify the document.
protected String
getExtractorContext()
          Gets context for the current implementation
protected InputStream
getInputStream(ContentReader reader)
          There seems to be some sort of issue with some downstream 3rd party libraries, and input streams that come from a ContentReader.
getMetadataSeparator()
protected abstract org.apache.tika.parser.Parser
getParser()
          Returns the correct Tika Parser to process the document.
protected Date
makeDate(String dateStr)
          Version which also tries the ISO-8601 formats (in order..), and similar formats, which Tika makes use of
protected boolean
needHeaderContents()
          Do we care about the contents of the extracted header, or nothing at all?
void
setDocumentSelector(org.apache.tika.extractor.DocumentSelector documentSelector)
          Sets the document selector, used for determining whether to parse embedded resources.
void
setMetadataSeparator(String metadataSeparator)
Methods inherited from class org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter
checkIsEmbedSupported, checkIsSupported, embed, extract, extract, extract, filterSystemProperties, getBeanName, getDefaultEmbedMapping, getDefaultMapping, getEmbedMapping, getExecutorService, getExtractionTime, getLimits, getMapping, getMimetypeService, getReliability, init, isEmbeddingSupported, isSupported, newRawMap, putRawValue, readEmbedMappingProperties, readEmbedMappingProperties, readGlobalEmbedMappingProperties, readGlobalExtractMappingProperties, readMappingProperties, readMappingProperties, register, setApplicationContext, setBeanName, setDictionaryService, setEmbedMapping, setEmbedMappingProperties, setEnableStringTagging, setExecutorService, setFailOnTypeConversion, setInheritDefaultEmbedMapping, setInheritDefaultMapping, setMapping, setMappingProperties, setMetadataExtracterConfig, setMimetypeLimits, setMimetypeService, setOverwritePolicy, setOverwritePolicy, setProperties, setRegistry, setSupportedDateFormats, setSupportedEmbedMimetypes, setSupportedMimetypes
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Field Detail
documentSelector
protected org.apache.tika.extractor.DocumentSelector documentSelector

KEY_AUTHOR
protected static final String KEY_AUTHOR
See Also:
Constant Field Values

KEY_COMMENTS
protected static final String KEY_COMMENTS
See Also:
Constant Field Values

KEY_CREATED
protected static final String KEY_CREATED
See Also:
Constant Field Values

KEY_DESCRIPTION
protected static final String KEY_DESCRIPTION
See Also:
Constant Field Values

KEY_SUBJECT
protected static final String KEY_SUBJECT
See Also:
Constant Field Values

KEY_TAGS
protected static final String KEY_TAGS
See Also:
Constant Field Values

KEY_TITLE
protected static final String KEY_TITLE
See Also:
Constant Field Values

logger
protected static org.apache.commons.logging.Log logger
Constructor Detail
TikaPoweredMetadataExtracter
public TikaPoweredMetadataExtracter(String extractorContext,
                                    ArrayList<String> supportedMimeTypes)

TikaPoweredMetadataExtracter
public TikaPoweredMetadataExtracter(ArrayList<String> supportedMimeTypes)

TikaPoweredMetadataExtracter
public TikaPoweredMetadataExtracter(ArrayList<String> supportedMimeTypes,
                                    ArrayList<String> supportedEmbedMimeTypes)

TikaPoweredMetadataExtracter
public TikaPoweredMetadataExtracter(HashSet<String> supportedMimeTypes)

TikaPoweredMetadataExtracter
public TikaPoweredMetadataExtracter(HashSet<String> supportedMimeTypes,
                                    HashSet<String> supportedEmbedMimeTypes)

TikaPoweredMetadataExtracter
public TikaPoweredMetadataExtracter(String extractorContext,
                                    HashSet<String> supportedMimeTypes,
                                    HashSet<String> supportedEmbedMimeTypes)
Method Detail
getMetadataSeparator
public String getMetadataSeparator()

setMetadataSeparator
public void setMetadataSeparator(String metadataSeparator)

buildSupportedMimetypes
protected static ArrayList<StringbuildSupportedMimetypes(String[] explicitTypes,
                                                           org.apache.tika.parser.Parser... tikaParsers)
Builds up a list of supported mime types by merging an explicit list with any that Tika also claims to support

getExtractorContext
protected String getExtractorContext()
Gets context for the current implementation
Returns:
String value which determines current context

makeDate
protected Date makeDate(String dateStr)
Version which also tries the ISO-8601 formats (in order..), and similar formats, which Tika makes use of
Overrides:
makeDate in class AbstractMappingMetadataExtracter

getParser
protected abstract org.apache.tika.parser.Parser getParser()
Returns the correct Tika Parser to process the document. If you don't know which you want, use TikaAutoMetadataExtracter which makes use of the Tika auto-detection.

getEmbedder
protected org.apache.tika.embedder.Embedder getEmbedder()
Returns the Tika Embedder to modify the document.
Returns:
the Tika embedder

needHeaderContents
protected boolean needHeaderContents()
Do we care about the contents of the extracted header, or nothing at all?

extractSpecific
protected Map<String,SerializableextractSpecific(org.apache.tika.metadata.Metadata metadata,
                                                   Map<String,Serializable> properties,
                                                   Map<String> headers)
Allows implementation specific mappings to be done.

getInputStream
protected InputStream getInputStream(ContentReader reader)
                              throws IOException
There seems to be some sort of issue with some downstream 3rd party libraries, and input streams that come from a ContentReader. This happens most often with JPEG and Tiff files. For these cases, buffer out to a local file if not already there
Throws:
IOException

setDocumentSelector
public void setDocumentSelector(org.apache.tika.extractor.DocumentSelector documentSelector)
Sets the document selector, used for determining whether to parse embedded resources.

getDocumentSelector
protected org.apache.tika.extractor.DocumentSelector getDocumentSelector(org.apache.tika.metadata.Metadata metadata,
                                                                         String targetMimeType)
Gets the document selector, used for determining whether to parse embedded resources, null by default so parse all.
Returns:
the document selector

buildParseContext
protected org.apache.tika.parser.ParseContext buildParseContext(org.apache.tika.metadata.Metadata metadata,
                                                                String sourceMimeType)
By default returns a new ParseContent
Returns:
the parse context

extractRaw
protected Map<String,SerializableextractRaw(ContentReader reader)
                                       throws Throwable
Description copied from class: AbstractMappingMetadataExtracter
Override to provide the raw extracted metadata values. An extracter should extract as many of the available properties as is realistically possible. Even if the default mapping doesn't handle all properties, it is possible for each instance of the extracter to be configured differently and more or less of the properties may be used in different installations.

Raw values must not be trimmed or removed for any reason. Null values and empty strings are

  • Null: Removed
  • Empty String: Passed to the OverwritePolicy
  • Non Serializable: Converted to String or fails if that is not possible

Properties extracted and their meanings and types should be thoroughly described in the class-level javadocs of the extracter implementation, for example:

 editor: - the document editor        -->  cm:author
 title:  - the document title         -->  cm:title
 user1:  - the document summary
 user2:  - the document description   -->  cm:description
 user3:  -
 user4:  -
 
Overrides:
extractRaw in class AbstractMappingMetadataExtracter
Parameters:
reader - the document to extract the values from. This stream provided by the reader must be closed if accessed directly.
Returns:
Returns a map of document property values keyed by property name.
Throws:
Throwable - All exception conditions can be handled.

embedInternal
protected void embedInternal(Map<String,Serializable> properties,
                             ContentReader reader,
                             ContentWriter writer)
                      throws Throwable
Description copied from class: AbstractMappingMetadataExtracter
Override to embed metadata values. An extracter should embed as many of the available properties as is realistically possible. Even if the default mapping doesn't handle all properties, it is possible for each instance of the extracter to be configured differently and more or less of the properties may be used in different installations.
Overrides:
embedInternal in class AbstractMappingMetadataExtracter
Parameters:
reader - the reader for the original document. This stream provided by the reader must be closed if accessed directly.
writer - the writer for the document to embed the values in. This stream provided by the writer must be closed if accessed directly.
Throws:
Throwable - All exception conditions can be handled.

extractSize
protected String extractSize(String sizeText)
Exif metadata for size also returns the string "pixels" after the number value , this function will stop at the first non digit character found in the text
Parameters:
sizeText - string text
Returns:
the size value

Overview  Package   Class  Use  Tree  Deprecated  Index  Help 
PREV CLASS   NEXT CLASS FRAMES    NO FRAMES    All Classes
SUMMARY: NESTED | FIELD | CONSTR | METHOD DETAIL: FIELD | CONSTR | METHOD

Copyright © 2005–2018 Alfresco Software. All rights reserved.

Java API documentation generated with DocFlex/Javadoc 1.6.1 using JavadocPro template set.