TikaPoweredMetadataExtracter (Alfresco 5.3.a-SNAPSHOT API)

java.lang.Object
- org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter
- - org.alfresco.repo.content.metadata.TikaPoweredMetadataExtracter

All Implemented Interfaces:

ContentWorker, MetadataEmbedder, MetadataExtracter, org.springframework.beans.factory.Aware, org.springframework.beans.factory.BeanNameAware, org.springframework.context.ApplicationContextAware

Direct Known Subclasses:

DWGMetadataExtracter, MailMetadataExtracter, OfficeMetadataExtracter, OpenDocumentMetadataExtracter, PdfBoxMetadataExtracter, PoiMetadataExtracter, TikaAudioMetadataExtracter, TikaAutoMetadataExtracter, TikaSpringConfiguredMetadataExtracter
```
@AlfrescoPublicApi
public abstract class TikaPoweredMetadataExtracter
extends AbstractMappingMetadataExtracter
implements MetadataEmbedder
```
The parent of all Metadata Extractors which use Apache Tika under the hood. This handles all the common parts of processing the files, and the common mappings. Individual extractors extend from this to do custom mappings.
```
   author:                 --      cm:author
   title:                  --      cm:title
   subject:                --      cm:description
   created:                --      cm:created
   comments:
 
```
Since:

3.4

Author:

Nick Burch

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`protected static class`	`TikaPoweredMetadataExtracter.HeadContentHandler` This content handler will capture entries from within the header of the Tika content XHTML, but ignore the rest.
`protected static class`	`TikaPoweredMetadataExtracter.MapCaptureContentHandler` This content handler will grab all tags and attributes, and record the textual content of the last seen one of them.
`protected static class`	`TikaPoweredMetadataExtracter.NullContentHandler` A content handler that ignores all the content it finds.

Nested classes/interfaces inherited from interface org.alfresco.repo.content.metadata.MetadataExtracter
MetadataExtracter.OverwritePolicy

Field Summary

Fields
Modifier and Type	Field and Description
`protected org.apache.tika.extractor.DocumentSelector`	`documentSelector`
`protected static String`	`KEY_AUTHOR`
`protected static String`	`KEY_COMMENTS`
`protected static String`	`KEY_CREATED`
`protected static String`	`KEY_DESCRIPTION`
`protected static String`	`KEY_SUBJECT`
`protected static String`	`KEY_TAGS`
`protected static String`	`KEY_TITLE`
`protected static org.apache.commons.logging.Log`	`logger`

Fields inherited from class org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter
MEGABYTE_SIZE, metadataExtracterConfig, NAMESPACE_PROPERTY_PREFIX, PROPERTY_COMPONENT_EMBED, PROPERTY_COMPONENT_EXTRACT, PROPERTY_PREFIX_METADATA

Constructor Summary

Constructors
Constructor and Description
`TikaPoweredMetadataExtracter(ArrayList<String> supportedMimeTypes)`
`TikaPoweredMetadataExtracter(ArrayList<String> supportedMimeTypes, ArrayList<String> supportedEmbedMimeTypes)`
`TikaPoweredMetadataExtracter(HashSet<String> supportedMimeTypes)`
`TikaPoweredMetadataExtracter(HashSet<String> supportedMimeTypes, HashSet<String> supportedEmbedMimeTypes)`
`TikaPoweredMetadataExtracter(String extractorContext, ArrayList<String> supportedMimeTypes)`
`TikaPoweredMetadataExtracter(String extractorContext, HashSet<String> supportedMimeTypes, HashSet<String> supportedEmbedMimeTypes)`

Method Summary

All Methods Static Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`protected org.apache.tika.parser.ParseContext`	`buildParseContext(org.apache.tika.metadata.Metadata metadata, String sourceMimeType)` By default returns a new ParseContent
`protected static ArrayList<String>`	`buildSupportedMimetypes(String[] explicitTypes, org.apache.tika.parser.Parser... tikaParsers)` Builds up a list of supported mime types by merging an explicit list with any that Tika also claims to support
`protected void`	`embedInternal(Map<String,Serializable> properties, org.alfresco.service.cmr.repository.ContentReader reader, org.alfresco.service.cmr.repository.ContentWriter writer)` Override to embed metadata values.
`protected Map<String,Serializable>`	`extractRaw(org.alfresco.service.cmr.repository.ContentReader reader)` Override to provide the raw extracted metadata values.
`protected String`	`extractSize(String sizeText)` Exif metadata for size also returns the string "pixels" after the number value , this function will stop at the first non digit character found in the text
`protected Map<String,Serializable>`	`extractSpecific(org.apache.tika.metadata.Metadata metadata, Map<String,Serializable> properties, Map<String,String> headers)` Allows implementation specific mappings to be done.
`protected org.apache.tika.extractor.DocumentSelector`	`getDocumentSelector(org.apache.tika.metadata.Metadata metadata, String targetMimeType)` Gets the document selector, used for determining whether to parse embedded resources, null by default so parse all.
`protected org.apache.tika.embedder.Embedder`	`getEmbedder()` Returns the Tika Embedder to modify the document.
`protected String`	`getExtractorContext()` Gets context for the current implementation
`protected InputStream`	`getInputStream(org.alfresco.service.cmr.repository.ContentReader reader)` There seems to be some sort of issue with some downstream 3rd party libraries, and input streams that come from a `ContentReader`.
`String`	`getMetadataSeparator()`
`protected abstract org.apache.tika.parser.Parser`	`getParser()` Returns the correct Tika Parser to process the document.
`protected Date`	`makeDate(String dateStr)` Version which also tries the ISO-8601 formats (in order..), and similar formats, which Tika makes use of
`protected boolean`	`needHeaderContents()` Do we care about the contents of the extracted header, or nothing at all?
`void`	`setDocumentSelector(org.apache.tika.extractor.DocumentSelector documentSelector)` Sets the document selector, used for determining whether to parse embedded resources.
`void`	`setMetadataSeparator(String metadataSeparator)`

Methods inherited from class org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter
checkIsEmbedSupported, checkIsSupported, embed, extract, extract, extract, filterSystemProperties, getBeanName, getDefaultEmbedMapping, getDefaultMapping, getEmbedMapping, getExecutorService, getLimits, getMapping, getMimetypeService, init, isEmbeddingSupported, isSupported, newRawMap, putRawValue, readEmbedMappingProperties, readEmbedMappingProperties, readGlobalEmbedMappingProperties, readGlobalExtractMappingProperties, readMappingProperties, readMappingProperties, register, setApplicationContext, setBeanName, setDictionaryService, setEmbedMapping, setEmbedMappingProperties, setEnableStringTagging, setExecutorService, setFailOnTypeConversion, setInheritDefaultEmbedMapping, setInheritDefaultMapping, setMapping, setMappingProperties, setMetadataExtracterConfig, setMimetypeLimits, setMimetypeService, setOverwritePolicy, setProperties, setRegistry, setSupportedDateFormats, setSupportedEmbedMimetypes, setSupportedMimetypes

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.alfresco.repo.content.metadata.MetadataEmbedder
embed, isEmbeddingSupported

- Field Detail
  - logger
```
protected static org.apache.commons.logging.Log logger
```
  - KEY_AUTHOR
```
protected static final String KEY_AUTHOR
```
    See Also:
    
    Constant Field Values
  - KEY_TITLE
```
protected static final String KEY_TITLE
```
    See Also:
    
    Constant Field Values
  - KEY_SUBJECT
```
protected static final String KEY_SUBJECT
```
    See Also:
    
    Constant Field Values
  - KEY_CREATED
```
protected static final String KEY_CREATED
```
    See Also:
    
    Constant Field Values
  - KEY_DESCRIPTION
```
protected static final String KEY_DESCRIPTION
```
    See Also:
    
    Constant Field Values
  - KEY_COMMENTS
```
protected static final String KEY_COMMENTS
```
    See Also:
    
    Constant Field Values
  - KEY_TAGS
```
protected static final String KEY_TAGS
```
    See Also:
    
    Constant Field Values
  - documentSelector
```
protected org.apache.tika.extractor.DocumentSelector documentSelector
```
- Constructor Detail
  - TikaPoweredMetadataExtracter
```
public TikaPoweredMetadataExtracter(String extractorContext,
                                    ArrayList<String> supportedMimeTypes)
```
  - TikaPoweredMetadataExtracter
```
public TikaPoweredMetadataExtracter(ArrayList<String> supportedMimeTypes)
```
  - TikaPoweredMetadataExtracter
```
public TikaPoweredMetadataExtracter(ArrayList<String> supportedMimeTypes,
                                    ArrayList<String> supportedEmbedMimeTypes)
```
  - TikaPoweredMetadataExtracter
```
public TikaPoweredMetadataExtracter(HashSet<String> supportedMimeTypes)
```
  - TikaPoweredMetadataExtracter
```
public TikaPoweredMetadataExtracter(HashSet<String> supportedMimeTypes,
                                    HashSet<String> supportedEmbedMimeTypes)
```
  - TikaPoweredMetadataExtracter
```
public TikaPoweredMetadataExtracter(String extractorContext,
                                    HashSet<String> supportedMimeTypes,
                                    HashSet<String> supportedEmbedMimeTypes)
```
- Method Detail
  - getMetadataSeparator
```
public String getMetadataSeparator()
```
  - setMetadataSeparator
```
public void setMetadataSeparator(String metadataSeparator)
```
  - buildSupportedMimetypes
```
protected static ArrayList<String> buildSupportedMimetypes(String[] explicitTypes,
                                                           org.apache.tika.parser.Parser... tikaParsers)
```
    Builds up a list of supported mime types by merging an explicit list with any that Tika also claims to support
  - getExtractorContext
```
protected String getExtractorContext()
```
    Gets context for the current implementation
    
    Returns:
    
    String value which determines current context
  - makeDate
```
protected Date makeDate(String dateStr)
```
    Version which also tries the ISO-8601 formats (in order..), and similar formats, which Tika makes use of
    
    Overrides:
    
    makeDate in class AbstractMappingMetadataExtracter
  - getParser
```
protected abstract org.apache.tika.parser.Parser getParser()
```
    Returns the correct Tika Parser to process the document. If you don't know which you want, use TikaAutoMetadataExtracter which makes use of the Tika auto-detection.
  - getEmbedder
```
protected org.apache.tika.embedder.Embedder getEmbedder()
```
    Returns the Tika Embedder to modify the document.
    
    Returns:
    
    the Tika embedder
  - needHeaderContents
```
protected boolean needHeaderContents()
```
    Do we care about the contents of the extracted header, or nothing at all?
  - extractSpecific
```
protected Map<String,Serializable> extractSpecific(org.apache.tika.metadata.Metadata metadata,
                                                   Map<String,Serializable> properties,
                                                   Map<String,String> headers)
```
    Allows implementation specific mappings to be done.
  - getInputStream
```
protected InputStream getInputStream(org.alfresco.service.cmr.repository.ContentReader reader)
                              throws IOException
```
    There seems to be some sort of issue with some downstream 3rd party libraries, and input streams that come from a ContentReader. This happens most often with JPEG and Tiff files. For these cases, buffer out to a local file if not already there
    
    Throws:
    
    IOException
  - setDocumentSelector
```
public void setDocumentSelector(org.apache.tika.extractor.DocumentSelector documentSelector)
```
    Sets the document selector, used for determining whether to parse embedded resources.
    
    Parameters:
    
    documentSelector -
  - getDocumentSelector
```
protected org.apache.tika.extractor.DocumentSelector getDocumentSelector(org.apache.tika.metadata.Metadata metadata,
                                                                         String targetMimeType)
```
    Gets the document selector, used for determining whether to parse embedded resources, null by default so parse all.
    
    Parameters:
    
    metadata -
    
    targetMimeType -
    
    Returns:
    
    the document selector
  - buildParseContext
```
protected org.apache.tika.parser.ParseContext buildParseContext(org.apache.tika.metadata.Metadata metadata,
                                                                String sourceMimeType)
```
    By default returns a new ParseContent
    
    Parameters:
    
    metadata -
    
    sourceMimeType -
    
    Returns:
    
    the parse context
  - extractRaw
```
protected Map<String,Serializable> extractRaw(org.alfresco.service.cmr.repository.ContentReader reader)
                                       throws Throwable
```
    Description copied from class: AbstractMappingMetadataExtracter
    Override to provide the raw extracted metadata values. An extracter should extract as many of the available properties as is realistically possible. Even if the default mapping doesn't handle all properties, it is possible for each instance of the extracter to be configured differently and more or less of the properties may be used in different installations.
    Raw values must not be trimmed or removed for any reason. Null values and empty strings are
    - Null: Removed
    - Empty String: Passed to the OverwritePolicy
    - Non Serializable: Converted to String or fails if that is not possible
    Properties extracted and their meanings and types should be thoroughly described in the class-level javadocs of the extracter implementation, for example:
```
 editor: - the document editor        -->  cm:author
 title:  - the document title         -->  cm:title
 user1:  - the document summary
 user2:  - the document description   -->  cm:description
 user3:  -
 user4:  -
 
```
    Specified by:
    
    extractRaw in class AbstractMappingMetadataExtracter
    
    Parameters:
    
    reader - the document to extract the values from. This stream provided by the reader must be closed if accessed directly.
    
    Returns:
    
    Returns a map of document property values keyed by property name.
    
    Throws:
    
    Throwable - All exception conditions can be handled.
    
    See Also:
    
    AbstractMappingMetadataExtracter.getDefaultMapping()
  - embedInternal
```
protected void embedInternal(Map<String,Serializable> properties,
                             org.alfresco.service.cmr.repository.ContentReader reader,
                             org.alfresco.service.cmr.repository.ContentWriter writer)
                      throws Throwable
```
    Description copied from class: AbstractMappingMetadataExtracter
    
    Override to embed metadata values. An extracter should embed as many of the available properties as is realistically possible. Even if the default mapping doesn't handle all properties, it is possible for each instance of the extracter to be configured differently and more or less of the properties may be used in different installations.
    
    Overrides:
    
    embedInternal in class AbstractMappingMetadataExtracter
    
    Parameters:
    
    properties - the metadata keys and values to embed in the content file
    
    reader - the reader for the original document. This stream provided by the reader must be closed if accessed directly.
    
    writer - the writer for the document to embed the values in. This stream provided by the writer must be closed if accessed directly.
    
    Throws:
    
    Throwable - All exception conditions can be handled.
    
    See Also:
    
    AbstractMappingMetadataExtracter.getDefaultEmbedMapping()
  - extractSize
```
protected String extractSize(String sizeText)
```
    Exif metadata for size also returns the string "pixels" after the number value , this function will stop at the first non digit character found in the text
    
    Parameters:
    
    sizeText - string text
    
    Returns:
    
    the size value

Class TikaPoweredMetadataExtracter

Nested Class Summary

Nested classes/interfaces inherited from interface org.alfresco.repo.content.metadata.MetadataExtracter

Field Summary

Fields inherited from class org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter

Constructor Summary

Method Summary

Methods inherited from class org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter

Methods inherited from class java.lang.Object

Methods inherited from interface org.alfresco.repo.content.metadata.MetadataEmbedder

Field Detail

logger

KEY_AUTHOR

KEY_TITLE

KEY_SUBJECT

KEY_CREATED

KEY_DESCRIPTION

KEY_COMMENTS

KEY_TAGS

documentSelector

Constructor Detail

TikaPoweredMetadataExtracter

TikaPoweredMetadataExtracter

TikaPoweredMetadataExtracter

TikaPoweredMetadataExtracter

TikaPoweredMetadataExtracter

TikaPoweredMetadataExtracter

Method Detail

getMetadataSeparator

setMetadataSeparator

buildSupportedMimetypes

getExtractorContext

makeDate

getParser

getEmbedder

needHeaderContents

extractSpecific

getInputStream

setDocumentSelector

getDocumentSelector

buildParseContext

extractRaw

embedInternal

extractSize