TikaPoweredMetadataExtracter (Alfresco 5.0.3 Public API)

MetadataEmbedder, ContentWorker, MetadataExtracter, org.springframework.beans.factory.BeanNameAware, org.springframework.beans.factory.Aware, org.springframework.context.ApplicationContextAware

Direct Known Subclasses:

TikaSpringConfiguredMetadataExtracter

public abstract class TikaPoweredMetadataExtracter

extends AbstractMappingMetadataExtracter

implements MetadataEmbedder

The parent of all Metadata Extractors which use Apache Tika under the hood. This handles all the common parts of processing the files, and the common mappings. Individual extractors extend from this to do custom mappings.

   author:                 --      cm:author
   title:                  --      cm:title
   subject:                --      cm:description
   created:                --      cm:created
   comments:

Since:

3.4

Author:

Nick Burch

Nested classes/interfaces inherited from interface org.alfresco.repo.content.metadata.MetadataExtracter

MetadataExtracter.OverwritePolicy

Field Summary
protected org.apache.tika.extractor.DocumentSelector	documentSelector
protected static String	KEY_AUTHOR
protected static String	KEY_COMMENTS
protected static String	KEY_CREATED
protected static String	KEY_DESCRIPTION
protected static String	KEY_SUBJECT
protected static String	KEY_TITLE
protected static org.apache.commons.logging.Log	logger

Fields inherited from class org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter

NAMESPACE_PROPERTY_PREFIX, PROPERTY_COMPONENT_EMBED, PROPERTY_COMPONENT_EXTRACT, PROPERTY_PREFIX_METADATA

Constructor Summary
TikaPoweredMetadataExtracter(String extractorContext, ArrayList<String> supportedMimeTypes)
TikaPoweredMetadataExtracter(String extractorContext, HashSet<String> supportedMimeTypes, HashSet<String> supportedEmbedMimeTypes)
TikaPoweredMetadataExtracter(ArrayList<String> supportedMimeTypes)
TikaPoweredMetadataExtracter(ArrayList<String> supportedMimeTypes, ArrayList<String> supportedEmbedMimeTypes)
TikaPoweredMetadataExtracter(HashSet<String> supportedMimeTypes)
TikaPoweredMetadataExtracter(HashSet<String> supportedMimeTypes, HashSet<String> supportedEmbedMimeTypes)

Method Summary
protected org.apache.tika.parser.ParseContext	buildParseContext(org.apache.tika.metadata.Metadata metadata, String sourceMimeType) By default returns a new ParseContent
protected static ArrayList<String>	buildSupportedMimetypes(String[] explicitTypes, org.apache.tika.parser.Parser... tikaParsers) Builds up a list of supported mime types by merging an explicit list with any that Tika also claims to support
protected void	embedInternal(Map<String,Serializable> properties, ContentReader reader, ContentWriter writer) Override to embed metadata values.
protected Map<String,Serializable>	extractRaw(ContentReader reader) Override to provide the raw extracted metadata values.
protected String	extractSize(String sizeText) Exif metadata for size also returns the string "pixels" after the number value , this function will stop at the first non digit character found in the text
protected Map<String,Serializable>	extractSpecific(org.apache.tika.metadata.Metadata metadata, Map<String,Serializable> properties, Map<String> headers) Allows implementation specific mappings to be done.
protected org.apache.tika.extractor.DocumentSelector	getDocumentSelector(org.apache.tika.metadata.Metadata metadata, String targetMimeType) Gets the document selector, used for determining whether to parse embedded resources, null by default so parse all.
protected org.apache.tika.embedder.Embedder	getEmbedder() Returns the Tika Embedder to modify the document.
protected String	getExtractorContext() Gets context for the current implementation
protected InputStream	getInputStream(ContentReader reader) There seems to be some sort of issue with some downstream 3rd party libraries, and input streams that come from a ContentReader.
protected abstract org.apache.tika.parser.Parser	getParser() Returns the correct Tika Parser to process the document.
protected Date	makeDate(String dateStr) Version which also tries the ISO-8601 formats (in order..), and similar formats, which Tika makes use of
protected boolean	needHeaderContents() Do we care about the contents of the extracted header, or nothing at all?
void	setDocumentSelector(org.apache.tika.extractor.DocumentSelector documentSelector) Sets the document selector, used for determining whether to parse embedded resources.

Methods inherited from class org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter

checkIsEmbedSupported, checkIsSupported, embed, extract, extract, extract, filterSystemProperties, getBeanName, getDefaultEmbedMapping, getDefaultMapping, getEmbedMapping, getExecutorService, getExtractionTime, getLimits, getMapping, getMimetypeService, getReliability, init, isEmbeddingSupported, isSupported, newRawMap, putRawValue, readEmbedMappingProperties, readEmbedMappingProperties, readGlobalEmbedMappingProperties, readGlobalExtractMappingProperties, readMappingProperties, readMappingProperties, register, setApplicationContext, setBeanName, setDictionaryService, setEmbedMapping, setEmbedMappingProperties, setEnableStringTagging, setExecutorService, setFailOnTypeConversion, setInheritDefaultEmbedMapping, setInheritDefaultMapping, setMapping, setMappingProperties, setMimetypeLimits, setMimetypeService, setOverwritePolicy, setOverwritePolicy, setProperties, setRegistry, setSupportedDateFormats, setSupportedEmbedMimetypes, setSupportedMimetypes

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

documentSelector

protected org.apache.tika.extractor.DocumentSelector documentSelector

KEY_AUTHOR

protected static final String KEY_AUTHOR

See Also:

Constant Field Values

KEY_COMMENTS

protected static final String KEY_COMMENTS

See Also:

Constant Field Values

KEY_CREATED

protected static final String KEY_CREATED

See Also:

Constant Field Values

KEY_DESCRIPTION

protected static final String KEY_DESCRIPTION

See Also:

Constant Field Values

KEY_SUBJECT

protected static final String KEY_SUBJECT

See Also:

Constant Field Values

KEY_TITLE

protected static final String KEY_TITLE

See Also:

Constant Field Values

logger

protected static org.apache.commons.logging.Log logger

Constructor Detail

TikaPoweredMetadataExtracter

public TikaPoweredMetadataExtracter(String extractorContext,
ArrayList<String> supportedMimeTypes)

TikaPoweredMetadataExtracter

public TikaPoweredMetadataExtracter(ArrayList<String> supportedMimeTypes)

TikaPoweredMetadataExtracter

public TikaPoweredMetadataExtracter(ArrayList<String> supportedMimeTypes,
ArrayList<String> supportedEmbedMimeTypes)

TikaPoweredMetadataExtracter

public TikaPoweredMetadataExtracter(HashSet<String> supportedMimeTypes)

TikaPoweredMetadataExtracter

public TikaPoweredMetadataExtracter(HashSet<String> supportedMimeTypes,
HashSet<String> supportedEmbedMimeTypes)

TikaPoweredMetadataExtracter

public TikaPoweredMetadataExtracter(String extractorContext,
HashSet<String> supportedMimeTypes,
HashSet<String> supportedEmbedMimeTypes)

Method Detail

buildSupportedMimetypes

protected static ArrayList<String> buildSupportedMimetypes(String[] explicitTypes,
org.apache.tika.parser.Parser... tikaParsers)

Builds up a list of supported mime types by merging an explicit list with any that Tika also claims to support

getExtractorContext

protected String getExtractorContext()

Gets context for the current implementation

Returns:

String value which determines current context

makeDate

protected Date makeDate(String dateStr)

Version which also tries the ISO-8601 formats (in order..), and similar formats, which Tika makes use of

Overrides:

makeDate in class AbstractMappingMetadataExtracter

getParser

protected abstract org.apache.tika.parser.Parser getParser()

Returns the correct Tika Parser to process the document. If you don't know which you want, use TikaAutoMetadataExtracter which makes use of the Tika auto-detection.

getEmbedder

protected org.apache.tika.embedder.Embedder getEmbedder()

Returns the Tika Embedder to modify the document.

Returns:

the Tika embedder

needHeaderContents

protected boolean needHeaderContents()

Do we care about the contents of the extracted header, or nothing at all?

extractSpecific

protected Map<String,Serializable> extractSpecific(org.apache.tika.metadata.Metadata metadata,
Map<String,Serializable> properties,
Map<String> headers)

Allows implementation specific mappings to be done.

getInputStream

protected InputStream getInputStream(ContentReader reader)
throws IOException

There seems to be some sort of issue with some downstream 3rd party libraries, and input streams that come from a ContentReader. This happens most often with JPEG and Tiff files. For these cases, buffer out to a local file if not already there

Throws:

IOException

setDocumentSelector

public void setDocumentSelector(org.apache.tika.extractor.DocumentSelector documentSelector)

Sets the document selector, used for determining whether to parse embedded resources.

getDocumentSelector

protected org.apache.tika.extractor.DocumentSelector getDocumentSelector(org.apache.tika.metadata.Metadata metadata,
String targetMimeType)

Gets the document selector, used for determining whether to parse embedded resources, null by default so parse all.

Returns:

the document selector

buildParseContext

protected org.apache.tika.parser.ParseContext buildParseContext(org.apache.tika.metadata.Metadata metadata,
String sourceMimeType)

By default returns a new ParseContent

Returns:

the parse context

extractRaw

protected Map<String,Serializable> extractRaw(ContentReader reader)
throws Throwable

Description copied from class: AbstractMappingMetadataExtracter

Override to provide the raw extracted metadata values. An extracter should extract as many of the available properties as is realistically possible. Even if the default mapping doesn't handle all properties, it is possible for each instance of the extracter to be configured differently and more or less of the properties may be used in different installations.

Raw values must not be trimmed or removed for any reason. Null values and empty strings are

Null: Removed
Empty String: Passed to the OverwritePolicy
Non Serializable: Converted to String or fails if that is not possible

Properties extracted and their meanings and types should be thoroughly described in the class-level javadocs of the extracter implementation, for example:

 editor: - the document editor        -->  cm:author
 title:  - the document title         -->  cm:title
 user1:  - the document summary
 user2:  - the document description   -->  cm:description
 user3:  -
 user4:  -

Overrides:

extractRaw in class AbstractMappingMetadataExtracter

Parameters:

reader - the document to extract the values from. This stream provided by the reader must be closed if accessed directly.

Returns:

Returns a map of document property values keyed by property name.

Throws:

Throwable - All exception conditions can be handled.

embedInternal

protected void embedInternal(Map<String,Serializable> properties,
                             ContentReader reader,
                             ContentWriter writer)
                      throws Throwable

Description copied from class: AbstractMappingMetadataExtracter

Override to embed metadata values. An extracter should embed as many of the available properties as is realistically possible. Even if the default mapping doesn't handle all properties, it is possible for each instance of the extracter to be configured differently and more or less of the properties may be used in different installations.

Overrides:

embedInternal in class AbstractMappingMetadataExtracter

Parameters:

reader - the reader for the original document. This stream provided by the reader must be closed if accessed directly.

writer - the writer for the document to embed the values in. This stream provided by the writer must be closed if accessed directly.

Throws:

Throwable - All exception conditions can be handled.

extractSize

protected String extractSize(String sizeText)

Exif metadata for size also returns the string "pixels" after the number value , this function will stop at the first non digit character found in the text

Parameters:

sizeText - string text

Returns:

the size value

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES All Classes

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Java API documentation generated with DocFlex/Javadoc 1.6.1 using JavadocPro template set.