|
Migrating an existing extracter to use this class is straightforward:
Nested classes/interfaces inherited from interface org.alfresco.repo.content.metadata.MetadataExtracter |
MetadataExtracter.OverwritePolicy |
Field Summary | ||
protected static org.apache.commons.logging.Log |
logger | |
static int |
MEGABYTE_SIZE | |
protected org.alfresco.repo.content.metadata.MetadataExtracterConfig |
metadataExtracterConfig | |
static String |
NAMESPACE_PROPERTY_PREFIX | |
static String |
PROPERTY_COMPONENT_EMBED | |
static String |
PROPERTY_COMPONENT_EXTRACT | |
static String |
PROPERTY_PREFIX_METADATA |
Constructor Summary | ||
protected |
AbstractMappingMetadataExtracter() Default constructor. |
|
protected |
AbstractMappingMetadataExtracter(Set<String> supportedMimetypes) Constructor that can be used when the list of supported mimetypes is known up front. |
|
protected |
AbstractMappingMetadataExtracter(Set<String> supportedMimetypes, Set<String> supportedEmbedMimetypes) Constructor that can be used when the list of supported extract and embed mimetypes is known up front. |
Method Summary | ||
protected void |
checkIsEmbedSupported(ContentWriter writer) Checks if embedding for the mimetype is supported. |
|
protected void |
checkIsSupported(ContentReader reader) Checks if the mimetype is supported. |
|
void |
embed(Map<QName,Serializable> properties, ContentReader reader, ContentWriter writer) Embeds the given properties into the file specified by the given content writer. |
|
protected void |
embedInternal(Map<String,Serializable> metadata, ContentReader reader, ContentWriter writer) Override to embed metadata values. |
|
extract(ContentReader reader, Map<QName,Serializable> destination) Extracts the metadata values from the content provided by the reader and source mimetype to the supplied map. |
||
extract(ContentReader reader, MetadataExtracter.OverwritePolicy overwritePolicy, Map<QName,Serializable> destination) Extracts the metadata values from the content provided by the reader and source mimetype to the supplied map. |
||
extract(ContentReader reader, MetadataExtracter.OverwritePolicy overwritePolicy, Map<QName,Serializable> destination, Map<String,Set<QName>> mapping) Extracts the metadata from the content provided by the reader and source mimetype to the supplied map. |
||
protected abstract Map<String,Serializable> |
extractRaw(ContentReader reader) Override to provide the raw extracted metadata values. |
|
protected void |
filterSystemProperties(Map<QName,Serializable> systemProperties, Map<QName,Serializable> targetProperties) Filters the system properties that are going to be applied. |
|
getBeanName() | ||
getDefaultEmbedMapping() This method provides a best guess of what model properties should be embedded in content. |
||
getDefaultMapping() This method provides a best guess of where to store the values extracted from the documents. |
||
getEmbedMapping() Helper method for derived classes to obtain the embed mappings. |
||
protected ExecutorService |
getExecutorService() Gets the ExecutorService to be used for timeout-aware extraction. |
|
long |
getExtractionTime() Provides an estimate, usually a worst case guess, of how long an extraction will take. |
|
protected MetadataExtracterLimits |
getLimits(String mimetype) Gets the metadata extracter limits for the given mimetype. |
|
getMapping() Helper method for derived classes to obtain the mappings that will be applied to raw values. |
||
protected MimetypeService |
getMimetypeService() | |
double |
getReliability(String mimetype) TODO - This doesn't appear to be used, so should be removed / deprecated / replaced |
|
protected void |
init() Provides a hook point for implementations to perform initialization. |
|
boolean |
isEmbeddingSupported(String sourceMimetype) Determines if the extracter works against the given mimetype. |
|
boolean |
isSupported(String sourceMimetype) Determines if the extracter works against the given mimetype. |
|
protected Date |
makeDate(String dateStr) Convert a date String to a Date object |
|
protected Map<String,Serializable> |
newRawMap() Helper method to fetch a clean map into which raw values can be dumped. |
|
protected boolean |
putRawValue(String key, Serializable value, Map<String,Serializable> destination) Adds a value to the map, conserving null values. |
|
readEmbedMappingProperties(Properties mappingProperties) A utility method to convert mapping properties to the Map form. |
||
readEmbedMappingProperties(String propertiesUrl) A utility method to read embed mapping properties from a resource file and convert to the map form. |
||
readGlobalEmbedMappingProperties() A utility method to convert global mapping properties to the Map form. |
||
readGlobalExtractMappingProperties() A utility method to convert global properties to the Map form for the given propertyComponent. |
||
readMappingProperties(Properties mappingProperties) A utility method to convert mapping properties to the Map form. |
||
readMappingProperties(String propertiesUrl) A utility method to read mapping properties from a resource file and convert to the map form. |
||
void |
register() Registers this instance of the extracter with the registry. |
|
void |
setApplicationContext(org.springframework.context.ApplicationContext applicationContext) | |
void |
setBeanName(String beanName) | |
void |
setDictionaryService(DictionaryService dictionaryService) | |
void |
setEmbedMapping(Map<QName,Set<String>> embedMapping) Set the embed mapping from document metadata to system metadata. |
|
void |
setEmbedMappingProperties(Properties embedMappingProperties) Set the properties that contain the embed mapping from model properties to content file metadata. |
|
void |
setEnableStringTagging(boolean enableStringTagging) Whether or not to enable the pass through of simple strings to cm:taggable tags |
|
void |
setExecutorService(ExecutorService executorService) Sets the ExecutorService to be used for timeout-aware extraction. |
|
void |
setFailOnTypeConversion(boolean failOnTypeConversion) Set whether the extractor should discard metadata that fails to convert to the target type defined in the data dictionary model. |
|
void |
setInheritDefaultEmbedMapping(boolean inheritDefaultEmbedMapping) Set if the embed property mappings augment or override the mapping generically provided by the extracter implementation. |
|
void |
setInheritDefaultMapping(boolean inheritDefaultMapping) Set if the property mappings augment or override the mapping generically provided by the extracter implementation. |
|
void |
setMapping(Map<String,Set<QName>> mapping) Set the mapping from document metadata to system metadata. |
|
void |
setMappingProperties(Properties mappingProperties) Set the properties that contain the mapping from document metadata to system metadata. |
|
void |
setMetadataExtracterConfig(org.alfresco.repo.content.metadata.MetadataExtracterConfig metadataExtracterConfig) The metadata extracter config. |
|
void |
setMimetypeLimits(Map<String,MetadataExtracterLimits> mimetypeLimits) Sets the map of source mimetypes to metadata extracter limits. |
|
void |
setMimetypeService(MimetypeService mimetypeService) | |
void |
setOverwritePolicy(MetadataExtracter.OverwritePolicy overwritePolicy) Set the policy to use when existing values are encountered. |
|
void |
setOverwritePolicy(String overwritePolicyStr) Set the policy to use when existing values are encountered. |
|
void |
setProperties(Properties properties) The Alfresco global properties. |
|
void |
setRegistry(MetadataExtracterRegistry registry) Set the registry to register with. |
|
void |
setSupportedDateFormats(List<String> supportedDateFormats) Set the date formats, over and above the ISO8601 format, that will be supported for string to date conversions. |
|
void |
setSupportedEmbedMimetypes(Collection<String> supportedEmbedMimetypes) Set the mimetypes that are supported for embedding. |
|
void |
setSupportedMimetypes(Collection<String> supportedMimetypes) Set the mimetypes that are supported by the extracter. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
1.0
if the mimetype is supported, otherwise 0.0Note that even when set to true an individual property mapping entry replaces the entry provided by the extracter implementation.
Note that even when set to true an individual property mapping entry replaces the entry provided by the extracter implementation.
ExecutorService
to be used for timeout-aware
extraction.
If no ExecutorService
has been defined a default
of Executors.newCachedThreadPool()
is used during
init().
ExecutorService
ExecutorService
to be used for timeout-aware
extraction.ExecutorService
for timeouts# Namespaces prefixes namespace.prefix.cm=http://www.alfresco.org/model/content/1.0 namespace.prefix.my=http://www....com/alfresco/1.0 # Mapping editor=cm:author, my:editor title=cm:title user1=cm:summary user2=cm:descriptionThe mapping can therefore be from a single document property onto several system properties.
# Namespaces prefixes namespace.prefix.cm=http://www.alfresco.org/model/content/1.0 namespace.prefix.my=http://www....com/alfresco/1.0 # Mapping cm\:author=editor cm\:title=title cm\:summary=user1 cm\:description=description,user2The embed mapping can therefore be from a model property onto several content file metadata properties.
Normally, the list of properties that can be extracted from a document is fixed and well-known - in that case, just extract everything. But Some implementations may have an extra, indeterminate set of values available for extraction. If the extraction of these runtime parameters is expensive, then the keys provided by the return value can be used to extract values from the documents. The metadata extraction becomes fully configuration-driven, i.e. declaring further mappings will result in more values being extracted from the documents.
Most extractors will not be using this method. For an example of its use, see the OpenDocument extractor, which uses the mapping to select specific user properties from a document.
Normally, the list of properties that can be embedded in a document is fixed and well-known.. But some implementations may have an extra, indeterminate set of values available for embedding. If the embedding of these runtime parameters is expensive, then the keys provided by the return value can be used to embed values in the documents. The metadata embedding becomes fully configuration-driven, i.e. declaring further mappings will result in more values being embedded in the documents.
Mappings can be specified using the same method defined for
normal mapping properties files but with a prefix of
metadata.extracter
, the extracter bean name, and the extract component.
For example:
metadata.extracter.TikaAuto.extract.namespace.prefix.my=http://DummyMappingMetadataExtracter
metadata.extracter.TikaAuto.extract.namespace.prefix.cm=http://www.alfresco.org/model/content/1.0
metadata.extracter.TikaAuto.extract.dc\:description=cm:description, my:customDescription
Different from readGlobalExtractMappingProperties in that keys are the Alfresco QNames and values are file metadata properties.
Mappings can be specified using the same method defined for
normal embed mapping properties files but with a prefix of
metadata.extracter
, the extracter bean name, and the embed component.
For example:
metadata.extracter.TikaAuto.embed.namespace.prefix.cm=http://www.alfresco.org/model/content/1.0
metadata.extracter.TikaAuto.embed.cm\:description=description
Different from readMappingProperties in that keys are the Alfresco QNames and values are file metadata properties.
This method is used to determine, up front, which of a set of equally reliant transformers will be used for a specific extraction.
The extraction viability can be determined by an up front call to MetadataExtracter.isSupported(String).
The source mimetype must be available on the ContentAccessor.getMimetype() method of the reader.
The extraction viability can be determined by an up front call to MetadataExtracter.isSupported(String).
The source mimetype must be available on the ContentAccessor.getMimetype() method of the reader.
The extraction viability can be determined by an up front call to MetadataExtracter.isSupported(String).
The source mimetype must be available on the ContentAccessor.getMimetype() method of the reader.
The embedding viability can be determined by an up front call to MetadataEmbedder.isEmbeddingSupported(String).
The source mimetype must be available on the ContentAccessor.getMimetype() method of the writer.
The default implementation looks for the default mapping file in the location given by the class name and .properties. If the extracter's class is x.y.z.MyExtracter then the default properties will be picked up at classpath:/alfresco/metadata/MyExtracter.properties. The previous location of classpath:/x/y/z/MyExtracter.properties is still supported but may be removed in a future release. Inner classes are supported, but the '$' in the class name is replaced with '-', so default properties for x.y.z.MyStuff$MyExtracter will be located using classpath:/alfresco/metadata/MyStuff-MyExtracter.properties.
The default mapping implementation should include thorough Javadocs so that the system administrators can accurately determine how to best enhance or override the default mapping.
If the default mapping is declared in a properties file other than the one named after the class, then the readMappingProperties(String) method can be used to quickly generate the return value:
protected Map<> getDefaultMapping()
{
return readMappingProperties(DEFAULT_MAPPING);
}
The map can also be created in code either statically or during the call.The default implementation looks for the default mapping file in the location given by the class name and .embed.properties. If the extracter's class is x.y.z.MyExtracter then the default properties will be picked up at classpath:/x/y/z/MyExtracter.embed.properties. Inner classes are supported, but the '$' in the class name is replaced with '-', so default properties for x.y.z.MyStuff$MyExtracter will be located using x.y.z.MyStuff-MyExtracter.embed.properties.
The default mapping implementation should include thorough Javadocs so that the system administrators can accurately determine how to best enhance or override the default mapping.
If the default mapping is declared in a properties file other than the one named after the class, then the readEmbedMappingProperties(String) method can be used to quickly generate the return value:
protected Map<> getDefaultMapping()
{
return readEmbedMappingProperties(DEFAULT_MAPPING);
}
The map can also be created in code either statically or during the call.
If no embed mapping properties file is found a reverse of the extract mapping in getDefaultMapping() will be assumed with the first QName in each value used as the key for this mapping and a last win approach for duplicates.
A specific match for the given mimetype is tried first and if none is found a wildcard of "*" is tried, if still not found defaults value will be used
Raw values must not be trimmed or removed for any reason. Null values and empty strings are
Properties extracted and their meanings and types should be thoroughly described in the class-level javadocs of the extracter implementation, for example:
editor: - the document editor --> cm:author title: - the document title --> cm:title user1: - the document summary user2: - the document description --> cm:description user3: - user4: -
|