Content Services performs metadata extraction on content automatically, however, you may wish to create custom metadata extractors to handle custom file properties and custom content models.
Architecture Information:Platform Architecture
Every time a file is uploaded to the repository the file’s MIME type is automatically detected. Based on the MIME type a related Metadata Extractor is invoked on the file. It will extract common properties from the file, such as author, and set the corresponding content model property accordingly. Each Metadata Extractor has a mapping between the properties it can extract and the content model properties.
Metadata extraction is primarily based on theApache Tika library. This means that whateverfile formats Tika can extract metadata from, Content Services can also handle. To give you an idea of what file formats Content Services can extract metadata from, here is a list of the most common formats:
The properties that are extracted are limited to the out-of-the-box content model, which is very generic. Here are some example of extracted property name and what content model property it maps to:
cm:authorcm:titlecm:descriptioncm:createdexif:exif (pixel dimensions, manufacturer, model, software, date-time etc.)cm:geographic (longitude & latitude)audio:audio (album, artist, composer, engineer, genre etc.)cm:emailed (from, to, subject, sent date)One thing to note though, even if an extractor can extract any of the system controlled properties, such as created date, it will not be used. Created date, creator, modified date, and modifier is always controlled by the Content Services system, unless you are using the Bulk Import tool, in which case last modified date can be preserved.
The extraction of metadata in the repository is performed in T-Engines (transform engines).Prior to Content Services version 7, it was performed inside the repository. T-Engines provide improved scalability,stability, security and flexibility. New extractors may be added without the need fora new Content Services release or applying an AMP on top of the repository (i.e.alfresco.war).
The Content Services version 6 framework for creating metadata extractors that run as part of the repositorystill exists, so existing AMPs that add extractors will still work as long as there isnot an extractor in a T-Engine that claims to do the same task. The framework isdeprecated and couldwell be removed in a future release.
This page describes how metadata extraction and embedding works, so that it is possible to add acustom T-Engine to do other types. It also lists the various extractors that have been moved to T-Engines.
A framework for embedding metadata into a file was provided as part of the repository prior to Content Services version 7. This too still exists, but has beendeprecated. Even though the content repository did notprovide any out of the box implementations, the embedding framework of metadata via T-Engines exists.
In the case of an extract, the T-Engine returns a JSON file that contains name value pairs. The namesare fully qualified QNames of properties on the source node. The values are the metadata values extractedfrom the content. The transform defines the mapping of metadata values to properties. Once returned tothe repository, the properties are automatically set.
In the case of an embed, the T-Engine takes name value pairs from the transform options, maps them tometadata values which are then updated in the supplied content. The content is then returned to the content repository and the node is updated.
Metadata extractors and embedders are just a specialist form of transform. ThetargetMediaTypein the T-Engineengine-config.json is set to"alfresco-metadata-extract" or"alfresco-metadata-embed"the following is a snippet from thetika_engine_config.json
{ "transformerName": "TikaAudioMetadataExtractor", "supportedSourceAndTargetList": [ {"sourceMediaType": "video/x-m4v", "targetMediaType": "alfresco-metadata-extract"}, {"sourceMediaType": "audio/x-oggflac", "targetMediaType": "alfresco-metadata-extract"}, {"sourceMediaType": "application/mp4", "targetMediaType": "alfresco-metadata-extract"}, {"sourceMediaType": "audio/vorbis", "targetMediaType": "alfresco-metadata-extract"}, {"sourceMediaType": "video/3gpp", "targetMediaType": "alfresco-metadata-extract"}, {"sourceMediaType": "audio/x-flac", "targetMediaType": "alfresco-metadata-extract"}, {"sourceMediaType": "video/3gpp2", "targetMediaType": "alfresco-metadata-extract"}, {"sourceMediaType": "video/quicktime", "targetMediaType": "alfresco-metadata-extract"}, {"sourceMediaType": "audio/mp4", "targetMediaType": "alfresco-metadata-extract"}, {"sourceMediaType": "video/mp4", "targetMediaType": "alfresco-metadata-extract"} ], "transformOptions": [ "metadataOptions" ] },If a T-Engine definition says it supports a metadata extract or embed, it will be used in preferenceto any extractor or embedder using the deprecated frameworks in the content repository.
Code that transforms a specific document type in a T-Engine generally implements theTransformerinterface. In addition to thetransform method,extractMetadata andembedMetadata methodswill be called depending on the target media type. The implementing class is called from thetransformImpl
method of the controller class.
default void transform(String transformName, String sourceMimetype, String targetMimetype, Map<String, String> transformOptions, File sourceFile, File targetFile) throws Exception {}default void extractMetadata(String transformName, String sourceMimetype, String targetMimetype, Map<String, String> transformOptions, File sourceFile, File targetFile) throws Exception {}default void embedMetadata(String transformName, String sourceMimetype, String targetMimetype, Map<String, String> transformOptions, File sourceFile, File targetFile) throws Exception {}It is typical for theextractMetadata method to call anotherextractMetadata method on a sub class ofAbstractMetadataExtractor as this class provides the bulk of the functionality needed to configure metadata extractionor embedding.
public void extractMetadata(String transformName, String sourceMimetype, String targetMimetype, Map<String, String> transformOptions, File sourceFile, File targetFile) throws Exception { AbstractMetadataExtractor extractor = ... extractor.extractMetadata(sourceMimetype, transformOptions, sourceFile, targetFile); } // Similar code for embedMetadataTheAbstractMetadataExtractor may be extended to perform metadata extract and embed tasks, by overriding two methodsin the sub classes:
public abstract Map<String, Serializable> extractMetadata(String sourceMimetype, Map<String, String> transformOptions, File sourceFile) throws Exception; public void embedMetadata(String sourceMimetype, String targetMimetype, Map<String, String> transformOptions, File sourceFile, File targetFile) throws Exception { // Default nothing, as embedding is not supported in most cases }Method parameters:
sourceMimetype mimetype of the sourcetransformOptions transform options from the clientsourceFile the source as a fileTheextractMetadata should extract and return ALL available metadata from thesourceFile.These values are then mapped into content repository property names and values, depending on what is defined in a<classname>_metadata_extract.properties file. Value may be discarded or a single value may even be used for multiple properties. The selected values are sent back to the repository as JSON as a mapping of fully qualified content model property names to values, where the values are applied to the source node.
TheAbstractMetadataExtractor class reads the<classname>_metadata_extract.properties file, so that it knows how tomap metadata returned from the sub classextractMetadata method onto content model properties. The following isan example for an email (file extension.eml):
## RFC822MetadataExtractor - default mapping## Namespacesnamespace.prefix.imap=http://www.alfresco.org/model/imap/1.0namespace.prefix.cm=http://www.alfresco.org/model/content/1.0# MappingsmessageFrom=imap:messageFrom, cm:originatormessageTo=imap:messageTo, cm:addresseemessageCc=imap:messageCc, cm:addresseesmessageSubject=imap:messageSubject, cm:title, cm:description, cm:subjectlinemessageSent=imap:dateSent, cm:sentdatemessageReceived=imap:dateReceivedThread-Index=imap:threadIndexMessage-ID=imap:messageIdAs can be seen, the email’s metadata formessageFrom (if available) will be used to set two properties in the contentrepository (if they exist):imap:messageFrom,cm:originator. The property names use namespace prefixes specified above.
It is possible to specify if properties in the repository will be set if the extracted values are not null or ifthe properties already have a value. By default,PRAGMATIC is used. Generally you will not need to change this.Other values (CAUTIOUS,EAGER,PRUDENT) are described inOverwritePolicy.To use a different policy add asys:overwritePolicy value to the Map returned fromtheextractMetadata method of the class extendingAbstractMetadataExtractor (described above).
The following table shows which conditions must be met for overwriting the value:

When a property is extracted, which is part of an aspect, it is possible to remove all otherproperties in the same aspect that do not have an extracted value. In this way only extracted values will be set andany previously set aspect properties will be cleared. By default, this does not take place and newly extracted valuesare just added to the node’s properties. To clear other aspect properties addsys:carryAspectProperties=false tothe Map returned from theextractMetadata method.
When an extracted property is taggable, it is possible to automatically extract tags from the value. By default, this isdisabled, but may be enabled by addingsys:enableStringTagging=true to the Map returned from theextractMetadata method.
AssumingenableStringTagging istrue, it is also possible to change the default separators of the tags in the value.The default separators are,; and\|. This is done by adding asys:stringTaggingSeparators value to the Mapreturned from theextractMetadata method. Please note that escaping of characters takes place in both Java and json,so json response would look like"sys:stringTaggingSeparators": ";,\",\",\\|" if the code explicitly sets the defaultseparators.
The request from the repository to extract metadata goes throughRenditionService2, so will use the asynchronous Alfresco Transform Service if available and a synchronous Local transform if not.
Normally the only transform options aretimeout andsourceEncoding, so the extractor code only has the source mimetypeand content itself to work on. Customisation of the property mapping should really be done in the T-Engine as described above.
However, it is currently possible for code running in the repository (i.e.alfresco.war) to override the default mapping of metadata to content model properties, with anextractMapping transform option. This approach isdeprecated and may be removed in a future minor Content Services 7.x release.
An AMP should supply a class that implements theMetadataExtractorPropertyMappingOverride interface and add it to themetadataExtractorPropertyMappingOverrides property of theextractor.Asynchronous spring bean.
/** * Overrides the default metadata mappings for PDF documents: * * <pre> * author=cm:author * title=cm:title * subject=cm:description * created=cm:created * </pre> * with: * <pre> * author=cm:author * title=cm:title,cm:description * </pre> */public class PdfMetadataExtractorOverride implements MetadataExtractorPropertyMappingOverride { @Override public boolean match(String sourceMimetype) { return MIMETYPE_PDF.equals(sourceMimetype); } @Override public Map<String, Set<String>> getExtractMapping(NodeRef nodeRef) { Map<String, Set<String>> mapping = new HashMap<>(); mapping.put("author", Collections.singleton("{http://www.alfresco.org/model/content/1.0}author")); mapping.put("title", Set.of("{http://www.alfresco.org/model/content/1.0}title", "{http://www.alfresco.org/model/content/1.0}description")); return mapping; }}Resulting in a request that contains the following transform options:
{"extractMapping":{ "author":["{http://www.alfresco.org/model/content/1.0}author"], "title":["{http://www.alfresco.org/model/content/1.0}title", "{http://www.alfresco.org/model/content/1.0}description"]}, "timeout":20000, "sourceEncoding":"UTF-8"}The transformed content that is returned to the repository is JSON and specifies what properties that should be updated on the source node. For example:
{"{http://www.alfresco.org/model/content/1.0}description":"Making Bread", "{http://www.alfresco.org/model/content/1.0}title":"Making Bread", "{http://www.alfresco.org/model/content/1.0}author":"Fred"}An embed request simply contains a transform option calledmetadata that contains a map of property names tovalues, resulting in transform options like the following:
{"metadata": {"{http://www.alfresco.org/model/content/1.0}author":"Fred", "{http://www.alfresco.org/model/content/1.0}title":"Making Bread" "{http://www.alfresco.org/model/content/1.0}helpers":["Jane","Paul"]}, "timeout":20000, "sourceEncoding":"UTF-8"}Values are either a String, or a Collection of Strings. The mappings of these content repositoryproperties to metadata properties is normally the reverse of those defined in the<classname>_metadata_extract.properties file in the T-Engine.
This is simply the source content with the metadata embedded. The content repository updatesthe content of the node with what is returned.
The repository still contains metadata extraction code.
The Content Services version 6 framework for running metadata extractors and embedders still exists. An additionalAsynchronousExtractor has been added to communicate with theRenditionService2 from Content Services version 7. TheAsynchronousExtractor handles the request and response in a generic way allowing all the content type specific code to be moved to a T-Engine.
The following XML based extractors have NOT been removed from the content repository as custom extensions may beusing them. There are no out-of-the-box extractors that use them as part of the repository. Ideally anycustom extensions should be moved to a custom T-Engine using code based on these classes.
The following extractors, and their configuration (i.e. property mappings), exist now in T-Engines rather than in the repository (i.e.alfresco.war):
TheLibreOffice extractor has also been moved to a T-Engine, even though Tika based extractors are now used for alltypes it supported. This has been the case since ACS 6.0.1. It was moved into a T-Engine to simplify moving any custom code that may have extended it.
TheTika based classes for extractors using configuration files or spring context files have been removed from the repository as the preferred way to create extractors is via a T-Engine and these approaches require in processextensions.
A common requirement is to be able to change the mapping of out-of-the-box properties, such as having thesubject property mapped tocm:title instead ofcm:description for a PDF file. This is quite easy to achieve, just override the out-of-the-box JSON configuration and re-configure the mapping. The out-of-the-box definitions for Metadata Extractors can be found in the places described in theabove section.
To change thesubject property so it is mapped to content model propertycm:title for PDF files re-define thePdfBoxMetadataExtractor_metadata_extract.properties configuration as follows:
## PdfBoxMetadataExtracter - custom mapping## Namespacesnamespace.prefix.cm=http://www.alfresco.org/model/content/1.0# Mappingsauthor=cm:authortitle=cm:titlesubject=cm:titleNote that all the namespaces that the content model properties belong to have to be specified as in the above example withnamespace.prefix.cm. It is also very important to know that the property names are case sensitive.
Sometimes it can be useful to know what metadata extractor that is actually used when you upload a document. Turning on Metadata Extraction logging is a good idea to get on top of what is happening. Set the following property inlog4j.properties:
log4j.logger.org.alfresco.repo.content.metadata=DEBUGWhat about the properties? It is likely that you will struggle to figure out what properties are extracted and their names. You can have this logged with the following log file configuration:
log4j.logger.org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter=DEBUGThis log configuration is set to some other log level out-of-the-box so you need to specifically re-configure it to be able to see something. Now when running you will also see the extracted doc properties.
Next requirement is most likely to map properties to custom content models. There is an ACME content model tutorial where the base document type has anacme:documentId property. You might want to add a document identifier to the PDFs you are uploading and have it automatically set in the ACME content model. Start by updating the thePdfBoxMetadataExtractor_metadata_extract.properties configuration as follows:
## PdfBoxMetadataExtracter - custom mapping## Namespacesnamespace.prefix.cm=http://www.alfresco.org/model/content/1.0namespace.prefix.acme=http://www.acme.org/model/content/1.0# Mappingsauthor=cm:authortitle=cm:titleDocumentId=acme:documentIdHere the custom document propertyDocumentId has been added so it is mapped to the ACME content model propertyacme:documentId. When doing this you also need to define the new custom namespaceacme. For this to work you need to have a rule on the folder that applies theacme:document type to any PDF document uploaded to the folder. This type has theacme:docuementId property.
Now, what if you would like to extract metadata from an XML file, how would you go about that? This can be achieved with theXmlMetadataExtracter, which in-turn uses theXPathMetadataExtracter to navigate the XML and extract metadata. These extractors are still in the repository, see thissection.
Let’s say we had XML files looking like this:
<?xml version="1.0" encoding="UTF-8"?><doc> <project> <number>PX001</number> </project> <securityClassification>Company Confidential</securityClassification> <text> Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent tincidunt luctus ante, in pulvinar ante rutrum quis. Etiam maximus arcu ut metus sollicitudin laoreet. Pellentesque ac purus nec massa euismod iaculis a sed sapien. Integer id nisi eu tellus commodo congue. In bibendum dapibus porttitor. Aenean lobortis sodales risus .... </text></doc>And whenever we upload one we want to have the/doc/@id attribute set asacme:documentId,/doc/project/number set asacme:projectNumber, and/doc/securityClassification set asacme:securityClassification. This will require configuration like this, note these are new bean definitions, no overrides:
<bean parent="baseMetadataExtracter" init-method="init"> <property name="mappingProperties"> <bean> <property name="location"> <value> classpath:alfresco/module/${project.artifactId}/metadataextraction/acme-content-model-mappings.properties </value> </property> </bean> </property> <property name="xpathMappingProperties"> <bean> <property name="location"> <value> classpath:alfresco/module/${project.artifactId}/metadataextraction/acme-xml-doc-xpath-mappings.properties </value> </property> </bean> </property></bean><bean init-method="init"> <property name="workers"> <map> <entry key="/*"> <ref bean="org.alfresco.tutorial.metadataextracter.xml.AcmeDocXPathMetadataExtracter"/> </entry> </map> </property></bean><bean parent="baseMetadataExtracter"> <property name="overwritePolicy"> <value>EAGER</value> <!-- Put the extracted metadata into the content model property as long as it is not null --> </property> <property name="selectors"> <list> <ref bean="org.alfresco.tutorial.metadataextracter.xml.selector.AcmeDocXPathSelector"/> </list> </property></bean>Theacme-content-model-mappings.properties file contains mappings from the extracted XML doc properties to the content model properties:
# Namespacesnamespace.prefix.acme=http://www.acme.org/model/content/1.0# Mappings - metadata property -> content model propertydocumentId=acme:documentIdsecurityClassification=acme:securityClassificationprojectNumber=acme:projectNumberThe property mapping can always be done in .properties files if we like, and we could have used a .properties file for thePDFBoxMetadataExtracter too. The other properties file called acme-xml-doc-xpath-mappings.properties contains the XPath expression configuration for where to find the metadata in the XML file:
# XPath Mappings - metadata property -> XML Document XPATHdocumentId=/doc/@idsecurityClassification=/doc/securityClassificationprojectNumber=/doc/project/numberMetadata extraction limits allows configurations onAbstractMappingMetadataExtracter for:
The default values for each of these properties areMAX value specified in the java code. These limits are configured per extractor and mimetype.
The limits configured for Content Services are:
Time out configured for all extractor and all mimetypescontent.metadataExtracter.default.timeoutMs=20000Maximum size of a document to process - configured for PdfBoxMetadataExtracter , pdf filescontent.metadataExtracter.pdf.maxDocumentSizeMB=10Maximum number of concurrent extractions - configured for PdfBoxMetadataExtracter , pdf filescontent.metadataExtracter.pdf.maxConcurrentExtractionsCount=5For XML metadata extraction you will still use theSDK and a JAR project applied to the Repository (i.e.alfresco.war).
To change the configuration for the majority of the metadata extractors you will have to generate a new Transform Core AIO Docker image with the new configuration. Another option would be tocreate a new separate T-Engine that has a higher priority (lower number) for this metadata extraction. That way you can still use the standard T-Engine and the new one from for this one special case.