Content Services performs metadata extraction on content automatically, however, you may wish to create custom metadata extractors to handle custom file properties and custom content models.
Architecture Information:Platform Architecture
Every time a file is uploaded to the repository the file’s MIME type is automatically detected. Based on the MIME type a related Metadata Extractor is invoked on the file. It will extract common properties from the file, such as author, and set the corresponding content model property accordingly. Each Metadata Extractor has a mapping between the properties it can extract and the content model properties. Metadata extraction is primarily based on theApache Tika library. This means that whateverfile formats Tika can extract metadata from, Content Services can also handle. To give you an idea of what file formats Content Services can extract metadata from, here is a list of the most common formats:
The properties that are extracted are limited to the out-of-the-box content model, which is very generic. Here are some example of extracted property name and what content model property it maps to:
cm:authorcm:titlecm:descriptioncm:createdexif:exif (pixel dimensions, manufacturer, model, software, date-time etc.)cm:geographic (longitude & latitude)audio:audio (album, artist, composer, engineer, genre etc.)cm:emailed (from, to, subject, sent date)One thing to note though, even if an extractor can extract any of the system controlled properties, such as created date, it will not be used. Created date, creator, modified date, and modifier is always controlled by the Content Services system, unless you are using the Bulk Import tool, in which case last modified date can be preserved.
A common requirement is to be able to change the mapping of out-of-the-box properties, such as having thesubject property mapped tocm:title instead ofcm:description. This is quite easy to achieve, just override the out-of-the-box bean and re-configure the mapping. The out-of-the-box Spring bean definitions for Metadata Extractors can be found in thecontent-services-context.xml file, which is locatedhere. Search for “Content Metadata Extractors” in the file and then you will find an ordered list of extractor definitions.
When overriding a Metadata Extractor configuration you have the option to inherit the default properties mapping or define a new one from scratch. For example, to change thesubject property so it is mapped to content model propertycm:title for PDF files re-define theextracter.PDFBox Spring bean as follows:
<bean parent="baseMetadataExtracter"> <property name="documentSelector" ref="pdfBoxEmbededDocumentSelector" /> <property name="inheritDefaultMapping"> <value>false</value> </property> <property name="mappingProperties"> <props> <prop key="namespace.prefix.cm">http://www.alfresco.org/model/content/1.0</prop> <prop key="author">cm:author</prop> <prop key="subject">cm:title</prop> <prop key="Keywords">cm:description</prop> </props> </property></bean>In this case you also map theauthor property. This is because when you set theinheritDefaultMapping property tofalse all the default property mappings are not used. Another property calledKeywords have also been mapped to thecm:description property. Note that all the namespaces that the content model properties belong to have to be specified as in the above example withnamespace.prefix.cm. It is also very important to know that the property names are case sensitive. So if the Keyword property had been written with a lower-casek, it would not have been picked up. Sometimes it can be useful to know what metadata extractor that is actually used when you upload a document. Turning on Metadata Extraction logging is a good idea to get on top of what is happening. Set the following property inlog4j.properties:
log4j.logger.org.alfresco.repo.content.metadata=DEBUGWith logging turned on the following information will be logged when uploading a PDF:
2015-12-07 13:56:51,324 DEBUG [content.metadata.MetadataExtracterRegistry] [http-bio-8080-exec-14] Get extractors for application/pdf2015-12-07 13:56:51,324 DEBUG [content.metadata.MetadataExtracterRegistry] [http-bio-8080-exec-14] Finding extractors for application/pdf2015-12-07 13:56:51,326 DEBUG [content.metadata.MetadataExtracterRegistry] [http-bio-8080-exec-14] Find supported: extracter.TikaAuto2015-12-07 13:56:51,326 DEBUG [content.metadata.MetadataExtracterRegistry] [http-bio-8080-exec-14] Find supported: extracter.PDFBox2015-12-07 13:56:51,326 DEBUG [content.metadata.MetadataExtracterRegistry] [http-bio-8080-exec-14] Find unsupported: extracter.Poi2015-12-07 13:56:51,326 DEBUG [content.metadata.MetadataExtracterRegistry] [http-bio-8080-exec-14] Find unsupported: extracter.Office2015-12-07 13:56:51,327 DEBUG [content.metadata.MetadataExtracterRegistry] [http-bio-8080-exec-14] Find unsupported: extracter.Mail2015-12-07 13:56:51,327 DEBUG [content.metadata.MetadataExtracterRegistry] [http-bio-8080-exec-14] Find unsupported: extracter.Html2015-12-07 13:56:51,327 DEBUG [content.metadata.MetadataExtracterRegistry] [http-bio-8080-exec-14] Find unsupported: extracter.OpenDocument2015-12-07 13:56:51,327 DEBUG [content.metadata.MetadataExtracterRegistry] [http-bio-8080-exec-14] Find unsupported: extracter.DWG2015-12-07 13:56:51,327 DEBUG [content.metadata.MetadataExtracterRegistry] [http-bio-8080-exec-14] Find unsupported: extracter.RFC8222015-12-07 13:56:51,327 DEBUG [content.metadata.MetadataExtracterRegistry] [http-bio-8080-exec-14] Find unsupported: extracter.MP32015-12-07 13:56:51,327 DEBUG [content.metadata.MetadataExtracterRegistry] [http-bio-8080-exec-14] Find unsupported: extracter.Audio2015-12-07 13:56:51,327 DEBUG [content.metadata.MetadataExtracterRegistry] [http-bio-8080-exec-14] Find unsupported: extracter.OpenOffice2015-12-07 13:56:51,327 DEBUG [content.metadata.MetadataExtracterRegistry] [http-bio-8080-exec-14] Find unsupported: org.alfresco.tutorial.metadataextracter.xml.AcmeDocXMLMetadataExtracter2015-12-07 13:56:51,327 DEBUG [content.metadata.MetadataExtracterRegistry] [http-bio-8080-exec-14] Find returning: [org.alfresco.repo.content.metadata.TikaAutoMetadataExtracter@763b7315, org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@6acadc76]2015-12-07 13:56:51,327 DEBUG [content.metadata.MetadataExtracterRegistry] [http-bio-8080-exec-14] Get supported: extracter.TikaAuto2015-12-07 13:56:51,327 DEBUG [content.metadata.MetadataExtracterRegistry] [http-bio-8080-exec-14] Get supported: extracter.PDFBox2015-12-07 13:56:51,327 DEBUG [content.metadata.MetadataExtracterRegistry] [http-bio-8080-exec-14] Get returning: extracter.PDFBoxYou can clearly see that the PDFBox extractor is invoked so you know you have customized the correct one. What about the properties? It is likely that you will struggle to figure out what properties are extracted and their names. You can have this logged with the following log file configuration:
log4j.logger.org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter=DEBUGThis log configuration is set to some other log level out-of-the-box so you need to specifically re-configure it to be able to see something. Now when running you will also see the extracted doc properties as in the following example:
Found: { pdf:PDFVersion=1.4, xmp:CreatorTool=Writer, Keywords=SomeKeyword1, SomeKeyword2, subject=SomeSubject, dc:creator=Martin Bergljung, description=SomeSubject, dcterms:created=2015-12-07T14:22:15Z, dc:format=application/pdf; version=1.4, title=SomeTitle, dc:title=SomeTitle, pdf:encrypted=false, cp:subject=SomeSubject, Content-Type=application/pdf, creator=Martin Bergljung, comments=null, meta:author=Martin Bergljung, dc:subject=SomeKeyword1, SomeKeyword2, meta:creation-date=2015-12-07T14:22:15Z, created=2015-12-07T14:22:15Z, author=Martin Bergljung, xmpTPg:NPages=1, Creation-Date=2015-12-07T14:22:15Z, meta:keyword=SomeKeyword1, SomeKeyword2, Author=Martin Bergljung, producer=LibreOffice 4.2}There is also a log entry with information about what properties that were actually successfully mapped:
Mapped and Accepted: { {http://www.alfresco.org/model/content/1.0}description={en_GB=SomeKeyword1, SomeKeyword2}, {http://www.alfresco.org/model/content/1.0}title={en_GB=SomeSubject}, {http://www.alfresco.org/model/content/1.0}author=Martin Bergljung}Next requirement is most likely to map properties to custom content models. There is an ACME content model tutorial where the base document type has anacme:documentId property. You might want to add a document identifier to the PDFs you are uploading and have it automatically set in the ACME content model. Start by updating the extractor configuration as follows:
<bean parent="baseMetadataExtracter"> <property name="documentSelector" ref="pdfBoxEmbededDocumentSelector" /> <property name="inheritDefaultMapping"> <value>false</value> </property> <property name="mappingProperties"> <props> <prop key="namespace.prefix.cm">http://www.alfresco.org/model/content/1.0</prop> <prop key="namespace.prefix.acme">http://www.acme.org/model/content/1.0</prop> <prop key="author">cm:author</prop> <prop key="subject">cm:title</prop> <prop key="Keywords">cm:description</prop> <prop key="DocumentId">acme:documentId</prop> </props> </property> </bean>Here the custom document propertyDocumentId has been added so it is mapped to the ACME content model propertyacme:documentId. When doing this you also need to define the new custom namespaceacme. For this to work you need to have a rule on the folder that applies theacme:document type to any PDF document uploaded to the folder. This type has theacme:docuementId property.
Now, what if you would like to extract metadata from an XML file, how would you go about that? This can be achieved with theXmlMetadataExtracter, which in-turn uses theXPathMetadataExtracter to navigate the XML and extract metadata. Let’s say we had XML files looking like this:
<?xml version="1.0" encoding="UTF-8"?><doc> <project> <number>PX001</number> </project> <securityClassification>Company Confidential</securityClassification> <text> Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent tincidunt luctus ante, in pulvinar ante rutrum quis. Etiam maximus arcu ut metus sollicitudin laoreet. Pellentesque ac purus nec massa euismod iaculis a sed sapien. Integer id nisi eu tellus commodo congue. In bibendum dapibus porttitor. Aenean lobortis sodales risus .... </text></doc>And whenever we upload one we want to have the/doc/@id attribute set asacme:documentId,/doc/project/number set asacme:projectNumber, and/doc/securityClassification set asacme:securityClassification. This will require configuration like this, note these are new bean definitions, no overrides as in previous examples:
<bean parent="baseMetadataExtracter" init-method="init"> <property name="mappingProperties"> <bean> <property name="location"> <value> classpath:alfresco/module/${project.artifactId}/metadataextraction/acme-content-model-mappings.properties </value> </property> </bean> </property> <property name="xpathMappingProperties"> <bean> <property name="location"> <value> classpath:alfresco/module/${project.artifactId}/metadataextraction/acme-xml-doc-xpath-mappings.properties </value> </property> </bean> </property></bean><bean init-method="init"> <property name="workers"> <map> <entry key="/*"> <ref bean="org.alfresco.tutorial.metadataextracter.xml.AcmeDocXPathMetadataExtracter"/> </entry> </map> </property></bean><bean parent="baseMetadataExtracter"> <property name="overwritePolicy"> <value>EAGER</value> <!-- Put the extracted metadata into the content model property as long as it is not null --> </property> <property name="selectors"> <list> <ref bean="org.alfresco.tutorial.metadataextracter.xml.selector.AcmeDocXPathSelector"/> </list> </property></bean>Theacme-content-model-mappings.properties file contains mappings from the extracted XML doc properties to the content model properties:
# Namespacesnamespace.prefix.acme=http://www.acme.org/model/content/1.0# Mappings - metadata property -> content model propertydocumentId=acme:documentIdsecurityClassification=acme:securityClassificationprojectNumber=acme:projectNumberThe property mapping can always be done in .properties files if we like, and we could have used a .properties file for thePDFBoxMetadataExtracter too. The other properties file called acme-xml-doc-xpath-mappings.properties contains the XPath expression configuration for where to find the metadata in the XML file:
# XPath Mappings - metadata property -> XML Document XPATHdocumentId=/doc/@idsecurityClassification=/doc/securityClassificationprojectNumber=/doc/project/numberMetadata extraction limits allows configurations onAbstractMappingMetadataExtracter for:
The default values for each of these properties areMAX value specified in the java code. These limits are configured per extractor and mimetype.
The limits configured for Content Services are:
Time out configured for all extractor and all mimetypescontent.metadataExtracter.default.timeoutMs=20000Maximum size of a document to process - configured for PdfBoxMetadataExtracter , pdf filescontent.metadataExtracter.pdf.maxDocumentSizeMB=10Maximum number of concurrent extractions - configured for PdfBoxMetadataExtracter , pdf filescontent.metadataExtracter.pdf.maxConcurrentExtractionsCount=5There are four types of overwrite policies that can be used when extracting metadata:
EAGERCAUTIOUSPRUDENTPRAGMATICThe following table shows which conditions must be met for overwriting the value:

The default overwrite policy isPRAGMATIC. To change the overwrite policy, set theoverwritePolicy property. For example:
<property name="overwritePolicy"> <value>EAGER</value> </property>To change the overwrite policy for the PDF metadata extractor, set theoverwritePolicy property in thealfresco-global.properties. For example:
content.metadataExtracter.pdf.overwritePolicy=EAGERtomcat/shared/classes/alfresco/extension - change name ofcustom-metadata-extractors-context.xml.sample tocustom-metadata-extractors-context.xml and define extractor beans. Change name ofmetadata-embedding-context.xml.sample tometadata-embedding-context.xml and make embedder bean definitions.aio/platform-jar/src/main/resources/alfresco/module/platform-jar/context/service-context.xml - Metadata Extractor bean definitions and metadata embedder bean definitionsaio/platform-jar/src/main/resources/alfresco/module/platform-jar/metadataextraction - Properties files with mappings