<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Missing embedded metadata when uploading PDF in Alfresco Forum</title>
    <link>https://connect.hyland.com/t5/alfresco-forum/missing-embedded-metadata-when-uploading-pdf/m-p/69973#M23047</link>
    <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hello Jeff, first of all thank your very much for your response.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'm sorry, I see now that I didn't make myself clear. I read that page of the documentation carefully. I'm writing because I think that while following the instructions in the documentation, I am experiencing a behaviour that I haven't seen discussed in said documentation, or any other document on the web that I could find. I understand that by default, only some fields are mapped, so I wanted to map the fields I need. First, of course, I created a new model that contains a custom type with the fields I needed (for example: DOI, volume, issn), and created a rule in the folder so that any document added to that folder would be specialized to that type.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Then, I needed to create a new mapping, but for that, first I needed to know the names of the properties according to Alfresco. To do this, I modified the log4j.properties so that log4j.logger.org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter=debug. With this, after uploading a document that contains the metadata I need, I could check the names of the properties I should use in the mapping.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This is where I found my problem. In the Alfresco log file, &lt;STRONG&gt;when I upload one of these documents, not all the metadata that is available in the PDF (see image 1 in first post of the thread) appears as a raw property&lt;/STRONG&gt;. For example:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE class="jive_macro_quote jive-quote jive_text_macro"&gt;&lt;P&gt;&lt;SPAN&gt;Raw Properties:&amp;nbsp; &amp;nbsp; {date=2018-08-13T08:56:21Z, pdf&lt;img id="smileytongue" class="emoticon emoticon-smileytongue" src="https://connect.hyland.com/i/smilies/16x16_smiley-tongue.png" alt="Smiley Tongue" title="Smiley Tongue" /&gt;DFVersion=1.6, xmp:CreatorTool=Springer, Keywords=Highly-cited documents,Google Scholar,Web of Science,Scopus,Coverage,Academic journals,Classic Papers, subject=Scientometrics, &lt;/SPAN&gt;&lt;A _jive_internal="true" href="https://community.alfresco.com/doi.org/10.1007/s11192-018-2820-9" rel="nofollow noopener noreferrer" target="_blank"&gt;https://doi.org/10.1007/s11192-018-2820-9&lt;/A&gt;&lt;SPAN&gt;, pdfa&lt;img id="smileytongue" class="emoticon emoticon-smileytongue" src="https://connect.hyland.com/i/smilies/16x16_smiley-tongue.png" alt="Smiley Tongue" title="Smiley Tongue" /&gt;DFVersion=A-2b, dc:creator=Enrique Orduna-Malea , Emilio Delgado López-Cózar , Alberto Martín-Martín, description=Scientometrics, &lt;/SPAN&gt;&lt;A _jive_internal="true" href="https://community.alfresco.com/doi.org/10.1007/s11192-018-2820-9" rel="nofollow noopener noreferrer" target="_blank"&gt;https://doi.org/10.1007/s11192-018-2820-9&lt;/A&gt;&lt;SPAN&gt;, dcterms:created=2018-06-26T11:18:02Z, Last-Modified=2018-08-13T08:56:21Z, dcterms:modified=2018-08-13T08:56:21Z, dc:format=application/pdf; version=1.6, application/pdf; version="A-2b", title=Coverage of highly-cited documents in Google Scholar, Web of Science, and Scopus: a multidisciplinary comparison, Last-Save-Date=2018-08-13T08:56:21Z, CrossMarkDomains[1]=springer.com, meta:save-date=2018-08-13T08:56:21Z, dc:title=Coverage of highly-cited documents in Google Scholar, Web of Science, and Scopus: a multidisciplinary comparison, pdf:encrypted=false, modified=2018-08-13T08:56:21Z, cp:subject=Scientometrics, &lt;/SPAN&gt;&lt;A _jive_internal="true" href="https://community.alfresco.com/doi.org/10.1007/s11192-018-2820-9" rel="nofollow noopener noreferrer" target="_blank"&gt;https://doi.org/10.1007/s11192-018-2820-9&lt;/A&gt;&lt;SPAN&gt;, robots=noindex, Content-Type=application/pdf, TIKA_PARSER_PARSE_SHAPES=false, creator=Enrique Orduna-Malea , Emilio Delgado López-Cózar , Alberto Martín-Martín, pdfaid:conformance=B, comments=null, meta:author=Enrique Orduna-Malea , Emilio Delgado López-Cózar , Alberto Martín-Martín, dc:subject=[Ljava.lang.String;@91aba4, meta:creation-date=2018-06-26T11:18:02Z, created=2018-06-26T11:18:02Z, author=Enrique Orduna-Malea , Emilio Delgado López-Cózar , Alberto Martín-Martín, xmpTPg:NPages=14, Creation-Date=2018-06-26T11:18:02Z, pdfaid&lt;img id="smileytongue" class="emoticon emoticon-smileytongue" src="https://connect.hyland.com/i/smilies/16x16_smiley-tongue.png" alt="Smiley Tongue" title="Smiley Tongue" /&gt;art=2, CrossMarkDomains[2]=springerlink.com, meta:keyword=Highly-cited documents,Google Scholar,Web of Science,Scopus,Coverage,Academic journals,Classic Papers, Author=Enrique Orduna-Malea , Emilio Delgado López-Cózar , Alberto Martín-Martín, producer=Acrobat Distiller 10.1.8 (Windows), CrossmarkDomainExclusive=true, CrossmarkMajorVersionDate=2010-04-23, doi=10.1007/s11192-018-2820-9}&lt;/SPAN&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;My main question is,&lt;STRONG&gt; why is Alfresco not detecting all available metadata in the PDF as raw properties?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried changing the mapping in the custom-repository-context.xml file anyway, trying to guess the name of the properties that don't appear in the list of raw properties. I tried mapping the DOI (which is available in the raw properties), the volume, and the ISSN (which are not available as raw properties):&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE class="jive_macro_quote jive-quote jive_text_macro"&gt;&lt;P&gt;&amp;lt;bean id="extracter.PDFBox" class="org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter"&lt;BR /&gt; parent="baseMetadataExtracter"&amp;gt;&lt;BR /&gt; &amp;lt;property name="documentSelector" ref="pdfBoxEmbededDocumentSelector" /&amp;gt;&lt;BR /&gt; &amp;lt;property name="inheritDefaultMapping"&amp;gt;&lt;BR /&gt; &amp;lt;value&amp;gt;true&amp;lt;/value&amp;gt;&lt;BR /&gt; &amp;lt;/property&amp;gt;&lt;BR /&gt; &amp;lt;property name="mappingProperties"&amp;gt;&lt;BR /&gt; &amp;lt;props&amp;gt;&lt;BR /&gt;&lt;SPAN&gt; &amp;lt;prop key="namespace.prefix.prism"&amp;gt;&lt;/SPAN&gt;&lt;A _jive_internal="true" href="https://community.alfresco.com/prismstandard.org/namespaces/basic/2.0" rel="nofollow noopener noreferrer" target="_blank"&gt;http://prismstandard.org/namespaces/basic/2.0&lt;/A&gt;&lt;SPAN&gt;&amp;lt;/prop&amp;gt;&lt;/SPAN&gt;&lt;BR /&gt; &amp;lt;prop key="doi"&amp;gt;prism:doi&amp;lt;/prop&amp;gt;&lt;BR /&gt; &amp;lt;prop key="prism:volume"&amp;gt;prism:volume&amp;lt;/prop&amp;gt;&lt;BR /&gt; &amp;lt;prop key="issn"&amp;gt;prism:issn&amp;lt;/prop&amp;gt;&lt;BR /&gt; &amp;lt;/props&amp;gt;&lt;BR /&gt; &amp;lt;/property&amp;gt;&lt;BR /&gt; &amp;lt;/bean&amp;gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;After uploading another document with this configuration in place, as I expected and feared, only the DOI was correctly extracted.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Any ideas as to why some metadata from the PDF is not being detected by Alfresco?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you very much for your help in advance.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
    <pubDate>Mon, 19 Nov 2018 10:11:30 GMT</pubDate>
    <dc:creator>albertomartin</dc:creator>
    <dc:date>2018-11-19T10:11:30Z</dc:date>
    <item>
      <title>Missing embedded metadata when uploading PDF</title>
      <link>https://connect.hyland.com/t5/alfresco-forum/missing-embedded-metadata-when-uploading-pdf/m-p/69971#M23045</link>
      <description>Hello, I'm trying to automate metadata extraction in Alfresco Community 5.2 so that my custom models get populated automatically when documents are uploaded. My PDFs have custom&amp;nbsp;embedded metadata fields (see image 1). However, when I import these PDFs to Alfresco, according to the information in the</description>
      <pubDate>Sun, 18 Nov 2018 13:56:22 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-forum/missing-embedded-metadata-when-uploading-pdf/m-p/69971#M23045</guid>
      <dc:creator>albertomartin</dc:creator>
      <dc:date>2018-11-18T13:56:22Z</dc:date>
    </item>
    <item>
      <title>Re: Missing embedded metadata when uploading PDF</title>
      <link>https://connect.hyland.com/t5/alfresco-forum/missing-embedded-metadata-when-uploading-pdf/m-p/69972#M23046</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Check the docs:&lt;/P&gt;&lt;P&gt;&lt;A class="link-titled" href="https://docs.alfresco.com/5.2/references/dev-extension-points-custom-metadata-extractor.html" title="https://docs.alfresco.com/5.2/references/dev-extension-points-custom-metadata-extractor.html" rel="nofollow noopener noreferrer"&gt;Metadata Extractors | Alfresco Documentation&lt;/A&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;By default, the metadata extraction grabs the author, title, subject, and created. If you want anything else, you'll have to tweak the metadata extractor. Because there is already an extractor that knows how to pull fields from PDFs you should not have to write your own from scratch, but you could if you needed to.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I think you'll just need to map the fields to actual properties in your model. The docs are pretty thorough on this topic and there are a number of other pages around the net that discuss customizing metadata extraction.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Mon, 19 Nov 2018 04:56:36 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-forum/missing-embedded-metadata-when-uploading-pdf/m-p/69972#M23046</guid>
      <dc:creator>jpotts</dc:creator>
      <dc:date>2018-11-19T04:56:36Z</dc:date>
    </item>
    <item>
      <title>Re: Missing embedded metadata when uploading PDF</title>
      <link>https://connect.hyland.com/t5/alfresco-forum/missing-embedded-metadata-when-uploading-pdf/m-p/69973#M23047</link>
      <description>&lt;HTML&gt;&lt;HEAD&gt;&lt;/HEAD&gt;&lt;BODY&gt;&lt;P&gt;Hello Jeff, first of all thank your very much for your response.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'm sorry, I see now that I didn't make myself clear. I read that page of the documentation carefully. I'm writing because I think that while following the instructions in the documentation, I am experiencing a behaviour that I haven't seen discussed in said documentation, or any other document on the web that I could find. I understand that by default, only some fields are mapped, so I wanted to map the fields I need. First, of course, I created a new model that contains a custom type with the fields I needed (for example: DOI, volume, issn), and created a rule in the folder so that any document added to that folder would be specialized to that type.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Then, I needed to create a new mapping, but for that, first I needed to know the names of the properties according to Alfresco. To do this, I modified the log4j.properties so that log4j.logger.org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter=debug. With this, after uploading a document that contains the metadata I need, I could check the names of the properties I should use in the mapping.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This is where I found my problem. In the Alfresco log file, &lt;STRONG&gt;when I upload one of these documents, not all the metadata that is available in the PDF (see image 1 in first post of the thread) appears as a raw property&lt;/STRONG&gt;. For example:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE class="jive_macro_quote jive-quote jive_text_macro"&gt;&lt;P&gt;&lt;SPAN&gt;Raw Properties:&amp;nbsp; &amp;nbsp; {date=2018-08-13T08:56:21Z, pdf&lt;img id="smileytongue" class="emoticon emoticon-smileytongue" src="https://connect.hyland.com/i/smilies/16x16_smiley-tongue.png" alt="Smiley Tongue" title="Smiley Tongue" /&gt;DFVersion=1.6, xmp:CreatorTool=Springer, Keywords=Highly-cited documents,Google Scholar,Web of Science,Scopus,Coverage,Academic journals,Classic Papers, subject=Scientometrics, &lt;/SPAN&gt;&lt;A _jive_internal="true" href="https://community.alfresco.com/doi.org/10.1007/s11192-018-2820-9" rel="nofollow noopener noreferrer" target="_blank"&gt;https://doi.org/10.1007/s11192-018-2820-9&lt;/A&gt;&lt;SPAN&gt;, pdfa&lt;img id="smileytongue" class="emoticon emoticon-smileytongue" src="https://connect.hyland.com/i/smilies/16x16_smiley-tongue.png" alt="Smiley Tongue" title="Smiley Tongue" /&gt;DFVersion=A-2b, dc:creator=Enrique Orduna-Malea , Emilio Delgado López-Cózar , Alberto Martín-Martín, description=Scientometrics, &lt;/SPAN&gt;&lt;A _jive_internal="true" href="https://community.alfresco.com/doi.org/10.1007/s11192-018-2820-9" rel="nofollow noopener noreferrer" target="_blank"&gt;https://doi.org/10.1007/s11192-018-2820-9&lt;/A&gt;&lt;SPAN&gt;, dcterms:created=2018-06-26T11:18:02Z, Last-Modified=2018-08-13T08:56:21Z, dcterms:modified=2018-08-13T08:56:21Z, dc:format=application/pdf; version=1.6, application/pdf; version="A-2b", title=Coverage of highly-cited documents in Google Scholar, Web of Science, and Scopus: a multidisciplinary comparison, Last-Save-Date=2018-08-13T08:56:21Z, CrossMarkDomains[1]=springer.com, meta:save-date=2018-08-13T08:56:21Z, dc:title=Coverage of highly-cited documents in Google Scholar, Web of Science, and Scopus: a multidisciplinary comparison, pdf:encrypted=false, modified=2018-08-13T08:56:21Z, cp:subject=Scientometrics, &lt;/SPAN&gt;&lt;A _jive_internal="true" href="https://community.alfresco.com/doi.org/10.1007/s11192-018-2820-9" rel="nofollow noopener noreferrer" target="_blank"&gt;https://doi.org/10.1007/s11192-018-2820-9&lt;/A&gt;&lt;SPAN&gt;, robots=noindex, Content-Type=application/pdf, TIKA_PARSER_PARSE_SHAPES=false, creator=Enrique Orduna-Malea , Emilio Delgado López-Cózar , Alberto Martín-Martín, pdfaid:conformance=B, comments=null, meta:author=Enrique Orduna-Malea , Emilio Delgado López-Cózar , Alberto Martín-Martín, dc:subject=[Ljava.lang.String;@91aba4, meta:creation-date=2018-06-26T11:18:02Z, created=2018-06-26T11:18:02Z, author=Enrique Orduna-Malea , Emilio Delgado López-Cózar , Alberto Martín-Martín, xmpTPg:NPages=14, Creation-Date=2018-06-26T11:18:02Z, pdfaid&lt;img id="smileytongue" class="emoticon emoticon-smileytongue" src="https://connect.hyland.com/i/smilies/16x16_smiley-tongue.png" alt="Smiley Tongue" title="Smiley Tongue" /&gt;art=2, CrossMarkDomains[2]=springerlink.com, meta:keyword=Highly-cited documents,Google Scholar,Web of Science,Scopus,Coverage,Academic journals,Classic Papers, Author=Enrique Orduna-Malea , Emilio Delgado López-Cózar , Alberto Martín-Martín, producer=Acrobat Distiller 10.1.8 (Windows), CrossmarkDomainExclusive=true, CrossmarkMajorVersionDate=2010-04-23, doi=10.1007/s11192-018-2820-9}&lt;/SPAN&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;My main question is,&lt;STRONG&gt; why is Alfresco not detecting all available metadata in the PDF as raw properties?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried changing the mapping in the custom-repository-context.xml file anyway, trying to guess the name of the properties that don't appear in the list of raw properties. I tried mapping the DOI (which is available in the raw properties), the volume, and the ISSN (which are not available as raw properties):&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE class="jive_macro_quote jive-quote jive_text_macro"&gt;&lt;P&gt;&amp;lt;bean id="extracter.PDFBox" class="org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter"&lt;BR /&gt; parent="baseMetadataExtracter"&amp;gt;&lt;BR /&gt; &amp;lt;property name="documentSelector" ref="pdfBoxEmbededDocumentSelector" /&amp;gt;&lt;BR /&gt; &amp;lt;property name="inheritDefaultMapping"&amp;gt;&lt;BR /&gt; &amp;lt;value&amp;gt;true&amp;lt;/value&amp;gt;&lt;BR /&gt; &amp;lt;/property&amp;gt;&lt;BR /&gt; &amp;lt;property name="mappingProperties"&amp;gt;&lt;BR /&gt; &amp;lt;props&amp;gt;&lt;BR /&gt;&lt;SPAN&gt; &amp;lt;prop key="namespace.prefix.prism"&amp;gt;&lt;/SPAN&gt;&lt;A _jive_internal="true" href="https://community.alfresco.com/prismstandard.org/namespaces/basic/2.0" rel="nofollow noopener noreferrer" target="_blank"&gt;http://prismstandard.org/namespaces/basic/2.0&lt;/A&gt;&lt;SPAN&gt;&amp;lt;/prop&amp;gt;&lt;/SPAN&gt;&lt;BR /&gt; &amp;lt;prop key="doi"&amp;gt;prism:doi&amp;lt;/prop&amp;gt;&lt;BR /&gt; &amp;lt;prop key="prism:volume"&amp;gt;prism:volume&amp;lt;/prop&amp;gt;&lt;BR /&gt; &amp;lt;prop key="issn"&amp;gt;prism:issn&amp;lt;/prop&amp;gt;&lt;BR /&gt; &amp;lt;/props&amp;gt;&lt;BR /&gt; &amp;lt;/property&amp;gt;&lt;BR /&gt; &amp;lt;/bean&amp;gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;After uploading another document with this configuration in place, as I expected and feared, only the DOI was correctly extracted.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Any ideas as to why some metadata from the PDF is not being detected by Alfresco?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you very much for your help in advance.&lt;/P&gt;&lt;/BODY&gt;&lt;/HTML&gt;</description>
      <pubDate>Mon, 19 Nov 2018 10:11:30 GMT</pubDate>
      <guid>https://connect.hyland.com/t5/alfresco-forum/missing-embedded-metadata-when-uploading-pdf/m-p/69973#M23047</guid>
      <dc:creator>albertomartin</dc:creator>
      <dc:date>2018-11-19T10:11:30Z</dc:date>
    </item>
  </channel>
</rss>

