cancel
Showing results for 
Search instead for 
Did you mean: 

Sending PDF to Alfresco via CIFS doesn't fire metadata extracter

jpbuttet
Confirmed Champ
Confirmed Champ

I run a dockerized Alfresco with following components:

  • Alfresco 5.2.f & api-explorer 5.2.0
  • Share 5.2.e
  • Nginx (reverse proxy on port 143)
  • Postgres 9.4
  • Libreoffice 5.1.2
  • Solr6 (alfresco-search-services-1.0.0)
A Fujitsu N7100 network scanner is attached to Alfresco via CIFS.
The CIFS interface is configured in the alfresco-global.properties file:
 
### CIFS configuration###
cifs.enabled=true
cifs.serverName=alfresco
cifs.domain=WORKGROUP
cifs.broadcast=255.255.255.255
cifs.ipv6.enabled=false
cifs.hostannounce=true
cifs.tcpipSMB.port=1445
cifs.netBIOSSMB.sessionPort=1139
cifs.netBIOSSMB.namePort=1137
cifs.netBIOSSMB.datagramPort=1138
 
The scanner has an integrated OCR engine and sends PDF files to Alfresco.
 
Unfortunately, the PDF files received from the scanner by Alfresco are not searchable and don't fire the metadata extractor (according the debug log).
When the same file is downloaded to a Windows 10 wokrstation (via the share interface) and then uploaded again into Alfresco (via the share interface), this event fires the metadata extractor and the file becomes searchable within Alfresco.
 
Here is my question : Why the PDF files received by Alfresco via CIFS interface don't fire Alfresco's metadata extractor ?
 
Alfresco log files showing metadata extractor fired upon PDF file upload via share interface , no such log when the same file is received by the CIFS Interface:
alfresco_1 | extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@b83b1f3
alfresco_1 | 2019-10-21 09:11:34,854 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-368] Concurrent extractions : 0
alfresco_1 | 2019-10-21 09:11:34,854 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-368] New extraction accepted. Concurrent extractions : 1
alfresco_1 | 2019-10-21 09:11:34,891 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-368] Extraction finalized. Remaining concurrent extraction : 0
alfresco_1 | 2019-10-21 09:11:34,891 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-368] Converted extracted raw values to system values:
alfresco_1 | 2019-10-21 09:11:34,900 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-368] Completed metadata extraction:
alfresco_1 | extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@b83b1f3
 
Could someone help me to solve this issue ?
 
Best regards.
 
Jean-Pierre Buttet
 
1 ACCEPTED ANSWER

Thank you all for your time.

I'd like to close this issue because of this:

CIFS doesn't work as I would like and I found that it will be removed:
https://hub.alfresco.com/t5/alfresco-content-services-blog/architecture-changes-for-alfresco-content...

I just switched to Alfresco 6.1 by using this https://github.com/Alfresco/alfresco-docker-installer

Then I configured the good old FTP interface. FTP just works like a charm, I just added this in JAVA_OPTS of docker-compose.yml:
-Dftp.enabled=true
-Dftp.port=2121

added port to alfresco service:

  • 21:2121 #ftp

edited alfresco/Dockerfile and added this line at the bottom:
EXPOSE 2121

I ahd to use ACTIVE communication mode in the client in order to connect to Alfresco.

The files received by Alfresco are readable AND metadat is searchable with the share interface. All right.

View answer in original post

8 REPLIES 8

jljwoznica
Star Collaborator
Star Collaborator

Have you confirmed that the PDFs are not image PDFs? So, are they PDFs wrapped around an image file or was OCR done by the scanner putting the text into the PDF file? Have you also confirmed that if you take the scanner output and manually upload it that it properly goes through as expected?

Hello,

Thank you for your answer.

Yes, the scanner performs OCR and put text into the PDF. This is confirmed because, as i wrote, the very same file becomes searchable after a simple download (into workstation) and then immediately upload into Alfresco by using share interface.

Looks like the PDF with text content doesn't fire the metadata extracter when sent by the CIFS interface from the scanner, but everything is okay when the same file is uploaded from Alfresco via share by the user...

Not sure it's related to the strange extracter behavior I submitted, but I have following debug logs during Aafresco startup:

alfresco_1 | 2019-10-21 15:24:41,219 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [localhost-startStop-1] Loaded mapping properties from resource: alfresco/metadata/TikaAutoMetadataExtracter.properties
alfresco_1 | 2019-10-21 15:24:41,222 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [localhost-startStop-1] No explicit embed mapping properties found at: alfresco/metadata/TikaAutoMetadataExtracter.embed.properties, assuming reverse of extract mapping
alfresco_1

Any idea ?

I'm still looking into this. Just to make completely sure - you do the exact same thing manually (same folder, same repository, same user, etc.) and it works?

Hello,

Thanks for your reply. Yes.

I just checked again:

first step:

PDF file sent to Alfresco (to folder "Nemerisation") via CIFS, -->no DEBUG metadata Extractor log and the PDF is not searchable.

The fujitsu N7100 scanner logs into Alfresco as user "xxx"

second step:

User "xxx", not the scanner but a real user is connected to Alfresco and performs following actions:

from the folder"Numerisation" user downloads the file to Windows 10 Worstation via share (firefox browser)

then immeditaly after, user uploads the same file to Alfresco via share (firefox browser)

---> this fires metadata extraxter according DEBUG log, and the PDF besomes searchable:

alfresco_1 | 2019-10-22 14:08:30,582 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-38] Starting metadata extraction:
alfresco_1 | reader: ContentAccessor[ contentUrl=store://2019/10/22/14/8/99524f4b-2ace-4202-8bd8-b83933f7edf9.bin, mimetype=application/pdf, size=405868, encoding=UTF-8, locale=fr]
alfresco_1 | extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@3839fcb4
alfresco_1 | 2019-10-22 14:08:30,584 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-38] Concurrent extractions : 0
alfresco_1 | 2019-10-22 14:08:30,585 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-38] New extraction accepted. Concurrent extractions : 1
alfresco_1 | 2019-10-22 14:08:30,601 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-38] Extraction finalized. Remaining concurrent extraction : 0
alfresco_1 | 2019-10-22 14:08:30,602 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-38] Converted extracted raw values to system values:
alfresco_1 | Raw Properties: {date=2019-10-22T15:06:30Z, pdfSmiley TongueDFVersion=1.3, TIKA_PARSER_PARSE_SHAPES=false, xmp:CreatorTool=N7100 1.0, comments=null, dc:subject=null, meta:creation-date=2019-10-22T15:06:30Z, created=2019-10-22T15:06:30Z, author=null, MetadataDate=D:20191022160630+01'00', xmpTPg:NPages=1, Creation-Date=2019-10-22T15:06:30Z, dcterms:created=2019-10-22T15:06:30Z, Last-Modified=2019-10-22T15:06:30Z, dcterms:modified=2019-10-22T15:06:30Z, dc:format=application/pdf; version=1.3, title=null, Last-Save-Date=2019-10-22T15:06:30Z, meta:save-date=2019-10-22T15:06:30Z, pdf:encrypted=false, producer=PFU PDF Library 1.0, modified=2019-10-22T15:06:30Z, Content-Type=application/pdf}
alfresco_1 | System Properties: {{http://www.alfresco.org/model/content/1.0}created=2019-10-22T15:06:30Z, {http://www.alfresco.org/model/content/1.0}title=null, {http://www.alfresco.org/model/content/1.0}author=null}
alfresco_1 | 2019-10-22 14:08:30,603 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-38] Extracted Metadata from ContentAccessor[ contentUrl=store://2019/10/22/14/8/99524f4b-2ace-4202-8bd8-b83933f7edf9.bin, mimetype=application/pdf, size=405868, encoding=UTF-8, locale=fr]
alfresco_1 | Found: {date=2019-10-22T15:06:30Z, pdfSmiley TongueDFVersion=1.3, TIKA_PARSER_PARSE_SHAPES=false, xmp:CreatorTool=N7100 1.0, comments=null, dc:subject=null, meta:creation-date=2019-10-22T15:06:30Z, created=2019-10-22T15:06:30Z, author=null, MetadataDate=D:20191022160630+01'00', xmpTPg:NPages=1, Creation-Date=2019-10-22T15:06:30Z, dcterms:created=2019-10-22T15:06:30Z, Last-Modified=2019-10-22T15:06:30Z, dcterms:modified=2019-10-22T15:06:30Z, dc:format=application/pdf; version=1.3, title=null, Last-Save-Date=2019-10-22T15:06:30Z, meta:save-date=2019-10-22T15:06:30Z, pdf:encrypted=false, producer=PFU PDF Library 1.0, modified=2019-10-22T15:06:30Z, Content-Type=application/pdf}
alfresco_1 | Mapped and Accepted: {{http://www.alfresco.org/model/content/1.0}title=null, {http://www.alfresco.org/model/content/1.0}author=null}
alfresco_1 | 2019-10-22 14:08:30,605 DEBUG [content.metadata.AbstractMappingMetadataExtracter] [http-bio-8080-exec-38] Completed metadata extraction:
alfresco_1 | reader: ContentAccessor[ contentUrl=store://2019/10/22/14/8/99524f4b-2ace-4202-8bd8-b83933f7edf9.bin, mimetype=application/pdf, size=405868, encoding=UTF-8, locale=fr]
alfresco_1 | extracter: org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter@3839fcb4
alfresco_1 | changed: {{http://www.alfresco.org/model/content/1.0}title=null, {http://www.alfresco.org/model/content/1.0}author=null}

thank you for your support. 🙂

Another question the PDF is showing up when added via CIFS, correct? I am wondering if it is a permission or user issue. is CIFS working in other situations for other operations? 

Hello,

A network folder is configured in the network scanner. This network folder is a folder within Alfresco. The scanner sends the pdf via the CIFS interface to Alfresco. So we can say that the document is pushed by the scanner to Alfresco.

Alfresco receives the document, then the document can be read by the user by using the share interface but the metadata doesn't show up when the user performs a search (by using the share interface).

Surprisingly, if the user downloads the same pdf file to the workstation (with share interface) and then uploads  the same file (with share interface), the metadata shows up during search.

Thank you for your help.

Best regards.

Thank you all for your time.

I'd like to close this issue because of this:

CIFS doesn't work as I would like and I found that it will be removed:
https://hub.alfresco.com/t5/alfresco-content-services-blog/architecture-changes-for-alfresco-content...

I just switched to Alfresco 6.1 by using this https://github.com/Alfresco/alfresco-docker-installer

Then I configured the good old FTP interface. FTP just works like a charm, I just added this in JAVA_OPTS of docker-compose.yml:
-Dftp.enabled=true
-Dftp.port=2121

added port to alfresco service:

  • 21:2121 #ftp

edited alfresco/Dockerfile and added this line at the bottom:
EXPOSE 2121

I ahd to use ACTIVE communication mode in the client in order to connect to Alfresco.

The files received by Alfresco are readable AND metadat is searchable with the share interface. All right.