Wednesday
Hello,
I am running Alfresco Community Edition 25.1.0 deployed using
alfresco-ansible-deployment on Rocky Linux 9.
Context:
- When I migrate a large amount of documents using the Share UI
(drag & drop of a big folder, several GB),
Alfresco eventually crashes / times out (504), which makes this approach unusable for production migration.
- However, when the upload succeeds via Share, full-text extraction works correctly:
searching for a word contained inside DOCX/PDF files returns the expected documents.
To avoid crashes, I switched to the built-in Bulk Filesystem Import tool
(/alfresco/s/bulkfsimport), which works very well from a performance point of view:
- no timeout
- no repository instability
- all files are imported correctly
Problem:
After a bulk import:
- Solr indexing seems OK (documents are visible, metadata search works)
- BUT full-text extraction does not work:
searching for a word contained inside DOCX files returns no results
This is very different from the behavior of the Share upload.
What I already checked:
- Transform service is running
- MIME types are correctly detected (DOCX, PDF)
- Re-uploading the same file via Share immediately fixes full-text search for that document
- Re-setting the content via REST API (PUT /nodes/{id}/content) also triggers extraction correctly
This makes me think that:
- either text extraction is skipped during bulk import
- or the transform/extract step is not triggered the same way as with Share uploads
Questions:
1. Is this a known limitation or expected behavior of Bulk Import in Community Edition?
2. Is there a recommended way to force text extraction after a bulk import
(reindex, re-transform, batch operation, configuration option)?
3. Are there specific settings (transform, tika, bulk import flags)
that must be enabled to get full-text extraction during bulk import?
Any guidance or confirmation would be greatly appreciated.
Thank you.
Thursday
Hello,
Could it be possible that you have location issues?
If you don't have solr configured as multilanguage and server and web browser languge are not the same, the content upload via drag & drop will be indexing in the language of the web browser and the one load via bulkfsimport in the server language.
If this is the case only the content indexed in its language will be return
I have been able to replicate your behaviour using the browser in spanish and the server with default englisg language but it disappeared when I configured the server in spabish too.
You could configure both in the same language or configure solr as multilanguage
Regards
Roberto Gámiz Sánchez
Alfresco Content Services Engineer
Friday
Hello Roberto,
Thank you very much for your answer, that’s a really interesting point.
I honestly hadn’t thought about the language aspect at all.
In our case, the server and the browser are indeed not using the same language, so this could clearly explain the different behavior we observe between bulk import and Share uploads.
I will test aligning the server language with the browser language, and also look into configuring Solr in multi-language mode.
I also have a related question regarding text extraction.
We noticed that very large PDF files (for example, documents with more than 1000 pages) are not fully text-extracted, even when uploaded via Share. Metadata is indexed and the file is accessible, but full-text search does not return results from the document content.
We suspect this might be related to timeouts during text extraction.
Is there a way in Alfresco (Community Edition) to:
Any guidance or best practices for handling text extraction on very large documents would be greatly appreciated.
Thanks again for your help!
Best regards,
yesterday
Sorry for the not concrete answer. I can only point the direction.
There is some actions in Alfresco which Share application timed out, but the actions go on in background.
@roberto_gamiz have sense. Check sys:locale property and Solr6 configuration for multilanguage.
For the large file look for the configuration property which limit the size of the MIMEtype transformation. Search subsystem use transformer to get text from the file. So, extend the size limit for pdf files. For "very large documents" look at the transformer resources. It could take a lot of memory and cpu itself.
Serge
Explore our Alfresco products with the links below. Use labels to filter content by product module.