Hyland Connect

iblanco · ‎08-02-2012

We have a little transformer that interacts with Tesseract OCR and is able to extract text from image PDFs. Despite the fact that Tesseract does not do layout analysis this approach it's quite useful for generating some indexable text in documents that otherwise wouldn't be searchable at all.

The problem is that, due to Lucene's way of working, whenever a document is updated, whatever that updating changes, all the reindexing for this node is done again.

That might not be a big deal in a fast transformer, but OCR transforming can easily take over 10 seconds per page so executing a script that changes 1000 documents might keep your server busy easily for over 3 hours. This also implies that due to the batching strategy used in Alfresco for index writing you might not notice any "improvement" in your searches until all job is done. If you upload a "TXT/Office/PDF with Text" file right after the 1000 OCR requiring files have changed you won't be able to search for the content of your last file until all OCR ends…

Does SOLR indexing change this in any way ? Not sure, but I guess no as I suppose that the transformation process is always done in Alfresco server in a similar way. Someone can confirm ?

What I want to do to solve this is create an "ExternalizedTransformer". I intent to build a transformer that simply adds a text like "ExternalizedTransformerPending".

Then a secondary process/API or whatever would search those nodes and process/transform them to text. After processing them it would attach an aspect with a "bnv:extractedText" property that would "hold" the text.

Here comes the tricky part:

After updating obviously the transformer will fire again but this time "ExternalizedTransformer" should detect that "bnv:extractedText" exists and take its value as the payload for the transformer. From now on, whenever that node is updated the transformer will recover it's text payload from the property which should be "lightning fast".

Obviously I will need to complement this with a behaviour attached to versión update so that it deletes the "bnv:extractedText" property whenever there is a content change, but that doesn't seem like a big deal.

The problem is, how do I get a reference to the node being transformed ? Java transformers usually extend AbstractContentTransformer2, but with those ones you have access to a ContentReader, not the node itself. Having access to the node is essential for "the tricky part", If I can't detect the aspect I can't use this approach.

Any suggestion or idea ?

Thank you

afaust · ‎08-03-2012

Hello,

I am not sure you would need all of the things you listed / a transformer to read an already extracted property.

Using an aspect with a separate property for the extracted text, you could do the following:

Implement an action that extracts the text from a single or batch of PDF document(s) and puts it in the special property of your new aspect

Have your special property indexed as normal

Modify the default FTS query template to transparently include your special property when doing full text searches

Implement a behaviour to trigger your extraction action for the relevant PDF document(s) when they are created / content updated (and also remove the special property before re-extraction)

The problem is, you can't get to node being transformed since the service framework for transformations does not care and even doesn't know about it - it just works on some abstract URL for the content in a backend storage, independent of what references that content. "You can't" can also be read as "not without some seriously dirty hacks" as far as I see it. I currently only see limited chances if one where to modify the Alfresco core code and Spring configuration and hook some context handling between the indexing and transformation parts. As any arbitrary client code could call a transformation, I am pretty sure, you would not be able to properly intercept / react to all calls and would still end up OCRing your content almost every time the transformer is called.

Regards
Axel

iblanco · ‎08-03-2012

Thank you AFaust for taking the time to reply.

I understand your approach but the problem with it is that changing the FTS template will definitively "do the trick" for simple searches but if a search is made explicitly against "TEXT" it won't work as expected, I would need to change all searching queries to something like (TEXT:"whatever" OR @bnv\:extractedText:"whatever")… so even if it is more feasible is not a perfect solution.

I've been looking for ADMLuceneIndexImpl and it seems that probably I could extend it just by copy pasting method "indexProperty". In the point it does the transformation (line 1325 in Community 4.0d) I could check if the supplied transformer implements an interface of my own capable of setting a NodeRef (NodeAwareTransformer) and if it does so set the node right before calling the transformation.

Sadly "indexProperty" is not precisely the most stylish method in all Alfresco and could probably improve with some refactoring, so I don't feel very comfortable copy pasting and changing it. This is a "very internal" method and seems a bit risky and hard to maintain for future releases.

I've thinking about another approach though:

The transformer will set the content "ExternalizedTransformerPending". An external process will search for those nodes, extract the content and leave it in a file in the file system identified by a unique identifier generated from the content of the file. The process will also update the node in some way so that the indexing fires again.

The transformer will calculate the identifier based on the content and if it finds the corresponding external file it will use it. If it doesn't find it it will just set "ExternalizedTransformerPending".

So the behaviour that intercepts version updates "only" needs to delete the corresponding file. Well in fact, as the ID is based on the content we could just forget this behaviour, the only problem will be that "orphaned" content extractions might appear, but that might not be a big deal… ummm, not sure about that last sentence.

The only problem now is determining a good way to identify a file by its content in a safe way. I think GIT uses SHA1 for identifying changesets, so it sounds like a good option, but I must check to make sure it is not too slow, keep in mind that the transformer will have to compute it every time.

I'll let you know how it turns out.

afaust · ‎08-03-2012

Hello,

as far as I know, it is possible to even specify the attributes searchd when executing a TEXT query, so I don't think this would require any rewrites of client code. This is something, that I've not yet done myself though.

Regards
Axel

iblanco · ‎08-03-2012

ummm,

I suppose you are talking about "query templates".

I've been checking the code and looking around but it doesn't seem like TEXT refers to a template, it seems to me that it relates to a real field in the index. So I don't think it can be "overriden" so that it consider other fields. What I can do is create a "MYTEXT" that refers to somehting like: "TEXT:"whatever" OR @bnv\:exportedText:"whatever"

But once more I have to change all client code to use "MYTEXT" instead of TEXT.

Where you talking about something else ? Or do you think there is a way to use templates to transform TEXT to "TEXT or Myproperty" ? If it really was posible is for sure that this one would be the best solution.

Thanks.

afaust · ‎08-03-2012

Hello,

according to the documentation in the wiki: "If no field is specified, the default TEXT field is used. This is a shortcut for searching all properties of type content."

So, there is no problem with extending / adapting anything - not client code, not query templates etc. Simply make your property a content property and you're done.

The thing I was talking about was the class SearchParameters, which lets you specify which fields are considered to be part of TEXT (TEXT is not actually stored in the index, but a marker where a OR-query is constructed for a set of content property fields). If it would be possible to specify a default set of fields for SearchParameters instances, that could have been a way to transparently include your field. But unforunately, there does not seem to be any, and fortunately, if you do what's stated above, you don't need to.

Regards
Axel

iblanco · ‎08-03-2012

Axel Thank you, thank you, thank you very very much for pointing that….
I've checked it by making an aspect with a content property and you are absolutly right… both content fields are indexed. I was already half the way on implementing the fileId approach but this is much much better.

I think that combined with the rendition service something quite elegant can be built.

Once again thank you very much.

Hyland Connect

'Slow transformers' and acces to the node being transformed