Project and Background: We are currently in the process of building out an alfresco environment with 8-16 nodes (clustered) supporting a portion of a website that received 80,000 - 150,000 logins per day.
I have been tasked with providing a search interface to the documents (pdf, open office docs, etc) that we have stored in our alfresco implementation. To accomplish this we have a cluster of google search appliances that we will be using for indexing the alfresco content.
The Question: I believe we need to do a custom search feed manager as opposed to using CIFS because not only do we need to index all of the content, we need to add meta data based upon the folder/directory structure of the content in alfresco. I am total n00b when it comes to alfresco (first saw it yesterday) but having read the docs these are my possible solutions:
1) Create a scheduled action that will run night or hourly or whatever to pick up changes files within the time frame specified. I am hoping this can give me new, editied, and deleted content. According to the wiki docs (http://wiki.alfresco.com/wiki/Scheduled_Actions), I can only use the built in actions?
2) Create a custom content transformation as descibed at (http://wiki.alfresco.com/wiki/Content_Transformations) that does not really transform the content but rather just calls our google search appliance with the custom created meta data.
3) Use web services to pull the changed content. I am not sure about the query language to only return the changed documents within the given time frame.
It's been a while, but we are searching for ways of crawl Alfresco from an external search appliance. Did you use finally a Rule with a Custom action? Were you able to manage ACLs, updates, deletes, and so on?