Open Source Text Mining
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-16-2006 02:29 PM
I am personally delighted to see the important work being accomplished here, and in the incredibly swift amount of time it has been delivered. I am quite envious of the privileged who are involved with it.
As you look forward, please consider embedding an open source text mining framework, such as GATE, and an easy-to-use web-based rule entry and output analysis UI into the Alfresco suite.
In my humble opinions:
a) I sense the general proximity of these two organizations (3 hours distance?) could facilitate great work. Both organizations are composed of genius and highly motivated inventors and users to see their success.
b) Given a user-selected corpus (full set or subset of documents), exposing text relationships within sets of documents would be a major attraction for industries like law-enforcement, pharma research, insurance, and any corporation with a legal staff needing in-house e-Discovery capability. The format transformation services of Alfresco (I suppose this leverages the work of OpenOffice?) fit very well for inputs into a GATE engine.
c) Offering the Alfresco suite as open source is already a monumental accomplishment. I can imagine a demand when this is the only ECM tool that has this capability embedded, with a set of customers demonstrating its value and sharing it openly. The Big 3 (maybe with the exception of UIMA) rightfully don't appear to be giving this full consideration, instead leaving it to customer demand from their set of customers who probably don't know what we don't know about the functional value well enough to be demanding it, and instead have other historical functional needs to solve first.
d) The meta-data extraction threads I've read here discuss information extraction and full-text indexing, both valuable functions of text mining. The information extraction threads I've read here seem to indicate parsing from an expectation of structured areas within documents, e.g. email headers, known Office document metadata sections, etc. I only request a more detailed evaluation into how to incorporate unstructured extraction using a rules framework, and hopefully not wait until Alfresco v4.4.
As you look forward, please consider embedding an open source text mining framework, such as GATE, and an easy-to-use web-based rule entry and output analysis UI into the Alfresco suite.
In my humble opinions:
a) I sense the general proximity of these two organizations (3 hours distance?) could facilitate great work. Both organizations are composed of genius and highly motivated inventors and users to see their success.
b) Given a user-selected corpus (full set or subset of documents), exposing text relationships within sets of documents would be a major attraction for industries like law-enforcement, pharma research, insurance, and any corporation with a legal staff needing in-house e-Discovery capability. The format transformation services of Alfresco (I suppose this leverages the work of OpenOffice?) fit very well for inputs into a GATE engine.
c) Offering the Alfresco suite as open source is already a monumental accomplishment. I can imagine a demand when this is the only ECM tool that has this capability embedded, with a set of customers demonstrating its value and sharing it openly. The Big 3 (maybe with the exception of UIMA) rightfully don't appear to be giving this full consideration, instead leaving it to customer demand from their set of customers who probably don't know what we don't know about the functional value well enough to be demanding it, and instead have other historical functional needs to solve first.
d) The meta-data extraction threads I've read here discuss information extraction and full-text indexing, both valuable functions of text mining. The information extraction threads I've read here seem to indicate parsing from an expectation of structured areas within documents, e.g. email headers, known Office document metadata sections, etc. I only request a more detailed evaluation into how to incorporate unstructured extraction using a rules framework, and hopefully not wait until Alfresco v4.4.

Labels:
- Labels:
-
Archive
1 REPLY 1

Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-02-2007 07:33 AM
Hi,
I'm so glad to read your post. I am definitely interested in incorporating text mining functionality into a DMS.
I work in the Archives department of a non-profit organization. We regularly receive audio recordings of a certain person speaking on a large number of different subjects (many times many different subjects are addressed within the same recording).
As soon as we receive each recording we convert it to a .wav file and transcribe it as a MS Word document. (Currently we have around 3000 such documents). Our publications department is regularly producing books, magazine and newspaper articles based on this material, and they are constantly requesting material on particular subjects from us which is very difficult to find by hand.
After spending a considerable amount of time researching text mining tools to help with this work, I found a piece of software called Leximancer - http://www.leximancer.com . This does pretty much exactly what we need, and enables us to specify a pre-built list of concepts for which it will build a semantic network around. It then outputs a set of indexed html files (indexed according to a number of user set parameters, which optimize it for different requirements - e.g. finding long passages about a single subject, or just one line quotes), and then provides a browser which allows you to search for material on different subjects. Unfortunately it is not open source, but I could not find any open soure tool which did what Leximancer does. (although if anybody knows of one, I would love to hear about it)
Right now I am in the process of streamlining our entire workflow. So I am looking for a good document management system which is extensible enough for me to build in a third party text analysis tool - such as leximancer. All its functions should be available within the DMS.
I also need a way of tagging and storing selections of text (once I have found interesting material using the text mining tool). These tagged text selections would be stored elsewhere in the repository and form the basis of a searchable database that I could regularly export and give to our publications department. (Something that works a bit like the clipmarks social web clipping tool - http://www.clipmarks.com)
We also want to start using the "Dragon naturally speaking" voice recognition software to transcribe our files for us. So integration with that would be very useful for us too.
I am yet to properly explore alfresco, but from what I've seen on the website it looks like a great tool which I am seriously considering using for our Archives.
I'm so glad to read your post. I am definitely interested in incorporating text mining functionality into a DMS.
I work in the Archives department of a non-profit organization. We regularly receive audio recordings of a certain person speaking on a large number of different subjects (many times many different subjects are addressed within the same recording).
As soon as we receive each recording we convert it to a .wav file and transcribe it as a MS Word document. (Currently we have around 3000 such documents). Our publications department is regularly producing books, magazine and newspaper articles based on this material, and they are constantly requesting material on particular subjects from us which is very difficult to find by hand.
After spending a considerable amount of time researching text mining tools to help with this work, I found a piece of software called Leximancer - http://www.leximancer.com . This does pretty much exactly what we need, and enables us to specify a pre-built list of concepts for which it will build a semantic network around. It then outputs a set of indexed html files (indexed according to a number of user set parameters, which optimize it for different requirements - e.g. finding long passages about a single subject, or just one line quotes), and then provides a browser which allows you to search for material on different subjects. Unfortunately it is not open source, but I could not find any open soure tool which did what Leximancer does. (although if anybody knows of one, I would love to hear about it)
Right now I am in the process of streamlining our entire workflow. So I am looking for a good document management system which is extensible enough for me to build in a third party text analysis tool - such as leximancer. All its functions should be available within the DMS.
I also need a way of tagging and storing selections of text (once I have found interesting material using the text mining tool). These tagged text selections would be stored elsewhere in the repository and form the basis of a searchable database that I could regularly export and give to our publications department. (Something that works a bit like the clipmarks social web clipping tool - http://www.clipmarks.com)
We also want to start using the "Dragon naturally speaking" voice recognition software to transcribe our files for us. So integration with that would be very useful for us too.
I am yet to properly explore alfresco, but from what I've seen on the website it looks like a great tool which I am seriously considering using for our Archives.
