Hyland Connect

camille · ‎09-05-2005

Bonjour les Francophones!

We at NeoDoc are currently evaluating the translation files and will start translation asap (after PR6 is published).

If you wish to participate, please post a message here.

Ãƒâ‚¬ bientÃƒÂ´t,

davidc · ‎03-02-2006

I would go for importing the french analyzer lib into Alfresco. There may be all sorts of issues upgrading to Lucene 1.9 - you're welcome to try, but that path may be much longer and require more testing.

sam69 · ‎03-03-2006

Hi !
Thanks for your responses.

I manage to integrate a modified version of lucene for french language following this instructions :

- copyFrenchAnalyzer.class, FrenchStemFilter.class, FrenchStemmer.class into the lucene.jar (more precisely in org\apache\lucene\analysis\fr)
(I know this method is not really good, I should create a new jar or someting…)

- copy the lucene.jar which include the FrenchAnalyser) in :
\alfresco\WEB-INF\lib

- create dataTypeAnalyzers_fr_FR.properties from the dataTypeAnalyzers.properties in
alfresco\WEB-INF\classes\alfresco\model
and replace standard.StandarAnalyzer by fr.FrenchAnlyzer where it's appropriate.

- restart Alfresco and import/create documents.

The French Analyser seems to be used and I will perform some test.

PS : Is there an alfresco repository available for Language Packs in order to work and post files ?

lgr · ‎03-03-2006

Sam,

You have imo 3 options :
- send the file to Alfresco to post it somewhere (i don't think there is such a place for contributions yet)
- send the file to Neobiz which uses a svn repository to allow people to work on language files. I don't really know how it works
- Or you can send them to me, i've got a small site dedicated to Alfresco, where i post languages packs as soon as i release them, and i can post every kind of file there. (i think i can mutualize all contribution to the french version of Alfresco as soon as there aren't too many)

Laurent.

sam69 · ‎03-03-2006

Ok laurent. It's strange that the alfresco's team don't use their own product to work 😉

I have tested the lucene jar modified for french and the results are not very good. I am not so sure that the French analyser is used for the querries.

For testing, I added a wrong line in the file dataTypeAnalyzers_fr_FR.properties in order to see if this file was read, and as expected, I got an exception. So I think that the problem is not here.

I got the french analyser on the lucene web page. Here is an extract of the file FrenchStemmer.java :



      replaceFrom( R2, new String[] { "ences", "ence" }, "ent" );

      String[] search = { "atrices", "ateurs", "ations", "atrice", "ateur", "ation"};
      deleteButSuffixFromElseReplace( R2, search, "ic",  true, R0, "iqU" );

[….]

// if one of the next steps is performed, we will need to perform step2a
      boolean temp = false;
      temp = replaceFrom( RV, new String[] { "amment" }, "ant" );

‍‍‍‍‍‍‍‍‍‍‍‍‍‍

So i tested some keywords :

in the file       -> searched -> result

constamment -> constant -> not found
amateur        -> amatrice -> not found
                    -> amateurs -> not found
fr?quence      ->fr?quent -> not found

Another extract f the code :



stopwords : words thatwill not be indexed at all

  public final static String[] FRENCH_STOP_WORDS = {
    "a", "afin", "ai", "ainsi", "aprÃ¨s", "attendu", "au", "aujourd", "auquel", "aussi",
    "autre", "autres", "aux", "auxquelles", "auxquels", "avait", "avant", "avec", "avoir",
    "c", "car", "ce", "ceci", "cela", "celle", "celles", "celui", "cependant", "certain",
    "certaine", "certaines", "certains", "ces", "cet", "cette", "ceux", "chez", "ci",
    "combien", "comme", "comment", "concernant", "contre", "d", "dans", "de", "debout",
    "dedans", "dehors", "del? ", "depuis", "derriÃ¨re", "des", "dÃ©sormais", "desquelles",
‍‍‍‍‍‍‍‍‍‍‍‍

and I try to search some stopwords, but they seems to be indexed, because I got results…

So I think the FrenchAnalyser is not used. Any idea of the problem ?

PS : I discovered that alfresco add the joker '*' char at the end of the search keyword….

andy · ‎03-03-2006

Hi

The analyser is determined from the data dictionary using the config as descibed, this is done in the LuceneAnalyser class. This is the same for writing to the index and parsing the query. The config settings you have should be using the correct analyser. You could set a debug point here to check.

If you think it is tokenised correctly at index time, it should be ok at query time.

Upgrading to lucene 1.9/2.0 is not recommended.
We have some stabilty/scalability fixes in our 1.4.3 jar that have not gone into lucene for some reason. I have raised the issues. We also have some improvements for merging segments without continual index optimisation which is very expensive.

We will migrate at some point.

Only documents indexed/changed after you have set the indexer will be treated correctly. Searching an index with a tokeniser that does not match how the field was indexed will produce odd results. Note stop words may have been previously indexed or be valid for other data types. So files indexed previously will be wrong. Edit the text or an attribute to force a reindex.

Does this help?

Cheers

Andy

sam69 · ‎03-06-2006

Thank you for your response Andy !

ok, for debug, I need to setup the developpment environnement for alfresco.
I will see if the french analyser is used or not.

Cheers,

Samuel

sam69 · ‎03-06-2006

Hi, I am back, and after debugging, it's sound like the French Analyser is used some time (ex : for the document's metadata), but maybe not for the content of the document itself.

I set a breakpoint on this method (of the LuceneAnalyser.java) :

public TokenStream tokenStream(String fieldName, Reader reader)
    {
        Analyzer analyser = (Analyzer) analysers.get(fieldName);
        if (analyser == null)
        {
            analyser = findAnalyser(fieldName);
        }
        [b]return analyser.tokenStream(fieldName, reader);[/b]
    }
‍‍‍‍‍‍‍‍‍‍

And it use the french analyser for all this variable :


[b]During the starting of alfresco :[/b]

fieldName= "@{http://www.alfresco.org/model/system/1.0}node-uuid"
fieldName= "@{http://www.alfresco.org/model/system/1.0}store-protocol"
fieldName= "@{http://www.alfresco.org/model/system/1.0}versionMinor"
fieldName= "@{http://www.alfresco.org/model/system/1.0}versionRevision"
fieldName= "@{http://www.alfresco.org/model/content/1.0}name"
fieldName= "@{http://www.alfresco.org/model/system/1.0}store-identifier"
fieldName= "@{http://www.alfresco.org/model/user/1.0}username"

[b]while creating a new html file :[/b]

fieldName= "@{http://www.alfresco.org/model/content/1.0}modifier"
fieldName= "@{http://www.alfresco.org/model/content/1.0}owner"
fieldName= "@{http://www.alfresco.org/model/content/1.0}description"
fieldName= "@{http://www.alfresco.org/model/application/1.0}icon"
fieldName= "@{http://www.alfresco.org/model/system/1.0}node-uuid"
fieldName= "@{http://www.alfresco.org/model/content/1.0}creator"
fieldName= "@{http://www.alfresco.org/model/content/1.0}title"
fieldName= "@{http://www.alfresco.org/model/system/1.0}store-identifier"
fieldName= "@{http://www.alfresco.org/model/application/1.0}editInline"
fieldName= "@{http://www.alfresco.org/model/content/1.0}modifier"
fieldName= "@{http://www.alfresco.org/model/content/1.0}content"
…and more
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Here is an extract of the FrenchAnalyser viewed by the debugger (it's look like it is the good one :

analyser= FrenchAnalyzer  (id=125)
stoptable= HashSet<E>  (id=538) ([ont, siennes, ceux, donc, l?, ?t?, comme, h?las, mien, ?, depuis, avait, mienne, cette, se, d, cependant, autres, outre, ?s, ni, n?anmoins, vers, doit, de, certain, la, hors, si, tu, des, nos, non, sien, les, comment, aux, ce, mais, certaines, debout, toute, leurs, sa, contre, sienne, l, on, divers, pr?s, toutes, voil?, d?sormais, dont, m?me, soit, moi, n?tres, chez, laquelle, ?])‍‍

But for the fieldName= "TEXT" it take the default analyser which is the standardAnalyser. I don?t know if it?s the content itself of the document added, but it sounds like it is.
Somebody can confirm that ?

But after that, I tried if the StandardAnalyser (for english) is working well also. So I set the language in english, added a new document with some english words which belong to the english stoplist (like the this with and … should be not indexed…) and they seems to be indexed despite the stoplist.
Anyone can confirm that the english stop words are indexed on their alfresco's version ? (because I am not sure of my alfresco configuration now…)

Thanks in advance !

Samuel

sam69 · ‎03-06-2006

Another thing now, I am not sure of the FrenchAnalyser I took. Maybe it is not for the same version of alfresco's Lucene (1.4.3).
So if someone has a frenchAnalyser for Lucene, it would be great to send it to me (sampub aat gmail doot com) for trying.

Thanks

andy · ‎03-13-2006

Hi

TEXT is not being indexed correctly, it will shortly be fixed.

The French Anlayser from the lucene trunk is fine for me.
It has one dependancy from 1.9 but this is self contained.

TEXT is then tokenised correctly and stop words are not found.

The search from th UI is adding its own ugly stemming by adding * to search terms. It is not exclusively using phrases which would be tokenised correctly.

So in the example given

File -> Search -> Found

amateur -> amat -> Yes
amateur -> amateur -> Yes
amateur -> amateurs -> No
amatrice -> amat -> Yes
amatrice -> amateur -> No
afin -> afin -> No (Stop word not indexed)

Improving how the query entered in the UI is parsed is on the todo list.

Regards

Andy

sam69 · ‎04-05-2006

Thanks Andy for your response !

So I am waiting for the next release to test it again, and I will provide the french analiser in a jar and the procedure to integrate it.

Regards,
Samuel

Hyland Connect

French (fr_FR)