cancel
Showing results for 
Search instead for 
Did you mean: 

Chinese documents

chiho80
Champ in-the-making
Champ in-the-making
I have added some chinese "word documents" and created a chinese HTML document. I cannot search with the documents when I use chinese terms(Using English term is ok).

I have looked into the index using Luke, the chinese words seem not indexed. I suspected that is the problem of the StandardAnalyzer, but I tried different analyzers (custom and two from lucene sandbox), recreted the index on each attempt, but still in vain. I have checked the analyzers with lucene and they worked fine (in my test script).

Can anyone tell me if there are different ways to check what happened? Is it the web layers problem?

Thanks in advance. Smiley Surprised
13 REPLIES 13

akinori
Champ in-the-making
Champ in-the-making
Hi,

Have you succeeded to enable CJKAnalyzer in Alfresco?

I'm Japanese. There are two major options to analyze Japanese contents in Lucene. JapaneseAnalyzer and CJKAnalyzer. So far I failed both of them.
I tried JaneseAnalysis first. Because it looked better. JapaneseAnalyzer requires another software called Sen to make dictionaly in advance, and it may require System.property stuff which I'm not sure if I already set properly in Tomcat context or not.  I saw lots of errors and warnings in the bootstrap message.

So now I'm trying to use CJKAnalyzer anyway. It seemed to work. No error message. But I cannot still find double byte charactors in index.

What I did…
1. Copy lucene-ja.jar which includes CJKAnalyzer to alfresco's lib directory.
2. Make dataTypeAnalyzers_ja_JP.properies pointing to CJKAnalyzer.
3. Restart Alfresco (1.2 RC2)
4. Import a word file including English and Japanese text.

After that,
1. the word file was found by a query with an English word.
2. the word file wasn't found by any query with a Japanese word (longer than 3 charactors I tried).

akinori
Champ in-the-making
Champ in-the-making
Hi again,

Still I've never succeeded to search with Japanese Contents.
Anyway I found more facts.

1. Both of CJKAnalyzer and JapaneseAnalyzer ran without any error. (Only when I remove the jar file of the analyzer, Alfresco gave me exceptions messages. I believe this means my properties file points to the analyzer properly)
2. I can find Japanese file name and topic titile, but I cannot find any files whose contents include Japanese charactors with Japanese keywords.
3. English keywords still work. If the target file includes both of English words and Japanese words, I can find it with any of those English words.

Sorry, I know it's off-topic in strict sense. Title is Chinese and I'm writing about Japanese issue. But I supposed my situation looked similar to chiho80's case.
I was wondering if someone could instruct me the way to check how lucene works in Alfresco's context now.

Akinori

andy
Champ on-the-rise
Champ on-the-rise
Hi

Apologies, this issue has mainly been discussed under French alaysis.

There was a bug tokenising the TEXT attribute in the index - it was always using the standard analyser and not picking up the one defined for text attributres. This is fixed in the future 1.2.1 build (tested with the french analyser). These changes will be merged into the HEAD code stream in due course. The code can be obtained from svn.

I will look into merge and 1.2.1 timings.

Regards

Andy

derek
Star Contributor
Star Contributor
The HEAD now contains all the latest bug fixes from the V1.2.0 branch (future 1.2.1 release).