cancel
Showing results for 
Search instead for 
Did you mean: 

Chinese documents

chiho80
Champ in-the-making
Champ in-the-making
I have added some chinese "word documents" and created a chinese HTML document. I cannot search with the documents when I use chinese terms(Using English term is ok).

I have looked into the index using Luke, the chinese words seem not indexed. I suspected that is the problem of the StandardAnalyzer, but I tried different analyzers (custom and two from lucene sandbox), recreted the index on each attempt, but still in vain. I have checked the analyzers with lucene and they worked fine (in my test script).

Can anyone tell me if there are different ways to check what happened? Is it the web layers problem?

Thanks in advance. Smiley Surprised
13 REPLIES 13

andy
Champ on-the-rise
Champ on-the-rise
Hi

The tokeniser is specified at the type level and is localisable.
If there is no localisation then the values from dictionaryModel.xml are used.

In the default configuration there is a default localisation bundle that specifies the tokenisers that are used. This is in the file

dataTypeAnalyzers.properties

You can set these for a particular locale by adding something like

dataTypeAnalyzers_zh_CN.properties

in the language pack. I don't think any of the language bundles include this as yet. If no locale specific file is found it falls back to the default.

The next question is: "How is the locale found?"
The default locale is picked up from the server or set when interacting with the client. It is possible that interactively adding a document may produce different tokenisation from indexing as a result of a rule as one could set the locale via the client and the other via the default Java locale in the repository.

I hope this helps. Let me know how you get on.

Regards

Andy

chiho80
Champ in-the-making
Champ in-the-making
Hi Andy,

Thanks for your advices. Actually I tried to change dataTypeAnalyzers.properties to use the custom analyzer (which can recognize zh, zh_CN and en). But it has no effect at all.

This time, I have read through the index and seem like the index is not written properly. The locale used to write the index made the chinese characters not readable.

In Hong Kong, we use zh as the main chinese locale. zh_CN is used in China. When we try to write zh content using zh_CN locale, the characters gets strange. When I look into the index, the index looks like zh_CN encoded. I will try to set up a debugger to see where the problem is.  Smiley Happy

I am sorry that I cannot post the analyzer here for your testing coz the analyzer contains some copyrighted codes. I will try to seek for your advice during the course and post the result / changes in code here when the bug is fixed. Smiley Very Happy

chiho80
Champ in-the-making
Champ in-the-making
The problem is solved. The problem is due to the reading of chinese file.
When the LuceneIndexerImpl tries to load the document from file system, it does not specify the charset. This would cause problem when reading CJK documents. Please find the code below I used to solve the problem.

[LuceneIndexerImpl ] At around line 1428, I changed the code to:

InputStreamReader isr = null;
InputStream ris = reader.getContentInputStream();
  try {
    isr = new InputStreamReader(ris,"UTF-8");
  } catch (UnsupportedEncodingException e) {
     isr = new InputStreamReader(ris);
  }
  doc.add(Field.Text("TEXT",isr));
}

derek
Star Contributor
Star Contributor
Hi,

Thanks for this.  You are right and we have fixed this.

During the indexing, we perform a transformation on any text that is not UTF-8 and force it to UTF-8.

Regards

lhy719
Champ in-the-making
Champ in-the-making
Hi Derek and Chiho,

I'm a Chinese in Taiwan. I've the Alfresoc 1.1 installed, but I still can't find Chinese file or folder names. Notheless it doesn't work with searching Chinese in any file content.

Does the fix method mentioned above has been apply in Alfresco 1.1? Or other things need to be done for Traditional Chinese?

Thanks!
Hammer Lee

derek
Star Contributor
Star Contributor
Hi,

The default analyzer is for English.  You will notice that chiho80 was writing a custom analyzer for Chinese.  These are fairly standard components to write - you might search the Lucene forums for more information on where to get one.

Or you could beg chiho80 to convince his company to release the code.

Regards

goodguy
Champ in-the-making
Champ in-the-making
derek

Is this going into Alfresco 1.2?

derek
Star Contributor
Star Contributor
That depends on whether you can convince chiho80 to give it to us.
:wink:

chiho80
Champ in-the-making
Champ in-the-making
It's been a while since I last been here. I am sorry that it seems not possible to contribute the analyzer after a discussion with my boss, as this is one of the core technologies used in our system (non-alfresco).

If you want to find some open source alternatives, you may try CJKAnalyzer, this is not as accurate as dictionary based Analyzer though. You may contact me for help.  Smiley Very Happy