Is Multilingual Searching Possible in Java?

Kushal.20 · June 19, 2019, 7:23am

Hello !
Just working upon the Search product. My query is, whether Group Docs supports multilingual searching ? What if I enter any Japanese or Chinese (for example) character to search for and get the documents having it. Is it feasible ?

atir.tahir · June 19, 2019, 9:08am

@Kushal.20,

We are investigating this. Your investigation ticket ID is SEARCHJAVA-81. As we have any further update, you’ll be notified.
However, API does recognize the search queries written in different keyboard layout. For example, word ‘pause’ typed on Greek keyboard layout is ‘παθσε’. Please go through this article.

atir.tahir · June 19, 2019, 11:23am

@Kushal.20,

API does support multilingual search. It doesn’t matter what languages are used in a search query.
As far as Japanese and Chinese are concerned. If you don’t set character type as letter in the Alphabet dictionary, it will not be indexed. By default, Japanese and Chinese characters are all separators, i.e. they are not indexed.
There are another nuance with these languages that they use their own separators (white-spaces).
If you will not set separators, words of appropriate length (less than 81 characters) will not be detected and will not be indexed.
Below is the example code. Sample file for this use-case - The Little Red Book (chinese).zip (76.0 KB)

Index index = new Index(Utilities.INDEX_PATH);
char[] characters = new char[] { '\u5E74', '\u6708' };
index.getDictionaries().getAlphabet().setRange(characters, CharacterType.Letter);
index.addToIndex(Utilities.DOCUMENTS_PATH);
SearchResults searchResults = index.search("\\u5E74");
index.highlightInText("D:\\highlighted.html", searchResults.get_Item(0));
for (DocumentResultInfo result : searchResults) {
     System.out.println(result.getHitCount() + " hits are in " + result.getFileName());
}

There could be problem against these two languages. It does not fully support hieroglyphic languages like Chinese and Japanese. Because it can not add to index each hieroglyph as a separate word. However, it could be improved in any of the future release(s).