Smart Chinese Uses a dictionary to pull out characters. Ok, here are two most important pieces of advice I can give you: I tried using the Paoding tokenizer as the query field tokenizer but that didn't help. Separate your Chinese text into its own field s. Since Traditional can be converted to Standardized fairly easily, the focus of this document is on Standardized.
Uploader: | Yomi |
Date Added: | 23 November 2013 |
File Size: | 42.4 Mb |
Operating Systems: | Windows NT/2000/XP/2003/2003/7/8/10 MacOS 10/X |
Downloads: | 23390 |
Price: | Free* [*Free Regsitration Required] |
Paoding Uses a large set of dictionaries, and provides exceptionally good search results across a multitude of contexts. Chinese words are frequently made up of more than one character, and words are not separated by spaces.
Indexing Chinese in Solr. Posted by MyNote on every bits at You set yourself up for handling additional languages fluidly and effectively.
gAnalysisException: not found the dic home dirctory - 程序园
While commercial options definitely exist, they were not a part of this comparison. Best Practices for Effective Cloud Migration. It is the single most common character in Standard Chinese by far. Is there anywhere else I should add it? A common misconception is that Chinese words are its characters — but this is the case only a fraction of the time. I use Paoding when indexing documents and when analyzing the queries. But when I am searching with this field, search results are 0.
You will not get good results. A set of 12 documents were loaded into Lucene. Compile it with Paoding-analyysis.
We have asked Dan Funk, a committer to Project Blacklight to provide a guest blog post for us on the details of how to approach indexing Chinese, particularly when you are a non-speaker. I just tried indexing this document: Thoughtfully parses Chinese characters — understands that character groups alter meaning.
Indexing Chinese in Solr
Ships with solr as an add-on package. PDB symbol for mscorwks. Smart Chinese Uses a dictionary to pull out characters.
You remove confusing, and likely false, results in a language the end user does not understand. Pzoding-analysis the DZone community and get the full member experience.
My Note on Solutions.: open-source Chinese TokenizerLibrary PaoDing with Apache Solr
Hi, I was trying out with this configuration. Hi thanks for your quick answer. Client integration with Solr by using SolrJ java. Over a million developers have joined DZone. When coming to search queries on the index it could only search onec - on the second query the Solr caught an exception that the input reader was closed - When calling the read on line in the tokenizer.
Newer Post Older Post Home. Opinions expressed by DZone contributors are their own. Thanks a lot in advance. Awaiting reply Regards Rajani. The dictionary is minimal and handles general cases well, but many nuances of the language are lost.
And Eric Pugh dep4b provided some much needed mentoring — helping me see a way forward in what I initially believed was an intractable problem. Tokenized on spaces — but will shift to character tokenization for Chinese text.
Комментариев нет:
Отправить комментарий