Japanese Language support#532
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
Japanese Languages support
Current behavior, pointed out issues, and Possible enhancementLanguage DetectionCurrent behaviorMeilisearch Language detection is handled by an external library namedwhatlang, then, depending on the detected Script and Language, a specialized segmenter and specialized Normalizers are chosen to tokenize the provided text. related to: Possible enhancement
SegmentationMeilisearch Japanese Segmentation is handled by an external library namedlindera. NormalizationCurrently, there is no specialized normalization for Japanese. Possible enhancementWe could normalize Japanese words by converting them into Hiragana, this could increase the recall of Meilisearch because:
Troubleshooting 🆘A Query containing Kanjis doesn't retrieve all the relevant documentsWhen doing a search query with only
WorkaroundThe only workaround is to use a specialized Meilisearch version that deactivates the Chinese Language support. Below is the link to the PR containing all the released versions: Possible fixesContribute!In Meilisearch, we don't speak nor understand all the Languages in the world, we could be wrong in our interpretation of how to support a new Language in order to provide a relevant search experience.
Thanks for your help! |
BetaWas this translation helpful?Give feedback.
All reactions
👍 21🎉 1❤️ 12🚀 1👀 1
Replies: 19 comments 56 replies
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
@ManyTheFish Thank you for writing the details.
It's fine. This is enough for Japanese to understand 👍 Language detection only with Kanji/Hanzi stringsI researched various things to see if whatlang could handle it, but it might be better not to expect too much. I'm just thinking it might be better to think of another way. NormalizationI think Unicode NKFC is enough for Japanese normalization. ( e.g.https://github.com/unicode-rs/unicode-normalization ) It's convert...
IndexingI am very happy to be able to do ambiguous searches in hiragana, so I agree. Since Lindera holds Pronounce by Katakana, I feel that it can be achieved by devising the indexing. About the Japanese input method.This is a supplement in the hope that it will be of some help.
There are two input methods for Japanese: "Romaji" and "Kana".
These can switch by Japanese IME option. Most Japanese choose romaji input because it is easy to learn the keyboard layout. |
BetaWas this translation helpful?Give feedback.
All reactions
👍 6❤️ 12
-
Great summary! In Lucene, there is good example explanation about Romaji Input variation. Lucene Japanese Tokenizer can output Romaji or Katakana Pronounce information. I hope this is helpful. |
BetaWas this translation helpful?Give feedback.
All reactions
👍 3❤️ 1
-
First, thank you a lot for your suggestion, It is really helpful! 🙏 NormalizationIf I try to summarize your points, we have several options to normalize Japanese:
Language detection only with Kanji/Hanzi strings
When I see your graph, I understand why it would not significantly improve language detection for our case. 🤔
|
BetaWas this translation helpful?Give feedback.
All reactions
👍 2👀 1
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
Hello all!
All these issues are open to external contributions during the whole month, so don't hesitate to contribute! 🧑💻 This is another step in enhancing Japanese Language support, depending on future feedback, we will be able to go further. Thanks for all your feedback! ✍️ 🇯🇵 |
BetaWas this translation helpful?Give feedback.
All reactions
🎉 10
-
All great! If they are fixed, I think it will be a product that can be used without hesitation in a Japanese. 👍👍👍👍👍 |
BetaWas this translation helpful?Give feedback.
All reactions
👍 5🚀 2
-
rel & ref:meilisearch/meilisearch#3565,meilisearch/meilisearch#3569 andTwitter I thought it would be different to write in a closed Issue, so I will write here, which is the starting point.
If so, it may still be difficult to use it production in a Japanese environment at this time. It may be that you have considered this and have not adopted it, but wouldn't it be simpler and easier to use if we could add an option to specify the language like "Algolia" or document settings like Ref "Algolia settings" : I imagine // Priority to Japanese > Chinese > English > Others"languagePriorityAttributes":["ja","zh","en"]// Priority to Chinese > Japanese > English > Others"languagePriorityAttributes":["zh","ja","en"]// Auto = Same behavior as current implementation (v1.1.0-rc1)"languagePriorityAttributes":[] In my opinion, given the compatibility of Instant-meilisearch and InstantSearch, it would be better for the user to match the Algolia implementation. That is, users can choose the language they want to search in. |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
-
Hello@miiton, This PR should temporarily solve your problem before we find a permanent fix! About the possible fixes:
Unless a permanent fix is implemented, I highly suggest you subscribe to the upper PR and use the linked Docker image, We will keep it up to date with Meilisearch stable versions. Sorry for all these inconveniences, I hope that we will quickly release a permanent fix to definitely support the Japanese Language. |
BetaWas this translation helpful?Give feedback.
All reactions
-
Thanks a lot! I am trying that docker image and it seems to work almost perfectly. I am trying that docker image and it seems to be working almost perfectly. Great... I understand the other matters as well, and look forward to future updates 😍 |
BetaWas this translation helpful?Give feedback.
All reactions
-
Re: using index settings for language priority: This is kind of tricky in my situation. I've got many records in both Chinese and Japanese that are stored on the same index. If possible, I'd rather have a search param that specifies the preferred language (but I'd assume during tokenization it'd blow up storage if both cn/jp tokens are being made) |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
-
Hi all, I also put here the comment I wrote onmeilisearch/charabia#139 . This character normalization seems to be performed after tokenization, but in some cases, it is better to perform character normalization before tokenization in Japanese. For example, this is a case where there is no problem even after tokenization:
Half-width But the following cases can be problematic.
Since full-width numbers are already registered in the morphological dictionary (IPADIC), each number becomes a single token, so a full-width If possible, I would like you to consider a way to perform character normalization before tokenization. |
BetaWas this translation helpful?Give feedback.
All reactions
👍 3❤️ 1
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
@ManyTheFish I hope it will be of some help to you. https://speakerdeck.com/mosuka/the-importance-of-morphological-analysis-in-japanese-search-engines |
BetaWas this translation helpful?Give feedback.
All reactions
❤️ 3
-
I have published a simple application that I had made to confirm that Meilisearch works in Japanese. |
BetaWas this translation helpful?Give feedback.
All reactions
🎉 4❤️ 6🚀 3
-
Hello people! The current behaviorLanguage DetectionToday, we are usingwhatlang-rs to detect the Script and the Language in a text, Language detection is really important for Japanese Language support mainly to make the difference with the Chinese Language when only Kanjis are used in a text, for example, a small title or a query. SegmentationTo segment Japanese text, we are currently usingLindera based on a Viterbi algorithm using the IPA dictionaries. Thanks to@mosuka for maintaining it. (a small explanation of Japanese segmentation) NormalizationSo far, we only normalize Japanese characters by replacing them with their decomposed compatible form, to give an example, half-width kanas are converted into kanas. To know more about this, I put below some documentation about it below: The remaining issues we should tackle in the future
PrototypesThere is a prototype of Meilisearch deactivating completely the Chinese support, this way we avoid Language detection mistakes, in addition, this prototype activates the katakana-to-hiragana conversion, if you want to try this prototype I put the link to it: Thanks! |
BetaWas this translation helpful?Give feedback.
All reactions
👍 7
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
Handling of Proper Nouns in JapaneseIs the issue of not being able to search for proper nouns that are not in ipadic already being discussed, such as Chinese language support, or etc? ref:misskey-dev/misskey/issues/10845 Target Contents
Search Word: |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1❤️ 1
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
@miiton@ManyTheFish |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
@ManyTheFish
|
BetaWas this translation helpful?Give feedback.
All reactions
🚀 1👀 1
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
Hello@mosuka, # allow japanese specialized tokenizationjapanese = ["japanese-segmentation-unidic"]japanese-segmentation-ipadic = ["lindera-tokenizer/ipadic","lindera-tokenizer/ipadic-compress"]japanese-segmentation-unidic = ["lindera-tokenizer/unidic","lindera-tokenizer/unidic-compress"]japanese-transliteration = ["dep:wana_kana"] What do you think? 🙂 |
BetaWas this translation helpful?Give feedback.
All reactions
😄 2
-
That's a good idea. |
BetaWas this translation helpful?Give feedback.
All reactions
❤️ 1
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
@ManyTheFish Thanks. |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
-
Hello everyone 👋 An update on Meilisearch and the Japanese support New release V1.3 🦁v1.3 has been released today 🦁 including a change in the Japanese segmentation, Meilisearch now relies on UniDic instead of IPADIC to segment Japanese words which should increase the amount of document retrieved by Meilisearch. We still encounter difficulties when a dataset contains small documents with kanji-only fields, if you don't manage to retrieve documents containing kanji-only fields with the official Meilisearch version, please try theJapanese specialized docker image that deactivates other Language support. A preview of V1.4 👀We just released a 🧪 prototype that allows the users to customize how Meilisearch tokenizes documents and queries, and we'd love your feedback.
How to get the prototype?Using docker, use the following command:
From source, compile Meilisearch on the How to use the prototype?You can find some examples below, or look at the originalPR for more info. We know that the Feedback and bug reporting when using this prototype are encouraged! Thanks in advance for your involvement. It means a lot to us ❤️ |
BetaWas this translation helpful?Give feedback.
All reactions
👍 6❤️ 3🚀 4
-
Facet Search is not working as I expected in Japanese.I am trying Facet Search on aJapanese demo site I have published, but it doesn't seem to work the way I want it to. I am trying to narrow down the prefectures... the example of Osaka-fu is easy to understand. ( Meilisearch version:prototype-japanese-5 ) If I type in |
BetaWas this translation helpful?Give feedback.
All reactions
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
Hi! Thank you guys for your fantastic work on improving Japanese support. I tried the Docker imagemeilisearch/prototype-japanese-7, which works really well in my case (Drupal search API + Meilisearch backend). I have two questions:
My apologies if this is not the right place to ask. Thank you for any input! |
BetaWas this translation helpful?Give feedback.
All reactions
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
Hello@hktang,
Yes, sorry, the link to the dedicated PR with all the released versions is below: I will link this PR in a troubleshooting section here.
Not so far, but there is a feature request to let the users choose the Language used on each index, I put the link below: don't hesitate to react or comment in the discussion explaining your use case, this helps us in our prioritization process. Don't hesitate to ask more questions if you need them, |
BetaWas this translation helpful?Give feedback.
All reactions
❤️ 1🚀 2
-
Hi@ManyTheFish great! Thank you so much for the information. This is exactly what I was looking for. |
BetaWas this translation helpful?Give feedback.
All reactions
-
Thanks for the Great Tools. My search query was '大学' and it is showing results for '大垣' as well which doesn't make sense. |
BetaWas this translation helpful?Give feedback.
All reactions
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
Hi@bedus-creation , You may be able to solve the problem by changing the setting. |
BetaWas this translation helpful?Give feedback.
All reactions
-
Hi@bedus-creation and@mosuka, $ curl \ -X POST 'http://localhost:7700/indexes/movies/search' \ -H 'Content-Type: application/json' \ --data-binary '{ "q": "大学",+ "matchingStrategy": "all" }' Don't hesitate to come back to us if none of these solutions work, and thank you@mosuka, for your help, |
BetaWas this translation helpful?Give feedback.
All reactions
❤️ 3
-
こんにちは! |
BetaWas this translation helpful?Give feedback.
All reactions
-
@tec-koshi 可能というお話は見ましたが、Cloudサポートに問い合わせるのが確実です。 I have been told that it is possible, but I think it is best to contact Cloud Support to be sure. |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
-
@miiton |
BetaWas this translation helpful?Give feedback.
All reactions
-
Discordで大丈夫ですよ 👍 ref:https://help.meilisearch.com/en/article/where-can-i-find-support-1qgucwt/ |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
-
ありがとうございます! |
BetaWas this translation helpful?Give feedback.
All reactions
-
Hello All,
If you want more information about the last release, I put below the link to it: Below is the PR with all the Japanese Language-specialized docker images: See you! |
BetaWas this translation helpful?Give feedback.
All reactions
🎉 5🚀 3
-
Specify an user dictionary for Japanese: I am impressed with the creation of such a wonderful search engine! 😀 Japanese does not perform word segmentation using spaces, so a good dictionary is necessary to determine word boundaries. Particularly for new words or proper nouns, registration of words in a user dictionary may be required. Lindera has a feature to specify a user dictionary, If this is appropriate, I would like to make PR with it. |
BetaWas this translation helpful?Give feedback.
All reactions
👍 2
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
Hello@kamiyn, thank you for your suggestion, Today, there is adictionary feature in Meilisearch allowing you to precise some specific words related to your dataset, this is not as advanced as Lindera's one, but it may be fit your needs, let me know about it. See you! |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
-
Hello@ManyTheFish, thank you for your response. I also recognize that a lot of work is needed to make custom dictionaries easily accessible to end-users in the MeiliSearch services. On the other hand, it is common to specify parameters via environment variables when using Docker, so I thought it would be beneficial. Especially if it is provided as an explanation of how to use the Japanese-specific Docker image, I believe that the number of users via Docker will increase. In that sense, this modification into Charabia seems to be an issue of how the Japanese-specific Docker image is positioned. |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
Hello All,
If you want more information about the last release, I put below the link to it: Below is the PR with all the Japanese Language-specialized docker images: See you! |
BetaWas this translation helpful?Give feedback.
All reactions
👍 4❤️ 2🚀 2
-
Thanks for the great support & great software We started using Is there anyway to get back this features ? |
BetaWas this translation helpful?Give feedback.
All reactions
👍 3🎉 1🚀 1👀 1
-
Nice! Thank you@PSeitz |
BetaWas this translation helpful?Give feedback.
All reactions
-
BetaWas this translation helpful?Give feedback.
All reactions
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
@miiton@ManyTheFish Thanks for the response, I finally tested the hiragana and katakana and it's working fine. But I found an issue with half-width and full-width for example: searching an |
BetaWas this translation helpful?Give feedback.
All reactions
-
I usually use For example, it converts "Apple" to "Apple", "アップル" to "アップル". NFKC can be beneficial for languages beyond Japanese. |
BetaWas this translation helpful?Give feedback.
All reactions
-
We need to check the tokenizer used by Meilisearch, it's a bit weird that it doesn't work if NFKC normalizes this derivation 🤔 In Charabia we are using NFKD to normalize this, which should be basically the same, |
BetaWas this translation helpful?Give feedback.
All reactions
👍 2
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
For those who try Meilisearch for the first time: you should try v1.10.2 (or any later versions) with
|
BetaWas this translation helpful?Give feedback.
All reactions
👍 6
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
I would like to enable Lindera's character filters and token filters in Charabia. I am thinking that if I don't create a Charabia token at this time based on values other than text recorded by Lindera's token, the term will be out of position due to highlighting, etc. I made it possible to describe Lindera settings in YAML, we would be very happy if this could be accomplished, allowing Japanese-specific string handling to be configured by out of Meilisearch using environment variable. Is there a better way to do this? |
BetaWas this translation helpful?Give feedback.
All reactions
-
Hey@mosuka, I think we could add a typeLinderaMetadata<'o> =Vec<Cow<'o,str>>;enumMetadata<'o>{#[default]None,Lindera(LinderaMetadata<'o>),} Then,the segmenter trait should implement a new pubtraitSegmenter:Sync +Send{/// Segments the provided text, creating an Iterator over `&str`.fnsegment_str<'o>(&self,s:&'ostr) ->Box<dynIterator<Item =&'ostr> +'o>;/// Segments the provided text, creating an Iterator over `Token`.////// This method uses `segment_str` by defaultfnsegment_str_with_meta<'o>(&self,s:&'ostr) ->Box<dynIterator<Item =(&'ostr,Metadata<'o>)> +'o>{Box::new(self.segment_str(s).map(|s|(s,Metadata::None)))}} But Lindera would override the implementation by itself and add the Metadata. After this, we just have to. makeSegmentedTokenIter use It's a bit of work but it's totally acceptable for Charabia and seems future proof for other tokenizers 🤔 |
BetaWas this translation helpful?Give feedback.
All reactions
❤️ 1
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
hi all, thanks for the great support for this (niche) direction, but i'd like to know if there's any plan to rebase/bump the Japanese supporttag/branch against the newest developments, e.g. given thenew indexer is much faster? |
BetaWas this translation helpful?Give feedback.
All reactions
-
Hi! As far as I know, the specialized tag/branch is no longer needed since we added |
BetaWas this translation helpful?Give feedback.