Meilisearch

Japanese Language support#532

ManyTheFish started this conversation inFeedback & Feature Proposal

Japanese Language support#532

ManyTheFish

Sep 15, 2022

· 19 comments· 56 replies

Return to top

Discussion options

ManyTheFish
Sep 15, 2022
Collaborator

Japanese Languages support

officially supported

Current behavior, pointed out issues, and Possible enhancement

Language Detection

Current behavior

Meilisearch Language detection is handled by an external library namedwhatlang, then, depending on the detected Script and Language, a specialized segmenter and specialized Normalizers are chosen to tokenize the provided text.

related to:

meilisearch#2403

Possible enhancement

An issue has been created on the whatlang repository by@miiton to enhance "CJK" language detection:whatlang-rs#122

Segmentation

Meilisearch Japanese Segmentation is handled by an external library namedlindera.

Normalization

Currently, there is no specialized normalization for Japanese.

Possible enhancement

We could normalize Japanese words by converting them into Hiragana, this could increase the recall of Meilisearch because:

This would allow finding Katakana words when making a Hiragana query

for instanceダメ, is also spelled駄目, orだめ (I didn't find a better example, sorry 🙏)

Japanese keyboard seems to only contain Hiragana characters. And so,Katakana andKanji characters are written inHiragana by the user, then the computer will suggest aKatakana or aKanji version of the written text.

💭 correct me if I am wrong 🙏, but this would allow Meilisearch to better retrieve typos by suggestingKanjis characters that have a close hiragana writting instead of suggesting any character.

Troubleshooting 🆘

A Query containing Kanjis doesn't retrieve all the relevant documents

When doing a search query with onlyKanji characters, the language detection doesn't classify the query as a Japanese one but as a Chinese one because:

Kanjis is a set of traditional Chinese characters used in Japanese and some are used in both Languages
whatlang doesn't have an advanced algorithm to make the difference between both Languages

💭 basically whatlang will classify texts as Mandarinunless there is at least 5% of Hiragana or Katakana characters.
So if we are in the case of only havingKanjis, it will always be classified as Mandarin, and then, the wrong tokenization pipeline will be chosen.

Workaround

The only workaround is to use a specialized Meilisearch version that deactivates the Chinese Language support. Below is the link to the PR containing all the released versions:
meilisearch/meilisearch#3882

Possible fixes

Allow the user to set the Language in the index settings

Contribute!

In Meilisearch, we don't speak nor understand all the Languages in the world, we could be wrong in our interpretation of how to support a new Language in order to provide a relevant search experience.
However, if you are a native speaker, don't hesitate to contribute to enhancing this experience:

⬆️ by upvoting this discussion to help us in prioritizing language supports
💬 by pointing out errors and oversights in this discussion
🧑‍💻 by opening a pull request oncharabia, the tokenizer used by Meilisearch

Thanks for your help!

You must be logged in to vote

Replies: 19 comments 56 replies

Comment options

miiton
Sep 22, 2022

@ManyTheFish Thank you for writing the details.

for instance ダメ, is also spelled 駄目, or だめ (I didn't find a better example, sorry 🙏)

It's fine. This is enough for Japanese to understand 👍

Language detection only with Kanji/Hanzi strings

I researched various things to see if whatlang could handle it, but it might be better not to expect too much.
This is because, as you know, Japanese kanji overlaps with Chinese, and the ratio is very large.

I'm just thinking it might be better to think of another way.

Normalization

I think Unicode NKFC is enough for Japanese normalization. ( e.g.https://github.com/unicode-rs/unicode-normalization )

It's convert...

source	unicode(source)	converted	unicode(converted)
`ｶﾞｷﾞｸﾞｹﾞｺﾞ`	`\uff76\uff9e\uff77\uff9e\uff78\uff9e\uff79\uff9e\uff7a\uff9e`	`ガギグゲゴ`	`\u30ac\u30ae\u30b0\u30b2\u30b4`

Indexing

I am very happy to be able to do ambiguous searches in hiragana, so I agree.

Since Lindera holds Pronounce by Katakana, I feel that it can be achieved by devising the indexing.

About the Japanese input method.

This is a supplement in the hope that it will be of some help.

Japanese keyboard seems to only contain Hiragana characters. And so, Katakana and Kanji characters are written in Hiragana by the user, then the computer will suggest a Katakana or a Kanji version of the written text.

There are two input methods for Japanese: "Romaji" and "Kana".

Romaji (Input by alhpabets)	Kana (Input hiragana directly)

These can switch by Japanese IME option.

Most Japanese choose romaji input because it is easy to learn the keyboard layout.

You must be logged in to vote

2 replies

Comment options

johtani Sep 22, 2022

Great summary!

In Lucene, there is good example explanation about Romaji Input variation.
https://issues.apache.org/jira/browse/LUCENE-10102

Lucene Japanese Tokenizer can output Romaji or Katakana Pronounce information.
They haveKatakana-Romaji Converter method. But it only supports Hepburn romanization and it's a bit difference between Romaji Input.

I hope this is helpful.

Comment options

ManyTheFish Sep 22, 2022
Collaborator Author

Hello@miiton @johtani,

First, thank you a lot for your suggestion, It is really helpful! 🙏

Normalization

If I try to summarize your points, we have several options to normalize Japanese:

doing anNFKC orNFKD normalization in order to unify canonical and compatibility equivalences

💭 it seems to be our softer option in terms of Normalization because we only lose little information.
However, it would be impossible for Meilisearch to retrieveだめ by searchingダメ.

Making a phonological Normalization by:

Romanizing characters
converting every character in Hiragana
converting every character in Katakana

💭 these last options seem, in my sense, similar in terms of impact on Meilisearch,だめ andダメ will be unified after normalization allowing Meilisearch to retrieve equally both versions. The only difference I see is, which keyboard should be favored.

Language detection only with Kanji/Hanzi strings

I researched various things to see if whatlang could handle it, but it might be better not to expect too much.
This is because, as you know, Japanese kanji overlaps with Chinese, and the ratio is very large.
[..]
I'm just thinking it might be better to think of another way.

When I see your graph, I understand why it would not significantly improve language detection for our case. 🤔
So I have followed your suggestion about finding another solution, and here I am:

It is almost impossible to know if a pair of kanji alone is a Chinese query or a Japanese one.
However, despite knowing the Language of the search request, we are able to know the main Languages of the Index where the user is searching, because, as a search query, we tokenize documents during the indexing.
Moreover, whatlang provides a way to set a subset of Languages that can be detected with theDetector::with_allowlist method.
Therefore, we could save, in the index, the detected Languages during the indexing in order to set anallowlist at search time,
this would avoid choosing the wrong tokenization pipeline when trying to detect a small query unless we have documents of both Chinese and Japanese Languages in the same index.

Comment options

ManyTheFish
Oct 5, 2022
Collaborator Author

Hello all!
I come back to you to make some updates.
ForHacktoberfest we've created several issues to enhance the Japanese support of Meilisearch:

Implement a Compatibility Decomposition Normalizer charabia#139:

A simple normalization allowing Meilisearch to retrieve "same" characters with different Unicode code points likeｶﾞ andガ.

Implement a Japanese specialized Normalizer charabia#131:

Allow Meilisearch to retrieve Hiragana words with Katakana search queries likeダメ andだめ

Add an allowlist to the tokenizer builder charabia#132:

the first step to enhance Meilisearch Language detection by forcing Charabia to choose in a sub-set of Language avoiding it to detect Chinese when it is not in the list

Store detected Language per document during indexing milli#646:

the second step to enhance Meilisearch Language detection by detecting languages during document indexing and storing it.

All these issues are open to external contributions during the whole month, so don't hesitate to contribute! 🧑‍💻
I'll be glad to help you if you have any questions.

This is another step in enhancing Japanese Language support, depending on future feedback, we will be able to go further.

Thanks for all your feedback! ✍️ 🇯🇵

You must be logged in to vote

5 replies

Comment options

miiton Oct 6, 2022

All great!

If they are fixed, I think it will be a product that can be used without hesitation in a Japanese. 👍👍👍👍👍

Comment options

miiton Mar 13, 2023

rel & ref:meilisearch/meilisearch#3565,meilisearch/meilisearch#3569 andTwitter

@ManyTheFish

I thought it would be different to write in a closed Issue, so I will write here, which is the starting point.

I think that I exhausted the possible enhancements on Meilisearch for now.

If so, it may still be difficult to use it production in a Japanese environment at this time.
Although a series of modifications have been made to strengthen the language judgment at the time of indexing, I still felt that the "difficulty in distinguishing Chinese or Japanese in Kanji-only words" is still difficult.

It may be that you have considered this and have not adopted it, but wouldn't it be simpler and easier to use if we could add an option to specify the language like "Algolia" or document settings likelanguagePriorityAttributes? Without thinking too much about it, I think it would be simpler and easier to use.

Ref "Algolia settings" :indexLanguages,queryLanguages

I imaginelanguagePriorityAttributes to be defined as follows

// Priority to Japanese > Chinese > English > Others"languagePriorityAttributes":["ja","zh","en"]// Priority to Chinese > Japanese > English > Others"languagePriorityAttributes":["zh","ja","en"]// Auto = Same behavior as current implementation (v1.1.0-rc1)"languagePriorityAttributes":[]

In my opinion, given the compatibility of Instant-meilisearch and InstantSearch, it would be better for the user to match the Algolia implementation. That is, users can choose the language they want to search in.

Comment options

ManyTheFish Mar 14, 2023
Collaborator Author

Hello@miiton,
about the miss-detection of the Japanese Language, I put below a PR providing a docker image that deactivates Chinese tokenization:
meilisearch/meilisearch#3588

This PR should temporarily solve your problem before we find a permanent fix!

About the possible fixes:

You suggested a setting to set the detectable Languages, that's something we have in mind, but, because it changes the API, we have to discuss with the product managers to choose the best user interface, and it takes more time than enhancing Meilisearch without changing the API. (poking@gmourier on this)
We could enhance Language detection by contributing to WhatLang following your suggestions or trying to replace the Language detection withhttps://github.com/quickwit-oss/whichlang.

Unless a permanent fix is implemented, I highly suggest you subscribe to the upper PR and use the linked Docker image, We will keep it up to date with Meilisearch stable versions.

Sorry for all these inconveniences, I hope that we will quickly release a permanent fix to definitely support the Japanese Language.☺️

Comment options

miiton Mar 16, 2023

@ManyTheFish

Thanks a lot! I am trying that docker image and it seems to work almost perfectly. I am trying that docker image and it seems to be working almost perfectly. Great...

I understand the other matters as well, and look forward to future updates 😍

Comment options

jamsch May 29, 2023

Re: using index settings for language priority: This is kind of tricky in my situation. I've got many records in both Chinese and Japanese that are stored on the same index. If possible, I'd rather have a search param that specifies the preferred language (but I'd assume during tokenization it'd blow up storage if both cn/jp tokens are being made)

Comment options

mosuka
Oct 6, 2022

Hi all,

I also put here the comment I wrote onmeilisearch/charabia#139 .

This character normalization seems to be performed after tokenization, but in some cases, it is better to perform character normalization before tokenization in Japanese.

For example, this is a case where there is no problem even after tokenization:

$ echo "私はメガネを買いました。" | lindera私名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシは助詞,係助詞,*,*,*,*,は,ハ,ワメガネ名詞,一般,*,*,*,*,メガネ,メガネ,メガネを助詞,格助詞,一般,*,*,*,を,ヲ,ヲ買い動詞,自立,*,*,五段・ワ行促音便,連用形,買う,カイ,カイまし助動詞,*,*,*,特殊・マス,連用形,ます,マシ,マシた助動詞,*,*,*,特殊・タ,基本形,た,タ,タ。記号,句点,*,*,*,*,。,。,。EOS

$ echo "私はﾒｶﾞﾈを買いました。" | lindera私名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシは助詞,係助詞,*,*,*,*,は,ハ,ワﾒｶﾞﾈUNKを助詞,格助詞,一般,*,*,*,を,ヲ,ヲ買い動詞,自立,*,*,五段・ワ行促音便,連用形,買う,カイ,カイまし助動詞,*,*,*,特殊・マス,連用形,ます,マシ,マシた助動詞,*,*,*,特殊・タ,基本形,た,タ,タ。記号,句点,*,*,*,*,。,。,。EOS

Half-widthﾒｶﾞﾈ is not a problem because it is tokenized as a single unknown word, although it does not exist in the Japanese morphological dictionary (IPADIC). Of course, if normalization has been done in advance, morphological analysis can be used to retrieve the part-of-speech information of words from the dictionary accurately.

But the following cases can be problematic.

$ echo "私は時給1000円です。" | lindera私名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシは助詞,係助詞,*,*,*,*,は,ハ,ワ時給名詞,一般,*,*,*,*,時給,ジキュウ,ジキュー1000UNK円名詞,接尾,助数詞,*,*,*,円,エン,エンです助動詞,*,*,*,特殊・デス,基本形,です,デス,デス。記号,句点,*,*,*,*,。,。,。EOS

$ echo "私は時給１０００円です。" | lindera私名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシは助詞,係助詞,*,*,*,*,は,ハ,ワ時給名詞,一般,*,*,*,*,時給,ジキュウ,ジキュー１名詞,数,*,*,*,*,１,イチ,イチ０名詞,数,*,*,*,*,０,ゼロ,ゼロ０名詞,数,*,*,*,*,０,ゼロ,ゼロ０名詞,数,*,*,*,*,０,ゼロ,ゼロ円名詞,接尾,助数詞,*,*,*,円,エン,エンです助動詞,*,*,*,特殊・デス,基本形,です,デス,デス。記号,句点,*,*,*,*,。,。,。EOS

Since full-width numbers are already registered in the morphological dictionary (IPADIC), each number becomes a single token, so a full-width１０００ cannot be searched for1000.
For this reason, it is common for search engines that handle Japanese to perform character normalization before tokenization.

If possible, I would like you to consider a way to perform character normalization before tokenization.
Thanks!

You must be logged in to vote

0 replies

Comment options

mosuka
Oct 17, 2022

@ManyTheFish
This is a slide that I created previously that explains how tokenizer works in Japanese search engines for software engineers who are not familiar with the Japanese language.

I hope it will be of some help to you.

https://speakerdeck.com/mosuka/the-importance-of-morphological-analysis-in-japanese-search-engines

You must be logged in to vote

0 replies

Comment options

miiton
Mar 27, 2023

I have published a simple application that I had made to confirm that Meilisearch works in Japanese.

https://meilisearch-example-jp.miiton.dev/

You must be logged in to vote

0 replies

Comment options

ManyTheFish
Apr 26, 2023
Collaborator Author

Hello people!
I've been putting this message off for a long time, I'm sorry.
I want to make a complete state of the current Japanese Language support in Meilisearch.

The current behavior

Language Detection

Today, we are usingwhatlang-rs to detect the Script and the Language in a text, Language detection is really important for Japanese Language support mainly to make the difference with the Chinese Language when only Kanjis are used in a text, for example, a small title or a query.

Segmentation

To segment Japanese text, we are currently usingLindera based on a Viterbi algorithm using the IPA dictionaries. Thanks to@mosuka for maintaining it. (a small explanation of Japanese segmentation)

Normalization

So far, we only normalize Japanese characters by replacing them with their decomposed compatible form, to give an example, half-width kanas are converted into kanas. To know more about this, I put below some documentation about it below:
https://unicode.org/reports/tr15/

The remaining issues we should tackle in the future

The Language detection library we are using is efficient in detecting the Script used, however, it's not sufficient to make the difference between Japanese and Chinese when only Kanjis are used in a text. There are some things that could be done to enhance this:
a) Contribute to thewhatlang-rs library, afirst issue has been created by@miiton that could help, however, if someone sees a better approach, don't hesitate to suggest it.
b) Replace the Library, here I didn't make any comparisons with other ones.
c) Add a configuration in Meilisearch allowing the user to authorize some Language to be detected or not.
The compatibility decomposition normalization is made after segmentation, however,@mosuka suggested finding a way to put it before,here is the link to the comment.
We had a discussion with@mosuka about converting all the Katakana characters into Hiragana characters during the normalization, this would be able to raise the recall, by allowing any form of a word to find the other one (ダメ <-> だめ). This feature is currently deactivated in Meilisearch buthere is the link to the discussion.

Prototypes

There is a prototype of Meilisearch deactivating completely the Chinese support, this way we avoid Language detection mistakes, in addition, this prototype activates the katakana-to-hiragana conversion, if you want to try this prototype I put the link to it:
meilisearch/meilisearch#3588

Thanks!

You must be logged in to vote

0 replies

Comment options

miiton
May 14, 2023

Handling of Proper Nouns in Japanese

@ManyTheFish

Is the issue of not being able to search for proper nouns that are not in ipadic already being discussed, such as Chinese language support, or etc?

ref:misskey-dev/misskey/issues/10845

Target Contents

id	msg
1	Misskeyはしゅいろさんが開発しています
2	Misskeyはしゅいろさんがつくったのだよ

# echo "Misskeyはしゅいろさんがつくったのだよ" | linderaMisskey UNKは      助詞,係助詞,*,*,*,*,は,ハ,ワし      動詞,自立,*,*,サ変・スル,連用形,する,シ,シゅいろさんがつくったのだよ      UNK# echo "Misskeyはしゅいろさんが開発しています" | linderaMisskey UNKは      助詞,係助詞,*,*,*,*,は,ハ,ワし      動詞,自立,*,*,サ変・スル,連用形,する,シ,シゅいろさんが    UNK開発    名詞,サ変接続,*,*,*,*,開発,カイハツ,カイハツし      動詞,自立,*,*,サ変・スル,連用形,する,シ,シて      助詞,接続助詞,*,*,*,*,て,テ,テい      動詞,非自立,*,*,一段,連用形,いる,イ,イます    助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス

Search Word:`しゅいろ`

# echo "しゅいろ" | linderaし      動詞,自立,*,*,サ変・スル,連用形,する,シ,シゅいろ  UNK

Results

❌Meilisearch (v1.1.1 & prototype-japanese-2)

id:2 no hits

# curl -s -H "Authorization: Bearer hoge" -H "Content-Type: application/json" -X POST localhost:7700/indexes/hogehoge/search -d '{"q":"しゅいろ","attributesToHighlight":["msg"]}' | jq -r '.hits[] | [.id, ._formatted.msg] | @tsv'1       Misskeyはし<em>ゅいろ</em>さんが開発<em>し</em>ています

⭕Algolia

⭕Manticore Search

mysql> CREATE TABLE notes(id bigint, msg text) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk';mysql> INSERT INTO notes(id, msg) VALUES (1, 'Misskeyはしゅいろさんが開発しています');mysql> INSERT INTO notes(id, msg) VALUES (2, 'Misskeyはしゅいろさんがつくったのだよ');mysql> SELECT id, msg, highlight() FROM notes WHERE match('しゅいろ');+------+------------------------------------------------------+---------------------------------------------------------------------------+| id   | msg                                                  | highlight()                                                               |+------+------------------------------------------------------+---------------------------------------------------------------------------+|    2 | Misskeyはしゅいろさんがつくったのだよ                | Misskeyは<b>しゅいろ</b>さんがつくったのだよ                              ||    1 | Misskeyはしゅいろさんが開発しています                | Misskeyは<b>しゅいろ</b>さんが開発<b>し</b>て<b>い</b>ます                |+------+------------------------------------------------------+---------------------------------------------------------------------------+

You must be logged in to vote

28 replies

Comment options

mosuka May 25, 2023

@miiton @ManyTheFish
Thanks for the explanation.
It seems to me that this problem could be solved if we could use UniDic to have consistent tokenization for searching and indexing, and also relax the limit on the number of words that can be registered for a multi-word synonyms.
What do you think?

Comment options

mosuka Jun 14, 2023

@ManyTheFish
I checked how much the binary size of lindera-cli changes between IPADIC and UniDic.
UniDic is about 30MB larger than IPADIC when built in release mode.
Nevertheless, I would like to switch to UniDic because of the expected improvement in search quality in Japanese, what do you think?

$ cargo build --features=ipadic,ipadic-compress --release$ ls -alh target/release/lindera-rwxr-xr-x 2 minoru minoru 15M Jun 12 21:51 target/release/lindera

$ cargo build --features=unidic,unidic-compress --release$ ls -alh target/release/lindera-rwxr-xr-x 2 minoru minoru 48M Jun 12 21:54 target/release/lindera

Comment options

ManyTheFish Jun 15, 2023
Collaborator Author

Hello@mosuka,
You can add the UniDic implementation to Charabia under a feature flag, then use thejapanese feature flag as a "meta" flag for the Japanese default behavior like:

# allow japanese specialized tokenizationjapanese = ["japanese-segmentation-unidic"]japanese-segmentation-ipadic = ["lindera-tokenizer/ipadic","lindera-tokenizer/ipadic-compress"]japanese-segmentation-unidic = ["lindera-tokenizer/unidic","lindera-tokenizer/unidic-compress"]japanese-transliteration = ["dep:wana_kana"]

What do you think? 🙂

Comment options

mosuka Jun 15, 2023

That's a good idea.
I'll send a pull request this weekend.
Thanks! 😃

Comment options

mosuka Jun 17, 2023

@ManyTheFish
I sent a pull request.
meilisearch/charabia#218

Thanks.

Comment options

ManyTheFish
Jul 31, 2023
Collaborator Author

Hello everyone 👋

An update on Meilisearch and the Japanese support

New release V1.3 🦁

v1.3 has been released today 🦁 including a change in the Japanese segmentation, Meilisearch now relies on UniDic instead of IPADIC to segment Japanese words which should increase the amount of document retrieved by Meilisearch.

We still encounter difficulties when a dataset contains small documents with kanji-only fields, if you don't manage to retrieve documents containing kanji-only fields with the official Meilisearch version, please try theJapanese specialized docker image that deactivates other Language support.

A preview of V1.4 👀

We just released a 🧪 prototype that allows the users to customize how Meilisearch tokenizes documents and queries, and we'd love your feedback.
Additionally to the already existingstopWords andsynonyms this prototype provides:

nonSeparatorTokens: allows removing some tokens from the default list of separators
separatorTokens: allows to add some tokens to the list of separators
dictionary: allows to override the word segmentation on the list of words defined in the dictionary

How to get the prototype?

Using docker, use the following command:

docker run -p 7700:7700 -v $(pwd)/meili_data:/meili_data getmeili/meilisearch:prototype-tokenizer-customization-2

From source, compile Meilisearch on theprototype-tokenizer-customization-2 tag

How to use the prototype?

You can find some examples below, or look at the originalPR for more info.

We know that thenonSeparatorTokens andseparatorTokens aren't as useful in Japanese as in other languages, however, thedictionary setting should allow defining Japanese proper nouns, company names, or any specific words to better tokenize documents related to a specific use case and retrieve them easily!

⚠️ We do NOT recommend using this prototype in production. This is for test purposes only.

Feedback and bug reporting when using this prototype are encouraged! Thanks in advance for your involvement. It means a lot to us ❤️

You must be logged in to vote

0 replies

Comment options

miiton
Sep 11, 2023

Facet Search is not working as I expected in Japanese.

I am trying Facet Search on aJapanese demo site I have published, but it doesn't seem to work the way I want it to.

I am trying to narrow down the prefectures... the example of Osaka-fu is easy to understand.

( Meilisearch version:prototype-japanese-5 )

If I type in`大阪`, it returns`大分県, 大阪府`.

I think it is because of the ambiguous search that大分県 is hit, but I would like the results to be returned in the order of大阪府, 大分県 at least.

If I type in`大阪府`, it returns`京都府, 大分県, 大阪府`.

I think it is a little strange for a "filter" to increase the number of results even though the number of input characters is increased.

You must be logged in to vote

0 replies

Comment options

hktang
Nov 7, 2023

Hi! Thank you guys for your fantastic work on improving Japanese support.

I tried the Docker imagemeilisearch/prototype-japanese-7, which works really well in my case (Drupal search API + Meilisearch backend).

I have two questions:

I noticed the image is numbered 0-7. Where can I find information about new releases so I can use the latest one?
Is there a configuration option in Meilisearch to explicitly set the fallback tokenizer to the Japanese one?

My apologies if this is not the right place to ask. Thank you for any input!

You must be logged in to vote

2 replies

Comment options

ManyTheFish Nov 7, 2023
Collaborator Author

Hello@hktang,
First of all, if you have any questions, suggestions, or relevancy issues with the Meilisearch Japanese support, this discussion is the right place. 👍

I noticed the image is numbered 0-7. Where can I find information about new releases so I can use the latest one?

Yes, sorry, the link to the dedicated PR with all the released versions is below:
meilisearch/meilisearch#3882

I will link this PR in a troubleshooting section here.

Is there a configuration option in Meilisearch to explicitly set the fallback tokenizer to the Japanese one?

Not so far, but there is a feature request to let the users choose the Language used on each index, I put the link below:
https://github.com/orgs/meilisearch/discussions/702

don't hesitate to react or comment in the discussion explaining your use case, this helps us in our prioritization process.☺️

Don't hesitate to ask more questions if you need them,
See you! 👋

Comment options

hktang Nov 8, 2023

Hi@ManyTheFish great! Thank you so much for the information. This is exactly what I was looking for.
Sure, I definitely look forward to having the language selector feature. Will give it a bump!

Comment options

bedus-creation
Nov 9, 2023

Thanks for the Great Tools.
We were experimenting with this getmeili/meilisearch:prototype-japanese-7 image in our Team. And It seems like it still has some incorrect results.

My search query was '大学' and it is showing results for '大垣' as well which doesn't make sense.

You must be logged in to vote

2 replies

Comment options

mosuka Nov 9, 2023

Hi@bedus-creation ,
This may be due to typo tolerance.
In Japanese, a single letter can have a completely different meaning in some cases, so unintended documents may appear in the search results.

You may be able to solve the problem by changing the setting.

https://blog.meilisearch.com/typo-tolerance/amp/

Comment options

ManyTheFish Nov 13, 2023
Collaborator Author

Hi@bedus-creation and@mosuka,
There is another feature that could cause this behavior: thematchingStrategy, this feature allows Meilisearch to remove some token if there is not enough results to show, but you can easily deactivate this feature by changing the matchingStrategy directly in the search request:

$ curl \  -X POST 'http://localhost:7700/indexes/movies/search' \  -H 'Content-Type: application/json' \  --data-binary '{    "q": "大学",+   "matchingStrategy": "all"  }'

Don't hesitate to come back to us if none of these solutions work, and thank you@mosuka, for your help,
see you!

Comment options

tec-koshi
Jan 12, 2024

こんにちは！
以下のような
meilisearch:prototype-japanese-7
https://hub.docker.com/layers/getmeili/meilisearch/prototype-japanese-7/images/sha256-0cbcaafc43d3db7d7e934e4d1340c18312b775e4426db75c46bd0d580f7b9cfd?context=explore
Docker Imageの日本語特化検索機能はMeilisearch Cloudでは利用されているのでしょうか？
何がしたいかと言いますと、Meilisearch Cloudでは、こちらのDocker Imageのように日本語に特化した検索はできますでしょうか？

Hello!
like below
meilisearch:prototype-japanese-7
https://hub.docker.com/layers/getmeili/meilisearch/prototype-japanese-7/images/sha256-0cbcaafc43d3db7d7e934e4d1340c18312b775e4426db75c46bd0d580f7b9cfd?context=explore
Is Docker Image's Japanese-specific search function used in Meilisearch Cloud?
What I want to do is, with Meilisearch Cloud, is it possible to search specifically for Japanese like this Docker Image?

You must be logged in to vote

4 replies

Comment options

miiton Jan 12, 2024

@tec-koshi
ref:https://twitter.com/StriftCodes/status/1732376643073724511

可能というお話は見ましたが、Cloudサポートに問い合わせるのが確実です。

I have been told that it is possible, but I think it is best to contact Cloud Support to be sure.

Comment options

tec-koshi Jan 15, 2024

@miiton
回答ありがとうございます。
Cloudサポートに問い合わせてみます。
ちなみにでしたが、Meilisearchの問い合わせ先を見つけることができずにDiscordにて問い合わせしました。
クラウドサポートの先知っておられましたら教えていただきたかったです。
よろしくお願いします。

Comment options

miiton Jan 15, 2024

Discordで大丈夫ですよ 👍

ref:https://help.meilisearch.com/en/article/where-can-i-find-support-1qgucwt/

Comment options

tec-koshi Jan 15, 2024

ありがとうございます！

Comment options

ManyTheFish
Jan 15, 2024
Collaborator Author

Hello All,
The Japanese Language-specialized docker image, up to date with the last Meilisearch version (v1.6.0), has been released:

$ docker pull getmeili/meilisearch:prototype-japanese-9

If you want more information about the last release, I put below the link to it:
https://github.com/meilisearch/meilisearch/releases/tag/v1.6.0

Below is the PR with all the Japanese Language-specialized docker images:
meilisearch/meilisearch#3882

See you!

You must be logged in to vote

0 replies

Comment options

kamiyn
May 21, 2024

Specify an user dictionary for Japanese:

I am impressed with the creation of such a wonderful search engine! 😀

Japanese does not perform word segmentation using spaces, so a good dictionary is necessary to determine word boundaries.

Particularly for new words or proper nouns, registration of words in a user dictionary may be required.

Lindera has a feature to specify a user dictionary,
and I am experimenting to specify a user dictionary via environment variable.
meilisearch/charabia@main...kamiyn:charabia:users/kamiyn/japanese_user_dictionary

If this is appropriate, I would like to make PR with it.

You must be logged in to vote

2 replies

Comment options

ManyTheFish May 27, 2024
Collaborator Author

Hello@kamiyn, thank you for your suggestion,
To be honest with you, I would gladly accept a contribution on Charabia's side to implement this option; however, adding this in Meilisearch needs more work in terms of API definition.
This means that even if we integrate your work on Charabia, the chances of being fully integrated into Meilisearch are very small.

Today, there is adictionary feature in Meilisearch allowing you to precise some specific words related to your dataset, this is not as advanced as Lindera's one, but it may be fit your needs, let me know about it.

See you!

Comment options

kamiyn May 28, 2024

Hello@ManyTheFish, thank you for your response.

I also recognize that a lot of work is needed to make custom dictionaries easily accessible to end-users in the MeiliSearch services.

On the other hand, it is common to specify parameters via environment variables when using Docker, so I thought it would be beneficial.

Especially if it is provided as an explanation of how to use the Japanese-specific Docker image, I believe that the number of users via Docker will increase.

In that sense, this modification into Charabia seems to be an issue of how the Japanese-specific Docker image is positioned.

Comment options

ManyTheFish
May 27, 2024
Collaborator Author

Hello All,
The Japanese Language-specialized docker image, up to date with the last Meilisearch version (v1.8.1), has been released:

$ docker pull getmeili/meilisearch:prototype-japanese-11

If you want more information about the last release, I put below the link to it:

Below is the PR with all the Japanese Language-specialized docker images:
meilisearch/meilisearch#3882

See you!

You must be logged in to vote

0 replies

Comment options

bedus-creation
Jul 5, 2024

Thanks for the great support & great software

We started usinggetmeili/meilisearch:prototype-japanese-11 in production, however I found that the default configuration doesn't seems to support hiragana and katakana search, for example, searchingニホン should also returnにほん、.

Is there anyway to get back this features ?

You must be logged in to vote

9 replies

Comment options

ManyTheFish Oct 2, 2024
Collaborator Author

Nice!
The feature can be reactivated for Meilisearch v1.12 so :)

Thank you@PSeitz

Comment options

tats-u Oct 14, 2024

meilisearch/charabia#312

Comment options

bedus-creation Feb 3, 2025

@miiton @ManyTheFish Thanks for the response, I finally tested the hiragana and katakana and it's working fine. But I found an issue with half-width and full-width for example: searching anapple is not returningＡｐｐｌｅ , is there any recommended way to fix it ?

Comment options

kamiyn Feb 3, 2025

I usually use
Unicode Normalization Form KC (NFKC)
against the issue with half-width and full-width
https://www.unicode.org/reports/tr15/#Norm_Forms

For example, it converts "Ａｐｐｌｅ" to "Apple", "ｱｯﾌﾟﾙ" to "アップル".

NFKC can be beneficial for languages beyond Japanese.

Comment options

ManyTheFish Feb 3, 2025
Collaborator Author

We need to check the tokenizer used by Meilisearch, it's a bit weird that it doesn't work if NFKC normalizes this derivation 🤔

In Charabia we are using NFKD to normalize this, which should be basically the same,
I've created an issue on the repository to check if it works.

meilisearch/charabia#327

Comment options

tats-u
Oct 3, 2024

For those who try Meilisearch for the first time: you should try v1.10.2 (or any later versions) with"locales": ["jpn"] first today.prototype-japanese-* images are based on older versions.
If you're usingprototype-japanese-12 now for example:

-12 →-184 → (dump migration) →-13 → (dump migration) →v1.10.2(→(dump migration)→1.11→…)

You must be logged in to vote

0 replies

Comment options

mosuka
Nov 23, 2024

@ManyTheFish

I would like to enable Lindera's character filters and token filters in Charabia.
For example, if I use Lindera's RegexCharacterFilter or JapaneseStopTagsFilter, the correct offset and the other values are recorded in Lindera's Token, but only Lindera's token.text is passed to Charabia.

I am thinking that if I don't create a Charabia token at this time based on values other than text recorded by Lindera's token, the term will be out of position due to highlighting, etc.

I made it possible to describe Lindera settings in YAML, we would be very happy if this could be accomplished, allowing Japanese-specific string handling to be configured by out of Meilisearch using environment variable.

Is there a better way to do this?

https://github.com/meilisearch/charabia/blob/467b3e4e767c58d54a8f2974813e3ed6ac6c9795/charabia/src/segmenter/japanese.rs#L33-L36

You must be logged in to vote

1 reply

Comment options

ManyTheFish Dec 3, 2024
Collaborator Author

Hey@mosuka,

I think we could add ametadata field in thecharabia::Token, that could be an enum:

typeLinderaMetadata<'o> =Vec<Cow<'o,str>>;enumMetadata<'o>{#[default]None,Lindera(LinderaMetadata<'o>),}

Then,the segmenter trait should implement a newsegment_str_with_meta method returning(&'o str, Metadata):

pubtraitSegmenter:Sync +Send{/// Segments the provided text, creating an Iterator over `&str`.fnsegment_str<'o>(&self,s:&'ostr) ->Box<dynIterator<Item =&'ostr> +'o>;/// Segments the provided text, creating an Iterator over `Token`.////// This method uses `segment_str` by defaultfnsegment_str_with_meta<'o>(&self,s:&'ostr) ->Box<dynIterator<Item =(&'ostr,Metadata<'o>)> +'o>{Box::new(self.segment_str(s).map(|s|(s,Metadata::None)))}}

But Lindera would override the implementation by itself and add the Metadata.

After this, we just have to. makeSegmentedTokenIter usesegment_str_with_meta instead_ofsegment_str.

It's a bit of work but it's totally acceptable for Charabia and seems future proof for other tokenizers 🤔

Comment options

jimexist
Feb 26, 2025

hi all, thanks for the great support for this (niche) direction, but i'd like to know if there's any plan to rebase/bump the Japanese supporttag/branch against the newest developments, e.g. given thenew indexer is much faster?

You must be logged in to vote

1 reply