Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Collection of Wongnai's datasets

License

NotificationsYou must be signed in to change notification settings

wongnai/wongnai-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 

Repository files navigation

This project is a collection of Wongnai's datasets which are mostly in Thai language. We hope that these datasets will advance research in natural language processing(NLP) especially in Thai language.

1. Search query dataset

There are 500,000 unique words extracted from search queries. These words were labeled by algorithms and judges for a word segmentation task. Our segmentation criteria is to segment the longest food word as possible for archiving the highest precision score in search system.

1.1 Files

  • search/labeled_queries_by_algo.txt : List of 500K words labeled by algorithms which were described in detail inblog post.

  • search/labeled_queries_by_judges.txt : List of 10K words labeled by judges following Wongnai's search criteria.

  • search/food_dictionary.txt : List of 400K food words used for labelling thelabeled_queries_by_algo.txt.

Please note that these words were collected from user-generated content(UGC) which might include some out of topic words.

1.2 Usage

  • You may uselabeled_queries_by_algo.txt for training your own word segmentation model by spliting into train and validation set and then evaluate your model withlabeled_queries_by_judges.txt.

2. Review dataset

The review dataset contains restaurant reviews and ratings (there are only 5 classes ranging from 1 to 5 stars).

2.1 Files

2.2 Usage

Wongnai data services

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp