Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

mesolitica/malaysian-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo


Malaysian-Dataset, We gather Malaysian dataset!

We are trying our best to make better documentation, all data pushed tohttps://huggingface.co/mesolitica andhttps://huggingface.co/malaysia-ai

Documentation

Proper documentation is available athttps://malaysian-dataset.readthedocs.io

How we gather dataset?

Crawling

Contributors heavily crawled Malaysian websites, you can check out the full list of crawled websites athttps://github.com/users/huseinzol05/projects/1

Social media

  1. We catch most of live data from Twitter, Facebook and Instagram usingcrawlers, So we just search using Elasticsearch query.

Translation

  1. We use Google Translate.
  2. We use LLM, including ChatGPT3.5, ChatGPT4, Mixtral, LLama3 70B.
  3. We use Malaya translation,https://huggingface.co/mesolitica/translation-t5-small-standard-bahasa-cased-v2

Semisupervised

Teacher-student

  1. Supervised small samples and then trained a base model.
  2. Trained base model predict larger samples, retrain next studentmodels on high confident labelled data.
  3. Repeat.

LLM

  1. Generate using ChatGPT3.5, ChatGPT4, Mixtral, LLama3 70B.

Notes

  1. Any missingmp.py, get it athttps://gist.github.com/huseinzol05/98974ae8c6c7a65d4bc0af9f5003786a
  2. Any missing python scripts, please contact me ASAP or create an issue.
  3. Please at least email us first before distributing these data. Remember all these hard workings we want to give it for free.
  4. What do you see just the data, but nobody can see how much we spent our cost to make it public.

Suggestion

  1. Feel free to contact me to request new dataset.
  2. Feel free to open an issue if the link to dataset is forbidden, sometime I forgot to make it open to public.

Non-commercial Usage

A lot of data here semisupervised / translated / tagged / decoded usingthird party software, example, Google Translate, Google Speech, so toavoid any future complication, it is better not use this data forcommercial purposes but allow for certain research purposes.

Acknowledgement

Thanks toIm Big,LigBlou,Mesolitica andKeyReply for sponsoring AWS Google andprivate cloud to deploy distributed crawlers.


[8]ページ先頭

©2009-2025 Movatter.jp