- Notifications
You must be signed in to change notification settings - Fork111
We gather Malaysian dataset!https://malaysian-dataset.readthedocs.io/
mesolitica/malaysian-dataset
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Malaysian-Dataset, We gather Malaysian dataset!
We are trying our best to make better documentation, all data pushed tohttps://huggingface.co/mesolitica andhttps://huggingface.co/malaysia-ai
Proper documentation is available athttps://malaysian-dataset.readthedocs.io
Contributors heavily crawled Malaysian websites, you can check out the full list of crawled websites athttps://github.com/users/huseinzol05/projects/1
- We catch most of live data from Twitter, Facebook and Instagram usingcrawlers, So we just search using Elasticsearch query.
- We use Google Translate.
- We use LLM, including ChatGPT3.5, ChatGPT4, Mixtral, LLama3 70B.
- We use Malaya translation,https://huggingface.co/mesolitica/translation-t5-small-standard-bahasa-cased-v2
- Supervised small samples and then trained a base model.
- Trained base model predict larger samples, retrain next studentmodels on high confident labelled data.
- Repeat.
- Generate using ChatGPT3.5, ChatGPT4, Mixtral, LLama3 70B.
- Any missing
mp.py
, get it athttps://gist.github.com/huseinzol05/98974ae8c6c7a65d4bc0af9f5003786a - Any missing python scripts, please contact me ASAP or create an issue.
- Please at least email us first before distributing these data. Remember all these hard workings we want to give it for free.
- What do you see just the data, but nobody can see how much we spent our cost to make it public.
- Feel free to contact me to request new dataset.
- Feel free to open an issue if the link to dataset is forbidden, sometime I forgot to make it open to public.
A lot of data here semisupervised / translated / tagged / decoded usingthird party software, example, Google Translate, Google Speech, so toavoid any future complication, it is better not use this data forcommercial purposes but allow for certain research purposes.
Thanks toIm Big,LigBlou,Mesolitica andKeyReply for sponsoring AWS Google andprivate cloud to deploy distributed crawlers.
About
We gather Malaysian dataset!https://malaysian-dataset.readthedocs.io/