Movatterモバイル変換

mesolitica/malaysian-datasetPublic

NotificationsYou must be signed in to change notification settings
Fork111
Star312

We gather Malaysian dataset!https://malaysian-dataset.readthedocs.io/

malaysian-dataset.readthedocs.io/

312 stars 111 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,173 Commits
.github		.github
chatbot		chatbot
corpus		corpus
crawl		crawl
dictionary		dictionary
docs		docs
embedding		embedding
keyphrase		keyphrase
knowledge-graph		knowledge-graph
lexicon		lexicon
llm-benchmark		llm-benchmark
llm-instruction		llm-instruction
news		news
nlq		nlq
normalization		normalization
ocr		ocr
paraphrase		paraphrase
parsing		parsing
phoneme		phoneme
question-answer		question-answer
segmentation		segmentation
sentiment		sentiment
speech-to-text-semisupervised		speech-to-text-semisupervised
speech-to-text		speech-to-text
speech		speech
spelling-correction/neuspell		spelling-correction/neuspell
summarization		summarization
tagging		tagging
tatabahasa		tatabahasa
text-similarity		text-similarity
text-to-speech		text-to-speech
tokenization/syllable		tokenization/syllable
translation		translation
true-case		true-case
vlm		vlm
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
README.rst		README.rst
malay-dataset1.png		malay-dataset1.png
malaysian-dataset.png		malaysian-dataset.png
wordcloud.png		wordcloud.png

Repository files navigation

Malaysian-Dataset, We gather Malaysian dataset!

We are trying our best to make better documentation, all data pushed tohttps://huggingface.co/mesolitica andhttps://huggingface.co/malaysia-ai

Documentation

Proper documentation is available athttps://malaysian-dataset.readthedocs.io

How we gather dataset?

Crawling

Contributors heavily crawled Malaysian websites, you can check out the full list of crawled websites athttps://github.com/users/huseinzol05/projects/1

Social media

We catch most of live data from Twitter, Facebook and Instagram usingcrawlers, So we just search using Elasticsearch query.

Translation

We use Google Translate.
We use LLM, including ChatGPT3.5, ChatGPT4, Mixtral, LLama3 70B.
We use Malaya translation,https://huggingface.co/mesolitica/translation-t5-small-standard-bahasa-cased-v2

Semisupervised

Teacher-student

Supervised small samples and then trained a base model.
Trained base model predict larger samples, retrain next studentmodels on high confident labelled data.
Repeat.

LLM

Generate using ChatGPT3.5, ChatGPT4, Mixtral, LLama3 70B.

Notes

Any missingmp.py, get it athttps://gist.github.com/huseinzol05/98974ae8c6c7a65d4bc0af9f5003786a
Any missing python scripts, please contact me ASAP or create an issue.
Please at least email us first before distributing these data. Remember all these hard workings we want to give it for free.
What do you see just the data, but nobody can see how much we spent our cost to make it public.

Suggestion

Feel free to contact me to request new dataset.
Feel free to open an issue if the link to dataset is forbidden, sometime I forgot to make it open to public.

Non-commercial Usage

A lot of data here semisupervised / translated / tagged / decoded usingthird party software, example, Google Translate, Google Speech, so toavoid any future complication, it is better not use this data forcommercial purposes but allow for certain research purposes.

Acknowledgement

Thanks toIm Big,LigBlou,Mesolitica andKeyReply for sponsoring AWS Google andprivate cloud to deploy distributed crawlers.

About

We gather Malaysian dataset!https://malaysian-dataset.readthedocs.io/

malaysian-dataset.readthedocs.io/

Releases

No releases published

Sponsor this project

Packages

No packages published

Contributors24

+ 10 contributors

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Folders and files

Latest commit

History

Repository files navigation

Documentation

How we gather dataset?

Crawling

Social media

Translation

Semisupervised

Teacher-student

LLM

Notes

Suggestion

Non-commercial Usage

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Releases

Sponsor this project

Packages

Contributors24

Languages

Movatterモバイル変換

mesolitica/malaysian-dataset

Folders and files

Latest commit

History

Repository files navigation

Documentation

How we gather dataset?

Crawling

Social media

Translation

Semisupervised

Teacher-student

LLM

Notes

Suggestion

Non-commercial Usage

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Releases

Sponsor this project

Packages0

Contributors24

Languages

Packages