- Notifications
You must be signed in to change notification settings - Fork0
MindXpansion/language_models_from_scratch
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
A repository to build aSmall Language Model from scratch, inspired byQWEN 3 andGPT 2.
This project focuses on implementing and experimenting with different tokenizers and data preprocessing strategies to prepare text data for training.
- Python
- HuggingFace
📂Data Load & Preprocessing
Efficient pipeline to load raw text data and prepare it for training.🔎Data Diversity Check (MinHash)
UsesMinHash to detect near-duplicate samples and ensure dataset diversity.🔤Custom BPE Tokenizer
Implements Byte-Pair Encoding (BPE) tokenizer from scratch for experimentation.📘BLT Tokenizer
Implementation ofFacebook Research's BLT Tokenizer.📝SentencePiece Tokenizer
Integration ofGoogle's SentencePiece Tokenizer.
Clone the repository:
git clone https://github.com/ltrc/mini-project-slm-sourabiiit7.gitcd small-language-modelInstall dependencies:
pip install -r requirements.txt
Update configuration:Modify parameters as needed in:
config/config.jsonRun the project:
python src/main.py
To run Perform PEFT on Google colab:
Run the cells in PEFT.ipynb
- Facebook Research BLT
- Google SentencePiece
- MinHash (Datasketch library)
- https://github.com/rasbt/LLMs-from-scratch
- ChatGPT
- GitHub Copilot
- Google Translate
| Language | Dataset | Train | Test |
|---|---|---|---|
Bangla | HuggingFaceFW/fineweb-2 ben_Beng | split='train' | split='test' |
English | HuggingFaceFW/fineweb-edu | split='train' | split='train' |
Konkani | HuggingFaceFW/fineweb-2 gom_Deva | split='train' | split='test' |
| Language | Similarity Score | Token Count |
|---|---|---|
Bangla | 0.414 | 500M |
English | 0.000 | 200M |
Konkani | 0.117 | 100K |
- Google Colab Free Tier
- Google Drive For storage
- Google Translate
About
Repo for simple small language training from scratch
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Languages
- Python66.5%
- Jupyter Notebook33.5%