MindXpansion/language_models_from_scratchPublic

forked fromsourabpanch7/language_models_from_scratch

NotificationsYou must be signed in to change notification settings
Fork0
Star1

Repo for simple small language training from scratch

1 star 3 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
resources/fine_tuning		resources/fine_tuning
src		src
test		test
.gitignore		.gitignore
PEFT.ipynb		PEFT.ipynb
README.md		README.md
requirements.txt		requirements.txt

Repository files navigation

Small Language Model

A repository to build aSmall Language Model from scratch, inspired byQWEN 3 andGPT 2.
This project focuses on implementing and experimenting with different tokenizers and data preprocessing strategies to prepare text data for training.

🚀 Tech Stack

Python
HuggingFace

✨ Features

📂Data Load & Preprocessing
Efficient pipeline to load raw text data and prepare it for training.
🔎Data Diversity Check (MinHash)
UsesMinHash to detect near-duplicate samples and ensure dataset diversity.
🔤Custom BPE Tokenizer
Implements Byte-Pair Encoding (BPE) tokenizer from scratch for experimentation.
📘BLT Tokenizer
Implementation ofFacebook Research's BLT Tokenizer.
📝SentencePiece Tokenizer
Integration ofGoogle's SentencePiece Tokenizer.

⚙️ Setup & Installation

Clone the repository:

git clone https://github.com/ltrc/mini-project-slm-sourabiiit7.gitcd small-language-model

Install dependencies:
```
pip install -r requirements.txt
```
Update configuration:Modify parameters as needed in:
```
config/config.json
```
Run the project:
```
python src/main.py
```
To run Perform PEFT on Google colab:
```
Run the cells in  PEFT.ipynb
```

📚 References

🛠️ Tools Used:

ChatGPT
GitHub Copilot
Google Translate

📚 Datasets used

Language	Dataset	Train	Test
`Bangla`	`HuggingFaceFW/fineweb-2 ben_Beng`	`split='train'`	`split='test'`
`English`	`HuggingFaceFW/fineweb-edu`	`split='train'`	`split='train'`
`Konkani`	`HuggingFaceFW/fineweb-2 gom_Deva`	`split='train'`	`split='test'`

📊 Data Statistics

Language	Similarity Score	Token Count
`Bangla`	`0.414`	`500M`
`English`	`0.000`	`200M`
`Konkani`	`0.117`	`100K`

🛠️ Infra Used:

Google Colab Free Tier
Google Drive For storage
Google Translate

About

Repo for simple small language training from scratch

Releases

No releases published

Packages

No packages published

Languages

Python66.5%
Jupyter Notebook33.5%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Small Language Model

🚀 Tech Stack

✨ Features

⚙️ Setup & Installation

📚 References

🛠️ Tools Used:

📚 Datasets used

📊 Data Statistics

🛠️ Infra Used:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

MindXpansion/language_models_from_scratch

Folders and files

Latest commit

History

Repository files navigation

Small Language Model

🚀 Tech Stack

✨ Features

⚙️ Setup & Installation

📚 References

🛠️ Tools Used:

📚 Datasets used

📊 Data Statistics

🛠️ Infra Used:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages