Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Repo for simple small language training from scratch

NotificationsYou must be signed in to change notification settings

MindXpansion/language_models_from_scratch

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A repository to build aSmall Language Model from scratch, inspired byQWEN 3 andGPT 2.
This project focuses on implementing and experimenting with different tokenizers and data preprocessing strategies to prepare text data for training.

🚀 Tech Stack

  • Python
  • HuggingFace

✨ Features

  • 📂Data Load & Preprocessing
    Efficient pipeline to load raw text data and prepare it for training.

  • 🔎Data Diversity Check (MinHash)
    UsesMinHash to detect near-duplicate samples and ensure dataset diversity.

  • 🔤Custom BPE Tokenizer
    Implements Byte-Pair Encoding (BPE) tokenizer from scratch for experimentation.

  • 📘BLT Tokenizer
    Implementation ofFacebook Research's BLT Tokenizer.

  • 📝SentencePiece Tokenizer
    Integration ofGoogle's SentencePiece Tokenizer.

⚙️ Setup & Installation

  1. Clone the repository:

    git clone https://github.com/ltrc/mini-project-slm-sourabiiit7.gitcd small-language-model
  2. Install dependencies:

    pip install -r requirements.txt
  3. Update configuration:Modify parameters as needed in:

    config/config.json
  4. Run the project:

    python src/main.py
  5. To run Perform PEFT on Google colab:

    Run the cells in  PEFT.ipynb

📚 References

🛠️ Tools Used:

  • ChatGPT
  • GitHub Copilot
  • Google Translate

📚 Datasets used

LanguageDatasetTrainTest
BanglaHuggingFaceFW/fineweb-2 ben_Bengsplit='train'split='test'
EnglishHuggingFaceFW/fineweb-edusplit='train'split='train'
KonkaniHuggingFaceFW/fineweb-2 gom_Devasplit='train'split='test'

📊 Data Statistics

LanguageSimilarity ScoreToken Count
Bangla0.414500M
English0.000200M
Konkani0.117100K

🛠️ Infra Used:

  • Google Colab Free Tier
  • Google Drive For storage
  • Google Translate

About

Repo for simple small language training from scratch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python66.5%
  • Jupyter Notebook33.5%

[8]ページ先頭

©2009-2025 Movatter.jp