llm-datasets
Here are 22 public repositories matching this topic...
Language:All
Sort:Most stars
collection of text2cypher datasets, evaluations, and finetuning instructions
- Updated
Jun 13, 2024 - Jupyter Notebook
Repository for organizing datasets and papers used in Open LLM.
- Updated
Jul 6, 2023
A data-centric AI package for ML/AI. Get the best high-quality data for the best results. Discord:https://discord.gg/t6ADqBKrdZ
- Updated
Nov 20, 2023 - Python
A collection of LLM related papers, thesis, tools, datasets, courses, open source models, benchmarks
- Updated
Oct 8, 2024 - Python
SyGra - Graph-oriented Synthetic data generation Pipeline
- Updated
Nov 6, 2025 - Python
A collection of recent open-source math datasets for training and evaluating Math LLMs
- Updated
Oct 14, 2025
Efficiently fetch and perform sentiment analysis (Turkish Only) on eksisozluk.com entries using Rust
- Updated
Feb 8, 2024 - Rust
A framework to analyze how AGI/ASI might emerge from decentralized, adaptive systems, rather than as the fruit of a single model deployment. It also aims to present orientation as a dynamic and self-evolving Magna Carta, helping to guide the emergence of such phenomena.
- Updated
Aug 6, 2025
Synthetically Generating Intent-Aware Information-Seeking Dialogues! Useful for various tasks such as training/evaluating User Intent Predictors with the possibility to training/evaluating on real human dialogues. The backbone LLM of SOLID is Zephyr-7b-beta.
- Updated
Aug 18, 2024 - Python
Convert multi-speaker audio files to structured chat data for LLMs
- Updated
Jan 29, 2025 - Python
WikiText syntax dataset generation pipeline and open dataset for auto UI generation in TiddlyWiki. (WIP)
- Updated
Nov 20, 2024 - TypeScript
A bunch of very famous repos source code's in python as pure localdocs all in this repo to train CODE AI
- Updated
Dec 12, 2024 - Python
LLM-Powered Dataset Creation Tool
- Updated
Aug 15, 2025 - HTML
Stratified LLM Subsets delivers diverse training data at 100K-1M scales across pre-training (FineWeb-Edu, Proof-Pile-2), instruction-following (Tulu-3, Orca AgentInstruct), and reasoning distillation (Llama-Nemotron). Embedding-based k-means clustering ensures maximum diversity across 5 high-quality open datasets.
- Updated
Oct 4, 2025 - HTML
PARROT (Performance Assessment of Reasoning and Responses On Trivia) is a novel benchmarking framework designed to evaluate Large Language Models (LLMs) on real-world, complex, and ambiguous QA tasks.
- Updated
Oct 14, 2024 - Python
A modified dataset consisting of English dialogs between a user and an assistant discussing movie preferences in natural language.
- Updated
Sep 29, 2023
An automated tool for generating high-quality outputs for LLM finetuning datasets using reverse-engineered model APIs.
- Updated
Oct 27, 2025
A collection of Persian poems structured for NLP and LLM tasks. Each poem is stored as a separate file, organized by poet, and formatted for easy use in training, fine-tuning, or text analysis workflows.
- Updated
Jul 3, 2025
- Updated
Feb 11, 2025
Improve this page
Add a description, image, and links to thellm-datasets topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with thellm-datasets topic, visit your repo's landing page and select "manage topics."