llm-datasets

A framework to analyze how AGI/ASI might emerge from decentralized, adaptive systems, rather than as the fruit of a single model deployment. It also aims to present orientation as a dynamic and self-evolving Magna Carta, helping to guide the emergence of such phenomena.

machine-learning agi dataset artificial-general-intelligence machine-learning-library datasets machine-learning-projects llm llms rlhf llm-datasets llm-framework llms-benchmarking llm-benchmarking artificial-general-super-intelligence agi-development

UpdatedAug 6, 2025

arian-askari /SOLID

Star4

Synthetically Generating Intent-Aware Information-Seeking Dialogues! Useful for various tasks such as training/evaluating User Intent Predictors with the possibility to training/evaluating on real human dialogues. The backbone LLM of SOLID is Zephyr-7b-beta.

solid dataset-generation conversational-ai intent-classification llm-training llm-inference llm-datasets llm-dialogs llm-conversations zephyr-7b-beta intent-aware-conversation-generation solid-rl

UpdatedAug 18, 2024
Python

neuralwork /audio2chat

Star3

Convert multi-speaker audio files to structured chat data for LLMs

chat transcription whisper speaker-diarization llm llm-datasets

UpdatedJan 29, 2025
Python

tiddly-gittly /TiddlyWiki-LLM-dataset

Star2

WikiText syntax dataset generation pipeline and open dataset for auto UI generation in TiddlyWiki. (WIP)

dataset tiddlywiki wikitext llm llm-training llm-datasets

UpdatedNov 20, 2024
TypeScript

DefinetlyNotAI /LLM_Data

Star2

A bunch of very famous repos source code's in python as pure localdocs all in this repo to train CODE AI

c data cpp cuda jupyter-notebook python3 code-examples llm llm-datasets data-dum programming-data programming-data-sets llm-code

UpdatedDec 12, 2024
Python

dmeldrum6 /LLM-Dataset-Builder

Star2

LLM-Powered Dataset Creation Tool

synthetic-data synthetic-dataset-generation synthetic-data-generation llm llm-training llm-datasets

UpdatedAug 15, 2025
HTML

AmanPriyanshu /Stratified-LLM-Subsets-100K-1M-Scale

Sponsor

Star1

Stratified LLM Subsets delivers diverse training data at 100K-1M scales across pre-training (FineWeb-Edu, Proof-Pile-2), instruction-following (Tulu-3, Orca AgentInstruct), and reasoning distillation (Llama-Nemotron). Embedding-based k-means clustering ensures maximum diversity across 5 high-quality open datasets.

training diversity data deep-learning dataset pretrained-models reasoning training-data fine-tuning finetuning sft pre-training instruction-following llm llms instruction-tuning llm-training llm-datasets finetuning-llms llm-training-data

UpdatedOct 4, 2025
HTML

redblock-ai /parrot-python

Star1

PARROT (Performance Assessment of Reasoning and Responses On Trivia) is a novel benchmarking framework designed to evaluate Large Language Models (LLMs) on real-world, complex, and ambiguous QA tasks.

benchmarking-framework llm-inference llm-datasets llm-qa-document llm-benchmarking

UpdatedOct 14, 2024
Python

aloobun /ccpem-modified

Star0

A modified dataset consisting of English dialogs between a user and an assistant discussing movie preferences in natural language.

dataset llm-datasets

UpdatedSep 29, 2023

sujalrajpoot /LLM-Finetuning-Dataset-Generator

Star0

An automated tool for generating high-quality outputs for LLM finetuning datasets using reverse-engineered model APIs.

dataset-generation dataset-generator llm llm-datasets llm-dataset-generator

UpdatedOct 27, 2025

mohammadreza-mohammadi94 /Persian-Poem-Dataset

Star0

A collection of Persian poems structured for NLP and LLM tasks. Each poem is stored as a separate file, organized by poet, and formatted for easy use in training, fine-tuning, or text analysis workflows.

dataset persian-dataset llm-datasets