Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
#

llm-datasets

Here are 22 public repositories matching this topic...

collection of text2cypher datasets, evaluations, and finetuning instructions

  • UpdatedJun 13, 2024
  • Jupyter Notebook

Repository for organizing datasets and papers used in Open LLM.

  • UpdatedJul 6, 2023

A collection of LLM related papers, thesis, tools, datasets, courses, open source models, benchmarks

  • UpdatedOct 8, 2024
  • Python
SyGra

A collection of recent open-source math datasets for training and evaluating Math LLMs

  • UpdatedOct 14, 2025

Efficiently fetch and perform sentiment analysis (Turkish Only) on eksisozluk.com entries using Rust

  • UpdatedFeb 8, 2024
  • Rust

A framework to analyze how AGI/ASI might emerge from decentralized, adaptive systems, rather than as the fruit of a single model deployment. It also aims to present orientation as a dynamic and self-evolving Magna Carta, helping to guide the emergence of such phenomena.

  • UpdatedAug 6, 2025

Synthetically Generating Intent-Aware Information-Seeking Dialogues! Useful for various tasks such as training/evaluating User Intent Predictors with the possibility to training/evaluating on real human dialogues. The backbone LLM of SOLID is Zephyr-7b-beta.

  • UpdatedAug 18, 2024
  • Python

Convert multi-speaker audio files to structured chat data for LLMs

  • UpdatedJan 29, 2025
  • Python

WikiText syntax dataset generation pipeline and open dataset for auto UI generation in TiddlyWiki. (WIP)

  • UpdatedNov 20, 2024
  • TypeScript

A bunch of very famous repos source code's in python as pure localdocs all in this repo to train CODE AI

  • UpdatedDec 12, 2024
  • Python

Stratified LLM Subsets delivers diverse training data at 100K-1M scales across pre-training (FineWeb-Edu, Proof-Pile-2), instruction-following (Tulu-3, Orca AgentInstruct), and reasoning distillation (Llama-Nemotron). Embedding-based k-means clustering ensures maximum diversity across 5 high-quality open datasets.

  • UpdatedOct 4, 2025
  • HTML

PARROT (Performance Assessment of Reasoning and Responses On Trivia) is a novel benchmarking framework designed to evaluate Large Language Models (LLMs) on real-world, complex, and ambiguous QA tasks.

  • UpdatedOct 14, 2024
  • Python

A modified dataset consisting of English dialogs between a user and an assistant discussing movie preferences in natural language.

  • UpdatedSep 29, 2023

An automated tool for generating high-quality outputs for LLM finetuning datasets using reverse-engineered model APIs.

  • UpdatedOct 27, 2025

A collection of Persian poems structured for NLP and LLM tasks. Each poem is stored as a separate file, organized by poet, and formatted for easy use in training, fine-tuning, or text analysis workflows.

  • UpdatedJul 3, 2025

minimal dataset conisting og 363 Human & Assitant dialogs

  • UpdatedOct 1, 2023

Improve this page

Add a description, image, and links to thellm-datasets topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with thellm-datasets topic, visit your repo's landing page and select "manage topics."

Learn more


[8]ページ先頭

©2009-2025 Movatter.jp