luo-junyu/Awesome-Data-Efficient-LLMPublic

NotificationsYou must be signed in to change notification settings
Fork4
Star49

A list of data-efficient and data-centric LLM (Large Language Model) papers. Our Survey Paper: Towards Efficient LLM Post Training: A Data-centric Perspective

49 stars 4 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
figs		figs
README.md		README.md

Repository files navigation

🚀 Awesome-Data-Efficient-LLM 🚀

A list of data-efficient and data-centric LLM (Large Language Model) papers

❖ Paper List

Tilte	TLDR	Category	Paper Link	Year	Publish
Data-efficient Fine-tuning for LLM-based Recommendation	Propose data pruning method for efficient LLM - based recommendation.	Data Selection	link	2024	ACM
CoachLM: Automatic Instruction Revisions Improve the Data Quality in LLM Instruction Tuning	CoachLM automatically revises samples to enhance instruction dataset quality.	Data Selection, Data Quality Enhancement	link	2023	IEEE
Alpagasus:Training a Better Alpaca with Fewer Data	Propose data selection strategy, filter low - quality data for IFT, ALPAGASUS as example.	Data Selection	link	2024	NIPS/ICML/ICLR
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning	Introduce self - guided method for LLMs to select samples, key innovation IFD metric.	Data Selection	link	2024	*ACL
Rethinking the Instruction Quality: LIFT is What You Need	LIFT elevates instruction quality by broadening data distribution.	Data Selection	link	2023	arxiv
Instag:Instruction tagging for analyzing supervised fine-tuning of large language models.pdf	Propose INSTAG to tag instructions, find benefits for LLMs, and a data sampling procedure.	Data Selection	link	2024	NIPS/ICML/ICLR
MoDS: Model-oriented Data Selection for Instruction Tuning	MoDS selects instruction data by quality, coverage and necessity.	Data Selection	link	2023	arxiv
SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions	SELF - INSTRUCT bootstraps from LM for instruction - following, nearly annotation - free.	Data Selection	link	2023	*ACL
Active Instruction Tuning: Improving Cross-Task Generalization by Training on Prompt Sensitive Tasks	Propose active IT based on prompt uncertainty to select tasks for LLM tuning.	Data Selection	link	2023	*ACL
Automated Data Curation for Robust Language Model Fine-Tuning	Introduce CLEAR for data curation in LLM fine - tuning without extra computations.	Data Selection	link	2024	*ACL
CLUES: Collaborative Private-domain High-quality Data Selection for LLMs via Training Dynamics	Propose data quality control via training dynamics for collaborative LLM training.	Data Selection	link	2024	NIPS/ICML/ICLR
Compute-Constrained Data Selection	Formalize data selection problem cost - aware, model trade - offs.	Data Selection	link	2025	NIPS/ICML/ICLR
DATA ADVISOR: Dynamic Data Curation for Safety Alignment of Large Language Models	DATA ADVISOR for data generation to enhance LLM safety.	Data Selection	link	2024	*ACL
Data Curation Alone Can Stabilize In-context Learning	Two methods curate training data subsets to stabilize ICL without algorithm changes.	Data Selection	link	2023	*ACL
Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs	Select data to nudge pre - training dist. closer to target dist. for cost - effective fine - tuning.	Data Selection	link	2024	NIPS/ICML/ICLR
Improving Data Efficiency via Curating LLM-Driven Rating Systems	DS2 corrects LLM - based scores for data selection promoting diversity.	Data Selection	link	2025	NIPS/ICML/ICLR
LLM-Select: Feature Selection with Large Language Models	LLMs can select predictive features without seeing training data.	Data Selection	link	2024	Journal
One-Shot Learning as Instruction Data Prospector for Large Language Models	NUGGETS uses one - shot learning to select high - quality instruction data.	Data Selection	link	2024	*ACL
SAMPLE-EFFICIENT ALIGNMENT FOR LLMS	Introduce unified algorithm for LLM alignment based on Thompson sampling.	Data Selection	link	2024	arxiv
LESS: Selecting Influential Data for Targeted Instruction Tuning	Propose LESS to select data for targeted instruction tuning in LLMs.	Data Selection	link	2024	NIPS/ICML/ICLR
An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models	Propose experimental design for SFT in LLMs to mitigate annotation cost.	Data Selection	link	2024	*ACL
DELE: Data Efficient LLM Evaluation	Propose adaptive sampling for LLM evaluation to reduce cost without losing integrity.	Data Selection	link	2024	NIPS/ICML/ICLR
Towards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective	Model synthetic data gen process, relate generalization & info gain.	Data Synthesis	link	2024	arxiv
Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data	Generate Lean 4 proof data to enhance LLM theorem - proving, without experimental focus.	Data Synthesis	link	2024	NIPS/ICML/ICLR
Are LLMs Naturally Good at Synthetic Tabular Data Generation?	LLMs as-is or fine - tuned are bad at tabular data generation; permutation - aware can help.	Data Synthesis	link	2024	arxiv
Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs	Group synthetic data strategies, study LLM training, propose selection framework.	Data Synthesis	link	2024	NIPS/ICML/ICLR
Best Practices and Lessons Learned on Synthetic Data for Language Models	The paper focuses on synthetic data for LMs, its use, challenges and responsible use.	Data Synthesis	link	2024	arxiv
ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning	ChatTS, a TS - MLLM, uses synthetic data for time series analysis.	Data Synthesis	link	2024	arxiv
Data extraction for evidence synthesis using a large language model: A proof-of-concept study	The study assesses Claude 2's data extraction in evidence synthesis.	Data Synthesis	link	2024	Journal
Illuminating Blind Spots of Language Models with Targeted Agent-in-the-Loop Synthetic Data	Use intelligent agents as teachers to generate samples for blind spot mitigation.	Data Synthesis	link	2024	arxiv
Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science	The paper studies strategies to increase synthetic data faithfulness.	Data Synthesis	link	2023	arxiv
Generative LLMs for Synthetic Data Generation: Methods, Challenges and the Future	The paper focuses on using LLMs for synthetic data generation & related aspects.	Data Synthesis	link	2023	Journal
HARMONIC: Harnessing LLMs for Tabular Data Synthesis and Privacy Protection	Introduce HARMONIC for tabular data synth & privacy, use LLMs w/ fine - tuning.	Data Synthesis	link	2024	NIPS/ICML/ICLR
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing	MAGPIE self - synthesizes alignment data from aligned LLMs without human prompts.	Data Synthesis	link	2024	arxiv
Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation	MATRIX multi - agent simulator creates scenarios for data synthesis in LLM post - training.	Data Synthesis	link	2025	NIPS/ICML/ICLR
Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations	Explore factors moderating LLM - generated data effectiveness in text classification.	Data Synthesis	link	2023	*ACL
Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance	Develop theoretical foundations for synthetic oversampling using LLMs.	Data Synthesis	link	2024	arxiv
Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models	This paper explores synthetic data flaws in LLM & presents a mitigation method.	Data Synthesis	link	2024	*ACL
Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement	Condor generates high - quality SFT data with two - stage framework for LLMs.	Data Synthesis	link	2025	arxiv
Data Augmentation using LLMs: Data Perspectives, Learning Paradigms and Challenges	The paper explores LLM - based data augmentation, challenges & learning paradigms.	Data Augmentation	link	2024	*ACL
Data is all you need: Finetuning LLMs for Chip Design via an Automated design-data augmentation framework	Propose an automated design - data augmentation framework for LLMs in chip design.	Data Augmentation	link	2024	ACM
LLM-powered Data Augmentation for Enhanced Cross-lingual Performance	Uses LLMs for data augmentation in limited multilingual datasets.	Data Augmentation, Survey	link	2023	*ACL
LLM-DA: Data Augmentation via Large Language Models for Few-Shot Named Entity Recognition	LLM - DA augments data at context/entity levels for few - shot NER.	Data Augmentation	link	2024	arxiv
LLM-Generated Natural Language Meets Scaling Laws: New Explorations and Data Augmentation Methods	Calculates LLMNL and HNL by scaling laws, proposes ZGPTDA for data augmentation.	Data Augmentation	link	2024	arxiv
A Survey on Data Augmentation in Large Model Era	Paper reviews large - model - driven data aug. methods, applications & future challenges.	Data Augmentation	link	2024	arxiv
ChatGPT Based Data Augmentation for Improved Parameter-Efficient Debiasing of LLMs	Use ChatGPT to generate data for LLM debiasing with two strategies.	Data Augmentation	link	2024	COLM
A Guide To Effectively Leveraging LLMs for Low-Resource Text Summarization: Data Augmentation and Semi-supervised Approaches	Two new methods for low - resource text summarization are proposed.	Data Augmentation	link	2025	*ACL
Empowering Large Language Models for Textual Data Augmentation	Propose a solution to auto - generate LLM augmentation instructions for quality data.	Data Augmentation	link	2024	*ACL
LLM-Generated Natural Language Meets Scaling Laws: New Explorations and Data Augmentation Methods	Introduce scaling laws for LLMNL and HNL, a new data augmentation method ZGPTDA.	Data Augmentation	link	2024	arxiv
LLM-AutoDA: Large Language Model-Driven Automatic Data Augmentation for Long-tailed Problems	Proposes LLM - AutoDA for long - tailed data augmentation by leveraging large - scale models.	Data Augmentation	link	2024	NIPS/ICML/ICLR
Building a Family of Data Augmentation Models for Low-cost LLM Fine-tuning on the Cloud	Present data augmentation models for low - cost LLM fine - tuning with key functionalities.	Data Augmentation	link	2025	*ACL
Mini-DA: Improving Your Model Performance through Minimal Data Augmentation using LLM	Mini - DA selects challenging samples for augmentation, improving resource utilization.	Data Augmentation	link	2024	*ACL
Data Augmentation for Text-based Person Retrieval Using Large Language Models	Propose LLM - DA for TPR, use TFF & BSS to augment data concisely & efficiently.	Data Augmentation	link	2024	*ACL
Data Augmentation for Cross-domain Parsing via Lightweight LLM Generation and Tree Hybridization	Propose data augmentation via LLM & tree hybridization for cross - domain parsing.	Data Augmentation	link	2025	*ACL
AugGPT: Leveraging ChatGPT for Text Data Augmentation	Propose AugGPT for text data augmentation, rephrasing training samples.	Data Augmentation	link	2025	IEEE
PGA-SciRE: Harnessing LLM on Data Augmentation for Enhancing Scientific Relation Extraction	Propose PGA framework for RE in scientific domain, two data aug. ways.	Data Augmentation	link	2024	arxiv
Improving Topic Relevance Model by Mix-structured Summarization and LLM-based Data Augmentation	Use query/doc summaries & LLM data augmentation for topic relevance modeling.	Data Augmentation	link	2024	arxiv
Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks	Propose RADA framework to augment data for low - resource domain tasks.	Data Augmentation	link	2024	arxiv
The Applicability of LLMs in Generating Textual Samples for Analysis of Imbalanced Datasets	The paper compares approaches for handling text data class imbalance.	Data Augmentation	link	2024	IEEE
Self-Rewarding Language Models	Study self - rewarding LMs, use LLM - as - a - Judge for self - rewards during training.	Self Evolution	link	2024	NIPS/ICML/ICLR
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models	Propose SPIN method for LLM, self - play mechanism refines its own capabilities.	Self Evolution	link	2024	NIPS/ICML/ICLR
Self-Boosting Large Language Models with Synthetic Preference Data	SynPO self - boosts LLMs via synthetic preference data, eliminating large - scale annotation.	Self Evolution	link	2024	arxiv
MEMORYLLM: Towards Self-Updatable Large Language Models	MEMORYLLM is self - updatable, can integrate new knowledge and retain long - term info.	Self Evolution	link	2024	NIPS/ICML/ICLR
Self-Refine: Iterative Refinement with Self-Feedback	Self - Refine iteratively refines LLM outputs without extra training data or RL.	Self Evolution	link	2023	NIPS/ICML/ICLR
META-REWARDING LANGUAGE MODELS: Self-Improving Alignment with LLM-as-a-Meta-Judge	Introduce Meta - Rewarding step for self - improving LLMs' judgment skills.	Self Evolution	link	2024	arxiv
Automated Proof Generation for Rust Code via Self-Evolution	SAFE framework enables Rust code proof generation via self - evolving cycle.	Self Evolution	link	2025	NIPS/ICML/ICLR
Arxiv Copilot: A Self-Evolving and Efficient LLM System for Personalized Academic Assistance	Arxiv Copilot is a self - evolving LLM system for personalized academic assistance.	Self Evolution	link	2024	*ACL
Automatic programming via large language models with population self-evolution for dynamic job shop scheduling problem	This paper proposes SeEvo method for HDRs design inspired by experts' strategies.	Self Evolution	link	2024	arxiv
Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation	A multi - agent framework for dynamic LLM evaluation through instance reframing.	Self Evolution	link	2025	*ACL
Bias Amplification in Language Model Evolution: An Iterated Learning Perspective	Draws parallels between LLM behavior & human culture evolution via Iterated Learning.	Self Evolution	link	2024	NIPS/ICML/ICLR
Enhanced Fine-Tuning of Lightweight Domain-Specific Q&A Model Based on Large Language Models	Propose Self - Evolution framework for lightweight LLM fine - tuning.	Self Evolution	link	2024	IEEE
Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models	Propose ENVISIONS to self - train LLMs in neural - symbolic scenarios, overcoming two challenges.	Self Evolution	link	2024	arxiv
I-SHEEP: Self-Alignment of LLM from Scratch through an Iterative Self-Enhancement Paradigm	I - SHEEP paradigm enables LLMs to self - improve iteratively in low - resource scenarios.	Self Evolution	link	2024	arxiv
Language Models as Continuous Self-Evolving Data Engineers	Propose LANCE for LLMs to self - train by auto - data operations, reducing post - training cost.	Self Evolution	link	2024	arxiv
LLM Guided Evolution - The Automation of Models Advancing Models	GE uses LLMs to directly modify code for model evolution.	Self Evolution	link	2024	arxiv
LLM-Evolve: Evaluation for LLM's Evolving Capability on Benchmarks	Proposes LLM - Evolve framework to evaluate LLMs' evolving ability on benchmarks.	Self Evolution	link	2024	*ACL
Long Term Memory : The Foundation of AI Self-Evolution	This paper explores AI self - evolution with LTM, not on experimental performance.	Self Evolution	link	2024	arxiv
METEOR: Evolutionary Journey of Large Language Models from Guidance to Self-Growth	Propose Meteor method for model evolution with 3 training phases to maximize domain capabilities.	Self Evolution, Distillation	link	2024	arxiv
Promptbreeder: Self-referential self-improvement via prompt evolution	Promptbreeder self - improves prompts via self - referential evolution.	Self Evolution	link	2024	NIPS/ICML/ICLR
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking	rStar - Math uses deep thinking via MCTS for SLMs to master math reasoning.	Self Evolution	link	2025	arxiv
Self: Language-driven self-evolution for large language model	SELF enables LLMs to self - evolve without human intervention via language feedback.	Self Evolution	link	2024	NIPS/ICML/ICLR
Self-Evolution Fine-Tuning for Policy Optimization	SEFT for policy optimization eliminates need for annotated samples.	Self Evolution	link	2024	*ACL
Self-Evolutionary Group-wise Log Parsing Based on Large Language Model	SelfLog self - evolves by LLM - extracted similar pairs and uses N - Gram - based methods.	Self Evolution	link	2024	IEEE
Self-Evolutionary Large Language Models through Uncertainty-Enhanced Preference Optimization	UPO framework mitigates noisy pref data for LLM self - evolution via reliable feedback.	Self Evolution	link	2024	arxiv
Self-Evolved Reward Learning for LLMs	Self - Evolved Reward Learning (SER) iteratively improves RM with self - generated data.	Self Evolution	link	2025	NIPS/ICML/ICLR
AugmenToxic: Leveraging Reinforcement Learning to Optimize LLM Instruction Fine-Tuning for Data Augmentation to Enhance Toxicity Detection	Propose RL - based method for LLM fine - tuning to augment toxic language data.	Toxicity / Trust-worthy	link	2024	ACM
Benchmarking LLMs in Political Content Text-Annotation: Proof-of-Concept with Toxicity and Incivility Data	Benchmarked LLMs in political text -annotation, not focusing on exp. performance.	Toxicity / Trust-worthy	link	2024	arxiv
Can LLMs Recognize Toxicity? A Structured Investigation Framework and Toxicity Metric	Introduce LLM - based toxicity metric, analyze factors, evaluate its performance.	Toxicity / Trust-worthy	link	2024	*ACL
Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation	The paper uses geometry to understand LLMs and solve toxicity - related issues.	Toxicity / Trust-worthy	link	2024	NIPS/ICML/ICLR
Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations	Paper presents detectors library for LLM harms, uses & challenges, not exp perf.	Toxicity / Trust-worthy	link	2024	arxiv
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs	This paper creates an open - source dataset to evaluate LLM safeguards.	Toxicity / Trust-worthy	link	2023	arxiv
Effcient Toxic Content Detection by Bootstrapping and Distilling Large Language Models	BD - LLM bootstraps & distills LLMs for toxic content detection via DToT.	Toxicity / Trust-worthy	link	2024	AAAI/IJCAL
Evaluating the Impact of Model Size on Toxicity and Stereotyping in Generative LLM	Explore LLM size's relation to toxicity & stereotyping, smallest model performs best.	Toxicity / Trust-worthy	link	2023	Journal
How Toxic Can You Get? Search-based Toxicity Testing for Large Language Models	EvoTox tests LLM toxicity post - alignment via iterative evolution strategy.	Toxicity / Trust-worthy	link	2025	arxiv
Improving Covert Toxicity Detection by Retrieving and Generating References	This paper explores refs' potential for covert toxicity detection.	Toxicity / Trust-worthy	link	2024	*ACL
Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs	The paper analyzes data contamination & eval malpractices in closed - source LLMs.	Toxicity / Trust-worthy	link	2024	*ACL
LLM-Based Synthetic Datasets: Applications and Limitations in Toxicity Detection	The paper explores LLM - based synthetic data in toxicity detection, its potential and limits.	Toxicity / Trust-worthy	link	2024	*ACL
Mitigating Biases to Embrace Diversity: A Comprehensive Annotation Benchmark for Toxic Language	New annotation benchmark reduces bias, shows LLM annotation value.	Toxicity / Trust-worthy	link	2024	*ACL
People Make Better Edits: Measuring the Efficacy of LLM-Generated Counterfactually Augmented Data for Harmful Language Detection	Assess if CAD generation for harmful lang. detection can be automated using NLP models.	Toxicity / Trust-worthy	link	2023	*ACL
Realistic Evaluation of Toxicity in Large Language Models	New TET dataset helps rigorously evaluate toxicity in popular LLMs.	Toxicity / Trust-worthy	link	2024	*ACL
TOXICCHAT: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation	This paper isn't about Efficient LLM Post Training, so can't provide relevant summary.	Toxicity / Trust-worthy	link	2023	*ACL
Toxicity Detection with Generative Prompt-based Inference	Explore generative zero - shot prompt - based toxicity detection.	Toxicity / Trust-worthy	link	2022	arxiv
Toxicity in CHATGPT: Analyzing Persona-assigned Language Models	The paper evaluates ChatGPT toxicity based on persona - assigned language models.	Toxicity / Trust-worthy	link	2023	*ACL
ToxiCraft:A Novel Framework for Synthetic Generation of Harmful Information	The paper proposes ToxiCraft to generate harmful info datasets, addressing two issues.	Toxicity / Trust-worthy	link	2024	*ACL
TOXIGEN: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection	Create TOXIGEN dataset, new method for generating text, human evaluation.	Toxicity / Trust-worthy	link	2022	arxiv
Dialectal Toxicity Detection: Evaluating LLM-as-a-Judge Consistency Across Language Varieties	This paper focuses on dialectal toxicity detection in LLMs, not relevant to efficient post - training.	Toxicity / Trust-worthy, LLM-as-Judger	link	2024	arxiv
Do-Not-Answer: Evaluating Safeguards in LLMs	The paper curates a dataset to evaluate LLM safeguards for safer deployment.	Toxicity / Trust-worthy	link	2024	*ACL
An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4	Fine - tuned judge models have limitations, integrated method improves them.	LLM-as-Judger	link	2024	*ACL
CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges	CalibraEval mitigates LLM - as - Judges selection bias via NOA.	LLM-as-Judger	link	2024	arxiv
Can LLMs be Good Graph Judger for Knowledge Graph Construction?	The paper proposes GraphJudger to address KG construction challenges.	LLM-as-Judger	link	2024	arxiv
CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences	Propose LLM - as - a - Judge methodology for evaluating LLM coding preference alignment.	LLM-as-Judger	link	2024	arxiv
Crowd score: A method for the evaluation of jokes using large language model AI voters as judges	Crowd Score method assesses joke funniness via LLMs as AI judges.	LLM-as-Judger	link	2022	arxiv
Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation	Introduce FLAMe, trained on quality tasks, less biased than other LLM - as - a - Judge models.	LLM-as-Judger	link	2024	*ACL
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena	Use LLM - as - a - judge to evaluate chat assistants, verify with two benchmarks.	LLM-as-Judger	link	2023	NIPS/ICML/ICLR
Judgelm: Fine-tuned large language models are scalable judges	Fine - tune LLMs as scalable judges, propose dataset & techniques.	LLM-as-Judger	link	2023	arxiv
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges	The paper studies LLM - as - judges, judges' performance and vulnerabilities.	LLM-as-Judger	link	2024	arxiv
Large Language Models are Inconsistent and Biased Evaluators	LLMs are inconsistent/biased evaluators; recipes to mitigate limitations are shared.	LLM-as-Judger	link	2024	arxiv
Llm-as-a-judge & reward model- What they can and cannot do	Analysis of automated evaluators: English eval & limitations.	LLM-as-Judger	link	2024	arxiv
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks	Evaluated 11 LLMs on 20 datasets; LLMs need human - validation before use as evaluators.	LLM-as-Judger	link	2024	arxiv
Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge	Introduce Meta - Rewarding step to self - improve LLM's judgment skills.	LLM-as-Judger	link	2024	arxiv
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark	This paper introduces MLLM - as - a - Judge benchmark to assess MLLMs' judging ability.	LLM-as-Judger	link	2024	NIPS/ICML/ICLR
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents	R - Judge benchmarks LLM agents' safety risk awareness in interactions.	LLM-as-Judger	link	2024	arxiv
Self-Taught Evaluators	An approach improves evaluators using only synthetic training data.	LLM-as-Judger	link	2024	arxiv
Style Over Substance: Evaluation Biases for Large Language Models	Study shows evaluation bias for LLMs, proposes MERS to improve LLM - based evaluations.	LLM-as-Judger	link	2025	*ACL
Wider and Deeper LLM Networks are Fairer LLM Evaluators	The paper uses wider & deeper LLM networks for fairer LLM evaluation.	LLM-as-Judger	link	2023	arxiv
Internal Consistency and Self-Feedback in Large Language Models: A Survey	This paper uses internal consistency perspective to explain LLM issues and introduce Self - Feedback.	Survey	link	2024	arxiv
A Survey on Self-Evolution of Large Language Models	The paper surveys self - evolution in LLMs, including its process and challenges.	Survey, Self Evolution	link	2024	arxiv
Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies	Reviews advances in auto - correcting LLMs via feedback, categorizes approaches.	Survey	link	2024	Journal
A Survey on Data Selection for LLM Instruction Tuning	This paper surveys data selection for LLM instruction tuning.	Survey, Data Selection	link	2024	arxiv
Large Language Models for Data Annotation and Synthesis: A Survey	This paper focuses on LLM post - training from a data - centric view.	Survey, Data Synthesis	link	2024	*ACL
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey	The paper organizes LLMs - driven data gen. studies to show research gaps and future ways.	Survey	link	2024	*ACL
Trustworthy LLMs： A survey and guideline for evaluating large language models' alignment	The paper surveys LLM trustworthiness dimensions for alignment evaluation.	Survey, Toxicity / Trust-worthy	link	2024	NIPS/ICML/ICLR
A Survey on Data Selection for Language Models	Comprehensive review of data selection for LMs to accelerate related research.	Survey, Data Selection	link	2024	Journal
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods	I'm sorry, but the given data is about "LLMs - as - Judges" not "Efficient LLM Post Training: A Data - centric Perspective", so I can't provide a relevant summary.	Survey, LLM-as-Judger	link	2024	arxiv
A Survey on Data Synthesis and Augmentation for Large Language Models	Reviews LLM data generation techniques, discusses constraints.	Survey, Data Synthesis, Data Augmentation	link	2024	arxiv
A Survey on Knowledge Distillation of Large Language Models	Comprehensive survey on KD in LLMs: mechanisms, skills, verticalization & DA interplay.	Survey, Distillation	link	2024	arxiv
Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and Application	Survey on LLM knowledge distillation methods, evaluation & application, not exp perf.	Survey, Distillation	link	2024	ACM
Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing	Impossible Distillation: distill high - quality from low - quality for summarization & paraphrasing.	Distillation	link	2023	arxiv
Prompt Distillation for Efficient LLM-based Recommendation	Propose prompt distillation to bridge IDs & words & reduce inference time.	Distillation	link	2023	ACM
Performance-Guided LLM Knowledge Distillation for Efficient Text Classification at Scale	PGKD for text classification, an LLM distillation method with versatile framework.	Distillation	link	2024	*ACL
Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels	The paper tests LLM - generated labels for supervised text classification workflows.	Distillation	link	2024	*ACL
Multistage Collaborative Knowledge Distillation from a Large Language Model for Semi-Supervised Sequence Generation	Propose MCKD for semi - supervised seq. gen., iteratively improve pseudolabels.	Distillation	link	2024	*ACL
Self-Data Distillation for Recovering Quality in Pruned Large Language Models	Self - data distillation fine - tuning mitigates quality loss from pruning and SFT.	Distillation	link	2024	NIPS/ICML/ICLR
Distillation Matters: Empowering Sequential Recommenders to Match the Performance of Large Language Models	Proposes DLLM2Rec for LLM-based rec. model distillation to sequential models.	Distillation	link	2024	ACM
Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs	Introduce ULD loss for cross - tokenizer distillation in LLMs.	Distillation	link	2025	Journal
Self-Evolution Knowledge Distillation for LLM-based Machine Translation	Self - Evolution KD dynamically integrates prior knowledge for better knowledge transfer.	Distillation, Self Evolution	link	2025	*ACL
Efficiently Distilling LLMs for Edge Applications	Propose MLFS for parameter - efficient supernet training of LLMs.	Distillation	link	2024	*ACL
Xai-driven knowledge distillation of large language models for efficient deployment on low-resource devices	DiXtill uses XAI to distill LLM knowledge into a self - explainable student model.	Distillation	link	2024	Journal
Compact Language Models via Pruning and Knowledge Distillation	Develop compression practices for LLMs via pruning and distillation.	Distillation	link	2024	NIPS/ICML/ICLR
LLM-Enhanced Multi-Teacher Knowledge Distillation for Modality-Incomplete Emotion Recognition in Daily Healthcare	Propose LLM - enhanced multi - teacher KD for emotion rec in modality - incomplete cases.	Distillation	link	2024	IEEE
BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation	BitDistiller combines QAT and KD for sub - 4 - bit LLMs with new techniques.	Distillation	link	2024	*ACL
Reducing LLM Hallucination Using Knowledge Distillation: A Case Study with Mistral Large and MMLU Benchmark	Knowledge distillation reduces LLM hallucination via specific methods.	Distillation	link	2024	arxiv
Distilling Large Language Models for Text-Attributed Graph Learning	Propose distilling LLMs into local graph model for TAG learning, novel training method.	Distillation	link	2024	ACM
CourseGPT-zh: an Educational Large Language Model Based on Knowledge Distillation Incorporating Prompt Optimization	CourseGPT - zh uses prompt optimization in a distillation framework for educational LLM.	Distillation	link	2024	arxiv
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression	Propose data distillation for prompt compression, formulate as token classification.	Distillation	link	2024	*ACL
LLM for Patient-Trial Matching: Privacy-Aware Data Augmentation Towards Better Performance and Generalizability	Propose LLM - PTM for patient - trial match, ensure data privacy in methodology.	Applications	link	2023	Others
LLM-Assisted Data Augmentation for Chinese Dialogue-Level Dependency Parsing	Present 3 LLM - based strategies for Chinese dialogue - level dependency parsing.	Applications	link	2024	Others
Resolving the Imbalance Issue in Hierarchical Disciplinary Topic Inference via LLM-based Data Augmentation	Use Llama V1 to augment data for balancing disciplinary topic inference.	Applications	link	2023	IEEE
LLM-based Privacy Data Augmentation Guided by Knowledge Distillation with a Distribution Tutor for Medical Text Classification	Propose a DP - based DA method for text classification in private domains.	Applications	link	2024	Others
Large Language Models for Healthcare Data Augmentation: An Example on Patient-Trial Matching	An LLM - based patient - trial matching approach with privacy - aware data augmentation.	Applications	link	2024	Others
Identifying Citizen-Related Issues from Social Media Using LLM-Based Data Augmentation	Propose LLM - based method for data augmentation to extract citizen - related data from tweets.	Applications, Data Augmentation	link	2024	Others
Synthetic Data Augmentation Using Large Language Models (LLM): A Case-Study of the Kamyr Digester	Introduces LLM - based data augmentation technique for data scarcity.	Applications	link	2024	IEEE
Conditional Label Smoothing For LLM-Based Data Augmentation in Medical Text Classification	Propose CLS for data augmentation in medical text classification.	Applications	link	2024	IEEE
Curriculum-style Data Augmentation for LLM-based Metaphor Detection	Propose open - source LLM fine - tuning and CDA for metaphor detection.	Applications, Data Augmentation	link	2024	arxiv
Enhancing Speech De-Identification with LLM-Based Data Augmentation	A novel data augmentation method for speech de - id using LLM and end - to - end model.	Applications	link	2024	IEEE
Enhancing Multilingual Fake News Detection through LLM-Based Data Augmentation	Use Llama 3 via LLM - based data augmentation to enrich fake news datasets.	Applications	link	2024	Others
LLMs Accelerate Annotation for Medical Information Extraction	Propose LLM - human combo for medical text annotation, reducing human burden.	Applications, Active Annotation	link	2023	Others
Crowdsourcing with Enhanced Data Quality Assurance: An Efficient Approach to Mitigate Resource Scarcity Challenges in Training Large Language Models for Healthcare	Propose CS framework with quality control for LLM in healthcare, address resource scarcity.	Applications	link	2024	Others
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement	LLM2LLM iteratively augments data for LLM fine - tuning in low - data scenarios.	Data Quality Enhancement, Data Augmentation	link	2024	*ACL
Data Quality Enhancement on the Basis of Diversity with Large Language Models for Text Classification: Uncovered, Difficult, and Noisy	Propose DQE method for text classification with LLMs, select data by greedy algorithm.	Data Quality Enhancement	link	2025	*ACL
Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation	Use LLM for data cleansing in Multi - News dataset, no need for costly human annotators.	Data Quality Enhancement	link	2024	*ACL
LLM-Enhanced Data Management	LLMDB for data management: avoid hallucination, reduce cost, improve accuracy.	Data Quality Enhancement	link	2024	ACM
Enhancing LLM Fine-tuning for Text-to-SQLs by SQL Quality Measurement	Propose using SQL Quality Measurement to enhance LLM-based Text - to - SQLs performance.	Data Quality Enhancement	link	2024	arxiv
On The Role of Prompt Construction In Enhancing Efficacy and Efficiency of LLM-Based Tabular Data Generation	Enriching prompts with domain insights improves LLM-based tabular data generation.	Data Quality Enhancement	link	2024	arxiv
On LLM-Enhanced Mixed-Type Data Imputation with High-Order Message Passing	Propose UnIMP with BiHMP and Xfusion for mixed - type data imputation.	Data Quality Enhancement	link	2025	arxiv
SEMIEVOL: Semi-supervised Fine-tuning for LLM Adaptation	SEMIEVOL, a semi - supervised LLM fine - tuning framework, propagates and selects knowledge.	Data Curation	link	2024	arxiv
Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes	Introduce CLLM for tabular augmentation in low - data, with curation mechanism for data.	Data Curation	link	2024	NIPS/ICML/ICLR
Data to Defense: The Role of Curation in Customizing LLMs Against Jailbreaking Attacks	Propose data curation approach & mitigation framework to counter jailbreaking.	Data Curation	link	2024	arxiv
DATA ADVISOR: Dynamic Data Curation for Safety Alignment of Large Language Models	Propose Data Advisor for data gen. considering dataset char. to enhance quality.	Data Curation	link	2024	*ACL
Data Curation Alone Can Stabilize In-context Learning	Two methods curate data subsets to stabilize ICL without algorithm changes.	Data Curation	link	2023	*ACL
Automated Data Curation for Robust Language Model Fine-Tuning	Introduced CLEAR for instruction tuning datasets to curate data without extra computations.	Data Curation	link	2024	*ACL
Improving Data Efficiency via Curating LLM-Driven Rating Systems	DS2, a data selection method, corrects LLM scores and promotes data sample diversity.	Data Curation, Data Selection	link	2025	NIPS/ICML/ICLR
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data Only	Show web data alone can lead to powerful models without curated data.	Data Curation	link	2023	NIPS/ICML/ICLR
Use of a Structured Knowledge Base Enhances Metadata Curation by Large Language Models	LLMs can improve metadata curation with a structured knowledge base.	Data Curation	link	2024	arxiv
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources	Source2Synth generates synth data from real sources without human annotations.	Data Curation, Data Synthesis	link	2024	arxiv
AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark	Investigated LLM's data - cleaning workflow auto - gen, proposed a benchmark.	Data Curation	link	2024	arxiv
Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation	Dynosaur automatically constructs instruction adjustment data and reduces costs by leveraging existing datasets.	Data Curation	link	2023	*ACL
AutoPureData: Automated Filtering of Web Data for LLM Fine-tuning	Proposes system to auto - filter web data for LLM training with trusted AI models.	Data Curation	link	2024	arxiv
Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond	Propose ADC for efficient dataset construction, offer benchmarks.	Data Curation	link	2024	arxiv
Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement	Proposes k - means & iterative refinement for data selection to finetune LLMs.	Data Curation	link	2025	NIPS/ICML/ICLR
Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions	Explore human - AI partnerships for high - quality LLM - based text data generation.	Data Curation	link	2023	*ACL
Balancing performance and cost of LLMs in a multi-agent framework for BIM data retrieval	Propose MAS method to match queries with LLMs for balanced BIM data retrieval.	Data Curation, Applications	link	2025	Others
Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System	Optima framework in LLM - based MAS improves communication & task effectiveness via LLM training.	Data Curation	link	2025	NIPS/ICML/ICLR
Synergized Data Efficiency and Compression (SEC) Optimization for Large Language Models	Propose SEC for LLMs to enhance efficiency without sacrificing performance.	Data Curation	link	2024	Others
LLMaAA: Making Large Language Models as Active Annotators	LLMaAA uses LLMs as annotators in active learning loop, optimizing annotation & training.	Active Annotation	link	2023	*ACL
Enhancing Review Classification Via Llm-Based Data Annotation and Multi-Perspective Feature Representation Learning	Propose MJAR dataset & MPFR approach for review classification.	Active Annotation	link	2024	Others
AutoLabel: Automated Textual Data Annotation Method Based on Active Learning and Large Language Model	AutoLabel uses LLM & active learning to assist text data annotation.	Active Annotation, Data Quality Enhancement	link	2024	Others
Human-LLM Collaborative Annotation Through Effective Verification of LLM Labels	A multi - step human - LLM collaborative approach for accurate annotations.	Active Annotation	link	2024	ACM
PDFChatAnnotator: A Human-LLM Collaborative Multi-Modal Data Annotation Tool for PDF-Format Catalogs	PDFChatAnnotator links data & extracts info, user can guide LLM annotations.	Active Annotation, Applications	link	2024	ACM
Selective Annotation via Data Allocation: These Data Should Be Triaged to Experts for Annotation Rather Than the Model	Propose SANT for selective annotation, allocating data to expert & model effectively.	Active Annotation	link	2024	*ACL
Entity Alignment with Noisy Annotations from Large Language Models	Propose LLM4EA framework for entity alignment with reduced annotation space and label refiner.	Active Annotation	link	2024	NIPS/ICML/ICLR
CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large Language Models for Data Annotation	The paper proposes CoAnnotating for human - LLM co - annotation using uncertainty.	Active Annotation	link	2023	*ACL
Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning	Present techniques to enhance code LLM training efficiency with data pruning.	Data Pruning	link	2024	*ACL
Data-efficient Fine-tuning for LLM-based Recommendation	Propose a data pruning method with two scores for efficient LLM - based recommendation.	Data Pruning	link	2024	ACM
LLM-Pruner: On the Structural Pruning of Large Language Models	LLM - Pruner compresses LLMs task - agnostically via structural pruning.	Data Pruning	link	2023	NIPS/ICML/ICLR
Pruning as a Domain-specific LLM Extractor	Introduce D - Pruner for domain - specific LLM compression by dual - pruning.	Data Pruning	link	2024	*ACL
Measuring Sample Importance in Data Pruning for Language Models based on Information Entropy	Rank training samples by informativeness via entropy for data - pruning of LLMs.	Data Pruning	link	2024	arxiv
P3: A Policy-Driven, Pace-Adaptive, and Diversity-Promoted Framework for data pruning in LLM Training	P3 optimizes LLM fine - tuning via iterative data pruning with 3 key components.	Data Pruning	link	2024	NIPS/ICML/ICLR
All-in-One Tuning and Structural Pruning for Domain-Specific LLMs	ATP is a unified approach to pruning & fine - tuning LLMs via a trainable generator.	Data Pruning	link	2024	arxiv
Language Model-Driven Data Pruning Enables Efficient Active Learning	ActivePrune, a novel pruning strategy for AL, uses LMs to prune unlabeled data.	Data Pruning	link	2025	NIPS/ICML/ICLR
Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models	Compresso: Structured Pruning via algo - LLM collaboration, uses LoRA & prompt.	Data Pruning	link	2024	NIPS/ICML/ICLR
Efficient LLM Pruning with Global Token-Dependency Awareness and Hardware-Adapted Inference	Propose VIB - based pruning method, post - pruning for LLMs to compress & speed up.	Data Pruning	link	2024	Others
SlimGPT: Layer-wise Structured Pruning for Large Language Models	SlimGPT, a fast LLM pruning method, uses strategies for near - optimal results.	Data Pruning	link	2024	NIPS/ICML/ICLR
Shortened LLaMA: A Simple Depth Pruning for Large Language Models	Simple depth pruning can compete with width pruning in zero - shot LLM task.	Data Pruning	link	2024	NIPS/ICML/ICLR
Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning	CoT - Influx maximizes concise CoT examples input to boost LLM math reasoning.	Data Pruning	link	2024	*ACL

🤗 Welcome to contribute to this repo! You can create a pull request or email me atluo.junyu@outlook.com.

About

A list of data-efficient and data-centric LLM (Large Language Model) papers. Our Survey Paper: Towards Efficient LLM Post Training: A Data-centric Perspective

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🚀 Awesome-Data-Efficient-LLM 🚀

❖ Paper List

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Movatterモバイル変換

luo-junyu/Awesome-Data-Efficient-LLM

Folders and files

Latest commit

History

Repository files navigation

🚀 Awesome-Data-Efficient-LLM 🚀

❖ Paper List

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Packages