evaluation-metrics

Star

Here are 662 public repositories matching this topic...

Language:All

Filter by language

All662 Python279 Jupyter Notebook271 HTML17 R13 C++11 Java8 TypeScript6 JavaScript5 Roff2 Perl2

Sort:Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

confident-ai /deepeval

Star12.6k

The LLM Evaluation Framework

python evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

UpdatedDec 17, 2025
Python

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

agent ai openai evaluation-metrics mistral cost-estimation autogen groq agentops llm langchain anthropic evals ollama crewai agents-sdk openai-agents

UpdatedOct 30, 2025
Python

datawhalechina /tiny-universe

Star4.2k

《大模型白盒子构建指南》：一个全手搓的Tiny-Universe

agent transformers llama evaluation-metrics diffusion rag qwen

UpdatedDec 2, 2025
Jupyter Notebook

huggingface /lighteval

Star2.2k

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

evaluation evaluation-metrics evaluation-framework huggingface

UpdatedDec 15, 2025
Python

huggingface /evaluation-guidebook

Star2k

Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!

machine-learning tutorial evaluation evaluation-metrics guidebook large-language-models llm

UpdatedDec 3, 2025
Jupyter Notebook

xinshuoweng /AB3DMOT

Star1.8k

(IROS 2020, ECCVW 2020) Official Python Implementation for "3D Multi-Object Tracking: A Baseline and New Evaluation Metrics"

tracking machine-learning real-time computer-vision robotics evaluation evaluation-metrics multi-object-tracking kitti 3d-tracking 3d-multi-object-tracking 2d-mot-evaluation 3d-mot 3d-multi kitti-3d

UpdatedApr 3, 2024
Python

google-research /rliable

Star859

[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.

benchmarking machine-learning google reinforcement-learning rl evaluation-metrics

UpdatedAug 12, 2024
Jupyter Notebook

jitsi /jiwer

Star832

Evaluate your speech-to-text system with similarity measures such as word error rate (WER)

python3 automatic-speech-recognition speech-to-text evaluation-metrics wer word-error-rate

UpdatedFeb 15, 2025
Python

MIND-Lab /OCTIS

Star793

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

nlp natural-language-processing hyperparameter-optimization topic-modeling nlp-library bayesian-optimization hyperparameter-tuning latent-dirichlet-allocation evaluation-metrics neural-topic-models latent-semantic-analysis topic-models hyperparameter-search non-negative-matrix-factorization nlproc

UpdatedNov 24, 2025
Python

Unbabel /COMET

Star692

A Neural Framework for MT Evaluation

nlp machine-learning natural-language-processing machine-translation artificial-intelligence evaluation-metrics

UpdatedSep 1, 2025
Python

nekhtiari /image-similarity-measures

Star635

📈 Implementation of eight evaluation metrics to access the similarity between two images. The eight metrics are as follows: RMSE, PSNR, SSIM, ISSM, FSIM, SRE, SAM, and UIQ.

processing machine-learning image metrics evaluation-metrics p1

UpdatedAug 31, 2024
Python

AmenRa /ranx

Star628

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

python information-retrieval evaluation comparison numba recommender-systems evaluation-metrics metasearch data-fusion score-fusion ranking-metrics information-retrieval-evaluation information-retrieval-metrics rank-fusion

UpdatedAug 7, 2025
Python

relari-ai /continuous-eval

Star515

Data-Driven Evaluation for LLM-Powered Applications

information-retrieval evaluation-metrics evaluation-framework rag llmops retrieval-augmented-generation llm-evaluation

UpdatedJan 22, 2025
Python

proycon /pynlpl

Star479

PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Mor…

python nlp machine-learning natural-language-processing library linguistics computational-linguistics text-processing nlp-library search-algorithms evaluation-metrics folia language-modelling