evaluation

Star

Here are 1,358 public repositories matching this topic...

Language:All

Filter by language

All1,358 Python554 Jupyter Notebook190 JavaScript63 Java58 C++46 HTML42 TypeScript42 R24 C#23 PHP20

Sort:Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

mrgloom /awesome-semantic-segmentation

Star10.6k

🤘 awesome-semantic-segmentation

benchmark evaluation deeplearning semantic-segmentation

UpdatedMay 8, 2021

langfuse /langfuse

Star9.5k

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability

UpdatedMar 17, 2025
TypeScript

explodinggradients /ragas

Star8.5k

Supercharge Your LLM Application Evaluations 🚀

evaluation llm llmops

UpdatedMar 15, 2025
Python

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

UpdatedMar 18, 2025
TypeScript

open-compass /opencompass

Star4.9k

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark evaluation openai llm chatgpt large-language-model llama2 llama3

UpdatedMar 17, 2025
Python

Knetic /govaluate

Star3.8k

Arbitrary expression evaluation for golang

go parsing evaluation expression

UpdatedMay 31, 2024
Go

Marker-Inc-Korea /AutoRAG

Star3.7k

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

python open-source qa benchmarking ops pipeline analysis optimization evaluation embeddings automl document-parser rag llm retrieval-augmented-generation llm-ops llm-evaluation rag-evaluation

UpdatedMar 3, 2025
Python

MichaelGrupp /evo

Star3.7k

Python package for the evaluation of odometry and SLAM

benchmark robotics tum mapping metrics evaluation ros slam trajectory-analysis odometry trajectory ros2 kitti euroc trajectory-evaluation

UpdatedFeb 18, 2025
Python

Helicone /helicone

Star3.4k

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

open-source playground monitoring analytics evaluation ycombinator openai gpt large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability agent-monitoring llm-cost

UpdatedMar 17, 2025
TypeScript

sdiehl /write-you-a-haskell

Star3.4k

Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)

compiler functional-programming book lambda-calculus evaluation type-theory type pdf-book type-checking haskel type-system functional-language hindley-milner type-inference intermediate-representation

UpdatedJan 11, 2021
Haskell

Kiln-AI /Kiln

Star3.2k

The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.

python windows macos machine-learning ai evaluation prompt ml collaboration openai dataset-generation synthetic-data fine-tuning prompt-engineering chain-of-thought rlhf evals ollama

UpdatedMar 18, 2025
Python

viebel /klipse

Sponsor

Star3.1k

Klipse is a JavaScript plugin for embedding interactive code snippets in tech blogs.

react javascript ruby python scheme clojure lua clojurescript reactjs common-lisp ocaml brainfuck evaluation prolog codemirror-editor reasonml interactive-snippets code-evaluation klipse-plugin

UpdatedOct 1, 2024
HTML

CLUEbenchmark /SuperCLUE

Star3.1k

SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese

evaluation chinese gpt-4 foundation-models chatgpt

UpdatedMay 23, 2024

zzw922cn /Automatic_Speech_Recognition

Star2.8k

End-to-end Automatic Speech Recognition for Madarian and English in Tensorflow

audio deep-learning tensorflow paper end-to-end evaluation cnn lstm speech-recognition rnn automatic-speech-recognition feature-vector data-preprocessing phonemes timit-dataset layer-normalization rnn-encoder-decoder chinese-speech-recognition

UpdatedMar 24, 2023
Python

microsoft /promptbench

Star2.6k

A unified evaluation framework for large language models

benchmark evaluation prompt robustness adversarial-attacks large-language-models prompt-engineering chatgpt

UpdatedFeb 11, 2025
Python

ianarawjo /ChainForge

Sponsor

Star2.5k

An open-source visual programming environment for battle-testing prompts to LLMs.

ai evaluation large-language-models prompt-engineering llms llmops

UpdatedMar 18, 2025
TypeScript

uptrain-ai /uptrain

Star2.2k

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

machine-learning monitoring evaluation experimentation jailbreak-detection autoevaluation root-cause-analysis prompt-engineering llmops openai-evals llm-prompting llm-eval llm-test hallucination-detection