llm-eval

Star

Here are 37 public repositories matching this topic...

Language:All

Filter by language

All37 Python20 Jupyter Notebook5 TypeScript5 MDX1 Svelte1

Sort:Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

promptfoo /promptfoo

Star5.9k

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

UpdatedMar 17, 2025
TypeScript

Arize-ai /phoenix

Star5.1k

AI Observability & Evaluation

openai datasets agents ai-monitoring ai-observability prompt-engineering llms langchain llmops anthropic llamaindex llm-eval evals llm-evaluation aiengineering smolagents

UpdatedMar 17, 2025
Jupyter Notebook

Giskard-AI /giskard

Sponsor

Star4.4k

🐢 Open-Source Evaluation & Testing for AI & LLM systems

ai-security mlops fairness-ai responsible-ai ml-validation red-team-tools trustworthy-ai ml-testing llm ai-red-team ai-testing llmops llm-security llm-eval llm-evaluation rag-evaluation agent-evaluation

UpdatedMar 10, 2025
Python

iterative /datachain

Star2.4k

ETL, Analytics, Versioning for Unstructured Data

machine-learning ai cv embeddings data-analytics data-wrangling multimodal mlops llm llm-eval

UpdatedMar 17, 2025
Python

uptrain-ai /uptrain

Star2.2k

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

machine-learning monitoring evaluation experimentation jailbreak-detection autoevaluation root-cause-analysis prompt-engineering llmops openai-evals llm-prompting llm-eval llm-test hallucination-detection

UpdatedAug 18, 2024
Python

athina-ai /athina-evals

Star271

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

UpdatedMar 17, 2025
Python

fiddlecube /fiddlecube-sdk

Star126

Generate ideal question-answers for testing RAG

synthetic-data llm-training llm-eval fine-tune-llms

UpdatedFeb 25, 2025
Python

Re-Align /just-eval

Star84

A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

evaluation gpt4 llm llm-eval llm-evaluation llm-evaluation-toolkit

UpdatedJan 29, 2024
Python

parea-ai /parea-sdk-py

Star76

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

UpdatedFeb 13, 2025
Python

kuk /rulm-sbs2

Star61

Бенчмарк сравнивает русские аналоги ChatGPT: Saiga, YandexGPT, Gigachat

russian-specific llm-eval

UpdatedSep 26, 2023
Jupyter Notebook

Auto-Playground /ragrank

Star33

🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.

machine-learning evaluation language-model rag llm prompt-engineering llmops llm-eval

UpdatedJan 7, 2025
Python

multinear /multinear

Star31

Develop reliable AI apps

reliability evaluation llm llms llm-eval llm-evaluation llms-benchmarking llm-evaluation-framework

UpdatedMar 12, 2025
Svelte

alan-turing-institute /prompto

Star26

An open source library for asynchronous querying of LLM endpoints

python nlp machine-learning natural-language-processing deep-learning transformers transformer hut23 large-language-models llms llm-eval llm-evaluation

UpdatedMar 3, 2025
Python

Supahands /llm-comparison-backend

Star19

This is an opensource project allowing you to compare two LLM's head to head with a given prompt, this section will be regarding the backend of this project, allowing for llm api's to be incorporated and used in the front-end

ai llm chatgpt llm-eval llm-api llm-comparison