Evals
Explore our comprehensive evaluations across AI capabilities.
SimpleQA Verified
SimpleQA Verified is a 1,000-prompt benchmark for reliably evaluating Large Language Models (LLMs) on short-form factuality and parametric knowledge. The authors from Google DeepMind and Google Research address various limitations ofSimpleQA, originally designed byWei et al. (2024) at OpenAI, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created to provide the research community with a more precise instrument to track genuine progress in factuality, discourage overfitting to benchmark artifacts, and ultimately foster the development of more trustworthy AI systems.