Grounding and Evaluation for Large Language Models (Tutorial)
Overview
With the ongoing rapid adoption of Artificial Intelligence (AI) based systems in high-stakes domains such as financial services, healthcare and life sciences, hiring and human resources, education, societal infrastructure, and national security, it is crucial to develop and deploy the underlying AI models and systems in a responsible manner and ensure their trustworthiness, safety, and observability. Our focus is on large language models (LLMs) and other generative AI models and applications. Such models and applications need to be evaluated and monitored not only for accuracy and quality-related metrics but also for robustness against adversarial attacks, robustness under distribution shifts, bias and discrimination against underrepresented groups, security and privacy protection, interpretability, hallucinations (and other ungrounded or low-quality outputs), harmful content (such as sexual, racist, and hateful responses), jailbreaks of safety and alignment mechanisms, prompt injection attacks, misinformation and disinformation, fake, misleading, and manipulative content, copyright infringement, and other responsible AI dimensions.
In this tutorial, we first highlight key harms associated with generative AI systems with a focus on ungrounded answers (hallucinations), jailbreaks and prompt injection attacks, harmful content, and copyright infringement. We then discuss how to effectively address potential risks and challenges, following the framework of identification, measurement, mitigation (with four mitigation layers at model, safety system, application, and positioning), and operationalization. We present real-world LLM use cases, practical challenges, best practices, lessons learned from deploying solution approaches in industry, and key open problems. Our goal is to stimulate further research on grounding and evaluation of LLMs, and enable researchers and practitioners to build more robust and trustworthy LLM applications.
Contributors
Krishnaram Kenthapadi (Oracle Health AI, USA)
Mehrnoosh Sameki (Microsoft Azure AI, USA)
Ankur Taly (Google Cloud AI, USA)
Tutorial Logistics
ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD 2024)
2 PM - 5 PMlocal timeonMonday, August 26, 2024 inRoom 133 [KDD Program Agenda]
Presenters: Krishnaram Kenthapadi (Oracle Health AI), Mehrnoosh Sameki & Sarah Bird (Microsoft Azure AI), Ankur Taly (Google Cloud AI)
Tutorial Video Recording:KDD'24 Video (TBA)

Contributor Bios
Krishnaram Kenthapadi is the Chief Scientist, Clinical AI at Oracle Health, where he leads the AI initiatives forClinical Digital Assistant and other Oracle Health products. Previously, as the Chief AI Officer & Chief Scientist ofFiddler AI, he led initiatives ongenerative AI (e.g.,Fiddler Auditor, an open-source library for evaluating & red-teaming LLMs before deployment; AI safety, observability & feedback mechanisms for LLMs in production), and on AI safety, alignment, observability, and trustworthiness, as well as the technical strategy, customer-driven innovation, and thought leadership for Fiddler. Prior to that, he was a Principal Scientist at Amazon AWS AI, where he led the fairness, explainability, privacy, and model understanding initiatives in Amazon AWS AI platform. Prior to joining Amazon, he led similar efforts at the LinkedIn AI team, and served as LinkedIn’s representative in Microsoft’s AI and Ethics in Engineering and Research (AETHER) Advisory Board. Previously, he was a Researcher at Microsoft Research Silicon Valley Lab. Krishnaram received his Ph.D. in Computer Science from Stanford University in 2006. His work has been recognized through awards at NAACL, WWW, SODA, CIKM, ICML AutoML workshop, and Microsoft’s AI/ML conference (MLADS). He has published 50+ papers, with 7000+ citations and filed 150+ patents (72 granted). He has presented tutorials onprivacy,fairness,explainable AI,ML model monitoring,responsible AI, andtrustworthy generative AI in industry at forums such as KDD '18 '19 '22 '23, WSDM '19, WWW '19 '20 '21 '23, FAccT '20'21 '22 '23, AAAI '20'21, and ICML'21 '23,and instructed a course on responsible AI at Stanford.
Mehrnoosh Sameki is a principal product manager and responsible AI tools area lead at Microsoft, leading a group of AI product managers to develop and deliver cutting-edge tools for model evaluation, and responsible AI, both in open source and through Azure AI platforms for generative AI solutions and ML models. She is also an adjunct assistant professor at Boston University, School of Computer Science, where she earned her PhD degree in 2017. Mehrnoosh has presented at several industry forums (including Microsoft Build) and tutorials onfairness,ML model monitoring, andresponsible AI in industry at forums such as KDD '19 '22, WWW '21 '23, FAccT '21 '22, AAAI '21, and ICML '21.
Ankur Taly is a Senior Staff Research Scientist in the Gemini team at Google, leading initiatives on post-training data and model quality. Previously, he was part of Google Cloud AI, leading initiatives around factuality and grounding for LLMs, and prior to that, he served as the Head of Data Science at Fiddler AI, where he was responsible for developing, and evangelizing core explainable AI technology. Ankur is most well-known for his contribution to developing and applyingIntegrated Gradients (6000+ citations)— a new interpretability algorithm for deep networks. His research in this area has resulted in publications at several top-tier machine learning conferences and journals. Besides machine learning, Ankur has a broad research background and has published 30+ papers in areas including computer security, programming languages, and formal verification. He co-teaches CS 328: Trustworthy Machine Learning at Stanford University, and has instructed several short courses and tutorials at various conferences (KDD 2024, KDD 2019, FActT 2019, FOSAD 2016, etc.). Ankur earned his PhD in Computer Science from Stanford University in 2012 and a BTech in Computer Science from IIT Bombay in 2007.
Tutorial Outline and Description
The tutorial will consist of the following parts:
Introduction and overview of the generative AI landscape
We give an overview of the generative AI landscape in industry and motivate the topic of the tutorial with the following questions. What constitutes generative AI? Why is generative AI an important topic? What are key applications of generative AI that are being deployed across different industry verticals? Why is it crucial to develop and deploy generative AI models and applications in a responsible manner?
Holistic Evaluation of LLMs
We highlight key challenges that arise when developing and deploying LLMs and other generative AI models in enterprise settings, and present an overview of solution approaches and open problems. We discuss evaluation dimensions such as truthfulness, safety and alignment, bias and fairness, robustness and security, privacy, model disgorgement and unlearning, copyright infringement, calibration and confidence, and transparency and causal interventions.
Grounding for LLMs
We then provide a deeper discussion of grounding for LLMs, that is, ensuring that every claim in the response generated by an LLM can be attributed to a document in the user-specified knowledge base. We highlight how grounding differs from factuality in the context of LLMs, and present technical solution approaches such as retrieval augmented generation, constrained decoding, evaluation, guardrails, and revision, and corpus tuning.
LLM Operations and Observability
We present processes and best practices for addressing grounding and evaluation related challenges in real-world LLM application settings. We discuss mechanisms for managing safety risks and vulnerabilities associated with deployed LLM and generative AI applications as well as practical approaches for monitoring the underlying models and systems with respect to quality and other responsible AI related metrics.
We will present real-world LLM case studies across different application domains such as healthcare, financial services, hiring, conversational assistants, and search and recommendation systems and discuss solution approaches for addressing above challenges, highlighting practical challenges, best practices, lessons learned from deploying solution approaches in the industry, and key open problems. We hope that our tutorial will inform both researchers and practitioners, stimulate further research on grounding and evaluation approaches for LLMs and other generative AI models, and pave the way for building more reliable generative AI models and applications in the future.
This tutorial is aimed at attendees with a wide range of interests and backgrounds both in academia and industry, including researchers interested in knowing about grounding, evaluation, and more broadly responsible AI techniques and tools in the context of LLMs / generative AI models as well as practitioners interested in implementing such tools for various LLMs / generative AI applications. We will not assume any prerequisite knowledge, and present the advances, challenges, and opportunities related to evaluation and grounding of LLMs by building intuition to ensure that the material is accessible to all attendees.
Related Tutorials and Resources
Sara Hajian, Francesco Bonchi, and Carlos Castillo,Algorithmic bias: From discrimination discovery to fairness-aware data mining, KDD Tutorial, 2016.
Solon Barocas and Moritz Hardt,Fairness in machine learning, NeurIPS Tutorial, 2017.
Kate Crawford,The Trouble with Bias, NeurIPS Keynote, 2017.
Arvind Narayanan,21 fairness definitions and their politics, FAccT Tutorial, 2018.
Sam Corbett-Davies and Sharad Goel,Defining and Designing Fair Algorithms, Tutorials at EC 2018 and ICML 2018.
Ben Hutchinson and Margaret Mitchell,Translation Tutorial: A History of Quantitative Fairness in Testing, FAccT Tutorial, 2019.
Henriette Cramer, Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé III, Miroslav Dudík, Hanna Wallach, Sravana Reddy, and Jean Garcia-Gathright,Translation Tutorial: Challenges of incorporating algorithmic fairness into industry practice, FAccT Tutorial, 2019.
Sarah Bird, Ben Hutchinson, Krishnaram Kenthapadi, Emre Kiciman, and Margaret Mitchell,Fairness-Aware Machine Learning: Practical Challenges and Lessons Learned, Tutorials at WSDM 2019, WWW 2019, and KDD 2019.
Krishna Gade, Sahin Cem Geyik, Krishnaram Kenthapadi, Varun Mithal, and Ankur Taly,Explainable AI in Industry, Tutorials at KDD 2019, FAccT 2020, and WWW 2020.
Freddy Lecue, Krishna Gade, Fosca Giannotti, Sahin Geyik, Riccardo Guidotti, Krishnaram Kenthapadi, Pasquale Minervini, Varun Mithal, and Ankur Taly,Explainable AI: Foundations, Industrial Applications, Practical Challenges, and Lessons Learned,AAAI 2020 Tutorial.
Himabindu Lakkaraju, Julius Adebayo, and Sameer Singh,Explaining Machine Learning Predictions: State-of-the-art, Challenges, and Opportunities, Tutorials at NeurIPS 2020 and AAAI 2021.
Freddy Lecue, Pasquale Minervini, Fosca Giannotti and Riccardo Guidotti,On Explainable AI: From Theory to Motivation, Industrial Applications and Coding Practices,AAAI 2021 Tutorial.
Kamalika Chaudhuri and Anand D. Sarwate,Differentially Private Machine Learning: Theory, Algorithms, and Applications, NeurIPS 2017 Tutorial.
Krishnaram Kenthapadi, Ilya Mironov, and Abhradeep Guha Thakurta,Privacy-preserving Data Mining in Industry, Tutorials at KDD 2018, WSDM 2019, and WWW 2019.
Krishnaram Kenthapadi, Ben Packer, Mehrnoosh Sameki, Nashlie Sephus,Responsible AI in Industry, Tutorials at AAAI 2021, FAccT 2021, WWW 2021, ICML 2021.
Krishnaram Kenthapadi, Himabindu Lakkaraju, Pradeep Natarajan, Mehrnoosh Sameki,Model Monitoring in Practice, Tutorials at FAccT 2022, KDD 2022, and WWW 2023.
Dmitry Ustalov, Nathan Lambert,Reinforcement Learning from Human Feedback, ICML 2023 Tutorial (Slides).
Krishnaram Kenthapadi, Himabindu Lakkaraju, Nazneen Rajani,Trustworthy Generative AI, Tutorials at ICML 2023, KDD 2023, and FAccT 2023.