- Notifications
You must be signed in to change notification settings - Fork72
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
License
potsawee/selfcheckgpt
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
- Project page for our paper "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models"
- We investigated several variants of the selfcheck approach: BERTScore, Question-Answering, n-gram, NLI, and LLM-Prompting.
- [Nov 2023] SelfCheckGPT-NLI Calibration Analysis thanks to Daniel Huynh[Link to Article]
- [Oct 2023] The paper is accepted and to appear at EMNLP 2023[Poster]
- [Aug 2023] Slides from ML Collective Talk[Link to Slides]
pip install selfcheckgptThere are three variants of SelfCheck scores in this package as described in the paper:SelfCheckBERTScore(),SelfCheckMQAG(),SelfCheckNgram(). All of the variants havepredict() which will output the sentence-level scores w.r.t. sampled passages. You can use packages such as spacy to split passage into sentences. For reproducibility, you can settorch.manual_seed before calling this function. See more details in Jupyter Notebookdemo/SelfCheck_demo1.ipynb
# Include necessary packages (torch, spacy, ...)fromselfcheckgpt.modeling_selfcheckimportSelfCheckMQAG,SelfCheckBERTScore,SelfCheckNgramdevice=torch.device("cuda"iftorch.cuda.is_available()else"cpu")selfcheck_mqag=SelfCheckMQAG(device=device)# set device to 'cuda' if GPU is availableselfcheck_bertscore=SelfCheckBERTScore(rescale_with_baseline=True)selfcheck_ngram=SelfCheckNgram(n=1)# n=1 means Unigram, n=2 means Bigram, etc.# LLM's text (e.g. GPT-3 response) to be evaluated at the sentence level & Split it into sentencespassage="Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Nation."sentences= [sent.text.strip()forsentinnlp(passage).sents]# spacy sentence tokenizationprint(sentences)['Michael Alan Weiner (born March 31, 1942) is an American radio host.','He is the host of The Savage Nation.']# Other samples generated by the same LLM to perform self-check for consistencysample1="Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country."sample2="Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times."sample3="Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT."# --------------------------------------------------------------------------------------------------------------- ## SelfCheck-MQAG: Score for each sentence where value is in [0.0, 1.0] and high value means non-factual# Additional params for each scoring_method:# -> counting: AT (answerability threshold, i.e. questions with answerability_score < AT are rejected)# -> bayes: AT, beta1, beta2# -> bayes_with_alpha: beta1, beta2sent_scores_mqag=selfcheck_mqag.predict(sentences=sentences,# list of sentencespassage=passage,# passage (before sentence-split)sampled_passages= [sample1,sample2,sample3],# list of sampled passagesnum_questions_per_sent=5,# number of questions to be drawnscoring_method='bayes_with_alpha',# options = 'counting', 'bayes', 'bayes_with_alpha'beta1=0.8,beta2=0.8,# additional params depending on scoring_method)print(sent_scores_mqag)# [0.30990949 0.42376232]# --------------------------------------------------------------------------------------------------------------- ## SelfCheck-BERTScore: Score for each sentence where value is in [0.0, 1.0] and high value means non-factualsent_scores_bertscore=selfcheck_bertscore.predict(sentences=sentences,# list of sentencessampled_passages= [sample1,sample2,sample3],# list of sampled passages)print(sent_scores_bertscore)# [0.0695562 0.45590915]# --------------------------------------------------------------------------------------------------------------- ## SelfCheck-Ngram: Score at sentence- and document-level where value is in [0.0, +inf) and high value means non-factual# as opposed to SelfCheck-MQAG and SelfCheck-BERTScore, SelfCheck-Ngram's score is not boundedsent_scores_ngram=selfcheck_ngram.predict(sentences=sentences,passage=passage,sampled_passages= [sample1,sample2,sample3],)print(sent_scores_ngram)# {'sent_level': { # sentence-level score similar to MQAG and BERTScore variant# 'avg_neg_logprob': [3.184312, 3.279774],# 'max_neg_logprob': [3.476098, 4.574710]# },# 'doc_level': { # document-level score such that avg_neg_logprob is computed over all tokens# 'avg_neg_logprob': 3.218678904916201,# 'avg_max_neg_logprob': 4.025404834169327# }# }
Entailment (or Contradiction) score with input being the sentence and a sampled passage can be used as the selfcheck score. We use DeBERTa-v3-large fine-tuned to Multi-NLI, and we normalize the probability of "entailment" or "contradiction" classes, and take Prob(contradiction) as the score.
fromselfcheckgpt.modeling_selfcheckimportSelfCheckNLIdevice=torch.device("cuda"iftorch.cuda.is_available()else"cpu")selfcheck_nli=SelfCheckNLI(device=device)# set device to 'cuda' if GPU is availablesent_scores_nli=selfcheck_nli.predict(sentences=sentences,# list of sentencessampled_passages= [sample1,sample2,sample3],# list of sampled passages)print(sent_scores_nli)# [0.334014 0.975106 ] -- based on the example above
Prompting an LLM (Llama2, Mistral, OpenAI's GPT) to assess information consistency in a zero-shot setup. We query an LLM to assess whether the i-th sentence is supported by the sample (as the context). Similar to other methods, a higher score indicates higher chance of being hallucination. An example when using Mistral is below:
# Option1: open-source modelfromselfcheckgpt.modeling_selfcheckimportSelfCheckLLMPromptdevice=torch.device("cuda"iftorch.cuda.is_available()else"cpu")llm_model="mistralai/Mistral-7B-Instruct-v0.2"selfcheck_prompt=SelfCheckLLMPrompt(llm_model,device)# Option2: API access# (currently only support OpenAI and Groq)# from selfcheckgpt.modeling_selfcheck_apiprompt import SelfCheckAPIPrompt# selfcheck_prompt = SelfCheckAPIPrompt(client_type="openai", model="gpt-3.5-turbo")# selfcheck_prompt = SelfCheckAPIPrompt(client_type="groq", model="llama3-70b-8192", api_key="your-api-key")sent_scores_prompt=selfcheck_prompt.predict(sentences=sentences,# list of sentencessampled_passages= [sample1,sample2,sample3],# list of sampled passagesverbose=True,# whether to show a progress bar)print(sent_scores_prompt)# [0.33333333, 0.66666667] -- based on the example above
The LLM can be any model available on HuggingFace. The default prompt template isContext: {context}\n\nSentence: {sentence}\n\nIs the sentence supported by the context above? Answer Yes or No.\n\nAnswer:, but you can change it usingselfcheck_prompt.set_prompt_template(new_prompt).
Most models (gpt-3.5-turbo, Llama2, Mistral) will output either 'Yes' or 'No' >95% of the time, while any remaining outputs can be set to N/A. The output is converted to score: Yes -> 0.0, No -> 1.0, N/A -> 0.5. The inconsistency score is then calculated by averaging.
Thewiki_bio_gpt3_hallucination dataset currently consists of 238 annotated passages (v3). You can find more information in the paper or our data card on HuggingFace:https://huggingface.co/datasets/potsawee/wiki_bio_gpt3_hallucination. To use this dataset, you can either load it through HuggingFace dataset API, or download it directly from below in the JSON format.
We've annotated GPT-3 wikibio passages further, and now the dataset consists of 238 annotated passages. Here isthe link for the IDs of the first 65 passages in thev1.
fromdatasetsimportload_datasetdataset=load_dataset("potsawee/wiki_bio_gpt3_hallucination")
Download from ourGoogle Drive, then you can load it in python:
importjsonwithopen("dataset.json","r")asf:content=f.read()dataset=json.loads(content)
Each instance consists of:
gpt3_text: GPT-3 generated passagewiki_bio_text: Actual Wikipedia passage (first paragraph)gpt3_sentences:gpt3_textsplit into sentences usingspacyannotation: human annotation at the sentence levelwiki_bio_test_idx: ID of the concept/individual from the original wikibio dataset (testset)gpt3_text_samples: list of sampled passages (do_sample = True & temperature = 1.0)
As described in our paper, probabilities (and generation entropies) of the generative LLM can be used to measure its confidence. Check our example/implementation of this approach indemo/experiments/probability-based-baselines.ipynb
- Full details can be found in our paper.
- Note that our new results show that LLMs such as GPT-3 (text-davinci-003) or ChatGPT (gpt-3.5-turbo) are good at text inconsistency assessment. Based on this finding, we trySelfCheckGPT-Prompt where each sentence (to be evaluated) is compared against each and every sampled_passage by prompting ChatGPT. SelfCheckGPT-Prompt is the best-performing method.
Results on thewiki_bio_gpt3_hallucination dataset.
| Method | NonFact (AUC-PR) | Factual (AUC-PR) | Ranking (PCC) |
|---|---|---|---|
| Random Guessing | 72.96 | 27.04 | - |
| GPT-3 Avg(-logP) | 83.21 | 53.97 | 57.04 |
| SelfCheck-BERTScore | 81.96 | 44.23 | 58.18 |
| SelfCheck-QA | 84.26 | 48.14 | 61.07 |
| SelfCheck-Unigram | 85.63 | 58.47 | 64.71 |
| SelfCheck-NLI | 92.50 | 66.08 | 74.14 |
| SelfCheck-Prompt (Llama2-7B-chat) | 89.05 | 63.06 | 61.52 |
| SelfCheck-Prompt (Llama2-13B-chat) | 91.91 | 64.34 | 75.44 |
| SelfCheck-Prompt (Mistral-7B-Instruct-v0.2) | 91.31 | 62.76 | 74.46 |
| SelfCheck-Prompt (gpt-3.5-turbo) | 93.42 | 67.09 | 78.32 |
MQAG (Multiple-choice Question Answering and Generation) was proposed in our previous work. Our MQAG implementation is included in this package, which can be used to: (1) generate multiple-choice questions, (2) answer multiple-choice questions, (3) obtain MQAG score.
fromselfcheckgpt.modeling_mqagimportMQAGmqag_model=MQAG()
It has three main functions:generate(),answer(),score(). We show an example usage indemo/MQAG_demo1.ipynb
This work is supported by Cambridge University Press & Assessment (CUP&A), a department of The Chancellor, Masters, and Scholars of the University of Cambridge, and the Cambridge Commonwealth, European & International Trust.
@article{manakul2023selfcheckgpt, title={Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models}, author={Manakul, Potsawee and Liusie, Adian and Gales, Mark JF}, journal={arXiv preprint arXiv:2303.08896}, year={2023}}About
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors3
Uh oh!
There was an error while loading.Please reload this page.
