potsawee/selfcheckgptPublic

NotificationsYou must be signed in to change notification settings
Fork72
Star602

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

License

MIT license

602 stars 72 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
demo		demo
selfcheckgpt		selfcheckgpt
.gitignore		.gitignore
DESCRIPTION.txt		DESCRIPTION.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Repository files navigation

SelfCheckGPT

Project page for our paper "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models"
We investigated several variants of the selfcheck approach: BERTScore, Question-Answering, n-gram, NLI, and LLM-Prompting.
[Nov 2023] SelfCheckGPT-NLI Calibration Analysis thanks to Daniel Huynh[Link to Article]
[Oct 2023] The paper is accepted and to appear at EMNLP 2023[Poster]
[Aug 2023] Slides from ML Collective Talk[Link to Slides]

Code/Package

Installation

pip install selfcheckgpt

SelfCheckGPT Usage: BERTScore, QA, n-gram

There are three variants of SelfCheck scores in this package as described in the paper:SelfCheckBERTScore(),SelfCheckMQAG(),SelfCheckNgram(). All of the variants havepredict() which will output the sentence-level scores w.r.t. sampled passages. You can use packages such as spacy to split passage into sentences. For reproducibility, you can settorch.manual_seed before calling this function. See more details in Jupyter Notebookdemo/SelfCheck_demo1.ipynb

# Include necessary packages (torch, spacy, ...)fromselfcheckgpt.modeling_selfcheckimportSelfCheckMQAG,SelfCheckBERTScore,SelfCheckNgramdevice=torch.device("cuda"iftorch.cuda.is_available()else"cpu")selfcheck_mqag=SelfCheckMQAG(device=device)# set device to 'cuda' if GPU is availableselfcheck_bertscore=SelfCheckBERTScore(rescale_with_baseline=True)selfcheck_ngram=SelfCheckNgram(n=1)# n=1 means Unigram, n=2 means Bigram, etc.# LLM's text (e.g. GPT-3 response) to be evaluated at the sentence level  & Split it into sentencespassage="Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Nation."sentences= [sent.text.strip()forsentinnlp(passage).sents]# spacy sentence tokenizationprint(sentences)['Michael Alan Weiner (born March 31, 1942) is an American radio host.','He is the host of The Savage Nation.']# Other samples generated by the same LLM to perform self-check for consistencysample1="Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country."sample2="Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times."sample3="Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT."# --------------------------------------------------------------------------------------------------------------- ## SelfCheck-MQAG: Score for each sentence where value is in [0.0, 1.0] and high value means non-factual# Additional params for each scoring_method:# -> counting: AT (answerability threshold, i.e. questions with answerability_score < AT are rejected)# -> bayes: AT, beta1, beta2# -> bayes_with_alpha: beta1, beta2sent_scores_mqag=selfcheck_mqag.predict(sentences=sentences,# list of sentencespassage=passage,# passage (before sentence-split)sampled_passages= [sample1,sample2,sample3],# list of sampled passagesnum_questions_per_sent=5,# number of questions to be drawnscoring_method='bayes_with_alpha',# options = 'counting', 'bayes', 'bayes_with_alpha'beta1=0.8,beta2=0.8,# additional params depending on scoring_method)print(sent_scores_mqag)# [0.30990949 0.42376232]# --------------------------------------------------------------------------------------------------------------- ## SelfCheck-BERTScore: Score for each sentence where value is in [0.0, 1.0] and high value means non-factualsent_scores_bertscore=selfcheck_bertscore.predict(sentences=sentences,# list of sentencessampled_passages= [sample1,sample2,sample3],# list of sampled passages)print(sent_scores_bertscore)# [0.0695562  0.45590915]# --------------------------------------------------------------------------------------------------------------- ## SelfCheck-Ngram: Score at sentence- and document-level where value is in [0.0, +inf) and high value means non-factual# as opposed to SelfCheck-MQAG and SelfCheck-BERTScore, SelfCheck-Ngram's score is not boundedsent_scores_ngram=selfcheck_ngram.predict(sentences=sentences,passage=passage,sampled_passages= [sample1,sample2,sample3],)print(sent_scores_ngram)# {'sent_level': { # sentence-level score similar to MQAG and BERTScore variant#     'avg_neg_logprob': [3.184312, 3.279774],#     'max_neg_logprob': [3.476098, 4.574710]#     },#  'doc_level': {  # document-level score such that avg_neg_logprob is computed over all tokens#     'avg_neg_logprob': 3.218678904916201,#     'avg_max_neg_logprob': 4.025404834169327#     }# }

SelfCheckGPT Usage: NLI (recommended)

Entailment (or Contradiction) score with input being the sentence and a sampled passage can be used as the selfcheck score. We use DeBERTa-v3-large fine-tuned to Multi-NLI, and we normalize the probability of "entailment" or "contradiction" classes, and take Prob(contradiction) as the score.

fromselfcheckgpt.modeling_selfcheckimportSelfCheckNLIdevice=torch.device("cuda"iftorch.cuda.is_available()else"cpu")selfcheck_nli=SelfCheckNLI(device=device)# set device to 'cuda' if GPU is availablesent_scores_nli=selfcheck_nli.predict(sentences=sentences,# list of sentencessampled_passages= [sample1,sample2,sample3],# list of sampled passages)print(sent_scores_nli)# [0.334014 0.975106 ] -- based on the example above

SelfCheckGPT Usage: LLM Prompt

Prompting an LLM (Llama2, Mistral, OpenAI's GPT) to assess information consistency in a zero-shot setup. We query an LLM to assess whether the i-th sentence is supported by the sample (as the context). Similar to other methods, a higher score indicates higher chance of being hallucination. An example when using Mistral is below:

# Option1: open-source modelfromselfcheckgpt.modeling_selfcheckimportSelfCheckLLMPromptdevice=torch.device("cuda"iftorch.cuda.is_available()else"cpu")llm_model="mistralai/Mistral-7B-Instruct-v0.2"selfcheck_prompt=SelfCheckLLMPrompt(llm_model,device)# Option2: API access# (currently only support OpenAI and Groq)# from selfcheckgpt.modeling_selfcheck_apiprompt import SelfCheckAPIPrompt# selfcheck_prompt = SelfCheckAPIPrompt(client_type="openai", model="gpt-3.5-turbo")# selfcheck_prompt = SelfCheckAPIPrompt(client_type="groq", model="llama3-70b-8192", api_key="your-api-key")sent_scores_prompt=selfcheck_prompt.predict(sentences=sentences,# list of sentencessampled_passages= [sample1,sample2,sample3],# list of sampled passagesverbose=True,# whether to show a progress bar)print(sent_scores_prompt)# [0.33333333, 0.66666667] -- based on the example above

The LLM can be any model available on HuggingFace. The default prompt template isContext: {context}\n\nSentence: {sentence}\n\nIs the sentence supported by the context above? Answer Yes or No.\n\nAnswer:, but you can change it usingselfcheck_prompt.set_prompt_template(new_prompt).

Most models (gpt-3.5-turbo, Llama2, Mistral) will output either 'Yes' or 'No' >95% of the time, while any remaining outputs can be set to N/A. The output is converted to score: Yes -> 0.0, No -> 1.0, N/A -> 0.5. The inconsistency score is then calculated by averaging.

Dataset

Thewiki_bio_gpt3_hallucination dataset currently consists of 238 annotated passages (v3). You can find more information in the paper or our data card on HuggingFace:https://huggingface.co/datasets/potsawee/wiki_bio_gpt3_hallucination. To use this dataset, you can either load it through HuggingFace dataset API, or download it directly from below in the JSON format.

Update

We've annotated GPT-3 wikibio passages further, and now the dataset consists of 238 annotated passages. Here isthe link for the IDs of the first 65 passages in thev1.

Option1: HuggingFace

fromdatasetsimportload_datasetdataset=load_dataset("potsawee/wiki_bio_gpt3_hallucination")

Option2: Manual Download

Download from ourGoogle Drive, then you can load it in python:

importjsonwithopen("dataset.json","r")asf:content=f.read()dataset=json.loads(content)

Each instance consists of:

gpt3_text: GPT-3 generated passage
wiki_bio_text: Actual Wikipedia passage (first paragraph)
gpt3_sentences:gpt3_text split into sentences usingspacy
annotation: human annotation at the sentence level
wiki_bio_test_idx: ID of the concept/individual from the original wikibio dataset (testset)
gpt3_text_samples: list of sampled passages (do_sample = True & temperature = 1.0)

Experiments

Probability-based baselines (e.g. GPT-3's probabilities)

As described in our paper, probabilities (and generation entropies) of the generative LLM can be used to measure its confidence. Check our example/implementation of this approach indemo/experiments/probability-based-baselines.ipynb

Experimental Results

Full details can be found in our paper.
Note that our new results show that LLMs such as GPT-3 (text-davinci-003) or ChatGPT (gpt-3.5-turbo) are good at text inconsistency assessment. Based on this finding, we trySelfCheckGPT-Prompt where each sentence (to be evaluated) is compared against each and every sampled_passage by prompting ChatGPT. SelfCheckGPT-Prompt is the best-performing method.

Results on thewiki_bio_gpt3_hallucination dataset.

Method	NonFact (AUC-PR)	Factual (AUC-PR)	Ranking (PCC)
Random Guessing	72.96	27.04	-
GPT-3 Avg(-logP)	83.21	53.97	57.04
SelfCheck-BERTScore	81.96	44.23	58.18
SelfCheck-QA	84.26	48.14	61.07
SelfCheck-Unigram	85.63	58.47	64.71
SelfCheck-NLI	92.50	66.08	74.14
SelfCheck-Prompt (Llama2-7B-chat)	89.05	63.06	61.52
SelfCheck-Prompt (Llama2-13B-chat)	91.91	64.34	75.44
SelfCheck-Prompt (Mistral-7B-Instruct-v0.2)	91.31	62.76	74.46
SelfCheck-Prompt (gpt-3.5-turbo)	93.42	67.09	78.32

Miscellaneous

MQAG (Multiple-choice Question Answering and Generation) was proposed in our previous work. Our MQAG implementation is included in this package, which can be used to: (1) generate multiple-choice questions, (2) answer multiple-choice questions, (3) obtain MQAG score.

MQAG Usage

fromselfcheckgpt.modeling_mqagimportMQAGmqag_model=MQAG()

It has three main functions:generate(),answer(),score(). We show an example usage indemo/MQAG_demo1.ipynb

Acknowledgements

This work is supported by Cambridge University Press & Assessment (CUP&A), a department of The Chancellor, Masters, and Scholars of the University of Cambridge, and the Cambridge Commonwealth, European & International Trust.

Citation

@article{manakul2023selfcheckgpt,  title={Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models},  author={Manakul, Potsawee and Liusie, Adian and Gales, Mark JF},  journal={arXiv preprint arXiv:2303.08896},  year={2023}}

About

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Releases6

v0.1.7: add selfcheck llm-prompt with OpenAI API Latest

Mar 10, 2024

+ 5 releases

Packages

No packages published

Contributors3

Languages

Python100.0%

Movatterモバイル変換

License

potsawee/selfcheckgpt

Folders and files

Latest commit

History

Repository files navigation

SelfCheckGPT

Code/Package

Installation

SelfCheckGPT Usage: BERTScore, QA, n-gram

SelfCheckGPT Usage: NLI (recommended)

SelfCheckGPT Usage: LLM Prompt

Dataset

Update

Option1: HuggingFace

Option2: Manual Download

Experiments

Probability-based baselines (e.g. GPT-3's probabilities)

Experimental Results

Miscellaneous

MQAG Usage

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases6

Packages0

Uh oh!

Contributors3

Uh oh!

Languages

Packages