- Notifications
You must be signed in to change notification settings - Fork6
Check for data drift between two OpenAI multi-turn chat jsonl files.
License
hamelsmu/ft-drift
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
ft-drift
helps you check for data drift by comparing two OpenAImulti-turn chat jsonlfiles.
pip install ft_drift
Checking for dataset drift can help you debug if:
- Your model is trained on data that doesn’t reflect production(different prompts, functions, etc).
- Your training data contains unexpected or accidental artifacts.
In either situation, you can compare data from relevant sources(i.e. production vs fine-tuning) to find unwanted changes. This is oneof the most common source of errors when fine-tuning models!
The demo below shows a cli tool used to detect data drift between twofiles,file_a.jsonl
andfile_b.jsonl
. Afterwards, a table ofimportant tokens that account for the drift are shown, such as:
END-UI-FORMAT
UI-FORMAT
- “```json”
- etc.
Currently,ft_drift
only detects drift in prompt templates, schemasand other token-based drift (as opposed to semantic drift).
After installingft_drift
, the cli commanddetect_drift
will beavailable to you.
This works by doing the following steps:
- Fit a binary classifier (random forest) to discriminate between twodatasets.
- If the classifier can predict a material difference (ex: AUC >=0.60) then we know there is drift (something is systematicallydifferent b/w the two datasets).
- We show the most important features from the classifier which aretokens (segments of text) to help you debug what is different.
If this tool doesn’t detect drift, it doesn’t mean drift doesn’t exist.It just means we didn’t find it. For more background on this approach,see this slide frommy talk on MLOpstools:
Other things that could be added:
- Semantic drift by incorporating embeddings.
- More features: length of messages, # of turns etc.
- Wiring up the function definition diff to the CLI (I don’t needthis yet for my use case).
About
Check for data drift between two OpenAI multi-turn chat jsonl files.
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.