Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Check for data drift between two OpenAI multi-turn chat jsonl files.

License

NotificationsYou must be signed in to change notification settings

hamelsmu/ft-drift

Repository files navigation

ft-drift helps you check for data drift by comparing two OpenAImulti-turn chat jsonlfiles.

Install

pip install ft_drift

Background

Checking for dataset drift can help you debug if:

  1. Your model is trained on data that doesn’t reflect production(different prompts, functions, etc).
  2. Your training data contains unexpected or accidental artifacts.

In either situation, you can compare data from relevant sources(i.e. production vs fine-tuning) to find unwanted changes. This is oneof the most common source of errors when fine-tuning models!

The demo below shows a cli tool used to detect data drift between twofiles,file_a.jsonl andfile_b.jsonl. Afterwards, a table ofimportant tokens that account for the drift are shown, such as:

  • END-UI-FORMAT
  • UI-FORMAT
  • “```json”
  • etc.

Currently,ft_drift only detects drift in prompt templates, schemasand other token-based drift (as opposed to semantic drift).

Usage

After installingft_drift, the cli commanddetect_drift will beavailable to you.

How Does it Work?

This works by doing the following steps:

  1. Fit a binary classifier (random forest) to discriminate between twodatasets.
  2. If the classifier can predict a material difference (ex: AUC >=0.60) then we know there is drift (something is systematicallydifferent b/w the two datasets).
  3. We show the most important features from the classifier which aretokens (segments of text) to help you debug what is different.

If this tool doesn’t detect drift, it doesn’t mean drift doesn’t exist.It just means we didn’t find it. For more background on this approach,see this slide frommy talk on MLOpstools:

TODO

Other things that could be added:

  • Semantic drift by incorporating embeddings.
  • More features: length of messages, # of turns etc.
  • Wiring up the function definition diff to the CLI (I don’t needthis yet for my use case).

About

Check for data drift between two OpenAI multi-turn chat jsonl files.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors2

  •  
  •  

[8]ページ先頭

©2009-2025 Movatter.jp