How much do ChatGPT versions affect real-world performance?

*Do versions change your performance by 1% or 10%? Let’s dive into a production-impacting LLM task and how model versions impacts agent performance.*

If you’ve used the ChatGPT API before, you’ve likely noticed different model versions associated with them. The original ChatGPT API was released March 1st, leading to a naming convention ofgpt-3.5-turbo-0301. Next camegpt-3.5-turbo-0613,gpt-3.5-turbo-1106 and as of this article’s writing,gpt-3.5-turbo-0125.

After the initialgpt-3.5-turbo-0301 release, it was challenging to predict the impact of each of these versions; Open AI did not release new benchmarks for each and did not disclose major changes. LLMs are also non-deterministic, making it difficult to measure the impact of such a change. So, from a high-level perspective,something shifted with these models due to a version upgrade, but it wasn’t clear what the consequences were.

Oncegpt-3.5-turbo-0613 was released, there were a few papers published looking into task-specific aspects of this model version change[1], but they mainly focused on small tasks like prime-number detection that are usually inconsequential to a real-world use case.

FromHow Is ChatGPT’s Behaviour Changing over Time?

Many of these tasks were also 0-shot based, rather than few-shot-based which is the technique that creates formatting and response stability that production applications look for. Over the course of 2023, more research was conducted on the overall sensitivity of prompts showing dramatic differences across models with small tweaks in prompt formats. [2]

LLM Drift for Conversational AI — Intent Classification

While ChatGPT and LLMs changed the narrative around building AI Agents, intent classification remains an important problem for directing user conversation and structuring agent actions.

In late 2023, we were working with a customer who was using LLMs for intent classification and noticed thatgpt-3.5-turbo-0301 was getting close to end of life. So, we upgraded this customer’s base model togpt-3.5-turbo-1106, and it dramatically decreased the performance.

After realizing the impact of our change (and reverting), we retroactively ran the benchmarks and saw the severe degradation on the newest model of ~10%*.

With this degradation in mind, we decided to tweak the initial prompt slightly to improve model performance back to our initial benchmarked accuracy. At face value, the changes are not apparent, but make an impact.

Methodology

To improve our prompts and reduce version discrepancies, we iterated on one portion of our real-life dataset to improve our validation accuracy and find a prompt that brought *gpt-3.5-turbo-1106’*s accuracy to an acceptable range.

After the iterations, we ran our evaluation dataset** once and confirmed that model results were much better forgpt-3.5-turbo-1106*** (andgpt-3.5-turbo-0125 a few months later for curiosities sake).

Prompt changes

Our final prompt had three main changes:

We changed the opening line to an indication of importance.
We changed fromdescriptions and intents todescriptions and actions.
We added a one shot example to improve none intent classification.

To measure the impact of each of these changes, we ran a small ablation study to measure the impact across each change.

Ablations

Our goal with ablations was to gain a better insight into the magnitude of each change’s performance.

The ablations generally improved performance for later models and demonstrated impact of-10.66% to 13.08%. Bolded results show the best performing ablation for each model. Interestingly enough, the ablations performed better on later models showing a higher impact across techniques. However, there was significant swings for seemingly small changes showing the brittle nature of prompts.

Final results

With our three prompt modifications, our intent classification task performed much better on newer models. The accuracy decreased for the initialgpt-3.5-turbo versions, but increased across most recent versions.

Conclusion

LLM versions can make a large difference! Even for a few shot approach. Having good, non benchmark datasets to validate performance is important, even when onboarding new use-cases. In the future we’ll be discussing some our techniques for internal prompt optimization!

The Voiceflow research section covers industry-relevant and applicable research on ML and LLM work in the conversational AI space.

LLM Drift for Conversational AI — Intent Classification

While ChatGPT and LLMs changed the narrative around building AI Agents, intent classification remains an important problem for directing user conversation and structuring agent actions.

After realizing the impact of our change (and reverting), we retroactively ran the benchmarks and saw the severe degradation on the newest model of ~10%*.

Methodology

After the iterations, we ran our evaluation dataset** once and confirmed that model results were much better forgpt-3.5-turbo-1106*** (andgpt-3.5-turbo-0125 a few months later for curiosities sake).

Prompt changes

Our final prompt had three main changes:

We changed the opening line to an indication of importance.
We changed fromdescriptions and intents todescriptions and actions.
We added a one shot example to improve none intent classification.

To measure the impact of each of these changes, we ran a small ablation study to measure the impact across each change.

Ablations

Our goal with ablations was to gain a better insight into the magnitude of each change’s performance.

Final results

Conclusion

The Voiceflow research section covers industry-relevant and applicable research on ML and LLM work in the conversational AI space.

Notes

* standard deviation across 5 runs was between 0.3-0.7% accuracy

** our validation set contained ~400 examples and our evaluation set contained ~200

*** our request parameters‍

messages = [{"role":"user","content": request}]response = openai.ChatCompletion.create(    model=model_version,    messages=messages,    temperature=0.1,    max_tokens=100)

Citations:

[1] L. Chen, M. Zaharia and J. Zou. 2023. How is ChatGPT’s behavior changing over time? ArXiv:2307.09009 [cs].

[2] M. Sclar, Y. Choi, Y. Tsvetkov, and A. Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324, 2023.

Cite this Work

@article{    IntentClassifcationChatGPTVersions,    author = {Linkov, Denys},    title = {How muchdo ChatGPT versions affect real world performance?},    year = {2024},    month = {03},    howpublished = {\url{https://voiceflow.com}},    url = {https://www.voiceflow.com/blog/how-much-do-chatgpt-versions-affect-real-world-performance}}

RECOMMENDED

How Trilogy automated 70% of their customer support

PUBLISHED

March 19, 2024

TOPICS IN THIS ARTICLE

Learning and APIs

Exceptional work

Denys Linkov

Machine Learning Team Lead at Voiceflow

Denys spends his days trying to understand how machine learning will impact our daily lives—whether it's building new models or diving into the latest generative AI tech. When he’s not leading courses on LLMs or expanding Voiceflow’s data science and ML capabilities, you can find him enjoying the outdoors on bike or on foot.

RECOMMENDED RESOURCES

No items found.

ALONG THE SAME PATH