Movatterモバイル変換

Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Speaker Deck

Speaker Deck

Practical Tips for Bootstrapping Information Ex...

Matthew Honnibal

August 09, 2024

1.3k

Practical Tips for Bootstrapping Information Extraction Pipelines

In this presentation, I will build onInes Montani's keynote, "Applied NLP in the Age of Generative AI", by demonstrating how to create an information extraction pipeline. The talk will focus on using thespaCy NLP library and theProdigy annotation tool, although the principles discussed will also apply to other frameworks.

Matthew Honnibal
PRO

August 09, 2024

Resources

spaCy: Industrial-Strength NLP

https://spacy.io

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It’s designed specifically for production use and helps you build applications that process and “understand” large volumes of text.

Prodigy: Radically efficient machine teaching

https://prodi.gy

Prodigy is a modern annotation tool for creating training data for machine learning models. It’s so efficient that data scientists can do the annotation themselves, enabling a new level of rapid iteration.

spacy-llm: Integrating LLMs into structured NLP pipelines

https://github.com/explosion/spacy-llm

spacy-llm features a modular system for fast prototyping and prompting, and turning unstructured responses into robust outputs for various NLP tasks, no training data required.

A practical guide to human-in-the-loop distillation

https://explosion.ai/blog/human-in-the-loop-distillation

This blog post presents practical solutions for using the latest state-of-the-art models in real-world applications and distilling their knowledge into smaller and faster components that you can run and maintain in-house.

How S&P Global is making markets more transparent with NLP, spaCy and Prodigy

https://explosion.ai/blog/sp-global-commodities

A case study on S&P Global’s efficient information extraction pipelines for real-time commodities trading insights in a high-security environment using human-in-the-loop distillation.

More Decks by Matthew Honnibal

See All by Matthew Honnibal

Designing for tomorrow's programming workflows

4

390

How many Labelled Examples do you need for a BERT-sized Model to Beat GPT-4 on Predictive Tasks?

4

2.8k

spaCy meets Transformers

1

710

Building new NLP solutions with spaCy and Prodigy

4

2.1k

Multi-lingual natural language understanding with spaCy

1

490

Embed, encode, attend, predict: A four-step framework for understanding neural network approaches to Natural Language Understanding problems

1

350

Other Decks in Programming

See All in Programming

AIエージェントはこう育てる - GitHub Copilot Agentとチームの共進化サイクル

0

760

Google Agent Development Kit でLINE Botを作ってみた

2

260

AI Agent 時代のソフトウェア開発を支える AWS Cloud Development Kit (CDK)

6

800

#QiitaBash MCPのセキュリティ

1

1.5k

状態遷移図を書こう / Sequence Chart vs State Diagram

2

200

レトロゲームから学ぶ通信技術の歴史

0

110

Startups on Rails in Past, Present and Future–Irina Nazarova, RailsConf 2025

0

240

テスターからテストエンジニアへ ~新米テストエンジニアが歩んだ9ヶ月振り返り~

2

220

チームのテスト力を総合的に鍛えて品質、スピード、レジリエンスを共立させる/Testing approach that improves quality, speed, and resilience

5

1.1k

AI時代のソフトウェア開発を考える（2025/07版） / Agentic Software Engineering Findy 2025-07 Edition

99

37k

AI時代の『改訂新版良いコード／悪いコードで学ぶ設計入門』 / ai-good-code-bad-code

23

9.6k

Modern Angular with Signals and Signal Store:New Rules for Your Architecture @enterJS Advanced Angular Day 2025

0

270

Featured

See All Featured

Put a Button on it: Removing Barriers to Going Fast.

60

3.9k

Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End

251

21k

Exploring the Power of Turbo Streams & Action Cable | RailsConf2023

34

5.9k

460

140k

The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024

26

2.9k

Optimising Largest Contentful Paint

37

3.3k

Reflections from 52 weeks, 52 projects

351

21k

The Art of Delivering Value - GDevCon NA Keynote

15

1.6k

Principles of Awesome APIs and How to Build Them.

126

17k

We Have a Design System, Now What?

53

7.7k

29

5.4k

Producing Creativity

346

40k

Transcript

PRACTICAL TIPS FOR BOOTSTRAPPING INFORMATION EXTRACTION PIPELINES Matthew Honnibal Explosion
🤠 You Developer GPT-4 API
Open-source library for industrial-strength natural language processing spacy.io SPACY 250m+
downloads
Open-source library for industrial-strength natural language processing spacy.io SPACY 250m+
downloads ChatGPT can write spaCy code!
900+ companies 10k+ users Modern scriptable annotation tool for machine
learning developers prodigy.ai PRODIGY
900+ companies 10k+ users Alex Smith Developer Kim Miller Analyst
GPT-4 API Modern scriptable annotation tool for machine learning developers prodigy.ai PRODIGY
We’re back to running Explosion as a smaller, independent-minded and
self-su ff icient company. explosion.ai/blog/back-to-our-roots BACK TO OUR ROOTS
We’re back to running Explosion as a smaller, independent-minded and
self-su ff icient company. explosion.ai/blog/back-to-our-roots Consulting open source developer tools BACK TO OUR ROOTS
WHAT I MEAN BY INFORMATION EXTRACTION
WHAT I MEAN BY INFORMATION EXTRACTION 📝 Turn text into
data. Make a database from earnings reports, or skills in job postings, or product feedback in social media – many more.
WHAT I MEAN BY INFORMATION EXTRACTION 📝 Turn text into
data. Make a database from earnings reports, or skills in job postings, or product feedback in social media – many more. 🗂 Lots of subtasks. Text classification, named entity recognition, entity linking, relation extraction can all be part of an information extraction pipeline.
WHAT I MEAN BY INFORMATION EXTRACTION 📝 Turn text into
data. Make a database from earnings reports, or skills in job postings, or product feedback in social media – many more. 🗂 Lots of subtasks. Text classification, named entity recognition, entity linking, relation extraction can all be part of an information extraction pipeline. 🎯 Mostly static schema. Most people are solving one problem at a time, so that’s what I’ll focus on.
Database “Hooli raises $5m to revolutionize search, led by ACME
Ventures”
COMPANY COMPANY named entity recognition Database “Hooli raises $5m to
revolutionize search, led by ACME Ventures”
COMPANY COMPANY named entity recognition MONEY currency normalization Database “Hooli
raises $5m to revolutionize search, led by ACME Ventures”
COMPANY COMPANY named entity recognition MONEY currency normalization 5923214 1681056
custom database lookup entity disambiguation Database “Hooli raises $5m to revolutionize search, led by ACME Ventures”
COMPANY COMPANY named entity recognition MONEY currency normalization INVESTOR entity
relation extraction 5923214 1681056 custom database lookup entity disambiguation Database “Hooli raises $5m to revolutionize search, led by ACME Ventures”
💬 question ⚙ text-to-SQL query data 📦 NLP pipeline 📖
texts + RIE: RETRIEVAL VIA INFORMATION EXTRACTION
💬 question ⚙ text-to-SQL query data 📦 NLP pipeline 📖
texts + RIE: RETRIEVAL VIA INFORMATION EXTRACTION RAG: RETRIEVAL-AUGMENTED GENERATION 💬 question ⚙ vectorizer query answers 📚 vector DB 📖 snippets + ⚙ vectorizer
TALK OUTLINE 💡
Training tips 1. TALK OUTLINE 💡
Training tips 1. Modelling tips 2. TALK OUTLINE 💡
Training tips 1. Modelling tips 2. Data annotation tips 3.
TALK OUTLINE 💡
SUPERVISED LEARNING IS STILL VERY STRONG Example data is super
powerful.
SUPERVISED LEARNING IS STILL VERY STRONG Example data is super
powerful. Example data can do things that instructions can’t.
SUPERVISED LEARNING IS STILL VERY STRONG Example data is super
powerful. Example data can do things that instructions can’t. In-context learning can’t use examples scalably.
KNOW YOUR ENEMIES What makes supervised learning hard?
product vision 👁 chicken-and- egg problem KNOW YOUR ENEMIES What
makes supervised learning hard?
product vision 👁 chicken-and- egg problem KNOW YOUR ENEMIES What
makes supervised learning hard? accuracy estimate 📈
product vision 👁 chicken-and- egg problem KNOW YOUR ENEMIES What
makes supervised learning hard? accuracy estimate 📈 training & evaluation 🔮
product vision 👁 chicken-and- egg problem KNOW YOUR ENEMIES What
makes supervised learning hard? accuracy estimate 📈 training & evaluation 🔮 labelled data 📚
product vision 👁 chicken-and- egg problem KNOW YOUR ENEMIES What
makes supervised learning hard? accuracy estimate 📈 training & evaluation 🔮 labelled data 📚 annotation scheme 🏷
RESULTS ARE HARD TO INTERPRET
RESULTS ARE HARD TO INTERPRET 😬 Model doesn’t train at
all. Is the data messed up somehow?
RESULTS ARE HARD TO INTERPRET 😬 Model doesn’t train at
all. Is the data messed up somehow? 🤨 Model learns barely better than chance. Could be data, hyper-parameters, modelling…
RESULTS ARE HARD TO INTERPRET 😬 Model doesn’t train at
all. Is the data messed up somehow? 🤨 Model learns barely better than chance. Could be data, hyper-parameters, modelling… 🥹 Results are decent! But can it be better? How do I know if I’m missing out?
RESULTS ARE HARD TO INTERPRET 😬 Model doesn’t train at
all. Is the data messed up somehow? 🤨 Model learns barely better than chance. Could be data, hyper-parameters, modelling… 🥹 Results are decent! But can it be better? How do I know if I’m missing out? 🤔 Results are too good to be true. Probably messed up the data…
Training ⚗ 1
FORM AND FALSIFY HYPOTHESES
This is the bit that’s broken. HYPOTHESIS
This is the bit that’s broken. HYPOTHESIS If this bit
is broken, what should I expect to see? QUESTION
This is the bit that’s broken. HYPOTHESIS If this bit
is broken, what should I expect to see? QUESTION Is that what actually happens? TEST
This is the bit that’s broken. HYPOTHESIS If this bit
is broken, what should I expect to see? QUESTION Is that what actually happens? TEST “I can’t connect to this site.”
This is the bit that’s broken. HYPOTHESIS If this bit
is broken, what should I expect to see? QUESTION Is that what actually happens? TEST “Maybe it’ll work if I reconnect to the wi-fi or if I restart my router.” SOLUTION MINDSET “I can’t connect to this site.”
This is the bit that’s broken. HYPOTHESIS If this bit
is broken, what should I expect to see? QUESTION Is that what actually happens? TEST “Maybe it’ll work if I reconnect to the wi-fi or if I restart my router.” SOLUTION MINDSET SCIENTIFIC MINDSET “If the problem is between me and the site, other sites won’t load either. If the problem is between me and the router, I won’t be able to ping it.” “I can’t connect to this site.”
EXAMPLES OF DEBUGGING TRAINING
EXAMPLES OF DEBUGGING TRAINING 📉 What happens if I train
on a tiny amount of data? Does the model converge?
EXAMPLES OF DEBUGGING TRAINING 📉 What happens if I train
on a tiny amount of data? Does the model converge? 🔀 What happens if I randomize the training labels? Does the model still learn?
EXAMPLES OF DEBUGGING TRAINING 📉 What happens if I train
on a tiny amount of data? Does the model converge? 🔀 What happens if I randomize the training labels? Does the model still learn? 🪄 Are my model weights changing at all during training?
EXAMPLES OF DEBUGGING TRAINING 📉 What happens if I train
on a tiny amount of data? Does the model converge? 🔀 What happens if I randomize the training labels? Does the model still learn? 🪄 Are my model weights changing at all during training? 🧮 What’s the mean and variance of my gradients?
PRIORITIZE ROBUSTNESS NOT ACCURACY
📈 Better needs to look better. You need it to
not be like this:
📈 Better needs to look better. You need it to
not be like this: 📦 Larger models are often less practical.
📈 Better needs to look better. You need it to
not be like this: 📦 Larger models are often less practical. 🤏 You need it to work with small samples.
📈 Better needs to look better. You need it to
not be like this: 📦 Larger models are often less practical. 🤏 You need it to work with small samples. 🌪 Large models are less stable with small batch sizes.
🔮 2 Modelling
ITERATE ON YOUR DATA AND SCALE DOWN
task- specific output 💬 prompt 📖 text 🔮 PROTOTYPE GPT-4
API
task- specific output 💬 prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm
prompt model & transform output to structured data GPT-4 API
task- specific output 💬 prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm
prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION
distilled task-specific components 📦 📦 📦 task- specific output 💬
prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION
distilled task-specific components 📦 📦 📦 task- specific output 💬
prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION modular
distilled task-specific components 📦 📦 📦 task- specific output 💬
prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION modular small & fast
distilled task-specific components 📦 📦 📦 task- specific output 💬
prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION modular small & fast data-private
config.cfg spacy.io/usage/large-language-models ⚙
config.cfg spacy.io/usage/large-language-models component ⚙
config.cfg spacy.io/usage/large-language-models model and provider ⏺ ⏺ ⏺ component ⚙
config.cfg spacy.io/usage/large-language-models model and provider ⏺ ⏺ ⏺ task definition
and labels Named Entity Recognition, Text Classification, Relation Extraction, … component ⚙
config.cfg spacy.io/usage/large-language-models label definitions to use in prompt model and
provider ⏺ ⏺ ⏺ task definition and labels Named Entity Recognition, Text Classification, Relation Extraction, … component ⚙
config.cfg spacy.io/usage/large-language-models label definitions to use in prompt model and
provider ⏺ ⏺ ⏺ task definition and labels Named Entity Recognition, Text Classification, Relation Extraction, … component ⚙ example from case study explosion.ai/blog/sp-global-commodities
Data annotation 📒 3
How much data do you need?
TRAINING =============== Train curve diagnostic =============== Training 4 times with
25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need?
TRAINING =============== Train curve diagnostic =============== Training 4 times with
25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need? Accuracy 0 25 50 75 100 % of examples 25 50 75 100 125 150 projection
TRAINING =============== Train curve diagnostic =============== Training 4 times with
25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need? Accuracy 0 25 50 75 100 % of examples 25 50 75 100 125 150 projection EVALUATION ⚠ You need enough data to avoid reporting meaningless precision.
TRAINING =============== Train curve diagnostic =============== Training 4 times with
25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need? Accuracy 0 25 50 75 100 % of examples 25 50 75 100 125 150 projection EVALUATION ⚠ You need enough data to avoid reporting meaningless precision. 📊 Ten samples per significant figure is a good rule of thumb.
TRAINING =============== Train curve diagnostic =============== Training 4 times with
25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need? Accuracy 0 25 50 75 100 % of examples 25 50 75 100 125 150 projection EVALUATION ⚠ You need enough data to avoid reporting meaningless precision. 📊 Ten samples per significant figure is a good rule of thumb. 1,000 samples is pretty good – enough for 94% vs. 95%.
KEEP TASKS SMALL
KEEP TASKS SMALL GOOD for i in range(rows): access_data(array[i]) ✅
BAD for j in range(columns): access_data(array[:, j]) ❌
KEEP TASKS SMALL Humans have a cache, too! GOOD for
i in range(rows): access_data(array[i]) ✅ BAD for j in range(columns): access_data(array[:, j]) ❌
KEEP TASKS SMALL Humans have a cache, too! GOOD for
i in range(rows): access_data(array[i]) ✅ BAD for j in range(columns): access_data(array[:, j]) ❌ DO THIS for annotation_type in annotation_types: for example in examples: annotate(example, annotation_type) ✅ NOT THIS for example in examples: for annotation_type in annotation_types: annotate(example, annotation_type) ❌
USE MODEL ASSISTANCE
USE MODEL ASSISTANCE 🔮 Suggest annotations however you can. Rule-
based, initial trained model, an LLM – or a combination of all.
USE MODEL ASSISTANCE 🔮 Suggest annotations however you can. Rule-
based, initial trained model, an LLM – or a combination of all. Suggestions improve e iciency. Common cases are common, so getting them preset speeds up annotation a lot. 🔥
USE MODEL ASSISTANCE 🔮 Suggest annotations however you can. Rule-
based, initial trained model, an LLM – or a combination of all. Suggestions improve e iciency. Common cases are common, so getting them preset speeds up annotation a lot. 🔥 Suggestions improve accuracy. You need the common cases to be annotated consistently. Humans suck at this. 📈
🔮 explosion.ai/blog/human-in-the-loop-distillation HUMAN IN THE LOOP
🔮 explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline HUMAN IN THE LOOP
🔮 explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline prompting HUMAN IN THE LOOP
🔮 explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline prompting HUMAN IN THE LOOP
🔮 explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline prompting transfer learning 📦 HUMAN
IN THE LOOP
🔮 explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline prompting transfer learning 📦 distilled
model HUMAN IN THE LOOP
prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl ⚙
prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl recipe function with
workflow ⚙
prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save
annotations to recipe function with workflow ⚙
prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save
annotations to recipe function with workflow [components.llm.model] @llm_models = "spacy.GPT-4.v2" ⚙
prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save
annotations to recipe function with workflow raw data [components.llm.model] @llm_models = "spacy.GPT-4.v2" ⚙
✨ Starting the web server at localhost:8080 ... Open the
app and start annotating! GPT-4 API prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save annotations to recipe function with workflow raw data [components.llm.model] @llm_models = "spacy.GPT-4.v2" ⚙
✨ Starting the web server at localhost:8080 ... Open the
app and start annotating! GPT-4 API prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save annotations to recipe function with workflow raw data 🤠 You Developer [components.llm.model] @llm_models = "spacy.GPT-4.v2" ⚙
explosion.ai/blog/guardian case study ANNOTATION STARTS AT HOME
explosion.ai/blog/guardian case study annotation guidelines ANNOTATION STARTS AT HOME
explosion.ai/blog/guardian case study annotation guidelines annotation meeting ANNOTATION STARTS AT
HOME
📒 🔮 ⚗
📒 🔮 Form and falsify hypotheses. ⚗
📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness.
📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale
down and iterate.
📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale
down and iterate. Imagine you’re the model.
📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale
down and iterate. Imagine you’re the model. Finish the pipeline to production.
📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale
down and iterate. Imagine you’re the model. Finish the pipeline to production. Be agile and annotate yourself.
📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale
down and iterate. Imagine you’re the model. Finish the pipeline to production. Be agile and annotate yourself. Keep tasks small.
📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale
down and iterate. Imagine you’re the model. Finish the pipeline to production. Be agile and annotate yourself. Keep tasks small. Use model assistance.
LinkedIn Explosion spaCy Prodigy Twitter Mastodon Bluesky explosion.ai spacy.io prodigy.ai
@honnibal @[email protected] @honnibal.bsky.social THANK YOU!

[8]ページ先頭

©2009-2025 Movatter.jp