Movatterモバイル変換


[0]ホーム

URL:


Upgrade to Pro — share decks privately, control downloads, hide ads and more …
Speaker DeckSpeaker Deck
Speaker Deck

Practical Tips for Bootstrapping Information Ex...

Avatar for Matthew Honnibal Matthew Honnibal
August 09, 2024

Practical Tips for Bootstrapping Information Extraction Pipelines

In this presentation, I will build onInes Montani's keynote, "Applied NLP in the Age of Generative AI", by demonstrating how to create an information extraction pipeline. The talk will focus on using thespaCy NLP library and theProdigy annotation tool, although the principles discussed will also apply to other frameworks.

Avatar for Matthew Honnibal

Matthew Honnibal

August 09, 2024
Tweet

Resources

spaCy: Industrial-Strength NLP

https://spacy.io

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It’s designed specifically for production use and helps you build applications that process and “understand” large volumes of text.

Prodigy: Radically efficient machine teaching

https://prodi.gy

Prodigy is a modern annotation tool for creating training data for machine learning models. It’s so efficient that data scientists can do the annotation themselves, enabling a new level of rapid iteration.

spacy-llm: Integrating LLMs into structured NLP pipelines

https://github.com/explosion/spacy-llm

spacy-llm features a modular system for fast prototyping and prompting, and turning unstructured responses into robust outputs for various NLP tasks, no training data required.

A practical guide to human-in-the-loop distillation

https://explosion.ai/blog/human-in-the-loop-distillation

This blog post presents practical solutions for using the latest state-of-the-art models in real-world applications and distilling their knowledge into smaller and faster components that you can run and maintain in-house.

How S&P Global is making markets more transparent with NLP, spaCy and Prodigy

https://explosion.ai/blog/sp-global-commodities

A case study on S&P Global’s efficient information extraction pipelines for real-time commodities trading insights in a high-security environment using human-in-the-loop distillation.

More Decks by Matthew Honnibal

See All by Matthew Honnibal

Other Decks in Programming

See All in Programming

Featured

See All Featured

Transcript

  1. PRACTICAL TIPS FOR BOOTSTRAPPING INFORMATION EXTRACTION PIPELINES Matthew Honnibal Explosion

    🤠 You Developer GPT-4 API
  2. Open-source library for industrial-strength natural language processing spacy.io SPACY 250m+

    downloads
  3. Open-source library for industrial-strength natural language processing spacy.io SPACY 250m+

    downloads ChatGPT can write spaCy code!
  4. 900+ companies 10k+ users Modern scriptable annotation tool for machine

    learning developers prodigy.ai PRODIGY
  5. 900+ companies 10k+ users Alex Smith Developer Kim Miller Analyst

    GPT-4 API Modern scriptable annotation tool for machine learning developers prodigy.ai PRODIGY
  6. We’re back to running Explosion as a smaller, independent-minded and

    self-su ff icient company. explosion.ai/blog/back-to-our-roots BACK TO OUR ROOTS
  7. We’re back to running Explosion as a smaller, independent-minded and

    self-su ff icient company. explosion.ai/blog/back-to-our-roots Consulting open source developer tools BACK TO OUR ROOTS
  8. WHAT I MEAN BY INFORMATION EXTRACTION

  9. WHAT I MEAN BY INFORMATION EXTRACTION 📝 Turn text into

    data. Make a database from earnings reports, or skills in job postings, or product feedback in social media – many more.
  10. WHAT I MEAN BY INFORMATION EXTRACTION 📝 Turn text into

    data. Make a database from earnings reports, or skills in job postings, or product feedback in social media – many more. 🗂 Lots of subtasks. Text classification, named entity recognition, entity linking, relation extraction can all be part of an information extraction pipeline.
  11. WHAT I MEAN BY INFORMATION EXTRACTION 📝 Turn text into

    data. Make a database from earnings reports, or skills in job postings, or product feedback in social media – many more. 🗂 Lots of subtasks. Text classification, named entity recognition, entity linking, relation extraction can all be part of an information extraction pipeline. 🎯 Mostly static schema. Most people are solving one problem at a time, so that’s what I’ll focus on.
  12. Database “Hooli raises $5m to revolutionize search, led by ACME

    Ventures”
  13. COMPANY COMPANY named entity recognition Database “Hooli raises $5m to

    revolutionize search, led by ACME Ventures”
  14. COMPANY COMPANY named entity recognition MONEY currency normalization Database “Hooli

    raises $5m to revolutionize search, led by ACME Ventures”
  15. COMPANY COMPANY named entity recognition MONEY currency normalization 5923214 1681056

    custom database lookup entity disambiguation Database “Hooli raises $5m to revolutionize search, led by ACME Ventures”
  16. COMPANY COMPANY named entity recognition MONEY currency normalization INVESTOR entity

    relation extraction 5923214 1681056 custom database lookup entity disambiguation Database “Hooli raises $5m to revolutionize search, led by ACME Ventures”
  17. 💬 question ⚙ text-to-SQL query data 📦 NLP pipeline 📖

    texts + RIE: RETRIEVAL VIA INFORMATION EXTRACTION
  18. 💬 question ⚙ text-to-SQL query data 📦 NLP pipeline 📖

    texts + RIE: RETRIEVAL VIA INFORMATION EXTRACTION RAG: RETRIEVAL-AUGMENTED GENERATION 💬 question ⚙ vectorizer query answers 📚 vector DB 📖 snippets + ⚙ vectorizer
  19. TALK OUTLINE 💡

  20. Training tips 1. TALK OUTLINE 💡

  21. Training tips 1. Modelling tips 2. TALK OUTLINE 💡

  22. Training tips 1. Modelling tips 2. Data annotation tips 3.

    TALK OUTLINE 💡
  23. SUPERVISED LEARNING IS STILL VERY STRONG Example data is super

    powerful.
  24. SUPERVISED LEARNING IS STILL VERY STRONG Example data is super

    powerful. Example data can do things that instructions can’t.
  25. SUPERVISED LEARNING IS STILL VERY STRONG Example data is super

    powerful. Example data can do things that instructions can’t. In-context learning can’t use examples scalably.
  26. KNOW YOUR ENEMIES What makes supervised learning hard?

  27. product vision 👁 chicken-and- egg problem KNOW YOUR ENEMIES What

    makes supervised learning hard?
  28. product vision 👁 chicken-and- egg problem KNOW YOUR ENEMIES What

    makes supervised learning hard? accuracy estimate 📈
  29. product vision 👁 chicken-and- egg problem KNOW YOUR ENEMIES What

    makes supervised learning hard? accuracy estimate 📈 training & evaluation 🔮
  30. product vision 👁 chicken-and- egg problem KNOW YOUR ENEMIES What

    makes supervised learning hard? accuracy estimate 📈 training & evaluation 🔮 labelled data 📚
  31. product vision 👁 chicken-and- egg problem KNOW YOUR ENEMIES What

    makes supervised learning hard? accuracy estimate 📈 training & evaluation 🔮 labelled data 📚 annotation scheme 🏷
  32. RESULTS ARE HARD TO INTERPRET

  33. RESULTS ARE HARD TO INTERPRET 😬 Model doesn’t train at

    all. Is the data messed up somehow?
  34. RESULTS ARE HARD TO INTERPRET 😬 Model doesn’t train at

    all. Is the data messed up somehow? 🤨 Model learns barely better than chance. Could be data, hyper-parameters, modelling…
  35. RESULTS ARE HARD TO INTERPRET 😬 Model doesn’t train at

    all. Is the data messed up somehow? 🤨 Model learns barely better than chance. Could be data, hyper-parameters, modelling… 🥹 Results are decent! But can it be better? How do I know if I’m missing out?
  36. RESULTS ARE HARD TO INTERPRET 😬 Model doesn’t train at

    all. Is the data messed up somehow? 🤨 Model learns barely better than chance. Could be data, hyper-parameters, modelling… 🥹 Results are decent! But can it be better? How do I know if I’m missing out? 🤔 Results are too good to be true. Probably messed up the data…
  37. Training ⚗ 1

  38. FORM AND FALSIFY HYPOTHESES

  39. This is the bit that’s broken. HYPOTHESIS

  40. This is the bit that’s broken. HYPOTHESIS If this bit

    is broken, what should I expect to see? QUESTION
  41. This is the bit that’s broken. HYPOTHESIS If this bit

    is broken, what should I expect to see? QUESTION Is that what actually happens? TEST
  42. This is the bit that’s broken. HYPOTHESIS If this bit

    is broken, what should I expect to see? QUESTION Is that what actually happens? TEST “I can’t connect to this site.”
  43. This is the bit that’s broken. HYPOTHESIS If this bit

    is broken, what should I expect to see? QUESTION Is that what actually happens? TEST “Maybe it’ll work if I reconnect to the wi-fi or if I restart my router.” SOLUTION MINDSET “I can’t connect to this site.”
  44. This is the bit that’s broken. HYPOTHESIS If this bit

    is broken, what should I expect to see? QUESTION Is that what actually happens? TEST “Maybe it’ll work if I reconnect to the wi-fi or if I restart my router.” SOLUTION MINDSET SCIENTIFIC MINDSET “If the problem is between me and the site, other sites won’t load either. If the problem is between me and the router, I won’t be able to ping it.” “I can’t connect to this site.”
  45. EXAMPLES OF DEBUGGING TRAINING

  46. EXAMPLES OF DEBUGGING TRAINING 📉 What happens if I train

    on a tiny amount of data? Does the model converge?
  47. EXAMPLES OF DEBUGGING TRAINING 📉 What happens if I train

    on a tiny amount of data? Does the model converge? 🔀 What happens if I randomize the training labels? Does the model still learn?
  48. EXAMPLES OF DEBUGGING TRAINING 📉 What happens if I train

    on a tiny amount of data? Does the model converge? 🔀 What happens if I randomize the training labels? Does the model still learn? 🪄 Are my model weights changing at all during training?
  49. EXAMPLES OF DEBUGGING TRAINING 📉 What happens if I train

    on a tiny amount of data? Does the model converge? 🔀 What happens if I randomize the training labels? Does the model still learn? 🪄 Are my model weights changing at all during training? 🧮 What’s the mean and variance of my gradients?
  50. PRIORITIZE ROBUSTNESS NOT ACCURACY

  51. 📈 Better needs to look better. You need it to

    not be like this:
  52. 📈 Better needs to look better. You need it to

    not be like this: 📦 Larger models are often less practical.
  53. 📈 Better needs to look better. You need it to

    not be like this: 📦 Larger models are often less practical. 🤏 You need it to work with small samples.
  54. 📈 Better needs to look better. You need it to

    not be like this: 📦 Larger models are often less practical. 🤏 You need it to work with small samples. 🌪 Large models are less stable with small batch sizes.
  55. 🔮 2 Modelling

  56. ITERATE ON YOUR DATA AND SCALE DOWN

  57. task- specific output 💬 prompt 📖 text 🔮 PROTOTYPE GPT-4

    API
  58. task- specific output 💬 prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm

    prompt model & transform output to structured data GPT-4 API
  59. task- specific output 💬 prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm

    prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION
  60. distilled task-specific components 📦 📦 📦 task- specific output 💬

    prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION
  61. distilled task-specific components 📦 📦 📦 task- specific output 💬

    prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION modular
  62. distilled task-specific components 📦 📦 📦 task- specific output 💬

    prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION modular small & fast
  63. distilled task-specific components 📦 📦 📦 task- specific output 💬

    prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION modular small & fast data-private
  64. config.cfg spacy.io/usage/large-language-models ⚙

  65. config.cfg spacy.io/usage/large-language-models component ⚙

  66. config.cfg spacy.io/usage/large-language-models model and provider ⏺ ⏺ ⏺ component ⚙

  67. config.cfg spacy.io/usage/large-language-models model and provider ⏺ ⏺ ⏺ task definition

    and labels Named Entity Recognition, Text Classification, Relation Extraction, … component ⚙
  68. config.cfg spacy.io/usage/large-language-models label definitions to use in prompt model and

    provider ⏺ ⏺ ⏺ task definition and labels Named Entity Recognition, Text Classification, Relation Extraction, … component ⚙
  69. config.cfg spacy.io/usage/large-language-models label definitions to use in prompt model and

    provider ⏺ ⏺ ⏺ task definition and labels Named Entity Recognition, Text Classification, Relation Extraction, … component ⚙ example from case study explosion.ai/blog/sp-global-commodities
  70. Data annotation 📒 3

  71. How much data do you need?

  72. TRAINING =============== Train curve diagnostic =============== Training 4 times with

    25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need?
  73. TRAINING =============== Train curve diagnostic =============== Training 4 times with

    25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need? Accuracy 0 25 50 75 100 % of examples 25 50 75 100 125 150 projection
  74. TRAINING =============== Train curve diagnostic =============== Training 4 times with

    25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need? Accuracy 0 25 50 75 100 % of examples 25 50 75 100 125 150 projection EVALUATION ⚠ You need enough data to avoid reporting meaningless precision.
  75. TRAINING =============== Train curve diagnostic =============== Training 4 times with

    25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need? Accuracy 0 25 50 75 100 % of examples 25 50 75 100 125 150 projection EVALUATION ⚠ You need enough data to avoid reporting meaningless precision. 📊 Ten samples per significant figure is a good rule of thumb.
  76. TRAINING =============== Train curve diagnostic =============== Training 4 times with

    25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need? Accuracy 0 25 50 75 100 % of examples 25 50 75 100 125 150 projection EVALUATION ⚠ You need enough data to avoid reporting meaningless precision. 📊 Ten samples per significant figure is a good rule of thumb. 1,000 samples is pretty good – enough for 94% vs. 95%.
  77. KEEP TASKS SMALL

  78. KEEP TASKS SMALL GOOD for i in range(rows): access_data(array[i]) ✅

    BAD for j in range(columns): access_data(array[:, j]) ❌
  79. KEEP TASKS SMALL Humans have a cache, too! GOOD for

    i in range(rows): access_data(array[i]) ✅ BAD for j in range(columns): access_data(array[:, j]) ❌
  80. KEEP TASKS SMALL Humans have a cache, too! GOOD for

    i in range(rows): access_data(array[i]) ✅ BAD for j in range(columns): access_data(array[:, j]) ❌ DO THIS for annotation_type in annotation_types: for example in examples: annotate(example, annotation_type) ✅ NOT THIS for example in examples: for annotation_type in annotation_types: annotate(example, annotation_type) ❌
  81. USE MODEL ASSISTANCE

  82. USE MODEL ASSISTANCE 🔮 Suggest annotations however you can. Rule-

    based, initial trained model, an LLM – or a combination of all.
  83. USE MODEL ASSISTANCE 🔮 Suggest annotations however you can. Rule-

    based, initial trained model, an LLM – or a combination of all. Suggestions improve e iciency. Common cases are common, so getting them preset speeds up annotation a lot. 🔥
  84. USE MODEL ASSISTANCE 🔮 Suggest annotations however you can. Rule-

    based, initial trained model, an LLM – or a combination of all. Suggestions improve e iciency. Common cases are common, so getting them preset speeds up annotation a lot. 🔥 Suggestions improve accuracy. You need the common cases to be annotated consistently. Humans suck at this. 📈
  85. 🔮 explosion.ai/blog/human-in-the-loop-distillation HUMAN IN THE LOOP

  86. 🔮 explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline HUMAN IN THE LOOP

  87. 🔮 explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline prompting HUMAN IN THE LOOP

  88. 🔮 explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline prompting HUMAN IN THE LOOP

  89. 🔮 explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline prompting transfer learning 📦 HUMAN

    IN THE LOOP
  90. 🔮 explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline prompting transfer learning 📦 distilled

    model HUMAN IN THE LOOP
  91. prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl ⚙

  92. prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl recipe function with

    workflow ⚙
  93. prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save

    annotations to recipe function with workflow ⚙
  94. prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save

    annotations to recipe function with workflow [components.llm.model] @llm_models = "spacy.GPT-4.v2" ⚙
  95. prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save

    annotations to recipe function with workflow raw data [components.llm.model] @llm_models = "spacy.GPT-4.v2" ⚙
  96. ✨ Starting the web server at localhost:8080 ... Open the

    app and start annotating! GPT-4 API prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save annotations to recipe function with workflow raw data [components.llm.model] @llm_models = "spacy.GPT-4.v2" ⚙
  97. ✨ Starting the web server at localhost:8080 ... Open the

    app and start annotating! GPT-4 API prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save annotations to recipe function with workflow raw data 🤠 You Developer [components.llm.model] @llm_models = "spacy.GPT-4.v2" ⚙
  98. explosion.ai/blog/guardian case study ANNOTATION STARTS AT HOME

  99. explosion.ai/blog/guardian case study annotation guidelines ANNOTATION STARTS AT HOME

  100. explosion.ai/blog/guardian case study annotation guidelines annotation meeting ANNOTATION STARTS AT

    HOME
  101. 📒 🔮 ⚗

  102. 📒 🔮 Form and falsify hypotheses. ⚗

  103. 📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness.

  104. 📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale

    down and iterate.
  105. 📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale

    down and iterate. Imagine you’re the model.
  106. 📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale

    down and iterate. Imagine you’re the model. Finish the pipeline to production.
  107. 📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale

    down and iterate. Imagine you’re the model. Finish the pipeline to production. Be agile and annotate yourself.
  108. 📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale

    down and iterate. Imagine you’re the model. Finish the pipeline to production. Be agile and annotate yourself. Keep tasks small.
  109. 📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale

    down and iterate. Imagine you’re the model. Finish the pipeline to production. Be agile and annotate yourself. Keep tasks small. Use model assistance.
  110. LinkedIn Explosion spaCy Prodigy Twitter Mastodon Bluesky explosion.ai spacy.io prodigy.ai

    @honnibal @[email protected] @honnibal.bsky.social THANK YOU!

[8]ページ先頭

©2009-2025 Movatter.jp