- Notifications
You must be signed in to change notification settings - Fork86
👻 Experimental library for scraping websites using OpenAI's GPT API.
License
jamesturk/scrapeghost
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
scrapeghost
is an experimental library for scraping websites using OpenAI's GPT.
Source:https://github.com/jamesturk/scrapeghost
Documentation:https://jamesturk.github.io/scrapeghost/
Issues:https://github.com/jamesturk/scrapeghost/issues
Use at your own risk. This library makes considerably expensive calls ($0.36 for a GPT-4 call on a moderately sized page.) Cost estimates are based on theOpenAI pricing page and not guaranteed to be accurate.
The purpose of this library is to provide a convenient interface for exploring web scraping with GPT.
While the bulk of the work is done by the GPT model,scrapeghost
provides a number of features to make it easier to use.
Python-based schema definition - Define the shape of the data you want to extract as any Python object, with as much or little detail as you want.
Preprocessing
- HTML cleaning - Remove unnecessary HTML to reduce the size and cost of API requests.
- CSS and XPath selectors - Pre-filter HTML by writing a single CSS or XPath selector.
- Auto-splitting - Optionally split the HTML into multiple calls to the model, allowing for larger pages to be scraped.
Postprocessing
- JSON validation - Ensure that the response is valid JSON. (With the option to kick it back to GPT for fixes if it's not.)
- Schema validation - Go a step further, use a
pydantic
schema to validate the response. - Hallucination check - Does the data in the response truly exist on the page?
Cost Controls
- Scrapers keep running totals of how many tokens have been sent and received, so costs can be tracked.
- Support for automatic fallbacks (e.g. use cost-saving GPT-3.5-Turbo by default, fall back to GPT-4 if needed.)
- Allows setting a budget and stops the scraper if the budget is exceeded.
About
👻 Experimental library for scraping websites using OpenAI's GPT API.