Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork7
Containerized Ferret worker
License
MontFerret/worker
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Worker is a simple HTTP server that acceptsFQL (Ferret Query Language) queries, executes them and returns their results.
Ferret is a declarative web scraping query language that allows you to extract data from web pages using a SQL-like syntax. Worker provides a REST API interface to execute FQL queries remotely, making it easy to integrate web scraping capabilities into your applications.
Common use cases:
- Web scraping and data extraction from websites
- Automated testing of web applications
- Monitoring web pages for changes
- Generating PDFs or screenshots from web pages
- Collecting data for analytics and research
OpenAPI v2 schema can be foundhere.
- Docker (recommended) or Go 1.23+ for local installation
- For local installation without Docker: Google Chrome or Chromium browser
The Worker is shipped with dedicated Docker image that contains headless Google Chrome, so feel free to run queries usingcdp driver:
DockerHub:
docker run -d -p 8080:8080 montferret/worker
GitHub Container Registry:
docker run -d -p 8080:8080 ghcr.io/montferret/worker
Alternatively, if you want to use your own version of Chrome, you can run the Worker locally.
Install from script:
curl https://raw.githubusercontent.com/MontFerret/worker/master/install.sh| shworkerBuild from source:
git clone https://github.com/MontFerret/worker.gitcd workermakeOnce the Worker is running, you can send FQL queries via POST requests tohttp://localhost:8080/:
Simple data extraction:
curl -X POST http://localhost:8080/ \ -H"Content-Type: application/json" \ -d'{ "text": "LET doc = DOCUMENT(\"https://example.com\") RETURN doc.title" }'
Web scraping with browser automation:
curl -X POST http://localhost:8080/ \ -H"Content-Type: application/json" \ -d'{ "text": "LET page = DOCUMENT(\"https://github.com\", { driver: \"cdp\" }) WAIT_ELEMENT(page, \"h1\") RETURN INNER_TEXT(page, \"h1\")" }'
Query with parameters:
curl -X POST http://localhost:8080/ \ -H"Content-Type: application/json" \ -d'{ "text": "LET doc = DOCUMENT(@url) RETURN doc.title", "params": { "url": "https://example.com" } }'
- 2 CPU
- 2 Gb of RAM
Executes a given FQL query. The payload must have the following shape:
{"text":"LET doc = DOCUMENT('https://example.com') RETURN doc.title","params": {"optional_param":"value" }}Request body:
text(string, required): The FQL query to executeparams(object, optional): Parameters to pass to the query (accessible via@param_name)
Response:
{"data":"Example Domain","stats": {"execution_time":"1.234s" }}Example with complex data extraction:
curl -X POST http://localhost:8080/ \ -H"Content-Type: application/json" \ -d'{ "text": "LET page = DOCUMENT(@url, { driver: \"cdp\" }) LET links = ELEMENTS(page, \"a\") RETURN links[* LIMIT 5].href", "params": { "url": "https://news.ycombinator.com" } }'
Returns worker information including Chrome, Ferret and worker versions:
{"ip":"127.0.0.1","version": {"worker":"1.18.0","chrome": {"browser":"125.0.6422.141","protocol":"1.3","v8":"12.5.227.39","webkit":"537.36" },"ferret":"0.18.1" }}Health check endpoint that returns HTTP 200 when the service is healthy and all dependencies (like Chrome) are accessible. Returns HTTP 424 when dependencies are unavailable.
Healthy response:
HTTP/1.1 200 OKUnhealthy response:
HTTP/1.1 424 Failed Dependency-log-level="debug" log level (trace, debug, info, warn, error, fatal, panic) -port=8080 port to listen -body-limit=1000 maximum size of request bodyin kb. 0 means no limit. -request-limit=0 amount of requests per secondfor each IP. 0 means no limit. -request-limit-time-window=180 amount of secondsfor request rate limittime window. -cache-size=100 amount of cached queries. 0 means no caching. -chrome-ip="127.0.0.1" Google Chrome remote IP address -chrome-port=9222 Google Chrome remote debugging port -no-chrome=false disable Chrome driver -version=false show version -help=false show this list
Production deployment with rate limiting:
worker \ -port=8080 \ -log-level=info \ -request-limit=10 \ -request-limit-time-window=60 \ -body-limit=2000 \ -cache-size=500
Development with debugging:
worker \ -port=3000 \ -log-level=debug \ -cache-size=0
Using external Chrome instance:
# Start Chrome with remote debugginggoogle-chrome --headless --remote-debugging-port=9222&# Start worker pointing to external Chromeworker -chrome-ip=localhost -chrome-port=9222
Without Chrome (HTTP driver only):
worker -no-chrome=true
Custom port and configuration:
docker run -d \ -p 3000:3000 \ -e PORT=3000 \ montferret/worker \ worker -port=3000 -log-level=info
With volume for persistent cache:
docker run -d \ -p 8080:8080 \ -v /host/cache:/app/cache \ montferret/worker
- Rate Limiting: Always enable rate limiting in production (
-request-limit) - Body Size Limits: Set appropriate body size limits (
-body-limit) to prevent abuse - Network Security: Worker should not be exposed directly to the internet without proper authentication
- Query Validation: Consider implementing query validation/filtering for untrusted input
- Resource Monitoring: Monitor CPU and memory usage as complex queries can be resource-intensive
- Chrome Security: The bundled Chrome runs in sandboxed mode, but avoid running as root in production
Recommended production configuration:
worker \ -port=8080 \ -log-level=warn \ -request-limit=5 \ -request-limit-time-window=60 \ -body-limit=1000 \ -cache-size=200
Chrome connection failed:
Error: failed to connect to Chrome- Ensure Chrome is running with
--remote-debugging-port=9222 - Check if Chrome is accessible at the configured IP/port
- For Docker: make sure Chrome service is healthy
Query timeout:
Error: query execution timeout- Complex pages may take longer to load
- Consider adding explicit waits in your FQL query
- Check network connectivity to target websites
Memory issues:
Error: out of memory- Reduce cache size (
-cache-size) - Limit concurrent requests (
-request-limit) - Monitor Chrome memory usage
Permission denied:
Error: permission denied accessing Chrome- Ensure proper user permissions for Chrome binary
- In Docker, avoid running as root when possible
Enable debug logging to troubleshoot issues:
worker -log-level=debug
Monitor worker health:
curl http://localhost:8080/healthcurl http://localhost:8080/info
// Extract page titleLETdoc=DOCUMENT("https://example.com")RETURNdoc.title// Get all linksLETdoc=DOCUMENT("https://example.com")LETlinks=ELEMENTS(doc,"a")RETURNlinks[*].href// Extract structured dataLETdoc=DOCUMENT("https://news.ycombinator.com")LETstories=ELEMENTS(doc,".titleline > a")RETURNstories[*LIMIT10].{title:INNER_TEXT(@),url: @.href}
// Navigate and interact with pageLETpage=DOCUMENT("https://github.com",{driver:"cdp"})WAIT_ELEMENT(page,"input[name='q']")INPUT(page,"input[name='q']","ferret")CLICK(page,"button[type='submit']")WAIT_ELEMENT(page,".repo-list-item")RETURNELEMENTS(page,".repo-list-item h3 a")[*].{name:INNER_TEXT(@),url: @.href}// Take screenshotLETpage=DOCUMENT("https://example.com",{driver:"cdp"})RETURNPDF(page)
// Query with parameters (pass via "params" in POST body)LETpage=DOCUMENT(@url,{driver:"cdp"})LETselector= @css_selectorRETURNELEMENTS(page,selector)[*].{text:INNER_TEXT(@),href: @.href}
# Clone repositorygit clone https://github.com/MontFerret/worker.gitcd worker# Install dependenciesmake install# Buildmake build# Run testsmaketest# Start development servermake start
- Fork the repository
- Create a feature branch:
git checkout -b my-feature - Make your changes
- Run tests:
make test - Run linter:
make lint - Commit changes:
git commit -am 'Add some feature' - Push to the branch:
git push origin my-feature - Submit a pull request
├── cmd/ # Command-line interface├── internal/ # Internal application code│ ├── controllers/ # HTTP request handlers│ ├── server/ # HTTP server configuration│ └── storage/ # Caching layer├── pkg/ # Public packages│ ├── caching/ # Cache implementation│ └── worker/ # Core worker logic├── reference/ # OpenAPI specification└── assets/ # Documentation assetsAbout
Containerized Ferret worker
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Sponsor this project
Uh oh!
There was an error while loading.Please reload this page.
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Contributors4
Uh oh!
There was an error while loading.Please reload this page.
