Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Containerized Ferret worker

License

NotificationsYou must be signed in to change notification settings

MontFerret/worker

Repository files navigation

Go Report StatusDiscord ChatLab releaseApache-2.0 License

Worker is a simple HTTP server that acceptsFQL (Ferret Query Language) queries, executes them and returns their results.

What is Ferret?

Ferret is a declarative web scraping query language that allows you to extract data from web pages using a SQL-like syntax. Worker provides a REST API interface to execute FQL queries remotely, making it easy to integrate web scraping capabilities into your applications.

Common use cases:

  • Web scraping and data extraction from websites
  • Automated testing of web applications
  • Monitoring web pages for changes
  • Generating PDFs or screenshots from web pages
  • Collecting data for analytics and research

OpenAPI v2 schema can be foundhere.

Quick start

Prerequisites

  • Docker (recommended) or Go 1.23+ for local installation
  • For local installation without Docker: Google Chrome or Chromium browser

Running with Docker

The Worker is shipped with dedicated Docker image that contains headless Google Chrome, so feel free to run queries usingcdp driver:

DockerHub:

docker run -d -p 8080:8080 montferret/worker

GitHub Container Registry:

docker run -d -p 8080:8080 ghcr.io/montferret/worker

Local Installation

Alternatively, if you want to use your own version of Chrome, you can run the Worker locally.

Install from script:

curl https://raw.githubusercontent.com/MontFerret/worker/master/install.sh| shworker

Build from source:

git clone https://github.com/MontFerret/worker.gitcd workermake

Your First Query

Once the Worker is running, you can send FQL queries via POST requests tohttp://localhost:8080/:

Simple data extraction:

curl -X POST http://localhost:8080/ \  -H"Content-Type: application/json" \  -d'{    "text": "LET doc = DOCUMENT(\"https://example.com\") RETURN doc.title"  }'

Web scraping with browser automation:

curl -X POST http://localhost:8080/ \  -H"Content-Type: application/json" \  -d'{    "text": "LET page = DOCUMENT(\"https://github.com\", { driver: \"cdp\" }) WAIT_ELEMENT(page, \"h1\") RETURN INNER_TEXT(page, \"h1\")"  }'

Query with parameters:

curl -X POST http://localhost:8080/ \  -H"Content-Type: application/json" \  -d'{    "text": "LET doc = DOCUMENT(@url) RETURN doc.title",    "params": {      "url": "https://example.com"    }  }'

Visual Example

worker

System Resource Requirements

  • 2 CPU
  • 2 Gb of RAM

Usage

API Reference

Endpoints

POST /

Executes a given FQL query. The payload must have the following shape:

{"text":"LET doc = DOCUMENT('https://example.com') RETURN doc.title","params": {"optional_param":"value"  }}

Request body:

  • text (string, required): The FQL query to execute
  • params (object, optional): Parameters to pass to the query (accessible via@param_name)

Response:

{"data":"Example Domain","stats": {"execution_time":"1.234s"  }}

Example with complex data extraction:

curl -X POST http://localhost:8080/ \  -H"Content-Type: application/json" \  -d'{    "text": "LET page = DOCUMENT(@url, { driver: \"cdp\" }) LET links = ELEMENTS(page, \"a\") RETURN links[* LIMIT 5].href",    "params": {      "url": "https://news.ycombinator.com"    }  }'

GET /info

Returns worker information including Chrome, Ferret and worker versions:

{"ip":"127.0.0.1","version": {"worker":"1.18.0","chrome": {"browser":"125.0.6422.141","protocol":"1.3","v8":"12.5.227.39","webkit":"537.36"    },"ferret":"0.18.1"  }}

GET /health

Health check endpoint that returns HTTP 200 when the service is healthy and all dependencies (like Chrome) are accessible. Returns HTTP 424 when dependencies are unavailable.

Healthy response:

HTTP/1.1 200 OK

Unhealthy response:

HTTP/1.1 424 Failed Dependency

Configuration

Command Line Options

  -log-level="debug"    log level (trace, debug, info, warn, error, fatal, panic)  -port=8080    port to listen  -body-limit=1000    maximum size of request bodyin kb. 0 means no limit.  -request-limit=0    amount of requests per secondfor each IP. 0 means no limit.  -request-limit-time-window=180    amount of secondsfor request rate limittime window.  -cache-size=100    amount of cached queries. 0 means no caching.  -chrome-ip="127.0.0.1"    Google Chrome remote IP address  -chrome-port=9222    Google Chrome remote debugging port  -no-chrome=false    disable Chrome driver  -version=false    show version  -help=false    show this list

Configuration Examples

Production deployment with rate limiting:

worker \  -port=8080 \  -log-level=info \  -request-limit=10 \  -request-limit-time-window=60 \  -body-limit=2000 \  -cache-size=500

Development with debugging:

worker \  -port=3000 \  -log-level=debug \  -cache-size=0

Using external Chrome instance:

# Start Chrome with remote debugginggoogle-chrome --headless --remote-debugging-port=9222&# Start worker pointing to external Chromeworker -chrome-ip=localhost -chrome-port=9222

Without Chrome (HTTP driver only):

worker -no-chrome=true

Docker Configuration

Custom port and configuration:

docker run -d \  -p 3000:3000 \  -e PORT=3000 \  montferret/worker \  worker -port=3000 -log-level=info

With volume for persistent cache:

docker run -d \  -p 8080:8080 \  -v /host/cache:/app/cache \  montferret/worker

Security Considerations

⚠️Important for Production Deployments:

  • Rate Limiting: Always enable rate limiting in production (-request-limit)
  • Body Size Limits: Set appropriate body size limits (-body-limit) to prevent abuse
  • Network Security: Worker should not be exposed directly to the internet without proper authentication
  • Query Validation: Consider implementing query validation/filtering for untrusted input
  • Resource Monitoring: Monitor CPU and memory usage as complex queries can be resource-intensive
  • Chrome Security: The bundled Chrome runs in sandboxed mode, but avoid running as root in production

Recommended production configuration:

worker \  -port=8080 \  -log-level=warn \  -request-limit=5 \  -request-limit-time-window=60 \  -body-limit=1000 \  -cache-size=200

Troubleshooting

Common Issues

Chrome connection failed:

Error: failed to connect to Chrome
  • Ensure Chrome is running with--remote-debugging-port=9222
  • Check if Chrome is accessible at the configured IP/port
  • For Docker: make sure Chrome service is healthy

Query timeout:

Error: query execution timeout
  • Complex pages may take longer to load
  • Consider adding explicit waits in your FQL query
  • Check network connectivity to target websites

Memory issues:

Error: out of memory
  • Reduce cache size (-cache-size)
  • Limit concurrent requests (-request-limit)
  • Monitor Chrome memory usage

Permission denied:

Error: permission denied accessing Chrome
  • Ensure proper user permissions for Chrome binary
  • In Docker, avoid running as root when possible

Debug Mode

Enable debug logging to troubleshoot issues:

worker -log-level=debug

Health Check

Monitor worker health:

curl http://localhost:8080/healthcurl http://localhost:8080/info

FQL Query Examples

Basic Web Scraping

// Extract page titleLETdoc=DOCUMENT("https://example.com")RETURNdoc.title// Get all linksLETdoc=DOCUMENT("https://example.com")LETlinks=ELEMENTS(doc,"a")RETURNlinks[*].href// Extract structured dataLETdoc=DOCUMENT("https://news.ycombinator.com")LETstories=ELEMENTS(doc,".titleline > a")RETURNstories[*LIMIT10].{title:INNER_TEXT(@),url: @.href}

Browser Automation with CDP

// Navigate and interact with pageLETpage=DOCUMENT("https://github.com",{driver:"cdp"})WAIT_ELEMENT(page,"input[name='q']")INPUT(page,"input[name='q']","ferret")CLICK(page,"button[type='submit']")WAIT_ELEMENT(page,".repo-list-item")RETURNELEMENTS(page,".repo-list-item h3 a")[*].{name:INNER_TEXT(@),url: @.href}// Take screenshotLETpage=DOCUMENT("https://example.com",{driver:"cdp"})RETURNPDF(page)

Using Parameters

// Query with parameters (pass via "params" in POST body)LETpage=DOCUMENT(@url,{driver:"cdp"})LETselector= @css_selectorRETURNELEMENTS(page,selector)[*].{text:INNER_TEXT(@),href: @.href}

Development

Building from Source

# Clone repositorygit clone https://github.com/MontFerret/worker.gitcd worker# Install dependenciesmake install# Buildmake build# Run testsmaketest# Start development servermake start

Contributing

  1. Fork the repository
  2. Create a feature branch:git checkout -b my-feature
  3. Make your changes
  4. Run tests:make test
  5. Run linter:make lint
  6. Commit changes:git commit -am 'Add some feature'
  7. Push to the branch:git push origin my-feature
  8. Submit a pull request

Project Structure

├── cmd/                    # Command-line interface├── internal/               # Internal application code│   ├── controllers/        # HTTP request handlers│   ├── server/            # HTTP server configuration│   └── storage/           # Caching layer├── pkg/                   # Public packages│   ├── caching/           # Cache implementation│   └── worker/            # Core worker logic├── reference/             # OpenAPI specification└── assets/               # Documentation assets

Links

Sponsor this project

    Packages

     
     
     

    Contributors4

    •  
    •  
    •  
    •  

    [8]ページ先頭

    ©2009-2026 Movatter.jp