Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

The unix-way web crawler

License

NotificationsYou must be signed in to change notification settings

s0rg/crawley

Repository files navigation

LicenseFOSSA StatusGo VersionReleaseMentioned in Awesome GoDownloads

CIGo Report CardMaintainabilityTest Coveragelibraries.ioIssues

crawley

Crawls web pages and prints any link it can find.

features

  • fast html SAX-parser (powered byx/net/html)
  • js/css lexical parsers (powered bytdewolff/parse) - extract api endpoints from js code andurl() properties
  • small (below 1500 SLOC), idiomatic, 100% test covered codebase
  • grabs most of useful resources urls (pics, videos, audios, forms, etc...)
  • found urls are streamed to stdout and guranteed to be unique (with fragments omitted)
  • scan depth (limited by starting host and path, by default - 0) can be configured
  • can be polite - crawl rules and sitemaps fromrobots.txt
  • brute mode - scan html comments for urls (this can lead to bogus results)
  • make use ofHTTP_PROXY /HTTPS_PROXY environment values + handles proxy auth (useHTTP_PROXY="socks5://127.0.0.1:1080/" crawley for socks5)
  • directory-only scan mode (akafast-scan)
  • user-defined cookies, in curl-compatible format (i.e.-cookie "ONE=1; TWO=2" -cookie "ITS=ME" -cookie @cookie-file)
  • user-defined headers, same as curl:-header "ONE: 1" -header "TWO: 2" -header @headers-file
  • tag filter - allow to specify tags to crawl for (single:-tag a -tag form, multiple:-tag a,form, or mixed)
  • url ignore - allow to ignore urls with matched substrings from crawling (i.e.:-ignore logout)
  • subdomains support - allow depth crawling for subdomains as well (e.g.crawley http://some-test.site will be able to crawlhttp://www.some-test.site)

examples

# print all links from first page:crawley http://some-test.site# print all js files and api endpoints:crawley -depth -1 -tag script -js http://some-test.site# print all endpoints from js:crawley -js http://some-test.site/app.js# download all png images from site:crawley -depth -1 -tag img http://some-test.site| grep'\.png$'| wget -i -# fast directory traversal:crawley -headless -delay 0 -depth -1 -dirs only http://some-test.site

installation

  • binaries / deb / rpm for Linux, FreeBSD, macOS and Windows.
  • archlinux you can use your favourite AUR helper to install it, e. g.paru -S crawley-bin.

usage

crawley [flags] urlpossible flags with default values:-all    scan all known sources (js/css/...)-brute    scan html comments-cookie value    extra cookies for request, can be used multiple times, accept files with '@'-prefix-css    scan css for urls-delay duration    per-request delay (0 - disable) (default 150ms)-depth int    scan depth (set -1 for unlimited)-dirs string    policy for non-resource urls: show / hide / only (default "show")-header value    extra headers for request, can be used multiple times, accept files with '@'-prefix-headless    disable pre-flight HEAD requests-ignore value    patterns (in urls) to be ignored in crawl process-js    scan js code for endpoints-proxy-auth string    credentials for proxy: user:password-robots string    policy for robots.txt: ignore / crawl / respect (default "ignore")-silent    suppress info and error messages in stderr-skip-ssl    skip ssl verification-subdomains    support subdomains (e.g. if www.domain.com found, recurse over it)-tag value    tags filter, single or comma-separated tag names-timeout duration    request timeout (min: 1 second, max: 10 minutes) (default 5s)-user-agent string    user-agent string-version    show version-workers int      number of workers (default - number of CPU cores)

flags autocompletion

Crawley can handle flags autocompletion in bash and zsh viacomplete:

complete -C"/full-path-to/bin/crawley" crawley

license

FOSSA Status


[8]ページ先頭

©2009-2025 Movatter.jp