- Notifications
You must be signed in to change notification settings - Fork16
s0rg/crawley
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Crawls web pages and prints any link it can find.
- fast html SAX-parser (powered byx/net/html)
- js/css lexical parsers (powered bytdewolff/parse) - extract api endpoints from js code and
url()
properties - small (below 1500 SLOC), idiomatic, 100% test covered codebase
- grabs most of useful resources urls (pics, videos, audios, forms, etc...)
- found urls are streamed to stdout and guranteed to be unique (with fragments omitted)
- scan depth (limited by starting host and path, by default - 0) can be configured
- can be polite - crawl rules and sitemaps from
robots.txt
brute
mode - scan html comments for urls (this can lead to bogus results)- make use of
HTTP_PROXY
/HTTPS_PROXY
environment values + handles proxy auth (useHTTP_PROXY="socks5://127.0.0.1:1080/" crawley
for socks5) - directory-only scan mode (aka
fast-scan
) - user-defined cookies, in curl-compatible format (i.e.
-cookie "ONE=1; TWO=2" -cookie "ITS=ME" -cookie @cookie-file
) - user-defined headers, same as curl:
-header "ONE: 1" -header "TWO: 2" -header @headers-file
- tag filter - allow to specify tags to crawl for (single:
-tag a -tag form
, multiple:-tag a,form
, or mixed) - url ignore - allow to ignore urls with matched substrings from crawling (i.e.:
-ignore logout
) - subdomains support - allow depth crawling for subdomains as well (e.g.
crawley http://some-test.site
will be able to crawlhttp://www.some-test.site
)
# print all links from first page:crawley http://some-test.site# print all js files and api endpoints:crawley -depth -1 -tag script -js http://some-test.site# print all endpoints from js:crawley -js http://some-test.site/app.js# download all png images from site:crawley -depth -1 -tag img http://some-test.site| grep'\.png$'| wget -i -# fast directory traversal:crawley -headless -delay 0 -depth -1 -dirs only http://some-test.site
- binaries / deb / rpm for Linux, FreeBSD, macOS and Windows.
- archlinux you can use your favourite AUR helper to install it, e. g.
paru -S crawley-bin
.
crawley [flags] urlpossible flags with default values:-all scan all known sources (js/css/...)-brute scan html comments-cookie value extra cookies for request, can be used multiple times, accept files with '@'-prefix-css scan css for urls-delay duration per-request delay (0 - disable) (default 150ms)-depth int scan depth (set -1 for unlimited)-dirs string policy for non-resource urls: show / hide / only (default "show")-header value extra headers for request, can be used multiple times, accept files with '@'-prefix-headless disable pre-flight HEAD requests-ignore value patterns (in urls) to be ignored in crawl process-js scan js code for endpoints-proxy-auth string credentials for proxy: user:password-robots string policy for robots.txt: ignore / crawl / respect (default "ignore")-silent suppress info and error messages in stderr-skip-ssl skip ssl verification-subdomains support subdomains (e.g. if www.domain.com found, recurse over it)-tag value tags filter, single or comma-separated tag names-timeout duration request timeout (min: 1 second, max: 10 minutes) (default 5s)-user-agent string user-agent string-version show version-workers int number of workers (default - number of CPU cores)
Crawley can handle flags autocompletion in bash and zsh viacomplete
:
complete -C"/full-path-to/bin/crawley" crawley
About
The unix-way web crawler