- Notifications
You must be signed in to change notification settings - Fork18
s0rg/crawley
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Crawls web pages and prints any link it can find.
- fast html SAX-parser (powered byx/net/html)
- js/css lexical parsers (powered bytdewolff/parse) - extract api endpoints from js code and
url()
properties - small (below 1500 SLOC), idiomatic, 100% test covered codebase
- grabs most of useful resources urls (pics, videos, audios, forms, etc...)
- found urls are streamed to stdout and guranteed to be unique (with fragments omitted)
- scan depth (limited by starting host and path, by default - 0) can be configured
- can be polite - crawl rules and sitemaps from
robots.txt
brute
mode - scan html comments for urls (this can lead to bogus results)- make use of
HTTP_PROXY
/HTTPS_PROXY
environment values + handles proxy auth (useHTTP_PROXY="socks5://127.0.0.1:1080/" crawley
for socks5) - directory-only scan mode (aka
fast-scan
) - user-defined cookies, in curl-compatible format (i.e.
-cookie "ONE=1; TWO=2" -cookie "ITS=ME" -cookie @cookie-file
) - user-defined headers, same as curl:
-header "ONE: 1" -header "TWO: 2" -header @headers-file
- tag filter - allow to specify tags to crawl for (single:
-tag a -tag form
, multiple:-tag a,form
, or mixed) - url ignore - allow to ignore urls with matched substrings from crawling (i.e.:
-ignore logout
) - subdomains support - allow depth crawling for subdomains as well (e.g.
crawley http://some-test.site
will be able to crawlhttp://www.some-test.site
)
# print all links from first page:crawley http://some-test.site# print all js files and api endpoints:crawley -depth -1 -tag script -js http://some-test.site# print all endpoints from js:crawley -js http://some-test.site/app.js# download all png images from site:crawley -depth -1 -tag img http://some-test.site| grep'\.png$'| wget -i -# fast directory traversal:crawley -headless -delay 0 -depth -1 -dirs only http://some-test.site
- binaries / deb / rpm for Linux, FreeBSD, macOS and Windows.
- archlinux you can use your favourite AUR helper to install it, e. g.
paru -S crawley-bin
.
crawley [flags] urlpossible flags with default values:-all scan all known sources (js/css/...)-brute scan html comments-cookie value extra cookies for request, can be used multiple times, accept files with '@'-prefix-css scan css for urls-delay duration per-request delay (0 - disable) (default 150ms)-depth int scan depth (set -1 for unlimited)-dirs string policy for non-resource urls: show / hide / only (default "show")-header value extra headers for request, can be used multiple times, accept files with '@'-prefix-headless disable pre-flight HEAD requests-ignore value patterns (in urls) to be ignored in crawl process-js scan js code for endpoints-proxy-auth string credentials for proxy: user:password-robots string policy for robots.txt: ignore / crawl / respect (default "ignore")-silent suppress info and error messages in stderr-skip-ssl skip ssl verification-subdomains support subdomains (e.g. if www.domain.com found, recurse over it)-tag value tags filter, single or comma-separated tag names-timeout duration request timeout (min: 1 second, max: 10 minutes) (default 5s)-user-agent string user-agent string-version show version-workers int number of workers (default - number of CPU cores)
Crawley can handle flags autocompletion in bash and zsh viacomplete
:
complete -C"/full-path-to/bin/crawley" crawley
About
The unix-way web crawler
Topics
Resources
License
Security policy
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Uh oh!
There was an error while loading.Please reload this page.