Gocrawl is a small Go library for scraping data from websites. Some command-line utilities are provided as well.
Gocrawl is a simple command-line utility to crawl websites.It usesgopkg.in/xmlpath.v2
to perform HTML parsing and XPath evaluation.
See theexamples
directory for an example job specification.Example usage and output:
$ gocrawl -input example/job.json{ "day": { "error": "", "values": [ "Saturday" ] }, "invalid_regexp": { "error": "error parsing regexp: missing closing ]: `[a-z+`", "values": [ "Saturday, 2 June 2018" ] }, "invalid_xpath": { "error": "compiling xml path \"//*[@id=\\\\\\\"ctdat\\\"]\":8: expected a literal string", "values": [] }, "month": { "error": "", "values": [ "June" ] }, "no_matching_xpath": { "error": "no match for xpath //*[@id=\"cttdat\"]", "values": [] }, "time": { "error": "", "values": [ "13:18:59" ] }}
Gocrawld is a daemon version of gocrawl. It accepts a POST request containing a job specification identical to that ofgocrawl
and returns the result of executing the crawl job, encoded as JSON.
Example usage:
$ gocrawld -host localhost -port 12345 &<pid>$ curl -XPOST localhost:12345 --data @example/job.json<job output will be the same as above except less pretty>
Example docker usage:
docker run --rm --net=host --detach johnstcn/gocrawld