This repository was archived by the owner on Jun 27, 2022. It is now read-only.

watzon/arachnidPublic archive

NotificationsYou must be signed in to change notification settings
Fork12
Star78

Powerful web scraping framework for Crystal

watzon.github.io/arachnid

License

MIT license

78 stars 12 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
spec		spec
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
shard.yml		shard.yml

Repository files navigation

Arachnid

This project is no longer maintained. Please seeMechanize for an alternative.

Arachnid is a fast, soon to be multi-threading capable web crawler for Crystal. It recenty underwent a full rewrite for Crystal 0.35.1, so see the documentation below for updated usage instructions.

Installation

Add the dependency to yourshard.yml:

dependencies:arachnid:github:watzon/arachnid

Runshards install

Usage

First, of course, you need to require arachnid in your project:

require"arachnid"

The Agent

Agent is the class that does all the heavy lifting and will be the main one you interact with. To create a newAgent, useAgent.new.

agent=Arachnid::Agent.new

The initialize method takes a bunch of optional parameters:

`:client`

You can, if you wish, supply your ownHTTP::Client instance to theAgent. This can be useful if you want to use a proxy, provided the proxy client extendsHTTP::Client.

`:user_agent`

The user agent to be added to every request header. You can override this on a per-host basis with either:host_headers or:default_headers.

`:default_headers`

The default headers to be used in every request.

`:host_headers`

Headers to be applied on a per-host basis. This is a hashString (host name) => HTTP::Headers.

`:queue`

TheArachnid::Queue instance to use for storing links waiting to be processed. The default is aMemoryQueue (which is the only one for now), but you can easily implement your ownQueue using whatever you want as a backend.

`:stop_on_empty`

Whether or not to stop running when the queue is empty. This is true by default. If it's made false, the loop will continue when the queue empties, so be sure you have a way to keep adding items to the queue.

`:follow_redirects`

Whether or not to follow redirects (add them to the queue).

Starting the Agent

There are four ways to start your Agent once it's been created. Here are some examples:

`#start_at`

#start_at starts the Agent running on a particular URL. It adds a single URL to the queue and starts there.

agent.start_at("https://crystal-lang.org")do# ...end

`#site`

#site starts the agent running at the given URL and adds a rule that keeps the agent restricted to the given site. This allows the agent to scan the given domain and any subdomains. For instance:

agent.site("https://crystal-lang.org")do# ...end

The above will matchcrystal-lang.org andforum.crystal-lang.org, but notgithub.com/crystal-lang or any other site not within the*.crystal-lang.org space.

`#host`

#host is like site, but with the added restriction of just remaining on the current domain path. Subdomains are not included.

agent.host("crystal-lang.org")do# ...end

`#start`

Provided you already have URIs in the queue ready to be scanned, you can also just use#start to start the Agent running.

agent.enqueue("https://crystal-lang.org")agent.enqueue("https://kemalcr.com")agent.start

Filters

URI's can be filtered before being enqueued. There are two kinds of filters, accept and reject. Accept filters can be used to ensure that a URI matches before being enqueued. Reject filters do the opposite, keeping URIs from being enqueued if theydo match.

For instance:

# This will filter out all sites where the host is not "crystal-lang.org"agent.accept_filter { |uri| uri.host=="crystal-lang.org" }

If you want to ignore certain parts of the above filter:

# This will ignore paths starting with "/api"agent.reject_filter { |uri| uri.path.to_s.starts_with?("/api") }

The#site and#host methods add a default accept filter in order to keep things in the given site or host.

Resources

All the above is useless if you can't do anything with the scanned resources, which is why we have theResource class. Every scanned resource is converted into aResource (or subclass) based on the content type. For instance,text/html becomes aResource::HTML which is parsed usingkostya/myhtml for extra speed.

Each resource has an associatedAgent#on_ method so you can do something when one of those resources is scanned:

agent.on_htmldo |page|putstypeof(page)# => Arachnid::Resource::HTMLputs page.title# => The Title of the Pageend

Currently we have:

#on_html
#on_image
#on_script
#on_stylesheet
#on_xml

There is also#on_resource which is called for every resource, including ones that don't match the above types. Resources all include, at minimum the URI at which the resource was found, and the response (HTTP::Client::Response) instance.

Contributing

Fork it (https://github.com/watzon/arachnid/fork)
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

Contributors

your-name-here - creator and maintainer

About

Powerful web scraping framework for Crystal

watzon.github.io/arachnid

Releases3

Updated CLI Latest

Jul 1, 2019

+ 2 releases

Packages

No packages published

Languages

Crystal100.0%

Movatterモバイル変換

License

watzon/arachnid

Folders and files

Latest commit

History

Repository files navigation

Arachnid

This project is no longer maintained. Please seeMechanize for an alternative.

Installation

Usage

The Agent

:client

:user_agent

:default_headers

:host_headers

:queue

:stop_on_empty

:follow_redirects

Starting the Agent

#start_at

#site

#host

#start

Filters

Resources

Contributing

Contributors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases3

Packages0

Uh oh!

Languages

`:client`

`:user_agent`

`:default_headers`

`:host_headers`

`:queue`

`:stop_on_empty`

`:follow_redirects`

`#start_at`

`#site`

`#host`

`#start`

Packages