Movatterモバイル変換

matthewmueller/x-rayPublic

NotificationsYou must be signed in to change notification settings
Fork346
Star5.9k

The next web scraper. See through the <html> noise.

License

MIT license

5.9k stars 346 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 262 Commits
.github		.github
examples		examples
lib		lib
test		test
.gitignore		.gitignore
.npmignore		.npmignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Readme.md		Readme.md
index.js		index.js
package.json		package.json
renovate.json		renovate.json

Repository files navigation

varXray=require('x-ray')varx=Xray()x('https://blog.ycombinator.com/','.post',[{title:'h1 a',link:'.article-title@href'}]).paginate('.nav-previous a@href').limit(3).write('results.json')

Installation

npm install x-ray

Features

Flexible schema: Supports strings, arrays, arrays of objects, and nested object structures. The schema is not tied to the structure of the page you're scraping, allowing you to pull the data in the structure of your choosing.
Composable: The API is entirely composable, giving you great flexibility in how you scrape each page.
Pagination support: Paginate through websites, scraping each page. X-ray also supports a requestdelay and a paginationlimit. Scraped pages can be streamed to a file, so if there's an error on one page, you won't lose what you've already scraped.
Crawler support: Start on one page and move to the next easily. The flow is predictable, followinga breadth-first crawl through each of the pages.
Responsible: X-ray has support for concurrency, throttles, delays, timeouts and limits to help you scrape any page responsibly.
Pluggable drivers: Swap in different scrapers depending on your needs. Currently supports HTTP andPhantomJS driver drivers. In the future, I'd like to see a Tor driver for requesting pages through the Tor network.

Selector API

xray(url, selector)(fn)

Scrape theurl for the followingselector, returning an object in the callbackfn.Theselector takes an enhanced jQuery-like string that is also able to select on attributes. The syntax for selecting on attributes isselector@attribute. If you do not supply an attribute, the default is selecting theinnerText.

Here are a few examples:

Scrape a single tag

xray('http://google.com','title')(function(err,title){console.log(title)// Google})

Scrape a single class

xray('http://reddit.com','.content')(fn)

Scrape an attribute

xray('http://techcrunch.com','img.logo@src')(fn)

ScrapeinnerHTML

xray('http://news.ycombinator.com','body@html')(fn)

xray(url, scope, selector)

You can also supply ascope to eachselector. In jQuery, this would look something like this:$(scope).find(selector).

xray(html, scope, selector)

Instead of a url, you can also supply raw HTML and all the same semantics apply.

varhtml='<body><h2>Pear</h2></body>'x(html,'body','h2')(function(err,header){header// => Pear})

API

xray.driver(driver)

Specify adriver to make requests through. Available drivers include:

request - A simple driver built around request. Use this to set headers, cookies or http methods.
phantom - A high-level browser automation library. Use this to render pages or when elements need to be interacted with, or when elements are created dynamically using javascript (e.g.: Ajax-calls).

xray.stream()

Returns Readable Stream of the data. This makes it easy to build APIs around x-ray. Here's an example with Express:

varapp=require('express')()varx=require('x-ray')()app.get('/',function(req,res){varstream=x('http://google.com','title').stream()stream.pipe(res)})

xray.write([path])

Stream the results to apath.

If no path is provided, then the behavior is the same as.stream().

xray.then(cb)

Constructs aPromise object and invoke itsthen function with a callbackcb. Be sure to invokethen() at the last step of xray method chaining, since the other methods are not promisified.

x('https://dribbble.com','li.group',[{title:'.dribbble-img strong',image:'.dribbble-img [data-src]@data-src'}]).paginate('.next_page@href').limit(3).then(function(res){console.log(res[0])// prints first result}).catch(function(err){console.log(err)// handle error in promise})

xray.paginate(selector)

Select aurl from aselector and visit that page.

xray.limit(n)

Limit the amount of pagination ton requests.

xray.abort(validator)

Abort pagination ifvalidator function returnstrue.Thevalidator function receives two arguments:

result: The scrape result object for the current page.
nextUrl: The URL of the next page to scrape.

xray.delay(from, [to])

Delay the next request betweenfrom andto milliseconds.If onlyfrom is specified, delay exactlyfrom milliseconds.

varx=Xray().delay('1s','10s')

xray.concurrency(n)

Set the request concurrency ton. Defaults toInfinity.

varx=Xray().concurrency(2)

xray.throttle(n, ms)

Throttle the requests ton requests perms milliseconds.

varx=Xray().throttle(2,'1s')

xray.timeout (ms)

Specify a timeout ofms milliseconds for each request.

varx=Xray().timeout(30)

Collections

X-ray also has support for selecting collections of tags. Whilex('ul', 'li') will only select the first list item in an unordered list,x('ul', ['li']) will select all of them.

Additionally, X-ray supports "collections of collections" allowing you to smartly select all list items in all lists with a command like this:x(['ul'], ['li']).

Composition

X-ray becomes more powerful when you start composing instances together. Here are a few possibilities:

Crawling to another site

varXray=require('x-ray')varx=Xray()x('http://google.com',{main:'title',image:x('#gbar a@href','title')// follow link to google images})(function(err,obj){/*  {    main: 'Google',    image: 'Google Images'  }*/})

Scoping a selection

varXray=require('x-ray')varx=Xray()x('http://mat.io',{title:'title',items:x('.item',[{title:'.item-content h2',description:'.item-content section'}])})(function(err,obj){/*  {    title: 'mat.io',    items: [      {        title: 'The 100 Best Children\'s Books of All Time',        description: 'Relive your childhood with TIME\'s list...'      }    ]  }*/})

Filters

Filters can specified when creating a new Xray instance. To apply filters to a value, append them to the selector using|.

varXray=require('x-ray')varx=Xray({filters:{trim:function(value){returntypeofvalue==='string' ?value.trim() :value},reverse:function(value){returntypeofvalue==='string'        ?value.split('').reverse().join('')        :value},slice:function(value,start,end){returntypeofvalue==='string' ?value.slice(start,end) :value}}})x('http://mat.io',{title:'title | trim | reverse | slice:2,3'})(function(err,obj){/*  {    title: 'oi'  }*/})