Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

parallel execution of RSelenium

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
NotificationsYou must be signed in to change notification settings

till-tietz/parsel

Repository files navigation

CRAN statusLicense: MIT

parsel is a framework for parallelized dynamic web-scraping usingRSelenium. Leveraging parallel processing, it allows you to run anyRSelenium web-scraping routine on multiple browser instancessimultaneously, thus greatly increasing the efficiency of your scraping.parsel utilizes chunked input processing as well as error catching andlogging, to ensure seamless execution of your scraping routine andminimal data loss, even in the presence of unforeseenRSeleniumerrors.parsel additionally provides convenient wrapper functionsaroundRSelenium methods, that allow you to quickly generate safescraping code with minimal coding on your end.

You can find theparsel websitehere.

Installation

# Install parsel from CRANinstall.packages("parsel")# Or the development version from GitHub:# install.packages("devtools")devtools::install_github("till-tietz/parsel")

Usage

Parallel Scraping

The following example will hopefully serve to illustrate thefunctionality and ideas behind howparsel operates. We’ll set up thefollowing scraping job:

  1. navigate to a random Wikipedia article
  2. retrieve its title
  3. navigate to the first linked page on the article
  4. retrieve the linked page’s title and first section

and parallelize it withparsel.

parsel requires two things:

  1. a scraping function defining the actions to be executed in eachRSelenium instance. Actions to be executed in each browserinstance should be written in the conventionalRSelenium syntaxwithremDr$ specifying the remote driver.
  2. some inputx to those actions (e.g. search terms to be entered insearch boxes or links to navigate to etc.)
library(RSelenium)library(parsel)#let's define our scraping function input#we want to run our function 4 times and we want it to start on the wikipedia main page each timeinput<- rep("https://de.wikipedia.org",4)#let's define our scraping functionget_wiki_text<-function(x){input_i<-x#navigate to input page (i.e wikipedia)remDr$navigate(input_i)#find and click random articlerand_art<-remDr$findElement(using="id","n-randompage")$clickElement()#get random article titletitle<-remDr$findElement(using="id","firstHeading")$getElementText()[[1]]#check if there is a linked pagelink_exists<- try(remDr$findElement(using="xpath","/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]"))#if no linked page fill output with NAif(is(link_exists,"try-error")){first_link_title<-NAfirst_link_text<-NA#if there is a linked page  }else {#click on linklink<-remDr$findElement(using="xpath","/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]")$clickElement()#get link page titlefirst_link_title<- try(remDr$findElement(using="id","firstHeading"))if(is(first_link_title,"try-error")){first_link_title<-NA    }else{first_link_title<-first_link_title$getElementText()[[1]]    }#get 1st section of link pagefirst_link_text<- try(remDr$findElement(using="xpath","/html/body/div[3]/div[3]/div[5]/div[1]/p[1]"))if(is(first_link_text,"try-error")){first_link_text<-NA    }else{first_link_text<-first_link_text$getElementText()[[1]]    }  }out<-data.frame("random_article"=title,"first_link_title"=first_link_title,"first_link_text"=first_link_text)return(out)}

Now that we have our scrape function and input we can parallelize theexecution of the function. For speed and efficiency reasons, it isadvisable to specify the headless browser option in theextraCapabilities argument.parscrape will show a progress bar, aswell as elapsed and estimated remaining time so you can keep track ofscraping progress.

wiki_text<-parsel::parscrape(scrape_fun=get_wiki_text,scrape_input=input,cores=2,packages= c("RSelenium","XML"),browser="firefox",scrape_tries=1,extraCapabilities=list("moz:firefoxOptions"=list(args=list('--headless'))                                     ))

parscrape returns a list with two elements:

  1. a list of your scrape function output
  2. a data.frame of inputs it was unable to scrape, and the associatederror messages

RSelenium Constructors

parsel allows you to generate safe scraping code with minimal hassleby simply composingconstructor functions that effectively act aswrappers aroundRSelenium methods in a pipe. You can return a scraperfunction defined byconstructors to the environment by starting yourpipe withstart_scraper() and ending it withbuild_scraper().Alternatively you can dump the code generated by yourconstructor pipeto the console viashow(). We’ll reproduce a slightly stripped downversion of theRSelenium code in the above wikipedia scraping routinevia theparselconstructor functions.

library(parsel)# returning a scaper functionstart_scraper(args="x",name="get_wiki_text") %>>%  go(url="x") %>>%   click(using="id",value="'n-randompage'",name="rand_art") %>>%  get_element(using="id",value="'firstHeading'",name="title") %>>%  click(using="xpath",value="'/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]'",name="link") %>>%  get_element(using="id",value="'firstHeading'",name="first_link_title") %>>%  get_element(using="xpath",value="'/html/body/div[3]/div[3]/div[5]/div[1]/p[1]'",name="first_link_text") %>>%  build_scraper()#> [1] "scraping function get_wiki_text constructed and in environment"#> [1] "scraping function get_wiki_text constructed and in environment"ls()#> [1] "get_wiki_text"# dumping generated code to consolego(url="x") %>>%  click(using="id",value="'n-randompage'",name="rand_art") %>>%  get_element(using="id",value="'firstHeading'",name="title") %>>%  click(using="xpath",value="'/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]'",name="link") %>>%  get_element(using="id",value="'firstHeading'",name="first_link_title") %>>%  get_element(using="xpath",value="'/html/body/div[3]/div[3]/div[5]/div[1]/p[1]'",name="first_link_text") %>>%  show()#> # navigate to url#> not_loaded <- TRUE#> remDr$navigate(x)#> while(not_loaded){#> Sys.sleep(0.25)#> current <- remDr$getCurrentUrl()[[1]]#> if(current == x){#> not_loaded <- FALSE#> }#> }#>#>  rand_art <- remDr$findElement(using = 'id', 'n-randompage')#> rand_art$clickElement()#> Sys.sleep(0.25)#>#>  title <- try(remDr$findElement(using = 'id', 'firstHeading'))#> if(is(title,'try-error')){#> title <- NA#> } else {#> title <- title$getElementText()[[1]]#> }#>#>  link <- remDr$findElement(using = 'xpath', '/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]')#> link$clickElement()#> Sys.sleep(0.25)#>#>  first_link_title <- try(remDr$findElement(using = 'id', 'firstHeading'))#> if(is(first_link_title,'try-error')){#> first_link_title <- NA#> } else {#> first_link_title <- first_link_title$getElementText()[[1]]#> }#>#>  first_link_text <- try(remDr$findElement(using = 'xpath', '/html/body/div[3]/div[3]/div[5]/div[1]/p[1]'))#> if(is(first_link_text,'try-error')){#> first_link_text <- NA#> } else {#> first_link_text <- first_link_text$getElementText()[[1]]#> }

About

parallel execution of RSelenium

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp