- Notifications
You must be signed in to change notification settings - Fork3
parallel execution of RSelenium
License
Unknown, MIT licenses found
Licenses found
till-tietz/parsel
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
parsel is a framework for parallelized dynamic web-scraping usingRSelenium. Leveraging parallel processing, it allows you to run anyRSelenium web-scraping routine on multiple browser instancessimultaneously, thus greatly increasing the efficiency of your scraping.parsel utilizes chunked input processing as well as error catching andlogging, to ensure seamless execution of your scraping routine andminimal data loss, even in the presence of unforeseenRSeleniumerrors.parsel additionally provides convenient wrapper functionsaroundRSelenium methods, that allow you to quickly generate safescraping code with minimal coding on your end.
You can find theparsel websitehere.
# Install parsel from CRANinstall.packages("parsel")# Or the development version from GitHub:# install.packages("devtools")devtools::install_github("till-tietz/parsel")
The following example will hopefully serve to illustrate thefunctionality and ideas behind howparsel operates. We’ll set up thefollowing scraping job:
- navigate to a random Wikipedia article
- retrieve its title
- navigate to the first linked page on the article
- retrieve the linked page’s title and first section
and parallelize it withparsel.
parsel requires two things:
- a scraping function defining the actions to be executed in each
RSeleniuminstance. Actions to be executed in each browserinstance should be written in the conventionalRSeleniumsyntaxwithremDr$specifying the remote driver. - some input
xto those actions (e.g. search terms to be entered insearch boxes or links to navigate to etc.)
library(RSelenium)library(parsel)#let's define our scraping function input#we want to run our function 4 times and we want it to start on the wikipedia main page each timeinput<- rep("https://de.wikipedia.org",4)#let's define our scraping functionget_wiki_text<-function(x){input_i<-x#navigate to input page (i.e wikipedia)remDr$navigate(input_i)#find and click random articlerand_art<-remDr$findElement(using="id","n-randompage")$clickElement()#get random article titletitle<-remDr$findElement(using="id","firstHeading")$getElementText()[[1]]#check if there is a linked pagelink_exists<- try(remDr$findElement(using="xpath","/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]"))#if no linked page fill output with NAif(is(link_exists,"try-error")){first_link_title<-NAfirst_link_text<-NA#if there is a linked page }else {#click on linklink<-remDr$findElement(using="xpath","/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]")$clickElement()#get link page titlefirst_link_title<- try(remDr$findElement(using="id","firstHeading"))if(is(first_link_title,"try-error")){first_link_title<-NA }else{first_link_title<-first_link_title$getElementText()[[1]] }#get 1st section of link pagefirst_link_text<- try(remDr$findElement(using="xpath","/html/body/div[3]/div[3]/div[5]/div[1]/p[1]"))if(is(first_link_text,"try-error")){first_link_text<-NA }else{first_link_text<-first_link_text$getElementText()[[1]] } }out<-data.frame("random_article"=title,"first_link_title"=first_link_title,"first_link_text"=first_link_text)return(out)}
Now that we have our scrape function and input we can parallelize theexecution of the function. For speed and efficiency reasons, it isadvisable to specify the headless browser option in theextraCapabilities argument.parscrape will show a progress bar, aswell as elapsed and estimated remaining time so you can keep track ofscraping progress.
wiki_text<-parsel::parscrape(scrape_fun=get_wiki_text,scrape_input=input,cores=2,packages= c("RSelenium","XML"),browser="firefox",scrape_tries=1,extraCapabilities=list("moz:firefoxOptions"=list(args=list('--headless')) ))
parscrape returns a list with two elements:
- a list of your scrape function output
- a data.frame of inputs it was unable to scrape, and the associatederror messages
parsel allows you to generate safe scraping code with minimal hassleby simply composingconstructor functions that effectively act aswrappers aroundRSelenium methods in a pipe. You can return a scraperfunction defined byconstructors to the environment by starting yourpipe withstart_scraper() and ending it withbuild_scraper().Alternatively you can dump the code generated by yourconstructor pipeto the console viashow(). We’ll reproduce a slightly stripped downversion of theRSelenium code in the above wikipedia scraping routinevia theparselconstructor functions.
library(parsel)# returning a scaper functionstart_scraper(args="x",name="get_wiki_text") %>>% go(url="x") %>>% click(using="id",value="'n-randompage'",name="rand_art") %>>% get_element(using="id",value="'firstHeading'",name="title") %>>% click(using="xpath",value="'/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]'",name="link") %>>% get_element(using="id",value="'firstHeading'",name="first_link_title") %>>% get_element(using="xpath",value="'/html/body/div[3]/div[3]/div[5]/div[1]/p[1]'",name="first_link_text") %>>% build_scraper()#> [1] "scraping function get_wiki_text constructed and in environment"#> [1] "scraping function get_wiki_text constructed and in environment"ls()#> [1] "get_wiki_text"# dumping generated code to consolego(url="x") %>>% click(using="id",value="'n-randompage'",name="rand_art") %>>% get_element(using="id",value="'firstHeading'",name="title") %>>% click(using="xpath",value="'/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]'",name="link") %>>% get_element(using="id",value="'firstHeading'",name="first_link_title") %>>% get_element(using="xpath",value="'/html/body/div[3]/div[3]/div[5]/div[1]/p[1]'",name="first_link_text") %>>% show()#> # navigate to url#> not_loaded <- TRUE#> remDr$navigate(x)#> while(not_loaded){#> Sys.sleep(0.25)#> current <- remDr$getCurrentUrl()[[1]]#> if(current == x){#> not_loaded <- FALSE#> }#> }#>#> rand_art <- remDr$findElement(using = 'id', 'n-randompage')#> rand_art$clickElement()#> Sys.sleep(0.25)#>#> title <- try(remDr$findElement(using = 'id', 'firstHeading'))#> if(is(title,'try-error')){#> title <- NA#> } else {#> title <- title$getElementText()[[1]]#> }#>#> link <- remDr$findElement(using = 'xpath', '/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]')#> link$clickElement()#> Sys.sleep(0.25)#>#> first_link_title <- try(remDr$findElement(using = 'id', 'firstHeading'))#> if(is(first_link_title,'try-error')){#> first_link_title <- NA#> } else {#> first_link_title <- first_link_title$getElementText()[[1]]#> }#>#> first_link_text <- try(remDr$findElement(using = 'xpath', '/html/body/div[3]/div[3]/div[5]/div[1]/p[1]'))#> if(is(first_link_text,'try-error')){#> first_link_text <- NA#> } else {#> first_link_text <- first_link_text$getElementText()[[1]]#> }
About
parallel execution of RSelenium
Topics
Resources
License
Unknown, MIT licenses found
Licenses found
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.