emadehsan/thalPublic

NotificationsYou must be signed in to change notification settings
Fork205
Star2.4k

Getting started with Puppeteer and Chrome Headless for Web Scraping

License

MIT license

2.4k stars 205 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
media		media
models		models
screenshots		screenshots
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.js		index.js
package.json		package.json

Repository files navigation

Getting started with Puppeteer and Chrome Headless for Web Scraping

Here is a link toMedium Article

Here is theChinese Version thanks to@csbun

Puppeteer is official tool for Chrome Headless by Google Chrome team. Since the official announcement of Chrome Headless, many of the industry standard libraries for automated testing have been discontinued by their maintainers. IncludingPhantomJS.Selenium IDE for Firefox has been discontinued due to lack of maintainers.

For sure, Chrome being the market leader in web browsing,Chrome Headless is going to industry leader inAutomated Testing of web applications. So, I have put together this starter guide on how to get started withWeb Scraping inChrome Headless.

TL;DR

In this guide we will scrape GitHub, login to it and extract and save emails of users usingChrome Headless,Puppeteer,Node andMongoDB. Don't worry GitHub have rate limiting mechanism in place to keep you under control but this post will give you good idea on Scrapping with Chrome Headless and Node. Also, alway stay updated with thedocumentation becausePuppeteer is under development and APIs are prone to changes.

Getting Started

Before we start, we need following tools installed. Head over to their websites and install them.

Project setup

Start off by making the project directory

$ mkdir thal$ cd thal

Initiate NPM. And put in the necessary details.

$ npm init

InstallPuppeteer. Its not stable and repository is updated daily. If you want to avail the latest functionality you can install it directly from its GitHub repository.

$ npm i --save puppeteer

Puppeteer includes its own chrome / chromium, that is guaranteed to work headless. So each time you install / update puppeteer, it will download its specific chrome version.

Coding

We will start by taking a screenshot of the page. This is code from their documentation.

Screenshot

constpuppeteer=require('puppeteer');asyncfunctionrun(){constbrowser=awaitpuppeteer.launch();constpage=awaitbrowser.newPage();awaitpage.goto('https://github.com');awaitpage.screenshot({path:'screenshots/github.png'});browser.close();}run();

If its your first time usingNode 7 or 8, you might be unfamiliar withasync andawait keywords. To putasync/await in really simple words, an async function returns a Promise. The promise when resolves might return the result that you asked for. But to do this in a single line, you tie the call to async function withawait.Save this inindex.js inside project directory.

Also create the screenshots dir.

$ mkdir screenshots

Run the code with

$ node index.js

The screenshot is now saved insidescreenshots/ dir.

Login to GitHub

If you go to GitHub and search forjohn, then click the users tab. You will see list of all users with names.

Some of them have made their emails publicly visible and some have chosen not to. But the thing is you can't see these emails without logging in. So, lets login. We will make heavy use ofPuppeteer documentation.

Add a filecreds.js in project root. I highly recommend signing up for new account with a new dummy email because youmight end up getting your account blocked.

module.exports={username:'<GITHUB_USERNAME>',password:'<GITHUB_PASSWORD>'}

Add another file.gitignore and put following content inside it:

node_modules/creds.js

Launch in non headless

For visual debugging, make chrome launch with GUI by passing an object withheadless: false tolaunch method.

constbrowser=awaitpuppeteer.launch({headless:false});

Lets navigate to login

awaitpage.goto('https://github.com/login');

Openhttps://github.com/login in your browser. Right click on input box belowUsername or email address and selectInspect. From developers tool, right click on the highlighted code andselectCopy thenCopy selector.

Paste that value to following constant

constUSERNAME_SELECTOR='#login_field';// "#login_field" is the copied value

Repeat the process for Password input box and Sign in button. You would have following

// dom element selectorsconstUSERNAME_SELECTOR='#login_field';constPASSWORD_SELECTOR='#password';constBUTTON_SELECTOR='#login > form > div.auth-form-body.mt-3 > input.btn.btn-primary.btn-block';

Logging in

Puppeteer provides methodsclick to click a DOM element andtype to type text in some input box. Let's fill in the credentials then click login and wait for redirect.

Up on top, requirecreds.js file.

constCREDS=require('./creds');

And then

awaitpage.click(USERNAME_SELECTOR);awaitpage.keyboard.type(CREDS.username);awaitpage.click(PASSWORD_SELECTOR);awaitpage.keyboard.type(CREDS.password);awaitPromise.all([page.click(BUTTON_SELECTOR),page.waitForNavigation()])

Search GitHub

Now, we have logged in. We can programmatically click on search box, fill it and on the results page, click users tab. But there's an easy way. Search requests are usually GET requests. So, every thing is sent via url. So, manually typejohn inside search box and then click users tab and copy the url. It would be

constsearchUrl='https://github.com/search?q=john&type=Users&utf8=%E2%9C%93';

Rearranging a bit

constuserToSearch='john';constsearchUrl=`https://github.com/search?q=${userToSearch}&type=Users&utf8=%E2%9C%93`;

Lets navigate to this page and wait to see if it actually searched?

awaitpage.goto(searchUrl);awaitpage.waitFor(2*1000);

Extract Emails

We are interested in extractingusername andemail of users. Lets copy the DOM element selectors like we did above.

constLIST_USERNAME_SELECTOR='#user_search_results > div.user-list > div:nth-child(1) div.d-flex > div > a';constLIST_EMAIL_SELECTOR='#user_search_results > div.user-list > div:nth-child(1) div.d-flex > div > ul > li:nth-child(2) > a';constLENGTH_SELECTOR_CLASS='user-list-item';

You can see that I also addedLENGTH_SELECTOR_CLASS above. If you look at the github page's code inside developers tool, you will observe thatdivs with classuser-list-item are actually housing information about a single user each.

Currently one way to extract text from an element is by usingevaluate method ofPage orElementHandle. When we navigate to page with search results, we will usepage.evaluate method to get the length of users list on the page. Theevaluate method evaluates the code inside browser context.

letlistLength=awaitpage.evaluate((sel)=>{returndocument.getElementsByClassName(sel).length;},LENGTH_SELECTOR_CLASS);

Let's loop through all the listed users and extract emails. As we loop through the DOM, we have to change index inside the selectors to point to the next DOM element. So, I put theINDEX string at the place where we want to place the index as we loop through.

// const LIST_USERNAME_SELECTOR = '#user_search_results > div.user-list > div:nth-child(1) div.d-flex > div > a';constLIST_USERNAME_SELECTOR='#user_search_results > div.user-list > div:nth-child(INDEX) div.d-flex > div > a';// const LIST_EMAIL_SELECTOR = '#user_search_results > div.user-list > div:nth-child(1) div.d-flex > div > ul > li:nth-child(2) > a';constLIST_EMAIL_SELECTOR='#user_search_results > div.user-list > div:nth-child(INDEX) div.d-flex > div > ul > li:nth-child(2) > a';constLENGTH_SELECTOR_CLASS='user-list-item';

The loop and extraction

for(leti=1;i<=listLength;i++){// change the index to the next childletusernameSelector=LIST_USERNAME_SELECTOR.replace("INDEX",i);letemailSelector=LIST_EMAIL_SELECTOR.replace("INDEX",i);letusername=awaitpage.evaluate((sel)=>{returndocument.querySelector(sel).getAttribute('href').replace('/','');},usernameSelector);letemail=awaitpage.evaluate((sel)=>{letelement=document.querySelector(sel);returnelement?element.innerHTML:null;},emailSelector);// not all users have emails visibleif(!email)continue;console.log(username,' -> ',email);// TODO save this user}

Now if you run the script withnode index.js you would see usernames and there corresponding emails printed.

Go over all the pages

First we would estimate the last page number with search results. At search results page, on top, you can see69,769 users at the time of this writing.

Fun Fact: If you compare with the previous screenshot of the page, you will notice that 6 morejohn s have joined GitHub in the matter of a few hours.

Copy its selector from developer tools. We would write a new function below therun function to return the number of pages we can go through.

asyncfunctiongetNumPages(page){constNUM_USER_SELECTOR='#js-pjax-container > div.container > div > div.column.three-fourths.codesearch-results.pr-6 > div.d-flex.flex-justify-between.border-bottom.pb-3 > h3';letinner=awaitpage.evaluate((sel)=>{lethtml=document.querySelector(sel).innerHTML;// format is: "69,803 users"returnhtml.replace(',','').replace('users','').trim();},NUM_USER_SELECTOR);letnumUsers=parseInt(inner);console.log('numUsers: ',numUsers);/*  * GitHub shows 10 resuls per page, so  */letnumPages=Math.ceil(numUsers/10);returnnumPages;}

At the bottom of the search results page, if you hover the mouse over buttons with page numbers, you can see they link to the next pages. The link to 2nd page withresults ishttps://github.com/search?p=2&q=john&type=Users&utf8=%E2%9C%93. Notice thep=2 query parameter in the URL. This will help us navigate to the next page.

After adding an outer loop to go through all the pages around our previous loop, the code looks like

letnumPages=awaitgetNumPages(page);console.log('Numpages: ',numPages);for(leth=1;h<=numPages;h++){letpageUrl=searchUrl+'&p='+h;awaitpage.goto(pageUrl);letlistLength=awaitpage.evaluate((sel)=>{returndocument.getElementsByClassName(sel).length;},LENGTH_SELECTOR_CLASS);for(leti=1;i<=listLength;i++){// change the index to the next childletusernameSelector=LIST_USERNAME_SELECTOR.replace("INDEX",i);letemailSelector=LIST_EMAIL_SELECTOR.replace("INDEX",i);letusername=awaitpage.evaluate((sel)=>{returndocument.querySelector(sel).getAttribute('href').replace('/','');},usernameSelector);letemail=awaitpage.evaluate((sel)=>{letelement=document.querySelector(sel);returnelement?element.innerHTML:null;},emailSelector);// not all users have emails visibleif(!email)continue;console.log(username,' -> ',email);// TODO save this users}}

Save to MongoDB

The part withpuppeteer is over now. We will usemongoose to store the information in toMongoDB. Its anORM, actually just a library to facilitate information storage and retrieval from the database.

$ npm i --save mongoose

MongoDB is a Schema-less NoSQL database. But we can make it follow some rules using Mongoose. First we would have to create aModel which is just representation of MongoDBCollection in code. Create a directorymodels. Create a fileuser.js inside and put the following code in it, the structure of our collection. Next whenever we insert something intousers collection with mongoose, it would have to follow this structure.

constmongoose=require('mongoose');letuserSchema=newmongoose.Schema({username:String,email:String,dateCrawled:Date});letUser=mongoose.model('User',userSchema);module.exports=User;

Let's now actually insert. We don't want duplicate emails in our database. So, we only insert a user's information if the email is not already present. Otherwise we would just update the information. For this we would use mongoose'sModel.findOneAndUpdate method.

At the top ofindex.js add the imports

constmongoose=require('mongoose');constUser=require('./models/user');

Add the following function at bottom ofindex.js toupsert (update or insert) the User model

functionupsertUser(userObj){constDB_URL='mongodb://localhost/thal';if(mongoose.connection.readyState==0){mongoose.connect(DB_URL);}// if this email exists, update the entry, don't insertconstconditions={email:userObj.email};constoptions={upsert:true,new:true,setDefaultsOnInsert:true};User.findOneAndUpdate(conditions,userObj,options,(err,result)=>{if(err)throwerr;});}

Start MongoDB server. Put following code inside the for loops at the place of comment// TODO save this user in order to save the user

upsertUser({username:username,email:email,dateCrawled:newDate()});

To check if you are actually getting users saved, get inside mongo shell

$ mongo> use thal> db.users.find().pretty()

You would see multiple users added there. This marks the crux of this guide.

Conclusion

Chrome Headless and Puppeteer is the start of a new era in Web Scraping and Automated Testing. Chrome Headless also supports WebGL. You can deploy your scraper in cloud and sit back and let it do the heavy load. Remember to remove theheadless: false option when you deploy on server.

While scraping, you might be halted by GitHub's rate limiting

Another thing I noticed, you cannot go beyond 100 pages on GitHub.

End note

Deserts symbolize vastness and are witness of the struggles and sacrifices of people whotraversed through these giant mountains of sand.Thal is a desert in Pakistan spanning across multiple districts including my home district Bhakkar. Somewhat similar is the case withInternet that wetraversed today in quest of data. That's why I named the repositoryThal. If you like this effort, please like and share this with others. If you have any suggestions, comment here or approach me directly@e_mad_ehsan. I would love to hear from you.

About

Getting started with Puppeteer and Chrome Headless for Web Scraping

emadehsan.com

Languages

JavaScript100.0%

Movatterモバイル変換

License

emadehsan/thal

Folders and files

Latest commit

History

Repository files navigation

Getting started with Puppeteer and Chrome Headless for Web Scraping

TL;DR

Getting Started

Project setup

Coding

Screenshot

Login to GitHub

Launch in non headless

Logging in

Search GitHub

Extract Emails

Go over all the pages

Save to MongoDB

Conclusion

End note

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors10

Uh oh!

Languages

Packages