Posted onOct 16, 2020

Create a public API by web scraping in NextJS

#nextjs #webscraping #api

Cover photo byKelly Sikkema

Typically, when wanting to pull in external data into an application, one would use an API. But what if the data we want isn't in an API, or the API isn't easily accessible?

That question is what we'll be answering in this tutorial.

Project Overview

The end result of this tutorial is an open API and docs page deployed on Vercel.

As for the project itself, we'll let our users pass in a Twitter handle and we'll return that users follower count. Note that we won't be using the Twitter API, instead, we'll be web scraping the data.

If you'd rather check out the code used to scrape followers from Twitter, feel free tocheck out this gist I put together and adapt it to your projects 🙂

But if you'd rather understand the process, problems, and reasoning as we create it together, keep on reading 😄

Understanding Web Scraping

Web scraping--or data-mining, may sound scary or dangerous, but it's really just using basic tools to access data we already have access to.

To kick things off, we'll start this project with an image of the data we'd like to collect:

Using mytwitter profile as an example, we'd want to grab my follower count as a number.

But how?

If you've ever worked with theTwitter API before, you know that it requires an approval process to use their API and has gotten more restrictive over the years.

Usually, in that case, I like to inspect the network tab in the Developer Tools to try and find an XHR request that serves as a JSON endpoint I can try to use.

In this case, no luck, and the incoming requests are polluted. So now is the time to use the element selector to see if there is a way to grab what we want with basic #"https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fbdnnxx486lawfqwwd6x6.png"> Using the Element Selector in the dev tools to inspect the DOM

"I want to grab the text3,605 that lives in the anchor tag with anhref of/mtliendo/followers"

It's important to be as specific as possible, but not anymore. Meaning, I wouldn't want to select one of those obscure looking class names because they're likely coming from a CSS module or computed class name that changes every time we visit the page.

🗒️ If you're wondering how we've gotten this far and haven't written any code, welcome to web scraping! Deciding how you're going to grab the data you want is actually the hard part.

Scaffolding Our Application

Now that we understand why we're resorting to web scraping and have identified a way to grab the data we want, let's get to coding.

First, we'll initialize a NextJS project by running the following command:

npx create-next-app twitter-followers&&cd$_

💡 The&& $_ at the end will change into the directory we just created after the NextJS scaffolding takes place.

Next, let's add in the package we'll be using to perform the web scraping.

npm i cheerio

cheerio is a fantastic library that uses methods similar tojQuery to traverse HTML-like strings on the backend.

With our project scaffolded, and our main dependency installed, let's start writing out our function.

The last step before opening up our project in an editor is to create a file for ourAPI route.

touchpages/api/followers.js

Writing the Scraper

In thefollowers.js file we just created, add the following code:

constcheerio=require('cheerio')// 1exportdefaultasync(req,res)=>{// 2if(req.method==='POST'){// 3constusername=req.body.TWusertry{// 4constresponse=awaitfetch(`https://mobile.twitter.com/${username}`)consthtmlString=awaitresponse.text()const$=cheerio.load(htmlString)constsearchContext=`a[href='/${username}/followers']`constfollowerCountString=$(searchContext).text().match(/[0-9]/gi).join('')res.statusCode=200returnres.json({user:username,followerCount:Number(followerCountString),})}catch(e){// 5res.statusCode=404returnres.json({user:username,error:`${username} not found. Tip: Double check the spelling.`,followerCount:-1,})}}}

Breaking down the code

First we import thecheerio module using commonJS (require instead ofimport)
We export a function. NextJS will create a serverless endpoint for us. In doing so, it gives us a way to see what data came in viareq (the request), and a way to send data back viares (the response). Because we're doing some asynchronous stuff in this function, I'm marking it asasync.
As mentioned above, thereq gives us info about what's coming in. Here we're saying, "If this incoming request is aPOST request, look at thebody and grab theTWuser piece of data. You'll see shortly how we send theTWuser along.
This is the heart of our application. Line by line, we're fetching data from Twitter. Parsing the response as.text() instead of.json().

This lets gives us back the HTML of the page as a string--which is exactly whatcheerio expects. From there, you'll notice thea[href='/${username}/followers'] piece of code. This is grabbing that anchor tag that contains the follower count. The problem is that it's a long string that looks like this:

"         3,606\n           Followers"

To solve for that, we use thematch method. This uses a bit of regex that grabs the numbers from the string, from there we join numbers back together and send the data back to the user as a JSON object.

5.. A final bit of error handling that sends back some data if the user isn't able to be found.

🚨 Perceptive devs may have noticed that I changed the URL fromtwitter.com tomobile.twitter.com. This is because the site desktop site uses client-side rendering, but the mobile site (legacy desktop site) renders the data on the server.

Testing Out the Endpoint

Now that the worst is over, let's test out the endpoint in a browser. To do so, replacepages/index.js with the following code:

importHeadfrom'next/head'importstylesfrom'../styles/Home.module.css'importReactfrom'react'exportdefaultfunctionHome(){const[inputValue,setInputValue]=React.useState('')const[userFollowers,setUserFollowers]=React.useState({})consthandleSubmit=(e)=>{e.preventDefault()fetch('/api/followers',{method:'post',headers:{'content-type':'application/json',},body:JSON.stringify({TWuser:inputValue}),}).then((res)=>res.json()).then((userData)=>{setUserFollowers(userData)})}return(<divclassName={styles.container}><Head><title>Fetch Twitter Follower</title><linkrel="icon"href="/favicon.ico"/></Head><mainclassName={styles.main}><h1>Fetch A Twitter Follower</h1><formonSubmit={handleSubmit}><label>            Enter a Twitter username<inputvalue={inputValue}onChange={(e)=>setInputValue(e.target.value)}/></label><button>Submit</button></form>{userFollowers.followerCount>=0?(<p>Followers:{userFollowers.followerCount}</p>):(<p>{userFollowers.error}</p>)}</main></div>)}

Not a whole lot going on here, but the main part I'll call out is thefetch call in thehandleSubmit function.

The nice thing about NextJS is that we can just reference the file in theapi directory by using theapi/<filename> syntax. After that we set the method topost, add ajson header, and pass in our data as a stringified object.

Special attention goes to theTWuser object in the body. This has to match what we grab fromreq.body in our api.

If you haven't already, test out your application by running the following command in your terminal

npm run dev

🚀 Deploy!

After making sure you can get a follower count and a proper error message, it's time to deploy!

Vercel is an awesome service to build and deploy apps built with NextJS. After logging in, they ask for a git URL for the project, so you'll want to make sure you put your project up on Github first.

Once done, your project will be live, and your example page should work. However, if you fetched your live API endpoint from a different browser it won't work. This is because CORS is blocked by default.

As a final step, let's fix that and redeploy it.

First, let's add thecors package

npm i cors

Then, we'll update theapi/followers.js file to include the following at the top of the file:

constcheerio=require('cheerio')constCors=require('cors')// Initializing the cors middlewareconstcors=Cors({methods:['POST'],})// Helper method to wait for a middleware to execute before continuing// And to throw an error when an error happens in a middlewarefunctionrunMiddleware(req,res,fn){returnnewPromise((resolve,reject)=>{fn(req,res,(result)=>{if(resultinstanceofError){returnreject(result)}returnresolve(result)})})}exportdefaultasync(req,res)=>{awaitrunMiddleware(req,res,cors)if(req.method==='POST'){constusername=req.body.TWuser// ...rest of code

Now, once you push your code up to GitHub, Vercel will automatically detect your changes and will start a new build.

Test out your function by editing the sandbox below and notice that now you can use your API endpoint outside of your application!

Conclusion

🎉 Congrats on making it this far🎉

I hope you learned a lot and now have a solid foundation to not only know how to web scrape but when it's appropriate and how to do so with NextJS!

The one thing this is missing is a database to store info🤔
So if you liked this tutorial, you'll definitely like the next one where we'll take a slightly different approach by addingAWS Amplify to the mix!

Top comments(5)

Josh May

Joined
Jul 30, 2022

• Jul 30 '22

Copy link

Spent a solid 10-15 hours trying to figure out why I was getting a CORS error. Then found this and cracked it in 15 minutes. Thanks Michael!!

Michael Liendo

Will code for tacos 🤤Lover of teaching, speaking, learning, and growing. Egghead Instructor. AWS Community Builder

Email
mtliendo@gmail.com
Location
Midwest
Education
self-paced
Work
Senior Developer Advocate at AWS
Joined
Jun 19, 2019

• Jul 31 '22

Copy link

Awesome to hear and happy to help!

Omar López

Front-End Web Developer, lots and lots of CSS

Joined
Jan 12, 2017

• Apr 23 '21

Copy link

Nice! Would this work for backing up an Instagram profile? Also would be cool being able to proxy a third party site but with custom/enhanced CSS/markup.

Michael Liendo

Will code for tacos 🤤Lover of teaching, speaking, learning, and growing. Egghead Instructor. AWS Community Builder

Email
mtliendo@gmail.com
Location
Midwest
Education
self-paced
Work
Senior Developer Advocate at AWS
Joined
Jun 19, 2019

• Jul 22 '21

Copy link

Hey Omar, thanks for checking out the post! Yup, you could definitely use this trick to grab all of your Instagram pics and store them in a database (or just save the files locally!)