
Cover photo byKelly Sikkema
Typically, when wanting to pull in external data into an application, one would use an API. But what if the data we want isn't in an API, or the API isn't easily accessible?
That question is what we'll be answering in this tutorial.
Project Overview
The end result of this tutorial is an open API and docs page deployed on Vercel.
As for the project itself, we'll let our users pass in a Twitter handle and we'll return that users follower count. Note that we won't be using the Twitter API, instead, we'll be web scraping the data.
If you'd rather check out the code used to scrape followers from Twitter, feel free tocheck out this gist I put together and adapt it to your projects 🙂
But if you'd rather understand the process, problems, and reasoning as we create it together, keep on reading 😄
Understanding Web Scraping
Web scraping--or data-mining, may sound scary or dangerous, but it's really just using basic tools to access data we already have access to.
To kick things off, we'll start this project with an image of the data we'd like to collect:
Using mytwitter profile as an example, we'd want to grab my follower count as a number.
But how?
If you've ever worked with theTwitter API before, you know that it requires an approval process to use their API and has gotten more restrictive over the years.
Usually, in that case, I like to inspect the network tab in the Developer Tools to try and find an XHR request that serves as a JSON endpoint I can try to use.
In this case, no luck, and the incoming requests are polluted. So now is the time to use the element selector to see if there is a way to grab what we want with basic #"https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fbdnnxx486lawfqwwd6x6.png">
"I want to grab the text3,605 that lives in the anchor tag with an
href
of/mtliendo/followers
"
It's important to be as specific as possible, but not anymore. Meaning, I wouldn't want to select one of those obscure looking class names because they're likely coming from a CSS module or computed class name that changes every time we visit the page.
🗒️ If you're wondering how we've gotten this far and haven't written any code, welcome to web scraping! Deciding how you're going to grab the data you want is actually the hard part.
Scaffolding Our Application
Now that we understand why we're resorting to web scraping and have identified a way to grab the data we want, let's get to coding.
First, we'll initialize a NextJS project by running the following command:
npx create-next-app twitter-followers&&cd$_
💡 The
&& $_
at the end will change into the directory we just created after the NextJS scaffolding takes place.
Next, let's add in the package we'll be using to perform the web scraping.
npm i cheerio
cheerio is a fantastic library that uses methods similar tojQuery to traverse HTML-like strings on the backend.
With our project scaffolded, and our main dependency installed, let's start writing out our function.
The last step before opening up our project in an editor is to create a file for ourAPI route.
touchpages/api/followers.js
Writing the Scraper
In thefollowers.js
file we just created, add the following code:
constcheerio=require('cheerio')// 1exportdefaultasync(req,res)=>{// 2if(req.method==='POST'){// 3constusername=req.body.TWusertry{// 4constresponse=awaitfetch(`https://mobile.twitter.com/${username}`)consthtmlString=awaitresponse.text()const$=cheerio.load(htmlString)constsearchContext=`a[href='/${username}/followers']`constfollowerCountString=$(searchContext).text().match(/[0-9]/gi).join('')res.statusCode=200returnres.json({user:username,followerCount:Number(followerCountString),})}catch(e){// 5res.statusCode=404returnres.json({user:username,error:`${username} not found. Tip: Double check the spelling.`,followerCount:-1,})}}}
Breaking down the code
First we import the
cheerio
module using commonJS (require instead ofimport)We export a function. NextJS will create a serverless endpoint for us. In doing so, it gives us a way to see what data came in via
req
(the request), and a way to send data back viares
(the response). Because we're doing some asynchronous stuff in this function, I'm marking it asasync
.As mentioned above, the
req
gives us info about what's coming in. Here we're saying, "If this incoming request is aPOST
request, look at thebody
and grab theTWuser
piece of data. You'll see shortly how we send theTWuser
along.This is the heart of our application. Line by line, we're fetching data from Twitter. Parsing the response as
.text()
instead of.json()
.
This lets gives us back the HTML of the page as a string--which is exactly whatcheerio
expects. From there, you'll notice thea[href='/${username}/followers']
piece of code. This is grabbing that anchor tag that contains the follower count. The problem is that it's a long string that looks like this:
" 3,606\n Followers"
To solve for that, we use thematch
method. This uses a bit of regex that grabs the numbers from the string, from there we join numbers back together and send the data back to the user as a JSON object.
5.. A final bit of error handling that sends back some data if the user isn't able to be found.
🚨 Perceptive devs may have noticed that I changed the URL from
twitter.com
tomobile.twitter.com
. This is because the site desktop site uses client-side rendering, but the mobile site (legacy desktop site) renders the data on the server.
Testing Out the Endpoint
Now that the worst is over, let's test out the endpoint in a browser. To do so, replacepages/index.js
with the following code:
importHeadfrom'next/head'importstylesfrom'../styles/Home.module.css'importReactfrom'react'exportdefaultfunctionHome(){const[inputValue,setInputValue]=React.useState('')const[userFollowers,setUserFollowers]=React.useState({})consthandleSubmit=(e)=>{e.preventDefault()fetch('/api/followers',{method:'post',headers:{'content-type':'application/json',},body:JSON.stringify({TWuser:inputValue}),}).then((res)=>res.json()).then((userData)=>{setUserFollowers(userData)})}return(<divclassName={styles.container}><Head><title>Fetch Twitter Follower</title><linkrel="icon"href="/favicon.ico"/></Head><mainclassName={styles.main}><h1>Fetch A Twitter Follower</h1><formonSubmit={handleSubmit}><label> Enter a Twitter username<inputvalue={inputValue}onChange={(e)=>setInputValue(e.target.value)}/></label><button>Submit</button></form>{userFollowers.followerCount>=0?(<p>Followers:{userFollowers.followerCount}</p>):(<p>{userFollowers.error}</p>)}</main></div>)}
Not a whole lot going on here, but the main part I'll call out is thefetch
call in thehandleSubmit
function.
The nice thing about NextJS is that we can just reference the file in theapi
directory by using theapi/<filename>
syntax. After that we set the method topost
, add ajson
header, and pass in our data as a stringified object.
Special attention goes to theTWuser
object in the body. This has to match what we grab fromreq.body
in our api.
If you haven't already, test out your application by running the following command in your terminal
npm run dev
🚀 Deploy!
After making sure you can get a follower count and a proper error message, it's time to deploy!
Vercel is an awesome service to build and deploy apps built with NextJS. After logging in, they ask for a git URL for the project, so you'll want to make sure you put your project up on Github first.
Once done, your project will be live, and your example page should work. However, if you fetched your live API endpoint from a different browser it won't work. This is because CORS is blocked by default.
As a final step, let's fix that and redeploy it.
First, let's add thecors
package
npm i cors
Then, we'll update theapi/followers.js
file to include the following at the top of the file:
constcheerio=require('cheerio')constCors=require('cors')// Initializing the cors middlewareconstcors=Cors({methods:['POST'],})// Helper method to wait for a middleware to execute before continuing// And to throw an error when an error happens in a middlewarefunctionrunMiddleware(req,res,fn){returnnewPromise((resolve,reject)=>{fn(req,res,(result)=>{if(resultinstanceofError){returnreject(result)}returnresolve(result)})})}exportdefaultasync(req,res)=>{awaitrunMiddleware(req,res,cors)if(req.method==='POST'){constusername=req.body.TWuser// ...rest of code
Now, once you push your code up to GitHub, Vercel will automatically detect your changes and will start a new build.
Test out your function by editing the sandbox below and notice that now you can use your API endpoint outside of your application!
Conclusion
🎉 Congrats on making it this far🎉
I hope you learned a lot and now have a solid foundation to not only know how to web scrape but when it's appropriate and how to do so with NextJS!
The one thing this is missing is a database to store info🤔
So if you liked this tutorial, you'll definitely like the next one where we'll take a slightly different approach by addingAWS Amplify to the mix!
Top comments(5)

- Email
- LocationMidwest
- Educationself-paced
- WorkSenior Developer Advocate at AWS
- Joined
Awesome to hear and happy to help!

Nice! Would this work for backing up an Instagram profile? Also would be cool being able to proxy a third party site but with custom/enhanced CSS/markup.

- Email
- LocationMidwest
- Educationself-paced
- WorkSenior Developer Advocate at AWS
- Joined
Hey Omar, thanks for checking out the post! Yup, you could definitely use this trick to grab all of your Instagram pics and store them in a database (or just save the files locally!)

Have you experience when you exceed the serverless function limit in 25 secs (What can I do about Vercel Serverless Functions timing out?)?
How you handle it/what is proposal how to handle it?
For further actions, you may consider blocking this person and/orreporting abuse