Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Cover image for Create a public API by web scraping in NextJS
Michael Liendo
Michael Liendo

Posted on

     

Create a public API by web scraping in NextJS

Cover photo byKelly Sikkema

Typically, when wanting to pull in external data into an application, one would use an API. But what if the data we want isn't in an API, or the API isn't easily accessible?

That question is what we'll be answering in this tutorial.

Project Overview

The end result of this tutorial is an open API and docs page deployed on Vercel.

As for the project itself, we'll let our users pass in a Twitter handle and we'll return that users follower count. Note that we won't be using the Twitter API, instead, we'll be web scraping the data.

If you'd rather check out the code used to scrape followers from Twitter, feel free tocheck out this gist I put together and adapt it to your projects 🙂

But if you'd rather understand the process, problems, and reasoning as we create it together, keep on reading 😄

Understanding Web Scraping

Web scraping--or data-mining, may sound scary or dangerous, but it's really just using basic tools to access data we already have access to.

To kick things off, we'll start this project with an image of the data we'd like to collect:

mtliendo twitter profile

Using mytwitter profile as an example, we'd want to grab my follower count as a number.

But how?

If you've ever worked with theTwitter API before, you know that it requires an approval process to use their API and has gotten more restrictive over the years.

Usually, in that case, I like to inspect the network tab in the Developer Tools to try and find an XHR request that serves as a JSON endpoint I can try to use.

copy the request as a node fetch

In this case, no luck, and the incoming requests are polluted. So now is the time to use the element selector to see if there is a way to grab what we want with basic #"https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fbdnnxx486lawfqwwd6x6.png">Using the Element Selector in the dev tools to inspect the DOM

"I want to grab the text3,605 that lives in the anchor tag with anhref of/mtliendo/followers"

It's important to be as specific as possible, but not anymore. Meaning, I wouldn't want to select one of those obscure looking class names because they're likely coming from a CSS module or computed class name that changes every time we visit the page.

🗒️ If you're wondering how we've gotten this far and haven't written any code, welcome to web scraping! Deciding how you're going to grab the data you want is actually the hard part.

Scaffolding Our Application

Now that we understand why we're resorting to web scraping and have identified a way to grab the data we want, let's get to coding.

First, we'll initialize a NextJS project by running the following command:

npx create-next-app twitter-followers&&cd$_
Enter fullscreen modeExit fullscreen mode

💡 The&& $_ at the end will change into the directory we just created after the NextJS scaffolding takes place.

Next, let's add in the package we'll be using to perform the web scraping.

npm i cheerio
Enter fullscreen modeExit fullscreen mode

cheerio is a fantastic library that uses methods similar tojQuery to traverse HTML-like strings on the backend.

With our project scaffolded, and our main dependency installed, let's start writing out our function.

The last step before opening up our project in an editor is to create a file for ourAPI route.

touchpages/api/followers.js
Enter fullscreen modeExit fullscreen mode

Writing the Scraper

In thefollowers.js file we just created, add the following code:

constcheerio=require('cheerio')// 1exportdefaultasync(req,res)=>{// 2if(req.method==='POST'){// 3constusername=req.body.TWusertry{// 4constresponse=awaitfetch(`https://mobile.twitter.com/${username}`)consthtmlString=awaitresponse.text()const$=cheerio.load(htmlString)constsearchContext=`a[href='/${username}/followers']`constfollowerCountString=$(searchContext).text().match(/[0-9]/gi).join('')res.statusCode=200returnres.json({user:username,followerCount:Number(followerCountString),})}catch(e){// 5res.statusCode=404returnres.json({user:username,error:`${username} not found. Tip: Double check the spelling.`,followerCount:-1,})}}}
Enter fullscreen modeExit fullscreen mode

Breaking down the code

  1. First we import thecheerio module using commonJS (require instead ofimport)

  2. We export a function. NextJS will create a serverless endpoint for us. In doing so, it gives us a way to see what data came in viareq (the request), and a way to send data back viares (the response). Because we're doing some asynchronous stuff in this function, I'm marking it asasync.

  3. As mentioned above, thereq gives us info about what's coming in. Here we're saying, "If this incoming request is aPOST request, look at thebody and grab theTWuser piece of data. You'll see shortly how we send theTWuser along.

  4. This is the heart of our application. Line by line, we're fetching data from Twitter. Parsing the response as.text() instead of.json().

This lets gives us back the HTML of the page as a string--which is exactly whatcheerio expects. From there, you'll notice thea[href='/${username}/followers'] piece of code. This is grabbing that anchor tag that contains the follower count. The problem is that it's a long string that looks like this:

"         3,606\n           Followers"
Enter fullscreen modeExit fullscreen mode

To solve for that, we use thematch method. This uses a bit of regex that grabs the numbers from the string, from there we join numbers back together and send the data back to the user as a JSON object.

5.. A final bit of error handling that sends back some data if the user isn't able to be found.

🚨 Perceptive devs may have noticed that I changed the URL fromtwitter.com tomobile.twitter.com. This is because the site desktop site uses client-side rendering, but the mobile site (legacy desktop site) renders the data on the server.

Testing Out the Endpoint

Now that the worst is over, let's test out the endpoint in a browser. To do so, replacepages/index.js with the following code:

importHeadfrom'next/head'importstylesfrom'../styles/Home.module.css'importReactfrom'react'exportdefaultfunctionHome(){const[inputValue,setInputValue]=React.useState('')const[userFollowers,setUserFollowers]=React.useState({})consthandleSubmit=(e)=>{e.preventDefault()fetch('/api/followers',{method:'post',headers:{'content-type':'application/json',},body:JSON.stringify({TWuser:inputValue}),}).then((res)=>res.json()).then((userData)=>{setUserFollowers(userData)})}return(<divclassName={styles.container}><Head><title>Fetch Twitter Follower</title><linkrel="icon"href="/favicon.ico"/></Head><mainclassName={styles.main}><h1>Fetch A Twitter Follower</h1><formonSubmit={handleSubmit}><label>            Enter a Twitter username<inputvalue={inputValue}onChange={(e)=>setInputValue(e.target.value)}/></label><button>Submit</button></form>{userFollowers.followerCount>=0?(<p>Followers:{userFollowers.followerCount}</p>):(<p>{userFollowers.error}</p>)}</main></div>)}
Enter fullscreen modeExit fullscreen mode

Not a whole lot going on here, but the main part I'll call out is thefetch call in thehandleSubmit function.

The nice thing about NextJS is that we can just reference the file in theapi directory by using theapi/<filename> syntax. After that we set the method topost, add ajson header, and pass in our data as a stringified object.

Special attention goes to theTWuser object in the body. This has to match what we grab fromreq.body in our api.

If you haven't already, test out your application by running the following command in your terminal

npm run dev
Enter fullscreen modeExit fullscreen mode

🚀 Deploy!

After making sure you can get a follower count and a proper error message, it's time to deploy!

Vercel is an awesome service to build and deploy apps built with NextJS. After logging in, they ask for a git URL for the project, so you'll want to make sure you put your project up on Github first.

Deploying to Vercel

Once done, your project will be live, and your example page should work. However, if you fetched your live API endpoint from a different browser it won't work. This is because CORS is blocked by default.

CORS error

As a final step, let's fix that and redeploy it.

First, let's add thecors package

npm i cors
Enter fullscreen modeExit fullscreen mode

Then, we'll update theapi/followers.js file to include the following at the top of the file:

constcheerio=require('cheerio')constCors=require('cors')// Initializing the cors middlewareconstcors=Cors({methods:['POST'],})// Helper method to wait for a middleware to execute before continuing// And to throw an error when an error happens in a middlewarefunctionrunMiddleware(req,res,fn){returnnewPromise((resolve,reject)=>{fn(req,res,(result)=>{if(resultinstanceofError){returnreject(result)}returnresolve(result)})})}exportdefaultasync(req,res)=>{awaitrunMiddleware(req,res,cors)if(req.method==='POST'){constusername=req.body.TWuser// ...rest of code
Enter fullscreen modeExit fullscreen mode

Now, once you push your code up to GitHub, Vercel will automatically detect your changes and will start a new build.

Test out your function by editing the sandbox below and notice that now you can use your API endpoint outside of your application!

Conclusion

🎉 Congrats on making it this far🎉

I hope you learned a lot and now have a solid foundation to not only know how to web scrape but when it's appropriate and how to do so with NextJS!

The one thing this is missing is a database to store info🤔
So if you liked this tutorial, you'll definitely like the next one where we'll take a slightly different approach by addingAWS Amplify to the mix!

Top comments(5)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss
CollapseExpand
 
joshmay profile image
Josh May
  • Joined

Spent a solid 10-15 hours trying to figure out why I was getting a CORS error. Then found this and cracked it in 15 minutes. Thanks Michael!!

CollapseExpand
 
focusotter profile image
Michael Liendo
Will code for tacos 🤤Lover of teaching, speaking, learning, and growing. Egghead Instructor. AWS Community Builder
  • Email
  • Location
    Midwest
  • Education
    self-paced
  • Work
    Senior Developer Advocate at AWS
  • Joined

Awesome to hear and happy to help!

CollapseExpand
 
zomars profile image
Omar López
Front-End Web Developer, lots and lots of CSS
  • Joined

Nice! Would this work for backing up an Instagram profile? Also would be cool being able to proxy a third party site but with custom/enhanced CSS/markup.

CollapseExpand
 
focusotter profile image
Michael Liendo
Will code for tacos 🤤Lover of teaching, speaking, learning, and growing. Egghead Instructor. AWS Community Builder
  • Email
  • Location
    Midwest
  • Education
    self-paced
  • Work
    Senior Developer Advocate at AWS
  • Joined

Hey Omar, thanks for checking out the post! Yup, you could definitely use this trick to grab all of your Instagram pics and store them in a database (or just save the files locally!)

CollapseExpand
 
krak86 profile image
Ruslan Korkin
  • Location
    Ukraine
  • Joined
• Edited on• Edited

Have you experience when you exceed the serverless function limit in 25 secs (What can I do about Vercel Serverless Functions timing out?)?

How you handle it/what is proposal how to handle it?

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

Will code for tacos 🤤Lover of teaching, speaking, learning, and growing. Egghead Instructor. AWS Community Builder
  • Location
    Midwest
  • Education
    self-paced
  • Work
    Senior Developer Advocate at AWS
  • Joined

More fromMichael Liendo

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp