Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Jordan Hansen
Jordan Hansen

Posted on • Originally published atjavascriptwebscrapingguy.com on

     

Jordan Scrapes Secretary of State: North Carolina

Demo code here

Today we do web scraping on theNorth Carolina Secratary of State. I’ve been to North Carolina once and it seemed like a great state. Really pretty with some beautiful beaches. This is the 15th (!!) entry in theSecretary of States web scraping series.

Investigation

North Carolina Michael Jackson fun gif

I try to look for the most recently registered businesses. They are the businesses that very likely are trying to get setup with new services and products and probably don’t have existing relationships. I think typically these are going to be the more valuable leads.

If the state doesn’t offer a date range with which to search, I’ve discovered a trick that works pretty okay. I just search for “2020”. 2020 is kind of a catchy number and because we are currently in that year people tend to start businesses that have that name in it.

Once I find one of these that is registered recently, I look for a business id somewhere. It’s typically a query parameter in the url or form data in the POST request. Either way, if I can increment that id by one number and still get a company that is recently registered, I know I can find recently registered business simply by increasing the id with which I search.

North Carolina was not much different. They allow you to search with pretty standard stuff. No date range, sadly, so the above “2020” trick works pretty well.

North Carolina secretary of state business search for 2020

Bingo, just like that we find a business registered in July of this year. Worked like a charm.

Whenever I’m first investigating a site, I always check the network requests. Often you can find that there are direct requests to an API that has the data you need. When I selected this company, 2020 Analytics LLC, I saw this network request and I thought I was in business.

ajax request for the business registration profile

This request didn’t return any easy to parse JSON, sadly, only HTML. Still, I should be able to POST that Sos ID here to this request and get what I wanted and just increment from there.

Maybe you’re seeing what I missed.

Database id vs Sos id

North Carolina pretty gif

The id shown in that photo was a lot bigger than the Secretary of State id. 16199332 vs 2006637. I started making requests and plucking out the filing date and the business title starting with 16199332.

The results were pretty intermittent. The first indication that something was up was that the numbers weren’t exatly sequential. One business would be registered on 7/21/2020 and then 10 numbers later a business was registered on 6/24/2020.

I’m not exactly sure programmatically what is happening that they are making entries into the database like that. In any case, I soon realized that something wasn’t matching up.

I wanted to call directly to this details page but for that I needed to get the database id somehow. Fortunately, North Carolina has a way to search by Sos id.

Search by Sos id

The resulting HTML looks like this:

HTML for sos id search

Because I’m searching by Sos id it only returned on result. I just grabbed and parsed this anchor tag to pluck out the database id from thatShowProfile function. Two requests, one to get the database id, another to use that database id to get the business details

The code

north carolina fun storm

(async () => {    const startingSosId = 2011748;    // Increment by large amounts so we can find the most recently registered businesses    for (let i = 0; i < 5000; i += 100) {        // We use the query post to get the database id        const databaseId = await getDatabaseId(startingSosId + i);        // With the database id we can just POST directly to the details endpoint        if (databaseId) {            await getBusinessDetails(databaseId);        }        // Good neighbor timeout        await timeout(1000);    }})();
Enter fullscreen modeExit fullscreen mode

This is the base of my scraping code. This showcases how I’m incrementing by larger jumps to be able to quickly determine where the end is. I go out and get the database id and then use that to get the business details

async function getDatabaseId(sosId: number) {    const url = 'https://www.sosnc.gov/online_services/search/Business_Registration_Results';    const formData = new FormData();    formData.append('SearchCriteria', sosId.toString());    formData.append(' __RequestVerificationToken', 'qnPxLQeaFPiEj4f1so7zWF8e5pTwiW0Ur8A0qkiK_45A_3TL__ wTjYlmaBmvWvYJVd2GiFppbLB39eD0F6bmbEUFsQc1');    formData.append('CorpSearchType', 'CORPORATION');    formData.append('EntityType', 'ORGANIZATION');    formData.append('Words', 'SOSID');    const axiosResponse = await axios.post(url, formData,        {            headers: formData.getHeaders()        });    const $ = cheerio.load(axiosResponse.data);    const onclickAttrib = $('.double tbody tr td a').attr('onclick');    if (onclickAttrib) {        const databaseId = onclickAttrib.split("ShowProfile('")[1].replace("')", '');        return databaseId;    }    else {        console.log('No business found for SosId', sosId);        return null;    }}
Enter fullscreen modeExit fullscreen mode

Getting the database id looks like this. Simply selecting that anchor tag shown above and parsing the function to grab the database id.

The most enjoyable part was working the business details. This section here had a lot of the data that I wanted but they weren’t always in the same order. The company didn’t always have the same fields.

North Carolina Secretary of State information fields

So I used a trick I’ve used before where I just loop through all of the elements in this section, get the text from the label section, and put the value where it needs to go based on that label.

const informationFields = $('.printFloatLeft section:nth-of-type(2) div:nth-of-type(1) span');for (let i = 0; i < informationFields.length; i++) {    if (informationFields[i].attribs.class === 'greenLabel') {        // This is kind of perverting cheerio objects        const label = informationFields[i].children[0].data.trim();        const value = informationFields[i + 1].children[0].data.trim();        switch (label) {            case 'SosId:':                business.sosId = value;                break;            case 'Citizenship:':                business.citizenShip = value;                break;            case 'Status:':                business.status = value;                break;            case 'Date Formed:':                business.filingDate = value;                break;            default:                break;        }    }}
Enter fullscreen modeExit fullscreen mode

I had to a do little almost abuse ofcheerio’s normally very easy API. The problem was at the top you can see that I’m selecting all the spans in this information section. I needed to loop through each one and I couldn’t find a way to access totext() function without using a proper css selector. For example,$('something').text() easy. But as I looped I didn’t want to select any further. I wanted that element. And that’s why I ended up withchildren[0].data.

Here’s the full function:

async function getBusinessDetails(databaseId: string) {    const url = 'https://www.sosnc.gov/online_services/search/_Business_Registration_profile';    const formData = new FormData();    formData.append('Id', databaseId);    const axiosResponse = await axios.post(url, formData,        {            headers: formData.getHeaders()        });    const $ = cheerio.load(axiosResponse.data);    const business: any = {        businessId: databaseId    };    business.title = $('.printFloatLeft section:nth-of-type(1) div:nth-of-type(1) span:nth-of-type(2)').text();    if (business.title) {        business.title = business.title.replace(/\n/g, '').trim()    }    else {        console.log('No business title found. Likely no business here', databaseId);        return;    }    const informationFields = $('.printFloatLeft section:nth-of-type(2) div:nth-of-type(1) span');    for (let i = 0; i < informationFields.length; i++) {        if (informationFields[i].attribs.class === 'greenLabel') {            // This is kind of perverting cheerio objects            const label = informationFields[i].children[0].data.trim();            const value = informationFields[i + 1].children[0].data.trim();            switch (label) {                case 'SosId:':                    business.sosId = value;                    break;                case 'Citizenship:':                    business.citizenShip = value;                    break;                case 'Status:':                    business.status = value;                    break;                case 'Date Formed:':                    business.filingDate = value;                    break;                default:                    break;            }        }    }    console.log('business', business);}
Enter fullscreen modeExit fullscreen mode

And…that’s it! It turned out pretty nice.

Looking for business leads?

Using the techniques talked about here atjavascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome web data. Learn more atCobalt Intelligence!

The postJordan Scrapes Secretary of State: North Carolina appeared first onJavaScript Web Scraping Guy.

Top comments(1)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss
CollapseExpand
 
godzilla__825b65e14cc7c82 profile image
GODZilla
  • Joined

Is it legal to scrape gov website data?

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

Software engineer and javascript lover.I love the power of the web and getting the data from it with web scraping.
  • Location
    Eagle, ID
  • Education
    Boise State University
  • Work
    Software Engineer at Lenovo Software
  • Joined

More fromJordan Hansen

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp