Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

SerpApi profile imageArtur Chukhrai
Artur Chukhrai forSerpApi

Posted on • Edited on • Originally published atserpapi.com

     

Scrape Google Product Online Sellers with Python

What will be scraped

wwbs-google-online-sellers

📌Note: In this image, I demonstrate that the data will be received with pagination. Therefore, I only show 5 sellers, and not all, as the image could take up a lot of space.

Full Code

If you don't need explanation, have a look atfull code example in the online IDE.

importrequests,jsonfromparselimportSelectordefget_online_sellers_results(url,headers):data=[]whileTrue:html=requests.get(url,headers=headers)selector=Selector(html.text)forresultinselector.css('.sh-osd__offer-row'):name=result.css('.kjM2Bf::text, .b5ycib::text').get()link='https://www.google.com'+result.css('.b5ycib::attr(href)').get()ifresult.css('.b5ycib')elseNonebase_price=result.css('.fObmGc::text').get()shipping=result.css('.SuutWb tr:nth-child(2) td:nth-child(2)::text').get()tax=result.css('.SuutWb tr:nth-child(3) td:nth-child(2)::text').get()total_price=result.css('.drzWO::text').get()data.append({'name':name,'link':link,'base_price':base_price,'additional_price':{'shipping':shipping,'tax':tax},'total_price':total_price})if'Next'inselector.css('.R9e18b .internal-link::text').get():url='https://www.google.com'+selector.css('.R9e18b .internal-link::attr(data-url)').get()else:breakreturndatadefmain():# https://docs.python-requests.org/en/master/user/quickstart/#custom-headersheaders={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'}URL='https://www.google.com/shopping/product/14019378181107046593/offers?hl=en&gl=us'online_sellers=get_online_sellers_results(URL,headers)print(json.dumps(online_sellers,indent=2,ensure_ascii=False))if__name__=="__main__":main()
Enter fullscreen modeExit fullscreen mode

Preparation

Install libraries:

pip install requests parsel
Enter fullscreen modeExit fullscreen mode

Reduce the chance of being blocked

Make sure you're usingrequest headersuser-agent to act as a "real" user visit. Because defaultrequestsuser-agent ispython-requests and websites understand that it's most likely a script that sends a request.Check what's youruser-agent.

There's ahow to reduce the chance of being blocked while web scraping blog post that can get you familiar with basic and more advanced approaches.

Code Explanation

Import libraries:

importrequests,jsonfromparselimportSelector
Enter fullscreen modeExit fullscreen mode
LibraryPurpose
requeststo make a request to the website.
jsonto convert extracted data to a JSON object.
SelectorXML/HTML parser that have fullXPath and CSS selectors support.

At the beginning of themain() function, theheaders andURL are defined. This data is then passed to theget_online_sellers_results(URL, headers) function to form a request and extract information.

Theonline_sellers list contains the received data that this function returns. At the end of the function, the data is output in JSON format:

defmain():# https://docs.python-requests.org/en/master/user/quickstart/#custom-headersheaders={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'}URL='https://www.google.com/shopping/product/14019378181107046593/offers?hl=en&gl=us'online_sellers=get_online_sellers_results(URL,headers)print(json.dumps(online_sellers,indent=2,ensure_ascii=False))
Enter fullscreen modeExit fullscreen mode

This code uses the generally accepted rule of using the__name__ == "__main__" construct:

if__name__=="__main__":main()
Enter fullscreen modeExit fullscreen mode

This check will only be performed if the user has run this file. If the user imports this file into another, then the check will not work. You can watch the videoPython Tutorial: ifname == 'main' for more details.

Let's take a look at theget_online_sellers_results(url, headers) function mentioned earlier. This function takesurl andheaders parameters to create a request. At the beginning of the function, thedata list in which the data will be stored is defined:

defget_online_sellers_results(url,headers):data=[]
Enter fullscreen modeExit fullscreen mode

Now we need to parse the HTML from theParsel package, into which we pass theHTML structure that was received after the request.

Up to 20 sellers fit on one page. If there are more than 20 of them, then a page with the remaining sellers is added. To scrape a Google Product Online Sellers with pagination, you need to check for the presence of theNext button. While theNext button exists, you need to fetch theurl for the next page in order to access it. If theNext button is not present, then you need tobreak the while loop:

whileTrue:html=requests.get(url,headers=headers)selector=Selector(html.text)# data extraction from current page will be hereif'Next'inselector.css('.R9e18b .internal-link::text').get():url='https://www.google.com'+selector.css('.R9e18b .internal-link::attr(data-url)').get()else:break
Enter fullscreen modeExit fullscreen mode

To retrieve data, you first need to find the.sh-osd__offer-row selector that is responsible for each seller and iterate over it:

forresultinselector.css('.sh-osd__offer-row'):# data extraction from each seller will be here
Enter fullscreen modeExit fullscreen mode

Data such asname,base_price,shipping,tax andtotal_price are retrieved for each seller. I want to draw your attention to the fact that not every seller has alink, so aternary expression is used when extracting:

name=result.css('.kjM2Bf::text, .b5ycib::text').get()link='https://www.google.com'+result.css('.b5ycib::attr(href)').get()ifresult.css('.b5ycib')elseNonebase_price=result.css('.fObmGc::text').get()shipping=result.css('.SuutWb tr:nth-child(2) td:nth-child(2)::text').get()tax=result.css('.SuutWb tr:nth-child(3) td:nth-child(2)::text').get()total_price=result.css('.drzWO::text').get()
Enter fullscreen modeExit fullscreen mode
CodeExplanation
css()to access elements by the passed selector.
::text or::attr(<attribute>)to extract textual or attribute data from the node.
get()to actually extract the textual data.

After extracting all data about the seller, a dictionary with this data is appended to thedata list:

data.append({'name':name,'link':link,'base_price':base_price,'additional_price':{'shipping':shipping,'tax':tax},'total_price':total_price})
Enter fullscreen modeExit fullscreen mode

At the end of the function, thedata list is returned.

returndata
Enter fullscreen modeExit fullscreen mode

Output:

[{"name":"Best Buy","link":"https://www.google.com/url?q=https://www.bestbuy.com/site/steelseries-aerox-3-2022-edition-lightweight-wired-optical-gaming-mouse-onyx/6485231.p%3FskuId%3D6485231%26ref%3DNS%26loc%3D101&sa=U&ved=0ahUKEwiSuKKm1r_7AhWESDABHQvhDGwQ2ykIJA&usg=AOvVaw37TQlxlXfUf7Aow3-oj3Wr","base_price":"$34.99","additional_price":{"shipping":"$0.00","tax":"$3.11"},"total_price":"$38.10"},...othersellers{"name":"Network Hardwares","link":"https://www.google.com/url?q=https://www.networkhardwares.com/products/aerox-3-wireless-2022-edition-62611%3Fcurrency%3DUSD%26variant%3D41025510441165%26utm_medium%3Dcpc%26utm_source%3Dgoogle%26utm_campaign%3DGoogle%2520Shopping%26srsltid%3DAYJSbAeM3Wi-nx6CPNXcQIZqlFcEv3uyBEgwTXa36ijEua1hx_LNmAm5EiM&sa=U&ved=0ahUKEwiSuKKm1r_7AhWESDABHQvhDGwQ2ykImgE&usg=AOvVaw1rOVOsiroUgnyyTT2JBN61","base_price":"$64.51","additional_price":{"shipping":"$0.00","tax":"$5.73"},"total_price":"$70.24"}]
Enter fullscreen modeExit fullscreen mode

Using Google Online Sellers API from SerpApi

This section is to show the comparison between the DIY solution and our solution.

The main difference is that it's a quicker approach.Google Online Sellers API will bypass blocks from search engines and you don't have to create the parser from scratch and maintain it.

First, we need to installgoogle-search-results:

pip install google-search-results
Enter fullscreen modeExit fullscreen mode

Import the necessary libraries for work:

fromserpapiimportGoogleSearchimportos,json
Enter fullscreen modeExit fullscreen mode

Next, we write the necessary parameters for making a request:

params={# https://docs.python.org/3/library/os.html#os.getenv'api_key':os.getenv('API_KEY'),# your serpapi api'engine':'google_product',# SerpApi search engine'product_id':'14019378181107046593',# product id'offers':True,# more offers, could be also set as '1` which is the same as True'hl':'en',# language'gl':'us'# country of the search, US -> USA}
Enter fullscreen modeExit fullscreen mode

We then create asearch object where the data is retrieved from the SerpApi backend. In theresults dictionary we get data from JSON:

search=GoogleSearch(params)# where data extraction happens on the SerpApi backendresults=search.get_dict()# JSON -> Python dict
Enter fullscreen modeExit fullscreen mode

Retrieving the data is quite simple, we just need to access the'sellers_results' key and then the'online_sellers' key:

online_sellers=results['sellers_results']['online_sellers']
Enter fullscreen modeExit fullscreen mode

After reviewing theplayground, you will be able to understand which keys you can turn to into this JSON structure.

Example code to integrate:

fromserpapiimportGoogleSearchimportos,jsonparams={# https://docs.python.org/3/library/os.html#os.getenv'api_key':os.getenv('API_KEY'),# your serpapi api'engine':'google_product',# SerpApi search engine'product_id':'14019378181107046593',# product id'offers':True,# more offers, could be also set as '1` which is the same as True'hl':'en',# language'gl':'us'# country of the search, US -> USA}search=GoogleSearch(params)# where data extraction happens on the backendresults=search.get_dict()# JSON -> Python dictonline_sellers=results['sellers_results']['online_sellers']print(json.dumps(online_sellers,indent=2,ensure_ascii=False))
Enter fullscreen modeExit fullscreen mode

Output:

[{"position":1,"name":"Best Buy","link":"https://www.google.com/url?q=https://www.bestbuy.com/site/steelseries-aerox-3-2022-edition-lightweight-wired-optical-gaming-mouse-onyx/6485231.p%3FskuId%3D6485231%26ref%3DNS%26loc%3D101&sa=U&ved=0ahUKEwiYt4fxyb_7AhXGFlkFHQZoCLMQ2ykIJA&usg=AOvVaw198AdAmbpUT5YEupYrp_iH","base_price":"$34.99","additional_price":{"shipping":"$0.00","tax":"$3.02"},"total_price":"$38.01"},...othersellers{"position":38,"name":"Network Hardwares","link":"https://www.google.com/url?q=https://www.networkhardwares.com/products/aerox-3-wireless-2022-edition-62611%3Fcurrency%3DUSD%26variant%3D41025510441165%26utm_medium%3Dcpc%26utm_source%3Dgoogle%26utm_campaign%3DGoogle%2520Shopping%26srsltid%3DAYJSbAdn6Cgm7HKsOdgiZ1_T8TK8NyOtSJpq2EC5meylVz982o4QDNcuTfA&sa=U&ved=0ahUKEwiYt4fxyb_7AhXGFlkFHQZoCLMQ2ykI5wE&usg=AOvVaw18MAXohnYThkG5Ip4Igqx-","base_price":"$64.51","additional_price":{"shipping":"$0.00","tax":"$5.56"},"total_price":"$70.07"}]
Enter fullscreen modeExit fullscreen mode

Links

Join us onTwitter |YouTube

Add aFeature Request💫 or aBug🐞

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

API to get search engine results with ease.

Trending onDEV CommunityHot

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp