Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Cover image for Scrape Google Realtime Search Trends with Python
SerpApi profile imageArtur Chukhrai
Artur Chukhrai forSerpApi

Posted on • Edited on • Originally published atserpapi.com

     

Scrape Google Realtime Search Trends with Python

What will be scraped

blog-google-trends-realtime-what-will-be-scraped

📌Note: For now, we don't have an API that supports extracting data from Google Realtime Search Trends.

This blog post is to show you way how you can do it yourself while we're working on releasing our proper API in a meantime. We'll update you on our Twitter once this API will be released.

Full Code

If you don't need explanation, have a look atfull code example in the online IDE.

importtime,jsonfromseleniumimportwebdriverfromselenium.webdriver.chrome.serviceimportServicefromwebdriver_manager.chromeimportChromeDriverManagerfromselenium.webdriver.common.byimportByfromselenium.webdriver.support.waitimportWebDriverWaitfromselenium.webdriver.supportimportexpected_conditionsasECfromparselimportSelectordefscroll_page(url):service=Service(ChromeDriverManager().install())options=webdriver.ChromeOptions()options.add_argument("--headless")options.add_argument('--lang=en')options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")driver=webdriver.Chrome(service=service,options=options)driver.get(url)WebDriverWait(driver,10000).until(EC.visibility_of_element_located((By.TAG_NAME,'body')))flag=Truewhileflag:try:search_input=driver.find_element(By.CSS_SELECTOR,'div[class*="feed-load-more-button"]')driver.execute_script("arguments[0].click();",search_input)time.sleep(3)except:flag=Falseselector=Selector(driver.page_source)driver.quit()returnselectordefscrape_realtime_search(selector):realtime_search_trends=[]forresultinselector.css('.feed-item-header'):index=result.css('.index::text').get().strip()subtitle=result.css('.summary-text a::text').get()subtitle_link=result.css('.summary-text a::attr(href)').get()source=result.css('.source-and-time span::text').get().strip()time=result.css('.subtitles-overlap div::text').get().strip()image_source=result.css('.image-text::text').get().strip()image_source_link=result.css('.image-link-wrapper a::attr(href)').get()thumbnail=f"https:{result.css('.feed-item-image-wrapper img::attr(src)').get()}"title=[]title_links={}forpartinresult.css('.title span a'):title_part=part.css('::text').get().strip()title.append(title_part)title_links[title_part]=f"https://trends.google.com{part.css('::attr(href)').get()}"realtime_search_trends.append({'index':index,'title':" • ".join(title),'title_links':title_links,'subtitle':subtitle,'subtitle_link':subtitle_link,'source':source,'time':time,'image_source':image_source,'image_source_link':image_source_link,'thumbnail':thumbnail,})print(json.dumps(realtime_search_trends,indent=2,ensure_ascii=False))defmain():GEO="US"CATEGORY="all"URL=f"https://trends.google.com/trends/trendingsearches/realtime?geo={GEO}&category={CATEGORY}"result=scroll_page(URL)scrape_realtime_search(result)if__name__=="__main__":main()
Enter fullscreen modeExit fullscreen mode

Preparation

Install libraries:

pip install parsel selenium webdriver webdriver_manager
Enter fullscreen modeExit fullscreen mode

Basic knowledge scraping with CSS selectors

CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.

If you haven't scraped with CSS selectors, there's a dedicated blog post of mine
abouthow to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they matter from a web-scraping perspective.

Reduce the chance of being blocked

Make sure you're usingrequest headersuser-agent to act as a "real" user visit. Because defaultrequestsuser-agent ispython-requests and websites understand that it's most likely a script that sends a request.Check what's youruser-agent.

There's ahow to reduce the chance of being blocked while web scraping blog post that can get you familiar with basic and more advanced approaches.

Code Explanation

Import libraries:

importtime,jsonfromseleniumimportwebdriverfromselenium.webdriver.chrome.serviceimportServicefromwebdriver_manager.chromeimportChromeDriverManagerfromselenium.webdriver.common.byimportByfromselenium.webdriver.support.waitimportWebDriverWaitfromselenium.webdriver.supportimportexpected_conditionsasECfromparselimportSelector
Enter fullscreen modeExit fullscreen mode
LibraryPurpose
timeto work with time in Python.
jsonto convert extracted data to a JSON object.
webdriverto drive a browser natively, as a user would, either locally or on a remote machine using the Selenium server.
Serviceto manage the starting and stopping of the ChromeDriver.
Byto set of supported locator strategies (By.ID, By.TAG_NAME, By.XPATH etc).
WebDriverWaitto wait only as long as required.
expected_conditionscontains a set of predefined conditions to use with WebDriverWait.
SelectorXML/HTML parser that have fullXPath and CSS selectors support.

Top-level code environment

This code uses the generally accepted rule of using the__name__ == "__main__" construct:

defmain():GEO="US"CATEGORY="all"URL=f"https://trends.google.com/trends/trendingsearches/realtime?geo={GEO}&category={CATEGORY}"result=scroll_page(URL)scrape_realtime_search(result)if__name__=="__main__":main()
Enter fullscreen modeExit fullscreen mode

This check will only be performed if the user has run this file. If the user imports this file into another, then the check will not work.

You can watch the videoPython Tutorial: ifname == 'main' for more details.

A small description of themain function:

realtime search

Scroll page

The function takes the URL and returns a full HTML structure.

First, let's understand how pagination works on theRealtime search trends page. To download more information, you must click on theLOAD MORE button below:

load more 1

📌Note: To get all the data, you need to press the button until the data runs out.

In this case,selenium library is used, which allows you to simulate user actions in the browser. Forselenium to work, you need to useChromeDriver, which can be downloaded manually or using code. In our case, the second method is used. To control the start and stop ofChromeDriver, you need to useService which will install browser binaries under the hood:

service=Service(ChromeDriverManager().install())
Enter fullscreen modeExit fullscreen mode

You should also addoptions to work correctly:

options=webdriver.ChromeOptions()options.add_argument("--headless")options.add_argument('--lang=en')options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")
Enter fullscreen modeExit fullscreen mode
Chrome optionsExplanation
--headlessto run Chrome in headless mode.
--lang=ento set the browser language to English.
user-agentto act as a "real" user request from the browser by passing it torequest headers.Check what's youruser-agent.

Now we can startwebdriver and pass the url to theget() method.

driver=webdriver.Chrome(service=service,options=options)driver.get(url)
Enter fullscreen modeExit fullscreen mode

Sometimes it is difficult to calculate how long it will take to load a page, it all depends on the speed of the Internet, the power of the computer and other factors. The method described below is much better than using a delay in seconds since the wait occurs exactly until the moment when the page is fully loaded:

WebDriverWait(driver,10000).until(EC.visibility_of_element_located((By.TAG_NAME,'body')))
Enter fullscreen modeExit fullscreen mode

📌Note: In this case, we give 10 seconds for the page to load, if it loads earlier then the wait will end.

When the page has loaded, it is necessary to find theLOAD MORE button. Selenium provides the ability tofind element by CSS Selectors.

Clicking the button is done by pasting the JavaScript code into theexecute_script() method. Wait a while for the data to load using thesleep() method. These actions are repeated as long as the button exists and allows you to download data.

flag=Truewhileflag:try:search_input=driver.find_element(By.CSS_SELECTOR,'div[class*="feed-load-more-button"]')driver.execute_script("arguments[0].click();",search_input)time.sleep(3)except:flag=False
Enter fullscreen modeExit fullscreen mode

Now we will use theSelector from theParsel Library where we pass thehtml structure with all the data, taking into account pagination.

Theparsel has much faster scraping times because of the engine itself and there is no network component anymore, no real-time interaction with a page and the element, there is only HTML parsing involved.

After all the operations done, stop the driver:

selector=Selector(driver.page_source)# extracting code from HTMLdriver.quit()
Enter fullscreen modeExit fullscreen mode

The function looks like this:

defscroll_page(url):service=Service(ChromeDriverManager().install())options=webdriver.ChromeOptions()options.add_argument("--headless")options.add_argument('--lang=en')options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")driver=webdriver.Chrome(service=service,options=options)driver.get(url)WebDriverWait(driver,10000).until(EC.visibility_of_element_located((By.TAG_NAME,'body')))flag=Truewhileflag:try:search_input=driver.find_element(By.CSS_SELECTOR,'div[class*="feed-load-more-button"]')driver.execute_script("arguments[0].click();",search_input)time.sleep(3)except:flag=Falseselector=Selector(driver.page_source)driver.quit()returnselector
Enter fullscreen modeExit fullscreen mode

In the gif below, I demonstrate how this function works:

blog-button-clicking

Scrape realtime search

This function takes a full HTML structure and prints all results in json format.

To scrape all items, you need to access the.feed-item-header selector. All data except the title is easily retrieved.

The title consists of several parts, separated by the symbol. To extract it completely, each result of the.title span a selector must be additionally iterated. Add all parts to thetitle list and subsequently assemble the string from the list. In addition, each title part has its own link, which is retrieved and added to thetitle_links dictionary, where thekey is the title part and thevalue is the link.

The complete function to scrape all data would look like this:

defscrape_realtime_search(selector):realtime_search_trends=[]forresultinselector.css('.feed-item-header'):index=result.css('.index::text').get().strip()subtitle=result.css('.summary-text a::text').get()subtitle_link=result.css('.summary-text a::attr(href)').get()source=result.css('.source-and-time span::text').get().strip()time=result.css('.subtitles-overlap div::text').get().strip()image_source=result.css('.image-text::text').get().strip()image_source_link=result.css('.image-link-wrapper a::attr(href)').get()thumbnail=f"https:{result.css('.feed-item-image-wrapper img::attr(src)').get()}"title=[]title_links={}forpartinresult.css('.title span a'):title_part=part.css('::text').get().strip()title.append(title_part)title_links[title_part]=f"https://trends.google.com{part.css('::attr(href)').get()}"realtime_search_trends.append({'index':index,'title':" • ".join(title),'title_links':title_links,'subtitle':subtitle,'subtitle_link':subtitle_link,'source':source,'time':time,'image_source':image_source,'image_source_link':image_source_link,'thumbnail':thumbnail,})print(json.dumps(realtime_search_trends,indent=2,ensure_ascii=False))
Enter fullscreen modeExit fullscreen mode
CodeExplanation
realtime_search_trendsa temporarylist where extracted data will beappended at the end of the function.
css()to access elements by the passed selector.
::text or::attr(<attribute>)to extract textual or attribute data from the node.
get()to actually extract the textual data.
strip()to return a copy of the string with the leading and trailing characters removed.
"".join()to concatenate a list into a string.
realtime_search_trends.append({})toappend extracted data to alist as a dictionary.

Output

[{"index":"1","title":"Student • Inflation • Student debt • Student loan • CNN • Joe Biden","title_links":{"Student":"https://trends.google.com/trends/explore?q=/m/014cnc&date=now+7-d&geo=US","Inflation":"https://trends.google.com/trends/explore?q=/m/09jx2&date=now+7-d&geo=US","Student debt":"https://trends.google.com/trends/explore?q=/m/051zcxv&date=now+7-d&geo=US","Student loan":"https://trends.google.com/trends/explore?q=/m/02crs_&date=now+7-d&geo=US","CNN":"https://trends.google.com/trends/explore?q=/m/0gsgr&date=now+7-d&geo=US","Joe Biden":"https://trends.google.com/trends/explore?q=/m/012gx2&date=now+7-d&geo=US"},"subtitle":"Rival Senate candidates offer differing solutions for inflation woes","subtitle_link":"https://www.cachevalleydaily.com/news/archive/2022/08/23/rival-senate-candidates-offer-differing-solutions-for-inflation-woes/","source":"Cache Valley Daily","time":"Aug 22, 2022 - Now","image_source":"Cache Valley Daily","image_source_link":"https://www.cachevalleydaily.com/news/archive/2022/08/23/rival-senate-candidates-offer-differing-solutions-for-inflation-woes/","thumbnail":"https://t2.gstatic.com/images?q=tbn:ANd9GcRR2kTVJd2bJLcv4U1CgyLUf5DWZFVekQF5tRbUS6QgEKKPLcB2QvMLCC2SnuID1gr370ISH6RniOc"},...otherresults{"index":"225","title":"Primary election • Lee County • Ron DeSantis","title_links":{"Primary election":"https://trends.google.com/trends/explore?q=/m/016ncr&date=now+7-d&geo=US","Lee County":"https://trends.google.com/trends/explore?q=/m/0jrjb&date=now+7-d&geo=US","Ron DeSantis":"https://trends.google.com/trends/explore?q=/m/0l8mn35&date=now+7-d&geo=US"},"subtitle":"Karnes, governor appointee, retains Lee Clerk of Courts position with more than 55% of the vote","subtitle_link":"https://www.news-press.com/story/news/politics/elections/2022/08/23/lee-county-florida-election-results-kevin-karnes-secures-republican-primary-lee-clerk-of-courts/7836666001/","source":"The News-Press","time":"Aug 23, 2022 - Now","image_source":"The News-Press","image_source_link":"https://www.news-press.com/story/news/politics/elections/2022/08/23/lee-county-florida-election-results-kevin-karnes-secures-republican-primary-lee-clerk-of-courts/7836666001/","thumbnail":"https://t1.gstatic.com/images?q=tbn:ANd9GcScg7Zo9OiYmDd4rlaqqaFkOm1okyJvAgjHZP8MQdsSjwNFtQcjbWiL0KXZl6X-VMGSYXnOa-Msa-w"}]
Enter fullscreen modeExit fullscreen mode

Join us onTwitter |YouTube

Add aFeature Request💫 or aBug🐞

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

API to get search engine results with ease.

More fromSerpApi

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp