
Scrape Google Realtime Search Trends with Python
What will be scraped
📌Note: For now, we don't have an API that supports extracting data from Google Realtime Search Trends.
This blog post is to show you way how you can do it yourself while we're working on releasing our proper API in a meantime. We'll update you on our Twitter once this API will be released.
Full Code
If you don't need explanation, have a look atfull code example in the online IDE.
importtime,jsonfromseleniumimportwebdriverfromselenium.webdriver.chrome.serviceimportServicefromwebdriver_manager.chromeimportChromeDriverManagerfromselenium.webdriver.common.byimportByfromselenium.webdriver.support.waitimportWebDriverWaitfromselenium.webdriver.supportimportexpected_conditionsasECfromparselimportSelectordefscroll_page(url):service=Service(ChromeDriverManager().install())options=webdriver.ChromeOptions()options.add_argument("--headless")options.add_argument('--lang=en')options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")driver=webdriver.Chrome(service=service,options=options)driver.get(url)WebDriverWait(driver,10000).until(EC.visibility_of_element_located((By.TAG_NAME,'body')))flag=Truewhileflag:try:search_input=driver.find_element(By.CSS_SELECTOR,'div[class*="feed-load-more-button"]')driver.execute_script("arguments[0].click();",search_input)time.sleep(3)except:flag=Falseselector=Selector(driver.page_source)driver.quit()returnselectordefscrape_realtime_search(selector):realtime_search_trends=[]forresultinselector.css('.feed-item-header'):index=result.css('.index::text').get().strip()subtitle=result.css('.summary-text a::text').get()subtitle_link=result.css('.summary-text a::attr(href)').get()source=result.css('.source-and-time span::text').get().strip()time=result.css('.subtitles-overlap div::text').get().strip()image_source=result.css('.image-text::text').get().strip()image_source_link=result.css('.image-link-wrapper a::attr(href)').get()thumbnail=f"https:{result.css('.feed-item-image-wrapper img::attr(src)').get()}"title=[]title_links={}forpartinresult.css('.title span a'):title_part=part.css('::text').get().strip()title.append(title_part)title_links[title_part]=f"https://trends.google.com{part.css('::attr(href)').get()}"realtime_search_trends.append({'index':index,'title':" • ".join(title),'title_links':title_links,'subtitle':subtitle,'subtitle_link':subtitle_link,'source':source,'time':time,'image_source':image_source,'image_source_link':image_source_link,'thumbnail':thumbnail,})print(json.dumps(realtime_search_trends,indent=2,ensure_ascii=False))defmain():GEO="US"CATEGORY="all"URL=f"https://trends.google.com/trends/trendingsearches/realtime?geo={GEO}&category={CATEGORY}"result=scroll_page(URL)scrape_realtime_search(result)if__name__=="__main__":main()
Preparation
Install libraries:
pip install parsel selenium webdriver webdriver_manager
Basic knowledge scraping with CSS selectors
CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.
If you haven't scraped with CSS selectors, there's a dedicated blog post of mine
abouthow to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they matter from a web-scraping perspective.
Reduce the chance of being blocked
Make sure you're usingrequest headersuser-agent
to act as a "real" user visit. Because defaultrequests
user-agent
ispython-requests
and websites understand that it's most likely a script that sends a request.Check what's youruser-agent
.
There's ahow to reduce the chance of being blocked while web scraping blog post that can get you familiar with basic and more advanced approaches.
Code Explanation
Import libraries:
importtime,jsonfromseleniumimportwebdriverfromselenium.webdriver.chrome.serviceimportServicefromwebdriver_manager.chromeimportChromeDriverManagerfromselenium.webdriver.common.byimportByfromselenium.webdriver.support.waitimportWebDriverWaitfromselenium.webdriver.supportimportexpected_conditionsasECfromparselimportSelector
Library | Purpose |
---|---|
time | to work with time in Python. |
json | to convert extracted data to a JSON object. |
webdriver | to drive a browser natively, as a user would, either locally or on a remote machine using the Selenium server. |
Service | to manage the starting and stopping of the ChromeDriver. |
By | to set of supported locator strategies (By.ID, By.TAG_NAME, By.XPATH etc). |
WebDriverWait | to wait only as long as required. |
expected_conditions | contains a set of predefined conditions to use with WebDriverWait. |
Selector | XML/HTML parser that have fullXPath and CSS selectors support. |
Top-level code environment
This code uses the generally accepted rule of using the__name__ == "__main__"
construct:
defmain():GEO="US"CATEGORY="all"URL=f"https://trends.google.com/trends/trendingsearches/realtime?geo={GEO}&category={CATEGORY}"result=scroll_page(URL)scrape_realtime_search(result)if__name__=="__main__":main()
This check will only be performed if the user has run this file. If the user imports this file into another, then the check will not work.
You can watch the videoPython Tutorial: ifname == 'main' for more details.
A small description of themain
function:
Scroll page
The function takes the URL and returns a full HTML structure.
First, let's understand how pagination works on theRealtime search trends page. To download more information, you must click on theLOAD MORE button below:
📌Note: To get all the data, you need to press the button until the data runs out.
In this case,selenium
library is used, which allows you to simulate user actions in the browser. Forselenium
to work, you need to useChromeDriver
, which can be downloaded manually or using code. In our case, the second method is used. To control the start and stop ofChromeDriver
, you need to useService
which will install browser binaries under the hood:
service=Service(ChromeDriverManager().install())
You should also addoptions
to work correctly:
options=webdriver.ChromeOptions()options.add_argument("--headless")options.add_argument('--lang=en')options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")
Chrome options | Explanation |
---|---|
--headless | to run Chrome in headless mode. |
--lang=en | to set the browser language to English. |
user-agent | to act as a "real" user request from the browser by passing it torequest headers.Check what's youruser-agent . |
Now we can startwebdriver
and pass the url to theget()
method.
driver=webdriver.Chrome(service=service,options=options)driver.get(url)
Sometimes it is difficult to calculate how long it will take to load a page, it all depends on the speed of the Internet, the power of the computer and other factors. The method described below is much better than using a delay in seconds since the wait occurs exactly until the moment when the page is fully loaded:
WebDriverWait(driver,10000).until(EC.visibility_of_element_located((By.TAG_NAME,'body')))
📌Note: In this case, we give 10 seconds for the page to load, if it loads earlier then the wait will end.
When the page has loaded, it is necessary to find theLOAD MORE
button. Selenium provides the ability tofind element by CSS Selectors.
Clicking the button is done by pasting the JavaScript code into theexecute_script()
method. Wait a while for the data to load using thesleep()
method. These actions are repeated as long as the button exists and allows you to download data.
flag=Truewhileflag:try:search_input=driver.find_element(By.CSS_SELECTOR,'div[class*="feed-load-more-button"]')driver.execute_script("arguments[0].click();",search_input)time.sleep(3)except:flag=False
Now we will use theSelector
from theParsel Library
where we pass thehtml
structure with all the data, taking into account pagination.
Theparsel
has much faster scraping times because of the engine itself and there is no network component anymore, no real-time interaction with a page and the element, there is only HTML parsing involved.
After all the operations done, stop the driver:
selector=Selector(driver.page_source)# extracting code from HTMLdriver.quit()
The function looks like this:
defscroll_page(url):service=Service(ChromeDriverManager().install())options=webdriver.ChromeOptions()options.add_argument("--headless")options.add_argument('--lang=en')options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")driver=webdriver.Chrome(service=service,options=options)driver.get(url)WebDriverWait(driver,10000).until(EC.visibility_of_element_located((By.TAG_NAME,'body')))flag=Truewhileflag:try:search_input=driver.find_element(By.CSS_SELECTOR,'div[class*="feed-load-more-button"]')driver.execute_script("arguments[0].click();",search_input)time.sleep(3)except:flag=Falseselector=Selector(driver.page_source)driver.quit()returnselector
In the gif below, I demonstrate how this function works:
Scrape realtime search
This function takes a full HTML structure and prints all results in json format.
To scrape all items, you need to access the.feed-item-header
selector. All data except the title is easily retrieved.
The title consists of several parts, separated by the symbol•
. To extract it completely, each result of the.title span a
selector must be additionally iterated. Add all parts to thetitle
list and subsequently assemble the string from the list. In addition, each title part has its own link, which is retrieved and added to thetitle_links
dictionary, where thekey
is the title part and thevalue
is the link.
The complete function to scrape all data would look like this:
defscrape_realtime_search(selector):realtime_search_trends=[]forresultinselector.css('.feed-item-header'):index=result.css('.index::text').get().strip()subtitle=result.css('.summary-text a::text').get()subtitle_link=result.css('.summary-text a::attr(href)').get()source=result.css('.source-and-time span::text').get().strip()time=result.css('.subtitles-overlap div::text').get().strip()image_source=result.css('.image-text::text').get().strip()image_source_link=result.css('.image-link-wrapper a::attr(href)').get()thumbnail=f"https:{result.css('.feed-item-image-wrapper img::attr(src)').get()}"title=[]title_links={}forpartinresult.css('.title span a'):title_part=part.css('::text').get().strip()title.append(title_part)title_links[title_part]=f"https://trends.google.com{part.css('::attr(href)').get()}"realtime_search_trends.append({'index':index,'title':" • ".join(title),'title_links':title_links,'subtitle':subtitle,'subtitle_link':subtitle_link,'source':source,'time':time,'image_source':image_source,'image_source_link':image_source_link,'thumbnail':thumbnail,})print(json.dumps(realtime_search_trends,indent=2,ensure_ascii=False))
Code | Explanation |
---|---|
realtime_search_trends | a temporarylist where extracted data will beappended at the end of the function. |
css() | to access elements by the passed selector. |
::text or::attr(<attribute>) | to extract textual or attribute data from the node. |
get() | to actually extract the textual data. |
strip() | to return a copy of the string with the leading and trailing characters removed. |
"".join() | to concatenate a list into a string. |
realtime_search_trends.append({}) | toappend extracted data to alist as a dictionary. |
Output
[{"index":"1","title":"Student • Inflation • Student debt • Student loan • CNN • Joe Biden","title_links":{"Student":"https://trends.google.com/trends/explore?q=/m/014cnc&date=now+7-d&geo=US","Inflation":"https://trends.google.com/trends/explore?q=/m/09jx2&date=now+7-d&geo=US","Student debt":"https://trends.google.com/trends/explore?q=/m/051zcxv&date=now+7-d&geo=US","Student loan":"https://trends.google.com/trends/explore?q=/m/02crs_&date=now+7-d&geo=US","CNN":"https://trends.google.com/trends/explore?q=/m/0gsgr&date=now+7-d&geo=US","Joe Biden":"https://trends.google.com/trends/explore?q=/m/012gx2&date=now+7-d&geo=US"},"subtitle":"Rival Senate candidates offer differing solutions for inflation woes","subtitle_link":"https://www.cachevalleydaily.com/news/archive/2022/08/23/rival-senate-candidates-offer-differing-solutions-for-inflation-woes/","source":"Cache Valley Daily","time":"Aug 22, 2022 - Now","image_source":"Cache Valley Daily","image_source_link":"https://www.cachevalleydaily.com/news/archive/2022/08/23/rival-senate-candidates-offer-differing-solutions-for-inflation-woes/","thumbnail":"https://t2.gstatic.com/images?q=tbn:ANd9GcRR2kTVJd2bJLcv4U1CgyLUf5DWZFVekQF5tRbUS6QgEKKPLcB2QvMLCC2SnuID1gr370ISH6RniOc"},...otherresults{"index":"225","title":"Primary election • Lee County • Ron DeSantis","title_links":{"Primary election":"https://trends.google.com/trends/explore?q=/m/016ncr&date=now+7-d&geo=US","Lee County":"https://trends.google.com/trends/explore?q=/m/0jrjb&date=now+7-d&geo=US","Ron DeSantis":"https://trends.google.com/trends/explore?q=/m/0l8mn35&date=now+7-d&geo=US"},"subtitle":"Karnes, governor appointee, retains Lee Clerk of Courts position with more than 55% of the vote","subtitle_link":"https://www.news-press.com/story/news/politics/elections/2022/08/23/lee-county-florida-election-results-kevin-karnes-secures-republican-primary-lee-clerk-of-courts/7836666001/","source":"The News-Press","time":"Aug 23, 2022 - Now","image_source":"The News-Press","image_source_link":"https://www.news-press.com/story/news/politics/elections/2022/08/23/lee-county-florida-election-results-kevin-karnes-secures-republican-primary-lee-clerk-of-courts/7836666001/","thumbnail":"https://t1.gstatic.com/images?q=tbn:ANd9GcScg7Zo9OiYmDd4rlaqqaFkOm1okyJvAgjHZP8MQdsSjwNFtQcjbWiL0KXZl6X-VMGSYXnOa-Msa-w"}]
Add aFeature Request💫 or aBug🐞
Top comments(0)
For further actions, you may consider blocking this person and/orreporting abuse