
Posted on • Edited on • Originally published atserpapi.com
Scrape Naver News Results with Python
This blog post you will see how to scrape title, link, snippet, news press name, date news published from Naver News Results using Python.
If you're already familiar with how I structure blog posts, then you canjump towhat will be scraped section since theIntro,Prerequisites, andImport sections are, for the most part, boilerplate part.
This blog is suited for users with little web scraping experience.
What is Naver Search
Naver is the most widely used platform in South Korea and it is used there more than Google, basedLink Assistant andCroud blog posts.
Intro
This blog post is the first of Naver web scraping series. Here you'll see how to scrape Naver News Results using Python withbeautifulsoup
,requests
,lxml
libraries.
Note: This blog post shows how to extract data that is being shown in thewhat will be scraped section, and don't cover different layout handling (unless said otherwise).
Prerequisites
pipinstallrequestspipinstalllxmlpipinstallbeautifulsoup4
Make sure to have a basic knowledge of Python, have a basic idea of the libraries mentioned above, and have a basic understanding ofCSS
selectors because you'll see mostly usage ofselect()
/select_one()
beautifulsoup
methods that acceptCSS
selectors.
Usually, I'm usingSelectorGadget extension to grabCSS
selectors by clicking on the desired element in the browser.CSS
selectors reference, or train on a few examples viaCSS Diner.
However, if SelectorGadget can't get the desired element, I use Elements tab via Dev Tools (F12 on a keyboard) to locate and grabCSS
selector(s) or other HTML elements.
To test if the selector extracts correct data you can place thoseCSS
selector(s) in SelectorGadget window, or via Dev Tools Console tab using$$(".SELECTOR")
which is equivalent todocument.querySelectorAll(".SELECTOR")
to see if the correct elements being selected.
Imports
importrequests,lxmlfrombs4importBeautifulSoup
What will be scraped
All News Results from the first page.
Process
If you don't need an explanation, jump to thecode section.
There're not a lot of steps that needs to be done, we need to:
- Make a request and save HTML locally.
- Find correct
CSS
selectors or HTML elements from where to extract data. - Extract data.
Make a request and save HTML locally
Why save locally?
The main point of this is to make sure that IP won't be banned or blocked for some time, which will delay the script development process.
When requests are being sent constantly (regular user won't do that) from the same IP, this could be detected (tagged or whatever, as unusual behavior) and blocked or banned to secure the website.
Try to save HTML locally first, test everything you need there, and then start making actual requests.
importrequestsheaders={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"}params={"query":"minecraft","where":"news",}html=requests.get("https://search.naver.com/search.naver",params=params,headers=headers).textwithopen(f"{params['query']}_naver_news.html",mode="w")asfile:file.write(html)
What we've done here?
Import arequests
library
importrequests
Adduser-agent
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"}
Add search queryparameters
params={"query":"minecraft",# search query"where":"news",# news results}
Passuser-agent
and queryparams
Passuser-agent
to requestheaders
and, pass queryparams
while making a request.
You can read more in-depth about this topic in the article below about why it's a good idea to passuser-agent
to request header.
Article No Longer Available
After request is made, then we receive a response which will be decoded via.text
.
html=requests.get("https://search.naver.com/search.naver",params=params,headers=headers).text
Save HTML locally
withopen(f"{params['query']}_naver_news.html",mode="w")asfile:file.write(html)# output file will be minecraft_naver_news.html
Find correct selectors or HTML elements
Get aCSS
selector of the container with all needed data such as title, link, etc
fornews_resultinsoup.select(".list_news .bx"):# further code
Get aCSS
selector for title, link, etc. that will be used in extracting part
fornews_resultinsoup.select(".list_news .bx"):# hey, news_results, grab TEXT from every element with ".news_tit" selectortitle=news_result.select_one(".news_tit").text# hey, news_results, grab href (link) from every element with ".news_tit" selectorlink=news_result.select_one(".news_tit")["href"]# other elements..
Extract data
importlxml,jsonfrombs4importBeautifulSoupwithopen("minecraft_naver_news.html",mode="r")ashtml_file:html=html_file.read()soup=BeautifulSoup(html,"lxml")news_data=[]fornews_resultinsoup.select(".list_news .bx"):title=news_result.select_one(".news_tit").textlink=news_result.select_one(".news_tit")["href"]thumbnail=news_result.select_one(".dsc_thumb img")["src"]snippet=news_result.select_one(".news_dsc").textpress_name=news_result.select_one(".info.press").textnews_date=news_result.select_one("span.info").textnews_data.append({"title":title,"link":link,"thumbnail":thumbnail,"snippet":snippet,"press_name":press_name,"news_date":news_date})print(json.dumps(news_data,indent=2,ensure_ascii=False))
Now lets see what is going on here.
Importbs4
,lxml
andjson
libraries
importlxml,jsonfrombs4importBeautifulSoup
Open saved HTML file and pass toBeautifulSoup()
Open saved HTML file and change the mode from writing (mode="w"
) to reading (mode="r"
) and pass it toBeautifulSoup()
so it can extract elements, and assigned"lxml"
as a HTML parser.
withopen("minecraft_naver_news.html",mode="r")ashtml_file:html=html_file.read()# readingsoup=BeautifulSoup(html,"lxml")
Createlist()
to temporary store the data
news_data=[]
Iterate over container
By container I meanCSS
selector that wraps other elements such as title, link, etc. inside itself with all the needed data, and extract it.
# news_data = []fornews_resultinsoup.select(".list_news .bx"):title=news_result.select_one(".news_tit").textlink=news_result.select_one(".news_tit")["href"]thumbnail=news_result.select_one(".dsc_thumb img")["src"]snippet=news_result.select_one(".news_dsc").textpress_name=news_result.select_one(".info.press").textnews_date=news_result.select_one("span.info").text
Append extracted data as a dictionary to earlier createdlist()
news_data.append({"title":title,"link":link,"thumbnail":thumbnail,"snippet":snippet,"press_name":press_name,"news_date":news_date})
Print collected data
Print the data usingjson.dumps()
, which in this case is just for pretty printing purpose.
print(json.dumps(news_data,indent=2,ensure_ascii=False))# part of the output'''[ {"title":"Xbox, 11월부터 블록버스터 게임 연이어 출시","link":"http://www.gameshot.net/common/con_view.php?code=GA617793ce93c74","thumbnail":"https://search.pstatic.net/common/?src=https%3A%2F%2Fimgnews.pstatic.net%2Fimage%2Forigin%2F5739%2F2021%2F10%2F26%2F19571.jpg&type=ofullfill264_180_gray&expire=2&refresh=true","snippet":" 마인크래프트(Minecraft) – 11월 3일(한국 시간) 마인크래프트는 11월 3일 Xbox Game Pass PC용에 추가될 예정이며, 새로운 마인크래프트 던전스 시즈널 어드벤처(Minecraft Dungeons Seasonal Adventures), 동굴과...","press_name":"게임샷","news_date":"6일 전" } # other results...]'''
Call newly added data
fornewsinnews_data:title=news["title"]# link, snippet, thumbnail..print(title)# prints all titles that was appended to the list()
Full Code
Have a look at third function that will make an actual request to Naver search with passed query parameters.Test in the online IDE yourself.
importlxml,json,requestsfrombs4importBeautifulSoupheaders={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"}params={"query":"minecraft","where":"news",}# function that parses content from local copy of htmldefextract_news_from_html():withopen("minecraft_naver_news.html",mode="r")ashtml_file:html=html_file.read()# calls naver_parser() function to parse the pagedata=naver_parser(html)print(json.dumps(data,indent=2,ensure_ascii=False))# function that makes an actual requestdefextract_naver_news_from_url():html=requests.get("https://search.naver.com/search.naver",params=params,headers=headers)# calls naver_parser() function to parse the pagedata=naver_parser(html)print(json.dumps(data,indent=2,ensure_ascii=False))# parser that accepts html argument from extract_news_from_html() or extract_naver_news_from_url()defnaver_parser(html):soup=BeautifulSoup(html.text,"lxml")news_data=[]fornews_resultinsoup.select(".list_news .bx"):title=news_result.select_one(".news_tit").textlink=news_result.select_one(".news_tit")["href"]thumbnail=news_result.select_one(".dsc_thumb img")["src"]snippet=news_result.select_one(".news_dsc").textpress_name=news_result.select_one(".info.press").textnews_date=news_result.select_one("span.info").textnews_data.append({"title":title,"link":link,"thumbnail":thumbnail,"snippet":snippet,"press_name":press_name,"news_date":news_date})returnnews_data
UsingNaver News Results API
As an alternative, you can achieve the same by using SerpApi. SerpApi is a paid API with a free plan.
The difference is that there's no need to code the parser from scratch and maintain it overtime (if something will be changed in the HTML), figure out what selectors to use, and how to bypass blocks from search engines.
Install SerpApi library
pip install google-search-results
Example code to integrate:
fromserpapiimportGoogleSearchimportos,jsonparams={"api_key":os.getenv("API_KEY"),"engine":"naver","query":"Minecraft","where":"news"}search=GoogleSearch(params)# where extraction happensresults=search.get_dict()# where structured json appearsnews_data=[]fornews_resultinresults["news_results"]:title=news_result["title"]link=news_result["link"]thumbnail=news_result["thumbnail"]snippet=news_result["snippet"]press_name=news_result["news_info"]["press_name"]date_news_poseted=news_result["news_info"]["news_date"]news_data.append({"title":title,"link":link,"thumbnail":thumbnail,"snippet":snippet,"press_name":press_name,"news_date":date_news_poseted})print(json.dumps(news_data,indent=2,ensure_ascii=False))
Let's how this code works.
Importserpapi
,os
,json
libraries
fromserpapiimportGoogleSearchimportosimportjson# in this case used for pretty printing
os
library stands foroperating system (miscellaneous operating system interfaces), andos.getenv(SECRET_KEY)
return the value of the environment variable key if it exists.
Define search parameters
Note this parameters will be different depending on what
"engine"
you're using (except, in this case,"api_key"
,"query"
).
params={"api_key":os.getenv("API_KEY"),# API key that being stored in the environment variable"engine":"naver",# search engine"query":"Minecraft",# search query"where":"news"# news results filter# other parameters}
Createlist()
to temporary store the data
news_data=[]
Iterate over each["news_resutlts"]
, and store to thenews_data
list()
.
The difference here is that instead of calling someCSS
selectors, we're extracting data from the dictionary (provided from SerpApi) by theirkey
.
fornews_resultinresults["news_results"]:title=news_result["title"]link=news_result["link"]thumbnail=news_result["thumbnail"]snippet=news_result["snippet"]press_name=news_result["news_info"]["press_name"]date_news_poseted=news_result["news_info"]["news_date"]news_data.append({"title":title,"link":link,"thumbnail":thumbnail,"snippet":snippet,"press_name":press_name,"news_date":date_news_poseted})
Print collected data viajson.dumps()
to see the output
print(json.dumps(news_data,indent=2,ensure_ascii=False))---------------'''[ {"title":"Xbox, 11월부터 블록버스터 게임 연이어 출시","link":"http://www.gameshot.net/common/con_view.php?code=GA617793ce93c74","thumbnail":"https://search.pstatic.net/common/?src=https%3A%2F%2Fimgnews.pstatic.net%2Fimage%2Forigin%2F5739%2F2021%2F10%2F26%2F19571.jpg&type=ofullfill264_180_gray&expire=2&refresh=true","snippet":" 마인크래프트(Minecraft) – 11월 3일(한국 시간) 마인크래프트는 11월 3일 Xbox Game Pass PC용에 추가될 예정이며, 새로운 마인크래프트 던전스 시즈널 어드벤처(Minecraft Dungeons Seasonal Adventures), 동굴과...","press_name":"게임샷","news_date":"6일 전" } # other results...]'''
Links
- Code in the online IDE
- Naver News Results API
- SelectorGadget
- An introduction to Naver
- Google Vs. Naver: Why Can’t Google Dominate Search in Korea?
Outro
If you have anything to share, any question, suggestion, or something that isn't working correctly, feel free to drop a comment in the comment section or via Twitter at@dimitryzub, or@serp_api.
Yours,
Dimitry, and the rest of SerpApi Team.
Top comments(0)
For further actions, you may consider blocking this person and/orreporting abuse