Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Cover image for Scrape Naver News Results with Python
Dmitriy Zub ☀️
Dmitriy Zub ☀️

Posted on • Edited on • Originally published atserpapi.com

     

Scrape Naver News Results with Python

This blog post you will see how to scrape title, link, snippet, news press name, date news published from Naver News Results using Python.

If you're already familiar with how I structure blog posts, then you canjump towhat will be scraped section since theIntro,Prerequisites, andImport sections are, for the most part, boilerplate part.

This blog is suited for users with little web scraping experience.


What is Naver Search

Naver is the most widely used platform in South Korea and it is used there more than Google, basedLink Assistant andCroud blog posts.

Intro

This blog post is the first of Naver web scraping series. Here you'll see how to scrape Naver News Results using Python withbeautifulsoup,requests,lxml libraries.

Note: This blog post shows how to extract data that is being shown in thewhat will be scraped section, and don't cover different layout handling (unless said otherwise).

Prerequisites

pipinstallrequestspipinstalllxmlpipinstallbeautifulsoup4
Enter fullscreen modeExit fullscreen mode

Make sure to have a basic knowledge of Python, have a basic idea of the libraries mentioned above, and have a basic understanding ofCSS selectors because you'll see mostly usage ofselect()/select_one()beautifulsoup methods that acceptCSS selectors.

Usually, I'm usingSelectorGadget extension to grabCSS selectors by clicking on the desired element in the browser.CSS selectors reference, or train on a few examples viaCSS Diner.

However, if SelectorGadget can't get the desired element, I use Elements tab via Dev Tools (F12 on a keyboard) to locate and grabCSS selector(s) or other HTML elements.

To test if the selector extracts correct data you can place thoseCSS selector(s) in SelectorGadget window, or via Dev Tools Console tab using$$(".SELECTOR") which is equivalent todocument.querySelectorAll(".SELECTOR") to see if the correct elements being selected.

Imports

importrequests,lxmlfrombs4importBeautifulSoup
Enter fullscreen modeExit fullscreen mode

What will be scraped

All News Results from the first page.

what will be scraped from Naver News results

Process

If you don't need an explanation, jump to thecode section.

There're not a lot of steps that needs to be done, we need to:

  1. Make a request and save HTML locally.
  2. Find correctCSS selectors or HTML elements from where to extract data.
  3. Extract data.

Make a request and save HTML locally

Why save locally?

The main point of this is to make sure that IP won't be banned or blocked for some time, which will delay the script development process.

When requests are being sent constantly (regular user won't do that) from the same IP, this could be detected (tagged or whatever, as unusual behavior) and blocked or banned to secure the website.

Try to save HTML locally first, test everything you need there, and then start making actual requests.

importrequestsheaders={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"}params={"query":"minecraft","where":"news",}html=requests.get("https://search.naver.com/search.naver",params=params,headers=headers).textwithopen(f"{params['query']}_naver_news.html",mode="w")asfile:file.write(html)
Enter fullscreen modeExit fullscreen mode

What we've done here?

Import arequests library

importrequests
Enter fullscreen modeExit fullscreen mode

Adduser-agent

headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"}
Enter fullscreen modeExit fullscreen mode

Add search queryparameters

params={"query":"minecraft",# search query"where":"news",# news results}
Enter fullscreen modeExit fullscreen mode

Passuser-agent and queryparams

Passuser-agent to requestheaders and, pass queryparams while making a request.

You can read more in-depth about this topic in the article below about why it's a good idea to passuser-agent to request header.

Article No Longer Available

After request is made, then we receive a response which will be decoded via.text.

html=requests.get("https://search.naver.com/search.naver",params=params,headers=headers).text
Enter fullscreen modeExit fullscreen mode

Save HTML locally

withopen(f"{params['query']}_naver_news.html",mode="w")asfile:file.write(html)# output file will be minecraft_naver_news.html
Enter fullscreen modeExit fullscreen mode

Find correct selectors or HTML elements

Get aCSS selector of the container with all needed data such as title, link, etc

Gif that shows which selectors being used as a container

fornews_resultinsoup.select(".list_news .bx"):# further code
Enter fullscreen modeExit fullscreen mode

Get aCSS selector for title, link, etc. that will be used in extracting part

Gif that shows which selectors being used for title, link, snippet, thumbnail and other data

fornews_resultinsoup.select(".list_news .bx"):# hey, news_results, grab TEXT from every element with ".news_tit" selectortitle=news_result.select_one(".news_tit").text# hey, news_results, grab href (link) from every element with ".news_tit" selectorlink=news_result.select_one(".news_tit")["href"]# other elements..
Enter fullscreen modeExit fullscreen mode

Extract data

importlxml,jsonfrombs4importBeautifulSoupwithopen("minecraft_naver_news.html",mode="r")ashtml_file:html=html_file.read()soup=BeautifulSoup(html,"lxml")news_data=[]fornews_resultinsoup.select(".list_news .bx"):title=news_result.select_one(".news_tit").textlink=news_result.select_one(".news_tit")["href"]thumbnail=news_result.select_one(".dsc_thumb img")["src"]snippet=news_result.select_one(".news_dsc").textpress_name=news_result.select_one(".info.press").textnews_date=news_result.select_one("span.info").textnews_data.append({"title":title,"link":link,"thumbnail":thumbnail,"snippet":snippet,"press_name":press_name,"news_date":news_date})print(json.dumps(news_data,indent=2,ensure_ascii=False))
Enter fullscreen modeExit fullscreen mode

Now lets see what is going on here.

Importbs4,lxml andjson libraries

importlxml,jsonfrombs4importBeautifulSoup
Enter fullscreen modeExit fullscreen mode

Open saved HTML file and pass toBeautifulSoup()

Open saved HTML file and change the mode from writing (mode="w") to reading (mode="r") and pass it toBeautifulSoup() so it can extract elements, and assigned"lxml" as a HTML parser.

withopen("minecraft_naver_news.html",mode="r")ashtml_file:html=html_file.read()# readingsoup=BeautifulSoup(html,"lxml")
Enter fullscreen modeExit fullscreen mode

Createlist() to temporary store the data

news_data=[]
Enter fullscreen modeExit fullscreen mode

Iterate over container

By container I meanCSS selector that wraps other elements such as title, link, etc. inside itself with all the needed data, and extract it.

# news_data = []fornews_resultinsoup.select(".list_news .bx"):title=news_result.select_one(".news_tit").textlink=news_result.select_one(".news_tit")["href"]thumbnail=news_result.select_one(".dsc_thumb img")["src"]snippet=news_result.select_one(".news_dsc").textpress_name=news_result.select_one(".info.press").textnews_date=news_result.select_one("span.info").text
Enter fullscreen modeExit fullscreen mode

Append extracted data as a dictionary to earlier createdlist()

news_data.append({"title":title,"link":link,"thumbnail":thumbnail,"snippet":snippet,"press_name":press_name,"news_date":news_date})
Enter fullscreen modeExit fullscreen mode

Print collected data

Print the data usingjson.dumps(), which in this case is just for pretty printing purpose.

print(json.dumps(news_data,indent=2,ensure_ascii=False))# part of the output'''[  {"title":"Xbox, 11월부터 블록버스터 게임 연이어 출시","link":"http://www.gameshot.net/common/con_view.php?code=GA617793ce93c74","thumbnail":"https://search.pstatic.net/common/?src=https%3A%2F%2Fimgnews.pstatic.net%2Fimage%2Forigin%2F5739%2F2021%2F10%2F26%2F19571.jpg&type=ofullfill264_180_gray&expire=2&refresh=true","snippet":"  마인크래프트(Minecraft) – 11월 3일(한국 시간) 마인크래프트는 11월 3일 Xbox Game Pass PC용에 추가될 예정이며, 새로운 마인크래프트 던전스 시즈널 어드벤처(Minecraft Dungeons Seasonal Adventures), 동굴과...","press_name":"게임샷","news_date":"6일 전"  }  # other results...]'''
Enter fullscreen modeExit fullscreen mode

Call newly added data

fornewsinnews_data:title=news["title"]# link, snippet, thumbnail..print(title)# prints all titles that was appended to the list()
Enter fullscreen modeExit fullscreen mode

Full Code

Have a look at third function that will make an actual request to Naver search with passed query parameters.Test in the online IDE yourself.

importlxml,json,requestsfrombs4importBeautifulSoupheaders={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"}params={"query":"minecraft","where":"news",}# function that parses content from local copy of htmldefextract_news_from_html():withopen("minecraft_naver_news.html",mode="r")ashtml_file:html=html_file.read()# calls naver_parser() function to parse the pagedata=naver_parser(html)print(json.dumps(data,indent=2,ensure_ascii=False))# function that makes an actual requestdefextract_naver_news_from_url():html=requests.get("https://search.naver.com/search.naver",params=params,headers=headers)# calls naver_parser() function to parse the pagedata=naver_parser(html)print(json.dumps(data,indent=2,ensure_ascii=False))# parser that accepts html argument from extract_news_from_html() or extract_naver_news_from_url()defnaver_parser(html):soup=BeautifulSoup(html.text,"lxml")news_data=[]fornews_resultinsoup.select(".list_news .bx"):title=news_result.select_one(".news_tit").textlink=news_result.select_one(".news_tit")["href"]thumbnail=news_result.select_one(".dsc_thumb img")["src"]snippet=news_result.select_one(".news_dsc").textpress_name=news_result.select_one(".info.press").textnews_date=news_result.select_one("span.info").textnews_data.append({"title":title,"link":link,"thumbnail":thumbnail,"snippet":snippet,"press_name":press_name,"news_date":news_date})returnnews_data
Enter fullscreen modeExit fullscreen mode

UsingNaver News Results API

As an alternative, you can achieve the same by using SerpApi. SerpApi is a paid API with a free plan.

The difference is that there's no need to code the parser from scratch and maintain it overtime (if something will be changed in the HTML), figure out what selectors to use, and how to bypass blocks from search engines.

Install SerpApi library

pip install google-search-results
Enter fullscreen modeExit fullscreen mode

Example code to integrate:

fromserpapiimportGoogleSearchimportos,jsonparams={"api_key":os.getenv("API_KEY"),"engine":"naver","query":"Minecraft","where":"news"}search=GoogleSearch(params)# where extraction happensresults=search.get_dict()# where structured json appearsnews_data=[]fornews_resultinresults["news_results"]:title=news_result["title"]link=news_result["link"]thumbnail=news_result["thumbnail"]snippet=news_result["snippet"]press_name=news_result["news_info"]["press_name"]date_news_poseted=news_result["news_info"]["news_date"]news_data.append({"title":title,"link":link,"thumbnail":thumbnail,"snippet":snippet,"press_name":press_name,"news_date":date_news_poseted})print(json.dumps(news_data,indent=2,ensure_ascii=False))
Enter fullscreen modeExit fullscreen mode

Let's how this code works.

Importserpapi,os,json libraries

fromserpapiimportGoogleSearchimportosimportjson# in this case used for pretty printing
Enter fullscreen modeExit fullscreen mode

os library stands foroperating system (miscellaneous operating system interfaces), andos.getenv(SECRET_KEY) return the value of the environment variable key if it exists.

Define search parameters

Note this parameters will be different depending on what"engine" you're using (except, in this case,"api_key","query").

params={"api_key":os.getenv("API_KEY"),# API key that being stored in the environment variable"engine":"naver",# search engine"query":"Minecraft",# search query"where":"news"# news results filter# other parameters}
Enter fullscreen modeExit fullscreen mode

Createlist() to temporary store the data

news_data=[]
Enter fullscreen modeExit fullscreen mode

Iterate over each["news_resutlts"], and store to thenews_datalist().

The difference here is that instead of calling someCSS selectors, we're extracting data from the dictionary (provided from SerpApi) by theirkey.

fornews_resultinresults["news_results"]:title=news_result["title"]link=news_result["link"]thumbnail=news_result["thumbnail"]snippet=news_result["snippet"]press_name=news_result["news_info"]["press_name"]date_news_poseted=news_result["news_info"]["news_date"]news_data.append({"title":title,"link":link,"thumbnail":thumbnail,"snippet":snippet,"press_name":press_name,"news_date":date_news_poseted})
Enter fullscreen modeExit fullscreen mode

Print collected data viajson.dumps() to see the output

print(json.dumps(news_data,indent=2,ensure_ascii=False))---------------'''[  {"title":"Xbox, 11월부터 블록버스터 게임 연이어 출시","link":"http://www.gameshot.net/common/con_view.php?code=GA617793ce93c74","thumbnail":"https://search.pstatic.net/common/?src=https%3A%2F%2Fimgnews.pstatic.net%2Fimage%2Forigin%2F5739%2F2021%2F10%2F26%2F19571.jpg&type=ofullfill264_180_gray&expire=2&refresh=true","snippet":"  마인크래프트(Minecraft) – 11월 3일(한국 시간) 마인크래프트는 11월 3일 Xbox Game Pass PC용에 추가될 예정이며, 새로운 마인크래프트 던전스 시즈널 어드벤처(Minecraft Dungeons Seasonal Adventures), 동굴과...","press_name":"게임샷","news_date":"6일 전"  }  # other results...]'''
Enter fullscreen modeExit fullscreen mode

Links

  1. Code in the online IDE
  2. Naver News Results API
  3. SelectorGadget
  4. An introduction to Naver
  5. Google Vs. Naver: Why Can’t Google Dominate Search in Korea?

Outro

If you have anything to share, any question, suggestion, or something that isn't working correctly, feel free to drop a comment in the comment section or via Twitter at@dimitryzub, or@serp_api.

Yours,
Dimitry, and the rest of SerpApi Team.


Join us onReddit |Twitter |YouTube

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

Environment art: Houdini, Blender, Substance, Unreal. Former Developer Advocate at SerpApi.
  • Location
    Ukraine
  • Work
    Developer Advocate at SerpApi
  • Joined

More fromDmitriy Zub ☀️

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp