Posted onNov 11, 2021 • Edited onNov 15, 2021 • Originally published atserpapi.com

Scrape Naver News Results with Python

#python #programming #tutorial #webscraping

This blog post you will see how to scrape title, link, snippet, news press name, date news published from Naver News Results using Python.

If you're already familiar with how I structure blog posts, then you canjump towhat will be scraped section since theIntro,Prerequisites, andImport sections are, for the most part, boilerplate part.

This blog is suited for users with little web scraping experience.

What is Naver Search

Naver is the most widely used platform in South Korea and it is used there more than Google, basedLink Assistant andCroud blog posts.

Intro

This blog post is the first of Naver web scraping series. Here you'll see how to scrape Naver News Results using Python withbeautifulsoup,requests,lxml libraries.

Note: This blog post shows how to extract data that is being shown in thewhat will be scraped section, and don't cover different layout handling (unless said otherwise).

Prerequisites

pipinstallrequestspipinstalllxmlpipinstallbeautifulsoup4

Make sure to have a basic knowledge of Python, have a basic idea of the libraries mentioned above, and have a basic understanding ofCSS selectors because you'll see mostly usage ofselect()/select_one()beautifulsoup methods that acceptCSS selectors.

Usually, I'm usingSelectorGadget extension to grabCSS selectors by clicking on the desired element in the browser.CSS selectors reference, or train on a few examples viaCSS Diner.

However, if SelectorGadget can't get the desired element, I use Elements tab via Dev Tools (F12 on a keyboard) to locate and grabCSS selector(s) or other HTML elements.

To test if the selector extracts correct data you can place thoseCSS selector(s) in SelectorGadget window, or via Dev Tools Console tab using$$(".SELECTOR") which is equivalent todocument.querySelectorAll(".SELECTOR") to see if the correct elements being selected.

Imports

importrequests,lxmlfrombs4importBeautifulSoup

What will be scraped

All News Results from the first page.

Process

If you don't need an explanation, jump to thecode section.

There're not a lot of steps that needs to be done, we need to:

Make a request and save HTML locally.
Find correctCSS selectors or HTML elements from where to extract data.
Extract data.

Make a request and save HTML locally

Why save locally?

The main point of this is to make sure that IP won't be banned or blocked for some time, which will delay the script development process.

When requests are being sent constantly (regular user won't do that) from the same IP, this could be detected (tagged or whatever, as unusual behavior) and blocked or banned to secure the website.

Try to save HTML locally first, test everything you need there, and then start making actual requests.

importrequestsheaders={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"}params={"query":"minecraft","where":"news",}html=requests.get("https://search.naver.com/search.naver",params=params,headers=headers).textwithopen(f"{params['query']}_naver_news.html",mode="w")asfile:file.write(html)

What we've done here?

Import a`requests` library

importrequests

Add`user-agent`

headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"}

Add search query`parameters`

params={"query":"minecraft",# search query"where":"news",# news results}

Pass`user-agent` and query`params`

Passuser-agent to requestheaders and, pass queryparams while making a request.

You can read more in-depth about this topic in the article below about why it's a good idea to passuser-agent to request header.

Article No Longer Available

After request is made, then we receive a response which will be decoded via.text.

html=requests.get("https://search.naver.com/search.naver",params=params,headers=headers).text

Save HTML locally

withopen(f"{params['query']}_naver_news.html",mode="w")asfile:file.write(html)# output file will be minecraft_naver_news.html

Find correct selectors or HTML elements

Get aCSS selector of the container with all needed data such as title, link, etc

fornews_resultinsoup.select(".list_news .bx"):# further code

Get aCSS selector for title, link, etc. that will be used in extracting part

fornews_resultinsoup.select(".list_news .bx"):# hey, news_results, grab TEXT from every element with ".news_tit" selectortitle=news_result.select_one(".news_tit").text# hey, news_results, grab href (link) from every element with ".news_tit" selectorlink=news_result.select_one(".news_tit")["href"]# other elements..

Extract data

importlxml,jsonfrombs4importBeautifulSoupwithopen("minecraft_naver_news.html",mode="r")ashtml_file:html=html_file.read()soup=BeautifulSoup(html,"lxml")news_data=[]fornews_resultinsoup.select(".list_news .bx"):title=news_result.select_one(".news_tit").textlink=news_result.select_one(".news_tit")["href"]thumbnail=news_result.select_one(".dsc_thumb img")["src"]snippet=news_result.select_one(".news_dsc").textpress_name=news_result.select_one(".info.press").textnews_date=news_result.select_one("span.info").textnews_data.append({"title":title,"link":link,"thumbnail":thumbnail,"snippet":snippet,"press_name":press_name,"news_date":news_date})print(json.dumps(news_data,indent=2,ensure_ascii=False))

Now lets see what is going on here.

Import`bs4`,`lxml` and`json` libraries

importlxml,jsonfrombs4importBeautifulSoup

Open saved HTML file and pass to`BeautifulSoup()`

Open saved HTML file and change the mode from writing (mode="w") to reading (mode="r") and pass it toBeautifulSoup() so it can extract elements, and assigned"lxml" as a HTML parser.

withopen("minecraft_naver_news.html",mode="r")ashtml_file:html=html_file.read()# readingsoup=BeautifulSoup(html,"lxml")

Create`list()` to temporary store the data

news_data=[]

Iterate over container

By container I meanCSS selector that wraps other elements such as title, link, etc. inside itself with all the needed data, and extract it.

# news_data = []fornews_resultinsoup.select(".list_news .bx"):title=news_result.select_one(".news_tit").textlink=news_result.select_one(".news_tit")["href"]thumbnail=news_result.select_one(".dsc_thumb img")["src"]snippet=news_result.select_one(".news_dsc").textpress_name=news_result.select_one(".info.press").textnews_date=news_result.select_one("span.info").text

Append extracted data as a dictionary to earlier created`list()`

news_data.append({"title":title,"link":link,"thumbnail":thumbnail,"snippet":snippet,"press_name":press_name,"news_date":news_date})

Print collected data

Print the data usingjson.dumps(), which in this case is just for pretty printing purpose.

print(json.dumps(news_data,indent=2,ensure_ascii=False))# part of the output'''[  {"title":"Xbox, 11월부터 블록버스터 게임 연이어 출시","link":"http://www.gameshot.net/common/con_view.php?code=GA617793ce93c74","thumbnail":"https://search.pstatic.net/common/?src=https%3A%2F%2Fimgnews.pstatic.net%2Fimage%2Forigin%2F5739%2F2021%2F10%2F26%2F19571.jpg&type=ofullfill264_180_gray&expire=2&refresh=true","snippet":"  마인크래프트(Minecraft) – 11월 3일(한국 시간) 마인크래프트는 11월 3일 Xbox Game Pass PC용에 추가될 예정이며, 새로운 마인크래프트 던전스 시즈널 어드벤처(Minecraft Dungeons Seasonal Adventures), 동굴과...","press_name":"게임샷","news_date":"6일 전"  }  # other results...]'''

Call newly added data

fornewsinnews_data:title=news["title"]# link, snippet, thumbnail..print(title)# prints all titles that was appended to the list()

Full Code

Have a look at third function that will make an actual request to Naver search with passed query parameters.Test in the online IDE yourself.

importlxml,json,requestsfrombs4importBeautifulSoupheaders={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"}params={"query":"minecraft","where":"news",}# function that parses content from local copy of htmldefextract_news_from_html():withopen("minecraft_naver_news.html",mode="r")ashtml_file:html=html_file.read()# calls naver_parser() function to parse the pagedata=naver_parser(html)print(json.dumps(data,indent=2,ensure_ascii=False))# function that makes an actual requestdefextract_naver_news_from_url():html=requests.get("https://search.naver.com/search.naver",params=params,headers=headers)# calls naver_parser() function to parse the pagedata=naver_parser(html)print(json.dumps(data,indent=2,ensure_ascii=False))# parser that accepts html argument from extract_news_from_html() or extract_naver_news_from_url()defnaver_parser(html):soup=BeautifulSoup(html.text,"lxml")news_data=[]fornews_resultinsoup.select(".list_news .bx"):title=news_result.select_one(".news_tit").textlink=news_result.select_one(".news_tit")["href"]thumbnail=news_result.select_one(".dsc_thumb img")["src"]snippet=news_result.select_one(".news_dsc").textpress_name=news_result.select_one(".info.press").textnews_date=news_result.select_one("span.info").textnews_data.append({"title":title,"link":link,"thumbnail":thumbnail,"snippet":snippet,"press_name":press_name,"news_date":news_date})returnnews_data

UsingNaver News Results API

As an alternative, you can achieve the same by using SerpApi. SerpApi is a paid API with a free plan.

The difference is that there's no need to code the parser from scratch and maintain it overtime (if something will be changed in the HTML), figure out what selectors to use, and how to bypass blocks from search engines.

Install SerpApi library

pip install google-search-results

Example code to integrate:

fromserpapiimportGoogleSearchimportos,jsonparams={"api_key":os.getenv("API_KEY"),"engine":"naver","query":"Minecraft","where":"news"}search=GoogleSearch(params)# where extraction happensresults=search.get_dict()# where structured json appearsnews_data=[]fornews_resultinresults["news_results"]:title=news_result["title"]link=news_result["link"]thumbnail=news_result["thumbnail"]snippet=news_result["snippet"]press_name=news_result["news_info"]["press_name"]date_news_poseted=news_result["news_info"]["news_date"]news_data.append({"title":title,"link":link,"thumbnail":thumbnail,"snippet":snippet,"press_name":press_name,"news_date":date_news_poseted})print(json.dumps(news_data,indent=2,ensure_ascii=False))

Let's how this code works.

Import`serpapi`,`os`,`json` libraries

fromserpapiimportGoogleSearchimportosimportjson# in this case used for pretty printing

os library stands foroperating system (miscellaneous operating system interfaces), andos.getenv(SECRET_KEY) return the value of the environment variable key if it exists.

Define search parameters

Note this parameters will be different depending on what"engine" you're using (except, in this case,"api_key","query").

params={"api_key":os.getenv("API_KEY"),# API key that being stored in the environment variable"engine":"naver",# search engine"query":"Minecraft",# search query"where":"news"# news results filter# other parameters}

Create`list()` to temporary store the data

news_data=[]

Iterate over each`["news_resutlts"]`, and store to the`news_datalist()`.

The difference here is that instead of calling someCSS selectors, we're extracting data from the dictionary (provided from SerpApi) by theirkey.

fornews_resultinresults["news_results"]:title=news_result["title"]link=news_result["link"]thumbnail=news_result["thumbnail"]snippet=news_result["snippet"]press_name=news_result["news_info"]["press_name"]date_news_poseted=news_result["news_info"]["news_date"]news_data.append({"title":title,"link":link,"thumbnail":thumbnail,"snippet":snippet,"press_name":press_name,"news_date":date_news_poseted})

Print collected data via`json.dumps()` to see the output

print(json.dumps(news_data,indent=2,ensure_ascii=False))---------------'''[  {"title":"Xbox, 11월부터 블록버스터 게임 연이어 출시","link":"http://www.gameshot.net/common/con_view.php?code=GA617793ce93c74","thumbnail":"https://search.pstatic.net/common/?src=https%3A%2F%2Fimgnews.pstatic.net%2Fimage%2Forigin%2F5739%2F2021%2F10%2F26%2F19571.jpg&type=ofullfill264_180_gray&expire=2&refresh=true","snippet":"  마인크래프트(Minecraft) – 11월 3일(한국 시간) 마인크래프트는 11월 3일 Xbox Game Pass PC용에 추가될 예정이며, 새로운 마인크래프트 던전스 시즈널 어드벤처(Minecraft Dungeons Seasonal Adventures), 동굴과...","press_name":"게임샷","news_date":"6일 전"  }  # other results...]'''

Outro

If you have anything to share, any question, suggestion, or something that isn't working correctly, feel free to drop a comment in the comment section or via Twitter at@dimitryzub, or@serp_api.

Yours,
Dimitry, and the rest of SerpApi Team.

Join us onReddit |Twitter |YouTube

Top comments(0)

For further actions, you may consider blocking this person and/orreporting abuse

Movatterモバイル変換

DEV Community

Scrape Naver News Results with Python

What is Naver Search

Intro

Prerequisites

Imports

What will be scraped

Process

Make a request and save HTML locally

What we've done here?

Import a`requests` library

Add`user-agent`

Add search query`parameters`

Pass`user-agent` and query`params`

Article No Longer Available

Save HTML locally

Find correct selectors or HTML elements

Extract data

Now lets see what is going on here.

Import`bs4`,`lxml` and`json` libraries

Open saved HTML file and pass to`BeautifulSoup()`

Create`list()` to temporary store the data

Iterate over container

Append extracted data as a dictionary to earlier created`list()`

Print collected data

Call newly added data

Full Code

UsingNaver News Results API

Let's how this code works.

Import`serpapi`,`os`,`json` libraries

Define search parameters

Create`list()` to temporary store the data

Iterate over each`["news_resutlts"]`, and store to the`news_datalist()`.

Print collected data via`json.dumps()` to see the output

Links

Outro

Top comments(0)

More fromDmitriy Zub ☀️

Movatterモバイル変換

What is Naver Search

Intro

Prerequisites

Imports

What will be scraped

Process

Make a request and save HTML locally

What we've done here?

Import arequests library

Adduser-agent

Add search queryparameters

Passuser-agent and queryparams

Article No Longer Available

Save HTML locally

Find correct selectors or HTML elements

Extract data

Now lets see what is going on here.

Importbs4,lxml andjson libraries

Open saved HTML file and pass toBeautifulSoup()

Createlist() to temporary store the data

Iterate over container

Append extracted data as a dictionary to earlier createdlist()

Print collected data

Call newly added data

Full Code

UsingNaver News Results API

Let's how this code works.

Importserpapi,os,json libraries

Define search parameters

Createlist() to temporary store the data

Iterate over each["news_resutlts"], and store to thenews_datalist().

Print collected data viajson.dumps() to see the output

Links

Outro

Import a`requests` library

Add`user-agent`

Add search query`parameters`

Pass`user-agent` and query`params`

Import`bs4`,`lxml` and`json` libraries

Open saved HTML file and pass to`BeautifulSoup()`

Create`list()` to temporary store the data

Append extracted data as a dictionary to earlier created`list()`

Import`serpapi`,`os`,`json` libraries

Create`list()` to temporary store the data

Iterate over each`["news_resutlts"]`, and store to the`news_datalist()`.

Print collected data via`json.dumps()` to see the output