Alexmhack/python_scraping_webPublic

NotificationsYou must be signed in to change notification settings
Fork2
Star4

Web scraping with python3 requests and BeautifulSoup

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.gitignore		.gitignore
README.md		README.md
crawl_yelp.py		crawl_yelp.py
extract_details.py		extract_details.py
formatting_url.py		formatting_url.py
los_angeles_restaurants.txt		los_angeles_restaurants.txt
reading_name.py		reading_name.py
requesting_yelp.py		requesting_yelp.py
requirements.txt		requirements.txt
restaurant_names.py		restaurant_names.py
write_csv_data.py		write_csv_data.py
writing_clean_data.py		writing_clean_data.py
writing_details.py		writing_details.py
yelp-los+angeles-clean.csv		yelp-los+angeles-clean.csv
yelp-los+angeles-clean.txt		yelp-los+angeles-clean.txt
yelp-los+angeles.txt		yelp-los+angeles.txt
yelp_20_pages.txt		yelp_20_pages.txt

Repository files navigation

python_scraping_web

Web scraping with python3 requests and BeautifulSoup

Installation

pip install -r requirements.txt

requirements.txt

requests==2.19.1beautifulsoup4==4.6.3

requests module for requesting the url and fetching response andbs4 (beautifulsoup4) for making web scraping easier.

Requesting and Souping

Once you have the requirements installed you can simply import and use. For now we will berequesting the yelp service and playing around with our modules

requesting_yelp.epy

import requestsfrom bs4 import BeautifulSoup

You can visityelp and search for anything in search bar, forexample we searched for restaurants and it returned backBest Restaurants in SanFrancisco, CA

Copy the url from the browser and paste it into your file

requesting_yelp.epy

url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=San+Francisco%2C+CA&ns=1"

Now requests comes to work. Since we make an GET request with our browser for this url andwe get back all the html response in form of new webpage, we are gonna do the same withrequests.get(url) and store the result.

requesting_yelp.epy

response = requests.get(url)

Now response contains the returned result from the get request to the url. We can usemethods on response like, you can actually print response,

print(response)print(response.status_code)

run the file

cmd

.../python_scraping_web> py requesting_yelp.py<Response [200]>200

200. The HTTP 200 OK success status response code indicates that the request hassucceeded.

To actually print the whole html that the webpage contains we can print

print(response.text)

Earlier version of python requests used to print the html fromresponse.text in uglyway but on printing it now we can get the prettified html or we can also use the bs4 module

For that we need to create a BeautifulSoup object by passing in the text returned from theurl,

soup = BeautifulSoup(response.text)print(soup.prettify())<img height="1" src="https://www.facebook.com/tr?id=102029836881428&amp;ev=PageView&amp;noscript=1" width="1">   </img>  </noscript>  <script>   (function() {                var main = null;                var main=function(){var c=Math.random()+"";var b=c*10000000000000;document.write('<iframe src="https://6372968.fls.doubleclick.net/activityi;src=6372968;type=invmedia;cat=qr3hlsqk;dc_lat=;dc_rdid=;tag_for_child_directed_treatment=;ord='+b+'?" width="1" height="1" frameborder="0"></iframe>')};                if (main === null) {                    throw 'invalid inline script, missing main declaration.';                }                main();            })();  </script>  <noscript>   <iframe frameborder="0" height="1" src="https://6372968.fls.doubleclick.net/activityi;src=6372968;type=invmedia;cat=qr3hlsqk;dc_lat=;dc_rdid=;tag_for_child_directed_treatment=;ord=1?" width="1">   </iframe>  </noscript> </body></html>

Resultant html should be something like this in both requests and BeautifulSoup case.

But BeautifulSoup gives us more advanced methods for scraping like thefind() andfindall()

requesting_yelp.py

links = soup.findAll('a')print(links)...<span>                The Netherlands        </span></a>, <a href="https://www.yelp.com.tr/" role="menuitem"><span>                Turkey        </span></a>, <a href="https://www.yelp.co.uk/" role="menuitem"><span>                United Kingdom        </span></a>, <a href="https://www.yelp.com/" role="menuitem"><span>                United States        </span>...

A lot of links exists so you terminal should be full of links and html tags

We can loop over the links variable and print the individual link

for link in links:print(link)

On running

...<a href="/atlanta">Atlanta</a><a href="/austin">Austin</a><a href="/boston">Boston</a><a href="/chicago">Chicago</a><a href="/dallas">Dallas</a><a href="/denver">Denver</a><a href="/detroit">Detroit</a><a href="/honolulu">Honolulu</a><a href="/houston">Houston</a><a href="/la">Los Angeles</a><a href="/miami">Miami</a><a href="/minneapolis">Minneapolis</a><a href="/nyc">New York</a><a href="/philadelphia">Philadelphia</a><a href="/portland">Portland</a><a href="/sacramento">Sacramento</a><a href="/san-diego">San Diego</a><a href="/sf">San Francisco</a><a href="/san-jose">San Jose</a><a href="/seattle">Seattle</a><a href="/dc">Washington, DC</a><a href="/locations">More Cities</a><a href="https://yelp.com/about">About</a><a href="https://officialblog.yelp.com/">Blog</a><a href="https://www.yelp-support.com/?l=en_US">Support</a><a href="/static?p=tos">Terms</a><a href="http://www.databyacxiom.com" rel="nofollow">Some Data By Acxiom</a>

This looks a lot cleaner now.

Requesting Pages

So far we have requesting a single url. In this section we will be formatting urlto request a different url.

If you look at the yelp url we used before, you might find at the very bottom thatthere ispagination being used.

So what we can do is visit another search page say2 page and we find that urlchanged a bit.

Specifically the url had some new value at last for the2 page which is

https://www.yelp.com/search?find_desc=Restaurants&find_loc=los+angeles&start=30

You guessed it right

&start=30

is what is new in the url, if you have worked with django then you might have usedpagination somewhere in your templates.

So that means we can actually add this value at the end of the existing url and locateto another search result page

Have a look atformatting_url.py

import requestsfrom bs4 import BeautifulSoupbase_url = "https://www.yelp.com/search?find_desc=Restaurants&find_loc={}"city = "los angeles"start = 30url = base_url.format(city)second_page = url + '&start=' + str(start)response = requests.get(third_page)print(f"STATUS CODE: {response.status_code} FOR {response.url}")soup = BeautifulSoup(response.text, 'html.parser')links = soup.findAll('a')

We assign 30 value to the start and add it asstr(start) at the end of the urland name itsecond_page and then request that page. We get200 status code

This means that by finding the patterns in url we can request more url.

So what more could be done. We can start aloop that would request the urls andeach time increment the start value by30

start = 0for i in range(40):    url = base_url.format(city)    url += '&start=' + str(start)    start += 30    if start == 270:        break    ...

Now how do I know that we have to increment by30 well I checked the pattern ofurls by visiting the pages and stop at270 so that we only request 10 pages.You can use whatever value you want but it should be multiple of30

Reading Restaurant Title

Now we will be using the previous code that we wrote informatting_url.py andextract the particular piece of text from the html tags that we need which is the titleof the restaurant from each search page.

Visit theurl and open developers toolsand point at the block of restaurant with title, rating, review etc. and find theli tag with classregular-search-result

We will be using this class for searching the particularli tag from the responseusingBeautifulSoup

reading_name.py

import requests...info_block = soup.findAll('li', {'class': 'regular-search-result'})print(info_block)

Run the file and you should the whole li tag and its inner tags printed. But we wantto extract the title of the restaurant from each li tag, for that we have to findthe class used in the title of restaurant

The title is wrapped inside aanchor tag with classbiz-name

info_block = soup.findAll('a', {'class': 'biz-name'})print(info_block)count = 0for info in info_block:    print(info.text)    count += 1print(count)

On printing thetext of the html tag we get the title of the restaurant, these arenot all the title cause some block don't havebiz-name class but we have what weneed.

We can also write the names of restaurants onto a file but we have to use a try exceptblock while performing the file writing operation since the title names of restaurantscontains some non str characters which could cause error.

with open('los_angeles_restaurants.txt', 'a') as file:    start = 0    for i in range(100):        url = base_url.format(city, start)        response = requests.get(url)        start += 30        print(f"STATUS CODE: {response.status_code} FOR {response.url}")        soup = BeautifulSoup(response.text, 'html.parser')        names = soup.findAll('a', {'class': 'biz-name'})        count = 0        for info in names:            try:                title = info.text                print(title)                file.write(title + '\n')                count += 1            except Exception as e:                print(e)        print(f"{count} RESTAURANTS EXTRACTED...")        print(start)        if start == 990:            break

For any questions regarding what we have done so far contact me atCodeMentor

Advanced Extraction

In this section we will be go a little more further and extract the name, address,phone-number of the restaurant.

This time we will be looking for thediv tag that has classbiz-listing-largethat contains the restaurant details.

Inwriting_details.py We have reused a lot of code from other files the onlydifference is that we open a new file, fetch the title, address, and phone numberfrom respective classes and write it into the file.

...city = "los+angeles"...file_path = f'yelp-{city}.txt'with open(file_path, 'w') as textFile:    soup = BeautifulSoup(response.text, 'html.parser')    businesses = soup.findAll('div', {'class': 'biz-listing-large'})    count = 0    for biz in businesses:        title = biz.find('a', {'class': 'biz-name'}).text        address = biz.find('address').text        phone = biz.find('span', {'class': 'biz-phone'}).text        detail = f"{title}\n{address}\n{phone}"        textFile.write(str(detail) + '\n\n')

We edit the city value so that it neither conflicts with url and our file path name.

yelp_los+angeles.txt is still doesn't has text in nice formatted way like we wantedbut in the next section will be working on it.

AMF Beverly Lanes            1201 W Beverly Blvd                    (323) 728-9161        Maccheroni Republic            332 S Broadway                    (213) 346-9725        Home Restaurant - Los Feliz            1760 Hillhurst Ave                    (323) 669-0211            ...

Writing Clean Data

Once we have extracted the data we want to make our data look good i.e. not withoutany spaces and newlines so for that we will use a simple logic

writing_clean_data.py

    ...    with open(file_path, 'a') as textFile:        count = 0        for biz in businesses:            try:                title = biz.find('a', {'class': 'biz-name'}).text                address = biz.find('address').contents                # print(address)                phone = biz.find('span', {'class': 'biz-phone'}).text                region = biz.find('span', {'class': 'neighborhood-str-list'}).contents                count += 1                for item in address:                    if "br" in item:                        print(item.getText())                    else:                        print('\n' + item.strip(" \n\r\t"))                for item in region:                    if "br" in item:                        print(item.getText())                    else:                        print(item.strip(" \n\t\r") + '\n')...

We simply get the text of the item if there are anybr tags else we strip thenewlines, return lines, tabs and spaces / space from the text, On running the file

800 W Sunset BlvdEcho Park4156 Santa Monica BlvdSilver Lake8500 Beverly BlvdBeverly Grove5484 Wilshire BlvdMid-Wilshire5115 Wilshire BlvdHancock Park126 E 6th StDowntown8164 W 3rd StBeverly Grove7910 W 3rd StBeverly Grove4163 W 5th StKoreatown435 N Fairfax AveBeverly Grove1267 W Temple StEcho Park429 W 8th StDowntown724 S Spring StDowntown8450 W 3rd StBeverly Grove2308 S Union AveUniversity Park5583 W Pico BlvdMid-Wilshire'NoneType' object has no attribute 'contents'3413 Cahuenga Blvd WHollywood Hills727 N BroadwayChinatown6602 Melrose AveHancock Park612 E 11th StDowntown...

The same way we have to clean the phone number

                ...                for item in phone:                    if "br" in item:                        phone_number += item.getText() + " "                    else:                        phone_number += item.strip(" \n\t\r") + " "                    ...            except Exception as e:                print(e)                logs = open('errors.log', 'a')                logs.write(str(e) + '\n')                logs.close()                address = None                phone_number = None                region = None

Again run the file and change thestart = 990 delete all content inyelp-{city}-clean.txt run the file again. All details of the restaurant willbe written to the file

yelp-{city}-clean.txt

Tea Station ExpressBestia2121 E 7th Pl Downtown (213) 514-5724 République624 S La Brea Ave Hancock Park (310) 362-6115 The Morrison3179 Los Feliz Blvd Atwater Village (323) 667-1839 A Food Affair1513 S Robertson Blvd Pico-Robertson (310) 557-9795 Running Goose1620 N Cahuenga Blvd Hollywood (323) 469-1080 Howlin’ Ray’sPerch448 S Hill St Downtown (213) 802-1770 Faith & Flower705 W 9th St Downtown (213) 239-0642

Notice that we some of the data is missing because due to some error we reduce the riskof code crashing by setting the values toNone

About

Web scraping with python3 requests and BeautifulSoup

Releases

No releases published

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

python_scraping_web

Installation

Requesting and Souping

Requesting Pages

Reading Restaurant Title

Advanced Extraction

Writing Clean Data

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

Alexmhack/python_scraping_web

Folders and files

Latest commit

History

Repository files navigation

python_scraping_web

Installation

Requesting and Souping

Requesting Pages

Reading Restaurant Title

Advanced Extraction

Writing Clean Data

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages