- Notifications
You must be signed in to change notification settings - Fork0
A complete step-by-step Python web scraping guide for 2025 - learn how to fetch, parse, and analyze website data using Requests and BeautifulSoup.
IPRoyal/python-web-scraping-guide
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation

If you want to exploreweb scraping, Python is the best place to start. Thanks to its simple syntax and great library support, Python makes it easy to extract data from websites.
In this tutorial, you’ll learn how to useRequests andBeautiful Soup to scrape web pages and analyze them. As an example, the project will collect post titles from ther/programming subreddit and determine the most mentioned programming languages.
Web scraping is the automated collection of data from websites.
Scrapers fetch a page’sHTML and extract needed data. Advanced tools may even useheadless browsers to simulate user actions.
⚠️ Web scraping can break easily when a website’s structure changes. Always check for available APIs before scraping.
Python offers unmatched simplicity and a strong ecosystem:
- Requests for handling HTTP requests
- BeautifulSoup for HTML parsing
- Scrapy andPlaywright for advanced use cases
These tools are well-documented, reliable, and widely used by developers.
You’ll need Python installed. Then, install the libraries:
pip install requestspip install bs4
Create a file namedscraper.py for your code.
Fetching page data is the first step. The example below loads the front page ofr/programming from the old Reddit interface.
importrequestspage=requests.get("https://old.reddit.com/r/programming/",headers={'User-agent':'Sorry, learning Python!'})html=page.content
To extract titles from the HTML, useBeautifulSoup.
frombs4importBeautifulSoupsoup=BeautifulSoup(html,"html.parser")p_tags=soup.find_all("p","title")titles= [p.find("a").get_text()forpinp_tags]print(titles)
This prints the titles of the posts on the first page.
You can extend the script to scrape multiple pages by looping through them.
importrequestsfrombs4importBeautifulSoupimporttimepost_titles= []next_page="https://old.reddit.com/r/programming/"forcurrent_pageinrange(0,20):page=requests.get(next_page,headers={'User-agent':'Sorry, learning Python!'})html=page.contentsoup=BeautifulSoup(html,"html.parser")p_tags=soup.find_all("p","title")titles= [p.find("a").get_text()forpinp_tags]post_titles+=titlesnext_page=soup.find("span","next-button").find("a")['href']time.sleep(3)print(post_titles)
After scraping, you can analyze which programming languages appear most often in post titles.
language_counter= {"javascript":0,"html":0,"css":0,"sql":0,"python":0,"typescript":0,"java":0,"c#":0,"c++":0,"php":0,"c":0,"powershell":0,"go":0,"rust":0,"kotlin":0,"dart":0,"ruby":0,}words= []fortitleinpost_titles:words+= [word.lower()forwordintitle.split()]forwordinwords:forkeyinlanguage_counter:ifword==key:language_counter[key]+=1print(language_counter)
Frequent scraping can get you blocked. Use aproxy server to hide your IP and distribute requests.
Example withIPRoyal Residential Proxies:
PROXIES= {"http":"http://yourusername:yourpassword@geo.iproyal.com:22323","https":"http://yourusername:yourpassword@geo.iproyal.com:22323"}page=requests.get(next_page,headers={'User-agent':'Just learning Python, sorry!'},proxies=PROXIES)
This routes all requests through proxy servers to prevent rate-limiting or bans.
You’ve learned how to:
- Fetch and parse HTML with Requests + BeautifulSoup
- Scrape multiple pages of Reddit
- Count programming language mentions
- Add proxy rotation for safer scraping
For more advanced scraping, explore frameworks likeScrapy orPlaywright.
About
A complete step-by-step Python web scraping guide for 2025 - learn how to fetch, parse, and analyze website data using Requests and BeautifulSoup.
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.