Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Cover image for Collecting one million website links
Anurag Rana
Anurag Rana

Posted on • Originally published atpythoncircle.com

     

Collecting one million website links

I needed a collection of different website links to experiment with Docker cluster. So I created this small script to collect one million website URLs.

Code is available onGithub too.

Running script:

Either create a new virtual environment using python3 or use the existing one in your system.
Install the dependencies.

pipinstallrequests,BeautifulSoup
Enter fullscreen modeExit fullscreen mode

Activate the virtual environment and run the code.

pythonone_million_websites.py
Enter fullscreen modeExit fullscreen mode

Complete Code:

importrequestsfrombs4importBeautifulSoupimportsysimporttimeheaders={"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","Accept-Language":"en-GB,en-US;q=0.9,en;q=0.8","User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/64.0.3282.167 Chrome/64.0.3282.167 Safari/537.36"}site_link_count=0foriinrange(1,201):url="http://websitelists.in/website-list-"+str(i)+".html"response=requests.get(url,headers=headers)ifresponse.status_code!=200:print(url+str(response.status_code))continuesoup=BeautifulSoup(response.text,'lxml')sites=soup.find_all("td",{"class":"web_width"})links=""forsiteinsites:site=site.find("a")["href"]links+=site+"\n"site_link_count+=1withopen("one_million_websites.txt","a")asf:f.write(links)print(str(site_link_count)+" links found")time.sleep(1)
Enter fullscreen modeExit fullscreen mode

We are scraping links from sitehttp://www.websitelists.in/. If you inspect the webpage, you can seeanchor tag insidetd tag with classweb_width.

We will convert the page response into BeautifulSoup object and get all such elements and extract theHREF value of them.

Although there is a natural delay of more than 1 second between consecutive requests which is pretty slow but is good for the server. I still introduced a one-second delay to avoid 429 HTTP status.

Scraped links will be dumped in the text file in the same directory.

Originally Published onpythoncircle.com

More fromPythonCircle:

Top comments(3)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss
CollapseExpand
 
kdinnypaul profile image
Dinny Paul
Full Stack Developer | Dreamer | Co-Founder@Alchetron.com | GrowthHacker | Passionate about Open Source Technologies | Mad about Web Performance
  • Location
    Mumbai
  • Work
    Full Stack Developer at Alchetron
  • Joined

You could use fake_useragent python library to change user agent with every request so that you don't get blocked by that website and you could also use free proxies thereby changing you ip address with every request :)

CollapseExpand
 
anuragrana profile image
Anurag Rana
Backend Developer. Working on Python and Java. Writes about Python and Django on https://www.pythoncircle.com
  • Joined

Great suggestions Dinny. However I feel we should be gentle on sites and should not send too many requests per second. That is why I didn't feel the need of using these two libraries.

I have written another article where I have used docker cluster to scrape data at a very high speed. Although I was not able to achieve desired results.

pythoncircle.com/post/518/scraping...

CollapseExpand
 
quangthien27 profile image
Theo Nguyen
Developer
  • Location
    Sydney, Australia
  • Joined

Good writting, thanks for that!

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

Backend Developer. Working on Python and Java. Writes about Python and Django on https://www.pythoncircle.com
  • Joined

More fromAnurag Rana

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp