Commit53da94f

committed

updates

1 parent239a0ff commit53da94fCopy full SHA for 53da94f

File tree

3 files changed

+37

-53

lines changed

3 files changed

+37

-53

lines changed

`‎08_basic_email_web_crawler.py`

Lines changed: 20 additions & 24 deletions

Original file line number	Diff line number	Diff line change
`@@ -7,40 +7,36 @@`
`7`	`7`	`link_re=re.compile(r'href="(.*?)"')`
`8`	`8`
`9`	`9`
`10`		`-defcrawl(url,maxlevel):`
	`10`	`+defcrawl(url):`
`11`	`11`
`12`	`12`	`result=set()`
`13`	`13`
`14`		`-whilemaxlevel>0:`
	`14`	`+req=requests.get(url)`
`15`	`15`
`16`		`-# Get the webpage`
`17`		`-req=requests.get(url)`
	`16`	`+# Check if successful`
	`17`	`+if(req.status_code!=200):`
	`18`	`+return []`
`18`	`19`
`19`		`-# Check if successful`
`20`		`-if(req.status_code!=200):`
`21`		`-return []`
	`20`	`+# Find links`
	`21`	`+links=link_re.findall(req.text)`
`22`	`22`
`23`		`-# Find and follow all the links`
`24`		`-links=link_re.findall(req.text)`
`25`		`-forlinkinlinks:`
`26`		`-# Get an absolute URL for a link`
`27`		`-link=urlparse.urljoin(url,link)`
	`23`	`+print"\nFound {} links".format(len(links))`
`28`	`24`
`29`		`-# Find all emails on current page`
`30`		`-result.update(email_re.findall(req.text))`
	`25`	`+# Search links for emails`
	`26`	`+forlinkinlinks:`
`31`	`27`
`32`		`-print"Crawled level: {}".format(maxlevel)`
	`28`	`+# Get an absolute URL for a link`
	`29`	`+link=urlparse.urljoin(url,link)`
`33`	`30`
`34`		`-# new level`
`35`		`-maxlevel-=1`
`36`		`-`
`37`		`-# recurse`
`38`		`-crawl(link,maxlevel)`
	`31`	`+# Find all emails on current page`
	`32`	`+result.update(email_re.findall(req.text))`
`39`	`33`
`40`	`34`	`returnresult`
`41`	`35`
`42`		`-emails=crawl('http://www.website_goes_here_dot_com',2)`
	`36`	`+if__name__=='__main__':`
	`37`	`+emails=crawl('http://www.realpython.com')`
`43`	`38`
`44`		`-print"\nScrapped e-mail addresses:"`
`45`		`-foremailinemails:`
`46`		`-printemail`
	`39`	`+print"\nScrapped e-mail addresses:"`
	`40`	`+foremailinemails:`
	`41`	`+printemail`
	`42`	`+print"\n"`

`‎09_basic_link_web_crawler.py`

Lines changed: 15 additions & 27 deletions

Original file line number	Diff line number	Diff line change
`@@ -6,39 +6,27 @@`
`6`	`6`	`link_re=re.compile(r'href="(.*?)"')`
`7`	`7`
`8`	`8`
`9`		`-defcrawl(url,maxlevel):`
	`9`	`+defcrawl(url):`
`10`	`10`
`11`		`-result=set()`
	`11`	`+req=requests.get(url)`
`12`	`12`
`13`		`-whilemaxlevel>0:`
	`13`	`+# Check if successful`
	`14`	`+if(req.status_code!=200):`
	`15`	`+return []`
`14`	`16`
`15`		`-# Get the webpage`
`16`		`-req=requests.get(url)`
	`17`	`+# Find links`
	`18`	`+links=link_re.findall(req.text)`
`17`	`19`
`18`		`-# Check if successful`
`19`		`-if(req.status_code!=200):`
`20`		`-return []`
	`20`	`+print"\nFound {} links".format(len(links))`
`21`	`21`
`22`		`-# Find and follow all the links`
`23`		`-links=link_re.findall(req.text)`
`24`		`-forlinkinlinks:`
`25`		`-# Get an absolute URL for a link`
`26`		`-link=urlparse.urljoin(url,link)`
`27`		`-# add links to result set`
`28`		`-result.update(link)`
	`22`	`+# Search links for emails`
	`23`	`+forlinkinlinks:`
`29`	`24`
`30`		`-print"Crawled level: {}".format(maxlevel)`
	`25`	`+# Get an absolute URL for a link`
	`26`	`+link=urlparse.urljoin(url,link)`
`31`	`27`
`32`		`-# new level`
`33`		`-maxlevel-=1`
	`28`	`+printlink`
`34`	`29`
`35`		`-# recurse`
`36`		`-crawl(link,maxlevel)`
`37`	`30`
`38`		`-returnresult`
`39`		`-`
`40`		`-emails=crawl('http://www.website_goes_here_dot_com',2)`
`41`		`-`
`42`		`-print"\nScrapped links:"`
`43`		`-forlinkinlinks:`
`44`		`-printlink`
	`31`	`+if__name__=='__main__':`
	`32`	`+crawl('http://www.realpython.com')`

`‎readme.md`

Lines changed: 2 additions & 2 deletions

Original file line number	Diff line number	Diff line change
`@@ -7,6 +7,6 @@`
`7`	`7`	`1.05_load_json_without_dupes.py: load json, convert to dict, raise error if there is a duplicate key`
`8`	`8`	`1.06_execution_time.py: class used for timing execution of code`
`9`	`9`	`1.07_benchmark_permissions_loading_django.py: benchmark loading of permissions in Django`
`10`		`-1.08_basic_email_web_crawler.py: web crawler for grabbing emails from a website recursively`
`11`		`-1.09_basic_link_web_crawler.py: web crawler for grabbing links from a websiterecursively`
	`10`	`+1.08_basic_email_web_crawler.py: web crawler for grabbing emails from a website`
	`11`	`+1.09_basic_link_web_crawler.py: web crawler for grabbing links from a website`
`12`	`12`	`1.10_find_files_recursively.py: recursively grab files from a directory`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit53da94f

File tree

3 files changed

3 files changed

`‎08_basic_email_web_crawler.py`

`‎09_basic_link_web_crawler.py`

`‎readme.md`

0 commit comments