Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Update 08_basic_email_web_crawler.py#5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
mjhea0 merged 1 commit intorealpython:masterfromRajuKoushik:patch-1
Feb 18, 2016
Merged
Changes fromall commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 14 additions & 33 deletions08_basic_email_web_crawler.py
View file
Open in desktop
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,26 @@
import requests
import re
try:
from urllib.parse import urljoin
except ImportError:
from urlparse import urljoin

# regex
email_re = re.compile(r'([\w\.,]+@[\w\.,]+\.\w+)')
link_re =re.compile(r'href="(.*?)"')
#get url
#url=input('Enter a URL (include 'http://'):')--this is wrong
url =input('Enter a URL (include `http://`):')


def crawl(url):
#connect to the url
website=requests.get(url)

result = set()
#read html
html=website.text

req = requests.get(url)

# Check if successful
if(req.status_code != 200):
return []
#use re.findall to grab all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

# Find links
links = link_re.findall(req.text)
emails=re.findall('([\w\.,]+@[\w\.,]+\.\w+)',html)

print("\nFound {} links".format(len(links)))

# Searchlinksfor emails
for link inlinks:
#prints the number oflinksin the list
print("\nFound {}links".format(len(links)))

# Get an absolute URL for a link
link = urljoin(url, link)

# Find all emails on current page
result.update(email_re.findall(req.text))

return result

if __name__ == '__main__':
emails = crawl('http://www.realpython.com')

print("\nScrapped e-mail addresses:")
for email in emails:
print(email)
print("\n")
for email in emails:
print(email)

[8]ページ先頭

©2009-2025 Movatter.jp