Movatterモバイル変換


[0]ホーム

URL:


Open In App
Next Article:
Automating Scrolling using Python-Opencv by Color Detection
Next article icon

Scrapy is open-source web-crawling framework written in Python used for web scraping, it can also be used to extract data for general-purpose. First all sub pages links are taken from the main page and then email id are scraped from these sub pages using regular expression. 

This article shows the email id extraction from geeksforgeeks site as a reference.

Email ids to be scraped from geeksforgeeks site - ['feedback@geeksforgeeks.org', 'classes@geeksforgeeks.org', 'complaints@geeksforgeeks.org','review-team@geeksforgeeks.org']

How to create Email ID Extractor Project using Scrapy?

1.Installation of packages - run following command from terminal 

pip install scrapy pip install scrapy-selenium

2.Create project - 

scrapy startproject projectname (Here projectname is geeksemailtrack) cd projectname scrapy genspider spidername (Here spidername is emails)

3) Add code in settings.py file to use scrapy-selenium

from shutil import which SELENIUM_DRIVER_NAME = 'chrome' SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver') SELENIUM_DRIVER_ARGUMENTS=[]DOWNLOADER_MIDDLEWARES = { 'scrapy_selenium.SeleniumMiddleware': 800 }

4) Now download chrome driver for your chrome and put it near to your chrome scrapy.cfg file. To download chrome driver refer this site -To download chrome driver

Directory structure - 


Step by Step Code - 

1. Import all required libraries - 

Python3
# web scraping frameworkimportscrapy# for regular expressionimportre# for selenium requestfromscrapy_seleniumimportSeleniumRequest# for link extractionfromscrapy.linkextractors.lxmlhtmlimportLxmlLinkExtractor

 
 

2. Createstart_requests function to hit the site from selenium. You can add your own URL. 
 


 

Python3
defstart_requests(self):yieldSeleniumRequest(url="https://www.geeksforgeeks.org/",wait_time=3,screenshot=True,callback=self.parse,dont_filter=True)

 
 


3. Createparse function: 
 


 

Python3
defparse(self,response):# this helps to get all links from source codelinks=LxmlLinkExtractor(allow=()).extract_links(response)# Finallinks contains links urlFinallinks=[str(link.url)forlinkinlinks]# links list for url that may have email idslinks=[]# filtering and storing only needed url in links list# pages that are about us and contact us are the ones that have email idsforlinkinFinallinks:if('Contact'inlinkor'contact'inlinkor'About'inlinkor'about'inlinkor'CONTACT'inlinkor'ABOUT'inlink):links.append(link)# current page url also added because few sites have email ids on there main pagelinks.append(str(response.url))# parse_link function is called for extracting email idsl=links[0]links.pop(0)# meta helps to transfer links list from parse to parse_linkyieldSeleniumRequest(url=l,wait_time=3,screenshot=True,callback=self.parse_link,dont_filter=True,meta={'links':links})

 
 

Explanation of parse function - 


 

  • In the following lines  all links are extracted from https://www.geeksforgeeks.org/ response.
links = LxmlLinkExtractor(allow=()).extract_links(response) Finallinks = [str(link.url) for link in links]
  • Finallinks is list containing all links.
  • To avoid unnecessary links we put filter that, if links belong to contact and about page then only we scrape details from that page.
for link in Finallinks: if ('Contact' in link or 'contact' in link or 'About' in link or 'about' in link or or 'CONTACT' in link or 'ABOUT' in link): links.append(link)
  • This Above filter is not necessary but sites do have lots of tags(links) and due to this, if site has 50 subpages in site then it will extract email from these 50 sub URLs. it is assumed that emails are mostly on home page, contact page, and about page so this filter help to reduce time wastage of scraping those URL that might not have email ids.
  • The links of pages that may have email ids are requested one by one and email ids are scraped using regular expression.


 

4. Createparse_link function code: 


 

Python3
defparse_link(self,response):# response.meta['links'] this helps to get links listlinks=response.meta['links']flag=0# links that contains following bad words are discardedbad_words=['facebook','instagram','youtube','twitter','wiki','linkedin']forwordinbad_words:# if any bad word is found in the current page url# flag is assigned to 1ifwordinstr(response.url):flag=1break# if flag is 1 then no need to get email from# that url/pageif(flag!=1):html_text=str(response.text)# regular expression used for email idemail_list=re.findall('\w+@\w+\.{1}\w+',html_text)# set of email_list to get uniqueemail_list=set(email_list)if(len(email_list)!=0):foriinemail_list:# adding email ids to final uniqueemailself.uniqueemail.add(i)# parse_link function is called till# if condition satisfy# else move to parsed functionif(len(links)>0):l=links[0]links.pop(0)yieldSeleniumRequest(url=l,callback=self.parse_link,dont_filter=True,meta={'links':links})else:yieldSeleniumRequest(url=response.url,callback=self.parsed,dont_filter=True)

Explanation of parse_link function: 
By response.text we get the all source code of the requested URL. The regex expression ‘\w+@\w+\.{1}\w+’ used here could be translated to something like this Look for every piece of string that starts with one or more letters, followed by an at sign (‘@’), followed by one or more letters with a dot in the end. 
After that it should have one or more letters again. Its a regex used for getting email id.

5. Createparsed function -
 

Python3
defparsed(self,response):# emails list of uniqueemail setemails=list(self.uniqueemail)finalemail=[]foremailinemails:# avoid garbage value by using '.in' and '.com'# and append email ids to finalemailif('.in'inemailor'.com'inemailor'info'inemailor'org'inemail):finalemail.append(email)# final unique email ids from geeksforgeeks siteprint('\n'*2)print("Emails scraped",finalemail)print('\n'*2)

Explanation of Parsed function: 
The above regex expression also leads to garbage values like select@1.13 in this scraping email id from geeksforgeeks, we know select@1.13 is not a email id. The parsed function filter applies filter that only takes emails containing '.com' and ".in".
 

Run the spider using following command - 

scrapy crawl spidername (spidername is name of spider)

Garbage value in scraped emails: 


Final scraped emails: 
 


 

Python
# web scraping frameworkimportscrapy# for regular expressionimportre# for selenium requestfromscrapy_seleniumimportSeleniumRequest# for link extractionfromscrapy.linkextractors.lxmlhtmlimportLxmlLinkExtractorclassEmailtrackSpider(scrapy.Spider):# name of spidername='emailtrack'# to have unique email idsuniqueemail=set()# start_requests sends request to given https://www.geeksforgeeks.org/# and parse function is calleddefstart_requests(self):yieldSeleniumRequest(url="https://www.geeksforgeeks.org/",wait_time=3,screenshot=True,callback=self.parse,dont_filter=True)defparse(self,response):# this helps to get all links from source codelinks=LxmlLinkExtractor(allow=()).extract_links(response)# Finallinks contains links urlFinallinks=[str(link.url)forlinkinlinks]# links list for url that may have email idslinks=[]# filtering and storing only needed url in links list# pages that are about us and contact us are the ones that have email idsforlinkinFinallinks:if('Contact'inlinkor'contact'inlinkor'About'inlinkor'about'inlinkor'CONTACT'inlinkor'ABOUT'inlink):links.append(link)# current page url also added because few sites have email ids on there main pagelinks.append(str(response.url))# parse_link function is called for extracting email idsl=links[0]links.pop(0)# meta helps to transfer links list from parse to parse_linkyieldSeleniumRequest(url=l,wait_time=3,screenshot=True,callback=self.parse_link,dont_filter=True,meta={'links':links})defparse_link(self,response):# response.meta['links'] this helps to get links listlinks=response.meta['links']flag=0# links that contains following bad words are discardedbad_words=['facebook','instagram','youtube','twitter','wiki','linkedin']forwordinbad_words:# if any bad word is found in the current page url# flag is assigned to 1ifwordinstr(response.url):flag=1break# if flag is 1 then no need to get email from# that url/pageif(flag!=1):html_text=str(response.text)# regular expression used for email idemail_list=re.findall('\w+@\w+\.{1}\w+',html_text)# set of email_list to get uniqueemail_list=set(email_list)if(len(email_list)!=0):foriinemail_list:# adding email ids to final uniqueemailself.uniqueemail.add(i)# parse_link function is called till# if condition satisfy# else move to parsed functionif(len(links)>0):l=links[0]links.pop(0)yieldSeleniumRequest(url=l,callback=self.parse_link,dont_filter=True,meta={'links':links})else:yieldSeleniumRequest(url=response.url,callback=self.parsed,dont_filter=True)defparsed(self,response):# emails list of uniqueemail setemails=list(self.uniqueemail)finalemail=[]foremailinemails:# avoid garbage value by using '.in' and '.com'# and append email ids to finalemailif('.in'inemailor'.com'inemailor'info'inemailor'org'inemail):finalemail.append(email)# final unique email ids from geeksforgeeks siteprint('\n'*2)print("Emails scraped",finalemail)print('\n'*2)

Working video of above code -

Reference -linkextractors 
 


Improve
Practice Tags :

Similar Reads

We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood ourCookie Policy &Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences

[8]ページ先頭

©2009-2025 Movatter.jp