Movatterモバイル変換


[0]ホーム

URL:


Open In App
Link extraction is a very common task when dealing with the HTML parsing. For every general web crawler that's the most important function to perform. Out of all the Python libraries present out there,lxmlis one of the best to work with. As explained in this article, lxml provides a number of helper function in order to extract the links.lxml installation -It is a Python binding for C libraries -libxsltandlibxml2. So, maintaining a Python base, it is very fast HTML parsing and XML library. To let it work - C libraries also need to be installed. For installation instruction, followthis link.Command to install -
sudo apt-get install python-lxmlorpip install lxml
What is lxml? It is designed specifically for parsing HTML and therefore comes with an html module. HTML string can be easily parsed with the help offromstring() function. This will return the list of all the links. Theiterlinks() method has four parameters of tuple form -
element : Link is extracted from this parsed node of the anchor tag. If interested in the link only, this can be ignored.attr : attribute of the link from where it has come from, that is simply 'href'link : The actual URL extracted from the anchor tag.pos : The anchor tag numeric index of the anchor tag in the document.
Code #1 :Python3
# importing libraryfromlxmlimporthtmlstring_document=html.fromstring('hi <a href ="/world">geeks</a>')# actual urllink=list(string_document.iterlinks())# Link lengthprint("Length of the link : ",len(link)
Output :
Length of the link : 1
Code #2 : Retrieving theiterlinks() tuplePython3
(element,attribute,link,pos)=link[0]print("attribute : ",attribute)print("\nlink : ",link)print("\nposition : ",position)
Output :
attribute : 'href'link : '/world'position : 0

Working -

ElementTree is built up when lxml parses the HTML. ElementTree is a tree structure having parent and child nodes. Each node in the tree is representing an HTML tag and it contains all the relative attributes of the tag. A tree after its creation can be iterated on to find elements. These elements can be an anchor or link tag. While the lxml.html module contains only HTML-specific functions for creating and iterating a tree,lxml.etree module contains the core tree handling code.

HTML parsing from files -

Instead of usingfromstring() function to parse an HTML,parse() function can be called with the filename or the URL - likehtml.parse('http://the/url') orhtml.parse('/path/to/filename'). Same result will be generated as loaded in the URL or file as in the string and then callfromstring().Code #3 : ElementTree workingPython3 1==
importrequestsimportlxml.html# requesting urlweb_response=requests.get('https://www.geeksforgeeks.org/')# buildingelement_tree=lxml.html.fromstring(web_response.text)tree_title_element=element_tree.xpath('//title')[0]print("Tag title : ",tree_title_element.tag)print("\nText title :",tree_title_element.text_content())print("\nhtml title :",lxml.html.tostring(tree_title_element))print("\ntitle tag:",tree_title_element.tag)print("\nParent's tag title:",tree_title_element.getparent().tag)
Output :
Tag title :  titleText title : GeeksforGeeks | A computer science portal for geekshtml title : b'GeeksforGeeks | A computer science portal for geeks\r\n'title tag: titleParent's tag title: head

Using request to scrap -

requestis a Python library, used to scrap the website. It requests the URL of the webserver usingget() method with URL as a parameter and in return, it gives the Response object. This object will include details about the request and the response. To read the web content, response.text()method is used. This content is sent back by the webserver under the request.Code #4 : Requesting web serverPython3 1==
importrequestsweb_response=requests.get('https://www.geeksforgeeks.org/')print("Response from web server :\n",web_response.text)
Output :It will generate a huge script, of which only a sample is added here.
Response from web server : <!DOCTYPE html><!--[if IE 7]><html lang="en-US" prefix="og: https://ogp.me/ns/"><![endif]--><<!--><html lang="en-US" prefix="og: https://ogp.me/ns/" >.........

Improve
Improve
Article Tags :

Explore

Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences

[8]ページ先頭

©2009-2025 Movatter.jp