I started practicing in web-scraping few days ago. I made this code to extract data from a wikipedia page. There are several tables that classify mountains based on their height. However there is a problem with the size of the matrices. Some of them contain 5 columns while others 4. So I made this algorithm to extract all the names and the attributes of the mountains into separate lists. My approach was to create a length list that contains the number of<td> within the<tr> tags. The algorithm finds which table contains four columns and fills the column in excess (in the case of 5 columns) with NONE. However, I believe that there is a more efficient and more pythonic way to do it especially in the part where I use thefind.next() function repedetly. Any suggestions are welcomed.
import requestsfrom bs4 import BeautifulSoupimport pandas as pdURL="https://en.wikipedia.org/wiki/List_of_mountains_by_elevation"content=requests.get(URL).contentsoup=BeautifulSoup(content,'html.parser')all_tables=soup.find_all("table",{"class":["sortable", "plainrowheaders"]})mountain_names=[]metres_KM=[]metres_FT=[]range_Mnt=[]location=[]lengths=[]for table in range(len(all_tables)): x=all_tables[table].find("tr").find_next("tr") y=x.find_all("td") lengths.append(len(y)) for row in all_tables[table].find_all("tr"): try: mountain_names.append(row.find("td").text) metres_KM.append(row.find("td").find_next("td").text) metres_FT.append(row.find("td").find_next("td").find_next("td").text) if lengths[table]==5: range_Mnt.append(row.find("td").find_next("td").find_next("td").find_next("td").text) else: range_Mnt.append(None) location.append(row.find("td").find_next("td").find_next("td").find_next("td").find_next("td").text) except: pass- 1\$\begingroup\$Is the code working as expected?\$\endgroup\$2018-06-25 23:17:03 +00:00CommentedJun 25, 2018 at 23:17
- \$\begingroup\$Yes, totally. However i want to find -out a better way to scrape tables rather than using find_next() all the time.\$\endgroup\$brain_dead_cow– brain_dead_cow2018-06-25 23:18:39 +00:00CommentedJun 25, 2018 at 23:18
- 3\$\begingroup\$Alright; By the way Welcome to Code Review. Hopefully you receive good answers!\$\endgroup\$2018-06-25 23:19:53 +00:00CommentedJun 25, 2018 at 23:19
- 1\$\begingroup\$Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please seewhat you may and may not do after receiving answers.\$\endgroup\$2018-06-26 14:59:42 +00:00CommentedJun 26, 2018 at 14:59
1 Answer1
You're just looping on the rows, but not on the cells:
for row in all_tables[table].find_all("tr"):Rather than using multiplefind_next("td") one after the other, add another loop usingrow.find_all('td') and append each row and cell to a 2D array.
Manipulating a 2D array is much easier and will make your code look much cleaner thanrow.find("td").find_next("td").find_next("td").
Good luck!
Those questions contain some answers that might interest you:
To be more specific, this code snippet from@shaktimaan:
data = []table = soup.find('table', attrs={'class':'lineItemsTable'})table_body = table.find('tbody')rows = table_body.find_all('tr')for row in rows: cols = row.find_all('td') cols = [ele.text.strip() for ele in cols] data.append([ele for ele in cols if ele])- \$\begingroup\$Thank you for your reply. Because I am new in scraping and in python generally what I understand is that you mean to replace the (try) part of my code with this loop? I did it by the way but data list is empty.\$\endgroup\$brain_dead_cow– brain_dead_cow2018-06-26 11:57:00 +00:00CommentedJun 26, 2018 at 11:57
You mustlog in to answer this question.
Explore related questions
See similar questions with these tags.
