Python scraping tables

Question 1

I started practicing in web-scraping few days ago. I made this code to extract data from a wikipedia page. There are several tables that classify mountains based on their height. However there is a problem with the size of the matrices. Some of them contain 5 columns while others 4. So I made this algorithm to extract all the names and the attributes of the mountains into separate lists. My approach was to create a length list that contains the number of<td> within the<tr> tags. The algorithm finds which table contains four columns and fills the column in excess (in the case of 5 columns) with NONE. However, I believe that there is a more efficient and more pythonic way to do it especially in the part where I use thefind.next() function repedetly. Any suggestions are welcomed.

import requestsfrom bs4 import BeautifulSoupimport pandas as pdURL="https://en.wikipedia.org/wiki/List_of_mountains_by_elevation"content=requests.get(URL).contentsoup=BeautifulSoup(content,'html.parser')all_tables=soup.find_all("table",{"class":["sortable", "plainrowheaders"]})mountain_names=[]metres_KM=[]metres_FT=[]range_Mnt=[]location=[]lengths=[]for table in range(len(all_tables)):  x=all_tables[table].find("tr").find_next("tr")  y=x.find_all("td")  lengths.append(len(y))    for row in all_tables[table].find_all("tr"):        try:            mountain_names.append(row.find("td").text)            metres_KM.append(row.find("td").find_next("td").text)            metres_FT.append(row.find("td").find_next("td").find_next("td").text)            if lengths[table]==5:                range_Mnt.append(row.find("td").find_next("td").find_next("td").find_next("td").text)            else:                range_Mnt.append(None)            location.append(row.find("td").find_next("td").find_next("td").find_next("td").find_next("td").text)        except:            pass

Question 2

Is the code working as expected?

Question 3

Yes, totally. However i want to find -out a better way to scrape tables rather than using find_next() all the time.

Question 4

Alright; By the way Welcome to Code Review. Hopefully you receive good answers!

Question 5

Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please seewhat you may and may not do after receiving answers.

Question 6

You're just looping on the rows, but not on the cells:

 for row in all_tables[table].find_all("tr"):

Rather than using multiplefind_next("td") one after the other, add another loop usingrow.find_all('td') and append each row and cell to a 2D array.

Manipulating a 2D array is much easier and will make your code look much cleaner thanrow.find("td").find_next("td").find_next("td").

Good luck!

Those questions contain some answers that might interest you:

To be more specific, this code snippet from@shaktimaan:

data = []table = soup.find('table', attrs={'class':'lineItemsTable'})table_body = table.find('tbody')rows = table_body.find_all('tr')for row in rows:    cols = row.find_all('td')    cols = [ele.text.strip() for ele in cols]    data.append([ele for ele in cols if ele])

Question 7

Thank you for your reply. Because I am new in scraping and in python generally what I understand is that you mean to replace the (try) part of my code with this loop? I did it by the way but data list is empty.

TwiN 4632 silver badges10 bronze badges · Answer 1 · 2018-06-26 02:58:43Z

You're just looping on the rows, but not on the cells:

 for row in all_tables[table].find_all("tr"):

Rather than using multiplefind_next("td") one after the other, add another loop usingrow.find_all('td') and append each row and cell to a 2D array.

Manipulating a 2D array is much easier and will make your code look much cleaner thanrow.find("td").find_next("td").find_next("td").

Good luck!

Those questions contain some answers that might interest you:

To be more specific, this code snippet from@shaktimaan:

data = []table = soup.find('table', attrs={'class':'lineItemsTable'})table_body = table.find('tbody')rows = table_body.find_all('tr')for row in rows:    cols = row.find_all('td')    cols = [ele.text.strip() for ele in cols]    data.append([ele for ele in cols if ele])

Thank you for your reply. Because I am new in scraping and in python generally what I understand is that you mean to replace the (try) part of my code with this loop? I did it by the way but data list is empty.

Movatterモバイル変換

Stack Exchange Network

Python scraping tables

1 Answer1

You mustlog in to answer this question.

Related

Hot Network Questions

Subscribe to RSS