Movatterモバイル変換

Abdeladim Fadheli · 5 min read · Updated may 2024 ·Web Scraping

Confused by complex code? Let ourAI-powered Code Explainer demystify it for you. Try it out!

Have you ever wanted to automatically extract HTML tables from web pages and save them in a proper format on your computer? If that's the case, then you're in the right place. In this tutorial, we will be usingrequests andBeautifulSoup libraries to convert any table on any web page and save it on our disk.

We will also usepandas to easily convert toCSV format (or any format thatpandas support). If you haven'trequests,BeautifulSoup andpandas installed, then install them with the following command:

pip3 install requests bs4 pandas

If you want to do the other way around, converting Pandas data frames to HTML tables, then checkthis tutorial.

Open up a new Python file and follow along. Let's import the libraries:

import requestsimport pandas as pdfrom bs4 import BeautifulSoup as bs

We need a function that accepts the target URL and gives us the propersoup object:

USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"# US englishLANGUAGE = "en-US,en;q=0.5"def get_soup(url):    """Constructs and returns a soup using the HTML content of `url` passed"""    # initialize a session    session = requests.Session()    # set the User-Agent as a regular browser    session.headers['User-Agent'] = USER_AGENT    # request for english content (optional)    session.headers['Accept-Language'] = LANGUAGE    session.headers['Content-Language'] = LANGUAGE    # make the request    html = session.get(url)    # return the soup    return bs(html.content, "html.parser")

We first initialize arequests session, we use theUser-Agent header to indicate that we are just a regular browser and not a bot (some websites block them), and then we get the HTML content usingsession.get() method. After that, we construct aBeautifulSoup object usinghtml.parser.

Related tutorial: How to Make an Email Extractor in Python.

Since we want to extract every table on any page, we need to find thetable HTML tag and return it. The following function does exactly that:

def get_all_tables(soup):    """Extracts and returns all tables in a soup object"""    return soup.find_all("table")

Now we need a way to get the table headers, the column names, or whatever you want to call them:

def get_table_headers(table):    """Given a table soup, returns all the headers"""    headers = []    for th in table.find("tr").find_all("th"):        headers.append(th.text.strip())    return headers

The above function finds the first row of the table and extracts all theth tags (table headers).

Now that we know how to extract table headers, the remaining is to extract all the table rows:

def get_table_rows(table):    """Given a table, returns all its rows"""    rows = []    for tr in table.find_all("tr")[1:]:        cells = []        # grab all td tags in this table row        tds = tr.find_all("td")        if len(tds) == 0:            # if no td tags, search for th tags            # can be found especially in wikipedia tables below the table            ths = tr.find_all("th")            for th in ths:                cells.append(th.text.strip())        else:            # use regular td tags            for td in tds:                cells.append(td.text.strip())        rows.append(cells)    return rows

All the above function is doing, is to findtr tags (table rows) and extracttd elements which then appends them to a list. The reason we usedtable.find_all("tr")[1:] and not alltr tags, is because the firsttr tag corresponds to the table headers; we don't wanna add it here.

The below function takes the table name, table headers, and all the rows and saves them inCSV format:

def save_as_csv(table_name, headers, rows):    pd.DataFrame(rows, columns=headers).to_csv(f"{table_name}.csv")

Now that we have all the core functions, let's bring them all together in themain() function:

def main(url):    # get the soup    soup = get_soup(url)    # extract all the tables from the web page    tables = get_all_tables(soup)    print(f"[+] Found a total of {len(tables)} tables.")    # iterate over all tables    for i, table in enumerate(tables, start=1):        # get the table headers        headers = get_table_headers(table)        # get all the rows of the table        rows = get_table_rows(table)        # save table as csv file        table_name = f"table-{i}"        print(f"[+] Saving {table_name}")        save_as_csv(table_name, headers, rows)

The above function does the following:

Parsing the HTML content of the web page given its URL by constructing theBeautifulSoup object.
Finding all the tables on that HTML page.
Iterating over all these extracted tables and saving them one by one.

Finally, let's call the main function:

if __name__ == "__main__":    import sys    try:        url = sys.argv[1]    except IndexError:        print("Please specify a URL.\nUsage: python html_table_extractor.py [URL]")        exit(1)    main(url)

This will accept the URL from the command line arguments. Let's try to see if this is working:

C:\pythoncode-tutorials\web-scraping\html-table-extractor>python html_table_extractor.py https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population[+] Found a total of 2 tables.[+] Saving table-1[+] Saving table-2

Nice, twoCSV files appeared in my current directory that correspond to the two tables in that Wikipedia page. Here is a part of one of the tables extracted:

Wikipedia Page Table Extracted Successfully

Awesome! We have successfully built a Python script to extract any table from any website, try to pass other URLs, and see if it's working.

For Javascript-driven websites (which load the website data dynamically using Javascript), try to userequests-html library orselenium instead. Let us see what you did in the comments below!

You can also make a web crawler that downloads all tables from an entire website. You can do that byextracting all website links and running this script on each URL you got from it.

Also, if, for whatever reason, the website you're scraping blocks your IP address, you need touse some proxy server as a countermeasure.

Happy Scraping ♥

Want to code smarter? OurPython Code Assistant is waiting to help you. Try it now!

View Full Code Switch My Framework

Sharing is caring!

Comment panel

Got a coding query or need some guidance before you comment? Check out thisPython Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!

Ethical Hacking with Python EBook - Topic - Top

New Tutorials

Building a Full-Stack RAG Chatbot with FastAPI, OpenAI, and Streamlit

How to Recover Deleted Files with Python

How to Use Python to Track Google Search Results and Reviews Over Time

YouTube Video Transcription Summarization with Python

Getting Started with Python for SaaS Applications

Movatterモバイル変換

How to Convert HTML Tables into CSV Files in Python

Read Also

How to Make an Email Extractor in Python

How to Automate Login using Selenium in Python

How to Download All Images from a Web Page in Python

Comment panel

Tags

New Tutorials

Popular Tutorials

Claim your Free Chapter!