Every once in awhile, I’ll have the need to load data from a spreadsheet into a Python program, but one question always comes up: what’s the best way to parse a spreadsheet in Python? The goal of today’s article is to find out!
Table of Contents
Problem Introduction
Recently, I was learning a visualization library in Python called VTK, and I needed to find a way to visualize some data from a spreadsheet. Unfortunately, I had two problems:
- I didn’t know how to read a spreadsheet
- I didn’t know how to parse the data that I read
In other words, what are some good ways to read spreadsheet data? And, what are some good ways to model that data in a Python program?
Parsing CSV Files
That first question will be the target of this article. In particular, we’ll be looking at various ways to read data from a CSV file. For example, does it make sense to try to write our own parsing implementation? After all, a CSV is one of the easier file formats to parse (as seen below), and Python is great for working with strings:
Name,Age,Favorite ColorJeremy,25,BlueAlly,41,MagentaJasmine,29,Aqua
That said, we may prefer to use some of the utilities provided by Python like thecsv package. What’s the point of reinventing the wheel when there’s a battle tested library built right into the language?
As usual, we’ll tackle a handful of solutions and discuss their pros and cons. By the end of this article, you should feel comfortable parsing CSV files yourself.
Internal Representation
As for the second question, we have a couple of options. For starters, we could interpret the spreadsheet as a giant list of lists. In this example, we could give each row of data its own list and store these rows in a list:
# Heading: Name, Age, Favorite Colorcsv_matrix = [ ["Jeremy", 25, "Blue"], ["Ally", 41, "Magenta"], ["Jasmine", 29, "Aqua"]]
Alternatively, we could give each column of data their own list and store those lists in a list:
# Heading: Name, Age, Favorite Colorcsv_matrix = [ ["Jeremy", "Ally", "Jasmine"], [25, 41, 29], ["Blue", "Magenta", "Aqua"]]
In either case, we’d have a giant matrix of data that would closely resemble the original CSV file.
Of course, I’m a bit partial to dictionaries, so I might like to use one of those. For instance, what’s stopping us from creating a dictionary where each key provides us with an entire column of data?
csv_dict = { "Name": ["Jeremy", "Ally", "Jasmine"], "Age": [25, 41, 29], "Favorite Color": ["Blue", "Magenta", "Aqua"]}
Alternatively, we could flip the relationship so we’re storing a list of dictionaries. That way, the data rows are mappings:
csv_mapping_list = [ { "Name": "Jeremy", "Age": 25, "Favorite Color": "Blue" }, { "Name": "Ally", "Age": 41, "Favorite Color": "Magenta" }, { "Name": "Jasmine", "Age": 29, "Favorite Color": "Aqua" }]
Personally, I prefer this last representation because data samples stick together. In other words, there’s no risk of distorting the original data set during actions like sorting. Meanwhile, the other three representations have independent lists which have to be sorted together. That’s enough to give this last representation the win.
In the next section, we’ll start digging into some parsing solutions which will all leverage this last representation. If you have any questions about the other three, feel free to drop them in the comments below.
Solutions
As always, we’ll cover a handful of common solutions to today’s problem. Feel free to grab what you need and run, but I recommend reading through the pros and cons of each solution. After all, you’ll never know when some of this might be useful.
Parse a Spreadsheet with Brute Force
At its core, a spreadsheet is a comma separated file (CSV). If we want to parse one by hand, we need to be aware of a few things:
- The header row
- The line endings
- The delimiter (in this case a comma)
In our original example, we had a spreadsheet that looked something like the following:
Name,Age,Favorite ColorJeremy,25,BlueAlly,41,MagentaJasmine,29,Aqua
To parse a spreadsheet in Python by hand, we’d want to read each line and split it by comma. After that, we’d need to do some post processing to get the format we want:
csv_mapping_list = []with open("/path/to/data.csv") as my_data: line_count = 0 for line in my_data: row_list = [val.strip() for val in line.split(",")] if line_count == 0: header = row_list else: row_dict = {} for i, key in enumerate(header): row_dict[key] = row_list[i] csv_mapping_list.append(row_dict) line_count += 1
Here is my very rough attempt at parsing a spreadsheet by hand. In this example, we open the CSV and read it line by line. For each line, we split it by comma and systematically trim each value using a list comprehension.
From there, we decide if the line we just parsed is the header or not. If it is, we save its value for later. Otherwise, we iterate over the row of values and map them into a dictionary using the header. I suppose it may have been simpler to use one of the methods from ourHow to Convert Two Lists into a Dictionary in Python article:
csv_mapping_list = []with open("/path/to/data.csv") as my_data: line_count = 0 for line in my_data: row_list = [val.strip() for val in line.split(",")] if line_count == 0: header = row_list else: row_dict = {key: value for key, value in zip(header, row_list)} csv_mapping_list.append(row_dict) line_count += 1
In either case, this solution leaves a lot of room for modification. For instance, if your file has some other kind of delimiter, this may be the solution for you.
That said, be aware that the brute force solution has a major drawback. According to Juha-Matti Santala, splitting by comma can fail if one of the rows contains text with a comma in it (i.e.
"Grifski, Jeremy",25,Blue
). To make matters worse, this problem changes depending on which delimiter you use. Fortunately, there are better solutions to follow!
Parse a Spreadsheet with the CSV Reader Object
As mentioned previously, we don’t have to write our own CSV parser if we don’t want to. Instead, we can get by with thecsv package. As you can probably imagine, it has a ton of CSV parsing functionality. In particular, it contains thereader object which we can use to read a CSV file like we did previously:
import csvcsv_mapping_list = []with open("/path/to/data.csv") as my_data: csv_reader = csv.reader(my_data, delimiter=",") line_count = 0 for line in csv_reader: if line_count == 0: header = line else: row_dict = {key: value for key, value in zip(header, line)} csv_mapping_list.append(row_dict) line_count += 1
With thereader object, we haven’t managed to simplify our code that much. In fact, all we did was replace thelist comprehension with the reader instantiation. Regardless, it’s a nice option for those who would prefer not to write their own parser.
Parse a Spreadsheet with the CSV DictReader Object
At this point, you’re probably wondering why we would even use thecsv library. After all, it barely made a difference in the last two examples. Fortunately, there is a way to reduce our code a bit using a special class in thecsv library called DictReader:
import csvwith open("/path/to/dict.csv") as my_data: csv_mapping_list = list(csv.DictReader(my_data))
And, there we have it! All the parsing we did in the first two sections has now been drastically reduced to just 3 lines of code.
That said, there is a minor difference in this solution. Instead of creating a list of dictionaries, we’ve created a list ofOrderedDict objects. They can be used just like dictionaries, but their key order is fixed. Regardless, the elegance of this solution should more than make up for that fact, and in some cases we may even want to preserve key order.
A Little Recap
At this point, I find it nice to look at all the potential solutions at once:
# Brute force solutioncsv_mapping_list = []with open("/path/to/data.csv") as my_data: line_count = 0 for line in my_data: row_list = [val.strip() for val in line.split(",")] if line_count == 0: header = row_list else: row_dict = {key: value for key, value in zip(header, row_list)} csv_mapping_list.append(row_dict) line_count += 1# CSV reader solutionimport csvcsv_mapping_list = []with open("/path/to/data.csv") as my_data: csv_reader = csv.reader(my_data, delimiter=",") line_count = 0 for line in csv_reader: if line_count == 0: header = line else: row_dict = {key: value for key, value in zip(header, line)} csv_mapping_list.append(row_dict) line_count += 1# CSV DictReader solutionimport csvwith open("/path/to/dict.csv") as my_data: csv_mapping_list = list(csv.DictReader(my_data))
Up next, I’m looking to write an article onhow to sort a list of dictionaries, so look out for that! Until then, thanks for sticking around. Hopefully, this article was helpful. If so, why not give it a share? Even better,become a member of The Renegade Coder community, so we can stay in touch!
If you’re not ready to leave just yet, I have plenty ofPython content for you:
- How I Almost Built a Twitter Bot
- How I Automated My Grading Responsibilities
- How to Automate Your GitHub Wiki
Once again, thanks for sticking around!
Jeremy grew up in a small town where he enjoyed playing soccer and video games, practicing taekwondo, and trading Pokémon cards. Once out of the nest, he pursued a Bachelors in Computer Engineering with a minor in Game Design. After college, he spent about two years writing software for a major engineering company. Then, he earned a master's in Computer Science and Engineering. Most recently, he earned a PhD in Engineering Education and now works as a Lecturer. In his spare time, Jeremy enjoys spending time with his wife and kid, playing Overwatch 2, Lethal Company, and Baldur's Gate 3, reading manga, watching Penguins hockey, and traveling the world.
Recent Posts
I wanted to take a moment to just talk about how I'm doing, which is really well actually.
The Worst Use Cases for Generative AI That Are Already Mainstream
This article looks at several use cases of generative AI that I find dystopian at best, and I want to talk about them.