In many Excel files, it’s common to have multiple sheets containing different sets of data. Being able to read in multiple sheets and consolidate the data into a single data structure can be highly valuable for analysis and processing. In this section, we will explore how to achieve this using theopenpyxl library and acustom function.
The importance of reading multiple sheets
When working with complex Excel files, it’s not uncommon to encounter scenarios where related data is spread across different sheets. For example, you may have one sheet for sales data, another for customer information, and yet another for product inventory. By reading in multiple sheets and consolidating the data, you can gain a holistic view and perform acomprehensive analysis.
Let’s start by examining the basic steps involved in reading inmultiple sheets:
- Load the workbook: Before accessingthe sheets, we need to load the Excel workbook using the
load_workbook function providedbyopenpyxl. - Get the sheet names: We can obtain the names of all the sheets in the workbook using the
sheetnames attribute. This allows us to identify the sheets we wantto read. - Read data from each sheet: By iterating over the sheet names, we can access each sheet individually and read the data.
Openpyxl provides methods such asiter_rows oriter_cols to traverse the cells of each sheet and retrieve thedesired data. - Store the data: To consolidate the data from multiple sheets, we can use a suitable data structure, such as a
pandasDataFrame or a Python list. As we read the data from each sheet, we concatenate or merge it into the consolidateddata structure:- If the data in all sheets follows the same format (as is the case in the example used in this chapter), we can simply concatenatethe datasets
- However, if the datasets have different structures because they describe different aspects of a dataset (for example, one sheet contains product information, the next contains customer data, and the third contains the sales of the products to the customers), then we can merge these datasets based on unique identifiers to create acomprehensive dataset
Using openpyxl to access sheets
openpyxl is a powerful library that allows us to interact with Excel files using Python. It provides a wide range of functionalities, including accessing and manipulating multiple sheets. Before we dive into the details, let’s take a moment to understand whyopenpyxl is a popular choice forthis task.
One of the primary advantages ofopenpyxl is its ability to handle various Excel file formats, such as.xlsx and.xlsm. This flexibility allows us to work with different versions of Excel files without compatibility issues. Additionally,openpyxl provides a straightforward and intuitive interface to access sheet data, making it easier for us to retrieve thedesired information.
Reading data from each sheet
To begin reading in multiple sheets, we need to load the Excel workbook using theload_workbook function provided byopenpyxl. This function takes the file path as input and returns a workbook object that represents the entireExcel file.
Once we have loaded the workbook, we can retrieve the names of all the sheets using the sheetnames attribute. This gives us a list of sheet names present in the Excel file. We can then iterate over these sheet names to read the data from eachsheet individually.
Retrieving sheet data with openpyxl
openpyxl provides variousmethods to access the data withina sheet.
Two commonly used methods areiter_rows anditer_cols. These methods allow us to iterate over the rows or columns of a sheet and retrieve thecell values.
Let’s have a look at howiter_rows canbe used:
# Assuming you are working with the first sheetsheet = wb['versicolor']# Iterate over rows and print raw valuesfor row in sheet.iter_rows(min_row=1, max_row=5, values_only=True): print(row)
Similarly,iter_cols can be usedlike this:
# Iterate over columns and print raw valuesfor column in sheet.iter_cols(min_col=1, max_col=5, values_only=True): print(column)
When usingiter_rows oriter_cols, we can specify whether we want to retrieve the cell values as raw values or as formatted values. Raw values give us the actual data stored in the cells, while formatted values include any formatting applied to the cells, such as date formatting ornumber formatting.
By iterating over the rows or columns of a sheet, we can retrieve the desired data and store it in a suitable data structure. One popular choice is to usepandasDataFrame, which provide a tabular representation of the data and offer convenient methods for data manipulationand analysis.
An alternative solution is using thevalues attribute of the sheet object. This provides a generator for all values in the sheet (much likeiter_rows anditer_cols do for rows and columns, respectively). While generators cannot be used directly to access the data, they can be used infor cycles to iterate over each value. Thepandas library’sDataFrame function also allows direct conversion from a suitable generator object toaDataFrame.
Combining data from multiple sheets
As we read the data from each sheet, we can store it in a list or dictionary, depending on our needs. Once we have retrieved the data from all the sheets, we can combine it into a single consolidated data structure. This step is crucial for further analysisand processing.
To combine the data, we can usepandasDataFrame. By creating individualDataFrame for each sheet’s data and then concatenating or merging them into a singleDataFrame, we can obtain a comprehensive dataset that includes all the information frommultiple sheets.
Custom function for reading multiple sheets
To simplify the process of reading in multiple sheetsand consolidating the data, we can create custom functions tailored to our specific requirements. These functions encapsulate the necessary steps and allow us to reuse thecode efficiently.
In our example, we define a function calledread_multiple_sheets that takes the file path as input. Inside the function, we load the workbook usingload_workbook and iterate over the sheet names retrieved with thesheets attribute.
For each sheet, we access it using the workbook object and retrieve the data using the custom read_single_sheet function. We then store the retrieved data in a list. Finally, we combine the data from all the sheets into a singlepandasDataFrame using the appropriate concatenation methodfrompandas.
By using these custom functions, we can easily read in multiple sheets from an Excel file and obtain a consolidated dataset that’s ready for analysis. The function provides a reusable and efficient solution, saving us time and effort in dealing with complexExcel files.
Customizing the code
The provided example is a starting point that you can customize based on your specific requirements. Here are a few considerations for customizingthe code:
- Filtering columns: If you only need specific columns from each sheet, you can modify the code to extract only the desired columns during the data retrieval step. You can do this by using the
iter_cols method instead of thevalues attribute and using a filtered list in afor cycle or by filtering the resultingpandasDataFrame object(s). - Handling missing data: If the sheets contain missing data, you can incorporate appropriate handling techniques, such as filling in missing values or excludingincomplete rows.
- Applying transformations: Depending on the nature of your data, you might need to apply transformations or calculations to the consolidated dataset. The custom function can be expanded to accommodatethese transformations.
Remember, the goal is to tailor the code to suit your unique needs and ensure it aligns with your dataprocessing requirements.
By leveraging the powerofopenpyxl and creating custom functions, you can efficiently read in multiple sheets from Excel files, consolidate the data, and prepare it for further analysis. This capability enables you to unlock valuable insights from complex Excel files and leverage the full potential ofyour data.
Now, let’s dive into an example that demonstratesthis process:
from openpyxl import load_workbookimport pandas as pddef read_single_sheet(workbook, sheet_name): # Load the sheet from the workbook sheet = workbook[sheet_name] # Read out the raaw data including headers sheet_data_raw = sheet.values # Separate the headers into a variable columns = next(sheet_data_raw)[0:] # Create a DataFrame based on the second and subsequent lines of data with the header as column names and return it return pd.DataFrame(sheet_data_raw, columns=columns)def read_multiple_sheets(file_path): # Load the workbook workbook = load_workbook(file_path) # Get a list of all sheet names in the workbook sheet_names = workbook.sheetnames # Cycle through the sheet names, load the data for each and concatenate them into a single DataFrame return pd.concat([read_single_sheet(workbook=workbook, sheet_name=sheet_name) for sheet_name in sheet_names], ignore_index=True)# Define the file path and sheet namesfile_path = 'iris_data.xlsx' # adjust the path as needed# Read the data from multiple sheetsconsolidated_data = read_multiple_sheets(file_path)# Display the consolidated dataprint(consolidated_data.head())
Let’s have a look atthe results:
Figure 1.8 – Using the openxlsx package to read in the Excel file
In the preceding code, we definetwo functions:
read_single_sheet: This reads the data from a single sheet into apandasDataFrameread_multiple_sheets: This reads and concatenates the data from all sheets inthe workbook
Within theread_multiple_sheets function, we load the workbook usingload_workbook and iterate over the sheet names. For each sheet, we retrieve the data using theread_single_sheet helper function, which reads the data from a sheet and creates apandasDataFrame for each sheet’s data, with the header row used as column names. Finally, we usepd.concat to combine all theDataFrame into a singleconsolidatedDataFrame.
By utilizing these custom functions, we can easily read in multiple sheets from an Excel file and obtain a consolidated dataset. This allows us to perform various data manipulations, analyses, or visualizations on thecombined data.
Understanding how to handle multiple sheets efficiently enhances our ability to work with complex Excel files and extract valuable insights fromdiverse datasets.