
- Python Pandas - Home
- Python Pandas - Introduction
- Python Pandas - Environment Setup
- Python Pandas - Basics
- Python Pandas - Introduction to Data Structures
- Python Pandas - Index Objects
- Python Pandas - Panel
- Python Pandas - Basic Functionality
- Python Pandas - Indexing & Selecting Data
- Python Pandas - Series
- Python Pandas - Series
- Python Pandas - Slicing a Series Object
- Python Pandas - Attributes of a Series Object
- Python Pandas - Arithmetic Operations on Series Object
- Python Pandas - Converting Series to Other Objects
- Python Pandas - DataFrame
- Python Pandas - DataFrame
- Python Pandas - Accessing DataFrame
- Python Pandas - Slicing a DataFrame Object
- Python Pandas - Modifying DataFrame
- Python Pandas - Removing Rows from a DataFrame
- Python Pandas - Arithmetic Operations on DataFrame
- Python Pandas - IO Tools
- Python Pandas - IO Tools
- Python Pandas - Working with CSV Format
- Python Pandas - Reading & Writing JSON Files
- Python Pandas - Reading Data from an Excel File
- Python Pandas - Writing Data to Excel Files
- Python Pandas - Working with HTML Data
- Python Pandas - Clipboard
- Python Pandas - Working with HDF5 Format
- Python Pandas - Comparison with SQL
- Python Pandas - Data Handling
- Python Pandas - Sorting
- Python Pandas - Reindexing
- Python Pandas - Iteration
- Python Pandas - Concatenation
- Python Pandas - Statistical Functions
- Python Pandas - Descriptive Statistics
- Python Pandas - Working with Text Data
- Python Pandas - Function Application
- Python Pandas - Options & Customization
- Python Pandas - Window Functions
- Python Pandas - Aggregations
- Python Pandas - Merging/Joining
- Python Pandas - MultiIndex
- Python Pandas - Basics of MultiIndex
- Python Pandas - Indexing with MultiIndex
- Python Pandas - Advanced Reindexing with MultiIndex
- Python Pandas - Renaming MultiIndex Labels
- Python Pandas - Sorting a MultiIndex
- Python Pandas - Binary Operations
- Python Pandas - Binary Comparison Operations
- Python Pandas - Boolean Indexing
- Python Pandas - Boolean Masking
- Python Pandas - Data Reshaping & Pivoting
- Python Pandas - Pivoting
- Python Pandas - Stacking & Unstacking
- Python Pandas - Melting
- Python Pandas - Computing Dummy Variables
- Python Pandas - Categorical Data
- Python Pandas - Categorical Data
- Python Pandas - Ordering & Sorting Categorical Data
- Python Pandas - Comparing Categorical Data
- Python Pandas - Handling Missing Data
- Python Pandas - Missing Data
- Python Pandas - Filling Missing Data
- Python Pandas - Interpolation of Missing Values
- Python Pandas - Dropping Missing Data
- Python Pandas - Calculations with Missing Data
- Python Pandas - Handling Duplicates
- Python Pandas - Duplicated Data
- Python Pandas - Counting & Retrieving Unique Elements
- Python Pandas - Duplicated Labels
- Python Pandas - Grouping & Aggregation
- Python Pandas - GroupBy
- Python Pandas - Time-series Data
- Python Pandas - Date Functionality
- Python Pandas - Timedelta
- Python Pandas - Sparse Data Structures
- Python Pandas - Sparse Data
- Python Pandas - Visualization
- Python Pandas - Visualization
- Python Pandas - Additional Concepts
- Python Pandas - Caveats & Gotchas
Python Pandas - Parsing XML File
XML stands for "Extensible Markup Language" and it is similar to HTML in structure but serves a different purpose. XML is widely used as a data representation format due to its flexibility.
The Pandas library provides the tools for reading andwriting XML data effectively. Theread_xml() function of the Pandas library allows you to parse XML strings, files, or URLs directly into a Pandas DataFrame.
In this tutorial, we will learn about how to parse XML data using Pandas, including reading XML data, handling nested structures, extracting attributes, and customizing column names.
The Pandas read_xml() Function
Theread_xml() function reads XML data from a string, file, or URL and converts it into a Pandas DataFrame. This function supports various parameters for handling complex XML structures and attributes.
Following is the syntax of the read_xml() function −
pandas.read_xml(path_or_buffer, *, xpath='./*', namespaces=None, elems_only=False, attrs_only=False, ...)
Key Parameters,
path_or_buffer: Accepts an XML string, file path, URL, or file-like object containing the XML data.
xpath: Specifies the XML path to parse specific nodes in the XML. Default is'./*'.
namespaces: Dictionary of namespaces used in the XML file.
elems_only: If True, parses only child elements.
attrs_only: If True, parses only attributes.
iterparse: For memory-efficient parsing of very large XML files.
parser: Specifies the parser (lxml or etree). Default is 'lxml'
You can get more details about this method from the following tutorialpandas.read_xml().
Reading an XML String
Thepandas.read_xml() function is used to read XML data directly into a Pandas DataFrame. The function can handle various XML structures.
Example
This example demonstrates how to parse an XML string representing contact information into a Pandas DataFrame. Each<contact> node contains elements such as name, company, and phone. Usingpandas.read_xml() function we will extract this data into a DataFrame.
from io import StringIOimport pandas as pd# Create a string representing XML Dataxml = """<contact-info> <contact1> <name>Tanmay </name> <company>TutorialsPoint</company> <phone>(011) 123-4567</phone> </contact1> <contact2> <name>Manisha </name> <company>TutorialsPoint</company> <phone>(011) 789-4567</phone> </contact2></contact-info>"""# Parse the String represented XML data df = pd.read_xml(StringIO(xml))print(df)
Following is the output of the above code −
| name | company | phone | |
|---|---|---|---|
| 0 | Tanmay | TutorialsPoint | (011) 123-4567 |
| 1 | Manisha | TutorialsPoint | (011) 789-4567 |
Parsing Nested XML Structures
For deeply nested or complex XML files, you can use thexpath andnamespaces parameters to extract specific nodes.
Example
This example shows how to parse a nested XML structure representing a bookstore. Each<book> node has child elements like title, author, year, and price. By using thexpath parameter we can easily locate and extract these<book> nodes and their contents into a DataFrame.
import pandas as pdfrom io import StringIO# Create a String representing XML data xml = """<?xml version="1.0" encoding="UTF-8"?><bookstore> <book category="cooking"> <title lang="en">Everyday Italian</title> <author>Giada De Laurentiis</author> <year>2005</year> <price>30.00</price> </book> <book category="children"> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> <book category="web"> <title lang="en">Learning XML</title> <author>Erik T. Ray</author> <year>2003</year> <price>39.95</price> </book></bookstore>"""# Parse the XML data into a DataFramedf = pd.read_xml(StringIO(xml), xpath=".//book")# Display the Output DataFrameprint('Output DataFrame from XML:')print(df)Following is the output of the above code −
Output DataFrame from XML:
| category | title | author | year | price | |
|---|---|---|---|---|---|
| 0 | cooking | Everyday Italian | Giada De Laurentiis | 2005 | 30.00 |
| 1 | children | Harry Potter | J K. Rowling | 2005 | 29.99 |
| 2 | web | Learning XML | Erik T. Ray | 2003 | 39.95 |
Reading XML Attributes Only
To extract only attributes of the XML data, you can set boolean value true to theattrs_only parameter of thepandas.read_xml() function.
Example
The following example demonstrates parsing only attributes of the XML data into the pandas object using theattrs_only parameter.
import pandas as pdfrom io import StringIO# Create a String representing XML data xml = """<?xml version="1.0" encoding="UTF-8"?><bookstore> <book category="cooking"> <title lang="en">Everyday Italian</title> <author>Giada De Laurentiis</author> <year>2005</year> <price>30.00</price> </book> <book category="children"> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> <book category="web"> <title lang="en">Learning XML</title> <author>Erik T. Ray</author> <year>2003</year> <price>39.95</price> </book></bookstore>"""# Parse the String represented XML data df = pd.read_xml(StringIO(xml), attrs_only=True)print(df)
Following is the output of the above code −
category0 cooking1 children2 web
Customizing Column Names While Reading XML
Thenames parameter of the Pandasread_xml() function allows you to customize the column names during the parsing process. This is useful for large files with generic or duplicate element names.
Example
The following example demonstrates how to customize column names while parsing XML data using Pandas.
from io import StringIOimport pandas as pd# Create a string representing XML Dataxml = """<?xml version="1.0" encoding="UTF-8"?><bookstore> <book category="cooking"> <title lang="en">Everyday Italian</title> <author>Giada De Laurentiis</author> <year>2005</year> <price>30.00</price> </book> <book category="children"> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book> <book category="web"> <title lang="en">Learning XML</title> <author>Erik T. Ray</author> <year>2003</year> <price>39.95</price> </book></bookstore>"""# Parse the String represented XML data df = pd.read_xml(StringIO(xml), names=['Book_Category', 'Book_Name', 'Author', 'Published_year', 'Price'])print(df)
Following is the output of the above code −
| Book_Category | Book_Name | Author | Published_year | Price | |
|---|---|---|---|---|---|
| 0 | cooking | Everyday Italian | Giada De Laurentiis | 2005 | 30.00 |
| 1 | children | Harry Potter | J K. Rowling | 2005 | 29.99 |
| 2 | web | Learning XML | Erik T. Ray | 2003 | 39.95 |