Python Pandas - Home
Python Pandas - Introduction
Python Pandas - Environment Setup
Python Pandas - Basics
Python Pandas - Introduction to Data Structures
Python Pandas - Index Objects
Python Pandas - Panel
Python Pandas - Basic Functionality
Python Pandas - Indexing & Selecting Data
Python Pandas - Series
Python Pandas - Series
Python Pandas - Slicing a Series Object
Python Pandas - Attributes of a Series Object
Python Pandas - Arithmetic Operations on Series Object
Python Pandas - Converting Series to Other Objects
Python Pandas - DataFrame
Python Pandas - DataFrame
Python Pandas - Accessing DataFrame
Python Pandas - Slicing a DataFrame Object
Python Pandas - Modifying DataFrame
Python Pandas - Removing Rows from a DataFrame
Python Pandas - Arithmetic Operations on DataFrame
Python Pandas - IO Tools
Python Pandas - IO Tools
Python Pandas - Working with CSV Format
Python Pandas - Reading & Writing JSON Files
Python Pandas - Reading Data from an Excel File
Python Pandas - Writing Data to Excel Files
Python Pandas - Working with HTML Data
Python Pandas - Clipboard
Python Pandas - Working with HDF5 Format
Python Pandas - Comparison with SQL
Python Pandas - Data Handling
Python Pandas - Sorting
Python Pandas - Reindexing
Python Pandas - Iteration
Python Pandas - Concatenation
Python Pandas - Statistical Functions
Python Pandas - Descriptive Statistics
Python Pandas - Working with Text Data
Python Pandas - Function Application
Python Pandas - Options & Customization
Python Pandas - Window Functions
Python Pandas - Aggregations
Python Pandas - Merging/Joining
Python Pandas - MultiIndex
Python Pandas - Basics of MultiIndex
Python Pandas - Indexing with MultiIndex
Python Pandas - Advanced Reindexing with MultiIndex
Python Pandas - Renaming MultiIndex Labels
Python Pandas - Sorting a MultiIndex
Python Pandas - Binary Operations
Python Pandas - Binary Comparison Operations
Python Pandas - Boolean Indexing
Python Pandas - Boolean Masking
Python Pandas - Data Reshaping & Pivoting
Python Pandas - Pivoting
Python Pandas - Stacking & Unstacking
Python Pandas - Melting
Python Pandas - Computing Dummy Variables
Python Pandas - Categorical Data
Python Pandas - Categorical Data
Python Pandas - Ordering & Sorting Categorical Data
Python Pandas - Comparing Categorical Data
Python Pandas - Handling Missing Data
Python Pandas - Missing Data
Python Pandas - Filling Missing Data
Python Pandas - Interpolation of Missing Values
Python Pandas - Dropping Missing Data
Python Pandas - Calculations with Missing Data
Python Pandas - Handling Duplicates
Python Pandas - Duplicated Data
Python Pandas - Counting & Retrieving Unique Elements
Python Pandas - Duplicated Labels
Python Pandas - Grouping & Aggregation
Python Pandas - GroupBy
Python Pandas - Time-series Data
Python Pandas - Date Functionality
Python Pandas - Timedelta
Python Pandas - Sparse Data Structures
Python Pandas - Sparse Data
Python Pandas - Visualization
Python Pandas - Visualization
Python Pandas - Additional Concepts
Python Pandas - Caveats & Gotchas

Python Pandas - Parsing XML File

XML stands for "Extensible Markup Language" and it is similar to HTML in structure but serves a different purpose. XML is widely used as a data representation format due to its flexibility.

The Pandas library provides the tools for reading andwriting XML data effectively. Theread_xml() function of the Pandas library allows you to parse XML strings, files, or URLs directly into a Pandas DataFrame.

In this tutorial, we will learn about how to parse XML data using Pandas, including reading XML data, handling nested structures, extracting attributes, and customizing column names.

The Pandas read_xml() Function

Theread_xml() function reads XML data from a string, file, or URL and converts it into a Pandas DataFrame. This function supports various parameters for handling complex XML structures and attributes.

Following is the syntax of the read_xml() function −

pandas.read_xml(path_or_buffer, *, xpath='./*', namespaces=None, elems_only=False, attrs_only=False, ...)

Key Parameters,

path_or_buffer: Accepts an XML string, file path, URL, or file-like object containing the XML data.
xpath: Specifies the XML path to parse specific nodes in the XML. Default is'./*'.
namespaces: Dictionary of namespaces used in the XML file.
elems_only: If True, parses only child elements.
attrs_only: If True, parses only attributes.
iterparse: For memory-efficient parsing of very large XML files.
parser: Specifies the parser (lxml or etree). Default is 'lxml'

You can get more details about this method from the following tutorialpandas.read_xml().

Reading an XML String

Thepandas.read_xml() function is used to read XML data directly into a Pandas DataFrame. The function can handle various XML structures.

Example

This example demonstrates how to parse an XML string representing contact information into a Pandas DataFrame. Each<contact> node contains elements such as name, company, and phone. Usingpandas.read_xml() function we will extract this data into a DataFrame.

from io import StringIOimport pandas as pd# Create a string representing XML Dataxml = """<contact-info>   <contact1>      <name>Tanmay </name>      <company>TutorialsPoint</company>      <phone>(011) 123-4567</phone>   </contact1>       <contact2>      <name>Manisha </name>      <company>TutorialsPoint</company>      <phone>(011) 789-4567</phone>   </contact2></contact-info>"""# Parse the String represented XML data df = pd.read_xml(StringIO(xml))print(df)

Following is the output of the above code −

	name	company	phone
0	Tanmay	TutorialsPoint	(011) 123-4567
1	Manisha	TutorialsPoint	(011) 789-4567

Parsing Nested XML Structures

For deeply nested or complex XML files, you can use thexpath andnamespaces parameters to extract specific nodes.

Example

This example shows how to parse a nested XML structure representing a bookstore. Each<book> node has child elements like title, author, year, and price. By using thexpath parameter we can easily locate and extract these<book> nodes and their contents into a DataFrame.

import pandas as pdfrom io import StringIO# Create a String representing XML data xml = """<?xml version="1.0" encoding="UTF-8"?><bookstore>  <book category="cooking">    <title lang="en">Everyday Italian</title>    <author>Giada De Laurentiis</author>    <year>2005</year>    <price>30.00</price>  </book>  <book category="children">    <title lang="en">Harry Potter</title>    <author>J K. Rowling</author>    <year>2005</year>    <price>29.99</price>  </book>  <book category="web">    <title lang="en">Learning XML</title>    <author>Erik T. Ray</author>    <year>2003</year>    <price>39.95</price>  </book></bookstore>"""# Parse the XML data into a DataFramedf = pd.read_xml(StringIO(xml), xpath=".//book")# Display the Output DataFrameprint('Output DataFrame from XML:')print(df)

Following is the output of the above code −

Output DataFrame from XML:

	category	title	author	year	price
0	cooking	Everyday Italian	Giada De Laurentiis	2005	30.00
1	children	Harry Potter	J K. Rowling	2005	29.99
2	web	Learning XML	Erik T. Ray	2003	39.95

Reading XML Attributes Only

To extract only attributes of the XML data, you can set boolean value true to theattrs_only parameter of thepandas.read_xml() function.

Example

The following example demonstrates parsing only attributes of the XML data into the pandas object using theattrs_only parameter.

import pandas as pdfrom io import StringIO# Create a String representing XML data xml = """<?xml version="1.0" encoding="UTF-8"?><bookstore>  <book category="cooking">    <title lang="en">Everyday Italian</title>    <author>Giada De Laurentiis</author>    <year>2005</year>    <price>30.00</price>  </book>  <book category="children">    <title lang="en">Harry Potter</title>    <author>J K. Rowling</author>    <year>2005</year>    <price>29.99</price>  </book>  <book category="web">    <title lang="en">Learning XML</title>    <author>Erik T. Ray</author>    <year>2003</year>    <price>39.95</price>  </book></bookstore>"""# Parse the String represented XML data df = pd.read_xml(StringIO(xml), attrs_only=True)print(df)

Following is the output of the above code −

   category0   cooking1  children2       web

Customizing Column Names While Reading XML

Thenames parameter of the Pandasread_xml() function allows you to customize the column names during the parsing process. This is useful for large files with generic or duplicate element names.

Example

The following example demonstrates how to customize column names while parsing XML data using Pandas.

from io import StringIOimport pandas as pd# Create a string representing XML Dataxml = """<?xml version="1.0" encoding="UTF-8"?><bookstore>  <book category="cooking">    <title lang="en">Everyday Italian</title>    <author>Giada De Laurentiis</author>    <year>2005</year>    <price>30.00</price>  </book>  <book category="children">    <title lang="en">Harry Potter</title>    <author>J K. Rowling</author>    <year>2005</year>    <price>29.99</price>  </book>  <book category="web">    <title lang="en">Learning XML</title>    <author>Erik T. Ray</author>    <year>2003</year>    <price>39.95</price>  </book></bookstore>"""# Parse the String represented XML data df = pd.read_xml(StringIO(xml), names=['Book_Category', 'Book_Name', 'Author', 'Published_year', 'Price'])print(df)

Following is the output of the above code −

Book_Category

		Book_Name	Author	Published_year	Price
0	cooking	Everyday Italian	Giada De Laurentiis	2005	30.00
1	children	Harry Potter	J K. Rowling	2005	29.99
2	web	Learning XML	Erik T. Ray	2003	39.95

Print Page

Movatterモバイル変換

Python Pandas - Parsing XML File

The Pandas read_xml() Function

Reading an XML String

Example

Parsing Nested XML Structures

Example

Reading XML Attributes Only

Example

Customizing Column Names While Reading XML

Example