Python ElementTree Tutorial

XML is a markup language like HTML which is used to describe data. XML documents are basically used to share the structured data via internet. In this tutorial, we will covering the basics of Python's ElementTree library which can be used for creating/writing xml documents.

There are two versions of ElementTree available in Python's Library - one version is a pure python implementation, while the other is implemented in C for performance. So, it is always advised to use the C version of the ElementTree. In case C implementation is not available on the OS that you are working on, native python library can be used. So, our below import of the library will ensure to use the C version of the library if available, otherwise it'll fallback to Python's version.


try:    
    import xml.etree.cElementTree as ET
except ImportError:
    import xml.etree.ElementTree as ET

ElementTree Library has two main classes to represent a xml document in the memory :-

ElementTree which represents the complete document as Tree
Element which represents the single node of the xml document

Reading the XML Data

XML data can either be read from the the file or the string. ET.fromstring will parse the xml structure stored in string as tree and will return the Element instance (root of the tree).

Reading the data from string :

xml_string = """<?xml version="1.0"?>
<catalog>
   <book id="BID101">
        <title>Introducing Python</title>
        <author>Bill Lubanovic</author>
        <publisher>SHR Off Publishers</publisher>
        <category>Python</category>
   </book>
   <book id="BID102">
        <title>Learning Perl</title>
        <author>Randal L. Schwartz</author>
        <publisher>SHR Off Publishers</publisher>
        <category>Perl</category>
   </book>
</catalog>"""

root = ET.fromstring(xml_string)

Reading the data from file:

For reading the data from the xml file, ET.parse method can be used, which will take the file-name as the argument. ET.parse will parse the xml document and will return the ElementTree instance.

tree = ET.parse('books.xml')
root = tree.getroot() # Find the root of the tree using getroot function

Iterating over the XML Nodes

Once the XML data is read into the memory, we can iterate over the xml nodes. There are couple of methods available to iterate over the xml nodes. One of the common way is to use for loop over the root node or any element instance, which will iterate over all the immediate children of that element.

Each node (Element) will have certain parameters, which can be accessed using the following attributes:

Element.tag will return a string identifying what kind of data this element represents.
Element.text will return the text associated with the element
Element.attrib will return the dictionary holding the element attributes/values

Let's now try to print these for our sample xml

for node in root:
    print "Node Tag : " + node.tag
    print "Node Text : " + node.text
    print "Node Attributes : " + str(node.attrib)

>> ...
Node Tag : book
Node Text :
Node Attributes : {'id': 'BID101'}
Node Tag : book
Node Text :
Node Attributes : {'id': 'BID102'}

Note that when we are iterating over the root element instance, it will just iterate over the direct children of the root tag only. In case, we want to iterate over all of the sub-children as well, we can use Element.iter() method.

for node in root.iter():
    print "Node Tag : " + node.tag
    print "Node Text : " + node.text
    print "Node Attributes : " + str(node.attrib)

>> ...
Node Tag : catalog
Node Text :
Node Attributes : {}
Node Tag : book
Node Text :
Node Attributes : {'id': 'BID101'}
Node Tag : title
Node Text : Introducing Python
Node Attributes : {}
Node Tag : author
Node Text : Bill Lubanovic
Node Attributes : {}
Node Tag : publisher
Node Text : SHR Off Publishers
Node Attributes : {}
Node Tag : category
Node Text : Python
Node Attributes : {}
Node Tag : book
Node Text :
Node Attributes : {'id': 'BID102'}
Node Tag : title
Node Text : Learning Perl
Node Attributes : {}
Node Tag : author
Node Text : Randal L. Schwartz
Node Attributes : {}
Node Tag : publisher
Node Text : SHR Off Publishers
Node Attributes : {}
Node Tag : category
Node Text : Perl
Node Attributes : {}

Just like the Element.iter method, ElementTree also supports an iter method which will eventually run iter on root node.

Element.Tail Attribute:

It is worth mentioning about Element.Tail attribute here, tail attribute is basically used to extract the text after the tag.
As an example, if we have to extract the text "Picture of a Cat" from the xml string <item><img src="cat.jpg" /> Picture of a cat</item>, then we can use Element.tail attribute on the img tag.

Creating and Writing the XML data

ElementTree library supports various Classes/Functions to create the XML tree and the methods to dump this tree in a file.
To create the tree, we need to create the Elements and append them to the root of the tree.Let's say, if we have to add a new book entry to our example, then we can do the following

# Create the book tag
bookelement = ET.Element("book") 
# Add the ID attribute to the tag
bookelement.set('id','BID103') 
# Add a sub-children/sub-element to the book tag
subelement = ET.SubElement(bookelement,"title") 
# Text for the sub-element
subelement.text = "Practical Programming in Tcl and Tk" 
# Add a sub-children/sub-element to the book tag
subelement = ET.SubElement(bookelement,"author") 
# Text for the sub-element
subelement.text = "Brent Welch" 
 # Add a sub-children/sub-element to the book tag
subelement = ET.SubElement(bookelement,"publisher")
# Text for the sub-element
subelement.text = "Prentice Hall" 
# Add a sub-children/sub-element to the book tag
subelement = ET.SubElement(bookelement,"category") 
# Text for the sub-element
subelement.text = "Prentice Hall" 
# Add the book element to the end of our tree
root.append(bookelement)

Once, we have added the new element to our xml tree, we can dump this in the xml file or print it out on the system output. ElementTree provides a method write that can be used to print the tree on the stdout.

Element object also supports the methods insert and remove that can be used to insert the node at a particular index, remove method takes Element as input, which needs to be removed.

tree = ET.ElementTree(root) # Generate the tree from the root element
tree.write(sys.stdout) # Dump the xml tree on the stdout

# To write the xml tree in the file, specify the file-name as input to write method
tree.write("xml-file.xml")

Reading Large XML Files:

Reading the complete xml files in memory is not efficient for large sized files due to memory consumption. For such cases, xml file can be read incrementally using ElementTree's Iterparse function. Note, that Iterparse would also endup creating the complete tree in memory. So, it's the onus of the program to process the desired tags and free the memory once it's processed.

context = ET.iterparse('books.xml', events=("start", "end"))
context = iter(context)
# get the root element
event, root = context.next()
for event, elem in context:
    if event == "end" and elem.tag == "author":
        print elem.text
        root.clear()

In the above example, we are interested in only author tag. So, as we encounter the author tag, we clear the rest of the tree from the memory. So, if you check the contents of the root at the end, it would be empty. Had we not used root.clear(), then we would be left again with the complete tree in the end.

Searching for XML Elements:

ElementTree library supports a lot of methods to search for an interesting element including the support for XPath queries.

# XPath queries using findall        
root = ET.fromstring(xml_string)
print root.findall("*/publisher")
print root.findall("book/category")   

# Iterate over interesting elements only
tree = ET.ElementTree(root)
for elem in tree.iter(tag='category'):
    print elem.tag, elem.attrib, elem.text

Sarbjit Singh

Search This Blog

Python ElementTree Tutorial

Comments

Post a Comment