XML is a markup language like HTML which is used to describe data. XML documents are basically used to share the structured data via internet. In this tutorial, we will covering the basics of Python's ElementTree library which can be used for creating/writing xml documents.
There are two versions of ElementTree available in Python's Library - one version is a pure python implementation, while the other is implemented in C for performance. So, it is always advised to use the C version of the ElementTree. In case C implementation is not available on the OS that you are working on, native python library can be used. So, our below import of the library will ensure to use the C version of the library if available, otherwise it'll fallback to Python's version.
There are two versions of ElementTree available in Python's Library - one version is a pure python implementation, while the other is implemented in C for performance. So, it is always advised to use the C version of the ElementTree. In case C implementation is not available on the OS that you are working on, native python library can be used. So, our below import of the library will ensure to use the C version of the library if available, otherwise it'll fallback to Python's version.
try:
import xml.etree.cElementTree as ET
except ImportError:
import xml.etree.ElementTree as ET
ElementTree Library has two main classes to represent a xml document in the memory :-
XML data can either be read from the the file or the string. ET.fromstring will parse the xml structure stored in string as tree and will return the Element instance (root of the tree).
Reading the data from string :
- ElementTree which represents the complete document as Tree
- Element which represents the single node of the xml document
XML data can either be read from the the file or the string. ET.fromstring will parse the xml structure stored in string as tree and will return the Element instance (root of the tree).
Reading the data from string :
xml_string = """<?xml version="1.0"?> <catalog> <book id="BID101"> <title>Introducing Python</title> <author>Bill Lubanovic</author> <publisher>SHR Off Publishers</publisher> <category>Python</category> </book> <book id="BID102"> <title>Learning Perl</title> <author>Randal L. Schwartz</author> <publisher>SHR Off Publishers</publisher> <category>Perl</category> </book> </catalog>""" root = ET.fromstring(xml_string)
For reading the data from the xml file, ET.parse method can be used, which will take the file-name as the argument. ET.parse will parse the xml document and will return the ElementTree instance.
tree = ET.parse('books.xml') root = tree.getroot() # Find the root of the tree using getroot function
Once the XML data is read into the memory, we can iterate over the xml nodes. There are couple of methods available to iterate over the xml nodes. One of the common way is to use for loop over the root node or any element instance, which will iterate over all the immediate children of that element.
Each node (Element) will have certain parameters, which can be accessed using the following attributes:
- Element.tag will return a string identifying what kind of data this element represents.
- Element.text will return the text associated with the element
- Element.attrib will return the dictionary holding the element attributes/values
for node in root: print "Node Tag : " + node.tag print "Node Text : " + node.text print "Node Attributes : " + str(node.attrib) >> ... Node Tag : book Node Text : Node Attributes : {'id': 'BID101'} Node Tag : book Node Text : Node Attributes : {'id': 'BID102'}
for node in root.iter(): print "Node Tag : " + node.tag print "Node Text : " + node.text print "Node Attributes : " + str(node.attrib) >> ... Node Tag : catalog Node Text : Node Attributes : {} Node Tag : book Node Text : Node Attributes : {'id': 'BID101'} Node Tag : title Node Text : Introducing Python Node Attributes : {} Node Tag : author Node Text : Bill Lubanovic Node Attributes : {} Node Tag : publisher Node Text : SHR Off Publishers Node Attributes : {} Node Tag : category Node Text : Python Node Attributes : {} Node Tag : book Node Text : Node Attributes : {'id': 'BID102'} Node Tag : title Node Text : Learning Perl Node Attributes : {} Node Tag : author Node Text : Randal L. Schwartz Node Attributes : {} Node Tag : publisher Node Text : SHR Off Publishers Node Attributes : {} Node Tag : category Node Text : Perl Node Attributes : {}
Just like the Element.iter method, ElementTree also supports an iter method which will eventually run iter on root node.
Element.Tail Attribute:
It is worth mentioning about Element.Tail attribute here, tail attribute is basically used to extract the text after the tag.
As an example, if we have to extract the text "Picture of a Cat" from the xml string <item><img src="cat.jpg" /> Picture of a cat</item>, then we can use Element.tail attribute on the img tag.
Creating and Writing the XML data
ElementTree library supports various Classes/Functions to create the XML tree and the methods to dump this tree in a file.
To create the tree, we need to create the Elements and append them to the root of the tree.Let's say, if we have to add a new book entry to our example, then we can do the following
Element.Tail Attribute:
It is worth mentioning about Element.Tail attribute here, tail attribute is basically used to extract the text after the tag.
As an example, if we have to extract the text "Picture of a Cat" from the xml string <item><img src="cat.jpg" /> Picture of a cat</item>, then we can use Element.tail attribute on the img tag.
Creating and Writing the XML data
ElementTree library supports various Classes/Functions to create the XML tree and the methods to dump this tree in a file.
To create the tree, we need to create the Elements and append them to the root of the tree.Let's say, if we have to add a new book entry to our example, then we can do the following
# Create the book tag bookelement = ET.Element("book") # Add the ID attribute to the tag bookelement.set('id','BID103') # Add a sub-children/sub-element to the book tag subelement = ET.SubElement(bookelement,"title") # Text for the sub-element subelement.text = "Practical Programming in Tcl and Tk" # Add a sub-children/sub-element to the book tag subelement = ET.SubElement(bookelement,"author") # Text for the sub-element subelement.text = "Brent Welch" # Add a sub-children/sub-element to the book tag subelement = ET.SubElement(bookelement,"publisher") # Text for the sub-element subelement.text = "Prentice Hall" # Add a sub-children/sub-element to the book tag subelement = ET.SubElement(bookelement,"category") # Text for the sub-element subelement.text = "Prentice Hall" # Add the book element to the end of our tree root.append(bookelement)
Once, we have added the new element to our xml tree, we can dump this in the xml file or print it out on the system output. ElementTree provides a method write that can be used to print the tree on the stdout.
Element object also supports the methods insert and remove that can be used to insert the node at a particular index, remove method takes Element as input, which needs to be removed.
Element object also supports the methods insert and remove that can be used to insert the node at a particular index, remove method takes Element as input, which needs to be removed.
tree = ET.ElementTree(root) # Generate the tree from the root element tree.write(sys.stdout) # Dump the xml tree on the stdout # To write the xml tree in the file, specify the file-name as input to write method tree.write("xml-file.xml")
Reading Large XML Files:
Reading the complete xml files in memory is not efficient for large sized files due to memory consumption. For such cases, xml file can be read incrementally using ElementTree's Iterparse function. Note, that Iterparse would also endup creating the complete tree in memory. So, it's the onus of the program to process the desired tags and free the memory once it's processed.
Reading the complete xml files in memory is not efficient for large sized files due to memory consumption. For such cases, xml file can be read incrementally using ElementTree's Iterparse function. Note, that Iterparse would also endup creating the complete tree in memory. So, it's the onus of the program to process the desired tags and free the memory once it's processed.
context = ET.iterparse('books.xml', events=("start", "end")) context = iter(context) # get the root element event, root = context.next() for event, elem in context: if event == "end" and elem.tag == "author": print elem.text root.clear()
In the above example, we are interested in only author tag. So, as we encounter the author tag, we clear the rest of the tree from the memory. So, if you check the contents of the root at the end, it would be empty. Had we not used root.clear(), then we would be left again with the complete tree in the end.
Searching for XML Elements:
Searching for XML Elements:
ElementTree library supports a lot of methods to search for an interesting element including the support for XPath queries.
# XPath queries using findall root = ET.fromstring(xml_string) print root.findall("*/publisher") print root.findall("book/category") # Iterate over interesting elements only tree = ET.ElementTree(root) for elem in tree.iter(tag='category'): print elem.tag, elem.attrib, elem.text
Comments
Post a Comment