Parsing XML file in python

Hello all,

This is the sixth article in the series Python for Data Science. If you are new to this series, we would recommend you to read our previous articles

  1. Python for Data Science Series - Part 1
  2. Python for Data Science Series - Part 2
  3. Using Numpy in Python
  4. Using Pandas in Python
  5. Data Visualization using Matplotlib Python


Please refer the videos below for detailed explanation on how to parse xml in python.


Please refer the following notebook to understand on how to do xml parsing in python.




In [4]:
import csv 
import requests 
import xml.etree.ElementTree as ET 
import os
In [2]:
main_folder_path = r"E:\openknowledgeshare.blogspot.com\Python\Outputs"
In [5]:
url = 'http://www.hindustantimes.com/rss/topnews/rssfeed.xml'

# creating HTTP response object from given url 
resp = requests.get(url) 

# saving the xml file 
with open(os.path.join(main_folder_path,'topnewsfeed.xml'), 'wb') as f: 
    f.write(resp.content) 
In [10]:
resp
Out[10]:
<Response [401]>

Read XML

In [42]:
xml_file_path = os.path.join(main_folder_path,'topnewsfeed.xml')
# create element tree object 
tree = ET.parse(xml_file_path)
In [43]:
tree
Out[43]:
<xml.etree.ElementTree.ElementTree at 0x21d49074438>
In [44]:
# get root element 
root = tree.getroot()
root
Out[44]:
<Element 'holidays' at 0x0000021D49078278>
In [45]:
root.items()
Out[45]:
[('year', '2017'), ('onemoreitem', 'abc')]
In [46]:
root.items()[0][0]
Out[46]:
'year'
In [47]:
root.items()[0][1]
Out[47]:
'2017'
In [48]:
root.getchildren()
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: This method will be removed in future versions.  Use 'list(elem)' or iteration over elem instead.
  """Entry point for launching an IPython kernel.
Out[48]:
[<Element 'holiday' at 0x0000021D490782C8>,
 <Element 'holiday' at 0x0000021D490783B8>,
 <Element 'anckfk' at 0x0000021D490784A8>,
 <Element 'anckfk' at 0x0000021D49078598>]
In [49]:
root.getchildren()[0].items()
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: This method will be removed in future versions.  Use 'list(elem)' or iteration over elem instead.
  """Entry point for launching an IPython kernel.
Out[49]:
[('type', 'other')]
In [50]:
root.getchildren()[0].getchildren()
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: This method will be removed in future versions.  Use 'list(elem)' or iteration over elem instead.
  """Entry point for launching an IPython kernel.
Out[50]:
[<Element 'date' at 0x0000021D49078318>,
 <Element 'name' at 0x0000021D49078368>]
In [51]:
root.getchildren()[0].getchildren()[0].text
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: This method will be removed in future versions.  Use 'list(elem)' or iteration over elem instead.
  """Entry point for launching an IPython kernel.
Out[51]:
'Jan 1'
In [52]:
root.getchildren()[0].getchildren()[1].text
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: This method will be removed in future versions.  Use 'list(elem)' or iteration over elem instead.
  """Entry point for launching an IPython kernel.
Out[52]:
'New Year'
In [53]:
root.getchildren()[1].getchildren()[0].text
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: This method will be removed in future versions.  Use 'list(elem)' or iteration over elem instead.
  """Entry point for launching an IPython kernel.
Out[53]:
'Oct 2'
In [54]:
root.getchildren()[1].getchildren()[1].text
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: This method will be removed in future versions.  Use 'list(elem)' or iteration over elem instead.
  """Entry point for launching an IPython kernel.
Out[54]:
'Gandhi Jayanti'
In [55]:
root.findall("holiday")
Out[55]:
[<Element 'holiday' at 0x0000021D490782C8>,
 <Element 'holiday' at 0x0000021D490783B8>]
In [61]:
root.findall("anckfk")
Out[61]:
[<Element 'anckfk' at 0x0000021D490784A8>,
 <Element 'anckfk' at 0x0000021D49078598>]
In [59]:
for each_node in root.findall('holiday'):
    date_node = each_node.find('date')
    name_node = each_node.find('name')
    print(date_node.text, name_node.text)
Jan 1 New Year
Oct 2 Gandhi Jayanti
In [ ]:
 

Comments

Popular posts from this blog

How to run Jupyter Notebooks in Cloud

How to download , install and configure Anaconda - Python

Project: Implementation of Label Editor