[Tutor] reading and processing xml files with python

Sun Jun 21 09:30:14 CEST 2009

Hi,

python.list at Safe-mail.net wrote:
> I am a total python XML noob and wanted some clarification on using python with reading remote XML data.

For XML in general, there's xml.etree.ElementTree in the stdlib. For remote
data (and for various other features), you should also try lxml.etree,
which is an advanced re-implementation.

http://codespeak.net/lxml/

> All examples I have found assumes the data is stored localy or have I misunderstood this?

Likely a misunderstanding. Any XML library I know of can parse from a
string or a file-like object (which a socket is, for example).

> If I browse to:
> 'user:password at domain.com/external/xmlinterface.jsp?cid=xxx&resType=hotel200631&intfc=ws&xml='
> 
> This request returns a page like:
> 
> <HotelAvailabilityListResults size="25">
> −
> <Hotel>
> <hotelId>134388</hotelId>
> <name>Milford Plaza at Times Square</name>
> <address1>700 8th Avenue</address1>
> <address2/>
> <address3/>
> <city>New York</city>
> <stateProvince>NY</stateProvince>
> <country>US</country>
> <postalCode>10036</postalCode>
> <airportCode>NYC</airportCode>
> <lowRate>155.4</lowRate>
> <highRate>259.0</highRate>
> <rateCurrencyCode>USD</rateCurrencyCode>
> <latitude>40.75905</latitude>
> <longitude>-73.98844</longitude>
[...]
> <rateFrequency>B</rateFrequency>
> </PromoRateInfo>
> </HotelProperty>
> </Hotel>
> 
> 
> I got this so far:
> 
>>>> import urllib2
>>>> request = urllib2.Request('user:password at domain.com/external/xmlinterface.jsp?cid=xxx&resType=hotel200631&intfc=ws&xml=')
>>>> opener = urllib2.build_opener()
>>>> firstdatastream = opener.open(request)
>>>> firstdata = firstdatastream.read()
>>>> print firstdata

I never used HTTP authentication with lxml (ElementTree doesn't support
parsing from remote URLs at all), so I'm not sure if this works:

	url = 'user:password at domain.com/external/...'

	from lxml import etree
	document = etree.parse(url)

If it doesn't, you can use your above code (BTW, isn't urlopen() enough
here?) up to the .open() call and do this afterwards:

	document = etree.parse( firstdatastream )

> <HotelAvailabilityListResults size='25'>
>   <Hotel>
>     <hotelId>134388</hotelId>
>     <name>Milford Plaza at Times Square</name>
>     <address1>700 8th Avenue</address1>
>     <address2/>
>     <address3/>
>     <city>New York</city>
>     <stateProvince>NY</stateProvince>
>     <country>US</country>
>     <postalCode>10036</postalCode>
> 
> ...
> 
> I would like to understand how to manipulate the data further and extract for example all the hotel names in a list?

Read the tutorials on ElementTree and/or lxml. To get a list of hotel
names, I'd expect this to work:

	print [ name.text for name in document.find('//Hotel/name') ]

Stefan