Should I learn Python instead?

Fri Apr 14 18:50:00 EDT 2006

"fyleow" <fyleow at gmail.com> wrote :

<snip>

> I realize that learning the library is part of the process, but as a
> beginner I appreciate simplicity.
> Is Python easier than C#?
Absolutely.

> Can someone show how to access an XML document on the web and have it
ready
> to be manipulated for comparison?
Yes. (see below)

> Any other advice for a newbie?
Learn Python. It's good to know other languages too, but when it comes to
getting stuff done fast & cleanly, you will find Python an invaluable tool.

# here's a simple use of the urllib module to fetch a document from the web
(output from an interactive python interpreter session):
> python
Python 2.3.4 (#1, Feb  7 2005, 15:50:45)
[GCC 3.3.4 (pre 3.3.5 20040809)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib
>>> url = 'http://www.google.com'
>>> fd = urllib.urlopen(url)
>>> html = fd.read()
>>> print html
<html><head><meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1"><title>Google</title><style>
...
<snip>

I don't use RSS myself, and don't happen to know a url that gets me to an
RSS document, but here's an RSS document I found at:
http://www.xml.com/pub/a/2002/12/18/dive-into-xml.html

Here's just the document itself:

 <rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns="http://purl.org/rss/1.0/"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
>
  <channel rdf:about="http://www.xml.com/cs/xml/query/q/19">
    <title>XML.com</title>
    <link>http://www.xml.com/</link>
    <description>XML.com features a rich mix of information and services for
the XML community.</description>
    <language>en-us</language>
    <items>
      <rdf:Seq>
        <rdf:li
rdf:resource="http://www.xml.com/pub/a/2002/12/04/normalizing.html"/>
        <rdf:li
rdf:resource="http://www.xml.com/pub/a/2002/12/04/som.html"/>
        <rdf:li
rdf:resource="http://www.xml.com/pub/a/2002/12/04/svg.html"/>
      </rdf:Seq>
    </items>
  </channel>
  <item rdf:about="http://www.xml.com/pub/a/2002/12/04/normalizing.html">
    <title>Normalizing XML, Part 2</title>
    <link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>
    <description>In this second and final look at applying relational
normalization techniques to W3C XML Schema data modeling, Will Provost
discusses when not to normalize, the scope of uniqueness and the fourth and
fifth normal forms.</description>
    <dc:creator>Will Provost</dc:creator>
    <dc:date>2002-12-04</dc:date>
  </item>
  <item rdf:about="http://www.xml.com/pub/a/2002/12/04/som.html">
    <title>The .NET Schema Object Model</title>
    <link>http://www.xml.com/pub/a/2002/12/04/som.html</link>
    <description>Priya Lakshminarayanan describes in detail the use of the
.NET Schema Object Model for programmatic manipulation of W3C XML
Schemas.</description>
    <dc:creator>Priya Lakshminarayanan</dc:creator>
    <dc:date>2002-12-04</dc:date>
  </item>
  <item rdf:about="http://www.xml.com/pub/a/2002/12/04/svg.html">
    <title>SVG's Past and Promising Future</title>
    <link>http://www.xml.com/pub/a/2002/12/04/svg.html</link>
    <description>In this month's SVG column, Antoine Quint looks back at
SVG's journey through 2002 and looks forward to 2003.</description>
    <dc:creator>Antoine Quint</dc:creator>
    <dc:date>2002-12-04</dc:date>
  </item>
</rdf:RDF>

Here's a separate interpreter session where I have put the document above
into a string called xml:

This is the important, functional part:

>>> from xml.dom import minidom
>>> doc = minidom.parseString(xml)

The rest of what follows is just me poking around at the document structure.
You can write whatever code you need to get what you want:

>>> doc.firstChild
<DOM Element: rdf:RDF at 0x404a56ac>
>>> c1 = doc.firstChild
>>> c1.getElementsByTagName('dc:date')
[<DOM Element: dc:date at 0x404acc6c>, <DOM Element: dc:date at 0x404acfec>,
<DOM Element: dc:date at 0x404db38c>]
>>> c1.getElementsByTagName('dc:date')[0].toxml()
u'<dc:date>2002-12-04</dc:date>'
>>> date = c1.getElementsByTagName('dc:date')[0]
>>> date.nodeValue
>>> date.hasChildNodes
<bound method Element.hasChildNodes of <DOM Element: dc:date at 0x404acc6c>>
>>> date.hasChildNodes()
True
>>> date.firstChild
<DOM Text node "2002-12-04">
>>> dir(date.firstChild)
['ATTRIBUTE_NODE', 'CDATA_SECTION_NODE', 'COMMENT_NODE',
'DOCUMENT_FRAGMENT_NODE', 'DOCUMENT_NODE', 'DOCUMENT_TYPE_NODE',
'ELEMENT_NODE', 'ENTITY_NODE', 'ENTITY_REFERENCE_NODE', 'NOTATION_NODE',
'PROCESSING_INSTRUCTION_NODE', 'TEXT_NODE', 'TREE_POSITION_ANCESTOR',
'TREE_POSITION_DESCENDENT', 'TREE_POSITION_DISCONNECTED',
'TREE_POSITION_EQUIVALENT', 'TREE_POSITION_FOLLOWING',
'TREE_POSITION_PRECEDING', 'TREE_POSITION_SAME_NODE', '__doc__', '__len__',
'__module__', '__nonzero__', '__repr__', '__setattr__',
'_call_user_data_handler', '_get_childNodes', '_get_data',
'_get_firstChild', '_get_isWhitespaceInElementContent', '_get_lastChild',
'_get_length', '_get_localName', '_get_nodeValue', '_get_wholeText',
'_set_data', '_set_nodeValue', 'appendChild', 'appendData', 'attributes',
'childNodes', 'cloneNode', 'data', 'deleteData', 'firstChild',
'getInterface', 'getUserData', 'hasAttributes', 'hasChildNodes',
'insertBefore', 'insertData', 'isSameNode', 'isSupported',
'isWhitespaceInElementContent', 'lastChild', 'length', 'localName',
'namespaceURI', 'nextSibling', 'nodeName', 'nodeType', 'nodeValue',
'normalize', 'ownerDocument', 'parentNode', 'prefix', 'previousSibling',
'removeChild', 'replaceChild', 'replaceData', 'replaceWholeText',
'setUserData', 'splitText', 'substringData', 'toprettyxml', 'toxml',
'unlink', 'wholeText', 'writexml']
>>> date.firstChild.wholeText
u'2002-12-04'
>>>

# call dir() and help() on your doc and various other things to see what is
available.
# for example

>>> dir(c1)
['ATTRIBUTE_NODE', 'CDATA_SECTION_NODE', 'COMMENT_NODE',
'DOCUMENT_FRAGMENT_NODE', 'DOCUMENT_NODE', 'DOCUMENT_TYPE_NODE',
'ELEMENT_NODE', 'ENTITY_NODE', 'ENTITY_REFERENCE_NODE', 'NOTATION_NODE',
'PROCESSING_INSTRUCTION_NODE', 'TEXT_NODE', 'TREE_POSITION_ANCESTOR',
'TREE_POSITION_DESCENDENT', 'TREE_POSITION_DISCONNECTED',
'TREE_POSITION_EQUIVALENT', 'TREE_POSITION_FOLLOWING',
'TREE_POSITION_PRECEDING', 'TREE_POSITION_SAME_NODE', '__doc__', '__init__',
'__module__', '__nonzero__', '__repr__', '_attrs', '_attrsNS',
'_call_user_data_handler', '_child_node_types', '_get_attributes',
'_get_childNodes', '_get_firstChild', '_get_lastChild', '_get_localName',
'_get_tagName', '_magic_id_nodes', 'appendChild', 'attributes',
'childNodes', 'cloneNode', 'firstChild', 'getAttribute', 'getAttributeNS',
'getAttributeNode', 'getAttributeNodeNS', 'getElementsByTagName',
'getElementsByTagNameNS', 'getInterface', 'getUserData', 'hasAttribute',
'hasAttributeNS', 'hasAttributes', 'hasChildNodes', 'insertBefore',
'isSameNode', 'isSupported', 'lastChild', 'localName', 'namespaceURI',
'nextSibling', 'nodeName', 'nodeType', 'nodeValue', 'normalize',
'ownerDocument', 'parentNode', 'prefix', 'previousSibling',
'removeAttribute', 'removeAttributeNS', 'removeAttributeNode',
'removeAttributeNodeNS', 'removeChild', 'replaceChild', 'schemaType',
'setAttribute', 'setAttributeNS', 'setAttributeNode', 'setAttributeNodeNS',
'setIdAttribute', 'setIdAttributeNS', 'setIdAttributeNode', 'setUserData',
'tagName', 'toprettyxml', 'toxml', 'unlink', 'writexml']

# See also: http://docs.python.org/lib/module-xml.dom.minidom.html

There may well be other, more specialized Python modules just for doing
RSS - check around and at the Python Cheese Shop: http://www.python.org/pypi

HTH,
-ej