Encoding when building XML via minidom? (...beginner needs advice)

Wed Mar 27 05:25:04 EST 2002

Hi gurus,

(I am very new to Python and to this group.  I did RTFM's,
but--you know--it is a lot of information for the Python
beginner.  Please, be patient with me ;-)

Briefly: I want to parse txt files with some internal structure
and convert them into well-formed XML internal format
(custom markup).  I do not know how to deal with
windows-1250 input encoding and how to produce
the XML output with the same encoding.

Motivation:
I have just learned the basics of Python.  The first thing that I want
to try is to rewrite one of my Perl scripts for processing a kind of
txt documents into other formats.  The main motivation is to enhance
the functionality of the utilities and to rewrite them into Python
(basicly, to learn the Python better ;).

Problem:
The txt documents use some formating rules and here-defined
markup (not compatible with anything standard, tailored for
the purpose).  The encoding of the txt documents is windows-1250
and they frequently use characters with code above 128 (Czech
texts written in Windows encoding).  (It is required to keep
unchanged the format, the encoding, and the tools for producing
them.)

It seems that XML is the right format for generating the intermediate
form of the documents, and it also seems that DOM is the right
model/apparatus to be used.  I tried some simple things with
Python's (2.1.1) xml.dom.minidom implementation.  I was able to build
some simple xml document that contained ASCII texts (just a sequence
of dom.createElement(), dom.createTextNode(), elem.appendChild(),
and dom.toxml()).

I probably need some advice for what is the best way to create my
own parser and how to use it with xml.dom.minidom.  The main
problem is the correct handling of the input encoding with respect to
the produced XML.  Ideally, the output encoding should remain
the same, and the output should start with:
    <?xml version="1.0" encoding="windows-1250" ?>

Are there some prefered ways of integrating the parser with xml.dom,
in Python?  Are there some tutorials on how to work with encoding
conversions? (Python's DOM works with Unicode internally, doesn't it?)
Are there some means to make writing the plain text with indentation
easier, in Python?

Thanks for you patience and for you help,
  Petr

--
Petr Prikryl (prikrylp at skil dot cz)