fun with unicode files

Thomas Heller theller at python.net
Tue Aug 24 04:33:03 EDT 2004


I want to use ConfigParser with both NT4-style .reg files, which are
ascii (or ansi?) files, and XP-stype .reg files which seem to be UTF-16
encoded unicode-files (hope that's the correct terminology).  [And yes, I
have read the warning in the manual that ConfigParser doesn't interpret
the value-type prefixes in the reg files]

Here's the start of the method I wrote to detect the encoding and read
the file:

def _parse_regfile(self, filename):
    ifi = open(filename, "r")
    import codecs, StringIO
    if ifi.read(2) in (codecs.BOM_LE, codecs.BOM_BE):
        ifi.close()
        ifi = codecs.open(filename, "r", "utf-16")

I wonder: do I really have to check for the BOM manually, or is there a
Python function which does that?
Continuing the code:

        # ConfigParser calls .readline(), but:
        # NotImplementedError: '.readline() is not implemented for UTF-16'
        # so we need to put the data into a StringIO instance.
        # Um, cStringIO doesn't handle unicode correctly, so we'll have
        # to use the slower StringIO
        ifi = StringIO.StringIO(ifi.read())
    ifi.readline() # skip the first two lines
    ifi.readline()
    c = ConfigParser()
    c.readfp(ifi)
    return c

Is there a better way to do this?  Why doesn't the UTF-16 codec
implement readline()?

Thomas



More information about the Python-list mailing list