right adjusted strings containing umlauts

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri Aug 9 21:29:52 EDT 2013


On Thu, 08 Aug 2013 17:24:49 +0200, Kurt Mueller wrote:

> What do I do, when input_strings/output_list has other codings like
> iso-8859-1?

When reading from a text file, honour some sort of encoding cookie at the 
top (or bottom) of the file, like Emacs and Vim use, or a BOM. If there 
is no encoding cookie, assume UTF-8.

When reading from stdin, assume UTF-8.

Otherwise, make it the caller's responsibility to specify the encoding if 
they wish to use something else.

Pseudo-code:

encoding = None

if command line arguments include '--encoding':
    encoding = --encoding argument

if encoding is None:
    if input file is stdin:
        encoding = 'utf-8'
    else:
        open file as binary
        if first 2-4 bytes look like a BOM:
            encoding = one of UTF-8 or UTF-16 or UTF-32
        else:
            read first two lines 
            if either looks like an encoding cookie:
                encoding = cookie
            # optionally check the end of the file as well
        close file

if encoding is None:
    encoding = 'utf-8'

read from file using encoding




-- 
Steven



More information about the Python-list mailing list