remove BOM from string read from utf-8 file

Matt Gerrans matt.gerrans at hp.com
Fri Feb 27 15:03:47 EST 2004


I found myself often needing to read text files that might be utf-8, unicode
or ansi, without knowing beforehand which, so I wrote a single function to
do it.    I don't know if this is the correct way to handle this situation,
but I couldn't find any function that would simply open a file with the
appropriate codec automatically, so I use this (it doesn't handle all cases,
but just the ones I've needed so far):

import os, codecs
#---------------------------------------------------------------------------
-
#                                OpenTextFile()
#
# Opens a file correctly whether it is unicode or ansi.  If the file
# doesn't exist, then the default encoding is unicode (UTF-16).
#
# Python documentation of the codecs module is pretty weak; for instance
# there are all these:
#    BOM
#    BOM_BE
#    BOM_LE
#    BOM_UTF8
#    BOM_UTF16
#    BOM_UTF16_BE
#    BOM_UTF16_LE
#    BOM_UTF32
#    BOM_UTF32_BE
#    BOM_UTF32_LE
# but no explanation of how they map to the encodings like 'utf-16'.  Some
# can be inferred, but some are not so clear.
#---------------------------------------------------------------------------
-
def OpenTextFile(filename,mode='r',encoding=None):
   if os.path.isfile(filename):
      f = file(filename,'rb')
      header = f.read(4) # Read just the first four bytes.
      f.close()
      # Don't change this to a map, because it is ordered!!!
      encodings = [ ( codecs.BOM_UTF32, 'utf-32' ),
                    ( codecs.BOM_UTF16, 'utf-16' ),
                    ( codecs.BOM_UTF8,  'utf-8'  ) ]
      for h,e in encodings:
         if header.find(h) == 0:
            encoding = e
            break
   return codecs.open(filename,mode,encoding)





More information about the Python-list mailing list