removing BOM prepended by codecs?

Steven D'Aprano steve at pearwood.info
Tue Sep 24 06:56:21 EDT 2013


On Tue, 24 Sep 2013 10:42:22 +0100, J. Bagg wrote:

> I'm having trouble with the BOM that is now prepended to codecs files.
> The files have to be read by java servlets which expect a clean file
> without any BOM.
> 
> Is there a way to stop the BOM being written?

Of course there is :-) but first we need to know how you are writing it 
in the first place.

If you are dealing with existing files, which already contain a BOM, you 
may need to open the files and re-save them without the BOM.

If you are dealing with temporary files you're creating programmatically, 
it depends how you're creating them. My guess is that you're doing 
something like this:

f = open("some file", "w", encoding="UTF-16")  # or UTF-32
f.write(data)
f.close()

or similar. Both the UTF-16 and UTF-32 codecs write BOMs. To avoid that, 
you should use UTF-16-BE or UTF-16-LE (Big Endian or Little Endian), as 
appropriate to your platform.

If you're getting a UTF-8 BOM, that's seriously weird. The standard UTF-8 
codec doesn't write a BOM. (Strictly speaking, it's not a Byte Order 
Mark, but a Signature.) Unless you're using encoding='UTF-8-sig', I can't 
guess how you're getting a UTF-8 BOM.

If you're doing something else, well, you'll have to explain what you're 
doing before we can tell you how to stop doing it :-)


> I'm working on Linux with a locale of en_GB.UTF8

The locale only sets the default encoding used by the OS, not that used 
by Python. Python 2 defaults to ASCII; Python 3 defaults to UTF-8.


-- 
Steven



More information about the Python-list mailing list