removing BOM prepended by codecs?
Steven D'Aprano
steve at pearwood.info
Tue Sep 24 06:56:21 EDT 2013
On Tue, 24 Sep 2013 10:42:22 +0100, J. Bagg wrote:
> I'm having trouble with the BOM that is now prepended to codecs files.
> The files have to be read by java servlets which expect a clean file
> without any BOM.
>
> Is there a way to stop the BOM being written?
Of course there is :-) but first we need to know how you are writing it
in the first place.
If you are dealing with existing files, which already contain a BOM, you
may need to open the files and re-save them without the BOM.
If you are dealing with temporary files you're creating programmatically,
it depends how you're creating them. My guess is that you're doing
something like this:
f = open("some file", "w", encoding="UTF-16") # or UTF-32
f.write(data)
f.close()
or similar. Both the UTF-16 and UTF-32 codecs write BOMs. To avoid that,
you should use UTF-16-BE or UTF-16-LE (Big Endian or Little Endian), as
appropriate to your platform.
If you're getting a UTF-8 BOM, that's seriously weird. The standard UTF-8
codec doesn't write a BOM. (Strictly speaking, it's not a Byte Order
Mark, but a Signature.) Unless you're using encoding='UTF-8-sig', I can't
guess how you're getting a UTF-8 BOM.
If you're doing something else, well, you'll have to explain what you're
doing before we can tell you how to stop doing it :-)
> I'm working on Linux with a locale of en_GB.UTF8
The locale only sets the default encoding used by the OS, not that used
by Python. Python 2 defaults to ASCII; Python 3 defaults to UTF-8.
--
Steven
More information about the Python-list
mailing list