removing BOM prepended by codecs?

Tue Sep 24 14:43:41 EDT 2013

Le mardi 24 septembre 2013 11:42:22 UTC+2, J. Bagg a écrit :
> I'm having trouble with the BOM that is now prepended to codecs files. 
> 
> The files have to be read by java servlets which expect a clean file 
> 
> without any BOM.
> 
> 
> 
> Is there a way to stop the BOM being written?
> 
> 
> 
> It is seriously messing up my work as the servlets do not expect it to 
> 
> be there. I could delete it but that means another delay in retrieving 
> 
> the data. My work is a bibliographic system and I'm writing a new search 
> 
> engine in Python to replace an ancient one in C.
> 
> 
> 
> I'm working on Linux with a locale of en_GB.UTF8
> 
> 
> 
> -- 
> 
> Dr Janet Bagg
> 
> CSAC, Dept of Anthropology,
> 
> University of Kent, UK

---------

Some points.

- The coding of a text file does not matter. What's
count is the knowledge of the coding.

- The *mark* (once the Unicode.org terminology in FAQ) indicating
a unicode encoded raw text file is neither a byte order mark,
nor a signature, it is an encoded code point, the encoded
U+FEFF, 'ZERO WIDTH NO-BREAK SPACE', code point. (Note, a
non breaking space at the start of a text is a non sense.)

- When such a mark exists, it is always possible to work
100% safely. No possible error.

- When such a mark does not exist, in many cases only
guessing is a (the) valid solution.

These are facts.

Now to the question, should I use (put) such a mark,
esp. in utf-8? I would say the following:

It seems to me, one see more and more marked utf-8 files.
(Windows is probably a reason.)

More importantly, more and more tools and software are
handling this utf-8 mark, or are corrected to support it,
so it basicaly does not hurt too much. Eg. Python, golang 1.1
(was not the case in 1.0), LibreOffice, TeXWorks supports it
now (was once not the case), the unicode TeX engines, ...

If I had to work in "archiving", it would seriously think
twice.

PS Unicode encodes characters on a per *script* ("alphabet")
basis, not per *language*.

jmf