[issue1328] feature request: force BOM option

Thu Nov 1 20:56:30 CET 2007

James G. sack (jim) added the comment:

Adam Olsen wrote:
> Adam Olsen added the comment:
> 
> The problem with "being tolerate" as you suggest is you lose the ability
> to round-trip.  Read in a file using the UTF-8 signature, write it back
> out, and suddenly nothing else can open it.

I'm sorry, I don't see the round-trip problem you describe.

If codec utf_8 or utf_8_sig were to accept input with or without the
3-byte BOM, and write it as currently specified without/with the BOM
respectively, then _I_ can reread again with either utf_8 or utf_8_sig.

No round trip problem _for me_.

Now If I need to exchange with some else, that's a different matter. One
way or another I need to know what format they need and create the
output they require for their input.

Am I missing something in your statement of a problem?

> Conceptually, these signatures shouldn't even be part of the encoding;
> they're a prefix in the file indicating which encoding to use.

Yes, I'm aware of that, but you can't predict what you may find in dusty
archives, or what someone may give to you. IMO, that's the basis of
being tolerant in what you accept, is it not?

> Note that the BOM signature (ZWNBSP) is a valid code point.  Although it
> seems unlikely for a file to start with ZWNBSP, if were to chop a file
> up into smaller chunks and decode them individually you'd be more likely
> to run into it.  (However, it seems general use of ZWNBSP is being
> discouraged precisely due to this potential for confusion[1]).

I understand that throwing away a ZWNBSP at the beginning of a file does
risk discarding data rather than metadata. I also believe the standards
people recognized that and deliberately picked a BOM character that is a
calculated low risk. I'm willing to accept that risk.

> In summary, guessing the encoding should never be the default.  Although
> it may be appropriate in some contexts, we must ensure we emit the right
> encoding for those contexts as well. [2]
> 
> [1] http://unicode.org/faq/utf_bom.html#38
> [2] http://unicode.org/faq/utf_bom.html#28

>From my point of view, I don't see that being tolerant in what _I_ (or
my applications) accept violates any guidelines.

Please explain where I am wrong.

Regards,
..jim

__________________________________
Tracker <report at bugs.python.org>
<http://bugs.python.org/issue1328>
__________________________________