[issue1328] feature request: force BOM option
James G. sack (jim)
report at bugs.python.org
Tue Nov 20 04:39:57 CET 2007
James G. sack (jim) added the comment:
More discussion of utf_8.py decoding behavior (and possible change):
For my needs, I would like the decoding parts of the utf_8 module to treat
an initial BOM as an optional signature and skip it if there is one (just
like the utf_8_sig decoder). In fact I have a working patch that replaces
the utf_8_sig decode, IncrementalDecoder and StreamReader components by
direct transplants from utf_8_sig (as recently repaired -- there was a
SteamReader error).
However the reason for discussion is to ask how it might impact existing
code.
I can imagine there might be utf_8 client code out there which expects to
see a leading U+feff as (perhaps) a clue that the output should be returned
with a BOM-signature (say) to accomodate the guessed input requirements of
the remote correspondant.
Making my work easier might actually make someone else's work (probably,
annoyingly) harder.
So what to do?
I can just live with code like
if input[0] == u"\ufeff":
input=input[1:}
spread around, and of course slightly different for incremental and stream
inputs.
But I probably wouldn't. I would probably substitute a
"my_utf_8" encoding for to make my code a little cleaner.
Another thought I had would require "the other guy" to update his code, but
at least it wouldn't make his work annoyingly difficult like my original
change might have.
Here's the basic outline:
- Add another decoder function that returns a 3-tuple
decode3(input, errors='strict') => (data, consumed, had_bom)
where had_bom is true if a leading bom was seen and skipped
- then the usual decode is just something like
def decode(input, errors='strict'):
return decode3(input, errors)[:2]
- add member variable and accessor to both IncrementalDecoder and
StreamReader classes something like
def had_bom(self):
return self.had_bom
and initialize/set the self.had_bom variable as required.
This complicates the interface somewhat and requires some additional
documantation.
Tpo document my original simple [-minded] idea required
possibly only a few more words in the existing paragraph
on utf_8_sig, to mention that both mods had the same
decoding behavior but different encoding.
I thought of a secondary consideration: If utf_8 and utf_8_sig are "almost
the same", it's possible that future refactoring might unify them with
differences contained in behavor-flags (eg, skip_leading_bom). The leading
bom processing might even be pushed into codecs.utf_8_decode for possible
minor advantages.
Is there anybody monitoring this who has an opinion on this?
..jim
----------
versions: +Python 2.6
__________________________________
Tracker <report at bugs.python.org>
<http://bugs.python.org/issue1328>
__________________________________
More information about the Python-bugs-list
mailing list