[issue1328] feature request: force BOM option

Tue Nov 20 04:39:57 CET 2007

James G. sack (jim) added the comment:

More discussion of utf_8.py decoding behavior (and possible change):

For my needs, I would like the decoding parts of the utf_8 module to treat 
an initial BOM as an optional signature and skip it if there is one (just 
like the utf_8_sig decoder). In fact I have a working patch that replaces 
the utf_8_sig  decode, IncrementalDecoder and StreamReader components by 
direct transplants from utf_8_sig (as recently repaired -- there was a 
SteamReader error).

However the reason for discussion is to ask how it might impact existing 
code.

I can imagine there might be utf_8 client code out there which expects to 
see a leading U+feff as (perhaps) a clue that the output should be returned 
with a BOM-signature (say) to accomodate the guessed input requirements of 
the remote correspondant.

Making my work easier might actually make someone else's work (probably, 
annoyingly) harder. 

So what to do?

I can just live with code like
  if input[0] == u"\ufeff": 
    input=input[1:}
spread around, and of course slightly different for incremental and stream 
inputs. 

  But I probably wouldn't. I would probably substitute a
  "my_utf_8" encoding for to make my code a little cleaner.

Another thought I had would require "the other guy" to update his code, but 
at least it wouldn't make his work annoyingly difficult like my original 
change might have.

Here's the basic outline:

- Add another decoder function that returns a 3-tuple
  decode3(input, errors='strict') => (data, consumed, had_bom)
where had_bom is true if a leading bom was seen and skipped

- then the usual decode is just something like
  def decode(input, errors='strict'):
    return decode3(input, errors)[:2]

- add member variable and accessor to both IncrementalDecoder and 
StreamReader classes something like
  def had_bom(self):
    return self.had_bom
and initialize/set the self.had_bom variable as required.

This complicates the interface somewhat and requires some additional 
documantation.

   Tpo document my original simple [-minded] idea required 
   possibly only a few more words in the existing paragraph
   on utf_8_sig, to mention that both mods had the same 
   decoding behavior but different encoding.

I thought of a secondary consideration: If utf_8 and utf_8_sig are "almost 
the same", it's possible that future refactoring might unify them with 
differences contained in behavor-flags (eg, skip_leading_bom). The leading 
bom processing might even be pushed into codecs.utf_8_decode for possible 
minor advantages. 

Is there anybody monitoring this who has an opinion on this? 

..jim

----------
versions: +Python 2.6

__________________________________
Tracker <report at bugs.python.org>
<http://bugs.python.org/issue1328>
__________________________________