[I18n-sig] UTF-8 and BOM

Paul Prescod paulp@ActiveState.com
Wed, 16 May 2001 12:26:41 -0700


"M.-A. Lemburg" wrote:
> 
>...
> 
> BOMs are standard Unicode char points, so they are legal in all
> Unicode encodings.

My point is that it is legal to interpret it as a BOM and not just a
character.

>...
> Uhm, I can't follow you here... BOMs in UTF-8 look like this:
> 
> >>> u'\ufeff'.encode('utf-8')
> '\xef\xbb\xbf'
> 
> which is somewhat different from '\xff\xfe' or '\xfe\xff'.

That's what's great about it!

>...
> >>> u'\ufeff'.encode('utf-16')
> '\xff\xfe\xff\xfe'

It is curious that decoding this removes both FEFF characters. Is it
right that the decoder removes all BOM sequences?

>>> codecs.utf_16_decode(  codecs.BOM*10 + "a".encode("UTF-16") + codecs.BOM*10)
(u'a', 44)

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook