Eclipse/PyDev - BOM Lexical Error

Ethan Furman ethan at stoneleaf.us
Fri Oct 8 11:58:51 EDT 2010


Lawrence D'Oliveiro wrote:
> In message <87hbgyosdc.fsf at web.de>, Diez B. Roggisch wrote:
> 
>> Lawrence D'Oliveiro <ldo at geek-central.gen.new_zealand> writes:
>>
>>> In message <87d3rorf2f.fsf at web.de>, Diez B. Roggisch wrote:
>>>
>>>> Lawrence D'Oliveiro <ldo at geek-central.gen.new_zealand> writes:
>>>>
>>>>> What exactly is the point of a BOM in a UTF-8-encoded file?
>>>> It's a marker like the "coding: utf-8" in python-files. It tells the
>>>> software aware of it that the content is UTF-8.
>>> But if the software is aware of it, then why does it need to be told?
>> Let me rephrase: windows editors such as notepad recognize the BOM, and
>> then assume (hopefully rightfully so) that the rest of the file is text
>> in utf-8 encoding.
> 
> But they can only recognize it as a BOM if they assume UTF-8 encoding to 
> begin with. Otherwise it could be interpreted as some other coding.

Not so.  The first three bytes are the flag.  For example, in a .dbf 
file, the first byte determines what type of dbf the file is: \x03 = 
dBase III, \x83 = dBase III with memos, etc.  More checking should 
naturally be done to ensure the rest of the fields make sense for the 
dbf type specified.

MS decided that if the first three bytes = \xEF \xBB \xBF then it's a 
UTF-8 file, and if it is not, don't open it with an MS product. 
Likewise, MS will add those bytes to any UTF-8 file it saves.

Naturally, this causes problems for non-MS usages, but anybody who's had 
to work with both MS and non-MS platforms/products/methodologies knows 
that MS does not play well with others.

~Ethan~



More information about the Python-list mailing list