[Tutor] name shortening in a csv module output

Steven D'Aprano steve at pearwood.info
Fri Apr 24 03:29:38 CEST 2015


On Thu, Apr 23, 2015 at 10:08:05PM +0100, Mark Lawrence wrote:

> Slight aside, why a BOM, all I ever think of is Inspector Clouseau? :)

:-)

I'm not sure if you mean that as an serious question or not. 

BOM stands for Byte Order Mark, and it if needed for UTF-16 and UTF-32 
encodings because there is some ambiguity due to hardware differences.

UTF-16 uses two byte "code units", that is, each character is 
represented by two bytes (or four, but we can ignore that). Similarly, 
UTF-32 uses four byte code units. The problem is, different platforms 
link multiple bytes in opposite order: "big-endian" and "little-endian". 
(In the Bad Old Days, there were "middle-endian" platforms too, and you 
really don't want to know about them.) For example, the two-byte 
quantity O1FE (in hex) might have:

(1) the 01 byte at address 100 and the FE byte at address 101 
    (big end first)

(2) the 01 byte at address 101 and the FE byte at address 100
    (little end first)


We always write the two bytes as 01FE in hexadecimal notation, with the 
"little end" (units) on the right, just as we do with decimals: 

01FE = E units + F sixteens + 1 sixteen-squares + 0 sixteen-cubes 

is the same as 510 in decimal:

510 = 0 units + 1 ten + 5 ten-squares

but in computer memory those two bytes could be stored either big-end 
first:

    01 has the lower address
    FE has the higher address

or little-end first:

    FE has the lower address
    01 has the higher address


So when you read a text file, and you see two bytes 01FE, unless you 
know whether that came from a big-endian system or a little-endian 
system, you don't know how to interpret it:

- If both the sender and the receiver are big-endian, or little-endian, 
  then two bytes 01 followed by FE represent 01FE or 510 in decimal, 
  which is LATIN CAPITAL LETTER O WITH STROKE AND ACUTE in UTF-16.

- If the sender and the receiver are opposite endians (one big-, 
  the other little-, it doesn't matter which) then the two bytes 01 
  followed by FE will be seen as FE01 (65025 in decimal) which is 
  VARIATION SELECTOR-2 in UTF-16.

I've deliberately used the terms "sender" and "receiver", because this 
problem is fundamental to networking too. Whenever you transfer data 
from a Motorola based platform to an Intel based platform, the byte 
order has to be swapped like this. If it's not swapped, the data looks 
weird. (TIFF files also have to take endian issues into account.)

UTF-16 and UTF-32 solve this problem by putting a Byte Order Mark at the 
start of the file. The BOM is two bytes in the case of UTF-16 (a single 
code-unit):

py> ''.encode('utf-16')
b'\xff\xfe'

So when you read a UTF-16 file back, if it starts with two bytes FFFE 
(in hex), you can continue. If it starts with FEFF, then the bytes have 
been reversed.

If you treat the file as Latin-1 (one of the many versions of "extended 
ASCII"), then you will see the BOM as either:

ÿþ  # byte order matches

þÿ  # byte order doesn't match


Normally you don't need to care about any of this! The encoding handles 
all the gory details. You just pick the right encoding:

- UTF-16 if the file will have a BOM;

- UTF-16BE if it is big-endian, and doesn't have a BOM;

- UTF-16LE if it is little-endian, and doesn't have a BOM;

and the encoding deals with the BOM.




-- 
Steve


More information about the Tutor mailing list