[Tutor] name shortening in a csv module output
Steven D'Aprano
steve at pearwood.info
Fri Apr 24 03:29:38 CEST 2015
On Thu, Apr 23, 2015 at 10:08:05PM +0100, Mark Lawrence wrote:
> Slight aside, why a BOM, all I ever think of is Inspector Clouseau? :)
:-)
I'm not sure if you mean that as an serious question or not.
BOM stands for Byte Order Mark, and it if needed for UTF-16 and UTF-32
encodings because there is some ambiguity due to hardware differences.
UTF-16 uses two byte "code units", that is, each character is
represented by two bytes (or four, but we can ignore that). Similarly,
UTF-32 uses four byte code units. The problem is, different platforms
link multiple bytes in opposite order: "big-endian" and "little-endian".
(In the Bad Old Days, there were "middle-endian" platforms too, and you
really don't want to know about them.) For example, the two-byte
quantity O1FE (in hex) might have:
(1) the 01 byte at address 100 and the FE byte at address 101
(big end first)
(2) the 01 byte at address 101 and the FE byte at address 100
(little end first)
We always write the two bytes as 01FE in hexadecimal notation, with the
"little end" (units) on the right, just as we do with decimals:
01FE = E units + F sixteens + 1 sixteen-squares + 0 sixteen-cubes
is the same as 510 in decimal:
510 = 0 units + 1 ten + 5 ten-squares
but in computer memory those two bytes could be stored either big-end
first:
01 has the lower address
FE has the higher address
or little-end first:
FE has the lower address
01 has the higher address
So when you read a text file, and you see two bytes 01FE, unless you
know whether that came from a big-endian system or a little-endian
system, you don't know how to interpret it:
- If both the sender and the receiver are big-endian, or little-endian,
then two bytes 01 followed by FE represent 01FE or 510 in decimal,
which is LATIN CAPITAL LETTER O WITH STROKE AND ACUTE in UTF-16.
- If the sender and the receiver are opposite endians (one big-,
the other little-, it doesn't matter which) then the two bytes 01
followed by FE will be seen as FE01 (65025 in decimal) which is
VARIATION SELECTOR-2 in UTF-16.
I've deliberately used the terms "sender" and "receiver", because this
problem is fundamental to networking too. Whenever you transfer data
from a Motorola based platform to an Intel based platform, the byte
order has to be swapped like this. If it's not swapped, the data looks
weird. (TIFF files also have to take endian issues into account.)
UTF-16 and UTF-32 solve this problem by putting a Byte Order Mark at the
start of the file. The BOM is two bytes in the case of UTF-16 (a single
code-unit):
py> ''.encode('utf-16')
b'\xff\xfe'
So when you read a UTF-16 file back, if it starts with two bytes FFFE
(in hex), you can continue. If it starts with FEFF, then the bytes have
been reversed.
If you treat the file as Latin-1 (one of the many versions of "extended
ASCII"), then you will see the BOM as either:
ÿþ # byte order matches
þÿ # byte order doesn't match
Normally you don't need to care about any of this! The encoding handles
all the gory details. You just pick the right encoding:
- UTF-16 if the file will have a BOM;
- UTF-16BE if it is big-endian, and doesn't have a BOM;
- UTF-16LE if it is little-endian, and doesn't have a BOM;
and the encoding deals with the BOM.
--
Steve
More information about the Tutor
mailing list