Issue 555360: UTF-16 BOM handling counterintuitive

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/36597

classification

Title:	UTF-16 BOM handling counterintuitive
Type:		Stage:
Components:	Unicode	Versions:

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	doerwalter	Nosy List:	doerwalter, lemburg, yaseppochi
Priority:	low	Keywords:

Created on 2002-05-13 09:21 by yaseppochi, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
diff.txt	doerwalter, 2002-06-03 14:54
diff2.txt	doerwalter, 2002-06-03 15:41

Messages (18)
msg10743 - (view)	Author: Stephen J. Turnbull (yaseppochi)	Date: 2002-05-13 09:21
A search on "Unicode BOM" doesn't turn up anything related. Sorry, I don't have a 2.2 or CVS to hand. Easy enough to replicate, anyway. The UTF-16 codec happily corrupts files by appending a BOM before writing encoded text to the file: bash-2.05a$ python Python 2.1.3 (#1, Apr 20 2002, 10:14:34) [GCC 2.95.4 20011002 (Debian prerelease)] on linux2 Type "copyright", "credits" or "license" for more information. >>> import codecs >>> f = codecs.open("/tmp/utf16","w","utf-16") >>> f.write(u"a") >>> f.close() >>> f = codecs.open("/tmp/utf16","a","utf-16") >>> f.write(u"a") >>> f.close() >>> f = open("/tmp/utf16","r") >>> f.read() '\xff\xfea\x00\xff\xfea\x00' Oops. Also, dir(codecs) shows BOM64* constants are defined (to what purpose, I have no idea---Microsoft Word files on Alpha, maybe?), but no BOM8, which actually has some basis in the standards. (I think the idea of a UTF-8 signature is a abomination, so you can leave it out<wink>, but people who do use the BOM as signature in UTF-8 files would find it useful.) Hmm ... >>> codecs.BOM_BE '\xfe\xff' >>> codecs.BOM64_BE '\x00\x00\xfe\xff' >>> codecs.BOM32_BE '\xfe\xff' >>> Urk! I only count 32 bits in BOM64 and 16 bits in BOM32! Maybe BOM32_* was intended as an alias for BOM_, and BOM64_ was a continuation of the typo, as it were? I wonder if this is the right interface, actually. Wouldn't prefixBOM() and checkBOM() methods for streams and strings make more sense? prefixBOM should be idempotent, and checkBOM would return either a codec (with size and endianness determined) or a codec.BOM* constant.
msg10744 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-05-13 20:43
Logged In: YES user_id=89016 And if you're using a different encoding for the second open call, the data will really be corrupted: f = codecs.open("/tmp/foo","w","utf-8") f.write("ää") f = codecs.open("/tmp/foo","a","latin-1") f.write("ää") But how should codec.open be able to determine that the file is always opened with the same encoding, or which encoding was used for the open call last time? And if it could would it have to read the content using the old encoding and rewrite it using the new encoding to keep the file consistent? I agree that the BOM names are broken. > I wonder if this is the right interface, actually. > Wouldn't prefixBOM() and checkBOM() methods for streams > and strings make more sense? prefixBOM should be > idempotent, and checkBOM would return either a codec > (with size and endianness determined) or a codec.BOM* > constant. You should consider UTF-16 to be a stateful encoding, so if you want to do your output in multiple pieces you have to use a stateful encoder, i.e. a StreamWriter: >>> import codecs, cStringIO as StringIO >>> stream = StringIO.StringIO() >>> writer = codecs.getwriter("utf-16")(stream) >>> writer.write(u"a") >>> writer.write(u"b") >>> stream.getvalue() '\xff\xfea\x00b\x00'
msg10745 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-06-02 17:16
Logged In: YES user_id=38388 I agree that opening a file in append mode should probably be smarter in the sense that the BOM is only written in case file.seek() points to the beginning of the file; patches are welcome. On the other points: * I don't see the point of adding an 8-bit BOM mark (UTF-8 does not depend on byte order). * The 32 vs. 16 refer to the number of bits in the Unicode internal type; they don't refer to the number of bits in the mark.
msg10746 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-06-03 11:54
Logged In: YES user_id=89016 > The 32 vs. 16 refer to the number of bits in the > Unicode internal type; they don't refer to the number of > bits in the mark. Yes, but unfortunately the constants in codecs are not BOM32_?? and BOM16_??, but BOM64_?? and BOM32_??.
msg10747 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-06-03 12:10
Logged In: YES user_id=38388 Hmm, you're right. Something is wrong here. Perhaps we should add aliases called BOM_UCS2_* and BOM_UCS4_* and update the documentation accordingly ? About the append mode: is file.mode considered to be part of the file interface or not... I think that a patch for the UTF-16 codec should check for this attribute on the stream object to tell whether or not to prepend the BOM mark.
msg10748 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-06-03 12:38
Logged In: YES user_id=89016 > Perhaps we should add aliases called BOM_UCS2_* > and BOM_UCS4_* and update the documentation > accordingly ? Sounds reasonable! If you want, I'll change it (including documentation) About the append mode: I don't think it's a good idea to try to fix this. There is much that can go wrong: seeking an odd number of bytes, mixed endianness on writes, using a different encoding on the second write. And what about UTF-8 and UTF-7? What should happen if the user seeks into the middle of a UTF-[78] byte sequence and starts to write?
msg10749 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-06-03 12:57
Logged In: YES user_id=38388 Ok, please do and then close the bug. About the append mode: I think you're right. It's not worth the trouble. Applications can easily figure this out for themselves (and then use the proper non-BOM prepending codec name).
msg10750 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-06-03 13:35
Logged In: YES user_id=89016 codecs.py says: BOM = struct.pack('=H', 0xFEFF) this is not correct in wide build. Should this be changed to if sys.maxunicode>0xffff: BOM = struct.pack('=L', 0x0000FEFF) else: BOM = struct.pack('=H', 0xFEFF) (with two additional constants BOM_UCS2 and BOM_UCS4?)
msg10751 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-06-03 13:58
Logged In: YES user_id=38388 The only requirement we have for BOM is that it matches the BOM mark which actually gets written to the file. The purpose of BOM_UCS2_ and BOM_UCS4_ is to be able to figure out which underlying Unicode version was used. I'm not even sure whether there's a standard for this on 64-bit machines; could be that Microsoft invented something here... (maybe that's also where the old names originated, I don't know)
msg10752 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-06-03 14:54
Logged In: YES user_id=89016 > The only requirement we have for BOM is that it matches > the BOM mark which actually gets written to the file. But this is independent from the internal byte size of the character type. UTF-16 always writes two bytes (except for surrogates). The attached diff.txt shows what I think the BOM stuff should look like.
msg10753 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-06-03 15:41
Logged In: YES user_id=89016 In old code BOM, BOM_LE and BOM_BE are all 16bit, so to be backwards compatible maybe the attached path diff2.txt should be applied instead. But this feels strange, because I'd expect that for a --enable-unicode=ucs2 build codecs.BOM==codecs.BOM_UCS2 and for an --enable-unicode=ucs4 build codecs.BOM==codecs.BOM_UCS4.
msg10754 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-06-03 17:02
Logged In: YES user_id=38388 Google'ing around a bit, I can't find a single reference to something like a special BOM mark on 64-bit machines. Perhaps this was just some wild idea which has no real meaning ? Hmm it could have a meaning for UTF-32... but then it should really be BOM_UTF16_ vs. BOM_UTF32_.. and have nothing to do with the internal storage format.
msg10755 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-06-03 17:19
Logged In: YES user_id=89016 So should I name them BOM_UTF16_* and BOM_UTF32_*? (IMHO it makes much more sense this way) Maybe Python should get an UTF-32 codec (see http://www.unicode.org/unicode/reports/tr19/)?
msg10756 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-06-03 18:39
Logged In: YES user_id=38388 Yes, please (but do leave the existing ones around fof backward compatiblity). About the UTF-32 codec: sure why not. Patches are welcome !
msg10757 - (view)	Author: Stephen J. Turnbull (yaseppochi)	Date: 2002-06-04 02:09
Logged In: YES user_id=88738 The reason for a BOM8 is for use as a _signature_, cf. ISO/IEC 10646-1, Annex F, as Amended by Amendment 2. Implementers of PEP 263 and those who have to interchange with MS Notepad and other such applications that use a leading ZERO-WIDTH NO-BREAK SPACE as a Unicode signature may find it convenient. The name BOM8 is for consistency with the other signatures. Of course you could trash _all_ the BOM names in favor of "SIGNATURE_UTF(8\|16\|32)(_[BL]E)?", which applies in all cases.
msg10758 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-06-04 07:19
Logged In: YES user_id=38388 Stephen, how would your BOM8 look like ? As explained below, the two constants are there for checking which signature was used, not so much for generating it (since this is up to the UTF-16/32 codecs). UTF-8 doesn't need a BOM. Still, it can be used as signature, so I'D say we add BOM_UTF8_ = '\xef\xbb\xbf' as well.
msg10759 - (view)	Author: Stephen J. Turnbull (yaseppochi)	Date: 2002-06-04 09:50
Logged In: YES user_id=88738 "My" BOM8 would look exactly as you give it: '\xef\xbb\xbf' This would be useful in the same kinds of contexts as the BOM16/32 variants, I would think.
msg10760 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2002-06-04 15:19
Logged In: YES user_id=89016 Checked in as: Misc/NEWS 1.413 Lib/codecs.py 1.25 Doc/lib/libcodecs.tex 1.9

History
Date	User	Action	Args
2022-04-10 16:05:19	admin	set	github: 36597
2002-05-13 09:21:28	yaseppochi	create