Convertion of Unicode to ASCII NIGHTMARE

Serge Orlov Serge.Orlov at gmail.com
Thu Apr 6 03:46:15 EDT 2006


Roger Binns wrote:
> "Serge Orlov" <Serge.Orlov at gmail.com> wrote in message news:1144295335.353840.322190 at i40g2000cwc.googlegroups.com...
> > I have an impression that handling/production of byte order marks is
> > pretty clear: they are produced/consumed only by two codecs: utf-16 and
> > utf-8-sig. What is not clear?
>
> Are you talking about the C APIs in Python/SQLite (that is what I
> have been discussing) or the language level?

Both. Documentation for PyUnicode_DecodeUTF16 and PyUnicode_EncodeUTF16
is pretty clear when BOM is produced/removed. The only problem is that
you have to find out host endianess yourself. In python it's
sys.byteorder, in C you use hack like

unsigned long one = 1;
endianess = (*(char *) &one) == 0) ? 1 : -1;

And then pass endianess to PyUnicode_(De/En)codeUTF16. So I still don't
see what is unclear about BOM production/handling.


>
> At the C level, SQLite doesn't accept boms.

It would be surprising if it did. Quote from
<http://www.unicode.org/faq/utf_bom.html>: "Where the data is typed,
such as a field in a database, a BOM is unnecessary"




More information about the Python-list mailing list