Comparing UTF-8 into USC-2 and vice versa (newbie :-) )

John Machin sjmachin at lexicon.net
Sun Jun 17 05:03:19 EDT 2007


On Jun 17, 6:48 pm, Tzury <Afro.Syst... at gmail.com> wrote:
> On Jun 17, 10:48 am, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
>
>
>
> > > I recently rewrote a .net application in python.
> > > The application is basically gets streams via TCP socket and handle
> > > operations against an existing database.
> > > The Database is SQLite3 (Encoded as UTF-8).
> > > The Networks streams are encoded as UCS-2.
>
> > > Since in UCS-2, 'A' = '0041' and when I check  with the built-in
> > > functions I get for  unicode("A", "utf-8") = u'A' = u'\u0041'. I
> > > wonder what is the difference, and how can I safely encode/decode
> > > UCS-2 streams and match them with the UTF-8 representation
>
> > In unicode("A", "utf-8"), the "utf-8" parameter does *not* mean
> > that the output is in UTF-8, but the *input*.
> > So "A" = '41' != '0041'. In UCS-2, the A consumes two bytes; in
> > UTF-8, it consumes only one byte.
>
> > For different letters, that's different: For example, for u'\xf6',
> > the UCS-2 representation (big-endian) is '00F6', for UTF-8, it is
> > 'C3B6'. For u'\u20AC', the UCS-2 is '20AC', the UTF-8 is 'E282AC'
> > (i.e. three bytes).
>
> > You should use Unicode objects in your program always, and encode
> > to or from UCS-2 or UTF-8 only when interfacing with the
> > network/database.
>
> > HTH,
> > Martin
>
> Thanks Martin for this guideline. But in fact say I get a USC-2 string
> and need to compare it with UTF-8 value in the database. How can I do
> it given the Python built-in libraries?

Use the str.decode method with the appropriate encoding. Borrowing
Martin's last example:

>>> '\xE2\x82\xAC'.decode('utf8')
u'\u20ac'
>>> '\x20\xAC'.decode('utf_16_be')
u'\u20ac'

BTW TLA 'USC' AAF SBE 'UCS'
HTH
SJM




More information about the Python-list mailing list