length of unicode strings
Trond Eivind Glomsrød
teg at redhat.com
Fri Aug 23 11:51:47 EDT 2002
Mark Hammond <mhammond at skippinet.com.au> writes:
> Trond Eivind Glomsrød wrote:
> > When running on a utf-8 system, python doesn't seem to take it input
> > in unicode:
> > Python 2.2.1 (#1, Aug 19 2002, 18:04:04)
> > [GCC 3.2 (Red Hat Linux Rawhide 3.2-1)] on linux2
> > Type "help", "copyright", "credits" or "license" for more information.
> >
>
> :( unicode is hard. I won't pretend to understand, but as no other
> replies exist this may be useful.
>
> >>>>a="å"
> >>>>a
> >>>
> > '\xc3\xa5'
>
> Here we do indeed seem to have a UTF8 representation of the
> character.
The entire system is running a utf-8 locale... the problem is that
python doesn't treat is as such, and I don't see a way to make it do so.
What I'll probably need is a way for python to set all these strings
as unicode by default...
>
> indeed,
> >>> len(unicode('\xc3\xa5', "utf8"))
> 1
>
> >
> >>>>len(a)
> >>>
> > 2
> >
> >>>>b=u"å"
> >>>>b
>
> What we see here is, effectively,
> b=u"\xc3\xa5"
Yes, I included the above to show that.
> ie, we are creating a unicode string from a 2 character ascii
> string. I'm really not sure what the semantics of the default encoding
> are here, but I would expect it to work if you changed the default
> encoding in site.py
>>> import sys
>>> sys.getdefaultencoding()
'utf'
>>> a="å"
>>> len(a)
2
>>>
(this what you get from enabling the locale sensitive encoding
detection in site.py)
Hardcoding it to utf-8 doesn't help either...
> > u'\xc3\xa5'
> >
> >>>>len(b)
> >>>
> > 2
> >
> >>>>a.isalpha()
> >>>
> > 0
> > Any particular things to configure? Enabling the
> > locale.getdefaultlocale() part in site.py doesn't help :(
>
> At the end of the day, it seem the character you want is \xe5, and, if
> decoded properly, the len() function works correctly. eg:
Yes. It boils down to a need to get python to recognize the string as
unicode automatically and mark it as such.
--
Trond Eivind Glomsrød
Red Hat, Inc.
More information about the Python-list
mailing list