[portland] Umlats from another dimension

Scott Garman sgarman at zenlinux.com
Mon Oct 13 00:56:24 CEST 2014


What a great community. :) Thanks for this, Tim.

Scott

On 10/12/2014 03:35 PM, Tim wrote:
>
> Hi Scott,
>
>> Any insights on what I'm missing would be greatly appreciated.
>
>
> Traditional Python strings are much more like byte arrays than
> character strings.  However, explicit unicode strings can be defined
> as well, but it is a separate data type.
>
> Your isinstance() test is merely checking the data type of the object,
> but this has nothing to do with the content stored within.
>
> For instance:
>
>>>> str = u'ä'
>>>> if isinstance(str, unicode):
> ...         print "This is unicode"
> ...
> This is unicode
>
>
> And:
>
>>>> str = u'any string, now stored as unicode'
>>>> if isinstance(str, unicode):
> ...         print "This is unicode"
> ...
> This is unicode
>
>
> Note the "u" letter prefix to the string definitions.
>
>
> I suspect when you include a character with an umlaut statically in
> the script as a traditional string, this is automatically encoded in
> your default character set (I guess utf-8) and stored within the
> string (once again, just a sequence of bytes).
>
> When you read data in from your users and want to inspect it for
> character content that doesn't fall within traditional ascii, I
> recommend you first decode it to unicode and then perform operations
> on it that way.  But for goodness sakes, don't force it to "ascii"!
> If you want to handle unicode, then interpret the input as utf-8 or
> whatever makes sense, then manipulate the resulting unicode object,
> preserving the extended character set.
>
> Consider this:
>
>>>> raw = 'ä'
>>>> unicode = raw.decode('utf-8')
>>>> for c in unicode:
> ...     print ord(c)
> ...
> 228
>
>
> Here, since Python knows how to interpret the value stored in the
> unicode object, the logical character value is printed out, rather
> than seeing two encoded bytes.
>
>
> Now, beyond just getting the characters converted into unicode
> properly, you still have to worry about what does Python consider to
> be an uppercase vs. lowercase character.  I believe that will depend
> on the locale you have set in the environment.  But that's about as
> far as my knowledge goes here...
>
> Hope that helps,
> tim
>
>
> PS- In Python 3, the default string object *is* unicode.  The old
>      behavior of strings is relegated to bytes().  In some ways this
>      makes it easier to understand what is going on with unicode.
>
> _______________________________________________
> Portland mailing list
> Portland at python.org
> https://mail.python.org/mailman/listinfo/portland
>



More information about the Portland mailing list