unicode by default

John Machin sjmachin at lexicon.net
Wed May 11 23:54:20 EDT 2011


On Thu, May 12, 2011 11:22 am, harrismh777 wrote:
> John Machin wrote:
>> (1) You cannot work without using bytes sequences. Files are byte
>> sequences. Web communication is in bytes. You need to (know / assume /
>> be
>> able to extract / guess) the input encoding. You need to encode your
>> output using an encoding that is expected by the consumer (or use an
>> output method that will do it for you).
>>
>> (2) You don't need to use bytes to specify a Unicode code point. Just
>> use
>> an escape sequence e.g. "\u0404" is a Cyrillic character.
>>
>
> Thanks John.  In reverse order, I understand point (2). I'm less clear
> on point (1).
>
> If I generate a string of characters that I presume to be ascii/utf-8
> (no \u0404 type characters)
> and write them to a file (stdout) how does
> default encoding affect that file.by default..?   I'm not seeing that
> there is anything unusual going on...

About """characters that I presume to be ascii/utf-8 (no \u0404 type
characters)""": All Unicode characters (including U+0404) are encodable in
bytes using UTF-8.

The result of sys.stdout.write(unicode_characters) to a TERMINAL depends
mostly on sys.stdout.encoding. This is likely to be UTF-8 on a
linux/OSX/platform. On a typical American / Western European /[former]
colonies Windows box, this is likely to be cp850 on a Command Prompt
window, and cp1252 in IDLE.

UTF-8: All Unicode characters are encodable in UTF-8. Only problem arises
if the terminal can't render the character -- you'll get spaces or blobs
or boxes with hex digits in them or nothing.

Windows (Command Prompt window): only a small subset of characters can be
encoded in e.g. cp850; anything else causes an exception.

Windows (IDLE): ignores sys.stdout.encoding and renders the characters
itself. Same outcome as *x/UTF-8 above.

If you write directly (or sys.stdout is redirected) to a FILE, the default
encoding is obtained by sys.getdefaultencoding() and is AFAIK ascii unless
the machine's site.py has been fiddled with to make it UTF-8 or something
else.

>   If I open the file with vi?  If
> I open the file with gedit?  emacs?

Any editor will have a default encoding; if that doesn't match the file
encoding, you have a (hopefully obvious) problem if the editor doesn't
detect the mismatch. Consult your editor's docs or HTFF1K.

> Another question... in mail I'm receiving many small blocks that look
> like sprites with four small hex codes, scattered about the mail...
> mostly punctuation, maybe?   ... guessing, are these unicode code
> points,

yes

> and if so what is the best way to 'guess' the encoding?

google("chardet") or rummage through the mail headers (but 4 hex digits in
a box are a symptom of inability to render, not necessarily caused by an
incorrect decoding)

 ... is
> it coded in the stream somewhere...protocol?

Should be.




More information about the Python-list mailing list