Try this

mensanator at aol.com mensanator at aol.com
Mon Sep 17 01:55:02 EDT 2007


On Sep 16, 9:27?pm, "Gabriel Genellina" <gagsl-... at yahoo.com.ar>
wrote:
> En Sun, 16 Sep 2007 21:58:09 -0300, mensana... at aol.com  
> <mensana... at aol.com> escribi :
>
> >> I'm eagerly awaiting publication of your professional specification
> >> for correctly detecting the encoding of an arbitrary stream of
> >> bytes
>
> > The very presence of an algorithm to detect encoding is a bug.
> > Files with they .txt extension should always be treated as ANSI
> > even if they contain binary data.
>
> Why ANSI?

Because that's the absence of encoding?

> Because it's convenient to *you*?

No, it's ANSI unless told otherwise.

> What about the rest of the world that don't speak
> English or even worse, don't use the Latin alpabet?

When the rest of the world creates the next
generation of computers, THEY can chosse the
defaults.

> What do you mean by "binary data"?

8-bit, ASCII is only 7-bit.

> Notepad is not interpreting the file as  
> "binary", it's text,

And will treat non-ASCII data as if it were ASCII.

> but interpreted using the wrong encoding.

So that's not a serious bug? To decide that a file
is Unicode despite the absence of the appropriate
markers?

>
> If you want to understand what happens here: The Unicode block for 'CJK  
> Unified Han' goes from U+4E00 to U+9FFF and is the largest block in the  
> basic plane, with more than 20000 code points. The previous block contains  
> the famous 64 hexagrams, and the previous block is 'CJK Unified Han  
> Extension A' ranging from U+3400 to U+4DBF.
> Note that ASCII letters go from 0x41-0x5A and 0x61-7A, and the range  
> 0x4100-0x7AFF is totally contained inside the above Unicode blocks.  
> Reading a small phrase containing only ASCII letters as it were in UTF16  
> would collapse each two letters into a single character, each character  
> being part of 'CJK Unified Han'. (Space and punctuation are allowed in odd  
> positions only, else the character would not belong to the Han blocks).
> As every character goes into the same code block the heuristics concludes  
> that the text is some Estern language encoded in UTF16.

But...but...Notepad doesn't have a UTF16 option.

> This is the "Well you are speed" phrase interpreted as UTF16:  
> u'\u6557\u6c6c\u7920\u756f\u6120\u6572\u7320\u6570\u6465'

How can you tell from that that it's UTF16? If there's
something stored in addition to those 18 bytes, you're
being misleading.

>
> > Notepad should never be
> > allowed to try to decide what the encoding is if the the open
> > dialog has the encoding set to ANSI.
>
> I'm using notepad.exe version 5.1.2600.2180 (XP SP2 fully updated) and  
> that's exactly what happens. I have to explicitely select Unicode in order  
> to see those Han characters.

So which is worse, you having to tell it that it's
Unicode or Notepad deciding on its own that a file
is Unicode when it isn't.

>
> --
> Gabriel Genellina





More information about the Python-list mailing list