Try this
mensanator at aol.com
mensanator at aol.com
Mon Sep 17 01:55:02 EDT 2007
On Sep 16, 9:27?pm, "Gabriel Genellina" <gagsl-... at yahoo.com.ar>
wrote:
> En Sun, 16 Sep 2007 21:58:09 -0300, mensana... at aol.com
> <mensana... at aol.com> escribi :
>
> >> I'm eagerly awaiting publication of your professional specification
> >> for correctly detecting the encoding of an arbitrary stream of
> >> bytes
>
> > The very presence of an algorithm to detect encoding is a bug.
> > Files with they .txt extension should always be treated as ANSI
> > even if they contain binary data.
>
> Why ANSI?
Because that's the absence of encoding?
> Because it's convenient to *you*?
No, it's ANSI unless told otherwise.
> What about the rest of the world that don't speak
> English or even worse, don't use the Latin alpabet?
When the rest of the world creates the next
generation of computers, THEY can chosse the
defaults.
> What do you mean by "binary data"?
8-bit, ASCII is only 7-bit.
> Notepad is not interpreting the file as
> "binary", it's text,
And will treat non-ASCII data as if it were ASCII.
> but interpreted using the wrong encoding.
So that's not a serious bug? To decide that a file
is Unicode despite the absence of the appropriate
markers?
>
> If you want to understand what happens here: The Unicode block for 'CJK
> Unified Han' goes from U+4E00 to U+9FFF and is the largest block in the
> basic plane, with more than 20000 code points. The previous block contains
> the famous 64 hexagrams, and the previous block is 'CJK Unified Han
> Extension A' ranging from U+3400 to U+4DBF.
> Note that ASCII letters go from 0x41-0x5A and 0x61-7A, and the range
> 0x4100-0x7AFF is totally contained inside the above Unicode blocks.
> Reading a small phrase containing only ASCII letters as it were in UTF16
> would collapse each two letters into a single character, each character
> being part of 'CJK Unified Han'. (Space and punctuation are allowed in odd
> positions only, else the character would not belong to the Han blocks).
> As every character goes into the same code block the heuristics concludes
> that the text is some Estern language encoded in UTF16.
But...but...Notepad doesn't have a UTF16 option.
> This is the "Well you are speed" phrase interpreted as UTF16:
> u'\u6557\u6c6c\u7920\u756f\u6120\u6572\u7320\u6570\u6465'
How can you tell from that that it's UTF16? If there's
something stored in addition to those 18 bytes, you're
being misleading.
>
> > Notepad should never be
> > allowed to try to decide what the encoding is if the the open
> > dialog has the encoding set to ANSI.
>
> I'm using notepad.exe version 5.1.2600.2180 (XP SP2 fully updated) and
> that's exactly what happens. I have to explicitely select Unicode in order
> to see those Han characters.
So which is worse, you having to tell it that it's
Unicode or Notepad deciding on its own that a file
is Unicode when it isn't.
>
> --
> Gabriel Genellina
More information about the Python-list
mailing list