Try this

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Mon Sep 17 04:17:31 EDT 2007


On 17 sep, 02:55, "mensana... at aol.com" <mensana... at aol.com> wrote:
> On Sep 16, 9:27?pm, "Gabriel Genellina" <gagsl-... at yahoo.com.ar>
> wrote:
>
> > En Sun, 16 Sep 2007 21:58:09 -0300, mensana... at aol.com  
> > <mensana... at aol.com> escribi :
>
> > >> I'm eagerly awaiting publication of your professional specification
> > >> for correctly detecting the encoding of an arbitrary stream of
> > >> bytes
>
> > > The very presence of an algorithm to detect encoding is a bug.
> > > Files with they .txt extension should always be treated as ANSI
> > > even if they contain binary data.
>
> > Why ANSI?
>
> Because that's the absence of encoding?

Are you kidding?

> > Because it's convenient to *you*?
>
> No, it's ANSI unless told otherwise.

Oh, yes, it's a joke surely.
(Anyway, *which* ANSI standard? AFAIK, the Windows character set has
never been standardized by ANSI).

> > What about the rest of the world that don't speak
> > English or even worse, don't use the Latin alpabet?
>
> When the rest of the world creates the next
> generation of computers, THEY can chosse the
> defaults.

No comments.

> > What do you mean by "binary data"?
>
> 8-bit, ASCII is only 7-bit.

Being "binary" as opposed to "text" has nothing to do with the number
of bits. "¡Olé!" is text, and contains characters outside the ASCII
set. A signal with range 0-63 can be encoded into 6 bits, but it's
binary data, not text.

> > Notepad is not interpreting the file as  
> > "binary", it's text,
>
> And will treat non-ASCII data as if it were ASCII.

I think you were complaining about the opposite situation.

> > but interpreted using the wrong encoding.
>
> So that's not a serious bug? To decide that a file
> is Unicode despite the absence of the appropriate
> markers?

Which are "the appropiate markers"? A BOM is not always required, and
Notepad supported Unicode even before the BOM was invented.
Please redirect your bug reports to bugs at microsoft.com

> > As every character goes into the same code block the heuristics concludes  
> > that the text is some Estern language encoded in UTF16.
>
> But...but...Notepad doesn't have a UTF16 option.

What it calls "Unicode" is in fact UTF16, or UCS2 on some previous
Windows versions.

> > This is the "Well you are speed" phrase interpreted as UTF16:  
> > u'\u6557\u6c6c\u7920\u756f\u6120\u6572\u7320\u6570\u6465'
>
> How can you tell from that that it's UTF16? If there's
> something stored in addition to those 18 bytes, you're
> being misleading.

*I* can tell it's not, but Notepad (which presumibly calls
IsTextUnicode) cannot, and I can't blame it given a so small sample of
less than 20 bytes.

> > > Notepad should never be
> > > allowed to try to decide what the encoding is if the the open
> > > dialog has the encoding set to ANSI.
>
> > I'm using notepad.exe version 5.1.2600.2180 (XP SP2 fully updated) and  
> > that's exactly what happens. I have to explicitely select Unicode in order  
> > to see those Han characters.
>
> So which is worse, you having to tell it that it's
> Unicode or Notepad deciding on its own that a file
> is Unicode when it isn't.

I don't know, and I don't care, and I don't use Notepad.

--
Gabriel Genellina




More information about the Python-list mailing list