unicode(s, enc).encode(enc) == s ?

Thu Jan 3 09:52:15 EST 2008

On Jan 2, 9:34 pm, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
> > In any case, it goes well beyond the situation that triggered my
> > original question in the first place, that basically was to provide a
> > reasonable check on whether round-tripping a string is successful --
> > this is in the context of a small utility to guess an encoding and to
> > use it to decode a byte string. This utility module was triggered by
> > one that Skip Montanaro had written some time ago, but I wanted to add
> > and combine several ideas and techniques (and support for my usage
> > scenarios) for guessing a string's encoding in one convenient place.
>
> Notice that this algorithm is not capable of detecting the ISO-2022
> encodings - they look like ASCII to this algorithm. This is by design,
> as the encoding was designed to only use 7-bit bytes, so that you can
> safely transport them in Email and such (*)

Well, one could specify decode_heuristically(s, enc="iso-2022-jp") and
that
encoding will be checked before ascii or any other encoding in the
list.

> If you want to add support for ISO-2022, you should look for escape
> characters, and then check whether the escape sequences are among
> the ISO-2022 ones:
> - ESC (  - 94-character graphic character set, G0
> - ESC )  - 94-character graphic character set, G1
> - ESC *  - 94-character graphic character set, G2
> - ESC +  - 94-character graphic character set, G3
> - ESC -  - 96-character graphic character set, G1
> - ESC .  - 96-character graphic character set, G2
> - ESC /  - 96-character graphic character set, G3
> - ESC $  - Multibyte
>            ( G0
>            ) G1
>            * G2
>            + G3
> - ESC %   - Non-ISO-2022 (e.g. UTF-8)
>
> If you see any of these, it should be ISO-2022; see
> the Wiki page as to what subset may be in use.
>
> G0..G3 means what register the character set is loaded
> into; when you have loaded a character set into a register,
> you can switch between registers through ^N (to G1),
> ^O (to G0), ESC n (to G2), ESC o (to G3) (*)

OK, suppose we do not know the string is likely to be iso-2022, but we
still want to detect it if it is. I have added a "may_do_better"
mechanism to the algorithm, to add special checks on a *guessed*
algorithm. I am not sure this will not however introduce more or other
problems than the one it is addressing...

I have re-instated checks for iso-8859-1 control chars (likely to be
cp1252), for special symbols in iso-8859-15 when they occur in
iso-8859-1 and cp1252, and for the iso-2022-jp escape sequences. To
flesh out with other checks is mechanical work...

If you could take a look at the updated page:

> >http://gizmojo.org/code/decodeh/

I still have issues with what happens in situations when for example a
file contains iso-2022  esc sequences but is anyway actally in ascii
or utf-8? e.g. this mail message! I'll let this issue turn for a
little while...

> > I will be very interested in any remarks any of you may have!
>
> From a shallow inspection, it looks right. I would have spelled
> "losses" as "loses".

Yes, corrected.

Thanks, mario