[Patches] [ python-Patches-670715 ] Universal Unicode Codec for POSIX iconv

Tue, 04 Feb 2003 10:58:36 -0800

Patches item #670715, was opened at 2003-01-19 17:51
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=670715&group_id=5470

Category: Library (Lib)
Group: Python 2.3
Status: Closed
Resolution: Accepted
Priority: 5
Submitted By: Hye-Shik Chang (perky)
Assigned to: Martin v. Löwis (loewis)
Summary: Universal Unicode Codec for POSIX iconv

Initial Comment:
Here's the unicode codec using POSIX iconv(3) library.

Tested on these platforms and seems to work:
  FreeBSD/i386, FreeBSD/alpha, FreeBSD/ia64,
  FreeBSD/sparc64, MacOS X/ppc, HP-UX/pa-risc2

This codec implementation supports PEP293, also.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2003-02-04 19:57

Message:
Logged In: YES 
user_id=89016

"ISO8859_7" is not known on Linux, can you try "ISO8859-7"
or better "ISO8859-1" or "LATIN1"?

Also I wonder whether it is a good thing to test iconv()
with the character '\x01'. Can you try diff-char.txt and see
what happens?

If all this fails, try diff-debug.txt and report what the
output is.

----------------------------------------------------------------------

Comment By: Christos Georgiou (tzot)
Date: 2003-02-04 18:02

Message:
Logged In: YES 
user_id=539787

I am afraid that SGI Irix' iconv must be added to the list of 
scary commercial implementations... at first the module did 
not compile due to two missing typecasts (see patch 
680146), but even after that, the module fails (and python 
dumps core):

Fatal Python error: can't initialize the _iconv_codec module: 
iconv_open() failed
Abort (core dumped)

(message at line 674 of the module)

This is because Irix iconv knows nothing about ASCII 
encoding...

I changed the "ASCII" argument to something 
existing, "ISO8859_7" which is my country's encoding, and 
then python dumps core with:

Fatal Python error: can't initialize the _iconv_codec module: 
mixed endianess
Abort (core dumped)

MIPS processors are big endian.

python works fine in all my programs where there is no use of 
str.decode and unicode.encode .
To make sure that the problem exists only in this module, I 
need to compile without the _iconv_codec .  Do I do that by 
changing setup.py?  This seems the way, but I haven't 
succeeded yet.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2003-01-30 21:04

Message:
Logged In: YES 
user_id=89016

I checked in a version of iconvcodec-3.txt that does a byte
swapping check in the init function as
Modules/_iconv_codec.c 1.5

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2003-01-30 19:26

Message:
Logged In: YES 
user_id=89016

iconvcodec-3.txt does byteswapping under the following
conditions:
    #ifndef WORDS_BIGENDIAN
    #ifdef __GNU_LIBRARY__
Byteswapping is done before encoding to the whole input and
to every piece returned from iconv() for decoding.

Detecting whether byteswapping is neccessary might not work
reliably with the above tests. If this is the case, a test
call to iconv() should probably be done in
Modules/_conv_codec.c::init_iconv_codec() to determine
whether to byte swap or not.

Another possibility might be to use utf-16/utf-32 instead of
ucs-2/ucs-4.

One test still fails: test_sane(), because it uses the
internal encoding in Python, where the real endianness is of
course unknown. The test also assumes a narrow Python build.

----------------------------------------------------------------------

Comment By: Hye-Shik Chang (perky)
Date: 2003-01-26 16:19

Message:
Logged In: YES 
user_id=55188

Thank you very much for your works.
I'm working on UCS endian detection and UCS transparent 
wrapper for UTF-{8,16}. I'll submit new patch in a week.
Please feel free to change my codes because I'm not familiar 
with python code convention and culture. :-)

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2003-01-26 12:32

Message:
Logged In: YES 
user_id=21627

I have committed it now with minimal changes so that it
works on Linux, as

setup.py 1.138
__init__.py 1.15
iconv_codec.py 1.1
regrtest.py 1.117
NEWS 1.627

I will make further changes; please watch the CVS. If you
would like to make further changes, please submit patches
against the CVS.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2003-01-22 02:07

Message:
Logged In: YES 
user_id=21627

Hmm. I see that Solaris does support conversion of CJK to
UTF-8. So even though we cannot convert into the internal
encoding, we could still convert through UTF-8.

Looking at /usr/lib/nls/iconv/config.iconv of HP-UX 11.11, I
see conversions from and to ucs2, for iso-8859, eucJP, sjis,
eucTW, big5, roc15, kore5, hp15CN, and many IBM code pages.

So I think the iconv codec should convert into the Python
internal representation if possible. If no encoding name for
that is known, it should convert to ucs2 (be) if possible,
or else to UTF-8; in all cases, it will then construct a
Unicode object from the resulting string.

----------------------------------------------------------------------

Comment By: Hye-Shik Chang (perky)
Date: 2003-01-22 01:21

Message:
Logged In: YES 
user_id=55188

iconv implementations on commercial UNIXes are very scary. 

Solaris implementation:
  no support for CJK <-> UCS conversion.
  They support UCS[24] only for iso-8859 and UTFs.

HP-UX implementation:
  They have useless iconv. HP-UX iconv has no unicode support.

BSD implementation (Konstantin Chuguev's):
  compatible with this patch (provides ucs-[24]-internal)

GLIBC implementation:
  provides ucs-[24] and they are same with GNU iconv's
ucs-[24]-internal.
  Because ucs-[24] of GNU/BSD implentation is big endian
always. We can't use ucs-[24] for every platform.

In conclusion, we must use 3rd party iconv on Solaris or HP-UX.
And, we need to detect whether the linked iconv is GNU/BSD
iconv or GLIBC iconv. (how?)

I'll investigate how to detect them, but ... :)

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2003-01-21 23:27

Message:
Logged In: YES 
user_id=21627

I'm quite happy with this patch, and will apply it shortly.
However, I am concerned that it is specific for GNU iconv.
IMO, there should be machinery to find out the "internal"
encoding, in case native the native iconv implementation is
used instead of GNU iconv.

----------------------------------------------------------------------

Comment By: Hye-Shik Chang (perky)
Date: 2003-01-20 14:33

Message:
Logged In: YES 
user_id=55188

Thank you for comments. :->
I uploaded a new revised patch with unittest and some code
style fixes.

I saw Martin v. Loewis's iconvcodecs about a years ago.
His implementation is very neat, but it had a limit on error
handling due to recursive call.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2003-01-19 23:56

Message:
Logged In: YES 
user_id=38388

The patch looks good, but you'll need to add some form
of testing to underline the "seems to work" :-)

Some docs on how to use the codec would also be needed.

Martin von Loewis has written a similar codec some months ago.
Perhaps you two could get in touch and sort out the details ?!

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=670715&group_id=5470