[Patches] [ python-Patches-670715 ] Universal Unicode Codec for POSIX iconv

SourceForge.net noreply@sourceforge.net
Thu, 30 Jan 2003 10:26:51 -0800


Patches item #670715, was opened at 2003-01-19 17:51
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=670715&group_id=5470

Category: Library (Lib)
Group: Python 2.3
Status: Closed
Resolution: Accepted
Priority: 5
Submitted By: Hye-Shik Chang (perky)
Assigned to: Martin v. Löwis (loewis)
Summary: Universal Unicode Codec for POSIX iconv

Initial Comment:
Here's the unicode codec using POSIX iconv(3) library.

Tested on these platforms and seems to work:
  FreeBSD/i386, FreeBSD/alpha, FreeBSD/ia64,
  FreeBSD/sparc64, MacOS X/ppc, HP-UX/pa-risc2

This codec implementation supports PEP293, also.


----------------------------------------------------------------------

>Comment By: Walter Dörwald (doerwalter)
Date: 2003-01-30 19:26

Message:
Logged In: YES 
user_id=89016

iconvcodec-3.txt does byteswapping under the following
conditions:
    #ifndef WORDS_BIGENDIAN
    #ifdef __GNU_LIBRARY__
Byteswapping is done before encoding to the whole input and
to every piece returned from iconv() for decoding.

Detecting whether byteswapping is neccessary might not work
reliably with the above tests. If this is the case, a test
call to iconv() should probably be done in
Modules/_conv_codec.c::init_iconv_codec() to determine
whether to byte swap or not.

Another possibility might be to use utf-16/utf-32 instead of
ucs-2/ucs-4.

One test still fails: test_sane(), because it uses the
internal encoding in Python, where the real endianness is of
course unknown. The test also assumes a narrow Python build.

----------------------------------------------------------------------

Comment By: Hye-Shik Chang (perky)
Date: 2003-01-26 16:19

Message:
Logged In: YES 
user_id=55188

Thank you very much for your works.
I'm working on UCS endian detection and UCS transparent 
wrapper for UTF-{8,16}. I'll submit new patch in a week.
Please feel free to change my codes because I'm not familiar 
with python code convention and culture. :-)


----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2003-01-26 12:32

Message:
Logged In: YES 
user_id=21627

I have committed it now with minimal changes so that it
works on Linux, as

setup.py 1.138
__init__.py 1.15
iconv_codec.py 1.1
regrtest.py 1.117
NEWS 1.627

I will make further changes; please watch the CVS. If you
would like to make further changes, please submit patches
against the CVS.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2003-01-22 02:07

Message:
Logged In: YES 
user_id=21627

Hmm. I see that Solaris does support conversion of CJK to
UTF-8. So even though we cannot convert into the internal
encoding, we could still convert through UTF-8.

Looking at /usr/lib/nls/iconv/config.iconv of HP-UX 11.11, I
see conversions from and to ucs2, for iso-8859, eucJP, sjis,
eucTW, big5, roc15, kore5, hp15CN, and many IBM code pages.

So I think the iconv codec should convert into the Python
internal representation if possible. If no encoding name for
that is known, it should convert to ucs2 (be) if possible,
or else to UTF-8; in all cases, it will then construct a
Unicode object from the resulting string.

----------------------------------------------------------------------

Comment By: Hye-Shik Chang (perky)
Date: 2003-01-22 01:21

Message:
Logged In: YES 
user_id=55188

iconv implementations on commercial UNIXes are very scary. 

Solaris implementation:
  no support for CJK <-> UCS conversion.
  They support UCS[24] only for iso-8859 and UTFs.

HP-UX implementation:
  They have useless iconv. HP-UX iconv has no unicode support.

BSD implementation (Konstantin Chuguev's):
  compatible with this patch (provides ucs-[24]-internal)

GLIBC implementation:
  provides ucs-[24] and they are same with GNU iconv's
ucs-[24]-internal.
  Because ucs-[24] of GNU/BSD implentation is big endian
always. We can't use ucs-[24] for every platform.

In conclusion, we must use 3rd party iconv on Solaris or HP-UX.
And, we need to detect whether the linked iconv is GNU/BSD
iconv or GLIBC iconv. (how?)

I'll investigate how to detect them, but ... :)

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2003-01-21 23:27

Message:
Logged In: YES 
user_id=21627

I'm quite happy with this patch, and will apply it shortly.
However, I am concerned that it is specific for GNU iconv.
IMO, there should be machinery to find out the "internal"
encoding, in case native the native iconv implementation is
used instead of GNU iconv.

----------------------------------------------------------------------

Comment By: Hye-Shik Chang (perky)
Date: 2003-01-20 14:33

Message:
Logged In: YES 
user_id=55188

Thank you for comments. :->
I uploaded a new revised patch with unittest and some code
style fixes.

I saw Martin v. Loewis's iconvcodecs about a years ago.
His implementation is very neat, but it had a limit on error
handling due to recursive call.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2003-01-19 23:56

Message:
Logged In: YES 
user_id=38388

The patch looks good, but you'll need to add some form
of testing to underline the "seems to work" :-)

Some docs on how to use the codec would also be needed.

Martin von Loewis has written a similar codec some months ago.
Perhaps you two could get in touch and sort out the details ?!

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=670715&group_id=5470