Jython: How to import escaped Unicode and export utf-8?

Wed Apr 25 16:50:24 EDT 2001

Thank you, Martin, for your help. I've been taking a closer look at the codecs
module as a way to move escaped Unicode characters/strings from an 8-bit text
file into lists. There is an excellent document What's New in Python 2.0 by
A.M. Kuchling and Moshe Zadka which has the best examples I've seen so far.
I've had to supplement it with Marc-Andre Lemburg's Python Unicode Integration
version 1.8 for listings of the parameters to use.

Martin von Loewis wrote:

> Maurice Bauhahn <bauhahnm at clara.net> writes:
>
> > My imports of escaped Unicode (u'\u1780' or '\u1780') end up in my lists
> > as:
> >
> > ["u'\\u1780'"]
>
> I very much doubt this. This looks more like the repr of a list,
> instead of like the list itself. That could be an incompatibility of
> repr for Unicode objects in Python, but I assume that the list is
> still build correctly.

It could be that because Jython's default encoding is 'ascii' my reader did
not consider those escapes. I used the sys.getdefaultencoding() function to
detect that encoding. Subsequently I tried the following:

(UTF8_encode, UTF8_decode, UTF8_streamreader, UTF8_streamwriter) =
codecs.lookup('UTF-8')
(UNIESCAPE_encode, UNIESCAPE_decode, UNIESCAPE_streamreader,
UNIESCAPE_streamwriter)=codecs.lookup('unicode-escape')
 oneencoding =
UNIESCAPE_streamreader(open('H:\\jy\\encodings\\KSCIIOne.txt','r')
 outdocument = UTF8_streamwriter( open('h:\\jy\\outtest.txt','wb' ))

for encodingline in oneencoding.readlines():

The error returned from this last line is:
SyntaxError: invalid syntax
>>> execfile('h:\\jy\\test.py')
Traceback (innermost last):
  File "<console>", line 1, in ?
  File "h:\jy\test.py", line 408, in ?
  File "h:\jy\test.py", line 80, in loadencode
  File "D:\Java\jython\Lib\codecs.py", line 269, in readlines
TypeError: unicode_escape_decode(): expected 2 args; got 1
What two arguments were expected where?

>
>
> > and .write as u'\u1780'.
>
> In CPython, that would give an exception. You cannot write a Unicode
> object onto a stream without encoding it first.

Which encoding would you recommend for the write() function (if I want to use
Regular Expressions on the output)? I like utf-8 because it leaves ASCII
characters pretty much as they were; however, I'm afraid that parsing/Regular
Expression tools will have problems with the irregular length for characters.
Next I want to do letter pair frequency studies with the output.

>
>
> > From the command line I can get something useful by writing:
> >
> > u'\u1780'.encode('utf-8')
> >
> > but it does not appear to work within my jython script.
>
> That should work. How does it fail?

The problem is probably back at my input...my list composed of inputted
strings still has that u'\\u1780' format.

>
>
> Regards,
> Martin

--
Maurice Bauhahn

United Kingdom

Home: bauhahnm at clara dot net