treating str as unicode in legacy code?

Sat Apr 14 18:14:56 EDT 2007

On Apr 13, 5:57 am, "Ben" <benjamin.... at gmail.com> wrote:
> I'm left with some legacy code using plain oldstr, and I need to make
> sure it works withunicodeinput/output. I have a simple plan to do
> this:
>
> - Run the code with "python -U" so all the string literals becomeunicodelitrals.

Requiring that the code is always run with a non-default argument
doesn't seem very robust/portable to me.

> - Add this statement
>
>  str=unicode
>
>   to all .py files so the type comparison (e.g., type('123') ==str)
> would work.
>

IMVHO (1) doing that merely changes "legacy code" to "kludged legacy
code" (2) there is no substitute for reading the code and trying to
nut out what it is doing.

Do you mean that those two things are the ONLY changes you plan to
make?

> Did I miss anything? Does this sound like a workable plan?

Do you need to make sure it still works with ASCII input? With input
in some other encoding e.g. cp1252?

What do you mean by "unicode input"? Bear in mind that if you want to
work with Python unicode objects internally, input from a file /
socket / whatever will need to be decoded i.e. you will have to read
the code and make appropriate changes. Data stored in (say) utf_16_le
encoding is not "unicode" in the sense that you need; it still has to
be decoded.

What do you mean by "unicode output"? You are going to need to encode
your output.

This doesn't work; the output is not "unicode" in any meaningful
sense:
>>> f = open(u'uout', u'w')
### Warning: you need to hope that all builtins etc that you are
calling cope with unicode arguments as well as the above one does.
>>> f.write(u'abcde\n')
>>> f.close()
>>> open(u'uout', u'rb').read()
'abcde\r\n'

This doesn't work; it crashes.
>>> f = open('uout2', u'w')
>>> f.write(u'abcde\xff\n')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in
position 5:
ordinal not in range(128)
>>>

Some object methods work differently with unicode; e.g. (1)
str.translate and unicode.translate.

(2)
>>> 'abc\xA0def'.split()
['abc\xa0def']
>>> u'abc\xA0def'.split()
[u'abc', u'def']
NameError: name 'isspace' is not defined
>>> '\xA0'.isspace()
False
>>> u'\xA0'.isspace()
True
>>>

HTH,
John