Unicode list

Sat Mar 31 21:04:17 EDT 2007

Rehceb Rotkiv wrote:
> Hello,
>
> I have this little grep-like program:
>
> ++++++++++snip++++++++++
> #!/usr/bin/python
>
> import sys
> import re
>
> pattern = sys.argv[1]
> inputfile = file(sys.argv[2], 'r')
>
> for line in inputfile:
>     matches = re.findall(pattern, line)
>     if matches:
>         print matches
> ++++++++++snip++++++++++
>
> Like this, the program prints some characters as strange escape
> sequences, which is due to the input file being encoded in utf-8:

So the UTF-8 data gets printed to your terminal which isn't configured
for UTF-8, right?

> When I convert "re.findall..." to a string and wrap an "unicode()" around it,
> the matches get printed correctly.

How do you meaningfully convert it to a string? The matches variable
refers to a list, but you surely don't want to be dealing with the
list's string representation.

> Is it possible to make "matches" unicode without saving it as a single string first?

Why not convert your input into Unicode and then, for the benefit of
certain kinds of character classes, use re.findall in Unicode mode (by
specifying re.U as a flag)? Then, each match will be produced as a
Unicode object.

> The function "unicode()" seems only to work for strings. Or is there a general way of telling
> Python to abandon the ancient and evil land of iso-8859 for good and use utf-8 only?

The only refuge from ancient and evil lands is found by climbing the
mountain of Unicode: convert from encoded text as soon as you can,
work only with Unicode objects, produce encoded text only when
necessary.

Paul