re.compile.match() results in unicode strings - why?

Fri Nov 12 06:04:11 EST 2004

Axel Bock wrote:

> Kent Johnson wrote:
> 
>> Apparently if the input strings are unicode then the groups will be as
>> well:
>> [...]
>> Are you sure that exp is not a unicode string?
> 
> hm. pretty much - i read the lines from a text file which contains only
> normal text. a sample line looks like that:
> 
> 6.     call_noparam    1000 runs       149453,1 ms     149,4531 ms/call
> 
> no surprise here, i think ... . Actually I also wrote the program which
> produces that file, and I really didn't use unicode then. opening the file
> with a text editor also does not show unicode, and I can't believe that
> windows does actually manage the unicode stuff transparently to text
> editors. and also I have never heard of file-attached codepage
> information, those would be the only things i could imagine as a reason.

Why do you keep speculating?

[Your code from another post]
> ** CODE **
> string = "1. asdf asdf 327,88"
> exp = re.compile("(\S+) (\S+) (\S+) (\S+).*")
> m = exp.match(string)
> print m.groups()
> ** /CODE **

You could modify that along the lines (untested)

string = "1. asdf asdf 327,88"
pattern = "(\S+) (\S+) (\S+) (\S+).*"
# make sure that there is no unicode input:
assert not isinstance(string, unicode)
assert not isinstance(pattern, unicode)
exp = re.compile(pattern)
m = exp.match(string)
# make sure at least one group is a unicode string
if m:
    assert [g for g in m.groups() if isinstance(g, unicode)]

If this does not throw an assertion error we can look further, but I still
think this is unlikely.

Peter