No explanation for weird behavior in re module!

Mon Feb 11 03:27:27 EST 2002

synthespian <synthespian at uol.com.br> wrote in message news:<87pu3c7osj.fsf at uol.com.br>...
> sjmachin at lexicon.net (John Machin) writes:
> 
> > synthespian <synthespian at uol.com.br> wrote in message news:<a470re$1cl30l$1 at ID-78052.news.dfncis.de>...
> > > Hi-
> > > 
> > > 	I'm really intrigued by this behavior:
> > > 
> > > >>> import re
> > > >>> p = re.compile('^(der|die|das(\s\w+))')
> > > >>> m = p.match('die Tür, Türen')
> > > >>> n = p.match('das Auto, Autos')
> > > >>> m.group(0)
>  'die'
> > > >>> m.group(1)
>  'die'
> > > >>> m.group(2)
>  [nothing!!!!]
> > > >>> n.group(0)
>  'das Auto'
> > > >>> n.group(1)
>  'das Auto'
> > > >>> n.group(2)
> > > 'Auto'
>  (snip)
> > see that this matches just 'die' also.
> > 
> > Anticipating the next raft of problems: (1) You will need to use the
> > re.UNICODE flag when you call re.compile(), otherwise \w will not
> > recognise the Unicode alphabetics (this *is* documented) (2) You may
> > need to give it an input whose Python type is 'unicode' -- being able
> > to see the umlaut on your screen is not sufficient evidence of this
> > :-) (3) You should get into the habit of using the raw string notation
> > with your regexes whether it is necessary or not, else you will be
> > bitten in the future.
> > 
> > Anyhow, the following works for me:
> > 
> > Python 2.2 (#28, Dec 21 2001, 12:21:22) [MSC 32 bit (Intel)] on win32
> > >>> p = re.compile(r'^((der|die|das)(\s\w+))',re.UNICODE)
> > >>> p.match('das Auto').groups()
>  ('das Auto', 'das', ' Auto')
> > >>> z = u'die T\u00FCr, T\u00FCren'
> > >>> p.match(z).groups()
> > (u'die T\xfcr', u'die', u' T\xfcr')
> 
> Hi-
> 	Thanks for helping. Unfortunately, the solution did not yet solve the problem at hand.
> 	You see, I must read lines forma a plain text file, which has the form "das Auto, Autos".

It can't be very plain, if you expect to be able to represent German
characters with umlauts and the "Eszett". Do you actually know what
character set you have in this "plain text file"? Unicode UTF-16?
Unicode UTF-8? "Latin 1" ? If not, I suggest that you create a small
file with the umlauts etc and use xd to dump it out and post what are
the hex representations of u-umlaut and eszett and so forth.

> 	The original program has to ask if "Auto" has "der, die or das" as the definite article.
> 	The problem is that I can't make Python read anything with non-ASCII character set.

The problem is not the reading, it is knowing what character set you
have, and using Python accordingly.

> 	The output you have suggested, "T\xfcr", is, for all practical purposes, unreadable.

That was *not* "suggested output". It was printing out the "repr" of
the data to show that correctly-formulated Python code, derived from a
brief perusal of the documentation, was in fact working on Unicode
data. 0x00FC is the Unicode 16-bit representation of 'Latin small
letter u with diaeresis'. In discussing these types of problems, it is
best to focus on the internal representation of the data, not what you
see on the screen. WYSIWYG? Well, what you see as identical on your
screen or printer can have two or more possible internal
representations.

> 	As I understood it from the other posts, the "\w+" on the regex will depend on my locales. 

(1) You have only one locale at a time. (2) Was the input file created
in your locale? (2) RTFM & re-read Tim's post: if you want to work in
Unicode, you will need to (a) supply Unicode data and (b) to use \w
properly,  supply the re.UNICODE flag to re.compile()

> 
> 	I have, in /etc/enviroment. the following set-up (native language: Portuguese):
> 
> 	LANG=C
> 	LANGUAGE=pt_BR
> 	LC_ALL=pt_BR
> 	LC_CTYPE=latin1
> 	LESSCHARSET=latin1
> 	NLSPATH=/var/catman
> 	MM_CHARSET=ISO-8859-1
> 
> 	Seems like non-ASCII is a real bother in Python...I'd expected this to have been better looked
> after in Python-2.0...
> 
We don't have enough information yet to know whether it's a
"non-ASCII" problem.
Let's assume for the moment that Unicode was a red herring, caused by
your "Other than the fact that 'Tür' has the 'ü' unicode charcater".
Let's also assume that your script and your data are both in
ISO-8859-1 aka Latin-1, which can hack both Portguese and German. Then
your only remaining problem should be, as both Tim and I posted, the
problem with precedence of | in a regex. Try fixing that first, and
see if you still have a problem ... if so, it is likely to be with the
'ü' not being recognised as alphabetic --- then you will need to talk
to the locale experts.  Oh and try this: print 'ü'.isalpha()
Whether the result is 1, 0, or an exception should tell you something.

HTH,
John