No explanation for weird behavior in re module!

Sun Feb 10 21:09:00 EST 2002

sjmachin at lexicon.net (John Machin) writes:

> synthespian <synthespian at uol.com.br> wrote in message news:<a470re$1cl30l$1 at ID-78052.news.dfncis.de>...
> > Hi-
> > 
> > 	I'm really intrigued by this behavior:
> > 
> > >>> import re
> > >>> p = re.compile('^(der|die|das(\s\w+))')
> > >>> m = p.match('die Tür, Türen')
> > >>> n = p.match('das Auto, Autos')
> > >>> m.group(0)
> >  'die'
> > >>> m.group(1)
> >  'die'
> > >>> m.group(2)
> >  [nothing!!!!]
> > >>> n.group(0)
> >  'das Auto'
> > >>> n.group(1)
> >  'das Auto'
> > >>> n.group(2)
> > 'Auto'
(snip)
> see that this matches just 'die' also.
> 
> Anticipating the next raft of problems: (1) You will need to use the
> re.UNICODE flag when you call re.compile(), otherwise \w will not
> recognise the Unicode alphabetics (this *is* documented) (2) You may
> need to give it an input whose Python type is 'unicode' -- being able
> to see the umlaut on your screen is not sufficient evidence of this
> :-) (3) You should get into the habit of using the raw string notation
> with your regexes whether it is necessary or not, else you will be
> bitten in the future.
> 
> Anyhow, the following works for me:
> 
> Python 2.2 (#28, Dec 21 2001, 12:21:22) [MSC 32 bit (Intel)] on win32
> >>> p = re.compile(r'^((der|die|das)(\s\w+))',re.UNICODE)
> >>> p.match('das Auto').groups()
> ('das Auto', 'das', ' Auto')
> >>> z = u'die T\u00FCr, T\u00FCren'
> >>> p.match(z).groups()
> (u'die T\xfcr', u'die', u' T\xfcr')

Hi-
	Thanks for helping. Unfortunately, the solution did not yet solve the problem at hand.
	You see, I must read lines forma a plain text file, which has the form "das Auto, Autos".
	The original program has to ask if "Auto" has "der, die or das" as the definite article.
	The problem is that I can't make Python read anything with non-ASCII character set.
	The output you have suggested, "T\xfcr", is, for all practical purposes, unreadable.
	As I understood it from the other posts, the "\w+" on the regex will depend on my locales.

	I have, in /etc/enviroment. the following set-up (native language: Portuguese):

	LANG=C
	LANGUAGE=pt_BR
	LC_ALL=pt_BR
	LC_CTYPE=latin1
	LESSCHARSET=latin1
	NLSPATH=/var/catman
	MM_CHARSET=ISO-8859-1

	Seems like non-ASCII is a real bother in Python...I'd expected this to have been better looked
after in Python-2.0...

	I appreciate any help...

	Thank you.

	H


-- 
(º·.¸(¨*·.¸     ¸.·*¨)¸.·º)
 «.·°·. synthespian .·°·.»
(¸.·º(¸.·¨*    *¨·.¸)º·.¸)
	     @
	   u o l
	   c o m
	    b r