help needed with regex and unicode

Mark Tolonen mark.e.tolonen at mailinator.com
Tue Mar 4 02:43:32 EST 2008


"Marc 'BlackJack' Rintsch" <bj_666 at gmx.net> wrote in message 
news:6349rmF23qmbmU1 at mid.uni-berlin.de...
> On Tue, 04 Mar 2008 10:49:54 +0530, Pradnyesh Sawant wrote:
>
>> I have a file which contains chinese characters. I just want to find out
>> all the places that these chinese characters occur.
>>
>> The following script doesn't seem to work :(
>>
>> **********************************************************************
>> class RemCh(object):
>>     def __init__(self, fName):
>>         self.pattern = re.compile(r'[\u2F00-\u2FDF]+')
>>         fp = open(fName, 'r')
>>         content = fp.read()
>>         s = re.search('[\u2F00-\u2fdf]', content, re.U)
>>         if s:
>>             print s.group(0)
>> if __name__ == '__main__':
>>     rc = RemCh('/home/pradnyesh/removeChinese/delFolder.php')
>> **********************************************************************
>>
>> the php file content is something like the following:
>>
>> **********************************************************************
>>     // Check if the folder still has subscribed blogs
>>     $subCount = function1($param1, $param2);
>>     if ($subCount > 0) {
>>         $errors['summary'] = 'æ­ï½ æ½å¤此åï«åéé§ç²è';
>>         $errorMessage  = 'æ­ï½ æ½å¤此åï«åéé§ç²è';
>>     }
>
> Looks like an UTF-8 encoded file viewed as ISO-8859-1.  Sou you should
> decode `content` to unicode before searching the chinese characters.
>

I couldn't get your data to decode into anything resembling Chinese, so I
created my own file as an example.  If reading an encoded text file, it 
comes
in as just a bunch of bytes:

   >>> print open('chinese.txt','r').read()
   我是美国人。  Wǒ shì Měiguórén.  I am an American.

Garbage, because the encoding isn't known.  Provide the correct encoding and
decode it to Unicode:

   >>> print open('chinese.txt','r').read().decode('utf8')
   我是美国人。  Wǒ shì Měiguórén.  I am an American.

Here's the Unicode string.  Note the 'u' before the quotes to indicate 
Unicode.

   >>> s=open('chinese.txt','r').read().decode('utf8')
   >>> s
   u'\ufeff\u6211\u662f\u7f8e\u56fd\u4eba\u3002  W\u01d2 sh\xec 
M\u011bigu\xf3r\xe9n.  I am an American.'

If working with Unicode strings, the re module should be provided Unicode 
strings also:

   >>> print re.search(ur'[\u4E00-\u9FA5]',s).group(0)
   我
   >>> print re.findall(ur'[\u4E00-\u9FA5]',s)
   [u'\u6211', u'\u662f', u'\u7f8e', u'\u56fd', u'\u4eba']

Hope that helps you.

--Mark




More information about the Python-list mailing list