help needed with regex and unicode

Marc 'BlackJack' Rintsch bj_666 at gmx.net
Tue Mar 4 01:51:35 EST 2008


On Tue, 04 Mar 2008 10:49:54 +0530, Pradnyesh Sawant wrote:

> I have a file which contains chinese characters. I just want to find out
> all the places that these chinese characters occur.
> 
> The following script doesn't seem to work :(
> 
> **********************************************************************
> class RemCh(object):
>     def __init__(self, fName):
>         self.pattern = re.compile(r'[\u2F00-\u2FDF]+')
>         fp = open(fName, 'r')
>         content = fp.read()
>         s = re.search('[\u2F00-\u2fdf]', content, re.U)
>         if s:
>             print s.group(0)
> if __name__ == '__main__':
>     rc = RemCh('/home/pradnyesh/removeChinese/delFolder.php')
> **********************************************************************
> 
> the php file content is something like the following:
> 
> **********************************************************************
>     // Check if the folder still has subscribed blogs
>     $subCount = function1($param1, $param2);
>     if ($subCount > 0) {
>         $errors['summary'] = 'æ­ï½ æ½å¤此åï«åéé§ç²è';
>         $errorMessage  = 'æ­ï½ æ½å¤此åï«åéé§ç²è';
>     }

Looks like an UTF-8 encoded file viewed as ISO-8859-1.  Sou you should
decode `content` to unicode before searching the chinese characters.

Ciao,
	Marc 'BlackJack' Rintsch



More information about the Python-list mailing list