How to use Unicode regexes?

rhys tucker rhystucker at rhystucker.fsnet.co.uk
Fri Jul 27 16:14:56 EDT 2001


Could somebody show me how to do Unicode regexes? I'm trying to write a strings-like utility for windows - so I want to match ascii and unicode characters in a binary file. Do I need 
one regex pattern since ascii and Unicode are similar for ascii text characters or are 2 regex patterns needed since they are different byte sizes?

The documentation suggest that I need to use \w pattern to match Unicode and set UNICODE. I'm not sure what and how to set Unicode.

This is what I've done so far - it matches (some ?) ascii characters but misses those unicode strings.

#!/usr/bin/env python
# strings program

import sys
from re import compile, findall
f = open(sys.argv[1])
fl = f.read()
patt = compile("[\032-\176\000]{4,}") 

matches = findall(patt, fl) 

for match in matches:
	print match



Thanks to those people who answered my earlier question.


rhys








More information about the Python-list mailing list