Using dictionary key as a regular expression class

Terry Reedy tjreedy at udel.edu
Sat Jan 23 02:45:41 EST 2010


On 1/22/2010 9:58 PM, Chris Jones wrote:
> On Fri, Jan 22, 2010 at 08:46:35PM EST, Terry Reedy wrote:

> Do you mean I should just read the file one character at a time?

Whoops, my misdirection (you can .read(1), but this is s  l   o   w.
I meant to suggest processing it a char at a time.

1. If not too big,

for c in open(x, 'rb').read() # left .read() off
# 'b' will get bytes, though ord(c) same for ascii chars for  byte or 
unicode

2. If too big for that,

for line in open():
   for c in line:    # or left off this part


>> To only count ascii chars, as should be the case for C code,
>>
>> achars = [0]*63
>> for c in open('xxx', 'c'):
>>    try:
>>      achars[ord(c)-32] += 1
>>    except IndexError:
>>      pass
>>
>> for i,n in enumerate(achars)
>>    print chr(i), n
>>
>> or sum subsets as desired.
>
> Thanks much for the snippet, let me play with it and see if I can come
> up with a Unicode/utf-8 version.. since while I'm at it I might as well
> write something a bit more general than C code.
>
> Since utf-8 is backward-compatible with 7bit ASCII, this shouldn't be
> a problem.

For any extended ascii, use larger array without decoding (until print, 
if need be). For unicode, add encoding to open and 'c in line' will 
return unicode chars. Then use *one* dict or defaultdict. I think 
something like

from collections import defaultdict
d = defaultdict(int)
...
     d[c] += 1 # if c is new, d[c] defaults to int() == 0

Terry Jan Reedy






More information about the Python-list mailing list