readlines() reading incorrect number of lines?

Fri Dec 21 14:42:21 EST 2007

Something I've occasionally found helpful with problem text files is
to build a histogram of character counts, something like this:

"""
chist.py
print a histogram of character frequencies in a nemed input file
"""

import sys

whitespace      = ' \t\n\r\v\f'
lowercase       = 'abcdefghijklmnopqrstuvwxyz'
uppercase       = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
letters         = lowercase + uppercase
ascii_lowercase = lowercase
ascii_uppercase = uppercase
ascii_letters   = ascii_lowercase + ascii_uppercase
digits          = '0123456789'
hexdigits       = digits + 'abcdef' + 'ABCDEF'
octdigits       = '01234567'
punctuation     = """!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""
printable       = digits + letters + punctuation

try:
    fname       = sys.argv[1]
except:
    print       "usage is chist yourfilename"
    sys.exit()

chars           = {}

f               = open (fname, "rb")
lines           = f.readlines()
for line in lines:
    for c in line:
        try:
            chars[ord(c)] += 1
        except:
            chars[ord(c)] = 1

ords = chars.keys()
ords.sort()

for o in ords:
    if chr(o) in printable:
        c   = chr(o)
    else:
        c =  "UNP"

    print "%5d %-5s %10d" % (o, c, chars[o])
print "_" * 50

Gerry

On Dec 20, 5:47 pm, John Machin <sjmac... at lexicon.net> wrote:
> On Dec 21, 8:13 am, Steven D'Aprano <st... at REMOVE-THIS-
>
>
>
> cybersource.com.au> wrote:
> > [Fixing top-posting.]
>
> > On Thu, 20 Dec 2007 12:41:44 -0800, Wojciech Gryc wrote:
> > > On Dec 20, 3:30 pm, John Machin <sjmac... at lexicon.net> wrote:
> > [snip]
> > >> > However, when I use Python's various methods -- readline(),
> > >> > readlines(), or xreadlines() and loop through the lines of the file,
> > >> > the line program exits at 16,000 lines. No error output or anything
> > >> > -- it seems the end of the loop was reached, and the code was
> > >> > executed successfully.
> > ...
> > >> One possibility: you are running this on Windows and the file contains
> > >> Ctrl-Z aka chr(26) aka '\x1a'.
>
> > > Hi,
>
> > > Python 2.5, on Windows XP. Actually, I think you may be right about \x1a
> > > -- there's a few lines that definitely have some strange character
> > > sequences, so this would make sense... Would you happen to know how I
> > > can actually fix this (e.g. replace the character)? Since Python doesn't
> > > see the rest of the file, I don't even know how to get to it to fix the
> > > problem... Due to the nature of the data I'm working with, manual
> > > editing is also not an option.
>
> > > Thanks,
> > > Wojciech
>
> > Open the file in binary mode:
>
> > open(filename, 'rb')
>
> > and Windows should do no special handling of Ctrl-Z characters.
>
> > --
> > Steven
>
> I don't know whether it's a bug or a feature or just a dark corner,
> but using mode='rU' does no special handling of Ctrl-Z either.
>
> >>> x = 'foo\r\n\x1abar\r\n'
> >>> f = open('udcray.txt', 'wb')
> >>> f.write(x)
> >>> f.close()
> >>> open('udcray.txt', 'r').readlines()
> ['foo\n']
> >>> open('udcray.txt', 'rU').readlines()
>
> ['foo\n', '\x1abar\n']>>> for line in open('udcray.txt', 'rU'):
>
> ...    print repr(line)
> ...
> 'foo\n'
> '\x1abar\n'
>
>
>
> Using 'rU' should make the OP's task of finding the strange character
> sequences a bit easier -- he won't have to read a block at a time and
> worry about the guff straddling a block boundary.