Regex Matching on Readline()

John Machin sjmachin at lexicon.net
Thu Dec 20 15:48:13 EST 2007


On Dec 21, 7:21 am, jwwest <jww... at gmail.com> wrote:
> On Dec 20, 2:13 pm, John Machin <sjmac... at lexicon.net> wrote:
>
>
>
> > On Dec 21, 6:50 am, jwwest <jww... at gmail.com> wrote:
>
> > > Anyone have any trouble pattern matching on lines returned by
> > > readline? Here's an example:
>
> > > string = "Accounting - General"
> > > pat = ".+\s-"
>
> > > Should match on "Accounting -". However, if I read that string in from
> > > a file it will not match. In fact, I can't get anything to match
> > > except ".*".
>
> > > I'm almost certain that it has something to do with the characters
> > > that python returns from readline(). If I have this in a file:
>
> > > Accounting - General
>
> > > And do a:
>
> > > line = f.readline()
> > > print line
>
> > > I get:
>
> > > A c c o u n t i n g  -  G e n e r a l
>
> > > Not sure why, I'm a nub at Python so any help is appreciated. They
> > > look like spaces to me, but aren't (I've tried matching on spacs too)
>
> > > - james
>
> > To find out what the pseudo-spaces are, do this:
>
> >     print repr(open("the_file", "rb").read()[:100])
>
> > and show us (copy/paste) what you get.
>
> > Also, tell us what platform you are running Python on, and how the
> > file was created (by what software, on what platform).
>
> Here's my output:
> 'A\x00c\x00c\x00o\x00u\x00n\x00t\x00i\x00n\x00g\x00 \x00-\x00 \x00G
> \x00e\x00n\x00e\x00r\x00a\x00l\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00
> \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00
> \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00'
>
> I'm running Python on Windows. The file was initially created as
> output from SQL Management Studio. I've re-saved it using TextPad
> which tells me it's Unicode and PC formatted.

"Unicode" means "utf16".

Try this:

import codecs
f = codecs.open("the_file", "r", encoding="utf16le")
for uline in f:
    line = uline.encode('cp1252') # or some other encoding if my guess
isn't correct
    # proceed as usual

Cheers,
John



More information about the Python-list mailing list