regex line by line over file

Steven D'Aprano steve at pearwood.info
Thu Mar 27 01:32:03 EDT 2014


On Wed, 26 Mar 2014 20:23:29 -0700, James Smith wrote:

> I can't get this to work.
> It runs but there is no output when I try it on a file.

Simplify, simplify, simplify. Either you will find the problem, or you 
will find the simplest example that demonstrates the problem.

In this case, the problem is that your regex is not matching what you 
expect it to match. So eliminate all the irrelevant cruft that is 
just noise, complicating the problem. Start with the simplest thing that
works and add complexity until the problem returns.

Eliminate the file. You can embed your data in a string, and try to match 
the regex against the string. Eliminate all the old commented-out code, 
that's just irrelevant. Eliminate reading from sys.argv, that has nothing 
to do with the problem.

So we get down to this:

import re
pat = re.compile('^\s*\"SHELF-.*,SC,.*,:\\\"Log Collection In Progress\\\"')
line1 = '    "SHELF-17:LOG_COLN_IP,SC,03-25,01-18-58,NEND,NA,,,:\"Log Collection In Progress\",NONE:1700000035-6364-1048,:YEAR=2014,MODE=NONE"'
print(pat.match(line1))

which matches.

Now let's get rid of those leaning toothpicks. We can use print
to see the repr() of the pattern, and a raw string to clean it up.
At the interactive interpreter:


py> print(pat.pattern)
^\s*"SHELF-.*,SC,.*,:\"Log Collection In Progress\"


Similarly for line1. I'll also use implicit concatenation to split 
it over multiple source lines. Raw strings, r'' or r"", don't need 
to escape the backslashes. Implicit concatenation means that two 
strings with no operator between them is implicitly concatenated 
into a single string:

    'abc' "def"

becomes 'abcdef'. By putting the pieces inside parentheses, I can 
put each piece on a separate line, which makes it easier to read 
compared to one giant long line.

pat = re.compile(
        r'^\s*"SHELF-.*,SC,.*,:\"Log Collection In Progress\"'
        )

line1 = (
        '    "SHELF-17:LOG_COLN_IP,SC,03-25,01-18-58,NEND,NA,,,:"'
        'Log Collection In Progress",NONE:1700000035-6364-1048,:'
        'YEAR=2014,MODE=NONE"'
        )


And at the interactive interpreter, I get a match:

py> pat.match(line1)
<_sre.SRE_Match object at 0xb721ad78>


So now we move on to the content of the one-line file. I don't have
access to the file, so all I have to go by is what you state it 
contains:

[quote]
The test file just has one line:
    "SHELF-17:LOG_COLN_IP,SC,03-25,01-18-58,NEND,NA,,,:\"Log Collection In Progress\",NONE:1700000035-6364-1048,:YEAR=2014,MODE=NONE"
[end quote]

which I interpret like this:

line2 = '    "SHELF-17:LOG_COLN_IP,SC,03-25,01-18-58,NEND,NA,,,:\"Log Collection In Progress\",NONE:1700000035-6364-1048,:YEAR=2014,MODE=NONE"\n'


(note the newline at the end), or if you prefer:

line2 = (
        '    "SHELF-17:LOG_COLN_IP,SC,03-25,01-18-58,NEND,NA,,,:"'
        'Log Collection In Progress",NONE:1700000035-6364-1048,:'
        'YEAR=2014,MODE=NONE"\n'
        )


Except for the newline, it equals line1, and it also matches the 
pattern:

py> pat.match(line2)
<_sre.SRE_Match object at 0xb721ab48>


So now we know that the regex matches the data you think you have.
The next questions are:

- are you reading the right file?
- are you mistaken about the content of the file?

I can't help you with the first. But the second: try running this:

# line2 and pat as defined above
filename = sys.argv[1]
with open(filename) as f:
    for line in f:
        print(len(line), line==line2, repr(line))
        print(repr(pat.match(line)))


which will show you what you have and whether or not it matches 
what you think it has. I expect that the file contents is not what 
you think it is, because the regex is matching the sample line.

Good luck!



-- 
Steven



More information about the Python-list mailing list