question about nasty regex
Tim Chase
python.list at tim.thechases.com
Mon Apr 3 12:50:33 EDT 2006
> What I mean is, I want to change, e.g.:
>
> "Doremus v. Board of Education of Hawthorne, 342 U.S. 429, 434, 72
> S. Ct. 394, 397, 96 L.Ed. 475 (1952)."
>
> into:
>
> "Doremus v. Board of Education of Hawthorne, 342 U.S. 429, 434 (1952)."
>
> Generally, the beginning pattern would consist of:
>
> 1. Two names, consisting of one or more words, always separated by a
> "v."
>
> 2. One, two, or three citations, each of which always has a volume
> number ("342") followed by a name, consisting of one or two word
> units always ending with "." ("U.S."), followed by a page number ("429")
>
> 3. Each citation may contain a comma and a second page number (", 434")
>
> 4. Optionally, a parenthesized year ("(1952)")
>
> 5. A final "."
>>> import re
>>> tests = ['Doremus v. Board of Education of Hawthorne,
342 U.S. 429, 434, 72 S. Ct. 394, 397, 96 L.Ed. 475
(1952).', 'Joe v. Volcano, Fork, 123 Internet, et. al, 314
U.S. 123, 43, 88 S. Ct. 394, 397, 97 L.Ed. 459 (2005).',
'Grandma v. RIAA, 314 U.S. 123, 43, 88 S. Ct. 394, 397, 97
L.Ed. 459.']
>>> r=
re.compile(r'(.*?)\s+v\.\s+(.*?)\s+(\d+)\s+U\.S\.\s+((?:\d+,\s*)+)\s*(.*?)(\(\d{4}\))?\.$')
>>> results = [r.match(x) for x in tests]
>>> for x in range(0,3):
... print "Test %i" % x
... print "="*20
... print "\n".join(["%s: %s" % (a,results[x].group(b))
for a,b in zip(["Party1", "Party2", "Court", "Pages",
"Extra", "Year"], range(1,7))])
...
Test 0
====================
Party1: Doremus
Party2: Board of Education of Hawthorne,
Court: 342
Pages: 429, 434,
Extra: 72 S. Ct. 394, 397, 96 L.Ed. 475
Year: (1952)
Test 1
====================
Party1: Joe
Party2: Volcano, Fork, 123 Internet, et. al,
Court: 314
Pages: 123, 43,
Extra: 88 S. Ct. 394, 397, 97 L.Ed. 459
Year: (2005)
Test 2
====================
Party1: Grandma
Party2: RIAA,
Court: 314
Pages: 123, 43,
Extra: 88 S. Ct. 394, 397, 97 L.Ed. 459
Year: None
Things get a little messy if one of the parties has digits
followed by whitespace, followed by "U.S" in their name,
such as a ficticious "99 U.S. Luftballoons". Caveat
regextor. There are also some places where trailing commas
end up in items if there are multiple parties. You may want
to strip them off too before reassembling them.
Reassemble the pieces as needed. Season to taste. Bake at
350 for 20-25 minutes until golden brown.
HTH, or at least gets you on the path to regexp mangling.
-tkc
More information about the Python-list
mailing list