question about nasty regex

Tim Chase python.list at tim.thechases.com
Mon Apr 3 12:50:33 EDT 2006


> What I mean is, I want to change, e.g.:
> 
> "Doremus v. Board of Education of Hawthorne, 342 U.S. 429, 434, 72 
> S. Ct. 394, 397, 96 L.Ed. 475 (1952)."
> 
> into:
> 
> "Doremus v. Board of Education of Hawthorne, 342 U.S. 429, 434 (1952)."
> 
> Generally, the beginning pattern would consist of:
> 
> 1. Two names, consisting of one or more words, always separated by a 
> "v."
> 
> 2. One, two, or three citations, each of which always has a volume 
> number ("342") followed by a name, consisting of one or two word 
> units always ending with "." ("U.S."), followed by a page number ("429")
> 
> 3. Each citation may contain a comma and a second page number (", 434")
> 
> 4. Optionally, a parenthesized year ("(1952)")
> 
> 5. A final "."

 >>> import re
 >>> tests = ['Doremus v. Board of Education of Hawthorne, 
342 U.S. 429, 434, 72 S. Ct. 394, 397, 96 L.Ed. 475 
(1952).', 'Joe v. Volcano, Fork, 123 Internet, et. al, 314 
U.S. 123, 43, 88 S. Ct. 394, 397, 97 L.Ed. 459 (2005).', 
'Grandma v. RIAA, 314 U.S. 123, 43, 88 S. Ct. 394, 397, 97 
L.Ed. 459.']
 >>> r= 
re.compile(r'(.*?)\s+v\.\s+(.*?)\s+(\d+)\s+U\.S\.\s+((?:\d+,\s*)+)\s*(.*?)(\(\d{4}\))?\.$')
 >>> results = [r.match(x) for x in tests]
 >>> for x in range(0,3):
...     print "Test %i" % x
...     print "="*20
...     print "\n".join(["%s: %s" % (a,results[x].group(b)) 
for a,b in zip(["Party1",  "Party2", "Court", "Pages", 
"Extra", "Year"], range(1,7))])
...
Test 0
====================
Party1: Doremus
Party2: Board of Education of Hawthorne,
Court: 342
Pages: 429, 434,
Extra: 72 S. Ct. 394, 397, 96 L.Ed. 475
Year: (1952)
Test 1
====================
Party1: Joe
Party2: Volcano, Fork, 123 Internet, et. al,
Court: 314
Pages: 123, 43,
Extra: 88 S. Ct. 394, 397, 97 L.Ed. 459
Year: (2005)
Test 2
====================
Party1: Grandma
Party2: RIAA,
Court: 314
Pages: 123, 43,
Extra: 88 S. Ct. 394, 397, 97 L.Ed. 459
Year: None


Things get a little messy if one of the parties has digits 
followed by whitespace, followed by "U.S" in their name, 
such as a ficticious "99 U.S. Luftballoons".  Caveat 
regextor.  There are also some places where trailing commas 
end up in items if there are multiple parties.  You may want 
to strip them off too before reassembling them.

Reassemble the pieces as needed.  Season to taste.  Bake at 
350 for 20-25 minutes until golden brown.

HTH, or at least gets you on the path to regexp mangling.

-tkc







More information about the Python-list mailing list