[Tutor] Picking up citations

Dinesh B Vadhia dineshbvadhia at hotmail.com
Sat Feb 7 17:53:55 CET 2009


Wow Kent, what a great start!

I found this http://mail.python.org/pipermail/python-list/2006-April/376149.html which lays out some patterns of legal citations ie.

1. Two names, consisting of one or more words, separated by a "v."
2. One, two, or three citations, each of which has a volume number ("90") followed by a Reporter name ("U.S." or "S.Ct." or "L.Ed."), which consists of one or two words always ending with a ".", followed by a page number ("1893")
3. Each citation may contain a comma and a second page number (", 234 ")
4. Optionally, a parenthesized year ("(1970)") or optional information in parentheses ("(DCMD Ala.1966)")
5. An ending "."

Some things I've noticed include:

* A sequence of DIFFERENT citations are separated by a ';' as in Carter v. Jury Commission of Greene County, 396 U.S. 320, 90 S.Ct. 518, 24 L.Ed.2d 549 (1970); Lathe Turner v. Fouche, 396 U.S. 346, 90 S.Ct. 532, 24 L.Ed.2d 567 (1970); White v. Crook, 251 F.Supp. 401 (DCMD Ala.1966) with the first separation containing the names (separated by a v.)

* A sequence of SIMILAR citations are separated by a ',' as in John Doggone Williams v. Florida, 399 U.S. 78, 90 S.Ct. 1893, 234, 26 L.Ed.2d 446 (1970) with the first separation containing the names (separated by a v.)

I was pondering the same issue about names ie. how do you know that "Page 500" is not part of "Carter".  My thought was to start from the "v.", step backwards a word at a time, assume that the first name is valid, for all subsequent words check if the last character of a word contained the digits [0-9] or these punctuation marks [.,:;], if so, then it was unlikely to be part of the name.

I've changed the sample text to include examples of multiple page references:

text = "Page 500 Carter v. Jury Commission of Greene County, 396 U.S. 320 876, 90 S.Ct. 518, 24 L.Ed.2d 549 (1970); Lathe Turner v. Fouche, 396 U.S. 346, 90 S.Ct. 532, 24 L.Ed.2d 567 (1970); White v. Crook, 251 F.Supp. 401 (DCMD Ala.1966). Moreover, the Court has also recognized that the exclusion of a discernible class from jury service injures not only those defendants who belong to the excluded class, but other defendants as well, in that it destroys the possibility that the jury will reflect a representative cross section of the community. In John Doggone Williams v. Florida, 399 U.S. 78, 90 S.Ct. 1893, 234, 26 L.Ed.2d 446 159-60 (1970), we sought to delineate some of the essential features of the jury that is guaranteed, in certain circumstances, by the Sixth Amendment. We concluded that it comprehends, inter alia, 'a fair possibility for obtaining a representative cross-section of the community.' 399 U.S., at 100, 90 S.Ct., at 1906.9 Thus if the Sixth Amendment were applicable here, and petitioner were challenging a post-Duncan petit jury, he would clearly have standing to challenge the systematic exclusion of any identifiable group from jury service."

Okay, I'd better get to grips with pyparsing!

Dinesh




From: Kent Johnson 
Sent: Saturday, February 07, 2009 6:21 AM
To: Dinesh B Vadhia 
Cc: tutor at python.org 
Subject: Re: [Tutor] Picking up citations


On Sat, Feb 7, 2009 at 1:11 AM, Dinesh B Vadhia
<dineshbvadhia at hotmail.com> wrote:
> Hi!  I want to process text that contains citations, in this case in legal
> documents, and pull-out each individual citation.  Here is a sample text:

<snip>

> The results required are:
>
> Carter v. Jury Commission of Greene County, 396 U.S. 320 (1970)
> Carter v. Jury Commission of Greene County, 90 S.Ct. 518 (1970)
> Carter v. Jury Commission of Greene County, 24 L.Ed.2d 549 (1970)
>
> Lathe Turner v. Fouche, 396 U.S. 346 (1970)
> Lathe Turner v. Fouche, 90 S.Ct. 532 (1970)
> Lathe Turner v. Fouche, 24 L.Ed.2d 567 (1970)
>
> White v. Crook, 251 F.Supp. 401 (DCMD Ala.1966)
>
> John Doggone Williams v. Florida, 399 U.S. 78 (1970)
> John Doggone Williams v. Florida, 90 S.Ct. 1893, 234 (1970)
> John Doggone Williams v. Florida, 26 L.Ed.2d 446 (1970)

Here is a close solution using pyparsing. It only gets the last word
of the first name, and it doesn't handle multiple page numbers so it
missing J. D. Williams entirely. The name is hard - how do you know
that "Page 500" is not part of "Carter" and "In" is not part of "John
Doggone Williams"? The page numbers seem possible in theory but I
don't know how to get pyparsing to do it.

from pprint import pprint as pp
from pyparsing import *

text = "" # your text

Name1 = Word(alphas).setResultsName('name1')
Name2 = Combine(OneOrMore(Word(alphas)), joinString=' ',
adjacent=False).setResultsName('name2')

Volume = Word(nums).setResultsName('volume')
Reporter = Word(alphas, alphanums+".").setResultsName('reporter')
Page = Word(nums).setResultsName('page')

VolumeCitation = (Volume + Reporter +
Page).setResultsName('volume_citation', listAllMatches=True)
VolumeCitations = delimitedList(VolumeCitation)

Date = (Suppress('(') +
Combine(CharsNotIn(')')).setResultsName('date') + Suppress(')'))

FullCitation = Name1 + Suppress('v.') + Name2 + Suppress(',') +
VolumeCitations + Date

for item in FullCitation.scanString(text):
    fc = item[0]
    # Uncomment the following to see the raw parse results
    # pp(fc)
    # print
    # print fc.name1
    # print fc.name2
    # for vc in fc.volume_citation:
    #     pp(vc)
    for vc in fc.volume_citation:
        print '%s v. %s, %s %s %s (%s)' % (fc.name1, fc.name2,
vc.volume, vc.reporter, vc.page, fc.date)
    print


The output is:
Carter v. Jury Commission of Greene County, 396 U.S. 320 (1970)
Carter v. Jury Commission of Greene County, 90 S.Ct. 518 (1970)
Carter v. Jury Commission of Greene County, 24 L.Ed.2d 549 (1970)

Turner v. Fouche, 396 U.S. 346 (1970)
Turner v. Fouche, 90 S.Ct. 532 (1970)
Turner v. Fouche, 24 L.Ed.2d 567 (1970)

White v. Crook, 251 F.Supp. 401 (DCMD Ala.1966)

Kent
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090207/fc829eb2/attachment.htm>


More information about the Tutor mailing list