Suggestions for how to approach this problem?

James Stroud jstroud at mbi.ucla.edu
Tue May 8 17:34:05 EDT 2007


John Salerno wrote:
> Marc 'BlackJack' Rintsch wrote:
> Here's what it looks like now:
> 
> 1.  Levy, S.B. (1964)  Isologous interference with ultraviolet and X-ray 
> irradiated
> bacteriophage T2.  J. Bacteriol. 87:1330-1338.
> 2.  Levy, S.B. and T. Watanabe (1966)  Mepacrine and transfer of R 
> factor.  Lancet 2:1138.
> 3.  Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966)  Episomic 
> resistance factors in
> Enterobacteriaceae.  34.  The specific effects of the inhibitors of DNA 
> synthesis on the
> transfer of R factor and F factor.  Med. Biol. (Tokyo)  73:79-83.
> 4.  Levy, S.B. (1967)  Blood safari into Kenya.  The New Physician 
> 16:50-54.
> 5.  Levy, S.B., W.T. Fitts and J.B. Leach (1967)  Surgical treatment of 
> diverticular disease of the
> colon:  Evaluation of an eleven-year period.  Annals Surg.  166:947-955.
> 
> As you can see, any single citation is broken over several lines as a 
> result of a line break. I want it to look like this:
> 
> 1.  Levy, S.B. (1964)  Isologous interference with ultraviolet and X-ray
>     irradiated bacteriophage T2.  J. Bacteriol. 87:1330-1338.
> 2.  Levy, S.B. and T. Watanabe (1966)  Mepacrine and transfer of R
>     factor.  Lancet 2:1138.
> 3.  Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966)  Episomic
>     resistance factors in Enterobacteriaceae.  34.  The specific effects
>     of the inhibitors of DNA synthesis on the
>     transfer of R factor and F factor.  Med. Biol. (Tokyo)  73:79-83.
> 4.  Levy, S.B. (1967)  Blood safari into Kenya.  The New Physician
>     16:50-54.
> 5.  Levy, S.B., W.T. Fitts and J.B. Leach (1967)  Surgical treatment of
>     diverticular disease of the colon:  Evaluation of an eleven-year
>     period.  Annals Surg.  166:947-955.
> 
> Now, since this is pasted, it might not even look good to you. But in 
> the second example, the numbers are meant to be bullets and so the 
> indentation would happen automatically (in Word). But for now they are 
> just typed.

If you can count on the person not skipping any numbers in the 
citations, you can take an "AI" approach to hopefully weed out the rare 
circumstance that a number followed by a period starts a line in the 
middle of the citation. This is not failsafe, say if you were on 
citation 33 and it was in chapter 34 and that 34 happend to start a new 
line. But, then again, even a human would take a little time to figure 
that one out--and probably wouldn't be 100% accurate either. I'm sure 
there is an AI word for the type of parser that could parse something 
like this unambiguously and I'm sure that it has been proven to be 
impossible to create:

import re
records = []
record = None
counter = 1
regex = re.compile(r'^(\d+)\. (.*)')
for aline in lines:
   m = regex.search(aline)
   if m is not None:
     recnum, aline = m.groups()
     if int(recnum) == counter:
       if record is not None:
         records.append(record)
       record = [aline.strip()]
       counter += 1
     continue
   record.append(aline.strip())

if record is not None:
   records.append(record)

records = [" ".join(r) for r in records]


py> import re
py> records = []
py> record = None
py> counter = 1
py> regex = re.compile(r'^(\d+)\. (.*)')
py> for aline in lines:
...   m = regex.search(aline)
...   if m is not None:
...     recnum, aline = m.groups()
...     if int(recnum) == counter:
...       if record is not None:
...         records.append(record)
...       record = [aline.strip()]
...       counter += 1
...     continue
...   record.append(aline.strip())
...
py> if record is not None:
...   records.append(record)
...
py> records = [" ".join(r) for r in records]
py> records

['Levy, S.B. (1964)  Isologous interference with ultraviolet and X-ray 
irradiated bacteriophage T2.  J. Bacteriol. 87:1330-1338.',
  'Levy, S.B. and T. Watanabe (1966)  Mepacrine and transfer of R 
factor.  Lancet 2:1138.',
  'Takano, I., S. Sato, S.B. Levy and T. Watanabe (1966)  Episomic 
resistance factors in Enterobacteriaceae.  34.  The specific effects of 
the inhibitors of DNA synthesis on the transfer of R factor and F 
factor.  Med. Biol. (Tokyo)  73:79-83.',
  'Levy, S.B. (1967)  Blood safari into Kenya.  The New Physician 
16:50-54.',
  'Levy, S.B., W.T. Fitts and J.B. Leach (1967)  Surgical treatment of 
diverticular disease of the colon:  Evaluation of an eleven-year period. 
  Annals Surg.  166:947-955.']


James



More information about the Python-list mailing list