[Tutor] Python Help: Converting a text file into a specified format

Peter Otten __peter__ at web.de
Sun Mar 29 11:16:17 EDT 2020


Vazquez, Juliana Mary wrote:

> Your task is to process each record in a file named “punned_result.txt
> (see attached) and convert it into the following format:
> 
> –[2] Brandt, Mary D; London, Jack E. “Health Informatics Standards: A
> User’s Guide.” Journal of AHIMA 71, no. 4 (2000): 39-43.
> 
> 1.    You are required to list all the author names [last_name, first_name
> initial]
> 
> 2.    The content in the quotation is the title of the article
> 
> 3.    Here [2] is the order of the record, 71 is the volume, 4 is the
> issue number, 2000 is the year of publication, and 39-43 is the start and
> end pages of the article in that issue of the journal.
> 
> Hints:
> 
> –You may want to use if/elif/else structure inside a while statement to
> test the first two characters in each line so that you can determine
> whether you need the info in that line or not.

Uh -- that seems to be a rather hard problem to solve with just if and 
while.

> Please help me work through this problem!

Start small and write a script that breaks your input data into individual 
records. These seem to be separated by empty lines. For now you can build a 
list of lists where each publication record is one of the inner lists:

[ 
   ["PMID- 32203977", ...],
   ["PMID- 32203970", ...],
   ...
]

When you have that working (in the process you may need to ask here again) 
you can split the inner lists into 
(key, [value1, value2, ...]) pairs. Note how the value part consists of a 
list because one key (example: FAU (which is "full author" according to 
https://www.nlm.nih.gov/bsd/mms/medlineelements.html) will occur once for 
every author of a publication. A Python dict is well-suited for that data 
(look at the dict.setdefault() method, or collections.defaultdict).
Example dict:


{'AD': ['Department of Cardiovascular, Endocrine-Metabolic Diseases '
        'and Aging, IstitutoSuperiore di Sanita, Rome, Italy.',
        'Department of Infectious Diseases, Istituto Superiore di '
        'Sanita, Rome, Italy.',
        'Office of the President, Istituto Superiore di Sanita, Rome, '
        'Italy.'],
 'AID': ['2763667 [pii]', '10.1001/jama.2020.4683 [doi]'],
 'AU': ['Onder G', 'Rezza G', 'Brusaferro S'],
 'CRDT': ['2020/03/24 06:00'],
 'DEP': ['20200323'],
 'DP': ['2020 Mar 23'],
 'EDAT': ['2020/03/24 06:00'],
 'FAU': ['Onder, Graziano', 'Rezza, Giovanni', 'Brusaferro, Silvio'],
 'IS': ['1538-3598 (Electronic)', '0098-7484 (Linking)'],
 'JID': ['7501160'],
 'JT': ['JAMA'],
 'LA': ['eng'],
 'LID': ['10.1001/jama.2020.4683 [doi]'],
 'LR': ['20200323'],
 'MHDA': ['2020/03/24 06:00'],
 'OWN': ['NLM'],
 'PHST': ['2020/03/24 06:00 [entrez]',
          '2020/03/24 06:00 [pubmed]',
          '2020/03/24 06:00 [medline]'],
 'PL': ['United States'],
 'PMID': ['32203977'],
 'PST': ['aheadofprint'],
 'PT': ['Journal Article'],
 'SB': ['AIM', 'IM'],
 'SO': ['JAMA. 2020 Mar 23. pii: 2763667. doi: '
        '10.1001/jama.2020.4683.'],
 'STAT': ['Publisher'],
 'TA': ['JAMA'],
 'TI': ['Case-Fatality Rate and Characteristics of Patients Dying in '
        'Relation to COVID-19 in Italy.']}


Once you have your data in the format above the next step is to print it as 
requested -- and to come up with a way to cope with missing information 
(example: some of the publications do not provide an author).

Some final tweaks (like extracting the year from the publication date), and 
you are there.



More information about the Tutor mailing list