joining strings question

patrick.waldo at gmail.com patrick.waldo at gmail.com
Fri Feb 29 11:21:47 EST 2008


I tried to make a simple abstraction of my problem, but it's probably
better to get down to it.  For the funkiness of the data, I'm
relatively new to Python and I'm either not processing it well or it's
because of BeautifulSoup.

Basically, I'm using BeautifulSoup to strip the tables from the
Federal Register (http://www.access.gpo.gov/su_docs/aces/fr-
cont.html).    So far my code strips the html and gets only the
departments I'd like to see.  Now I need to put it into an Excel file
(with pyExcelerator) with the name of the record and the pdf.  A
snippet from my data from BeautifulSoup like this:

['Environmental Protection Agency', 'RULES', 'Approval and
Promulgation of Air Quality Implementation Plans:', 'Illinois;
Revisions to Emission Reduction Market System, ', '11042 [E8-3800]',
'E8-3800.pdf', 'Ohio; Oxides of Nitrogen Budget Trading Program;
Correction, ', '11192 [Z8-2506]', 'Z8-2506.pdf', 'NOTICES', 'Agency
Information Collection Activities; Proposals, Submissions, and
Approvals, ', '11108-11110 [E8-3934]', 'E8-3934.pdf', 'Data
Availability for Lead National Ambient Air Quality Standard Review, ',
'11110-11111 [E8-3935]', 'E8-3935.pdf', 'Environmental Impacts
Statements; Notice of  Availability, ', '11112 [E8-3917]',
'E8-3917.pdf']

What I'd like to see in Excel is this:
'Approval and Promulgation of Air Quality Implementation Plans:
Illinois; Revisions to Emission Reduction Market System, 11042
[E8-3800]'  | 'E8-3800.pdf' | RULES
'Ohio; Oxides of Nitrogen Budget Trading Program; Correction, 11192
[Z8-2506]' |  'Z8-2506.pdf' | RULES
'Agency Information Collection Activities; Proposals, Submissions, and
Approvals, 11108-11110 [E8-3934]' | 'E8-3934.pdf' | NOTICES
'Data Availability for Lead National Ambient Air Quality Standard
Review, 11110-11111 [E8-3935]' | 'E8-3935.pdf' | NOTICES
'Environmental Impacts Statements; Notice of  Availability, 11112
[E8-3917]' | 'E8-3917.pdf' | NOTICES
etc...for every department I want.

Now that I look at it I've got another problem because 'Approval and
Promulgation of Air Quality Implementation Plans:' should be joined to
both Illinois and Ohio...I love finding these little inconsistencies!
Once I get the data organized with all the titles joined together
appropriately, outputting it to Excel should be relatively easy.

So my problem is how to join these titles together.  There are a
couple patterns.  Every law is followed by a number, which is always
followed by the pdf.

Any ideas would be much appreciated.

My code so far (excuse the ugliness):

import urllib
import re, codecs, os
import pyExcelerator
from pyExcelerator import *
from BeautifulSoup import BeautifulSoup as BS

#Get the url, make the soup, and get the table to be processed
url = "http://www.access.gpo.gov/su_docs/aces/fr-cont.html"
site = urllib.urlopen(url)
soup = BS(site)
body = soup('table')[1]
tds = body.findAll('td')
mess = []
for td in tds:
    mess.append(str(td))
spacer = re.compile(r'<td colspan="4" height="10">.*')
data = []
x=0
for n, t in enumerate(mess):
    if spacer.match(t):
        data.append(mess[x:n])
        x = n

dept = re.compile(r'<td colspan="4">.*')
title = re.compile(r'<td colspan="3">.*')
title2 = re.compile(r'<td colspan="2".*')
link = re.compile(r'<td align="right">.*')
none = re.compile(r'None')

#Strip the html and organize by department
group = []
db_list = []
for d in data:
    pre_list = []
    for item in d:
        if dept.match(item):
            dept_soup = BS(item)
            try:
                dept_contents = dept_soup('a')[0]['name']
                pre_list.append(str(dept_contents))
            except IndexError:
                break
        elif title.match(item) or title2.match(item):
            title_soup = BS(item)
            title_contents = title_soup.td.string
            if none.match(str(title_contents)):
                pre_list.append(str(title_soup('a')[0]['href']))
            else:
                 pre_list.append(str(title_contents))
        elif link.match(item):
            link_soup = BS(item)
            link_contents = link_soup('a')[1]['href']
            pre_list.append(str(link_contents))
    db_list.append(pre_list)
for db in db_list:
    for n, dash_space in enumerate(db):
        dash_space = dash_space.replace('–','-')
        dash_space = dash_space.replace(' ', ' ')
        db[n] = dash_space
download = re.compile(r'http://.*')
for db in db_list:
    for n, pdf in enumerate(db):
        if download.match(pdf):
            filename = re.split('http://.*/',pdf)
            db[n] = filename[1]
#Strip out these departments
AgrDep = re.compile(r'Agriculture Department')
EPA = re.compile(r'Environmental Protection Agency')
FDA = re.compile(r'Food and Drug Administration')
key_data = []
for list in db_list:
    for db in list:
        if AgrDep.match(db) or EPA.match(db) or FDA.match(db):
            key_data.append(list)
#Get appropriate links from covered departments as well
LINK = re.compile(r'^#.*')
links = []
for kd in key_data:
    for item in kd:
        if LINK.match(item):
            links.append(item[1:])
for list in db_list:
    for db in list:
        if db in links:
            key_data.append(list)

print key_data[1]
On Feb 29, 4:35 pm, Steve Holden <st... at holdenweb.com> wrote:
> patrick.wa... at gmail.com wrote:
> > Hi all,
>
> > I have some data with some categories, titles, subtitles, and a link
> > to their pdf and I need to join the title and the subtitle for every
> > file and divide them into their separate groups.
>
> > So the data comes in like this:
>
> > data = ['RULES', 'title','subtitle','pdf',
> > 'title1','subtitle1','pdf1','NOTICES','title2','subtitle2','pdf','title3','subtitle3','pdf']
>
> > What I'd like to see is this:
>
> > [RULES', 'title subtitle','pdf', 'title1 subtitle1','pdf1'],
> > ['NOTICES','title2 subtitle2','pdf','title3 subtitle3','pdf'], etc...
>
> > I've racked my brain for a while about this and I can't seem to figure
> > it out.  Any ideas would be much appreciated.
>
> > Thanks
>
> data = ['RULES', 'title','subtitle','pdf',
> 'title1','subtitle1','pdf1','NOTICES','title2','subtitle2','pdf','title3','subtitle3','pdf']
> olist = []
> while data:
>    if data[0] == data[0].upper():
>      olist.append([data[0]])
>      del data[0]
>    else:
>      olist[-1].append(data[0]+' '+data[1])
>      olist[-1].append(data[2])
>      del data[:3]
> print olist
>
> However, I suspect you should be asking yourself whether this is really
> an appropriate data structure for your needs. If you say what you are
> trying to achieve in the large rather than focusing on a limited
> programming issue there may be much better solutions.
>
> I suspect, for example, that a dict indexed by the categories and with
> the entries each containing a list of tuples might suit your needs much
> better, i.e.
>
> {
>    'RULES':   [('title subtitle', 'pdf'),
>                ('title1 subtitle1', 'pdf')],
>    'NOTICES': [('title2 subtitle2', 'pdf'),
>                 'title3 subtitle3', 'pdf')]}
>
> One final observation: if all the files are PDFs then you might just as
> well throw the 'pdf' strings away and use a constant extension when you
> try and open them or whatever :-). Then the lists of tuples i the dict
> example could just become lists of strings.
>
> regards
>   Steve
>
> --
> Steve Holden        +1 571 484 6266   +1 800 494 3119
> Holden Web LLC              http://www.holdenweb.com/




More information about the Python-list mailing list