Using Beautiful Soup to entangle bookmarks.html

Anthra Norell anthra.norell at tiscalinet.ch
Thu Sep 7 16:47:38 EDT 2006


> Hi,
>
> I'm trying to use the Beautiful Soup package to parse through the
> "bookmarks.html" file which Firefox exports all your bookmarks into.
> 've been struggling with the documentation trying to figure out how to
> extract all the urls. Has anybody got a couple of longer examples using
> Beautiful Soup I could play around with?
>
> Thanks,
> Martin.


Martin,

   SE is a stream editor that does not introduce the overhead and complications of overkill parsing. See if it suits your needs:
http://cheeseshop.python.org/pypi/SE/2.2%20beta

>>> import SE
>>> Bookmark_Filter  = SE.SE ('''
      <EAT>   # delete all unmatched input
      "~(?i)<a.*?href.*?>~==\n"    # keep hrefs and add a new line
      "~(?i)[^>]+/a>~==\n\n"  # keep text till end of anchor and add two newlines
      |   # run
       <a= <A= </a>= </A>= href\== HREF\==  >=      # delete the noise (extend to your liking)
''')

>>> print Bookmark_Filter (r'C:\WINDOWS\Application Data\Mozilla\Profiles\default\wwaidm0p.slt\bookmarks.html', '')    # 2nd
parameter '' commands string output. Default is a file.
...

 "http://www.inksupply.com/index.cfm?source=html/main2.html" ADD_DATE="1016024829" LAST_VISIT="1039439802" LAST_CHARSET="ISO-8859-1"
MIS Associates Inc.

 "http://www.weink.com/" ADD_DATE="1016034183" LAST_VISIT="1118782455" LAST_CHARSET="windows-1252"
Inkjet, Laser, Copier, Fax Supplies

 "http://www.nextrend.com/analysis/content/pr_9-19-2000.asp" ADD_DATE="1018037196" LAST_VISIT="1126289805" LAST_CHARSET="ISO-8859-1"
NexTrend - Press Releases

 "http://wp.netscape.com/escapes/search/netsearch_E.html" ADD_DATE="1021644432" LAST_VISIT="1023182857" LAST_CHARSET="ISO-8859-1"
Net Search Page - Google

 "http://www.python.org/" ADD_DATE="1021651575" LAST_VISIT="1121690494" LAST_CHARSET="ISO-8859-1"
Python Language Website

 "http://www.teldir.com/real/frame.asp?page=http://www.whitepages.ch" ADD_DATE="1027354641" LAST_VISIT="1115386846"
LAST_CHARSET="windows-1252"
http://www.teldir.com/real/frame.asp?page=http://www.whitepages.ch

... etc.


You may refine this further by adding more deletions or substitutions. Adding them one by one and examining the output each time
around is very easy and straightforward. The SE object accepts strings as well as file names and then returns strings by default, so
developing interactively in an IDLE window using a sample data string is extremely fast and painless, because it is possible to
develop incrementally, one step at a time.

>>> Bookmark_Filter.save ('bookmark_filter.se')    # Save definitions to an editable text file
>>> Bookmark_Filter = SE. SE. ('bookmark_filter.se')    # Next time naming the definition file makes the same object

Regards

Frederic





More information about the Python-list mailing list