What's the best way to parse this HTML tag?

Roy Smith roy at panix.com
Sun Mar 11 20:28:33 EDT 2012


In article 
<239c4ad7-ac93-45c5-98d6-71a434e1c5aa at r21g2000yqa.googlegroups.com>,
 John Salerno <johnjsal at gmail.com> wrote:

> Getting the time that the song is played is easy, because the time is
> wrapped in a <div> tag all by itself with a class attribute that has a
> specific value I can search for. But the actual song title and artist
> information is harder, because the HTML isn't quite as precise. Here's
> a sample:
> 
> <div class="cmPlaylistContent">
>  <strong>
>   <a href="/lsp/t2995/">
>    Love Without End, Amen
>   </a>
>  </strong>
>  <br/>
>  <a href="/lsp/a436/">
>   George Strait
>  </a>
> [...]
> Therefore, I appeal to your greater wisdom in these matters. Given
> this HTML, is there a "best practice" for how to refer to the song
> title and artist?

Obviously, any attempt at screen scraping is fraught with peril.  
Beautiful Soup is a great tool but it doesn't negate the fact that 
you've made a pact with the devil.  That being said, if I had to guess, 
here's your puppy:

>   <a href="/lsp/t2995/">
>    Love Without End, Amen
>   </a>

the thing to look for is an "a" element with an href that starts with 
"/lsp/t", where "t" is for "track".  Likewise:

>  <a href="/lsp/a436/">
>   George Strait
>  </a>

an href starting with "/lsp/a" is probably an artist link.

You owe the Oracle three helpings of tag soup.



More information about the Python-list mailing list