first python program.. lyrics search

Eric @ Zomething eric at zomething.com
Fri Apr 2 22:39:29 EST 2004


Gustavo wrote:

> I am trying to parse the html code and get the information that i need. 
[snip]
> 
> The problem is once I have the M.htm page I don't now how to get from 
> the html code the "http://www.musicsonglyrics.com/M/Metallica/Metallica 
> lyrics.htm" link to proceed with the search. I tried to read about 
> Regular Expressions [snip]


There are several challenges to this.  First, you will find lyric website webpages that have, for example, "Metallica" on them, but do not have any Metallica lyrics on them.  Second, the organizational (file/database) method for finding lyrics will vary by website.  Third, the presentation of lyrics will vary by website.  This makes it hard to generalize an approach, I think, without perhaps using a fairly robust spider.

As for lyric sites, there may also be some access impediments, depending on the site, and cagey sys admin's have been known to pollute the data sent in response to welcome requests.

I do think regular expressions are your friend, and can be used successfully to parse the lyrics, or other information, out of a specific website's pages, even if those pages are not, for example, valid (x)html.  One nice way to work into a workable understanding of RE is to use a program such as redemo.py which may have been included with your Python distribution under "/Scripts" or the equivalent.


It might be an interesting problem to identify song and poem lyrics by analyzing the strings (line length, word content, etc.), but I am not sure how favorable a success rate could be achieved.  I suppose its all about characterizing the data...  that could be fun.

good luck!


Eric 




More information about the Python-list mailing list