[Mailman-Users] Migrating from YahooGroups to Mailman
Greg Ward
gward at mems-exchange.org
Tue Jul 31 22:46:19 CEST 2001
On 31 July 2001, Bradford Shaw said:
> I would also be interested in this. We just moved a large list off of
> yahoogroups to its own server and Yahoo refuses to cooperate in letting us
> have the archived messages (over 2 years worth at approx 1500 per month).
I've just poked around groups.yahoo.com a bit, and it looks like this is
doable (but painful). From any message, you can hit the "View source"
link. This takes you to a page like
http://groups.yahoo.com/group/NucNews/message/3859?source=1
(message 3859 of the "NucNews" list).
The good news:
* the URL is dead easy to generate, eg. here's Python code to suck
the entire archives for a list, one HTML file per message:
url_template = "http://groups.yahoo.com/group/%s/message/%d?source=1"
group_name = "NucNews"
num_messages = 3887 # every message page has this
for msg_num in xrange(num_messages):
url = url_template % (group_name, msg_num)
msg_filename = "msg-%04d.html" % msg_num
urllib.urlretrieve(url, msg_filename)
(UNTESTED -- YMMV)
The bad news:
* the HTML you download will need serious massaging before it's genuinely
plain text (ie. valid RFC 822 messages). Eg. here's a sample from
message 1 of NucNews:
"""
<!-- start of guts !-->
<!-- Layout !-->
<table border="0" cellspacing="0" cellpadding="0" width="100%">
<pre>From <a href="/group/NucNews/post?protectID=197212253115067135062046036199121208067038025008">prop1 at xxxxx.xxxx</a> Tue Jan 12 14:08:50 1999
X-Digest-Num: 0
Message-ID: <<a href="/group/NucNews/post?protectID=204183107153178091074082017036098126254083020093090065230045073141210143030150043098201196026">62814.0.1.959296473 at e...</a>>
Date: Tue, 12 Jan 1999 17:08:50 -0500
[...]
"""
Note how anything that looks like an email address (including the
return-path and message-id headers) are turned into hyperlinks. You'll
need to strip out these <a href> tags (see Python's htmllib, although
you can probably kludge it with a regex) and just preserve the content
of the tag, eg. "prop1 at xxxxx.xxxx".
Even after doing this, you still don't have a valid RFC 822 message,
eg. here's another excerpt from msg 1 of NucNews:
From: Peace Through Reason <<a href="/group/NucNews/post?protectID=197212253115067135062046036199121208067038025008">prop1 at xxxxx.xxxx</a>
De-HTML-ified, that becomes:
From: Peace Through Reason <prop1 at xxxxx.xxxx
which, even if you ignore the mangled email address, is still missing a
trailing angle-bracket. Someone goofed in converting this message to
HTML, and now you lose! You'll probably have to add some "Fix Yahoo!
bogosity" heuristics to your script. Bummer.
Executive summary: your situation isn't completely hopeless, but it sure
does suck. Bummer.
Greg
--
Greg Ward - software developer gward at mems-exchange.org
MEMS Exchange http://www.mems-exchange.org
More information about the Mailman-Users
mailing list