[Mailman-Users] txt.gz character encoding

Mark Sapiro mark at msapiro.net
Sat Jul 28 18:28:18 CEST 2012


Gary Kopp wrote:

>I am trying to download Pipermail archives from
>http://lists.xwiki.org/pipermail/users/. They are offered in txt.gz files. I
>now understand that even though it is not immediately obvious, I can
>download the uncompressed .txt versions by modifying the URL, and the
>resulting files are fine. But if I download one of the txt.gz files and
>unzip it to create a .txt file the results are undecipherable. It looks like
>a different character encoding was used. The beginning of the unzipped file
>has the host server's path in clear text at the top (information that is not
>in the .txt file downloaded directly, BTW), but the rest is gibberish. Is
>there something special about the process that Pipermail uses to produce the
>.gz files, or is this something xwiki.org might have changed?


Mailman/pipermail creates the .txt.gz files in one of two ways
depending on configuration, but both use the same underlying process.
In either case, the message being archived is appended to the .txt
text file.

In the default case, that's all that's done, but Mailman's
cron/nightly_gzip is run overnight to (re)create the .txt.gz file from
the .txt file.

If the installation has set GZIP_ARCHIVE_TXT_FILES to a true value in
mm_cfg.py, when the message is added to the .txt file, the .txt.gz is
(re)created from the .txt file at that time. This involves more
overhead than the default but avoids the issue of messages added
during a day not being in the .txt.gz file until the next day.

In my case, I avoid both the overhead and the delay issue by just not
running cron/nightly_gzip. Then the files served from the archive TOC
page are the .txt files as there are no .txt.gz files.

None of the above addresses your question however. To answer your
question, whether the gzipping is done on the fly by pipermail or
nightly by cron/nightly_gzip or both, it is done via the Python gzip
module which in turn relies on the Python zlib module to do the actual
comperssion.

It appears that there is something in this process in the xwiki.org
installation that actually gzips the file twice.

[msapiro at MSAPIRO ~/Desktop]$ file 2012-July.txt.gz
2012-July.txt.gz: gzip compressed data, from Unix
[msapiro at MSAPIRO ~/Desktop]$ gunzip 2012-July.txt.gz
[msapiro at MSAPIRO ~/Desktop]$ file 2012-July.txt
2012-July.txt: gzip compressed data, was
"/var/lib/mailman/archives/private/users/2012-July.txt", last
modified: Fri Jul 27 20:27:03 2012, max compression


I.e., it appears the
/var/lib/mailman/archives/private/users/2012-July.txt was compressed
by gzip with it's (default) --name option and then the result was
gzipped again.

You can recover the original .txt file from the .txt.gz file in this
case by, e.g.

[msapiro at MSAPIRO ~/Desktop]$ gunzip 2012-July.txt.gz
[msapiro at MSAPIRO ~/Desktop]$ mv 2012-July.txt 2012-July.txt.gz
[msapiro at MSAPIRO ~/Desktop]$ gunzip --no-name 2012-July.txt.gz


This situation is specific to the xwiki.org installation.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan



More information about the Mailman-Users mailing list