Python doc problem example: gzip module (reprise)

Xah Lee xah at xahlee.org
Sat Nov 5 05:24:46 EST 2005


Python Doc Problem Example: gzip

Xah Lee, 20050831

Today i need to use Python to compress/decompress gzip files. Since
i've read the official Python tutorial 8 months ago, have spent 30
minutes with Python 3 times a week since, have 14 years of computing
experience, 8 years in mathematical computing and 4 years in unix admin
and perl, i have quickly found the official doc:
http://python.org/doc/2.4.1/lib/module-gzip.html

I'd imagine it being a function something like:

fileContent = GzipFile(filePath, comprress/decompress)

However, scanning the doc after 20 seconds there's no single example
showing how it is used.

Instead, the doc starts with some arcane info about compatibility with
some other compression module and other software. Then it talks in a
very haphazard way with confused writing about the main function
GzipFile. No perspectives whatsoever about using it to solve a problem
nor a concrete description of how to use it. Instead, jargons of Class,
Constructor, Object etc are thrown together with presumption of
reader's expertise of IO programing in Python and gzip compression
arcana.

After no understanding, and being not a Python expert, i wanted to read
about file objects but there's no link.

After locating the file object's doc page:
http://python.org/doc/2.4.1/lib/bltin-file-objects.html, but itself is
written and organized in a very unhelpful way.

Here's the detail of the problems of its documentation. It starts with:

    «The data compression provided by the zlib module is compatible
with that used by the GNU compression program gzip. Accordingly, the
gzip module provides the GzipFile class to read and write gzip-format
files, automatically compressing or decompressing the data so it looks
like an ordinary file object. Note that additional file formats which
can be decompressed by the gzip and gunzip programs, such as those
produced by compress and pack, are not supported by this module.»

This intro paragraph is about 3 things: (1) the purpose of this gzip
module. (2) its relation with zlib module. (3) A gratuitous arcana
about gzip program's support of “compress and pack” software being
not supported by Python's gzip module. Necessarily mentioned because
how the writing in this paragraph is phrased. The writing itself is a
jumble.

Of the people using the gzip module, vast majority really just need to
decompress a gzip file. They don't need to know (2) and (3) in a
preamble. The worst aspect here is the jumbled writing.

    «class GzipFile( [filename[, mode[, compresslevel[, fileobj]]]])
Constructor for the GzipFile class, which simulates most of the methods
of a file object, with the exception of the readinto() and truncate()
methods. At least one of fileobj and filename must be given a
non-trivial value. The new class instance is based on fileobj, which
can be a regular file, a StringIO object, or any other object which
simulates a file. It defaults to None, in which case filename is opened
to provide a file object.»

This paragraph assumes that readers are thoroughly familiar with
Python's File Objects and its methods. The writing is haphazard and
extremely confusive. Instead of explicitness and clarity, it tries to
convey its meanings by side effects.

• The words “simulate” are usd twice inanely. The sentence
“...Gzipfile class, which simulates...” is better said by
“Gzipfile is modeled after Python's File Objects class.”

• The intention to state that it has all Python's File Object methods
except two of them, is ambiguous phrased. It is as if to say all
methods exists, except that two of them works differently.

• The used of the word “non-trivial value” is inane. What does a
non-trivial value mean here? Does “non-trivial value” have specific
meaning in Python? Or, is it meant with generic English interpretation?
If the latter, then what does it mean to say: “At least one of
fileobj and filename must be given a non-trivial value”? Does it
simply mean one of these parameters must be given?

• The rest of the paragraph is just incomprehensible.

    «When fileobj is not None, the filename argument is only used to
be included in the gzip file header, which may includes the original
filename of the uncompressed file. It defaults to the filename of
fileobj, if discernible; otherwise, it defaults to the empty string,
and in this case the original filename is not included in the header.»

“discernible”? This writing is very confused, and it assumes the
reader is familiar with the technical specification of Gzip.

    «The mode argument can be any of 'r', 'rb', 'a', 'ab', 'w', or
'wb', depending on whether the file will be read or written. The
default is the mode of fileobj if discernible; otherwise, the default
is 'rb'. If not given, the 'b' flag will be added to the mode to ensure
the file is opened in binary mode for cross-platform portability.»

“discernible”? Again, familiarity with the working of Python's file
object is implicitly assumed. For people who do not have expertise with
working with files using Python, it necessatates the reading of
Python's file objects documentation.

    «The compresslevel argument is an integer from 1 to 9 controlling
the level of compression; 1 is fastest and produces the least
compression, and 9 is slowest and produces the most compression. The
default is 9.»

    «Calling a GzipFile object's close() method does not close
fileobj, since you might wish to append more material after the
compressed data. This also allows you to pass a StringIO object opened
for writing as fileobj, and retrieve the resulting memory buffer using
the StringIO object's getvalue() method.»

huh? append more material? pass a StringIO? and memory buffer?

Here, expertise in programing with IO is assumed of the reader.
Meanwhile, the writing is not clear about how exactly what it is trying
to say about the close() method.
Suggestions
--------------------------
A quality documentation should be clear, succinct, precise. And, the
least it assumes reader's expertise to obtain these qualities, the
better it is.

Vast majority of programers using this module really just want to
compress or decompress a file. They do not need to know any more
details about the technicalities of this module nor about the Gzip
compression specification. Here's what Python documentation writers
should do to improve it:

• Rewrite the intro paragraph. Example: “This module provides a
simple interface to compress and decompress files using the GNU
compression format gzip. For detailed working with gzip format, use the
zlib module.”. The “zlib module” phrase should be linked to its
documentation.

• Near the top of the documentation, add a example of usage. A
example is worth a thousand words:

 # decompressing a file
import gzip
fileObj = gzip.GzipFile("/Users/joe/war_and_peace.txt.gz", 'rb');
fileContent = fileObj.read()
fileObj.close()

 # compressing a file
import gzip
fileObj = gzip.GzipFile("/Users/mary/hamlet.txt.gz", 'wb');
fileObj.write(fileContent)
fileObj.close()

• Add at the beginning of the documentation a explicit statement,
that GzipFile() is modeled after Python's File Objects, and provide a
link to it.

• Rephrase the writing so as to not assume that the reader is
thoroughly familiar with Python's IO. For example, when speaking of the
modes 'r', 'rb', ... add a brief statement on what they mean. This way,
readers may not have to take a extra step to read the page on File
Objects.

• Remove arcane technical details about gzip compression to the
bottom as footnotes.

• General advice on the writing: The goal of writing on this module
is to document its behavior, and effectively indicate how to use it.
Keep this in mind when writing the documentation. Make it clear on what
you are trying to say for each itemized paragraph. Make it precise, but
without over doing it. Assume your readers are familiar with Python
language or gzip compression. For example, what are classes and objects
in Python, and what compressions are, compression levels, file name
suffix convention. However, do not assume that the readers are expert
of Python IO, or gzip specification or compression technology and
software in the industry. If exact technical details or warnings are
necessary, move them to footnotes.
---------------
For a collection of essays on Python's documentation problems, see
bottom of:
http://xahlee.org/perl-python/python.html

 Xah
 xah at xahlee.orghttp://xahlee.org/




More information about the Python-list mailing list