[SciPy-Dev] docstring standard: parameter shape description

Mon Jan 28 19:35:28 EST 2013

On Mon, 28 Jan 2013 13:47:26 -0800 Nathaniel Smith <njs at pobox.com> wrote:

> On Mon, Jan 28, 2013 at 1:21 PM, Joe Harrington <jh at physics.ucf.edu> wrote:
> > On Sun, Jan 27, 2013 at 2:51 PM, Ralf Gommers
> <ralf.gommers at gmail.com> wrote> :
> >> Hi,
> >>
> >> When merging the doc wiki edits there were a large number of changes to the
> >> shape description of parameters/returns. This is not yet described in the
> >> docstring standard
> >> (https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt),
> >> an> d
> >> currently is done in various ways:
> >>
> >> param1 : ndarray, shape (N,)
> >
> > I think it should be consistent between all cases, start with the class
> > and then the shape, and solve the general problem.
> >
> > Initially, I agreed with Josef about being terse, but it reads hard that
> > way and if you're a newbie you might wonder what the numbers in parens
> > are.  The word "shape" does not add an extra line, and the comma makes
> > sense as an appositive in English.
> 
> +1, the word 'shape' is a pretty critical clue the first time you see this.
> 
> > So, I prefer:
> >
> > param1 : ndarray, shape XXXXX
> >
> > For XXXXX, we need to specify:
> >
> > ranges of allowed numbers of dimensions
> > ranges of allowed sizes within each dimension
> > low- or high-side unconstrained sizes in either case
> >
> > We should accept the output of .shape, and define some range
> > conventions.  Of course, there will be pathological cases, particularly
> > in specialist packages that adopt the numpy doc standard, where nothing
> > but text will adequately describe the allowed dimensions ("If there are
> > three dimensions, then the second dimension must...").  A "(see text)"
> > should be allowed after the shape spec.
> >
> > So, this is my counterproposal for inclusion in the standard:
> >
> >
> ---------------------------------------------------------------------------->
> > ---
> > param1 : ndarray, shape <shapespec> [(see text)]
> > as in
> > param1 : ndarray, shape (2, 2+, dim(any), 4-, 4-6, any) (see text)
> >
> > in <shapespec>:
> >   the spec reads from the slowest-varying to the fastest-varying dimension
> >   a number means exactly that number of items on that axis
> >   a number followed by a "+" ("-") means that number or more (fewer) items
> >   a-b means between a and b items, INCLUSIVE
> >   "any" means any number of items on that axis
> >   dim(dimspec) means the conventions above apply for dimensions
> >   instead of i> tems
> >
> > The example would mean an array with dimensions, from slowest to
> > fastest-varying, of size:
> > 2
> > 2 or more
> > (0 or more axes can be inserted here)
> > 0 to 4
> > 4 to 6
> > any size, including absent (use 1+ to require a dimension)
> 
> "any size" should mean 0+. "absent" is not a size. If a function does
> accept an optional final dimension, can we write that like 'shape (N,
> D) or shape (N,)'?

An array dimension cannot have 0 elements (the total size is the product
of the shape tuple's elements).

"1+" means the dimension has to be there.  "any" means it could be there
or not.  There are many cases in image processing where an optional
initial or final dimension appears, so I felt this would cover most of
our cases and avoid most uses of dim().  But you could get rid of "any"
and use "dim(1-)" instead.  I'm not sure which is clearer to a beginner.

(N,) is too obscure to a beginner and might be missed by anyone reading
fast.  Also, (2,) is a valid 1D shape (i.e., it's valid tuple notation).
Having a different meaning from the rules governing the normal output of
.shape is not a good idea.

> For inserting axes, "..." is clearer than the rather opaque
> "any(dim)", and matches existing Python convention.

It's dim(any), not any(dim), so it's clear enough.  "dim()" just means
the contents applies to dimensions, not items.  The reason to use dim()
is the generality.  What if you can only insert 1 axis, or up to two?
Then you can say "dim(1)" or "dim(2-)".  "..."  doesn't capture this at
all.

However, I don't mind allowing "..." as a shorthand for "dim(any)".
How about adding :

  "..." is an alias for dim(any)

to the spec list.

> Generally, though, for input parameters it's usually best to specify
> the size as a variable rather than a numeric range so it can be
> referred back to later, right? And for output parameters there's no
> need to specify ranges, since the shape should be determined by the
> input?  'in1 : ndarray, shape (N, M), in2 : ndarray, shape (M, K), out
> : ndarray, shape (N, K)'. 

I agree completely.  How about adding :

  a capital letter (variable) means any number of items, for later reference

in the spec list.

> The spec in this complexity seems to be in
> peril of overengineering.

Rather than overengineering, I'm trying to prevent underengineering
ourselves into a corner from which it is difficult to recover.  The spec
as proposed will not look strange in any normal case (now that the
omission of variables is fixed).  For those objects that are weird, it
will look as good as it can while still delivering the desired
information.  The danger of not thinking it through now is that we
underspecify, document things a certain way for a while, then discover
that we need more, and further that what we have specified is not
compatible with the best way to do it generally.  Then we need to comb
numpy's 2000+ functions, rewriting all the shapespecs, not to mention
all the other packages that now use the numpy doc spec or derivatives of
it.

> Do we have examples of when these more
> elaborate specifiers would be useful?

Sure, mainly in science image processing, such as in astronomy,
especially where true-color, tagged, and mosaicked images are involved.
I've written many routines that handle both the case of a single image
and stacks/arrays of images in arbitrary or semi-arbitrary
configurations, which would be

(dim(any), N, M)
or
(dim(1-2), N, M)

The latter case is common as the input to image mosaicking software
handling either strips or 2D mosaics.

I've also seen cases where there is an array of information ancillary to
each pixel in an image.  For example, the per-pixel status bits from the
Spitzer Space Telescope's calibration pipeline could be carried along
this way, or the temperature, pressure, and humidity in each grid cell
of a general circulation model could be stored this way, or
uncertainties could be kept this way.  The spec would be

(N, M, dim(any))
or
(N, M, dim(0-1))

A true-color image is

(N, M, 3)

except when it's 

(N, M, 4)

or

(3, N, M)

if stored as 3 separate images, or 

(4, N, M)

if it's got transparency.  So, that's 

(3-4, N, M) or (N, M, 3-4)

So, there's an argument to allowing a list of shapespecs.

The input for mosaicking color, tagged images would be:

(dim(1-2), 3-4, N, M, dim(0-1))

If the latter arrays were restricted to being image stacks as opposed to
2D mosaics, and the routine were smart enough to know that if there's
only one image in the stack then just return it, then:

(dim(1-), 3-4, N, M, dim(0-1))

Also, the shapespec is a little underspecified, still.  In the latter
case, what if you wanted to handle both monochrome and color images?
Then the 3-4 dimension is optional.  I suppose we should add to the spec:

  opt() means this part of the specification is optional

as in

(dim(1-), opt(3-4,) N, M, dim(0-1))

So, my proposal is now:

------------------------------------------------------------------------------
param1 : ndarray, shape <shapespec> [(see text)]
as in
param1 : ndarray, shape (2, 2+, dim(0+), 4-, 4-6, opt(N)) (see text)

in <shapespec>:
  the spec reads from the slowest-varying to the fastest-varying dimension
  a number means exactly that number of items on that axis
  a capital letter (variable) means 1 or more items, for later reference
  a number followed by a "+" ("-") means that number or more (fewer) items
  a-b, where a and b are numerical sizes, means between a and b items, INCLUSIVE
  dim(dimspec) means the conventions above apply for dimensions instead of items
  opt() means this part of the specification is optional
shorthands:
  0 starting a range means the dimension (or list of dimensions) is optional
  "..." is an alias for dim(0+) (any number of dimensions, or none)

The example would mean an array with dimensions, from slowest to
fastest-varying, of size:
2
2 or more
(0 or more axes can appear here)
0 to 4
4 to 6
any size, including absent axis

While the shapespec allows for complex shape options to be specified,
always use the simplest shapespec possible for the object.
-------------------------------------------------------------------------------

The variables let us get rid of "any".  Use either "N" or "opt(N)",
depending on what you mean.

The important thing to remember is that the vast majority of routines
will have nice, simple shapespecs.  We're just ensuring that the complex
cases can be handled in the same documentation standard.

--jh--