[Python-3000] bytes & Py_TPFLAGS_BASETYPE

Mon Sep 17 03:56:09 CEST 2007

On 9/16/07, Mathieu Fenniak <mathieu.fenniak at gmail.com> wrote:
> On 16-Sep-07, at 12:38 PM, Guido van Rossum wrote:
> > I'm not doubting that *your* subclass works well enough. The problem
> > is that it must robust in the light of *any* subclass, no matter how
> > crazy.
>
> I understand that, but I'm not sure what kind of problems can be
> created by crazy subclasses.  But my imagination of "crazy subclass"
> is pretty limited.
>
> > I'd have to understand more about your app to see whether subclassing
> > truly makes sense.
>
> I didn't want to flood too many pointless details into the
> discussion, so here's the minimum that I think is relevant.  The
> project is pyPdf, a library for reading and writing PDF files.  I've
> been working on making the library support unicode text strings
> within PDF documents.
>
> In a PDF file, a "string" can either be a text string, or a byte
> string.  A string is a text string if it starts with a UTF-16BE BOM
> marker, or if it can be decoded using an encoding called
> PDFDocEncoding (which is specified by the PDF reference, similar to
> Latin-1 but different just to make life difficult).  pyPdf needs to
> be capable of reading and writing these string objects.  Whether a
> string is a byte or a text string, writing out the raw bytes is the
> same process after the text has been encoded.  This lends itself to a
> common StringObject base class:
>
> class StringObject(PdfObject):
>      # contains common behavior for both types of strings, such as
> the ability to serialize out a byte array, encrypt/decrypt strings
> for "secure" PDF files
>      # also contains reading code that attempts to autodetect whether
> the string is a byte or text string
>
> class ByteStringObject(bytes, StringObject):
>      # adds the byte array storage, and passes self back to
> StringObject for serialization output
>
> class TextStringObject(str, StringObject):
>      # overrides the default output serialization to encode the
> unicode string to match PDF's requirements,
>      # passes the resulting byte array up for serialization.
>
> (complete source code, if you're interested: http://hg.pybrary.net/
> pyPdf-py3/file/fe0dc2014a1b/pyPdf/generic.py)
>
> Deriving from the bytes type provides storage, and also direct & easy
> access to the byte array content.  I think in this case using bytes
> as a base type makes sense, at least as much as using str as a base
> type.  pyPdf derives from list and dict for different PDF object
> types in a similar manner as well.

So suppose my answer was "no, bytes won't be subclassable". How much
would you really lose by having to wrap a separate object around a
bytes object, rather than being able to subclass? How much extra code
do you think you would have to write?

Another way to look at it-- how much of the bytes type's API do your
objects really have to support?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)