programming unlimited "categories" in python ?

Stephen shriek at gmx.co.uk
Tue Oct 23 09:29:14 EDT 2001


> > I'm developing a catalog application (storing its data in a
> > relational database), where catalog entries are categorized
> > by one or more categories/subcategories.  It would be nice to
> > have "unlimited" levels of subcategories and that seemed
> > simple enough ~ "Use a parent/child" I hear you say ~ and that's
> > what I did but I've since found it's not very flexible further
> > down the line.
> 
> This is a database rather than Python question, but since you asked,
> I'll answer from my database experience.

Not so fast.   From a database point of view, the table schema
is relatively trivial.  However, as soon as you start programming
with "unlimited" categories/subcategories, you soon find you have 
to do multiple recursions. 

eg. In the example before, one would have to select all categories
having "Africa" as a parent. Then select all categories that were
subcategories of those categories with Africa as a parent. This is 
tiresome enough with just 3 levels of categorisation. 

Almost every application I've encountered requires categorisation of
some sort so I figured it would be good to learn how to do this
"properly" once and for all. I don't believe it's just a matter 
of SQL theory.



> > Let me demonstrate with some example categories/subcategories
> > which we place in a category tree, giving each node a unique ID.
> > The root node is deemed to have node ID zero (0).
> >
> > Root -- 1. ByLocation --- 3. Africa   --- 4.  Mozambique
> >                                       --- 6.  SouthAfrica
> >                       --- 5. Europe   --- 9.  Portugal
> >      -- 2. BySeverity --- 7. Critical --- 8.  Death
> >                                       --- 11. Handicap
> >                       --- 10. Moderate---
> 
> Location and Severity are two separate variables that should be coded
> completely separately from each other.  Severity, if not numeric, is
> usually an ordered series of categories, such as 1=mild, 2=moderate,
> 3=handicap, 4=critical, 5=death.  Coding three-level LOcation is more
> complicated.  Either
> 1. Split it into three variables - continent, country, city, with
> continent required and country and city optional -or-
> 2. Encoded the levels into 'one' entry (which you split into pieces as
> needed).  You could try a separater-ed entry such as
> continent-country-city (like internet address, windows registry, etc)
> or positional-subfield encoded (like many product ids and library book
> categorizers (as in either Dewey Decimal or Library of Congress
> systems)).  For instance, 0 = Africa, 10 = Mozanbique, 11=(capital
> city), 20= Angola, 1000 = Europe, etc.  In otherwords, 1000s =
> continent, hundreds/tens = country, ones position = city.

Thank you for the advice.  The examples I provided of Location and
Severity were just that though - examples.  And who is to say how
a user of your latest application will wish to categorize his data ?

For example in Act! contact management application, you can assign
multiple categories/subcategories to each contact.  And the categories
and subcategories seem to be unlimited. I was merely wondering how
to build a Python application (be it web-based or with Tkinter) that
utilizes such categorization.



> > This structure can be stored in a single table of a database.
> >
> > Parent_ID    Category_ID     Category_Label
> > 0            1               ByLocation
> > 0            2               BySeverity
> > 1            3               Africa
> > 3            4               Mozambique
> > 1            5               Europe
> > 3            6               SouthAfrica
> > 2            7               Critical
> > 7            8               Death
> > 5            9               Portugal
> > etc
> 
> Mixing two varriables like this is a mistake.

How else to provide unlimited scalability ?


> > So far so good.   Cataloged items/illnesses record their
> > categories in a one-to-many table.  For example, an illness
> > with categories  "4" and "8" occurs in Mozambique and can
> > result in death.
> >
> > This appears scalable.
> >
> > Likewise you can easily select all illnesses occuring in
> > Portugal using a JOIN and filtering category ID "9".
> >
> > So what's the problem ?
> >
> > The problem arises if one asks "Which illnesses occur
> > in Africa ?".  First, one has to find all category IDs
> > for which this is true. This may be easy for the example
> > above but imagine if the category ("Africa") has a sub-
> > -category for each and every country then further
> > subcategorized by major cities. To find all possible
> > categories, one would have to do a recursive select
> > finding each subscategory with the appropriate "parent ID".
> > This does not seem like a very efficient method.
> 
> If location were coded with continent either a separate variable/field
> or separate subfield, this query would be trivial, as it should be.

But such an argument can be applied to each and every category,
not just "continent", which implies that there is no such thing
as the boilerplate "unlimited categories/subcategories" code; 
and that each application must be customized instead. 

Fortunately, a search on google 
http://www.google.com/searchq=programming+unlimited+categories+&btnG=Google+Search
 
proves this not to be case with a multitude of applications 
(mainly in Perl) that offer "unlimited categories/subcategories".

Thank you for the advice, Terry.  Sorry if I'm being obstinate,
but I think the problem is a little less simplistic than it first
appears. Let's look at the everyday addressbook in Act! or 
Outlook ? How would you create unlimited levels of categorisation 
and unlimited numbers of subcategories ?

Stephen.



More information about the Python-list mailing list