[Catalog-sig] [Distutils] Specification for package indexes?

Fri Jul 7 20:52:51 CEST 2006

On Jul 7, 2006, at 2:31 PM, Phillip J. Eby wrote:

> At 02:04 PM 7/7/2006 -0400, Fred Drake wrote:
>> On 7/7/06, Jim Fulton <jim at zope.com> wrote:
>> > > +1 on static pages.  I don't, however, see a reason to require
>> > > valid XML.  Or rather, I don't expect to implement XML parsing in
>> > > easy_install; if the spec is too complex to implement with  
>> regular
>> > > expression matching, it's probably too complex for people to  
>> throw
>> > > together an index with what's at hand.  In particular, I'd  
>> like it
>> > > to be practical to put together a simple index just using  
>> Apache's
>> > > built-in directory indexes, as long as they use the right URL
>> > > hierarchy.  That means that class or rel attributes should  
>> only be
>> > > required for links that are requesting non-index pages to be  
>> spidered.
>> >
>> > I would find parsing much easier with an XML parser  than with
>> > regular expressions.
>> > I  think it would be much more robust too.
>>
>> XHTML would be best, though I agree we shouldn't care about validity
>> so much as just well-formedness (which is required).  I think it
>> should be possible to do it with valid XHTML, though, since whether
>> that's desired or not is a python.org policy concern.  (Not that I
>> suspect we'll ever really care about that.)
>>
>> Of course, it should be possible to parse with htmllib and  
>> HTMLParser as well.
>
> I still think requiring even HTML validity or well-formedness is  
> YAGNI; one could indeed just pull all well-formed URLs from the  
> page.  EasyInstall uses this case-insensitive regular expression to  
> find only href'd urls:
>
>     href\s*=\s*['"]?([^'"> ]+)
>
> In the absence of a requirement for more information than this  
> (perhaps coupled with a "rel" attribute in the same element), I'm  
> wary of starting out by requiring even well-formedness, because  
> it's way overkill for the requirements as I understand them.

But I thought we *were* talking about adding rel or class tags so  
that we
could determine information about the intended use of a URL.

> One of the advantage of defining the URL layout as part of the API  
> is that it gives you enough contextual information to decide what  
> links should be followed, and which ones are purely informational.

Perhaps someone should propose an API and we'll see. :)

> Indeed, the only reason to look at anything *but* hrefs is to  
> indicate that an *external* (i.e. non-index) link should be  
> followed, to spider for other download links.  So if following  
> external links is out of scope for the API we want to define, then  
> *any* information other than the URLs in an API page are YAGNI.

Who said following external links is out of scope.

> Now, all of this is based on my assumption that the use case here  
> is somebody wants to throw together a rough-and-ready package index  
> that tools should be able to use to find *downloadable  
> distributions*.  If you and Jim have much more elaborate use cases  
> in mind, then of course some well-formedness might be useful.

setuptools has a notion of an index.  That notion is not at all well  
defined.
Currently, the index has linkes that are followed to find package  
links elsewhere.
This seems reasonably useful.  I dunno.  I'm not sure I care.  What I  
do care
about is that the index API should be well defined so that we can  
implement
alternate indexes and alternate tools to read indexes.  I'm not  
looking to
satisfy use cases beyond what we have now.

All I want is an API. :)  I'm not bent on XML.

Jim

--
Jim Fulton			mailto:jim at zope.com		Python Powered!
CTO 				(540) 361-1714			http://www.python.org
Zope Corporation	http://www.zope.com		http://www.zope.org