[Catalog-sig] Hosting documentation on PyPI

Wed Aug 6 19:11:23 CEST 2008

Martin v. Löwis wrote:
>> There's an XSS concern if users can upload arbitrary HTML.  Approval
>> would address some of that, but it might be better to avoid the issue
>> altogether.
>>
>> One way to handle that would be to host each package's documentation on
>> a different domain.  E.g., package.pypi.python.org.
> 
> Can you please elaborate? What is the issue, and how could creating
> domains resolve it?

The issue is that you can put in Javascript that does XMLHttpRequests to 
other URLs on the same domain, and those requests can do things like 
change a user's password, delete packages, etc.  The Javascript will be 
run as the person who is viewing the page.  So if I am logged in to PyPI 
and view some random page on pypi.python.org, and that page contains 
malicious Javascript, that malicious Javascript can do anything on 
pypi.python.org as though it was me doing it.

You can't make XMLHttpRequests across domains, so by putting each 
package on its own domain you avoid the problem.

> Also, what would be the best way to set up the web server to implement
> that? Getting a delegation for a pypi.python.org zone onto that machine
> should be possible, and I know how to update zone files once an hour.
> However, I feel slightly uncomfortable with generating a huge Apache
> config with hundreds of virtual hosts, and having Apache restart every
> hour.

I'd set up a new IP address for the wildcard, and then I think something 
like:

<VirtualHost wildcard_ip_address>
   RewriteCond %{HTTP_HOST} ^([a-z0-9-]+)\.pypi.python.org
   RewriteRule (.*) /pypi/sites/%1/$1 [L]
</VirtualHost>

and of course the other important Apache stuff, like turning off all 
extraneous options, etc.

>> Another option is using an HTML scrubber.  But removing Javascript would
>> be unfortunate in this case as there's a lot of good uses of it, so
>> multiple domains would be better IMHO.
> 
> For this, I'm very skeptical. There will be too many complaints that it
> removes stuff incorrectly.
> 
>> If implemented I think all existing packages could be approved, which
>> would greatly reduce the approval queue.
> 
> I wouldn't mind this starting slowly, say, being experimental until the
> end of the year. Currently, python.org doesn't provide any similar
> hosting (although the PyPI-generated package pages come close), so there
> could be many risks that cause us to pull the plug.
> 
> As for "all existing packages could be approved": the existing ones
> perhaps, but for new ones, wouldn't there still be a chance of somebody
> uploading/linking porn, viruses, whatever?
> 
> Most likely, it works out just fine, of course, as people have to leave
> real email addresses, and interact in a fairly involved manner already,
> which has prevented spambots from registering so far (I'm sure the RSS
> publication would cause immediate reaction from the community should a
> spammer make it "through").

Yes.  I don't think any of the current packages are spam packages 
(though I did see one spam package in the past, but that was years ago), 
and at the moment there's little incentive... mostly because it's just 
too complicated to upload a package.  You could do link spam, it's just 
a lot of trouble.  It would be easier with this system to hide pages in 
weird locations, though you'd still have the spam package as evidence. 
So I don't think the danger is particularly high of spam.  If there were 
a hundred pypi's out there accepting submissions then it might be worth 
coding a bot to spam them, but with just one it seems like it'd be a 
waste of time on the spammer's part.

-- 
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org