[Catalog-sig] Hosting documentation on PyPI
Ian Bicking
ianb at colorstudy.com
Wed Aug 6 19:11:23 CEST 2008
Martin v. Löwis wrote:
>> There's an XSS concern if users can upload arbitrary HTML. Approval
>> would address some of that, but it might be better to avoid the issue
>> altogether.
>>
>> One way to handle that would be to host each package's documentation on
>> a different domain. E.g., package.pypi.python.org.
>
> Can you please elaborate? What is the issue, and how could creating
> domains resolve it?
The issue is that you can put in Javascript that does XMLHttpRequests to
other URLs on the same domain, and those requests can do things like
change a user's password, delete packages, etc. The Javascript will be
run as the person who is viewing the page. So if I am logged in to PyPI
and view some random page on pypi.python.org, and that page contains
malicious Javascript, that malicious Javascript can do anything on
pypi.python.org as though it was me doing it.
You can't make XMLHttpRequests across domains, so by putting each
package on its own domain you avoid the problem.
> Also, what would be the best way to set up the web server to implement
> that? Getting a delegation for a pypi.python.org zone onto that machine
> should be possible, and I know how to update zone files once an hour.
> However, I feel slightly uncomfortable with generating a huge Apache
> config with hundreds of virtual hosts, and having Apache restart every
> hour.
I'd set up a new IP address for the wildcard, and then I think something
like:
<VirtualHost wildcard_ip_address>
RewriteCond %{HTTP_HOST} ^([a-z0-9-]+)\.pypi.python.org
RewriteRule (.*) /pypi/sites/%1/$1 [L]
</VirtualHost>
and of course the other important Apache stuff, like turning off all
extraneous options, etc.
>> Another option is using an HTML scrubber. But removing Javascript would
>> be unfortunate in this case as there's a lot of good uses of it, so
>> multiple domains would be better IMHO.
>
> For this, I'm very skeptical. There will be too many complaints that it
> removes stuff incorrectly.
>
>> If implemented I think all existing packages could be approved, which
>> would greatly reduce the approval queue.
>
> I wouldn't mind this starting slowly, say, being experimental until the
> end of the year. Currently, python.org doesn't provide any similar
> hosting (although the PyPI-generated package pages come close), so there
> could be many risks that cause us to pull the plug.
>
> As for "all existing packages could be approved": the existing ones
> perhaps, but for new ones, wouldn't there still be a chance of somebody
> uploading/linking porn, viruses, whatever?
>
> Most likely, it works out just fine, of course, as people have to leave
> real email addresses, and interact in a fairly involved manner already,
> which has prevented spambots from registering so far (I'm sure the RSS
> publication would cause immediate reaction from the community should a
> spammer make it "through").
Yes. I don't think any of the current packages are spam packages
(though I did see one spam package in the past, but that was years ago),
and at the moment there's little incentive... mostly because it's just
too complicated to upload a package. You could do link spam, it's just
a lot of trouble. It would be easier with this system to hide pages in
weird locations, though you'd still have the spam package as evidence.
So I don't think the danger is particularly high of spam. If there were
a hundred pypi's out there accepting submissions then it might be worth
coding a bot to spam them, but with just one it seems like it'd be a
waste of time on the spammer's part.
--
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org
More information about the Catalog-SIG
mailing list