[Distutils] Proposal: Restrict the characters in a project name

Donald Stufft donald at stufft.io
Wed May 15 19:07:30 CEST 2013


On May 15, 2013, at 10:31 AM, Daniel Holth <dholth at gmail.com> wrote:

> How to avoid confusables.
> 
> These scripts are recommended for use in identifiers:
> http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
> 
> This report details a confusables detection algorithm:
> http://www.unicode.org/reports/tr39/#Confusable_Detection
> 
> And ICU implements it:
> http://www.icu-project.org/apiref/icu4c/uspoof_8h.html (see also
> PyICU).
> 
> The package index would enforce uniqueness of the "skeleton" of each
> registered package which is just an internal normalization based on
> confusability. if skeleton(identifier1) == skeleton(identifier2) then
> id1 and id2 are confusable.
> 
> The tooling could get away with a simpler rule like
> re.sub("[^\w\d.]+", "_", distribution, re.UNICODE)
> 
> As a bonus to including the world, this should be able to prevent
> people from exchanging zeroes for capital O.
> 
> On Wed, May 15, 2013 at 7:17 AM, Eric V. Smith <eric at trueblade.com> wrote:
>> On 05/15/2013 07:10 AM, Donald Stufft wrote:
>>>>>> Anyone want to run a scan over the PyPI package set to see
>>>>>> how many packages would cause problems for a "[a-zA-Z0-9_.-]"
>>>>>> only filter?
>>>>> 
>>>>> See my previous email where I did queries against my local DB.
>>>>> It's 225 total projects that wouldn't be allowed.
>>>> 
>>>> Can you send the list of those projects?
>>>> 
>>>> Eric.
>>>> 
>>> 
>>> Here you go https://gist.github.com/dstufft/5583225 used a Python
>>> oneliner and the PyPI API so others can reproduce easily if they
>>> wish.
>> 
>> Perfect. Thanks.
>> 
>> It looks like space causes most of the issues. I'm not sure how
>> "Twisted Flow >= 1.0" would be expected to parse.
>> 
>> Eric.
>> 
>> 
>> _______________________________________________
>> Distutils-SIG maillist  -  Distutils-SIG at python.org
>> http://mail.python.org/mailman/listinfo/distutils-sig



This gets into an area that is both complicated to setup, more complicated to maintain, and harder to explain to people.

I also cannot find any data on if the confusables list is whitelist or blacklist, but given it's nature of a list of characters that are confusing then I'm going to guess it's a blacklist which means it's very possible (and likely) that there are missing glyphs there that can easily be confused for each another.

It also doesn't solve the problem that these names can and will be used in systems outside of a Python runtime that may or may not support unicode characters so it affords a much smaller window of compatibility.

It also makes the urls a whole heck of a lot less nice.

All for something that people haven't really even attempted to use (here's a total list of things that have ever been registered to PyPI with a name that uses unicode):

Manual de Py2Exe en Español
flügelform
☃
t☃
py<U+1F4A9>
<U+2063>_init__
D<U+2063>jango
D\x01jango
pyramid-✔

The vast bulk of them being people either attempting to play with unicode or people attempting to do exactly as I outlined as a potential threat.

-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20130515/f4a3e781/attachment.pgp>


More information about the Distutils-SIG mailing list