[Python-ideas] AST Hash

Wed Sep 11 22:53:39 CEST 2013

On 11.09.2013 18:05, anatoly techtonik wrote:
> Hi,
> 
> We need a checksum for code pieces. The goal of the checksum is to
> reliably detect pieces of code with absolutely identical behaviour.
> Borders of such checksum can be functions, classes, modules,.
> Practical application for such checksums are:
> 
>  - detecting usage of recipes and examples across PyPI packages
>  - detecting usage of standard stdlib calls
>  - creating execution safe serialization formats for data
>    - choosing class to deserialize data fields of the object based on its hash
>  - enable consistent validation and testing of results across various AST tools
> 
> There can be two approaches to build such checksum:
> 1. Code Section Hash
> 2. AST Hash
> 
> Code Section Hash is built from a substring of a source code, cut on
> function or class boundaries. This hash is flaky - whitespace and
> comment differences ruin it, even when behaviour (and bytecode) stays
> the same. It is possible to reduce the effect of whitespace and
> comment changes by normalizing the substring - dedenting, reindenting
> with 4 spaces, stripping empty lines, comments and trailing
> whitespace. And it still will be unreliable and affected by whitespace
> changes in the middle of the string. Therefore a 2nd way of hashing is
> more preferable.
> 
> AST Hash is build on AST. This excludes any comments, whitespace etc.
> and makes the hash strict and reliable. This is a canonical Default
> AST Hash.
> 
> There are cases when Default AST Hash may not be enough for
> comparison. For example, if local variables are renamed, or docstrings
> changed, the behaviour of a function may not change, but its AST hash
> will. In these cases additional normalization rules apply. Such as
> changing all local variable names to var1, var2, ... in order of
> appearance, stripping docstrings etc. Every set of such normalization
> rules should have a name. This will also be the name of resulting
> custom AST Hash.
> 
> Explicit naming of AST Hashes and hardlinking of names to rules that
> are used to build them will settle common ground (base) for AST tools
> interoperability and research papers. As such, it most likely require
> a separate PEP.

You might want to have a look at this paper which discussed
AST compression (for Java, but the ideas apply to Python just
as well):

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.135.5917&rep=rep1&type=pdf

If you compress the AST into a string and take its hash,
you should pretty much have what you want.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Sep 11 2013)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2013-09-11: Released eGenix PyRun 1.3.0 ...       http://egenix.com/go49
2013-09-04: Released eGenix pyOpenSSL 0.13.2 ...  http://egenix.com/go48
2013-09-20: PyCon UK 2013, Coventry, UK ...                 9 days to go

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/