A Python script to put CTAN into git (from DVDs)

Jakub Narebski jnareb at gmail.com
Sun Nov 6 15:29:28 EST 2011


The following message is a courtesy copy of an article
that has been posted to comp.lang.python,comp.text.tex as well.

Jonathan Fine <jfine at pytex.org> writes:
> On 06/11/11 16:42, Jakub Narebski wrote:
>> Jonathan Fine<jfine at pytex.org>  writes:
>>
>>> This it to let you know that I'm writing (in Python) a script that
>>> places the content of CTAN into a git repository.
>>>       https://bitbucket.org/jfine/python-ctantools
>>
>> I hope that you meant "repositories" (plural) here, one per tool,
>> rather than putting all of CTAN into single Git repository.

[moved]
>> There was similar effort done in putting CPAN (Comprehensive _Perl_
>> Archive Network) in Git, hosting repositories on GitHub[1], by the name
>> of gitPAN, see e.g.:
>>
>>    "The gitPAN Import is Complete"
>>    http://perlisalive.com/articles/36
>>
>> [1]: https://github.com/gitpan
[/moved]
 
> There are complex dependencies among LaTeX macro packages, and TeX is
> often distributed and installed from a DVD.  So it makes sense here to
> put *all* the content of a DVD into a repository.

Note that for gitPAN each "distribution" (usually but not always
corresponding to single Perl module) is in separate repository.
The dependencies are handled by CPAN / CPANPLUS / cpanm client
(i.e. during install).
 
Putting all DVD (is it "TeX Live" DVD by the way?) into single
repository would put quite a bit of stress to git; it was created for
software development (although admittedly of large project like Linux
kernel), not 4GB+ trees.

> Once you've done that, it is then possible and sensible to select
> suitable interesting subsets, such as releases of a particular
> package. Users could even define their own subsets, such as "all
> resources needed to process this file, exactly as it processes on my
> machine".

This could be handled using submodules, by having superrepository that
consist solely of references to other repositories by the way of
submodules... plus perhaps some administrativa files (like README for
whole CTAN, or search tool, or DVD install, etc.)

This could be the used to get for example contents of DVD from 2010.


But even though submodules (c.f. Subversion svn:external, Mecurial
forest extension, etc.) are in Git for quite a bit of time, it doesn't
have best user interface.
 
> In addition, many TeX users have a TeX DVD.  If they import it into a
> git repository (using for example my script) then the update from 2011
> to 2012 would require much less bandwidth.

???

> Finally, I'd rather be working within git that modified copy of the
> ISO when doing the subsetting.  I'm pretty sure that I can manage to
> pull the small repositories from the big git-CTAN repository.

No you cannot.  It is all or nothing; there is no support for partial
_clone_ (yet), and it looks like it is a hard problem.

Nb. there is support for partial _checkout_, but this is something
different.

> But as I proceed, perhaps I'll change my mind (smile).
> 
>>> I'm working from the TeX Collection DVDs that are published each year
>>> by the TeX user groups, which contain a snapshot of CTAN (about
>>> 100,000 files occupying 4Gb), which means I have to unzip folders and
>>> do a few other things.
>>
>> There is 'contrib/fast-import/import-zips.py' in git.git repository.
>> If you are not using it, or its equivalent, it might be worth checking
>> out.
> 
> Well, I didn't know about that.  I took a look, and it doesn't do what
> I want.  I need to walk the tree (on a mounted ISO) and unpack some
> (but not all) zip files as I come across them.  For details see:
>  https://bitbucket.org/jfine/python-ctantools/src/tip/ctantools/filetools.py
> 
> In addition, I don't want to make a commit.  I just want to make a ref
> at the end of building the tree.  This is because I want the import of
> a TeX DVD to give effectively identical results for all users, and so
> any commit information would be effectively constant.

Commit = tree + parent + metadata.

I think you would very much want to have linear sequence of trees,
ordered via DAG of commits.  "Naked" trees are rather bad idea, I think.
 
> As I recall the first 'commit' to the git repository for the Linux
> kernel was just a tree, with a reference to that tree as a tag.  But
> no commit.

That was a bad accident that there is a tag that points directly to a
tree of _initial import_, not something to copy.

-- 
Jakub Narębski



More information about the Python-list mailing list