very large graph

Tue Jun 24 12:23:29 EDT 2008

On Jun 24, 5:58 am, "A.T.Hofkamp" <h... at se-162.se.wtb.tue.nl> wrote:
> On 2008-06-24, MRAB <goo... at mrabarnett.plus.com> wrote:
>
> > On Jun 24, 1:26 am, chrispoliq... at gmail.com wrote:
> >> I need to represent the hyperlinks between a large number of HTML
> >> files as a graph.  My non-directed graph will have about 63,000 nodes
> >> and and probably close to 500,000 edges.
>
> >> I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
> >> python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
> >> generating a file to store the graph, and I have also looked into
> >> Graphviz for visualization.  I'm just not sure which modules are
> >> best.  I need to be able to do the following:
>
> Afaik Graphviz is not good at abstracting the graph, which you may need here.
> A page with 63,000 circles on it, and 500,000 edges will probably come out of
> the printer as a black sheet of paper.
> (8"x11" paper, 1 circle is 1/5", then you have only 2200 circles at one sheet.
> You need a factor 28 more circles which leads to a circle of about 0.007".)
>
> Even when actual paper format would not be a problem, you will need some
> abstraction and/or tools, as you will not find anything manually in an ocean of
> 63,000 elements.
>
> One area you may want to start looking for tools is in state graphs, where the
> set of possible states of an entire system is unfolded. These things go up to
> over a million states, so you only have a 'small' problem there...
>
>
>
>
>
> >> 1)  The names of my nodes are not known ahead of time, so I will
> >> extract the title from all the HTML files to name the nodes prior to
> >> parsing the files for hyperlinks (edges).
>
> >> 2) Every file will be parsed for links and nondirectional connections
> >> will be drawn between the two nodes.
>
> >> 3)  The files might link to each other so the graph package needs to
> >> be able to check to see if an edge between two nodes already exists,
> >> or at least not double draw connections between the two nodes when
> >> adding edges.
>
> >> I'm relatively new to graph theory so I would greatly appreciate any
> >> suggestions for filetypes.  I imagine doing this as a python
> >> dictionary with a list for the edges and a node:list paring is out of
> >> the question for such a large graph?
>
> > Perhaps a dictionary where the key is a node and the value is a set of
> > destination nodes?
>
> For undirected edges, you could make an Edge class and have a set of Edge's
> (where two Edge objects are equal when they connect the same nodes).
> I don't expect 500,000 elements in a set to be a problem.
>
> Sincerely,
> Albert- Hide quoted text -
>
> - Show quoted text -

Readability counts: Do you need to be able to examine the datafile by
hand?  If not, investigate the 'shelve' module.  You can go two ways.
One is to map each node to a list of nodes.  It's probably more
intuitive, but it needs repeated code:

shelf['pageA'].add( pageB )
shelf['pageB']= pageB
shelf['pageB'].add( pageA )

Which is incorrect as is anyway-- updates to shelved objects aren't
committed automatically.

a= shelf['pageA']
a.add( pageB )
shelf['pageA']= a
b= shelf['pageB']
b.add( pageA )
shelf['pageB']= b

is closer.  Otherwise, two is to store a 2-tuple of node keys in a
secondary file.

shelfEdges[ frozenset(
    shelfVerts[ 'pageA' ], shelfVerts[ 'pageB' ]
) ]= True