[Tutor] Python Networkx with file in gexf format

Tue Jun 2 14:38:36 EDT 2020

Hello everybody,

I'm doing a continuing education, with Python being a relatively large part of it. With my basic knowledge I have practically no chance - especially for the project work, which we have to deliver on June 20, 2020. We are not allowed to use Gephi. Everything must be analyzed and derived in Python. So I am dependent on help and thank you for every support, no matter how small.

I have a dataset in gexf format about the nodes "students" (student-ID) and "teachers" (teacher-ID), where each node belongs to a school class and has a corresponding gender; for teachers no gender is given. There are 10 school classes, e.g. 1A, 2B, etc.
The edges connect the pupil-ID by means of "Origin" and "Destination". The edges are all of the type "Undirected" and weight "1". The "duration" is the duration of all interactions between the nodes (origin and destination); "count" is the number of times the origin and destination have joined together to form an interaction. I have attached the dataset to you.

If someone can help me with this, I would send him/her the dataset by mail. And of course I would pay something for the work.

So far I have managed to get the system to tell me how many nodes and edges there are and what the average degree per node is. That was it.

Now I wish to read out various information from this dataset with Python and the package networkx - and above all to display it graphically. This causes me many difficulties, because in python I want to work with the nodes / edges in the gexf document; but also with the values per item - for example with the value of an ID of a student or with a class name. For this I want to use "networkx". And these are my questions:

1. how can I find out from which data type a feature is? How can I convert the datatype of a feature for example from String to Int (preprocessing engineering)?

2. how can I calculate the number of edges per node (1 origin and how many targets?)? I think this is called "degree", I have seen. What does this code look like? So I want to know how many connections (edges) the node 1551 has to other nodes, for example how many connections the student 1551 has to other students to other students (and teachers). How can I list them per node? What does the code for this look like?
For example, how can I calculate the sum of "counts" or "duration" per student ID? How can I use the result of the number of nodes per Student ID to divide them, for example to calculate the average duration?

4. develop and display the graph for the whole dataset What does the code for this look like?

5. how can I display graphs, i.e. connections and nodes of a single school class (clusters?), single nodes (students) of the same class, etc. with different colors? What does the code for this look like?

6. subgraphs: Are they parts of a whole graph, as I understood it, or? How can I display them for example for between two, three classes, ten students per class, for students and teachers together, etc.? Or for example for the connections within a class? What does the code for this look like?

7. are subgraphs also called "subgroups" and "clusters"? What does the code for this look like so that I can graphically represent such properties? What do I concentrate on in the dataset? For example items? values?

8 How can I determine whether or which student is an "influencer" in the class and which student is not an influencer at all? And which pupil is the "influencer" between school classes? Which are the "inluencers" between school classes? What does the code for this look like?

9. how can I remove items. For example, because they represent an outlier? Or simply remove all those items whose "count" is below 10, for example? What does the code for this look like?

10. weight: I have read a lot about it, but I have not been able to figure out what it is and what purpose it serves. What does this tell me? What can I do with it? How can I change this weight? Why is the weight adjusted in different ways, i.e. for certain connections the weight is often set to 2, for others to 3, etc. And above all: Why should I change a weight, i.e. what could be the intention? What would this mean for "my" dataset? Between all origins and destinations, in our case there is weight 1. Why should I change it? 
Why should it make sense to change weights? In which case would I do this and why? What is the code for changing the weight?

11. How to calculate the following? For example, is this calculated per student or even per cluster (e.g. school class)? Or for which properties is this calculated?
- Degree Centrality
- Betweenness-Zentralität)
- Closeness Central Office
- Prestige Indegree
- Ego Network

12. Link Predictions? I heard that there are ways to use different models (or algorithms?) to predict what other possible connections between nodes or the students (in our case) might look like. For example Jaccard, Common Neighbours, Preferential Attachment, Resource Allocation, etc. What does the code for these look like?

I am looking forward to your feedback. Thank you very much and best regards

Daniel Wobmann