Introduction to Network Analysis and Visualization with Python NetworkX
This post provides an overview of getting started with analyzing and visualizing networks using the NetworkX package for Python.
Any system of interconnected objects can be called a network. Major classes of networks include technological, biological, economic, and social. As such, networks surround us and include each of us. The most important network concepts are the entities and the relationships between them. Network analysis has its foundation in graph theory, and discrete entities are called the nodes and the relationships or links between them the edges. Together, these nodes and edges form a graph.
There are a few different Python packages for network analysis. NetworkX is popular as it is built with the Python language, included with the Conda distribution, and in conjunction with the package Matplotlib can create network visualizations.
Network Data
My data for this example are from the subreddit Tech News, r/technews during one week in May 2020 which had the most posts of any week that year. I enjoy this subreddit for how it has posts from a variety of news domains with the latest happenings in technology.
With each comment, a Redditor is responding to either the post author or to another commenter. This gives the edges direction. When there is at least one directed edge in a network, it is a directed network. When there are entirely symmetric relationships between the network entities, these form an undirected network.
I began by importing a few Python packages.
I organized the connections between the subreddit authors and commenters into Pandas data frame with about 5K rows. Here is a sample where count is the number of times the ‘source’ responded to a post or comment of the ‘target’:
There are several options of how NetworkX can read data into a network graph. My data is structured as an edge list. The edge list has pairs of entities that are connected. There can also be attributes about each entity (node) and/or attributes about the links (edges) between them. In my example, the ‘count’ is an edge attribute.
NetworkX understands this Pandas data frame structure as an edge list. I read the data into a directed network graph.
To analyze an undirected network, change the create_using argument to nx.Graph(). Or, exclude this argument as the default is an undirected graph.
Network Analysis
To analyze my example network, I started with the network size. The size of the network can be described by the number of nodes, or the number of edges, or both! This example has 3,461 nodes and 5,224 edges. In other words, there are 3,461 Redditors who posted or commented in the selected week, and 5,224 links between them.
Measures of centrality indicate which node(s) has the most effect on others. There are many centrality measures. Here are the most common:
- Degree Centrality
Important nodes have many connections. - Betweenness Centrality
Important nodes connect other nodes. - Closeness Centrality
Important nodes are close to other nodes. - Eigenvector Centrality
Important nodes have many connections to other important nodes. - Page Rank
Important nodes have many in-coming edges. A variant of the Eigenvector Centrality that is used to analyze directed networks.
NetworkX has methods to calculate these different centrality measures. Each of the methods returns a Python dictionary.
The code for the other centrality measures is below. Note that since my example is a directed network, I calculated each node’s Page Rank rather than Eigenvector Centrality.
And the results!
The node’s neighbors are connected to the node. The Redditor with the highest degree and closeness centralities was ‘totatree’ who had 283 neighbors just this one week in May!
A node’s degree is how many edges, or connections, it has. So ‘totatree’ has a node degree of 283. Since this is a directed network, total degree can be separated into in-degree and out-degree. In-degree in this example means the number of users who responded to the Redditor, and out-degree means the number of users the Redditor replied to.
After checking the in-degree and out-degree, ‘totatree’ has received 283 comments from others and did not reply to anyone else during the one week example period. This indicates ‘totatree’ authored some posts and many other users replied. Whereas ‘limache’ who has the highest betweenness centrality and the highest page rank, has an in-degree of 166 and an out-degree of 41. Meaning ‘limache’ received 116 replies and submitted 41 responses to others.
Network Visualization
Sometimes when visualizing networks the nodes overlap and it can be difficult to see the connections. There are a variety of layout algorithms to display the network that attempt to prevent node overlaps. These algorithms include: circular_layout, spectral_layout, random_layout, and spring_layout.
The spring layout usually works best at preventing node overlap. This layout applies the Fruchterman-Reingold force-directed algorithm. Let’s start with a portion of entire network and visualize the network of user ‘limache.’
There are a few ways to create visualizations in NetworkX. Using code like below allows for adding optional arguments to customize the nodes, edges, and layout. Running the layout algorithm multiple times helps to spread the nodes apart. It was several trials on the number of iterations to run the algorithm, as well as for the values of k and scale, to yield the best visualization. The nodes I choose to color Reddit red, and the width of the edges I based on the number of times one user replied to another.
The many arrows directed toward this Redditor have merged together. This is not surprising with the output from the commands above to calculate the node’s in-degree and out-degree.
For the visualization at the top of this article, I filtered the network to nodes with degree of at least 12. Here is the code.
The entire network was too large to visualize effectively with NetworkX. There are several open-source tools specifically to create network visualizations. One tool is Gephi. Graphs from NetworkX can be exported as GraphML files and then imported into Gephi. Before the saving the NetworkX graph as a GraphML file, I added a couple of the Python dictionaries containing centrality measures. Adding these measures as node attributes will allow one to be selected to size the nodes when creating the visualization.
I’ll leave describing my adventures in Gephi for another day!
Conclusion
This article provided an introduction to the main concepts of network analysis as well Python codes using the NetworkX package. Although the example data included was from a social network, networks can be comprised of any type of entity. NetworkX has some limitations with large networks with over 100M edges and there are other Python packages better suited for such massive networks. In addition, visualizing more than a few hundred nodes may be better accomplished with a graph visualization tool such as Gephi.