This is a draft version, and a formal version is required to use LaTex or well-organized word document.
- Designed the contact frequency, find communities with max-flow and min-cut, and weighted PR ranking and HITS with Wang Shuai
- Crawled bilateral (mutual) friends of a single user within a distance of 2
- Obtain small communities (subgroups) via min-cut
- Plot the graph, with calculated PR, authorities and hubs
Crawling 2-hop Bilateral Friends
After a short inspection, I found that my frequently contact friends have an average number of 150 bilateral friends. In this case, a 2-hop-distance social network would have 150^2 = 22500 nodes, whilst a 3-hop-distance one would have 150^3 = 3375000 nodes, which is too heavy for further calculations, so the distance of 2 is chosen.
I used python, weibopy (an open source Sina Weibo API SDK) to do this task. Specifically, I stored only user ID and bilateral friends and pack them into vertices and edges in networkx. Eventually, the graph is exported to graphml for further use. After this step, I obtained a undirected graph of 20k vertices and 30k edges.
// TODO: Average bilateral friends per person ?
The graph obtained in previous step is rather sparse. // Density? It is because the users on the leaf (which have a degree of 1 or very small) are weakly connected to the main graph, and less likely to have strong relationship with the root and its community.
Moreover, the leaf nodes will significantly affect the max-flow/min-cut procedure in terms of in the early stage of min cut iteration, most of the cuts will be on the weak links of the leaf nodes, and it's time consumption is massive. Therefore, pruning the leaf nodes is necessary.
But by how many degrees of a leaf node has to prune? What is the critical point, as for keep the balance of performance and less potential community information loss? To find out, I pruned the graph with degree from 1 to 21.
At first we chose degree threshold 4, which is the obvious choice because the curve decreases mildly after 4 degrees. But in further experiments, there are still too many nodes to compute the max-flow/min-cut, and the initial communities are normally single nodes. So we decided to prune more.
The selected degree threshold is 15, for the sake of keeping as much information as we could, and a compromise to performance.
After pruning, the graph is now 288 nodes taking 3999 edges.
Max-flow and Min-cut
To identify the small communities of the social network, we defined the capacity of each edge. We calculated the contact frequency of each pair of users, and assigned the normalized value (0-1) to the corresponding edges. For details, please refer Wang Shuai's report.
Up to this point, each edge in the graph carries a weight of capacity and still, the graph is undirected.
Under the principle of using off-shelf utilities and tools, I have made several attempts on graph-mining packages, including networkx, graph-tools, igraph on python and igraph on R.
Despite networkx is a very powerful and easy-to-use python package, the
min_cut(G, s, t, capacity='capacity') function computes only the value of the cut, rather than returning 2 partitions that every other packages can do.
graph-tools is a python package but it is written in c, and has a huge amount of package dependencies which part of them are painful to install to my developing environment. After hours and hours of making and linking, I decided to abandon graph-tools.
igraph, with has both python and R support, is even more powerful (has a massive number of built-in functions) that networkx. Nonetheless, after I wrote a demo using python-igraph on the 1st question of assignment II to validate, I found that the result of min-cut differs (not even close) from the solution.
Luckily, the demo I wrote in R version of igraph matches the solution. Therefore, igraph on R is chosen in spite of some inconveniences of graph importing/exporting issues.
The Dilemma of Choosing Source and Sink
In a s-t cut, the flow starts from s (AKA. source) and is received at t (AKA. sink). But how do we determine a pair of s-t in a particular graph? Two method was proposed by Wang Shuai and myself.
- 2 highest weighted-degree nodes
As for as selecting the optimal pair of s-t, the diameter is intuitively considered, on account of the farthest pair has a high chance on the opposite sides of 2 partitions after min-cut, which meets the rule of the source and sink definition.
Instead of using the standard diameter measurement, we introduced the weighted diameter, the calculation method is as below:
WeightedDiameter = sum(weight_of_edge_on_path)
Both networkx and igraph provide the functions that compute the value of diameter only. On behalf of obtain the nodes which linked by path of diameter, normally we traverse all possible pairs and find the farthest one, whereas this method has a time complexity of O(n^2), which is intolerably slow. We need to seek a faster algorithm instead.
At the end we happened to learn a heuristic algorithm called Fast Map, which solves the farthest pair of nodes problem in PCA, on Prof. Tao's DM course. It can find two pair of nodes with no guarantee of farthest far enough distance, and more importantly, in linear time.
2 highest weighted-degree nodes
In some cases, using the farthest pair of nodes also doesn't assure on two potential communities. Hence, a method of using 2 highest weighted-degree nodes as s-t is introduced.
The princeple is very straightforward. Identify a pair of nodes, which have the largest and second largest sum of weighted-degree, and treat them as two most active users in respective community.
As a result, the min-cut algorithm will return 2 communities, either s-t on two fartherst rims, or represent the most active users of two communities.
The essential idea is using
maxflow(G, s, t, capacity) built-in function in igraph. The return value of
maxflow() is consisted by value, cut, partition1, partition2, etc.
In ideal condition, the algorithm is supposed to return a 50%-50% community (in terms of nodes) of the entire graph. Yet, in practise, the maxflow algorithm is likely to return 2 partitions that one is a single node, the other is the rest of nodes. The reason is selected s-t is not optimal, causing single node in one partition, which is merely a outlyer.
To acquire the proper communities, a min-cut iteration is devised. Procedure is described below.
Set i = 0. At the beginning of each loop, i++
Load current graph Gi. If i = 1, load the raw
Find the s-t, using diameter or 2 highest weighted-degree nodes
maxflow(), getting the edges being cut, and remove them, so that the graph contains two connected components.
Find the smaller connected component, output as a result, and remove the connected component's according nodes and edges from the graph.
The iteration terminates when there's only a single node remain in the graph.
Then identify small communities by selecting connected components which number of nodes larger than a threshold, e.g., 4, which means consider communities larger than 4 only.