Freitag, August 25, 2006

DB,SNA: Degrees of Separation

In a previous post I already described the power and usefulness of Self Joins. In this article I want to go one step further and address a possible approach to the computation of the famous indicator "degrees of separation" (see Six degrees of separation) by using database self joins. The algorithm can be used for computations on nearly arbitrary networks, either directed or undirected ones, either weighted or uniform networks. As byproducts the algorithm provides the diameter of the network and the best reachable person within the network.
The degrees of separation are defined as the medium distance between each pair of people/nodes in a network. So the first and most complex step is the generation of the closure of the network, i.e. a table that contains each pair of nodes and their minimal distances (shortest path). This is done by a flooding algorithm. After the n-th iteration all pairs of nodes are in the IterationTable that are n+1 steps apart at the very most.
Precondition: All direct links are given in table Relations (columns source, sink, weight (is 1 for unweighted graphs)). If you want to consider the network as undirected do not forget to add the corresponding backlinks as extra rows (union of the relation and its transposed pendant).

  1. Construct the table IterationTable and fill it with the contents of table Relations.

  2. Execute the SQL-Query IterationStep as long as the view Distances is changing. Remove redundant and unnecessary rows after each step (you may store the result of Distances in a temporary table, clear IterationTable and feed the data from the temporary table back in IterationTable).


Thereby the view Distances is given by

SELECT source, sink, MIN(distance) AS mindistance
FROM IterationTable
GROUP BY [source], [sink]

The IterationStep is given by

INSERT INTO IterationTable
SELECT IterationTable.source AS source, Relations.sink AS sink,
IterationTable.distance+Relations.weight AS distance
FROM IterationTable INNER JOIN Relations ON
IterationTable.sink=Relations.source

The deletion of redundant rows is very, very important for performance. Redundant are rows where there is already a shorter path in the data set.
You may also consider table maintenance operations of your dbms as intermediate steps to boost performance. For example, in postgresql you may vacuum the table. Thereby it is cleaned up and the table statistics are recalculated for faster access to a particular row.
The following view ConnectedTo can be used to check, whether Distances is changing:

SELECT source, COUNT(*) AS numreachable FROM Distances GROUP BY source

For undirected graphs we can state that within one connected component all belonging nodes will have the same value numreachable, namely the cardinality of the connected component. For directed graphs you have to compare row by row (computers are good at this task).
Let us assume that we have an undirected graph. Then the next step is the identification of the different connected components. This can be done by checking the view ConnectedTo for differing numbers. It is not very likely that there are two distinct connected components with exactly the same number of nodes. However, you can check by a simple SQL-Query whether there are two nodes with the same number that are not connected - just to be sure. The indicator "degrees of separation" is only meaningful within a single connected component. So you have to compute the degrees of separation for each connected component independently.
Assuming that there is just one connected component (all fragments are removed) now the following query has to be executed:

SELECT AVG(mindistance) AS DegreesOfSeparation
FROM Distances

And you are done.
You can get the diameter by using the SQL MAX aggregation operator and the best reachable person by grouping the previous query by source.
The sole problem of my pretty naïve approach is its scalability. There are very good algorithms to solve the shortest path problem between two nodes, however I am not aware of an efficient algorithm that computes the shortest paths between each pair of nodes within the network at once. And up to now I am also not aware of an efficient heuristic approach for the computation of the degrees of separation (please comment if you know). Please do not expect the described algorithm to work for large data sets. The art is probably to find useful intermediate steps to simplify the IterationTable. The more rows you can remove the better. For example, for the Go-network I have always removed the links where neither source nor sink were signed up to the survey. The Go-network consisted of nearly 800 direct links, 120 nodes (signed-up) and nearly 400 external nodes. The algorithm worked, however the intermediate table consisted of more than half a million rows before simplification.

0 Comments:

Post a Comment

<< Home