In a previous post I already described the power and usefulness of
Self Joins. In this article I want to go one step further and address a possible approach to the computation of the famous indicator "degrees of separation" (see
Six degrees of separation) by using database self joins. The algorithm can be used for computations on nearly arbitrary networks, either directed or undirected ones, either weighted or uniform networks. As byproducts the algorithm provides the diameter of the network and the best reachable person within the network.
The degrees of separation are defined as the medium distance between each pair of people/nodes in a network. So the first and most complex step is the generation of the closure of the network, i.e. a table that contains each pair of nodes and their minimal distances (shortest path). This is done by a flooding algorithm. After the n-th iteration all pairs of nodes are in the
IterationTable that are n+1 steps apart at the very most.
Precondition: All direct links are given in table
Relations (columns
source,
sink,
weight (is 1 for unweighted graphs)). If you want to consider the network as undirected do not forget to add the corresponding backlinks as extra rows (union of the relation and its transposed pendant).
- Construct the table
IterationTable and fill it with the contents of table Relations.
- Execute the SQL-Query
IterationStep as long as the view Distances is changing. Remove redundant and unnecessary rows after each step (you may store the result of Distances in a temporary table, clear IterationTable and feed the data from the temporary table back in IterationTable).
Thereby the view
Distances is given by
SELECT source, sink, MIN(distance) AS mindistance
FROM IterationTable
GROUP BY [source], [sink]
The
IterationStep is given by
INSERT INTO IterationTable
SELECT IterationTable.source AS source, Relations.sink AS sink,
IterationTable.distance+Relations.weight AS distance
FROM IterationTable INNER JOIN Relations ON
IterationTable.sink=Relations.source
The deletion of redundant rows is very, very important for performance. Redundant are rows where there is already a shorter path in the data set.
You may also consider table maintenance operations of your
dbms as intermediate steps to boost performance. For example, in
postgresql you may
vacuum the table. Thereby it is cleaned up and the table statistics are recalculated for faster access to a particular row.
The following view
ConnectedTo can be used to check, whether
Distances is changing:
SELECT source, COUNT(*) AS numreachable FROM Distances GROUP BY source
For undirected graphs we can state that within one connected component all belonging nodes will have the same value
numreachable, namely the cardinality of the connected component. For directed graphs you have to compare row by row (computers are good at this task).
Let us assume that we have an undirected graph. Then the next step is the identification of the different connected components. This can be done by checking the view
ConnectedTo for differing numbers. It is not very likely that there are two distinct connected components with exactly the same number of nodes. However, you can check by a simple SQL-Query whether there are two nodes with the same number that are not connected - just to be sure. The indicator "degrees of separation" is only meaningful within a single connected component. So you have to compute the degrees of separation for each connected component independently.
Assuming that there is just one connected component (all fragments are removed) now the following query has to be executed:
SELECT AVG(mindistance) AS DegreesOfSeparation
FROM Distances
And you are done.
You can get the diameter by using the SQL
MAX aggregation operator and the best reachable person by grouping the previous query by
source.
The sole problem of my pretty naïve approach is its scalability. There are very good algorithms to solve the shortest path problem between two nodes, however I am not aware of an efficient algorithm that computes the shortest paths between each pair of nodes within the network at once. And up to now I am also not aware of an efficient heuristic approach for the computation of the degrees of separation (please comment if you know). Please do not expect the described algorithm to work for large data sets. The art is probably to find useful intermediate steps to simplify the
IterationTable. The more rows you can remove the better. For example, for the Go-network I have always removed the links where neither source nor sink were signed up to the survey. The Go-network consisted of nearly 800 direct links, 120 nodes (signed-up) and nearly 400 external nodes. The algorithm worked, however the intermediate table consisted of more than half a million rows before simplification.