SQL Graph Algorithms
This page is for various algorithms that I've developed for handling directed graphs efficiently in SQL, based on real-word problems, and discussions on comp.databases.theory. The real-world problems I've worked on are:
This approach is a quick way to traverse a tree or graph with a minimum of code. Starting from a root or set of roots, it's a simple join to get everything reachable one edge away from the root, and everything one edge away from that set, and so on. This example assumes a table edges( parent_id int, child_id int ), and iteratively fills a temp table #reached of nodes which are reachable from the starting node, indicated by @root_id:
create table #reached( id int primary key ) insert into #reached values ( @root_id ) -- substitute root id as needed while ( @@rowcount > 0 ) begin insert into #reached (id ) select distinct child_id from edges e join #reached p on p.id = e.parent_id where e.child_id not in ( select id from #reached ) endThis script assumes that the graph may not be a tree, so it checks that each node it reaches has not already been inserted, and uses DISTINCT to handle the case where a node is reached via multiple paths during the same iteration. The script also does not try to avoid repetition; every reached node is used in every iteration. In trees with a large out-degree, new nodes in a given iteration will outnumber old ones, so explicitly separating the new nodes (the "frontier set") from the others may take more time than just throwing them all together in #reached.
This algorithm will not fail if a cycle exists in the graph.
If one can do this for every node, then one can obviously work one's way all around the tree with this method. The main difference in a database environment which makes this approach possible is that any access path can be used: parent-to-child, child-parent, and child-sibling can be traversed in one query, whereas typical pointer-based trees only allow one direction to be followed.
In depth-first search, the stack is really only needed in order to backtrack up the tree, from child to parent, which can be done via a query as noted. So, when the graph being searched is a tree (not a DAG, with multiple paths to a node), there is no need to keep a stack. This procedure begins at the root of a tree and travels its nodes in depth-first order, using simple rules to figure out which node to visit next.
treecrawl.sql (Oracle PL/SQL, with sample data)
"I just picked up a copy and it looks great! You are right about the whole approach and my stuff stinks." - Joe Celko, author of SQL for Smarties.
Most depth-first search algorithms work through the tree in a single-threaded fashion, issuing a sql statement for each node reached. Unfortunately the overhead of such a process can be considerable, and the single-threaded structure of the process makes it hard to take advantage of parallel processing.
When the object of such a traversal is a sequential labelling of the tree's nodes in preorder (as is needed for the "nested sets" indexing scheme, for instance), a more parallel method can be used which will yield the same numbering.
The following routine uses a more parallel approach to the problem, using statements which access an entire level of the tree at a time.
The first phase of the algorithm is a counting of each node's descendants. The first iteration sets the descendant count for each leaf node to 1, since a leaf node can have no child nodes by definition. Subsequent iterations update the descendant counts for nodes which have NULL in the count field; if all of a node's children have non-null counts, their sum will be non-null and the node will have the correct total. The iterations end when no further nodes are found for updating (or the root is updated, an equivalent condition). (I've since found that sum() is supposed to throw out nulls and return a non-null value, but this procedure works so far, possibly due to non-standard behavior.)
The second phase of the algorithm uses the fact that the descendant counts indicate how much the sequence number will increment when a given node's subtree is traversed. That is, given a node or a group of sibling nodes, it is no longer necessary to walk the nodes below them to derive the sequence increments for each subsearch. If a given parent node has sequence number i, its first child will be given i+1, and if the first child has j descendants, the second child node will be numbered (i+1)+j, and so on.
The first node to have its sequence number set is the root, with a value of 1. Then for each parent with a non-null sequence number, the child sequence numbers are calculated using the descendant counts and totalling across (in a theta join). Each iteration moves the frontier down a level, until no more nodes are found.
The result is a sequential numbering of the tree's nodes the same way a tree crawling algorithm does. The sample output from this script is the same as for treecrawl.sql.
superdfs.sql - Oracle PL/SQL procedures with sample data.
The transitive closure of a graph G is a set of node pairs (a,b) where there exists a path from a to b in G. Warshall's algorithm starts by finding paths of length 1 (edges), then length 2, 4, 8, and so on. Each iteration uses the connections found in the previous one, to find connections up to twice as long. In SQL, the process of joining pairs of connections (a,b) and (b,c) is straightforward, and the only caveat is to avoid inserting the same connection twice. The process ends when no new connections are found.
tc.sql Transact-SQL, not tested (although based on similar code that has worked)
The benefit of the transitive closure is that all the nodes under a given root in a tree, say, can be found with one query. In addition, cycles in the graph are shown by identical pairs (a,a), which indicate that a non-trivial path from a to a exists in the graph. Therefore you can avoid this message:
ORA-01436: CONNECT BY loop in user data Cause: The condition specified in a CONNECT BY clause caused a loop in the query, where the next record to be selected is a descendent of itself. When this happens, there can be no end to the query. Action: Check the CONNECT BY clause and remove the circular reference.