Strict Weak Ordering in a Tree Data Structure

This post presents an idea of a strict weak ordering definition for elements in a set of nodes from a tree data structure. After presenting the main idea, there's an example of some XPath queries that could be used in an XML document to achieve this ordering plus some complexity notes on the algorithms.

Assume we have a tree data structure (it may be any representation of a logical tree, e.g., DOM, JSON, YAML, your own... you name it) such that all nodes have at most one parent and we have a set of nodes from an instance of that tree whose elements are to be evaluated sequentially. If your set implementation requires a strict weak ordering for elements in the set (refer to STL's documentation for a definition of strict weak ordering and its application to C++'s set implementation) you might use a natural order provided by the set from which the node values are taken from (for example, if your nodes have string values, you may use lexicographical ordering which is a strict weak ordering for the set of strings).

Yet let us further assume that we need the elements in the set to be ordered in such a way that if Node A is an ancestor of Node B, then Node A is evaluated first than Node B. In such a case, we need to define a strict weak ordering that will satisfy this condition. Let < be a relation defined as follows:

if (A is ancestor of B) then A < B

then it's clear that this relation is antisymmetric and transitive for if Node A is an ancestor of Node B and Node B is an ancestor of Node C, then Node B cannot be an ancestor of Node A and Node A is an ancestor of Node C. Then this relation defines a strict weak order (it's a well known fact that transitiveness and antisymmetry imply antireflexiveness). Notice, though, that this order is not total, because there are nodes that cannot be compared (intuitively, two nodes that belong to distinct branches), thus it defines some equivalence classes inside the tree.

Now, let us explore some implementation details of this strict weak order in XML documents using XPath. The first inmediate approach is given by the following pseudocode

function order(A, B) // returns true if A < B
result <= XPathQuery("//NodeB/ancestor::NodeA") //evaluate the query and store results on variable
return (size(result) > 0) //was there any ancestor of node B that matched node A?
end function

But, depending on your XPath library, you may need to parse the document each time you wish to execute a query (if so, then either you have a bad library or you have a bad idea of how to use your library). In such case, an alternative algorithm to implement this ordering could be as follows: we define a path as a sequence of nodes (a(0), a(1),..., a(n)) such that a0 is the root node and an is a leaf node and a(i) is the parent node of a(i+1) for i=0,1,...,n-1. Then, by loading all paths from the XML document into a set of vectors, we can check if Node A < Node B by checking if there's any vector containing both nodes and Node A is before Node B in that vector. To load all paths, a simple algorithm is as follows:

set_of_vectors s
allLeaves = XPathQuery("//[count(child::) = 0]") //Load all nodes that have no children
for each leave in allLeaves
pathForLeaf = XPathQuery("ancestor-or-self::") // Notice that the context is the leaf node
insert pathForLeaf into s
end for

In the first case, no complexity guarantees can be examined from the code, so you would need to check complexity guarantees from the XPath library you are using (if they are provided at all). In the second case, some observations should be made: path loading depends solely on the XPath library as in the first case, yet the containers used to store retrieved nodes may affect the overall performance of the checking function. For example, using a set of vectors instead of a vector of vectors may not improve searching for a vector containing both nodes, even if the set implementation is hash-based, because usual hashing is done on the whole value of the elements and not on particular conditions of them (although I can think of at least one hash function that may optimize this search). On the other hand, and talking specifically about a C++ implementation of the algorithm, using a set of sets or a set of vectors may not improve performance: if you use a set, then finding the iterators iA and iB that point to Node A and Node B respectively is O(1), but then iA < iB is not a valid expression for set iterators (since it's not a Random Acces Container); on the other hand, if you use vectors, iterators may be found in O(log n) (using binary search from the STL) but then iA < iB is a valid expression (although I don't know what's the complexity guarantee of it).