Abstract: Finding frequent subsets from a group of itemsets is a critical part of the classic "Market Basket" problem. Walking through each itemset's subsets in lexicographic order is a useful way to find and count frequent itemsets, without candidate generation. The traversal can be optimized to skip set extensions which are known to be infrequent. A prefix tree is the most useful structure for this optimization.
Given a sorted group of itemsets, whose members are themselves in ascending order, eg:
ABCD
ABDE
ABE
AE
how can one find all subsets which occur in two (or in general some integer c) or more itemsets in the group?
One subset is already visible on the upper left, the common prefix AB of the first three rows. Simply sorting the sets lexicographically yields this result. One can also see, that if we skip C in the first set, the first two rows will be prefixed by ABD, and by various deletions and resortings, we can make all the common sets appear as prefixes in the first two rows.
If this deletion (subsetting) and re-sorting process can be done systematically, we can be sure that all frequent subsets will be found.
Given a group of sets as above, imagine sorting all of their subsets lexicographically, for instance:
ABCD: ABDE: ABE: AE: A A A A AB AB AB AE ABC ABD ABE E ABCD ABDE AE AC AD B ACD ADE BE AD AE E B B BC BD BCD BDE BD BE C D CD DE D E
This picture allows us to see the first frequent subset immediately, "A". If we remove all the A's, we see AB at the head of three of the four lists. To work down without missing anything, we simply need to scan the head of each list for the minimum value S, count and remove it from lists which have it, and repeat until all lists are empty.
This process is simply a merge join of all the ordered subset vectors, for all sets T. The difference from the classical join is that, if a given subset S is at the head of the vector for itemset T, we can calculate the next value S' from S and T. We only need to store the current value at the head of the vector in order to iterate through all successive values.
(Algorithm 1) 1. For each T, insert minimum non-empty subset S_T of T in a heap H Loop: 2. Remove S_T from the head of H 3. If S != S_prev, Set S_prev = S and S-Count = 1 4. Otherwise, S-Count = S-Count + 1 5. If S-Count = c, save S in frequent itemsets list 6. Find the successor S_T' of S_T in P(T), insert S_T' into H 7. Continue
This process is conceptually simple, but it generates every possible subset of every set T. It is correct, but inefficient. However the process can be pruned to skip some useless steps, while preserving the basic order of traversal.
Whereas algorithm 1 operates on a single subset at a time, the grouping of subsets sharing the same prefix allows us to count them and compare their number to the cutoff value c. A trie in which descendant counts are kept at each node allows us to make this measurement quickly, and allows groups of subsets sharing the same prefix to be operated on in one operation. Instead of n extractions from the heap in algorithm 1, we are able to retrieve the same n sets as one subtree from the trie.
Optimization 1: The trie allows us to extend to the longest prefix immediately by descending the minimal path. For instance if ABC is the minimum value in the trie, we may find that A and AB have more than c descendants, but ABC does not. We can extend to this longest frequent prefix P by simply descending the "left side" of the trie and checking the descendant counts on each node.
Optimization 2: Once a longest prefix P has been found, only 1-extensions of P need to be considered. For instance, if AB is frequent but ABC is not, then strings of the form ABCy, which occupy subsequent values in the ordering, cannot be frequent. Skipping these extensions of ABC is done by deleting node ABC and moving each descendant to its next possible value not having that prefix, for instance ABCDE -> ABD, skipping ABCD, ABCDE. In general, if P is frequent, Px is the first child of P and is infrequent, and Pxy_..Pxy_n are the descendants of Px, then we move them to Py_1..Py_n, simply deleting member x. If any y_i is empty, we backtrack to find the next subset of the parent itemset.
This operation preserves the lexicographic order used in algorithm 1 above, but skips the useless extensions of Px. For a given set T and its vector of subsets, this operation removes a range of subsets which are guaranteed not to be frequent.
The algorithm can be written as follows:
(Algorithm 2)
Start: 0. Insert the subsets T, with elements in ascending order, into a trie
Loop:
1. Find the minimum (in the lex. ordering) value S on the trie,
and the longest prefix P of S which has at least c descendants
2. Find the minimum child node x of P and its descendants of the form
Pxy_1...Pxy_n
3. Reinsert the descendants at Py_1..Py_n, for all non-empty y_i
4. Where y_i is empty, find the next subset of the parent set T
in the lexicographical subset ordering, if any.
5. Break when all subsets of every T have been exhausted.
Step 4 means taking a string Px and finding minimal Qw, where Q is the parent of
P and P < Qw <= Qx. In other words, we backtrack through the last element from
P, and
find the next subset of T following P.
Algorithm 2 makes substantial use of a trie (or digital search tree, or ternary search tree) to perform group operations on sets. However its operation uses the same ordering and joining technique as algorithm 1, which is simpler and based on the idea of a merge join. Both traverse candidate subsets in lexicographic order, and perform a similar join. Algorithm 2 gains efficiency by doing group operations on a trie, removing up to c-1 subsets at a time from the head of a "heap" represented by the trie, and skipping useless extensions of failed sets.
Algorithm 2 explores frequent subsets in a depth-first manner and never explores beyond a 1-extension of a frequent subset. The algorithm only tests combinations existing in at least one set.