Regular Expression Mining

By Sergei Savchenko, February 01, 2004

Regular expressions are convenient devices for identifying patterns in collections of strings.

Feb04: Regular Expression Mining

Sergei is a researcher with DecisionsToday Inc. in Ottawa, Canada. He can be contacted at [email protected].

Regular expressions are convenient devices to identify common patterns in collections of strings. Search engines, text editors, and other textdata-oriented tools use them to widen search targets.

The simplest regular expression over some alphabet, say {a,b,c}, is an atomic expression emitting a string consisting of a single symbol from an alphabet (a,b, or c). More complex expressions are constructed from simpler expressions:

If r₁ and r₂ are regular expressions, then so is the sequence of expressions r₁r₂, producing strings where the prefixes are emitted by the first expression and the suffixes by the second one.
If r₁ and r₂ are regular expressions, then so is a disjunction of the expressions r₁| r₂, producing strings of either expression r₁ or r₂.
If r is a regular expression, then so is r*, where symbols produced by r are repeated 0 to n times.

Additionally, brackets are used to clarify operator precedence.

Although the constructive rules I've described are sufficient, software that uses regular expressions often introduces syntactic additions. It is possible, for instance, to denote rr* as r⁺, thus describing a regular expression occurring 1 to n times. Also common are character classes, such as [a-z], which denote a disjunction of all letters between a and z (for instance, a|b|...|z). For example, the regular expression (ab⁺)⁺ denotes the family of strings ab, abb, abab, abbbabb, ababbbabb, and so on.

An interesting problem involving regular expressions revolves around regular expression mining (also known as "regular expression induction"). The problem is this: Given a set of strings, you want to automatically derive a regular expression that is likely to describe strings in the set. As it turns out, this isn't an easy problem to solve.

To begin with, it can be difficult to express in algorithmic terms exactly what you are looking for. You don't necessarily want an expression that exactly describes only a given ("training") set. Expressions such as these can be easily obtained as a disjunction of all strings in the training set. Besides, you probably want to find a pattern to help you identify previously unseen strings and determine whether they belong to the same class as the strings in the training set. Thus, some generalization of the training set must occur. However, an ultimate generalization is simply a disjunction of all symbols in the alphabet repeated multiple times. While an expression such as this describes the strings in the training set, it also describes any other strings over the same alphabet. This is not particularly helpful. What you want is a certain number of reasonable generalizations to obtain an expression that describes the training set, yet which is not so broad that it includes strings unlike the ones in the training set. It is hard, however, to quantitatively express these criteria.

The field of grammatical inference offers some approaches as to how regular expressions can be induced given a training set of strings. In this article, I present an algorithm—based on several of these approaches—that produces satisfactory results. Java source code that implements this algorithm is available electronically; see "Resource Center," page 5.

With this algorithm, you begin by constructing an overly specific automaton that represents all strings in the training set. Nodes in this automaton are marked with emitted symbols, whereas directed edges represent which nodes follow which nodes in the produced strings. You start construction of an overly specific automaton when you create two special nodes that do not emit symbols to represent the beginning and end. You can further represent each string as a node chain connecting the initial and final nodes. Clearly, this automaton exactly describes the training set and nothing else.

To illustrate, suppose you are given a training set consisting of only the two strings:

abb

abab

Figure 1(a) is an overly specific automaton for this set.

The algorithm proceeds by merging pairs of similar nodes in the automaton. You are looking for such nodes that have similar predecessor nodes, are similar in terms of symbols that they produce, and have similar successor nodes. The comparison criteria could potentially be recursive or limited to immediate neighbors. The comparison criteria I use is the weighted sum in Example 1, where p₁ and p₂ are:

p₁ and p₂ are sets of symbols emitted by immediate predecessor nodes.
n₁ and n₂ are sets of symbols produced by the nodes being compared.
t₁and t₂ are the types of symbols in the compared nodes.
s₁and s₂ are sets of symbols produced by immediate successor nodes.

Ratios of the size of the intersection over the size of the union describe the similarity of individual components of the weighted sum. Every node is only allowed a single type (such as a character, number, or separator type). Thus, the intersection of types is 1 if the nodes are of the same type; otherwise 0. The weight coefficients used could be as simple as 0.3, 0.3, 0.1, and 0.3 for c₁,c₂,c₃, and c₄, respectively. The weighted sum gives a value between 0 and 1 so that it is close to 1 for similar nodes, and close to 0 for dissimilar ones.

The algorithm proceeds iteratively at each step, merging a pair of nodes with the best comparison coefficient. When the comparison coefficients fall below a certain threshold (0.4, for instance), the merging process stops.

Figure 1 illustrates this process for a sample training set. The broken line identifies nodes to be merged. For example, initially in Figure 1(a), the two nodes producing symbol a are similar since they share the same predecessor node and their successor nodes emit the same symbol. Thus they are merged.

To merge two nodes, you create a new node such that it emits any symbol emitted by the parents and shares all of the arriving and departing edges. Although originally, any node produces only a single symbol, it is possible that after a merge a node will be producing one of several symbols.

Loops can be obtained in the merging process; see Figure 1(f). It is not hard to see that the resulting automaton represents the training set. Indeed it describes strings containing multiple blocks of repeating bs separated by as—exactly what the training set seems to suggest.

Once the automaton is derived, you want to turn it into a regular expression. The next algorithm I present does this. You can iteratively delete every node in the automaton (with the exception of the initial and the final nodes) by replacing it with edges marked with symbols that the node used to emit. For instance, assume in Figure 2(a) that you want to delete a node producing symbol s. Since this node has a loop, you will necessarily produce a string of at least one repeating s when passing through it. Thus, any two nodes that are connected via the current node can be joined directly by an edge marked with such an expression. When this is done for all nodes connected via the current one, the node itself can be safely deleted.

It may be beneficial to allow only a single marked edge to connect any two nodes. When another marked edge between the pair of nodes has to be created, you can update the expression marking the original edge. The new expression is simply a disjunction of the new and old expressions.

Figure 2 illustrates a trivial case. In the general case in Figure 3, there could already exist marked edges and marked loops. Since you are imposing a condition to allow only one edge to connect any two nodes in one direction, there may not be more than a single marked and single unmarked loop. Thus, Figure 3 represents the general transformation to delete a node by replacing it with marked edges.

Assuming that the arriving edge was marked by an expression a, the node emits an expression s, the departing edge is marked by an expression d, and there is a loop marked with expression l (as well as an unmarked loop); the resulting expression on the new marked edge is as⁺(ls⁺)*d. Why? Because on entering the node, the unmarked loop can be taken multiple times (represented by the first s⁺ in the expression). At an arbitrary point, the marked loop can be taken 0 to n times. However, on every new arrival back to the node, the unmarked loop may be used once again. Thus, to delete a node, all arriving edges must be reconnected with every departing edge using the procedure just described. Eventually, all nodes are deleted except for the initial and final nodes that, in the end, must be connected with a single edge marked with the expression we are seeking. Figure 4 illustrates the entire derivation process for this example. Postprocessing could shorten the resulting expression even further. In the particular case of the example, it can be shortened to just (ab⁺)⁺.

The algorithm just described runs in O(n³) time, where n is the number of symbols (not strings) in the training set. This order of complexity exists because, during automaton derivation to determine appropriate merge candidates, it is necessary to compare every node with every remaining node (thus, n²), and there are as many as n-1 pairs of nodes to merge (thus, n³). If you used recursive comparison criteria, the complexity would have been even higher.

Although this algorithm works well on small alphabets, it is not particularly useful on practical sets of strings in its unmodified form. One modification you can add to improve performance has to do with character classes. It is also based on the observation that regular expressions in practical applications often contain such classes as [a-z], [A-Z], and [0-9]. You can explicitly examine every node in the automaton during the derivation. If a subset of departing (or arriving) nodes produces a significant portion of some character class, you can collapse all these nodes together into just one marked with appropriate character class expression.

Another modification is based on the observation that many types of strings are of finite length. In this case, you want to avoid excessive generalizations. For instance, a ZIP code is a number occurring five times—not an arbitrary number of times. If a training set is determined to have a predominant length (or lengths), you can introduce a restriction to rule out merges that result in loops in the automaton.

You can also use knowledge of separator characters that occur exactly once in all strings in the training set. Such characters should be prohibited from appearing in any loop. These heuristics seem to significantly enhance practical performance of the algorithm.

In conclusion, the regular expression mining approach augments other techniques used in data mining. In a modified form, regular expression mining can be used to ensure data quality when the nature of data is not known beforehand.

DDJ

1 2 3 4 5 6 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Regular Expression Mining

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Regular Expression Mining

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content