June 08, 2007
The C5 Generic Collection LibraryUnusual Functionality
In addition to what you'd expect from a good collection library, C5 offers some unusual functionality:
Using C5
To illustrate the use of C5, consider a simple index of a filea sorted list of all words that appear in the file, and for each such word, a sorted list of the lines in which it appears. We use a dictionary to map each word to a set of line numbers; this avoids duplicate line numbers for each word. Using a tree dictionary makes sure the words come out sorted when the dictionary is printed, and similarly for the line numbers. For each word, the line numbers are represented by a TreeSet<int>, and hence the entire file index is represented by a TreeDictionary<String, TreeSet<int>>.
Let filename be the name of the file. Listing One(a) then builds the index in the tree-dictionary index. The foreach loop body is not optimal: It makes at least two searches in index for each string s; namely, one call to Contains and one or two calls to the indexer index[s]. Since this is a common pattern, dictionaries in the C5 library provide a method bool Find(K k, out V v) that combines the Contains(k) test with the lookup of the associated value v. Using Find, the number of searches is reduced to one or two; see Listing One(b).
(a)
IDictionary<String, TreeSet<int>> index
= new TreeDictionary<String, TreeSet<int>>();
Regex delim = new Regex("[^a-zA-Z0-9]+");
using (TextReader rd = new StreamReader(filename)) {
int lineno = 0;
for (String line = rd.ReadLine(); line != null; line = rd.ReadLine()) {
String[] res = delim.Split(line);
lineno++;
foreach (String s in res)
if (s != "") {
if (!index.Contains(s))
index[s] = new TreeSet<int>();
index[s].Add(lineno);
}
}
}
Listing One
The number of searches can be further reduced to exactly one for each string s by using the C5 method FindOrAdd, at the expense of creating a single empty TreeSet<int> in addition to those needed for all the words in the index. Regardless of how the index is built, it can be printed in alphabetical order using standard C# idioms; see Listing One(c). Listing One(d) is a fragment of the output.
A similar approach can be used to find all anagrams, such as "generate" and "teenager," in a text. From a word, you can compute the corresponding bag of characters, such as {a, e, e, e, g, n, r, t g}; this represents the anagram class of the word. You can create a dictionary mapping bags of characters to sets of words, and populate the dictionary in much the same way as the previous file index.
|
|
||||||||||||||||||||||||||||||
|
|
|
|