October 04, 2006
Great Book on High-Dimensional Indexing
Anyone dealing professionally with problems of abstracting and working with large masses of complex data -- from stock- and market analysis to the esoterica of temperature-analog "annealing" in back-prop neural networks -- will be familiar with the practice of converting data features to high-dimensional points. In some cases, the goal is visualization in a "terrain map" or similar format. But while visualization can certainly be useful, it's just one way we can imagine wanting to extract information and insight from such transformed data.
But it's hard to do similarity searches and other forms of knowledge discovery with high-dimensional datasets. In many applications, you swiftly encounter a combinatorial explosion (trees! everywhere, trees!) that disqualifies B+-tree "low-dimensional" indexing strategies, and even makes R-tree and similar multi-dimensional tools hard to scale.
No problemo (mei guangxi), says professor Cui Yu of the Dept. of Computer Science, Monmouth University. In her 2002 monograph, High-Dimensional Data Indexing, (Springer, preface online here as PDF), Prof. Yu proposes a set of improved transformative indexing, space-pruning, iteration-constraining and similarity-"distance"-judging strategies and algorithms that can make searches of even very dense HD datasets readily doable using a B+-tree indexing architecture. Her iMinMax(theta) strategy maps points in high-dimensional spaces to one-dimensional values by reference to minima and maxima of each point among all dimensions -- the process "tunable" for different data distributions by modifying the 'theta' value. The resulting data can be efficiently indexed with B+-tree.
Dr. Yu has product-architecture and development as well as academic and theoretical experience, and prioritizes real-world applications -- both in the generalized (i.e., she'd like to see standard database products and libraries incorporate these tools) and specific sense. Her writing is terse and charming, she explains hard things really well, and the applications of her work are undeniably cool. Definitely worth a read.
Posted by John Jainschigg at 11:14 AM Permalink
|