News Story
The Dwarf Structure for Creating, Storing, and Querying Highly Compressed Data Cubes (ISR IP)
For more information, contact ISR External Relations Director
Jeff Coriale at coriale@umd.edu or 301.405.6604.
Inventors: Nicholas Roussopoulos, John Sismanis, Antonios Deligiannakis, Ioannis Kotidis
U.S. Patent: 7,133,876
Description
University of Maryland researchers have developed and implemented a super-compressed data structure called Dwarf for storing and querying cubes of multidimensional data. Dwarf can "slice and dice" data cubes on every possible projection. For example, click-data from a web ISP server can be stored very efficiently in data slice that group and aggregate data by click destination, or by destination, source, by time (minute/hour/day/week/month/year, etc.). Such groupings are then used in "data mining" over an extremely large set of log data.
Dwarf automatically identifies prefix and suffix structural redundancies and factors them out by coalescing their store. Prefix redundancy is high on dense areas of cubes but suffix redundancy is significantly higher for sparse areas. Putting the two together fuses the exponential sizes of high dimensional full cubes into a dramatically condensed data structure. The elimination of suffix redundancy has an equally dramatic reduction in the computation of the cube because recomputation of the redundant suffixes is avoided. This effect is multiplied in the presence of correlation amongst attributes in the cube. A Petabyte 25-dimensional cube was shrunk this way to a 2.3GB Dwarf Cube, in less than 20 minutes, a 1:400000 storage reduction ratio. Still, Dwarf provides 100% precision on cube queries and is a self-sufficient structure which requires no access to the fact table. What makes Dwarf practical is the automatic discovery, in a single pass over the fact table, of the prefix and suffix redundancies without user involvement or knowledge of the value distributions.
The dwarf structure plays a double role as a storage and indexing mechanism for high dimension data. Roll-up and drill-down queries seem to benefit from the dwarf structure due to common paths that are exploited while caching. In terms of update speed, dwarf by far outperforms the closest competitor for storing the full data cube, while their performance is comparable when the competitor is reduced to storing only a partial cube of the same size as Dwarf. For the APB-1 Benchmark density 5, the fastest performance results are published by Oracle-Express and take 4.5 hours to generate a partial cube of about 30GB while the corresponding Dwarf cube is generated in 21 minutes and has a size of 4.3GB.
A number of Dwarf patents are pending.
For more information
If you would like to license this intellectual property, have questions, would like to contact the inventors, or need more information, contact ISR External Relations Director Jeff Coriale at coriale@umd.edu or 301.405.6604.
Find more ISR IP
You can go to our main IP search page to search by research category or faculty name. Or view the entire list of available IP on our complete IP listing page.
ISR-IP-Roussopoulos ISR-IP-multidimensional ISR-IP databases
Published June 19, 2007