Maximum Entropy Techniques
for Analyzing Large Transaction Datasets

Project Participants

Introduction

Massive datasets containing huge numbers of records are of increasing interest among both organizations that routinely collect such data and data miners who try to find regularities in them. One class of such datasets is transaction data that is typically binary in nature. Such data sets are characterized by high sparseness, i.e., there may be hundreds and thousands of binary attributes but a particular record may only have a few of them set to 1. Examples include retail transaction data and Web log data, where each row is a transaction or session and each column represents a product or Web page.

Knowing the counts of Boolean conjunctive queries on arbitrary subsets of attributes is an important data mining task (which is usually referred to as "query selectivity estimation"). This task can be reduced to finding the probability of the query P(Q) with respect to the empirical probability distribution defined by the data. While any Boolean query can be answered using a single scan through the dataset, this approach becomes infeasible for real-time queries on huge datasets that do not reside in main memory. We would thus like to have an approximate algorithm that would allow us to trade accuracy in the estimate of P(Q) with the time taken to calculate it.

We study different probabilistic models for approximating P(Q), in particular, the independence model, the Chow-Liu tree model, the maximum entropy model with various enhancements, such as clustering and bucket elimination, the mixture of Bernoulli models, and AD trees. We compare the models with respect to their memory footprint, accuracy, and the online time to estimate the query count. For more discussion on the models used and the experimental setup see our papers referenced below. The following plots illustrate the relative performance of the models.

Figure 1. Average relative error on the 500 random queries as a function of model complexity for the Microsoft Web data. The X-axis reflects the number of parameters in the models. By varying the complexity parameters (such as the number of clusters in the mixture model, the size of the AD tree or the upper limit on the frequency of itemsets used in learning of the maximum entropy model) we obtained multiple points on the graphs for some of the models. Not surprisingly the more memory the model takes, the more accurate it is (since we are specifically interested in querying the same data as the model is built on rather than generalization to new data).

Figure 2. Average relative error on the 500 random queries as a function of the online time for the Microsoft Web data. Multiple points for some of the models were obtained by varying the model complexity as on the Figure 1. Note that AD trees are the best in terms of error-online time tradeoff but they take orders of magnitude more memory to achieve such performance. The maximum entropy models are still able to offer high accuracy and require less memory than AD trees but the online time requirements for them grow exponentially with the query size.

Results to date

Papers

Collaborator and Funding

This is joint work with Heikki Mannila at Nokia Research, partially funded by the National Science Foundation.

Related Web Pages of Interest


Information and Computer Science
University of California, Irvine CA 92697-3425


Last modified: September 4, 2000