Top | Web home | SourceForge home | ||
$Id: //guest/julian_hyde/mondrian/doc/overview.html#3 $ | ||
(C) Copyright 2002, Kana Software, Inc. and others | ||
Author | Julian Hyde (julian.hyde@mail.com) | |
---|---|---|
Created | February 13th, 2002 |
Mondrian is an OLAP engine written in Java. It executes queries written in the MDX language, reading data from a relational database (RDBMS), and presents the results in a multidimensional format via a Java API. Let's go into what that means.
Online Analytical Processing (OLAP) means analysing large quantities of data in real-time. Unlike Online Transaction Processing (OLTP), where typical operations read and modify individual and small numbers of records, OLAP deals with data in bulk, and operations are generally read-only. The term 'online' implies that even though huge quantities of data are involved — typically many millions of records, occupying several gigabytes — the system must respond to queries fast enough to allow an interactive exploration of the data. As we shall see, that presents considerable technical challenges.
OLAP employs a technique called Multidimensional Analysis. Whereas a relational database stores all data in the form of rows and columns, a multidimensional dataset consists of axes and cells. Consider the dataset
Year 2000 2001 Growth Product Dollar sales Unit sales Dollar sales Unit sales Dollar sales Unit sales Total $17,165 $2,825 $18,867 3,163 10% 12% — Books $12,845 956 $14,562 1,121 13% 17% —— Fiction $1,341 424 $1,202 380 16% 37% —— Non-fiction $1,412 400 $1,224 386 11% 2% — Magazines $2,753 824 $2,426 766 -12% -7% — Greetings cards $1,567 1,045 $1,879 1,276 20% 22%
The rows axis consists of the members 'All products', 'Books', 'Fiction', and so forth, and the columns axis consists of the cartesian product of the years '2000' and '2001', and the calculation 'Growth', and the measures 'Unit sales' and 'Dollar sales'. Each cell represents the sales of a product category in a particular year; for example, the dollar sales of Magazines in 2001 were $2426.
This is a richer view of the data than would be presented by a relational database. The members of a multidimensional dataset are not always values from a relational column. 'Total', 'Books' and 'Fiction' are members at successive levels in a hierarchy, each of which is rolled up to the next. And even though it is alongside the years '2000' and '2001', 'Growth' is a calculated member, which introduces a formula for computing cells from other cells.
The dimensions used here — products, time, and measures — are just three of many dimensions by which the dataset can be categorized and filtered. The collection of dimensions, hierarchies and measures is called a cube.
I hope I have demonstrated that multidimensional is above all a way of presenting data. Although some multidimensional databases store the data in multidimensional format, I shall argue that it is simpler to store the data in relational format. It's time to look at the architecture of an OLAP system.
A Mondrian OLAP System consists of four layers; working from the eyes of the end-user to the bowels of the data center, these are the presentation layer, the calculation layer, the aggregation layer, and the storage layer.
The presentation layer determines what the end-user sees on his or her monitor, and how he or she can interact to ask new questions. There are many ways to present multidimensional datasets, including pivot tables (an interactive version of the table shown above), pie, line and bar charts, and advanced visualization tools such as clickable maps and dynamic graphics. These might be written in Swing or JSP, charts rendered in JPEG or GIF format, or transmitted to a remote application via XML. What all of these forms of presentation have in common is the multidimensional 'grammar' of dimensions, measures and cells in which the presentation layer asks the question is asked, and OLAP server returns the answer.
The second layer is the calculation layer. The calculation layer parses, validates and executes MDX queries. A query is evaluted in multiple phases. The axes are computed first, then the values of the cells within the axes. For efficiency, the calculation layer sends cell-requests to the aggregation layer in batches. A query transformer allows the application to manipulate existing queries, rather than building an MDX statement from scratch for each request. And metadata describes the the dimensional model, and how it maps onto the relational model.
The third layer is the aggregation layer. An aggregation is a set of measure values ('cells') in memory, qualified by a set of dimension column values. The calculation layer sends requests for sets of cells. If the requested cells are not in the cache, or derivable by rolling up an aggregation in the cache, the aggregation manager and sends a request to the storage layer.
The storage layer is an RDBMS. It is responsible for providing aggregated cell data, and members from dimension tables. I describe below why I decided to use the features of the RDBMS rather than developing a storage system optimized for multidimensional data.
All four of these components can exist on the same machine. Layers 2 and 3, which comprise the Mondrian server, must be on the same machine. The storage layer could be on another machine, accessed via remote JDBC connection. In a multi-user system, the presentation layer would exist on each end-user's machine (except in the case of JSP pages generated on the server).
OLAP Servers are generally categorized according to how they store their data:
Three kinds of data need to be stored: fact table data (the transactional records), aggregates, and dimensions.
MOLAP databases store fact data in multidimensional format, but if there are more than a few dimensions, this data will be sparse, and the multidimensional format does not perform well. A HOLAP (hybrid OLAP) system solves this problem by leaving the most granular data in the relational database, but stores aggregates in multidimensional format.
Pre-computed aggregates are necessary for large data sets, otherwise certain
queries could not be answered without reading the entire contents of the fact
table. MOLAP aggregates are often an image of the in-memory data structure,
broken up into pages and stored on disk. ROLAP aggregates are stored in tables.
In some ROLAP systems these are explicitly managed by the OLAP server; in other
systems, the tables are declared as materialized views, and they are implicitly
used when the OLAP server issues a query with the right combination of columns
in the group by
clause.
The final component of the aggregation strategy is the cache. The cache holds pre-computed aggregations in memory so subsequent queries can access cell values without going to disk. If the cache holds the required data set at a lower level of aggregation, it can compute the required data set by rolling up.
The cache is arguably the most important part of the aggregation strategy because it is adaptive. It is difficult to choose a set of aggregations to pre-compute which speed up the system without using huge amounts of disk, particularly those with a high dimensionality or if the users are submitting unpredictable queries. And in a system where data is changing in real-time, it is impractical to maintain pre-computed aggregates. A reasonably sized cache can allow a system to perform adequately in the face of unpredictable queries, with few or no pre-computed aggregates.
Mondrian's aggregation strategy is as follows:
group by
queries. Again, why develop an aggregator when the RDBMS has one?The general idea is to delegate unto the database what is the database's. This places additional burden on the database, but once those features are added to the database, all clients of the database will benefit from them. Multidimensional storage would reduce I/O and result in faster operation in some circumstances, but I don't think it warrants the complexity at this stage.
A wonderful side-effect is that because Mondrian requires no storage of its own, it can be installed by adding a JAR file to the class path and be up and running immediately. Because there are no redundant data sets to manage, the data-loading process is easier, and Mondrian is ideally suited to do OLAP on data sets which change in real time.
Note to self: The cache manager ought to distinguish between data which is being pulled into the cache to be rolled up immediately into some other aggregation, and an aggregation which is explicitly needed.
See {@link mondrian.olap.Parser}.
It is represented as an XML file. The metadata is loaded into memory the
first time you reference a dimensional model. You can modify the model at
runtime by creating instances of classes such as {@link
mondrian.rolap.RolapHierarchy}
.
todo: See {@link mondrian.olap.Query} and {@link mondrian.olap.Result}.
todo: The package {@link mondrian.rolap}
. is the one and
only implementation of the API. The DriverManager (class {@link
mondrian.olap.DriverManager}
) acts as class-factory.
todo: How members are calculated...
todo: How aggregations are batched...
todo: MDX functions. See user-defined functions.
Aggregations are based upon the relational model: as far as the aggregation
manager is concerned, there is no relationship between the columns city
and state
. This means that all roll-ups are the same: you just drop
a column. Consider the 3 roll-ups possible by dropping a column from the
aggregation {gender
, city
, state
}:
dropping gender
is equivalent to removing the [Gender]
dimension; dropping city
is equivalent to rolling up to a higher
level in the [Geography]
hierarchy; and dropping state
is not even allowed in the dimensional model (no, sorry, you can't ask about
products sold in a cities called 'Portland'). This approach will also allow us
to implement 'drill anywhere'.
An aggregation is defined by a search condition, for example, {state in
('CA', 'OR', 'WA'), city = any, gender = 'M', measure = 'Unit sales'}
.
The any
value is important; if we had asked for a specific
set of cities, we would not later be able to roll-up by dropping the city
column.
The caching strategy is to throw out the aggregation with the lowest cost/benefit ratio. The 'benefit' of an item is the effort it took to produce (effort which it is saving future queries) multiplied by its 'usefulness' which declines exponentially if it is not used over time. The 'cost' of an item is its size.
Something like this.
Because there isn't one. MDX is a component of Microsoft's OLE DB for OLAP standard which, as the name implies, only runs on Windows. Mondrian's API is fairly similar in flavor to ADO MD (ActiveX Data Objects for Multidimensional), a API which Microsoft built in order to make OLE DB for OLAP easier to use.
XML for Analysis is pretty much OLE DB for OLAP expressed in Web Services rather than COM, and therefore seems to offer a platform-neutral standard for OLAP, but take-up seems to be limited to vendors who supported OLE DB for OLAP already.
The other query vendors failed to reach consensus several years ago with the OLAP Council API, and are now encamped on the JOLAP specification.
I plan to provide a JOLAP API to Mondrian as soon as JOLAP is available.
Not very much.
StrToSet()
and StrToTuple()
functions take
an extra parameter.Param()
and ParamRef()
allow
you to create parameterized MDX statements.todo: User-defined functions
todo: Cell readers
todo: Member readers
Yes, if your RDBMS can. We delegate the aggregation to the RDBMS, and if your RDBMS happens to have materialized group by views created, your query will fly. And the next time you run the same or a similar query, that will really fly, because the results will be in the aggregation cache.
Please send me an email, and let me know what you liked and didn't like about it. If you can think of ways that Mondrian can be improved, roll up your sleeves and help make it better. If you use Mondrian in your application, consider sharing your work so that everyone can use it.
End $Id: //guest/julian_hyde/mondrian/doc/overview.html#3 $ |