<body> <html> <head> <meta http-equiv="Content-Language" content="en-us"> <meta name="GENERATOR" content="Microsoft FrontPage 5.0"> <meta name="ProgId" content="FrontPage.Editor.Document"> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> <title>Mondrian overview</title> <style> A:link { color:#000066; } A:visited { color:#666666; } A.clsIncCpyRt, A.clsIncCpyRt:visited, P.clsIncCpyRt { font-weight:normal; font-size:75%; font-family:verdana,arial,helvetica,sans-serif; color:black; text-decoration:none; } A.clsLeftMenu, A.clsLeftMenu:visited { color:#000000; text-decoration:none; font-weight:bold; font-size:8pt; } A.clsBackTop, A.clsBackTop:visited { margin-top:10; margin-bottom:0; padding-bottom:0; font-size:75%; color:black; } A:hover, A.clsBackTop:hover, A.clsIncCpyRt:hover, A:active { color:blue; } A.clsGlossary { font-size:10pt; color:green; } BODY { font-size:80%; font-family:verdana,arial,helvetica,sans-serif; } BUTTON.clsShowme, BUTTON.clsShowme5 { font-weight:bold; font-size:11; font-family:arial; width:68; height:23; position:relative; top:2; background-color:#002F90; color:#FFFFFF; } DIV.clsBeta { font-weight:bold; color:red; } DIV.clsDocBody { margin-left:10px; margin-right:10px; } DIV.clsDocBody HR { margin-top:0; } DIV.clsDesFooter { margin:10px 10px 0px 223px; } DIV.clsFPfig { font-size:80%; } DIV.clsHi { padding-left:2em; text-indent:-2em } DIV.clsShowme { margin-bottom:.5em; margin-top:.5em; } H1{ font-size:145%; margin-top:1.25em; margin-bottom:0em; } H2 { font-size:135%; margin-top:1.25em; margin-bottom:.5em; } H3 { font-size:128%; margin-top:1em; margin-bottom:0em; } H4 { font-size:120%; margin-top:.8em; margin-bottom:0em; } H5 { font-size:110%; margin-top:.8em; margin-bottom:0em; } H6 { font-size:70%; margin-top:.6em; margin-bottom:0em; } HR.clsTransHR { position:relative; top:20; margin-bottom:15; } P.clsRef { font-weight:bold; margin-top:12pt; margin-bottom:0pt; } PRE { background:#EEEEEE; margin-top:1em; margin-bottom:1em; margin-left:0px; padding:5pt; } PRE.clsCode, CODE.clsText { font-family:'courier new',courier,serif; font-size:130%; } PRE.clsSyntax { font-family:verdana,arial,helvetica,sans-serif; font-size:120%; } SPAN.clsEntryText { line-height:12pt; font-size:8pt; } SPAN.clsHeading { color:#00319C; font-size:11pt; font-weight:bold; } SPAN.clsDefValue, TD.clsDefValue { font-weight:bold; font-family:'courier new' } SPAN.clsLiteral, TD.clsLiteral { font-family:'courier new'; } SPAN.clsRange, TD.clsRange { font-style:italic; } SPAN.clsShowme { width:100%; filter:dropshadow(color=#000000,OffX=2.5,OffY=2.5,Positive=1); position:relative; top:-8; } TABLE { font-size:100%; } TABLE.clsStd { background-color:#444; border:1px none; cellspacing:0; cellpadding:0 } TABLE.clsStd TH, BLOCKQUOTE TH { font-size:100%; text-align:left; vertical-align:top; background-color:#DDD; padding:2px; } TABLE.clsStd TD, BLOCKQUOTE TD { font-size:100%; vertical-align:top; background-color:#EEE; padding:2px; } TABLE.clsParamVls, TABLE.clsParamVls TD { padding-left:2pt; padding-right:2pt; } #TOC { visibility:hidden; } UL UL, OL UL { list-style-type:square; } .clsHide { display:none; } .clsShow { } .clsShowDiv { visibility:hidden; position:absolute; left:230px; top:140px; height:0px; width:170px; z-index:-1; } .#pBackTop { display:none; } #idTransDiv { position:relative; width:90%; top:20; filter:revealTrans(duration=1.0, transition=23); } /*** INDEX-SPECIFIC ***/ A.clsDisabled { text-decoration:none; color:black; cursor:text; } A.clsEnabled { cursor:auto; } SPAN.clsAccess { text-decoration:underline; } TABLE.clsIndex { font-size:100%; padding-left:2pt; padding-right:2pt; margin-top: 17pt; } TABLE.clsIndex TD { margin:3pt; background-color:#EEEEEE; } TR.clsEntry { vertical-align:top; } TABLE.clsIndex TD.clsLetters { background-color:#CCCCCC; text-align:center; } TD.clsMainHead { background-color:#FFFFFF; vertical-align:top; font-size:145%; font-weight:bold; margin-top:1.35em; margin-bottom:.5em; } UL.clsIndex { margin-left:20pt; margin-top:0pt; margin-bottom:5pt; } LI OL { padding-bottom: 1.5em } /*** GALLERY/TOOLS/SAMPLES ***/ FORM.clsSamples { margin-bottom:0; margin-top:0; } H1.clsSampH1 { font-size:145%; margin-top:.25em; margin-bottom:.25em; } H1.clsSampHead { margin-top:5px; margin-bottom:5px; font-size:24px; font-weight:bold; font-family:verdana,arial,helvetica,sans-serif; } H2.clsSampTitle { font-size:128%; margin-top:.2em; margin-bottom:0em; } TD.clsDemo { font-size:8pt; color:#00319C; text-decoration:underline; } .clsSampDnldMain { font-size:11px; font-family:verdana,arial,helvetica,sans-serif; } .clsShowDesc { cursor:hand; } A.clsTools { color:#0B3586; font-weight:bold; } H1.clsTools, H2.clsTools { color:#0B3586; margin-top:5px; } TD.clsToolsHome { font-size:9pt; line-height:15pt; } SPAN.clsToolsTitle { color:#00319C; font-size:11pt; font-weight:bold; text-decoration:none; } /*** DESIGN ***/ P.cat { font-size:13pt; color:#787800; text-decoration:none; margin-top:18px; } P.author { font-size:9pt; font-style:italic; line-height:13pt; margin-top:10px; } P.date { font-size:8pt; line-height:12px; margin-top:0px; color:#3366FF; } P.graph1 { line-height:13pt; margin-top:-10px; } P.col { line-height:13pt; margin-top:10px; margin-left:5px; } P.cal1 { text-decoration:none; margin-top:-10px; } P.cal2 {margin-top:-10px; } P.photo { font-size:8pt; } /*** DOCTOP ***/ #tblNavLinks A { color:black; text-decoration:none; font-family:verdana,arial,helvetica,sans-serif; } #lnkShowText, #lnkSyncText, #lnkSearchText, #lnkIndexText { font-size:8pt; font-weight:bold; } #lnkPrevText, #lnkNextText, #lnkUpText { font-size:7.5pt; font-weight:normal; } DIV.clsBucketBranch { margin-left:10px; margin-top:15px; margin-bottom:-10pt; font-style:italic; font-size:85%; } DIV.clsBucketBranch A, DIV.clsBucketBranch A:link, DIV.clsBucketBranch A:active, DIV.clsBucketBranch A:visited { text-decoration:none; color:black; } DIV.clsBucketBranch A:hover { color:blue; } /*** SDK, IE4 ONLY ***/ DIV.clsExpanded, A.clsExpanded { display:inline; color:black; } DIV.clsCollapsed, A.clsCollapsed { display:none; } SPAN.clsPropattr { font-weight:bold; } #pStyles, #pCode, #pSyntax, #pEvents, #pStyles {display:none; text-decoration:underline; cursor:hand; } /*** jhyde added ***/ CODE { color:maroon; font-family:'courier new' } DFN { font-weight:bold; font-style:italic; } </style> </head> <!-- This sentence is here to fool javadoc (which is looking for a period, and otherwise finds one inside one of our header tables). --> <table border="1" class="clsStd" width="100%"> <tr> <td colspan="2"><a href="index.html">Top</a> | <a href="http://public.perforce.com/guest/julian_hyde/mondrian/doc/index.html">Web home</a> | <a href="http://sourceforge.net/projects/mondrian/">SourceForge home</a></td> <td width="0" align="right" rowspan="2"> <a href="http://sourceforge.net"><img src="http://sourceforge.net/sflogo.php?group_id=35302&type=1" width="88" height="31" border="0" alt="SourceForge.net Logo"></a></td> </tr> <tr> <td colspan="2"><em>$Id: //guest/paul_dymecki/mondrian/doc/overview.html#1 $</em></td> </tr> <tr> <td colspan="3"><em>(C) Copyright 2002, Kana Software, Inc. and others</em></td> </tr> <tr> <th align="right" width="30%">Author</th> <td colspan="2">Julian Hyde (<a href="mailto:julian.hyde@mail.com">julian.hyde@mail.com</a>)</td> </tr> <tr> <th align="right" width="30%">Created</th> <td colspan="2">February 13<sup><font face="Verdana">th</font></sup>, 2002</td> </tr> </table> <h1>Mondrian overview</h1> <p>Mondrian is an OLAP engine written in Java. It executes queries written in the MDX language, reading data from a relational database (RDBMS), and presents the results in a multidimensional format via a Java API. Let's go into what that means.</p> <h2>Online Analytical Processing</h2> <p><dfn><font face="Verdana">Online Analytical Processing (OLAP)</font></dfn> means analysing large quantities of data in real-time. Unlike Online Transaction Processing (OLTP), where typical operations read and modify individual and small numbers of records, OLAP deals with data in bulk, and operations are generally read-only. The term 'online' implies that even though huge quantities of data are involved typically many millions of records, occupying several gigabytes the system must respond to queries fast enough to allow an interactive exploration of the data. As we shall see, that presents considerable technical challenges.</p> <p>OLAP employs a technique called <dfn><font face="Verdana">Multidimensional Analysis</font></dfn>. Whereas a relational database stores all data in the form of rows and columns, a multidimensional dataset consists of <dfn><font face="Verdana">axes</font></dfn> and <dfn><font face="Verdana">cells</font></dfn>. Consider the dataset</p> <blockquote> <table border="0" style="clsStd" id="AutoNumber1" cellpadding="2"> <tr> <td nowrap><i>Year</i></td> <th align="right" colspan="2">2000</th> <th align="right" colspan="2">2001</th> <th align="right" colspan="2">Growth</th> </tr> <tr> <td nowrap><i>Product</i></td> <th align="right">Dollar sales</th> <th align="right">Unit sales</th> <th align="right">Dollar sales</th> <th align="right">Unit sales</th> <th align="right">Dollar sales</th> <th align="right">Unit sales</th> </tr> <tr> <th nowrap>Total</th> <td align="right">$17,165</td> <td align="right">$2,825</td> <td align="right">$18,867</td> <td align="right">3,163</td> <td align="right">10%</td> <td align="right">12%</td> </tr> <tr> <th nowrap> Books</th> <td align="right">$12,845</td> <td align="right">956</td> <td align="right">$14,562</td> <td align="right">1,121</td> <td align="right">13%</td> <td align="right">17%</td> </tr> <tr> <th nowrap> Fiction</th> <td align="right">$1,341</td> <td align="right">424</td> <td align="right">$1,202</td> <td align="right">380</td> <td align="right">16%</td> <td align="right">37%</td> </tr> <tr> <th nowrap> Non-fiction</th> <td align="right">$1,412</td> <td align="right">400</td> <td align="right">$1,224</td> <td align="right">386</td> <td align="right">11%</td> <td align="right">2%</td> </tr> <tr> <th nowrap> Magazines</th> <td align="right">$2,753</td> <td align="right">824</td> <td align="right">$2,426</td> <td align="right">766</td> <td align="right">-12%</td> <td align="right">-7%</td> </tr> <tr> <th nowrap>— Greetings cards</th> <td align="right">$1,567</td> <td align="right">1,045</td> <td align="right">$1,879</td> <td align="right">1,276</td> <td align="right">20%</td> <td align="right">22%</td> </tr> </table> </blockquote> <p>The rows axis consists of the members 'All products', 'Books', 'Fiction', and so forth, and the columns axis consists of the cartesian product of the years '2000' and '2001', and the <font face="Verdana">calculation</font> 'Growth', and the <dfn>measures</dfn> 'Unit sales' and 'Dollar sales'. Each cell represents the sales of a product category in a particular year; for example, the dollar sales of Magazines in 2001 were $2426.</p> <p>This is a richer view of the data than would be presented by a relational database. The members of a multidimensional dataset are not always values from a relational column. 'Total', 'Books' and 'Fiction' are members at successive levels in a <dfn>hierarchy</dfn>, each of which is rolled up to the next. And even though it is alongside the years '2000' and '2001', 'Growth' is a <dfn>calculated member</dfn>, which introduces a formula for computing cells from other cells.</p> <p>The dimensions used here products, time, and measures are just three of many dimensions by which the dataset can be categorized and filtered. The collection of dimensions, hierarchies and measures is called a <dfn> <font face="Verdana">cube</font></dfn>.</p> <p>I hope I have demonstrated that multidimensional is above all a way of <em> <font face="Verdana">presenting</font></em> data. Although some multidimensional databases <em><font face="Verdana">store</font></em> the data in multidimensional format, I shall argue that it is simpler to store the data in relational format. It's time to look at the architecture of an OLAP system.</p> <h2>Architecture</h2> <p>A Mondrian OLAP System consists of four layers; working from the eyes of the end-user to the bowels of the data center, these are the presentation layer, the calculation layer, the aggregation layer, and the storage layer.</p> <p>The <dfn><font face="Verdana">presentation layer</font></dfn> determines what the end-user sees on his or her monitor, and how he or she can interact to ask new questions. There are many ways to present multidimensional datasets, including pivot tables (an interactive version of the table shown above), pie, line and bar charts, and advanced visualization tools such as clickable maps and dynamic graphics. These might be written in Swing or JSP, charts rendered in JPEG or GIF format, or transmitted to a remote application via XML. What all of these forms of presentation have in common is the multidimensional 'grammar' of dimensions, measures and cells in which the presentation layer asks the question is asked, and OLAP server returns the answer.</p> <p>The second layer is the <dfn><font face="Verdana">calculation layer</font></dfn>. The calculation layer parses, validates and executes MDX queries. A query is evaluted in multiple phases. The axes are computed first, then the values of the cells within the axes. For efficiency, the calculation layer sends cell-requests to the aggregation layer in batches. A <dfn> <font face="Verdana">query transformer</font></dfn> allows the application to manipulate existing queries, rather than building an MDX statement from scratch for each request. And <dfn> <font face="Verdana">metadata</font></dfn> describes the the dimensional model, and how it maps onto the relational model.</p> <p>The third layer is the <dfn><font face="Verdana">aggregation layer</font></dfn>. An aggregation is a set of measure values ('cells') in memory, qualified by a set of dimension column values. The calculation layer sends requests for sets of cells. If the requested cells are not in the cache, or derivable by rolling up an aggregation in the cache, the aggregation manager and sends a request to the storage layer.</p> <p>The <dfn><font face="Verdana">storage layer</font></dfn> is an RDBMS. It is responsible for providing aggregated cell data, and members from dimension tables. I describe <a href="#Storage_and_aggregation_strategies">below</a> why I decided to use the features of the RDBMS rather than developing a storage system optimized for multidimensional data.</p> <p>All four of these components can exist on the same machine. Layers 2 and 3, which comprise the Mondrian server, must be on the same machine. The storage layer could be on another machine, accessed via remote JDBC connection. In a multi-user system, the presentation layer would exist on each end-user's machine (except in the case of JSP pages generated on the server).</p> <h3><a name="Storage_and_aggregation_strategies">Storage and aggregation strategies</a></h3> <p>OLAP Servers are generally categorized according to how they store their data:</p> <ul> <li>A <font face="Verdana"><dfn>MOLAP (multidimensional OLAP)</dfn></font> server stores all of its data on disk in structures optimized for multidimensional access. Typically, data is stored in dense arrays, requiring only 4 or 8 bytes per cell value.</li> <li>A <font face="Verdana"><dfn>ROLAP (relational OLAP)</dfn></font> server stores its data in a relational database. Each row in a fact table has a column for each dimension and measure.</li> </ul> <p>Three kinds of data need to be stored: fact table data (the transactional records), aggregates, and dimensions.</p> <p>MOLAP databases store fact data in multidimensional format, but if there are more than a few dimensions, this data will be sparse, and the multidimensional format does not perform well. A <font face="Verdana"><dfn>HOLAP (hybrid OLAP)</dfn></font> system solves this problem by leaving the most granular data in the relational database, but stores aggregates in multidimensional format.</p> <p>Pre-computed aggregates are necessary for large data sets, otherwise certain queries could not be answered without reading the entire contents of the fact table. MOLAP aggregates are often an image of the in-memory data structure, broken up into pages and stored on disk. ROLAP aggregates are stored in tables. In some ROLAP systems these are explicitly managed by the OLAP server; in other systems, the tables are declared as materialized views, and they are implicitly used when the OLAP server issues a query with the right combination of columns in the <code>group by</code> clause.</p> <p>The final component of the aggregation strategy is the cache. The cache holds pre-computed aggregations in memory so subsequent queries can access cell values without going to disk. If the cache holds the required data set at a lower level of aggregation, it can compute the required data set by rolling up.</p> <p>The cache is arguably the most important part of the aggregation strategy because it is <em><font face="Verdana">adaptive</font></em>. It is difficult to choose a set of aggregations to pre-compute which speed up the system without using huge amounts of disk, particularly those with a high dimensionality or if the users are submitting unpredictable queries. And in a system where data is changing in real-time, it is impractical to maintain pre-computed aggregates. A reasonably sized cache can allow a system to perform adequately in the face of unpredictable queries, with few or no pre-computed aggregates.</p> <p>Mondrian's aggregation strategy is as follows:</p> <ul> <li>Fact data is stored in the RDBMS. Why develop a storage manager when the RDBMS already has one?</li> <li>Read aggregate data into the cache by submitting <code>group by</code> queries. Again, why develop an aggregator when the RDBMS has one?</li> <li><em><font face="Verdana">If</font></em> the RDBMS supports materialized views, <em><font face="Verdana">and </font></em>the database administrator chooses to create materialized views for particular aggregations, then Mondrian will use them implicitly. Ideally, Mondrian's aggregation manager should be aware that these materialized views exist and that those particular aggregations are cheap to compute. If should even offer tuning suggestings to the database administrator.</li> </ul> <p>The general idea is to delegate unto the database what is the database's. This places additional burden on the database, but once those features are added to the database, all clients of the database will benefit from them. Multidimensional storage would reduce I/O and result in faster operation in some circumstances, but I don't think it warrants the complexity at this stage.</p> <p>A wonderful side-effect is that because Mondrian requires no storage of its own, it can be installed by adding a JAR file to the class path and be up and running immediately. Because there are no redundant data sets to manage, the data-loading process is easier, and Mondrian is ideally suited to do OLAP on data sets which change in real time.</p> <p><i>Note to self</i>: The cache manager ought to distinguish between data which is being pulled into the cache to be rolled up immediately into some other aggregation, and an aggregation which is explicitly needed.</p> <h2>Components</h2> <h3>Query transformer</h3> <p>See {@link mondrian.olap.Parser}.</p> <h3>Metadata</h3> <p>It is represented as an XML file. The metadata is loaded into memory the first time you reference a dimensional model. You can modify the model at runtime by creating instances of classes such as <code>{@link mondrian.rolap.RolapHierarchy}</code>.</p> <h3>Calculation layer</h3> <p><i>todo</i>: See {@link mondrian.olap.Query} and {@link mondrian.olap.Result}.</p> <p><i>todo</i>: The <code>package {@link mondrian.rolap}</code>. is the one and only implementation of the API. The DriverManager (<code>class {@link mondrian.olap.DriverManager}</code>) acts as class-factory.</p> <p><i>todo</i>: How members are calculated...</p> <p><i>todo</i>: How aggregations are batched...</p> <p><i>todo</i>: MDX functions. See <a href="#User_defined_functions">user-defined functions</a>.</p> <h3>Aggregation manager</h3> <p>Aggregations are based upon the relational model: as far as the aggregation manager is concerned, there is no relationship between the columns <code>city</code> and <code>state</code>. This means that all roll-ups are the same: you just drop a column. Consider the 3 roll-ups possible by dropping a column from the aggregation {<code>gender</code>, <code>city</code>, <code>state</code>}: dropping <code>gender</code> is equivalent to removing the <code>[Gender]</code> dimension; dropping <code>city</code> is equivalent to rolling up to a higher level in the <code>[Geography]</code> hierarchy; and dropping <code>state</code> is not even allowed in the dimensional model (no, sorry, you can't ask about products sold in a cities called 'Portland'). This approach will also allow us to implement 'drill anywhere'.</p> <p>An aggregation is defined by a search condition, for example, <code>{state in ('CA', 'OR', 'WA'), city = <i>any</i>, gender = 'M', measure = 'Unit sales'}</code>. The <i><code>any</code></i> value is important; if we had asked for a specific set of cities, we would not later be able to roll-up by dropping the <code>city</code> column.</p> <p>The caching strategy is to throw out the aggregation with the lowest cost/benefit ratio. The 'benefit' of an item is the effort it took to produce (effort which it is saving future queries) multiplied by its 'usefulness' which declines exponentially if it is not used over time. The 'cost' of an item is its size.</p> <h2>How do I use Mondrian in my application?</h2> <p>Something like this.</p> <ol> <li>Install the JAR.</li> <li>Create an XML mapping file.</li> <li>Create a Mondrian connection, specifying the JDBC URL of the RDBMS, and the URL of the mapping file.</li> <li>Execute an MDX statement.</li> <li>Render it. (There are currently no presentation tools which can render it.)</li> <li>In response to user actions such as drill-down and pivot, use the query transformer services to transform the query, and re-execute.</li> </ol> <h2>Why doesn't Mondrian use a standard API?</h2> <p>Because there isn't one. MDX is a component of Microsoft's OLE DB for OLAP standard which, as the name implies, only runs on Windows. Mondrian's API is fairly similar in flavor to ADO MD (ActiveX Data Objects for Multidimensional), a API which Microsoft built in order to make OLE DB for OLAP easier to use.</p> <p>XML for Analysis is pretty much OLE DB for OLAP expressed in Web Services rather than COM, and therefore seems to offer a platform-neutral standard for OLAP, but take-up seems to be limited to vendors who supported OLE DB for OLAP already.</p> <p>The other query vendors failed to reach consensus several years ago with the OLAP Council API, and are now encamped on the JOLAP specification.</p> <p>I plan to provide a JOLAP API to Mondrian as soon as JOLAP is available.</p> <h2>How does Mondrian's dialect of MDX differ from MSOLAP's?</h2> <p>Not very much.</p> <ol> <li>The <code>StrToSet()</code> and <code>StrToTuple()</code> functions take an extra parameter.</li> <li>Parsing is case-sensitive.</li> <li>Pseudo-functions <code>Param()</code> and <code>ParamRef()</code> allow you to create parameterized MDX statements.</li> </ol> <h2>How can Mondrian be extended?</h2> <p><i>todo</i>: <a name="User_defined_functions">User-defined functions</a></p> <p><i>todo</i>: Cell readers</p> <p><i>todo</i>: Member readers</p> <h2>Can Mondrian handle large datasets?</h2> <p>Yes, if your RDBMS can. We delegate the aggregation to the RDBMS, and if your RDBMS happens to have materialized group by views created, your query will fly. And the next time you run the same or a similar query, that will really fly, because the results will be in the aggregation cache.</p> <h2>Where is Mondrian going in the future?</h2> <ol> <li>Presentation layer</li> <li>Complete implementation of MDX (not many functions implemented yet)</li> <li>Tuning</li> <li>Support JOLAP API.</li> </ol> <h2>Mondrian is fantastic! How can I possibly thank you?</h2> <p>Please send me an email, and let me know what you liked and didn't like about it. If you can think of ways that Mondrian can be improved, roll up your sleeves and help make it better. If you use Mondrian in your application, consider sharing your work so that everyone can use it.</p> <b> <table border="1" width="100%" class="clsStd"> <tr> <td>End <i>$Id: //guest/paul_dymecki/mondrian/doc/overview.html#1 $</i></td> </tr> </table> <p> </p> </b> </html> </body>
# | Change | User | Description | Committed | |
---|---|---|---|---|---|
#1 | 1820 | Paul Robert Dymecki | mondrian: Integrate latest from //guest/julian_hyde | ||
//guest/julian_hyde/mondrian/doc/overview.html | |||||
#3 | 1501 | Julian Hyde |
Mondrian: generate MetaDef.java; fix home page link. |
||
#2 | 1460 | Julian Hyde | mondrian: Add home page. | ||
#1 | 1459 | Julian Hyde | mondrian: Add overview. |