<body> <html> <head> <meta http-equiv="Content-Language" content="en-us"> <meta name="GENERATOR" content="Microsoft FrontPage 5.0"> <meta name="ProgId" content="FrontPage.Editor.Document"> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> <title>Saffron overview</title> <style> A:link { color:#000066; } A:visited { color:#666666; } A.clsIncCpyRt, A.clsIncCpyRt:visited, P.clsIncCpyRt { font-weight:normal; font-size:75%; font-family:verdana,arial,helvetica,sans-serif; color:black; text-decoration:none; } A.clsLeftMenu, A.clsLeftMenu:visited { color:#000000; text-decoration:none; font-weight:bold; font-size:8pt; } A.clsBackTop, A.clsBackTop:visited { margin-top:10; margin-bottom:0; padding-bottom:0; font-size:75%; color:black; } A:hover, A.clsBackTop:hover, A.clsIncCpyRt:hover, A:active { color:blue; } A.clsGlossary { font-size:10pt; color:green; } BODY { font-size:80%; font-family:verdana,arial,helvetica,sans-serif; } BUTTON.clsShowme, BUTTON.clsShowme5 { font-weight:bold; font-size:11; font-family:arial; width:68; height:23; position:relative; top:2; background-color:#002F90; color:#FFFFFF; } DIV.clsBeta { font-weight:bold; color:red; } DIV.clsDocBody { margin-left:10px; margin-right:10px; } DIV.clsDocBody HR { margin-top:0; } DIV.clsDesFooter { margin:10px 10px 0px 223px; } DIV.clsFPfig { font-size:80%; } DIV.clsHi { padding-left:2em; text-indent:-2em } DIV.clsShowme { margin-bottom:.5em; margin-top:.5em; } H1{ font-size:145%; margin-top:1.25em; margin-bottom:0em; } H2 { font-size:135%; margin-top:1.25em; margin-bottom:.5em; } H3 { font-size:128%; margin-top:1em; margin-bottom:0em; } H4 { font-size:120%; margin-top:.8em; margin-bottom:0em; } H5 { font-size:110%; margin-top:.8em; margin-bottom:0em; } H6 { font-size:70%; margin-top:.6em; margin-bottom:0em; } HR.clsTransHR { position:relative; top:20; margin-bottom:15; } P.clsRef { font-weight:bold; margin-top:12pt; margin-bottom:0pt; } PRE { background:#EEEEEE; margin-top:1em; margin-bottom:1em; margin-left:0px; padding:5pt; } PRE.clsCode, CODE.clsText { font-family:'courier new',courier,serif; font-size:130%; } PRE.clsSyntax { font-family:verdana,arial,helvetica,sans-serif; font-size:120%; } SPAN.clsEntryText { line-height:12pt; font-size:8pt; } SPAN.clsHeading { color:#00319C; font-size:11pt; font-weight:bold; } SPAN.clsDefValue, TD.clsDefValue { font-weight:bold; font-family:'courier new' } SPAN.clsLiteral, TD.clsLiteral { font-family:'courier new'; } SPAN.clsRange, TD.clsRange { font-style:italic; } SPAN.clsShowme { width:100%; filter:dropshadow(color=#000000,OffX=2.5,OffY=2.5,Positive=1); position:relative; top:-8; } TABLE { font-size:100%; } TABLE.clsStd { background-color:#444; border:1px none; cellspacing:0; cellpadding:0 } TABLE.clsStd TH, BLOCKQUOTE TH { font-size:100%; text-align:left; vertical-align:top; background-color:#DDD; padding:2px; } TABLE.clsStd TD, BLOCKQUOTE TD { font-size:100%; vertical-align:top; background-color:#EEE; padding:2px; } TABLE.clsParamVls, TABLE.clsParamVls TD { padding-left:2pt; padding-right:2pt; } #TOC { visibility:hidden; } UL UL, OL UL { list-style-type:square; } .clsHide { display:none; } .clsShow { } .clsShowDiv { visibility:hidden; position:absolute; left:230px; top:140px; height:0px; width:170px; z-index:-1; } .#pBackTop { display:none; } #idTransDiv { position:relative; width:90%; top:20; filter:revealTrans(duration=1.0, transition=23); } /*** INDEX-SPECIFIC ***/ A.clsDisabled { text-decoration:none; color:black; cursor:text; } A.clsEnabled { cursor:auto; } SPAN.clsAccess { text-decoration:underline; } TABLE.clsIndex { font-size:100%; padding-left:2pt; padding-right:2pt; margin-top: 17pt; } TABLE.clsIndex TD { margin:3pt; background-color:#EEEEEE; } TR.clsEntry { vertical-align:top; } TABLE.clsIndex TD.clsLetters { background-color:#CCCCCC; text-align:center; } TD.clsMainHead { background-color:#FFFFFF; vertical-align:top; font-size:145%; font-weight:bold; margin-top:1.35em; margin-bottom:.5em; } UL.clsIndex { margin-left:20pt; margin-top:0pt; margin-bottom:5pt; } LI OL { padding-bottom: 1.5em } /*** GALLERY/TOOLS/SAMPLES ***/ FORM.clsSamples { margin-bottom:0; margin-top:0; } H1.clsSampH1 { font-size:145%; margin-top:.25em; margin-bottom:.25em; } H1.clsSampHead { margin-top:5px; margin-bottom:5px; font-size:24px; font-weight:bold; font-family:verdana,arial,helvetica,sans-serif; } H2.clsSampTitle { font-size:128%; margin-top:.2em; margin-bottom:0em; } TD.clsDemo { font-size:8pt; color:#00319C; text-decoration:underline; } .clsSampDnldMain { font-size:11px; font-family:verdana,arial,helvetica,sans-serif; } .clsShowDesc { cursor:hand; } A.clsTools { color:#0B3586; font-weight:bold; } H1.clsTools, H2.clsTools { color:#0B3586; margin-top:5px; } TD.clsToolsHome { font-size:9pt; line-height:15pt; } SPAN.clsToolsTitle { color:#00319C; font-size:11pt; font-weight:bold; text-decoration:none; } /*** DESIGN ***/ P.cat { font-size:13pt; color:#787800; text-decoration:none; margin-top:18px; } P.author { font-size:9pt; font-style:italic; line-height:13pt; margin-top:10px; } P.date { font-size:8pt; line-height:12px; margin-top:0px; color:#3366FF; } P.graph1 { line-height:13pt; margin-top:-10px; } P.col { line-height:13pt; margin-top:10px; margin-left:5px; } P.cal1 { text-decoration:none; margin-top:-10px; } P.cal2 {margin-top:-10px; } P.photo { font-size:8pt; } /*** DOCTOP ***/ #tblNavLinks A { color:black; text-decoration:none; font-family:verdana,arial,helvetica,sans-serif; } #lnkShowText, #lnkSyncText, #lnkSearchText, #lnkIndexText { font-size:8pt; font-weight:bold; } #lnkPrevText, #lnkNextText, #lnkUpText { font-size:7.5pt; font-weight:normal; } DIV.clsBucketBranch { margin-left:10px; margin-top:15px; margin-bottom:-10pt; font-style:italic; font-size:85%; } DIV.clsBucketBranch A, DIV.clsBucketBranch A:link, DIV.clsBucketBranch A:active, DIV.clsBucketBranch A:visited { text-decoration:none; color:black; } DIV.clsBucketBranch A:hover { color:blue; } /*** SDK, IE4 ONLY ***/ DIV.clsExpanded, A.clsExpanded { display:inline; color:black; } DIV.clsCollapsed, A.clsCollapsed { display:none; } SPAN.clsPropattr { font-weight:bold; } #pStyles, #pCode, #pSyntax, #pEvents, #pStyles {display:none; text-decoration:underline; cursor:hand; } /*** jhyde added ***/ CODE { color:maroon; font-family:'courier new' } DFN { font-weight:bold; font-style:italic; } </style> </head> <!-- This sentence is here to fool javadoc (which is looking for a period, and otherwise finds one inside one of our header tables). --> <table border="1" class="clsStd" width="100%"> <tr> <td colspan="2"><a href="index.html">Top</a> | <a href="http://public.perforce.com/guest/julian_hyde/saffron/doc/index.html">Web home</a> | <a href="http://sourceforge.net/projects/saffron/">SourceForge home</a> <div align="right"> <a href="http://sourceforge.net"> <img src="http://sourceforge.net/sflogo.php?group_id=46646&type=1" border="0" alt="SourceForge.net Logo"> </a> </div> </td> </tr> <tr> <td colspan="2"><em>$Id: //guest/julian_hyde/saffron/doc/overview.html#5 $</em></td> </tr> <tr> <td colspan="2"><em>(C) Copyright 2002, Julian Hyde</em></td> </tr> <tr> <th align="right">Author</th> <td>Julian Hyde (<a href="mailto:julian.hyde@mail.com">julian.hyde@mail.com</a>)</td> </tr> <tr> <th align="right">Created</th> <td>November 5<sup>th</sup>, 2002</td> </tr> </table> <h1>Saffron overview</h1> <p>Julian Hyde, November 5th, 2001.</p> <h2>Overview</h2> <p>Saffron is a seamless integration of SQL-like relational expressions into Java. As we shall see, the tight integration means that relational expressions are much more powerful than those offered by SQL.</p> <h2>How is Saffron different from SQL?</h2> <ol> <li><b>You can use Saffron expressions directly from Java code</b>. For example, an <code>exists </code>condition:<blockquote> <pre>if (exists (select from emp where emp.salary > 50000)) { ... }</pre> </blockquote> <p>An <code>in </code>condition:</p> <blockquote> <pre>if ("Fred" in (select emp.ename from emp where emp.salary > 50000)) { ... }</pre> </blockquote> <p>A <code>for </code>loop:</p> <blockquote> <pre>for (emp in (select from emp order by emp.salary)) { System.out.println(emp.ename + " earns " + emp.salary); }</pre> </blockquote> <p>And an array initialized using a relational expression:</p> <blockquote> <pre>Emp[] programmers = (select from emp where emp.title.equals("Programmer"));</pre> </blockquote> </li> <li><b>Saffron</b><b> uses Java expression syntax</b>. You can call java methods, construct objects and arrays, and access java objects just as you would in regular java code. For example,<blockquote> <pre>select emp.ename.substring(ename.indexOf(".")) from emp</pre> </blockquote> </li> <li><b>Saffron</b><b> relational expressions use Java's type system</b>. Runtime type conversions (of, say, a <code>DECIMAL(8,2)</code> column to a java <code>float </code>variable) are inefficient, error-prone, and can lose information. The names of tables and columns are checked at compile time, so your programs are more maintainable.</li> <li><b>Saffron</b><b> doesn't use 3-valued logic</b>. A condition in Saffron always evaluates to <code>true</code> or <code>false</code>, whereas in SQL a condition can sometimes evaluates to NULL.</li> <li><b>Saffron</b><b> expressions are evaluated efficiently</b>. If an expression involves only Java objects (no tables), the Saffron compiler will generate Java code which is almost as efficient as hand-written code.</li> <li><b>Saffron</b><b> accesses heterogeneous data better</b>. Consider the following query, which accesses data from an in-memory array (<code>manufacturers</code>), and tables from two different relational databases (<code>orderDb </code>and <code>productDb</code>).<blockquote> <pre>String[] manufacturers = new String[] {"Heinz", "Guinness"}; int[] customers = ( select order.customerId from productDb.products as product join orderDb.orders as order on order.productId == productId where product.manufacturer in manufacturers);</pre> </blockquote> Should the query be executed on the orders database or the products database? And how should the manufacturer list be sent over: as a hard-coded in-list, or by restarting the query for each manufacturer in the list? It really doesn't matter: the Saffron optimizer will decide for you. </li> </ol> <h2>How does Saffron improve Java?</h2> <p>We have already discussed how Saffron makes your life easier if you are embedding SQL statements in Java using JDBC. But Saffron can help even if you are not accessing data stored in a relational database. This is because in-memory data is often relational, and the algorithms to deal with it are often the familiar relational operations: sort, join, filter, union, intersect, minus.</p> <p>Suppose I have two lists of customers, and I want to create a list of both, with duplicates eliminated. An experienced Java programmer would probably write something like this:</p> <blockquote> <pre>Customer[] addCustomerLists(Customer[] list1, Customer[] list2) { Hashtable hashtable = new Hashtable(); for (int i = 0; i < list1.length; i++) { Customer c = list1[i]; hashtable.put(c, c); } for (int i = 0; i < list2.length; i++) { Customer c = list2[i]; hashtable.put(c, c); } Customer[] list3 = new Customer[hashtable.size()]; hashtable.asList().copyInto(list3); return list3; }</pre> </blockquote> <p>They would test it, notice that the customers are now in arbitrary order, and add</p> <blockquote> <pre> java.util.Arrays.sort(list3);</pre> </blockquote> <p>just before returning the result. In Saffron, a programmer would write</p> <blockquote> <pre>Customer[] list3 = select from (list1 union list2) c order by c;</pre> </blockquote> <p>They don't need to know Hashtable is the 'correct' data structure to use. They tell the system <i>what </i>they want, and the system figures <i>how </i>to compute it efficiently.</p> <p>This is a rather simple example. It gets better, because Saffron is a <i> closed </i>language. This means that you can use a Saffron expression anywhere </p> <h2>Grammar</h2> <p><i>Definition</i>. A <dfn>set expression</dfn> is an expression which returns more than one item; namely, a query, an array, an iterator or collection, or a <code>Table</code>.</p> <h3>Relational expressions</h3> <blockquote> <pre><code>select</code> [<code>{</code>[<i>expression</i> [<code>as </code><i>alias</i>]], ...<code>}</code> | <i>expression</i>] [<code>from </code><i>from-expression</i>] [<code>where </code><i>expression</i>] [<code>order by </code>[<i>expression </i>[<code>asc </code>| <code>desc</code>]], ...]</pre> <pre><i>expression </i><code>union </code><i>expression</i></pre> <pre><i>expression </i><code>intersect </code><i>expression</i></pre> <pre><i>expression </i><code>minus </code><i>expression</i></pre> </blockquote> <p>The syntax for the from list <i>from-expression</i> is as follows:</p> <blockquote> <pre><i>expression </i>[as alias]</pre> <pre><i>from-expression </i>[<code>left</code> | <code>right</code> | <code>full</code> | ] <code>join</code><i> </i> <i>from-expression </i>[<code>on</code> <i>condition</i>]</pre> </blockquote> <p>Notes:</p> <ol> <li>The select list can have 3 forms.<ol> <li>An <i><dfn>empty select list</dfn> </i>means return everything from the from clause. If the from clause has precisely one item, the result is the same type as that item (for example, each row from <code>select from emp</code> has type <code> Emp</code>); otherwise, the result is a record, with one field per from item (for example, each row from <code>select from emp join dept</code> has type <code>{Emp emp; Dept dept}</code>; each row from <code>select</code> has type <code>{}</code>).</li> <li>A <dfn>single-item select list</dfn> means return rows without boxing them into records. So, for example, each row from <code>select three from new int[] {1, 2, 3} as three</code> is an <code>int</code>. </li> <li>A <dfn>boxed select list</dfn> (enclosed in <code>{</code> ... <code>}</code>) means create a record for each row. The optimizer will create a synthetic class to describe the type of the row. Two rows with the same types and aliases, in the same order, will have the same synthetic class (and are therefore union-compatible). For example, each row from <code>select {three} from new int[] {1, 2, 3} as three</code> is <code>class Synthetic$124 { int three; }</code>; the type of <code>select {emp.empno, emp.ename as name} from emp</code> is <code>class Synthetic$527 { int empno; String name; }</code>.</li> </ol> </li> <li>Aliases of items in from and select lists default to variable, field, and method names. The aliases of the fields in <code>Emp emp; select emp, dept.deptno, dept.dname.substring(3) from dept</code> are <code>emp</code>, <code>deptno</code>, and <code>substring</code>, respectively. Other kinds of expression have no default alias, and therefore an alias must be specified. Whether implied or explicit, the aliases in a select or from list must be distinct.</li> <li>If the from clause is missing, the input to the select statement is a table with one row and no columns.</li> <li>Expressions in the from list must be set expressions. Any kind of expression is allowed in the select list; queries are returned as arrays.</li> <li>If <code>on</code><i> condition </i>is omitted from a <i>from-expression</i> it defaults to <code>true</code> (hence, a cartesian product). The keywords <code>left</code>, <code>right</code> and <code>full</code> indicate that an outer join is to be performed.</li> <li>No plans to do <i>aggregation</i>.</li> </ol> <h3>Boolean expressions</h3> <p>We add the following boolean expressions to Java:</p> <blockquote> <pre><i>expression</i> <code>in </code><i>expression</i></pre> <pre><code>exists </code>(<i>expression</i>)</pre> </blockquote> <p>The right-hand expressions must be set expressions.</p> <p>These expressions can, of course, occur inside relational expressions. The sub-query so created is said to be <dfn>correlated</dfn> if it refers to rows from the outer query. For example, the following query is correlated on <code>dept.deptno</code>:</p> <blockquote> <pre><code>select from </code>depts <code>as </code>dept <code>where exists </code>( <code>select from </code>emps <code>as </code>emp <code> where </code>emp.deptno == dept.deptno && emp.gender.equals("F"))</pre> </blockquote> <p>We do not provide a <code>not</code> operator to negate these. Use the regular Java <code>!</code> operator, for example, <code>!("apple" in fruit)</code> or <code>!exists (select from fruit where fruit.equals("apple")</code>.</p> <h3>Statements</h3> <blockquote> <pre><code>for </code>([<i>type</i>]<i> variable</i> <code>in </code><i>expression</i>) { <i>statement list </i>}</pre> </blockquote> <h3>Aggregation</h3> <p>Simple example:</p> <blockquote> <pre><code>select</code> emp.deptno, sum(emp.sal)<code> as</code> sumSal<code> group by</code> emp.deptno<code> from </code>emps <code>as </code>emp</pre> </blockquote> <p>Unlike SQL, the <code>group by</code> clause comes before the <code>from</code> clause, and is mandatory if you want to aggregate. The equivalent of the SQL <code>select sum(emp.sal) from emp</code> has an empty <code>group by</code> clause in Saffron: <code>select sum(emp.sal) group by from emp</code>.</p> <p>In Saffron, the <code>where</code> clause is applied <i>after</i> aggregation has been performed, more like a SQL <code>having</code> clause than a <code> where</code> clause. Expressions in the <code>where</code> clause can only reference aggregations and grouping expressions, not ungrouped expressions. If you want to filter the rows going into the aggregation, use a subquery in the from list:</p> <blockquote> <pre><code>select</code> femp.deptno, sum(femp.sal)<code> as</code> sumSal<code> group by</code> femp.deptno<code> from </code>( <code>select</code> <code>from</code> emps <code>as</code> emp <code>where</code> emp.gender.equals("F")) <code>as </code>femp <code>where </code>avg(femp.sal) > 50000</pre> </blockquote> <p>There are 3 kinds of aggregation function:</p> <ol> <li>Builtin aggregation functions (such as <code>count</code>) are treated specially by the system, and generate optimized code.</li> <li>A regular java function which takes a set (array, collection, or iterator) will be called if its argument matches the element type in the actual usage. For example, if <code>emp.sal</code> is of type <code>double</code>, then <code>double sum(double[] a)</code> would match. Likewise, <code>int count(Iterator iter)</code> would match <code>select count(emp) from emps as emp</code>, because Emp can be up-cast to Object, the element type of Iterator.</li> <li>User-defined aggregations.</li> </ol> <pA user-defined aggregation is a class which implements <code>interface {@link saffron.AggregationExtender}</code>. If you invoke the {@link saffron.AggregationExtender#aggregate} pseudo-method, the system generates the right code to calculate the aggregation.</p> <p>Now let's see how the Saffron system invokes a user-defined aggregation while evaluating a query. Suppose we have written <code>class MySum</code>, which can total both <code> int</code> and <code>double</code> values, and evaluate the query</p> <blockquote> <pre><code>select</code> MySum(i) <code>from new int</code>[] {2, 3, 4} <code>as</code> i</pre> </blockquote> <p>The Saffron system makes following calls. Note how the <code>resolve</code> method deals with operator overloading. Note how <code>int</code>s are wrapped as <code>Integer</code>s then in arrays to be passed to <code>addOne</code>, and returned from <code>result</code>.</p> <blockquote> <pre>AggregatorClass sum = new MySum(); AggregationExtender intSum = sum.resolve(new Class[] {Integer.TYPE}); Object total = intSum.start(null); intSum.addOne(total, new Object[] {new Integer(2)}); intSum.addOne(total, new Object[] {new Integer(4)}); Object total2 = intSum.start(null); intSum.addOne(total2, new Object[] {new Integer(3)}); intSum.merge(total, total2); System.out.println("Result is " + ((Integer) intSum.result(total)).intValue());</pre> </blockquote> <p>Examples of more complex aggregations:</p> <ul> <li>The <code>median</code> aggregation ({@link saffron.ext.Median}) requires a variable-size workspace. A typical implementation would use an {@link java.util.ArrayList} to collect the values, sort them, then return the middle one.</li> <li>The <code>nth</code> aggregation {@link saffron.ext.Nth}) returns the <i>n</i>th item of a set. <i>n</i> is passed to the <code>start</code> method, so must be independent of any particular row. Example:<blockquote> <pre><code>select</code> new Nth(3).aggregate(name) <code>from</code> (<code>select from</code> emps <code>order by</code> salary)</pre> </blockquote> </li> <li>The <code>closest</code> aggregation computes the nearest of a set of points to a given point. Its <code>start</code> method receives the latitude and longitude of the initial point (as <code>new Object[] {new Double(42), new Double(129)}</code>), and the <code>addOne</code> method receives the latitude and longitude of each subsequent point.<blockquote> <pre><code>select</code> closest(42, -129, office.latitude, office.longitude) <code>from</code> offices <code>as</code> office</pre> </blockquote> </li> </ul> <p>Notes & issues:</p> <ol> <li>Arguments are zero or more primitive types or objects. Primitives are passed wrapped as objects.</li> <li>A robust aggregation of median (not possible without invoking custom code-generation) would internally expand it to<blockquote> <pre><code>select </code>nth(count() / 2, sorted) <code>from</code> (<code>select from new int</code>[] {2, 3, 4} <code>as</code> i <code>order by</code> i) <code>as </code>sorted</pre> </blockquote> </li> <li>Distinct aggregations. Syntax? Implementation?</li> <li>Could we use aggregations as rolling averages? (I guess the <code>{@link saffron.AggregationExtender}.result()</code> method would have to be non-destructive.)</li> </ol> <h2>Type-checking</h2> <p>One of Java's most important features is its strong type system. To take advantage of this, every relational expression is always has a result which is an array type.</p> <p>This doesn't mean that the array is actually created. If you use the <code>for (... in ...)</code> construct, for example, the system will implement the loop using an iterator, and is unlikely to construct the result as an array. And the optimizer tries hard not to convert intermediate sets, such as that created by an <code>in </code>operator, into arrays.</p> <p>Regular java expressions can also be used as relational expressions:</p> <ul> <li>arrays</li> <li>collections (<code>Collection</code>, including <code>Vector</code>)</li> <li>iterators (<code>Enumeration</code>, <code>Iterator</code>)</li> </ul> <p>The problem with collections and iterators (until generics arrive in Java 1.4) is that we don't know what type of objects they contain. We allow the programmer to specify that type information using by a dummy cast to an array type. For example:</p> <blockquote> <pre>Vector fruit = new Vector(); fruit.addElement("apple"); fruit.addElement("banana"); fruit.addElement("orange"); fruit.addElement("mango"); String[] tropicalFruit = (select fruit from (String[]) fruit where fruit.indexOf("n") >= 0);</pre> </blockquote> <p>You can cast a collection or iterator to an array of primitive types, provided that the collection does not contain null values.</p> <h3>Casting input datasets</h3> <p>Some data sources, such as vectors, iterators and external data sources such as JDBC tables, contain very little type information, and this makes it difficult to write concise queries. For example,</p> <blockquote> <pre>Vector emps = new Vector(); emps.addElement(fred); emps.addElement(barney); for (emp in select from emps as emp where ((Emp) emp).deptno == 20) { print(((Emp) emp).name); }</pre> </blockquote> <p>Better to tell the system just once that <code>emps </code>is a collection of <code>Emp </code>objects:</p> <blockquote> <pre>for (emp in select from (Emp[]) emps as emp where emp.deptno == 20) { print(emp.name); }</pre> </blockquote> <p>The cast does not necessarily cause the vector to be converted to an array: we are simply providing the Saffron compiler with more type information, and allowing it to choose the most efficient plan.</p> <h3>Determining the output type</h3> <p>By default, a relational expression returns an array. You can produce results in a different format by using a dummy cast, like this:</p> <blockquote> <pre>Iterator emps; Iterator femaleEmps = (Iterator) (select from (Emp[]) emps as emp where emp.gender.equals("F"));</pre> </blockquote> <p>This is particularly useful where the output is large (even infinite!) and you to iterate over the first few rows without computing the whole result.</p> <p>The available types are {@link java.util.Iterator}, {@link java.util.Enumeration}, {@link java.util.Collection}, {@link java.util.Vector}, and <code>type[]</code>. In future, we may add {@link java.sql.ResultSet}.</p> <h3><a name="Tables_and_Schemas">Tables and Schemas</a></h3> <p>The interfaces Table, Schema and Connection allow you to access data sources which have arbitrary access methods. A table (<code>interface {@link saffron.Table}</code>) represents a set of relational data, and it provides methods to describe the data source, discover its access methods, and so forth. Schemas (<code>interface {@link saffron.Schema}</code>) make it easy to expose a set of tables and the information necessary to compile programs which use them. If you can to access data in a JDBC database, derive a class from <code>class {@link saffron.ext.ReflectSchema}</code> and create a member of type {@link saffron.ext.JdbcTable} for each table your wish to access (<code>class {@link sales.Sales}</code> is an example of this.)</p> <p>Let's look at what happens when you compile a program which uses a schema.</p> <ol> <li>The compiler notices a piece of code (<code>sales.emps</code>) which accesses a member or method of an object (<code>sales</code>) which implements <code> interface {@link saffron.Connection}</code>.</li> <li>The compiler calls the static method <code>getSchema</code> on that object. (Why call a static method, not a method in the interface? Well, remember that this is happening at compile time. The real object does not exist yet.)</li> <li>The compiler invokes the schema's <code>getMemberTable</code> or <code> getMethodTable</code> method, to retrieve a <code>Table</code>. If there is no such table, these methods may throw <code>openjava.mop.NoSuchMemberException</code>, which will be reported as a compilation error.</li> <li>The compiler invokes the table's <code>toRel</code> method, passing the piece of parse tree which represents the expression which will yields the schema. (The <i>real </i>schema object only gets created at run time, remember.)</li> <li>The compiler calls schema's getTable method with the parse object be for example, class Sales.</li> </ol> <p>In particular, we provide <code>abstract class {@link saffron.ext.JdbcTable} implements {@link saffron.Table}</code> to make it easy to access tables via JDBC. We recommend that you group tables into a schema class, which extends <code>class {@link saffron.ext.ReflectSchema}</code>. For example,</p><blockquote> <pre>class SalesConnection extends saffron.ext.JdbcConnection implements saffron.Connection { public SalesConnection(String connect) { super(connect); } // required for saffron.Connection public static saffron.Schema getSchema() { return new SalesSchema(); } }; class SalesSchema extends saffron.ext.ReflectSchema implements saffron.Schema { public static class Emp { public int empno; public String name; public int deptno; }; public static class Dept { public int deptno; public String name; }; public saffron.Table emp = new saffron.ext.JdbcTable( this, "EMP", Emp.class); public saffron.Table depts = new saffron.ext.JdbcTable( this, "DEPT", Dept.class); }; SalesConnection sales = new SalesConnection("jdbc:odbc:MyDatabase"); for (emp in sales.emps) { <i>some code</i> }</pre> </blockquote> <p>The Saffron compiler creates an instance of the schema, and interrogates it regarding the objects it contains. When it sees the expression <code>sales.emps</code>, it invokes <code>SalesConnection.getSchema().getTableForMember("emps")</code> to get a {@link saffron.Table} which describes <code>emps</code> and, in particular, that it returns rows of type <code>Emp</code>. The translator then calls the table's <code>toRel()</code> method to create a relational expression node; in this case a {@link saffron.rel.TableAccess}. (The translator actually converts <code> sales.emp</code> into a brief intermediate expression <code>(Emp[]) sales.contentsAsArray("emps")</code>. The <code>contentsAsArray</code> function is a placeholder which has special meaning to the translator. Note how an array cast is used to provide type information.)</p> <p>As well as type information, the schema could in future provide information about volumetrics, indexes, and even custom rules for optimizing the query.</p> <p><i>todo</i>: Expose <code>Map</code>s (including <code>Hashtable</code>s) as relations with 2 columns.</p> <h3>Non-relational schemas</h3> <p>A schema could be used to present non-relational data in a relational format. It would know that 'getDept(40)' was the same as 'select from dept where deptno = 40' and 'dept.getEmployees()' is the same as 'select from emp where emp.deptno == dept.deptno'. We would need a mechanism to construct an <dfn> extent</dfn> (the set of all instances of a given entity). The schema could declare an explicit method to find the instances of an entity; another way is to use mandatory relationships (for example, since every employee must work in a department, given all departments, we can find all employees by traversing the 'works in' relationship). It helps if the relationship is many- (or one-) to-one, since we don't need to eliminate duplicates. There could be more than one access path to an entity.</p> <p>How to represent such mappings? Views: dept.getEmployees() is equivalent to select from emp where emp.deptno == dept.deptno. Better, parameterized views. Example {@link sales.Reflect} presents Java's type system ({@link java.lang.Class}, {@java.lang.reflect.Field}, <i>et cetera</i>) as relations.</p> <p><i>todo: </i>Could allow a type (e.g. <code>boolean</code>, <code>int</code>) to serve as a relation. Could implement by iterating over all possible values of that type (rarely practical) or joining to an attribute of a more finite relation. Example: <code>select from int as i where i in (select emp.deptno from emp)</code>.</p> <h2>Null semantics</h2> <p>Saffron uses Java, not SQL, null semantics. <code>null</code> is always the Java null value. Every boolean expression evaluates to either <code>true </code>or <code>false</code>. And yes, <code>null == null</code> evaluates to <code>true</code>.</p> <h2>Comparison semantics</h2> <p>In Saffron, as in Java, the <code>==</code> operator compares addresses, not content. The expression <code>"abc" == "a" + "bc"</code> may, on some Java systems, evaluate to false, so you should write <code>"abc".equals("a" + "bc")</code> instead.</p> <p>The <code>in </code>operator expands to expressions which use the <code> equals()</code> and <code>hashCode()</code> methods, so you should only uses it on classes for which these methods are well-behaved. And the expression</p> <blockquote> <pre>null in new String[] {"apple", "orange"}</pre> </blockquote> <p>will give a <code>NullPointerException</code>.</p> <p>The <code>order by</code> operator requires that objects implement the <code> interface java.lang.Comparable</code>. (<i>todo: Compile- or run-time error if not?</i>)</p> <p>Primitive types are compared using the primitive operators <code>==</code>, <code><</code> etc.</p> <p>Two instances of a synthetic classes are equal if and only if each of their components are equal. Synthetic classes are compared lexicographically. Example:</p> <blockquote> <pre>select from (select from twos join threes) as i join (select from twos join threes) as j where i < j</pre> </blockquote> <h2>Primitive values</h2> <p>Saffron accepts primitive values, such as <code>int</code>, <code>char</code>, <code>double</code>, in most places that it accepts objects. A couple of exceptions:</p> <ol> <li>Outer joins need to produce null values. For example,<pre>select {twos, threes} from new int[] {2, 4, 6, 8} as twos left join new int[] {3, 6, 9} as threes on twos == threes</pre> would produce<pre>2, 0 4, 0 6, 6 8, 0 </pre> </li> <li>Saffron uses Java collections to implement certain algorithms. For example, in the following expression, it uses a Hashtable to detect values which have already been emitted. <pre>select from new int[] {2, 4, 6, 8} as twos minus select from new int[] {3, 6, 9} as threes</pre> It automatically boxes and unboxes primitive values.</li> </ol> <h2>Planning and evaluation</h2> <p>The Saffron system runs in static and dynamic mode.</p> <p>In static mode, relational expressions are embedded into Java programs. A pre-processor checks that the relational expressions are valid, and converts them into regular java. There are many ways to evaluate a relational expression, </p> <p>A query optimizer chooses an evaluation plan.</p> <h2>Evaluation model</h2> <h2>Implementation issues</h2> <h3>Composite rows</h3> <p>If a relational expression has more than one expression in its <code>select </code>clause, it produces rows which are not atomic. For example,</p> <blockquote> <pre>int[] twos = new int[] {2,4,6,8}; int[] threes = new int[] {3,6,9}; for (i in (select twos, threes twos cross join threes)) { ... }</pre> </blockquote> <p>creates rows which are pairs of <code>int</code>s. It is translated to</p> <blockquote> <pre>int[] twos = new int[] {2,4,6,8}; int[] threes = new int[] {3,6,9}; for (int x = 0; x < twos.length; x++) { for (int y = 0; y < threes.length; y++) { Synth1 i = new Synth1(twos[x], threes[y]); ... } }</pre> </blockquote> <p>This illustrates two issues with composite rows. First, these rows need a type. The Saffron compiler solves this issue by generating a synthetic class (in this case, <code>class Synth1</code>). We don't know what this class will be called, so we use <code>for (<i>variable</i> in ...)</code> syntax, which lets the system infer the type for us.</p> <p>Second, the system creates a new object for every row, which might be expensive. In this case, they are emitted from the query, so we can't avoid creating them. If the composite rows are not emitted, the Saffron optimizer will do its best to avoid creating them.</p> <h2>How the optimizer works</h2> <p>The job of the optimizer is to convert a parse tree containing Saffron expressions into a regular parse tree. Its input and output, therefore, are both parse trees.</p> <p>In between times, it searches for Saffron expressions, and converts them to tree of intermediate relational expressions (<code>class {@link saffron.rel.Rel}</code>).</p> <p>These relational expressions still contain fragments of regular Java expressions for select expressions, method calls, and so forth. The Java expressions are translated into internal form. Except for references to variables outside the Saffron expression, variables tend to disappear, to be replaced by mysterious-looking expressions. For example, in the expression</p> <blockquote> <pre>select from location join (select from emp join dept on emp.deptno == dept.deptno) on location.state.equals(dept.state) </pre> </blockquote> <p>the join condition <code>location.state.equals(dept.state)</code> becomes <code>$input0.state.equals($input1.$f1.state)</code>. Here, <code>$input0</code> and <code>$input1</code> represent the 2 inputs to the join node (which is the context for this Java fragment); <code>$f1</code> is the first field of the composite row produced by the inner query. That field's type is Dept, and therefore it has a <code>state</code> field.</p> <p>The optimizer uses a dynamic programming approach based upon that used by the Volcano optimizer. Starting with the initial expression, it applies transformation rules to convert it, and its sub-expressions, into semantically equivalent expressions. It estimates the cost of each expression, and cultivates variants of expressions until it finds cheap implementations of them.</p> <p>An expression can be implemented using several <dfn>calling conventions</dfn>. Calling conventions include array, vector, iterator. The most important (and efficient) calling convention is inline. Inline means that a piece of code will be executed for every row in the expression; no collection of objects is created, just <i>stuff happens</i>. An inline implementation of</p> <blockquote> <pre>for (o in (select from emp join dept on emp.deptno == dept.deptno)) { System.out.println(o.emp.ename + " works in " + o.dept.dname); }</pre> </blockquote> <p>would be</p> <blockquote> <pre>for (int i = 0; i < emp.length; i++) { for (int j = 0; j < dept.length; j++) { if (emp[i].deptno == dept[j].deptno) { Synth1 o = new Synth1(emp[i], dept[j]); System.out.println(o.emp.ename + " works in " + o.dept.dname); } } }</pre> </blockquote> <p>This is rather tricky to implement, because the stuff in the middle of the for loop corresponds to stuff at the top of Rel tree. Each Rel node's implement method has to handle a callback from a child saying 'I have placed a row in variable <i>xyz</i>; please generate code to do something with it.'</p> <p>If two expressions have the same meaning but different calling conventions, they are not immediately interchangeable. It may be cheaper to evaluate a particular expression as an array, but another algorithm may find it more convenient if that expression is a vector. For this reason, there are special kinds of rule called <dfn>conversion rules</dfn>. They can convert from one convention to another, but they have a non-zero cost. The optimizer may decide that it is cheaper to build a vector in the first place, rather than paying the cost of converting it to a vector later.</p> <p>There is a special kind of node called a <dfn>Terminator</dfn> (<code>class {@link saffron.rel.java.Terminator}</code>). After a tree has been optimized, a Terminator is placed on the root to drive code-generation. It behaves somewhat like a Rel node (in that it has inputs, and can be called-back to generate code), but it does not actually produce any result. (<i>todo</i>: Maybe Terminator should be a calling convention?)</p> <h1>Appendix 1: Design notes</h1> <h2>Syntax of <code>group by</code></h2> <p>There are several problems with a SQL-like aggregation:</p> <ol> <li>If a <code>select</code> statement contains a <code>group by</code> clause, the validation rules for expressions within that statement are altered: only expressions which are in the <code>group by</code> can be used outside of an aggregate function.</li> <li>Even if there is no <code>group by</code> clause present, aggregation implicitly occurs if an aggregate function occurs anywhere in the <code>select</code> statement. For example, every expression in<blockquote> <pre>select emp.gender.toLowerCase(), emp.name from emp where emp.name.startsWith("P")</pre> </blockquote> <p>becomes illegal if an aggregation is added:</p> <blockquote> <pre>select emp.gender.toLowerCase(), emp.name from emp where emp.name.startsWith("P") && min(emp.salary) > 50000</pre> </blockquote> <p>This syntax problem could be fixed fairly easily, by requiring an empty <code> group by</code> clause in such cases.</p> </li> <li>If we allow functions on sets to be used as aggregation functions, then certain statements are ambiguous. Consider the following:<blockquote> <pre>int count(Object[] a) { int n = 0; for (int i = 0; i < a.length; i++) { if (a[i] != null) { n++; } } return n; } class Emp { String name; Person[] dependents; }; select count(emp.dependents) from emp;</pre> </blockquote> <p>It is not clear whether Saffron should return a row for each employee with the number of the dependents that employee has, or a single number, the number of employees whose <code>dependents</code> field is not null.</li> </ol> <p>The underlying problem is that SQL aggregation makes a scalar expression behave like a set, and this undermines the usual rules of expression validation and type-checking. We can make the set explicit, but the resulting expression</p> <blockquote> <pre>select deptno, sum(select emp.sal from emps as emp where emp.deptno == deptno) as sumSal, min(select emp.sal from emps as emp where emp.deptno == deptno) as minSal from (select distinct deptno from emps) as deptno</pre> </blockquote> <p>is more verbose -- the underlying set, <code>emps</code>, is referenced three times -- and perhaps less efficient.</p> <p>We need to make grouping an aggregation explicit.</p> <p>QUEL aggregations are more like scalar sub-queries. They are more verbose. They are also more powerful: note how we can sum only female employees' salaries. In this syntax, the set of values to be aggregated over (in this case, <code>select ... from emp</code>) is formed for each aggregate expression; the optimizer recognizes common expressions and evaluates them only once. Example:</p> <blockquote> <pre><code>select distinct </code>emp.deptno, min(<code>select</code> emp.sal <code>from </code>emps <code>as </code>emp2 <code>where </code>emp2.deptno == emp.deptno) <code>as </code>minSal sum(<code>select</code> emp.sal <code>from </code>emps <code>as </code>emp2 <code>where </code>emp2.deptno == emp.deptno && gender.equals("F")) <code>as </code>sumFemaleSal <code>from </code>emps <code>as </code>emp</pre> </blockquote> <h2>Aggregation types</h2> <p>Jim Gray et al. [<a href="#GBLP96">GBLP96</a>] classify aggregation operators into 3 categories, in increasing order of complexity:</p> <ul> <li><dfn>Distributive</dfn> aggregations requires fixed storage, and two totals can be merged. Example: <code>sum</code>, <code>count</code>, <code>min</code>, <code>max</code>. To merge two partial sums, take <code>sum(S1 u S2) = sum({sum(S1), sum(S2)})</code>. (Here, <code>S1</code> and <code>S2</code> are the sets of values which formed the partial sums, and <code>S1 u S2</code> is their union, the set of all values.) <code>min</code> and <code>max</code> work the same. <code>count</code> is different, in that it cannot be used to merge two sums; instead, you use <code>+</code>: <code>count(S1 u S2) = count(S1) + count(S2)</code>.</li> <li><dfn>Algebraic</dfn> aggregations also require fixed storage, but two totals cannot be merged. Example: <code>avg</code>, <code>stddev</code>.</li> <li><dfn>Holistic</dfn> aggregations require variable storage. Example: <code>median</code>, <code>countDistinct</code>.</li> </ul> <p>Some algebraic aggregates can be decomposed into distributive aggregates, as follows: avg(S) = sum(S) / count(S). We could provide better support for these; maybe a new method <code>Expression {@link saffron.rel.Aggregation}.preExpand()</code>, which is called before the parse tree is converted into a {@link saffron.rel.Aggregate} relational operator.</p> <p>It is hard to implement holistic aggregations efficiently. For example, <code> median</code> can be translated into an efficient relational expressions by hand:</p> <blockquote> <pre>select {nth(countEmp / 2, salary) as medianSalary} group by {emp.deptno} from (select from emp order by emp.deptno) empSorted join (select {emp.deptno, count() as empCount} group by {emp.deptno} from emp) empCounted on empSorted.deptno == empCounted.deptno</pre> </blockquote> <p>Similarly distinct <code>count</code>:</p> <blockquote> <pre>select { empGrouped.deptno, count(empGrouped.gender) as countDistinctGender} group by {empGrouped.deptno} from ( select distinct {emp.deptno, emp.gender} from emps as emp) as empGrouped</pre> </blockquote> <p>We should supply an extender which can transform either the parse tree or the relational operator tree in such a way.</p> <h2>Start parameters for aggregations</h2> <p>Aggregations can take parameters per query, per group, and per row. Examples:</p> <ul> <li>Conventional aggregations take parameters only per row.</li> <li>The nth(count, value), count is per-group, and value is per-row.</li> <li>In min(collator, value), collator is per-query, and value is per-row.</li> </ul> <p>One way to provide per-group parameters is for aggregations to have two sets of parameters: the first passed at start time, the other passed per row.</p> <p>But custom aggregations can also do the job. They are a little less efficient, and the syntax is a little more verbose, but what's going on is clearer. </p> <p>Custom aggregation to do locale-specific min:</p> <blockquote> <pre>select {emp.dept.deptno, new LocaleMin(emp.dept.locale).apply(name) minName} group by {emp.dept} from emps as emp</pre> </blockquote> <h2>Problems with iterators</h2> <p><i>The ideas in this section are not implemented. So, if your iterator is infinite, the optimizer may generate a plan which never finishes. And it may generate a plan which requires iterators to be restarted.</i></p> <p>Iterators (and enumerations) present peculiarities that finite data sources such as JDBC tables, arrays, and collections don't present.</p> <p><b>Restarting an iterator</b></p> <p>First, they need to be restarted. Suppose the optimizer implemented</p> <blockquote> <pre>Iterator nouns, verbs; for (i in (String[]) verbs intersect (String[]) nouns) { print(verb + " is both a noun and a verb"); }</pre> </blockquote> <p>as a nested loops join,</p> <blockquote> <pre>while (nouns.hasNext()) { String noun = (String) nouns.next(); while (verbs.hasNext()) { String verb = (String) verbs.next(); if (verb.equals(noun)) { print(verb " is both a noun and a verb"); } } }</pre> </blockquote> <p>This doesn't work. The problem is, the second time around the outer loop, <code>verbs.hasNext()</code> has already returned <code>false</code>.</p> <p>If possible, you should use a class which implements {@link java.util.Collection} or {@link saffron.runtime.Iterable}; the optimizer calls the <code>iterate()</code> method each time it needs to start an iteration.</p> <p>If you supply an {@link java.util.Iterator}, the optimizer wraps it in a {@link saffron.runtime.BufferedIterator}.</p> <p><i>Enhancement</i>: If the plan never requires that the iterator is restarted, the optimizer should leave the iterator unwrapped.</p> <p><b>Infinite iterators</b></p> <p>The second problem with iterators is that they can be infinite. This is a bad idea if you're copying the results to an array, but you can achieve spectacular results if you're prepared to go with the whole <i>streaming thing</i>. Here's an example:</p> <blockquote> <pre>class MultiplesOf implements Iterator { int i, n; MultiplesOf(int n) { this.n = n; this.i = 0; } public boolean hasNext() { return true; } public Object next() { return i += n; } public void remove() { throw new UnsupportedOperationException(); } } Iterator twos = new MultiplesOf(2); Iterator tens = (Iterator) (select from twos where twos % 5 == 0);</pre> </blockquote> <p>We have created one infinite iterator from another!</p> <p>The problem is that the optimizer might be tempted to do stupid things, like storing the results in an intermediate array. We tell it not to by implementing an empty interface, {@link saffron.runtime.Infinite}.</p> <p><i>We have not implemented support for infinite iterators. Expressions involving infinite iterators might just go on for ever.</i></p> <h1>Appendix 2: Futures</h1> <p>List:</p> <ol> <li>Aggregation<ol> <li>Allow built-in aggregations in regular java (that is, outside <code> select</code> statements with a <code>group by</code> clause). Example #1: <code>select {deptno, count(select from emps as e2 where e2.deptno == e1.deptno) as countEmp} from emps as e1</code>. Example #2: <code>if (count(select from emps as e2 where e2.deptno == e1.deptno</code>.</li> </ol> </li> <li>Schema on top of persistence layer -- JDO say. Or JAXB. Talk to Castor folks about it.</li> <li>Refactoring.<ol> <li>search for obsolete & deprecated</li> </ol> </li> <li>Demos<ol> <li>Swing application where you can enter expressions or statements, and execute them. Can create a set of named connections.</li> </ol> </li> <li>Environment<ol> <li>log</li> <li>hsqldb</li> </ol> </li> <li>OpenJava<ol> <li>Obsolete MSJavaCompiler maybe</li> <li>Drive compiler from system properties</li> <li>Drive parser from system properties (or only have one parser, and turn off saffron features programmatically).</li> </ol> </li> </ol> <h1>Appendix 3: References</h1> <dl> <dt><a name="GBLP96">GBLP96</a></dt> <dd>J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. In <em> IEEE 12th International Conference on Data Engineering</em>, pages 152—159, 1996 (<a href="http://research.microsoft.com/~gray/DataCube.doc">doc</a>).</dd> </dl> <table border="1" width="100%" class="clsStd"> <tr> <td>End <i>$Id: //guest/julian_hyde/saffron/doc/overview.html#5 $</i></td> </tr> </table> </html> </body>
# | Change | User | Description | Committed | |
---|---|---|---|---|---|
#5 | 1748 | Julian Hyde |
saffron: convert unit tests to JUnit; add CallingConvention.ITERABLE; lots of other stuff; release 0.1. |
||
#4 | 1485 | Julian Hyde | Saffron/Mondrian: John Sichi's feedback. | ||
#3 | 1474 | Julian Hyde |
saffron: Aggregations are working. Renamed 'aggregator' to 'aggregation'. |
||
#2 | 1467 | Julian Hyde |
saffron: First saffron check-in; incorporate my changes to openjava. |
||
#1 | 1461 | Julian Hyde |
saffron: First check in. Just documents, and the unmodified OpenJava 20001010 source files. |