Saffron overview

Top \| Web home \| SourceForge home
$Id: //guest/julian_hyde/saffron/doc/overview.html#5 $
(C) Copyright 2002, Julian Hyde
Author	Julian Hyde (julian.hyde@mail.com)
Created	November 5^th, 2002

Saffron overview

Julian Hyde, November 5th, 2001.

Overview

Saffron is a seamless integration of SQL-like relational expressions into Java. As we shall see, the tight integration means that relational expressions are much more powerful than those offered by SQL.

How is Saffron different from SQL?

You can use Saffron expressions directly from Java code. For example, an exists condition:

if (exists (select from emp where emp.salary > 50000)) {
  ...
}

An in condition:

if ("Fred" in (select emp.ename from emp where emp.salary > 50000)) {
   ...
}

A for loop:

for (emp in (select from emp order by emp.salary)) {
   System.out.println(emp.ename + " earns " + emp.salary);
}

And an array initialized using a relational expression:

Emp[] programmers = (select from emp
        where emp.title.equals("Programmer"));

Saffron uses Java expression syntax. You can call java methods, construct objects and arrays, and access java objects just as you would in regular java code. For example,
```
select emp.ename.substring(ename.indexOf(".")) from emp
```
Saffron relational expressions use Java's type system. Runtime type conversions (of, say, a DECIMAL(8,2) column to a java floatvariable) are inefficient, error-prone, and can lose information. The names of tables and columns are checked at compile time, so your programs are more maintainable.
Saffron doesn't use 3-valued logic. A condition in Saffron always evaluates to true or false, whereas in SQL a condition can sometimes evaluates to NULL.
Saffron expressions are evaluated efficiently. If an expression involves only Java objects (no tables), the Saffron compiler will generate Java code which is almost as efficient as hand-written code.
Saffron accesses heterogeneous data better. Consider the following query, which accesses data from an in-memory array (manufacturers), and tables from two different relational databases (orderDb and productDb).
```
String[] manufacturers = new String[] {"Heinz", "Guinness"};
int[] customers = (
  select order.customerId
  from productDb.products as product
  join orderDb.orders as order on order.productId == productId
  where product.manufacturer in manufacturers);
```
Should the query be executed on the orders database or the products database? And how should the manufacturer list be sent over: as a hard-coded in-list, or by restarting the query for each manufacturer in the list? It really doesn't matter: the Saffron optimizer will decide for you.

How does Saffron improve Java?

We have already discussed how Saffron makes your life easier if you are embedding SQL statements in Java using JDBC. But Saffron can help even if you are not accessing data stored in a relational database. This is because in-memory data is often relational, and the algorithms to deal with it are often the familiar relational operations: sort, join, filter, union, intersect, minus.

Suppose I have two lists of customers, and I want to create a list of both, with duplicates eliminated. An experienced Java programmer would probably write something like this:

Customer[] addCustomerLists(Customer[] list1, Customer[] list2) {
  Hashtable hashtable = new Hashtable();
  for (int i = 0; i < list1.length; i++) {
    Customer c = list1[i];
    hashtable.put(c, c);
  }
  for (int i = 0; i < list2.length; i++) {
    Customer c = list2[i];
    hashtable.put(c, c);
  }
  Customer[] list3 = new Customer[hashtable.size()];
  hashtable.asList().copyInto(list3);
  return list3;
}

They would test it, notice that the customers are now in arbitrary order, and add

  java.util.Arrays.sort(list3);

just before returning the result. In Saffron, a programmer would write

Customer[] list3 = select from (list1 union list2) c order by c;

They don't need to know Hashtable is the 'correct' data structure to use. They tell the system what they want, and the system figures how to compute it efficiently.

This is a rather simple example. It gets better, because Saffron is a closed language. This means that you can use a Saffron expression anywhere

Grammar

Definition. A set expression is an expression which returns more than one item; namely, a query, an array, an iterator or collection, or a Table.

Relational expressions

select [{[expression [as alias]], ...} | expression]
  [from from-expression]
  [where expression]
  [order by [expression [asc | desc]], ...]

expression union expression

expression intersect expression

expression minus expression

The syntax for the from list from-expression is as follows:

expression [as alias]

from-expression [left | right | full | ] join
   from-expression [on condition]

Notes:

The select list can have 3 forms.
1. An empty select list means return everything from the from clause. If the from clause has precisely one item, the result is the same type as that item (for example, each row from select from emp has type Emp); otherwise, the result is a record, with one field per from item (for example, each row from select from emp join dept has type {Emp emp; Dept dept}; each row from select has type {}).
2. A single-item select list means return rows without boxing them into records. So, for example, each row from select three from new int[] {1, 2, 3} as three is an int.
3. A boxed select list (enclosed in { ... }) means create a record for each row. The optimizer will create a synthetic class to describe the type of the row. Two rows with the same types and aliases, in the same order, will have the same synthetic class (and are therefore union-compatible). For example, each row from select {three} from new int[] {1, 2, 3} as three is class Synthetic$124 { int three; }; the type of select {emp.empno, emp.ename as name} from emp is class Synthetic$527 { int empno; String name; }.
Aliases of items in from and select lists default to variable, field, and method names. The aliases of the fields in Emp emp; select emp, dept.deptno, dept.dname.substring(3) from dept are emp, deptno, and substring, respectively. Other kinds of expression have no default alias, and therefore an alias must be specified. Whether implied or explicit, the aliases in a select or from list must be distinct.
If the from clause is missing, the input to the select statement is a table with one row and no columns.
Expressions in the from list must be set expressions. Any kind of expression is allowed in the select list; queries are returned as arrays.
If on condition is omitted from a from-expression it defaults to true (hence, a cartesian product). The keywords left, right and full indicate that an outer join is to be performed.
No plans to do aggregation.

Boolean expressions

We add the following boolean expressions to Java:

expression in expression

exists (expression)

The right-hand expressions must be set expressions.

These expressions can, of course, occur inside relational expressions. The sub-query so created is said to be correlated if it refers to rows from the outer query. For example, the following query is correlated on dept.deptno:

select
from depts as dept
where exists (
   select from emps as emp
   where emp.deptno == dept.deptno &&
      emp.gender.equals("F"))

We do not provide a not operator to negate these. Use the regular Java ! operator, for example, !("apple" in fruit) or !exists (select from fruit where fruit.equals("apple").

Statements

for ([type] variable in expression) {
   statement list
}

Aggregation

Simple example:

select emp.deptno, sum(emp.sal) as sumSal
group by emp.deptno
from emps as emp

Unlike SQL, the group by clause comes before the from clause, and is mandatory if you want to aggregate. The equivalent of the SQL select sum(emp.sal) from emp has an empty group by clause in Saffron: select sum(emp.sal) group by from emp.

In Saffron, the where clause is applied after aggregation has been performed, more like a SQL having clause than a where clause. Expressions in the where clause can only reference aggregations and grouping expressions, not ungrouped expressions. If you want to filter the rows going into the aggregation, use a subquery in the from list:

select femp.deptno, sum(femp.sal) as sumSal
group by femp.deptno
from (
    select
    from emps as emp
    where emp.gender.equals("F")) as femp
where avg(femp.sal) > 50000

There are 3 kinds of aggregation function:

Builtin aggregation functions (such as count) are treated specially by the system, and generate optimized code.
A regular java function which takes a set (array, collection, or iterator) will be called if its argument matches the element type in the actual usage. For example, if emp.sal is of type double, then double sum(double[] a) would match. Likewise, int count(Iterator iter) would match select count(emp) from emps as emp, because Emp can be up-cast to Object, the element type of Iterator.
User-defined aggregations.

interface {@link saffron.AggregationExtender}. If you invoke the {@link saffron.AggregationExtender#aggregate} pseudo-method, the system generates the right code to calculate the aggregation.

Now let's see how the Saffron system invokes a user-defined aggregation while evaluating a query. Suppose we have written class MySum, which can total both int and double values, and evaluate the query

select MySum(i)
from new int[] {2, 3, 4} as i

The Saffron system makes following calls. Note how the resolve method deals with operator overloading. Note how ints are wrapped as Integers then in arrays to be passed to addOne, and returned from result.

AggregatorClass sum = new MySum();
AggregationExtender intSum = sum.resolve(new Class[] {Integer.TYPE});
Object total = intSum.start(null);
intSum.addOne(total, new Object[] {new Integer(2)});
intSum.addOne(total, new Object[] {new Integer(4)});
Object total2 = intSum.start(null);
intSum.addOne(total2, new Object[] {new Integer(3)});
intSum.merge(total, total2);
System.out.println("Result is " +
    ((Integer) intSum.result(total)).intValue());

Examples of more complex aggregations:

The median aggregation ({@link saffron.ext.Median}) requires a variable-size workspace. A typical implementation would use an {@link java.util.ArrayList} to collect the values, sort them, then return the middle one.
The nth aggregation {@link saffron.ext.Nth}) returns the nth item of a set. n is passed to the start method, so must be independent of any particular row. Example:
```
select new Nth(3).aggregate(name) from (select from emps order by salary)
```
The closest aggregation computes the nearest of a set of points to a given point. Its start method receives the latitude and longitude of the initial point (as new Object[] {new Double(42), new Double(129)}), and the addOne method receives the latitude and longitude of each subsequent point.
```
select closest(42, -129, office.latitude, office.longitude)
from offices as office
```

Notes & issues:

Arguments are zero or more primitive types or objects. Primitives are passed wrapped as objects.
A robust aggregation of median (not possible without invoking custom code-generation) would internally expand it to
```
select nth(count() / 2, sorted)
from (select from new int[] {2, 3, 4} as i order by i) as sorted
```
Distinct aggregations. Syntax? Implementation?
Could we use aggregations as rolling averages? (I guess the {@link saffron.AggregationExtender}.result() method would have to be non-destructive.)

Type-checking

One of Java's most important features is its strong type system. To take advantage of this, every relational expression is always has a result which is an array type.

This doesn't mean that the array is actually created. If you use the for (... in ...) construct, for example, the system will implement the loop using an iterator, and is unlikely to construct the result as an array. And the optimizer tries hard not to convert intermediate sets, such as that created by an in operator, into arrays.

Regular java expressions can also be used as relational expressions:

arrays
collections (Collection, including Vector)
iterators (Enumeration, Iterator)

The problem with collections and iterators (until generics arrive in Java 1.4) is that we don't know what type of objects they contain. We allow the programmer to specify that type information using by a dummy cast to an array type. For example:

Vector fruit = new Vector();
fruit.addElement("apple");
fruit.addElement("banana");
fruit.addElement("orange");
fruit.addElement("mango");
String[] tropicalFruit = (select fruit
                          from (String[]) fruit
                          where fruit.indexOf("n") >= 0);

You can cast a collection or iterator to an array of primitive types, provided that the collection does not contain null values.

Casting input datasets

Some data sources, such as vectors, iterators and external data sources such as JDBC tables, contain very little type information, and this makes it difficult to write concise queries. For example,

Vector emps = new Vector();
emps.addElement(fred);
emps.addElement(barney);
for (emp in select from emps as emp where ((Emp) emp).deptno == 20) {
  print(((Emp) emp).name);
}

Better to tell the system just once that emps is a collection of Emp objects:

for (emp in select from (Emp[]) emps as emp where emp.deptno == 20) {
  print(emp.name);
}

The cast does not necessarily cause the vector to be converted to an array: we are simply providing the Saffron compiler with more type information, and allowing it to choose the most efficient plan.

Determining the output type

By default, a relational expression returns an array. You can produce results in a different format by using a dummy cast, like this:

Iterator emps;
Iterator femaleEmps = (Iterator)
    (select from (Emp[]) emps as emp
     where emp.gender.equals("F"));

This is particularly useful where the output is large (even infinite!) and you to iterate over the first few rows without computing the whole result.

The available types are {@link java.util.Iterator}, {@link java.util.Enumeration}, {@link java.util.Collection}, {@link java.util.Vector}, and type[]. In future, we may add {@link java.sql.ResultSet}.

Tables and Schemas

The interfaces Table, Schema and Connection allow you to access data sources which have arbitrary access methods. A table (interface {@link saffron.Table}) represents a set of relational data, and it provides methods to describe the data source, discover its access methods, and so forth. Schemas (interface {@link saffron.Schema}) make it easy to expose a set of tables and the information necessary to compile programs which use them. If you can to access data in a JDBC database, derive a class from class {@link saffron.ext.ReflectSchema} and create a member of type {@link saffron.ext.JdbcTable} for each table your wish to access (class {@link sales.Sales} is an example of this.)

Let's look at what happens when you compile a program which uses a schema.

The compiler notices a piece of code (sales.emps) which accesses a member or method of an object (sales) which implements interface {@link saffron.Connection}.
The compiler calls the static method getSchema on that object. (Why call a static method, not a method in the interface? Well, remember that this is happening at compile time. The real object does not exist yet.)
The compiler invokes the schema's getMemberTable or getMethodTable method, to retrieve a Table. If there is no such table, these methods may throw openjava.mop.NoSuchMemberException, which will be reported as a compilation error.
The compiler invokes the table's toRel method, passing the piece of parse tree which represents the expression which will yields the schema. (The real schema object only gets created at run time, remember.)
The compiler calls schema's getTable method with the parse object be for example, class Sales.

In particular, we provide abstract class {@link saffron.ext.JdbcTable} implements {@link saffron.Table} to make it easy to access tables via JDBC. We recommend that you group tables into a schema class, which extends class {@link saffron.ext.ReflectSchema}. For example,

class SalesConnection
   extends saffron.ext.JdbcConnection
   implements saffron.Connection
{
   public SalesConnection(String connect) {
      super(connect);
   }
   // required for saffron.Connection
   public static saffron.Schema getSchema() {
      return new SalesSchema();
   }
};

class SalesSchema
   extends saffron.ext.ReflectSchema
   implements saffron.Schema
{
   public static class Emp {
      public int empno;
      public String name;
      public int deptno;
   };
   public static class Dept {
      public int deptno;
      public String name;
   };

   public saffron.Table emp = new saffron.ext.JdbcTable(
      this, "EMP", Emp.class);
   public saffron.Table depts = new saffron.ext.JdbcTable(
      this, "DEPT", Dept.class);
};

SalesConnection sales = new SalesConnection("jdbc:odbc:MyDatabase");
for (emp in sales.emps) {
  some code
}

The Saffron compiler creates an instance of the schema, and interrogates it regarding the objects it contains. When it sees the expression sales.emps, it invokes SalesConnection.getSchema().getTableForMember("emps") to get a {@link saffron.Table} which describes emps and, in particular, that it returns rows of type Emp. The translator then calls the table's toRel() method to create a relational expression node; in this case a {@link saffron.rel.TableAccess}. (The translator actually converts sales.emp into a brief intermediate expression (Emp[]) sales.contentsAsArray("emps"). The contentsAsArray function is a placeholder which has special meaning to the translator. Note how an array cast is used to provide type information.)

As well as type information, the schema could in future provide information about volumetrics, indexes, and even custom rules for optimizing the query.

todo: Expose Maps (including Hashtables) as relations with 2 columns.

Non-relational schemas

A schema could be used to present non-relational data in a relational format. It would know that 'getDept(40)' was the same as 'select from dept where deptno = 40' and 'dept.getEmployees()' is the same as 'select from emp where emp.deptno == dept.deptno'. We would need a mechanism to construct an extent (the set of all instances of a given entity). The schema could declare an explicit method to find the instances of an entity; another way is to use mandatory relationships (for example, since every employee must work in a department, given all departments, we can find all employees by traversing the 'works in' relationship). It helps if the relationship is many- (or one-) to-one, since we don't need to eliminate duplicates. There could be more than one access path to an entity.

How to represent such mappings? Views: dept.getEmployees() is equivalent to select from emp where emp.deptno == dept.deptno. Better, parameterized views. Example {@link sales.Reflect} presents Java's type system ({@link java.lang.Class}, {@java.lang.reflect.Field}, et cetera) as relations.

todo: Could allow a type (e.g. boolean, int) to serve as a relation. Could implement by iterating over all possible values of that type (rarely practical) or joining to an attribute of a more finite relation. Example: select from int as i where i in (select emp.deptno from emp).

Null semantics

Saffron uses Java, not SQL, null semantics. null is always the Java null value. Every boolean expression evaluates to either true or false. And yes, null == null evaluates to true.

Comparison semantics

In Saffron, as in Java, the == operator compares addresses, not content. The expression "abc" == "a" + "bc" may, on some Java systems, evaluate to false, so you should write "abc".equals("a" + "bc") instead.

The in operator expands to expressions which use the equals() and hashCode() methods, so you should only uses it on classes for which these methods are well-behaved. And the expression

null in new String[] {"apple", "orange"}

will give a NullPointerException.

The order by operator requires that objects implement the interface java.lang.Comparable. (todo: Compile- or run-time error if not?)

Primitive types are compared using the primitive operators ==, < etc.

Two instances of a synthetic classes are equal if and only if each of their components are equal. Synthetic classes are compared lexicographically. Example:

select
from (select from twos join threes) as i
join (select from twos join threes) as j
where i < j

Primitive values

Saffron accepts primitive values, such as int, char, double, in most places that it accepts objects. A couple of exceptions:

Outer joins need to produce null values. For example,

select {twos, threes}
from new int[] {2, 4, 6, 8} as twos
left join new int[] {3, 6, 9} as threes
on twos == threes

would produce

2, 0
4, 0
6, 6
8, 0

Saffron uses Java collections to implement certain algorithms. For example, in the following expression, it uses a Hashtable to detect values which have already been emitted.
```
select from new int[] {2, 4, 6, 8} as twos
minus
select from new int[] {3, 6, 9} as threes
```
It automatically boxes and unboxes primitive values.

Planning and evaluation

The Saffron system runs in static and dynamic mode.

In static mode, relational expressions are embedded into Java programs. A pre-processor checks that the relational expressions are valid, and converts them into regular java. There are many ways to evaluate a relational expression,

A query optimizer chooses an evaluation plan.

Evaluation model

Implementation issues

Composite rows

If a relational expression has more than one expression in its selectclause, it produces rows which are not atomic. For example,

int[] twos = new int[] {2,4,6,8};
int[] threes = new int[] {3,6,9};
for (i in (select twos, threes twos cross join threes)) {
   ...
}

creates rows which are pairs of ints. It is translated to

int[] twos = new int[] {2,4,6,8};
int[] threes = new int[] {3,6,9};
for (int x = 0; x < twos.length; x++) {
   for (int y = 0; y < threes.length; y++) {
      Synth1 i = new Synth1(twos[x], threes[y]);
      ...
   }
}

This illustrates two issues with composite rows. First, these rows need a type. The Saffron compiler solves this issue by generating a synthetic class (in this case, class Synth1). We don't know what this class will be called, so we use for (variable in ...) syntax, which lets the system infer the type for us.

Second, the system creates a new object for every row, which might be expensive. In this case, they are emitted from the query, so we can't avoid creating them. If the composite rows are not emitted, the Saffron optimizer will do its best to avoid creating them.

How the optimizer works

The job of the optimizer is to convert a parse tree containing Saffron expressions into a regular parse tree. Its input and output, therefore, are both parse trees.

In between times, it searches for Saffron expressions, and converts them to tree of intermediate relational expressions (class {@link saffron.rel.Rel}).

These relational expressions still contain fragments of regular Java expressions for select expressions, method calls, and so forth. The Java expressions are translated into internal form. Except for references to variables outside the Saffron expression, variables tend to disappear, to be replaced by mysterious-looking expressions. For example, in the expression

select
from location
join (select
      from emp
      join dept on emp.deptno == dept.deptno)
on location.state.equals(dept.state)

the join condition location.state.equals(dept.state) becomes $input0.state.equals($input1.$f1.state). Here, $input0 and $input1 represent the 2 inputs to the join node (which is the context for this Java fragment); $f1 is the first field of the composite row produced by the inner query. That field's type is Dept, and therefore it has a state field.

The optimizer uses a dynamic programming approach based upon that used by the Volcano optimizer. Starting with the initial expression, it applies transformation rules to convert it, and its sub-expressions, into semantically equivalent expressions. It estimates the cost of each expression, and cultivates variants of expressions until it finds cheap implementations of them.

An expression can be implemented using several calling conventions. Calling conventions include array, vector, iterator. The most important (and efficient) calling convention is inline. Inline means that a piece of code will be executed for every row in the expression; no collection of objects is created, just stuff happens. An inline implementation of

for (o in (select from emp join dept on emp.deptno == dept.deptno)) {
  System.out.println(o.emp.ename + " works in " + o.dept.dname);
}

would be

for (int i = 0; i < emp.length; i++) {
   for (int j = 0; j < dept.length; j++) {
      if (emp[i].deptno == dept[j].deptno) {
         Synth1 o = new Synth1(emp[i], dept[j]);
         System.out.println(o.emp.ename + " works in " + o.dept.dname);
      }
   }
}

This is rather tricky to implement, because the stuff in the middle of the for loop corresponds to stuff at the top of Rel tree. Each Rel node's implement method has to handle a callback from a child saying 'I have placed a row in variable xyz; please generate code to do something with it.'

If two expressions have the same meaning but different calling conventions, they are not immediately interchangeable. It may be cheaper to evaluate a particular expression as an array, but another algorithm may find it more convenient if that expression is a vector. For this reason, there are special kinds of rule called conversion rules. They can convert from one convention to another, but they have a non-zero cost. The optimizer may decide that it is cheaper to build a vector in the first place, rather than paying the cost of converting it to a vector later.

There is a special kind of node called a Terminator (class {@link saffron.rel.java.Terminator}). After a tree has been optimized, a Terminator is placed on the root to drive code-generation. It behaves somewhat like a Rel node (in that it has inputs, and can be called-back to generate code), but it does not actually produce any result. (todo: Maybe Terminator should be a calling convention?)

Appendix 1: Design notes

Syntax of `group by`

There are several problems with a SQL-like aggregation:

If a select statement contains a group by clause, the validation rules for expressions within that statement are altered: only expressions which are in the group by can be used outside of an aggregate function.
Even if there is no group by clause present, aggregation implicitly occurs if an aggregate function occurs anywhere in the select statement. For example, every expression in
```
select emp.gender.toLowerCase(), emp.name
from emp
where emp.name.startsWith("P")
```
becomes illegal if an aggregation is added:
```
select emp.gender.toLowerCase(), emp.name
from emp
where emp.name.startsWith("P") && min(emp.salary) > 50000
```
This syntax problem could be fixed fairly easily, by requiring an empty group by clause in such cases.
If we allow functions on sets to be used as aggregation functions, then certain statements are ambiguous. Consider the following:
```
int count(Object[] a) {
    int n = 0;
    for (int i = 0; i < a.length; i++) {
        if (a[i] != null) {
            n++;
        }
    }
    return n;
}
class Emp {
    String name;
    Person[] dependents;
};
select count(emp.dependents) from emp;
```
It is not clear whether Saffron should return a row for each employee with the number of the dependents that employee has, or a single number, the number of employees whose dependents field is not null.

The underlying problem is that SQL aggregation makes a scalar expression behave like a set, and this undermines the usual rules of expression validation and type-checking. We can make the set explicit, but the resulting expression

select deptno,
    sum(select emp.sal from emps as emp
           where emp.deptno == deptno) as sumSal,
    min(select emp.sal from emps as emp
           where emp.deptno == deptno) as minSal
from (select distinct deptno from emps) as deptno

is more verbose -- the underlying set, emps, is referenced three times -- and perhaps less efficient.

We need to make grouping an aggregation explicit.

QUEL aggregations are more like scalar sub-queries. They are more verbose. They are also more powerful: note how we can sum only female employees' salaries. In this syntax, the set of values to be aggregated over (in this case, select ... from emp) is formed for each aggregate expression; the optimizer recognizes common expressions and evaluates them only once. Example:

select distinct emp.deptno,
  min(select emp.sal from emps as emp2
      where emp2.deptno == emp.deptno) as minSal
  sum(select emp.sal from emps as emp2
      where emp2.deptno == emp.deptno &&
          gender.equals("F")) as sumFemaleSal
from emps as emp

Aggregation types

Jim Gray et al. [GBLP96] classify aggregation operators into 3 categories, in increasing order of complexity:

Distributive aggregations requires fixed storage, and two totals can be merged. Example: sum, count, min, max. To merge two partial sums, take sum(S1 u S2) = sum({sum(S1), sum(S2)}). (Here, S1 and S2 are the sets of values which formed the partial sums, and S1 u S2 is their union, the set of all values.) min and max work the same. count is different, in that it cannot be used to merge two sums; instead, you use +: count(S1 u S2) = count(S1) + count(S2).
Algebraic aggregations also require fixed storage, but two totals cannot be merged. Example: avg, stddev.
Holistic aggregations require variable storage. Example: median, countDistinct.

Some algebraic aggregates can be decomposed into distributive aggregates, as follows: avg(S) = sum(S) / count(S). We could provide better support for these; maybe a new method <code>Expression {@link saffron.rel.Aggregation}.preExpand()</code>, which is called before the parse tree is converted into a {@link saffron.rel.Aggregate} relational operator.

It is hard to implement holistic aggregations efficiently. For example, median can be translated into an efficient relational expressions by hand:

select {nth(countEmp / 2, salary) as medianSalary}
group by {emp.deptno}
from (select from emp order by emp.deptno) empSorted
join (select {emp.deptno, count() as empCount}
      group by {emp.deptno} from emp) empCounted
on empSorted.deptno == empCounted.deptno

Similarly distinct count:

select {
    empGrouped.deptno,
    count(empGrouped.gender) as countDistinctGender}
group by {empGrouped.deptno}
from (
    select distinct {emp.deptno, emp.gender}
    from emps as emp) as empGrouped

We should supply an extender which can transform either the parse tree or the relational operator tree in such a way.

Start parameters for aggregations

Aggregations can take parameters per query, per group, and per row. Examples:

Conventional aggregations take parameters only per row.
The nth(count, value), count is per-group, and value is per-row.
In min(collator, value), collator is per-query, and value is per-row.

One way to provide per-group parameters is for aggregations to have two sets of parameters: the first passed at start time, the other passed per row.

But custom aggregations can also do the job. They are a little less efficient, and the syntax is a little more verbose, but what's going on is clearer.

Custom aggregation to do locale-specific min:

select {emp.dept.deptno,
   new LocaleMin(emp.dept.locale).apply(name) minName}
group by {emp.dept}
from emps as emp

Problems with iterators

The ideas in this section are not implemented. So, if your iterator is infinite, the optimizer may generate a plan which never finishes. And it may generate a plan which requires iterators to be restarted.

Iterators (and enumerations) present peculiarities that finite data sources such as JDBC tables, arrays, and collections don't present.

Restarting an iterator

First, they need to be restarted. Suppose the optimizer implemented

Iterator nouns, verbs;
for (i in (String[]) verbs intersect (String[]) nouns) {
   print(verb + " is both a noun and a verb");
}

as a nested loops join,

while (nouns.hasNext()) {
   String noun = (String) nouns.next();
   while (verbs.hasNext()) {
      String verb = (String) verbs.next();
      if (verb.equals(noun)) {
         print(verb " is both a noun and a verb");
      }
   }
}

This doesn't work. The problem is, the second time around the outer loop, verbs.hasNext() has already returned false.

If possible, you should use a class which implements {@link java.util.Collection} or {@link saffron.runtime.Iterable}; the optimizer calls the iterate() method each time it needs to start an iteration.

If you supply an {@link java.util.Iterator}, the optimizer wraps it in a {@link saffron.runtime.BufferedIterator}.

Enhancement: If the plan never requires that the iterator is restarted, the optimizer should leave the iterator unwrapped.

Infinite iterators

The second problem with iterators is that they can be infinite. This is a bad idea if you're copying the results to an array, but you can achieve spectacular results if you're prepared to go with the whole streaming thing. Here's an example:

class MultiplesOf implements Iterator {
    int i, n;
    MultiplesOf(int n) {
        this.n = n;
        this.i = 0;
    }
    public boolean hasNext() {
        return true;
    }
    public Object next() {
        return i += n;
    }
    public void remove() {
        throw new UnsupportedOperationException();
    }
}

Iterator twos = new MultiplesOf(2);
Iterator tens = (Iterator) (select from twos where twos % 5 == 0);

We have created one infinite iterator from another!

The problem is that the optimizer might be tempted to do stupid things, like storing the results in an intermediate array. We tell it not to by implementing an empty interface, {@link saffron.runtime.Infinite}.

We have not implemented support for infinite iterators. Expressions involving infinite iterators might just go on for ever.

Appendix 2: Futures

List:

Aggregation
1. Allow built-in aggregations in regular java (that is, outside select statements with a group by clause). Example #1: select {deptno, count(select from emps as e2 where e2.deptno == e1.deptno) as countEmp} from emps as e1. Example #2: if (count(select from emps as e2 where e2.deptno == e1.deptno.
Schema on top of persistence layer -- JDO say. Or JAXB. Talk to Castor folks about it.
Refactoring.
1. search for obsolete & deprecated
Demos
1. Swing application where you can enter expressions or statements, and execute them. Can create a set of named connections.
Environment
1. log
2. hsqldb
OpenJava
1. Obsolete MSJavaCompiler maybe
2. Drive compiler from system properties
3. Drive parser from system properties (or only have one parser, and turn off saffron features programmatically).

Appendix 3: References

GBLP96: J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. In IEEE 12th International Conference on Data Engineering, pages 152—159, 1996 (doc).

End $Id: //guest/julian_hyde/saffron/doc/overview.html#5 $

Saffron overview

Overview

How is Saffron different from SQL?

How does Saffron improve Java?

Grammar

Relational expressions

Boolean expressions

Statements

Aggregation

Type-checking

Casting input datasets

Determining the output type

Tables and Schemas

Non-relational schemas

Null semantics

Comparison semantics

Primitive values

Planning and evaluation

Evaluation model

Implementation issues

Composite rows

How the optimizer works

Appendix 1: Design notes

Syntax of group by

Aggregation types

Start parameters for aggregations

Problems with iterators

Appendix 2: Futures

Appendix 3: References

Syntax of `group by`