Top | Web home | SourceForge home | |
$Id: //guest/julian_hyde/saffron/doc/overview.html#2 $ | |
(C) Copyright 2002, Julian Hyde | |
Author | Julian Hyde (julian.hyde@mail.com) |
---|---|
Created | November 5th, 2002 |
Julian Hyde, November 5th, 2001.
Saffron is a seamless integration of SQL-like relational expressions into Java. As we shall see, the tight integration means that relational expressions are much more powerful than those offered by SQL.
exists
condition:if (exists (select from emp where emp.salary > 50000)) { ... }
An in
condition:
if ("Fred" in (select emp.ename from emp where emp.salary > 50000)) { ... }
A for
loop:
for (emp in (select from emp order by emp.salary)) { System.out.println(emp.ename + " earns " + emp.salary); }
And an array initialized using a relational expression:
Emp[] programmers = (select from emp where emp.title.equals("Programmer"));
select emp.ename.substring(ename.indexOf(".")) from emp
DECIMAL(8,2)
column to a java float
variable) are inefficient, error-prone, and can lose information.
The names of tables and columns are checked at compile time, so your programs
are more maintainable.true
or false
, whereas in SQL a
condition can sometimes evaluates to NULL.manufacturers
), and tables from two
different relational databases (orderDb
and productDb
).Should the query be executed on the orders database or the products database? And how should the manufacturer list be sent over: as a hard-coded in-list, or by restarting the query for each manufacturer in the list? It really doesn't matter: the Saffron optimizer will decide for you.String[] manufacturers = new String[] {"Heinz", "Guinness"}; int[] customers = ( select order.customerId from productDb.products as product join orderDb.orders as order on order.productId == productId where product.manufacturer in manufacturers);
We have already discussed how Saffron makes your life easier if you are embedding SQL statements in Java using JDBC. But Saffron can help even if you are not accessing data stored in a relational database. This is because in-memory data is often relational, and the algorithms to deal with it are often the familiar relational operations: sort, join, filter, union, intersect, minus.
Suppose I have two lists of customers, and I want to create a list of both, with duplicates eliminated. An experienced Java programmer would probably write something like this:
Customer[] addCustomerLists(Customer[] list1, Customer[] list2) { Hashtable hashtable = new Hashtable(); for (int i = 0; i < list1.length; i++) { Customer c = list1[i]; hashtable.put(c, c); } for (int i = 0; i < list2.length; i++) { Customer c = list2[i]; hashtable.put(c, c); } Customer[] list3 = new Customer[hashtable.size()]; hashtable.asList().copyInto(list3); return list3; }
They would test it, notice that the customers are now in arbitrary order, and add
java.util.Arrays.sort(list3);
just before returning the result. In Saffron, a programmer would write
Customer[] list3 = select from (list1 union list2) c order by c;
They don't need to know Hashtable is the 'correct' data structure to use. They tell the system what they want, and the system figures how to compute it efficiently.
This is a rather simple example. It gets better, because Saffron is a closed language. This means that you can use a Saffron expression anywhere
Definition. A set expression is
an expression which returns more than one item; namely, a query, an array, an
iterator or collection, or a Table
.
select
[{
[expression [as
alias]], ...}
| expression] [from
from-expression] [where
expression] [order by
[expression [asc
|desc
]], ...]expressionunion
expressionexpressionintersect
expressionexpressionminus
expression
The syntax for the from list from-expression is as follows:
expression [as alias]from-expression [left
|right
|full
| ]join
from-expression [on
condition]
Notes:
select from emp
has type
Emp
); otherwise, the result is a record, with one field per from item
(for example, each row from select from emp join dept
has type
{Emp emp; Dept dept}
; each row
from select
has type {}
).select three from new int[] {1, 2, 3} as three
is an int
. {
... }
) means create a record for each row. The
optimizer will create a synthetic class to describe the type of the row. Two
rows with the same types and aliases, in the same order, will have the same
synthetic class (and are therefore union-compatible). For example, each
row from select {three} from new int[] {1, 2, 3} as three
is
class Synthetic$124 { int three; }
; the type of select {emp.empno,
emp.ename as name} from emp
is class Synthetic$527 { int empno;
String name; }
.Emp emp; select emp,
dept.deptno, dept.dname.substring(3) from dept
are emp
,
deptno
, and substring
, respectively. Other
kinds of expression have no default alias, and therefore an alias must be
specified. Whether implied or explicit, the aliases in a select or from
list must be distinct.on
condition is omitted from a from-expression
it defaults to true
(hence, a cartesian product). The keywords
left
, right
and full
indicate that an
outer join is to be performed.We add the following boolean expressions to Java:
expressionin
expressionexists
(expression)
The right-hand expressions must be set expressions.
These expressions can, of course, occur inside relational expressions. The
sub-query so created is said to be correlated
if it refers to rows from the outer query. For example, the
following query
is correlated on dept.deptno
:
select from
deptsas
deptwhere exists
(select from
empsas
empwhere
emp.deptno == dept.deptno && emp.gender.equals("F"))
We do not provide a not
operator to negate these. Use the
regular Java !
operator, for example, !("apple" in fruit)
or !exists (select from fruit where fruit.equals("apple")
.
for
([type] variablein
expression) { statement list }
Simple example:
select
emp.deptno, sum(emp.sal)as
sumSalgroup by
emp.deptnofrom
empsas
emp
Unlike SQL, the group by
clause comes before the from
clause, and is mandatory if you want to aggregate. The equivalent of the SQL select
sum(emp.sal) from emp
has an empty group by
clause in Saffron: select sum(emp.sal)
group by from emp
.
In Saffron, the where
clause is applied after aggregation
has been performed, more like a SQL having
clause than a
where
clause. Expressions in the where
clause can only
reference aggregations and grouping expressions, not ungrouped expressions. If
you want to filter the rows going into the aggregation, use a subquery in the
from list:
select
femp.deptno, sum(femp.sal)as
sumSalgroup by
femp.deptnofrom
(select
from
empsas
empwhere
emp.gender.equals("F"))as
fempwhere
avg(femp.sal) > 50000
There are 3 kinds of aggregation function:
count
) are treated
specially by the system, and generate optimized code.emp.sal
is of type double
, then
double sum(double[] a)
would match. Likewise, int
count(Iterator iter)
would match select count(emp) from emps as
emp
, because Emp can be up-cast to Object, the element type of Iterator.A user-defined aggregator is a class which implements interface
{@link saffron.AggregatorClass}
. If the name of the class is used as a
function (no new operator), the aggregator is automatically invoked. (Aggregator
classes are generally given names which match function, not class,
naming-conventions.) The class must be in scope: in the current package, fully
qualified, or imported. The {@link saffron.AggregatorClass#resolve} method
determines whether the aggregator can support the argument types. If returns
null
, a compile-time error is issued.
Now let's see how the Saffron system invokes a user-defined aggregator while
evaluating a query. Suppose we have written class MySum
, which can total both
int
and double
values, and evaluate the query
select
MySum(i)from new int
[] {2, 3, 4}as
i
The Saffron system makes following calls. Note how the resolve
method deals with operator overloading. Note how int
s are wrapped
as Integer
s then in arrays to be passed to addOne
, and returned from
result
.
AggregatorClass sum = new MySum(); AggregatorExtender intSum = sum.resolve(new Class[] {Integer.TYPE}); Object total = intSum.start(null); intSum.addOne(total, new Object[] {new Integer(2)}); intSum.addOne(total, new Object[] {new Integer(4)}); Object total2 = intSum.start(null); intSum.addOne(total2, new Object[] {new Integer(3)}); intSum.merge(total, total2); System.out.println("Result is " + ((Integer) intSum.result(total)).intValue());
Examples of more complex aggregators:
median
aggregator ({@link saffron.runtime.median}) requires a
variable-size workspace. A typical implementation would use an {@link java.util.ArrayList} to collect the values, sort them, then return the middle one.nth
aggregator {@link saffron.runtime.Nth}) returns the nth item of a set. n
is passed to the start
method, so must be independent of any
particular row. Example:select
new Nth(3).aggregate(name)from
(select from
empsorder by
salary)
closest
aggregator computes the nearest of a set of
points to a given point. Its start
method receives the latitude
and longitude of the initial point (as new Object[] {new Double(42), new
Double(129)}
), and the addOne
method receives the latitude
and longitude of each subsequent point.select
closest(42, -129, office.latitude, office.longitude)from
officesas
office
Notes & issues:
select
nth(count() / 2, sorted)from
(select from new int
[] {2, 3, 4}as
iorder by
i)as
sorted
{@link saffron.AggregatorExtender}.result()
method would have to be non-destructive.)One of Java's most important features is its strong type system. To take advantage of this, every relational expression is always has a result which is an array type.
This doesn't mean that the array is actually created. If you use the
for (... in ...)
construct, for example, the system will implement the loop
using an iterator, and is unlikely to construct the result as an array.
And the optimizer tries hard not to convert intermediate sets, such as that
created by an in
operator, into arrays.
Regular java expressions can also be used as relational expressions:
Collection
, including Vector
)Enumeration
, Iterator
)The problem with collections and iterators (until generics arrive in Java 1.4) is that we don't know what type of objects they contain. We allow the programmer to specify that type information using by a dummy cast to an array type. For example:
Vector fruit = new Vector(); fruit.addElement("apple"); fruit.addElement("banana"); fruit.addElement("orange"); fruit.addElement("mango"); String[] tropicalFruit = (select fruit from (String[]) fruit where fruit.indexOf("n") >= 0);
You can cast a collection or iterator to an array of primitive types, provided that the collection does not contain null values.
Some data sources, such as vectors, iterators and external data sources such as JDBC tables, contain very little type information, and this makes it difficult to write concise queries. For example,
Vector emps = new Vector(); emps.addElement(fred); emps.addElement(barney); for (emp in select from emps as emp where ((Emp) emp).deptno == 20) { print(((Emp) emp).name); }
Better to tell the system just once that emps
is a collection of
Emp
objects:
for (emp in select from (Emp[]) emps as emp where emp.deptno == 20) { print(emp.name); }
The cast does not necessarily cause the vector to be converted to an array: we are simply providing the Saffron compiler with more type information, and allowing it to choose the most efficient plan.
The interfaces Table, Schema and Connection allow you to access data sources which have
arbitrary access methods. A table (interface {@link saffron.Table}
)
represents a set of relational data, and it provides methods to describe the data source,
discover its access methods, and so forth. Schemas (interface
{@link saffron.Schema}
) make it easy to expose a set of tables and the information
necessary to compile programs which use them. An abstract class
{@link saffron.ext.JdbcSchema} extends {@link saffron.Schema}
makes it easy to access data in a
JDBC database.
Let's look at what happens when you compile a program which uses a schema.
sales.emps
) which
accesses a member or method of an object (sales
) which implements
interface {@link saffron.Connection}
.getSchema
on that
object. (Why call a static method, not a method in the interface?
Well, remember that this is happening at compile time. The real object
does not exist yet.)getMemberTable
or
getMethodTable
method, to retrieve a Table
. If there is no
such table, these methods may throw openjava.mop.NoSuchMemberException
,
which will be reported as a compilation error.toRel
method, passing the
piece of parse tree which represents the expression which will yields the
schema. (The real schema object only gets created at run time,
remember.)In particular, we provide abstract class {@link saffron.ext.JdbcTable} implements
{@link saffron.Table}
to make it easy to access tables via JDBC. We recommend that
you group tables into a schema class, which extends class {@link saffron.ext.JdbcSchema}
. For example,
class SalesConnection extends saffron.ext.JdbcConnection implements saffron.Connection { public SalesConnection(String connect) { super(connect); } // required for saffron.Connection public static saffron.Schema getSchema() { return new SalesSchema(); } }; class SalesSchema extends saffron.ext.JdbcSchema implements saffron.Schema { public static class Emp { public int empno; public String name; public int deptno; }; public static class Dept { public int deptno; public String name; }; public saffron.Table emp = new saffron.ext.JdbcTable(this, "EMP", Emp.class); public saffron.Table depts = new saffron.ext.JdbcTable(this, "DEPT", Dept.class); }; SalesConnection sales = new SalesConnection("jdbc:odbc:MyDatabase"); for (emp in sales.emps) { some code }
The Saffron compiler creates an instance of the schema, and interrogates it
regarding the objects it contains. When it sees the expression sales.emps
,
it invokes SalesConnection.getSchema().getTableForMember("emps")
to
get a {@link saffron.Table} which describes emps
and, in particular,
that it returns rows of type Emp
. The translator then calls
the table's toRel()
method to create a relational expression node;
in this case a {@link saffron.rel.TableAccess}. (The translator actually converts
sales.emp
into a brief intermediate expression (Emp[])
sales.contentsAsArray("emps")
. The contentsAsArray
function
is a placeholder which has special meaning to the translator. Note how an array
cast is used to provide type information.)
As well as type information, the schema could in future provide information about volumetrics, indexes, and even custom rules for optimizing the query.
todo: Expose Map
s (including Hashtable
s) as
relations with 2 columns.
A schema could be used to present non-relational data in a relational format. It would know that 'getDept(40)' was the same as 'select from dept where deptno = 40' and 'dept.getEmployees()' is the same as 'select from emp where emp.deptno == dept.deptno'. We would need a mechanism to construct an extent (the set of all instances of a given entity). The schema could declare an explicit method to find the instances of an entity; another way is to use mandatory relationships (for example, since every employee must work in a department, given all departments, we can find all employees by traversing the 'works in' relationship). It helps if the relationship is many- (or one-) to-one, since we don't need to eliminate duplicates. There could be more than one access path to an entity.
How to represent such mappings? Views: dept.getEmployees() is equivalent to select from emp where emp.deptno == dept.deptno. Better, parameterized views. Example {@link sales.Reflect} presents Java's type system (package {@link java.lang.reflect}) as relations.
todo: Could allow a type (e.g. boolean
, int
)
to serve as a relation. Could implement by iterating over all possible values of
that type (rarely practical) or joining to an attribute of a more finite
relation. Example: select from int as i where i in (select
emp.deptno from emp)
.
Saffron uses Java, not SQL, null semantics. null
is always the Java null value. Every boolean expression evaluates to
either true
or false
. And yes, null ==
null
evaluates to true
.
In Saffron, as in Java, the ==
operator
compares addresses, not content. The expression "abc" == "a"
+ "bc"
may, on some Java systems, evaluate to false, so you should
write "abc".equals("a" + "bc")
instead.
The in
operator expands to expressions which use the
equals()
and hashCode()
methods, so you should only uses it
on classes for which these methods are well-behaved. And the expression
null in new String[] {"apple", "orange"}
will give a NullPointerException
.
The order by
operator requires that objects implement the
interface java.lang.Comparable
. (todo: Compile- or run-time error if
not?)
Primitive types are compared using the primitive operators ==
,
<
etc.
Two instances of a synthetic classes are equal if and only if each of their components are equal. Synthetic classes are compared lexicographically. Example:
select from (select from twos join threes) as i join (select from twos join threes) as j where i < j
Saffron accepts primitive values, such as int
, char
,
double
, in most places that it accepts objects. A couple of
exceptions:
select {twos, threes} from new int[] {2, 4, 6, 8} as twos left join new int[] {3, 6, 9} as threes on twos == threeswould produce
2, 0 4, 0 6, 6 8, 0
select from new int[] {2, 4, 6, 8} as twos minus select from new int[] {3, 6, 9} as threesIt automatically boxes and unboxes primitive values.
The Saffron system runs in static and dynamic mode.
In static mode, relational expressions are embedded into Java programs. A pre-processor checks that the relational expressions are valid, and converts them into regular java. There are many ways to evaluate a relational expression,
A query optimizer chooses an evaluation plan.
If a relational expression has more than one expression in its select
clause, it produces rows which are not atomic. For example,
int[] twos = new int[] {2,4,6,8}; int[] threes = new int[] {3,6,9}; for (i in (select twos, threes twos cross join threes)) { ... }
creates rows which are pairs of int
s. It is translated to
int[] twos = new int[] {2,4,6,8}; int[] threes = new int[] {3,6,9}; for (int x = 0; x < twos.length; x++) { for (int y = 0; y < threes.length; y++) { Synth1 i = new Synth1(twos[x], threes[y]); ... } }
This illustrates two issues with composite rows. First, these rows need
a type. The Saffron compiler solves this issue by generating a synthetic
class (in this case, class Synth1
). We don't know what this
class will be called, so we use for (variable in ...)
syntax, which lets the system infer the type for us.
Second, the system creates a new object for every row, which might be expensive. In this case, they are emitted from the query, so we can't avoid creating them. If the composite rows are not emitted, the Saffron optimizer will do its best to avoid creating them.
The job of the optimizer is to convert a parse tree containing Saffron expressions into a regular parse tree. Its input and output, therefore, are both parse trees.
In between times, it searches for Saffron expressions, and converts them to
tree of intermediate relational expressions (class {@link saffron.rel.Rel}
).
These relational expressions still contain fragments of regular Java expressions for select expressions, method calls, and so forth. The Java expressions are translated into internal form. Except for references to variables outside the Saffron expression, variables tend to disappear, to be replaced by mysterious-looking expressions. For example, in the expression
select from location join (select from emp join dept on emp.deptno == dept.deptno) on location.state.equals(dept.state)
the join condition location.state.equals(dept.state)
becomes $input0.state.equals($input1.$f1.state)
.
Here, $input0
and $input1
represent the 2
inputs to the join node (which is the context for this Java fragment); $f1
is the first field of the composite row produced by the inner query. That
field's type is Dept, and therefore it has a state
field.
The optimizer uses a dynamic programming approach based upon that used by the Volcano optimizer. Starting with the initial expression, it applies transformation rules to convert it, and its sub-expressions, into semantically equivalent expressions. It estimates the cost of each expression, and cultivates variants of expressions until it finds cheap implementations of them.
An expression can be implemented using several calling conventions. Calling conventions include array, vector, iterator. The most important (and efficient) calling convention is inline. Inline means that a piece of code will be executed for every row in the expression; no collection of objects is created, just stuff happens. An inline implementation of
for (o in (select from emp join dept on emp.deptno == dept.deptno)) { System.out.println(o.emp.ename + " works in " + o.dept.dname); }
would be
for (int i = 0; i < emp.length; i++) { for (int j = 0; j < dept.length; j++) { if (emp[i].deptno == dept[j].deptno) { Synth1 o = new Synth1(emp[i], dept[j]); System.out.println(o.emp.ename + " works in " + o.dept.dname); } } }
This is rather tricky to implement, because the stuff in the middle of the for loop corresponds to stuff at the top of Rel tree. Each Rel node's implement method has to handle a callback from a child saying 'I have placed a row in variable xyz; please generate code to do something with it.'
If two expressions have the same meaning but different calling conventions, they are not immediately interchangeable. It may be cheaper to evaluate a particular expression as an array, but another algorithm may find it more convenient if that expression is a vector. For this reason, there are special kinds of rule called conversion rules. They can convert from one convention to another, but they have a non-zero cost. The optimizer may decide that it is cheaper to build a vector in the first place, rather than paying the cost of converting it to a vector later.
There is a special kind of node called a Terminator
(class {@link saffron.rel.Terminator}
).
After a tree has been optimized, a Terminator is placed on the root to drive
code-generation. It behaves somewhat like a Rel node (in that it has
inputs, and can be called-back to generate code), but it does not actually
produce any result. (todo: Maybe Terminator should be a calling
convention?)
group by
There are several problems with a SQL-like aggregation:
select
statement contains a group by
clause,
the validation rules for expressions within that statement are altered: only
expressions which are in the group by
can be used outside of an
aggregate function.group by
clause present, aggregation
implicitly occurs if an aggregate function occurs anywhere in the select
statement. For example, every expression inselect emp.gender.toLowerCase(), emp.name from emp where emp.name.startsWith("P")
becomes illegal if an aggregation is added:
select emp.gender.toLowerCase(), emp.name from emp where emp.name.startsWith("P") && min(emp.salary) > 50000
This syntax problem could be fixed fairly easily, by requiring an empty
group by
clause in such cases.
int count(Object[] a) { int n = 0; for (int i = 0; i < a.length; i++) { if (a[i] != null) { n++; } } return n; } class Emp { String name; Person[] dependents; }; select count(emp.dependents) from emp;
It is not clear whether Saffron should return a row for each employee with
the number of the dependents that employee has, or a single number, the number
of employees whose dependents
field is not null.
The underlying problem is that SQL aggregation makes a scalar expression behave like a set, and this undermines the usual rules of expression validation and type-checking. We can make the set explicit, but the resulting expression
select deptno, sum(select emp.sal from emps as emp where emp.deptno == deptno) as sumSal, min(select emp.sal from emps as emp where emp.deptno == deptno) as minSal from (select distinct deptno from emps) as deptno
is more verbose -- the underlying set, emps
, is referenced three
times -- and perhaps less efficient.
We need to make grouping an aggregation explicit.
QUEL aggregations are more like scalar sub-queries. They are more verbose. They are also more powerful: note how we can sum only female employees'
salaries. In this syntax, the set of values to be aggregated over (in this case,
select ... from emp
) is formed for each aggregate expression; the
optimizer recognizes common expressions and evaluates them only once. Example:
select distinct
emp.deptno, min(select
emp.salfrom
empsas
emp2where
emp2.deptno == emp.deptno)as
minSal sum(select
emp.salfrom
empsas
emp2where
emp2.deptno == emp.deptno && gender.equals("F"))as
sumFemaleSalfrom
empsas
emp
group by
I have heard (todo: find reference, and fix terminology) aggregators classified into 3 sorts:
sum
, count
, min
,
max
. To merge two partial sums, take sum(S1 u S2) =
sum({sum(S1), sum(S2)})
. (Here, S1
and S2
are
the sets of values which formed the partial sums, and S1 u S2
is
their union, the set of all values.) min
and max
work the same. count
is different, in that it cannot be used to
merge two sums; instead, you use +
: count(S1 u S2) =
count(S1) + count(S2)
.avg
,
stddev
.median
, countDistinct
.Some compound aggregates can be decomposed into simple aggregates, as follows: avg(S) = sum(S) / count(S); stddev(S) = todo:. We could provide better support for these; maybe a new method <code>Expression {@link Aggregator}.preExpand()</code>, which is called before the parse tree is converted into a {@link saffron.Aggregate} relational operator.
It is hard to implement heuristic aggregators efficiently. For example,
median
can be translated into an efficient relational expressions by
hand:
select {nth(countEmp / 2, salary) as medianSalary} group by {emp.deptno} from (select from emp order by emp.deptno) empSorted join (select {emp.deptno, count() as empCount} group by {emp.deptno} from emp) empCounted on empSorted.deptno == empCounted.deptno
Similarly distinct count
:
select { empGrouped.deptno, count(empGrouped.gender) as countDistinctGender} group by {empGrouped.deptno} from ( select distinct {emp.deptno, emp.gender} from emps as emp) as empGrouped
We should supply an extender which can transform either the parse tree or the relational operator tree in such a way.
Aggregators can take parameters per query, per group, and per row. Examples:
One way to provide per-group parameters is for aggregators to have two sets of parameters: the first passed at start time, the other passed per row.
But custom aggregators can also do the job. They are a little less efficient, and the syntax is a little more verbose, but what's going on is clearer.
Custom aggregator to do locale-specific min:
select {emp.dept.deptno, new LocaleMin(emp.dept.locale).apply(name) minName} group by {emp.dept} from emps as emp
List:
select
statements with a group by
clause). Examples:
select {deptno, count(select from emps as e2 where e2.deptno ==
e1.deptno) as countEmp} from emps as e1
; if (count(select from
emps as e2 where e2.deptno == e1.deptno
End $Id: //guest/julian_hyde/saffron/doc/overview.html#2 $ |