Table Aggregation in MDELite
MDELite supports a standard form of relational table aggregation that is based on the idea of relational data cubes. Start with a relational table T, Fig (a) below. A groupBy operation partitions this table into a 3D array of tables T[r][d][y], where r is a value from the Region column of T, d is a value from the Dept column of T, and y is a value from the Year column of T. All tuples t of T with Region=r, Dept=d, and Year=y are tuples of T[r][d][s].
Next, each distinct table of T[r][d][y] is aggregated into a single tuple a in table A, which has 4 columns, the groupBy columns Region, Dept, and Year and, in this case, a single aggregation value/statistic of table T[r][d][y], like what is the average of the Sales column, what is the largest value in the sales column, how many rows are there in the T[r][d][y], etc. All T[r][d][y] tables are aggregated in exactly the same manner. The resulting table A can then be aggregated, printed, etc. like the orginal table T. Some notes:
- There is no need to use groupBy -- aggergation can be performed on a single table. The result will be a table A with precisely one row.
- The number of columns in table A equals the number of groupBy columns (which could be zero), followed by the number of aggregations produced.
The basic call to group by is:
- groupBy( String[] fields,Reducer... reds ) -- where fields is an array of one or more column names, and reds is an array of reducers (a.k.a. aggregate functions).
- aggregate( Reducer... reds ) -- where reds is an array of reducers (a.k.a. aggregate functions).
The typical use of group by is:
- create the table to be "grouped"
- "groupBy" it and produce the aggregate table
- print or further aggregate the aggregate table
In the following, I present a series of aggregation and groupBy examples. The database is taken from (a database text to be specified and linked). It is complicated enough to provide a wealth of examples for aggregation (and without).
1. How many employees are from wichita?
Join the employees relation with a shortened zipcode relation and group by. One could also, in this case, just count the number of employees in the given zipcode. String[] fields = {"zipcodes.city"};
employees.join("zip", zipcodes.select(z->z.is("city","wichita")), "zip")
.groupBy(fields,new Count("cnt"))
.print();
// or
int i = zipcodes.select(z->z.is("city","wichita")).rightSemiJoin("zip",employees,"zip").count();
System.out.println("count = " + i);
2. How many versions of "Land Before Time" have been sold?
The printing of the computed relation, below, is overkill. Table t = parts.select(p->p.get("pname").startsWith("land before time"))
.rightSemiJoin("pno", odetails, "pno")
.aggregate(new IntSum("sum","qty"));
t.print();
System.out.println("Number of LBT sold = " + t.getFirst().get("sum"));
3. How many parts has each employee sold?
Table t = employees.join("eno", orders, "eno").join("orders.ono", odetails, "ono");
String[] fields = {"employees.eno","employees.ename"};
Table r = t.groupBy(fields, new IntSum("Nsold","odetails.qty"));
r.print();
4. For each part, how much money is in our warehouse?
This is a bit harder: The DoubleAgg reducer computes a field, called "money", and needs a function lambda(int i, Tuple tpl)=i + tpl.qoh*tpl*price; to compute the sum. DoubleAgg da = new DoubleAgg("money",(i,tpl) -> i + tpl.getInt("qoh")*tpl.getDouble("price"));
parts.aggregate(da).print();
5. What is the cost per order?
Table j = orders.join("ono",odetails,"ono")
.join("odetails.pno",parts,"pno");
String[] fields = {"odetails.ono"};
Table r = j.groupBy(fields,
new DoubleAgg("cost",
(d,t)-> d + t.getInt("odetails.qty")*t.getDouble("parts.price") ));
r.print();
6. What customers ordered dirty harry or dr, zhivago?
Not really an aggregation, but just for fun. parts.select(p->p.is("pname","dirty harry","dr. zhivago"))
.rightSemiJoin("pno", odetails, "pno")
.rightSemiJoin("ono", orders, "ono")
.rightSemiJoin("cno", customers, "cno")
.print();