To efficiently query MacroBase, we've extended the ANSI SQL standard with our MacroBase operators, which we call MacroBase SQL. Finding explanations for anomalies in your data is as simple as issuing a SQL query from the command line using our custom shell, which enables users to easily explore their data in an interactive way.
New SQL Operator: DIFF
To extend MacroBase queries in SQL, we've extend the standard SQL syntax to
include our own operator, called DIFF.
DIFF Inputs
-
DIFFtakes in two relations as arguments: the first relation is the outliers relation, while the second is the inliers relation. This represents the classification stage of the MacroBase pipeline. Important: both relations must share the same schema! -
The
DIFFoperator also requires you to specify which columns in the given table you want to consider for explanations. This is done using theONkeyword in SQL;ON location, version, hw_modelmeans that MacroBase will use those three columns for explanation generation. If you want to consider all possible columns, useON *; MacroBase SQL will intelligently select the proper columns to explore for an initial query. -
You can also specify several optional keywords to customize your query:
-
COMPARE BY, specifies a ratio metric function, such asglobal_ratio,risk_ratio, orprevalence_ratio. (global_ratiois the default.) Users must also specify an aggregation function (such asCOUNT, orAVG) that takes a column (or multiple columns) as an argument. -
WITH MIN RATIOorWITH MIN SUPPORT, which specifies the minimum ratio and support thresholds for your query. (By default, the minimum ratio is 1.5, and the minimum support is 0.2) -
MAX COMBO, which specifies the maximum order you want for your generated explanations (e.g.,MAX COMBO [number])
-
Remember: a DIFF query is just SQL! So you can include any other standard
SQL clause: you can add WHERE clauses, ORDER BYs, and LIMITs, for
example. (GROUP BY and HAVING is not yet supported.) For example, if you
want to prune out results with low support (e.g., 0.2), simply add WHERE support > 0.2
to your SQL query. (By default, DIFF queries prune out all
results with support less than 0.2, and ratios less than 1.5.)
New SQL Operator: SPLIT
Manually specifying the outliers and inliers for your DIFF query can be a pain for users,
especially if the inlier and outlier subqueries are quite redundant. To simplify things, we've
introduced an additional operator called SPLIT, which segments a single relation in SQL
into two distinct relations.
The SPLIT clause has the following template:
SPLIT <relation> WHERE <boolean_expression>
We can use the SPLIT clause in our DIFF query by passing it as a single argument to our
DIFF operator:
SELECT * FROM DIFF (SPLIT <relation> WHERE <boolean_expression>)
...
...
Here, tuples where <boolean_expression> evaluates to true will be placed in
the outlier table, while ones that evaluate to false will be placed in the
inlier table.
You can also specify a subquery as the argument to your SPLIT clause:
SELECT * FROM DIFF
(SPLIT
(SELECT * FROM <relation> WHERE ...)
WHERE <boolean_expression>)
...
...
Summary
Overall, the formal definition of a DIFF query looks something like this:
SELECT <column_name>,..., <column_name>
FROM DIFF ([ <relation> | <subquery> ], [ <relation> | <subquery> ])
ON { <column_name>,..., <column_name> | * }
[ WITH MIN SUPPORT <decimal> | MIN RATIO <decimal> ]
[ COMPARE BY { <ratio_metric_fn>(<aggregation_fn>(<column_name> | *)) } ]
[ MAX COMBO <number> ]
[ WHERE <boolean_expression> ]
[ ORDER BY <column_name> ( ASC | DESC) }
[ LIMIT { <number> | ALL } ];
or using a SPLIT clause:
SELECT <column_name>,..., <column_name>
FROM DIFF (SPLIT [ <relation> | <subquery> ] WHERE <boolean_expression> )
ON { <column_name>,..., <column_name> | * }
[ WITH MIN SUPPORT <decimal> | MIN RATIO <decimal> ]
[ COMPARE BY { <ratio_metric_fn>(<aggregation_fn>(<column_name> | *)) } ]
[ MAX COMBO <number> ]
[ WHERE <boolean_expression> ]
[ ORDER BY <column_name> ( ASC | DESC) }
[ LIMIT { <number> | ALL } ];
DIFF Output
In order for MacroBase to compatible with traditional SQL queries, the DIFF
operator has to be composable -- it has to seamlessly integrate with other SQL
commands. (This is because SQL is fundamentally a relational
algebra, except that it
operates on multi-sets instead of sets.)
This means that the DIFF operator has to output a relation, just like the
output of any other SQL operator: a SELECT, a WHERE, a JOIN, or a GROUP BY
always outputs a relation, so that users can layer additional queries or
clauses downstream. The DIFF operator does, in fact, output a relation---it
outputs the columns specified in the ON clause, plus three additional
columns: the ratio column, the support column, and the outlier counts
column.
For example, suppose we ran the following SQL query in MacroBase SQL:
SELECT * FROM
DIFF
(SELECT * FROM sample WHERE usage > 1000.0) outliers,
(SELECT * FROM sample WHERE usage < 1000.0) inliers
ON
location, version
COMPARE BY
risk_ratio(COUNT(*)) ORDER BY support;
Then, the output of that query will look something like this:
-------------------------------------------------------------------------------------------
| location | version | risk_ratio | support | outlier_count |
-------------------------------------------------------------------------------------------
| CAN | v2 | 1.5173 | .210526 | 8.0 |
| null | v1 | 10.456989 | .789474 | 30.0 |
| CAN | v1 | 46.424051 | .789474 | 30.0 |
| CAN | null | ∞ | 1.0 | 38.0 |
-------------------------------------------------------------------------------------------
Here, the null values correspond to an explanation that does not include a
value for that particular column; if you've written GROUP BY or CUBE
queries in SQL, these nullls mean the exact same thing. (In database
parlance, the inclusion of these nulls means we are returning a normalized
relation.) For example, the last row in the output above is examining the
explanation results for location=CAN; the row immediately above examines the
results for location=CAN && version=v1.