Setup

Before going through this tutorial, follow the setup instructions to make sure MacroBase is correctly installed on your machine.

Analyzing Wikipedia Edits with MacroBase SQL

In this tutorial, we're going to analyze a sample of Wikipedia edits from September 12, 2015. You can download the sample data here: wikiticker.csv. (Make sure to download the file to your top-level MacroBase directory.)

Using MacroBase parlance, we can classify the schema of this dataset into metrics (measurements in the dataset that capture user's interests) and attributes (dimensions that could possibly explain why a metric is behaving unusually).

Besides "time", here are the attributes present in the data:

  • user
  • page
  • channel
  • namespace
  • comment
  • metroCode
  • cityName
  • regionName
  • regionIsoCode
  • countryName
  • countryIsoCode
  • isAnonymous
  • isMinor
  • isNew
  • isRobot
  • isUnpatrolled

And here are the three metrics that are present:

  • added (number of lines added in the edit)
  • deleted (number of lines deleted)
  • delta (number of lines that changed)

Getting Started

From the top-level directory, run bin/macrobase-sql to start MacroBase SQL shell -- you should see this:

Welcome to
    __  ___                      ____                
   /  |/  /___ _______________  / __ )____ _________ 
  / /|_/ / __ `/ ___/ ___/ __ \/ __  / __ `/ ___/ _ \
 / /  / / /_/ / /__/ /  / /_/ / /_/ / /_/ (__  )  __/
/_/  /_/\__,_/\___/_/   \____/_____/\__,_/____/\___/ 

macrobase-sql>

Next, let's load the CSV file into MacroBase SQL:

IMPORT FROM CSV FILE 'wikiticker.csv' INTO wiki(time string, user string, page
  string, channel string, namespace string, comment string, metroCode string,
  cityName string, regionName string, regionIsoCode string, countryName string,
  countryIsoCode string, isAnonymous string, isMinor string, isNew string,
  isRobot string, isUnpatrolled string, delta double, added double, deleted
  double);

MacroBase SQL is just SQL: you can add projections in the SELECT clause, predicates in the WHERE clause, an ORDER BY clause (on single columns only -- for now), and a LIMIT clause.

SELECT comment, channel FROM wiki
  WHERE countryIsoCode is not NULL
  ORDER BY channel DESC LIMIT 15;

To save the output of any query in MacroBase SQL to a file, use the INTO OUTFILE syntax found in MySQL

SELECT comment, channel FROM wiki
  WHERE countryIsoCode is not NULL
  ORDER BY channel DESC
  INTO OUTFILE 'comments.csv' FIELDS TERMINATED BY '\t';
  -- FIELDS TERMINATED BY clause is optional; default is ','

Now, let's try out the features that are unique to MacroBase SQL, such as the DIFF operator. Here's a simple DIFF query with two arguments: outliers and inliers.

SELECT * FROM
  DIFF
    (SELECT * FROM wiki WHERE deleted > 0.0) outliers,
    (SELECT * FROM wiki WHERE deleted <= 0.0) inliers
  ON *;

Notice the ON * at the end: MacroBase will automatically try to find attribute columns that are good candidates for explanation. As the query is run, you'll see which columns are selected. For example, in the query above you should see something like this in the log output:

0    [main] INFO  QueryEngine  - Using channel, cityName, countryIsoCode, countryName, isAnonymous, isMinor, isNew, isRobot, isUnpatrolled, metroCode, namespace, regionIsoCode, regionName as candidate attributes for explanation

In this case, the time, user, and page columns were ignored, since they have a unique value per row.

You should also see additional metadata about the query in the log output, like the minimum support (default: 0.2), minimum ratio (default: 1.5), and ratio metric (default: global_ratio) used in the query:

4    [main] INFO  APriori  - Min Support Ratio: 0.2
5    [main] INFO  APriori  - Min Ratio Metric: 1.5
5    [main] INFO  APriori  - Using Ratio of: GlobalRatioMetric

You can also write a DIFF query using our SPLIT operator. The SPLIT operator takes in a relation and a WHERE clause, and it effectively splits the input relation into two output relations: one in which the WHERE clause always evaluates to "true", and the other in which the WHERE clause always evaluates to "false". We can rewrite our initial DIFF query much more concisely and get the exact same result:

SELECT * FROM DIFF (SPLIT wiki WHERE deleted > 0.0) ON *;

Instead of a table name, you can also pass a subquery as the input relation to a SPLIT operator. This again yields the exact same result:

SELECT * FROM DIFF (SPLIT (SELECT * FROM wiki) WHERE deleted > 0.0) ON *;

Note: A SPLIT clause can't be the top-level query in MacroBase SQL; the query below, for example, will result in a parsing error:

SPLIT wiki WHERE deleted > 0.0;

Many of the columns in ON * didn't yield any explanations (e.g., "countryName", "regionName"); we can filter these out by modifying the ON clause to include only the columns we care about (which will also improve query performance).

SELECT * FROM DIFF (SPLIT wiki WHERE deleted > 0.0)
  ON isRobot, channel, isUnpatrolled, isNew, isMinor, isAnonymous, namespace;

Maybe our original query (with ON *) didn't yield explanations in "countryName" and "regionName" because our minimum support or minimum ratio were too high; we can tweak either using WITH MIN RATIO and/or MIN SUPPORT:

SELECT * FROM DIFF (SPLIT wiki WHERE deleted > 0.0)
  ON *
  WITH MIN SUPPORT 0.10;

SELECT * FROM DIFF (SPLIT wiki WHERE deleted > 0.0)
  ON *
  WITH MIN RATIO 1.25;

SELECT * FROM DIFF (SPLIT wiki WHERE deleted > 0.0)
  ON *
  WITH MIN SUPPORT 0.10 MIN RATIO 1.25;
  -- WITH MIN RATIO 1.25 MIN SUPPORT 0.10 also works

We also have support for user-defined functions (UDFs) that can be applied to individual columns in MacroBase SQL; these come in handy for more complicated SPLIT clauses. For example, we can apply a percentile UDF to the "deleted" column:

SELECT percentile(deleted) FROM wiki;
SELECT deleted, percentile(deleted) as percentile FROM wiki;
SELECT *, percentile(deleted) as percentile FROM wiki;

We can also execute UDFs in the WHERE clause to apply custom predicates to our data:

SELECT deleted FROM wiki WHERE percentile(deleted) > 0.95;

In the SPLIT clause, you can treat the UDF column as any other column:

SELECT * FROM DIFF
    (SPLIT (
      SELECT *, percentile(deleted) as percentile FROM wiki)
    WHERE percentile > 0.95)
  ON isRobot, channel, isUnpatrolled, isNew, isMinor, isAnonymous, namespace
  WITH MIN SUPPORT 0.10;

Your Turn

There's plenty more to explore in this dataset! For example, try writing a few queries using the DIFF and SPLIT operators to analyze the "added" and "delta" metrics, which we haven't done yet. Are there any interesting outliers that you can find?

Here's a sample query to get you started:

SELECT * FROM DIFF
  (SPLIT (
    SELECT *, percentile(delta) as percentile FROM wiki
    )
  WHERE percentile > 0.99)
ON *;