Setup
Before going through this tutorial, we recommend going through the SQL tutorial to better understand MacroBase-SQL.
To install MacroBase-Spark, download the latest MacroBase-Spark release from our GitHub page and build from source. Make sure you download a MacroBase-Spark release.
Building MacroBase-SQL-Spark requires Apache Maven 3.3.9
and Java 8+. Once Maven and Java 8 are installed, simply run mvn package
in the top-level directory, and MacroBase-SQL-Spark should successfully build.
In order to run MacroBase-SQL-Spark, your computer must be connected to an
already-existing Spark cluster with HDFS set up. Your $SPARK_HOME
environment variable must be set to your top-level Spark directory
and the $SPARK_HOME/bin
directory must be on your system path.
Running Spark Locally
This tutorial will use the same dataset as the SQL tutorial, a sample of Wikipedia edits from September 12, 2015. You can download the sample data here: wikiticker.csv. (Make sure to download the file to your top-level MacroBase directory.)
First, we will discuss how to run MacroBase-SQL-Spark locally, then explain how to run it on a cluster. To start up a local MacroBase-SQL-Spark shell, run the following command from MacroBase's top-level directory:
spark-submit --master "local[2]" --class edu.stanford.futuredata.macrobase.sql.MacroBaseSQLRepl sql/target/macrobase-sql-1.0-SNAPSHOT.jar -d -n 2
We will explain how this command was constructed. spark-submit
is Spark's
built-in submission script, documented
here.
The --master
switch specifies the master of the Spark cluster to submit to; since we are
running locally, we tell Spark to use local[2]
, or two cores of our local computer. --class
tells Spark where to begin execution. The jar is the application jar. All arguments following
the jar are passed to the application directly. In this case, the -d
flag tells MacroBase-SQL
to distribute and use Spark. The -n
flag tells MacroBase-SQL-Spark how many partitions
to make when distributing computation. In this case, since we have only two cores
to distribute over, we use two partitions.
Once MacroBase-SQL-Spark is running, it takes in the same commands as MacroBase-SQL. Upon launching the shell, you should see:
Welcome to MacroBase!
macrobase-sql>
As in the previous tutorial, let's load in our CSV file:
IMPORT FROM CSV FILE 'wikiticker.csv' INTO wiki(time string, user string, page
string, channel string, namespace string, comment string, metroCode string,
cityName string, regionName string, regionIsoCode string, countryName string,
countryIsoCode string, isAnonymous string, isMinor string, isNew string,
isRobot string, isUnpatrolled string, delta double, added double, deleted
double);
This command, like most others, behaves identically as it does in MacroBase-SQL, but distributes
its work across a Spark cluster. Let's try a command that's unique to MacroBase-SQL,
like a DIFF
:
SELECT * FROM
DIFF
(SELECT * FROM wiki WHERE deleted > 0.0) outliers,
(SELECT * FROM wiki WHERE deleted <= 0.0) inliers
ON channel, namespace, comment, metroCode, cityName, regionName, regionIsoCode,
countryName, countryIsoCode, isAnonymous, isMinor, isNew, isRobot, isUnpatrolled;
Unlike MacroBase-SQL, MacroBase-SQL-Spark does not yet support ON *
so we must
specify all attributes to DIFF
over.
As in MacroBase-SQL, you can also write a DIFF
query using our SPLIT
operator.
We can rewrite our initial DIFF
query more concisely
and get the exact same result:
SELECT * FROM DIFF (SPLIT wiki WHERE deleted > 0.0)
ON channel, namespace, comment, metroCode, cityName, regionName, regionIsoCode,
countryName, countryIsoCode, isAnonymous, isMinor, isNew, isRobot, isUnpatrolled;
We can also still tweak parameters using WITH MIN RATIO
and/or MIN SUPPORT
:
SELECT * FROM DIFF (SPLIT wiki WHERE deleted > 0.0)
ON channel, namespace, comment, metroCode, cityName, regionName, regionIsoCode,
countryName, countryIsoCode, isAnonymous, isMinor, isNew, isRobot, isUnpatrolled
WITH MIN SUPPORT 0.10;
SELECT * FROM DIFF (SPLIT wiki WHERE deleted > 0.0)
ON channel, namespace, comment, metroCode, cityName, regionName, regionIsoCode,
countryName, countryIsoCode, isAnonymous, isMinor, isNew, isRobot, isUnpatrolled
WITH MIN RATIO 1.25;
SELECT * FROM DIFF (SPLIT wiki WHERE deleted > 0.0)
ON channel, namespace, comment, metroCode, cityName, regionName, regionIsoCode,
countryName, countryIsoCode, isAnonymous, isMinor, isNew, isRobot, isUnpatrolled
WITH MIN SUPPORT 0.10 MIN RATIO 1.25;
-- WITH MIN RATIO 1.25 MIN SUPPORT 0.10 also works
Running Spark on a Cluster
Running MacroBase-SQL-Spark on a cluster is largely the same as running it locally. The major differences are in ingest and in the launch command. Ingest on a cluster is done through HDFS, so we must first put our file in our cluster's HDFS system. Here's one way to do it:
hadoop fs -put FILE.csv /user/USERNAME
Afterwards, we launch MacroBase-SQL-Spark as before, but substituting the actual
master URL of the cluster for local[2]
in the supplied command. Depending
on the cluster, we may want to tune other properties, such as the number
of executors to use or the amount of memory to use on each executor. Full
documentation is here.
In addition to this, we may want to tune the number of partitions made by MacroBase
using the -n
flag; at the minimum this should be set to the number of executing
cores in the cluster.