Skip to content

Commit

Permalink
Updates for accuracy, typos, and to avail discussion.
Browse files Browse the repository at this point in the history
  • Loading branch information
DennisDawson committed Feb 19, 2015
1 parent 2e84ec3 commit 6f4ca09
Showing 1 changed file with 69 additions and 50 deletions.
119 changes: 69 additions & 50 deletions spark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,64 +4,43 @@ This module provides an example of processing event data using Apache Spark.

## Getting started

This example assumes that you're running a CDH5.1 or later cluster (such as the
[Cloudera Quickstart VM][getvm]) that has Spark configured. This example requires
the `spark-submit` command to execute the Spark job on the cluster. If you're using
the Quickstart VM, be sure to run this example from the VM rather than the host
computer.
This example assumes that you're running a CDH5.1 or later cluster (such as the [Cloudera Quickstart VM][getvm]) that has Spark configured. This example requires the `spark-submit` command to execute the Spark job on the cluster. If you're using the Quickstart VM, run this example from the VM rather than the host computer.

[getvm]: http://www.cloudera.com/content/support/en/downloads/quickstart_vms.html

On the cluster, check out a copy of the code:
On the cluster, check out a copy of the code and navigate to the `/spark` directory using the following commands in a terminal window.

```bash
```
git clone https://github.com/kite-sdk/kite-examples.git
cd kite-examples
cd spark
```

## Building
## Building the Application

To build the project, type
To build the project, enter the following command in a terminal window.

```bash
```
mvn install
```

## Running

### Create and populate the events dataset
## Creating and Populating the Events Dataset

First we need to create and populate the `events` dataset.
In this example, you store raw events in a Hive-backed dataset so that you can process the results using Hive. Use `CreateEvents`, provided with the demo, to both create and populate random event records. Execute the following command from a terminal window in the `kite-examples/spark` directory.

We store the raw events in a Hive-backed dataset so you can also process the data
using Impala or Hive. We'll use a tool provided with the demo to both create and

This comment has been minimized.

Copy link
@DennisDawson

DennisDawson Feb 19, 2015

I can't view the dataset in Impala: It tells me it doesn't support map types for events, or record types for correlated events. This does mean that I have to view it in Hive, which is slow.

This comment has been minimized.

Copy link
@joey

joey Feb 20, 2015

Member

Yes, that's a limitation of Impala.

populate the random events:

```bash
```
mvn exec:java -Dexec.mainClass="org.kitesdk.examples.spark.CreateEvents"
```

You can browse the generated events using [Hue on the QuickstartVM](http://localhost:8888/metastore/table/default/events/read).

### Use Spark to correlate events
## Using Spark to Correlate Events

Now we want to use Spark to correlate events from the same IP address within a
five minute window. Before we implement our algorithm, we need to configure Spark.
In particular, we need to set up Spark to use the Kryo serialization library and
configure Kryo to automatically serialize our Avro objects.
In this example, you use Spark to correlate events generated from the same IP address within a five-minute window. Begin by configuring Spark to use the Kryo serialization library.

```java
// Create our Spark configuration and get a Java context
SparkConf sparkConf = new SparkConf()
.setAppName("Correlate Events")
// Configure the use of Kryo serialization including our Avro registrator
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator", "org.kitesdk.examples.spark.AvroKyroRegistrator");
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
```
Register your Avro classes with the following Scala class to use Avro's specific binary serialization for both the `StandardEvent` and `CorrelatedEvents` classes.

We can register our Avro classes with a small bit of Scala code:
### AvroKyroRegistrator.scala

This comment has been minimized.

Copy link
@DennisDawson

DennisDawson Feb 19, 2015

I moved the scala class above the discussion of the Java class. It was confusing to me to jump back and forth.


```scala
class AvroKyroRegistrator extends KryoRegistrator {
Expand All @@ -72,22 +51,34 @@ class AvroKyroRegistrator extends KryoRegistrator {
}
```

This will register the use of Avro's specific binary serialization for bot the

This comment has been minimized.

Copy link
@DennisDawson

DennisDawson Feb 20, 2015

"bot" should be "both"

`StandardEvent` and `CorrelatedEvents` classes.
### Highlights from CorrelateEventsTask.class

The following snippets show examples of code you use to configure and invoke Spark tasks.

Configure Kryo to automatically serialize Avro objects.

```java
// Create the Spark configuration and get a Java context
SparkConf sparkConf = new SparkConf()
.setAppName("Correlate Events")
// Configure the use of Kryo serialization including the Avro registrator
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator", "org.kitesdk.examples.spark.AvroKyroRegistrator");
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
``

In order to access our Hive-backed datasets from remote Spark tasks, we need to
register some JARs in Spark's equivalent of the Hadoop DistributedCache:
To access Hive-backed datasets from remote Spark tasks,
register JARs in the Spark equivalent of the Hadoop DistributedCache:

```java
// Register some classes that will be needed in remote Spark tasks
// Register classes needed for remote Spark tasks

This comment has been minimized.

Copy link
@joey

joey Feb 20, 2015

Member

We should also update these comments in the code itself.

addJarFromClass(sparkContext, getClass());
addJars(sparkContext, System.getenv("HIVE_HOME"), "lib");
sparkContext.addFile(System.getenv("HIVE_HOME")+"/conf/hive-site.xml");
```

This comment has been minimized.

Copy link
@DennisDawson

DennisDawson Feb 19, 2015

I don't see these lines in the Java class, or anywhere else in the examples directory. I couldn't find them using grep. There is no HIVE_HOME environment variable defined on the VM that I could find. Are these still part of the application?

This comment has been minimized.

Copy link
@joey

joey Feb 20, 2015

Member

That was probably OBE at some point, since it's not in the code it's safe to remove them.


Now we're ready to read from the events dataset by configuring the MapReduce
`DatasetKeyInputFormat` and then using Spark's built-in support to generate an
RDD form an `InputFormat`.

This comment has been minimized.

Copy link
@DennisDawson

DennisDawson Feb 20, 2015

"form" should be "from" - this sentence took me a few minutes to parse due to that little transposition.

Configure the MapReduce `DatasetKeyInputFormat` to enable the application to read from the _events_ dataset. Use Spark built-in support to generate an
RDD (Resilient Distributed Dataset) from the input format.

```java
Configuration conf = new Configuration();
Expand All @@ -97,9 +88,7 @@ JavaPairRDD<StandardEvent, Void> events = sparkContext.newAPIHadoopRDD(conf,
DatasetKeyInputFormat.class, StandardEvent.class, Void.class);
```

We can now process the events as needed. Once we have our finall RDD, we can
configure `DatasetKeyOutputFormat` in the same way and use the
`saveAsNewAPIHadoopFile` method to persist the data to our output dataset.
The application can now process events as needed. Using your RDD, configure `DatasetKeyOutputFormat` the same way and use `saveAsNewAPIHadoopFile` to store data in an output dataset.

```java
DatasetKeyOutputFormat.configure(conf).writeTo(correlatedEventsUri).withType(CorrelatedEvents.class);
Expand All @@ -108,21 +97,51 @@ matches.saveAsNewAPIHadoopFile("dummy", CorrelatedEvents.class, Void.class,
DatasetKeyOutputFormat.class, conf);
```

You can run the example Spark job by executing the following:
In a terminal window, run the Spark job using the following command.

```bash
```
spark-submit --class org.kitesdk.examples.spark.CorrelateEvents --jars $(mvn dependency:build-classpath | grep -v '^\[' | sed -e 's/:/,/g') target/kite-spark-demo-*.jar
```

This comment has been minimized.

Copy link
@DennisDawson

DennisDawson Feb 19, 2015

Tom mentioned that it would be good to make this command less complicated.

This comment has been minimized.

Copy link
@joey

joey Feb 20, 2015

Member

Yeah, I don't know how to make it less complicated other than adding a shell script that does this for us. For a real application, you'd probably need to have a maven assembly that would copy all of the jars to a directory and a shell script for launching the job. I didn't do that here because it's a bit of overkill and harder to run from in the project directory. Thoughts?


You can browse the correlated events using [Hue on the QuickstartVM](http://localhost:8888/metastore/table/default/correlated_events/read).

This comment has been minimized.

Copy link
@DennisDawson

DennisDawson Feb 19, 2015

The results are arcane and not very interesting. You essentially have some huge horizontal scrolly records where you have to examine the timestamps and compare them to see that they're all within 300000 Milliseconds (5 minutes) of one another. This might be terribly interesting to a certain type of customer, but I strongly urge anyone coming up with examples in the future to please have something that very quickly returns the answer "42" so that customers say "wow, that's cool!"

Meanwhile, seriously, can someone suggest a SQL query that someone might use against the correlated_events dataset that returns useful information in easily digestible form?

This comment has been minimized.

Copy link
@joey

joey Feb 20, 2015

Member

This example isn't a complete analytic, it's the first step in doing correlated event analysis. What you'd want to do to "finish" the analysis is something like:

SELECT ip, SUM(numCorrelations) AS numCorrelations FROM
  (SELECT event.ip AS ip, size(correlated) AS numCorrelations
    FROM correlated_events) correlated_counts
  GROUP BY ip
  SORT BY numCorrelations DESC
  LIMIT 10;

which will give you the top IPs correlated with alerts:

ip numCorrelations
192.168.121.116 32
192.168.106.157 30
192.168.64.16 28
192.168.148.78 26
192.168.28.19 26
192.168.89.91 24
192.168.128.124 24
192.168.137.101 24
192.168.137.188 24
192.168.161.107 24

### Delete the datasets
## Deleting the datasets

When you're done or if you want to run the example again, you can delete the datasets we created:
When you're done, or if you want to run the example again, delete the datasets using the Kite CLI `delete` command.

```bash
```
curl http://central.maven.org/maven2/org/kitesdk/kite-tools/0.17.0/kite-tools-0.17.0-binary.jar -o kite-dataset
chmod +x kite-dataset
./kite-dataset delete events
./kite-dataset delete correlated_events
```

## Troubleshooting

The following are known issues and their solutions.

This comment has been minimized.

Copy link
@DennisDawson

DennisDawson Feb 19, 2015

Obviously, it would be better if CorrelatedEventsTask.java compiled and ran the first time.

This comment has been minimized.

Copy link
@joey

joey Feb 20, 2015

Member

Ok, I figured out the cause of this problem. Unfortunately, the solution is to make the ugly command line more ugly. If we get to a consensus on how we want to handle that (script versus one-liner) we can get the right solution that doesn't require running it twice.

### ClassNotFoundException

The first time you execute `spark-submit`, the process might not find `CorrelateEvents`.

```
java.lang.ClassNotFoundException: org.kitesdk.examples.spark.CorrelateEvents
```

Execute the command a second time to get past this exception.

### AccessControlException

This comment has been minimized.

Copy link
@DennisDawson

DennisDawson Feb 19, 2015

The access control issue is something in the VM, so I don't think we can do anything to fix it, unless we ran it as part of the mvn install. I don't think it hurts to run it if it's already configured, but I don't know.

This comment has been minimized.

Copy link
@joey

joey Feb 20, 2015

Member

This should be fixed in the 5.3 version of the VM. I think that we only require the 5.2 VM, so we should just add this as a prep-step for this rather than as a troubleshooting step.

On some VMs, you might receive the following exception.

```
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): \
Permission denied: user=cloudera, access=EXECUTE, inode="/user/spark":spark:spark:drwxr-x---
```

In a terminal window, update permissions using the following commands.

```
$ sudo su - hdfs
$ hadoop fs -chmod -R 777 /user/spark
```

0 comments on commit 6f4ca09

Please sign in to comment.