View on GitHub

Spark-Cassandra-Notes

Cassandra data computing with Apache Spark

Scala Apps Template

Template of directories

This is a particular structure of files and folders with which to compile Scala applications.

Spark Submit

To launch scala applications in Spark Cluster you could use spark-submit utility which is part of Spark tools. This is the alternative to spark-shell that is better for interactive mode.

The spark-submit executable an entry point class with main method. You can submit with option --class. But if this app has more dependencies or need packaged apps, normally in jar format for Java and Scala.

Compilation

A compiler like sbt is necessary to build jars which could be submitted. And this compilation require some settings and a particular folder structure for our applications.

Folder tree:

|--lib
|  |... (i.e.: dependency jars)   
|--project (automatically generated by sbt)
|--src
|  |--main
|  |  |--scala
|        |... (i.e: .scala files)
|--target (compilation results)
|--built.sbt (sbt configuration)

At root folder we have built.sbt with configuration stuff about sbt project compilation.
Below src->main->scala we could deploy Scala Scripts which are part of the project
lib folder is used to place dependencies; normally unmanaged jars (managed dependencies can be declared into built.sbt)
built.sbt: you can place your app configuration here. For instance:

name := <app_name>
version := <app_version>

scalaVersion := <scala_app_version>

// spark version which fit with the app
val sparkVersion = <spark_version>
// external dependencies i.e.: Spark-Cassandra Connector
unmanagedJars in Compile += file("lib/spark-cassandra-connector.jar")

resolvers ++= Seq(
  "apache-snapshots" at "http://repository.apache.org/snapshots/"
)

// managed dependencies (ivy, maven, etc.)
libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion,
  "org.apache.spark" %% "spark-sql" % sparkVersion,
  "org.apache.spark" %% "spark-mllib" % sparkVersion,
  "org.apache.spark" %% "spark-streaming" % sparkVersion,
  "org.apache.spark" %% "spark-hive" % sparkVersion
)