View on GitHub

Spark-Cassandra-Notes

Cassandra data computing with Apache Spark

< Back Home

Mock Data

Simple data extraction from Apache Cassandra using RDD

Github repository Path: examples/mock-example Language: Scala v2.11

This example retrieve data from Cassandra keyspace examples and table name mockdata. Data are retrieved into RDD and filtered. It only selects “gender” and “first_name” columns in a Pair RDD. It groups by name and count taking the five most repeated male first-names. It does the same with female names and prints each list to standard output.

val record_names = sc.cassandraTable[(String,String)]("examples","mock_data")
                    .select("gender","first_name") //convert to RDD pair with gender and first_name columns              
                    .cache
//Male
val male_names = record_names.where("gender = 'Male'") // gender filtering 
val male_names_c = male_names.map{ case (k,v) => (v,1) } // associate 1 point to each male first name
val males_result = male_names_c.reduceByKey{ case (v,count) => count + count } //count 

So at least we have a Seq with male first names and a count Seq[(<first_name>,n), ...]

The same for female names.

The same for male records. The output would be:

Five highest repeated male names:
(Mario,2)
(Rabbi,2)
(Clarke,2)
(Claudell,2)
(Brnaba,2)

Five highest repeated female names:
(Ellyn,2)
(Margret,2)
(Sonya,2)
(Natala,2)
(Barbara-anne,2)