Mock Data
Simple data extraction from Apache Cassandra using RDD
Github repository Path: examples/mock-example Language: Scala v2.11
- Previous Requirements
- Data sources
This example retrieve data from Cassandra keyspace examples and table name mockdata. Data are retrieved into RDD and filtered. It only selects “gender” and “first_name” columns in a Pair RDD. It groups by name and count taking the five most repeated male first-names. It does the same with female names and prints each list to standard output.
val record_names = sc.cassandraTable[(String,String)]("examples","mock_data")
.select("gender","first_name") //convert to RDD pair with gender and first_name columns
.cache
//Male
val male_names = record_names.where("gender = 'Male'") // gender filtering
- When gender is filtered, append 1 to each name into Tuple2
(<first_name>,1)
, thenreduceByKey
countsfirst_name
field.
val male_names_c = male_names.map{ case (k,v) => (v,1) } // associate 1 point to each male first name
val males_result = male_names_c.reduceByKey{ case (v,count) => count + count } //count
So at least we have a Seq with male first names and a count Seq[(<first_name>,n), ...]
The same for female names.
- Last step is to take five most repeated names of each gender recorded:
println("Ordered Female Names count list:") // ordered RDD by female names val females_result_az = females_result.sortByKey() // key male names are sorted in asc order females_result_az.collect.foreach(println) // print result records through stdout // taking 5 highest female repeated names println("Five highest repeated female names:") val females_result_high = females_result.sortBy(_._2,false).take(5) females_result_high.foreach(println)
The same for male records. The output would be:
Five highest repeated male names:
(Mario,2)
(Rabbi,2)
(Clarke,2)
(Claudell,2)
(Brnaba,2)
Five highest repeated female names:
(Ellyn,2)
(Margret,2)
(Sonya,2)
(Natala,2)
(Barbara-anne,2)