My colleague Justin Risch recently obtained some data from the popular game Pokemon GO. He cleansed the data into a much more usable CSV format and I decided to use this to do some practice in Apache Spark.
It was a fairly simple Spark class written in Scala using the Eclipse Scala IDE.
Here is the code:
object MostSpawnedPokemon {
def loadNames() : Map[Int, String] = {
Source.fromFile("../Data/pokemon.csv")
.getLines()
.map(_.split(','))
.filter(_.length > 1)
.map(fields => fields(0).toInt -> fields(1))
.toMap
}
def main(args: Array[String]) {
// Set the log level to only print errors
Logger.getLogger("org").setLevel(Level.ERROR)
// Create a SparkContext using every core of the local machine
val sc = new SparkContext("local[*]", "PopularMoviesNicer")
// Create a broadcast variable of our ID -> Pokemon Name map
var nameDict = sc.broadcast(loadNames)
// Read in each Spawn line
val lines = sc.textFile("../Data/AllData.csv")
// Map to (spawnedPokemonId, 1) tuples
val spawned = lines.map(x => (x.split(",")(2).toInt, 1))
// Count up all the 1's for each Pokemon
val spawnedCounts = spawned.reduceByKey( (x, y) => x + y )
// Flip (spawnedPokemonId, count) to (count, spawnedPokemonId)
val flipped = spawnedCounts.map( x => (x._2, x._1) )
// Sort
val sortedCount = flipped.sortByKey()
// Fold in the Pokemon names from the broadcast variable
val sortedCountWithNames = sortedCount.map( x => (nameDict.value(x._2), x._1) )
// Collect and print results
val results = sortedCountWithNames.collect()
results.foreach(println)
}
}
I go through each step of my code in the comments. This ended up being a simple yet solid example of using Broadcast Variables in Spark. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. The output from this was the Pokemon's name and the number of spawn occurrences.
Here is a snippet from the output showing the most uncommon Pokemon:
This is a fun dataset to work with and I am going to continue using it as I begin learning more advanced Spark programming. This simple bit of information regarding what were the most uncommon Pokemon ended up helping Justin in some work he did visualizing the data.