Ippon Blog

Finding the most spawned Pokemon in Pokemon GO using Spark & visualizing the Data

Written by Nicholas Peterson | Sep 6, 2017 3:25:00 PM

My colleague Justin Risch recently obtained some data from the popular game Pokemon GO. He cleansed the data into a much more usable CSV format and I decided to use this to do some practice in Apache Spark.

It was a fairly simple Spark class written in Scala using the Eclipse Scala IDE.

Here is the code:

object MostSpawnedPokemon {

  def loadNames() : Map[Int, String] = {
    Source.fromFile("../Data/pokemon.csv")
      .getLines()
      .map(_.split(','))
      .filter(_.length > 1)
      .map(fields => fields(0).toInt -> fields(1))
      .toMap
  }

  def main(args: Array[String]) {

    // Set the log level to only print errors
    Logger.getLogger("org").setLevel(Level.ERROR)

    // Create a SparkContext using every core of the local machine
    val sc = new SparkContext("local[*]", "PopularMoviesNicer")

    // Create a broadcast variable of our ID -> Pokemon Name map
    var nameDict = sc.broadcast(loadNames)

    // Read in each Spawn line
    val lines = sc.textFile("../Data/AllData.csv")

    // Map to (spawnedPokemonId, 1) tuples
    val spawned = lines.map(x => (x.split(",")(2).toInt, 1))

    // Count up all the 1's for each Pokemon
    val spawnedCounts = spawned.reduceByKey( (x, y) => x + y )

    // Flip (spawnedPokemonId, count) to (count, spawnedPokemonId)
    val flipped = spawnedCounts.map( x => (x._2, x._1) )

    // Sort
    val sortedCount = flipped.sortByKey()

    // Fold in the Pokemon names from the broadcast variable
    val sortedCountWithNames = sortedCount.map( x  => (nameDict.value(x._2), x._1) )

    // Collect and print results
    val results = sortedCountWithNames.collect()
    results.foreach(println)
  }

}

I go through each step of my code in the comments. This ended up being a simple yet solid example of using Broadcast Variables in Spark. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. The output from this was the Pokemon's name and the number of spawn occurrences.

Here is a snippet from the output showing the most uncommon Pokemon:

  • (machamp,8)
  • (kabutops,14)
  • (charizard,14)
  • (farfetchd,14)
  • (muk,15)
  • (gyarados,20)
  • (raichu,23)
  • (omastar,25)
  • (alakazam,27)
  • (ninetales,27)

This is a fun dataset to work with and I am going to continue using it as I begin learning more advanced Spark programming. This simple bit of information regarding what were the most uncommon Pokemon ended up helping Justin in some work he did visualizing the data.