Finding the most spawned Pokemon in Pokemon GO using Spark & visualizing the Data

My colleague Justin Risch recently obtained some data from the popular game Pokemon GO. He cleansed the data into a much more usable CSV format and I decided to use this to do some practice in Apache Spark.

It was a fairly simple Spark class written in Scala using the Eclipse Scala IDE.

Here is the code:

object MostSpawnedPokemon {

  def loadNames() : Map[Int, String] = {
    Source.fromFile("../Data/pokemon.csv")
      .getLines()
      .map(_.split(','))
      .filter(_.length > 1)
      .map(fields => fields(0).toInt -> fields(1))
      .toMap
  }

  def main(args: Array[String]) {

    // Set the log level to only print errors
    Logger.getLogger("org").setLevel(Level.ERROR)

    // Create a SparkContext using every core of the local machine
    val sc = new SparkContext("local[*]", "PopularMoviesNicer")

    // Create a broadcast variable of our ID -> Pokemon Name map
    var nameDict = sc.broadcast(loadNames)

    // Read in each Spawn line
    val lines = sc.textFile("../Data/AllData.csv")

    // Map to (spawnedPokemonId, 1) tuples
    val spawned = lines.map(x => (x.split(",")(2).toInt, 1))

    // Count up all the 1's for each Pokemon
    val spawnedCounts = spawned.reduceByKey( (x, y) => x + y )

    // Flip (spawnedPokemonId, count) to (count, spawnedPokemonId)
    val flipped = spawnedCounts.map( x => (x._2, x._1) )

    // Sort
    val sortedCount = flipped.sortByKey()

    // Fold in the Pokemon names from the broadcast variable
    val sortedCountWithNames = sortedCount.map( x  => (nameDict.value(x._2), x._1) )

    // Collect and print results
    val results = sortedCountWithNames.collect()
    results.foreach(println)
  }

}

I go through each step of my code in the comments. This ended up being a simple yet solid example of using Broadcast Variables in Spark. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. The output from this was the Pokemon's name and the number of spawn occurrences.

Here is a snippet from the output showing the most uncommon Pokemon:

  • (machamp,8)
  • (kabutops,14)
  • (charizard,14)
  • (farfetchd,14)
  • (muk,15)
  • (gyarados,20)
  • (raichu,23)
  • (omastar,25)
  • (alakazam,27)
  • (ninetales,27)

This is a fun dataset to work with and I am going to continue using it as I begin learning more advanced Spark programming. This simple bit of information regarding what were the most uncommon Pokemon ended up helping Justin in some work he did visualizing the data.


Found this post useful? Kindly tap
Author image
Entry level Software Engineer working in Richmond, VA. My goal is to become a Data Engineer and overall Big Data Expert. Hobbies include playing golf, working on cars, and coaching basketball.
Richmond,VA LinkedIn
OUR COMPANY
Ippon Technologies is an international consulting firm that specializes in Agile Development, Big Data and DevOps / Cloud. Our 300+ highly skilled consultants are located in the US, France and Australia. Ippon technologies has a $32 million revenue and a 20% annual growth rate.