Total 10 Posts

Apache Spark

Transient Cluster on AWS

This post demonstrates a cost-effective and automated solution for running Spark-Jobs on the EMR cluster on a daily basis using CloudWatch, Lambda, EMR, S3, and SNS.…

Read More

Performance Tweaking Apache Spark

Apache Spark Streaming applications need to be monitored frequently to be certain that they are…

Read More

Incrementally loaded Parquet files

In this post, I explore how you can leverage Parquet when you need to load…

Read More

MongoDB and Apache Spark - Getting started tutorial

MongoDB and Apache Spark are two popular Big Data technologies. In my previous post, I…

Read More

May 03, 2017

Raphael Brugier

Big Data

Introduction to the MongoDB connector for Apache Spark

MongoDB is one of the most popular NoSQL databases. Its unique capabilities to store document-oriented…

Read More

Spark Summit East 2017 - A summary

I attended Spark Summit East 2017 last week. This 2 day conference - February 8th…

Read More

A tour of Databricks Community Edition: a hosted Spark service

With the recent announcement of the Community Edition, it’s time to have a look…

Read More

Testing strategy for Spark Streaming - Part 2 of 2

In a previous post, we’ve seen why it’s important to test your Spark…

Read More

Testing strategy for Apache Spark jobs - Part 1 of 2

Like any other application, Apache Spark jobs deserve good testing practices and coverage. Indeed, the…

Read More

Applying Data Science with Apache Spark Coding Dojo

This week, at the power plant (Ippon Technologies USA headquarters), we had the pleasure of…

Read More