Strata+Hadoop World is probably the most important conference about Data Science and Data Engineering. The last edition took place in New York last week and gathered more than 2000 attendees. With 15 parallel tracks over 2 days of conference, that’s more than 160 sessions you could attend! Here is my take.
As an introduction to the technical talks, the keynote speakers did a great job showing how our industry takes place in our -real- life, how data can be efficiently used to contribute to better healthcare, better education, better lives…
To name only one, Mar Cabra, head of the Data and Research unit at the International Consortium of Investigative Journalists, explained what techniques they used to analyze the huge amount of data from the Panama Leaks.
The least I can say is that these talks were inspiring.
A few suggested talks
I obviously cannot give a comprehensive transcript of what was presented at Strata. Instead, let me give you a summary of a few sessions I liked.
Making on-demand grocery delivery profitable with data science – Jeremy Stanley, VP of Data Science at Instacart, explained how they attempt at optimizing their online grocery deliveries. Customers can order groceries online and shoppers will shop from local grocery stores and deliver to the customers. To keep the customers happy, you need to deliver quickly, meaning you need shoppers to be available. However, if you have too many shoppers, they are under-used and are a cost for the company. Jeremy showed how, version after version, they get better at routing the shoppers. A very complex problem.
Apache Kafka: The rise of real-time data and stream processing – Neha Narkhede, cofounder and head of engineering at Confluent, gave an overview of the 3 products that are being developed and supported by Confluent: Kafka, Kafka Connect and Kafka Streams. I was already a great fan of Kafka and Kafka Streams but this talk showed me that Confluent has an extensive roadmap for Kafka Streams, and Neha described how the 3 tools can be assembled to make up the “modern ETL”.
Parquet performance tuning: The missing guide – Ryan Blue, an engineer on Netflix’s Big Data Platform team, went through a description of the columnar storage format Apache Parquet. He gave a highly technical talk about how to take advantage of the features of this format and how to avoid some caveats. This shows that choosing a storage format is only the first part of the developer’s job, there’s a lot of tuning to be done after that step.
Twitter’s real-time stack: Processing billions of events with Heron and DistributedLog – Karthik Ramasamy, tech lead for real-time analytics at Twitter, went through the challenges of setting up an analytics platform at the scale of Twitter. He described how they created a high-performance replicated log service, DistributedLog, and how they created Heron, a processing engine designed to be “a better Storm”. Heron has been in production at Twitter for 2+ years now and is 4-5 times faster than Storm.
A torrent of technologies
Finally, if there is something to remember from this edition, it is that the industry is very active at creating new technologies. When you think you know the “Data ecosystem” quite well, Strata shows you you’re wrong! Here is a selection of the technologies that were mentioned, some I already knew about, some I found out about during the conference:
- Processing: Apache Spark, Apache Apex, Apache Beam, Heron
- Data stores: Pinot (OLAP), Druid, Apache Kudu (analytics)
- Pub-sub: Kafka, Google Cloud Pub/Sub
- Threat detection: Apache Spot
How to see the presentations
Most of the talks have been recorded and will soon be available on O’Reilly’s platform, Safari.
Since some speakers have already given the same talks in other conferences, you might also be able to find the recordings from these events or see one of the speaker in one of your local Meetups.
Finally, note that MapR has made available 6 Free ebooks by Ted Dunning and Ellen Friedman.