Delta Lake 2.0
The Delta format is based on Apache Parquet and is the standard storage layer for your Databricks Lakehouse. It provides several features, such as ACID transactions, audit history, time travel, etc. (check this article for more details).
By open sourcing it, Databricks will let you produce Delta files that other applications will be able to read and use! That allows you to take advantage of all the features previously quoted while enjoying up to 4.3x faster processing compared to other storage layers.
With more and more real-time data pipelines and infrastructures, streaming data is becoming technically challenging. It has needs far different from and more complicated to meet than those of event-driven applications and batch processing.
Databricks is releasing "Project Lightspeed," a framework which advances Spark Structured Streaming into the real-time era as more and more new use cases and workloads migrate into streaming (more details here).
Spark Connect is a simple client-server protocol that will let you run Spark anywhere! The demo on stage was done using an iPad, but the possibilities are endless (phones, embedded systems, etc.).
After configuring your Spark client, it will generate a query plan of your Spark code and send it to your Spark cluster so it can do the heavy lifting for you. You will be able to run Spark directly on your machine without having to have a local cluster using all your resources.
The Databricks Lakehouse Platform combines the best elements of data lakes and data warehouses to deliver the reliability, strong governance, and performance of data warehouses with the openness, flexibility, and machine learning support of data lakes.
Unity Catalog is your unified governance solution for your Lakehouse, with key features like:
- Automated Data Lineage
- Built-in Data Search and Discovery
- Simplified Access Controls
This blog will give you a good overview of all the new features introduced to Unity Catalog.
Machine Learning is becoming more widely used in production and with MLflow 2.0, and with it, you will have access to MLflow pipelines to build production-grade ML pipelines.
Working with ML is never easy and has its bag of challenges that can quickly become a rotten tomato in your organization. Automating and scaling your pipelines should be relatively easy with the first release of MLflow Pipelines (see this article).
Powered by Delta Sharing, the Databricks Marketplace will let you discover and use a various number of data products (datasets, notebooks, dashboards, etc.) from third party vendors. In literally few seconds, you will able to obtain the datasets of your choice and have it in your own Lakehouse.
Please read this introduction for more details.
Databricks SQL Serverless
The beauty of the Lakehouse is that you can combine Spark and SQL code so that your Data Engineers, Data Scientists, and Data Analysts can use the language in which they are most familiar.
In order to reduce infrastructure cost and provide a more elastic approach, you can now have a Serverless SQL Warehouse! For now, it is only available on Amazon Web Services (AWS), but Microsoft Azure and Google Cloud Platform (GCP) compatibilities are coming soon (full article).
I am already looking forward to next year's Data + AI Summit by Databricks!
Need help using Databricks? Need help designing and implementing your future Lakehouse? Ippon can help! Send us a line at firstname.lastname@example.org.