Databricks Data + AI Summit 2023 : Key Announcements

Sandip Roy
4 min readJul 10, 2023

For last few years Databricks was primarily focusing on enhancing it’s Lakehouse & warehousing capability but with resurgence of AI, it shifted it’s focus towards Lakehouse AI as the premier platform to accelerate generative AI production journey.

Databricks Open source/ended architecture is paying off. Not only is Databricks solution built on MLFlow, Apache Parquet and Delta Lake but it expands further into Data (a.k.a Delta) sharing, compatibility between Apache Hudi, Apache Iceberg and its very own Linux Foundation Delta Lake

Unity catalog which was released last year as data governance tool is slowly proving to be the nucleus in its strategy to support unification of data and AI, federation, security and governance.

Data Lake

Unity Catalog

Unity Catalog acts as a central metadata store to secure and govern access. It applies policies for table, row/column access. Today, Unity Catalog doesn’t let you define common data access governance policies but plans to add those capabilities in the future.

Delta 3.0

UniForm, or unified format, was one of the more impactful announcements as it provides compatibility between Apache Hudi, Apache Iceberg and Delta Lake. Behind the scenes, Databricks makes three copies of metadata to support each of the formats. This allows Delta tables to be read as if they are Hudi or Iceberg formatted. This feature is the key part of open source Linux Foundation Delta Lake 3.0

Other announcement is Delta Kernel, a simplified connector development kit that protects against version changes and Liquid Clustering that makes partitioning more cost efficient and lower latency for read and write operations.

Hive Metastore Interface allows any software compatible with Apache Hive to connect to Unity Catalog such as Amazon EMR, Apache Spark, Amazon Athena, Presto, Trino. This will further boost openness, consistent data governance, easy path to moderniza legacy worksloads and finally cost optimization.

Lakehouse Federation

Capabilities in Unity Catalog allow you to discover, query, and govern data across data platforms including MySQL, PostgreSQL, Amazon Redshift, Snowflake, Azure SQL Database, Azure Synapse, Google’s BigQuery, and more from within Databricks without moving or copying the data, all within a simplified and unified experience. This means Unity Catalog’s advanced security features such as row and column level access controls, discovery features like tags, and data lineage will be available across these external data sources, ensuring consistent governance.

Data Warehouse and Streaming

Materialized views

Databricks introduced materialized views on streaming data using Delta Lake Tables and “volumes” to store unstructured data.

Indexless Index

It uses AI to predict indexes. AI is also used to determine the data layout and clustering, which has resulted in lower storage costs.

LakehouseIQ

LakehouseIQ uses generative AI to understand jargon, data usage patterns, organizational structure, and more to answer questions within the context of a business and this metadata is stored in Unity Catalog and natural language can be used to query and understand data.

The Databricks Assistant, powered by LakehouseIQ, is in preview.

Generative AI and Machine Learning

Lakehouse AI includes capabilities to allow users to build Generative AI applications, manage the entire AI lifecycle, monitor and govern the process. Databricks is fundamentally aligned to use all three approaches for LLMs — foundation models from Databricks Marketplace, fine-tuning models, and training custom models.

Vector Search

Databricks Vector Search enables developers to improve the accuracy of their generative AI responses through embeddings search. It will fully manage and automatically create vector embeddings from files in Unity Catalog — Databricks’ flagship solution for unified search and governance across data, analytics and AI — and keep them updated automatically through seamless integrations Databricks Model Serving. Additionally, developers have the ability to add query filters to provide even better outcomes for their users.

It is integrated with Databricks Model Serving and Auto ML.

MLFLow 2.5

MLFlow 2.5 adds AI Gateway, Prompt Tools and Monitoring. AI Gateway is interesting as it not only does access control and rate limiting, but it caches predictions to serve repeated prompts.

Mosaic ML Acquisitions

Databricks announced it will pay $1.3 billion to acquire MosaicML, an open source startup with neural networks expertise that has built a platform for organizations to train large language models and deploy generative AI tools based on them. Mosaic’s MPT-7B is the most downloaded LLM yet in the history.

English SDK for Spark

Inspired by GitHub Copilot on how it has revolutionized the field of AI-assisted code development and since the state-of-the-art large language models know Spark really well, Databricks introduced English SDK where user just ask questions in English and the generative AI engine compiles it into PySpark and SQL code. How cool is that?

Marketplace

Lakehouse Apps

It didn’t get as much limelight but it is akin to Snowflake Summit 2023’s biggest announcement — Snowpark Container Service.

Lakehouse App allows 3rd party products to run inside a user’s Databricks instance in containers. This approach removes the risk of sending data across different products running in their own security environments. Based on open protocol Delta sharing already a few customers like Oracle Cloud Infrastructure (OCI), Dell and Twilio using Delta Sharing with zero egress cost.

Marketplace scope is vastly expanded. It includes data, AI/ML models, applications, notebooks, schemas, documentation, queries, and volumes. Databricks is now adding monetization capabilities. New data providers, like the London Stock Exchange Group, IQVIA, LiveRamp, and many others have been added.

--

--

Sandip Roy

Bigdata and Databricks Practice Lead at Wipro Ltd