With Microsoft announcing public preview of their unified data and analytics platform, the question that everyone would have in their mind is how does Databricks compare with Fabric and then how both integrate with each other considering both proclaiming themselves as unified data analytics platform.
Well we’re not yet quite comparing apples with apples. Databricks is slightly narrower and deeper in focus whereas Fabric from the outset has a much broader vision innovating all the way up with feature set spanning descriptive, diagnostic, predictive and prescriptive analytics, from batch through streaming workloads boosted by OneLake, an open & governed, unified SaaS data lake that serves as a single place to store organizational data.
Before we talk about how Fabric and Databricks works together it’s absolutely critical to understand what OneLake is and how it is addresses data silo.
- One data lake for the entire organization
- One copy of data for use with multiple analytical engines
The term virtualization in OneLake is realized by an entity called Shortcuts. Shortcut is a OneLake feature that enables data to be reused without copying it. Shortcuts function as a symbolic link to data, allowing a live connection to the target data from another location. Shortcuts can be created to any data within OneLake, or to external data lakes such as Azure Data Lake Storage Gen2 (ADLS Gen2) or Amazon S3. So essentially OneLake data can be accessed by either of two ways — Use OneLake with existing Data Lakes or Use data landed in OneLake directly.
Also since OneLake and ADLS shares API, it is easy to start using OneLake data with any application e.g. Databricks with just endpoint changes as below:
OneLake with existing Data Lakes
This scenario is mostly prevalent when your Databricks instance is already integrated with ADLS Gen2. In this case you just need to create a short cut to this storage account in OneLake and as OneLake uses the same APIs as ADLS Gen2 and supports the same Delta parquet format for data storage, Azure Databricks notebooks can be seamlessly updated to use the OneLake endpoints for the data. This keeps the paths consistent across experiences whether the data consumer is querying data through a warehouse in Microsoft Fabric or a notebook in Azure Databricks.
Leverage data landed directly in OneLake
In this case, data is landed directly on OneLake and it is accessed in a standard medallion architecture by Databricks where data is cleaned and processed into the gold layer as data products which can be served to any other components of Fabric without data being copied multiple times. This is extremely beneficial for components like Power BI where now it can experience the ease of connecting to live data (courtesy Direct Lake mode) but at the same time achieve the performance when data is imported into it.
With “Direct Lake” paradigm in play, both Fabric and Databricks can co-exist to simplify analytics workloads. Whether customers have well established Azure Databricks practices or are looking to get started, Microsoft Fabric and OneLake bring rich data management features that can only increase the value of Azure Databricks usage.
The bottom-line is to continue your current data platform in the same way as you are doing, while adding Fabric into the mix without creating a strong dependency and off-course over time, feedback and learnings will be gathered that could lead us to re-evaluate our position either moving further from or closer to Fabric.
Thanks for reading. In case you want to share your case studies or want to connect, please ping me via LinkedIn