Thread 01 – Data Engineering with Fabric – Ramblings of a Crafty DBA

Managed Vs Unmanaged Tables

Microsoft Fabric was release to the general availability on November 15th, 2024. I will be writing a quick post periodically in 2024 to get you up to speed on how to manipulate data in the lake house using spark. I really like the speed of the starter pools in Microsoft Fabric. A one to ten node pool will be available for consumption in less than 10 seconds. Read all about this new compute from this learn page.

Business Problem

Our manager has given us weather data to load into Microsoft Fabric. How do we accomplish this task?

Technical Solution

This is a very simple use case for our engineers to get used to the platform. The following topics will be explored in this article (thread).

create account
create lake house
create directories
upload files
create notebook
create unmanaged table
create managed table
query both tables

Architectural Overview

The architectural diagram shows how information flows from a source system, into a delta lake house, transformed by programs, and used by end users. To get source data into the lake, we can use any of the three methods to retrieve the data as files: pipelines – traditional Azure Data Factory components, dataflows – wrangling data flows based on Power Query and shortcuts – the ability to link external storage to the lake. Once the data is in the lake, there are two types of programs that can transform the data files: spark notebooks and data flows.

Fabric Data Engineering

The thread articles will be focused on how to use Microsoft Fabric’s version of Apache spark using the Python language to transform data. However, the code samples are more than likely cross cloud and/or cross platform ready with minor changes. Once you learn Apache Spark, you can use that knowledge anywhere it is supported.

Lakehouse Setup

The first thing we need is to have a Microsoft Fabric account. You can get a free trial account here. You must use a domain account. Any free public account from @outlook.com or @gmail.com will not work. I am using my john@craftydba.com user account below.

The user interface for Microsoft Fabric reminds me of Power BI. I have a default workspace all to my self.

If we select the create option to the left, we get to pick and choose which object we want to create. Since we have not place to put our data, please choose the lakehouse object.

Naming conventions are import for organizing objects. The image below show that I am prefixing the name of the lakehouse with “lh4”.

Naming conventions are import for organizing objects. The image below show that I am prefixing the name of the lakehouse with “lh4”. I uploaded the weather data files into the raw directory. Please see the medallion architecture for details on naming conventions for folders. See thread zip file at end of article to obtain a copy.

What is a lakehouse in Microsoft Fabric. It is a collection of data files that can be organized by folders. But, we really can not analyze data files directly. The real power is when we transform the data into a delta table. Delta tables support the ACID properties of a database and can be queried with Spark SQL.

Spark Notebook

Jupyter Notebooks have been incorporated in many different development tools such as Data Studio and Synapse Analytics. It is not surprising that will we be creating Spark Notebooks to write our data engineering pipelines in Microsoft Fabric.

Just because we are breaking python programs into cells, it does not mean you do not comment your code. At the top of the notebook I am showing which topics will be covered in this lesson. Jump to the end of the article to download the thread zip file that has the sample code.

The code below reads the comma separated values (CSV) file for low temperature readings into a Spark DataFrame. Because the column names might be the same between two data sets, the with column rename method is used to make the column unique.

#
#  Read low temp file
# 

# location
path = "Files/raw/weather/low_temps.csv"

# read file
df_low = spark.read.format("csv").option("header","true").load(path)

# rename columns
df_low = df_low.withColumnRenamed("temp", "low_temp")

# Read low temp file

# location

path = "Files/raw/weather/low_temps.csv"

# read file

df_low = spark.read.format("csv").option("header","true").load(path)

# rename columns

df_low = df_low.withColumnRenamed("temp", "low_temp")

A similar code snippet is used to read in the high temperature data file. Two fields are renamed in preparation for joining the DataFrames.

#
#  Read high temp file
# 

# location
path = "Files/raw/weather/high_temps.csv"

# read file
df_high = spark.read.format("csv").option("header","true").load(path)

# rename columns
df_high = df_high.withColumnRenamed("temp", "high_temp")
df_high = df_high.withColumnRenamed("date", "date2")

# Read high temp file

# location

path = "Files/raw/weather/high_temps.csv"

# read file

df_high = spark.read.format("csv").option("header","true").load(path)

# rename columns

df_high = df_high.withColumnRenamed("temp", "high_temp")

df_high = df_high.withColumnRenamed("date", "date2")

To get our final dataset, we join the two DataFrames on the date columns, drop the duplicate column, and save the result set as a new DataFrame.

#
# join dataframes + drop extra columns
#

df_temps = df_low.join(df_high, df_low["date"] == df_high["date2"]).drop("date2")

# join dataframes + drop extra columns

df_temps = df_low.join(df_high, df_low["date"] == df_high["date2"]).drop("date2")

If you executed the code up to this point in the notebook, the Spark Cluster will have a Spark Session that contains our transformed data in the df_temps DataFrame. The last step is to save the data from memory to the data lake.

Hive Tables

There are two ways to create a table in the Hive Catalog: managed and umanaged tables. To create a managed table, we just use the save as table method. One major drawn back of this type of table type is the storage is controlled by the service. Thus, dropping a table removes both the schema (meta-data) and the data (files).

#
#  create managed (internal) table
#

df_temps.repartition(2).write.saveAsTable('managed_weather')

# create managed (internal) table

df_temps.repartition(2).write.saveAsTable('managed_weather')

Creating a unmanaged table is a two step process.

First, we need to write the data to storage. Since we did not mount or link any storage to the lake house, we will use the local storage under the “Files” directory. I am going to save the data in a delta file format. The second step is to use Spark SQL to create (register) the table for the storage location. The major benefit of unmanaged tables is that a drop table action only gets rid of the meta-data, not the data files.

#
#  Write delta file
# 

# location
path = "Files/bronze/weather"

# write delta files
df_temps.repartition(2) \
    .write \
    .format("delta") \
    .mode("overwrite") \
    .save(path)

# Write delta file

# location

path = "Files/bronze/weather"

# write delta files

df_temps.repartition(2) \

.write \

.format("delta") \

.mode("overwrite") \

.save(path)

The code above use the write method with the format option to create a delta file. The code below uses a special command at the top. It is called a magic command. Please see this extensive list of commands. In our notebook cell, the %%sql is replace with spark.sql(“”) code at execution time.

%%sql
--
--  create unmanaged (external) table
--

create table unmanaged_weather using delta
location 'Files/bronze/weather'

%%sql

-- create unmanaged (external) table

create table unmanaged_weather using delta

location 'Files/bronze/weather'

The image below the two tables have been created in the hive database. Unlike Azure Synapse that supports create database command, Microsoft Fabric automatically creates a database when you deploy the lakehouse object. Additionally, the image shows the two partition delta file in the bronze quality zone.

Testing

The easiest way to test our hive tables is to view them visually. If we click the table, a preview window with the data is shown below. Please note the format of the data is strings. To be more accurate, we should fix this by casting the appropriate columns to dates and double precision floating points.

When working with big data, how can we determine if table A is a copy of table B. The first step is to check row counts.

%%sql

-- get managed counts
select 
  'managed table' as description,
  count(*) as total 
from 
  managed_weather

union

-- get unmanaged counts
select 
  'unmanaged table' as description,
  count(*) as total 
from 
  unmanaged_weather

%%sql

-- get managed counts

select

'managed table' as description,

count(*) as total

from

managed_weather

union

-- get unmanaged counts

select

'unmanaged table' as description,

count(*) as total

from

unmanaged_weather

The output from the Spark SQL query shows matching row counts.

The cool think about PySpark is the fact that we can use both Spark SQL and/or DataFrame methods when we want. The code below, filters the imaginary big data sets by a key value such as year. Then, once we determine if the datasets have the same amount of rows, we can use the subtract method to validate equality.

Summary

Is the above code ready for deployment to product. The answer to that question is no. We did not test if we can run the same code more than once without errors. When creating a data lake, capturing before and after record counts is very important. This audit trail will help you when debugging issues.

The main component that is missing from Spark Notebooks is scheduling. This can be achieved creating a Data Factory pipeline with a scheduled trigger. Additionally, we can simplify the process by using Spark SQL to expose the CSV files in the raw quality zone.

What is currently missing in Microsoft fabric?

Right now, you can create views in the hive catalog but you can not see them. Additionally, unmanaged tables are not supported by the SQL End Point. I can understand why views are not supported in the SQL End Point since the Spark SQL syntax is a-lot larger than the T-SQL syntax; However, they should be shown in the lakehouse explorer. I am hoping the Microsoft Fabric Product Team puts both of these items on the backlog to be fixed.

What do I like about Microsoft fabric?

I love the fact that starter pools are blinding fast. This is a big deal over the major competitor, Databricks. Next, time we will work on streamlining and scheduling the data engineering Spark notebook for weather data. Enclosed is the zip file with the data files and Spark notebook.

Thread 01 – Data Engineering with Fabric

Managed Vs Unmanaged Tables

Business Problem

Technical Solution

Architectural Overview

Lakehouse Setup

Spark Notebook

Hive Tables

Testing

Summary

Related posts

Leave a Comment Cancel reply

Managed Vs Unmanaged Tables

Business Problem

Technical Solution

Architectural Overview

Lakehouse Setup

Spark Notebook

Hive Tables

Testing

Summary

Related posts

Protected: Thread 06 – Data Engineering with Fabric

Protected: Thread 05 – Data Engineering with Fabric

Protected: Thread 04 – Data Engineering with Fabric

Leave a Comment Cancel reply