{"id":9797,"date":"2024-02-28T23:32:21","date_gmt":"2024-02-28T23:32:21","guid":{"rendered":"https:\/\/craftydba.com\/?p=9797"},"modified":"2024-12-26T21:02:42","modified_gmt":"2024-12-26T21:02:42","slug":"thread-05-data-engineering-with-fabric","status":"publish","type":"post","link":"https:\/\/craftydba.com\/?p=9797","title":{"rendered":"Thread 05 \u2013 Data Engineering with Fabric"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/microsoft-fabric-logo.png\" alt=\"Microsoft Fabric Logo\" width=\"100\" height=\"100\" align=\"\u201dleft\u201d\" hspace=\"\u201d50\u201d\" vspace=\"\u201d50\u201d\" \/><\/p>\n<h2 style=\"color: green;\">Data Presentation Layer<\/h2>\n<p>Microsoft Fabric allows the developer to create delta tables in the lake house.  The bronze tables contain multiple versions of the truth and the silver tables a cleaned up, single version of the truth.  How can we combine the silver tables into a relational model for consumption from the gold layer?<\/p>\n<h4 style=\"color: brown;\">Business Problem<\/h4>\n<p>Our manager at <b>adventure works<\/b> has asked us to use a metadata driven solution to ingest CSV files from external storage into a Microsoft Fabric.  A typical medallion architecture will be used in the design.  The final goal of the project is to have end users access the gold layer tables using their favorite SQL tool.<\/p>\n<h4 style=\"color: brown;\">Technical Solution<\/h4>\n<p>Microsoft Fabric supports two data warehouse <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-warehouse\/data-warehousing\">components<\/a>:  SQL Analytics Endpoint and Synapse Data Warehouse.  The image below shows that the endpoint is <b>read-only<\/b> while the warehouse supports both <b>read<\/b> and <b>write<\/b> operations.  It makes sense to use the SQL Analytics Endpoint since we do not have to write to the tables.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-000.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The following topics will be covered in this article (thread).<\/p>\n<ol>\n<li>create new lake house<\/li>\n<li>create + populate metadata table<\/li>\n<li>import + update child pipeline<\/li>\n<li>create + execute parent pipeline<\/li>\n<li>review bronze + silver tables<\/li>\n<li>use warehouse view to combine tables<\/li>\n<li>use warehouse view to aggregate table<\/li>\n<li>connect to warehouse via SSMS<\/li>\n<\/ol>\n<h4 style=\"color: brown;\">Architectural Overview<\/h4>\n<p>The architectural diagram for the metadata driven design is seen below.  This is the <b>child pipeline<\/b> that I have been talking about.  Please remember that the data pipeline uses a full refresh pattern.  The bronze tables will contain all processes files and the silver tables will contained the most recent file (data).<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/architecture-diagram.png\" width=\"600\" height=\"600\" ><br \/>\n<\/p>\n<p>The <b>parent pipeline<\/b> will call the child pipeline for each meta data entry.  At the end of a successful execution, the lake house will be created for all <b>saleslt<\/b> tables for both the bronze and silver zones.  We will be working on the creating parent pipeline in this article.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/presentation-layer.png\" width=\"605\" height=\"605\" ><br \/>\n<\/p>\n<p>The above diagram shows how the presentation layer works.  By default, all tables in the lake house (spark engine) are available to the SQL Analytics Endpoint as read only.  To date, spark views are <font style=\"color: brown;\">not supported<\/font>.  Do not fret.  We can combine and aggregate data using SQL views in the data warehouse.  After completing the warehouse or gold layer, we will retrieve some data using my favorite tool (SQL Server Management Studio).<\/p>\n<h4 style=\"color: brown;\">Meta Data Tables<\/h4>\n<p>A meta data driven design allows the developer to automate the copying of data into the lake (<b>raw<\/b> files) as well as building the <b>bronze<\/b> and <b>silver<\/b> tables.  The image below shows the meta data table viewed from the Lake House explorer.  Please note, only the meta data table exists at this time and the local raw folder is empty.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-010.png\" width=\"600\" height=\"600\" ><\/p>\n<p>Let&#8217;s review on the meta data table right now with the following fields.<\/p>\n<ol>\n<li>pipeline id &#8211; just an identifier<\/li>\n<li>container name &#8211; same path in source and destination storage<\/li>\n<li>file name &#8211; same file in source and destination storage<\/li>\n<li>header flag &#8211; does the data file have a header line<\/li>\n<li>delimiter string &#8211; how is the tuple divided into columns<\/li>\n<li>table name &#8211; the root name of the table<\/li>\n<li>schema string &#8211; infer schema or use this definition<\/li>\n<\/ol>\n<p>Each piece of information in the delta table is used by either the pipeline activities or spark notebooks to dictate which action to perform.  Of course, there is a one to one mapping from source data files to rows of data in the meta data table.  Please see the folder listing of the source storage account as shown below.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-005.png\" width=\"600\" height=\"600\" ><\/p>\n<p>Please look at the <b>nb-create-meta-data<\/b> notebook for more details on how to reload the meta data table from scratch.  Just download the zip file at the end of the article and find the file under the code folder.<\/p>\n<h4 style=\"color: brown;\">Parent Pipeline<\/h4>\n<p>If you did not notice, I created a new lake house named <b>lh_adv_wrks<\/b>.  Since this task is so easy, I did not include any instructions.  See the MS Learn <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/onelake\/create-lakehouse-onelake\">webpage<\/a> for more details.  The data pipeline named <b>pl-refresh-advwrks<\/b> is the parent pipeline we are building out today.  It calls the child pipeline named <b>pl-delimited-full-load<\/b> with the pipeline id ranging from 1 to 11.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-012.png\" width=\"600\" height=\"600\" ><\/p>\n<p>To write a while loop, we usually need two variables in a programming language like Python.  The first variable is a counter and the second is the dynamic limit.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"python language - while loop\">\ncnt = 1\nlmt = 5\nwhile cnt < lmt:\n  print(f\"Number is {cnt}!\")\n  cnt = cnt + 1\n<\/pre>\n<p>Data Factory does not allowed the developer to set a variable to itself.  It is just a limitation of tool.  The image below shows the error when trying to set the variable named <b>var_cnt<\/b> to itself plus one.  The solution to this problem is to use another variable.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-015.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The image below shows the test condition for the <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/data-factory\/control-flow-until-activity\">until loop<\/a>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-013.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The variable named <b>var_cur<\/b> is used as a temporary variable since we can not use self assignment.  Thus, we set the current value equal to the counter plus one.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-014.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The last step of the process is to set the counter to the current value.  See the assignment of the <b>var_cnt<\/b> variable below.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-016.png\" width=\"600\" height=\"600\" ><\/p>\n<p>After successful execution of the parent pipeline, the lake house will contain refreshed data.<\/p>\n<h4 style=\"color: brown;\">System Review<\/h4>\n<p>The workspace is where all the objects in Fabric are shown.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-040.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The table below shows the objects that are part of the Adventure Works lake house solution.<\/p>\n<ol>\n<li>nb-create-meta-data - this delta table describes how to load the system<\/li>\n<li>nb-delimited-full-load - read raw files, refresh bronze + silver tables<\/li>\n<li>nb-test-schema - this Spark SQL code tests the schema used in meta data table<\/li>\n<li>pl-delimited-full-load - copy file from source to raw, execute spark notebook<\/li>\n<li>pl-refresh-advwrks - call the full load pipeline for all source files\/li>\n<li>lh_adv_wrks - the lake house<\/li>\n<ul>\n<li>SQL analytics endpoint - external programs can interact with lake via TDS<\/li>\n<li>semantic model - the default Power BI data model<\/li>\n<\/ul>\n<\/ol>\n<p>The one lake is where it all the action happens.  If we right click on the lake house object and select open, the lake house explorer will be shown.  The image below shows the versioning of data <b>Files<\/b> in the raw zone for the adventure works dataset.  Each time we execute the pipeline, a CSV file is copied to a sub-folder that has today's date in sortable format.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-045.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The image below shows a bronze and silver table for each of the source files.  There are a total of 23 delta <b>Tables<\/b> shown by the explorer.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-050.png\" width=\"600\" height=\"600\" ><\/p>\n<p>In the next section, we will be using the SQL analytics endpoint to create views for our presentation layer.<\/p>\n<h4 style=\"color: brown;\">Data Warehouse<\/h4>\n<p>Make sure the key word <b>warehouses<\/b> is at the top of the explorer.  In the lake house explorer, we have only tables and files.  In the warehouse explorer, we can create other objects such as views, functions and stored procedures.  Today, we are focusing on views for our presentation layer.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-055.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The first view we are going to create is named <b>silver_dim_products<\/b>.  The SQL code below creates the view that combines the 3 product related tables into one dataset.  It is a coin toss on what prefix to use on the table.  I choose to use <b>silver<\/b> since we did not perform any calculations.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-10\" title=\"SQL endpoint - combine tables\">\n-- ******************************************************\n-- *\n-- * Name:         sql_silver_dim_products.txt\n-- *     \n-- * Design Phase:\n-- *     Author:   John Miner\n-- *     Date:     03-01-2024\n-- *     Purpose:  Combine 3 tables into one view.\n-- * \n-- ******************************************************\n\nCREATE VIEW [silver_dim_products]\nAS \nSELECT\n  p.ProductKey, \n  p.ProductAlternateKey, \n  p.WeightUnitMeasureCode, \n  p.SizeUnitMeasureCode, \n  p.EnglishProductName,\n  p.StandardCost, \n  p.FinishedGoodsFlag, \n  p.Color, \n  p.SafetyStockLevel, \n  p.ReorderPoint, \n  p.ListPrice, \n  p.Size,  \n  p.SizeRange, \n  p.Weight, \n  p.DaysToManufacture, \n  p.ProductLine, \n  p.DealerPrice, \n  p.Class, \n  p.Style, \n  p.ModelName, \n  p.StartDate, \n  p.EndDate, \n  p.Status,\n  sc.ProductSubcategoryKey, \n  sc.ProductSubcategoryAlternateKey, \n  sc.EnglishProductSubcategoryName,\n  pc.ProductCategoryKey, \n  pc.EnglishProductCategoryName\nFROM \n  silver_dim_product_full as p\nJOIN \n  silver_dim_product_subcategory_full as sc \nON \n  p.ProductSubcategoryKey = sc.ProductSubcategoryKey\nJOIN \n  silver_dim_product_category_full as pc\nON \n  pc.ProductCategoryKey = sc.ProductCategoryKey;\n<\/pre>\n<p>The output from previewing the SQL View is show below.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-060.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The view named <b>gold_report_data<\/b> combines data from six tables into one flattened dataset ready for reporting.  See the code below for details.  This table is truly a <b>gold layer<\/b> object.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-10\" title=\"SQL endpoint - combine tables + calculate columns\">\n-- ******************************************************\n-- *\n-- * Name:         sql_gold_report_data.txt\n-- *     \n-- * Design Phase:\n-- *     Author:   John Miner\n-- *     Date:     03-01-2024\n-- *     Purpose:  Combine tables for reporting.\n-- * \n-- ******************************************************\n\nCREATE VIEW gold_report_data\nAS\nSELECT\n   p.EnglishProductCategoryName\n  ,Coalesce(p.ModelName, p.EnglishProductName) AS Model\n  ,c.CustomerKey\n  ,s.SalesTerritoryGroup AS Region\n  , ABS(CAST((datediff(d, getdate(), c.BirthDate) \/ 365.25)AS INT)) as Age\n  ,CASE\n      WHEN c.YearlyIncome < 40000 THEN 'Low'\n      WHEN c.YearlyIncome > 60000 THEN 'High'\n      ELSE 'Moderate'\n  END AS IncomeGroup\n  ,d.CalendarYear\n  ,d.FiscalYear\n  ,d.MonthNumberOfYear AS Month\n  ,f.SalesOrderNumber AS OrderNumber\n  ,f.SalesOrderLineNumber AS LineNumber\n  ,f.OrderQuantity AS Quantity\n  ,f.ExtendedAmount AS Amount   \nFROM\n  silver_fact_internet_sales_full as f\nINNER JOIN \n  silver_dim_date_full as d\nON \n  f.OrderDateKey = d.DateKey\n\nINNER JOIN \n  silver_dim_products as p\nON \n  f.ProductKey = p.ProductKey\n    \nINNER JOIN \n  silver_dim_customer_full as c\nON \n  f.CustomerKey = c.CustomerKey\n\nINNER JOIN \n  silver_dim_geography_full as g\nON \n  c.GeographyKey = g.GeographyKey\n\nINNER JOIN \n  silver_dim_sales_territory_full as s\nON \n  g.SalesTerritoryKey = s.SalesTerritoryKey;\n<\/pre>\n<p>The image below shows sample data from the table.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-065.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The last step is to use this flattened data in the gold view to figure out what sold best in December 2010 in North America.  The bike model named \"Road-150\" was the top seller.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-070.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The SQL query seen below was used to generate the previous result set.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-10\" title=\"SQL endpoint - combine tables + calculate columns\">\n-- ******************************************************\n-- *\n-- * Name:         sql_analyze_sales_by_time.txt\n-- *     \n-- * Design Phase:\n-- *     Author:   John Miner\n-- *     Date:     03-01-2024\n-- *     Purpose:  Combine tables for reporting.\n-- * \n-- ******************************************************\n\nSELECT \n  CalendarYear as RptYear,\n  Month as RptMonth,\n  Region as RptRegion,\n  Model as ModelNo,\n  SUM(Quantity) as TotalQty,\n  SUM(Amount) as TotalAmt\nFROM \n  gold_report_data \nGROUP BY\n  CalendarYear,\n  Month,\n  Region,\n  Model\nORDER BY\n  CalendarYear,\n  Month,\n  Region;\n<\/pre>\n<p>What I did not cover is the fact that any SQL scripts that are created by hand or by right click are stored in the section called <b>My Queries<\/b>.  These queries can be pulled up and re-executed any time you want.  Just like a SQL database, an object has to be dropped before being created again.<\/p>\n<h4 style=\"color: brown;\">Using the SQL endpoint<\/h4>\n<p>We need to retrieve the fully qualified name of the SQL analytics endpoint.  Use the copy icon to retrieve this long string.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-075.png\" width=\"600\" height=\"600\" ><\/p>\n<p>Fabric only supports <b>Microsoft Entra<\/b> formally know as Azure Active Directory with multi-factory authentication (MFA).  The image below shows the most recent SQL Server Management Studio making a connection to the SQL endpoint.  The user named john@craftydba.com is signing into the service.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-080.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The login action from SSMS will open a web browser to allow the end user to enter credentials.  Since I work with many different accounts, I have to choose the correct user name. <\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-082.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The authenticator application is used to verify that the correct user is signing into the system.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-085.png\" width=\"600\" height=\"600\" ><\/p>\n<p>Once authentication is complete, we can close the web browser window.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-090.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The last step is to make sure the result set that was created in our Fabric Data Warehouse matches the result set retrieve by SSMS using the SQL end point.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-005-100.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The real power of the One Lake is the ability to mesh data sets from a variety of locations into one place.  The SQL analytics end point allows users to leverage existing report and\/or developer tools to look at the data in the warehouse.<\/p>\n<h4 style=\"color: brown;\">Summary<\/h4>\n<p>One thing I really like about Microsoft Fabric is the fact that all the services are in one place.  Today, we covered the parent\/child pipeline design.  The parent pipeline retrieves the list of meta data from the delta table.  For each row, data is copied from the source location to the raw layer.  Additionally, a bronze table showing history is created as well as a silver table showing the most recent data.  This is a <b>full refresh<\/b> load pattern with history for <b>auditing<\/b>.<\/p>\n<p>Once the data is in the lake house, we can use the data warehouse section of Fabric to create views that combine tables, aggregate rows and calculate values.  This data engineering is used to clean the data for reporting or machine learning.<\/p>\n<p>Finally, there are two ways to use the data.  The <b>semantic model<\/b> is used by Power BI reports that a built into Fabric.  The SQL <b>analytics endpoint<\/b> is used for external reporting and\/or developer tools to access the data.  The steps to access the warehouse are pretty similar to any relational database management system that we might access.<\/p>\n<p>Next time, I will be talking about how to end users can update files in the Lake House using the one lake explorer. Enclosed is the <a href=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/bundle-05.zip\">zip file<\/a> with the CSV data files, Data Pipelines in JSON format, Spark notebooks and SQL notebooks.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Data Presentation Layer Microsoft Fabric allows the developer to create delta tables in the lake house. The bronze tables contain multiple versions of the truth and the silver tables a cleaned up, single version of the truth. How can we combine the silver tables into a relational model for consumption from the gold layer? Business Problem Our manager at adventure works has asked us to use a metadata driven solution to ingest CSV files from external storage into a Microsoft Fabric. A typical medallion architecture will be used in the&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1132],"tags":[1136],"class_list":["post-9797","post","type-post","status-publish","format-standard","hentry","category-microsoft-fabric","tag-john-f-miner-iii-microsoft-fabric-data-engineering-apache-spark-data-factory-full-load-code-meta-data-driven-design-learn"],"_links":{"self":[{"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/posts\/9797","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/craftydba.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=9797"}],"version-history":[{"count":169,"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/posts\/9797\/revisions"}],"predecessor-version":[{"id":10059,"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/posts\/9797\/revisions\/10059"}],"wp:attachment":[{"href":"https:\/\/craftydba.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=9797"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/craftydba.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=9797"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/craftydba.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=9797"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}