{"id":9447,"date":"2024-02-23T18:47:36","date_gmt":"2024-02-23T18:47:36","guid":{"rendered":"https:\/\/craftydba.com\/?p=9447"},"modified":"2024-12-26T21:02:55","modified_gmt":"2024-12-26T21:02:55","slug":"thread-04-data-engineering-with-fabric","status":"publish","type":"post","link":"https:\/\/craftydba.com\/?p=9447","title":{"rendered":"Thread 04 \u2013 Data Engineering with Fabric"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/microsoft-fabric-logo.png\" alt=\"Microsoft Fabric Logo\" width=\"100\" height=\"100\" align=\"\u201dleft\u201d\" hspace=\"\u201d50\u201d\" vspace=\"\u201d50\u201d\" \/><\/p>\n<h2 style=\"color: green;\">Metadata Driven Pipelines<\/h2>\n<p><font style=\"color: green;\">What is a metadata driven pipeline?<\/font>  Wikipedia defines <a href = \"https:\/\/en.wikipedia.org\/wiki\/Metadata\">metadata<\/a> as &#8220;data that provides information about other data&#8221;.  As a developer, we can create a non parameterized pipeline and\/or notebook to solve a business problem.  However, if we have to solve the same problem a hundred times, the amount of code can get unwieldly.  A better way to solve this problem is to store metadata in the delta lake.  This data will drive how the Azure Data Factory and Spark Notebooks execute.<\/p>\n<h4 style=\"color: brown;\">Business Problem<\/h4>\n<p>Our manager has asked us to create a metadata driven solution to ingest CSV files from external storage into a Microsoft Fabric one lake.  How can we complete this task?<\/p>\n<h4 style=\"color: brown;\">Technical Solution<\/h4>\n<p>We are going to go back to the drawing board for this solution.  First, we have determined that the CSV files stored in the ADLS container are small to medium in size.  Second, we are going to keep historical files around for auditing in the bronze layer.  Third, the silver layer will present to the end users the most recent version of the truth (data).  Since the datasets are tiny in nature, we are going to use a full load process.<\/p>\n<p>The following topics will be covered in this article (thread).<\/p>\n<ol>\n<li>create metadata table<\/li>\n<li>populate metadata table<\/li>\n<li>create data pipeline<\/li>\n<li>create credentials for storage<\/li>\n<li>create parameters and variables<\/li>\n<li>code for copy activity<\/li>\n<li>code for notebook activity<\/li>\n<li>test for 3 use cases<\/li>\n<li>any future enhancements<\/li>\n<\/ol>\n<h4 style=\"color: brown;\">Architectural Overview<\/h4>\n<p>The architectural diagram for this metadata driven design is seen below.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/architecture-diagram.png\" width=\"600\" height=\"600\" ><br \/>\n<br \/>There are three major components:  Azure Data Lake Storage which is external to the Fabric service; Data Factory Pipeline Activities that read\/write files; and the Lake House storage with its medallion layers. Please note that medallion layers are either folders and\/or naming conventions to organize the data in logical buckets.<br \/>\n<br \/>\nA popular use case of a data lake is to collect data from various disconnected systems (<b>producers<\/b>) and place the data into single location (<b>hub<\/b>). However, there are always users (<b>consumers<\/b>) that want to access the data using their favorite tools.  That is why a data lake is considered a <b>spoke<\/b> and <b>hub<\/b> design.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/spoke-n-hub.png\" width=\"300\" height=\"300\" ><\/p>\n<p>As a Microsoft Fabric designer we can use either <b>data pipelines<\/b> or <b>data flows<\/b> to move the data into the lake.  I choose pipelines since they can be easily parameterized.  At the transformation layer, we must choose between <b>Spark notebooks<\/b> or <b>data flows<\/b>.  Again, I decided to pick notebooks since the code is portable to other platforms.  Since the SQL endpoint does not support Spark views, we are going to have to add this gold layer using <b>data warehouse views<\/b>.<\/p>\n<h4 style=\"color: brown;\">Meta Data Tables<\/h4>\n<p>A meta data driven design allows the developer to automate the copying of data into the lake (<b>raw<\/b> files) as well as building the <b>bronze<\/b> and <b>silver<\/b> tables.  The image below shows the meta data table viewed from the SQL endpoint.<\/p>\n<p><\/b>This is a spoiler alert.<\/b>  Since the <b>gold<\/b> tables are usually custom for each lake, we will be talking about Synapse Data Engineering and Data Warehouse design in our next article.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-500.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The above image shows the end state of our article.  The netflix, high temperature and low temperature raw files have been converted to both bronze and silver tables.<\/p>\n<p>Lets focus on the meta data table right now with the following fields.<\/p>\n<ol>\n<li>pipeline id &#8211; just an identifier<\/li>\n<li>container name &#8211; same path in source and destination storage<\/li>\n<li>file name &#8211; same file in source and destination storage<\/li>\n<li>header flag &#8211; does the data file have a header line<\/li>\n<li>delimiter string &#8211; how is the tuple divided into columns<\/li>\n<li>table name &#8211; the root name of the table<\/li>\n<li>schema string &#8211; infer schema or use this definition<\/li>\n<\/ol>\n<p>The purpose of the this section is to show the developer how to drop table, create table, and insert data.  These actions are required to create our meta data table and populate it.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"hive database - drop table\">\n%%sql\n--\n--  remove table\n--\n\ndrop table if exists meta_data;\n<\/pre>\n<p>The above Spark SQL drops the meta data table and the below SQL code creates the table.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"hive database - create table\">\n%%sql\n--\n--  create table\n--\n\ncreate table meta_data \n(\n    pipeline_id int,\n    container_nm string,\n    file_nm string,\n    header_flg boolean,\n    delimiter_str string,\n    table_nm string,\n    schema_str string\n);\n<\/pre>\n<p>Insert meta data for the netflix dataset.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"hive database - netflix meta data\">\n%%sql\n--\n--  netflix dataset\n--\n\ninsert into meta_data values (1, 'raw\/netflix', 'netflix_titles.csv', true, ',', 'netflix', '');\n<\/pre>\n<p>Insert meta data for the high temperature dataset.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"hive database - high temp meta data\">\n%%sql\n--\n--  weather - high temp dataset\n--\ninsert into meta_data values (2, 'raw\/weather', 'high_temps.csv', true, ',', 'high_temps', 'date timestamp, temp float');\n<\/pre>\n<p>Insert meta data for the low temperature dataset.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"hive database - low temp data\">\n%%sql\n--\n--  weather - low temp dataset\n--\n\ninsert into meta_data values (3, 'raw\/weather', 'low_temps.csv', true, ',', 'low_temps', 'date timestamp, temp float');\n<\/pre>\n<p>Right away, we can see that will will want a gold view to <b>join<\/b> the silver high and low tables into one logical dataset.  This is not an uncommon request.  I have a current client that has a old and new payroll system.  In the data lake, they want one view of the payroll register.  Thus, the view <b>unions<\/b> several tables into one logical one.<br \/>\n<\/p>\n<h4 style=\"color: brown;\">Azure Data Lake Storage<\/h4>\n<p>At this time, Microsoft Fabric does not support a <b>on premise gateway<\/b>.  Use this <a href = \"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-factory\/compare-fabric-data-factory-and-azure-data-factory\">link<\/a> to find out the major differences between Azure Synapse and Microsoft Fabric in regards to data factory.  In our proof of concept example, datasets will be dropped into a external storage account.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-100.png\" width=\"600\" height=\"600\" ><\/p>\n<p>To make life easy, the container and path will be the same between the source and destination.  The above image shows two temperature files and the below image shows one netflix file.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-105.png\" width=\"600\" height=\"600\" ><\/p>\n<p>To simply the example, we are going to use an account key to access the Azure Storage Account.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-120.png\" width=\"600\" height=\"600\" ><\/p>\n<p>For a production system, I will use a service principle since we can restrict access using both RBAC and ACL security.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-095.png\" width=\"600\" height=\"600\" ><\/p>\n<p>In the next section, we are going to start building our data pipeline.<\/p>\n<h4 style=\"color: brown;\">Data Pipeline &#8211; Copy Activity<\/h4>\n<p>Click the create icon on the left menu to start our Data Factory adventure.  Both Data Flow and Data Pipelines are available to the developer.  Please create a <b>new<\/b> Data Pipeline.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-110.png\" width=\"600\" height=\"600\" ><\/p>\n<p>I am going to highlight the two activities that are key to our program.  The <b>copy activity<\/b> allows a developer read data from the source and write data to the destination. The <b>notebook activity<\/b> allows us to call a spark notebook that will tear down and rebuild the bronze and silver tables.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-230.png\" width=\"600\" height=\"600\" ><\/p>\n<p>Let&#8217;s work on the connection next.  In prior versions of data factory, the developer create a <b>linked service<\/b> that contained the connection information and a <b>dataset<\/b> that described the data.  For example, linked service named <b>ls_asql_hris<\/b> has the connection to an Azure SQL database and a dataset named <b>ds_asql_employee<\/b> describes the fields in the table.<\/p>\n<p>These two objects are replaced by one connection object in Fabric.  The image below shows the support for many types of connections.  We want to setup on for Azure Data Lake Storage.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-115.png\" width=\"600\" height=\"600\" ><\/p>\n<p>Supply the fully qualified location to the storage as well as the account key.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-125.png\" width=\"600\" height=\"600\" ><\/p>\n<p>It is important to following your companies naming conventions for activities and add detailed comments when possible.<br \/>\n The image below shows the copy activity we are about to configure.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-235.png\" width=\"600\" height=\"600\" ><\/p>\n<p>Since we are only copying the file from the source to the destination storage, please choose a <b>binary copy<\/b>.  The image below shows the container name and file name are parameters to the source definition.  <\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-240.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The same parameters are used for the destination with one addition.  We are going to add the current date formatted as &#8220;YYYYMMDD&#8221; as a subfolder.  That way, we can have multiple versions of the source files by a given day.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-245.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The item() variable is exposed to the activity when a <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-factory\/foreach-activity\">for-each activity<\/a> is used.  We will talk about working with meta data using the lookup activity next.<\/p>\n<h4 style=\"color: brown;\">Data Pipeline &#8211; Lookup &#038; ForEach &#8211; Activities<\/h4>\n<p>The image below shows the top three activities used by our data pipeline.  The for-each activity has a two levels of nesting.  Please note I renamed the pipeline to <font style=\"color: green;\">pl-delimited-full-load<\/font> and gave it an appropriate description.  I defined a pipeline parameter named <font style=\"color: green;\">pipeline_id<\/font> which tells the pipeline which row of meta data to use.  <\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-200.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The variable shown below was used in the destination section of the copy activity.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-210.png\" width=\"600\" height=\"600\" ><\/p>\n<p>Let&#8217;s review the details of the three topmost activities.  The image below shows the <a href =\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-factory\/lookup-activity\">lookup activity<\/a> named <font style=\"color: green;\">lkp-meta-data<\/font> reads the meta_data table from the lake house and returns a JSON object. <\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-005.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The <a href = \"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-factory\/set-variable-activity\">set variable activity<\/a> named <font style=\"color: green;\">set-folder-var<\/font> converts the current data\/time into a pattern for the creation of the sub-folder in our destination storage.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-010.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-factory\/foreach-activity\">for-each activity<\/a> traverses the array of meta data records.  The JSON <b>output<\/b> from the lookup activity is feed to the for-each activity as <b>input<\/b>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-015.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The simplest way to execute the copy and notebook activity for the correct pipeline id is to use a <a href = \"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-factory\/if-condition-activity\">if condition activity<\/a>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-220.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The details of the condition are shown below.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-225.png\" width=\"600\" height=\"600\" ><\/p>\n<p>I usually do not repeat screen shots; However, this new feature in Data Factory is <b>awesome!<\/b><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-230.png\" width=\"600\" height=\"600\" ><\/p>\n<p>When testing each component of a pipeline, the developer might want to <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-factory\/deactivate-activity\">de-activate<\/a> or <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-factory\/deactivate-activity\">activate<\/a> a component.  I learn about this feature in the <a href = \"https:\/\/learn.microsoft.com\/en-us\/sql\/integration-services\/control-flow\/sequence-container?view=sql-server-ver16\">sequence container<\/a> when developing packages in SSIS 2005.  I am glad the feature made it to Data Factory.<\/p>\n<p>To sum up this section, the lookup and for-each activity are really made for each other.  We could make this code more complex by figuring out if the JSON output from the lookup activity has an array element that matches our pipeline id.  However, all if condition tests are being done in parallel and it&#8217;s not worth the effort to save a few seconds.<\/p>\n<h4 style=\"color: brown;\">Data Pipeline \u2013 Notebook Activity<\/h4>\n<p>The <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-factory\/notebook-activity\">notebook activity<\/a> just calls the Spark Notebook with the parameters we supply.  Please use a naming convention and enter detailed comments.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-250.png\" width=\"600\" height=\"600\" ><\/p>\n<p>On the settings page of the activity, we browse the workspace for the notebook we want to run.  In our case, the name of the notebook is lesson-04-full-load.  The parameters match one for one to the fields in the meta data table.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-255.png\" width=\"600\" height=\"600\" ><\/p>\n<p>Before we execute the finalized pipeline, let&#8217;s look at the revised Spark Notebook.<\/p>\n<h4 style=\"color: brown;\">Spark Notebook<\/h4>\n<p>At the top of the notebook we need a header to describe the program as well as a parameter cell.  The code below shows the parameter cell.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"full load - parameter cell\">\nvar_path = 'raw\/netflix'\nvar_table = 'netflix'\nvar_delimiter = ','\nvar_header = 'true'\nvar_schema = ''\n<\/pre>\n<p>The first step of the notebook is to remove the existing bronze and silver tables for a given parameterized table name.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"full load - remove tables\">\n#\n#  F1 - remove existing tables\n#\n\n# drop bronze table\nstmt = f'drop table if exists bronze_{var_table}_full;'\nret = spark.sql(stmt)\n\n# drop silver table\nstmt = f'drop table if exists silver_{var_table}_full;'\nret = spark.sql(stmt)\n<\/pre>\n<p>The second step of the notebook is read the CSV file into a DataFrame and produce a temporary view as the final output.  Two tests are performed on the schema variable.  If the variable is None or a empty string after trimming, then we want to read infer the schema.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"full load - read file + create temp view\">\n#\n#  F2 - load data frame\n#\n\n   \n# root path\npath = 'Files\/' + var_path\n\n# tmp table\ntable = 'tmp_' + var_table.strip() + '_full'\n\n# schema flag\nschema_flag = True\ntry:\n  \n  # none test\n  if var_schema is None:\n    schema_flag = False\n\n  # empty string \n  if not bool(var_schema.strip()):\n    schema_flag = False\n\nexcept:\n  pass\n\n\n# load all files w\/o schema\nif not schema_flag:\n    df = spark.read.format(\"csv\") \\\n        .option(\"header\",var_header) \\\n        .option(\"delimiter\", var_delimiter) \\\n        .option(\"recursiveFileLookup\", \"true\") \\\n        .load(path)\n\n# load all files w\/ schema\nelse:\n    df = spark.read.format(\"csv\") \\\n        .schema(var_schema) \\\n        .option(\"header\",var_header) \\\n        .option(\"delimiter\", var_delimiter) \\\n        .option(\"recursiveFileLookup\", \"true\") \\\n        .load(path)\n\n# convert to view\ndf.createOrReplaceTempView(table)\n\n# debugging\nprint(f\"data lake path - {path}\")\nprint(f\"temporary view - {table}\")\n<\/pre>\n<p>The third step in the program creates the bronze table while adding a load date, a folder name and a file name.  The split_string function is called with parameters specific to our directory structure.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"full load - create bronze table\">\n#\n#  F3 - create bronze table (all files)\n#\n  \n# spark sql - assume 1 level nesting on dir\nstmt = f\"\"\"\n  create table bronze_{var_table}_full as\n  select \n    *, \n    current_timestamp() as _load_date,\n    split_part(input_file_name(), '\/', 9) as _folder_name,\n    split_part(split_part(input_file_name(), '\/', 10), '?', 1) as _file_name\n  from \n    tmp_{var_table}_full \n\"\"\"\n\n# create table\nret = spark.sql(stmt)\n\n# debugging\nprint(f\"execute spark sql - \\n {stmt}\")\n<\/pre>\n<p>The fourth step creates the silver table by only showing the data from the newest sub-directory by date string.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"full load - create silver table\">\n#\n#  F4 - create silver table (lastest file)\n#\n  \n# spark sql\nstmt = f\"\"\"\n  create table silver_{var_table}_full as\n  with cte_{var_table} as\n  (\n    select * \n    from bronze_{var_table}_full as l\n    where l._folder_name = (select max(_folder_name) from bronze_{var_table}_full)\n  )\n  select \n    *\n  from \n    cte_{var_table}\n\"\"\"\n\n# create table\nret = spark.sql(stmt)\n\n# debugging\nprint(f\"execute spark sql - \\n {stmt}\")\n<\/pre>\n<p>If this was a production system, I would be logging the record counts into an audit table.  Here is code to print record counts for the bronze table.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"full load - bronze table count\">\n#\n#  Grab bronze count\n#\n\ntry:\n    sql_stmt = f\"select count(*) from bronze_{var_table}_full\"\n    bronze_recs = spark.sql(sql_stmt).first()[0]\nexcept:\n    bronze_recs = 0\n\n# show values\nprint(f\"The bronze_{var_table}_full record count is {bronze_recs}\")  \n\n<\/pre>\n<p>Of course, we want to show record counts for the silver table.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"full load - silver table count\">\n#\n#  Grab silver count\n#\n\ntry:\n    sql_stmt = f\"select count(*) from silver_{var_table}_full\"\n    silver_recs = spark.sql(sql_stmt).first()[0]\nexcept:\n    silver_recs = 0\n\n# show values\nprint(f\"The silver_{var_table}_full record count is {silver_recs}\")  \n<\/pre>\n<p>Now that will have a complete Data Pipeline &#8211; copy data into lake and Spark Notebook &#8211; transform files into tables, lets execute the pipeline three times with each pipeline id to load the CSV files into our one lake (lake house).<\/p>\n<h4 style=\"color: brown;\">Testing<\/h4>\n<p>Please set the pipeline variable named pipeline id to one.  Then trigger the pipeline.  The output below shows a successful execution of the code.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-265.png\" width=\"600\" height=\"600\" ><\/p>\n<p>If will click on the output of the Spark Notebook, we get a link to the notebook snapshot. <\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-304-27-.png\" width=\"600\" height=\"600\" ><\/p>\n<p>This snapshot was taken from the high temperature execution.  We can tell by the debugging statements and record counts.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-270.png\" width=\"600\" height=\"600\" ><\/p>\n<p>At the end of the first execution, we have create a bronze and silver table for the Netflix titles.  The image below shows a grid with the data.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-260.png\" width=\"600\" height=\"600\" ><\/p>\n<p>At this time, please repeat the execution for each source.  This can be done by incrementing the pipeline id and executing the pipeline.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-285.png\" width=\"300\" height=\"300\" ><\/p>\n<p>While creating a meta data driven design takes some time, it is well worth the investment.  Now, I can load hundreds of files into the lake by adding meta data to the table and scheduling the pipeline with the correct id.<\/p>\n<h4 style=\"color: brown;\">Summary<\/h4>\n<p>Overall, I really do like the <b>one stop shop<\/b> idea of Fabric.  Every you need is in one place with the Microsoft Fabric Service.  Like any development team, Microsoft had to push the product out the door with a few rough edges. <\/p>\n<p>The image below shows the three items that might have larger customers wait on adoption.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-004-600.png\" width=\"600\" height=\"600\" ><\/p>\n<p><font style=\"color: brown;\">First<\/font>, many companies still have a good percentage of their data on premises.  Without a data gateway, older technologies such as SSIS will have to be used to land the data into a storage container.  <\/p>\n<p><font style=\"color: brown;\">Second<\/font>, some companies like hospitals and financial institutions regard their data as highly sanative. The use of a virtual private network and endpoints has secured Azure components in the past.  This support is not available in Fabric today.<\/p>\n<p><font style=\"color: brown;\">Third<\/font>, most companies have at least two environments:  Development and Production.  How do you move code from one environment to another?  How do you keep track of versions?  Spark Notebooks are easy since you can just download and upload files and manually check the code into git.  <\/p>\n<p>Data Factory is more complicated.  It always has been a JSON export file that had many inter-dependencies.  But some of that has gone away.  A linked service and dataset has been replaced with a connection which is part of the pipeline.  Therefore, it might be possible to download and upload pipelines in the future?  For now, we can move code by create a pipeline with the same name as the demo &#8211; <font style=\"color: green;\">pl-delimited-full-load<\/font>.  Then hit the {} button to modify the JSON with the supplied definition in the zip file. <\/p>\n<p>In short, the data gateway, private end points, and continuous integration \/ continuous deployment features are required to make this product shine. Enclosed is the <a href=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/bundle-04.zip\">zip file<\/a> with the CSV data files, Data Pipeline in JSON format and Spark notebooks.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Metadata Driven Pipelines What is a metadata driven pipeline? Wikipedia defines metadata as &#8220;data that provides information about other data&#8221;. As a developer, we can create a non parameterized pipeline and\/or notebook to solve a business problem. However, if we have to solve the same problem a hundred times, the amount of code can get unwieldly. A better way to solve this problem is to store metadata in the delta lake. This data will drive how the Azure Data Factory and Spark Notebooks execute. Business Problem Our manager has asked&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1132],"tags":[1136],"class_list":["post-9447","post","type-post","status-publish","format-standard","hentry","category-microsoft-fabric","tag-john-f-miner-iii-microsoft-fabric-data-engineering-apache-spark-data-factory-full-load-code-meta-data-driven-design-learn"],"_links":{"self":[{"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/posts\/9447","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/craftydba.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=9447"}],"version-history":[{"count":212,"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/posts\/9447\/revisions"}],"predecessor-version":[{"id":9796,"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/posts\/9447\/revisions\/9796"}],"wp:attachment":[{"href":"https:\/\/craftydba.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=9447"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/craftydba.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=9447"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/craftydba.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=9447"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}