{"id":9193,"date":"2024-02-21T16:07:28","date_gmt":"2024-02-21T16:07:28","guid":{"rendered":"https:\/\/craftydba.com\/?p=9193"},"modified":"2024-12-26T21:03:20","modified_gmt":"2024-12-26T21:03:20","slug":"thread-03-data-engineering-with-fabric","status":"publish","type":"post","link":"https:\/\/craftydba.com\/?p=9193","title":{"rendered":"Thread 03 \u2013 Data Engineering with Fabric"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/microsoft-fabric-logo.png\" alt=\"Microsoft Fabric Logo\" width=\"100\" height=\"100\" align=\"\u201dleft\u201d\" hspace=\"\u201d50\u201d\" vspace=\"\u201d50\u201d\" \/><\/p>\n<h2 style=\"color: green;\">Full versus Incremental Loads<\/h2>\n<p>The loading of data from a source system to target system has been well <a href = \"https:\/\/en.wikipedia.org\/wiki\/Data_loading\">documented<\/a> over the years.  My first introduction to an Extract, Transform and Load program was <a href=\"https:\/\/en.wikipedia.org\/wiki\/Data_Transformation_Services\">DTS<\/a> for SQL Server 7.0 in 1998.<\/p>\n<p>  In a data lake, we have a bronze quality zone that supposed to represent the raw data in a delta file format.  This might include versions of the files for auditing.  In the silver quality zone, we have a single version of truth.  The data is de-duplicated and cleaned up.  How can we achieve these goals using the Apache Spark engine in Microsoft Fabric? <\/p>\n<h4 style=\"color: brown;\">Business Problem<\/h4>\n<p>Our manager has given us weather data to load into Microsoft Fabric.  In the last article, we re-organized the <b>full<\/b> load sub-directory to have two complete data files (high temperatures and low temperatures) for each day of the sample week.  As a result, we have 14 files.  The <b>incremental<\/b> load sub-directory has two files per each day for 1369 days.  The <b>high<\/b> and <b>low<\/b> temperature files for a given day each have a single reading.  There is a total of 2,738 files.  How can we create a notebook to erase existing tables if needed, rebuild the full load tables, and rebuild the incremental tables?<\/p>\n<h4 style=\"color: brown;\">Technical Solution<\/h4>\n<p>This use case allows data engineers to learn how to transform data using both Spark SQL and Spark DataFrames. The following topics will be explored in this article (thread).<\/p>\n<ol>\n<li>passing parameters<\/li>\n<li>show existing tables<\/li>\n<li>erase tables<\/li>\n<li>full load pattern<\/li>\n<li>testing full load tables<\/li>\n<li>incremental load pattern<\/li>\n<li>testing incremental load tables<\/li>\n<li>future enhancements<\/li>\n<\/ol>\n<h4 style=\"color: brown;\">Architectural Overview<\/h4>\n<p>I borrowed the image from <a href = \"https:\/\/www.emilyriederer.com\/post\/data-error-gen\/\">Emily Riederers<\/a> blog since I really liked it. <font style=\"color: brown;\">What is the difference between a full and incremental load?<\/font><br \/>\n<img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/incr-full-good.png\" width=\"600\" height=\"600\" ><br \/>\nWe can see that a <b>incremental load<\/b> may start with a full load.This load brings over any historical data and is depicted by the blue box.  However, going forward we just bring over a new incremental file.  This is depicted by the green box.  Sometimes you might be collecting data from scratch.  Thus, there is no history.  This is what we are doing with the sample incremental weather data.<\/p>\n<p>A <b>full load<\/b> means that all data is copied over every scheduled execution of the notebook.  The main question to ask the business is &#8220;do you need auditing of the data&#8221;?  If so, we need to keep the full files for each processing day.  One great use case for this pattern is critical business process data.  Let&#8217;s make believe we are a chocolate maker and we have a set of recipes for creating our chocolate items.  We might want to know when the recipe was changed for the almond clusters?<\/p>\n<p>A full load is great for small to medium datasets that do not often change.  For extremely large datasets, start with a multi file full load for historical data given the retention period you are looking for.  Then going forward, use a incremental files to keep the dataset up to date.  In our proof of concept today, we never go back and update history.  We will talk about merging datasets with incremental loading in a future thread will allow for this use case.<\/p>\n<h4 style=\"color: brown;\">Erase Tables<\/h4>\n<p>Fabric supplies the data engineer with a <a href = \"https:\/\/learn.microsoft.com\/en-us\/azure\/synapse-analytics\/synapse-notebook-activity\">parameter cell<\/a>.  We can see that the <b><font style=\"color: brown;\">process<\/font><\/b> variable is currently set to <b>full<\/b>.  However, the Python code in the notebook was built to support <b>erase<\/b>, <b>incr<\/b> and <b>full<\/b> operations (values).  Additionally, we might want to <b><font style=\"color: brown;\">debug<\/font><\/b> variable to be used to optionally display sample data and row counts.  This variable supports both <b>true<\/b> and <b>false<\/b> values.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-003-01.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The above image shows the parameter cell we are using in this notebook.  The image below shows the end state of the hive catalog.  It will have four published tables.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-003-05.png\" width=\"250\" height=\"250\" ><\/p>\n<p>The <b><font style=\"color: brown;\">show tables<\/font><\/b> Spark SQL command lists both permanent and temporary tables.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"show tables - hive database\">\n%%sql\n--\n--  E2 - show hive + temp tables\n--\n\nshow tables;\n\n<\/pre>\n<p>After executing the above Spark SQL statement we see there are 3 tables in the spark session for each load type:  bronze table, silver table and temporary table.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-003-10.png\" width=\"600\" height=\"600\" ><\/p>\n<p>We are going to use the if statement to optionally execute code.  If our process variable is set to <b>erase<\/b>, the following code will execute.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"drop table - all tables\">\n#\n#  E1 - remove all hive tables\n#\n\nif process == 'erase':\n    tables = spark.sql('show tables;').collect()\n    for table in tables:\n        stmt = \"drop table \" + table.tableName\n        ret = spark.sql(stmt)\n<\/pre>\n<p>Here is a quick overview of the above code.  First, we grab a DataFrame containing all the lakehouse tables by calling the <a href = \"https:\/\/spark.apache.org\/docs\/3.1.1\/api\/python\/reference\/api\/pyspark.sql.SparkSession.sql.html#pyspark.sql.SparkSession.sql\">spark.sql<\/a> method with the correct Spark SQL statement.  The <a href = \"https:\/\/spark.apache.org\/docs\/3.1.1\/api\/python\/reference\/api\/pyspark.sql.DataFrame.collect.html#pyspark.sql.DataFrame.collect\">collect<\/a> method of the DataFrame returns a list.  Then for each table in the list, we execute a <a href = \"https:\/\/spark.apache.org\/docs\/latest\/sql-ref-syntax-ddl-drop-table.html\">drop table<\/a> command to remove the tables from the hive catalog.  The <b>spark<\/b> variable is just a point to the Spark Session.  We will be using this method many times in the future.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-003-15.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The above image shows all the tables have been removed from the hive catalog.<\/p>\n<h4 style=\"color: brown;\">Full Load<\/h4>\n<p>In this section, we are going over code that will be used to create bronze and silver tables using a full load pattern.   If we try to create a table when it already exists, the program will error out.  The <b>first step<\/b> in our full load process is to remove the existing bronze and silver tables.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"drop tables - full load\">\n#\n#  F1 - remove existing tables\n#\n\nif process == 'full':\n    ret = spark.sql('drop table if exists bronze_weather_full;')\n    ret = spark.sql('drop table if exists silver_weather_full;')\n<\/pre>\n<p>The <b>second step<\/b> in our full load process is to read in the data files as a Spark DataFrame.  To fix the data type issue we had in the prior article, we are going to pass a schema definition to the <a href = \"https:\/\/spark.apache.org\/docs\/3.1.1\/api\/python\/reference\/api\/pyspark.sql.SparkSession.read.html#pyspark.sql.SparkSession.read\">read<\/a> method.  The <a href=\"https:\/\/spark.apache.org\/docs\/3.1.1\/api\/python\/reference\/api\/pyspark.sql.DataFrame.createOrReplaceTempView.html#pyspark.sql.DataFrame.createOrReplaceTempView\">createOrReplaceTempView<\/a> method is very powerful.  It registers the Spark DataFrame as a temporary view.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"load data as view - full load\">\n#\n#  F2 - load data frame\n#\n\n# load sub folders into df\nif process == 'full':\n    \n    # root path\n    path = \"Files\/raw\/weather\/full\/\"\n\n    # file schema\n    layout = \"date timestamp, temp float\"\n\n    # load all files\n    df = spark.read.format(\"csv\") \\\n        .schema(layout) \\\n        .option(\"header\",\"true\") \\\n        .option(\"recursiveFileLookup\", \"true\") \\\n        .load(path)\n\n    # convert to view\n    df.createOrReplaceTempView('tmp_weather_full')\n<\/pre>\n<p>The <b>third step<\/b> is to use Spark SQL to transform the data.  The <a href = \"https:\/\/spark.apache.org\/docs\/latest\/sql-ref-syntax-ddl-create-table-datasource.html\">create table as<\/a> command allows us to create a hive table with the resulting data from the <a href=\"https:\/\/spark.apache.org\/docs\/latest\/sql-ref-syntax-qry-select.html\">select<\/a> statement.  Another useful Spark SQL function is <a href = \"https:\/\/spark.apache.org\/docs\/latest\/api\/sql\/index.html#input_file_name\">input_file_name<\/a>.  The image below shows the full path of the processed file with some additional URL info.  The <a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/sql\/index.html#split_part\">split_part<\/a> function is very useful when parsing a delimited string and selecting a particular item in an array.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-003-70.png\" width=\"800\" height=\"800\" ><\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"create bronze table - full load\">\n#\n#  F3 - create bronze table (all files)\n#\n  \n# spark sql\nstmt = \"\"\"\n  create table bronze_weather_full as\n  select \n    date as _reading_date,\n    temp as _reading_temp,\n    current_timestamp() as _load_date,\n    split_part(input_file_name(), '\/', 9) as _folder_name,\n    split_part(split_part(input_file_name(), '\/', 10), '?', 1) as _file_name\n  from \n    tmp_weather_full \n\"\"\"\n\n# create table?\nif process == 'full':\n    ret = spark.sql(stmt)\n<\/pre>\n<p>Other than adding additional fields such as folder name, file name, and insert date, the data in the bronze delta table reflects the raw data stored as files and folders.  That is about to change when we create the silver delta table.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"create silver table - full load\">\n#\n#  F4 - create silver table (lastest files)\n#\n  \n# spark sql\nstmt = \"\"\"\n  create table silver_weather_full as\n  with cte_high_temps as\n  (\n    select * \n    from bronze_weather_full as h \n    where h._file_name = 'high_temps.csv' and\n    h._folder_name = (select max(_folder_name) from bronze_weather_full)\n  )\n  ,\n  cte_low_temps as\n  (\n    select * \n    from bronze_weather_full as l\n    where l._file_name = 'low_temps.csv' and\n    l._folder_name = (select max(_folder_name) from bronze_weather_full)\n  )\n  select \n    h._reading_date,\n    h._reading_temp as _high_temp,\n    l._reading_temp as _low_temp,\n    h._load_date,\n    h._folder_name,\n    h._file_name as _high_file,\n    l._file_name as _low_file\n  from \n    cte_high_temps as h \n  join \n    cte_low_temps as l\n  on \n    h._reading_date = l._reading_date\n  order by\n    h._reading_date desc\n\"\"\"\n\n# create table?\nif process == 'full':\n    ret = spark.sql(stmt)\n<\/pre>\n<p>We are going to use a common table expression to create two derived tables: <b>cte_high_temps<\/b> and <b>cte_low_temps<\/b>.  I purposely created the folders using a year, month, and day format.  This is a sortable directory.  To find the most recent data, we just need to select the data that has the maximum folder value.  <\/p>\n<p>Additionally, we want to combine the high temperate and low temperature reading for a given day.  I am keeping the source folder name, the high temp file name and the low temp file name as lineage.<\/p>\n<p>The image below shows the new tables in the hive catalog when executing the Spark Notebook with process equal to <b>full<\/b>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-003-20.png\" width=\"250\" height=\"250\" ><\/p>\n<h4 style=\"color: brown;\">Full Table Testing<\/h4>\n<p>I am going to skip over the code snippets since the complete notebook is enclosed at the end of the article.  The first image shows that for a given day, we have 7 entries for high temperatures and 7 entries for low temperatures in the bronze table.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-003-25.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The second image shows the record count from the bronze table.  We have a-lot of duplicate data.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-003-30.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The third image that the silver table has one row per day.  That row contains both high and low temperature readings.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-003-35.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The fourth image shows the silver table has the same number of rows equal to the number of distinct days.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-003-40.png\" width=\"600\" height=\"600\" ><\/p>\n<p>In summary, the bronze table for the full load pattern can be quite large if we decide to keep a-lot of historical files.<\/p>\n<h4 style=\"color: brown;\">Incremental Load<\/h4>\n<p>In this section, we are going over code that will be used to create bronze and silver tables using a incremental load pattern.  There is not historical component since we have data for each day.  If we try to create a table when it already exists, the program will error out.  The <b>first step<\/b> in our incremental load process is to remove the existing bronze and silver tables.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"drop tables - incremental load\">\n#\n#  I1 - remove existing tables\n#\n\nif process == 'incr':\n    ret = spark.sql('drop table if exists bronze_weather_incr;')\n    ret = spark.sql('drop table if exists silver_weather_incr;')\n<\/pre>\n<p>The <b>second step<\/b> in our incremental load process is to read in the data files as a Spark DataFrame.  A schema is used to correctly type the data as well as a temporary view.  This view allows us to use Spark SQL in steps three and four to transform and create the medallion tables.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"load data as view - incremental load\">\n#\n#  I2 - load data frame\n#\n\n# load sub folders into df\nif process == 'incr':\n    \n    # root path\n    path = \"Files\/raw\/weather\/incremental\/\"\n\n    # file schema\n    layout = \"date timestamp, temp float\"\n\n    # load all files\n    df = spark.read.format(\"csv\") \\\n        .schema(layout) \\\n        .option(\"header\",\"true\") \\\n        .option(\"recursiveFileLookup\", \"true\") \\\n        .load(path)\n\n    # convert to view\n    df.createOrReplaceTempView('tmp_weather_incr')\n<\/pre>\n<p>The <b>third step<\/b> creates the bronze table.  Since no filtering and\/or transformations are being performed, the code almost matches the full load bronze table code.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"create bronze table - incremental load\">\n#\n#  I3 - create bronze table (all files)\n#\n  \n# spark sql\nstmt = \"\"\"\n  create table bronze_weather_incr as\n  select \n    date as _reading_date,\n    temp as _reading_temp,\n    current_timestamp() as _load_date,\n    split_part(input_file_name(), '\/', 9) as _folder_name,\n    split_part(split_part(input_file_name(), '\/', 10), '?', 1) as _file_name\n  from \n    tmp_weather_incr\n\"\"\"\n\n# create table?\nif process == 'incr':\n    ret = spark.sql(stmt)\n<\/pre>\n<p>The <b>fourth step<\/b> creates our silver table.  Again, we are using to common table expressions to separate and combine the high and low temperature data.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"create bronze table - incremental load\">\n#\n#  F4 - create silver table (all files)\n#\n  \n# spark sql\nstmt = \"\"\"\n  create table silver_weather_incr as\n  with cte_high_temps as\n  (\n    select * \n    from bronze_weather_incr as h \n    where h._file_name = 'high_temps.csv'\n  )\n  ,\n  cte_low_temps as\n  (\n    select * \n    from bronze_weather_incr as l\n    where l._file_name = 'low_temps.csv'\n  )\n  select \n    h._reading_date,\n    h._reading_temp as _high_temp,\n    l._reading_temp as _low_temp,\n    h._load_date,\n    h._folder_name,\n    h._file_name as _high_file,\n    l._file_name as _low_file\n  from \n    cte_high_temps as h \n  join \n    cte_low_temps as l\n  on \n    h._reading_date = l._reading_date\n  order by\n    h._reading_date desc\n\"\"\"\n\n# create table?\nif process == 'incr':\n    ret = spark.sql(stmt)\n<\/pre>\n<p>The image below shows two new tables for the incremental data.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-003-75.png\" width=\"250\" height=\"250\" ><\/p>\n<h4 style=\"color: brown;\">Incremental Table Testing<\/h4>\n<p>I am going to skip over the code snippets since the complete notebook is enclosed at the end of the article.  The first image shows that for a given day, we have a sub-folder and two files.  One file is for a high temperature reading and one file is for a low temperature reading.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-003-45.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The second image shows that the number of files equals the number of days times two.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-003-50.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The third image shows that the two temperatures have been paired together as one data point.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-003-55.png\" width=\"600\" height=\"600\" ><\/p>\n<p>The fourth image shows that the number of rows is equal to the number of days or 1369.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-003-60.png\" width=\"600\" height=\"600\" ><\/p>\n<p>If we preview the data in the silver incremental table, we can grid looks like the power query interface.  The correct data types are assigned to each column (field).<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/thread-003-65.png\" width=\"600\" height=\"600\" ><\/p>\n<p>As a consultant, I find a-lot of companies skip steps in regards to system design, commenting code, peer review of code and code testing.  Skipping these steps might result in a poor end product.<\/p>\n<p>Going back to the <a>medallion architecture<\/a>, we can modify the diagram to leave just the files as the data source.  This is the only step I did not do in this article.  In the next article, we are going to work on scheduling the Spark Notebook with a data pipeline.  The SQL analytics endpoint will allow us to expose the existing bronze and silver tables as well as create gold views.  This same end point will allow other tools like SSMS to connect to the one lake and use the data.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/onelake-medallion-lakehouse-architecture-example.png\" width=\"600\" height=\"600\" ><\/p>\n<h4 style=\"color: brown;\">Summary<\/h4>\n<p>Fabric supplies the data engineer with access to the <a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/python\/index.html\">Apache Spark<\/a> engine to read, transform and write files.  Once a file(s) is read into a DataFrame, we can publish a temporary view and perform all the transformation logic using Spark SQL.  Yes, you do need to know a language like Python or Scala but remember to test your SQL statements with the %%sql magic command.<\/p>\n<p><b><font style=\"color: brown;\">Is this code ready for production?<\/font><\/b>  The answer is <font style=\"color: green;\"><b>it all depends<\/b><\/font>.  If you are looking for a full load and\/or a incremental append pattern, you are all set.  This code will work with <b>small<\/b> to <b>medium<\/b> datasets.  We are rebuilding the delta tables each time from scratch.  The amount of processing time will increase linearly with the number of files.  There is always the consideration of optimization of large datasets via <a href=\"https:\/\/spark.apache.org\/docs\/3.0.2\/sql-ref-syntax-ddl-create-table-datasource.html\">partitioning<\/a> or <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-engineering\/delta-optimization-and-v-order?tabs=sparksql\">v-order<\/a>.  <\/p>\n<p>In a future article, I will talk about how to handle <b>large<\/b> datasets with partition and v-order.  Additionally, for large datasets we do not want to rebuild the delta table each time.  The <a href = \"https:\/\/docs.delta.io\/latest\/delta-update.html#upsert-into-a-table-using-merge\">MERGE<\/a> command can be used to update the data on a given key.<\/p>\n<p>Let&#8217;s think about the sample dataset we were given.  It splits data into both high and low temperatures which is crazy.  But in real life, a file with longitude, latitude, date, time, and temperature is more appropriate.  The sample rate does not matter as long as it is consistent between files.  With this new file layout, we can collect data from a multiple set of GPS locations.  A simple min and max query by date gives us our final data. <\/p>\n<p>Enclosed is the <a href=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/bundle-03.zip\">zip file<\/a> with the data files and Spark notebook.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Full versus Incremental Loads The loading of data from a source system to target system has been well documented over the years. My first introduction to an Extract, Transform and Load program was DTS for SQL Server 7.0 in 1998. In a data lake, we have a bronze quality zone that supposed to represent the raw data in a delta file format. This might include versions of the files for auditing. In the silver quality zone, we have a single version of truth. The data is de-duplicated and cleaned up.&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1132],"tags":[1135],"class_list":["post-9193","post","type-post","status-publish","format-standard","hentry","category-microsoft-fabric","tag-john-f-miner-iii-microsoft-fabric-data-engineering-apache-spark-full-load-code-incremental-load-code-erase-hive-tables-learn"],"_links":{"self":[{"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/posts\/9193","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/craftydba.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=9193"}],"version-history":[{"count":158,"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/posts\/9193\/revisions"}],"predecessor-version":[{"id":9445,"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/posts\/9193\/revisions\/9445"}],"wp:attachment":[{"href":"https:\/\/craftydba.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=9193"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/craftydba.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=9193"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/craftydba.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=9193"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}