{"id":10060,"date":"2024-03-03T15:49:24","date_gmt":"2024-03-03T15:49:24","guid":{"rendered":"https:\/\/craftydba.com\/?p=10060"},"modified":"2024-12-26T20:43:25","modified_gmt":"2024-12-26T20:43:25","slug":"test","status":"publish","type":"post","link":"https:\/\/craftydba.com\/?p=10060","title":{"rendered":"Thread 06 \u2013 Data Engineering with Fabric"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/02\/microsoft-fabric-logo.png\" alt=\"Microsoft Fabric Logo\" width=\"100\" height=\"100\" align=\"\u201dleft\u201d\" hspace=\"\u201d50\u201d\" vspace=\"\u201d50\u201d\" \/><\/p>\n<h2 style=\"color: green;\">Empowering Fabric Lakehouse Users<\/h2>\n<p>Microsoft Fabric allows the developers to create delta tables in the Lakehouse.  However, the automation of copying, updating, and deleting of data files in the Lakehouse might not be possible.  How can we empower the end users to make these changes?<\/p>\n<h4 style=\"color: brown;\">Business Problem<\/h4>\n<p>Today, we are going to cover two new tools from Microsoft that are in preview.  First, the <a href = \"https:\/\/learn.microsoft.com\/en-us\/fabric\/onelake\/onelake-file-explorer\">one lake explorer<\/a> allows users to access files in the Lakehouse like they were in windows explorer.  Second, the <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-science\/data-wrangler\">data wrangler<\/a> extension for Visual Studio Code allows users to read, modify and write parquet files.<\/p>\n<h4 style=\"color: brown;\">One Lake Explorer<\/h4>\n<p>The one lake file explorer can be downloaded from this <a href = \"https:\/\/www.microsoft.com\/en-us\/download\/details.aspx?id=105222\">URL<\/a>.  Once you downloaded the installation program, please execute the program.  Click the install button to start the process.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-000.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>Once the program is installed, it will ask you to log into Microsoft Fabric using the credentials you want to associate with the one lake explorer.  In my case, I will be using the john@craftydba.com account.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-005.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>Of course I have to enter in the correct password.  However, I will not tell you what mine is.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-010.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>Multi factor authentication is enforced by the Microsoft Azure Subscription.  At this time, I have to grab my android phone and enter the number shown below into the Microsoft Authenticator application.  If you have not used the application before, here is a Microsoft <a href=\"https:\/\/support.microsoft.com\/en-us\/account-billing\/download-and-install-the-microsoft-authenticator-app-351498fc-850a-45da-b7b6-27e523b8702a\">web page<\/a> to learn more about installation.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-015.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>Since I work with multiple accounts everyday, I do not want this account associated with everything.  Therefore, I am going to have the One Lake Explorer application only associated with the john@craftydba.com Azure Active Directory (<a href = \"https:\/\/learn.microsoft.com\/en-us\/entra\/\">Microsoft Entra ID<\/a>) user.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-020.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>The image below shows the &#8220;One Lake &#8211; Microsoft (preview)&#8221; folder showing up in our windows explorer.  The most important part of using this tool is to synchronize the data between our laptop and Microsoft Fabric.  This can be easily done with a right click action.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-025.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>Please note that the explorer is pointing to the Adventure Works Lakehouse we created last time.<\/p>\n<h4 style=\"color: brown;\">Load New Lakehouse<\/h4>\n<p>The first dataset that we are going to work with today is the <b>S&#038;P 500 stocks<\/b> files for years 2013 to 2017.  I downloaded this data from Yahoo financials a long time ago.  The dataset has been used to demonstrate how to use <a href=\"https:\/\/learn.microsoft.com\/en-us\/sql\/relational-databases\/polybase\/polybase-versioned-feature-summary?view=sql-server-ver16\">PolyBase<\/a> and Azure SQL Warehouse.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-030.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>The above image shows the creation of the new Lakehouse named <b>lh_snp_500<\/b> and the image below shows it contains no files and\/or tables.  It is important to know that the Lakehouse does not support many special characters.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-035.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>If we refresh the One Lake Explorer, the new Lakehouse will show up.  Now, please navigate to the <b>Files<\/b> directory.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-040.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>We are going to re-use the Spark Notebook that creates the bronze and silver tables for a given directory.  Therefore, please create the following nested sub-directories:  <b>RAW<\/b>, <b>STOCKS<\/b>, and <b>S&#038;P2013<\/b>.  <\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-050.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>Please jump to the end of the article to get the zip file that has the data and code files.  Please copy over the CSV files for year 2013 into the appropriate directory. <\/p>\n<h4 style=\"color: brown;\">Initial Table Creation<\/h4>\n<p>There were two changes that I made to the <b>nb-delimited-full-load<\/b> Spark notebook.  <\/p>\n<p>First, I added support for unlimited nested folders.  This is important if we want to capture the folder name (leaf level) and file name.  Before, we hard coded the indexing to match the folder nesting.  The <a href=\"https:\/\/spark.apache.org\/docs\/3.1.1\/api\/python\/reference\/api\/pyspark.sql.functions.reverse.html\">reverse<\/a>, <a href =\"https:\/\/spark.apache.org\/docs\/3.1.2\/api\/python\/reference\/api\/pyspark.sql.functions.split.html\">split<\/a>, and <a href = \"https:\/\/spark.apache.org\/docs\/latest\/api\/python\/reference\/pyspark.sql\/api\/pyspark.sql.functions.split_part.html\">split_part<\/a> Spark functions convert the string into an array which is manipulated to return the file and folder name.  <\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"python language - while loop\">\n#\n#  F3 - create bronze table (all files)\n#\n  \n# spark sql - assume 1 level nesting on dir\nstmt = f\"\"\"\n  create table bronze_{var_table}_full as\n  select \n    *, \n    current_timestamp() as _load_date,\n    reverse(split(input_file_name(), '\/'))[1] as _folder_name,\n    split_part(reverse(split(input_file_name(), '\/'))[0], '?', 1) as _file_name\n  from \n    tmp_{var_table}_full \n\"\"\"\n\n# create table\nret = spark.sql(stmt)\n\n# debugging\nprint(f\"execute spark sql - \\n {stmt}\")\n<\/pre>\n<p>Second, we are not keeping an audit trail for the stock data.  Thus, the bronze and silver tables match exactly.  To accomplish this logic, we just need to remove the where clause from common table expression.  This clause was selecting the folders that had the maximum date by ASCII sorting.  Better yet, we can just use a simple select statement as seen below.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-4\" title=\"python language - while loop\">\n#\n#  F4 - create silver table (lastest file)\n#\n  \n# spark sql\nstmt = f\"\"\"\n  create table silver_{var_table}_full as\n  select * \n  from bronze_{var_table}_full\n\"\"\"\n\n# create table\nret = spark.sql(stmt)\n\n# debugging\nprint(f\"execute spark sql - \\n {stmt}\")\n<\/pre>\n<p>The new notebook has been named <b>nb-delimited-full-no-history<\/b>.  There are 125,216 rows of data in the delta tables.  These tables are built from the 508 CSV files in the directory named 2013.  How do we know this data is accurate?<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-055.png\" width=\"600\" height=\"600\" \/><\/p>\n<h4 style=\"color: brown;\">Managing CSV Files<\/h4>\n<p>The user will have to know how to <b>insert<\/b>, <b>update<\/b> and <b>delete<\/b> files from the storage of the Lakehouse.  We are going to cover those actions shortly.  In addition, it is always good to profile your data.  Any issues with the files can be corrected at that time.  The delete tables can be torn down and rebuilt.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-060.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>The image above shows the completion of creating five directories for each year we captured stock data.  Behind the scenes, each folder has been loaded with files representing the five hundred plus companies.  The image below shows the completion of rebuilding the delta tables.  Right now, we have 625,833 rows of data from loading 2,533 files into our delta table.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-065.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>The following images use the SQL analytics endpoint to query the delta tables.  If we group by the string representing the year, we should only have 5 years.  However, that is not the case.  The first two rows in the image below represent bad data.  The null values are from an unwanted files that was placed into the directory for usage with PolyBase.  The one row from 2012 is a dividend that has the wrong year.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-075.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>The union of the good and bad CSV files has resulted in bad rows.  This data integrity issue can be solved by just deleting the bad files.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-070.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>The dividend that was recorded on 12-31-2012 has no valuable data.  Thus we can edit the correct file and delete the row.  This is an update action to the file.  <\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-080.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>I left these same issues in the sample data files I gave you.  Just use the one lake explorer to find the files with a *.txt extension and delete them.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-085.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>The DFS stock from 2013 has a bad row of data.  Use notepad++ to edit the file and remove the line.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-090.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>If we re-execute the Apache Spark notebook, we will rebuild the bronze and silver delta tables.  Right now, the total number of records is 635,816.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-100.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>To double check our data integrity, run the grouping by year query as seen below.  The sum of the yearly numbers matched the total rows above.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-105.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>Caution, I did notice that the synchronization of files from the desktop to the one lake does take time.  If you run the Spark notebook right away, you might not capture all the files into the delta table.<\/p>\n<h4 style=\"color: brown;\">Weak vs Strong File Types<\/h4>\n<p>The <a href = \"https:\/\/en.wikipedia.org\/wiki\/Comma-separated_values\">comma separated values<\/a> (CSV) file format is considered weak in nature do to the following facts:<\/p>\n<ol>\n<li>viewable in text editor<\/li>\n<li>header is optional<\/li>\n<li>data types are inferred<\/li>\n<li>no support for compression<\/li>\n<li>partitioning is not supported<\/li>\n<\/ol>\n<p>There is a place for CSV files since they can be opened and manipulated by Microsoft Excel.  This makes the format attractive to business users.<\/p>\n<p>The delta file format is based upon Apache Parquet files with the addition of logging.  The <a href = \"https:\/\/en.wikipedia.org\/wiki\/Apache_Parquet\">parquet file format<\/a> is considered strong in nature since it has solved all these issues.  How can we insert, update and delete records from the one lake if a parquet file format is used?<\/p>\n<h4 style=\"color: brown;\">New Delta Table<\/h4>\n<p>The second dataset that we are going to work with today is the <b>titanic passenger list<\/b>.  The automation of steps is important when a company has hundreds of files and\/or tables.  However, all steps can be done via the Microsoft Fabric graphical user interface.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-110.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>The above image shows the parquet file being placed onto the One Lake Explorer folder.  The image below shows that a delta table can be created from the parquet file by right clicking in the Lakehouse explorer.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-115.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>The last step is to preview the data.  In the final section of this article, we will learn how to read, update, and write parquet files.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-120.png\" width=\"600\" height=\"600\" \/><\/p>\n<h4 style=\"color: brown;\">Visual Studio Code<\/h4>\n<p>Microsoft has the <a href=\"https:\/\/marketplace.visualstudio.com\/items?itemName=ms-toolsai.datawrangler\">Data Wrangler<\/a> extension (plugin) for Visual Studio Code that can be used to profile and clean up data files.  Any file type supported by pandas seems fine.  The image below shows details on the extension.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-125.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>The following are required installations for this to work properly.<\/p>\n<ol>\n<li>Visual Studio Code<\/li>\n<li>Anaconda Python<\/li>\n<li>Microsoft Data Wrangler<\/li>\n<\/ol>\n<p>The image below shows the data has been loaded into a pandas DataFrame.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-130.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>We noticed that Mr. James Moran had a missing age.  Our research department has found a long lost record stating James&#8217;s age to be 25.  How can we update this row?<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-135.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>The <a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.at.html\">at()<\/a> method of a Pandas DataFrame allows the user to update the in memory copy given a row index value of 5 and column name of &#8216;Age&#8217;.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-140.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>We must write out a file to persist the change.  I could have overwritten the original file but decided to save lineage.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-145.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>Please synch the one lake explorer with the Microsoft Fabric Lakehouse.  The above and below images show the two new files.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-150.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>Please manually drop the delta table named <b>bronze_titanic_full<\/b> and rebuild with the parquet file named <b>titanic2.parquet<\/b>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/thread-006-155.png\" width=\"600\" height=\"600\" \/><\/p>\n<p>In a nutshell, working with parquet files is a lot harder to do than CSV files.  However, features like column names, data types, and compression make this file format very attractive.<\/p>\n<h4 style=\"color: brown;\">Summary<\/h4>\n<p>Today, we covered two new tools from Microsoft.  First, the <a href = \"https:\/\/learn.microsoft.com\/en-us\/fabric\/onelake\/onelake-file-explorer\">one lake explorer<\/a> allows users to access files in the Lakehouse like they were in windows explorer.  Second, the <a href=\"https:\/\/learn.microsoft.com\/en-us\/fabric\/data-science\/data-wrangler\">data wrangler<\/a> extension for Visual Studio Code allows users to read, modify and write parquet files.<\/p>\n<p>Adoption by the end users is very important when it comes to data lakes.  If no one from your organization uses the Microsoft One Lake, why build it?  Many business users still use Microsoft Excel.  With a little training, they can managed files in the lake using the new explorer.  The ADF and\/or Spark jobs that build the delta tables can be set to run periodically during the working hours.  This will empower the end users to manage the information in the lake and have it refresh automatically.<\/p>\n<p>What we have not talked about is reporting.  The ultimate goal is to have the stock data flow into a Power BI report.  Again, this is a made up use case.  However, thousands of lines of data is not useful for an end user.  But a report showing how Microsoft&#8217;s stock has increased over the last 5 years is.<\/p>\n<p>Enclosed is the <a href=\"https:\/\/craftydba.com\/wp-content\/uploads\/2024\/03\/bundle-06.zip\">zip file<\/a> with the data files and Spark notebook.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Empowering Fabric Lakehouse Users Microsoft Fabric allows the developers to create delta tables in the Lakehouse. However, the automation of copying, updating, and deleting of data files in the Lakehouse might not be possible. How can we empower the end users to make these changes? Business Problem Today, we are going to cover two new tools from Microsoft that are in preview. First, the one lake explorer allows users to access files in the Lakehouse like they were in windows explorer. Second, the data wrangler extension for Visual Studio Code&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1132],"tags":[1147,1139,1143,1145,1150,1140,15,1142,1137,1144,1141,1148,1146,1149],"class_list":["post-10060","post","type-post","status-publish","format-standard","hentry","category-microsoft-fabric","tag-anaconda-python","tag-apache-spark","tag-csv-files","tag-data-wrangler-extension","tag-empowering-users","tag-full-load-code","tag-john-f-miner-iii","tag-learn","tag-microsoft-fabric","tag-one-lake-explorer","tag-parquet-files","tag-strong-file-format","tag-vs-code","tag-weak-file-format"],"_links":{"self":[{"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/posts\/10060","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/craftydba.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=10060"}],"version-history":[{"count":155,"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/posts\/10060\/revisions"}],"predecessor-version":[{"id":10316,"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/posts\/10060\/revisions\/10316"}],"wp:attachment":[{"href":"https:\/\/craftydba.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=10060"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/craftydba.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=10060"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/craftydba.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=10060"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}