The SELECT reserved word is part of the Data Manipulation Language (DML) defined by Codd and is used to query data from the database. Because of its popularity, there are many clauses and functions that can be used to construct a query to retrieve data.
Today, I am continuing our exploration by reviewing SELECT queries that aggregate data. Data aggregation is the process of converting many records into a few records with special meaning. Many of the functions used in aggregation have statistical meaning such as minimum value, maximum value, etc.
I will be using the Adventure Works 2012 Sample database supplied by Microsoft during this talk. You should down load the Entity Relationship Diagram (ERD) so that you have an overall view of how the database enforces referential integrity, relationships.
Today, I am going to be introducing a new operator and two new functions. Let us start off by getting the business requirements for the query from the Human Resources manager.
(S)he wants to get salary statistics of employees broken down by department and by job title. The basic statistics are MIN, MAX, and AVG salary per employee as well as the COUNT employees in that job title. Sub total and grand total numbers would be an added plus.
Listed below is my TSQL solution to this business problem. I will be going over the query in detail.
<span style="color: #008000; font-family: Lucida Console; font-size: small;">-- Use Adventure Works
-- For each department & job title, get min, max, avg pay,
-- # of full time employees. Optionally add sub & grand totals.
WITH Cte_LastRateChange(BusinessEntityID, RateChangeDate) AS
SELECT PAY1.BusinessEntityID, MAX(PAY1.RateChangeDate)
FROM [HumanResources].[EmployeePayHistory] PAY1
GROUP BY PAY1.BusinessEntityID
Cte_CurrentPayRate(BusinessEntityID, Rate) AS
SELECT PAY2.BusinessEntityID, PAY2.Rate
FROM [HumanResources].[EmployeePayHistory] PAY2
INNER JOIN Cte_LastRateChange PAY3
ON PAY2.BusinessEntityID = PAY3.BusinessEntityID
AND PAY2.RateChangeDate = PAY3.RateChangeDate
, GROUPING(EMP1.JobTitle) AS SumInfo
, FORMAT(MIN(PAY4.Rate * 2080), 'C', 'en-us') AS MinPay
, FORMAT(MAX(PAY4.Rate * 2080), 'C', 'en-us') AS MaxPay
, FORMAT(AVG(PAY4.Rate * 2080), 'C', 'en-us') AS AvgPay
, COUNT(*) AS FteNo
[HumanResources].[Employee] AS EMP1
LEFT JOIN [HumanResources].[EmployeeDepartmentHistory] AS EDH1
ON EMP1.BusinessEntityID = EDH1.BusinessEntityID
LEFT JOIN [HumanResources].[Department] AS DEP1
ON EDH1.DepartmentID = DEP1.DepartmentID
LEFT JOIN Cte_CurrentPayRate AS PAY4
ON EMP1.BusinessEntityID = PAY4.BusinessEntityID
EDH1.EndDate IS NULL
The [EmployeePayHistory] table contains a record for every pay increase that an employee has every had. Because we want the most recent pay data, we need to use Common Table Expressions (CTE) to get it.
First, we want the most recent rate change date by employee, [Cte_LastRateChange]. Second, we want to join this result set to the [EmployeePayHistory] table to extract the most current pay history by employee [Cte_CurrentPayRate].
That was the hard part!
Now we join the [Employee], [EmployeeDepartmentHistory], [Cte_CurrentPayRate] and [Department] tables on the correct key columns to get our final result set. Employees usually move around departments over a period of time. To remove old records, we want to filter out [EmployeeDepartmentHistory] records that have a NULL [EndDate] by using the WHERE clause.
Since we want to have aggregated information, use the GROUP BY clause on [GroupName] and [JobTitle] to summarize the information.
It is never to late to teach an old IT professional new tricks. By having a comma at the beginning of a field name, it is very easy to remove a field from the query by commenting it out. Having the commas at the end of the line may force you to change the code.
COMMENTS come in two different flavors: one line ‘- -‘ or multiple lines ‘/*’ followed by ‘*/’. Both types come in handy and which one you use is up to you. A close look at the query shows a order by clause which is commented out.
Getting back to our topic, the two new functions are used in the COLUMN / EXPRESSION list of the SELECT statement. The FORMAT() function is new in 2012. I assumed that all employees are full time and on salary. Multiplying the hourly rate times 2080 hour per year calculates the yearly pay. The GROUPING() function that was introduced in 2005 and is only relevant when used with the ROLLUP clause. It returns a 1 for sub total and grand total row, 0 for all others.
The ROLLUP clause creates simple aggregate rows, plus subtotal or super-aggregate rows, and also a grand total row. This is just want we want to use for our query. The screen shot below shows the grouping of jobs in the sales department with a sub-total and a grand-total for all departments.
Please note that ANSI SQL is a living work of art. The syntax has been changed to be ISO compliant so that the WITH ROLLUP clause might not be available in future versions of SQL Server. The new syntax combines both the clauses into one.
<span style="color: #008000; font-family: Lucida Console; font-size: small;">-- WITH ROLLUP is depreciated!
GROUP BY ROLLUP (DEP1.GroupName, EMP1.JobTitle).
In a nutshell, the TSQL syntax has changed over the years with the introduction of new CLAUSES and FUNCTIONS for aggregating and partitioning data. We saw how the ROLLUP clause can be used to add sub totals and grand totals to a result set. Next time, I will be talking about the CUBE clause.