The SELECT reserved word is part of the Data Manipulation Language (DML) defined by Codd and is used to query data from the database. Because of its popularity, there are many clauses and functions that can be used to construct a query to retrieve data.
Today, I am continuing our exploration by reviewing SELECT queries that aggregate data. Data aggregation is the process of converting many records into a few records with special meaning. Many of the functions used in aggregation have statistical meaning such as minimum value, maximum value, etc.
I will be using the Adventure Works Sample database supplied by Microsoft during this talk.
We will start by looking at the the sales people and their year-to-date sales numbers. The following query joins the SalesPerson and Person tables to retrieve our results.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
<span style="color: #008000; font-family: Lucida Console; font-size: small;">-- Use the correct database USE AdventureWorks2008R2 GO -- List existing sales force SELECT P.BusinessEntityID as Id, cast(P.LastName as char(20)) as LastName, cast(P.FirstName as char(20)) as FirstName, Str(S.SalesYTD, 12, 2) as SalesYtd FROM Sales.SalesPerson S INNER JOIN Person.Person P ON s.BusinessEntityID = P.BusinessEntityID WHERE TerritoryID IS NOT NULL GO -- Output Id LastName FirstName SalesYtd ----------- -------------------- -------------------- ------------ 275 Blythe Michael 3763178.18 276 Mitchell Linda 4251368.55 277 Carson Jillian 3189418.37 278 Vargas Garrett 1453719.47 279 Reiter Tsvi 2315185.61 280 Ansman-Wolfe Pamela 1352577.13 281 Ito Shu 2458535.62 282 Saraiva José 2604540.72 283 Campbell David 1573012.94 284 Mensa-Annan Tete 1576562.20 286 Tsoflias Lynn 1421810.92 288 Valdez Rachel 1827066.71 289 Pak Jae 4116871.23 290 Varkey Chudukatil Ranjit 3121616.32 </span> |
One nice feature of Microsoft Access is the FIRST and LAST functions are part of the aggregation set. We can recreate these features by using the ORDER BY and TOP clauses. The queries below return the first and last sales person by total sales.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
<span style="color: #008000; font-family: Lucida Console; font-size: small;">-- Top sales person (LAST) SELECT TOP 1 P.BusinessEntityID as Id, cast(P.LastName as char(20)) as LastName, cast(P.FirstName as char(20)) as FirstName, Str(S.SalesYTD, 12, 2) as SalesYtd FROM Sales.SalesPerson S INNER JOIN Person.Person P ON s.BusinessEntityID = P.BusinessEntityID WHERE TerritoryID IS NOT NULL ORDER BY SalesYTD DESC -- Output Id LastName FirstName SalesYtd ----------- -------------------- -------------------- ------------ 276 Mitchell Linda 4251368.55 -- Bottom sales person (FIRST) SELECT TOP 1 P.BusinessEntityID as Id, cast(P.LastName as char(20)) as LastName, cast(P.FirstName as char(20)) as FirstName, Str(S.SalesYTD, 12, 2) as SalesYtd FROM Sales.SalesPerson S INNER JOIN Person.Person P ON s.BusinessEntityID = P.BusinessEntityID WHERE TerritoryID IS NOT NULL ORDER BY SalesYTD ASC -- Output Id LastName FirstName SalesYtd ----------- -------------------- -------------------- ------------ 280 Ansman-Wolfe Pamela 1352577.13 </span> |
The power of aggregation is achieved by using the GROUP BY and HAVING clauses. Basically, GROUP BY takes all values that are the same for a given column list and returns one unique row. The HAVING clause is a secondary filter like the WHERE clause that can specify some aggregate condition needs to be meet.
The query below returns the number of sales people by country. The COUNT function returns a integer value.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
<span style="color: #008000; font-family: Lucida Console; font-size: small;">-- Count sales people by country SELECT T.CountryRegionCode AS Country, COUNT(*) AS SalesPeople FROM Sales.SalesPerson P INNER JOIN Sales.SalesTerritory T ON P.TerritoryID = T.TerritoryID GROUP BY T.CountryRegionCode -- Output Country SalesPeople ------- ----------- AU 1 CA 2 DE 1 FR 1 GB 1 US 8 </span> |
We can rewite this query with a HAVING clause to return all countries that have multiple sales people.
1 2 3 4 5 6 7 8 9 10 11 12 |
<span style="color: #008000; font-family: Lucida Console; font-size: small;">-- -- Show countries that have multiple sales people SELECT T.CountryRegionCode AS Country, COUNT(*) AS SalesPeople FROM Sales.SalesPerson P INNER JOIN Sales.SalesTerritory T ON P.TerritoryID = T.TerritoryID GROUP BY T.CountryRegionCode HAVING COUNT(*) > 1 --Output Country SalesPeople ------- ----------- CA 2 US 8 </span> |
Common day statistics such as MIN, MAX and AVG can be calculated by selecting the correct function. The example below shows the minimum, maximum, and average bonus for a sales person in each country.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
<span style="color: #008000; font-family: Lucida Console; font-size: small;">-- Show the min, max, & avg bonus by country SELECT T.CountryRegionCode AS Country, MIN(P.Bonus) AS SmallBonus, MAX(P.Bonus) AS BigBonus, AVG(P.Bonus) AS AvgBonus FROM Sales.SalesPerson P INNER JOIN Sales.SalesTerritory T ON P.TerritoryID = T.TerritoryID GROUP BY T.CountryRegionCode -- Output Country SmallBonus BigBonus AvgBonus ------- --------------------- --------------------- --------------------- AU 5650.00 5650.00 5650.00 CA 500.00 5000.00 2750.00 DE 75.00 75.00 75.00 FR 985.00 985.00 985.00 GB 5150.00 5150.00 5150.00 US 2000.00 6700.00 3906.25 </span> |
The SUM function can be used to totaled up all values in a particular column. The example below looks at the average and total sales by country regardless of person.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
<span style="color: #008000; font-family: Lucida Console; font-size: small;">-- Show avg and total sales by sales person for a particular country SELECT T.CountryRegionCode AS TheCountry, AVG(P.SalesYTD) AS TheAvg, SUM(P.SalesYTD) AS TheSales FROM Sales.SalesPerson P INNER JOIN Sales.SalesTerritory T ON P.TerritoryID = T.TerritoryID GROUP BY T.CountryRegionCode -- Output TheCountry TheAvg TheSales ---------- --------------------- --------------------- AU 1421810.9242 1421810.9242 CA 2029130.0912 4058260.1825 DE 1827066.7118 1827066.7118 FR 3121616.3202 3121616.3202 GB 4116871.2277 4116871.2277 US 2559979.8237 20479838.5899 </span> |
Last but not least, advance statistics such as standard deviation using the STDEV function or variance using the VAR fucntion can be calculated. The example below goes to the Sales Order Header level and returns statistics by country for the
2008 year.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
<span style="color: #008000; font-family: Lucida Console; font-size: small;">-- Get statistics on sales detail by country for 2008 SELECT YEAR(H.OrderDate) AS TheYear, T.CountryRegionCode AS TheCountry, STR(AVG(H.TotalDue), 12, 2) AS TheAvg, STR(STDEV(H.TotalDue), 12, 3) AS TheStdDev, STR(VAR(H.TotalDue), 12, 3) AS TheVar FROM Sales.SalesOrderHeader H INNER JOIN Sales.SalesTerritory T ON H.TerritoryID = T.TerritoryID WHERE YEAR(H.OrderDate) = 2008 GROUP BY YEAR(H.OrderDate), T.CountryRegionCode ORDER BY YEAR(H.OrderDate), T.CountryRegionCode -- Output TheYear TheCountry TheAvg TheStdDev TheVar ----------- ---------- ------------ ------------ ------------ 2008 AU 1254.82 3111.270 9680003.033 2008 CA 1800.43 7957.318 63318908.399 2008 DE 1764.02 6209.980 38563854.668 2008 FR 2110.00 8906.554 79326696.856 2008 GB 1888.58 7335.738 53813050.772 2008 US 2748.81 9902.666 98062789.479 </span> |
In summary, aggregation is a great way to summarize data for organizational decision makers.
The following functions are part of TSQL aggregation with the except of FIRST and LAST which can be
emulated.
- AVG – average of values in a column.
- COUNT – counts how many rows.
- FIRST – returns the value of the first record in a field.
- LAST – returns the value of the last record in a field.
- MAX – maximum value in a column.
- MIN – minimum value in a column.
- STDEV – sample standard deviation of the values in a column.
- STDEVP – standard deviation of the values in a column.
- SUM – adds the values in a column.
- VAR – sample variance of the values in a column.
- VARP – variance of the values in a column.
There are some exciting new aggregation features in TSQL 2008 that I will talk in the future about.
Knowing how to summarize data is key to using the information that is stored in the database.
Much appreciated for the information and share!
Many thanks for your review! Frankly speaking I have never come across anything that cool.
Hey! Someone in my Myspace group shared this site with us so I came to take a look. I’m definitely loving the information. I’m bookmarking and will be tweeting this to my followers! Outstanding blog and outstanding design and style.