Joining Multiple Tables, Grouping And Evaluating Totals
This is a continuation to the Join-Object blog post considering multiple joins and SQL style grouping and total evaluation.
Problem:
In the Join-Object blog we’ve seen that relational data naturally organizes in multiple tables. Once we have those tables, we frequently want to see the information back together, so we need something like Join-Object. Sometimes we also need to group the information and evaluate totals inside the groups. For instance, what are the total sales for each customer?
How can we use Join-Object in multiple tables and group information from the tables to report totals?
Joining multiple tables:
Consider the following tables describing sales at a fruit stand:
Customers |
||
Id |
Name |
Phone |
1 |
John Smith |
(206)555-8899 |
2 |
Mark Stone |
(425)999-8811 |
3 |
Ruth Moore |
(425)888-3355 |
Products |
||
Id |
Name |
Price |
1 |
Apple |
10 |
2 |
Orange |
12 |
3 |
Kiwi |
15 |
Orders |
||
Id |
Date |
CustomerId |
1 |
6/6/2012 9:12:44 AM |
1 |
2 |
6/7/2012 6:12:10 PM |
1 |
3 |
6/5/2012 12:05:03 PM |
2 |
OrderItems |
|||
Id |
Quantity |
ProductId |
OrderId |
1 |
2 |
3 |
1 |
2 |
1 |
2 |
1 |
3 |
4 |
5 |
1 |
4 |
5 |
1 |
2 |
5 |
4 |
2 |
2 |
6 |
1 |
1 |
3 |
As we saw in the Join-Object blog, we need this many tables to deal with the simple conceptual problem of selling fruit!
One might question the need for the OrderItems table. An order item is a line in the receipt for an order that lists a product and its quantity. Since there can be many items per order, it needs its own table as mentioned in the Join-Object blog.
As we can see from the data, we have 3 customers, 3 products, 3 orders, and 6 order items. The three first order items are for the Order with Id 1.
Good data design forces us to split the tables, but after this is done, we frequently need to see the data together.
For example, suppose I want a report with all basic receipt information, including customer name, order date, all products bought, their quantity, and price. Here is the SQL statement that produces the report:
SELECT Customer.Name, [Order].Date, OrderItem.Quantity,
Product.Name AS [Product Name], Product.Price
FROM OrderItem INNER JOIN Product ON OrderItem.ProductId = Product.id INNER JOIN
Customer INNER JOIN [Order] ON Customer.id = [Order].CustomerId ON OrderItem.OrderId = [Order].id
Here is the result:
Joining All Tables |
||||
Name |
Date |
Quantity |
Product Name |
Price |
John Smith |
6/6/2012 9:12:44 AM |
2 |
Apple |
10 |
John Smith |
6/6/2012 9:12:44 AM |
1 |
Orange |
12 |
John Smith |
6/6/2012 9:12:44 AM |
4 |
Kiwi |
5 |
John Smith |
6/7/2012 6:12:10 PM |
5 |
Apple |
10 |
John Smith |
6/7/2012 6:12:10 PM |
4 |
Orange |
12 |
Mark Stone |
6/5/2012 12:00:00 AM |
1 |
Apple |
10 |
A little explanation of the SQL statement
The “AS [Product Name]” segment sets the column name in the output. Both product name and customer name are called “Name” in their own tables, so we need to differentiate them in the output.
The brackets around “[Order]” differentiate it from the SQL "Order" keyword.
The Inner Join shows only the items present in both tables.
All 4 tables are “inner joined” in the statement.
Putting the data in Windows PowerShell
From here on, we will use Windows PowerShell to manipulate this data, so let’s start by putting this data in PowerShell.
The following script creates four CSV files and then imports them into four variables that contain our data.
@"
Id,Name,Phone
1,John Smith,(206)555-8899
2,Mark Stone,(425)999-8811
3,Ruth Moore,(425)888-3355
"@ > c:\temp\customers.csv
$customers=import-csv c:\temp\customers.csv
@"
Id,Name,Price
1,Apple,10
2,Orange,12
3,Kiwi,15
"@ > c:\temp\products.csv
$products=import-csv c:\temp\products.csv
@"
Id,Date,CustomerId
1,6/6/2012 9:12:44 AM,1
2,6/7/2012 6:12:10 PM,1
3,6/5/2012 12:05:03 PM,2
"@ > c:\temp\orders.csv
$orders=import-csv c:\temp\orders.csv
@"
Id,Quantity,ProductId,OrderId
1,2,1,1
2,1,2,1
3,4,3,1
4,5,1,2
5,4,2,2
6,1,1,3
"@ > c:\temp\orderItems.csv
$orderItems=import-csv c:\temp\orderItems.csv
"Customers ";$customers | format-table -AutoSize
"`r`nProducts ";$products | format-table -AutoSize
"`r`nOrders ";$orders | format-table -AutoSize
"`r`nOrder Items ";$orderItems | format-table -AutoSize
Here is the output:
Customers
Id Name Phone
-- ---- -----
1 John Smith (206)555-8899
2 Mark Stone (425)999-8811
3 Ruth Moore (425)888-3355
Products
Id Name Price
-- ---- -----
1 Apple 10
2 Orange 12
3 Kiwi 15
Orders
Id Date CustomerId
-- ---- ----------
1 6/6/2012 9:12:44 AM 1
2 6/7/2012 6:12:10 PM 1
3 6/5/2012 12:05:03 PM 2
Order Items
Id Quantity ProductId OrderId
-- -------- --------- -------
1 2 1 1
2 1 2 1
3 4 3 1
4 5 1 2
5 4 2 2
6 1 1 3
The equivalent joins in PowerShell with the help of Join-Object are as follows:
# Joins the customer to his orders and places Name(from Customer),
# Date(from Order) and Id (from Order)
$customerOrders=Join-Object -Left $customers -Right $orders -LeftProperties Name -RightProperties Date,Id -Where {$args[0].Id -eq $args[1].CustomerId}
# Adds the OrderItems to the data already gathered. The output objects will
# have Name(from Customer), Date(from Order), Quantity(from OrderItem)
# and ProductId (from OrderItem).
$customerOrderItems=Join-Object -Left $customerOrders -Right $orderItems -LeftProperties Name,Date -RightProperties Quantity,ProductId -Where {$args[0].Id -eq $args[1].OrderId}
# Adds the Products to the data already gathered. The output objects will
# have Name (from Customer), Date(from Order), Quantity (from OrderItem)
# Product Name (from Product) and Price from Product
# We use a Hashtable instead of simply "Name" for Product Name because
# "Name" is already present in the output object for the customer name,
# so we need to differentiate the Product name from that.
# If you did not know the syntax for the Hashtable Expression, you will
# be interested in the fact it also works for Select-Object properties
$customerProducts=Join-Object -Left $customerOrderItems -Right $products -LeftProperties Name,Date,Quantity -RightProperties @{Name="Product Name";Expression={$_.Name}},Price -Where {$args[0].ProductId -eq $args[1].Id}
$customerProducts | format-table
Here is the output:
Name Date Quantity Product Name Price
---- ---- -------- ------------ -----
John Smith 6/6/2012 9:12:44 AM 2 Apple 10
John Smith 6/6/2012 9:12:44 AM 1 Orange 12
John Smith 6/6/2012 9:12:44 AM 4 Kiwi 15
John Smith 6/7/2012 6:12:10 PM 5 Apple 10
John Smith 6/7/2012 6:12:10 PM 4 Orange 12
Mark Stone 6/5/2012 12:05:03 PM 1 Apple 10
So, the following three lines of code …
$customerOrders=Join-Object -Left $customers -Right $orders -LeftProperties Name -RightProperties Date,Id -Where {$args[0].Id -eq $args[1].CustomerId}
$customerOrderItems=Join-Object -Left $customerOrders -Right $orderItems -LeftProperties Name,Date -RightProperties Quantity,ProductId -Where {$args[0].Id -eq $args[1].OrderId}
$customerProducts=Join-Object -Left $customerOrderItems -Right $products -LeftProperties Name,Date,Quantity -RightProperties @{Name="Product Name";Expression={$_.Name}},Price -Where {$args[0].ProductId -eq $args[1].Id}
… do the same job as the SQL statement:
SELECT Customer.Name, [Order].Date, OrderItem.Quantity,
Product.Name AS [Product Name], Product.Price
FROM OrderItem INNER JOIN Product ON OrderItem.ProductId = Product.id INNER JOIN
Customer INNER JOIN [Order] ON Customer.id = [Order].CustomerId ON OrderItem.OrderId = [Order].id
There are some advantages to the SQL statement:
· It is a single statement that is somewhat shorter than the three statements in PowerShell.
· It has a better syntax for aliasing a property in the output (“as ProductName” as opposed to “@{Name="Product Name";Expression={$_.Name}}”).
In my opinion, those are small advantages. Each one of the PowerShell join statements is easy to understand in isolation. The hashtable syntax used in the third command (to change a property name) is a bit long, but very common to PowerShell users, because it is useful for other cmdlets including Select-Object.
I considered changing Join-Object to take all the lists and Where clauses in a single cmdlet call, but the need to pair each of two lists with a Where clause (and a Join type) makes the command a bit too cryptic for my taste. I prefer three statements that are easy to understand to one that is not. Performance-wise, there is no difference, because each Join has to be considered in isolation.
Now that the data is joined, what about getting the total money spent by each customer, the total money spent in each product and the total money spent?
Total Money Spent by Each Customer
Let’s start with the whole command:
$customerProducts | Group-Object name | `
Select-Object Name,@{Name='Total';
Expression={$_.Group | ForEach-Object `
-Begin {$total=0;} `
-Process {$total+=[int]$_.Price*$_.Quantity} `
-End {$total}}} | format-table
It produces this output:
Name Sum
---- ---
John Smith 190
Mark Stone 10
The first line:
$customerProducts | Group-Object name
Produces two groups:
Count Name Group
----- ---- -----
5 John Smith {@{Name=John Smith; Date=6/6/2012 9:12:44 AM;...
1 Mark Stone {@{Name=Mark Stone; Date=6/5/2012 12:05:03 PM;...
In order to better see what is in each group, we run:
($customerProducts | Group-Object name)[0].Group
Which produces:
Name : John Smith
Date : 6/6/2012 9:12:44 AM
Quantity : 2
ProductName : Apple
Price : 10
Name : John Smith
Date : 6/6/2012 9:12:44 AM
Quantity : 1
ProductName : Orange
Price : 12
Name : John Smith
Date : 6/6/2012 9:12:44 AM
Quantity : 4
ProductName : Kiwi
Price : 15
Name : John Smith
Date : 6/7/2012 6:12:10 PM
Quantity : 5
ProductName : Apple
Price : 10
Name : John Smith
Date : 6/7/2012 6:12:10 PM
Quantity : 4
ProductName : Orange
Price : 12
And:
($customerProducts | Group-Object name)[1].Group
Which produces:
Name : Mark Stone
Date : 6/5/2012 12:05:03 PM
Quantity : 1
ProductName : Apple
Price : 10
So, inside the “Group” property of each object in the output of group-object is the Price and Quantity we want to multiply in order to get a total. If we wanted just the total quantity of fruit we could do:
$customerProducts | Group-Object name | `
Select-Object Name, `
@{Name='Total';Expression={($_.Group | Measure-Object Quantity -Sum).Sum}} `
| format-table
With output:
Name Total
---- -----
John Smith 16
Mark Stone 1
Name is the first column in the output. The second column is given by:
@{Name='Total'; Expression={($_.Group | Measure-Object Quantity -Sum).Sum}}
This hashtable is a syntax for select object which means a column with a Name = Total and a value which is the Expression. The Expression gets the value of the Group property and pipes to Measure-Object to sum the ‘Quantity’ property of each member of the Group. Finally the Sum property of the Measure-Object result is retrieved. That property contains the actual sum.
Since we want to get Price * Quantity and measure-object only supports plain properties we end up with the full command repeated here:
$customerProducts | Group-Object name | `
Select-Object Name,@{Name='Total';
Expression={$_.Group | ForEach-Object `
-Begin {$total=0;} `
-Process {$total+=[int]$_.Price*$_.Quantity} `
-End {$total}}} | format-table
Instead of | measure-object we have a | ForEach-Object where we use –Begin, -Process and –End to evaluate the total. This could be easily modified to calculate anything else, including averages, standard deviations, etc.
Note: The [int] cast in front of $_.Price is required because the price is a string out of our CSV file. This could be avoided by using Export-CliXml and Import-CliXml for our data, whioch would preserve the data type.
Here is the corresponding SQL statement:
SELECT Customer.Name, SUM(OrderItem.Quantity * Product.Price) AS SUM
FROM OrderItem INNER JOIN
Product ON OrderItem.ProductId = Product.id INNER JOIN
Customer INNER JOIN
[Order] ON Customer.id = [Order].CustomerId ON OrderItem.OrderId = [Order].id
GROUP BY Customer.Name
The SQL statement combines the joining, grouping and total evaluation(Price * Quantity) operations.
The Powershell statement is only doing the grouping and total evaluation, since the $customerProducts variable has the result of the joins performed in the previous section of this blog post.
The bold segments of the SQL statement are performing the grouping and the PowerShell statement.
SQL is designed specifically for this kind of data, so it provides a shorter syntax overall, but I believe the PowerShell equivalent is simple enough to understand.
Total Money Spent in Each Product
The only thing that changes from the previous section is that the property we’re passing to Group-Object is “ProductName”:
$customerProducts | Group-Object ProductName | `
Select-Object Name,@{Name='Total';
Expression={$_.Group | ForEach-Object `
-Begin {$total=0;} `
-Process {$total+=[int]$_.Price*$_.Quantity} `
-End {$total}}} | format-table
Name Sum
---- ---
Apple 80
Orange 60
Kiwi 60
The corresponding SQL is:
SELECT Product.Name, SUM(OrderItem.Quantity * Product.Price) AS Sum
FROM OrderItem INNER JOIN
Product ON OrderItem.ProductId = Product.id
GROUP BY Product.Name
Total Money Spent
Here is the command that produces this total:
$customerProducts | ForEach-Object `
-Begin {$total=0;} `
-Process {$total+=[int]$_.Price*$_.Quantity} `
-End {$total}
Here is the corresponding SQL:
SELECT SUM(OrderItem.Quantity * Product.Price) AS Sum
FROM OrderItem INNER JOIN
Product ON OrderItem.ProductId = Product.id
Lucio Silveira[MSFT]