Problem:
How do I join two lists of objects into a single list? Or the equivalent: How do I join two CSV files into a single one?
Note that joining makes sense only when the two lists or CSV files have something in common.
Why do we need Join?
If you know why you need join would not enjoy a review of basic database concepts, feel free to skip this section.
Say you own a company and want to keep track of the time when your employees arrive. A simplified table of employees could look like this:
Employee | |
Id | Name |
1 | John |
2 | Mark |
3 | Hanna |
The id uniquely identifies an employee because two employees can have the same name. A full table would have more employees. It would also have last name, first name, the date the employee joined the company, and other columns.
The Entrance table that contains each employee's arrival time might look like this:
Entrance | |
Employee Id | When |
1 | 6/12/2012 08:05:01 AM |
1 | 6/13/2012 07:59:12 AM |
1 | 6/14/2012 07:49:10 AM |
2 | 6/12/2012 10:33:00 AM |
2 | 6/13/2012 10:15:00 AM |
44 | 2/29/2012 01:00:00 AM |
The “When” column has both the date and the time. Types like DateTime are common for .Net Objects and databases.
We can conclude the following from the data:
- John (the employee with id 1) seems to arrive early; at around 8.
- Mark (the employee with id 2) is not a morning person (I like him already).
- Mary is more of a “work from home” kind of person. She has never entered the company headquarters.
- What about the Employee with id 44? There is no such employee in our table! And what nefarious thing was he doing at 1:00 am on February 29? I wonder if he will be coming back in 4 years… Employee 44 seems to be a mistake. Maybe the data was entered incorrectly. In some databases, you can use “referential integrity” to make sure that such mistakes never make into your table data. Obviously, this system doesn't have a check like that.
Wait a minute, why do we have two tables instead of one?
A simple one-table solution might look like this:
Employee Entrance | |
Employee | When |
John | 6/12/2012 08:05:01 AM |
John | 6/13/2012 07:59:12 AM |
John | 6/14/2012 07:49:10 AM |
Mark | 6/12/2012 10:33:00 AM |
Mark | 6/13/2012 10:15:00 AM |
Before throwing stones at this solution, I should say this could work for the simplest of cases. As we are going to see below, it doesn’t work if things get a bit more complex, so it is not future-proof. Now, for more specific stones:
In the table above, John’s 3 entrances and Mark’s 2 entrances are recorded, but I omitted the Entrance for id 44 that we used to have, because he doesn’t have a name. If we consider that we can have 2 Johns working for the company, we will need an ID, as seen below. We can also have more data for each employee, like their Last Name:
Employee Entrance | |||
Id | First Name | Last Name | When |
1 | John | Smith | 6/12/2012 08:05:01 AM |
1 | John | Smith | 6/13/2012 07:59:12 AM |
1 | John | Smith | 6/14/2012 07:49:10 AM |
2 | Mark | West | 6/12/2012 10:33:00 AM |
2 | Mark | West | 6/13/2012 10:15:00 AM |
This repetition of first name and last name is one of the reasons we have two tables. Having two tables allows us to have an Entrance Table, where we repeat just the Employee Id, a small amount of data. This way, the tables can evolve separately. If we start adding columns for Employees, like Address, Phone, etc., nothing changes in the Entrance table.
There is one more fundamental reason to have two tables. Say we want to start keeping track of employee payments. If the employees are in the same table as their entrance, how do add the payments? It clearly does not fit in the same table, because there would be multiple payments per employee and there are already multiple employee entries for their entrances.
The proper way to express the concept of one of Employee and multiple entrances is to have two tables. Then, it is going to be simple to add columns to the Employee table or add another table for their multiple payments. Another way to say this is that a one-to-many relationship, like one employee to many payments, or one employee to many entrances, needs a separate table.
It turns out that problems we can easily describe conceptually can often require many tables.
For example, let’s consider keeping track of sales at a fruit stand. That seems simple enough conceptually, but let’s see what is involved in this task. We have customers, fruit (our product), and sales. Each sale receipt has one or more lines, in which each line describes a product and its quantity. The tables we have are:
- The customer table with name, delivery address, phone, etc.
- The product table with name, price, etc.
- The sales table with date and customer.
- The sale item table with the sale (id), product (id) and quantity.
The information on a customer receipt would be gathered from all 4 tables and we would need to join this information from multiple tables.
If even simple conceptual problems require many tables, joining tables becomes a fundamental need, which brings us to the next section.
The different types of Joins
Let’s look at our entrance data again.
|
|
Now that the tables are separate and future-proof, we might want a view of the data that joins the name and entrance date/time for each employee. This can be done in 4 different ways:
1) Only the Employees with Entrances and Entrances with an Employee (SQL Inner Join)
This join shows all the employee entrances except for the Entrance for Employee 44 that has no corresponding Id. It also does not show a line for Hanna because there is no entry for her in the Entrance table.
Name | When |
John | 6/12/2012 08:05:01 AM |
John | 6/13/2012 07:59:12 AM |
John | 6/14/2012 07:49:10 AM |
Mark | 6/12/2012 10:33:00 AM |
Mark | 6/13/2012 10:15:00 AM |
An example of a SQL statement that would perform this join looks like this:
SELECT Employee.Name,Entrance.[When]
FROM Employee
INNER JOINEntrance ON Employee.id = Entrance.EmployeeId
Select enumerates the columns, which are Name from the Employee table, and When from the Entrance table. The square brackets surrounding “[When]” distinguish the “When” column name from the "When" SQL command. This is required only for SQL language keywords.
Employee is “inner joined” to Entrance by the statement: Employee.id = Entrance.EmployeeId.
Inner Join is the name SQL gives to this kind of join where an item appears in the result only if the join statement (Employee.id = Entrance.EmployeeId) is true in both tables being joined. Hanna has her ID of 3, but there is no 3 in the Entrance table to make this statement true. Employee 44 has his Entrance but there is no ID in the Employee table corresponding to 44 that would make the statement true.
2) All Employees (SQL Left Outer Join)
This join shows all employees, including Hanna who has no corresponding entry in the Entrance table.
Name | When |
John | 6/12/2012 08:05:01 AM |
John | 6/13/2012 07:59:12 AM |
John | 6/14/2012 07:49:10 AM |
Mark | 6/12/2012 10:33:00 AM |
Mark | 6/13/2012 10:15:00 AM |
Hanna |
|
The SQL Statement looks like:
SELECT Employee.Name, Entrance.[When]
FROM Employee
LEFT OUTER JOINEntrance ON Employee.id = Entrance.EmployeeId
The only change in this second SQL statement is the use of “LEFT OUTER JOIN” (instead of inner join).
Left refers to the fact that all elements of the left table in “Employee LEFT OUTER JOIN Entrance” (which is Employee) will appear in the result. “
OUTER is in opposition to the INNER join and it allows Hanna to be listed even though her Employee.id is not in any of the Entrance’s EmployeeIds.
Note: Some SQL variants will not use “OUTER”.
3) All Entrances (SQL Right Outer Join)
This join has all of the Entrances, including the entry for Employee 44 even though there is no corresponding employee Id for that in the Employee’s table.
Name | When |
John | 6/12/2012 08:05:01 AM |
John | 6/13/2012 07:59:12 AM |
John | 6/14/2012 07:49:10 AM |
Mark | 6/12/2012 10:33:00 AM |
Mark | 6/13/2012 10:15:00 AM |
| 2/29/2012 01:00:00 AM |
The SQL Statement looks like:
SELECT Employee.Name, Entrance.[When]
FROM Employee
RIGHT OUTER JOINEntrance ON Employee.id = Entrance.EmployeeId
The only change to the previous join SQL statement is the use of “RIGHT” instead of “LEFT”.
Right refers to the fact that all elements of the left table in “Employee RIGHT OUTER JOIN Entrance” (which is Entrance) will appear in the result. “
Note: Some SQL variants will not use “OUTER”.
4) All Entrances and Employees (SQL Full Join)
The join below shows all employees and all entrances, including Employees with no corresponding entrances (Hanna) or entrances with no corresponding employee (the entrance for Employee 44).
Name | When |
John | 6/12/2012 08:05:01 AM |
John | 6/13/2012 07:59:12 AM |
John | 6/14/2012 07:49:10 AM |
Mark | 6/12/2012 10:33:00 AM |
Mark | 6/13/2012 10:15:00 AM |
Hanna |
|
| 2/29/2012 01:00:00 AM |
The SQL Statement looks like:
SELECT Employee.Name, Entrance.[When]
FROM Employee
FULL OUTER JOINEntrance ON Employee.id = Entrance.EmployeeId
The only change to the previous join the use of “FULL” instead of “RIGHT” or “LEFT”.
Full refers to the fact that all elements of both join sides (Employee and Entrance specified in “Employee FULL OUTER JOIN Entrance”) will appear in the result.
Note: Some SQL variants will not use “OUTER”.
Join in Windows PowerShell
There is no built-in cmdlet for joining in Windows PowerShell, so the remainder of this post will be about building a cmdlet called Join-Object that performs the four join operations in lists of objects in memory. Those objects can come from different places, including csv files.
For example, lets create a couple of CSV files that contain the sample data we have been using:
@"
Id,Name
1,John
2,Mark
3,Hanna
"@ > c:\temp\employee.csv
@"
EmployeeId,When
1,6/12/2012 08:05:01 AM
1,6/13/2012 07:59:12 AM
1,6/14/2012 07:49:10 AM
2,6/12/2012 10:33:00 AM
2,6/13/2012 10:15:00 AM
44,2/29/2012 01:00:00 AM
"@ > c:\temp\entrance.csv
$employee = Import-Csv c:\temp\employee.csv
$entrance = Import-Csv c:\temp\entrance.csv
After running these commands, we have:
PS C:\temp> $employee
Id Name
-- ----
1 John
2 Mark
3 Hanna
And:
PS C:\temp> $entrance
EmployeeId When
---------- ----
1 6/12/2012 08:05:01 AM
1 6/13/2012 07:59:12 AM
1 6/14/2012 07:49:10 AM
2 6/12/2012 10:33:00 AM
2 6/13/2012 10:15:00 AM
44 2/29/2012 01:00:00 AM
I chose CSV because you can find data to be joined in CSV files and also because it’s a compact way to show the sample data.
Below are 4 calls to the Join-Object that produce the equivalent of the 4 types of join.
PS C:\temp> Join-Object -Left $employee -Right $entrance -Where {$args[0].Id -eq $args[1].EmployeeId} –LeftProperties "Name" –RightProperties "When" -Type OnlyIfInBoth
Name When
---- ----
John 6/12/2012 08:05:01 AM
John 6/13/2012 07:59:12 AM
John 6/14/2012 07:49:10 AM
Mark 6/12/2012 10:33:00 AM
Mark 6/13/2012 10:15:00 AM
Join-Object’s first two parameters are the left and right lists of objects.
The “Where” parameter is the join condition, and it uses $args[0] to represent an item from the left list and $args[1] for an item in the right list.
The RightProperties and LeftProperties parameters are the names of the properties to appear in the output from the left list and right list.
The Type parameter is one of four names: AllInLeft, AllInRight, OnlyIfInBoth and AllInBoth. I thought those four names were simpler to relate to their functionality than SQL’s full, inner and outer names, but here is a mapped correspondence:
Join-Object | SQL |
AllInLeft | Left Outer |
AllIInRight | Right Outer |
OnlyIfInBoth | Inner |
AllInBoth | Full Outer |
Here is an example of AllInLeft:
PS C:\temp> Join-Object -Left $employee -Right $entrance -Where {$args[0].Id -eq $args[1].EmployeeId} -LeftProperties "Name" -RightProperties "When" -Type AllInLeft
Name When
---- ----
John 6/12/2012 08:05:01 AM
John 6/13/2012 07:59:12 AM
John 6/14/2012 07:49:10 AM
Mark 6/12/2012 10:33:00 AM
Mark 6/13/2012 10:15:00 AM
Hanna
Here is the output of AllInRight:
PS C:\temp> Join-Object -Left $employee -Right $entrance -Where {$args[0].Id -eq $args[1].EmployeeId} -LeftProperties "Name" -RightProperties "When" -Type AllInRight
Name When
---- ----
John 6/12/2012 08:05:01 AM
John 6/13/2012 07:59:12 AM
John 6/14/2012 07:49:10 AM
Mark 6/12/2012 10:33:00 AM
Mark 6/13/2012 10:15:00 AM
2/29/2012 01:00:00 AM
And finally, the output of AllInBoth:
PS C:\temp> Join-Object –Left $employee –Right $entrance –Where {$args[0].Id -eq $args[1].EmployeeId} –LeftProperties "Name" –RightProperties "When" -Type AllInBoth
Name When
---- ----
John 6/12/2012 08:05:01 AM
John 6/13/2012 07:59:12 AM
John 6/14/2012 07:49:10 AM
Mark 6/12/2012 10:33:00 AM
Mark 6/13/2012 10:15:00 AM
Hanna
2/29/2012 01:00:00 AM
Conclusions
Join is an essential tool for data centric problems because, usually, multiple tables are necessary.
Data centric problems are normally handled by database tools, but if you happen to have CSV files with this kind of data, Join-Object might be useful.
Lucio Silveira [MSFT]
The Join-Object Script
function AddItemProperties($item, $properties, $output)
{
if($item -ne $null)
{
foreach($property in $properties)
{
$propertyHash =$property -as [hashtable]
if($propertyHash -ne $null)
{
$hashName=$propertyHash["name"] -as [string]
if($hashName -eq $null)
{
throw "there should be a string Name"
}
$expression=$propertyHash["expression"] -as [scriptblock]
if($expression -eq $null)
{
throw "there should be a ScriptBlock Expression"
}
$_=$item
$expressionValue=& $expression
$output | add-member -MemberType "NoteProperty" -Name $hashName -Value $expressionValue
}
else
{
# .psobject.Properties allows you to list the properties of any object, also known as "reflection"
foreach($itemProperty in $item.psobject.Properties)
{
if ($itemProperty.Name -like $property)
{
$output | add-member -MemberType "NoteProperty" -Name $itemProperty.Name -Value $itemProperty.Value
}
}
}
}
}
}
function WriteJoinObjectOutput($leftItem, $rightItem, $leftProperties, $rightProperties, $Type)
{
$output = new-object psobject
if($Type -eq "AllInRight")
{
# This mix of rightItem with LeftProperties and vice versa is due to
# the switch of Left and Right arguments for AllInRight
AddItemProperties $rightItem $leftProperties $output
AddItemProperties $leftItem $rightProperties $output
}
else
{
AddItemProperties $leftItem $leftProperties $output
AddItemProperties $rightItem $rightProperties $output
}
$output
}
<#
.Synopsis
Joins two lists of objects
.DESCRIPTION
Joins two lists of objects
.EXAMPLE
Join-Object $a $b "Id" ("Name","Salary")
#>
function Join-Object
{
[CmdletBinding()]
[OutputType([int])]
Param
(
# List to join with $Right
[Parameter(Mandatory=$true,
Position=0)]
[object[]]
$Left,
# List to join with $Left
[Parameter(Mandatory=$true,
Position=1)]
[object[]]
$Right,
# Condition in which an item in the left matches an item in the right
# typically something like: {$args[0].Id -eq $args[1].Id}
[Parameter(Mandatory=$true,
Position=2)]
[scriptblock]
$Where,
# Properties from $Left we want in the output.
# Each property can:
# - Be a plain property name like "Name"
# - Contain wildcards like "*"
# - Be a hashtable like @{Name="Product Name";Expression={$_.Name}}. Name is the output property name
# and Expression is the property value. The same syntax is available in select-object and it is
# important for join-object because joined lists could have a property with the same name
[Parameter(Mandatory=$true,
Position=3)]
[object[]]
$LeftProperties,
# Properties from $Right we want in the output.
# Like LeftProperties, each can be a plain name, wildcard or hashtable. See the LeftProperties comments.
[Parameter(Mandatory=$true,
Position=4)]
[object[]]
$RightProperties,
# Type of join.
# AllInLeft will have all elements from Left at least once in the output, and might appear more than once
# if the where clause is true for more than one element in right, Left elements with matches in Right are
# preceded by elements with no matches. This is equivalent to an outer left join (or simply left join)
# SQL statement.
# AllInRight is similar to AllInLeft.
# OnlyIfInBoth will cause all elements from Left to be placed in the output, only if there is at least one
# match in Right. This is equivalent to a SQL inner join (or simply join) statement.
# AllInBoth will have all entries in right and left in the output. Specifically, it will have all entries
# in right with at least one match in left, followed by all entries in Right with no matches in left,
# followed by all entries in Left with no matches in Right.This is equivallent to a SQL full join.
[Parameter(Mandatory=$false,
Position=5)]
[ValidateSet("AllInLeft","OnlyIfInBoth","AllInBoth", "AllInRight")]
[string]
$Type="OnlyIfInBoth"
)
Begin
{
# a list of the matches in right for each object in left
$leftMatchesInRight = new-object System.Collections.ArrayList
# the count for all matches
$rightMatchesCount = New-Object "object[]" $Right.Count
for($i=0;$i -lt $Right.Count;$i++)
{
$rightMatchesCount[$i]=0
}
}
Process
{
if($Type -eq "AllInRight")
{
# for AllInRight we just switch Left and Right
$aux = $Left
$Left = $Right
$Right = $aux
}
# go over items in $Left and produce the list of matches
foreach($leftItem in $Left)
{
$leftItemMatchesInRight = new-object System.Collections.ArrayList
$null = $leftMatchesInRight.Add($leftItemMatchesInRight)
for($i=0; $i -lt $right.Count;$i++)
{
$rightItem=$right[$i]
if($Type -eq "AllInRight")
{
# For AllInRight, we want $args[0] to refer to the left and $args[1] to refer to right,
# but since we switched left and right, we have to switch the where arguments
$whereLeft = $rightItem
$whereRight = $leftItem
}
else
{
$whereLeft = $leftItem
$whereRight = $rightItem
}
if(Invoke-Command -ScriptBlock $where -ArgumentList $whereLeft,$whereRight)
{
$null = $leftItemMatchesInRight.Add($rightItem)
$rightMatchesCount[$i]++
}
}
}
# go over the list of matches and produce output
for($i=0; $i -lt $left.Count;$i++)
{
$leftItemMatchesInRight=$leftMatchesInRight[$i]
$leftItem=$left[$i]
if($leftItemMatchesInRight.Count -eq 0)
{
if($Type -ne "OnlyIfInBoth")
{
WriteJoinObjectOutput $leftItem $null $LeftProperties $RightProperties $Type
}
continue
}
foreach($leftItemMatchInRight in $leftItemMatchesInRight)
{
WriteJoinObjectOutput $leftItem $leftItemMatchInRight $LeftProperties $RightProperties $Type
}
}
}
End
{
#produce final output for members of right with no matches for the AllInBoth option
if($Type -eq "AllInBoth")
{
for($i=0; $i -lt $right.Count;$i++)
{
$rightMatchCount=$rightMatchesCount[$i]
if($rightMatchCount -eq 0)
{
$rightItem=$Right[$i]
WriteJoinObjectOutput $null $rightItem $LeftProperties $RightProperties $Type
}
}
}
}
}