This is the second post in a three-part series.
- Part 1:
- Useful methods on the String class
- Introduction to Regular Expressions
- The Select-String cmdlet
- Part 2:
- the -split operator
- the -match operator
- the switch statement
- the Regex class
- Part 3:
- a real world, complete and slightly bigger, example of a switch-based parser
The -split
operator
The -split
operator splits one or more strings into substrings.
A common pattern is a name-value pattern:
Note the usage of the Max-substrings parameter to the -split
operator.
We want to ensure that is doesn’t matter if the value contains the character to split on.
$text = "Description=The '=' character is used for assigning values to a variable"
$name, $value = $text -split "=", 2
@"
Name = $name
Value = $value
"@
Name = Description
Value = The '=' character is used for assigning values to a variable
When the line to parse contains fields separated by a well known separator, that is never a part of the field values, we can use the -split
operator
in combination with multiple assignment to get the fields into variables.
$name, $location, $occupation = "Spider Man,New York,Super Hero" -split ','
If only the location is of interest, the unwanted items can be assigned to $null
.
$null, $location, $null = "Spider Man,New York,Super Hero" -split ','
$location
New York
If there are many fields, assigning to null doesn’t scale well. Indexing can be used instead, to get the fields of interest.
$inputText = "x,Staffan,x,x,x,x,x,x,x,x,x,x,Stockholm,x,x,x,x,x,x,x,x,11,x,x,x,x"
$name, $location, $age = ($inputText -split ',')[1,12,21]
$name
$location
$age
Staffan
Stockholm
11
It is almost always a good idea to create an object that gives context to the different parts.
$inputText = "x,Steve,x,x,x,x,x,x,x,x,x,x,Seattle,x,x,x,x,x,x,x,x,22,x,x,x,x"
$name, $location, $age = ($inputText -split ',')[1,12,21]
[PSCustomObject] @{
Name = $name
Location = $location
Age = [int] $age
}
Name Location Age
---- -------- ---
Steve Seattle 22
Instead of creating a PSCustomObject, we can create a class. It’s a bit more to type, but we can get more help from the engine, for example with tab completion.
The example below also shows an example of type conversion, where the default string to number conversion doesn’t work.
The age field is handled by PowerShell’s built-in type conversion. It is of type [int]
, and PowerShell will handle the conversion from string
to int
,
but in some cases we need to help out a bit. The ShoeSize field is also an [int]
, but the data is hexadecimal,
and without the hex specifier (‘0x’), this conversion fails for some values, and provides incorrect results for the others.
class PowerSheller {
[string] $Name
[string] $Location
[int] $Age
[int] $ShoeSize
}
$inputText = "x,Staffan,x,x,x,x,x,x,x,x,x,x,Stockholm,x,x,x,x,x,x,x,x,33,x,11d,x,x"
$name, $location, $age, $shoeSize = ($inputText -split ',')[1,12,21,23]
[PowerSheller] @{
Name = $name
Location = $location
Age = $age
# ShoeSize is expressed in hex, with no '0x' because reasons :)
# And yes, it's in millimeters.
ShoeSize = [Convert]::ToInt32($shoeSize, 16)
}
Name Location Age ShoeSize
---- -------- --- --------
Staffan Stockholm 33 285
The split operator’s first argument is actually a regex (by default, can be changed with options).
I use this on long command lines in log files (like those given to compilers) where there can be hundreds of options specified. This makes it hard to see if a certain option is specified or not, but when split into their own lines, it becomes trivial.
The pattern below uses a positive lookahead assertion.
It can be very useful to make patterns match only in a given context, like if they are, or are not, preceded or followed by another pattern.
$cmdline = "cl.exe /D Bar=1 /I SomePath /D Foo /O2 /I SomeOtherPath /Debug a1.cpp a3.cpp a2.cpp"
$cmdline -split "\s+(?=[-/])"
cl.exe
/D Bar=1
/I SomePath
/D Foo
/O2
/I SomeOtherPath
/Debug a1.cpp a2.cpp
Breaking down the regex, by rewriting it with the x option:
(?x) # ignore whitespace in the pattern, and enable comments after '#'
\s+ # one or more spaces
(?=[-/]) # only match the previous spaces if they are followed by any of '-' or '/'.
Splitting with a scriptblock
The -split
operator also comes in another form, where you can pass it a scriptblock instead of a regular expression.
This allows for more complicated logic, that can be hard or impossible to express as a regular expression.
The scriptblock accepts two parameters, the text to split and the current index. $_
is bound to the character at the current index.
function SplitWhitespaceInMiddleOfText {
param(
[string]$Text,
[int] $Index
)
if ($Index -lt 10 -or $Index -gt 40){
return $false
}
$_ -match '\s'
}
$inputText = "Some text that only needs splitting in the middle of the text"
$inputText -split $function:SplitWhitespaceInMiddleOfText
Some text that
only
needs
splitting
in
the middle of the text
The $function:SplitWhitespaceInMiddleOfText
syntax is a way to get to content (the scriptblock that implements it) of the function, just as $env:UserName
gets the content of an item in the env:
drive.
It provides a way to document and/or reuse the scriptblock.
The -match
operator
The -match
operator works in conjunction with the $matches
automatic variable.
Each time a -match
or a -notmatch
succeeds, the $matches
variable is populated so that each capture group gets its own entry.
If the capture group is named, the key will be the name of the group, otherwise it will be the index.
As an example:
if ('a b c' -match '(\w) (?<named>\w) (\w)'){
$matches
}
Name Value
---- -----
named b
2 c
1 a
0 a b c
Notice that the indices only increase on groups without names. I.E. the indices of later groups change when a group is named.
Armed with the regex knowledge from the earlier post, we can write the following:
PS> " 10,Some text" -match '^\s+(\d+),(.+)'
True
PS> $matches
Name Value
---- -----
2 Some text
1 10
0 10,Some text
or with named groups
PS> " 10,Some text" -match '^\s+(?<num>\d+),(?<text>.+)'
True
PS> $matches
Name Value
---- -----
num 10
text Some text
0 10,Some text
The important thing here is that the parts of the pattern that we want to extract has parenthesis around them.
That is what creates the capture groups that allow us to reference those parts of the matching text, either by name or by index.
Combining this into a function makes it easy to use:
function ParseMyString($text){
if ($text -match '^\s+(\d+),(.+)') {
[PSCustomObject] @{
Number = [int] $matches[1]
Text = $matches[2]
}
}
else {
Write-Warning "ParseMyString: Input `$text` doesn't match pattern"
}
}
ParseMyString " 10,Some text"
Number Text
------- ----
10 Some text
Notice the type conversion when assigning the Number
property. As long as the number is in range of an integer, this will always succeed, since we have made a successful match in the if statement above. ([long]
or [bigint]
could be used. In this case I provide the input, and I have promised myself to stick to a range that fits in a 32-bit integer.)
Now we will be able to sort or do numerical operations on the Number
property, and it will behave like we want it to – as a number, not as a string.
The switch
statement
Now we’re at the big guns
The switch
statement in PowerShell has been given special functionality for parsing text.
It has two flags that are useful for parsing text and files with text in them. -regex
and -file
.
When specifying -regex
, the match clauses that are strings are treated as regular expressions. The switch statement also sets the $matches
automatic variable.
When specifying -file
, PowerShell treats the input as a file name, to read input from, rather than as a value statement.
Note the use of a ScriptBlock instead of a string as the match clause to determine if we should skip preamble lines.
class ParsedOutput {
[int] $Number
[string] $Text
[string] ToString() { return "{0} ({1})" -f $this.Text, $this.Number }
}
$inputData =
"Preamble line",
"LastLineOfPreamble",
" 10,Some Text",
" Some other text,20"
$inPreamble = $true
switch -regex ($inputData) {
{$inPreamble -and $_ -eq 'LastLineOfPreamble'} { $inPreamble = $false; continue }
"^\s+(?<num>\d+),(?<text>.+)" { # this matches the first line of non-preamble input
[ParsedOutput] @{
Number = $matches.num
Text = $matches.text
}
continue
}
"^\s+(?<text>[^,]+),(?<num>\d+)" { # this matches the second line of non-preamble input
[ParsedOutput] @{
Number = $matches.num
Text = $matches.text
}
continue
}
}
Number Text
------ ----
10 Some Text
20 Some other text
The pattern [^,]+
in the text
group in the code above is useful. It means match anything that is not a comma ,
. We are using the any-of
construct []
, and within those brackets, ^
changes meaning from the beginning of the line
to anything but
.
That is useful when we are matching delimited fields. A requirement is that the delimiter cannot be part of the set of allowed field values.
The regex
class
regex
is a type accelerator for System.Text.RegularExpressions.Regex. It can be useful when porting code from C#, and sometimes when we want to get more control in situations when we have many matches of a capture group. It also allows us to pre-create the regular expressions which can matter in performance sensitive scenarios, and to specify a timeout.
One instance where the regex
class is needed is when you have multiple captures of a group.
Consider the following:
Text | Pattern |
---|---|
a,b,c, |
(\w,)+ |
If the match operator is used, $matches
will contain
Name Value
---- -----
1 c,
0 a,b,c,
The pattern matched three times, for a,
, b,
and c,
. However, only the last match is preserved in the $matches
dictionary.
However, the following will allow us to get to all the captures of the group:
[regex]::match('a,b,c,', '(\w,)+').Groups[1].Captures
Index Length Value
----- ------ -----
0 2 a,
2 2 b,
4 2 c,
Below is an example that uses the members of the Regex class to parse input data
class ParsedOutput {
[int] $Number
[string] $Text
[string] ToString() { return "{0} ({1})" -f $this.Text, $this.Number }
}
$inputData =
" 10,Some Text",
" Some other text,20" # this text will not match
[regex] $pattern = "^\s+(\d+),(.+)"
foreach($d in $inputData){
$match = $pattern.Match($d)
if ($match.Success){
$number, $text = $match.Groups[1,2].Value
[ParsedOutput] @{
Number = $number
Text = $text
}
}
else {
Write-Warning "regex: '$d' did not match pattern '$pattern'"
}
}
WARNING: regex: ' Some other text,20' did not match pattern '^\s+(\d+),(.+)'
Number Text
------ ----
10 Some Text
It may surprise you that the warning appears before the output. PowerShell has a quite complex formatting system at the end of the pipeline, which treats pipeline output different than other streams. Among other things, it buffers output in the beginning of a pipeline to calculate sensible column widths. This works well in practice, but sometimes gives strange reordering of output on different streams.
Summary
In this post we have looked at how the -split
operator can be used to split a string in parts, how the -match
operator can be used to extract different patterns from some text, and how the powerful switch
statement can be used to match against multiple patterns.
We ended by looking at how the regex
class, which in some cases provides a bit more control, but at the expense of ease of use.
This concludes the second part of this post. Next time, we will look at a complete, real world, example of a switch-based parser.
Thanks to Jason Shirk, Mathias Jessen and Steve Lee for reviews and feedback.
Staffan Gustafsson, @StaffanGson, powercode@github
Staffan works at DICE in Stockholm, Sweden, as a Software Engineer and has been using PowerShell since the first public beta.
He was most seriously pleased when PowerShell was open sourced, and has since contributed bug fixes, new features and performance improvements.
Staffan is a speaker at PSConfEU and is always happy to talk PowerShell.