Quantcast
Channel: PowerShell
Viewing all articles
Browse latest Browse all 1519

Parsing Text with PowerShell (1/3)

$
0
0

This is the first post in a three part series.

  • Part 1:
    • Useful methods on the String class
    • Introduction to Regular Expressions
    • The Select-String cmdlet
  • Part 2:
    • The -split operator
    • The -match operator
    • The switch statement
    • The Regex class
  • Part 3:
    • A real world, complete and slightly bigger, example of a switch-based parser

A task that appears regularly in my workflow is text parsing. It may be about getting a token from a single line of text or about turning the text output of native tools into structured objects so I can leverage the power of PowerShell.

I always strive to create structure as early as I can in the pipeline, so that later on I can reason about the content as properties on objects instead of as text at some offset in a string. This also helps with sorting, since the properties can have their correct type, so that numbers, dates etc. are sorted as such and not as text.

There are a number of options available to a PowerShell user, and I’m giving an overview here of the most common ones.

This is not a text about how to create a high performance parser for a language with a structured EBNF grammar. There are better tools available for that, for example ANTLR.

.Net methods on the string class

Any treatment of string parsing in PowerShell would be incomplete if it didn’t mention the methods on the string class.
There are a few methods that I’m using more often than others when parsing strings:

Name Description
Substring(int startIndex) Retrieves a substring from this instance. The substring starts at a specified character position and continues to the end of the string.
Substring(int startIndex, int length) Retrieves a substring from this instance. The substring starts at a specified character position and has a specified length.
IndexOf(string value) Reports the zero-based index of the first occurrence of the specified string in this instance.
IndexOf(string value, int startIndex) Reports the zero-based index of the first occurrence of the specified string in this instance. The search starts at a specified character position.
LastIndexOf(string value) Reports the zero-based index of the last occurrence of the specified string in this instance. Often used together with Substring.
LastIndexOf(string value, int startIndex) Reports the zero-based index position of the last occurrence of a specified string within this instance. The search starts at a specified character position and proceeds backward toward the beginning of the string.

This is a minor subset of the available functions. It may be well worth your time to read up on the string class since it is so fundamental in PowerShell.
Docs are found here.

As an example, this can be useful when we have very large input data of comma-separated input with 15 columns and we are only interested in the third column from the end. If we were to use the -split ',' operator, we would create 15 new strings and an array for each line. On the other hand, using LastIndexOf on the input string a few times and then SubString to get the value of interest is faster and results in just one new string.

function parseThirdFromEnd([string]$line){
    $i = $line.LastIndexOf(",")             # get the last separator
    $i = $line.LastIndexOf(",", $i - 1)     # get the second to last separator, also the end of the column we are interested in
    $j = $line.LastIndexOf(",", $i - 1)     # get the separator before the column we want
    $j++                                    # more forward past the separator
    $line.SubString($j,$i-$j)               # get the text of the column we are looking for
}

In this sample, I ignore that the IndexOf and LastIndexOf returns -1 if they cannot find the text to search for. From experience, I also know that it is easy to mess up the index arithmetics.
So while using these methods can improve performance, it is also more error prone and a lot more to type. I would only resort to this when I know the input data is very large and performance is an issue. So this is not a recommendation, or a starting point, but something to resort to.

On rare occasions, I write the whole parser in C#. An example of this is in a module wrapping the Perforce version control system, where the command line tool can output python dictionaries. It is a binary format, and the use case was complicated enough that I was more comfortable with a compiler checked implementation language.

Regular Expressions

Almost all of the parsing options in PowerShell make use of regular expressions, so I will start with a short intro of some regular expression concepts that are used later in these posts.

Regular expressions are very useful to know when writing simple parsers since they allow us to express patterns of interest and to capture text that matches those patterns.

It is a very rich language, but you can get quite a long way by learning a few key parts. I’ve found regular-expressions.info to be a good online resource for more information. It is not written directly for the .net regex implementation, but most of the information is valid across the different implementations.

Regex Description
* Zero or more of the preceding character. a* matches the empty string, a, aa, etc, but not b.
+ One or more of the preceding character. a+ matches a, aa, etc, but not the empty string or b.
. Matches any character
[ax1] Any of a,x,1
a-d matches any of a,b,c,d
\w The \w meta character is used to find a word character. A word character is a character from a-z, A-Z, 0-9, including the _ (underscore) character. It also matches variants of the characters such as ??? and ???.
\W The inversion of \w. Matches any non-word character
\s The \s meta character is used to find white space
\S The inversion of \s. Matches any non-whitespace character
\d Matches digits
\D The inversion of \d. Matches non-digits
\b Matches a word boundary, that is, the position between a word and a space.
\B The inversion of \b. . er\B matches the er in verb but not the er in never.
^ The beginning of a line
$ The end of a line
(<expr>) Capture groups

Combining these, we can create a pattern like below to match a text like:

Text Pattern
" 42,Answer" ^\s+\d+,.+

The above pattern can be written like this using the x (ignore pattern whitespace) modifier.

Starting the regex with (?x) ignores whitespace in the pattern (it has to be specified explicitly, with \s) and also enables the comment character #.

(?x)  # this regex ignores whitespace in the pattern. Makes it possible do document a regex with comments.
^     # the start of the line
\s+   # one or more whitespace character
\d+   # one or more digits
,     # a comma
.+    # one or more characters of any kind

By using capture groups, we make it possible to refer back to specific parts of a matched expression.

Text Pattern
" 42,Answer" ^\s+(\d+),(.+)
(?x)  # this regex ignores whitespace in the pattern. Makes it possible to document a regex with comments.
^     # the start of the line
\s+   # one or more whitespace character
(\d+) # capture one or more digits in the first group (index 1)
,     # a comma
(.+)  # capture one or more characters of any kind in the second group (index 2)

Naming regular expression groups

There is a construct called named capturing groups, (?<group_name>pattern), that will create a capture group with a designated name.

The regex above can be rewritten like this, which allows us to refer to the capture groups by name instead of by index.

^\s+(?<num>\d+),(?<text>.+)

Different languages have implementation specific solutions to accessing the values of the captured groups. We will see later on in this series how it is done in PowerShell.

The Select-String cmdlet

The Select-String command is a work horse, and is very powerful when you understand the output it produces.
I use it mainly when searching for text in files, but occasionally also when looking for something in command output and similar.

The key to being efficient with Select-String is to know how to get to the matched patterns in the output. In its internals, it uses the same regex class as the -match and -split operator, but instead of populating a global variable with the resulting groups, as -match does, it writes an object to the pipeline, with a Matches property that contains the results of the match.

Set-Content twitterData.txt -value @"
Lee, Steve-@Steve_MSFT,2992
Lee Holmes-13000 @Lee_Holmes
Staffan Gustafsson-463 @StaffanGson
Tribbiani, Joey-@Matt_LeBlanc,463400
"@

# extracting captured groups
Get-ChildItem twitterData.txt |
    Select-String -Pattern "^(\w+) ([^-]+)-(\d+) (@\w+)" |
    Foreach-Object {
        $first, $last, $followers, $handle = $_.Matches[0].Groups[1..4].Value   # this is a common way of getting the groups of a call to select-string
        [PSCustomObject] @{
            FirstName = $first
            LastName = $last
            Handle = $handle
            TwitterFollowers = [int] $followers
        }
    }
FirstName LastName   Handle       TwitterFollowers
--------- --------   ------       ----------------
Lee       Holmes     @Lee_Holmes             13000
Staffan   Gustafsson @StaffanGson              463

Support for Multiple Patterns

As we can see above, only half of the data matched the pattern to Select-String.

A technique that I find useful is to take advantage of the fact that Select-String supports the use of multiple patterns.

The lines of input data in twitterData.txt contain the same type of information, but they’re formatted in slightly different ways.
Using multiple patterns in combination with named capture groups makes it a breeze to extract the groups even when the positions of the groups differ.

$firstLastPattern = "^(?<first>\w+) (?<last>[^-]+)-(?<followers>\d+) (?<handle>@.+)"
$lastFirstPattern = "^(?<last>[^\s,]+),\s+(?<first>[^-]+)-(?<handle>@[^,]+),(?<followers>\d+)"
Get-ChildItem twitterData.txt |
     Select-String -Pattern $firstLastPattern, $lastFirstPattern |
    Foreach-Object {
        # here we access the groups by name instead of by index
        $first, $last, $followers, $handle = $_.Matches[0].Groups['first', 'last', 'followers', 'handle'].Value
        [PSCustomObject] @{
            FirstName = $first
            LastName = $last
            Handle = $handle
            TwitterFollowers = [int] $followers
        }
    }
FirstName LastName   Handle        TwitterFollowers
--------- --------   ------        ----------------
Steve     Lee        @Steve_MSFT               2992
Lee       Holmes     @Lee_Holmes              13000
Staffan   Gustafsson @StaffanGson               463
Joey      Tribbiani  @Matt_LeBlanc           463400

Breaking down the $firstLastPattern gives us

(?x)                # this regex ignores whitespace in the pattern. Makes it possible do document a regex with comments.
^                   # the start of the line
(?<first>\w+)       # capture one or more of any word characters into a group named 'first'
\s                  # a space
(?<last>[^-]+)      # capture one of more of any characters but `-` into a group named 'last'
-                   # a '-'
(?<followers>\d+)   # capture 1 or more digits into a group named 'followers'
\s                  # a space
(?<handle>@.+)      # capture a '@' followed by one or more characters into a group named 'handle'

The second regex is similar, but with the groups in different order. But since we retrieve the groups by name, we don’t have to care about the positions of the capture groups, and multiple assignment works fine.

Context around Matches

Select-String also has a Context parameter which accepts an array of one or two numbers specifying the number of lines before and after a match that should be captured. All text parsing techniques in this post can be used to parse information from the context lines.
The result object has a Context property, that returns an object with PreContext and PostContext properties, both of the type string[].

This can be used to get the second line before a match:

# using the context property
Get-ChildItem twitterData.txt |
    Select-String -Pattern "Staffan" -Context 2,1 |
    Foreach-Object { $_.Context.PreContext[1], $_.Context.PostContext[0] }
Lee Holmes-13000 @Lee_Holmes
Tribbiani, Joey-@Matt_LeBlanc,463400

To understand the indexing of the Pre- and PostContext arrays, consider the following:

Lee, Steve-@Steve_MSFT,2992                  <- PreContext[0]
Lee Holmes-13000 @Lee_Holmes                 <- PreContext[1]
Staffan Gustafsson-463 @StaffanGson          <- Pattern matched this line
Tribbiani, Joey-@Matt_LeBlanc,463400         <- PostContext[0]

The pipeline support of Select-String makes it different from the other parsing tools available in PowerShell, and makes it the undisputed king of one-liners.

I would like stress how much more useful Select-String becomes once you understand how to get to the parts of the matches.

Summary

We have looked at useful methods of the string class, especially how to use Substring to get to text at a specific offset. We also looked at regular expression, a language used to describe patterns in text, and on the Select-String cmdlet, which makes heavy use of regular expression.

Next time, we will look at the operators -split and -match, the switch statement (which is surprisingly useful for text parsing), and the regex class.

Staffan Gustafsson, @StaffanGson, github

Thanks to Jason Shirk, Mathias Jessen and Steve Lee for reviews and feedback.


Viewing all articles
Browse latest Browse all 1519

Trending Articles