X
Popular Searches

Using AngleSharp in PowerShell 7 to Parse Webpages

Powershell logo

AngleSharp is a .NET library that makes parsing and working with HTML content quick and easy. As AngleSharp is written in .NET, you can use and consume the output in PowerShell as well. Combining these two allows you to quickly and easily script HTML content. In this article, we will explore how to set up AngleSharp and consume a weather page, and convert the data into a PowerShell object.

Installing and Loading AngleSharp

Installing AngleSharp is easy using the Install-Package command. You can even install the package into the CurrentUser scope which means you do not need administrative rights to use this library. The package is contained in the NuGet library.

Install-Package 'AngleSharp' -Scope 'CurrentUser' -Source 'Nuget'

Next we will want to load AngleSharp for use in our PowerShell script. To do this we will want to use the Add-Type cmdlet to directly load the DLL for the library. Below is code that assists in locating the latest .NET version and loading that path, if the library isn’t already loaded in your session.

If ( -Not ([System.Management.Automation.PSTypeName]'AngleSharp.Parser.Html.HtmlParser').Type ) {
    $standardAssemblyFullPath = (Get-ChildItem -Filter '*.dll' -Recurse (Split-Path (Get-Package -Name 'AngleSharp').Source)).FullName | Where-Object {$_ -Like "*standard*"} | Select-Object -Last 1

    Add-Type -Path $standardAssemblyFullPath -ErrorAction 'SilentlyContinue'
} # Terminate If - Not Loaded

Read on to discover how to parse the webpage content and create a useful PowerShell object!

Parsing a Webpage

Of course, the whole point of this is to actually parse a web page. In this example we will load the content from Invoke-WebRequest and then using the result, parse the content in AngleSharp. We are going to use a local 7-day forecast from the National Weather Service to pull in weather data and convert to an object. First, let’s retrieve the weather data.

$Request = Invoke-WebRequest -Uri "<https://forecast.weather.gov/MapClick.php?lat=40.48675500000007&lon=-88.99177999999995>"

The data for the site that we are interested in is in the Content property, but it is the full HTML source, which is a lot to process. Often, it is easiest to use Chrome Developer Tools to locate the section of the HTML source that we want to use (F12 in Chrome for the site you want to inspect).

 

The HTML structure of the NWS weather page.

Thankfully, there is a div container with an unordered list that we can parse. The next step is to actually load the retrieved content into AngleSharp.

$Parser = New-Object AngleSharp.Html.Parser.HtmlParser
$Parsed = $Parser.ParseDocument($Request.Content)

Now that we have the parsed content available in our $Parsed variable, we can start to manipulate this data to get to just the section we want. Very conveniently, the NWS site provides an ID just for this unordered list named seven-day-forecast-list. Since each ID is unique on an HTML page, this makes the list easy to target. Using the All property on our parsed content, we can retrieve just the object with the ID of seven-day-forecast-list.

$ForecastList = $Parsed.All | Where-Object ID -EQ 'seven-day-forecast-list'

This will result in a lot of different properties, but we are focused on the ChildNodes property as it will contain each li containing the data we need. To get an idea of what we are looking to target in our object, let’s take a look at an individual li. There are a handful of elements with classes that we can target.

  • period-name – The relative time period.
  • short-desc – A condensed description of the weather.
  • temp temp-high – The high temperature.

 

HTML structure of a single tombstone-container element.

You may notice that the img tag contains an alt property with a lot of useful information. It’s pretty easy to find the class to target as it is stored in the classname property of the child node. To target the alt element we will have to rely on a slightly different method, QuerySelectorAll which uses traditional CSS selectors to make complex targeting easy.

$ForecastList.ChildNodes | ForEach-Object {
	# Retrieve just the content of the tombstone-container div underneath the forecast-tombstone li element.
  $Node = $_.ChildNodes | Where-Object ClassName -EQ 'tombstone-container'

  [PSCustomObject]@{
		# Search the child nodes under the tombstone-container and find the element named period-name. Retrieve just the innerHTML which is the text value. This includes a break element, <br> of which we don't need, so replace that with a space instead.
    "Period" = $Node.ChildNodes.Where({ $_.ClassName -EQ 'period-name'}).InnerHTML -Replace "<br>"," "
    "Temp"   = $Node.ChildNodes.Where({ $_.ClassName -Match 'temp'}).InnerHTML
    "Short"  = $Node.ChildNodes.Where({ $_.ClassName -EQ 'short-desc'}).InnerHTML -Replace "<br>"," "
		# Since we don't have a class to target, on the root node, use the CSS selector p > img which looks for a p element with a child img element. Next, use the Attributes property to find the one named alt and return it's value.
    "Alt"    = $Node.QuerySelectorAll("p > img").Attributes.Where({$_.Name -EQ 'alt'}).Value
  }
}

 

Output of the parsed web page from AngleSharp.

Although we have to iterate over a few elements to ultimately get to just the ones we want, we can walk through the HTML document structure and get to just what we need. It can be a bit tricky to understand the structures, but ultimately what AngleSharp is doing is creating objects for each DOM element. Once you figure out the best way to target the elements you need, extracting the content is not difficult.

Conclusion

AngleSharp offers an excellent programmatic interface to parsing and interacting with HTML content on webpages. This can open the door to using PowerShell to retrieve content that may be otherwise inaccessible. Taking this content, storing it, and using it in scripts is extremely useful and can help aid system integration methods!

Adam Bertram Adam Bertram
Adam Bertram is a 20+ year veteran of IT and an experienced online business professional. He’s a consultant, Microsoft MVP, blogger, trainer, published author and content marketer for multiple technology companies. Catch up on Adam’s articles at adamtheautomator.com, connect on LinkedIn, or follow him on Twitter at @adbertram. Read Full Bio »

The above article may contain affiliate links, which help support CloudSavvy IT.