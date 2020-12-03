I have a process where I need to copy all the images from a web page. I used to run this process with
xmllint, which will process an XML or HTML file and print out the entries you specify. But when my server host provider upgraded their systems, they didn’t include
xmllint. So I had to find another way to extract a list of images from an HTML page. It turns out you can do this in Bash.
The read Statement
You may not think Bash can parse data files, but it can with some clever thinking. Bash, like other UNIX shells before it, can parse lines one at a time from a file via the built-in
read statement.
By default, the
read statement scans a line of data and splits it into fields. Usually,
read splits fields using spaces and tabs, with newlines ending each line, but you can change this behavior by setting the Internal Field Separator (
IFS) value and the end-of-line delimiter (
-d).
To parse an HTML file using
read , set the
IFS to a greater-than symbol (
>) and the delimiter to a less-than symbol (
<). Each time Bash scans a line, it parses up to the next
< (the start of an HTML tag) then splits that data at each
> (the end of an HTML tag). This sample code takes a line of input and splits the data into the
TAG and
VALUE variables:
local IFS='>' read -d '<' TAG VALUE
Let’s explore how this works. Consider this simple HTML file:
<img src="logo.png" alt="My logo" /> <p>some text</p>
The first time
read parses this file, it stops at the first
< symbol. Since
< is the first character of this sample input, that means Bash finds an empty string. The resulting
TAG and
VALUE strings are also empty. But that’s fine for my use case.
The next time Bash reads the input, it gets
img src="logo.png"↲alt="My logo" />↲ with a newline right before the alt, and stops before the
< symbol on the next line. Then
read splits the line at the
> symbol, which leaves
TAG with
img src="logo.png"↲alt="My logo" / and
VALUE with an empty newline.
The third time
read parses the HTML file, it gets
p>some text. Bash splits the string at the
> resulting in
TAG containing
p and
VALUE with
some text .
A Simple Parser
Now that you understand how to use
read, it’s easy to parse a longer HTML file with Bash. Start with a Bash function called
xmlgetnext to parse the data using
read , since you’ll be doing this again and again in the script. I named my function
xmlgetnext to remind me this is a replacement for the Linux
xmllint program, but I could have just as easily named it
htmlgetnext .
xmlgetnext () { local IFS='>' read -d '<' TAG VALUE }
Now call that
xmlgetnext function to parse the HTML file. This is my complete
htmltags script:
#!/bin/sh # print a list of all html tags xmlgetnext () { local IFS='>' read -d '<' TAG VALUE } cat $1 | while xmlgetnext ; do echo $TAG ; done
The last line is the key. It loops through the file using
xmlgetnext to parse the HTML, and prints out only the
TAG entries. And because of how
echo operates with the standard field separators, any lines like
img src="logo.png"↲alt="My logo" / that contain a newline get printed on a single line, as
img src="logo.png" alt="My logo" /.
To fetch just the list of images, I run the output of this script through
grep to only print the lines that have an
img tag at the start of the line.