How to use Bash associative arrays

Command interpreters and scripting languages like the Bash shell are essential tools of any operating system. Here's how to use powerful data structures in Bash called associative arrays or hashes.

bash-script.jpg

Image: jivacore/Shutterstock

In Bash, a hash is a data structure that can contain many sub-variables, of the same or different kinds, but indexes them with user-defined text strings, or keys, instead of fixed numeric identifiers. Besides being extremely flexible, hashes also make scripts more readable. If you need to process the areas of certain countries, for example, a syntax like:

print area_of('Germany')

would be as self-documenting as it can be, right?

SEE: Hiring Kit: JavaScript Developer (TechRepublic Premium)

How to create and fill Bash hashes

Bash hashes must be declared with the uppercase A switch (meaning Associative Array), and can then be filled by listing all their key/value pairs with this syntax:

# Country areas, in square miles
declare -A area_of
area_of=( [Italy]="116347" [Germany]="137998" [France]="213011" [Poland]="120728" [Spain]="192476" )

The first thing to notice here is that the order in which the elements are declared is irrelevant. The shell will just ignore it, and store everything according to its own internal algorithms. As proof, this is what happens when you retrieve those data as they were stored:

print ${area_of[*]}
213011 120728 137998 192476 116347
print ${!area_of[*]}
France Poland Germany Spain Italy

By default, the asterisk inside the square brackets extracts all and only the values of a hash. Adding the exclamation mark, instead, retrieves the hash keys. But in both cases there is no easily recognizable order.

You may also populate a hash dynamically, by calling other programs. If you, for example, had another shell script called hash-generator, that outputs all the pairs as one properly formatted string:

#! /bin/bash
printf '[Italy]="116347" [Germany]="137998" [France]="213011" [Poland]="120728" [Spain]="192476"'
calling hash-generator in this way from the script that actually uses the area_of hash:
VALS=$( hash-generator )
eval declare -A area_of=( $VALS )

would fill that hash with exactly the same keys and values. Of course, the message here is that "hash-generator" can be any program, maybe much more powerful than Bash, as long as it can output data in that format. To fill a hash with the content of an already existing plain text file, instead, follow these suggestions from Stack Overflow.

How to process hashes

The exact syntax to refer to a specific element of a hash, or delete it, is this:

print ${area_of['Germany]}
unset ${area_of['Germany]}

To erase a whole hash, pass just its name to unset, and then re-declare it:

unset area_of
declare -A area_of

The number of key/value pairs stored into a hash is held by the special variable called "${#HASHNAME[@]}" (don't look at me, I did not invent this syntax). But if all you need is to process all the elements of a hash, regardless of their number or internal order, just follow this example:

for country in "${!area_of[@]}"
do
echo "Area of $country: ${area_of[$country]}"
done

whose output is:

Area of France: 213011 square miles

Area of Poland: 120728 square miles

Area of Germany: 137998 square miles

You can use basically the same procedure to create a "mirror" hash, with keys and values inverted:

declare -A country_whose_area_is
for country in "${!area_of[@]}"; do
country_whose_area_is[${area_of[$country]}]=$country
done

Among other things, this "mirroring" may be the easiest way to process the original hash looking at its values, instead of keys.

How to sort hashes

If hash elements are stored in semi-random sequences, what is the most efficient way to handle them in any alphanumerical order? The answer is that it depends on what exactly should be ordered and when. In the many cases when what should be sorted is only the final output of a loop, and all is needed to do that is a sort command right after the closing statement:

for country in "${!area_of[@]}"
do
  echo "$country: ${area_of[$country]}"
done | sort

To sort the output by key (even if keys were not retrieved in that order!):

France: 213011 square miles

Germany: 137998 square miles

Italy: 116347 square miles.

Sorting the same lines numerically, by country area, is almost as easy. Prepending the areas at the beginning of each line:

for aa in "${!area_of[@]}"
 do
 printf "%s|%s = %s square miles\n" "${area_of[$aa]}" "$aa" "${area_of[$aa]}"
 done

yields lines like these:

213011|France = 213011 sq. miles

120728|Poland = 120728 sq. miles

137998|Germany = 137998 sq. miles

that, while still unsorted, now start with just the strings on which we want to sort. Therefore, using sort again, but piped to the cut command with "|" as column separator:

1 for aa in "${!area_of_generated[@]}"
2 do
3 printf "%s|%s = %s square miles\n" "${area_of_generated[$aa]}" "$aa" "${area_of_generated[$aa]}"
4 done | sort | cut '-d|' -f2-

will sort by areas and then remove them, to finally produce the desired result:

Italy = 116347 sq. miles

Poland = 120728 sq. miles

Germany = 137998 sq. miles

Multi-level hashes

While Bash does not support nested, multi-level hashes, it is possible to emulate them with some auxiliary arrays. Consider this code, that stores the areas of European regions, while also cataloging them by country:

1  declare -a european_regions=('Bavaria' 'Lazio' 'Saxony' 'Tuscany')
 2  declare -a european_countries=('Italy' 'Germany')
 3  declare -A area_of_country_regions
 4  area_of_country_regions=( [Lazio in Italy]="5000" [Tuscany in Italy]="6000" [Bavaria in Germany]="9500" [Saxony in Germany]="7200" )
 5  
 6  for country in "${european_countries[@]}"
 7  do
 8   for region in "${european_regions[@]}"
 9     do
10       cr="$region in $country"
11       if test "${area_of_country_regions[$cr]+isset}"
12         then
13         printf "Area of %-20.20s: %s\n" "$cr" "${area_of_country_regions[$cr]}"
14         fi
15     done
16  done

The code creates two normal arrays, one for countries and one for regions, plus one hash with composite keys that associate each region to its country and emulate a two-level hash. The code then generates all possible combinations of regions and countries, but only processes existing elements of areaofcountry_regions, recognizing them with the *isset test of line 11. Rough, but effective, isn't it?

Also see