Given the following input file.

file2.txt

-123456.12 -1234567.12 912300001     1000 123.1 123.1 ab 1 3
-123466.12 -1234667.12 912300001     1001 133.1 124.1 ab 2 2
-123476.12 -1234767.12 912300001     1002 143.1 125.1 ab 3 1
-123486.12 -1234867.12 922300001     1000 153.1 126.1 ab 3 1
-123496.12 -1234967.12 922300001     1001 163.1 127.1 ab 2 2
-123446.12 -1234367.12 922300001     1002 173.1 128.1 ab 1 3

The 3rd column (e.g. 912300001) is the base of a filename which exists in the same directory which we need to lookup using the 4th column (e.g. 1000)

Here is an example of the structure of the file we need to lookup.

912300001.file1.txt

VALUE   1000    -21 -21
VALUE   1001    -23 -22
VALUE   1002    -23 -21

We need to match on the 2nd column of this file and if there is a match we want to take the 3rd and 4th columns and merge/join with the original “row” from file2.txt resulting in the following output.

-123456.12 -1234567.12 912300001     1000 123.1 123.1 ab 1 3 -21 -21
-123466.12 -1234667.12 912300001     1001 133.1 124.1 ab 2 2 -23 -22
-123476.12 -1234767.12 912300001     1002 143.1 125.1 ab 3 1 -23 -21

We would like to complete this task using bash if possible.

Code

We will start with the output.

$ cat 912300001.file1.txt 
VALUE   1000    -21 -21
VALUE   1001    -23 -22
VALUE   1002    -23 -21
$ cat 922300001.file1.txt 
VALUE   1000    Y Y
VALUE   1001    X X
VALUE   1002    Z Z
$ bash filemerge
-123456.12 -1234567.12 912300001     1000 123.1 123.1 ab 1 3 -21 -21
-123466.12 -1234667.12 912300001     1001 133.1 124.1 ab 2 2 -23 -22
-123476.12 -1234767.12 912300001     1002 143.1 125.1 ab 3 1 -23 -21
-123486.12 -1234867.12 922300001     1000 153.1 126.1 ab 3 1 Y Y
-123496.12 -1234967.12 922300001     1001 163.1 127.1 ab 2 2 X X
-123446.12 -1234367.12 922300001     1002 173.1 128.1 ab 1 3 Z Z

Here is the code we used.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#!/bin/bash while read -r line do read -r _ _ name value _ <<< "$line" [[ $name != $prev_name ]] && { declare -A lookup=() while read -r _ key value do lookup[$key]=$value done < "$name.file1.txt" } prev_name=$name echo "$line ${lookup[$value]}" done < file2.txt

We start by reading file2.txt line-by-line.

The usual way to read a file line-by-line in the shell is by using a while read loop e.g.

1 2 3 4
while read -r line do # do something with line done < filename

line here is the name of a variable we have chosen which will be set by bash and populated by each line of filename

You can omit a variable name and read will populate the default variable REPLY instead.

read -r

The -r option of read has to do with backslash handling.

$ read -r foo bar baz <<< 'one two three'
$ declare -p foo bar baz
declare -- foo="one"
declare -- bar="two"
declare -- baz="three"

(note: declare -p is just being used to show us the contents of the variables.)

When read is used without -r backslashes are treated “specially” e.g.

$ read foo bar baz <<< 'one\ two three'
$ declare -p foo bar baz
declare -- foo="one two"
declare -- bar="three"
declare -- baz=""

Because we have quoted the space using a backslash the foo bar is treated as a single “word” meaning we have a total of 2 words as opposed to 3 which results in the baz variable being empty.

<<<

The <<< construct used here is called a Here String which sends a string to the stdin of a command.

$ sed 's/foo/bar/' <<< foobar
barbar
$ echo foobar | sed 's/foo/bar/'
barbar

If you look closely at the input file file2.txt you may notice that the “filename” and “value” columns are sorted. This means that we only ever need to have 1 lookup file stored in an array at any given time.

Each time we encounter a “new” filename we can load that file into an array to use for our lookups.

In order to do so we need to extract the “filename” and “value” columns from each line in file2.txt which we do on line 5 of the code.

Let’s take a closer look at how that works with an example.

$ line='-123456.12 -1234567.12 912300001     1000 123.1 123.1 ab 1 3'
$ read -r _ _ name value _ <<< "$line"
$ declare -p name value
declare -- name="912300001"
declare -- value="1000"

It’s common to use _ as a “placeholder” (or “throwaway”) variable name if you do not intend to use its value.

Another option for “splitting a row into columns” could be to use the -a option of read which creates an array variable.

$ read -r -a columns <<< "$line"
$ declare -p columns
declare -a columns='([0]="-123456.12" [1]="-1234567.12" [2]="912300001" [3]="1000" 
[4]="123.1" [5]="123.1" [6]="ab" [7]="1" [8]="3")'

After we extract the “filename” we need to test if it is a “new” filename which brings us to line 6

[[ $name ne $prev_name ]] && {

prev_name has not yet been defined in the code which means that the first time this test executes $prev_name will default to an “empty string” i.e. [[ $name ne '' ]] which will return True.

declare -A

When this test returns True it means we have encountered a “new” filename in which case we create our array (named lookup) and process $name.file1.txt line-by-line extracting the “value” column as our array “key” and columns 3 and 4 as our array “value”.

1 2 3 4 5
declare -A lookup=() while read -r _ key value do lookup[$key]=$value done < "$name.file1.txt"

With declare -A lookup=() we are declaring lookup as an empty associative array each time as we don’t want to remember values from any previous files.

The declare -A part is what specifies that lookup is an associative array but what exactly is an “associative array” and why are we using one?

In the example files shown the “value” column (e.g. 1000) is an integer which is a valid array index. If however, we had a value of 0030 we would get some unexpected behaviour.

$ lookup=()
$ lookup["0030"]=moo
$ declare -p lookup
declare -a lookup='([24]="moo")'

bash treats numbers (even though it is quoted here) with leading zeros as octal values and it converts 30 into octal 24 which may cause some surprise.

bash added associative arrays in version 4 and they can be created using the declare -A name syntax.

An associative array allows us to use strings as the keys as opposed to just “integers” and as such there will be no attempted conversion on the key.

$ declare -A lookup=()
$ lookup["0030"]=moo
$ declare -p lookup
declare -A lookup='([0030]="moo" )'

It also means that we can use strings as mentioned.

$ lookup[hello]=hi
$ declare -p lookup
declare -A lookup='([0030]="moo" [hello]="hi" )'

After the array has been populated we use prev_name=$name to save the filename which we will test against when the next line of input is processed.

${var}

The $name variable inside $name.file1.txt expands correctly because . characters are not allowed in variable names meaning bash knows where the variable name ends.

$ my.var=1
-bash: my.var=1: command not found

If this were not the case we could use surrounding braces i.e. ${name} to give bash a hint.

$ name=filename
$ echo "$name.ext"
filename.ext
$ echo "$nameext"

$ echo "${name}ext"
filenameext

The $nameext example produces an empty value because bash thinks we’re using the variable nameext which does not exist.

Finally, we echo "$line ${lookup[$value]}" to print out the merged/joined result.

This code assumes there is a matching value in the lookup and there is no error checking implemented.

Perl

If you do not have Bash 4 another option could be to use perl and here is a possible solution written as a one-liner.

$ perl -lape '%lookup = do { open fh, "<", "$F[2].file1.txt"; map { /\S+\s+(\S+)\s+(.+)/ } readline fh }, $name = $F[2] if $F[2] ne $name; $_ = "$_ $lookup{$F[3]}"' file2.txt 
-123456.12 -1234567.12 912300001     1000 123.1 123.1 ab 1 3 -21 -21
-123466.12 -1234667.12 912300001     1001 133.1 124.1 ab 2 2 -23 -22
-123476.12 -1234767.12 912300001     1002 143.1 125.1 ab 3 1 -23 -21
-123486.12 -1234867.12 922300001     1000 153.1 126.1 ab 3 1 Y Y
-123496.12 -1234967.12 922300001     1001 163.1 127.1 ab 2 2 X X
-123446.12 -1234367.12 922300001     1002 173.1 128.1 ab 1 3 Z Z

Let’s rewrite it on multiple lines to try to explain what’s going on.

1 2 3 4 5 6 7
$ perl -lape ' %lookup = do { open fh, "<", "$F[2].file1.txt"; map { /\S+\s+(\S+)\s+(.+)/ } readline fh }, $name = $F[2] if $F[2] ne $name; $_ = "$_ $lookup{$F[3]}" ' file2.txt

First the options (or switches):

  • -l removes (or chomps) the line endings from each line of input
  • -a auto-splits the line into “columns” and stores them in the array named @F
  • -p adds a while (readline) { ...; print } loop around our code

These are all documented inside perlrun.pod which you can read using the perldoc perlrun command or in a browser on metacpan.org.

We can use -MO=Deparse to tell perl to show us what the “compiled” code looks like.

$ perl -MO=Deparse -lape ''
BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
    chomp $_;
    our(@F) = split(' ', $_, 0);
}
continue {
    die "-p destination: $!\n" unless print $_;
}
-e syntax OK

So onto the rest of our code:

  • %lookup = do { ... }, $name = $F[2] if $F[2] ne $name

This declares a hash (which is what Perl calls an associative array) as the result of a do block which inside the we open $F[2].file1.txt and use map { } readline fh to process each line of the file.

$F[2] is the 3rd column of the line i.e. the “filename” (arrays are indexed starting from 0).

/\S+\s+(\S+)\s+(.+)/ is a regex match which extracts column 2 along with columns 3 and 4 which are then returned by the map as key-value pairs and are stored into the %lookup hash.

$ perl -MData::Dumper -e 'print Dumper { map { /\S+\s+(\S+)\s+(.+)/ } readline }' 912300001.file1.txt 
$VAR1 = {
          '1001' => '-23 -22',
          '1002' => '-23 -21',
          '1000' => '-21 -21'
        };

In %lookup = do { ... }, $name = $F[2] we are using the Comma Operator to chain them together meaning they are both executed only if $F[2] ne $name returns True as if we had written:

1 2 3 4
if ($F[2] ne $name) { %lookup = do { ... }; $name = $F[2] }

This means that we only populate %lookup if we have “found a new filename” similar to how the [[ $name ne $prev_name ]] test works from the bash version.

Finally we have $_ = "$_ $lookup{$F[3]}" which makes use of $_ which is called the Default Variable.

If you recall from earlier we said -p added a while (readline) { ...; print } loop around our code. print is the same as print $_ i.e. it operates on the default variable (as do many other things in Perl).

while (readline) also populates the $_ variable which means in our code $_ contains the current line so we are simply adding columns 3 and 4 onto the end of the line and the implicit print call will print them out.

We could have also used .= to concatenate e.g. $_ .= " $lookup{$F[3]}" which would have saved 1 character if playing golf is your thing.