file-a I need to find all files recursively (“using bash”)
that contain all of the lines from
file-a regardless of order.
How can I do this?
Usually “using bash” just means “from the command-line” so
we’re just going to skip ahead directly to
doing this “in bash” is a world of pain.
Let’s create some example files for testing.
$ cat all-lines/file-a a b c $ cat all-lines/dir/file1 c b a $ cat all-lines/dir/file2 a b c d $ cat all-lines/dir/file3 d c e
So we have
2 matching files here:
First we have the output.
$ python find-files.py all-lines/dir/file1 all-lines/dir/file2
The code that generated it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
import os file_a = 'all-lines/file-a' start_dir = 'all-lines/dir' with open(file_a) as f: lines_a = set(line for line in f) for root, dirs, files in os.walk(start_dir): for filename in files: path = os.path.join(root, filename) with open(path) as f: lines_b = set() for line in f: lines_b.add(line) #if lines_a.issubset(lines_b): if lines_a <= lines_b: print(path) break
If you’ve not seen the with statement
before it’s being used here as when the block exits it will
close() on the file for us.
We’re creating the set
lines_a to store all the lines
file-a which we will use to compare against the
Two things to note:
- this assumes
file-afits into memory
- line endings are not being removed
If we wanted to ignore line endings we could instead store
line.rstrip('\r\n') in our
A set is an unordered collection with no duplicate elements. Set objects also support mathematical operations like union, intersection, difference, and symmetric difference.
>>> x = set('abcd') >>> y = set('abcdef') >>> >>> x <= y True >>> x.issubset(y) True
<= here is the same as using issubset()
which provides a simple way to test if all the elements in
x are contained in
To process a directory structure recursively in Python we can use os.walk()
which would be like using the
os.walk() yields a 3-tuple
(dirpath, dirnames, filenames)
dirpathbeing the path to the current directory being processed
dirnamesbeing a list of directory names contained in
filenamesbeing a list of file names contained in
filenames only gives us the names and not the full path we use
to build the full path to the filename in our code. Also note that
root appears to be
the name most commonly used in examples to store the
So for each file we create our
lines_b set and process the file
line-by-line adding each line to the set.
The reason we’re processing each
file-b line-by-line and using
is because this allows us to stop processing the file as soon as there
is a match. If there is a match we call
break which exits the
If we had a large file that contained a match in the first few lines this would save from having to process the full file. It could also allow us to process files that are “too large to fit into memory”.