We are given the following “CSV” data which we are asked to “fix” using Python:
10AT&T$146,801 11General Electric$140,389 12AmerisourceBergen$135,962 13Verizon$131,620 14Chevron$131,118 15Costco$116,199 16Fannie Mae$110,359 17Kroger$109,830 18Amazon.com$107,006 19Walgreens Boots Alliance$103,444 20HP$103,355
As it turns out it’s not really CSV data at all as the columns are all merged into one. The data looks like it came from the Fortune 500 list albeit “incorrectly”.
So what do we know? We know the rank is going to be the first sequence of digits on the line. The name of the company will then be everything up to but not including the dollar sign. The revenue will be everything after the dollar sign from which we want to remove the comma.
This sounds like a job for a regular expression.
1 2 3 4 5 6
import fileinput, re for line in fileinput.input(): rank, name, revenue = re.search(r'(\d+)(.+)\$(.+)', line).groups() revenue = revenue.replace(',', '') print(','.join([rank, name, revenue]))
fileinput module is a simple way to write a filter-like program that
processes input line by line. It handles reading data from filenames passed
as arguments and data given on stdin. It can also be used to overwrite
files with the new data like how the
-i option works on
perl and some
So for the regex we have
(\d+) which matches and captures
1 or more digits i.e. the rank number.
() is used to
create a capture group. We could have used
^ (which matches the start
of the string) in our pattern but all of the lines we’re working with
start with digits so it’s not needed.
(.+)\$ matches and captures everything up to the occurrence of a dollar sign.
matches the end of the string in regex so it needs to be escaped to match literally.
(.+) captures everything else on the line i.e. the revenue amount.
We then remove any commas from the revenue amount with
.replace(',', '') and join
the values back using comma as the delimiter.
Here is a sample run:
$ python not-quite-csv.py not-quite.csv 10,AT&T,146801 11,General Electric,140389 12,AmerisourceBergen,135962 13,Verizon,131620 14,Chevron,131118 15,Costco,116199 16,Fannie Mae,110359 17,Kroger,109830 18,Amazon.com,107006 19,Walgreens Boots Alliance,103444 20,HP,103355
','.join() we could use the
csv module which would handle things
like automatic quoting of fields that contain the delimiter but the sample data
we were given was “simple” meaning that it was not possible for the company
names to contain a comma.
If you do not have to use Python specifically you may also consider using
perl to solve this problem:
$ perl -pe 's/(\d+)(.+)\$(.+)/"$1,$2,".$3=~s\/,\/\/gr/e' not-quite.csv 10,AT&T,146801 11,General Electric,140389 12,AmerisourceBergen,135962 13,Verizon,131620 14,Chevron,131118 15,Costco,116199 16,Fannie Mae,110359 17,Kroger,109830 18,Amazon.com,107006 19,Walgreens Boots Alliance,103444 20,HP,103355
-p option adds a loop like the
for line in fileinput.input() loop
from the Python example. The
s/// command is the same as
takes a pattern and a replacement. The pattern here is the same as we used
in the Python code so it needs no explanation.
So we’re using the
e modifier on the
s command i.e.
modifier evaluates the replacement part as perl code.
In the replacement
"$1,$2," is just creating a string of the first 2
capture groups followed by commas.
. is the string concatenation operator like
"foo" + "bar" in Python.
To the string we are adding
$3=~s\/,\/\/gr which is actually another
s/// command embedded inside. This is why the
/ are escaped although
we could have chosen a different delimiter e.g.
This removes all commas from
$3 which is the revenue amount. Without
g modifier it would just remove the first similar to
.replace(',', '', 1) in Python.
By default the
s/// command does not return the modified string. In
order to do that you must use the
So this means the final string we have in our replacement is “rank,name,revenue with commas removed”.
If we take another look at an input line can we see another way to achieve the intended output?
If we removed the last comma in the string replace the
with a comma and added a comma after the rank number we would
get the current expected output:
$ perl -pe 's/.+\K,//; s/\$/,/; s/\d+\K/,/' not-quite.csv 10,AT&T,146801 11,General Electric,140389 12,AmerisourceBergen,135962 13,Verizon,131620 14,Chevron,131118 15,Costco,116199 16,Fannie Mae,110359 17,Kroger,109830 18,Amazon.com,107006 19,Walgreens Boots Alliance,103444 20,HP,103355
s/.+\K,// removes the last comma in the string. Using capture
groups you could write it as
s/(.+),/$1/ which may look more
Anything to the left of
\K is not included in the match which
can be useful in simplifying patterns.
s/\$/,/ replaces dollar sign with a comma. We could be more strict
and use the
.+\K\$ approach to stipulate it’s the last dollar sign
but there are no other instances of it in our data.
s/\d+\K/,/ adds a comma after the first sequence of
This approach makes assumptions about the data as it only strips the last comma. If the revenue amount was larger and contained multiple commas it would break. It does work for the data we were given and serves as an example approaching the problem from a different perspective.
None of these approaches modify the input file. I find it’s usually simpler to just use shell redirection to output the result to a new file e.g.
$ perl -pe 's/.+\K,//; s/\$/,/; s/\d+\K/,/' not-quite.csv > fortune500.csv
Do note that redirections are set up before command execution so the output filename must differ from the input filename.
You could however use perl’s
-i option and pass
fileinput.input() in the Python example if you wished to do so.