Comparing csv files with the linux shell


My heart sinks whenever I witness people resorting to Excel to compare large files. It's kind of ok when the files to be compared are below 10K rows in size... anything bigger than that and the time it takes to select the rows to compare (and the comparison in itself) becomes too much of a frustration.

The alternative is to use the command line of course.

To compare two csv files file1.csv and file2.csv, columns to columns, on a Linux operating system.

step 1. copy all lines containing the string to search from the first input file
grep 'searchString' file1.csv > f1.csv


step 2. extract the relevant columns (here columns 3,4 and 5)
cut -d',' -f3,4,5 f1.csv  > cols_f1.csv


step 3. sort on the 2nd column (for example)
sort -k2  -t"," cols_f1.csv > sorted_f1.csv


step 4. remove duplicates
uniq sorted_f1.csv > uniq_f1.csv




Quite a bit of typing here... and that's only to extract the columns from the first file. Fortunately all of these commands can be piped.

steps 1,2,3 and 4 for the second file.

grep 'searchString' file2.csv | cut -d',' -f3,4,5 | sort -k2 -t"," | uniq > uniq_f2.csv



And finally the last step.  Show all lines unique to file1, all lines unique to file2 and all lines common to both files, arranged in a 3-columns output

comm uniq_f1.csv uniq_f2.csv

No comments:

Post a Comment