edglog: Comparing csv files with the linux shell

My heart sinks whenever I witness people resorting to Excel to compare large files. It's kind of ok when the files to be compared are below 10K rows in size... anything bigger than that and the time it takes to select the rows to compare (and the comparison in itself) becomes too much of a frustration.

The alternative is to use the command line of course.

To compare two csv files file1.csv and file2.csv, columns to columns, on a Linux operating system.

step 1. copy all lines containing the string to search from the first input file

grep 'searchString' file1.csv > f1.csv

step 2. extract the relevant columns (here columns 3,4 and 5)

cut -d',' -f3,4,5 f1.csv  > cols_f1.csv

step 3. sort on the 2nd column (for example)

sort -k2  -t"," cols_f1.csv > sorted_f1.csv

step 4. remove duplicates

uniq sorted_f1.csv > uniq_f1.csv

Quite a bit of typing here... and that's only to extract the columns from the first file. Fortunately all of these commands can be piped.

steps 1,2,3 and 4 for the second file.

grep 'searchString' file2.csv | cut -d',' -f3,4,5 | sort -k2 -t"," | uniq > uniq_f2.csv

And finally the last step. Show all lines unique to file1, all lines unique to file2 and all lines common to both files, arranged in a 3-columns output

comm uniq_f1.csv uniq_f2.csv

edglog

Comparing csv files with the linux shell

No comments:

Post a Comment