Comparing csv files with the linux shell


My heart sinks whenever I witness people resorting to Excel to compare large files. It's kind of ok when the files to be compared are below 10K rows in size... anything bigger than that and the time it takes to select the rows to compare (and the comparison in itself) becomes too much of a frustration.

The alternative is to use the command line of course.

To compare two csv files file1.csv and file2.csv, columns to columns, on a Linux operating system.

step 1. copy all lines containing the string to search from the first input file
grep 'searchString' file1.csv > f1.csv


step 2. extract the relevant columns (here columns 3,4 and 5)
cut -d',' -f3,4,5 f1.csv  > cols_f1.csv


step 3. sort on the 2nd column (for example)
sort -k2  -t"," cols_f1.csv > sorted_f1.csv


step 4. remove duplicates
uniq sorted_f1.csv > uniq_f1.csv




Quite a bit of typing here... and that's only to extract the columns from the first file. Fortunately all of these commands can be piped.

steps 1,2,3 and 4 for the second file.

grep 'searchString' file2.csv | cut -d',' -f3,4,5 | sort -k2 -t"," | uniq > uniq_f2.csv



And finally the last step.  Show all lines unique to file1, all lines unique to file2 and all lines common to both files, arranged in a 3-columns output

comm uniq_f1.csv uniq_f2.csv

The Guava Joiner - an example


Google Guava ships a nice utility to transform a list of strings into a single string, with invidual elements separated by a separator character.

import com.google.common.base.Joiner;
List words = newArrayList("abc","def","ghj");
System.out.println(Joiner.on(",").join(words));

is a replacement for the rather verbose:

List words = newArrayList("abc","def","ghj");
StringBuilder sb = new StringBuilder();
Iterator iterator = words.iterator();
if (iterator.hasNext()) {
   sb.append(iterator.next());
   while (iterator.hasNext()) {
      sb.append(","); 
      sb.append(iterator.next());
   }
}
System.out.println(sb);

Unsprung! moving away from Spring DI


Advantages of Spring (when used for dependency injection):

- separates configuration from the code. Thus the wiring of the application can be modified without recompiling, in theory. Not sure how often this happens in practice.

... and the cons:

-   Xml config files are easy to get wrong (no compile time checks obviously)
- Annotations spread through out the code are not terribly visible
- Error messages thrown by the framework can be cryptic at times

There is a simpler alternative. Do away with the Spring container and inject the dependencies manually. All that is needed is a class to inject the dependency, a context which creates the appropriate dependency, and a bit of application code to wire the two together.


A DbReader uses a constructor-injected datasource to retrieve database results.
class DbReader{

  DataSource dataSource;

   DbReader (DataSource dataSource){
     this.dataSource= dataSource;
   }

   public  Object fetch(){
    //use the data source to execute a database query
   }

A Context  holds a reference to a data source (mock or real)
class Context {

  DataSource dataSource;

  static Context liveContext(){
     //get production db data source
     DataSource dataSource = ...;
     return new Context(dataSource);
  }

  static Context mockContext(){
     // get in memory test db data source (or mock)
     DataSource dataSource = ...; 
     return new Context(dataSource);
  }

  Context(DataSource dataSource){
      this.dataSource = dataSource;
  }

  DataSource getDataSource(){
     return dataSource;
  }

}


Project classes inject the datasource associated with a live context in the DbReader constructor.

   public static void main(String args[]){
      Context ctx = Context.liveContext();
      //fetch data from a prod database
      Object result = new DbReader(ctx.getDataSource()).fetch();
   }


while integration tests inject a mock datasource from a mock context.:

   @Test
   public void Test(){
     Context ctx = Context.mockContext();
     //fetch data from  a mock or in-memory db
     Object result = new DbReader(ctx.getDataSource()).fetch();
     //assert that result is as expected...
    }


Simple, easy to understand (and to debug), compile-time checks in place, no messing around with xml configuration files (or annotations).

...and it's even simpler with the Unsprung project which can help generate the Context class above from a Spring configuration file.