Data cleaning from the command line

If you’ve been reading John D. Cook’s blog recently, you know he’s been writing about “computational survivalism,” using the basic Unix text manipulation tools to process data. I know Kieran Healy has also written about using these tools instead of his beloved R for preprocessing data; this article talks a little about it, but I feel certain there’s a longer discussion somewhere else. You should probably just read all his posts. And John Cook’s, too. I’ll be here when you’re done.

I use Unix tools to manipulate data all the time. Even though a lot of my analysis is done via Python and Pandas, those old command-line programs just can’t be beat when it comes to wrangling a mass of text-formatted data into a convenient shape for analysis. I have a particular example from work I was doing last week.

I needed to analyze dozens of data files from a series of tests run on a piece of equipment. The data consisted of strain gauge and pressure gauge readings taken as the machine was run. After a brief but painful stint of having the test data provided as a set of Excel files, I got the technician in charge of the data acquisition to send me the results as CSV. That’s where we’ll start.

Each data file starts with 17 lines of preamble information about the software, the data acquisition channels, the calibration, and so on. This is important information for documenting the tests—and I make sure the raw files are saved for later reference—but it gets in the way of importing the data as a table.

I figured a simple sed command would delete these preamble lines, but I figured wrong. For God knows what reason, the CSV file that came from the computer that collected the data was in UTF-16 format (is this common in Windows?), even though there wasn’t a single non-ASCII character in the file. UTF-16 is not something sed likes.

So I took a quick look at the iconv man page and wrote this one-liner to get the files into a format I could use:

for f in *.csv; do iconv -f UTF-16 -t UTF-8 "$f" > "$f.new"; done

I suppose I could have chosen ASCII as the “to” (-t) format, but I’m in the habit of calling my files UTF-8 even when there’s nothing outside of the ASCII range.

After a quick check with BBEdit to confirm that the files had indeed converted, I got rid of the UTF-16 versions and replaced them with the UTF-8 versions

rm *.csv
rename 's/\.new//' *.new

The rename command I use is my adaptation of Larry Wall’s old Perl script. The first argument is a Perl-style substitution command.

With the files in the proper format, it was time to delete the 17-line preamble. The easiest way to do this is with the GNU version sed that can be installed through Homebrew:

gsed -i '1,17d' *.csv

The sed that comes with macOS requires a bit more typing:

for f in *.csv; do sed -i.bak '1,17d' "$f"; done

The built-in sed forces you to do two things:

  1. Include an extension with the -i switch so you have backups of the original files.
  2. Use a loop to go through the files. A command like

    sed -i.bak '1,17d' *.csv
    

    would concatenate all the CSV files together and delete the first 17 lines of that. The upshot is that only the first CSV file would have its preamble removed.

Hence my preference for gsed.

With the preambles gone, I now had a set of real CSV files. Each one had a header line followed by many lines of data. I could have stopped the preprocessing here, but there were a couple more things I wanted to change.

First, the data acquisition software inserts an “Alarm” item after every data channel, effectively giving me twice as many columns as necessary. To get rid of the Alarm columns, I needed to know which ones they were. I could have opened one of the files in BBEdit and started counting through the header line, but it was more conveniently done via John Cook’s header numbering one-liner:

head -1 example.csv | gsed 's/,/\n/g' | nl

Because all the CSVs have the same header line, I could run this on any one of them. The head -1 part extracts just the first line of the file. That gets piped to the sed substitution command, which converts every comma into a newline. Finally, nl prefixes each of the lines sent to it with its line number. What I got was this:

     1  Scan
     2  Time
     3  101 <SG1> (VDC)
     4  Alarm 101
     5  102 <SG2> (VDC)
     6  Alarm 102
     7  103 <SG3> (VDC)
     8  Alarm 103
     9  104 <SG4> (VDC)
    10  Alarm 104
    11  105 <SG5> (VDC)
    12  Alarm 105
    13  106 <SG6> (VDC)
    14  Alarm 106
    15  107 <SG7> (VDC)
    16  Alarm 107
    17  108 <SG8> (VDC)
    18  Alarm 108
    19  109 <Pressure> (VDC)
    20  Alarm 109

With the headers laid out in in numbered rows, it was easy to construct a cut command to pull out only the columns I needed:

for f in *.csv; do cut -d, -f 1-3,5,7,9,11,13,15,17,19 "$f" > "$f.new"; done

I’d like to tell you I did something clever like

seq -s, 3 2 19 | pbcopy

to get the list of odd numbers from 3 to 19 and then pasted them into the cut command, but I just typed them out like an animal.

Once again, I ran

rm *.csv
rename 's/\.new//' *.new

to get rid of the old versions of the files.

The last task was to change the header line to a set of short names that I’d find more convenient when processing the data in Pandas. For that, I used sed’s “change” command:

gsed -i '1cScan,Datetime,G1,G2,G3,G4,G5,G6,G7,P' *.csv

With this change, I could access the data in Pandas using simple references like

df.G1

instead of long messes like

df['101 <SG1> (VDC)']

All told, I used nine commands to clean up the data. If I had just a few data files, I might have done the editing by hand. It is, after all, quite easy to delete rows and columns of a spreadsheet. But I had well over a hundred of these files, and even if I had the patience to edit them all by hand, there’s no way I could have done it without making mistakes here and there.

Which isn’t to say I wrote each command correctly on the first try. But after testing each command on a single file, I could then apply it with confidence to the full set.1 The tests were run in several batches over a few days. Once I had the commands set up, I could clean each new set of data in no time.


  1. And it took a lot less time to develop and test the commands than it did to write this post.