Posted onJul 8, 2020 • Originally published atrasjonell.tech onJul 8, 2020

Using AWK To Process Huge CSV Files

#awk #shell

Problem

I had 38 separate.csv files each consisted of at least 40K lines of data. Overall there were ≥ 2.6 million rows of information. The problem was that the data was not in the ideal form.

Each file had 5 columns. One of the columns was calledbirthday. It had a conventional(at least for the most of the world) format:DD/MM/YYYY. However I was going to put this data in a DB(more on that later!) and I needed to have 3 separate properties(ie.{ day: DD, month: MM, year: YYYY }). Also there were some columns that were not useful so I had to get rid of them. In addition, I also wanted to reduce the number of files to one.

Approach

I know some python and from a couple of data science courses I was introduced toPandas, a well-known tool for data analysis built on top of python. It had a straightforward way of reading, manipulating, and writing csv files. I was going to use that, but didn’t really want to download this tool to use only one of the features and never use it again. So I choose something that comes with nearly all modern unix-like systems.AWK isdesigned for text processing and much like other programs in the unix-like world it does one thing well and produces fascinating results in regards to simplicity and performance when composed with other unix tools.

Solution

Now we need to make AWK process.csv files.AWK usesfield separators to split an input into fields. It may either be a character or a RegExp.So all we need to do to make AWK become a.csv processor is to set FS(field separator) to be",".

We can achieve this using AWK’sStartup and Cleanup Actions.

# clean_csv.awkBEGIN {  FS = ","}{  print $1}END {}

Then if we pass an input such asone, two, three to the script, it will printone:

$ echo "one, two, three" | awk -f csv.awkone

Now that we have this, we can finally solve our problems: splittingDD/MM/YYYY column into three separate columns and removing the useless column.AWK has asplit function that takes a field, a new variable, and a separator and puts the separated values into the new variable.

The birthday column was #3 so to create the separate fields we need to make the following modifications:

# clean_csv.awkBEGIN {  FS = ","}{  # split DD/MM/YYY into an array of values  split($4, dob, "/")  # turn these values into separate columns  total = dob[1] "," dob[2] "," dob[3]  # finalize the row, omitting the useless #4 column.  row = $1 "," $2 "," total "," $5  print row}END {}

This mainly solves most of our problems, except for the first row of the input..csv files have headers as a set of labels on the first row, thus we need to skip every first line of every.csv file.

In AWKNR andFNR represent the record number. The latter shows the record number of the current file, while the first holds the total count. Since we are going to use this script on more then one file, we’ll stick withFNR.

So the final script looks like so:

# clean_csv.awkBEGIN {  FS = ","  # create a new header with correct labels  print "last_name,first_name,birth_day,birth_month,birth_year,country"}{  if (FNR > 1) { # omit every first line    # split DD/MM/YYY into an array of values    split($4, dob, "/")    # turn these values into separate columns    total = dob[1] "," dob[2] "," dob[3]    # finalize the row, omitting the useless #4 column.    row = $1 "," $2 "," total "," $5    print row  }}END {}

At last, to execute this on our 38.csv files we need to run this command:

awk -f clean_csv *.csv > clean.csv

Conclusion

We read and manipulated over 30.csv files with more than 2.6 million rows of data combined and reduced them into a single clean file without installing any software. So if you’re faced with a problem, you may not need to install a bloated software to solve it when the solution comes pre-installed with your system.