Break Large CSV into parts

Posted by ChenRiang on May 2, 2021

Often time, process or read a large CSV file is quite troublesome. With the concept of parallelism, we can break the file into smaller chunk and create multiple processes to process each of these smaller chunk of record simultaneously.

But, how do we to chunk the csv equally according to our needs?

e.g.
We have a CSV file named application.csv which consist of 20,000 records. We need to chunk it into part files which have 10,000 each.

We can achieve this by creating a linux function.

  1. Paste the following command into your terminal.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    
     splitCsv() {
         HEADER=$(head -1 $1)
         if [ -n "$2" ]; then
             CHUNK=$2
         else 
             CHUNK=1000
         fi
         tail -n +2 $1 | split -l $CHUNK - $1_split_
         for i in $1_split_*; do
             sed -i -e "1i$HEADER" "$i"
         done
     }
    
  2. Run the following command.
    1
    2
    
     # e.g. splitCsv application.csv 10000
     splitCsv <csv file> <chunk size>
    
  3. You will notice the part files created with the following format:
    1
    2
    3
    4
    5
    
     <csv file>_split_<part>
        
     e.g.
     application.csv_split_aa
     application.csv_split_ab
    



Reference

StackOverflow