Often time, process or read a large CSV file is quite troublesome. With the concept of parallelism, we can break the file into smaller chunk and create multiple processes to process each of these smaller chunk of record simultaneously.
But, how do we to chunk the csv equally according to our needs?
e.g.
We have a CSV file named application.csv
which consist of 20,000 records.
We need to chunk it into part files which have 10,000 each.
We can achieve this by creating a linux function.
- Paste the following command into your terminal.
1 2 3 4 5 6 7 8 9 10 11 12
splitCsv() { HEADER=$(head -1 $1) if [ -n "$2" ]; then CHUNK=$2 else CHUNK=1000 fi tail -n +2 $1 | split -l $CHUNK - $1_split_ for i in $1_split_*; do sed -i -e "1i$HEADER" "$i" done }
- Run the following command.
1 2
# e.g. splitCsv application.csv 10000 splitCsv <csv file> <chunk size>
- You will notice the part files created with the following format:
1 2 3 4 5
<csv file>_split_<part> e.g. application.csv_split_aa application.csv_split_ab
Reference