in2csv: the Excel killer is part of csvkit the command-line spreadsheet

By | October 5, 2020

Summary

Note: I wrote a different version of this on this post elsewhere: csvkit command-line spreadsheet can convert and compute multiple Excel files

How it all started

It all started as I had 26 different .xlsx Excel files from which I wanted to extract just the one column with the email address. Of course it could be done by hand, one file at a time. But that’s not an efficient way of using a computer! There had to be a better way!

My first idea was going to turn to R and Tidyverse as there is a package that can read .xlsx files called readxl, but the xlsx package might have worked too.

However, I decided to first do a quick check online to see if I could fine an even better solution. And I did find it in this Stack Overflow posting: Convert xlsx file to csv using batch!

There was a LibreOffice option that was tempting:

libreoffice --headless --convert-to csv *

But in there was another answer that peeked my curiosity. First this answer:

Try in2csv!

Usage: in2csv file.xlsx > file.csv

And then this more elaborate version of the same suggestion:

Adding to @marbel’s answer (which is a great suggestion!), here’s the script that worked for me on Mac OS X El Captain’s Terminal, for batch conversion (since that’s what the OP asked). I thought it would be trivial to do a for loop but it wasn’t! (had to change the extension by string manipulation and it looks like Mac’s bash is a bit different also)

for x in $(ls *.xlsx); do x1=${x%".xlsx"}; in2csv $x > $x1.csv; echo "$x1.csv done."; done

So this is what I used… However, I had to slightly modify as detailed in my own comment on this page:

I had contained 2 blank lines at the top. Modifying the command to remove the first 2 lines (needs to be before the redirect): for x in $(ls *.xlsx); do x1=${x%".xlsx"}; in2csv $x | sed 1,2d > $x1.csv; echo "$x1.csv done."; done Thank you for a great option, including the bash removal of .xlsx avoiding the need to use bash base option!

Docker

The software in2csv is in fact part of a set of commands written in python called csvkit:

Since I always have trouble with python installations, I decided I’d run it with Docker. I found a well documented docker option to run in2csvhttps://hub.docker.com/r/thomasleplus/csv

On a Mac or Linux the command to launch the docker container with sharing the current directory an making it the default directory would be:

docker run --rm -it -v $(pwd):/data -w /data thomasleplus/csv

Note: on Windows the command would be

For cmd:

docker run --rm -it -v "%cd%:/data" -w /data thomasleplus/csv

For PowerShell:

docker run --rm -it -v "${PWD}:/data" -w /data thomasleplus/csv

Just placing the files in this directory would make them available. Then I just had to run the above loop that converted the .xlsx files into .csv files and removing the first 2 lines. The first line was then just the column header.

Reading just a little bit on the documentation I found that I could print just the column with the emails.

Step 1: find out the column number that contains the emails. All 26 files have the same format.

csvcut -n Registrations01.csv

The output is a numbered list of column headers

 17: Email
 18: Title
 19: Classification

The for each file we can just “extract” the 17th column and redirect all into a file, for example:

for f in Registrations* do; csvcut -c 17 $f >> x; done

Now the file x contains all emails and can be sorted as unique with bash command.

This kit has many more commands in the categories: Input, Processing, Output and Analysis (see Reference page).

This is certainly a very nice tool!

 

 

Share this:

One thought on “in2csv: the Excel killer is part of csvkit the command-line spreadsheet

  1. Pingback: csvkit command-line spreadsheet can convert and compute multiple Excel files – Biochemistry Computational Research Facility (BCRF) – UW–Madison

Leave a Reply