Summary
csvkit
is a suite of command-line tools for converting to and working with CSV, the king of tabular file formats. (csvkit
can convert XLSX files to CSV.)- A good docker container for
csvkit
: thomasleplus/csv - Inspiration: stack overflow article: convert-xlsx-file-to-csv-using-batch
Note: I wrote a different version of this on this post elsewhere: csvkit command-line spreadsheet can convert and compute multiple Excel files
How it all started
It all started as I had 26 different .xlsx
Excel files from which I wanted to extract just the one column with the email address. Of course it could be done by hand, one file at a time. But that’s not an efficient way of using a computer! There had to be a better way!
My first idea was going to turn to R and Tidyverse as there is a package that can read .xlsx
files called readxl, but the xlsx package might have worked too.
However, I decided to first do a quick check online to see if I could fine an even better solution. And I did find it in this Stack Overflow posting: Convert xlsx file to csv using batch!
There was a LibreOffice option that was tempting:
libreoffice --headless --convert-to csv *
But in there was another answer that peeked my curiosity. First this answer:
Try in2csv!
Usage:
in2csv file.xlsx > file.csv
And then this more elaborate version of the same suggestion:
Adding to @marbel’s answer (which is a great suggestion!), here’s the script that worked for me on Mac OS X El Captain’s Terminal, for batch conversion (since that’s what the OP asked). I thought it would be trivial to do a
for
loop but it wasn’t! (had to change the extension by string manipulation and it looks like Mac’s bash is a bit different also)for x in $(ls *.xlsx); do x1=${x%".xlsx"}; in2csv $x > $x1.csv; echo "$x1.csv done."; done
So this is what I used… However, I had to slightly modify as detailed in my own comment on this page:
I had contained 2 blank lines at the top. Modifying the command to remove the first 2 lines (needs to be before the redirect):
for x in $(ls *.xlsx); do x1=${x%".xlsx"}; in2csv $x | sed 1,2d > $x1.csv; echo "$x1.csv done."; done
Thank you for a great option, including the bash removal of.xlsx
avoiding the need to use bashbase
option!
Docker
The software in2csv
is in fact part of a set of commands written in python called csvkit
:
Since I always have trouble with python installations, I decided I’d run it with Docker. I found a well documented docker option to run in2csv
: https://hub.docker.com/r/thomasleplus/csv
On a Mac or Linux the command to launch the docker container with sharing the current directory an making it the default directory would be:
docker run --rm -it -v $(pwd):/data -w /data thomasleplus/csv
Note: on Windows the command would be
For
cmd
:
docker run --rm -it -v "%cd%:/data" -w /data thomasleplus/csv
For PowerShell:
docker run --rm -it -v "${PWD}:/data" -w /data thomasleplus/csv
Just placing the files in this directory would make them available. Then I just had to run the above loop that converted the .xlsx
files into .csv
files and removing the first 2 lines. The first line was then just the column header.
Reading just a little bit on the documentation I found that I could print just the column with the emails.
Step 1: find out the column number that contains the emails. All 26 files have the same format.
csvcut -n Registrations01.csv
The output is a numbered list of column headers
17: Email
18: Title
19: Classification
The for each file we can just “extract” the 17th column and redirect all into a file, for example:
for f in Registrations* do; csvcut -c 17 $f >> x; done
Now the file x
contains all emails and can be sorted as unique with bash command.
This kit has many more commands in the categories: Input, Processing, Output and Analysis (see Reference page).
This is certainly a very nice tool!
Pingback: csvkit command-line spreadsheet can convert and compute multiple Excel files – Biochemistry Computational Research Facility (BCRF) – UW–Madison