Reading and exploring CSV files
#csv #today-i-learned in posts
Here are a few way I explore CSV files, with DuckDB being a new entrant.
CSVKit
I find myself working with CSV files these days. I sometimes need to look around inside. For 10s of rows, a text editor will do - I just use VS Code. csvlook from csvkit is great too.
The following command will allow me to scroll through the first 500 lines of a file. The --no-inference
flag informs csvlook to not infer data type and format values accordingly.
head -n 500 yourfile.csv | csvlook --no-inference | less -S
Sometimes I find myself needing to filter out rows. The following command will filter rows based on the presence of \\N
in any of the columns between 1 and 12, and output the inverse (-i
) results (good data) to a file.
csvgrep -c 1-12 -m "\\N" -i -a file_with_unwanted_data.csv > file-with_unwanted_rows_removed.csv
SQLite
Sometimes I want to explore the file a bit and csvkit tools becoming a bit unwieldy. SQLlite is perfect for the job. Remember, SQLite is included in macOS and Mac OS X by default, so sqlite-utils will just work once installed.
This command switches to CSV mode and imports the CSV file into a table called prices
:
sqlite>.mode csv
sqlite>.import prices.csv prices
Julia Evans mentioned sqlite-utils in her blog post sqlite-utils: a nice way to import data into SQLite for analysis It's another tool from the prolific Simon Willison.
The following command installs sqlite-utils
and uses it to create a new databases called prices.db
(sqlite is a file-based database) with a table called prices with records inserted into the table from the CSV file:
brew install sqlite-utils
Once installed, you can import with this command:
sqlite-utils insert prices.db prices prices.csv --csv
With large files, Simon notes sqlite-utils can be slower compared to using SQLite directly. You can import JSON and other formats just as easily.
Once you have a SQLite database, use standard SQL to query.
DuckDB
I just listed to a podcast on Software Engineering Daily about DuckDB. The founder, Hannes Mühleisen, makes a claim that the DuckDB team have inadvertently built the worlds best CSV reader.
duckdb -c "select * from read_csv('prices.csv')"
DuckDB doesn't need to ingest the data into its database engine. There are clients for various programming languages. And you can read from many different data formats beyond CSV. Paul Gross has started to use DuckDB as a replacement for jq
.