Juha-Matti Santala
Community Builder. Dreamer. Adventurer.

Use SQL to query your CSV files

csvkit is a magnificent toolbox of different tooling to manage CSV data on the command line.

Today I’m focusing on one tool in it: csvsql. It allows you to run SQL queries to query your CSV data, without having to set up any databases in the middle.

The Data

Let’s start with the data. I created a CSV file from Wikipedia’s entry on olympic medals by country and cleaned it up a bit (removed rank, asterisks, country codes and so on) and added headers. I saved it as olympic-medals.csv and for the example’s sake, I trimmed it to top 10 entries to give you and idea of the shape of the data:

country,gold,silver,bronze,total
United States,1174,952,833,2959
Soviet Union,473,376,355,1204
Germany,305,305,312,922
Great Britain,296,323,331,950
China,285,231,197,713
France,264,293,332,889
Italy,259,231,269,759
Sweden,212,228,239,679
Norway,209,186,173,568
Russia,194,165,186,545

Running SQL queries

We can now query it with SQL using csvsql.

Let’s start by finding the top 10 best countries:

csvsql --query \
"SELECT country, total FROM stdin ORDER BY total DESC LIMIT 10" \
< olympic-medals.csv

and the output is

country,total
United States,2959.0
Soviet Union,1204.0
Great Britain,950.0
Germany,922.0
France,889.0
Italy,759.0
China,713.0
Sweden,679.0
Japan,573.0
Norway,568.0

The engine used by csvsql uses floating point numbers instead of integers so by default you get decimal numbers as an output. If you want the final output to have integers instead, you need to cast them:

SELECT cast(total as integer) as total FROM stdin

I wouldn’t stress about it though since at least for me, the use case with this is usually explorative to see what’s happening in the data and doing some simple analysis rather than building it into a production pipeline.

You can find more documentation for csvsql in this tutorial or its docs page.

JOIN two CSV files for querying

csvsql provides the basic SQL query functionality so if you want to run queries for multiple CSV files joined together (like with SQL’s JOIN command), you need to rely on another tool in csvkit: csvjoin.

In this example, I’m continuing with the above medals data but add another CSV file called countries-gdp.csv with GDP data from Wikipedia. Here’s first 9 entries as an example.

country,region,imf-forecast,imf-year,wb-estimate,wb-year,un-estimate,un-year
United States,Americas,26949643,2023,25462700,2022,23315081,2021
China,Asia,17700899,2023,17963171,2022,17734131,2021
Germany,Europe,4429838,2023,4072192,2022,4259935,2021
Japan,Asia,4230862,2023,4231141,2022,4940878,2021
India,Asia,3732224,2023,3385090,2022,3201471,2021
United Kingdom,Europe,3332059,2023,3070668,2022,3131378,2021
France,Europe,3049016,2023,2782905,2022,2957880,2021
Italy,Europe,2186082,2023,2010432,2022,2107703,2021
Brazil,Americas,2126809,2023,1920096,2022,1608981,2021

we can combine these two CSV datasets with csvjoin:

csvjoin olympic-medals.csv countries-gdp.csv -c country

and get a result like (a trimmed result)

country,gold,silver,bronze,total,region,imf-forecast,imf-year,wb-estimate,wb-year,un-estimate,un-year
United States,1174,952,833,2959,Americas,26949643,2023,25462700,2022,23315081,2021
Germany,305,305,312,922,Europe,4429838,2023,4072192,2022,4259935,2021
China,285,231,197,713,Asia,17700899,2023,17963171,2022,17734131,2021
France,264,293,332,889,Europe,3049016,2023,2782905,2022,2957880,2021
Italy,259,231,269,759,Europe,2186082,2023,2010432,2022,2107703,2021
Sweden,212,228,239,679,Europe,597110,2023,585939,2022,635664,2021
Norway,209,186,173,568,Europe,546768,2023,579267,2022,482175,2021
Russia,194,165,186,545,Europe,1862470,2023,2240422,2022,1778782,2021
Japan,186,178,209,573,Asia,4230862,2023,4231141,2022,4940878,2021
Hungary,183,156,182,521,Europe,203829,2023,178789,2022,181848,2021
Australia,170,180,216,566,Oceania,1687713,2023,1675419,2022,1734532,2021
Canada,148,182,221,551,Americas,2117805,2023,2139840,2022,1988336,2021
Netherlands,148,154,167,469,Europe,1092748,2023,991115,2022,1012847,2021
Finland,146,150,184,480,Europe,305689,2023,280826,2022,297302,2021

Now we could for example run a query against this new data to aggregate average medal counts by UN regions:

csvjoin olympic-medals.csv countries-gdp.csv -c country | \
csvsql --query \
"SELECT region, avg(gold) as gold, avg(silver) as silver,\
avg(bronze) as bronze FROM stdin GROUP BY region ORDER BY gold DESC"

with output of

region,gold,silver,bronze
Europe,71.44736842105263,75.97368421052632,84.15789473684211
Americas,57.03703703703704,51.25925925925926,50.74074074074074
Oceania,45.4,43.4,54.0
Asia,20.945945945945947,20.16216216216216,22.864864864864863
Africa,4.5,5.214285714285714,6.0

Combining these different csvkit tools is straight-forward as they (mostly) print the output to standard out and accept input from standard in.

I recommend checking out the full csvkit tutorial to see what else it offers and how you can use it to explore your CSV data without dumping it to a database.

Comments

Comment by replying to this post in Mastodon.

Loading comments...

Continue discussion in Mastodon »