Use SQL to query your CSV files
csvkit is a magnificent toolbox of different tooling to manage CSV data on the command line.
Today I’m focusing on one tool in it: csvsql. It allows you to run SQL queries to query your CSV data, without having to set up any databases in the middle.
Let’s start with the data. I created a CSV file from
Wikipedia’s entry on olympic medals by country
and cleaned it up a bit (removed rank, asterisks, country codes and so on) and
added headers. I saved it as
olympic-medals.csv and for the example’s
sake, I trimmed it to top 10 entries to give you and idea of the shape of the
country,gold,silver,bronze,total United States,1174,952,833,2959 Soviet Union,473,376,355,1204 Germany,305,305,312,922 Great Britain,296,323,331,950 China,285,231,197,713 France,264,293,332,889 Italy,259,231,269,759 Sweden,212,228,239,679 Norway,209,186,173,568 Russia,194,165,186,545
Running SQL queries
We can now query it with SQL using csvsql.
Let’s start by finding the top 10 best countries:
csvsql --query \ "SELECT country, total FROM stdin ORDER BY total DESC LIMIT 10" \ < olympic-medals.csv
and the output is
country,total United States,2959.0 Soviet Union,1204.0 Great Britain,950.0 Germany,922.0 France,889.0 Italy,759.0 China,713.0 Sweden,679.0 Japan,573.0 Norway,568.0
The engine used by csvsql uses floating point numbers instead of integers so by default you get decimal numbers as an output. If you want the final output to have integers instead, you need to cast them:
SELECT cast(total as integer) as total FROM stdin
I wouldn’t stress about it though since at least for me, the use case with this is usually explorative to see what’s happening in the data and doing some simple analysis rather than building it into a production pipeline.
JOIN two CSV files for querying
csvsql provides the basic SQL query functionality so if you want to run
queries for multiple CSV files joined together (like with SQL’s
JOIN command), you need to rely on
another tool in csvkit:
In this example, I’m continuing with the above medals data but add another CSV
GDP data from Wikipedia. Here’s first 9 entries as an example.
country,region,imf-forecast,imf-year,wb-estimate,wb-year,un-estimate,un-year United States,Americas,26949643,2023,25462700,2022,23315081,2021 China,Asia,17700899,2023,17963171,2022,17734131,2021 Germany,Europe,4429838,2023,4072192,2022,4259935,2021 Japan,Asia,4230862,2023,4231141,2022,4940878,2021 India,Asia,3732224,2023,3385090,2022,3201471,2021 United Kingdom,Europe,3332059,2023,3070668,2022,3131378,2021 France,Europe,3049016,2023,2782905,2022,2957880,2021 Italy,Europe,2186082,2023,2010432,2022,2107703,2021 Brazil,Americas,2126809,2023,1920096,2022,1608981,2021
we can combine these two CSV datasets with csvjoin:
csvjoin olympic-medals.csv countries-gdp.csv -c country
and get a result like (a trimmed result)
country,gold,silver,bronze,total,region,imf-forecast,imf-year,wb-estimate,wb-year,un-estimate,un-year United States,1174,952,833,2959,Americas,26949643,2023,25462700,2022,23315081,2021 Germany,305,305,312,922,Europe,4429838,2023,4072192,2022,4259935,2021 China,285,231,197,713,Asia,17700899,2023,17963171,2022,17734131,2021 France,264,293,332,889,Europe,3049016,2023,2782905,2022,2957880,2021 Italy,259,231,269,759,Europe,2186082,2023,2010432,2022,2107703,2021 Sweden,212,228,239,679,Europe,597110,2023,585939,2022,635664,2021 Norway,209,186,173,568,Europe,546768,2023,579267,2022,482175,2021 Russia,194,165,186,545,Europe,1862470,2023,2240422,2022,1778782,2021 Japan,186,178,209,573,Asia,4230862,2023,4231141,2022,4940878,2021 Hungary,183,156,182,521,Europe,203829,2023,178789,2022,181848,2021 Australia,170,180,216,566,Oceania,1687713,2023,1675419,2022,1734532,2021 Canada,148,182,221,551,Americas,2117805,2023,2139840,2022,1988336,2021 Netherlands,148,154,167,469,Europe,1092748,2023,991115,2022,1012847,2021 Finland,146,150,184,480,Europe,305689,2023,280826,2022,297302,2021
Now we could for example run a query against this new data to aggregate average medal counts by UN regions:
csvjoin olympic-medals.csv countries-gdp.csv -c country | \ csvsql --query \ "SELECT region, avg(gold) as gold, avg(silver) as silver,\ avg(bronze) as bronze FROM stdin GROUP BY region ORDER BY gold DESC"
with output of
region,gold,silver,bronze Europe,71.44736842105263,75.97368421052632,84.15789473684211 Americas,57.03703703703704,51.25925925925926,50.74074074074074 Oceania,45.4,43.4,54.0 Asia,20.945945945945947,20.16216216216216,22.864864864864863 Africa,4.5,5.214285714285714,6.0
Combining these different csvkit tools is straight-forward as they (mostly) print the output to standard out and accept input from standard in.
I recommend checking out the full csvkit tutorial to see what else it offers and how you can use it to explore your CSV data without dumping it to a database.
Sign up for Syntax Error, a monthly newsletter that helps developers turn a stressful debugging situation into a joyful exploration.