Use SQL to query your CSV files
csvkit is a magnificent toolbox of different tooling to manage CSV data on the command line.
Today I’m focusing on one tool in it: csvsql. It allows you to run SQL queries to query your CSV data, without having to set up any databases in the middle.
The Data
Let’s start with the data. I created a CSV file from
Wikipedia’s entry on olympic medals by country
and cleaned it up a bit (removed rank, asterisks, country codes and so on) and
added headers. I saved it as
olympic-medals.csv
and for the example’s
sake, I trimmed it to top 10 entries to give you and idea of the shape of the
data:
country,gold,silver,bronze,total
United States,1174,952,833,2959
Soviet Union,473,376,355,1204
Germany,305,305,312,922
Great Britain,296,323,331,950
China,285,231,197,713
France,264,293,332,889
Italy,259,231,269,759
Sweden,212,228,239,679
Norway,209,186,173,568
Russia,194,165,186,545
Running SQL queries
We can now query it with SQL using csvsql.
Let’s start by finding the top 10 best countries:
csvsql --query \
"SELECT country, total FROM stdin ORDER BY total DESC LIMIT 10" \
< olympic-medals.csv
and the output is
country,total
United States,2959.0
Soviet Union,1204.0
Great Britain,950.0
Germany,922.0
France,889.0
Italy,759.0
China,713.0
Sweden,679.0
Japan,573.0
Norway,568.0
The engine used by csvsql uses floating point numbers instead of integers so by default you get decimal numbers as an output. If you want the final output to have integers instead, you need to cast them:
SELECT cast(total as integer) as total FROM stdin
I wouldn’t stress about it though since at least for me, the use case with this is usually explorative to see what’s happening in the data and doing some simple analysis rather than building it into a production pipeline.
You can find more documentation for csvsql in this tutorial or its docs page.
JOIN
two CSV files for querying
csvsql provides the basic SQL query functionality so if you want to run
queries for multiple CSV files joined together (like with SQL’s
JOIN
command), you need to rely on
another tool in csvkit:
csvjoin.
In this example, I’m continuing with the above medals data but add another CSV
file called countries-gdp.csv
with
GDP data from Wikipedia. Here’s first 9 entries as an example.
country,region,imf-forecast,imf-year,wb-estimate,wb-year,un-estimate,un-year
United States,Americas,26949643,2023,25462700,2022,23315081,2021
China,Asia,17700899,2023,17963171,2022,17734131,2021
Germany,Europe,4429838,2023,4072192,2022,4259935,2021
Japan,Asia,4230862,2023,4231141,2022,4940878,2021
India,Asia,3732224,2023,3385090,2022,3201471,2021
United Kingdom,Europe,3332059,2023,3070668,2022,3131378,2021
France,Europe,3049016,2023,2782905,2022,2957880,2021
Italy,Europe,2186082,2023,2010432,2022,2107703,2021
Brazil,Americas,2126809,2023,1920096,2022,1608981,2021
we can combine these two CSV datasets with csvjoin:
csvjoin olympic-medals.csv countries-gdp.csv -c country
and get a result like (a trimmed result)
country,gold,silver,bronze,total,region,imf-forecast,imf-year,wb-estimate,wb-year,un-estimate,un-year
United States,1174,952,833,2959,Americas,26949643,2023,25462700,2022,23315081,2021
Germany,305,305,312,922,Europe,4429838,2023,4072192,2022,4259935,2021
China,285,231,197,713,Asia,17700899,2023,17963171,2022,17734131,2021
France,264,293,332,889,Europe,3049016,2023,2782905,2022,2957880,2021
Italy,259,231,269,759,Europe,2186082,2023,2010432,2022,2107703,2021
Sweden,212,228,239,679,Europe,597110,2023,585939,2022,635664,2021
Norway,209,186,173,568,Europe,546768,2023,579267,2022,482175,2021
Russia,194,165,186,545,Europe,1862470,2023,2240422,2022,1778782,2021
Japan,186,178,209,573,Asia,4230862,2023,4231141,2022,4940878,2021
Hungary,183,156,182,521,Europe,203829,2023,178789,2022,181848,2021
Australia,170,180,216,566,Oceania,1687713,2023,1675419,2022,1734532,2021
Canada,148,182,221,551,Americas,2117805,2023,2139840,2022,1988336,2021
Netherlands,148,154,167,469,Europe,1092748,2023,991115,2022,1012847,2021
Finland,146,150,184,480,Europe,305689,2023,280826,2022,297302,2021
Now we could for example run a query against this new data to aggregate average medal counts by UN regions:
csvjoin olympic-medals.csv countries-gdp.csv -c country | \
csvsql --query \
"SELECT region, avg(gold) as gold, avg(silver) as silver,\
avg(bronze) as bronze FROM stdin GROUP BY region ORDER BY gold DESC"
with output of
region,gold,silver,bronze
Europe,71.44736842105263,75.97368421052632,84.15789473684211
Americas,57.03703703703704,51.25925925925926,50.74074074074074
Oceania,45.4,43.4,54.0
Asia,20.945945945945947,20.16216216216216,22.864864864864863
Africa,4.5,5.214285714285714,6.0
Combining these different csvkit tools is straight-forward as they (mostly) print the output to standard out and accept input from standard in.
I recommend checking out the full csvkit tutorial to see what else it offers and how you can use it to explore your CSV data without dumping it to a database.
Comments
Loading comments...
Continue discussion in Mastodon »