Benchmarking DuckDB vs SQLite for Simple Queries

I currently build a piece of software which will need a lot of bog-standard SQL queries against an in-memory database. I was (and am) going with SQLite, but a good friend of mine who Knows About Databases™ told me recently that currently, DuckDB is the hot stuff among in-memory databases - but mainly for complex (read: OLAP) queries.

In this article I explore whether DuckDB or SQLite is the better (read: faster) alternative for my simple queries.

Disclaimer: I’m not saying that the results in this article mean that “X is better than Y” or “X is bad”. I have a special use case - I only use relatively simple queries, and my data set is comparatively small. I am fully aware that my use case is not what DuckDB is made for. I think both DuckDB and SQLite are really great products.

Reproducing the Results: I publish my code and some additional data so that you can reproduce my results, see below.

Benchmark Data

Since I had that data lying around anyways, I am using the GTFS data for Germany as test data. The data is compiled by DELFi e.V. and can be downloaded for free (after registration) at www.opendata-oepnv.de1. It is not provided under an open license, so I cannot redistribute it. However, it is the largest GTFS data set that I could find.

I only use the trips and stop_times data from the GTFS data set. The trips table contains all public transit trips in Germany, and the stop_times table contains all the calls that vehicles make during those trips. The data set contains about 1.7 million trips and about 35 million stop_times.

Benchmark Setup

I use a small Python script (which is also available, see below) that drives SQLite v. 3.41.2 and DuckDB v. 0.9.1.

For the SQLite database, I prepare the database outside of the benchmark script using an SQL script. The SQL script creates the trips and stop_times with appropriate primary and foreign keys set (thus creating the respective indices).

The DuckDB database is initialized directly in the benchmark script using DuckDB’s read_csv_auto method, and respective indices (see below) are created afterwards.

I want the benchmarks to run deterministically, so I create lists of values to query for a priori and store them in two files. When I run n queries for some value in the benchmarks, I use the first n values from the appropriate file.

For each of the benchmarks mentioned below, I create a list of statements to be run (and timed). I run each such list of statements twice: The first run warms up the caches, and the second run is timed. The benchmarks are run on my 2016 Thinkpad T460s laptop, which has an i5-6300U CPU and 20 GB of RAM.

Benchmarks

I run a total of six benchmarks, which I split into two groups: Queries which can be served by looking at the built indices, and queries which can not (i.e., where the DB systems need to iterate through the tables).

Querying the Indices

The simplest benchmark is the indexed benchmark, where each query simply retrieves a trip by its primary key, i.e., a query of the form SELECT * FROM trips WHERE trip_id = <ID>. The second benchmark, called composite-indexed does the same for the stop_times table, which has a composite primary key (the primary key consists of the trip_id of the respective trip and a stop_sequence integer, which just orders the calls within a trip). The third query in this group is the join benchmark, which retrieves stop_times joined with the data from their respective trip, i.e., queries of the form

1SELECT *
2FROM stop_times JOIN trips ON trips.trip_id = stop_times.trip_id
3WHERE stop_times.trip_id = <Trip ID> and stop_sequence = <Sequence Number>;

The last benchmark in this group, called indexed-multiple does the same as indexed, but queries for 50 rows per query instead of one.

For each of these benchmarks, I run 500 queries and average the query times.

Querying Without an Index

The first benchmark in this group, nonindexed, retrieves trips by their ‘route ID’. In GTFS, every trip is associated with a route. There are about 27000 routes in the GTFS data set, so we expect on average about 63 trips per route. The route_id field is not indexed. The next benchmark, called many, does the same but with the direction_id field. There are only two direction IDs: 0 and 1, so we expect each query to return half of the trips, about 850000.

For these benchmarks, I only run 100 queries each (because they take a lot longer…).

Results

Let’s look at the results from the “with-index” benchmarks first, which are best looked at in a table.2 The reported values are milliseconds per query:

DuckDBSQLiteFactor SQLite-to-DuckDB
indexed0.9270.06314.7
composite-indexed9.680.078124.1
join13.30.118112.7
indexed-multiple1.000.06814.7

Or, if you like plots more than tables:

Figure 1: Average query times for the four indexed benchmarks, in milliseconds. The plot is probably not very useful, but otherwise this article would have had zero graphics. :(
Figure 1: Average query times for the four indexed benchmarks, in milliseconds. The plot is probably not very useful, but otherwise this article would have had zero graphics. :(

We see that in this group, SQLite outperforms DuckDB consistently by one or two orders of magnitude. Keep in mind that the composite-indexed and join benchmarks both query the larger stop_times table, and not the smaller trips table. It looks to me that DuckDB somehow does not profit from the indexing.

For the “without-index” benchmarks, we get a different picture (note the inverse factor):

DuckDBSQLiteFactor DuckDB-to-SQLite
nonindexed7.7018223.6
many2.902722938.6

Again as the obligatory plot:

Figure 2: Average query times for the two nonindexed benchmarks, in milliseconds. Yes, DuckDB is really **that** much faster - doesn't make this plot more useful than the previous one.
Figure 2: Average query times for the two nonindexed benchmarks, in milliseconds. Yes, DuckDB is really that much faster - doesn’t make this plot more useful than the previous one.

Here, DuckDB clearly outperforms SQLite, by up to three orders of magnitude. I also think it’s very interesting that DuckDB actually gets faster in the many case, where the output size is larger than in the nonindexed case.

Interpretation

The interpretation seems to be clear: DuckDB does not seem to work with indices as well as SQLite does (see also the section below) but is blazingly fast when scanning large parts of the table. SQLite is great at using indices but takes its time when it has to look at the whole table.

Since my workload mainly consists of simple queries which should be well answerable from an index (and it is feasible to just create as many indices as it takes…), I will stay with SQLite for the project.

Validating the DuckDB Results

The results make me very sceptical whether I did the correct things to create indices in DuckDB. I only found a short page about indices in the DuckDB documentation, but if I read that correctly it states that I should just use the usual CREATE INDEX SQL command. The benchmark script therefore does this:

1duckdb.sql("CREATE UNIQUE INDEX trips_id_idx ON trips(trip_id)")
2duckdb.sql("CREATE UNIQUE INDEX stop_times_id_idx ON stop_times(trip_id,stop_sequence)")

To test my hypothesis, I removed these two lines (i.e., did not create any indices in DuckDB) and re-ran the benchmarks. The performance of the indexed benchmark went down from 0.9 ms / query to 2 ms / query, so those CREATE INDEX lines are not without effect. However, the composite-indexed, join and indexed-multiple benchmarks remain largely unaffected.

The documentation does say:

Unidimensional indexes are supported, while multidimensional indexes are not yet supported.

So composite-indexed being unaffected is not a big surprise. However, the index only giving a speedup of about 2 seems weird, and I would expect join and indexed-multiple to also show at least a small speedup.

Reproducing the Benchmarks

I publish my benchmark code, which you can download as a tar archive here The archive contains:

  • The Python script used to run the benchmarks3
  • An SQL script to convert a set of GTFS CSV files into the necessary SQLite database
  • A little Bash script used to pre-create the set of IDs to be queried during the benchmarks

Reproducing my results should be as easy as this:

  • Obtain the GTFS data from https://www.opendata-oepnv.de/ht/en/organisation/delfi/start1 Note that you can also reproduce these results with other GTFS data sets, but you might need to adapt the gtfs_to_sqlite.sql script.
  • Make sure your Python has SQLite and DuckDB available.
  • Download and unpack the archive.
  • Put the unpacked GTFS data into the data/ subfolder. If should at least contain the trips.txt and stop_times.txt files.
  • Run cd data; sqlite3 gtfs.sqlite3 < gtfs_to_sqlite.sql to create the SQLite database.
  • Run bash ./prepare.sh to create a two random lists of IDs.
  • Run cd ..; python3 ./bench.py

It should output a table with the measured values and place the plots at /tmp/dbench_plot_indexed.pdf and /tmp/dbench_plot_nonindexed.pdf, respectively.


  1. Sorry, parts of that page may be in German, even though the link goes to the English website… ↩︎ ↩︎

  2. Honestly: The plots are only there so that this post has a nice image that can appear when this article is shared… The differences between DuckDB and SQLite are just always so large that plotting them together completely hides all other detail. ↩︎

  3. Note that the script is not necessarily a shining example for clean, well-documented Python code. However it is short, does nothing fancy and should be manageable. ↩︎

Comments

You can use your Mastodon account to reply to this post.

Reply to tinloaf's post

With an account on the Fediverse or Mastodon, you can respond to this post. Since Mastodon is decentralized, you can use your existing account hosted by another Mastodon server or compatible platform if you don't have an account on this one.

Copy and paste this URL into the search field of your favourite Fediverse app or the web interface of your Mastodon server.