DuckDB Internals: Why Is DuckDB Fast? (Part 1)

smithclay · 2026-06-19T05:21:52 1781846512

If you're reading this and curious: consider writing a duckdb community extension* or contributing to an existing one*

duckdb is becoming a kind of data superglue between a lot of data ecosystems (GIS, observability, analytics, lakehouses, object storage, etc) that don't talk to each other typically, and it's worth checking out in 2026.

* https://github.com/duckdb/extension-template * https://duckdb.org/community_extensions/

aleda145 · 2026-06-19T07:16:57 1781853417

I just started doing this last week!

I'm not very good at C++, but coupled with the extension template and codex I got a basic version of my extension working within an hour. Go for it!

pknerd · 2026-06-19T05:51:49 1781848309

Just curious whether one can earn money making these exts?

faangguyindia · 2026-06-19T06:40:10 1781851210

You can definately offer consultation or custom integration.

0xferruccio · 2026-06-19T05:35:30 1781847330

DuckDB is amazing for any sort of fast data analysis when the data is small enough that it can fit on your laptop

Recently at work I've been using it to analyse the Claude code sessions of every engineer at our company (that we upload to S3) and it's been extremely helpful to help us find gaps in devex and have clear metrics to back up the impact of fixing them

Another thing it's been really useful for has been getting metrics on Claude skills usage and then dive into use-cases by looking at the transcripts

Other engineers that had never touched DuckDB were so impressed with how easy it is for AI agents to write queries on our dataset

zurfer · 2026-06-19T07:29:10 1781854150

Like sqlite, duckdb is underappreciated as a production database. You can totally run it on servers or even "serverless" and do some heavy data transformations or with the right server size work with large scale datasets (up to a TB compressed seems fine).

ndr · 2026-06-19T08:54:55 1781859295

This. I've recently used both duckdb and sqlite to power a dashboard for a small restaurant of a family member. It converts all their sales to a very tiny parquet files, daily.

The file fits in memory and can do all sort of computation in the browser itself. The backend is extremely simple, it just loads the JS and serves the parquet files.

It was also trivial to let the owner do their own queries, just give the schema to an LLM and let it use the charting library, no data hallucinations. If they need it in the dashboard they can either use that one or ask me to review that query.

To be honest, given how simple some things became, it's been really fun to work on.

tomnipotent · 2026-06-19T07:41:58 1781854918

Not to mention it can query across heterogeneous sources, so the same query can use a duckdb table, sqlite, csv, and parquet (including predicate pushdown).

cyanide911 · 2026-06-19T07:36:23 1781854583

>Recently at work I've been using it to analyse the Claude code sessions of every engineer at our company (that we upload to S3) and it's been extremely helpful to help us find gaps in devex and have clear metrics to back up the impact of fixing them

Nice! How do you set things up so that your engineers's claude code sessions upload to S3? Thanks for the help in advance

anitil · 2026-06-19T04:48:13 1781844493

DuckDb makes so much of my life easier, though I've never used it for large problems. The ability to run `select * from 'data.json'` is just lovely. The fact that it's also a powerhouse is so impressive, I'd usually expect a project to be good at small problems (like mine) xor large problems, but not both

medvezhenok · 2026-06-19T05:49:12 1781848152

Yup. And an extra benefit that you can treat any file like a table, so you can also do something like

  UPDATE my_table
  SET x = file1.x,
      y = file2.y
  FROM 'first_file.csv' file1
  LEFT JOIN 's3://my_bucket/second_file.parquet' file2
    ON file1.id = file2.id
  WHERE mytable.id = file1.id;

steve_adams_86 · 2026-06-19T04:37:18 1781843838

> DuckDB has received widespread adoption because it's just so damn easy to use.

This was a major factor in my initial adoption. Since then it has stuck because it’s also absurdly capable, versatile, and fast.

If it wasn’t so easy to use I suspect I wouldn’t have adopted it when I did. The ergonomics are crazy. It still impresses me regularly.

jkubicek · 2026-06-19T04:44:42 1781844282

What do you use it for? I’m perpetually interested in using DuckDB, but it doesn’t seem to do anything I need.

raihansaputra · 2026-06-19T10:39:15 1781865555

throwing in my 2 cents: It just replaced pandas for me. It's just so much easier to write sql against csv/json/whatever format data in jupyter/marimo notebooks through duckdb rather than reasoning through pandas. SQL is far more natural for me, and agents also work through it easily.

orthoxerox · 2026-06-19T05:46:16 1781847976

All kinds of data processing. For example, you download a million rows of metrics and load them in Excel to build pivot tables. It works, but now it's a billion rows. If you know SQL, it's a snap to point DuckDB at the source CSV or JSON and get the results in a second.

medvezhenok · 2026-06-19T05:43:17 1781847797

Basically like a locally hosted Snowflake - it only shines if you have enough data to analyze (100 MB - 100 GB is probably the sweet-spot range - less than that and the benefits are small, more than that and you risk flying too close to the sun with memory usage).

It has connectors for Postgres & other stores, so I find it faster to connect to a Postgres instance, pull all of the data from a table (even if the table is like 50GB - if you have 30 cores on the machine it will pull from Postgres using 30 cores in parallel, so it will only take a minute or two) - and then any analytical queries on the data are 10+ times faster in DuckDB over native Postgres (GROUP BY, regexp_replace, count(distinct...) etc).

steve_adams_86 · 2026-06-19T05:50:24 1781848224

The most interesting use case lately has been using it as the transformation and validation engine for a CLI that handles scientific data. Some datasets are small and could have been handled at the application layer, but some are quite massive (especially genomic data). DuckDB bundles with the CLI and travels around any platform, is super lightweight, allows for easily running in CI, on a user’s machine, against datasets of all sizes, and so on.

There are other embeddable options out there but I found DuckDb fit better for the potentially massive datasets, and also because of how naturally it ingests the types of data we work with, some of its unique features, and how trivial it was to learn and integrate with the project.

Otherwise I use it almost daily for doing guardrailed data exploration with LLMs. I prefer SQL over random DSLs in AWS or Sentry or what have you. I’ll ingest the data I need and just run SQL against it. I mentioned in another comment that I’ll tend to store more useful data (especially data I export routinely, like infra cost reports) on S3 and use a Rill instance to do basic exploration in a GUI (it will query remote parquet files).

edweis · 2026-06-19T05:19:16 1781846356

I personally find it useful to search logs with AI

steve_adams_86 · 2026-06-19T05:43:26 1781847806

Yes, it’s amazing for giving rails and structure to data so you can be sure an LLM is making more sense than it might with grep and jq. It also allows a little more sanity at scale with jobs like this. You can get pretty crazy with parquet in S3 with an engine like duckdb. And it’s dirt cheap to keep that stuff hanging around for future reference and sanity checking your understanding of things.

For data I reference frequently, and especially which I know will grow over time, I’ve started using Rill because it makes ad-hoc exploration very smooth and low-friction.

My process tends to be something like:

1. Explore logs or some other at least somewhat structured dataset

2. Use Claude to find useful patterns and determine how I might benefit from this data in ways I wasn’t yet aware

3. See how often it’s useful for decision making

4. If it’s frequently useful, formalize it as a view in my Rill instance and refine the models to maximize their utility

mcv · 2026-06-19T09:08:39 1781860119

Is everything becoming columnar? Parquet stores data per column instead of per row because it improves compression. I get that. Arrow apparently is columnar, and now DuckDB also gets its efficiency by treating data as columns instead of rows?

I still need to wrap my head around how that works, but it's a fascinating development.

levanten · 2026-06-19T09:27:23 1781861243

It depends on your task. In analytics where you need to scan lots of data points within few columns, then columnar storage is very much the best. But for transactional workloads where you have to deal with specific entities, row based would be more advantageous. There are hybrid systems that try to be both at the same time but in my experience they end not doing either very well.

willtemperley · 2026-06-19T08:02:17 1781856137

The one huge caveat for anyone that cannot use dynamic linking e.g. in an AppStore context, DuckDB isn’t a great choice. It’s very hard to statically link extensions.

This is where Arrow wins I think. Arrow CPP for example has very portable builds and the C interface is very usable for building bindings.

DuckDB is excellent, but it’s more a black box than a library.

Edit: after a conversation with a robot, it would seem that the DuckDB and ArrowCPP C APIs are complimentary, so it's very possible to have Arrow CPP and DuckDB to coexist in an app, each with its own strength. Arrow CPP doen't have a simple SQL story for example.

tobilg · 2026-06-19T09:09:50 1781860190

I can't confirm this, I have several instances which have statically linked extensions...

tobyhinloopen · 2026-06-19T08:23:55 1781857435

The only reason I know and use DuckDB is because my (internal, private-use-only, experimental) vibe coded projects use it a ton. I didn't pick it - LLMs did. Until this article, I wasn't aware of what it actually is capable of.

Most of these projects use JSON(L) files for storage, and duckdb to process them.

mootothemax · 2026-06-19T08:37:52 1781858272

If you haven’t investigated storing in parquet format - and it doesn’t break other consumers that need your jsonl formatted files - it could be worth trialling for your use case. You’ll see vastly smaller file sizes (even more so if you use zstd compression), and querying time will shoot up.

Usual caveats apply, but as a general rule it’s held up well for me. Only downside is that inspecting the results moves from vi on the output file to duckdb and a select * from.

tobyhinloopen · 2026-06-19T10:01:31 1781863291

I'll 100% try DuckDB in more serious projects where I would normally use Sqlite.

ai_fry_ur_brain · 2026-06-19T08:26:27 1781857587

What an incredible way to build software

snissn · 2026-06-19T05:56:05 1781848565

I'm just curious - is duckdb too slow for people? This benchmark from clickhouse shows it being fairly slow compared to some options: https://jsonbench.com/

conradkay · 2026-06-19T08:26:06 1781857566

That's for their `JSON` data types. In DuckDB it's just a string meaning lots of queries will have to do JSON parsing on every row, but the inserts are very fast. Definitely a bit of a footgun and when you actually just need STRUCT or MAP.

There's a talk about ClickHouse's approach from its creator: https://www.youtube.com/watch?v=xHj9mysh0GI , but the gist is that it maintains (sub)columns to store different paths in the JSON

In other ways DuckDB has very good JSON support, like you can do `CREATE TABLE name AS `SELECT * FROM 'data.json';` and it'll infer the schema when possible.

jdw64 · 2026-06-19T05:14:19 1781846059

The data scientists I work with use this. Why do they use it? I don't really know much about it, but I've noticed they use it quite often. I mainly use MySQL or PostgreSQL. What are the advantages of DuckDB? It seems like they usually use it as an alternative to Pandas.

medvezhenok · 2026-06-19T05:32:13 1781847133

DuckDB has been probably my most used tool in 2026 - if you're comfortable with SQL it's incredible at quickly prototyping and slicing / dicing data.

I do a lot of experiments with regexes, and if you get used to the RE2 syntax that DuckDB uses, you can see up to 10-100x uplift in terms of speed compared to Postgres on things like regexp_matches(), regexp_extract(), etc (depending on query/table/machine specifics). It has quite powerful scripting with custom Macros, fixes a lot of annoyances of SQL for me compared to Postgres.

I think if you have access to a machine with a lot of RAM / cores and a beefy data set, then it's basically like a RAMdisk version of Snowflake running locally on your machine.

(and of course the fact that it makes it convenient to read CSV/parquet, read/write from S3, etc) - it's a very ergonomic tool.

jdw64 · 2026-06-19T05:36:37 1781847397

Thank you for your kind reply. I should look into it too. In my case, knowing various libraries is directly related to my livelihood. Have a good day.

Demiurge · 2026-06-19T05:31:47 1781847107

Here is the thing, it’s a write only single file format. If you need to run analytical queries it’s optimized for reading, you just open a file and query for the parts you want. If you have multiple clients that read and write data to the database, you should use postgresql.

It’s not really a database in the traditional sense, there is no ACID complexity, it’s a library that lets use write SQL to query a tabular data file.

bdcravens · 2026-06-19T05:20:04 1781846404

Primarily the ability to work directly with data in its native format (CSV for example) without needing ETL.

throwaway7783 · 2026-06-19T05:26:26 1781846786

How does this work in a production setup? Can this be set up like a server, or is it mostly for individual users to play around with data?

DanielHB · 2026-06-19T08:24:16 1781857456

In my previous job (working with electric vehicles) we had a AWS batch job that pulled all data from S3[1] into containers (one container per vehicle) and then push that data into duckdb then run some basic queries and data analysis.

The key thing is that this scaled horizontally pretty much forever, since each vehicle had a fixed amount of data per year we could tightly control the performance characteristics of the analysis. Adding more vehicles didn't make things slower, just linearly more expensive.

I vaguely remember the data from those containers also being used to process some aggregate analysis (like the each vehicle-container would output some data that would be consumed by another job that did aggregates). But I don't remember the specifics.

[1]: I believe we used JSONL or parquet format, but I didn't work in that part of the stack directly

orthoxerox · 2026-06-19T05:41:02 1781847662

The idea is that you treat data storage and data processing as two distinct tasks. You have your data in S3 or HDFS or a local directory and you run DuckDB on whatever single-node compute you have: a local machine or a container in a cluster.

There are companies that write cluster computing engines with duckdb as the byte-cruncher at their heart, but usually it's more like NumPy, Pandas or Polars on steroids. Or SQLite, but for running OLAP queries.

blackoil · 2026-06-19T05:30:24 1781847024

It is an OLAP db. So you can have a pipeline storing data in parquet files in S3. And then use DuckDB to directly query on it.

jdw64 · 2026-06-19T05:25:28 1781846728

Then it definitely makes sense. Scientists usually handle a lot of CSV files. Thank you

bunsenhoneydew · 2026-06-19T07:34:43 1781854483

DuckDB is a fantastic piece of tech. One of the best, if not the best, I’ve found in several years.

Panzerschrek · 2026-06-19T05:49:06 1781848146

If DuckDB is so fast and has no data transfer overheads, does it need all this typical SQL machinery with filtering and joining via SELECT queries? Wouldn't it be simpler and faster to return all data to the caller code (all table rows, but only requested columns) and let it perform all other necessary data processing logic?

jauco · 2026-06-19T06:20:47 1781850047

You’d end up implementing your own home grown version of hash join and query pushdown (skipping parquet row groups entirely) etc and your own home grown heuristics in selecting the right one (planning)

Which can outperform a generic solution like this of course, but it’s not less work to make faster for most cases.

Also duckdb can give you access to an in memory representation (e.g. `fetch_arrow_table()`) so you have less “language data structure wrapping” overhead. And you can do filtering yourself on that. In most cases the “where” statements will win though.

pknerd · 2026-06-19T05:48:53 1781848133

FTA:

> ..In-process means there's no server. You don't connect to DuckDB; you load it as a library inside your program, the same way you'd load NumPy or Polars

Does it mean it can perform all statistical computations as well if I want to use for algo trading?

holografix · 2026-06-19T05:48:39 1781848119

Why is DuckDB so popular when one can use Python + Pandas?

Better perf + SQL is that mostly it?

refactor_master · 2026-06-19T06:21:31 1781850091

The better question is, why is DuckDB so popular when one can use Polars which has a sane, lintable, typesafe API compared to the mess that is SQL:

  WITH lagged AS (
      SELECT 
          *, 
          LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time
      FROM events
  ),
  sessions AS (
      SELECT 
          *, 
          SUM(COALESCE((date_diff('minute', prev_time, event_time) > 30)::INT, 1)) 
              OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
      FROM lagged
  )
  SELECT
      user_id,
      session_id,
      MIN(event_time) AS session_start,
      MAX(event_time) AS session_end,
      COUNT(*) AS event_count
  FROM sessions
  GROUP BY ALL
  ORDER BY user_id, session_start;

vs

  result = (
      df.sort(["user_id", "event_time"])
      .with_columns(
          session_id=(
              pl.when(pl.col("event_time").diff().is_null())
              .then(1)
              .when(pl.col("event_time").diff().dt.total_minutes() > 30)
              .then(1)
              .otherwise(0)
              .cum_sum()
              .over("user_id")
          )
      )
      .group_by(["user_id", "session_id"])
      .agg(
          session_start=pl.col("event_time").min(),
          session_end=pl.col("event_time").max(),
          event_count=pl.col("event_time").count(),
      )
      .sort(["user_id", "session_start"])
  )

coldtea · 2026-06-19T10:41:30 1781865690

Precisely to avoid the custom NIH Polars API, and use SQL which works everywhere.

brikym · 2026-06-19T07:31:45 1781854305

Polars typesafe? It doesn't show you any errors until runtime right? Kusto query language is the best I've seen at type safety and I wish open source DBs would steal some ideas from it.

porridgeraisin · 2026-06-19T06:58:04 1781852284

I understand the linting aspect but not gonna lie I understood the first one immediately way more than the 2nd one due to knowing SQL well.

IshKebab · 2026-06-19T08:42:16 1781858536

That does look nicer if you have a Parquet file and want to analyze it. But DuckDB is also a database - if you want a persistent, reliable and mutable data store I don't think Polars would be suitable would it? (Genuine question - you sound like an expert and I'm not.)

homebessguy · 2026-06-19T07:11:00 1781853060

"Languages come and go, but SQL is forever"

paytonjjones · 2026-06-19T05:57:24 1781848644

Pandas has lots and lots of problems.

Performance is definitely one of them, but it also has inconsistent and duplicated methods, inconsistent defaults (e.g. some methods are inplace by default), copy by reference issues, I could go on.

It was an early winner in an extremely popular language. That's really the main thing going for it, but alternatives have been a long time coming.

estetlinus · 2026-06-19T06:13:11 1781849591

Why would you prefer Python and Pandas over good old SQL? Pandas is so verbose and hard to debug, most of the times struggle to be performant on small datasets.

SQL has been around since the dawn of databases. I am happy to see a trend away from pandas.

RobinL · 2026-06-19T05:56:40 1781848600

I wrote a blog post a while back to address this question here: https://www.robinlinacre.com/recommend_duckdb/

codingbear · 2026-06-19T05:24:01 1781846641

duckdb is so nice coupled with claude code. It extensive file support and some very interesting decisions on local caching data (like from S3 or snowflake) makes it easy to slice and dice almost any kind of tabular data.

blackoil · 2026-06-19T05:32:12 1781847132

> duckdb is so nice coupled with claude code

Can you expand upon it? You mean claude code use it to store its memory/state or it can do business queries using DuckDB.

medvezhenok · 2026-06-19T05:45:30 1781847930

Claude code can write exploratory queries for you to give you a quick rundown on the shape of the data set, frequencies, missing values, etc etc (without having to load it into a more persistent data store or writing custom python scripts). I also find SQL snippets inherently more re-usable than custom python code.

You can also write a skill that CC can re-use if you're analyzing a lot of similar data sets with minor variance.

thefourthchime · 2026-06-19T04:42:42 1781844162

I’m a huge fan, I’ve been wanting to know into the internals. Look forward to digging in.

f311a · 2026-06-19T06:04:27 1781849067

I wish this article was not LLM written

pknerd · 2026-06-19T05:52:42 1781848362

umm can we say it can replace SQLite?

3eb7988a1663 · 2026-06-19T05:54:19 1781848459

OLAP vs OLTP. Sure you could use one for the other, but they have ideal use cases. Updating a single record in SQLite is going to be more efficient than doing the same in DuckDB.

steve_adams_86 · 2026-06-19T05:59:00 1781848740

They seem similar at a glance but they’re quite different. You can think of SQLite as a transactional database while DuckDB is better used as an analytical database.

I can see applications having valid reasons to use both. You can use SQLite as the catalog in duck lake systems, for example. SQLite is your metadata record, DuckDB is your ingestion/scanning/aggregating/joining engine.