Pitfalls of JSONB indexes in PostgreSQL
For a long time PostgreSQL supports JSON fields and can even index them. This functionality was added in versions 9.2 and 9.4 that are unsupported now. It's older than the oldest JavaScript frameworks that are still actively used, we can even call it ancient by current standards.
I perfectly remember the world where PostgreSQL had no JSON support because 9.2 was released in 2012, and before that, I worked in a company that used MongoDB (we suffered greatly[1]). It was an ample bit of marketing for Mongo back then: "you can store any document without tediously defining an ungodly schema! You gain so much flexibility!"
Little did we know back then that the world does not work that way, and relational SQL databases are actually way more flexible than document-oriented DBs, columnar DBs, or whatever.[2] Because often we don't know what exactly are we going to do with the data, and with relational DBs we can lay out the data however seems reasonable, and then add indexes to support our use cases.
In document-oriented DBs you need to lay out your data exactly the way you're going to query it later. Or else you'll need to migrate data inside your schema-less database to another layout, which is way more cumbersome and error-prone than adding some indexes and JOINs.
Don't trust me - there is an exceptional talk on Advanced Design Patterns for DynamoDB by Rick Houlihan, a Principal Technologist at AWS. He explains that and so much more - it's a very information-dense presentation with interesting ideas. I found it useful even though I don't plan to use DynamoDB nor MongoDB in the near future.
Anyway, JSON support was added into PostgreSQL a long time ago, because
sometimes it is useful to store some documents in the database. And it can be
indexed in two different
ways -
full GIN and a special
jsonb_path_ops
that supports indexing the @>
operator only. It means
"contains" and can be used like this:
SELECT * FROM table WHERE jsonb_field @> '{"employer": {"country": "ZA"}}';
Let me tell you a story about how I cleverly used this feature and it bit me in the ass.
Story time๐
I am a co-founder at www.prophy.science which is a product that can understand, search and recommend scientific papers and experts. To do that well, we need a collection of all scientific papers, and papers are often provided by many different providers with different ids. There are PubMed (30M+ articles), PubMed Central (6M+ articles), Crossref (80-100M+), INSPIRE, there are preprint servers like arXiv, biorXiv, medRxiv and many others.
There is a widespread system of DOIs that are used to persistently identify journal articles, research reports and data sets. It was introduced in the year 2000, and, as many of these bibliographic databases predate DOI standard, they have their own identifiers. Sometimes they even cross-link their IDs between different services, and sometimes they cross-link wrong articles.
Some monitoring services download data from Crossref, Pubmed, PMC and some other
sources, add them and report that they have 180 million articles, 220 million,
or some other bullshit. We strive to merge the same article from different
sources into one entity with many external identifiers. We called these
identifiers "origin ids" and stored them in a special jsonb
column, so one row
could have a record like this:
{"pubmed": "3782696", "pmc": "24093010", "doi": "10.3389/fnhum.2013.00489"}
It was a simple key-value document with a jsonb_path_ops
index on it. And
whenever we needed to fetch an article by an origin id, we queried it using a
@>
operator like that:
SELECT id FROM articles WHERE origin_ids @> '{"pubmed": "123456"}';
It is a bit easier to store ids this way, no need to maintain a separate table with hundreds of millions of rows.
One problem arose when we tried to query the index with many different origin
ids. There is no IN
nor ANY()
, so we stitched lots of OR
s together:
SELECT id FROM articles WHERE
origin_ids @> '{"pubmed": "123456"}' OR
origin_ids @> '{"pubmed": "654321"}' OR
origin_ids @> '{"pubmed": "123321"}' OR
origin_ids @> '{"pubmed": "456654"}';
Explain everything๐
And with enough OR
s the query gets really slow. Why? EXPLAIN
helpfully says
that it becomes a sequential scan (I shortened output for clarity):
EXPLAIN
SELECT id, origin_ids
FROM articles
WHERE origin_ids @> '{"pubmed": "123456"}' OR
origin_ids @> '{"pubmed": "654321"}' OR
....; - x200
QUERY PLAN
------------------------------------------------------------
Seq Scan on articles (rows=7805036)
Filter: ((origin_ids @> '{"pubmed": "123456"}') OR
(origin_ids @> '{"pubmed": "654321"}') OR ...x200)
Why? For some reason it thinks that this query will return millions of rows. But
one origin id can match at most one article if my data is correct, so 200
filters should only match 0..200 rows. Let's look at EXPLAIN ANALYZE
to check:
EXPLAIN ANALYZE
SELECT id, origin_ids
FROM articles
WHERE origin_ids @> '{"pubmed": "123456"}' OR
origin_ids @> '{"pubmed": "654321"}' OR
....; - x200
QUERY PLAN
------------------------------------------------------------
Seq Scan on articles (rows=7805036) (actual rows=200)
Filter: ((origin_ids @> '{"pubmed": "123456"}') OR
(origin_ids @> '{"pubmed": "654321"}') OR ...x200)
It does indeed return only 200 rows. Hmmm... Let's check one row:
EXPLAIN ANALYZE
SELECT id, origin_ids
FROM articles
WHERE origin_ids @> '{"pubmed": "123456"}';
QUERY PLAN
------------------------------------------------------------
Bitmap Heap Scan on articles (rows=43038) (actual rows=1)
Recheck Cond: (origin_ids @> '{"pubmed": "123456"}')
-> Bitmap Index Scan on ... (rows=43038) (actual rows=1)
Index Cond: (origin_ids @> '{"pubmed": "123456"}')
Supposedly 43 thousand rows for only one filter! And 7.8 million rows are 39 thousand times more than 200, which is pretty close. At the time I fired these queries we had only 43 million of articles. PostgreSQL gathers some statistics about values in different columns to be able to produce reasonable query plans, and looks like it's shooting blanks for this column.
What's the simplest fix? Oftentimes
ANALYZE
on a table
is enough to fix broken statistics, but this time it didn't help at
all. Sometimes it's useful to adjust how many rows are analyzed to gather
statistics, and it can be adjusted down to a per-column basis with ALTER TABLE ... ALTER COLUMN ... SET STATISTICS
, but
here it had no effect as well.
Since version 10 PostgreSQL supports CREATE STATISTICS
to gather complex statistics for inter-column dependencies and whatnot, but our
filter is single-column, no luck here as well.
Contsel๐
So I dug some more, and more... And found that operator @>
uses something
called contsel
. It was mentioned in PostgreSQL mailing
list
in 2010. I tried to decrypt what contsel
means and I think it stands for
"contains selectivity". Then I tried searching PostgreSQL sources for contsel
mentions
and found exactly one
place
in C code which mentions it:
Datum
contsel(PG_FUNCTION_ARGS)
{
PG_RETURN_FLOAT8(0.001);
}
0.001? That looks exactly like the ratio between 43 million rows in the table and an estimated 43 thousand rows in the result. However, if we just multiply 43 thousand by 200 filters we should get 8.6 million, and PostgreSQL estimated only 7.8M. This discrepancy bothered me for a minute because I like to understand things completely, so they won't set me up for an unpleasant surprise later[3].
After a minute of contemplating the difference I realized that it's probability in play - PostgreSQL thinks that every filter can match 0.1% of the total number of rows and they can overlap. The actual math is:
1 - 0.999 ** 200 = 1 - 0.819 = 0.181
18.1% of 43 million is 7.8 million (I'm rounding numbers here). Itch scratched successfully.
And, depending on the different costs of various factors in the config, Postgres will select either sequential scan or will use an index. Our first solution was to slice these filters into batches with no more than 150 of them per query. It worked quite well for a couple of years.
Domain modeling failure๐
Until we learned that one article could have more than one such external identifier per type. For example, some pre-print services grant new DOI for each version. 10.26434/chemrxiv.11938173.v8 has eight of them at the time of writing. And then it has the main DOI without version 10.26434/chemrxiv.11938173, and will have another one if it will be published after peer review. There are other cases for some other identifier types (we call these types "origin name").
We had two options:
-
Store origin ids in a separate table with columns
article_id
,origin_name
andorigin_id
with two indexes - one onarticle_id
and the other on(origin_name, origin_id)
; -
Accommodate many values per key in
jsonb
. Two more possible options here:- Many values per key:
{"doi": ["10.26434/3", "10.26434/3.v1"]}
- List of pairs:
[["doi", "10.26434/3"], ["doi", "10.26434/3.v1"]]
Both can be queried with
@>
, but it's getting even more uglier than it was. - Many values per key:
We ended up doing kind of both - we created a separate table that's much easier to query with many origin ids at once, and we store a list of pairs in a separate non-indexed column so it's convenient to query.
Separate table speed-up๐
As a bonus, it's much-much faster to query a btree index with lots of filters
than a GIN one. With a GIN every @>
turns into a separate Bitmap Index Scan
that costs approximately millisecond for each (0.7-1.2 ms each if in
cache). With a btree index on two columns we construct a query that looks like
this:
SELECT article_id FROM articles_origin_ids WHERE
(origin_name = $1 AND origin_id = ANY($2)) OR
(origin_name = $3 AND origin_id = ANY($4)) OR
(origin_name = $5 AND origin_id = ANY($6));
Accessing a btree index is faster even by itself, I get 0.07 ms for Bitmap Index
Scan node in EXPLAIN ANALYZE
for one origin_name
, origin_id
pair. And when
we fit many origins into one query, the cost of accessing an index (each AND
turns into a separate Bitmap Index Scan) is getting amortized between tens to
hundreds of ids with the same origin_name
. It can go as low as 0.01 ms (10 ยตs)
per origin_id
.
That's two orders of magnitude! Thank you very much, I'll take it. If I would read something like that in the docs, I would go with a separate table right from the start.
Additional pitfalls๐
Current versions of PostgreSQL add new features that can exacerbate these bad query plans.
For example, when a query is estimated to return millions of rows, it makes complete sense to fire up parallel workers, but for simple 150 rows - not so much. For a test query that I'm running to get concrete numbers for this post, I'm getting speed up from 143 ms to 120 ms with parallelization taking 6 workers.
JIT compiling bad plans๐
The other pitfall is a new one to PostgreSQL 12 (I have a post on how we upgraded to it) and it's called query JIT compilation.
JIT compiler will think the query is massive enough for JITting and will spend additional time on it. For my particular test query, it spends 150-1500 ms each time a query is fired (depends on optimization and inlining). It's 1-10 times slower than the slow query, and it's up to 1000 times slower than a fast query to a separate table. 1000 times!
Thankfully, we migrated origin ids to a separate table before PostgreSQL 12 even came out.
My friends had the same
problem[4] with JIT compiling a query for more than 13
seconds for a query that usually executes in 30 ms. Upgrade to PostgreSQL 12
brought their site down until they turned JIT off. They also had a jsonb
column with an index on it, which inflated estimated rows and cost. However,
even without that part, the query was big enough to trigger JIT compilation.
I found the same problem in my code just yesterday when the query was big and complex enough to trigger JIT compilation, but the number of results wasn't big enough so it slowed things down instead of speeding up.
I tried to make PostgreSQL cache JITted code with prepared
statements, but I
wasn't successful. It looked to me like JIT was compiling a query each time I
fired EXPLAIN ANALYZE
, even after more than five times to stabilize a query
plan. I tried to force a generic plan for it to cache JITted code, but it still
didn't help.
Possible fix๐
I don't know PostgreSQL internals enough to know how hard it is to add
statistics for jsonb
fields. Maybe it's possible to somehow extend CREATE STATISTICS
, or make @>
to respect n_distinct
somehow.
With the JIT triggering a very expensive compilation for complex queries with a
small number of rows I think it's best to penalize the cost of enabling JIT
based on how many nodes[5] are there in a
query plan. There are
settings like
jit_above_cost
, and with a setting like jit_node_cost
decision would be made
like this:
jit_above_cost > query_plan_cost + jit_node_cost * query_plan_node_count
For now, I'll just turn JIT off completely in pg_config, and will enable it with
SET jit = 'on'
only where I know it helps.
Notes๐
[1] Some say they've almost completed migration to PostgreSQL and do not suffer anymore. Some say they still suffer with what's left almost a decade later.
[2] Datomic claims that it's even more flexible than relational DBs, and I played with it a bit and tend to think it really is. However, their main focus is on Datomic Cloud on AWS, and I need on-premise, with no additional Cassandra.
[3] After discussion on
lobste.rs I
understood that it's possible to query PostgreSQL directly to find what @>
is
using:
SELECT oprname, typname, oprrest FROM pg_operator op
JOIN pg_type typ ON op.oprleft = typ.oid WHERE oprname = '@>';
You'll also see that for anyarray
types it uses different selectivity method
arraycontsel
. And
PostgreSQL gathers statistics for arrays, so @>
can be a viable choice. Also,
arrays support
&&
overlap operator that can be handy.
[4] Author of that post is not only a friend of mine, but my brother as well.
[5] Are they actually called "nodes"? I'm not sure.