Unexpected indexes on PostgreSQL 12

5 min read

I recently wrote on Twitter that we've upgraded our database from PostgreSQL 11 to 12. For context, the size of the database itself is 1.5 TB, biggest table has more than one billion rows and weighs 300+ GB alone, without indexes. Many other tables have 100M+ rows too.

The main benefit that I was expecting from new version is REINDEX CONCURRENTLY. We have a couple of places where we do it manually every day: CREATE INDEX CONCURRENTLY, drop old index, rename new one. We need to do it often, because we implemented a task queue in PostgreSQL using SELECT FOR UPDATE and SKIP LOCKED, and it thrashes an index and bloats it a bit too much. It's a small price to pay to have that particular task queue directly in our database, with transactional integrity, JOINs, history and everything.

When I say we do it manually, of course I mean that we have a task that's run daily. I'm not a total maniac.

We also have a couple of materialized views that we update concurrently in the same way: CREATE TABLE from a query, drop old table, rename new one, continue. PostgreSQL 13 probably will get something on that front too, but for now we have to go the extra mile. In PostgreSQL 11 it took 22-25 minutes to refresh each of them (this one runs weekly). In PostgreSQL 12 we've got JIT compiler, so I expected some speedups here. From my testing it's got faster approximately by 10% (2 minutes).

The other thing I remembered was that B-tree indexes should have gotten better. I found this article to be a pretty good explanation. Shortly, new format is introduced that's more compact, less bloaty, hence faster. And we need to REINDEX old indexes after pg_upgrade to have these goodies. Ok, I thought.

After the upgrade, one of my developers suggested to try to convert one hash index to B-tree on our biggest table. We've had quite a few since PG10 made them reliable, and when we tested them, hash indexes definitely were smaller. The results were astounding: the index shrank from 54 GB to 25 GB. More than two-fold reduction in size is too good to be true! And indeed it was, when we executed REINDEX on a hash index, it shrank from 54 to 38 GB. Still, I'll take 25 GB index instead of 38 GB every day of the week, thank you very much.

Granted, this column has low cardinality, for 1.2 billion rows it has only 30 millions distinct values (40x difference). Maybe its inefficient with lots of duplicating values, when they end up in the same hash bucket? I've tested it on some other table on a column with much higher cardinality, only 2.3 rows per unique value. Still B-tree was much smaller, only 63% of hash index (4823 MB vs 7655 MB).

Okay, what if I create a hash index on some primary key column and then compare it with a btree? Still 25% difference in the favor of B-tree. What the heck? Was our testing back in the day sloppy and we compared apples to oranges, or is this new B-tree just so much better?

Do we really need hash indexes at all when B-tree indexes are more compact, faster and have more features? Ok, all previous testing was done on bigint columns, maybe text columns have something else to say?

They definitely do: for example, it's impossible to have a btree index on a text column with value longer than third of a page. If you try to do it, PostgreSQL will helpfully complain:

ERROR: index row size 2744 exceeds btree version 4 maximum 2704
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

Looks like hash indexes do not have such a limit at all. I've got curious and tried to create a hash index on a text column with a value 15 millions characters long, and it did with no problems at all. I guess hash index in PostgreSQL is true to its name and doesn't store an indexed value within itself.

What else? For some random table and random column I've tried it on, hash index is almost twice smaller than btree (44%), much faster to construct (33%). It's also much faster to query: for the same 1000 random values hash index was three times faster (5 ms vs. 15ms) with everything in cache, and hit only 2577 shared buffers with btree hitting 6100.

Maybe hash index will also be more efficient with some other field types like UUID, but they can't be UNIQUE, so it additionally limits their usefulness.

The end of the story: we converted almost all our hash indexes to btree.