Running on Postgres: A Guide for Startup CTOs

“`

So you’re starting a new project – or company – and you’ve decided that you’re going to run on Postgres. The hard part (picking your database) is over, and now the fun part begins: making sure that you don’t need to think about it again for another few years. This post will run through some lesser-known tips for configuring Postgres for the long haul, from connection pooling to security basics to extensions and indices.

Set Up Connection Pooling for Postgres With PGBouncer

By default, Postgres forks a separate process for each client connection from the main OS process. At low volume, the time it takes to create and destroy those processes – plus the fact that they’re never reused – won’t matter. You can set a number of max_connections manually, but eventually, as you scale up, you’ll likely run into issues here. Connection pooling helps you essentially “cache” these processes and reuse them as clients connect and disconnect from your database. Though you can build connection pooling into your application logic, most opt for a third-party tool, and in Postgres’s case, that’s PGBouncer. It’s an open source, lightweight connection pooler that you can install on either your database server or your application server. You’ve got 3 levels of pooling to choose from:

Session pooling: stays true to the “client connections are indefinite” model and keeps a connection open for the entire time a client is connected.
Transaction pooling: connections last for a single transaction, after which they’re sent back to the pool.
Statement pooling: connections last for just a query, so if you’ve got multiple as part of a transaction, it wouldn’t work at all.

Most choose session pooling – it’s the most conservative and least risky to dropping connections – but every app is different, and you’ll need to figure out the right mode for your constraints.

Connection Pooling’s Performance Impact

The million-dollar question, though: does this actually work? Percona ran a series of benchmarks to figure out how PGBouncer impacts performance. With a small number of concurrent clients (100, you start to see meaningful performance benefits. PGBouncer So do you need a connection pooler right away to support your first few users? Probably not. But using PGBouncer will help you once you reach even low/moderate traffic.

Postgres Security for Dummies

For the first few weeks you’re working on your project, it’s usually just one or two developers working on an empty database; security is not top of mind. But as you launch your app to the world, it’s going to need to be. And with databases, it sometimes seems like there are a million different ways to lock things down.

Restricting Access at the Host or User Level

Let’s start with access. Postgres restricts access in two ways:

At the host level – defining IP addresses and domains with access rights
At the user level – defining database users and their permissions

The pg_hba.conf file in the PGDATA directory is where you define who can connect to which databases. If you don’t have an entry for a client in there, they will not be able to access the database. Assuming your application server is running somewhere else, here’s how you might allow it to access the database:

# Trust any connection via TCP/IP from this machine host all 127.0.0.1 255.255.255.255 trust

Outside of just “trust any connections from this machine,” there are tons of different ways to authenticate your client with the database server, from password to ident to certificates. And if you’ve eschewed the great comforts of RDS (or Kinsta) and are running your backend on the same server as your database, you can connect via Unix sockets instead of TCP/IP.

Authorization and Privileges

Once your client itself is authenticated, you need to deal with the question of authorization. The SQL standard defines a privilege system, and each object in Postgres (like a table, row, etc.) has different privileges relating to it that can be assigned to users: things like SELECT and UPDATE, but also TRUNCATE, REFERENCES, TRIGGER, etc. You bestow privileges upon users with the GRANT command. Best practice is to follow the principle of least privilege, so the database user that you create for your client(s) should only be able to access whatever it needs to access.

Row Level Security

The last thing to cover here is row level security. RLS exists from the perspective of the table (not the user) and restricts which rows can be accessed, updated, etc. By default, tables do not have RLS enabled, so your user will be able to do whatever their access policies dictate. To enable RLS for a table, you’d start with:

ALTER TABLE [table_name] ENABLE ROW LEVEL SECURITY

And then add a policy. Let’s say you want to restrict read access to your lightsaber_internals table to trustworthy individuals, who are already defined in the jedi user group, such that only a lightsaber’s owner can see its internal details. Here’s how you’d do it:

ALTER TABLE lightsaber_internals ENABLE ROW LEVEL SECURITY CREATE POLICY jedi_only ON lightsaber_internals TO jedi USING (jedi = lightsaber_jedi);

RLS policies like this are useful when you need security at a more granular level than just tables (situations with PII, etc.).

Think in Advance About Scaling Problems

At every startup I’ve ever worked at, there has been some degree of manual scaling when it comes to the database.

Set Up Database Monitoring

Most of the companies I’ve worked at use Datadog for their db monitoring. If you’re using a managed database service, you can probably get by for a bit using their native stuff. Datadog has a good post on their blog covering major metrics you should be watching, like read and write throughput, sequential scans, data written to disk, etc.

Put Together Guidelines for Scaling Vertically

When your team gets paged – and it will happen – the last thing you want is for everyone to need to put hands on deck to get the issue resolved, when in most cases, a simple scale up solves the problem. It’s good to put together a basic plan for your team on what’s within scope when you’re running out of space or compute.

Vacuuming and Tuning Your Autovacuum

When you DELETE data in Postgres or UPDATE data (which is functionally equivalent to deleting and inserting), Postgres doesn’t actually delete that data right away (😱). Instead, it’s “marked” as deleted by storing the transaction ID of the delete in an xmax header; the reason for this is that it makes MVCC in Postgres more straightforward. But if these rows aren’t really deleted eventually, they’ll start to waste disk space and give you problems. The easiest way to get rid of these rows is using the VACUUM command. You could run a vacuum manually whenever dead rows build up or even just set it up to run every x minutes, but a better strategy is to autovacuum based on how many dead rows have accumulated. Tuning your autovacuum is a nuanced topic beyond the scope of this post: I’d highly recommend reading 2ndQuadrant’s post about it.

Set Up a Read Replica (or Two)

This one is easy. If you anticipate a meaningful increase in traffic (a launch coming up, etc.), you can easily create read-only replicas (or at least one); they’ll help offload some work from the main DB instance. If you opt for multiple replicas, you’ll get the added benefit of improving availability if any one of them goes down for any reason. Adding replicas is pretty straightforward in most DBaaS providers; just keep an eye out on cost: they are often priced at the same level as a main DB instance despite being read-only.

Add Indexes to Your (Anticipated) Largest Tables

Database indexes help speed up read queries by creating ancillary data structures that make your scans faster. For many use cases, adding an index to a table or two is basically a no-brainer. In Postgres, you can create an index with the CREATE…

“`

Running on Postgres: A Guide for Startup CTOs

Adding Multiple Galleries to Your WordPress Posts and Pages

Creating an Unturned Server in 2024: A Step-by-Step Guide

You may also like

Leave a Comment Cancel Reply