How to setup PostgreSQL replication to BigQuery on the Google Cloud Platform?
You are on a Google Cloud Platform using Cloud SQL for PostgreSQL, and your analytic queries are starting to struggle. You attempt to optimize the queries, but soon, you realize that you are spending more on optimization than you’d like, with more and more demands from the analytic team. The load caused by the analytics starts to cause hiccups on your main database, impacting the performance of your application. You spin up a read replica to offload the analytic traffic. This has worked for some time, but what then?
Fortunately, there are alternative storage solutions for your data, such as BigQuery - a serverless data warehouse. With BigQuery, you can seamlessly transition and benefit from its similar SQL syntax, eliminating the need for a complete rewrite of your queries. That sounds great, but how do I get the data from PostgreSQL to BigQuery? To do so, we need to familiarize ourselves with PostgreSQL Logical Replication.
Logical who?
Easy, tiger, you're not in a rodeo! Before we even touch the concept of logical replication, we need to go through other terms first.
- WAL - Write ahead log. This is a log of changes used so that every change is first recorded in this log before it is applied to the database itself. This ensures durability and allows crash recovery.
- Logical Decoding - Process of extracting changes to a database's tables into an easy-to-understand format, abstracted from database internals.
- Replication slot - A named stream of changes for the consumers. Slots will hold the data and prevent it from being vacuumed unless consumed. There are two types of slots - logical and physical. We are going to use logical repl. slot for this guide, but if you are using a replica, you already have a replication slot and may not even know about it.
- Source and destination nodes - Are Postgres instances in which a producer of changes, commonly called the source node, streams its changes to the consumer node, commonly referred to as the destination node.
- Publication is a Postgres resource - Responsible for publishing changes. It can be configured to use a replication slot to observe some or all tables.
- Subscription - Postgres resource is responsible for receiving the data.
All tables that should be logically replicated need a replica identity—a way to tell which row is which. By default, this identity is a primary key. See the following image for a better understanding of how these all work together.
Setting up the Postgres source node
If we were to set up a Postgres-to-Postgres replication, we’d create all the replication slots, publications, and subscriptions using built-in functions. It seems to be the best practice to use the pglogical plugin, which abstracts things away and smoothes some transitions when dealing with different Postgres versions. However, our goal is a Postgres-to-BigQuery replication, which leaves Postgres as the source node. Let’s start by enabling logical replication.
Usually, you would enable logical replication by setting wal_level=logical. On CloudSQL you cannot manipulate this setting directly, so you need to set a specific CloudSQL flag: cloudsql.logical_decoding. To manipulate and use any of the replication resources, you need permissions WITH REPLICATION. Go ahead and grant them now:
ALTER USER myuser WITH REPLICATION
The next item on the list is the logical replication slot. I will name it “bigquery” because it will represent the data source for the Big Query destination. The second argument is the plugin name. This plugin is responsible for transforming the internal representation of WAL data to a better understandable format—look at e.g. the wal2json plugin to better understand what such a plugin might do. I use the pgoutput plugin for this scenario, a standard PostgreSQL output plugin. It is required for the next step.
SELECT PG_CREATE_LOGICAL_REPLICATION_SLOT('bigquery', 'pgoutput');
Stream data to BigQuery with Datastream
With the source node set up, it is time to create the consumer. Google Cloud Platform (GCP) provides a product for that - Datastream. Datastream is a serverless change data capture (CDC) and replication service that allows you to synchronize data from one source to another.
Creating the stream from Postgres to BigQuery with Datastream is pretty straightforward. Postgres is already set up, and the stream itself has very few options: the source, an id, a name, a region, and the destination.
Allow the Datastream to read data from the source by
- providing database name, replication slot name, a publication,
- a Postgres user with replication rights,
- and network access by either IP whitelist or VPC.
Rest assured, the BigQuery destination offers flexibility. You can choose to opt-in for a BigQuery dataset for each Postgres schema and set a data freshness that suits your requirements. This flexibility should put you at ease, knowing you can adapt the destination to your data needs. Data freshness, ranging from 0 seconds to 1 day, tells the Datastream how often it needs to query for data changes. More querying means more resources are used, negatively affecting the solution's cost. GCP suggests not using smaller values for this setting than your application needs to.
When creating the Datastream stream in the Console, you can choose to Initiate backfill. Besides the CDC mode, Stream also has the backfill action available. It is an operation that reads and runs the entire source through the Stream. This is useful for initial data synchronization. After successfully creating the Stream, we should see our BigQuery dataset filled with Postgres records.
In the logs, we can see when the data is being written:
Closing notes
Notes on security:
- The Datastream connection to CloudSQL can be set up via IP allowlisting or the networking.
- Be sure to grant the Postgres user the Datastream SELECT permissions just for the tables you want it to access and make it in sync with the tables in the publication.
Notes on performance:
- The Postgres source has logical decoding enabled and provides logical replication slots used by the Datastream. The replication slots are also used by replicas, and while there might be a bit of overhead with the logical decoding and slots, it should be the optimal way of doing CDC.
- The backfill will be noticeable on the source database, so plan ahead if additional load on the source database is crucial for you.
- The same goes for minor downtimes because enabling logical decoding on CloudSQL requires a restart.
- Be sure to consume your replication slots to avoid storage issues. Remember that Postgres will keep all the changes so the consumer won’t miss any data.
- Only include tables in the publication, as you will consume with Datastream. This will save you data transfer overhead and costs. There might be something surprising about the replication with Datastream, so I recommend going through the FAQ first.
Conclusion
Setting up PostgreSQL replication to BigQuery on Google Cloud Platform allows you to keep your central database running smoothly while giving your analytics team the powerful tools they need. With logical replication and Datastream, transferring data changes becomes a breeze, ensuring your analytics are always up-to-date. This means less stress on your primary database, cost-effective data management, and more time for you to focus on gaining valuable insights. Embrace this seamless integration and let your data work harder for you!