Jun 6, 2024 · Jun 6, 2024
diff --git a/pgml-cms/blog/our-migration-from-aws-to-gcp-with-minimal-downtime.md b/pgml-cms/blog/our-migration-from-aws-to-gcp-with-minimal-downtime.md

 June 6, 2024

 From the beginning, our plan for PostgresML was to be cloud-agnostic. Since we are an infrastructure provider, we have to deploy our code where our customers are. Like most startups, we started on AWS, because that'swhat we knew best. After over 10 years of AWS experience, and its general dominance in the market, it seemed right to build something we've done before, this time in Rust of course.
 From the beginning, our plan for PostgresML was to be cloud-agnostic. Since we are an infrastructure provider, we have to deploy our code where our customers are. Like most startups, we started on AWS, because that iswhat we knew best. After over 10 years of AWS experience, and its general dominance in the market, it seemed right to build something we have done before, this time in Rust of course.

 After talking to several customers, we'venoticed a pattern: most of them were using either Azure or GCP. So we had to go back to our original plan. Our platform manages all infrastructure internally, by representing common concepts likehosts, networking rules,open ports,anddomains as first class entities in our codebase. To add additional cloud vendors, we just had to write integrations with their APIs.
 After talking to several customers, we havenoticed a pattern: most of them were using either Azure or GCP. So we had to go back to our original plan. Our platform manages all infrastructure internally, by representing common concepts likevirtual machines, networking rules, andDNS as first class entities in our codebase. To add additional cloud vendors, we just had to write integrations with their APIs.

 ## Cloud-agnostic from the start

 PostgresML, much like Postgres itself, can run on a variety of platforms. Our operating system of choice, **Ubuntu**, is available on all clouds, and comes with good support for GPUs. We therefore had no trouble spinning up machines on Azure and GCP with identical software to match our AWS deployments.

 Since we're first and foremost a database company, data integrity and security are extremely important. To achieve that goal, and to be independent from any cloud-specific storage solutions, we are using **ZFS** as our filesystem to store Postgres data.
 Since we are first and foremost a database company, data integrity and security are extremely important. To achieve that goal, and to be independent from any cloud-specific storage solutions, we are using **ZFS** as our filesystem to store Postgres data.

 Moving ZFS filesystems between machines is a solved problem, or so we thought.


 ### Moving data is hard

 Moving data is hard. Moving terabytes of data between machines in the same cloud can be achieved with volume snapshots, and the hard part of ensuring data integrity is delegated to the cloud vendor. Of course, that'snot always guaranteed, and you can still corrupt your data if you'renot careful, but that's a story for another time.
 Moving data is hard. Moving terabytes of data between machines in the same cloud can be achieved with volume snapshots, and the hard part of ensuring data integrity is delegated to the cloud vendor. Of course, that isnot always guaranteed, and you can still corrupt your data if you arenot careful, but that is a story for another time.

 That being said, to move data between clouds, one has to rely on your own tooling. Since we use ZFS, our original plan was to just send a ZFS snapshot across the country and synchronize later with Postgres replication. To make sure the dataisn't intercepted by nefarious entities while in transit, the typical recommendation is to pipe it through SSH:
 That being said, to move data between clouds, one has to rely on your own tooling. Since we use ZFS, our original plan was to just send a ZFS snapshot across the country and synchronize later with Postgres replication. To make sure the datais not intercepted by nefarious entities while in transit, the typical recommendation is to pipe it through SSH:

 ```bash
 zfs send tank/pgdata@snapshot | ssh ubuntu@machine \

 #### First attempt

 Our filesystem was multiple terabytes, but both machines had10Gbit NICs, so we expected this to take just a few hours. To our surprise, the transfer speedwouldn't go higher than 30MB/second. At that rate, the migration would take days. Since we had to setup Postgres replication afterwards, we had to keep a replication slot open to prevent WAL cleanup on the primary.
 Our filesystem was multiple terabytes, but both machines had100Gbit NICs, so we expected this to take just a few hours. To our surprise, the transfer speedwould not go higher than 30MB/second. At that rate, the migration would take days. Since we had to setup Postgres replication afterwards, we had to keep a replication slot open to prevent WAL cleanup on the primary.

 A dangling replication slot left unattended for days would accumulate terabytes of write-ahead log and eventually run our filesystem out of space and shut down the database. To make things harder, _zfs send_ is an all or nothing operation: if interrupted for any reason, e.g. network errors, one would have to start over from scratch.

 So realistically, a multi-day operation was out of the question. At this point, we were stuck and a realization loomed: there is a good reason why most organizationsdon't attempt a cloud migration.
 So realistically, a multi-day operation was out of the question. At this point, we were stuck and a realization loomed: there is a good reason why most organizationsdo not attempt a cloud migration.

 #### Trial and error


 As of this writing, we could not find any existing tools to send a ZFS file system to S3 and download it from Cloud Storage, in real time. Most tools like [z3](https://github.com/presslabs/z3) are used for backup purposes, but we needed to transfer filesystem chunks as quickly as possible.

 So just like withanything else, we decided to write our own, in Rust. After days of digging through Tokio documentation and networking theory blog posts to understand how to move bytes as fast as possible between the filesystem and an HTTP endpoint, we had a pretty basic application that could chunk a byte stream, send it to an object storage service as separate files, download those files as they are being created in real time, re-assemble and pipe them into a ZFS snapshot.
 So just like witheverything else, we decided to write our own, in Rust. After days of digging through Tokio documentation and networking theory blog posts to understand how to move bytes as fast as possible between the filesystem and an HTTP endpoint, we had a pretty basic application that could chunk a byte stream, send it to an object storage service as separate files, download those files as they are being created in real time, re-assemble and pipe them into a ZFS snapshot.

 This was an exciting moment. We created something new and were going to open source it once we made sure it worked well, increasing our contribution to the community. The moment arrived and we started our data transfer. After a few minutes, our measured transfer speed was: roughly 30MB/second.


 Something was clearly wrong. Our migration plans were at risk and since we wanted to move our Serverless cloud to GCP, we were pretty concerned. Were we trapped on AWS forever?

 Something stood out though after trying so many different approaches. Why 30MB/second? That seems like a made up number, and on two separate clouds too? Clearly, itwasn't an issue with the network or our tooling, but with how we used it.
 Something stood out though after trying so many different approaches. Why 30MB/second? That seems like a made up number, and on two separate clouds too? Clearly, itwas not an issue with the network or our tooling, but with how we used it.

 #### Buffer and compress

 After researching a bit about how other people migrated filesystems (it is quite common in the ZFS community, since it makes itquiteconvenient, ourissues notwithstanding), the issue emerged: _zfs send_ and _zfs recv_ do not buffer data. For each chunk of data they send and receive, they issue separate `write(2)` and `read(2)` calls to the kernel, and process whatever data they get.
 After researching a bit about how other people migrated filesystems (it is quite common in the ZFS community, since it makes it convenient, ourproblems notwithstanding), the issue emerged: _zfs send_ and _zfs recv_ do not buffer data. For each chunk of data they send and receive, they issue separate `write(2)` and `read(2)` calls to the kernel, and process whatever data they get.

 In case of a network transfer, these kernel calls propagate all the way to the network stack, and like any experienced network engineer would tell you, makes things very slow.

 In comes `mbuffer(1)`. If you're not familiar with it, mbuffer is a tool that _buffers_ whatever data it receives and sends it in larger chunks to its destination, in our case SSH on the sender side and ZFS on the receiver side. Combined with a multi-threaded stream compressor, `pbzip2(1)`, which cut our data size in half, we were finally in business, transferring our data at over 200 MB/second which cut our migration time from days to just a few hours, all with just one command:
 In comes `mbuffer(1)`. If you are not familiar with it, mbuffer is a tool that _buffers_ whatever data it receives and sends it in larger chunks to its destination, in our case SSH on the sender side and ZFS on the receiver side. Combined with a multi-threaded stream compressor, `pbzip2(1)`, which cut our data size in half, we were finally in business, transferring our data at over 200 MB/second which cut our migration time from days to just a few hours, all with just one command:

 ```bash
 zfs send tank/pgdata@snapshot | pbzip2 | mbuffer -s 12M -m 2G | ssh ubuntu@gcp \

 ### Double check everything

 Once the ZFS snapshot finally made it from theEast coast to the Midwest, we configured Postgres streaming replication, which went as you'd expect, and we had a live hot standby in GCP, ready to go. Before cutting the AWS cord, we wanted to double check that everything was okay. We were moving customer data after all, and losing data is bad for business — especially for a database company.
 Once the ZFS snapshot finally made it from theWest Coast to the Midwest, we configured Postgres streaming replication, which went as you would expect, and we had a live hot standby in GCP, ready to go. Before cutting the AWS cord, we wanted to double check that everything was okay. We were moving customer data after all, and losing data is bad for business — especially for a database company.

 #### The case of the missing bytes

 ZFS is a reliable and battle tested filesystem, so we were not worried, but there is nothing wrong with a second opinion. The naive way to check that all your data is still there is to compare the size of the filesystems. Not a terrible place to start, so we ran `df -h` and immediately our jaws dropped: only half the data made it over to GCP.

 After days of roadblocks, this was not a good sign, and there was no reasonable explanation for what happened. ZFS checksums every single block, mbuffer is a simple tool, pbzip definitely decompressed the stream and SSHhasn't lost a byte since the 1990s.
 After days of roadblocks, this was not a good sign, and there was no reasonable explanation for what happened. ZFS checksums every single block, mbuffer is a simple tool, pbzip definitely decompressed the stream and SSHhas not lost a byte since the 1990s.

 In addition, just to make things even weirder, Postgres replicationdidn't complain and the data was, seemingly, all there. We checked by running your typical `SELECT COUNT(*) FROM a_few_tables` and everything added up: as the data was changing in AWS, it was updating in GCP.
 In addition, just to make things even weirder, Postgres replicationdid not complain and the data was, seemingly, all there. We checked by running your typical `SELECT COUNT(*) FROM a_few_tables` and everything added up: as the data was changing in AWS, it was updating in GCP.

 #### (File)systems are virtual

 3. Advanced filesystems are complex
 3. You can solve hard problems, just take it one step at time

 At PostgresML, we're excited to solve hard problems. If you are too, feel free to explore [career opportunities](/careers) with us, or check out our [open-source docs](/docs) and contribute to our project.
 At PostgresML, we are excited to solve hard problems. If you are too, feel free to explore [career opportunities](/careers) with us, or check out our [open-source docs](/docs) and contribute to our project.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -17,15 +17,15 @@ Lev Kokotov

		June 6, 2024

		From the beginning, our plan for PostgresML was to be cloud-agnostic. Since we are an infrastructure provider, we have to deploy our code where our customers are. Like most startups, we started on AWS, because that'swhat we knew best. After over 10 years of AWS experience, and its general dominance in the market, it seemed right to build something we've done before, this time in Rust of course.
		From the beginning, our plan for PostgresML was to be cloud-agnostic. Since we are an infrastructure provider, we have to deploy our code where our customers are. Like most startups, we started on AWS, because that iswhat we knew best. After over 10 years of AWS experience, and its general dominance in the market, it seemed right to build something we have done before, this time in Rust of course.

		After talking to several customers, we'venoticed a pattern: most of them were using either Azure or GCP. So we had to go back to our original plan. Our platform manages all infrastructure internally, by representing common concepts likehosts, networking rules,open ports,anddomains as first class entities in our codebase. To add additional cloud vendors, we just had to write integrations with their APIs.
		After talking to several customers, we havenoticed a pattern: most of them were using either Azure or GCP. So we had to go back to our original plan. Our platform manages all infrastructure internally, by representing common concepts likevirtual machines, networking rules, andDNS as first class entities in our codebase. To add additional cloud vendors, we just had to write integrations with their APIs.

		## Cloud-agnostic from the start

		PostgresML, much like Postgres itself, can run on a variety of platforms. Our operating system of choice, Ubuntu, is available on all clouds, and comes with good support for GPUs. We therefore had no trouble spinning up machines on Azure and GCP with identical software to match our AWS deployments.

		Since we're first and foremost a database company, data integrity and security are extremely important. To achieve that goal, and to be independent from any cloud-specific storage solutions, we are using ZFS as our filesystem to store Postgres data.
		Since we are first and foremost a database company, data integrity and security are extremely important. To achieve that goal, and to be independent from any cloud-specific storage solutions, we are using ZFS as our filesystem to store Postgres data.

		Moving ZFS filesystems between machines is a solved problem, or so we thought.

Expand All		@@ -35,9 +35,9 @@ Our primary Serverless deployment was in Oregon, AWS us-west-2 region. We were

		### Moving data is hard

		Moving data is hard. Moving terabytes of data between machines in the same cloud can be achieved with volume snapshots, and the hard part of ensuring data integrity is delegated to the cloud vendor. Of course, that'snot always guaranteed, and you can still corrupt your data if you'renot careful, but that's a story for another time.
		Moving data is hard. Moving terabytes of data between machines in the same cloud can be achieved with volume snapshots, and the hard part of ensuring data integrity is delegated to the cloud vendor. Of course, that isnot always guaranteed, and you can still corrupt your data if you arenot careful, but that is a story for another time.

		That being said, to move data between clouds, one has to rely on your own tooling. Since we use ZFS, our original plan was to just send a ZFS snapshot across the country and synchronize later with Postgres replication. To make sure the dataisn't intercepted by nefarious entities while in transit, the typical recommendation is to pipe it through SSH:
		That being said, to move data between clouds, one has to rely on your own tooling. Since we use ZFS, our original plan was to just send a ZFS snapshot across the country and synchronize later with Postgres replication. To make sure the datais not intercepted by nefarious entities while in transit, the typical recommendation is to pipe it through SSH:

		```bash
		zfs send tank/pgdata@snapshot \| ssh ubuntu@machine \
Expand All		@@ -46,11 +46,11 @@ zfs recv tank/pgdata@snapshot

		#### First attempt

		Our filesystem was multiple terabytes, but both machines had10Gbit NICs, so we expected this to take just a few hours. To our surprise, the transfer speedwouldn't go higher than 30MB/second. At that rate, the migration would take days. Since we had to setup Postgres replication afterwards, we had to keep a replication slot open to prevent WAL cleanup on the primary.
		Our filesystem was multiple terabytes, but both machines had100Gbit NICs, so we expected this to take just a few hours. To our surprise, the transfer speedwould not go higher than 30MB/second. At that rate, the migration would take days. Since we had to setup Postgres replication afterwards, we had to keep a replication slot open to prevent WAL cleanup on the primary.

		A dangling replication slot left unattended for days would accumulate terabytes of write-ahead log and eventually run our filesystem out of space and shut down the database. To make things harder, _zfs send_ is an all or nothing operation: if interrupted for any reason, e.g. network errors, one would have to start over from scratch.

		So realistically, a multi-day operation was out of the question. At this point, we were stuck and a realization loomed: there is a good reason why most organizationsdon't attempt a cloud migration.
		So realistically, a multi-day operation was out of the question. At this point, we were stuck and a realization loomed: there is a good reason why most organizationsdo not attempt a cloud migration.

		#### Trial and error

Expand All		@@ -64,7 +64,7 @@ So we had a thought: why not upload our ZFS filesystem to S3 first, transfer it

		As of this writing, we could not find any existing tools to send a ZFS file system to S3 and download it from Cloud Storage, in real time. Most tools like [z3](https://github.com/presslabs/z3) are used for backup purposes, but we needed to transfer filesystem chunks as quickly as possible.

		So just like withanything else, we decided to write our own, in Rust. After days of digging through Tokio documentation and networking theory blog posts to understand how to move bytes as fast as possible between the filesystem and an HTTP endpoint, we had a pretty basic application that could chunk a byte stream, send it to an object storage service as separate files, download those files as they are being created in real time, re-assemble and pipe them into a ZFS snapshot.
		So just like witheverything else, we decided to write our own, in Rust. After days of digging through Tokio documentation and networking theory blog posts to understand how to move bytes as fast as possible between the filesystem and an HTTP endpoint, we had a pretty basic application that could chunk a byte stream, send it to an object storage service as separate files, download those files as they are being created in real time, re-assemble and pipe them into a ZFS snapshot.

		This was an exciting moment. We created something new and were going to open source it once we made sure it worked well, increasing our contribution to the community. The moment arrived and we started our data transfer. After a few minutes, our measured transfer speed was: roughly 30MB/second.

Expand All		@@ -74,15 +74,15 @@ Was there a conspiracy afoot? We thought so. We even tried using S3 Transfer Acc

		Something was clearly wrong. Our migration plans were at risk and since we wanted to move our Serverless cloud to GCP, we were pretty concerned. Were we trapped on AWS forever?

		Something stood out though after trying so many different approaches. Why 30MB/second? That seems like a made up number, and on two separate clouds too? Clearly, itwasn't an issue with the network or our tooling, but with how we used it.
		Something stood out though after trying so many different approaches. Why 30MB/second? That seems like a made up number, and on two separate clouds too? Clearly, itwas not an issue with the network or our tooling, but with how we used it.

		#### Buffer and compress

		After researching a bit about how other people migrated filesystems (it is quite common in the ZFS community, since it makes itquiteconvenient, ourissues notwithstanding), the issue emerged: _zfs send_ and _zfs recv_ do not buffer data. For each chunk of data they send and receive, they issue separate `write(2)` and `read(2)` calls to the kernel, and process whatever data they get.
		After researching a bit about how other people migrated filesystems (it is quite common in the ZFS community, since it makes it convenient, ourproblems notwithstanding), the issue emerged: _zfs send_ and _zfs recv_ do not buffer data. For each chunk of data they send and receive, they issue separate `write(2)` and `read(2)` calls to the kernel, and process whatever data they get.

		In case of a network transfer, these kernel calls propagate all the way to the network stack, and like any experienced network engineer would tell you, makes things very slow.

		In comes `mbuffer(1)`. If you're not familiar with it, mbuffer is a tool that _buffers_ whatever data it receives and sends it in larger chunks to its destination, in our case SSH on the sender side and ZFS on the receiver side. Combined with a multi-threaded stream compressor, `pbzip2(1)`, which cut our data size in half, we were finally in business, transferring our data at over 200 MB/second which cut our migration time from days to just a few hours, all with just one command:
		In comes `mbuffer(1)`. If you are not familiar with it, mbuffer is a tool that _buffers_ whatever data it receives and sends it in larger chunks to its destination, in our case SSH on the sender side and ZFS on the receiver side. Combined with a multi-threaded stream compressor, `pbzip2(1)`, which cut our data size in half, we were finally in business, transferring our data at over 200 MB/second which cut our migration time from days to just a few hours, all with just one command:

		```bash
		zfs send tank/pgdata@snapshot \| pbzip2 \| mbuffer -s 12M -m 2G \| ssh ubuntu@gcp \
Expand All		@@ -91,15 +91,15 @@ mbuffer -s 12M -m 2G \| pbzip2 -d \| zfs recv tank/pgdata@snapshot

		### Double check everything

		Once the ZFS snapshot finally made it from theEast coast to the Midwest, we configured Postgres streaming replication, which went as you'd expect, and we had a live hot standby in GCP, ready to go. Before cutting the AWS cord, we wanted to double check that everything was okay. We were moving customer data after all, and losing data is bad for business — especially for a database company.
		Once the ZFS snapshot finally made it from theWest Coast to the Midwest, we configured Postgres streaming replication, which went as you would expect, and we had a live hot standby in GCP, ready to go. Before cutting the AWS cord, we wanted to double check that everything was okay. We were moving customer data after all, and losing data is bad for business — especially for a database company.

		#### The case of the missing bytes

		ZFS is a reliable and battle tested filesystem, so we were not worried, but there is nothing wrong with a second opinion. The naive way to check that all your data is still there is to compare the size of the filesystems. Not a terrible place to start, so we ran `df -h` and immediately our jaws dropped: only half the data made it over to GCP.

		After days of roadblocks, this was not a good sign, and there was no reasonable explanation for what happened. ZFS checksums every single block, mbuffer is a simple tool, pbzip definitely decompressed the stream and SSHhasn't lost a byte since the 1990s.
		After days of roadblocks, this was not a good sign, and there was no reasonable explanation for what happened. ZFS checksums every single block, mbuffer is a simple tool, pbzip definitely decompressed the stream and SSHhas not lost a byte since the 1990s.

		In addition, just to make things even weirder, Postgres replicationdidn't complain and the data was, seemingly, all there. We checked by running your typical `SELECT COUNT(*) FROM a_few_tables` and everything added up: as the data was changing in AWS, it was updating in GCP.
		In addition, just to make things even weirder, Postgres replicationdid not complain and the data was, seemingly, all there. We checked by running your typical `SELECT COUNT(*) FROM a_few_tables` and everything added up: as the data was changing in AWS, it was updating in GCP.

		#### (File)systems are virtual

Expand DownExpand Up		@@ -130,5 +130,5 @@ Migrating between clouds is hard, but not impossible. The key is to understand h
		3. Advanced filesystems are complex
		3. You can solve hard problems, just take it one step at time

		At PostgresML, we're excited to solve hard problems. If you are too, feel free to explore [career opportunities](/careers) with us, or check out our [open-source docs](/docs) and contribute to our project.
		At PostgresML, we are excited to solve hard problems. If you are too, feel free to explore [career opportunities](/careers) with us, or check out our [open-source docs](/docs) and contribute to our project.