Google Cloud SQL – Availability for PostgreSQL – Part II (Read Replicas)

In this post we will look at read replicas as an additional method to achieve multi zone availability for Cloud SQL, which gives us – in turn – the ability to offload (potentially expensive) IO operations such as user created backups or read operations without adding load to the master instance.

In the previous post in this series we looked at Regional availability for PostgreSQL HA using Cloud SQL:

Recall that this option was simple to implement and worked relatively seamlessly and transparently with respect to zonal failover.

Now let’s look at read replicas in Cloud SQL as an additional measure for availability.

Deploying Read Replica(s)

Deploying read replicas is slightly more involved than simple regional (high) availability, as you will need to define each replica or replicas as a separate Cloud SQL instance which is a slave to the primary instance (the master instance).

An example using Terraform is provided here, starting by creating the master instance:

Next you would specify one or more read replicas (typically in a zone other than the zone the master is in):

Note that several of the options supplied are omitted when creating a read replica database instance, such as the backup and maintenance options – as these operations cannot be performed on a read replica as we will see later.

Cloud SQL Instances – showing master and replica
Cloud SQL Master Instance

Voila! You have just set up a master instance (the primary instance your application and/or users will connect to) along with a read replica in a different zone which will be asynchronously updated as changes occur on the master instance.

Read Replicas in Action

Now that we have created a read replica, lets see it in action. After connecting to the read replica (like you would any other instance), attempt to access a table that has not been created on the master as shown here:

SELECT operation from the replica instance

Now create the table and insert some data on the master instance:

Create a table and insert a record on the master instance

Now try the select operation on the replica instance:

SELECT operation from the replica instance (after changes have been made on the master)

It works!

Some Points to Note about Cloud SQL Read Replicas

  • Users connect to a read replica as a normal database connection (as shown above)
  • Google managed backups (using the console or gcloud sql backups create .. ) can NOT be performed against replica instances
  • Read replicas can be used to offload IO intensive operations from the the master instance – such as user managed backup operations (e.g. pg_dump)
pg_dump operation against a replica instance
  • BE CAREFUL Despite their name, read replicas are NOT read only, updates can be made which will NOT propagate back to the master instance – you could get yourself in an awful mess if you allow users to perform INSERT, UPDATE, DELETE, CREATE or DROP operations against replica instances.

Promoting a Read Replica

If required a read replica can be promoted to a standalone Cloud SQL instance, which is another DR option. Keep in mind however as the read replica is updated in an asynchronous manner, promotion of a read replica may result in a loss of data (hopefully not much but a loss nonetheless). Your application RPO will dictate if this is acceptable or not.

Promotion of a read replica is reasonably straightforward as demonstrated here using the console:

Promoting a read replica using the console

You can also use the following gcloud command:

 gcloud sql instances promote-replica  <replica_instance_name>

Once you click on the Promote Replica button you will see the following warning:

Promoting a read replica using the console

This simply states that once you promote the replica instance your instance will become an independent instance with no further relationship with the master instance. Once accepted and the promotion process is complete, you can see that you now have two independent Cloud SQL instances (as advertised!):

Promoted Cloud SQL instance

Some of the options you would normally configure with a master instance would need to be configured on the promoted replica instance – such as high availability, maintenance and scheduled backups – but in the event of a zonal failure you would be back up and running with virtually no data loss!

Full source code for this article is available at: https://github.com/gamma-data/cloud-sql-postgres-availability-tutorial

Google Cloud SQL – Availability, Replication, Failover for PostgreSQL – Part I

In this multi part blog we will explore the features available in Google Cloud SQL for High Availability, Backup and Recovery, Replication and Failover and Security (at rest and in transit) for the PostgreSQL DBMS engine. Some of these features are relatively hot of the press and in Beta – which still makes them available for general use.

We will start by looking at the High Availability (HA) options available to you when using the PostgreSQL engine in Google Cloud SQL.

Most of you would be familiar with the concepts of High Availability, Redundancy, Fault Tolerance, etc but let’s start with a definition of HA anyway:

High availability (HA) is a characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.

Wikipedia

Higher than a normal period is quite subjective, typically this is quantified by a percentage represented by a number of “9s”, that is 99.99% (which would be quoted as “four nines”), this would allot you 52.60 minutes of downtime over a one-year period.

Essentially the number of 9’s required will drive your bias towards the options available to you for Cloud SQL HA.

We will start with Cloud SQL HA in its simplest form, Regional Availability.

Regional Availability

Knowing what we know about the Google Cloud Platform, regional availability means that our application or service (in this case Cloud SQL) should be resilient to a failure of any one zone in our region. In fact, as all GCP regions have at least 3 zones – two zones could fail, and our application would still be available.

Regional availability for Cloud SQL (which Google refers to as High Availability), creates a standby instance in addition to the primary instance and uses a regional Persistent Disk resource to store the database instance data, transaction log and other state files, which is synchronously replicated to a Persistent Disk resource local to the zones that the primary and standby instances are located in.

A shared IP address (like a Virtual IP) is used to serve traffic to the healthy (normally primary) Cloud SQL instance.

An overview of Cloud SQL HA is shown here:

Cloud SQL High Availability

Implementing High Availability for Cloud SQL

Implementing Regional Availability for Cloud SQL is dead simple, it is one argument:

availability_type = "REGIONAL"

Using the gcloud command line utility, this would be:

gcloud sql instances create postgresql-instance-1234 \
  --availability-type=REGIONAL \
  --database-version= POSTGRES_9_6

Using Terraform (with a complete set of options) it would look like:

resource "google_sql_database_instance" "postgres_ha" {
  provider = google-beta
  region = var.region
  project = var.project
  name = "postgresql-instance-${random_id.instance_suffix.hex}"
  database_version = "POSTGRES_9_6"
  settings {
   tier = var.tier
   disk_size = var.disk_size
   activation_policy = "ALWAYS"
   disk_autoresize = true
   disk_type = "PD_SSD"
   availability_type = "REGIONAL"
   backup_configuration {
     enabled = true
     start_time = "00:00"
   }
   ip_configuration  {
     ipv4_enabled = false
     private_network = google_compute_network.private_network.self_link
   }
   maintenance_window  {
     day = 7
     hour = 0
     update_track = "stable"
   }
  }
 } 

Once deployed you will notice a few different items in the console, first from the instance overview page you can see that the High Availability option is ENABLED for your instance.

Second, you will see a Failover button enabled on the detailed management view for this instance.

Failover

Failovers and failbacks can be initiated manually or automatically (should the primary be unresponsive). A manual failover can be invoked by executing the command:

gcloud sql instances failover postgresql-instance-1234

There is an --async option which will return immediately, invoking the failover operation asynchronously.

Failover can also be invoked from the Cloud Console using the Failover button shown previously. As an example I have created a connection to a regionally available Cloud SQL instance and started a command which runs a loop and prints out a counter:

Now using the gcloud command shown earlier, I have invoked a manual failover of the Cloud SQL instance.

Once the failover is initiated, the client connection is dropped (as the server is momentarily unavailable):

The connection can be immediately re-established afterwards, the state of the running query is lost – importantly no data is lost however. If your application clients had retry logic in their code and they weren’t executing a long running query, chances are no one would notice! Once reconnecting normal database activities can be resumed:

A quick check of the instance logs will show that the failover event has occured:

Now when you return to the instance page in the console you will see a Failback button, which indicates that your instance is being served by the standby:

Note that there may be a slight delay in the availability of this option as the replica is still being synched.

It is worth noting that nothing comes for free! When you run in REGIONAL or High Availability mode – you are effectively paying double the costs as compared to running in ZONAL mode. However availability and cost have always been trade-offs against one another – you get what you pay for…

More information can be found at: https://cloud.google.com/sql/docs/postgres/high-availability

Next up we will look at read replicas (and their ability to be promoted) as another high availability alternative in Cloud SQL.

The Ultimate AWS to GCP Thesaurus

There are many posts available which map analogous services between the different cloud providers, but this post attempts to go a step further and map additional concepts, terms, and configuration options to be the definitive thesaurus for cloud practitioners familiar with AWS looking to fast track their familiarisation with GCP.

It should be noted that AWS and GCP are fundamentally different platforms, nowhere is this more apparent than in the way networking is implemented between the two providers, see:

This post is focused on the core infrastructure, networking and security services offered by the two major cloud providers, I will do a future post on higher level services such as the ML/AI offerings from the respective providers.

Furthermore this will be a living post which I will continue to update, I encourage comments from readers on additional mappings which I will incorporate into the post as well.

I have broken this down into sections based upon the layout of the AWS Console.

Compute

EC2 (Elastic Compute Cloud)GCE (Google Compute Engine)
Availability ZoneZone
InstanceVM Instance
Instance FamilyMachine Family
Instance TypeMachine Type
Amazon Machine Image (AMI)Image
IAM Role (for an EC2 Instance)Service Account
Security GroupsVPC Firewall Rules (ALLOW)
TagLabel
Termination ProtectionDeletion Protection
Reserved InstancesCommitted Use Discounts
Capacity ReservationReservation
User DataStartup Script
Spot InstancesPreemptible VMs
Dedicated InstancesSole Tenancy
EBS VolumePersistent Disk
Auto Scaling GroupManaged Instance Group
Launch ConfigurationInstance Template
ELB ListenerURL Map (Load Balancer)
ELB Target GroupBackend/ Instance Group
Instance Storage (ephemeral)Local SSDs
EBS SnapshotsSnapshots
KeypairSSH Keys
Elastic IPExternal IP
LambdaGoogle Cloud Functions
Elastic BeanstalkGoogle App Engine
Elastic Container Registry (ECR)Google Container Registry (GCR)
Elastic Container Service (ECS)Google Kubernetes Engine (GKE)
Elastic Kubernetes Service (EKS)Google Kubernetes Engine (GKE)
AWS FargateCloud Run
AWS Service QuotasAllocation Quotas
Account (within an Organisation)†Project
RegionRegion
AWS Cloud​FormationCloud Deployment Manager

Storage

Simple Storage Service (S3)Google Cloud Storage (GCS)
Standard Storage ClassStandard Storage Class
Infrequent Access Storage ClassNearline Storage Class
Amazon GlacierColdline Storage Class
Lifecycle PolicyRetention Policy
TagsLabels
SnowballTransfer Appliance
Requester PaysRequester Pays
RegionLocation Type/Location
Object LockHold
Vault Lock (Glacier)Bucket Lock
Multi Part UploadParallel Composite Transfer
Cross-Origin Resource Sharing (CORS)Cross-Origin Resource Sharing (CORS)
Static Website HostingBucket Website Configuration
S3 Access PointsVPC Service Controls
Object NotificationsPub/Sub Notifications for Cloud Storage
Presigned URLSigned URL
Transfer AccelerationStorage Transfer Service
Elastic File System (EFS)Cloud Filestore
AWS DataSyncTransfer Service for on-premises data
ETagETag
BucketBucket
aws s3gsutil

Database

Relational Database Service (RDS)Cloud SQL
DynamoDBCloud Datastore
ElastiCacheCloud Memorystore
Table (DynamoDB)Kind (Cloud Datastore)
Item (DynamoDB)Entity (Cloud Datastore)
Partition Key (DynamoDB)Key (Cloud Datastore)
Attributes (DynamoDB)Properties (Cloud Datastore)
Local Secondary Index (DynamoDB)Composite Index (Cloud Datastore)
Elastic Map Reduce (EMR)Cloud DataProc
AthenaBig Query
AWS GlueCloud DataFlow
Glue CatalogData Catalog
Amazon Simple Notification Service (SNS)Cloud PubSub (push subscription)
Amazon KinesisCloud PubSub
Amazon Simple Queue Service (SQS)Cloud PubSub (poll and pull mode)

Networking & Content Delivery

Virtual Private Cloud (VPC) (Regional)VPC Network (Global or Regional)
Subnet (Zonal)Subnet (Regional)
Route TablesRoutes
Network ACLs (NACLS)VPC Firewall Rules (ALLOW or DENY)
CloudFrontCloud CDN
Route 53Cloud DNS/Google Domains
Direct ConnectDedicated (or Partner) Interconnect
Virtual Private Network (VPN)Cloud VPN
AWS PrivateLinkGoogle Private Access
NAT GatewayCloud NAT
Elastic Load BalancerLoad Balancer
AWS WAFCloud Armour
VPC Peering ConnectionVPC Network Peering
Amazon API GatewayApigee API Gateway
Amazon API GatewayCloud Endpoints

Security, Identity, & Compliance

Root AccountSuper Admin
IAM UserMember
IAM PolicyRole (Collection of Permissions)
IAM Policy AttachmentIAM Role Binding (or IAM Binding)
Key Management Service (KMS)Cloud KMS
CloudHSMCloud HSM
Amazon Inspector (agent based)Cloud Security Scanner (scan based)
AWS Security HubCloud Security Command Center (SCC)
Secrets ManagerSecret Manager
Amazon MacieCloud Data Loss Prevention (DLP)
AWS WAFCloud Armour
AWS ShieldCloud Armour

† No direct equivalent, this is the closest equivalent

Google Cloud Storage Object Notifications using Slack

This article describes the steps to integrate Slack with Google Cloud Functions to get notified about object events within a specified Google Cloud Storage bucket.

Google Cloud Storage Object Notifications using Slack

Events could include the creation of new objects, as well as delete, archive or metadata operations performed on a given bucket.

This pattern could be easily extended to other event sources supported by Cloud Functions including:

  • Cloud Pub/Sub messages
  • Cloud Firestore and Firebase events
  • Stackdriver log entries

More information can be found at https://cloud.google.com/functions/docs/concepts/events-triggers.

The prerequisite steps to configure Slack are provided here:

1. First you will need to create a Slack app (assuming you have already set up an account and a workspace). The following screenshots demonstrate this process:

Create a Slack app
Give the app a name and associate it with an existing Slack workspace

2. Next you need to Enable and Activate Incoming Webhooks to your app and add this to your workspace. The following screenshots demonstrate this process:

Enable Incoming Web Hooks for the app
Activate incoming webhooks
Add the webhook to your workspace

3. Next you need to specify a channel for notifications generated from object events.

Select a channel for the webhook

4. Now you need to copy the Webhook url provided, you will use this later in your Cloud Function.

Copy the webhook URL to the clipboard

Treat your webhook url as a secret, do not upload this to a public source code repository

Next you need to create your Cloud Function, this example uses Python but you can use an alternative runtime including Node.js or Go.

This example templates the source code using the Terraform template_file data source. The function source code is shown here:

Within your Terraform code you need to render your Cloud Function code substituting the slack_webhook_url for it’s value which you will supply as a Terraform variable. The rendered template file is then placed in a local directory along with a requirements.txt file and zipped up. The resulting Zip archive is uploaded to a specified bucket where it will be sourced to create the Cloud Function.

Now you need to create the Cloud Function, the following HCL snippet demonstrates this:

The event_trigger block in particular specifies which GCS bucket to watch and what events will trigger invocation of the function. Bucket events include:

  • google.storage.object.finalize (the creation of a new object)
  • google.storage.object.delete
  • google.storage.object.archive
  • google.storage.object.metadataUpdate

You could add additional logic to the Cloud Function code to look for specific object names or naming patterns, but keep in mind the function will fire upon every event matching the event_type and resource criteria.

To deploy the function, you would simply run:

terraform apply -var="slack_webhook_url=https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXX"

Now once you upload a file named test-object.txt, voilà!:

Slack notification for a new object created

Full source code is available at: https://github.com/gamma-data/gcs-object-notifications-using-slack

Scalable, Secure Application Load Balancing with VPC Native GKE and Istio

At the time of this writing, GCP does not have a generally available non-public facing Layer 7 load balancer. While this is sure to change in the future, this article outlines a design pattern which has been proven to provide scalable and extensible application load balancing services for multiple applications running in Kubernetes pods on GKE.

When you create a service of type LoadBalancer in GKE, Kubernetes hooks into the provider (GCP in this case) on your behalf to create a Google Load Balancer, while this may be specified as INTERNAL, there are two issues:

Issue #1:

The GCP load balancer created for you is a Layer 4 TCP load balancer.

Issue #2:

The normal behaviour is for Google to enumerate all of the node pools in your GKE cluster and “automagically” create mapping GCE instance groups for each node pool for each zone the instances are deployed in. This means the entire surface area of your cluster is exposed to the external network – which may not be optimal for internal applications on a multi tenanted cluster.

The Solution:

Using Istio deployed on GKE along with the Istio Ingress Gateway along with an externally created load balancer, it is possible to get scalable HTTP load balancing along with all the normal ALB goodness (stickiness, path-based routing, host-based routing, health checks, TLS offload, etc.).

An abstract depiction of this architecture is shown here:

Istio Ingress Design Pattern for VPC Native GKE Clusters

This can be deployed with a combination of Terraform and kubectl. The steps to deploy at a high level are:

  1. Create a GKE cluster with at least two node pools: ingress-nodepool and service-nodepool. Ideally create these node pools as multi-zonal for availability. You could create additional node pools for your Egress Gateway or an operations-nodepool to host Istio, etc as well.
  2. Deploy Istio.
  3. Deploy the Istio Ingress Gateway service on the ingress-nodepool using Service type NodePort.
  4. Create an associated Certificate Gateway using server certificates and private keys for TLS offload.
  5. Create a service in the service-nodepool.
  6. Reserve an unallocated static IP address from the node network range.
  7. Create an internal TCP load balancer:
    1. Specify the frontend as the IP address reserved in step 6.
    2. Specify the backend as the managed instance groups created during the node pool creation for the ingress-nodepool (ingress-nodepool-ig-a, ingress-nodepool-ig-b, ingress-nodepool-ig-c).
    3. Specify ports 80 and 443.
  8. Create a GCP Firewall Rule to allow traffic from authorized sources (network tags or CIDR ranges) to a target of the ingress-nodepool network tag.
  9. Create a Cloud DNS A Record for your managed zone as *.namespace.zone pointing to the IP Address assigned to the load balancer frontend in step 7.1.
  10. Enable Health Checks through the GCP firewall to reach the ingress-nodepool network tag at a minimum – however there is no harm in allowing these to all node pools.

The service should then be resolvable and routable from authorized internal networks (peered private VPCs or internal networks connected via VPN or Dedicated Interconnect) as:

https://<service>.<namespace>.<zone>/<endpoint>

The advantages of this design pattern are…

  1. The Ingress Gateway provides fully functional application load balancing services.
  2. Istio provides service discovery and routing using names and namespaces.
  3. The Ingress Gateway service and ingress gateway node pool can be scaled as required to meet demand.
  4. The Ingress Gateway is multi zonal for greater availability

GCP Networking for AWS Professionals

GCP and AWS share many similarities, they both provide similar services and both leverage containerization, virtualization and software defined networking.

There are some significant differences when it comes to their respective implementations, networking is a key example of this.

Before we compare and contrast the two different approaches to networking, it is worthwhile noting the genesis of the two major cloud providers.

Google was born to be global, Amazon became global

By no means am I suggesting that Amazon didn’t have designs on going global from it’s beginnings, but AWS was driven (entirely at the beginning) by the needs of the Amazon eCommerce business. Amazon started in the US before expanding into other regions (such as Europe and Australia). In some cases the expansion took decades (Amazon only entered Australia as a retailer in 2018).

Google, by contrast, was providing application, search and marketing services worldwide from its very beginning. GCP which was used as the vector to deliver these services and applications was architected around this global model, even though their actual data centre expansion may not have been as rapid as AWS’s (for example GCP opened its Australia region 5 years after AWS).

Their respective networking implementations reflect how their respective companies evolved.

AWS is a leader in IaaS, GCP is a leader in PaaS

This is only an opinion and may be argued, however if you look at the chronology of the two platforms, consider this:

  • The first services released by AWS (simultaneously for all intents and purposes) were S3, SQS and EC2
  • The first service released by Google was AppEngine (a pure PaaS offering)

Google has launched and matured their IaaS offerings since as AWS has done similarly with their PaaS offerings, but they started from much different places.

With all of that said, here are the key differences when it comes to networking between the two major cloud providers:

GCP VPCs are Global by default, AWS VPCs are Regional only

This is the first fundamental difference between the two providers. Each GCP project is allocated one VPC network with Subnets in each of the 18 GCP Regions. Whereas each AWS Account is allocated one Default VPC in each AWS Region with a Subnet in each AWS Availability Zone for that Region, that is each account has 17 VPCs in each of the 17 Regions (excluding GovCloud regions).

Default Global VPC Network in GCP

It is entirely possible to create VPCs in GCP which are Regional, but they are Global by default.

This global tenancy can be advantageous in many cases, but can be limiting in others, for instance there is a limit of 25 peering connections to any one VPC, the limit in AWS is 125.

GCP Subnets are Regional, AWS Subnets are Zonal

Subnets in GCP automatically span all Zones in a Region, whereas AWS VPC Subnets are assigned to Availability Zones in a Region. This means you are abstracted from some of the networking and zonal complexity, but you have less control over specific network placement of instances and endpoints. You can infer from this design that Zones are replicated or synchronised within a Region, making them less of a direct consideration for High Availability (or at least as much or your concern as they otherwise would be).

All GCP Firewall Rules are Stateful

AWS Security Groups are stateful firewall rules – meaning they maintain connection state for inbound connections, AWS also has Network ACLs (NACLs) which are stateless firewall rules. GCP has no direct equivalent of NACLs, however GCP Firewall Rules are more configurable than their AWS counterparts. For instance, GCP Firewall Rules can include Deny actions which is not an option with AWS Security Group Rules.

Load Balancers in GCP are layer 4 (TCP/UDP) unless they are public facing

AWS Application Load Balancers can be deployed in private VPCs with no external IPs attached to them. GCP has Application Load Balancers (Layer 7 load balancers) but only for public facing applications, internal facing load balancers in GCP are Network Load Balancers. This presents some challenges with application level load balancing functionality such as stickiness. There are potential workarounds however such as NGINX in GKE behind

Firewall rules are at the Network Level not at the Instance or Service Level

There are simple firewall settings available at the instance level, these are limited to allowing HTTP and HTTPS traffic to the instance only and don’t allow you to specify sources. Detailed Firewall Rules are set at the GCP VPC Network level and are not attached or associated with instances as they are in AWS.

Hopefully this is helpful for AWS engineers and architects being exposed to GCP for the first time!