Skip to main content

18 posts tagged with "gcp"

View All Tags

· 4 min read
Tom Klimovski

So you're using BigQuery (BQ). It's all set up and humming perfectly. Maybe now, you want to run an ELT job whenever a new table partition is created, or maybe you want to retrain your ML model whenever new rows are inserted into the BQ table.

In my previous article on EventArc, we went through how Logging can help us create eventing-type functionality in your application. Let's take it a step further and walk through how we can couple BigQuery and Cloud Run.

In this article you will learn how to

  • Tie together BigQuery and Cloud Run
  • Use BigQuery's audit log to trigger Cloud Run
  • With those triggers, run your required code

Let's go!

Let's create a temporary dataset within BigQuery named tmp_bq_to_cr.

In that same dataset, let's create a table in which we will insert some rows to test our BQ audit log. Let's grab some rows from a BQ public dataset to create this table:

CREATE OR REPLACE TABLE tmp_bq_to_cr.cloud_run_trigger AS
SELECT
date, country_name, new_persons_vaccinated, population
from `bigquery-public-data.covid19_open_data.covid19_open_data`
where country_name='Australia'
AND
date > '2021-05-31'
LIMIT 100

Following this, let's run an insert query that will help us build our mock database trigger:

INSERT INTO tmp_bq_to_cr.cloud_run_trigger
VALUES('2021-06-18', 'Australia', 3, 1000)

Now, in another browser tab let's navigate to BQ Audit Events and look for our INSERT INTO event:

BQ-insert-event

There will be several audit logs for any given BQ action. Only after a query is parsed does BQ know which table we want to interact with, so the initial log will, for e.g., not have the table name.

We don't want any old audit log, so we need to ensure we look for a unique set of attributes that clearly identify our action, such as in the diagram above.

In the case of inserting rows, the attributes are a combination of

  • The method is google.cloud.bigquery.v2.JobService.InsertJob
  • The name of the table being inserted to is the protoPayload.resourceName
  • The dataset id is available as resource.labels.dataset_id
  • The number of inserted rows is protoPayload.metadata.tableDataChanged.insertedRowsCount

Time for some code

Now that we've identified the payload that we're looking for, we can write the action for Cloud Run. We've picked Python and Flask to help us in this instance. (full code is on GitHub).

First, let's filter out the noise and find the event we want to process

@app.route('/', methods=['POST'])
def index():
# Gets the Payload data from the Audit Log
content = request.json
try:
ds = content['resource']['labels']['dataset_id']
proj = content['resource']['labels']['project_id']
tbl = content['protoPayload']['resourceName']
rows = int(content['protoPayload']['metadata']
['tableDataChange']['insertedRowsCount'])
if ds == 'cloud_run_tmp' and \
tbl.endswith('tables/cloud_run_trigger') and rows > 0:
query = create_agg()
return "table created", 200
except:
# if these fields are not in the JSON, ignore
pass
return "ok", 200

Now that we've found the event we want, let's execute the action we need. In this example, we'll aggregate and write out to a new table created_by_trigger:

def create_agg():
client = bigquery.Client()
query = """
CREATE OR REPLACE TABLE tmp_bq_to_cr.created_by_trigger AS
SELECT
count_name, SUM(new_persons_vaccinated) AS n
FROM tmp_bq_to_cr.cloud_run_trigger
"""
client.query(query)
return query

The Dockerfile for the container is simply a basic Python container into which we install Flask and the BigQuery client library:

FROM python:3.9-slim
RUN pip install Flask==1.1.2 gunicorn==20.0.4 google-cloud-bigquery
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY *.py ./
CMD exec gunicorn --bind :$PORT main:app

Now we Cloud Run

Build the container and deploy it using a couple of gcloud commands:

SERVICE=bq-cloud-run
PROJECT=$(gcloud config get-value project)
CONTAINER="gcr.io/${PROJECT}/${SERVICE}"
gcloud builds submit --tag ${CONTAINER}
gcloud run deploy ${SERVICE} --image $CONTAINER --platform managed

I always forget about the permissions

In order for the trigger to work, the Cloud Run service account will need the following permissions:

gcloud projects add-iam-policy-binding $PROJECT \
--member="serviceAccount:service-${PROJECT_NO}@gcp-sa-pubsub.iam.gserviceaccount.com"\
--role='roles/iam.serviceAccountTokenCreator'

gcloud projects add-iam-policy-binding $PROJECT \
--member=serviceAccount:${SVC_ACCOUNT} \
--role='roles/eventarc.admin'

Finally, the event trigger

gcloud eventarc triggers create ${SERVICE}-trigger \
--location ${REGION} --service-account ${SVC_ACCOUNT} \
--destination-run-service ${SERVICE} \
--event-filters type=google.cloud.audit.log.v1.written \
--event-filters methodName=google.cloud.bigquery.v2.JobService.InsertJob \
--event-filters serviceName=bigquery.googleapis.com

Important to note here is that we're triggering on any Insert log created by BQ That's why in this action we had to filter these events based on the payload.

Take it for a spin

Now, try out the BigQuery -> Cloud Run trigger and action. Go to the BigQuery console and insert a row or two:

INSERT INTO tmp_bq_to_cr.cloud_run_trigger
VALUES('2021-06-18', 'Australia', 5, 25000)

Watch as a new table called created_by_trigger gets created! You have successfully triggered a Cloud Run action on a database event in BigQuery.

Enjoy!

· 10 min read
Tom Klimovski

Metadata Hub (MDH) is intended to be the source of truth for metadata around the Company’s platform. It has the ability to load metadata configuration from yaml, and serve that information up via API. It will also be the store of information for pipeline information while ingesting files into the platform.

Key philosophies:

Config-Driven. Anyone who has been authorized to do so, should be able to add another ‘table-info.yaml’ in to MDH without the need to update any code in the system

Here’s how table information makes its way into MDH:

Metadata Hub
Metadata Hub

Paths

/tablesget:summary: All tables in MDHdescription: get the title of all tables that exist in MDH
post:summary: Creates a new table in MDHdescription: Creates a new table in MDH
/tables/{id}getsummary: Obtain information about specific table
/tables/{id}/columnsgetsummary: All columns for a particular tabledescription: Obtain information on columns for a particular table
/rungetsummary: All information about a particular end-to-end batch run of file ingestion
postsummary: Update metadata on a batch loaddescription: Update metadata on a batch load
/calendargetsummary: Use this to save on calculation of business days.description: This base response gives you today's date in a string
/calendar/previousBusinessDaygetsummary: Will return a string of the previous business daydescription: Will return a string of the previous business day, based on the date on when it's called
/calendar/nextBusinessDaygetsummary: Will return a string of the next business daydescription: Will return a string of the next business day, based on the date on when it's called

Yaml to Datastore - Entity/Kind design

Datastore Primer

Before we jump right into Entity Groups in Datastore, it is important to first go over the basics and establish a common vocabulary. Datastore holds entities, which are objects, that can contain various key/value pairs, called properties. Each entity must contain a unique identifier, known as a key. When creating an entity, a user can choose to specify a custom key or let Datastore create a key. If a user decides to specify a custom key, it will contain two fields: a kind, which represents a category such as ‘Toy’ or ‘Marital Status’, and a name, which is the identifying value. If a user decides to only specify a kind when creating a key, and does not specify a unique identifier, Datastore automatically generates an ID behind the scenes. Below is an example of a Python3 script which illustrates this identifier concept.

from google.cloud import datastore

client = datastore.Client()
#Custom key- specify my kind=item and a unique_id of broker
custom_key_entry = datastore.Entity(client.key("table","broker"))
client.put(custom_key_entry)

#Only specify kind=item, let datastore generate unique_id
datastore_gen_key_entry = datastore.Entity(client.key("table"))
client.put(datastore_gen_key_entry)

In your GCP Console under Datastore, you will then see your two entities of kind “table”. One will contain your custom key and one will contain the automatically generated key.

Ancestors and Entity Groups

For highly related or hierarchical data, Datastore allows entities to be stored in a parent/child relationship. This is known as an entity group or ancestor/descendent relationship.

Entity Group

erd

This is an example of an entity group with kinds of types: table, column, and classification. The ‘Grandparent’ in this relationship is the ‘table’. In order to configure this, one must first create the table entity. Then, a user can create a column, and specify that the parent is a table key. In order to create the grandchild, a user then creates a classification and sets its parent to be a column key. To further add customizable attributes, a user can specify additional key-value pairs such as pii and data_type. These key-value pairs are stored as properties. We model this diagram in Datastore in our working example below.

One can create entity groups by setting the ‘parent’ parameter while creating an entity key for a child. This command adds the parent key to be part of the child entity key. The child’s key is represented as a tuple (‘parent_key’, ‘child_key’), such that the parents’ key is the prefix of the key, which is followed by its own unique identifier. For example, follow the diagram above:

table_key = datastore_client.key("table","broker")
column_key = datastore_client.key("column","broker_legal_name", parent=table_key)

Printing the variable table_key will display: ("table", "broker","column", "broker_legal_name")

Datastore also supports chaining of parents, which can lead to very large keys for descendants with a long lineage of ancestors. Additionally, parents can have multiple children (representing a one-to-many relationship). However, there is no native support for entities to have multiple parents (representing a many-to-many relationship). Once you have configured this ancestral hierarchy, it is easy to retrieve all descendants for a given parent. You can do this by querying on the parent key by using the ‘ancestor’ parameter. For example, given the entity table_key created above, I can query for all of the tables

columns: my_query = client.query(kind="table", ancestor = column_key) .

A Full Working Example for MDH

As per our Key Philosophies - Config-Driven - anyone should be able to add a new table to be processed and landed in a target-table somewhere within MDH with our yaml syntax. Below is a full working python3 example of the table/column/classification hierarchical model described above.

from google.cloud import datastore

datastore_client = datastore.Client()

# Entities with kinds- table, column, classification
my_entities = [
{"kind": "table", "table_id": "broker", "table_type": "snapshot",
"notes": "describes mortgage brokers"},
{"kind": "column", "column_id": "broker_legal_name", "table_id": "broker",
"data_type": "string", "size": 20, "nullable": 1},
{"kind": "column", "column_id": "broker_short_code", "table_id": "broker",
"data_type": "string", "size": 3, "nullable": 1},
{"kind": "classification", "classification_id":"classif_id_REQ_01",
"restriction_level": "public", "pii": 0, "if": "greater than 90 days",
"column_id": "broker_legal_name", "table_id": "broker"},
{"kind": "classification", "classification_id":"classif_id_REQ_03",
"restriction_level": "restricted", "pii": 0, "if": "less than 90 days",
"column_id": "broker_legal_name", "table_id": "broker"},
{"kind": "classification", "classification_id":"classif_id_REQ_214",
"restriction_level": "public", "pii": 0, "column_id": "broker_short_code",
"table_id": "broker"},
]


# traverse my_entities, set parents and add those to datastore
for entity in my_entities:
kind = entity['kind']
parent_key = None
if kind == "column":
parent_key = datastore_client.key("table", entity["table_id"])
elif kind == "classification":
parent_key = datastore_client.key("table", entity["table_id"],
"column", entity["column_id"])

key = datastore_client.key(kind, entity[kind+"_id"],
parent=parent_key)
datastore_entry = datastore.Entity(key)
datastore_entry.update(entity)

print("Saving: {}".format(entity))

datastore_client.put(datastore_entry)

The code above assumes that you’ve set yourself up with a working Service Account or authorised yourself in, and that your GCP project has been set.

Now let’s do some digging around our newly minted Datastore model. Let’s grab the column ‘broker_legal_name’

query1 = datastore_client.query(kind="column")
query1.add_filter("column_id", "=", "broker_legal_name")

Now that we have the column entity, let’s locate it’s parent id.

column = list(query1.fetch())[0]
print("This column belongs to: " +str(column.key.parent.id_or_name))

Further to this, we can also get all data classification elements attributed to a single column using the ancestor clause query.

query2 = datastore_client.query(kind="classification", ancestor=column.key)
for classification in list(query2.fetch()):
print(classification.key)
print(classification["restriction_level"])

For more complex queries, Datastore has the concept of indexes being set, usually via it’s index.yaml configuration. The following is an example of an index.yaml file:

indexes:
- kind: Cat
ancestor: no
properties:
- name: name
- name: age
direction: desc

- kind: Cat
properties:
- name: name
direction: asc
- name: whiskers
direction: desc

- kind: Store
ancestor: yes
properties:
- name: business
direction: asc
- name: owner
direction: asc

Indexes are important when attempting to add filters on more than one particular attribute within a Datastore entity. For example, the following code will fail:

# Adding a '>' filter will cause this to fail. Sidenote; it will work
# without an index if you add another '=' filter.
query2 = datastore_client.query(kind="classification", ancestor=column.key)
query2.add_filter("pii", ">", 0)
for classification in list(query2.fetch()):
print(classification.key)
print(classification["classification_id"])

To rectify this issue, you need to create an index.yaml that looks like the following:

indexes:
- kind: classification
ancestor: yes
properties:
- name: pii

You would usually upload the yaml file using the gcloud commands:

gcloud datastore indexes create path/to/index.yaml.

However, let’s do this programmatically.

The official pypi package for google-cloud-datastore can be found here: https://pypi.org/project/google-cloud-datastore/. At the time of writing, Firestore in Datastore-mode will be the way forward, as per the release note from January 31, 2019.

Cloud Firestore is now Generally Available. Cloud Firestore is the new version of Cloud Datastore and includes a backwards-compatible Datastore mode.

If you intend to use the Cloud Datastore API in a new project, use Cloud Firestore in Datastore mode. Existing Cloud Datastore databases will be automatically upgraded to Cloud Firestore in Datastore mode.

Except where noted, the Cloud Datastore documentation now describes behavior for Cloud Firestore in Datastore mode.

We’ve purposefully created MDH in Datastore to show you how it was done originally, and we’ll be migrating the Datastore code to Firestore in an upcoming post.

Creating and deleting indexes within Datastore will need to be done through the REST API via googleapiclient.discovery, as this function doesn’t exist via the google-cloud-datastore API. Working with the discovery api client can be a bit daunting for a first-time user, so here’s the code to add an index on Datastore:

import os
from google.oauth2 import service_account
from googleapiclient.discovery import build
from google.cloud import datastore


SCOPES = ['https://www.googleapis.com/auth/cloud-platform']

SERVICE_ACCOUNT_FILE = os.getenv('GOOGLE_APPLICATION_CREDENTIALS')
PROJECT_ID = os.getenv("PROJECT_ID")

credentials = service_account
.Credentials
.from_service_account_file(SERVICE_ACCOUNT_FILE, scopes=SCOPES)

datastore_api = build('datastore', 'v1', credentials=credentials)

body = {
'ancestor': 'ALL_ANCESTORS',
'kind': 'classification',
'properties': [{
'name': 'pii',
'direction': 'DESCENDING'
}]
}

response = datastore_api.projects()
.indexes()
.create(projectId=PROJECT_ID, body=body)
.execute()

How did we craft this API request? We can use the Google API Discovery Service to build client libraries, IDE plugins, and other tools that interact with Google APIs. The Discovery API provides a list of Google APIs and a machine-readable "Discovery Document" for each API. Features of the Discovery API:

  • A directory of supported APIs schemas based on JSON Schema.
  • A machine-readable "Discovery Document" for each of the supported APIs. Each document contains:
  • A list of API methods and available parameters for each method.
  • A list of available OAuth 2.0 scopes.
  • Inline documentation of methods, parameters, and available parameter values.

Navigating to the API reference page for Datastore and going to the ‘Datastore Admin’ API page, we can see references to the Indexes and RESTful endpoints we can hit for those Indexes. Therefore, looking at the link for the Discovery document for Datastore:

https://datastore.googleapis.com/$discovery/rest?version=v1

From this, we can build out our instantiation for the google api discovery object build('datastore', 'v1', credentials=credentials)

With respect to building out the body aspect of the request, I’ve found crafting that part within the ‘Try this API’ section of https://cloud.google.com/datastore/docs/reference/admin/rest/v1/projects.indexes/create pretty valuable.

With this code, your index should show up in your Datastore console! You can also retrieve them within gcloud with gcloud datastore indexes list if you’d like to verify the indexes outside our python code. So there you have it: a working example of entity groups, ancestors, indexes and Metadata within Datastore. Have fun coding!

· One min read
Jeffrey Aven

Following on from the recent post GCP Templates for C4 Diagrams using PlantUML, cloud architects are often challenged with producing diagrams for architectures spanning multiple cloud providers, particularly as you elevate to enterprise level diagrams.

In this post, with the magic of !includeurl we have brought PlantUML template libraries together for AWS, Azure and GCP icon sets, allowing us to produce multi cloud C4 diagrams using PlantUML like this one:

Multi Cloud Architecture Diagram using PlantUML

Creating a multi cloud diagram is simple, start by adding the following include statements after the @startuml label in a new PlantUML C4 diagram:

Then add references to the required services from different providers…

Then include the predefined resources from your different cloud providers in your diagram as shown here (describing a client server application over a cloud to cloud VPN between Azure and GCP)...

Happy multi-cloud diagramming!

Full source code is available at:

https://github.com/gamma-data/plantuml-multi-cloud-diagrams

· 4 min read
Jeffrey Aven

This is a follow up to the original Cloud Bigtable primer where we discussed the basics of Cloud Bigtable:

Cloud Bigtable Primer - Part I

In this article we will cover schema design and row key selection in Bigtable – arguably the most critical design decision to make when employing Bigtable in a cloud data architecture.

Quick Review

Recall from the previous post where the Bigtable data model was introduced that tables in Bigtable are comprised of rows and columns - much like a table in any other RDBMS. Every row is uniquely identified by a rowkey – again akin to a primary key in a table in an RDBMS. But this is where the similarities end.

Unlike a table in an RDBMS, columns only ever exist when they are inserted, and NULLs are not stored. See the illustration below:

Row Key Selection

Data in Bigtable is distributed by row keys. Row keys are physically stored in tablets in lexographic order. Recall that row keys are your ONLY indexes to data in Bigtable.

Selection Considerations

As row keys are your only indexes to retrieve or update rows in Bigtable, row key design must take the access patterns for the data to be stored and served via Bigtable into consideration, specifically the following must be considered when designing a Bigtable application:

  • Search patterns (returning data for a specific entity)
  • Scan patterns (returning batches of data)

Queries that use the row key, a row prefix, or a row range are the most efficient. Queries that do not include a row key will typically scan GB or TB of data and would not be suitable for operational use cases.

Row Key Performance

Row key performance will be biased towards your specific access patterns and application functional requirements. For example if you are performing sequential reads or scan operations then sequential keys will perform the best, however their write performance will not be optimal. Conversely, random keys (such as a uuid) will perform best for writes but poor for scan or sequential read operations.

Adding salts to keys (or additional data), similar to the use of salts in cryptography as well as promoting other field keys to be part of a composite row key can help achieve a “Goldilocks” scenario for both reads and writes, see the diagram below:

Using Reverse Timestamps

Use reverse timestamps when your most common query is for the latest values. Typically you would append the reverse timestamp to the key, this will ensure that the same related records are grouped together, for instance if you are storing events for a customer using the customer id along with an appended reverse timestamp (for example <customer_id>#<reverse_ts>) would allow you to quickly serve the latest events for a customer in descending order as within each group (customer_id), rows will be sorted so most recent insert will be located at the top.
A reverse timestamp can be generalised as:

Long.MAX_VALUE - System.currentTimeMillis()

Schema Design Tips

Some general tips for good schema design using Bigtable are summarised below:

  • Group related data for more efficient reads using column families
  • Distribute data evenly for more efficient writes
  • Place identical values in the adjoining rows for more efficient compression using row keys

Following these tips will give you the best possible performance using Bigtable.

Use the Key Visualizer to profile performance

Google provides a neat tool to visualize your row key distribution in Cloud Bigtable. You need to have at least 30 GB of data in your table to enable this feature.

The Key Visualizer is shown here:

Bigtable Key Visualizer

The Key Visualizer will help you find and prevent hotspots, find rows with too much data and show if your key schema is balanced.

Summary

Bigtable is one of the original and best (massively) distributed NoSQL platforms available. Schema and moreover row key design play a massive part in ensuring low latency and query performance. Go forth and conquer with Cloud Bigtable!

· 2 min read
Jeffrey Aven

I am a believer in the mantra of “Everything-as-Code”, this includes diagrams and other architectural artefacts. Enter PlantUML…

PlantUML

PlantUML is an open-source tool which allows users to create UML diagrams from an intuitive DSL (Domain Specific Language). PlantUML is built on top of Graphviz and enables software architects and designers to use code to create Sequence Diagrams, Use Case Diagrams, Class Diagrams, State and Activity Diagrams and much more.

C4 Diagrams

PlantUML can be extended to support the C4 model for visualising software architecture. Which describes software in different layers including Context, Container, Component and Code diagrams.

GCP Architecture Diagramming using C4

PlantUML and C4 can be used to produce cloud architectures, there are official libraries available through PlantUML for Azure and AWS service icons, however these do not exist for GCP yet. There are several open source libraries available, however I have made an attempt to simplify the implementation.

The code below can be used to generate a C4 diagram describing a GCP architecture including official GCP service icons:

@startuml
!define GCPPuml https://raw.githubusercontent.com/gamma-data/GCP-C4-PlantUML/master/templates

!includeurl GCPPuml/C4\_Context.puml
!includeurl GCPPuml/C4\_Component.puml
!includeurl GCPPuml/C4\_Container.puml
!includeurl GCPPuml/GCPC4Integration.puml
!includeurl GCPPuml/GCPCommon.puml

!includeurl GCPPuml/Networking/CloudDNS.puml
!includeurl GCPPuml/Networking/CloudLoadBalancing.puml
!includeurl GCPPuml/Compute/ComputeEngine.puml
!includeurl GCPPuml/Storage/CloudStorage.puml
!includeurl GCPPuml/Databases/CloudSQL.puml

title Sample C4 Diagram with GCP Icons

Person(publisher, "Publisher")
System\_Ext(device, "User")

Boundary(gcp,"gcp-project") {
CloudDNS(dns, "Managed Zone", "Cloud DNS")
CloudLoadBalancing(lb, "L7 Load Balancer", "Cloud Load Balancing")
CloudStorage(bucket, "Static Content Bucket", "Cloud Storage")
Boundary(region, "gcp-region") {
Boundary(zonea, "zone a") {
ComputeEngine(gcea, "Content Server", "Compute Engine")
CloudSQL(csqla, "Dynamic Content", "Cloud SQL")
}
Boundary(zoneb, "zone b") {
ComputeEngine(gceb, "Content Server", "Compute Engine")
CloudSQL(csqlb, "Dynamic Content\\n(Read Replica)", "Cloud SQL")
}
}
}

Rel(device, dns, "resolves name")
Rel(device, lb, "makes request")
Rel(lb, gcea, "routes request")
Rel(lb, gceb, "routes request")
Rel(gcea, bucket, "get static content")
Rel(gceb, bucket, "get static content")
Rel(gcea, csqla, "get dynamic content")
Rel(gceb, csqla, "get dynamic content")
Rel(csqla, csqlb, "replication")
Rel(publisher,bucket,"publish static content")

@enduml

The preceding code generates the diagram below:

Additional services can be added and used in your diagrams by adding them to your includes, such as:

!includeurl GCPPuml/DataAnalytics/BigQuery.puml
!includeurl GCPPuml/DataAnalytics/CloudDataflow.puml
!includeurl GCPPuml/AIandMachineLearning/AIHub.puml
!includeurl GCPPuml/AIandMachineLearning/CloudAutoML.puml
!includeurl GCPPuml/DeveloperTools/CloudBuild.puml
!includeurl GCPPuml/HybridandMultiCloud/Stackdriver.puml
!includeurl GCPPuml/InternetofThings/CloudIoTCore.puml
!includeurl GCPPuml/Migration/TransferAppliance.puml
!includeurl GCPPuml/Security/CloudIAM.puml
' and more…

The complete template library is available at:

https://github.com/gamma-data/GCP-C4-PlantUML

· 7 min read
Jeffrey Aven

Bigtable is one of the foundational services in the Google Cloud Platform and to this day one of the greatest contributions to the big data ecosystem at large. It is also one of the least known services available, with all the headlines and attention going to more widely used services such as BigQuery.

Background

In 2006 (pre Google Cloud Platform), Google released a white paper called “Bigtable: A Distributed Storage System for Structured Data”, this paper set out the reference architecture for what was to become Cloud Bigtable. This followed several other whitepapers including the GoogleFS and MapReduce whitepapers released in 2003 and 2004 which provided abstract reference architectures for the Google File System (now known as Colossus) and the MapReduce algorithm. These whitepapers inspired a generation of open source distributed processing systems including Hadoop. Google has long had a pattern of publicising a generalized overview of their approach to solving different storage and processing challenges at scale through white papers.

Bigtable Whitepaper 2006

The Bigtable white paper inspired a wave of open source distributed key/value oriented NoSQL data stores including Apache HBase and Apache Cassandra.

What is Bigtable?

Bigtable is a distributed, petabyte scale NoSQL database. More specifically, Bigtable is…

a map

At its core Bigtable is a distributed map or an associative array indexed by a row key, with values in columns which are created only when they are referenced. Each value is an uninterpreted byte array.

sorted

Row keys are stored in lexographic order akin to a clustered index in a relational database.

sparse

A given row can have any number of columns, not all columns must have values and NULLs are not stored. There may also be gaps between keys.

multi-dimensional

All values are versioned with a timestamp (or configurable integer). Data is not updated in place, it is instead superseded with another version.

When (and when not) to use Bigtable

  • You need to do many thousands of operations per second on TB+ scale data
  • Your access patterns are well known and simple
  • You need to support random write or random read operations (or sequential reads) - each using a row key as the primary identifier

Don’t use Bigtable if…

  • You need explicit JOIN capability, that is joining one or more tables
  • You need to do ad-hoc analytics
  • Your access patterns are unknown or not well defined

Bigtable vs Relational Database Systems

The following table compares and contrasts Bigtable against relational databases (both transaction oriented and analytic oriented databases):

 BigtableRDBMS (OLTP)RDBMS (DSS/MPP)
Data LayoutColumn Family OrientedRow OrientedColumn Oriented
Transaction SupportSingle Row OnlyYesDepends (but usually no)
Query DSLget/put/scan/deleteSQLSQL
IndexesRow Key OnlyYesYes (typically PI based)
Max Data SizePB+'00s GB to TBTB+
Read/Write Throughput"'000000s queries/s"'000s queries/s

Bigtable Data Model

Tables in Bigtable are comprised of rows and columns (sounds familiar so far..). Every row is uniquely identified by a rowkey (like a primary key..again sounds familiar so far).

Columns belong to Column Families and only exist when inserted, NULLs are not stored - this is where it starts to differ from a traditional RDBMS. The following image demonstrates the data model for a fictitious table in Bigtable.

Bigtable Data Model

In the previous example, we created two Column Families (cf1 and cf2). These are created during table definition or update operations (akin to DDL operations in the relational world). In this case, we have chosen to store primary attributes, like name, etc in cf1 and features (or derived attributes) in cf2 like indicators.

Cell versioning

Each cell has a timestamp/version associated with it, multiple versions of a row can exist. Versions are naturally stored in descending order.

Properties such as the max age for a cell or the maximum number of versions to be stored for any given cell are set on the Column Family. Versions are compacted through a process called Garbage Collection - not to be confused with Java Garbage Collection (albeit same idea).

Row KeyColumnValueTimestamp
123cf1:statusACTIVE2020-06-30T08.58.27.560
123cf1:statusPENDING2020-06-28T06.20.18.330
123cf1:statusINACTIVE2020-06-27T07.59.20.460

Bigtable Instances, Clusters, Nodes and Tables

Bigtable is a "no-ops" service, meaning you do not need to configure machine types or details about the underlying infrastructure, save a few sizing or performance options - such as the number of nodes in a cluster or whether to use solid state hard drives (SSD) or the magnetic alternative (HDD). The following diagram shows the relationships and cardinality for Cloud Bigtable.

Bigtable Instances, Clusters and Nodes

Clusters and nodes are the physical compute layer for Bigtable, these are zonal assets, zonal and regional availability can be achieved through replication which we will discuss later in this article.

Instances are a virtual abstraction for clusters, Tables belong to instances (not clusters). This is due to Bigtables underlying architecture which is based upon a separation of storage and compute as shown below.

Bigtable Separation of Storage and Compute

Bigtables separation of storage and compute allow it to scale horizontally, as nodes are stateless they can be increased to increase query performance. The underlying storage system in inherently scalable.

Physical Storage & Column Families

Data (Columns) for Bigtable is stored in Tablets (as shown in the previous diagram), which store "regions" of row keys for a particular Column Family. Columns consist of a column family prefix and qualifier, for instance:

cf1:col1

A table can have one or more Column Families. Column families must be declared at schema definition time (could be a create or alter operation). A cell is an intersection of a row key and a version of a column within a column family.

Storage settings (such as the compaction/garbage collection properties mentioned before) can be specified for each Column Family - which can differ from other column families in the same table.

Bigtable Availability and Replication

Replication is used to increase availability and durability for Cloud Bigtable – this can also be used to segregate read and write operations for the same table.

Data and changes to tables are replicated across multiple regions or multiple zones within the same region, this replication can be blocking (single row transactions) or non blocking (eventually consistent). However all clusters within a Bigtable instance are considered primary (writable).

Requests are routed using Application Profiles, a single-cluster routing policy can be used for manual failover, whereas a multi-cluster routing is used for automatic failover.

Backup and Export Options for Bigtable

Managed backups can be taken at a table level, new tables can be created from backups. The backups cannot be exported, however table level export and import operations are available via pre-baked Dataflow templates for data stored in GCS in the following formats:

  • SequenceFiles
  • Avro Files
  • Parquet Files
  • CSV Files

Accessing Bigtable

Bigtable data and admin functions are available via:

  • cbt (optional component of the Google SDK)
  • hbase shell (REPL shell)
  • Happybase API (Python API for Hbase)
  • SDK libraries for:
    • Golang
    • Python
    • Java
    • Node.js
    • Ruby
    • C#, C++, PHP, and more

As Bigtable is not a cheap service, there is a local emulator available which is great for application development. This is part of the Cloud SDK, and can be started using the following command:

gcloud beta emulators bigtable start

In the next article in this series we will demonstrate admin and data functions as well as the local emulator.

Next Up : Part II - Row Key Selection and Schema Design in Bigtable

· 3 min read
Jeffrey Aven

This is a follow up to a previous blog, Google Cloud Storage Object Notifications using Slack in which we used Slack to notify us of new objects being uploaded to GCS.

In this article we will take things a step further, where uploading an object to a GCS bucket will trigger a DLP inspection of the object and if any preconfigured info types (such as credit card numbers or API credentials) are present in the object, a Slack notification will be generated.

As DLP scans are “jobs”, meaning they run asynchronously, we will need to trigger scans and inspect results using two separate Cloud Functions (one for triggering a scan [gcs-dlp-scan-trigger] and one for inspecting the results of the scan [gcs-dlp-evaluate-results]) and a Cloud PubSub topic [dlp-scan-topic] which is used to hold the reference to the DLP job.

The process is described using the sequence diagram below:

The Code

The gcs-dlp-scan-trigger Cloud Function fires when a new object is created in a specified GCS bucket. This function configures the DLP scan to be executed, including the DLP info types (for instance CREDIT_CARD_NUMBER, EMAIL_ADDRESS, ETHNIC_GROUP, PHONE_NUMBER, etc) a and likelihood of that info type existing (for instance LIKELY). DLP scans determine the probability of an info type occurring in the data, they do not scan every object in its entirety as this would be too expensive.

The primary function executed in the gcs-dlp-scan-trigger Cloud Function is named inspect_gcs_file. This function configures and submits the DLP job, supplying a PubSub topic to which the DLP Job Name will be written, the code for the inspect_gcs_file is shown here:

At this stage the DLP job is created an running asynchronously, the next Cloud Function, gcs-dlp-evaluate-results, fires when a message is sent to the PubSub topic defined in the DLP job. The gcs-dlp-evaluate-results reads the DLP Job Name from the PubSub topic, connects to the DLP service and queries the job status, when the job is complete, this function checks the results of the scan, if the min_likliehood threshold is met for any of the specified info types, a Slack message is generated. The code for the main method in the gcs-dlp-evaluate-results function is shown here:

Finally, a Slack webhook is used to send the message to a specified Slack channel in a workspace, this is done using the send_slack_notification function shown here:

An example Slack message is shown here:

Slack Notification for Sensitive Data Detected in a Newly Created GCS Object

Full source code can be found at: https://github.com/gamma-data/automated-gcs-object-scanning-using-dlp-with-notifications-using-slack

· 9 min read
Daniel Hussey

Terraform is a powerful tool for managing your Infrastructure as Code. Declare your resources once, define their variables per environment and sleep easy knowing your CI pipeline will take care of the rest.

But… one night you wake up in a sweat. The details are fuzzy but you were browsing your favourite cloud provider’s console - probably GCP ;) - and thought you saw a bucket had been created outside of your allowed locations! Maybe it even had risky access controls.

You go brush it off and try to fall back to sleep, but you can’t quite push the thought from your mind that somewhere in all that Terraform code, someone could be declaring resources in unapproved locations, and your CICD pipeline would do nothing to stop it. Oh the regulatory implications.

Enter Terraform Validator by Forseti

Terraform Validator by Forseti allows you to declare your Policy as Code, check compliance of your Terraform plans against said Policy, and automatically fail violating plans in a CI step. All without setting up servers or agents.

You’re going to learn how to enforce policy on GCP resources like BigQuery, IAM, networks, MySQL, Google Kubernetes Engine (GKE) and more. If you’re particularly crafty, you may be able to go beyond GCP.

Forseti’s suite of solutions are GCP focused and allow a wide range of live config validation, monitoring and more using the Policy Library we’re going to set up. These additional capabilities require additional infrastructure. But we’re going one step at a time, starting with enforcing policy during deployment.

Getting Started

Let’s assume you already have an established CICD pipeline that uses Terraform, or that you are content to validate your Terraform plans locally for now. In that case, we need just two things:

  1. A Policy Library
  2. Terraform Validator

It’s that simple! No new servers, agents, firewall rules, extra service accounts or other nonsense. Just add Policy Library, the Validator tool and you can enforce policy on your Terraform deployments.

We’re going to tinker with some existing GCP-focused sample policies (aka Constraints) that Forseti makes available. These samples cover a wide range of resources and use cases, so it is easy to adjust what’s provided to define your own Constraints.

Policy Library

First let's open up some of Forseti's pre-defined constraints. We’ll copy them into our own Git repository and adjust to create policies that match our needs. Repeatable and configurable - that’s Policy as Code at work.

Concepts

In the world of Forseti and in particular Terraform Validator, Policies are defined and understood via easy to read YAML files known as Constraints

There is just enough information in a Constraint file for to make clear its purpose and effect, and by tinkering lightly with a pre-written Constraint you can achieve a lot without looking too deeply into the inner workings . But there’s more happening than meets the eye.

Constraints are built on Templates - which are like forms with some extra bits waiting to be completed to make a Constraint. Except there’s a lot more hidden away that’s pretty cool if you want to understand it.

Think of a Template as a ‘Class’ in the OOP sense, and of a Constraint as an instantiated Template with all the key attributes defined.

E.g. A generic Template for policy on bucket locations and a Constraint to specify which locations are relevant in a given instance. Again, buckets and locations are just the basic example - the potential applications are far greater.

Now the real magic is that just like a ‘Class’, a Template contains logic that makes everything abstracted away in the Constraint possible. Templates contain inline Rego (ray-go), borrowed lovingly by Forseti from the Open Policy Agent (OPA) team.

Learn more about Rego and OPA here to understand the relationship to our Terraform Validator.

But let’s begin.

Set up your Policies

Create your Policy Library repository

Create your Policy Library repository by cloning https://github.com/forseti-security/policy-library into your own VCS.

This repo contains templates and sample constraints which will form the basis of your policies. So get it into your Git environment and clone it to local for the next step.

Customise sample constraints to fit your needs

As discussed in Concepts, Constraints are defined Templates, which make use of Rego policy language. Nice. So let’s take a sample Constraint, put it in our Policy Library and set the values to what we need. It’s that easy - no need to write new templates or learn Rego if your use case is covered.

In a new branch…

  1. Copy the sample Constraint storage_location.yaml to your Constraints folder.
$ cp policy-library/samples/storage_location.yaml policy-library/policies/constraints/storage_location.yaml
  1. Replace the sample location (asia-southeast1) in storage_location.yaml with australia-southeast1.
  spec:  
severity: high
match:
target: ["organization/*"]
parameters:
mode: "allowlist"
locations:
- australia-southeast1
exemptions: []
  1. Push back to your repo - not Forseti’s!
$ git push https://github.com/<your-repository>/policy-library.git

Policy review

There you go - you’ve customised a sample Constraint. Now you have your own instance of version controlled Policy-as-Code and are ready to apply the power of OPA’s Rego policy language that lies within the parent Template. Impressively easy right?

That’s a pretty simple example. You can browse the rest of Forseti’s Policy Library to view other sample Constraints, Templates and the Rego logic that makes all of this work. These can be adjusted to cover all kinds of use cases across GCP resources.

I suggest working with and editing the sample Constraints before making any changes to Templates.

If you were to write Rego and Templates from scratch, you might even be able to enforce Policy as Code against non-GCP Terraform code.

Terraform Validator

Now, let’s set up the Terraform Validator tool and have it compare a sample piece of Terraform code against the Constraint we configured above. Keep in mind you’ll want to translate what’s done here into steps in your CICD pipeline.

Once the tool is in place, we really just run terraform plan and feed the output into Terraform Validator. The Validator compares it to our Constraints, runs all the abstracted logic we don’t need to worry about and returns 0 or 2 when done for pass / fail respectively. Easy.

So using Terraform if I try to make a bucket in australia-southeast1 it should pass, if I try to make one in the US it should fail. Let’s set up the tool, write some basic Terraform and see how we go.

Setup Terraform Validator

Check for the latest version of terraform-validator from the official terraform-validator GCS bucket.

Very important when using tf version 0.12 or greater. This is the easy way - you can pull from the Terraform Validator Github and make it yourself too.

$ gsutil ls -r gs://terraform-validator/releases

Copy the latest version to the working dir

$ gsutil cp gs://terraform-validator/releases/2020-03-05/terraform-validator-linux-amd64 .

Make it executable

$ chmod 755 terraform-validator-linux-amd64

Ready to go!

Review your Terraform code

We’re going to make a ridiculously simple piece of Terraform that tries to create one bucket in our project to keep things simple.

# main.tf

resource "google_storage_bucket" "tf-validator-demo-bucket" {  
name          = "tf-validator-demo-bucket"
  location      = "US"
  force_destroy = true

  lifecycle_rule {
    condition {
      age = "3"
    }
    action {
      type = "Delete"
    }
  }
}

This is a pretty standard bit of Terraform for a GCS bucket, but made very simple with all the values defined directly in main.tf. Note the location of the bucket - it violates our Constraint that was set to the australia-southeast1 region.

Make the Terraform plan

Warm up Terraform.
Double check your Terraform code if there are any hiccups.

$ terraform init

Make the Terraform plan and store output to file.

$ terraform plan --out=terraform.tfplan

Convert the plan to JSON

$ terraform show -json ./terraform.tfplan > ./terraform.tfplan.json

Validate the non-compliant Terraform plan against your Constraints, for example

$ ./terraform-validator-linux-amd64 validate ./tfplan.tfplan.json --policy-path=../repos/policy-library/

TA-DA!

Found Violations:

Constraint allow_some_storage_location on resource //storage.googleapis.com/tf-validator-demo-bucket: //storage.googleapis.com/tf-validator-demo-bucket is in a disallowed location.

Validate the compliant Terraform plan against your Constraints

Let’s see what happens if we repeat the above, changing the location of our GCS bucket to australia-southeast1.

$ ./terraform-validator-linux-amd64 validate ./tfplan.tfplan.json --policy-path=../repos/policy-library/

Results in..

No violations found.

Success!!!

Now all that’s left to do for your Policy as Code CICD pipeline is to configure the rest of your Constraints and run this check before you go ahead and terraform apply. Be sure to make the apply step dependent on the outcome of the Validator.

Wrap Up

We’ve looked at how to apply Policy as Code to validate our Infrastructure as Code. Sounds pretty modern and DevOpsy doesn’t it.

To recap, we learned about Constraints, which are fully defined instances of Policy as Code. They’re based on YAML Templates that refer to the OPA policy language Rego, but we didn’t have to learn it :)

We created our own version controlled Policy Library.

Using the above learning and some handy pre-existing samples, we wrote policies (Constraints) for GCP infrastructure, specifying a whitelist for locations in which GCS buckets could be deployed.

As mentioned there are dozens upon dozens of samples across BigQuery, IAM, networks, MySQL, Google Kubernetes Engine (GKE) and more to work with.

Of course, we stored these configured Constraints in our version-controlled Policy Library.

  • We looked at a simple set of Terraform code to define a GCS bucket, and stored the Terraform plan to a file before applying it.
  • We ran Forseti’s Terraform Validator against the Terraform plan file, and had the Validator compare the plan to our Policy Library.
  • We saw that the results matched our expectations! Compliance with the location specified in our Constraint passed the Validator’s checks, and non-compliance triggered a violation.

Awesome. And the best part is that all this required no special permissions, no infrastructure for servers or agents and no networking.

All of that comes with the full Forseti suite of Inventory taking Config Validation of already deployed resources. We might get to that next time.

References:

https://github.com/GoogleCloudPlatform/terraform-validator https://github.com/forseti-security/policy-library https://www.openpolicyagent.org/docs/latest/policy-language/ https://cloud.google.com/blog/products/identity-security/using-forseti-config-validator-with-terraform-validator https://forsetisecurity.org/docs/latest/concepts/

· 12 min read
Jeffrey Aven

This article demonstrates creating a site to site IPSEC VPN connection between a GCP VPC network and an Azure Virtual Network, enabling private RFC1918 network connectivity between virtual networks in both clouds. This is done using a single PowerShell script leveraging Azure PowerShell and gcloud commands in the Google SDK.

Additionally, we will use Azure Private DNS to enable private access between Azure hosts and GCP APIs (such as Cloud Storage or Big Query).

An overview of the solution is provided here:

Azure to GCP VPN Design

One note before starting - site to site VPN connections between GCP and Azure currently do not support dynamic routing using BGP, however creating some simple routes on either end of the connection will be enough to get going.

Let’s go through this step by step:

Step 1 : Authenticate to Azure

Azure’s account equivalent is a subscription, the following command from Azure Powershell is used to authenticate a user to one or more subscriptions.

Connect-AzAccount

This command will open a browser window prompting you for Microsoft credentials, once authenticated you will be returned to the command line.

Step 2 : Create a Resource Group (Azure)

A resource group is roughly equivalent to a project in GCP. You will need to supply a Location (equivalent to a GCP region):

New-AzResourceGroup `
-Name "azure-to-gcp" `
-Location "Australia Southeast"

Step 3 : Create a Virtual Network with Subnets and Routes (Azure)

An Azure Virtual Network is the equivalent of a VPC network in GCP (or AWS), you must define subnets before creating a Virtual Network. In this example we will create two subnets, one Gateway subnet (which needs to be named accordingly) where the VPN gateway will reside, and one subnet named ‘default’ where we will host VMs which will connect to GCP services over the private VPN connection.

Before defining the default subnet we must create and attach a Route Table (equivalent of a Route in GCP), this particular route will be used to route ‘private’ requests to services in GCP (such as Big Query).

# define route table and route to GCP private access
$azroutecfg = New-AzRouteConfig `
-Name "google-private" `
-AddressPrefix "199.36.153.4/30" `
-NextHopType "VirtualNetworkGateway"

$azrttbl = New-AzRouteTable `
-ResourceGroupName "azure-to-gcp" `
-Name "google-private" `
-Location "Australia Southeast" `
-Route $azroutecfg

# define gateway subnet
$gatewaySubnet = New-AzVirtualNetworkSubnetConfig `
-Name "GatewaySubnet" `
-AddressPrefix "10.1.2.0/24"

# define default subnet
$defaultSubnet = New-AzVirtualNetworkSubnetConfig `
-Name "default" `
-AddressPrefix "10.1.1.0/24" `
-RouteTable $azrttbl

# create virtual network and subnets
$vnet = New-AzVirtualNetwork `
-Name "azure-to-gcp-vnet" `
-ResourceGroupName "azure-to-gcp" `
-Location "Australia Southeast" `
-AddressPrefix "10.1.0.0/16" `
-Subnet $gatewaySubnet,$defaultSubnet

Step 4 : Create Network Security Groups (Azure)

Network Security Groups in Azure are stateful firewalls much like Firewall Rules in VPC networks in GCP. Like GCP, the lower priority overrides higher priority rules.

In the example we will create several rules to allow inbound ICMP, TCP and UDP traffic from our Google VPC and RDP traffic from the Internet (which we will use to logon to a VM in Azure to test private connectivity between the two clouds):

# create network security group
$rule1 = New-AzNetworkSecurityRuleConfig `
-Name rdp-rule `
-Description "Allow RDP" `
-Access Allow `
-Protocol Tcp `
-Direction Inbound `
-Priority 100 `
-SourceAddressPrefix Internet `
-SourcePortRange * `
-DestinationAddressPrefix * `
-DestinationPortRange 3389

$rule2 = New-AzNetworkSecurityRuleConfig `
-Name icmp-rule `
-Description "Allow ICMP" `
-Access Allow `
-Protocol Icmp `
-Direction Inbound `
-Priority 101 `
-SourceAddressPrefix * `
-SourcePortRange * `
-DestinationAddressPrefix * `
-DestinationPortRange *

$rule3 = New-AzNetworkSecurityRuleConfig `
-Name gcp-rule `
-Description "Allow GCP" `
-Access Allow `
-Protocol Tcp `
-Direction Inbound `
-Priority 102 `
-SourceAddressPrefix "10.2.0.0/16" `
-SourcePortRange * `
-DestinationAddressPrefix * `
-DestinationPortRange *

$nsg = New-AzNetworkSecurityGroup `
-ResourceGroupName "azure-to-gcp" `
-Location "Australia Southeast" `
-Name "nsg-vm" `
-SecurityRules $rule1,$rule2,$rule3

Step 5 : Create Public IP Addresses (Azure)

We need to create two Public IP Address (equivalent of an External IP in GCP) which will be used for our VPN gateway and for the VM we will create:

# create public IP address for VM
$vmpip = New-AzPublicIpAddress `
-Name "vm-ip" `
-ResourceGroupName "azure-to-gcp" `
-Location "Australia Southeast" `
-AllocationMethod Dynamic

# create public IP address for NW gateway
$ngwpip = New-AzPublicIpAddress `
-Name "ngw-ip" `
-ResourceGroupName "azure-to-gcp" `
-Location "Australia Southeast" `
-AllocationMethod Dynamic

Step 6 : Create Virtual Network Gateway (Azure)

The Virtual Network Gateway in Azure is the VPN Gateway equivalent in Azure which will be used to create a VPN tunnel between Azure and a GCP VPN Gateway. This gateway will be placed in the Gateway subnet created previously and one of the Public IP addresses created in the previous step will be assigned to this gateway.

# create virtual network gateway
$ngwipconfig = New-AzVirtualNetworkGatewayIpConfig `
-Name "ngw-ipconfig" `
-SubnetId $gatewaySubnet.Id `
-PublicIpAddressId $ngwpip.Id

# use the AsJob switch as this is a long running process
$job = New-AzVirtualNetworkGateway -Name "vnet-gateway" `
-ResourceGroupName "azure-to-gcp" `
-Location "Australia Southeast" `
-IpConfigurations $ngwipconfig `
-GatewayType "Vpn" `
-VpnType "RouteBased" `
-GatewaySku "VpnGw1" `
-VpnGatewayGeneration "Generation1" `
-AsJob

$vnetgw = Get-AzVirtualNetworkGateway `
-Name "vnet-gateway" `
-ResourceGroupName "azure-to-gcp"

Step 7 : Create a VPC Network and Subnetwork(s) (GCP)

A VPC network and subnet need to be created in GCP, the subnet defines the VPC address space. This address space must not overlap with the Azure Virtual Network CIDR. For all GCP steps it is assumed that the project is set for client config (e.g. gcloud config set project your_project) so it does not need to be specified for each operation. Private Google access should be enabled on all subnets created.

# creating VPC network and subnets
gcloud compute networks create "azure-to-gcp-vpc" `
--subnet-mode=custom `
--bgp-routing-mode=regional

gcloud compute networks subnets create "aus-subnet" `
--network "azure-to-gcp-vpc" `
--range "10.2.1.0/24" `
--region "australia-southeast1" `
--enable-private-ip-google-access

Step 8 : Create an External IP (GCP)

An external IP address will need to be created in GCP which will be used for the external facing interface of the VPN Gateway.

# create external IP
gcloud compute addresses create "ext-gw-ip" `
--region "australia-southeast1"

$gcp_ipaddr_obj = gcloud compute addresses describe "ext-gw-ip" `
--region "australia-southeast1" `
--format json | ConvertFrom-Json

$gcp_ipaddr = $gcp_ipaddr_obj.address

Step 9 : Create Firewall Rules (GCP)

VPC firewall rules will need to be created in GCP to allow VPN traffic as well as SSH traffic from the internet (which allows you to SSH into VM instances using Cloud Shell).

# create VPN firewall rules
gcloud compute firewall-rules create "vpn-rule1" `
--network "azure-to-gcp-vpc" `
--allow tcp,udp,icmp `
--source-ranges "10.1.0.0/16"

gcloud compute firewall-rules create "ssh-rule1" `
--network "azure-to-gcp-vpc" `
--allow tcp:22

Step 10 : Create VPN Gateway and Forwarding Rules (GCP)

Create a VPN Gateway and Forwarding Rules in GCP which will be used to create a tunnel between GCP and Azure.

# create cloud VPN 
gcloud compute target-vpn-gateways create "vpn-gw" `
--network "azure-to-gcp-vpc" `
--region "australia-southeast1" `
--project "azure-to-gcp-project"

# create forwarding rule ESP
gcloud compute forwarding-rules create "fr-gw-name-esp" `
--ip-protocol ESP `
--address "ext-gw-ip" `
--target-vpn-gateway "vpn-gw" `
--region "australia-southeast1" `
--project "azure-to-gcp-project"

# creating forwarding rule UDP500
gcloud compute forwarding-rules create "fr-gw-name-udp500" `
--ip-protocol UDP `
--ports 500 `
--address "ext-gw-ip" `
--target-vpn-gateway "vpn-gw" `
--region "australia-southeast1" `
--project "azure-to-gcp-project"

# creating forwarding rule UDP4500
gcloud compute forwarding-rules create "fr-gw-name-udp4500" `
--ip-protocol UDP `
--ports 4500 `
--address "ext-gw-ip" `
--target-vpn-gateway "vpn-gw" `
--region "australia-southeast1" `
--project "azure-to-gcp-project"

Step 10 : Create VPN Tunnel (GCP Side)

Now we will create the GCP side of our VPN tunnel using the Public IP Address of the Azure Virtual Network Gateway created in a previous step. As this example uses a route based VPN the traffic selector values need to be set at 0.0.0.0/0. A PSK (Pre Shared Key) needs to be supplied which will be the same key used when we configure a VPN Connection on the Azure side of the tunnel.

# get peer public IP address of Azure gateway
$azpubip = Get-AzPublicIpAddress `
-Name "ngw-ip" `
-ResourceGroupName "azure-to-gcp"

# create VPN tunnel
gcloud compute vpn-tunnels create "vpn-tunnel-to-azure" `
--peer-address $azpubip.IpAddress `
--local-traffic-selector "0.0.0.0/0" `
--remote-traffic-selector "0.0.0.0/0" `
--ike-version 2 `
--shared-secret << Pre-Shared Key >> `
--target-vpn-gateway "vpn-gw" `
--region "australia-southeast1" `
--project "azure-to-gcp-project"

Step 11 : Create Static Routes (GCP Side)

As we are using static routing (as opposed to dynamic routing) we will need to define all of the specific routes on the GCP side. We will need to setup routes for both outgoing traffic to the Azure network as well as incoming traffic for the restricted Google API range (199.36.153.4/30).

# create static route (VPN)
gcloud compute routes create "route-to-azure" `
--destination-range "10.1.0.0/16" `
--next-hop-vpn-tunnel "vpn-tunnel-to-azure" `
--network "azure-to-gcp-vpc" `
--next-hop-vpn-tunnel-region "australia-southeast1" `
--project "azure-to-gcp-project"

# create static route (Restricted APIs)
gcloud compute routes create apis `
--network "azure-to-gcp-vpc" `
--destination-range "199.36.153.4/30" `
--next-hop-gateway default-internet-gateway `
--project "azure-to-gcp-project"

Step 12 : Create a Local Gateway (Azure)

A Local Gateway in Azure is an object that represents the remote gateway (GCP VPN gateway).

# create local gateway
$azlocalgw = New-AzLocalNetworkGateway `
-Name "local-gateway" `
-ResourceGroupName "azure-to-gcp" `
-Location "Australia Southeast" `
-GatewayIpAddress $gcp_ipaddr `
-AddressPrefix "10.2.0.0/16"

Step 13 : Create a VPN Connection (Azure)

Now we can setup the Azure side of the VPN Connection which is accomplished by associating the Azure Virtual Network Gateway with the Local Network Gateway. A PSK (Pre Shared Key) needs to be supplied which is the same key used for the GCP VPN Tunnel in step 10.

# create connection
$azvpnconn = New-AzVirtualNetworkGatewayConnection `
-Name "vpn-connection" `
-ResourceGroupName "azure-to-gcp" `
-VirtualNetworkGateway1 $vnetgw `
-LocalNetworkGateway2 $azlocalgw `
-Location "Australia Southeast" `
-ConnectionType IPsec `
-SharedKey << Pre-Shared Key >> `
-ConnectionProtocol "IKEv2"

VPN Tunnel Established!

At this stage we have created an end to end connection between the virtual networks in both clouds. You should see this reflected in the respective consoles in each provider.

GCP VPN Tunnel to a Azure Virtual Network
GCP VPN Tunnel to a Azure Virtual Network
Azure VPN Connection to a GCP VPC Network
Azure VPN Connection to a GCP VPC Network

Congratulations! You have just setup a multi cloud environment using private networking. Now let’s setup Google Private Access for Azure hosts and create VMs on each side to test our setup.

Step 14 : Create a Private DNS Zone for googleapis.com (Azure)

We will now need to create a Private DNS zone in Azure for the googleapis.com domain which will host records to redirect Google API requests to the Restricted API range.

# create private DNS zone
New-AzPrivateDnsZone `
-ResourceGroupName "azure-to-gcp" `
-Name "googleapis.com"

# Add A Records
$Records = @()
$Records += New-AzPrivateDnsRecordConfig `
-IPv4Address 199.36.153.4
$Records += New-AzPrivateDnsRecordConfig `
-IPv4Address 199.36.153.5
$Records += New-AzPrivateDnsRecordConfig `
-IPv4Address 199.36.153.6
$Records += New-AzPrivateDnsRecordConfig `
-IPv4Address 199.36.153.7

New-AzPrivateDnsRecordSet `
-Name "restricted" `
-RecordType A `
-ResourceGroupName "azure-to-gcp" `
-TTL 300 `
-ZoneName "googleapis.com" `
-PrivateDnsRecords $Records

# Add CNAME Records
$Records = @()
$Records += New-AzPrivateDnsRecordConfig `
-Cname "restricted.googleapis.com."

New-AzPrivateDnsRecordSet `
-Name "*" `
-RecordType CNAME `
-ResourceGroupName "azure-to-gcp" `
-TTL 300 `
-ZoneName "googleapis.com" `
-PrivateDnsRecords $Records

# Create VNet Link
New-AzPrivateDnsVirtualNetworkLink `
-ResourceGroupName "azure-to-gcp" `
-ZoneName "googleapis.com" `
-Name "dns-zone-link" `
-VirtualNetworkId $vnet.Id

Step 15 : Create a Virtual Machine (Azure)

We will create a VM in Azure which we can use to test the VPN tunnel as well as to test Private Google Access over our VPN tunnel.

# create VM
$az_vmlocaladminpwd = ConvertTo-SecureString << Password Param >> `
-AsPlainText -Force
$Credential = New-Object System.Management.Automation.PSCredential ("LocalAdminUser", $az_vmlocaladminpwd);

$nic = New-AzNetworkInterface `
-Name "vm-nic" `
-ResourceGroupName "azure-to-gcp" `
-Location "Australia Southeast" `
-SubnetId $defaultSubnet.Id `
-NetworkSecurityGroupId $nsg.Id `
-PublicIpAddressId $vmpip.Id `
-EnableAcceleratedNetworking `
-Force

$VirtualMachine = New-AzVMConfig `
-VMName "windows-desktop" `
-VMSize "Standard_D4_v3"

$VirtualMachine = Set-AzVMOperatingSystem `
-VM $VirtualMachine `
-Windows `
-ComputerName "windows-desktop" `
-Credential $Credential `
-ProvisionVMAgent `
-EnableAutoUpdate

$VirtualMachine = Add-AzVMNetworkInterface `
-VM $VirtualMachine `
-Id $nic.Id

$VirtualMachine = Set-AzVMSourceImage `
-VM $VirtualMachine `
-PublisherName 'MicrosoftWindowsDesktop' `
-Offer 'Windows-10' `
-Skus 'rs5-pro' `
-Version latest

New-AzVM `
-ResourceGroupName "azure-to-gcp" `
-Location "Australia Southeast" `
-VM $VirtualMachine `
-Verbose

Step 16 : Create a VM Instance (GCP)

We will create a Linux VM in GCP to test connectivity to hosts in Azure using the VPN tunnel we have established.

# create VM instance
gcloud compute instances create "gcp-instance" `
--zone "australia-southeast1-b" `
--machine-type "f1-micro" `
--subnet "aus-subnet" `
--network-tier PREMIUM `
--maintenance-policy MIGRATE `
--image=debian-9-stretch-v20200309 `
--image-project=debian-cloud `
--boot-disk-size 10GB `
--boot-disk-type pd-standard `
--boot-disk-device-name instance-1 `
--reservation-affinity any

Test Connectivity

Now we are ready to test connectivity from both sides of the tunnel.

Azure to GCP

Establish a remote desktop (RDP) connection to the Azure VM created in Step 15. Ping the GCP VM instance using its private IP address.

Test Private IP Connectivity from Azure to GCP

GCP to Azure

Now SSH into the GCP Linux VM instance and ping the Azure host using its private IP address.

Test Private IP Connectivity from GCP to Azure

Test Private Google Access from Azure

Now that we have established bi-directional connectivity between the two clouds, we can test private access to Google APIs from our Azure host. Follow the steps below to test private access:

  1. RDP into the Azure VM
  2. Install the Google Cloud SDK ( https://cloud.google.com/sdk/)
  3. Perform an nslookup to ensure that calls to googleapis.com resolve to the restricted API range (e.g. nslookup storage.googleapis.com). You should see a response showing the A records from the googleapis.com Private DNS Zone created in step 14.
  4. Now test connectivity to Google APIs, for example to test access to Google Cloud Storage using gsutil, or test access to Big Query using the bq command

Congratulations! You are now a multi cloud ninja!

· 5 min read
Jeffrey Aven

In the previous post in this series Spark in the Google Cloud Platform Part 1, we started to explore the various ways in which we could deploy Apache Spark applications in GCP. The first option we looked at was deploying Spark using Cloud DataProc, a managed Hadoop cluster with various ecosystem components included.

In this post, we will look at another option for deploying Spark in GCP – a Spark Standalone cluster running on GKE.

Spark Standalone refers to the in-built cluster manager provided with each Spark release. Standalone can be a bit of a misnomer as it sounds like a single instance – which it is not, standalone simply refers to the fact that it is not dependent upon any other projects or components – such as Apache YARN, Mesos, etc.

A Spark Standalone cluster consists of a Master node or instance and one of more Worker nodes. The Master node serves as both a master and a cluster manager in the Spark runtime architecture.

The Master process is responsible for marshalling resource requests on behalf of applications and monitoring cluster resources.

The Worker nodes host one or many Executor instances which are responsible for carrying out tasks.

Deploying a Spark Standalone cluster on GKE is reasonably straightforward. In the example provided in this post we will set up a private network (VPC), create a GKE cluster, and deploy a Spark Master pod and two Spark Worker pods (in a real scenario you would typically have many Worker pods).

Once the network and GKE cluster have been deployed, the first step is to create Docker images for both the Master and Workers.

The Dockerfile below can be used to create an image capable or running either the Worker or Master daemons:

Note the shell scripts included in the Dockerfile: spark-master and spark-worker. These will be used later on by K8S deployments to start the relative Master and Worker daemon processes in each of the pods.

Next, we will use Cloud Build to build an image using the Dockerfile are store this in GCR (Google Container Registry), from the Cloud Build directory in our project we will run:

gcloud builds submit --tag gcr.io/spark-demo-266309/spark-standalone

Next, we will create Kubernetes deployments for our Master and Worker pods.

Firstly, we need to get cluster credentials for our GKE cluster named ‘spark-cluster’:

gcloud container clusters get-credentials spark-cluster --zone australia-southeast1-a --project spark-demo-266309

Now from within the k8s-deployments\deploy folder of our project we will use the kubectl command to deploy the Master pod, service and the Worker pods

Starting with the Master deployment, this will deploy our Spark Standalone image into a container running the Master daemon process:

To deploy the Master, run the following:

kubectl create -f spark-master-deployment.yaml

The Master will expose a web UI on port 8080 and an RPC service on port 7077, we will need to deploy a K8S service for this, the YAML required to do this is shown here:

To deploy the Master service, run the following:

kubectl create -f spark-master-service.yaml

Now that we have a Master pod and service up and running, we need to deploy our Workers which are preconfigured to communicate with the Master service.

The YAML required to deploy the two Worker pods is shown here:

To deploy the Worker pods, run the following:

kubectl create -f spark-worker-deployment.yaml

You can now inspect the Spark processes running on your GKE cluster.

kubectl get deployments

Shows...

NAME           READY   UP-TO-DATE   AVAILABLE   AGE
spark-master 1/1 1 1 7m45s
spark-worker 2/2 2 2 9s
kubectl get pods

Shows...

NAME                            READY   STATUS    RESTARTS   AGE
spark-master-f69d7d9bc-7jgmj 1/1 Running 0 8m
spark-worker-55965f669c-rm59p 1/1 Running 0 24s
spark-worker-55965f669c-wsb2f 1/1 Running 0 24s

Next, as we need to expose the Web UI for the Master process we will create a LoadBalancer resource. The YAML used to do this is provided here:

To deploy the LB, you would run the following:

kubectl create -f spark-ui-lb.yaml

NOTE This is just an example, for simplicity we are creating an external LoadBalancer with a public IP, this configuration is likely not be appropriate in most real scenarios, alternatives would include an internal LoadBalancer, retraction of Authorized Networks, a jump host, SSH tunnelling or IAP.

Now you’re up and running!

You can access the Master web UI from the Google Console link shown here:

Accessing the Spark Master UI from the Google Cloud Console

The Spark Master UI should look like this:

Spark Master UI

Next we will exec into a Worker pod, get a shell:

kubectl exec -it spark-worker-55965f669c-rm59p -- sh

Now from within the shell environment of a Worker – which includes all of the Spark client libraries, we will submit a simple Spark application:

spark-submit --class org.apache.spark.examples.SparkPi \
--master spark://10.11.250.98:7077 \
/opt/spark/examples/jars/spark-examples*.jar 10000

You can see the results in the shell, as shown here:

Spark Pi Estimator Example

Additionally, as all of the container logs go to Stackdriver, you can view the application logs there as well:

Container Logs in StackDriver

This is a simple way to get a Spark cluster running, it is not without its downsides and shortcomings however, which include the limited security mechanisms available (SASL, network security, shared secrets).

In the final post in this series we will look at Spark on Kubernetes, using Kubernetes as the Spark cluster manager and interacting with Spark using the Kubernetes API and control plane, see you then.

Full source code for this article is available at: https://github.com/gamma-data/spark-on-gcp

The infrastructure coding for this example uses Powershell and Terraform, and is deployed as follows:

PS > .\run.ps1 private-network apply <gcp-project> <region>
PS > .\run.ps1 gke apply <gcp-project> <region>

· 6 min read
Jeffrey Aven

I have been an avid Spark enthusiast since 2014 (the early days..). Spark has featured heavily in every project I have been involved with from data warehousing, ETL, feature extraction, advanced analytics to event processing and IoT applications. I like to think of it as a Swiss army knife for distributed processing.

Curiously enough, the first project I had been involved with for some years that did not feature the Apache Spark project was a green field GCP project which got me thinking… where does Spark fit into the GCP landscape?

Unlike the other major providers who use Spark as the backbone of their managed distributed ETL services with examples such as AWS Glue or the Spark integration runtime option in Azure Data Factory, Google’s managed ETL solution is Cloud DataFlow. Cloud DataFlow which is a managed Apache Beam service does not use a Spark runtime (there is a Spark Runner however this is not an option when using CDF). So where does this leave Spark?

My summation is that although Spark is not a first-class citizen in GCP (as far as managed ETL), it is not a second-class citizen either. This article will discuss the various ways Spark clusters and applications can be deployed within the GCP ecosystem.

Quick Primer on Spark

Every Spark application contains several components regardless of deployment mode, the components in the Spark runtime architecture are:

  • the Driver
  • the Master
  • the Cluster Manager
  • the Executor(s), which run on worker nodes or Workers

Each component has a specific role in executing a Spark program and all of the Spark components run in Java virtual machines (JVMs).

Spark Runtime Architecture

Cluster Managers schedule and manage distributed resources (compute and memory) across the nodes of the cluster. Cluster Managers available for Spark include:

  • Standalone
  • YARN (Hadoop)
  • Mesos
  • Kubernetes

Spark on DataProc

This is perhaps the simplest and most integrated approach to using Spark in the GCP ecosystem.

DataProc is GCP’s managed Hadoop Service (akin to AWS EMR or HDInsight on Azure). DataProc uses Hadoop/YARN as the Cluster Manager. DataProc clusters can be deployed on a private network (VPC using RFC1918 address space), supports encryption at Rest using Google Managed or Customer Managed Keys in KMS, supports autoscaling and the use of Preemptible Workers, and can be deployed in a HA config.

Furthermore, DataProc clusters can enforce strong authentication using Kerberos which can be integrated into other directory services such as Active Directory through the use of cross realm trusts.

Deployment

DataProc clusters can be deployed using the gcloud dataproc clusters create command or using IaC solutions such as Terraform. For this article I have included an example in the source code using the gcloud command to deploy a DataProc cluster on a private network which was created using Terraform.

Integration

The beauty of DataProc is its native integration into IAM and the GCP service plane. Having been a long-time user of AWS EMR, I have found that the usability and integration are in many ways superior in GCP DataProc. Let’s look at some examples...

IAM and IAP (TCP Forwarding)

DataProc is integrated into Cloud IAM using various coarse grained permissions use as dataproc.clusters.use and simplified IAM Roles such as dataproc.editor or dataproc.admin. Members with bindings to the these roles can perform tasks such as submitting jobs and creating workflow templates (which we will discuss shortly), as well as accessing instances such as the master node instance or instances in the cluster using IAP (TCP Forwarding) without requiring a public IP address or a bastion host.

DataProc Jobs and Workflows

Spark jobs can be submitted using the console or via gcloud dataproc jobs submit as shown here:

Submitting a Spark Job using gcloud dataproc jobs submit

Cluster logs are natively available in StackDriver and standard out from the Spark Driver is visible from the console as well as via gcloud commands.

Complex Workflows can be created by adding Jobs as Steps in Workflow Templates using the following command:

gcloud dataproc workflow-templates add-job spark

Optional Components and the Component Gateway

DataProc provides you with a Hadoop cluster including YARN and HDFS, a Spark runtine – which includes Spark SQL and SparkR. DataProc also supports several optional components including Anaconda, Jupyter, Zeppelin, Druid, Presto, and more.

Web interfaces to some of these components as well as the management interfaces such as the Resource Manager UI or the Spark History Server UI can be accessed through the Component Gateway.

This is a Cloud IAM integrated gateway (much like IAP) which can allow access through an authenticated and authorized console session to web UIs in the cluster – without the need for SSH tunnels, additional firewall rules, bastion hosts, or public IPs. Very cool.

Links to the component UIs as well as built in UIs like the YARN Resource Manager UI are available directly from through the console.

Jupyter

Jupyter is a popular notebook application in the data science and analytics communities used for reproducible research. DataProc’s Jupyter component provides a ready-made Spark application vector using PySpark. If you have also installed the Anaconda component you will have access to the full complement of scientific and mathematic Python packages such as Pandas and NumPy which can be used in Jupyter notebooks as well. Using the Component Gateway, Jupyer notebooks can be accessed directly from the Google console as shown here:

Jupyter Notebooks using DataProc

From this example you can see that I accessed source data from a GCS bucket and used HDFS as local scratch space.

Furthermore, notebooks are automagically saved in your integrated Cloud Storage DataProc staging bucket and can be shared amongst analysts or accessed at a later time. These notebooks also persist beyond the lifespan of the cluster.

Next up we will look at deploying a Spark Standalone cluster on a GKE cluster, see you then!

· 4 min read
Jeffrey Aven

This article demonstrates Cloud SQL federated queries for Big Query, a neat and simple to use feature.

Connecting to Cloud SQL

One of the challenges presented when using Cloud SQL on a private network (VPC) is providing access to users. There are several ways to accomplish this which include:

  • open the database port on the VPC Firewall (5432 for example for Postgres) and let users access the database using a command line or locally installed GUI tool (may not be allowed in your environment)
  • provide a web based interface deployed on your VPC such as PGAdmin deployed on a GCE instance or GKE pod (adds security and management overhead)
  • use the Cloud SQL proxy (requires additional software to be installed and configured)

In additional, all of the above solutions require direct IP connectivity to the instance which may not always be available. Furthermore each of these operations requires the user to present some form of authentication – in many cases the database user and password which then must be managed at an individual level.

Enter Cloud SQL federated queries for Big Query…

Big Query Federated Queries for Cloud SQL

Big Query allows you to query tables and views in Cloud SQL (currently MySQL and Postgres) using the Federated Queries feature. The queries could be authorized views in Big Query datasets for example.

This has the following advantages:

  • Allows users to authenticate and use the GCP console to query Cloud SQL
  • Does not require direct IP connectivity to the user or additional routes or firewall rules
  • Leverages Cloud IAM as the authorization mechanism – rather that unmanaged db user accounts and object level permissions
  • External queries can be executed against a read replica of the Cloud SQL instance to offload query IO from the master instance

Setting it up

Setting up big query federated queries for Cloud SQL is exceptionally straightforward, a summary of the steps are provided below:

Step 1. Enable a Public IP on the Cloud SQL instance

This sounds bad, but it isn’t really that bad. You need to enable a public interface for Big Query to be able to establish a connection to Cloud SQL, however this is not accessed through the actual public internet – rather it is accessed through the Google network using the back end of the front end if you will.

Furthermore, you configure an empty list of authorized networks which effectively shields the instance from the public network, this can be configured in Terraform as shown here:

This configuration change can be made to a running instance as well as during the initial provisioning of the instance.

As shown below you will get a warning dialog in the console saying that you have no authorized networks - this is by design.

Cloud SQL Public IP Enabled with No Authorized Networks

Step 2. Create a Big Query dataset which will be used to execute the queries to Cloud SQL

Connections to Cloud SQL are defined in a Big Query dataset, this can also be use to control access to Cloud SQL using authorized views controlled by IAM roles.

Step 3. Create a connection to Cloud SQL

To create a connection to Cloud SQL from Big Query you must first enable the BigQuery Connection API, this is done at a project level.

As this is a fairly recent feature there isn't great coverage with either the bq tool or any of the Big Query client libraries to do this so we will need to use the console for now...

Under the Resources -> Add Data link in the left hand panel of the Big Query console UI, select Create Connection. You will see a side info panel with a form to enter connection details for your Cloud SQL instance.

In this example I will setup a connection to a Cloud SQL read replica instance I have created:

Creating a Big Query Connection to Cloud SQL

More information on the Big Query Connections API can be found at: https://cloud.google.com/bigquery/docs/reference/bigqueryconnection/rest

The following permissions are associated with connections in Big Query:

bigquery.connections.create  
bigquery.connections.get
bigquery.connections.list
bigquery.connections.use
bigquery.connections.update
bigquery.connections.delete

These permissions are conveniently combined into the following predefined roles:

roles/bigquery.connectionAdmin    (BigQuery Connection Admin)         
roles/bigquery.connectionUser (BigQuery Connection User)

Step 4. Query away!

Now the connection to Cloud SQL can be accessed using the EXTERNAL_QUERY function in Big Query, as shown here:

Querying Cloud SQL from Big Query

· 5 min read
Jeffrey Aven

In this post we will look at read replicas as an additional method to achieve multi zone availability for Cloud SQL, which gives us - in turn - the ability to offload (potentially expensive) IO operations such as user created backups or read operations without adding load to the master instance.

In the previous post in this series we looked at Regional availability for PostgreSQL HA using Cloud SQL:

Google Cloud SQL – Availability, Replication, Failover for PostgreSQL – Part I

Recall that this option was simple to implement and worked relatively seamlessly and transparently with respect to zonal failover.

Now let's look at read replicas in Cloud SQL as an additional measure for availability.

Deploying Read Replica(s)

Deploying read replicas is slightly more involved than simple regional (high) availability, as you will need to define each replica or replicas as a separate Cloud SQL instance which is a slave to the primary instance (the master instance).

An example using Terraform is provided here, starting by creating the master instance:

Next you would specify one or more read replicas (typically in a zone other than the zone the master is in):

Note that several of the options supplied are omitted when creating a read replica database instance, such as the backup and maintenance options - as these operations cannot be performed on a read replica as we will see later.

Cloud SQL Instances - showing master and replica
Cloud SQL Instances - showing master and replica
Cloud SQL Master Instance
Cloud SQL Master Instance

Voila! You have just set up a master instance (the primary instance your application and/or users will connect to) along with a read replica in a different zone which will be asynchronously updated as changes occur on the master instance.

Read Replicas in Action

Now that we have created a read replica, lets see it in action. After connecting to the read replica (like you would any other instance), attempt to access a table that has not been created on the master as shown here:

SELECT operation from the replica instance
SELECT operation from the replica instance

Now create the table and insert some data on the master instance:

Create a table and insert a record on the master instance
Create a table and insert a record on the master instance

Now try the select operation on the replica instance:

SELECT operation from the replica instance (after changes have been made on the master)
SELECT operation from the replica instance (after changes have been made on the master)

It works!

Some Points to Note about Cloud SQL Read Replicas

  • Users connect to a read replica as a normal database connection (as shown above)
  • Google managed backups (using the console or gcloud sql backups create .. ) can NOT be performed against replica instances
  • Read replicas can be used to offload IO intensive operations from the the master instance - such as user managed backup operations (e.g. pg_dump)
pg_dump operation against a replica instance
pg_dump operation against a replica instance
  • BE CAREFUL Despite their name, read replicas are NOT read only, updates can be made which will NOT propagate back to the master instance - you could get yourself in an awful mess if you allow users to perform INSERT, UPDATE, DELETE, CREATE or DROP operations against replica instances.

Promoting a Read Replica

If required a read replica can be promoted to a standalone Cloud SQL instance, which is another DR option. Keep in mind however as the read replica is updated in an asynchronous manner, promotion of a read replica may result in a loss of data (hopefully not much but a loss nonetheless). Your application RPO will dictate if this is acceptable or not.

Promotion of a read replica is reasonably straightforward as demonstrated here using the console:

Promoting a read replica using the console
Promoting a read replica using the console

You can also use the following gcloud command:

gcloud sql instances promote-replica <replica_instance_name>

Once you click on the Promote Replica button you will see the following warning:

Promoting a read replica using the console

This simply states that once you promote the replica instance your instance will become an independent instance with no further relationship with the master instance. Once accepted and the promotion process is complete, you can see that you now have two independent Cloud SQL instances (as advertised!):

Promoted Cloud SQL instance
Promoted Cloud SQL instance

Some of the options you would normally configure with a master instance would need to be configured on the promoted replica instance - such as high availability, maintenance and scheduled backups - but in the event of a zonal failure you would be back up and running with virtually no data loss!

Full source code for this article is available at: https://github.com/gamma-data/cloud-sql-postgres-availability-tutorial

· 5 min read
Jeffrey Aven

In this multi part blog we will explore the features available in Google Cloud SQL for High Availability, Backup and Recovery, Replication and Failover and Security (at rest and in transit) for the PostgreSQL DBMS engine. Some of these features are relatively hot of the press and in Beta – which still makes them available for general use.

We will start by looking at the High Availability (HA) options available to you when using the PostgreSQL engine in Google Cloud SQL.

Most of you would be familiar with the concepts of High Availability, Redundancy, Fault Tolerance, etc but let’s start with a definition of HA anyway:

High availability (HA) is a characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.

Wikipedia

Higher than a normal period is quite subjective, typically this is quantified by a percentage represented by a number of “9s”, that is 99.99% (which would be quoted as “four nines”), this would allot you 52.60 minutes of downtime over a one-year period.

Essentially the number of 9’s required will drive your bias towards the options available to you for Cloud SQL HA.

We will start with Cloud SQL HA in its simplest form, Regional Availability.

Regional Availability

Knowing what we know about the Google Cloud Platform, regional availability means that our application or service (in this case Cloud SQL) should be resilient to a failure of any one zone in our region. In fact, as all GCP regions have at least 3 zones – two zones could fail, and our application would still be available.

Regional availability for Cloud SQL (which Google refers to as High Availability), creates a standby instance in addition to the primary instance and uses a regional Persistent Disk resource to store the database instance data, transaction log and other state files, which is synchronously replicated to a Persistent Disk resource local to the zones that the primary and standby instances are located in.

A shared IP address (like a Virtual IP) is used to serve traffic to the healthy (normally primary) Cloud SQL instance.

An overview of Cloud SQL HA is shown here:

Cloud SQL High Availability

Implementing High Availability for Cloud SQL

Implementing Regional Availability for Cloud SQL is dead simple, it is one argument:

availability_type = "REGIONAL"

Using the gcloud command line utility, this would be:

gcloud sql instances create postgresql-instance-1234 \
--availability-type=REGIONAL \
--database-version= POSTGRES_9_6

Using Terraform (with a complete set of options) it would look like:

resource "google_sql_database_instance" "postgres_ha" {
provider = google-beta
region = var.region
project = var.project
name = "postgresql-instance-${random_id.instance_suffix.hex}"
database_version = "POSTGRES_9_6"
settings {
tier = var.tier
disk_size = var.disk_size
activation_policy = "ALWAYS"
disk_autoresize = true
disk_type = "PD_SSD"
**availability_type = "REGIONAL"**
backup_configuration {
enabled = true
start_time = "00:00"
}
ip_configuration {
ipv4_enabled = false
private_network = google_compute_network.private_network.self_link
}
maintenance_window {
day = 7
hour = 0
update_track = "stable"
}
}
}

Once deployed you will notice a few different items in the console, first from the instance overview page you can see that the High Availability option is ENABLED for your instance.

Second, you will see a Failover button enabled on the detailed management view for this instance.

Failover

Failovers and failbacks can be initiated manually or automatically (should the primary be unresponsive). A manual failover can be invoked by executing the command:

gcloud sql instances failover postgresql-instance-1234

There is an --async option which will return immediately, invoking the failover operation asynchronously.

Failover can also be invoked from the Cloud Console using the Failover button shown previously. As an example I have created a connection to a regionally available Cloud SQL instance and started a command which runs a loop and prints out a counter:

Now using the gcloud command shown earlier, I have invoked a manual failover of the Cloud SQL instance.

Once the failover is initiated, the client connection is dropped (as the server is momentarily unavailable):

The connection can be immediately re-established afterwards, the state of the running query is lost - importantly no data is lost however. If your application clients had retry logic in their code and they weren't executing a long running query, chances are no one would notice! Once reconnecting normal database activities can be resumed:

A quick check of the instance logs will show that the failover event has occured:

Now when you return to the instance page in the console you will see a Failback button, which indicates that your instance is being served by the standby:

Note that there may be a slight delay in the availability of this option as the replica is still being synched.

It is worth noting that nothing comes for free! When you run in REGIONAL or High Availability mode - you are effectively paying double the costs as compared to running in ZONAL mode. However availability and cost have always been trade-offs against one another - you get what you pay for...

More information can be found at: https://cloud.google.com/sql/docs/postgres/high-availability

Next up we will look at read replicas (and their ability to be promoted) as another high availability alternative in Cloud SQL.

· 5 min read
Jeffrey Aven

There are many posts available which map analogous services between the different cloud providers, but this post attempts to go a step further and map additional concepts, terms, and configuration options to be the definitive thesaurus for cloud practitioners familiar with AWS looking to fast track their familiarisation with GCP.

It should be noted that AWS and GCP are fundamentally different platforms, nowhere is this more apparent than in the way networking is implemented between the two providers, see: GCP Networking for AWS Professionals

This post is focused on the core infrastructure, networking and security services offered by the two major cloud providers, I will do a future post on higher level services such as the ML/AI offerings from the respective providers.

Furthermore this will be a living post which I will continue to update, I encourage comments from readers on additional mappings which I will incorporate into the post as well.

I have broken this down into sections based upon the layout of the AWS Console.

Compute

EC2 (Elastic Compute Cloud)GCE (Google Compute Engine)
Availability ZoneZone
InstanceVM Instance
Instance FamilyMachine Family
Instance TypeMachine Type
Amazon Machine Image (AMI)Image
IAM Role (for an EC2 Instance)Service Account
Security GroupsVPC Firewall Rules (ALLOW)
TagLabel
Termination ProtectionDeletion Protection
Reserved InstancesCommitted Use Discounts
Capacity ReservationReservation
User DataStartup Script
Spot InstancesPreemptible VMs
Dedicated InstancesSole Tenancy
EBS VolumePersistent Disk
Auto Scaling GroupManaged Instance Group
Launch ConfigurationInstance Template
ELB ListenerURL Map (Load Balancer)
ELB Target GroupBackend/ Instance Group
Instance Storage (ephemeral)Local SSDs
EBS SnapshotsSnapshots
KeypairSSH Keys
Elastic IPExternal IP
LambdaGoogle Cloud Functions
Elastic BeanstalkGoogle App Engine
Elastic Container Registry (ECR)Google Container Registry (GCR)
Elastic Container Service (ECS)Google Kubernetes Engine (GKE)
Elastic Kubernetes Service (EKS)Google Kubernetes Engine (GKE)
AWS FargateCloud Run
AWS Service QuotasAllocation Quotas
Account (within an Organisation)Project
RegionRegion
AWS Cloud​FormationCloud Deployment Manager

Storage

Simple Storage Service (S3)Google Cloud Storage (GCS)
Standard Storage ClassStandard Storage Class
Infrequent Access Storage ClassNearline Storage Class
Amazon GlacierColdline Storage Class
Lifecycle PolicyRetention Policy
TagsLabels
SnowballTransfer Appliance
Requester PaysRequester Pays
RegionLocation Type/Location
Object LockHold
Vault Lock (Glacier)Bucket Lock
Multi Part UploadParallel Composite Transfer
Cross-Origin Resource Sharing (CORS)Cross-Origin Resource Sharing (CORS)
Static Website HostingBucket Website Configuration
S3 Access PointsVPC Service Controls
Object NotificationsPub/Sub Notifications for Cloud Storage
Presigned URLSigned URL
Transfer AccelerationStorage Transfer Service
Elastic File System (EFS)Cloud Filestore
AWS DataSyncTransfer Service for on-premises data
ETagETag
BucketBucket
aws s3gsutil

Database

Relational Database Service (RDS)Cloud SQL
DynamoDBCloud Datastore
ElastiCacheCloud Memorystore
Table (DynamoDB)Kind (Cloud Datastore)
Item (DynamoDB)Entity (Cloud Datastore)
Partition Key (DynamoDB)Key (Cloud Datastore)
Attributes (DynamoDB)Properties (Cloud Datastore)
Local Secondary Index (DynamoDB)Composite Index (Cloud Datastore)
Elastic Map Reduce (EMR)Cloud DataProc
AthenaBig Query
AWS GlueCloud DataFlow
Glue CatalogData Catalog
Amazon Simple Notification Service (SNS)Cloud PubSub (push subscription)
Amazon KinesisCloud PubSub
Amazon Simple Queue Service (SQS)Cloud PubSub (poll and pull mode)

Networking & Content Delivery

Virtual Private Cloud (VPC) (Regional)VPC Network (Global or Regional)
Subnet (Zonal)Subnet (Regional)
Route TablesRoutes
Network ACLs (NACLS)VPC Firewall Rules (ALLOW or DENY)
CloudFrontCloud CDN
Route 53Cloud DNS/Google Domains
Direct ConnectDedicated (or Partner) Interconnect
Virtual Private Network (VPN)Cloud VPN
AWS PrivateLinkGoogle Private Access
NAT GatewayCloud NAT
Elastic Load BalancerLoad Balancer
AWS WAFCloud Armour
VPC Peering ConnectionVPC Network Peering
Amazon API GatewayApigee API Gateway
Amazon API GatewayCloud Endpoints

Security, Identity, & Compliance

Root AccountSuper Admin
IAM UserMember
IAM PolicyRole (Collection of Permissions)
IAM Policy AttachmentIAM Role Binding (or IAM Binding)
Key Management Service (KMS)Cloud KMS
CloudHSMCloud HSM
Amazon Inspector (agent based)Cloud Security Scanner (scan based)
AWS Security HubCloud Security Command Center (SCC)
Secrets ManagerSecret Manager
Amazon MacieCloud Data Loss Prevention (DLP)
AWS WAFCloud Armour
AWS ShieldCloud Armour

† No direct equivalent, this is the closest equivalent

· 3 min read
Jeffrey Aven

This article describes the steps to integrate Slack with Google Cloud Functions to get notified about object events within a specified Google Cloud Storage bucket.

Google Cloud Storage Object Notifications using Slack

Events could include the creation of new objects, as well as delete, archive or metadata operations performed on a given bucket.

This pattern could be easily extended to other event sources supported by Cloud Functions including:

  • Cloud Pub/Sub messages
  • Cloud Firestore and Firebase events
  • Stackdriver log entries

More information can be found at https://cloud.google.com/functions/docs/concepts/events-triggers.

The prerequisite steps to configure Slack are provided here:

  1. First you will need to create a Slack app (assuming you have already set up an account and a workspace). The following screenshots demonstrate this process:
Create a Slack app
Create a Slack app
Give the app a name and associate it with an existing Slack workspace
Give the app a name and associate it with an existing Slack workspace
  1. Next you need to Enable and Activate Incoming Webhooks to your app and add this to your workspace. The following screenshots demonstrate this process:
Enable Incoming Web Hooks for the app
Enable Incoming Web Hooks for the app
Activate incoming webhooks
Activate incoming webhooks
Add the webhook to your workspace
Add the webhook to your workspace
  1. Next you need to specify a channel for notifications generated from object events.
Select a channel for the webhook
Select a channel for the webhook
  1. Now you need to copy the Webhook url provided, you will use this later in your Cloud Function.
Copy the webhook URL to the clipboard
Copy the webhook URL to the clipboard

Treat your webhook url as a secret, do not upload this to a public source code repository

Next you need to create your Cloud Function, this example uses Python but you can use an alternative runtime including Node.js or Go.

This example templates the source code using the Terraform template_file data source. The function source code is shown here:

Within your Terraform code you need to render your Cloud Function code substituting the slack_webhook_url for it's value which you will supply as a Terraform variable. The rendered template file is then placed in a local directory along with a requirements.txt file and zipped up. The resulting Zip archive is uploaded to a specified bucket where it will be sourced to create the Cloud Function.

Now you need to create the Cloud Function, the following HCL snippet demonstrates this:

The event_trigger block in particular specifies which GCS bucket to watch and what events will trigger invocation of the function. Bucket events include:

  • google.storage.object.finalize (the creation of a new object)
  • google.storage.object.delete
  • google.storage.object.archive
  • google.storage.object.metadataUpdate

You could add additional logic to the Cloud Function code to look for specific object names or naming patterns, but keep in mind the function will fire upon every event matching the event_type and resource criteria.

To deploy the function, you would simply run:

terraform apply -var="slack_webhook_url=https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXX"

Now once you upload a file named test-object.txt, voilà!:

Slack notification for a new object created
Slack notification for a new object created

Full source code is available at: https://github.com/gamma-data/gcs-object-notifications-using-slack

· 3 min read
Jeffrey Aven

At the time of this writing, GCP does not have a generally available non-public facing Layer 7 load balancer. While this is sure to change in the future, this article outlines a design pattern which has been proven to provide scalable and extensible application load balancing services for multiple applications running in Kubernetes pods on GKE.

When you create a service of type LoadBalancer in GKE, Kubernetes hooks into the provider (GCP in this case) on your behalf to create a Google Load Balancer, while this may be specified as INTERNAL, there are two issues:

Issue #1:

The GCP load balancer created for you is a Layer 4 TCP load balancer.

Issue #2:

The normal behaviour is for Google to enumerate all of the node pools in your GKE cluster and “automagically” create mapping GCE instance groups for each node pool for each zone the instances are deployed in. This means the entire surface area of your cluster is exposed to the external network – which may not be optimal for internal applications on a multi tenanted cluster.

The Solution:

Using Istio deployed on GKE along with the Istio Ingress Gateway along with an externally created load balancer, it is possible to get scalable HTTP load balancing along with all the normal ALB goodness (stickiness, path-based routing, host-based routing, health checks, TLS offload, etc.).

An abstract depiction of this architecture is shown here:

Istio Ingress Design Pattern for VPC Native GKE Clusters

This can be deployed with a combination of Terraform and kubectl. The steps to deploy at a high level are:

  1. Create a GKE cluster with at least two node pools: ingress-nodepool and service-nodepool. Ideally create these node pools as multi-zonal for availability. You could create additional node pools for your Egress Gateway or an operations-nodepool to host Istio, etc as well.
  2. Deploy Istio.
  3. Deploy the Istio Ingress Gateway service on the ingress-nodepool using Service type NodePort.
  4. Create an associated Certificate Gateway using server certificates and private keys for TLS offload.
  5. Create a service in the service-nodepool.
  6. Reserve an unallocated static IP address from the node network range.
  7. Create an internal TCP load balancer:
    1. Specify the frontend as the IP address reserved in step 6.
    2. Specify the backend as the managed instance groups created during the node pool creation for the ingress-nodepool (ingress-nodepool-ig-a, ingress-nodepool-ig-b, ingress-nodepool-ig-c).
    3. Specify ports 80 and 443.
  8. Create a GCP Firewall Rule to allow traffic from authorized sources (network tags or CIDR ranges) to a target of the ingress-nodepool network tag.
  9. Create a Cloud DNS A Record for your managed zone as *.namespace.zone pointing to the IP Address assigned to the load balancer frontend in step 7.1.
  10. Enable Health Checks through the GCP firewall to reach the ingress-nodepool network tag at a minimum – however there is no harm in allowing these to all node pools.

The service should then be resolvable and routable from authorized internal networks (peered private VPCs or internal networks connected via VPN or Dedicated Interconnect) as:

https://_service.namespace.zone/endpoint__

The advantages of this design pattern are...

  1. The Ingress Gateway provides fully functional application load balancing services.
  2. Istio provides service discovery and routing using names and namespaces.
  3. The Ingress Gateway service and ingress gateway node pool can be scaled as required to meet demand.
  4. The Ingress Gateway is multi zonal for greater availability

· 4 min read
Jeffrey Aven

GCP and AWS share many similarities, they both provide similar services and both leverage containerization, virtualization and software defined networking.

There are some significant differences when it comes to their respective implementations, networking is a key example of this.

Before we compare and contrast the two different approaches to networking, it is worthwhile noting the genesis of the two major cloud providers.

Google was born to be global, Amazon became global

By no means am I suggesting that Amazon didn't have designs on going global from it's beginnings, but AWS was driven (entirely at the beginning) by the needs of the Amazon eCommerce business. Amazon started in the US before expanding into other regions (such as Europe and Australia). In some cases the expansion took decades (Amazon only entered Australia as a retailer in 2018).

Google, by contrast, was providing application, search and marketing services worldwide from its very beginning. GCP which was used as the vector to deliver these services and applications was architected around this global model, even though their actual data centre expansion may not have been as rapid as AWS’s (for example GCP opened its Australia region 5 years after AWS).

Their respective networking implementations reflect how their respective companies evolved.

AWS is a leader in IaaS, GCP is a leader in PaaS

This is only an opinion and may be argued, however if you look at the chronology of the two platforms, consider this:

  • The first services released by AWS (simultaneously for all intents and purposes) were S3, SQS and EC2
  • The first service released by Google was AppEngine (a pure PaaS offering)

Google has launched and matured their IaaS offerings since as AWS has done similarly with their PaaS offerings, but they started from much different places.

With all of that said, here are the key differences when it comes to networking between the two major cloud providers:

GCP VPCs are Global by default, AWS VPCs are Regional only

This is the first fundamental difference between the two providers. Each GCP project is allocated one VPC network with Subnets in each of the 18 GCP Regions. Whereas each AWS Account is allocated one Default VPC in each AWS Region with a Subnet in each AWS Availability Zone for that Region, that is each account has 17 VPCs in each of the 17 Regions (excluding GovCloud regions).

Default Global VPC Network in GCP

It is entirely possible to create VPCs in GCP which are Regional, but they are Global by default.

This global tenancy can be advantageous in many cases, but can be limiting in others, for instance there is a limit of 25 peering connections to any one VPC, the limit in AWS is 125.

GCP Subnets are Regional, AWS Subnets are Zonal

Subnets in GCP automatically span all Zones in a Region, whereas AWS VPC Subnets are assigned to Availability Zones in a Region. This means you are abstracted from some of the networking and zonal complexity, but you have less control over specific network placement of instances and endpoints. You can infer from this design that Zones are replicated or synchronised within a Region, making them less of a direct consideration for High Availability (or at least as much or your concern as they otherwise would be).

All GCP Firewall Rules are Stateful

AWS Security Groups are stateful firewall rules – meaning they maintain connection state for inbound connections, AWS also has Network ACLs (NACLs) which are stateless firewall rules. GCP has no direct equivalent of NACLs, however GCP Firewall Rules are more configurable than their AWS counterparts. For instance, GCP Firewall Rules can include Deny actions which is not an option with AWS Security Group Rules.

Load Balancers in GCP are layer 4 (TCP/UDP) unless they are public facing

AWS Application Load Balancers can be deployed in private VPCs with no external IPs attached to them. GCP has Application Load Balancers (Layer 7 load balancers) but only for public facing applications, internal facing load balancers in GCP are Network Load Balancers. This presents some challenges with application level load balancing functionality such as stickiness. There are potential workarounds however such as NGINX in GKE behind

Firewall rules are at the Network Level not at the Instance or Service Level

There are simple firewall settings available at the instance level, these are limited to allowing HTTP and HTTPS traffic to the instance only and don’t allow you to specify sources. Detailed Firewall Rules are set at the GCP VPC Network level and are not attached or associated with instances as they are in AWS.

Hopefully this is helpful for AWS engineers and architects being exposed to GCP for the first time!