Skip to main content

· 2 min read
Jeffrey Aven

I started this blog a few years back to chronicle my journeys through building cloud data platforms, along the way I gathered some friends to share their experiences as well. The easiest platform to start this blog on was Wordpress. This worked, but wasnt really aligned with the way myself and my collegues worked,and didnt really align with the types of things we were writing about in blog articles or embracing as general principles... e.g. 'everything-as-code', 'gitops', etc.

Enter Static Site Generators and Jamstack architecture. Not only does a Jamstack, SSG architecture for a blog site (or docs site or any other site), allow you to manage every aspect of your web property as code, but as a static site has several other benefits inlcuding increased performance, easier distribution (using CDNs), better security (no origin server required), all this as well as being SEO friendly (and optimised in many cases).

But moving the site from Wordpress to a SSG must be an onerous task.. wrong.

I moved this blog over a weekend which was quite simple in the end, here are the steps:

  1. Export your Wordpress site (Tools->Export), make sure to select All Content.

  2. Use wordpress-export-to-markdown to convert your posts to a basic Markdown format with frontmatter, does a pretty good job

  3. Choose and deploy a Static Site Generator (I chose Docusaurus, but there are several other alternatives available such as VuePress, Jekyll, etc)

  4. Drop your Markdown docs into your SSG content (blogs) directory (converted in step 2)

  5. You will probably need to do some fine tuning as some things may not export cleanly, but 99% of the content will be fine

  6. Deploy your new blog site, I am using GitHub Pages, but you could use anything similar - Netlify, Vercel, Digital Ocean, Azure Static Web Apps, etc or implement your own custom CI routine to build your project and push it to an object storage bucket configured to serve a static web site (such as Google Cloud Storage and AWS S3)

Thats it!

· 3 min read
Mark Stella

Everytime I start a new project I try and optimise how the application can work across multiple envronments. For those who don't have the luxury of developing everything in docker containers or isolated spaces, you will know my pain. How do I write code that can run on my local dev environment, migrate to the shared test and ci environment and ultimately still work in production.

In the past I tried exotic options like dynamically generating YAML or JSON using Jinja. I then graduated to HOCON which made my life so much easier. This was until I stumbled across Jsonnet. For those who have not seen this in action, think JSON meets Jinja meets HOCON (a Frankenstein creation that I have actually built in the past)

To get a feel for how it looks, below is a contrived example where I require 3 environments (dev, test and production) that have different paths, databases and vault configuration.

Essentially, when this config is run through the Jsonnet templating engine, it will expect a variable 'ENV' to ultimately refine the environment entry to the one we specifically want to use.

A helpful thing I like to do with my programs is give users a bit of information as to what environments can be used. For me, running a cli that requires args should be as informative as possible - so listing out all the environments is mandatory. I achieve this with a little trickery and a lot of help from the click package!

local exe = "application.exe";
local Environment(prefix) = {  root: "/usr/" + prefix + "/app",  path: self.root + "/bin/" + exe,  database: std.asciiUpper(prefix) + "_DB",  tmp_dir: "/tmp/" + prefix};
local Vault = {  local uri = "http://127.0.0.1:8200/v1/secret/app",  _: {},  dev: {      secrets_uri: uri,      approle: "local"  },  tst: {      secrets_uri: uri,      approle: "local"  },  prd: {      secrets_uri: "https://vsrvr:8200/v1/secret/app",      approle: "sa_user"  }};
{
  environments: {    _: {},    dev: Environment("dev") + Vault[std.extVar("ENV")],    tst: Environment("tst") + Vault[std.extVar("ENV")],    prd: Environment("prd") + Vault[std.extVar("ENV")]  },
  environment: $["environments"][std.extVar("ENV")],}

The trick I perform is to have a placeholder entry '_' that I use to initially render the template. I then use the generated JSON file and get all the environment keys so I can feed that directly into click.

from typing import Any, Dictimport clickimport jsonimport _jsonnetfrom pprint import pprint
ENV_JSONNET = 'environment.jsonnet'ENV_PFX_PLACEHOLDER = '_'
def parse_environment(prefix: str) -> Dict[str, Any]:    _json_str = _jsonnet.evaluate_file(ENV_JSONNET, ext_vars={'ENV': prefix})    return json.loads(_json_str)
_config = parse_environment(prefix=ENV_PFX_PLACEHOLDER)
_env_prefixes = [k for k in _config['environments'].keys() if k != ENV_PFX_PLACEHOLDER]

@click.command(name="EnvMgr")@click.option(    "-e",    "--environment",    required=True,    type=click.Choice(_env_prefixes, case_sensitive=False),    help="Which environment this is executing on",)def cli(environment: str) -> None:    config = parse_environment(environment)    pprint(config['environment'])

if __name__ == "__main__":    cli()

This now allows me to execute the application with both list checking (has the user selected an allowed environment?) and the autogenerated help that click provides.

Below shows running the cli with no arguments:

$> python cli.py
Usage: cli.py [OPTIONS]Try 'cli.py --help' for help.
Error: Missing option '-e' / '--environment'. Choose from:        dev,        prd,        tst

Executing the application with a valid environment:

$> python cli.py -e dev
{'approle': 'local', 'database': 'DEV_DB', 'path': '/usr/dev/app/bin/application.exe', 'root': '/usr/dev/app', 'secrets_uri': 'http://127.0.0.1:8200/v1/secret/app', 'tmp_dir': '/tmp/dev'}

Executing the application with an invalid environment:

$> python cli.py -e prd3
Usage: cli.py [OPTIONS]Try 'cli.py --help' for help.
Error: Invalid value for '-e' / '--environment': 'prd3' is not one of 'dev', 'prd', 'tst'.

This is only the tip of what Jsonnet can provide, I am continually learning more about the templating engine and the tool.

· 4 min read
Tom Klimovski

So you're using BigQuery (BQ). It's all set up and humming perfectly. Maybe now, you want to run an ELT job whenever a new table partition is created, or maybe you want to retrain your ML model whenever new rows are inserted into the BQ table.

In my previous article on EventArc, we went through how Logging can help us create eventing-type functionality in your application. Let's take it a step further and walk through how we can couple BigQuery and Cloud Run.

In this article you will learn how to

  • Tie together BigQuery and Cloud Run
  • Use BigQuery's audit log to trigger Cloud Run
  • With those triggers, run your required code

Let's go!#

Let's create a temporary dataset within BigQuery named tmp_bq_to_cr.

In that same dataset, let's create a table in which we will insert some rows to test our BQ audit log. Let's grab some rows from a BQ public dataset to create this table:

CREATE OR REPLACE TABLE tmp_bq_to_cr.cloud_run_trigger ASSELECT date, country_name, new_persons_vaccinated, population from `bigquery-public-data.covid19_open_data.covid19_open_data` where country_name='Australia' AND date > '2021-05-31'LIMIT 100

Following this, let's run an insert query that will help us build our mock database trigger:

INSERT INTO tmp_bq_to_cr.cloud_run_triggerVALUES('2021-06-18', 'Australia', 3, 1000)

Now, in another browser tab let's navigate to BQ Audit Events and look for our INSERT INTO event:

BQ-insert-event

There will be several audit logs for any given BQ action. Only after a query is parsed does BQ know which table we want to interact with, so the initial log will, for e.g., not have the table name.

We don't want any old audit log, so we need to ensure we look for a unique set of attributes that clearly identify our action, such as in the diagram above.

In the case of inserting rows, the attributes are a combination of

  • The method is google.cloud.bigquery.v2.JobService.InsertJob
  • The name of the table being inserted to is the protoPayload.resourceName
  • The dataset id is available as resource.labels.dataset_id
  • The number of inserted rows is protoPayload.metadata.tableDataChanged.insertedRowsCount

Time for some code#

Now that we've identified the payload that we're looking for, we can write the action for Cloud Run. We've picked Python and Flask to help us in this instance. (full code is on GitHub).

First, let's filter out the noise and find the event we want to process

@app.route('/', methods=['POST'])def index():    # Gets the Payload data from the Audit Log    content = request.json    try:        ds = content['resource']['labels']['dataset_id']        proj = content['resource']['labels']['project_id']        tbl = content['protoPayload']['resourceName']        rows = int(content['protoPayload']['metadata']                   ['tableDataChange']['insertedRowsCount'])        if ds == 'cloud_run_tmp' and \           tbl.endswith('tables/cloud_run_trigger') and rows > 0:            query = create_agg()            return "table created", 200    except:        # if these fields are not in the JSON, ignore        pass    return "ok", 200

Now that we've found the event we want, let's execute the action we need. In this example, we'll aggregate and write out to a new table created_by_trigger:

def create_agg():    client = bigquery.Client()    query = """    CREATE OR REPLACE TABLE tmp_bq_to_cr.created_by_trigger AS    SELECT      count_name, SUM(new_persons_vaccinated) AS n    FROM tmp_bq_to_cr.cloud_run_trigger    """    client.query(query)    return query

The Dockerfile for the container is simply a basic Python container into which we install Flask and the BigQuery client library:

FROM python:3.9-slimRUN pip install Flask==1.1.2 gunicorn==20.0.4 google-cloud-bigqueryENV APP_HOME /appWORKDIR $APP_HOMECOPY *.py ./CMD exec gunicorn --bind :$PORT main:app

Now we Cloud Run#

Build the container and deploy it using a couple of gcloud commands:

SERVICE=bq-cloud-runPROJECT=$(gcloud config get-value project)CONTAINER="gcr.io/${PROJECT}/${SERVICE}"gcloud builds submit --tag ${CONTAINER}gcloud run deploy ${SERVICE} --image $CONTAINER --platform managed

I always forget about the permissions#

In order for the trigger to work, the Cloud Run service account will need the following permissions:

gcloud projects add-iam-policy-binding $PROJECT \    --member="serviceAccount:service-${PROJECT_NO}@gcp-sa-pubsub.iam.gserviceaccount.com"\    --role='roles/iam.serviceAccountTokenCreator'
gcloud projects add-iam-policy-binding $PROJECT \    --member=serviceAccount:${SVC_ACCOUNT} \    --role='roles/eventarc.admin'

Finally, the event trigger#

gcloud eventarc triggers create ${SERVICE}-trigger \  --location ${REGION} --service-account ${SVC_ACCOUNT} \  --destination-run-service ${SERVICE}  \  --event-filters type=google.cloud.audit.log.v1.written \  --event-filters methodName=google.cloud.bigquery.v2.JobService.InsertJob \  --event-filters serviceName=bigquery.googleapis.com

Important to note here is that we're triggering on any Insert log created by BQ That's why in this action we had to filter these events based on the payload.

Take it for a spin

Now, try out the BigQuery -> Cloud Run trigger and action. Go to the BigQuery console and insert a row or two:

INSERT INTO tmp_bq_to_cr.cloud_run_triggerVALUES('2021-06-18', 'Australia', 5, 25000)

Watch as a new table called created_by_trigger gets created! You have successfully triggered a Cloud Run action on a database event in BigQuery.

Enjoy!

· 3 min read
Jeffrey Aven

The Azure Static Web App feature is relatively new in the Azure estate which has recently become generally available, I thought I would take it for a test drive and discuss my findings.

I am a proponent of the JAMStack architecture for front end applications and a user of CD enabled CDN services like Netlify, so this Azure feature was naturally appealing to me.

Azure SWAs allow you to serve static assets (like JavaScript) without a origin server, meaning you don’t need a web server, are able to streamline content distribution and web app performance, and reduce the attack surface area of your application.

The major advantage to using is simplicity, no scaffolding or infra requirements and it is seamlessly integrated into your CI/CD processes (natively if you are using GitHub).

Deploying Static Web Apps in Azure#

Pretty simple to setup, aside from a name and a resource group, you just need to supply:

  • a location (Azure region to be used for serverless back end APIs via Azure Function Apps) note that this is not a location where the static web is necessarily running
  • a GitHub or GitLab repo URL
  • the branch you wish to use to trigger production deployments (e.g. main)
  • a path to your code within your app (e.g. where your package.json file is located)
  • an output folder (e.g. dist) this should not exist in your repo
  • a project or personal access token for your GitHub account (alternatively you can perform an interactive OAuth2.0 consent if using the portal)

An example is shown here:

GitHub Actions#

Using the consent provided (either using the OAuth flow or by providing a token), Azure Static Web Apps will automagically create the GitHub Actions workflow to deploy your application on a push or merge event to your repo. This includes providing scoped API credentials to Azure to allow access to the Static Web App resource using secrets in GitHub (which are created automagically as well). An example workflow is shown here:

Preview or Staging Releases#

Similar to the functionality in analogous services like Netlify, you can configure preview releases of your application to be deployed from specified branches on pull request events.

Routes and Authorization#

Routes (for SPAs) need to be provided to Azure by using a file named staticwebapp.config.json located in the application root of your repo (same level as you package.json file). You can also specify response codes and whether the rout requires authentication as shown here:

Pros#

  • Globally distributed CDN
  • Increased security posture, reduced attack surface area
  • Simplified architecture and deployment
  • No App Service Plan required – cost reduction
  • Enables Continuous Deployment – incl preview/staging environments
  • TLS and DNS can be easily configured for your app

Cons#

  • Serverless API locations are limited
  • Integration with other VCS/CI/CD systems like GitLab would need to be custom built (GitHub and Azure DevOps is integrated)

Overall, this is a good feature for deploying SPAs or PWAs in Azure.

· 10 min read
Tom Klimovski

Metadata Hub (MDH) is intended to be the source of truth for metadata around the Company’s platform. It has the ability to load metadata configuration from yaml, and serve that information up via API. It will also be the store of information for pipeline information while ingesting files into the platform.

Key philosophies:#

Config-Driven. Anyone who has been authorized to do so, should be able to add another ‘table-info.yaml’ in to MDH without the need to update any code in the system

Here’s how table information makes its way into MDH:

Metadata Hub
Metadata Hub

Paths#

/tablesget:summary: All tables in MDHdescription: get the title of all tables that exist in MDH
post:summary: Creates a new table in MDHdescription: Creates a new table in MDH
/tables/{id}getsummary: Obtain information about specific table
/tables/{id}/columnsgetsummary: All columns for a particular tabledescription: Obtain information on columns for a particular table
/rungetsummary: All information about a particular end-to-end batch run of file ingestion
postsummary: Update metadata on a batch loaddescription: Update metadata on a batch load
/calendargetsummary: Use this to save on calculation of business days.description: This base response gives you today's date in a string
/calendar/previousBusinessDaygetsummary: Will return a string of the previous business daydescription: Will return a string of the previous business day, based on the date on when it's called
/calendar/nextBusinessDaygetsummary: Will return a string of the next business daydescription: Will return a string of the next business day, based on the date on when it's called

Yaml to Datastore - Entity/Kind design

Datastore Primer#

Before we jump right into Entity Groups in Datastore, it is important to first go over the basics and establish a common vocabulary. Datastore holds entities, which are objects, that can contain various key/value pairs, called properties. Each entity must contain a unique identifier, known as a key. When creating an entity, a user can choose to specify a custom key or let Datastore create a key. If a user decides to specify a custom key, it will contain two fields: a kind, which represents a category such as ‘Toy’ or ‘Marital Status’, and a name, which is the identifying value. If a user decides to only specify a kind when creating a key, and does not specify a unique identifier, Datastore automatically generates an ID behind the scenes. Below is an example of a Python3 script which illustrates this identifier concept.

from google.cloud import datastore
client = datastore.Client()#Custom key- specify my kind=item and a unique_id of brokercustom_key_entry = datastore.Entity(client.key("table","broker"))client.put(custom_key_entry)
#Only specify kind=item, let datastore generate unique_iddatastore_gen_key_entry = datastore.Entity(client.key("table"))client.put(datastore_gen_key_entry)

In your GCP Console under Datastore, you will then see your two entities of kind “table”. One will contain your custom key and one will contain the automatically generated key.

Ancestors and Entity Groups

For highly related or hierarchical data, Datastore allows entities to be stored in a parent/child relationship. This is known as an entity group or ancestor/descendent relationship.

Entity Group#

erd

This is an example of an entity group with kinds of types: table, column, and classification. The ‘Grandparent’ in this relationship is the ‘table’. In order to configure this, one must first create the table entity. Then, a user can create a column, and specify that the parent is a table key. In order to create the grandchild, a user then creates a classification and sets its parent to be a column key. To further add customizable attributes, a user can specify additional key-value pairs such as pii and data_type. These key-value pairs are stored as properties. We model this diagram in Datastore in our working example below.

One can create entity groups by setting the ‘parent’ parameter while creating an entity key for a child. This command adds the parent key to be part of the child entity key. The child’s key is represented as a tuple (‘parent_key’, ‘child_key’), such that the parents’ key is the prefix of the key, which is followed by its own unique identifier. For example, follow the diagram above:

table_key = datastore_client.key("table","broker")column_key = datastore_client.key("column","broker_legal_name", parent=table_key)

Printing the variable table_key will display: ("table", "broker","column", "broker_legal_name")

Datastore also supports chaining of parents, which can lead to very large keys for descendants with a long lineage of ancestors. Additionally, parents can have multiple children (representing a one-to-many relationship). However, there is no native support for entities to have multiple parents (representing a many-to-many relationship). Once you have configured this ancestral hierarchy, it is easy to retrieve all descendants for a given parent. You can do this by querying on the parent key by using the ‘ancestor’ parameter. For example, given the entity table_key created above, I can query for all of the tables

columns: my_query = client.query(kind="table", ancestor = column_key) .

A Full Working Example for MDH

As per our Key Philosophies - Config-Driven - anyone should be able to add a new table to be processed and landed in a target-table somewhere within MDH with our yaml syntax. Below is a full working python3 example of the table/column/classification hierarchical model described above.

from google.cloud import datastore
datastore_client = datastore.Client()
# Entities with kinds- table, column, classificationmy_entities = [{"kind": "table", "table_id": "broker", "table_type": "snapshot",    "notes": "describes mortgage brokers"},{"kind": "column", "column_id": "broker_legal_name", "table_id": "broker",    "data_type": "string", "size": 20, "nullable": 1},{"kind": "column", "column_id": "broker_short_code", "table_id": "broker",    "data_type": "string", "size": 3, "nullable": 1},{"kind": "classification", "classification_id":"classif_id_REQ_01",    "restriction_level": "public", "pii": 0, "if": "greater than 90 days",    "column_id": "broker_legal_name", "table_id": "broker"},{"kind": "classification", "classification_id":"classif_id_REQ_03",    "restriction_level": "restricted", "pii": 0, "if": "less than 90 days",    "column_id": "broker_legal_name", "table_id": "broker"},{"kind": "classification", "classification_id":"classif_id_REQ_214",    "restriction_level": "public", "pii": 0, "column_id": "broker_short_code",    "table_id": "broker"},]

# traverse my_entities, set parents and add those to datastorefor entity in my_entities:    kind = entity['kind']    parent_key = None    if kind == "column":        parent_key = datastore_client.key("table", entity["table_id"])    elif kind == "classification":        parent_key = datastore_client.key("table", entity["table_id"],                                          "column", entity["column_id"])
    key = datastore_client.key(kind, entity[kind+"_id"],        parent=parent_key)    datastore_entry = datastore.Entity(key)    datastore_entry.update(entity)
    print("Saving: {}".format(entity))
    datastore_client.put(datastore_entry)

The code above assumes that you’ve set yourself up with a working Service Account or authorised yourself in, and that your GCP project has been set.

Now let’s do some digging around our newly minted Datastore model. Let’s grab the column ‘broker_legal_name’

query1 = datastore_client.query(kind="column")query1.add_filter("column_id", "=", "broker_legal_name")

Now that we have the column entity, let’s locate it’s parent id.

column = list(query1.fetch())[0]print("This column belongs to: " +str(column.key.parent.id_or_name))

Further to this, we can also get all data classification elements attributed to a single column using the ancestor clause query.

query2 = datastore_client.query(kind="classification", ancestor=column.key)for classification in list(query2.fetch()):    print(classification.key)    print(classification["restriction_level"])

For more complex queries, Datastore has the concept of indexes being set, usually via it’s index.yaml configuration. The following is an example of an index.yaml file:

indexes:  - kind: Cat    ancestor: no    properties:      - name: name      - name: age        direction: desc
  - kind: Cat    properties:      - name: name        direction: asc      - name: whiskers        direction: desc
  - kind: Store    ancestor: yes    properties:      - name: business        direction: asc      - name: owner        direction: asc

Indexes are important when attempting to add filters on more than one particular attribute within a Datastore entity. For example, the following code will fail:

# Adding a '>' filter will cause this to fail. Sidenote; it will work# without an index if you add another '=' filter.query2 = datastore_client.query(kind="classification", ancestor=column.key)query2.add_filter("pii", ">", 0)for classification in list(query2.fetch()):        print(classification.key)        print(classification["classification_id"])

To rectify this issue, you need to create an index.yaml that looks like the following:

indexes:  - kind: classification    ancestor: yes    properties:      - name: pii

You would usually upload the yaml file using the gcloud commands:

gcloud datastore indexes create path/to/index.yaml.

However, let’s do this programmatically.

The official pypi package for google-cloud-datastore can be found here: https://pypi.org/project/google-cloud-datastore/. At the time of writing, Firestore in Datastore-mode will be the way forward, as per the release note from January 31, 2019.

Cloud Firestore is now Generally Available. Cloud Firestore is the new version of Cloud Datastore and includes a backwards-compatible Datastore mode.

If you intend to use the Cloud Datastore API in a new project, use Cloud Firestore in Datastore mode. Existing Cloud Datastore databases will be automatically upgraded to Cloud Firestore in Datastore mode.

Except where noted, the Cloud Datastore documentation now describes behavior for Cloud Firestore in Datastore mode.

We’ve purposefully created MDH in Datastore to show you how it was done originally, and we’ll be migrating the Datastore code to Firestore in an upcoming post.

Creating and deleting indexes within Datastore will need to be done through the REST API via googleapiclient.discovery, as this function doesn’t exist via the google-cloud-datastore API. Working with the discovery api client can be a bit daunting for a first-time user, so here’s the code to add an index on Datastore:

import osfrom google.oauth2 import service_accountfrom googleapiclient.discovery import buildfrom google.cloud import datastore

SCOPES = ['https://www.googleapis.com/auth/cloud-platform']
SERVICE_ACCOUNT_FILE = os.getenv('GOOGLE_APPLICATION_CREDENTIALS')PROJECT_ID = os.getenv("PROJECT_ID")
credentials = service_account             .Credentials         .from_service_account_file(SERVICE_ACCOUNT_FILE, scopes=SCOPES)
datastore_api = build('datastore', 'v1', credentials=credentials)
body = {    'ancestor': 'ALL_ANCESTORS',    'kind': 'classification',    'properties': [{        'name': 'pii',        'direction': 'DESCENDING'    }]}
response = datastore_api.projects()           .indexes()           .create(projectId=PROJECT_ID, body=body)           .execute()

How did we craft this API request? We can use the Google API Discovery Service to build client libraries, IDE plugins, and other tools that interact with Google APIs. The Discovery API provides a list of Google APIs and a machine-readable "Discovery Document" for each API. Features of the Discovery API:

  • A directory of supported APIs schemas based on JSON Schema.
  • A machine-readable "Discovery Document" for each of the supported APIs. Each document contains:
  • A list of API methods and available parameters for each method.
  • A list of available OAuth 2.0 scopes.
  • Inline documentation of methods, parameters, and available parameter values.

Navigating to the API reference page for Datastore and going to the ‘Datastore Admin’ API page, we can see references to the Indexes and RESTful endpoints we can hit for those Indexes. Therefore, looking at the link for the Discovery document for Datastore:

https://datastore.googleapis.com/$discovery/rest?version=v1

From this, we can build out our instantiation for the google api discovery object build('datastore', 'v1', credentials=credentials)

With respect to building out the body aspect of the request, I’ve found crafting that part within the ‘Try this API’ section of https://cloud.google.com/datastore/docs/reference/admin/rest/v1/projects.indexes/create pretty valuable.

With this code, your index should show up in your Datastore console! You can also retrieve them within gcloud with gcloud datastore indexes list if you’d like to verify the indexes outside our python code. So there you have it: a working example of entity groups, ancestors, indexes and Metadata within Datastore. Have fun coding!