Enumerating all roles for a user in Snowflake

Snowflake allows roles to be assigned to other roles, so when a user is assigned to a role, they may inherit the ability to use countless other roles.

Challenge: recursively enumerate all roles for a given user

One solution would be to create a complex query on the “SNOWFLAKE"."ACCOUNT_USAGE"."GRANTS_TO_ROLES" object.

An easier solution is to use a stored procedure to recurse through grants for a given user and return an ARRAY of roles for that user.

This is a good programming exercise in tail call recursion (sort of) in JavaScript. Here is the code:

To call the stored proc, execute:

One drawback of stored procedures in Snowflake is that they can only have scalar or array return types and cannot be used directly in a SQL query, however you can use the table(result_scan(last_query_id())) trick to get around this, as shown below where we will pivot the ARRAY into a record set with the array elements as rows:

IMPORTANT

This query must be the next statement run immediately after the CALL statement and cannot be run again until you run another CALL statement.

More adventures with Snowflake soon!

EventArc: The state of eventing in Google Cloud

When defining event-driven architectures, it’s always good to keep up with how the landscape is changing. How do you connect microservices in your architecture? Is Pub/Sub the end-game for all events? To dive a bit deeper, let’s talk through the benefits of having a single orchestrator, or perhaps a choreographer is better?

Orchestration versus choreography refresher

My colleague @jeffreyaven did a recent post explaining this concept in simple terms, which is worth reviewing, see:

Should there really be a central orchestrator controlling all interactions between services…..or, should each service work independently and only interact through events?

  • Orchestration is usually viewed as a domain-wide central service that defines the flow and control of communication between services. In this paradigm, in becomes easier to change and ultimately monitor policies across your org.
  • Choreography has each service registering and emitting events as they need to. It doesn’t direct or define the flow of communication, but using this method usually has a central broker passing around messages and allows services to be truly independent.

Enter Workflows, which is suited for centrally orchestrated services. Not only Google Cloud service such as Cloud Functions and Cloud Run, but also external services.

How about choreography? Pub/Sub and Eventarc are both suited for this. We all know and love Pub/Sub, but how do I use EventArc?

What is Eventarc?

Announced in October-2020, it was introduced as eventing functionality that enables you, the developer, to send events to Cloud Run from more than 60 Google Cloud sources.

But how does it work?

Eventing is done by reading those sweet sweet Audit Logs, from various sources, and sending them to Cloud Run services as events in Cloud Events format. Quick primer on Cloud Events: its a specification for describing event data in a common way. The specification is now under the Cloud Native Computing Foundation. Hooray! It can also read events from Pub/Sub topics for custom applications. Here’s a diagram I graciously ripped from Google Cloud Blog:

Eventarc

Why do I need Eventarc? I have the Pub/Sub

Good question. Eventarc provides and easier path to receive events not only from Pub/Sub topics but from a number of Google Cloud sources with its Audit Log and Pub/Sub integration. Actually, any service that has Audit Log integration can be an event source for Eventarc. Beyond easy integration, it provides consistency and structure to how events are generated, routed and consumed. Things like:

Triggers

They specify routing rules from events sources, to event sinks. Listen for new object creation in GCS and route that event to a service in Cloud Run by creating an Audit-Log-Trigger. Create triggers that also listen to Pub/Sub. Then list all triggers in one, central place in Eventarc:

gcloud beta eventarc triggers list

Consistency with eventing format and libraries

Using the CloudEvent-compliant specification will allow for event data in a common way, increasing the movement towards the goal of consistency, accessibility and portability. Makes it easier for different languages to read the event and Google Events Libraries to parse fields.

This means that the long-term vision of Eventarc to be the hub of events, enabling a unified eventing story for Google Cloud and beyond.

Eventarc producers and consumers

In the future, you can excpect to forego Audit Log and read these events directly and send these out to even more sinks within GCP and any HTTP target.


This article written on inspiration from https://cloud.google.com/blog/topics/developers-practitioners/eventarc-unified-eventing-experience-google-cloud. Thanks Mete Atamel!

Microservices Concepts: Orchestration versus Choreography

One of the foundational concepts in microservices architecture and design patterns is the concept of Orchestration versus Choreography. Before we look at a reference implementation of each of these patterns, it is worthwhile starting with an analogy.

This is often likened to a Jazz band versus a Symphony Orchestra.

A modern symphony orchestra is normally comprised of sections such as strings, brass, woodwind and percussion sections. The sections are orchestrated by a conductor, usually placed at a central point with respect to each of the sections. The conductor instructs each section to perform their components of the overall symphony.

By contrast, a Jazz band does not have a conductor and also features improvisation, with different musicians improvising based upon each other. Choreography, although more aligned to dance, can involve improvisation. In both cases there is still an intended output and a general framework as to how the composition will be executed, however unlike a symphony orchestra there is a degree of spontaneity.

Now back to technology and microservices…

In the Orchestration model, there is a central orchestration service which controls the interactions between other services, in other words the flow and control of communication and/or message passing between services is controlled by an orchestrator (much like the conductor in a symphony orchestra). On the plus side, this model enables easier monitoring and policy enforcement across the system. A generalisation of the Orchestration model is shown below:

Orchestration model

By contrast, in the Choreography model, each service works independently and interacts with other services through events. In this model each service registers and emits events as they need to. The flow (of communication between services) is not predefined, much like a Jazz band. This model often includes a central broker for message passing between services, but the services operate independently of each other and are not controlled by a central service (an orchestrator). A generalisation of the Choreography model is shown below:

Choreography model

We will post subsequent articles with implementations of these patterns, but it is worthwhile getting a foundational understanding first.

Using the Azure CLI to Create an API using a Function App within API Management

Function Apps, Logic Apps and App Services can be used to expose APIs within Azure API Management which is an easy way to deploy serverless microservices. You can see this capability in the Azure portal below within API Management:

Add a new API using a Function App as a back end

Like most readers, I like to script everything, so I was initially frustrated when I couldn’t replicate this operation in the Azure cli, REST, PowerShell, or any of the other SDKs or IaC tools. Others shared my frustration as seen here.

I was nearly resigned to using click ops in the portal (arrrgh) before I worked out this workaround.

The Solution

There is a bit more prep work required to automate this process, but it is well worth it.

1. Create an OpenApi (or Swagger spec or WADL) specification document, as seen below (use the absolute URL for your Function App in the url parameter):

2. Use the az apim api import function (not the az apim api create function), as shown here:

3. Associate the API with a product (which is how you can rate limit APIs)

That’s it! You can now access your function via the API gateway using the gateway url or via the developer portal as seen below:

Function App API in API Management in the Azure Portal
Function App API in the Dev Portal

Great Expectations (for your data…)

This article provides an introduction to the Great Expectations Python library for data quality management (https://github.com/great-expectations/great_expectations).

So what are expectations when it comes to data (and data quality)…

An expectation is a falsifiable, verifiable statement about data. Expectations provide a language to talk about data characteristics and data quality – humans to humans, humans to machines and machines to machines.

The great expectations project includes predefined, codified expectations such as:

expect_column_to_exist 
expect_table_row_count_to_be_between 
expect_column_values_to_be_unique 
expect_column_values_to_not_be_null 
expect_column_values_to_be_between 
expect_column_values_to_match_regex 
expect_column_mean_to_be_between 
expect_column_kl_divergence_to_be_less_than

… and many more

Expectations are both data tests and docs! Expectations can be presented in a machine-friendly JSON, for example:

Great Expectations provides validation results of defined expectations, which can dramatically shorten your development cycle.

validation results in great expectations

Nearly 50 built in expectations allow you to express how you understand your data, and you can add custom expectations if you need a new one. A machine can test if a dataset conforms to the expectation.

OK, enough talk, let’s go!

pyenv virtualenv 3.8.2 ge38
pip install great-expectations

tried with Python 3.7.2, but had issues with library lgzm on my local machine

once installed, run the following in the python repl shell:

showing the data in the dataframe should give you the following:

as can be seen, a collection of random integers in each column for our initial testing. Let’s pipe this data in to great-expectations…

yields the following output…

this shows that there are 0 unexpected items in the data we are testing. Great!

Now let’s have a look at a negative test. Since we’ve picked the values at random, there are bound to be duplicates. Let’s test that:

yields…

The JSON schema has metadata information about the result, of note is the result section which is specific to our query, and shows the percentage that failed the expectation.

Let’s progress to something more real-world, namely creating exceptions that are run on databases. Armed with our basic understanding of great-expectations, let’s…

  • set up a postgres database
  • initiate a new Data Context within great-expectations
  • write test-cases for the data
  • group those test-cases and
  • run it

Setting up a Database

if you don’t have it installed,

wait 15 minutes for download the internet. Verify postgres running with docker ps, then connect with

Create some data

Take data for a spin

should yield

Now time for great-expectations

Great Expectations relies on the library sqlalchemy and psycopg2 to connect to your data.

once done, let’s set up great-expectations

should look like below:

let’s set up a few other goodies while we’re here

Congratulations! Great Expectations is now set up

You should see a file structure as follows:

great expectations tree structure

If you didn’t generate a suite during the set up based on app.order, you can do so now with

great_expectations suite new

when created, looking at great_expectations/expectations/app/order/warning.json should yield the following:

as noted in the content section, this expectation config is created by the tool by looking at 1000 rows of the data. We also have access to the data-doc site which we can open in the browser at great_expectations/uncommitted/data_docs/local_site/index.html

great expectations index page

Clicking on app.order.warning, you’ll see the sample expectation shown in the UI

great expectations app order screen

Now, let’s create our own expectation file and take it for a spin. We’ll call this one error.

great expectations new suite

This should also start a jupyter notebook. If for some reason you need to start it back up again, you can do so with

Go ahead and hit run on your first cell.

Editing a suite with Jupyter

Let’s keep it simple and test the customer_order_id column is in a set with the values below:

using the following expectations function in your Table Expectation(s). You may need to click the + sign in the toolbar to insert a new cell, as below:

Adding table expectation

As we can see, appropriate json output that describes the output of our expectation. Go ahead and run the final cell, which will save our work and open a newly minted data documentation UI page, where you’ll see the expectations you defined in human readable form.

Saved suite

Running the test cases

In Great Expectations, running a set of expectations (test cases) is called a checkpoint. Let’s create a new checkpoint called first_checkpoint for our app.order.error expectation as shown below:

Let’s take a look at our checkpoint definition.

Above you can see the validation_operator_name which points to a definition in great_expectations.yml, and the batches where we defined the data source and what expectations to run against.

Let’s have a look at great_expectations.yml. We can see the action_list_operator defined and all the actions it contains:

List operators

Let’s run our checkpoint using

Validate checkpoint

Okay cool, we’ve set up an expectation, a checkpoint and shown a successful status! But what does a failure look like? We can introduce a failure by logging in to postgres and inserting a customer_11 that we’ll know will fail, as we’ve specific our expectation that customer_id should only have two values..

Here are the commands to make that happen, as well as the command to re-run our checkpoint:

Run checkpoint again, this time it should fail

Failed checkpoint

As expected, it failed.

Supported Databases

In it’s current implementation version 0.12.9, the supported databases our of the box are:

It’s great to be BigQuery supported out of the box, but what about Google Spanner and Google BigTable? Short-answer; currently not supported. See tickets https://github.com/googleapis/google-cloud-python/issues/3022.

With respect to BigTable, it may not be possible as SQLAlchemy can only manage SQL-based RDBSM-type systems, while BigTable (and HBase) are NoSQL non-relational systems.

Scheduling

Now that we have seen how to run tests on our data, we can run our checkpoints from bash or a python script(generated using great_expectations checkpoint script first_checkpoint). This lends itself to easy integration with scheduling tools like airflow, cron, prefect, etc.

Production deployment

When deploying in production, you can store any sensitive information(credentials, validation results, etc) which are part of the uncommitted folder in cloud storage systems or databases or data stores depending on your infratructure setup. Great Expectations has a lot of options

When not to use a data quality framework

This tool is great and provides a lot of advanced data quality validation functions, but it adds another layer of complexity to your infrastructure that you will have to maintain and trouble shoot in case of errors. It would be wise to use it only when needed.

In general

Do not use a data quality framework, if simple SQL based tests at post load time works for your use case. Do not use a data quality framework, if you only have a few (usually < 5) simple data pipelines.

Do use it when you have data that needs to be tested in an automated and a repeatable fashion. As shown in this article, Great Expectations has a number of options that can be toggled to suit your particular use-case.

Conclusion

Great Expectations shows a lot of promise, and it’s an active project so expect to see features roll out frequently. It’s been quite easy to use, but I’d like to see all it’s features work in a locked-down enterprise environment.

Tom Klimovski
Principal Consultant, Gamma Data
tom.klimovski@gammadata.io

Multi Cloud Diagramming with PlantUML

Following on from the recent post GCP Templates for C4 Diagrams using PlantUML, cloud architects are often challenged with producing diagrams for architectures spanning multiple cloud providers, particularly as you elevate to enterprise level diagrams.

In this post, with the magic of !includeurl we have brought PlantUML template libraries together for AWS, Azure and GCP icon sets, allowing us to produce multi cloud C4 diagrams using PlantUML like this one:

Multi Cloud Architecture Diagram using PlantUML

Creating a multi cloud diagram is simple, start by adding the following include statements after the @startuml label in a new PlantUML C4 diagram:

Then add references to the required services from different providers…

Then include the predefined resources from your different cloud providers in your diagram as shown here (describing a client server application over a cloud to cloud VPN between Azure and GCP)…

Happy multi-cloud diagramming!

Full source code is available at:

https://github.com/gamma-data/plantuml-multi-cloud-diagrams

Cloud Bigtable Primer Part II – Row Key Selection and Schema Design

This is a follow up to the original Cloud Bigtable primer where we discussed the basics of Cloud Bigtable:

In this article we will cover schema design and row key selection in Bigtable – arguably the most critical design decision to make when employing Bigtable in a cloud data architecture.

Quick Review

Recall from the previous post where the Bigtable data model was introduced that tables in Bigtable are comprised of rows and columns – much like a table in any other RDBMS. Every row is uniquely identified by a rowkey – again akin to a primary key in a table in an RDBMS. But this is where the similarities end.

Unlike a table in an RDBMS, columns only ever exist when they are inserted, and NULLs are not stored. See the illustration below:

Row Key Selection

Data in Bigtable is distributed by row keys. Row keys are physically stored in tablets in lexographic order. Recall that row keys are your ONLY indexes to data in Bigtable.

Selection Considerations

As row keys are your only indexes to retrieve or update rows in Bigtable, row key design must take the access patterns for the data to be stored and served via Bigtable into consideration, specifically the following must be considered when designing a Bigtable application:

  • Search patterns (returning data for a specific entity)
  • Scan patterns (returning batches of data)

Queries that use the row key, a row prefix, or a row range are the most efficient. Queries that do not include a row key will typically scan GB or TB of data and would not be suitable for operational use cases.

Row Key Performance

Row key performance will be biased towards your specific access patterns and application functional requirements. For example if you are performing sequential reads or scan operations then sequential keys will perform the best, however their write performance will not be optimal. Conversely, random keys (such as a uuid) will perform best for writes but poor for scan or sequential read operations.

Adding salts to keys (or additional data), similar to the use of salts in cryptography as well as promoting other field keys to be part of a composite row key can help achieve a “Goldilocks” scenario for both reads and writes, see the diagram below:

Using Reverse Timestamps

Use reverse timestamps when your most common query is for the latest values. Typically you would append the reverse timestamp to the key, this will ensure that the same related records are grouped together, for instance if you are storing events for a customer using the customer id along with an appended reverse timestamp (for example <customer_id>#<reverse_ts>) would allow you to quickly serve the latest events for a customer in descending order as within each group (customer_id), rows will be sorted so most recent insert will be located at the top.
A reverse timestamp can be generalised as:

Long.MAX_VALUE - System.currentTimeMillis()

Schema Design Tips

Some general tips for good schema design using Bigtable are summarised below:

  • Group related data for more efficient reads using column families
  • Distribute data evenly for more efficient writes
  • Place identical values in the adjoining rows for more efficient compression using row keys

Following these tips will give you the best possible performance using Bigtable.

Use the Key Visualizer to profile performance

Google provides a neat tool to visualize your row key distribution in Cloud Bigtable. You need to have at least 30 GB of data in your table to enable this feature.

The Key Visualizer is shown here:

Bigtable Key Visualizer

The Key Visualizer will help you find and prevent hotspots, find rows with too much data and show if your key schema is balanced.

Summary

Bigtable is one of the original and best (massively) distributed NoSQL platforms available. Schema and moreover row key design play a massive part in ensuring low latency and query performance. Go forth and conquer with Cloud Bigtable!

GCP Templates for C4 Diagrams using PlantUML

I am a believer in the mantra of “Everything-as-Code”, this includes diagrams and other architectural artefacts. Enter PlantUML…

PlantUML

PlantUML is an open-source tool which allows users to create UML diagrams from an intuitive DSL (Domain Specific Language). PlantUML is built on top of Graphviz and enables software architects and designers to use code to create Sequence Diagrams, Use Case Diagrams, Class Diagrams, State and Activity Diagrams and much more.

C4 Diagrams

PlantUML can be extended to support the C4 model for visualising software architecture. Which describes software in different layers including Context, Container, Component and Code diagrams.

GCP Architecture Diagramming using C4

PlantUML and C4 can be used to produce cloud architectures, there are official libraries available through PlantUML for Azure and AWS service icons, however these do not exist for GCP yet. There are several open source libraries available, however I have made an attempt to simplify the implementation.

The code below can be used to generate a C4 diagram describing a GCP architecture including official GCP service icons:

@startuml
!define GCPPuml https://raw.githubusercontent.com/gamma-data/GCP-C4-PlantUML/master/templates

!includeurl GCPPuml/C4_Context.puml
!includeurl GCPPuml/C4_Component.puml
!includeurl GCPPuml/C4_Container.puml
!includeurl GCPPuml/GCPC4Integration.puml
!includeurl GCPPuml/GCPCommon.puml

!includeurl GCPPuml/Networking/CloudDNS.puml
!includeurl GCPPuml/Networking/CloudLoadBalancing.puml
!includeurl GCPPuml/Compute/ComputeEngine.puml
!includeurl GCPPuml/Storage/CloudStorage.puml
!includeurl GCPPuml/Databases/CloudSQL.puml

title Sample C4 Diagram with GCP Icons

Person(publisher, "Publisher")
System_Ext(device, "User")

Boundary(gcp,"gcp-project") {
  CloudDNS(dns, "Managed Zone", "Cloud DNS")
  CloudLoadBalancing(lb, "L7 Load Balancer", "Cloud Load Balancing")
  CloudStorage(bucket, "Static Content Bucket", "Cloud Storage")
  Boundary(region, "gcp-region") {
    Boundary(zonea, "zone a") {
      ComputeEngine(gcea, "Content Server", "Compute Engine")
      CloudSQL(csqla, "Dynamic Content", "Cloud SQL")
    }
    Boundary(zoneb, "zone b") {
      ComputeEngine(gceb, "Content Server", "Compute Engine")
      CloudSQL(csqlb, "Dynamic Content\n(Read Replica)", "Cloud SQL")
    }
  }
}

Rel(device, dns, "resolves name")
Rel(device, lb, "makes request")
Rel(lb, gcea, "routes request")
Rel(lb, gceb, "routes request")
Rel(gcea, bucket, "get static content")
Rel(gceb, bucket, "get static content")
Rel(gcea, csqla, "get dynamic content")
Rel(gceb, csqla, "get dynamic content")
Rel(csqla, csqlb, "replication")
Rel(publisher,bucket,"publish static content")

@enduml

The preceding code generates the diagram below:

Additional services can be added and used in your diagrams by adding them to your includes, such as:

!includeurl GCPPuml/DataAnalytics/BigQuery.puml
!includeurl GCPPuml/DataAnalytics/CloudDataflow.puml
!includeurl GCPPuml/AIandMachineLearning/AIHub.puml
!includeurl GCPPuml/AIandMachineLearning/CloudAutoML.puml
!includeurl GCPPuml/DeveloperTools/CloudBuild.puml
!includeurl GCPPuml/HybridandMultiCloud/Stackdriver.puml
!includeurl GCPPuml/InternetofThings/CloudIoTCore.puml
!includeurl GCPPuml/Migration/TransferAppliance.puml
!includeurl GCPPuml/Security/CloudIAM.puml
' and more…

The complete template library is available at:

https://github.com/gamma-data/GCP-C4-PlantUML

Cloud Bigtable Primer – Part I

Bigtable is one of the foundational services in the Google Cloud Platform and to this day one of the greatest contributions to the big data ecosystem at large. It is also one of the least known services available, with all the headlines and attention going to more widely used services such as BigQuery.

Background

In 2006 (pre Google Cloud Platform), Google released a white paper called “Bigtable: A Distributed Storage System for Structured Data”, this paper set out the reference architecture for what was to become Cloud Bigtable. This followed several other whitepapers including the GoogleFS and MapReduce whitepapers released in 2003 and 2004 which provided abstract reference architectures for the Google File System (now known as Colossus) and the MapReduce algorithm. These whitepapers inspired a generation of open source distributed processing systems including Hadoop. Google has long had a pattern of publicising a generalized overview of their approach to solving different storage and processing challenges at scale through white papers.

Bigtable Whitepaper 2006

The Bigtable white paper inspired a wave of open source distributed key/value oriented NoSQL data stores including Apache HBase and Apache Cassandra.

What is Bigtable?

Bigtable is a distributed, petabyte scale NoSQL database. More specifically, Bigtable is…

a map

At its core Bigtable is a distributed map or an associative array indexed by a row key, with values in columns which are created only when they are referenced. Each value is an uninterpreted byte array.

sorted

Row keys are stored in lexographic order akin to a clustered index in a relational database.

sparse

A given row can have any number of columns, not all columns must have values and NULLs are not stored. There may also be gaps between keys.

multi-dimensional

All values are versioned with a timestamp (or configurable integer). Data is not updated in place, it is instead superseded with another version.

When (and when not) to use Bigtable

  • You need to do many thousands of operations per second on TB+ scale data
  • Your access patterns are well known and simple
  • You need to support random write or random read operations (or sequential reads) – each using a row key as the primary identifier

Don’t use Bigtable if…

  • You need explicit JOIN capability, that is joining one or more tables
  • You need to do ad-hoc analytics
  • Your access patterns are unknown or not well defined

Bigtable vs Relational Database Systems

The following table compares and contrasts Bigtable against relational databases (both transaction oriented and analytic oriented databases):

 BigtableRDBMS (OLTP)RDBMS (DSS/MPP)
Data LayoutColumn Family OrientedRow OrientedColumn Oriented
Transaction SupportSingle Row OnlyYesDepends (but usually no)
Query DSLget/put/scan/deleteSQLSQL
IndexesRow Key OnlyYesYes (typically PI based)
Max Data SizePB+'00s GB to TBTB+
Read/Write Throughput'000,000s queries/s'000s queries/s'000s queries/s

Bigtable Data Model

Tables in Bigtable are comprised of rows and columns (sounds familiar so far..). Every row is uniquely identified by a rowkey (like a primary key..again sounds familiar so far).

Columns belong to Column Families and only exist when inserted, NULLs are not stored – this is where it starts to differ from a traditional RDBMS. The following image demonstrates the data model for a fictitious table in Bigtable.

Bigtable Data Model

In the previous example, we created two Column Families (cf1 and cf2). These are created during table definition or update operations (akin to DDL operations in the relational world). In this case, we have chosen to store primary attributes, like name, etc in cf1 and features (or derived attributes) in cf2 like indicators.

Cell versioning

Each cell has a timestamp/version associated with it, multiple versions of a row can exist. Versions are naturally stored in descending order.

Properties such as the max age for a cell or the maximum number of versions to be stored for any given cell are set on the Column Family. Versions are compacted through a process called Garbage Collection – not to be confused with Java Garbage Collection (albeit same idea).

Row KeyColumnValueTimestamp
123cf1:statusACTIVE2020-06-30T08.58.27.560
123cf1:statusPENDING2020-06-28T06.20.18.330
123cf1:statusINACTIVE2020-06-27T07.59.20.460

Bigtable Instances, Clusters, Nodes and Tables

Bigtable is a “no-ops” service, meaning you do not need to configure machine types or details about the underlying infrastructure, save a few sizing or performance options – such as the number of nodes in a cluster or whether to use solid state hard drives (SSD) or the magnetic alternative (HDD). The following diagram shows the relationships and cardinality for Cloud Bigtable.

Bigtable Instances, Clusters and Nodes

Clusters and nodes are the physical compute layer for Bigtable, these are zonal assets, zonal and regional availability can be achieved through replication which we will discuss later in this article.

Instances are a virtual abstraction for clusters, Tables belong to instances (not clusters). This is due to Bigtables underlying architecture which is based upon a separation of storage and compute as shown below.

Bigtable Separation of Storage and Compute

Bigtables separation of storage and compute allow it to scale horizontally, as nodes are stateless they can be increased to increase query performance. The underlying storage system in inherently scalable.

Physical Storage & Column Families

Data (Columns) for Bigtable is stored in Tablets (as shown in the previous diagram), which store “regions” of row keys for a particular Column Family. Columns consist of a column family prefix and qualifier, for instance:

cf1:col1

A table can have one or more Column Families. Column families must be declared at schema definition time (could be a create or alter operation). A cell is an intersection of a row key and a version of a column within a column family.

Storage settings (such as the compaction/garbage collection properties mentioned before) can be specified for each Column Family – which can differ from other column families in the same table.

Bigtable Availability and Replication

Replication is used to increase availability and durability for Cloud Bigtable – this can also be used to segregate read and write operations for the same table.

Data and changes to tables are replicated across multiple regions or multiple zones within the same region, this replication can be blocking (single row transactions) or non blocking (eventually consistent). However all clusters within a Bigtable instance are considered primary (writable).

Requests are routed using Application Profiles, a single-cluster routing policy can be used for manual failover, whereas a multi-cluster routing is used for automatic failover.

Backup and Export Options for Bigtable

Managed backups can be taken at a table level, new tables can be created from backups. The backups cannot be exported, however table level export and import operations are available via pre-baked Dataflow templates for data stored in GCS in the following formats:

  • SequenceFiles
  • Avro Files
  • Parquet Files
  • CSV Files

Accessing Bigtable

Bigtable data and admin functions are available via:

  • cbt (optional component of the Google SDK)
  • hbase shell (REPL shell)
  • Happybase API (Python API for Hbase)
  • SDK libraries for:
    • Golang
    • Python
    • Java
    • Node.js
    • Ruby
    • C#, C++, PHP, and more

As Bigtable is not a cheap service, there is a local emulator available which is great for application development. This is part of the Cloud SDK, and can be started using the following command:

gcloud beta emulators bigtable start

In the next article in this series we will demonstrate admin and data functions as well as the local emulator.

Next Up : Part II – Row Key Selection and Schema Design in Bigtable

Automated GCS Object Scanning Using DLP with Notifications Using Slack

This is a follow up to a previous blog, Google Cloud Storage Object Notifications using Slack in which we used Slack to notify us of new objects being uploaded to GCS.

In this article we will take things a step further, where uploading an object to a GCS bucket will trigger a DLP inspection of the object and if any preconfigured info types (such as credit card numbers or API credentials) are present in the object, a Slack notification will be generated.

As DLP scans are “jobs”, meaning they run asynchronously, we will need to trigger scans and inspect results using two separate Cloud Functions (one for triggering a scan [gcs-dlp-scan-trigger] and one for inspecting the results of the scan [gcs-dlp-evaluate-results]) and a Cloud PubSub topic [dlp-scan-topic] which is used to hold the reference to the DLP job.

The process is described using the sequence diagram below:

The Code

The gcs-dlp-scan-trigger Cloud Function fires when a new object is created in a specified GCS bucket. This function configures the DLP scan to be executed, including the DLP info types (for instance CREDIT_CARD_NUMBER, EMAIL_ADDRESS, ETHNIC_GROUP, PHONE_NUMBER, etc) a and likelihood of that info type existing (for instance LIKELY). DLP scans determine the probability of an info type occurring in the data, they do not scan every object in its entirety as this would be too expensive.

The primary function executed in the gcs-dlp-scan-trigger Cloud Function is named inspect_gcs_file. This function configures and submits the DLP job, supplying a PubSub topic to which the DLP Job Name will be written, the code for the inspect_gcs_file is shown here:

At this stage the DLP job is created an running asynchronously, the next Cloud Function, gcs-dlp-evaluate-results, fires when a message is sent to the PubSub topic defined in the DLP job. The gcs-dlp-evaluate-results reads the DLP Job Name from the PubSub topic, connects to the DLP service and queries the job status, when the job is complete, this function checks the results of the scan, if the min_likliehood threshold is met for any of the specified info types, a Slack message is generated. The code for the main method in the gcs-dlp-evaluate-results function is shown here:

Finally, a Slack webhook is used to send the message to a specified Slack channel in a workspace, this is done using the send_slack_notification function shown here:

An example Slack message is shown here:

839Slack Notification for Sensitive Data Detected in a Newly Created GCS Object

Full source code can be found at: https://github.com/gamma-data/automated-gcs-object-scanning-using-dlp-with-notifications-using-slack