Using Jsonnet to Configure Multiple Environments

Everytime I start a new project I try and optimise how the application can work across multiple envronments. For those who don’t have the luxury of developing everything in docker containers or isolated spaces, you will know my pain. How do I write code that can run on my local dev environment, migrate to the shared test and ci environment and ultimately still work in production.

In the past I tried exotic options like dynamically generating YAML or JSON using Jinja. I then graduated to HOCON which made my life so much easier. This was until I stumbled across Jsonnet. For those who have not seen this in action, think JSON meets Jinja meets HOCON (a Frankenstein creation that I have actually built in the past)

To get a feel for how it looks, below is a contrived example where I require 3 environments (dev, test and production) that have different paths, databases and vault configuration.

Essentially, when this config is run through the Jsonnet templating engine, it will expect a variable ‘ENV‘ to ultimately refine the environment entry to the one we specifically want to use.

A helpful thing I like to do with my programs is give users a bit of information as to what environments can be used. For me, running a cli that requires args should be as informative as possible – so listing out all the environments is mandatory. I achieve this with a little trickery and a lot of help from the click package!

local exe = "application.exe";

local Environment(prefix) = {
  root: "/usr/" + prefix + "/app",
  path: self.root + "/bin/" + exe,
  database: std.asciiUpper(prefix) + "_DB",
  tmp_dir: "/tmp/" + prefix
};

local Vault = {
  local uri = "http://127.0.0.1:8200/v1/secret/app",
  _: {},
  dev: {
      secrets_uri: uri,
      approle: "local"
  },
  tst: {
      secrets_uri: uri,
      approle: "local"
  },
  prd: {
      secrets_uri: "https://vsrvr:8200/v1/secret/app",
      approle: "sa_user"
  }
};

{

  environments: {
    _: {},
    dev: Environment("dev") + Vault[std.extVar("ENV")],
    tst: Environment("tst") + Vault[std.extVar("ENV")],
    prd: Environment("prd") + Vault[std.extVar("ENV")]
  },

  environment: $["environments"][std.extVar("ENV")],
}

The trick I perform is to have a placeholder entry ‘_‘ that I use to initially render the template. I then use the generated JSON file and get all the environment keys so I can feed that directly into click.

from typing import Any, Dict
import click
import json
import _jsonnet
from pprint import pprint

ENV_JSONNET = 'environment.jsonnet'
ENV_PFX_PLACEHOLDER = '_'

def parse_environment(prefix: str) -> Dict[str, Any]:
    _json_str = _jsonnet.evaluate_file(ENV_JSONNET, ext_vars={'ENV': prefix})
    return json.loads(_json_str)

_config = parse_environment(prefix=ENV_PFX_PLACEHOLDER)

_env_prefixes = [k for k in _config['environments'].keys() if k != ENV_PFX_PLACEHOLDER]


@click.command(name="EnvMgr")
@click.option(
    "-e",
    "--environment",
    required=True,
    type=click.Choice(_env_prefixes, case_sensitive=False),
    help="Which environment this is executing on",
)
def cli(environment: str) -> None:
    config = parse_environment(environment)
    pprint(config['environment'])


if __name__ == "__main__":
    cli()

This now allows me to execute the application with both list checking (has the user selected an allowed environment?) and the autogenerated help that click provides.

Below shows running the cli with no arguments:

$> python cli.py

Usage: cli.py [OPTIONS]
Try 'cli.py --help' for help.

Error: Missing option '-e' / '--environment'. Choose from:
        dev,
        prd,
        tst

Executing the application with a valid environment:

$> python cli.py -e dev

{'approle': 'local',
 'database': 'DEV_DB',
 'path': '/usr/dev/app/bin/application.exe',
 'root': '/usr/dev/app',
 'secrets_uri': 'http://127.0.0.1:8200/v1/secret/app',
 'tmp_dir': '/tmp/dev'}

Executing the application with an invalid environment:

$> python cli.py -e prd3

Usage: cli.py [OPTIONS]
Try 'cli.py --help' for help.

Error: Invalid value for '-e' / '--environment': 'prd3' is not one of 'dev', 'prd', 'tst'.

This is only the tip of what Jsonnet can provide, I am continually learning more about the templating engine and the tool.

Simple Tasker: Configuration driven orchestration

Recently I found myself at a client that were using a third party tool to scan all their enterprise applications in order to collate their data lineage. They had spent two years onboarding applications to the tool, resulting in a large technical mess that was hard to debug and impossible to extend. As new applications were integrated onto the platform, developers were forced to think of new ways of connecting and tranforming the data so it could be consumed.

The general approach was: setup scanner -> scan application -> modify results -> upload results -> backup results -> cleanup workspace -> delete anything older than 'X' days

Each developer had their own style of doing this – involving shell scripts, python scripts, SQL and everything in between. Worse, there was slabs of code replicated across the entire repository, with variables and paths changed depending on the use case.

My tasks was to create a framework that could orchestrate the scanning and adhered to the following philosophies:

  • DRY (Don’t Repeat Yourself)
  • Config driven
  • Version controlled
  • Simple to extend
  • Idempotent

It also had to be written in Python as that was all the client was skilled in.

After looking at what was on the market (Airflow and Prefect being the main contenders) I decided to roll my own simplified orchestrator that required as little actual coding as possible and could be setup by configuration.

In choosing a configuration format, I settled on HOCON as it closely resembled JSON but has advanced features such as interpolation, substitions and the ability to include other hocon files – this would drastically reduce the amount of boilerplate configuration required.

Because I had focused so heavily on being configuration driven, I also needed the following charecteristics to be delivered:

  • Self discovery of task types (more on this later)
  • Configuration validation at startup

Tasks and self discovery

As I wanted anyone to be able to rapidly extend the framework by adding tasks, I needed to reduce as much repetition and boilerplate as possible. Ideally, I wanted a developer to just have to think about writing code and not have to deal with how to integrate this.

To achieve this, we needed a way of registering new ‘tasks’ that would become available to the framework. I wanted a developer to simply have to subclass the main Task class and implement a run function – the rest would be taken care of.

class TaskRegistry:

    def __init__(self) -> None:
        self._registry = {}

    def register(self, cls: type) -> None:
        n = getattr(cls, 'task_name', cls.__name__).lower()
        self._registry[n] = cls

    def registered(self) -> List[str]:
        return list(self._registry.keys())

    def has(self, name: str) -> bool:
        return name in self._registry

    def get(self, name: str) -> type:
        return self._registry[name]

    def create(self, name: str, *args, **kwargs) -> object:
        try:
            return self._registry[name](*args, **kwargs)
        except KeyError:
            raise ClassNotRegisteredException(name)


registry = TaskRegistry()

Once the registry was instantiated, any new Tasks that inherited from ‘Task’ would automatically be added to the registry. We could then use the create(name) function to instantiate any class – essentially a pythonic Factory Method

class Task(ABC):

    def __init__(self) -> None:
        self.logger = logging.getLogger(self.__class__.__name__)

    def __init_subclass__(cls) -> None:
        registry.register(cls)

    @abstractmethod
    def run(self, **kwargs) -> bool:
        raise NotImplementedError

For the framework to automatically register the classes, it was important to follow the project structure. As long as the task resided in the ‘tasks’ module, we could scan this at runtime and register each task.

└── simple_tasker
    ├── __init__.py
    ├── cli.py
    └── tasks
        ├── __init__.py
        ├── archive.py
        └── shell_script.py

This was achieved with a simple dynamic module importer

modules = glob.glob(join(dirname(__file__), "*.py"))

for f in modules:
    if isfile(f) and not f.endswith("__init__.py"):
        __import__(f"{Task.__module__}.{basename(f)[:-3]}")

The configuration

In designing how the configuration would bind to the task, I needed to capture the name (what object to instanticate) and what args to pass to the instantiated run function. I decided to model it as below with everything under a ‘tasks’ array

tasks: [
    {
        name: shell_script
        args: {
            script_path: uname
            script_args: -a
        }
    },
    {
        name: shell_script
        args: {
            script_path: find
            script_args: [${CWD}/simple_tasker/tasks, -name, "*.py"]
        }
    },
    {
        name: archive
        args: {
            input_directory_path: ${CWD}/simple_tasker/tasks
            target_file_path: /tmp/${PLATFORM}_${TODAY}.tar.gz
        }
    }
]

Orchestration and validation

As mentioned previously, one of the goals was to ensure the configuration was valid prior to any execution. This meant that the framework needed to validate whether tha task name referred to a registered task, and that all mandatory arguments were addressed in the configuration. Determining whether the task was registered was just a simple key check, however to validate the arguments to the run required some inspection – I needed to get all args for the run function and filter out ‘self’ and any asterisk args (*args, **kwargs)

def get_mandatory_args(func) -> List[str]:

    mandatory_args = []
    for k, v in inspect.signature(func).parameters.items():
        if (
            k != "self"
            and v.default is inspect.Parameter.empty
            and not str(v).startswith("*")
        ):
            mandatory_args.append(k)

    return mandatory_args

And finally onto the actual execution bit. The main functionality required here is to validate that the config was defined correctly, then loop through all tasks and execute them – passing in any args.

class Tasker:

    def __init__(self, path: Path, env: Dict[str, str] = None) -> None:

        self.logger = logging.getLogger(self.__class__.__name__)
        self._tasks = []

        with wrap_environment(env):
            self._config = ConfigFactory.parse_file(path)


    def __validate_config(self) -> bool:

        error_count = 0

        for task in self._config.get("tasks", []):
            name, args = task["name"].lower(), task.get("args", {})

            if registry.has(name):
                for arg in get_mandatory_args(registry.get(name).run):
                    if arg not in args:
                        print(f"Missing arg '{arg}' for task '{name}'")
                        error_count += 1
            else:
                print(f"Unknown tasks '{name}'")
                error_count += 1

            self._tasks.append((name, args))

        return error_count == 0

    def run(self) -> bool:

        if self.__validate_config():

            for name, args in self._tasks:
                exe = registry.create(name)
                self.logger.info(f"About to execute: '{name}'")
                if not exe.run(**args):
                    self.logger.error(f"Failed tasks '{name}'")
                    return False

            return True
        return False

Putting it together – sample tasks

Below are two examples of how easy it is to configure the framework. We have a simple folder archiver that will tar/gz a directory based on 2 input parameters.

class Archive(Task):

    def __init__(self) -> None:
        super().__init__()

    def run(self, input_directory_path: str, target_file_path: str) -> bool:

        self.logger.info(f"Archiving '{input_directory_path}' to '{target_file_path}'")

        with tarfile.open(target_file_path, "w:gz") as tar:
            tar.add(
                input_directory_path,
                arcname=os.path.basename(input_directory_path)
            )
        return True

A more complex example would be the ability to execute shell scripts (or os functions) by passing in some optional variables and variables that can either be a string or list.

class ShellScript(Task):

    task_name = "shell_script"

    def __init__(self) -> None:
        super().__init__()

    def run(
        self,
        script_path: str,
        script_args: Union[str, List[str]] = None,
        working_directory_path: str = None
    ) -> bool:

        cmd = [script_path]

        if isinstance(script_args, str):
            cmd.append(script_args)
        else:
            cmd += script_args

        try:

            result = subprocess.check_output(
                cmd,
                stderr=subprocess.STDOUT,
                cwd=working_directory_path
            ).decode("utf-8").splitlines()

            for o in result:
                self.logger.info(o)

        except (subprocess.CalledProcessError, FileNotFoundError) as e:
            self.logger.error(e)
            return False

        return True

You can view the entire implementation here

Masking Private Keys in CI/CD Pipelines in GitLab

Big fan of GitLab (and GitLab CI in particular). I had a recent requirement to push changes to a wiki repo associated with a GitLab project through a GitLab CI pipeline (using the SaaS version of GitLab) and ran into a conundrum…

Using the GitLab SaaS version – deploy tokens can’t have write api access, so the next best solution is to use deploy keys, adding your public key as a deploy key and granting this key write access to repositories is relatively straightforward.

This issue is when you attempt to create a masked GitLab CI variable using the private key from your keypair, you get this…

I was a bit astonished to see this to be honest… Looks like it has been raised as an issue several times over the last few years but never resolved (the root cause of which is something to do with newline characters or base64 encoding or the overall length of the string).

I came up with a solution! Not pretty but effective, masks the variable so that it cannot be printed in CI logs as shown here:

Setup

Add a masked and protected GitLab variable for each line in the private key, for example:

The Code

Add the following block to your .gitlab-ci.yml file:

now within Jobs in your pipeline you can simply do this to clone, push or pull from a remote GitLab repo:

as mentioned not pretty, but effective and no other cleaner options as I could see…

Enumerating all roles for a user in Snowflake

Snowflake allows roles to be assigned to other roles, so when a user is assigned to a role, they may inherit the ability to use countless other roles.

Challenge: recursively enumerate all roles for a given user

One solution would be to create a complex query on the “SNOWFLAKE"."ACCOUNT_USAGE"."GRANTS_TO_ROLES" object.

An easier solution is to use a stored procedure to recurse through grants for a given user and return an ARRAY of roles for that user.

This is a good programming exercise in tail call recursion (sort of) in JavaScript. Here is the code:

To call the stored proc, execute:

One drawback of stored procedures in Snowflake is that they can only have scalar or array return types and cannot be used directly in a SQL query, however you can use the table(result_scan(last_query_id())) trick to get around this, as shown below where we will pivot the ARRAY into a record set with the array elements as rows:

IMPORTANT

This query must be the next statement run immediately after the CALL statement and cannot be run again until you run another CALL statement.

More adventures with Snowflake soon!

EventArc: The state of eventing in Google Cloud

When defining event-driven architectures, it’s always good to keep up with how the landscape is changing. How do you connect microservices in your architecture? Is Pub/Sub the end-game for all events? To dive a bit deeper, let’s talk through the benefits of having a single orchestrator, or perhaps a choreographer is better?

Orchestration versus choreography refresher

My colleague @jeffreyaven did a recent post explaining this concept in simple terms, which is worth reviewing, see:

Should there really be a central orchestrator controlling all interactions between services…..or, should each service work independently and only interact through events?

  • Orchestration is usually viewed as a domain-wide central service that defines the flow and control of communication between services. In this paradigm, in becomes easier to change and ultimately monitor policies across your org.
  • Choreography has each service registering and emitting events as they need to. It doesn’t direct or define the flow of communication, but using this method usually has a central broker passing around messages and allows services to be truly independent.

Enter Workflows, which is suited for centrally orchestrated services. Not only Google Cloud service such as Cloud Functions and Cloud Run, but also external services.

How about choreography? Pub/Sub and Eventarc are both suited for this. We all know and love Pub/Sub, but how do I use EventArc?

What is Eventarc?

Announced in October-2020, it was introduced as eventing functionality that enables you, the developer, to send events to Cloud Run from more than 60 Google Cloud sources.

But how does it work?

Eventing is done by reading those sweet sweet Audit Logs, from various sources, and sending them to Cloud Run services as events in Cloud Events format. Quick primer on Cloud Events: its a specification for describing event data in a common way. The specification is now under the Cloud Native Computing Foundation. Hooray! It can also read events from Pub/Sub topics for custom applications. Here’s a diagram I graciously ripped from Google Cloud Blog:

Eventarc

Why do I need Eventarc? I have the Pub/Sub

Good question. Eventarc provides and easier path to receive events not only from Pub/Sub topics but from a number of Google Cloud sources with its Audit Log and Pub/Sub integration. Actually, any service that has Audit Log integration can be an event source for Eventarc. Beyond easy integration, it provides consistency and structure to how events are generated, routed and consumed. Things like:

Triggers

They specify routing rules from events sources, to event sinks. Listen for new object creation in GCS and route that event to a service in Cloud Run by creating an Audit-Log-Trigger. Create triggers that also listen to Pub/Sub. Then list all triggers in one, central place in Eventarc:

gcloud beta eventarc triggers list

Consistency with eventing format and libraries

Using the CloudEvent-compliant specification will allow for event data in a common way, increasing the movement towards the goal of consistency, accessibility and portability. Makes it easier for different languages to read the event and Google Events Libraries to parse fields.

This means that the long-term vision of Eventarc to be the hub of events, enabling a unified eventing story for Google Cloud and beyond.

Eventarc producers and consumers

In the future, you can excpect to forego Audit Log and read these events directly and send these out to even more sinks within GCP and any HTTP target.


This article written on inspiration from https://cloud.google.com/blog/topics/developers-practitioners/eventarc-unified-eventing-experience-google-cloud. Thanks Mete Atamel!

Microservices Concepts: Orchestration versus Choreography

One of the foundational concepts in microservices architecture and design patterns is the concept of Orchestration versus Choreography. Before we look at a reference implementation of each of these patterns, it is worthwhile starting with an analogy.

This is often likened to a Jazz band versus a Symphony Orchestra.

A modern symphony orchestra is normally comprised of sections such as strings, brass, woodwind and percussion sections. The sections are orchestrated by a conductor, usually placed at a central point with respect to each of the sections. The conductor instructs each section to perform their components of the overall symphony.

By contrast, a Jazz band does not have a conductor and also features improvisation, with different musicians improvising based upon each other. Choreography, although more aligned to dance, can involve improvisation. In both cases there is still an intended output and a general framework as to how the composition will be executed, however unlike a symphony orchestra there is a degree of spontaneity.

Now back to technology and microservices…

In the Orchestration model, there is a central orchestration service which controls the interactions between other services, in other words the flow and control of communication and/or message passing between services is controlled by an orchestrator (much like the conductor in a symphony orchestra). On the plus side, this model enables easier monitoring and policy enforcement across the system. A generalisation of the Orchestration model is shown below:

Orchestration model

By contrast, in the Choreography model, each service works independently and interacts with other services through events. In this model each service registers and emits events as they need to. The flow (of communication between services) is not predefined, much like a Jazz band. This model often includes a central broker for message passing between services, but the services operate independently of each other and are not controlled by a central service (an orchestrator). A generalisation of the Choreography model is shown below:

Choreography model

We will post subsequent articles with implementations of these patterns, but it is worthwhile getting a foundational understanding first.

Great Expectations (for your data…)

This article provides an introduction to the Great Expectations Python library for data quality management (https://github.com/great-expectations/great_expectations).

So what are expectations when it comes to data (and data quality)…

An expectation is a falsifiable, verifiable statement about data. Expectations provide a language to talk about data characteristics and data quality – humans to humans, humans to machines and machines to machines.

The great expectations project includes predefined, codified expectations such as:

expect_column_to_exist 
expect_table_row_count_to_be_between 
expect_column_values_to_be_unique 
expect_column_values_to_not_be_null 
expect_column_values_to_be_between 
expect_column_values_to_match_regex 
expect_column_mean_to_be_between 
expect_column_kl_divergence_to_be_less_than

… and many more

Expectations are both data tests and docs! Expectations can be presented in a machine-friendly JSON, for example:

Great Expectations provides validation results of defined expectations, which can dramatically shorten your development cycle.

validation results in great expectations

Nearly 50 built in expectations allow you to express how you understand your data, and you can add custom expectations if you need a new one. A machine can test if a dataset conforms to the expectation.

OK, enough talk, let’s go!

pyenv virtualenv 3.8.2 ge38
pip install great-expectations

tried with Python 3.7.2, but had issues with library lgzm on my local machine

once installed, run the following in the python repl shell:

showing the data in the dataframe should give you the following:

as can be seen, a collection of random integers in each column for our initial testing. Let’s pipe this data in to great-expectations…

yields the following output…

this shows that there are 0 unexpected items in the data we are testing. Great!

Now let’s have a look at a negative test. Since we’ve picked the values at random, there are bound to be duplicates. Let’s test that:

yields…

The JSON schema has metadata information about the result, of note is the result section which is specific to our query, and shows the percentage that failed the expectation.

Let’s progress to something more real-world, namely creating exceptions that are run on databases. Armed with our basic understanding of great-expectations, let’s…

  • set up a postgres database
  • initiate a new Data Context within great-expectations
  • write test-cases for the data
  • group those test-cases and
  • run it

Setting up a Database

if you don’t have it installed,

wait 15 minutes for download the internet. Verify postgres running with docker ps, then connect with

Create some data

Take data for a spin

should yield

Now time for great-expectations

Great Expectations relies on the library sqlalchemy and psycopg2 to connect to your data.

once done, let’s set up great-expectations

should look like below:

let’s set up a few other goodies while we’re here

Congratulations! Great Expectations is now set up

You should see a file structure as follows:

great expectations tree structure

If you didn’t generate a suite during the set up based on app.order, you can do so now with

great_expectations suite new

when created, looking at great_expectations/expectations/app/order/warning.json should yield the following:

as noted in the content section, this expectation config is created by the tool by looking at 1000 rows of the data. We also have access to the data-doc site which we can open in the browser at great_expectations/uncommitted/data_docs/local_site/index.html

great expectations index page

Clicking on app.order.warning, you’ll see the sample expectation shown in the UI

great expectations app order screen

Now, let’s create our own expectation file and take it for a spin. We’ll call this one error.

great expectations new suite

This should also start a jupyter notebook. If for some reason you need to start it back up again, you can do so with

Go ahead and hit run on your first cell.

Editing a suite with Jupyter

Let’s keep it simple and test the customer_order_id column is in a set with the values below:

using the following expectations function in your Table Expectation(s). You may need to click the + sign in the toolbar to insert a new cell, as below:

Adding table expectation

As we can see, appropriate json output that describes the output of our expectation. Go ahead and run the final cell, which will save our work and open a newly minted data documentation UI page, where you’ll see the expectations you defined in human readable form.

Saved suite

Running the test cases

In Great Expectations, running a set of expectations (test cases) is called a checkpoint. Let’s create a new checkpoint called first_checkpoint for our app.order.error expectation as shown below:

Let’s take a look at our checkpoint definition.

Above you can see the validation_operator_name which points to a definition in great_expectations.yml, and the batches where we defined the data source and what expectations to run against.

Let’s have a look at great_expectations.yml. We can see the action_list_operator defined and all the actions it contains:

List operators

Let’s run our checkpoint using

Validate checkpoint

Okay cool, we’ve set up an expectation, a checkpoint and shown a successful status! But what does a failure look like? We can introduce a failure by logging in to postgres and inserting a customer_11 that we’ll know will fail, as we’ve specific our expectation that customer_id should only have two values..

Here are the commands to make that happen, as well as the command to re-run our checkpoint:

Run checkpoint again, this time it should fail

Failed checkpoint

As expected, it failed.

Supported Databases

In it’s current implementation version 0.12.9, the supported databases our of the box are:

It’s great to be BigQuery supported out of the box, but what about Google Spanner and Google BigTable? Short-answer; currently not supported. See tickets https://github.com/googleapis/google-cloud-python/issues/3022.

With respect to BigTable, it may not be possible as SQLAlchemy can only manage SQL-based RDBSM-type systems, while BigTable (and HBase) are NoSQL non-relational systems.

Scheduling

Now that we have seen how to run tests on our data, we can run our checkpoints from bash or a python script(generated using great_expectations checkpoint script first_checkpoint). This lends itself to easy integration with scheduling tools like airflow, cron, prefect, etc.

Production deployment

When deploying in production, you can store any sensitive information(credentials, validation results, etc) which are part of the uncommitted folder in cloud storage systems or databases or data stores depending on your infratructure setup. Great Expectations has a lot of options

When not to use a data quality framework

This tool is great and provides a lot of advanced data quality validation functions, but it adds another layer of complexity to your infrastructure that you will have to maintain and trouble shoot in case of errors. It would be wise to use it only when needed.

In general

Do not use a data quality framework, if simple SQL based tests at post load time works for your use case. Do not use a data quality framework, if you only have a few (usually < 5) simple data pipelines.

Do use it when you have data that needs to be tested in an automated and a repeatable fashion. As shown in this article, Great Expectations has a number of options that can be toggled to suit your particular use-case.

Conclusion

Great Expectations shows a lot of promise, and it’s an active project so expect to see features roll out frequently. It’s been quite easy to use, but I’d like to see all it’s features work in a locked-down enterprise environment.

Tom Klimovski
Principal Consultant, Gamma Data
tom.klimovski@gammadata.io

JSON Wrangling with Go

Golang is a fantastic language but at first glance it is a bit clumsy when it comes to JSON in contrast to other languages such as Python or Javascript. Having said that once you master the concepts involved with JSON wrangling using Go it is equally as functional – with added type safety and performance.

In this article we will build a program in Golang to parse a JSON file containing a collection held in a named key – without knowing the structure of this object, we will expose the schema for the object including data types and recurse the object for its values.

This example uses a great Go package called tablewriter to render the output of these operations using a table style result set.

The program has describe and select verbs as operation types; describe shows the column names in the collection and their respective data types, select prints the keys and values as a tabular result set with column headers for the keys and rows containing their corresponding values.

Starting with this:

We will end up with this when performing a describe operation:

And this when performing a select operation:

Now let’s talk about how we got there…

The JSON package

Support for JSON in Go is provided using the encoding/json package, this needs to be imported in your program of course… You will also need to import the reflect package – more on this later. io/ioutil is required to read the data from a file input, there are other packages included in the program that are removed for brevity:

Reading the data…

We will read the data from the JSON file into a variable called body, note that we are not attempting to deserialize the data at this point. This is also a good opportunity to handle any runtime or IO errors that occur here as well.

The interface…

We will declare an empty interface called data which will be used to decode the json object (of which the structure is not known), we will also create an abstract interface called colldata to hold the contents of the collection contained inside the JSON object that we are specifically looking for:

Validating…

Next we need to validate that the input is a valid JSON object, we can use the json.Valid(body) method to do this:

Unmarshalling…

Now the interesting bits, we will deserialize the JSON object to the empty data interface we created earlier using the json.Unmarshal() method:

Note that this operation is another opportunity to catch unexpected errors and handle them accordingly.

Checking the type of the object using reflection…

Now that we have serialized the JSON object into the data interface, there are several ways we can inspect the type of the object (which could be a map or an array). One such way is to use reflection. Reflection is the ability of a program to inspect itself at runtime. An example is shown here:

This instruction would produce the following output for our zones.json file:

The type switch…

Another method to decode the type of the data object (and any objects nested as elements or keys within the data object), is to use the type switch, an example of a type switch function is shown here:

Finding the nested collection and recursing it…

The aim of the program is to find a collection (an array of maps) nested in a JSON object. The maps with each element of the array are unknown at runtime and are discovered through recursion.

If we are performing a describe operation, we only need to parse the first element of the collection to get the key names and the data type of the values (for which we will use the same getObjectType function to perform a type switch.

If we are performing a select operation, we need to parse the first element to get the column names (the keys in the map) and then we need to recurse each element to get the values for each key.

If the element contains a key named id or name, we will place this at the beginning of the resultant record, as maps are unordered by definition.

The output…

As mentioned, we are using the tablewriter package to render the output of the collection as a pretty printed table in our terminal. As wrap around can get pretty ugly an additional maxfieldlen argument is provided to truncate the values if needed.

In summary…

Although it is a bit more involved than some other languages, once you get your head around processing JSON in Go, the possibilities are endless!

Full source code can be found at: https://github.com/gamma-data/json-wrangling-with-golang

Forseti Terraform Validator: Enforcing resource policy compliance in your CI pipeline

Terraform is a powerful tool for managing your Infrastructure as Code. Declare your resources once, define their variables per environment and sleep easy knowing your CI pipeline will take care of the rest.

But… one night you wake up in a sweat. The details are fuzzy but you were browsing your favourite cloud provider’s console – probably GCP 😉 – and thought you saw a bucket had been created outside of your allowed locations! Maybe it even had risky access controls.

You go brush it off and try to fall back to sleep, but you can’t quite push the thought from your mind that somewhere in all that Terraform code, someone could be declaring resources in unapproved locations, and your CICD pipeline would do nothing to stop it. Oh the regulatory implications.

Enter Terraform Validator by Forseti

Terraform Validator by Forseti allows you to declare your Policy as Code, check compliance of your Terraform plans against said Policy, and automatically fail violating plans in a CI step. All without setting up servers or agents.

You’re going to learn how to enforce policy on GCP resources like BigQuery, IAM, networks, MySQL, Google Kubernetes Engine (GKE) and more. If you’re particularly crafty, you may be able to go beyond GCP.

Forseti’s suite of solutions are GCP focused and allow a wide range of live config validation, monitoring and more using the Policy Library we’re going to set up. These additional capabilities require additional infrastructure. But we’re going one step at a time, starting with enforcing policy during deployment.

Getting Started

Let’s assume you already have an established CICD pipeline that uses Terraform, or that you are content to validate your Terraform plans locally for now. In that case, we need just two things:

  1. A Policy Library
  2. Terraform Validator

It’s that simple! No new servers, agents, firewall rules, extra service accounts or other nonsense. Just add Policy Library, the Validator tool and you can enforce policy on your Terraform deployments.

We’re going to tinker with some existing GCP-focused sample policies (aka Constraints) that Forseti makes available. These samples cover a wide range of resources and use cases, so it is easy to adjust what’s provided to define your own Constraints.

Policy Library

First let’s open up some of Forseti’s pre-defined constraints. We’ll copy them into our own Git repository and adjust to create policies that match our needs. Repeatable and configurable – that’s Policy as Code at work.

Concepts

In the world of Forseti and in particular Terraform Validator, Policies are defined and understood via easy to read YAML files known as Constraints

There is just enough information in a Constraint file for to make clear its purpose and effect, and by tinkering lightly with a pre-written Constraint you can achieve a lot without looking too deeply into the inner workings . But there’s more happening than meets the eye.

Constraints are built on Templates – which are like forms with some extra bits waiting to be completed to make a Constraint. Except there’s a lot more hidden away that’s pretty cool if you want to understand it.

Think of a Template as a ‘Class’ in the OOP sense, and of a Constraint as an instantiated Template with all the key attributes defined.

E.g. A generic Template for policy on bucket locations and a Constraint to specify which locations are relevant in a given instance. Again, buckets and locations are just the basic example – the potential applications are far greater.

Now the real magic is that just like a ‘Class’, a Template contains logic that makes everything abstracted away in the Constraint possible. Templates contain inline Rego (ray-go), borrowed lovingly by Forseti from the Open Policy Agent (OPA) team.

Learn more about Rego and OPA here to understand the relationship to our Terraform Validator.

But let’s begin.

Set up your Policies

Create your Policy Library repository

Create your Policy Library repository by cloning https://github.com/forseti-security/policy-library into your own VCS.

This repo contains templates and sample constraints which will form the basis of your policies. So get it into your Git environment and clone it to local for the next step.

Customise sample constraints to fit your needs

As discussed in Concepts, Constraints are defined Templates, which make use of Rego policy language. Nice. So let’s take a sample Constraint, put it in our Policy Library and set the values to what we need. It’s that easy – no need to write new templates or learn Rego if your use case is covered.

In a new branch…

  1. Copy the sample Constraint storage_location.yaml to your Constraints folder.
    $ cp policy-library/samples/storage_location.yaml policy-library/policies/constraints/storage_location.yaml
  2. Replace the sample location (asia-southeast1) in storage_location.yaml with australia-southeast1.
    spec:
    severity: high
    match:
    target: ["organization/*"]
    parameters:
    mode: "allowlist"
    locations:
    - australia-southeast1
    exemptions: []
  3. Push back to your repo – not Forseti’s!
    $ git push https://github.com/<your-repository>/policy-library.git

Policy review

There you go – you’ve customised a sample Constraint. Now you have your own instance of version controlled Policy-as-Code and are ready to apply the power of OPA’s Rego policy language that lies within the parent Template. Impressively easy right?

That’s a pretty simple example. You can browse the rest of Forseti’s Policy Library to view other sample Constraints, Templates and the Rego logic that makes all of this work. These can be adjusted to cover all kinds of use cases across GCP resources.

I suggest working with and editing the sample Constraints before making any changes to Templates.

If you were to write Rego and Templates from scratch, you might even be able to enforce Policy as Code against non-GCP Terraform code.

Terraform Validator

Now, let’s set up the Terraform Validator tool and have it compare a sample piece of Terraform code against the Constraint we configured above. Keep in mind you’ll want to translate what’s done here into steps in your CICD pipeline.

Once the tool is in place, we really just run terraform plan and feed the output into Terraform Validator. The Validator compares it to our Constraints, runs all the abstracted logic we don’t need to worry about and returns 0 or 2 when done for pass / fail respectively. Easy.

So using Terraform if I try to make a bucket in australia-southeast1 it should pass, if I try to make one in the US it should fail. Let’s set up the tool, write some basic Terraform and see how we go.

Setup Terraform Validator

Check for the latest version of terraform-validator from the official terraform-validator GCS bucket.

Very important when using tf version 0.12 or greater. This is the easy way – you can pull from the Terraform Validator Github and make it yourself too.

$ gsutil ls -r gs://terraform-validator/releases

Copy the latest version to the working dir

$ gsutil cp gs://terraform-validator/releases/2020-03-05/terraform-validator-linux-amd64 .

Make it executable

$ chmod 755 terraform-validator-linux-amd64

Ready to go!

Review your Terraform code

We’re going to make a ridiculously simple piece of Terraform that tries to create one bucket in our project to keep things simple.

main.tf

resource "google_storage_bucket" "tf-validator-demo-bucket" {  
  name          = "tf-validator-demo-bucket"
  location      = "US"
  force_destroy = true

  lifecycle_rule {
    condition {
      age = "3"
    }
    action {
      type = "Delete"
    }
  }
}

This is a pretty standard bit of Terraform for a GCS bucket, but made very simple with all the values defined directly in main.tf. Note the location of the bucket – it violates our Constraint that was set to the australia-southeast1 region.

Make the Terraform plan

Warm up Terraform.
Double check your Terraform code if there are any hiccups.

$ terraform init

Make the Terraform plan and store output to file.

$ terraform plan --out=terraform.tfplan

Convert the plan to JSON

$ terraform show -json ./terraform.tfplan > ./terraform.tfplan.json

Validate the non-compliant Terraform plan against your Constraints, for example

$ ./terraform-validator-linux-amd64 validate ./tfplan.tfplan.json --policy-path=../repos/policy-library/

TA-DA!

Found Violations:

Constraint allow_some_storage_location on resource //storage.googleapis.com/tf-validator-demo-bucket: //storage.googleapis.com/tf-validator-demo-bucket is in a disallowed location.

Validate the compliant Terraform plan against your Constraints

Let’s see what happens if we repeat the above, changing the location of our GCS bucket to australia-southeast1.

$ ./terraform-validator-linux-amd64 validate ./tfplan.tfplan.json --policy-path=../repos/policy-library/

Results in..

No violations found.

Success!!!

Now all that’s left to do for your Policy as Code CICD pipeline is to configure the rest of your Constraints and run this check before you go ahead and terraform apply. Be sure to make the apply step dependent on the outcome of the Validator.

Wrap Up

We’ve looked at how to apply Policy as Code to validate our Infrastructure as Code. Sounds pretty modern and DevOpsy doesn’t it.

To recap, we learned about Constraints, which are fully defined instances of Policy as Code. They’re based on YAML Templates that refer to the OPA policy language Rego, but we didn’t have to learn it 🙂

We created our own version controlled Policy Library.

Using the above learning and some handy pre-existing samples, we wrote policies (Constraints) for GCP infrastructure, specifying a whitelist for locations in which GCS buckets could be deployed.

As mentioned there are dozens upon dozens of samples across BigQuery, IAM, networks, MySQL, Google Kubernetes Engine (GKE) and more to work with.

Of course, we stored these configured Constraints in our version-controlled Policy Library.

  • We looked at a simple set of Terraform code to define a GCS bucket, and stored the Terraform plan to a file before applying it.
  • We ran Forseti’s Terraform Validator against the Terraform plan file, and had the Validator compare the plan to our Policy Library.
  • We saw that the results matched our expectations! Compliance with the location specified in our Constraint passed the Validator’s checks, and non-compliance triggered a violation.

Awesome. And the best part is that all this required no special permissions, no infrastructure for servers or agents and no networking.

All of that comes with the full Forseti suite of Inventory taking Config Validation of already deployed resources. We might get to that next time.

References:

https://github.com/GoogleCloudPlatform/terraform-validator https://github.com/forseti-security/policy-library https://www.openpolicyagent.org/docs/latest/policy-language/ https://cloud.google.com/blog/products/identity-security/using-forseti-config-validator-with-terraform-validator https://forsetisecurity.org/docs/latest/concepts/