Really Simple Terraform – S3 Object Notifications using Lambda and SES

Following on from the previous post in the Really Simple Terraform series simple-lambda-ec2-scheduler, where we used Terraform to deploy a Lambda function including the packaging of the Python function into a ZIP archive and creation of all supporting objects (roles, policies, permissions, etc) – in this post we will take things a step further by using templating to update parameters in the Lambda function code before the packaging and creation of the Lambda function.

S3 event notifications can be published directly to an SNS topic which you could create an email subscription, this is quite straightforward. However the email notifications you get look something like this:

Email Notification sent via an SNS Topic Subscription

There is very little you can do about this.

However if you take a slightly different approach by triggering a Lambda function to send an email via SES you have much more control over content and formatting. Using this approach you could get an email notification that looks like this:

Email Notification sent using Lambda and SES

Much easier on the eye!

Prerequisites

You will need verified AWS SES (Simple Email Service) email addresses for the sender and recipient’s addresses used for your object notification emails. This can be done via the console as shown here:

SES Email Address Verification

Note that SES is not available in every AWS region, pick one that is generally closest to your particular reason (but it really doesn’t matter for this purpose).

Deployment

The Terraform module creates an IAM Role and associated policy for the Lambda function as shown here:

Variables in the module are substituted into the function code template, the rendered template file is then packaged as a ZIP archive to be uploaded as the Lambda function source as shown here:

As in the previous post, I will reiterate that although Terraform is technically not a build tool, it can be used for simple build operations such as this.

The Lambda function is deployed using the following code:

Finally the S3 object notification events are configured as shown here:

Use the following commands to run this example (I have created a default credentials profile, but you could supply your API credentials directly, use STS, etc):

cd simple-notifications-with-lambda-and-ses
terraform init
terraform apply

Full source code for this article can be found at: https://github.com/avensolutions/simple-notifications-with-lambda-and-ses

Really Simple Terraform – Infrastructure Automation using AWS Lambda

There are many other blog posts and examples available for either scheduling infrastructure tasks such as the starting or stopping of EC2 instances; or deploying a Lambda function using Terraform. However, I have found many of the other examples to be unnecessarily complicated, so I have put together a very simple example doing both.

The function itself could be easily adapted to take other actions including interacting with other AWS services using the boto3 library (the Python AWS SDK). The data payload could be modified to pass different data to the function as well.

The script only requires input variables for schedule_expression (cron schedule based upon GMT for triggering the function – could also be expressed as a rate, e.g. rate(5 minutes)) and environment (value passed to the function on each invocation). In this example the Input data is the value for the “Environment” key for an EC2 instance tag – a user defined tag to associate the instance to a particular environment (e.g. Dev, Test. Prod). The key could be changed as required, for instance if you wanted to stop instances based upon their given name or part thereof you could change the tag key to be “Name”.

When triggered, the function will stop all running EC2 instances with the given Environment tag.

The Terraform script creates:

  • an IAM Role and associated policy for the Lambda Function
  • the Lambda function
  • a Cloudwatch event rule and trigger

The IAM role and policies required for the Lambda function are deployed as shown here:

The function source code is packaged into a ZIP archive and deployed using Terraform as follows:

Admittedly Terraform is an infrastructure automation tool and not a build/packaging tool (such as Jenkins, etc), but in this case the packaging only involves zipping up the function source code, so Terraform can be used as a ‘one stop shop’ to keep things simple.

The Cloudwatch schedule trigger is deployed as follows:

Use the following commands to run this example (I have created a default credentials profile, but you could supply your API credentials directly, use STS, etc):

cd simple-lambda-ec2-scheduler
terraform init
terraform apply
Terraform output

Full source code is available at: https://github.com/avensolutions/simple-lambda-ec2-scheduler

Stay tuned for more simple Terraform deployment recipes in coming posts…

Multi Stage ETL Framework using Spark SQL

Most traditional data warehouse or datamart ETL routines consist of multi stage SQL transformations, often a series of CTAS (CREATE TABLE AS SELECT) statements usually creating transient or temporary tables – such as volatile tables in Teradata or Common Table Expressions (CTE’s).

The initial challenge when moving from a SQL/MPP based ETL framework platformed on Oracle, Teradata, SQL Server, etc to a Spark based ETL framework is what to do with this…

Multi Stage SQL Based ETL

One approach is to use the lightweight, configuration driven, multi stage Spark SQL based ETL framework described in this post.

This framework is driven from a YAML configuration document. YAML was preferred over JSON as a document format as it allows for multi-line statements (SQL statements), as well as comments – which are very useful as SQL can sometimes be undecipherable even for the person that wrote it.

The YAML config document has three main sections: sources, transforms and targets.

The sources section is used to configure the input data source(s) including optional column and row filters. In this case the data sources are tables available in the Spark catalog (for instance the AWS Glue Catalog or a Hive Metastore), this could easily be extended to read from other datasources using the Spark DataFrameReader API.

The transforms section contains the multiple SQL statements to be run in sequence where each statement creates a temporary view using objects created by preceding statements.

Finally the targets section writes out the final object or objects to a specified destination (S3, HDFS, etc).

The process_sql_statements.py script that is used to execute the framework is very simple (30 lines of code not including comments, etc). It loads the sources into Spark Dataframes and then creates temporary views to reference these datasets in the transforms section, then sequentially executes the SQL statements in the list of transforms. Lastly the script writes out the final view or views to the desired destination – in this case parquet files stored in S3 were used as the target.

You could implement an object naming convention such as prefixing object names with sv_, iv_, fv_ (for source view, intermediate view and final view respectively) if this helps you differentiate between the different objects.

To use this framework you would simply use spark-submit as follows:

spark-submit process_sql_statements.py config.yml

Full source code can be found at: https://github.com/avensolutions/spark-sql-etl-framework

The Cost of Future Change: What we should really be focused on (but no one is…)

In my thirty year career in Technology I can’t think of a more profound period of change. It is not as if change never existed previously, it did. Innovation and Moore’s law have been constant features in technology. However, for decades we had experienced a long period of relativity stability. That is, the rate of change was somewhat fixed. This was largely due to the fact that change, for the most part, was dictated almost entirely by a handful of large companies (such as Oracle, Microsoft, IBM, SAP, etc.). We, the technology community, were exclusively at the mercy of the product management teams at these companies dreaming up the next new feature, and we were subject to the tech giant’s product release cycle as to when we could use this feature. The rate of change was therefore artificially suppressed to a degree.

Then along came the open source software revolution. I can recall being fortunate enough to be in a meeting with Steve Ballmer along with a small group of Microsoft partners in Sydney, Australia around 2002. I can remember him as a larger than life figure with a booming voice and an intense presence. I can vividly remember him going on a long diatribe about Open Source Software, primarily focused around Linux. To put some historical context around it, Linux was a fringe technology at best at that point in time – nobody was taking it seriously in the enterprise. The Apache Software Foundation wasn’t a blip on the radar at the time, Software-as-a-Service was in its infancy, and Platform-as-a-Service and cloud were non-factors – so what was he worried about?

From the fear instilled in his rant, you would have thought this was going to be the end of humanity as we knew it. Well it was the beginning of the end, the end of the halcyon days of the software industry as we knew it. The early beginnings of the end of the monopoly on change held by a select few software vendors. The beginning of the democratisation of change.

Fast forward fifteen years, and now everyone is a (potential) software publisher. Real world problems are solved in real companies and in many cases software products are created, published and made (freely) available to other companies. We have seen countless examples of this pattern, Hadoop at Yahoo!, Kafka at LinkedIn, Airflow at Airbnb, to name a few. Then there are companies created with scant capital or investment, driven by a small group of smart people focused on solving a problem or streamlining a process that they have encountered in the real world. Many of these companies growing to be globally recognised names, such as Atlassian or Hashicorp.

The rate of change is no longer suppressed to the privileged few – in fact we have an explosion of change. No longer do we have a handful of technology options to achieve a desired outcome or solve a particular problem, we have a constellation of options.

Now the bad news, the force multiplier in change created by open source communities cuts both ways. When the community is behind a project it gets hyper-charged, conversely when the community drops off, functionality and stability drop off just as quickly.

This means there are components, projects, modules, services we are using today that won’t be around in the same format in 2-3 years’ time. There are products that don’t exist today that will be integral (and potentially critical) to our operations in 1-2 years’ time.

This brings me to my final point – the cost of future change. We as leaders and custodians of technology in this period should not only factor in current day run costs or the cost of change we know about (the cost of current change), but the hidden cost of future change. We need to think that whatever we are doing now is not what we will be doing in a years’ time – we will need to accommodate future change.

We need to be thinking about the cost of future change and what we can do to minimise this. This means not just swapping out one system for another, but thinking about how you will change this system or component again in the near future. We need to take the opportunity that is in front of us to replace monoliths with modular systems, move from tightly coupled to loosely coupled components, from proprietary systems to open systems, from rigid architectures to extensible architectures.

When you’re planning a change today, consider the cost of future change…