One of the foundational concepts in microservices architecture and design patterns is the concept of Orchestration versus Choreography. Before we look at a reference implementation of each of these patterns, it is worthwhile starting with an analogy.
This is often likened to a Jazz band versus a Symphony Orchestra.
A modern symphony orchestra is normally comprised of sections such as strings, brass, woodwind and percussion sections. The sections are orchestrated by a conductor, usually placed at a central point with respect to each of the sections. The conductor instructs each section to perform their components of the overall symphony.
By contrast, a Jazz band does not have a conductor and also features improvisation, with different musicians improvising based upon each other. Choreography, although more aligned to dance, can involve improvisation. In both cases there is still an intended output and a general framework as to how the composition will be executed, however unlike a symphony orchestra there is a degree of spontaneity.
Now back to technology and microservices…
In the Orchestration model, there is a central orchestration service which controls the interactions between other services, in other words the flow and control of communication and/or message passing between services is controlled by an orchestrator (much like the conductor in a symphony orchestra). On the plus side, this model enables easier monitoring and policy enforcement across the system. A generalisation of the Orchestration model is shown below:
By contrast, in the Choreography model, each service works independently and interacts with other services through events. In this model each service registers and emits events as they need to. The flow (of communication between services) is not predefined, much like a Jazz band. This model often includes a central broker for message passing between services, but the services operate independently of each other and are not controlled by a central service (an orchestrator). A generalisation of the Choreography model is shown below:
We will post subsequent articles with implementations of these patterns, but it is worthwhile getting a foundational understanding first.
Function Apps, Logic Apps and App Services can be used to expose APIs within Azure API Management which is an easy way to deploy serverless microservices. You can see this capability in the Azure portal below within API Management:
Like most readers, I like to script everything, so I was initially frustrated when I couldn’t replicate this operation in the Azure cli, REST, PowerShell, or any of the other SDKs or IaC tools. Others shared my frustration as seen here.
I was nearly resigned to using click ops in the portal (arrrgh) before I worked out this workaround.
There is a bit more prep work required to automate this process, but it is well worth it.
1. Create an OpenApi (or Swagger spec or WADL) specification document, as seen below (use the absolute URL for your Function App in the url parameter):
2. Use the az apim api import function (not the az apim api create function), as shown here:
3. Associate the API with a product (which is how you can rate limit APIs)
That’s it! You can now access your function via the API gateway using the gateway url or via the developer portal as seen below:
Following on from the recent post GCP Templates for C4 Diagrams using PlantUML, cloud architects are often challenged with producing diagrams for architectures spanning multiple cloud providers, particularly as you elevate to enterprise level diagrams.
In this post, with the magic of !includeurl we have brought PlantUML template libraries together for AWS, Azure and GCP icon sets, allowing us to produce multi cloud C4 diagrams using PlantUML like this one:
Creating a multi cloud diagram is simple, start by adding the following include statements after the @startuml label in a new PlantUML C4 diagram:
Then add references to the required services from different providers…
Then include the predefined resources from your different cloud providers in your diagram as shown here (describing a client server application over a cloud to cloud VPN between Azure and GCP)…
This is a follow up to the original Cloud Bigtable primer where we discussed the basics of Cloud Bigtable:
In this article we will cover schema design and row key selection in Bigtable – arguably the most critical design decision to make when employing Bigtable in a cloud data architecture.
Recall from the previous post where the Bigtable data model was introduced that tables in Bigtable are comprised of rows and columns – much like a table in any other RDBMS. Every row is uniquely identified by a rowkey – again akin to a primary key in a table in an RDBMS. But this is where the similarities end.
Unlike a table in an RDBMS, columns only ever exist when they are inserted, and NULLs are not stored. See the illustration below:
Row Key Selection
Data in Bigtable is distributed by row keys. Row keys are physically stored in tablets in lexographic order. Recall that row keys are your ONLY indexes to data in Bigtable.
As row keys are your only indexes to retrieve or update rows in Bigtable, row key design must take the access patterns for the data to be stored and served via Bigtable into consideration, specifically the following must be considered when designing a Bigtable application:
Search patterns (returning data for a specific entity)
Scan patterns (returning batches of data)
Queries that use the row key, a row prefix, or a row range are the most efficient. Queries that do not include a row key will typically scan GB or TB of data and would not be suitable for operational use cases.
Row Key Performance
Row key performance will be biased towards your specific access patterns and application functional requirements. For example if you are performing sequential reads or scan operations then sequential keys will perform the best, however their write performance will not be optimal. Conversely, random keys (such as a uuid) will perform best for writes but poor for scan or sequential read operations.
Adding salts to keys (or additional data), similar to the use of salts in cryptography as well as promoting other field keys to be part of a composite row key can help achieve a “Goldilocks” scenario for both reads and writes, see the diagram below:
Using Reverse Timestamps
Use reverse timestamps when your most common query is for the latest values. Typically you would append the reverse timestamp to the key, this will ensure that the same related records are grouped together, for instance if you are storing events for a customer using the customer id along with an appended reverse timestamp (for example <customer_id>#<reverse_ts>) would allow you to quickly serve the latest events for a customer in descending order as within each group (customer_id), rows will be sorted so most recent insert will be located at the top. A reverse timestamp can be generalised as:
Long.MAX_VALUE - System.currentTimeMillis()
Schema Design Tips
Some general tips for good schema design using Bigtable are summarised below:
Group related data for more efficient reads using column families
Distribute data evenly for more efficient writes
Place identical values in the adjoining rows for more efficient compression using row keys
Following these tips will give you the best possible performance using Bigtable.
Use the Key Visualizer to profile performance
Google provides a neat tool to visualize your row key distribution in Cloud Bigtable. You need to have at least 30 GB of data in your table to enable this feature.
The Key Visualizer is shown here:
The Key Visualizer will help you find and prevent hotspots, find rows with too much data and show if your key schema is balanced.
Bigtable is one of the original and best (massively) distributed NoSQL platforms available. Schema and moreover row key design play a massive part in ensuring low latency and query performance. Go forth and conquer with Cloud Bigtable!
I am a believer in the mantra of “Everything-as-Code”, this includes diagrams and other architectural artefacts. Enter PlantUML…
PlantUML is an open-source tool which allows users to create UML diagrams from an intuitive DSL (Domain Specific Language). PlantUML is built on top of Graphviz and enables software architects and designers to use code to create Sequence Diagrams, Use Case Diagrams, Class Diagrams, State and Activity Diagrams and much more.
PlantUML can be extended to support the C4 model for visualising software architecture. Which describes software in different layers including Context, Container, Component and Code diagrams.
GCP Architecture Diagramming using C4
PlantUML and C4 can be used to produce cloud architectures, there are official libraries available through PlantUML for Azure and AWS service icons, however these do not exist for GCP yet. There are several open source libraries available, however I have made an attempt to simplify the implementation.
The code below can be used to generate a C4 diagram describing a GCP architecture including official GCP service icons:
Bigtable is one of the foundational services in the Google Cloud Platform and to this day one of the greatest contributions to the big data ecosystem at large. It is also one of the least known services available, with all the headlines and attention going to more widely used services such as BigQuery.
In 2006 (pre Google Cloud Platform), Google released a white paper called “Bigtable: A Distributed Storage System for Structured Data”, this paper set out the reference architecture for what was to become Cloud Bigtable. This followed several other whitepapers including the GoogleFS and MapReduce whitepapers released in 2003 and 2004 which provided abstract reference architectures for the Google File System (now known as Colossus) and the MapReduce algorithm. These whitepapers inspired a generation of open source distributed processing systems including Hadoop. Google has long had a pattern of publicising a generalized overview of their approach to solving different storage and processing challenges at scale through white papers.
The Bigtable white paper inspired a wave of open source distributed key/value oriented NoSQL data stores including Apache HBase and Apache Cassandra.
What is Bigtable?
Bigtable is a distributed, petabyte scale NoSQL database. More specifically, Bigtable is…
At its core Bigtable is a distributed map or an associative array indexed by a row key, with values in columns which are created only when they are referenced. Each value is an uninterpreted byte array.
Row keys are stored in lexographic order akin to a clustered index in a relational database.
A given row can have any number of columns, not all columns must have values and NULLs are not stored. There may also be gaps between keys.
All values are versioned with a timestamp (or configurable integer). Data is not updated in place, it is instead superseded with another version.
When (and when not) to use Bigtable
You need to do many thousands of operations per second on TB+ scale data
Your access patterns are well known and simple
You need to support random write or random read operations (or sequential reads) – each using a row key as the primary identifier
Don’t use Bigtable if…
You need explicit JOIN capability, that is joining one or more tables
You need to do ad-hoc analytics
Your access patterns are unknown or not well defined
Bigtable vs Relational Database Systems
The following table compares and contrasts Bigtable against relational databases (both transaction oriented and analytic oriented databases):
Column Family Oriented
Single Row Only
Depends (but usually no)
Row Key Only
Yes (typically PI based)
Max Data Size
'00s GB to TB
Bigtable Data Model
Tables in Bigtable are comprised of rows and columns (sounds familiar so far..). Every row is uniquely identified by a rowkey (like a primary key..again sounds familiar so far).
Columns belong to Column Families and only exist when inserted, NULLs are not stored – this is where it starts to differ from a traditional RDBMS. The following image demonstrates the data model for a fictitious table in Bigtable.
In the previous example, we created two Column Families (cf1 and cf2). These are created during table definition or update operations (akin to DDL operations in the relational world). In this case, we have chosen to store primary attributes, like name, etc in cf1 and features (or derived attributes) in cf2 like indicators.
Each cell has a timestamp/version associated with it, multiple versions of a row can exist. Versions are naturally stored in descending order.
Properties such as the max age for a cell or the maximum number of versions to be stored for any given cell are set on the Column Family. Versions are compacted through a process called Garbage Collection – not to be confused with Java Garbage Collection (albeit same idea).
Bigtable Instances, Clusters, Nodes and Tables
Bigtable is a “no-ops” service, meaning you do not need to configure machine types or details about the underlying infrastructure, save a few sizing or performance options – such as the number of nodes in a cluster or whether to use solid state hard drives (SSD) or the magnetic alternative (HDD). The following diagram shows the relationships and cardinality for Cloud Bigtable.
Clusters and nodes are the physical compute layer for Bigtable, these are zonal assets, zonal and regional availability can be achieved through replication which we will discuss later in this article.
Instances are a virtual abstraction for clusters, Tables belong to instances (not clusters). This is due to Bigtables underlying architecture which is based upon a separation of storage and compute as shown below.
Bigtables separation of storage and compute allow it to scale horizontally, as nodes are stateless they can be increased to increase query performance. The underlying storage system in inherently scalable.
Physical Storage & Column Families
Data (Columns) for Bigtable is stored in Tablets (as shown in the previous diagram), which store “regions” of row keys for a particular Column Family. Columns consist of a column family prefix and qualifier, for instance:
A table can have one or more Column Families. Column families must be declared at schema definition time (could be a create or alter operation). A cell is an intersection of a row key and a version of a column within a column family.
Storage settings (such as the compaction/garbage collection properties mentioned before) can be specified for each Column Family – which can differ from other column families in the same table.
Bigtable Availability and Replication
Replication is used to increase availability and durability for Cloud Bigtable – this can also be used to segregate read and write operations for the same table.
Data and changes to tables are replicated across multiple regions or multiple zones within the same region, this replication can be blocking (single row transactions) or non blocking (eventually consistent). However all clusters within a Bigtable instance are considered primary (writable).
Requests are routed using Application Profiles, a single-cluster routing policy can be used for manual failover, whereas a multi-cluster routing is used for automatic failover.
Backup and Export Options for Bigtable
Managed backups can be taken at a table level, new tables can be created from backups. The backups cannot be exported, however table level export and import operations are available via pre-baked Dataflow templates for data stored in GCS in the following formats:
Bigtable data and admin functions are available via:
cbt (optional component of the Google SDK)
hbase shell (REPL shell)
Happybase API (Python API for Hbase)
SDK libraries for:
C#, C++, PHP, and more
As Bigtable is not a cheap service, there is a local emulator available which is great for application development. This is part of the Cloud SDK, and can be started using the following command:
gcloud beta emulators bigtable start
In the next article in this series we will demonstrate admin and data functions as well as the local emulator.
In this article we will take things a step further, where uploading an object to a GCS bucket will trigger a DLP inspection of the object and if any preconfigured info types (such as credit card numbers or API credentials) are present in the object, a Slack notification will be generated.
As DLP scans are “jobs”, meaning they run asynchronously, we will need to trigger scans and inspect results using two separate Cloud Functions (one for triggering a scan [gcs-dlp-scan-trigger] and one for inspecting the results of the scan [gcs-dlp-evaluate-results]) and a Cloud PubSub topic [dlp-scan-topic] which is used to hold the reference to the DLP job.
The process is described using the sequence diagram below:
The gcs-dlp-scan-trigger Cloud Function fires when a new object is created in a specified GCS bucket. This function configures the DLP scan to be executed, including the DLP info types (for instance CREDIT_CARD_NUMBER, EMAIL_ADDRESS, ETHNIC_GROUP, PHONE_NUMBER, etc) a and likelihood of that info type existing (for instance LIKELY). DLP scans determine the probability of an info type occurring in the data, they do not scan every object in its entirety as this would be too expensive.
The primary function executed in the gcs-dlp-scan-trigger Cloud Function is named inspect_gcs_file. This function configures and submits the DLP job, supplying a PubSub topic to which the DLP Job Name will be written, the code for the inspect_gcs_file is shown here:
At this stage the DLP job is created an running asynchronously, the next Cloud Function, gcs-dlp-evaluate-results, fires when a message is sent to the PubSub topic defined in the DLP job. The gcs-dlp-evaluate-results reads the DLP Job Name from the PubSub topic, connects to the DLP service and queries the job status, when the job is complete, this function checks the results of the scan, if the min_likliehood threshold is met for any of the specified info types, a Slack message is generated. The code for the main method in the gcs-dlp-evaluate-results function is shown here:
Finally, a Slack webhook is used to send the message to a specified Slack channel in a workspace, this is done using the send_slack_notification function shown here:
In this article we will build a program in Golang to parse a JSON file containing a collection held in a named key – without knowing the structure of this object, we will expose the schema for the object including data types and recurse the object for its values.
This example uses a great Go package called tablewriter to render the output of these operations using a table style result set.
The program has describe and select verbs as operation types; describe shows the column names in the collection and their respective data types, select prints the keys and values as a tabular result set with column headers for the keys and rows containing their corresponding values.
Starting with this:
We will end up with this when performing a describe operation:
And this when performing a select operation:
Now let’s talk about how we got there…
The JSON package
Support for JSON in Go is provided using the encoding/json package, this needs to be imported in your program of course… You will also need to import the reflect package – more on this later. io/ioutil is required to read the data from a file input, there are other packages included in the program that are removed for brevity:
Reading the data…
We will read the data from the JSON file into a variable called body, note that we are not attempting to deserialize the data at this point. This is also a good opportunity to handle any runtime or IO errors that occur here as well.
We will declare an empty interface called data which will be used to decode the json object (of which the structure is not known), we will also create an abstract interface called colldata to hold the contents of the collection contained inside the JSON object that we are specifically looking for:
Next we need to validate that the input is a valid JSON object, we can use the json.Valid(body) method to do this:
Now the interesting bits, we will deserialize the JSON object to the empty data interface we created earlier using the json.Unmarshal() method:
Note that this operation is another opportunity to catch unexpected errors and handle them accordingly.
Checking the type of the object using reflection…
Now that we have serialized the JSON object into the data interface, there are several ways we can inspect the type of the object (which could be a map or an array). One such way is to use reflection. Reflection is the ability of a program to inspect itself at runtime. An example is shown here:
This instruction would produce the following output for our zones.json file:
The type switch…
Another method to decode the type of the data object (and any objects nested as elements or keys within the data object), is to use the type switch, an example of a type switch function is shown here:
Finding the nested collection and recursing it…
The aim of the program is to find a collection (an array of maps) nested in a JSON object. The maps with each element of the array are unknown at runtime and are discovered through recursion.
If we are performing a describe operation, we only need to parse the first element of the collection to get the key names and the data type of the values (for which we will use the same getObjectType function to perform a type switch.
If we are performing a select operation, we need to parse the first element to get the column names (the keys in the map) and then we need to recurse each element to get the values for each key.
If the element contains a key named id or name, we will place this at the beginning of the resultant record, as maps are unordered by definition.
As mentioned, we are using the tablewriter package to render the output of the collection as a pretty printed table in our terminal. As wrap around can get pretty ugly an additional maxfieldlen argument is provided to truncate the values if needed.
Although it is a bit more involved than some other languages, once you get your head around processing JSON in Go, the possibilities are endless!
This article demonstrates creating a site to site IPSEC VPN connection between a GCP VPC network and an Azure Virtual Network, enabling private RFC1918 network connectivity between virtual networks in both clouds. This is done using a single PowerShell script leveraging Azure PowerShell and gcloud commands in the Google SDK.
Additionally, we will use Azure Private DNS to enable private access between Azure hosts and GCP APIs (such as Cloud Storage or Big Query).
An overview of the solution is provided here:
One note before starting – site to site VPN connections between GCP and Azure currently do not support dynamic routing using BGP, however creating some simple routes on either end of the connection will be enough to get going.
Let’s go through this step by step:
Step 1 : Authenticate to Azure
Azure’s account equivalent is a subscription, the following command from Azure Powershell is used to authenticate a user to one or more subscriptions.
This command will open a browser window prompting you for Microsoft credentials, once authenticated you will be returned to the command line.
Step 2 : Create a Resource Group (Azure)
A resource group is roughly equivalent to a project in GCP. You will need to supply a Location (equivalent to a GCP region):
Step 3 : Create a Virtual Network with Subnets and Routes (Azure)
An Azure Virtual Network is the equivalent of a VPC network in GCP (or AWS), you must define subnets before creating a Virtual Network. In this example we will create two subnets, one Gateway subnet (which needs to be named accordingly) where the VPN gateway will reside, and one subnet named ‘default’ where we will host VMs which will connect to GCP services over the private VPN connection.
Before defining the default subnet we must create and attach a Route Table (equivalent of a Route in GCP), this particular route will be used to route ‘private’ requests to services in GCP (such as Big Query).
Network Security Groups in Azure are stateful firewalls much like Firewall Rules in VPC networks in GCP. Like GCP, the lower priority overrides higher priority rules.
In the example we will create several rules to allow inbound ICMP, TCP and UDP traffic from our Google VPC and RDP traffic from the Internet (which we will use to logon to a VM in Azure to test private connectivity between the two clouds):
We need to create two Public IP Address (equivalent of an External IP in GCP) which will be used for our VPN gateway and for the VM we will create:
# create public IP address for VM
$vmpip = New-AzPublicIpAddress `
-Name "vm-ip" `
-ResourceGroupName "azure-to-gcp" `
-Location "Australia Southeast" `
# create public IP address for NW gateway
$ngwpip = New-AzPublicIpAddress `
-Name "ngw-ip" `
-ResourceGroupName "azure-to-gcp" `
-Location "Australia Southeast" `
Step 6 : Create Virtual Network Gateway (Azure)
The Virtual Network Gateway in Azure is the VPN Gateway equivalent in Azure which will be used to create a VPN tunnel between Azure and a GCP VPN Gateway. This gateway will be placed in the Gateway subnet created previously and one of the Public IP addresses created in the previous step will be assigned to this gateway.
# create virtual network gateway
$ngwipconfig = New-AzVirtualNetworkGatewayIpConfig `
-Name "ngw-ipconfig" `
-SubnetId $gatewaySubnet.Id `
# use the AsJob switch as this is a long running process
$job = New-AzVirtualNetworkGateway -Name "vnet-gateway" `
-ResourceGroupName "azure-to-gcp" `
-Location "Australia Southeast" `
-IpConfigurations $ngwipconfig `
-GatewayType "Vpn" `
-VpnType "RouteBased" `
-GatewaySku "VpnGw1" `
-VpnGatewayGeneration "Generation1" `
$vnetgw = Get-AzVirtualNetworkGateway `
-Name "vnet-gateway" `
Step 7 : Create a VPC Network and Subnetwork(s) (GCP)
A VPC network and subnet need to be created in GCP, the subnet defines the VPC address space. This address space must not overlap with the Azure Virtual Network CIDR. For all GCP steps it is assumed that the project is set for client config (e.g. gcloud config set project <>) so it does not need to be specified for each operation. Private Google access should be enabled on all subnets created.
Now we will create the GCP side of our VPN tunnel using the Public IP Address of the Azure Virtual Network Gateway created in a previous step. As this example uses a route based VPN the traffic selector values need to be set at 0.0.0.0/0. A PSK (Pre Shared Key) needs to be supplied which will be the same key used when we configure a VPN Connection on the Azure side of the tunnel.
As we are using static routing (as opposed to dynamic routing) we will need to define all of the specific routes on the GCP side. We will need to setup routes for both outgoing traffic to the Azure network as well as incoming traffic for the restricted Google API range (188.8.131.52/30).
Now we can setup the Azure side of the VPN Connection which is accomplished by associating the Azure Virtual Network Gateway with the Local Network Gateway. A PSK (Pre Shared Key) needs to be supplied which is the same key used for the GCP VPN Tunnel in step 10.
Perform an nslookup to ensure that calls to googleapis.com resolve to the restricted API range (e.g. nslookup storage.googleapis.com). You should see a response showing the A records from the googleapis.com Private DNS Zone created in step 14.
Now test connectivity to Google APIs, for example to test access to Google Cloud Storage using gsutil, or test access to Big Query using the bq command
In the previous post in this series Spark in the Google Cloud Platform Part 1, we started to explore the various ways in which we could deploy Apache Spark applications in GCP. The first option we looked at was deploying Spark using Cloud DataProc, a managed Hadoop cluster with various ecosystem components included.
Spark Training Courses from the AlphaZetta Academy
In this post, we will look at another option for deploying Spark in GCP – a Spark Standalone cluster running on GKE.
Spark Standalone refers to the in-built cluster manager provided with each Spark release. Standalone can be a bit of a misnomer as it sounds like a single instance – which it is not, standalone simply refers to the fact that it is not dependent upon any other projects or components – such as Apache YARN, Mesos, etc.
A Spark Standalone cluster consists of a Master node or instance and one of more Worker nodes. The Master node serves as both a master and a cluster manager in the Spark runtime architecture.
The Master process is responsible for marshalling resource requests on behalf of applications and monitoring cluster resources.
The Worker nodes host one or many Executor instances which are responsible for carrying out tasks.
Deploying a Spark Standalone cluster on GKE is reasonably straightforward. In the example provided in this post we will set up a private network (VPC), create a GKE cluster, and deploy a Spark Master pod and two Spark Worker pods (in a real scenario you would typically have many Worker pods).
Once the network and GKE cluster have been deployed, the first step is to create Docker images for both the Master and Workers.
The Dockerfile below can be used to create an image capable or running either the Worker or Master daemons:
Note the shell scripts included in the Dockerfile: spark-master and spark-worker. These will be used later on by K8S deployments to start the relative Master and Worker daemon processes in each of the pods.
Next, we will use Cloud Build to build an image using the Dockerfile are store this in GCR (Google Container Registry), from the Cloud Build directory in our project we will run:
Now from within the k8s-deployments\deploy folder of our project we will use the kubectl command to deploy the Master pod, service and the Worker pods
Starting with the Master deployment, this will deploy our Spark Standalone image into a container running the Master daemon process:
To deploy the Master, run the following:
kubectl create -f spark-master-deployment.yaml
The Master will expose a web UI on port 8080 and an RPC service on port 7077, we will need to deploy a K8S service for this, the YAML required to do this is shown here:
To deploy the Master service, run the following:
kubectl create -f spark-master-service.yaml
Now that we have a Master pod and service up and running, we need to deploy our Workers which are preconfigured to communicate with the Master service.
The YAML required to deploy the two Worker pods is shown here:
To deploy the Worker pods, run the following:
kubectl create -f spark-worker-deployment.yaml
You can now inspect the Spark processes running on your GKE cluster.
kubectl get deployments
NAME READY UP-TO-DATE AVAILABLE AGE
spark-master 1/1 1 1 7m45s
spark-worker 2/2 2 2 9s
kubectl get pods
NAME READY STATUS RESTARTS AGE
spark-master-f69d7d9bc-7jgmj 1/1 Running 0 8m
spark-worker-55965f669c-rm59p 1/1 Running 0 24s
spark-worker-55965f669c-wsb2f 1/1 Running 0 24s
Next, as we need to expose the Web UI for the Master process we will create a LoadBalancer resource. The YAML used to do this is provided here:
To deploy the LB, you would run the following:
kubectl create -f spark-ui-lb.yaml
NOTE This is just an example, for simplicity we are creating an external LoadBalancer with a public IP, this configuration is likely not be appropriate in most real scenarios, alternatives would include an internal LoadBalancer, retraction of Authorized Networks, a jump host, SSH tunnelling or IAP.
Now you’re up and running!
You can access the Master web UI from the Google Console link shown here:
The Spark Master UI should look like this:
Next we will exec into a Worker pod, get a shell:
kubectl exec -it spark-worker-55965f669c-rm59p -- sh
Now from within the shell environment of a Worker – which includes all of the Spark client libraries, we will submit a simple Spark application:
You can see the results in the shell, as shown here:
Additionally, as all of the container logs go to Stackdriver, you can view the application logs there as well:
This is a simple way to get a Spark cluster running, it is not without its downsides and shortcomings however, which include the limited security mechanisms available (SASL, network security, shared secrets).
In the final post in this series we will look at Spark on Kubernetes, using Kubernetes as the Spark cluster manager and interacting with Spark using the Kubernetes API and control plane, see you then.