IT automation performance insights with OpenCensus and Google Cloud StackDriver
Overview
Wouldn’t it be great if you can easily identify the bottleneck in your application, or processes? Even better, be alerted when there is a change in the baseline performance?
This article will demonstrate how you can achieve this, using OpenCensus and StackDriver tracing within Google Cloud.
What is a trace?
A trace is a dataset which describes how a request flows through the various components of a system. It typically includes data such as processing time (latency) at various stages. A waterfall view can be used to visualize processing times, similar to that provided by developer tools within a web browser like Google Chrome, but using a different source of data.
The waterfall chart in Chrome developer tools is visually similar
What components are involved in tracing?
Tracing is typically enabled by within the source code of an application, or activated via middleware. A popular tracing library is OpenCensus which originated from a Google internal library called Census. OpenCensus integrates with various programming langauges and includes the ability to export traces to various external systems including Google StackDriver, Prometheus, SignalFx and Zipkin.
Sample architecture diagram for tracing
A trace represents a single request as it flows through a system, it includes one parent span, and optionally one or more child spans. A span represents a measurement of one or more operations in a trace, for example an operation (span) could encompass a database query, HTTP request to an API, or function within the source code.
Example trace for a web request, which includes multiple spans for various operations
Enabling traces for IT automation processes
I frequently develop software solutions to automate IT processes in the domain of computer networking. This led to me explore how I can gain visibility into the performance and reliability of automation.
A typical automation process may involve the several operations, for example:
- Retrieving data from a master data source
- Retrieving data from an IT system
- Comparing data in both systems
- Performing multiple create/read/update/delete (CRUD) operations in an IT system
- Generating a report of changes
The following section demonstrates how to enabling tracing within a simple Python script, including exporting traces to StackDriver in Google Cloud.
Prerequisites
The following items are required to implement the solution:
- Google Cloud project, with service account (Cloud Trace Agent IAM role assigned)
- Python libaries for OpenCensus, including exporter for StackDriver
- Source code (the process we’ll enable traces within)
Google Cloud Project and service account
The Google Cloud project is used to store and visualize trace data sent by the OpenCensus libraries. You can sign up for a personal Google Cloud account, which includes a free credit. Create a service account and assign the ‘Cloud Trace Agent’ Identity and Access Management (IAM) role. Create and download a key; this JSON file will be used by OpenCensus to authenticate to the Google Cloud project. Refer to document Getting Started with Authentication for more information about authenticating to Google Cloud.
Python libaries for OpenCensus
Download the Python libraries using PIP:
pip install opencensus
pip install opencensus-ext-stackdriver
Sample source code
Below is an example of how OpenCensus has been integrated into an existing script. A key point is that any code nested under with tracer.span(name="name of span") as span:
will contribute to an instance of a span measurement. As the code is processed by the Python interpreter, the first span will be the parent, any further instances will be considered as additional child spans. The name of the span is arbitrary, but I chose to use a URI convention to illustrate the structure/flow of the automation process.
Annotations can be added to spans, enabling additional context when viewing the waterfall chart.
span.add_annotation("processing item {}".format(item))
Labels can be create to provide more information when viewing the details of a span.
tracer.add_attribute_to_current_span(attribute_key='Operation', attribute_value='Delete')
Sample Python script with traces enabled
# import OpenCensus modules
from opencensus.common.transports.async_ import AsyncTransport
from opencensus.ext.stackdriver import trace_exporter as stackdriver_exporter
from opencensus.trace import tracer as tracer_module
def get_master_data():
# create the first child span, everything executing within the function is measured
with tracer.span(name="/it-process/get-master-data") as span:
data = some_function_to_get_master_data()
return data
def get_it_system_data():
# create the second child span
with tracer.span(name="/it-process/get-it-system-data") as span:
data = some_function_to_get_it_system_data()
return data
def sync_data(master_data, it_system_data):
results = []
for item in data:
# create the third child span for each item
# everything executed in the block below is measured
with tracer.span(name="/it-process/sync-data") as span:
response = some_function_to_sync_data()
results.append(response)
# let's create an annotation showing what item is being processed
span.add_annotation("processing item {}".format(item))
return results
def generate_report(results):
# creates fourth child span
with tracer.span(name="/it-process/generate-report") as span:
report = some_function_to_generate_report(item)
# let's add some additional information about the span
tracer.add_attribute_to_current_span(attribute_key='Total Report Items',
attribute_value=len(report))
return report
if __name__ == '__main__':
# setup the OpenCensus trace and exporter
exporter = stackdriver_exporter.StackdriverExporter(project_id='my-google-project-id',
transport=AsyncTransport)
tracer = tracer_module.Tracer(exporter=exporter)
# create a parent span, everything which executes below this is included in the span
with tracer.span(name="/it-process") as span:
master_data = get_master_data()
it_system_data = get_it_system_data()
results = sync_data(master_data, it_system_data)
generate_report(results)
Google Cloud StackDriver Trace
Waterfall chart
After executing the script the trace will immediately be visible in the StackDriver Trace section of Google Cloud. As you can see, the parent span is named ‘it-process’, followed by child spans of ‘it-process/get-master-data’, ‘it-process/get-it-system-data’ etc.
Chart highlighting each operation in the script, with processing times
Scatter graph
The scatter graph is useful for spotting outliers, where an operation has taken a longer than usual time to process. In the graph there is clearly an issue at 9.54pm. Selecting a data point will display the waterfall chart, along with the operation which introduced the delay.
Scatter graph showing requests by response time
Analysis Report
You can create customized reports to gain insights into trace data, for example viewing percentage of traces by distribution of response time.
Screenshot of custom trace analysis report
I hope this article was useful and provided some ideas for how you could leverage traces to get insights into your processes and applications.