AWS X-Ray with Lambda: Distributed Tracing for the Win

I’m going to take you through why we use X-Ray, how to use X-Ray to gain valuable insights and debug errors, and point out things to watch out for.

FloQast Engineering
FloQast Engineering

--

By Jess Perina

AWS X-Ray with Lambda is the easiest way to get started with tracing a simple function or distributed services. Why? Because it’s built right in and it’s free-ish*!

I’m going to take you through the why we use X-Ray with our Lambdas at FloQast. After that, I’ll show you how we use the X-Ray console to gain valuable information and debug issues. Finally, I’ll finish by pointing out some things to watch out for and sharing resources I’ve found helpful.

*AWS X-Ray has a perpetual free tier. The first 100,000 traces recorded and 1,000,000 traces retrieved or scanned each month are free. You can use sampling rules to control the number of traces AWS records. Trace data is saved for 30 days.

Why bother with X-Ray?

We know that utilizing “the cloud” for your application can save you time, money, and headaches. However, when using this strategy you give up some control when it comes to seeing the full request cycle and how a request travels between the different services in your app. As a result, you can find it difficult to track down issues when they arise.

Would you like to be able to see the health of your application at a glance? Maybe you want to track down and troubleshoot those errors you’re serving up. Or, perhaps you’re more concerned about fine tuning your app for performance. I don’t like digging through multiple server logs. Do you? Instead, you can use AWS X-Ray with your Lambdas to visualize the components of your application, identify performance bottlenecks, and troubleshoot requests that resulted in an error.

Our teams have learned the hard way that the time to implement this kind of monitoring is NOT when you realize you need data to make a business critical decision or track down a tricky error. Get X-Ray set up before you think you need it. You’ll be happy you did. My colleague Trung has written about how his team has implemented AWS’s X-Ray in our main app and uses it to solve some of the problems mentioned above. He goes into detail about the code his team wrote to get the most out of X-Ray. In addition, he highlights how it’s helping them to make better decisions that are data driven and trackable.

How to get that X-Ray trace data flowing?

Let’s get into the set up.

  1. Go to the page that lists your Lambda functions.

2. Select the function for which you would like to activate X-Ray.

3. Scroll past your function code to the box labeled AWS X-Ray and click the “Active tracing” box.

The above steps are great for a personal project. However, in a professional setting, when working at scale, with multiple environments, the last thing you should be doing is clicking a bunch of buttons in AWS. FloQast has an amazing DevOps team that’s all about those efficiencies and setting up infrastructure as code (IsaC). We use Terraform to handle the creation of our AWS services and activation of options like active tracing. Setting the `tracing_config` property in your Terraform to “Active” should take care of the above steps without a single click. Take a look at an example of one of our Lambda Terraforms or you can check out the Terraform docs.

resource "aws_lambda_function" "gl_core" {
function_name = "gl_core-${var.environment}"
filename = "example.zip"
role = "${aws_iam_role.lambda_role.arn}"
handler = "index.handler"
runtime = "nodejs10.x"
memory_size = "1024"
timeout = "900"

description = "GL Core Lambda - Managed by Terraform"

vpc_config = {
subnet_ids = ["${data.terraform_remote_state.core-base.private_subnets}"]
security_group_ids = ["${aws_security_group.gl-service_lambda.id}"]
}

environment {
variables = {
AWS_ACCOUNT = "${var.aws_account}"
RUN_ENV = "${var.environment}"
NODE_ENV = "${var.environment}"
}
}

layers = ["${data.terraform_remote_state.gl_service_layer.layer_arn}"]

tags {
Name = "fq-gl_core-lambda-${var.environment}"
}

lifecycle {
ignore_changes = [
"filename"
]
}

tracing_config {
mode = "Active"
}

}

If you’re curious and want to read more about IsaC we’ve got you covered.

You have now activated X-Ray for your Lambda function. (You may see a message that your function’s execution role doesn’t have permission to send trace data to X-Ray, but AWS will attempt to add this permission to the role for you automatically.) Clicking the “View traces in X-Ray” button will take you to the X-Ray console.

You can stop here if you want. If you do nothing else you will still have unlocked the ability to gain valuable insights into your application via the X-Ray console. X-Ray will also continue these traces through other AWS services your Lambda interacts with that have X-Ray support.

Want to take X-Ray and Lambda even further?

While it’s helpful to know when your app is serving up errors it can still be hard to figure out what’s going on if issues are intermittent. Out of the box, X-Ray with Lambda gives you the basics, but it’s extensible so you can suit it to your needs. For example, in your Lambda, you can use the X-Ray SDK to extend the main Invocation subsegment, which is set up for you, with additional subsegments for downstream calls, annotations, and metadata.

In other words, you can tell X-Ray to track and trace whatever data you want. Curious about how long a series of calls is taking? Wrap it in a subsegment and X-Ray will let you know. Wanna see if all those errors are related to the same customer? Create a subsegment that includes a customer ID as an annotation, which are searchable in the X-Ray console and can be used to group your traces. Want to keep track of some general request info that you don’t need to be searchable in X-Ray? Add a subsegment with some metadata. Allow me to show you some examples of how my team is using X-Ray with our Lambdas.

Now that you’ve got it, how do you use it?

Below you can see the X-Ray console and two Lambda based services we have with tracing enabled. One is looking healthy. The other, not so much. Let’s go over both.

The top service lives almost entirely in the world of AWS and utilizes Lambdas, SSM, DynamoBD. The second service uses a Lambda to make requests from a single third party API. Let’s start with the healthy service. When you select a node on the service map you’ll get a popup with a graph and some options.

Everything looks good here

The graph shows latency. The amount of time between when a request starts and when it completes. Most of the requests to this particular Lambda take around 5.7 seconds to complete. From here we can click the view traces button to see individual traces.

The traces page gives you a list of sampled requests with a bunch of additional information. You may be wondering why the average response time here is 45ms when I just told you that first Lambda node takes around five and a half seconds. The shorter response time is from the Lambda service node and shows how long the lambda service took to respond and kick off our Lambda function. Lambda will always add two nodes to your service map. The first one will show how long it took to fire up the service, the time it spent waiting around (dwell time), and how long your Lambda function took to execute. This is also where you’ll see the number of attempts your Lambda made because retry is built in.

Additionally, if we look further down this trace we can see each step in the request cycle, a small sample of which you can see below.

As I mentioned earlier, we get tracing of the interactions between most AWS services for free, but what about stuff that isn’t AWS? We wanted to get a better idea of how long our third party API calls were taking and which customers they were for without digging through our logs. We utilized the AWS SDK and subsegments to accomplish this. Right before we make the API call we set up a subsegment, name it, then add a customer ID as an annotation.

module.exports.getFilesInTable = async (tableId, ****Token, tlcId) => {
const subSegment = Xray.getSegment().addNewSubsegment('sending **** Request: getFilesInTable');
subSegment.addAnnotation('tlcId', tlcId);
...

If we select that section of the trace we can see more information about it, including our ID annotation. Very handy if we start seeing any issues.

The red ring of despair

Bugs, defects, edge cases, user errors… all of these will pop up in your app at some point, so be ready for them. X-Ray can help you pinpoint issues and debug faster. Let’s take a look at the second, not so healthy service. Right away you can see something is wrong here. The colored rings indicate the rate of errors or faults at each node. The service details allow you to select a response status to filter by before you click view traces, so you can hone in on the trouble spots.

If we dig into one of these traces we can see the fault is coming from a call to our third party API, which is throwing an error in our Lambda. If we hover over the fault icon we can see the error message “Error Unauthorized”. The section labeled accounts is another custom subsegment set up to help us time not just the API call and the resulting data retrieval, but also the processing we do on the data when we get it. If we click on it we can get more details about the error.

X-Ray’s overview tab gives us some general information about the selected node, most of which we can already see from the trace details. It is worth noting the segment and parent IDs. These are how X-Ray tracks all the calls made between services and creates traces and service maps. The information we need, however, is in the exceptions tab.

X-Ray’s exceptions tab gives us the whole stack trace and we didn’t have to search a single server log! With this many errors coming through our next step would be to check the annotations and and metadata for any helpful information added to the subsegment, such as a customer ID. In addition, in the main traces page for this service we could even group the traces by a customer ID annotation. Then we can see if all our issues are coming from the same source.

Things to note and watch out for

You may not need to activate X-Ray for your Lambda. If you’ve enabled X-Ray tracing in an AWS service that invokes your function, Lambda will send trace data to X-Ray automatically. The upstream service, such as API Gateway, or an application hosted on EC2 that is instrumented with the X-Ray SDK, samples incoming requests and adds a tracing header that tells Lambda to send traces or not. Although, if you want to trace something specific to your Lambda via a subsegment you will still need to use the X-Ray SDK. Similarly, the X-Ray SDK is also required for AWS service that do not have native support for X-Ray such as CloudWatch and S3. To accomplish this utilize the captureAWS method to trace all AWS services.

const AWSXRay = require('aws-xray-sdk-core');
AWSXRay.captureAWS(require('aws-sdk'));

OR

const AWS = require('aws-sdk');
const AWSXRay = require('aws-xray-sdk-core');
AWSXRay.captureAWS(AWS);

Conversely, you also have the option of capturing a single AWS service, using the captureAWSClient method, if you want to be mindful of how much trace data you're generating.

const s3 = AWSXRay.captureAWSClient(new S3());

AWS has extended support for X-Ray to many of its services. So, as noted above, if your request starts in a service that has X-Ray tracing enabled, trace data will also be sent for the other X-Ray supported services your request touches. With the addition of the capture methods you can extend your traces to all AWS services. Although, you cannot add annotations and metadata from these captured services. Notably, there are exceptions to this support in some features of S3. Due to how the S3 SDK is written there are some methods and connections that are not traced (see interesting issues below).

Additional Resources

If you’re interested in falling further down the AWS X-Ray with Lambda rabbit hole, or prefer to learn in a different way, I’m included many of the resources I’ve found helpful.

Videos

Articles

Docs

Interesting issues

Originally published at https://floqast.com on June 10, 2020.

--

--