Testing Amazon EventBridge events using AWS Step Functions

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/testing-amazon-eventbridge-events-using-aws-step-functions/

This post is written by Siarhei Kazhura, Solutions Architect and Riaz Panjwani, Solutions Architect.

Amazon EventBridge is a serverless event bus that can be used to ingest and process events from a variety of sources, such as AWS services and SaaS applications. With EventBridge, developers can build loosely coupled and independently scalable event-driven applications.

It can be useful to know with EventBridge when events are not able to reach the desired destination. This can be caused by multiple factors, such as:

  1. Event pattern does not match the event rule
  2. Event transformer failure
  3. Event destination expects a different payload (for example, API destinations) and returns an error

EventBridge sends metrics to Amazon CloudWatch, which allows for the detection of failed invocations on a given event rule. You can also use EventBridge rules with a dead-letter queue (DLQ) to identify any failed event deliveries. The messages delivered to the queue contain additional metadata such as error codes, error messages, and the target ARN for debugging.

However, understanding why events fail to deliver is still a manual process. Checking CloudWatch metrics for failures, and then the DLQ takes time. This is evident when developing new functionality, when you must constantly update the event matching patterns and event transformers, and run tests to see if they provide the desired effect. EventBridge sandbox functionality can help with manual testing but this approach does not scale or help with automated event testing.

This post demonstrates how to automate testing for EventBridge events. It uses AWS Step Functions for orchestration, along with Amazon DynamoDB and Amazon S3 to capture the results of your events, Amazon SQS for the DLQ, and AWS Lambda to invoke the workflows and processing.

Overview

Using the solution provided in this post, users can track events from its inception to delivery and identify where any issues or errors are occurring. This solution is also customizable, and can incorporate integration tests against events to test pattern matching and transformations.

Reference architecture

At a high level:

  1. The event testing workflow is exposed via an API Gateway endpoint, and users can send a request.
  2. This request is validated and routed to a Step Functions EventTester workflow, which performs the event test.
  3. The EventTester workflow creates a sample event based on the received payload, and performs multiple tests on the sample event.
  4. The sample event is matched against the rule that is being tested. The results are stored in an Amazon DynamoDB EventTracking table, and the transformed event payload is stored in the TransformedEventPayload Amazon S3 bucket.
  5. The EventTester workflow has an embedded AWS Step Functions workflow called EventStatusPoller. The EventStatusPoller workflow polls the EventTracking table.
  6. The EventStatusPoller workflow has a customizable 10-second timeout. If the timeout is reached, this may indicate that the event pattern does not match. EventBridge tests if the event does not match against a given pattern, using the AWS SDK for EventBridge.
  7. After completing the tests, the response is formatted and sent back to the API Gateway. By default, the timeout is set to 15 seconds.
  8. API Gateway processes the response, strips the unnecessary elements, and sends the response back to the issuer. You can use this response to verify if the test event delivery is successful, or identify the reason a failure occurred.

EventTester workflow

After an API call, this event is sent to the EventTester express workflow. This orchestrates the automated testing, and returns the results of the test.

EventTester workflow

In this workflow:

1. The test event is sent to EventBridge to see if the event matches the rule and can be transformed. The result is stored in a DynamoDB table.
2. The PollEventStatus synchronous Express Workflow is invoked. It polls the DynamoDB table until a record with the event ID is found or it reaches the timeout. The configurable timeout is 15 seconds by default.
3. If a record is found, it checks the event status.

From here, there are three possible states. In the first state, if the event status has succeeded:

4. The response from the PollEventStatus workflow is parsed and the payload is formatted.
5. The payload is stored in an S3 bucket.
6. The final response is created, which includes the payload, the event ID, and the event status.
7. The execution is successful, and the final response is returned to the user.

In the second state, if no record is found in the table and the PollEventStatus workflow reaches the timeout:

8. The most likely explanation for reaching the timeout is that the event pattern does not match the rule, so the event is not processed. You can build a test to verify if this is the issue.
9. From the EventBridge SDK, the TestEventPattern call is made to see if the event pattern matches the rule.
10. The results of the TestEventPattern call are checked.
11. If the event pattern does not match the rule, then the issue has been successfully identified and the response is created to be sent back to the user. If the event pattern matches the rule, then the issue has not been identified.
12. The response shows that this is an unexpected error.

In the third state, this acts as a catch-all to any other errors that may occur:

13. The response is created with the details of the unexpected error.
14. The execution has failed, and the final response is sent back to the user.

Event testing process

The following diagram shows how events are sent to EventBridge and their results are captured in S3 and DynamoDB. This is the first step of the EventTester workflow:

Event testing process

When the event is tested:

  1. The sample event is received and sent to the EventBridge custom event bus.
  2. A CatchAll rule is triggered, which captures all events on the custom event bus.
  3. All events from the CatchAll rule are sent to a CloudWatch log group, which allows for an original payload inspection.
  4. The event is also propagated to the EventTesting rule. The event is matched against the rule pattern, and if successful the event is transformed based on the transformer provided.
  5. If the event is matched and transformed successfully, the Lambda function EventProcessor is invoked to process the transformed event payload. You can add additional custom code to this function for further testing of the event (for example, API integration with the transformed payload).
  6. The event status is updated to SUCCESS and the event metadata is saved to the EventTracking DynamoDB table.
  7. The transformed event payload is saved to the TransformedEventPayload S3 bucket.
  8. If there’s an error, EventBridge sends the event to the SQS DLQ.
  9. The Lambda function ErrorHandler polls the DLQ and processes the errors in batches.
  10. The event status is updated to ERROR and the event metadata is saved to the EventTracking DynamoDB table.
  11. The event payload is saved to the TransformedEventPayload S3 bucket.

EventStatusPoller workflow

EventStatusPoller workflow

When the poller runs:

  1. It checks the DynamoDB table to see if the event has been processed.
  2. The result of the poll is checked.
  3. If the event has not been processed, the workflow loops and polls the DynamoDB table again.
  4. If the event has been processed, the results of the event are passed to next step in the Event Testing workflow.

Visit Composing AWS Step Functions to abstract polling of asynchronous services for additional details.

Testing at scale

Testing at scale

The EventTester workflow uses Express Workflows, which can handle testing high volume event workloads. For example, you can run the solution against large volumes of historical events stored in S3 or CloudWatch.

This can be achieved by using services such as Lambda or AWS Fargate to read the events in batches and run tests simultaneously. To achieve optimal performance, some performance tuning may be required depending on the scale and events that are being tested.

To minimize the cost of the demo, the DynamoDB table is provisioned with 5 read capacity units and 5 write capacity units. For a production system, consider using on-demand capacity, or update the provisioned table capacity.

Event sampling

Event sampling

In this implementation, the EventBridge EventTester can be used to periodically sample events from your system for testing:

  1. Any existing rules that must be tested are provisioned via the AWS CDK.
  2. The sampling rule is added to an existing event bus, and has the same pattern as the rule that is tested. This filters out events that are not processed by the tested rule.
  3. SQS queue is used for buffering.
  4. Lambda function processes events in batches, and can optionally implement sampling. For example, setting a 10% sampling rate will take one random message out of 10 messages in a given batch.
  5. The event is tested against the endpoint provided. Note that the EventTesting rule is also provisioned via AWS CDK from the same code base as the tested rule. The tested rule is replicated into the EventTesting workflow.
  6. The result is returned to a Lambda function, and is then sent to CloudWatch Logs.
  7. A metric is set based on the number of ERROR responses in the logs.
  8. An alarm is configured when the ERROR metric crosses a provided threshold.

This sampling can complement existing metrics exposed for EventBridge via CloudWatch.

Solution walkthrough

To follow the solution walkthrough, visit the solution repository. The walkthrough explains:

  1. Prerequisites required.
  2. Detailed solution deployment walkthrough.
  3. Solution customization and testing.
  4. Cleanup process.
  5. Cost considerations.

Conclusion

This blog post outlines how to use Step Functions, Lambda, SQS, DynamoDB, and S3 to create a workflow that automates the testing of EventBridge events. With this example, you can send events to the EventBridge Event Tester endpoint to verify that event delivery is successful or identify the root cause for event delivery failures.

For more serverless learning resources, visit Serverless Land.