All posts by Ivo Pinto

Decreasing incident response time for OutSystems with AWS serverless technology

Post Syndicated from Ivo Pinto original https://aws.amazon.com/blogs/architecture/decreasing-incident-response-time-for-outsystems-with-aws-serverless-technology/

Leading modern application platform space OutSystems is a low-code platform that provides tools for companies to develop, deploy, and manage omnichannel enterprise applications.

Security is a top priority at OutSystems. Their Security Operations Center (SOC) deals with thousands of incidents a year, each with a set of response actions that need to be executed as quickly as possible. Providing security at such large scale is a challenge, even for the most well-prepared organizations. Manual and repetitive tasks account for the majority of the response time involved in this process, and decreasing this key metric requires orchestration and automation.

Security orchestration, automation, and response (SOAR) systems are designed to translate security analysts’ manual procedures into automated actions, making them faster and more scalable.

In this blog post, we’ll explore how OutSystems lowered their incident response time by 99 percent by designing and deploying a custom SOAR using Serverless services on AWS.

Solution architecture

Security incidents happen with unknown frequency, making serverless services a natural fit to boost security at OutSystems because of their increased agility and capability to scale to zero.

There are two ways to trigger SOAR actions in this architecture:

  1. Automatically through Security Information and Event Management (SIEM) security incident findings
  2. On-demand through chat application

Using the first method, when a security incident is detected by the SIEM, an event is published to Amazon Simple Notification Service (Amazon SNS). This triggers an AWS Lambda function that creates a ticket in an internal ticketing system. Then the Lambda Playbooks function triggers to decide which playbook to run depending on the incident details.

Each playbook is a set of actions that are executed in response to a trigger. Playbooks are the key component behind automated tasks. OutSystems uses AWS Step Functions to orchestrate the actions and Lambda functions to execute them.

But this solution does not exist in isolation. Depending on the playbook, Step Functions interacts with other components such as AWS Secrets Manager or external APIs.

Using the second method, the on-demand trigger for OutSystems SOAR relies on a chat application. This application calls a Lambda function URL that interacts with the playbooks we just discussed.

Figure 1 represents the high-level architecture of OutSystems’ custom SOAR.

SOAR architecture for AWS

Figure 1. SOAR architecture for AWS

This architecture was deployed with Infrastructure as Code (IaC) using AWS CloudFormation and AWS CodePipeline.

This same IaC architecture is used when new playbooks or updates to existing ones are made. Code changes that are committed to a source control repository trigger the CodePipeline which uses AWS CodeBuild and CloudFormation change sets to deploy the updates to the affected resources.

Use cases

The use cases that OutSystems has deployed playbooks for to date include:

  • SQL injection
  • Unauthorized access to credentials
  • Issuance of new certificates
  • Login brute forces
  • Impossible travel

Let’s explore the Impossible travel use case. Impossible travel happens when a user logs in from one location, and then later logs in from a different location that would be impossible to travel between within the elapsed time.

When the SIEM identifies this behavior, it triggers an alert and the following actions are performed:

  1. A ticket is created
  2. An IP address check is performed in reputation databases, such as AbuseIPDB or VirusTotal
  3. An IP address check is performed in the internal database, and the IP address is added if it is not found
  4. A search is performed for past events with the same IP address
  5. A WHOIS is performed on the IP address
  6. Recent logins of the user are identified in the SIEM, along with all related information
  7. All of this information is automatically added to the ticket. Every step listed here was previously performed manually; a task that took an average of 15 minutes. Now, the process takes just 8 seconds—a 99.1% incident response time improvement.

The following remediation actions can also be automated, along with many others:

Some of these remediation actions are already in place, while others are in development.

Conclusion

At OutSystems, much like at AWS, security is considered “job zero.” It is not only important to be proactive in preventing security incidents, but when they happen, the response must be quick, effective, and as immune to human error as possible.

With the implementation of this custom SOAR, OutSystems reduced the average response time to security incidents by 99%. Tasks that previously took 76 hours of analysts’ time are now accomplished automatically within 31 minutes.

During the evaluation period, SOAR addressed hundreds of real-world incidents with some threat intel use cases being executed thousands of times.

An architecture composed of serverless services ensures OutSystems does not pay for systems that are standing by waiting for work, and at the same time, not compromising on performance.

If you are interested in this topic—how to respond to security incidents using AWS serverless services—be sure you also read the Orchestrating a security incident response with AWS Step Functions and How to get started with security response automation on AWS blog posts.

How Launchmetrics improves fashion brands performance using Amazon EC2 Spot Instances

Post Syndicated from Ivo Pinto original https://aws.amazon.com/blogs/architecture/how-launchmetrics-improves-fashion-brands-performance-using-amazon-ec2-spot-instances/

Launchmetrics offers its Brand Performance Cloud tools and intelligence to help fashion, luxury, and beauty retail executives optimize their global strategy. Launchmetrics initially operated their whole infrastructure on-premises; however, they wanted to scale their data ingestion while simultaneously providing improved and faster insights for their clients. These business needs led them to build their architecture in AWS cloud.

In this blog post, we explain how Launchmetrics’ uses Amazon Web Services (AWS) to crawl the web for online social and print media. Using the data gathered, Launchmetrics is able to provide prescriptive analytics and insights to their clients. As a result, clients can understand their brand’s momentum and interact with their audience, successfully launching their products.

Architecture overview

Launchmetrics’ platform architecture is represented in Figure 1 and composed of three tiers:

  1. Crawl
  2. Data Persistence
  3. Processing
Launchmetrics backend architecture

Figure 1. Launchmetrics backend architecture

The Crawl tier is composed of several Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances launched via Auto Scaling groups. Spot Instances take advantage of unused Amazon EC2 capacity at a discounted rate compared with On-Demand Instances, which are compute instances that are billed per-hour or -second with no long-term commitments. Launchmetrics heavily leverages Spot Instances. The Crawl tier is responsible for retrieving, processing, and storing data from several media sources (represented in Figure 1 with the number 1).

The Data Persistence tier consists of two components: Amazon Kinesis Data Streams and Amazon Simple Queue Service (Amazon SQS). Kinesis Data Streams stores data that the Crawl tier collects, while Amazon SQS stores the metadata of the whole process. In this context, metadata helps Launchmetrics gain insight into when the data is collected and if it has started processing. This is key information if a Spot Instance is interrupted, which we will dive deeper into later.

The third tier, Processing, also makes use of Spot Instances and is responsible for pulling data from the Data Persistence tier (represented in Figure 1 with the number 2). It then applies proprietary algorithms, both analytics and machine learning models, to create consumer insights. These insights are stored in a data layer (not depicted) that consists of an Amazon Aurora cluster and an Amazon OpenSearch Service cluster.

By having this separation of tiers, Launchmetrics is able to use a decoupled architecture, where each component can scale independently and is more reliable. Both the Crawl and the Data Processing tiers use Spot Instances for up to 90% of their capacity.

Data processing using EC2 Spot Instances

When Launchmetrics decided to migrate their workloads to the AWS cloud, Spot Instances were one of the main drivers. As Spot Instances offer large discounts without commitment, Launchmetrics was able to track more than 1200 brands, translating to 1+ billion end users. Daily, this represents tracking upwards of 500k influencer profiles, 8 million documents, and around 70 million social media comments.

Aside from the cost-savings with Spot Instances, Launchmetrics incurred collateral benefits in terms of architecture design: building stateless, decoupled, elastic, and fault-tolerant applications. In turn, their stack architecture became more loosely coupled, as well.

All Launchmetrics Auto Scaling groups have the following configuration:

  • Spot allocation strategy: cost-optimized
  • Capacity rebalance: true
  • Three availability zones
  • A diversified list of instance types

By using Auto Scaling groups, Launchmetrics is able to scale worker instances depending on how many items they have in the SQS queue, increasing the instance efficiency. Data processing workloads like the ones Launchmetrics’ platform have, are an exemplary use of multiple instance types, such as M5, M5a, C5, and C5a. When adopting Spot Instances, Launchmetrics considered other instance types to have access to spare capacity. As a result, Launchmetrics found out that workload’s performance improved, as they use instances with more resources at a lower cost.

By decoupling their data processing workload using SQS queues, processes are stopped when an interruption arrives. As the Auto Scaling group launches a replacement Spot Instance, clients are not impacted and data is not lost. All processes go through a data checkpoint, where a new Spot Instance resumes processing any pending data. Spot Instances have resulted in a reduction of up to 75% of related operational costs.

To increase confidence in their ability to deal with Spot interruptions and service disruptions, Launchmetrics is exploring using AWS Fault Injection Simulator to simulate faults on their architecture, like a Spot interruption. Learn more about how this service works on the AWS Fault Injection Simulator now supports Spot Interruptions launch page.

Reporting data insights

After processing data from different media sources, AWS aided Launchmetrics in producing higher quality data insights, faster: the previous on-premises architecture had a time range of 5-6 minutes to run, whereas the AWS-driven architecture takes less than 1 minute.

This is made possible by elasticity and availability compute capacity that Amazon EC2 provides compared with an on-premises static fleet. Furthermore, offloading some management and operational tasks to AWS by using AWS managed services, such as Amazon Aurora or Amazon OpenSearch Service, Launchmetrics can focus on their core business and improve proprietary solutions rather than use that time in undifferentiated activities.

Building continuous delivery pipelines

Let’s discuss how Launchmetrics makes changes to their software with so many components.

Both of their computing tiers, Crawl and Processing, consist of standalone EC2 instances launched via Auto Scaling groups and EC2 instances that are part of an Amazon Elastic Container Service (Amazon ECS) cluster. Currently, 70% of Launchmetrics workloads are still running with Auto Scaling groups, while 30% are containerized and run on Amazon ECS. This is important because for each of these workload groups, the deployment process is different.

For workloads that run on Auto Scaling groups, they use an AWS CodePipeline to orchestrate the whole process, which includes:

I.  Creating a new Amazon Machine Image (AMI) using AWS CodeBuild
II. Deploying the newly built AMI using Terraform in CodeBuild

For containerized workloads that run on Amazon ECS, Launchmetrics also uses a CodePipeline to orchestrate the process by:

III. Creating a new container image, and storing it in Amazon Elastic Container Registry
IV. Changing the container image in the task definition, and updating the Amazon ECS service using CodeBuild

Conclusion

In this blog post, we explored how Launchmetrics is using EC2 Spot Instances to reduce costs while producing high-quality data insights for their clients. We also demonstrated how decoupling an architecture is important for handling interruptions and why following Spot Instance best practices can grant access to more spare capacity.

Using this architecture, Launchmetrics produced faster, data-driven insights for their clients and increased their capacity to innovate. They are continuing to containerize their applications and are projected to have 100% of their workloads running on Amazon ECS with Spot Instances by the end of 2023.

To learn more about handling EC2 Spot Instance interruptions, visit the AWS Best practices for handling EC2 Spot Instance interruptions blog post. Likewise, if you are interested in learning more about AWS Fault Injection Simulator and how it can benefit your architecture, read Increase your e-commerce website reliability using chaos engineering and AWS Fault Injection Simulator.