Tag Archives: Hyperscale Series

Journey to Cloud-Native Architecture Series #6: Improve cost visibility and re-architect for cost optimization

Post Syndicated from Anuj Gupta original https://aws.amazon.com/blogs/architecture/journey-to-cloud-native-architecture-series-6-improve-cost-visibility-and-re-architect-for-cost-optimization/

After we improved our security posture in the 5th blog of the series, we discovered that operational costs are growing disproportionately higher than revenue. This is because the number of users grew more than 10 times on our e-commerce platform.

To address this, we created a plan to better understand our AWS spend and identify cost savings. In this post, Part 6, we’ll show you how we improved cost visibility, rearchitected for cost optimization, and applied cost control measures to encourage innovation while ensuring improved return on investment.

Identifying areas to improve costs with Cost Explorer

Using AWS Cost Explorer, we identified the following areas for improvement:

  • Improve cost visibility and establish FinOps culture:
    • Provide visibility and cost insights for untagged resources
    • Provide cost visibility into shared AWS resources
    • Better understand gross margin
  • Use Service Control Policies (SCPs) for cost control
  • Optimize costs for data transfer
  • Reduce storage costs:
  • Optimize compute costs:
    • Adjust idle resources and under-utilized resources
    • Migrate to latest generation Graviton2 instances

The following sections provide more information on the methods we used to improve these areas.

Improve cost visibility and establish FinOps culture

Provide visibility and cost insights for untagged resources

To improve our organization’s FinOps culture and help teams better understand their costs, we needed a tool to provide visibility into untagged resources and enable engineering teams to take actions on cost optimization.

We used CloudZero to automate the categorization of our spend across all AWS accounts and provide our teams the ability to see cost insights. It imports metadata from AWS resources along with AWS tags, which makes it easier to allocate cost to different categories.

Provide cost visibility into shared AWS resources

We created Dimensions such as Development, Test, and Production in CloudZero to easily group cost by environment. We also defined rules in CostFormation to help us understand the cost of running a new feature by splitting cost using rules.

Understand gross margin

To better understand how our AWS bills going up is related to delivering more value for our customers, we used the guidance in Unit Metric – The Touchstone of your IT Planning and Evaluation to identify a demand driver (in our case, number of orders). By evaluating the number of orders against AWS spend, we gained valuable insights into cost KPIs, such as cost per order, which helped us better understand gross margin for our business.

Use Service Control Policies for cost control

Following the guidance in the Control developer account costs with AWS budgets blog post, we applied SCPs to control costs and control which AWS services, resources, and individual API actions users and roles in each member account of an OU can access.

As shown in Figure 1, we applied the following cost control SCPs:

  • SCP-3 on Sandbox OU prevents modification of billing settings and limits the allowable EC2 instance types to only general purpose instances up to 4xl.
  • SCP-2 on Workload SDLC OU denies access to EC2 instances larger than 4xl. AWS Budgets sends alerts to a Slack channel when spend reaches beyond a defined threshold.
  • SCP-1 on Workload PROD OU denies access to any operations outside of the specified AWS Regions and prevents member accounts from leaving the organization.
Figure 1. Applying SCPs on different environments for cost control

Figure 1. Applying SCPs on different environments for cost control

Optimize costs for data transfer

Data transfer represented one major category of overall AWS cost in our Cost Explorer report, so we used CloudZero’s Networking Sub-category Dimension to get insights into AWS outbound, Intra-Region (Availability Zone (AZ) to AZ), NAT gateway, and S3 outbound costs.

To get more insights, we also set up a temporary Data Transfer dashboard with Amazon QuickSight using the guidance in the AWS Well-Architected Cost Optimization lab. It showed us PublicIP charges for applications, NAT gateway charges for traffic between Amazon EC2 and Amazon S3 within the same Region, inter-AZ data transfer for Development and Test environments, and cross AZ data transfer for NAT gateway.

Figure 2 shows how we used Amazon S3 Gateway endpoints (continuous line) instead of an S3 public endpoint (dotted line) to reduce NAT gateway charges. For our Development and Test environments, we created application-database partitions to reduce inter-AZ data transfer.

Data transfer charges across AZs and AWS services

Figure 2. Data transfer charges across AZs and AWS services

Reduce storage costs

Update objects to use cost appropriate S3 storage class

In our review of the Cost Explorer report, we noticed that all objects were stored using the Standard storage class in Amazon S3. To update this, we used guidance from the Amazon S3 cost optimization for predictable and dynamic access patterns blog post to identify predictable data access patterns using Amazon S3 Storage Lens.

The number of GET requests, download bytes, and retrieval rate for Amazon S3 prefixes informed us how often datasets are accessed over a period of time and when a dataset is infrequently accessed. 40% of our objects on Amazon S3 have a dynamic data access pattern. Storing this data in S3 Standard-Infrequent Access could lead to unnecessary retrieval fees, so we transitioned dynamic data access pattern objects to Amazon S3 Intelligent-Tiering and updated applications to select S3 Intelligent-Tier when uploading such objects. For infrequently accessed objects, we created Amazon S3 lifecycle policies to automatically transition objects to Amazon S3 Standard-Infrequent Access, Amazon S3 One Zone-Infrequent Access, and/or Amazon S3 Glacier storage classes.

Adopt Amazon EBS gp3

Using guidance from a re:Invent talk on Optimizing resource efficiency with AWS Compute Optimizer, we identified EBS volumes that were over-provisioned by more than 30%. AWS Compute Optimizer automatically analyzed utilization patterns and metrics such as VolumeReadBytes VolumeWriteBytes, VolumeReadOps, and VolumeWriteOps for all EBS volumes in our AWS account to provide recommendations on migrating from gp2 to gp3 volumes.

The migrate your Amazon EBS volumes from gp2 to gp3 blog post helped us identify baseline throughput and IOPS requirements for our workload, calculate cost savings using the cost savings calculator, and provided steps to migrate to gp3.

Optimize compute costs

Adjust idle resources and under-utilized resources

Deploying Instance Scheduler on AWS helped us further cost optimize Amazon EC2 and Amazon Relational Database Service (Amazon RDS) resources in Development, Test, and Pre-production environments. This way, we only pay for the 40-60 hours per week instead of the full 168 hours in a week, providing 64-76% cost savings.

Migrate to latest generation Graviton2 instances

As user traffic grew, application throughput requirements changed significantly, which led to more compute cost. We migrated to the newest generation of Graviton2 instances with similar memory and CPU, achieving higher performance for reduced cost. We updated Amazon RDS on Graviton 2Amazon ElasticCache to Graviton2, and Amazon OpenSearch on Graviton2 for low-effort cost savings. The following table shows the comparison in cost after we migrated to Graviton instances.

Service Previous Instance Cost for on-demand (per hour) in us-east-1 New Instance Cost for on-demand (per hour) in us-east-1 Cost Savings
Amazon RDS (PostgreSQL) r5.4xlarge 1.008 r6g.4xlarge 0.8064 20.00%
Amazon ElasticCache cache.r5.2xlarge 0.862 cache.r6g.xlarge 0.411 52.32%
Amazon OpenSearch (data nodes) r5.xlarge.search 0.372 r6g.xlarge.search 0.335 9.95%

After that, we tested our Java-based applications to run on an arm64 processor using the guidance on the Graviton GitHub and AWS Graviton2 for Independent Software Vendors whitepaper. We conducted functional and non-functional tests on the application to ensure that it provides the same experience for users with improved performance.

Load testing for cost optimization

We included load testing in CI/CD pipeline to avoid over-provisioning and to identify resource bottlenecks before our application goes into production. To do this, we used Serverless Artillery workshop to set up a load testing environment in a separate AWS account. As a part of that load testing, we were able to simulate production traffic at required scale with much reduced cost than using EC2 instances.

Conclusion

In this blog post, we discussed how observations in Cost Explorer helped us identify improvements for cost management and optimization. We talked about how you get better cost visibility using CloudZero and apply cost control measures using SCPs. We also talked about how you can save data transfer cost, storage cost and compute cost with low effort.

Other blogs in this series

Journey to Adopt Cloud-Native Architecture Series #5 – Enhancing Threat Detection, Data Protection, and Incident Response

Post Syndicated from Anuj Gupta original https://aws.amazon.com/blogs/architecture/journey-to-adopt-cloud-native-architecture-series-5-enhancing-threat-detection-data-protection-and-incident-response/

In Part 4 of this series, Governing Security at Scale and IAM Baselining, we discussed building a multi-account strategy and improving access management and least privilege to prevent unwanted access and to enforce security controls.

As a refresher from previous posts in this series, our example e-commerce company’s “Shoppers” application runs in the cloud. The company experienced hypergrowth, which posed a number of platform and technology challenges, including enforcing security and governance controls to mitigate security risks.

With the pace of new infrastructure and software deployments, we had to ensure we maintain strong security. This post, Part 5, shows how we detect security misconfigurations, indicators of compromise, and other anomalous activity. We also show how we developed and iterated on our incident response processes.

Threat detection and data protection

With our newly acquired customer base from hypergrowth, we had to make sure we maintained customer trust. We also needed to detect and respond to security events quickly to reduce the scope and impact of any unauthorized activity. We were concerned about vulnerabilities on our public-facing web servers, accidental sensitive data exposure, and other security misconfigurations.

Prior to hypergrowth, application teams scanned for vulnerabilities and maintained the security of their applications. After hypergrowth, we established dedicated security team and identified tools to simplify the management of our cloud security posture. This allowed us to easily identify and prioritize security risks.

Use AWS security services to detect threats and misconfigurations

We use the following AWS security services to simplify the management of cloud security risks and reduce the burden of third-party integrations. This also minimizes the amount of engineering work required by our security team.

Detect threats with Amazon GuardDuty

We use Amazon GuardDuty to keep up with the newest threat actor tactics, techniques, and procedures (TTPs) and indicators of compromise (IOCs).

GuardDuty saves us time and reduces complexity, because we don’t have to continuously engineer detections for new TTPs and IOCs for static events and machine-learning-based detections. This allows our security analysts to focus on building runbooks and quickly responding to security findings.

Discover sensitive data with Amazon Macie for Amazon S3

To host our external website, we use a few public Amazon Simple Storage Service (Amazon S3) buckets with static assets. We don’t want developers to accidentally put sensitive data in these buckets, and we wanted to understand which S3 buckets contain sensitive information, such as financial or personally identifiable information (PII).

We explored building a custom scanner to search for sensitive data, but maintaining the search patterns was too complex. It was also costly to continuously re-scan files each month. Therefore, we use Amazon Macie to continuously scan our S3 buckets for sensitive data. After Macie makes its initial scan, it will only scan new or updated objects in those S3 buckets, which reduces our costs significantly. We added filter rules to exclude files of larger size and S3 prefixes to scan required objects and provided a sampling rate to further cost optimize scanning large S3 buckets (in our case, S3 buckets greater than 1 TB).

Scan for vulnerabilities with Amazon Inspector

Because we use a wide variety of operating systems and software, we must scan our Amazon Elastic Compute Cloud (Amazon EC2) instances for known software vulnerabilities, such as Log4J.

We use Amazon Inspector to run continuous vulnerability scans on our EC2 instances and Amazon Elastic Container Registry (Amazon ECR) container images. With Amazon Inspector, we can continuously detect if our developers are deploying and releasing vulnerable software on our EC2 instances and ECR images without setting up a third-party vulnerability scanner and installing additional endpoint agents.

Aggregate security findings with AWS Security Hub

We don’t want our security analysts to arbitrarily act on one security finding over another. This is time-consuming and does not properly prioritize the highest risks to address. We also need to track ownership, view progress of various findings, and build consistent responses for common security findings.

With AWS Security Hub, our analysts can seamlessly prioritize findings from GuardDuty, Macie, Amazon Inspector, and many other AWS services. Our analysts also use Security Hub’s built-in security checks and insights to identify AWS resources and accounts that have a high number of findings and act on them.

Setting up the threat detection services

This is how we set up these services:

Our security analysts use Security Hub-generated Jira tickets to view, prioritize, and respond to all security findings and misconfigurations across our AWS environment.

Through this configuration, our analysts no longer need to pivot between various AWS accounts, security tool consoles, and Regions, which makes the day-to-day management and operations much easier. Figure 1 depicts the data flow to Security Hub.

Aggregation of security services in security tooling account

Figure 1. Aggregation of security services in security tooling account

Delegated administrator setup

Figure 2. Delegated administrator setup

Incident response

Before hypergrowth, there was no formal way to respond to security incidents. To prevent future security issues, we built incident response plans and processes to quickly address potential security incidents and minimize the impact and exposure. Following the AWS Security Incident Response Guide and NIST framework, we adopted the following best practices.

Playbooks and runbooks for repeatability

We developed incident response playbooks and runbooks for repeatable responses for security events that include:

  • Playbooks for more strategic scenarios and responses based on some of the sample playbooks found here.
  • Runbooks that provide step-by-step guidance for our security analysts to follow in case an event occurs. We used Amazon SageMaker notebooks and AWS Systems Manager Incident Manager runbooks to develop repeatable responses for pre-identified incidents, such as suspected command and control activity on an EC2 instance.

Automation for quicker response time

After developing our repeatable processes, we identified areas where we could accelerate responses to security threats by automating the response. We used the AWS Security Hub Automated Response and Remediation solution as a starting point.

By using this solution, we didn’t need to build our own automated response and remediation workflow. The code is also easy to read, repeat, and centrally deploy through AWS CloudFormation StackSets. We used some of the built-in remediations like disabling active keys that have not been rotated for more than 90 days, making all Amazon Elastic Block Store (Amazon EBS) snapshots private, and many more. With automatic remediation, our analysts can respond quicker and in a more holistic and repeatable way.

Simulations to improve incident response capabilities

We implemented quarterly incident response simulations. These simulations test how well prepared our people, processes, and technologies are for an incident. We included some cloud-specific simulations like an S3 bucket exposure and an externally shared Amazon Relational Database Service (Amazon RDS) snapshot to ensure our security staff are prepared for an incident in the cloud. We use the results of the simulations to iterate on our incident response processes.

Conclusion

In this blog post, we discussed how to prepare for, detect, and respond to security events in an AWS environment. We identified security services to detect security events, vulnerabilities, and misconfigurations. We then discussed how to develop incident response processes through building playbooks and runbooks, performing simulations, and automation. With these new capabilities, we can detect and respond to a security incident throughout hypergrowth.

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Other blog posts in this series

Journey to Adopt Cloud-Native Architecture Series: #4 – Governing Security at Scale and IAM Baselining

Post Syndicated from Anuj Gupta original https://aws.amazon.com/blogs/architecture/journey-to-adopt-cloud-native-architecture-series-4-governing-security-at-scale-and-iam-baselining/

In Part 3 of this series, Improved Resiliency and Standardized Observability, we talked about design patterns that you can adopt to improve resiliency, achieve minimum business continuity, and scale applications with lengthy transactions (more than 3 minutes).

As a refresher from previous blogs in this series, our example ecommerce company’s “Shoppers” application runs in the cloud. The company experienced hypergrowth, which posed a number of platform and technology challenges, namely, they needed to scale on the backend without impacting users.

Because of this hypergrowth, distributed denial of service (DDoS) attacks on the ecommerce company’s services increased 10 times in 6 months. Some of these attacks led to downtime and loss of revenue. This blog post shows you how we addressed these threats by implementing a multi-account strategy and applying AWS Identity and Access Management (IAM) best practices.

A multi-account strategy ensures security at scale

Originally, the company’s production and non-production services were running in a single account. This meant non-production vulnerabilities like frequently changing code or privileged access could impact the production environment. Additionally, the application experienced issues due to unexpectedly reaching service quotas. These include (but are not limited to) number of read replicas per master in Amazon Relational Database Service (Amazon RDS) and total storage for all DB instances in Auto Scaling Service Quotas for Amazon Elastic Compute Cloud (Amazon EC2).

To address these issues, we followed multi-account strategy best practices. We established the multi-account hierarchy shown in Figure 1 that includes the following eight organizational units (OUs) to meet business requirements:

  1. Security PROD OU
  2. Security SDLC OU
  3. Infrastructure PROD OU
  4. Infrastructure SDLC OU
  5. Workload PROD OU
  6. Workload SDLC OU
  7. Sandbox OU
  8. Transitional OU

To identify the right fit for our needs, we evaluated AWS Landing Zone and AWS Control Tower. To reduce operation overhead of maintaining a solution, we used AWS Control Tower to deploy guardrails as service control policies (SCPs). These guardrails were then separated into production and non-production environments, creating the hierarchy shown in Figure 1.

We created a new Payer (or Management) Account with Sandbox OU and Transitional OU under Root OU. We then moved existing AWS accounts under the Transitional OU and Sandbox OU. We provisioned new accounts with Account Factory and gradually migrated services from existing AWS accounts into the newly formed Log Archive Account, Security Account, Network Account, and Shared Services Account and applied appropriate guardrails. We then registered Sandbox OU with Control Tower. Additionally, we migrated the centralized logging solution from Part 3 of this blog series to the Security Account. We moved non-production applications into the Dev and Test Accounts, respectively, to isolate workloads. We then moved existing accounts that had production services from the Transitional OU to Workload PROD OU.

Multi-account hierarchy

Figure 1. Multi-account hierarchy

Implementing a multi-account strategy alleviated service quota challenges. It isolated variable demand non-production environments from more consistent production environments, which reduced the downtime caused by unplanned scaling events. The multi-account strategy enforces governance at scale, but also promotes innovation by allocating separate accounts with distinct security requirements for proof of concepts and experimentation. This reduces impact risks to production accounts and allows the required guardrails to be automatically applied.

Improving access management and least privilege access

When the company experienced hypergrowth, they not only had to scale their application’s infrastructure, but they also had to increase how often they release their code. They also hired and onboarded new internal teams.

To strengthen new/existing employees’ credentials, we used AWS Trusted Advisor for IAM Access Key Rotation. This identifies IAM users whose access keys have not been rotated for more than 90 days and created an automated way to rotate them. We then generated an IAM credential report to identify IAM users that don’t need console access or that don’t need access keys. We gradually assigned these users role-based access versus IAM access keys.

During a Well-Architected Security Pillar review, we identified some applications that used hardcoded passwords that hadn’t been updated for more than 90 days. We re-factored these applications to get passwords from AWS Secrets Manager and followed best practices for performance.

Additionally, we set up a system to automatically change passwords for RDS databases and wrote an AWS Lambda function to update passwords for third-party integration. Some applications on Amazon EC2 were using IAM access keys to access AWS services. We re-factored them to get permissions from the EC2 instance role attached to the EC2 instances, which reduced operational burden of rotating access keys.

Using IAM Access Analyzer, we analyzed AWS CloudTrail logs and generated policies for IAM roles. This helped us determine the least privilege permissions required for the roles as mentioned in the IAM Access Analyzer makes it easier to implement least privilege permissions by generating IAM policies based on access activity blog.

To streamline access for internal users, we migrated users to AWS Single Sign-On (AWS SSO) federated access. We enabled all features in AWS Organizations to use AWS SSO and created permission sets to define access boundaries for different functions. We assigned permission sets to different user groups and assigned users to user groups based on their job function. This allowed us to reduce the number of IAM policies and use tag-based control when defining AWS SSO permissions policies.

We followed the guidance in the Attribute-based Access Control with AWS SSO blog post to map user attributes and use tags to define permissions boundaries for user groups. This allowed us to provide access to users based on specific teams, projects, and departments. We enforced multi-factor authentication (MFA) for all AWS SSO users by configuring MFA settings to allow sign in only when an MFA device has been registered.

These improvements ensure that only the right people have access to the required resources for the right time. They reduce the risk of compromised security credentials by using AWS Security Token Service (AWS STS) to generate temporary credentials when needed. System passwords are better protected from unwanted access and automatically rotated for improved security. AWS SSO also allows us to enforce permissions at scale when people’s job functions change within or across teams.

Conclusion

In this blog post, we described design patterns we used to implement security governance at scale using multi-account strategy and AWS SSO integrations. We also talked about patterns you can adopt for IAM baselining that allow least privilege access, checking for IAM best practices, and proactively detecting unwanted access.

This blog post also covers why you need to refresh your threat model during hyperscale growth and how different services can make it easier to enforce security controls. In the next blog, we will talk about more security design patterns to improve infrastructure security and incident response during hyperscale.

Find out more

Other blogs in this series

Related information