SonarCloud, a software-as-a-service (SaaS) product developed by Sonar, seamlessly integrates into developers’ CI/CD workflows to increase code quality and identify vulnerabilities. Over the last few months, Sonar’s cloud engineers have worked on modernizing SonarCloud to increase the lead time to production.
Following Domain Driven Design principles, Sonar split the application into multiple business domains, each owned by independent teams. They also built a unified API to expose these domains publicly.
This solution isn’t exclusive to Sonar; it’s a blueprint for organizations modernizing their applications towards domain-driven design or microservices with public service exposure.
Introduction
SonarCloud’s core was initially built as a monolithic application on AWS, managed by a single team. Over time, it gained widespread adoption among thousands of organizations, leading to the introduction of new features and contributions from multiple teams.
In response to this growth, Sonar recognized the need to modernize its architecture. The decision was made to transition to domain-driven design, aligning with the team’s structure. New functionalities are now developed within independent domains, managed by dedicated teams, while existing components are gradually refactored using the strangler pattern.
This transformation resulted in SonarCloud being composed of multiple domains, and securely exposing them to customers became a key challenge. To address this, Sonar’s engineers built a unified API, a solution we’ll explore in the following section.
Solution overview
Figure 1 illustrates the architecture of the unified API, the gateway through which end-users access SonarCloud services. It is built on an Application Load Balancer and Amazon API Gateway private APIs.
Figure 1. Unified API architecture
The VPC endpoint for API Gateway spans three Availability Zones (AZs), providing an Elastic Network Interface (ENI) in each private subnet. Meanwhile, the ALB is configured with an HTTPS listener, linked to a target group containing the IP addresses of the ENIs.
To streamline access, we’ve established an API Gateway custom domain at api.example.com. Within this domain, we’ve created API mappings for each domain. This setup allows for seamless routing, with paths like /domain1 leading directly to the corresponding domain1 private API of the API Gateway service.
Here is how it works:
The user makes a request to api.example.com/domain1, which is routed to the ALB using Amazon Route53 for DNS resolution.
The ALB terminates the connection, decrypts the request and sends it to one of the VPC endpoint ENIs. At this point, the domain name and the path of the request respectively match our custom domain name, api.example.com, and our API mapping for /domain1.
Based on the custom domain name and API mapping, the API Gateway service routes the request to the domain1 private API.
In this solution, we leverage the two following functionalities of the Amazon API Gateway:
Private REST APIs in Amazon API Gateway can only be accessed from your virtual private cloud by using an interface VPC endpoint. This is an ENI that you create in your VPC.
API Gateway custom domains allow you to set up your API’s hostname. The default base URL for an API is: https://api-id.execute-api.region.amazonaws.com/stage
With custom domains you can define a more intuitive URL, such as: https://api.example.com/domain1This is not supported for private REST APIs by default so we are using a workaround documented in https://github.com/aws-samples/.
Conclusion
In this post, we described the architecture of a unified API built by Sonar to securely expose multiple domains through a single API endpoint. To conclude, let’s review how this solution is aligned with the best practices of the AWS Well-Architected Framework.
Security
The unified API approach improves the security of the application by reducing the attack surface as opposed to having a public API per domain. AWS Web Application Firewall (WAF) used on the ALB protects the application from common web exploits. AWS Shield, enabled by default on Amazon CloudFront, provides Network/Transport layer protection against DDoS attacks.
Operational Excellence
The design allows each team to independently deploy application and infrastructure changes behind a dedicated private API Gateway. This leads to a minimal operational overhead for the platform team and was a requirement. In addition, the architecture is based on managed services, which scale automatically as SonarCloud usage evolves.
Reliability
The solution is built using AWS services providing high-availability by default across Availability Zones (AZs) in the AWS Region. Requests throttling can be configured on each private API Gateway to protect the underlying resources from being overwhelmed.
Performance
Amazon CloudFront increases the performance of the API, especially for users located far from the deployment AWS Region. The traffic flows through the AWS network backbone which offers superior performance for accessing the ALB.
Cost
The ALB is used as the single entry-point and brings an extra cost as opposed to exposing multiple public API Gateways. This is a trade-off for enhanced security and customer experience.
Sustainability
By using serverless managed services, Sonar is able to match the provisioned infrastructure with the customer demand. This avoids overprovisioning resources and reduces the environmental impact of the solution.
Designing a system to be either stateful or stateless is an important choice with tradeoffs regarding its performance and scalability. In a stateful system, data from one session is carried over to the next. A stateless system doesn’t preserve data between sessions and depends on external entities such as databases or cache to manage state.
Stateful and stateless architectures are both widely adopted.
Stateful applications are typically simple to deploy. Stateful applications save client session data on the server, allowing for faster processing and improved performance. Stateful applications excel in predictable workloads and offer consistent user experiences.
Stateless architectures typically align with the demands of dynamic workload and changing business requirements. Stateless application design can increase flexibility with horizontal scaling and dynamic deployment. This flexibility helps applications handle sudden spikes in traffic, maintain resilience to failures, and optimize cost.
Figure 1 provides a conceptual comparison of stateful and stateless architectures.
Figure 1. Conceptual diagram for stateful vs stateless architectures
For example, an eCommerce application accessible from web and mobile devices manages several aspects of the customer transaction life cycle. This lifecycle starts with account creation, then moves to placing items in the shopping cart, and proceeds through checkout. Session and user profile data provide session persistence and cart management, which retain the cart’s contents and render the latest updated cart from any device. A stateless architecture is preferable for this application because it decouples user data and offloads the session data. This provides the flexibility to scale each component independently to meet varying workloads and optimize resource utilization.
In this blog, we outline the process and benefits of converting from a stateful to stateless architecture.
Solution overview
This section walks you through the steps for converting stateful to stateless architecture:
Identifying and understanding the stateful requirements
Decoupling user profile data
Offloading session data
Scaling each component dynamically
Designing a stateless architecture
Step 1: Identifying and understanding the stateful components
Transforming a stateful architecture to a stateless architecture starts with reviewing the overall architecture and source code of the application, and then analyzing dataflow and dependencies.
Review the architecture and source code
It’s important to understand how your application accesses and shares data. Pay attention to components that persist state data and retain state information. Examples include user credentials, user profiles, session tokens, and data specific to sessions (such as shopping carts). Identifying how this data is handled serves as the foundation for planning the conversion to a stateless architecture.
Analyze dataflow and dependencies
Analyze and understand the components that maintain state within the architecture. This helps you assess the potential impact of transitioning to a stateless design.
You can use the following questionnaire to assess the components. Customize the questions according to your application.
What data is specific to a user or session?
How is user data stored and managed?
How is the session data accessed and updated?
Which components rely on the user and session data?
Are there any shared or centralized data stores?
How does the state affect scalability and tolerance?
Can the stateful components be decoupled or made stateless?
Step 2: Decoupling user profile data
Decoupling user data involves separating and managing user data from the core application logic. Delegate responsibilities for user management and secrets, such as application programming interface (API) keys and database credentials, to a separate service that can be resilient and scale independently. For example, you can use:
AWS Secrets Manager to decouple user data by storing secrets in a secure, centralized location. This means that the application code doesn’t need to store secrets, which makes it more secure.
Amazon S3 to store large, unstructured data, such as images and documents. Your application can retrieve this data when required, eliminating the need to store it in memory.
Amazon DynamoDB to store information such as user profiles. Your application can query this data in near-real time.
Step 3: Offloading session data
Offloading session data refers to the practice of storing and managing session related data external to the stateful components of an application. This involves separating the state from business logic. You can offload session data to a database, cache, or external files.
Factors to consider when offloading session data include:
Stateless architecture gives the flexibility to scale each component independently, allowing the application to meet varying workloads and optimize resource utilization. While planning for scaling, consider using:
AWS Autoscaling, which supports automatic scaling of resources based on predefined policies and metrics.
AWS Load Balancer, which supports dynamic scaling by automatically adding or removing instances based on the configured scaling policies and health checks.
After you identify which state and user data need to be persisted, and your storage solution of choice, you can begin designing the stateless architecture. This involves:
Understanding how the application interacts with the storage solution.
Planning how session creation, retrieval, and expiration logic work with the overall session management.
Refactoring application logic to remove references to the state information that’s stored on the server.
Rearchitecting the application into smaller, independent services, as described in steps 2, 3, and 4.
Performing thorough testing to ensure that all functionalities produce the desired results after the conversion.
The following figure is an example of a stateless architecture on AWS. This architecture separates the user interface, application logic, and data storage into distinct layers, allowing for scalability, modularity, and flexibility in designing and deploying applications. The tiers interact through well-defined interfaces and APIs, ensuring that each component focuses on its specific responsibilities.
Figure 2. Example of a stateless architecture
Benefits
Benefits of adopting a stateless architecture include:
Scalability: Stateless components don’t maintain a local state. Typically, you can easily replicate and distribute them to handle increasing workloads. This supports horizontal scaling, making it possible to add or remove capacity based on fluctuating traffic and demand.
Reliability and fault tolerance: Stateless architectures are inherently resilient to failures. If a stateless component fails, it can be replaced or restarted without affecting the overall system. Because stateless applications don’t have a shared state, failures in one component don’t impact other components. This helps ensure continuity of user sessions, minimizes disruptions, and improves fault tolerance and overall system reliability.
Cost-effectiveness: By leveraging on-demand scaling capabilities, your application can dynamically adjust resources based on actual demand, avoiding overprovisioning of infrastructure. Stateless architectures lend themselves to serverless computing models, paying for the actual run time and resulting in cost savings.
Performance: Externalizing session data by using services optimized for high-speed access, such as in-memory caches, can reduce the latency compared to maintaining session data internally.
Flexibility and extensibility: Stateless architectures provide flexibility and agility in application development. Offloaded session data provides more flexibility to adopt different technologies and services within the architecture. Applications can easily integrate with other AWS services for enhanced functionality, such as analytics, near real-time notifications, or personalization.
Conclusion
Converting stateful applications to stateless applications requires careful planning, design, and implementation. Your choice of architecture depends on your application’s specific needs. If an application is simple to develop and debug, then a stateful architecture might be a good choice. However, if an application needs to be scalable and fault tolerant, then a stateless architecture might be a better choice. It’s important to understand the current application thoroughly before embarking on a refactoring journey.
In the software development process, adopting developer tools makes it easier for developers to write code, build applications, and test more efficiently. As a developer, you can use various AWS developer tools for code editing, code quality, code completion, and so on. These tools include Amazon CodeGuru for code analysis, and Amazon CodeWhisper for getting coding recommendations powered by machine learning algorithms.
In this edition of Let’s Architect!, we’ll show you some tools that every developer should consider including in their toolkit.
This blog post shares several prompts to enhance your programming experience with Amazon CodeWhisperer.
Why is this important to developers? By default, CodeWhisperer gives you code recommendations in real time — this example shows you how to make the best use of these recommendations. You’ll see the different dimensions of writing a simple application, but most importantly, you’ll learn how to resolve problems you could face in development workflows. Even if you’re just a beginner, you’ll be able to use this example to leverage AI to increase productivity.
Code quality is important in software development. It’s essential for resilient, cost-effective, and enduring software systems. It helps guarantee performance efficiency and satisfy functional requirements, but also guarantee long-term maintainability.
In this blog post, the authors talk about the advantages offered by CodeGuru automated code reviews, which allow you to proactively identify and address potential issues before they find their way into the main branches of your repository. CodeGuru not only streamlines your development pipeline, but also fortifies the integrity of your codebase, ensuring that only the highest quality code makes its way into your production environment.
AWS provides various tools for developers. You can access the complete list here. One in particular—Lambda Power Tools—is designed to implement serverless best practices and elevate developer velocity. Powertools for AWS Lambda (Python) is a library of observability best practices and solutions to common problems like implementing idempotency or handling batch errors. It supports different languages, such as Python, Java, Typescript, and .Net and lets you choose choose your favorite(s). There is also a roadmap available, so you can see upcoming features.
Developers test their code in an AWS account to see if their changes are working successfully, especially when developing new infrastructure workloads programmatically or provisioning new services. AWS Cloud Development Kit (AWS CDK) CLI has a flag called hotswap that helps to speed up your deployments. It does this by swapping specific resources, without going through the whole AWS CloudFormation process.
Not all changes can be hotswapped, though. When hotswapping isn’t possible, cdk-watch will go back to using a full CloudFormation deployment. NOTE: This command deliberately introduces drift in CloudFormation to speed up deployments. For this reason, only use it for development purposes. Never use hotswap for your production deployments!
CodeGuru implemented in this end-to-end CICD pipeline
See you next time!
Thanks for reading! This is the last post for 2023. We hope you enjoyed our work this year and we look forward to seeing you in 2024.
To revisit any of our previous posts or explore the entire series, visit the Let’s Architect! page. Thank you for being a part of our community, and we look forward to bringing you more insightful content in the future. Happy re:Invent, everybody!
With the launch of Amazon Security Lake, it’s now simpler and more convenient to access security-related data in a single place. Security Lake automatically centralizes security data from cloud, on-premises, and custom sources into a purpose-built data lake stored in your account, and removes the overhead related to building and scaling your infrastructure as your data volumes increase. With Security Lake, you can get a more complete understanding of your security data across your entire organization. You can also improve the protection of your workloads, applications, and data.
Security Lake has adopted the Open Cybersecurity Schema Framework (OCSF), an open standard. With OCSF support, the service can normalize and combine security data from AWS and a broad range of enterprise security data sources. Using the native ingestion capabilities of the service to pull in AWS CloudTrail, Amazon Route 53, VPC Flow Logs, or AWS Security Hub findings, ingesting supported third-party partner findings, or ingesting your own security-related logs, Security Lake provides an environment in which you can correlate events and findings by using a broad range of tools from the AWS and APN partner community.
Many customers have already deployed and maintain a centralized logging solution using services such as Amazon OpenSearch Service or a third-party security information and event management (SIEM) tool, and often use business intelligence (BI) tools such as Amazon QuickSight to gain insights into their data. With Security Lake, you have the freedom to choose how you analyze this data. In some cases, it may be from a centralized team using OpenSearch or a SIEM tool, and in other cases it may be that you want the ability to give your teams access to QuickSight dashboards or provide specific teams access to a single data source with Amazon Athena.
Before you get started
To follow along with this post, you must have:
A basic understanding of Security Lake, Athena, and QuickSight
Security Lake already deployed and accepting data sources
An existing QuickSight deployment that can be used to visualize Security Lake data, or an account where you can sign up for QuickSight to create visualizations
Accessing data
Security Lake uses the concept of data subscribers when it comes to accessing your data. A subscriber consumes logs and events from Security Lake, and supports two types of access:
Data access — Subscribers can directly access Amazon Simple Storage Service (Amazon S3) objects and receive notifications of new objects through a subscription endpoint or by polling an Amazon Simple Queue Service (Amazon SQS) queue. This is the architecture typically used by tools such as OpenSearch Service and partner SIEM solutions.
Query access — Subscribers with query access can directly query AWS Lake Formation tables in your S3 bucket by using services like Athena. Although the primary query engine for Security Lake is Athena, you can also use other services that integrate with AWS Glue, such as Amazon Redshift Spectrum and Spark SQL.
In the sections that follow, we walk through how to configure cross-account sharing from Security Lake to visualize your data with QuickSight, and the associated Athena queries that are used. It’s a best practice to isolate log data from visualization workloads, and we recommend using a separate AWS account for QuickSight visualizations. A high-level overview of the architecture is shown in Figure 1.
Figure 1: Security Lake visualization architecture overview
In Figure 1, Security Lake data is being cataloged by AWS Glue in account A. This catalog is then shared to account B by using AWS Resource Access Manager. Users in account B are then able to directly query the cataloged Security Lake data using Athena, or get visualizations by accessing QuickSight dashboards that use Athena to query the data.
Configure a Security Lake subscriber
The following steps guide you through configuring a Security Lake subscriber using the delegated administrator account.
To configure a Security Lake subscriber
Sign in to the AWS Management Console and navigate to the Amazon Security Lake console in the Security Lake delegated administrator account. In this post, we’ll call this Account A.
Go to Subscribers and choose Create subscriber.
On the Subscriber details page, enter a Subscriber name. For example, cross-account-visualization.
For Log and event sources, select All log and event sources. For Data access method, select Lake Formation.
Add the Account ID for the AWS account that you’ll use for visualizations. In this post, we’ll call this Account B.
Security Lake creates a resource share in your visualizations account using AWS Resource Access Manager (AWS RAM). You can view the configuration of the subscriber from Security Lake by selecting the subscriber you just created from the main Subscribers page. It should look like Figure 2.
Figure 2: Subscriber configuration
Note: your configuration might be slightly different, based on what you’ve named your subscriber, the AWS Region you’re using, the logs being ingested, and the external ID that you created.
Configure Athena to visualize your data
Now that the subscriber is configured, you can move on to the next stage, where you configure Athena and QuickSight to visualize your data.
Note: In the following example, queries will be against Security Hub findings, using the Security Lake table in the ap-southeast-2 Region. If necessary, change the table name in your queries to match the Security Lake Region you use in the following configuration steps.
To configure Athena
Sign in to your QuickSight visualization account (Account B).
Navigate to the AWS Resource Access Manager (AWS RAM) console. You’ll see a Resource share invitation under Shared with me in the menu on the left-hand side of the screen. Choose Resource shares to go to the invitation.
Figure 3: RAM menu
On the Resource shares page, select the name of the resource share starting with LakeFormation-V3, and then choose Accept resource share. The Security Lake Glue catalog is now available to Account B to query.
For cross-account access, you should create a database to link the shared tables. Navigate to Lake Formation, and then under the Data catalog menu option, select Databases, then select Create database.
Enter a name, for example security_lake_visualization, and keep the defaults for all other settings. Choose Create database.
Figure 4: Create database
After you’ve created the database, you need to create resource links from the shared tables into the database. Select Tables under the Data catalog menu option. Select one of the tables shared by Security Lake by selecting the table’s name. You can identify the shared tables by looking for the ones that start with amazon_security_lake_table_.
From the Actions dropdown list, select Create resource link.
Figure 5: Creating a resource link
Enter the name for the resource link, for example amazon_security_lake_table_ap_southeast_2_sh_findings_1_0, and then select the security_lake_visualization database created in the previous steps.
Choose Create. After the links have been created, the names of the resource links will appear in italics in the list of tables.
You can now select the radio button next to the resource link, select Actions, and then select View data under Table. This takes you to the Athena query editor, where you can now run queries on the shared Security Lake tables.
Figure 6: Viewing data to query
To use Athena for queries, you must configure an S3 bucket to store query results. If this is the first time Athena is being used in your account, you’ll receive a message saying that you need to configure an S3 bucket. To do this, choose Edit settings in the information notice and follow the instructions.
In the Editor configuration, select AwsDataCatalog from the Data source options. The Database should be the database you created in the previous steps, for example security_lake_visualization.
After selecting the database, copy the query that follows and paste it into your Athena query editor, and then choose Run. This runs your first query to list 10 Security Hub findings:
Figure 7: Athena data query editor
SELECT * FROM
"AwsDataCatalog"."security_lake_visualization"."amazon_security_lake_table_ap_southeast_2_sh_findings_1_0" limit 10;
This queries Security Hub data in Security Lake from the Region you specified, and outputs the results in the Query results section on the page. For a list of example Security Lake specific queries, see the AWS Security Analytics Bootstrap project, where you can find example queries specific to each of the Security Lake natively ingested data sources.
To build advanced dashboards, you can create views using Athena. The following is an example of a view that lists 100 findings with failed checks sorted by created_time of the findings.
CREATE VIEW security_hub_new_failed_findings_by_created_time AS
SELECT
finding.title, cloud.account_uid, compliance.status, metadata.product.name
FROM "security_lake_visualization"."amazon_security_lake_table_ap_southeast_2_sh_findings_1_0"
WHERE compliance.status = 'FAILED'
ORDER BY finding.created_time
limit 100;
You can now query the view to list the first 10 rows using the following query.
SELECT * FROM
"security_lake_visualization"."security_hub_new_failed_findings_by_created_time" limit 10;
Create a QuickSight dataset
Now that you’ve done a sample query and created a view, you can use Athena as the data source to create a dataset in QuickSight.
To create a QuickSight dataset
Sign in to your QuickSight visualization account (also known as Account B), and open the QuickSight console.
When using cross-account configuration with AWS Glue Catalog, you also need to configure permissions on tables that are shared through Lake Formation. For a detailed deep dive, see Use Amazon Athena and Amazon QuickSight in a cross-account environment. For the use case highlighted in this post, use the following steps to grant access on the cross-account tables in the Glue Catalog.
In the AWS Lake Formation console, navigate to the Tables section and select the resource link for the table, for example amazon_security_lake_table_ap_southeast_2_sh_findings_1_0.
Select Actions. Under Permissions, select Grant on target.
For the LF-Tags or catalog resources section, use the default settings.
For Table permissions, choose Select for both Table Permissions and Grantable Permissions.
Choose Grant.
Figure 8: Granting permissions in Lake Formation
After permissions are in place, you can create datasets. You should also verify that you’re using QuickSight in the same Region where Lake Formation is sharing the data. The simplest way to determine your Region is to check the QuickSight URL in your web browser. The Region will be at the beginning of the URL. To change the Region, select the settings icon in the top right of the QuickSight screen and select the correct Region from the list of available Regions in the drop-down menu.
Select Datasets, and then select New dataset. Select Athena from the list of available data sources.
Enter a Data source name, for example security_lake_visualizations, and leave the Athena workgroup as [primary]. Then select Create data source.
Select the tables to build your dashboards. On the Choose your table prompt, for Catalog, select AwsDataCatalog. For Database, select the database you created in the previous steps, for example security_lake_visualization. For Table, select the table with the name starting with amazon_security_lake_table_. Choose Select.
Figure 9: Selecting the table for a new dataset
On the Finish dataset creation prompt, select Import to SPICE for quicker analytics. Choose Visualize.
In the left-hand menu in QuickSight, you can choose attributes from the data set to add analytics and widgets.
After you’re familiar with how to use QuickSight to visualize data from Security Lake, you can create additional datasets and add other widgets to create dashboards that are specific to your needs.
AWS pre-built QuickSight dashboards
So far, you’ve seen how to use Athena manually to query your data and how to use QuickSight to visualize it. AWS Professional Services is excited to announce the publication of the Data Visualization framework to help customers quickly visualize their data using QuickSight. The repository contains a combination of CDK tools and scripts that can be used to create the required AWS objects and deploy basic data sources, datasets, analysis, dashboards, and the required user groups to QuickSight with respect to Security Lake. The framework includes three pre-built dashboards based on the following personas.
Persona
Role description
Challenges
Goals
CISO/Executive Stakeholder
Owns and operates, with their support staff, all security-related activities within a business; total financial and risk accountability
Difficult to assess organizational aggregated security risk
Burdened by license costs of security tooling
Less agility in security programs due to mounting cost and complexity
Reduce risk
Reduce cost
Improve metrics (MTTD/MTTR and others)
Security Data Custodian
Aggregates all security-related data sources while managing cost, access, and compliance
Writes new custom extract, transform, and load (ETL) every time a new data source shows up; difficult to maintain
Manually provisions access for users to view security data
Constrained by cost and performance limitations in tools depending on licenses and hardware
Reduce overhead to integrate new data
Improve data governance
Streamline access
Security Operator/Analyst
Uses security tooling to monitor, assess, and respond to security-related events. Might perform incident response (IR), threat hunting, and other activities.
Moves between many security tools to answer questions about data
Lacks substantive automated analytics; manually processing and analyzing data
Can’t look historically due to limitations in tools (licensing, storage, scalability)
Reduce total number of tools
Increase observability
Decrease time to detect and respond
Decrease “alert fatigue”
After deploying through the CDK, you will have three pre-built dashboards configured and available to view. Once deployed, each of these dashboards can be customized according to your requirements. The Data Lake Executive dashboard provides a high-level overview of security findings, as shown in Figure 10.
Figure 10: Example QuickSight dashboard showing an overview of findings in Security Lake
The Security Lake custodian role will have visibility of security related data sources, as shown in Figure 11.
Figure 11: Security Lake custodian dashboard
And the Security Lake operator will have a view of security related events, as shown in Figure 12.
Figure 12: Security Operator dashboard
Conclusion
In this post, you learned about Security Lake, and how you can use Athena to query your data and QuickSight to gain visibility of your security findings stored within Security Lake. When using QuickSight to visualize your data, it’s important to remember that the data remains in your S3 bucket within your own environment. However, if you have other use cases or wish to use other analytics tools such as OpenSearch, Security Lake gives you the freedom to choose how you want to interact with your data.
We also introduced the Data Visualization framework that was created by AWS Professional Services. The framework uses the CDK to deploy a set of pre-built dashboards to help get you up and running quickly.
With the announcement of AWS AppFabric, we’re making it even simpler to ingest data directly into Security Lake from leading SaaS applications without building and managing custom code or point-to-point integrations, enabling quick visualization of your data from a single place, in a common format.
For additional information on using Athena to query Security Lake, have a look at the AWS Security Analytics Bootstrap project, where you can find queries specific to each of the Security Lake natively ingested data sources. If you want to learn more about how to configure and use QuickSight to visualize findings, we have hands-on QuickSight workshops to help you configure and build QuickSight dashboards for visualizing your data.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
Want more AWS Security news? Follow us on Twitter.
In our previous Journey to Cloud-Native blogposts, we talked about evolving our architecture to become more scalable, secure, and cost effective to handle hyperscale requirements. In this post, we take these next steps: 1/ containerizing our applications to improve resource efficiency, and, 2/ using cell-based design to improve resiliency and time to production.
Containerize applications for standardization and scale
Standardize container orchestration tooling
Selecting the right service for your use case requires considering your organizational needs and skill sets for managing container orchestrators at scale. Our team chose Amazon Elastic Kubernetes Service (EKS), because we have the skills and experience working with Kubernetes. In addition, our leadership was committed to open source, and the topology awareness feature aligned with our resiliency requirements.
For containerizing our Java and .NET applications, we used AWS App2Container (A2C) to create deployment artifacts. AWS A2C reduced the time needed for dependency analysis, and created artifacts that we could plug into our deployment pipeline. A2C supports EKS Blueprints with GitOps which helps reduce the time-to-container adoption. Blueprints also improved consistency and follows security best practices. If you are unsure on the right tooling for running containerized workloads, you can refer to Choosing an AWS container service.
Identify the right tools for logging and monitoring
Logging in to the container environment adds some complexity due to the dynamic number of short-lived log sources. For proper tracing of events, we needed a way to collect logs from all of the system components and applications. We set up the Fluent Bit plugin to collect logs and send them to Amazon CloudWatch.
We used CloudWatch Container Insights for Prometheus for scraping Prometheus metrics from containerized applications. It allowed us to use existing tools (Amazon CloudWatch and Prometheus) to build purpose-built dashboards for different teams and applications. We created dashboards in Amazon Managed Grafana by using the native integration with Amazon CloudWatch. These tools took away the heavy lifting of managing logging and container monitoring from our teams.
Managing resource utilization
In hyperscale environments, a noisy neighbor container can consume all the resources for an entire cluster. Amazon EKS provides the ability to define and apply requests and limits for pods and containers. These values determine the minimum and maximum amount of a resource that a container can have. We also used resource quotas to configure the total amount of memory and CPU that can be used by all pods running in a namespace.
With Amazon EKS, the Kubernetes metric server GitHub collects resource metrics from Kubelets and exposes them in the Kubernetes API server. This is accomplished through the metrics API for use by the Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA). The HPA consumes these metrics to provide horizontal scaling by increasing the number of replicas to distribute your workloads. The VPA uses these metrics to dynamically adjust for pod resources like CPU/Memory reservations.
With metrics server integration, Amazon EKS takes away the operational complexity of scaling applications and provides more granular controls on scaling applications dynamically. We recommend that hyperscale customers consider HPA as their preferred application autoscaler because it provides resiliency (increased number of replicas) in addition to scaling. On the cluster level, Karpenter provides rapid provisioning and de-provisioning of large numbers of diverse compute resources, in addition to providing cost effective usage. It helps applications hyperscale to meet business growth needs.
Build rollback strategies for failure management
In hyperscale environments, deployment rollouts typically have a low margin for errors and require close to zero downtime. We implemented progressive rollouts, canary deployments, and automated rollbacks to reduce risk for production deployments. We used key performance indicators (KPIs) like application response time and error rates to determine whether to continue or to rollback our deployment. We leveraged the integration with Prometheus to collect metrics and measure KPIs.
Improving resilience and scale using cell-based design
We still needed to handle black swan events and minimize the impact of unexpected failures or scaling events to make them more resilient. We came up with a design that creates independently deployable units of applications with contained fault isolation boundaries. The new design uses a cell-based approach, along with a shuffle sharding technique to further improve resiliency. Peter Vosshall’s re:Invent talk details this approach.
A light-weight routing layer that manages the traffic routing of incoming requests to the cells.
The cells themselves, which are independent units with isolated boundaries to prevent system wide impact.
Figure 1. High level cell-based architecture
Figure 1. High Level cell-based architecture
We used best practices from Guidance for Cell-based Architecture on AWS for implementation of our cell-based architectures and routing layer. In order to minimize the risk of failure in the routing layer, we made it the thinnest layer and tested robustness for all possible scenarios. We used a hash function to map cells to customers and stored the mapping in a highly scaled and resilient data store, Amazon DynamoDB. This layer eases the addition of new cells, and provides for a horizontal-scaled application environment to gracefully handle the hypergrowth of customers or orders.
In Figure 2, we revisit the architecture as mentioned on Blog #3 of our series. To manage AWS service limits and reduce blast radius, we spread our applications into multiple AWS accounts.
Figure 2. Architecture from Blog #3
Here’s how the new design looks when implemented on AWS:
Figure 3a. Cell-based architecture
Figure 3a uses Amazon EKS instead of Amazon EC2 for deploying applications, and uses a light-weight routing layer for incoming traffic. An Amazon Route 53 configuration was deployed in a networking shared services AWS account. We have multiple isolated boundaries in our design — an AWS availability zone, an AWS Region, and an AWS account. This makes the architecture more resilient and flexible in hypergrowth.
We used our AWS availability zones as a cell boundary. We implemented shuffle sharding to map customers to multiple cells to reduce impact in case one of the cells goes down. We also used shuffle sharding to scale the number of cells that can process a customer request, in case we have unprecedented requests from a customer.
Cell-based design to handle Black swan events Black swan event mitigation
Let’s discuss how our cell-based application will react to black swan events. First, we need to align the pods into cell groups in the Amazon EKS cluster. Now we can observe how each deployment is isolated from the other deployments. Only the routing layer, which include Amazon Route 53, Amazon DynamoDB, and the load balancer, is common across the system.
Figure 3b. Cell-based architecture zoomed in on EKS
Without a cell-based design, a black swan event could take down our application in the initial attack. Now that our application is spread over three cells, the worst case for an initial attack is 33% of our application’s capacity. We now have resiliency boundaries at the node level and availability zones.
Conclusion
In this blog post, we discussed how containerizing applications can improve resource efficiency and help standardize tooling and processes. This reduces engineering overhead of scaling applications. We talked about how adopting cell-based design and shuffle sharding further improves the resilience posture of our applications, our ability to manage failures, and handle unexpected large scaling events.
From customer interactions on e-commerce platforms to social media trends and from sensor data in internet of things (IoT) devices to financial market updates, streaming data encompasses a vast array of information. This ability to handle real-time flow often distinguishes successful organizations from their competitors. Harnessing the potential of streaming data processing offers organizations an opportunity to stay at the forefront of their industries, make data-informed decisions with unprecedented agility, and gain invaluable insights into customer behavior and operational efficiency.
AWS provides a foundation for building robust and reliable data pipelines that efficiently transport streaming data, eliminating the intricacies of infrastructure management. This shift empowers engineers to focus their talents and energies on creating business value, rather than consuming their time for managing infrastructure.
In a world of exploding data, traditional on-premises analytics struggle to scale and become cost-prohibitive. Modern data architecture on AWS offers a solution. It lets organizations easily access, analyze, and break down data silos, all while ensuring data security. This empowers real-time insights and versatile applications, from live dashboards to data lakes and warehouses, transforming the way we harness data.
This whitepaper guides you through implementing this architecture, focusing on streaming technologies. It simplifies data collection, management, and analysis, offering three movement patterns to glean insights from near real-time data using AWS’s tailored analytics services. The future of data analytics has arrived.
In this workshop, you’ll see how to process data in real-time, using streaming and micro-batching technologies in the context of anomaly detection. You will also learn how to integrate Apache Kafka on Amazon Managed Streaming for Apache Kafka (Amazon MSK) with an Apache Flink consumer to process and aggregate the events for reporting purposes.
Streaming architectures built on Apache Kafka follow the publish/subscribe paradigm: producers publish events to topics via a write operation and the consumers read the events.
This video describes how to offer a real-time financial data feed as a service on AWS. By using Amazon MSK, you can work with Kafka to allow consumers to subscribe to message topics containing the data of interest. The sessions drills down into the best design practices for working with Kafka and the techniques for establishing hybrid connectivity for working at a global scale.
The Samsung SmartThings story is a compelling case study in how businesses can modernize and optimize their streaming data analytics, relieve the burden of infrastructure management, and embrace a future of real-time insights. After Samsung migrated to Amazon Managed Service for Apache Flink, the development team’s focus shifted from the tedium of infrastructure upkeep to the realm of delivering tangible business value. This change enabled them to harness the full potential of a fully managed stream-processing platform.
Unstructured data is information that doesn’t conform to a predefined schema or isn’t organized according to a preset data model. Unstructured information may have a little or a lot of structure but in ways that are unexpected or inconsistent. Text, images, audio, and videos are common examples of unstructured data. Most companies produce and consume unstructured data such as documents, emails, web pages, engagement center phone calls, and social media. By some estimates, unstructured data can make up to 80–90% of all new enterprise data and is growing many times faster than structured data. After decades of digitizing everything in your enterprise, you may have an enormous amount of data, but with dormant value. However, with the help of AI and machine learning (ML), new software tools are now available to unearth the value of unstructured data.
In this post, we discuss how AWS can help you successfully address the challenges of extracting insights from unstructured data. We discuss various design patterns and architectures for extracting and cataloging valuable insights from unstructured data using AWS. Additionally, we show how to use AWS AI/ML services for analyzing unstructured data.
Why it’s challenging to process and manage unstructured data
Unstructured data makes up a large proportion of the data in the enterprise that can’t be stored in a traditional relational database management systems (RDBMS). Understanding the data, categorizing it, storing it, and extracting insights from it can be challenging. In addition, identifying incremental changes requires specialized patterns and detecting sensitive data and meeting compliance requirements calls for sophisticated functions. It can be difficult to integrate unstructured data with structured data from existing information systems. Some view structured and unstructured data as apples and oranges, instead of being complementary. But most important of all, the assumed dormant value in the unstructured data is a question mark, which can only be answered after these sophisticated techniques have been applied. Therefore, there is a need to being able to analyze and extract value from the data economically and flexibly.
Solution overview
Data and metadata discovery is one of the primary requirements in data analytics, where data consumers explore what data is available and in what format, and then consume or query it for analysis. If you can apply a schema on top of the dataset, then it’s straightforward to query because you can load the data into a database or impose a virtual table schema for querying. But in the case of unstructured data, metadata discovery is challenging because the raw data isn’t easily readable.
You can integrate different technologies or tools to build a solution. In this post, we explain how to integrate different AWS services to provide an end-to-end solution that includes data extraction, management, and governance.
The solution integrates data in three tiers. The first is the raw input data that gets ingested by source systems, the second is the output data that gets extracted from input data using AI, and the third is the metadata layer that maintains a relationship between them for data discovery.
The following is a high-level architecture of the solution we can build to process the unstructured data, assuming the input data is being ingested to the raw input object store.
The steps of the workflow are as follows:
Integrated AI services extract data from the unstructured data.
These services write the output to a data lake.
A metadata layer helps build the relationship between the raw data and AI extracted output. When the data and metadata are available for end-users, we can break the user access pattern into additional steps.
In the metadata catalog discovery step, we can use query engines to access the metadata for discovery and apply filters as per our analytics needs. Then we move to the next stage of accessing the actual data extracted from the raw unstructured data.
The end-user accesses the output of the AI services and uses the query engines to query the structured data available in the data lake. We can optionally integrate additional tools that help control access and provide governance.
There might be scenarios where, after accessing the AI extracted output, the end-user wants to access the original raw object (such as media files) for further analysis. Additionally, we need to make sure we have access control policies so the end-user has access only to the respective raw data they want to access.
Now that we understand the high-level architecture, let’s discuss what AWS services we can integrate in each step of the architecture to provide an end-to-end solution.
The following diagram is the enhanced version of our solution architecture, where we have integrated AWS services.
Let’s understand how these AWS services are integrated in detail. We have divided the steps into two broad user flows: data processing and metadata enrichment (Steps 1–3) and end-users accessing the data and metadata with fine-grained access control (Steps 4–6).
Various AI services (which we discuss in the next section) extract data from the unstructured datasets.
The output is written to an Amazon Simple Storage Service (Amazon S3) bucket (labeled Extracted JSON in the preceding diagram). Optionally, we can restructure the input raw objects for better partitioning, which can help while implementing fine-grained access control on the raw input data (labeled as the Partitioned bucket in the diagram).
After the initial data extraction phase, we can apply additional transformations to enrich the datasets using AWS Glue. We also build an additional metadata layer, which maintains a relationship between the raw S3 object path, the AI extracted output path, the optional enriched version S3 path, and any other metadata that will help the end-user discover the data.
The AI extracted output is expected to be available as a delimited file or in JSON format. We can create an AWS Glue Data Catalog table for querying using Athena or Redshift Spectrum. Like the previous step, we can use Lake Formation policies for fine-grained access control.
Lastly, the end-user accesses the raw unstructured data available in Amazon S3 for further analysis. We have proposed integrating Amazon S3 Access Points for access control at this layer. We explain this in detail later in this post.
Now let’s expand the following parts of the architecture to understand the implementation better:
Using AWS AI services to process unstructured data
Using S3 Access Points to integrate access control on raw S3 unstructured data
Process unstructured data with AWS AI services
As we discussed earlier, unstructured data can come in a variety of formats, such as text, audio, video, and images, and each type of data requires a different approach for extracting metadata. AWS AI services are designed to extract metadata from different types of unstructured data. The following are the most commonly used services for unstructured data processing:
Amazon Comprehend – This natural language processing (NLP) service uses ML to extract metadata from text data. It can analyze text in multiple languages, detect entities, extract key phrases, determine sentiment, and more. With Amazon Comprehend, you can easily gain insights from large volumes of text data such as extracting product entity, customer name, and sentiment from social media posts.
Amazon Transcribe – This speech-to-text service uses ML to convert speech to text and extract metadata from audio data. It can recognize multiple speakers, transcribe conversations, identify keywords, and more. With Amazon Transcribe, you can convert unstructured data such as customer support recordings into text and further derive insights from it.
Amazon Rekognition – This image and video analysis service uses ML to extract metadata from visual data. It can recognize objects, people, faces, and text, detect inappropriate content, and more. With Amazon Rekognition, you can easily analyze images and videos to gain insights such as identifying entity type (human or other) and identifying if the person is a known celebrity in an image.
Amazon Textract – You can use this ML service to extract metadata from scanned documents and images. It can extract text, tables, and forms from images, PDFs, and scanned documents. With Amazon Textract, you can digitize documents and extract data such as customer name, product name, product price, and date from an invoice.
Amazon SageMaker – This service enables you to build and deploy custom ML models for a wide range of use cases, including extracting metadata from unstructured data. With SageMaker, you can build custom models that are tailored to your specific needs, which can be particularly useful for extracting metadata from unstructured data that requires a high degree of accuracy or domain-specific knowledge.
Amazon Bedrock – This fully managed service offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon with a single API. It also offers a broad set of capabilities to build generative AI applications, simplifying development while maintaining privacy and security.
With these specialized AI services, you can efficiently extract metadata from unstructured data and use it for further analysis and insights. It’s important to note that each service has its own strengths and limitations, and choosing the right service for your specific use case is critical for achieving accurate and reliable results.
AWS AI services are available via various APIs, which enables you to integrate AI capabilities into your applications and workflows. AWS Step Functions is a serverless workflow service that allows you to coordinate and orchestrate multiple AWS services, including AI services, into a single workflow. This can be particularly useful when you need to process large amounts of unstructured data and perform multiple AI-related tasks, such as text analysis, image recognition, and NLP.
With Step Functions and AWS Lambda functions, you can create sophisticated workflows that include AI services and other AWS services. For instance, you can use Amazon S3 to store input data, invoke a Lambda function to trigger an Amazon Transcribe job to transcribe an audio file, and use the output to trigger an Amazon Comprehend analysis job to generate sentiment metadata for the transcribed text. This enables you to create complex, multi-step workflows that are straightforward to manage, scalable, and cost-effective.
The following is an example architecture that shows how Step Functions can help invoke AWS AI services using Lambda functions.
The workflow steps are as follows:
Unstructured data, such as text files, audio files, and video files, are ingested into the S3 raw bucket.
A Lambda function is triggered to read the data from the S3 bucket and call Step Functions to orchestrate the workflow required to extract the metadata.
The Step Functions workflow checks the type of file, calls the corresponding AWS AI service APIs, checks the job status, and performs any postprocessing required on the output.
AWS AI services can be accessed via APIs and invoked as batch jobs. To extract metadata from different types of unstructured data, you can use multiple AI services in sequence, with each service processing the corresponding file type.
After the Step Functions workflow completes the metadata extraction process and performs any required postprocessing, the resulting output is stored in an S3 bucket for cataloging.
Next, let’s understand how can we implement security or access control on both the extracted output as well as the raw input objects.
Implement access control on raw and processed data in Amazon S3
We just consider access controls for three types of data when managing unstructured data: the AI-extracted semi-structured output, the metadata, and the raw unstructured original files. When it comes to AI extracted output, it’s in JSON format and can be restricted via Lake Formation and Amazon DataZone. We recommend keeping the metadata (information that captures which unstructured datasets are already processed by the pipeline and available for analysis) open to your organization, which will enable metadata discovery across the organization.
To control access of raw unstructured data, you can integrate S3 Access Points and explore additional support in the future as AWS services evolve. S3 Access Points simplify data access for any AWS service or customer application that stores data in Amazon S3. Access points are named network endpoints that are attached to buckets that you can use to perform S3 object operations. Each access point has distinct permissions and network controls that Amazon S3 applies for any request that is made through that access point. Each access point enforces a customized access point policy that works in conjunction with the bucket policy that is attached to the underlying bucket. With S3 Access Points, you can create unique access control policies for each access point to easily control access to specific datasets within an S3 bucket. This works well in multi-tenant or shared bucket scenarios where users or teams are assigned to unique prefixes within one S3 bucket.
An access point can support a single user or application, or groups of users or applications within and across accounts, allowing separate management of each access point. Every access point is associated with a single bucket and contains a network origin control and a Block Public Access control. For example, you can create an access point with a network origin control that only permits storage access from your virtual private cloud (VPC), a logically isolated section of the AWS Cloud. You can also create an access point with the access point policy configured to only allow access to objects with a defined prefix or to objects with specific tags. You can also configure custom Block Public Access settings for each access point.
The following architecture provides an overview of how an end-user can get access to specific S3 objects by assuming a specific AWS Identity and Access Management (IAM) role. If you have a large number of S3 objects to control access, consider grouping the S3 objects, assigning them tags, and then defining access control by tags.
This post explained how you can use AWS AI services to extract readable data from unstructured datasets, build a metadata layer on top of them to allow data discovery, and build an access control mechanism on top of the raw S3 objects and extracted data using Lake Formation, Amazon DataZone, and S3 Access Points.
In addition to AWS AI services, you can also integrate large language models with vector databases to enable semantic or similarity search on top of unstructured datasets. To learn more about how to enable semantic search on unstructured data by integrating Amazon OpenSearch Service as a vector database, refer to Try semantic search with the Amazon OpenSearch Service vector engine.
As of writing this post, S3 Access Points is one of the best solutions to implement access control on raw S3 objects using tagging, but as AWS service features evolve in the future, you can explore alternative options as well.
About the Authors
Sakti Mishra is a Principal Solutions Architect at AWS, where he helps customers modernize their data architecture and define their end-to-end data strategy, including data security, accessibility, governance, and more. He is also the author of the book Simplify Big Data Analytics with Amazon EMR. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family.
Bhavana Chirumamilla is a Senior Resident Architect at AWS with a strong passion for data and machine learning operations. She brings a wealth of experience and enthusiasm to help enterprises build effective data and ML strategies. In her spare time, Bhavana enjoys spending time with her family and engaging in various activities such as traveling, hiking, gardening, and watching documentaries.
Sheela Sonone is a Senior Resident Architect at AWS. She helps AWS customers make informed choices and trade-offs about accelerating their data, analytics, and AI/ML workloads and implementations. In her spare time, she enjoys spending time with her family—usually on tennis courts.
Daniel Bruno is a Principal Resident Architect at AWS. He had been building analytics and machine learning solutions for over 20 years and splits his time helping customers build data science programs and designing impactful ML products.
When adding AI into products, you need to design and implement robust data pipelines to build datasets and generate reports for your business. But, data pipelines for batch processing present common challenges: you have to guarantee data quality to make sure the downstream systems receive good data. You also need orchestrators to coordinate different big data jobs, and the architecture should be scalable to process terabytes of data.
With this edition of Let’s Architect!, we’ll cover important things to keep in mind while working in the area of data engineering. Most of these concepts come directly from the principles of system design and software engineering. We’ll show you how to extend beyond the basics to ensure you can handle datasets of any size — including for training AI models.
In software engineering, building robust and stable applications tends to have a direct correlation with overall organization performance. Data engineering and machine learning add extra complexity: they not only have to manage software, but they also involve datasets, data and training pipelines, as well as models.
The data community is incorporating the core concepts of engineering best practices found in software communities, but there is still space for improvement. This video covers ways to leverage software engineering practices for data engineering and demonstrates how measuring key performance metrics can help build more robust and reliable data pipelines. You will learn from the direct experience of engineering teams to understand how they built their mental models.
In a data architecture like data mesh, ensuring data quality is critical because data is a key product shared with multiple teams and stakeholders.
Data quality is a fundamental requirement for data pipelines to make sure the downstream data consumers can run successfully and produce the expected output. For example, machine learning models are subject to garbage in, garbage-out effects. If we train a model on a corrupted dataset, the model learns inaccurate or incomplete data that may give incorrect predictions and impact your business.
Checking data quality is fundamental to make sure the jobs in our pipeline produce the right output. Deequ is a library built on top of Apache Spark that defines “unit tests for data” to find errors early, before the data gets fed to consuming systems or machine learning algorithms. Check it out on GitHub. To find out more, read Test data quality at scale with Deequ.
Big data pipelines are often built on frameworks like Apache Spark for transforming and joining datasets for machine learning. This session explains Amazon EMR, a managed service to run compute jobs at scale on managed clusters, an excellent fit for running Apache Spark in production.
In this session, you’ll discover how to process over 250 billion events from broker-dealers and over 1.7 trillion events from exchanges within 4 hours. FINRA shares how they designed their system to improve the SLA for data processing and how they optimized their platform for cost and performance.
Apache Airflow is an open-source workflow management platform for data engineering pipelines: you can define your workflows as a sequence of tasks and let the framework orchestrate their execution.
Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed service for Apache Airflow in the AWS cloud. This workshop is a great starting point to learn more about Apache Airflow, understand how you can take advantage of it for your data pipelines, and get hands-on experience to run it on AWS.
Internet Travel Solutions, LLC (ITS) is a travel management company that develops and maintains smart products and services for the corporate, commercial, and cargo sectors. ITS streamlines travel bookings for companies of any size around the world. It provides an intuitive consumer site with an integrated view of your travel and expenses.
ITS had been using monolithic architectures to host travel applications for years. As demand grew, applications became more complex, difficult to scale, and challenging to update over time. This slowed down deployment cycles.
Building a microservices-based air travel search engine
Typically, when a customer accesses the search widget on the consumer site, they select their origin, destination, and travel dates. Then, flights matching these search criteria are displayed. Data is retrieved from the backend database, and multiple calls are made to the Global Distribution System and external partner’s APIs, which typically takes 10-15 seconds. ITS then uses proprietary logic combined with business policies to curate the best results for the user. The existing monolith system worked well for normal workloads. However, when the number of concurrent user requests increased, overall performance of the application degraded.
In order to enhance the user experience, significantly accelerate search speed, and advance ITS’ modernization initiative, ITS chose to restructure their air travel application into microservices. The key goals in rearchitecting the application are:
To break down search components into logical units
To reduce database load by serving transient requests through memory-based storage
To decrease application logic processing on ITS’ side to under 3 seconds
Overview of the solution
To begin, we decompose our air travel search engine into microservices (for example, search, list, PriceGraph, and more). Next, we containerize the application to simplify and optimize system utilization by running these microservices using AWS Fargate, a serverless compute option on Amazon ECS.
Every search call processes about 30-60 MB of data in varying formats from different data stores. We use a new JSON-based data format to streamline varying data formats and store this data in Amazon ElastiCache for Redis, an in-memory data store that provides sub-millisecond latency and data structure flexibility. Additionally, some of the static data used by our air travel search application was moved to Amazon DynamoDB for faster retrieval speeds.
Figure 1. ITS’ microservice architecture, using AWS
ITS’ modernized architecture has several benefits beyond reducing operational expenses (OpEx). Some of these advantages include:
Agility. This architecture streamlines development, testing, and deploying changes on individual components, leading to faster iterations and shorter time-to-market (TTM).
Scalability. The managed scaling feature of AWS Fargate eliminates the need to worry about cluster autoscaling when setting up capacity providers. Amazon ECS actively oversees the task lifecycle and health status, responding to unexpected occurrences like crashes or freezes by initiating tasks as necessary to fulfill our service demands. This capability enhances resource utilization, ensures business continuity, and lowers overall total cost of ownership (TCO), letting the application owner focus on business needs.
Improved performance. Integrating Amazon ElastiCache for Redis with Amazon ECS on AWS Fargate to cache frequently accessed data significantly improves search response times and lowers load on backend services.
Centralized configuration management. Decoupling configuration parameters like database connection, strings, and environment variables from application code by integrating AWS Systems Manager Parameter Store, also provides consistency across tasks.
Results and metrics
ITS designed this architecture, tested, and implemented it in their production environment. ITS benchmarked this solution against their monolith application under varying factors for four months and noticed a significant improvement in air travel search speeds and overall performance. Here are the results:
Single User
Non-cloud airlist page round trip (RT)
Cloud airlist page RT
Leg 1
Leg 2
Leg 1
Leg 2
Test 1
29 secs
17 secs
11 secs
2 secs
Test 2
24 secs
11 secs
11.8 secs
1 sec
Test 3
24 secs
12 secs
14 secs
1 sec
Table 1. Monolithic versus modernized architecture response times
Searching round trip (RT) flights in the old system resulted in an average runtime of 27 seconds for the first leg, and 12 seconds for the return leg. With the new system, the average time is 12 seconds for the first leg and 1.3 seconds for the return leg. This is a combined improvement of 72%
Note that this time includes the trip time for our calls to reach an external vendor and receive inventory back. This usually ranges from 6 to 17 seconds, depending on the third-party system performance. Leg 2 performance for our new system is significantly faster (between 1-2 seconds). This is because search results are served directly from the Amazon ElastiCache for Redis in-memory datastore, rather than querying backend databases. This decreases load on the database, enabling it to handle more complex and resource-intensive operations efficiently.
Table 2 shows the results of endurance tests:
Endurance Test
Cloud airlist page RT
Leg 1
Leg 2
50 Users in 10 minutes
14.01 secs
4.48 secs
100 Users in 15 minutes
14.47 secs
13.31 secs
Table 2. Endurance test
Table 3 shows the results of spike tests:
Spike Test
Cloud airlist page RT
Leg 1
Leg 2
10 Users
12.34 secs
9.41 secs
20 Users
11.97 secs
10.55 secs
30 Users
15 secs
7.75 secs
Table 3. Spike test
Conclusion
In this blog post, we explored how Internet Travel Solutions, LLC (ITS) is using Amazon ECS on AWS Fargate, Amazon ElastiCache for Redis, and other services to containerize microservices, reduce costs, and increase application performance. This results in a vastly improved search results speed. ITS overcame many technical complexities and design considerations to modernize its air travel search engine.
Data governance is the process of ensuring the integrity, availability, usability, and security of an organization’s data. Due to the volume, velocity, and variety of data being ingested in data lakes, it can get challenging to develop and maintain policies and procedures to ensure data governance at scale for your data lake. Data confidentiality and data quality are the two essential themes for data governance. Data confidentiality refers to the protection and control of sensitive and private information to prevent unauthorized access, especially when dealing with personally identifiable information (PII). Data quality focuses on maintaining accurate, reliable, and consistent data across the organization. Poor data quality can lead to erroneous decisions, inefficient operations, and compromised business performance.
Companies need to ensure data confidentiality is maintained throughout the data pipeline and that high-quality data is available to consumers in a timely manner. A lot of this effort is manual, where data owners and data stewards define and apply the policies statically up front for each dataset in the lake. This gets tedious and delays the data adoption across the enterprise.
Let’s consider a fictional company, OkTank. OkTank has multiple ingestion pipelines that populate multiple tables in the data lake. OkTank wants to ensure the data lake is governed with data quality rules and access policies in place at all times.
Multiple personas consume data from the data lake, such as business leaders, data scientists, data analysts, and data engineers. For each set of users, a different level of governance is needed. For example, business leaders need top-quality and highly accurate data, data scientists cannot see PII data and need data within an acceptable quality range for their model training, and data engineers can see all data except PII.
Currently, these requirements are hard-coded and managed manually for each set of users. OkTank wants to scale this and is looking for ways to control governance in an automated way. Primarily, they are looking for the following features:
When new data and tables get added to the data lake, the governance policies (data quality checks and access controls) get automatically applied for them. Unless the data is certified to be consumed, it shouldn’t be accessible to the end-users. For example, they want to ensure basic data quality checks are applied on all new tables and provide access to the data based on the data quality score.
Due to changes in source data, the existing data profile of data lake tables may drift. It’s required to ensure the governance is met as defined. For example, the system should automatically mark columns as sensitive if sensitive data is detected in a column that was earlier marked as public and was available publicly for users. The system should hide the column from unauthorized users accordingly.
For the purpose of this post, the following governance policies are defined:
No PII data should exist in tables or columns tagged as public.
If a column has any PII data, the column should be marked as sensitive. The table should then also be marked sensitive.
The following data quality rules should be applied on all tables:
All tables should have a minimum set of columns: data_key, data_load_date, and data_location.
data_key is a key column and should meet key requirements of being unique and complete.
data_location should match with locations defined in a separate reference (base) table.
The data_load_date column should be complete.
User access to tables is controlled as per the following table.
User Description
Can Access Sensitive Tables
Can Access Sensitive Columns
Min Data Quality Threshold Needed to consume Data
Category 1
Yes
Yes
100%
Category 2
Yes
No
50%
Category 3
No
No
0%
In this post, we use AWS Glue Data Quality and sensitive data detection features. We also use Lake Formation tag-based access control to manage access at scale.
The following diagram illustrates the solution architecture.
The governance requirements highlighted in the previous table are translated to the following Lake Formation LF-Tags.
IAM User
LF-Tag: tbl_class
LF-Tag: col_class
LF-Tag: dq_tag
Category 1
sensitive, public
sensitive, public
DQ100
Category 2
sensitive, public
public
DQ100,DQ90,DQ50_80,DQ80_90
Category 3
public
public
DQ90, DQ100, DQ_LT_50, DQ50_80, DQ80_90
This post uses AWS Step Functions to orchestrate the governance jobs, but you can use any other orchestration tool of choice. To simulate data ingestion, we manually place the files in an Amazon Simple Storage Service (Amazon S3) bucket. In this post, we trigger the Step Functions state machine manually for ease of understanding. In practice, you can integrate or invoke the jobs as part of a data ingestion pipeline, via event triggers like AWS Glue crawler or Amazon S3 events, or schedule them as needed.
In this post, we use an AWS Glue database named oktank_autogov_temp and a target table named customer on which we apply the governance rules. We use AWS CloudFormation to provision the resources. AWS CloudFormation lets you model, provision, and manage AWS and third-party resources by treating infrastructure as code.
Prerequisites
Complete the following prerequisite steps:
Identify an AWS Region in which you want to create the resources and ensure you use the same Region throughout the setup and verifications.
Have a Lake Formation administrator role to run the CloudFormation template and grant permissions.
Sign in to the Lake Formation console and add yourself as a Lake Formation data lake administrator if you aren’t already an admin. If you are setting up Lake Formation for the first time in your Region, then you can do this in the following pop-up window that appears up when you connect to the Lake Formation console and select the desired Region.
Otherwise, you can add data lake administrators by choosing Administrative roles and tasks in the navigation pane on the Lake Formation console and choosing Add administrators. Then select Data lake administrator, identity your users and roles, and choose Confirm.
You need to provide a unique bucket name and specify passwords for the three users reflecting three different user personas (Category 1, Category 2, and Category 3) that we use for this post.
The stack provisions an S3 bucket to store the dummy data, AWS Glue scripts, results of sensitive data detection, and Amazon Athena query results in their respective folders.
The stack copies the AWS Glue scripts into the scripts folder and creates two AWS Glue jobs Data-Quality-PII-Checker_Job and LF-Tag-Handler_Job pointing to the corresponding scripts.
The AWS Glue job Data-Quality-PII-Checker_Job applies the data quality rules and publishes the results. It also checks for sensitive data in the columns. In this post, we check for the PERSON_NAME and EMAILdata types. If any columns with sensitive data are detected, it persists the sensitive data detection results to the S3 bucket.
The following screenshot shows a sample result from the job after it runs. You can see this after you trigger the Step Functions workflow in subsequent steps. To check the results, on the AWS Glue console, choose ETL jobs and choose the job called Data-Quality-PII-Checker_Job. Then navigate to the Data quality tab to view the results.
The AWS Glue jobLF-Tag-Handler_Job fetches the data quality metrics published by Data-Quality-PII-Checker_Job. It checks the status of the DataQuality_PIIColumns result. It gets the list of sensitive column names from the sensitive data detection file created in the Data-Quality-PII-Checker_Job and tags the columns as sensitive. The rest of the columns are tagged as public. It also tags the table assensitive if sensitive columns are detected. The table is marked as public if no sensitive columns are detected.
The job also checks the data quality score for the DataQuality_BasicChecks result set. It maps the data quality score into tags as shown in the following table and applies the corresponding tag on the table.
Data Quality Score
Data Quality Tag
100%
DQ100
90-100%
DQ90
80-90%
DQ80_90
50-80%
DQ50_80
Less than 50%
DQ_LT_50
The CloudFormation stack copies some mock data to the data folder and registers this location under AWS Lake Formation Data lake locations so Lake Formation can govern access on the location using service-linked role for Lake Formation.
The customer subfolder contains the initial customer dataset for the table customer. The base subfolder contains the base dataset, which we use to check referential integrity as part of the data quality checks. The column data_location in the customer table should match with locations defined in this base table.
The stack also copies some additional mock data to the bucket under the data-v1 folder. We use this data to simulate data quality issues.
It also creates the following resources:
An AWS Glue database called oktank_autogov_temp and two tables under the database:
customer – This is our target table on which we will be governing the access based on data quality rules and PII checks.
base – This is the base table that has the reference data. One of the data quality rules checks that the customer data always adheres to locations present in the base table.
DataLakeUser_Category1 – The data lake user corresponding to the Category 1 user. This user should be able to access sensitive data but needs 100% accurate data.
DataLakeUser_Category2 – The data lake user corresponding to the Category 2 user. This user should not be able to access sensitive columns in the table. It needs more than 50% accurate data.
DataLakeUser_Category3 – The data lake user corresponding to the Category 3 user. This user should not be able to access tables containing sensitive data. Data quality can be 0%.
GlueServiceDQRole – The role for the data quality and sensitive data detection job.
GlueServiceLFTaggerRole – The role for the LF-Tags handler job for applying the tags to the table.
StepFunctionRole – The Step Functions role for triggering the AWS Glue jobs.
A Step Functions state machine named AutoGovMachine that you use to trigger the runs for the AWS Glue jobs to check data quality and update the LF-Tags.
Athena workgroups named auto_gov_blog_workgroup_temporary_user1, auto_gov_blog_workgroup_temporary_user2, and auto_gov_blog_workgroup_temporary_user3. These workgroups point to different Athena query result locations for each user. Each user is granted access to the corresponding query result location only. This ensures a specific user doesn’t access the query results of other users. You should switch to a specific workgroup to run queries in Athena as part of the test for the specific user.
The CloudFormation stack generates the following outputs. Take note of the values of the IAM users to use in subsequent steps.
Grant permissions
After you launch the CloudFormation stack, complete the following steps:
On the Lake Formation console, under Permissions choose Data lake permissions in the navigation pane.
Search for the database oktank_autogov_temp and table customer.
If IAMAllowedPrincipals access if present, select it choose Revoke.
Choose Revoke again to revoke the permissions.
Category 1 users can access all data except if the data quality score of the table is below 100%. Therefore, we grant the user the necessary permissions.
Under Permissions in the navigation pane, choose Data lake permissions.
Search for database oktank_autogov_temp and table customer.
Choose Grant
Select IAM users and roles and choose the value for UserCategory1 from your CloudFormation stack output.
Under LF-Tags or catalog resources, choose Add LF-Tag key-value pair.
Add the following key-value pairs:
For the col_class key, add the values public and sensitive.
For the tbl_class key, add the values public and sensitive.
For the dq_tag key, add the value DQ100.
For Table permissions, select Select.
Choose Grant.
Category 2 users can’t access sensitive columns. They can access tables with a data quality score above 50%.
Repeat the preceding steps to grant the appropriate permissions in Lake Formation to UserCategory2:
For the col_class key, add the value public.
For the tbl_class key, add the values public and sensitive.
For the dq_tag key, add the values DQ50_80, DQ80_90, DQ90, and DQ100.
For Table permissions, select Select.
Choose Grant.
Category 3 users can’t access tables that contain any sensitive columns. Such tables are marked as sensitive by the system. They can access tables with any data quality score.
Repeat the preceding steps to grant the appropriate permissions in Lake Formation to UserCategory3:
For the col_class key, add the value public.
For the tbl_class key, add the value public.
For the dq_tag key, add the values DQ_LT_50, DQ50_80, DQ80_90, DQ90, and DQ100.
For Table permissions, select Select.
Choose Grant.
You can verify the LF-Tag permissions assigned in Lake Formation by navigating to the Data lake permissions page and searching for the Resource type LF-Tag expression.
Test the solution
Now we can test the workflow. We test three different use cases in this post. You will notice how the permissions to the tables change based on the values of LF-Tags applied to the customer table and the columns of the table. We use Athena to query the tables.
Use case 1
In this first use case, a new table was created on the lake and new data was ingested to the table. The data file cust_feedback_v0.csv was copied to the data/customer location in the S3 bucket. This simulates new data ingestion on a new table called customer.
Lake Formation doesn’t allow any users to access this table currently. To test this scenario, complete the following steps:
Sign in to the Athena console with the UserCategory1 user.
Switch the workgroup to auto_gov_blog_workgroup_temporary_user1 in the Athena query editor.
Choose Acknowledge to accept the workgroup settings.
Run the following query in the query editor:
select * from "oktank_autogov_temp"."customer" limit 10
On the Step Functions console, run the AutoGovMachine state machine.
In the Input – optional section, use the following JSON and replace the BucketName value with the bucket name you used for the CloudFormation stack earlier (for this post, we use auto-gov-blog):
{
"Comment": "Auto Governance with AWS Glue and AWS LakeFormation",
"BucketName": "<Replace with your bucket name>"
}
The state machine triggers the AWS Glue jobs to check data quality on the table and apply the corresponding LF-Tags.
You can check the LF-Tags applied on the table and the columns. To do so, when the state machine is complete, sign in to Lake Formation with the admin role used earlier to grant permissions.
Navigate to the table customer under the oktank_autogov_temp database and choose Edit LF-Tags to validate the tags applied on the table.
You can also validate that columns customer_email and customer_name are tagged as sensitive for the col_class LF-Tag.
To check this, choose Edit Schema for the customer table.
Select the two columns and choose Edit LF-Tags.
You can check the tags on these columns.
The rest of the columns are tagged as public.
Sign in to the Athena console with UserCategory1 and run the same query again:
select * from "oktank_autogov_temp"."customer" limit 10
This time, the user is able to see the data. This is because the LF-Tag permissions we applied earlier are in effect.
Sign in as UserCategory2 user to verify permissions.
Switch to workgroup auto_gov_blog_workgroup_temporary_user2 in Athena.
This user can access the table but can only see public columns. Therefore, the user shouldn’t be able to see the customer_email and customer_phone columns because these columns contain sensitive data as identified by the system.
Run the same query again:
select * from "oktank_autogov_temp"."customer" limit 10
Sign in to Athena and verify the permissions for DataLakeUser_Category3.
Switch to workgroup auto_gov_blog_workgroup_temporary_user3 in Athena.
This user can’t access the table because the table is marked as sensitive due to the presence of sensitive data columns in the table.
Run the same query again:
select * from "oktank_autogov_temp"."customer" limit 10
Use case 2
Let’s ingest some new data on the table.
Sign in to the Amazon S3 console with the admin role used earlier to grant permissions.
Copy the file cust_feedback_v1.csv from the data-v1 folder in the S3 bucket to the data/customer folder in the S3 bucket using the default options.
This new data file has data quality issues because the column data_location breaks referential integrity with the base table. This data also introduces some sensitive data in column comment1. This column was earlier marked as public because it didn’t have any sensitive data.
The following screenshot shows what the customer folder should look like now.
Run the AutoGovMachine state machine again and use the same JSON as the StartExecution input you used earlier:
{
"Comment": "Auto Governance with AWS Glue and AWS LakeFormation",
"BucketName": "<Replace with your bucket name>"
}
The job classifies column comment1 as sensitive on the customer table. It also updates the dq_tag value on the table because the data quality has changed due to the breaking referential integrity check.
You can verify the new tag values via the Lake Formation console as described earlier. The dq_tag value was DQ100. The value is changed to DQ50_80, reflecting the data quality score for the table.
Also, earlier the value for the col_class tag for the comment1 column was public. The value is now changed to sensitive because sensitive data is detected in this column.
Category 2 users shouldn’t be able to access sensitive columns in the table.
Sign in with UserCategory2 to Athena and rerun the earlier query:
select * from "oktank_autogov_temp"."customer" limit 10
The column comment1 is now not available for UserCategory2 as expected. The access permissions are handled automatically.
Also, because the data quality score goes down below 100%, this new dataset is now not available for the Category1 user. This user should have access to data only when the score is 100% as per our defined rules.
Sign in with UserCategory1 to Athena and rerun the earlier query:
select * from "oktank_autogov_temp"."customer" limit 10
You will see the user is not able to access the table now. The access permissions are handled automatically.
Use case 3
Let’s fix the invalid data and remove the data quality issue.
Delete the cust_feedback_v1.csv file from the data/customer Amazon S3 location.
Copy the file cust_feedback_v1_fixed.csv from the data-v1 folder in the S3 bucket to the data/customer S3 location. This data file fixes the data quality issues.
Rerun the AutoGovMachine state machine.
When the state machine is complete, the data quality score goes up to 100% again and the tag on the table gets updated accordingly. You can verify the new tag as shown earlier via the Lake Formation console.
The Category1 user can access the table again.
Clean up
To avoid incurring further charges, delete the CloudFormation stack to delete the resources provisioned as part of this post.
Conclusion
This post covered AWS Glue Data Quality and sensitive detection features and Lake Formation LF-Tag based access control. We explored how you can combine these features and use them to build a scalable automated data governance capability on your data lake. We explored how user permissions changed when data was initially ingested to the table and when data drift was observed as part of subsequent ingestions.
For further reading, refer to the following resources:
Shoukat Ghouse is a Senior Big Data Specialist Solutions Architect at AWS. He helps customers around the world build robust, efficient and scalable data platforms on AWS leveraging AWS analytics services like AWS Glue, AWS Lake Formation, Amazon Athena and Amazon EMR.
In this release, we have made the implementation guidance for the new and updated best practices more prescriptive, including enhanced recommendations and steps on reusable architecture patterns targeting specific business outcomes in the Amazon Web Services (AWS) Cloud.
A brief history
The Well-Architected Framework is a collection of best practices that allow customers to evaluate and improve the design, implementation, and operations of their workloads in the cloud.
In 2020, Well-Architected Framework guidance had a new release, along with more lenses, as well as API integration with the AWS Well-Architected Tool. The sixth pillar, Sustainability, was added in 2021. In 2022, dedicated pages were introduced for each consolidated best practice across all six pillars, with several best practices updated with improved prescriptive guidance. By April 2023, more than 50% of the Framework’s best practices have had their prescriptive guidance improved.
A brief history of the AWS Well-Architected Framework
What’s new
As customers mature in their journey, they are seeking guidance to achieve accurate solutions that is prescriptive to their business, environments, and workloads. AWS Well-Architected is committed to providing such information to customers by continually evolving and updating our guidance.
The Operational Excellence Pillar has received updates to two of the five Design Principles and has a new Design Principle on observability, which highlights its importance and relevance throughout the pillar content. All 10 best practices in OPS05 have been updated, and we have consolidated 28 best practices into 16, across four questions (OPS04, OPS06, OPS08, and OPS09), as well as improving prescriptive guidance.
Security
In the Security Pillar, the Incident response in SEC10 underwent an update to align with the AWS Security Incident Response Guide, while introducing one new best practice, and improving the prescriptive guidance for others. Two best practices in SEC08 and SEC09 have received improved prescriptive guidance on securing workloads at rest and in transit.
Reliability
The Reliability Pillar has received prescriptive guidance improvements to one best practice in REL06, and six best practices in REL11, focused on how to best monitor, failover, remediate, and limit impacts of failures. The update addresses a wide variety of managed services and designs, including multi-Region-based resilience.
Performance Efficiency
The Performance Efficiency Pillar has been completely restructured, consolidating and merging guidance to reduce the number of best practices by 10 and the number of questions by three. We have added best practices around efficient caching and optimizing hardware acceleration. We have also improved the implementation guidance in all 32 best practices of the newly restructured Pillar.
Cost Optimization
The Cost Optimization Pillar has 10 best practices with improved implementation prescriptive guidance.
Sustainability
The Sustainability Pillar has received updates to the risk levels of seven best practices.
Conclusion
This Well-Architected release includes updates and improvements to 90 best practices: Operational Excellence (26), Security (8), Reliability (7), Performance Efficiency (32), Cost Optimization (10), and Sustainability (7). These changes are in addition to the 151 improved best practices released in 2023 (127 in April 10, 2023; and 24 in July 13, 2023), resulting in more than 73% of the existing Framework best practices updated at least once in the last year.
As of this release, 100% of Performance Efficiency, Cost Optimization, and Sustainability; 63% of Operational Excellence; 60% of Security; and 50% of Reliability Pillar content have been refreshed at least once since October 2022.
The content is available in 11 languages: English, Spanish, French, German, Italian, Japanese, Korean, Indonesian, Brazilian Portuguese, Simplified Chinese, and Traditional Chinese.
Updates in this release are also available in the AWS Well-Architected Tool, which can be used to review your workloads, address important design considerations, and help ensure that you follow the best practices and guidance of the AWS Well-Architected Framework.
SQL databases in Amazon Web Services (AWS), using services like Amazon Relational Database Service (Amazon RDS) and Amazon Aurora, offer software architects scalability, automated management, robust security, and cost-efficiency. This combination simplifies database management, improves performance, enhances security, and allows architects to create efficient and scalable software systems.
In this post, we introduce caching strategies and continue with real case studies that use services like Amazon ElastiCache or Amazon MemoryDB in real workloads where customers share the reasoning behind their approaches. It’s very important to understand the context for leveraging a specific solution or pattern, and these resources answer many commonly asked questions.
For software architects and developers, striking the right balance between operational complexity and cost efficiency is a perpetual challenge. Often, provisioning a separate database for each workload is the gold standard, offering unmatched isolation and granular operational controls. However, it’s not always the most cost-effective or operationally manageable approach. Through a real-world success story, we explore how Aurora played a pivotal role in helping VMware Aria Cost, powered by CloudHealth, consolidate a staggering 166 self-managed MySQL databases onto 62 Aurora clusters.
Amazon RDS Blue/Green Deployments revolutionizes the way you handle database updates, ensuring safety and simplicity, often achieving rapid updates in just a minute, with zero data loss. Meanwhile, Amazon RDS Optimized Writes turbocharges write transaction throughput by as much as double, without any additional extra cost. Amazon RDS Optimized Reads steps in to deliver a significant boost to database performance, processing queries up to 50% faster.
Discover how to leverage these capabilities of Amazon RDS in this one-hour video from re:Invent 2022.
In the world of mission-critical workloads, the importance of a robust disaster recovery (DR) strategy cannot be overstated. It’s the lifeline that ensures databases stay operational, even in the face of unexpected events. Discover the intricacies of crafting a dependable, cross-Region DR strategy tailored to Amazon RDS for SQL Server.
In this AWS Developers session, we uncover the best practices for efficiently managing and monitoring these cross-Region read replicas. From proactive monitoring to fine-tuning, you’ll gain the insights needed to keep your DR strategy finely tuned.
Aurora represents a paradigm shift in relational databases, boasting an architecture that decouples computational processes from data storage. It introduces advanced features, such as Global Database and low-latency read replicas, redefining the landscape of database management.
This modern database service excels in performance, scalability, and high availability on a large scale, offering compatibility with both MySQL and PostgreSQL open-source editions. Additionally, it provides an array of developer tools tailored for serverless and machine learning-driven applications.
This re:Invent 2022 session is an in-depth exploration of some of Aurora’s most compelling features, including Aurora Serverless v2 and Global Database. We also share the most recent innovations aimed at enhancing performance, scalability, and security while streamlining operational processes.
Oracle WebLogic Server is used by enterprises to power production workloads, including Oracle E-Business Suite (EBS) and Oracle Fusion Middleware applications.
Customer applications are deployed to WebLogic Server instances (managed servers) and managed using an administration server (admin server) within a logical organization unit, called a domain. Clusters of managed servers provide application availability and horizontal scalability, while the single-instance admin server does not host applications.
There are various architectures detailing WebLogic-managed server high availability (HA). In this post, we demonstrate using Availability Zones (AZ) and a floating IP address to achieve a “stretch cluster” (Oracle’s terminology).
Figure 1. Overview of a WebLogic domain
Overview of problem
The WebLogic admin server is important for domain configuration, management, and monitoring both application performance and system health. Historically, WebLogic was configured using IP addresses, with managed servers caching the admin server IP to reconnect if the connection was lost.
This can cause issues in a dynamic Cloud setup, as replacing the admin server from a template changes its IP address, causing two connectivity issues:
Communication within the domain: the admin and managed servers communicate via the T3 protocol, which is based on Java RMI.
Remote access to admin server console: allowing internet admin access and what additional security controls may be required is beyond the scope of this post.
Here, we will explore how to minimize downtime and achieve HA for your admin server.
Solution overview
For this solution, there are three approaches customers tend to follow:
Use a floating virtual IP to keep the address static. This solution is familiar to WebLogic administrators as it replicates historical on-premise HA implementations. The remainder of this post dives into this practical implementation.
Use DNS to resolve the admin server IP address. This is also a supported configuration.
Run in a “headless configuration” and not (normally) run the admin server.
Use WebLogic Scripting Tool to issues commands
Collect and observe metrics through other toolsRunning “headless” requires a high level of operational maturity. It may not be compatible for certain vendor packaged applications deployed to WebLogic.
Using a floating IP address for WebLogic admin server
Here, we discuss the reference WebLogic deployment architecture on AWS, as depicted in Figure 2.
Figure 2. Reference WebLogic deployment with multi-AZ admin HA capability
In this example, a WebLogic domain resides in a virtual private cloud’s (VPC) private subnet. The admin server is on its own Amazon Elastic Compute Cloud (Amazon EC2) instance. It’s bound to the private IP 10.0.11.8 that floats across AZs within the VPC. There are two ways to achieve this:
Create a “dummy” subnet in the VPC (in any AZ), with the smallest allowed subnet size of /28. Excluding the first “4” and the last IP of the subnet because they’re reserved, choose an address. For a 10.0.11.0/28 subnet, we will use 10.0.11.8 and configure WebLogic admin server to bind to that.
Use an IP outside of the VPC. We discuss this second way and compare both processes in the later section “Alternate solution for multi-AZ floating IP”.
This example Amazon Web Services stretch architecture with one WebLogic domain and one admin server:
Create a VPC across two or more AZs, with one private subnet in each AZ for managed servers and an additional “dummy” subnet.
Create two EC2 instances, one for each of the WebLogic Managed Servers (distributed across the private subnets).
Use an Auto Scaling group to ensure a single admin server running.
Create an Amazon EC2 launch template for the admin server.
Associate the launch template and an Auto Scaling group with minimum, maximum, and desired capacity of 1. The Auto Scaling Group (ASG) detects EC2 and/or AZ degradation and launches a new instance in a different AZ if the current fails.
Create an AWS Lambda function (example to follow) to be called by the Auto Scale group lifecycle hook to update the route tables.
Update the user data commands (example to follow) of the launch template to:
Add the floating IP address to the network interface
Start the admin server using the floating IP
To route traffic to the floating IP, we update route tables for both public and private subnets.
We create a Lambda function launched by the Auto Scale group lifecycle hook pending:InService when a new admin instance is created. This Lambda code updates routing rules in both route tables mapping the dummy subnet CIDR (10.0.11.0/28) of the “floating” IP to the admin Amazon EC2. This updates routes in both the public and private subnets for the dynamically launched admin server, enabling managed servers to connect.
Enabling internet access to the admin server
If enabling internet access to the admin server, create an internet-facing Application Load Balancer (ALB) attached to the public subnets. With the route to the admin server, the ALB can forward traffic to it.
Create an IP-based target group that points to the floating IP.
Add a forwarding rule in the ALB to route WebLogic admin traffic to the admin server.
User data commands in the launch template to make admin server accessible upon ASG scale out
In the admin server EC2 launch template, add user data code to monitor the ASG lifecycle state. When it reaches InService state, a Lambda function is invoked to update route tables. Then, the script starts the WebLogic admin server Java process (and associated NodeManager, if used).
The admin server instance’s SourceDestCheck attribute needs to be set to false, enabling it to bind to the logical IP. This change can also be done in the Lambda function.
When a user accesses the admin server from the internet:
Traffic flows to the elastic IP address associated to the internet-facing ALB.
The ALB forwards to the configured target group.
The ALB uses the updated routes to reach 10.0.11.8 (admin server).
When managed servers communicate with the admin server, they use the updated route table to reach 10.0.11.8 (admin server).
The Lambda function
Here, we present a Lambda function example that sets the EC2 instance SourceDeskCheck attribute to false and updates the route rules for the dummy subnet CIDR (the “floating” IP on the admin server EC2) in both public and private route tables.
The following code in Amazon EC2 user data shows how to add logical secondary IP address to the Amazon EC2 primary ENI, keep polling the ASG lifecycle state, and start the admin server Java process upon Amazon EC2 entering the InService state.
ip addr add 10.0.11.8/28 br 10.0.11.255 dev eth0
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
for x in {1..30}
do
target_state=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/autoscaling/target-lifecycle-state)
if [ \"$target_state\" = \"InService\" ]; then
su -c 'nohup /mnt/efs/wls/fmw/install/Oracle/Middleware/Oracle_Home/user_projects/domains/domain1/bin/startWebLogic.sh &' ec2-user
break
fi
sleep 10
done
Alternate solution for multi-AZ floating IP
An alternative solution for the floating IP is to use an IP external to the VPC. The configurations for ASG, Amazon EC2 launch template, and ASG lifecycle hook Lambda function remain the same. However, the ALB cannot access the WebLogic admin console webapp from the internet due to its requirement for a VPC-internal subnet. To access the webapp in this scenario, stand up a bastion host in a public subnet.
While this approach “saves” 16 VPC IP addresses by avoiding a dummy subnet, there are disadvantages:
Bastion hosts are not AZ-failure resilient.
Missing true multi-AZ resilience like the first solution.
Requires additional cost and complexity in managing multiple bastion hosts across AZs or a VPN.
Conclusion
AWS has a track record of efficiently running Oracle applications, Oracle EBS, PeopleSoft, and mission critical JEE workloads. In this post, we delved into a HA solution using a multi-AZ floating IP for the WebLogic admin server, and using ASG to ensure a singular admin server. We showed how to use ASG lifecycle hooks and Lambda to automate route updates for the floating IP and configuring an ALB to allow Internet access for the admin server. This solution achieves multi-AZ resilience for WebLogic admin server with automated recovery, transforming a traditional WebLogic admin server from a pet to cattle.
In-memory databases play a critical role in modern computing, particularly in reducing the strain on existing resources, scaling workloads efficiently, and minimizing the cost of infrastructure. The advanced performance capabilities of in-memory databases make them vital for demanding applications characterized by voluminous data, real-time analytics, and rapid response requirements.
In this edition of Let’s Architect!, we are introducing caching strategies and, further, examining case studies that use Amazon Web Services (AWS), like Amazon ElastiCache or Amazon MemoryDB for Redis, in real workloads where customers share the reasoning behind their approaches. It is very important understanding the context for leveraging a specific solution or pattern, and many common questions can be answered with these resources.
Many services built at Amazon rely on caching systems in the background to speed up performance, deal with low latency requirements, and avoid overloading on source databases and other microservices. Operating caches and adding caches into our systems may present complex challenges in terms of monitoring, data consistency, and load on the other components of the system. Indeed, a cache can give big benefits, but it’s also a new component to run and keep healthy. Furthermore, engineers may need to use empirical methods to choose the cache size, expiration policy, and eviction policy: we always have to perform tests and use the metrics to tune the setup.
With this Amazon Builder’s Library resource, you can learn strategies for using caching in your architecture and best practices directly from Amazon’s engineers.
Discover how Yahoo effectively leverages the power of Amazon ElastiCache and data tiering to process an astounding 1.3 million advertising data events per second, all while generating savings of up to 50% on their overall bill.
Data tiering is an ingenious method to scale up to hundreds of terabytes of capacity by intelligently managing data. It achieves this by automatically shifting the least-recently accessed data between RAM and high-performance SSDs.
In this video, you will gain insights into how data tiering operates and how you can unlock ultra-fast speeds and seamless scalability for your workloads in a cost-efficient manner. Furthermore, you can also learn how it’s implemented under the hood.
MemoryDB is a robust, durable database marked by microsecond reads, low single-digit millisecond writes, scalability, and fortified enterprise security. It guarantees an impressive 99.99% availability, coupled with instantaneous recovery without any data loss.
In this session, we explore multiple use cases across sectors, such as Financial Services, Retail, and Media & Entertainment, like payment processing, message brokering, and durable session store applications. Moreover, through a practical demonstration, you can learn how to utilize MemoryDB to establish a microservices message broker for a Media & Entertainment application.
MemoryDB offers the kind of ultra-fast performance that only an in-memory database can deliver, curtailing latency to microseconds and processing 160+ million requests per second —without data loss. In this re:Invent 2022 session, you will understand why Samsung SmartThings selected MemoryDB as the engine to power the next generation of their IoT device connectivity platform, one that processes millions of events every day.
You can also discover the intricate design of MemoryDB and how it ensures data durability without compromising the performance of in-memory operations, thanks to the utilization of a multi-AZ transactional log. This session is an enlightening deep-dive into durable, in-memory data operations.
In this edition of AWS Online Tech Talks, explore Amazon ElastiCache, a managed service that facilitates the seamless setup, operation, and scaling of widely used, open-source–compatible, in-memory datastores in the cloud environment. This service positions you to develop data-intensive applications or enhance the performance of your existing databases through high-throughput, low-latency, in-memory datastores. Learn how it is leveraged for caching, session stores, gaming, geospatial services, real-time analytics, and queuing functionalities.
This course can help cultivate a deeper understanding of Amazon ElastiCache, and how it can be used to accelerate your data processing while maintaining robustness and reliability.
For most organizations, protecting their high value assets is a top priority. AWS Web Application Firewall (AWS WAF) is an industry leading solution that protects web applications from the evolving threat landscape, which includes common web exploits and bots. These threats affect availability, compromise security, or can consume excessive resources. Though AWS WAF is a managed service, the operating model of this critical first layer of defence is often overlooked.
Operating models for a core service like AWS WAF differ depending on your company’s technology footprint, and use cases are dependent on workloads. While some businesses were born in the public cloud and have modern applications, many large established businesses have classic and legacy workloads across their business units. We will examine three distinct operating models using AWS WAF, AWS Firewall Manager service (AWS FMS), AWS Organizations, and other AWS services.
Operating Models
I. Centralized
The centralized model works well for organizations where the applications to be protected by AWS WAF are similar, and rules can be consistent. With multi-tenant environments (where tenants share the same infrastructure or application), AWS WAF can be deployed with the same web access control lists (web ACLs) and rules for consistent security. Content management systems (CMS) also benefit from this model, since consistent web ACL and rules can protect multiple websites hosted on their CMS platform. This operating model provides uniform protection against web-based attacks and centralized administration across multiple AWS accounts. For managing all your accounts and applications in AWS Organizations, use AWS Firewall Manager.
AWS Firewall Manager simplifies your AWS WAF administration and helps you enforce AWS WAF rules on the resources in all accounts in an AWS Organization, by using AWS Config in the background. The compliance dashboard gives you a simplified view of the security posture. A centralized information security (IS) team can configure and manage AWS WAF’s managed and custom rules.
AWS Managed Rules are designed to protect against common web threats, providing an additional layer of security for your applications. By leveraging AWS Managed Rules and their pre-configured rule groups, you can streamline the management of WAF configurations. This reduces the need for specialized teams to handle these complex tasks and thereby alleviates undifferentiated heavy lifting.
A centralized operating pattern (see Figure 1) requires IS teams to construct an AWS WAF policy by using AWS FMS and then implement it at scale in each and every account. Keeping current on the constantly changing threat landscape can be time-consuming and expensive. Security teams will have the option of selecting one or more rule groups from AWS Managed Rules or an AWS Marketplace subscription for each web ACL, along with any custom rule needed.
Figure 1. Centralized operating model for AWS WAF
AWS Config managed rule sets ensure AWS WAF logging, rule groups, web ACLs, and regional and global AWS WAF deployments have no empty rule sets. Managed rule sets simplify compliance monitoring and reporting, while assuring security and compliance. AWS CloudTrail monitors changes to AWS WAF configurations, providing valuable auditing capability of your operating environment.
This model places the responsibility for defining, enforcing, and reviewing security policies, as well as remediating any issues, squarely on the security administrator and IS team. While comprehensive, this approach may require careful management to avoid potential bottlenecks, especially in larger-scale operations.
II. Distributed
Many organizations start their IT operations on AWS from their inception. These organizations typically have multi-skilled infrastructure and development teams and a lean operating model. The distributed model shown in Figure 2 is a good fit for them. In this case, the application team understands the underlying infrastructure components and the Infrastructure as Code (IaC) that provisions them. It makes sense for these development teams to also manage the interconnected application security components, like AWS WAF.
The application teams own the deployment of AWS WAF and the setup of the Web ACLs for their respective applications. Typically, the Web ACL will be a combination of baseline rule groups and use case specific rule groups, both deployed and managed by the application team.
One of the challenges that comes with the distributed deployment is the inconsistency in rules’ deployment which can result in varying levels of protection. Conflicting priorities within application teams can sometimes compromise the focus on security, prioritizing feature rollouts over comprehensive risk mitigation, for example. A strong governance model can be very helpful in situations like these, where the security team might not be responsible for deploying the AWS WAF rules, but do need security posture visibility. AWS Security services like Security Hub and Config rules can help set these parameters. For example, some of the managed Config rules and Security Hub controls check if AWS WAF is enabled for Application Load Balancer (ALB) and Amazon API Gateway, and also if the associated Web ACL is empty.
Figure 2. Distributed operating model for AWS WAF
III. Hybrid
An organization that has a diverse range of customer-facing applications hosted in a number of different AWS accounts can benefit from a hybrid deployment operating model. Organizations whose infrastructure is managed by a combination of an in-house security team, third-party vendors, contractors, and a managed cybersecurity operations center (CSOC) can also use this model. In this case, the security team can build and enforce a core AWS WAF rule set using AWS Firewall Manager. Application teams, can build and manage additional rules based on the requirements of the application. For example, use case specific rule groups will be different for PHP applications as compared to WordPress-based applications.
Information security teams can specify how core rule groups are ordered. The application administrator has the ability to add rules and rule groups that will be executed between the two rule group sets. This approach ensures that adequate security is applied to all legacy and modern applications, and developers can still write and manage custom rules for enhanced protection.
Organizations should adopt a collaborative DevSecOps model of development, where both the security team and the application development teams will build, manage, and deploy security rules. This can also be considered a hybrid approach combining the best of the central and distributed models, as shown in Figure 3.
Figure 3. Hybrid operating model for AWS WAF
Governance is shared between the centralized security team responsible for baseline rules sets deployed across all AWS accounts, and the individual application team responsible for AWS WAF custom rule sets. To maintain security and compliance, AWS Config checksAmazon CloudFront,AWS AppSync, Amazon API Gateway, and ALB for AWS WAF association with managed rule sets. AWS Security Hub combines and prioritizes AWS Firewall Manager security findings, enabling visibility into AWS WAF rule conformance across AWS accounts and resources. This model requires close coordination between the two teams to ensure that security policies are consistent and all security issues are effectively addressed.
The AWS WAF incident response strategy includes detecting, investigating, containing, and documenting incidents, alerting personnel, developing response plans, implementing mitigation measures, and continuous improvement based on lessons learned. Threat modelling for AWS WAF involves identifying assets, assessing threats and vulnerabilities, defining security controls, testing and monitoring, and staying updated on threats and AWS WAF updates.
Conclusion
Using the appropriate operating model is key to ensuring that the right web application security controls are implemented. It accounts for the needs of both business and application owners. In the majority of implementations, the centralized and hybrid model works well, by providing a stratified policy enforcement. However, the distributed method can be used to manage specific use cases. Amazon Firewall Manager services can be used to streamline the management of centralized and hybrid operating models across AWS Organizations.
Customers often need to architect solutions to support components across multiple cloud service providers, a need which may arise if they have acquired a company running on another cloud, or for functional purposes where specific services provide a differentiated capability. In this post, we will show you how to use the AWS Cloud Development Kit (AWS CDK) to create a single pane of glass for managing your multicloud resources.
AWS CDK is an open source framework that builds on the underlying functionality provided by AWS CloudFormation. It allows developers to define cloud resources using common programming languages and an abstraction model based on reusable components called constructs. There is a misconception that CloudFormation and CDK can only be used to provision resources on AWS, but this is not the case. The CloudFormation registry, with support for third party resource types, along with custom resource providers, allow for any resource that can be configured via an API to be created and managed, regardless of where it is located.
Multicloud solution design paradigm
Multicloud solutions are often designed with services grouped and separated by cloud, creating a segregation of resource and functions within the design. This approach leads to a duplication of layers of the solution, most commonly a duplication of resources and the deployment processes for each environment. This duplication increases cost, and leads to a complexity of management increasing the potential break points within the solution or practice.
Along with simplifying resource deployments, and the ever-increasing complexity of customer needs, so too has the need increased for the capability of IaC solutions to deploy resources across hybrid or multicloud environments. Through meeting this need, a proliferation of supported tools, frameworks, languages, and practices has created “choice overload”. At worst, this scares the non-cloud-savvy away from adopting an IaC solution benefiting their cloud journey, and at best confuses the very reason for adopting an IaC practice.
A single pane of glass
Systems Thinking is a holistic approach that focuses on the way a system’s constituent parts interrelate and how systems work as a whole especially over time and within the context of larger systems. Systems thinking is commonly accepted as the backbone of a successful systems engineering approach. Designing solutions taking a full systems view, based on the component’s function and interrelation within the system across environments, more closely aligns with the ability to handle the deployment of each cloud-specific resource, from a single control plane.
While AWS provides a list of services that can be used to help design, manage and operate hybrid and multicloud solutions, with AWS as the primary cloud you can go beyond just using services to support multicloud. CloudFormation registry resource types model and provision resources using custom logic, as a component of stacks in CloudFormation. Public extensions are not only provided by AWS, but third-party extensions are made available for general use by publishers other than AWS, meaning customers can create their own extensions and publish them for anyone to use.
The AWS CDK, which has a 1:1 mapping of all AWS CloudFormation resources, as well as a library of abstracted constructs, supports the ability to import custom AWS CloudFormation extensions, enabling customers and partners to create custom AWS CDK constructs for their extensions. The chosen programming language can be used to inherit and abstract the custom resource into reusable AWS CDK constructs, allowing developers to create solutions that contain native AWS extensions along with secondary hybrid or alternate cloud resources.
Providing the ability to integrate mixed resources in the same stack more closely aligns with the functional design and often diagrammatic depiction of the solution. In essence, we are creating a single IaC pane of glass over the entire solution, deployed through a single control plane. This lowers the complexity and the cost of maintaining separate modules and deployment pipelines across multiple cloud providers.
A common use case for a multicloud: disaster recovery
One of the most common use cases of the requirement for using components across different cloud providers is the need to maintain data sovereignty while designing disaster recovery (DR) into a solution.
Data sovereignty is the idea that data is subject to the laws of where it is physically located, and in some countries extends to regulations that if data is collected from citizens of a geographical area, then the data must reside in servers located in jurisdictions of that geographical area or in countries with a similar scope and rigor in their protection laws.
This requires organizations to remain in compliance with their host country, and in cases such as state government agencies, a stricter scope of within state boundaries, data sovereignty regulations. Unfortunately, not all countries, and especially not all states, have multiple AWS regions to select from when designing where their primary and recovery data backups will reside. Therefore, the DR solution needs to take advantage of multiple cloud providers in the same geography, and as such a solution must be designed to backup or replicate data across providers.
The multicloud solution
A multicloud solution to the proposed use case would be the backup of data from an AWS resource such as an Amazon S3 bucket to another cloud provider within the same geography, such as an Azure Blob Storage container, using AWS event driven behaviour to trigger the copying of data from the primary AWS resource to the secondary Azure backup resource.
Following the IaC single pane of glass approach, the Azure Blob Storage container is created as a resource type in the CloudFormation Registry, and imported into the AWS CDK to be used as a construct in the solution. However, before the extension resource type can be used effectively in the CDK as a reusable construct and added to your private library, you will first need to go through the import into CDK process for creating Constructs.
There are three different levels of constructs, beginning with low-level constructs, which are called CFN Resources (or L1, short for “layer 1”). These constructs directly represent all resources available in AWS CloudFormation. They are named CfnXyz, where Xyz is name of the resource.
Layer 1 Construct
In this example, an L1 construct named CfnAzureBlobStorage represents an Azure::BlobStorage AWS CloudFormation extension. Here you also explicitly configure the ref property, in order for higher level constructs to access the Output value which will be the Azure blob container url being provisioned.
As with every CDK Construct, the constructor arguments are scope, id and props. scope and id are propagated to the cdk.Construct base class. The props argument is of type CfnAzureBlobStorageProps which includes four properties all of type string. This is how the Azure credentials are propagated down from upstream constructs.
Layer 2 Construct
The next level of constructs, L2, also represent AWS resources, but with a higher-level, intent-based API. They provide similar functionality, but incorporate the defaults, boilerplate, and glue logic you’d be writing yourself with a CFN Resource construct. They also provide convenience methods that make it simpler to work with the resource.
In this example, an L2 construct is created to abstract the CfnAzureBlobStorage L1 construct and provides additional properties and methods.
The custom L2 construct class is declared as AzureBlobStorage, this time without the Cfn prefix to represent an L2 construct. This time the constructor arguments include the Azure credentials and client secret, and the ref from the L1 construct us output to the public variable AzureBlobContainerUrl.
As an L2 construct, the AzureBlobStorage construct could be used in CDK Apps along with AWS Resource Constructs in the same Stack, to be provisioned through AWS CloudFormation creating the IaC single pane of glass for a multicloud solution.
Layer 3 Construct
The true value of the CDK construct programming model is in the ability to extend L2 constructs, which represent a single resource, into a composition of multiple constructs that provide a solution for a common task. These are Layer 3, L3, Constructs also known as patterns.
In this example, the L3 construct represents the solution architecture to backup objects uploaded to an Amazon S3 bucket into an Azure Blob Storage container in real-time, using AWS Lambda to process event notifications from Amazon S3.
import { RemovalPolicy, Duration, CfnOutput } from "aws-cdk-lib";
import { Bucket, BlockPublicAccess, EventType } from "aws-cdk-lib/aws-s3";
import { DockerImageFunction, DockerImageCode } from "aws-cdk-lib/aws-lambda";
import { PolicyStatement, Effect } from "aws-cdk-lib/aws-iam";
import { LambdaDestination } from "aws-cdk-lib/aws-s3-notifications";
import { IStringParameter, StringParameter } from "aws-cdk-lib/aws-ssm";
import { Secret, ISecret } from "aws-cdk-lib/aws-secretsmanager";
import { Construct } from "constructs";
import { AzureBlobStorage } from "./azure-blob-storage";
// L3 Construct
export class S3ToAzureBackupService extends Construct {
constructor(
scope: Construct,
id: string,
azureSubscriptionIdParamName: string,
azureClientIdParamName: string,
azureTenantIdParamName: string,
azureClientSecretName: string
) {
super(scope, id);
// Retrieve existing SSM Parameters
const azureSubscriptionIdParameter = this.getSSMParameter("AzureSubscriptionIdParam", azureSubscriptionIdParamName);
const azureClientIdParameter = this.getSSMParameter("AzureClientIdParam", azureClientIdParamName);
const azureTenantIdParameter = this.getSSMParameter("AzureTenantIdParam", azureTenantIdParamName);
// Retrieve existing Azure Client Secret
const azureClientSecret = this.getSecret("AzureClientSecret", azureClientSecretName);
// Create an S3 bucket
const sourceBucket = new Bucket(this, "SourceBucketForAzureBlob", {
removalPolicy: RemovalPolicy.RETAIN,
blockPublicAccess: BlockPublicAccess.BLOCK_ALL,
});
// Create a corresponding Azure Blob Storage account and a Blob Container
const azurebBlobStorage = new AzureBlobStorage(
this,
"MyCustomAzureBlobStorage",
azureSubscriptionIdParameter.stringValue,
azureClientIdParameter.stringValue,
azureTenantIdParameter.stringValue,
azureClientSecretName
);
// Create a lambda function that will receive notifications from S3 bucket
// and copy the new uploaded object to Azure Blob Storage
const copyObjectToAzureLambda = new DockerImageFunction(
this,
"CopyObjectsToAzureLambda",
{
timeout: Duration.seconds(60),
code: DockerImageCode.fromImageAsset("copy_s3_fn_code", {
buildArgs: {
"--platform": "linux/amd64"
}
}),
},
);
// Add an IAM policy statement to allow the Lambda function to access the
// S3 bucket
sourceBucket.grantRead(copyObjectToAzureLambda);
// Add an IAM policy statement to allow the Lambda function to get the contents
// of an S3 object
copyObjectToAzureLambda.addToRolePolicy(
new PolicyStatement({
effect: Effect.ALLOW,
actions: ["s3:GetObject"],
resources: [`arn:aws:s3:::${sourceBucket.bucketName}/*`],
})
);
// Set up an S3 bucket notification to trigger the Lambda function
// when an object is uploaded
sourceBucket.addEventNotification(
EventType.OBJECT_CREATED,
new LambdaDestination(copyObjectToAzureLambda)
);
// Grant the Lambda function read access to existing SSM Parameters
azureSubscriptionIdParameter.grantRead(copyObjectToAzureLambda);
azureClientIdParameter.grantRead(copyObjectToAzureLambda);
azureTenantIdParameter.grantRead(copyObjectToAzureLambda);
// Put the Azure Blob Container Url into SSM Parameter Store
this.createStringSSMParameter(
"AzureBlobContainerUrl",
"Azure blob container URL",
"/s3toazurebackupservice/azureblobcontainerurl",
azurebBlobStorage.blobContainerUrl,
copyObjectToAzureLambda
);
// Grant the Lambda function read access to the secret
azureClientSecret.grantRead(copyObjectToAzureLambda);
// Output S3 bucket arn
new CfnOutput(this, "sourceBucketArn", {
value: sourceBucket.bucketArn,
exportName: "sourceBucketArn",
});
// Output the Blob Conatiner Url
new CfnOutput(this, "azureBlobContainerUrl", {
value: azurebBlobStorage.blobContainerUrl,
exportName: "azureBlobContainerUrl",
});
}
}
The custom L3 construct can be used in larger IaC solutions by calling the class called S3ToAzureBackupService and providing the Azure credentials and client secret as properties to the constructor.
import * as cdk from "aws-cdk-lib";
import { Construct } from "constructs";
import { S3ToAzureBackupService } from "./s3-to-azure-backup-service";
export class MultiCloudBackupCdkStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
const s3ToAzureBackupService = new S3ToAzureBackupService(
this,
"MyMultiCloudBackupService",
"/s3toazurebackupservice/azuresubscriptionid",
"/s3toazurebackupservice/azureclientid",
"/s3toazurebackupservice/azuretenantid",
"s3toazurebackupservice/azureclientsecret"
);
}
}
Solution Diagram
Diagram 1: IaC Single Control Plane, demonstrates the concept of the Azure Blob Storage extension being imported from the AWS CloudFormation Registry into AWS CDK as an L1 CfnResource, wrapped into an L2 Construct and used in an L3 pattern alongside AWS resources to perform the specific task of backing up from and Amazon s3 Bucket into an Azure Blob Storage Container.
Diagram 1: IaC Single Control Plan
The CDK application is then synthesized into one or more AWS CloudFormation Templates, which result in the CloudFormation service deploying AWS resource configurations to AWS and Azure resource configurations to Azure.
This solution demonstrates not only how to consolidate the management of secondary cloud resources into a unified infrastructure stack in AWS, but also the improved productivity by eliminating the complexity and cost of operating multiple deployment mechanisms into multiple public cloud environments.
The following video demonstrates an example in real-time of the end-state solution:
Next Steps
While this was just a straightforward example, with the same approach you can use your imagination to come up with even more and complex scenarios where AWS CDK can be used as a single pane of glass for IaC to manage multicloud and hybrid solutions.
To get started with the solution discussed in this post, this workshop will provide you with the instructions you need to understand the steps required to create the S3ToAzureBackupService.
Once you have learned how to create AWS CloudFormation extensions and develop them into AWS CDK Constructs, you will learn how, with just a few lines of code, you can develop reusable multicloud unified IaC solutions that deploy through a single AWS control plane.
Conclusion
By adopting AWS CloudFormation extensions and AWS CDK, deployed through a single AWS control plane, the cost and complexity of maintaining deployment pipelines across multiple cloud providers is reduced to a single holistic solution-focused pipeline. The techniques demonstrated in this post and the related workshop provide a capability to simplify the design of complex systems, improve the management of integration, and more closely align the IaC and deployment management practices with the design.
As competition grows fiercer, marketers need ways to ensure they reach each user with personalized content on their most critical channels. Short message/messaging service (SMS) is a key part of that effort, touching more than 5 billion people worldwide, with an impressive 82% open rate. However, SMS lacks the built-in engagement metrics supported by other channels.
To bridge this gap, leading customer engagement platform, Braze, recently built an in-house SMS link shortening solution using Amazon DynamoDB and Amazon DynamoDB Accelerator (DAX). It’s designed to handle up to 27 billion redirects per month, allowing marketers to automatically shorten SMS-related URLs. Alongside the Braze Intelligence Suite, you can use SMS click data in reporting functions and retargeting actions. Read on to learn how Braze created this feature and the impact it’s having on marketers and consumers alike.
SMS link shortening approach
Many Braze customers have used third-party SMS link shortening solutions in the past. However, this approach complicates the SMS composition process and isolates click metrics from Braze analytics. This makes it difficult to get a full picture of SMS performance.
Figure 1. Multiple approaches for shortening URLs
The following table compares all 3 approaches for their pros and cons.
Scenario
#1 – Unshortened URL in SMS
#2 – 3rd Party Shortener
#3 – Braze Link Shortening & Click Tracking
Low Character Count
X
✓
✓
Total Clicks
X
✓
✓
Ability to Retarget Users
X
X
✓
Ability to Trigger Subsequent Messages
X
X
✓
With link shortening built in-house and more tightly integrated into the Braze platform, Braze can maintain more control over their roadmap priority. By developing the tool internally, Braze achieved a 90% reduction in ongoing expenses compared with the $400,000 annual expense associated with using an outside solution.
Braze SMS link shortening: Flow and architecture
Figure 2. SMS link shortening architecture
The following steps explain the link shortening architecture:
First, customers initiate campaigns via the Braze Dashboard. Using this interface, they can also make requests to shorten URLs.
The URL registration process is managed by a Kubernetes-deployed Go-based service. This service not only shortens the provided URL but also maintains reference data in Amazon DynamoDB.
After processing, the dashboard receives the generated campaign details alongside the shortened URL.
The fully refined campaign can be efficiently distributed to intended recipients through SMS channels.
Upon a user’s interaction with the shortened URL, the message gets directed to the URL redirect service. This redirection occurs through an Application Load Balancer.
The redirect service processes links in messages, calls the service, and replaces links before sending to carriers.
Asynchronous calls feed data to a Kafka queue for metrics, using the HTTP sink connector integrated with Braze systems.
The registration and redirect services are decoupled from the Braze platform to enable independent deployment and scaling due to different requirements. Both the services are running the same code, but with different endpoints exposed, depending on the functionality of a given Kubernetes pod. This restricts internal access to the registration endpoint and permits independent scaling of the services, while still maintaining a fast response time.
Braze SMS link shortening: Scale
Right now, our customers use the Braze platform to send about 200 million SMS messages each month, with peak speeds of around 2,000 messages per second. Many of these messages contain one or more URLs that need to be shortened. In order to support the scalability of the link shortening feature and give us room to grow, we designed the service to handle 33 million URLs sent per month, and 3.25 million redirects per month. We assumed that we’d see up to 65 million database writes per month and 3.25 million reads per month in connection with the redirect service. This would require storage of 65 GB per month, with peaks of ~2,000 writes and 100 reads per second.
With these needs in mind, we carried out testing and determined that Amazon DynamoDB made the most sense as the backend database for the redirect service. To determine this, we tested read and write performance and found that it exceeded our needs. Additionally, it was fully managed, thus requiring less maintenance expertise, and included DAX out of the box. Most clicks happen close to send, so leveraging DAX helps us smooth out the read and write load associated with the SMS link shortener.
Because we know how long we must keep the relevant written elements at write time, we’re able to use DynamoDB Time to Live (TTL) to effectively manage their lifecycle. Finally, we’re careful to evenly distribute partition keys to avoid hot partitions, and DynamoDB’s autoscaling capabilities make it possible for us to respond more efficiently to spikes in demand.
Braze SMS link shortening: Flow
Figure 3. Braze SMS link shortening flow
When the marketer initiates an SMS send, Braze checks its primary datastore (a MongoDB collection) to see if the link has already been shortened (see Figure 3). If it has, Braze re-uses that shortened link and continues the send. If it hasn’t, the registration process is initiated to generate a new site identifier that encodes the generation date and saves campaign information in DynamoDB via DAX.
The response from the registration service is used to generate a short link (1a) for the SMS.
A recipient gets an SMS containing a short link (2).
Recipient decides to tap it (3). Braze smoothly redirects them to the destination URL, and updates the campaign statistics to show that the link was tapped.
Using Amazon Route 53’s latency-based routing, Braze directs the recipient to the nearest endpoint (Braze currently has North America and EU deployments), then inspects the link to ensure validity and that it hasn’t expired. If it passes those checks, the redirect service queries DynamoDB via DAX for information about the redirect (3a). Initial redirects are cached at send time, while later requests query the DAX cache.
The user is redirected with a P99 redirect latency of less than 10 milliseconds (3b).
Emit campaign-level metrics on redirects.
Braze generates URL identifiers, which serve as the partition key to the DynamoDB collection, by generating a random number. We concatenate the generation date timestamp to the number, then Base66 encode the value. This results in a generated URL that looks like https://brz.ai/5xRmz, with “5xRmz” being the encoded URL identifier. The use of randomized partition keys helps avoid hot, overloaded partitions. Embedding the generation date lets us see when a given link was generated without querying the database. This helps us maintain performance and reduce costs by removing old links from the database. Other cost control measures include autoscaling and the use of DAX to avoid repeat reads of the same data. We also query DynamoDB directly against a hash key, avoiding scatter-gather queries.
Braze link shortening feature results
Since its launch, SMS link shortening has been used by over 300 Braze customer companies in more than 700 million SMS messages. This includes 50% of the total SMS volume sent by Braze during last year’s Black Friday period. There has been a tangible reduction in the time it takes to build and send SMS. “The Motley Fool”, a financial media company, saved up to four hours of work per month while driving click rates of up to 15%. Another Braze client utilized multimedia messaging service (MMS) and link shortening to encourage users to shop during their “Smart Investment” campaign, rewarding users with additional store credit. Using the engagement data collected with Braze link shortening, they were able to offer engaged users unique messaging and follow-up offers. They retargeted users who did not interact with the message via other Braze messaging channels.
Conclusion
The Braze platform is designed to be both accessible to marketers and capable of supporting best-in-class cross-channel customer engagement. Our SMS link shortening feature, supported by AWS, enables marketers to provide an exceptional user experience and save time and money.
Every software component built by engineers and architects is designed with a purpose: to offer particular functionalities and, ultimately, contribute to the generation of business value. We should consider fundamental factors, such as the scalability of the software and the ease of evolution during times of business changes. However, performance and cost are important factors as well since they can impact the business profitability.
This edition of Let’s Architect! follows a similar series post from 2022, which discusses optimizing the cost of an architecture. Today, we focus on architectural patterns, services, and best practices to design cost-optimized cloud workloads. We also want to identify solutions, such as the use of Graviton processors, for increased performance at lower price. Cost optimization is a continuous process that requires the identification of the right tools for each job, as well as the adoption of efficient designs for your system.
Govern cloud usage and avoid cost surprises without slowing down innovation within your organization. In this re:Invent 2022 session, you can learn how to set up guardrails and operationalize cost control within your organizations using services, such as AWS Budgets and AWS Cost Anomaly Detection, and explore the latest enhancements in the AWS cost control space. Additionally, Mercado Libre shares how they automate their cloud cost control through central management and automated algorithms.
Work backwards from team needs to define/deploy cloud governance in AWS environments
Compute optimization
When it comes to optimizing compute workloads, there are many tools available, such as AWS Compute Optimizer, Amazon EC2 Spot Instances, Amazon EC2 Reserved Instances, and Graviton instances. Modernizing your applications can also lead to cost savings, but you need to know how to use the right tools and techniques in an effective and efficient way.
For AWS Lambda functions, you can use the AWS Lambda Cost Optimization video to learn how to optimize your costs. The video covers topics, such as understanding and graphing performance versus cost, code optimization techniques, and avoiding idle wait time. If you are using Amazon Elastic Container Service (Amazon ECS) and AWS Fargate, you can watch a Twitch video on cost optimization using Amazon ECS and AWS Fargate to learn how to adjust your costs. The video covers topics like using spot instances, choosing the right instance type, and using Fargate Spot.
The choice of the hardware is a fundamental driver for performance, cost, as well as resource consumption of the systems we build. Graviton is a family of processors designed by AWS to support cloud-based workloads and give improvements in terms of performance and cost. This re:Invent 2022 presentation introduces Graviton and addresses the problems it can solve, how the underlying CPU architecture is designed, and how to get started with it. Furthermore, you can learn the journey to move different types of workloads to this architecture, such as containers, Java applications, and C applications.
The Cost Optimization section of the AWS Well Architected Workshop helps you learn how to optimize your AWS costs by using features, such as AWS Compute Optimizer, Spot Instances, and Reserved Instances. The workshop includes hands-on labs that walk you through the process of optimizing costs for different types of workloads and services, such as Amazon Elastic Compute Cloud, Amazon ECS, and Lambda.
Security is fundamental for each product and service you are building with. Whether you are working on the back-end or the data and machine learning components of a system, the solution should be securely built.
In 2022, we discussed security in our post Let’s Architect! Architecting for Security. Today, we take a closer look at general security practices for your cloud workloads to secure both networks and applications, with a mix of resources to show you how to architect for security using the services offered by Amazon Web Services (AWS).
In this edition of Let’s Architect!, we share some practices for protecting your workloads from the most common attacks, introduce the Zero Trust principle (you can learn how AWS itself is implementing it!), plus how to move to containers and/or alternative approaches for managing your secrets.
This session from AWS re:Invent, security engineers guide you through the most common threat vectors and vulnerabilities that AWS customers faced in 2022. For each possible threat, you can learn how it’s implemented by attackers, the weaknesses attackers tend to leverage, and the solutions offered by AWS to avert these security issues. We describe this as fundamental architecting for security: this implies adopting suitable services to protect your workloads, as well as follow architectural practices for security.
What is Zero Trust? It is a security model that produces higher security outcomes compared with the traditional network perimeter model.
How does Zero Trust work in practice, and how can you start adopting it? This AWS re:Invent 2022 session defines the Zero Trust models and explains how to implement one. You can learn how it is used within AWS, as well as how any architecture can be built with these pillars in mind. Furthermore, there is a practical use case to show you how Delphix put Zero Trust into production.
Nowadays, it’s vital to have a thorough understanding of a container’s underlying security layers. AWS services, like Amazon Elastic Kubernetes Service and Amazon Elastic Container Service, have harnessed these Linux security-layer protections, keeping a sharp focus on the principle of least privilege. This approach significantly minimizes the potential attack surface by limiting the permissions and privileges of processes, thus upholding the integrity of the system.
This re:Inforce 2023 session discusses best practices for securing containers for your distributed systems.
Secrets play a critical role in providing access to confidential systems and resources. Ensuring the secure and consistent management of these secrets, however, presents a challenge for many organizations.
Anti-patterns observed in numerous organizational secrets management systems include sharing plaintext secrets via unsecured means, such as emails or messaging apps, which can allow application developers to view these secrets in plaintext or even neglect to rotate secrets regularly. This detailed guidance walks you through the steps of discovering and classifying secrets, plus explains the implementation and migration processes involved in transferring secrets to AWS Secrets Manager.
SeatGeek is a ticketing platform for web and mobile users, offering ticket purchase and reselling for sports games, concerts, and theatrical productions. In 2022, SeatGeek had an average of 47 million daily tickets available, and their mobile app was downloaded 33+ million times.
Historically, SeatGeek used multiple identity and access tools internally. Applications were individually managing authorization, leading to increased overhead and a need for more standardization. SeatGeek sought to simplify the API provided to customers and partners by abstracting and standardizing the authorization layer. They were also looking to introduce centralized API rate-limiting to prevent noisy neighbor problems in their multi-tenant SaaS application.
In this blog, we will take you through SeatGeek’s journey and explore the solution architecture they’ve implemented. As of the publication of this post, many B2B customers have adopted this solution to query terabytes of business data.
Building multi-tenant SaaS environments
Multi-tenant SaaS environments allow highly performant and cost-efficient applications by sharing underlying resources across tenants. While this is a benefit, it is important to implement cross-tenant isolation practices to adhere to security, compliance, and performance objectives. With that, each tenant should only be able to access their authorized resources. Another consideration is the noisy neighbor problem that occurs when one of the tenants monopolizes excessive shared capacity, causing performance issues for other tenants.
Authentication, authorization, and rate-limiting are critical components of a secure and resilient multi-tenant environment. Without these mechanisms in place, there is a risk of unauthorized access, resource-hogging, and denial-of-service attacks, which can compromise the security and stability of the system. Validating access early in the workflow can help eliminate the need for individual applications to implement similar heavy-lifting validation techniques.
SeatGeek had several criteria for addressing these concerns:
SeatGeek did not want to introduce any additional infrastructure management overhead; plus, they preferred to use serverless services to “stitch” managed components together (with minimal effort) to implement their business requirements.
They wanted this solution to scale as seamlessly as possible with demand and adoption increases; concurrently, SeatGeek did not want to pay for idle or over-provisioned resources.
Exploring the solution
The SeatGeek team used a combination of Amazon Web Services (AWS) serverless services to address the aforementioned criteria and achieve the desired business outcome. Amazon API Gateway was used to serve APIs at the entry point to SeatGeek’s cloud environment. API Gateway allowed SeatGeek to use a custom AWS Lambda authorizer for integration with Auth0 and defining throttling configurations for their tenants. Since all the services used in the solution are fully serverless, they do not require infrastructure management, are scaled up and down automatically on-demand, and provide pay-as-you-go pricing.
SeatGeek created a set of tiered usage plans in API Gateway (bronze, silver, and gold) to introduce rate-limiting. Each usage plan had a pre-defined request-per-second rate limit configuration. A unique API key was created by API Gateway for each tenant. Amazon DynamoDB was used to store the association of existing tenant IDs (managed by Auth0) to API keys (managed by API Gateway). This allowed us to keep API key management transparent to SeatGeek’s tenants.
Each new tenant goes through an onboarding workflow. This is an automated process managed with Terraform. During new tenant onboarding, SeatGeek creates a new tenant ID in Auth0, a new API key in API Gateway, and stores association between them in DynamoDB. Each API key is also associated with one of the usage plans.
Once onboarding completes, the new tenant can start invoking SeatGeek APIs (Figure 1).
Tenant authenticates with Auth0 using machine-to-machine authorization. Auth0 returns a JSON web token representing tenant authentication success. The token includes claims required for downstream authorization, such as tenant ID, expiration date, scopes, and signature.
Tenant sends a request to the SeatGeak API. The request includes the token obtained in Step 1 and application-specific parameters, for example, retrieving the last 12 months of booking data.
Lambda authorizer retrieves the token validation keys from Auth0. The keys are cached in the authorizer, so this happens only once for each authorizer launch environment. This allows token validation locally without calling Auth0 each time, reducing latency and preventing an excessive number of requests to Auth0.
Lambda authorizer performs token validation, checking tokens’ structure, expiration date, signature, audience, and subject. In case validation succeeds, Lambda authorizer extracts the tenant ID from the token.
Lambda authorizer uses tenant ID extracted in Step 5 to retrieve the associated API key from DynamoDB and return it back to API Gateway.
The API Gateway uses API key to check if the client making this particular request is above the rate-limit threshold, based on the usage plan associated with API key. If the rate limit is exceeded, HTTP 429 (“Too Many Requests”) is returned to the client. Otherwise, the request will be forwarded to the backend for further processing.
Optionally, the backend can perform additional application-specific token validations.
Architecture benefits
The architecture implemented by SeatGeek provides several benefits:
Centralized authorization: Using Auth0 with API Gateway and Lambda authorizer allows for standardization the API authentication and removes the burden of individual applications having to implement authorization.
Multiple levels of caching: Each Lambda authorizer launch environment caches token validation keys in memory to validate tokens locally. This reduces token validation time and helps to avoid excessive traffic to Auth0. In addition, API Gateway can be configured with up to 5 minutes of caching for Lambda authorizer response, so the same token will not be revalidated in that timespan. This reduces overall cost and load on Lambda authorizer and DynamoDB.
Noisy neighbor prevention: Usage plans and rate limits prevent any particular tenant from monopolizing the shared resources and causing a negative performance impact for other tenants.
Simple management and reduced total cost of ownership: Using AWS serverless services removed the infrastructure maintenance overhead and allowed SeatGeek to deliver business value faster. It also ensured they didn’t pay for over-provisioned capacity, and their environment could scale up and down automatically and on demand.
Conclusion
In this blog, we explored how SeatGeek used AWS serverless services, such as API Gateway, Lambda, and DynamoDB, to integrate with external identity provider Auth0, and implemented per-tenant rate limits with multi-tiered usage plans. Using AWS serverless services allowed SeatGeek to avoid undifferentiated heavy-lifting of infrastructure management and accelerate efforts to build a solution addressing business requirements.
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.