Tag Archives: Thought Leadership

How AWS threat intelligence deters threat actors

Post Syndicated from Mark Ryland original https://aws.amazon.com/blogs/security/how-aws-threat-intelligence-deters-threat-actors/

Every day across the Amazon Web Services (AWS) cloud infrastructure, we detect and successfully thwart hundreds of cyberattacks that might otherwise be disruptive and costly. These important but mostly unseen victories are achieved with a global network of sensors and an associated set of disruption tools. Using these capabilities, we make it more difficult and expensive for cyberattacks to be carried out against our network, our infrastructure, and our customers. But we also help make the internet as a whole a safer place by working with other responsible providers to take action against threat actors operating within their infrastructure. Turning our global-scale threat intelligence into swift action is just one of the many steps that we take as part of our commitment to security as our top priority. Although this is a never-ending endeavor and our capabilities are constantly improving, we’ve reached a point where we believe customers and other stakeholders can benefit from learning more about what we’re doing today, and where we want to go in the future.

Global-scale threat intelligence using the AWS Cloud

With the largest public network footprint of any cloud provider, our scale at AWS gives us unparalleled insight into certain activities on the internet, in real time. Some years ago, leveraging that scale, AWS Principal Security Engineer Nima Sharifi Mehr started looking for novel approaches for gathering intelligence to counter threats. Our teams began building an internal suite of tools, given the moniker MadPot, and before long, Amazon security researchers were successfully finding, studying, and stopping thousands of digital threats that might have affected its customers.

MadPot was built to accomplish two things: first, discover and monitor threat activities and second, disrupt harmful activities whenever possible to protect AWS customers and others. MadPot has grown to become a sophisticated system of monitoring sensors and automated response capabilities. The sensors observe more than 100 million potential threat interactions and probes every day around the world, with approximately 500,000 of those observed activities advancing to the point where they can be classified as malicious. That enormous amount of threat intelligence data is ingested, correlated, and analyzed to deliver actionable insights about potentially harmful activity happening across the internet. The response capabilities automatically protect the AWS network from identified threats, and generate outbound communications to other companies whose infrastructure is being used for malicious activities.

Systems of this sort are known as honeypots—decoys set up to capture threat actor behavior—and have long served as valuable observation and threat intelligence tools. However, the approach we take through MadPot produces unique insights resulting from our scale at AWS and the automation behind the system. To attract threat actors whose behaviors we can then observe and act on, we designed the system so that it looks like it’s composed of a huge number of plausible innocent targets. Mimicking real systems in a controlled and safe environment provides observations and insights that we can often immediately use to help stop harmful activity and help protect customers.

Of course, threat actors know that systems like this are in place, so they frequently change their techniques—and so do we. We invest heavily in making sure that MadPot constantly changes and evolves its behavior, continuing to have visibility into activities that reveal the tactics, techniques, and procedures (TTPs) of threat actors. We put this intelligence to use quickly in AWS tools, such as AWS Shield and AWS WAF, so that many threats are mitigated early by initiating automated responses. When appropriate, we also provide the threat data to customers through Amazon GuardDuty so that their own tooling and automation can respond.

Three minutes to exploit attempt, no time to waste

Within approximately 90 seconds of launching a new sensor within our MadPot simulated workload, we can observe that the workload has been discovered by probes scanning the internet. From there, it takes only three minutes on average before attempts are made to penetrate and exploit it. This is an astonishingly short amount of time, considering that these workloads aren’t advertised or part of other visible systems that would be obvious to threat actors. This clearly demonstrates the voracity of scanning taking place and the high degree of automation that threat actors employ to find their next target.

As these attempts run their course, the MadPot system analyzes the telemetry, code, attempted network connections, and other key data points of the threat actor’s behavior. This information becomes even more valuable as we aggregate threat actor activities to generate a more complete picture of available intelligence.

Disrupting attacks to maintain business as usual

In-depth threat intelligence analysis also happens in MadPot. The system launches the malware it captures in a sandboxed environment, connects information from disparate techniques into threat patterns, and more. When the gathered signals provide high enough confidence in a finding, the system acts to disrupt threats whenever possible, such as disconnecting a threat actor’s resources from the AWS network. Or, it could entail preparing that information to be shared with the wider community, such as a computer emergency response team (CERT), internet service provider (ISP), a domain registrar, or government agency so that they can help disrupt the identified threat.

As a major internet presence, AWS takes on the responsibility to help and collaborate with the security community when possible. Information sharing within the security community is a long-standing tradition and an area where we’ve been an active participant for years.

In the first quarter of 2023:

  • We used 5.5B signals from our internet threat sensors and 1.5B signals from our active network probes in our anti-botnet security efforts.
  • We stopped over 1.3M outbound botnet-driven DDoS attacks.
  • We shared our security intelligence findings, including nearly a thousand botnet C2 hosts, with relevant hosting providers and domain registrars.
  • We traced back and worked with external parties to dismantle the sources of 230k L7/HTTP(S) DDoS attacks.

Three examples of MadPot’s effectiveness: Botnets, Sandworm, and Volt Typhoon

Recently, MadPot detected, collected, and analyzed suspicious signals that uncovered a distributed denial of service (DDoS) botnet that was using the domain free.bigbots.[tld] (the top-level domain is omitted) as a command and control (C2) domain. A botnet is made up of compromised systems that belong to innocent parties—such as computers, home routers, and Internet of Things (IoT) devices—that have been previously compromised, with malware installed that awaits commands to flood a target with network packets. Bots under this C2 domain were launching 15–20 DDoS attacks per hour at a rate of about 800 million packets per second.

As MadPot mapped out this threat, our intelligence revealed a list of IP addresses used by the C2 servers corresponding to an extremely high number of requests from the bots. Our systems blocked those IP addresses from access to AWS networks so that a compromised customer compute node on AWS couldn’t participate in the attacks. AWS automation then used the intelligence gathered to contact the company that was hosting the C2 systems and the registrar responsible for the DNS name. The company whose infrastructure was hosting the C2s took them offline in less than 48 hours, and the domain registrar decommissioned the DNS name in less than 72 hours. Without the ability to control DNS records, the threat actor could not easily resuscitate the network by moving the C2s to a different network location. In less than three days, this widely distributed malware and the C2 infrastructure required to operate it was rendered inoperable, and the DDoS attacks impacting systems throughout the internet ground to a halt.

MadPot is effective in detecting and understanding the threat actors that target many different kinds of infrastructure, not just cloud infrastructure, including the malware, ports, and techniques that they may be using. Thus, through MadPot we identified the threat group called Sandworm—the cluster associated with Cyclops Blink, a piece of malware used to manage a botnet of compromised routers. Sandworm was attempting to exploit a vulnerability affecting WatchGuard network security appliances. With close investigation of the payload, we identified not only IP addresses but also other unique attributes associated with the Sandworm threat that were involved in an attempted compromise of an AWS customer. MadPot’s unique ability to mimic a variety of services and engage in high levels of interaction helped us capture additional details about Sandworm campaigns, such as services that the actor was targeting and post-exploitation commands initiated by that actor. Using this intelligence, we notified the customer, who promptly acted to mitigate the vulnerability. Without this swift action, the actor might have been able to gain a foothold in the customer’s network and gain access to other organizations that the customer served.

For our final example, the MadPot system was used to help government cyber and law enforcement authorities identify and ultimately disrupt Volt Typhoon, the widely-reported state-sponsored threat actor that focused on stealthy and targeted cyber espionage campaigns against critical infrastructure organizations. Through our investigation inside MadPot, we identified a payload submitted by the threat actor that contained a unique signature, which allowed identification and attribution of activities by Volt Typhoon that would otherwise appear to be unrelated. By using the data lake that stores a complete history of MadPot interactions, we were able to search years of data very quickly and ultimately identify other examples of this unique signature, which was being sent in payloads to MadPot as far back as August 2021. The previous request was seemingly benign in nature, so we believed that it was associated with a reconnaissance tool. We were then able to identify other IP addresses that the threat actor was using in recent months. We shared our findings with government authorities, and those hard-to-make connections helped inform the research and conclusions of the Cybersecurity and Infrastructure Security Agency (CISA) of the U.S. government. Our work and the work of other cooperating parties resulted in their May 2023 Cybersecurity advisory. To this day, we continue to observe the actor probing U.S. network infrastructure, and we continue to share details with appropriate government cyber and law enforcement organizations.

Putting global-scale threat intelligence to work for AWS customers and beyond

At AWS, security is our top priority, and we work hard to help prevent security issues from causing disruption to your business. As we work to defend our infrastructure and your data, we use our global-scale insights to gather a high volume of security intelligence—at scale and in real time—to help protect you automatically. Whenever possible, AWS Security and its systems disrupt threats where that action will be most impactful; often, this work happens largely behind the scenes. As demonstrated in the botnet case described earlier, we neutralize threats by using our global-scale threat intelligence and by collaborating with entities that are directly impacted by malicious activities. We incorporate findings from MadPot into AWS security tools, including preventative services, such as AWS WAF, AWS Shield, AWS Network Firewall, and Amazon Route 53 Resolver DNS Firewall, and detective and reactive services, such as Amazon GuardDuty, AWS Security Hub, and Amazon Inspector, putting security intelligence when appropriate directly into the hands of our customers, so that they can build their own response procedures and automations.

But our work extends security protections and improvements far beyond the bounds of AWS itself. We work closely with the security community and collaborating businesses around the world to isolate and take down threat actors. In the first half of this year, we shared intelligence of nearly 2,000 botnet C2 hosts with relevant hosting providers and domain registrars to take down the botnets’ control infrastructure. We also traced back and worked with external parties to dismantle the sources of approximately 230,000 Layer 7 DDoS attacks. The effectiveness of our mitigation strategies relies heavily on our ability to quickly capture, analyze, and act on threat intelligence. By taking these steps, AWS is going beyond just typical DDoS defense, and moving our protection beyond our borders.

We’re glad to be able to share information about MadPot and some of the capabilities that we’re operating today. For more information, see this presentation from our most recent re:Inforce conference: How AWS threat intelligence becomes managed firewall rules, as well as an overview post published today, Meet MadPot, a threat intelligence tool Amazon uses to protect customers from cybercrime, which includes some good information about the AWS security engineer behind the original creation of MadPot. Going forward, you can expect to hear more from us as we develop and enhance our threat intelligence and response systems, making both AWS and the internet as a whole a safer place.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Mark Ryland

Mark Ryland

Mark is the director of the Office of the CISO for AWS. He has over 30 years of experience in the technology industry, and has served in leadership roles in cybersecurity, software engineering, distributed systems, technology standardization, and public policy. Previously, he served as the Director of Solution Architecture and Professional Services for the AWS World Public Sector team.

Optimize checkpointing in your Amazon Managed Service for Apache Flink applications with buffer debloating and unaligned checkpoints – Part 2

Post Syndicated from Lorenzo Nicora original https://aws.amazon.com/blogs/big-data/optimize-checkpointing-in-your-amazon-managed-service-for-apache-flink-applications-with-buffer-debloating-and-unaligned-checkpoints-part-2/

This post is a continuation of a two-part series. In the first part, we delved into Apache Flink‘s internal mechanisms for checkpointing, in-flight data buffering, and handling backpressure. We covered these concepts in order to understand how buffer debloating and unaligned checkpoints allow us to enhance performance for specific conditions in Apache Flink applications.

In Part 1, we introduced and examined how to use buffer debloating to improve in-flight data processing. In this post, we focus on unaligned checkpoints. This feature has been available since Apache Flink 1.11 and has received many improvements since then. Unaligned checkpoints help, under specific conditions, to reduce checkpointing time for applications suffering temporary backpressure, and can be now enabled in Amazon Managed Service for Apache Flink applications running Apache Flink 1.15.2 through a support ticket.

Even though this feature might improve performance for your checkpoints, if your application is constantly failing because of checkpoints timing out, or is suffering from having constant backpressure, you may require a deeper analysis and redesign of your application.

Aligned checkpoints

As discussed in Part 1, Apache Flink checkpointing allows applications to record state in case of failure. We’ve already discussed how checkpoints, when triggered by the job manager, signal all source operators to snapshot their state, which is then broadcasted as a special record called a checkpoint barrier. This process achieves exactly-once consistency for state in a distributed streaming application through the alignment of these barriers.

Let’s walk through the process of aligned checkpoints in a standard Apache Flink application. Remember that Apache Flink distributes the workload horizontally: each operator (a node in the logical flow of your application, including sources and sinks) is split into multiple sub-tasks based on its parallelism.

Barrier alignment

The alignment of checkpoint barriers is crucial for achieving exactly-once consistency in Apache Flink applications during checkpoint runs. To recap, when a job manager triggers a checkpoint, all sub-tasks of source operators receive a signal to initiate the checkpoint process. Each sub-task independently snapshots its state to the state backend and broadcasts a special record known as a checkpoint barrier to all outgoing streams.

When an application operates with a parallelism higher than 1, multiple instances of each task—referred to as sub-tasks—enable parallel message consumption and processing. A sub-task can receive distinct partitions of the same stream from different upstream sub-tasks, such as after a stream repartitioning with keyBy or rebalance operations. To maintain exactly-once consistency, all sub-tasks must wait for the arrival of all checkpoint barriers before taking a snapshot of the state. The following diagram illustrates the checkpoint barriers flow.

Checkpoint Barriers flow in the Buffer Queues

This phase is called checkpoint alignment. During alignment, the sub-task stops processing records from the partitions from which it has already received barriers, as shown in the following figure.

The first Barrier reaches the sub-task: Checkpointing Alignment begins

However, it continues to process partitions that are behind the barrier.

Processing continues only for partitions behind the barrier

When barriers from all upstream partitions have arrived, the sub-task takes a snapshot of its state.

Barrier alignment complete: snapshot state

Then it broadcasts the barrier downstream.

Emit Barriers downstream, and continue processing

The time a sub-task spends waiting for all barriers to arrive is measured by the checkpoint Alignment Duration metric, which can be observed in the Apache Flink UI.

If the application experiences backpressure, an increase in this metric could lead to longer checkpoint durations and even checkpoint failures due to timeouts. This is where unaligned checkpoints become a viable option to potentially enhance checkpointing performance.

Unaligned checkpoints

Unaligned checkpoints address situations where backpressure is not just a temporary spike, but results in timeouts for aligned checkpoints, due to barrier queuing within the stream. As discussed in Part 1, checkpoint barriers can’t overtake regular records. Therefore, significant backpressure can slow down the movement of barriers across the application, potentially causing checkpoint timeouts.

The objective of unaligned checkpoints is to enable barrier overtaking, allowing barriers to move swiftly from source to sink even when the data flow is slower than anticipated.

Building on what we saw in Part 1 concerning checkpoints and what aligned checkpoints are, let’s explore how unaligned checkpoints modify the checkpointing mechanism.

Upon emission, each source’s checkpoint barrier is injected into the stream flowing across sub-tasks. It travels from the source output network buffer queue into the input network buffer queue of the subsequent operator.

Upon the arrival of the first barrier in the input network buffer queue, the operator initially waits for barrier alignment. If the specified alignment timeout expires because not all barriers have reached the end of the input network buffer queue, the operator switches to unaligned checkpoint mode.

The alignment timeout can be set programmatically by env.getCheckpointConfig().setAlignedCheckpointTimeout(Duration.ofSeconds(30)), but modifying the default is not recommended in Apache Flink 1.15.

Checkpoint barriers flow in the buffer queues

The operator waits until all checkpoint barriers are present in the input network buffer queue before triggering the checkpoint. Unlike aligned checkpoints, the operator doesn’t need to wait for all barriers to reach the queue’s end, allowing the operator to have in-flight data from the buffer that hasn’t been processed before checkpoint initiation.

All barriers are in the input queues

After all barriers have arrived in the input network buffer queue, the operator advances the barrier to the end of the output network buffer queue. This enhances checkpointing speed because the barrier can smoothly traverse the application from source to sink, independent of the application’s end-to-end latency.

Barriers can overtake in-flight messages

After forwarding the barrier to the output network buffer queue, the operator initiates the snapshot of in-flight data between the barriers in the input and output network buffer queues, along with the snapshot of the state.

Although processing is momentarily paused during this process, the actual writing to the remote persistent state storage occurs asynchronously, preventing potential bottlenecks.

Snapshot state and in-flight messages

The local snapshot, encompassing in-flight messages and state, is saved asynchronously in the remote persistent state store, while the barrier continues its journey through the application.

Processing continues

When to use unaligned checkpoints

Remember, barrier alignment only occurs between partitions coming from different sub-tasks of the same operator. Therefore, if an operator is experiencing temporary backpressure, enabling unaligned checkpoints may be beneficial. This way, the application doesn’t have to wait for all barriers to reach the operator before performing the snapshot of state or moving the barrier forward.

Temporary backpressure could arise from the following:

  • A surge in data ingestion
  • Backfilling or catching up with historical data
  • Increased message processing time due to delayed external systems

Another scenario where unaligned checkpoints prove advantageous is when working with exactly-once sinks. Utilizing the two-phase commit sink function for exactly-once sinks, unaligned checkpoints can expedite checkpoint runs, thereby reducing end-to-end latency.

When not to use unaligned checkpoints

Unaligned checkpoints won’t reduce the time required for savepoints (called snapshots in the Amazon Managed Service for Apache Flink implementation) because savepoints exclusively utilize aligned checkpoints. Furthermore, because Apache Flink doesn’t permit concurrent unaligned checkpoints, savepoints won’t occur simultaneously with unaligned checkpoints, potentially elongating savepoint durations.

Unaligned checkpoints won’t fix any underlying issue in your application design. If your application is suffering from persistent backpressure or constant checkpointing timeouts, this might indicate data skewness or underprovisioning, which may require improving and tuning the application.

Using unaligned checkpoints with buffer debloating

One alternative for reducing the risks associated with an increased state size is to combine unaligned checkpoints with buffer debloating. This approach results in having less in-flight data to snapshot and store in the state, along with less data to be used for recovery in case of failure. This synergy facilitates enhanced performance and efficient checkpoint runs, leading to smaller checkpointing sizes and faster recovery times. When testing the use of unaligned checkpoints, we recommend doing so with buffer debloating to prevent the state size from increasing.

Limitations

Unaligned checkpoints are subject to the following limitations:

  • They provide no benefit for operators with a parallelism of 1.
  • They only improve performance for operators where barrier alignment would have occurred. This alignment happens only if records are coming from different sub-tasks of the same operator, for example, through repartitioning or keyBy operations.
  • Operators receiving input from multiple sources or participating in joins might not experience improvements, because the operator would be receiving data from different operators in those cases.
  • Although checkpoint barriers can surpass records in the network’s buffer queue, this won’t occur if the sub-task is currently processing a message. If processing a message takes too much time (for example, a flat-map operation emitting numerous records for each input record), barrier handling will be delayed.
  • As we have seen, savepoints always use aligned checkpoints. If the savepoints of your applications are slow due to barrier alignment, unaligned checkpoints will not help.
  • Additional limitations affect watermarks, message ordering, and broadcast state in recovery. For more details, refer to Limitations.

Considerations

Considerations for implementing unaligned checkpoints:

  • Unaligned checkpoints introduce additional I/O to checkpoint storage
  • Checkpoints encompass not only operator state but also in-flight data within network buffer queues, leading to increased state size

Recommendations

We offer the following recommendations:

  • Consider enabling unaligned checkpoints only if both of the following conditions are true:
  • Checkpoints are timing out.
  • The average checkpoint Async Duration of any operator is more than 50% of the total checkpoint duration for the operator (sum of Sync Duration + Async Duration).
  • Consider enabling buffer debloating first, and evaluate whether it solves the problem of checkpoints timing out.
  • If buffer debloating doesn’t help, consider enabling unaligned checkpoints along with buffer debloating. Buffer debloating mitigates the drawbacks of unaligned checkpoints, reducing the amount of in-flight data.
  • If unaligned checkpoints and buffer debloating together don’t improve checkpoint alignment duration, consider testing unaligned checkpoints alone.

Decision flow

Finally, but most importantly, always test unaligned checkpoints in a non-production environment first, running some comparative performance testing with a realistic workload, and verify that unaligned checkpoints actually reduce checkpoint duration.

Conclusion

This two-part series explored advanced strategies for optimizing checkpointing within your Amazon Managed Service for Apache Flink applications. By harnessing the potential of buffer debloating and unaligned checkpoints, you can unlock significant performance improvements and streamline checkpoint processes. However, it’s important to understand when these techniques will provide improvements and when they will not. If you believe your application may benefit from checkpoint performance improvement, you can enable these features in your Amazon Managed Service For Apache Flink version 1.15 applications. We recommend first enabling buffer debloating and testing the application. If you are still not seeing the expected outcome, enable buffer debloating with unaligned checkpoints. This way, you can immediately reduce the state size and the additional I/O to state backends. Lastly, you may try using unaligned checkpoints by itself, bearing in mind the considerations we’ve mentioned.

With a deeper understanding of these techniques and their applicability, you are better equipped to maximize the efficiency of checkpoints and mitigate the effect of backpressure in your Apache Flink application.


About the Authors

Lorenzo NicoraLorenzo Nicora works as Senior Streaming Solution Architect helping customers across EMEA. He has been building cloud-native, data-intensive systems for over 25 years, working in the finance industry both through consultancies and for FinTech product companies. He has leveraged open-source technologies extensively and contributed to several projects, including Apache Flink.

Francisco MorilloFrancisco Morillo is a Streaming Solutions Architect at AWS. Francisco works with AWS customers helping them design real-time analytics architectures using AWS services, supporting Amazon Managed Streaming for Apache Kafka (Amazon MSK) and AWS’s managed offering for Apache Flink.

Optimize checkpointing in your Amazon Managed Service for Apache Flink applications with buffer debloating and unaligned checkpoints – Part 1

Post Syndicated from Lorenzo Nicora original https://aws.amazon.com/blogs/big-data/part-1-optimize-checkpointing-in-your-amazon-managed-service-for-apache-flink-applications-with-buffer-debloating-and-unaligned-checkpoints/

This post is the first of a two-part series regarding checkpointing mechanisms and in-flight data buffering. In this first part, we explain some of the fundamental Apache Flink internals and cover the buffer debloating feature. In the second part, we focus on unaligned checkpoints.

Apache Flink is an open-source distributed engine for stateful processing over unbounded datasets (streams) and bounded datasets (batches). Amazon Managed Service for Apache Flink, formerly known as Amazon Kinesis Data Analytics, is the AWS service offering fully managed Apache Flink.

Apache Flink is designed for stateful processing at scale, for high throughput and low latency. It scales horizontally, distributing processing and state across multiple nodes, and is designed to withstand failures without compromising the exactly-once consistency it provides.

Internally, Apache Flink uses clever mechanisms to maintain exactly-once state consistency, while also optimizing for throughput and reduced latency. The default behavior works well for most use cases. Recent versions introduced two functionalities that can be optionally enabled to improve application performance under particular conditions: buffer debloating and unaligned checkpoints.

Buffer debloating and unaligned checkpoints can be enabled on Amazon Managed Service for Apache Flink version 1.15.

To understand how these functionalities can help and when to use them, we need to dive deep into some of the fundamental internal mechanisms of Apache Flink: checkpointing, in-flight data buffering, and backpressure.

Maintaining state consistency through failures with checkpointing

Apache Flink checkpointing periodically saves the internal application state for recovering in case of failure. Each of the distributed components of an application asynchronously snapshots its state to an external persistent datastore. The challenge is taking snapshots guaranteeing exactly-once consistency. A naïve “stop-the-world, take a snapshot” implementation would never meet the high throughput and low latency goals Apache Flink has been designed for.

Let’s walk through the process of checkpointing in a simple streaming application.

As shown in the following figure, Apache Flink distributes the work horizontally. Each operator (a node in the logical flow of your application, including sources and sinks) is split into multiple sub-tasks, based on its parallelism. The application is coordinated by a job manager. Checkpoints are periodically initiated by the job manager, sending a signal to all source operators’ sub-tasks.

Checkpoint initiated by the Job Manager

On receiving the signal, each source sub-task independently snapshots its state (for example, the offsets of the Kafka topic it is consuming) to a persistent storage, and then broadcasts a special record called checkpoint barrier (“CB” in the following diagrams) to all outgoing streams. Checkpoint barriers work similarly to watermarks in Apache Flink, flowing in-bands, along with normal records. A barrier does not overtake normal records and is not overtaken.

Source operators emit checkpoint bariers

When a downstream operator’s sub-task receives all checkpoint barriers from all input channels, it starts snapshotting its state.

A sub-task does not pause processing while saving its state to the remote, persistent state backend. This is a two-phase operation. First, the sub-task takes a snapshot of the state, on the local file system or in memory, depending on application configuration. This operation is blocking but very fast. When the snapshot is complete, it restarts processing records, while the state is asynchronously saved to the external, persistent state store. When the state is successfully saved to the state store, the sub-task acknowledges to the job manager that its checkpointing is complete.

The time a sub-task spends on the synchronous and asynchronous parts of the checkpoint is measured by Sync Duration and Async Duration metrics, shown by the Apache Flink UI. It is then asynchronously sent to the backend. After the fast snapshot, the sub-task restarts processing messages. The backend notifies the sub-task when the state has been successfully saved. The sub-task, in turn, sends an acknowledgment to the job manager that checkpointing is complete.

Sub-tasks acknowledge checkpoint completion

Checkpoint barriers propagate through all operators, down to the sinks. When all sink sub-tasks have acknowledged the checkpoint to the job manager, the checkpoint is declared complete and can be used to recover the application, for example in case of failure.

Sink operators acknowledge checkpoint is complete

Checkpoint barrier alignment

A sub-task may receive different partitions of the same stream from different upstream sub-tasks, for example when a stream is repartitioned with a keyBy or a rebalance. Each upstream sub-task will emit a checkpoint barrier independently. To maintain exactly-once consistency, the sub-task must wait for the barriers to arrive on all input partitions before taking a snapshot of its state.

This phase is called checkpoint alignment. During the alignment, the sub-task stops processing records from the partitions it already received the barrier from and continues processing the partitions that are behind the barrier.

After the barriers from all upstream partitions have arrived, the sub-task takes the snapshot of its state and then broadcasts the barrier downstream.

The time spent by a sub-task while aligning barriers is measured by the Checkpoint Alignment Duration metric, shown by the Apache Flink UI.

Checkpoint barrier alignment

In-flight data buffering

To optimize for throughput, Apache Flink tries to keep each sub-task always busy. This is achieved by transmitting records over the network in blocks and by buffering in-flight data. Note that this is data transmission optimization; Flink operators always process records one at the time.

Data is handed over between sub-tasks in units called network buffers. A network buffer has a fixed size, in bytes.

Sub-tasks also buffer in-flight input and output data. These buffers are called network buffer queues. Each queue is composed of multiple network buffers. Each sub-task has an input network buffer queue for each upstream sub-task and an output network buffer queue for each downstream sub-task.

Each record emitted by the sub-task is serialized, put into network buffers, and published to the output network buffer queue. To use all the available space, multiple messages can be packed into a single network buffer or split across subsequent network buffers.

A separate thread sends full network buffers over the network, where they are stored in the destination sub-task’s input network buffer queue.

When the destination sub-task thread is free, it deserializes the network buffers, rebuilds the records, and processes them one at a time.

Network Buffer Queue

Backpressure

If a sub-task can’t keep up with processing records at the same pace they are received, the input queue fills up. When the input queue is full, the upstream sub-task stops sending data.

Data accumulates in the sender’s output queue. When this is also full, the sender sub-task stops processing records, accumulating received data in its own input queue, and the effects propagates upstream.

This is the backpressure that Apache Flink uses to control the internal flow, preventing slow operators from being overwhelmed by slowing down the upstream flow. Backpressure is a safety mechanism to maximize the application throughput. It can be temporary, in case of an unexpected peak of ingested data, for example. If not temporary, it is usually the symptom—not the cause—that the application is not designed correctly or it has insufficient resources to process the workload.

Full Network Buffer Queue generates backpressure

In-flight buffering and checkpoint barriers

As checkpoint barriers flow with normal records, they also flow in the network buffers, through the input and output queues. In normal conditions, barriers don’t overtake records, and they are never overtaken. If records are queueing up due to backpressure, checkpoint barriers are also stuck in the queue, taking longer time to propagate from the sources to the sinks, delaying the completion of the checkpoint.

In the second part of this series, we will see how unaligned checkpoints can let barriers overtake records under specific conditions. For now, let’s see how we can optimize the size of input and output queues with buffer debloating.

Buffer debloating to optimize in-flight data

The default network buffer queue size is a good compromise for most applications. You can modify this size, but it applies to all sub-tasks, and it may be difficult to optimize this one-size-fits-all across different operators.

Longer queues support bigger throughout, but they may slow down checkpoint barriers that have to go through longer queues, causing longer End to End Checkpoint Duration. Ideally, the amount of in-flight data should be adjusted based on the actual throughput.

In version 1.14, Apache Flink introduced buffer debloating, which can be enabled to adjust in-flight data of each sub-task, based on the current throughput the sub-task is processing, and periodically reassess and readjust it.

How buffer debloating helps your application

Consider a streaming application, ingesting records from a streaming source and publishing the results to a streaming destination after some transformations. Under normal conditions, the application is sized to process the incoming throughput smoothly. Our destination has limited capacity, for example a Kafka topic throttled via quotas, sufficient to handle the normal throughput, with some margin.

In-flight data buffering under normal throughput

Imagine that the ingestion throughput has occasional peaks. These peaks exceed the limits of the streaming destination (throughput quota of the Kafka topic), which starts throttling.

Full in-flight data buffer to the sink backpressure the preceding operator

Because the sink can’t process the full throughput, in-flight data accumulates upstream of the sink, causing backpressure on the upstream operator. The effect eventually propagates up to the source, and the source starts lagging behind the most recent record in the source stream.

Backpressure propagates upstream, up to the source operator

As long this is a temporary condition, backpressure and lagging are not a problem per se, as long as the application is able to catch up when the peak has finished.

Unfortunately, accumulating in-flight data also slows down the propagation of the checkpoint barriers. Checkpoint End to End Duration goes up, and checkpoints may eventually time out.

Full in-flight data buffers slow down checkpoint barrier propagation, under backpressure

The situation is even worse if the sink uses two-phase commit for exactly-once guarantees. For example, KafkaSink uses Kafka transactions committed on checkpoints. If checkpoints become too slow, transactions are committed later, significantly increasing the latency of any downstream consumer using a read-committed isolation level.

Slow checkpoints under backpressure may also cause a vicious cycle. A slowed-down application eventually crashes, and recovers from the last checkpoint that is quite old. This causes a long reprocessing that, in turn, induces more backpressure and even slower checkpoints.

In this case, buffer debloating can help by adjusting the amount of in-flight data based on the throughput each sub-task is actually processing. When a sub-task is throttled by backpressure, the amount of in-flight data is reduced, also reducing the time checkpoint barriers take to go through all operators. Checkpoint End to End Duration goes down, and checkpoints do not time out.

Buffer debloating internals

Buffer debloating estimates the throughput a sub-task is capable of processing, assuming no idling, and limits the upstream in-flight data buffers to contain just enough data to be processed in 1 second (by default).

For efficiency, network buffers in the queues are fixed. Buffer debloating caps the usable size of each network buffer, making it smaller when the sub-task is processing slowly.

Buffer debloating speed up barrier propagation, reducing the volume of in-flight data

The benefits of less in-flight data depends on whether Apache Flink is using standard checkpoint alignment, the default behavior described so far, or unaligned checkpoints. We will examine unaligned checkpoints in the second part of this series, but let’s see the effect of buffer debloating, briefly.

  • With aligned checkpoints (default behavior) – Less in-flight data makes checkpoint barrier propagation faster, ultimately reducing the end-to-end checkpoint duration but also making it more predictable
  • With unaligned checkpoints (optional) – Less in-flight data reduces the amount of in-flight records stored with the checkpoint, ultimately reducing the checkpoint size

What buffer debloating does not do

Note that the problem we are trying to solve is slow checkpointing (or excessive checkpointing size, with unaligned checkpoints). Buffer debloating helps making checkpointing faster.

Buffer debloating does not remove backpressure. Backpressure is the internal protective mechanism that Apache Flink uses when some part of the application is not able to cope with the incoming throughput. To reduce backpressure, you have to work on other aspects of the application. When backpressure is only temporary, for example under peak conditions, the only way of removing it would be sizing the end-to-end system for the peak, rather than normal workload. But this could be impossible or too expensive.

Buffer debloating helps reduce and keep checkpoint duration stable under exceptional and temporary conditions. If an application experiences backpressure under its normal workload, or checkpoints are too slow under normal conditions, you should investigate the implementation of your application to understand the root cause.

When the automatic throughput prediction fails

Buffer debloating doesn’t have any particular drawback, but in corner cases, the mechanism may incorrectly estimate the throughput, and the resulting amount of in-flight data may not be optimal.

Estimating the throughput is complex when an operator receives data from multiple upstream operators, connected streams or unions, with very different throughput. It may also take time to adjust to a sudden spike, causing a temporary suboptimal buffering.

  • Too small in-flight data may reduce the throughput the sub-task can process (it will be idling), causing more backpressure upstream
  • Too large buffers may slow down checkpointing and increase the checkpoint size (with unaligned checkpoints)

Conclusion

The checkpointing mechanism makes Apache Flink fault tolerant, providing exactly-once state consistency. In-flight data buffering and backpressure control the data flow within the distributed streaming application maximize the throughput. Apache Flink default behaviors and configurations are good for most workloads.

The effectiveness of buffer debloating depends on the characteristics of the workload and the application. The general recommendation is to test the functionality in a non-production environment with a realistic workload to verify it actually helps with your use case.

You can request to enable buffer debloating on your Amazon Managed Service for Apache Flink application.

Under particular conditions, the combined effect of backpressure and in-flight data buffering may slow down checkpointing, increase checkpointing size (with unaligned checkpoints), and even cause checkpoints to fail. In these cases, enabling unaligned checkpointing may help reduce checkpoint duration or size.

In the second part of this series, we will understand better unaligned checkpoints and how they can help your application checkpointing efficiently in presence of backpressure, especially in combination with buffer debloating.


About the Authors

Lorenzo NicoraLorenzo Nicora works as Senior Streaming Solution Architect at AWS, helping customers across EMEA. He has been building cloud-native, data-intensive systems for over 25 years, working in the finance industry both through consultancies and for FinTech product companies. He has leveraged open-source technologies extensively and contributed to several projects, including Apache Flink.

Francisco MorilloFrancisco Morillo is a Streaming Solutions Architect at AWS. Francisco works with AWS customers helping them design real-time analytics architectures using AWS services, supporting Amazon Managed Streaming for Apache Kafka (Amazon MSK) and AWS’s managed offering for Apache Flink.

Let’s Architect! Leveraging in-memory databases

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-leveraging-in-memory-databases/

In-memory databases play a critical role in modern computing, particularly in reducing the strain on existing resources, scaling workloads efficiently, and minimizing the cost of infrastructure. The advanced performance capabilities of in-memory databases make them vital for demanding applications characterized by voluminous data, real-time analytics, and rapid response requirements.

In this edition of Let’s Architect!, we are introducing caching strategies and, further, examining case studies that use Amazon Web Services (AWS), like Amazon ElastiCache or Amazon MemoryDB for Redis, in real workloads where customers share the reasoning behind their approaches. It is very important understanding the context for leveraging a specific solution or pattern, and many common questions can be answered with these resources.

Caching challenges and strategies

Many services built at Amazon rely on caching systems in the background to speed up performance, deal with low latency requirements, and avoid overloading on source databases and other microservices. Operating caches and adding caches into our systems may present complex challenges in terms of monitoring, data consistency, and load on the other components of the system. Indeed, a cache can give big benefits, but it’s also a new component to run and keep healthy. Furthermore, engineers may need to use empirical methods to choose the cache size, expiration policy, and eviction policy: we always have to perform tests and use the metrics to tune the setup.

With this Amazon Builder’s Library resource, you can learn strategies for using caching in your architecture and best practices directly from Amazon’s engineers.

Take me to this Amazon Builder’s Library article!

Strategies applied in Amazon applications at scale, explained and contextualized by Amazon engineers

Strategies applied in Amazon applications at scale, explained and contextualized by Amazon engineers

How Yahoo cost optimizes their in-memory workloads with AWS

Discover how Yahoo effectively leverages the power of Amazon ElastiCache and data tiering to process an astounding 1.3 million advertising data events per second, all while generating savings of up to 50% on their overall bill.

Data tiering is an ingenious method to scale up to hundreds of terabytes of capacity by intelligently managing data. It achieves this by automatically shifting the least-recently accessed data between RAM and high-performance SSDs.

In this video, you will gain insights into how data tiering operates and how you can unlock ultra-fast speeds and seamless scalability for your workloads in a cost-efficient manner. Furthermore, you can also learn how it’s implemented under the hood.

Take me to this re:Invent 2022 video!

A snapshot of how Yahoo architecture leverages Amazon ElastiCache

A snapshot of how Yahoo architecture leverages Amazon ElastiCache

Use MemoryDB to build real-time applications for performance and durability

MemoryDB is a robust, durable database marked by microsecond reads, low single-digit millisecond writes, scalability, and fortified enterprise security. It guarantees an impressive 99.99% availability, coupled with instantaneous recovery without any data loss.

In this session, we explore multiple use cases across sectors, such as Financial Services, Retail, and Media & Entertainment, like payment processing, message brokering, and durable session store applications. Moreover, through a practical demonstration, you can learn how to utilize MemoryDB to establish a microservices message broker for a Media & Entertainment application.

Take me to this AWS Online Tech Talks video!

A sample use case for retail application

A sample use case for retail application

Samsung SmartThings powers home automation with Amazon MemoryDB

MemoryDB offers the kind of ultra-fast performance that only an in-memory database can deliver, curtailing latency to microseconds and processing 160+ million requests per second —without data loss. In this re:Invent 2022 session, you will understand why Samsung SmartThings selected MemoryDB as the engine to power the next generation of their IoT device connectivity platform, one that processes millions of events every day.

You can also discover the intricate design of MemoryDB and how it ensures data durability without compromising the performance of in-memory operations, thanks to the utilization of a multi-AZ transactional log. This session is an enlightening deep-dive into durable, in-memory data operations.

Take me to this re:Invent 2022 video!

The architecture leveraged by Samsung SmartThings using Amazon MemoryDB for Redis

The architecture leveraged by Samsung SmartThings using Amazon MemoryDB for Redis

Amazon ElastiCache: In-memory datastore fundamentals, use cases and examples

In this edition of AWS Online Tech Talks, explore Amazon ElastiCache, a managed service that facilitates the seamless setup, operation, and scaling of widely used, open-source–compatible, in-memory datastores in the cloud environment. This service positions you to develop data-intensive applications or enhance the performance of your existing databases through high-throughput, low-latency, in-memory datastores. Learn how it is leveraged for caching, session stores, gaming, geospatial services, real-time analytics, and queuing functionalities.

This course can help cultivate a deeper understanding of Amazon ElastiCache, and how it can be used to accelerate your data processing while maintaining robustness and reliability.

Take me to this AWS Online Tech Talks course!

A free training course to increase your skills and leverage better in-memory databases

A free training course to increase your skills and leverage better in-memory databases

See you next time!

Thanks for joining us to discuss in-memory databases! In 2 weeks, we’ll talk about SQL databases.

To find all the blogs from this series, visit the Let’s Architect! list of content on the AWS Architecture Blog.

Using AWS CloudFormation and AWS Cloud Development Kit to provision multicloud resources

Post Syndicated from Aaron Sempf original https://aws.amazon.com/blogs/devops/using-aws-cloudformation-and-aws-cloud-development-kit-to-provision-multicloud-resources/

Customers often need to architect solutions to support components across multiple cloud service providers, a need which may arise if they have acquired a company running on another cloud, or for functional purposes where specific services provide a differentiated capability. In this post, we will show you how to use the AWS Cloud Development Kit (AWS CDK) to create a single pane of glass for managing your multicloud resources.

AWS CDK is an open source framework that builds on the underlying functionality provided by AWS CloudFormation. It allows developers to define cloud resources using common programming languages and an abstraction model based on reusable components called constructs. There is a misconception that CloudFormation and CDK can only be used to provision resources on AWS, but this is not the case. The CloudFormation registry, with support for third party resource types, along with custom resource providers, allow for any resource that can be configured via an API to be created and managed, regardless of where it is located.

Multicloud solution design paradigm

Multicloud solutions are often designed with services grouped and separated by cloud, creating a segregation of resource and functions within the design. This approach leads to a duplication of layers of the solution, most commonly a duplication of resources and the deployment processes for each environment. This duplication increases cost, and leads to a complexity of management increasing the potential break points within the solution or practice. 

Along with simplifying resource deployments, and the ever-increasing complexity of customer needs, so too has the need increased for the capability of IaC solutions to deploy resources across hybrid or multicloud environments. Through meeting this need, a proliferation of supported tools, frameworks, languages, and practices has created “choice overload”. At worst, this scares the non-cloud-savvy away from adopting an IaC solution benefiting their cloud journey, and at best confuses the very reason for adopting an IaC practice.

A single pane of glass

Systems Thinking is a holistic approach that focuses on the way a system’s constituent parts interrelate and how systems work as a whole especially over time and within the context of larger systems. Systems thinking is commonly accepted as the backbone of a successful systems engineering approach. Designing solutions taking a full systems view, based on the component’s function and interrelation within the system across environments, more closely aligns with the ability to handle the deployment of each cloud-specific resource, from a single control plane.

While AWS provides a list of services that can be used to help design, manage and operate hybrid and multicloud solutions, with AWS as the primary cloud you can go beyond just using services to support multicloud. CloudFormation registry resource types model and provision resources using custom logic, as a component of stacks in CloudFormation. Public extensions are not only provided by AWS, but third-party extensions are made available for general use by publishers other than AWS, meaning customers can create their own extensions and publish them for anyone to use.

The AWS CDK, which has a 1:1 mapping of all AWS CloudFormation resources, as well as a library of abstracted constructs, supports the ability to import custom AWS CloudFormation extensions, enabling customers and partners to create custom AWS CDK constructs for their extensions. The chosen programming language can be used to inherit and abstract the custom resource into reusable AWS CDK constructs, allowing developers to create solutions that contain native AWS extensions along with secondary hybrid or alternate cloud resources.

Providing the ability to integrate mixed resources in the same stack more closely aligns with the functional design and often diagrammatic depiction of the solution. In essence, we are creating a single IaC pane of glass over the entire solution, deployed through a single control plane. This lowers the complexity and the cost of maintaining separate modules and deployment pipelines across multiple cloud providers.

A common use case for a multicloud: disaster recovery

One of the most common use cases of the requirement for using components across different cloud providers is the need to maintain data sovereignty while designing disaster recovery (DR) into a solution.

Data sovereignty is the idea that data is subject to the laws of where it is physically located, and in some countries extends to regulations that if data is collected from citizens of a geographical area, then the data must reside in servers located in jurisdictions of that geographical area or in countries with a similar scope and rigor in their protection laws. 

This requires organizations to remain in compliance with their host country, and in cases such as state government agencies, a stricter scope of within state boundaries, data sovereignty regulations. Unfortunately, not all countries, and especially not all states, have multiple AWS regions to select from when designing where their primary and recovery data backups will reside. Therefore, the DR solution needs to take advantage of multiple cloud providers in the same geography, and as such a solution must be designed to backup or replicate data across providers.

The multicloud solution

A multicloud solution to the proposed use case would be the backup of data from an AWS resource such as an Amazon S3 bucket to another cloud provider within the same geography, such as an Azure Blob Storage container, using AWS event driven behaviour to trigger the copying of data from the primary AWS resource to the secondary Azure backup resource.

Following the IaC single pane of glass approach, the Azure Blob Storage container is created as a resource type in the CloudFormation Registry, and imported into the AWS CDK to be used as a construct in the solution. However, before the extension resource type can be used effectively in the CDK as a reusable construct and added to your private library, you will first need to go through the import into CDK process for creating Constructs.

There are three different levels of constructs, beginning with low-level constructs, which are called CFN Resources (or L1, short for “layer 1”). These constructs directly represent all resources available in AWS CloudFormation. They are named CfnXyz, where Xyz is name of the resource.

Layer 1 Construct

In this example, an L1 construct named CfnAzureBlobStorage represents an Azure::BlobStorage AWS CloudFormation extension. Here you also explicitly configure the ref property, in order for higher level constructs to access the Output value which will be the Azure blob container url being provisioned.

import { CfnResource } from "aws-cdk-lib";
import { Secret, ISecret } from "aws-cdk-lib/aws-secretsmanager";
import { Construct } from "constructs";

export interface CfnAzureBlobStorageProps {
  subscriptionId: string;
  clientId: string;
  tenantId: string;
  clientSecretName: string;
}

// L1 Construct
export class CfnAzureBlobStorage extends Construct {
  // Allows accessing the ref property
  public readonly ref: string;

  constructor(scope: Construct, id: string, props: CfnAzureBlobStorageProps) {
    super(scope, id);

    const secret = this.getSecret("AzureClientSecret", props.clientSecretName);
    
    const azureBlobStorage = new CfnResource(
      this,
      "ExtensionAzureBlobStorage",
      {
        type: "Azure::BlobStorage",
        properties: {
          AzureSubscriptionId: props.subscriptionId,
          AzureClientId: props.clientId,
          AzureTenantId: props.tenantId,
          AzureClientSecret: secret.secretValue.unsafeUnwrap()
        },
      }
    );

    this.ref = azureBlobStorage.ref;
  }

  private getSecret(id: string, secretName: string) : ISecret {  
    return Secret.fromSecretNameV2(this, secretName.concat("Value"), secretName);
  }
}

As with every CDK Construct, the constructor arguments are scope, id and props. scope and id are propagated to the cdk.Construct base class. The props argument is of type CfnAzureBlobStorageProps which includes four properties all of type string. This is how the Azure credentials are propagated down from upstream constructs.

Layer 2 Construct

The next level of constructs, L2, also represent AWS resources, but with a higher-level, intent-based API. They provide similar functionality, but incorporate the defaults, boilerplate, and glue logic you’d be writing yourself with a CFN Resource construct. They also provide convenience methods that make it simpler to work with the resource.

In this example, an L2 construct is created to abstract the CfnAzureBlobStorage L1 construct and provides additional properties and methods.

import { Construct } from "constructs";
import { CfnAzureBlobStorage } from "./cfn-azure-blob-storage";

// L2 Construct
export class AzureBlobStorage extends Construct {
  public readonly blobContainerUrl: string;

  constructor(
    scope: Construct,
    id: string,
    subscriptionId: string,
    clientId: string,
    tenantId: string,
    clientSecretName: string
  ) {
    super(scope, id);

    const azureBlobStorage = new CfnAzureBlobStorage(
      this,
      "CfnAzureBlobStorage",
      {
        subscriptionId: subscriptionId,
        clientId: clientId,
        tenantId: tenantId,
        clientSecretName: clientSecretName,
      }
    );

    this.blobContainerUrl = azureBlobStorage.ref;
  }
}

The custom L2 construct class is declared as AzureBlobStorage, this time without the Cfn prefix to represent an L2 construct. This time the constructor arguments include the Azure credentials and client secret, and the ref from the L1 construct us output to the public variable AzureBlobContainerUrl.

As an L2 construct, the AzureBlobStorage construct could be used in CDK Apps along with AWS Resource Constructs in the same Stack, to be provisioned through AWS CloudFormation creating the IaC single pane of glass for a multicloud solution.

Layer 3 Construct

The true value of the CDK construct programming model is in the ability to extend L2 constructs, which represent a single resource, into a composition of multiple constructs that provide a solution for a common task. These are Layer 3, L3, Constructs also known as patterns.

In this example, the L3 construct represents the solution architecture to backup objects uploaded to an Amazon S3 bucket into an Azure Blob Storage container in real-time, using AWS Lambda to process event notifications from Amazon S3.

import { RemovalPolicy, Duration, CfnOutput } from "aws-cdk-lib";
import { Bucket, BlockPublicAccess, EventType } from "aws-cdk-lib/aws-s3";
import { DockerImageFunction, DockerImageCode } from "aws-cdk-lib/aws-lambda";
import { PolicyStatement, Effect } from "aws-cdk-lib/aws-iam";
import { LambdaDestination } from "aws-cdk-lib/aws-s3-notifications";
import { IStringParameter, StringParameter } from "aws-cdk-lib/aws-ssm";
import { Secret, ISecret } from "aws-cdk-lib/aws-secretsmanager";
import { Construct } from "constructs";
import { AzureBlobStorage } from "./azure-blob-storage";

// L3 Construct
export class S3ToAzureBackupService extends Construct {
  constructor(
    scope: Construct,
    id: string,
    azureSubscriptionIdParamName: string,
    azureClientIdParamName: string,
    azureTenantIdParamName: string,
    azureClientSecretName: string
  ) {
    super(scope, id);

    // Retrieve existing SSM Parameters
    const azureSubscriptionIdParameter = this.getSSMParameter("AzureSubscriptionIdParam", azureSubscriptionIdParamName);
    const azureClientIdParameter = this.getSSMParameter("AzureClientIdParam", azureClientIdParamName);
    const azureTenantIdParameter = this.getSSMParameter("AzureTenantIdParam", azureTenantIdParamName);    
    
    // Retrieve existing Azure Client Secret
    const azureClientSecret = this.getSecret("AzureClientSecret", azureClientSecretName);

    // Create an S3 bucket
    const sourceBucket = new Bucket(this, "SourceBucketForAzureBlob", {
      removalPolicy: RemovalPolicy.RETAIN,
      blockPublicAccess: BlockPublicAccess.BLOCK_ALL,
    });

    // Create a corresponding Azure Blob Storage account and a Blob Container
    const azurebBlobStorage = new AzureBlobStorage(
      this,
      "MyCustomAzureBlobStorage",
      azureSubscriptionIdParameter.stringValue,
      azureClientIdParameter.stringValue,
      azureTenantIdParameter.stringValue,
      azureClientSecretName
    );

    // Create a lambda function that will receive notifications from S3 bucket
    // and copy the new uploaded object to Azure Blob Storage
    const copyObjectToAzureLambda = new DockerImageFunction(
      this,
      "CopyObjectsToAzureLambda",
      {
        timeout: Duration.seconds(60),
        code: DockerImageCode.fromImageAsset("copy_s3_fn_code", {
          buildArgs: {
            "--platform": "linux/amd64"
          }
        }),
      },
    );

    // Add an IAM policy statement to allow the Lambda function to access the
    // S3 bucket
    sourceBucket.grantRead(copyObjectToAzureLambda);

    // Add an IAM policy statement to allow the Lambda function to get the contents
    // of an S3 object
    copyObjectToAzureLambda.addToRolePolicy(
      new PolicyStatement({
        effect: Effect.ALLOW,
        actions: ["s3:GetObject"],
        resources: [`arn:aws:s3:::${sourceBucket.bucketName}/*`],
      })
    );

    // Set up an S3 bucket notification to trigger the Lambda function
    // when an object is uploaded
    sourceBucket.addEventNotification(
      EventType.OBJECT_CREATED,
      new LambdaDestination(copyObjectToAzureLambda)
    );

    // Grant the Lambda function read access to existing SSM Parameters
    azureSubscriptionIdParameter.grantRead(copyObjectToAzureLambda);
    azureClientIdParameter.grantRead(copyObjectToAzureLambda);
    azureTenantIdParameter.grantRead(copyObjectToAzureLambda);

    // Put the Azure Blob Container Url into SSM Parameter Store
    this.createStringSSMParameter(
      "AzureBlobContainerUrl",
      "Azure blob container URL",
      "/s3toazurebackupservice/azureblobcontainerurl",
      azurebBlobStorage.blobContainerUrl,
      copyObjectToAzureLambda
    );      

    // Grant the Lambda function read access to the secret
    azureClientSecret.grantRead(copyObjectToAzureLambda);

    // Output S3 bucket arn
    new CfnOutput(this, "sourceBucketArn", {
      value: sourceBucket.bucketArn,
      exportName: "sourceBucketArn",
    });

    // Output the Blob Conatiner Url
    new CfnOutput(this, "azureBlobContainerUrl", {
      value: azurebBlobStorage.blobContainerUrl,
      exportName: "azureBlobContainerUrl",
    });
  }

}

The custom L3 construct can be used in larger IaC solutions by calling the class called S3ToAzureBackupService and providing the Azure credentials and client secret as properties to the constructor.

import * as cdk from "aws-cdk-lib";
import { Construct } from "constructs";
import { S3ToAzureBackupService } from "./s3-to-azure-backup-service";

export class MultiCloudBackupCdkStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const s3ToAzureBackupService = new S3ToAzureBackupService(
      this,
      "MyMultiCloudBackupService",
      "/s3toazurebackupservice/azuresubscriptionid",
      "/s3toazurebackupservice/azureclientid",
      "/s3toazurebackupservice/azuretenantid",
      "s3toazurebackupservice/azureclientsecret"
    );
  }
}

Solution Diagram

Diagram 1: IaC Single Control Plane, demonstrates the concept of the Azure Blob Storage extension being imported from the AWS CloudFormation Registry into AWS CDK as an L1 CfnResource, wrapped into an L2 Construct and used in an L3 pattern alongside AWS resources to perform the specific task of backing up from and Amazon s3 Bucket into an Azure Blob Storage Container.

Multicloud IaC with CDK

Diagram 1: IaC Single Control Plan

The CDK application is then synthesized into one or more AWS CloudFormation Templates, which result in the CloudFormation service deploying AWS resource configurations to AWS and Azure resource configurations to Azure.

This solution demonstrates not only how to consolidate the management of secondary cloud resources into a unified infrastructure stack in AWS, but also the improved productivity by eliminating the complexity and cost of operating multiple deployment mechanisms into multiple public cloud environments.

The following video demonstrates an example in real-time of the end-state solution:

Next Steps

While this was just a straightforward example, with the same approach you can use your imagination to come up with even more and complex scenarios where AWS CDK can be used as a single pane of glass for IaC to manage multicloud and hybrid solutions.

To get started with the solution discussed in this post, this workshop will provide you with the instructions you need to understand the steps required to create the S3ToAzureBackupService.

Once you have learned how to create AWS CloudFormation extensions and develop them into AWS CDK Constructs, you will learn how, with just a few lines of code, you can develop reusable multicloud unified IaC solutions that deploy through a single AWS control plane.

Conclusion

By adopting AWS CloudFormation extensions and AWS CDK, deployed through a single AWS control plane, the cost and complexity of maintaining deployment pipelines across multiple cloud providers is reduced to a single holistic solution-focused pipeline. The techniques demonstrated in this post and the related workshop provide a capability to simplify the design of complex systems, improve the management of integration, and more closely align the IaC and deployment management practices with the design.

About the authors:

Aaron Sempf

Aaron Sempf is a Global Principal Partner Solutions Architect, in the Global Systems Integrators team. With over twenty years in software engineering and distributed system, he focuses on solving for large scale integration and event driven systems. When not working with AWS GSI partners, he can be found coding prototypes for autonomous robots, IoT devices, and distributed solutions.

 
Puneet Talwar

Puneet Talwar

Puneet Talwar is a Senior Solutions Architect at Amazon Web Services (AWS) on the Australian Public Sector team. With a background of over twenty years in software engineering, he particularly enjoys helping customers build modern, API Driven software architectures at scale. In his spare time, he can be found building prototypes for micro front ends and event driven architectures.

Embracing our broad responsibility for securing digital infrastructure in the European Union

Post Syndicated from Frank Adelmann original https://aws.amazon.com/blogs/security/embracing-our-broad-responsibility-for-securing-digital-infrastructure-in-the-european-union/

Over the past few decades, digital technologies have brought tremendous benefits to our societies, governments, businesses, and everyday lives. However, the more we depend on them for critical applications, the more we must do so securely. The increasing reliance on these systems comes with a broad responsibility for society, companies, and governments.

At Amazon Web Services (AWS), every employee, regardless of their role, works to verify that security is an integral component of every facet of the business (see Security at AWS). This goes hand-in-hand with new cybersecurity-related regulations, such as the Directive on Measures for a High Common Level of Cybersecurity Across the Union (NIS 2), formally adopted by the European Parliament and the Counsel of the European Union (EU) in December 2022. NIS 2 will be transposed into the national laws of the EU Member States by October 2024, and aims to strengthen cybersecurity across the EU.

AWS is excited to help customers become more resilient, and we look forward to even closer cooperation with national cybersecurity authorities to raise the bar on cybersecurity across Europe. Building society’s trust in the online environment is key to harnessing the power of innovation for social and economic development. It’s also one of our core Leadership Principles: Success and scale bring broad responsibility.

Compliance with NIS 2

NIS 2 seeks to ensure that entities mitigate the risks posed by cyber threats, minimize the impact of incidents, and protect the continuity of essential and important services in the EU.

Besides increased cooperation between authorities and support for enhanced information sharing amongst covered entities, NIS 2 includes minimum requirements for cybersecurity risk management measures and reporting obligations, which are applicable to a broad range of AWS customers based on their sector. Examples of sectors that must comply with NIS 2 requirements are energy, transport, health, public administration, and digital infrastructures. For the full list of covered sectors, see Annexes I and II of NIS 2. Generally, the NIS 2 Directive applies to a wider pool of entities than those currently covered by the NIS Directive, including medium-sized enterprises, as defined in Article 2 of the Annex to Recommendation 2003/361/EC (over 50 employees or an annual turnover over €10 million).

In several countries, aspects of the AWS service offerings are already part of the national critical infrastructure. For example, in Germany, Amazon Elastic Compute Cloud (Amazon EC2) and Amazon CloudFront are in scope for the KRITIS regulation. For several years, AWS has fulfilled its obligations to secure these services, run audits related to national critical infrastructure, and have established channels for exchanging security information with the German Federal Office for Information Security (BSI) KRITIS office. AWS is also part of the UP KRITIS initiative, a cooperative effort between industry and the German Government to set industry standards.

AWS will continue to support customers in implementing resilient solutions, in accordance with the shared responsibility model. Compliance efforts within AWS will include implementing the requirements of the act and setting out technical and methodological requirements for cloud computing service providers, to be published by the European Commission, as foreseen in Article 21 of NIS 2.

AWS cybersecurity risk management – Current status

Even before the introduction of NIS 2, AWS has been helping customers improve their resilience and incident response capacities. Our core infrastructure is designed to satisfy the security requirements of the military, global banks, and other highly sensitive organizations.

AWS provides information and communication technology services and building blocks that businesses, public authorities, universities, and individuals use to become more secure, innovative, and responsive to their own needs and the needs of their customers. Security and compliance remain a shared responsibility between AWS and the customer. We make sure that the AWS cloud infrastructure complies with applicable regulatory requirements and good practices for cloud providers, and customers remain responsible for building compliant workloads in the cloud.

In total, AWS supports or has obtained over 143 security standards compliance certifications and attestations around the globe, such as ISO 27001, ISO 22301, ISO 20000, ISO 27017, and System and Organization Controls (SOC) 2. The following are some examples of European certifications and attestations that we’ve achieved:

  • C5 — provides a wide-ranging control framework for establishing and evidencing the security of cloud operations in Germany.
  • ENS High — comprises principles for adequate protection applicable to government agencies and public organizations in Spain.
  • HDS — demonstrates an adequate framework for technical and governance measures to secure and protect personal health data, governed by French law.
  • Pinakes — provides a rating framework intended to manage and monitor the cybersecurity controls of service providers upon which Spanish financial entities depend.

These and other AWS Compliance Programs help customers understand the robust controls in place at AWS to help ensure the security and compliance of the cloud. Through dedicated teams, we’re prepared to provide assurance about the approach that AWS has taken to operational resilience and to help customers achieve assurance about the security and resiliency of their workloads. AWS Artifact provides on-demand access to these security and compliance reports and many more.

For security in the cloud, it’s crucial for our customers to make security by design and security by default central tenets of product development. To begin with, customers can use the AWS Well-Architected tool to help build secure, high-performing, resilient, and efficient infrastructure for a variety of applications and workloads. Customers that use the AWS Cloud Adoption Framework (AWS CAF) can improve cloud readiness by identifying and prioritizing transformation opportunities. These foundational resources help customers secure regulated workloads. AWS Security Hub provides customers with a comprehensive view of their security state on AWS and helps them check their environments against industry standards and good practices.

With regards to the cybersecurity risk management measures and reporting obligations that NIS 2 mandates, existing AWS service offerings can help customers fulfill their part of the shared responsibility model and comply with future national implementations of NIS 2. For example, customers can use Amazon GuardDuty to detect a set of specific threats to AWS accounts and watch out for malicious activity. Amazon CloudWatch helps customers monitor the state of their AWS resources. With AWS Config, customers can continually assess, audit, and evaluate the configurations and relationships of selected resources on AWS, on premises, and on other clouds. Furthermore, AWS Whitepapers, such as the AWS Security Incident Response Guide, help customers understand, implement, and manage fundamental security concepts in their cloud architecture.

NIS 2 foresees the development and implementation of comprehensive cybersecurity awareness training programs for management bodies and employees. At AWS, we provide various training programs at no cost to the public to increase awareness on cybersecurity, such as the Amazon cybersecurity awareness training, AWS Cloud Security Learning, AWS re/Start, and AWS Ramp-Up Guides.

AWS cooperation with authorities

At Amazon, we strive to be the world’s most customer-centric company. For AWS Security Assurance, that means having teams that continuously engage with authorities to understand and exceed regulatory and customer obligations on behalf of customers. This is just one way that we raise the security bar in Europe. At the same time, we recommend that national regulators carefully assess potentially conflicting, overlapping, or contradictory measures.

We also cooperate with cybersecurity agencies around the globe because we recognize the importance of their role in keeping the world safe. To that end, we have built the Global Cybersecurity Program (GCSP) to provide agencies with a direct and consistent line of communication to the AWS Security team. Two examples of GCSP members are the Dutch National Cyber Security Centrum (NCSC-NL), with whom we signed a cooperation in May 2023, and the Italian National Cybersecurity Agency (ACN). Together, we will work on cybersecurity initiatives and strengthen the cybersecurity posture across the EU. With the war in Ukraine, we have experienced how important such a collaboration can be. AWS has played an important role in helping Ukraine’s government maintain continuity and provide critical services to citizens since the onset of the war.

The way forward

At AWS, we will continue to provide key stakeholders with greater insights into how we help customers tackle their most challenging cybersecurity issues and provide opportunities to deep dive into what we’re building. We very much look forward to continuing our work with authorities, agencies and, most importantly, our customers to provide for the best solutions and raise the bar on cybersecurity and resilience across the EU and globally.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Frank Adelmann

Frank Adelmann

Frank is the Regulated Industry and Security Engagement Lead for Regulated Commercial Sectors in Europe. He joined AWS in 2022 after working as a regulator in the European financial sector, technical advisor on cybersecurity matters in the International Monetary Fund, and Head of Information Security in the European Commodity Clearing AG. Today, Frank is passionately engaging with European regulators to understand and exceed regulatory and customer expectations.

Let’s Architect! Cost-optimizing AWS workloads

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-cost-optimizing-aws-workloads/

Every software component built by engineers and architects is designed with a purpose: to offer particular functionalities and, ultimately, contribute to the generation of business value. We should consider fundamental factors, such as the scalability of the software and the ease of evolution during times of business changes. However, performance and cost are important factors as well since they can impact the business profitability.

This edition of Let’s Architect! follows a similar series post from 2022, which discusses optimizing the cost of an architecture. Today, we focus on architectural patterns, services, and best practices to design cost-optimized cloud workloads. We also want to identify solutions, such as the use of Graviton processors, for increased performance at lower price. Cost optimization is a continuous process that requires the identification of the right tools for each job, as well as the adoption of efficient designs for your system.

AWS re:Invent 2022 – Manage and control your AWS costs

Govern cloud usage and avoid cost surprises without slowing down innovation within your organization. In this re:Invent 2022 session, you can learn how to set up guardrails and operationalize cost control within your organizations using services, such as AWS Budgets and AWS Cost Anomaly Detection, and explore the latest enhancements in the AWS cost control space. Additionally, Mercado Libre shares how they automate their cloud cost control through central management and automated algorithms.

Take me to this re:Invent 2022 video!

Work backwards from team needs to define/deploy cloud governance in AWS environments

Work backwards from team needs to define/deploy cloud governance in AWS environments

Compute optimization

When it comes to optimizing compute workloads, there are many tools available, such as AWS Compute Optimizer, Amazon EC2 Spot Instances, Amazon EC2 Reserved Instances, and Graviton instances. Modernizing your applications can also lead to cost savings, but you need to know how to use the right tools and techniques in an effective and efficient way.

For AWS Lambda functions, you can use the AWS Lambda Cost Optimization video to learn how to optimize your costs. The video covers topics, such as understanding and graphing performance versus cost, code optimization techniques, and avoiding idle wait time. If you are using Amazon Elastic Container Service (Amazon ECS) and AWS Fargate, you can watch a Twitch video on cost optimization using Amazon ECS and AWS Fargate to learn how to adjust your costs. The video covers topics like using spot instances, choosing the right instance type, and using Fargate Spot.

Finally, with Amazon Elastic Kubernetes Service (Amazon EKS), you can use Karpenter, an open-source Kubernetes cluster auto scaler to help optimize compute workloads. Karpenter can help you launch right-sized compute resources in response to changing application load, help you adopt spot and Graviton instances. To learn more about Karpenter, read the post How CoStar uses Karpenter to optimize their Amazon EKS Resources on the AWS Containers Blog.

Take me to Cost Optimization using Amazon ECS and AWS Fargate!
Take me to AWS Lambda Cost Optimization!
Take me to How CoStar uses Karpenter to optimize their Amazon EKS Resources!

Karpenter launches and terminates nodes to reduce infrastructure costs

Karpenter launches and terminates nodes to reduce infrastructure costs

AWS Lambda general guidance for cost optimization

AWS Lambda general guidance for cost optimization

AWS Graviton deep dive: The best price performance for AWS workloads

The choice of the hardware is a fundamental driver for performance, cost, as well as resource consumption of the systems we build. Graviton is a family of processors designed by AWS to support cloud-based workloads and give improvements in terms of performance and cost. This re:Invent 2022 presentation introduces Graviton and addresses the problems it can solve, how the underlying CPU architecture is designed, and how to get started with it. Furthermore, you can learn the journey to move different types of workloads to this architecture, such as containers, Java applications, and C applications.

Take me to this re:Invent 2022 video!

AWS Graviton processors are specifically designed by AWS for cloud workloads to deliver the best price performance

AWS Graviton processors are specifically designed by AWS for cloud workloads to deliver the best price performance

AWS Well-Architected Labs: Cost Optimization

The Cost Optimization section of the AWS Well Architected Workshop helps you learn how to optimize your AWS costs by using features, such as AWS Compute Optimizer, Spot Instances, and Reserved Instances. The workshop includes hands-on labs that walk you through the process of optimizing costs for different types of workloads and services, such as Amazon Elastic Compute Cloud, Amazon ECS, and Lambda.

Take me to this AWS Well-Architected lab!

Savings Plans is a flexible pricing model that can help reduce expenses compared with on-demand pricing

Savings Plans is a flexible pricing model that can help reduce expenses compared with on-demand pricing

See you next time!

Thanks for joining us to discuss cost optimization! In 2 weeks, we’ll talk about in-memory databases and caching systems.

To find all the blogs from this series, visit the Let’s Architect! list of content on the AWS Architecture Blog.

AWS Digital Sovereignty Pledge: Announcing new dedicated infrastructure options

Post Syndicated from Matt Garman original https://aws.amazon.com/blogs/security/aws-digital-sovereignty-pledge-announcing-new-dedicated-infrastructure-options/

At AWS, we’re committed to helping our customers meet digital sovereignty requirements. Last year, I announced the AWS Digital Sovereignty Pledge, our commitment to offering all AWS customers the most advanced set of sovereignty controls and features available in the cloud. Our approach is to continue to make AWS sovereign-by-design—as it has been from day one.

I promised that our pledge was just the start, and that we would continue to innovate to meet the needs of our customers. As part of our promise, we pledged to invest in an ambitious roadmap of capabilities on data residency, granular access restriction, encryption, and resilience. Today, I’d like to update you on another milestone on our journey to continue to help our customers address their sovereignty needs.

Further control over the location of your data

Customers have always controlled the location of their data with AWS. For example, in Europe, customers have the choice to deploy their data into any of eight existing AWS Regions. These AWS Regions provide the broadest set of cloud services and features, enabling our customers to run the majority of their workloads. Customers can also use AWS Local Zones, a type of infrastructure deployment that makes AWS services available in more places, to help meet latency and data residency requirements, without having to deploy self-managed infrastructure. Customers who must comply with data residency regulations can choose to run their workloads in specific geographic locations where AWS Regions and Local Zones are available.

Announcing AWS Dedicated Local Zones

Our public sector and regulated industry customers have told us they want dedicated infrastructure for their most critical workloads to help meet regulatory or other compliance requirements. Many of these customers manage their own infrastructure on premises for workloads that require isolation. This forgoes the performance, innovation, elasticity, scalability, and resiliency benefits of the cloud.

To help our customers address these needs, I’m excited to announce AWS Dedicated Local Zones. Dedicated Local Zones are a type of AWS infrastructure that is fully managed by AWS, built for exclusive use by a customer or community, and placed in a customer-specified location or data center to help comply with regulatory requirements. Dedicated Local Zones can be operated by local AWS personnel and offer the same benefits of Local Zones, such as elasticity, scalability, and pay-as-you-go pricing, with added security and governance features. These features include data access monitoring and audit programs, controls to limit infrastructure access to customer-selected AWS accounts, and options to enforce security clearance or other criteria on local AWS operating personnel. With Dedicated Local Zones, we work with customers to configure their own Local Zones with the services and capabilities they need to meet their regulatory requirements.

AWS Dedicated Local Zones meet the same high AWS security standards that apply to AWS Regions and Local Zones. They also come with the same AWS Nitro System that powers all modern Amazon Elastic Compute Cloud (Amazon EC2) instances to help ensure confidentiality and integrity of customer data. With Dedicated Local Zones, customers can use the multitenancy features of the cloud to efficiently enable adoption across multiple AWS accounts created by a customer’s community of agencies and business units, and reduce the operational overhead of managing on-premises infrastructure. Customers can deploy multiple Dedicated Local Zones for resiliency and simplify their applications’ architecture by using consistent AWS infrastructure, APIs, and tools across different classifications of applications running in AWS Regions and Dedicated Local Zones. AWS services, such as Amazon EC2, Amazon Virtual Private Cloud (Amazon VPC), Amazon Elastic Block Store (Amazon EBS), Elastic Load Balancing (ELB), Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), and AWS Direct Connect, will be available in Dedicated Local Zones.

Innovating with the Singapore Government’s Smart Nation and Digital Government Group

At AWS, we work closely with customers to understand their requirements for their most critical workloads. Our work with the Singapore Government’s Smart Nation and Digital Government Group (SNDGG) to build a Smart Nation for their citizens and businesses illustrates this approach. This group spearheads Singapore’s digital government transformation and development of the public sector’s engineering capabilities. SNDGG is the first customer to deploy Dedicated Local Zones.

“AWS is a strategic partner and has been since the beginning of our cloud journey. SNDGG collaborated with AWS to define and build Dedicated Local Zones to help us meet our stringent data isolation and security requirements, enabling Singapore to run more sensitive workloads in the cloud securely,” said Chan Cheow Hoe, Government Chief Digital Technology Officer of Singapore. “In addition to helping the Singapore government meet its cybersecurity requirements, the Dedicated Local Zones enable us to offer its agencies a seamless and consistent cloud experience.”

Our commitments to our customers

We remain committed to helping our customers meet evolving sovereignty requirements. We continue to innovate sovereignty features, controls, and assurances globally with AWS, without compromising on the full power of AWS.

To get started, you can visit the Dedicated Local Zones website, where you can contact AWS specialists to learn more about how Dedicated Local Zones can be configured to meet your regulatory needs.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Matt Garman

Matt Garman

Matt is currently the Senior Vice President of AWS Sales, Marketing and Global Services at AWS, and also sits on Amazon’s executive leadership S-Team. Matt joined Amazon in 2006, and has held several leadership positions in AWS over that time. Matt previously served as Vice President of the Amazon EC2 and Compute Services businesses for AWS for over 10 years. Matt was responsible for P&L, product management, and engineering and operations for all compute and storage services in AWS. He started at Amazon when AWS first launched in 2006 and served as one of the first product managers, helping to launch the initial set of AWS services. Prior to Amazon, he spent time in product management roles at early stage Internet startups. Matt earned a BS and MS in Industrial Engineering from Stanford University, and an MBA from the Kellogg School of Management at Northwestern University.

How AWS built the Security Guardians program, a mechanism to distribute security ownership

Post Syndicated from Ana Malhotra original https://aws.amazon.com/blogs/security/how-aws-built-the-security-guardians-program-a-mechanism-to-distribute-security-ownership/

Product security teams play a critical role to help ensure that new services, products, and features are built and shipped securely to customers. However, since security teams are in the product launch path, they can form a bottleneck if organizations struggle to scale their security teams to support their growing product development teams. In this post, we will share how Amazon Web Services (AWS) developed a mechanism to scale security processes and expertise by distributing security ownership between security teams and development teams. This mechanism has many names in the industry — Security Champions, Security Advocates, and others — and it’s often part of a shift-left approach to security. At AWS, we call this mechanism Security Guardians.

In many organizations, there are fewer security professionals than product developers. Our experience is that it takes much more time to hire a security professional than other technical job roles, and research conducted by (ISC)2 shows that the cybersecurity industry is short 3.4 million workers. When product development teams continue to grow at a faster rate than security teams, the disparity between security professionals and product developers continues to increase as well. Although most businesses understand the importance of security, frustration and tensions can arise when it becomes a bottleneck for the business and its ability to serve customers.

At AWS, we require the teams that build products to undergo an independent security review with an AWS application security engineer before launching. This is a mechanism to verify that new services, features, solutions, vendor applications, and hardware meet our high security bar. This intensive process impacts how quickly product teams can ship to customers. As shown in Figure 1, we found that as the product teams scaled, so did the problem: there were more products being built than the security teams could review and approve for launch. Because security reviews are required and non-negotiable, this could potentially lead to delays in the shipping of products and features.

Figure 1: More products are being developed than can be reviewed and shipped

Figure 1: More products are being developed than can be reviewed and shipped

How AWS builds a culture of security

Because of its size and scale, many customers look to AWS to understand how we scale our own security teams. To tell our story and provide insight, let’s take a look at the culture of security at AWS.

Security is a business priority

At AWS, security is a business priority. Business leaders prioritize building products and services that are designed to be secure, and they consider security to be an enabler of the business rather than an obstacle.

Leaders also strive to create a safe environment by encouraging employees to identify and escalate potential security issues. Escalation is the process of making sure that the right people know about the problem at the right time. Escalation encompasses “Dive Deep”, which is one of our corporate values at Amazon, because it requires owners and leaders to dive into the details of the issue. If you don’t know the details, you can’t make good decisions about what’s going on and how to run your business effectively.

This aspect of the culture goes beyond intention — it’s embedded in our organizational structure:

CISOs and IT leaders play a key role in demystifying what security and compliance represent for the business. At AWS, we made an intentional choice for the security team to report directly to the CEO. The goal was to build security into the structural fabric of how AWS makes decisions, and every week our security team spends time with AWS leadership to ensure we’re making the right choices on tactical and strategic security issues.

– Stephen Schmidt, Chief Security Officer, Amazon, on Building a Culture of Security

Everyone owns security

Because our leadership supports security, it’s understood within AWS that security is everyone’s job. Security teams and product development teams work together to help ensure that products are built and shipped securely. Despite this collaboration, the product teams own the security of their product. They are responsible for making sure that security controls are built into the product and that customers have the tools they need to use the product securely.

On the other hand, central security teams are responsible for helping developers to build securely and verifying that security requirements are met before launch. They provide guidance to help developers understand what security controls to build, provide tools to make it simpler for developers to implement and test controls, provide support in threat modeling activities, use mechanisms to help ensure that customers’ security expectations are met before launch, and so on.

This responsibility model highlights how security ownership is distributed between the security and product development teams. At AWS, we learned that without this distribution, security doesn’t scale. Regardless of the number of security experts we hire, product teams always grow faster. Although the culture around security and the need to distribute ownership is now well understood, without the right mechanisms in place, this model would have collapsed.

Mechanisms compared to good intentions

Mechanisms are the final pillar of AWS culture that has allowed us to successfully distribute security across our organization. A mechanism is a complete process, or virtuous cycle, that reinforces and improves itself as it operates. As shown in Figure 2, a mechanism takes controllable inputs and transforms them into ongoing outputs to address a recurring business challenge. At AWS, the business challenge that we’re facing is that security teams create bottlenecks for the business. The culture of security at AWS provides support to help address this challenge, but we needed a mechanism to actually do it.

Figure 2: AWS sees mechanisms as a complete process, or virtuous cycle

Figure 2: AWS sees mechanisms as a complete process, or virtuous cycle

“Often, when we find a recurring problem, something that happens over and over again, we pull the team together, ask them to try harder, do better – essentially, we ask for good intentions. This rarely works… When you are asking for good intentions, you are not asking for a change… because people already had good intentions. But if good intentions don’t work, what does? Mechanisms work.

 – Jeff Bezos, February 1, 2008 All Hands.

At AWS, we’ve learned that we can help solve the challenge of scaling security by distributing security ownership with a mechanism we call the Security Guardians program. Like other mechanisms, it has inputs and outputs, and transforms over time.

AWS distributes security ownership with the Security Guardians program

At AWS, the Security Guardians program trains, develops, and empowers developers to be security ambassadors, or Guardians, within the product teams. At a high level, Guardians make sure that security considerations for a product are made earlier and more often, helping their peers build and ship their product faster. They also work closely with the central security team to help ensure that the security bar at AWS is rising and the Security Guardians program is improving over time. As shown in Figure 3, embedding security expertise within the product teams helps products with Guardian involvement move through security review faster.

Figure 3: Security expertise is embedded in the product teams by Guardians

Figure 3: Security expertise is embedded in the product teams by Guardians

Guardians are informed, security-minded product builders who volunteer to be consistent champions of security on their teams and are deeply familiar with the security processes and tools. They provide security guidance throughout the development lifecycle and are stakeholders in the security of the products being shipped, helping their teams make informed decisions that lead to more secure, on-time launches. Guardians are the security points-of-contact for their product teams.

In this distributed security ownership model, accountability for product security sits with the product development teams. However, the Guardians are responsible for performing the first evaluation of a development team’s security review submission. They confirm the quality and completeness of the new service’s resources, design documents, threat model, automated findings, and penetration test readiness. The development teams, supported by the Guardian, submit their security review to AWS Application Security (AppSec) engineers for the final pre-launch review.

In practice, as part of this development journey, Guardians help ensure that security considerations are made early, when teams are assessing customer requests and the feature or product design. This can be done by starting the threat modeling processes. Next, they work to make sure that mitigations identified during threat modeling are developed. Guardians also play an active role in software testing, including security scans such as static application security testing (SAST) and dynamic application security testing (DAST). To close out the security review, security engineers work with Guardians to make sure that findings are resolved and the product is ready to ship.

Figure 4: Expedited security review process supported by Guardians

Figure 4: Expedited security review process supported by Guardians

Guardians are, after all, Amazonians. Therefore, Guardians exemplify a number of the Amazon Leadership Principles and often have the following characteristics:

  • They are exemplary practitioners for security ownership and empower their teams to own the security of their service.
  • They hold a high security bar and exercise strong security judgement, don’t accept quick or easy answers, and drive continuous improvement.
  • They advocate for security needs in internal discussions with the product team.
  • They are thoughtful yet assertive to make customer security a top priority on their team.
  • They maintain and showcase their security knowledge to their peers, continuously building knowledge from many different sources to gain perspective and to stay up to date on the constantly evolving threat landscape.
  • They aren’t afraid to have their work independently validated by the central security team.

Expected outcomes

AWS has benefited greatly from the Security Guardians program. We’ve had 22.5 percent fewer medium and high severity security findings generated during the security review process and have taken about 26.9 percent less time to review a new service or feature. This data demonstrates that with Guardians involved we’re identifying fewer issues late in the process, reducing remediation work, and as a result securely shipping services faster for our customers. To help both builders and Guardians improve over time, our security review tool captures feedback from security engineers on their inputs. This helps ensure that our security ownership mechanism reinforces and improves itself over time.

AWS and other organizations have benefited from this mechanism because it generates specialized security resources and distributes security knowledge that scales without needing to hire additional staff.

A program such as this could help your business build and ship faster, as it has for AWS, while maintaining an appropriately high security bar that rises over time. By training builders to be security practitioners and advocates within your development cycle, you can increase the chances of identifying risks and security findings earlier. These findings, earlier in the development lifecycle, can reduce the likelihood of having to patch security bugs or even start over after the product has already been built. We also believe that a consistent security experience for your product teams is an important aspect of successfully distributing your security ownership. An experience with less confusion and friction will help build trust between the product and security teams.

To learn more about building positive security culture for your business, watch this spotlight interview with Stephen Schmidt, Chief Security Officer, Amazon.

If you’re an AWS customer and want to learn more about how AWS built the Security Guardians program, reach out to your local AWS solutions architect or account manager for more information.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Ana Malhotra

Ana Malhotra

Ana is a Security Specialist Solutions Architect and the Healthcare and Life Sciences (HCLS) Security Lead for AWS Industry, based in Seattle, Washington. As a former AWS Application Security Engineer, Ana loves talking all things AppSec, including people, process, and technology. In her free time, she enjoys tapping into her creative side with music and dance.

Mitch Beaumont

Mitch Beaumont

Mitch is a Principal Solutions Architect for Amazon Web Services, based in Sydney, Australia. Mitch works with some of Australia’s largest financial services customers, helping them to continually raise the security bar for the products and features that they build and ship. Outside of work, Mitch enjoys spending time with his family, photography, and surfing.

Derive operational insights from application logs using Automated Data Analytics on AWS

Post Syndicated from Aparajithan Vaidyanathan original https://aws.amazon.com/blogs/big-data/derive-operational-insights-from-application-logs-using-automated-data-analytics-on-aws/

Automated Data Analytics (ADA) on AWS is an AWS solution that enables you to derive meaningful insights from data in a matter of minutes through a simple and intuitive user interface. ADA offers an AWS-native data analytics platform that is ready to use out of the box by data analysts for a variety of use cases. With ADA, teams can ingest, transform, govern, and query diverse datasets from a range of data sources without requiring specialist technical skills. ADA provides a set of pre-built connectors to ingest data from a wide range of sources including Amazon Simple Storage Service (Amazon S3), Amazon Kinesis Data Streams, Amazon CloudWatch, Amazon CloudTrail, and Amazon DynamoDB as well as many others.

ADA provides a foundational platform that can be used by data analysts in a diverse set of use cases including IT, finance, marketing, sales, and security. ADA’s out-of-the-box CloudWatch data connector allows data ingestion from CloudWatch logs in the same AWS account in which ADA has been deployed, or from a different AWS account.

In this post, we demonstrate how an application developer or application tester is able to use ADA to derive operational insights of applications running in AWS. We also demonstrate how you can use the ADA solution to connect to different data sources in AWS. We first deploy the ADA solution into an AWS account and set up the ADA solution by creating data products using data connectors. We then use the ADA Query Workbench to join the separate datasets and query the correlated data, using familiar Structured Query Language (SQL), to gain insights. We also demonstrate how ADA can be integrated with business intelligence (BI) tools such as Tableau to visualize the data and to build reports.

Solution overview

In this section, we present the solution architecture for the demo and explain the workflow. For the purposes of demonstration, the bespoke application is simulated using an AWS Lambda function that emits logs in Apache Log Format at a preset interval using Amazon EventBridge. This standard format can be produced by many different web servers and be read by many log analysis programs. The application (Lambda function) logs are sent to a CloudWatch log group. The historical application logs are stored in an S3 bucket for reference and for querying purposes. A lookup table with a list of HTTP status codes along with the descriptions is stored in a DynamoDB table. These three serve as sources from which data is ingested into ADA for correlation, query, and analysis. We deploy the ADA solution into an AWS account and set up ADA. We then create the data products within ADA for the CloudWatch log group, S3 bucket, and DynamoDB. As the data products are configured, ADA provisions data pipelines to ingest the data from the sources. With the ADA Query Workbench, you can query the ingested data using plain SQL for application troubleshooting or issue diagnosis.

The following diagram provides an overview of the architecture and workflow of using ADA to gain insights into application logs.

The workflow includes the following steps:

  1. A Lambda function is scheduled to be triggered at 2-minute intervals using EventBridge.
  2. The Lambda function emits logs that are stored at a specified CloudWatch log group under /aws/lambda/CdkStack-AdaLogGenLambdaFunction. The application logs are generated using the Apache Log Format schema but stored in the CloudWatch log group in JSON format.
  3. The data products for CloudWatch, Amazon S3, and DynamoDB are created in ADA. The CloudWatch data product connects to the CloudWatch log group where the application (Lambda function) logs are stored. The Amazon S3 connector connects to an S3 bucket folder where the historical logs are stored. The DynamoDB connector connects to a DynamoDB table where the status codes that are referred by the application and historical logs are stored.
  4. For each of the data products, ADA deploys the data pipeline infrastructure to ingest data from the sources. When the data ingestion is complete, you can write queries using SQL via the ADA Query Workbench.
  5. You can log in to the ADA portal and compose SQL queries from the Query Workbench to gain insights in to the application logs. You can optionally save the query and share the query with other ADA users in the same domain. The ADA query feature is powered by Amazon Athena, which is a serverless, interactive analytics service that provides a simplified, flexible way to analyze petabytes of data.
  6. Tableau is configured to access the ADA data products via ADA egress endpoints. You then create a dashboard with two charts. The first chart is a heat map that shows the prevalence of HTTP error codes correlated with the application API endpoints. The second chart is a bar chart that shows the top 10 application APIs with a total count of HTTP error codes from the historical data.

Prerequisites

For this post, you need to complete the following prerequisites:

  1. Install the AWS Command Line Interface (AWS CLI), AWS Cloud Development Kit (AWS CDK) prerequisites, TypeScript-specific prerequisites, and git.
  2. Deploy the ADA solution in your AWS account in the us-east-1 Region.
    1. Provide an admin email while launching the ADA AWS CloudFormation stack. This is needed for ADA to send the root user password. An admin phone number is required to receive a one-time password message if multi-factor authentication (MFA) is enabled. For this demo, MFA is not enabled.
  3. Build and deploy the sample application (available on the GitHub repo) solution so that the following resources can be provisioned in your account in the us-east-1 Region:
    1. A Lambda function that simulates the logging application and an EventBridge rule that invokes the application function at 2-minute intervals.
    2. An S3 bucket with the relevant bucket policies and a CSV file that contains the historical application logs.
    3. A DynamoDB table with the lookup data.
    4. Relevant AWS Identity and Access Management (IAM) roles and permissions required for the services.
  4. Optionally, install Tableau Desktop, a third-party BI provider. For this post, we use Tableau Desktop version 2021.2. There is a cost involved in using a licensed version of the Tableau Desktop application. For additional details, refer to the Tableau licensing information.

Deploy and set up ADA

After ADA is deployed successfully, you can log in using the admin email provided during the installation. You then create a domain named CW_Domain. A domain is a user-defined collection of data products. For example, a domain might be a team or a project. Domains provide a structured way for users to organize their data products and manage access permissions.

  1. On the ADA console, choose Domains in the navigation pane.
  2. Choose Create domain.
  3. Enter a name (CW_Domain) and description, then choose Submit.

Set up the sample application infrastructure using AWS CDK

The AWS CDK solution that deploys the demo application is hosted on GitHub. The steps to clone the repo and to set up the AWS CDK project are detailed in this section. Before you run these commands, be sure to configure your AWS credentials. Create a folder, open the terminal, and navigate to the folder where the AWS CDK solution needs to be installed. Run the following code:

gh repo clone aws-samples/operational-insights-with-automated-data-analytics-on-aws
cd operational-insights-with-automated-data-analytics-on-aws
npm install
npm run build
cdk synth
cdk deploy

These steps perform the following actions:

  • Install the library dependencies
  • Build the project
  • Generate a valid CloudFormation template
  • Deploy the stack using AWS CloudFormation in your AWS account

The deployment takes about 1–2 minutes and creates the DynamoDB lookup table, Lambda function, and S3 bucket containing the historical log files as outputs. Copy these values to a text editing application, such as Notepad.

Create ADA data products

We create three different data products for this demo, one for each data source that you’ll be querying to gain operational insights. A data product is a dataset (a collection of data such as a table or a CSV file) that has been successfully imported into ADA and that can be queried.

Create a CloudWatch data product

First, we create a data product for the application logs by setting up ADA to ingest the CloudWatch log group for the sample application (Lambda function). Use the CdkStack.LambdaFunction output to get the Lambda function ARN and locate the corresponding CloudWatch log group ARN on the CloudWatch console.

Then complete the following steps:

  1. On the ADA console, navigate to the ADA domain and create a CloudWatch data product.
  2. For Name¸ enter a name.
  3. For Source type, choose Amazon CloudWatch.
  4. Disable Automatic PII.

ADA has a feature that automatically detects personally identifiable information (PII) data during import that is enabled by default. For this demo, we disable this option for the data product because the discovery of PII data is not in the scope of this demo.

  1. Choose Next.
  2. Search for and choose the CloudWatch log group ARN copied from the previous step.
  3. Copy the log group ARN.
  4. On the data product page, enter the log group ARN.
  5. For CloudWatch Query, enter a query that you want ADA to get from the log group.

In this demo, we query the @message field because we’re interested in getting the application logs from the log group.

  1. Select how the data updates are triggered after initial import.

ADA can be configured to ingest the data from the source at flexible intervals (up to 15 minutes or later) or on demand. For the demo, we set the data updates to run hourly.

  1. Choose Next.

Next, ADA will connect to the log group and query the schema. Because the logs are in Apache Log Format, we transform the logs into separate fields so that we can run queries on the specific log fields. ADA provides four default transformations and supports custom transformation through a Python script. In this demo, we run a custom Python script to transform the JSON message field into Apache Log Format fields.

  1. Choose Transform schema.
  2. Choose Create new transform.
  3. Upload the apache-log-extractor-transform.py script from the /asset/transform_logs/ folder.
  4. Choose Submit.

ADA will transform the CloudWatch logs using the script and present the processed schema.

  1. Choose Next.
  2. In the last step, review the steps and choose Submit.

ADA will start the data processing, create the data pipelines, and prepare the CloudWatch log groups to be queried from the Query Workbench. This process will take a few minutes to complete and will be shown on the ADA console under Data Products.

Create an Amazon S3 data product

We repeat the steps to add the historical logs from the Amazon S3 data source and look up reference data from the DynamoDB table. For these two data sources, we don’t create custom transforms because the data formats are in CSV (for historical logs) and key attributes (for reference lookup data).

  1. On the ADA console, create a new data product.
  2. Enter a name (hist_logs) and choose Amazon S3.
  3. Copy the Amazon S3 URI (the text after arn:aws:s3:::) from the CdkStack.S3 output variable and navigate to the Amazon S3 console.
  4. In the search box, enter the copied text, open the S3 bucket, select the /logs folder, and choose Copy S3 URI.

The historical logs are stored in this path.

  1. Navigate back to the ADA console and enter the copied S3 URI for S3 location.
  2. For Update Trigger, select On Demand because the historical logs are updated at an unspecified frequency.
  3. For Update Policy, select Append to append newly imported data to the existing data.
  4. Choose Next.

ADA processes the schema for the files in the selected folder path. Because the logs are in CSV format, ADA is able to read the column names without requiring additional transformations. However, the columns status_code and request_size are inferred as long type by ADA. We want to keep the column data types consistent among the data products so that we can join the data tables and query the data. The column status_code will be used to create joins across the data tables.

  1. Choose Transform schema to change the data types of the two columns to string data type.

Note the highlighted column names in the Schema preview pane prior to applying the data type transformations.

  1. In the Transform plan pane, under Built-in transforms, choose Apply Mapping.

This option allows you to change the data type from one type to another.

  1. In the Apply Mapping section, deselect Drop other fields.

If this option is not disabled, only the transformed columns will be preserved and all other columns will be dropped. Because we want to retain all the columns, we disable this option.

  1. Under Field Mappings¸ for Old name and New name, enter status_code and for New type, enter string.
  2. Choose Add Item.
  3. For Old name and New name¸ enter request_size and for New data type, enter string.
  4. Choose Submit.

ADA will apply the mapping transformation on the Amazon S3 data source. Note the column types in the Schema preview pane.

  1. Choose View sample to preview the data with the transformation applied.

ADA will display the PII data acknowledgement to ensure that either only authorized users can view the data or that the dataset doesn’t contain any PII data.

  1. Choose Agree to continue to view the sample data.

Note that the schema is identical to the CloudWatch log group schema because both the current application and historical application logs are in Apache Log Format.

  1. In the final step, review the configuration and choose Submit.

ADA starts processing the data from the Amazon S3 source, creates the backend infrastructure, and prepares the data product. This process takes a few minutes depending upon the size of the data.

Create a DynamoDB data product

Lastly, we create a DynamoDB data product. Complete the following steps:

  1. On the ADA console, create a new data product.
  2. Enter a name (lookup) and choose Amazon DynamoDB.
  3. Enter the Cdk.DynamoDBTable output variable for DynamoDB Table ARN.

This table contains key attributes that will be used as a lookup table in this demo. For the lookup data, we are using the HTTP codes and long and short descriptions of the codes. You can also use PostgreSQL, MySQL, or a CSV file source as an alternative.

  1. For Update Trigger, select On-Demand.

The updates will be on demand because the lookup is mostly for reference purpose while querying and any updates to the lookup data can be updated in ADA using on-demand triggers.

  1. Choose Next.

ADA reads the schema from the underlying DynamoDB schema and presents the column name and type for optional transformation. We will proceed with the default schema selection because the column types are consistent with the types from the CloudWatch log group and Amazon S3 CSV data source. Having data types that are consistent across the data sources allows us to write queries to fetch records by joining the tables using the column fields. For example, the column key in the DynamoDB schema corresponds to the status_code in the Amazon S3 and CloudWatch data products. We can write queries that can join the three tables using the column name key. An example is shown in the next section.

  1. Choose Continue with current schema.
  2. Review the configuration and choose Submit.

ADA will process the data from the DynamoDB table data source and prepare the data product. Depending upon the size of the data, this process takes a few minutes.

Now we have all the three data products processed by ADA and available for you to run queries.

Use the Query Workbench to query the data

ADA allows you to run queries against the data products while abstracting the data source and making it accessible using SQL (Structured Query Language). You can write queries and join the tables just as you would query against tables in a relational database. We demonstrate ADA’s querying capability via two user scenarios. In both the scenarios, we join an application log dataset to the error codes lookup table. In the first use case, we query the current application logs to identify the top 10 most accessed application endpoints along with the corresponding HTTP status codes:

--Query the top 10 Application endpoints along with the corresponding HTTP request type and HTTP status code.

SELECT logs.endpoint AS Application_EndPoint, logs.http_request AS REQUEST, count(logs.endpoint) as Endpoint_Count, ref.key as HTTP_Status_Code, ref.short as Description
FROM cw_domain.cloud_watch_application_logs logs
INNER JOIN cw_domain.lookup ref ON logs.status_code = ref.key
where logs.status_code LIKE '4%%' OR logs.status_code LIKE '5%%' -- = '/v1/server'
GROUP BY logs.endpoint, logs.http_request, ref.key, ref.short
ORDER BY Endpoint_Count DESC
LIMIT 10

In the second example, we query the historical logs table to get the top 10 application endpoints with the most errors to understand the endpoint call pattern:

-- Query Historical Logs to get the top 10 Application Endpoints with most number of errors along with an explanation of the error code.

SELECT endpoint as Application_EndPoint, count(status_code) as Error_Count, ref.long as Description FROM cw_domain.hist_logs hist
INNER JOIN cw_domain.lookup ref ON hist.status_code = ref.key
WHERE hist.status_code LIKE '4%%' OR hist.status_code LIKE '5%%'
GROUP BY endpoint, status_code, ref.long
ORDER BY Error_Count desc
LIMIT 10

In addition to querying, you can optionally save the query and share the saved query with other users in the same domain. The shared queries are accessible directly from the Query Workbench. The query results can also be exported to CSV format.

Visualize ADA data products in Tableau

ADA offers the ability to connect to third-party BI tools to visualize data and create reports from the ADA data products. In this demo, we use ADA’s native integration with Tableau to visualize the data from the three data products we configured earlier. Using Tableau’s Athena connector and following the steps in Tableau configuration, you can configure ADA as a data source in Tableau. After a successful connection has been established between Tableau and ADA, Tableau will populate the three data products under the Tableau catalog cw_domain.

We then establish a relationship across the three databases using the HTTP status code as the joining column, as shown in the following screenshot. Tableau allows us to work in online and offline mode with the data sources. In online mode, Tableau will connect to ADA and query the data products live. In offline mode, we can use the Extract option to extract the data from ADA and import the data in to Tableau. In this demo, we import the data in to Tableau to make the querying more responsive. We then save the Tableau workbook. We can inspect the data from the data sources by choosing the database and Update Now.

With the data source configurations in place in Tableau, we can create custom reports, charts, and visualizations on the ADA data products. Let’s consider two use cases for visualizations.

As shown in the following figure, we visualized the frequency of the HTTP errors by application endpoints using Tableau’s built-in heat map chart. We filtered out the HTTP status codes to only include error codes in the 4xx and 5xx range.

We also created a bar chart to depict the application endpoints from the historical logs ordered by the count of HTTP error codes. In this chart, we can see that the /v1/server/admin endpoint has generated the most HTTP error status codes.

Clean up

Cleaning up the sample application infrastructure is a two-step process. First, to remove the infrastructure provisioned for the purposes of this demo, run the following command in the terminal:

cdk destroy

For the following question, enter y and AWS CDK will delete the resources deployed for the demo:

Are you sure you want to delete: CdkStack (y/n)? y

Alternatively, you can remove the resources via the AWS CloudFormation console by navigating to the CdkStack stack and choosing Delete.

The second step is to uninstall ADA. For instructions, refer to Uninstall the solution.

Conclusion

In this post, we demonstrated how to use the ADA solution to derive insights from application logs stored across two different data sources. We demonstrated how to install ADA on an AWS account and deploy the demo components using AWS CDK. We created data products in ADA and configured the data products with the respective data sources using the ADA’s built-in data connectors. We demonstrated how to query the data products using standard SQL queries and generate insights on the log data. We also connected the Tableau Desktop client, a third-party BI product, to ADA and demonstrated how to build visualizations against the data products.

ADA automates the process of ingesting, transforming, governing, and querying diverse datasets and simplifying the lifecycle management of data. ADA’s pre-built connectors allow you to ingest data from diverse data sources. Software teams with basic knowledge of AWS products and services will be able to set up an operational data analytics platform in a few hours and provide secure access to the data. The data can then be easily and quickly queried using an intuitive and standalone web user interface.

Try out ADA today to easily manage and gain insights from data.


About the authors

Aparajithan Vaidyanathan is a Principal Enterprise Solutions Architect at AWS. He supports enterprise customers migrate and modernize their workloads on AWS cloud. He is a Cloud Architect with 23+ years of experience designing and developing enterprise, large-scale and distributed software systems. He specializes in Machine Learning & Data Analytics with focus on Data and Feature Engineering domain. He is an aspiring marathon runner and his hobbies include hiking, bike riding and spending time with his wife and two boys.

Rashim Rahman is a Software Developer based out of Sydney, Australia with 10+ years of experience in software development and architecture. He works primarily on building large scale open-source AWS solutions for common customer use cases and business problems. In his spare time, he enjoys sports and spending time with friends and family.

Hafiz Saadullah is a Principal Technical Product Manager at Amazon Web Services. Hafiz focuses on AWS Solutions, designed to help customers by addressing common business problems and use cases.

Let’s Architect! Security in software architectures

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-security-in-software-architectures/

Security is fundamental for each product and service you are building with. Whether you are working on the back-end or the data and machine learning components of a system, the solution should be securely built.

In 2022, we discussed security in our post Let’s Architect! Architecting for Security. Today, we take a closer look at general security practices for your cloud workloads to secure both networks and applications, with a mix of resources to show you how to architect for security using the services offered by Amazon Web Services (AWS).

In this edition of Let’s Architect!, we share some practices for protecting your workloads from the most common attacks, introduce the Zero Trust principle (you can learn how AWS itself is implementing it!), plus how to move to containers and/or alternative approaches for managing your secrets.

A deep dive on the current security threat landscape with AWS

This session from AWS re:Invent, security engineers guide you through the most common threat vectors and vulnerabilities that AWS customers faced in 2022. For each possible threat, you can learn how it’s implemented by attackers, the weaknesses attackers tend to leverage, and the solutions offered by AWS to avert these security issues. We describe this as fundamental architecting for security: this implies adopting suitable services to protect your workloads, as well as follow architectural practices for security.

Take me to this re:Invent 2022 session!

Statistics about common attacks and how they can be launched

Statistics about common attacks and how they can be launched

Zero Trust: Enough talk, let’s build better security

What is Zero Trust? It is a security model that produces higher security outcomes compared with the traditional network perimeter model.

How does Zero Trust work in practice, and how can you start adopting it? This AWS re:Invent 2022 session defines the Zero Trust models and explains how to implement one. You can learn how it is used within AWS, as well as how any architecture can be built with these pillars in mind. Furthermore, there is a practical use case to show you how Delphix put Zero Trust into production.

Take me to this re:Invent 2022 session!

AWS implements the Zero Trust principle for managing interactions across different services

AWS implements the Zero Trust principle for managing interactions across different services

A deep dive into container security on AWS

Nowadays, it’s vital to have a thorough understanding of a container’s underlying security layers. AWS services, like Amazon Elastic Kubernetes Service and Amazon Elastic Container Service, have harnessed these Linux security-layer protections, keeping a sharp focus on the principle of least privilege. This approach significantly minimizes the potential attack surface by limiting the permissions and privileges of processes, thus upholding the integrity of the system.

This re:Inforce 2023 session discusses best practices for securing containers for your distributed systems.

Take me to this re:Inforce 2023 session!

Fundamentals and best practices to secure containers

Fundamentals and best practices to secure containers

Migrating your secrets to AWS Secrets Manager

Secrets play a critical role in providing access to confidential systems and resources. Ensuring the secure and consistent management of these secrets, however, presents a challenge for many organizations.

Anti-patterns observed in numerous organizational secrets management systems include sharing plaintext secrets via unsecured means, such as emails or messaging apps, which can allow application developers to view these secrets in plaintext or even neglect to rotate secrets regularly. This detailed guidance walks you through the steps of discovering and classifying secrets, plus explains the implementation and migration processes involved in transferring secrets to AWS Secrets Manager.

Take me to this AWS Security Blog post!

An organization's perspectives and responsibilities when building a secrets management solution

An organization’s perspectives and responsibilities when building a secrets management solution

Conclusion

We’re glad you joined our conversation on building secure architectures! Join us in a couple of weeks when we’ll talk about cost optimization on AWS.

To find all the blogs from this series, visit the Let’s Architect! list of content on the AWS Architecture Blog.

Cost considerations and common options for AWS Network Firewall log management

Post Syndicated from Sharon Li original https://aws.amazon.com/blogs/security/cost-considerations-and-common-options-for-aws-network-firewall-log-management/

When you’re designing a security strategy for your organization, firewalls provide the first line of defense against threats. Amazon Web Services (AWS) offers AWS Network Firewall, a stateful, managed network firewall that includes intrusion detection and prevention (IDP) for your Amazon Virtual Private Cloud (VPC).

Logging plays a vital role in any firewall policy, as emphasized by the National Institute of Standards and Technology (NIST) Guidelines on Firewalls and Firewall Policy. Logging enables organizations to take proactive measures to help prevent and recover from failures, maintain proper firewall security configurations, and gather insights for effectively responding to security incidents.

Determining the optimal logging approach for your organization should be approached on a case-by-case basis. It involves striking a balance between your security and compliance requirements and the costs associated with implementing solutions to meet those requirements.

This blog post walks you through logging configuration best practices, discusses three common architectural patterns for Network Firewall logging, and provides guidelines for optimizing the cost of your logging solution. This information will help you make a more informed choice for your organization’s use case.

Stateless and stateful rules engines logging

When discussing Network Firewall best practices, it’s essential to understand the distinction between stateful and stateless rules. Note that stateless rules don’t support firewall logging, which can make them difficult to work with in use cases that depend on logs.

To verify that traffic is forwarded to the stateful inspection engine that generates logs, you can add a custom-defined stateless rule group that covers the traffic you need to monitor, or you can set a default action for stateless traffic to be forwarded to stateful rule groups in the firewall policy, as shown in the following figure.

Figure 1: Set up stateless default actions to forward to stateful rule groups

Figure 1: Set up stateless default actions to forward to stateful rule groups

Alert logs and flow logs

Network Firewall provides two types of logs:

  • Alert — Sends logs for traffic that matches a stateful rule whose action is set to Alert or Drop.
  • Flow — Sends logs for network traffic that the stateless engine forwards to the stateful rules engine.

To grasp the use cases of alert and flow logs, let’s begin by understanding what a flow is from the view of the firewall. For the network firewall, network flow is a one-way series of packets that share essential IP header information. It’s important to note that the Network Firewall flow log differs from the VPC flow log, as it captures the network flow from the firewall’s perspective and it is summarized in JSON format.

For example, the following sequence shows how an HTTP request passes through the Network Firewall.

Figure 2: HTTP request passes through Network Firewall

Figure 2: HTTP request passes through Network Firewall

When you’re using a stateful rule to block egress HTTP traffic, the TCP connection will be established initially. When an HTTP request comes in, it will be evaluated by the stateful rule. Depending on the rule’s action, the firewall may send a TCP reset to the sender when a Reject action is configured, or it may drop the packets to block them if a Drop action is configured. In the case of a Drop action, shown in Figure 3, the Network Firewall decides not to forward the packets at the HTTP layer, and the closure of the connection is determined by the TCP timers on both the client and server sides.

Figure 3: HTTP request blocked by Network Firewall

Figure 3: HTTP request blocked by Network Firewall

In the given example, the Network Firewall generates a flow log that provides information like IP addresses, port numbers, protocols, timestamps, number of packets, and bytes of the traffic. However, it doesn’t include details about the stateful inspection, such as whether the traffic was blocked or allowed.

Figure 4 shows the inbound flow log.

Figure 4: Inbound flow log

Figure 4: Inbound flow log

Figure 5 shows the outbound flow log.

Figure 5: Outbound flow log

Figure 5: Outbound flow log

The alert log entry complements the flow log by containing stateful inspection details. The entry includes information about whether the traffic was allowed or blocked and also provides the hostname associated with the traffic. This additional information enhances the understanding of network activities and security events, as shown in Figure 6.

Figure 6: Alert log

Figure 6: Alert log

In summary, flow logs provide stateless information and are valuable for identifying trends, like monitoring IP addresses that transmit the most data over time in your network. On the other hand, alert logs contain stateful inspection details, making them helpful for troubleshooting and threat hunting purposes.

Keep in mind that flow logs can become excessive. When you’re forwarding traffic to a stateful inspection engine, flow logs capture the network flows crossing your Network Firewall endpoints. Because log volume affects overall costs, it’s essential to choose the log type that suits your use case and security needs. If you don’t need flow logs for traffic flow trends, consider only enabling alert logs to help reduce expenses.

Effective logging with alert rules

When you write stateful rules using the Suricata format, set the alert rule to be evaluated before the pass rule to log allowed traffic. Be aware that:

  • You must enable strict rule evaluation order to allow the alert rule to be evaluated before the pass rule. Otherwise the order of evaluation by default is pass rules first, then drop, then alert. The engine stops processing rules when it finds a match.
  • When you use pass rules, it’s recommended to add a message to remind anyone looking at the policy that these rules do not generate messages. This will help when developing and troubleshooting your rules.

For example, the rules below will allow traffic to a target with a specific Server Name Indication (SNI) and log the traffic that was allowed. As you can see in the pass rule, it includes a message to remind the firewall policy maker that pass rules don’t alert. The alert rule evaluated before the pass rule logs a message to tell the log viewer which rule allows the traffic. This way you can see allowed domains in the logs.


alert tls $HOME_NET any -> $EXTERNAL_NET any (tls.sni; content:"www.example.com"; nocase; startswith; endswith; msg:"Traffic allowed by rule 72912"; flow:to_server, established; sid:82912;)
pass tls $HOME_NET any -> $EXTERNAL_NET any (tls.sni; content:"www.example.com"; nocase; startswith; endswith; msg:"Pass rules don't alert"; flow:to_server, established; sid:72912;)

This way you can see allowed domains in the alert logs.

Figure 7: Allowed domain in the alert log

Figure 7: Allowed domain in the alert log

Log destination considerations

Network Firewall supports the following log destinations:

You can select the destination that best fits your organization’s processes. In the next sections, we review the most common pattern for each log destination and walk you through the cost considerations, assuming a scenario in which you generate 15 TB Network Firewall logs in us-east-1 Region per month.

Amazon S3

Network Firewall is configured to inspect traffic and send logs to an S3 bucket in JSON format using Amazon CloudWatch vended logs, which are logs published by AWS services on behalf of the customer. Optionally, logs in the S3 bucket can then be queried using Amazon Athena for monitoring and analysis purposes. You can also create Amazon QuickSight dashboards with an Athena-based dataset to provide additional insight into traffic patterns and trends, as shown in Figure 8.

Figure 8: Architecture diagram showing AWS Network Firewall logs going to S3

Figure 8: Architecture diagram showing AWS Network Firewall logs going to S3

Cost considerations

Note that Network Firewall logging charges for the pattern above are the combined charges for CloudWatch Logs vended log delivery to the S3 buckets and for using Amazon S3.

CloudWatch vended log pricing can influence overall costs significantly in this pattern, depending on the amount of logs generated by Network Firewall, so it’s recommended that your team be aware of the charges described in Amazon CloudWatch Pricing – Amazon Web Services (AWS). From the CloudWatch pricing page, navigate to Paid Tier, choose the Logs tab, select your Region and then under Vended Logs, see the information for Delivery to S3.

For Amazon S3, go to Amazon S3 Simple Storage Service Pricing – Amazon Web Services, choose the Storage & requests tab, and view the information for your Region in the Requests & data retrievals section. Costs will be dependent on storage tiers and usage patterns and the number of PUT requests to S3.

In our example, 15 TB is converted and compressed to approximately 380 GB in the S3 bucket. The total monthly cost in the us-east-1 Region is approximately $3800.

Long-term storage

There are additional features in Amazon S3 to help you save on storage costs:

Analytics and reporting

Athena and QuickSight can be used for analytics and reporting:

  • Athena can perform SQL queries directly against data in the S3 bucket where Network Firewall logs are stored. In the Athena query editor, a single query can be run to set up the table that points to the Network Firewall logging bucket.
  • After data is available in Athena, you can use Athena as a data source for QuickSight dashboards. You can use QuickSight to visualize data from your Network Firewall logs, taking advantage of AWS serverless services.
  • Please note that using Athena to scan firewall data in S3 might increase costs, as can the number of authors, users, reports, alerts, and SPICE data used in QuickSight.

Amazon CloudWatch Logs

In this pattern, shown in Figure 9, Network Firewall is configured to send logs to Amazon CloudWatch as a destination. Once the logs are available in CloudWatch, CloudWatch Log Insights can be used to search, analyze, and visualize your logs to generate alerts, notifications, and alarms based on specific log query patterns.

Figure 9: Architecture diagram using CloudWatch for Network Firewall Logs

Figure 9: Architecture diagram using CloudWatch for Network Firewall Logs

Cost considerations

Configuring Network Firewall to send logs to CloudWatch incurs charges based on the number of metrics configured, metrics collection frequency, the number of API requests, and the log size. See Amazon CloudWatch Pricing for additional details.

In our example of 15 TB logs, this pattern in the us-east-1 Region results in approximately $6900.

CloudWatch dashboards offers a mechanism to create customized views of the metrics and alarms for your Network Firewall logs. These dashboards incur an additional charge of $3 per month for each dashboard.

Contributor Insights and CloudWatch alarms are additional ways that you can monitor logs for a pre-defined query pattern and take necessary corrective actions if needed. Contributor Insights are charged per Contributor Insights rule. To learn more, go to the Amazon CloudWatch Pricing page, and under Paid Tier, choose the Contributor Insights tab. CloudWatch alarms are charged based on the number of metric alarms configured and the number of CloudWatch Insights queries analyzed. To learn more, navigate to the CloudWatch pricing page and navigate to the Metrics Insights tab.

Long-term storage

CloudWatch offers the flexibility to retain logs from 1 day up to 10 years. The default behavior is never expire, but you should consider your use case and costs before deciding on the optimal log retention period. For cost optimization, the recommendation is to move logs that need to be preserved long-term or for compliance from CloudWatch to Amazon S3. Additional cost optimization can be achieved through S3 tiering. To learn more, see Managing your storage lifecycle in the S3 User Guide.

AWS Lambda with Amazon EventBridge, as shown in the following sample code, can be used to create an export task to send logs from CloudWatch to Amazon S3 based on an event rule, pattern matching rule, or scheduled time intervals for long-term storage and other use cases.

import boto3
import os
import datetime


GROUP_NAME = "/AnfwDemo/Anfw/Alert"
DESTINATION_BUCKET = "cwexportlogs-blog"
PREFIX = "network-logs"
NDAYS = 1
nDays = int(NDAYS)

currentTime = datetime.datetime.now()
StartDate = currentTime - datetime.timedelta(days=nDays)
EndDate = currentTime - datetime.timedelta(days=nDays - 1)


fromDate = int(StartDate.timestamp() * 1000)
toDate = int(EndDate.timestamp() * 1000)

BUCKET_PREFIX = os.path.join(PREFIX, StartDate.strftime('%Y{0}%m{0}%d').format(os.path.sep))

def lambda_handler(event, context):
    client = boto3.client('logs')
    response = client.create_export_task(
         logGroupName=GROUP_NAME,
         fromTime=fromDate,
         to=toDate,
         destination=DESTINATION_BUCKET,
         destinationPrefix=BUCKET_PREFIX
        )
    print(response)

Figure 10 shows how EventBridge is configured to trigger the Lambda function periodically.

Figure 10: EventBridge scheduler for daily export of CloudWatch logs

Figure 10: EventBridge scheduler for daily export of CloudWatch logs

Analytics and reporting

CloudWatch Insights offers a rich query language that you can use to perform complex searches and aggregations on your Network Firewall log data stored in log groups as shown in Figure 11.

The query results can be exported to CloudWatch dashboard for visualization and operational decision making. This will help you quickly identify patterns, anomalies, and trends in the log data to create the alarms for proactive monitoring and corrective actions.

Figure 11: Network Firewall logs ingested into CloudWatch and analyzed through CloudWatch Logs Insights

Figure 11: Network Firewall logs ingested into CloudWatch and analyzed through CloudWatch Logs Insights

Amazon Kinesis Data Firehose

For this destination option, Network Firewall sends logs to Amazon Kinesis Data Firehose. From there, you can choose the destination for your logs, including Amazon S3, Amazon Redshift, Amazon OpenSearch Service, and an HTTP endpoint that’s owned by you or your third-party service providers. The most common approach for this option is to deliver logs to OpenSearch, where you can index log data, visualize, and analyze using dashboards as shown in Figure 12.

In the blog post How to analyze AWS Network Firewall logs using Amazon OpenSearch Service, you learn how to build network analytics and visualizations using OpenSearch in detail. Here, we discuss only some cost considerations of using this pattern.

Figure 12: Architecture diagram showing AWS Network Firewall logs going to OpenSearch

Figure 12: Architecture diagram showing AWS Network Firewall logs going to OpenSearch

Cost considerations

The charge when using Kinesis Data Firehose as a log destination is for CloudWatch Logs vended log delivery. Ingestion pricing is tiered and billed per GB ingested in 5 KB increments. See Amazon Kinesis Data Firehose Pricing under Vended Logs as source. There are no additional Kinesis Data Firehose charges for delivery unless optional features are used.

For 15 TB of log data, the cost of CloudWatch delivery and Kinesis Data Firehose ingestion is approximately $5400 monthly in the us-east-1 Region.

The cost for Amazon OpenSearch Service is based on three dimensions:

  • Instance hours, which are the number of hours that an instance is available to you for use
  • The amount of storage you request
  • The amount of data transferred in and out of OpenSearch Service

Storage pricing depends on the storage tier and type of instance that you choose. See pricing examples of using OpenSearch Service. When creating your OpenSearch domain, see Sizing Amazon OpenSearch Service domains to help you right-size your OpenSearch domain. Other cost optimization best practices include choosing the right storage tier and using AWS Graviton2 instances to improve performance.

For instance, allocating approximately 15 TB of UltraWarm storage in the us-east-1 Region will result in a monthly cost of $4700. Keep in mind that in addition to storage costs, you should also account for compute instances and hot storage.

In short, the estimated total cost for log ingestion and storage in the us-east-1 Region for this pattern is at least $10,100.

Leveraging OpenSearch will enable you to promptly investigate, detect, analyze, and respond to security threats.

Summary

The following table shows a summary of the expenses and advantages of each solution. Since storing logs is a fundamental aspect of log management, we use the monthly cost of using Amazon S3 as the log delivery destination as our baseline when making these comparisons.

Pattern Log delivery and storage cost as a multiple of the baseline cost Functionalities Dependencies
Amazon S3, Athena, QuickSight 1 The most economical option for log analysis. The solution requires security engineers to have a good analytics skillset. Familiarity with Athena query and query running time will impact the incident response time and the cost.
Amazon CloudWatch 1.8 Log analysis, dashboards, and reporting can be implemented from the CloudWatch console. No additional service is needed. The solution requires security engineers to be comfortable with CloudWatch Logs Insights query syntax. The CloudWatch Logs Insights query will impact the incident response time and the cost.
Amazon Kinesis Data Firehose, OpenSearch 2.7+ Investigate, detect, analyze, and respond to security threats quickly with OpenSearch. The solution requires you to invest in managing the OpenSearch cluster.

You have the flexibility to select distinct solutions for flow logs and alert logs based on your requirements. For flow logs, opting for Amazon S3 as the destination offers a cost-effective approach. On the other hand, for alert logs, using the Kinesis Data Firehose and OpenSearch solution allows for quick incident response. Minimizing the time required to address ongoing security challenges can translate to reduced business risk at different costs.

Conclusion

This blog post has explored various patterns for Network Firewall log management, highlighting the cost considerations associated with each approach. While cost is a crucial factor in designing an efficient log management solution, it’s important to consider other factors such as real-time requirements, solution complexity, and ownership. Ultimately, the key is to adopt a log management pattern that aligns with your operational needs and budgetary constraints. Network security is an iterative practice, and by optimizing your log management strategy, you can enhance your overall security posture while effectively managing costs.

For more information about working with Network Firewall, see What is AWS Network Firewall?

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Sharon Li

Sharon Li

Sharon is an Enterprise Solutions Architect at Amazon Web Services based in Boston, with a passion for designing and building secure workloads on AWS. Prior to her current role at AWS, Sharon worked as a software development engineer at Amazon, where she played a key role in bringing security into the development process.

Larry Tewksbury

Larry Tewksbury

Larry is an AWS Technical Account Manager based in New Hampshire. He works with enterprise customers in the Northeast to understand, scale, and optimize their cloud operations. Outside of work, he enjoys spending time with his family, hiking, and tech-based hobbies.

Shashidhar Makkapati

Shashidhar Makkapati

Shashidhar is an Enterprise Solutions Architect at Amazon Web Services, based in Charlotte, NC. With over two decades of experience as an enterprise architect, he has a keen focus on cloud adoption and digital transformation in the financial services industry. Shashidhar supports enterprise customers in the US Northeast. In his free time, he enjoys reading, traveling, and spending time with his family.

The art and science of data product portfolio management

Post Syndicated from Faris Haddad original https://aws.amazon.com/blogs/big-data/the-art-and-science-of-data-product-portfolio-management/

This post is the first in a series dedicated to the art and science of practical data mesh implementation (for an overview of data mesh, read the original whitepaper The data mesh shift). The series attempts to bridge the gap between the tenets of data mesh and its real-life implementation by deep-diving into the functional and non-functional capabilities essential to a working operating model, laying out the decisions that need to be made for each capability, and describing the key business and technical processes required to implement them. Taken together, the posts in this series lay out some possible operating models for data mesh within an organization.

Kudzu

Kudzu—or kuzu (クズ)—is native to Japan and southeast China. First introduced to the southeastern United States in 1876 as a promising solution for erosion control, it now represents a cautionary tale about unintended consequences, as Kudzu’s speed of growth outcompetes everything from native grasses to tree systems by growing over and shading them from the sunlight they need to photosynthesize—eventually leading to species extinction and loss of biodiversity. The story of Kudzu offers a powerful analogy to the dangers and consequences of implementing data mesh architectures without fully understanding or appreciating how they are intended to be used. When the “Kudzu” of unmanaged pseudo-data products (methods of sharing data that masquerade as data products while failing to fulfill the myriad obligations associated with them) has overwhelmed the local ecosystem of true data products, eradication is costly and prone to failure, and can represent significant wasted effort and resources, as well as lost time.

Desert

While Kudzu was taking over the south in the 1930s, desertification caused by extensive deforestation was overwhelming the Midwest, with large tracts of land becoming barren and residents forced to leave and find other places to make a living. In the same way, overly restrictive data governance practices that either prevent data products from taking root at all, or pare them back too aggressively (deforestation), can over time create “data deserts” that drive both the producers and consumers of data within an organization to look elsewhere for their data needs. At the same time, unstructured approaches to data mesh management that don’t have a vision for what types of products should exist and how to ensure they are developed are at high risk of creating the same effect through simple neglect. This is due to a  common misconception about data mesh as a data strategy, which is that it is effectively self-organizing—meaning that once presented with the opportunity, data owners within the organization will spring to the responsibilities and obligations associated with publishing high-quality data products. In reality, the work of a data producer is often thankless, and without clear incentive strategies, organizations may end up with data deserts that create more data governance issues as producers and consumers go elsewhere to seek out the data they need to perform work.

Bonsai

Bonsai (盆栽) is an art form originating from an ancient Chinese tradition called penjing (盆景), and later shaped by the minimalist teachings of Zen Buddhism into the practice we know and recognize today. The patient practice of Bonsai offers useful analogies to the concepts and processes required to avoid the chaos of Kudzu as well as the specter of organizational data deserts. Bonsai artists carefully observe the naturally occurring buds that are produced by the tree and encourage those that add to the overall aesthetics of the tree, while pruning those that don’t work well with their neighbors. The same ideas apply equally well to data products within a data mesh—by encouraging the growth and adoption of those data products that add value to our data mesh, and continuously pruning those that do not, we maximize the value and sustainability of our data mesh implementations. In a similar vein, Bonsai artists must balance their vision for the shape of the tree with a respect for the natural characteristics and innate structure of the species they have chosen to work with—to ignore the biology of the tree would be disastrous to the longevity of the tree, as well as to the quality of the art itself. In the same way, organizations seeking to implement successful data mesh strategies must respect the nature and structure (legal, political, commercial, technology) of their organizations in their implementation.

Of the key capabilities proposed for the implementation of a sustainable data mesh operating model, the one that is most relevant to the problems we’ve described—and explore later in this post—is data product portfolio management.

Overview of data product portfolio management

Data mesh architectures are, by their nature, ideal for implementation within federated organizations, with decentralized ownership of data and clear legal, regulatory, or commercial boundaries between entities or lines of business. The same organizational characteristics that make data mesh architectures valuable, however, also put them at risk of turning into one of the twin nightmares of Kudzu or data deserts.

To define the shape and nature of an organizational data mesh, a number of key questions need to be answered, including but not limited to:

  • What are the key data domains within the organization? What are the key data products within these domains needed to solve current business problems? How do we iterate on this discovery process to add value while we are mapping our domains?
  • Who are the consumers in our organization, and what logical, regulatory, physical, or commercial boundaries might separate them from producers and their data products?
  • How do we encourage the development and maintenance of key data products in a decentralized organization?
  • How do we monitor data products against their SLAs, and ensure alerting and escalation on failure so that the organization is protected from bad data?
  • How do we enable those we see as being autonomous producers and consumers with the right skills, the right tools, and the right mindset to actually want to (and be able to) take more ownership of independently publishing data as a product and consuming it responsibly?
  • What is the lifecycle of a data product? When do new data products get created, and who is allowed to create them? When are data products deprecated, and who is accountable for the consequences to their consumers?
  • How do we define “risk” and “value” in the context of data products, and how can we measure this? Whose responsibility is it to justify the existence of a given data product?

To answer questions such as these and plan accordingly, organizations must implement data product portfolio management (DPPM). DPPM does not exist in a vacuum—by its nature, DPPM is closely related to and interdependent with enterprise architecture practices like business capability management and project portfolio management. DPPM itself may therefore also be considered, in part, an enterprise architecture practice.

As an enterprise architecture practice, DPPM is responsible for its implementation, which should reside within a function whose remit is appropriately global and cross-functional. This may be within the CDO office for those organizations that have a CDO or equivalent central data function, or the enterprise architecture team in organizations that do not.

Goals of DPPM

The goals of DPPM can be summarized as follows:

  • Protect value – DPPM protects the value of the organizational data strategy by developing, implementing, and enforcing frameworks to measure the contribution of data products to organizational goals in objective terms. Examples may include associated revenue, savings, or reductions in operational losses. Earlier in their lifecycle, data products may be measured by alternative metrics, including adoption (number of consumers) and level of activity (releases, interaction with consumers, and so on). In the pursuit of this goal, the DPPM capability is accountable for engaging with the business to continuously priorities where data as a product can add value and align delivery priority accordingly. Strategies for measuring value and prioritizing data products are explored later in this post.
  • Manage risk – All data products introduce risk to the organization—risk of wasted money and effort through non-adoption, risk of operational loss associated with improper use, and risk of failure on the part of the data product to meet requirements on availability, completeness, or quality. These risks are exacerbated in the case of proliferation of low-quality or unsupervised data products. DPPM seeks to understand and measure these risks on an individual and aggregated basis. This is a particularly challenging goal because what constitutes risk associated with the existence of a particular data product is determined largely by its consumers and is likely to change over time (though like entropy, is only ever likely to increase).
  • Guide evolution – The final goal of DPPM is to guide the evolution of the data product landscape to meet overarching organizational data goals, such as mutually exclusive or collectively exhaustive domains and data products, the identification and enablement of single-threaded ownership of product definitions, or the agile inclusion of new sources of data and creation of products to serve tactical or strategic business goals. Some principles for the management of data mesh evolution, and the evaluation of data products against organizational goals, are explored later in this post.

Challenges of DPPM

In this section, we explore some of the challenges of DPPM, and the pragmatic ways some of these challenges could be addressed.

Infancy

Data mesh as a concept is still relatively new. As such, there is little standardization associated with practical operating models for building and managing data mesh architectures, and no access to fully fledged out-of-the-box reference operating models, frameworks, or tools to support the practice of DPPM.

Some elements of DPPM are supported in disparate tools (for example, some data catalogs include basic community features that contribute to measuring value), but not in a holistic way. Over time, standardization of the processes associated with DPPM will likely occur as a side-effect of commoditization, driven by the popularity and adoption of new services that take on and automate more of the undifferentiated heavy lifting associated with mesh supervision. In the meantime, however, organizations adopting data mesh architectures are left largely to their own devices around how to operate them effectively.

Resistance

The purest expression of democracy is anarchy, and the more federated an organization is (itself a supporting factor in choosing data mesh architectures), the more resistance may be observed to any forms of centralized governance. This is a challenge for DPPM, because in some way it must come together in one place. Just as the Bonsai artist knows the vision for the entire tree, there must be a cohesive vision for and ability to guide the evolution of a data mesh, no matter how broadly federated and autonomous individual domains or data products might be.

Balancing this with the need to respect the natural shape (and culture) of an organization, however, requires organizations that implement DPPM to think about how to do so in a way that doesn’t conflict with the reality of the organization. This might mean, for example, that DPPM may need to happen at several layers—at minimum within data domains, possibly within lines of business, and then at an enterprise level through appropriate data committees, guilds, or other structures that bring stakeholders together. All of this complicates the processes and collaboration needed to perform DPPM effectively.

Maturity

Data mesh architectures, and therefore DPPM, presume relatively high levels of data maturity within an organization—a clear data strategy, understanding of data ownership and stewardship, principles and policies that govern the use of data, and a moderate-to-high level of education and training around data within the organization. A lack of data maturity within the organization, or a weak or immature enterprise architecture function, will face significant hurdles in the implementation of any data mesh architecture, let alone a strong and useful DPPM practice.

In reality, however, data maturity is not uniform across organizations. Even in seemingly low-maturity organizations, there are often teams who are more mature and have a higher appetite to engage. By leaning into these teams and showing value through them first, then using them as evangelists, organizations can gain maturity while benefitting earlier from the advantages of data mesh strategies.

The following sections explore the implementation of DPPM along the lines of people, process, and technology, as well as describing the key characteristics of data products—scope, value, risk, uniqueness, and fitness—and how they relate to data mesh practices.

People

To implement DPPM effectively, a wide variety of stakeholders in the organization may need to be involved in one capacity or another. The following table suggests some key roles, but it’s up to an individual organization to determine how and if these map to their own roles and functions.

Function RACI Role Responsibility
Senior Leadership A Chief Data Officer Ultimately accountable for organizational data strategy and implementation. Approves changes to DPPM principles and operating model. Acts as chair of, and appoints members to, the data council.
. R Data Council** Stakeholder body representing organizational governance around data strategy. Acts as steering body for the governance of DPPM as a practice (KPI monitoring, maturity assessments, auditing, and so on). Approves changes to guidelines and methodologies. Approves changes to data product portfolio (discussed later in this post). Approves and governs centrally funded and prioritized data product development activities.
Enterprise Architecture AR Head of Enterprise Architecture Responsible for development and enforcement of data strategy. Accountable and responsible for the design and implementation of DPPM as an organizational capability.
. R Domain Architect Responsible for the implementing screening, data product analysis, periodic evaluation, and optimal portfolio selection practices. Responsible for the development of methodologies and their selection criteria.
Legal & Compliance C Legal & Compliance Officer Consults on permissibility of data products with reference to local regulation. Consults on permissibility of data sharing with reference to local regulation or commercial agreements.
. C Data Privacy Officer Consults on permissibility of data use with reference to local data privacy law. Consults on permissibility of cross-entity or border data sharing with reference to data privacy law.
Information Security RC Information Security Officer Consults on maturity assessments (discussed later in this post) for information security-relevant data product capabilities. Approves changes to data product technology architecture. Approves changes to IAM procedures relating to data products.
Business Functions A Data Domain Owner Ultimately accountable for the appropriate use of domain data, as well as its quality and availability. Accountable for domain data products. Approves changes to the domain data model and domain data product portfolio.
c R Data Domain Steward Responsible for implementing data domain responsibilities, including operational (day-to-day) governance of domain data products. Approves use of domain data in new data products, and performs regular (such as yearly) attestation of data products using domain data.
. A Data Owner Ultimately accountable for the appropriate use of owned data (for example, CRM data), as well as its quality and availability.
. R Data Steward Responsible for implementing data responsibilities. Approves use of owned data in new data products, and performs regular (such as yearly) attestation of data products using owned data.
. AR Data Product Owner Accountable and responsible for the design, development, and delivery of data products against their stated SLOs. Contributes to data product analysis and portfolio adjustment practices for own data products.

** The data council typically consists of permanent representatives from each function (data domain owners), enterprise architecture, and the chief data officer or equivalent.

Process

The following diagram illustrates the strategic, tactical, and operational practices associated with DPPM. Some considerations for the implementation of these practices is explored in more detail in this post, though their specific interpretation and implementation is dependent on the individual organization.

Boundaries

When reading this section, it’s important to bear in mind the impact of boundaries—although strategy development may be established as a global practice, other practices within DPPM must respect relevant organizational boundaries (which may be physical, geographical, operational, legal, commercial, or regulatory in nature). In some cases, the existence of boundaries may require some or all tactical and operational practices to be duplicated within each associated boundary. For example, an insurance company with a property and casualty legal entity in North America and a life entity in Germany may need to implement DPPM separately within each entity.

Strategy development

This practice deals with answering questions associated with the overall data mesh strategy, including the following:

  • The overall scope (data domains, participating entities, and so on) of the data mesh
  • The degree of freedom of participating entities in their definition and implementation of the data mesh (for example, a mesh of meshes vs. a single mesh)
  • The distribution of responsibilities for activities and capabilities associated with the data mesh (degree of democratization)
  • The definition and documentation of key performance indicators (KPIs) against which the data mesh should be governed (such as risk and value)
  • The governance operating model (including this practice)

Key deliverables include the following:

  • Organizational guidelines for operational processes around pre-screening and screening of data products
  • Well-defined KPIs that guide methodology development and selection for practices like data product analysis, screening, and optimal portfolio selection
  • Allocation of organizational resources (people, budget, time) to the implementation of tactical processes around methodology development, optimal portfolio selection, and portfolio adjustment

Key considerations

In this section, we discuss some key considerations for strategy development.

Data mesh structure

This diagram illustrates the analogous relationship between data products in a data mesh, and the structure of the mesh itself.

The following considerations relate to screening, data product analysis, and optimal portfolio selection.

  • Trunk (core data products) – Core data products are those that are central to the organization’s ability to function, and from which the majority of other data products are derived. These may be data products consumed in the implementation of key business activities, or associated with critical processes such as regulatory reporting and risk management. Organizational governance for these data products typically favors availability and data accuracy over agility.
  • Branch (cross-domain data products) – Cross-domain data products represent the most common cross-domain use cases for data (for example, joining customer data with product data). These data products may be widely used across business functions to support reporting and analytics, and—to a lesser extent—operational processes. Because these data products may consume a variety of sources, organizational governance may favor a balanced view on agility vs. reliability, accepting some degree of risk in return for being able to adapt to changes in data sources. Data product versioning can offer mitigation of risks associated with change.
  • Leaf (everything else) – These are the myriad data products that may arise within a data mesh, either as permanent additions to support individual teams and use cases or as temporary data products to fill data gaps or support time-limited initiatives. Because the number of these data products may be high and risks are typically limited to a single process or a small part of the organization, organizational governance typically favors a light touch and may prefer to govern through guidelines and best practices, rather than through active participation in the data product lifecycle.

Data products vs. data definitions

The following figure illustrates how data definitions are defined and inherited throughout the lineage of data products.

In a data mesh architecture, data products may inherit data from each other (one data product consumes another in its data pipeline) or independently publish data within (or related to) the same domain. For example, a customer data product may be inherited by a customer support data product, while another the customer journey data product may directly publish customer-relevant data from independent sources. When no standards are applied to how domain data attributes are used and published, data products even within the same data domain may lose interoperability because it becomes difficult or impossible to join them together for reporting or analytics purposes.

To prevent this, it can be useful to distinguish between data products and data definitions. Typically, organizations will select a single-threaded owner (often a data owner or steward, or a domain data owner or steward) who is responsible for defining minimal data definitions for common and reusable data entities within data domains. For example, a data owner responsible for the sales and marketing data domain may identify a customer data product as a reusable data entity within the domain and publish a minimal data definition that all producers of customer-relevant data must incorporate within their data products, to ensure that all data products associated with customer data are interoperable.

DPPM can assist in the identification and production of data definitions as part of its data product analysis activities, as well as enforce their incorporation as part of oversight of data product development.

Service management thinking

These considerations relate to data product analysis, periodic evaluation, and methodology selection.

Data products are services provided to the organization or externally to customers and partners. As such, it may make sense to adapt a service management framework like ITIL, in combination with the ITIL Maturity Model, for use in evaluating the fitness of data products for their scope and audience, as well as in describing the roles, processes, and acceptable technologies that should form the operating model for any data product.

At the operational level, the stakeholders required to implement each practice may change depending on the scope of the data product. For example, the release management practice for a core data product may require involvement of the data council, whereas the same practice for a team data product may only involve the team or functional head. To avoid creating decision-making bottlenecks, organizations should aim to minimize the number of stakeholders in each case and focus on single-threaded owners wherever possible.

The following table proposes a subset of capabilities and how they might be applied to data products of different scopes. Suggested target maturity levels, between 1 and 5, are included for each scope. (1= Initial, 5= Optimizing)

Target Maturity Data Product Scope.
4 – 5 3 – 4 2 – 3 2
Capability    Core
  Cross-Domain
  Function / Team
  Personal
Information Security Management X X X X
Knowledge Management X X X .
Release Management X X X .
Service-Level Management X X X .
Measurement and Reporting X X . .
Availability Management X X . .
Capacity and Performance Management X X . .
Incident Management X X . .
Monitoring and Event Management X X . .
Service Validation and Testing X X . .

Methodology development

This practice deals with the development of concrete, objective frameworks, metrics, and processes for the measurement of data product value and risk. Because the driving factors behind risk and value are not necessarily the same between products, it may be necessary to develop several methodologies or variants thereof.

Key deliverables include the following:

  • Well-defined frameworks for measuring risk and value of data products, as well as for determining the optimal portfolio of data products
  • Operationally feasible, measurable metrics associated with value and risk

Key considerations

A key consideration for assessing data products is that of consumer value or risk vs. uniqueness. The following diagram illustrates how value and risk of a data product are driven by its consumers.

Data products don’t inherently present risk or add value, but rather indirectly pose—in an aggregated fashion—the risk and value created by their consumers.

In a consumer-centric value and risk model, governance of consumers ensures that all data use meets the following requirements:

  • Is associated with a business case justifying the use of data (for example, new business, cost reduction through business process automation, and so on)
  • Is regularly evaluated with reference to the risk associated with the use case (for example, regulatory reporting

The value and risk associated with the linked data products are then calculated as an aggregation. Where organizations already track use cases associated with data, either as part of data privacy governance or as a by-product of the access approval process, these existing systems and databases can be reused or extended.

Conversely, where data products overlap with each other, their value to the organization is reduced accordingly, because redundancies between data products represent an inefficient use of resources and increase organizational complexity associated with data quality management.

To ensure that the model is operationally feasible (see the key deliverables of methodology development), it may be sufficient to consider simple aggregations, rather than attempting to calculate value and risk attribution at a product or use case level.

Optimal portfolio selection

This practice deals with the determination of which combination of data products (existing, new, or potential) would best meet the organization’s current and known future needs. This practice takes input from data product analysis and data product proposals, as well as other enterprise architecture practices (for example, business architecture), and considers trade-offs between data-debt and time-to-value, as well as other considerations such as redundancy between data products to determine the optimal mix of permanent and temporary data products at any given point in time.

Because the number of data products in an organization may become significant over time, it may be useful to apply heuristics to the problem of optimal portfolio selection. For example, it may be sufficient to consider core and cross-domain data products (trunk and branches) during quarterly portfolio reviews, with other data products (leaves) audited on a yearly basis.

Key deliverables include the following:

  • A target state definition for the data mesh, including all relevant data products
  • An indication of organizational priorities for use by the portfolio adjustment practice

Key considerations

The following are key considerations regarding the data product half-life:

  • Long-term or strategic data products – These data products fill a long-term organizational need, are often associated with key source systems in various domains, and anchor the overall data strategy. Over time, as an organization’s data mesh matures, long-term data products should form the bulk of the mesh.
  • Time-bound data products – These data products fill a gap in data strategy and allow the organization to move on data opportunities until core data products can be updated. An example of this might be data products created and used in the context of mergers and acquisitions transactions and post-acquisition, to provide consistent data for reporting and business intelligence until mid-term and long-term application consolidation has taken place. Time-bound data products are considered as data-debt and should be managed accordingly.
  • Purpose-driven data products – These data products serve a narrow, finite purpose. Purpose-driven data products may or may not be time-bound, but are characterized primarily by a strict set of consumers known in advance. Examples of this might include:
    • Data products developed to support system-of-record harmonization between lines of business (for example, deduplication of customer records between insurance lines of business using separate CRM systems
    • Data products created explicitly for the monitoring of other data products (data quality, update frequency, and so on)

Portfolio adjustment

This practice implements the feasibility analysis, planning and project management, as well as communication and organizational change management activities associated with changes to the optimal portfolio. As part of this practice, a gap analysis is conducted between the current and target data product portfolio, and a set of required actions and estimated time and effort prepared for review by the organization. During such a period, data products may be marked for development (new data products to fill a need), changes, consolidation (merging two or more data products into a single data product), or deprecation. Several iterations of optimal portfolio selection and portfolio adjustment may be required to find an appropriate balance between optimality and feasibility of implementation.

Key deliverables include the following:

  • A gap analysis between the current and target data product portfolio, as well as proposed remediation activities
  • High-level project plans and effort or budget assessments associated with required changes, for approval by relevant stakeholders (such as the data council)

Data product proposals

This practice organizes the collection and prioritization of requests for new, or changes to existing, data products within the organization. Its implementation may be adapted from or managed by existing demand management processes within the organization.

Key deliverables include a registry of demand against new or existing data products, including metadata on source systems, attributes, known use cases, proposed data product owners, and suggested organizational priority.

Methodology selection

This practice is associated with the identification and application of the most appropriate methodologies (such as value and risk) during data product analysis, screening, and optimal portfolio selection. The selection of an appropriate methodology for the type, maturity, and scope of a data product (or an entire portfolio) is a key element in avoiding either a “Kudzu” mesh or a “data desert.”

Key deliverables include reusable selection criteria for mapping methodologies to data products during data product analysis, screening, and optimal portfolio selection.

Pre-screening

This optional practice is primarily a mechanism to avoid unnecessary time and effort in the comparatively expensive practice of data product analysis by offering simple self-service applications of guidelines to the evaluation of data products. An example might include the automated approval of data products that fall under the classification of personal data products, requiring only attestation on the part of the requester that they will uphold the relevant portions of the guideline that governs such data products.

Key deliverables include tools and checklists for the self-service evaluation of data products against guidelines and automated registration of approved data products.

Data product analysis

This practice incorporates guidelines, methodologies, as well as (where available) metadata relating to data products (performance against SLOs, service management metrics, degree of overlap with other data products) to establish an understanding of the value and risk associated with individual data products, as well as gaps between current and target capability maturities, and compliance with published product definitions and standards.

Key deliverables include a summary of findings for a particular data product, including scores for relevant value, risk, and maturity metrics, as well as operational gaps requiring remediation and recommendations on next steps (repair, enhance, decommission, and so on).

Screening

This optional practice is a mechanism to reduce complexity in optimal portfolio selection by ensuring the early removal of data products from consideration that fail to meet value or risk targets, or have been identified as redundant to other data products already available in the organization.

Key deliverables include a list of data products that should be slated for removal (direct-to-decommissioning).

Data product development

This practice is not performed directly under DPPM, but is managed in part by the portfolio adjustment practice, and may be governed by standards that are developed as part of DPPM. In the context of DPPM, this practice is primarily associated with ensuring that data products are developed according to the specifications agreed as part of portfolio adjustment.

Key deliverables include project management and software or service development deliverables and artefacts.

Data product decommissioning

This practice manages the decommissioning of data products and the migration of affected consumers to new or other data products where relevant. Unmanaged decommissioning of data products, especially those with many downstream consumers, can threaten the stability of the entire data mesh, as well as have significant consequences to business functions.

Key deliverables include a decommissioning plan, including stakeholder assessment and sign-off, timelines, migration plans for affected consumers, and back-out strategies.

Periodic evaluation

This practice manages the calendar and implementation of periodic reviews of the data mesh, both in its entirety as well as at the data product level, and is primarily an exercise in project management.

Key deliverables include the following:

  • yearly review calendar, published and made available to all data product owners and affected stakeholders
  • Project management deliverables and artefacts, including evidence of evaluations having been performed against each data product

Technology

Although most practices within DPPM don’t rely heavily on technology and automation, some key supporting applications and services are required to implement DPPM effectively:

  • Data catalog – Core to the delivery of DPPM is the organizational data catalog. Beyond providing transparency into what data products exist within an organization, a data catalog can provide key insights into data lineage between data products (key to the implementation of portfolio adjustment) and adoption of data products by the organization. The data catalog can also be used to capture and make available both the documented as well as the realized SLO for any given data product, and—through the use of a business glossary—assist in the identification of redundancy between data products.
  • Service management – Service management solutions (such as ServiceNOW) used in the context of data product management offer important insights into the fitness of data products by capturing and tracking incidents, problems, requests, and other metrics against data products.
  • Demand management – A demand management solution supports self-service implementation and automation of data product proposal and pre-screening activities, as well as prioritization activities associated with selection and development of data products.

Conclusion

Although this post focused on implementing DPPM in the context of a data mesh, this capability—like data product thinking—is not exclusive to data mesh architectures. The practices outlined here can be practiced at any scale to ensure that the production and use of data within the organization is always in line with its current and future needs, that governance is implemented in a consistent way, and that the organization can have Bonsai, not Kudzu.

For more information about data mesh and data management, refer to the following:

In upcoming posts, we will cover other aspects of data mesh operating models, including data mesh supervision and service management models for data product owners.


About the Authors


Maximilian Mayrhofer
is a Principal Solutions Architect working in the AWS Financial Services EMEA Go-to-Market team. He has over 12 years experience in digital transformation within private banking and asset management. In his free time, he is an avid reader of science fiction and enjoys bouldering.


Faris Haddad
is the Data & Insights Lead in the AABG Strategic Pursuits team. He helps enterprises successfully become data-driven.

AWS Security Profile: Get to know the AWS Identity Solutions team

Post Syndicated from Maddie Bacon original https://aws.amazon.com/blogs/security/aws-security-profile-get-to-know-the-aws-identity-solutions-team/

Remek Hetman, Principal Solutions Architect on the Identity Solutions team

Remek Hetman, Principal Solutions Architect on the Identity Solutions team

In this profile, I met with Ilya Epshteyn, Senior Manager of the AWS Identity Solutions team, to chat about his team and what they’re working on.


Let’s start with the basics. What does the Identity Solutions team do?
We are a team of specialist solutions architects (SAs) who are in the AWS Identity organization. At AWS, we have SAs who directly support customers, and we have SAs who are embedded in internal engineering teams—we are the latter. As SAs, we work on complex customer scenarios, build solutions, and create deep technical content on identity topics, including identity and access management—like blog posts, workshops, and sessions at our global events. A significant portion of our time is spent working internally with AWS product and engineering teams to help bring the customer experience perspective. Identity touches everything — it’s the fabric of every AWS service — and we want to help achieve a consistent identity experience for customers. To help do this, we use different tooling to proactively identify challenges in customers’ experience with identity across AWS.

What is the mission of the Identity Solutions team?
Our mission is to make it easier for customers to implement access controls that protect their data in a straightforward and consistent manner across AWS services. A consistent experience simplifies the implementation and validation of security controls. We help identify customer’s pain points and work with our service teams to improve their experiences. We also provide highly prescriptive guidance to customers around identity. We don’t want to just say, “here’s an option.” Our guidance comes from a place of knowing how it will be operationalized and implemented. We won’t recommend something to customers unless we’ve tried it ourselves.

In order to literally “try it ourselves,” we built and operate a large-scale AWS environment called Mirror World, in which we use AWS services from the perspective of an AWS customer. The environment allows us to create different controls and use them in conjunction with other tools and services, truly putting ourselves in the shoes of the customer. This is in line with our mission of “active empathy,” our #1 team tenet.

Interesting! Tell us more about Mirror World.
There are three main use cases for Mirror World:

  • We use it to understand and proactively identify challenges with the customer experience for existing and new AWS services and features. As new features are launched, we get early access and test them out so that we can improve the documentation and prescriptive guidance that we provide to customers.
  • We vend accounts in it. Internal field teams can request accounts and get their hands on a large-scale AWS environment with real customer setups, including organization-wide security controls and networking.
  • AWS service teams use this environment to see how customers experience their AWS service.

What are your other major focus areas right now?
Data perimeters — set of preventive guardrails in your AWS environment to help ensure that only your trusted identities are accessing trusted resources from expected networks — are a big focus for us. Because data perimeters touch so many different aspects of identity and access management, our team is helping to organize what the user experience will look like, and helping to define the future state of data perimeters. Team members Tatyana Yatskevich and Matt Luttrell went into more detail about this in their profiles.

What are some of the common questions you hear from customers?
Customers who have already been operating in the cloud for several years often tell us that they’re looking for opportunities to optimize their environment at scale. They’re maturing and managing hundreds or even thousands of accounts, so they commonly ask us for ways to simplify and scale their environment. For customers earlier in their journey, a common question is what lessons we have learned while working with more experienced customers so that they can benefit from their journey. Like Andy Jassy says, “There is no compression algorithm for experience.”

What do you wish customers would ask about more?
How to get rid of their long-term credentials to significantly reduce the chances of credentials becoming compromised. We realize that for some customers it’s an effort to move away from IAM users and long-term credentials. We’d love to hear from more customers how they’re moving away from them or what’s stopping them from doing so. We’ve done a better job setting newer customers on the right path with short-term credentials and IAM roles instead of users, but for more tenured customers, there’s still an opportunity to improve in this area.

Looking ahead, what are your goals for the team?
We’re lucky that our team has individuals with diverse backgrounds and skillsets that have enabled us to deliver on our mission. But if we want to make a bigger impact, we need to scale. We will continue to utilize Mirror World, do more with automation, and expand our team collaboration to further the consistent identity experience for our customers. We also recently launched a repo containing recommended service control policies, which we plan to continue expanding. And we’re going to continue to build end-to-end solutions for identity use cases, such as IAM Policy Validator for AWS CloudFormation. We will also continue identity enablement on complex topics, such as the data perimeter blog series and workshop, so that we can reach even more customers with prescriptive guidance. Stay tuned for more blog posts from our team coming soon here! If you’re interested in any of the topics mentioned in this post and would like to start a conversation, please reach out to your account team.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Author

Maddie Bacon

Maddie (she/her) is a technical writer for Amazon Security with a passion for creating meaningful content that focuses on the human side of security and encourages a security-first mindset. She previously worked as a reporter and editor, and has a BA in Mathematics. In her spare time, she enjoys reading, traveling, and staunchly defending the Oxford comma.

Author

Ilya Epshteyn

Ilya is a Senior Manager of Identity Solutions in AWS Identity. He helps customers to innovate on AWS by building highly secure, available, and scalable architectures. He enjoys spending time outdoors and building Lego creations with his kids.

Let’s Architect! Resiliency in architectures

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-resiliency-in-architectures/

What is “resiliency”, and why does it matter? When we discussed this topic in an early 2022 edition of Let’s Architect!, we referenced the AWS Well-Architected Framework, which defines resilience as having “the capability to recover when stressed by load, accidental or intentional attacks, and failure of any part in the workload’s components.” Businesses rely heavily on the availability and performance of their digital services. Resilience has emerged as critical for any efficiently architected system, which is why it is a fundamental role in ensuring the reliability and availability of workloads hosted on the AWS Cloud platform.

In this newer edition of Let’s Architect!, we share some best practices for putting together resilient architectures, focusing on providing continuous service and avoiding disruptions. Ensuring uninterrupted operations is likely a primary objective when it comes to building a resilient architecture.

Understand resiliency patterns and trade-offs to architect efficiently in the cloud

In this AWS Architecture Blog post, the authors introduce five resilience patterns. Each of these patterns comes with specific strengths and trade-offs, allowing architects to personalize their resilience strategies according to the unique requirements of their applications and business needs. By understanding these patterns and their implications, organizations can design resilient cloud architectures that deliver high availability and efficient recovery from potential disruptions.

Take me to this Architecture Blog post!

Resilience patterns and tradeoffs

Resilience patterns and tradeoffs

Timeouts, retries, and backoff with jitter

Marc Broker discusses the inevitability of failures and the importance of designing systems to withstand them. He highlights three essential tools for building resilience: timeouts, retries, and backoff. By embracing these three techniques, we can create robust systems that maintain high availability in the face of failures. Timeouts, backoff, and jitter are fundamental to spread the traffic coming from clients and avoid overloading your systems. Building resilience is a fundamental aspect of ensuring the reliability and performance of AWS services in the ever-changing and dynamic technological landscape.

Take me to the Amazon Builders’ Library!

The Amazon Builder’s Library is a collection of technical resources produced by engineers at Amazon

The Amazon Builder’s Library is a collection of technical resources produced by engineers at Amazon

Prepare & Protect Your Applications From Disruption With AWS Resilience Hub

The AWS Resilience Hub not only protects businesses from potential downtime risks but also helps them build a robust foundation for their applications, ensuring uninterrupted service delivery to customers and users.

In this AWS Online Tech Talk, led by the Principal Product Manager of AWS Resilience Hub, the importance of a resilience hub to protect mission-critical applications from downtime risks is emphasized. The AWS Resilience Hub is showcased as a centralized platform to define, validate, and track application resilience. The talk includes strategies to avoid disruptions caused by software, infrastructure, or operational issues, plus there’s also a demo demonstrating how to apply these techniques effectively.

If you are interested in delving deeper into the services discussed in the session, AWS Resilience Hub is a valuable resource for monitoring and implementing resilient architectures.

Take me to this AWS Online Tech Talk!

AWS Resilience Hub recommendations

AWS Resilience Hub recommendations

Data resiliency design patterns with AWS

In this re:Invent 2022 session, data resiliency, why it matters to customers, and how you can incorporate it into your application architecture is discussed in depth. This session kicks off with the comprehensive overview of data resiliency, breaking down its core components and illustrating its critical role in modern application development. It, then, covers application data resiliency and protection designs, plus extending from the native data resiliency capabilities of AWS storage through DR solutions using AWS Elastic Disaster Recovery.

Take me to this re:Invent 2022 video!

Asynchronous cross-region replication

Asynchronous cross-region replication

See you next time!

Thanks for joining our discussion on architecture resiliency! See you in two weeks when we’ll talk about security on AWS.

To find all the blogs from this series, visit the Let’s Architect! list of content on the AWS Architecture Blog.

How Amazon CodeGuru Security helps you effectively balance security and velocity

Post Syndicated from Leo da Silva original https://aws.amazon.com/blogs/security/how_amazon_codeguru_security_helps_effectively_balance_security_and_velocity/

Software development is a well-established process—developers write code, review it, build artifacts, and deploy the application. They then monitor the application using data to improve the code. This process is often repeated many times over. As Amazon Web Services (AWS) customers embrace modern software development practices, they sometimes face challenges with the use of third-party code security tools, such as an overwhelming number of findings, high rates of false positives among those findings, and the logistics of tracking open issues across code versions.

Customers tell us they need help to identify the top risks in their application code as it is being built and to receive actionable recommendations to mitigate these risks. In this blog post, we demonstrate how the new Amazon CodeGuru Security service and its fully managed, machine learning (ML)-powered code security analysis capabilities provide intelligent recommendations to improve code security and quality. Amazon CodeGuru Security enhances the overall security posture of applications that are deployed in your environment while reducing the time to deploy in production.

Amazon CodeGuru Security is a managed static application security tool (SAST) service that is also available through Amazon CodeGuru Reviewer, Amazon CodeWhisperer security scanning, the Amazon SageMaker Studio CodeGuru extension, and Amazon Inspector code scanning.

Solution overview

In this blog post, we introduce you to the features and capabilities of Amazon CodeGuru Security. Amazon CodeGuru Security helps you focus on security risks that are relevant to your environment, along with contextually relevant remediation suggestions (provided as code diffs). Integration, centralization, and scalability of the service are facilitated by using an API-based design, plus bug tracking to automatically detect code fixes and close findings without user intervention. Amazon CodeGuru Security currently supports applications that are written in Python, Java, and JavaScript, along with associated artifacts like scripts, configuration, and documentation files.

Created to improve the security posture of applications that were built for the cloud, Amazon CodeGuru Security rules are developed in partnership with Amazon application security teams, applying learnings and adhering to best practices that govern the development of Amazon internal systems and services.

Amazon CodeGuru Security offers multiple integration points:

In Figure 1, you can see one of the proposed architecture patterns that supports the integration of Amazon CodeGuru Security into your existing application deployment pipeline. In this scenario, developers write application code and get it committed into Amazon CodeCommit. This event causes AWS CodeBuild to start building the application and the static security code analysis of the application code, using a pre-build hook. The code and build artifacts are copied to a local Amazon S3 bucket within your account, and Amazon CodeGuru Security scans the application assets.

Figure 1: Example of CodeGuru Security integration with deployment pipeline

Figure 1: Example of CodeGuru Security integration with deployment pipeline

Amazon CodeGuru detection engine

At the core of the CodeGuru detector design is the idea of user action in response to findings. Detectors flag security risks or quality issues with a high degree of precision, such that action can be taken directly to remediate the finding. With this goal in mind, we have designed the Guru Query Language (GQL) toolkit. GQL enables precise expression of scenario-centric micro-analyzers that check specific properties (for example, misuse of a particular Java cryptography library or API) through a wide range of analysis constructs (more than 200 at the time of publication).

Among these constructs are capabilities such as type inference (determining the precise types of variables and fields), inter-procedural analysis (analyzing across function boundaries), and advanced taint tracking capabilities, where untrusted data (from taint sources) is tracked through the application to determine whether it reaches security-sensitive operations (known as taint sinks) without being sanitized.

By using GQL, the rule author can combine constructs as building blocks to precisely match the vulnerable patterns that are being targeted. As an example, you can specify taint sources and sinks in a contextual way so that only data read from remote (as opposed to local) files is considered untrusted.

We benchmark detectors against ever-growing datasets, and improve them based on feedback from our partner security teams and customers, as well as metrics that we collect. Detectors are subjected to a rigorous quality control process. Starting from the detector specification, we work closely with subject matter experts (SMEs) to make sure that the suggestions cover the most important application surfaces and are not overly defensive in the warnings they raise. Moving from specification to implementation, detections are reviewed and sampled from shadow runs on live codebases with the same SMEs as well as internal CodeGuru users. If detectors meet an internal performance bar, they are launched internally at AWS. After they are launched, the detectors are monitored by using weekly metrics. A detector graduates into the commercial CodeGuru service only if it meets a high quality bar for several weeks.

Amazon CodeGuru Security uses a detection engine to find security issues in the application code that is scanned. The engine uses a Detector Library, which is a resource that contains detailed information about the CodeGuru security and code quality detectors, to help you build secure and efficient applications. Each detection page within the Detector Library contains descriptions, compliant and non-compliant example code snippets, severities, and additional information that helps you mitigate risks (such as Common Weakness Enumeration (CWE) numbers). The materials presented in the Amazon CodeGuru Detector Library are intended to be a high-level summary of the service’s capabilities, but might not be inclusive of all detectors or their functionality.

Bug Fix Tracking and code fixes

With user action as the ultimate goal, an important metric to us is whether code fixes are made in response to our recommendations. As such, AWS has designed a novel Bug Fix Tracking (BFT) algorithm, whose key functionality is to relate CodeGuru findings across revisions of a given codebase or application. If, for example, CodeGuru reports misuse of a cryptographic API on version V1 of codebase C, then BFT detects whether that misuse issue is still present when version V2 of C is scanned.

Tracking bugs and bug fixes are nontrivial. Code can be refactored into different locations within a file, and sometimes also into different files. In addition, syntax may be adjusted in ways that are orthogonal to fixing an issue (for example, if variables are renamed). The CodeGuru BFT algorithm constructs a bi-partite graph to relate a pair of findings across revisions, or otherwise declare a finding as either closed (no match in V2) or new (no match in V1).

Figure 2 shows the process that is used by BFT in tracking application bugs. After the application version being scanned is identified and the bug detection verification starts, BFT updates the database with its findings, validating the existing issues with findings uncovered in version N-1.

Figure 2: Overview of the Bug Fix Tracking algorithm

Figure 2: Overview of the Bug Fix Tracking algorithm

The algorithm is staged, starting from the simple case of 1:1 correspondence between findings, through cases where findings might have drifted to a new location but are otherwise the same. For the final, most complex scenario of fuzzy matching, we use advanced hashing techniques to establish the mapping.

BFT provides a metric that guides our own rule development and tuning process on an ongoing basis. Data about BFT findings is available to our customers through the CodeGuru Security API. With gathered data about fixes, security engineers and leaders can measure exposure to security risks, quantify the lifetime of high and critical security issues, monitor burn rate for security issues, and form other insights from the raw data.

Actionable recommendations and concrete remediation

To align with our goal of encouraging user action in response to our recommendations, we’ve added a feature powered by automated reasoning for including concrete remediation advice as part of CodeGuru recommendations. This comes in the form of a code diff, which you can apply mechanically by using standard utilities like patch.

The screenshot in Figure 3 shows how this functionality creates an important bridge between security engineers and software engineers—the former have the necessary security expertise, while the latter are often responsible for carrying out the code fix. Recommendations that are accompanied by concrete fix suggestions can cut through multiple correspondences, alignment issues, and validation cycles, which can help accelerate remediation.

Figure 3: Example of recommendation showing difference between compliant and non-compliant code

Figure 3: Example of recommendation showing difference between compliant and non-compliant code

To enable the reasoning illustrated in Figure 3, where the data reaching the addObject call goes through sanitization in the form of an HtmlUtils::htmlEscape call, the underlying algorithm performs several steps. First, a formal representation of the code, known as its Abstract Syntax Tree (AST), is constructed. The AST is then visited by one or more transformation “recipes,” whose goal is to manipulate the program such that the vulnerability is mitigated.

Code transformation is done in a contextual manner, so that syntax (for example, variable names) and formatting (for example, indentation levels) are preserved. To verify that the transformation is valid, the algorithm further runs post-processing checks on the resulting code structure and syntax.

An important refinement of the remediation capability is that Amazon CodeGuru Security performs pre-analysis ahead of running the security scan to classify code artifacts into application- versus library-dependencies. It’s more feasible to take action on a recommendation for code owned by you, compared to code in a third-party library. The classification algorithm has been trained on hundreds of thousands of open-source libraries to disassemble code artifacts, including bundling application and library content in the same file, and focus downstream analysis on the most pertinent scanning surfaces.

Critical security issues have been shown to sometimes take hundreds of days to address (as discussed in this study). Internal studies that look at use of CodeGuru have seen a steep drop in time to fix issues thanks to concrete fix suggestions, which is value that the service excited to share with you.

Conclusion

Amazon CodeGuru Security is a static application security testing (SAST) tool that combines ML and automated reasoning to identify security issues in your code. Amazon CodeGuru detection capabilities that use GQL (Guru Query Language), Bug Fix Tracking (BFT), and efficacy mechanisms and AppSec expertise can help you precisely identify code security issues with a low rate of false positives. High signal-to-noise ratio is a key enabler in integrating SAST into the daily work of security engineers and software developers.

In addition, Amazon CodeGuru Security provides thorough fix recommendations, which your development teams can use to improve the overall time to remediate application security issues. At the same time, the recommendations can help you to implement security best practices based on an ML model that was trained on millions of lines of code and vulnerability assessments performed within Amazon. Get started with Amazon CodeGuru Security.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Leo da Silva

Leo da Silva

Leo is a Security Specialist Solutions Architect at AWS who uses his knowledge to help customers better utilize cloud services and technologies securely. Over the years, Leo has had the opportunity to work in large, complex environments, designing, architecting, and implementing highly scalable and secure solutions for global companies. He is passionate about football, BBQ, and Jiu Jitsu—the Brazilian version of them all.

Omer Tripp

Omer Tripp

Omer is a Principal Applied Scientist on the Amazon CodeGuru team. His research work is at the intersection of programming languages, machine learning, and security. Outside of work, Omer likes to stay physically active (through tennis, basketball, skiing, and various other activities), as well as tour the US and the world with his family.

Let’s Architect! DevOps Best Practices on AWS

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-devops-best-practices-on-aws/

DevOps has revolutionized software development and operations by fostering collaboration, automation, and continuous improvement. By bringing together development and operations teams, organizations can accelerate software delivery, enhance reliability, and achieve faster time-to-market.

In this blog post, we will explore the best practices and architectural considerations for implementing DevOps with Amazon Web Services (AWS), enabling you to build efficient and scalable systems that align with DevOps principles. The Let’s Architect! team wants to share useful resources that help you to optimize your software development and operations.

DevOps revolution

Distributed systems are adopted from enterprises more frequently now. When an organization wants to leverage distributed systems’ characteristics, it requires a mindset and approach shift, akin to a new model for software development lifecycle.

In this re:Invent 2021 video, Emily Freeman, now Head of Community Engagement at AWS, shares with us the insights gained in the trenches when adapting a new software development lifecycle that will help your organization thrive using distributed systems.

Take me to this re:Invent 2021 video!

Operationalizing the DevOps revolution

Operationalizing the DevOps revolution

My CI/CD pipeline is my release captain

Designing effective DevOps workflows is necessary for achieving seamless collaboration between development and operations teams. The Amazon Builders’ Library offers a wealth of guidance on designing DevOps workflows that promote efficiency, scalability, and reliability. From continuous integration and deployment strategies to configuration management and observability, this resource covers various aspects of DevOps workflow design. By following the best practices outlined in the Builders’ Library, you can create robust and scalable DevOps workflows that facilitate rapid software delivery and smooth operations.

Take me to this resource!

A pipeline coordinates multiple inflight releases and promotes them through three stages

A pipeline coordinates multiple inflight releases and promotes them through three stages

Using Cloud Fitness Functions to Drive Evolutionary Architecture

Cloud fitness functions provide a powerful mechanism for driving evolutionary architecture within your DevOps practices. By defining and measuring architectural fitness goals, you can continuously improve and evolve your systems over time.

This AWS Architecture Blog post delves into how AWS services, like AWS Lambda, AWS Step Functions, and Amazon CloudWatch can be leveraged to implement cloud fitness functions effectively. By integrating these services into your DevOps workflows, you can establish an architecture that evolves in alignment with changing business needs: improving system resilience, scalability, and maintainability.

Take me to this AWS Architecture Blog post!

Fitness functions provide feedback to engineers via metrics

Fitness functions provide feedback to engineers via metrics

Multi-Region Terraform Deployments with AWS CodePipeline using Terraform Built CI/CD

Achieving consistent deployments across multiple regions is a common challenge. This AWS DevOps Blog post demonstrates how to use Terraform, AWS CodePipeline, and infrastructure-as-code principles to automate Multi-Region deployments effectively. By adopting this approach, you can demonstrate the consistent infrastructure and application deployments, improving the scalability, reliability, and availability of your DevOps practices.

The post also provides practical examples and step-by-step instructions for implementing Multi-Region deployments with Terraform and AWS services, enabling you to leverage the power of infrastructure-as-code to streamline DevOps workflows.

Take me to this AWS DevOps Blog post!

Multi-Region AWS deployment with IaC and CI/CD pipelines

Multi-Region AWS deployment with IaC and CI/CD pipelines

See you next time!

Thanks for joining our discussion on DevOps best practices! Next time we’ll talk about how to create resilient workloads on AWS.

To find all the blogs from this series, check out the Let’s Architect! list of content on the AWS Architecture Blog. See you soon!

Build a serverless log analytics pipeline using Amazon OpenSearch Ingestion with managed Amazon OpenSearch Service

Post Syndicated from Hajer Bouafif original https://aws.amazon.com/blogs/big-data/build-a-serverless-log-analytics-pipeline-using-amazon-opensearch-ingestion-with-managed-amazon-opensearch-service/

In this post, we show how to build a log ingestion pipeline using the new Amazon OpenSearch Ingestion, a fully managed data collector that delivers real-time log and trace data to Amazon OpenSearch Service domains. OpenSearch Ingestion is powered by the open-source data collector Data Prepper. Data Prepper is part of the open-source OpenSearch project. With OpenSearch Ingestion, you can filter, enrich, transform, and deliver your data for downstream analysis and visualization. OpenSearch Ingestion is serverless, so you don’t need to worry about scaling your infrastructure, operating your ingestion fleet, and patching or updating the software.

For a comprehensive overview of OpenSearch Ingestion, visit Amazon OpenSearch Ingestion, and for more information about the Data Prepper open-source project, visit Data Prepper.

In this post, we explore the logging infrastructure for a fictitious company, AnyCompany. We explore the components of the end-to-end solution and then show how to configure OpenSearch Ingestion’s main parameters and how the logs come in and out of OpenSearch Ingestion.

Solution overview

Consider a scenario in which AnyCompany collects Apache web logs. They use OpenSearch Service to monitor web access and identify possible root causes to error logs of type 4xx and 5xx. The following architecture diagram outlines the use of every component used in the log analytics pipeline: Fluent Bit collects and forwards logs; OpenSearch Ingestion processes, routes, and ingests logs; and OpenSearch Service analyzes the logs.

The workflow contains the following stages:

  1. Generate and collectFluent Bit collects the generated logs and forwards them to OpenSearch Ingestion. In this post, you create fake logs that Fluent Bit forwards to OpenSearch Ingestion. Check the list of supported clients to review the required configuration for each client supported by OpenSearch Ingestion.
  2. Process and ingest – OpenSearch Ingestion filters the logs based on response value, processes the logs using a grok processor, and applies conditional routing to ingest the error logs to an OpenSearch Service index.
  3. Store and analyze – We can analyze the Apache httpd error logs using OpenSearch Dashboards.

Prerequisites

To implement this solution, make sure you have the following prerequisites:

Configure OpenSearch Ingestion

First, you define the appropriate AWS Identity and Access Management (IAM) permissions to write to and from OpenSearch Ingestion. Then you set up the pipeline configuration in the OpenSearch Ingestion. Let’s explore each step in more detail.

Configure IAM permissions

OpenSearch Ingestion works with IAM to secure communications into and out of OpenSearch Ingestion. You need two roles, authenticated using AWS Signature V4 (SigV4) signed requests. The originating entity requires permissions to write to OpenSearch Ingestion. OpenSearch Ingestion requires permissions to write to your OpenSearch Service domain. Finally, you must create an access policy using OpenSearch Service’s fine-grained access control, which allows OpenSearch Ingestion to create indexes and write to them in your domain.

The following diagram illustrates the IAM permissions to allow OpenSearch Ingestion to write to an OpenSearch Service domain. Refer to Setting up roles and users in Amazon OpenSearch Ingestion to get more details on roles and permissions required to use OpenSearch Ingestion.

In the demo, you use the AWS Cloud9 EC2 instance profile’s credentials to sign requests sent to OpenSearch Ingestion. You use Fluent Bit to fetch the credentials and assume the role you pass in the aws_role_arn you configure later.

  1. Create an ingestion role (called IngestionRole) to allow Fluent Bit to ingest the logs into your pipeline.

Create a trust relationship to allow Fluent Bit to assume the ingestion role, as shown in the following code. Fluent Bit attempts to fetch the credentials in the following order. In configuring the access policy for this role, you grant permission for the osis:Ingest.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "{your-account-id}"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
  1. Create a pipeline role (called PipelineRole) with a trust relationship for OpenSearch Ingestion to assume that role. The domain-level access policy of the OpenSearch domain grants the pipeline role access to the domain.
  1. Finally, configure your domain’s security plugin to enable OpenSearch Ingestion’s assumed role to create indexes and write data to the domain.

In this demo, the OpenSearch Service domain uses fine-grained access control for authentication, so you need to map the OpenSearch Ingestion pipeline role to the OpenSearch backend role all_access. For instructions, refer to Step 2: Include the pipeline role in the domain access policy page.

Create the pipeline in OpenSearch Ingestion

To create an OpenSearch Ingestion pipeline, complete the following steps:

  1. On the OpenSearch Service console, choose Pipelines in the navigation pane.
  2. Choose Create pipeline.
  3. For Pipeline name, enter a name.

  1. Input the minimum and maximum Ingestion OpenSearch Compute Units (Ingestion OCUs). In this example, we use the default pipeline capacity settings of minimum 1 Ingestion OCU and maximum 4 Ingestion OCUs.

Each OCU is a combination of approximately 8 GB of memory and 2 vCPUs that can handle an estimated 8 GiB per hour. OpenSearch Ingestion supports up to 96 OCUs, and it automatically scales up and down based on your ingest workload demand.

  1. In the Pipeline configuration section, configure Data Prepper to process your data by choosing the appropriate blueprint configuration template on the Configuration blueprints menu. For this post, we choose AWS-LogAggregationWithConditionalRouting.

The OpenSearch Ingestion pipeline configuration consists of four sections:

  • Source – This is the input component of a pipeline. It defines the mechanism through which a pipeline consumes records. In this post, you use the http_source plugin and provide the Fluent Bit output URI value within the path attribute.
  • Processors – This represents an intermediate processing to filter, transform, and enrich your input data. Refer to Supported plugins for more details on the list of operations supported in OpenSearch Ingestion. In this post, we use the grok processor COMMONAPACHELOG, which matches input logs against the common Apache log pattern and makes it easy to query in OpenSearch Service.
  • Sink – This is the output component of a pipeline. It defines one or more destinations to which a pipeline publishes records. In this post, you define an OpenSearch Service domain and index as sink.
  • Route – This is the part of a processor that allows the pipeline to route the data into different sinks based on specific conditions. In this example, you create four routes based in the response field value of the log. If the response field value of the log line matches 2xx or 3xx, the log is sent to the OpenSearch Service index aggregated_2xx_3xx. If the response field value matches 4xx, the log is sent to the index aggregated_4xx. If the response field value matches 5xx, the log is sent to the index aggregated_5xx.
  1. Update the blueprint based on your use case. The following code shows an example of the pipeline configuration YAML file:
version: "2"
log-aggregate-pipeline:
  source:
    http:
      # Provide the FluentBit output URI value.
      path: "/log/ingest"
  processor:
    - date:
        from_time_received: true
        destination: "@timestamp"
    - grok:
        match:
          log: [ "%{COMMONAPACHELOG_DATATYPED}" ]
  route:
    - 2xx_status: "/response >= 200 and /response < 300"
    - 3xx_status: "/response >= 300 and /response < 400"
    - 4xx_status: "/response >= 400 and /response < 500"
    - 5xx_status: "/response >= 500 and /response < 600"
  sink:
    - opensearch:
        # Provide an AWS OpenSearch Service domain endpoint
        hosts: [ "{your-domain-endpoint}" ]
        aws:
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "arn:aws:iam::{your-account-id}:role/PipelineRole"
          # Provide the region of the domain.
          region: "{AWS_Region}"
        index: "aggregated_2xx_3xx"
        routes:
          - 2xx_status
          - 3xx_status
    - opensearch:
        # Provide an AWS OpenSearch Service domain endpoint
        hosts: [ "{your-domain-endpoint}"  ]
        aws:
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "arn:aws:iam::{your-account-id}:role/PipelineRole"
          # Provide the region of the domain.
          region: "{AWS_Region}"
        index: "aggregated_4xx"
        routes:
          - 4xx_status
    - opensearch:
        # Provide an AWS OpenSearch Service domain endpoint
        hosts: [ "{your-domain-endpoint}"  ]
        aws:
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "arn:aws:iam::{your-account-id}:role/PipelineRole"
          # Provide the region of the domain.
          region: "{AWS_Region}"
        index: "aggregated_5xx"
        routes:
          - 5xx_status

Provide the relevant values for your domain endpoint, account ID, and Region related to your configuration.

  1. Check the health of your configuration setup by choosing Validate pipeline when you finish the update.

When designing a production workload, deploy your pipeline within a VPC. For instructions, refer to Securing Amazon OpenSearch Ingestion pipelines within a VPC.

  1. For this post, select Public access under Network.

  1. In the Log publishing options section, select Publish to CloudWatch logs and Create new group.

OpenSearch Ingestion uses the log levels of INFO, WARN, ERROR, and FATAL. Enabling log publishing helps you monitor your pipelines in production.

  1. Choose Next and Create pipeline.
  2. Select the pipeline and choose View details to see the progress of the pipeline creation.

Wait until the status changes to Active to start using the pipeline.

Send logs to the OpenSearch Ingestion pipeline

To start sending logs to the OpenSearch Ingestion pipeline, complete the following steps:

  1. On the AWS Cloud9 console, create a Fluent Bit configuration file and update the following attributes:
    • Host – Enter the ingestion URL of your OpenSearch Ingestion pipeline.
    • aws_service – Enter osis.
    • aws_role_arn – Enter the ARN of the IAM role IngestionRole.

The following code shows an example of the Fluent-bit.conf file:

[SERVICE]
    parsers_file          ./parsers.conf
    
[INPUT]
    name                  tail
    refresh_interval      5
    path                  /var/log/*.log
    read_from_head        true
[FILTER]
    Name parser
    Key_Name log
    Parser apache
[OUTPUT]
    Name http
    Match *
    Host {Ingestion URL}
    Port 443
    URI /log/ingest
    format json
    aws_auth true
    aws_region {AWS_region}
    aws_role_arn arn:aws:iam::{your-account-id}:role/IngestionRole
    aws_service osis
    Log_Level trace
    tls On
  1. In the AWS Cloud9 environment, create a docker-compose YAML file to deploy Fluent Bit and Flog containers:
version: '3'
services:
  fluent-bit:
    container_name: fluent-bit
    image: docker.io/amazon/aws-for-fluent-bit
    volumes:
      - ./fluent-bit.conf:/fluent-bit/etc/fluent-bit.conf
      - ./apache-logs:/var/log
  flog:
    container_name: flog
    image: mingrammer/flog
    command: flog -t log -f apache_common -o web/log/test.log -w -n 100000 -d 1ms -p 1000
    volumes:
      - ./apache-logs:/web/log

Before you start the Docker containers, you need to update the IAM EC2 instance role in AWS Cloud9 so it can sign the requests sent to OpenSearch Ingestion.

  1. For demo purposes, create an IAM service-linked role and choose EC2 under Use case to allow the AWS Cloud9 EC2 instance to call OpenSearch Ingestion on your behalf.
  2. Add the OpenSearch Ingestion policy, which is the same policy you used with IngestionRole.
  3. Add the AdministratorAccess permission policy to the role as well.

Your role definition should look like the following screenshot.

  1. After you create the role, go back to AWS Cloud9, select your demo environment, and choose View details.
  2. On the EC2 instance tab, choose Manage EC2 instance to view the details of the EC2 instance attached to your AWS Cloud9 environment.

  1. On the Amazon EC2 console, replace the IAM role of your AWS Cloud9 EC2 instance with the new role.
  2. Open a terminal in AWS Cloud9 and run the command docker-compose up.

Check the output in the terminal—if everything is working correctly, you get status 200.

Fluent Bit collects logs from the /var/log repository in the container and pushes the data to the OpenSearch Ingestion pipeline.

  1. Open OpenSearch Dashboards, navigate to Dev Tools, and run the command GET _cat/indices to validate that the data has been delivered by OpenSearch Ingestion to your OpenSearch Service domain.

You should see the three indexes created: aggregated_2xx_3xx, aggregated_4xx, and aggregated_5xx.

Now you can focus on analyzing your log data and reinvent your business without having to worry about any operational overhead regarding your ingestion pipeline.

Best practices for monitoring

You can monitor the Amazon CloudWatch metrics made available to you to maintain the right performance and availability of your pipeline. Check the list of available pipeline metrics related to the source, buffer, processor, and sink plugins.

Navigate to the Metrics tab for your specific OpenSearch Ingestion pipeline to explore the graphs available to each metric, as shown in the following screenshot.

In your production workloads, make sure to configure the following CloudWatch alarms to notify you when the pipeline metrics breach a specific threshold so you can promptly remediate each issue.

Managing cost

While OpenSearch Ingestion automatically provisions and scales the OCUs for your spiky workloads, you only pay for the compute resources actively used by your pipeline to ingest, process, and route data. Therefore, setting up a maximum capacity of Ingestion OCUs allows you to handle your workload peak demand while controlling cost.

For production workloads, make sure to configure a minimum of 2 Ingestion OCUs to ensure 99.9% availability for the ingestion pipeline. Check the sizing recommendations and learn how OpenSearch Ingestion responds to workload spikes.

Clean up

Make sure you clean up unwanted AWS resources created during this post in order to prevent additional billing for these resources. Follow these steps to clean up your AWS account:

  1. On the AWS Cloud9 console, choose Environments in the navigation pane.
  2. Select the environment you want to delete and choose Delete.
  3. On the OpenSearch Service console, choose Domains under Managed clusters in the navigation pane.
  4. Select the domain you want to delete and choose Delete.
  5. Select Pipelines under Ingestion in the navigation pane.
  6. Select the pipeline you want to delete and on the Actions menu, choose Delete.

Conclusion

In this post, you learned how to create a serverless ingestion pipeline to deliver Apache access logs to an OpenSearch Service domain using OpenSearch Ingestion. You learned the IAM permissions required to start using OpenSearch Ingestion and how to use a pipeline blueprint instead of creating a pipeline configuration from scratch.

You used Fluent Bit to collect and forward Apache logs, and used OpenSearch Ingestion to process and conditionally route the log data to different indexes in OpenSearch Service. For more examples about writing to OpenSearch Ingestion pipelines, refer to Sending data to Amazon OpenSearch Ingestion pipelines.

Finally, the post provided you with recommendations and best practices to deploy OpenSearch Ingestion pipelines in a production environment while controlling cost.

Follow this post to build your serverless log analytics pipeline, and refer to Top strategies for high volume tracing with Amazon OpenSearch Ingestion to learn more about high volume tracing with OpenSearch Ingestion.


About the authors

Hajer Bouafif is an Analytics Specialist Solutions Architect at Amazon Web Services. She focuses on OpenSearch Service and helps customers design and build well-architected analytics workloads in diverse industries. Hajer enjoys spending time outdoors and discovering new cultures.

Francisco Losada is an Analytics Specialist Solutions Architect based out of Madrid, Spain. He works with customers across EMEA to architect, implement, and evolve analytics solutions at AWS. He advocates for OpenSearch, the open-source search and analytics suite, and supports the community by sharing code samples, writing content, and speaking at conferences. In his spare time, Francisco enjoys playing tennis and running.

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Automated Code Review on Pull Requests using AWS CodeCommit and AWS CodeBuild

Post Syndicated from Verinder Singh original https://aws.amazon.com/blogs/devops/automated-code-review-on-pull-requests-using-aws-codecommit-and-aws-codebuild/

Pull Requests play a critical part in the software development process. They ensure that a developer’s proposed code changes are reviewed by relevant parties before code is merged into the main codebase. This is a standard procedure that is followed across the globe in different organisations today. However, pull requests often require code reviewers to read through a great deal of code and manually check it against quality and security standards. These manual reviews can lead to problematic code being merged into the main codebase if the reviewer overlooks any problems.

To help solve this problem, we recommend using Amazon CodeGuru Reviewer to assist in the review process. CodeGuru Reviewer identifies critical defects and deviation from best practices in your code. It provides recommendations to remediate its findings as comments in your pull requests, helping reviewers miss fewer problems that may have otherwise made into production. You can easily integrate your repositories in AWS CodeCommit with Amazon CodeGuru Reviewer following these steps.

The purpose of this post isn’t, however, to show you CodeGuru Reviewer. Instead, our aim is to help you achieve automated code reviews with your pull requests if you already have a code scanning tool and need to continue using it. In this post, we will show you step-by-step how to add automation to the pull request review process using your code scanning tool with AWS CodeCommit (as source code repository) and AWS CodeBuild (to automatically review code using your code reviewer). After following this guide, you should be able to give developers automatic feedback on their code changes and augment manual code reviews so fewer problems make it into your main codebase.

Solution Overview

The solution comprises of the following components:

  1. AWS CodeCommit: AWS service to host private Git repositories.
  2. Amazon EventBridge: AWS service to receive pullRequestCreated and pullRequestSourceBranchUpdated events and trigger Amazon EventBridge rule.
  3. AWS CodeBuild: AWS service to perform code review and send the result to AWS CodeCommit repository as pull request comment.

The following diagram illustrates the architecture:

Figure 1: This architecture diagram illustrates the workflow where developer raises a Pull Request and receives automated feedback on the code changes using AWS CodeCommit, AWS CodeBuild and Amazon EventBridge rule

Figure 1. Architecture Diagram of the proposed solution in the blog

  1. Developer raises a pull request against the main branch of the source code repository in AWS CodeCommit.
  2. The pullRequestCreated event is received by the default event bus.
  3. The default event bus triggers the Amazon EventBridge rule which is configured to be triggered on pullRequestCreated and pullRequestSourceBranchUpdated events.
  4. The EventBridge rule triggers AWS CodeBuild project.
  5. The AWS CodeBuild project runs the code quality check using customer’s choice of tool and sends the results back to the pull request as comments. Based on the result, the AWS CodeBuild project approves or rejects the pull request automatically.

Walkthrough

The following steps provide a high-level overview of the walkthrough:

  1. Create a source code repository in AWS CodeCommit.
  2. Create and associate an approval rule template.
  3. Create AWS CodeBuild project to run the code quality check and post the result as pull request comment.
  4. Create an Amazon EventBridge rule that reacts to AWS CodeCommit pullRequestCreated and pullRequestSourceBranchUpdated events for the repository created in step 1 and set its target to AWS CodeBuild project created in step 3.
  5. Create a feature branch, add a new file and raise a pull request.
  6. Verify the pull request with the code review feedback in comment section.

1. Create a source code repository in AWS CodeCommit

Create an empty test repository in AWS CodeCommit by following these steps. Once the repository is created you can add files to your repository following these steps. If you create or upload the first file for your repository in the console, a branch is created for you named main. This branch is the default branch for your repository. If you are using a Git client instead, consider configuring your Git client to use main as the name for the initial branch. This blog post assumes the default branch is named as main.

2. Create and associate an approval rule template

Create an AWS CodeCommit approval rule template and associate it with the code repository created in step 1 following these steps.

3. Create AWS CodeBuild project to run the code quality check and post the result as pull request comment

This blog post is based on the assumption that the source code repository has JavaScript code in it, so it uses jshint as a code analysis tool to review the code quality of those files. However, users can choose a different tool as per their use case and choice of programming language.

Create an AWS CodeBuild project from AWS Management Console following these steps and using below configuration:

  • Source: Choose the AWS CodeCommit repository created in step 1 as the source provider.
  • Environment: Select the latest version of AWS managed image with operating system of your choice. Choose New service role option to create the service IAM role with default permissions.
  • Buildspec: Use below build specification. Replace <NODEJS_VERSION> with the latest supported nodejs runtime version for the image selected in previous step. Replace <REPOSITORY_NAME> with the repository name created in step 1. The below spec installs the jshint package, creates a jshint config file with a few sample rules, runs it against the source code in the pull request commit, posts the result as comment to the pull request page and based on the results, approves or rejects the pull request automatically.
version: 0.2
phases:
  install:
    runtime-versions:
      nodejs: <NODEJS_VERSION>
    commands:
      - npm install jshint --global
  build:
    commands:
      - echo \{\"esversion\":6,\"eqeqeq\":true,\"quotmark\":\"single\"\} > .jshintrc
      - CODE_QUALITY_RESULT="$(echo \`\`\`) $(jshint .)"; EXITCODE=$?
      - aws codecommit post-comment-for-pull-request --pull-request-id $PULL_REQUEST_ID --repository-name <REPOSITORY_NAME> --content "$CODE_QUALITY_RESULT" --before-commit-id $DESTINATION_COMMIT_ID --after-commit-id $SOURCE_COMMIT_ID --region $AWS_REGION	
      - |
        if [ $EXITCODE -ne 0 ]
        then
          PR_STATUS='REVOKE'
        else
          PR_STATUS='APPROVE'
        fi
      - REVISION_ID=$(aws codecommit get-pull-request --pull-request-id $PULL_REQUEST_ID | jq -r '.pullRequest.revisionId')
      - aws codecommit update-pull-request-approval-state --pull-request-id $PULL_REQUEST_ID --revision-id $REVISION_ID --approval-state $PR_STATUS --region $AWS_REGION

Once the AWS CodeBuild project has been created successfully, modify its IAM service role by following the below steps:

  • Choose the CodeBuild project’s Build details tab.
  • Choose the Service role link under the Environment section which should navigate you to the CodeBuild’s IAM service role in IAM console.
  • Expand the default customer managed policy and choose Edit.
  • Add the following actions to the existing codecommit actions:
"codecommit:CreatePullRequestApprovalRule",
"codecommit:GetPullRequest",
"codecommit:PostCommentForPullRequest",
"codecommit:UpdatePullRequestApprovalState"

  • Choose Next.
  • On the Review screen, choose Save changes.

4. Create an Amazon EventBridge rule that reacts to AWS CodeCommit pullRequestCreated and pullRequestSourceBranchUpdated events for the repository created in step 1 and set its target to AWS CodeBuild project created in step 3

Follow these steps to create an Amazon EventBridge rule that gets triggered whenever a pull request is created or updated using the following event pattern. Replace the <REGION>, <ACCOUNT_ID> and <REPOSITORY_NAME> placeholders with the actual values. Select target of the event rule as AWS CodeBuild project created in step 3.

Event Pattern

{
    "detail-type": ["CodeCommit Pull Request State Change"],
    "resources": ["arn:aws:codecommit:<REGION>:<ACCOUNT_ID>:<REPOSITORY_NAME>"],
    "source": ["aws.codecommit"],
    "detail": {
      "isMerged": ["False"],
      "pullRequestStatus": ["Open"],
      "repositoryNames": ["<REPOSITORY_NAME>"],
      "destinationReference": ["refs/heads/main"],
      "event": ["pullRequestCreated", "pullRequestSourceBranchUpdated"]
    },
    "account": ["<ACCOUNT_ID>"]
  }

Follow these steps to configure the target input using the below input path and input template.

Input transformer – Input path

{
    "detail-destinationCommit": "$.detail.destinationCommit",
    "detail-pullRequestId": "$.detail.pullRequestId",
    "detail-sourceCommit": "$.detail.sourceCommit"
}

Input transformer – Input template

{
    "sourceVersion": <detail-sourceCommit>,
    "environmentVariablesOverride": [
        {
            "name": "DESTINATION_COMMIT_ID",
            "type": "PLAINTEXT",
            "value": <detail-destinationCommit>
        },
        {
            "name": "SOURCE_COMMIT_ID",
            "type": "PLAINTEXT",
            "value": <detail-sourceCommit>
        },
        {
            "name": "PULL_REQUEST_ID",
            "type": "PLAINTEXT",
            "value": <detail-pullRequestId>
        }
    ]
}

5. Create a feature branch, add a new file and raise a pull request

Create a feature branch following these steps. Push a new file called “index.js” to the root of the repository with the below content.

function greet(dayofweek) {
  if (dayofweek == "Saturday" || dayofweek == "Sunday") {
    console.log("Have a great weekend");
  } else {
    console.log("Have a great day at work");
  }
}

Now raise a pull request using the feature branch as source and main branch as destination following these steps.

6. Verify the pull request with the code review feedback in comment section

As soon as the pull request is created, the AWS CodeBuild project created in step 3 above will be triggered which will run the code quality check and post the results as a pull request comment. Navigate to the AWS CodeCommit repository pull request page in AWS Management Console and check under the Activity tab to confirm the automated code review result being displayed as the latest comment.

The pull request comment submitted by AWS CodeBuild highlights 6 errors in the JavaScript code. The errors on lines first and third are based on the jshint rule “eqeqeq”. It recommends to use strict equality operator (“===”) instead of the loose equality operator (“==”) to avoid type coercion. The errors on lines second, fourth and fifth are based on jshint rule “quotmark” which recommends to use single quotes with strings instead of double quotes for better readability. These jshint rules are defined in AWS CodeBuild project’s buildspec in step 3 above.

Figure 2: The image shows the AWS CodeCommit pull request's Activity tab with code review results automatically posted by the automated code reviewer

Figure 2. Pull Request comments updated with automated code review results.

Conclusion

In this blog post we’ve shown how using AWS CodeCommit and AWS CodeBuild services customers can automate their pull request review process by utilising Amazon EventBridge events and using their own choice of code quality tool. This simple solution also makes it easier for the human reviewers by providing them with automated code quality results as input and enabling them to focus their code review more on business logic code changes rather than static code quality issues.

About the authors

Blog post's primary author's image

Verinder Singh

Verinder Singh is an experienced Solution’s Architect based out of Sydney, Australia with 16+ years of experience in software development and architecture. He works primarily on building large scale open-source AWS solutions for common customer use cases and business problems. In his spare time, he enjoys vacationing and watching movies with his family.

Blog post's secondary author's image

Deenadayaalan Thirugnanasambandam

Deenadayaalan Thirugnanasambandam is a Principal Cloud Architect at AWS. He provides prescriptive architectural guidance and consulting that enable and accelerate customers’ adoption of AWS.

Let’s Architect! Governance best practices

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-governance-best-practices/

Governance plays a crucial role in AWS environments, as it ensures compliance, security, and operational efficiency.

In this Let’s Architect!, we aim to provide valuable insights and best practices on how to configure governance appropriately within a company’s AWS infrastructure. By implementing these best practices, you can establish robust controls, enhance security, and maintain compliance, enabling your organization to fully leverage the power of AWS services while mitigating risks and maximizing operational efficiency.

If you are hungry for more information on governance, check out the Architecture Center’s management and governance page, where you can find a collection of AWS solutions, blueprints, and best practices on this topic.

How Global Payments scales on AWS with governance and controls

As global financial and regulated industry organizations increasingly turn to AWS for scaling their operations, they face the critical challenge of balancing growth with stringent governance and control regulatory requirements.

During this re:Invent 2022 session, Global Payments sheds light on how they leverage AWS cloud operations services to address this challenge head-on. By utilizing AWS Service Catalog, they streamline the deployment of pre-approved, compliant resources and services across their AWS accounts. This not only expedites the provisioning process but also ensures that all resources meet the required regulatory standards.

Take me to this re:Invent 2022 video!

The combination of AWS Service Catalog and AWS Organizations empowers Global Payments to establish robust governance and control mechanisms

The combination of AWS Service Catalog and AWS Organizations empowers Global Payments to establish robust governance and control mechanisms

Governance and security with infrastructure as code

Maintaining security and compliance throughout the entire deployment process is critical.

In this video, you will discover how cfn-guard can be utilized to validate your deployment pipelines built using AWS CloudFormation. By defining and applying custom rules, cfn-guard empowers you to enforce security policies, prevent misconfigurations, and ensure compliance with regulatory requirements. Moreover, by leveraging cdk-nag, you can catch potential security vulnerabilities and compliance risks early in the development process.

Take me to this governance video!

Learn how to use AWS CloudFormation and the AWS Cloud Development Kit to deploy cloud applications in regulated environments while enforcing security controls

Learn how to use AWS CloudFormation and the AWS Cloud Development Kit to deploy cloud applications in regulated environments while enforcing security controls

Get more out of service control policies in a multi-account environment

AWS customers often utilize AWS Organizations to effectively manage multiple AWS accounts. There are numerous advantages to employing this approach within an organization, including grouping workloads with shared business objectives, ensuring compliance with regulatory frameworks, and establishing robust isolation barriers based on ownership. Customers commonly utilize separate accounts for development, testing, and production purposes. However, as the number of these accounts grows, the need arises for a centralized approach to establish control mechanisms and guidelines.

In this AWS Security Blog post, we will guide you through various techniques that can enhance the utilization of AWS Organizations’ service control policies (SCPs) in a multi-account environment.

Take me to this AWS Security Blog post!

A sample organization showing the maximum number of SCPs applicable at each level (root, OU, account)

A sample organization showing the maximum number of SCPs applicable at each level (root, OU, account)

Centralized Logging on AWS

Having a single pane of glass where all Amazon CloudWatch Logs are displayed is crucial for effectively monitoring and understanding the overall performance and health of a system or application.

The AWS Centralized Logging solution facilitates the aggregation, examination, and visualization of CloudWatch Logs through a unified dashboard.

This solution streamlines the consolidation, administration, and analysis of log files originating from diverse sources, including access audit logs, configuration change records, and billing events. Furthermore, it enables the collection of CloudWatch Logs from numerous AWS accounts and Regions.

Take me to this AWS solution!

The Centralized Logging on AWS solution contains the following components log ingestion, indexing, and visualization

The Centralized Logging on AWS solution contains the following components log ingestion, indexing, and visualization

See you next time!

Thanks for joining our discussion on governance best practices! We’ll be back in 2 weeks, when we’ll explore DevOps best practices.

To find all the blogs from this series, you can check the Let’s Architect! list of content on the AWS Architecture Blog.