Tag Archives: Best practices

Create, Use, and Troubleshoot Launch Scripts on Amazon Lightsail

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/create-use-and-troubleshoot-launch-scripts-on-amazon-lightsail/

This blog post is written by Brian Graf, Senior Developer Advocate, Amazon Lightsail and Sophia Parafina, Senior Developer Advocate. 

Amazon Lightsail is a virtual private server (VPS) for deploying both operating systems (OS) and pre-packaged applications, such as WordPress, Plesk, cPanel, PrestaShop, and more. When deploying these instances, you can run launch scripts with additional commands such as installation of applications, configuration of system files, or installing pre-requisites for your application.

Where do I add a launch script?

If you’re deploying an instance with the Lightsail console, launch scripts can be added to an instance at deployment. They are added in the ‘deploy instance’ page:

Image of Amazon Lightsail deploy an instance page

The launch script must be added before the instance is deployed, because launch scripts can’t retroactively run after deployment.

Anatomy of a Windows Launch Script

When deploying a Lightsail Windows instance, you can use a batch script or a PowerShell script in the ‘launch script’ textbox.  Of the two options, PowerShell is more extensible and provides greater flexibility for configuration and control.

If you choose to write your launch script as a batch file, you must add <script> </script> tags at the beginning and end of your code respectively. Alternatively, a launch script in PowerShell, must use the <powershell></powershell> tags in a similar fashion.

After the closing </script> or </powershell> tag, you must add a <persist></persist> tag on the following line. The persist tag is used to determine if this is a run-once command or if it should run every time your instance is rebooted or changed from the ‘Stop’ to ‘Start’ state. If you want your script to run every time the instance is rebooted or started, then you must set the persist tag to ‘true’. If you want your launch script to just run once, then you would set your persist tag to ‘false’.

Anatomy of a Linux Launch Script

Like a Windows launch script, a Linux launch script requires specific code on the first row of the textbox to successfully execute during deployment. You must place ‘#!/bin/bash’ as the first line of code to set the shell that executes the rest of the script. After first line of code, you can continue adding additional commands to achieve the results you want.

How do I know if my Launch Script ran successfully?

Although running launch scripts is convenient to create a baseline instance, it’s possible that your instance doesn’t achieve the desired end-state because of an error in your script or permissions issues. You must troubleshoot to see why the launch script didn’t complete successfully. To find if the launch script ran successfully, refer to the instance logs to determine whether your launch script was successful or not.

For Windows, the launch log can be found in: C:\ProgramData\Amazon\EC2-Windows\launch\Log\UserdataExecution.log. Note that ProgramData is a hidden folder, and unless you access the file from PowerShell or Command Prompt, you must use Windows File Explorer (`View > Show > Hidden items`) folders to see it.

For Linux, the launch log can be found in: /var/log/cloud-init-output.log and can be monitored after your instance launches by tailing the log by typing the following in the terminal:

tail -f /var/log/cloud-init-output.log

If you want to see the entire log file including commands that have already run before you opened the log file, then you can type the following in the terminal:

less +F /var/log/cloud-init-output.log

On a Windows instance, an easy way to monitor the UserdataExecution.log is to add the following code in your launch script, which creates a shortcut to tail or watch the log as commands are executing:

# Create a log-monitoring script to monitor the progress of the launch script execution

$monitorlogs = @"
get-content C:\ProgramData\Amazon\EC2-Windows\launch\Log\UserdataExecution.log -wait
"@

# Save the log-monitoring script to the desktop for the user

$monitorlogs | out-file -FilePath C:\Users\Administrator\Desktop\MonitorLogs.ps1 -Encoding utf8 -Force

</powershell>
<persist>false</persist>

If the script was executed, then the last line of the log should say ‘{Timestamp}: User data script completed’.

However, if you want more detail, you can build the logging into your launch script. For example, you can append a text or log file with each command so that you can read the output in an easy-to-access location:

<powershell>
# Set the location for the log file. In this case,
# it will appear on the desktop of your Lightsail instance
$loc = "c:\Users\Administrator\Desktop\mylog.txt"

# Write text to the log file
Write-Output "Starting Script" >> $loc

# Download and install Chocolatey to do unattended installations of the rest of the apps.
iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))

# You could run commands like this to output the progress to the log file:

# Install vscode and all dependencies
choco install -y vscode --force --force-dependencies --verbose >> $loc

# Install git and all dependencies
choco install -y git --force --force-dependencies --verbose >> $loc

# Completed
Write-Output "Completed" >> $loc
</powershell>
<persist>false</persist>

This code creates a log file, outputs data, and appends it along the way. If there is an issue, then you can see where the logs stopped or errors appeared.

For Ubuntu and Amazon Linux 2

If the cloud-init-output.log isn’t comprehensive enough, then you can re-direct the output from your commands to a log file of your choice. In this example, we create a log file in the /tmp/ directory and push all output from our commands to this file.

# Create the log file
touch /tmp/launchscript.log

# Add text to the log file if you so choose
echo 'Starting' >> /tmp/launchscript.log

# Update package index
sudo apt update >> /tmp/launchscript.log

# Install software to manage independent software vendor sources
sudo apt -y install software-properties-common >> /tmp/launchscript.log

# Add the repository for all PHP versions
sudo add-apt-repository -y ppa:ondrej/php >> /tmp/launchscript.log

# Install Web server, mySQL client, PHP (and packages), unzip, and curl
sudo apt -y install apache2 mysql-client-core-8.0 php8.0 libapache2-mod-php8.0 php8.0-common php8.0-imap php8.0-mbstring php8.0-xmlrpc php8.0-soap php8.0-gd php8.0-xml php8.0-intl php8.0-mysql php8.0-cli php8.0-bcmath php8.0-ldap php8.0-zip php8.0-curl unzip curl >> /tmp/launchscript.log

# Any final text you want to include
echo 'Completed' >> /tmp/launchscript.log

It’s possible to check the logs before the launch script has finished executing. One way to follow along is to ‘tail’ the log file. This lets you stream all updates as they occur. You can monitor the log using:

‘tail -f /tmp/launchscript.log’. </code>

Using Launch Scripts from AWS Command Line Interface (AWS CLI)

You can deploy their Lightsail instances from the AWS Command Line Interface (AWS CLI) instead of the Lightsail console. You can add launch scripts to the AWS CLI command as a parameter by creating a variable with the script and referencing the variable, or by saving the launch script as a file and referencing the local file location on your computer.

The launch script is still written the same way as the previous examples. For a Windows instance with a PowerShell launch script, you can deploy a Lightsail instance with a launch script with the following code:

# PowerShell script saved in the Downloads folder:

$loc = "c:\Users\Administrator\Desktop\mylog.txt"

# Write text to the log file

Write-Output "Starting Script" >> $loc

# Download and install Chocolatey to do unattended installations of the rest of the apps.

iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))

# You could run commands like this to output the progress to the log file:

# Install vscode and all dependencies

choco install -y vscode --force --force-dependencies --verbose >> $loc

# Install git and all dependencies

choco install -y git --force --force-dependencies --verbose >> $loc

# Completed

Write-Output "Completed" >> $loc

AWS CLI code to deploy a Windows Server 2019 medium instance in the us-west-2a Availability Zone:

aws lightsail create-instances \

--instance-names "my-windows-instance-1" \

--availability-zone us-west-2a \

--blueprint-id windows_server_2019 \

--bundle-id medium_win_2_0 \

--region us-west-2 \

--user-data file://~/Downloads/powershell_script.ps1

Clean up

Remember to delete resources when you are finished using them to avoid incurring future costs.

Conclusion

You now have the understanding and examples of how to create and troubleshoot Lightsail launch scripts both through the Lightsail console and AWS CLI. As demonstrated in this blog, using launch scripts, you can increase your productivity and decrease the deployment time and configuration of your applications. For more examples of using launch scripts, check out the aws-samples GitHub repository. You now have all the foundational building blocks you need to successfully script automated instance configuration. To learn more about Lightsail, visit the Lightsail service page.

The art and science of data product portfolio management

Post Syndicated from Faris Haddad original https://aws.amazon.com/blogs/big-data/the-art-and-science-of-data-product-portfolio-management/

This post is the first in a series dedicated to the art and science of practical data mesh implementation (for an overview of data mesh, read the original whitepaper The data mesh shift). The series attempts to bridge the gap between the tenets of data mesh and its real-life implementation by deep-diving into the functional and non-functional capabilities essential to a working operating model, laying out the decisions that need to be made for each capability, and describing the key business and technical processes required to implement them. Taken together, the posts in this series lay out some possible operating models for data mesh within an organization.

Kudzu

Kudzu—or kuzu (クズ)—is native to Japan and southeast China. First introduced to the southeastern United States in 1876 as a promising solution for erosion control, it now represents a cautionary tale about unintended consequences, as Kudzu’s speed of growth outcompetes everything from native grasses to tree systems by growing over and shading them from the sunlight they need to photosynthesize—eventually leading to species extinction and loss of biodiversity. The story of Kudzu offers a powerful analogy to the dangers and consequences of implementing data mesh architectures without fully understanding or appreciating how they are intended to be used. When the “Kudzu” of unmanaged pseudo-data products (methods of sharing data that masquerade as data products while failing to fulfill the myriad obligations associated with them) has overwhelmed the local ecosystem of true data products, eradication is costly and prone to failure, and can represent significant wasted effort and resources, as well as lost time.

Desert

While Kudzu was taking over the south in the 1930s, desertification caused by extensive deforestation was overwhelming the Midwest, with large tracts of land becoming barren and residents forced to leave and find other places to make a living. In the same way, overly restrictive data governance practices that either prevent data products from taking root at all, or pare them back too aggressively (deforestation), can over time create “data deserts” that drive both the producers and consumers of data within an organization to look elsewhere for their data needs. At the same time, unstructured approaches to data mesh management that don’t have a vision for what types of products should exist and how to ensure they are developed are at high risk of creating the same effect through simple neglect. This is due to a  common misconception about data mesh as a data strategy, which is that it is effectively self-organizing—meaning that once presented with the opportunity, data owners within the organization will spring to the responsibilities and obligations associated with publishing high-quality data products. In reality, the work of a data producer is often thankless, and without clear incentive strategies, organizations may end up with data deserts that create more data governance issues as producers and consumers go elsewhere to seek out the data they need to perform work.

Bonsai

Bonsai (盆栽) is an art form originating from an ancient Chinese tradition called penjing (盆景), and later shaped by the minimalist teachings of Zen Buddhism into the practice we know and recognize today. The patient practice of Bonsai offers useful analogies to the concepts and processes required to avoid the chaos of Kudzu as well as the specter of organizational data deserts. Bonsai artists carefully observe the naturally occurring buds that are produced by the tree and encourage those that add to the overall aesthetics of the tree, while pruning those that don’t work well with their neighbors. The same ideas apply equally well to data products within a data mesh—by encouraging the growth and adoption of those data products that add value to our data mesh, and continuously pruning those that do not, we maximize the value and sustainability of our data mesh implementations. In a similar vein, Bonsai artists must balance their vision for the shape of the tree with a respect for the natural characteristics and innate structure of the species they have chosen to work with—to ignore the biology of the tree would be disastrous to the longevity of the tree, as well as to the quality of the art itself. In the same way, organizations seeking to implement successful data mesh strategies must respect the nature and structure (legal, political, commercial, technology) of their organizations in their implementation.

Of the key capabilities proposed for the implementation of a sustainable data mesh operating model, the one that is most relevant to the problems we’ve described—and explore later in this post—is data product portfolio management.

Overview of data product portfolio management

Data mesh architectures are, by their nature, ideal for implementation within federated organizations, with decentralized ownership of data and clear legal, regulatory, or commercial boundaries between entities or lines of business. The same organizational characteristics that make data mesh architectures valuable, however, also put them at risk of turning into one of the twin nightmares of Kudzu or data deserts.

To define the shape and nature of an organizational data mesh, a number of key questions need to be answered, including but not limited to:

  • What are the key data domains within the organization? What are the key data products within these domains needed to solve current business problems? How do we iterate on this discovery process to add value while we are mapping our domains?
  • Who are the consumers in our organization, and what logical, regulatory, physical, or commercial boundaries might separate them from producers and their data products?
  • How do we encourage the development and maintenance of key data products in a decentralized organization?
  • How do we monitor data products against their SLAs, and ensure alerting and escalation on failure so that the organization is protected from bad data?
  • How do we enable those we see as being autonomous producers and consumers with the right skills, the right tools, and the right mindset to actually want to (and be able to) take more ownership of independently publishing data as a product and consuming it responsibly?
  • What is the lifecycle of a data product? When do new data products get created, and who is allowed to create them? When are data products deprecated, and who is accountable for the consequences to their consumers?
  • How do we define “risk” and “value” in the context of data products, and how can we measure this? Whose responsibility is it to justify the existence of a given data product?

To answer questions such as these and plan accordingly, organizations must implement data product portfolio management (DPPM). DPPM does not exist in a vacuum—by its nature, DPPM is closely related to and interdependent with enterprise architecture practices like business capability management and project portfolio management. DPPM itself may therefore also be considered, in part, an enterprise architecture practice.

As an enterprise architecture practice, DPPM is responsible for its implementation, which should reside within a function whose remit is appropriately global and cross-functional. This may be within the CDO office for those organizations that have a CDO or equivalent central data function, or the enterprise architecture team in organizations that do not.

Goals of DPPM

The goals of DPPM can be summarized as follows:

  • Protect value – DPPM protects the value of the organizational data strategy by developing, implementing, and enforcing frameworks to measure the contribution of data products to organizational goals in objective terms. Examples may include associated revenue, savings, or reductions in operational losses. Earlier in their lifecycle, data products may be measured by alternative metrics, including adoption (number of consumers) and level of activity (releases, interaction with consumers, and so on). In the pursuit of this goal, the DPPM capability is accountable for engaging with the business to continuously priorities where data as a product can add value and align delivery priority accordingly. Strategies for measuring value and prioritizing data products are explored later in this post.
  • Manage risk – All data products introduce risk to the organization—risk of wasted money and effort through non-adoption, risk of operational loss associated with improper use, and risk of failure on the part of the data product to meet requirements on availability, completeness, or quality. These risks are exacerbated in the case of proliferation of low-quality or unsupervised data products. DPPM seeks to understand and measure these risks on an individual and aggregated basis. This is a particularly challenging goal because what constitutes risk associated with the existence of a particular data product is determined largely by its consumers and is likely to change over time (though like entropy, is only ever likely to increase).
  • Guide evolution – The final goal of DPPM is to guide the evolution of the data product landscape to meet overarching organizational data goals, such as mutually exclusive or collectively exhaustive domains and data products, the identification and enablement of single-threaded ownership of product definitions, or the agile inclusion of new sources of data and creation of products to serve tactical or strategic business goals. Some principles for the management of data mesh evolution, and the evaluation of data products against organizational goals, are explored later in this post.

Challenges of DPPM

In this section, we explore some of the challenges of DPPM, and the pragmatic ways some of these challenges could be addressed.

Infancy

Data mesh as a concept is still relatively new. As such, there is little standardization associated with practical operating models for building and managing data mesh architectures, and no access to fully fledged out-of-the-box reference operating models, frameworks, or tools to support the practice of DPPM.

Some elements of DPPM are supported in disparate tools (for example, some data catalogs include basic community features that contribute to measuring value), but not in a holistic way. Over time, standardization of the processes associated with DPPM will likely occur as a side-effect of commoditization, driven by the popularity and adoption of new services that take on and automate more of the undifferentiated heavy lifting associated with mesh supervision. In the meantime, however, organizations adopting data mesh architectures are left largely to their own devices around how to operate them effectively.

Resistance

The purest expression of democracy is anarchy, and the more federated an organization is (itself a supporting factor in choosing data mesh architectures), the more resistance may be observed to any forms of centralized governance. This is a challenge for DPPM, because in some way it must come together in one place. Just as the Bonsai artist knows the vision for the entire tree, there must be a cohesive vision for and ability to guide the evolution of a data mesh, no matter how broadly federated and autonomous individual domains or data products might be.

Balancing this with the need to respect the natural shape (and culture) of an organization, however, requires organizations that implement DPPM to think about how to do so in a way that doesn’t conflict with the reality of the organization. This might mean, for example, that DPPM may need to happen at several layers—at minimum within data domains, possibly within lines of business, and then at an enterprise level through appropriate data committees, guilds, or other structures that bring stakeholders together. All of this complicates the processes and collaboration needed to perform DPPM effectively.

Maturity

Data mesh architectures, and therefore DPPM, presume relatively high levels of data maturity within an organization—a clear data strategy, understanding of data ownership and stewardship, principles and policies that govern the use of data, and a moderate-to-high level of education and training around data within the organization. A lack of data maturity within the organization, or a weak or immature enterprise architecture function, will face significant hurdles in the implementation of any data mesh architecture, let alone a strong and useful DPPM practice.

In reality, however, data maturity is not uniform across organizations. Even in seemingly low-maturity organizations, there are often teams who are more mature and have a higher appetite to engage. By leaning into these teams and showing value through them first, then using them as evangelists, organizations can gain maturity while benefitting earlier from the advantages of data mesh strategies.

The following sections explore the implementation of DPPM along the lines of people, process, and technology, as well as describing the key characteristics of data products—scope, value, risk, uniqueness, and fitness—and how they relate to data mesh practices.

People

To implement DPPM effectively, a wide variety of stakeholders in the organization may need to be involved in one capacity or another. The following table suggests some key roles, but it’s up to an individual organization to determine how and if these map to their own roles and functions.

Function RACI Role Responsibility
Senior Leadership A Chief Data Officer Ultimately accountable for organizational data strategy and implementation. Approves changes to DPPM principles and operating model. Acts as chair of, and appoints members to, the data council.
. R Data Council** Stakeholder body representing organizational governance around data strategy. Acts as steering body for the governance of DPPM as a practice (KPI monitoring, maturity assessments, auditing, and so on). Approves changes to guidelines and methodologies. Approves changes to data product portfolio (discussed later in this post). Approves and governs centrally funded and prioritized data product development activities.
Enterprise Architecture AR Head of Enterprise Architecture Responsible for development and enforcement of data strategy. Accountable and responsible for the design and implementation of DPPM as an organizational capability.
. R Domain Architect Responsible for the implementing screening, data product analysis, periodic evaluation, and optimal portfolio selection practices. Responsible for the development of methodologies and their selection criteria.
Legal & Compliance C Legal & Compliance Officer Consults on permissibility of data products with reference to local regulation. Consults on permissibility of data sharing with reference to local regulation or commercial agreements.
. C Data Privacy Officer Consults on permissibility of data use with reference to local data privacy law. Consults on permissibility of cross-entity or border data sharing with reference to data privacy law.
Information Security RC Information Security Officer Consults on maturity assessments (discussed later in this post) for information security-relevant data product capabilities. Approves changes to data product technology architecture. Approves changes to IAM procedures relating to data products.
Business Functions A Data Domain Owner Ultimately accountable for the appropriate use of domain data, as well as its quality and availability. Accountable for domain data products. Approves changes to the domain data model and domain data product portfolio.
c R Data Domain Steward Responsible for implementing data domain responsibilities, including operational (day-to-day) governance of domain data products. Approves use of domain data in new data products, and performs regular (such as yearly) attestation of data products using domain data.
. A Data Owner Ultimately accountable for the appropriate use of owned data (for example, CRM data), as well as its quality and availability.
. R Data Steward Responsible for implementing data responsibilities. Approves use of owned data in new data products, and performs regular (such as yearly) attestation of data products using owned data.
. AR Data Product Owner Accountable and responsible for the design, development, and delivery of data products against their stated SLOs. Contributes to data product analysis and portfolio adjustment practices for own data products.

** The data council typically consists of permanent representatives from each function (data domain owners), enterprise architecture, and the chief data officer or equivalent.

Process

The following diagram illustrates the strategic, tactical, and operational practices associated with DPPM. Some considerations for the implementation of these practices is explored in more detail in this post, though their specific interpretation and implementation is dependent on the individual organization.

Boundaries

When reading this section, it’s important to bear in mind the impact of boundaries—although strategy development may be established as a global practice, other practices within DPPM must respect relevant organizational boundaries (which may be physical, geographical, operational, legal, commercial, or regulatory in nature). In some cases, the existence of boundaries may require some or all tactical and operational practices to be duplicated within each associated boundary. For example, an insurance company with a property and casualty legal entity in North America and a life entity in Germany may need to implement DPPM separately within each entity.

Strategy development

This practice deals with answering questions associated with the overall data mesh strategy, including the following:

  • The overall scope (data domains, participating entities, and so on) of the data mesh
  • The degree of freedom of participating entities in their definition and implementation of the data mesh (for example, a mesh of meshes vs. a single mesh)
  • The distribution of responsibilities for activities and capabilities associated with the data mesh (degree of democratization)
  • The definition and documentation of key performance indicators (KPIs) against which the data mesh should be governed (such as risk and value)
  • The governance operating model (including this practice)

Key deliverables include the following:

  • Organizational guidelines for operational processes around pre-screening and screening of data products
  • Well-defined KPIs that guide methodology development and selection for practices like data product analysis, screening, and optimal portfolio selection
  • Allocation of organizational resources (people, budget, time) to the implementation of tactical processes around methodology development, optimal portfolio selection, and portfolio adjustment

Key considerations

In this section, we discuss some key considerations for strategy development.

Data mesh structure

This diagram illustrates the analogous relationship between data products in a data mesh, and the structure of the mesh itself.

The following considerations relate to screening, data product analysis, and optimal portfolio selection.

  • Trunk (core data products) – Core data products are those that are central to the organization’s ability to function, and from which the majority of other data products are derived. These may be data products consumed in the implementation of key business activities, or associated with critical processes such as regulatory reporting and risk management. Organizational governance for these data products typically favors availability and data accuracy over agility.
  • Branch (cross-domain data products) – Cross-domain data products represent the most common cross-domain use cases for data (for example, joining customer data with product data). These data products may be widely used across business functions to support reporting and analytics, and—to a lesser extent—operational processes. Because these data products may consume a variety of sources, organizational governance may favor a balanced view on agility vs. reliability, accepting some degree of risk in return for being able to adapt to changes in data sources. Data product versioning can offer mitigation of risks associated with change.
  • Leaf (everything else) – These are the myriad data products that may arise within a data mesh, either as permanent additions to support individual teams and use cases or as temporary data products to fill data gaps or support time-limited initiatives. Because the number of these data products may be high and risks are typically limited to a single process or a small part of the organization, organizational governance typically favors a light touch and may prefer to govern through guidelines and best practices, rather than through active participation in the data product lifecycle.

Data products vs. data definitions

The following figure illustrates how data definitions are defined and inherited throughout the lineage of data products.

In a data mesh architecture, data products may inherit data from each other (one data product consumes another in its data pipeline) or independently publish data within (or related to) the same domain. For example, a customer data product may be inherited by a customer support data product, while another the customer journey data product may directly publish customer-relevant data from independent sources. When no standards are applied to how domain data attributes are used and published, data products even within the same data domain may lose interoperability because it becomes difficult or impossible to join them together for reporting or analytics purposes.

To prevent this, it can be useful to distinguish between data products and data definitions. Typically, organizations will select a single-threaded owner (often a data owner or steward, or a domain data owner or steward) who is responsible for defining minimal data definitions for common and reusable data entities within data domains. For example, a data owner responsible for the sales and marketing data domain may identify a customer data product as a reusable data entity within the domain and publish a minimal data definition that all producers of customer-relevant data must incorporate within their data products, to ensure that all data products associated with customer data are interoperable.

DPPM can assist in the identification and production of data definitions as part of its data product analysis activities, as well as enforce their incorporation as part of oversight of data product development.

Service management thinking

These considerations relate to data product analysis, periodic evaluation, and methodology selection.

Data products are services provided to the organization or externally to customers and partners. As such, it may make sense to adapt a service management framework like ITIL, in combination with the ITIL Maturity Model, for use in evaluating the fitness of data products for their scope and audience, as well as in describing the roles, processes, and acceptable technologies that should form the operating model for any data product.

At the operational level, the stakeholders required to implement each practice may change depending on the scope of the data product. For example, the release management practice for a core data product may require involvement of the data council, whereas the same practice for a team data product may only involve the team or functional head. To avoid creating decision-making bottlenecks, organizations should aim to minimize the number of stakeholders in each case and focus on single-threaded owners wherever possible.

The following table proposes a subset of capabilities and how they might be applied to data products of different scopes. Suggested target maturity levels, between 1 and 5, are included for each scope. (1= Initial, 5= Optimizing)

Target Maturity Data Product Scope.
4 – 5 3 – 4 2 – 3 2
Capability    Core
  Cross-Domain
  Function / Team
  Personal
Information Security Management X X X X
Knowledge Management X X X .
Release Management X X X .
Service-Level Management X X X .
Measurement and Reporting X X . .
Availability Management X X . .
Capacity and Performance Management X X . .
Incident Management X X . .
Monitoring and Event Management X X . .
Service Validation and Testing X X . .

Methodology development

This practice deals with the development of concrete, objective frameworks, metrics, and processes for the measurement of data product value and risk. Because the driving factors behind risk and value are not necessarily the same between products, it may be necessary to develop several methodologies or variants thereof.

Key deliverables include the following:

  • Well-defined frameworks for measuring risk and value of data products, as well as for determining the optimal portfolio of data products
  • Operationally feasible, measurable metrics associated with value and risk

Key considerations

A key consideration for assessing data products is that of consumer value or risk vs. uniqueness. The following diagram illustrates how value and risk of a data product are driven by its consumers.

Data products don’t inherently present risk or add value, but rather indirectly pose—in an aggregated fashion—the risk and value created by their consumers.

In a consumer-centric value and risk model, governance of consumers ensures that all data use meets the following requirements:

  • Is associated with a business case justifying the use of data (for example, new business, cost reduction through business process automation, and so on)
  • Is regularly evaluated with reference to the risk associated with the use case (for example, regulatory reporting

The value and risk associated with the linked data products are then calculated as an aggregation. Where organizations already track use cases associated with data, either as part of data privacy governance or as a by-product of the access approval process, these existing systems and databases can be reused or extended.

Conversely, where data products overlap with each other, their value to the organization is reduced accordingly, because redundancies between data products represent an inefficient use of resources and increase organizational complexity associated with data quality management.

To ensure that the model is operationally feasible (see the key deliverables of methodology development), it may be sufficient to consider simple aggregations, rather than attempting to calculate value and risk attribution at a product or use case level.

Optimal portfolio selection

This practice deals with the determination of which combination of data products (existing, new, or potential) would best meet the organization’s current and known future needs. This practice takes input from data product analysis and data product proposals, as well as other enterprise architecture practices (for example, business architecture), and considers trade-offs between data-debt and time-to-value, as well as other considerations such as redundancy between data products to determine the optimal mix of permanent and temporary data products at any given point in time.

Because the number of data products in an organization may become significant over time, it may be useful to apply heuristics to the problem of optimal portfolio selection. For example, it may be sufficient to consider core and cross-domain data products (trunk and branches) during quarterly portfolio reviews, with other data products (leaves) audited on a yearly basis.

Key deliverables include the following:

  • A target state definition for the data mesh, including all relevant data products
  • An indication of organizational priorities for use by the portfolio adjustment practice

Key considerations

The following are key considerations regarding the data product half-life:

  • Long-term or strategic data products – These data products fill a long-term organizational need, are often associated with key source systems in various domains, and anchor the overall data strategy. Over time, as an organization’s data mesh matures, long-term data products should form the bulk of the mesh.
  • Time-bound data products – These data products fill a gap in data strategy and allow the organization to move on data opportunities until core data products can be updated. An example of this might be data products created and used in the context of mergers and acquisitions transactions and post-acquisition, to provide consistent data for reporting and business intelligence until mid-term and long-term application consolidation has taken place. Time-bound data products are considered as data-debt and should be managed accordingly.
  • Purpose-driven data products – These data products serve a narrow, finite purpose. Purpose-driven data products may or may not be time-bound, but are characterized primarily by a strict set of consumers known in advance. Examples of this might include:
    • Data products developed to support system-of-record harmonization between lines of business (for example, deduplication of customer records between insurance lines of business using separate CRM systems
    • Data products created explicitly for the monitoring of other data products (data quality, update frequency, and so on)

Portfolio adjustment

This practice implements the feasibility analysis, planning and project management, as well as communication and organizational change management activities associated with changes to the optimal portfolio. As part of this practice, a gap analysis is conducted between the current and target data product portfolio, and a set of required actions and estimated time and effort prepared for review by the organization. During such a period, data products may be marked for development (new data products to fill a need), changes, consolidation (merging two or more data products into a single data product), or deprecation. Several iterations of optimal portfolio selection and portfolio adjustment may be required to find an appropriate balance between optimality and feasibility of implementation.

Key deliverables include the following:

  • A gap analysis between the current and target data product portfolio, as well as proposed remediation activities
  • High-level project plans and effort or budget assessments associated with required changes, for approval by relevant stakeholders (such as the data council)

Data product proposals

This practice organizes the collection and prioritization of requests for new, or changes to existing, data products within the organization. Its implementation may be adapted from or managed by existing demand management processes within the organization.

Key deliverables include a registry of demand against new or existing data products, including metadata on source systems, attributes, known use cases, proposed data product owners, and suggested organizational priority.

Methodology selection

This practice is associated with the identification and application of the most appropriate methodologies (such as value and risk) during data product analysis, screening, and optimal portfolio selection. The selection of an appropriate methodology for the type, maturity, and scope of a data product (or an entire portfolio) is a key element in avoiding either a “Kudzu” mesh or a “data desert.”

Key deliverables include reusable selection criteria for mapping methodologies to data products during data product analysis, screening, and optimal portfolio selection.

Pre-screening

This optional practice is primarily a mechanism to avoid unnecessary time and effort in the comparatively expensive practice of data product analysis by offering simple self-service applications of guidelines to the evaluation of data products. An example might include the automated approval of data products that fall under the classification of personal data products, requiring only attestation on the part of the requester that they will uphold the relevant portions of the guideline that governs such data products.

Key deliverables include tools and checklists for the self-service evaluation of data products against guidelines and automated registration of approved data products.

Data product analysis

This practice incorporates guidelines, methodologies, as well as (where available) metadata relating to data products (performance against SLOs, service management metrics, degree of overlap with other data products) to establish an understanding of the value and risk associated with individual data products, as well as gaps between current and target capability maturities, and compliance with published product definitions and standards.

Key deliverables include a summary of findings for a particular data product, including scores for relevant value, risk, and maturity metrics, as well as operational gaps requiring remediation and recommendations on next steps (repair, enhance, decommission, and so on).

Screening

This optional practice is a mechanism to reduce complexity in optimal portfolio selection by ensuring the early removal of data products from consideration that fail to meet value or risk targets, or have been identified as redundant to other data products already available in the organization.

Key deliverables include a list of data products that should be slated for removal (direct-to-decommissioning).

Data product development

This practice is not performed directly under DPPM, but is managed in part by the portfolio adjustment practice, and may be governed by standards that are developed as part of DPPM. In the context of DPPM, this practice is primarily associated with ensuring that data products are developed according to the specifications agreed as part of portfolio adjustment.

Key deliverables include project management and software or service development deliverables and artefacts.

Data product decommissioning

This practice manages the decommissioning of data products and the migration of affected consumers to new or other data products where relevant. Unmanaged decommissioning of data products, especially those with many downstream consumers, can threaten the stability of the entire data mesh, as well as have significant consequences to business functions.

Key deliverables include a decommissioning plan, including stakeholder assessment and sign-off, timelines, migration plans for affected consumers, and back-out strategies.

Periodic evaluation

This practice manages the calendar and implementation of periodic reviews of the data mesh, both in its entirety as well as at the data product level, and is primarily an exercise in project management.

Key deliverables include the following:

  • yearly review calendar, published and made available to all data product owners and affected stakeholders
  • Project management deliverables and artefacts, including evidence of evaluations having been performed against each data product

Technology

Although most practices within DPPM don’t rely heavily on technology and automation, some key supporting applications and services are required to implement DPPM effectively:

  • Data catalog – Core to the delivery of DPPM is the organizational data catalog. Beyond providing transparency into what data products exist within an organization, a data catalog can provide key insights into data lineage between data products (key to the implementation of portfolio adjustment) and adoption of data products by the organization. The data catalog can also be used to capture and make available both the documented as well as the realized SLO for any given data product, and—through the use of a business glossary—assist in the identification of redundancy between data products.
  • Service management – Service management solutions (such as ServiceNOW) used in the context of data product management offer important insights into the fitness of data products by capturing and tracking incidents, problems, requests, and other metrics against data products.
  • Demand management – A demand management solution supports self-service implementation and automation of data product proposal and pre-screening activities, as well as prioritization activities associated with selection and development of data products.

Conclusion

Although this post focused on implementing DPPM in the context of a data mesh, this capability—like data product thinking—is not exclusive to data mesh architectures. The practices outlined here can be practiced at any scale to ensure that the production and use of data within the organization is always in line with its current and future needs, that governance is implemented in a consistent way, and that the organization can have Bonsai, not Kudzu.

For more information about data mesh and data management, refer to the following:

In upcoming posts, we will cover other aspects of data mesh operating models, including data mesh supervision and service management models for data product owners.


About the Authors


Maximilian Mayrhofer
is a Principal Solutions Architect working in the AWS Financial Services EMEA Go-to-Market team. He has over 12 years experience in digital transformation within private banking and asset management. In his free time, he is an avid reader of science fiction and enjoys bouldering.


Faris Haddad
is the Data & Insights Lead in the AABG Strategic Pursuits team. He helps enterprises successfully become data-driven.

How Ontraport reduced data processing cost by 80% with AWS Glue

Post Syndicated from Elijah Ball original https://aws.amazon.com/blogs/big-data/how-ontraport-reduced-data-processing-cost-by-80-with-aws-glue/

This post is written in collaboration with Elijah Ball from Ontraport.

Customers are implementing data and analytics workloads in the AWS Cloud to optimize cost. When implementing data processing workloads in AWS, you have the option to use technologies like Amazon EMR or serverless technologies like AWS Glue. Both options minimize the undifferentiated heavy lifting activities like managing servers, performing upgrades, and deploying security patches and allow you to focus on what is important: meeting core business objectives. The difference between both approaches can play a critical role in enabling your organization to be more productive and innovative, while also saving money and resources.

Services like Amazon EMR focus on offering you flexibility to support data processing workloads at scale using frameworks you’re accustomed to. For example, with Amazon EMR, you can choose from multiple open-source data processing frameworks such as Apache Spark, Apache Hive, and Presto, and fine-tune workloads by customizing things such as cluster instance types on Amazon Elastic Compute Cloud (Amazon EC2) or use containerized environments running on Amazon Elastic Kubernetes Service (Amazon EKS). This option is best suited when migrating workloads from big data environments like Apache Hadoop or Spark, or when used by teams that are familiar with open-source frameworks supported on Amazon EMR.

Serverless services like AWS Glue minimize the need to think about servers and focus on offering additional productivity and DataOps tooling for accelerating data pipeline development. AWS Glue is a serverless data integration service that helps analytics users discover, prepare, move, and integrate data from multiple sources via a low-code or no-code approach. This option is best suited when organizations are resource-constrained and need to build data processing workloads at scale with limited expertise, allowing them to expedite development and reduced Total Cost of Ownership (TCO).

In this post, we show how our AWS customer Ontraport evaluated the use of AWS Glue and Amazon EMR to reduce TCO, and how they reduced their storage cost by 92% and their processing cost by 80% with only one full-time developer.

Ontraport’s workload and solution

Ontraport is a CRM and automation service that powers businesses’ marketing, sales and operations all in one place—empowering businesses to grow faster and deliver more value to their customers.

Log processing and analysis is critical to Ontraport. It allows them to provide better services and insight to customers such as email campaign optimization. For example, email logs alone record 3–4 events for every one of the 15–20 million messages Ontraport sends on behalf of their clients each day. Analysis of email transactions with providers such as Google and Microsoft allow Ontraport’s delivery team to optimize open rates for the campaigns of clients with big contact lists.

Some of the big log contributors are web server and CDN events, email transaction records, and custom event logs within Ontraport’s proprietary applications. The following is a sample breakdown of their daily log contributions:

Cloudflare request logs 75 million records
CloudFront request logs 2 million records
Nginx/Apache logs 20 million records
Email logs 50 million records
General server logs 50 million records
Ontraport app logs 6 million records

Ontraport’s solution uses Amazon Kinesis and Amazon Kinesis Data Firehose to ingest log data and write recent records into an Amazon OpenSearch Service database, from where analysts and administrators can analyze the last 3 months of data. Custom application logs record interactions with the Ontraport CRM so client accounts can be audited or recovered by the customer support team. Originally, all logs were retained back to 2018. Retention is multi-leveled by age:

  • Less than 1 week – OpenSearch hot storage
  • Between 1 week and 3 months – OpenSearch cold storage
  • More than 3 months – Extract, transform, and load (ETL) processed in Amazon Simple Storage Service (Amazon S3), available through Amazon Athena

The following diagram shows the architecture of their log processing and analytics data pipeline.

Evaluating the optimal solution

In order to optimize storage and analysis of their historical records in Amazon S3, Ontraport implemented an ETL process to transform and compress TSV and JSON files into Parquet files with partitioning by the hour. The compression and transformation helped Ontraport reduce their S3 storage costs by 92%.

In phase 1, Ontraport implemented an ETL workload with Amazon EMR. Given the scale of their data (hundreds of billions of rows) and only one developer, Ontraport’s first attempt at the Apache Spark application required a 16-node EMR cluster with r5.12xlarge core and task nodes. The configuration allowed the developer to process 1 year of data and minimize out-of-memory issues with a rough version of the Spark ETL application.

To help optimize the workload, Ontraport reached out to AWS for optimization recommendations. There were a considerable number of options to optimize the workload within Amazon EMR, such as right-sizing Amazon Elastic Compute Cloud (Amazon EC2) instance type based on workload profile, modifying Spark YARN memory configuration, and rewriting portions of the Spark code. Considering the resource constraints (only one full-time developer), the AWS team recommended exploring similar logic with AWS Glue Studio.

Some of the initial benefits with using AWS Glue for this workload include the following:

  • AWS Glue has the concept of crawlers that provides a no-code approach to catalog data sources and identify schema from multiple data sources, in this case, Amazon S3.
  • AWS Glue provides built-in data processing capabilities with abstract methods on top of Spark that reduce the overhead required to develop efficient data processing code. For example, AWS Glue supports a DynamicFrame class corresponding to a Spark DataFrame that provides additional flexibility when working with semi-structured datasets and can be quickly transformed into a Spark DataFrame. DynamicFrames can be generated directly from crawled tables or directly from files in Amazon S3. See the following example code:
    dyf = glueContext.create_dynamic_frame.from_options(
    
    connection_type = 's3',
    connection_options = {'paths': [s3://<bucket/paths>]},
    format = 'json')

  • It minimizes the need for Ontraport to right-size instance types and auto scaling configurations.
  • Using AWS Glue Studio interactive sessions allows Ontraport to quickly iterate when code changes where needed when detecting historical log schema evolution.

Ontraport had to process 100 terabytes of log data. The cost of processing each terabyte with the initial configuration was approximately $500. That cost came down to approximately $100 per terabyte after using AWS Glue. By using AWS Glue and AWS Glue Studio, Ontraport’s cost of processing the jobs was reduced by 80%.

Diving deep into the AWS Glue workload

Ontraport’s first AWS Glue application was a PySpark workload that ingested data from TSV and JSON files in Amazon S3, performed basic transformations on timestamp fields, and converted the data types of a couple fields. Finally, it writes output data into a curated S3 bucket as compressed Parquet files of approximately 1 GB in size and partitioned in 1-hour intervals to optimize for queries with Athena.

With an AWS Glue job configured with 10 workers of the type G.2x configuration, Ontraport was able to process approximately 500 million records in less than 60 minutes. When processing 10 billion records, they were able to increase the job configuration to a maximum of 100 workers with auto scaling enabled to complete the job within 1 hour.

What’s next?

Ontraport has been able to process logs as early as 2018. The team is updating the processing code to allow for scenarios of schema evolution (such as new fields) and parameterized some components to fully automate the batch processing. They are also looking to fine-tune the number of provisioned AWS Glue workers to obtain optimal price-performance.

Conclusion

In this post, we showed you how Ontraport used AWS Glue to help reduce development overhead and simplify development efforts for their ETL workloads with only one full-time developer. Although services like Amazon EMR offer great flexibility and optimization, the ease of use and simplification in AWS Glue often offer a faster path for cost-optimization and innovation for small and medium businesses. For more information about AWS Glue, check out Getting Started with AWS Glue.


About the Authors

Elijah Ball has been a Sys Admin at Ontraport for 12 years. He is currently working to move Ontraport’s production workloads to AWS and develop data analysis strategies for Ontraport.

Pablo Redondo is a Principal Solutions Architect at Amazon Web Services. He is a data enthusiast with over 16 years of FinTech and healthcare industry experience and is a member of the AWS Analytics Technical Field Community (TFC). Pablo has been leading the AWS Gain Insights Program to help AWS customers achieve better insights and tangible business value from their data analytics initiatives.

Vikram Honmurgi is a Customer Solutions Manager at Amazon Web Services. With over 15 years of software delivery experience, Vikram is passionate about assisting customers and accelerating their cloud journey, delivering frictionless migrations, and ensuring our customers capture the full potential and sustainable business advantages of migrating to the AWS Cloud.

Configure fine-grained access to your resources shared using AWS Resource Access Manager

Post Syndicated from Fabian Labat original https://aws.amazon.com/blogs/security/configure-fine-grained-access-to-your-resources-shared-using-aws-resource-access-manager/

You can use AWS Resource Access Manager (AWS RAM) to securely, simply, and consistently share supported resource types within your organization or organizational units (OUs) and across AWS accounts. This means you can provision your resources once and use AWS RAM to share them with accounts. With AWS RAM, the accounts that receive the shared resources can list those resources alongside the resources they own.

When you share your resources by using AWS RAM, you can specify the actions that an account can perform and the access conditions on the shared resource. AWS RAM provides AWS managed permissions, which are created and maintained by AWS and which grant permissions for common customer scenarios. Now, you can further tailor resource access by authoring and applying fine-grained customer managed permissions in AWS RAM. A customer managed permission is a managed permission that you create to precisely specify who can do what under which conditions for the resource types included in your resource share.

This blog post walks you through how to use customer managed permissions to tailor your resource access to meet your business and security needs. Customer managed permissions help you follow the best practice of least privilege for your resources that are shared using AWS RAM.

Considerations

Before you start, review the considerations for using customer managed permissions for supported resource types in the AWS RAM User Guide.

Solution overview

Many AWS customers share infrastructure services to accounts in an organization from a centralized infrastructure OU. The networking account in the infrastructure OU follows the best practice of least privilege and grants only the permissions that accounts receiving these resources, such as development accounts, require to perform a specific task. The solution in this post demonstrates how you can share an Amazon Virtual Private Cloud (Amazon VPC) IP Address Manager (IPAM) pool with the accounts in a Development OU. IPAM makes it simpler for you to plan, track, and monitor IP addresses for your AWS workloads.

You’ll use a networking account that owns an IPAM pool to share the pool with the accounts in a Development OU. You’ll do this by creating a resource share and a customer managed permission through AWS RAM. In this example, shown in Figure 1, both the networking account and the Development OU are in the same organization. The accounts in the Development OU only need the permissions that are required to allocate a classless inter-domain routing (CIDR) range and not to view the IPAM pool details. You’ll further refine access to the shared IPAM pool so that only AWS Identity and Access Management (IAM) users or roles tagged with team = networking can perform actions on the IPAM pool that’s shared using AWS RAM.

Figure 1: Multi-account diagram for sharing your IPAM pool from a networking account in the Infrastructure OU to accounts in the Development OU

Figure 1: Multi-account diagram for sharing your IPAM pool from a networking account in the Infrastructure OU to accounts in the Development OU

Prerequisites

For this walkthrough, you must have the following prerequisites:

  • An AWS account (the networking account) with an IPAM pool already provisioned. For this example, create an IPAM pool in a networking account named ipam-vpc-pool-use1-dev. Because you share resources across accounts in the same AWS Region using AWS RAM, provision the IPAM pool in the same Region where your development accounts will access the pool.
  • An AWS OU with the associated development accounts to share the IPAM pool with. In this example, these accounts are in your Development OU.
  • An IAM role or user with permissions to perform IPAM and AWS RAM operations in the networking account and the development accounts.

Share your IPAM pool with your Development OU with least privilege permissions

In this section, you share an IPAM pool from your networking account to the accounts in your Development OU and grant least-privilege permissions. To do that, you create a resource share that contains your IPAM pool, your customer managed permission for the IPAM pool, and the OU principal you want to share the IPAM pool with. A resource share contains resources you want to share, the principals you want to share the resources with, and the managed permissions that grant resource access to the account receiving the resources. You can add the IPAM pool to an existing resource share, or you can create a new resource share. Depending on your workflow, you can start creating a resource share either in the Amazon VPC IPAM or in the AWS RAM console.

To initiate a new resource share from the Amazon VPC IPAM console

  1. Sign in to the AWS Management Console as your networking account. For Features, select Amazon VPC IP Address Manager console.
  2. Select ipam-vpc-pool-use1-dev, which was provisioned as part of the prerequisites.
  3. On the IPAM pool detail page, choose the Resource sharing tab.
  4. Choose Create resource share.
     
Figure 2: Create resource share to share your IPAM pool

Figure 2: Create resource share to share your IPAM pool

Alternatively, you can initiate a new resource share from the AWS RAM console.

To initiate a new resource share from the AWS RAM console

  1. Sign in to the AWS Management Console as your networking account. For Services, select Resource Access Manager console.
  2. Choose Create resource share.

Next, specify the resource share details, including the name, the resource type, and the specific resource you want to share. Note that the steps of the resource share creation process are located on the left side of the AWS RAM console.

To specify the resource share details

  1. For Name, enter ipam-shared-dev-pool.
  2. For Select resource type, choose IPAM pools.
  3. For Resources, select the Amazon Resource Name (ARN) of the IPAM pool you want to share from a list of the IPAM pool ARNs you own.
  4. Choose Next.
     
Figure 3: Specify the resources to share in your resource share

Figure 3: Specify the resources to share in your resource share

Configure customer managed permissions

In this example, the accounts in the Development OU need the permissions required to allocate a CIDR range, but not the permissions to view the IPAM pool details. The existing AWS managed permission grants both read and write permissions. Therefore, you need to create a customer managed permission to refine the resource access permissions for your accounts in the Development OU. With a customer managed permission, you can select and tailor the actions that the development accounts can perform on the IPAM pool, such as write-only actions.

In this section, you create a customer managed permission, configure the managed permission name, select the resource type, and choose the actions that are allowed with the shared resource.

To create and author a customer managed permission

  1. On the Associate managed permissions page, choose Create customer managed permission. This will bring up a new browser tab with a Create a customer managed permission page.
  2. On the Create a customer managed permission page, enter my-ipam-cmp for the Customer managed permission name.
  3. Confirm the Resource type as ec2:IpamPool.
  4. On the Visual editor tab of the Policy template section, select the Write checkbox only. This will automatically check all the available write actions.
  5. Choose Create customer managed permission.
     
Figure 4: Create a customer managed permission with only write actions

Figure 4: Create a customer managed permission with only write actions

Now that you’ve created your customer managed permission, you must associate it to your resource share.

To associate your customer managed permission

  1. Go back to the previous Associate managed permissions page. This is most likely located in a separate browser tab.
  2. Choose the refresh icon .
  3. Select my-ipam-cmp from the dropdown menu.
  4. Review the policy template, and then choose Next.

Next, select the IAM roles, IAM users, AWS accounts, AWS OUs, or organization you want to share your IPAM pool with. In this example, you share the IPAM pool with an OU in your account.

To grant access to principals

  1. On the Grant access to principals page, select Allow sharing only with your organization.
  2. For Select principal type, choose Organizational unit (OU).
  3. Enter the Development OU’s ID.
  4. Select Add, and then choose Next.
  5. Choose Create resource share to complete creation of your resource share.
     
Figure 5: Grant access to principals in your resource share

Figure 5: Grant access to principals in your resource share

Verify the customer managed permissions

Now let’s verify that the customer managed permission is working as expected. In this section, you verify that the development account cannot view the details of the IPAM pool and that you can use that same account to create a VPC with the IPAM pool.

To verify that an account in your Development OU can’t view the IPAM pool details

  1. Sign in to the AWS Management Console as an account in your Development OU. For Features, select Amazon VPC IP Address Manager console.
  2. In the left navigation pane, choose Pools.
  3. Select ipam-shared-dev-pool. You won’t be able to view the IPAM pool details.

To verify that an account in your Development OU can create a new VPC with the IPAM pool

  1. Sign in to the AWS Management Console as an account in your Development OU. For Services, select VPC console.
  2. On the VPC dashboard, choose Create VPC.
  3. On the Create VPC page, select VPC only.
  4. For name, enter my-dev-vpc.
  5. Select IPAM-allocated IPv4 CIDR block.
  6. Choose the ARN of the IPAM pool that’s shared with your development account.
  7. For Netmask, select /24 256 IPs.
  8. Choose Create VPC. You’ve successfully created a VPC with the IPAM pool shared with your account in your Development OU.
     
Figure 6: Create a VPC

Figure 6: Create a VPC

Update customer managed permissions

You can create a new version of your customer managed permission to rescope and update the access granularity of your resources that are shared using AWS RAM. For example, you can add a condition in your customer managed permissions so that only IAM users or roles tagged with a particular principal tag can access and perform the actions allowed on resources shared using AWS RAM. If you need to update your customer managed permission — for example, after testing or as your business and security needs evolve — you can create and save a new version of the same customer managed permission rather than creating an entirely new customer management permission. For example, you might want to adjust your access configurations to read-only actions for your development accounts and to rescope to read-write actions for your testing accounts. The new version of the permission won’t apply automatically to your existing resource shares, and you must explicitly apply it to those shares for it to take effect.

To create a version of your customer managed permission

  1. Sign in to the AWS Management Console as your networking account. For Services, select Resource Access Manager console.
  2. In the left navigation pane, choose Managed permissions library.
  3. For Filter by text, enter my-ipam-cmp and select my-ipam-cmp. You can also select the Any type dropdown menu and then select Customer managed to narrow the list of managed permissions to only your customer managed permissions.
  4. On the my-ipam-cmp page, choose Create version.
  5. You can make the customer managed permission more fine-grained by adding a condition. On the Create a customer managed permission for my-ipam-cmp page, under the Policy template section, choose JSON editor.
  6. Add a condition with aws:PrincipalTag that allows only the users or roles tagged with team = networking to access the shared IPAM pool.
    "Condition": {
                    "StringEquals": {
                        "aws:PrincipalTag/team": "networking"
                    }
                }

  7. Choose Create version. This new version will be automatically set as the default version of your customer managed permission. As a result, new resource shares that use the customer managed permission will use the new version.
     
Figure 7: Update your customer managed permissions and add a condition statement with aws:PrincipalTag

Figure 7: Update your customer managed permissions and add a condition statement with aws:PrincipalTag

Note: Now that you have the new version of your customer managed permission, you must explicitly apply it to your existing resource shares for it to take effect.

To apply the new version of the customer managed permission to existing resource shares

  1. On the my-ipam-cmp page, under the Managed permission versions, select Version 1.
  2. Choose the Associated resource shares tab.
  3. Find ipam-shared-dev-pool and next to the current version number, select Update to default version. This will update your ipam-shared-dev-pool resource share with the new version of your my-ipam-cmp customer managed permission.

To verify your updated customer managed permission, see the Verify the customer managed permissions section earlier in this post. Make sure that you sign in with an IAM role or user tagged with team = networking, and then repeat the steps of that section to verify your updated customer managed permission. If you use an IAM role or user that is not tagged with team = networking, you won’t be able to allocate a CIDR from the IPAM pool and you won’t be able to create the VPC.

Cleanup

To remove the resources created by the preceding example:

  1. Delete the resource share from the AWS RAM console.
  2. Deprovision the CIDR from the IPAM pool.
  3. Delete the IPAM pool you created.

Summary

This blog post presented an example of using customer managed permissions in AWS RAM. AWS RAM brings simplicity, consistency, and confidence when sharing your resources across accounts. In the example, you used AWS RAM to share an IPAM pool to accounts in a Development OU, configured fine-grained resource access controls, and followed the best practice of least privilege by granting only the permissions required for the accounts in the Development OU to perform a specific task with the shared IPAM pool. In the example, you also created a new version of your customer managed permission to rescope the access granularity of your resources that are shared using AWS RAM.

To learn more about AWS RAM and customer managed permissions, see the AWS RAM documentation and watch the AWS RAM Introduces Customer Managed Permissions demo.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Fabian Labat

Fabian Labat

Fabian is a principal solutions architect based in New York, where he guides global financial services customers to build highly secure, scalable, reliable, and cost-efficient applications on the cloud. He brings over 25 years of technology experience in system design and IT infrastructure.

Nini Ren

Nini Ren

Nini is the product manager for AWS Resource Access Manager (RAM). He enjoys working closely with customers to develop solutions that not only meet their needs, but also create value for their businesses. Nini holds an MBA from The Wharton School, a masters of computer and information technology from the University of Pennsylvania, and an AB in chemistry and physics from Harvard College.

How FIS ingests and searches vector data for quick ticket resolution with Amazon OpenSearch Service

Post Syndicated from Rupesh Tiwari original https://aws.amazon.com/blogs/big-data/how-fis-ingests-and-searches-vector-data-for-quick-ticket-resolution-with-amazon-opensearch-service/

This post was co-written by Sheel Saket, Senior Data Science Manager at FIS, and Rupesh Tiwari, Senior Architect at Amazon Web Services.

Do you ever find yourself grappling with multiple defect logging mechanisms, scattered project management tools, and fragmented software development platforms? Have you experienced the frustration of lacking a unified view, hindering your ability to efficiently manage and identify common trending issues within your enterprise? Are you constantly facing challenges when it comes to addressing defects and their impact, causing disruptions in your production cycles?

If these questions resonate with you, then you’re not alone. FIS, a leading technology and services provider, has encountered these very challenges. In their quest for a solution, they teamed up with AWS to tackle these obstacles head-on. In this post, we take you on a journey through their collaborative project, exploring how they used Amazon OpenSearch Service to transform their operations, enhance efficiency, and gain valuable insights.

This post shares FIS’s journey in overcoming challenges and provides step-by-step instructions for provisioning the solution architecture in your AWS account. You’ll learn how to implement a transformative solution that empowers your organization with near-real-time data indexing and visualization capabilities.

In the following sections, we dive into the details of FIS’s journey and discover how they overcame these challenges, revolutionizing their approach to defect management and software development.

Challenges for near-real-time ticket visualization and search

FIS faced several challenges in achieving near-real-time ticket visualization and search capabilities, including the
following:

  • Integrating ticket data from tens of different third-party systems
  • Overcoming API call thresholds and limitations from various systems
  • Implementing an efficient KNN vector search algorithm for resolving issues and performing trend analysis
  • Establishing a robust data ingestion and indexing process for real-time updates from 15,000 tickets per day
  • Ensuring unified access to ticket information across 20 development teams
  • Providing secure and scalable access to ticket data for up to 250 teams

Despite these challenges, FIS successfully enhanced their operational efficiency, enabled quick ticket resolution, and gained valuable insights through the integration of OpenSearch Service.

Let’s delve into the technical walkthrough of the architecture diagram and mechanisms. The following section provides step-by-step instructions for provisioning and implementing the solution on your AWS Management Console, along with a helpful video tutorial.

Solution overview

The architecture diagram of FIS’s near-real-time data indexing and visualization solution incorporates various AWS services for specific functions. The solution uses GitHub as the data source, employs Amazon Simple Storage Service (Amazon S3) for scalable storage, manages APIs with Amazon API Gateway, performs serverless computing using AWS Lambda, and facilitates data streaming and ETL (extract, transform, and load) processes through Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose. OpenSearch Service is employed for analytics and application monitoring. This architecture ensures a robust and scalable solution, enabling FIS to efficiently index and visualize data in near-real time. With these AWS services, FIS effectively manages their data pipeline and gains valuable insights for their business processes.

The following diagram illustrates the solution architecture.

Architecture Diagram

The workflow includes the following steps:

  1. GitHub webhook events stream data to both Amazon S3 and OpenSearch
    Service, facilitating real-time data analysis.
  2. A Lambda function connects to an API Gateway REST API, processing and structuring the received payloads.
  3. The Lambda function adds the structured data to a Kinesis data stream, enabling immediate data streaming and quick ticket insights.
  4. Kinesis Data Firehose streams the records from the Kinesis data stream to an S3 bucket, simultaneously creating an index in OpenSearch Service.
  5. OpenSearch Service uses the indexed data to provide near-real-time visualization and enable efficient ticket analysis through K-Nearest Neighbor (KNN) search, enhancing productivity and optimizing data operations.

The following sections provide step-by-step instructions for setting up the solution. Additionally, we have created a video guide that demonstrates each step in detail. You are welcome to watch the video and follow along with this post if you prefer.

Prerequisites

You should have the following prerequisites:

Implement the solution

Complete the following steps to implement the solution:

  1. Create an OpenSearch Service domain.
  2. Create an S3 bucket named git-data.
  3. Create a Kinesis data stream named git-data-stream.
  4. Create a Firehose delivery stream named git-data-delivery-stream with
    git-data-stream as the source and git-data as the destination, and a buffer interval of 60 seconds.
  5. Create a Lambda function named git-webhook-handler with a timeout of 5 minutes. Add code to add data to the Kinesis data stream.
  6. Grant the Lambda function’s execution role permission to put_record on the Kinesis data stream.
  7. Create a REST API in API Gateway named git-webhook-handler-api. Create a resource named
    git-data with a POST method, integrate it with the Lambda function git-webhook-handler created in the previous step, and deploy the REST API.
  8. Create a delivery stream with the Kinesis data stream as the source and OpenSearch Service as the destination. Provide the AWS Identity and Access Management (IAM) role for Kinesis Data Firehose with the necessary permissions to create an index in OpenSearch Service. Finally, add the IAM role as a backend service in OpenSearch Service.
  9. Navigate to your GitHub repository and create a webhook to enable seamless integration with the solution. Copy the REST API URL and enter this newly created webhook.

Test the solution

To test the solution, complete the following steps:

  1. Go to your GitHub repository and choose the Star button, and verify that you receive a response with a status code of 200.
  2. Also, check for the ShardId and SequenceNumber in the recent deliveries to confirm successful event addition to the Kinesis data stream.

Kinesis data stream

  1. On the Kinesis console, use the Data Viewer to confirm the arrival of data records.

kinesis record data

  1. Navigate to the OpenSearch Dashboard and choose the dev tool.
  2. Search for the records and observe that all the Git events are displayed
    in the result pane.

opensearch devtool

  1. On the Amazon S3 console, open the bucket and view the data records.

s3 bucket records

Security

We adhere to IAM best practices to uphold security:

  1. Craft a Lambda execution role for read/write operations on the Kinesis data stream.
  2. Generate an IAM role for Kinesis Data Firehose to manage Amazon S3 and OpenSearch
    Service access.
  3. Link this IAM role in OpenSearch Service security to confer backend user privileges.

Clean up

To avoid incurring future charges, delete all the resources you created.

Benefits of near-real-time ticket visualization and search

During our demonstration, we showcased the utilization of GitHub as the streaming data source. However, it’s important to note that the solution we presented has the flexibility to scale and incorporate multiple data sources from various services. This allows for the consolidation and visualization of diverse data in near-real time, using the capabilities of OpenSearch Service.

With the implementation of the solution described in this post, FIS effectively overcame all the challenges they faced.

In this section, we delve into the details of the challenges and benefits they achieved:

  • Integrating ticket data from multiple third-party systems – Near-real-time data streaming ensures an up-to-date information flow from third-party providers for timely insights
  • Overcoming API call thresholds and limitations imposed by different systems – Unrestricted data flow with no threshold or rate limiting enables seamless integration and continuous updates
  • Accommodating scalability requirements for up to 250 teams – The asynchronous, serverless architecture effortlessly scales more than 250 times larger without infrastructure modifications
  • Efficiently resolving tickets and performing trend analysis – OpenSearch Service semantic KNN search identifies duplicates and defects, and optimizes operations for improved efficiency
  • Gaining valuable insights for business processes – Artificial intelligence (AI) and machine
    learning (ML) analytics use the data stored in the S3 bucket, empowering deeper insights and informed decision-making
  • Ensuring secure access to ticket data and regulatory compliance – Secure data access and compliance with data protection regulations ensure data privacy and regulatory compliance

Conclusion

FIS, in collaboration with AWS, successfully addressed several challenges to achieve near-real-time ticket visualization and search capabilities. With OpenSearch Service, FIS enhanced operational efficiency by efficiently resolving ticketsand performing trend analysis. With their data ingestion and indexing process, FIS processed 15,000 tickets per day in real time. The solution provided secure and scalable access to ticket data for more than 250 teams, enabling unified collaboration. FIS experienced a remarkable 30% reduction in ticket resolution time, empowering teams to quickly address
issues.

As Sheel Saket, Senior Data Science Manager at FIS, states, “Our near-real-time solution transformed how we identify and resolve tickets, improving our overall productivity.”

Furthermore, organizations can further improve the solution by adopting Amazon OpenSearch Ingestion for data ingestion, which offers cost savings and out-of-the-box data processing capabilities. By embracing this transformative solution, organizations can optimize their ticket management, drive productivity, and deliver exceptional experiences to customers.

Want to know more? You can reach out to FIS from their official FIS contact page, follow FIS Twitter, and visit the FIS LinkedIn page.


About the Author

Rupesh Tiwari is a Senior Solutions Architect at AWS in New York City, with a focus on Financial Services. He has over 18 years of IT experience in the finance, insurance, and education domains, and specializes in architecting large-scale applications and cloud-native big data workloads. In his spare time, Rupesh enjoys singing karaoke, watching comedy TV series, and creating joyful moments with his family.

Sheel Saket is a Senior Data Science Manager at FIS in Chicago, Illinois. He has over 11 years of IT experience in the finance, insurance, and e-commerce domains, and specializes in architecting large-scale AI solutions and cloud MLOps. In his spare time, Sheel enjoys listening to audiobooks, podcasts, and watching movies with his family.

Five actionable steps to GDPR compliance (Right to be forgotten) with Amazon Redshift

Post Syndicated from Kishore Tata original https://aws.amazon.com/blogs/big-data/five-actionable-steps-to-gdpr-compliance-right-to-be-forgotten-with-amazon-redshift/

The GDPR (General Data Protection Regulation) right to be forgotten, also known as the right to erasure, gives individuals the right to request the deletion of their personally identifiable information (PII) data held by organizations. This means that individuals can ask companies to erase their personal data from their systems and any third parties with whom the data was shared. Organizations must comply with these requests provided that there are no legitimate grounds for retaining the personal data, such as legal obligations or contractual requirements.

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. It is designed for analyzing large volumes of data and performing complex queries on structured and semi-structured data. Many customers are looking for best practices to keep their Amazon Redshift analytics environment compliant and have an ability to respond to GDPR right to forgotten requests.

In this post, we discuss challenges associated with implementation and architectural patterns and actionable best practices for organizations to respond to the right to be forgotten request requirements of the GDPR for data stored in Amazon Redshift.

Who does GDPR apply to?

The GDPR applies to all organizations established in the EU and to organizations, whether or not established in the EU, that process the personal data of EU individuals in connection with either the offering of goods or services to data subjects in the EU or the monitoring of behavior that takes place within the EU.

The following are key terms we use when discussing the GDPR:

  • Data subject – An identifiable living person and resident in the EU or UK, on whom personal data is held by a business or organization or service provider
  • Processor – The entity that processes the data on the instructions of the controller (for example, AWS)
  • Controller – The entity that determines the purposes and means of processing personal data (for example, an AWS customer)
  • Personal data – Information relating to an identified or identifiable person, including names, email addresses, and phone numbers

Implementing the right to be forgotten can include the following challenges:

  • Data identification – One of the main challenges is identifying all instances of personal data across various systems, databases, and backups. Organizations need to have a clear understanding of where personal data is being stored and how it is processed to effectively fulfill the deletion requests.
  • Data dependencies – Personal data can be interconnected and intertwined with other data systems, making it challenging to remove specific data without impacting the integrity of functionality of other systems or processes. It requires careful analysis to identify data dependencies and mitigate any potential risks or disruptions.
  • Data replication and backups – Personal data can exist in multiple copies due to data replication and backups. Ensuring the complete removal of data from all these copies and backups can be challenging. Organizations need to establish processes to track and manage data copies effectively.
  • Legal obligations and exemptions – The right to be forgotten is not absolute and may be subject to legal obligations or exemptions. Organizations need to carefully assess requests, considering factors such as legal requirements, legitimate interests, freedom of expression, or public interest to determine if the request can be fulfilled or if any exceptions apply.
  • Data archiving and retention – Organizations may have legal or regulatory requirements to retain certain data for a specific period. Balancing the right to be forgotten with the obligation to retain data can be a challenge. Clear policies and procedures need to be established to manage data retention and deletion appropriately.

Architecture patterns

Organizations are generally required to respond to right to be forgotten requests within 30 days from when the individual submits a request. This deadline can be extended by a maximum of 2 months taking into account the complexity and the number of the requests, provided that the data subject has been informed about the reasons for the delay within 1 month of the receipt of the request.

The following sections discuss a few commonly referenced architecture patterns, best practices, and options supported by Amazon Redshift to support your data subject’s GDPR right to be forgotten request in your organization.

Actionable Steps

Data management and governance

Addressing the challenges mentioned requires a combination of technical, operational, and legal measures. Organizations need to develop robust data governance practices, establish clear procedures for handling deletion requests, and maintain ongoing compliance with GDPR regulations.

Large organizations usually have multiple Redshift environments, databases, and tables spread across multiple Regions and accounts. To successfully respond to a data subject’s requests, organizations should have a clear strategy to determine how data is forgotten, flagged, anonymized, or deleted, and they should have clear guidelines in place for data audits.

Data mapping involves identifying and documenting the flow of personal data in an organization. It helps organizations understand how personal data moves through their systems, where it is stored, and how it is processed. By creating visual representations of data flows, organizations can gain a clear understanding of the lifecycle of personal data and identify potential vulnerabilities or compliance gaps.

Note that putting a comprehensive data strategy in place is not in scope for this post.

Audit tracking

Organizations must maintain proper documentation and audit trails of the deletion process to demonstrate compliance with GDPR requirements. A typical audit control framework should record the data subject requests (who is the data subject, when was it requested, what data, approver, due date, scheduled ETL process if any, and so on). This will help with your audit requests and provide the ability to roll back in case of accidental deletions observed during the QA process. It’s important to maintain the list of users and systems who may get impacted during this process to ensure effective communication.

Data discovery and findability

Findability is an important step of the process. Organizations need to have mechanisms to find the data under consideration in an efficient and quick manner for timely response. The following are some patterns and best practices you can employ to find the data in Amazon Redshift.

Tagging

Consider tagging your Amazon Redshift resources to quickly identify which clusters and snapshots contain the PII data, the owners, the data retention policy, and so on. Tags provide metadata about resources at a glance. Redshift resources, such as namespaces, workgroups, snapshots, and clusters can be tagged. For more information about tagging, refer to Tagging resources in Amazon Redshift.

Naming conventions

As a part of the modeling strategy, name the database objects (databases, schemas, tables, columns) with an indicator that they contain PII so that they can be queried using system tables (for example, make a list of the tables and columns where PII data is involved). Identifying the list of tables and users or the systems that have access to them will help streamline the communication process. The following sample SQL can help you find the databases, schemas, and tables with a name that contains PII:

SELECT
pg_catalog.pg_namespace.nspname AS schema_name,
pg_catalog.pg_class.relname AS table_name,
pg_catalog.pg_attribute.attname AS column_name,
pg_catalog.pg_database.datname AS database_name
FROM
pg_catalog.pg_namespace
JOIN pg_catalog.pg_class ON pg_catalog.pg_namespace.oid = pg_catalog.pg_class.relnamespace
JOIN pg_catalog.pg_attribute ON pg_catalog.pg_class.oid = pg_catalog.pg_attribute.attrelid
JOIN pg_catalog.pg_database ON pg_catalog.pg_attribute.attnum > 0
WHERE
pg_catalog.pg_attribute.attname LIKE '%PII%';

SELECT datname
FROM pg_database
WHERE datname LIKE '%PII%';

SELECT table_schema, table_name, column_name
FROM information_schema.columns
WHERE column_name LIKE '%PII%'

Separate PII and non-PII

Whenever possible, keep the sensitive data in a separate table, database, or schema. Isolating the data in a separate database may not always be possible. However, you can separate the non-PII columns in a separate table, for example, Customer_NonPII and Customer_PII, and then join them with an unintelligent key. This helps identify the tables that contain non-PII columns. This approach is straightforward to implement and keeps non-PII data intact, which can be useful for analysis purposes. The following figure shows an example of these tables.

PII-Non PII Example Tables

Flag columns

In the preceding tables, rows in bold are marked with Forgotten_flag=Yes. You can maintain a Forgotten_flag as a column with the default value as No and update this value to Yes whenever a request to be forgotten is received. Also, as a best practice from HIPAA, do a batch deletion once in a month. The downstream and upstream systems need to respect this flag and include this in their processing. This helps identify the rows that need to be deleted. For our example, we can use the following code:

Delete from Customer_PII where forgotten_flag=“Yes”

Use Master data management system

Organizations that maintain a master data management system maintain a golden record for a customer, which acts as a single version of truth from multiple disparate systems. These systems also contain crosswalks with several peripheral systems that contain the natural key of the customer and golden record. This technique helps find customer records and related tables. The following is a representative example of a crosswalk table in a master data management system.

Example of a MDM Records

Use AWS Lake Formation

Some organizations have use cases where you can share the data across multiple departments and business units and use Amazon Redshift data sharing. We can use AWS Lake Formation tags to tag the database objects and columns and define fine-grained access controls on who can have the access to use data. Organizations can have a dedicated resource with access to all tagged resources. With Lake Formation, you can centrally define and enforce database-, table-, column-, and row-level access permissions of Redshift data shares and restrict user access to objects within a data share.

By sharing data through Lake Formation, you can define permissions in Lake Formation and apply those permissions to data shares and their objects. For example, if you have a table containing employee information, you can use column-level filters to help prevent employees who don’t work in the HR department from seeing sensitive information. Refer to AWS Lake Formation-managed Redshift shares for more details on the implementation.

Use Amazon DataZone

Amazon DataZone introduces a business metadata catalog. Business metadata provides information authored or used by businesses and gives context to organizational data. Data discovery is a key task that business metadata can support. Data discovery uses centrally defined corporate ontologies and taxonomies to classify data sources and allows you to find relevant data objects. You can add business metadata in Amazon DataZone to support data discovery.

Data erasure

By using the approaches we’ve discussed, you can find the clusters, databases, tables, columns, snapshots that contain the data to be deleted. The following are some methods and best practices for data erasure.

Restricted backup

In some use cases, you may have to keep data backed up to align with government regulations for a certain period of time. It’s a good idea to take the backup of the data objects before deletion and keep it for an agreed-upon retention time. You can use AWS Backup to take automatic or manual backups. AWS Backup allows you to define a central backup policy to manage the data protection of your applications. For more information, refer to New – Amazon Redshift Support in AWS Backup.

Physical deletes

After we find the tables that contain the data, we can delete the data using the following code (using the flagging technique discussed earlier):

Delete from Customer_PII where forgotten_flag=“Yes”

It’s a good practice to delete data at a specified schedule, such as once every 25–30 days, so that it is simpler to maintain the state of the database.

Logical deletes

You may need to keep data in a separate environment for audit purposes. You can employ Amazon Redshift row access policies and conditional dynamic masking policies to filter and anonymize the data.

You can use row access policies on Forgotten_flag=No on the tables that contain PII data so that the designated users can only see the necessary data. Refer to Achieve fine-grained data security with row-level access control in Amazon Redshift for more information about how to implement row access policies.

You can use conditional dynamic data masking policies so that designated users can see the redacted data. With dynamic data masking (DDM) in Amazon Redshift, organizations can help protect sensitive data in your data warehouse. You can manipulate how Amazon Redshift shows sensitive data to the user at query time without transforming it in the database. You control access to data through masking policies that apply custom obfuscation rules to a given user or role. That way, you can respond to changing privacy requirements without altering the underlying data or editing SQL queries.

Dynamic data masking policies hide, obfuscate, or pseudonymize data that matches a given format. When attached to a table, the masking expression is applied to one or more of its columns. You can further modify masking policies to only apply them to certain users or user-defined roles that you can create with role-based access control (RBAC). Additionally, you can apply DDM on the cell level by using conditional columns when creating your masking policy.

Organizations can use conditional dynamic data masking to redact sensitive columns (for example, names) where the forgotten flag column value is TRUE, and the other columns display the full values.

Backup and restore

Data from Redshift clusters can be transferred, exported, or copied to different AWS services or outside of the cloud. Organizations should have an effective governance process to detect and remove data to align with the GDPR compliance requirement. However, this is beyond the scope of this post.

Amazon Redshift offers backups and snapshots of the data. After deleting the PII data, organizations should also purge the data from their backups. To do so, you need to restore the snapshot to a new cluster, remove the data, and take a fresh backup. The following figure illustrates this workflow.

It’s good practice to keep the retention period at 29 days (if applicable) so that the backups are cleared after 30 days. Organizations can also set the backup schedule to a certain date (for example, the first of every month).

Backup and Restore

Communication

It’s important to communicate to the users and processes who may be impacted by this deletion. The following query helps identify the list of users and groups who have access to the affected tables:

SELECT
nspname AS schema_name,
relname AS table_name,
attname AS column_name,
usename AS user_name,
groname AS group_name
FROM pg_namespace
JOIN pg_class ON pg_namespace.oid = pg_class.relnamespace
JOIN pg_attribute ON pg_class.oid = pg_attribute.attrelid
LEFT JOIN pg_group ON pg_attribute.attacl::text LIKE '%' || groname || '%'
LEFT JOIN pg_user ON pg_attribute.attacl::text LIKE '%' || usename || '%'
WHERE
pg_attribute.attname LIKE '%PII%'
AND (usename IS NOT NULL OR groname IS NOT NULL);

Security controls

Maintaining security is of great importance in GDPR compliance. By implementing robust security measures, organizations can help protect personal data from unauthorized access, breaches, and misuse, thereby helping maintain the privacy rights of individuals. Security plays a crucial role in upholding the principles of confidentiality, integrity, and availability of personal data. AWS offers a comprehensive suite of services and features that can support GDPR compliance and enhance security measures.

The GDPR does not change the AWS shared responsibility model, which continues to be relevant for customers. The shared responsibility model is a useful approach to illustrate the different responsibilities of AWS (as a data processor or subprocessor) and customers (as either data controllers or data processors) under the GDPR.

Under the shared responsibility model, AWS is responsible for securing the underlying infrastructure that supports AWS services (“Security of the Cloud”), and customers, acting either as data controllers or data processors, are responsible for personal data they upload to AWS services (“Security in the Cloud”).

AWS offers a GDPR-compliant AWS Data Processing Addendum (AWS DPA), which enables you to comply with GDPR contractual obligations. The AWS DPA is incorporated into the AWS Service Terms.

Article 32 of the GDPR requires that organizations must “…implement appropriate technical and organizational measures to ensure a level of security appropriate to the risk, including …the pseudonymization and encryption of personal data[…].” In addition, organizations must “safeguard against the unauthorized disclosure of or access to personal data.” Refer to the Navigating GDPR Compliance on AWS whitepaper for more details.

Conclusion

In this post, we delved into the significance of GDPR and its impact on safeguarding privacy rights. We discussed five commonly followed best practices that organizations can reference for responding to GDPR right to be forgotten requests for data that resides in Redshift clusters. We also highlighted that the GDPR does not change the AWS shared responsibility model.

We encourage you to take charge of your data privacy today. Prioritizing GPDR compliance and data privacy will not only strengthen trust, but also build customer loyalty and safeguard personal information in digital era. If you need assistance or guidance, reach out to an AWS representative. AWS has teams of Enterprise Support Representatives, Professional Services Consultants, and other staff to help with GDPR questions. You can contact us with questions. To learn more about GDPR compliance when using AWS services, refer to the General Data Protection Regulation (GDPR) Center. To learn more about the right to be forgotten, refer to Right to Erasure.

Disclaimer: The information provided above is not a legal advice. It is intended to showcase commonly followed best practices. It is crucial to consult with your organization’s privacy officer or legal counsel and determine appropriate solutions.


About the Authors

YaduKishore ProfileYadukishore Tatavarthi  is a Senior Partner Solutions Architect supporting Healthcare and life science customers at Amazon Web Services. He has been helping the customers over the last 20 years in building the enterprise data strategies, advising customers on cloud implementations, migrations, reference architecture creation, data modeling best practices, data lake/warehouses architecture, and other technical processes.

Sudhir GuptaSudhir Gupta is a Principal Partner Solutions Architect, Analytics Specialist at AWS with over 18 years of experience in Databases and Analytics. He helps AWS partners and customers design, implement, and migrate large-scale data & analytics (D&A) workloads. As a trusted advisor to partners, he enables partners globally on AWS D&A services, builds solutions/accelerators, and leads go-to-market initiatives

Deepak SinghDeepak Singh is a Senior Solutions Architect at Amazon Web Services with 20+ years of experience in Data & AIA. He enjoys working with AWS partners and customers on building scalable analytical solutions for their business outcomes. When not at work, he loves spending time with family or exploring new technologies in analytics and AI space.

Configure monitoring, limits, and alarms in Amazon Redshift Serverless to keep costs predictable

Post Syndicated from Satesh Sonti original https://aws.amazon.com/blogs/big-data/configure-monitoring-limits-and-alarms-in-amazon-redshift-serverless-to-keep-costs-predictable/

Amazon Redshift Serverless makes it simple to run and scale analytics in seconds. It automatically provisions and intelligently scales data warehouse compute capacity to deliver fast performance, and you pay only for what you use. Just load your data and start querying right away in the Amazon Redshift Query Editor or in your favorite business intelligence (BI) tool. Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs), and you can configure base RPUs anywhere between 8–512. You can start with your preferred RPU capacity or defaults and adjust anytime later.

In this post, we share how you can monitor your workloads running on Redshift Serverless through three approaches: the Redshift Serverless console, Amazon CloudWatch, and system views. We also show how to set up guardrails via alerts and limits for Redshift Serverless to keep your costs predictable.

Method 1: Monitor through the Redshift Serverless console

You can view all user queries, including Data Manipulation Language (DML) statements, Data Definition Language (DDL) statements, and Data Control Language (DCL), through the Redshift Serverless console. You can also view the RPU consumption to run these workloads on a single page. You can also apply filters based on time, database, users, and type of queries.

Prerequisites for monitoring access

A superuser has access to monitor all workloads and resource consumption by default. If other users need monitoring access through the Redshift Serverless console, then the superuser can provide necessary access by performing the following steps:

  1. Create a policy with necessary privileges and assign this policy to required users or roles.
  2. Grant query monitoring permission to the user or role.

For more information, refer to Granting access to monitor queries.

Query monitoring

In this section, we walk through the Redshift Serverless console to see query history, database performance, and resource usage. We also go through monitoring options and how to set filters to narrow down results using filter attributes.

  1. On the Redshift Serverless console, under Monitoring in the navigation pane, choose Query and database monitoring.
  2. Open the workgroup you want to monitor.
  3. In the Metric filters section, expand Additional filtering options.
  4. You can set filters for time range, aggregation time interval, database, query category, SQL, and users.

Query and database monitoring

Two tabs are available, Query history and Database performance. Use the Query history tab for obtaining details at a per-query level, and the Database performance tab for reviewing performance aggregated across queries. Both these tabs are filtered based off the selections you made.

Under Query history, you will see the Query runtime graph. Use this graph to look into query concurrency (queries that are running in the same time frame). You can choose a query to view more query run details, for example, queries that took longer to run than you expected.

Query runtime monitoring dashbaord

In the Queries and loads section, you can see all queries by default, but you can also filter by status to view completed, running, and failed queries.

Query history screen

Navigate to the Database Performance tab in the Query and database monitoring section to view the following:

  • Queries completed per second – Average number of queries completed per second
  • Queries duration –Average amount of time to complete a query
  • Database connections – Number of active database connections
  • Running and Queued queries – Total number of running and queued queries at a Resource monitoring

To monitor your resources, complete the following steps:

  1. On the Redshift Serverless console, choose Resource monitoring under Monitoring in the navigation pane.

The default workgroup will be selected by default, but you can choose the workgroup you would like to monitor.

  1. In the Metric filters section, expand Additional filtering options.
  2. Choose a 1-minute time interval (for example) and review the results.

You can also try different ranges to see the results.

Screen to apply metric filters

On the RPU capacity used graph, you can see how Redshift Serverless is able to scale RPUs in a matter of minutes. This gives a visual representation of peaks and lows in your consumption over your chosen period of time.

RPU capacity consumption

You also see the actual compute usage in terms of RPU-seconds for the workload you ran.
RPU Seconds consumed

Method 2: Monitor metrics in CloudWatch

Redshift Serverless publishes serverless endpoint performance metrics to CloudWatch. The Amazon Redshift CloudWatch metrics are data points for operational monitoring. These metrics enable you to monitor performance of your serverless workgroups (compute) and usage of namespaces (data). CloudWatch allows you to centrally monitor your serverless endpoints in one AWS account, or also cross-account and cross-Region.

  • On the CloudWatch console, under Metrics in the navigation pane, choose All metrics.
  • On the Browse tab, choose AWS/Redshift-Serverless to get to a collection of metrics for Redshift Serverless usage.

Redshift Serverless in Amazon CloudWatch

  • Choose Workgroup to view workgroup-related metrics.

Workgroups and Namespaces

From the list, you can check your particular workgroup and the metrics available (in this example, ComputeSeconds and ComputeCapacity). You should see the graph is updated and charting your data.

Redshift Serverless Workgroup Metrics

  • To name the graph, choose the pencil icon next to the graph title and enter a graph name (for example, dataanalytics-serverless), then choose Apply.

Rename CloudWatch Graph

  • On the Browse tab, choose AWS/Redshift-Serverless and choose Namespace this time.
  • Select the namespace you want to monitor and the metrics of interest.

Redshift Serverless Namespace Metrics

You can add additional metrics to your graph. To centralize monitoring, you can add these metrics to an existing CloudWatch dashboard or a new dashboard.

  • On the Actions menu, choose Add to dashboard.

Redshift Serverless Namespace Metrics

Method 3: Granular monitoring using system views

System views in Redshift Serverless are used to monitor workload performance and RPU usage at a granular level over a period of time. These query monitoring system views have been simplified to include monitoring for DDL, DML, COPY, and UNLOAD queries. For a complete list of system views and their uses, refer to Monitoring views.

SQL Notebook

You can download the SQL notebook with most used system views queries. These queries help to answer most frequently asked monitoring questions listed below.

  • How to monitor queries based on status?
  • How to monitor specific query elapsed time breakdown details?
  • How to monitor workload breakdown by query count, and percentile run time?
  • How to monitor detailed steps involved in query execution?
  • How to monitor Redshift serverless usage cost by day?
  • How to monitor data loads (copy commands)?
  • How to monitor number of sessions, and connections?

You can import this in Query Editor V2.0 and run the queries connecting to the Redshift Serverless workgroup you would like to monitor.

Set limits to control costs

When you are creating your serverless endpoint, the base capacity is defaulted to 128 RPUs. However, you can change it at creation time or later via the Redshift Serverless console.

  1. On the details page of your serverless workgroup, choose the Limits tab.
  2. In the Base capacity section, choose Edit.
  3. You can specify Base capacity from 8–512 RPUs, in increments of 8.

Each RPU provides 16 GB memory, so the lowest base 8 RPU is compute with 128 GB memory, and highest base 512 RPU is compute with 8 TB memory.

Edit base RPU capacity

Usage limits

To configure usage capacity limits to limit your overall Redshift Serverless bill, complete the following steps:

  1. In the Usage limits section, choose Manage usage limits.
  2. To control RPU usage, set the maximum RPU-hours by frequency. You can set Frequency to Daily, Weekly, and Monthly.
  3. For Usage limit (RPU hours), enter your preferred value.
  4. For Action, choose Alert, Log to system table, or Turn off user queries.

Set RPU usage limit

Optionally, you can select an existing Amazon Simple Notification Service (Amazon SNS) topic or create a new SNS topic, and subscribe via email to this SNS topic to be notified when usage limits have been met.

Query monitoring rules for Redshift Serverless

To prevent wasteful resource utilization and runaway costs caused by poorly rewritten queries, you can implement query monitoring rules via query limits on your Redshift Serverless workgroup. For more information, refer to WLM query monitoring rules. The query monitoring rules in Redshift Serverless stop queries that meet the limit that has been set up in the rule. To receive notifications and automate notifications on Slack, refer to Automate notifications on Slack for Amazon Redshift query monitoring rule violations.

To set up query limits, complete the following steps:

  1. On the Redshift Serverless console, choose Workgroup configuration in the navigation pane.
  2. Choose a workgroup to monitor.
  3. On the workgroup details page, under Query monitoring rules, choose Manage query limits.

You can add up to 10 query monitoring rules to each serverless workgroup.

Set query limits

The serverless workgroup will go to a Modifying state each time you add or remove a limit.

Let’s take an example where you have to create a serverless workgroup for your dashboards. You know that dashboard queries typically complete in under a minute. If any dashboard query takes more than a minute, it could indicate a poorly written query or a query that hasn’t been tested well, and has incorrectly been released to production.

For this use case, we set a rule with Limit type as Query execution time and Limit (seconds) as 60.

Set required limit

The following screenshot shows the Redshift Serverless metrics available for setting up query monitoring rules.

Query Monitoring Metrics on CloudWatch

Configure alarms

Alarms are very useful because they enable you to make proactive decisions about your Redshift Serverless endpoint. Any usage limits that you set up will automatically show as alarms on the Redshift Serverless console, and are created as CloudWatch alarms.

Additionally, you can set up one or more CloudWatch alarms on any of the metrics listed in Amazon Redshift Serverless metrics.

For example, setting an alarm for DataStorage over a threshold value would keep track of the storage space that your serverless namespace is using for your data.

To create an alarm for your Redshift Serverless instance, complete the following steps:

  1. On the Redshift Serverless console, under Monitoring in the navigation pane, choose Alarms.
  2. Choose Create alarm.

Set Alarms from console

  1. Choose your level of metrics to monitor:
    • Workgroup
    • Namespace
    • Snapshot storage

If we select Workgroup, we can choose from the workgroup-level metrics shown in the following screenshot.

Workgroup Level Metrics

The following screenshot shows how we can set up alarms at the namespace level along with various metrics that are available to use.

Namespace Level Metrics

The following screenshot shows the metrics available at the snapshot storage level.

Snapshot level metrics

If you are starting new, then please start with most commonly used metrics listed below. Please also Create a billing alarm to monitor your estimated AWS charges.

  • ComputeSeconds
  • ComputeCapacity
  • DatabaseConnections
  • EstimatedCharges
  • DataStorage
  • QueriesFailed

Notifications

After you define your alarm, provide a name and a description, and choose to enable notifications.

Amazon Redshift uses an SNS topic to send alarm notifications. For instructions to create an SNS topic, refer to Creating an Amazon SNS topic. You must subscribe to the topic to receive the messages published to it. For instructions, refer to Subscribing to an Amazon SNS topic.

You can also monitor event notifications to be aware of the changes in your Redshift Serverless Datawarehouse. Please refer Amazon Redshift Serverless event notifications with Amazon EventBridge for further details.

Clean up

To clean up your resources, delete the workgroup and namespace you used for trying the monitoring approaches discussed in this post.

Cleanup

Conclusion

In this post, we covered how to perform monitoring activities on Redshift Serverless through the Redshift Serverless console, system views, and CloudWatch, and how to keep costs predictable. Try the monitoring approaches discussed in this post and let us know your feedback in the comments.


About the Authors

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 17 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Harshida Patel is a Specialist Principal Solutions Architect, Analytics with AWS.

Raghu Kuppala is an Analytics Specialist Solutions Architect experienced working in the databases, data warehousing, and analytics space. Outside of work, he enjoys trying different cuisines and spending time with his family and friends.

Ashish Agrawal is a Sr. Technical Product Manager with Amazon Redshift, building cloud-based data warehouses and analytics cloud services. Ashish has over 24 years of experience in IT. Ashish has expertise in data warehouses, data lakes, and platform as a service. Ashish has been a speaker at worldwide technical conferences.

Migrating your secrets to AWS Secrets Manager, Part 2: Implementation

Post Syndicated from Adesh Gairola original https://aws.amazon.com/blogs/security/migrating-your-secrets-to-aws-secrets-manager-part-2-implementation/

In Part 1 of this series, we provided guidance on how to discover and classify secrets and design a migration solution for customers who plan to migrate secrets to AWS Secrets Manager. We also mentioned steps that you can take to enable preventative and detective controls for Secrets Manager. In this post, we discuss how teams should approach the next phase, which is implementing the migration of secrets to Secrets Manager. We also provide a sample solution to demonstrate migration.

Implement secrets migration

Application teams lead the effort to design the migration strategy for their application secrets. Once you’ve made the decision to migrate your secrets to Secrets Manager, there are two potential options for migration implementation. One option is to move the application to AWS in its current state and then modify the application source code to retrieve secrets from Secrets Manager. Another option is to update the on-premises application to use Secrets Manager for retrieving secrets. You can use features such as AWS Identity and Access Management (IAM) Roles Anywhere to make the application communicate with Secrets Manager even before the migration, which can simplify the migration phase.

If the application code contains hardcoded secrets, the code should be updated so that it references Secrets Manager. A good interim state would be to pass these secrets as environment variables to your application. Using environment variables helps in decoupling the secrets retrieval logic from the application code and allows for a smooth cutover and rollback (if required).

Cutover to Secrets Manager should be done in a maintenance window. This minimizes downtime and impacts to production.

Before you perform the cutover procedure, verify the following:

  • Application components can access Secrets Manager APIs. Based on your environment, this connectivity might be provisioned through interface virtual private cloud (VPC) endpoints or over the internet.
  • Secrets exist in Secrets Manager and have the correct tags. This is important if you are using attribute-based access control (ABAC).
  • Applications that integrate with Secrets Manager have the required IAM permissions.
  • Have a well-documented cutover and rollback plan that contains the changes that will be made to the application during cutover. These would include steps like updating the code to use environment variables and updating the application to use IAM roles or instance profiles (for apps that are being migrated to Amazon Elastic Compute Cloud (Amazon EC2)).

After the cutover, verify that Secrets Manager integration was successful. You can use AWS CloudTrail to confirm that application components are using Secrets Manager.

We recommend that you further optimize your integration by enabling automatic secrets rotation. If your secrets were previously widely accessible (for example, they were stored in your Git repositories), we recommend rotating as soon as possible when migrating .

Sample application to demo integration with Secrets Manager

In the next sections, we present a sample AWS Cloud Development Kit (AWS CDK) solution that demonstrates the implementation of the previously discussed guardrails, design, and migration strategy. You can use the sample solution as a starting point and expand upon it. It includes components that environment teams may deploy to help provide potentially secure access for application teams to migrate their secrets to Secrets Manager. The solution uses ABAC, a tagging scheme, and IAM Roles Anywhere to demonstrate regulated access to secrets for application teams. Additionally, the solution contains client-side utilities to assist application and migration teams in updating secrets. Teams with on-premises applications that are seeking integration with Secrets Manager before migration can use the client-side utility for access through IAM Roles Anywhere.

The sample solution is hosted on the aws-secrets-manager-abac-authorization-samples GitHub repository and is made up of the following components:

  • A common environment infrastructure stack (created and owned by environment teams). This stack provisions the following resources:
    • A sample VPC created with Amazon Virtual Private Cloud (Amazon VPC), with PUBLIC, PRIVATE_WITH_NAT, and PRIVATE_ISOLATED subnet types.
    • VPC endpoints for the AWS Key Management Service (AWS KMS) and Secrets Manager services to the sample VPC. The use of VPC endpoints means that calls to AWS KMS and Secrets Manager are not made over the internet and remain internal to the AWS backbone network.
    • An empty shell secret, tagged with the supplied attributes and an IAM managed policy that uses attribute-based access control conditions. This means that the secret is managed in code, but the actual secret value is not visible in version control systems like GitHub or in AWS CloudFormation parameter inputs. 
  • An IAM Roles Anywhere infrastructure stack (created and owned by environment teams). This stack provisions the following resources:
    • An AWS Certificate Manager Private Certificate Authority (AWS Private CA).
    • An IAM Roles Anywhere public key infrastructure (PKI) trust anchor that uses AWS Private CA.
    • An IAM role for the on-premises application that uses the common environment infrastructure stack.
    • An IAM Roles Anywhere profile.

    Note: You can choose to use your existing CAs as trust anchors. If you do not have a CA, the stack described here provisions a PKI for you. IAM Roles Anywhere allows migration teams to use Secrets Manager before the application is moved to the cloud. Post migration, you could consider updating the applications to use native IAM integration (like instance profiles for EC2 instances) and revoking IAM Roles Anywhere credentials.

  • A client-side utility (primarily used by application or migration teams). This is a shell script that does the following:
    • Assists in provisioning a certificate by using OpenSSL.
    • Uses aws_signing_helper (Credential Helper) to set up AWS CLI profiles by using the credential_process for IAM Roles Anywhere.
    • Assists application teams to access and update their application secrets after assuming an IAM role by using IAM Roles Anywhere.
  • A sample application stack (created and owned by the application/migration team). This is a sample serverless application that demonstrates the use of the solution. It deploys the following components, which indicate that your ABAC-based IAM strategy is working as expected and is effectively restricting access to secrets:
    • The sample application stack uses a VPC-deployed common environment infrastructure stack.
    • It deploys an Amazon Aurora MySQL serverless cluster in the PRIVATE_ISOLATED subnet and uses the secret that is created through a common environment infrastructure stack.
    • It deploys a sample Lambda function in the PRIVATE_WITH_NAT subnet.
    • It deploys two IAM roles for testing:
      • allowedRole (default role): When the application uses this role, it is able to use the GET action to get the secret and open a connection to the Aurora MySQL database.
      • Not allowedRole: When the application uses this role, it is unable to use the GET action to get the secret and open a connection to the Aurora MySQL database.

Prerequisites to deploy the sample solution

The following software packages need to be installed in your development environment before you deploy this solution:

Note: In this section, we provide examples of AWS CLI commands and configuration for Linux or macOS operating systems. For instructions on using AWS CLI on Windows, refer to the AWS CLI documentation.

Before deployment, make sure that the correct AWS credentials are configured in your terminal session. The credentials can be either in the environment variables or in ~/.aws. For more details, see Configuring the AWS CLI.

Next, use the following commands to set your AWS credentials to deploy the stack:

export AWS_ACCESS_KEY_ID=<>
export AWS_SECRET_ACCESS_KEY=<>
export AWS_REGION = <>

You can view the IAM credentials that are being used by your session by running the command aws sts get-caller-identity. If you are running the cdk command for the first time in your AWS account, you will need to run the following cdk bootstrap command to provision a CDK Toolkit stack that will manage the resources necessary to enable deployment of cloud applications with the AWS CDK.

cdk bootstrap aws://<AWS account number>/<Region> # Bootstrap CDK in the specified account and AWS Region

Select the applicable archetype and deploy the solution

This section outlines the design and deployment steps for two archetypes:

Archetype 1: Application is currently on premises

Archetype 1 has the following requirements:

  • The application is currently hosted on premises.
  • The application would consume API keys, stored credentials, and other secrets in Secrets Manager.

The application, environment and security teams work together to define a tagging strategy that will be used to restrict access to secrets. After this, the proposed workflow for each persona is as follows:

  1. The environment engineer deploys a common environment infrastructure stack (as described earlier in this post) to bootstrap the AWS account with secrets and IAM policy by using the supplied tagging requirement.
  2. Additionally, the environment engineer deploys the IAM Roles Anywhere infrastructure stack.
  3. The application developer updates the secrets required by the application by using the client-side utility (helper.sh).
  4. The application developer uses the client-side utility to update the AWS CLI profile to consume the IAM Roles Anywhere role from the on-premises servers.

    Figure 1 shows the workflow for Archetype 1.

    Figure 1: Application on premises connecting to Secrets Manager

    Figure 1: Application on premises connecting to Secrets Manager

To deploy Archetype 1

  1. (Actions by the application team persona) Clone the repository and update the tagging details at configs/tagconfig.json.

    Note: Do not modify the tag/attributes name/key, only modify value.

  2. (Actions by the environment team persona) Run the following command to deploy the common environment infrastructure stack.
    ./helper.sh prepare
    Then, run the following command to deploy the IAM Roles Anywhere infrastructure stack../helper.sh on-prem
  3. (Actions by the application team persona) Update the secret value of the dummy secrets provided by the environment team, by using the following command.
    ./helper.sh update-secret

    Note: This command will only update the secret if it’s still using the dummy value.

    Then, run the following command to set up the client and server on premises../helper.sh client-profile-setup

    Follow the command prompt. It will help you request a client certificate and update the AWS CLI profile.

    Important: When you request a client certificate, make sure to supply at least one distinguished name, like CommonName.

The sample output should look like the following.


‐‐> This role can be used by the application by using the AWS CLI profile 'developer'.
‐‐> For instance, the following output illustrates how to access secret values by using the AWS CLI profile 'developer'.
‐‐> Sample AWS CLI: aws secretsmanager get-secret-value ‐‐secret-id $SECRET_ARN ‐‐profile developer

At this point, the client-side utility (helper.sh client-profile-setup) should have updated the AWS CLI configuration file with the following profile.

[profile developer]
region = <aws-region>
credential_process = /Users/<local-laptop-user>/.aws/aws_signing_helper credential-process
    ‐‐certificate /Users/<local-laptop-user>/.aws/client_cert.pem
    ‐‐private-key /Users/<local-laptop-user>/.aws/my_private_key.clear.key
    ‐‐trust-anchor-arn arn:aws:rolesanywhere:<aws-region>:444455556666:trust-anchor/a1b2c3d4-5678-90ab-cdef-EXAMPLE11111 
    ‐‐profile-arn arn:aws:rolesanywhere:<aws-region>:444455556666:profile/a1b2c3d4-5678-90ab-cdef-EXAMPLE22222 
    ‐‐role-arn arn:aws:iam::444455556666:role/RolesanywhereabacStack-onPremAppRole-1234567890ABC

To test Archetype 1 deployment

  • The application team can verify that the AWS CLI profile has been properly set up and is capable of retrieving secrets from Secrets Manager by running the following client-side utility command.
    ./helper.sh on-prem-test

This client-side utility (helper.sh) command verifies that the AWS CLI profile (for example, developer) has been set up for IAM Roles Anywhere and can run the GetSecretValue API action to retrieve the value of the secret stored in Secrets Manager.

The sample output should look like the following.

‐‐> Checking credentials ...
{
    "UserId": "AKIAIOSFODNN7EXAMPLE:EXAMPLE11111EXAMPLEEXAMPLE111111",
    "Account": "444455556666",
    "Arn": "arn:aws:sts::444455556666:assumed-role/RolesanywhereabacStack-onPremAppRole-1234567890ABC"
}
‐‐> Assume role worked for:
arn:aws:sts::444455556666:assumed-role/RolesanywhereabacStack-onPremAppRole-1234567890ABC
‐‐> This role can be used by the application by using the AWS CLI profile 'developer'. 
‐‐> For instance, the following output illustrates how to access secret values by using the AWS CLI profile 'developer'. 
‐‐> Sample AWS CLI: aws secretsmanager get-secret-value --secret-id $SECRET_ARN ‐‐profile $PROFILE_NAME
-------Output-------
{
  "password": "randomuniquepassword",
  "servertype": "testserver1",
  "username": "testuser1"
}
-------Output-------

Archetype 2: Application has migrated to AWS

Archetype 2 has the following requirement:

  • Deploy a sample application to demonstrate how ABAC authorization works for Secrets Manager APIs.

The application, environment, and security teams work together to define a tagging strategy that will be used to restrict access to secrets. After this, the proposed workflow for each persona is as follows:

  1. The environment engineer deploys a common environment infrastructure stack to bootstrap the AWS account with secrets and an IAM policy by using the supplied tagging requirement.
  2. The application developer updates the secrets required by the application by using the client-side utility (helper.sh).
  3. The application developer tests the sample application to confirm operability of ABAC.

Figure 2 shows the workflow for Archetype 2.

Figure 2: Sample migrated application connecting to Secrets Manager

Figure 2: Sample migrated application connecting to Secrets Manager

To deploy Archetype 2

  1. (Actions by the application team persona) Clone the repository and update the tagging details at configs/tagconfig.json.

    Note: Don’t modify the tag/attributes name/key, only modify value.

  2. (Actions by the environment team persona) Run the following command to deploy the common platform infrastructure stack.
    ./helper.sh prepare
  3. (Actions by the application team persona) Update the secret value of the dummy secrets provided by the environment team, using the following command.
    ./helper.sh update-secret

    Note: This command will only update the secret if it is still using the dummy value.

    Then, run the following command to deploy a sample app stack.
    ./helper.sh on-aws

    Note: If your secrets were migrated from a system that did not have the correct access controls, as a best security practice, you should rotate them at least once manually.

At this point, the client-side utility should have deployed a sample application Lambda function. This function connects to a MySQL database by using credentials stored in Secrets Manager. It retrieves the secret values, validates them, and establishes a connection to the database. The function returns a message that indicates whether the connection to the database is working or not.

To test Archetype 2 deployment

  • The application team can use the following client-side utility (helper.sh) to invoke the Lambda function and verify whether the connection is functional or not.
    ./helper.sh on-aws-test

The sample output should look like the following.

‐‐> Check if AWS CLI is installed
‐‐> AWS CLI found 
‐‐> Using tags to create Lambda function name and invoking a test 
‐‐> Checking the Lambda invoke response..... 
‐‐> The status code is 200
‐‐> Reading response from test function: 
"Connection to the DB is working."
‐‐> Response shows database connection is working from Lambda function using secret.

Conclusion

Building an effective secrets management solution requires careful planning and implementation. AWS Secrets Manager can help you effectively manage the lifecycle of your secrets at scale. We encourage you to take an iterative approach to building your secrets management solution, starting by focusing on core functional requirements like managing access, defining audit requirements, and building preventative and detective controls for secrets management. In future iterations, you can improve your solution by implementing more advanced functionalities like automatic rotation or resource policies for secrets.

To read Part 1 of this series, go to Migrating your secrets to AWS, Part I: Discovery and design.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Secrets Manager re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Adesh Gairola

Adesh Gairola

Adesh Gairola is a Senior Security Consultant at Amazon Web Services in Sydney, Australia. Adesh is eager to help customers build robust defenses, and design and implement security solutions that enable business transformations. He is always looking for new ways to help customers improve their security posture.

Eric Swamy

Eric Swamy

Eric is a Senior Security Consultant working in the Professional Services team in Sydney, Australia. He is passionate about helping customers build the confidence and technical capability to move their most sensitive workloads to cloud. When not at work, he loves to spend time with his family and friends outdoors, listen to music, and go on long walks.

Migrating your secrets to AWS Secrets Manager, Part I: Discovery and design

Post Syndicated from Eric Swamy original https://aws.amazon.com/blogs/security/migrating-your-secrets-to-aws-secrets-manager-part-i-discovery-and-design/

“An ounce of prevention is worth a pound of cure.” – Benjamin Franklin

A secret can be defined as sensitive information that is not intended to be known or disclosed to unauthorized individuals, entities, or processes. Secrets like API keys, passwords, and SSH keys provide access to confidential systems and resources, but it can be a challenge for organizations to maintain secure and consistent management of these secrets. Commonly observed anti-patterns in organizational secrets management systems include sharing plaintext secrets in emails or messaging apps, allowing application developers to view secrets in plaintext, hard-coding secrets into applications and storing them in version control systems, failing to rotate secrets regularly, and not logging and monitoring access to secrets.

We have created a two-part Amazon Web Services (AWS) blog post that provides prescriptive guidance on how you can use AWS Secrets Manager to help you achieve a cloud-based and modern secrets management system. In this first blog post, we discuss approaches to discover and classify secrets. In Part 2 of this series, we elaborate on the implementation phase and discuss migration techniques that will help you migrate your secrets to AWS Secrets Manager.

Managing secrets: Best practices and personas

A secret’s lifecycle comprises four phases: create, store, use, and destroy. An effective secrets management solution protects the secret in each of these phases from unauthorized access. Besides being secure, robust, scalable, and highly available, the secrets management system should integrate closely with other tools, solutions, and services that are being used within the organization. Legacy secret stores may lack integration with privileged access management (PAM), logging and monitoring, DevOps, configuration management, and encryption and auditing, which leads to teams not having uniform practices for consuming secrets and creates discrepancies from organizational policies.

Secrets Manager is a secrets management service that helps you protect access to your applications, services, and IT resources. This is a non-exhaustive list of features that AWS Secrets Manager offers:

  • Access control through AWS Identity and Access Management (IAM) — Secrets Manager offers built-in integration with the AWS Identity and Access Management (IAM) service. You can attach access control policies to IAM principals or to secrets themselves (by using resource-based policies).
  • Logging and monitoring — Secrets Manager integrates with AWS logging and monitoring services such as AWS CloudTrail and Amazon CloudWatch. This means that you can use your existing AWS logging and monitoring stack to log access to secrets and audit their usage.
  • Integration with other AWS services — Secrets Manager can store and manage the lifecycle of secrets created by other AWS services like Amazon Relational Database Service (Amazon RDS), Amazon Redshift, and Amazon QuickSight. AWS is constantly working on integrating more services with Secrets Manager.
  • Secrets encryption at rest — Secrets Manager integrates with AWS Key Management Service (AWS KMS). Secrets are encrypted at rest by using an AWS-managed key or customer-managed key.
  • Framework to support the rotation of secrets securely — Rotation helps limit the scope of a compromise and should be an integral part of a modern approach to secrets management. You can use Secrets Manager to schedule automatic database credentials rotation for Amazon RDS, Amazon Redshift, and Amazon DocumentDB. You can use customized AWS Lambda functions to extend the Secrets Manager rotation feature to other secret types, such as API keys and OAuth tokens for on-premises and cloud resources.

Security, cloud, and application teams within an organization need to work together cohesively to build an effective secrets management solution. Each of these teams has unique perspectives and responsibilities when it comes to building an effective secrets management solution, as shown in the following table.

Persona Responsibilities What they want What they don’t want
Security teams/security architect Define control objectives and requirements from the secrets management system Least privileged short-lived access, logging and monitoring, and rotation of secrets Secrets sprawl
Cloud team/environment team Implement controls, create guardrails, detect events of interest Scalable, robust, and highly available secrets management infrastructure Application teams reaching out to them to provision or manage app secrets
Developer/migration engineer Migrate applications and their secrets to the cloud Independent control and management of their app secrets Dependency on external teams

To sum up the requirements from all the personas mentioned here: The approach to provision and consume secrets should be secure, governed, easily scalable, and self-service.

We’ll now discuss how to discover and classify secrets and design the migration in a way that helps you to meet these varied requirements.

Discovery — Assess and categorize existing secrets

The initial discovery phase involves running sessions aimed at discovering, assessing, and categorizing secrets. Migrating applications and associated infrastructure to the cloud requires a strategic and methodical approach to progressively discover and analyze IT assets. This analysis can be used to create high-confidence migration wave plans. You should treat secrets as IT assets and include them in the migration assessment planning.

For application-related secrets, arguably the most appropriate time to migrate a secret is when the application that uses the secret is being migrated itself. This lets you track and report the use of secrets as soon as the application begins to operate in the cloud. If secrets are left on-premises during an application migration, this often creates a risk to the availability of the application. The migrated application ends up having a dependency on the connectivity and availability of the on-premises secrets management system.

The activities performed in this phase are often handled by multiple teams. Depending on the purpose of the secret, this can be a mix of application developers, migration teams, and environment teams.

Following are some common secret types you might come across while migrating applications.

Type Description
Application secrets Secrets specific to an application
Client credentials Cloud to on-premises credentials or OAuth tokens (such as Okta, Google APIs, and so on)
Database credentials Credentials for cloud-hosted databases, for example, Amazon Redshift, Amazon RDS or Amazon Aurora, Amazon DocumentDB
Third-party credentials Vendor application credentials or API keys
Certificate private keys Custom applications or infrastructure that might require programmatic access to the private key
Cryptographic keys Cryptographic keys used for data encryption or digital signatures
SSH keys Centralized management of SSH keys can potentially make it easier to rotate, update, and track keys
AWS access keys On-premises to cloud credentials (IAM)

Creating an inventory for secrets becomes simpler when organizations have an IT asset management (ITAM) or Identity and Access Management (IAM) tool to manage their IT assets (such as secrets) effectively. For organizations that don’t have an on-premises secrets management system, creating an inventory of secrets is a combination of manual and automated efforts. Application subject matter experts (SMEs) should be engaged to find the location of secrets that the application uses. In addition, you can use commercial tools to scan endpoints and source code and detect secrets that might be hardcoded in the application. Amazon CodeGuru is a service that can detect secrets in code. It also provides an option to migrate these secrets to Secrets Manager.

AWS has previously described seven common migration strategies for moving applications to the cloud. These strategies are refactor, replatform, repurchase, rehost, relocate, retain, and retire. For the purposes of migrating secrets, we recommend condensing these seven strategies into three: retire, retain, and relocate. You should evaluate every secret that is being considered for migration against a decision tree to determine which of these three strategies to use. The decision tree evaluates each secret against key business drivers like cost reduction, risk appetite, and the need to innovate. This allows teams to assess if a secret can be replaced by native AWS services, needs to be retained on-premises, migrated to Secrets Manager, or retired. Figure 1 shows this decision process.

Figure 1: Decision tree for assessing a secret for migration

Figure 1: Decision tree for assessing a secret for migration

Capture the associated details for secrets that are marked as RELOCATE. This information is essential and must remain confidential. Some secret metadata is transitive and can be derived from related assets, including details such as itsm-tier, sensitivity-rating, cost-center, deployment pipeline, and repository name. With Secrets Manager, you will use resource tags to bind this metadata with the secret.

You should gather at least the following information for the secrets that you plan to relocate and migrate to AWS Secrets Manager.

Metadata about secrets Rationale for gathering data
Secrets team name or owner Gathering the name or email address of the individual or team responsible for managing secrets can aid in verifying that they are maintained and updated correctly.
Secrets application name or ID To keep track of which applications use which secrets, it is helpful to collect application details that are associated with these secrets.
Secrets environment name or ID Gathering information about the environment to which secrets belong, such as “prod,” “dev,” or “test,” can assist in the efficient management and organization of your secrets.
Secrets data classification Understanding your organization’s data classification policy can help you identify secrets that contain sensitive or confidential information. It is recommended to handle these secrets with extra care. This information, which may be labeled “confidential,” “proprietary,” or “personally identifiable information (PII),” can indicate the level of sensitivity associated with a particular secret according to your organization’s data classification policy or standard.
Secrets function or usage If you want to quickly find the secrets you need for a specific task or project, consider documenting their usage. For example, you can document secrets related to “backup,” “database,” “authentication,” or “third-party integration.” This approach can allow you to identify and retrieve the necessary secrets within your infrastructure without spending a lot of time searching for them.

This is also a good time to decide on the rotation strategy for each secret. When you rotate a secret, you update the credentials in both Secrets Manager and the service to which that secret provides access (in other words, the resource). Secrets Manager supports automatic rotation of secrets based on a schedule.

Design the migration solution

In this phase, security and environment teams work together to onboard the Secrets Manager service to their organization’s cloud environment. This involves defining access controls, guardrails, and logging capabilities so that the service can be consumed in a regulated and governed manner.

As a starting point, use the following design principles mentioned in the Security Pillar of the AWS Well Architected Framework to design a migration solution:

  • Implement a strong identity foundation
  • Enable traceability
  • Apply security at all layers
  • Automate security best practices
  • Protect data at rest and in transit
  • Keep people away from data
  • Prepare for security events

The design considerations covered in the rest of this section will help you prepare your AWS environment to host production-grade secrets. This phase can be run in parallel with the discovery phase.

Design your access control system to establish a strong identity foundation

In this phase, you define and implement the strategy to restrict access to secrets stored in Secrets Manager. You can use the AWS Identity and Access Management (IAM) service to specify that identities (human and non-human IAM principals) are only able to access and manage secrets that they own. Organizations that organize their workloads and environments by using separate AWS accounts should consider using a combination of role-based access control (RBAC) and attribute-based access control (ABAC) to restrict access to secrets depending on the granularity of access that’s required.

You can use a scalable automation to deploy and update key IAM roles and policies, including the following:

  • Pipeline deployment policies and roles — This refers to IAM roles for CICD pipelines. These pipelines should be the primary mechanism for creating, updating, and deleting secrets in the organization.
  • IAM Identity Center permission sets — These allow human identities access to the Secrets Manager API. We recommend that you provision secrets by using infrastructure as code (IaC). However, there are instances where users need to interact directly with the service. This can be for initial testing, troubleshooting purposes, or updating a secret value when automatic rotation fails or is not enabled.
  • IAM permissions boundary — Boundary policies allow application teams to create IAM roles in a self-serviced, governed, and regulated manner.

Most organizations have Infrastructure, DevOps, or Security teams that deploy baseline configurations into AWS accounts. These solutions help these teams govern the AWS account and often have their own secrets. IAM policies should be created such that the IAM principals created by the application teams are unable to access secrets that are owned by the environment team, and vice versa. To enforce this logical boundary, you can use tagging and naming conventions on your secrets by using IAM.

A sample scheme for tagging your secrets can look like the following.

Tag key Tag value Notes Policy elements Secret tags
appname
  • Lowercase
  • Alphanumeric only
  • User friendly
  • Quickly identifiable
A user-friendly name for the application PrincipalTag/ appname =<value> (applies to role)
RequestTag/ appname =<value> (applies to caller)
SecretManager:ResourceTag/ appname=<value> (applies to the secret)
appname:<value>
appid
  • Lowercase
  • Alphanumeric only
  • Unique across the organization
  • Fixed length (5–7 characters)
Uniquely identifies the application among other cloud-hosted apps PrincipalTag/appid=<value>
RequestTag/appid=<value>
SecretManager:ResourceTag/appid=<value>
appid:<value>
appfunc
  • Lowercase
  • Fixed values (for example, web, msg, dba, api, storage, container, middleware, tool, service)
Used to describe the function of a particular target that the secret material is associated with (for example, web server, message broker, database) PrincipalTag/appfunc=<value>
RequestTag/appfunc=<value>
SecretManager:ResourceTag/appfunc=<value>
Appfunc:<value>
appenv
  • Lowercase
  • Fixed values (for example, dev, test, nonp, prod)
An identifier for the secret usage environment PrincipalTag/appenv=<value>
RequestTag/appenv=<value>
SecretManager:ResourceTag/appenv=<value>
appenv:<value>
dataclassification
  • Lowercase
  • Fixed values (for example, protected, confidential)
Use your organization’s data classification standards to classify the secrets PrincipalTag/dataclassification=<value>
RequestTag/dataclassification=<value>
SecretManager:ResourceTag/dataclassification=<value>
Dataclassification:<value>

If you maintain a registry that documents details of your cloud-hosted applications, most of these tags can be derived from the registry.

It’s common to apply different security and operational policies for the non-production and production environments of a given workload. Although production environments are generally deployed in a dedicated account, it’s common to have less critical non-production apps and environments coexisting in the same AWS account. For operation and governance at scale in these multi-tenanted accounts, you can use attribute-based access control (ABAC) to manage secure access to secrets. ABAC enables you to grant permissions based on tags. The main benefits of using tag-based access control are its scalability and operational efficiency.

Figure 2 shows an example of ABAC in action, where an IAM policy allows access to a secret only if the appfunc, appenv, and appid tags on the secret match the tags on the IAM principal that is trying to access the secrets.

Figure 2: ABAC access control

Figure 2: ABAC access control

ABAC works as follows:

  • Tags on a resource define who can access the resource. It is therefore important that resources are tagged upon creation.
  • For a create secret operation, IAM verifies whether the Principal tags on the IAM identity that is making the API call match the request tags in the request.
  • For an update, delete, or read operation, IAM verifies that the Principal tags on the IAM identity that is making the API call match the resource tags on the secret.
  • Regardless of the number of workloads or environments that coexist in the same account, you only need to create one ABAC-based IAM policy. This policy is the same for different kinds of accounts and can be deployed by using a capability like AWS CloudFormation StackSets. This is the reason that ABAC scales well for scenarios where multiple applications and environments are deployed in the same AWS account.
  • IAM roles can use a common IAM policy, such as the one described in the previous bullet point. You need to verify that the roles have the correct tags set on them, according to your tagging convention. This will automatically grant the roles access to the secrets that have the same resource tags.
  • Note that with this approach, tagging secrets and IAM roles becomes the most critical component for controlling access. For this reason, all tags on IAM roles and secrets on Secrets Manager must follow a standard naming convention at all times.

The following is an ABAC-based IAM policy that allows creation, updates, and deletion of secrets based on the tagging scheme described in the preceding table.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Condition": {
                "StringEquals": {
                    "secretsmanager:ResourceTag/appfunc": "${aws:PrincipalTag/appfunc}",
                    "secretsmanager:ResourceTag/appenv": "${aws:PrincipalTag/appenv}",
                    "secretsmanager:ResourceTag/name": "${aws:PrincipalTag/name}",
                    "secretsmanager:ResourceTag/appid": "${aws:PrincipalTag/appid}"
                }
            },
            "Action": [
                "secretsmanager:GetSecretValue",
                "secretsmanager:PutSecretValue",
                "secretsmanager:UpdateSecret",
                "secretsmanager:DeleteSecret"
            ],
            "Resource": "arn:aws:secretsmanager:ap-southeast-2:*:secret:${aws:PrincipalTag/name}/${aws:PrincipalTag/appid}/${aws:PrincipalTag/appfunc}/${aws:PrincipalTag/appenv}*",
            "Effect": "Allow",
            "Sid": "AccessBasedOnResourceTags"
        },
        {
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/appfunc": "${aws:PrincipalTag/appfunc}",
                    "aws:RequestTag/appid": "${aws:PrincipalTag/appid}",
                    "aws:RequestTag/name": "${aws:PrincipalTag/name}",
                    "aws:RequestTag/appenv": "${aws:PrincipalTag/appenv}"
                }
            },
            "Action": [
                "secretsmanager:TagResource",
                "secretsmanager:CreateSecret"
            ],
            "Resource": "arn:aws:secretsmanager:ap-southeast-2:*:secret:${aws:PrincipalTag/name}/${aws:PrincipalTag/appid}/${aws:PrincipalTag/appfunc}/${aws:PrincipalTag/appenv}*",
            "Effect": "Allow",
            "Sid": "AccessBasedOnRequestTags"
        }
    ]
}

In addition to controlling access, this policy also enforces a naming convention. IAM principals will only be able to create a secret that matches the following naming scheme.

Secret name = value of tag-key (appid + appfunc + appenv + name)
For example, /ordersapp/api/prod/logisticsapi

You can choose to implement ABAC so that the resource name matches the principal tags or the resource tags match the principal tags, or both. These are just different types of ABAC. The sample policy provided here implements both types. It’s important to note that because ABAC-based IAM policies are shared across multiple workloads, potential misconfigurations in the policies will have a wider scope of impact.

For more information about building your ABAC strategy, refer to the blog post Working backward: From IAM policies and principal tags to standardized names and tags for your AWS resources.

You can also add checks in your pipeline to provide early feedback for developers. These checks may potentially assist in verifying whether appropriate tags have been set up in IaC resources prior to their creation. Your pipeline-based controls provide an additional layer of defense and complement or extend restrictions enforced by IAM policies.

Resource-based policies

Resource-based policies are a flexible and powerful mechanism to control access to secrets. They are directly associated with a secret and allow specific principals mentioned in the policy to have access to the secret. You can use these policies to grant identities (internal or external to the account) access to a secret.

If your organization uses resource policies, security teams should come up with control objectives for these policies. Controls should be set so that only resource-based policies meeting your organizations requirements are created. Control objectives for resource policies may be set as follows:

  • Allow statements in the policy to have allow access to the secret from the same application.
  • Allow statements in the policy to have allow access from organization-owned cross-account identities only if they belong to the same environment. Controls that meet these objectives can be preventative (checks in pipeline) or responsive (config rules and Amazon EventBridge invoked Lambda functions).

Environment teams can also choose to provision resource-based policies for application teams. The provision process can be manual, but is preferably automated. An example would be that these teams can allow application teams to tag secrets with specific values, like a cross-account IAM role Amazon Resource Number (ARN) that needs access. An automation invoked by EventBridge rules then asserts that the cross-account principal in the tag belongs to the organization and is in the same environment, and then provisions a resource-based policy for the application team. Using such mechanisms creates a self-service way for teams to create safe resource policies that meet common use cases.

Resource-based policies for Secrets Manager can be a helpful tool for controlling access to secrets, but it is important to consider specific situations where alternative access control mechanisms might be more appropriate. For example, if your access control requirements for secrets involve complex conditions or dependencies that cannot be easily expressed using the resource-based policy syntax, it may be challenging to manage and maintain the policies effectively. In such cases, you may want to consider using a different access control mechanism that better aligns with your requirements. For help determining which type of policy to use, see Identity-based policies and resource-based policies.

Design detective controls to achieve traceability, monitoring, and alerting

Prepare your environment to record and flag events of interest when Secrets Manager is used to store and update secrets. We recommend that you start by identifying risks and then formulate objectives and devise control measures for each identified risk, as follows:

  • Control objectives — What does the control evaluate, and how is it configured? Controls can be configured by using CloudTrail events invoked by Lambda functions, AWS config rules, or CloudWatch alarms. Controls can evaluate a misconfigured property in a secrets resource or report on an event of interest.
  • Target audience — Identify teams that should be notified if the event occurs. This can be a combination of the environment, security, and application teams.
  • Notification type — SNS, email, Slack channel notifications, or an ITIL ticket.
  • Criticality — Low, medium, or high, based on the criticality of the event.

The following is a sample matrix that can serve as a starting point for documenting detective controls for Secrets Manager. The column titled AWS services in the table offers some suggestions for implementation to help you meet your control objetves.

Risk Control objective Criticality AWS services
A secret is created without tags that match naming and tagging schemes
  • Enforce least privilege
  • Establish logging and monitoring
  • Manage secrets
HIGH (if using ABAC) CloudTrail invoked Lambda function or custom AWS config rule
IAM related tags on a secret are updated, removed
  • Manage secrets
  • Enforce least privilege
HIGH (if using ABAC) CloudTrail invoked Lambda function or custom config rule
A resource policy is created when resource policies have not been onboarded to the environment
  • Manage secrets
  • Enforce least privilege
HIGH Pipeline or CloudTrail invoked ¬Lambda function or custom config rule
A secret is marked for deletion from an unusual source — root user or admin break glass role
  • Improve availability
  • Protect configurations
  • Prepare for incident response
  • Manage secrets
HIGH CloudTrail invoked Lambda function
A non-compliant resource policy was created — for example, to provide secret access to a foreign account
  • Enforce least privilege
  • Manage secrets
HIGH CloudTrail invoked Lambda function or custom config rule
An AWS KMS key for secrets encryption is marked for deletion
  • Manage secrets
  • Protect configurations
HIGH CloudTrail invoked Lambda function
A secret rotation failed
  • Manage secrets
  • Improve availability
MEDIUM Managed config rule
A secret is inactive and is not being accessed for x number of days
  • Optimize costs
LOW Managed config rule
Secrets are created that do not use KMS key
  • Encrypt data at rest
LOW Managed config rule
Automatic rotation is not enabled
  • Manage secrets
LOW Managed config rule
Successful create, update, and read events for secrets
  • Establish logging and monitoring
LOW CloudTrail logs

We suggest that you deploy these controls in your AWS accounts by using a scalable mechanism, such as CloudFormation StackSets.

For more details, see the following topics:

Design for additional protection at the network layer

You can use the guiding principles for Zero Trust networking to add additional mechanisms to control access to secrets. The best security doesn’t come from making a binary choice between identity-centric and network-centric controls, but by using both effectively in combination with each other.

VPC endpoints allow you to provide a private connection between your VPC and Secrets Manager API endpoints. They also provide the ability to attach a policy that allows you to enforce identity-centric rules at a logical network boundary. You can use global context keys like aws:PrincipalOrgID in VPC endpoint policies to allow requests to Secrets Manager service only from identities that belong to the same AWS organization. You can also use aws:sourceVpce and aws:sourceVpc IAM conditions to allow access to the secret only if the request originates from a specific VPC endpoint or VPC, respectively.

For more details on VPC endpoints, see Using an AWS Secrets Manager VPC endpoint.

Design for least privileged access to encryption keys

To reduce unauthorized access, secrets should be encrypted at rest. Secrets Manager integrates with AWS KMS and uses envelope encryption. Every secret in Secrets Manager is encrypted with a unique data key. Each data key is protected by a KMS key. Whenever the secret value inside a secret changes, Secrets Manager generates a new data key to protect it. The data key is encrypted under a KMS key and stored in the metadata of the secret. To decrypt the secret, Secrets Manager first decrypts the encrypted data key by using the KMS key in AWS KMS.

The following is a sample AWS KMS policy that permits cryptographic operations to a KMS key only from the Secrets Manager service within an AWS account, and allows the AWS KMS decrypt action from a specific IAM principal throughout the organization.

{
    "Version": "2012-10-17",
    "Id": "secrets_manager_encrypt_org",
    "Statement": [
        {
            "Sid": "Root Access",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::444455556666:root"
            },
            "Action": "kms:*",
            "Resource": "*"
        },
        {
            "Sid": "Allow access for Key Administrators",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
             "arn:aws:iam::444455556666:role/platformRoles/KMS-key-admin-role",                    "arn:aws:iam::444455556666:role/platformRoles/KMS-key-automation-role"
                ]
            },
            "Action": [
                "kms:CancelKeyDeletion",
                "kms:Create*",
                "kms:Delete*",
                "kms:Describe*",
                "kms:Disable*",
                "kms:Enable*",
                "kms:Get*",
                "kms:List*",
                "kms:Put*",
                "kms:Revoke*",
                "kms:ScheduleKeyDeletion",
                "kms:TagResource",
                "kms:UntagResource",
                "kms:Update*"
            ],
            "Resource": "*"
        },
        {
            "Sid": "Allow Secrets Manager use of the KMS key for a specific account",
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": [
                "kms:Encrypt",
                "kms:Decrypt",
                "kms:ReEncrypt*",
                "kms:GenerateDataKey*",
                "kms:CreateGrant",
                "kms:ListGrants",
                "kms:DescribeKey"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "kms:CallerAccount": "444455556666",
                    "kms:ViaService": "secretsmanager.us-east-1.amazonaws.com"
                }
            }
        },
        {
            "Sid": "Allow use of Secrets Manager secrets from a specific IAM role (service account) throughout your org",
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": "kms:Decrypt",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:PrincipalOrgID": "o-exampleorgid"
                },
                "StringLike": {
                    "aws:PrincipalArn": "arn:aws:iam::*:role/platformRoles/secretsAccessRole"
                }
            }
        }
    ]
}

Additionally, you can use the secretsmanager:KmsKeyId IAM condition key to allow secrets creation only when AWS KMS encryption is enabled for the secret. You can also add checks in your pipeline that allow the creation of a secret only when a KMS key is associated with the secret.

Design or update applications for efficient retrieval of secrets

In applications, you can retrieve your secrets by calling the GetSecretValue function in the available AWS SDKs. However, we recommend that you cache your secret values by using client-side caching. Caching secrets can improve speed, help to prevent throttling by limiting calls to the service, and potentially reduce your costs.

Secrets Manager integrates with the following AWS services to provide efficient retrieval of secrets:

  • For Amazon RDS, you can integrate with Secrets Manager to simplify managing master user passwords for Amazon RDS database instances. Amazon RDS can manage the master user password and stores it securely in Secrets Manager, which may eliminate the need for custom AWS Lambda functions to manage password rotations. The integration can help you secure your database by encrypting the secrets, using your own managed key or an AWS KMS key provided by Secrets Manager. As a result, the master user password is not visible in plaintext during the database creation workflow. This feature is available for the Amazon RDS and Aurora engines, and more information can be found in the Amazon RDS and Aurora User Guides.
  • For Amazon Elastic Kubernetes Service (Amazon EKS), you can use the AWS Secrets and Configuration Provider (ASCP) for the Kubernetes Secrets Store CSI Driver. This open-source project enables you to mount Secrets Manager secrets as Kubernetes secrets. The driver translates Kubernetes secret objects into Secrets Manager API calls, allowing you to access and manage secrets from within Kubernetes. After you configure the Kubernetes Secrets Store CSI Driver, you can create Kubernetes secrets backed by Secrets Manager secrets. These secrets are securely stored in Secrets Manager and can be accessed by your applications that are running in Amazon EKS.
  • For Amazon Elastic Container Service (Amazon ECS), sensitive data can be securely stored in Secrets Manager secrets and then accessed by your containers through environment variables or as part of the log configuration. This allows for a simple and potentially safe injection of sensitive data into your containers, making it a possible solution for your needs.
  • For AWS Lambda, you can use the AWS Parameters and Secrets Lambda Extension to retrieve and cache Secrets Manager secrets in Lambda functions without the need for an AWS SDK. It is noteworthy that retrieving a cached secret is faster compared to the standard method of retrieving secrets from Secrets Manager. Moreover, using a cache can be cost-efficient, because there is a charge for calling Secrets Manager APIs. For more details, see the Secrets Manager User Guide.

For additional information on how to use Secrets Manager secrets with AWS services, refer to the following resources:

Develop an incident response plan for security events

It is recommended that you prepare for unforeseeable incidents such as unauthorized access to your secrets. Developing an incident response plan can help minimize the impact of the security event, facilitate a prompt and effective response, and may help to protect your organization’s assets and reputation. The traceability and monitoring controls we discussed in the previous section can be used both during and after the incident.

The Computer Security Incident Handling Guide SP 800-61 Rev. 2, which was created by the National Institute of Standards and Technology (NIST), can help you create an incident response plan for specific incident types. It provides a thorough and organized approach to incident response, covering everything from initial preparation and planning to detection and analysis, containment, eradication, recovery, and follow-up. The framework emphasizes the importance of continual improvement and learning from past incidents to enhance the overall security posture of the organization.

Refer to the following documentation for further details and sample playbooks:

Conclusion

In this post, we discussed how organizations can take a phased approach to migrate their secrets to AWS Secrets Manager. Your teams can use the thought exercises mentioned in this post to decide if they would like to rehost, replatform, or retire secrets. We discussed what guardrails should be enabled for application teams to consume secrets in a safe and regulated manner. We also touched upon ways organizations can discover and classify their secrets.

In Part 2 of this series, we go into the details of the migration implementation phase and walk you through a sample solution that you can use to integrate on-premises applications with Secrets Manager.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Secrets Manager re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Eric Swamy

Eric Swamy

Eric is a Senior Security Consultant working in the Professional Services team in Sydney, Australia. He is passionate about helping customers build the confidence and technical capability to move their most sensitive workloads to cloud. When not at work, he loves to spend time with his family and friends outdoors, listen to music, and go on long walks.

Adesh Gairola

Adesh Gairola

Adesh Gairola is a Senior Security Consultant at Amazon Web Services in Sydney, Australia. Adesh is eager to help customers build robust defenses, and design and implement security solutions that enable business transformations. He is always looking for new ways to help customers improve their security posture.

Protect APIs with Amazon API Gateway and perimeter protection services

Post Syndicated from Pengfei Shao original https://aws.amazon.com/blogs/security/protect-apis-with-amazon-api-gateway-and-perimeter-protection-services/

As Amazon Web Services (AWS) customers build new applications, APIs have been key to driving the adoption of these offerings. APIs simplify client integration and provide for efficient operations and management of applications by offering standard contracts for data exchange. APIs are also the front door to hosted applications that need to be effectively secured, monitored, and metered to provide resilient infrastructure.

In this post, we will discuss how to help protect your APIs by building a perimeter protection layer with Amazon CloudFront, AWS WAF, and AWS Shield and putting it in front of Amazon API Gateway endpoints. Amazon API Gateway is a fully managed AWS service that you can use to create, publish, maintain, monitor, and secure REST, HTTP, and WebSocket APIs at any scale.

Solution overview

CloudFront, AWS WAF, and Shield provide a layered security perimeter that co-resides at the AWS edge and provides scalable, reliable, and high-performance protection for applications and content. For more information, see the AWS Best Practices for DDoS Resiliency whitepaper.

By using CloudFront as the front door to APIs that are hosted on API Gateway, globally distributed API clients can get accelerated API performance. API Gateway endpoints that are hosted in an AWS Region gain access to scaled distributed denial of service (DDoS) mitigation capacity across the AWS global edge network.

When you protect CloudFront distributions with AWS WAF, you can protect your API Gateway API endpoints against common web exploits and bots that can affect availability, compromise security, or consume excessive resources. AWS Managed Rules for AWS WAF help provide protection against common application vulnerabilities or other unwanted traffic, without the need for you to write your own rules. AWS WAF rate-based rules automatically block traffic from source IPs when they exceed the thresholds that you define, which helps to protect your application against web request floods, and alerts you to sudden spikes in traffic that might indicate a potential DDoS attack.

Shield mitigates infrastructure layer DDoS attacks against CloudFront distributions in real time, without observable latency. When you protect a CloudFront distribution with Shield Advanced, you gain additional detection and mitigation against large and sophisticated DDoS attacks, near real-time visibility into attacks, and integration with AWS WAF. When you configure Shield Advanced automatic application layer DDoS mitigation, Shield Advanced responds to application layer (layer 7) attacks by creating, evaluating, and deploying custom AWS WAF rules.

To take advantage of the perimeter protection layer built with CloudFront, AWS WAF, and Shield, and to help avoid exposing API Gateway endpoints directly, you can use the following approaches to restrict API access through CloudFront only. For more information about these approaches, see the Security Overview of Amazon API Gateway whitepaper.

  1. CloudFront can insert the X-API-Key header before it forwards the request to API Gateway, and API Gateway validates the API key when receiving the requests. For more information, see Protecting your API using Amazon API Gateway and AWS WAF — Part 2.
  2. CloudFront can insert a custom header (not X-API-Key) with a known secret that is shared with API Gateway. An AWS Lambda custom request authorizer that is configured in API Gateway validates the secret. For more information, see Restricting access on HTTP API Gateway Endpoint with Lambda Authorizer.
  3. CloudFront can sign the request with AWS Signature Version 4 by using Lambda@Edge before it sends the request to API Gateway. Configured AWS Identity and Access Management (IAM) authorization in API Gateway validates the signature and verifies the identity of the requester.

Although the X-API-Key header approach is straightforward to implement at a lower cost, it’s only applicable to customers who are using REST API endpoints. If the X-API-Key header already exists, CloudFront will overwrite it. The custom header approach addresses this limitation, but it has an additional cost due to the use of the Lambda authorizer. With both approaches, there is an operational overhead for managing keys and rotating the keys periodically. Also, it isn’t a security best practice to use long-term secrets for authorization.

By using the AWS Signature Version 4 approach, you can minimize this type of operational overhead through the use of requests signed with Signature Version 4 in Lambda@Edge. The signing uses temporary credentials that AWS Security Token Service (AWS STS) provides, and built-in API Gateway IAM authorization performs the request signature validation. There is an additional Lambda@Edge cost in this approach. This approach supports the three API endpoint types available in API Gateway — REST, HTTP, and WebSocket — and it helps secure requests by verifying the identity of the requester, protecting data in transit, and protecting against potential replay attacks. We describe this approach in detail in the next section.

Solution architecture

Figure 1 shows the architecture of the Signature Version 4 solution.

Figure 1: High-level flow of a client request with sequence of events

Figure 1: High-level flow of a client request with sequence of events

The sequence of events that occurs when the client sends a request is as follows:

  1. A client sends a request to an API endpoint that is fronted by CloudFront.
  2. AWS WAF inspects the request at the edge location according to the web access control list (web ACL) rules that you configured. With Shield Advanced automatic application-layer mitigation enabled, when Shield Advanced detects a DDoS attack and identifies the attack signatures, Shield Advanced creates AWS WAF rules inside an associated web ACL to mitigate the attack.
  3. CloudFront handles the request and invokes the Lambda@Edge function before sending the request to API Gateway.
  4. The Lambda@Edge function signs the request with Signature Version 4 by adding the necessary headers.
  5. API Gateway verifies the Lambda@Edge function with the necessary permissions and sends the request to the backend.
  6. An unauthorized client sends a request to an API Gateway endpoint, and it receives the HTTP 403 Forbidden message.

Solution deployment

The sample solution contains the following main steps:

  1. Preparation
  2. Deploy the CloudFormation template
  3. Enable IAM authorization in API Gateway
  4. Confirm successful viewer access to the CloudFront URL
  5. Confirm that direct access to the API Gateway API URL is blocked
  6. Review the CloudFront configuration
  7. Review the Lambda@Edge function and its IAM role
  8. Review the AWS WAF web ACL configuration
  9. (Optional) Protect the CloudFront distribution with Shield Advanced

Step 1: Preparation

Before you deploy the solution, you will first need to create an API Gateway endpoint.

To create an API Gateway endpoint

  1. Choose the following Launch Stack button to launch a CloudFormation stack in your account.

    Select this image to open a link that starts building the CloudFormation stack

    Note: The stack will launch in the US East (N. Virginia) Region (us-east-1). To deploy the solution to another Region, download the solution’s CloudFormation template, and deploy it to the selected Region.

    When you launch the stack, it creates an API called PetStoreAPI that is deployed to the prod stage.

  2. In the Stages navigation pane, expand the prod stage, select GET on /pets/{petId}, and then copy the Invoke URL value of https://api-id.execute-api.region.amazonaws.com/prod/pets/{petId}. {petId} stands for a path variable.
  3. In the address bar of a browser, paste the Invoke URL value. Make sure to replace {petId} with your own information (for example, 1), and press Enter to submit the request. A 200 OK response should return with the following JSON payload:
    {
      "id": 1,
      "type": "dog",
      "price": 249.99
    }

In this post, we will refer to this API Gateway endpoint as the CloudFront origin.

Step 2: Deploy the CloudFormation template

The next step is to deploy the CloudFormation template of the solution.

The CloudFormation template includes the following:

  • A CloudFront distribution that uses an API Gateway endpoint as the origin
  • An AWS WAF web ACL that is associated with the CloudFront distribution
  • A Lambda@Edge function that is used to sign the request with Signature Version 4 and that the CloudFront distribution invokes before the request is forwarded to the origin on the CloudFront distribution
  • An IAM role for the Lambda@Edge function

To deploy the CloudFormation template

  1. Choose the following Launch Stack button to launch a CloudFormation stack in your account.

    Select this image to open a link that starts building the CloudFormation stack

    Note: The stack will launch in the US East N. Virginia Region (us-east-1). To deploy the solution to another Region, download the solution’s CloudFormation template, provide the required parameters, and deploy it to the selected Region.

  2. On the Specify stack details page, update with the following:
    1. For Stack name, enter APIProtection
    2. For the parameter APIGWEndpoint, enter the API Gateway endpoint in the following format. Make sure to replace <Region> with your own information.

    {api-id}.execute-api.<Region>.amazonaws.com

  3. Choose Next to continue the stack deployment.

It takes a couple of minutes to finish the deployment. After it finishes, the Output tab lists the CloudFront domain URL, as shown in Figure 2.

Figure 2: CloudFormation template output

Figure 2: CloudFormation template output

Step 3: Enable IAM authorization in API Gateway

Before you verify the solution, you will enable IAM authorization on the API endpoint first, which enforces Signature Version 4 verification at API Gateway. The following steps are applied for a REST API; you could also enable IAM authorization on an HTTP API or WebSocket API.

To enable IAM authorization in API Gateway

  1. In the API Gateway console, choose the name of your API.
  2. In the Resources pane, choose the GET method for the resource /pets. In the Method Execution pane, choose Method Request.
  3. Under Settings, for Authorization, choose the pencil icon (Edit). Then, in the dropdown list, choose AWS_IAM, and choose the check mark icon (Update).
  4. Repeat steps 2 and 3 for the resource /pets/{petId}.
  5. Deploy your API so that the changes take effect. When deploying, choose prod as the stage.
Figure 3: Enable IAM authorization in API Gateway

Figure 3: Enable IAM authorization in API Gateway

Step 4: Confirm successful viewer access to the CloudFront URL

Now that you’ve deployed the setup, you can verify that you are able to access the API through the CloudFront distribution.

To confirm viewer access through CloudFront

  1. In the CloudFormation console, choose the APIProtection stack.
  2. On the stack Outputs tab, copy the value for the CFDistribution entry and append /prod/pets to it, then open the URL in a new browser tab or window. The result should look similar to the following, which confirms successful viewer access through CloudFront.
    Figure 4: Successful API response when accessing API through CloudFront distribution

    Figure 4: Successful API response when accessing API through CloudFront distribution

Step 5: Confirm that direct access to the API Gateway API URL is blocked

Next, verify whether direct access to the API Gateway API endpoint is blocked.

Copy your API Gateway endpoint URL and append /prod/pets to it, then open the URL in a new browser tab or window. The result should look similar to the following, which confirms that direct viewer access through API Gateway is blocked.

Figure 5: API error response when attempting to access API Gateway directly

Figure 5: API error response when attempting to access API Gateway directly

Step 6: Review CloudFront configuration

Now that you’ve confirmed that access to the API Gateway endpoint is restricted to CloudFront only, you will review the CloudFront configuration that enables this restriction.

To review the CloudFront configuration

  1. In the CloudFormation console, choose the APIProtection stack. On the stack Resources tab, under the CFDistribution entry, copy the distribution ID.
  2. In the CloudFront console, select the distribution that has the distribution ID that you noted in the preceding step. On the Behaviors tab, select the behavior with path pattern Default (*).
  3. Choose Edit and scroll to the Cache key and origin requests section. You can see that Origin request policy is set to AllViewerExceptHostHeader, which allows CloudFront to forward viewer headers, cookies, and query strings to origins except the Host header. This policy is intended for use with the API Gateway origin.
  4. Scroll down to the Function associations – optional section.
    Figure 6: CloudFront configuration – Function association with origin request

    Figure 6: CloudFront configuration – Function association with origin request

    You can see that a Lambda@Edge function is associated with the origin request event; CloudFront invokes this function before forwarding requests to the origin. You can also see that the Include body option is selected, which exposes the request body to Lambda@Edge for HTTP methods like POST/PUT, and the request payload hash will be used for Signature Version 4 signing in the Lambda@Edge function.

Step 7: Review the Lambda@Edge function and its IAM role

In this step, you will review the Lambda@Edge function code and its IAM role, and learn how the function signs the request with Signature Version 4 before forwarding to API Gateway.

To review the Lambda@Edge function code

  1. In the CloudFormation console, choose the APIProtection stack.
  2. On the stack Resources tab, choose the Sigv4RequestLambdaFunction link to go to the Lambda function, and review the function code. You can see that it follows the Signature Version 4 signing process and uses an AWS access key to calculate the signature. The AWS access key is a temporary security credential provided when the IAM role for Lambda is being assumed.

To review the IAM role for Lambda

  1. In the CloudFormation console, choose the APIProtection stack.
  2. On the stack Resources tab, choose the Sigv4RequestLambdaFunctionExecutionRole link to go to the IAM role. Expand the permission policy to review the permissions. You can see that the policy allows the API Gateway endpoint to be invoked.
            {
                "Action": [
                    "execute-api:Invoke"
                ],
                "Resource": [
                    "arn:aws:execute-api:<region>:<account-id>:<api-id>/*/*/*"
                ],
                "Effect": "Allow"
            }

Because IAM authorization is enabled, when API Gateway receives the request, it checks whether the client has execute-api:Invoke permission for the API and route before handling the request.

Step 8: Review AWS WAF web ACL configuration

In this step, you will review the web ACL configuration in AWS WAF.

AWS Managed Rules for AWS WAF helps provide protection against common application vulnerabilities or other unwanted traffic. The web ACL for this solution includes several AWS managed rule groups as an example. The Amazon IP reputation list managed rule group helps to mitigate bots and reduce the risk of threat actors by blocking problematic IP addresses. The Core rule set (CRS) managed rule group helps provide protection against exploitation of a wide range of vulnerabilities, including some of the high risk and commonly occurring vulnerabilities described in the OWASP Top 10. The Known bad inputs managed rule group helps to reduce the risk of threat actors by blocking request patterns that are known to be invalid and that are associated with exploitation or discovery of vulnerabilities, like Log4J.

AWS WAF supports rate-based rules to block requests originating from IP addresses that exceed the set threshold per 5-minute time span, until the rate of requests falls below the threshold. We have used one such rule in the following example, but you could layer the rules for better security posture. You can configure multiple rate-based rules, each with a different threshold and scope (like URI, IP list, or country) for better protection. For more information on best practices for AWS WAF rate-based rules, see The three most important AWS WAF rate-based rules.

To review the web ACL configuration

  1. In the CloudFormation console, choose the APIProtection stack.
  2. On the stack Outputs tab, choose the EdgeLayerWebACL link to go to the web ACL configuration, and then choose the Rules tab to review the rules for this web ACL. On the Rules tab, you can see that the web ACL includes the following rule and rule groups.
    Figure 7: AWS WAF web ACL configuration

    Figure 7: AWS WAF web ACL configuration

  3. Choose the Associated AWS resources tab. You should see that the CloudFront distribution is associated to this web ACL.

Step 9: (Optional) Protect the CloudFront distribution with Shield Advanced

In this optional step, you will protect your CloudFront distribution with Shield Advanced. This adds additional protection on top of the protection provided by AWS WAF managed rule groups and rate-based rules in the web ACL that is associated with the CloudFront distribution.

Note: Proceed with this step only if you have subscribed to an annual subscription to Shield Advanced.

AWS Shield is a managed DDoS protection service that is offered in two tiers: AWS Shield Standard and AWS Shield Advanced. All AWS customers benefit from the automatic protection of Shield Standard, at no additional cost. Shield Standard helps defend against the most common, frequently occurring network and transport layer DDoS attacks that target your website or applications. AWS Shield Advanced is a paid service that requires a 1-year commitment—you pay one monthly subscription fee, plus usage fees based on gigabytes (GB) of data transferred out. Shield Advanced provides expanded DDoS attack protection for your applications.

Besides providing visibility and additional detection and mitigation against large and sophisticated DDoS attacks, Shield Advanced also gives you 24/7 access to the Shield Response Team (SRT) and cost protection against spikes in your AWS bill that might result from a DDoS attack against your protected resources. When you use both Shield Advanced and AWS WAF to help protect your resources, AWS waives the basic AWS WAF fees for web ACLs, rules, and web requests for your protected resources. You can grant permission to the SRT to act on your behalf, and also configure proactive engagement so that SRT contacts you directly when the availability and performance of your application is impacted by a possible DDoS attack.

Shield Advanced automatic application-layer DDoS mitigation compares current traffic patterns to historic traffic baselines to detect deviations that might indicate a DDoS attack. When you enable automatic application-layer DDoS mitigation, if your protected resource doesn’t yet have a history of normal application traffic, we recommend that you set to Count mode until a history of normal application traffic has been established. Shield Advanced establishes baselines that represent normal traffic patterns after protecting resources for at least 24 hours and is most accurate after 30 days. To mitigate against application layer attacks automatically, change the AWS WAF rule action to Block after you’ve established a normal traffic baseline.

To help protect your CloudFront distribution with Shield Advanced

  1. In the WAF & Shield console, in the AWS Shield section, choose Protected Resources, and then choose Add resources to protect.
  2. For Resource type, select CloudFront distribution, and then choose Load resources.
  3. In the Select resources section, select the CloudFront distribution that you used in Step 6 of this post. Then choose Protect with Shield Advanced.
  4. In the Automatic application layer DDoS mitigation section, choose Enable. Leave the AWS WAF rule action as Count, and then choose Next.
  5. (Optional, but recommended) Under Associated health check, choose one Amazon Route 53 health check to associate with the protection, and then choose Next. The Route 53 health check is used to enable health-based detection, which can improve responsiveness and accuracy in attack detection and mitigation. Associating the protected resource with a Route 53 health check is also one of the prerequisites to be protected with proactive engagement. You can create the health check by following these best practices.
  6. (Optional) In the Select SNS topic to notify for DDoS detected alarms section, select the SNS topic that you want to use for notification for DDoS detected alarms, then choose Next.
  7. Choose Finish configuration.

With automatic application-layer DDoS mitigation configured, Shield Advanced creates a rule group in the web ACL that you have associated with your resource. Shield Advanced depends on the rule group for automatic application-layer DDoS mitigation.

To review the rule group created by Shield Advanced

  1. In the CloudFormation console, choose the APIProtection stack. On the stack Outputs tab, look for the EdgeLayerWebACL entry.
  2. Choose the EdgeLayerWebACL link to go to the web ACL configuration.
  3. Choose the Rules tab, and look for the rule group with the name that starts with ShieldMitigationRuleGroup, at the bottom of the rule list. This rule group is managed by Shield Advanced, and is not viewable.
    Figure 8: Shield Advanced created rule group for DDoS mitigation

    Figure 8: Shield Advanced created rule group for DDoS mitigation

Considerations

Here are some further considerations as you implement this solution:

Conclusion

In this blog post, we introduced managing public-facing APIs through API Gateway, and helping protect API Gateway endpoints by using CloudFront and AWS perimeter protection services (AWS WAF and Shield Advanced). We walked through the steps to add Signature Version 4 authentication information to the CloudFront originated API requests, providing trusted access to the APIs. Together, these actions present a best practice approach to build a DDoS-resilient architecture that helps protect your application’s availability by preventing many common infrastructure and application layer DDoS attacks.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Pengfei Shao

Pengfei Shao

Pengfei is a Senior Technical Account Manager at AWS based in Stockholm, with more than 20 years of experience in Telecom and IT industry. His main focus is to help AWS Enterprise Support customers to remain operationally healthy, secure, and cost efficient in AWS. He is also focusing on AWS Edge Services domain, and loves to work with customers to solve their technical challenges.

Manoj Gupta

Manoj Gupta

Manoj is a Senior Solutions Architect at AWS. He’s passionate about building well-architected cloud-focused solutions by using AWS services with security, networking, and serverless as his primary focus areas. Before AWS, he worked in application and system architecture roles, building solutions across various industries. Outside of work, when he gets free time, he enjoys the outdoors and walking trails with his family.

Optimize AWS Config for AWS Security Hub to effectively manage your cloud security posture

Post Syndicated from Nicholas Jaeger original https://aws.amazon.com/blogs/security/optimize-aws-config-for-aws-security-hub-to-effectively-manage-your-cloud-security-posture/

AWS Security Hub is a cloud security posture management service that performs security best practice checks, aggregates security findings from Amazon Web Services (AWS) and third-party security services, and enables automated remediation. Most of the checks Security Hub performs on AWS resources happen as soon as there is a configuration change, giving you nearly immediate visibility of non-compliant resources in your environment, compared to checks that run on a periodic basis. This near real-time finding and reporting of non-compliant resources helps you to quickly respond to infrastructure misconfigurations and reduce risk. Security Hub offers these continuous security checks through its integration with the AWS Config configuration recorder.

By default, AWS Config enables recording for more than 300 resource types in your account. Today, Security Hub has controls that cover approximately 60 of those resource types. If you’re using AWS Config only for Security Hub, you can optimize the configuration of the configuration recorder to track only the resources you need, helping to reduce the costs related to monitoring those resources in AWS Config and the amount of data produced, stored, and analyzed by AWS Config. This blog post walks you through how to set up and optimize the AWS Config recorder when it is used for controls in Security Hub.

Using AWS Config and Security Hub for continuous security checks

When you enable Security Hub, you’re alerted to first enable resource recording in AWS Config, as shown in Figure 1. AWS Config continually assesses, audits, and evaluates the configurations and relationships of your resources on AWS, on premises, and in other cloud environments. Security Hub uses this capability to perform change-initiated security checks. Security Hub checks that use periodic rules don’t depend on the AWS Config recorder. You must enable AWS Config resource recording for all the accounts and in all AWS Regions where you plan to enable Security Hub standards and controls. AWS Config charges for the configuration items that are recorded, separately from Security Hub.

Figure 1: Security Hub alerts you to first enable resource recording in AWS Config

Figure 1: Security Hub alerts you to first enable resource recording in AWS Config

When you get started with AWS Config, you’re prompted to set up the configuration recorder, as shown in Figure 2. AWS Config uses the configuration recorder to detect changes in your resource configurations and capture these changes as configuration items. Using the AWS Config configuration recorder not only allows for continuous security checks, it also minimizes the need to query for the configurations of the individual services, saving your service API quotas for other use cases. By default, the configuration recorder records the supported resources in the Region where the recorder is running.

Note: While AWS Config supports the configuration recording of more than 300 resource types, some Regions support only a subset of those resource types. To learn more, see Supported Resource Types and Resource Coverage by Region Availability.

Figure 2: Default AWS Config settings

Figure 2: Default AWS Config settings

Optimizing AWS Config for Security Hub

Recording global resources as well as current and future resources in AWS Config is more than what is necessary to enable Security Hub controls. If you’re using the configuration recorder only for Security Hub controls, and you want to cost optimize your use of AWS Config or reduce the amount of data produced, stored, and analyzed by AWS Config, you only need to record the configurations of approximately 60 resource types, as described in AWS Config resources required to generate control findings.

Set up AWS Config, optimized for Security Hub

We’ve created an AWS CloudFormation template that you can use to set up AWS Config to record only what’s needed for Security Hub. You can download the template from GitHub.

This template can be used in any Region that supports AWS Config (see AWS Services by Region). Although resource coverage varies by Region (Resource Coverage by Region Availability), you can still use this template in every Region. If a resource type is supported by AWS Config in at least one Region, you can enable the recording of that resource type in all Regions supported by AWS Config. For the Regions that don’t support the specified resource type, the recorder will be enabled but will not record any configuration items until AWS Config supports the resource type in the Region.

Security Hub regularly releases new controls that might rely on recording additional resource types in AWS Config. When you use this template, you can subscribe to Security Hub announcements with Amazon Simple Notification Service (SNS) to get information about newly released controls that might require you to update the resource types recorded by AWS Config (and listed in the CloudFormation template). The CloudFormation template receives periodic updates in GitHub, but you should validate that it’s up to date before using it. You can also use AWS CloudFormation StackSets to deploy, update, or delete the template across multiple accounts and Regions with a single operation. If you don’t enable the recording of all resources in AWS Config, the Security Hub control, Config.1 AWS Config should be enabled, will fail. If you take this approach, you have the option to disable the Config.1 Security Hub control or suppress its findings using the automation rules feature in Security Hub.

Customizing for your use cases

You can modify the CloudFormation template depending on your use cases for AWS Config and Security Hub. If your use case for AWS Config extends beyond your use of Security Hub controls, consider what additional resource types you will need to record the configurations of for your use case. For example, AWS Firewall Manager, AWS Backup, AWS Control Tower, AWS Marketplace, and AWS Trusted Advisor require AWS Config recording. Additionally, if you use other features of AWS Config, such as custom rules that depend on recording specific resource types, you can add these resource types in the CloudFormation script. You can see the results of AWS Config rule evaluations as findings in Security Hub.

Another customization example is related to the AWS Config configuration timeline. By default, resources evaluated by Security Hub controls include links to the associated AWS Config rule and configuration timeline in AWS Config for that resource, as shown in Figure 3.

Figure 3: Link from Security Hub control to the configuration timeline for the resource in AWS Config

Figure 3: Link from Security Hub control to the configuration timeline for the resource in AWS Config

The AWS Config configuration timeline, as illustrated in Figure 4, shows you the history of compliance changes for the resource, but it requires the AWS::Config::ResourceCompliance resource type to be recorded. If you need to track changes in compliance for resources and use the configuration timeline in AWS Config, you must add the AWS::Config::ResourceCompliance resource type to the CloudFormation template provided in the preceding section. In this case, Security Hub may change the compliance of the Security Hub managed AWS Config rules, which are recorded as configuration items for the AWS::Config::ResourceCompliance resource type, incurring additional AWS Config recorder charges.

Figure 4: Config resource timeline

Figure 4: Config resource timeline

Summary

You can use the CloudFormation template provided in this post to optimize the AWS Config configuration recorder for Security Hub to reduce your AWS Config costs and to reduce the amount of data produced, stored, and analyzed by AWS Config. Alternatively, you can run AWS Config with the default settings or use the AWS Config console or scripts to further customize your configuration to fit your use case. Visit Getting started with AWS Security Hub to learn more about managing your security alerts.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Nicholas Jaeger

Nicholas Jaeger

Nicholas is a Senior Security Specialist Solutions Architect at AWS. His background includes software engineering, teaching, solutions architecture, and AWS security. Today, he focuses on helping as many customers operate as securely as possible on AWS. Nicholas also hosts AWS Security Activation Days to provide customers with prescriptive guidance while using AWS security services to increase visibility and reduce risk.

Dora Karali

Dora Karali

Dora is a Senior Manager of Product Management at AWS External Security Services. She is currently responsible for Security Hub and has previously worked on GuardDuty, too. Dora has more than 15 years of experience in cybersecurity. She has defined strategy for and created, managed, positioned, and sold cybersecurity cloud and on-premises products and services in multiple enterprise and consumer markets.

Build a serverless log analytics pipeline using Amazon OpenSearch Ingestion with managed Amazon OpenSearch Service

Post Syndicated from Hajer Bouafif original https://aws.amazon.com/blogs/big-data/build-a-serverless-log-analytics-pipeline-using-amazon-opensearch-ingestion-with-managed-amazon-opensearch-service/

In this post, we show how to build a log ingestion pipeline using the new Amazon OpenSearch Ingestion, a fully managed data collector that delivers real-time log and trace data to Amazon OpenSearch Service domains. OpenSearch Ingestion is powered by the open-source data collector Data Prepper. Data Prepper is part of the open-source OpenSearch project. With OpenSearch Ingestion, you can filter, enrich, transform, and deliver your data for downstream analysis and visualization. OpenSearch Ingestion is serverless, so you don’t need to worry about scaling your infrastructure, operating your ingestion fleet, and patching or updating the software.

For a comprehensive overview of OpenSearch Ingestion, visit Amazon OpenSearch Ingestion, and for more information about the Data Prepper open-source project, visit Data Prepper.

In this post, we explore the logging infrastructure for a fictitious company, AnyCompany. We explore the components of the end-to-end solution and then show how to configure OpenSearch Ingestion’s main parameters and how the logs come in and out of OpenSearch Ingestion.

Solution overview

Consider a scenario in which AnyCompany collects Apache web logs. They use OpenSearch Service to monitor web access and identify possible root causes to error logs of type 4xx and 5xx. The following architecture diagram outlines the use of every component used in the log analytics pipeline: Fluent Bit collects and forwards logs; OpenSearch Ingestion processes, routes, and ingests logs; and OpenSearch Service analyzes the logs.

The workflow contains the following stages:

  1. Generate and collectFluent Bit collects the generated logs and forwards them to OpenSearch Ingestion. In this post, you create fake logs that Fluent Bit forwards to OpenSearch Ingestion. Check the list of supported clients to review the required configuration for each client supported by OpenSearch Ingestion.
  2. Process and ingest – OpenSearch Ingestion filters the logs based on response value, processes the logs using a grok processor, and applies conditional routing to ingest the error logs to an OpenSearch Service index.
  3. Store and analyze – We can analyze the Apache httpd error logs using OpenSearch Dashboards.

Prerequisites

To implement this solution, make sure you have the following prerequisites:

Configure OpenSearch Ingestion

First, you define the appropriate AWS Identity and Access Management (IAM) permissions to write to and from OpenSearch Ingestion. Then you set up the pipeline configuration in the OpenSearch Ingestion. Let’s explore each step in more detail.

Configure IAM permissions

OpenSearch Ingestion works with IAM to secure communications into and out of OpenSearch Ingestion. You need two roles, authenticated using AWS Signature V4 (SigV4) signed requests. The originating entity requires permissions to write to OpenSearch Ingestion. OpenSearch Ingestion requires permissions to write to your OpenSearch Service domain. Finally, you must create an access policy using OpenSearch Service’s fine-grained access control, which allows OpenSearch Ingestion to create indexes and write to them in your domain.

The following diagram illustrates the IAM permissions to allow OpenSearch Ingestion to write to an OpenSearch Service domain. Refer to Setting up roles and users in Amazon OpenSearch Ingestion to get more details on roles and permissions required to use OpenSearch Ingestion.

In the demo, you use the AWS Cloud9 EC2 instance profile’s credentials to sign requests sent to OpenSearch Ingestion. You use Fluent Bit to fetch the credentials and assume the role you pass in the aws_role_arn you configure later.

  1. Create an ingestion role (called IngestionRole) to allow Fluent Bit to ingest the logs into your pipeline.

Create a trust relationship to allow Fluent Bit to assume the ingestion role, as shown in the following code. Fluent Bit attempts to fetch the credentials in the following order. In configuring the access policy for this role, you grant permission for the osis:Ingest.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "{your-account-id}"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
  1. Create a pipeline role (called PipelineRole) with a trust relationship for OpenSearch Ingestion to assume that role. The domain-level access policy of the OpenSearch domain grants the pipeline role access to the domain.
  1. Finally, configure your domain’s security plugin to enable OpenSearch Ingestion’s assumed role to create indexes and write data to the domain.

In this demo, the OpenSearch Service domain uses fine-grained access control for authentication, so you need to map the OpenSearch Ingestion pipeline role to the OpenSearch backend role all_access. For instructions, refer to Step 2: Include the pipeline role in the domain access policy page.

Create the pipeline in OpenSearch Ingestion

To create an OpenSearch Ingestion pipeline, complete the following steps:

  1. On the OpenSearch Service console, choose Pipelines in the navigation pane.
  2. Choose Create pipeline.
  3. For Pipeline name, enter a name.

  1. Input the minimum and maximum Ingestion OpenSearch Compute Units (Ingestion OCUs). In this example, we use the default pipeline capacity settings of minimum 1 Ingestion OCU and maximum 4 Ingestion OCUs.

Each OCU is a combination of approximately 8 GB of memory and 2 vCPUs that can handle an estimated 8 GiB per hour. OpenSearch Ingestion supports up to 96 OCUs, and it automatically scales up and down based on your ingest workload demand.

  1. In the Pipeline configuration section, configure Data Prepper to process your data by choosing the appropriate blueprint configuration template on the Configuration blueprints menu. For this post, we choose AWS-LogAggregationWithConditionalRouting.

The OpenSearch Ingestion pipeline configuration consists of four sections:

  • Source – This is the input component of a pipeline. It defines the mechanism through which a pipeline consumes records. In this post, you use the http_source plugin and provide the Fluent Bit output URI value within the path attribute.
  • Processors – This represents an intermediate processing to filter, transform, and enrich your input data. Refer to Supported plugins for more details on the list of operations supported in OpenSearch Ingestion. In this post, we use the grok processor COMMONAPACHELOG, which matches input logs against the common Apache log pattern and makes it easy to query in OpenSearch Service.
  • Sink – This is the output component of a pipeline. It defines one or more destinations to which a pipeline publishes records. In this post, you define an OpenSearch Service domain and index as sink.
  • Route – This is the part of a processor that allows the pipeline to route the data into different sinks based on specific conditions. In this example, you create four routes based in the response field value of the log. If the response field value of the log line matches 2xx or 3xx, the log is sent to the OpenSearch Service index aggregated_2xx_3xx. If the response field value matches 4xx, the log is sent to the index aggregated_4xx. If the response field value matches 5xx, the log is sent to the index aggregated_5xx.
  1. Update the blueprint based on your use case. The following code shows an example of the pipeline configuration YAML file:
version: "2"
log-aggregate-pipeline:
  source:
    http:
      # Provide the FluentBit output URI value.
      path: "/log/ingest"
  processor:
    - date:
        from_time_received: true
        destination: "@timestamp"
    - grok:
        match:
          log: [ "%{COMMONAPACHELOG_DATATYPED}" ]
  route:
    - 2xx_status: "/response >= 200 and /response < 300"
    - 3xx_status: "/response >= 300 and /response < 400"
    - 4xx_status: "/response >= 400 and /response < 500"
    - 5xx_status: "/response >= 500 and /response < 600"
  sink:
    - opensearch:
        # Provide an AWS OpenSearch Service domain endpoint
        hosts: [ "{your-domain-endpoint}" ]
        aws:
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "arn:aws:iam::{your-account-id}:role/PipelineRole"
          # Provide the region of the domain.
          region: "{AWS_Region}"
        index: "aggregated_2xx_3xx"
        routes:
          - 2xx_status
          - 3xx_status
    - opensearch:
        # Provide an AWS OpenSearch Service domain endpoint
        hosts: [ "{your-domain-endpoint}"  ]
        aws:
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "arn:aws:iam::{your-account-id}:role/PipelineRole"
          # Provide the region of the domain.
          region: "{AWS_Region}"
        index: "aggregated_4xx"
        routes:
          - 4xx_status
    - opensearch:
        # Provide an AWS OpenSearch Service domain endpoint
        hosts: [ "{your-domain-endpoint}"  ]
        aws:
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "arn:aws:iam::{your-account-id}:role/PipelineRole"
          # Provide the region of the domain.
          region: "{AWS_Region}"
        index: "aggregated_5xx"
        routes:
          - 5xx_status

Provide the relevant values for your domain endpoint, account ID, and Region related to your configuration.

  1. Check the health of your configuration setup by choosing Validate pipeline when you finish the update.

When designing a production workload, deploy your pipeline within a VPC. For instructions, refer to Securing Amazon OpenSearch Ingestion pipelines within a VPC.

  1. For this post, select Public access under Network.

  1. In the Log publishing options section, select Publish to CloudWatch logs and Create new group.

OpenSearch Ingestion uses the log levels of INFO, WARN, ERROR, and FATAL. Enabling log publishing helps you monitor your pipelines in production.

  1. Choose Next and Create pipeline.
  2. Select the pipeline and choose View details to see the progress of the pipeline creation.

Wait until the status changes to Active to start using the pipeline.

Send logs to the OpenSearch Ingestion pipeline

To start sending logs to the OpenSearch Ingestion pipeline, complete the following steps:

  1. On the AWS Cloud9 console, create a Fluent Bit configuration file and update the following attributes:
    • Host – Enter the ingestion URL of your OpenSearch Ingestion pipeline.
    • aws_service – Enter osis.
    • aws_role_arn – Enter the ARN of the IAM role IngestionRole.

The following code shows an example of the Fluent-bit.conf file:

[SERVICE]
    parsers_file          ./parsers.conf
    
[INPUT]
    name                  tail
    refresh_interval      5
    path                  /var/log/*.log
    read_from_head        true
[FILTER]
    Name parser
    Key_Name log
    Parser apache
[OUTPUT]
    Name http
    Match *
    Host {Ingestion URL}
    Port 443
    URI /log/ingest
    format json
    aws_auth true
    aws_region {AWS_region}
    aws_role_arn arn:aws:iam::{your-account-id}:role/IngestionRole
    aws_service osis
    Log_Level trace
    tls On
  1. In the AWS Cloud9 environment, create a docker-compose YAML file to deploy Fluent Bit and Flog containers:
version: '3'
services:
  fluent-bit:
    container_name: fluent-bit
    image: docker.io/amazon/aws-for-fluent-bit
    volumes:
      - ./fluent-bit.conf:/fluent-bit/etc/fluent-bit.conf
      - ./apache-logs:/var/log
  flog:
    container_name: flog
    image: mingrammer/flog
    command: flog -t log -f apache_common -o web/log/test.log -w -n 100000 -d 1ms -p 1000
    volumes:
      - ./apache-logs:/web/log

Before you start the Docker containers, you need to update the IAM EC2 instance role in AWS Cloud9 so it can sign the requests sent to OpenSearch Ingestion.

  1. For demo purposes, create an IAM service-linked role and choose EC2 under Use case to allow the AWS Cloud9 EC2 instance to call OpenSearch Ingestion on your behalf.
  2. Add the OpenSearch Ingestion policy, which is the same policy you used with IngestionRole.
  3. Add the AdministratorAccess permission policy to the role as well.

Your role definition should look like the following screenshot.

  1. After you create the role, go back to AWS Cloud9, select your demo environment, and choose View details.
  2. On the EC2 instance tab, choose Manage EC2 instance to view the details of the EC2 instance attached to your AWS Cloud9 environment.

  1. On the Amazon EC2 console, replace the IAM role of your AWS Cloud9 EC2 instance with the new role.
  2. Open a terminal in AWS Cloud9 and run the command docker-compose up.

Check the output in the terminal—if everything is working correctly, you get status 200.

Fluent Bit collects logs from the /var/log repository in the container and pushes the data to the OpenSearch Ingestion pipeline.

  1. Open OpenSearch Dashboards, navigate to Dev Tools, and run the command GET _cat/indices to validate that the data has been delivered by OpenSearch Ingestion to your OpenSearch Service domain.

You should see the three indexes created: aggregated_2xx_3xx, aggregated_4xx, and aggregated_5xx.

Now you can focus on analyzing your log data and reinvent your business without having to worry about any operational overhead regarding your ingestion pipeline.

Best practices for monitoring

You can monitor the Amazon CloudWatch metrics made available to you to maintain the right performance and availability of your pipeline. Check the list of available pipeline metrics related to the source, buffer, processor, and sink plugins.

Navigate to the Metrics tab for your specific OpenSearch Ingestion pipeline to explore the graphs available to each metric, as shown in the following screenshot.

In your production workloads, make sure to configure the following CloudWatch alarms to notify you when the pipeline metrics breach a specific threshold so you can promptly remediate each issue.

Managing cost

While OpenSearch Ingestion automatically provisions and scales the OCUs for your spiky workloads, you only pay for the compute resources actively used by your pipeline to ingest, process, and route data. Therefore, setting up a maximum capacity of Ingestion OCUs allows you to handle your workload peak demand while controlling cost.

For production workloads, make sure to configure a minimum of 2 Ingestion OCUs to ensure 99.9% availability for the ingestion pipeline. Check the sizing recommendations and learn how OpenSearch Ingestion responds to workload spikes.

Clean up

Make sure you clean up unwanted AWS resources created during this post in order to prevent additional billing for these resources. Follow these steps to clean up your AWS account:

  1. On the AWS Cloud9 console, choose Environments in the navigation pane.
  2. Select the environment you want to delete and choose Delete.
  3. On the OpenSearch Service console, choose Domains under Managed clusters in the navigation pane.
  4. Select the domain you want to delete and choose Delete.
  5. Select Pipelines under Ingestion in the navigation pane.
  6. Select the pipeline you want to delete and on the Actions menu, choose Delete.

Conclusion

In this post, you learned how to create a serverless ingestion pipeline to deliver Apache access logs to an OpenSearch Service domain using OpenSearch Ingestion. You learned the IAM permissions required to start using OpenSearch Ingestion and how to use a pipeline blueprint instead of creating a pipeline configuration from scratch.

You used Fluent Bit to collect and forward Apache logs, and used OpenSearch Ingestion to process and conditionally route the log data to different indexes in OpenSearch Service. For more examples about writing to OpenSearch Ingestion pipelines, refer to Sending data to Amazon OpenSearch Ingestion pipelines.

Finally, the post provided you with recommendations and best practices to deploy OpenSearch Ingestion pipelines in a production environment while controlling cost.

Follow this post to build your serverless log analytics pipeline, and refer to Top strategies for high volume tracing with Amazon OpenSearch Ingestion to learn more about high volume tracing with OpenSearch Ingestion.


About the authors

Hajer Bouafif is an Analytics Specialist Solutions Architect at Amazon Web Services. She focuses on OpenSearch Service and helps customers design and build well-architected analytics workloads in diverse industries. Hajer enjoys spending time outdoors and discovering new cultures.

Francisco Losada is an Analytics Specialist Solutions Architect based out of Madrid, Spain. He works with customers across EMEA to architect, implement, and evolve analytics solutions at AWS. He advocates for OpenSearch, the open-source search and analytics suite, and supports the community by sharing code samples, writing content, and speaking at conferences. In his spare time, Francisco enjoys playing tennis and running.

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Build and share a business capability model with Amazon QuickSight

Post Syndicated from Abdul Qadir original https://aws.amazon.com/blogs/big-data/build-and-share-a-business-capability-model-with-amazon-quicksight/

The technology landscape has been evolving rapidly, with waves of change impacting IT from every angle. It is causing a ripple effect across IT organizations and shifting the way IT delivers applications and services.

The change factors impacting IT organizations include:

  • The shift from a traditional application model to a services-based application model (SaaS, PaaS)
  • The shift from a traditional infrastructure and hardware costing model to cloud-based containers (private and public clouds) with metered usage for resources (IaaS)
  • The shift from the lengthy traditional development and delivery cycles to continuous development and integration (DevOps)
  • The shift in application architecture from N-Tier to loosely coupled services

The portfolio of services delivering business capabilities are the new assets of IT organizations that need to be cataloged in a repository. The system must follow a well-defined business taxonomy that enhances discovery, analysis, and reuse by potential consumers, and avoids building redundant services. The traditional portfolio management tools within the organization need to be augmented with additional components that can manage the complexity of the services ecosystem.

This post provides a simple and quick way of building an extendable analytical system using Amazon QuickSight to better manage lines of business (LOBs) with a detailed list of business capabilities and APIs, deep analytical insights, and desired graphical visualizations from different dimensions. In addition, this tool enhances the discovery and reuse of existing business capabilities, avoids duplication of services, and shortens time-to-market.

Use case overview

Bob is a Senior Enterprise Architect. He recently joined a Tier 1 bank. His first assignment is to assess the bank’s capabilities to offer new financial products to its high-value retail clients. The only document given to Bob was PowerPoint slides and the names of the head of each department to get more information. The PowerPoint presentation provided high-level information, but it didn’t give an insight into how capable each department is to provide the required data through APIs for the new products. To collect that information, Bob gets in touch with the head of each department, who in turn refer him to their development leads, who in turn give him a bunch of technical documents that explain how APIs are being used.

Relevance

Business analysts are familiar with business terminology and taxonomy, and often depend on the technology team to explain the technical assets associated with business capabilities. The business capabilities are the assets of the IT organization that need to be cataloged in a repository. The catalog must follow a well-defined business taxonomy that enhances discovery and reuse by consumers, and avoids building redundant services.

The better organized the catalog is, the higher the potential for reuse and the return on investment for the services transformation strategy. The catalog needs to be organized using some business functions taxonomy with a detailed list of capabilities and sub-capabilities. The following diagram illustrates an example of services information and interdependencies.

Example of services information and interdependencies

Defining and capturing a business capability model

If an enterprise doesn’t have a system to capture the business capability model, consider defining and finding a way to capture the model for better insight and visibility, and then map it with digital assets like APIs. The model should be able to showcase to LOBs their categories and capabilities. The following table includes some sample LOBs and their associations for a business that sells the services.

LOB

Category

Capability

Recruitment

Manage Applicant Experience

Manage Application Activities

Process Application

Follow-Ups

Pursue Automated Leads

Sale Service

Engage Customer

Provide Needs Assessment Tools

Provide Service Information

After the map is defined and captured, each business capability can be mapped to APIs that are implemented for it. Each business capability then has visibility into all the associated digital assets and mapped metadata of the services, such as consumers of the API.

To capture the model, you can define a simple table to capture the information, and then you can perform further analysis on it with an analytical tool such as QuickSight.

In the following sample data model, each business LOB has several business categories and capabilities, and each capability can be mapped to multiple APIs. Also note that there’s not always a 1:1 mapping between a business capability, an API, and a service.

  • Business LOB – Recruitment, Sale Service
  • Business category – Process Application, Engage Customer
  • Business capabilities – Complete an Application, Follow-Ups
  • Digital assets – Recruitment API, Sale Service API

There are sets of other standard information that you can include in a data model, such as API consumers.

The following example shows a table structure to capture this information.

LOB table structure

The following figure visualizes the business capabilities and associated APIs.

Visualization of business capabilities and associated APIs

The remainder of the post highlights the key components to build the full solution end to end. The UI captures the business capabilities and associated APIs, and publishes the service information through a DevOps process. The solution also includes storage and a reporting tool that complement the applications portfolio management capability in place and expand its capabilities with the services portfolio.

Aligning APIs to a business capability model

To align APIs to a business capability model, you can follow these steps:

  1. Understand the business capabilities – Identify the key business capabilities of your organization and understand how they support the overall business strategy.
  2. Map the APIs to the capabilities – Review the existing APIs and map them to the corresponding business capabilities. This will help identify any gaps in the capabilities that can be addressed through new or updated APIs.
  3. Prioritize the APIs – Prioritize the development of new or updated APIs based on their importance to the business capabilities. This will ensure that the most critical capabilities are supported by the APIs.
  4. Implement governance – Implement a governance process to ensure that the APIs are aligned with the business capabilities and are used correctly. This can include setting standards for how the APIs are designed, developed, and deployed.
  5. Monitor and measure – Monitor the usage and performance of the APIs to measure their impact on the business capabilities. Use this information to make decisions about changes to the APIs over time.
  6. Regularly review and update – Review and update the mapping of the APIs to the business capabilities on a regular basis to ensure they remain aligned with the organization’s goals and objectives.

Maintenance and evolution of a business capability model

Building a business capability model is not a one-time exercise. It keeps evolving with business requirements and usage. Data management best practices should be followed as per your company’s guidelines to have consistent data end to end.

Solution overview

In this section, we introduce the ability to capture the business capabilities and associated APIs and make them available using the QuickSight business intelligence (BI) tool, and highlight its features.

The following approach provides the ability to manage business capability models and enable them to link business capabilities with enterprise digital assets, including services, APIs, and IT systems. This solution enables IT and business teams to further drill down into the model to see what has been implemented. These details provide value to architects and analysts to assess which services can be combined to provide new offerings and shorten time-to-market, enable reusability by consumers, and avoid building redundant services.

The following key components are required:

Organizations can use their existing UI framework (if available) to capture the information, or they can use one of the open-source services available in the market. Depending on the selection and capability of the open-source product, a user interface can be generated and customized.

Let’s look at each service in our solution in more detail:

  • Amplify – Amplify is a set of tools and services that can be used together or on their own, to help front-end web and mobile developers build scalable full stack applications, powered by AWS. With Amplify, you can configure app backends and connect your app in minutes, deploy static web apps in a few clicks, and easily manage app content outside the AWS Management Console. Amplify supports popular web frameworks including JavaScript, React, Angular, Vue, and Next.js, and mobile platforms including Android, iOS, React Native, Ionic, and Flutter. Get to market faster with AWS Amplify.
  • AppSync – AWS AppSync simplifies application development by creating a universal API for securely accessing, modifying, and combining data from multiple sources. AWS AppSync is a managed service that uses GraphQL so that applications can easily get only the data they need.
  • Athena – Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. In this solution, we use Athena as a data source for QuickSight.
  • Amazon Cognito – Amazon Cognito delivers frictionless customer identity and access management (CIAM) with a cost-effective and customizable platform. It easily connects the web application to the backend resources and web services.
  • DynamoDB – DynamoDB is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale. DynamoDB offers built-in security, continuous backups, automated multi-Region replication, in-memory caching, and data import and export tools.
  • QuickSight – QuickSight is a serverless, cloud-based BI and reporting service that brings data insights to your teams and end-users through machine learning (ML)-powered dashboards and data visualizations, which can be accessed via QuickSight or embedded in apps and portals that your users access.

The following diagram illustrates the solution architecture.

Business capabilities insights solution architecture

In the following sections, we walk through the implementation and end-to-end integration steps.

Build a serverless web application with Amplify

The open-source Amplify provides a CLI, libraries, UI components and Amplify hosting to build full stack iOS, Android, Flutter, Web, and React Native apps. For instructions on building a serverless web application, refer to the following tutorial. For this post, we created the following GraphQL schema with amplify add api:

type BusinessCapability @model {
  company_id: ID!
  company_name: String!
  company_desc: String!
  lob_name: String!
  categoray: String!
  capability: String!
  digital_asset_type: String!
  digital_asset_name: String!
  digital_asset_info: String!
}

After we use Amplify to deploy the API in the cloud, a corresponding AppSync API and a DynamoDB table are created automatically.

You can use the Amplify UI library to generate a business capability intake form and bind the fields to your front-end code.

Amplify studio generated form

You can add authentication to your application using Amazon Cognito by running amplify add auth.

With that, you are now hosting a serverless web application for your business capabilities securely and at scale.

Set up Athena and the Athena DynamoDB data connector

The DynamoDB table generated by Amplify stores all the business capabilities. You can set up Athena and the Athena DynamoDB data connector so that you can query your tables with SQL. For more information, refer to Amazon Athena DynamoDB connector.

Enable QuickSight

Enable QuickSight in your AWS account and create the datasets. The source dataset is the Athena database and table that you created earlier. To connect, you need to allow access to query Athena and Amazon S3 via the admin user interface in QuickSight. Refer to accessing AWS resources for access requirements.

Sample reports

When all the components are up and running, you can design analyses and generate reports. For more information about gathering insights from the captured data, refer to Tutorial: Create an Amazon QuickSight analysis. You can export reports in PDF, and share analyses and reports with other users. The following screenshots are reports that reflects the relationship among LOBs, business capabilities, and APIs.

The first screenshot visualizes the capabilities and associated APIs. This enables the user to identify a set of APIs, and use the same API in new similar business functions.

Business Capability Visualization 1

The following screenshot visualizes LOBs, category, and capabilities. This enables the user to easily gain insights on these relationships.

Business Capabilities Visualization 2

Best practices

The following are some best practices for business capability modeling:

  • Define clear and measurable capabilities – Each capability should be defined in a way that is clear and measurable, so that it can be tracked and improved over time.
  • Involve key stakeholders – Involve key stakeholders in the modeling process to ensure that the capabilities accurately reflect the needs of the organization.
  • Use a consistent framework – Use a consistent framework to ensure that capabilities are defined and organized in a way that makes sense for the organization.
  • Regularly review and update – Review and update the capabilities regularly to ensure they remain relevant and aligned with the organization’s goals and objectives.
  • Use visual representations – Use visual representations, like diagrams or models, to help stakeholders understand and communicate the capabilities.
  • Implement a governance process – Implement a governance process to ensure that the capabilities are being used correctly and to make decisions about changes to the capabilities over time.

Conclusion

In this post, you learned how to build a system to manage a business capability model, and discover and visualize the results in QuickSight.

We hope that companies can use this solution to manage their enterprise capability model and enable users to explore business functions available for them to use within the organization. Business users and technical architects can now easily discover business capabilities and APIs, helping accelerate the creation and orchestration of new features. With the QuickSight web interface, you can filter through thousands of business capabilities, analyze the data for your business needs, and understand the technical requirements and how to combine existing technical capabilities into a new business capability.

Furthermore, you can use your data source to gain further insights from your data by setting up ML Insights in QuickSight and create graphical representations of your data using QuickSight visuals.

To learn more about how you can create, schedule, and share reports and data exports, see Amazon QuickSight Paginated Reports.


About the authors

Abdul Qadir is an AWS Solutions Architect based in New Jersey. He works with independent software vendors in the Northeast and provides customer guidance to build well-architected solutions on the AWS cloud platform.

Sharon Li is a solutions architect at AWS, based in the Boston, MA area. She works with enterprise customers, helping them solve difficult problems and build on AWS. Outside of work, she likes to spend time with her family and explore local restaurants.

Mixing AWS Graviton with x86 CPUs to optimize cost and resiliency using Amazon EKS

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/mixing-aws-graviton-with-x86-cpus-to-optimize-cost-and-resilience-using-amazon-eks/

This post is written by Yahav Biran, Principal SA, and Yuval Dovrat, Israel Head Compute SA.

This post shows you how to integrate AWS Graviton-based Amazon EC2 instances into an existing Amazon Elastic Kubernetes Service (Amazon EKS) environment running on x86-based Amazon EC2 instances. Customers use mixed-CPU architectures to enable their application to utilize a wide selection of Amazon EC2 instance types and improve overall application resilience. In order to successfully run a mixed-CPU application, it is strongly recommended that you test application performance in a test environment before running production applications on Graviton-based instances. You can follow AWS’ transition guide to learn more about porting your application to AWS Graviton.

This example shows how you can use KEDA for controlling application capacity across CPU types in EKS. KEDA will trigger a deployment based on the application’s response latency as measured by the Application Load Balancer (ALB). To simplify resource provisioning, Karpenter, an open-source Kubernetes node provisioning software, and AWS Load Balancer Controller, are shown as well.

Solution Overview

There are two solutions that this post covers to test a mixed-CPU application. The first configuration (shown in Figure 1 below) is the “A/B Configuration”. It uses an Application Load Balancer (ALB)-based Ingress to control traffic flowing to x86-based and Graviton-based node pools. You use this configuration to gradually migrate a live application from x86-based instances to Graviton-based instances, while validating the response time with Amazon CloudWatch.

A/B Configuration, with ALB ingress for gradual transition between CPU types

Figure 1, config 1: A/B Configuration

In the second configuration, the “Karpenter Controlled Configuration” (shown in Figure 2 below as Config 2), Karpenter automatically controls the instance blend. Karpenter is configured to use weighted provisioners with values that prioritize AWS Graviton-based Amazon EC2 instances over x86-based Amazon EC2 instances.

Karpenter Controlled Configuration, with Weighting provisioners topology

Figure 2, config II:  Karpenter Controlled Configuration, with Weighting provisioners topology

It is recommended that you start with the “A/B” configuration to measure the response time of live requests. Once your workload is validated on Graviton-based instances, you can build the second configuration to simplify the deployment configuration and increase resiliency. This enables your application to automatically utilize x86-based instances if needed, for example, during an unplanned large-scale event.

You can find the step-by-step guide on GitHub to help you to examine and try the example app deployment described in this post. The following provides an overview of the step-by-step guide.

Code Migration to AWS Graviton

The first step is migrating your code from x86-based instances to Graviton-based instances. AWS has multiple resources to help you migrate your code. These include AWS Graviton Fast Start Program, AWS Graviton Technical Guide GitHub Repository, AWS Graviton Transition Guide, and Porting Advisor for Graviton.

After making any required changes, you might need to recompile your application for the Arm64 architecture. This is necessary if your application is written in a language that compiles to machine code, such as Golang and C/C++, or if you need to rebuild native-code libraries for interpreted/JIT compiled languages such as the Python/C API or Java Native Interface (JNI).

To allow your containerized application to run on both x86 and Graviton-based nodes, you must build OCI images for both the x86 and Arm64 architectures, push them to your image repository (such as Amazon ECR), and stitch them together by creating and pushing an OCI multi-architecture manifest list. You can find an overview of these steps in this AWS blog post. You can also find the AWS Cloud Development Kit (CDK) construct on GitHub to help get you started.

To simplify the process, you can use a Linux distribution package manager that supports cross-platform packages and avoid platform-specific software package names in the Linux distribution wherever possible. For example, use:

RUN pip install httpd

instead of:

ARG ARCH=aarch64 or amd64
RUN yum install httpd.${ARCH}

This blog post shows you how to automate multi-arch OCI image building in greater depth.

Application Deployment

Config 1 – A/B controlled topology

This topology allows you to migrate to Graviton while validating the application’s response time (approximately 300ms) on both x86 and Graviton-based instances. As shown in Figure 1, this design has a single Listener that forwards incoming requests to two Target Groups. One Target Group is associated with Graviton-based instances, while the other Target Group is associated with x86-based instances. The traffic ratio associated with each target group is defined in the Ingress configuration.

Here are the steps to create Config 1:

  1. Create two KEDA ScaledObjects that scale the number of pods based on the latency metric (AWS/ApplicationELB-TargetResponseTime) that matches the target group (triggers.metadata.dimensionValue). Declare the maximum acceptable latency in targetMetricValue:0.3.
    Below is the Graviton deployment scaledObject (spec.scaleTargetRef), note the comments that denote the value of the x86 deployment scaledObject
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
…
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: armsimplemultiarchapp #amdsimplemultiarchapp
…
  triggers:                 
    - type: aws-cloudwatch
      metadata:
        namespace: "AWS/ApplicationELB"
        dimensionName: "LoadBalancer"
        dimensionValue: "app/simplemultiarchapp/xxxxxx"
        metricName: "TargetResponseTime"
        targetMetricValue: "0.3"
  1. Once the topology has been created, add Amazon CloudWatch Container Insights to measure CPU, network throughput, and instance performance.
  2. To simplify testing and control for potential performance differences in instance generations, create two dedicated Karpenter provisioners and Kubernetes Deployments (replica sets) and specify the instance generation, CPU count, and CPU architecture for each one. This example uses c7g (Graviton3) and c6i (Intel) . You will remove these constraints in the next topology to allow more allocation flexibility.

The x86-based instances Karpenter provisioner:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: x86provisioner
spec:
  requirements:
  - key: karpenter.k8s.aws/instance-generation
    operator: In
    values:
    - "6"
  - key: karpenter.k8s.aws/instance-cpu
    operator: In 
    values:
    - "2"
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64

The Graviton-based instances Karpenter provisioner:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: arm64provisioner
spec:
  requirements:
  - key: karpenter.k8s.aws/instance-generation
    operator: In
    values:
    - "7"
  - key: karpenter.k8s.aws/instance-cpu
    operator: In
    values:
    - "2"
  - key: kubernetes.io/arch
    operator: In
    values:
    - arm64
  1. Create two Kubernetes Deployment resources—one per CPU architecture—that use nodeSelector to schedule one Deployment on Graviton-based instances, and another Deployment on x86-based instances. Similarly, create two NodePort Service resources, where each Service points to its architecture-specific ReplicaSet.
  2. Create an Application Load Balancer using the AWS Load Balancer Controller to distribute incoming requests among the different pods. Control the traffic routing in the ingress by adding an ingress.kubernetes.io/actions.weighted-routing annotation. You can adjust the weight in the example below to meet your needs. This migration example started with a 100%-to-0% x86-to-Graviton ratio, adjusting over time by 10% increments until it reached a 0%-to-100% x86-to-Graviton ratio.
…
alb.ingress.kubernetes.io/actions.weighted-routing: | 
{
…
  "targetGroups":[
    {
      "serviceName":"armsimplemultiarchapp-svc",
      "servicePort":"80","weight":50
    },
    {
      "serviceName":"amdsimplemultiarchapp-svc",
      "servicePort":"80","weight":50}]
    }
 }

spec:
  ingressClassName: alb
  rules:
    - http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: weighted-routing

You can simulate live user requests to an example application ALB endpoint. Amazon CloudWatch populates ALB Target Group request/second metrics, dimensioned by HTTP response code, to help assess the application throughput and CPU usage.

During the simulation, you will need to verify the following:

  • Both Graviton-based instances and x86-based instances pods process a variable amount of traffic.
  • The application response time (p99) meets the performance requirements (300ms).

The orange (Graviton) and blue (x86) curves of HTTP 2xx responses (figure 4) show the application throughput (HTTP requests/seconds) for each CPU architecture during the migration.

Gradual transition from x86 to Graviton using ALB ingress

Figure 3 HTTP 2XX per CPU architecture

Figure 4 shows an example of application response time during the transition from x86-based instances to Graviton-based instances. The latency associated with each instance family grows and shrinks as the live request simulation changes the load on the application. In this example, the latency on x86 instances (until 07:00) grew up to 300ms because most of the request load was directed at to x86-based pods. It began to converge at around 08:00 when more pods were powered by Graviton-based instances. Finally, after 15:00, the request load was processed by Graviton-based instances entirely.

Two curves with different colors indicate p99 application targets response time. Graviton-based pods have a response time (between 150 and 300ms) similar to x86-based pods.

Figure 4: Target Response Time p99

Config 2 – Karpenter Controlled Configuration

After fully testing the application on Graviton-based EC2 instances, you are ready to simplify the deployment topology with weighted provisioners while preserving the ability to launch x86-based instances as needed.

Here are the steps to create Config 2:

  1. Reuse the CPU-based provisioners from the previous topology, but assign a higher .spec.weight to Graviton-based instances provisioner. The x86 provisioner is still deployed in case x86-based instances are required. The karpenter.k8s.aws/instance-family can be expanded beyond those set in Config 1 or excluded by switching the operator to NotIn.

The x86-based Amazon EC2 instances Karpenter provisioner:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: x86provisioner
spec:
  requirements:
  - key: kubernetes.io/arch
    operator: In
    values: [amd64]

The Graviton-based Amazon EC2 instances Karpenter provisioner:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: priority-arm64provisioner
spec:
  weight: 10
  requirements:
  - key: kubernetes.io/arch
    operator: In
    values: [arm64]
  1. Next, merge the two Kubernetes deployments into one deployment similar to the original before migration (i.e., no specific nodeSelector that points to a CPU-specific provisioner).

The two services are also combined into a single Kubernetes service and the actions.weighted-routing annotation is removed from the ingress resources:

spec:
  rules:
    - http:
        paths:
          - path: /app
            pathType: Prefix
            backend:
              service:
                name: simplemultiarchapp-svc
  1. Unite the two KEDA ScaledObject resources from the first configuration and point them to a single deployment, e.g., simplemultiarchapp. The new KEDA ScaledObject will be:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: simplemultiarchapp-hpa
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: simplemultiarchapp
…

Two curves with different colors to indicate HTTP request/sec count. The curves show Graviton (Blue) as baseline and bursting with x86 (Orange).

Figure 5 Config 2 – Weighting provisioners results

A synthetic limit on Graviton CPU capacity is set to illustrate the scaling to x86_64 CPUs (Provisioner.limits.resources.cpu). The total application throughput (figure 6) is shown by aarch64_200 (blue) and x86_64_200 (orange). Mixing CPUs did not impact the target response time (Figure 6). Karpenter behaved as expected: prioritizing Graviton-based instances, and bursting to x86-based Amazon EC2 instances when CPU limits were crossed.

Mixing CPU did not impact the application latency when x86 instances where added

Figure 6 Config 2 -HTTP response time p99 with mixed-CPU provisioner

Conclusion

The use of a mixed-CPU architecture enables your application to utilize a wide selection of Amazon EC2 instance types and improves your applications’ resilience while meeting your service-level objectives. Application metrics can be used to control the migration with AWS ALB Ingress, Karpenter, and KEDA. Moreover, AWS Graviton-based Amazon EC2 instances can deliver up to 40% better price performance than x86-based Amazon EC2 instances. Learn more about this example on GitHub and more announcements about Gravtion.

Consolidating controls in Security Hub: The new controls view and consolidated findings

Post Syndicated from Emmanuel Isimah original https://aws.amazon.com/blogs/security/consolidating-controls-in-security-hub-the-new-controls-view-and-consolidated-findings/

In this blog post, we focus on two recently released features of AWS Security Hub: the consolidated controls view and consolidated control findings. You can use these features to manage controls across standards and to consolidate findings, which can help you significantly reduce finding noise and administrative overhead.

Security Hub is a cloud security posture management service that you can use to apply security best practice controls, such as “EC2 instances should not have a public IP address.” With Security Hub, you can check that your environment is properly configured and that your existing configurations don’t pose a security risk. Security Hub has more than 200 controls that cover more than 30 AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3), and AWS Lambda. In addition, Security Hub has integrations with more than 85 partner products. Security Hub can centralize findings across your AWS accounts and AWS Regions into a single delegated administrator account in your aggregation Region of choice, creating a single pane of glass to view findings. This can help you to triage, investigate, and respond to findings in a simpler way and improve your security posture.

The Security Hub controls are grouped into the following security standards:

With the new features — consolidated controls view and consolidated control findings—you can now do the following:

  • Enable or disable controls across standards in a single action. Previously, if you wanted to maintain the same enablement status of controls between standards, you had to take the same action across multiple standards (up to six times!).
  • If you choose to turn on consolidated control findings, you will receive only a single finding for a security check, even if the security check is enabled across several standards. This reduces the number of findings and helps you focus on the most important misconfigured resources in your AWS environment. It allows you to apply actions and updates (such as suppressing the finding or changing its severity) one time rather than having to do so multiple times across non-consolidated findings.

Overview of new features

Now we’ll discuss some of the details of how you can use the two new features to streamline the management of controls.

The new consolidated controls view

On the new Controls page, now available in the Security Hub console as shown in Figure 1, you can view and configure security controls across standards from one central location.

Figure 1: Security Hub Controls page

Figure 1: Security Hub Controls page

Before this release, controls had to be managed within the context of individual security standards. Even if the same control was part of multiple standards, the control had different IDs in each of them. With this recent release, Security Hub now assigns controls a unique security control ID across standards, so that it’s simpler for you to reference the controls and view their findings. Following the current naming convention of the AWS FSBP standard, the consolidated control IDs start with the relevant service in scope for the control. In fact, whenever possible, the new consolidated control ID is the same as the previous FSBP control ID.

For example, before this release, control IAM.6 in FSBP was also referenced as 1.14 in CIS 1.2, and 1.6 in CIS 1.4, PCI.IAM.4, and CT.IAM.6. After the release, the control is now referenced as IAM.6 in the Security Hub standards. This change does not affect the pre-existing API calls for Security Hub, such as UpdateStandardsControl, where you can still provide the previous StandardControlARN in order to make the call.

By using the new Controls view, you can understand the status of controls across your system, view control findings, and prioritize next steps without context switching. The following information is available on the Controls page of the Security Hub console:

  • An overall security score, which is based on the proportion of passed controls to the total number of enabled controls.
  • A breakdown of security checks across controls, with the percentage of failed security checks highlighted. Because many controls can contain multiple security checks and multiple findings, this value might be different from the security score, which considers controls as a single object. You can use this metric, as well as your security score, to monitor your progress as you work to remediate findings.
  • A list of controls that are categorized into different tabs based on enablement and compliance status. If you are an administrator of an organization within Security Hub, the enablement and compliance status will reflect the aggregate status of the entire organization. In your finding aggregation Region, the status will also be aggregated across linked Regions.

From the controls page, you can select a control to view its details (including its title and the standards it belongs to), and view and act on the findings generated by the control.

Security Hub also offers new API operations that match the capabilities of the controls page. Unlike the pre-existing API operations, these new API operations use the consolidated control IDs (also known as security control IDs) to provide a way to know and manage the relationship between controls and standards. You can use these API operations to manage each Security Hub control across standards, to make sure that the status of controls in the standards is aligned. The new API operations include the following:

We also provide an example script that makes use of these API calls and applies them across accounts and Regions so that your configuration is consistent. You can use our script to enable or disable Security Hub controls across your various accounts or Regions.

Consolidating control findings between standards

Before we released the consolidated control findings feature, Security Hub generated separate findings per standard for each related control. Now, you can turn on consolidated control findings, and after doing so, Security Hub will produce a single finding per security check, even when the underlying control is shared across multiple standards. Having a single finding per check across standards will help you investigate, update, and remediate failed findings more quickly, while also reducing finding noise.

As an example, we can look at control CloudTrail.2, which is shared between standards supported by Security Hub. Before you turn on this capability, you might potentially receive up to six findings for each security check generated by this control—with one finding for each security standard. After you turn on consolidated control findings, these older findings will be archived and Security Hub will generate one finding per security check in this control, regardless of how many security standards you have enabled. For an example of how the standard-specific findings compare to the new consolidated finding, see Sample control findings. The following is an example of a consolidated finding for the CloudTrial.2 control; we’ve highlighted the part that shows this finding is shared across standards.

{
  "SchemaVersion": "2018-10-08",
  "Id": "arn:aws:securityhub:us-east-2:123456789012:security-control/CloudTrail.2/finding/a1b2c3d4-5678-90ab-cdef-EXAMPLE11111",
  "ProductArn": "arn:aws:securityhub:us-east-2::product/aws/securityhub",
  "ProductName": "Security Hub",
  "CompanyName": "AWS",
  "Region": "us-east-2",
  "GeneratorId": "security-control/CloudTrail.2",
  "AwsAccountId": "123456789012",
  "Types": [
    "Software and Configuration Checks/Industry and Regulatory Standards"
  ],
  "FirstObservedAt": "2022-10-06T02:18:23.076Z",
  "LastObservedAt": "2022-10-28T16:10:06.956Z",
  "CreatedAt": "2022-10-06T02:18:23.076Z",
  "UpdatedAt": "2022-10-28T16:10:00.093Z",
  "Severity": {
    "Label": "MEDIUM",
    "Normalized": "40",
    "Original": "MEDIUM"
  },
  "Title": "CloudTrail should have encryption at-rest enabled",
  "Description": "This AWS control checks whether AWS CloudTrail is configured to use the server-side encryption (SSE) AWS Key Management Service (AWS KMS) customer master key (CMK) encryption. The check will pass if the KmsKeyId is defined.",
  "Remediation": {
    "Recommendation": {
      "Text": "For directions on how to correct this issue, consult the AWS Security Hub controls documentation.",
      "Url": "https://docs.aws.amazon.com/console/securityhub/CloudTrail.2/remediation"
    }
  },
  "ProductFields": {
    "RelatedAWSResources:0/name": "securityhub-cloud-trail-encryption-enabled-fe95bf3f",
    "RelatedAWSResources:0/type": "AWS::Config::ConfigRule",
    "aws/securityhub/ProductName": "Security Hub",
    "aws/securityhub/CompanyName": "AWS",
    "Resources:0/Id": "arn:aws:cloudtrail:us-east-2:123456789012:trail/AWSMacieTrail-DO-NOT-EDIT",
    "aws/securityhub/FindingId": "arn:aws:securityhub:us-east-2::product/aws/securityhub/arn:aws:securityhub:us-east-2:123456789012:security-control/CloudTrail.2/finding/a1b2c3d4-5678-90ab-cdef-EXAMPLE11111"
  }
  "Resources": [
    {
      "Type": "AwsCloudTrailTrail",
      "Id": "arn:aws:cloudtrail:us-east-2:123456789012:trail/AWSMacieTrail-DO-NOT-EDIT",
      "Partition": "aws",
      "Region": "us-east-2"
    }
  ],
  "Compliance": {
    "Status": "FAILED",
    "RelatedRequirements": [
        "PCI DSS v3.2.1/3.4",
        "CIS AWS Foundations Benchmark v1.2.0/2.7",
        "CIS AWS Foundations Benchmark v1.4.0/3.7"
    ],
    "SecurityControlId": "CloudTrail.2",
    "AssociatedStandards": [
  { "StandardsId": "standards/aws-foundational-security-best-practices/v/1.0.0"},
  { "StandardsId": "standards/pci-dss/v/3.2.1"},
  { "StandardsId": "ruleset/cis-aws-foundations-benchmark/v/1.2.0"},
  { "StandardsId": "standards/cis-aws-foundations-benchmark/v/1.4.0"},
  { "StandardsId": "standards/service-managed-aws-control-tower/v/1.0.0"},
  ]
  },
  "WorkflowState": "NEW",
  "Workflow": {
    "Status": "NEW"
  },
  "RecordState": "ACTIVE",
  "FindingProviderFields": {
    "Severity": {
      "Label": "MEDIUM",
      "Normalized": "40",
      "Original": "MEDIUM"
    },
    "Types": [
      "Software and Configuration Checks/Industry and Regulatory Standards"
    ]
  }
}

To turn on consolidated control findings

  1. Open the Security Hub console.
  2. In the left navigation pane, choose Settings, and then choose the General tab.
  3. Under Controls, turn on Consolidated control findings, and then choose Save.
Figure 2: Turn on consolidated control findings

Figure 2: Turn on consolidated control findings

If you are using the Security Hub integration with AWS Organizations or have invited member accounts through a manual invitation process, consolidated control findings can only be turned on by the administrator account. When this action is taken in the administrator account, the action will also be reflected in each member account in the current Region. It can take up to 18 hours for Security Hub to archive existing standard-specific findings and generate the new, standard-agnostic, findings.

You can also enable consolidated control findings by using the API (calling the UpdateSecurityHubConfiguration API with the ControlFindingGenerator parameter equal to SECURITY_CONTROL), or by using the AWS CLI (running the update-security-hub-configuration command with control-finding-generator equal to SECURITY_CONTROL), as in the following example.

aws securityhub ‐‐region <Region of choice> update-security-hub-configuration ‐‐control-finding-generator SECURITY_CONTROL

Much like the console behavior, if you have an organizational setup in Security Hub, this API action can only be taken by the administrator, and it will be reflected in each member account in the same Region.

What to expect when you enable consolidated control findings

To allow for these new capabilities to be launched, changes to the AWS Security Finding Format (ASFF) are required. This format is used by Security Hub for findings it generates from its controls or ingests from external providers. When you turn on finding consolidation, Security Hub will archive old standard-specific findings and generate standard-agnostic findings instead. This action will only affect control findings that Security Hub generates, and it will not affect findings ingested from partner products. However, in Security Hub findings, turning on consolidated control findings might cause some updates that you previously made to findings to be archived. Despite this one-time change, after the migration is complete (it can take up to 18 hours), you will be able to update finding fields in a single action and the updates will apply across standards, without the need to make multiple updates.

One field affected by the new capabilities is the Workflow field, which provides information about the status of the investigation into a finding. Manipulating this field can also update the overall compliance status of the control that the finding is related to. For example, if you have a control with one failed finding (and the rest have passed), and the failed finding comes from a resource for which you’d like to make an exception, you can decide to suppress that failed finding by updating the Workflow field. If you suppress failed findings in a control, its compliance status can change to pass.

Before turning on consolidated control findings, if you want to maintain an aligned compliance status in controls that belong to multiple standards, you have to update the Workflow status of findings in each standard. After turning on finding consolidation, you will only have to update the Workflow status once, and the suppression will be applied across standards, helping you to reduce the number of steps needed to suppress the same findings across standards.

As mentioned earlier, when you turn on this new capability, some updates made to the previous, standard-specific findings will be archived and will not be included in the new consolidated control findings generated by Security Hub. In the case of the Workflow status, the new consolidated findings will be created with a value of NEW (for failed findings) or RESOLVED (for new findings) in the Workflow field. However, after you have onboarded to the new finding format, you can update the value of the Workflow field, as well as other fields, and this value will be maintained without requiring you to make continuous updates. For the full list of fields that can be affected by the migration to the consolidated finding format, see Consolidated control findings – ASFF changes. Before you turn on finding consolidation, we suggest that you check if your custom automations refer to those affected fields. If they do, you can update your automations and test them by using the Sample control findings in the documentation.

Conclusion

This blog post covers new Security Hub features that make it simpler for you to manage controls across standards. With the new consolidated control findings feature, you can focus on the most relevant findings and reduce noise, which is why we recommend that you review the new feature and its associated changes and turn it on at your earliest convenience.

If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on the Security Hub forum or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Emmanuel Isimah

Emmanuel Isimah

Emmanuel is a solutions architect covering hypergrowth customers in the Digital Native Business sector. He has a background in networking, security, and containers. Emmanuel helps customers build and secure innovative cloud solutions, solving their business problems by using data-driven approaches. Emmanuel’s areas of depth include security and compliance, cloud operations, and containers.

Automated Code Review on Pull Requests using AWS CodeCommit and AWS CodeBuild

Post Syndicated from Verinder Singh original https://aws.amazon.com/blogs/devops/automated-code-review-on-pull-requests-using-aws-codecommit-and-aws-codebuild/

Pull Requests play a critical part in the software development process. They ensure that a developer’s proposed code changes are reviewed by relevant parties before code is merged into the main codebase. This is a standard procedure that is followed across the globe in different organisations today. However, pull requests often require code reviewers to read through a great deal of code and manually check it against quality and security standards. These manual reviews can lead to problematic code being merged into the main codebase if the reviewer overlooks any problems.

To help solve this problem, we recommend using Amazon CodeGuru Reviewer to assist in the review process. CodeGuru Reviewer identifies critical defects and deviation from best practices in your code. It provides recommendations to remediate its findings as comments in your pull requests, helping reviewers miss fewer problems that may have otherwise made into production. You can easily integrate your repositories in AWS CodeCommit with Amazon CodeGuru Reviewer following these steps.

The purpose of this post isn’t, however, to show you CodeGuru Reviewer. Instead, our aim is to help you achieve automated code reviews with your pull requests if you already have a code scanning tool and need to continue using it. In this post, we will show you step-by-step how to add automation to the pull request review process using your code scanning tool with AWS CodeCommit (as source code repository) and AWS CodeBuild (to automatically review code using your code reviewer). After following this guide, you should be able to give developers automatic feedback on their code changes and augment manual code reviews so fewer problems make it into your main codebase.

Solution Overview

The solution comprises of the following components:

  1. AWS CodeCommit: AWS service to host private Git repositories.
  2. Amazon EventBridge: AWS service to receive pullRequestCreated and pullRequestSourceBranchUpdated events and trigger Amazon EventBridge rule.
  3. AWS CodeBuild: AWS service to perform code review and send the result to AWS CodeCommit repository as pull request comment.

The following diagram illustrates the architecture:

Figure 1: This architecture diagram illustrates the workflow where developer raises a Pull Request and receives automated feedback on the code changes using AWS CodeCommit, AWS CodeBuild and Amazon EventBridge rule

Figure 1. Architecture Diagram of the proposed solution in the blog

  1. Developer raises a pull request against the main branch of the source code repository in AWS CodeCommit.
  2. The pullRequestCreated event is received by the default event bus.
  3. The default event bus triggers the Amazon EventBridge rule which is configured to be triggered on pullRequestCreated and pullRequestSourceBranchUpdated events.
  4. The EventBridge rule triggers AWS CodeBuild project.
  5. The AWS CodeBuild project runs the code quality check using customer’s choice of tool and sends the results back to the pull request as comments. Based on the result, the AWS CodeBuild project approves or rejects the pull request automatically.

Walkthrough

The following steps provide a high-level overview of the walkthrough:

  1. Create a source code repository in AWS CodeCommit.
  2. Create and associate an approval rule template.
  3. Create AWS CodeBuild project to run the code quality check and post the result as pull request comment.
  4. Create an Amazon EventBridge rule that reacts to AWS CodeCommit pullRequestCreated and pullRequestSourceBranchUpdated events for the repository created in step 1 and set its target to AWS CodeBuild project created in step 3.
  5. Create a feature branch, add a new file and raise a pull request.
  6. Verify the pull request with the code review feedback in comment section.

1. Create a source code repository in AWS CodeCommit

Create an empty test repository in AWS CodeCommit by following these steps. Once the repository is created you can add files to your repository following these steps. If you create or upload the first file for your repository in the console, a branch is created for you named main. This branch is the default branch for your repository. If you are using a Git client instead, consider configuring your Git client to use main as the name for the initial branch. This blog post assumes the default branch is named as main.

2. Create and associate an approval rule template

Create an AWS CodeCommit approval rule template and associate it with the code repository created in step 1 following these steps.

3. Create AWS CodeBuild project to run the code quality check and post the result as pull request comment

This blog post is based on the assumption that the source code repository has JavaScript code in it, so it uses jshint as a code analysis tool to review the code quality of those files. However, users can choose a different tool as per their use case and choice of programming language.

Create an AWS CodeBuild project from AWS Management Console following these steps and using below configuration:

  • Source: Choose the AWS CodeCommit repository created in step 1 as the source provider.
  • Environment: Select the latest version of AWS managed image with operating system of your choice. Choose New service role option to create the service IAM role with default permissions.
  • Buildspec: Use below build specification. Replace <NODEJS_VERSION> with the latest supported nodejs runtime version for the image selected in previous step. Replace <REPOSITORY_NAME> with the repository name created in step 1. The below spec installs the jshint package, creates a jshint config file with a few sample rules, runs it against the source code in the pull request commit, posts the result as comment to the pull request page and based on the results, approves or rejects the pull request automatically.
version: 0.2
phases:
  install:
    runtime-versions:
      nodejs: <NODEJS_VERSION>
    commands:
      - npm install jshint --global
  build:
    commands:
      - echo \{\"esversion\":6,\"eqeqeq\":true,\"quotmark\":\"single\"\} > .jshintrc
      - CODE_QUALITY_RESULT="$(echo \`\`\`) $(jshint .)"; EXITCODE=$?
      - aws codecommit post-comment-for-pull-request --pull-request-id $PULL_REQUEST_ID --repository-name <REPOSITORY_NAME> --content "$CODE_QUALITY_RESULT" --before-commit-id $DESTINATION_COMMIT_ID --after-commit-id $SOURCE_COMMIT_ID --region $AWS_REGION	
      - |
        if [ $EXITCODE -ne 0 ]
        then
          PR_STATUS='REVOKE'
        else
          PR_STATUS='APPROVE'
        fi
      - REVISION_ID=$(aws codecommit get-pull-request --pull-request-id $PULL_REQUEST_ID | jq -r '.pullRequest.revisionId')
      - aws codecommit update-pull-request-approval-state --pull-request-id $PULL_REQUEST_ID --revision-id $REVISION_ID --approval-state $PR_STATUS --region $AWS_REGION

Once the AWS CodeBuild project has been created successfully, modify its IAM service role by following the below steps:

  • Choose the CodeBuild project’s Build details tab.
  • Choose the Service role link under the Environment section which should navigate you to the CodeBuild’s IAM service role in IAM console.
  • Expand the default customer managed policy and choose Edit.
  • Add the following actions to the existing codecommit actions:
"codecommit:CreatePullRequestApprovalRule",
"codecommit:GetPullRequest",
"codecommit:PostCommentForPullRequest",
"codecommit:UpdatePullRequestApprovalState"

  • Choose Next.
  • On the Review screen, choose Save changes.

4. Create an Amazon EventBridge rule that reacts to AWS CodeCommit pullRequestCreated and pullRequestSourceBranchUpdated events for the repository created in step 1 and set its target to AWS CodeBuild project created in step 3

Follow these steps to create an Amazon EventBridge rule that gets triggered whenever a pull request is created or updated using the following event pattern. Replace the <REGION>, <ACCOUNT_ID> and <REPOSITORY_NAME> placeholders with the actual values. Select target of the event rule as AWS CodeBuild project created in step 3.

Event Pattern

{
    "detail-type": ["CodeCommit Pull Request State Change"],
    "resources": ["arn:aws:codecommit:<REGION>:<ACCOUNT_ID>:<REPOSITORY_NAME>"],
    "source": ["aws.codecommit"],
    "detail": {
      "isMerged": ["False"],
      "pullRequestStatus": ["Open"],
      "repositoryNames": ["<REPOSITORY_NAME>"],
      "destinationReference": ["refs/heads/main"],
      "event": ["pullRequestCreated", "pullRequestSourceBranchUpdated"]
    },
    "account": ["<ACCOUNT_ID>"]
  }

Follow these steps to configure the target input using the below input path and input template.

Input transformer – Input path

{
    "detail-destinationCommit": "$.detail.destinationCommit",
    "detail-pullRequestId": "$.detail.pullRequestId",
    "detail-sourceCommit": "$.detail.sourceCommit"
}

Input transformer – Input template

{
    "sourceVersion": <detail-sourceCommit>,
    "environmentVariablesOverride": [
        {
            "name": "DESTINATION_COMMIT_ID",
            "type": "PLAINTEXT",
            "value": <detail-destinationCommit>
        },
        {
            "name": "SOURCE_COMMIT_ID",
            "type": "PLAINTEXT",
            "value": <detail-sourceCommit>
        },
        {
            "name": "PULL_REQUEST_ID",
            "type": "PLAINTEXT",
            "value": <detail-pullRequestId>
        }
    ]
}

5. Create a feature branch, add a new file and raise a pull request

Create a feature branch following these steps. Push a new file called “index.js” to the root of the repository with the below content.

function greet(dayofweek) {
  if (dayofweek == "Saturday" || dayofweek == "Sunday") {
    console.log("Have a great weekend");
  } else {
    console.log("Have a great day at work");
  }
}

Now raise a pull request using the feature branch as source and main branch as destination following these steps.

6. Verify the pull request with the code review feedback in comment section

As soon as the pull request is created, the AWS CodeBuild project created in step 3 above will be triggered which will run the code quality check and post the results as a pull request comment. Navigate to the AWS CodeCommit repository pull request page in AWS Management Console and check under the Activity tab to confirm the automated code review result being displayed as the latest comment.

The pull request comment submitted by AWS CodeBuild highlights 6 errors in the JavaScript code. The errors on lines first and third are based on the jshint rule “eqeqeq”. It recommends to use strict equality operator (“===”) instead of the loose equality operator (“==”) to avoid type coercion. The errors on lines second, fourth and fifth are based on jshint rule “quotmark” which recommends to use single quotes with strings instead of double quotes for better readability. These jshint rules are defined in AWS CodeBuild project’s buildspec in step 3 above.

Figure 2: The image shows the AWS CodeCommit pull request's Activity tab with code review results automatically posted by the automated code reviewer

Figure 2. Pull Request comments updated with automated code review results.

Conclusion

In this blog post we’ve shown how using AWS CodeCommit and AWS CodeBuild services customers can automate their pull request review process by utilising Amazon EventBridge events and using their own choice of code quality tool. This simple solution also makes it easier for the human reviewers by providing them with automated code quality results as input and enabling them to focus their code review more on business logic code changes rather than static code quality issues.

About the authors

Blog post's primary author's image

Verinder Singh

Verinder Singh is an experienced Solution’s Architect based out of Sydney, Australia with 16+ years of experience in software development and architecture. He works primarily on building large scale open-source AWS solutions for common customer use cases and business problems. In his spare time, he enjoys vacationing and watching movies with his family.

Blog post's secondary author's image

Deenadayaalan Thirugnanasambandam

Deenadayaalan Thirugnanasambandam is a Principal Cloud Architect at AWS. He provides prescriptive architectural guidance and consulting that enable and accelerate customers’ adoption of AWS.

IAM Policies and Bucket Policies and ACLs! Oh, My! (Controlling Access to S3 Resources)

Post Syndicated from Kai Zhao original https://aws.amazon.com/blogs/security/iam-policies-and-bucket-policies-and-acls-oh-my-controlling-access-to-s3-resources/

Updated on July 6, 2023: This post has been updated to reflect the current guidance around the usage of S3 ACL and to include S3 Access Points and the Block Public Access for accounts and S3 buckets.

Updated on April 27, 2023: Amazon S3 now automatically enables S3 Block Public Access and disables S3 access control lists (ACLs) for all new S3 buckets in all AWS Regions.

Updated on January 8, 2019: Based on customer feedback, we updated the third paragraph in the “What about S3 ACLs?” section to clarify permission management.


In this post, we will discuss Amazon S3 Bucket Policies and IAM Policies and its different use cases. This post will assist you in distinguishing between the usage of IAM policies and S3 bucket policies. We will also discuss how these policies integrate with some default S3 bucket security settings like automatically enabling S3 Block Public Access and disabling S3 access control lists (ACLs).

IAM policies vs. S3 bucket policies

AWS access is managed by setting IAM policies and linking them to IAM identities (users, groups of users, or roles) or AWS resources. A policy is an object in AWS that when associated with an identity or resource, defines their permissions. IAM policies specify what actions are allowed or denied on what AWS resources (e.g. user Alice can read objects from the “Production” bucket but can’t write objects in the “Dev” bucket whereas user Bob can have full access to S3).

S3 bucket policies, on the other hand, are resource-based policies that you can use to grant access permissions to your Amazon S3 buckets and the objects in it. S3 bucket policies can allow or deny requests based on the elements in the policy.(e.g. allow user Alice to PUT but not DELETE objects in the bucket).

Note: You attach S3 bucket policies at the bucket level (i.e. you can’t attach a bucket policy to an S3 object), but the permissions specified in the bucket policy apply to all the objects in the bucket. You can also specify permissions at the object level by putting an object as the resource in the Bucket policy.

IAM policies and S3 bucket policies are both used for access control and they’re both written in JSON using the AWS access policy language. Let’s look at an example policy of each type:

Sample S3 Bucket Policy

This S3 bucket policy enables any IAM principal (user or role) in account 111122223333 to use the Amazon S3 GET Bucket (List Objects) operation.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": ["arn:aws:iam::111122223333:root"]
      },
      "Action": "s3:ListBucket",
      "Resource": ["arn:aws:s3:::my_bucket"]
    }
  ]
}

This S3 bucket policy enables the IAM role ‘Role-name’ under the account 111122223333 to use the Amazon S3 GET Bucket (List Objects) operation.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111122223333:role/Role-name"
      },
      "Action": "s3:ListBucket",
      "Resource": "arn:aws:s3:::my_bucket"
    }
  ]
}

Sample IAM Policy

This IAM policy grants the IAM principal it is attached to permission to perform any S3 operation on the contents of the bucket named “my_bucket”.

{
  "Version": "2012-10-17",
  "Statement":[{
    "Effect": "Allow",
    "Action": "s3:*",
    "Resource": ["arn:aws:s3:::my_bucket/*"]
    }
  ]
}

Note that the S3 bucket policy includes a “Principal” element, which lists the principals that bucket policy controls access for. The “Principal” element is unnecessary in an IAM policy, because the principal is by default the entity that the IAM policy is attached to.

S3 bucket policies (as the name would imply) only control access to S3 resources for the bucket they’re attached to, whereas IAM policies can specify nearly any AWS action. One of the neat things about AWS is that you can actually apply both IAM policies and S3 bucket policies simultaneously, with the ultimate authorization being the least-privilege union of all the permissions (more on this in the section below titled “How does authorization work with multiple access control mechanisms?”).

When to use IAM policies vs. S3 policies

Use IAM policies if:

  • You need to control access to AWS services other than S3. IAM policies will be easier to manage since you can centrally manage all of your permissions in IAM, instead of spreading them between IAM and S3.
  • You have numerous S3 buckets each with different permissions requirements. IAM policies will be easier to manage since you don’t have to define a large number of S3 bucket policies and can instead rely on fewer, more detailed IAM policies.
  • You prefer to keep access control policies in the IAM environment.

Use S3 bucket policies if:

  • You want a simple way to grant cross-account access to your S3 environment, without using IAM roles.
  • Your IAM policies bump up against the size limit (up to 2 kb for users, 5 kb for groups, and 10 kb for roles). S3 supports bucket policies of up 20 kb.
  • You prefer to keep access control policies in the S3 environment.
  • You want to apply common security controls to all principals who interact with S3 buckets, such as restricting the IP addresses or VPC a bucket can be accessed from.

If you’re still unsure of which to use, consider which audit question is most important to you:

  • If you’re more interested in “What can this user do in AWS?” then IAM policies are probably the way to go. You can easily answer this by looking up an IAM user and then examining their IAM policies to see what rights they have.
  • If you’re more interested in “Who can access this S3 bucket?” then S3 bucket policies will likely suit you better. You can easily answer this by looking up a bucket and examining the bucket policy.

Whichever method you choose, we recommend staying as consistent as possible. Auditing permissions becomes more challenging as the number of IAM policies and S3 bucket policies grows.

What about S3 ACLs?

An S3 ACL is a sub-resource that’s attached to every S3 bucket and object. It defines which AWS accounts or groups are granted access and the type of access. You can attach S3 ACLs to both buckets and individual objects within a bucket to manage permissions for those objects. As a general rule, AWS recommends using S3 bucket policies or IAM policies for access control. S3 ACLs is a legacy access control mechanism that predates IAM. By default, Object Ownership is set to the Bucket owner enforced setting and all ACLs are disabled, as can be seen below.

A majority of modern use cases in Amazon S3 no longer require the use of ACLs, and we recommend that you keep ACLs disabled by applying the Bucket owner enforced setting. This approach simplifies permissions management: you can use policies to more easily control access to every object in your bucket, regardless of who uploaded the objects in your bucket. When ACLs are disabled, the bucket owner owns all the objects in the bucket and manages access to data exclusively using access management policies.

S3 bucket policies and IAM policies define object-level permissions by providing those objects in the Resource element in your policy statements. The statement will apply to those objects in the bucket. Consolidating object-specific permissions into one policy (as opposed to multiple S3 ACLs) makes it simpler for you to determine effective permissions for your users and roles.

You can disable ACLs on both newly created and already existing buckets. For newly created buckets, ACLs are disabled by default. In the case of an existing bucket that already has objects in it, after you disable ACLs, the object and bucket ACLs are no longer part of an access evaluation, and access is granted or denied on the basis of policies.

S3 Access Points and S3 Access

In some cases customers have use cases with complex entitlement: Amazon s3 is used to store shared datasets where data is aggregated and accessed by different applications, individuals or teams for different use cases. Managing access to this shared bucket requires a single bucket policy that controls access for dozens to hundreds of applications with different permission levels. As an application set grows, the bucket policy becomes more complex, time consuming to manage, and needs to be audited to make sure that changes don’t have an unexpected impact on another application.

These customers need additional policy space for access to their data, and that buckets. To support these use cases, Amazon S3 provides a feature called Amazon S3 Access Points. Amazon S3 access points simplify data access for any AWS service or customer application that stores data in S3.

Access points are named network endpoints that are attached to buckets that you can use to perform S3 object operations, such as GetObject and PutObject. Each access point has distinct permissions and network controls that S3 applies for any request that is made through that access point. Each access point enforces a customized access point policy that works in conjunction with the bucket policy that is attached to the underlying bucket.

Amazon S3 access points support AWS Identity and Access Management (IAM) resource policies that allow you to control the use of the access point by resource, user, or other conditions. For an application or user to be able to access objects through an access point, both the access point and the underlying bucket must permit the request.

Note that Adding an S3 access point to a bucket doesn’t change the bucket’s ehaviour when the bucket is accessed directly through the bucket’s name or Amazon Resource Name (ARN). All existing operations against the bucket will continue to work as before. Restrictions that you include in an access point policy apply only to requests made through that access point.

Sample Access point policy

This access point policy grants the IAM user Alice permissions to GET and PUT objects through the access point ‘my-access-point’ in account 111122223333.

{
  “Version”: “2012-10-17”,
  “Statement”:[{
    “Effect”: “Allow”,
    “Principal”: { “AWS”: “arn:aws:iam::111122223333:user/Alice” },
    “Action”: [“s3:GetObject”, “s3:PutObject”],
    “Resource”: “arn:aws:s3:us-west-2:111122223333:accesspoint/my-access-point/object/*”
    }
  ]
}

Blocking Public Access for accounts and buckets

Public access is granted to buckets and objects through access control lists (ACLs), bucket policies, access point policies, or all. In order to ensure that public access to this bucket and its objects is blocked, you can turn on Block all public on both the bucket level or the account level.

The Amazon S3 Block Public Access feature provides settings for access points, buckets, and accounts to help you manage public access to Amazon S3 resources. By default, new buckets, access points, and objects don’t allow public access. However, users can modify bucket policies, access point policies, or object permissions to allow public access. S3 Block Public Access settings override these policies and permissions so that you can limit public access to these resources.

With S3 Block Public Access, account administrators and bucket owners can easily set up centralized controls to limit public access to their Amazon S3 resources that are enforced regardless of how the resources are created.

If you apply a setting to an account, it applies to all buckets and access points that are owned by that account. Similarly, if you apply a setting to a bucket, it applies to all access points associated with that bucket.

Block Public Access for buckets

These settings apply only to this bucket and its access points. AWS recommends that you turn on Block all public access, but before applying any of these settings, ensure that your applications will work correctly without public access. If you require some level of public access to this bucket or objects within, you can customize the individual settings below to suit your specific storage use cases.

You can use the S3 console, AWS CLI, AWS SDKs, and REST API to grant public access to one or more buckets. This setting is on by default at the account creation, as can be seen below (using the S3 console).

Turning off this session will create a warning in the account, as AWS recommends this setting to be turned un unless public access is required for specific and verified use cases such as static website hosting.

This setting can also be turned on for existing buckets. In the AWS Management Console this is done by opening the Amazon S3 console at https://console.aws.amazon.com/s3/, choosing the name of the bucket you want, choosing the Permissions tab. And Choosing Edit to change the public access settings for the bucket.

Block Public Access for accounts

In order to ensure that public access to all your S3 buckets and objects is blocked, turn on Block all public access. These settings apply account-wide for all current and future buckets and access points. AWS recommends that you turn on Block all public access, but before applying any of these settings, ensure that your applications will work correctly without public access. If you require some level of public access to your buckets or objects, you can customize the individual settings below to suit your specific storage use cases.

You can use the S3 console, AWS CLI, AWS SDKs, and REST API to configure block public access settings for all the buckets in your account. This setting can be turned on in the AWS Management Console by opening the Amazon S3 console at https://console.aws.amazon.com/s3/, and clicking Block Public Access setting for this account on the left panel. And Choosing Edit to change the public access settings for the bucket.

When working with AWS organizations, you can prevent people from modifying the Block Public Access on the account level by adding a Service control policy (SCP) that denies editing this. An example of such a SCP can be seen below:

{
  “Version”: “2012-10-17”,
  “Statement”:[{
    “Sid”: “DenyTurningOffBlockPublicAccessForThisAccount”,
    “Effect”: “Deny”,
    “Action”: “s3:PutAccountPublicAccessBlock”,
    “Resource”: “arn:aws:s3:::*”
    }
  ]
}

How does authorization work with multiple access control mechanisms?

Whenever an AWS principal issues a request to S3, the authorization decision depends on the union of all the IAM policies, S3 bucket policies, and S3 ACLs that apply as well as if Block Public Access is enabled on either the account, bucket or access point.

In accordance with the principle of least-privilege, decisions default to DENY and an explicit DENY always trumps an ALLOW. For example, if an IAM policy grants access to an object, the S3 bucket policies denies access to that object, and there is no S3 ACL, then access will be denied. Similarly, if no method specifies an ALLOW, then the request will be denied by default. Only if no method specifies a DENY and one or more methods specify an ALLOW will the request be allowed.

When Amazon S3 receives a request to access a bucket or an object, it determines whether the bucket or the bucket owner’s account has a block public access setting applied. If the request was made through an access point, Amazon S3 also checks for block public access settings for the access point. If there is an existing block public access setting that prohibits the requested access, Amazon S3 rejects the request.

This diagram illustrates the authorization process.

We hope that this post clarifies some of the confusion around the various ways you can control access to your S3 environment.

Using IAM Access Analyzer for S3 to review bucket access

Another interesting feature that can be used is IAM Access Analyzer for S3 to review bucket access. You can use IAM Access Analyzer for S3 to review buckets with bucket ACLs, bucket policies, or access point policies that grant public access. IAM Access Analyzer for S3 alerts you to buckets that are configured to allow access to anyone on the internet or other AWS accounts, including AWS accounts outside of your organization. For each public or shared bucket, you receive findings that report the source and level of public or shared access.

In IAM Access Analyzer for S3, you can block all public access to a bucket with a single click. You can also drill down into bucket-level permission settings to configure granular levels of access. For specific and verified use cases that require public or shared access, you can acknowledge and record your intent for the bucket to remain public or shared by archiving the findings for the bucket.

Additional Resources

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Laura Verghote

Laura Verghote

Laura is a Territory Solutions Architect for Public Sector customers in the Benelux. She works together with customers to design and build solutions in the AWS cloud. She joined AWS as a technical trainer through a graduate program and has wide experience delivering training content to developers, administrators, architects, and partners in EMEA.

Gautam Kumar

Gautam Kumar

Gautam is a Solution Architect at AWS. Gautam helps various Enterprise customers to design and architect innovative solutions on AWS and specifically passionate about building secure workloads on AWS. Outside work, he enjoys traveling and spending time with family.

Validating attestation documents produced by AWS Nitro Enclaves

Post Syndicated from maceneff original https://aws.amazon.com/blogs/compute/validating-attestation-documents-produced-by-aws-nitro-enclaves/

This blog post is written by Paco Gonzalez Senior EMEA IoT Specialist SA.

AWS Nitro Enclaves offers an isolated, hardened, and highly constrained environment to host security-critical applications. Think of AWS Nitro Enclaves as regular Amazon Elastic Compute Cloud (Amazon EC2) virtual machines (VMs) but with the added benefit of the environment being highly constrained.

A great benefit of using AWS Nitro Enclaves is that you can run your software as if it was a regular EC2 instance, but with no persistent storage and limited access to external systems. The only way to communicate with AWS Nitro Enclaves is using a VSOCK socket. This special type of communication mechanism acts as an isolated communication channel between the parent EC2 instance and AWS Nitro Enclaves.Diagram that shows how Nitro Enclaves uses the proven isolation of the Nitro Hypervisor to further isolate the CPU and memory of the Nitro Enclaves from users, applications, and libraries on the parent instance.

 Fig 1 – AWS Nitro Enclaves uses the proven isolation of the Nitro Hypervisor to further isolate the CPU and memory of the Nitro Enclaves from users, applications, and libraries on the parent instance.

AWS Nitro Enclaves comes with a custom Linux device called the Nitro Security Module (NSM), which is accessible via /dev/nsm. This device provides attestation capability to the Nitro Enclaves. The attestation comes in the form of an attestation document. The attestation document makes it easy and safe to build trust between systems that interact with the Nitro Enclaves. The external system must have a mechanism to process the attestation document to determine the validity of the attestation document.

In this post, I go through the anatomy of an attestation document produced by the NSM API. I then show you an example of how to perform different validations that help determine the accuracy of an attestation document produced by the AWS Nitro Enclaves Security Module. I use syntactic and semantic validations to check for the attestation document’s correctness before proceeding with a cryptographic validation of the contents of the document’s payload. The examples used in this post use the C language. Look at the companion repository available in GitHub for access to the all source code used in this post.

Anatomy of an attestation document produced by AWS Nitro Enclaves

The attestation document uses the Concise Binary Object Representation (CBOR) format to encode the data. The CBOR object is wrapped using the CBOR Object Signing and Encryption (COSE) protocol. The COSE format used is a single-signer data structure called “COSE_Sign1”. The object is comprised of headers, the payload, and a signature.

For more information about COSE, see RFC 8152: CBOR Object Signing and Encryption (COSE). For more information about CBOR, see RFC 8949 Concise Binary Object Representation (CBOR).

We published a library to make it easy to interact with the NSM. The library contains helpers which your application, running on the Nitro Enclaves, can use to communicate with the NSM device.

Here is the minimum code needed to generate an attestation document:

#include <stdlib.h>
#include <stdio.h>
#include <nsm.h>

#define NSM_MAX_ATTESTATION_DOC_SIZE (16 * 1024)

int main(void) {

    /// NSM library initialization function.  
    /// *Returns*: A descriptor for the opened device file.

    int nsm_fd = nsm_lib_init();
    if (nsm_fd < 0) {
        exit(1);
    }

    /// NSM `GetAttestationDoc` operation for non-Rust callers.  
    /// *Argument 1 (input)*: The descriptor to the NSM device file.  
    /// *Argument 2 (input)*: User data.  
    /// *Argument 3 (input)*: The size of the user data buffer.  
    /// *Argument 4 (input)*: Nonce data.  
    /// *Argument 5 (input)*: The size of the nonce data buffer.  
    /// *Argument 6 (input)*: Public key data.  
    /// *Argument 7 (input)*: The size of the public key data buffer.  
    /// *Argument 8 (output)*: The obtained attestation document.  
    /// *Argument 9 (input / output)*: The document buffer capacity (as input)
    /// and the size of the received document (as output).  
    /// *Returns*: The status of the operation.

    int status;
    uint8_t att_doc_buff[NSM_MAX_ATTESTATION_DOC_SIZE];
    uint32_t att_doc_cap_and_size = NSM_MAX_ATTESTATION_DOC_SIZE;

    status = nsm_get_attestation_doc(nsm_fd, NULL, 0, NULL, 0, NULL, 0, att_doc_buff, 
                                    &att_doc_cap_and_size);
    if (status != ERROR_CODE_SUCCESS) {
        printf("[Error] Request::Attestation got invalid response: %s\n",status);
        exit(1);
    }

    printf("########## attestation_document_buff ##########\r\n");
    for(int i=0; i<att_doc_cap_and_size; i++)
        fprintf(stdout, "%02X", att_doc_buff[i]);

    exit(0);
}

To produce a sample attestation document, initialize the device, call the function ‘nsm_get_attestation_doc’ inside the AWS Nitro Enclaves, and dump the contents. The library is written using Rust, but it contains
bindings for C. You can read more about the library and some of the other relevant capabilities
here.

The COSE headers contain a protected and an un-protected data section. The cryptographic algorithm used for the signature is specified inside the protected area. AWS Nitro Enclaves use a 384-bit elliptic curve algorithm (P-384) to sign attestation documents. AWS Nitro Enclaves do not use the unprotected data field so it is always left blank.

The payload contains fixed parameters that include the following: information about the issuing NSM, a timestamp of the issuing event, a map of all the locked Platform Configuration Registers (PCRs) at the moment the attestation document was generated, the hashing algorithm used to produce the digest that was used to calculate the PCR values – AWS Nitro Enclaves use a 384 bit secure hashing algorithm (SHA384), a x509 certificate signed by AWS Nitro Enclaves’ Private Public Key Infrastructure (PKI). An AWS Nitro Enclaves certificate expires three hours after it has been issued. The common name (CN) contains information about the issuing NSM – and finally the issuing Certificate Authority (CA) bundle. The payload also contains optional parameters that a third-party application can use to create custom authentication and authorization workflows. The optional parameters are: a public key, a cryptographic nonce, and additional arbitrary data.

Finally, the signature is the result of a signing operation using the private key related to the public key contained inside the certificate that is part of the payload.

Diagram that illustrates the components of a attestation document produced by a Nitro Enclave

Fig 2. An attestation document is generated and signed by the Nitro Hypervisor. It contains information about the Nitro Enclaves and it can be used by an external service to verify the identity of Nitro Enclaves and to establish trust. You can use the attestation document to build your own cryptographic attestation mechanisms.

Syntactical validation

Early validation of the attestation document format makes sure that only documents that conform to the expected structure are processed in subsequent steps.

I start by attempting to decode the CBOR object and testing to see if it corresponds to a COSE object signed with one signer or ‘COSE_Sign1’ structure. This can be easily done by looking at the most significant first three bits (MSB) of the first byte – I am expecting a stream of CBOR bytes (decimal 6). Then, I take the least significant (LSB) remaining five bits of the first byte – I am expecting a tag that tells me it is a COSE_Sign1 object (decimal 18).

assert(att_doc_buff[0*] == 6 <<5 | 18); // 0xD2

* Note that the time of writing, the NSM does not include the COSE tag and thus this validation cannot be made and is mentioned in this post for informational purposes only. However, it is important to keep this in mind, as the tag is part of the standard, and the NSM device or library could include it in the future.*

The next step is to parse the actual CBOR object. A COSE_Sign1 object is an array of size 4 (protected headers, un protected headers, payload, and signature). Therefore, I must check that the next three MSB correspond to Type 4 (array) and that the size is exactly 4.

assert(att_doc_buff[0] == 4 <<5 | 4); // 0x84

The next byte determines what the first CBOR item of the array looks like. I am expecting the protected COSE header as the first item of the array. The CBOR field should indicate that the contents of the item are of a Type 2 (raw bytes) and the size should be exactly 4.

assert(att_doc_buff[1] == 2 <<5 | 4); // 0x44

The next four bytes represent the protected header. The contents of this item is a regular CBOR object. The object should contain a Type 5 (map) with a single item (1). The item first key is expected to be the number 1. The first three MSB of the first byte should be a Type 1 (negative integer). The remaining five LSB should indicate that the value is an 8-bit number (decimal 24). The last byte should be negative 35 as it maps to the P-384 curve that Nitro Enclaves use. Note that CBOR negative numbers are stored minus 1.

assert(att_doc_buff[2] == 5 <<5 | 1); // 0xA1
assert(att_doc_buff[3] == 0x01); // 0x01
assert(att_doc_buff[4] == 1 <<5 | 24); // 0x38
assert(att_doc_buff[4] == 35-1); // 0x22

The next byte corresponds to the unprotected header. AWS Nitro Enclaves do not use unprotected headers. Therefore, the expected is a Type 5 (map) with zero items.

assert(att_doc_buff[6] == 5 <<5 | 0); // 0xA0

Now that I am done inspecting the headers, I can move onto the payload. The CBOR object used for the payload is Type 2 (raw bytes). This time we are expecting a large steam of bytes. The remaining five LSB are used to indicate the data type used to indicate the size of the byte stream (i.e. 8-bit, 16-bit). AWS Nitro Enclaves attestation documents are about 5 KiB without using any of the three optional parameters. The optional parameters have a size limit of 1 KiB each. This means that it would be highly unlikely for the buffer to be larger than a 16-bit number (CBOR short count: 25).

assert(att_doc_buff[8] == 2 <<5 | 25); // 0x59

The next two bytes represent the size of the payload which I am going to skip those for now, as the contents of the payload are validated in subsequent steps. I’ll move onto the final portion of the attestation document: the signature. The signature has to be a Type 2 (raw bytes) of exactly 96 bytes.

    uint16_t payload_size = att_doc_buff[8] << 8 | att_doc_buff[9];
    assert(att_doc_buff[9+payload_size+1] == (2<<5 | 24));   // 0x58
    assert(att_doc_buff[9+payload_size+1+1] == 96);         // 0x60

At this point, I have validated that the data produced by the NSM looks the way it should. My application is ready to start looking into the contents of the attestation document.

I want to make sure that the document contains all mandatory fields and I can check that the fields have the right structure and their sizes are within the expected boundaries. I have evidence that the data looks the way it should, so I am ready to use an off-the-shelf CBOR library to make the validation process easier instead of doing it by hand.

Here is an example of how to load a CBOR object using libcbor and standard C libraries to check the contents. I am showing just one example to illustrate the process. Refer to the section ‘Verifying the root of trust’ in the AWS Nitro Enclaves User Guide for a detailed description of each parameter and the validations that your application should perform to make sure that the document is valid.

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <assert.h>

#include <cbor.h>
#include <openssl/ssl.h>

#define APP_X509_BUFF_LEN                   (1024*2)
#define APP_ATTDOC_BUFF_LEN                 (1024*10)

void output_handler(char * msg){
    fprintf(stdout, "\r\n%s\r\n", msg);
}

void output_handler_bytes(uint8_t * buffer, int buffer_size){    
    for(int i=0; i<buffer_size; i++)
        fprintf(stdout, "%02X", buffer[i]);
    fprintf(stdout, "\r\n");
}

int read_file( unsigned char * file, char * file_name, size_t elements) {
    FILE * fp; size_t file_len = 0;
    fp = fopen(file_name, "r");
    file_len = fread(file, sizeof(char), elements, fp);
    if (ferror(fp) != 0 ) {
        fputs("Error reading file", stderr);
    } 
    fclose(fp);
    return file_len; 
}  

int main(int argc, char* argv[]) {

    // STEP 0 - LOAD ATTESTATION DOCUMENT

    // Check inputs, expect two
    if (argc != 3) { 
        fprintf(stderr, "%s\r\n", "ERROR: usage: ./main {att_doc_sample.bin} {AWS_NitroEnclaves_Root-G1.pem}"); exit(1);
    }

    // Load file into buffer, use 1st argument
    unsigned char * att_doc_buff = malloc(APP_ATTDOC_BUFF_LEN);
    int att_doc_len = read_file(att_doc_buff, argv[1], APP_ATTDOC_BUFF_LEN );

    // STEP 1 - SYTANCTIC VALIDATON

    // Check COSE TAG (skipping - not currently implemented by AWS Nitro Enclaves)
    // assert(att_doc_buff[0] == 6 <<5 | 18); // 0xD2
    // Check if this is an array of exactly 4 items
    assert(att_doc_buff[0] == (4<<5 | 4));      // 0x84
    // Check if next item is a byte stream of 4 bytes
    assert(att_doc_buff[1] == (2<<5 | 4));      // 0x44
    // Check is fist item if byte stream is a map with 1 item
    assert(att_doc_buff[2] == (5<<5 | 1));      // 0xA1
    // Check that the first key of the map is 0x01
    assert(att_doc_buff[3] == 0x01);            // 0x01
    // Check that value of the the first key of the map is -35 (P-384 curve)
    assert(att_doc_buff[4] == (1 <<5 | 24));    // 0x38
    assert(att_doc_buff[5] == 35-1);            // 0x22
    // Check that next item is a map of 0 items
    assert(att_doc_buff[6] == (5<<5 | 0));      // 0xA0
    // Check that the next item is a byte stream and the size is a 16-bit number (dec. 25)
    assert(att_doc_buff[7] == (2<<5 | 25));     // 0x59
    // Cast the 16-bit number
    uint16_t payload_size = att_doc_buff[8] << 8 | att_doc_buff[9];
    // Check that the item after the payload is a byte stream and the size is 8-bit number (dec. 24)
    assert(att_doc_buff[9+payload_size+1] == (2<<5 | 24));   // 0x58
    // Check that the size of the signature is exactly 96 bytes
    assert(att_doc_buff[9+payload_size+1+1] == 96);         // 0x60

    // Parse buffer using library
    struct cbor_load_result ad_result;
    cbor_item_t * ad_item = cbor_load(att_doc_buff, att_doc_len, &ad_result);
    free(att_doc_buff); // not needed anymore

    // Parse protected header -> item 0 
    cbor_item_t * ad_pheader = cbor_array_get(ad_item, 0); 
    size_t ad_pheader_len = cbor_bytestring_length(ad_pheader);

    // Parse signed bytes -> item 2 (skip un-protected headers as they are always empty)
    cbor_item_t * ad_signed = cbor_array_get(ad_item, 2);
    size_t ad_signed_len = cbor_bytestring_length(ad_signed);

    // Load signed bytes as a new CBOR object
    unsigned char * ad_signed_d = cbor_bytestring_handle(ad_signed);
    struct cbor_load_result ad_signed_result;
    cbor_item_t * ad_signed_item = cbor_load(ad_signed_d, ad_signed_len, &ad_signed_result);

    // Create the pair structure
    struct cbor_pair * ad_signed_item_pairs = cbor_map_handle(ad_signed_item);

    // Parse signature -> item 3
    cbor_item_t * ad_sig = cbor_array_get(ad_item, 3); 
    size_t ad_sig_len = cbor_bytestring_length(ad_sig);
    unsigned char * ad_sig_d = cbor_bytestring_handle(ad_sig);

    // Example 01: Check that the first item's key is the string "module_id" and that is not empty
    size_t module_k_len = cbor_string_length(ad_signed_item_pairs[0].key);
    unsigned char * module_k_str = realloc(cbor_string_handle(ad_signed_item_pairs[0].key), module_k_len+1); //null char
    module_k_str[module_k_len] = '\0';
    size_t module_v_len = cbor_string_length(ad_signed_item_pairs[0].value);
    unsigned char * module_v_str = realloc(cbor_string_handle(ad_signed_item_pairs[0].value), module_v_len+1); //null char
    module_v_str[module_v_len] = '\0';
    assert(module_k_len != 0);
    assert(module_v_len != 0);

    // Example 02: Check that the module id key is actually the string "module_id"
    assert(!strcmp("module_id",(const char *)module_k_str));

    // Example 03: Check that the signature is exactly 96 bytes long
    assert(ad_sig_len == 96);

    // Example 04: Check that the protected header is exactly 4 bytes long
    assert(ad_pheader_len == 4);

Semantic validation

The next step is to look at the data contained in the attestation document and check if it conforms to pre-defined business rules. The attestation document contains a certificate that was signed by the AWS Nitro Enclaves’ PKI. This validation it is important, as it proves that the document was signed by the AWS Nitro Enclaves’ PKI.

The signature of an x509 certificate is based on the certificate’s payload digest. Validating this signature means that I trust the information contained within the certificate, including the public key which I can later use to validate the attestation document itself. Furthermore, the information in the document contains details about the NSM module and a timestamp. Passing this check provides the assurances I need to trust that the document originated from my software running on AWS Nitro Enclaves at a specific time.

Diagram that illustrates the components of a x.509 certificate, part of the payload of a attestation document produced by AWS Nitro Enclaves.

Fig 3. The attestation document contains a x.509 certificate that was signed by the AWS Nitro Enclaves’ PKI.

Here is an example of how I use the AWS Nitro Enclaves’ Private PKI root certificate from an external file. Then, use the CA bundle contained in the attestation document to validate the authenticity of the certificate contained in the document. In this example, I am using the OpenSSL library.

// STEP 2 -  SEMANTIC VALIDATION

    // Load AWS Nitro Enclave's Private PKI root certificate
    unsigned char * x509_root_ca = malloc(APP_X509_BUFF_LEN);
    int x509_root_ca_len = read_file(x509_root_ca, argv[2], APP_X509_BUFF_LEN );
    BIO * bio = BIO_new_mem_buf((void*)x509_root_ca, x509_root_ca_len);
    X509 * caX509 = PEM_read_bio_X509(bio, NULL, NULL, NULL);
    if (caX509 == NULL) {
        fprintf(stderr, "%s\r\n", "ERROR: PEM_read_bio_X509 failed"); exit(1);
    }
    free(x509_root_ca); free(bio);
    // Create CA_STORE
    X509_STORE * ca_store = NULL;
    ca_store = X509_STORE_new();
    /* ADD X509_V_FLAG_NO_CHECK_TIME FOR TESTING! TODO REMOVE */
    X509_STORE_set_flags (ca_store, X509_V_FLAG_NO_CHECK_TIME);
    if (X509_STORE_add_cert(ca_store, caX509) != 1) {
        fprintf(stderr, "%s\r\n", "ERROR: X509_STORE_add_cert failed"); exit(1);
    }
    // Add certificates to CA_STORE from cabundle
    // Skip the first one [0] as that is the Root CA and we want to read it from an external source
    for (int i = 1; i < cbor_array_size(ad_signed_item_pairs[5].value); ++i){ 
        cbor_item_t * ad_cabundle = cbor_array_get(ad_signed_item_pairs[5].value, i); 
        size_t ad_cabundle_len = cbor_bytestring_length(ad_cabundle);
        unsigned char * ad_cabundle_d = cbor_bytestring_handle(ad_cabundle);
        X509 * cabnX509 = X509_new();
        cabnX509 = d2i_X509(&cabnX509, (const unsigned char **)&ad_cabundle_d, ad_cabundle_len);
        if (cabnX509 == NULL) {
            fprintf(stderr, "%s\r\n", "ERROR: d2i_X509 failed"); exit(1);
        }
        if (X509_STORE_add_cert(ca_store, cabnX509) != 1) {
            fprintf(stderr, "%s\r\n", "ERROR: X509_STORE_add_cert failed"); exit(1);
        }
    }

    // Load certificate from attestation dcoument - this a certificate that we don't trust (yet)
    size_t ad_signed_cert_len = cbor_bytestring_length(ad_signed_item_pairs[4].value);
    unsigned char * ad_signed_cert_d = realloc(cbor_bytestring_handle(ad_signed_item_pairs[4].value), ad_signed_cert_len);
    X509 * pX509 = X509_new();
    pX509 = d2i_X509(&pX509, (const unsigned char **)&ad_signed_cert_d, ad_signed_cert_len);
    if (pX509 == NULL) {
        fprintf(stderr, "%s\r\n", "ERROR: d2i_X509 failed"); exit(1);
    }
    // Initialize X509 store context and veryfy untrusted certificate
    STACK_OF(X509) * ca_stack = NULL;
    X509_STORE_CTX * store_ctx = X509_STORE_CTX_new();
    if (X509_STORE_CTX_init(store_ctx, ca_store, pX509, ca_stack) != 1) {
        fprintf(stderr, "%s\r\n", "ERROR: X509_STORE_CTX_init failed"); exit(1);
    }
    if (X509_verify_cert(store_ctx) != 1) {
        fprintf(stderr, "%s\r\n", "ERROR: X509_verify_cert failed"); exit(1);
    }
    fprintf(stdout, "%s\r\n", "OK: ########## Root of Trust Verified! ##########");

Having proof that the certificate was signed by the expected CA is just the beginning. I also want to make sure that the contents of the certificate are correct. This involves checking that the certificate has not expired, as well as making sure that the critical extensions contain correct information to name a few.

Cryptographic validation

The syntactic validation helped me determine that the attestation document has the right shape, and the sematic validation helped me determine if the document meets my business rules. However, I still don’t know for sure if the document is valid.

The attestation document contains critical information, such as PCRs and the AWS Identity Access and Management (IAM) role among other details. I can safely use these two values in my authentication or authorization workflows if I can prove that they are trustworthy.

The attestation document was signed using a private key that is never exposed. However, the corresponding public key is contained within the certificate that was issued and stored within the attestation document. I know I can trust the contents of this certificate because I have proof that the certificate was signed by an entity that I trust.

Here is an example where I cryptographically prove that all the protected contents of the attestation document are related to the public key contained in the certificate. To validate the COSE signature, I must first recreate the original message that was used during the signature operation – COSE uses a specific format. Then, I use OpenSSL to check if there is a match between the message, signature, and public key. If the signature checks, then I can trust the contents of the already semantically-verified payload.

 // STEP 3 - CRYPTOGRAPHIC VALIDATION

    #define SIG_STRUCTURE_BUFFER_S (1024*10)
    // Create new empty key
    EVP_PKEY * pkey = EVP_PKEY_new();
    // Create a new eliptic curve object using P-384 curve
    EC_KEY * ec_key = EC_KEY_new_by_curve_name(NID_secp384r1);
    // Reference the public key stucture and eliptic curve object with each other
    EVP_PKEY_assign_EC_KEY(pkey, ec_key);
    // Load the public key from the attestation document (we trust it now)
    pkey = X509_get_pubkey(pX509);
    if (pkey == NULL) {
        fprintf(stderr, "%s\r\n", "ERROR: X509_get_pubkey failed"); exit(1);
    }
    // Allocate, initialize and return a digest context
    EVP_MD_CTX * ctx = EVP_MD_CTX_create();
    // Set up verification context
    if (EVP_DigestVerifyInit(ctx, NULL, EVP_sha384(), NULL, pkey) <= 0) {
        fprintf(stderr, "%s\r\n", "ERROR: EVP_DigestVerifyInit failed"); exit(1);
    }
    // Recreate COSE_Sign1 structure, and serilise it into a buffer
    cbor_item_t * cose_sig_arr = cbor_new_definite_array(4);
    cbor_item_t * cose_sig_arr_0_sig1 = cbor_build_string("Signature1"); 
    cbor_item_t * cose_sig_arr_2_empty = cbor_build_bytestring(NULL, 0);

    assert(cbor_array_push(cose_sig_arr, cose_sig_arr_0_sig1));
    assert(cbor_array_push(cose_sig_arr, ad_pheader));
    assert(cbor_array_push(cose_sig_arr, cose_sig_arr_2_empty));
    assert(cbor_array_push(cose_sig_arr, ad_signed));

    unsigned char sig_struct_buffer[SIG_STRUCTURE_BUFFER_S];
    size_t sig_struct_buffer_len = cbor_serialize(cose_sig_arr, sig_struct_buffer, SIG_STRUCTURE_BUFFER_S);
    // Hash message and load it into the verificaiton context
    if (EVP_DigestVerifyUpdate(ctx, sig_struct_buffer, sig_struct_buffer_len) <= 0) {
        fprintf(stderr, "%s\r\n", "ERROR: nEVP_DigestVerifyUpdate failed"); exit(1);
    }
    // Create R and V BIGNUM structures
    BIGNUM * sig_r = BN_new(); BIGNUM * sig_v = BN_new();
    BN_bin2bn(ad_sig_d, 48, sig_r); BN_bin2bn(ad_sig_d + 48, 48, sig_v);
    // Allocate an empty ECDSA_SIG structure
    ECDSA_SIG * ec_sig = ECDSA_SIG_new();
    // Set R and V values
    ECDSA_SIG_set0(ec_sig, sig_r, sig_v);
    // Convert R and V values into DER format
    int sig_size = i2d_ECDSA_SIG(ec_sig, NULL);
    unsigned char * sig_bytes = malloc(sig_size); unsigned char * p;
    memset_s(sig_bytes,sig_size,0xFF, sig_size);
    p = sig_bytes;
    sig_size = i2d_ECDSA_SIG(ec_sig, &p);
    // Verify the data in the context against the signature and get final result
    if (EVP_DigestVerifyFinal(ctx, sig_bytes, sig_size) != 1) {
        fprintf(stderr, "%s\r\n", "ERROR: EVP_DigestVerifyFinal failed"); exit(1);
    } else {
        fprintf(stdout, "%s\r\n", "OK: ########## Message Verified! ##########"); 
        free(sig_bytes);
        exit(0);
    }
    //#endif

    exit(1);

}

Conclusion

In this post, I went through a detailed examination of attestation documents produced by the AWS Nitro Enclaves. Then, I went over different types of validations (syntactic, semantic, and cryptographic) that safely help determine if an attestation document should be trusted. I’ve also included access to a public repository that contains the source code used in this post. New AWS Nitro Enclaves users can use it as a starting point when looking to integrate their applications with AWS Nitro Enclaves and build highly secure and confidential solutions.

Three ways to accelerate incident response in the cloud: insights from re:Inforce 2023

Post Syndicated from Anne Grahn original https://aws.amazon.com/blogs/security/three-ways-to-accelerate-incident-response-in-the-cloud-insights-from-reinforce-2023/

AWS re:Inforce took place in Anaheim, California, on June 13–14, 2023. AWS customers, partners, and industry peers participated in hundreds of technical and non-technical security-focused sessions across six tracks, an Expo featuring AWS experts and AWS Security Competency Partners, and keynote and leadership sessions.

The threat detection and incident response track showcased how AWS customers can get the visibility they need to help improve their security posture, identify issues before they impact business, and investigate and respond quickly to security incidents across their environment.

With dozens of service and feature announcements—and innumerable best practices shared by AWS experts, customers, and partners—distilling highlights is a challenge. From an incident response perspective, three key themes emerged.

Proactively detect, contextualize, and visualize security events

When it comes to effectively responding to security events, rapid detection is key. Among the launches announced during the keynote was the expansion of Amazon Detective finding groups to include Amazon Inspector findings in addition to Amazon GuardDuty findings.

Detective, GuardDuty, and Inspector are part of a broad set of fully managed AWS security services that help you identify potential security risks, so that you can respond quickly and confidently.

Using machine learning, Detective finding groups can help you conduct faster investigations, identify the root cause of events, and map to the MITRE ATT&CK framework to quickly run security issues to ground. The finding group visualization panel shown in the following figure displays findings and entities involved in a finding group. This interactive visualization can help you analyze, understand, and triage the impact of finding groups.

Figure 1: Detective finding groups visualization panel

Figure 1: Detective finding groups visualization panel

With the expanded threat and vulnerability findings announced at re:Inforce, you can prioritize where to focus your time by answering questions such as “was this EC2 instance compromised because of a software vulnerability?” or “did this GuardDuty finding occur because of unintended network exposure?”

In the session Streamline security analysis with Amazon Detective, AWS Principal Product Manager Rich Vorwaller, AWS Senior Security Engineer Rima Tanash, and AWS Program Manager Jordan Kramer demonstrated how to use graph analysis techniques and machine learning in Detective to identify related findings and resources, and investigate them together to accelerate incident analysis.

In addition to Detective, you can also use Amazon Security Lake to contextualize and visualize security events. Security Lake became generally available on May 30, 2023, and several re:Inforce sessions focused on how you can use this new service to assist with investigations and incident response.

As detailed in the following figure, Security Lake automatically centralizes security data from AWS environments, SaaS providers, on-premises environments, and cloud sources into a purpose-built data lake stored in your account. Security Lake makes it simpler to analyze security data, gain a more comprehensive understanding of security across an entire organization, and improve the protection of workloads, applications, and data. Security Lake automates the collection and management of security data from multiple accounts and AWS Regions, so you can use your preferred analytics tools while retaining complete control and ownership over your security data. Security Lake has adopted the Open Cybersecurity Schema Framework (OCSF), an open standard. With OCSF support, the service normalizes and combines security data from AWS and a broad range of enterprise security data sources.

Figure 2: How Security Lake works

Figure 2: How Security Lake works

To date, 57 AWS security partners have announced integrations with Security Lake, and we now have more than 70 third-party sources, 16 analytics subscribers, and 13 service partners.

In Gaining insights from Amazon Security Lake, AWS Principal Solutions Architect Mark Keating and AWS Security Engineering Manager Keith Gilbert detailed how to get the most out of Security Lake. Addressing questions such as, “How do I get access to the data?” and “What tools can I use?,” they demonstrated how analytics services and security information and event management (SIEM) solutions can connect to and use data stored within Security Lake to investigate security events and identify trends across an organization. They emphasized how bringing together logs in multiple formats and normalizing them into a single format empowers security teams to gain valuable context from security data, and more effectively respond to events. Data can be queried with Amazon Athena, or pulled by Amazon OpenSearch Service or your SIEM system directly from Security Lake.

Build your security data lake with Amazon Security Lake featured AWS Product Manager Jonathan Garzon, AWS Product Solutions Architect Ross Warren, and Global CISO of Interpublic Group (IPG) Troy Wilkinson demonstrating how Security Lake helps address common challenges associated with analyzing enterprise security data, and detailing how IPG is using the service. Wilkinson noted that IPG’s objective is to bring security data together in one place, improve searches, and gain insights from their data that they haven’t been able to before.

“With Security Lake, we found that it was super simple to bring data in. Not just the third-party data and Amazon data, but also our on-premises data from custom apps that we built.” — Troy Wilkinson, global CISO, Interpublic Group

Use automation and machine learning to reduce mean time to response

Incident response automation can help free security analysts from repetitive tasks, so they can spend their time identifying and addressing high-priority security issues.

In How LLA reduces incident response time with AWS Systems Manager, telecommunications provider Liberty Latin America (LLA) detailed how they implemented a security framework to detect security issues and automate incident response in more than 180 AWS accounts accessed by internal stakeholders and third-party partners by using AWS Systems Manager Incident Manager, AWS Organizations, Amazon GuardDuty, and AWS Security Hub.

LLA operates in over 20 countries across Latin America and the Caribbean. After completing multiple acquisitions, LLA needed a centralized security operations team to handle incidents and notify the teams responsible for each AWS account. They used GuardDuty, Security Hub, and Systems Manager Incident Manager to automate and streamline detection and response, and they configured the services to initiate alerts whenever there was an issue requiring attention.

Speaking alongside AWS Principal Solutions Architect Jesus Federico and AWS Principal Product Manager Sarah Holberg, LLA Senior Manager of Cloud Services Joaquin Cameselle noted that when GuardDuty identifies a critical issue, it generates a new finding in Security Hub. This finding is then forwarded to Systems Manager Incident Manager through an Amazon EventBridge rule. This configuration helps ensure the involvement of the appropriate individuals associated with each account.

“We have deployed a security framework in Liberty Latin America to identify security issues and streamline incident response across over 180 AWS accounts. The framework that leverages AWS Systems Manager Incident Manager, Amazon GuardDuty, and AWS Security Hub enabled us to detect and respond to incidents with greater efficiency. As a result, we have reduced our reaction time by 90%, ensuring prompt engagement of the appropriate teams for each AWS account and facilitating visibility of issues for the central security team.” — Joaquin Cameselle, senior manager, cloud services, Liberty Latin America

How Citibank (Citi) advanced their containment capabilities through automation outlined how the National Institute of Standards and Technology (NIST) Incident Response framework is applied to AWS services, and highlighted Citi’s implementation of a highly scalable cloud incident response framework designed to support the 28 AWS services in their cloud environment.

After describing the four phases of the incident response process — preparation and prevention; detection and analysis; containment, eradication, and recovery; and post-incident activity—AWS ProServe Global Financial Services Senior Engagement Manager Harikumar Subramonion noted that, to fully benefit from the cloud, you need to embrace automation. Automation benefits the third phase of the incident response process by speeding up containment, and reducing mean time to response.

Citibank Head of Cloud Security Operations Elvis Velez and Vice President of Cloud Security Damien Burks described how Citi built the Cloud Containment Automation Framework (CCAF) from the ground up by using AWS Step Functions and AWS Lambda, enabling them to respond to events 24/7 without human error, and reduce the time it takes to contain resources from 4 hours to 15 minutes. Velez described how Citi uses adversary emulation exercises that use the MITRE ATT&CK Cloud Matrix to simulate realistic attacks on AWS environments, and continuously validate their ability to effectively contain incidents.

Innovate and do more with less

Security operations teams are often understaffed, making it difficult to keep up with alerts. According to data from CyberSeek, there are currently 69 workers available for every 100 cybersecurity job openings.

Effectively evaluating security and compliance posture is critical, despite resource constraints. In Centralizing security at scale with Security Hub and Intuit’s experience, AWS Senior Solutions Architect Craig Simon, AWS Senior Security Hub Product Manager Dora Karali, and Intuit Principal Software Engineer Matt Gravlin discussed how to ease security management with Security Hub. Fortune 500 financial software provider Intuit has approximately 2,000 AWS accounts, 10 million AWS resources, and receives 20 million findings a day from AWS services through Security Hub. Gravlin detailed Intuit’s Automated Compliance Platform (ACP), which combines Security Hub and AWS Config with an internal compliance solution to help Intuit reduce audit timelines, effectively manage remediation, and make compliance more consistent.

“By using Security Hub, we leveraged AWS expertise with their regulatory controls and best practice controls. It helped us keep up to date as new controls are released on a regular basis. We like Security Hub’s aggregation features that consolidate findings from other AWS services and third-party providers. I personally call it the super aggregator. A key component is the Security Hub to Amazon EventBridge integration. This allowed us to stream millions of findings on a daily basis to be inserted into our ACP database.” — Matt Gravlin, principal software engineer, Intuit

At AWS re:Inforce, we launched a new Security Hub capability for automating actions to update findings. You can now use rules to automatically update various fields in findings that match defined criteria. This allows you to automatically suppress findings, update the severity of findings according to organizational policies, change the workflow status of findings, and add notes. With automation rules, Security Hub provides you a simplified way to build automations directly from the Security Hub console and API. This reduces repetitive work for cloud security and DevOps engineers and can reduce mean time to response.

In Continuous innovation in AWS detection and response services, AWS Worldwide Security Specialist Senior Manager Himanshu Verma and GuardDuty Senior Manager Ryan Holland highlighted new features that can help you gain actionable insights that you can use to enhance your overall security posture. After mapping AWS security capabilities to the core functions of the NIST Cybersecurity Framework, Verma and Holland provided an overview of AWS threat detection and response services that included a technical demonstration.

Bolstering incident response with AWS Wickr enterprise integrations highlighted how incident responders can collaborate securely during a security event, even on a compromised network. AWS Senior Security Specialist Solutions Architect Wes Wood demonstrated an innovative approach to incident response communications by detailing how you can integrate the end-to-end encrypted collaboration service AWS Wickr Enterprise with GuardDuty and AWS WAF. Using Wickr Bots, you can build integrated workflows that incorporate GuardDuty and third-party findings into a more secure, out-of-band communication channel for dedicated teams.

Evolve your incident response maturity

AWS re:Inforce featured many more highlights on incident response, including How to run security incident response in your Amazon EKS environment and Investigating incidents with Amazon Security Lake and Jupyter notebooks code talks, as well as the announcement of our Cyber Insurance Partners program. Content presented throughout the conference made one thing clear: AWS is working harder than ever to help you gain the insights that you need to strengthen your organization’s security posture, and accelerate incident response in the cloud.

To watch AWS re:Inforce sessions on demand, see the AWS re:Inforce playlists on YouTube.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Anne Grahn

Anne Grahn

Anne is a Senior Worldwide Security GTM Specialist at AWS based in Chicago. She has more than a decade of experience in the security industry, and focuses on effectively communicating cybersecurity risk. She maintains a Certified Information Systems Security Professional (CISSP) certification.

Author

Himanshu Verma

Himanshu is a Worldwide Specialist for AWS Security Services. In this role, he leads the go-to-market creation and execution for AWS Security Services, field enablement, and strategic customer advisement. Prior to AWS, he held several leadership roles in Product Management, engineering and development, working on various identity, information security, and data protection technologies. He obsesses brainstorming disruptive ideas, venturing outdoors, photography, and trying various “hole in the wall” food and drinking establishments around the globe.

Jesus Federico

Jesus Federico

Jesus is a Principal Solutions Architect for AWS in the telecommunications vertical, working to provide guidance and technical assistance to communication service providers on their cloud journey. He supports CSPs in designing and implementing secure, resilient, scalable, and high-performance applications in the cloud.

How to Manage Global Sending of SMS with Amazon Pinpoint

Post Syndicated from Tyler Holmes original https://aws.amazon.com/blogs/messaging-and-targeting/how-to-manage-global-sending-of-sms-with-amazon-pinpoint/

Amazon Pinpoint has a global SMS reach, of 240 countries and regions around the world, enabling companies of all sizes to send SMS globally. Unlike the process of sending a personal message from your phone to someone in another country, sending Application to Person (A2P) messages, also known as bulk SMS, involves many more regulations and requirements that vary from country to country. In this post we will review best practices for sending Global SMS and share a selection of AWS resources to help you send SMS globally.

The first thing to understand about delivering SMS around the world is that it takes a vast network of components working seamlessly together around the globe to deliver an SMS globally. The image below gives a simple example of delivering an SMS in the United States. Mobile devices are at the center of this, connecting to mobile carriers or operators, who operate the infrastructure necessary for SMS transmission. Once you hit that send button from AWS, your message travels to an Aggregator, who has connections to Operators, Partners, and/or other Aggregators. The reason for this is that there is no one vendor who delivers globally. AWS uses many Aggregators that both enable us to send globally as well as improve resiliency and deliverability of your messages. The last stop on the journey is the Short Message Service Center (SMSC), a central hub that receives, stores, and forwards text messages. The SMSC acts as a gateway, routing your message to the recipient’s carrier or operator through a series of interconnected networks, thanks to agreements between different carriers known as interconnection agreements. The entire process is facilitated by the Signaling System 7 (SS7), a set of protocols that enables the exchange of information between telecommunication networks, ensuring messages reach their intended recipients.
Diagram showing how SMS is delivered using aggregators
Every country has its own regulations and processes that you need to comply with in order to successfully deliver SMS to handsets that are registered to a particular country. There are some countries with little regulation and others that will block all SMS traffic unless it has been registered with the proper authorities.

Each country’s requirements include the origination identities (OIDs) that their networks support, some of these include long codes (standard phone numbers that typically have 10 or more digits), short codes (phone numbers that contain between four and seven digits), and Sender IDs (names that contain 6–11 alphanumeric characters). Each of these types of origination identities has unique benefits and drawbacks and you will need one for each use case and country you plan on supporting. Here is a list of the countries that AWS currently sends to and the OIDs that are supported.

Pre-Planning and Country Selection
The first step to planning a global roll out of SMS is to know what countries you want to send to and what each of your use cases are. Put together a spreadsheet (Download Here Global SMS Planning Sheet) for each unique use case you have and the countries you plan on sending to with the below key details:

  • The volumes you expect to send to each country
  • The throughput (Also referred to as Messages per Second, MPS, Transactions per Second, or TPS) at which you expect to deliver these messages
  • Whether your use case is one-way or two-way
    • Not all countries support 2-way communications, which is the ability to have the recipient send a message back to the OID. Sender ID also does not support 2-way communication so if you are planning on using Sender ID you will need to account for how to opt recipients out of future communications.
  • Leave a column for the Origination Identity you will use for each country
  • Leave a column for whether this country requires advanced registration
  • Leave a column for any country specific limitations or requirements such as language limitations
  • Leave a column for the estimated time it takes to register
    • This chart has estimates for common countries but there are others that also have lead time in procuring an OID so please open a support case for review

Selecting an Origination Identity

Now that you have these details all in one place consult this table to determine what OIDs each country supports, and, if your use case requires it, which countries support two-way.

In countries where there are multiple options for OIDs there are several guidelines to consider when you’re deciding what type of origination identity to use:

  • Sender IDs are a great option for one-way use cases. However, they’re not available in all countries and if you are needing to opt-out your customers you will need to provide a way for them to do so since they are only one-way.
    • In some countries (such as India and Saudi Arabia), long codes can be used to receive incoming messages, but can’t be used to send outgoing messages. You can use these inbound-only long codes to provide your recipients with a way to opt out of messages that you send using a Sender ID.
  • Short codes are a great option for two-way use cases and have the highest throughput of all OIDs.
    • While short codes have a higher throughput they also come at a much higher cost than other OIDs so weigh your cost against your use case requirements.
  • In some countries, we maintain a pool of shared origination identities. If you send messages to recipients in a particular country, but you don’t have a dedicated origination identity in that country, we make an effort to deliver your message using one of these shared identities.
    • Shared identities are unavailable in some countries, including the United States and China.
    • Shared identities cannot be 2-way so make sure you have a way of opting customers out of communication

With these in mind consult this guide to help you decide which OID to use for each country and use case. Update your sheet as you review each country. Many of our customers opt for a phased roll-out, enabling SMS for the countries that do not require registration and can be put into production swiftly while working through the registration process for those countries that require it and bringing those to production as they are approved. A phased approach is also preferred as it allows customers to monitor for any problems with deliverability with a smaller volume than their full production workload.

Procurement and Registration of Origination Identities

In countries where registration is onerous it is important to have a few things about your process all in one place. Some registrations are very similar in the information that they ask for while others have special processes that you need to follow. Examples include:

Once you have decided on your OIDs for each of your countries you can begin the process of procuring them. Depending on where you plan on sending you may need to open a case to procure them. Short codes you also need to open a case but the process is slightly different so review the documentation here. If you are having trouble making a decision on OIDs you may have the option of engaging with AWS support or your Account Manager dependent on the support level you have opted for on your account.

Testing SMS Sending

Once you have procured OIDs and are ready to begin testing, it is essential that you set up a way of monitoring the events that Pinpoint generates. Pay attention to the Delivery Receipts (DLRs) that are returned back into the event stream. These provide you details on the success or failure of your sends. Pinpoint delivers all events via Amazon Kinesis, which needs to be enabled within each Project you are using. This is a common solution among our customers. It enables the stream, sends it to a user-specified S3 Bucket, and sets up Tables and Views within Amazon Athena, our serverless SQL query engine.. Kinesis can stream to many different destinations, including Redshift and HTTP endpoints, among many others. This gives you flexibility in how you deliver the events to their required locations. Monitoring SMS events is an important part of sending globally, these are the SMS Events that are possible to receive in your stream.

TPS limits can vary depending on the countries you’re sending to and the OIDs you’re using. If there’s a risk of exceeding these limits and triggering rate limiting errors, it’s crucial to devise a strategy for queuing your messages. Keep in mind, Amazon Pinpoint doesn’t offer queueing capabilities. Therefore, message queueing must be incorporated at your application level or by leveraging AWS services. For instance, you could deploy this commonly used architecture that’s adjustable according to your specific use case.

Once you have your monitoring solution in place, you are read to begin testing sends to real destination phone numbers. Keep in mind that at this point you are likely still in the Sandbox for SMS. This means you have much lower quotas for sending and can only send to verified phone numbers or the SMS Simulator numbers. Pinpoint includes an SMS simulator, which you can use to send text messages and receive realistic event records to 51 commonly sent to countries. Messages sent to these destination phone numbers are not sent over the carrier network but do incur the standard outbound SMS messaging rate for the country that the simulated phone number is based in.

Best Practices for Sending
Before beginning There are two common ways of sending SMS via Pinpoint. The first option is the Pinpoint API using the SendMessages Action, which you can send a direct message to as many as 100 recipients at a time. The second option is to use the SMS and Voice v2 API and the SendTextMessage Action, which has more options available to configure your sends and can send to a single recipient with each call. The V2 API is the preferred way of sending as it allows for more fine grained control over your messages and is the API upon which new functionality will be built. Keep in mind that sending via the API does not attribute any metrics back to an endpoint unless you are specifying an endpoint ID in your call, so if you are using other features of Pinpoint such as campaigns or journeys or sending via other channels such as email you will need to consider your strategy for measuring success and how you will tie all of your communication efforts together.

When sending SMS Pinpoint includes logic for selecting the best OID to send from based on the country code. If there are multiple OIDs available to send to a particular country Pinpoint will default to the highest throughput OID available in your Account/Region. If there are not OIDs specific to the country being sent to Pinpoint will default to SenderID or to a shared OID owned by Pinpoint in that order, if the country allows these OIDs to be used. Given this functionality the best practice for sending SMS is to not specify the OID needed to send to a specific country and to allow Pinpoint to select. You can restrict Pinpoint to send to only those countries that you have OIDs for by using Pools, and turning off Shared Routes, more on this below.

If you have multiple use cases and need to specify the correct OID for each, this is where the V2 API is useful. OIDs can be attached to Pools, which can be configured to serve a particular use case, and the pool can be specified in your SendTextMessage call. Sending using a PoolID and allowing Pinpoint to select the right OID from that pool for the destination phone number simplifies your sending process. This blogpost details the process for creating Pools and using them to send SMS.

As mentioned above Pools also serve an additional use case, which is to limit message sending to specific countries. Some countries allow messages without an OID. If you don’t modify your settings to disable this feature, Pinpoint will attempt to deliver messages to these countries, even if you don’t have an explicit OID for them. Restricting SMS sends only to countries that you have OIDs for can be accomplished by using Pools and configuring “SharedRoutesEnabled“ to false by using the UpdatePool Action. Once configured you will receive an error back if attempting to send to a destination phone number that you do not have an OID for in the Pool. This configuration gives you the ability to control your costs while simplifying your process.

Managing Opt-Outs

As we have seen, managing SMS in an environment of increasing global regulation is challenging. An area of importance that needs to be configured is how you plan on managing the ability for recipients to opt out of your communications. Pinpoint can automatically opt your customers out of SMS communications using predefined keywords such as, “stop” or “unsubscribe.” However, this would make for an Account wide opt-out, and not ideal for customers that have multiple use cases such as OTP and Marketing communications. This blogpost details the process of managing opt-outs for multiple use cases. The configuration is enabled through the V2 API and is another reason to standardize your process on this API.

Monitoring Sending

The last step in ensuring success for SMS sending is having a solid platform for monitoring your sending. SMS is not a guaranteed delivery channel. You will always receive an event for a successful send in the event stream but there is no guarantee of a return status event, if a DLR from a carrier is not sent. A list of SMS Events and possible statuses can be found here.

The first Event you should see returned when watching the Event Stream for an SMS send activity is the “PENDING” event. This means we’ve sent the message to the carrier, where it’s buffered, and we’re waiting for the carrier to return a status message. There are no status messages between the “PENDING” state and the “whatever happens next” state, so if the carrier is retrying, we simply stay in PENDING and do not create more events. If a message is successfully delivered and a DLR is sent back from the carrier then a new event will be generated with a status of “SUCCESSFUL/DELIVERED.”

Make sure to review all of the possible values for the record_status attribute so that you are aware of varying issues with your sending that can arise. For example, statuses such as “Blocked,” “Spam,” and “Carrier_Blocked“ can indicate systemic issues that should be investigated.

Updates sent from a carrier via a DLR can be delayed for up to 72 hours or never sent at all. This varies based on the carrier and the country being sent to. Should you require a higher level of reliability, you need to establish business logic around monitoring SMS messages. If messages remain in a PENDING status longer than your business requirements permit, you must make a decision on how to handle them. You need to consider whether missed or duplicated messages are acceptable, or if it’s preferable to retry messages that are stuck in pending. The following is an example architecture for failed SMS retries that you can adjust to your needs.

Conclusion

This post covers the general process for getting started with Global SMS but as you have learned each country presents a different challenge and the regulatory environment is constantly evolving. It’s important to make sure that you are receiving messages from AWS that detail new regulations, new feature launches, and other major announcements to continually improve your process and make sure your SMS are delivering at the highest rate possible.

Take the time to plan out your approach, follow the steps outlined in this blog, and take advantage of any resources available to you within your support tier.

Decide what origination IDs you will need here
Review the documentation for the V2 SMS and Voice API here
Review the Pinpoint API and SendMessage here
Check out the support tiers comparison here

Resources:
https://docs.aws.amazon.com/pinpoint/latest/userguide/channels-sms-countries.html
https://aws.amazon.com/blogs/messaging-and-targeting/how-to-utilise-amazon-pinpoint-to-retry-unsuccessful-sms-delivery/
https://datatracker.ietf.org/doc/html/draft-wilde-sms-uri-20#section-4
https://docs.aws.amazon.com/pinpoint/latest/developerguide/event-streams-data-sms.html
https://docs.aws.amazon.com/pinpoint/latest/userguide/channels-sms-limitations-opt-out.html
https://docs.aws.amazon.com/pinpoint/latest/userguide/channels-sms-simulator.html