Tag Archives: open source

Using AWS security services to protect against, detect, and respond to the Log4j vulnerability

2021-12-16 Marshall Jones

Post Syndicated from Marshall Jones original https://aws.amazon.com/blogs/security/using-aws-security-services-to-protect-against-detect-and-respond-to-the-log4j-vulnerability/

January 7, 2022: The blog post has been updated to include using Network ACL rules to block potential log4j-related outbound traffic.

January 4, 2022: The blog post has been updated to suggest using WAF rules when correct HTTP Host Header FQDN value is not provided in the request.

December 31, 2021: We made a minor update to the second paragraph in the Amazon Route 53 Resolver DNS Firewall section.

December 29, 2021: A paragraph under the Detect section has been added to provide guidance on validating if log4j exists in an environment.

December 23, 2021: The GuardDuty section has been updated to describe new threat labels added to specific finding to give log4j context.

December 21, 2021: The post includes more info about Route 53 Resolver DNS query logging.

December 20, 2021: The post has been updated to include Amazon Route 53 Resolver DNS Firewall info.

December 17, 2021: The post has been updated to include using Athena to query VPC flow logs.

December 16, 2021: The Respond section of the post has been updated to include IMDSv2 and container mitigation info.

This blog post was first published on December 15, 2021.

Overview

In this post we will provide guidance to help customers who are responding to the recently disclosed log4j vulnerability. This covers what you can do to limit the risk of the vulnerability, how you can try to identify if you are susceptible to the issue, and then what you can do to update your infrastructure with the appropriate patches.

The log4j vulnerability (CVE-2021-44228, CVE-2021-45046) is a critical vulnerability (CVSS 3.1 base score of 10.0) in the ubiquitous logging platform Apache Log4j. This vulnerability allows an attacker to perform a remote code execution on the vulnerable platform. Version 2 of log4j, between versions 2.0-beta-9 and 2.15.0, is affected.

The vulnerability uses the Java Naming and Directory Interface (JNDI) which is used by a Java program to find data, typically through a directory, commonly a LDAP directory in the case of this vulnerability.

Figure 1, below, highlights the log4j JNDI attack flow.

Figure 1. Log4j attack progression. Source: GovCERT.ch, the Computer Emergency Response Team (GovCERT) of the Swiss government

As an immediate response, follow this blog and use the tool designed to hotpatch a running JVM using any log4j 2.0+. Steve Schmidt, Chief Information Security Officer for AWS, also discussed this hotpatch.

Protect

You can use multiple AWS services to help limit your risk/exposure from the log4j vulnerability. You can build a layered control approach, and/or pick and choose the controls identified below to help limit your exposure.

AWS WAF

Use AWS Web Application Firewall, following AWS Managed Rules for AWS WAF, to help protect your Amazon CloudFront distribution, Amazon API Gateway REST API, Application Load Balancer, or AWS AppSync GraphQL API resources.

AWSManagedRulesKnownBadInputsRuleSet esp. the Log4JRCE rule which helps inspects the request for the presence of the Log4j vulnerability. Example patterns include ${jndi:ldap://example.com/}.
AWSManagedRulesAnonymousIpList esp. the AnonymousIPList rule which helps inspect IP addresses of sources known to anonymize client information.
AWSManagedRulesCommonRuleSet, esp. the SizeRestrictions_BODY rule to verify that the request body size is at most 8 KB (8,192 bytes).

You should also consider implementing WAF rules that deny access, if the correct HTTP Host Header FQDN value is not provided in the request. This can help reduce the likelihood of scanners that are scanning the internet IP address space from reaching your resources protected by WAF via a request with an incorrect Host Header, like an IP address instead of an FQDN. It’s also possible to use custom Application Load Balancer listener rules to achieve this.

If you’re using AWS WAF Classic, you will need to migrate to AWS WAF or create custom regex match conditions.

Have multiple accounts? Follow these instructions to use AWS Firewall Manager to deploy AWS WAF rules centrally across your AWS organization.

Amazon Route 53 Resolver DNS Firewall

You can use Route 53 Resolver DNS Firewall, following AWS Managed Domain Lists, to help proactively protect resources with outbound public DNS resolution. We recommend associating Route 53 Resolver DNS Firewall with a rule configured to block domains on the AWSManagedDomainsMalwareDomainList, which has been updated in all supported AWS regions with domains identified as hosting malware used in conjunction with the log4j vulnerability. AWS will continue to deliver domain updates for Route 53 Resolver DNS Firewall through this list.

Also, you should consider blocking outbound port 53 to prevent the use of external untrusted DNS servers. This helps force all DNS queries through DNS Firewall and ensures DNS traffic is visible for GuardDuty inspection. Using DNS Firewall to block DNS resolution of certain country code top-level domains (ccTLD) that your VPC resources have no legitimate reason to connect out to, may also help. Examples of ccTLDs you may want to block may be included in the known log4j callback domains IOCs.

We also recommend that you enable DNS query logging, which allows you to identify and audit potentially impacted resources within your VPC, by inspecting the DNS logs for the presence of blocked outbound queries due to the log4j vulnerability, or to other known malicious destinations. DNS query logging is also useful in helping identify EC2 instances vulnerable to log4j that are responding to active log4j scans, which may be originating from malicious actors or from legitimate security researchers. In either case, instances responding to these scans potentially have the log4j vulnerability and should be addressed. GreyNoise is monitoring for log4j scans and sharing the callback domains here. Some notable domains customers may want to examine log activity for, but not necessarily block, are: *interact.sh, *leakix.net, *canarytokens.com, *dnslog.cn, *.dnsbin.net, and *cyberwar.nl. It is very likely that instances resolving these domains are vulnerable to log4j.

AWS Network Firewall

Customers can use Suricata-compatible IDS/IPS rules in AWS Network Firewall to deploy network-based detection and protection. While Suricata doesn’t have a protocol detector for LDAP, it is possible to detect these LDAP calls with Suricata. Open-source Suricata rules addressing Log4j are available from corelight, NCC Group, from ET Labs, and from CrowdStrike. These rules can help identify scanning, as well as post exploitation of the log4j vulnerability. Because there is a large amount of benign scanning happening now, we recommend customers focus their time first on potential post-exploitation activities, such as outbound LDAP traffic from their VPC to untrusted internet destinations.

We also recommend customers consider implementing outbound port/protocol enforcement rules that monitor or prevent instances of protocols like LDAP from using non-standard LDAP ports such as 53, 80, 123, and 443. Monitoring or preventing usage of port 1389 outbound may be particularly helpful in identifying systems that have been triggered by internet scanners to make command and control calls outbound. We also recommend that systems without a legitimate business need to initiate network calls out to the internet not be given that ability by default. Outbound network traffic filtering and monitoring is not only very helpful with log4j, but with identifying other classes of vulnerabilities too.

Network Access Control Lists

Customers may be able to use Network Access Control List rules (NACLs) to block some of the known log4j-related outbound ports to help limit further compromise of successfully exploited systems. We recommend customers consider blocking ports 1389, 1388, 1234, 12344, 9999, 8085, 1343 outbound. As NACLs block traffic at the subnet level, careful consideration should be given to ensure any new rules do not block legitimate communications using these outbound ports across internal subnets. Blocking ports 389 and 88 outbound can also be helpful in mitigating log4j, but those ports are commonly used for legitimate applications, especially in a Windows Active Directory environment. See the VPC flow logs section below to get details on how you can validate any ports being considered.

Use IMDSv2

Through the early days of the log4j vulnerability researchers have noted that, once a host has been compromised with the initial JDNI vulnerability, attackers sometimes try to harvest credentials from the host and send those out via some mechanism such as LDAP, HTTP, or DNS lookups. We recommend customers use IAM roles instead of long-term access keys, and not store sensitive information such as credentials in environment variables. Customers can also leverage AWS Secrets Manager to store and automatically rotate database credentials instead of storing long-term database credentials in a host’s environment variables. See prescriptive guidance here and here on how to implement Secrets Manager in your environment.

To help guard against such attacks in AWS when EC2 Roles may be in use — and to help keep all IMDS data private for that matter — customers should consider requiring the use of Instance MetaData Service version 2 (IMDSv2). Since IMDSv2 is enabled by default, you can require its use by disabling IMDSv1 (which is also enabled by default). With IMDSv2, requests are protected by an initial interaction in which the calling process must first obtain a session token with an HTTP PUT, and subsequent requests must contain the token in an HTTP header. This makes it much more difficult for attackers to harvest credentials or any other data from the IMDS. For more information about using IMDSv2, please refer to this blog and documentation. While all recent AWS SDKs and tools support IMDSv2, as with any potentially non-backwards compatible change, test this change on representative systems before deploying it broadly.

Detect

This post has covered how to potentially limit the ability to exploit this vulnerability. Next, we’ll shift our focus to which AWS services can help to detect whether this vulnerability exists in your environment.

Figure 2. Log4j finding in the Inspector console

Amazon Inspector

As shown in Figure 2, the Amazon Inspector team has created coverage for identifying the existence of this vulnerability in your Amazon EC2 instances and Amazon Elastic Container Registry Images (Amazon ECR). With the new Amazon Inspector, scanning is automated and continual. Continual scanning is driven by events such as new software packages, new instances, and new common vulnerability and exposure (CVEs) being published.

For example, once the Inspector team added support for the log4j vulnerability (CVE-2021-44228 & CVE-2021-45046), Inspector immediately began looking for this vulnerability for all supported AWS Systems Manager managed instances where Log4j was installed via OS package managers and where this package was present in Maven-compatible Amazon ECR container images. If this vulnerability is present, findings will begin appearing without any manual action. If you are using Inspector Classic, you will need to ensure you are running an assessment against all of your Amazon EC2 instances. You can follow this documentation to ensure you are creating an assessment target for all of your Amazon EC2 instances. Here are further details on container scanning updates in Amazon ECR private registries.

GuardDuty

In addition to finding the presence of this vulnerability through Inspector, the Amazon GuardDuty team has also begun adding indicators of compromise associated with exploiting the Log4j vulnerability, and will continue to do so. GuardDuty will monitor for attempts to reach known-bad IP addresses or DNS entries, and can also find post-exploit activity through anomaly-based behavioral findings. For example, if an Amazon EC2 instance starts communicating on unusual ports, GuardDuty would detect this activity and create the finding Behavior:EC2/NetworkPortUnusual. This activity is not limited to the NetworkPortUnusual finding, though. GuardDuty has a number of different findings associated with post exploit activity, such as credential compromise, that might be seen in response to a compromised AWS resource. For a list of GuardDuty findings, please refer to this GuardDuty documentation.

To further help you identify and prioritize issues related to CVE-2021-44228 and CVE-2021-45046, the GuardDuty team has added threat labels to the finding detail for the following finding types:

Backdoor:EC2/C&CActivity.B
If the IP queried is Log4j-related, then fields of the associated finding will include the following values:

service.additionalInfo.threatListName = Amazon
service.additionalInfo.threatName = Log4j Related

Backdoor:EC2/C&CActivity.B!DNS
If the domain name queried is Log4j-related, then the fields of the associated finding will include the following values:

service.additionalInfo.threatListName = Amazon
service.additionalInfo.threatName = Log4j Related

Behavior:EC2/NetworkPortUnusual
If the EC2 instance communicated on port 389 or port 1389, then the associated finding severity will be modified to High, and the finding fields will include the following value:

service.additionalInfo.context = Possible Log4j callback

Figure 3. GuardDuty finding with log4j threat labels

Security Hub

Many customers today also use AWS Security Hub with Inspector and GuardDuty to aggregate alerts and enable automatic remediation and response. In the short term, we recommend that you use Security Hub to set up alerting through AWS Chatbot, Amazon Simple Notification Service, or a ticketing system for visibility when Inspector finds this vulnerability in your environment. In the long term, we recommend you use Security Hub to enable automatic remediation and response for security alerts when appropriate. Here are ideas on how to setup automatic remediation and response with Security Hub.

VPC flow logs

Customers can use Athena or CloudWatch Logs Insights queries against their VPC flow logs to help identify VPC resources associated with log4j post exploitation outbound network activity. Version 5 of VPC flow logs is particularly useful, because it includes the “flow-direction” field. We recommend customers start by paying special attention to outbound network calls using destination port 1389 since outbound usage of that port is less common in legitimate applications. Customers should also investigate outbound network calls using destination ports 1388, 1234, 12344, 9999, 8085, 1343, 389, and 88 to untrusted internet destination IP addresses. Free-tier IP reputation services, such as VirusTotal, GreyNoise, NOC.org, and ipinfo.io, can provide helpful insights related to public IP addresses found in the logged activity.

Note: If you have a Microsoft Active Directory environment in the captured VPC flow logs being queried, you might see false positives due to its use of port 389.

Validation with open-source tools

With the evolving nature of the different log4j vulnerabilities, it’s important to validate that upgrades, patches, and mitigations in your environment are indeed working to mitigate potential exploitation of the log4j vulnerability. You can use open-source tools, such as aws_public_ips, to get a list of all your current public IP addresses for an AWS Account, and then actively scan those IPs with log4j-scan using a DNS Canary Token to get notification of which systems still have the log4j vulnerability and can be exploited. We recommend that you run this scan periodically over the next few weeks to validate that any mitigations are still in place, and no new systems are vulnerable to the log4j issue.

Respond

The first two sections have discussed ways to help prevent potential exploitation attempts, and how to detect the presence of the vulnerability and potential exploitation attempts. In this section, we will focus on steps that you can take to mitigate this vulnerability. As we noted in the overview, the immediate response recommended is to follow this blog and use the tool designed to hotpatch a running JVM using any log4j 2.0+. Steve Schmidt, Chief Information Security Officer for AWS, also discussed this hotpatch.

Figure 4. Systems Manager Patch Manager patch baseline approving critical patches immediately

AWS Patch Manager

If you use AWS Systems Manager Patch Manager, and you have critical patches set to install immediately in your patch baseline, your EC2 instances will already have the patch. It is important to note that you’re not done at this point. Next, you will need to update the class path wherever the library is used in your application code, to ensure you are using the most up-to-date version. You can use AWS Patch Manager to patch managed nodes in a hybrid environment. See here for further implementation details.

Container mitigation

To install the hotpatch noted in the overview onto EKS cluster worker nodes AWS has developed an RPM that performs a JVM-level hotpatch which disables JNDI lookups from the log4j2 library. The Apache Log4j2 node agent is an open-source project built by the Kubernetes team at AWS. To learn more about how to install this node agent, please visit the this Github page.

Once identified, ECR container images will need to be updated to use the patched log4j version. Downstream, you will need to ensure that any containers built with a vulnerable ECR container image are updated to use the new image as soon as possible. This can vary depending on the service you are using to deploy these images. For example, if you are using Amazon Elastic Container Service (Amazon ECS), you might want to update the service to force a new deployment, which will pull down the image using the new log4j version. Check the documentation that supports the method you use to deploy containers.

If you’re running Java-based applications on Windows containers, follow Microsoft’s guidance here.

We recommend you vend new application credentials and revoke existing credentials immediately after patching.

Mitigation strategies if you can’t upgrade

In case you either can’t upgrade to a patched version, which disables access to JDNI by default, or if you are still determining your strategy for how you are going to patch your environment, you can mitigate this vulnerability by changing your log4j configuration. To implement this mitigation in releases >=2.10, you will need to remove the JndiLookup class from the classpath: zip -q -d log4j-core-*.jar org/apache/logging/log4j/core/lookup/JndiLookup.class.

For a more comprehensive list about mitigation steps for specific versions, refer to the Apache website.

Conclusion

In this blog post, we outlined key AWS security services that enable you to adopt a layered approach to help protect against, detect, and respond to your risk from the log4j vulnerability. We urge you to continue to monitor our security bulletins; we will continue updating our bulletins with our remediation efforts for our side of the shared-responsibility model.

Given the criticality of this vulnerability, we urge you to pay close attention to the vulnerability, and appropriately prioritize implementing the controls highlighted in this blog.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security news? Follow us on Twitter.

Open source hotpatch for Apache Log4j vulnerability

2021-12-13 Steve Schmidt

Post Syndicated from Steve Schmidt original https://aws.amazon.com/blogs/security/open-source-hotpatch-for-apache-log4j-vulnerability/

At Amazon Web Services (AWS), security remains our top priority. As we addressed the Apache Log4j vulnerability this weekend, I’m pleased to note that our team created and released a hotpatch as an interim mitigation step. This tool may help you mitigate the risk when updating is not immediately possible.

It’s important that you review, patch, or mitigate this vulnerability as soon as possible. We still recommend that you update Log4j to version 2.15 as a mitigation, but we know that can take some time, depending on your resources. To take immediate action, we recommend that you implement this newly created tool to hotpatch your Log4j deployments. A huge thanks to the Amazon Corretto team for spending days, nights, and the weekend to write, harden, and ship this code. This tool is available now at GitHub.

Caveats

As with all open source software, you’re using this at your own risk. Note that the hotpatch has been tested with JDK8 and JDK11 on Linux. On JDK17, only the static agent mode works. A full list of caveats can be found in the README.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Introducing stack graphs

2021-12-09 Douglas Creager

Post Syndicated from Douglas Creager original https://github.blog/2021-12-09-introducing-stack-graphs/

Today, we announced the general availability of precise code navigation for all public and private Python repositories on GitHub.com. Precise code navigation is powered by stack graphs, a new open source framework we’ve created that lets you define the name binding rules for a programming language using a declarative, domain-specific language (DSL). With stack graphs, we can generate code navigation data for a repository without requiring any configuration from the repository owner, and without tapping into a build process or other CI job. In this post, I’ll dig into how stack graphs work, and how they achieve these results.

(This post is a condensed version of a talk that I gave at Strange Loop in October 2021. Please check out the video of that talk if you’d like to learn even more!)

Code navigation is a family of features that let you explore the relationships in your code and its dependencies at a deep level. The most basic code navigation features are “jump to definition” and “find all references.” Both build on the fact that names are pervasive in the code that we write. Programming languages let us define things — functions, classes, modules, methods, variables, and more. Those things have names so that we can refer back to them in other parts of our code.

A picture (even a simple one) is worth a thousand words:

a simple Python module

In this Python module, the reference to broil at the end of the file refers to the function definition earlier in the file. (Throughout this post, I’ll highlight definitions in red and references in blue.)

Our goal, then, is to collect information about the lists of definitions and references, and to be able to determine which definitions each reference maps to, for all of the code hosted on GitHub.

Why is this hard?

In the above example, the definition and reference were close to each other, and it was easy to visually see the relationship between them. But it won’t always be that easy!

names can shadow each other in Python but not in Rust

For instance, what if there are multiple definitions with the same name? In Python, names can shadow each other, which means that the broil reference should refer to the latter of the two definitions.

But these rules are language-specific! In Rust, top-level definitions are not allowed to shadow each other, but local variables are. So, this transliteration of my example from Python to Rust is an error according to the Rust language spec. If we were writing a Rust compiler, we would want to surface this error for the programmer to fix. But what about for an exploration feature like code navigation? We might want to show some result even for erroneous code. We’re only human, after all!

code can live in multiple packages

Up to now, I’ve only shown you examples consisting of a single file. But when was the last time you worked on a software project consisting of a single file? It’s much more likely that your code will be split across multiple files, multiple packages, and multiple repositories. Programming languages give us the ability to refer to definitions that might be quite far away. But as you might expect, the rules for how you refer to things in other files are different for different languages.

In the above example, I’ve split everything up into three files living in two separate packages or repositories. (I’m using emoji to represent the package names.) In Python, import statements let us refer to names defined in other modules, and the name of a module is determined by the name of the file containing its code. Together, this lets us see that the broil reference in chef.py in the “chef” package refers to the broil definition in stove.py in the “frying pan” package.

code can change

Code changes and evolves over time. What happens when one of your dependencies changes the implementation of a function that you’re calling? Here, the maintainers of the “frying pan” package have added some logging to the broil function. As a result, the broil reference in chef.py now refers to a different definition. Insidiously, it was an intermediate file that changed — not the file containing the reference, nor the file containing the original definition! If we’re not careful, we’ll have to reanalyze every file in the repository, and in all its dependencies, whenever any file changes! This makes the amount of work we must do quadratic in the number of changed files, rather than linear, which is especially problematic at GitHub’s scale.

Our last difficulty is one of scale. As mentioned above, we want to provide this feature for all of the code hosted on GitHub. Moreover, we don’t want to require any manual configuration on the part of each repository owner. You shouldn’t have to figure out how to produce code navigation data for your language and project, or have to configure a CI build to generate that data. Code navigation should Just Work.

At GitHub’s scale, this poses two problems. The first is the sheer amount of code that comes in every minute of every day. In each commit that we receive, it’s very likely that only a small number of files have been modified. We must be able to rely on incremental processing and storage, reusing the results that we’ve already calculated and saved for the files that haven’t changed.

The second challenge is the number of programming languages that we need to (eventually) support. GitHub hosts code written in every programming language imaginable. Git itself doesn’t care what language you use for your project — to Git, everything is just bytes. But for a feature like code navigation, where the name binding rules are different for each language, we must know how to parse and interpret the content of those files. To support this at scale, it must be as easy as possible for GitHub engineers and external language communities to describe the name binding rules for a language.

To summarize:

Different languages have different name binding rules.
Some of those rules can be quite complex.
The result might depend on intermediate ﬁles.
We don’t want to require manual per-repository conﬁguration.
We need incremental processing to handle our scale.

Stack graphs

After examining the problem space, we created stack graphs to tackle these challenges, based on the scope graphs framework from Eelco Visser’s research group at TU Delft. Below I’ll discuss what stack graphs are and how they work.

Because we must rely on incremental results, it’s important that at index time (that is, when we receive pushes containing new commits), we look at each file completely in isolation. Our goal is to extract “facts” about each file that describe the definitions and references in the file, and all possible things that each reference could resolve to.

For instance, consider this example:

two Python files

Our final result must be able to encode the fact that the broil reference and definition live in different files. But to be incremental, our analysis must look at each file separately. I’m going to step into each file to show you what information GitHub can extract in isolation.

the stack graph for stove.py

Looking first at stove.py, we can see that it contains a definition of broil. From the name of the file, we know that this definition lives in a module called stove, giving a fully qualified name of stove.broil. We can create a graph structure representing this fact (along with information about the other symbols in the file). Each definition (including the module itself) gets a red, double-bordered definition node. The other nodes, and the pattern of how we’ve connected these nodes with edges, define the scoping and shadowing rules for these symbols. For other programming languages, which don’t implement the same shadowing behavior as Python, we’d use a different pattern of edges to connect everything.

the stack graph for kitchen.py

We can do the same thing for kitchen.py. The broil reference is represented by a blue, single-bordered reference node. The import statement also appears in the graph, as a gadget of nodes involving the broil and stove symbols.

Because we are looking at this file in isolation, we don’t yet know what the broil reference resolves to. The import statement means that it might resolve to stove.broil, defined in some other file — but that depends on whether there is a file defining that symbol. This example does in fact contain such a file (we just looked at it!), but we must ignore that while extracting incremental facts about kitchen.py.

At query time, however, we’re able to bring together the data from all files in the commit that you’re looking at. We can load the graphs for each of the files, producing a single “merged” graph for the entire commit:

the merged stack graph

Within this merged graph, every valid name binding is represented by a path from a reference node to a definition node.

However, not every path in the graph represents a valid name binding! For instance, looking only at the graph structure, there are perfectly fine paths from the broil reference node to the saute and bake definition nodes. To rule out those paths, we also maintain a symbol stack while searching for paths. Each blue node pushes a symbol onto the stack, and each red node pops a symbol from the stack. Importantly, we are not allowed to move into a “pop” node if its symbol does not match the top of the stack.

We’ve shown the contents of the symbol stack at a handful of places in the path that’s highlighted above. Most importantly, when we reach the portion of the graph containing the saute, broil, and bake definition nodes, the symbol stack contains ⟨broil⟩, ensuring that the only valid path that we discover is the one that ends at the broil definition.

We can also use different graph structures to handle my other examples. For example:

the stack graph for shadowed Python definitions

In this graph, we annotate some of the graph edges with a precedence value. Paths that include edges with a higher precedence value are preferred over those with lower precedences. This lets us correctly handle Python’s shadowing behavior.

For other programming languages, which don’t implement the same shadowing behavior as Python, we’d use a different pattern of edges to connect everything. For instance, the stack graph for my Rust example from earlier would be:

the stack graph for conflicting Rust definitions

To model Rust’s rule that top-level definitions with the same name are conflicts, we have a single node that all definitions hang off of. We can use precedences to choose whether to show all conflicting definitions (by giving them all the same precedence value), or just the first one (by assigning precedences sequentially).

With a stack graph available to us, we can implement “jump to definition:”

The user clicks on a reference.
We load in the stack graphs for each file in the commit, and merge them
together.
We perform a path-finding search starting from the reference node
corresponding to the symbol that the user clicked on, considering
symbol stacks and precedences to ensure that we don’t create any invalid
paths.
Any valid paths that we find represent the definitions that the reference
refers to. We display those in a hover card.

Creating stack graphs using Tree-sitter

I’ve described how to use stack graphs to perform code navigation lookups, but I haven’t mentioned how to create stack graphs from the source code that you push to GitHub.

For that, we turned to Tree-sitter, an open source parsing framework. The Tree-sitter community has already written parsers for a wide variety of programming languages, and we already use Tree-sitter in many places across GitHub. This makes it a natural choice to build stack graphs on.

Tree-sitter’s parsers already let us efficiently parse the code that our users upload. For instance, the Tree-sitter parser for Python produces a concrete syntax tree (CST) for our stove.py example file:

$ tree-sitter parse stove.py
(module [0, 0] - [10, 0]
  (function_definition [0, 0] - [1, 8]
    name: (identifier [0, 4] - [0, 8])
    parameters: (parameters [0, 8] - [0, 10])
    body: (block [1, 4] - [1, 8]
      (pass_statement [1, 4] - [1, 8])))
  (function_definition [3, 0] - [4, 8]
    name: (identifier [3, 4] - [3, 9])
    parameters: (parameters [3, 9] - [3, 11])
    body: (block [4, 4] - [4, 8]
      (pass_statement [4, 4] - [4, 8])))
  (function_definition [6, 0] - [7, 8]
    name: (identifier [6, 4] - [6, 9])
    parameters: (parameters [6, 9] - [6, 11])
    body: (block [7, 4] - [7, 8]
      (pass_statement [7, 4] - [7, 8]))))

Tree-sitter also provides a query language that lets us look for patterns within the CST:

(function_definition
  name: (identifier) @name) @function

This query would locate all three of our example method definitions, annotating each definition as a whole with a @function label and the name of each method with a @name label.

As part of developing stack graphs, we’ve added a new graph construction language to Tree-sitter, which lets you construct arbitrary graph structures (including but not limited to stack graphs) from parsed CSTs. You use stanzas to define the gadget of graph nodes and edges that should be created for each occurrence of a Tree-sitter query, and how the newly created nodes and edges should connect to graph content that you’ve already created elsewhere. For instance, the following snippet would create the stack graph definition node for my example Python method definitions:

(function_definition
  name: (identifier) @name) @function
{
    node @function.def
    attr (@function.def) kind = "definition"
    attr (@function.def) symbol = @name
    edge @function.containing_scope -> @function.def
}

This approach lets us create stack graphs incrementally for each source file that we receive, while only having to analyze the source code content, and without having to invoke any language-specific tooling or build systems. (The only language-specific part is the set of graph construction rules for that language!)

But wait, there’s more!

This post is already quite long, and I’ve only scratched the surface. You might be wondering:

Performing a full path-finding search for every “jump to definition” query seems wasteful. Can we precalculate more information at index time while still being incremental?
All the examples we’ve shown are pretty trivial. Can we handle more complex examples?

For instance, how about the following Python file, where we need to use dataflow to trace what particular value was passed in as a parameter to passthrough to correctly resolve the reference to one on the final line?
```
def passthrough(x):
  return x

class A:
  one = 1

passthrough(A).one
```
Or the following Java file, where we have to trace inheritance and generic type parameters to see that the reference to length should resolve to String.length from the Java standard library?
```
import java.util.HashMap;

class MyMap extends HashMap<String, String> {
  int firstLength() {
      return this.entrySet().iterator().next().getKey().length();
  }
}
```
Why aren’t we using the Language Server Protocol (LSP) or Language Server Index Format (LSIF)?

To dig even deeper and learn more, I encourage you to check out my Strange Loop talk and the stack-graphs crate: our open source Rust implementation of these ideas. And in the meantime, keep navigating!

New – FreeRTOS Extended Maintenance Plan for Up to 10 Years

2021-12-02 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-freertos-extended-maintenance-plan-for-up-to-10-years/

Last AWS re:Invent 2020, we announced FreeRTOS Long Term Support (LTS) that offers a more stable foundation than standard releases, as manufacturers deploy and later update devices in the field. FreeRTOS is an open source, real-time operating system for microcontrollers that makes small, low-power edge devices easy to program, deploy, secure, connect, and manage.

In 2021, FreeRTOS LTS released 202012.01 to include AWS IoT Over-the-Air (OTA) update, AWS IoT Device Defender, and AWS IoT Jobs libraries that provides feature stability, security patches, and critical bug fixes for the next two years.

Today, I am happy to announce FreeRTOS Extended Maintenance Plan (EMP), which allows embedded developers to receive critical bug fixes and security patches on their chosen FreeRTOS LTS version for up to 10 years beyond the expiry of the initial LTS period. FreeRTOS EMP lets developers improve device security (or helps keep devices secure) for years, save on operating system upgrade costs, and reduce the risks associated with patching their devices.

FreeRTOS EMP applies to libraries covered by FreeRTOS LTS. Therefore, developers have device lifecycles longer than the LTS period of 2 years and can continue using a version that provides feature stability, security patches, and critical bug fixes, all without having to plan a costly version upgrade.

Here are main features of FreeRTOS EMP:

Features	Description	Why is it important?
Feature stability	Get FreeRTOS libraries that maintain the same set of features for years	Save upgrade costs by using a stable FreeRTOS codebase for their product lifecycle
API stability	Get FreeRTOS libraries that have stable APIs for years
Critical fixes	Receive security patches and critical bug* fixes on your chosen FreeRTOS libraries	Security patches help keep their IoT devices secure for the product lifecycle
Notification of patches	Receive timely notification upcoming patches	Timely awareness of security patches helps proactively plan the deployment of patches
Flexible subscription plan	Extend maintenance by a year or longer	Continue to renew their annual subscription for a longer period to keep the same version for the entire device lifecycle, or for a shorter period to buy time before upgrading to the latest FreeRTOS version.

* A critical bug is a defect determined by AWS to impact the functionality of the affected library and has no reasonable workaround.

Getting Started with FreeRTOS EMP
To get started, subscribe to the plan using your AWS account, and renew the subscription annually or for a longer period to either cover their product lifecycle or until you are ready to transition to a new FreeRTOS LTS release.

Before the end of the current LTS period, you will be able to use your AWS account to complete the FreeRTOS EMP registration on the FreeRTOS console, review and agree to the associated terms and conditions, select the LTS version, and buy an annual subscription. You will then gain access to the private repository where you’ll receive .zip files containing a git repo with chosen libraries, patches, and related notifications.

Under NDA, AWS will notify you via official AWS Security channels of an upcoming patch and its timelines (if AWS is reasonably able to do so and deems it appropriate). Patches will be sent to your private repository within three business days of successfully implementing and getting AWS Security approval for our mitigation.

AWS will provide technical support for FreeRTOS EMP customers via separate subscriptions to AWS Support. AWS Support is not included in FreeRTOS EMP subscriptions. You can track issues such as AWS accounts, billing, and bugs, or get access to technical experts such as patch integration issues based on your AWS Support plan.

Available Now
FreeRTOS EMP will be available for the current and all previous FreeRTOS LTS releases. Subscriptions can be renewed annually for up to 10 years from the end of the chosen LTS version’s support period. For example, a subscription for FreeRTOS 202012.01 LTS, whose LTS period ends March 2023, may be renewed annually for up to 10 years (i.e., March 2033).

You can find more information on the FreeRTOS feature page. Please send us feedback on the forum of FreeRTOS or AWS Support.

— Channy

Introducing Karpenter – An Open-Source High-Performance Kubernetes Cluster Autoscaler

2021-11-30 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/introducing-karpenter-an-open-source-high-performance-kubernetes-cluster-autoscaler/

Today we are announcing that Karpenter is ready for production. Karpenter is an open-source, flexible, high-performance Kubernetes cluster autoscaler built with AWS. It helps improve your application availability and cluster efficiency by rapidly launching right-sized compute resources in response to changing application load. Karpenter also provides just-in-time compute resources to meet your application’s needs and will soon automatically optimize a cluster’s compute resource footprint to reduce costs and improve performance.

Before Karpenter, Kubernetes users needed to dynamically adjust the compute capacity of their clusters to support applications using Amazon EC2 Auto Scaling groups and the Kubernetes Cluster Autoscaler. Nearly half of Kubernetes customers on AWS report that configuring cluster auto scaling using the Kubernetes Cluster Autoscaler is challenging and restrictive.

When Karpenter is installed in your cluster, Karpenter observes the aggregate resource requests of unscheduled pods and makes decisions to launch new nodes and terminate them to reduce scheduling latencies and infrastructure costs. Karpenter does this by observing events within the Kubernetes cluster and then sending commands to the underlying cloud provider’s compute service, such as Amazon EC2.

Karpenter is an open-source project licensed under the Apache License 2.0. It is designed to work with any Kubernetes cluster running in any environment, including all major cloud providers and on-premises environments. We welcome contributions to build additional cloud providers or to improve core project functionality. If you find a bug, have a suggestion, or have something to contribute, please engage with us on GitHub.

Getting Started with Karpenter on AWS
To get started with Karpenter in any Kubernetes cluster, ensure there is some compute capacity available, and install it using the Helm charts provided in the public repository. Karpenter also requires permissions to provision compute resources on the provider of your choice.

Once installed in your cluster, the default Karpenter provisioner will observe incoming Kubernetes pods, which cannot be scheduled due to insufficient compute resources in the cluster and automatically launch new resources to meet their scheduling and resource requirements.

I want to show a quick start using Karpenter in an Amazon EKS cluster based on Getting Started with Karpenter on AWS. It requires the installation of AWS Command Line Interface (AWS CLI), kubectl, eksctl, and Helm (the package manager for Kubernetes). After setting up these tools, create a cluster with eksctl. This example configuration file specifies a basic cluster with one initial node.

cat <<EOF > cluster.yaml
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: eks-karpenter-demo
  region: us-east-1
  version: "1.20"
managedNodeGroups:
  - instanceType: m5.large
    amiFamily: AmazonLinux2
    name: eks-kapenter-demo-ng
    desiredCapacity: 1
    minSize: 1
    maxSize: 5
EOF
$ eksctl create cluster -f cluster.yaml

Karpenter itself can run anywhere, including on self-managed node groups, managed node groups, or AWS Fargate. Karpenter will provision EC2 instances in your account.

Next, you need to create necessary AWS Identity and Access Management (IAM) resources using the AWS CloudFormation template and IAM Roles for Service Accounts (IRSA) for the Karpenter controller to get permissions like launching instances following the documentation. You also need to install the Helm chart to deploy Karpenter to your cluster.

$ helm repo add karpenter https://charts.karpenter.sh
$ helm repo update
$ helm upgrade --install --skip-crds karpenter karpenter/karpenter --namespace karpenter \
  --create-namespace --set serviceAccount.create=false --version 0.5.0 \
  --set controller.clusterName=eks-karpenter-demo
  --set controller.clusterEndpoint=$(aws eks describe-cluster --name eks-karpenter-demo --query "cluster.endpoint" --output json) \
  --wait # for the defaulting webhook to install before creating a Provisioner

Karpenter provisioners are a Kubernetes resource that enables you to configure the behavior of Karpenter in your cluster. When you create a default provisioner, without further customization besides what is needed for Karpenter to provision compute resources in your cluster, Karpenter automatically discovers node properties such as instance types, zones, architectures, operating systems, and purchase types of instances. You don’t need to define these spec:requirements if there is no explicit business requirement.

cat <<EOF | kubectl apply -f -
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
#Requirements that constrain the parameters of provisioned nodes. 
#Operators { In, NotIn } are supported to enable including or excluding values
  requirements:
    - key: node.k8s.aws/instance-type #If not included, all instance types are considered
      operator: In
      values: ["m5.large", "m5.2xlarge"]
    - key: "topology.kubernetes.io/zone" #If not included, all zones are considered
      operator: In
      values: ["us-east-1a", "us-east-1b"]
    - key: "kubernetes.io/arch" #If not included, all architectures are considered
      values: ["arm64", "amd64"]
    - key: " karpenter.sh/capacity-type" #If not included, the webhook for the AWS cloud provider will default to on-demand
      operator: In
      values: ["spot", "on-demand"]
  provider:
    instanceProfile: KarpenterNodeInstanceProfile-eks-karpenter-demo
  ttlSecondsAfterEmpty: 30  
EOF

The ttlSecondsAfterEmpty value configures Karpenter to terminate empty nodes. If this value is disabled, nodes will never scale down due to low utilization. To learn more, see Provisioner custom resource definitions (CRDs) on the Karpenter site.

Karpenter is now active and ready to begin provisioning nodes in your cluster. Create some pods using a deployment, and watch Karpenter provision nodes in response.

$ kubectl create deployment --name inflate \
          --image=public.ecr.aws/eks-distro/kubernetes/pause:3.2

Let’s scale the deployment and check out the logs of the Karpenter controller.

$ kubectl scale deployment inflate --replicas 10
$ kubectl logs -f -n karpenter $(kubectl get pods -n karpenter -l karpenter=controller -o name)
2021-11-23T04:46:11.280Z        INFO    controller.allocation.provisioner/default       Starting provisioning loop      {"commit": "abc12345"}
2021-11-23T04:46:11.280Z        INFO    controller.allocation.provisioner/default       Waiting to batch additional pods        {"commit": "abc123456"}
2021-11-23T04:46:12.452Z        INFO    controller.allocation.provisioner/default       Found 9 provisionable pods      {"commit": "abc12345"}
2021-11-23T04:46:13.689Z        INFO    controller.allocation.provisioner/default       Computed packing for 10 pod(s) with instance type option(s) [m5.large]  {"commit": " abc123456"}
2021-11-23T04:46:16.228Z        INFO    controller.allocation.provisioner/default       Launched instance: i-01234abcdef, type: m5.large, zone: us-east-1a, hostname: ip-192-168-0-0.ec2.internal    {"commit": "abc12345"}
2021-11-23T04:46:16.265Z        INFO    controller.allocation.provisioner/default       Bound 9 pod(s) to node ip-192-168-0-0.ec2.internal  {"commit": "abc12345"}
2021-11-23T04:46:16.265Z        INFO    controller.allocation.provisioner/default       Watching for pod events {"commit": "abc12345"}

The provisioner’s controller listens for Pods changes, which launched a new instance and bound the provisionable Pods into the new nodes.

Now, delete the deployment. After 30 seconds (ttlSecondsAfterEmpty = 30), Karpenter should terminate the empty nodes.

$ kubectl delete deployment inflate
$ kubectl logs -f -n karpenter $(kubectl get pods -n karpenter -l karpenter=controller -o name)
2021-11-23T04:46:18.953Z        INFO    controller.allocation.provisioner/default       Watching for pod events {"commit": "abc12345"}
2021-11-23T04:49:05.805Z        INFO    controller.Node Added TTL to empty node ip-192-168-0-0.ec2.internal {"commit": "abc12345"}
2021-11-23T04:49:35.823Z        INFO    controller.Node Triggering termination after 30s for empty node ip-192-168-0-0.ec2.internal {"commit": "abc12345"}
2021-11-23T04:49:35.849Z        INFO    controller.Termination  Cordoned node ip-192-168-116-109.ec2.internal   {"commit": "abc12345"}
2021-11-23T04:49:36.521Z        INFO    controller.Termination  Deleted node ip-192-168-0-0.ec2.internal    {"commit": "abc12345"}

If you delete a node with kubectl, Karpenter will gracefully cordon, drain, and shut down the corresponding instance. Under the hood, Karpenter adds a finalizer to the node object, which blocks deletion until all pods are drained, and the instance is terminated.

Things to Know
Here are a couple of things to keep in mind about Kapenter features:

Accelerated Computing: Karpenter works with all kinds of Kubernetes applications, but it performs particularly well for use cases that require rapid provisioning and deprovisioning large numbers of diverse compute resources quickly. For example, this includes batch jobs to train machine learning models, run simulations, or perform complex financial calculations. You can leverage custom resources of nvidia.com/gpu, amd.com/gpu, and aws.amazon.com/neuron for use cases that require accelerated EC2 instances.

Provisioners Compatibility: Kapenter provisioners are designed to work alongside static capacity management solutions like Amazon EKS managed node groups and EC2 Auto Scaling groups. You may choose to manage the entirety of your capacity using provisioners, a mixed model with both dynamic and statically managed capacity, or a fully static approach. We recommend not using Kubernetes Cluster Autoscaler at the same time as Karpenter because both systems scale up nodes in response to unschedulable pods. If configured together, both systems will race to launch or terminate instances for these pods.

Join our Karpenter Community
Karpenter’s community is open to everyone. Give it a try, and join our working group meeting, or follow our roadmap for future releases that interest you. As I said, we welcome any contributions such as bug reports, new features, corrections, or additional documentation.

To learn more about Karpenter, see the documentation and demo video from AWS Container Day.

– Channy

New – AWS Marketplace for Containers Anywhere to Deploy Your Kubernetes Cluster in Any Environment

2021-11-30 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-aws-marketplace-for-containers-anywhere-to-deploy-your-kubernetes-cluster-in-any-environment/

More than 300,000 customers use AWS Marketplace today to find, subscribe to, and deploy third-party software packaged as Amazon Machine Images (AMIs), software-as-a-service (SaaS), and containers. Customers can find and subscribe containerized third-party applications from AWS Marketplace and deploy them in Amazon Elastic Container Service (Amazon ECS) and Amazon Elastic Kubernetes Service (Amazon EKS).

Many customers that run Kubernetes applications on AWS want to deploy them on-premises due to constraints, such as latency and data governance requirements. Also, once they have deployed the Kubernetes application, they need additional tools to govern the application through license tracking, billing, and upgrades.

Today, we announce AWS Marketplace for Containers Anywhere, a set of capabilities that allows AWS customers to find, subscribe to, and deploy third-party Kubernetes applications from AWS Marketplace on any Kubernetes cluster in any environment. This capability makes the AWS Marketplace more useful for customers who run containerized workloads.

With this launch, you can deploy third party Kubernetes applications to on-premises environments using Amazon EKS Anywhere or any customer self-managed Kubernetes cluster in on-premises environments or in Amazon Elastic Compute Cloud (Amazon EC2), enabling you to use a single catalog to find container images regardless of where they eventually plan to deploy.

With AWS Marketplace for Containers Anywhere, you can get the same benefits as any other products in AWS Marketplace, including consolidated billing, flexible payment options, and lower pricing for long-term contracts. You can find vetted, security-scanned, third-party Kubernetes applications, manage upgrades with a few clicks, and track all licenses and bills. You can migrate applications between any environment without purchasing duplicate licenses. After you have subscribed to an application using this feature, you can migrate your Kubernetes applications to AWS by deploying the independent software vendor (ISV) provided Helm charts onto their Kubernetes clusters on AWS without changing their licenses.

Getting Started with AWS Marketplace for Containers Anywhere
You can get started by visiting AWS Marketplace. Easily search in Delivery methods in all products, then filter Helm Chart in the catalog to find Kubernetes-based applications that they can deploy on AWS and on premises.

If you chose to subscribe to your favorite product, you would select Continue to Subscribe.

Once you accept the seller’s end user license agreement (EULA), select Create Contract and Continue to Configuration.

You can configure the software deployment using the dropdowns. Once Fulfillment option and Software Version are selected, choose Continue to Launch.

To deploy on Amazon EKS, you have the option to deploy the application on a new EKS cluster or copy and paste commands into existing clusters. You can also deploy into self-managed Kubernetes in EC2 by clicking on the self-managed Kubernetes option in the supported services.

To deploy on-premises or in EC2, you can select EKS Anywhere and then take an additional step to request a license token on the AWS Marketplace launch page. You will then use commands provided by AWS Marketplace to download container images, Helm charts from the AWS Marketplace Elastic Container Registry (ECR), the service account creation, and the token to apply IAM Roles for Service Accounts on your EKS cluster.

To upgrade or renew your existing software licenses, you can go to the AWS Marketplace website for a self-service upgrade or renewal experience. You can also negotiate a private offer directly with ISVs to upgrade and renew the application. After you subscribe to the new offer, the license is automatically updated in AWS License Manager. You can view all the licenses you have purchased from AWS Marketplace using AWS License Manager, including the application capabilities you’re entitled to and the expiration date.

Launch Partners of AWS Marketplace for Containers Anywhere
Here is the list of our launch partners to support an on-premises deployment option. Try them out today!

D2iQ delivers the leading independent platform for enterprise-grade Kubernetes implementations at scale and across environments, including cloud, hybrid, edge, and air-gapped.
HAProxy Technologies offers widely used software load balancers to deliver websites and applications with the utmost performance, observability, and security at any scale and in any environment.
Isovalent builds open-source software and enterprise solutions such as Cilium and eBPF solving networking, security, and observability needs for modern cloud-native infrastructure.
JFrog‘s “liquid software” mission is to power the world’s software updates through the seamless, secure flow of binaries from developers to the edge.
Kasten by Veeam provides Kasten K10, a data management platform purpose-built for Kubernetes, an easy-to-use, scalable, and secure system for backup and recovery, disaster recovery, and application mobility.
Nirmata, the creator of Kyverno, provides open source and enterprise solutions for policy-based security and automation of production Kubernetes workloads and clusters.
Palo Alto Networks, the global cybersecurity leader, is shaping the cloud-centric future with technology that is transforming the way people and organizations operate.
Prosimo‘s SaaS combines cloud networking, performance, security, AI powered observability and cost management to reduce enterprise cloud deployment complexity and risk.
Solodev is an enterprise CMS and digital ecosystem for building custom cloud apps, from content to crypto. Get access to DevOps, training, and 24/7 support—powered by AWS.
Trilio, a leader in cloud-native data protection for Kubernetes, OpenStack, and Red Hat Virtualization environments, offers solutions for backup and recovery, migration, and application mobility.

If you are interested in offering your Kubernetes application on AWS Marketplace, register and modify your product to integrate with AWS License Manager APIs using the provided AWS SDK. Integrating with AWS License Manager will allow the application to check licenses procured through AWS Marketplace.

Next, you would create a new container product on AWS Marketplace with a contract offer by submitting details of the listing, including the product information, license options, and pricing. The details would be reviewed, approved, and published by AWS Marketplace Technical Account Managers. You would then submit the new container image to AWS Marketplace ECR and add it to a newly created container product through the self-service Marketplace Management Portal. All container images are scanned for Common Vulnerabilities and Exposures (CVEs).

Finally, the product listing and container images would be published and accessible by customers on AWS Marketplace’s customer website. To learn more details about creating container products on AWS Marketplace, visit Getting started as a seller and Container-based products in the AWS documentation.

Available Now
The feature of AWS Marketplace for Containers Anywhere is available now in all Regions that support AWS Marketplace. You can start using the feature directly from the product of launch partners.

Give it a try, and please send us feedback either in the AWS forum for AWS Marketplace or through your usual AWS support contacts.

– Channy

An Open-Source CMS on the Cloudflare Stack: Introductory Post

2021-11-19 Luke Edwards

Post Syndicated from Luke Edwards original https://blog.cloudflare.com/production-saas-intro/

An Open-Source CMS on the Cloudflare Stack: Introductory Post

The Cloudflare documentation is a great resource when learning concepts, reviewing API usage notes, or when you’re in need of a concise snippet to illustrate those APIs or concepts. But, as comprehensive as it is, new users to the Cloudflare Workers platform must bridge a large gap to go from the introductory example snippets to a real, production-ready application. While some of this may be specific to Workers (as with any platform), developers everywhere are figuring out how applications should be built in a serverless world. Building large serverless applications entails a learning curve journey, regardless of a developer’s experience level.

At Cloudflare, we’re intimately aware of this because we also had to go through the same transition. Our engineers are world-class and expertfully design and craft products that compliment the distributed paradigm… but experts aren’t born overnight! We have been there, and we want to help jumpstart and aid others’ understanding.

With this in mind, we decided to do something unique to the industry: we are developing an example feature-complete SaaS application that will be built entirely on the Cloudflare stack. It is and will continue to be completely free, open-sourced on GitHub, and developed in public. This will be an incredible time as it can be used as a template for launching your own SaaS applications, too! In fact, you can clone the GitHub repository, update a few service tokens, and deploy the pre-built application to your own Cloudflare account within minutes!

Of course, examples and templates are great, but technologies and best practices never stop evolving. Cloudflare is no exception and is constantly iterating and introducing new products and product features. By extension, this requires the SaaS application to be a living example that evolves alongside the Workers platform — and this is part of our commitment.

|| Don’t miss out! Watch the project on GitHub to track its development progress and stay current with our latest changes and recommendations.

Application Overview

Now, aside from actually building the application, we needed to find a balance between picking an example SaaS application that is both complex enough to serve as a convincing case study and simple — or self-contained — enough so that developers can quickly dive in, follow the source, and understand the components involved and the reasons why and/or how they are used.

Ultimately we decided to build an example content management system (CMS) which, as an application archetype, has also transformed over the years. Traditionally, a CMS operated on rented hardware, which was home to a long-lived server that handled incoming requests and queried an SQL-like database in order to retrieve the requested content, render it to an HTML page, and repeat the process over and over again. WordPress was — and still is — a very common example of this approach.

Naturally, this application architecture was improved over the years: layers of caching were introduced, database schemas were redesigned to minimize the number of rows processed, and some frameworks began to skip the database entirely, preferring a build-step to render all content upfront as static HTML pages. (This is now known as “static-site generation” and is still a very popular approach.)

Today, in the serverless era, there are a number of “headless CMS” options available. These are made “headless” because they are not monolithic web servers that render HTML for each request. Instead, they offer API endpoints that will return the content as raw JSON data. This allows web developers to build completely custom templates for their website using whatever tools and/or frameworks they prefer. This approach grants an enormous amount of flexibility to the developer without losing the ability to organize their content, image assets, etc. WordPress, the seasoned veteran, is one of few that is able to offer a headless and a “headful” mode. Other headless choices, like Sanity.io and Contentful, are also quite popular.

The CMS application model is a great case study for our open-source example. One of the primary tenets of an edge-first design is that content should be made available as close as physically possible to the users asking for it. And the serverless architecture means that there’s no longer — or should not be — a single point of failure. These both directly benefit the CMS archetype and, when implemented, will yield clear performance gains.

Current Progress

Before diving into the roadmap and explaining how this project will progress, it’s important to call out that this project has already been — and will continue to be — an ongoing effort! Today, you can find the project on GitHub and inspect the work that’s already been completed. As of now, the application already combines Workers, Workers KV, Cloudflare for SaaS, and Rate Limiting with Pages and Durable Objects additions to come in later milestones.

Phase 1 (see below) is nearing completion and, when finished, will mark the end of a very significant milestone. A new update to this blog post series will be issued, covering the highlights and technical overview of the project so far. This is important and immediately useful because on its own, Phase 1 qualifies for a successful, full-stack application.

Development Milestones

The CMS application will progress in milestones. We have already released the project and will continue to build upon it in accordance with the roadmap (below). GitHub stargazers will be able to keep tabs on its progress, or at the very least, only subscribe to updates for the milestones they’re interested in following.

Each milestone is a sizable checkpoint on its own. As you’ll see below, the project roadmap is planned in a way such that each phase adds a considerable amount of new features and/or integrates with a new Cloudflare product. At every point, the application will remain functional and maintain a live, interactive demo to immediately demonstrate the latest functionality to passersby.

This format is chosen because it’s how real applications — and real products — are developed. Our goal is to ensure that the GitHub repository is never out of date. And, because of the development structure, one may always traverse the list of past milestones to review the changes that were necessary to migrate X or how product Y was integrated.

|| Note: Visit the GitHub milestones to view more details and to subscribe for updates. There is so much more than can be listed here.

Phase 1 – JSON API

The project must begin with some API endpoints to start managing and manipulating data. Using Workers and Workers KV, the work within this milestone will focus on building a robust JSON API that handles the core functionality that the rest of the application will need.

There is no HTML, CSS, or client-side JavaScript involved in this phase. Instead, work here should focus solely on the data: how it’s accessed, how models relate to one another, and how best to structure and store these relationships within Workers KV. For example, individuals should be able to create and manage workspaces that belong to their personal user accounts or to the organization(s) that they belong to.

Additionally, when creating content, the document should be validated against an existing schema that the document was assigned. This feature is critical in any CMS platform that plans to handle thousands of documents within a workspace. Without it, there’s very little confidence that your contents’ JSON representation is consistently structured.

A number of other features are planned — subscription management and invoicing through Stripe, sending transactional emails through SendGrid, and assigning vanity domains to workspaces through Cloudflare for SaaS. Finally, of course, the standard house-keeping tasks will be set up. This includes continuous integration (CI) with API testing and automated, continuous deployments (CD).

By the end of this phase, the project will exist as a collection of API endpoints that, on its own, is a complete application. While it may only be accessible through curl commands — or any other preferred method for manually constructing HTTP requests, the completion of Phase 1 already qualifies the project as a full-stack application and could serve a real-world SaaS product.

Additionally, the repository will include all the best practices for writing tests, automating deployments, and organizing the source in a way for long-term growth and maintenance. And, because we started with the JSON API, the project is immediately useful and capable of integrating with your existing build tools and frameworks. In other words, stargazers could deploy the project to their own accounts as their own personalized Headless CMS. Perhaps some of you will build Gatsby or Eleventy plugins — if you do, please let us know!

Phase 2 — Dashboard UI

As fun as curl may be, most people prefer some form of visual interface they can interact with. This phase will be all about assembling a frontend to serve as the CMS application’s dashboard.

We will use Svelte, a JavaScript framework for building user interfaces. While not everyone may enjoy or agree with this decision, the templating syntax resembles standard HTML markup, which will allow non-frontend developers to follow along and gauge what’s going on.

Svelte will be paired with Tailwind CSS for the design system. Tailwind is a very popular, utility-first CSS framework that allows developers to compose styles through predefined, reusable HTML ”class” names.

The result will be a single-page application (SPA) and will be hosted on Cloudflare Pages. This means that, out of the box, the dashboard will be able to take advantage of Access-protected preview deployments, instant rollbacks, automated deployments, comprehensive analytics, and more.

Finally, now that Pages integrates with Workers directly, the JSON API from Phase 1 will migrate into a new repository structure. While this may seem like an innocent refactor, it actually unlocks an incredible set of features for the JSON API: Access-protected preview deployments, instant rollbacks, and automated deployments. Yes — these are the same Pages features mentioned above! This is amazing because it means that our API is continuously and atomically versioned, allowing its development to continue safely alongside the client dashboard that depends on it. In other words, there is zero risk of the API and the dashboard diverging, which would have allowed their expectations of one another to misalign. Instant rollbacks will also apply to the API since the entire application operates as a single Pages unit.

The previous phase will have built the core SaaS product functionality, but completing this phase will make it feel like a real-world product that can be launched and used on a daily basis. In fact, the end of Phase 2 marks the application as a possible contender in the Headless CMS service space.

Phase 3 — Article Edge-Rendering

The previous phases are focused on assembling a minimum viable Headless CMS product, but Phase 3 looks to grow outside this archetypal box. This will happen by allowing the application to render HTML web pages by injecting the JSON content into predefined templates.

Like WordPress, the CMS application should allow its users to choose whether they want to continue using the “headless” feature or enlist the complete template engine. Should they opt for HTML output, the Cloudflare project will only include a few premade templates that a user may select from — but, of course, this can be customized in your own projects.

Even though this phase reintroduces the monolithic CMS archetype, it’s a significantly safer, faster, and more resilient architecture than the single, all-in-one server of yesteryear. The CMS contents will still be distributed around the world, close to the customers’ readers — but now, the content can be rendered from anywhere in the world at extremely low latencies, too.

Phase 4 — Feature Upgrades

At this stage, the application is — for the most part — complete. It’s functional, looks nice, performs well globally, and can be used in two very distinct ways.

In the context of a real SaaS product, development begins to shift towards adding new features that excite users or towards maintenance health of the project. For example, Phase 4 will utilize Durable Objects to introduce a document editor that allows multiple users to edit the same document in a real-time, collaborative environment.

It’s also very likely that Cloudflare R2 Storage will be introduced as a backend for media assets, allowing users to upload and manage images within a workspace. Or perhaps we decide to use Cloudflare Images for this and R2 is used for importing and exporting content backups.

As you may expect, this milestone is full of unknowns, but that’s because the future holds unlimited possibilities. The project will continue to evolve and expand with Cloudflare and with time.

Of course, if you have ideas or suggestions for features, start a discussion with us on GitHub. We would love to hear from you!

Next Steps

This was the introductory post of (what will be) an ongoing series. When each milestone is completed, we will publish a new post in this series with a retrospective and with technical walkthroughs of key aspects from that chapter’s work.

We’re at the beginning of an exciting journey, and we hope you’re as interested as we are!

You can show your support by starring or following the project on GitHub. All releases, discussions, and milestone tracking will reside within the repository. The next generation of SaaS applications will be built on Cloudflare — subscribe and dive in early!

Highlights from Git 2.34

2021-11-15 Taylor Blau

Post Syndicated from Taylor Blau original https://github.blog/2021-11-15-highlights-from-git-2-34/

The open source Git project just released Git 2.34 with features and bug fixes from over 109 contributors, 29 of them new. We last caught up with you on the latest in Git back when 2.33 was released. To celebrate this most recent release, here’s GitHub’s look at some of the most interesting features and changes introduced since last time.

Sparse index

In the past, we’ve talked about new Git features to make it possible to work with large repositories, like partial clones and sparse-checkout. For a complete description, check out the linked blog posts. But as a refresher, these two features work together to allow you to:

Fetch or clone only part of a repository’s objects, and
Only populate part of your working copy, typically scoped to a set of
sub-directories.

This pair of features is designed to create the illusion that you are working in a much smaller repository than you actually are. For instance, if your work takes place in an all-encompassing monorepo, your local copy only needs to contain the parts of the repository that you frequently work in.

But often, this illusion falls short. Why? The answer is the index. The index is the data structure Git uses to track what will be written the next time you run git commit, as well as to track the state of every file in your repository at the current point in history.

As you can imagine, even if you are working in a small corner of a large repository, the index still has to keep track of the repository’s entire contents, not just the parts that you are working in. Unfortunately, that overhead adds up: every time Git needs work with the index, it needs to parse and write out a lot of data that doesn’t affect the parts of your repository outside of your sparse checkout.

That’s changing in this release with the addition of a sparse-enabled index. Unlike the index of previous versions, this release enables the index to only track the parts of your repository that you care about. Specifically, it only contains entries for parts of your repository that are either in your sparse checkout, or at the boundary between your sparse checkout and the rest of the repository.

Triangles represent trees and boxes represent blobs. Left: a representation of a non-sparse index’s contents. Right: a sparse-ified index.

The high-level details here are that the index format now understands that specially marked directories indicate the boundary between the contents of your sparse checkout and the parts of your repository that you don’t have checked out. But the process of implementing this new format, teaching sub-commands how to use it, and making sure that the sparse index can be expanded to a full index is much more detailed.

For all of the details behind this exciting new feature, check out a comprehensive blog post published by Derrick Stolee last week: Making your monorepo feel small with Git’s sparse index.

[source, source, source, source, source, source, source, source]

Multi-pack reachability bitmaps

In a previous blog post, we talked about a new feature to enable reachability bitmaps to keep track of objects stored in multiple packs within your object store.

This release of Git contains the remaining components described in that blog post. If you haven’t read it, here’s a summary. When serving a fetch, a Git server needs to send the client everything reachable from the set of objects they want, less anything reachable from the set that they already have. (You can think of a clone as a “special case” fetch where the client wants everything and has nothing).

In order to compute this set efficiently, Git can use reachability bitmaps. One of these .bitmap files stores a set of bitmaps, each corresponding to some commit. The contents of an individual bitmap is a string of bits, one per object, indicating which objects are reachable from each commit.

In the past, the contents of a reachability bitmap were tied to the order of objects within a single packfile. This meant that a bitmap could only cover objects in one packfile. In other words, bitmaps were only useful if you could efficiently pack the entire contents of your repository down into a single packfile.

For many repositories, writing all objects into the same pack is completely feasible. But the effort it takes to write a pack (including searching for deltas between objects, compressing individual objects, and I/O cost) scales with the size of the pack you’re writing.

Git 2.34 introduces a new bitmap format that is instead tied to the contents of the multi-pack index file. This means that a bitmap can now flexibly represent objects in multiple packs, and server operators no longer need to repack their biggest repositories into a single pack in order to take full advantage of reachability bitmaps.

For more details, including some of the steps required to make this new feature work, see the aforementioned blog post.

[source, source, source]

A new default merge strategy

In an earlier blog post, we explained Git’s newest merge strategy: ort. Here are some of the basics:

When Git needs to merge two branches, it uses one of several “strategy” backends in order to resolve the changes or emit conflicts when two changes cannot be reconciled.

For years, Git has used a strategy called “recursive”. If you have ever done a merge in Git without passing -s <strategy>, then you have almost certainly used the recursive engine. Recursive behaves mostly like a standard three-way merge, with one exception. In the case of “criss-cross” merges (where there isn’t a single merge base), recursive merges multiple bases together in pairs (recursively) in order to produce a single tree which is then treated as the new merge base. This makes it possible to resolve cases where a traditional three-way merge might produce a conflict.

In recent versions of Git, there has been an ongoing effort to replace the recursive strategy with a new one called ort (short for “ostensibly recursive‘s twin”). Why do this? There are a few reasons, but perhaps the most compelling is that a rewrite allowed Git to implement a merge strategy that doesn’t operate on the index (that same one we talked about a couple of sections ago)!

ort does just that: it’s a full-blown rewrite of the merge strategy that aims to emulate the same concepts behind recursive while avoiding many of its long-standing performance and correctness problems. In a merge containing many renames, ort outperforms recursive by 500x. For a series of similar merges (like in a rebase operation), the speedup is over 9000x, in part due to ort’s ability to cache and reuse results from previous merges.

These numbers show off some of the worst-case scenarios for recursive, but in testing, ort consistently outperforms recursive with much less variance. In Git 2.34, ort is now the default merge strategy, so you should notice faster merges with fewer bugs just by upgrading.

For more details about the ort merge strategy, see our earlier blog post, or any one of a six-part series of posts written by ort‘s creator, Elijah Newren: part one, part two, part three, part four, part five, and part six.

[source]

Tidbits

Now that we have looked at some of the bigger features in detail, let’s turn to a handful of smaller topics from this release.

You might be aware that Git allows you to sign your work by attaching your PGP signature to certain objects. For example, the Git project itself publishes tags signed by the maintainer in order to verify that each release comes from someone trustworthy.
But the experience of using GPG and maintaining keys can be somewhat
cumbersome. One alternative is to use a new feature of OpenSSH (released
back in OpenSSH 8.0) that allows using the SSH key you likely already have as a signing key.

Git 2.34 includes support to take advantage of this feature and allows you to sign your work using SSH keys. To try it out, you can either set user.signingKey to the SSH key you want to use (for example, by asking your ssh-agent for a list with ssh-add -L), or set gpg.format to ssh and gpg.ssh.defaultKeyCommand to ssh-add -L in order to automatically use the first SSH key available.

After configuring Git to sign objects using your SSH keys, you can use git commit -S, git merge -S, and git tag -s as usual, and they will automatically use your SSH key.

For more information about the new configuration options, including information about how to verify SSH signatures with an “allowed signers” file, check out the documentation.

[source, source, source]
If you’ve ever accidentally typed git psuh when you meant push, you
might have seen this message:
```
$ git psuh
git: 'psuh' is not a git command. See 'git --help'.

The most similar command is
  push
```
You have always been able to control this behavior by setting the
help.autoCorrect configuration. You can hide this advice by setting that
configuration to never, or let Git automatically rerun the most similar
command for you immediately or with a delay (by setting immediate, or a
real number of seconds to wait before rerunning your command).

In Git 2.34, you can now configure Git to ask you interactively whether you
want to rerun your last operation with the suggested command by setting
help.autoCorrect to prompt.

[source]

In Git 2.34, a handful of patch series were focused on improving the performance of interacting with other repositories. Here’s a pair of tidbits that improves the performance of git fetch and git push:

When fetching from a remote, your client needs to do some bookkeeping before
and after it receives a set of objects from the remote.

Before anything happens, your client needs to figure out what it has in common with the remote it’s fetching from, and what commits it wants as a result. Previously, this process was somewhat wasteful: Git used to load commit objects directly when they could instead have been read from the commit-graph, resulting in much improved performance. In Git 2.34, commits loaded in this code path use the commit-graph when possible. The effect of this scales with the number of references in your repository: in an example repository with over 2 million references, it cuts the time it takes to fetch a single commit by more than half.

[source]
Another patch series made a handful of improvements to updating local references when fetching, along with some changes to improve fetch negotiation, as well as skipping the connectivity check (which I’ll talk about in more detail in the next tidbit) when the receiving end had already verified the connectedness of the new objects. These changes together contributed similarly impressive performance improvements to the git fetch command.

[source]

You might have heard of “submodules,” the Git feature that allows combining multiple repositories by storing links to other repositories. Submodules have been somewhat neglected over the years, but this release brought renewed attention to the feature. Here are just some of the changes that enhance submodules:

It might be a surprise to learn that, though the majority of Git is written in C, the original git submodule command is actually a shell script!
The Git project has been converting many of its subcommands written in other languages into C. Reimplementing subcommands as C programs means that
they can be read and written more easily, take advantage of Git’s comprehensive libraries, and avoid the overhead of spawning many processes, especially on platforms where the new process overhead is rather costly.

In Git 2.34, many parts of the git submodule command were rewritten in C.
This project was completed by Atharva Raykar, who is a Google Summer of Code
student. You can check out their final report here, along with Git’s other GSoC participant ZheNing Hu’s report here.

[source, source, source]
While we’re on the topic of submodules, one thing you might not know is that
when using commands that deal with objects from both the submodule and the
repository containing it, the submodule is temporarily added as an alternate
object store of the other repository!

Alternates are Git’s object borrowing mechanism, which allow you to in effect link multiple object stores together. When using a repository with alternates, any object lookups that fail to find an object are retried in that repository’s alternate.

In order to make both the objects in a submodule and the objects in the repository that contains that submodule available to git grep (among a select set of
other commands), the submodule would temporarily be added as an alternate for the duration of that command.

If you’re thinking to yourself, “this is a hack”, then you’re not alone. Git has made internal changes to parameterize many functions in terms of a repository (which is usually the global the_repository). This allowed Git to avoid combining multiple repositories via alternates and instead make function calls by passing two (or more) separate repository instances. This enables Git to avoid hackily relying on the alternates mechanism, which produced less confusing and error-prone code as a result.

[source, source, source]
One last submodule-related topic (though there are more we couldn’t fit here!). If you are cloning a repository that you know to contain submodules, it is often useful to pass the --recurse-submodules, which will cause that repository’s submodules to be cloned and initialized, too.

But other commands that can optionally recurse into submodules (like git diff, for example) don’t themselves recurse into submodules by default, even when you cloned with --recurse-submodules. In Git 2.34, this is no longer the case, with one caveat: when cloning with --recurse-submodules, other commands only recurse into submodules if the submodule.stickyRecursiveClone configuration is set, to prevent commands from unintentionally running in submodules.

[source]

Now that I’ve listed out a few of the submodule-related changes, let’s get back
to the rest of the tidbits:

If you’ve ever scripted around Git, you have almost certainly run into Git’s cat-file plumbing command. This tool can be used to print out a single object (by providing the object name as an argument), a stream of objects (by providing line-delimited object names over stdin), or all objects in your repository (with --batch-all-objects).
This low-level command accidentally took into account replace refs, which produced confusing results when combined with --batch-all-objects, resulting in it not actually showing all objects in your repository if some were hidden by refs/replace.

Dropping support for replacement refs made it possible for cat-file to reuse some information when it is given --batch-all-objects. Namely, to populate the list of objects, it iterates each object in each pack and therefore knows the byte offset within each pack where each object can be found. Previous versions of Git did not reuse this information when looking up objects to parse them, but Git 2.34 retains this information.

This makes it possible to process an object’s metadata much more quickly by avoiding having to locate it twice. In a copy of torvalds/linux, the time it takes to print the name and type of each object (for the curious, that’s git cat-file --batch-check='%(objectname) %(objecttype)' --batch-all-objects --unordered) dropped from 8.1 seconds to just 4.3 seconds.

[source]
There has been a recent concerted effort to remove some memory leaks from Git’s code. Unlike library code, Git typically has a very short runtime. This makes the need to free allocated memory much less urgent, since if a process is about to exit, all memory allocated to it will be “freed” by the operating system.

A recent patch has made it so that Git’s integration tests can be run in a mode that ensures no memory is leaked (by setting GIT_TEST_PASSING_SANITIZE_LEAK=true in the environment). Since Git’s test suite still contains memory leaks in some tests, a new mode was added to run only tests that have been specifically marked as being leak-free. That way, when Git is compiled with leak detection (by running make SANITIZE=leak), you can easily spot regressions in tests that were supposedly leak-free.

Building off this new infrastructure, there have been many patch series that remove leaks from the code in various places.

[source, source, source, source, source, source, source, source, source, source, source]
When you need to get some debugging information out of a Git process, like what version you’re running, or how much time it spent in a particular region, the trace2 mechanism is a good choice. Often, looking at these logs is like looking at a piece of a puzzle. For example, when you run git fetch, you actually run git fetch-pack, which then invokes git upload-pack on the remote, which itself invokes git pack-objects.

Trace2 output includes information about when child processes are started and stopped (and consequently, how long they took to run), but what if you’re trying to figure out something more basic than that, like what process you were started by? In other words, if you’re stuck looking at output from a slow git pack-objects, how do you figure out whether it was a fetch (in which case it would have been started by upload-pack) or part of a repository repack (which here would be started by git repack)?

Git 2.34 includes additional debugging information in trace2 output to indicate the full ancestry of a process, so you can easily read out the name of the program a process was started by, like so:
```
$ cat trace2.log
21:14:38.170730 common-main.c:48                  version 2.34.0.rc1.14.g88d915a634
21:14:38.170810 common-main.c:49                  start /home/ttaylorr/src/git/git pack-objects git pack-objects --revs --thin --stdout --progress --delta-base-offset
21:14:38.174325 compat/linux/procinfo.c:170       cmd_ancestry sh <- git-upload-pack <- sh <- git <- zsh <- sshd <- systemd
```
(Above, you can see that pack-objects was run by git upload-pack, which was run by sh–that’s where we inserted the trace point via uploadpack.packObjectsHook, which was run by git, in my shell, over sshd, which was started by systemd.)

[source, source]
In a previous post, we talked about the background maintenance daemon, which can be used to perform routine repository maintenance in the background (like pre-fetching, or repacking the objects in your repository).

When this feature was first released back in Git 2.31, it had support for cron on Linux, launchctl on macOS, and schtasks on Windows. Git 2.34 brings support for systemd-based timers on Linux. This has a few benefits over cron: cron may not be available everywhere, and using systemd isolates each service into its own cgroup and writes its logs separately.

If you want to use systemd instead of the default scheduler, you can run:
```
$ git maintenance start --scheduler=systemd
```
[source]
In a previous blog post, we talked about how git rebase works, and how to move a complicated branching structure elsewhere in your repository’s history.

The brief history is that this used to be done with the --preserve-merges option, which attempted to replay merges elsewhere in history. Confusingly, this mode uses rebase’s interactive machinery internally, so attempting to manually edit the rebase sequence (with git rebase -i) often produced counterintuitive results.

The --rebase-merges option fixed many of these issues and has been the recommended replacement of --preserve-merges for some time now. In Git 2.34, the --preserve-merges option is now gone for good.

[source]
You might have used git grep to quickly search through your code. But you might not have known that git log has a --grep=<expression> option, which allows you to filter through commits produced by git log to only show ones whose commit messages match the provided expression.

In previous versions, the --grep option only filtered down which results were presented in the output of git log. But in Git 2.34, git log now knows how to colorize the parts of its output that matched the provided expression, like so:

[source]
Last but not least, if you’re using Git in a terminal on Windows, you might have noticed that your terminal is left in a weird state after running git commit, or git rebase, like in this issue.

This was because Git shares its terminal with any child processes it spawns, including your $EDITOR. If your editor sets special terminal settings but does not clear them upon exiting, it can leave your terminal in a broken state.

Git 2.34 introduces functionality to save and restore the terminal settings before and after launching your editor. That means that even misbehaving editors cannot corrupt your terminal since it will always be restored to the state it was in before launching the editor.

[source]

The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.34, or any previous version in the Git repository.

Make your monorepo feel small with Git’s sparse index

2021-11-10 Derrick Stolee

Post Syndicated from Derrick Stolee original https://github.blog/2021-11-10-make-your-monorepo-feel-small-with-gits-sparse-index/

One way that Git scales to the largest monorepos is the sparse-checkout feature, which allows you to focus on a subset of the files. This is supposed to make it feel like you are actually in a small repository, even though you are contributing to a large repository.

There’s only one problem: the Git index is still large in a monorepo, and users can feel it. Until now.

The Git index is a critical data structure in Git. It serves as the “staging area” between the files you have on your filesystem and your commit history. When you run git add, the files from your working directory are hashed and stored as objects in the index, leading them to be “staged changes”. When you run git commit, the staged changes as stored in the index are used to create that new commit. When you run git checkout, Git takes the data from a commit and writes it to the working directory and the index.

The working directory, index, and commit history

In addition to storing your staged changes, the index also stores filesystem information about your working directory. This helps Git report changed files more quickly. One problem is that the index stores this information for every file at HEAD, even if those files are outside of the sparse-checkout definition. This means that the index can be much larger in a monorepo than it would be if your important subset of files was in its own repository.

Throughout the past year, the Git Fundamentals team at GitHub contributed a new feature to Git called the sparse index, which allows the index to focus on the files within the sparse-checkout cone. If you are in a repository that can use sparse-checkout, then you can enable the sparse index using these commands:

git sparse-checkout init --cone --sparse-index
git sparse-checkout set <dir1> <dir2> ... <dirN>

The size of the sparse index will scale with the number of files within your chosen directories, instead of the full size of your repository. When enabled with a number of other performance features, this can have a dramatic performance impact.

As we built the sparse index, we tested its performance against a large monorepo that has over two million files at HEAD, and with a sparse-checkout definition that populated the working directory with about 100,000 of those files. We then compared the performance between having a normal “full” index, a sparse index, and a repository that only contained the files matching the sparse-checkout definition.

Command performance by index and repository type

The chart above demonstrates the significant performance improvements enabled by the sparse index. The bottom bars for each Git command show the runtime without the sparse index. The middle bars show the runtime of the same commands with the sparse index enabled. The top bars show the runtime of these commands, except in a repository that only contained the files within the sparse-checkout cone, representing the theoretical optimum. Since the sparse index still contains pointers to the rest of the monorepo, there is still some overhead. This overhead is hardly noticeable, since the difference is at most 60 milliseconds, even in the worst case above.

Today, I will go deep into the design and implementation of the sparse index. In particular, I’ll focus on how the Git community made such a significant change to a critical data structure in a safe way. I will include links to the actual changes in the Git codebase as I go.

This post is going to go deep into the guts of Git. If you are unfamiliar with Git’s object model, then learn about commits, trees, and blobs before continuing.

You can also get an overview of the sparse index alongside several other advanced Git features in this presentation I gave with colleague Lessley Dennington at the GitHub Nova event.

First, let’s dig into the Git index to understand its structure and purpose. I’ll use the derrickstolee/trace2-flamechart repository as a concrete example (and to generate some of the figures). If you want to follow along with the Git commands shown, then clone that repository.

The Git index

The index file stores a list of every file at HEAD, along with the object ID for its blob and some metadata. This list of files is stored as a flat list and Git parses the index into an array.

You can expose the list of files in the index using the git ls-files command:

$ git ls-files
LICENSE
README.md
bin/index.js
examples/fetch/git-fetch-after.png
examples/fetch/git-fetch-after.svg
examples/fetch/git-fetch-after.txt
examples/fetch/git-fetch-before.png
examples/fetch/git-fetch-before.svg
examples/fetch/git-fetch-before.txt
examples/fetch/git-fetch-combined.png
examples/fetch/git-fetch-combined.svg
examples/maintenance/trace.png
examples/maintenance/trace.svg
examples/maintenance/trace.txt
package.json

In this repository, many of the files live in an examples/ directory, but you don’t actually need them for the functionality of the code, which lives in the bin/ directory. You can focus the repository only on the necessary files using the git sparse-checkout command:

$ ls
LICENSE     README.md     bin       examples   package.json
$ git sparse-checkout init --cone --sparse-index
$ git sparse-checkout set bin
$ ls
LICENSE     README.md     bin       package.json

Even though I used the sparse-checkout command to reduce the size of the working directory, my git ls-files command will return the same set of files as before. In fact, I can dig in a little more and expose some more information using git ls-files --debug.

$ git ls-files --debug
LICENSE
  ctime: 1634910503:287405820
  mtime: 1634910503:287405820
  dev: 16777220 ino: 119325319
  uid: 501  gid: 20
  size: 1098    flags: 0
README.md
  ctime: 1634910503:288090279
  mtime: 1634910503:288090279
  dev: 16777220 ino: 119325320
  uid: 501  gid: 20
  size: 934 flags: 0
bin/index.js
  ctime: 1634910767:828434033
  mtime: 1634910767:828434033
  dev: 16777220 ino: 119325520
  uid: 501  gid: 20
  size: 7292    flags: 0
examples/fetch/git-fetch-after.png
  ctime: 0:0
  mtime: 0:0
  dev: 0    ino: 0
  uid: 0    gid: 0
  size: 0   flags: 40004000
(...)

The above output is truncated, but it shows that each index entry contains additional filesystem information for each path. The last entry listed shows what happens for a file that is outside of the sparse-checkout definition: all of the filesystem information is removed and the flags entry has some bits enabled. These bits include a SKIP_WORKTREE bit that signifies that Git will not write that file to the working directory.

If these files are not written to disk, then why are they listed in the index at all? The reason is that Git still needs to understand the content that would be there if the index was expanded. Further, that information is used to generate a commit with the git commit command.

In the Git codebase, there is a test helper that can show additional information from the index: test-tool read-cache. (You won’t have this command if you just have normal Git installed.) Running it here, you can see that the index also stores the object IDs for every file:

$ test-tool read-cache --table
100644 blob 646521d0d6c070e6f15e0f5828be1127d3b75503    LICENSE
100644 blob b230b3a6e2d81d50dc00177e970a10726b5baf08    README.md
100755 blob 918533d51c7a5f91622311893dcfd40bfd4f43d7    bin/index.js
100644 blob e0f88531b916b92821476760672e8161b9954898    examples/fetch/git-fetch-after.png
100644 blob f4a523cd1acb0a9d2620970ad7a43405d6e305dc    examples/fetch/git-fetch-after.svg
100644 blob fc4e30dca5fcb0c3d2031dc82a43d5d644e26b41    examples/fetch/git-fetch-after.txt
100644 blob 15dc889965617df3b5a30cf01e52c491e41c59c1    examples/fetch/git-fetch-before.png
100644 blob 602bd5bcbd815914a035d0d4f0d2a3896f600de2    examples/fetch/git-fetch-before.svg
100644 blob bc40a8e4658d17c35de996f3655e737b85ce7ad9    examples/fetch/git-fetch-before.txt
100644 blob 356cdd36e0d78a62af8b010d25d658054bb6fdc7    examples/fetch/git-fetch-combined.png
100644 blob cc0c23f2c8a822c51a17c46268f38c2268b400ae    examples/fetch/git-fetch-combined.svg
100644 blob dfc0893d172d841d971e206461466db935b7c192    examples/maintenance/trace.png
100644 blob a10a876472e46c6ae58e6fc6e2adc64d4dae809b    examples/maintenance/trace.svg
100644 blob 8f5e8bfbc44674feb3aa96e0b7bf1bf717495658    examples/maintenance/trace.txt
100644 blob a4599a9e0a01c28a2c0a622457664fc8c55bfdf9    package.json

Notice in particular how every single row lists the object type as blob, meaning a file. Also, the bin/index.js file has executable permissions, so the file permission column shows 100755 instead of 100644 like the others. This is all important metadata to store for each index entry.

To visualize the index, the diagram below displays our blobs as boxes in a line in the order given above, but it represents the trees that connect the root tree to those blobs as triangles. Thus, the root tree has two subtrees for bin/ and examples/, and the examples/ subtree has two subtrees for fetch/ and maintenance/.

full index

This figure represents all of the links between the trees and blobs. However, the core index data structure stores only the list of blobs as a flat list. The nesting tree structure does not exist in the core of the file.

However, there is an extension to the format that includes the information of the nesting directories: the cache-tree extension. Each node of the cache-tree stores a list of sub-nodes and a range of index entries that are covered by the current node. Each node stores the object ID for the tree it represents.

The index and the cache tree

The root node always covers all of the blobs in the index. The contained nodes have ranges contained within that range.

Git commands such as git add update the cache-tree extension in order to make the next git commit command very fast. To create the new commit, Git can use the tree from the root of the cache-tree extension.

Many Git commands use the index in many different ways. Some commands compare the working directory to the index and update one or the other when there is a difference. Others compare the index and a commit. Some compare multiple indexes together. Some use the cache-tree extension to navigate the nested tree structure, but mostly they scan the flat list of files in the form of an array.

The index affects Git performance at scale

The index can be very large in monorepos. I will show Git performance data from an example monorepo that has over two million files at HEAD. Even using the latest compression techniques available, the index file is over 180 MB in this monorepo.

This has a significant effect on normal Git commands. Presented below is an annotated flamechart of a git status command with one of these large indexes. The x-axis represents time since the start of the command, and each rectangle represents a region of Git’s execution that is marked by its trace2 logging library.

`git status` with full index

Three regions are annotated here:

The index is read from disk and parsed into memory.
The working directory is compared to the index. This triggers a lazy initialization of some hash tables that are required for this effort.
The modified index is written to disk.

Parsing is multi-threaded, but writes are not. This explains some of the differences in how long those actions take.

Clearly, the amount of data in the index is a significant portion of this command. This also affects other commands such as git add and git commit, which are expected to be fast.

This performance concern became abundantly clear when our monorepo customer wanted to group more dependencies within the monorepo. Some teams had isolated Git repositories that were hundreds of times smaller than the monorepo, but these repositories created packages that were consumed by the monorepo, causing complications in tracking dependencies. The hope was that they could merge into the monorepo and rely on sparse-checkout to make it feel like they were still working in a small repository. The user experience was actually much worse, and the root cause was the time it took to read and write the index.

The main culprit is that there are millions of index entries corresponding to files these users do not care about for their daily work. When they push to the server and create a pull request, the build machines can handle the massive scale of building the entire tree and verifying that the small change works within the larger whole of the monorepo. Users should not need to pay that cost.

As members of the Git Fundamentals team, we are very focused on Git performance, and the index has been on our minds for years. For example, the index is a bottleneck for the VFS for Git environment, but that environment has particular needs that prevent improvements in this area.

The biggest thing that has changed recently is the creation of “cone mode” sparse-checkout patterns. These use directory-based pattern matching instead of file-based pattern matching. While cone mode was originally designed to speed up pattern matching in the sparse-checkout feature, it has now unlocked a new way to shrink the index.

The sparse index

The sparse index differs from a normal “full” index in one aspect: it can store directory paths with the object ID for its tree object. This is in addition to the file paths which are paired with blob objects. Since the cone mode sparse-checkout patterns match on a directory level, we can determine that an entire directory is out of the sparse-checkout cone and replace all of its contained file paths with a single directory path.

Back in my example derrickstolee/trace2-flamegraph repository, you can enable the sparse index command and then use the test-tool read-cache tool to show the contents of the index.

$ git sparse-checkout init --cone --sparse-index
$ test-tool read-cache --table
100644 blob 646521d0d6c070e6f15e0f5828be1127d3b75503    LICENSE
100644 blob b230b3a6e2d81d50dc00177e970a10726b5baf08    README.md
100755 blob 918533d51c7a5f91622311893dcfd40bfd4f43d7    bin/index.js
040000 tree b395192a7adbf21793f9489f3623c117802b2043    examples/
100644 blob a4599a9e0a01c28a2c0a622457664fc8c55bfdf9    package.json

Similar to my previous visualizations, you can now see how the index contains a directory entry in addition to the four blobs.

sparse index

The sparse directory entries correspond to directories that are just outside of the sparse-checkout definition. These directories also have a cache-tree node whose range is only one entry: that sparse directory entry.

I can even display the full details of the --debug output for git ls-files. This currently requires a --sparse flag that I have implemented in my personal fork of Git, but a similar feature will eventually be available in the core Git client.

$ git ls-files --debug --sparse
LICENSE
  ctime: 1634910503:287405820
  mtime: 1634910503:287405820
  dev: 16777220 ino: 119325319
  uid: 501  gid: 20
  size: 1098    flags: 200000
README.md
  ctime: 1634910503:288090279
  mtime: 1634910503:288090279
  dev: 16777220 ino: 119325320
  uid: 501  gid: 20
  size: 934 flags: 200000
bin/index.js
  ctime: 1634910767:828434033
  mtime: 1634910767:828434033
  dev: 16777220 ino: 119325520
  uid: 501  gid: 20
  size: 7292    flags: 200000
examples/
  ctime: 0:0
  mtime: 0:0
  dev: 0    ino: 0
  uid: 0    gid: 0
  size: 0   flags: 40004000
package.json
  ctime: 1634910503:288676330
  mtime: 1634910503:288676330
  dev: 16777220 ino: 119325321
  uid: 501  gid: 20
  size: 680 flags: 200000

This output is not truncated as it was before, and you can see that the sparse directory entry for examples/ is the only one with blank filesystem data. It also has the same flags value as the sparse file entries did before.

By removing the number of index entries as well as reducing the average path length, you can shrink the index size significantly. In our example monorepo, most users will reduce their index size from 180 MB to less than 10 MB!

Back to our monorepo, let’s try that git status example again and create a new flamechart. Here, I compare the flamechart for git status with a full index followed by one with a sparse index.

Annotated `git status` flame chart

With the sparse index, the git status command drops from 1.3 seconds to under 200 milliseconds! In the flame chart above I highlighted some regions that have similar appearance in each run. These represent the parts of the git status command that are actually walking the working directory and doing work independent of the index size. Everything else is slower in the full index case entirely because of the size of the index!

Building the sparse index safely

Pruning the index at the directory level is a relatively simple idea with a rather complicated result: our flat list of paths now contains two types of Git objects! There are dozens of places in the Git codebase that interact directly with the index in subtly different ways, and all of them are expecting every index entry to point to a blob object.

In order to make such a change to a critical data structure, we needed to first create a compatibility layer. To safely interact with a sparse index, we needed a way to expand a sparse index to an equivalent full index. This way, code paths that have not been integrated and tested with a sparse index can still be used, even if the on-disk format is sparse.

In the Git codebase, we started by creating the ensure_full_index() method, which converts a sparse index into a full one. This method inspects the list for directory entries and replaces them with its contained file entries. Since the directory is outside of the sparse-checkout cone, we could ignore all of the filesystem metadata information and populate the list by traversing the tree objects under that directory. Before anything else happens, the ensure_full_index() method is called immediately after parsing the index so no interactions with the index happen until the sparse directories are removed.

Expanding to a full index

When Git expands a sparse index to a full one, it scans the entries in lexicographic order. If the entry is a file, then Git copies it to the new list. If the entry is a directory, then the tree at that location is passed to the read_tree_at() method to iterate over all contained blobs. For each contained blob, Git generates an index entry for the corresponding file. Finally, the index entry list is copied back into the index and the index is no longer sparse.

Once that protection was in place, we extended Git to write the sparse index format. When writing the index, a full index is converted to a sparse one in-memory using the convert_to_sparse() method.

Collapsing to a sparse index

To convert from a full index to a sparse one, Git uses the cache-tree extension to find the object ID for our new sparse directory entries. The existing file entries are copied, and Git inserts the directory entries as needed.

Once these steps were built, we could verify that the index size was shrinking to the scale we expected. Both of these steps were included in a single series that introduced the format and implemented the conversions.

While it is nice that the index size has shrunk, we couldn’t stop there. The index is a very compact data structure, so it is more efficient to read it from disk than to recreate it by parsing trees. The ensure_full_index() method takes noticeably longer to expand a sparse index to a full one than it would take to read a full one from disk. In order to gain the performance benefits of a sparse index, we needed to teach Git what to do when encountering sparse directory entries.

Before embarking on these integrations, we first set up more guardrails. A new setting, command_requires_full_index, was created which is enabled by default. This setting is used to trigger ensure_full_index() upon parsing a sparse index unless the Git command being run has explicitly disabled the setting. This allowed us to integrate with Git commands one-by-one without disrupting the behavior of other Git commands. In addition to these settings, we inserted calls to ensure_full_index() before most index interactions. to make sure that we were operating on a full index in any code path that might iterate over the index entries. This allowed us to find which code paths needed integration: we could debug a Git command with a breakpoint on ensure_full_index() and see the call stack that led to that expansion.

The first command to integrate with the sparse index was git status. In hindsight, this was a challenging command to use as a starting point, because it performs multiple index operations that are common to other commands. This became clear when integrating with git checkout and git commit because most of the work was already done in the git status integration.

Let’s explore some of the smaller interactions that needed special care with directory entries.

Example implementation detail: `git diff`

The git diff command can show what is different between different representations of a working directory. There are two interesting cases that involve the index: comparing the working directory to the index, and comparing the index to a commit.

With no other arguments, git diff shows the differences in the working directory compared to what is staged in the index. This algorithm is mostly simple to integrate with the sparse index: while walking the working directory, Git drills into a directory only if it exists. If the sparse directory entries in the sparse index do not appear as directories in the working directory, then it never tries to drill into the sparse directory entries. If Git finds that a sparse directory entry does exist in the filesystem as a real directory, then the ensure_full_index() method expands the index and Git continues as normal. This is not desired, so we did everything possible to make sure that these directories do not exist, including updating the sparse-checkout feature to delete ignored files outside the cone.

The git diff --cached command compares the files staged in the index with the commit at HEAD. Here, it is easier to have differences outside the sparse-checkout cone, such as when using git reset --soft to change the HEAD commit without changing the working directory or index. In this case, the git diff --cached command wants to compare the root tree for the HEAD commit to the files in the index. This can proceed normally for the files that exist in the sparse index, but when we reach a sparse directory entry, we see the tree object staged in the index as well as a tree object from the tree walk. At this point, we shift from a tree-vs-index comparison to a tree-vs-tree comparison of those subtrees. When that diff is complete, we can continue with the larger comparison.

One major benefit to the tree-vs-tree comparison is that it is easier to compare the same type of objects. The recursive comparison can also prune the walk when it finds two subtrees with the same object ID, as all of their content is the same at that point.

This change allows us to report on these differences not only in the git diff command, but also such diffs as are written during git status, git checkout, and git commit.

Implementation detail: ORT merge strategy

When beginning the sparse index work, there was a huge question that we did not know how to tackle: three-way merges. The default merge strategy, recursive, uses the index as a data structure during its computation. It was going to be difficult to reconcile that algorithm to work within the confines of a sparse index. In fact, many merges that a monorepo user runs would need to resolve merges outside of the sparse-checkout cone.

Luckily, another contributor, Elijah Newren, announced that he was creating a new merge strategy, named the ORT strategy, that did not use the index. We prioritized reviewing and testing that strategy so that we could take advantage of it. It turns out that it is also a better algorithm in general, so it will become the new default strategy with Git 2.34.

The critical feature of the ORT strategy was its replacement of the index with a recursive tree-like structure. That structure is built from the root tree and only creates subtrees for paths that have changed since the merge base. At the end, the ORT strategy creates an index to match its representation of the resulting merge commit.

Because of the ORT merge strategy, integrating the sparse index into git merge, git rebase, git cherry-pick, and git revert was very simple. We just needed to make sure the index that was created at the end of the merge was sparse from its original creation.

Different merge strategies and their performance

As we reported earlier, the ORT strategy improves over the recursive strategy in the typical case, but also the recursive strategy has significant outliers as shown in the box plot above. Enabling the sparse index on top of the ORT strategy provides even more improvements.

Without the ORT merge strategy, the sparse index work could have easily doubled in scope. For a detailed look at the ORT strategy and its many optimizations, take a look at Elijah’s six-part blog series:

Testing the sparse index

The sparse index was touching critical code and doing so in interesting ways. We needed a way to carefully test that these changes were as correct as possible. The Git test suite is substantial and has excellent coverage of most index operations. However, almost all of those tests do not use sparse-checkout, so we couldn’t immediately gain value in checking the sparse index by enabling it globally.

We created a test script that focused on testing the sparse index in a new way. The test starts by creating a repository with some interesting data shapes in it. Then, each test case starts by copying that repository into three new repositories. Those three repositories have different configurations:

The repository as-is, without sparse-checkout.
The repository with cone mode sparse-checkout enabled.
The repository with cone mode sparse-checkout and sparse index enabled.

Then, each test case runs a number of Git commands against all three repositories, expecting the same output and results in the working directory. This allowed us to be confident that the changes we were making to enable the sparse index would have identical behavior with the other two cases.

Along the way, we found some interesting differences between sparse-checkout and full repositories. Several of these bugs have been fixed since. Sometimes, it was unclear whether the sparse-checkout feature should do the same thing as a normal repository, specifically when interacting with files outside of the sparse-checkout cone. This led to changing how some commands interact with sparse entries. In Git 2.34, some commands will need a --sparse flag in order to modify paths outside of the sparse-checkout cone.

In addition to these test scripts, we routinely ran the Scalar functional tests against our development branches, since many of those tests focus on special circumstances around the sparse-checkout feature. If a Scalar test would fail when the Git tests did not, then we would create a similar test in Git to prevent such a bug in the future.

Once we had integrated with a core set of Git commands, we also created an experimental release that contained early versions of these integrations. We provided this version to a subset of monorepo users to evaluate the performance. We found some interesting data from some of the users, but overall the results confirmed that the sparse index was going to significantly improve the user experience in the monorepo. The most important thing we discovered is that the sparse-checkout feature should remove ignored files outside of the cone, as those files cause the sparse index to expand to a full one, negating the performance benefits. There is an additional benefit in that the working directory shrinks even more by deleting these files.

The current state of the sparse index

Not all Git commands understand the sparse index. Those that have not been integrated trigger a compatibility check that converts a sparse index into a full one during the first index read. The integrated commands are ones that have been carefully tested with the sparse index. They likely received some code updates in order to properly handle a sparse index. These integrations were dispersed across the last few Git versions, and some only exist in the microsoft/git fork until we can complete contributing them to the core Git project.

scope

In June, Git 2.32.0 was released with an understanding of the sparse index format. In August, Git 2.33.0 included integrations with git status, git commit, and git checkout. Git 2.34.0 is slated for a November release with integrations for git add, git merge, git rebase, git cherry-pick, and git reset. In order to serve our monorepo users, we fast-tracked some integrations into a pre-release in our fork including integrations with git diff, git blame, git clean, git sparse-checkout, and git stash.

Based on our understanding of how users interact with a monorepo, the commands that are listed here are sufficient to cover almost all users’ needs. We expect that users who adopt the sparse index with these integrations will have a significantly improved experience from before. This all depends on the size of their monorepo and on the size of their sparse-checkout cone.

We will release this feature widely to our monorepo users with the sparse index on by default in the 2.34 release of our Git fork. The core Git project will keep the sparse index off by default until it has all of these features and the implementation has been stable for a few versions.

Looking to the future

At this point, we have covered all of the integrations we need to have a successful monorepo experience. There is more work to be done. In particular, we need to finish contributing the final integrations in our list. Upstream progress takes time and we are grateful for all of the feedback we have received from the community so far. There are more commands that could use integrations, as well.

For now, we are focused on ensuring that monorepo users transition to the sparse index without incident. If you are interested in the sparse index, then we believe it is in an excellent state to start using it. If you have any issues with its performance or stability, then please do not hesitate to create an issue or start a discussion. We will be there to help!

Through the sparse index, we broke through a significant barrier to monorepo scale. We continue to seek the next innovation that will help Git scale to the largest repos out there.

Metasploit Wrap-Up

2021-11-05 Spencer McIntyre

Post Syndicated from Spencer McIntyre original https://blog.rapid7.com/2021/11/05/metasploit-wrap-up-137/

GitLab RCE

Metasploit Wrap-Up

New Rapid7 team member jbaines-r7 wrote an exploit targeting GitLab via the ExifTool command. Exploiting this vulnerability results in unauthenticated remote code execution as the git user. What makes this module extra neat is the fact that it chains two vulnerabilities together to achieve this desired effect. The first vulnerability is in GitLab itself that can be leveraged to pass invalid image files to the ExifTool parser which contained the second vulnerability whereby a specially-constructed image could be used to execute code. For even more information on these vulnerabilities, check out Rapid7’s post.

Less Than BulletProof

This week community member h00die submitted another WordPress module. This one leverages an information disclosure vulnerability in the WordPress BulletProof Security plugin that can disclose user credentials from a backup file. These credentials could then be used by a malicious attacker to login to WordPress if the hashed password is able to be cracked in an offline attack.

Metasploit Masterfully Manages Meterpreter Metadata

Each Meterpreter implementation is a unique snowflake that often incorporates API commands that others may not. A great example of this are all the missing Kiwi commands in the Linux Meterpreter. Metasploit now has much better support for modules to identify the functionality they require a Meterpreter session to have in order to run. This will help alleviate frustration encountered by users when they try to run a post module with a Meterpreter type that doesn’t offer functionality that is needed. This furthers the Metasploit project goal of providing more meaningful error information regarding post module incompatibilities which has been an ongoing effort this year.

New module content (3)

WordPress BulletProof Security Backup Disclosure by Ron Jost (Hacker5preme) and h00die, which exploits CVE-2021-39327 – This adds an auxiliary module that leverages an information disclosure vulnerability in the BulletproofSecurity plugin for WordPress. This vulnerability is identified as CVE-2021-39327. The module retrieves a backup file, which is publicly accessible, and extracts user credentials from the database backup.
GitLab Unauthenticated Remote ExifTool Command Injection by William Bowling and jbaines-r7, which exploits CVE-2021-22204 and CVE-2021-22205 – This adds an exploit for an unauthenticated remote command injection in GitLab via a separate vulnerability within ExifTool. The vulnerabilities are identified as CVE-2021-22204 and CVE-2021-22205.
WordPress Plugin Pie Register Auth Bypass to RCE by Lotfi13-DZ and h00die – This exploits an authentication bypass which leads to arbitrary code execution in versions 3.7.1.4 and below of the WordPress plugin, pie-register. Supplying a valid admin id to the user_id_social_site parameter in a POST request now returns a valid session cookie. With that session cookie, a PHP payload as a plugin is uploaded and requested, resulting in code execution.

Enhancements and features

#15665 from adfoster-r7 – This adds additional metadata to exploit modules to specify Meterpreter command requirements. Metadata information is used to add a descriptive warning when running modules with a Meterpreter implementation that doesn’t support the required command functionality.
#15782 from k0pak4 – This updates the iis_internal_ip module to include coverage for the PROPFIND internal IP address disclosure as described by CVE-2002-0422.

Bugs fixed

#15805 from timwr – This bumps the metasploit-payloads version to include two bug fixes for the Python Meterpreter.

Get it

As always, you can update to the latest Metasploit Framework with msfupdate
and you can get more details on the changes since the last blog post from
GitHub:

If you are a git user, you can clone the Metasploit Framework repo (master branch) for the latest.
To install fresh without using git, you can use the open-source-only Nightly Installers or the
binary installers (which also include the commercial edition).

Open-Sourcing a Monitoring GUI for Metaflow

2021-10-28 Netflix Technology Blog

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/open-sourcing-a-monitoring-gui-for-metaflow-75ff465f0d60

Open-Sourcing a Monitoring GUI for Metaflow, Netflix’s ML Platform

tl;dr Today, we are open-sourcing a long-awaited GUI for Metaflow. The Metaflow GUI allows data scientists to monitor their workflows in real-time, track experiments, and see detailed logs and results for every executed task. The GUI can be extended with plugins, allowing the community to build integrations to other systems, custom visualizations, and embed upcoming features of Metaflow directly into its views.

<a href="https://medium.com/media/9389f8a3d6aaab2925a26771476c3dc8/href">https://medium.com/media/9389f8a3d6aaab2925a26771476c3dc8/href</a>

Metaflow is a full-stack framework for data science that we started developing at Netflix over four years ago and which we open-sourced in 2019. It allows data scientists to define ML workflows, test them locally, scale-out to the cloud, and deploy to production in idiomatic Python code. Since open-sourcing, the Metaflow community has been growing quickly: it is now the 7th most starred active project on Netflix’s GitHub account with nearly 4800 stars. Outside Netflix, Metaflow is used to power machine learning in production by hundreds of companies across industries from bioinformatics to real estate.

Since its inception, Metaflow has been a command-line-centric tool. It makes it easy for data scientists to express even complex machine learning applications in idiomatic Python, test them locally, or scale them out in the cloud — all using their favorite IDEs and terminals. Following our culture of freedom and responsibility, Metaflow grants data scientists the freedom to choose the right modeling approach, handle data and features flexibly, and construct workflows easily while ensuring that the resulting project executes responsibly and robustly on the production infrastructure.

As the number and criticality of projects running on Metaflow increased — some of which are very central to our business — our ML platform team started receiving an increasing number of support requests. Frequently, the questions were of the nature “can you help me understand why my flow takes so long to execute” or “how can I find the logs for a model that failed last night.” Technically, Metaflow provides a Python API that allows the user to inspect all details e.g., in a notebook, but writing code in a notebook to answer basic questions like this felt overkill and unnecessarily tedious. After observing the situation for months, we started forming an understanding of the kind of a new user interface that could address the growing needs of our users.

Requirements for a Metaflow GUI

Metaflow is a human-centered system by design. We consider our Python API and the CLI to be integral parts of the overall user interface and user experience, which singularly focuses on making it easier to build production-ready ML projects from scratch. In our approach, Python code provides a highly expressive and productive user interface for expressing complex business logic, such as ML models and workflows. At the same time, the CLI allows users to execute specific commands quickly and even automate common actions. When it comes to complex, real-life development work like this, it would be hard to achieve the same level of productivity on a graphical user interface.

However, textual UIs are quite lacking when it comes to discoverability and getting a holistic understanding of the system’s state. The questions we were hearing reflected this gap: we were lacking a user interface that would allow the users, quite simply, to figure out quickly what is happening in their Metaflow projects.

Netflix has a long history of developing innovative tools for observability, so when we began to specify requirements for the new GUI, we were able to leverage experiences from the previous GUIs built for other use cases, as well as real-life user stories from Metaflow users. We wanted to scope the GUI tightly, focusing on a specific gap in the Metaflow experience:

The GUI should allow the users to see what flows and tasks are executing and what is happening inside them. Notably, we didn’t want to replace any of the functionality in the Metaflow APIs or CLI with the GUI — just to complement them. This meant that the GUI would be read-only: all actions like writing code and starting executions should happen on the users’ IDE and terminal as before. We also had no need to build a model-monitoring GUI yet, which is a wholly separate problem domain.
The GUI would be targeted at professional data scientists. Instead of a fancy GUI for demos and presentations, we wanted a serious productivity tool with carefully thought-out user workflows that would fit seamlessly into our toolchain of data science. This requires attention to small details: for instance, users should be able to copy a link to any view in the GUI and share it e.g., on Slack, for easy collaboration and support (or to integrate with the Metaflow Slack bot). And, there should be natural affordances for navigating between the CLI, the GUI, and notebooks.
The GUI should be scalable and snappy: it should handle our existing repository consisting of millions of runs, some of which contain tens of thousands of tasks without hiccups. Based on our experiences with other GUIs operating at Netflix-scale, this is not a trivial requirement: scalability needs to be baked into the design from the very beginning. Sluggish GUIs are hard to debug and fix afterwards, and they can have a significantly negative impact on productivity.
The GUI should integrate well with other GUIs. A modern ML stack consists of many independent systems like data warehouses, compute layers, model serving systems, and, in particular, notebooks. It should be possible to find runs and tasks of interest in the Metaflow GUI and use a task-specific view to jump to other GUIs for further information. Our landscape of tools is constantly evolving, so we didn’t want to hardcode these links and views in the GUI itself. Instead, following the integration-friendly ethos of Metaflow, we want to embed relevant information in the GUI as plugins.
Finally, we wanted to minimize the operational overhead of the GUI. In particular, under no circumstances should the GUI impact Metaflow executions. The GUI backend should be a simple service, optionally sitting alongside the existing Metaflow metadata service, providing a read-only, real-time view to the stored state. The frontend side should be easily extensible and maintainable, suggesting that we wanted a modern React app.

Monitoring GUI for Metaflow

As our ML Platform team had limited frontend resources, we reached out to Codemate to help with the implementation. As it often happens in software engineering projects, the project took longer than expected to finish, mostly because the problem of tracking and visualizing thousands of concurrent objects in real-time in a highly distributed environment is a surprisingly non-trivial problem (duh!). After countless iterations, we are finally very happy with the outcome, which we have now used in production for a few months.

When you open the GUI, you see an overview of all flows and runs, both current and historical, which you can group and filter in various ways:

We can use this view for experiment tracking: Metaflow records every execution automatically, so data scientists can track all their work using this view. Naturally, the view can be grouped by user. They can also tag their runs and filter the view by tags, allowing them to focus on particular subsets of experiments.

After you click a specific run, you see all its tasks on a timeline:

The timeline view is extremely useful in understanding performance bottlenecks, distribution of task runtimes, and finding failed tasks. At the top, you can see global attributes of the run, such as its status, start time, parameters etc. You can click a specific task to see more details:

This task view shows logs produced by a task, its results, and optionally links to other systems that are relevant to the task. For instance, if the task had deployed a model to a model serving platform, the view could include a link to a UI used for monitoring microservices.

As specified in our requirements, the GUI should work well with Metaflow CLI. To facilitate this, the top bar includes a navigation component where the user can copy-paste any pathspec, i.e., a path to any object in the Metaflow universe, which are prominently shown in the CLI output. This way, the user can easily move from the CLI to the GUI to observe runs and tasks in detail.

While the CLI is great, it is challenging to visualize flows. Each flow can be represented as a Directed Acyclic Graph (DAG), and so the GUI provides a much better way to visualize a flow. The DAG view presents all the steps of a flow and how they are related. Each step may have developer comments. They are colored to indicate the current state. Split steps are grouped by shaded boxes, while steps that participated in a foreach are grouped by a double shade box. Clicking on a step will take you to the Task view.

Users at different organizations will likely have some special use cases that are not directly supported. The Metaflow GUI is extensible through its plugin API. For example, Netflix has its container orchestration platform called Titus. Users can configure tasks to utilize Titus to scale up or out. When failures happen, users will need to access their Titus containers for more information, and within the task view, a simple plugin provides a link for further troubleshooting.

Try it at home!

We know that our user stories and requirements for a Metaflow GUI are not unique to Netflix. A number of companies in the Metaflow community have requested GUI for Metaflow in the past. To support the thriving community and invite 3rd party contributions to the GUI, we are open-sourcing our Monitoring GUI for Metaflow today!

You can find detailed instructions for how to deploy the GUI here. If you want to see the GUI in action before deploying it, Outerbounds, a new startup founded by our ex-colleagues, has deployed a public demo instance of the GUI. Outerbounds also hosts an active Slack community of Metaflow users where you can find support for GUI-related issues and share feedback and ideas for improvement.

With the new GUI, data scientists don’t have to fly blind anymore. Instead of reaching out to a platform team for support, they can easily see the state of their workflows on their own. We hope that Metaflow users outside Netflix will find the GUI equally beneficial, and companies will find creative ways to improve the GUI with new plugins.

For more context on the development process and motivation for the GUI, you can watch this recording of the GUI launch meetup.

Open-Sourcing a Monitoring GUI for Metaflow was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

CAMBI, a banding artifact detector

2021-10-14 Netflix Technology Blog

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/cambi-a-banding-artifact-detector-96777ae12fe2

by Joel Sole, Mariana Afonso, Lukas Krasula, Zhi Li, and Pulkit Tandon

Introducing the banding artifacts detector developed by Netflix aiming at further improving the delivered video quality

Banding artifacts can be pretty annoying. But, first of all, you may wonder, what is a banding artifact?

Banding artifact?

You are at home enjoying a show on your brand-new TV. Great content delivered at excellent quality. But then, you notice some bands in an otherwise beautiful sunset scene. What was that? A sci-fi plot twist? Some device glitch? More likely, banding artifacts, which appear as false staircase edges in what should be smoothly varying image areas.

Bands can show up in the sky in that sunset scene, in dark scenes, in flat backgrounds, etc. In any case, we don’t like them, nor should anybody be distracted from the storyline by their presence.

Just a subtle change in the video signal can cause banding artifacts. This slight variation in the value of some pixels disproportionately impacts the perceived quality. Bands are more visible (and annoying) when the viewing conditions are right: large TV with good contrast and a dark environment without screen reflections.

Some examples below. Since we don’t know where and when you are reading this blog post, we exaggerate the banding artifacts, so you get the gist. The first example is from the opening scene of one of our first shows. Check out the sky. Do you see the bands? The viewing environment (background brightness, ambient lighting, screen brightness, contrast, viewing distance) influences the bands’ visibility. You may play with those factors and observe how the perception of banding is affected.

Banding artifacts are also found in compressed images, as in this one we have often used to illustrate the point:

Even the Voyager encountered banding along the way; xkcd 🙂

How annoying is it?

We set up an experiment to measure the perceived quality in the presence of banding artifacts. We asked participants to rate the impact of the banding artifacts on a scale from 0 (unwatchable) to 100 (imperceptible) for a range of videos with different resolutions, bit-rates, and dithering. Participants rated 86 videos in total. Most of the content was banding-prone, while some not. The collected mean opinion scores (MOS) covered the entire scale.

According to usual metrics, the videos in the experiment with perceptible banding should be mid to high-quality (i.e., PSNR>40dB and VMAF>80). However, the experiment scores show something entirely different, as we’ll see below.

You can’t fix it if you don’t know it’s there

Netflix encodes video at scale. Likewise, video quality is assessed at scale within the encoding pipeline, not by an army of humans rating each video. This is where objective video quality metrics come in, as they automatically provide actionable insights into the actual quality of an encode.

PSNR has been the primary video quality metric for decades: it is based on the average pixel distance of the encoded video to the source video. In the case of banding, this distance is tiny compared to its perceptual impact. Consequently, there is little information about banding in the PSNR numbers. The data from the subjective experiment confirms this lack of correlation between PSNR and MOS:

Another video quality metric is VMAF, which Netflix jointly developed with several collaborators and open-sourced on Github. VMAF has become a de facto standard for evaluating the performance of encoding systems and driving encoding optimizations, being a crucial factor for the quality of Netflix encodes. However, VMAF does not specifically target banding artifacts. It was designed with our streaming use case in mind, in particular, to capture the video quality of movies and shows in the presence of encoding and scaling artifacts. VMAF works exceptionally well in the general case, but, like PSNR, lacks correlation with MOS in the presence of banding:

VMAF, PSNR, and other commonly used video quality metrics don’t detect banding artifacts properly and, if we can’t catch the issue, we cannot take steps to fix it. Ideally, our wish list for a banding detector would include the following items:

High correlation with MOS for content distorted with banding artifacts
Simple, intuitive, distortion-specific, and based on human visual system principles
Consistent performance across the different resolutions, qualities, and bit-depths delivered in our service
Robust to dithering, which video pipelines commonly introduce

We didn’t find any algorithm in the literature that fit our purposes. So we set out to develop one.

CAMBI

We hand-crafted in a traditional NNN (non-neural network) way an algorithm to meet our requirements. A white box solution derived from first principles with just a few, visually-motivated, parameters: the contrast-aware multiscale banding index (CAMBI).

A block diagram describing the steps involved in CAMBI is shown below. CAMBI operates as a no-reference banding detector taking a (distorted) video as an input and producing a banding visibility score as the output. The algorithm extracts pixel-level maps at multiple scales for frames of the encoded video. Subsequently, it combines these maps into a single index motivated by the human contrast sensitivity function (CSF).

Pre-processing

Each input frame goes through up to three pre-processing steps.

The first step extracts the luma component: although chromatic banding exists, like most past works, we assume that most of the banding can be captured in the luma channel. The second step is converting the luma channel to 10-bit (if the input is 8-bit).

Third, we account for the presence of dithering in the frame. Dithering is intentionally applied noise used to randomize quantization error that is shown to reduce banding visibility. To account for both dithered and non-dithered content, we use a 2×2 filter to smoothen the intensity values to replicate the low-pass filtering done by the human visual system.

Multiscale Banding Confidence

We consider banding detection a contrast-detection problem, and hence banding visibility is majorly governed by the CSF. The CSF itself largely depends on the perceived contrast across a step and the spatial frequency of the steps. CAMBI explicitly accounts for the contrast across pixels by looking at the differences in pixel intensity and does this at multiple scales to account for spatial frequency. This is done by calculating pixel-wise banding confidence at different contrasts and scales, each referred to as a CAMBI map for the frame. Banding confidence computation also considers the sensitivity to change in brightness depending on the local brightness. At the end of this process, twenty CAMBI maps are obtained per frame capturing banding across four contrast steps and five scales.

Spatio-Temporal Pooling

CAMBI maps are spatiotemporally pooled to obtain the final banding index. Spatial pooling is done based on the observation that CAMBI maps belong to the initial linear phase of the CSF. First, pooling is applied in the contrast dimension by keeping the maximum weighted contrast for each position. The result is five maps, one per scale. There is an example of such maps further down in this post.

Since regions with the poorest quality dominate the perceived quality of the video, only a percentage of the pixels, those with the most banding, is considered during spatial pooling for the maps at each scale. The resulting scores per scale are linearly combined with CSF-based weights to derive the CAMBI for each frame.

According to our experiments, CAMBI is temporally stable within a single video shot, so a simple average suffices as a temporal pooling mechanism across frames. However, note that this assumption breaks down for videos with multiple shots with different characteristics.

CAMBI agrees with the subjective assessments

Our results show that CAMBI provides a high correlation with MOS while, as illustrated above, VMAF and PSNR have very little correlation. The table reports two correlation coefficients, namely Spearman Rank Order Correlation (SROCC) and Pearson’s Linear Correlation (PLCC):

The following plot visualizes that CAMBI correlates well with subjective scores and that a CAMBI of around 5 is where banding starts to be slightly annoying. Note that, unlike the two quality metrics, CAMBI correlates inversely with MOS: the higher the CAMBI score is, the more perceptible the banding is, and thus the quality is lower.

Staring at the sunset

We use this sunset as an example of banding and how CAMBI scores it. Below we also show the same sunset with fake colors, so bands pop up even more.

There is no banding on the sea part of the image. In the sky, the size of the bands increases as the distance from the sun increases. The five maps below, one per scale, capture the confidence of banding at different spatial frequencies. These maps are further spatially pooled, accounting for the CSF, giving a CAMBI score of 19 for the frame, which perceptually corresponds to somewhere between ‘annoying’ to ‘very annoying’ banding according to the MOS data.

Open-source and next steps

A banding detection mechanism robust to multiple encoding parameters can help identify the onset of banding in videos and serve as the first step towards its mitigation. In the future, we hope to leverage CAMBI to develop a new version of VMAF that can account for banding artifacts.

We open-sourced CAMBI as a new standalone feature in libvmaf. Similar to VMAF, CAMBI is an organic project expected to be gradually improved over time. We welcome any feedback and contributions.

Acknowledgments

We want to thank Christos Bampis, Kyle Swanson, Andrey Norkin, and Anush Moorthy for the fruitful discussions and all the participants in the subjective tests that made this work possible.

CAMBI, a banding artifact detector was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Automated security and compliance remediation at HDI

2021-10-11 Uladzimir Palkhouski

Post Syndicated from Uladzimir Palkhouski original https://aws.amazon.com/blogs/devops/automated-security-and-compliance-remediation-at-hdi/

with Dr. Malte Polley (HDI Systeme AG – Cloud Solutions Architect)

At HDI, one of the biggest European insurance group companies, we use AWS to build new services and capabilities and delight our customers. Working in the financial services industry, the company has to comply with numerous regulatory requirements in the areas of data protection and FSI regulations such as GDPR, German Supervisory Requirements for IT (VAIT) and Supervision of Insurance Undertakings (VAG). The same security and compliance assessment process in the cloud supports development productivity and organizational agility, and helps our teams innovate at a high pace and meet the growing demands of our internal and external customers.

In this post, we explore how HDI adopted AWS security and compliance best practices. We describe implementation of automated security and compliance monitoring of AWS resources using a combination of AWS and open-source solutions. We also go through the steps to implement automated security findings remediation and address continuous deployment of new security controls.

Background

Data analytics is the key capability for understanding our customers’ needs, driving business operations improvement, and developing new services, products, and capabilities for our customers. We needed a cloud-native data platform of virtually unlimited scale that offers descriptive and prescriptive analytics capabilities to internal teams with a high innovation pace and short experimentation cycles. One of the success metrics in our mission is time to market, therefore it’s important to provide flexibility to internal teams to quickly experiment with new use cases. At the same time, we’re vigilant about data privacy. Having a secure and compliant cloud environment is a prerequisite for every new experiment and use case on our data platform.

Cloud security and compliance implementation in the cloud is a shared effort between the Cloud Center of Competence team (C3), the Network Operation Center (NoC), and the product and platform teams. The C3 team is responsible for new AWS account provisioning, account security, and compliance baseline setup. Cross-account networking configuration is established and managed by the NoC team. Product teams are responsible for AWS services configuration to meet their requirements in the most efficient way. Typically, they deploy and configure infrastructure and application stacks, including the following:

Network configuration – Amazon Virtual Private Cloud (Amazon VPC) subnets and routing
Object storage setup – Amazon Simple Storage Service (Amazon S3) buckets and bucket policies
Data encryption at rest configuration – Management of AWS Key Management Service (AWS KMS) customer master keys (CMKs) and key policies
Managed services configuration – AWS Glue jobs, AWS Cloud9 environments, and others

We were looking for security controls model that would allow us to continuously monitor infrastructure and application components set up by all the teams. The model also needed to support guardrails that allowed product teams to focus on new use case implementation, but also inherited the security and compliance best practices promoted and ensured within our company.

Security and compliance baseline definition

We started with the AWS Well-Architected Framework Security Pillar whitepaper, which provides implementation guidance on the essential areas of security and compliance in the cloud, including identity and access management, infrastructure security, data protection, detection, and incident response. Although all five elements are equally important for implementing enterprise-grade security and compliance in the cloud, we saw an opportunity to improve controls of on-premises environments by automating detection and incident response elements. The continuous monitoring of AWS infrastructure and application changes complemented by the automated incident response of the security baseline helps us foster security best practices and allows for a high innovation pace. Manual security reviews are no longer required to asses security posture.

Our security and compliance controls framework is based on GDPR and several standards and programs, including ISO 27001, C5. Translation of the controls framework into the security and compliance baseline definition in the cloud isn’t always straightforward, so we use a number of guidelines. As a starting point, we use CIS Amazon Web Services benchmarks, because it’s a prescriptive recommendation and its controls cover multiple AWS security areas, including identity and access management, logging and monitoring configuration, and network configuration. CIS benchmarks are industry-recognized cyber security best practices and recommendations that cover a wide range of technology families, and are used by enterprise organizations around the world. We also apply GDPR compliance on AWS recommendations and AWS Foundational Security Best Practices, extending controls recommended by CIS AWS Foundations Benchmarks in multiple control areas: inventory, logging, data protection, access management, and more.

Security controls implementation

AWS provides multiple services that help implement security and compliance controls:

AWS CloudTrail provides a history of events in an AWS account, including those originating from command line tools, AWS SDKs, AWS APIs, or the AWS Management Console. In addition, it allows exporting event history for further analysis and subscribing to specific events to implement automated remediation.
AWS Config allows you to monitor AWS resource configuration, and automatically evaluate and remediate incidents related to unexpected resources configuration. AWS Config comes with pre-built conformance pack sample templates designed to help you meet operational best practices and compliance standards.
Amazon GuardDuty provides threat detection capabilities that continuously monitor network activity, data access patterns, and account behavior.

With multiple AWS services to use as building blocks for continuous monitoring and automation, there is a strong need for a consolidated findings overview and unified remediation framework. This is where AWS Security Hub comes into play. Security Hub provides built-in security standards and controls that make it easy to enable foundational security controls. Then, Security Hub integrates with CloudTrail, AWS Config, GuardDuty, and other AWS services out of the box, which eliminates the need to develop and maintain integration code. Security Hub also accepts findings from third-party partner products and provides APIs for custom product integration. Security Hub significantly reduces the effort to consolidate audit information coming from multiple AWS-native and third-party channels. Its API and supported partner products ecosystem gave us confidence that we can adhere to changes in security and compliance standards with low effort.

While AWS provides a rich set of services to manage risk at the Three Lines Model, we were looking for wider community support in maintaining and extending security controls beyond those defined by CIS benchmarks and compliance and best practices recommendations on AWS. We came across Prowler, an open-source tool focusing on AWS security assessment and auditing and infrastructure hardening. Prowler implements CIS AWS benchmark controls and has over 100 additional checks. We appreciated Prowler providing checks that helped us meet GDPR and ISO 27001 requirements, specifically. Prowler delivers assessment reports in multiple formats, which makes it easy to implement reporting archival for future auditing needs. In addition, Prowler integrates well with Security Hub, which allows us to use a single service for consolidating security and compliance incidents across a number of channels.

We came up with the solution architecture depicted in the following diagram.

Automated remediation solution architecture HDI

Let’s look closely into the most critical components of this solution.

Prowler is a command line tool that uses the AWS Command Line Interface (AWS CLI) and a bash script. Individual Prowler checks are bash scripts organized into groups by compliance standard or AWS service. By supplying corresponding command line arguments, we can run Prowler against a specific AWS Region or multiple Regions at the same time. We can run Prowler in multiple ways; we chose to run it as an AWS Fargate task for Amazon Elastic Container Service (Amazon ECS). Fargate is a serverless compute engine that runs Docker-compatible containers. ECS Fargate tasks are scheduled tasks that make it easy to perform periodic assessments of an AWS account and export findings. We configured Prowler to run every 7 days in every account and Region it’s deployed into.

Security Hub acts as a single place for consolidating security findings from multiple sources. When Security Hub is enabled in a given Region, CIS AWS Foundations Benchmark and Foundational Security Best Practices standards are enabled as well. Enabling these standards also configures integration with AWS Config and Guard Duty. Integration with Prowler requires enabling product integration on the Security Hub side by calling the EnableImportFindingsForProduct API action for a given product. Because Prowler supports integration with Security Hub out of the box, posting security findings is a matter of passing the right command line arguments: -M json-asff to format reports as AWS Security Findings Format and -S to ship findings to Security Hub.

Automated security findings remediation is implemented using AWS Lambda functions and the AWS SDK for Python (Boto3). The remediation function can be triggered in two ways: automatically in response to a new security finding, or by a security engineer from the Security Hub findings page. In both cases, the same Lambda function is used. Remediation functions implement security standards in accordance with recommendations, whether they’re CIS AWS Foundations Benchmark and Foundational Security Best Practices standards, or others.

The exact activities performed depend on the security findings type and its severity. Examples of activities performed include deleting non-rotated AWS Identity and Access Management (IAM) access keys, enabling server-side encryption for S3 buckets, and deleting unencrypted Amazon Elastic Block Store (Amazon EBS) volumes.

To trigger the Lambda function, we use Amazon EventBridge, which makes it easy to build an event-driven remediation engine and allows us to define Lambda functions as targets for Security Hub findings and custom actions. EventBridge allows us to define filters for security findings and therefore map finding types to specific remediation functions. Upon successfully performing security remediation, each function updates one or more Security Hub findings by calling the BatchUpdateFindings API and passing the corresponding finding ID.

The following example code shows a function enforcing an IAM password policy:

import boto3
import os
import logging
from botocore.exceptions import ClientError

iam = boto3.client("iam")
securityhub = boto3.client("securityhub")

log_level = os.environ.get("LOG_LEVEL", "INFO")
logging.root.setLevel(logging.getLevelName(log_level))
logger = logging.getLogger(__name__)


def lambda_handler(event, context, iam=iam, securityhub=securityhub):
    """Remediate findings related to cis15 and cis11.

    Params:
        event: Lambda event object
        context: Lambda context object
        iam: iam boto3 client
        securityhub: securityhub boto3 client
    Returns:
        No returns
    """
    finding_id = event["detail"]["findings"][0]["Id"]
    product_arn = event["detail"]["findings"][0]["ProductArn"]
    lambda_name = os.environ["AWS_LAMBDA_FUNCTION_NAME"]
    try:
        iam.update_account_password_policy(
            MinimumPasswordLength=14,
            RequireSymbols=True,
            RequireNumbers=True,
            RequireUppercaseCharacters=True,
            RequireLowercaseCharacters=True,
            AllowUsersToChangePassword=True,
            MaxPasswordAge=90,
            PasswordReusePrevention=24,
            HardExpiry=True,
        )
        logger.info("IAM Password Policy Updated")
    except ClientError as e:
        logger.exception(e)
        raise e
    try:
        securityhub.batch_update_findings(
            FindingIdentifiers=[{"Id": finding_id, "ProductArn": product_arn},],
            Note={
                "Text": "Changed non compliant password policy",
                "UpdatedBy": lambda_name,
            },
            Workflow={"Status": "RESOLVED"},
        )
    except ClientError as e:
        logger.exception(e)
        raise e

A key aspect in developing remediation Lambda functions is testability. To quickly iterate through testing cycles, we cover each remediation function with unit tests, in which necessary dependencies are mocked and replaced with stub objects. Because no Lambda deployment is required to check remediation logic, we can test newly developed functions and ensure reliability of existing ones in seconds.

Each Lambda function developed is accompanied with an event.json document containing an example of an EventBridge event for a given security finding. A security finding event allows us to verify remediation logic precisely, including deletion or suspension of non-compliant resources or a finding status update in Security Hub and the response returned. Unit tests cover both successful and erroneous remediation logic. We use pytest to develop unit tests, and botocore.stub and moto to replace runtime dependencies with mocks and stubs.

Automated security findings remediation

The following diagram illustrates our security assessment and automated remediation process.

Automated remediation flow HDI

The workflow includes the following steps:

An existing Security Hub integration performs periodic resource audits. The integration posts new security findings to Security Hub.
Security Hub reports the security incident to the company’s centralized Service Now instance by using the Service Now ITSM Security Hub integration.
Security Hub triggers automated remediation:
1. Security Hub triggers the remediation function by sending an event to EventBridge. The event has a source field equal to aws.securityhub, with the filter ID corresponding to the specific finding type and compliance status as FAILED. The combination of these fields allows us to map the event to a particular remediation function.
2. The remediation function starts processing the security finding event.
3. The function calls the UpdateFindings Security Hub API to update the security finding status upon completing remediation.
4. Security Hub updates the corresponding security incident status in Service Now (Step 2)
Alternatively, the security operations engineer resolves the security incident in Service Now:
1. The engineer reviews the current security incident in Service Now.
2. The engineer manually resolves the security incident in Service Now.
3. Service Now updates the finding status by calling the UpdateFindings Security Hub API. Service Now uses the AWS Service Management Connector.
Alternatively, the platform security engineer triggers remediation:
1. The engineer reviews the currently active security findings on the Security Hub findings page.
2. The engineer triggers remediation from the security findings page by selecting the appropriate action.
3. Security Hub triggers the remediation function by sending an event with the source aws.securityhub to EventBridge. The automated remediation flow continues as described in the Step 3.

Deployment automation

Due to legal requirements, HDI uses the infrastructure as code (IaC) principle while defining and deploying AWS infrastructure. We started with AWS CloudFormation templates defined as YAML or JSON format. The templates are static by nature and define resources in a declarative way. We figured out that as our solution complexity grows, the CloudFormation templates also grow in size and complexity, because all the resources deployed have to be explicitly defined. We wanted a solution to increase our development productivity and simplify infrastructure definition.

The AWS Cloud Development Kit (AWS CDK) helped us in two ways:

The AWS CDK provides ready-to-use building blocks called constructs. These constructs include pre-configured AWS services following best practices. For example, a Lambda function always gets an IAM role with an IAM policy to be able to write logs to CloudWatch Logs.
The AWS CDK allows us to use high-level programming languages to define configuration of all AWS services. Imperative definition allows us to build our own abstractions and reuse them to achieve concise resource definition.

We found that implementing IaC with the AWS CDK is faster and less error-prone. At HDI, we use Python to build application logic and define AWS infrastructure. The imperative nature of the AWS CDK is truly a turning point in fulfilling legal requirements and achieving high developer productivity at the same time.

One of the AWS CDK constructs we use is AWS CDK pipeline. This construct creates a customizable continuous integration and continuous delivery (CI/CD) pipeline implemented with AWS CodePipeline. The source action is based on AWS CodeCommit. The synth action is responsible for creating a CloudFormation template from the AWS CDK project. The synth action also runs unit tests on remediations functions. The pipeline actions are connected via artifacts. Lastly, the AWS CDK pipeline constructs offer a self-mutating feature, which allows us to maintain the AWS CDK project as well as the pipeline in a single code repository. Changes of the pipeline definition as well as automated remediation solutions are deployed seamlessly. The actual solution deployment is also implemented as a CI/CD stage. Stages can be eventually deployed in cross-Region and cross-account patterns. To use cross-account deployments, the AWS CDK provides a bootstrap functionality to create a trust relationship between AWS accounts.

The AWS CDK project is broken down to multiple stacks. To deploy the CI/CD pipeline, we run the cdk deploy cicd-4-securityhub command. To add a new Lambda remediation function, we must add remediation code, optional unit tests, and finally the Lambda remediation configuration object. This configuration object defines the Lambda function’s environment variables, necessary IAM policies, and external dependencies. See the following example code of this configuration:

prowler_729_lambda = {
    "name": "Prowler 7.29",
    "id": "prowler729",
    "description": "Remediates Prowler 7.29 by deleting/terminating unencrypted EC2 instances/EBS volumes",
    "policies": [
        _iam.PolicyStatement(
            effect=_iam.Effect.ALLOW,
            actions=["ec2:TerminateInstances", "ec2:DeleteVolume"],
            resources=["*"])
        ],
    "path": "delete_unencrypted_ebs_volumes",
    "environment_variables": [
        {"key": "ACCOUNT_ID", "value": core.Aws.ACCOUNT_ID}
    ],
    "filter_id": ["prowler-extra729"],
 }

Remediation functions are organized in accordance with the security and compliance frameworks they belong to. The AWS CDK code iterates over remediation definition lists and synthesizes corresponding policies and Lambda functions to be deployed later. Committing Git changes and pushing them triggers the CI/CD pipeline, which deploys the newly defined remediation function and adjusts the configuration of Prowler.

We are working on publishing the source code discussed in this blog post.

Looking forward

As we keep introducing new use cases in the cloud, we plan to improve our solution in the following ways:

Continuously add new controls based on our own experience and improving industry standards
Introduce cross-account security and compliance assessment by consolidating findings in a central security account
Improve automated remediation resiliency by introducing remediation failure notifications and retry queues
Run a Well-Architected review to identify and address possible areas of improvement

Conclusion

Working on the solution described in this post helped us improve our security posture and meet compliancy requirements in the cloud. Specifically, we were able to achieve the following:

Gain a shared understanding of security and compliance controls implementation as well as shared responsibilities in the cloud between multiple teams
Speed up security reviews of cloud environments by implementing continuous assessment and minimizing manual reviews
Provide product and platform teams with secure and compliant environments
Lay a foundation for future requirements and improvement of security posture in the cloud

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

About the Authors

Building an InnerSource ecosystem using AWS DevOps tools

2021-10-07 Debashish Chakrabarty

Post Syndicated from Debashish Chakrabarty original https://aws.amazon.com/blogs/devops/building-an-innersource-ecosystem-using-aws-devops-tools/

InnerSource is the term for the emerging practice of organizations adopting the open source methodology, albeit to develop proprietary software. This blog discusses the building of a model InnerSource ecosystem that leverages multiple AWS services, such as CodeBuild, CodeCommit, CodePipeline, CodeArtifact, and CodeGuru, along with other AWS services and open source tools.

What is InnerSource and why is it gaining traction?

Most software companies leverage open source software (OSS) in their products, as it is a great mechanism for standardizing software and bringing in cost effectiveness via the re-use of high quality, time-tested code. Some organizations may allow its use as-is, while others may utilize a vetting mechanism to ensure that the OSS adheres to the organization standards of security, quality, etc. This confidence in OSS stems from how these community projects are managed and sustained, as well as the culture of openness, collaboration, and creativity that they nurture.

Many organizations building closed source software are now trying to imitate these development principles and practices. This approach, which has been perhaps more discussed than adopted, is popularly called “InnerSource”. InnerSource serves as a great tool for collaborative software development within the organization perimeter, while keeping its concerns for IP & Legality in check. It provides collaboration and innovation avenues beyond the confines of organizational silos through knowledge and talent sharing. Organizations reap the benefits of better code quality and faster time-to-market, yet at only a fraction of the cost.

What constitutes an InnerSource ecosystem?

Infrastructure and processes that harbor collaboration stand at the heart of InnerSource ecology. These systems (refer Figure 1) would typically include tools supporting features such as code hosting, peer reviews, Pull Request (PR) approval flow, issue tracking, documentation, communication & collaboration, continuous integration, and automated testing, among others. Another major component of this system is an entry portal that enables the employees to discover the InnerSource projects and join the community, beginning as ordinary users of the reusable code and later graduating to contributors and committers.

A typical InnerSource ecosystem

Figure 1: A typical InnerSource ecosystem

More to InnerSource than meets the eye

This blog focuses on detailing a technical solution for establishing the required tools for an InnerSource system primarily to enable a development workflow and infrastructure. But the secret sauce of an InnerSource initiative in an enterprise necessitates many other ingredients.

Figure 2: InnerSource Roles & Use Cases

InnerSource thrives on community collaboration and a low entry barrier to enable adoptability. In turn, that demands a cultural makeover. While strategically deciding on the projects that can be inner sourced as well as the appropriate licensing model, enterprises should bootstrap the initiative with a seed product that draws the community, with maintainers and the first set of contributors. Many of these users would eventually be promoted, through a meritocracy-based system, to become the trusted committers.

Over a set period, the organization should plan to move from an infra specific model to a project specific model. In a Project-specific InnerSource model, the responsibility for a particular software asset is owned by a dedicated team funded by other business units. Whereas in the Infrastructure-based InnerSource model, the organization provides the necessary infrastructure to create the ecosystem with code & document repositories, communication tools, etc. This enables anybody in the organization to create a new InnerSource project, although each project initiator maintains their own projects. They could begin by establishing a community of practice, and designating a core team that would provide continuing support to the InnerSource projects’ internal customers. Having a team of dedicated resources would clearly indicate the organization’s long-term commitment to sustaining the initiative. The organization should promote this culture through regular boot camps, trainings, and a recognition program.

Lastly, the significance of having a modular architecture in the InnerSource projects cannot be understated. This architecture helps developers understand the code better, as well as aids code reuse and parallel development, where multiple contributors could work on different code modules while avoiding conflicts during code merges.

A model InnerSource solution using AWS services

This blog discusses a solution that weaves various services together to create the necessary infrastructure for an InnerSource system. While it is not a full-blown solution, and it may lack some other components that an organization may desire in its own system, it can provide you with a good head start.

The ultimate goal of the model solution is to enable a developer workflow as depicted in Figure 3.

Typical developer workflow at InnerSource

Figure 3: Typical developer workflow at InnerSource

At the core of the InnerSource-verse is the distributed version control (AWS CodeCommit in our case). To maintain system transparency, openness, and participation, we must have a discovery mechanism where users could search for the projects and receive encouragement to contribute to the one they prefer (Step 1 in Figure 4).

Architecture diagram for the model InnerSource system

Figure 4: Architecture diagram for the model InnerSource system

For this purpose, the model solution utilizes an open source reference implantation of InnerSource Portal. The portal indexes data from AWS CodeCommit by using a crawler, and it lists available projects with associated metadata, such as the skills required, number of active branches, and average number of commits. For CodeCommit, you can use the crawler implementation that we created in the open source code repo at https://github.com/aws-samples/codecommit-crawler-innersource.

The major portal feature is providing an option to contribute to a project by using a “Contribute” link. This can present a pop-up form to “apply as a contributor” (Step 2 in Figure 4), which when submitted sends an email (or creates a ticket) to the project maintainer/committer who can create an IAM (Step 3 in Figure 4) user with access to the particular repository. Note that the pop-up form functionality is built into the open source version of the portal. However, it would be trivial to add one with an associated action (send an email, cut a ticket, etc.).

InnerSource portal indexes CodeCommit repos and provides a bird’s eye view

Figure 5: InnerSource portal indexes CodeCommit repos and provides a bird’s eye view

The contributor, upon receiving access, logs in to CodeCommit, clones the mainline branch of the InnerSource project (Step 4 in Figure 4) into a fix or feature branch, and starts altering/adding the code. Once completed, the contributor commits the code to the branch and raises a PR (Step 5 in Figure 4). A Pull Request is a mechanism to offer code to an existing repository, which is then peer-reviewed and tested before acceptance for inclusion.

The PR triggers a CodeGuru review (Step 6 in Figure 4) that adds the recommendations as comments on the PR. Furthermore, it triggers a CodeBuild process (Steps 7 to 10 in Figure 4) and logs the build result in the PR. At this point, the code can be peer reviewed by Trusted Committers or Owners of the project repository. The number of approvals would depend on the approval template rule configured in CodeCommit. The Committer(s) can approve the PR (Step 12 in Figure 4) and merge the code to the mainline branch – that is once they verify that the code serves its purpose, has passed the required tests, and doesn’t break the build. They could also rely on the approval vote from a sanity test conducted by a CodeBuild process. Optionally, a build process could deploy the latest mainline code (Step 14 in Figure 4) on the PR merge commit.

To maintain transparency in all communications related to progress, bugs, and feature requests to downstream users and contributors, a communication tool may be needed. This solution does not show integration with any Issue/Bug tracking tool out of the box. However, multiple of these tools are available at the AWS marketplace, with some offering forum and Wiki add-ons in order to elicit discussions. Standard project documentation can be kept within the repository by using the constructs of the README.md file to provide project mission details and the CONTRIBUTING.md file to guide the potential code contributors.

An overview of the AWS services used in the model solution

The model solution employs the following AWS services:

Amazon CodeCommit: a fully managed source control service to host secure and highly scalable private Git repositories.
Amazon CodeBuild: a fully managed build service that compiles source code, runs tests, and produces software packages that are ready to deploy.
Amazon CodeDeploy: a service that automates code deployments to any instance, including EC2 instances and instances running on-premises.
Amazon CodeGuru: a developer tool providing intelligent recommendations to improve code quality and identify an application’s most expensive lines of code.
Amazon CodePipeline: a fully managed continuous delivery service that helps automate release pipelines for fast and reliable application and infrastructure updates.
Amazon CodeArtifact: a fully managed artifact repository service that makes it easy to securely store, publish, and share software packages utilized in their software development process.
Amazon S3: an object storage service that offers industry-leading scalability, data availability, security, and performance.
Amazon EC2: a web service providing secure, resizable compute capacity in the cloud. It is designed to ease web-scale computing for developers.
Amazon EventBridge: a serverless event bus that eases the building of event-driven applications at scale by using events generated from applications and AWS services.
Amazon Lambda: a serverless compute service that lets you run code without provisioning or managing servers.

The journey of a thousand miles begins with a single step

InnerSource might not be the right fit for every organization, but is a great step for those wanting to encourage a culture of quality and innovation, as well as purge silos through enhanced collaboration. It requires backing from leadership to sponsor the engineering initiatives, as well as champion the establishment of an open and transparent culture granting autonomy to the developers across the org to contribute to projects outside of their teams. The organizations best-suited for InnerSource have already participated in open source initiatives, have engineering teams that are adept with CI/CD tools, and are willing to adopt OSS practices. They should start small with a pilot and build upon their successes.

Conclusion

Ever more enterprises are adopting the open source culture to develop proprietary software by establishing an InnerSource. This instills innovation, transparency, and collaboration that result in cost effective and quality software development. This blog discussed a model solution to build the developer workflow inside an InnerSource ecosystem, from project discovery to PR approval and deployment. Additional features, like an integrated Issue Tracker, Real time chat, and Wiki/Forum, can further enrich this solution.

If you need helping hands, AWS Professional Services can help adapt and implement this model InnerSource solution in your enterprise. Moreover, our Advisory services can help establish the governance model to accelerate OSS culture adoption through Experience Based Acceleration (EBA) parties.

References

An introduction to InnerSource
The InnerSource Commons is a forum for sharing experiences and best practices to advance the InnerSource movement. Some proven approaches have been provided as patterns. In addition, several free books are also available on the topic.

About the authors

Amazon Managed Service for Prometheus Is Now Generally Available with Alert Manager and Ruler

2021-09-29 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/amazon-managed-service-for-prometheus-is-now-generally-available-with-alert-manager-and-ruler/

At AWS re:Invent 2020, we introduced the preview of Amazon Managed Service for Prometheus, an open source Prometheus-compatible monitoring service that makes it easy to monitor containerized applications at scale. With Amazon Managed Service for Prometheus, you can use the Prometheus query language (PromQL) to monitor the performance of containerized workloads without having to manage the underlying infrastructure required to scale and secure the ingestion, storage, alert, and querying of operational metrics.

Amazon Managed Service for Prometheus automatically scales as your monitoring needs grow. It is a highly available service deployed across multiple Availability Zones (AZs) that integrates AWS security and compliance capabilities. The service offers native support for PromQL as well as the ability to ingest Prometheus metrics from over 150 Prometheus exporters maintained by the open source community.

With Amazon Managed Service for Prometheus, you can collect Prometheus metrics from Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), and Amazon Elastic Kubernetes Service (Amazon EKS) environments using AWS Distro for OpenTelemetry (ADOT) or Prometheus servers as collection agents.

During the preview, we contributed the high-availability alert manager to the open source Cortex project, a project providing horizontally scalable, highly available, multi-tenant, long-term store for Prometheus. Also, we reduced the price of metric samples ingested by up to 84 percent, and supported collection of metrics for AWS Lambda applications by ADOT.

Today, I am happy to announce the general availability of Amazon Managed Service for Prometheus with new features such as alert manager and ruler that support Amazon Simple Notification Service (Amazon SNS) as a receiver destination for notifications from Alert Manager. You can integrate Amazon SNS with destinations such as email, webhook, Slack, PagerDuty, OpsGenie, or VictorOps with Amazon SNS.

Getting Started with Alert Manager and Ruler
To get started in the AWS Management Console, you can simply create a workspace, a logical space dedicated to the storage, alerting, and querying of metrics from one or more Prometheus servers. You can set up the ingestion of Prometheus metrics to this workspace using Helm and query those metrics. To learn more, see Getting started in the Amazon Managed Service for Prometheus User Guide.

At general availability, we added new alert manager and rules management features. The service supports two types of rules: recording rules and alerting rules. These rules files are the same YAML format as standalone Prometheus, which may be configured and then evaluated at regular intervals.

To configure your workspace with a set of rules, choose Add namespace in Rules management and select a YAML format rules file.

An example rules file would record CPU usage metrics in container workloads and triggers an alert if CPU usage is high for five minutes.

Next, you can create a new Amazon SNS topic or reuse an existing SNS topic where it will route the alerts. The alertmanager routes the alerts to SNS and SNS routes to downstream locations. Configured alerting rules will fire alerts to the Alert Manager, which deduplicate, group, and route alerts to Amazon SNS via the SNS receiver. If you’d like to receive email notifications for your alerts, configure an email subscription for the SNS topic you had.

To give Amazon Managed Service for Prometheus permission to send messages to your SNS topic, select the topic you plan to send to, and add the access policy block:

{
    "Sid": "Allow_Publish_Alarms",
    "Effect": "Allow",
    "Principal": {
        "Service": "aps.amazonaws.com"
    },
    "Action": [
        "sns:Publish",
        "sns:GetTopicAttributes"
    ],
    "Resource": "arn:aws:sns:us-east-1:123456789012:Notifyme"
}

If you have a topic to get alerts, you can configure this SNS receiver in the alert manager configuration. An example config file is the same format as Prometheus, but you have to provide the config underneath an alertmanager_config: block in for the service’s Alert Manager. For more information about the Alert Manager config, visit Alerting Configuration in Prometheus guide.

alertmanager_config:
  route:
    receiver: default
    repeat_interval: 5m
  receivers:
    name: default
    sns_configs:
      topic_arn: "arn:aws:sns:us-east-1:123456789012:Notifyme"
      sigv4:
        region: us-west-2
      attributes:
        key: severity
        value: "warning"

You can replace the topic_arn for the topic that you create while setting up the SNS connection. To learn more about the SNS receiver in the alert manager config, visit Prometheus SNS receiver on the Prometheus Github page.

To configure the Alert Manager, open the Alert Manager and choose Add definition, then select a YAML format alert config file.

When an alert is created by Prometheus and sent to the Alert Manager, it can be queried by hitting the ListAlerts endpoint to see all the active alerts in the system. After hitting the endpoint, you can see alerts in the list of actively firing alerts.

$ curl https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-0123456/alertmanager/api/v2/alerts
GET /workspaces/ws-0123456/alertmanager/api/v2/alerts HTTP/1.1
Host: aps-workspaces.us-east-1.amazonaws.com
X-Amz-Date: 20210628T171924Z
...
[
    "receivers": [
      {
        "name": "default"
      }
    ],
    "startsAt": "2021-09-24T01:37:42.393Z",
    "updatedAt": "2021-09-24T01:37:42.393Z",
    "endsAt": "2021-09-24T01:37:42.393Z",
    "status": {
      "state": "unprocessed",
    },
    "labels": {
      "severity": "warning"
    }
  }
]

A successful notification will result in an email received from your SNS topic with the alert details. Also, you can output messages in JSON format to be easily processed downstream of SNS by AWS Lambda or other APIs and webhook receiving endpoints. For example, you can connect SNS with a Lambda function for message transformation or triggering automation. To learn more, visit Configuring Alertmanager to output JSON to SNS in the Amazon Managed Service for Prometheus User Guide.

Sending from Amazon SNS to Other Notification Destinations
You can connect Amazon SNS to a variety of outbound destinations such as email, webhook (HTTP), Slack, PageDuty, and OpsGenie.

Webhook – To configure a preexisting SNS topic to output messages to a webhook endpoint, first create a subscription to an existing topic. Once active, your HTTP endpoint should receive SNS notifications.
Slack – You can either integrate with Slack’s email-to-channel integration where Slack has the ability to accept an email and forward it to a Slack channel, or you can utilize a Lambda function to rewrite the SNS notification to Slack. To learn more, see forwarding emails to Slack channels and AWS Lambda function to convert SNS messages to Slack.
PagerDuty – To customize the payload sent to PagerDuty, customize the template that is used in generating the message sent to SNS by adjusting or updating template_files block in your alertmanager definition.

Available Now
Amazon Managed Service for Prometheus is available today in nine AWS Regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Frankfurt), Europe (Ireland), Europe (Stockholm), Asia Pacific (Singapore), Asia Pacific (Sydney), and Asia Pacific (Tokyo).

You pay only for what you use, based on metrics ingested, queried, and stored. As part of the AWS Free Tier, you can get started with Amazon Managed Service for Prometheus for 40 million metric samples ingested and 10 GB metrics stored per month. To learn more, visit the pricing page.

If you want to learn about AWS observability on AWS, visit One Observability Workshop which provides a hands-on experience for you on the wide variety of toolsets AWS offers to set up monitoring and observability on your applications.

Please send feedback to the AWS forum for Amazon Managed Service for Prometheus or through your usual AWS support contacts.

– Channy

New – Amazon Genomics CLI Is Now Open Source and Generally Available

2021-09-27 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-amazon-genomics-cli-is-now-open-source-and-generally-available/

Less than 70 years separate us from one of the greatest discoveries of all time: the double helix structure of DNA. We now know that DNA is a sort of a twisted ladder composed of four types of compounds, called bases. These four bases are usually identified by an uppercase letter: adenine (A), guanine (G), cytosine (C), and thymine (T). One of the reasons for the double helix structure is that when these compounds are at the two sides of the ladder, A always bonds with T, and C always bonds with G.

If we unroll the ladder on a table, we’d see two sequences of “letters”, and each of the two sides would carry the same genetic information. For example, here are two series (AGCT and TCGA) bound together:

A – T
G – C
C – G
T – A

These series of letters can be very long. For example, the human genome is composed of over 3 billion letters of code and acts as the biological blueprint of every cell in a person. The information in a person’s genome can be used to create highly personalized treatments to improve the health of individuals and even the entire population. Similarly, genomic data can be use to track infectious diseases, improve diagnosis, and even track epidemics, food pathogens and toxins. This is the emerging field of environmental genomics.

Accessing genomic data requires genome sequencing, which with recent advances in technology, can be done for large groups of individuals, quickly and more cost-effectively than ever before. In the next five years, genomics datasets are estimated to grow and contain more than a billion sequenced genomes.

How Genomics Data Analysis Works
Genomics data analysis uses a variety of tools that need to be orchestrated as a specific sequence of steps, or a workflow. To facilitate developing, sharing, and running workflows, the genomics and bioinformatics communities have developed specialized workflow definition languages like WDL, Nextflow, CWL, and Snakemake.

However, this process generates petabytes of raw genomic data and experts in genomics and life science struggle to scale compute and storage resources to handle data at such massive scale.

To process data and provide answers quickly, cloud resources like compute, storage, and networking need to be configured to work together with analysis tools. As a result, scientists and researchers often have to spend valuable time deploying infrastructure and modifying open-source genomics analysis tools instead of making contributions to genomics innovations.

Introducing Amazon Genomics CLI
A couple of months ago, we shared the preview of Amazon Genomics CLI, a tool that makes it easier to process genomics data at petabyte scale on AWS. I am excited to share that the Amazon Genomics CLI is now an open source project and is generally available today. You can use it with publicly available workflows as a starting point and develop your analysis on top of these.

Amazon Genomics CLI simplifies and automates the deployment of cloud infrastructure, providing you with an easy-to-use command line interface to quickly setup and run genomics workflows on AWS. By removing the heavy lifting from setting up and running genomics workflows in the cloud, software developers and researchers can automatically provision, configure and scale cloud resources to enable faster and more cost-effective population-level genetics studies, drug discovery cycles, and more.

Amazon Genomics CLI lets you run your workflows on an optimized cloud infrastructure. More specifically, the CLI:

Includes improvements to genomics workflow engines to make them integrate better with AWS, removing the burden to manually modify open-source tools and tune them to run efficiently at scale. These tools work seamlessly across Amazon Elastic Container Service (Amazon ECS), Amazon DynamoDB, Amazon Elastic File System (Amazon EFS), and Amazon Simple Storage Service (Amazon S3), helping you to scale compute and storage and at the same time optimize your costs using features like EC2 Spot Instances.
Eliminates the most time-consuming tasks like provisioning storage and compute capacities, deploying the genomics workflow engines, and tuning the clusters used to execute workflows.
Automatically increases or decreases cloud resources based on your workloads, which eliminates the risk of buying too much or too little capacity.
Tags resources so that you can use tools like AWS Cost & Usage Report to understand the costs related to your genomics data analysis across multiple AWS services.

The use of Amazon Genomics CLI is based on these three main concepts:

Workflow – These are bioinformatics workflows written in languages like WDL or Nextflow. They can be either single script files or packages of multiple files. These workflow script files are workflow definitions and combined with additional metadata, like the workflow language the definition is written in, form a workflow specification that is used by the CLI to execute workflows on appropriate compute resources.

Context – A context encapsulates and automates time-consuming tasks to configure and deploy workflow engines, create data access policies, and tune compute clusters (managed using AWS Batch) for operation at scale.

Project – A project links together workflows, datasets, and the contexts used to process them. From a user perspective, it handles resources related to the same problem or used by the same team.

Let’s see how this works in practice.

Using Amazon Genomics CLI
I follow the instructions to install Amazon Genomics CLI on my laptop. Now, I can use the agc command to manage genomic workloads. I see the available options with:

$ agc --help

The first time I use it, I activate my AWS account:

$ agc account activate

This creates the core infrastructure that Amazon Genomics CLI needs to operate, which includes an S3 bucket, a virtual private cloud (VPC), and a DynamoDB table. The S3 bucket is used for durable metadata, and the VPC is used to isolate compute resources.

Optionally, I can bring my own VPC. I can also use one of my named profiles for the AWS Command Line Interface (CLI). In this way, I can customize the AWS Region and the AWS account used by the Amazon Genomics CLI.

I configure my email address in the local settings. This wil be used to tag resources created by the CLI:

$ agc configure email [email protected]

There are a few demo projects in the examples folder included by the Amazon Genomics CLI installation. These projects use different engines, such as Cromwell or Nextflow. In the demo-wdl-project folder, the agc-project.yaml file describes the workflows, the data, and the contexts for the Demo project:

---
name: Demo
schemaVersion: 1
workflows:
  hello:
    type:
      language: wdl
      version: 1.0
    sourceURL: workflows/hello
  read:
    type:
      language: wdl
      version: 1.0
    sourceURL: workflows/read
  haplotype:
    type:
      language: wdl
      version: 1.0
    sourceURL: workflows/haplotype
  words-with-vowels:
    type:
      language: wdl
      version: 1.0
    sourceURL: workflows/words
data:
  - location: s3://gatk-test-data
    readOnly: true
  - location: s3://broad-references
    readOnly: true
contexts:
  myContext:
    engines:
      - type: wdl
        engine: cromwell

  spotCtx:
    requestSpotInstances: true
    engines:
      - type: wdl
        engine: cromwell

For this project, there are four workflows (hello, read, words-with-vowels, and haplotype). The project has read-only access to two S3 buckets and can run workflows using two contexts. Both contexts use the Cromwell engine. One context (spotCtx) uses Amazon EC2 Spot Instances to optimize costs.

In the demo-wdl-project folder, I use the Amazon Genomics CLI to deploy the spotCtx context:

$ agc context deploy -c spotCtx

After a few minutes, the context is ready, and I can execute the workflows. Once started, a context incurs about $0.40 per hour of baseline costs. These costs don’t include the resources created to execute workflows. Those resources depend on your specific use case. Contexts have the option to use spot instances by adding the requestSpotInstances flag to their configuration.

I use the CLI to see the status of the contexts of the project:

$ agc context status

INSTANCE spotCtx STARTED true

Now, let’s look at the workflows included in this project:

$ agc workflow list

2021-09-24T11:15:29+01:00 𝒊  Listing workflows.
WORKFLOWNAME haplotype
WORKFLOWNAME hello
WORKFLOWNAME read
WORKFLOWNAME words-with-vowels

The simplest workflow is hello. The content of the hello.wdl file is quite understandable if you know any programming language:

version 1.0
workflow hello_agc {
    call hello {}
}
task hello {
    command { echo "Hello Amazon Genomics CLI!" }
    runtime {
        docker: "ubuntu:latest"
    }
    output { String out = read_string( stdout() ) }
}

The hello workflow defines a single task (hello) that prints the output of a command. The task is executed on a specific container image (ubuntu:latest). The output is taken from standard output (stdout), the default file descriptor where a process can write output.

Running workflows is an asynchronous process. After submitting a workflow from the CLI, it is handled entirely in the cloud. I can run multiple workflows at a time. The underlying compute resources will automatically scale and I will be charged only for what I use.

Using the CLI, I start the hello workflow:

$ agc workflow run hello -c spotCtx

2021-09-24T13:03:47+01:00 𝒊  Running workflow. Workflow name: 'hello', Arguments: '', Context: 'spotCtx'
fcf72b78-f725-493e-b633-7dbe67878e91

The workflow was successfully submitted, and the last line is the workflow execution ID. I can use this ID to reference a specific workflow execution. Now, I check the status of the workflow:

$ agc workflow status

2021-09-24T13:04:21+01:00 𝒊  Showing workflow run(s). Max Runs: 20
WORKFLOWINSTANCE	spotCtx	fcf72b78-f725-493e-b633-7dbe67878e91	true	RUNNING	2021-09-24T12:03:53Z	hello

The hello workflow is still running. After a few minutes, I check again:

$ agc workflow status

2021-09-24T13:12:23+01:00 𝒊  Showing workflow run(s). Max Runs: 20
WORKFLOWINSTANCE	spotCtx	fcf72b78-f725-493e-b633-7dbe67878e91	true	COMPLETE	2021-09-24T12:03:53Z	hello

The workflow has terminated and is now complete. I look at the workflow logs:

$ agc logs workflow hello

2021-09-24T13:13:08+01:00 𝒊  Showing the logs for 'hello'
2021-09-24T13:13:12+01:00 𝒊  Showing logs for the latest run of the workflow. Run id: 'fcf72b78-f725-493e-b633-7dbe67878e91'
Fri, 24 Sep 2021 13:07:22 +0100	download: s3://agc-123412341234-eu-west-1/scripts/1a82f9a96e387d78ae3786c967f97cc0 to tmp/tmp.498XAhEOy/batch-file-temp
Fri, 24 Sep 2021 13:07:22 +0100	*** LOCALIZING INPUTS ***
Fri, 24 Sep 2021 13:07:23 +0100	download: s3://agc-123412341234-eu-west-1/project/Demo/userid/danilop20tbvT/context/spotCtx/cromwell-execution/hello_agc/fcf72b78-f725-493e-b633-7dbe67878e91/call-hello/script to agc-024700040865-eu-west-1/project/Demo/userid/danilop20tbvT/context/spotCtx/cromwell-execution/hello_agc/fcf72b78-f725-493e-b633-7dbe67878e91/call-hello/script
Fri, 24 Sep 2021 13:07:23 +0100	*** COMPLETED LOCALIZATION ***
Fri, 24 Sep 2021 13:07:23 +0100	Hello Amazon Genomics CLI!
Fri, 24 Sep 2021 13:07:23 +0100	*** DELOCALIZING OUTPUTS ***
Fri, 24 Sep 2021 13:07:24 +0100	upload: ./hello-rc.txt to s3://agc-123412341234-eu-west-1/project/Demo/userid/danilop20tbvT/context/spotCtx/cromwell-execution/hello_agc/fcf72b78-f725-493e-b633-7dbe67878e91/call-hello/hello-rc.txt
Fri, 24 Sep 2021 13:07:25 +0100	upload: ./hello-stderr.log to s3://agc-123412341234-eu-west-1/project/Demo/userid/danilop20tbvT/context/spotCtx/cromwell-execution/hello_agc/fcf72b78-f725-493e-b633-7dbe67878e91/call-hello/hello-stderr.log
Fri, 24 Sep 2021 13:07:25 +0100	upload: ./hello-stdout.log to s3://agc-123412341234-eu-west-1/project/Demo/userid/danilop20tbvT/context/spotCtx/cromwell-execution/hello_agc/fcf72b78-f725-493e-b633-7dbe67878e91/call-hello/hello-stdout.log
Fri, 24 Sep 2021 13:07:25 +0100	*** COMPLETED DELOCALIZATION ***
Fri, 24 Sep 2021 13:07:25 +0100	*** EXITING WITH RETURN CODE ***
Fri, 24 Sep 2021 13:07:25 +0100	0

In the logs, I find as expected the Hello Amazon Genomics CLI! message printed by workflow.

I can also look at the content of hello-stdout.log on S3 using the information in the log above:

aws s3 cp s3://agc-123412341234-eu-west-1/project/Demo/userid/danilop20tbvT/context/spotCtx/cromwell-execution/hello_agc/fcf72b78-f725-493e-b633-7dbe67878e91/call-hello/hello-stdout.log -

Hello Amazon Genomics CLI!

It worked! Now, let’s look for at more complex workflows. Before I change project, I destroy the context for the Demo project:

$ agc context destroy -c spotCtx

In the gatk-best-practices-project folder, I list the available workflows for the project:

$ agc workflow list

2021-09-24T11:41:14+01:00 𝒊  Listing workflows.
WORKFLOWNAME	bam-to-unmapped-bams
WORKFLOWNAME	cram-to-bam
WORKFLOWNAME	gatk4-basic-joint-genotyping
WORKFLOWNAME	gatk4-data-processing
WORKFLOWNAME	gatk4-germline-snps-indels
WORKFLOWNAME	gatk4-rnaseq-germline-snps-indels
WORKFLOWNAME	interleaved-fastq-to-paired-fastq
WORKFLOWNAME	paired-fastq-to-unmapped-bam
WORKFLOWNAME	seq-format-validation

In the agc-project.yaml file, the gatk4-data-processing workflow points to a local directory with the same name. This is the content of that directory:

$ ls gatk4-data-processing

MANIFEST.json
processing-for-variant-discovery-gatk4.hg38.wgs.inputs.json
processing-for-variant-discovery-gatk4.wdl

This workflow processes high-throughput sequencing data with GATK4, a genomic analysis toolkit focused on variant discovery.

The directory contains a MANIFEST.json file. The manifest file describes which file contains the main workflow to execute (there can be more than one WDL file in the directory) and where to find input parameters and options. Here’s the content of the manifest file:

{
  "mainWorkflowURL": "processing-for-variant-discovery-gatk4.wdl",
  "inputFileURLs": [
    "processing-for-variant-discovery-gatk4.hg38.wgs.inputs.json"
  ],
  "optionFileURL": "options.json"
}

In the gatk-best-practices-project folder, I create a context to run the workflows:

$ agc context deploy -c spotCtx

Then, I start the gatk4-data-processing workflow:

$ agc workflow run gatk4-data-processing -c spotCtx

2021-09-24T12:08:22+01:00 𝒊  Running workflow. Workflow name: 'gatk4-data-processing', Arguments: '', Context: 'spotCtx'
630e2d53-0c28-4f35-873e-65363529c3de

After a couple of hours, the workflow has terminated:

$ agc workflow status

2021-09-24T14:06:40+01:00 𝒊  Showing workflow run(s). Max Runs: 20
WORKFLOWINSTANCE	spotCtx	630e2d53-0c28-4f35-873e-65363529c3de	true	COMPLETE	2021-09-24T11:08:28Z	gatk4-data-processing

I look at the logs:

$ agc logs workflow gatk4-data-processing

...
Fri, 24 Sep 2021 14:02:32 +0100	*** DELOCALIZING OUTPUTS ***
Fri, 24 Sep 2021 14:03:45 +0100	upload: ./NA12878.hg38.bam to s3://agc-123412341234-eu-west-1/project/GATK/userid/danilop20tbvT/context/spotCtx/cromwell-execution/PreProcessingForVariantDiscovery_GATK4/630e2d53-0c28-4f35-873e-65363529c3de/call-GatherBamFiles/NA12878.hg38.bam
Fri, 24 Sep 2021 14:03:46 +0100	upload: ./NA12878.hg38.bam.md5 to s3://agc-123412341234-eu-west-1/project/GATK/userid/danilop20tbvT/context/spotCtx/cromwell-execution/PreProcessingForVariantDiscovery_GATK4/630e2d53-0c28-4f35-873e-65363529c3de/call-GatherBamFiles/NA12878.hg38.bam.md5
Fri, 24 Sep 2021 14:03:47 +0100	upload: ./NA12878.hg38.bai to s3://agc-123412341234-eu-west-1/project/GATK/userid/danilop20tbvT/context/spotCtx/cromwell-execution/PreProcessingForVariantDiscovery_GATK4/630e2d53-0c28-4f35-873e-65363529c3de/call-GatherBamFiles/NA12878.hg38.bai
Fri, 24 Sep 2021 14:03:48 +0100	upload: ./GatherBamFiles-rc.txt to s3://agc-123412341234-eu-west-1/project/GATK/userid/danilop20tbvT/context/spotCtx/cromwell-execution/PreProcessingForVariantDiscovery_GATK4/630e2d53-0c28-4f35-873e-65363529c3de/call-GatherBamFiles/GatherBamFiles-rc.txt
Fri, 24 Sep 2021 14:03:49 +0100	upload: ./GatherBamFiles-stderr.log to s3://agc-123412341234-eu-west-1/project/GATK/userid/danilop20tbvT/context/spotCtx/cromwell-execution/PreProcessingForVariantDiscovery_GATK4/630e2d53-0c28-4f35-873e-65363529c3de/call-GatherBamFiles/GatherBamFiles-stderr.log
Fri, 24 Sep 2021 14:03:50 +0100	upload: ./GatherBamFiles-stdout.log to s3://agc-123412341234-eu-west-1/project/GATK/userid/danilop20tbvT/context/spotCtx/cromwell-execution/PreProcessingForVariantDiscovery_GATK4/630e2d53-0c28-4f35-873e-65363529c3de/call-GatherBamFiles/GatherBamFiles-stdout.log
Fri, 24 Sep 2021 14:03:50 +0100	*** COMPLETED DELOCALIZATION ***
Fri, 24 Sep 2021 14:03:50 +0100	*** EXITING WITH RETURN CODE ***
Fri, 24 Sep 2021 14:03:50 +0100	0

Results have been written to the S3 bucket created during the account activation. The name of the bucket is in the logs but I can also find it stored as a parameter by AWS Systems Manager. I can save it in an environment variable with the following command:

$ export AGC_BUCKET=$(aws ssm get-parameter \
  --name /agc/_common/bucket \
  --query 'Parameter.Value' \
  --output text)

Using the AWS Command Line Interface (CLI), I can now explore the results on the S3 bucket and get the outputs of the workflow.

Before looking at the results, I remove the resources that I don’t need by stopping the context. This will destroy all compute resources, but retain data in S3.

$ agc context destroy -c spotCtx

Additional examples on configuring different contexts and running additional workflows are provided in the documentation on GitHub.

Availability and Pricing
Amazon Genomics CLI is an open source tool, and you can use it today in all AWS Regions with the exception of AWS GovCloud (US) and Regions located in China. There is no cost for using the AWS Genomics CLI. You pay for the AWS resources created by the CLI.

With the Amazon Genomics CLI, you can focus on science instead of architecting infrastructure. This gets you up and running faster, enabling research, development, and testing workloads. For production workloads that scale to several thousand parallel workflows, we can provide recommended ways to leverage additional Amazon services, like AWS Step Functions, just reach out to our account teams for more information.

— Danilo

New for AWS Distro for OpenTelemetry – Tracing Support is Now Generally Available

2021-09-23 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-aws-distro-for-opentelemetry-tracing-support-is-now-generally-available/

Last year before re:Invent, we introduced the public preview of AWS Distro for OpenTelemetry, a secure distribution of the OpenTelemetry project supported by AWS. OpenTelemetry provides tools, APIs, and SDKs to instrument, generate, collect, and export telemetry data to better understand the behavior and the performance of your applications. Yesterday, upstream OpenTelemetry announced tracing stability milestone for its components. Today, I am happy to share that support for traces is now generally available in AWS Distro for OpenTelemetry.

Using OpenTelemetry, you can instrument your applications just once and then send traces to multiple monitoring solutions.

You can use AWS Distro for OpenTelemetry to instrument your applications running on Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (EKS), and AWS Lambda, as well as on premises. Containers running on AWS Fargate and orchestrated via either ECS or EKS are also supported.

You can send tracing data collected by AWS Distro for OpenTelemetry to AWS X-Ray, as well as partner destinations such as:

AppDynamics, Dynatrace, Grafana, Honeycomb, Lightstep, NewRelic, and SumoLogic – which support OpenTelemetry Protocol (OTLP) exporters natively.
Datadog, Logz.io, Splunk – which have their own exporters.

You can use auto-instrumentation agents to collect traces without changing your code. Auto-instrumentation is available today for Java and Python applications. Auto-instrumentation support for Python currently only covers the AWS SDK. You can instrument your applications using other programming languages (such as Go, Node.js, and .NET) with the OpenTelemetry SDKs.

Let’s see how this works in practice for a Java application.

Visualizing Traces for a Java Application Using Auto-Instrumentation
I create a simple Java application that shows the list of my Amazon Simple Storage Service (Amazon S3) buckets and my Amazon DynamoDB tables:

package com.example.myapp;

import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.s3.S3Client;
import software.amazon.awssdk.services.s3.model.*;
import software.amazon.awssdk.services.dynamodb.model.DynamoDbException;
import software.amazon.awssdk.services.dynamodb.model.ListTablesResponse;
import software.amazon.awssdk.services.dynamodb.model.ListTablesRequest;
import software.amazon.awssdk.services.dynamodb.DynamoDbClient;

import java.util.List;

/**
 * Hello world!
 *
 */
public class App {

    public static void listAllTables(DynamoDbClient ddb) {

        System.out.println("DynamoDB Tables:");

        boolean moreTables = true;
        String lastName = null;

        while (moreTables) {
            try {
                ListTablesResponse response = null;
                if (lastName == null) {
                    ListTablesRequest request = ListTablesRequest.builder().build();
                    response = ddb.listTables(request);
                } else {
                    ListTablesRequest request = ListTablesRequest.builder().exclusiveStartTableName(lastName).build();
                    response = ddb.listTables(request);
                }

                List<String> tableNames = response.tableNames();

                if (tableNames.size() > 0) {
                    for (String curName : tableNames) {
                        System.out.format("* %s\n", curName);
                    }
                } else {
                    System.out.println("No tables found!");
                    System.exit(0);
                }

                lastName = response.lastEvaluatedTableName();
                if (lastName == null) {
                    moreTables = false;
                }
            } catch (DynamoDbException e) {
                System.err.println(e.getMessage());
                System.exit(1);
            }
        }

        System.out.println("Done!\n");
    }

    public static void listAllBuckets(S3Client s3) {

        System.out.println("S3 Buckets:");

        ListBucketsRequest listBucketsRequest = ListBucketsRequest.builder().build();
        ListBucketsResponse listBucketsResponse = s3.listBuckets(listBucketsRequest);
        listBucketsResponse.buckets().stream().forEach(x -> System.out.format("* %s\n", x.name()));

        System.out.println("Done!\n");
    }

    public static void listAllBucketsAndTables(S3Client s3, DynamoDbClient ddb) {
        listAllBuckets(s3);
        listAllTables(ddb);
    }

    public static void main(String[] args) {

        Region region = Region.EU_WEST_1;

        S3Client s3 = S3Client.builder().region(region).build();
        DynamoDbClient ddb = DynamoDbClient.builder().region(region).build();

        listAllBucketsAndTables(s3, ddb);

        s3.close();
        ddb.close();
    }
}

I package the application using Apache Maven. Here’s the Project Object Model (POM) file managing dependencies such as the AWS SDK for Java 2.x that I use to interact with S3 and DynamoDB:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>
  <groupId>com.example.myapp</groupId>
  <artifactId>myapp</artifactId>
  <packaging>jar</packaging>
  <version>1.0-SNAPSHOT</version>
  <name>myapp</name>
  <dependencyManagement>
    <dependencies>
      <dependency>
        <groupId>software.amazon.awssdk</groupId>
        <artifactId>bom</artifactId>
        <version>2.17.38</version>
        <type>pom</type>
        <scope>import</scope>
      </dependency>
    </dependencies>
  </dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>software.amazon.awssdk</groupId>
      <artifactId>s3</artifactId>
    </dependency>
    <dependency>
      <groupId>software.amazon.awssdk</groupId>
      <artifactId>dynamodb</artifactId>
    </dependency>
  </dependencies>
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.8.1</version>
        <configuration>
          <source>8</source>
          <target>8</target>
        </configuration>
      </plugin>
      <plugin>
        <artifactId>maven-assembly-plugin</artifactId>
        <configuration>
          <archive>
            <manifest>
              <mainClass>com.example.myapp.App</mainClass>
            </manifest>
          </archive>
          <descriptorRefs>
            <descriptorRef>jar-with-dependencies</descriptorRef>
          </descriptorRefs>
        </configuration>
      </plugin>
    </plugins>
  </build>
</project>

I use Maven to create an executable Java Archive (JAR) file that includes all dependencies:

$ mvn clean compile assembly:single

To run the application and get tracing data, I need two components:

The AWS Distro for OpenTelemetry Auto-Instrumentation Agent for Java, a Java agent that can be attached to any Java 8+ application to capture telemetry from a number of popular libraries and frameworks, including the AWS SDK.
The AWS Distro for OpenTelemetry Collector, an executable that can receive, process, and export telemetry data to monitoring destinations.

In one terminal, I run the AWS Distro for OpenTelemetry Collector using Docker:

$ docker run --rm -p 4317:4317 -p 55680:55680 -p 8889:8888 \
         -e AWS_REGION=eu-west-1 \
         -e AWS_PROFILE=default \
         -v ~/.aws:/root/.aws \
         --name awscollector public.ecr.aws/aws-observability/aws-otel-collector:latest

The collector is now ready to receive traces and forward them to a monitoring platform. By default, the AWS Distro for OpenTelemetry Collector sends traces to AWS X-Ray. I can change the exporter or add more exporters by editing the collector configuration. For example, I can follow the documentation to configure OLTP exporters to send telemetry data using the OLTP protocol. In the documentation, I also find how to configure other partner destinations. [[ It would be great it we had a link for the partner section, I can find only links to a specific partner ]]

I download the latest version of the AWS Distro for OpenTelemetry Auto-Instrumentation Java Agent. Now, I run my application and use the agent to capture telemetry data without having to add any specific instrumentation the code. In the OTEL_RESOURCE_ATTRIBUTES environment variable I set a name and a namespace for the service: [[ Are service.name and service.namespace being used by X-Ray? I couldn’t find them in the service map ]]

$ OTEL_RESOURCE_ATTRIBUTES=service.name=MyApp,service.namespace=MyTeam \
  java -javaagent:otel/aws-opentelemetry-agent.jar \
       -jar myapp/target/myapp-1.0-SNAPSHOT-jar-with-dependencies.jar

As expected, I get the list of my S3 buckets globally and of the DynamoDB tables in the Region.

To generate more tracing data, I run the previous command a few times. Each time I run the application, telemetry data is collected by the agent and sent to the collector. The collector buffers the data and then sends it to the configured exporters. By default, it is sending traces to X-Ray.

Now, I look at the service map in the AWS X-Ray console to see my application’s interactions with other services:

And there they are! Without any change in the code, I see my application’s calls to the S3 and DynamoDB APIs. There were no errors, and all the circles are green. Inside the circles, I find the average latency of the invocations and the number of transactions per minute.

Adding Spans to a Java Application
The information automatically collected can be improved by providing more information with the traces. For example, I might have interactions with the same service in different parts of my application, and it would be useful to separate those interactions in the service map. In this way, if there is an error or high latency, I would know which part of my application is affected.

One way to do so is to use spans or segments. A span represents a group of logically related activities. For example, the listAllBucketsAndTables method is performing two operations, one with S3 and one with DynamoDB. I’d like to group them together in a span. The quickest way with OpenTelemetry is to add the @WithSpan annotation to the method. Because the result of a method usually depends on its arguments, I also use the @SpanAttribute annotation to describe which arguments in the method invocation should be automatically added as attributes to the span.

@WithSpan
    public static void listAllBucketsAndTables(@SpanAttribute("title") String title, S3Client s3, DynamoDbClient ddb) {

        System.out.println(title);

        listAllBuckets(s3);
        listAllTables(ddb);
    }

To be able to use the @WithSpan and @SpanAttribute annotations, I need to import them into the code and add the necessary OpenTelemetry dependencies to the POM. All these changes are based on the OpenTelemetry specifications and don’t depend on the actual implementation that I am using, or on the tool that I will use to visualize or analyze the telemetry data. I have only to make these changes once to instrument my application. Isn’t that great?

To better see how spans work, I create another method that is running the same operations in reverse order, first listing the DynamoDB tables, then the S3 buckets:

    @WithSpan
    public static void listTablesFirstAndThenBuckets(@SpanAttribute("title") String title, S3Client s3, DynamoDbClient ddb) {

        System.out.println(title);

        listAllTables(ddb);
        listAllBuckets(s3);
    }

The application is now running the two methods (listAllBucketsAndTables and listTablesFirstAndThenBuckets) one after the other. For simplicity, here’s the full code of the instrumented application:

package com.example.myapp;

import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.s3.S3Client;
import software.amazon.awssdk.services.s3.model.*;
import software.amazon.awssdk.services.dynamodb.model.DynamoDbException;
import software.amazon.awssdk.services.dynamodb.model.ListTablesResponse;
import software.amazon.awssdk.services.dynamodb.model.ListTablesRequest;
import software.amazon.awssdk.services.dynamodb.DynamoDbClient;

import java.util.List;

import io.opentelemetry.extension.annotations.SpanAttribute;
import io.opentelemetry.extension.annotations.WithSpan;

/**
 * Hello world!
 *
 */
public class App {

    public static void listAllTables(DynamoDbClient ddb) {

        System.out.println("DynamoDB Tables:");

        boolean moreTables = true;
        String lastName = null;

        while (moreTables) {
            try {
                ListTablesResponse response = null;
                if (lastName == null) {
                    ListTablesRequest request = ListTablesRequest.builder().build();
                    response = ddb.listTables(request);
                } else {
                    ListTablesRequest request = ListTablesRequest.builder().exclusiveStartTableName(lastName).build();
                    response = ddb.listTables(request);
                }

                List<String> tableNames = response.tableNames();

                if (tableNames.size() > 0) {
                    for (String curName : tableNames) {
                        System.out.format("* %s\n", curName);
                    }
                } else {
                    System.out.println("No tables found!");
                    System.exit(0);
                }

                lastName = response.lastEvaluatedTableName();
                if (lastName == null) {
                    moreTables = false;
                }
            } catch (DynamoDbException e) {
                System.err.println(e.getMessage());
                System.exit(1);
            }
        }

        System.out.println("Done!\n");
    }

    public static void listAllBuckets(S3Client s3) {

        System.out.println("S3 Buckets:");

        ListBucketsRequest listBucketsRequest = ListBucketsRequest.builder().build();
        ListBucketsResponse listBucketsResponse = s3.listBuckets(listBucketsRequest);
        listBucketsResponse.buckets().stream().forEach(x -> System.out.format("* %s\n", x.name()));

        System.out.println("Done!\n");
    }

    @WithSpan
    public static void listAllBucketsAndTables(@SpanAttribute("title") String title, S3Client s3, DynamoDbClient ddb) {

        System.out.println(title);

        listAllBuckets(s3);
        listAllTables(ddb);

    }

    @WithSpan
    public static void listTablesFirstAndThenBuckets(@SpanAttribute("title") String title, S3Client s3, DynamoDbClient ddb) {

        System.out.println(title);

        listAllTables(ddb);
        listAllBuckets(s3);

    }

    public static void main(String[] args) {

        Region region = Region.EU_WEST_1;

        S3Client s3 = S3Client.builder().region(region).build();
        DynamoDbClient ddb = DynamoDbClient.builder().region(region).build();

        listAllBucketsAndTables("My S3 buckets and DynamoDB tables", s3, ddb);
        listTablesFirstAndThenBuckets("My DynamoDB tables first and then S3 bucket", s3, ddb);

        s3.close();
        ddb.close();
    }
}

And here’s the updated POM that includes the additional OpenTelemetry dependencies:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>
  <groupId>com.example.myapp</groupId>
  <artifactId>myapp</artifactId>
  <packaging>jar</packaging>
  <version>1.0-SNAPSHOT</version>
  <name>myapp</name>
  <dependencyManagement>
    <dependencies>
      <dependency>
        <groupId>software.amazon.awssdk</groupId>
        <artifactId>bom</artifactId>
        <version>2.16.60</version>
        <type>pom</type>
        <scope>import</scope>
      </dependency>
    </dependencies>
  </dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>3.8.1</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>software.amazon.awssdk</groupId>
      <artifactId>s3</artifactId>
    </dependency>
    <dependency>
      <groupId>software.amazon.awssdk</groupId>
      <artifactId>dynamodb</artifactId>
    </dependency>
    <dependency>
      <groupId>io.opentelemetry</groupId>
      <artifactId>opentelemetry-extension-annotations</artifactId>
      <version>1.5.0</version>
    </dependency>
    <dependency>
      <groupId>io.opentelemetry</groupId>
      <artifactId>opentelemetry-api</artifactId>
      <version>1.5.0</version>
    </dependency>
  </dependencies>
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.8.1</version>
        <configuration>
          <source>8</source>
          <target>8</target>
        </configuration>
      </plugin>
      <plugin>
        <artifactId>maven-assembly-plugin</artifactId>
        <configuration>
          <archive>
            <manifest>
              <mainClass>com.example.myapp.App</mainClass>
            </manifest>
          </archive>
          <descriptorRefs>
            <descriptorRef>jar-with-dependencies</descriptorRef>
          </descriptorRefs>
        </configuration>
      </plugin>
    </plugins>
  </build>
</project>

I compile my application with these changes and run it again a few times:

$ mvn clean compile assembly:single

$ OTEL_RESOURCE_ATTRIBUTES=service.name=MyApp,service.namespace=MyTeam \
  java -javaagent:otel/aws-opentelemetry-agent.jar \
       -jar myapp/target/myapp-1.0-SNAPSHOT-jar-with-dependencies.jar

Now, let’s look at the X-Ray service map, computed using the additional information provided by those annotations.

Now I see the two methods and the other services they invoke. If there are errors or high latency, I can easily understand how the two methods are affected.

In the Traces section of the X-Ray console, I look at the Raw data for some of the traces. Because the title argument was annotated with @SpanAttribute, each trace has the value of that argument in the metadata section.

Collecting Traces from Lambda Functions
The previous steps work on premises, on EC2, and with applications running in containers. To collect traces and use auto-instrumentation with Lambda functions, you can use the AWS managed OpenTelemetry Lambda Layers (a few examples are included in the repository).

After you add the Lambda layer to your function, you can use the environment variable OPENTELEMETRY_COLLECTOR_CONFIG_FILE to pass your own configuration to the collector. More information on using AWS Distro for OpenTelemetry with AWS Lambda is available in the documentation.

Availability and Pricing
You can use AWS Distro for OpenTelemetry to get telemetry data from your application running on premises and on AWS. There are no additional costs for using AWS Distro for OpenTelemetry. Depending on your configuration, you might pay for the AWS services that are destinations for OpenTelemetry data, such as AWS X-Ray, Amazon CloudWatch, and Amazon Managed Service for Prometheus (AMP).

To learn more, you are invited to this webinar on Thursday, October 7 at 10:00 am PT / 1:00 pm EDT / 7:00 pm CEST.

Simplify the instrumentation of your applications and improve their observability using AWS Distro for OpenTelemetry today.

— Danilo

Amazon EKS Anywhere – Now Generally Available to Create and Manage Kubernetes Clusters on Premises

2021-09-08 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/amazon-eks-anywhere-now-generally-available-to-create-and-manage-kubernetes-clusters-on-premises/

At AWS re:Invent 2020, we preannounced new deployment options of Amazon Elastic Container Service (Amazon ECS) Anywhere and Amazon Elastic Kubernetes Service (Amazon EKS) Anywhere in your own data center.

Today, I am happy to announce the general availability of Amazon EKS Anywhere, a deployment option for Amazon EKS that enables you to easily create and operate Kubernetes clusters on premises using VMware vSphere starting today. EKS Anywhere provides an installable software package for creating and operating Kubernetes clusters on premises and automation tooling for cluster lifecycle support.

EKS Anywhere brings a consistent AWS management experience to your data center, building on the strengths of Amazon EKS Distro, an open-source distribution for Kubernetes used by Amazon EKS.

EKS Anywhere is also Open Source. You can reduce the complexity of buying or building your own management tooling to create EKS Distro clusters, configure the operating environment, and update software. EKS Anywhere enables you to automate cluster management, reduce support costs, and eliminate the redundant effort of using multiple open-source or third-party tools for operating Kubernetes clusters. EKS Anywhere is fully supported by AWS. In addition, you can leverage the EKS console to view all your Kubernetes clusters, running anywhere.

We provide several deployment options for your Kubernetes cluster:

Feature	Amazon EKS	EKS on Outposts	EKS Anywhere	EKS Distro
Hardware	Managed by AWS		Managed by customer
Deployment types	Amazon EC2, AWS Fargate (Serverless)	EC2 on Outposts	Customer Infrastructure
Control plane management	Managed by AWS		Managed by customer
Control plane location	AWS cloud	Customer’s on-premises or data center
Cluster updates	Managed in-place update process for control plane and data plane		CLI (Flux supported rolling update for data plane, manual update for control plane)
Networking and Security	Amazon VPC Container Network Interface (CNI), Other compatible 3rd party CNI plugins		Cilium CNI	3rd party CNI plugins
Console support	Amazon EKS console		EKS console using EKS Connector	Self-service
Support	AWS Support		EKS Anywhere support subscription	Self-service

EKS Anywhere integrates with a variety of products from our partners to help customers take advantage of EKS Anywhere and provide additional functionality. This includes Flux for cluster updates, Flux Controller for GitOps, eksctl – a simple CLI tool for creating and managing clusters on EKS, and Cilium for networking and security.

We also provide flexibility for you to integrate with your choice of tools in other areas. To add integrations to your EKS Anywhere cluster, see this list of suggested third-party tools for your consideration.

Get Started with Amazon EKS Anywhere
To get started with EKS Anywhere, you can create a bootstrap cluster in your machine for local development and test purposes. Currently, it allows you to create clusters in a VMWare vSphere environment for production workloads.

Let’s create a cluster on your desktop machine using eksctl! You can install eksctl and eksctl-anywhere with homebrew on Mac. Optionally, you can install some additional tools you may want for your EKS Anywhere clusters, such as kubectl. To learn more on Linux, see the installation guide in EKS Anywhere documentation.

$ brew install aws/tap/eks-anywhere
$ eksctl anywhere version
0.63.0

Generate a cluster config and create a cluster.

$ CLUSTER_NAME=dev-cluster
$ eksctl anywhere generate clusterconfig $CLUSTER_NAME \
    --provider docker > $CLUSTER_NAME.yaml
$ eksctl anywhere create cluster -f $CLUSTER_NAME.yaml
[i] Performing setup and validations
[v] validation succeeded {"validation": "docker Provider setup is valid"}
[i] Creating new bootstrap cluster
[i] Installing cluster-api providers on bootstrap cluster
[i] Provider specific setup
[i] Creating new workload cluster
[i] Installing networking on workload cluster
[i] Installing cluster-api providers on workload cluster
[i] Moving cluster management from bootstrap to workload cluster
[i] Installing EKS-A custom components (CRD and controller) on workload cluster
[i] Creating EKS-A CRDs instances on workload cluster
[i] Installing AddonManager and GitOps Toolkit on workload cluster
[i] GitOps field not specified, bootstrap flux skipped
[i] Deleting bootstrap cluster
[v] Cluster created!

Once your workload cluster is created, a KUBECONFIG file is stored on your admin machine with admin permissions for the workload cluster. You’ll be able to use that file with kubectl to set up and deploy workloads.

$ export KUBECONFIG=${PWD}/${CLUSTER_NAME}/${CLUSTER_NAME}-eks-a-cluster.kubeconfig
$ kubectl get ns
NAME                                STATUS   AGE
capd-system                         Active   21m
capi-kubeadm-bootstrap-system       Active   21m
capi-kubeadm-control-plane-system   Active   21m
capi-system                         Active   21m
capi-webhook-system                 Active   21m
cert-manager                        Active   22m
default                             Active   23m
eksa-system                         Active   20m
kube-node-lease                     Active   23m
kube-public                         Active   23m
kube-system                         Active   23m

You can create a simple test application for you to verify your cluster is working properly. Deploy and see a new pod running in your cluster, and forward the deployment port to your local machine with the following commands:

$ kubectl apply -f "https://anywhere.eks.amazonaws.com/manifests/hello-eks-a.yaml"
$ kubectl get pods -l app=hello-eks-a
NAME                                     READY   STATUS    RESTARTS   AGE
hello-eks-a-745bfcd586-6zx6b   1/1     Running   0          22m
$ kubectl port-forward deploy/hello-eks-a 8000:80
$ curl localhost:8000
⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢

Thank you for using

███████╗██╗  ██╗███████╗
██╔════╝██║ ██╔╝██╔════╝
█████╗  █████╔╝ ███████╗
██╔══╝  ██╔═██╗ ╚════██║
███████╗██║  ██╗███████║
╚══════╝╚═╝  ╚═╝╚══════╝

 █████╗ ███╗   ██╗██╗   ██╗██╗    ██╗██╗  ██╗███████╗██████╗ ███████╗
██╔══██╗████╗  ██║╚██╗ ██╔╝██║    ██║██║  ██║██╔════╝██╔══██╗██╔════╝
███████║██╔██╗ ██║ ╚████╔╝ ██║ █╗ ██║███████║█████╗  ██████╔╝█████╗  
██╔══██║██║╚██╗██║  ╚██╔╝  ██║███╗██║██╔══██║██╔══╝  ██╔══██╗██╔══╝  
██║  ██║██║ ╚████║   ██║   ╚███╔███╔╝██║  ██║███████╗██║  ██║███████╗
╚═╝  ╚═╝╚═╝  ╚═══╝   ╚═╝    ╚══╝╚══╝ ╚═╝  ╚═╝╚══════╝╚═╝  ╚═╝╚══════╝

You have successfully deployed the hello-eks-a pod hello-eks-a-c5b9bc9d8-qp6bg

For more information check out
https://anywhere.eks.amazonaws.com

⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢⬡⬢

EKS Anywhere also supports a VMWare vSphere 7.0 version or higher for production clusters. To create a production cluster, see the requirements for VMware vSphere deployment and follow Create production cluster in EKS Anywhere documentation. It’s almost the same process as creating a test cluster on your machine.

A production-grade EKS Anywhere cluster should include at least three control plane nodes and three worker nodes on the vSphere for high availability and rolling upgrades. See the Cluster management in EKS Anywhere documentation for more information on common operational tasks like scaling, updating, and deleting the cluster.

EKS Connector – Public Preview
EKS Connector is a new capability that allows you to connect any Kubernetes clusters to the EKS console. You can connect any Kubernetes cluster, including self-managed clusters on EC2, EKS Anywhere clusters running on premises, and other Kubernetes clusters running outside of AWS to the EKS console. It makes it easy for you to view all connected clusters centrally.

To connect your EKS Anywhere cluster, visit the Clusters section in EKS console and select Register in the Add cluster drop-down menu.

Define a name for your cluster and select the Provider (if you don’t find an appropriate provider, select Other).

After registering the cluster, you will be redirected to the Cluster Overview page. Select Download YAML file to get the Kubernetes configuration file to deploy all the necessary infrastructure to connect your cluster to EKS.

Apply downloaded eks-connector.yaml and role binding eks-connector-binding.yaml file from the EKS Connector in our documentation. EKS Connector acts as a proxy and forwards the EKS console requests to the Kubernetes API server on your cluster, so you need to associate the connector’s service account with an EKS Connector Role, which gives permission to impersonate AWS IAM entities.

$ kubectl apply -f eks-connector.yaml
$ kubectl apply -f eks-connector-binding.yaml

After completing the registration, the cluster should be in the ACTIVE state.

$ eks describe-cluster --name "my-first-registered-cluster" --region ${AWS_REGION}

Here is the expected output:

{
    "cluster": {
    "name": "my-first-registered-cluster",
    "arn": "arn:aws:eks:{EKS-REGION}:{ACCOUNT-ID}:cluster/my-first-registered-cluster", 
    "createdAt": 1627672425.765,
    "connectorConfig": {
    "activationId": "xxxxxxxxACTIVATION_IDxxxxxxxx", 
    "activationExpiry": 1627676019.0,
    "provider": "OTHER",
     "roleArn": "arn:aws:iam::{ACCOUNT-ID}:role/eks-connector-agent"
    },
  "status": "ACTIVE", "tags": {}
  } 
}

EKS Connector is now in public preview in all AWS Regions where Amazon EKS is available. Please choose a region that’s closest to your cluster location to minimize latency. To learn more, visit EKS Connector in the Amazon EKS User Guide.

Things to Know
Here are a couple of things to keep in mind about EKS Anywhere:

Connectivity: There are three connectivity options: fully connected, partially disconnected, and fully disconnected. For fully connected and partially disconnected connectivity, you can connect your EKS Anywhere clusters to the EKS console via the EKS Connector and see the cluster configuration and workload status. You can leverage AWS services through AWS Controllers for Kubernetes (ACK). You can connect EKS Anywhere infrastructure resources using AWS System Manager Agents and view them using the SSM console.

Security Model: AWS follows the Shared Responsibility Model, where AWS is responsible for the security of the cloud, while the customer is responsible for security in the cloud. However, EKS Anywhere is an open-source tool, and the distribution of responsibility differs from that of a managed cloud service like Amazon EKS. AWS is responsible for building and delivering a secure tool. This tool will provision an initially secure Kubernetes cluster. To learn more, see Security Best Practices in EKS Anywhere documentation.

AWS Support: AWS Enterprise Support is a prerequisite for purchasing an Amazon EKS Anywhere Support subscription. If you would like business support for your EKS Anywhere clusters, please contact your Technical Account Manager (TAM) for details. Also, EKS Anywhere is supported by the open-source community. If you have a problem, open an issue and someone will get back to you as soon as possible.

Available Now
Amazon EKS Anywhere is now available to leverage EKS features with your on-premise infrastructure, accelerate adoption with partner integrations, managed add-ons, and curated open-source tools.

To learn more with a live demo and Q&A, join us for Containers from the Couch on September 13. You can see full demos to create a cluster and show admin workflows for scaling, upgrading the cluster version, and GitOps management.

Please send us feedback either through your usual AWS Support contacts, on the AWS Forum for Amazon EKS or on the container roadmap on Github.

– Channy

Security at Scale in the Open-Source Supply Chain

2021-09-08 Aaron Wells

Post Syndicated from Aaron Wells original https://blog.rapid7.com/2021/09/08/security-at-scale-in-the-open-source-supply-chain/

Security at Scale in the Open-Source Supply Chain

“We’ve all heard of paying it forward, but this is ridiculous!” That’s probably what most of us think when one of our partners or vendors inadvertently leaves an open door into our shared supply-chain network; an attacker can enter at any time. Well, we probably think in slightly more expletive-laden terms, but nonetheless, no organization or company wants to be the focal point of blame from a multitude of (formerly) trusting partners or vendors.

Open-source software (OSS) is particularly susceptible to these vulnerabilities. OSS is simultaneously incredible and incredibly vulnerable. In fact, there are so many risks that can result from largely structuring operations on OSS that vendors may not prioritize patching a vulnerability once their security team is alerted. And can we blame them? They want to continue operations and feed the bottom line, not put a pause on operations to forever chase vulnerabilities and patch them one-by-one. But that leaves all of their supply-chain partners open to exploitation. What to do?

The supply-chain scene

Throughout a 12-month timeframe spanning 2019-2020, attacks aimed at OSS increased 430%, according to a study by Sonatype. It’s not quite as simple as “gain access to one, gain access to all,” but if a bad actor is properly motivated, this is exactly what can happen. In terms of motivation, supply-chain attackers can fall into 2 groups:

Bandwagoners: Attackers falling into this group will often wait for public disclosure of supply-chain vulnerabilities.
Ahead-of-the-curvers: Attackers falling into this group will actively hunt for and exploit vulnerabilities, saddling the unfortunate organization with malware and threatening its entire supply chain.

To add to the favor of attackers, the same Sonatype study also found that a shockingly low percentage of security organizations do not even learn of new open-source vulnerabilities in the short term after they’re disclosed. Sure, everyone’s busy and has their priorities. But that ethos exists while these vulnerabilities are being exploited. Perhaps the project was shipped on time, but malicious code was simultaneously being injected somewhere along the line. Then, instead of continuing with forward progress, remediation becomes the name of the game.

According to the Sonatype report, there were more than a trillion open-source component and container download requests in 2020 alone. The most important aspects to consider then are the security history of your component(s) and how dependents along your supply chain are using them. Obviously, this can be overwhelming to think about, but with researchers increasingly focused on remediation at scale, the future of supply-chain security is starting to look brighter.

Learn more about open-source security + win some cash!

Submit to the 2021 Velociraptor Contributor Competition

Securing at scale

Instead of the one-by-one approach to patching, security professionals need to start thinking about securing entire classes of vulnerabilities. It’s true that there is no current catch-all mechanism for such efficient action. But researchers can begin to work together to create methodologies that enable security organizations to better prioritize vulnerability risk management (VRM) instead of filing each one away to patch at a later date.

Of course, preventive security measures — inclusive of our shift-left culture — can help to mitigate the need to scale such remediation actions; the fact remains though that bad actors will always find a way. Therefore, until there are effective ways to eliminate large swaths of vulnerabilities at once, there is a growing need for teams to adhere to current best practices and measures like:

Dedicating time and resources to help ensure code is secure all along the chain
Thinking holistically about the security of open-source code with regard to the CI/CD lifecycle and the entire stack
Being willing to pitch in and develop coordinated, industry-wide efforts to improve the security of OSS at scale
Educating outside stakeholders on just how interdependent supply-chain-linked organizations are

As supply-chain attackers refine their methods to target ever-larger companies, the pressure is on developers to refine their understanding of how each and every contributor on a team can expose the organization and its partners along the chain, as The Linux Foundation points out. However, is this too much to put on the shoulders of DevOps? Shifting left to a DevSecOps culture is great and all, but teams are now being asked to think in the context of securing an entire supply chain’s worth of output.

This is why the industry at large must continue the push for research into new ways to eliminate entire classes of vulnerabilities. That’s a seismic shift left that will only help developers — and really, everyone — put more energy into things other than security.

Monitoring mindfully

While a proliferation of OSS components — as advantageous as they are for collaboration at scale — can make a supply chain vulnerable, the power of one open-source community can help monitor another open-source community. Velociraptor by Rapid7 is an open-source digital forensics and incident response (DFIR) platform.

This powerful DFIR tool thrives in loaded conditions. It can quickly scale incident response and monitoring and help security organizations to better prioritize remediation — actions well-suited to address the scale of modern supply-chain attacks. How quickly organizations choose to respond to incidents or vulnerabilities is, of course, up to them.

Supply chain security is ever-evolving

If one link in the chain is attacked via a long-languishing vulnerability whose risk has increasingly become harder to manage, it almost goes without saying that company’s partners or vendors immediately lose confidence in it because the entire chain is now at risk. The public’s confidence likely will follow.

There are any number of preventive measures an interdependent security organization can implement. However, the need for further research into scaling security for whole classes of vulnerabilities comes at a crucial time as global supply-chain attacks more frequently occur in all shapes and sizes.

Want to contribute to a more secure open-source future?

Submit to the 2021 Velociraptor Contributor Competition

Top 5 reasons to choose Zabbix for network monitoring

2021-08-31 Dmitry Lambert

Post Syndicated from Dmitry Lambert original https://blog.zabbix.com/top-5-reasons-to-choose-zabbix-for-network-monitoring/15247/

There are many monitoring solutions and monitoring tools that you can use for different monitoring tasks. But in this post and video, we will focus only on Zabbix and the top five features making Zabbix the best choice to monitor your home office as well as enterprise instances or projects.

I. Free and open-source solution (0:44)
II. Wide functionality (1:43)
III. No access to your data (03:53)
IV. Balance of flexibility and simplicity (5:35)
V. Commercial services (7:55)

Free and open-source solution

First, Zabbix is a free and open-source solution covered by General Public License (GPL) v2. This means that the Zabbix source code is readily available and can be redistributed or modified. With this in mind, you can always create your own version of Zabbix, if you’re willing to play around with the source code or have a great idea on how to improve the product.

Zabbix software properties

There are no paid versions of Zabbix, no paid functionality, and no hidden costs. You can monitor any number of devices and define your own data retention policies at no cost at all.

All of the latest features are absolutely free and available in the latest version of Zabbix. You can visit zabbix.com, click the Download button, choose the platform that is best for you and install Zabbix packages on it. Zabbix can be deployed on any kind of environment, be it a virtual machine, physical servers, cloud environments, or even a Docker container. After you have downloaded Zabbix, you are ready to go ahead with the latest Zabbix feature set.

Selecting platform to install Zabbix n Zabbix.com/download

Rich feature set

Zabbix is a fully enterprise-ready product with a wide set of features that you can use to achieve any of your monitoring tasks. As a tool, Zabbix is not focused on any single thing, offering users extreme flexibility. For instance, you can monitor Windows or Linux machines agentlessly or opt-in to install a Zabbix agent on them. On the other hand, to monitor network devices, SNMP monitoring might be the easiest approach. All it takes to start monitoring your end-points is creating an item, specifying the metric that you want to monitor together with the data collection interval, and you are good to go.

Configuring SNMP monitoring parameters

After we have collected the data, we can configure our problem thresholds (also known as triggers in Zabbix) by navigating to Configuration > Hosts > Triggers. Triggers are definitions of our problem thresholds, where you can define a problem threshold, when do you consider your metric to be within that threshold and how do you recover from the problem.

There is a wide array of so called trigger functions – these allow us to define thresholds in different ways. For example, we can analyze the last received value, averages, minimum and maximum values over some time, look for a specific string in a value and much, much more!

We also need to define a way of reacting to a problem – should we receive an e-mail if something goes wrong? Or maybe we want to try and remediate the issue automatically by executing a command or a script? This is where the so-called Actions come in. Actions are based on and/or conditions, that allow you to very granularly define how you’re going to react to a particular problem.

For example, you might define an action that states “If the Trigger name contains ‘SNMPSim’ send an email or a mobile text message to our Network administrator. If the problem still persists after 10 minutes, execute a locally stored script that should fix the problem.

Trigger actions

Once you have defined your items, triggers and actions, it’s time to present this information in a user-friendly fashion. For this, you can create a set of multi-page dashboards where you can see all of your collected data by utilizing different dashboard widgets together with a list of active problems,

display the collected metrics on graphs and provide an overview of your infrastructure state on network maps.

The dashboard is completely dynamic and interactive — you can zoom in on any point in time in your graphs, create interactive map hierarchies, navigate to different sections of Zabbix and much more.

Zabbix also supports SLA monitoring for your IT business services. You can define your IT service trees, link them to existing triggers and have access to different SLA related views.

Configuring and monitoring SLA

Inventory collection and storage is also natively supported by Zabbix. You can collect any inventory information from your devices – your device serial numbers, locations, software versions and much more. This can be done in many different ways – the inventory information can be captured from the collected metrics, populated manually or updated by using the Zabbix API. This information can be used to access different inventory views and group your devices based on the collected inventory data.

You own your data

There are many different ways to deploy Zabbix. You can navigate to zabbix.com/download page and select the installation method that fits your requirements. For Zabbix packages you have the option to choose the required Zabbix version, select from multiple operating systems – from the Red Hat Enterprise Linux to Raspberry Pi, as well as the specific OS version, the database backend, and the web server backend. After everything is selected, you will be presented with a comprehensive list of commands that you can use to get Zabbix up and running in minutes.

Zabbix Packages

If you are interested in cloud deployments then you can also run Zabbix in many different cloud environments, such as AWS, Microsoft Azure, Google Cloud, DigitalOcean, Linode, Red Hat OpenShift, Oracle Cloud or Yandex Cloud. All of these options offer full Zabbix functionality with native cloud images.

Zabbix Cloud images

Docker images are also available for different Zabbix components. You can run a single component in a container or deploy the whole Zabbix architecture in a containerized environment. The Docker hub page contains a comprehensive list of environment variables and examples of how to deploy container images with a single command.

The quickest way to deploy Zabbix, especially in a PoC environment, would be the Zabbix Appliance. Zabbix Appliance is a virtual machine image with all of the Zabbix components already pre-configured for you. Simply download the image for the hypervisor of your choice, deploy it and you are good to go.

The Zabbix source code is also available for download for different Zabbix versions. This approach is useful for more exotic environments, where installing via packages is not an option.

Zabbix Sources

Agents are available for download via packages, but if packages is not an option, you can always download the precompiled agents for many different operating systems, including Windows.

Zabbix agents

No matter which option you choose, Zabbix LLC never has any access to your configuration or history data. You are fully in control of your deployment and the data in it. This way you have the guarantee that your data belongs only to you.

Balance of flexibility and simplicity

A good monitoring solution should be simple and approachable even for users that are not experts in monitoring, Linux system, scripting and any other DevOps-related skills. However, simplicity usually comes with at a cost of functionality. If the tool focuses too much on simplicity, it would inevitably restrict the end-users in the set of functionality that they have available for them.

Zabbix provides a balance of simplicity and flexibility. While the amount of features in Zabbix may seem daunting at first, the flexibility it provides is the key benefit of Zabbix. With Zabbix, you can easily extend the out-of-the-box monitoring approaches with your custom monitoring methods. If you have in-house applications specific to your company, you can always extend the Zabbix monitoring functionality and create your own custom checks or use scripts or commands to collect the data. This way you can define custom methods not only for data collection, but also use scripting for robust automatic issue remediation.

This gives you a huge variety of features that you can utilize in Zabbix either by using out-of-the-box templates or defining your own custom checks. All of this can be done within a single central frontend.

Even if you’re monitoring 50 branches in different countries within one Zabbix installation, a Zabbix administrator will be able to maintain and change the configuration — add new items, triggers, etc. Zabbix is also a great fit for multi-tenant environments. The robust permission and role schema enable you to define multiple Zabbix administrators that can have granular access to monitoring entities within their organization.

Commercial services

Open-source solutions are great as you can download the product at any time, irrespective of your goals and use them for both small home lab environments and large enterprise infrastructures. Similar to many open-source solutions, Zabbix also has a large and passionate international community of users ready to help you out on the official forums, different social networks, Zabbix subreddit and other communication channels.

If this not sufficient and you’re still feeling overwhelmed by all of the available features and require additional help to deploy your environment with best practices in mind – this is where the Zabbix commercial services come into play.

Commercial services

Zabbix team offers multiple commercial services, starting with a multiple-tier technical support. With technical support services, Zabbix experts will have your back and help you with fixing any issues and answer all of your questions 24/7.
Zabbix team also offers consulting services, where you can address any topic that you wish to discuss to Zabbix experts — how to deploy Zabbix and start monitoring your infrastructure, whether Zabbix is able to cover all of your needs, receive help with tuning Zabbix configuration and much more.
Turnkey solutions allow you to engage Zabbix professionals and build everything from scratch with best practices and scalability in mind.
Zabbix team can lend you a hand with Template building services for your custom in-house application.
The Zabbix team will document all of the performed steps, so you can have a clear view of what has been done and what was the reason behind it. You can utilize this knowledge down the line to learn and be able to follow the best practice approaches on your own.
Upgrade procedures can be extremely stressful – you may be worried about minimizing downtime, following your organizational SLA’s or maybe you simply aren’t sure how to properly perform an upgrade. Once again, the Zabbix team can do this for you, document it and guide you through the process so you can learn from it and do it yourself in the later versions.
Need help with troubleshooting an ongoing issue? Then the remote troubleshooting services are for you. Zabbix team can help you get to the bottom of any issues you may have with your Zabbix architecture.

Zabbix is an extremely fast-growing enterprise-ready project with a vast set of functionality trusted by global brands, capable to support collection of hundreds of thousands of metrics with real-time 24/7 data analysis, powerful visualization options, robust permission schema, out of the box reports and the ability to tailor the tool to your specific needs.

If you have never tried Zabbix, this might be the perfect moment to visit zabbix.com, click Download, and download Zabbix on your local test environment and try to monitor a couple of hosts to get acquainted with the product. I am sure that you’re going to be more than satisfied with the results!

Overview

Protect

AWS WAF

Amazon Route 53 Resolver DNS Firewall

AWS Network Firewall

Network Access Control Lists

Use IMDSv2

Detect

Amazon Inspector

GuardDuty

Security Hub

VPC flow logs

Validation with open-source tools

Respond

AWS Patch Manager

Container mitigation

Mitigation strategies if you can’t upgrade

Conclusion

Caveats

What is code navigation?

Why is this hard?

Stack graphs

Creating stack graphs using Tree-sitter

But wait, there’s more!

Application Overview

Current Progress

Development Milestones

Phase 1 – JSON API

Phase 2 — Dashboard UI

Phase 3 — Article Edge-Rendering

Phase 4 — Feature Upgrades

Next Steps

Sparse index

Multi-pack reachability bitmaps

A new default merge strategy

Tidbits

The rest of the iceberg

The Git index

The index affects Git performance at scale

The sparse index

Building the sparse index safely

Example implementation detail: git diff

Implementation detail: ORT merge strategy

Testing the sparse index

The current state of the sparse index

Looking to the future

GitLab RCE

Less Than BulletProof

Metasploit Masterfully Manages Meterpreter Metadata

New module content (3)

Enhancements and features

Bugs fixed

Get it

Open-Sourcing a Monitoring GUI for Metaflow, Netflix’s ML Platform

Requirements for a Metaflow GUI

Monitoring GUI for Metaflow

Try it at home!

by Joel Sole, Mariana Afonso, Lukas Krasula, Zhi Li, and Pulkit Tandon

Banding artifact?

How annoying is it?

You can’t fix it if you don’t know it’s there

CAMBI

Pre-processing

Multiscale Banding Confidence

Spatio-Temporal Pooling

CAMBI agrees with the subjective assessments

Staring at the sunset

Open-source and next steps

Acknowledgments

Background

Security and compliance baseline definition

Security controls implementation

Automated security findings remediation

Deployment automation

Looking forward

Conclusion

About the Authors

What is InnerSource and why is it gaining traction?

What constitutes an InnerSource ecosystem?

More to InnerSource than meets the eye

Example implementation detail: `git diff`