CIDABM

2024-10-09 xkcd.com

Post Syndicated from xkcd.com original https://xkcd.com/2996/

There's a heated debate over whether the big island of Tierra del Fuego should qualify for membership.

Securing communications at the edge with AWS Wickr

2024-10-09 Erik Iwanski

Post Syndicated from Erik Iwanski original https://aws.amazon.com/blogs/messaging-and-targeting/securing-communications-at-the-edge-with-aws-wickr/

Organizations that are looking to establish secure communication networks at the edge often encounter challenges. The use of disparate collaboration tools on personal and government-issued devices can make it difficult to protect sensitive data and avoid communication gaps.

This blog post highlights four common communication issues that customers encounter when operating in disconnected (or intermittently connected) environments, and how end-to-end encrypted messaging and collaboration service AWS Wickr can help you address them.

Issue 1: Seamless communication—multiple agencies and partners need to collaborate effectively.

Federal, state, and local organizations tend to use different means and mechanisms to communicate both internally and externally with third parties, which often leads to interoperability challenges. They need to seamlessly coordinate and connect with mission partners—including government agencies, military teams, medical professionals, and first responders—even in disconnected environments in order to work together effectively.

Issue 2: Out-of-band communication—teams need a way to ensure that communication is possible when primary channels are down or compromised.

Network disruptions, security events, and system failures can impact communication channels. The use of a separate, secure, out-of-band communication tool that can be used as a backup when primary channels are unavailable or compromised is critical to protecting sensitive information, maintaining business continuity, and coordinating incident response activities.

Issue 3: Data retention—messages and files need to be retained to help meet recordkeeping requirements, and facilitate after-action reports.

Virtually all federal, state, and local government agencies must adhere to various data retention and records management policies, regulations, and laws. Many are subject to Federal Records Act (FRA) and National Archives and Records Administration (NARA) regulations that require them to collect, store, and manage federal records that are created, received, and used in daily operations. For those subject to Freedom of Information Act (FOIA) requests and U.S. Department of Defense (DOD) Instruction 8170.01—which prescribes procedures for the collection, distribution, storage, and processing of DOD information through electronic messaging services—effectively retaining messages is about more than supporting security and compliance; it’s about maintaining public trust.

Issue 4: Security and control—communications must be adequately protected and administrative control must be maintained, no matter the environment.

The transmission of sensitive and mission-critical data through messaging apps and collaboration tools that lack critical encryption and security protocols increases the likelihood of a security incident. Popular consumer messaging apps don’t provide controls that allow for individual devices or accounts to be suspended or removed, increasing the threat of data exposure stemming from a lost or stolen device. Enterprise collaboration apps lack the advanced security provided by end-to-end encryption.

How AWS Wickr can help

AWS Wickr is a secure messaging and collaboration service that protects one-to-one and group messaging, voice and video calling, file sharing, screen sharing, and location sharing with 256-bit encryption.

Wickr combines the security and privacy of end-to-end encryption with the data retention and administrative controls you need to accelerate collaboration, even in disconnected environments.

Wickr provides the following capabilities to help you address common communication challenges:

Seamless communication: Federation and guest access features allow you to exchange sensitive information with mission partners, without the need to connect to a virtual private network (VPN). You can assign groups of users to specific federation rules, restrict access to select agencies and partners, and allow or disable the guest user access feature for individual security groups.
Out-of-band communication: Wickr provides a communication channel outside of existing systems that can help you keep teams connected and protect sensitive information, even when primary channels are down or compromised. The user interface is intuitive; response teams can simply open the application on their device and start collaborating, without special software or training.
Data retention: Wickr network administrators can configure and apply data retention to both internal and external communications in a Wickr network. This includes conversations with guest users, external teams, and other partner networks, so you can retain messages and files sent to and from the organization to help meet requirements. Data retention is implemented as an always-on recipient that is added to conversations, similar to the blind carbon copy (BCC) feature in email. The data retention process can run anywhere Docker workloads are supported: on-premises, on an Amazon Elastic Compute Cloud (Amazon EC2) virtual machine, or at a location of your choice.
Security and control: With Wickr, each message gets a unique Advanced Encryption Standard (AES) private encryption key, and a unique Elliptic-curve Diffie–Hellman (ECDH) public key to negotiate the key exchange with recipients. Message content—including text, files, audio, or video—is encrypted on the sending device using the message-specific AES key. This key is then exchanged via the ECDH key exchange mechanism so that no one other than intended recipients can decrypt the content (not even AWS). Fine-grained administrative controls allow you to organize users into security groups with restricted access to features and content at their level. You can apply policies to each group that are custom-tailored to meet desired outcomes. Wickr app data can be deleted remotely both by administrators, and end users.

Communicating at the edge

Wickr is available in two deployment models: cloud-native AWS Wickr and AWS WickrGov, which are available through the AWS Management Console, and self-hosted Wickr Enterprise. Wickr Enterprise offers the same secure collaboration features as AWS Wickr and AWS WickrGov, but can be self-hosted on any private on-premises infrastructure (such as an AWS Outpost or Snowball Edge device), private cloud infrastructure, or in a multi-cloud deployment. Wickr Enterprise can maintain secure communications when internet access (via broadband, mobile, 5G, or satellite) to cloud-based networks fails. You can run Wickr Enterprise without any internet connectivity and it supports architectural resiliency, such as deploying a fully managed network backhaul that is capable of federating with AWS Wickr users when internet connectivity is available.

Figure 1 illustrates a hybrid architecture that combines AWS Wickr and Wickr Enterprise. The Snowball Edge device running Wickr allows disconnected communications at the edge between Wickr Enterprise users. When internet connectivity becomes available, Wickr Enterprise users can federate with AWS Wickr users and send data retention logs to Amazon S3 or any customer-defined storage.

Figure 1: Hybrid of Wickr Enterprise self-hosted on Snowball Edge and AWS Wickr in the Cloud. A hybrid solution federates AWS Wickr in the cloud with a local deployment of Wickr Enterprise for extended resilience and redundancy.

Collaborate with confidence

Securing communications at the edge is critical to protecting sensitive data and maintaining operational resilience. AWS Wickr offers a secure, simple-to-use, reliable solution that can help you address common challenges and collaborate effectively, even in the harshest environments. By choosing the features and deployment options that meet your needs, you can facilitate secure and compliant communications everywhere, and seamlessly collaborate with mission partners.

AWS Wickr has been authorized for Department of Defense Cloud Computing Security Requirements Guide Impact Level 4 and 5 (DoD CC SRG IL4 and IL5) in the AWS GovCloud (US-West) Region. It is also Federal Risk and Authorization Management Program (FedRAMP) authorized at the Moderate impact level in the AWS US East (N. Virginia) Region, FedRamp High authorized in the AWS GovCloud (US-West) Region, and meets compliance programs and standards such as Health Insurance Portability and Accountability Act (HIPAA) eligibility, International Organization for Standardization (ISO) 27001, and System and Organization Controls (SOC) 1,2, and 3.

For more information, please visit the AWS Wickr webpage, or email [email protected].

About the Authors

Erik Iwanski

Erik is a Principal Worldwide Go-to-Market (GTM) Specialist for Amazon Web Services (AWS) and is based in Montana. He focuses on global customers and leads the global GTM plan for AWS Wickr. Erik has 15-plus years of experience working across industries from national security, federal/SLED sales, healthcare, and technology. He holds a master’s degree in microbiology from California State University Long Beach and a bachelor’s degree in Biological Sciences from the University of California Irvine.

Anne Grahn

Anne is a Senior Worldwide Security GTM Specialist at AWS, based in Chicago. She has more than 13 years of experience in the security industry, and focuses on effectively communicating cybersecurity risk. She maintains a Certified Information Systems Security Professional (CISSP) certification.

Patch Tuesday – October 2024

2024-10-09 Adam Barnett

Post Syndicated from Adam Barnett original https://blog.rapid7.com/2024/10/08/patch-tuesday-october-2024/

Patch Tuesday - October 2024

Microsoft is addressing 118 vulnerabilities this October 2024 Patch Tuesday. Microsoft has evidence of in-the-wild exploitation and/or public disclosure for five of the vulnerabilities published today, although it does not rate any of these as critical (yet). Of those five, Microsoft lists two as exploited in the wild, and both of these are now listed on CISA KEV. Microsoft is also patching three further critical remote code execution (RCE) vulnerabilities today. Three browser vulnerabilities have already been published separately this month, and are not included in the total.

Somewhat unusually, we’ll take a look at two of the three critical RCEs published today — CVE-2024-43468 and CVE-2024-43582 — before moving on to the arguably somewhat-less- threatening zero-day vulnerabilities patched today.

Microsoft Configuration Manager: pre-auth RCE

Microsoft Configuration Manager receives a patch for the only vulnerability published by Microsoft today with a CVSS base score of 9.8. Although Microsoft doesn’t tag it as either publicly disclosed or exploited-in-the-wild, the advisory for CVE-2024-43468 appears to describe a no-interaction, low complexity, unauthenticated network RCE against Microsoft Configuration Manager. Exploitation is achieved by sending specially-crafted malicious requests, and leads to code execution in the context of the Configuration Manager server or its underlying database. The relevant update is installed within the Configuration Manager console, and requires specific administrator actions that Microsoft describes in detail in a generic series of articles. Further information and several specific required steps are described in KB29166583.

Confusingly, this KB29166583 was first published over a month ago on 2024-09-04, and was then subsequently unpublished and republished on 2024-09-18, all without any mention of CVE-2024-43468, which was published only today and which KB29166583 apparently remediates. Defenders should read the available documentation carefully, and then probably read it again for good measure.

RPD RPC: pre-auth RCE

Any RDP Server critical RCE is worth patching quickly. CVE-2024-43582 is a pre-auth critical RCE in the Remote Desktop Protocol Server. Exploitation requires an attacker to send deliberately-malformed packets to a Windows RPC host, and leads to code execution in the context of the RPC service, although what this means in practice may depend on factors including RPC Interface Restriction configuration on the target asset. One silver lining: attack complexity is high, since the attacker must win a race condition to access memory improperly.

Winlogon: zero-day EoP

Who doesn’t love a good elevation of privilege vulnerability? Weary blue teamers who see the words “publicly disclosed” on a brand-new advisory know the answer. CVE-2024-43583 describes a flaw in Winlogon which gets an attacker all the way to SYSTEM via abuse of a third-party Input Method Editor (IME) during the sign-on process. The supplementary KB5046254 article explains that the 2024-10-08 patches disable non-Microsoft IME during the sign-in process. On that basis, outright removal of third-party IME is a mitigation available to anyone who is not able to apply today’s patches immediately.

Attack surface reduction is always worth considering, and removal of third-party IMEs certainly accomplishes that. Anyone who needs to keep a third-party IME can still do so, but once today’s patches are applied, that third-party IME will be disabled — only in the context of the sign-in process — to prevent exploitation of CVE-2024-43583. Although Microsoft doesn’t quite spell it out, the only reasonable interpretation of the available information is that an asset with no first-party/Microsoft IME installed would remain vulnerable after patching, since otherwise no IME would be available when attempting to sign in. Use of third-party IME is more likely to be a concern in mixed-language or non-English-speaking contexts. The disclosure process around this vulnerability may not have been entirely smooth; back in September, one of the researchers credited with the discovery expressed discontent with MSRC via X-formerly-known-as-Twitter.

Hyper-V: zero-day container escape

CVE-2024-20659 describes a publicly-disclosed security feature bypass in Hyper-V. Microsoft describes exploitation as both less likely and highly complex. An attacker must be both lucky and resourceful, since only UEFI-enabled hypervisors with certain unspecified hardware are vulnerable, and exploitation requires coordination of a number of factors followed by a well-timed reboot. All this after first achieving a foothold on the same network — although in this context, this likely means access to a VM on the target hypervisor, rather than some other location on the same subnet. The prize for successful exploitation is compromise of the hypervisor kernel.

MSHTML: zero-day XSS

CVE-2024-43573 is an exploited-in-the-wild spoofing vulnerability in MSHTML for which Microsoft is also aware of functional public exploit code; the advisory lists CWE-79 as the weakness, which translates to cross-site scripting (XSS). The advisory is sparse on further detail, although Windows Server 2012/2012 R2 admins who typically install Security Only updates should note that Microsoft is encouraging installation of the Monthly Rollups to ensure remediation in this case. The low CVSSv3 base score of 6.5 reflects the requirement for user interaction and the lack of impact to integrity or availability; a reasonable assumption might be that exploitation leads to improper disclosure of sensitive data, but no other direct effect on the target asset.

cURL: zero-day RCE

Microsoft is most famous for its closed source products, but has cautiously softened its stance on open source considerably in the past quarter century or so. Windows has included components of cURL for almost seven years at this point, along with various other open source components; Microsoft does patch these from time to time, although not always as quickly as defenders might like. Today’s patches for CVE-2024-6197, a publicly-disclosed RCE vulnerability in cURL, continue that trend.

The Microsoft advisory for CVE-2024-6197 clarifies that Windows does not ship libcurl, only the curl command line, but that’s still vulnerable and thus in scope for a fix. Exploitation requires that the user connect to a malicious server controlled by the attacker, and code execution is presumably in the context of the user launching the curl CLI tool on the Windows asset. The cURL project advisory for CVE-2024-6197 was originally published on 2024-07-24, and offers further detail from their perspective. Interestingly, the cURL project describes the most likely outcome of exploitation as a crash, and does not specifically mention RCE, although it is careful not to exclude the possibility of unspecified “more serious results,” which could well mean RCE. Microsoft rates this vulnerability as important, which is on track with the CVSS base score of 8.8.

Management Console: zero-day RCE

CVE-2024-43572 rounds out today’s five zero-day vulnerabilities, and describes a low-complexity, no-user-interaction RCE in Microsoft Management Console. Microsoft is aware of both public functional exploit code and in-the-wild exploitation. The vulnerability is exploited when a user downloads and opens a specially-crafted malicious Microsoft Saved Console (MSC) file, so there’s no suggestion here that the Management Console is vulnerable via network attack. Today’s patches prevent untrusted MSC files from being opened, although the advisory does not describe how Windows will know what’s trusted and what isn’t. Microsoft has chosen to map CVE-2024-43572 to CWE-70, which is a very broad category, the use of which is explicitly discouraged by MITRE.

VS Code Arduino extension: cloud critical RCE

A third critical RCE patched today is hopefully less concerning than its siblings. CVE-2024-43488 is in the Visual Studio Code extension for Arduino, and Microsoft notes that the vulnerability documented by this CVE requires no customer action to resolve. A reasonable question is: what does “no action required” really mean here? Within the advisory, Microsoft both claims to have fully mitigated the vulnerability, and also that there is no plan to fix the vulnerability. As confusing as that all sounds, perhaps the most important takeaway here is that Microsoft is now issuing cloud service CVEs in a stated effort to improve transparency. It’s not clear when the vulnerability was first introduced or when it was remediated, but nevertheless the recent expansion into a whole new class of CVEs is a welcome step by Microsoft.

SharePoint: EoP to SYSTEM

A sparse advisory for CVE-2024-43503, which is an elevation of privilege vulnerability which leads to SYSTEM. Advisories for similar vulnerabilities typically describe the specific SharePoint privileges required, but this one does not, so a reasonable assumption might be that the requirement here is simply minimal Site Member privileges.

Microsoft lifecycle update

Today sees the end of support for Windows 11 22H2 for Home, Pro, Pro Education, Pro for Workstations, and SE editions, as well as for Windows 11 21H2 for Education, Enterprise, and Enterprise multi-session editions. Server 2012 and Server 2012 R2 pass into Year 2 of ESU. Windows Embedded POSReady — the POS stands for Point-of-Sale — receives its final ESU updates today, and that might just be the last gasp for Windows 7 as a whole. As well as patching today’s critical RCE CVE-2024-43468, Intune admins still using Configuration Manager 2303 should look to upgrade to a newer version immediately, because support ends (somewhat unusually) on Thursday this week.

Summary charts

Summary tables

Apps vulnerabilities

CVE	Title	Exploited?	Publicly disclosed?	CVSSv3 base score
CVE-2024-43604	Outlook for Android Elevation of Privilege Vulnerability	No	No	5.7

Azure vulnerabilities

CVE	Title	Exploited?	Publicly disclosed?	CVSSv3 base score
CVE-2024-38179	Azure Stack Hyperconverged Infrastructure (HCI) Elevation of Privilege Vulnerability	No	No	8.8
CVE-2024-43591	Azure Command Line Integration (CLI) Elevation of Privilege Vulnerability	No	No	8.7
CVE-2024-38097	Azure Monitor Agent Elevation of Privilege Vulnerability	No	No	7.1
CVE-2024-43480	Azure Service Fabric for Linux Remote Code Execution Vulnerability	No	No	6.6

Browser vulnerabilities

CVE	Title	Exploited?	Publicly disclosed?	CVSSv3 base score
CVE-2024-9370	Chromium: CVE-2024-9370 Inappropriate implementation in V8	No	No	N/A
CVE-2024-9369	Chromium: CVE-2024-9369 Insufficient data validation in Mojo	No	No	N/A
CVE-2024-7025	Chromium: CVE-2024-7025 Integer overflow in Layout	No	No	N/A

Developer Tools vulnerabilities

CVE	Title	Exploited?	Publicly disclosed?	CVSSv3 base score
CVE-2024-43488	Visual Studio Code extension for Arduino Remote Code Execution Vulnerability	No	No	8.8
CVE-2024-43497	DeepSpeed Remote Code Execution Vulnerability	No	No	8.4
CVE-2024-38229	.NET and Visual Studio Remote Code Execution Vulnerability	No	No	8.1
CVE-2024-43590	Visual C++ Redistributable Installer Elevation of Privilege Vulnerability	No	No	7.8
CVE-2024-43483	.NET, .NET Framework, and Visual Studio Denial of Service Vulnerability	No	No	7.5
CVE-2024-43484	.NET, .NET Framework, and Visual Studio Denial of Service Vulnerability	No	No	7.5
CVE-2024-43485	.NET and Visual Studio Denial of Service Vulnerability	No	No	7.5
CVE-2024-43601	Visual Studio Code for Linux Remote Code Execution Vulnerability	No	No	7.1
CVE-2024-43603	Visual Studio Collector Service Denial of Service Vulnerability	No	No	5.5

ESU Windows vulnerabilities

CVE	Title	Exploited?	Publicly disclosed?	CVSSv3 base score
CVE-2024-38124	Windows Netlogon Elevation of Privilege Vulnerability	No	No	9
CVE-2024-43518	Windows Telephony Server Remote Code Execution Vulnerability	No	No	8.8
CVE-2024-43608	Windows Routing and Remote Access Service (RRAS) Remote Code Execution Vulnerability	No	No	8.8
CVE-2024-43607	Windows Routing and Remote Access Service (RRAS) Remote Code Execution Vulnerability	No	No	8.8
CVE-2024-38265	Windows Routing and Remote Access Service (RRAS) Remote Code Execution Vulnerability	No	No	8.8
CVE-2024-43453	Windows Routing and Remote Access Service (RRAS) Remote Code Execution Vulnerability	No	No	8.8
CVE-2024-38212	Windows Routing and Remote Access Service (RRAS) Remote Code Execution Vulnerability	No	No	8.8
CVE-2024-43549	Windows Routing and Remote Access Service (RRAS) Remote Code Execution Vulnerability	No	No	8.8
CVE-2024-43564	Windows Routing and Remote Access Service (RRAS) Remote Code Execution Vulnerability	No	No	8.8
CVE-2024-43589	Windows Routing and Remote Access Service (RRAS) Remote Code Execution Vulnerability	No	No	8.8
CVE-2024-43592	Windows Routing and Remote Access Service (RRAS) Remote Code Execution Vulnerability	No	No	8.8
CVE-2024-43593	Windows Routing and Remote Access Service (RRAS) Remote Code Execution Vulnerability	No	No	8.8
CVE-2024-43611	Windows Routing and Remote Access Service (RRAS) Remote Code Execution Vulnerability	No	No	8.8
CVE-2024-43532	Remote Registry Service Elevation of Privilege Vulnerability	No	No	8.8
CVE-2024-43599	Remote Desktop Client Remote Code Execution Vulnerability	No	No	8.8
CVE-2024-43519	Microsoft WDAC OLE DB provider for SQL Server Remote Code Execution Vulnerability	No	No	8.8
CVE-2024-43517	Microsoft ActiveX Data Objects Remote Code Execution Vulnerability	No	No	8.8
CVE-2024-43583	Winlogon Elevation of Privilege Vulnerability	No	Yes	7.8
CVE-2024-38261	Windows Routing and Remote Access Service (RRAS) Remote Code Execution Vulnerability	No	No	7.8
CVE-2024-43514	Windows Resilient File System (ReFS) Elevation of Privilege Vulnerability	No	No	7.8
CVE-2024-43509	Windows Graphics Component Elevation of Privilege Vulnerability	No	No	7.8
CVE-2024-43556	Windows Graphics Component Elevation of Privilege Vulnerability	No	No	7.8
CVE-2024-43501	Windows Common Log File System Driver Elevation of Privilege Vulnerability	No	No	7.8
CVE-2024-43563	Windows Ancillary Function Driver for WinSock Elevation of Privilege Vulnerability	No	No	7.8
CVE-2024-43560	Microsoft Windows Storage Port Driver Elevation of Privilege Vulnerability	No	No	7.8
CVE-2024-43572	Microsoft Management Console Remote Code Execution Vulnerability	Yes	Yes	7.8
CVE-2024-38262	Windows Remote Desktop Licensing Service Remote Code Execution Vulnerability	No	No	7.5
CVE-2024-43545	Windows Online Certificate Status Protocol (OCSP) Server Denial of Service Vulnerability	No	No	7.5
CVE-2024-43521	Windows Hyper-V Denial of Service Vulnerability	No	No	7.5
CVE-2024-43567	Windows Hyper-V Denial of Service Vulnerability	No	No	7.5
CVE-2024-43541	Microsoft Simple Certificate Enrollment Protocol Denial of Service Vulnerability	No	No	7.5
CVE-2024-43544	Microsoft Simple Certificate Enrollment Protocol Denial of Service Vulnerability	No	No	7.5
CVE-2024-43515	Internet Small Computer Systems Interface (iSCSI) Denial of Service Vulnerability	No	No	7.5
CVE-2024-43506	BranchCache Denial of Service Vulnerability	No	No	7.5
CVE-2024-38149	BranchCache Denial of Service Vulnerability	No	No	7.5
CVE-2024-43550	Windows Secure Channel Spoofing Vulnerability	No	No	7.4
CVE-2024-43553	NT OS Kernel Elevation of Privilege Vulnerability	No	No	7.4
CVE-2024-43535	Windows Kernel-Mode Driver Elevation of Privilege Vulnerability	No	No	7
CVE-2024-37976	Windows Resume Extensible Firmware Interface Security Feature Bypass Vulnerability	No	No	6.7
CVE-2024-37982	Windows Resume Extensible Firmware Interface Security Feature Bypass Vulnerability	No	No	6.7
CVE-2024-37983	Windows Resume Extensible Firmware Interface Security Feature Bypass Vulnerability	No	No	6.7
CVE-2024-37979	Windows Kernel Elevation of Privilege Vulnerability	No	No	6.7
CVE-2024-43512	Windows Standards-Based Storage Management Service Denial of Service Vulnerability	No	No	6.5
CVE-2024-43573	Windows MSHTML Platform Spoofing Vulnerability	Yes	Yes	6.5
CVE-2024-43547	Windows Kerberos Information Disclosure Vulnerability	No	No	6.5
CVE-2024-43534	Windows Graphics Component Information Disclosure Vulnerability	No	No	6.5
CVE-2024-43570	Windows Kernel Elevation of Privilege Vulnerability	No	No	6.4
CVE-2024-43513	BitLocker Security Feature Bypass Vulnerability	No	No	6.4
CVE-2024-43520	Windows Kernel Denial of Service Vulnerability	No	No	5
CVE-2024-43456	Windows Remote Desktop Services Tampering Vulnerability	No	No	4.8

Mariner Windows vulnerabilities

CVE	Title	Exploited?	Publicly disclosed?	CVSSv3 base score
CVE-2024-6197	Open Source Curl Remote Code Execution Vulnerability	No	Yes	8.8

Microsoft Office vulnerabilities

CVE	Title	Exploited?	Publicly disclosed?	CVSSv3 base score
CVE-2024-43503	Microsoft SharePoint Elevation of Privilege Vulnerability	No	No	7.8
CVE-2024-43505	Microsoft Office Visio Remote Code Execution Vulnerability	No	No	7.8
CVE-2024-43576	Microsoft Office Remote Code Execution Vulnerability	No	No	7.8
CVE-2024-43616	Microsoft Office Remote Code Execution Vulnerability	No	No	7.8
CVE-2024-43504	Microsoft Excel Remote Code Execution Vulnerability	No	No	7.8
CVE-2024-43609	Microsoft Office Spoofing Vulnerability	No	No	6.5

SQL Server vulnerabilities

CVE	Title	Exploited?	Publicly disclosed?	CVSSv3 base score
CVE-2024-43612	Power BI Report Server Spoofing Vulnerability	No	No	6.9
CVE-2024-43481	Power BI Report Server Spoofing Vulnerability	No	No	6.5

System Center vulnerabilities

CVE	Title	Exploited?	Publicly disclosed?	CVSSv3 base score
CVE-2024-43468	Microsoft Configuration Manager Remote Code Execution Vulnerability	No	No	9.8
CVE-2024-43614	Microsoft Defender for Endpoint for Linux Spoofing Vulnerability	No	No	5.5

Windows vulnerabilities

CVE	Title	Exploited?	Publicly disclosed?	CVSSv3 base score
CVE-2024-43533	Remote Desktop Client Remote Code Execution Vulnerability	No	No	8.8
CVE-2024-43574	Microsoft Speech Application Programming Interface (SAPI) Remote Code Execution Vulnerability	No	No	8.3
CVE-2024-43582	Remote Desktop Protocol Server Remote Code Execution Vulnerability	No	No	8.1
CVE-2024-30092	Windows Hyper-V Remote Code Execution Vulnerability	No	No	8
CVE-2024-43551	Windows Storage Elevation of Privilege Vulnerability	No	No	7.8
CVE-2024-43516	Windows Secure Kernel Mode Elevation of Privilege Vulnerability	No	No	7.8
CVE-2024-43528	Windows Secure Kernel Mode Elevation of Privilege Vulnerability	No	No	7.8
CVE-2024-43527	Windows Kernel Elevation of Privilege Vulnerability	No	No	7.8
CVE-2024-43584	Windows Scripting Engine Security Feature Bypass Vulnerability	No	No	7.7
CVE-2024-43562	Windows Network Address Translation (NAT) Denial of Service Vulnerability	No	No	7.5
CVE-2024-43565	Windows Network Address Translation (NAT) Denial of Service Vulnerability	No	No	7.5
CVE-2024-38129	Windows Kerberos Elevation of Privilege Vulnerability	No	No	7.5
CVE-2024-43575	Windows Hyper-V Denial of Service Vulnerability	No	No	7.5
CVE-2024-38029	Microsoft OpenSSH for Windows Remote Code Execution Vulnerability	No	No	7.5
CVE-2024-43552	Windows Shell Remote Code Execution Vulnerability	No	No	7.3
CVE-2024-43529	Windows Print Spooler Elevation of Privilege Vulnerability	No	No	7.3
CVE-2024-43502	Windows Kernel Elevation of Privilege Vulnerability	No	No	7.1
CVE-2024-20659	Windows Hyper-V Security Feature Bypass Vulnerability	No	Yes	7.1
CVE-2024-43581	Microsoft OpenSSH for Windows Remote Code Execution Vulnerability	No	No	7.1
CVE-2024-43615	Microsoft OpenSSH for Windows Remote Code Execution Vulnerability	No	No	7.1
CVE-2024-43522	Windows Local Security Authority (LSA) Elevation of Privilege Vulnerability	No	No	7
CVE-2024-43511	Windows Kernel Elevation of Privilege Vulnerability	No	No	7
CVE-2024-43525	Windows Mobile Broadband Driver Remote Code Execution Vulnerability	No	No	6.8
CVE-2024-43526	Windows Mobile Broadband Driver Remote Code Execution Vulnerability	No	No	6.8
CVE-2024-43543	Windows Mobile Broadband Driver Remote Code Execution Vulnerability	No	No	6.8
CVE-2024-43523	Windows Mobile Broadband Driver Remote Code Execution Vulnerability	No	No	6.8
CVE-2024-43524	Windows Mobile Broadband Driver Remote Code Execution Vulnerability	No	No	6.8
CVE-2024-43536	Windows Mobile Broadband Driver Remote Code Execution Vulnerability	No	No	6.8
CVE-2024-43537	Windows Mobile Broadband Driver Denial of Service Vulnerability	No	No	6.5
CVE-2024-43538	Windows Mobile Broadband Driver Denial of Service Vulnerability	No	No	6.5
CVE-2024-43540	Windows Mobile Broadband Driver Denial of Service Vulnerability	No	No	6.5
CVE-2024-43542	Windows Mobile Broadband Driver Denial of Service Vulnerability	No	No	6.5
CVE-2024-43555	Windows Mobile Broadband Driver Denial of Service Vulnerability	No	No	6.5
CVE-2024-43557	Windows Mobile Broadband Driver Denial of Service Vulnerability	No	No	6.5
CVE-2024-43558	Windows Mobile Broadband Driver Denial of Service Vulnerability	No	No	6.5
CVE-2024-43559	Windows Mobile Broadband Driver Denial of Service Vulnerability	No	No	6.5
CVE-2024-43561	Windows Mobile Broadband Driver Denial of Service Vulnerability	No	No	6.5
CVE-2024-43546	Windows Cryptographic Information Disclosure Vulnerability	No	No	5.6
CVE-2024-43571	Sudo for Windows Spoofing Vulnerability	No	No	5.6
CVE-2024-43500	Windows Resilient File System (ReFS) Information Disclosure Vulnerability	No	No	5.5
CVE-2024-43554	Windows Kernel-Mode Driver Information Disclosure Vulnerability	No	No	5.5
CVE-2024-43508	Windows Graphics Component Information Disclosure Vulnerability	No	No	5.5
CVE-2024-43585	Code Integrity Guard Security Feature Bypass Vulnerability	No	No	5.5

The Strange Case of Oliver Cromwell’s Head

2024-10-09 The History Guy: History Deserves to Be Remembered

Post Syndicated from The History Guy: History Deserves to Be Remembered original https://www.youtube.com/watch?v=kk4fNyQUJIM

[$] The Open Source Pledge: peer pressure to pay maintainers

2024-10-08 jzb

Post Syndicated from jzb original https://lwn.net/Articles/993073/

In the early days of open source, it was a struggle to get companies
to accept the concept and trust its development model.
Now, companies have few qualms about using it, but do tend to take open source and
those who maintain it for granted. The struggle now is to find ways
to compensate producers of the software, sustain the open‑source
commons, and avoid burning out maintainers. The Open Source Pledge project is
an effort to persuade companies to pay maintainers by making it a social
norm. On October 8, the project is launching a marketing campaign to raise
awareness and try to get a larger conversation started around paying
maintainers.

This is the New Astera Labs Scorpio PCIe Switch Targeting Broadcom in AI Servers

2024-10-08 John Lee

Post Syndicated from John Lee original https://www.servethehome.com/this-is-the-new-astera-labs-scorpio-pcie-switch-targeting-broadcom-in-ai-servers/

This is the new Astera Labs Scorpio PCIe switch designed to displace Broadcom in AI servers connecting CPU, GPU, NIC, and SSDs together

The post This is the New Astera Labs Scorpio PCIe Switch Targeting Broadcom in AI Servers appeared first on ServeTheHome.

We Tested It! Starlink Mini in Motion and EMF Levels

2024-10-08 Crosstalk Solutions

Post Syndicated from Crosstalk Solutions original https://www.youtube.com/watch?v=X4W88flGWmA

Introducing Netflix TimeSeries Data Abstraction Layer

2024-10-08 Netflix Technology Blog

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/introducing-netflix-timeseries-data-abstraction-layer-31552f6326f8

Rajiv Shringi Vinay Chella Kaidan Fullerton Oleksii Tkachuk Joey Lynch

Introduction

As Netflix continues to expand and diversify into various sectors like Video on Demand and Gaming, the ability to ingest and store vast amounts of temporal data — often reaching petabytes — with millisecond access latency has become increasingly vital. In previous blog posts, we introduced the Key-Value Data Abstraction Layer and the Data Gateway Platform, both of which are integral to Netflix’s data architecture. The Key-Value Abstraction offers a flexible, scalable solution for storing and accessing structured key-value data, while the Data Gateway Platform provides essential infrastructure for protecting, configuring, and deploying the data tier.

Building on these foundational abstractions, we developed the TimeSeries Abstraction — a versatile and scalable solution designed to efficiently store and query large volumes of temporal event data with low millisecond latencies, all in a cost-effective manner across various use cases.

In this post, we will delve into the architecture, design principles, and real-world applications of the TimeSeries Abstraction, demonstrating how it enhances our platform’s ability to manage temporal data at scale.

Note: Contrary to what the name may suggest, this system is not built as a general-purpose time series database. We do not use it for metrics, histograms, timers, or any such near-real time analytics use case. Those use cases are well served by the Netflix Atlas telemetry system. Instead, we focus on addressing the challenge of storing and accessing extremely high-throughput, immutable temporal event data in a low-latency and cost-efficient manner.

Challenges

At Netflix, temporal data is continuously generated and utilized, whether from user interactions like video-play events, asset impressions, or complex micro-service network activities. Effectively managing this data at scale to extract valuable insights is crucial for ensuring optimal user experiences and system reliability.

However, storing and querying such data presents a unique set of challenges:

High Throughput: Managing up to 10 million writes per second while maintaining high availability.
Efficient Querying in Large Datasets: Storing petabytes of data while ensuring primary key reads return results within low double-digit milliseconds, and supporting searches and aggregations across multiple secondary attributes.
Global Reads and Writes: Facilitating read and write operations from anywhere in the world with adjustable consistency models.
Tunable Configuration: Offering the ability to partition datasets in either a single-tenant or multi-tenant datastore, with options to adjust various dataset aspects such as retention and consistency.
Handling Bursty Traffic: Managing significant traffic spikes during high-demand events, such as new content launches or regional failovers.
Cost Efficiency: Reducing the cost per byte and per operation to optimize long-term retention while minimizing infrastructure expenses, which can amount to millions of dollars for Netflix.

TimeSeries Abstraction

The TimeSeries Abstraction was developed to meet these requirements, built around the following core design principles:

Partitioned Data: Data is partitioned using a unique temporal partitioning strategy combined with an event bucketing approach to efficiently manage bursty workloads and streamline queries.
Flexible Storage: The service is designed to integrate with various storage backends, including Apache Cassandra and Elasticsearch, allowing Netflix to customize storage solutions based on specific use case requirements.
Configurability: TimeSeries offers a range of tunable options for each dataset, providing the flexibility needed to accommodate a wide array of use cases.
Scalability: The architecture supports both horizontal and vertical scaling, enabling the system to handle increasing throughput and data volumes as Netflix expands its user base and services.
Sharded Infrastructure: Leveraging the Data Gateway Platform, we can deploy single-tenant and/or multi-tenant infrastructure with the necessary access and traffic isolation.

Let’s dive into the various aspects of this abstraction.

Data Model

We follow a unique event data model that encapsulates all the data we want to capture for events, while allowing us to query them efficiently.

Let’s start with the smallest unit of data in the abstraction and work our way up.

Event Item: An event item is a key-value pair that users use to store data for a given event. For example: {“device_type”: “ios”}.
Event: An event is a structured collection of one or more such event items. An event occurs at a specific point in time and is identified by a client-generated timestamp and an event identifier (such as a UUID). This combination of event_time and event_id also forms part of the unique idempotency key for the event, enabling users to safely retry requests.
Time Series ID: A time_series_id is a collection of one or more such events over the dataset’s retention period. For instance, a device_id would store all events occurring for a given device over the retention period. All events are immutable, and the TimeSeries service only ever appends events to a given time series ID.
Namespace: A namespace is a collection of time series IDs and event data, representing the complete TimeSeries dataset. Users can create one or more namespaces for each of their use cases. The abstraction applies various tunable options at the namespace level, which we will discuss further when we explore the service’s control plane.

API

The abstraction provides the following APIs to interact with the event data.

WriteEventRecordsSync: This endpoint writes a batch of events and sends back a durability acknowledgement to the client. This is used in cases where users require a guarantee of durability.
WriteEventRecords: This is the fire-and-forget version of the above endpoint. It enqueues a batch of events without the durability acknowledgement. This is used in cases like logging or tracing, where users care more about throughput and can tolerate a small amount of data loss.

{
  "namespace": "my_dataset",
  "events": [
    {
      "timeSeriesId": "profile100",
      "eventTime": "2024-10-03T21:24:23.988Z",
      "eventId": "550e8400-e29b-41d4-a716-446655440000",
      "eventItems": [
        {
          "eventItemKey": "ZGV2aWNlVHlwZQ==",  
          "eventItemValue": "aW9z"
        },
        {
          "eventItemKey": "ZGV2aWNlTWV0YWRhdGE=",
          "eventItemValue": "c29tZSBtZXRhZGF0YQ=="
        }
      ]
    },
    {
      "timeSeriesId": "profile100",
      "eventTime": "2024-10-03T21:23:30.000Z",
      "eventId": "123e4567-e89b-12d3-a456-426614174000",
      "eventItems": [
        {
          "eventItemKey": "ZGV2aWNlVHlwZQ==",  
          "eventItemValue": "YW5kcm9pZA=="
        }
      ]
    }
  ]
}

ReadEventRecords: Given a combination of a namespace, a timeSeriesId, a timeInterval, and optional eventFilters, this endpoint returns all the matching events, sorted descending by event_time, with low millisecond latency.

{
  "namespace": "my_dataset",
  "timeSeriesId": "profile100",
  "timeInterval": {
    "start": "2024-10-02T21:00:00.000Z",
    "end":   "2024-10-03T21:00:00.000Z"
  },
  "eventFilters": [
    {
      "matchEventItemKey": "ZGV2aWNlVHlwZQ==",
      "matchEventItemValue": "aW9z"
    }
  ],
  "pageSize": 100,
  "totalRecordLimit": 1000
}

SearchEventRecords: Given a search criteria and a time interval, this endpoint returns all the matching events. These use cases are fine with eventually consistent reads.

{
  "namespace": "my_dataset",
  "timeInterval": {
    "start": "2024-10-02T21:00:00.000Z",
    "end": "2024-10-03T21:00:00.000Z"
  },
  "searchQuery": {
    "booleanQuery": {
      "searchQuery": [
        {
          "equals": {
            "eventItemKey": "deviceType",
            "eventItemValue": "aW9z"
          }
        },
        {
          "equals": {
            "eventItemKey": "deviceType",
            "eventItemValue": "YW5kcm9pZA=="
          }
        }
      ],
      "operator": "OR"
    }
  },
  "pageSize": 100,
  "totalRecordLimit": 1000
}

AggregateEventRecords: Given a search criteria and an aggregation mode (e.g. DistinctAggregation) , this endpoint performs the given aggregation within a given time interval. Similar to the Search endpoint, users can tolerate eventual consistency and a potentially higher latency (in seconds).

{
  "namespace": "my_dataset",
  "timeInterval": {
    "start": "2024-10-02T21:00:00.000Z",
    "end": "2024-10-03T21:00:00.000Z"
  },
  "searchQuery": {...some search criteria...},
  "aggregationQuery": {
    "distinct": {
      "eventItemKey": "deviceType",
      "pageSize": 100
    }
  }
}

In the subsequent sections, we will talk about how we interact with this data at the storage layer.

Storage Layer

The storage layer for TimeSeries comprises a primary data store and an optional index data store. The primary data store ensures data durability during writes and is used for primary read operations, while the index data store is utilized for search and aggregate operations. At Netflix, Apache Cassandra is the preferred choice for storing durable data in high-throughput scenarios, while Elasticsearch is the preferred data store for indexing. However, similar to our approach with the API, the storage layer is not tightly coupled to these specific data stores. Instead, we define storage API contracts that must be fulfilled, allowing us the flexibility to replace the underlying data stores as needed.

Primary Datastore

In this section, we will talk about how we leverage Apache Cassandra for TimeSeries use cases.

Partitioning Scheme

At Netflix’s scale, the continuous influx of event data can quickly overwhelm traditional databases. Temporal partitioning addresses this challenge by dividing the data into manageable chunks based on time intervals, such as hourly, daily, or monthly windows. This approach enables efficient querying of specific time ranges without the need to scan the entire dataset. It also allows Netflix to archive, compress, or delete older data efficiently, optimizing both storage and query performance. Additionally, this partitioning mitigates the performance issues typically associated with wide partitions in Cassandra. By employing this strategy, we can operate at much higher disk utilization, as it reduces the need to reserve large amounts of disk space for compactions, thereby saving costs.

Here is what it looks like :

Time Slice: A time slice is the unit of data retention and maps directly to a Cassandra table. We create multiple such time slices, each covering a specific interval of time. An event lands in one of these slices based on the event_time. These slices are joined with no time gaps in between, with operations being start-inclusive and end-exclusive, ensuring that all data lands in one of the slices.

Why not use row-based Time-To-Live (TTL)?

Using TTL on individual events would generate a significant number of tombstones in Cassandra, degrading performance, especially during range scans. By employing discrete time slices and dropping them, we avoid the tombstone issue entirely. The tradeoff is that data may be retained slightly longer than necessary, as an entire table’s time range must fall outside the retention window before it can be dropped. Additionally, TTLs are difficult to adjust later, whereas TimeSeries can extend the dataset retention instantly with a single control plane operation.

Time Buckets: Within a time slice, data is further partitioned into time buckets. This facilitates effective range scans by allowing us to target specific time buckets for a given query range. The tradeoff is that if a user wants to read the entire range of data over a large time period, we must scan many partitions. We mitigate potential latency by scanning these partitions in parallel and aggregating the data at the end. In most cases, the advantage of targeting smaller data subsets outweighs the read amplification from these scatter-gather operations. Typically, users read a smaller subset of data rather than the entire retention range.

Event Buckets: To manage extremely high-throughput write operations, which may result in a burst of writes for a given time series within a short period, we further divide the time bucket into event buckets. This prevents overloading the same partition for a given time range and also reduces partition sizes further, albeit with a slight increase in read amplification.

Note: With Cassandra 4.x onwards, we notice a substantial improvement in the performance of scanning a range of data in a wide partition. See Future Enhancements at the end to see the Dynamic Event bucketing work that aims to take advantage of this.

Storage Tables

We use two kinds of tables

Data tables: These are the time slices that store the actual event data.
Metadata table: This table stores information about how each time slice is configured per namespace.

Data tables

The partition key enables splitting events for a time_series_id over a range of time_bucket(s) and event_bucket(s), thus mitigating hot partitions, while the clustering key allows us to keep data sorted on disk in the order we almost always want to read it. The value_metadata column stores metadata for the event_item_value such as compression.

Writing to the data table:

User writes will land in a given time slice, time bucket, and event bucket as a factor of the event_time attached to the event. This factor is dictated by the control plane configuration of a given namespace.

For example:

Reading from the data table:

The below illustration depicts at a high-level on how we scatter-gather the reads from multiple partitions and join the result set at the end to return the final result.

Metadata table

This table stores the configuration data about the time slices for a given namespace.

Note the following:

No Time Gaps: The end_time of a given time slice overlaps with the start_time of the next time slice, ensuring all events find a home.
Retention: The status indicates which tables fall inside and outside of the retention window.
Flexible: This metadata can be adjusted per time slice, allowing us to tune the partition settings of future time slices based on observed data patterns in the current time slice.

There is a lot more information that can be stored into the metadata column (e.g., compaction settings for the table), but we only show the partition settings here for brevity.

Index Datastore

To support secondary access patterns via non-primary key attributes, we index data into Elasticsearch. Users can configure a list of attributes per namespace that they wish to search and/or aggregate data on. The service extracts these fields from events as they stream in, indexing the resultant documents into Elasticsearch. Depending on the throughput, we may use Elasticsearch as a reverse index, retrieving the full data from Cassandra, or we may store the entire source data directly in Elasticsearch.

Note: Again, users are never directly exposed to Elasticsearch, just like they are not directly exposed to Cassandra. Instead, they interact with the Search and Aggregate API endpoints that translate a given query to that needed for the underlying datastore.

In the next section, we will talk about how we configure these data stores for different datasets.

Control Plane

The data plane is responsible for executing the read and write operations, while the control plane configures every aspect of a namespace’s behavior. The data plane communicates with the TimeSeries control stack, which manages this configuration information. In turn, the TimeSeries control stack interacts with a sharded Data Gateway Platform Control Plane that oversees control configurations for all abstractions and namespaces.

Separating the responsibilities of the data plane and control plane helps maintain the high availability of our data plane, as the control plane takes on tasks that may require some form of schema consensus from the underlying data stores.

Namespace Configuration

The below configuration snippet demonstrates the immense flexibility of the service and how we can tune several things per namespace using our control plane.

"persistence_configuration": [
  {
    "id": "PRIMARY_STORAGE",
    "physical_storage": {
      "type": "CASSANDRA",                  // type of primary storage
      "cluster": "cass_dgw_ts_tracing",     // physical cluster name
      "dataset": "tracing_default"          // maps to the keyspace
    },
    "config": {
      "timePartition": {
        "secondsPerTimeSlice": "129600",    // width of a time slice
        "secondPerTimeBucket": "3600",      // width of a time bucket
        "eventBuckets": 4                   // how many event buckets within
      },
      "queueBuffering": {
        "coalesce": "1s",                   // how long to coalesce writes
        "bufferCapacity": 4194304           // queue capacity in bytes
      },
      "consistencyScope": "LOCAL",          // single-region/multi-region
      "consistencyTarget": "EVENTUAL",      // read/write consistency
      "acceptLimit": "129600s"              // how far back writes are allowed
    },
    "lifecycleConfigs": {
      "lifecycleConfig": [                  // Primary store data retention
        {
          "type": "retention",
          "config": {
            "close_after": "1296000s",      // close for reads/writes
            "delete_after": "1382400s"      // drop time slice
          }
        }
      ]
    }
  },
  {
    "id": "INDEX_STORAGE",
    "physicalStorage": {
      "type": "ELASTICSEARCH",              // type of index storage
      "cluster": "es_dgw_ts_tracing",       // ES cluster name
      "dataset": "tracing_default_useast1"  // base index name
    },
    "config": {
      "timePartition": {
        "secondsPerSlice": "129600"         // width of the index slice
      },
      "consistencyScope": "LOCAL",
      "consistencyTarget": "EVENTUAL",      // how should we read/write data
      "acceptLimit": "129600s",             // how far back writes are allowed
      "indexConfig": {
        "fieldMapping": {                   // fields to extract to index
          "tags.nf.app": "KEYWORD",
          "tags.duration": "INTEGER",
          "tags.enabled": "BOOLEAN"
        },
        "refreshInterval": "60s"            // Index related settings
      }
    },
    "lifecycleConfigs": {
      "lifecycleConfig": [
        {
          "type": "retention",              // Index retention settings
          "config": {
            "close_after": "1296000s",
            "delete_after": "1382400s"
          }
        }
      ]
    }
  }
]

Provisioning Infrastructure

With so many different parameters, we need automated provisioning workflows to deduce the best settings for a given workload. When users want to create their namespaces, they specify a list of workload desires, which the automation translates into concrete infrastructure and related control plane configuration. We highly encourage you to watch this ApacheCon talk, by one of our stunning colleagues Joey Lynch, on how we achieve this. We may go into detail on this subject in one of our future blog posts.

Once the system provisions the initial infrastructure, it then scales in response to the user workload. The next section describes how this is achieved.

Scalability

Our users may operate with limited information at the time of provisioning their namespaces, resulting in best-effort provisioning estimates. Further, evolving use-cases may introduce new throughput requirements over time. Here’s how we manage this:

Horizontal scaling: TimeSeries server instances can auto-scale up and down as per attached scaling policies to meet the traffic demand. The storage server capacity can be recomputed to accommodate changing requirements using our capacity planner.
Vertical scaling: We may also choose to vertically scale our TimeSeries server instances or our storage instances to get greater CPU, RAM and/or attached storage capacity.
Scaling disk: We may attach EBS to store data if the capacity planner prefers infrastructure that offers larger storage at a lower cost rather than SSDs optimized for latency. In such cases, we deploy jobs to scale the EBS volume when the disk storage reaches a certain percentage threshold.
Re-partitioning data: Inaccurate workload estimates can lead to over or under-partitioning of our datasets. TimeSeries control-plane can adjust the partitioning configuration for upcoming time slices, once we realize the nature of data in the wild (via partition histograms). In the future we plan to support re-partitioning of older data and dynamic partitioning of current data.

Design Principles

So far, we have seen how TimeSeries stores, configures and interacts with event datasets. Let’s see how we apply different techniques to improve the performance of our operations and provide better guarantees.

Event Idempotency

We prefer to bake in idempotency in all mutation endpoints, so that users can retry or hedge their requests safely. Hedging is when the client sends an identical competing request to the server, if the original request does not come back with a response in an expected amount of time. The client then responds with whichever request completes first. This is done to keep the tail latencies for an application relatively low. This can only be done safely if the mutations are idempotent. For TimeSeries, the combination of event_time, event_id and event_item_key form the idempotency key for a given time_series_id event.

SLO-based Hedging

We assign Service Level Objectives (SLO) targets for different endpoints within TimeSeries, as an indication of what we think the performance of those endpoints should be for a given namespace. We can then hedge a request if the response does not come back in that configured amount of time.

"slos": {
  "read": {               // SLOs per endpoint
    "latency": {
      "target": "0.5s",   // hedge around this number
      "max": "1s"         // time-out around this number
    }
  },
  "write": {
    "latency": {
      "target": "0.01s",
      "max": "0.05s"
    }
  }
}

Partial Return

Sometimes, a client may be sensitive to latency and willing to accept a partial result set. A real-world example of this is real-time frequency capping. Precision is not critical in this case, but if the response is delayed, it becomes practically useless to the upstream client. Therefore, the client prefers to work with whatever data has been collected so far rather than timing out while waiting for all the data. The TimeSeries client supports partial returns around SLOs for this purpose. Importantly, we still maintain the latest order of events in this partial fetch.

Adaptive Pagination

All reads start with a default fanout factor, scanning 8 partition buckets in parallel. However, if the service layer determines that the time_series dataset is dense — i.e., most reads are satisfied by reading the first few partition buckets — then it dynamically adjusts the fanout factor of future reads in order to reduce the read amplification on the underlying datastore. Conversely, if the dataset is sparse, we may want to increase this limit with a reasonable upper bound.

Limited Write Window

In most cases, the active range for writing data is smaller than the range for reading data — i.e., we want a range of time to become immutable as soon as possible so that we can apply optimizations on top of it. We control this by having a configurable “acceptLimit” parameter that prevents users from writing events older than this time limit. For example, an accept limit of 4 hours means that users cannot write events older than now() — 4 hours. We sometimes raise this limit for backfilling historical data, but it is tuned back down for regular write operations. Once a range of data becomes immutable, we can safely do things like caching, compressing, and compacting it for reads.

Buffering Writes

We frequently leverage this service for handling bursty workloads. Rather than overwhelming the underlying datastore with this load all at once, we aim to distribute it more evenly by allowing events to coalesce over short durations (typically seconds). These events accumulate in in-memory queues running on each instance. Dedicated consumers then steadily drain these queues, grouping the events by their partition key, and batching the writes to the underlying datastore.

The queues are tailored to each datastore since their operational characteristics depend on the specific datastore being written to. For instance, the batch size for writing to Cassandra is significantly smaller than that for indexing into Elasticsearch, leading to different drain rates and batch sizes for the associated consumers.

While using in-memory queues does increase JVM garbage collection, we have experienced substantial improvements by transitioning to JDK 21 with ZGC. To illustrate the impact, ZGC has reduced our tail latencies by an impressive 86%:

Because we use in-memory queues, we are prone to losing events in case of an instance crash. As such, these queues are only used for use cases that can tolerate some amount of data loss .e.g. tracing/logging. For use cases that need guaranteed durability and/or read-after-write consistency, these queues are effectively disabled and writes are flushed to the data store almost immediately.

Dynamic Compaction

Once a time slice exits the active write window, we can leverage the immutability of the data to optimize it for read performance. This process may involve re-compacting immutable data using optimal compaction strategies, dynamically shrinking and/or splitting shards to optimize system resources, and other similar techniques to ensure fast and reliable performance.

The following section provides a glimpse into the real-world performance of some of our TimeSeries datasets.

Real-world Performance

The service can write data in the order of low single digit milliseconds

while consistently maintaining stable point-read latencies:

At the time of writing this blog, the service was processing close to 15 million events/second across all the different datasets at peak globally.

Time Series Usage @ Netflix

The TimeSeries Abstraction plays a vital role across key services at Netflix. Here are some impactful use cases:

Tracing and Insights: Logs traces across all apps and micro-services within Netflix, to understand service-to-service communication, aid in debugging of issues, and answer support requests.
User Interaction Tracking: Tracks millions of user interactions — such as video playbacks, searches, and content engagement — providing insights that enhance Netflix’s recommendation algorithms in real-time and improve the overall user experience.
Feature Rollout and Performance Analysis: Tracks the rollout and performance of new product features, enabling Netflix engineers to measure how users engage with features, which powers data-driven decisions about future improvements.
Asset Impression Tracking and Optimization: Tracks asset impressions ensuring content and assets are delivered efficiently while providing real-time feedback for optimizations.
Billing and Subscription Management: Stores historical data related to billing and subscription management, ensuring accuracy in transaction records and supporting customer service inquiries.

and more…

Future Enhancements

As the use cases evolve, and the need to make the abstraction even more cost effective grows, we aim to make many improvements to the service in the upcoming months. Some of them are:

Tiered Storage for Cost Efficiency: Support moving older, lesser-accessed data into cheaper object storage that has higher time to first byte, potentially saving Netflix millions of dollars.
Dynamic Event Bucketing: Support real-time partitioning of keys into optimally-sized partitions as events stream in, rather than having a somewhat static configuration at the time of provisioning a namespace. This strategy has a huge advantage of not partitioning time_series_ids that don’t need it, thus saving the overall cost of read amplification. Also, with Cassandra 4.x, we have noted major improvements in reading a subset of data in a wide partition that could lead us to be less aggressive with partitioning the entire dataset ahead of time.
Caching: Take advantage of immutability of data and cache it intelligently for discrete time ranges.
Count and other Aggregations: Some users are only interested in counting events in a given time interval rather than fetching all the event data for it.

Conclusion

The TimeSeries Abstraction is a vital component of Netflix’s online data infrastructure, playing a crucial role in supporting both real-time and long-term decision-making. Whether it’s monitoring system performance during high-traffic events or optimizing user engagement through behavior analytics, TimeSeries Abstraction ensures that Netflix operates seamlessly and efficiently on a global scale.

As Netflix continues to innovate and expand into new verticals, the TimeSeries Abstraction will remain a cornerstone of our platform, helping us push the boundaries of what’s possible in streaming and beyond.

Stay tuned for Part 2, where we’ll introduce our Distributed Counter Abstraction, a key element of Netflix’s Composite Abstractions, built on top of the TimeSeries Abstraction.

Acknowledgments

Special thanks to our stunning colleagues who contributed to TimeSeries Abstraction’s success: Tom DeVoe Mengqing Wang, Kartik Sathyanarayanan, Jordan West, Matt Lehman, Cheng Wang, Chris Lohfink .

Introducing Netflix TimeSeries Data Abstraction Layer was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Fine Print: How Minimum Data Retention Fees Affect Cloud Costs

2024-10-08 Kari Rivas

Post Syndicated from Kari Rivas original https://www.backblaze.com/blog/the-fine-print-how-minimum-data-retention-fees-affect-cloud-costs/

A decorative image showing a stylize image of an invoice with the phrase "minimum retention fees," as a line item.

You probably won’t notice a little asterisked footnote tucked at the bottom of the page the first time you read through a cloud storage vendor’s pricing tables. You probably won’t notice it the second or third time either. But you’ll definitely notice it when your bill comes in with charges for data you thought you deleted weeks ago.

That footnote explains an often overlooked challenge to your budget: minimum data retention periods. These policies, used by cloud providers like AWS, Azure, Google Cloud, and Wasabi, can lead to unexpected cost increases and complicated data management strategies.

Today, I’m breaking down cloud storage retention minimums and common scenarios where they directly impact storage budgets and data management policies.

What are minimum data retention periods?

Retention minimums specify the minimum amount of time that data must be stored before it can be deleted, overwritten, or moved to a different storage tier without incurring additional charges.

Cloud storage providers with multiple tiers like AWS or Google Cloud use minimum retention policies to ensure that customers cannot frequently move data between storage tiers to exploit lower-cost storage classes for short-term storage. For cloud providers that have a single class of storage, these policies allow providers to stabilize their resource usage and maintain predictable pricing structures.

Minimum retention periods can vary significantly between providers, and even between different storage tiers offered by the same provider. For example, AWS S3 Standard has no minimum retention period, but S3 Standard-IA has a 30 day minimum, Glacier has a 90 day minimum, and Deep Archive has a 180 day minimum.

Despite their significance, information about these retention periods is often buried in the fine print of service agreements or technical documentation.

What are delete fees?

Delete fees are a direct consequence of deleting or moving files before the retention minimum is met. Cloud providers charge these fees to ensure that the infrastructure allocated for the data is compensated for the resources it would have otherwise used during the retention period. This fee is typically prorated, representing the remaining days in the retention period that the data was meant to occupy in a storage class.

The terms “delete fees,” “minimum storage duration,” and “minimum retention fees” all refer to a similar policy.

How are delete fees incurred?

Early deletion fees can be triggered by various actions, not just the obvious deletion of files. Some examples include:

Moving data from a higher-cost tier to a lower-cost tier before the minimum retention period has been met: This scenario often catches organizations off guard when they attempt to optimize costs by transferring infrequently accessed data to a cheaper storage class.
Overwriting existing files: When a file is overwritten, the cloud provider typically treats this as a delete operation followed by a new write operation. If the original file hasn’t met its minimum retention period, the organization may be charged for the remaining time, even though they’re still using the same amount of storage space.

A decorative image showing three bars, one that represents the stored object, and two that represent what duration of days you might be charged for.

Implementation of automated lifecycle policies: Many organizations set up rules to automatically move or delete data based on its age or access patterns. However, if these policies don’t account for minimum retention periods, they can inadvertently trigger early delete fees on a large scale.
Renaming files or folders: Even seemingly benign actions like renaming files or folders can sometimes be interpreted as delete-and-rewrite operations by certain cloud storage systems, potentially triggering these fees.

Additionally, in multi-user or multi-team environments, lack of communication about retention policies can lead to unexpected charges. One team might delete or move data without realizing the financial implications for the entire organization.

The financial impact of minimum data retention periods

Minimum data retention periods, particularly in cold storage tiers, can have significant impacts on IT budgets. What may have seemed like a cost-saving storage tier can actually increase expenses when operations require frequent deletions or movements of data before the minimum retention period is over. But even in hot storage, these policies can unexpectedly inflate overall costs.

To illustrate the real-world impact of retention minimums, let’s examine a few common scenarios:

1. Backup strategy

Let’s say you have a 30 day backup strategy for your critical infrastructure, and you opt for Wasabi object storage to save costs vs. AWS. You plan to keep a month’s worth of backups in the cloud and will then replace them with the newer backups.

Wasabi’s minimum retention policy is 90 days for its Pay as You Go storage (and 30 days for its Reserved Capacity Storage).

You store an initial 50TB of backups in Wasabi on Day 1. On Day 31, the older backup is deleted and replaced with the newer backup. So, you incur costs for 30 days of Timed Active Storage (50TB) and 60 days of Timed Deleted Storage (50TB). These charges are incurred every time the backup is replaced.

With Wasabi’s Pay as You Go storage, your monthly bill will look like this:

50TB x $6.99/TB/month x 3= $1048.50

We multiply by 3 because the 90 day minimum retention policy equals three months’ time. One of those you’ve actually stored for, and the other two are because you’ve replaced your backups with the new ones.

Compare this to Backblaze B2 Cloud Storage, which has no minimum retention policy and costs $6 per TB/month for its Pay as You Go storage:

50TB x $6/TB/month = $300

The minimum retention policy effectively triples the anticipated storage expenses. When scaled across multiple backup sets or extended periods, the impact on the IT budget can be substantial.

Delete fees in the real world: California university switches to Backblaze to eliminate surprise bills from Wasabi

Cal Poly Humboldt thought they understood cloud storage provider Wasabi’s pricing, but each month brought unexpected charges for deleted data due to Wasabi’s minimum storage retention policies. This, in turn, caused a chain reaction of calls from the procurement office, buying extra capacity, and then modifying the system to try to avoid further bills. To silence the monthly fire alarms, they switched to Backblaze.

With no retention minimums, Cal Poly Humboldt now knows exactly what their Backblaze costs will be up front. The move was so smooth that they migrated another 100TB from Google’s no-longer-free tier for educational institutions and plan to scale their storage to over a petabyte to back up and safeguard research data.

2. Application storage

In application storage use cases, retention minimums can impact cloud spend significantly when the data has a short lifecycle. Applications with high transaction volumes—such as e-commerce, user-generated content applications, or surveillance platforms—frequently upload and delete as part of their daily operations.

For example, most video surveillance platforms may only need 30 days of history for footage that’s been uploaded and processed, so something like a 90-day retention period doesn’t make financial or operational sense. E-commerce customers can also be affected; these businesses have users that frequently upload and delete content to manage storefronts, creating unpredictable data usage patterns. In these cases, you are at the mercy of your end users—if users churn through files quickly you will pay the retention penalties.

3. Video production

Retention minimums also affect video production workflows particularly when you need to make revisions once a project has been archived in cold storage—a common workflow many studios and broadcasting agencies use to get more affordable storage rates for seldom-accessed data.

Whether due to last minute changes in branding, edits to visuals, or adjustments to sound, the project needs to be pulled from storage for further modification. Because the files were moved to colder storage under a 90 day retention policy, accessing and modifying them before that period ends can trigger significant early delete fees.

If you routinely archive files immediately after a project completes anticipating that no further changes will be required, these early delete fees can add up quickly.

The hidden complexities of minimum data retention periods

Retention minimums can significantly impact your bottom line. These policies, often buried in the fine print, can lead to unexpected costs and complicate data management strategies across various industries.

Understanding the nuances of minimum data retention periods and their associated costs is crucial for developing an effective and economically sound cloud storage strategy. It enables organizations to make more informed decisions, avoid unexpected expenses, and better align their storage choices with their specific data management needs and budget constraints.

The post The Fine Print: How Minimum Data Retention Fees Affect Cloud Costs appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

Access private code repositories for installing Python dependencies on Amazon MWAA

2024-10-08 Tim Wilhoit

Post Syndicated from Tim Wilhoit original https://aws.amazon.com/blogs/big-data/access-private-code-repositories-for-installing-python-dependencies-on-amazon-mwaa/

Customers who use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) often need Python dependencies that are hosted in private code repositories. Many customers opt for public network access mode for its ease of use and ability to make outbound Internet requests, all while maintaining secure access. However, private code repositories may not be accessible via the Internet. It’s also a best practice to only install Python dependencies where they are needed. You can use Amazon MWAA startup scripts to selectively install Python dependencies required for running code on workers, while avoiding issues due to web server restrictions.

This post demonstrates a method to selectively install Python dependencies based on the Amazon MWAA component type (web server, scheduler, or worker) from a Git repository only accessible from your virtual private cloud (VPC).

Solution overview

This solution focuses on using a private Git repository to selectively install Python dependencies, although you can use the same pattern demonstrated in this post with private Python package indexes such as AWS CodeArtifact. For more information, refer to Amazon MWAA with AWS CodeArtifact for Python dependencies.

The Amazon MWAA architecture allows you to choose a web server access mode to control whether the web server is accessible from the internet or only from your VPC. You can also control whether your workers, scheduler, and web servers have access to the internet through your customer VPC configuration. In this post, we demonstrate an environment such as the one shown in the following diagram, where the environment is using public network access mode for the web servers, and the Apache Airflow workers and schedulers don’t have a route to the internet from your VPC.

mwaa-architecture

There are up to four potential networking configurations for an Amazon MWAA environment:

Public routing and public web server access mode
Private routing and public web server access mode (pictured in the preceding diagram)
Public routing and private web server access mode
Private routing and private web server access mode

We focus on one networking configuration for this post, but the fundamental concepts are applicable for any networking configuration.

The solution we walk through relies on the fact that Amazon MWAA runs a startup script (startup.sh) during startup on every individual Apache Airflow component (worker, scheduler, and web server) before installing requirements (requirements.txt) and initializing the Apache Airflow process. This startup script is used to set an environment variable, which is then referenced in the requirements.txt file to selectively install libraries.

The following steps allow us to accomplish this:

Create and install the startup script (startup.sh) in the Amazon MWAA environment. This script sets the environment variable for selectively installing dependencies.
Create and install global Python dependencies (requirements.txt) in the Amazon MWAA environment. This file contains the global dependencies required by all Amazon MWAA components.
Create and install component-specific Python dependencies in the Amazon MWAA environment. This step involves creating separate requirements files for each component type (worker, scheduler, web server) to selectively install the necessary dependencies.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
An Amazon MWAA environment deployed with public access mode for the web server
Versioning enabled for your Amazon MWAA environment’s Amazon Simple Storage Service (Amazon S3) bucket
Amazon CloudWatch logging enabled at the INFO level for worker and web server
A Git repository accessible from within your VPC

Additionally, we upload a sample Python package to the Git repository:

git clone https://github.com/scrapy/scrapy
git clone https://git-codecommit.us-east-1.amazonaws.com/v1/repos/scrapy scrapylocal
rm -rf ./scrapy/.git*
cp -r ./scrapy/* ./scrapylocal
cd scrapylocal
git add --all
git commit -m "first commit"
git push

Create and install the startup script in the Amazon MWAA environment

Create the startup.sh file using the following example code:

#!/bin/sh

echo "Printing Apache Airflow component"
echo $MWAA_AIRFLOW_COMPONENT

if [[ "${MWAA_AIRFLOW_COMPONENT}" != "webserver" ]]
then
sudo yum -y install libaio
fi
if [[ "${MWAA_AIRFLOW_COMPONENT}" == "webserver" ]]
then
echo "Setting extended python requirements for webservers"
export EXTENDED_REQUIREMENTS="webserver_reqs.txt"
fi

if [[ "${MWAA_AIRFLOW_COMPONENT}" == "worker" ]]
then
echo "Setting extended python requirements for workers"
export EXTENDED_REQUIREMENTS="worker_reqs.txt"
fi

if [[ "${MWAA_AIRFLOW_COMPONENT}" == "scheduler" ]]
then
echo "Setting extended python requirements for schedulers"
export EXTENDED_REQUIREMENTS="scheduler_reqs.txt"
fi

Upload startup.sh to the S3 bucket for your Amazon MWAA environment:

aws s3 cp startup.sh s3://[mwaa-environment-bucket]
aws mwaa update-environment --startup-script-s3-path s3://[mwaa-environment-bucket]/startup.sh

Browse the CloudWatch log streams for your workers and view the worker_console log. Notice the startup script is now running and setting the environment variable.

log-startup-script

Create and install global Python dependencies in the Amazon MWAA environment

Your requirements file must include a –constraint statement to make sure the packages listed in your requirements are compatible with the version of Apache Airflow you are using. The statement beginning with -r references the environment variable you set in your startup.sh script based on the component type.

The following code is an example of the requirements.txt file:

--constraint https://raw.githubusercontent.com/apache/airflow/constraints-2.8.1/constraints-3.11.txt
-r /usr/local/airflow/dags/${EXTENDED_REQUIREMENTS}

Upload the requirements.txt file to the Amazon MWAA environment S3 bucket:

aws s3 cp requirements.txt s3://[mwaa-environment-bucket]

Create and install component-specific Python dependencies in the Amazon MWAA environment

For this example, we want to install the Python package scrapy on workers and schedulers from our private Git repository. We also want to install pprintpp on the web server from the public Python packages indexes. To accomplish that, we need to create the following files (we provide example code):

webserver_reqs.txt:

prettyprint

worker_reqs.txt:

git+https://[user]:[password]@git-codecommit.us-east-1.amazonaws.com/v1/repos/scrapy#egg=scrapy

scheduler_reqs.txt:

git+https://[user]:[password]@git-codecommit.us-east-1.amazonaws.com/v1/repos/scrapy#egg=scrapy

Upload webserver_reqs.txt, scheduler_reqs.txt, and worker_reqs.txt to the DAGs folder for the Amazon MWAA environment:

aws s3 cp webserver_reqs.txt s3://mwaa-environment/dags
aws s3 cp scheduler_reqs.txt s3://mwaa-environment/dags
aws s3 cp worker_reqs.txt s3://mwaa-environment/dags

Update the environment for the new requirements file and observe the results

Get the latest object version for the requirements file:

aws s3api list-object-versions --bucket [mwaa-environment-bucket]

Update the Amazon MWAA environment to use the new requirements.txt file:

aws mwaa update-environment --name [mwaa-environment-name] --requirements-s3-object-version [s3-object-version]

Browse the CloudWatch log streams for your workers and view the requirements_install log. Notice the startup script is now running and setting the environment variable.

log-requirements

log-git

Conclusion

In this post, we demonstrated a method to selectively install Python dependencies based on the Amazon MWAA component type (web server, scheduler, or worker) from a Git repository only accessible from your VPC.

We hope this post provided you with a better understanding of how startup scripts and Python dependency management work in an Amazon MWAA environment. You can implement other variations and configurations using the concepts outlined in this post, depending on your specific network setup and requirements.

About the Author

Tim Wilhoit is a Sr. Solutions Architect for the Department of Defense at AWS. Tim has over 20 years of enterprise IT experience. His areas of interest are serverless computing and ML/AI. In his spare time, Tim enjoys spending time at the lake and rooting on the Oklahoma State Cowboys. Go Pokes!

These 4 Sony Cameras need an UPGRADE!

2024-10-08 Matt Granger

Post Syndicated from Matt Granger original https://www.youtube.com/watch?v=ekf50CE1jT4

Tamron Hall & Lish Steiling | A Confident Cook | Talks at Google

2024-10-08 Talks at Google

Post Syndicated from Talks at Google original https://www.youtube.com/watch?v=cDKC9-eldjM

New Microsoft Azure NVIDIA GB200 Systems Shown as Two-Thirds Cooling

2024-10-08 Cliff Robinson

Post Syndicated from Cliff Robinson original https://www.servethehome.com/new-microsoft-azure-nvidia-gb200-systems-shown/

Microsoft Azure NVIDIA GB200 systems are huge, with two thirds of the aisle space being dedicated to cooling the NVIDIA rack

The post New Microsoft Azure NVIDIA GB200 Systems Shown as Two-Thirds Cooling appeared first on ServeTheHome.

Aqara Hub M3 and Smart Video Doorbell G4 – are they any good?

2024-10-08 BeardedTinker

Post Syndicated from BeardedTinker original https://www.youtube.com/watch?v=54eUA5EyMqQ

Amazon Prime Big Deals Day Price Tracking – October 2024

2024-10-08 The Hook Up

Post Syndicated from The Hook Up original https://www.youtube.com/watch?v=nTlN9Yogzdc

[$] Efficient Rust tracepoints

2024-10-08 daroc

Post Syndicated from daroc original https://lwn.net/Articles/992455/

Alice Ryhl has been working to enable

tracepoints — which are widely used
throughout the kernel — to be seamlessly placed in Rust code as well. She spoke
about her approach at Kangrejos. Her

patch set
enables efficient use of static
tracepoints, but supporting dynamic tracepoints will take some additional effort.

Security updates for Tuesday

2024-10-08 corbet

Post Syndicated from corbet original https://lwn.net/Articles/993276/

Security updates have been issued by Debian (kernel), Fedora (webkitgtk), Mageia (cups), Oracle (e2fsprogs, kernel, and kernel-container), Red Hat (buildah, container-tools:rhel8, containernetworking-plugins, git-lfs, go-toolset:rhel8, golang, grafana-pcp, podman, and skopeo), SUSE (Mesa, mozjs115, podofo, and redis7), and Ubuntu (cups and cups-filters).

Improve security incident response times by using AWS Service Catalog to decentralize security notifications

2024-10-08 Cheng Wang

Post Syndicated from Cheng Wang original https://aws.amazon.com/blogs/security/improve-security-incident-response-times-by-using-aws-service-catalog-to-decentralize-security-notifications/

Many organizations continuously receive security-related findings that highlight resources that aren’t configured according to the organization’s security policies. The findings can come from threat detection services like Amazon GuardDuty, or from cloud security posture management (CSPM) services like AWS Security Hub, or other sources. An important question to ask is: How, and how soon, are your teams notified of these findings?

Often, security-related findings are streamed to a single centralized security team or Security Operations Center (SOC). Although it’s a best practice to capture logs, findings, and metrics in standardized locations, the centralized team might not be the best equipped to make configuration changes in response to an incident. Involving the owners or developers of the impacted applications and resources is key because they have the context required to respond appropriately. Security teams often have manual processes for locating and contacting workload owners, but they might not be up to date on the current owners of a workload. Delays in notifying workload owners can increase the time to resolve a security incident or a resource misconfiguration.

This post outlines a decentralized approach to security notifications, using a self-service mechanism powered by AWS Service Catalog to enhance response times. With this mechanism, workload owners can subscribe to receive near real-time Security Hub notifications for their AWS accounts or workloads through email. The notifications include those from Security Hub product integrations like GuardDuty, AWS Health, Amazon Inspector, and third-party products, as well as notifications of non-compliance with security standards. These notifications can better equip your teams to configure AWS resources properly and reduce the exposure time of unsecured resources.

End-user experience

After you deploy the solution in this post, users in assigned groups can access a least-privilege AWS IAM Identity Center permission set, called SubscribeToSecurityNotifications, for their AWS accounts (Figure 1). The solution can also work with existing permission sets or federated IAM roles without IAM Identity Center.

Figure 1: IAM Identity Center portal with the permission set to subscribe to security notifications

After the user chooses SubscribeToSecurityNotifications, they are redirected to an AWS Service Catalog product for subscribing to security notifications and can see instructions on how to proceed (Figure 2).

Figure 2: AWS Service Catalog product view

The user can then choose the Launch product utton and enter one or more email addresses and the minimum severity level for notifications (Critical, High, Medium, or Low). If the AWS account has multiple workloads, they can choose to receive only the notifications related to the applications they own by specifying the resource tags. They can also choose to restrict security notifications to include or exclude specific security products (Figure 3).

Figure 3: Service Catalog security notifications product parameters

You can update the Service Catalog product configurations after provisioning by doing the following:

In the Service Catalog console, in the left navigation menu, choose Provisioned products.
Select the provisioned product, choose Actions, and then choose Update.
Update the parameters you want to change.

For accounts that have multiple applications, each application owner can set up their own notifications by provisioning an additional Service Catalog product. You can use the Filter findings by tag parameters to receive notifications only for a specific application. The example shown in Figure 3 specifies that the user will receive notifications only from resources with the tag key app and the tag value BigApp1 or AnotherApp.

After confirming the subscription, the user starts to receive email notifications for new Security Hub findings in near real-time. Each email contains a summary of the finding in the subject line, the account details, the finding details, recommendations (if any), the list of resources affected with their tags, and an IAM Identity Center shortcut link to the Security Hub finding (Figure 4). The email ends with the raw JSON of the finding.

Figure 4: Sample email showing details of the security notification

Choosing the link in the email takes the user directly to the AWS account and the finding in Security Hub, where they can see more details and search for related findings (Figure 5).

Figure 5: Security Hub finding detail page, linked from the notification email

Solution overview

We’ve provided two deployment options for this solution; a simpler option and one that is more advanced.

Figure 6 shows the simpler deployment option of using the requesting user’s IAM permissions to create the resources required for notifications.

Figure 6: Architecture diagram of the simpler configuration of the solution

The solution involves the following steps:

Create a central Subscribe to AWS Security Hub notifications Service Catalog product in an AWS account which is shared with the entire organization in AWS Organizations or with specific organizational units (OUs). Configure the product with the names of IAM roles or IAM Identity Center permission sets that can launch the product.
Users who sign in through the designated IAM roles or permission sets can access the shared Service Catalog product from the AWS Management Console and enter the required parameters such as their email address and the minimum severity level for notifications.
The Service Catalog product creates an AWS CloudFormation stack, which creates an Amazon Simple Notification Service (Amazon SNS) topic and an Amazon EventBridge rule that filters new Security Hub finding events that match the user’s parameters, such as minimum severity level. The rule then formats the Security Hub JSON event message to make it human-readable by using native EventBridge input transformers. The formatted message is then sent to SNS, which emails the user.

We also provide a more advanced and recommended deployment option, shown in Figure 7. This option involves using an AWS Lambda function to enhance messages by doing conversions from UTC to your selected time zone, setting the email subject to the finding summary, and including an IAM Identity Center shortcut link to the finding. To not require your users to have permissions for creating Lambda functions and IAM roles, a Service Catalog launch role is used to create resources on behalf of the user, and this role is restricted by using IAM permissions boundaries.

Figure 7: Architecture diagram of the solution when using the calling user’s permissions

The architecture is similar to the previous option, but with the following changes:

Create a CloudFormation StackSet in advance to pre-create an IAM role and an IAM permissions boundary policy in every AWS account. The IAM role is used by the Service Catalog product as a launch role. It has permissions to create CloudFormation resources such as SNS topics, as well as to create IAM roles that are restricted by the IAM permissions boundary policy that allows only publishing SNS messages and writing to Amazon CloudWatch Logs.
Users who want to subscribe to security notifications require only minimal permissions; just enough to access Service Catalog and to pass the pre-created role (from the preceding step) to Service Catalog. This solution provides a sample AWS Identity Center permission set with these minimal permissions.
The Service Catalog product uses a Lambda function to format the message to make it human-readable. The stack creates an IAM role, limited by the permissions boundary, and the role is assumed by the Lambda function to publish the SNS message.

Prerequisites

The solution installation requires the following:

Administrator-level access to AWS Organizations. AWS Organizations must have all features
Security Hub enabled in the accounts you are monitoring.
An AWS account to host this solution, for example the Security Hub administrator account or a shared services account. This cannot be the management account.
One or more AWS accounts to consume the Service Catalog product.
Authentication that uses AWS IAM Identity Center or federated IAM role names in every AWS account for users accessing the Service Catalog product.
(Optional, only required when you opt to use Service Catalog launch roles) CloudFormation StackSet creation access to either the management account or a CloudFormation delegated administrator account.
This solution supports notifications coming from multiple AWS Regions. If you are operating Security Hub in multiple Regions, for a simplified deployment evaluate the Security Hub cross-Region aggregation feature and enable it for the applicable Regions.

Walkthrough

There are four steps to deploy this solution:

Configure AWS Organizations to allow Service Catalog product sharing.
(Optional, recommended) Use CloudFormation StackSets to deploy the Service Catalog launch IAM role across accounts.
Service Catalog product creation to allow users to subscribe to Security Hub notifications. This needs to be deployed in the specific Region you want to monitor your Security Hub findings in, or where you enabled cross-Region aggregation.
(Optional, recommended) Provision least-privileged IAM Identity Center permission sets.

Step 1: Configure AWS Organizations

Service Catalog organizations sharing in AWS Organizations must be enabled, and the account that is hosting the solution must be one of the delegated administrators for Service Catalog. This allows the Service Catalog product to be shared to other AWS accounts in the organization.

To enable this configuration, sign in to the AWS Management Console in the management AWS account, launch the AWS CloudShell service, and enter the following commands. Replace the <Account ID> variable with the ID of the account that will host the Service Catalog product.

# Enable AWS Organizations integration in Service Catalog
aws servicecatalog enable-aws-organizations-access

# Nominate the account to be one of the delegated administrators for Service Catalog
aws organizations register-delegated-administrator --account-id <Account ID> --service-principal servicecatalog.amazonaws.com

Step 2: (Optional, recommended) Deploy IAM roles across accounts with CloudFormation StackSets

The following steps create a CloudFormation StackSet to deploy a Service Catalog launch role and permissions boundary across your accounts. This is highly recommended if you plan to enable Lambda formatting, because if you skip this step, only users who have permissions to create IAM roles will be able to subscribe to security notifications.

To deploy IAM roles with StackSets

Sign in to the AWS Management Console from the management AWS account, or from a CloudFormation delegated administrator
Download the CloudFormation template for creating the StackSet.
Navigate to the AWS CloudFormation page.
Choose Create stack, and then choose With new resources (standard).
Choose Upload a template file and upload the CloudFormation template that you downloaded earlier:SecurityHub_notifications_IAM_role_stackset.yaml. Then choose Next.
Enter the stack name SecurityNotifications-IAM-roles-StackSet.
Enter the following values for the parameters:
1. AWS Organization ID: Start AWS CloudShell and enter the command provided in the parameter description to get the organization ID.
2. Organization root ID or OU ID(s): To deploy the IAM role and permissions boundary to every account, enter the organization root ID using CloudShell and the command in the parameter description. To deploy to specific OUs, enter a comma-separated list of OU IDs. Make sure that you include the OU of the account that is hosting the solution.
3. Current Account Type: Choose either Management account or Delegated administrator account, as needed.
4. Formatting method: Indicate whether you plan to use the Lambda formatter for Security Hub notifications, or native EventBridge formatting with no Lambda functions. If you’re unsure, choose Lambda.
Choose Next, and then optionally enter tags and choose Submit. Wait for the stack creation to finish.

Step 3: Create Service Catalog product

Next, run the included installation script that creates the CloudFormation templates that are required to deploy the Service Catalog product and portfolio.

To run the installation script

In the terminal, enter the following commands:

git clone https://github.com/aws-samples/improving-security-incident-response-times-by-decentralizing-notifications.git

cd improving-security-incident-response-times-by-decentralizing-notifications

./install.sh

The script will ask for the following information:

Whether you will be using the Lambda formatter (as opposed to the native EventBridge formatter).
The timezone to use for displaying dates and times in the email notifications, for example Australia/Melbourne. The default is UTC.
The Service Catalog provider display name, which can be your company, organization, or team name.
The Service Catalog product version, which defaults to v1. Increment this value if you make a change in the product CloudFormation template file.
Whether you deployed the IAM role StackSet in Step 2, earlier.
The principal type that will use the Service Catalog product. If you are using IAM Identity Center, enter IAM_Identity_Center_Permission_Set. If you have federated IAM roles configured, enter IAM role name.
If you entered IAM_Identity_Center_Permission_Set in the previous step, enter the IAM Identity Center URL subdomain. This is used for creating a shortcut URL link to Security Hub in the email. For example, if your URL looks like this: https://d-abcd1234.awsapps.com/start/#/, then enter d-abcd1234.
The principals that will have access to the Service Catalog product across the AWS accounts. If you’re using IAM Identity Center, this will be a permission set name. If you plan to deploy the provided permission set in the next step (Step 4), press enter to accept the default value SubscribeToSecurityNotifications. Otherwise, enter an appropriate permission set name (for example AWSPowerUserAccess) or IAM role name that users use.

The script creates the following CloudFormation stacks:

SecurityHub_notifications_SC-Bucket.yaml: This stack creates an Amazon Simple Storage (Amazon S3) bucket that contains the file SecurityHub-Notifications.yaml, which is the CloudFormation template file associated with the Service Catalog product. The script modifies the Mappings section of the template file that has the configuration details depending on the answers to the installation script questions, and then uploads the file to the bucket.
SecurityHub_notifications_ServiceCatalog_Portfolio.yaml: This stack creates a Service Catalog portfolio and product using the Amazon S3 bucket from the previous step and gives permissions to the required principals to launch the product.

After the script finishes the installation, it outputs the Service Catalog Product ID, which you will need in the next step. The script then asks whether it should automatically share this Service Catalog portfolio with the entire organization or a specific account, or whether you will configure sharing to specific OUs manually.

(Optional) To manually configure sharing with an OU

In the Service Catalog console, choose Portfolios.
Choose Subscribe to AWS Security Hub notifications.
On the Share tab, choose Add a share.
Choose AWS Organization, and then select the OU. The product will be shared to the accounts and child OUs within the selected OU.
Select Principal sharing, and then choose Share.

To expand this solution across Regions, enable Security Hub cross-Region aggregation. This results in the email notifications coming from the linked Regions that are configured in Security Hub, even though the Service Catalog product is instantiated in a single Region. If cross-Region aggregation isn’t enabled and you want to monitor multiple Regions, you must repeat the preceding steps in all the Regions you are monitoring.

Step 4: (Optional, recommended) Provision IAM Identity Center permission sets

This step requires you to have completed Step 2 (Deploy IAM roles across accounts with CloudFormation StackSets).

If you’re using IAM Identity Center, the following steps create a custom permission set, SubscribeToSecurityNotifications, that provides least-privileged access for users to subscribe to security notifications. The permission set redirects to the Service Catalog page to launch the product.

To provision Identity Center permission sets

Sign in to the AWS Management Console from the management AWS account, or from an IAM Identity Center delegated administrator
Download the CloudFormation template for creating the permission set.
Navigate to the AWS CloudFormation page.
Choose Create stack, and then choose With new resources (standard).
Choose Upload a template file and upload the CloudFormation template you downloaded earlier: SecurityHub_notifications_PermissionSets.yaml. Then choose Next.
Enter the stack name SecurityNotifications-PermissionSet.
Enter the following values for the parameters:
1. AWS IAM Identity Center Instance ARN: Use the AWS CloudShell command in the parameter description to get the IAM Identity Center ARN.
2. Permission set name: Use the default value SubscribeToSecurityNotifications.
3. Service Catalog product ID: Use the last output line of the install.sh script in Step 3, or alternatively get the product ID from the Service Catalog console for the product account.
Choose Next. Then optionally enter tags and choose Next Wait for the stack creation to finish.

Next, go to the IAM Identity Center console, select your AWS accounts, and assign access to the SubscribeToSecurityNotifications permission set for your users or groups.

Testing

To test the solution, sign in to an AWS account, making sure to sign in with the designated IAM Identity Center permission set or IAM role. Launch the product in Service Catalog to subscribe to Security Hub security notifications.

Wait for a Security Hub notification. For example, if you have the AWS Foundational Security Best Practices (FSBP) standard enabled, creating an S3 bucket with no server access logging enabled should generate a notification within a few minutes.

Additional considerations

Keep in mind the following:

There is a cost for each SNS email notification sent out, as well as for Service Catalog API calls and execution of Lambda functions (if enabled).
Consider enabling Security Hub consolidated control findings so you don’t receive multiple email notifications for a control that applies to multiple standards.
The blog post Considerations for security operations in the cloud compares and contrasts the centralized, decentralized, and hybrid models for security operations.
The Initiate remediation for non-compliant resources and Incident response sections of the Security Pillar of the AWS Well-Architected Framework walk through best practices for remediation and incident response.

Cleanup

To remove unneeded resources after testing the solution, follow these steps:

In the workload account or accounts where the product was launched:
1. Go to the Service Catalog provisioned products page and terminate each associated provisioned product. This stops security notifications from being sent to the email address associated with the product.
In the AWS account that is hosting the directory:
1. In the Service Catalog console, choose Portfolios, and then choose Subscribe to AWS Security Hub notifications. On the Share tab, select the items in the list and choose Actions, then choose Unshare.
2. In the CloudFormation console, delete the SecurityNotifications-Service-Catalog stack.
3. In the Amazon S3 console, for the two buckets starting with securitynotifications-sc-bucket, select the bucket and choose Empty to empty the bucket.
4. In the CloudFormation console, delete the SecurityNotifications-SC-Bucket stack.
If applicable, go to the management account or the CloudFormation delegated administrator account and delete the SecurityNotifications-IAM-roles-StackSet stack.
If applicable, go to the management account or the IAM Identity Center delegated administrator account and delete the SecurityNotifications-PermissionSet stack.

Conclusion

This solution described in this blog post enables you to set up a self-service standardized mechanism that application or workload owners can use to get security notifications within minutes through email, as opposed to being contacted by a security team later. This can help to improve your security posture by reducing the incident resolution time, which reduces the time that a security issue remains active.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Leveraging Kubernetes virtual machines at Cloudflare with KubeVirt

2024-10-08 Justin Cichra

Post Syndicated from Justin Cichra original https://blog.cloudflare.com/leveraging-kubernetes-virtual-machines-with-kubevirt

Cloudflare runs several multi-tenant Kubernetes clusters across our core data centers. These general-purpose clusters run on bare metal and power our control plane, analytics, and various engineering tools such as build infrastructure and continuous integration.

Kubernetes is a container orchestration platform. It enables software engineers to deploy containerized applications to a cluster of machines. This enables teams to build highly-available software on a scalable and resilient platform.

In this blog post we discuss our Kubernetes architecture, why we needed virtualization, and how we’re using it today.

Multi-tenant clusters

Multi-tenancy is a concept where one system can share its resources among a wide range of customers. This model allows us to build and manage a small number of general purpose Kubernetes clusters for our internal application teams. Keeping the number of clusters small reduces our operational toil. This model shrinks costs and increases computational efficiency by sharing hardware. Multi-tenancy also allows us to scale more efficiently. Scaling is done at either a cluster or application level. Cluster operators scale the platform by adding more hardware. Teams scale their applications by updating their Kubernetes manifests. They can scale vertically by increasing their resource requests or horizontally by increasing the number of replicas.

All of our Kubernetes clusters are multi-tenant with various components enabled for a secure and resilient platform.

Pods are secured using the latest standards recommended by the Kubernetes project. We use Pod Security Admission (PSA) and Pod Security Standards to ensure all workloads are following best practices. By default, all namespaces use the most restrictive profile, and only a few Kubernetes control plane namespaces are granted privileged access. For additional policies not covered by PSA, we built custom Validating Webhooks on top of the controller-runtime framework. PSA and our custom policies ensure clusters are secure and workloads are isolated.

Our need for virtualization

A select number of teams needed tight integration with the Linux kernel. Examples include Docker daemons for build infrastructure and the ability to simulate servers running the software and configuration of our global network. With our pod security requirements, these workloads are not permitted to interface with the host kernel at a deep level (e.g. no iptables or sysctls). Doing so may disrupt other tenants sharing the node and open additional attack vectors if an application was compromised. A virtualization platform would enable these workloads to interact with their own kernel within a secured Kubernetes cluster.

We considered various different virtualization solutions. Running a separate virtualization platform outside of Kubernetes would have worked, but would not tightly integrate containerized workloads with virtual machines. It would also be an additional operational burden on our team, as backups, alerting, and fleet management would have to exist for both our Kubernetes and virtual machine clusters.

We then looked for solutions that run virtual machines within Kubernetes. Teams could already manually deploy QEMU pods, but this was not an elegant solution. We needed a better way. There were several other options, but KubeVirt was the tool that met the majority of our requirements. Other solutions required a privileged container to run a virtual machine, but KubeVirt did not – this was a crucial requirement in our goal of creating a more secure multi-tenant cluster. KubeVirt also uses a feature of the Kubernetes API called Custom Resource Definitions (CRDs), which extends the Kubernetes API with new objects, increasing the flexibility of Kubernetes beyond its built-in types. For KubeVirt, this includes objects such as VirtualMachine and VirtualMachineInstanceReplicaSet. We felt the use of CRDs would allow KubeVirt to grow as more features were added.

What is KubeVirt?

KubeVirt is a virtualization platform that enables users to run virtual machines within Kubernetes. With KubeVirt, virtual machines run alongside containerized workloads on the same platform. Kubernetes primitives such as network policies, configmaps, and services all integrate with virtual machines. KubeVirt scales with our needs and is successfully running hundreds of virtual machines across several clusters. We frequently remediate Kubernetes nodes, so virtual machines and pods are always exercising their startup/shutdown processes.

How Cloudflare uses KubeVirt

There are a number of internal projects leveraging virtual machines at Cloudflare. We’ll touch on a few of our more popular use cases:

Kubernetes scalability testing
Development environments
Kernel and iPXE testing
Build pipelines

Kubernetes scalability testing

Setup process

Our staging clusters are much smaller than our largest production clusters. They also run on bare metal and mirror the configuration we have for each production cluster. This is extremely useful when rolling out new software, operating systems, or kernel changes; however, they miss bugs that only surface at scale. We use KubeVirt to bridge this gap and virtualize Kubernetes clusters with hundreds of nodes and thousands of pods.

The setup process for virtualized clusters differs from our bare metal provisioning steps. For bare metal, we use Salt to provision clusters from start to finish. For our virtualized clusters we use Ansible and kubeadm. Our bare metal staging clusters are responsible for testing and validating our Salt configuration. The virtualized clusters give us a vanilla Kubernetes environment without any Cloudflare customizations. Having a stock environment in addition to our Salt environment helps us isolate bugs down to a Kubernetes change, a kernel change, or a Cloudflare-specific configuration change.

Our virtualized clusters consist of a KubeVirt VirtualMachine object per node. We create three control-plane nodes and any number of worker nodes. Each virtual machine starts out as a vanilla Debian generic cloud image. Using KubeVirt’s cloud-init support, the virtual machine downloads an internal Ansible playbook which installs a recent kernel, cri-o (the container runtime we use), and kubeadm.

- name: Add the Kubernetes gpg key
  apt_key:
    url: https://pkgs.k8s.io/core:/stable:/{{ kube_version }}/deb/Release.key
    keyring: /etc/apt/keyrings/kubernetes-apt-keyring.gpg
    state: present

- name: Add the Kubernetes repository
  shell: echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/{{ kube_version }}/deb/ /" | tee /etc/apt/sources.list.d/kubernetes.list

- name: Add the CRI-O gpg key
  apt_key:
    url: https://pkgs.k8s.io/addons:/cri-o:/{{ crio_version }}/deb/Release.key
    keyring: /etc/apt/keyrings/cri-o-apt-keyring.gpg
    state: present

- name: Add the CRI-O repository
  shell: echo "deb [signed-by=/etc/apt/keyrings/cri-o-apt-keyring.gpg] https://pkgs.k8s.io/addons:/cri-o:/{{ crio_version }}/deb/ /" | tee /etc/apt/sources.list.d/cri-o.list

- name: Install CRI-O and Kubernetes packages
  apt:
    name:
      - cri-o
      - kubelet
      - kubeadm
      - kubectl
    update_cache: yes
    state: present

- name: Enable and start CRI-O service
  service:
    state: started
    enabled: yes
    name: crio.service

^{Ansible playbook steps to download and install Kubernetes tooling}

Once each node has completed its individual playbook, we can initialize and join nodes to the cluster using another playbook that runs kubeadm. From there the cluster can be accessed by logging into a control plane node using kubectl.

Simulating at scale

When losing 10s or 100s of nodes at once, Kubernetes needs to act quickly to minimize downtime. The sooner it recognizes node failure, the faster it can reroute traffic to healthy pods.

Using Kubernetes in KubeVirt we are able to simulate a large cluster undergoing a network cut and observe how Kubernetes reacts. The KubeVirt Kubernetes cluster allows us to rapidly iterate on configuration changes and code patches.

The following Ansible playbook task simulates a network segmentation failure where only the control-plane nodes remain online.

- name: Disable network interfaces on all workers
  command: ifconfig enp1s0 down
  async: 5
  poll: 0
  ignore_errors: yes
  when: inventory_hostname in groups['kube-node']

^{An Ansible role which disables the network on all worker nodes simultaneously.}

This framework allows us to exercise the code in controller-manager, Kubernetes’s daemon that reconciles the fundamental state of the system (Nodes, Pods, etc). Our simulation platform helped us drastically shorten full traffic recovery time when a large number of Kubernetes nodes become unreachable. We upstreamed our changes to Kubernetes and more controller-manager speed improvements are coming soon.

Development environments

Compiling code on your laptop can be slow. Perhaps you’re working on a patch for a large open-source project (e.g. V8 or Clickhouse) or need more bandwidth to upload and download containers. With KubeVirt, we enable our developers to rapidly iterate on software development and testing on powerful server hardware. KubeVirt integrates with Kubernetes Persistent Volumes, which enables teams to persist their development environment across restarts.

There are a number of teams at Cloudflare using KubeVirt for a variety of development and testing environments. Most notably is a project called Edge Test Fleet, which emulates a physical server and all the software that runs Cloudflare’s global network. Teams can test their code and configuration changes against the entire software stack without reserving dedicated hardware. Cloudflare uses Salt to provision systems. It can be difficult to iterate and test Salt changes without a complete virtual environment. Edge Test Fleet makes iterating on Salt easier, ensuring states compile and render the right output. With Edge Test Fleet, new developers can better understand how Cloudflare’s global network works without touching staging or production.

Additionally, one Cloudflare team developed a framework that allows users to build and test changes to Clickhouse using a VSCode environment. This framework is generally applicable to all teams requiring a development environment. Once a template environment is provisioned, CSI Volume Cloning can duplicate a golden volume, separating persistent environments for each developer.

apiVersion: v1
kind: PersistentVolumeClaim
name: devspace-jcichra-rootfs
namespace: dev-clickhouse-vms
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: rook-ceph-nvme
  dataSource:
    kind: PersistentVolumeClaim
    name: dev-rootfs
  resources:
    requests:
      storage: 500Gi

^{A PersistentVolumeClaim that clones data from another volume using CSI Volume Cloning}

Kernel and iPXE testing

Unlike user space software development, when a kernel crashes, the entire system crashes. The kernel team uses KubeVirt for development. KubeVirt gives all kernel engineers, regardless of laptop OS or architecture, the same x86 environment and hypervisor. Virtual machines on server hardware can be scaled up to more cores and memory than on laptops. The Cloudflare kernel team has also found low-level issues which only surface in environments with many CPUs.

To make testing fast and easy, the kernel team serves iPXE images via an nginx Pod and Service adjacent to the virtual machine. A recent kernel and Debian image are copied to the nginx pod via kubectl cp. The iPXE file can then be referenced in the KubeVirt virtual machine definition via the DNS name for the Kubernetes Service.

interfaces:
  name: default
  masquerade: {}
  model: e1000e
  ports:
    - port: 22
      dhcpOptions:
        bootFileName: http://httpboot.u-$K8S_USER.svc.cluster.local/boot.ipxe

When the virtual machine boots, it will get an IP address on the default interface behind NAT due to our masquerade setting. Then it will download boot.ipxe, which describes what additional files should be downloaded to start the system. In this case, the kernel (vmlinuz-amd64), Debian (baseimg-amd64.img) and additional kernel modules (modules-amd64.img) are downloaded.

^{UEFI iPXE boot connecting and downloading files from nginx pod in user’s namespace}

Once the system is booted, a developer can log in to the system for testing:

linux login: root
Password: 
Linux linux 6.6.35-cloudflare-2024.6.7 #1 SMP PREEMPT_DYNAMIC Mon Sep 27 00:00:00 UTC 2010 x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
root@linux:~#

Custom kernels can be copied to the nginx pod via kubectl cp. Restarting the virtual machine will load that new kernel for testing. When a kernel panic occurs, the virtual machine can quickly be restarted with virtctl restart linux and it will go through the iPXE boot process again.

Build pipelines

Cloudflare leverages KubeVirt to build a majority of software at Cloudflare. Virtual machines give build system users full control over their pipeline. For example, Debian packages can easily be installed and separate container daemons (such as Docker) can run all within a Kubernetes namespace using the restricted Pod Security Standard. KubeVirt’s VirtualMachineReplicaSet concept allows us to quickly scale up and down the number of build agents to match demand. We can roll out different sets of virtual machines with varying sizes, kernels, and operating systems.

To scale efficiently, we leverage container disks to store our agent virtual machine images. Container disks allow us to store the virtual machine image (for example, a qcow image) in our container registry. This strategy works well when the state in virtual machines is ephemeral. Liveness probes detect unhealthy or broken agents, shutting down the virtual machine and replacing them with a fresh instance. Other automation limits virtual machine uptime, capping it to 3–4 hours to keep build agents fresh.

Next steps

We’re excited to expand our use of KubeVirt and unlock new capabilities for our internal users. KubeVirt’s Linux ARM64 support will allow us to build ARM64 packages in-cluster and simulate ARM64 systems.

Projects like KubeVirt CDI (Containerized Data Importer) will streamline our user’s virtual machine experience. Instead of users manually building container disks, we can provide a catalog of virtual machine images. It also allows us to copy virtual machine disks between namespaces.

Conclusion

KubeVirt has proven to be a great tool for virtualization in our Kubernetes-first environment. We’ve unlocked the ability to support more workloads with our multi-tenant model. The KubeVirt platform allows us to offer a single compute platform supporting containers and virtual machines. Managing it has been simple, and upgrades have been straightforward and non-disruptive. We’re exploring additional features KubeVirt offers to improve the experience for our users.

Finally, our team is expanding! We’re looking for more people passionate about Kubernetes to join our team and help us push Kubernetes to the next level.

Cloudflare acquires Kivera to add simple, preventive cloud security to Cloudflare One

2024-10-08 Noelle Kagan

Post Syndicated from Noelle Kagan original https://blog.cloudflare.com/cloudflare-acquires-kivera

We’re excited to announce that Kivera, a cloud security, data protection, and compliance company, has joined Cloudflare. This acquisition extends our SASE portfolio to incorporate inline cloud app controls, empowering Cloudflare One customers with preventative security controls for all their cloud services.

In today’s digital landscape, cloud services and SaaS (software as a service) apps have become indispensable for the daily operation of organizations. At the same time, the amount of data flowing between organizations and their cloud providers has ballooned, increasing the chances of data leakage, compliance issues, and worse, opportunities for attackers. Additionally, many companies — especially at enterprise scale — are working directly with multiple cloud providers for flexibility based on the strengths, resiliency against outages or errors, and cost efficiencies of different clouds.

Security teams that rely on Cloud Security Posture Management (CSPM) or similar tools for monitoring cloud configurations and permissions and Infrastructure as code (IaC) scanning are falling short due to detecting issues only after misconfigurations occur with an overwhelming volume of alerts. The combination of Kivera and Cloudflare One puts preventive controls directly into the deployment process, or ‘inline’, blocking errors before they happen. This offers a proactive approach essential to protecting cloud infrastructure from evolving cyber threats, maintaining data security, and accelerating compliance.

An early warning system for cloud security risks

In a significant leap forward in cloud security, the combination of Kivera’s technology and Cloudflare One adds preventive, inline controls to enforce secure configurations for cloud resources. By inspecting cloud API traffic, these new capabilities equip organizations with enhanced visibility and granular controls, allowing for a proactive approach in mitigating risks, managing cloud security posture, and embracing a streamlined DevOps process when deploying cloud infrastructure.

Kivera will add the following capabilities to Cloudflare’s SASE platform:

One-click security: Customers benefit from immediate prevention of the most common cloud breaches caused by misconfigurations, such as accidentally allowing public access or policy inconsistencies.
Enforced cloud tenant control: Companies can easily draw boundaries around their cloud resources and tenants to ensure that sensitive data stays within their organization.
Prevent data exfiltration: Easily set rules to prevent data being sent to unauthorized locations.
Reduce ‘shadow’ cloud infrastructure: Ensure that every interaction between a customer and their cloud provider is in line with preset standards.
Streamline cloud security compliance: Customers can automatically assess and enforce compliance against the most common regulatory frameworks.
Flexible DevOps model: Enforce bespoke controls independent of public cloud setup and deployment tools, minimizing the layers of lock-in between an organization and a cloud provider.
Complementing other cloud security tools: Create a first line of defense for cloud deployment errors, reducing the volume of alerts for customers also using CSPM tools or Cloud Native Application Protection Platforms (CNAPPs).

_{An intelligent proxy that uses a policy-based approach to

enforce secure configuration of cloud resources.}

Better together with Cloudflare One

As a SASE platform, Cloudflare One ensures safe access and provides data controls for cloud and SaaS apps. This integration broadens the scope of Cloudflare’s SASE platform beyond user-facing applications to incorporate increased cloud security through proactive configuration management of infrastructure services, beyond what CSPM and CASB solutions provide. With the addition of Kivera to Cloudflare One, customers now have a unified platform for all their inline protections, including cloud control, access management, and threat and data protection. All of these features are available with single-pass inspection, which is 50% faster than Secure Web Gateway (SWG) alternatives.

With the earlier acquisition of BastionZero, a Zero Trust infrastructure access company, Cloudflare One expanded the scope of its VPN replacement solution to cover infrastructure resources as easily as it does apps and networks. Together Kivera and BastionZero enable centralized security management across hybrid IT environments, and provide a modern DevOps-friendly way to help enterprises connect and protect their hybrid infrastructure with Zero Trust best practices.

Beyond its SASE capabilities, Cloudflare One is integral to Cloudflare’s connectivity cloud, enabling organizations to consolidate IT security tools on a single platform. This simplifies secure access to resources, from developer privileged access to technical infrastructure and expanding cloud services. As Forrester echoes, “Cloudflare is a good choice for enterprise prospects seeking a high-performance, low-maintenance, DevOps-oriented solution.”

The growing threat of cloud misconfigurations

The cloud has become a prime target for cyberattacks. According to the 2023 Cloud Risk Report, CrowdStrike observed a 95% increase in cloud exploitation from 2021 to 2022, with a staggering 288% jump in cases involving threat actors directly targeting the cloud.

Misconfigurations in cloud infrastructure settings, such as improperly set security parameters and default access controls, provide adversaries with an easy path to infiltrate the cloud. According to the 2023 Thales Global Cloud Security Study, which surveyed nearly 3,000 IT and security professionals from 18 countries, 44% of respondents reported experiencing a data breach, with misconfigurations and human error identified as the leading cause, accounting for 31% of the incidents.

Further, according to Gartner^Ⓡ, “Through 2027, 99% of records compromised in cloud environments will be the result of user misconfigurations and account compromise, not the result of an issue with the cloud provider.”¹

Several factors contribute to the rise of cloud misconfigurations:

Rapid adoption of cloud services: Leaders are often driven by the scalability, cost-efficiency, and ability to support remote work and real-time collaboration that cloud services offer. These factors enable rapid adoption of cloud services which can lead to unintentional misconfigurations as IT teams struggle to keep up with the pace and complexity of these services.
Complexity of cloud environments: Cloud infrastructure can be highly complex with multiple services and configurations to manage. For example, AWS alone offers 373 services with 15,617 actions and 140,000+ parameters, making it challenging for IT teams to manage settings accurately.
Decentralized management: In large organizations, cloud infrastructure resources are often managed by multiple teams or departments. Without centralized oversight, inconsistent security policies and configurations can arise, increasing the risk of misconfigurations.
Continuous Integration and Continuous Deployment (CI/CD): CI/CD pipelines promote the ability to rapidly deploy, change and frequently update infrastructure. With this velocity comes the increased risk of misconfigurations when changes are not properly managed and reviewed.
Insufficient training and awareness: Employees may lack the cross-functional skills needed for cloud security, such as understanding networks, identity, and service configurations. This knowledge gap can lead to mistakes and increases the risk of misconfigurations that compromise security.

Common exploitation methods

Threat actors exploit cloud services through various means, including targeting misconfigurations, abusing privileges, and bypassing encryption. Misconfigurations such as exposed storage buckets or improperly secured APIs offer attackers easy access to sensitive data and resources. Privilege abuse occurs when attackers gain unauthorized access through compromised credentials or poorly managed identity and access management (IAM) policies, allowing them to escalate their access and move laterally within the cloud environment. Additionally, unencrypted data enables attackers to intercept and decrypt data in transit or at rest, further compromising the integrity and confidentiality of sensitive information.

Here are some other vulnerabilities that organizations should address:

Unrestricted access to cloud tenants: Allowing unrestricted access exposes cloud platforms to data exfiltration by malicious actors. Limiting access to approved tenants with specific IP addresses and service destinations helps prevent unauthorized access.
Exposed access keys: Exposed access keys can be exploited by unauthorized parties to steal or delete data. Requiring encryption for the access keys and restricting their usage can mitigate this risk.
Excessive account permissions: Granting excessive privileges to cloud accounts increases the potential impact of security breaches. Limiting permissions to necessary operations helps prevent lateral movement and privilege escalation by threat actors.
Inadequate network segmentation: Poorly managed network security groups and insufficient segmentation practices can allow attackers to move freely within cloud environments. Drawing boundaries around your cloud resources and tenants ensures that data stays within your organization.
Improper public access configuration: Incorrectly exposing critical services or storage resources to the internet increases the likelihood of unauthorized access and data compromise. Preventing public access drastically reduces risk.
Shadow cloud infrastructure: Abandoned or neglected cloud instances are often left vulnerable to exploitation, providing attackers with opportunities to access sensitive data left behind. Preventing untagged or unapproved cloud resources to be created can reduce the risk of exposure.

Limitations of existing tools

Many organizations turn to CSPM tools to give them more visibility into cloud misconfigurations. These tools often alert teams after an issue occurs, putting security teams in a reactive mode. Remediation efforts require collaboration between security teams and developers to implement changes, which can be time-consuming and resource-intensive. This approach not only delays issue resolution but also exposes companies to compliance and legal risks, while failing to train employees on secure cloud practices. On average, it takes 207 days to identify these breaches and an additional 70 days to contain them.

Addressing the growing threat of cloud misconfigurations requires proactive security measures and continuous monitoring. Organizations must adopt proactive security solutions that not only detect and alert but also prevent misconfigurations from occuring in the first place and enforce best practices. Creating a first line of defense for cloud deployment errors reduces the volume of alerts for customers, especially those also using CSPM tools or CNAPPs.

By implementing these proactive strategies, organizations can safeguard their cloud environments against the evolving landscape of cyber threats, ensuring robust security and compliance while minimizing risks and operational disruptions.

What’s next for Kivera

The Kivera product will not be a point solution add-on. We’re making it a core part of our Cloudflare One offering because integrating features from products like our Secure Web Gateway give customers a comprehensive solution that works better together.

We’re excited to welcome Kivera to the Cloudflare team. Through the end of 2024 and into early 2025, Kivera’s team will focus on integrating their preventive inline cloud app controls directly into Cloudflare One. We are looking for early access testers and teams to provide feedback about what they would like to see. If you’d like early access, please join the waitlist.

_{[1] Source: Outcome-Driven Metrics You Can Use to Evaluate Cloud Security Controls, Gartner, Charlie Winckless, Paul Proctor, Manuel Acosta, 09/28/2023}