Patch Tuesday – August 2022

Post Syndicated from Greg Wiseman original https://blog.rapid7.com/2022/08/09/patch-tuesday-august-2022/

Patch Tuesday - August 2022

It’s the week of Hacker Summer Camp in Las Vegas, and Microsoft has published fixes for 141 separate vulnerabilities in their swath of August updates. This is a new monthly record by raw CVE count, but from a patching perspective, the numbers are slightly less dire. 20 CVEs affect their Chromium-based Edge browser, and 34 affect Azure Site Recovery (up from 32 CVEs affecting that product last month). As usual, OS-level updates will address a lot of these, but note that some extra configuration is required to fully protect Exchange Server this month.

There is one 0-day being patched this month. CVE-2022-34713 is a remote code execution (RCE) vulnerability affecting the Microsoft Windows Support Diagnostic Tool (MSDT) – it carries a CVSSv3 base score of 7.8, as it requires convincing a potential victim to open a malicious file. The advisory indicates that this CVE is a variant of the “Dogwalk” vulnerability, which made news alongside Follina (CVE-2022-30190) back in May.

Publicly disclosed, but not (yet) exploited is CVE-2022-30134, an Information Disclosure vulnerability affecting Exchange Server. In this case, simply patching is not sufficient to protect against attackers being able to read targeted email messages. Administrators should enable Extended Protection in order to fully remediate this vulnerability, as well as the five other vulnerabilities affecting Exchange this month. Details about how to accomplish this are available via the Exchange Blog.

Microsoft also patched several flaws affecting Remote Access Server (RAS). The most severe of these (CVE-2022-30133 and CVE-2022-35744) are related to Windows Point-to-Point Tunneling Protocol and could allow RCE simply by sending a malicious connection request to a server. Seven CVEs affecting the Windows Secure Socket Tunneling Protocol (SSTP) on RAS were also fixed this month: six RCEs and one Denial of Service. If you have RAS in your environment but are unable to patch immediately, consider blocking traffic on port 1723 from your network.

Vulnerabilities affecting Windows Network File System (NFS) have been trending in recent months, and today sees Microsoft patching CVE-2022-34715 (RCE, CVSS 9.8) affecting NFSv4.1 on Windows Server 2022.

This is the worst of it. One last vulnerability to highlight: CVE-2022-35797 is a Security Feature Bypass in Windows Hello – Microsoft’s biometric authentication mechanism for Windows 10. Successful exploitation requires physical access to a system, but would allow an attacker to bypass a facial recognition check.

Summary charts

Patch Tuesday - August 2022
Patch Tuesday - August 2022
Patch Tuesday - August 2022
Patch Tuesday - August 2022

Summary tables

Azure vulnerabilities

CVE Title Exploited? Publicly disclosed? CVSSv3 base score Has FAQ?
CVE-2022-35802 Azure Site Recovery Elevation of Privilege Vulnerability No No 8.1 Yes
CVE-2022-30175 Azure RTOS GUIX Studio Remote Code Execution Vulnerability No No 7.8 Yes
CVE-2022-30176 Azure RTOS GUIX Studio Remote Code Execution Vulnerability No No 7.8 Yes
CVE-2022-34687 Azure RTOS GUIX Studio Remote Code Execution Vulnerability No No 7.8 Yes
CVE-2022-35773 Azure RTOS GUIX Studio Remote Code Execution Vulnerability No No 7.8 Yes
CVE-2022-35779 Azure RTOS GUIX Studio Remote Code Execution Vulnerability No No 7.8 Yes
CVE-2022-35806 Azure RTOS GUIX Studio Remote Code Execution Vulnerability No No 7.8 Yes
CVE-2022-35772 Azure Site Recovery Remote Code Execution Vulnerability No No 7.2 Yes
CVE-2022-35824 Azure Site Recovery Remote Code Execution Vulnerability No No 7.2 Yes
CVE-2022-33646 Azure Batch Node Agent Elevation of Privilege Vulnerability No No 7 Yes
CVE-2022-35780 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35781 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35799 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35775 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35801 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35807 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35808 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35782 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35809 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35784 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35810 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35811 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35785 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35786 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35813 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35788 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35814 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35789 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35815 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35790 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35816 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35817 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35791 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35818 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35819 Azure Site Recovery Elevation of Privilege Vulnerability No No 6.5 Yes
CVE-2022-35776 Azure Site Recovery Denial of Service Vulnerability No No 6.2 Yes
CVE-2022-34685 Azure RTOS GUIX Studio Information Disclosure Vulnerability No No 5.5 Yes
CVE-2022-34686 Azure RTOS GUIX Studio Information Disclosure Vulnerability No No 5.5 Yes
CVE-2022-35774 Azure Site Recovery Elevation of Privilege Vulnerability No No 4.9 Yes
CVE-2022-35800 Azure Site Recovery Elevation of Privilege Vulnerability No No 4.9 Yes
CVE-2022-35787 Azure Site Recovery Elevation of Privilege Vulnerability No No 4.9 Yes
CVE-2022-35821 Azure Sphere Information Disclosure Vulnerability No No 4.4 Yes
CVE-2022-35783 Azure Site Recovery Elevation of Privilege Vulnerability No No 4.4 Yes
CVE-2022-35812 Azure Site Recovery Elevation of Privilege Vulnerability No No 4.4 Yes

Browser vulnerabilities

CVE Title Exploited? Publicly disclosed? CVSSv3 base score Has FAQ?
CVE-2022-33649 Microsoft Edge (Chromium-based) Security Feature Bypass Vulnerability No No 9.6 Yes
CVE-2022-33636 Microsoft Edge (Chromium-based) Remote Code Execution Vulnerability No No 8.3 Yes
CVE-2022-35796 Microsoft Edge (Chromium-based) Elevation of Privilege Vulnerability No No 7.5 Yes
CVE-2022-2624 Chromium: CVE-2022-2624 Heap buffer overflow in PDF No No N/A Yes
CVE-2022-2623 Chromium: CVE-2022-2623 Use after free in Offline No No N/A Yes
CVE-2022-2622 Chromium: CVE-2022-2622 Insufficient validation of untrusted input in Safe Browsing No No N/A Yes
CVE-2022-2621 Chromium: CVE-2022-2621 Use after free in Extensions No No N/A Yes
CVE-2022-2619 Chromium: CVE-2022-2619 Insufficient validation of untrusted input in Settings No No N/A Yes
CVE-2022-2618 Chromium: CVE-2022-2618 Insufficient validation of untrusted input in Internals No No N/A Yes
CVE-2022-2617 Chromium: CVE-2022-2617 Use after free in Extensions API No No N/A Yes
CVE-2022-2616 Chromium: CVE-2022-2616 Inappropriate implementation in Extensions API No No N/A Yes
CVE-2022-2615 Chromium: CVE-2022-2615 Insufficient policy enforcement in Cookies No No N/A Yes
CVE-2022-2614 Chromium: CVE-2022-2614 Use after free in Sign-In Flow No No N/A Yes
CVE-2022-2612 Chromium: CVE-2022-2612 Side-channel information leakage in Keyboard input No No N/A Yes
CVE-2022-2611 Chromium: CVE-2022-2611 Inappropriate implementation in Fullscreen API No No N/A Yes
CVE-2022-2610 Chromium: CVE-2022-2610 Insufficient policy enforcement in Background Fetch No No N/A Yes
CVE-2022-2606 Chromium: CVE-2022-2606 Use after free in Managed devices API No No N/A Yes
CVE-2022-2605 Chromium: CVE-2022-2605 Out of bounds read in Dawn No No N/A Yes
CVE-2022-2604 Chromium: CVE-2022-2604 Use after free in Safe Browsing No No N/A Yes
CVE-2022-2603 Chromium: CVE-2022-2603 Use after free in Omnibox No No N/A Yes

Developer Tools vulnerabilities

CVE Title Exploited? Publicly disclosed? CVSSv3 base score Has FAQ?
CVE-2022-35777 Visual Studio Remote Code Execution Vulnerability No No 8.8 Yes
CVE-2022-35825 Visual Studio Remote Code Execution Vulnerability No No 8.8 Yes
CVE-2022-35826 Visual Studio Remote Code Execution Vulnerability No No 8.8 Yes
CVE-2022-35827 Visual Studio Remote Code Execution Vulnerability No No 8.8 Yes
CVE-2022-34716 .NET Spoofing Vulnerability No No 5.9 Yes

ESU Windows vulnerabilities

CVE Title Exploited? Publicly disclosed? CVSSv3 base score Has FAQ?
CVE-2022-30133 Windows Point-to-Point Protocol (PPP) Remote Code Execution Vulnerability No No 9.8 Yes
CVE-2022-35744 Windows Point-to-Point Protocol (PPP) Remote Code Execution Vulnerability No No 9.8 Yes
CVE-2022-34691 Active Directory Domain Services Elevation of Privilege Vulnerability No No 8.8 Yes
CVE-2022-34714 Windows Secure Socket Tunneling Protocol (SSTP) Remote Code Execution Vulnerability No No 8.1 Yes
CVE-2022-35745 Windows Secure Socket Tunneling Protocol (SSTP) Remote Code Execution Vulnerability No No 8.1 Yes
CVE-2022-35752 Windows Secure Socket Tunneling Protocol (SSTP) Remote Code Execution Vulnerability No No 8.1 Yes
CVE-2022-35753 Windows Secure Socket Tunneling Protocol (SSTP) Remote Code Execution Vulnerability No No 8.1 Yes
CVE-2022-34702 Windows Secure Socket Tunneling Protocol (SSTP) Remote Code Execution Vulnerability No No 8.1 Yes
CVE-2022-35767 Windows Secure Socket Tunneling Protocol (SSTP) Remote Code Execution Vulnerability No No 8.1 Yes
CVE-2022-34706 Windows Local Security Authority (LSA) Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-34707 Windows Kernel Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-35768 Windows Kernel Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-35756 Windows Kerberos Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-35751 Windows Hyper-V Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-35795 Windows Error Reporting Service Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-35820 Windows Bluetooth Driver Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-35750 Win32k Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-34713 Microsoft Windows Support Diagnostic Tool (MSDT) Remote Code Execution Vulnerability Yes Yes 7.8 Yes
CVE-2022-35743 Microsoft Windows Support Diagnostic Tool (MSDT) Remote Code Execution Vulnerability No No 7.8 Yes
CVE-2022-35760 Microsoft ATA Port Driver Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-30194 Windows WebBrowser Control Remote Code Execution Vulnerability No No 7.5 Yes
CVE-2022-35769 Windows Point-to-Point Protocol (PPP) Denial of Service Vulnerability No No 7.5 No
CVE-2022-35793 Windows Print Spooler Elevation of Privilege Vulnerability No No 7.3 Yes
CVE-2022-34690 Windows Fax Service Elevation of Privilege Vulnerability No No 7.1 Yes
CVE-2022-35759 Windows Local Security Authority (LSA) Denial of Service Vulnerability No No 6.5 No
CVE-2022-35747 Windows Point-to-Point Protocol (PPP) Denial of Service Vulnerability No No 5.9 Yes
CVE-2022-35758 Windows Kernel Memory Information Disclosure Vulnerability No No 5.5 Yes
CVE-2022-34708 Windows Kernel Information Disclosure Vulnerability No No 5.5 Yes
CVE-2022-34701 Windows Secure Socket Tunneling Protocol (SSTP) Denial of Service Vulnerability No No 5.3 No

Exchange Server vulnerabilities

CVE Title Exploited? Publicly disclosed? CVSSv3 base score Has FAQ?
CVE-2022-21980 Microsoft Exchange Server Elevation of Privilege Vulnerability No No 8 Yes
CVE-2022-24516 Microsoft Exchange Server Elevation of Privilege Vulnerability No No 8 Yes
CVE-2022-24477 Microsoft Exchange Server Elevation of Privilege Vulnerability No No 8 Yes
CVE-2022-30134 Microsoft Exchange Information Disclosure Vulnerability No Yes 7.6 Yes
CVE-2022-34692 Microsoft Exchange Information Disclosure Vulnerability No No 5.3 Yes
CVE-2022-21979 Microsoft Exchange Information Disclosure Vulnerability No No 4.8 Yes

Microsoft Office vulnerabilities

CVE Title Exploited? Publicly disclosed? CVSSv3 base score Has FAQ?
CVE-2022-34717 Microsoft Office Remote Code Execution Vulnerability No No 8.8 Yes
CVE-2022-33648 Microsoft Excel Remote Code Execution Vulnerability No No 7.8 Yes
CVE-2022-35742 Microsoft Outlook Denial of Service Vulnerability No No 7.5 Yes
CVE-2022-33631 Microsoft Excel Security Feature Bypass Vulnerability No No 7.3 Yes

System Center Azure vulnerabilities

CVE Title Exploited? Publicly disclosed? CVSSv3 base score Has FAQ?
CVE-2022-33640 System Center Operations Manager: Open Management Infrastructure (OMI) Elevation of Privilege Vulnerability No No 7.8 Yes

Windows vulnerabilities

CVE Title Exploited? Publicly disclosed? CVSSv3 base score Has FAQ?
CVE-2022-34715 Windows Network File System Remote Code Execution Vulnerability No No 9.8 Yes
CVE-2022-35804 SMB Client and Server Remote Code Execution Vulnerability No No 8.8 Yes
CVE-2022-35761 Windows Kernel Elevation of Privilege Vulnerability No No 8.4 Yes
CVE-2022-35766 Windows Secure Socket Tunneling Protocol (SSTP) Remote Code Execution Vulnerability No No 8.1 Yes
CVE-2022-35794 Windows Secure Socket Tunneling Protocol (SSTP) Remote Code Execution Vulnerability No No 8.1 Yes
CVE-2022-34699 Windows Win32k Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-33670 Windows Partition Management Driver Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-34703 Windows Partition Management Driver Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-34696 Windows Hyper-V Remote Code Execution Vulnerability No No 7.8 Yes
CVE-2022-35746 Windows Digital Media Receiver Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-35749 Windows Digital Media Receiver Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-34705 Windows Defender Credential Guard Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-35771 Windows Defender Credential Guard Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-35762 Storage Spaces Direct Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-35763 Storage Spaces Direct Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-35764 Storage Spaces Direct Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-35765 Storage Spaces Direct Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-35792 Storage Spaces Direct Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-30144 Windows Bluetooth Service Remote Code Execution Vulnerability No No 7.5 Yes
CVE-2022-35748 HTTP.sys Denial of Service Vulnerability No No 7.5 Yes
CVE-2022-35755 Windows Print Spooler Elevation of Privilege Vulnerability No No 7.3 Yes
CVE-2022-35757 Windows Cloud Files Mini Filter Driver Elevation of Privilege Vulnerability No No 7.3 Yes
CVE-2022-35754 Unified Write Filter Elevation of Privilege Vulnerability No No 6.7 Yes
CVE-2022-35797 Windows Hello Security Feature Bypass Vulnerability No No 6.1 Yes
CVE-2022-34709 Windows Defender Credential Guard Security Feature Bypass Vulnerability No No 6 Yes
CVE-2022-30197 Windows Kernel Information Disclosure Vulnerability No No 5.5 Yes
CVE-2022-34710 Windows Defender Credential Guard Information Disclosure Vulnerability No No 5.5 Yes
CVE-2022-34712 Windows Defender Credential Guard Information Disclosure Vulnerability No No 5.5 Yes
CVE-2022-34704 Windows Defender Credential Guard Information Disclosure Vulnerability No No 5.5 Yes
CVE-2022-34303 CERT/CC: CVE-20220-34303 Crypto Pro Boot Loader Bypass No No N/A Yes
CVE-2022-34302 CERT/CC: CVE-2022-34302 New Horizon Data Systems Inc Boot Loader Bypass No No N/A Yes
CVE-2022-34301 CERT/CC: CVE-2022-34301 Eurosoft Boot Loader Bypass No No N/A Yes

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

How NerdWallet uses AWS and Apache Hudi to build a serverless, real-time analytics platform

Post Syndicated from Kevin Chun original https://aws.amazon.com/blogs/big-data/how-nerdwallet-uses-aws-and-apache-hudi-to-build-a-serverless-real-time-analytics-platform/

This is a guest post by Kevin Chun, Staff Software Engineer in Core Engineering at NerdWallet.

NerdWallet’s mission is to provide clarity for all of life’s financial decisions. This covers a diverse set of topics: from choosing the right credit card, to managing your spending, to finding the best personal loan, to refinancing your mortgage. As a result, NerdWallet offers powerful capabilities that span across numerous domains, such as credit monitoring and alerting, dashboards for tracking net worth and cash flow, machine learning (ML)-driven recommendations, and many more for millions of users.

To build a cohesive and performant experience for our users, we need to be able to use large volumes of varying user data sourced by multiple independent teams. This requires a strong data culture along with a set of data infrastructure and self-serve tooling that enables creativity and collaboration.

In this post, we outline a use case that demonstrates how NerdWallet is scaling its data ecosystem by building a serverless pipeline that enables streaming data from across the company. We iterated on two different architectures. We explain the challenges we ran into with the initial design and the benefits we achieved by using Apache Hudi and additional AWS services in the second design.

Problem statement

NerdWallet captures a sizable amount of spending data. This data is used to build helpful dashboards and actionable insights for users. The data is stored in an Amazon Aurora cluster. Even though the Aurora cluster works well as an Online Transaction Processing (OLTP) engine, it’s not suitable for large, complex Online Analytical Processing (OLAP) queries. As a result, we can’t expose direct database access to analysts and data engineers. The data owners have to solve requests with new data derivations on read replicas. As the data volume and the diversity of data consumers and requests grow, this process gets more difficult to maintain. In addition, data scientists mostly require data files access from an object store like Amazon Simple Storage Service (Amazon S3).

We decided to explore alternatives where all consumers can independently fulfill their own data requests safely and scalably using open-standard tooling and protocols. Drawing inspiration from the data mesh paradigm, we designed a data lake based on Amazon S3 that decouples data producers from consumers while providing a self-serve, security-compliant, and scalable set of tooling that is easy to provision.

Initial design

The following diagram illustrates the architecture of the initial design.

The design included the following key components:

  1. We chose AWS Data Migration Service (AWS DMS) because it’s a managed service that facilitates the movement of data from various data stores such as relational and NoSQL databases into Amazon S3. AWS DMS allows one-time migration and ongoing replication with change data capture (CDC) to keep the source and target data stores in sync.
  2. We chose Amazon S3 as the foundation for our data lake because of its scalability, durability, and flexibility. You can seamlessly increase storage from gigabytes to petabytes, paying only for what you use. It’s designed to provide 11 9s of durability. It supports structured, semi-structured, and unstructured data, and has native integration with a broad portfolio of AWS services.
  3. AWS Glue is a fully managed data integration service. AWS Glue makes it easier to categorize, clean, transform, and reliably transfer data between different data stores.
  4. Amazon Athena is a serverless interactive query engine that makes it easy to analyze data directly in Amazon S3 using standard SQL. Athena scales automatically—running queries in parallel—so results are fast, even with large datasets, high concurrency, and complex queries.

This architecture works fine with small testing datasets. However, the team quickly ran into complications with the production datasets at scale.

Challenges

The team encountered the following challenges:

  • Long batch processing time and complexed transformation logic – A single run of the Spark batch job took 2–3 hours to complete, and we ended up getting a fairly large AWS bill when testing against billions of records. The core problem was that we had to reconstruct the latest state and rewrite the entire set of records per partition for every job run, even if the incremental changes were a single record of the partition. When we scaled that to thousands of unique transactions per second, we quickly saw the degradation in transformation performance.
  • Increased complexity with a large number of clients – This workload contained millions of clients, and one common query pattern was to filter by single client ID. There were numerous optimizations that we were forced to tack on, such as predicate pushdowns, tuning the Parquet file size, using a bucketed partition scheme, and more. As more data owners adopted this architecture, we would have to customize each of these optimizations for their data models and consumer query patterns.
  • Limited extendibility for real-time use cases – This batch extract, transform, and load (ETL) architecture wasn’t going to scale to handle hourly updates of thousands of records upserts per second. In addition, it would be challenging for the data platform team to keep up with the diverse real-time analytical needs. Incremental queries, time-travel queries, improved latency, and so on would require heavy investment over a long period of time. Improving on this issue would open up possibilities like near-real-time ML inference and event-based alerting.

With all these limitations of the initial design, we decided to go all-in on a real incremental processing framework.

Solution

The following diagram illustrates our updated design. To support real-time use cases, we added Amazon Kinesis Data Streams, AWS Lambda, Amazon Kinesis Data Firehose and Amazon Simple Notification Service (Amazon SNS) into the architecture.

The updated components are as follows:

  1. Amazon Kinesis Data Streams is a serverless streaming data service that makes it easy to capture, process, and store data streams. We set up a Kinesis data stream as a target for AWS DMS. The data stream collects the CDC logs.
  2. We use a Lambda function to transform the CDC records. We apply schema validation and data enrichment at the record level in the Lambda function. The transformed results are published to a second Kinesis data stream for the data lake consumption and an Amazon SNS topic so that changes can be fanned out to various downstream systems.
  3. Downstream systems can subscribe to the Amazon SNS topic and take real-time actions (within seconds) based on the CDC logs. This can support use cases like anomaly detection and event-based alerting.
  4. To solve the problem of long batch processing time, we use Apache Hudi file format to store the data and perform streaming ETL using AWS Glue streaming jobs. Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and data pipeline development. Hudi allows you to build streaming data lakes with incremental data pipelines, with support for transactions, record-level updates, and deletes on data stored in data lakes. Hudi integrates well with various AWS analytics services such as AWS Glue, Amazon EMR, and Athena, which makes it a straightforward extension of our previous architecture. While Apache Hudi solves the record-level update and delete challenges, AWS Glue streaming jobs convert the long-running batch transformations into low-latency micro-batch transformations. We use the AWS Glue Connector for Apache Hudi to import the Apache Hudi dependencies in the AWS Glue streaming job and write transformed data to Amazon S3 continuously. Hudi does all the heavy lifting of record-level upserts, while we simply configure the writer and transform the data into Hudi Copy-on-Write table type. With Hudi on AWS Glue streaming jobs, we reduce the data freshness latency for our core datasets from hours to under 15 minutes.
  5. To solve the partition challenges for high cardinality UUIDs, we use the bucketing technique. Bucketing groups data based on specific columns together within a single partition. These columns are known as bucket keys. When you group related data together into a single bucket (a file within a partition), you significantly reduce the amount of data scanned by Athena, thereby improving query performance and reducing cost. Our existing queries are filtered on the user ID already, so we significantly improve the performance of our Athena usage without having to rewrite queries by using bucketed user IDs as the partition scheme. For example, the following code shows total spending per user in specific categories:
    SELECT ID, SUM(AMOUNT) SPENDING
    FROM "{{DATABASE}}"."{{TABLE}}"
    WHERE CATEGORY IN (
    'ENTERTAINMENT',
    'SOME_OTHER_CATEGORY')
    AND ID_BUCKET ='{{ID_BUCKET}}'
    GROUP BY ID;

  1. Our data scientist team can access the dataset and perform ML model training using Amazon SageMaker.
  2. We maintain a copy of the raw CDC logs in Amazon S3 via Amazon Kinesis Data Firehose.

Conclusion

In the end, we landed on a serverless stream processing architecture that can scale to thousands of writes per second within minutes of freshness on our data lakes. We’ve rolled out to our first high-volume team! At our current scale, the Hudi job is processing roughly 1.75 MiB per second per AWS Glue worker, which can automatically scale up and down (thanks to AWS Glue auto scaling). We’ve also observed an outstanding improvement of end-to-end freshness at less than 5 minutes due to Hudi’s incremental upserts vs. our first attempt.

With Hudi on Amazon S3, we’ve built a high-leverage foundation to personalize our users’ experiences. Teams that own data can now share their data across the organization with reliability and performance characteristics built into a cookie-cutter solution. This enables our data consumers to build more sophisticated signals to provide clarity for all of life’s financial decisions.

We hope that this post will inspire your organization to build a real-time analytics platform using serverless technologies to accelerate your business goals.


About the authors

Kevin Chun is a Staff Software Engineer in Core Engineering at NerdWallet. He builds data infrastructure and tooling to help NerdWallet provide clarity for all of life’s financial decisions.

Dylan Qu is a Specialist Solutions Architect focused on big data and analytics with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.

The mechanics of a sophisticated phishing scam and how we stopped it

Post Syndicated from Matthew Prince original https://blog.cloudflare.com/2022-07-sms-phishing-attacks/

The mechanics of a sophisticated phishing scam and how we stopped it

The mechanics of a sophisticated phishing scam and how we stopped it

Yesterday, August 8, 2022, Twilio shared that they’d been compromised by a targeted phishing attack. Around the same time as Twilio was attacked, we saw an attack with very similar characteristics also targeting Cloudflare’s employees. While individual employees did fall for the phishing messages, we were able to thwart the attack through our own use of Cloudflare One products, and physical security keys issued to every employee that are required to access all our applications.

We have confirmed that no Cloudflare systems were compromised. Our Cloudforce One threat intelligence team was able to perform additional analysis to further dissect the mechanism of the attack and gather critical evidence to assist in tracking down the attacker.

This was a sophisticated attack targeting employees and systems in such a way that we believe most organizations would be likely to be breached. Given that the attacker is targeting multiple organizations, we wanted to share here a rundown of exactly what we saw in order to help other companies recognize and mitigate this attack.

Targeted Text Messages

On July 20, 2022, the Cloudflare Security team received reports of employees receiving legitimate-looking text messages pointing to what appeared to be a Cloudflare Okta login page. The messages began at 2022-07-22 22:50 UTC. Over the course of less than 1 minute, at least 76 employees received text messages on their personal and work phones. Some messages were also sent to the employee’s family members. We have not yet been able to determine how the attacker assembled the list of employee’s phone numbers but have reviewed access logs to our employee directory services and have found no sign of compromise.

Cloudflare runs a 24×7 Security Incident Response Team (SIRT). Every Cloudflare employee is trained to report anything that is suspicious to the SIRT. More than 90 percent of the reports to SIRT turn out to not be threats. Employees are encouraged to report anything and never discouraged from over-reporting. In this case, however, the reports to SIRT were a real threat.

The text messages received by employees looked like this:

The mechanics of a sophisticated phishing scam and how we stopped it

They came from four phone numbers associated with T-Mobile-issued SIM cards: (754) 268-9387, (205) 946-7573, (754) 364-6683 and (561) 524-5989. They pointed to an official-looking domain: cloudflare-okta.com. That domain had been registered via Porkbun, a domain registrar, at 2022-07-20 22:13:04 UTC — less than 40 minutes before the phishing campaign began.

Cloudflare built our secure registrar product in part to be able to monitor when domains using the Cloudflare brand were registered and get them shut down. However, because this domain was registered so recently, it had not yet been published as a new .com registration, so our systems did not detect its registration and our team had not yet moved to terminate it.

If you clicked on the link it took you to a phishing page. The phishing page was hosted on DigitalOcean and looked like this:

The mechanics of a sophisticated phishing scam and how we stopped it

Cloudflare uses Okta as our identity provider. The phishing page was designed to look identical to a legitimate Okta login page. The phishing page prompted anyone who visited it for their username and password.

Real-Time Phishing

We were able to analyze the payload of the phishing attack based on what our employees received as well as its content being posted to services like VirusTotal by other companies that had been attacked. When the phishing page was completed by a victim, the credentials were immediately relayed to the attacker via the messaging service Telegram. This real-time relay was important because the phishing page would also prompt for a Time-based One Time Password (TOTP) code.

Presumably, the attacker would receive the credentials in real-time, enter them in a victim company’s actual login page, and, for many organizations that would generate a code sent to the employee via SMS or displayed on a password generator. The employee would then enter the TOTP code on the phishing site, and it too would be relayed to the attacker. The attacker could then, before the TOTP code expired, use it to access the company’s actual login page — defeating most two-factor authentication implementations.

The mechanics of a sophisticated phishing scam and how we stopped it

Protected Even If Not Perfect

We confirmed that three Cloudflare employees fell for the phishing message and entered their credentials. However, Cloudflare does not use TOTP codes. Instead, every employee at the company is issued a FIDO2-compliant security key from a vendor like YubiKey. Since the hard keys are tied to users and implement origin binding, even a sophisticated, real-time phishing operation like this cannot gather the information necessary to log in to any of our systems. While the attacker attempted to log in to our systems with the compromised username and password credentials, they could not get past the hard key requirement.

But this phishing page was not simply after credentials and TOTP codes. If someone made it past those steps, the phishing page then initiated the download of a phishing payload which included AnyDesk’s remote access software. That software, if installed, would allow an attacker to control the victim’s machine remotely. We confirmed that none of our team members got to this step. If they had, however, our endpoint security would have stopped the installation of the remote access software.

How Did We Respond?

The main response actions we took for this incident were:

1. Block the phishing domain using Cloudflare Gateway

Cloudflare Gateway is a Secure Web Gateway solution providing threat and data protection with DNS / HTTP filtering and natively-integrated Zero Trust. We use this  solution internally to proactively identify malicious domains and block them. Our team added the malicious domain to Cloudflare Gateway to block all employees from accessing it.

Gateway’s automatic detection of malicious domains also identified the domain and blocked it, but the fact that it was registered and messages were sent within such a short interval of time meant that the system hadn’t automatically taken action before some employees had clicked on the links. Given this incident we are working to speed up how quickly malicious domains are identified and blocked. We’re also implementing controls on access to newly registered domains which we offer to customers but had not implemented ourselves.

2. Identify all impacted Cloudflare employees and reset compromised credentials

We were able to compare recipients of the phishing texts to login activity and identify threat-actor attempts to authenticate to our employee accounts. We identified login attempts blocked due to the hard key (U2F) requirements indicating that the correct password was used, but the second factor could not be verified. For the three of our employees’ credentials were leaked, we reset their credentials and any active sessions and initiated scans of their devices.

3. Identify and take down threat-actor infrastructure

The threat actor’s phishing domain was newly registered via Porkbun, and hosted on DigitalOcean. The phishing domain used to target Cloudflare was set up less than an hour before the initial phishing wave. The site had a Nuxt.js frontend, and a Django backend. We worked with DigitalOcean to shut down the attacker’s server. We also worked with Porkbun to seize control of the malicious domain.

From the failed sign-in attempts we were able to determine that the threat actor was leveraging Mullvad VPN software and distinctively using the Google Chrome browser on a Windows 10 machine. The VPN IP addresses used by the attacker were 198.54.132.88, and 198.54.135.222. Those IPs are assigned to Tzulo, a US-based dedicated server provider whose website claims they have servers located in Los Angeles and Chicago. It appears, actually, that the first was actually running on a server in the Toronto area and the latter on a server in the Washington, DC area. We blocked these IPs from accessing any of our services.

4. Update detections to identify any subsequent attack attempts

With what we were able to uncover about this attack, we incorporated additional signals to our already existing detections to specifically identify this threat-actor. At the time of writing we have not observed any additional waves targeting our employees. However, intelligence from the server indicated the attacker was targeting other organizations, including Twilio. We reached out to these other organizations and shared intelligence on the attack.

5. Audit service access logs for any additional indications of attack

Following the attack, we screened all our system logs for any additional fingerprints from this particular attacker. Given Cloudflare Access serves as the central control point for all Cloudflare applications, we can search the logs for any indication the attacker may have breached any systems. Given employees’ phones were targeted, we also carefully reviewed the logs of our employee directory providers. We did not find any evidence of compromise.

Lessons Learned and Additional Steps We’re Taking

We learn from every attack. Even though the attacker was not successful, we are making additional adjustments from what we’ve learned. We’re adjusting the settings for Cloudflare Gateway to restrict or sandbox access to sites running on domains that were registered within the last 24 hours. We will also run any non-whitelisted sites containing terms such as “cloudflare” “okta” “sso” and “2fa” through our browser isolation technology. We are also increasingly using Area 1’s phish-identification technology to scan the web and look for any pages that are designed to target Cloudflare. Finally, we’re tightening up our Access implementation to prevent any logins from unknown VPNs, residential proxies, and infrastructure providers. All of these are standard features of the same products we offer to customers.

The attack also reinforced the importance of three things we’re doing well. First, requiring hard keys for access to all applications. Like Google, we have not seen any successful phishing attacks since rolling hard keys out. Tools like Cloudflare Access made it easy to support hard keys even across legacy applications. If you’re an organization interested in how we rolled out hard keys, reach out to [email protected] and our security team would be happy to share the best practices we learned through this process.

Second, using Cloudflare’s own technology to protect our employees and systems. Cloudflare One’s solutions like Access and Gateway were critical to staying ahead of this attack. We configured our Access implementation to require hard keys for every application. It also creates a central logging location for all application authentications. And, if ever necessary, a place from which we can kill the sessions of a potentially compromised employee. Gateway allows us the ability to shut down malicious sites like this one quickly and understand what employees may have fallen for the attack. These are all functionalities that we make available to Cloudflare customers as part of our Cloudflare One suite and this attack demonstrates how effective they can.

Third, having a paranoid but blame-free culture is critical for security. The three employees who fell for the phishing scam were not reprimanded. We’re all human and we make mistakes. It’s critically important that when we do, we report them and don’t cover them up. This incident provided another example of why security is part of every team member at Cloudflare’s job.

Detailed Timeline of Events

2022-07-20 22:49 UTC Attacker sends out 100+ SMS messages to Cloudflare employees and their families.
2022-07-20 22:50 UTC Employees begin reporting SMS messages to Cloudflare Security team.
2022-07-20 22:52 UTC Verify that the attacker’s domain is blocked in Cloudflare Gateway for corporate devices.
2022-07-20 22:58 UTC Warning communication sent to all employees across chat and email.
2022-07-20 22:50 UTC to
2022-07-20 23:26 UTC
Monitor telemetry in the Okta System log & Cloudflare Gateway HTTP logs to locate credential compromise. Clear login sessions and suspend accounts on discovery.
2022-07-20 23:26 UTC Phishing site is taken down by the hosting provider.
2022-07-20 23:37 UTC Reset leaked employee credentials.
2022-07-21 00:15 UTC Deep dive into attacker infrastructure and capabilities.

Indicators of compromise

Value Type Context and MITRE Mapping
cloudflare-okta[.]com hosted on 147[.]182[.]132[.]52 Phishing URL T1566.002: Phishing: Spear Phishing Link sent to users.
64547b7a4a9de8af79ff0eefadde2aed10c17f9d8f9a2465c0110c848d85317a SHA-256 T1219: Remote Access Software being distributed by the threat actor

What You Can Do

If you are similar attacks in your environment, please don’t hesitate to reach out to [email protected], and we’re happy to share best practices on how to keep your business secure. Finally, if you want to work on detecting and mitigating the next attacks with us? We’re hiring on our Detection and Response team, come join us!

Building AWS Lambda governance and guardrails

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-aws-lambda-governance-and-guardrails/

When building serverless applications using AWS Lambda, there are a number of considerations regarding security, governance, and compliance. This post highlights how Lambda, as a serverless service, simplifies cloud security and compliance so you can concentrate on your business logic. It covers controls that you can implement for your Lambda workloads to ensure that your applications conform to your organizational requirements.

The Shared Responsibility Model

The AWS Shared Responsibility Model distinguishes between what AWS is responsible for and what customers are responsible for with cloud workloads. AWS is responsible for “Security of the Cloud” where AWS protects the infrastructure that runs all the services offered in the AWS Cloud. Customers are responsible for “Security in the Cloud”, managing and securing their workloads. When building traditional applications, you take on responsibility for many infrastructure services, including operating systems and network configuration.

Traditional application shared responsibility

Traditional application shared responsibility

One major benefit when building serverless applications is shifting more responsibility to AWS so you can concentrate on your business applications. AWS handles managing and patching the underlying servers, operating systems, and networking as part of running the services.

Serverless application shared responsibility

Serverless application shared responsibility

For Lambda, AWS manages the application platform where your code runs, which includes patching and updating the managed language runtimes. This reduces the attack surface while making cloud security simpler. You are responsible for the security of your code and AWS Identity and Access Management (IAM) to the Lambda service and within your function.

Lambda is SOCHIPAAPCI, and ISO-compliant. For more information, see Compliance validation for AWS Lambda and the latest Lambda certification and compliance readiness services in scope.

Lambda isolation

Lambda functions run in separate isolated AWS accounts that are dedicated to the Lambda service. Lambda invokes your code in a secure and isolated runtime environment within the Lambda service account. A runtime environment is a collection of resources running in a dedicated hardware-virtualized Micro Virtual Machines (MVM) on a Lambda worker node.

Lambda workers are bare metalEC2 Nitro instances, which are managed and patched by the Lambda service team. They have a maximum lease lifetime of 14 hours to keep the underlying infrastructure secure and fresh. MVMs are created by Firecracker, an open source virtual machine monitor (VMM) that uses Linux’s Kernel-based Virtual Machine (KVM) to create and manage MVMs securely at scale.

MVMs maintain a strong separation between runtime environments at the virtual machine hardware level, which increases security. Runtime environments are never reused across functions, function versions, or AWS accounts.

Isolation model for AWS Lambda workers

Isolation model for AWS Lambda workers

Network security

Lambda functions always run inside secure Amazon Virtual Private Cloud (Amazon VPCs) owned by the Lambda service. This gives the Lambda function access to AWS services and the public internet. There is no direct network inbound access to Lambda workers, runtime environments, or Lambda functions. All inbound access to a Lambda function only comes via the Lambda Invoke API, which sends the event object to the function handler.

You can configure a Lambda function to connect to private subnets in a VPC in your account if necessary, which you can control with IAM condition keys . The Lambda function still runs inside the Lambda service VPC but sends all network traffic through your VPC. Function outbound traffic comes from your own network address space.

AWS Lambda service VPC with VPC-to-VPC NAT to customer VPC

AWS Lambda service VPC with VPC-to-VPC NAT to customer VPC

To give your VPC-connected function access to the internet, route outbound traffic to a NAT gateway in a public subnet. Connecting a function to a public subnet doesn’t give it internet access or a public IP address, as the function is still running in the Lambda service VPC and then routing network traffic into your VPC.

All internal AWS traffic uses the AWS Global Backbone rather than traversing the internet. You do not need to connect your functions to a VPC to avoid connectivity to AWS services over the internet. VPC connected functions allow you to control and audit outbound network access.

You can use security groups to control outbound traffic for VPC-connected functions and network ACLs to block access to CIDR IP ranges or ports. VPC endpoints allow you to enable private communications with supported AWS services without internet access.

You can use VPC Flow Logs to audit traffic going to and from network interfaces in your VPC.

Runtime environment re-use

Each runtime environment processes a single request at a time. After Lambda finishes processing the request, the runtime environment is ready to process an additional request for the same function version. For more information on how Lambda manages runtime environments, see Understanding AWS Lambda scaling and throughput.

Data can persist in the local temporary filesystem path, in globally scoped variables, and in environment variables across subsequent invocations of the same function version. Ensure that you only handle sensitive information within individual invocations of the function by processing it in the function handler, or using local variables. Do not re-use files in the local temporary filesystem to process unencrypted sensitive data. Do not put sensitive or confidential information into Lambda environment variables, tags, or other freeform fields such as Name fields.

For more Lambda security information, see the Lambda security whitepaper.

Multiple accounts

AWS recommends using multiple accounts to isolate your resources because they provide natural boundaries for security, access, and billing. Use AWS Organizations to manage and govern individual member accounts centrally. You can use AWS Control Tower to automate many of the account build steps and apply managed guardrails to govern your environment. These include preventative guardrails to limit actions and detective guardrails to detect and alert on non-compliance resources for remediation.

Lambda access controls

Lambda permissions define what a Lambda function can do, and who or what can invoke the function. Consider the following areas when applying access controls to your Lambda functions to ensure least privilege:

Execution role

Lambda functions have permission to access other AWS resources using execution roles. This is an AWS principal that the Lambda service assumes which grants permissions using identity policy statements assigned to the role. The Lambda service uses this role to fetch and cache temporary security credentials, which are then available as environment variables during a function’s invocation. It may re-use them across different runtime environments that use the same execution role.

Ensure that each function has its own unique role with the minimum set of permissions..

Identity/user policies

IAM identity policies are attached to IAM users, groups, or roles. These policies allow users or callers to perform operations on Lambda functions. You can restrict who can create functions, or control what functions particular users can manage.

Resource policies

Resource policies define what identities have fine-grained inbound access to managed services. For example, you can restrict which Lambda function versions can add events to a specific Amazon EventBridge event bus. You can use resource-based policies on Lambda resources to control what AWS IAM identities and event sources can invoke a specific version or alias of your function. You also use a resource-based policy to allow an AWS service to invoke your function on your behalf. To see which services support resource-based policies, see “AWS services that work with IAM”.

Attribute-based access control (ABAC)

With attribute-based access control (ABAC), you can use tags to control access to your Lambda functions. With ABAC, you can scale an access control strategy by setting granular permissions with tags without requiring permissions updates for every new user or resource as your organization scales. You can also use tag policies with AWS Organizations to standardize tags across resources.

Permissions boundaries

Permissions boundaries are a way to delegate permission management safely. The boundary places a limit on the maximum permissions that a policy can grant. For example, you can use boundary permissions to limit the scope of the execution role to allow only read access to databases. A builder with permission to manage a function or with write access to the applications code repository cannot escalate the permissions beyond the boundary to allow write access.

Service control policies

When using AWS Organizations, you can use Service control policies (SCPs) to manage permissions in your organization. These provide guardrails for what actions IAM users and roles within the organization root or OUs can do. For more information, see the AWS Organizations documentation, which includes example service control policies.

Code signing

As you are responsible for the code that runs in your Lambda functions, you can ensure that only trusted code runs by using code signing with the AWS Signer service. AWS Signer digitally signs your code packages and Lambda validates the code package before accepting the deployment, which can be part of your automated software deployment process.

Auditing Lambda configuration, permissions and access

You should audit access and permissions regularly to ensure that your workloads are secure. Use the IAM console to view when an IAM role was last used.

IAM last used

IAM last used

IAM access advisor

Use IAM access advisor on the Access Advisor tab in the IAM console to review when was the last time an AWS service was used from a specific IAM user or role. You can use this to remove IAM policies and access from your IAM roles.

IAM access advisor

IAM access advisor

AWS CloudTrail

AWS CloudTrail helps you monitor, log, and retain account activity to provide a complete event history of actions across your AWS infrastructure. You can monitor Lambda API actions to ensure that only appropriate actions are made against your Lambda functions. These include CreateFunction, DeleteFunction, CreateEventSourceMapping, AddPermission, UpdateEventSourceMapping,  UpdateFunctionConfiguration, and UpdateFunctionCode.

AWS CloudTrail

AWS CloudTrail

IAM Access Analyzer

You can validate policies using IAM Access Analyzer, which provides over 100 policy checks with security warnings for overly permissive policies. To learn more about policy checks provided by IAM Access Analyzer, see “IAM Access Analyzer policy validation”.

You can also generate IAM policies based on access activity from CloudTrail logs, which contain the permissions that the role used in your specified date range.

IAM Access Analyzer

IAM Access Analyzer

AWS Config

AWS Config provides you with a record of the configuration history of your AWS resources. AWS Config monitors the resource configuration and includes rules to alert when they fall into a non-compliant state.

For Lambda, you can track and alert on changes to your function configuration, along with the IAM execution role. This allows you to gather Lambda function lifecycle data for potential audit and compliance requirements. For more information, see the Lambda Operators Guide.

AWS Config includes Lambda managed config rules such as lambda-concurrency-check, lambda-dlq-check, lambda-function-public-access-prohibited, lambda-function-settings-check, and lambda-inside-vpc. You can also write your own rules.

There are a number of other AWS services to help with security compliance.

  1. AWS Audit Manager: Collect evidence to help you audit your use of cloud services.
  2. Amazon GuardDuty: Detect unexpected and potentially unauthorized activity in your AWS environment.
  3. Amazon Macie: Evaluates your content to identify business-critical or potentially confidential data.
  4. AWS Trusted Advisor: Identify opportunities to improve stability, save money, or help close security gaps.
  5. AWS Security Hub: Provides security checks and recommendations across your organization.

Conclusion

Lambda makes cloud security simpler by taking on more responsibility using the AWS Shared Responsibility Model. Lambda implements strict workload security at scale to isolate your code and prevent network intrusion to your functions. This post provides guidance on assessing and implementing best practices and tools for Lambda to improve your security, governance, and compliance controls. These include permissions, access controls, multiple accounts, and code security. Learn how to audit your function permissions, configuration, and access to ensure that your applications conform to your organizational requirements.

For more serverless learning resources, visit Serverless Land.

6 Reasons Managed Detection and Response Is Hitting Its Stride

Post Syndicated from Mikayla Wyman original https://blog.rapid7.com/2022/08/09/6-reasons-managed-detection-and-response-is-hitting-its-stride/

6 Reasons Managed Detection and Response Is Hitting Its Stride

Cyber threats have risen to the #1 concern of CEOs, which means security teams — in the hot seat for years — are really feeling it now. Files and data live in the cloud. Work is hybrid or remote. There’s turmoil around the world. Cyberattacks are not just a distant boogieman – they’re here and happening every day.

As companies try to make sure their existing security infrastructure can keep up, they confront the skills gap, a 0% industry unemployment rate, and no room for mistakes. Managed Detection and Response (MDR) is having a moment.

According to a recent ESG study, MDR is one of the fastest growing areas of cybersecurity today. A whopping 85% of surveyed organizations currently use or plan to use managed services for their security operations. And 88% say they will increase their use of managed services in the next 1-2 years.

What’s driving this move to MDR? Let’s take a look at six main factors.

1. Focus

Augmenting an internal security team means internal security personnel can focus on more strategic security initiatives rather than day-to-day operational tasks. In fact, 55% of surveyed organizations want to focus their internal security teams on more strategic initiatives rather than spend time on daily basics, the ESG study found.

By partnering with an MDR provider, alert triaging and investigations are generally taken care of by the external team. Of course, your organization still has some things you’ll need to do – partnership is the name of the game. But by working with a MDR service, security teams suddenly have more time and bandwidth to work strategically.

2. Services

ESG reports that 52% of companies surveyed believe managed service providers can do a better job with security operations than they can.

What you would once have to train your detection and response team to do, MDR providers take over. That means they’re able to detect active attackers within your environment and contain threats. Analyze incidents and provide recommendations for remediation, and apply learnings from other environments they manage to your environment to make sure you’re protected from the latest attacker behaviors. Finally, good MDR providers are able to pivot into breach response if an attacker is live within your network.

To learn more about how to evaluate MDR providers on eight core capabilities, read the MDR Buyers Guide here.

3. Augmentation

About half of organizations (49%) believe a service provider can augment their security operations center (SOC) team with additional support.

Most companies that are able to build internal SOCs are generally well-funded, can afford roughly 10-12 full-time personnel, have a large array of security tools at their disposal, and have extensive processes already outlined. Sound doable? Great! If not, augmentation by way of an MDR provider is your tall glass of water.

Sign on with an MDR provider, get deployed, and your team is instantly extended. Benefits include time savings, cost savings, and experience level that most companies can’t afford to hire at scale.

4. Skills

No surprise, 42% of surveyed organizations in the ESG study believe they don’t have adequate skills for security operations in-house.

MDR is more than outsourcing 24x7x365 monitoring. It’s a partnership that helps you move towards a more secure stature with guidance and expertise.

This type of partnership allows teams to contextualize metrics and reports, get a better understanding of investigations that take place within their environment, and have someone to walk through processes should an attack take place. You also have an expert in your corner during CISO, board, or executive meetings.

5. Price

40% of surveyed organizations did a cost analysis and found that it would cost less to use a service provider than to do it themselves.

We won’t sugar-coat it – partnering with an MDR service provider is expensive. But so is building out an internal team that can actually monitor and investigate within an organization’s environment round the clock.

The cost of partnering with an MDR provider pales in comparison to the cost of employing 10-12 security personnel that operate an around-the-clock SOC, and it can offer ROI much more quickly.

Check out this recent Forrester study to learn more about cost-saving outcomes of partnering with Rapid7’s MDR team.

6. Staff

Finally, ESG tells us that 35% of surveyed organizations don’t have an adequately sized staff for security operations.

Even with unlimited budget to hire a full team, it would be an incredibly labor-intensive and time-consuming process. It would be nearly impossible for most organizations to accomplish. Not only is finding qualified candidates and hiring a huge pain point, but the resources needed to onboard and train staff often aren’t there.

Of course, all MDR services are not the same

Keep these three things in mind:

  • Forrester found Rapid7 MDR reduced breaches by 90%
  • Forrester found Rapid7 MDR delivered 549% ROI
  • In the event of a breach, Rapid7 MDR pivots to full-on digital forensics and incident response, no delay, no limits

Check out our full MDR Buyer’s Guide for 2022 here.

Additional reading:

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

Introducing new Cloudflare for SaaS documentation

Post Syndicated from Mia Malden original https://blog.cloudflare.com/introducing-new-cloudflare-for-saas-documentation/

Introducing new Cloudflare for SaaS documentation

Introducing new Cloudflare for SaaS documentation

As a SaaS provider, you’re juggling many challenges while building your application, whether it’s custom domain support, protection from attacks, or maintaining an origin server. In 2021, we were proud to announce Cloudflare for SaaS for Everyone, which allows anyone to use Cloudflare to cover those challenges, so they can focus on other aspects of their business. This product has a variety of potential implementations; now, we are excited to announce a new section in our Developer Docs specifically devoted to Cloudflare for SaaS documentation to allow you take full advantage of its product suite.

Cloudflare for SaaS solution

You may remember, from our October 2021 blog post, all the ways that Cloudflare provides solutions for SaaS providers:

  • Set up an origin server
  • Encrypt your customers’ traffic
  • Keep your customers online
  • Boost the performance of global customers
  • Support custom domains
  • Protect against attacks and bots
  • Scale for growth
  • Provide insights and analytics
Introducing new Cloudflare for SaaS documentation

However, we received feedback from customers indicating confusion around actually using the capabilities of Cloudflare for SaaS because there are so many features! With the existing documentation, it wasn’t 100% clear how to enhance security and performance, or how to support custom domains. Now, we want to show customers how to use Cloudflare for SaaS to its full potential by including more product integrations in the docs, as opposed to only focusing on the SSL/TLS piece.

Bridging the gap

Cloudflare for SaaS can be overwhelming with so many possible add-ons and configurations. That’s why the new docs are organized into six main categories, housing a number of new, detailed guides (for example, WAF for SaaS and Regional Services for SaaS):

Introducing new Cloudflare for SaaS documentation

Once you get your SaaS application up and running with the Get Started page, you can find which configurations are best suited to your needs based on your priorities as a provider. Even if you aren’t sure what your goals are, this setup outlines the possibilities much more clearly through a number of new documents and product guides such as:

Instead of pondering over vague subsection titles, you can peruse with purpose in mind. The advantages and possibilities of Cloudflare for SaaS are highlighted instead of hidden.

Possible configurations

This setup facilitates configurations much more easily to meet your goals as a SaaS provider.

For example, consider performance. Previously, there was no documentation surrounding reduced latency for SaaS providers. Now, the Performance section explains the automatic benefits to your performance by onboarding with Cloudflare for SaaS. Additionally, it offers three options of how to reduce latency even further through brand-new docs:

Similarly, the new organization offers WAF for SaaS as a previously hidden security solution, extending providers the ability to enable automatic protection from vulnerabilities and the flexibility to create custom rules. This is conveniently accompanied by a step-by-step tutorial using Cloudflare Managed Rulesets.

What’s next

While this transition represents an improvement in the Cloudflare for SaaS docs, we’re going to expand its accessibility even more. Some tutorials, such as our Managed Ruleset Tutorial, are already live within the tile. However, more step-by-step guides for Cloudflare for SaaS products and add-ons will further enable our customers to take full advantage of the available product suite. In particular, keep an eye out for expanding documentation around using Workers for Platforms.

Check it out

Visit the new Cloudflare for SaaS tile to see the updates. If you are a SaaS provider interested in extending Cloudflare benefits to your customers through Cloudflare for SaaS, visit our Cloudflare for SaaS overview and our Plans page.

Introducing AWS Glue Flex jobs: Cost savings on ETL workloads

Post Syndicated from Aniket Jiddigoudar original https://aws.amazon.com/blogs/big-data/introducing-aws-glue-flex-jobs-cost-savings-on-etl-workloads/

AWS Glue is a serverless data integration service that makes it simple to discover, prepare, and combine data for analytics, machine learning (ML), and application development. You can use AWS Glue to create, run, and monitor data integration and ETL (extract, transform, and load) pipelines and catalog your assets across multiple data stores. Typically, these data integration jobs can have varying degrees of priority and time sensitivity. For example, non-urgent workloads such as pre-production, testing, and one-time data loads often don’t require fast job startup times or consistent runtimes via dedicated resources.

Today, we are pleased to announce the general availability of a new AWS Glue job run class called Flex. Flex allows you to optimize your costs on your non-urgent or non-time sensitive data integration workloads such as pre-production jobs, testing, and one-time data loads. With Flex, AWS Glue jobs run on spare compute capacity instead of dedicated hardware. The start and runtimes of jobs using Flex can vary because spare compute resources aren’t readily available and can be reclaimed during the run of a job

Regardless of the run option used, AWS Glue jobs have the same capabilities, including access to custom connectors, visual authoring interface, job scheduling, and Glue Auto Scaling. With the Flex execution option, customers can optimize the costs of their data integration workloads by configuring the execution option based on the workloads’ requirements, using standard execution option for time-sensitive workloads, and Flex for non-urgent workloads. The Flex execution class is available for AWS Glue 3.0 Spark jobs.

The Flex execution class is available for AWS Glue 3.0 Spark jobs.

In this post, we provide more details about AWS Glue Flex jobs and how to enable Flex capacity.

How do you use Flexible capacity?

The AWS Glue jobs API now supports an additional parameter called execution-class, which lets you choose STANDARD or FLEX when running the job. To use Flex, you simply set the parameter to FLEX.

To enable Flex via the AWS Glue Studio console, complete the following steps:

  1. On the AWS Glue Studio console, while authoring a job, navigate to the Job details tab
  2. Select Flex Execution.
  3. Set an appropriate value for the Job Timeout parameter (defaults to 120 minutes for Flex jobs).
  4. Save the job.
  5. After finalizing all other details, choose Run to run the job with Flex capacity.

On the Runs tab, you should be able to see FLEX listed under Execution class.

You can also enable Flex via the AWS Command Line Interface (AWS CLI).

You can set the --execution-class setting in the start-job-run API, which lets you run a particular AWS Glue job’s run with Flex capacity:

aws glue start-job-run --job-name my-job \
    --execution-class FLEX \
    --timeout 300 \

You can also set the --execution-class during the create-job API. This sets the default run class of all the runs of this job to FLEX:

aws glue create-job \
    --name flexCLI \
    --role AWSGlueServiceRoleDefault \
    --command "Name=glueetl,ScriptLocation=s3://mybucket/myfolder/" \
    --region us-east-2 \
    --execution-class FLEX \
    --worker-type G.1X \
    --number-of-workers 10 \
    --glue-version 3.0

The following are additional details about the relevant parameters:

  • –execution-class – The enum string that specifies if a job should be run as FLEX or STANDARD capacity. The default is STANDARD.
  • –timeout – Specifies the time (in minutes) the job will run before it’s moved into a TIMEOUT state.

When should you use Flexible capacity?

The Flex execution class is ideal for reducing the costs of time-insensitive workloads. For example:

  • Nightly ETL jobs, or jobs that run over weekends for processing workloads
  • One-time bulk data ingestion jobs
  • Jobs running in test environments or pre-production workloads
  • Time-insensitive workloads where it’s acceptable to have variable start and end times

In comparison, the standard execution class is ideal for time-sensitive workloads that require fast job startup and dedicated resources. In addition, jobs that have downstream dependencies are better served by the standard execution class.

What is the typical life-cycle of a Flexible capacity Job?

When a start-job-run API call is issued, with the execution-class set to FLEX, AWS Glue will begin to request compute resources. If no resources are available immediately upon issuing the API call, the job will move into a WAITING state. No billing occurs at this point.

As soon as the job is able to acquire compute resources, the job moves to a RUNNING state. At this point, even if all the computes requested aren’t available, the job begins running on whatever hardware is present. As more Flex capacity becomes available, AWS Glue adds it to the job, up to a maximum value specified by Number of workers.

At this point, billing begins. You’re charged only for the compute resources that are running at any given time, and only for the duration that they ran for.

While the job is running, if Flex capacity is reclaimed, AWS Glue continues running the job on the existing compute resources while it tries to meet the shortfall by requesting more resources. If capacity is reclaimed, billing for that capacity is halted as well. Billing for new capacity will start when it is provisioned again. If the job completes successfully, the job’s state moves to SUCCEEDED. If the job fails due to various user or system errors, the job’s state transitions to FAILED. If the job is unable to complete before the time specified by the --timeout parameter, whether due to a lack of compute capacity or due to issues with the AWS Glue job script, the job goes into a TIMEOUT state.

Flexible job runs rely on the availability of non-dedicated compute capacity in AWS, which in turn depends on several factors, such as the Region and Availability Zone, time of day, day of the week, and the number of DPUs required by a job.

A parameter of particular importance for Flex Jobs is the --timeout value. It’s possible for Flex jobs to take longer to run than standard jobs, especially if capacity is reclaimed while the job is running. As a result, selecting the right timeout value that’s appropriate for your workload is critical. Choose a timeout value such that the total cost of the Flex job run doesn’t exceed a standard job run. If the value is set too high, the job can wait for too long, trying to acquire capacity that isn’t available. If the value is set too low, the job times out, even if capacity is available and the job execution is proceeding correctly.

How are Flex capacity jobs billed?

Flex jobs are billed per worker at the Flex DPU-hour rates. This means that you’re billed only for the capacity that actually ran during the execution of the job, for the duration that it ran.

For example, if you ran an AWS Glue Flex job for 10 workers, and AWS Glue was only able to acquire 5 workers, you’re only billed for five workers, and only for the duration that those workers ran. If, during the job run, two out of those five workers are reclaimed, then billing for those two workers is stopped, while billing for the remaining three workers continues. If provisioning for the two reclaimed workers is successful during the job run, billing for those two will start again.

For more information on Flex pricing, refer to AWS Glue pricing.

Conclusion

This post discusses the new AWS Glue Flex job execution class, which allows you to optimize costs for non-time-sensitive ETL workloads and test environments.

You can start using Flex capacity for your existing and new workloads today. However, note that the Flex class is not supported for Python Shell jobs, AWS Glue streaming jobs, or AWS Glue ML jobs.

For more information on AWS Glue Flex jobs, refer to their latest documentation.

Special thanks to everyone who contributed to the launch: Parag Shah, Sampath Shreekantha, Yinzhi Xi and Jessica Cheng,


About the authors

Aniket Jiddigoudar is a Big Data Architect on the AWS Glue team.

Vaibhav Porwal is a Senior Software Development Engineer on the AWS Glue team.

Sriram Ramarathnam is a Software Development Manager on the AWS Glue team.

AWS Week in Review – August 8, 2022

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/aws-week-in-review-august-8-2022/

As an ex-.NET developer, and now Developer Advocate for .NET at AWS, I’m excited to bring you this week’s Week in Review post, for reasons that will quickly become apparent! There are several updates, customer stories, and events I want to bring to your attention, so let’s dive straight in!

Last Week’s launches
.NET developers, here are two new updates to be aware of—and be sure to check out the events section below for another big announcement:

Tiered pricing for AWS Lambda will interest customers running large workloads on Lambda. The tiers, based on compute duration (measured in GB-seconds), help you save on monthly costs—automatically. Find out more about the new tiers, and see some worked examples showing just how they can help reduce costs, in this AWS Compute Blog post by Heeki Park, a Principal Solutions Architect for Serverless.

Amazon Relational Database Service (RDS) released updates for several popular database engines:

  • RDS for Oracle now supports the April 2022 patch.
  • RDS for PostgreSQL now supports new minor versions. Besides the version upgrades, there are also updates for the PostgreSQL extensions pglogical, pg_hint_plan, and hll.
  • RDS for MySQL can now enforce SSL/TLS for client connections to your databases to help enhance transport layer security. You can enforce SSL/TLS by simply enabling the require_secure_transport parameter (disabled by default) via the Amazon RDS Management console, the AWS Command Line Interface (AWS CLI), AWS Tools for PowerShell, or using the API. When you enable this parameter, clients will only be able to connect if an encrypted connection can be established.

Amazon Elastic Compute Cloud (Amazon EC2) expanded availability of the latest generation storage-optimized Is4gen and Im4gn instances to the Asia Pacific (Sydney), Canada (Central), Europe (Frankfurt), and Europe (London) Regions. Built on the AWS Nitro System and powered by AWS Graviton2 processors, these instance types feature up to 30 TB of storage using the new custom-designed AWS Nitro System SSDs. They’re ideal for maximizing the storage performance of I/O intensive workloads that continuously read and write from the SSDs in a sustained manner, for example SQL/NoSQL databases, search engines, distributed file systems, and data analytics.

Lastly, there’s a new URL from AWS Support API to use when you need to access the AWS Support Center console. I recommend bookmarking the new URL, https://support.console.aws.amazon.com/, which the team built using the latest architectural standards for high availability and Region redundancy to ensure you’re always able to contact AWS Support via the console.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here’s some other news items and customer stories that you may find interesting:

AWS Open Source News and Updates – Catch up on all the latest open-source projects, tools, and demos from the AWS community in installment #123 of the weekly open source newsletter.

In one recent AWS on Air livestream segment from AWS re:MARS, discussing the increasing scale of machine learning (ML) models, our guests mentioned billion-parameter ML models which quite intrigued me. As an ex-developer, my mental model of parameters is a handful of values, if that, supplied to methods or functions—not billions. Of course, I’ve since learned they’re not the same thing! As I continue my own ML learning journey I was particularly interested in reading this Amazon Science blog on 20B-parameter Alexa Teacher Models (AlexaTM). These large-scale multilingual language models can learn new concepts and transfer knowledge from one language or task to another with minimal human input, given only a few examples of a task in a new language.

When developing games intended to run fully in the cloud, what benefits might there be in going fully cloud-native and moving the entire process into the cloud? Find out in this customer story from Return Entertainment, who did just that to build a cloud-native gaming infrastructure in a few months, reducing time and cost with AWS services.

Upcoming events
Check your calendar and sign up for these online and in-person AWS events:

AWS Storage Day: On August 10, tune into this virtual event on twitch.tv/aws, 9:00 AM–4.30 PM PT, where we’ll be diving into building data resiliency into your organization, and how to put data to work to gain insights and realize its potential, while also optimizing your storage costs. Register for the event here.

AWS SummitAWS Global Summits: These free events bring the cloud computing community together to connect, collaborate, and learn about AWS. Registration is open for the following AWS Summits in August:

AWS .NET Enterprise Developer Days 2022 – North America: Registration for this free, 2-day, in-person event and follow-up 2-day virtual event opened this past week. The in-person event runs September 7–8, at the Palmer Events Center in Austin, Texas. The virtual event runs September 13–14. AWS .NET Enterprise Developer Days (.NET EDD) runs as a mini-conference within the DeveloperWeek Cloud conference (also in-person and virtual). Anyone registering for .NET EDD is eligible for a free pass to DeveloperWeek Cloud, and vice versa! I’m super excited to be helping organize this third .NET event from AWS, our first that has an in-person version. If you’re a .NET developer working with AWS, I encourage you to check it out!

That’s all for this week. Be sure to check back next Monday for another Week in Review roundup!

— Steve
This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Estimating cost for Amazon SQS message processing using AWS Lambda

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/estimating-cost-for-amazon-sqs-message-processing-using-aws-lambda/

This post was written by Sabha Parameswaran, Senior Solutions Architect.

AWS Lambda enables fully managed asynchronous messaging processing through integration with Amazon SQS. This blog post helps estimate the cost and performance benefits when using Lambda to handle millions of messages per day by using a simulated setup.

Overview

Lambda supports asynchronous handling of messages using SQS integration as an event source and can scale for handling millions of messages per day. Customers often ask about the cost of implementing a Lambda-based messaging solution.

There are multiple variables like Lambda function runtime, individual message size, batch size for consuming from SQS, processing latency per message (depending on the backend services invoked), and function memory size settings. These can determine the overall performance and associated cost of a Lambda-based messaging solution.

This post provides cost estimation using these variables, along with guidance around optimization. The estimates focus on consuming from standard queues and not FIFO queues.

SQS event source

The Lambda event source mapping supports integration for SQS. Lambda users specify the SQS queue to consume messages. Lambda internally polls the queue and invokes the function synchronously with an event containing the queue messages.

The configuration controls in Lambda for consuming messages from an SQS queue are:

  • Batch size: The maximum number of records that can be batched as one event delivered to the consuming Lambda function. The maximum batch size is 10,000 records.
  • Batch window: The maximum time (in seconds) to gather records as a single batch. A larger batch window size means waiting longer for a larger SQS batch of messages before passing to the Lambda function.
  • SQS content filtering: Selecting only the messages that match a defined content criteria. This can reduce cost by removing unwanted or irrelevant messages. Lambda now supports content filtering (for SQS, Kinesis, and DynamoDB) and developers can use the filtering capabilities to avoid processing SQS messages, reducing unnecessary invocations and associated cost.

Lambda sends as many records in a single batch as allowed by the batch size, as long as it’s earlier than the batch window value, and smaller than the maximum payload size of 6 MB. Having large batch sizes means that a single Lambda invocation can handle more messages rather than multiple Lambda invocations to handle smaller batches (which translates to setting higher concurrency limits).

The cost and time to process might vary based on the actual number of messages in the batch. A larger batch size can imply longer processing but requires lower concurrency (number of concurrent Lambda invocations).

Lambda configurations

Lambda function costs are calculated based on memory used and time spent (in GB-second) in execution of a function. Aside from the event source configuration, there are several other Lambda function configurations that impact cost and performance:

  • Processor type: Lambda functions provide options to choose between x86 and Arm/Graviton processors. The newer Arm/Graviton processors can yield a higher performance and lower cost compared to x86 based on the workload. Compare the options and run tests before selecting.
  • Memory allotted: This is directly proportional to the CPU allotted to the function and translates to price for each invocation. Higher memory can lead to faster execution but also higher cost. The optimal memory required for a small batch versus large batch can vary based on the workload, size of incoming messages, transformations, requirements to store intermediate, or final results. Optimal tuning of the memory configurations is key to ensuring right cost versus performance. See the AWS Lambda Power Tuning documentation for more details on identifying the optimal memory versus performance for a fixed batch size and then extrapolate the memory settings for larger batch sizes.
  • Lambda function runtime: Some runtimes have a smaller memory footprint and may be more cost effective than others that are memory intensive. Choosing the runtime affects the memory allocation.
  • Function performance: This can be considered as TPS – total number of requests completed per second. Or conversely measured as time to complete one request. The performance – time to finish a function execution can be dependent on the event containing the batch of messages; bigger batches mean more time to complete an event and complexity and dependencies (performance of the backend that needs to be invoked) of the message processing. The calculations are based on the assumption that the Lambda function and related dependencies have been optimized and tuned to scale linearly with various batch sizes and number of invocations.
  • Concurrency: Number of concurrent Lambda function executions. Concurrency is important for scaling of Lambda functions, allowing users to delegate the capacity planning and scaling to thee Lambda service.

The higher the concurrency, the more workloads it can process in a shorter time, allowing better performance, but this does not change the overall cost. Concurrency is not equivalent to TPS: it is more of a scaling factor in overall TPS. For example, a workload comprised of a set of messages takes 20 seconds to complete. 100 workloads would mean 2000 seconds to complete. With a concurrency of 10, it takes 200 seconds. With a concurrency of 100, the time drops to 20 seconds as each of the 100 workloads are handled concurrently. But each function essentially runs for the same duration and memory, regardless of concurrency. So the cost remains the same, as it is measured in GB-hours (memory multiplied by time). But the performance view differs. So, the cost estimations do not consider the concurrency settings of Lambda functions as the workloads have to be processed either sequential or concurrently.

Assumptions

The cost estimation tool presented helps users estimate monthly Lambda function costs for processing SQS standard queue messages based on the following assumptions:

  • The system has reached steady state and has millions of messages available to be consumed per day in standard queues. The number of messages per day remains constant throughout the entire month.
  • Since it’s a steady state, there are no associated Lambda function cold start delays.
  • All SQS messages that need to be processed successfully have already met the filter criteria. Also, no poison messages that have to be re-tried repeatedly. Messages are not going to be rejected, unacknowledged, or reprocessed.
  • The workload scales linearly in performance versus batch size. All the associated dependencies can scale linearly and a batch of N messages should take the same time as N x a single message with a fixed overhead per function invocation irrespective of the batch size. For example, a function’s overhead is 50 ms irrespective of the batch size. Processing a single message takes 20 ms. So a batch of 20 messages should take 490 ms (50 + 20*20) versus a batch of 5 messages takes 150 ms (50 + 5*20).
  • Function memory increases in steps, based on increasing the batch size. For example, 100 messages uses a 256 MB of baseline memory. Every additional 500 messages require additional 128 MB of memory. A sliding window of memory to batch size:
Batch size Memory
1–100 256 MB
100–600 384 MB
600–1100 512 MB
1100–1600 640 MB

Lambda uses SQS APIs internally to poll and dequeue the messages. The costs for the polling and dequeue operations using SQS APIs are not included as part of the estimations. The internal SQS dequeue portion is outside the control of the Lambda developer and the cost estimates only cover the message processing using Lambda. Also, the tool does not consider any reprocessing or duplicate processing of messages due to exceptions or errors that can vary the cost.

Using the cost estimation tool

The estimator tool is a Python-based command line program that takes in an input properties file that specifies the various input parameters to come up with Lambda function cost versus performance estimations for various batch sizes, messages per day, etc. The tool does take into account the eligible monthly free tier for Lambda function executions.

Pre-requisites: Running the tool requires Python 3.9 and installation of Plotly package (5.7.+) or creating and using Docker images.

To run the tool:

  1. Clone the repo:
    git clone https://github.com/aws-samples/aws-lambda-sqs-cost-estimator
  2. Install the tool:
    cd aws-lambda-sqs-cost-estimator/code
    pip3 install -r requirements.txt
  3. Edit the input.prop file and run the tool to generate cost estimations:
    python3 LambdaPlotly.py

This shows the cost estimates on a local browser instance. Running the code as a Docker image is also supported. Refer to the GitHub repo for additional instructions.

  1. Clone the repo and build the Docker container:
    git clone https://github.com/aws-samples/aws-lambda-sqs-cost-estimator
    cd aws-lambda-sqs-cost-estimator/code
    docker build -t lambda-dash .
  2. Edit the input.prop file and run the tool to generate cost estimations:
    docker run -it -v `pwd`:/app -p 8080:8080 lambda-dash
  3. Navigate to http://0.0.0.0:8080/app in a browser to view the generated cost estimate plot.

There are various input parameters for the cost estimations specified inside the input.prop file. Tune the input parameters as needed:

Parameter Description Sample value (units not included)
base_lambda_memory_mb Baseline memory for the Lambda function (in MB) 128
warm_latency_ms Invocation time for Lambda handler method (going with warm start) irrespective of batch size in the incoming event payload in ms 20
process_per_message_ms Time to process a single message (linearly scales with number of messages per batch in event payload) in ms 10
max_batch_size Maximum batch size per event payload processed by a single Lambda instance 1000 (max is 10000)
batch_memory_overhead_mb Additional memory for processing increments in batch size (in MB) 128
batch_increment Increments of batch size for increased memory 300

The following is sample input.prop file content:

base_lambda_memory_mb=128

# Total process time for N messages in batch = warm_latency_ms + (process_per_message_ms * N)

# Time spent in function initialization/warm-up

warm_latency_ms=20

# Time spent for processing each message in milliseconds

process_per_message_ms=10

# Max batch size

max_batch_size=1000

# Additional lambda memory X mb required for managing/parsing/processing N additional messages processed when using variable batch sizes

#batch_memory_overhead_mb=X

#batch_increment=N

batch_memory_overhead_mb=128

batch_increment=300

The tool generates a page with plot graphs and tables with 3 sections:

Cost example

There is an accompanying interactive legend showing cost and batch size. The top section shows a graph of cost versus message volumes versus batch size:

cost versus message volumes vs Batch size

The second section shows the actual cost variation for different batch sizes for 10 million messages:

actual cost variation for different batch sizes for 10 million messages.

The third section shows the memory and time required to process with different batch sizes:

memory and time required to process with different batch sizes

The various control input parameters used for graph generation are shown at the bottom of the page.

Double-clicking on a specific batch size or line on the right-hand legend displays that specific plot with its pricing details.

specific plot to be displayed with its pricing details

You can modify the input parameters with different settings for memory, batch sizes, memory for increased batches and rerun the program to create different cost estimations. You can also export the generated graphs as PNG image files for reference.

Conclusion

You can use Lambda functions to handle fully managed asynchronous processing of SQS messages. Estimating the cost and optimal setup depends on leveraging the various configurations of SQS and Lambda functions. The cost estimator tool presented in this blog should help you understand these configurations and their impact on the overall cost and performance of the Lambda function-based messaging solutions.

For more serverless learning resources, visit Serverless Land.

Forwood Safety uses Amazon QuickSight Q to extend life-saving safety analytics to larger audiences

Post Syndicated from Faye Crompton original https://aws.amazon.com/blogs/big-data/forwood-safety-uses-amazon-quicksight-q-to-extend-life-saving-safety-analytics-to-larger-audiences/

This is a guest post by Faye Crompton from Forwood Safety. Forwood provides fatality prevention solutions to organizations across the globe.

At Forwood Safety, we have a laser focus on saving lives. Our solutions, which provide full content and proven methodology via verification tools and analytical capabilities, have one purpose: eliminating fatalities in the workplace. We recently realized an ambition to provide interactive, dynamic data visualization tools that enable our end users to access safety data in the field, regardless of their experience with analytics and data reporting.

In this post, I’ll talk about how Amazon QuickSight Q solved these challenges by giving users fast data insights through natural language querying capabilities.

Driving data insights with QuickSight

Forwood’s Critical Risk Management (CRM) solution provides organizations with globally benchmarked and comprehensive critical control checklists and verification controls that are proven to prevent fatalities in the workplace. CRM protects frontline workers from serious harm by helping change the culture of risk management for companies. In addition, our Forwood Analytical Self-Service Tool (FAST) enables our customers to use self-service reporting to get updated dashboards that display key safety and fatality prevention metrics.

For several years, we used AWS QuickSight to provide data visualization for our CRM and FAST reporting products, with great success. Most of our technology stack was already based on AWS, so we knew QuickSight would be easy to integrate. QuickSight is agnostic in terms of data sources and types, and it’s a very flexible tool. It’s also an open data technology, so it can accept most of the data sources that we throw at it. Most importantly, it ties in seamlessly with our own architecture and data pipelines in a way that our previous BI tools couldn’t. After we implemented QuickSight, we started using it to power both CRM and FAST, visualizing risk data and serving it back to our customers.

Using QuickSight Q to help site supervisors get answers quickly

Furthering our focus on innovation and usability; we identified a common challenge that we believed QuickSight could solve through our FAST application on behalf of our clients — we needed to make risk data more accessible for those of our clients who aren’t data analysts. We recognize that not everyone is an analyst. We also have mining industry customers who are not frequently accessing our applications via desktop. For example, mining site Supervisors and Operators working deep underground typically have access only via their mobile devices. For these users, it’s easier for them to ask the questions relevant to their specific use cases as needed at point of use, rather than filter and search through a dashboard to find the answers ahead of time.

QuickSight Q was the perfect solution to this challenge. QuickSight Q is a feature within QuickSight that uses machine learning to understand questions and how they relate to business data. The feature provides data insights and visualizations within seconds. With this capability, users can simply type in questions in natural language to access data insights about risk and compliance. Mining site workers, for example, can ask if the site is safe or if the right verification processes are in place. Health and safety teams and mining site supervisors can ask questions such as “Which sites should I verify today?” or “Which risk will be highest next week?” and receive a chart with the relevant data.

Making data more accessible to everyone

QuickSight Q gives our on-site customers near-real-time risk and compliance data from their mobile devices in a way they couldn’t before. With QuickSight Q, we can give our FAST users the opportunity to quickly visualize any fatality risks at their sites based on updated fatality prevention data. All users, not just analysts, can identify worksites that have a higher fatality risk because the data can show trends in non-compliance with safety standards. Our clients no longer have to look in a dashboard for the answers to their questions; those looking at a dashboard can go beyond the dashboard and ask deeper questions.

QuickSight Q solved one of our main BI challenges: how to make risk data more accessible to more people without extensive user training and technical understanding. Soon, we hope to use QuickSight Q as part of a multidimensional predictive dataset using deep learning models to deliver even more insights to our customers.

We look forward to extending our use of QuickSight. When we first started using it, it was strictly for analytics on our existing data. More recently, we started using API deployments for QuickSight. We have many different clients, and we use the API feature to maintain master versions of all 30+ standard reports, and then deploy those dashboards to as many clients as we need to via code. Previously, we saw QuickSight as a function of our analytics products; now we see it as a powerful and flexible toolkit of analytics features that our developers can build with.

Additionally, we look forward to relying on QuickSight Q to bring life-saving safety analytics to more people. QuickSight Q bridges the gap between the data a company has and the decisions that company needs to make, and that’s very powerful for our clients. Forwood Safety is driven to eradicate workplace fatalities, and by getting data to more people and making it easy to access, we can make our solutions more effective, saving more lives.


About the author

Faye Crompton is Head of Analytics, Safety Applications and Computer Vision at Forwood Safety. She leads work on analytics and safety products that reduce fatality risk in mining and other high-risk industries.

How One Engineer Upskilled Into a Salesforce Engineering Role at Rapid7

Post Syndicated from Rapid7 original https://blog.rapid7.com/2022/08/08/how-one-engineer-upskilled-into-a-salesforce-engineering-role-at-rapid7/

How One Engineer Upskilled Into a Salesforce Engineering Role at Rapid7

At Rapid7, we believe the growth and development of our people enables us to better serve customers who depend on us. When our Engineering team was searching for candidates to help with our Salesforce ecosystem, John Millar demonstrated many of our core values – most importantly, the appetite to learn and grow his career as part of our commitment to “Never Done.” Through his own grit and determination – and support from his team – he transitioned into a new role and acquired a new set of skills along the way.

Here’s a closer look at that journey, told in John’s own words.

Celebrating a new path

Coming up on nearly two years at Rapid7, I am over the moon with what I have achieved personally and professionally. Before joining the company, I was a Q developer working with KDB+ systems. Now, I am an Engineer working in our Salesforce ecosystem in Belfast.

Getting up to speed with our Salesforce system, becoming a valuable member of our development team, and helping to knock out some big projects in that time period have made me incredibly proud of how my career has grown in under two years. I have also become the team’s SME for an integrated software tool that is connected with Salesforce and have completed my first Salesforce certification, with more planned before the end of the year. These certifications are funded by Rapid7 as part of their core value of “Never Done.”

Creating a new direction for my career and having the opportunity to grow has certainly paid off, but it didn’t happen overnight.

Jumping into something new

Rewinding back to 2020 – I had been working for over two years as part of a periodic low-frequency development team for a Tier 1 bank. We were responsible for the maintenance and development of the low frequency components of the plant. This role revolved around a holistic time series database system built on kdb+ (q language), containing a wide range of data covering both periodic and aperiodic frequencies and all asset classes.

I felt like I wanted a new challenge and was interested in moving back into a role based around an object oriented language, similar to what I had been working with throughout University. I had heard of and researched Rapid7, so when they contacted me and outlined their goals, objectives, and culture along with the specific role I would be applying for, I knew it was for me and wanted to make that jump.

Supporting new skills and growth

One of the core values of Rapid7 is “Never Done,” which encourages employees to constantly learn and improve their knowledge stack. I believe this was pivotal in my upskilling process, as the support needed was very accessible.

Rapid7 was invested in my growth from the moment I joined. As a candidate, I didn’t fit 100% of the requirements at the time. I understood the fundamentals and met the core criteria, but I didn’t have a ton of experience in Java and had no experience with Salesforce. Rapid7 recognized my potential and was invested in helping me grow my skills and become a great Salesforce developer for the team.  

When upskilling to Salesforce, the main area I used was Trailheads, a free program provided by Salesforce. These exercises and learning modules are very detailed, interesting, and interactive. They really help with absorbing and understanding the information in conjunction with actively completing tasks in parallel. Additionally, I was supported and mentored by colleagues from Rapid7, who were equally invested in my growth. Whether it was through formal 1:1s or just making themselves available for advice and questions, I felt supported throughout the process.

Creating impact

Making the transition was not easy, and it took a lot of time and effort. I had to be self-motivated and determined to get up to speed with the Salesforce CRM and Salesforce Apex. Having completed this transition journey into Salesforce, it is all the more satisfying when completing and planning work, knowing that it has paid dividends in terms of my career growth.

Our team is making an impact by enabling the Salesforce ecosystem to operate more efficiently. We do this by analyzing and debugging issues, identifying opportunities, and improving our integration capabilities. This means the Rapid7 team is better positioned to support and protect our customers against outside threats to their business, as well as protect the personal information and data of their customers.

I have great confidence and pride in the work that I complete and feel I play a vital role in our team. I would highly recommend anyone thinking of making that jump to something new, to go for it. I know I haven’t looked back.

Additional reading:

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

How to prepare your application to scale reliably with Amazon EC2

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/how-to-prepare-your-application-to-scale-reliably-with-amazon-ec2/

This blog post is written by, Gabriele Postorino, Senior Technical Account Manager, and Giorgio Bonfiglio, Principal Technical Account Manager

In this post, we’ll discuss how you can prepare for planned and unplanned scaling events with

Most of the challenges related to horizontal scaling can be mitigated by optimizing the architectural implementation and applying improvements in operational processes.

In the following sections, we’ll explore this in depth. Recommendations can be applied partially or fully – they come with different complexities, and each one will help you reduce the risk of facing insufficient capacity errors or scaling delays, as well as deliver enhancements in areas such as fault tolerance, elasticity, and cost optimization.

Architectural best practices

Instance capacity can be regarded as being divided into “pools” defined by AZ (such as us-east-1a), instance type (for example m5.xlarge), and tenancy. Combining the following two guidelines will widen the capacity pools available to scale out your fleets of instances. This will help you reduce costs, transparently recover from failures, and increase your application scalability.

Instance flexibility

Whether you’re migrating a new workload to the cloud, or tuning an existing workload, you’ll likely evaluate which compute configuration options are available and determine the right configuration for your application.

If your workload is already running on EC2 instances, you might already be aware of the instance type that it runs best on. Let’s say that your application is RAM intensive, and you found that r6i.4xlarge instances are best suited for it.

However, relying on a single instance type might result in artificially limiting your ability to scale compute resources for your workload when needed. It’s always a good idea to explore how your workload behaves when running on other instance types: you might find that your application can serve double the number of requests served by one r6i.4xlarge instance when using one r6i.8xlarge instance or four r6i.2xlarge instances.

Furthermore, there’s no reason to limit your options to a single instance family, generation, or processor type. For example, m6a.8xlarge instances offer the same amount of RAM of r6i.4xlarge and might be used to run your application if needed.

Amazon EC2 Auto Scaling helps you make sure that you have the right number of EC2 instances available to handle the load for your application.

Auto Scaling groups can be configured to respond to scaling events by selecting the type of instance to launch among a list of instance types. You can statically populate the list in advance, as in the following screenshot,

The Instance type requirements section of the Auto Scaling Wizard instance launch options step is shown with the option “Manually add instance types” selected.

or dynamically define it by a set of instance attributes as shown in the subsequent screenshot:

The Instance type requirements section of the Auto Scaling Wizard instance launch options step is shown with the option “Specify instance attribute” selected.

For example, by setting the requirements to a minimum of 8 vCPUs, 64GiB of Memory, and a RAM/CPU ratio of 8 (just like r6i.2xlarge instances), up to 73 instance types can be included in the list of suitable instances. They will be selected for launch starting from the lowest priced instance types. If the request can’t be fulfilled in full by the lowest priced instance type, then additional instances will be launched from the second lowest instance type pool, and so on.

Instance distribution

Each AWS Region consists of multiple, isolated Availability Zones (AZ), interconnected with high-bandwidth, low-latency networking. Spreading a workload across AZs is a well-established resiliency best practice. It will make sure that your end users aren’t impacted in the case of a single AZ, data center, or rack failures, as each AZ has its own distinct instance capacity pools that you can leverage to scale your application fleets.

EC2 Auto Scaling can manage the optimal distribution of EC2 instances in a group across all AZs in a Region automatically, as well as deal with temporary failures transparently. To do so, it must be configured to use at least one subnet in each AZ. Then, it will attempt to distribute instances evenly across AZs and automatically cycle through AZs in case of temporary launch failures.

Diagram showing a VPC with subnets in 2 Availability Zones and an Autoscaling group managing groups of instances of different types

Operational best practices

The way that your workload is operated also impacts your ability to scale it when needed. Failure management and appropriate scaling techniques will help you maximize the availability of your environment.

Failure management

On-Demand capacity isn’t guaranteed to always be available. There might be short windows of time when AWS doesn’t have enough available On-Demand capacity to fulfill your specific request: as the availability of On-Demand capacity changes frequently, it’s important that your launch processes implement retry mechanisms.

Retries and fallbacks are managed automatically by EC2 Auto Scaling. But if you have a custom workflow to launch instances, it should be able to work with server error codes, in particular InsufficientInstanceCapacity or InternalError, by retrying the launch request. For a complete list of error codes for the EC2 API, please refer to our documentation.

Another option provided by EC2 is represented by EC2 Fleets. EC2 Fleet is a feature that helps to implement instance flexibility best practices. Instead of calling RunInstances with one instance type and retrying, EC2 Fleet in Instant mode considers all provided instance types, using a list of instances or Attribute Based Instance selection, and provisions capacity from the pools configured by the EC2 Fleet call where capacity was available.

Scaling technique

Launching EC2 instances as soon as you have an initial indication of increased load, in smaller batches and over a longer time span, helps increase your application performance and reliability while reducing costs and minimizing disruptions.

Two different scaling techniques that follow the increase in load are depicted. One scaling approach adds a large number of instances less frequently, while the second approach launches a smaller batch of instances more frequently. In the graph above, two different scaling techniques are depicted. Scaling approach #1 adds a large number of instances less frequently, while approach #2 launches a smaller batch of instances more frequently. Adopting the first approach risks your application not being able to sustain the increase in load in a timely manner. This will potentially cause an impact on end users and leave the operations team with little time to resolve.

Effective capacity planning

On-Demand Instances are best suited for applications with irregular, uninterruptible workloads. Interruptible workloads can avail of Spot Instances that pick from spare EC2 capacity. They cost less than On-Demand Instances but can be interrupted with a two-minute warning.

If your workload has a stable baseline utilization that hardly changes over time, then you can reserve capacity for your baseline usage of EC2 instances using open On-Demand Capacity Reservations and cover them with Savings Plans to get discounted rates with a one-year or three-year commitment, with the latter offering the bigger discounts.

Open On-Demand Capacity Reservations and Savings Plans aren’t tightly related to the EC2 instances that they cover at a certain point in time. Rather they shift to other usage, matching all of the parameters of the respective On-Demand Capacity Reservation or Savings Plan (e.g., Instance Type, Operating System, AZ, tenancy) in your account or across accounts for which you have sharing enabled. This lets you be dynamic even with your stable baseline. For example, during a rolling update or a blue/green deployment, On-Demand Capacity Reservations and Savings Plans will automatically cover any instances that match the respective criteria.

ODCR Fleets

There are times when you can’t apply all of the recommended mitigating actions in anticipation of a planned event. In those cases, you might want to use On-Demand Capacity Reservation Fleets to reserve capacity in advance for additional peace of mind. Capacity reservation fleets let you define capacity requests across multiple instance types, up to a target capacity that you specify. They can be created and managed using the AWS Command Line Interface (AWS CLI) and the AWS APIs.

Key concepts of Capacity Reservation Fleets are the total target capacity and the instance type weight. The instance type weight expresses the number of capacity units that each instance of a specific instance type counts toward the total target capacity.

Let’s say your workload is memory-bound, you expect to need 1,6TiB of RAM, and you want to use r6i instances. You can create a Capacity Reservation Fleet for r6i instances defining weights for each instance type in the family based on the relative amount of memory that they have in an instance type specification json file.

instanceTypeSpecification.json:
[
    {             
        "InstanceType": "r6i.2xlarge",                       
        "InstancePlatform":"Linux/UNIX",            
        "Weight": 1,
        "AvailabilityZone":"eu-west-1a",        
        "EbsOptimized": true,           
        "Priority" : 3
    },
    { 
        "InstanceType": "r6i.4xlarge",                        
        "InstancePlatform":"Linux/UNIX",            
        "Weight": 2,
        "AvailabilityZone":"eu-west-1a",        
        "EbsOptimized": true,            
        "Priority" : 2
    },
    {             
        "InstanceType": "r6i.8xlarge",                        
        "InstancePlatform":"Linux/UNIX",           
        "Weight": 4,
        "AvailabilityZone":"eu-west-1a",       
        "EbsOptimized": true,            
        "Priority" : 1
    }
]

Then, you want to use this specification to create a Capacity Reservation Fleet that takes care of the underlying Capacity Reservations needed to fulfill your request:

$ aws ec2 create-capacity-reservation-fleet \
--total-target-capacity 25 \
--allocation-strategy prioritized \
--instance-match-criteria open \
--tenancy default \
--end-date 2022-05-31T00:00:00.000Z \
--instance-type-specifications file://instanceTypeSpecification.json

In this example, I set the target capacity to 25, which is the number of r6i.2xlarge needed to get 1,6TiB of total memory across the fleet. As you might have noticed, Capacity Reservation Fleets can be created with an end date. They will automatically cancel themselves and the Capacity Reservations that they created when the end date is reached, so that you don’t need to.

AWS Infrastructure Event Management

Last but not least, our teams can offer the AWS Infrastructure Event Management (IEM) program. Part of select AWS Support offerings, the IEM program has been designed to help you with planning and executing events that impact your infrastructure on AWS. By requesting an IEM engagement, you will be supported by AWS experts during all of the phases of your event.Flow chart showing the steps and IEM is usually made of: 1. Event is planned 2. IEM is initiated 6-8 weeks in advance of the event 3. Infrastructure readiness is assessed and mitigations are applied 4. The event 5. Post-event reviewStarting from your business outcomes and success criteria, we’ll assess your infrastructure readiness for the event, evaluate risks, and recommend specific actions to mitigate them. The AWS experts will focus on your application architecture as a whole and dive deep into each of its components with your respective teams. They might also engage with other AWS teams to notify them of the upcoming event, and get specific prescriptive guidance when needed. During the event, AWS experts will have the context needed to help you resolve any issue that might arise as quickly as possible. The program is included in the Enterprise and Enterprise On-Ramp Support plans and is available to Business Support customers for an additional fee.

Conclusion

Whether you’re planning for a big future event, or you want to make sure that your application can withstand unexpected increases in traffic, it’s important that you consider what we discussed in this article:

  • Use as many instance types as you can, don’t limit your workload to use a single instance type when it could also use a lot more types
  • Distribute your EC2 instances across all AZs in the Region
  • Expect failures: manage retries and fallback options
  • Make use of EC2 Autoscaling and EC2 Fleet whenever possible
  • Avoid scaling spikes: start scaling earlier, in smaller chunks, and more frequently
  • Reserve capacity only when you really need to

For further study, we recommend the Well-Architected Framework Reliability and Operational Excellence pillars as starting points. Moreover, if you have an event coming up, talk to your Technical Account Manager, your Account Team, or contact us to find out how we can help!

The 3-2-1 Backup Strategy

Post Syndicated from original https://www.backblaze.com/blog/the-3-2-1-backup-strategy/

Backing up your computer files isn’t like buying comprehensive car insurance, e.g., something you hope you never have to use but are sure glad you have when you need it. Of course, you want to protect your personal information, photos, work files, and other important data from hard drive crashes, accidental deletions, drink spills, theft, or malware. While Backblaze can’t help with power outages, computer encryption, or anti-theft technologies (though we can locate a computer), we can help make backing up your files a no-brainer. And (at least to our most recent survey) with only 10% of respondents who own a computer backing up daily, folks need the help!

Warning

Some cloud service providers suggest that encouraging people to keep multiple copies of their data reflects a lack of faith in our product. Still, redundant failsafes should always be part of your plan. It’s like investing, diversification is key. The truth is that anything and everything can fail. Hard drives fail. Good employees make bad mistakes. Bad employees make worse mistakes. And whether it was the Amazon Web Services outage that took down a large swath of the internet or the Google Cloud Storage outage that affected platforms like Snapchat, Shopify, and Discord, even the biggest providers can let you down in your time of need. That’s why the 3-2-1 strategy exists.

What’s Changed About the 3-2-1 Backup Strategy?

You may have heard of the 3-2-1 backup strategy. It means having at least three copies of your data, two local (on-site) but on different media (read: devices), and at least one copy off-site.

We’ll use “socialsecurity.jpg” as an example for this scenario. Socialsecurity.jpg lives on your computer at home; let’s say you took a picture of it for your tax accountant years ago for some tax-related stuff (as tax accountants are wont to do). That’s one copy of the data.

You also have an external hard drive to back up your computer; if you’re on a Mac, you might use it as a Time Machine drive (and Backblaze loves Time Machine). That external hard drive will back up socialsecurity.jpg as part of its backup process. That’s a second copy on a different device or medium.

In addition to that external hard drive, you also have an online backup solution (we recommend Backblaze, go figure!). The online backup continuously scans your computer and uploads your data to an off-site data center. Socialsecurity.jpg is included in this upload, becoming the third copy of your data.

Oh! And, your paper social security card is hopefully stored in a fire-proof safe (not your wallet). Does that sound pretty air-tight to you? It is, or at least it used to be.

The rise in ransomware attacks calls for strengthening the basic principles of the 3-2-1 strategy—redundancy, geographic distance, and access—with added protections. Cybercrimes targeting networked machines and capturing all data, including backups, is a growing problem. Data geeks who know about backup and recovery are going “comprehensive” with their backup “insurance.” New versions of the tried-and-true backup strategy have emerged, such as the 3-2-1-1-0 or 4-3-2 backups. Sounds like overkill? It isn’t. The good news is that companies like Backblaze exist to make at least the off-site component less stressful, they do the work and keep up with security best practices for you.

Why Is It Important to Back Up On-site and Off-site?

Whether you are interested in backing up a Mac or a PC, an on-site backup is a simple way to access your data quickly should anything happen to your computer. If your laptop or desktop’s hard drive crashes, and you have an up-to-date external hard drive available, you can quickly get most of your data back or use the external drive on another computer while yours gets fixed or replaced. If you remember to keep that external hard drive fairly up to date, the exposure for data loss is negligible, as you might only lose the uncopied files on your laptop. Most external hard drives even come with software to ensure they’re readily updated.

Having an on-site backup is a great start, but having an off-site backup is a key component in having a complete backup strategy, including cloud storage. The newer backup strategies build on the cloud’s strengths:

  • Convenience: Backing up large volumes of data in the cloud is fast.
  • Durability and reliability: Cloud storage centers protect against fires, natural disasters, and more.
  • Collaboration: Sharing with permissions is intuitive and effortless in the cloud.

With millions working in the cloud, those three copies in the 3-2-1-1-0 backup are separated by media on-site, off-site, and offline. This backup strategy exceeds the original model in its zero error copy. This fidelity to your files’ durability and reliability is possible with the help of a backup protection tool like Object Lock, which makes it impossible to modify or delete data (for a certain amount of time) because of its Write Once, Read Many (WORM) model. If you’re using cloud storage, consider a backup strategy that uses the principles of redundancy, distance, access, and immutability like 3-2-1-1-0. And in the case of Backblaze, retention history (like our Extended Version History feature of Computer Backup) adds additional layers of protection in how long those copies are kept should anything happen to your physical devices.

Is 3-2-1 Perfect?

There is no such thing as a perfect backup system, but the 3-2-1 approach is a great start for most people and businesses. Even the United States government recommends this approach. In a 2012 paper for US-CERT (United States Computer Emergency Readiness Team), Carnegie Mellon recommended the 3-2-1 method in their publication: Data Backup Options.

Backing Up Is the Best Insurance

The 3-2-1-1-0 plan is great for getting your files backed up. If you view the strategy like an insurance policy, you want one that provides the coverage needed should the unthinkable happen. Service also matters; having a local, off-site, and offline backup gives you more options for backup recovery. And a zero error policy for recoverability is a “no questions asked” claims process. One can dream! We did.

That’s how, and why, Backblaze created the world’s easiest cloud backup.

The post The 3-2-1 Backup Strategy appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

[$] An io_uring-based user-space block driver

Post Syndicated from original https://lwn.net/Articles/903855/

The addition of the ublk driver during the 6.0 merge window would have been
easy to miss; it was buried deeply within an io_uring pull request and is
entirely devoid of any sort of documentation that might indicate why it
merits a closer look. Ublk is intended to facilitate the implementation of
high-performance block drivers in user space; to that end, it uses io_uring
for its communication with the kernel. This driver is considered
experimental for now; if it is successful, it might just be a harbinger of
more significant changes to come to the kernel in the future.

Coordinating large messages across accounts and Regions with Amazon SNS and SQS

Post Syndicated from Mrudhula Balasubramanyan original https://aws.amazon.com/blogs/architecture/coordinating-large-messages-across-accounts-and-regions-with-amazon-sns-and-sqs/

Many organizations have applications distributed across various business units. Teams in these business units may develop their applications independent of each other to serve their individual business needs. Applications can reside in a single Amazon Web Services (AWS) account or be distributed across multiple accounts. Applications may be deployed to a single AWS Region or span multiple Regions.

Irrespective of how the applications are owned and operated, these applications need to communicate with each other. Within an organization, applications tend to be part of a larger system, therefore, communication and coordination among these individual applications is critical to overall operation.

There are a number of ways to enable coordination among component applications. It can be done either synchronously or asynchronously:

  • Synchronous communication uses a traditional request-response model, in which the applications exchange information in a tightly coupled fashion, introducing multiple points of potential failure.
  • Asynchronous communication uses an event-driven model, in which the applications exchange messages as events or state changes and are loosely coupled. Loose coupling allows applications to evolve independently of each other, increasing scalability and fault-tolerance in the overall system.

Event-driven architectures use a publisher-subscriber model, in which events are emitted by the publisher and consumed by one or more subscribers.

A key consideration when implementing an event-driven architecture is the size of the messages or events that are exchanged. How can you implement an event-driven architecture for large messages, beyond the default maximum of the services? How can you architect messaging and automation of applications across AWS accounts and Regions?

This blog presents architectures for enhancing event-driven models to exchange large messages. These architectures depict how to coordinate applications across AWS accounts and Regions.

Challenge with application coordination

A challenge with application coordination is exchanging large messages. For the purposes of this post, a large message is defined as an event payload between 256 KB and 2 GB. This stems from the fact that Amazon Simple Notification Service (Amazon SNS) and Amazon Simple Queue Service (Amazon SQS) currently have a maximum event payload size of 256 KB. To exchange messages larger than 256 KB, an intermediate data store must be used.

To exchange messages across AWS accounts and Regions, set up the publisher access policy to allow subscriber applications in other accounts and Regions. In the case of large messages, also set up a central data repository and provide access to subscribers.

Figure 1 depicts a basic schematic of applications distributed across accounts communicating asynchronously as part of a larger enterprise application.

Asynchronous communication across applications

Figure 1. Asynchronous communication across applications

Architecture overview

The overview covers two scenarios:

  1. Coordination of applications distributed across AWS accounts and deployed in the same Region
  2. Coordination of applications distributed across AWS accounts and deployed to different Regions

Coordination across accounts and single AWS Region

Figure 2 represents an event-driven architecture, in which applications are distributed across AWS Accounts A, B, and C. The applications are all deployed to the same AWS Region, us-east-1. A single Region simplifies the architecture, so you can focus on application coordination across AWS accounts.

Application coordination across accounts and single AWS Region

Figure 2. Application coordination across accounts and single AWS Region

The application in Account A (Application A) is implemented as an AWS Lambda function. This application communicates with the applications in Accounts B and C. The application in Account B is launched with AWS Step Functions (Application B), and the application in Account C runs on Amazon Elastic Container Service (Application C).

In this scenario, Applications B and C need information from upstream Application A. Application A publishes this information as an event, and Applications B and C subscribe to an SNS topic to receive the events. However, since they are in other accounts, you must define an access policy to control who can access the SNS topic. You can use sample Amazon SNS access policies to craft your own.

If the event payload is in the 256 KB to 2 GB range, you can use Amazon Simple Storage Service (Amazon S3) as the intermediate data store for your payload. Application A uses the Amazon SNS Extended Client Library for Java to upload the payload to an S3 bucket and publish a message to an SNS topic, with a reference to the stored S3 object. The message containing the metadata must be within the SNS maximum message limit of 256 KB. Amazon EventBridge is used for routing events and handling authentication.

The subscriber Applications B and C need to de-reference and retrieve the payloads from Amazon S3. The SQS queue in Account B and Lambda function in Account C subscribe to the SNS topic in Account A. In Account B, a Lambda function is used to poll the SQS queue and read the message with the metadata. The Lambda function uses the Amazon SQS Extended Client Library for Java to retrieve the S3 object referenced in the message.

The Lambda function in Account C uses the Payload Offloading Java Common Library for AWS to get the referenced S3 object.

Once the S3 object is retrieved, the Lambda functions in Accounts B and C process the data and pass on the information to downstream applications.

This architecture uses Amazon SQS and Lambda as subscribers because they provide libraries that support offloading large payloads to Amazon S3. However, you can use any Java-enabled endpoint, such as an HTTPS endpoint that uses Payload Offloading Java Common Library for AWS to de-reference the message content.

Coordination across accounts and multiple AWS Regions

Sometimes applications are spread across AWS Regions, leading to increased latency in coordination. For existing applications, it could take substantive effort to consolidate to a single Region. Hence, asynchronous coordination would be a good fit for this scenario. Figure 3 expands on the architecture presented earlier to include multiple AWS Regions.

Application coordination across accounts and multiple AWS Regions

Figure 3. Application coordination across accounts and multiple AWS Regions

The Lambda function in Account C is in the same Region as the upstream application in Account A, but the Lambda function in Account B is in a different Region. These functions must retrieve the payload from the S3 bucket in Account A.

To provide access, configure the AWS Lambda execution role with the appropriate permissions. Make sure that the S3 bucket policy allows access to the Lambda functions from Accounts B and C.

Considerations

For variable message sizes, you can specify if payloads are always stored in Amazon S3 regardless of their size, which can help simplify the design.

If the application that publishes/subscribes large messages is implemented using the AWS Java SDK, it must be Java 8 or higher. Service-specific client libraries are also available in Python, C#, and Node.js.

An Amazon S3 Multi-Region Access Point can be an alternative to a centralized bucket for the payloads. It has not been explored in this post due to the asynchronous nature of cross-region replication.

In general, retrieval of data across Regions is slower than in the same Region. For faster retrieval, workloads should be run in the same AWS Region.

Conclusion

This post demonstrates how to use event-driven architectures for coordinating applications that need to exchange large messages across AWS accounts and Regions. The messaging and automation are enabled by the Payload Offloading Java Common Library for AWS and use Amazon S3 as the intermediate data store. These components can simplify the solution implementation and improve scalability, fault-tolerance, and performance of your applications.

Ready to get started? Explore SQS Large Message Handling.

Security updates for Monday

Post Syndicated from original https://lwn.net/Articles/904191/

Security updates have been issued by Debian (chromium, libtirpc, and xorg-server), Fedora (giflib, mingw-giflib, and teeworlds), Mageia (chromium-browser-stable, kernel, kernel-linus, mingw-giflib, osmo, python-m2crypto, and sqlite3), Oracle (httpd, php, vim, virt:ol and virt-devel:ol, and xorg-x11-server), SUSE (caddy, crash, dpkg, fwupd, python-M2Crypto, and trivy), and Ubuntu (gdk-pixbuf, libjpeg-turbo, and phpliteadmin).

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close