[$] Using low-cost wireless sensors in the unlicensed bands

Post Syndicated from original https://lwn.net/Articles/921497/

When it comes to home automation, people often end up with devices
supporting the Zigbee or Z-Wave protocols, but those devices are
relatively expensive. When I was looking for a way to keep an eye on the
temperature at home a few years ago, I bought a bunch of cheap
temperature and humidity sensors emitting radio signals in the unlicensed
ISM (Industrial, Scientific, and Medical) frequency bands instead. Thanks to rtl_433 and, more recently, rtl_433_ESP and OpenMQTTGateway,
I was able to integrate their measurements easily into my home-automation
system.

Year in Review: Rapid7 Threat Intelligence

Post Syndicated from Stacy Moran original https://blog.rapid7.com/2023/01/31/year-in-review-rapid7-threat-intelligence/

Year in Review: Rapid7 Threat Intelligence

In an evolving threat landscape, non-stop alerts and more IOC feeds don’t guarantee better protection. Security teams are overwhelmed and struggle to identify relevant threat information.

Thankfully, Threat Command delivers highly contextual alerts and integration across your environment to help you cut through the noise, enable prioritization, streamline operations, and reduce brand exposure. Threat Command external threat intelligence protects organizations in every industry from targeted threats across the clear, deep, and dark web.

As we forge into 2023, we remain laser-focused and committed to addressing the critical needs of resource-constrained security operations teams:

  • Accessible and actionable external threat intelligence
  • Better visibility for faster decisions
  • Greater relevance, less noise
  • Simplified security workflows
  • Accelerated response
  • Faster time-to-value

But first, let’s take a look at the ways we improved Threat Command in 2022.

Executing on Our Promise of Value
2022 Product Feature Introductions and Enhancements

Throughout 2022, we continuously iterated and improved upon the capabilities of Threat Command, making it an even more effective resource to keep your organization safe from external threats. Here is a rundown of some of the most important improvements we made last year.

First Half 2022

In our blog Threat Intel Enhances Rapid7 XDR With Improved Visibility and Context”, we summarize the unmistakable value threat intelligence brings to the Rapid7 solution portfolio in year one following the IntSights acquisition. Highlights include:

  • Threat Command + InsightIDR integration: The only 360-degree XDR solution in the market that infuses generic threat intelligence (IOCs) and customized digital risk protection coverage. Unlock a comprehensive view of your external and internal attack surface by seeing Threat Command alerts alongside IDR detections.
  • Threat Command Vulnerability Risk Analyzer + InsightVM integration: Rely on threat intelligence vulnerability context and risk prioritization that eliminates the guesswork of manual patch management.
  • Twitter Chatter: Know when your company is mentioned in negative discourse on Twitter.
  • Information Stealers: Get alerted when employees have been compromised by malware that gathers leaked credentials and private data from infected devices. In many cases, this scenario plays out on employee-owned personal devices, drastically amplifying potential risk to the organization.
  • Asset Management: Track your most targeted digital assets for a more proactive defense. Categorize your assets using tags and comments, and automatically generate policy conditions and bulk actions for alerts.
  • Strategic Intelligence: The first strategic dashboard for CISOs delivers visualization of threats specifically targeting the organization – critical input for assessing, planning, and budgeting for future security investments. This is the threat intelligence market’s only comprehensive view of an organization’s external threat landscape (aligned to the MITRE ATT&CK framework).
Year in Review: Rapid7 Threat Intelligence

Second Half 2022

Rapid7 + ServiceNow: In the second half of the year, we released Threat Command for ServiceNow ITSM. Users of both platforms now have access to an end-to-end integration for managing security incidents:

  • Quickly and easily create ServiceNow incidents based on Threat Command alert data for streamlined incident response from a single pane of glass within ServiceNow.
  • Create incidents in your ServiceNow instance based on Threat Command alert data and assign ITSM tickets to specific users or groups.

Customers can install the app now from the ServiceNow store.

Learn more: Threat Command ServiceNow ITSM Integration Brief

Year in Review: Rapid7 Threat Intelligence

Rapid7 + MISP: Our Threat Intelligence Platform (TIP) now integrates with MISP (Malware Information Sharing Platform), an open-source TI platform that collects and shares indicators of compromise related to security incidents. This integration allows users to ingest enriched IOCs from our TIP and create events in MISP cloud devices.

Year in Review: Rapid7 Threat Intelligence

TIP Investigation Enhancements

  • Filterable user events now appear in the IOC Timeline for improved visibility and investigation efficiency. Users can view events related to specific IOCs, sorted by date.
  • See the relation types between related IOCs on the Investigation map for 360-degree visibility and faster investigations.
  • View Threat Command alert indications on IOC nodes in the Investigation map for additional visibility.

Leaked Credentials Enhancements

  • Our Leaked Credentials coverage now supports a wide variety of additional database formats, allowing broader visibility into the ever-expanding threat of leaked credentials detected in various breaches and hacker campaigns across the clear, deep, and dark web.

Looking Ahead

Lots happening in 2023! Look for our new Forrester Total Economic Impact of Rapid7 Threat Command for Digital Risk Protection and Threat Intelligence in early Q2 (sneak peak: our ROI number surpasses that of our primary competitors!) and new solutions packages that scale with customer needs across the maturity spectrum and offer opportunities to maximize ROI.

Stay tuned!

There are many more exciting feature enhancements and new releases planned throughout the year. A big thank you to all of our customers and partners. We look forward to delivering even more value to you in 2023!

Learn more about how Threat Command simplifies threat intelligence, delivering instant value for organizations of any size or maturity, while reducing risk exposure. Watch an on-demand demo to see how Threat Command takes the complexity out of threat intelligence with an intuitive platform that prioritizes the most critical threats to your organization.

Want to find out where and how your organization is being targeted? Get a free threat report now.

Year in Review: Rapid7 Threat Intelligence

Security updates for Tuesday

Post Syndicated from original https://lwn.net/Articles/921765/

Security updates have been issued by CentOS (bind, firefox, java-1.8.0-openjdk, java-11-openjdk, kernel, libXpm, pki-core, sssd, sudo, thunderbird, tigervnc, and xorg-x11-server), Debian (cinder, glance, libarchive, libhtml-stripscripts-perl, modsecurity-crs, node-moment, node-qs, nova, ruby-git, ruby-rack, and tiff), Fedora (java-17-openjdk, rust-bat, rust-cargo-c, rust-git-delta, rust-gitui, rust-pore, rust-silver, rust-tokei, and seamonkey), Oracle (libksba), Red Hat (kernel, kernel-rt, kpatch-patch, libksba, and pcs), Scientific Linux (libksba), SUSE (apache2-mod_auth_openidc, ghostscript, libarchive, nginx, python, vim, and xen), and Ubuntu (cinder, glance, linux-raspi, nova, python-future, and sudo).

Backblaze Drive Stats for 2022

Post Syndicated from original https://www.backblaze.com/blog/backblaze-drive-stats-for-2022/

As of December 31, 2022, we had 235,608 drives under management. Of that number, there were 4,299 boot drives and 231,309 data drives. This report will focus on our data drives. We’ll review the hard drive failure rates for 2022, compare those rates to previous years, and present the lifetime failure statistics for all the hard drive models active in our data center as of the end of 2022. Along the way, we’ll share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.

2022 Hard Drive Failure Rates

At the end of 2022, Backblaze was monitoring 231,309 hard drives used to store data. For our evaluation, we removed 388 drives from consideration which were used for either testing purposes or drive models for which we did not have at least 60 drives. This leaves us with 230,921 hard drives to analyze for this report.

Observations and Notes

One Zero for the Year

In 2022, only one drive had zero failures, the 8TB Seagate (model: ST8000NM000A). That “zero” does come with some caveats: We have only 79 drives in service and the drive has a limited number of drive days—22,839. These drives are used as spares to replace 8TB drives that have failed.

What About the Old Guys?

  • The 6TB Seagate (model: ST6000DX000) drive is the oldest in our fleet with an average age of 92.5 months. In 2021, it had an annualized failure rate (AFR) of just 0.11%, but has slipped a bit to 0.68% for 2022. A very respectable number any time, but especially after nearly eight years of duty.
  • The 4TB Toshiba (model: MD04ABA400V) drives have an average age of 91.3 months. In 2021, this drive has an AFR of 2.04% and that has jumped to 3.13% for 2022, which included three drive failures. Given the limited number of drives and drive days for this model, if there were only two drive failures in 2022, the AFR would be 2.08%, or nearly the same as 2021.
  • Both of these drive models have a relatively small number of drive days, so confidence in the AFR numbers is debatable. That said, both drives have performed well over their lifespan.

New Models

In 2021, we added five new models while retiring zero, giving us a total of 29 different models we are tracking. Here are the five new models:

  1. HUH728080ALE604–8TB
  2. ST8000NM000A–8TB
  3. ST16000NM002J–16TB
  4. MG08ACA16TA–16TB
  5. WUH721816ALE6L4–16TB

The two 8TB drive models are being used to replace failed 8TB drives. The three 16TB drive models are additive to the inventory.

Comparing Drive Stats for 2020, 2021, and 2022

The chart below compares the AFR for each of the last three years. The data for each year is inclusive of that year only and the operational drive models present at the end of each year.

Drive Failure Was Up in 2022

After a slight increase in AFR from 2020 to 2021, there was a more notable increase in AFR in 2022 from 1.01% in 2021 to 1.37%. What happened? In our Q2 2022 and Q3 2022 quarterly Drive Stats reports, we noted an increase in the overall AFR from the previous quarter and attributed it to the aging fleet of drives. But, is that really the case? Let’s take a look at some of the factors at play that could cause the rise in AFR for 2022. We’ll start with drive size.

Drive Size and Drive Failure

The chart below compares 2021 and 2022 AFR for our large drives (which we’ve defined as 12TB, 14TB, and 16TB drives) to our smaller drives (which we’ve defined as 4TB, 6TB , 8TB, and 10TB drives).

With the exception of the 16TB drives, every drive size had an increase in their AFR from 2021 to 2022. In the case of the small drives, the increase was pronounced, and at 2.12% is well above the 1.37% AFR for 2022 for all drives.

In addition, while the small drive cohort represents only 28.7% of the drive days in 2022, they account for 44.5% of the drive failures. Our smaller drives are failing more often, but they are also older, so let’s take a closer look at that.

Drive Age and Drive Failure

When examining the correlation of drive age to drive failure we should start with our previous look at the hard drive failure bathtub curve. There we concluded that drives generally fail more often as they age. To see if that matters here, we’ll start with the table below which shows the average age of each drive model of drives by size.

With the exception of the 8TB Seagate (model: ST8000NM000A), which we recently purchased as replacements for failed 8TB drives, the drives fall neatly into our two groups noted above—10TB and below and 12TB and up.

Now let’s group the individual drive models into cohorts defined by drive size. But before we do, we should remember that the 6TB and 10TB drive models have a relatively small number of drives and drive days in comparison to the remaining drive groups. In addition, the 6TB and 10TB drive cohorts consist of one drive model, while the other drive groups have at least four different drive models. Still, leaving them out seems incomplete, so we’ve included tables with and without the 6TB and 10TB drive cohorts.

Each table shows the relationship for each drive size, between the average age of the drives and their associated AFR. The chart on the right (V2) clearly shows that the older drives, when grouped by size, fail more often. This increase as a drive model ages follows the bathtub curve we spoke of earlier.

So, What Caused the Increase in Drive Failure and Does it Matter?

The aging of our fleet of hard drives does appear to be the most logical reason for the increased AFR in 2022. We could dig in further, but that is probably moot at this point. You see, we spent 2022 building out our presence in two new data centers, the Nautilus facility in Stockton, California and the CoreSite facility in Reston, Virginia. In 2023, our focus is expected to be on replacing our older drives with 16TB and larger hard drives. The 4TB drives and yes, even our O.G. 6TB Seagate drives could go. We’ll keep you posted.

Drive Failures by Manufacturer

We’ve looked at drive failure by drive age and drive size, so it’s only right to look at drive failure by manufacturer. Below we have plotted the quarterly AFR over the last three years by manufacturer.

Starting in Q1 of 2021 and continuing to the end of 2022, we can see that the overall rise in the overall AFR over that time seems to be driven by Seagate and, to a lesser degree, Toshiba, although HGST contributes heavily to the Q1 2022 rise. In the case of Seagate, this makes sense as most of our Seagate drives are significantly older than any of the other manufacturers’ drives.

Before you throw your Seagate and Toshiba drives in the trash, you might want to consider the lifecycle cost of a given hard drive model versus its failure rate. We looked at this in our Q3 2022 Drive Stats report, and outlined the trade-offs between drive cost and failure rates. For example, in general, Seagate drives are less expensive and their failure rates are typically higher in our environment. But, their failure rates are typically not high enough to make them less cost effective over their lifetime. You could make a good case that for us, many Seagate drive models are just as cost effective as more expensive drives. It helps that our B2 Cloud Storage platform is built with drive failure in mind, but we’ll admit that fewer drive failures is never a bad thing.

Lifetime Hard Drive Stats

The table below is the lifetime AFR of all the drive models in production as of December 31, 2022.

The current lifetime AFR is 1.39%, which is down from a year ago (1.40%) and also down from last quarter (1.41%). The lifetime AFR is less prone to rapid changes due to temporary fluctuations in drive failures and is a good indicator of a drive model’s AFR. But it takes a fair amount of observations (in our case, drive days) to be confident in that number. To that end, the table below shows only those drive models which have accumulated one million drive days or more in their lifetime. We’ve ordered the list by drive days.

Finally, we are going to open up a bit here and share the results of the 388 drives we removed from our analysis because they were test drives or drive models with 60 or fewer drives. These drives are divided amongst 20 different drive models and the table below lists those drive models which were operational in our data centers as of December 31, 2022. Big caveat here: these are just test drives and so on, so be gentle. We usually ignore them in the reports, so this is their chance to shine, or not. We look forward to seeing your comments.

There are many reasons why these drives got to this point in their Backblaze career, but we’ll save those stories for another time. At this point, we’re just sharing to be forthright about the data, but there are certainly tales to be told. Stay tuned.

Our Annual Drive Stats Webinar

Join me on Tuesday, February 7 at 10 a.m. PT to review the results of the 2022 report. You’ll get a look behind the scenes at the data and the process we use to create the annual report.

Sign Up for the Webinar

The Hard Drive Stats Data

The complete data set used to create the tables and charts in this report is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data itself to anyone; it is free.

If you just want the data used to create the tables and charts in this blog post you can download the ZIP file containing the CSV files for each chart.

Good luck and let us know if you find anything interesting.

Want More Insights?

Check out our take on Hard Drive Cost per Gigabyte and Hard Drive Life Expectancy.

Interested in the SSD Data?

Read the most recent SSD edition of our Drive Stats Report.

The post Backblaze Drive Stats for 2022 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

CVE-2022-47929: traffic control noqueue no problem?

Post Syndicated from Frederick Lawler original https://blog.cloudflare.com/cve-2022-47929-traffic-control-noqueue-no-problem/

CVE-2022-47929: traffic control noqueue no problem?

CVE-2022-47929: traffic control noqueue no problem?

USER namespaces power the functionality of our favorite tools such as docker, podman, and kubernetes. We wrote about Linux namespaces back in June and explained them like this:

Most of the namespaces are uncontroversial, like the UTS namespace which allows the host system to hide its hostname and time. Others are complex but straightforward – NET and NS (mount) namespaces are known to be hard to wrap your head around. Finally, there is this very special, very curious USER namespace. USER namespace is special since it allows the – typically unprivileged owner to operate as “root” inside it. It’s a foundation to having tools like Docker to not operate as true root, and things like rootless containers.

Due to its nature, allowing unprivileged users access to USER namespace always carried a great security risk. With its help the unprivileged user can in fact run code that typically requires root. This code is often under-tested and buggy. Today we will look into one such case where USER namespaces are leveraged to exploit a kernel bug that can result in an unprivileged denial of service attack.

Enter Linux Traffic Control queue disciplines

In 2019, we were exploring leveraging Linux Traffic Control’s queue discipline (qdisc) to schedule packets for one of our services with the Hierarchy Token Bucket (HTB) classful qdisc strategy. Linux Traffic Control is a user-configured system to schedule and filter network packets. Queue disciplines are the strategies in which packets are scheduled. In particular, we wanted to filter and schedule certain packets from an interface, and drop others into the noqueue qdisc.

noqueue is a special case qdisc, such that packets are supposed to be dropped when scheduled into it. In practice, this is not the case. Linux handles noqueue such that packets are passed through and not dropped (for the most part). The documentation states as much. It also states that “It is not possible to assign the noqueue queuing discipline to physical devices or classes.” So what happens when we assign noqueue to a class?

Let’s write some shell commands to show the problem in action:

1. $ sudo -i
2. # dev=enp0s5
3. # tc qdisc replace dev $dev root handle 1: htb default 1
4. # tc class add dev $dev parent 1: classid 1:1 htb rate 10mbit
5. # tc qdisc add dev $dev parent 1:1 handle 10: noqueue

  1. First we need to log in as root because that gives us CAP_NET_ADMIN to be able to configure traffic control.
  2. We then assign a network interface to a variable. These can be found with ip a. Virtual interfaces can be located by calling ls /sys/devices/virtual/net. These will match with the output from ip a.
  3. Our interface is currently assigned to the pfifo_fast qdisc, so we replace it with the HTB classful qdisc and assign it the handle of 1:. We can think of this as the root node in a tree. The “default 1” configures this such that unclassified traffic will be routed directly through this qdisc which falls back to pfifo_fast queuing. (more on this later)
  4. Next we add a class to our root qdisc 1:, assign it to the first leaf node 1 of root 1: 1:1, and give it some reasonable configuration defaults.
  5. Lastly, we add the noqueue qdisc to our first leaf node in the hierarchy: 1:1. This effectively means traffic routed here will be scheduled to noqueue

Assuming our setup executed without a hitch, we will receive something similar to this kernel panic:

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
...
Call Trace:
<TASK>
htb_enqueue+0x1c8/0x370
dev_qdisc_enqueue+0x15/0x90
__dev_queue_xmit+0x798/0xd00
...
</TASK>

We know that the root user is responsible for setting qdisc on interfaces, so if root can crash the kernel, so what? We just do not apply noqueue qdisc to a class id of a HTB qdisc:

# dev=enp0s5
# tc qdisc replace dev $dev root handle 1: htb default 1
# tc class add dev $dev parent 1: classid 1:2 htb rate 10mbit // A
// B is missing, so anything not filtered into 1:2 will be pfifio_fast

Here, we leveraged the default case of HTB where we assign a class id 1:2 to be rate-limited (A), and implicitly did not set a qdisc to another class such as id 1:1 (B). Packets queued to (A) will be filtered to HTB_DIRECT and packets queued to (B) will be filtered into pfifo_fast.

Because we were not familiar with this part of the codebase, we notified the mailing lists and created a ticket. The bug did not seem all that important to us at that time.

Fast-forward to 2022, we are pushing USER namespace creation hardening. We extended the Linux LSM framework with a new LSM hook: userns_create to leverage eBPF LSM for our protections, and encourage others to do so as well. Recently while combing our ticket backlog, we rethought this bug. We asked ourselves, “can we leverage USER namespaces to trigger the bug?” and the short answer is yes!

Demonstrating the bug

The exploit can be performed with any classful qdisc that assumes a struct Qdisc.enqueue function to not be NULL (more on this later), but in this case, we are demonstrating just with HTB.

$ unshare -rU –net
$ dev=lo
$ tc qdisc replace dev $dev root handle 1: htb default 1
$ tc class add dev $dev parent 1: classid 1:1 htb rate 10mbit
$ tc qdisc add dev $dev parent 1:1 handle 10: noqueue
$ ping -I $dev -w 1 -c 1 1.1.1.1

We use the “lo” interface to demonstrate that this bug is triggerable with a virtual interface. This is important for containers because they are fed virtual interfaces most of the time, and not the physical interface. Because of that, we can use a container to crash the host as an unprivileged user, and thus perform a denial of service attack.

Why does that work?

To understand the problem a bit better, we need to look back to the original patch series, but specifically this commit that introduced the bug. Before this series, achieving noqueue on interfaces relied on a hack that would set a device qdisc to noqueue if the device had a tx_queue_len = 0. The commit d66d6c3152e8 (“net: sched: register noqueue qdisc”) circumvents this by explicitly allowing noqueue to be added with the tc command without needing to get around that limitation.

The way the kernel checks for whether we are in a noqueue case or not, is to simply check if a qdisc has a NULL enqueue() function. Recall from earlier that noqueue does not necessarily drop packets in practice? After that check in the fail case, the following logic handles the noqueue functionality. In order to fail the check, the author had to cheat a reassignment from noop_enqueue() to NULL by making enqueue = NULL in the init which is called way after register_qdisc() during runtime.

Here is where classful qdiscs come into play. The check for an enqueue function is no longer NULL. In this call path, it is now set to HTB (in our example) and is thus allowed to enqueue the struct skb to a queue by making a call to the function htb_enqueue(). Once in there, HTB performs a lookup to pull in a qdisc assigned to a leaf node, and eventually attempts to queue the struct skb to the chosen qdisc which ultimately reaches this function:

include/net/sch_generic.h

static inline int qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
				struct sk_buff **to_free)
{
	qdisc_calculate_pkt_len(skb, sch);
	return sch->enqueue(skb, sch, to_free); // sch->enqueue == NULL
}

We can see that the enqueueing process is fairly agnostic from physical/virtual interfaces. The permissions and validation checks are done when adding a queue to an interface, which is why the classful qdics assume the queue to not be NULL. This knowledge leads us to a few solutions to consider.

Solutions

We had a few solutions ranging from what we thought was best to worst:

  1. Follow tc-noqueue documentation and do not allow noqueue to be assigned to a classful qdisc
  2. Instead of checking for NULL, check for struct noqueue_qdisc_ops, and reset noqueue to back to noop_enqueue
  3. For each classful qdisc, check for NULL and fallback

While we ultimately went for the first option: “disallow noqueue for qdisc classes”, the third option creates a lot of churn in the code, and does not solve the problem completely. Future qdiscs implementations could forget that important check as well as the maintainers. However, the reason for passing on the second option is a bit more interesting.

The reason we did not follow that approach is because we need to first answer these questions:

Why not allow noqueue for classful qdiscs?

This contradicts the documentation. The documentation does have some precedent for not being totally followed in practice, but we will need to update that to reflect the current state. This is fine to do, but does not address the behavior change problem other than remove the NULL dereference bug.

What behavior changes if we do allow noqueue for qdiscs?

This is harder to answer because we need to determine what that behavior should be. Currently, when noqueue is applied as the root qdisc for an interface, the path is to essentially allow packets to be processed. Claiming a fallback for classes is a different matter. They may each have their own fallback rules, and how do we know what is the right fallback? Sometimes in HTB the fallback is pass-through with HTB_DIRECT, sometimes it is pfifo_fast. What about the other classes? Perhaps instead we should fall back to the default noqueue behavior as it is for root qdiscs?

We felt that going down this route would only add confusion and additional complexity to queuing. We could also make an argument that such a change could be considered a feature addition and not necessarily a bug fix. Suffice it to say, adhering to the current documentation seems to be the more appealing approach to prevent the vulnerability now, while something else can be worked out later.

Takeaways

First and foremost, apply this patch as soon as possible. And consider hardening USER namespaces on your systems by setting sysctl -w kernel.unprivileged_userns_clone=0, which only lets root create USER namespaces in Debian kernels, sysctl -w user.max_user_namespaces=[number] for a process hierarchy, or consider backporting these two patches: security_create_user_ns() and the SELinux implementation  (now in Linux 6.1.x) to allow you to protect your systems with either eBPF or SELinux. If you are sure you’re not using USER namespaces and in extreme cases, you might consider turning the feature off with CONFIG_USERNS=n. This is just one example of many where namespaces are leveraged to perform an attack, and more are surely to crop up in varying levels of severity in the future.

Special thanks to Ignat Korchagin and Jakub Sitnicki for code reviews and helping demonstrate the bug in practice.

Ransomware Payments Are Down

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2023/01/ransomware-payments-are-down.html

Chainalysis reports that worldwide ransomware payments were down in 2022.

Ransomware attackers extorted at least $456.8 million from victims in 2022, down from $765.6 million the year before.

As always, we have to caveat these findings by noting that the true totals are much higher, as there are cryptocurrency addresses controlled by ransomware attackers that have yet to be identified on the blockchain and incorporated into our data. When we published last year’s version of this report, for example, we had only identified $602 million in ransomware payments in 2021. Still, the trend is clear: Ransomware payments are significantly down.

However, that doesn’t mean attacks are down, or at least not as much as the drastic drop-off in payments would suggest. Instead, we believe that much of the decline is due to victim organizations increasingly refusing to pay ransomware attackers.

Къде сме в индекса за възприятие на корупцията на Transparency International? Корупция без граници. България – по-корумпирана от Армения, стои до Гана, Бенин, Сенегал

Post Syndicated from Николай Марченко original https://bivol.bg/bulgaria-transparency-index.html

вторник 31 януари 2023


Промяната у нас за последните две години на практика не се е състояла, ако гледаме по нивото на корупцията в държавата. България е подобрила позицията си в световния индекс за…

Create more partitions and retain data for longer in your MSK Serverless clusters

Post Syndicated from Usama Naseem original https://aws.amazon.com/blogs/big-data/create-more-partitions-and-retain-data-for-longer-in-your-msk-serverless-clusters/

In April 2022, Amazon Managed Streaming for Apache Kafka (Amazon MSK) launched an exciting new capability, Amazon MSK Serverless. Amazon MSK is a fully managed service for Apache Kafka that makes it easier for developers to build and run highly available, secure, and scalable applications based on Apache Kafka. With MSK Serverless, developers can run their applications without having to provision, configure, or optimize their Apache Kafka clusters. MSK Serverless automatically provisions and scales compute and storage resources, so developers have access to on-demand streaming capacity and storage.

Over the remainder of 2022, the team collected customer feedback and worked backward from customer requirements to add new capabilities that made MSK Serverless even better. In this post, we discuss a few of these enhancements in detail and provide an example use case.

Higher default quota for partitions in a cluster

Data in Apache Kafka is written to topics, which can be partitioned into multiple log files called partitions. When a producer application writes data to a topic, it is appended to one of these partitions. MSK Serverless launched with a maximum quota of 120 partitions per cluster. However, our customers told us that they needed more partitions per cluster for a variety of use cases, ranging from change data capture (CDC) to faster real-time data processing.

In December 2022, we increased the default quota for partitions for MSK Serverless clusters. With the increased quota, you can create up to 2,400 partitions per cluster. The 20-fold increase in the number of partitions you can have per cluster lets you create more topics per cluster and have more applications consume data in parallel. You can also implement better isolation of data with fine-grained access control. More partitions are particularly useful for CDC use cases where each table in the database has hundreds of unqiue keys, which are each mapped to a unique partition. With more partitions, you can use MSK Serverless for capturing changes in larger databases with lots of tables and hundreds of keys. Note that the 2,400 limit only applies to leader partitions. MSK Serverless creates two replicas of each partition by default at no additional cost that don’t count towards this limit.

Unlimited data retention duration

The data you produce to your topics can be retained in Apache Kafka for a configurable duration, depending on how long you need to access data using Apache Kafka consumer APIs. Typically, customers retain data for short periods of time, ranging from a few hours to a few days. Previously, MSK Serverless limited data retention to a maximum of 24 hours (1 day), which is sufficient for most popular Apache Kafka use cases. However, some use cases require customers to retain data for longer, such as retaining data for audit purposes or maintaining application recovery SLAs.

Now, with the increase in the data retention duration quota, you can retain data for as long as you need in your MSK Serverless clusters. Longer data retention is particularly useful for use cases where your consumer applications need quick access to older data. For instance, in the case of a failure, the application may need to access data from the start of the topic to reconstruct its state. Because you can now retain data in your topics for longer durations, you can restore your application’s state by accessing older data using Kafka’s consumer API, making it easier to recover from such failures. After the application recovers, you can configure your application to start consuming the data from the earliest timestamp you need to reestablish your application’s state. Note that you can only retain up to 250 GB of data per partition. As long as your partition doesn’t reach 250 GB in size, you may retain it for as long as you wish. You may create more partitions if you need more storage for a given topic.

These new quotas are available in all Regions where MSK Serverless is available. For more information, navigate to the MSK Serverless tab on the Amazon MSK pricing page and choose the Region drop-down menu.

You can also request an increase to the maximum number of partitions quota by contacting AWS Support if you need more than 2,400 partitions in a cluster. The quotas for more partitions and longer retention are applied to both existing and new clusters.

Getting started: Create a topic with 1,000 partitions and 7-day retention

In this section, we demonstrate how to create a topic in MSK Serverless, specify the number of partitions, and set its retention duration.

As a prerequisite, you must have an MSK Serverless cluster and an Apache Kafka client. Refer to Getting started using MSK Serverless clusters for step-by-step instructions.

  1. On your client machine, access kafka_2.12-2.8.1/bin and run the following export command (replace the ‘my-endpoint’ with the bootstrap server string of your MSK Serverless cluster):
    export BS=my-endpoint

  2. Run the following command to create a topic called msk-sample-topic with 1,000 partitions and 7-day data retention (604,800,000 milliseconds):
    <path-to-your-kafka-installation>/bin/kafka-topics.sh 
    --bootstrap-server $BS 
    --command-config client.properties 
    --create 
    --topic msk-sample-topic 
    --partitions 1000 
    --config retention.ms=604800000

  3. (Optional) Run the following command to view the details of the topic you created in step 2 above:
    <path-to-your-kafka-installation>/bin/kafka-topics.sh 
    --bootstrap-server $BS
    --command-config client.properties 
    --topic msk-sample-topic 
    --describe | head -n1

    You will see the following result:

    Topic: msk-sample-topic TopicId: Ze76LY9EQuiH0xOIenx_HA PartitionCount: 1000ReplicationFactor: 3    Configs: min.insync.replicas=2,segment.bytes=134217728,retention.ms=604800000,message.format.version=2.8-IV2,unclean.leader.election.enable=false,retention.bytes=268435456000

Clean up

To avoid incurring charges on the AWS resources created in this post, delete the MSK Serverless cluster and the Amazon Elastic Compute Cloud (Amazon EC2) instance for your client machine.

  1. On the Amazon MSK console, select the MSK Serverless cluster you used for this solution.
  2. Choose Actions, then choose Delete.
  3. On the Amazon EC2 console, select the instance that you created for your Apache Kafka client machine.
  4. Choose Instance state, then choose Terminate instance.

Conclusion

This post demonstrated how to create an MSK Serverless cluster topic with 1,000 partitions and 7-day retention. With the new quota increases, you can create up to 2,400 partitions per cluster and retain data for as long as you need. If you have comments or feedback, please feel free to leave them in the comments.


About the author

Usama Naseem is a Senior Product Manager for Amazon MSK and focuses on MSK Serverless. Previously, he held product management roles for AWS Lambda and Amazon Fresh. He is passionate about giving customers the tools to build real-time applications in the cloud. Outside of work, he continues to be under the delusion that he will be the best squash player in the world one day.

New – Deployment Pipelines Reference Architecture and Reference Implementations

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/new_deployment_pipelines_reference_architecture_and_-reference_implementations/

Today, we are launching a new reference architecture and a set of reference implementations for enterprise-grade deployment pipelines. A deployment pipeline automates the building, testing, and deploying of applications or infrastructures into your AWS environments. When you deploy your workloads to the cloud, having deployment pipelines is key to gaining agility and lowering time to market.

When I talk with you at conferences or on social media, I frequently hear that our documentation and tutorials are good resources to get started with a new service or a new concept. However, when you want to scale your usage or when you have complex or enterprise-grade use cases, you often lack resources to dive deeper.

This is why we have created over the years hundreds of reference architectures based on real-life use cases and also the security reference architecture. Today, we are adding a new reference architecture to this collection.

We used the best practices and lessons learned at Amazon and with hundreds of customer projects to create this deployment pipeline reference architecture and implementations. They go well beyond the typical “Hello World” example: They document how to architect and how to implement complex deployment pipelines with multiple environments, multiple AWS accounts, multiple Regions, manual approval, automated testing, automated code analysis, etc. When you want to increase the speed at which you deliver software to your customers through DevOps and continuous delivery, this new reference architecture shows you how to combine AWS services to work together. They document the mandatory and optional components of the architecture.

Having an architecture document and diagram is great, but having an implementation is even better. Each pipeline type in the reference architecture has at least one reference implementation. One of the reference implementations uses an AWS Cloud Development Kit (AWS CDK) application to deploy the reference architecture on your accounts. It is a good starting point to study or customize the reference architecture to fit your specific requirements.

You will find this reference architecture and its implementations at https://pipelines.devops.aws.dev.

Deployment pipeline reference architecture

Let’s Deploy a Reference Implementation
The new deployment pipeline reference architecture demonstrates how to build a pipeline to deploy a Java containerized application and a database. It comes with two reference implementations. We are working on additional pipeline types to deploy Amazon EC2 AMIs, manage a fleet of accounts, and manage dynamic configuration for your applications.

The sample application is developed with SpringBoot. It runs on top of Corretto, the Amazon-provided distribution of the OpenJDK. The application is packaged with the CDK and is deployed on AWS Fargate. But the application is not important here; you can substitute your own application. The important parts are the infrastructure components and the pipeline to deploy an application. For this pipeline type, we provide two reference implementations. One deploys the application using Amazon CodeCatalyst, the new service that we announced at re:Invent 2022, and one uses AWS CodePipeline. This is the one I choose to deploy for this blog post.

The pipeline starts building the applications with AWS CodeBuild. It runs the unit tests and also runs Amazon CodeGuru to review code quality and security. Finally, it runs Trivy to detect additional security concerns, such as known vulnerabilities in the application dependencies. When the build is successful, the pipeline deploys the application in three environments: beta, gamma, and production. It deploys the application in the beta environment in a single Region. The pipeline runs end-to-end tests in the beta environment. All the tests must succeed before the deployment continues to the gamma environment. The gamma environment uses two Regions to host the application. After deployment in the gamma environment, the deployment into production is subject to manual approval. Finally, the pipeline deploys the application in the production environment in six Regions, with three waves of deployments made of two Regions each.

Deployment Pipelines Reference Architecture

I need four AWS accounts to deploy this reference implementation: one to deploy the pipeline and tooling and one for each environment (beta, gamma, and production). At a high level, there are two deployment steps: first, I bootstrap the CDK for all four accounts, and then I create the pipeline itself in the toolchain account. You must plan for 2-3 hours of your time to prepare your accounts, create the pipeline, and go through a first deployment.

Once the pipeline is created, it builds, tests, and deploys the sample application from its source in AWS CodeCommit. You can commit and push changes to the application source code and see it going through the pipeline steps again.

My colleague Irshad Buch helped me try the pipeline on my account. He wrote a detailed README with step-by-step instructions to let you do the same on your side. The reference architecture that describes this implementation in detail is available on this new web page. The application source code, the AWS CDK scripts to deploy the application, and the AWS CDK scripts to create the pipeline itself are all available on AWS’s GitHub. Feel free to contribute, report issues or suggest improvements.

Available Now
The deployment pipeline reference architecture and its reference implementations are available today, free of charge. If you decide to deploy a reference implementation, we will charge you for the resources it creates on your accounts. You can use the provided AWS CDK code and the detailed instructions to deploy this pipeline on your AWS accounts. Try them today!

— seb

Reduce risk by implementing HttpOnly cookie authentication in Amazon API Gateway

Post Syndicated from Marc Borntraeger original https://aws.amazon.com/blogs/security/reduce-risk-by-implementing-httponly-cookie-authentication-in-amazon-api-gateway/

Some web applications need to protect their authentication tokens or session IDs from cross-site scripting (XSS). It’s an Open Web Application Security Project (OWASP) best practice for session management to store secrets in the browsers’ cookie store with the HttpOnly attribute enabled. When cookies have the HttpOnly attribute set, the browser will prevent client-side JavaScript code from accessing the value. This reduces the risk of secrets being compromised.

In this blog post, you’ll learn how to store access tokens and authenticate with HttpOnly cookies in your own workloads when using Amazon API Gateway as the client-facing endpoint. The tutorial in this post will show you a solution to store OAuth2 access tokens in the browser cookie store, and verify user authentication through Amazon API Gateway. This post describes how to use Amazon Cognito to issue OAuth2 access tokens, but the solution is not limited to OAuth2. You can use other kinds of tokens or session IDs.

The solution consists of two decoupled parts:

  1. OAuth2 flow
  2. Authentication check

Note: This tutorial takes you through detailed step-by-step instructions to deploy an example solution. If you prefer to deploy the solution with a script, see the api-gw-http-only-cookie-auth GitHub repository.

Prerequisites

No costs should incur when you deploy the application from this tutorial because the services you’re going to use are included in the AWS Free Tier. However, be aware that small charges may apply if you have other workloads running in your AWS account and exceed the free tier. Make sure to clean up your resources from this tutorial after deployment.

Solution architecture

This solution uses Amazon Cognito, Amazon API Gateway, and AWS Lambda to build a solution that persists OAuth2 access tokens in the browser cookie store. Figure 1 illustrates the solution architecture for the OAuth2 flow.

Figure 1: OAuth2 flow solution architecture

Figure 1: OAuth2 flow solution architecture

  1. A user authenticates by using Amazon Cognito.
  2. Amazon Cognito has an OAuth2 redirect URI pointing to your API Gateway endpoint and invokes the integrated Lambda function oAuth2Callback.
  3. The oAuth2Callback Lambda function makes a request to the Amazon Cognito token endpoint with the OAuth2 authorization code to get the access token.
  4. The Lambda function returns a response with the Set-Cookie header, instructing the web browser to persist the access token as an HttpOnly cookie. The browser will automatically interpret the Set-Cookie header, because it’s a web standard. HttpOnly cookies can’t be accessed through JavaScript—they can only be set through the Set-Cookie header.

After the OAuth2 flow, you are set up to issue and store access tokens. Next, you need to verify that users are authenticated before they are allowed to access your protected backend. Figure 2 illustrates how the authentication check is handled.

Figure 2: Authentication check solution architecture

Figure 2: Authentication check solution architecture

  1. A user requests a protected backend resource. The browser automatically attaches HttpOnly cookies to every request, as defined in the web standard.
  2. The Lambda function oAuth2Authorizer acts as the Lambda authorizer for HTTP APIs. It validates whether requests are authenticated. If requests include the proper access token in the request cookie header, then it allows the request.
  3. API Gateway only passes through requests that are authenticated.

Amazon Cognito is not involved in the authentication check, because the Lambda function can validate the OAuth2 access tokens by using a JSON Web Token (JWT) validation check.

1. Deploying the OAuth2 flow

In this section, you’ll deploy the first part of the solution, which is the OAuth2 flow. The OAuth2 flow is responsible for issuing and persisting OAuth2 access tokens in the browser’s cookie store.

1.1. Create a mock protected backend

As shown in in Figure 2, you need to protect a backend. For the purposes of this post, you create a mock backend by creating a simple Lambda function with a default response.

To create the Lambda function

  1. In the Lambda console, choose Create function.

    Note: Make sure to select your desired AWS Region.

  2. Choose Author from scratch as the option to create the function.
  3. In the Basic information section as shown in , enter or select the following values:
  4. Choose Create function.
    Figure 3: Configuring the getProtectedResource Lambda function

    Figure 3: Configuring the getProtectedResource Lambda function

The default Lambda function code returns a simple Hello from Lambda message, which is sufficient to demonstrate the concept of this solution.

1.2. Create an HTTP API in Amazon API Gateway

Next, you create an HTTP API by using API Gateway. Either an HTTP API or a REST API will work. In this example, choose HTTP API because it’s offered at a lower price point (for this tutorial you will stay within the free tier).

To create the API Gateway API

  1. In the API Gateway console, under HTTP API, choose Build.
  2. On the Create and configure integrations page, as shown in Figure 4, choose Add integration, then enter or select the following values:
    • Select Lambda.
    • For Lambda function, select the getProtectedResource Lambda function that you created in the previous section.
    • For API name, enter a name. In this example, I used MyApp.
    • Choose Next.
    Figure 4: Configuring API Gateway integrations and API name

    Figure 4: Configuring API Gateway integrations and API name

  3. On the Configure routes page, as shown in Figure 5, enter or select the following values:
    • For Method, select GET.
    • For Resource path, enter / (a single forward slash).
    • For Integration target, select the getProtectedResource Lambda function.
    • Choose Next.
    Figure 5: Configuring API Gateway routes

    Figure 5: Configuring API Gateway routes

  4. On the Configure stages page, keep all the default options, and choose Next.
  5. On the Review and create page, choose Create.
  6. Note down the value of Invoke URL, as shown in Figure 6.
    Figure 6: Note down the invoke URL

    Figure 6: Note down the invoke URL

Now it’s time to test your API Gateway API. Paste the value of Invoke URL into your browser. You’ll see the following message from your Lambda function: Hello from Lambda.

1.3. Use Amazon Cognito

You’ll use Amazon Cognito user pools to create and maintain a user directory, and add sign-up and sign-in to your web application.

To create an Amazon Cognito user pool

  1. In the Amazon Cognito console, choose Create user pool.
  2. On the Authentication providers page, as shown in Figure 7, for Cognito user pool sign-in options, select Email, then choose Next.
    Figure 7: Configuring authentication providers

    Figure 7: Configuring authentication providers

  3. In the Multi-factor authentication pane of the Configure Security requirements page, as shown in Figure 8, choose your MFA enforcement. For this example, choose No MFA to make it simpler for you to test your solution. However, in production for data sensitive workloads you should choose Require MFA – Recommended. Choose Next.
    Figure 8: Configuring MFA

    Figure 8: Configuring MFA

  4. On the Configure sign-up experience page, keep all the default options and choose Next.
  5. On the Configure message delivery page, as shown in Figure 9, choose your email provider. For this example, choose Send email with Cognito to make it simple to test your solution. In production workloads, you should choose Send email with Amazon SES – Recommended. Choose Next.
    Figure 9: Configuring email

    Figure 9: Configuring email

  6. In the User pool name section of the Integrate your app page, as shown in Figure 10, enter or select the following values:
    1. For User pool name, enter a name. In this example, I used MyUserPool.
      Figure 10: Configuring user pool name

      Figure 10: Configuring user pool name

    2. In the Hosted authentication pages section, as shown in Figure 11, select Use the Cognito Hosted UI.
      Figure 11: Configuring hosted authentication pages

      Figure 11: Configuring hosted authentication pages

    3. In the Domain section, as shown in Figure 12, for Domain type, choose Use a Cognito domain. For Cognito domain, enter a domain name. Note that domains in Cognito must be unique. Make sure to enter a unique name, for example by appending random numbers at the end of your domain name. For this example, I used https://http-only-cookie-secured-app.
      Figure 12: Configuring an Amazon Cognito domain

      Figure 12: Configuring an Amazon Cognito domain

    4. In the Initial app client section, as shown in Figure 13, enter or select the following values:
      • For App type, keep the default setting Public client.
      • For App client name, enter a friendly name. In this example, I used MyAppClient.
      • For Client secret, keep the default setting Don’t generate a client secret.
      • For Allowed callback URLs, enter <API_GW_INVOKE_URL>/oauth2/callback, replacing <API_GW_INVOKE_URL> with the invoke URL you noted down from API Gateway in the previous section.
        Figure 13: Configuring the initial app client

        Figure 13: Configuring the initial app client

    5. Choose Next.
  7. Choose Create user pool.

Next, you need to retrieve some Amazon Cognito information for later use.

To note down Amazon Cognito information

  1. In the Amazon Cognito console, choose the user pool you created in the previous steps.
  2. Under User pool overview, make note of the User pool ID value.
  3. On the App integration tab, under Cognito Domain, make note of the Domain value.
  4. Under App client list, make note of the Client ID value.
  5. Under App client list, choose the app client name you created in the previous steps.
  6. Under Hosted UI, make note of the Allowed callback URLs value.

Next, create the user that you will use in a later section of this post to run your test.

To create a user

  1. In the Amazon Cognito console, choose the user pool you created in the previous steps.
  2. Under Users, choose Create user.
  3. For Email address, enter [email protected]. For this tutorial, you don’t need to send out actual emails, so the email address does not need to actually exist.
  4. Choose Mark email address as verified.
  5. For password, enter a password you can remember (or even better: use a password generator).
  6. Remember the email and password for later use.
  7. Choose Create user.

1.4. Create the Lambda function oAuth2Callback

Next, you create the Lambda function oAuth2Callback, which is responsible for issuing and persisting the OAuth2 access tokens.

To create the Lambda function oAuth2Callback

  1. In the Lambda console, choose Create function.

    Note: Make sure to select your desired Region.

  2. For Function name, enter oAuth2Callback.
  3. For Runtime, select Node.js 16.x.
  4. For Architecture, select arm64.
  5. Choose Create function.

After you create the Lambda function, you need to add the code. Create a new folder on your local machine and open it with your preferred integrated development environment (IDE). Add the package.json and index.js files, as shown in the following examples.

package.json

{
  "name": "oAuth2Callback",
  "version": "0.0.1",
  "dependencies": {
    "axios": "^0.27.2",
    "qs": "^6.11.0"
  }
}

In a terminal at the root of your created folder, run the following command.

$ npm install

In the index.js example code that follows, be sure to replace the placeholders with your values.

index.js

const qs = require("qs");
const axios = require("axios").default;
exports.handler = async function (event) {
  const code = event.queryStringParameters?.code;
  if (code == null) {
    return {
      statusCode: 400,
      body: "code query param required",
    };
  }
  const data = {
    grant_type: "authorization_code",
    client_id: "<your client ID from Cognito>",
    // The redirect has already happened, but you still need to pass the URI for validation, so a valid oAuth2 access token can be generated
    redirect_uri: encodeURI("<your callback URL from Cognito>"),
    code: code,
  };
  // Every Cognito instance has its own token endpoints. For more information check the documentation: https://docs.aws.amazon.com/cognito/latest/developerguide/token-endpoint.html
  const res = await axios.post(
    "<your App Client Cognito domain>/oauth2/token",
    qs.stringify(data),
    {
      headers: {
        "Content-Type": "application/x-www-form-urlencoded",
      },
    }
  );
  return {
    statusCode: 302,
    // These headers are returned as part of the response to the browser.
    headers: {
      // The Location header tells the browser it should redirect to the root of the URL
      Location: "/",
      // The Set-Cookie header tells the browser to persist the access token in the cookie store
      "Set-Cookie": `accessToken=${res.data.access_token}; Secure; HttpOnly; SameSite=Lax; Path=/`,
    },
  };
};

Along with the HttpOnly attribute, you pass along two additional cookie attributes:

  • Secure – Indicates that cookies are only sent by the browser to the server when a request is made with the https: scheme.
  • SameSite – Controls whether or not a cookie is sent with cross-site requests, providing protection against cross-site request forgery attacks. You set the value to Lax because you want the cookie to be set when the user is forwarded from Amazon Cognito to your web application (which runs under a different URL).

For more information, see Using HTTP cookies on the MDN Web Docs site.

Afterwards, upload the code to the oAuth2Callback Lambda function as described in Upload a Lambda Function in the AWS Toolkit for VS Code User Guide.

1.5. Configure an OAuth2 callback route in API Gateway

Now, you configure API Gateway to use your new Lambda function through a Lambda proxy integration.

To configure API Gateway to use your Lambda function

  1. In the API Gateway console, under APIs, choose your API name. For me, the name is MyApp.
  2. Under Develop, choose Routes.
  3. Choose Create.
  4. Enter or select the following values:
    • For method, select GET.
    • For path, enter /oauth2/callback.
  5. Choose Create.
  6. Choose GET under /oauth2/callback, and then choose Attach integration.
  7. Choose Create and attach an integration.
    • For Integration type, choose Lambda function.
    • For Lambda function, choose oAuth2Callback from the last step.
  8. Choose Create.

Your route configuration in API Gateway should now look like Figure 14.

Figure 14: Routes for API Gateway

Figure 14: Routes for API Gateway

2. Testing the OAuth2 flow

Now that you have the components in place, you can test your OAuth2 flow. You test the OAuth2 flow by invoking the login on your browser.

To test the OAuth2 flow

  1. In the Amazon Cognito console, choose your user pool name. For me, the name is MyUserPool.
  2. Under the navigation tabs, choose App integration.
  3. Under App client list, choose your app client name. For me, the name is MyAppClient.
  4. Choose View Hosted UI.
  5. In the newly opened browser tab, open your developer tools, so you can inspect the network requests.
  6. Log in with the email address and password you set in the previous section. Change your password, if you’re asked to do so. You can also choose the same password as you set in the previous section.
  7. You should see your Hello from Lambda message.

To test that the cookie was accurately set

  1. Check your browser network tab in the browser developer settings. You’ll see the /oauth2/callback request, as shown in Figure 15.
    Figure 15: Callback network request

    Figure 15: Callback network request

    The response headers should include a set-cookie header, as you specified in your Lambda function. With the set-cookie header, your OAuth2 access token is set as an HttpOnly cookie in the browser, and access is prohibited from any client-side code.

  2. Alternatively, you can inspect the cookie in the browser cookie storage, as shown in Figure 16.

  3. If you want to retry the authentication, navigate in your browser to your Amazon Cognito domain that you chose in the previous section and clear all site data in the browser developer tools. Do the same with your API Gateway invoke URL. Now you can restart the test with a clean state.

3. Deploying the authentication check

In this section, you’ll deploy the second part of your application: the authentication check. The authentication check makes it so that only authenticated users can access your protected backend. The authentication check works with the HttpOnly cookie, which is stored in the user’s cookie store.

3.1. Create the Lambda function oAuth2Authorizer

This Lambda function checks that requests are authenticated.

To create the Lambda function

  1. In the Lambda console, choose Create function.

    Note: Make sure to select your desired Region.

  2. For Function name, enter oAuth2Authorizer.
  3. For Runtime, select Node.js 16.x.
  4. For Architecture, select arm64.
  5. Choose Create function.

After you create the Lambda function, you need to add the code. Create a new folder on your local machine and open it with your preferred IDE. Add the package.json and index.js files as shown in the following examples.

package.json

{
  "name": "oAuth2Authorizer",
  "version": "0.0.1",
  "dependencies": {
    "aws-jwt-verify": "^3.1.0"
  }
}

In a terminal at the root of your created folder, run the following command.

$ npm install

In the index.js example code, be sure to replace the placeholders with your values.

index.js

const { CognitoJwtVerifier } = require("aws-jwt-verify");
function getAccessTokenFromCookies(cookiesArray) {
  // cookieStr contains the full cookie definition string: "accessToken=abc"
  for (const cookieStr of cookiesArray) {
    const cookieArr = cookieStr.split("accessToken=");
    // After splitting you should get an array with 2 entries: ["", "abc"] - Or only 1 entry in case it was a different cookie string: ["test=test"]
    if (cookieArr[1] != null) {
      return cookieArr[1]; // Returning only the value of the access token without cookie name
    }
  }
  return null;
}
// Create the verifier outside the Lambda handler (= during cold start),
// so the cache can be reused for subsequent invocations. Then, only during the
// first invocation, will the verifier actually need to fetch the JWKS.
const verifier = CognitoJwtVerifier.create({
  userPoolId: "<your user pool ID from Cognito>",
  tokenUse: "access",
  clientId: "<your client ID from Cognito>",
});
exports.handler = async (event) => {
  if (event.cookies == null) {
    console.log("No cookies found");
    return {
      isAuthorized: false,
    };
  }
  // Cookies array looks something like this: ["accessToken=abc", "otherCookie=Random Value"]
  const accessToken = getAccessTokenFromCookies(event.cookies);
  if (accessToken == null) {
    console.log("Access token not found in cookies");
    return {
      isAuthorized: false,
    };
  }
  try {
    await verifier.verify(accessToken);
    return {
      isAuthorized: true,
    };
  } catch (e) {
    console.error(e);
    return {
      isAuthorized: false,
    };
  }
};

After you add the package.json and index.js files, upload the code to the oAuth2Authorizer Lambda function as described in Upload a Lambda Function in the AWS Toolkit for VS Code User Guide.

3.2. Configure the Lambda authorizer in API Gateway

Next, you configure your authorizer Lambda function to protect your backend. This way you control access to your HTTP API.

To configure the authorizer Lambda function

  1. In the API Gateway console, under APIs, choose your API name. For me, the name is MyApp.
  2. Under Develop, choose Routes.
  3. Under / (a single forward slash) GET, choose Attach authorization.
  4. Choose Create and attach an authorizer.
  5. Choose Lambda.
  6. Enter or select the following values:
    • For Name, enter oAuth2Authorizer.
    • For Lambda function, choose oAuth2Authorizer.
    • Clear Authorizer caching. For this tutorial, you disable authorizer caching to make testing simpler. See the section Bonus: Enabling authorizer caching for more information about enabling caching to increase performance.
    • Under Identity sources, choose Remove.

      Note: Identity sources are ignored for your Lambda authorizer. These are only used for caching.

    • Choose Create and attach.
  7. Under Develop, choose Routes to inspect all routes.

Now your API Gateway route /oauth2/callback should be configured as shown in Figure 17.

Figure 17: API Gateway route configuration

Figure 17: API Gateway route configuration

4. Testing the OAuth2 authorizer

You did it! From your last test, you should still be authenticated. So, if you open the API Gateway Invoke URL in your browser, you’ll be greeted from your protected backend.

In case you are not authenticated anymore, you’ll have to follow the steps again from the section Testing the OAuth2 flow to authenticate.

When you inspect the HTTP request that your browser makes in the developer tools as shown in Figure 18, you can see that authentication works because the HttpOnly cookie is automatically attached to every request.

Figure 18: Browser requests include HttpOnly cookies

Figure 18: Browser requests include HttpOnly cookies

To verify that your authorizer Lambda function works correctly, paste the same Invoke URL you noted previously in an incognito window. Incognito windows do not share the cookie store with your browser session, so you see a {"message":"Forbidden"} error message with HTTP response code 403 – Forbidden.

Cleanup

Delete all unwanted resources to avoid incurring costs.

To delete the Amazon Cognito domain and user pool

  1. In the Amazon Cognito console, choose your user pool name. For me, the name is MyUserPool.
  2. Under the navigation tabs, choose App integration.
  3. Under Domain, choose Actions, then choose Delete Cognito domain.
  4. Confirm by entering your custom Amazon Cognito domain, and choose Delete.
  5. Choose Delete user pool.
  6. Confirm by entering your user pool name (in my case, MyUserPool), and then choose Delete.

To delete your API Gateway resource

  1. In the API Gateway console, select your API name. For me, the name is MyApp.
  2. Under Actions, choose Delete and confirm your deletion.

To delete the AWS Lambda functions

  1. In the Lambda console, select all three of the Lambda functions you created.
  2. Under Actions, choose Delete and confirm your deletion.

Bonus: Enabling authorizer caching

As mentioned earlier, you can enable authorizer caching to help improve your performance. When caching is enabled for an authorizer, API Gateway uses the authorizer’s identity sources as the cache key. If a client specifies the same parameters in identity sources within the configured Time to Live (TTL), then API Gateway uses the cached authorizer result, rather than invoking your Lambda function.

To enable caching, your authorizer must have at least one identity source. To cache by the cookie request header, you specify $request.header.cookie as the identity source. Be aware that caching will be affected if you pass along additional HttpOnly cookies apart from the access token.

For more information, see Working with AWS Lambda authorizers for HTTP APIs in the Amazon API Gateway Developer Guide.

Conclusion

In this blog post, you learned how to implement authentication by using HttpOnly cookies. You used Amazon API Gateway and AWS Lambda to persist and validate the HttpOnly cookies, and you used Amazon Cognito to issue OAuth2 access tokens. If you want to try an automated deployment of this solution with a script, see the api-gw-http-only-cookie-auth GitHub repository.

The application of this solution to protect your secrets from potential cross-site scripting (XSS) attacks is not limited to OAuth2. You can protect other kinds of tokens, sessions, or tracking IDs with HttpOnly cookies.

In this solution, you used NodeJS for your Lambda functions to implement authentication. But HttpOnly cookies are widely supported by many programing frameworks. You can find more implementation options on the OWASP Secure Cookie Attribute page.

Although this blog post gives you a tutorial on how to implement HttpOnly cookie authentication in API Gateway, it may not meet all your security and functional requirements. Make sure to check your business requirements and talk to your stakeholders before you adopt techniques from this blog post.

Furthermore, it’s a good idea to continuously test your web application, so that cookies are only set with your approved security attributes. For more information, see the OWASP Testing for Cookies Attributes page.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon API Gateway re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Marc Borntraeger

Marc Borntraeger

Marc is a Solutions Architect in healthcare, based in Zurich, Switzerland. He helps security-sensitive customers such as hospitals to re-innovate themselves with AWS.

A 16x NVIDIA GPU 128 Core Arm Server Supermicro ARS-210M-NR with Ampere Altra

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/a-16x-nvidia-gpu-128-core-arm-server-supermicro-ars-210m-nr-with-ampere-altra/

We review the Supermicro ARS-210M-NR with 16x NVIDIA Ampere 16GB GPUs and 128 Ampere Altra Max Arm cores designed for cloud deployments

The post A 16x NVIDIA GPU 128 Core Arm Server Supermicro ARS-210M-NR with Ampere Altra appeared first on ServeTheHome.

AWS Week in Review – January 30, 2023

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/aws-week-in-review-january-30-2023/

This week’s review post comes to you from the road, having just wrapped up sponsorship of NDC London. While there we got to speak to many .NET developers, both new and experienced with AWS, and all eager to learn more. Thanks to everyone who stopped by our expo booth to chat or ask questions to the team!

.NET on AWS booth, NDC London 2023.NET on AWS booth, NDC London 2023

Last Week’s Launches
My team will be back on the road to our next events soon, but first, here are just some launches that caught my attention while I was at the expo booth last week:

General availability of Porting Advisor for Graviton: AWS Graviton2 processors are custom designed, Arm64, processors, that deliver increased price performance over comparable x86-64 processors. They’re suitable for a wide range of compute workloads on Amazon Elastic Compute Cloud (Amazon EC2) including application servers, microservices, high-performance computing (HPC), CPU-based ML inference, gaming, any many more. They’re also available in other AWS services such as AWS Lambda, AWS Fargate, to name just a few. The new Porting Advisor for Graviton is a freely available, open-source command line tool for analyzing compatibility of applications you want to run on Graviton-based compute environments. It provides a report that highlights missing or outdated libraries, and code, that you may need to update in order to port your application to run on Graviton processors.

Runtime management controls for AWS Lambda: Automated feature updates, performance improvements, and security patches to runtime environments for Lambda functions is popular with many customers. However, some customers have asked for increased visibility into when these updates occur, and control over when they’re applied. The new runtime management controls for Lambda provide optional capabilities for those customers that require more control over runtime changes. The new controls are optional; by default, all your Lambda functions will continue to receive automatic updates. But, if you wish, you can now apply a runtime management configuration with your functions that specifies how you want updates to be applied. You can find full details on the new runtime management controls in this blog post on the AWS Compute Blog.

General availability of Amazon OpenSearch Serverless: OpenSearch Serverless was one of the livestream segments in the recent AWS on Air re:Invent Recap of previews that were announced at the conference last December. OpenSearch Serverless is now generally available. As a serverless option for Amazon OpenSearch Service, it removes the need to configure, manage, or scale OpenSearch clusters, offering automatic provisioning and scaling of resources to enable fast ingestion and query responses.

Additional connectors for Amazon AppFlow: At AWS re:Invent 2023, I blogged about a release of new data connectors enabling data transfer from a variety of Software-as-a-Service (SaaS) applications to Amazon AppFlow. An additional set of 10 connectors, enabling connectivity from Asana, Google Calendar, JDBC, PayPal, and more, are also now available. Check out the full list of additional connectors launched this past week in this What’s New post.

AWS open-source news and updates: As usual, there’s a new edition of the weekly open-source newsletter highlighting new open-source projects, tools, and demos from the AWS Community. Read edition #143 here – LINK TBD.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Upcoming AWS Events
Check your calendars and sign up for these AWS events:

AWS Innovate Data and AI/ML edition: AWS Innovate is a free online event to learn the latest from AWS experts and get step-by-step guidance on using AI/ML to drive fast, efficient, and measurable results.

  • AWS Innovate Data and AI/ML edition for Asia Pacific and Japan is taking place on February 22, 2023. Register here.
  • Registrations for AWS Innovate EMEA (March 9, 2023) and the Americas (March 14, 2023) will open soon. Check the AWS Innovate page for updates.

You can find details on all upcoming events, in-person or virtual, here.

And finally, if you’re a .NET developer, my team will be at Swetugg, in Sweden, February 8-9, and DeveloperWeek, Oakland, California, February 15-17. If you’re in the vicinity at these events, be sure to stop by and say hello!

That’s all for this week. Check back next Monday for another Week in Review!

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

AWS achieves ISO 20000-1:2018 certification for 109 services

Post Syndicated from Rodrigo Fiuza original https://aws.amazon.com/blogs/security/aws-achieves-iso-20000-12018-certification-for-109-services/

We continue to expand the scope of our assurance programs at Amazon Web Services (AWS) and are pleased to announce that AWS Regions and AWS Edge locations are now certified by the International Organization for Standardization (ISO) 20000-1:2018 standard. This certification demonstrates our continuous commitment to adhere to the heightened expectations for cloud service providers.

Published by the International Organization for Standardization (ISO), ISO 20000-1:2018 helps organizations specify requirements for establishing, implementing, maintaining, and continually improving a Service Management System (SMS).

AWS was evaluated by EY CertifyPoint, an independent third-party auditor. The Certificate of Compliance illustrating the AWS compliance status is available through AWS Artifact. AWS Artifact is a self-service portal for on-demand access to AWS compliance reports. Sign in to AWS Artifact in the AWS Management Console, or learn more at Getting Started with AWS Artifact.

As of this writing, 109 services offered globally are in scope of this certification. For up-to-date information, including when additional services are added, see the AWS ISO 20000-1:2018 certification webpage.

AWS strives to continuously bring services into scope of its compliance programs to help you meet your architectural and regulatory needs. Reach out to your AWS account team if you have questions or feedback about ISO 20000-1:2018 compliance.

To learn more about our compliance and security programs, see AWS Compliance Programs. As always, we value your feedback and questions; you can reach out to the AWS Compliance team through the Contact Us page.

 
If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Rodrigo Fiuza

Rodrigo is a Security Audit Manager at AWS, based in São Paulo. He leads audits, attestations, certifications, and assessments across Latin America, Caribbean and Europe. Rodrigo has previously worked in risk management, security assurance, and technology audits for the past 12 years.

Run Apache Spark workloads 3.5 times faster with Amazon EMR 6.9

Post Syndicated from Sekar Srinivasan original https://aws.amazon.com/blogs/big-data/run-apache-spark-workloads-3-5-times-faster-with-amazon-emr-6-9/

The Amazon EMR runtime for Apache Spark is a performance-optimized runtime for Apache Spark that is 100% API compatible with open-source Apache Spark. With Amazon EMR release 6.9.0, the EMR runtime for Apache Spark supports equivalent Spark version 3.3.0.

With Amazon EMR 6.9.0, you can now run your Apache Spark 3.x applications faster and at lower cost without requiring any changes to your applications. In our performance benchmark tests, derived from TPC-DS performance tests at 3 TB scale, we found the EMR runtime for Apache Spark 3.3.0 provides a 3.5 times (using total runtime) performance improvement on average over open-source Apache Spark 3.3.0.

In this post, we analyze the results from our benchmark tests running a TPC-DS application on open-source Apache Spark and then on Amazon EMR 6.9, which comes with an optimized Spark runtime that is compatible with open-source Spark. We walk through a detailed cost analysis and finally provide step-by-step instructions to run the benchmark.

Results observed

To evaluate the performance improvements, we used an open-source Spark performance test utility that is derived from the TPC-DS performance test toolkit. We ran the tests on a seven-node (six core nodes and one primary node) c5d.9xlarge EMR cluster with the EMR runtime for Apache Spark, and a second seven-node self-managed cluster on Amazon Elastic Compute Cloud (Amazon EC2) with the equivalent open-source version of Spark. We ran both the tests with data in Amazon Simple Storage Service (Amazon S3).

Dynamic Resource Allocation (DRA) is a great feature to use for varying workloads. However, for a benchmarking exercise where we compare two platforms purely on performance, and test data volumes don’t change (3 TB in our case), we believe it’s best to avoid variability in order to run an apples-to-apples comparison. In our tests in both open-source Spark and Amazon EMR, we disabled DRA while running the benchmarking application.

The following table shows the total job runtime for all queries (in seconds) in the 3 TB query dataset between Amazon EMR version 6.9.0 and open-source Spark version 3.3.0. We observed that our TPC-DS tests had a total job runtime on Amazon EMR on Amazon EC2 that was 3.5 times faster than that using an open-source Spark cluster of the same configuration.

The per-query speedup on Amazon EMR 6.9 with and without the EMR runtime for Apache Spark is illustrated in the following chart. The horizontal axis shows each query in the 3 TB benchmark. The vertical axis shows the speedup of each query due to the EMR runtime. Notable performance gains are over 10 times faster for TPC-DS queries 24b, 72, 95, and 96.

Cost analysis

The performance improvements of the EMR runtime for Apache Spark directly translate to lower costs. We were able to realize a 67% cost savings running the benchmark application on Amazon EMR in comparison with the cost incurred to run the same application on open-source Spark on Amazon EC2 with the same cluster sizing due to reduced hours of Amazon EMR and Amazon EC2 usage. Amazon EMR pricing is for EMR applications running on EMR clusters with EC2 instances. The Amazon EMR price is added to the underlying compute and storage prices such as EC2 instance price and Amazon Elastic Block Store (Amazon EBS) cost (if attaching EBS volumes). Overall, the estimated benchmark cost in the US East (N. Virginia) Region is $27.01 per run for the open-source Spark on Amazon EC2 and $8.82 per run for Amazon EMR.

Benchmark Job Runtime (Hour) Estimated Cost Total EC2 Instance Total vCPU Total Memory (GiB) Root Device (Amazon EBS)

Open-source Spark on Amazon EC2

(1 primary and 6 core nodes)

2.23 $27.01 7 252 504 20 GiB gp2

Amazon EMR on Amazon EC2

(1 primary and 6 core nodes)

0.63 $8.82 7 252 504 20 GiB gp2

Cost breakdown

The following is the cost breakdown for the open-source Spark on Amazon EC2 job ($27.01):

  • Total Amazon EC2 cost – (7 * $1.728 * 2.23) = (number of instances * c5d.9xlarge hourly rate * job runtime in hour) = $26.97
  • Amazon EBS cost – ($0.1/730 * 20 * 7 * 2.23) = (Amazon EBS per GB-hourly rate * root EBS size * number of instances * job runtime in hour) = $0.042

The following is the cost breakdown for the Amazon EMR on Amazon EC2 job ($8.82):

  • Total Amazon EMR cost – (7 * $0.27 * 0.63) = ((number of core nodes + number of primary nodes)* c5d.9xlarge Amazon EMR price * job runtime in hour) = $1.19
  • Total Amazon EC2 cost – (7 * $1.728 * 0.63) = ((number of core nodes + number of primary nodes)* c5d.9xlarge instance price * job runtime in hour) = $7.62
  • Amazon EBS cost – ($0.1/730 * 20 GiB * 7 * 0.63) = (Amazon EBS per GB-hourly rate * EBS size * number of instances * job runtime in hour) = $0.012

Set up OSS Spark benchmarking

In the following sections, we provide a brief outline of the steps involved in setting up the benchmarking. For detailed instructions with examples, refer to the GitHub repo.

For our OSS Spark benchmarking, we use the open-source tool Flintrock to launch our Amazon EC2-based Apache Spark cluster. Flintrock provides a quick way to launch an Apache Spark cluster on Amazon EC2 using the command line.

Prerequisites

Complete the following prerequisite steps:

  1. Have Python 3.7.x or above.
  2. Have Pip3 22.2.2 or above.
  3. Add the Python bin directory to your environment path. The Flintrock binary will be installed in this path.
  4. Run aws configure to configure your AWS Command Line Interface (AWS CLI) shell to point to the benchmarking account. Refer to Quick configuration with aws configure for instructions.
  5. Have a key pair with restrictive file permissions to access the OSS Spark primary node.
  6. Create a new S3 bucket in your test account if needed.
  7. Copy the TPC-DS source data as input to your S3 bucket.
  8. Build the benchmark application following the steps provided in Steps to build spark-benchmark-assembly application. Alternatively, you can download a pre-built spark-benchmark-assembly-3.3.0.jar if you want a Spark 3.3.0-based application.

Deploy the Spark cluster and run the benchmark job

Complete the following steps:

  1. Install the Flintrock tool via pip as shown in Steps to setup OSS Spark Benchmarking.
  2. Run the command flintrock configure, which pops up a default configuration file.
  3. Modify the default config.yaml file based on your needs. Alternatively, copy and paste the config.yaml file content to the default configure file. Then save the file to where it was.
  4. Finally, launch the 7-node Spark cluster on Amazon EC2 via Flintrock.

This should create a Spark cluster with one primary node and six worker nodes. If you see any error messages, double-check the config file values, especially the Spark and Hadoop versions and the attributes of download-source and the AMI.

The OSS Spark cluster doesn’t come with YARN resource manager. To enable it, we need to configure the cluster.

  1. Download the yarn-site.xml and enable-yarn.sh files from the GitHub repo.
  2. Replace <private ip of primary node> with the IP address of the primary node in your Flintrock cluster.

You can retrieve the IP address from the Amazon EC2 console.

  1. Upload the files to all the nodes of the Spark cluster.
  2. Run the enable-yarn script.
  3. Enable Snappy support in Hadoop (the benchmark job reads Snappy compressed data).
  4. Download the benchmark utility application JAR file spark-benchmark-assembly-3.3.0.jar to your local machine.
  5. Copy this file to the cluster.
  6. Log in to the primary node and start YARN.
  7. Submit the benchmark job on the open-source Spark cluster as shown in Submit the benchmark job.

Summarize the results

Download the test result file from the output S3 bucket s3://$YOUR_S3_BUCKET/EC2_TPCDS-TEST-3T-RESULT/timestamp=xxxx/summary.csv/xxx.csv. (Replace $YOUR_S3_BUCKET with your S3 bucket name.) You can use the Amazon S3 console and navigate to the output S3 location or use the AWS CLI.

The Spark benchmark application creates a timestamp folder and writes a summary file inside a summary.csv prefix. Your timestamp and file name will be different from the one shown in the preceding example.

The output CSV files have four columns without header names. They are:

  • Query name
  • Median time
  • Minimum time
  • Maximum time

The following screenshot shows a sample output. We have manually added column names. The way we calculate the geomean and the total job runtime is based on arithmetic means. We first take the mean of the med, min, and max values using the formula AVERAGE(B2:D2). Then we take a geometric mean of the Avg column using the formula GEOMEAN(E2:E105).

Set up Amazon EMR benchmarking

For detailed instructions, see Steps to setup EMR Benchmarking.

Prerequisites

Complete the following prerequisite steps:

  1. Run aws configure to configure your AWS CLI shell to point to the benchmarking account. Refer to Quick configuration with aws configure for instructions.
  2. Upload the benchmark application to Amazon S3.

Deploy the EMR cluster and run the benchmark job

Complete the following steps:

  1. Spin up Amazon EMR in your AWS CLI shell using command line as shown in Deploy EMR Cluster and run benchmark job.
  2. Configure Amazon EMR with one primary (c5d.9xlarge) and six core (c5d.9xlarge) nodes. Refer to create-cluster for a detailed description of AWS CLI options.
  3. Store the cluster ID from the response. You need this in the next step.
  4. Submit the benchmark job in Amazon EMR using add-steps in the AWS CLI.

Summarize the results

Summarize the results from the output bucket s3://$YOUR_S3_BUCKET/blog/EMRONEC2_TPCDS-TEST-3T-RESULT in the same manner as we did for the OSS results and compare.

Clean up

To avoid incurring future charges, delete the resources you created using the instructions in the Cleanup section of the GitHub repo.

  1. Stop the EMR and OSS Spark clusters. You may also delete them if you don’t want to retain the content. You can delete these resources by running the script cleanup-benchmark-env.sh from a terminal in your benchmark environment.
  2. If you used AWS Cloud9 as your IDE for building the benchmark application JAR file using Steps to build spark-benchmark-assembly application, you may want to delete the environment as well.

Conclusion

You can run your Apache Spark workloads 3.5 times (based on total runtime) faster and at lower cost without making any changes to your applications by using Amazon EMR 6.9.0.

To keep up to date, subscribe to the Big Data Blog’s RSS feed to learn more about the EMR runtime for Apache Spark, configuration best practices, and tuning advice.

For past benchmark tests, see Run Apache Spark 3.0 workloads 1.7 times faster with Amazon EMR runtime for Apache Spark. Note that the past benchmark result of 1.7 times performance was based on geometric mean. Based on geometric mean, the performance in Amazon EMR 6.9 was two times faster.


About the authors

Sekar Srinivasan is a Sr. Specialist Solutions Architect at AWS focused on Big Data and Analytics. Sekar has over 20 years of experience working with data. He is passionate about helping customers build scalable solutions modernizing their architecture and generating insights from their data. In his spare time he likes to work on non-profit projects, especially those focused on underprivileged Children’s education.

Prabu Ravichandran is a Senior Data Architect with Amazon Web Services, focussed on Analytics, data Lake architecture and implementation. He helps customers architect and build scalable and robust solutions using AWS services. In his free time, Prabu enjoys traveling and spending time with family.

Handle UPSERT data operations using open-source Delta Lake and AWS Glue

Post Syndicated from Praveen Allam original https://aws.amazon.com/blogs/big-data/handle-upsert-data-operations-using-open-source-delta-lake-and-aws-glue/

Many customers need an ACID transaction (atomic, consistent, isolated, durable) data lake that can log change data capture (CDC) from operational data sources. There is also demand for merging real-time data into batch data. Delta Lake framework provides these two capabilities. In this post, we discuss how to handle UPSERTs (updates and inserts) of the operational data using natively integrated Delta Lake with AWS Glue, and query the Delta Lake using Amazon Athena.

We examine a hypothetical insurance organization that issues commercial policies to small- and medium-scale businesses. The insurance prices vary based on several criteria, such as where the business is located, business type, earthquake or flood coverage, and so on. This organization is planning to build a data analytical platform, and the insurance policy data is one of the inputs to this platform. Because the business is growing, hundreds and thousands of new insurance policies are being enrolled and renewed every month. Therefore, all this operational data needs to be sent to Delta Lake in near-real time so that the organization can perform various analytics, and build machine learning (ML) models to serve their customers in a more efficient and cost-effective way.

Solution overview

The data can originate from any source, but typically customers want to bring operational data to data lakes to perform data analytics. One of the solutions is to bring the relational data by using AWS Database Migration Service (AWS DMS). AWS DMS tasks can be configured to copy the full load as well as ongoing changes (CDC). The full load and CDC load can be brought into the raw and curated (Delta Lake) storage layers in the data lake. To keep it simple, in this post we opt out of the data sources and ingestion layer; the assumption is that the data is already copied to the raw bucket in the form of CSV files. An AWS Glue ETL job does the necessary transformation and copies the data to the Delta Lake layer. The Delta Lake layer ensures ACID compliance of the source data.

The following diagram illustrates the solution architecture.
Architecture diagram

The use case we use in this post is about a commercial insurance company. We use a simple dataset that contains the following columns:

  • Policy – Policy number, entered as text
  • Expiry – Date that policy expires
  • Location – Location type (Urban or Rural)
  • State – Name of state where property is located
  • Region – Geographic region where property is located
  • Insured Value – Property value
  • Business Type – Business use type for property, such as Farming or Retail
  • Earthquake – Is earthquake coverage included (Y or N)
  • Flood – Is flood coverage included (Y or N)

The dataset contains a sample of 25 insurance policies. In the case of a production dataset, it may contain millions of records.

policy_id,expiry_date,location_name,state_code,region_name,insured_value,business_type,earthquake,flood
200242,2023-01-02,Urban,NY,East,1617630,Retail,N,N
200314,2023-01-02,Urban,NY,East,8678500,Apartment,Y,Y
200359,2023-01-02,Rural,WI,Midwest,2052660,Farming,N,N
200315,2023-01-02,Urban,NY,East,17580000,Apartment,Y,Y
200385,2023-01-02,Urban,NY,East,1925000,Hospitality,N,N
200388,2023-01-04,Urban,IL,Midwest,12934500,Apartment,Y,Y
200358,2023-01-05,Urban,WI,Midwest,928300,Office Bldg,N,N
200264,2023-01-07,Rural,NY,East,2219900,Farming,N,N
200265,2023-01-07,Urban,NY,East,14100000,Apartment,Y,Y
100582,2023-03-25,Urban,NJ,East,4651680,Apartment,Y,Y
100487,2023-03-25,Urban,NY,East,5990067,Apartment,N,N
100519,2023-03-25,Rural,NY,East,4102500,Farming,N,N
100462,2023-03-25,Urban,NY,East,3400000,Construction,Y,Y
100486,2023-03-26,Urban,NY,East,9973900,Apartment,Y,Y
100463,2023-03-27,Urban,NY,East,15480000,Office Bldg,Y,Y
100595,2023-03-27,Rural,NY,East,2446600,Farming,N,N
100617,2023-03-27,Urban,VT,Northeast,8861500,Office Bldg,N,N
100580,2023-03-30,Urban,NH,Northeast,97920,Office Bldg,Y,Y
100581,2023-03-30,Urban,NY,East,5150000,Apartment,Y,Y
100475,2023-03-31,Rural,WI,Midwest,1451662,Farming,N,N
100503,2023-03-31,Urban,NJ,East,1761960,Office Bldg,N,N
100504,2023-03-31,Rural,NY,East,1649105,Farming,N,N
100616,2023-03-31,Urban,NY,East,2329500,Apartment,N,N
100611,2023-04-25,Urban,NJ,East,1595500,Office Bldg,Y,Y
100621,2023-04-25,Urban,MI,Central,394220,Retail,N,N

In the following sections, we walk through the steps to perform the Delta Lake UPSERT operations. We use the AWS Management Console to perform all the steps. However, you can also automate these steps using tools like AWS CloudFormation, the AWS Cloud Development Kit (AWS CDK), Terraforms, and so on.

Prerequisites

This post is focused towards architects, engineers, developers, and data scientists who build, design, and build analytical solutions on AWS. We expect a basic understanding of the console, AWS Glue, Amazon Simple Storage Service (Amazon S3), and Athena. Additionally, the persona is able to create AWS Identity and Access Management (IAM) policies and roles, create and run AWS Glue jobs and crawlers, and is able work with the Athena query editor.

Use Athena query engine version 3 to query delta lake tables, later in the section “Query the full load using Athena”.

Athena QE V3

Set up an S3 bucket for full and CDC load data feeds

To set up your S3 bucket, complete the following steps:

  1. Log in to your AWS account and choose a Region nearest to you.
  2. On the Amazon S3 console, create a new bucket. Make sure the name is unique (for example, delta-lake-cdc-blog-<some random number>).
  3. Create the following folders:
    1. $bucket_name/fullload – This folder is used for a one-time full load from the upstream data source
    2. $bucket_name/cdcload – This folder is used for copying the upstream data changes
    3. $bucket_name/delta – This folder holds the Delta Lake data files
  4. Copy the sample dataset and save it in a file called full-load.csv to your local machine.
  5. Upload the file using the Amazon S3 console into the folder $bucket_name/fullload.

s3 folders

Set up an IAM policy and role

In this section, we create an IAM policy for the S3 bucket access and a role for AWS Glue jobs to run, and also use the same role for querying the Delta Lake using Athena.

  1. On the IAM console, choose Polices in the navigation pane.
  2. Choose Create policy.
  3. Select JSON tab and paste the following policy code. Replace the {bucket_name} you created in the earlier step.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowListingOfFolders",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::{bucket_name}"
            ]
        },
        {
            "Sid": "ObjectAccessInBucket",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::{bucket_name}/*"
        }
    ]
}
  1. Name the policy delta-lake-cdc-blog-policy and select Create policy.
  2. On the IAM console, choose Roles in the navigation pane.
  3. Choose Create role.
  4. Select AWS Glue as your trusted entity and choose Next.
  5. Select the policy you just created, and with two additional AWS managed policies:
    1. delta-lake-cdc-blog-policy
    2. AWSGlueServiceRole
    3. CloudWatchFullAccess
  1. Choose Next.
  2. Give the role a name (for example, delta-lake-cdc-blog-role).

IAM role

Set up AWS Glue jobs

In this section, we set up two AWS Glue jobs: one for full load and one for the CDC load. Let’s start with the full load job.

  1. On the AWS Glue console, under Data Integration and ETL in the navigation pane, choose Jobs. AWS Glue Studio opens in a new tab.
  2. Select Spark script editor and choose Create.

Glue Studio Editor

  1. In the script editor, replace the code with the following code snippet
import sys
from awsglue.utils import getResolvedOptions
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME','s3_bucket'])

# Initialize Spark Session with Delta Lake
spark = SparkSession \
.builder \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()

#Define the table schema
schema = StructType() \
      .add("policy_id",IntegerType(),True) \
      .add("expiry_date",DateType(),True) \
      .add("location_name",StringType(),True) \
      .add("state_code",StringType(),True) \
      .add("region_name",StringType(),True) \
      .add("insured_value",IntegerType(),True) \
      .add("business_type",StringType(),True) \
      .add("earthquake_coverage",StringType(),True) \
      .add("flood_coverage",StringType(),True) 

# Read the full load
sdf = spark.read.format("csv").option("header",True).schema(schema).load("s3://"+ args['s3_bucket']+"/fullload/")
sdf.printSchema()

# Write data as DELTA TABLE
sdf.write.format("delta").mode("overwrite").save("s3://"+ args['s3_bucket']+"/delta/insurance/")
  1. Navigate to the Job details tab.
  2. Provide a name for the job (for example, Full-Load-Job).
  3. For IAM Role¸ choose the role delta-lake-cdc-blog-role that you created earlier.
  4. For Worker type¸ choose G 2X.
  5. For Job bookmark, choose Disable.
  6. Set Number of retries to 0.
  7. Under Advanced properties¸ keep the default values, but provide the delta core JAR file path for Python library path and Dependent JARs path.
  8. Under Job parameters:
    1. Add the key --s3_bucket with the bucket name you created earlier as the value.
    2. Add the key --datalake-formats  and give the value delta
  9. Keep the remaining default values and choose Save.

Job details

Now let’s create the CDC load job.

  1. Create a second job called CDC-Load-Job.
  2. Follow the steps on the Job details tab as with the previous job.
  3. Alternatively, you may choose “Clone job” option from the Full-Load-Job, this will carry all the job details from the full load job.
  4. In the script editor, enter the following code snippet for the CDC logic:
import sys
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.functions import expr

## For Delta lake
from delta.tables import DeltaTable


## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME','s3_bucket'])

# Initialize Spark Session with Delta Lake
spark = SparkSession \
.builder \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()

# Read the CDC load
cdc_df = spark.read.csv("s3://"+ args['s3_bucket']+"/cdcload")
cdc_df.show(5,True)

# now read the full load (latest data) as delta table
delta_df = DeltaTable.forPath(spark, "s3://"+ args['s3_bucket']+"/delta/insurance/")
delta_df.toDF().show(5,True)

# UPSERT process if matches on the condition the update else insert
# if there is no keyword then create a data set with Insert, Update and Delete flag and do it separately.
# for delete it has to run in loop with delete condition, this script do not handle deletes.
    
final_df = delta_df.alias("prev_df").merge( \
source = cdc_df.alias("append_df"), \
#matching on primarykey
condition = expr("prev_df.policy_id = append_df._c1"))\
.whenMatchedUpdate(set= {
    "prev_df.expiry_date"           : col("append_df._c2"), 
    "prev_df.location_name"         : col("append_df._c3"),
    "prev_df.state_code"            : col("append_df._c4"),
    "prev_df.region_name"           : col("append_df._c5"), 
    "prev_df.insured_value"         : col("append_df._c6"),
    "prev_df.business_type"         : col("append_df._c7"),
    "prev_df.earthquake_coverage"   : col("append_df._c8"), 
    "prev_df.flood_coverage"        : col("append_df._c9")} )\
.whenNotMatchedInsert(values =
#inserting a new row to Delta table
{   "prev_df.policy_id"             : col("append_df._c1"),
    "prev_df.expiry_date"           : col("append_df._c2"), 
    "prev_df.location_name"         : col("append_df._c3"),
    "prev_df.state_code"            : col("append_df._c4"),
    "prev_df.region_name"           : col("append_df._c5"), 
    "prev_df.insured_value"         : col("append_df._c6"),
    "prev_df.business_type"         : col("append_df._c7"),
    "prev_df.earthquake_coverage"   : col("append_df._c8"), 
    "prev_df.flood_coverage"        : col("append_df._c9")
})\
.execute()

Run the full load job

On the AWS Glue console, open full-load-job and choose Run. The job takes about 2 minutes to complete, and the job run status changes to Succeeded. Go to $bucket_name and open the delta folder, which contains the insurance folder. You can note the Delta Lake files in it. Delta location on S3

Create and run the AWS Glue crawler

In this step, we create an AWS Glue crawler with Delta Lake as the data source type. After successfully running the crawler, we inspect the data using Athena.

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Choose Create crawler.
  3. Provide a name (for example, delta-lake-crawler) and choose Next.
  4. Choose Add a data source and choose Delta Lake as your data source.
  5. Copy your delta folder URI (for example, s3://delta-lake-cdc-blog-123456789/delta/insurance) and enter the Delta Lake table path location.
  6. Keep the default selection Create Native tables, and choose Add a Delta Lake data source.
  7. Choose Next.
  8. Choose the IAM role you created earlier, then choose Next.
  9. Select the default target database, and provide delta_ for the table name prefix. If no default database exist, you may create one.
  10. Choose Next.
  11. Choose Create crawler.
  12. Run the newly created crawler. After the crawler is complete, the delta_insurance table is available under Databases/Tables.
  13. Open the table to check the table overview.

You can observe nine columns and their data types. Glue table

Query the full load using Athena

In the earlier step, we created the delta_insurance table by running a crawler against the Delta Lake location. In this section, we query the delta_insurance table using Athena. Note that if you’re using Athena for the first time, set the query output folder to store the Athena query results (for example, s3://<your-s3-bucket>/query-output/).

  1. On the Athena console, open the query editor.
  2. Keep the default selections for Data source and Database.
  3. Run the query SELECT * FROM delta_insurance;. This query returns a total of 25 rows, the same as what was in the full load data feed.
  4. For the CDC comparison, run the following query and store the results in a location where you can compare these results later:
SELECT * FROM delta_insurance
WHERE policy_id IN (100462,100463,100475,110001,110002)
order by policy_id;

The following screenshot shows the Athena query result.

Query results from full load

Upload the CDC data feed and run the CDC job

In this section, we update three insurance policies and insert two new policies.

  1. Copy the following insurance policy data and save it locally as cdc-load.csv:
U,100462,2024-12-31,Urban,NY,East,3400000,Construction,Y,Y
U,100463,2023-03-27,Urban,NY,East,1000000,Office Bldg,Y,Y
U,100475,2023-03-31,Rural,WI,Midwest,1451662,Farming,N,Y
I,110001,2024-03-31,Urban,CA,WEST,210000,Office Bldg,N,N
I,110002,2024-03-31,Rural,FL,East,975000,Retail,N,Y

The first column in the CDC feed describes the UPSERT operations. U is for updating an existing record, and I is for inserting a new record.

  1. Upload the cdc-load.csv file to the $bucket_name/cdcload/ folder.
  2. On the AWS Glue console, run CDC-Load-Job. This job takes care of updating the Delta Lake accordingly.

The change details are as follows:

  • 100462 – Expiry date changes to 12/31/2024
  • 100463 – Insured value changes to 1 million
  • 100475 – This policy is now under a new flood zone
  • 110001 and 110002 – New policies added to the table
  1. Run the query again:
SELECT * FROM delta_insurance
WHERE policy_id IN (100462, 100463,100475,110001,110002)
order by policy_id;

As shown in the following screenshot, the changes in the CDC data feed are reflected in the Athena query results.
Athena query results

Clean up

In this solution, we used all managed services, and there is no cost if AWS Glue jobs aren’t running. However, if you want to clean up the tasks, you can delete the two AWS Glue jobs, AWS Glue table, and S3 bucket.

Conclusion

Organizations are continuously looking at high performance, cost-effective, and scalable analytical solutions to extract the value of their operational data sources in near-real time. The analytical platform should be ready to receive changes in the operational data as soon as they occur. Typical data lake solutions face challenges to handle the changes in source data; the Delta Lake framework can close this gap. This post demonstrated how to build data lakes for UPSERT operations using AWS Glue and native Delta Lake tables, and how to query AWS Glue tables from Athena. You can implement your large scale UPSERT data operations using AWS Glue, Delta Lake and perform analytics using Amazon Athena.

References


About the Authors

 Praveen Allam is a Solutions Architect at AWS. He helps customers design scalable, better cost-perfromant enterprise-grade applications using the AWS Cloud. He builds solutions to help organizations make data-driven decisions.

Vivek Singh is Senior Solutions Architect with the AWS Data Lab team. He helps customers unblock their data journey on the AWS ecosystem. His interest areas are data pipeline automation, data quality and data governance, data lakes, and lake house architectures.

За руснака, загубил гражданството си у нас и мястото си във Forbes Роман Доброхотов: Санкциите “Магнитски” са катастрофа за Сергей Адониев

Post Syndicated from Николай Марченко original https://bivol.bg/dobrokhotov-adoniev-magnitsky.html

понеделник 30 януари 2023


Още двама български граждани са внесени в списъка на санкционираните лица от Министерството на финансите на САЩ по Закона “Магнитски”. Става дума за живеещите в Лондон руснаци с двойно гражданство…

The collective thoughts of the interwebz