CVE-2021-3779: Ruby-MySQL Gem Client File Read (FIXED)

Post Syndicated from Tod Beardsley original https://blog.rapid7.com/2022/06/28/cve-2021-3779-ruby-mysql-gem-client-file-read-fixed/

CVE-2021-3779: Ruby-MySQL Gem Client File Read (FIXED)

The ruby-mysql Ruby gem prior to version 2.10.0 maintained by Tomita Masahiro is vulnerable to an instance of CWE-610: Externally Controlled Reference to a Resource in Another Sphere, wherein a malicious MySQL server can request local file content from a client without explicit authorization from the user. The initial CVSSv3 estimate for this issue is 6.5. Note that this issue does not affect the much more popular mysql2 gem. This issue was fixed in ruby-mysql 2.10.0 on October 23, 2021, and users of ruby-mysql are urged to update.

Product description

The ruby-mysql Ruby gem is an implementation of a MySQL client. While it is far less popular than the mysql2 gem, it serves a particular niche audience of users that desire a pure Ruby implementation of MySQL client functionality without linking to an external library (as mysql2 does).

Credit

This issue was reported to Rapid7 by Hans-Martin Münch of MOGWAI LABS GmbH, initially as a Metasploit issue, and is being disclosed in accordance with Rapid7’s vulnerability disclosure policy after coordination with the upstream maintainer of this library, as well as JPCERT/CC and CERT/CC.

Exploitation

A malicious actor can read arbitrary files from a client that uses ruby-mysql to communicate to a rogue MySQL server and issue database queries. In these cases, the server has the option to create a database reply using the LOAD DATA LOCAL statement, which instructs the client to provide additional data from a local file readable by the client (and not a “local” file on the server). The easiest way to demonstrate this issue is to run an instance of Rogue-MySql-Server by Gifts and perform any database query using the vulnerable version of the mysql gem.

Note that this behavior is a defined and expected option for servers and is described in the documentation, quoted below:

Because LOAD DATA LOCAL is an SQL statement, parsing occurs on the server side, and transfer of the file from the client host to the server host is initiated by the MySQL server, which tells the client the file named in the statement. In theory, a patched server could tell the client program to transfer a file of the server’s choosing rather than the file named in the statement. Such a server could access any file on the client host to which the client user has read access. (A patched server could in fact reply with a file-transfer request to any statement, not just LOAD DATA LOCAL, so a more fundamental issue is that clients should not connect to untrusted servers.) [emphasis added]

So, the vulnerability is not so much a MySQL server or protocol issue, but a vulnerability in a client that does not at least provide an option to disable LOAD DATA LOCAL queries; this is the situation with version 2.9.14 and earlier versions of ruby-mysql.

There is also prior work on this type of issue, and interested readers should refer to Knownsec 404 Team‘s article describing the issue for a thorough understanding of the dangers of LOAD DATA LOCAL and untrusted MySQL servers.

Impact

As stated, this issue only affects Ruby-based MySQL clients that connect to malicious MySQL servers. The vast majority of clients already know who they’re connecting to, and while an attacker could poison DNS records or otherwise intercede in network traffic to capture unwitting clients, such network shenanigans will be foiled by routine security controls like SSL certificates. The true risk is posed only to those people who connect to random and unknown MySQL servers in unfamiliar environments.

In other words, penetration testers and other opportunistic MySQL attackers are most at risk from this kind of vulnerability. CVE-2021-3779 fits squarely in the category of “hacking the hackers,” where an aggressive honeypot is designed to lie in wait for wandering MySQL scanners and attackers and steal data local to those connecting clients.

This is the reason why Hans-Martin Münch of MOGWAI LABS GmbH first brought this to Rapid7’s attention as an issue in Metasploit. While Metasploit users are indeed the most at risk to falling victim to an exploit for this vulnerability, the underlying issue was quickly identified as one in the shared open-source library code that Metasploit depends on for managing MySQL connections to remote servers. (One such example is the MySQL hashdump auxiliary module.)

Remediation

Users who implement ruby-mysql should update their packaged gem with the latest version of ruby-mysql, as it has been fixed in version 2.10.0. The current version (as of this writing) is 3.0.0 and was released in November of 2021.

Users unable to update can patch around the issue by ensuring that CLIENT_LOCAL_FILES is disallowed by the client, similarly to how Metasploit Framework initially remediated this issue while waiting on a fix from the upstream maintainer.

Disclosure timeline

The astute reader will note a significant gap of several months between the fix release and this disclosure. This was a failure on my, Tod Beardsley’s, part, since I was handling this issue.

For the record, there was no intention to bury this vulnerability — after all, we communicated it to the Tomita (the maintainer), RubyGems (who pointed us in the direction of Rubysec, thanks André), CERT/CC, and JPCERT/CC, so hopefully the intention to disclose in a timely manner was and is obvious.

But a confluence of family tragedies and home-office technical disasters conspired with the usual complications of a multi-stakeholder, multi-continent effort to coordinate disclosure in open-source library code.

I am also acutely aware of the irony of this delay in light of my recent post on silent patches, and I offer apologies for that delay. I am committed to being better with backups, both of the data and human varieties.

Note that all dates are local to the United States (some dates may differ in Japan and Germany depending on the time of day).

  • August, 2021: Issue discovered by Hans-Martin Münch of MOGWAI LABS GmbH.
  • Thu, Sep 2, 2021: Issue reported to Rapid7’s security contact as a Metasploit issue, #9286.
  • Tue, Sep 7, 2021: Rapid7 validated the issue, reserved CVE-2021-3779, and contacted the vulnerable gem maintainer, Tomita Masahiro.
  • Tue, Sep 8, 2021: Metasploit Framework temporary remediation committed.
  • Tue, Sep 8, 2021: Notified CERT/CC and RubyGems for disclosure coordination, as the gem appeared to be abandoned by the maintainer given no updates in several years.
  • Tue, Sep 9, 2021: Notified JPCERT/CC through VINCE on CERT/CC’s advice, as VU#541053.
  • Thu, Sep 10, 2021: JPCERT/CC acknowledged the issue and attempted to contact the gem maintainer.
  • Mon, Oct 18, 2021: Maintainer responded to JPCERT/CC, acknowledging the issue.
  • Fri, Oct 22, 2021: Fixed version 2.10.0 released, Rapid7 notified Hans-Martin of the fix.
  • Wed, Feb 16, 2022: CERT/CC asks for an update on the issue, Rapid7 communicates the fix to CERT/CC and JPCERT/CC.
  • Tue, Jun 6, 2022: CERT/CC asks for an update, Rapid7 commits to sharing disclosure documentation.
  • Tue, Jun 14, 2022: Rapid7 shares disclosure details with CERT/CC and Hans-Martin, and asks JPCERT/CC to communicate this document to Tomita.
  • Tue, June 28, 2022: This public disclosure

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

Additional reading:

How do we create engaging online courses for computing educators?

Post Syndicated from Dan Fisher original https://www.raspberrypi.org/blog/creating-free-online-courses-training-computing-computer-science-teachers-educators/

With our online courses programme, launched in 2017, we made it our mission to provide computing educators with the best possible free training we can design. Five years on, here are some of the key stats about the courses’ impact:

  • We’ve produced and launched 35 free online courses 
  • We’ve created over 650 educational course videos 
  • More than 234,000 learners have participated in the courses
  • Over 19,000 teachers in England have participated through the National Centre for Computing Education
A teacher attending Picademy laughs as she works through an activity

Designed and created in-house, each and every course is a real cross-team effort that involves a lot of careful planning and a number of different stages. Here we’re taking you behind the scenes to show you how we make our courses, introduce you to the people involved, and explain how we ensure our courses are of high quality.

But first, here’s some quick answers to questions you may have:

Our free online courses — key questions answered

What are the courses? 

They are online training courses to help you learn about computing and computing education. The courses are hosted on the FutureLearn website. They are asynchronous, meaning you can take them whenever and wherever you want.

Are the courses free?

Yes! All our courses are free when you sign up for time-limited access, which gives you full access to the learning materials for the complete course duration. FutureLearn also has a paid-for ‘unlimited’ option, where you receive a certificate for each course you take.

alt=""

Are the courses right for me? 

They are aimed at educators, particularly classroom teachers, but they are also beneficial to anyone who wants to learn more about computing.

How long does a course take?

To help you structure your learning, our courses are divided into three or four weeks, but it’s up to you how quickly you work through them. You can complete a course in one afternoon, or spread your learning out and study for 30 minutes a day over three or four weeks. This flexibility makes it easy to fit a course into a busy schedule. 

How can I access the courses?

""

What goes into creating an engaging online course?

Creating our online courses is a team effort involving writers, videographers, illustrators, animators, copy editors, presenters, and subject matter experts working together over months of production. The entire process is guided by our online course producers, Martin O’Hanlon, Ross Exton, and Michael Conterio, who know a thing or two about creating high-quality learning experiences. We spoke to them about what it takes to create an engaging course. 

The educators working at the Raspberry Pi Foundation.
The educators at the Raspberry Pi Foundation. On screen: Ross Exton. Left to right in person: Michael Conterio, Martin O’Hanlon.

Hi guys. You’ve created courses on a wide range of computing subjects. How do you decide what the focus of your next course is going to be?

Martin: We are driven by the needs of teachers. “What are teachers telling us they want to learn? Or what are the gaps in the curriculum where our learners need additional support?”

For example, our Introduction to Machine Learning and AI course was introduced as a result of feedback from teachers that while the subject wasn’t necessarily on the curriculum, they felt underprepared to answer questions from students or provide context when teaching other topics.

A woman holds up a Raspberry Pi computer in front of a laptop screen.

How do you then go about planning it out and turning that plan into an actual course structure?

Michael: Working with the course authors, we’ll generally agree on the big topics we want to cover or questions that we want to answer. We’ll often also have individual elements that we want to fit in somewhere, for example an activity involving making a learning resource more accessible. From there it’s a case of taking the bigger topics and working out how we can split them up into smaller chunks, until we get down to individual learning activities.

Ross: But then we’ll end up shuffling things around until we are happy — not only that we’ve got everything that we wanted to cover, but that the overall structure makes sense. We often talk about the ‘narrative’ of a course.

What is your approach to pedagogy in online courses?

Martin: At the Raspberry Pi Foundation we have a set of 12 pedagogy principles that we use through our learning resources (including online courses). We take particular care to lead with concepts, model processes, and activities; add variety for our learners; and include opportunities to create projects. 

The Raspberry Pi Foundation's 12 principles of computing pedagogy: lead with concepts; structure lessons; make concrete; unplug, unpack, repack; work together; read and explore code first; foster program comprehension; model everything; challenge misconceptions; create projects; get hands-on; add variety.
Learn more about the 12 principles in the free special edition of Hello World, The Big book of Computing Pedagogy, downloadable in PDF format.

Can you tell us about some of the pitfalls with course writing that you’ve learned along the way?

Michael: Because the learner is not present, you have to be incredibly precise with instructions as you can’t help learners directly as they are working through the content. And even if you think something is obvious, it’s easy for learners to accidentally miss an instruction, so it’s generally good to try to keep them together rather than spread out.

Martin: Luckily, it is often possible to tell from comments that learners have shared when something is hard to understand so we can improve future runs of the course.

How important is the media you add to the courses, like animations and videos? What is the process for creating this type of content?

Ross: It’s essential! It brings the abstract concepts of computing to life. The media in our courses helps our learners to visualise the ideas we’re presenting in ways that are engaging and relatable. 

alt=""

As we’re writing the course, we capture every creative idea that will best support our learners in gaining the knowledge and skills that they need. From ‘how-to’ guides with live coding, to physical computing demonstrations, or animations of robots, we think carefully about each image and video and how we’re not just telling the learner something, but showing them.

We then work with a brilliantly talented team of illustrators, animators, videographers, and presenters to create all of that media. 

A videographer preparing to film a course presenter.
And… action! We film all the video content for courses in-house, working closely with the educators who present the content.

There are lots of opportunities for social learning within the courses. Can you explain more about its importance and how we integrate it?

Ross: Social learning is a really important part of our online courses experience. Over the past year we have made significant investment to make it easier for participants to share programs they’ve written as part of their learning, for example, and for facilitators to provide support.

Martin: It is important people have the opportunity to share their learning with others. This is something often lost when taking an online course and it can feel like you are ‘on your own’. 

In the Raspberry Pi Foundation’s online courses learners are given the opportunity to ask questions, share what they have created, and provide their own insight in the comments. Educators from the Foundation facilitate the courses — responding to comments and providing advice is a big part of what they do.

Thank you Martin, Michael, and Ross. 

What new online course would you like us to create? Tell us in the comments below.

The post How do we create engaging online courses for computing educators? appeared first on Raspberry Pi.

Hertzbleed explained

Post Syndicated from Yingchen Wang original https://blog.cloudflare.com/hertzbleed-explained/

Hertzbleed explained

Hertzbleed explained

You may have heard a bit about the Hertzbleed attack that was recently disclosed. Fortunately, one of the student researchers who was part of the team that discovered this vulnerability and developed the attack is spending this summer with Cloudflare Research and can help us understand it better.

The first thing to note is that Hertzbleed is a new type of side-channel attack that relies on changes in CPU frequency. Hertzbleed is a real, and practical, threat to the security of cryptographic software.

Should I be worried?

From the Hertzbleed website,

“If you are an ordinary user and not a cryptography engineer, probably not: you don’t need to apply a patch or change any configurations right now. If you are a cryptography engineer, read on. Also, if you are running a SIKE decapsulation server, make sure to deploy the mitigation described below.”

Notice: As of today, there is no known attack that uses Hertzbleed to target conventional and standardized cryptography, such as the encryption used in Cloudflare products and services. Having said that, let’s get into the details of processor frequency scaling to understand the core of this vulnerability.

In short, the Hertzbleed attack shows that, under certain circumstances, dynamic voltage and frequency scaling (DVFS), a power management scheme of modern x86 processors, depends on the data being processed. This means that on modern processors, the same program can run at different CPU frequencies (and therefore take different wall-clock times). For example, we expect that a CPU takes the same amount of time to perform the following two operations because it uses the same algorithm for both. However, there is an observable time difference between them:

Hertzbleed explained

Trivia: Could you guess which operation runs faster?

Before giving the answer we will explain some details about how Hertzbleed works and its impact on SIKE, a new cryptographic algorithm designed to be computationally infeasible for an adversary to break, even for an attacker with a quantum computer.

Frequency Scaling

Suppose a runner is in a long distance race. To optimize the performance, the heart monitors the body all the time. Depending on the input (such as distance or oxygen absorption), it releases the appropriate hormones that will accelerate or slow down the heart rate, and as a result tells the runner to speed up or slow down a little. Just like the heart of a runner, DVFS (dynamic voltage and frequency scaling) is a monitor system for the CPU. It helps the CPU to run at its best under present conditions without being overloaded.

Hertzbleed explained

Just as a runner’s heart causes a runner’s pace to fluctuate throughout a race depending on the level of exertion, when a CPU is running a sustained workload, DVFS modifies the CPU’s frequency from the so-called steady-state frequency. DVFS causes it to switch among multiple performance levels (called P-states) and oscillate among them. Modern DVFS gives the hardware almost full control to adjust the P-states it wants to execute in and the duration it stays at any P-state. These modifications are totally opaque to the user, since they are controlled by hardware and the operating system provides limited visibility and control to the end-user.

The ACPI specification defines P0 state as the state the CPU runs at its maximum performance capability. Moving to higher P-states makes the CPU less performant in favor of consuming less energy and power.

Hertzbleed explained
Suppose a CPU’s steady-state frequency is 4.0 GHz. Under DVFS, frequency can oscillate between 3.9-4.1 GHz.

How long does the CPU stay at each P-state? Most importantly, how can this even lead to a vulnerability? Excellent questions!

Modern DVFS is designed this way because CPUs have a Thermal Design Point (TDP), indicating the expected power consumption at steady state under a sustained workload. For a typical computer desktop processor, such as a Core i7-8700, the TDP is 65 W.

To continue our human running analogy: a typical person can sprint only short distances, and must run longer distances at a slower pace. When the workload is of short duration, DVFS allows the CPU to enter a high-performance state, called Turbo Boost on Intel processors. In this mode, the CPU can temporarily execute very quickly while consuming much more power than TDP allows. But when running a sustained workload, the CPU average power consumption should stay below TDP to prevent overheating. For example, as illustrated below, suppose the CPU has been free of any task for a while, the CPU runs extra hard (Turbo Boost on) when it just starts running the workload. After a while, it realizes that this workload is not a short one, so it slows down and enters steady-state. How much does it slow down? That depends on the TDP. When entering steady-state, the CPU runs at a certain speed such that its current power consumption is not above TDP.

Hertzbleed explained
CPU entering steady state after running at a higher frequency.

Beyond protecting CPUs from overheating, DVFS also wants to maximize the performance. When a runner is in a marathon, she doesn’t run at a fixed pace but rather her pace floats up and down a little. Remember the P-state we mentioned above? CPUs oscillate between P-states just like runners adjust their pace slightly over time. P-states are CPU frequency levels with discrete increments of 100 MHz.

Hertzbleed explained
CPU frequency levels with discrete increments

The CPU can safely run at a high P-state (low frequency) all the time to stay below TDP, but there might be room between its power consumption and the TDP. To maximize CPU performance, DVFS utilizes this gap by allowing the CPU to oscillate between multiple P-states. The CPU stays at each P-state for only dozens of milliseconds, so that its temporary power consumption might exceed or fall below TDP a little, but its average power consumption is equal to TDP.

To understand this, check out this figure again.

Hertzbleed explained

If the CPU only wants to protect itself from overheating, it can run at P-state 3.9 GHz safely. However, DVFS wants to maximize the CPU performance by utilizing all available power allowed by TDP. As a result, the CPU oscillates around the P-state 4.0 GHz. It is never far above or below. When at 4.1 GHz, it overloads itself a little, it then drops to a higher P-state. When at 3.9 GHz, it recovers itself, it quickly climbs to a lower P-state. It may not stay long in any P-state, which avoids overheating when at 4.1 GHz and keeps the average power consumption near the TDP.

This is exactly how modern DVFS monitors your CPU to help it optimize power consumption while working hard.

Again, how can DVFS and TDP lead to a vulnerability? We are almost there!

Frequency Scaling vulnerability

The design of DVFS and TDP can be problematic because CPU power consumption is data-dependent! The Hertzbleed paper gives an explicit leakage model of certain operations identifying two cases.

First, the larger the number of bits set (also known as the Hamming weight) in the operands, the more power an operation takes. The Hamming weight effect is widely observed with no known explanation of its root cause. For example,

Hertzbleed explained

The addition on the left will consume more power compared to the one on the right.

Similarly, when registers change their value there are power variations due to transistor switching. For example, a register switching its value from A to B (as shown in the left) requires flipping only one bit because the Hamming distance of A and B is 1. Meanwhile, switching from C to D will consume more energy to perform six bit transitions since the Hamming distance between C and D is 6.

Hertzbleed explained
Hamming distance

Now we see where the vulnerability is! When running sustained workloads, CPU overall performance is capped by TDP. Under modern DVFS, it maximizes its performance by oscillating between multiple P-states. At the same time, the CPU power consumption is data-dependent. Inevitably, workloads with different power consumption will lead to different CPU P-state distribution. For example, if workload w1 consumes less power than workload w2, the CPU will stay longer in lower P-state (higher frequency) when running w1.

Hertzbleed explained
Different power consumption leads to different P-state distribution

As a result, since the power consumption is data-dependent, it follows that CPU frequency adjustments (the distribution of P-states) and execution time (as 1 Hertz = 1 cycle per second) are data-dependent too.

Consider a program that takes five cycles to finish as depicted in the following figure.

Hertzbleed explained
CPU frequency directly translate to running time

As illustrated in the table below, f the program with input 1 runs at 4.0 GHz (red) then it takes 1.25 nanoseconds to finish. If the program consumes more power with input 2, under DVFS, it will run at a lower frequency, 3.5 GHz (blue). It takes more time, 1.43 nanoseconds, to finish. If the program consumes even more power with input 3, under DVFS, it will run at an even lower frequency of 3.0 GHz (purple). Now it takes 1.67 nanoseconds to finish. This program always takes five cycles to finish, but the amount of power it consumes depends on the input. The power influences the CPU frequency, and CPU frequency directly translates to execution time. In the end, the program’s execution time becomes data-dependent.

Execution time of a five cycles program
Frequency 4.0 GHz 3.5 GHz 3.0 GHz
Execution Time 1.25 ns 1.43 ns 1.67 ns

To give you another concrete example: Suppose we have a sustained workload Foo. We know that Foo consumes more power with input data 1, and less power with input data 2. As shown on the left in the figure below, if the power consumption of Foo is below the TDP, CPU frequency as well as running time stays the same regardless of the choice of input data. However, as shown in the middle, if we add a background stressor to the CPU, the combined power consumption will exceed TDP. Now we are in trouble. CPU overall performance is monitored by DVFS and capped by TDP. To prevent itself from overheating, it dynamically adjusts its P-state distribution when running workload with various power consumption. P-state distribution of Foo(data 1) will have a slight right shift compared to that of Foo(data 2). As shown on the right, CPU running Foo(data 1) results in a lower overall frequency and longer running time. The observation here is that, if data is a binary secret, an attacker can infer data by simply measuring the running time of Foo!

Hertzbleed explained
Complete recap of Hertzbleed. Figure taken from Intel’s documentation.

This observation is astonishing because it conflicts with our expectation of a CPU. We expect a CPU to take the same amount of time computing these two additions.

Hertzbleed explained

However, Hertzbleed tells us that just like a person doing math on paper, a CPU not only takes more power to compute more complicated numbers but also spends more time as well! This is not what a CPU should do while performing a secure computation! Because anyone that measures the CPU execution time should not be able to infer the data being computed on.

This takeaway of Hertzbleed creates a significant problem for cryptography implementations because an attacker shouldn’t be able to infer a secret from program’s running time. When developers implement a cryptographic protocol out of mathematical construction, a goal in common is to ensure constant-time execution. That is, code execution does not leak secret information via a timing channel. We have witnessed that timing attacks are practical: notable examples are those shown by Kocher, Brumley-Boneh, Lucky13, and many others. How to properly implement constant-time code is subject of extensive study.

Historically, our understanding of which operations contribute to time variation did not take DVFS into account. The Hertzbleed vulnerability derives from this oversight: any workload which differs by significant power consumption will also differ in timing. Hertzbleed proposes a new perspective on the development of secure programs: any program vulnerable to power analysis becomes potentially vulnerable to timing analysis!

Which cryptographic algorithms are vulnerable to Hertzbleed is unclear. According to the authors, a systematic study of Hertzbleed is left as future work. However, Hertzbleed was exemplified as a vector for attacking SIKE.

Brief description of SIKE

The Supersingular Isogeny Key Encapsulation (SIKE) protocol is a Key Encapsulation Mechanism (KEM) finalist of the NIST Post-Quantum Cryptography competition (currently at Round 3). The building block operation of SIKE is the calculation of isogenies (transformations) between elliptic curves. You can find helpful information about the calculation of isogenies in our previous blog post. In essence, calculating isogenies amounts to evaluating mathematical formulas that take as inputs points on an elliptic curve and produce other different points lying on a different elliptic curve.

Hertzbleed explained

SIKE bases its security on the difficulty of computing a relationship between two elliptic curves. On the one hand, it’s easy computing this relation (called an isogeny) if the points that generate such isogeny (called the kernel of the isogeny) are known in advance. On the other hand, it’s difficult to know the isogeny given only two elliptic curves, but without knowledge of the kernel points. An attacker has no advantage if the number of possible kernel points to try is large enough to make the search infeasible (computationally intractable) even with the help of a quantum computer.

Similarly to other algorithms based on elliptic curves, such as ECDSA or ECDH, the core of SIKE is calculating operations over points on elliptic curves. As usual, points are represented by a pair of coordinates (x,y) which fulfill the elliptic curve equation

$ y^2= x^3 + Ax^2 +x $

where A is a parameter identifying different elliptic curves.

For performance reasons, SIKE uses one of the fastest elliptic curve models: the Montgomery curves. The special property that makes these curves fast is that it allows working only with the x-coordinate of points. Hence, one can express the x-coordinate as a fraction x = X / Z, without using the y-coordinate at all. This representation simplifies the calculation of point additions, scalar multiplications, and isogenies between curves. Nonetheless, such simplicity does not come for free, and there is a price to be paid.

The formulas for point operations using Montgomery curves have some edge cases. More technically, a formula is said to be complete if for any valid input a valid output point is produced. Otherwise, a formula is not complete, meaning that there are some exceptional inputs for which it cannot produce a valid output point.

Hertzbleed explained

In practice, algorithms working with incomplete formulas must be designed in such a way that edge cases never occur. Otherwise, algorithms could trigger some undesired effects. Let’s take a closer look at what happens in this situation.

A subtle yet relevant property of some incomplete formulas is the nature of the output they produce when operating on points in the exceptional set. Operating with anomalous inputs, the output has both coordinates equal to zero, so X=0 and Z=0. If we recall our basics on fractions, we can figure out that there is something odd in a fraction X/Z = 0/0; furthermore it was always regarded as something not well-defined. This intuition is not wrong, something bad just happened. This fraction does not represent a valid point on the curve. In fact, it is not even a (projective) point.

The domino effect

Hertzbleed explained

Exploiting this subtlety of mathematical formulas makes a case for the Hertzbleed side-channel attack. In SIKE, whenever an edge case occurs at some point in the middle of its execution, it produces a domino effect that propagates the zero coordinates to subsequent computations, which means the whole algorithm is stuck on 0. As a result, the computation gets corrupted obtaining a zero at the end, but what is worse is that an attacker can use this domino effect to make guesses on the bits of secret keys.

Trying to guess one bit of the key requires the attacker to be able to trigger an exceptional case exactly at the point in which the bit is used. It looks like the attacker needs to be super lucky to trigger edge cases when it only has control of the input points. Fortunately for the attacker, the internal algorithm used in SIKE has some invariants that can help to hand-craft points in such a way that triggers an exceptional case exactly at the right point. A systematic study of all exceptional points and edge cases was, independently, shown by De Feo et al. as well as in the Hertzbleed article.

With these tools at hand, and using the DVFS side channel, the attacker can now guess bit-by-bit the secret key by passing hand-crafted invalid input points. There are two cases an attacker can observe when the SIKE algorithm uses the secret key:

  • If the bit of interest is equal to the one before it, no edge cases are present and computation proceeds normally, and the program will take the expected amount of wall-time since all the calculations are performed over random-looking data.
  • On the other hand, if the bit of interest is different from the one before it, the algorithm will enter the exceptional case, triggering the domino effect for the rest of the computation, and the DVFS will make the program run faster as it automatically changes the CPU’s frequency.

Using this oracle, the attacker can query it, learning bit by bit the secret key used in SIKE.

Ok, let’s recap.

SIKE uses special formulas to speed up operations, but if these formulas are forced to hit certain edge cases then they will fail. Failing due to these edge cases not only corrupts the computation, but also makes the formulas output coordinates with zeros, which in machine representation amount to several registers all loaded with zeros. If the computation continues without noticing the presence of these edge cases, then the processor registers will be stuck on 0 for the rest of the computation. Finally, at the hardware level, some instructions can consume fewer resources if operands are zeroed. Because of that, the DVFS behind CPU power consumption can modify the CPU frequency, which alters the steady-state frequency. The ultimate result is a program that runs faster or slower depending on whether it operates with all zeros versus with random-looking data.

Hertzbleed explained

Hertzbleed’s authors contacted Cloudflare Research because they showed a successful attack on CIRCL, our optimized Go cryptographic library that includes SIKE. We worked closely with the authors to find potential mitigations in the early stages of their research. While the embargo of the disclosure was in effect, another research group including De Feo et al. independently described a systematic study of the possible failures of SIKE formulas, including the same attack found by the Hertzbleed team, and pointed to a proper countermeasure. Hertzbleed borrows such a countermeasure.

What countermeasures are available for SIKE?

Hertzbleed explained

The immediate action specific for SIKE is to prevent edge cases from occurring in the first place. Most SIKE implementations provide a certain amount of leeway, assuming that inputs will not trigger exceptional cases. This is not a safe assumption. Instead, implementations should be hardened and should validate that inputs and keys are well-formed.

Enforcing a strict validation of untrusted inputs is always the recommended action. For example, a common check on elliptic curve-based algorithms is to validate that inputs correspond to points on the curve and that their coordinates are in the proper range from 0 to p-1 (as described in Section 3.2.2.1 of SEC 1). These checks also apply to SIKE, but they are not sufficient.

What malformed inputs have in common in the case of SIKE is that input points could have arbitrary order—that is, in addition to checking that points must lie on the curve, they must also have a prescribed order, so they are valid. This is akin to small subgroup attacks for the Diffie-Hellman case using finite fields. In SIKE, there are several overlapping groups in the same curve, and input points having incorrect order should be detected.

The countermeasure, originally proposed by Costello, et al., consists of verifying whether the input points are of the right full-order. To do so, we check whether an input point vanishes only when multiplied by its expected order, and not before when multiplied by smaller scalars. By doing so, the hand-crafted invalid points will not pass this validation routine, which prevents edge cases from appearing during the algorithm execution. In practice, we observed around a 5-10% performance overhead on SIKE decapsulation. The ciphertext validation is already available in CIRCL as of version v1.2.0. We strongly recommend updating your projects that depend on CIRCL to this version, so you can make sure that strict validation on SIKE is in place.

Hertzbleed explained

Closing comments

Hertzbleed shows that certain workloads can induce changes on the frequency scaling of the processor, making programs run faster or slower. In this setting, small differences on the bit pattern of data result in observable differences on execution time. This puts a spotlight on the state-of-the-art techniques we know so far used to protect against timing attacks, and makes us rethink the measures needed to produce constant-time code and secure implementations. Defending against features like DVFS seems to be something that programmers should start to consider too.

Although SIKE was the victim this time, it is possible that other cryptographic algorithms may expose similar symptoms that can be leveraged by Hertzbleed. An investigation of other targets with this brand-new tool in the attacker’s portfolio remains an open problem.

Hertzbleed allowed us to learn more about how the machines we have in front of us work and how the processor constantly monitors itself, optimizing the performance of the system. Hardware manufacturers have focused on performance of processors by providing many optimizations, however, further study of the security of computations is also needed.

If you are excited about this project, at Cloudflare we are working on raising the bar on the production of code for cryptography. Reach out to us if you are interested in high-assurance tools for developers, and don’t forget our outreach programs whether you are a student, a faculty member, or an independent researcher.

Level Up Your Backup Game With Backblaze and Veritas

Post Syndicated from Jennifer Newman original https://www.backblaze.com/blog/level-up-your-backup-game-with-backblaze-and-veritas/

If you’re using Veritas’ Backup Exec, you’re already ahead of the game when it comes to your data backup and recovery approach. Now, you can power up your backup strength by adding Backblaze B2 Cloud Storage as a destination for your Backup Exec data.

The joint solution offers easy, affordable, S3-compatible object storage to customers who use Backup Exec to streamline their data backup and recovery approach. Read on to learn more about this partnership and how it can benefit your business.

Backblaze and Veritas: In Person at VMware Explore 2022

Want to learn more? Stop by booth 1501 at VMware Explore 2022, August 29–September 1. You can visit with our technical experts, see demos, and learn more about how Backblaze B2 Cloud Storage seamlessly integrates with Veritas Backup Exec. Or, schedule a meeting using this link to talk about solutions tailored to your business needs.

What Is Veritas Backup Exec?

The Veritas Backup Exec service helps businesses protect almost any data on any storage device—tape, servers, or in the cloud. By unifying backup management in one panel, it creates a simple, customizable approach to managing backups and orchestrating recovery that can benefit everyone from the local bakery, to a regional school district, to large government units.

How Does Backblaze Integrate With Backup Exec?

Customers have regularly asked for Backblaze to be accepted into the Veritas Technology Partner Program. Being able to seamlessly integrate Veritas with Backblaze B2 is a big win for these customers who’ve long been looking for an easy, affordable solution that helps them manage infrastructure costs.

Beginning today, any business or institution using Backup Exec can configure their data to back up to B2 Cloud Storage. IT teams can rest easy, knowing that remote offices, Linux and Unix workloads, virtual or Microsoft workloads, and more, are protected and immediately available if they need them—all configurable in a few clicks and at one-fifth the price of traditional storage vendors, with no data retention minimums or other lock-in fees.

“The Backup Exec service empowers small and medium-sized businesses to do more with less, and to minimize their daily lift around data management and protection. This aligns perfectly with our mission to make storing and using data astonishingly easy—especially for IT teams that already have a ton on their plate.”
—Nilay Patel, VP of Sales & Partnerships, Backblaze

How Does This Partnership Benefit Joint Customers?

The partnership delivers in three key value areas:

  • Simplification: Backup Exec provides a simple, unified user interface that removes complexity from data protection. Backblaze B2 is configurable in a few easy steps.
  • Hybrid-cloud adoption: The capital and personnel expense of purchasing and managing on-prem data storage can be challenging. Veritas and Backblaze put cloud adoption within reach—for backups, workloads, infrastructure, recovery, or everything together.
  • Cost reduction: Backup Exec makes configuring backups easy, which means that businesses can fine-tune their storage bill in line with their budgets. With B2 Cloud Storage priced at one-fifth of traditional cloud vendors—the combination offers a powerful tool to businesses that want to optimize their spend.

“Our goal with Backup Exec is to provide a simple, powerful solution that frees business owners from concerns about data loss. Backblaze is a natural partner in this effort and we’re excited to work with them to bring the value of cloud backup to a larger group of businesses and institutions.”
—Jason Von Eberstein, Senior Principal Product Manager of Veritas

Bonus Points: Veritas + Carahsoft + Backblaze = Public Sector Success

Backblaze recently announced a partnership with Carahsoft to help public sector CIOs optimize their cloud spend. Carahsoft is the Master Government Aggregator™ for the IT industry. The partnership—which was enabled by our recent launch of a capacity-based pricing bundle, Backblaze B2 Reserve—solves both the budgeting and procurement challenges public sector CIOs are facing. And Carahsoft is a strategic distributor for Veritas, so public sector customers can take advantage of the joint solution as well.

About Veritas

Veritas enables organizations to harness the power of their information, with solutions designed to serve the world’s largest and most complex heterogeneous environments. Veritas’s industry-leading solutions cover all platforms with backup and recovery, business continuity, software-defined storage, and information governance.

Getting Started With Backblaze B2 and Veritas

Ready to take your backup game to the next level? Click here to get started today and check out our Knowledge Base article for detailed instructions.

The post Level Up Your Backup Game With Backblaze and Veritas appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

When Security Locks You Out of Everything

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2022/06/__trashed-2.html

Thought experiment story of someone who lost everything in a house fire, and now can’t log into anything:

But to get into my cloud, I need my password and 2FA. And even if I could convince the cloud provider to bypass that and let me in, the backup is secured with a password which is stored in—you guessed it—my Password Manager.

I am in cyclic dependency hell. To get my passwords, I need my 2FA. To get my 2FA, I need my passwords.

It’s a one-in-a-million story, and one that’s hard to take into account in system design.

This is where we reach the limits of the “Code Is Law” movement.

In the boring analogue world—I am pretty sure that I’d be able to convince a human that I am who I say I am. And, thus, get access to my accounts. I may have to go to court to force a company to give me access back, but it is possible.

But when things are secured by an unassailable algorithm—I am out of luck. No amount of pleading will let me without the correct credentials. The company which provides my password manager simply doesn’t have access to my passwords. There is no-one to convince. Code is law.

Of course, if I can wangle my way past security, an evil-doer could also do so.

So which is the bigger risk?

  • An impersonator who convinces a service provider that they are me?
  • A malicious insider who works for a service provider?
  • Me permanently losing access to all of my identifiers?

I don’t know the answer to that.

Those risks are in the order of most common to least common, but that doesn’t necessarily mean that they are in risk order. They probably are, but then we’re left with no good way to handle someone who has lost all their digital credentials—computer, phone, backup, hardware token, wallet with ID cards—in a catastrophic house fire.

I want to remind readers that this isn’t a true story. It didn’t actually happen. It’s a thought experiment.

Безвремието на съвремието

Post Syndicated from original https://bivol.bg/%D0%B1%D0%B5%D0%B7%D0%B2%D1%80%D0%B5%D0%BC%D0%B8%D0%B5%D1%82%D0%BE-%D0%BD%D0%B0-%D1%81%D1%8A%D0%B2%D1%80%D0%B5%D0%BC%D0%B8%D0%B5%D1%82%D0%BE.html

вторник 28 юни 2022


Честно да си призная, хич не ми е мъчно за правителството на Киро. Не, не искам да се отнасям обидно към него, използвайки умалителни имена. Той не ми е неприятен,…

Build a high-performance, ACID compliant, evolving data lake using Apache Iceberg on Amazon EMR

Post Syndicated from Sekar Srinivasan original https://aws.amazon.com/blogs/big-data/build-a-high-performance-acid-compliant-evolving-data-lake-using-apache-iceberg-on-amazon-emr/

Amazon EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto.

Apache Iceberg is an open table format for huge analytic datasets. Table formats typically indicate the format and location of individual table files. Iceberg adds functionality on top of that to help manage petabyte-scale datasets as well as newer data lake requirements such as transactions, upsert/merge, time travel, and schema and partition evolution. Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, Flink, and Hive using a high-performance table format that works just like a SQL table.

Amazon EMR release 6.5.0 and later includes Apache Iceberg so you can reliably work with huge tables with full support for ACID (Atomic, Consistent, Isolated, Durable) transactions in a highly concurrent and performant manner without getting locked into a single file format.

In this post, we discuss the modern data lake requirements and the challenges—including support for ACID transactions and concurrent writers, partition and schema evolution—that come with these. We also discuss how Iceberg solves these challenges. Additionally, we provide a step-by-step guide on how to get started with an Iceberg notebook in Amazon EMR Studio. You can access this sample notebook from the GitHub repo. You can also find this notebook in your EMR Studio workspace under Notebook Examples.

Modern data lake challenges

Amazon EMR integrates with Amazon Simple Storage Service (Amazon S3) natively for persistent data storage, and allows you to independently scale your data in Amazon S3 and compute on your EMR cluster. This enables you to bring in data from multiple sources (for example, transactional data from operational databases, social media feeds, and SaaS data sources) using different tools, and each data source has its own transient EMR cluster to perform transformation and ingestion in parallel. You can now keep one central copy of your data and share it with multiple user groups that run analytics and even make in-place updates on a data lake. We’re increasingly seeing the following requirements (and challenges) emerge as mainstream:

  • Consistent reads and writes across multiple concurrent users – There are two primary concerns:
    • Reader-writer isolation – When a job is updating a huge dataset, another job accessing the same data frequently works on a partially updated dataset, leaving the data in an inconsistent state.
    • Concurrent writes on the same dataset – Table formats relying on coarse-grained locks slow down the system. This limitation is even more telling in real-time streaming workloads.
  • Consistent table updates across multiple files or partitions – With Hive tables, writing to multiple partitions at once isn’t an atomic operation. If you’re overwriting a partition, for instance, you might delete several files across partitions without having any guarantees that you will replace them, potentially resulting in data loss. For huge tables, it’s not practical to use global locks and keep the readers and writers waiting. Common workarounds (such as rewriting all the data in all the partitions that need to be changed at the same time and then pointing to the new locations) cause huge data duplication and redundant extract, transform, and load (ETL) jobs.
  • Continuous schema evolution – Simple DDL commands often render the data unusable. For instance, say a data engineer renames a column and writes some data. The consuming analytics tool now can’t read it because the metastore can’t track former names for columns. That rename operation has effectively dropped a column and added a new column. Now there is data written in both schemas. Historically, schema changes required expensive backfills and redundant ETL processes.
  • Different query patterns on the same data – If you change the partitioning to optimize your query after a year, say from daily to hourly, you have to rewrite the table with the new hour column as the partition. In addition, you have to rewrite queries to use the new partition column in your table.
  • ACID transactions, streaming upserts, file size optimization, and data snapshots – Existing tools that support these features lock you into specific file formats, complicating interoperability across the analytics ecosystem.
  • Support for mixed file formats – With existing solutions, if you rename a column in one file format (say Parquet, ORC, or Avro), you get a different behavior than if you rename a column in a different file format. There is inconsistency in data types supported by different file formats. These limitations necessitate additional ETL steps.

The problem

When multiple users share the same data, varied requirements ensue. The data platform needs to be transactional to handle concurrent upserts and reads.

Table formats such as Hive track a list of partitions inside the table within a data catalog. However, the underlying files are still not tracked transactionally, because we’re relying on an immutable object storage that is just not designed to be transactional. After the specific partitions to be updated or inserted have been identified, we still need to list all the files in those partitions at the leaf level of the partition hierarchy before we can filter out which of those files are relevant. For huge analytic datasets with thousands of files in each partition, listing all those files each time you run a query slows it down considerably. Furthermore, doing atomic commits—getting thousands of files in the table live in exactly the same moment—becomes impractical.

Apache Iceberg on Amazon EMR

Iceberg development was started by Netflix in December 2017 and was donated to the Apache software foundation in November 2018 as an incubator project. In May 2020, it graduated from the incubator.

Iceberg on Amazon EMR comes completely integrated and tested for running in production backed by Enterprise Support. This means you get 24/7 technical support from Amazon EMR experts, tools and technology to automatically manage the health of your environment, and consultative architectural, performance, and troubleshooting guidance on Iceberg issues.

Iceberg has integrations with other AWS services. For example, you can use the AWS Glue Data Catalog as the metastore for Iceberg tables. Iceberg also supports other catalog types such as Hive, Hadoop, Amazon DynamoDB, Amazon Relational Database Service (Amazon RDS), and other custom implementations. When using AWS Glue as the data catalog, the AWS Glue database serves as your Iceberg namespace. Similarly, the AWS Glue table and AWS Glue TableVersion serve as the Iceberg table and table version, respectively. Your AWS Glue Data Catalog could be in the same or different account or even a different Region, making multi-account, multi-Region pipelines easily deployable. Amazon Athena supports read, time travel, write, and DDL queries for Apache Iceberg tables that use the Apache Parquet format for data and the AWS Glue Data Catalog for their metastore.

How Iceberg addresses these challenges

Iceberg tracks individual data files in a table instead of simply maintaining a pointer to high-level table or partition locations. This allows writers to create data files in-place and only adds files to the table in an explicit commit. Every time new datasets are ingested into this table, a new point-in-time snapshot gets created. At query time, there is no need to list a directory to find the files we need to work with, because the snapshot already has that information pre-populated during the write time. Because of this design, Iceberg solves the problems listed earlier in the following ways:

  • Consistent reads and writes across multiple concurrent users – Iceberg relies on optimistic concurrency to support concurrent reads and writes from multiple user groups. If two operations are running at the same time, only one of them will be successful. The other job will retry, but that retry will be implicit to the user and that will be done at the metadata level. If Iceberg detects that the second update is not in conflict, it will commit it successfully.
  • Consistent table updates across multiple partitions – In Iceberg, the partition of a file isn’t determined by the physical location of the files within directories or prefixes. Instead, Iceberg stores partition information within manifests of the data files. Therefore, updates across multiple partitions entail a simple, atomic metadata change.
  • Continuous schema evolution – Iceberg tracks columns by using unique IDs and not by the column name, which enables easy schema evolution. You can safely add, drop, rename, or even reorder columns. You can also update column data types if the update is safe (such as widening from INT to BIGINT or float to double)
  • Different query patterns on the same data – Iceberg keeps track of the relationship between partitioning values and the column that they came from. Logical data is decoupled from physical layout, which enables easy partition evolution as well. Partition values can be implicitly derived using a transform such as day(timestamp) or hour(timestamp) of an existing column.
  • ACID transactions, streaming upserts, file size optimization, and data snapshots – Iceberg supports ACID transactions with serializable isolation. Furthermore, Iceberg supports deletes, upserts, change data capture (CDC), time travel (getting the state of the data from a past time regardless of the current state of the data), and compaction (consolidating small files into larger files to reduce metadata overhead and improve query speed). Table changes are atomic, and readers never see partial or uncommitted changes.
  • Support for mixed file formats – Because schema fields are tracked by unique IDs independent of the underlying file format, you can have consistent queries across file formats such as Avro, Parquet, and ORC.

Using Apache Iceberg with Amazon EMR

In this post, we demonstrate creating an Amazon EMR cluster that supports Iceberg using the AWS Command Line Interface (AWS CLI). You can also create the cluster from the Amazon EMR console. We use Amazon EMR Studio to run notebook code on our EMR cluster. To set up an EMR Studio, refer to Set up an EMR Studio. First, we note down the subnets that we specified when we created our EMR Studio. Now we launch our EMR cluster using the AWS CLI:

aws emr create-cluster \
--name iceberg-emr-cluster \
--use-default-roles \
--release-label emr-6.6.0 \
--instance-count 1 \
--instance-type r5.4xlarge \
--applications Name=Hadoop Name=Livy Name=Spark Name=JupyterEnterpriseGateway \
--ec2-attributes SubnetId=<EMR-STUDIO-SUBNET>\
--configurations '[{"Classification":"iceberg-
defaults","Properties":{"iceberg.enabled":"true"}},{"Classification":"spark-hive-
site","Properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.met
astore.AWSGlueDataCatalogHiveClientFactory"}}]'

We choose emr-6.6.0 as the release label. This release comes with Iceberg version 0.13.1 pre-installed. We launch a single-node EMR cluster with the instance type R5.4xlarge and with the following applications installed: Hadoop, Spark, Livy, and Jupyter Enterprise Gateway. Make sure that you replace <EMR-STUDIO-SUBNET> with a subnet ID from the list of EMR Studio’s subnets you noted earlier. We need to enable Iceberg and the AWS Glue Data Catalog on our cluster. To do this, we use the following configuration classifications:

[
  {
    "Classification": "iceberg-defaults ",
    "Properties": {
      "iceberg.enabled":"true"
    }
  },
  {
    "Classification": "spark-hive-site ",
    "Properties": {
      "hive.metastore.client.factory.class":        
         "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
  }
]

Initial setup

Let’s first create an S3 bucket location in the same Region as the EMR cluster to save a sample dataset that we’re going to create and work with. In this post, we use the placeholder bucket name YOUR-BUCKET-NAME. Remember to replace this with a globally unique bucket name when testing this out in your environment. From our EMR Studio workspace, we attach our cluster and use the PySpark kernel.

You can upload the sample notebook from the GitHub repo or use the Iceberg example under Notebook Examples in your own EMR Studio workspace and run the cells following the instructions in the notebook.

Configure a Spark session

In this command, we set our AWS Glue Data Catalog name as glue_catalog1. You can replace it with a different name. But if you do so, remember to change the Data Catalog name throughout this example, because we use the fully qualified table name including the Data Catalog name in all of our commands going forward. In the following command, remember to replace YOUR-BUCKET-NAME with your own bucket name:

%%configure -f
{
    "conf":  {
             "spark.sql.catalog.glue_catalog1": "org.apache.iceberg.spark.SparkCatalog",
             "spark.sql.catalog.glue_catalog1.warehouse": 
                   "s3://YOUR-BUCKET-NAME/iceberg/glue_catalog1/tables/",
             "spark.sql.catalog.glue_catalog1.catalog-impl":    "org.apache.iceberg.aws.glue.GlueCatalog",
             "spark.sql.catalog.glue_catalog1.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
             "spark.sql.catalog.glue_catalog1.lock-impl": "org.apache.iceberg.aws.glue.DynamoLockManager",
             "spark.sql.catalog.glue_catalog1.lock.table": "myGlueLockTable",
  "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
           } 
}

Let’s assume that the name of your catalog is glue_catalog1. The preceding code has the following components:

  • glue_catalog1.warehouse points to the Amazon S3 path where you want to store your data and metadata.
  • To make the catalog an AWS Glue Data Catalog, set glue_catalog1.catalog-impl to org.apache.iceberg.aws.glue.GlueCatalog. This key is required to point to an implementation class for any custom catalog implementation.
  • Use org.apache.iceberg.aws.s3.S3FileIO as the glue_catalog1.io-impl in order to take advantage of Amazon S3 multipart upload for high parallelism.
  • We use an Amazon DynamoDB table for lock implementation. This is optional, and is recommended for high concurrency workloads. To do that, we set lock-impl for our catalog to org.apache.iceberg.aws.glue.DynamoLockManager and we set lock.table to myGlueLockTable as the table name so that for every commit, the Data Catalog first obtains a lock using this table and then tries to safely modify the AWS Glue table. If you choose this option, the table gets created in your own account. Note that you need to have the necessary access permissions to create and use a DynamoDB table. Furthermore, additional DynamoDB charges apply.

Now that you’re all set with your EMR cluster for compute, S3 bucket for data, and AWS Glue Data Catalog for metadata, you can start creating a table and running the DML statements.

For all commands going forward, we use the %%sql cell magic to run Spark SQL commands in our EMR Studio notebook. However, for brevity, we don’t show the cell magic command. But you may need to use that in your Studio notebook for the SQL commands to work.

Create an Iceberg table in the AWS Glue Data Catalog

The default catalog is the AwsDataCatalog. Let’s switch to our AWS Glue catalog glue_catalog1, which has support for Iceberg tables. There are no namespaces as yet. A namespace in Iceberg is the same thing as a database in AWS Glue.

%%sql
use glue_catalog1

Let’s create a table called orders. The DDL syntax looks the same as creating a Hive table, for example, except that we include USING iceberg:

CREATE TABLE glue_catalog1.salesdb.orders
    (
      order_id              int,
      product_name          string,
      product_category      string,
      qty                   int,
      unit_price            decimal(7,2),
      order_datetime        timestamp
    )
USING iceberg
PARTITIONED BY (days(order_datetime))

Note that we’re also partitioning this table by extracting the day out of the order_datetime column. We don’t have to create a separate column for the partition.

DML statements

We then insert records to our table. Here is an example:

INSERT INTO glue_catalog1.salesdb.orders VALUES 
    (
        1, 
        'Harry Potter and the Prisoner of Azkaban',
        'Books',
        2,
        7.99,
        current_timestamp()
    )

DML statements result in snapshots getting created. Note the snapshot_id and the timestamp column called committed_at:

SELECT * FROM glue_catalog1.salesdb.orders.snapshots;

We now insert four more records and then query the orders table and confirm that the five records are present:

SELECT * FROM glue_catalog1.salesdb.orders

Querying from Athena

Because Iceberg on Amazon EMR comes pre-integrated with the AWS Glue Data Catalog, we can now query the Iceberg tables from AWS analytics services that support Iceberg. Let’s query the salesdb/orders table from Athena as shown in the following screenshot.

Upserts

The notebook then gives examples for updates and deletes, and even upserts. We use the MERGE INTO statement for upserts, which uses the source table orders_update with new and updated records:

MERGE INTO glue_catalog1.salesdb.orders target 
USING glue_catalog1.salesdb.orders_update source          
ON target.order_id = source.order_id              
WHEN MATCHED THEN 
    UPDATE SET
        order_id = source.order_id,
        product_name = source.product_name,
        product_category = source.product_category,
        qty = source.qty,
        unit_price = source.unit_price,
        order_datetime = source.order_datetime
WHEN NOT MATCHED THEN
    INSERT *
select * from glue_catalog1.salesdb.orders;

Schema evolution

We then walk through schema evolution using simple ALTER TABLE commands to add, rename, and drop columns. The following example how simple it is to rename a column:

ALTER TABLE glue_catalog1.salesdb.orders RENAME COLUMN qty TO quantity
DESC table glue_catalog1.salesdb.orders

Time travel

Iceberg also allows us to travel backward or forward by storing point-in-time snapshots. We can travel using timestamps when the snapshots were created or directly using the snapshot_id. The following is an example of a CALL statement that uses rollback_to_snapshot:

%%sql
CALL glue_catalog1.system.rollback_to_snapshot('salesdb.orders', 8008410363488501197)

We then travel forward in time by calling set_current_snapshot:

%%sql
CALL glue_catalog1.system.set_current_snapshot('salesdb.orders', 8392090950225782953)

Partition evolution

The notebook ends with an example that shows how partition evolution works in Iceberg. Iceberg stores partition information as part of the metadata. Because there is no separate partition column in the data itself, changing the partitioning scheme to hourly partitions for example is just a matter of calling a different partition transform hours(…) on an existing column order_datetime as shown in the following example:

%%sql
ALTER TABLE glue_catalog1.salesdb.orders ADD PARTITION FIELD hours(order_datetime)

You can continue to use the old partition on the old data. New data is written using the new spec in a new layout. Metadata for each of the partition versions is kept separately.

The notebook shows how you can query the table using the new hourly partition:

%%sql
SELECT * FROM glue_catalog1.salesdb.orders where hour(order_datetime)=1

You can continue to query your old data using the day() transform. There is only the original order_datetime column in the table.

%%sql
SELECT * FROM glue_catalog1.salesdb.orders where day(order_datetime)>=14

You don’t have to store additional columns to accommodate multiple partitioning schemes. The partition definitions are in the metadata, providing the flexibility to evolve and change the partition definitions in the future.

Conclusion

In this post, we introduced Apache Iceberg and explained how Iceberg solves some challenges in modern data lakes. We then walked you through how to run Iceberg on Amazon EMR using the AWS Glue Data Catalog as the metastore, and query the data using Athena. You can also run upserts on this data from Athena. There is no additional cost to using Iceberg with Amazon EMR.

For more information about Iceberg, refer to How Iceberg works. Iceberg on Amazon EMR, with its integration with AWS Analytics services, can simplify the way you process, upsert, and delete data, with full support for ACID transactions in Amazon S3. You can also implement schema evolution, partition evolution, time travel, and compaction of data.


About the Author

Sekar Srinivasan is a Sr. Specialist Solutions Architect at AWS focused on Big Data and Analytics. Sekar has over 20 years of experience working with data. He is passionate about helping customers build scalable solutions modernizing their architecture and generating insights from their data. In his spare time he likes to work on non-profit projects, especially those focused on underprivileged Children’s education.

AWS Week in Review – June 27, 2022

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-week-in-review-june-27-2022/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

It’s the beginning of a new week, and I’d like to start with a recap of the most significant AWS news from the previous 7 days. Last week was special because I had the privilege to be at the very first EMEA AWS Heroes Summit in Milan, Italy. It was a great opportunity of mutual learning as this community of experts shared their thoughts with AWS developer advocates, product managers, and technologists on topics such as containers, serverless, and machine learning.

Participants at the EMEA AWS Heroes Summit 2022

Last Week’s Launches
Here are the launches that got my attention last week:

Amazon Connect Cases (available in preview) – This new capability of Amazon Connect provides built-in case management for your contact center agents to create, collaborate on, and resolve customer issues. Learn more in this blog post that shows how to simplify case management in your contact center.

Many updates for Amazon RDS and Amazon AuroraAmazon RDS Custom for Oracle now supports Oracle database 12.2 and 18c, and Amazon RDS Multi-AZ deployments with one primary and two readable standby database instances now supports M5d and R5d instances and is available in more Regions. There is also a Regional expansion for RDS Custom. Finally, PostgreSQL 14, a new major version, is now supported by Amazon Aurora PostgreSQL-Compatible Edition.

AWS WAF Captcha is now generally available – You can use AWS WAF Captcha to block unwanted bot traffic by requiring users to successfully complete challenges before their web requests are allowed to reach resources.

Private IP VPNs with AWS Site-to-Site VPN – You can now deploy AWS Site-to-Site VPN connections over AWS Direct Connect using private IP addresses. This way, you can encrypt traffic between on-premises networks and AWS via Direct Connect connections without the need for public IP addresses.

AWS Center for Quantum Networking – Research and development of quantum computers have the potential to revolutionize science and technology. To address fundamental scientific and engineering challenges and develop new hardware, software, and applications for quantum networks, we announced the AWS Center for Quantum Networking.

Simpler access to sustainability data, plus a global hackathon – The Amazon Sustainability Data Initiative catalog of datasets is now searchable and discoverable through AWS Data Exchange. As part of a new collaboration with the International Research Centre in Artificial Intelligence, under the auspices of UNESCO, you can use the power of the cloud to help the world become sustainable by participating to the Amazon Sustainability Data Initiative Global Hackathon.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
A couple of takeaways from the Amazon re:MARS conference:

Amazon CodeWhisperer (preview) – Amazon CodeWhisperer is a coding companion powered by machine learning with support for multiple IDEs and languages.

Synthetic data generation with Amazon SageMaker Ground TruthGenerate labeled synthetic image data that you can combine with real-world data to create more complete training datasets for your ML models.

Some other updates you might have missed:

AstraZeneca’s drug design program built using AWS wins innovation award – AstraZeneca received the BioIT World Innovative Practice Award at the 20th anniversary of the Bio-IT World Conference for its novel augmented drug design platform built on AWS. More in this blog post.

Large object storage strategies for Amazon DynamoDB – A blog post showing different options for handling large objects within DynamoDB and the benefits and disadvantages of each approach.

Amazon DevOps Guru for RDS under the hoodSome details of how DevOps Guru for RDS works, with a specific focus on its scalability, security, and availability.

AWS open-source news and updates – A newsletter curated by my colleague Ricardo to bring you the latest open-source projects, posts, events, and more.

Upcoming AWS Events
It’s AWS Summits season and here are some virtual and in-person events that might be close to you:

On June 30, the AWS User Group Ukraine is running an AWS Tech Conference to discuss digital transformation with AWS. Join to learn from many sessions including a fireside chat with Dr. Werner Vogels, CTO at Amazon.com.

That’s all from me for this week. Come back next Monday for another Week in Review!

Danilo

Highlights from Git 2.37

Post Syndicated from Taylor Blau original https://github.blog/2022-06-27-highlights-from-git-2-37/

The open source Git project just released Git 2.37, with features and bug fixes from over 75 contributors, 20 of them new. We last caught up with you on the latest in Git back when 2.36 was released.

To celebrate this most recent release, here’s GitHub’s look at some of the most interesting features and changes introduced since last time.

Before we get into the details of Git 2.37.0, we first wanted to let you know that Git Merge is returning this September. The conference features talks, workshops, and more all about Git and the Git ecosystem. There is still time to submit a proposal to speak. We look forward to seeing you there!

A new mechanism for pruning unreachable objects

In Git, we often talk about classifying objects as either “reachable” or “unreachable”. An object is “reachable” when there is at least one reference (a branch or a tag) from which you can start an object walk (traversing from commits to their parents, from trees into their sub-trees, and so on) and end up at your destination. Similarly, an object is “unreachable” when no such reference exists.

A Git repository needs all of its reachable objects to ensure that the repository is intact. But it is free to discard unreachable objects at any time. And it is often desirable to do just that, particularly when many unreachable objects have piled up, you’re running low on disk space, or similar. In fact, Git does this automatically when running garbage collection.

But observant readers will notice the gc.pruneExpire configuration. This setting defines a “grace period” during which unreachable objects which are not yet old enough to be removed from the repository completely are left alone. This is done in order to mitigate a race condition where an unreachable object that is about to be deleted becomes reachable by some other process (like an incoming reference update or a push) before then being deleted, leaving the repository in a corrupt state.

Setting a small, non-zero grace period makes it much less likely to encounter this race in practice. But it leads us to another problem: how do we keep track of the age of the unreachable objects which didn’t leave the repository? We can’t pack them together into a single packfile; since all objects in a pack share the same modification time, updating any object drags them all forward. Instead, prior to Git 2.37, each surviving unreachable object was written out as a loose object, and the mtime of the individual objects was used to store their age. This can lead to serious problems when there are many unreachable objects which are too new and can’t be pruned.

Git 2.37 introduces a new concept, cruft packs, which allow unreachable objects to be stored together in a single packfile by writing the ages of individual objects in an auxiliary table stored in an *.mtimes file alongside the pack.

While cruft packs don’t eliminate the data race we described earlier, in practice they can help make it much less likely by allowing repositories to prune with a much longer grace period, without worrying about the potential to create many loose objects. To try it out yourself, you can run:

$ git gc --cruft --prune=1.day.ago

and notice that your $GIT_DIR/objects/pack directory will have an additional .mtimes file, storing the ages of each unreachable object written within the last 24 hours

$ ls -1 .git/objects/pack
pack-243103d0f640e0096edb3ef0c842bc5534a9f9a4.idx
pack-243103d0f640e0096edb3ef0c842bc5534a9f9a4.mtimes
pack-243103d0f640e0096edb3ef0c842bc5534a9f9a4.pack
pack-5a827af6f1a793a45c816b05d40dfd4d5f5edf28.idx
pack-5a827af6f1a793a45c816b05d40dfd4d5f5edf28.pack

There’s a lot of detail we haven’t yet covered on cruft packs, so expect a more comprehensive technical overview in a separate blog post soon.

[source]

A builtin filesystem monitor for Windows and macOS

As we have discussed often before, one of the factors that significantly impact Git’s performance is the size of your working directory. When you run git status, for example, Git has to crawl your entire working directory (in the worst case) in order to figure out which files have been modified.

Git has its own cached understanding of the filesystem to avoid this whole-directory traversal in many cases. But it can be expensive for Git to update its cached understanding of the filesystem with the actual state of the disk while you work.

In the past, Git has made it possible to integrate with tools like Watchman via a hook, making it possible to replace Git’s expensive refreshing process with a long-running daemon which tracks the filesystem state more directly.

But setting up this hook and installing a third-party tool can be cumbersome. In Git 2.37, this functionality is built into Git itself on Windows and macOS, removing the need to install an external tool and configure the hook.

You can enable this for your repository by enabling the core.fsmonitor config setting.

$ git config core.fsmonitor true

After setting up the config, an initial git status will take the normal amount of time, but subsequent commands will take advantage of the monitored data and run significantly faster.

The full implementation is impossible to describe completely in this post. Interested readers can follow along later this week with a blog post written by Jeff Hostetler for more information. We’ll be sure to add a link here when that post is published.

[source, source, source, source]

The sparse index is ready for wide use

We previously announced Git’s sparse index feature, which helps speed up Git commands when using the sparse-checkout feature in a large repository.

In case you haven’t seen our earlier post, here’s a brief refresher. Often when working in an extremely large repository, you don’t need the entire contents of your repository present locally in order to contribute. For example, if your company uses a single monorepo, you may only be interested in the parts of that repository that correspond to the handful of products you work on.

Partial clones make it possible for Git to only download the objects that you care about. The sparse index is an equally important component of the equation. The sparse index makes it possible for the index (a key data structure which tracks the content of your next commit, which files have been modified, and more) to only keep track of the parts of your repository that you’re interested in.

When we originally announced the sparse index, we explained how different Git subcommands would have to be updated individually to take advantage of the sparse index. With Git 2.37.0, all of those integrations are now included in the core Git project and available to all users.

In this release, the final integrations were for git show, git sparse-checkout, and git stash. In particular, git stash has the largest performance boost of all of the integrations so far because of how the command reads and writes indexes multiple times in a single process, achieving a near 80% speed-up in certain cases (though see this thread for all of the details).

[source, source, source]

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.37, or any previous version in the Git repository.

Tidbits

Now that we have looked at some of the bigger features in detail, let’s turn to a handful of smaller topics from this release.

  • Speaking of sparse checkouts, this release deprecates the non---cone-mode style of sparse checkout definitions.

    For the uninitiated, the git sparse-checkout command supports two kinds of patterns which dictate which parts of your repository should be checked out: “cone” mode, and “non-cone” mode. The latter, which allows specifying individual files with a .gitignore-style syntax, can be confusing to use correctly, and has performance problems (namely that in the worst case all patterns must try to be matched with all files, leading to slow-downs). Most importantly, it is incompatible with the sparse-index, which brings the performance enhancements of using a sparse checkout to all of the Git commands you’re familiar with.

    For these reasons (and more!), the non---cone mode style of patterns is discouraged, and users are instead encouraged to use --cone mode.

    [source]

  • In our highlights from the last Git release, we talked about more flexible fsync configuration, which made it possible to more precisely define what files Git would explicitly synchronize with fsync() and what strategy it would use to do that synchronization.

    This release brings a new strategy to the list supported by core.fsyncMethod: “batch”, which can provide significant speed-ups on supported filesystems when writing many individual files. This new mode works by staging many updates to the disk’s writeback cache before preforming a single fsync() causing the disk to flush its writeback cache. Files are then atomically moved into place, guaranteeing that they are fsync()-durable by the time they enter the object directory.

    For now, this mode only supports batching loose object writes, and will only be enabled when core.fsync includes the loose-objects value. On a synthetic test of adding 500 files to the repository with git add (each resulting in a new loose object), the new batch mode imposes only a modest penalty over not fsyncing at all.

    On Linux, for example, adding 500 files takes .06 seconds without any fsync() calls, 1.88 seconds with an fsync() after each loose object write, and only .15 seconds with the new batched fsync(). Other platforms display similar speed-ups, with a notable example being Windows, where the numbers are .35 seconds, 11.18 seconds, and just .41 seconds, respectively.

    [source]

  • If you’ve ever wondered, “what’s changed in my repository since yesterday?”, one way you can figure that out is with the --since option, which is supported by all standard revision-walking commands, like log and rev-list.

    This option works by starting with the specified commits, and walking recursively along each commit’s parents, stopping the traversal as soon as it encounters a commit older than the --since date. But in occasional circumstances (particularly when there is) clock skew this can produce confusing results.

    For example, suppose you have three commits, C1, C2, and C3, where C2 is the parent of C3, and C1 is the parent of C2. If both C1 and C3 were written in the last hour, but C2 is a day old (perhaps because the committer’s clock is running slow), then a traversal with --since=1.hour.ago will only show C3, since seeing C2 causes Git to halt its traversal.

    If you expect your repository’s history has some amount of clock skew, then you can use --since-as-filter in place of --since, which only prints commits newer than the specified date, but does not halt its traversal upon seeing an older one.

    [source]

  • If you work with partial clones, and have a variety of different Git remotes, it can be confusing to remember which partial clone filter is attached to which remote.

    Even in a simple example, trying to remember what object filter was used to clone your repository requires this incantation:

    $ git config remote.origin.partialCloneFilter
    

    In Git 2.37, you can now access this information much more readily behind the -v flag of git remote, like so:

    $ git remote -v
    origin    [email protected]:git/git.git (fetch) [tree:0]
    origin    [email protected]:git/git.git (push)
    

    Here, you can easily see between the square-brackets that the remote origin uses a tree:0 filter.

    This work was contributed by Abhradeep Chakraborty, a Google Summer of Code student, who is one of three students participating this year and working on Git.

    [source]

  • Speaking of remote configuration, Git 2.37 ships with support for warning or exiting when it encounters plain-text credentials stored in your configuration with the new transfer.credentialsInUrl setting.

    Storing credentials in plain-text in your repository’s configuration is discouraged, since it forces you to ensure you have appropriately restrictive permissions on the configuration file. Aside from storing the data unencrypted at rest, Git often passes the full URL (including credentials) to other programs, exposing them on systems where other processes have access to arguments list of sensitive processes. In most cases, it is encouraged to use Git’s credential mechanism, or tools like GCM.

    This new setting allows Git to either ignore or halt execution when it sees one of these credentials by setting the transfer.credentialsInUrl to “warn” or “die” respectively. The default, “allow”, does nothing.

    [source, source]

  • If you’ve ever used git add -p to stage the contents of your working tree incrementally, then you may be familiar with git add‘s “interactive mode”, or git add -i, of which git add -p is a sub-mode.

    In addition to “patch” mode, git add -i supports “status”, “update”, “revert”, “add untracked”, “patch”, and “diff”. Until recently, this mode of git add -i was actually written in Perl. This command has been the most recent subject of a long-running effort to port Git commands written in Perl into C. This makes it possible to use Git’s libraries without spawning sub-processes, which can be prohibitively expensive on certain platforms.

    The C reimplementation of git add -i has shipped in releases of Git as early as v2.25.0. In more recent versions, this reimplementation has been in “testing” mode behind an opt-in configuration. Git 2.37 promotes the C reimplementation by default, so Windows users should notice a speed-up when using git add -p.

    [source, source, source, source, source, source, source]

  • Last but not least, there is a lot of exciting work going on for Git developers, too, like improving the localization workflow, improving CI output with GitHub Actions, and reducing memory leaks in internal APIs.

    If you’re interested in contributing to Git, now is a more exciting time than ever to start. Check out this guide for some tips on getting started.

The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.37 or any previous version in the Git repository.

Building a low-code speech “you know” counter using AWS Step Functions

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-a-low-code-speech-you-know-counter-using-aws-step-functions/

This post is written by Doug Toppin, Software Development Engineer, and Kishore Dhamodaran, Solutions Architect.

In public speaking, filler phrases can distract the audience and reduce the value and impact of what you are telling them. Reviewing recordings of presentations can be helpful to determine whether presenters are using filler phrases. Instead of manually reviewing prior recordings, automation can process media files and perform a speech-to-text function. That text can then be processed to report on the use of filler phrases.

This blog explains how to use AWS Step Functions, Amazon EventBridge, Amazon Transcribe and Amazon Athena to report on the use of the common phrase “you know” in media files. These services can automate and reduce the time required to find the use of filler phrases.

Step Functions can automate and chain together multiple activities and other Amazon services. Amazon Transcribe is a speech to text service that uses media files as input and produces textual transcripts from them. Athena is an interactive query service that makes it easier to analyze data in Amazon S3 using standard SQL. Athena enables the use of standard SQL to query data in S3.

This blog shows a low-code, configuration driven approach to implementing this solution. Low-code means writing little or no custom software to perform a function. Instead, you use a configuration drive approach using service integrations where state machine tasks call AWS services using existing SDKs, APIs, or interfaces. A configuration driven approach in this example is using Step Functions’ Amazon States Language (ASL) to tie actions together rather than writing traditional code. This requires fewer details for data management and error handling combined with a visual user interface for composing the workflow. As the actions and logic are clearly defined with the visual workflow, this reduces maintenance.

Solution overview

The following diagram shows the solution architecture.

SolutionOverview

Solution Overview

  1. You upload a media file to an Amazon S3 Media bucket.
  2. The media file upload to S3 triggers an EventBridge rule.
  3. The EventBridge rule starts the Step Functions state machine execution.
  4. The state machine invokes Amazon Transcribe to process the media file.
  5. The transcription output file is stored in the Amazon S3 Transcript bucket.
  6. The state machine invokes Athena to query the textual transcript for the filler phrase. This uses the AWS Glue table to describe the format of the transcription results file.
  7. The filler phrase count determined by Athena is returned and stored in the Amazon S3 Results bucket.

Prerequisites

  1. An AWS account and an AWS user or role with sufficient permissions to create the necessary resources.
  2. Access to the following AWS services: Step Functions, Amazon Transcribe, Athena, and Amazon S3.
  3. Latest version of the AWS Serverless Application Model (AWS SAM) CLI, which helps developers create and manage serverless applications in the AWS Cloud.
  4. Test media files (for example, the Official AWS Podcast).

Example walkthrough

  1. Clone the GitHub repository to your local machine.
  2. git clone https://github.com/aws-samples/aws-stepfunctions-examples.git
  3. Deploy the resources using AWS SAM. The deploy command processes the AWS SAM template file to create the necessary resources in AWS. Choose you-know as the stack name and the AWS Region that you want to deploy your solution to.
  4. cd aws-stepfunctions-examples/sam/app-low-code-you-know-counter/
    sam deploy --guided

Use the default parameters or replace with different values if necessary. For example, to get counts of a different filler phrase, replace the FillerPhrase parameter.

GlueDatabaseYouKnowP Name of the AWS Glue database to create.
AthenaTableName Name of the AWS Glue table that is used by Athena to query the results.
FillerPhrase The filler phrase to check.
AthenaQueryPreparedStatementName Name of the Athena prepared statement used to run SQL queries on.
AthenaWorkgroup Athena workgroup to use
AthenaDataCatalog The data source for running the Athena queries
SAM Deploy

SAM Deploy

Running the filler phrase counter

  1. Navigate to the Amazon S3 console and upload an mp3 or mp4 podcast recording to the bucket named bucket-{account number}-{Region}-you-know-media.
  2. Navigate to the Step Functions console. Choose the running state machine, and monitor the execution of the transcription state machine.
  3. State Machine Execution

    State Machine Execution

  4. When the execution completes successfully, select the QueryExecutionSuccess task to examine the output and see the filler phrase count.
  5. State Machine Output

    State Machine Output

  6. Amazon Transcribe produces the transcript text of the media file. You can examine the output in the Results bucket. Using the S3 console, navigate to the bucket, choose the file matching the media file name and use ‘Query with S3 Select’ to view the content.
  7. If the transcription job does not execute, the state machine reports the failure and exits.
  8. State Machine Fail

    State Machine Fail

Exploring the state machine

The state machine orchestrates the transcription processing:

State Machine Explore

State Machine Explore

The StartTranscriptionJob task starts the transcription job. The Wait state adds a 60-second delay before checking the status of the transcription job. Until the status of the job changes to FAILED or COMPLETED, the choice state continues.

When the job successfully completes, the AthenaStartQueryExecutionUsingPreparedStatement task starts the Athena query, and stores the results in the S3 results bucket. The AthenaGetQueryResults task retrieves the count from the resultset.

The TranscribeMediaBucket holds the media files to be uploaded. The configuration sends the upload notification event to EventBridge:

      
   NotificationConfiguration:
     EventBridgeConfiguration:
       EventBridgeEnabled: true
	  

The TranscribeResultsBucket has an associated policy to provide access to Amazon Transcribe. Athena stores the output from the queries performed by the state machine in the AthenaQueryResultsBucket .

When a media upload occurs, the YouKnowTranscribeStateMachine uses Step Functions’ native event integration to trigger an EventBridge rule. This contains an event object similar to:

{
  "version": "0",
  "id": "99a0cb40-4b26-7d74-dc59-c837f5346ac6",
  "detail-type": "Object Created",
  "source": "aws.s3",
  "account": "012345678901",
  "time": "2022-05-19T22:21:10Z",
  "region": "us-east-2",
  "resources": [
    "arn:aws:s3:::bucket-012345678901-us-east-2-you-know-media"
  ],
  "detail": {
    "version": "0",
    "bucket": {
      "name": "bucket-012345678901-us-east-2-you-know-media"
    },
    "object": {
      "key": "Podcase_Episode.m4a",
      "size": 202329,
      "etag": "624fce93a981f97d85025e8432e24f48",
      "sequencer": "006286C2D604D7A390"
    },
    "request-id": "B4DA7RD214V1QG3W",
    "requester": "012345678901",
    "source-ip-address": "172.0.0.1",
    "reason": "PutObject"
  }
}

The state machine allows you to prepare parameters and use the direct SDK integrations to start the transcription job by calling the Amazon Transcribe service’s API. This integration means you don’t have to write custom code to perform this function. The event triggering the state machine execution contains the uploaded media file location.


  StartTranscriptionJob:
	Type: Task
	Comment: Start a transcribe job on the provided media file
	Parameters:
	  Media:
		MediaFileUri.$: States.Format('s3://{}/{}', $.detail.bucket.name, $.detail.object.key)
	  TranscriptionJobName.$: "$.detail.object.key"
	  IdentifyLanguage: true
	  OutputBucketName: !Ref TranscribeResultsBucket
	Resource: !Sub 'arn:${AWS::Partition}:states:${AWS::Region}:${AWS::AccountId}:aws-sdk:transcribe:startTranscriptionJob'

The SDK uses aws-sdk:transcribe:getTranscriptionJob to get the status of the job.


  GetTranscriptionJob:
	Type: Task
	Comment: Retrieve the status of an Amazon Transcribe job
	Parameters:
	  TranscriptionJobName.$: "$.TranscriptionJob.TranscriptionJobName"
	Resource: !Sub 'arn:${AWS::Partition}:states:${AWS::Region}:${AWS::AccountId}:aws-sdk:transcribe:getTranscriptionJob'
	Next: TranscriptionJobStatus

The state machine uses a polling loop with a delay to check the status of the transcription job.


  TranscriptionJobStatus:
	Type: Choice
	Choices:
	- Variable: "$.TranscriptionJob.TranscriptionJobStatus"
	  StringEquals: COMPLETED
	  Next: AthenaStartQueryExecutionUsingPreparedStatement
	- Variable: "$.TranscriptionJob.TranscriptionJobStatus"
	  StringEquals: FAILED
	  Next: Failed
	Default: Wait

When the transcription job completes successfully, the filler phrase counting process begins.

An Athena prepared statement performs the query with the transcription job name as a runtime parameter. The AWS SDK starts the query and the state machine execution pauses, waiting for the results to return before progressing to the next state:

athena:startQueryExecution.sync

When the query completes, Step Functions uses the SDK integration to retrieve the results using athena:getQueryResults:

athena:getQueryResults

It creates an Athena prepared statement to pass the transcription jobname as a parameter for the query execution:

  ResultsQueryPreparedStatement:
    Type: AWS::Athena::PreparedStatement
    Properties:
      Description: Create a statement that allows the use of a parameter for specifying an Amazon Transcribe job name in the Athena query
      QueryStatement: !Sub >-
        select cardinality(regexp_extract_all(results.transcripts[1].transcript, '${FillerPhrase}')) AS item_count from "${GlueDatabaseYouKnow}"."${AthenaTableName}" where jobname like ?
      StatementName: !Ref AthenaQueryPreparedStatementName
      WorkGroup: !Ref AthenaWorkgroup

There are several opportunities to enhance this tool. For example, adding support for multiple filler phrases. You could build a larger application to upload media and retrieve the results. You could take advantage of Amazon Transcribe’s real-time transcription API to display the results while a presentation is in progress to provide immediate feedback to the presenter.

Cleaning up

  1. Navigate to the Amazon Transcribe console. Choose Transcription jobs in the left pane, select the jobs created by this example, and choose Delete.
  2. Cleanup Delete

    Cleanup Delete

  3. Navigate to the S3 console. In the Find buckets by name search bar, enter “you-know”. This shows the list of buckets created for this example. Choose each of the radio buttons next to the bucket individually and choose Empty.
  4. Cleanup S3

    Cleanup S3

  5. Use the following command to delete the stack, and confirm the stack deletion.
  6. sam delete

Conclusion

Low-code applications can increase developer efficiency by reducing the amount of custom code required to build solutions. They can also enable non-developer roles to create automation to perform business functions by providing drag-and-drop style user interfaces.

This post shows how a low-code approach can build a tool chain using AWS services. The example processes media files to produce text transcripts and count the use of filler phrases in those transcripts. It shows how to process EventBridge data and how to invoke Amazon Transcribe and Athena using Step Functions state machines.

For more serverless learning resources, visit Serverless Land.

[$] Two memory-tiering patch sets

Post Syndicated from original https://lwn.net/Articles/898766/

Once upon a time, computers just had one type memory, so memory within
a given system was interchangeable. The arrival of non-uniform
memory access (NUMA) systems complicated the situation significantly; now
some memory was faster to access than the rest, and memory-management
algorithms had to adapt or performance would suffer. But NUMA was just the
start; today’s tiered-memory systems, which may include several tiers of
memory with different performance characteristics, are adding new
challenges. A couple of relevant patch sets currently under review help to
illustrate the types of problems that will have to be solved.

KDE Apps Mid-Year Update (KDE.news)

Post Syndicated from original https://lwn.net/Articles/899167/

Here’s an
update on recent KDE application development
on KDE.news:

KStars is probably the most feature-rich free astronomy software around and the 3.5.9
release adds some exciting new features.

HiPS (Hierarchical Progressive Surveys) is a technology that provides
progressive high resolution images of the sky at different zoom
levels. KStars fully supports online HiPS where data is downloaded from
online servers and cached to be displayed on the Sky Map.

Ojeda: Memory Safety for the World’s Largest Software Project

Post Syndicated from original https://lwn.net/Articles/899164/

Miguel Ojeda has posted an
update on the Rust-for-Linux project
.

This second year since the RFC we are looking forward to several
milestones which hopefully we will achieve:

  • More users or use cases inside the kernel, including example drivers – this is pretty important to get merged into the kernel.
  • Splitting the kernel crate and managing dependencies to allow better development.
  • Extending the current integration of the kernel documentation, testing and other tools.
  • Getting more subsystem maintainers, companies and researchers involved.
  • Seeing most of the remaining Rust features stabilized.
  • Possibly being able to start compiling the Rust code in the kernel with GCC.
  • And, of course, getting merged into the mainline kernel, which
    should make everything else easier!

API Security: Best Practices for a Changing Attack Surface

Post Syndicated from Ray Cochrane original https://blog.rapid7.com/2022/06/27/api-security-best-practices-for-a-changing-attack-surface/

API Security: Best Practices for a Changing Attack Surface

API usage is skyrocketing. According to the latest State of the API Report, API requests increased by 56% last year to a total of 855 million, and Google says the growth isn’t expected to slow any time soon.

APIs – short for application programming interfaces – are a critical component of how applications are built. They control the type of requests that occur between programs, how requests are made, and the format of those requests.

The huge increase in usage stems from the important role APIs – and web applications more broadly – play in digital transformation. APIs have helped facilitate the transition from monolithic applications to microservices. They’ve enabled businesses to provide user-oriented API-based services for B2B use cases, including automation and integration. And they’re integral to modern web applications, which are no longer just HTML with links but rich user interfaces, built as single-page apps with REST API backends. Nearly every modern application utilizes – or is – an API.

Today, it’s almost impossible to do anything online without interacting with an API. That’s why cyberattacks are increasingly targeting APIs, and they’ve become a large part of the application attack surface.

Why securing APIs is important

APIs are a lucrative target that can allow hackers to gain access to an otherwise secure system and exploit vulnerabilities. Not only do APIs often suffer from the same vulnerabilities as web applications – like broken access controls, injections, security misconfigurations, and vulnerabilities inherited from other dependent code libraries – but they are also more susceptible to resource consumption and rate limiting issues due to the automated nature of their users.

Due to a lack of knowledge in the market, it’s also common that legacy issues from early APIs are carried forward. For example, not all APIs will be fronted by an API gateway, with older APIs sitting in the background with little or no protection simply due to a lack of awareness of them. Many unused APIs will also not have been decommissioned, as newer APIs are produced and replace them as a product evolves, due to a lack of standard process and bad practice. This can leave legacy APIs vulnerable to attacks.

How to secure an API

The first step in securing your APIs is to audit your environment and/or applications to take an inventory of what APIs you have and which ones you’re actually utilizing. Then, you must understand the purpose of each individual API to allow you to validate that it is working as expected. You must also understand the expected behavior of the API to allow identification of threats more readily by being able to capture abnormal activity. Once you have a firm understanding of what the API’s functionality is and what expected behavior is, you can then both manage and test your API more effectively and efficiently.

API management is a key element for API security. APIs not only require the same controls as web apps but also additional controls specific to the API’s unique function. Documentation and version control of APIs is of vital importance, as one product can have multiple APIs – even hundreds or thousands.

Poor management can lead to issues with legacy and defunct APIs, as you will often find that only a small portion of APIs pass through an API gateway. Meanwhile, older APIs – which haven’t been decommissioned, or which teams simply aren’t aware of – can sit in the background with no protection. The probability of known vulnerabilities with older APIs is also significantly higher, which amplifies the risk profile.

The same legacy issues can also lead to coverage gaps, and calls that are outside of an API gateway could leave a blind spot when it comes to intra-API calls. Publishing and clearly defining your API will simplify users’ understanding of the API, allowing them to connect in the most appropriate and effective way. Ensuring your API is appropriately monitored is a key management technique. Continuously undertaking performance checks will allow you to understand if the API is under stress from being overloaded. It can also provide an indication of traffic volumes to monitor usage, potentially gauge malicious activity (via audit logs), and judge whether you need to scale up your operation. Lastly, having a response plan in place for attacks is a vital control in API security, allowing for a rapid but controlled response to potential threats.

There have been many recent examples of API-based attacks, such as those experienced by WordPress – and even on the dating scene with Bumble’s recent vulnerability issues. Some simple but effective steps you can take to secure your API and reduce the risk of such exposures include:

  • Authentication: Do you have a control in place to understand who’s calling your API?
  • Authorization: Should the person calling be able to access this data?
  • Encryption: Have you encrypted your network traffic?
  • Traffic management: Have you set rate limits or thresholds to keep a customer from pulling too much data or running scripts to tie up an API?
  • Audit logging: Effective logging ensures you can understand what normal traffic looks like and allows you to identify abnormal activity.

How to test your API

API testing is still evolving to keep up with the increase in volume and complexity. While manual API security testing can be done with traditional testing tools, and fully automated API security testing is partially supported by most major DAST solutions, there are many open-source tools written for guided API security testing. API testing used in conjunction with proper API management will increase API security.

API testing is most effective when you have a full risk profile of your business – i.e. you are fully aware of all of your APIs (including legacy or defunct APIs) to ensure you have no blindspots that could be exposed or manipulated. Taking time to identify vulnerabilities in API frameworks, your network, configuration, and policy all enhance your API security.

Anticipating threats by understanding expected behavior and having adequate testing in place will allow for proactive coverage and enhanced protection and threat identification.

Finally, you must continuously test your endpoint to ensure protection is maintained at all times and optimum security is in place. The ability to identify and block security risks before they occur is vital in the fight to provide the best protection against threats to your API.

Additional reading:

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close