Ukraine Intercepting Russian Soldiers’ Cell Phone Calls

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2022/12/ukraine-intercepting-russian-soldiers-cell-phone-calls.html

They’re using commercial phones, which go through the Ukrainian telecom network:

“You still have a lot of soldiers bringing cellphones to the frontline who want to talk to their families and they are either being intercepted as they go through a Ukrainian telecommunications provider or intercepted over the air,” said Alperovitch. “That doesn’t pose too much difficulty for the Ukrainian security services.”

[…]

“Security has always been a mess, both in the army and among defence officials,” the source said. “For example, in 2013 they tried to get all the staff at the ministry of defence to replace our iPhones with Russian-made Yoto smartphones.

“But everyone just kept using the iPhone as a second mobile because it was much better. We would just keep the iPhone in the car’s glove compartment for when we got back from work. In the end, the ministry gave up and stopped caring. If the top doesn’t take security very seriously, how can you expect any discipline in the regular army?”

This isn’t a new problem and it isn’t a Russian problem. Here’s a more general article on the problem from 2020.

What’s Up, Home? – Catching the Northern Lights

Post Syndicated from Janne Pikkarainen original https://blog.zabbix.com/whats-up-home-catching-the-northern-lights/24836/

Can you monitor Northern Lights with Zabbix? Of course, you can! By day, I am a monitoring tech lead in a global cyber security company. By night, I monitor my home with Zabbix & Grafana and do some weird experiments with them. Welcome to my blog about this project.

Christmas is coming, and (at least if you believe Hollywood movies) part of that magic would be staring at the sky and marvel the Northern Lights. Well, in practice you probably won’t see them, as even if the Northern Lights would be up there, a thick layer of clouds will probably prevent you from seeing them. Or then you live in an area with so many street lights that you don’t see the sky properly.

We have tried to watch them several times with my wife, but our attempts all over the years and all the seasons have failed so far. But, for the sake of the Christmas spirit, let’s imagine you could actually see the lights.

Getting the data

There are probably actual APIs for getting the data — at first, I went to NASA’s open data site but then quickly gave up; there’s so much data that I would not have an actual idea how to start parsing this beautiful sky flames phenomenon.

Admitting my lameness, I next came up with plan B. The Finnish Institute of meteorology has this page for space weather & Northern Lights predictions. Sorry, the page is all in Finnish, so likely it looks like an alien language to you. Anyway, there’s this snippet that shows the probability of Northern Lights tonight (“Tänä yönä”), tomorrow (“Huomenna” and the day after tomorrow (“Ylihuomenna”).

Is that some kind of advanced form of encryption? No, that’s just the Finnish language for you.
Making it work

But how to parse that? Well, of course, with Zabbix, that is easy with the HTTP Agent item type. It allows you to grab website content and then perform all the advanced processing for the data you would expect from Zabbix item preprocessing.

Then, using dependent items — one for tonight, one for tomorrow, one for the day after tomorrow — and item preprocessing we can extract the interesting bits.

And see, it works!

I also created a (still boring-looking) dashboard, which shows me the current values.

The problem I now have is that I don’t know all the values the page could contain — when I created this blog post, the chances of seeing the Northern Lights were small (“pieni”) or smallish (“pienehkö”). Well, I keep checking my dashboard from now on! For now, I could create triggers that would alert me if the values would be something else than “pieni” or “pienehkö”, but did not have time for that yet.

I have been working at Forcepoint since 2014 and I bring many Nordic values to the company, even though I’m not lucky with the Northern Lights. — Janne Pikkarainen

This post was originally published on the author’s LinkedIn account.

MariaDB plc has gone public on NYSE!

Post Syndicated from original http://monty-says.blogspot.com/2022/12/mariadb-plc-has-gone-public-on-nyse.html

Yesterday was a big day for me and everyone involved in MariaDB when MariaDB Corporation (Now MariaDB plc) was listed on the New York Stock Exchange ! (NYSE:MRDB)

This has been a long journey, starting from the exodus of MySQL developers from Sun Microsystems to Monty Program in 2009 as part of the announcement of Oracle acquiring Sun.

One year later, there was a second exodus of MySQL personal from Oracle consisting of Sales people, support engineers and support to SkySQL Ab. SkySQL AB was founded by Open Ocean Capital.

Monty Program Ab and SkySQL worked together to ensure that SkySQL customers and MariaDB users would have the best possible experience using MariaDB. In 2009 this resulted in a merger of the two companies to the resulting company MariaDB Corporation Ab . Since then a lot of other very talented people has joined MariaDB.

During the above time, I also was part of creating MariaDB Foundation to ensure that the MariaDB server would always remain free software.

Now going public, together with Angel Pond Holdings Corporation, is the next step on the journey. This will enable us to put more resources on developing MariaDB server to solve even harder problems for more demanding customers and MariaDB users.  I am looking forward to spend a lot more years working on the MariaDB server.

Lastly, I would like to give a great thanks for everyone who has been part of this incredible journey and to our customers whose trust in MariaDB has made this next step possible!

How we use GitHub to be more productive, collaborative, and secure

Post Syndicated from Mike Hanley original https://github.blog/2022-12-20-how-we-use-github-to-be-more-productive-collaborative-and-secure/

It’s that time of year where we’re all looking back at what we’ve accomplished and thinking ahead to goals and plans for the calendar year to come. As part of GitHub Universe, I shared some numbers that provided a window into the work our engineering and security teams drive each day on behalf of our community, customers, and Hubbers. As someone who loves data, it’s not just fun to see how we operate GitHub at scale, but it’s also rewarding to see how this work contributes to our vision to be the home for all developers–which includes our own engineering and security teams.

Over the course of the past year1, GitHub staff made millions of commits across all of our internal repositories. That’s a ton of branches, pull requests, Issues, and more. We processed billions of API requests daily. And we ran tens of thousands of production deployments across the internal apps that power GitHub’s services. If you do the math, that’s hundreds of deploys per day.

GitHub is big. But the reality is, no matter your size, your scale, or your stage, we’re all dealing with the same questions. Those questions boil down to how to optimize for productivity, collaboration, and, of course, security.

It’s a running joke internally that you have to type “GitHub” three times to get to the monolith. So, let’s take a look at how we at GitHub (1) use GitHub (2) to build the GitHub (3) you rely on.

Productivity

GitHub’s cloud-powered experiences, namely Codespaces and GitHub Copilot, have been two of the biggest game changers for us in the past few years.

Codespaces

It’s no secret that local development hasn’t evolved much in the past decade. The github/github repository, where much of what you experience on GitHub.com lives, is fairly large and took several minutes to clone even on a good network connection. Combine this with setting up dependencies and getting your environment the way you like it, spinning up a local environment used to take 45 minutes to go from checkout to a built local developer environment.

But now, with Codespaces, a few clicks and less than 60 seconds later, you’re in a working development environment that’s running on faster hardware than the MacBook I use daily.

Heating my home office in the chilly Midwest with my laptop doing a local build was nice, but it’s a thing of the past. Moving to Codespaces last year has truly impacted our day-to-day developer experience, and we’re not looking back.

GitHub Copilot

We’ve been using GitHub Copilot for more than a year internally, and it still feels like magic to me every day. We recently published a study that looked at GitHub Copilot performance across two groups of developers–one that used GitHub Copilot and one that didn’t. To no one’s surprise, the group that used GitHub Copilot was able to complete the same task 55% faster than the group that didn’t have GitHub Copilot.

Getting the job done faster is great, but the data also provided incredible insight into developer satisfaction. Almost three-quarters of the developers surveyed said that GitHub Copilot helped them stay in the flow and spend more time focusing on the fun parts of their jobs. When was the last time you adopted an experience that made you love your job more? It’s an incredible example of putting developers first that has completely changed how we build here at GitHub.

Collaboration

At GitHub, we’re remote-first and we have highly distributed teams, so we prioritize discoverability and how we keep teams up-to-date across our work. That’s where tools like Issues and projects come into play. They allow us to plan, track, and collaborate in a centralized place that’s right next to the code we’re working on.

Incorporating projects across our security team has made it easier for us to not only track our work, but also to help people understand how their work fits into the company’s broader mission and supports our customers.

Projects gives us a big picture view of our work, but what about the more tactical discovery of a file, function, or new feature another team is building? When you’re working on a massive 15-year-old codebase (looking at you, GitHub), sometimes you need to find code that was written well before you even joined the company, and that can feel like trying to find a needle in a haystack.

So, we’ve adopted the new code search and code view, which has helped our developers quickly find what they need without losing velocity. This improved discoverability, along with the enhanced organization offered by Issues and projects, has had huge implications for our teams in terms of how we’ve been able to collaborate across groups.

Shifting security left

Like we saw when we looked at local development environments, the security industry still struggles with the same issues that have plagued us for more than a decade. Exposed credentials, as an example, are still the root cause for more than half of all data breaches today2. Phishing is still the best, and cheapest, way for an adversary to get into organizations and wreak havoc. And we’re still pleading with organizations to implement multi-factor authentication to keep the most basic techniques from bad actors at bay.

It’s time to build security into everything we do across the developer lifecycle.

The software supply chain starts with the developer. Normalizing the use of strong authentication is one of the most important ways that we at GitHub, the home of open source, can help defend the entire ecosystem against supply chain attacks. We enforce multi-factor authentication with security keys for our internal developers, and we’re requiring that every developer who contributes software on GitHub.com enable 2FA by the end of next year. The closer we can bring our security and engineering teams together, the better the outcomes and security experiences we can create together.

Another way we do that is by scaling the knowledge of our security teams with tools like CodeQL to create checks that are deployed for all our developers, protecting all our users. And because the CodeQL queries are open source, the vulnerability patterns shared by security teams at GitHub or by our customers end up as CodeQL queries that are then available for everyone. This acts like a global force multiplier for security knowledge in the developer and security communities.

Security shouldn’t be gatekeeping your teams from shipping. It should be the process that enables them to ship quickly–remember our hundreds of production deployments per day?–and with confidence.

Big, small, or in-between

As you see, GitHub has the same priorities as any other development team out there.

It doesn’t matter if you’re processing billions of API requests a day, like we are, or if you’re just starting on that next idea that will be launched into the world.

These are just a few ways over the course of the last year that we’ve used GitHub to build our own platform securely and improve our own developer experiences, not only to be more productive, collaborative, and secure, but to be creative, to be happier, and to build the best work of our lives.

To learn more about how we use GitHub to build GitHub, and to see demos of the features highlighted here, take a look at this talk from GitHub Universe 2022.

Notes


  1. Data collected January-October 2022 
  2. Verizon DBIR 

[$] Beyond microblogging with ActivityPub

Post Syndicated from original https://lwn.net/Articles/918224/

ActivityPub-enabled microblogs are gaining
popularity as a replacement for Twitter, but ActivityPub is for more than
just microblogging. Many other popular services also have open-source
alternatives that speak ActivityPub. Proprietary services operated by
commercial interests usually deliberately limit interoperability, but users
of any ActivityPub-enabled service should be able to communicate with each
other, even if they are using different services. This promise of
interoperability is often limited in practice, though; while ActivityPub
specifies how multiple types of content
can be published, the kinds of content that can be
displayed or interacted with vary from project to project.

Stream VPC flow logs to Amazon OpenSearch Service via Amazon Kinesis Data Firehose

Post Syndicated from Chaitanya Shah original https://aws.amazon.com/blogs/big-data/stream-vpc-flow-logs-to-amazon-opensearch-service-via-amazon-kinesis-data-firehose/

Amazon Virtual Private Cloud (Amazon VPC) flow logs enable you to track the IP traffic going to and from the network interfaces in your VPC for your workloads. Analyzing VPC logs helps you understand how your applications are communicating over your VPC network with log records and acts as a main source of information to the network in your VPC. After collecting the flow logs, the next step is performing log analysis to understand user or application behavior and patterns to make informed decisions. You can analyze logs using log analytics tools such as Amazon OpenSearch Service.

Amazon Kinesis Data Firehose is a fully managed service for delivering near real-time streaming data to various destinations for storage and performing near real-time analytics. With its extensible data transformation capabilities, you can also streamline log processing and log delivery pipelines into a single Firehose delivery stream.

Amazon OpenSearch Service makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more. Amazon OpenSearch is an open source, distributed search and analytics suite. Amazon OpenSearch Service offers the latest versions of OpenSearch, support for 19 versions of Elasticsearch (1.5 to 7.10 versions), as well as visualization capabilities powered by OpenSearch Dashboards and Kibana (1.5 to 7.10 versions). Amazon OpenSearch Service currently has tens of thousands of active customers with hundreds of thousands of clusters under management processing trillions of requests per month.

In this post, you will learn how to ingest VPC flow logs with Kinesis Data Firehose and deliver them to an Amazon OpenSearch Service for analysis using OpenSearch Service Dashboards.

Overview of solution

This solution uses native integration of VPC flow logs streaming to Kinesis Data Firehose. We use a Firehose delivery stream to buffer the streamed VPC flow logs, and deliver those to an OpenSearch Service destination endpoint. We use Amazon OpenSearch Service Dashboards to create an index pattern for the VPC flow logs to analyze and visualize the logs in a near-real time. The following diagram illustrates this architecture.

Solution Architecture

We walk you through the following high-level steps:

  1. Create an OpenSearch Service domain for storing and analyzing the VPC flow logs.
  2. Create a Firehose delivery stream to deliver the flow logs to the OpenSearch Service domain.
  3. Create a VPC flow log subscription to the delivery stream.
  4. Explore VPC flow logs in OpenSearch Service Dashboards
    • Create role mapping with an OpenSearch Service user to the Kinesis Data Firehose service role. Because we’re using a public access domain for OpenSearch Service, we have to map the delivery stream AWS Identity and Access Management (IAM) role to the OpenSearch Service primary user to deliver logs in bulk to the OpenSearch Service domain.
    • Create an index pattern in OpenSearch Service Dashboards to enable analysis and visualization of VPC logs.

Prerequisites

As a prerequisite, you need to create an Amazon Simple Storage Service (Amazon S3) bucket to store the Firehose delivery stream backups and failed logs.

Create an Amazon OpenSearch Service domain

For demonstration purposes, and to limit the costs, we create an OpenSearch Service domain with the Development and testing deployment type and public access to the dashboard. For instructions, refer to Create an Amazon OpenSearch Service domain. Note that we select Public access only for demo purposes. For production, we recommend using VPC access for security reasons.

When it’s complete, the OpenSearch Service domain shows as Active.

OpenSearch Domain

Create a Kinesis Data Firehose delivery stream

Now that your Amazon OpenSearch Service domain is active, you can create a Firehose delivery stream where VPC flow logs are streamed.

  1. On the Amazon Kinesis console, choose Kinesis Data Firehose in the navigation pane, then choose Create delivery stream.
  2. Choose Direct PUT as the source and set the destination as Amazon OpenSearch Service.
  3. For Delivery stream name, enter PUT-OPENSEARCH-STREAM-DEMO.Kinesis Delivery Stream
  4. In the Destination settings section, choose Browse and choose the previously created Amazon OpenSearch Service domain.
  5. For Index name, enter vpcflowlogs.
  6. For Index rotation, choose Every day.
  7. For this post, we set Buffer size to 5 and Buffer interval to 900.You can modify these settings to optimize ingestion throughput and near-real-time behavior.
    Kinesis Stream Destination setting
  1. In the Backup settings section, for Source record backup in Amazon S3, select Failed events only so you only save the data that fails to deliver to Amazon OpenSearch Service.
  2. For S3 bucket, choose Browse and choose the S3 bucket you created to store failed logs and backups.
  3. Optionally, you can input a prefix for backup files and error files.
  4. Select GZIP for Compression for data records.
  5. For Encryption for data records, select Disabled.Kinesis Stream - Backup Setting
  6. Expand Advanced settings, and for Amazon CloudWatch error logging, select Enabled.
  7. Choose Create delivery stream.Kinesis Stream - Advance Setting

When the delivery stream is active, proceed to the next step.

Create a VPC flow logs subscription

Now you create a VPC flow logs subscription for the Firehose delivery stream you created in the previous step.

  1. On the Amazon VPC console, choose Your VPCs.
  2. Select the VPC for which to create the flow log.
  3. On the Actions menu, choose Create flow log.VPC Flow Log
  4. Select All to send all flow log records to Amazon OpenSearch Service.

If you want to filter the flow logs, you can select either Accept or Reject.

  1. For Maximum aggregation interval, select 10 minutes or the minimum setting of 1 minute if you need the flow log data to be available for near-real-time analysis in Amazon OpenSearch Service.
  2. For Destination, select Send to Kinesis Firehose in the same account if the delivery stream is set up on the same account where you create the VPC flow logs.
  3. For Log record format, if you leave it at AWS default format, the flow logs are sent as version 2 format.

Alternatively, you can specify which fields you need the flow logs to capture and send to an Amazon OpenSearch Service. For more information on log format and available fields, refer to Flow log records.

  1. Choose Create flow log.Create VPC Flow Logs

Now let’s explore the VPC flow logs in Amazon OpenSearch Service.

Explore VPC flow logs in Amazon OpenSearch Service Dashboards

In the final step, we set up OpenSearch Service Dashboards to explore the VPC flow logs.

  1. On the OpenSearch Service console, choose Domains in the navigation pane.
  2. Choose the domain you created.
  3. Under OpenSearch Dashboards URL, choose the link to open a new tab.OpenSearch Dashboard
  4. Log in with the user you created during OpenSearch Service domain setup.OpenSearch Service Dashboard
  5. Select Private for Select your tenant, then choose Confirm.OpenSearch Service Dashboard Tenant

Because we used a public access domain for OpenSearch Service, you need to map the role created for the Firehose delivery stream to the OpenSearch Service Dashboards user, so that the delivery stream can deliver logs in bulk to the OpenSearch Service domain.

  1. On the menu icon, choose Security.
  2. Choose Roles.
  3. Choose the all_access role.OpenSearch Service All Access Role
  4. On the Mapped users tab, choose Manage mapping.OpenSearch Service Dashboard map role
  5. For Backend roles, enter the IAM role ARN created for the Firehose delivery stream.
  6. Choose Map.OpenSearch Service Dashboard Map role arn
  7. Now that mapping is complete, choose the menu icon, then choose Stack management.OpenSearch Service Dashboard Stack Management
  8. Choose Index Patterns, then choose Create index pattern.
  9. For Index pattern name, enter vpcflowlogs*.
  10. Choose Next step.OpenSearch Service Dashboard Create Index
  11. Navigate to the Discover menu option.You can see the VPC flow logs from your VPC in this dashboard. Now you can search and visualize the flow logs that are being streamed in near-real time to the OpenSearch Service domain.
    OpenSearch Service Dashboard Discover

Clean up

After you test out this solution, remember to delete all the resources you created to avoid incurring future charges:

  1. Delete your Amazon OpenSearch Service domain.
  2. Delete the VPC flow logs subscription.
  3. Delete the Firehose delivery stream.
  4. Delete the S3 bucket for the VPC flow logs backup and failed logs.
  5. If you created a new VPC and new resources in the VPC, delete the resources and VPC.

Conclusion

In this post, we walked through a solution of how integrate VPC flow logs with a Kinesis Data Firehose delivery stream and deliver it to an Amazon OpenSearch Service destination with no code and visualize it in OpenSearch Service Dashboards.

Try this new quick and hassle-free way of sending your VPC flow logs to an Amazon OpenSearch Service using Kinesis Data Firehose.


About the Author

Chaitanya Shah is a Sr. Technical Account Manager with AWS, based out of New York. He has over 22 years of experience working with enterprise customers. He loves to code and actively contributes to the AWS solutions labs to help customers solve complex problems. He provides guidance to AWS customers on best practices for their AWS Cloud migrations. He is also specialized in AWS data transfer and the data and analytics domain.

How to get best price performance from your Amazon Redshift Data Sharing deployment

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/how-to-get-best-price-performance-from-your-amazon-redshift-data-sharing-deployment/

Amazon Redshift is a fast, scalable, secure, and fully-managed data warehouse that enables you to analyze all of your data using standard SQL easily and cost-effectively. Amazon Redshift Data Sharing allows customers to securely share live, transactionally consistent data in one Amazon Redshift cluster with another Amazon Redshift cluster across accounts and regions without needing to copy or move data from one cluster to another.

Amazon Redshift Data Sharing was initially launched in March 2021, and added support for cross-account data sharing was added in August 2021. The cross-region support became generally available in February 2022. This provides full flexibility and agility to share data across Redshift clusters in the same AWS account, different accounts, or different regions.

Amazon Redshift Data Sharing is used to fundamentally redefine Amazon Redshift deployment architectures into a hub-spoke, data mesh model to better meet performance SLAs, provide workload isolation, perform cross-group analytics, easily onboard new use cases, and most importantly do all of this without the complexity of data movement and data copies. Some of the most common questions asked during data sharing deployment are, “How big should my consumer clusters and producer clusters be?”, and “How do I get the best price performance for workload isolation?”. As workload characteristics like data size, ingestion rate, query pattern, and maintenance activities can impact data sharing performance, a continuous strategy to size both consumer and producer clusters to maximize the performance and minimize cost should be implemented. In this post, we provide a step-by-step approach to help you determine your producer and consumer clusters sizes for the best price performance based on your specific workload.

Generic consumer sizing guidance

The following steps show the generic strategy to size your producer and consumer clusters. You can use it as a starting point and modify accordingly to cater your specific use case scenario.

Size your producer cluster

You should always make sure that you properly size your producer cluster to get the performance that you need to meet your SLA. You can leverage the sizing calculator from the Amazon Redshift console to get a recommendation for the producer cluster based on the size of your data and query characteristic. Look for Help me choose on the console in AWS Regions that support RA3 node types to use this sizing calculator. Note that this is just an initial recommendation to get started, and you should test running your full workload on the initial size cluster and elastic resize the cluster up and down accordingly to get the best price performance.

Size and setup initial consumer cluster

You should always size your consumer cluster based on your compute needs. One way to get started is to follow the generic cluster sizing guide similar to the producer cluster above.

Setup Amazon Redshift data sharing

Setup data sharing from producer to consumer once you have both the producer and consumer cluster setup. Refer to this post for guidance on how to setup data sharing.

Test consumer only workload on initial consumer cluster

Test consumer only workload on the new initial consumer cluster. This can be done by pointing consumer applications, for example ETL tools, BI applications, and SQL clients, to the new consumer cluster and rerunning the workload to evaluate the performance against your requirements.

Test consumer only workload on different consumer cluster configurations

If the initial size consumer cluster meets or exceeds your workload performance requirements, then you can either continue to use this cluster configuration or you can test on smaller configurations to see if you can further reduce the cost and still get the performance that you need.

On the other hand, if the initial size consumer cluster fails to meet your workload performance requirements, then you can further test larger configurations to get the configuration that meets your SLA.

As a rule of thumb, size up the consumer cluster by 2x the initial cluster configuration incrementally until it meets your workload requirements.

Once you plan out what configuration you want to test, use elastic resize to resize the initial cluster to the target cluster configuration. After elastic resize is completed, perform the same workload test and evaluate the performance against your SLA. Select the configuration that meets your price performance target.

Test producer only workload on different producer cluster configurations

Once you move your consumer workload to the consumer cluster with the optimum price performance, there might be an opportunity to reduce the compute resource on the producer to save on costs.

To achieve this, you can rerun the producer only workload on 1/2x of the original producer size and evaluate the workload performance. Resizing the cluster up and down accordingly depends on the result, and then you select the minimum producer configuration that meets your workload performance requirements.

Re-evaluate after a full workload run over time

As Amazon Redshift continues evolving, and there are continuous performance and scalability improvement releases, data sharing performance will continue improving. Furthermore, numerous variables might impact the performance of data sharing queries. The following are just some examples:

  • Ingestion rate and amount of data change
  • Query pattern and characteristic
  • Workload changes
  • Concurrency
  • Maintenance activities, for example vacuum, analyze, and ATO

This is why you must re-evaluate the producer and consumer cluster sizing using the strategy above on occasion, especially after a full workload deployment, to gain the new best price performance from your cluster’s configuration.

Automated sizing solutions

If your environment involved more complex architecture, for example with multiple tools or applications (BI, ingestion or streaming, ETL, data science), then it might not feasible to use the manual method from the generic guidance above. Instead, you can leverage solutions in this section to automatically replay the workload from your production cluster on the test consumer and producer clusters to evaluate the performance.

Simple Replay utility will be leveraged as the automated solution to guide you through the process of getting the right producer and consumer clusters size for the best price performance.

Simple Replay is a tool for conducting a what-if analysis and evaluating how your workload performs in different scenarios. For example, you can use the tool to benchmark your actual workload on a new instance type like RA3, evaluate a new feature, or assess different cluster configurations. It also includes enhanced support for replaying data ingestion and export pipelines with COPY and UNLOAD statements. To get started and replay your workloads, download the tool from the Amazon Redshift GitHub repository.

Here we walk through the steps to extract your workload logs from the source production cluster and replay them in an isolated environment. This lets you perform a direct comparison between these Amazon Redshift clusters seamlessly and select the clusters configuration that best meet your price performance target.

The following diagram shows the solution architecture.

Architecutre for testing simple replay

Solution walkthrough

Follow these steps to go through the solution to size your consumer and producer clusters.

Size your production cluster

You should always make sure to properly size your existing production cluster to get the performance that you need to meet your workload requirements. You can leverage the sizing calculator from the Amazon Redshift console to get a recommendation on the production cluster based on the size of your data and query characteristic. Look for Help me choose on the console in AWS Regions that support RA3 node types to use this sizing calculator. Note that this is just an initial recommendation to get started. You should test running your full workload on the initial size cluster and elastic resize the cluster up and down accordingly to get the best price performance.

Identify the workload to be isolated

You might have different workloads running on your original cluster, but the first step is to identify the most critical workload to the business that we want to isolate. This is because we want to make sure that the new architecture can meet your workload requirements. This post is a good reference on a data sharing workload isolation use case that can help you decide which workload can be isolated.

Setup Simple Replay

Once you know your critical workload, you must enable audit logging in your production cluster where the critical workload identified above is running to capture query activities and store in Amazon Simple Storage Service (Amazon S3). Note that it may take up to three hours for the audit logs to be delivered to Amazon S3. Once the audit log is available, proceed to setup Simple Replay and then extract the critical workload from the audit log. Note that start_time and end_time could be used as parameters to filter out the critical workload if those workloads run in certain time periods, for example 9am to 11am. Otherwise it will extract all of the logged activities.

Baseline workload

Create a baseline cluster with the same configuration as the producer cluster by restoring from the production snapshot. The purpose of starting with the same configuration is to baseline the performance with an isolated environment.

Once the baseline cluster is available, replay the extracted workload in the baseline cluster. The output from this replay will be the baseline used to compare against subsequent replays on different consumer configurations.

Setup initial producer and consumer test clusters

Create a producer cluster with the same production cluster configuration by restoring from the production snapshot. Create a consumer cluster with the recommended initial consumer size from the previous guidance. Furthermore, setup data sharing between the producer and consumer.

Replay workload on initial producer and consumer

Replay the producer only workload on the initial size producer cluster. This can be achieved using the “Exclude” filter parameter to exclude consumer queries, for example the user that runs consumer queries.

Replay the consumer only workload on the initial size consumer cluster. This can be achieved using the “Include” filter parameter to exclude consumer queries, for example the user that runs consumer queries.

Evaluate the performance of these replays against the baseline and workload performance requirements.

Replay consumer workload on different configurations

If the initial size consumer cluster meets or exceeds your workload performance requirements, then you can either use this cluster configuration or you can follow these steps to test on smaller configurations to see if you can further reduce costs and still get the performance that you need.

Compare initial consumer performance results against your workload requirements:

  1. If the result exceeds your workload performance requirements, then you can reduce the size of the consumer cluster incrementally, starting with 1/2x, retry the replay and evaluate the performance, then resize up or down accordingly based on the result until it meets your workload requirements. The purpose is to get a sweet spot where you’re comfortable with the performance requirements and get the lowest price possible.
  2. If the result fails to meet your workload performance requirements, then you can increase the size of the cluster incrementally, starting with 2x the original size, retry the replay and evaluate the performance until it meets your workload performance requirements.

Replay producer workload on different configurations

Once you split your workloads out to consumer clusters, the load on the producer cluster should be reduced and you should evaluate your producer cluster’s workload performance to seek the opportunity to downsize to save on costs.

The steps are similar to consumer replay. Elastic resize the producer cluster incrementally starting with 1/2x the original size, replay the producer only workload and evaluate the performance, and then further resize up or down until it meets your workload performance requirements. The purpose is to get a sweet spot where you’re comfortable with the workload performance requirements and get the lowest price possible. Once you have the desired producer cluster configuration, retry replay consumer workloads on the consumer cluster to make sure that the performance wasn’t impacted by producer cluster configuration changes. Finally, you should replay both producer and consumer workloads concurrently to make sure that the performance is achieved in a full workload scenario.

Re-evaluate after a full workload run over time

Similar to the generic guidance, you should re-evaluate the producer and consumer clusters sizing using the previous strategy on occasion, especially after full workload deployment to gain the new best price performance from your cluster’s configuration.

Clean up

Running these sizing tests in your AWS account may have some cost implications because it provisions new Amazon Redshift clusters, which may be charged as on-demand instances if you don’t have Reserved Instances. When you complete your evaluations, we recommend deleting the Amazon Redshift clusters to save on costs. We also recommend pausing your clusters when they’re not in use.

Applying Amazon Redshift and data sharing best practices

Proper sizing of both your producer and consumer clusters will give you a good start to get the best price performance from your Amazon Redshift deployment. However, sizing isn’t the only factor that can maximize your performance. In this case, understanding and following best practices are equally important.

General Amazon Redshift performance tuning best practices are applicable to data sharing deployment. Make sure that your deployment follows these best practices.

There numerous data sharing specific best practices that you should follow to make sure that you maximize the performance. Refer to this post for more details.

Summary

There is no one-size-fits-all recommendation on producer and consumer cluster sizes. It varies by workloads and your performance SLA. The purpose of this post is to provide you with guidance for how you can evaluate your specific data sharing workload performance to determine both consumer and producer cluster sizes to get the best price performance. Consider testing your workloads on producer and consumer using simple replay before adopting it in production to get the best price performance.


About the Authors

BP Yau is a Sr Product Manager at AWS. He is passionate about helping customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Sidhanth Muralidhar is a Principal Technical Account Manager at AWS. He works with large enterprise customers who run their workloads on AWS. He is passionate about working with customers and helping them architect workloads for costs, reliability, performance and operational excellence at scale in their cloud journey. He has a keen interest in Data Analytics as well.

GnuPG 2.4.0 released

Post Syndicated from original https://lwn.net/Articles/918269/

Version 2.4.0 of the GNU Privacy Guard has been released. “Exactly 25 years ago the very first release of GnuPG was published. We
are pleased to take this opportunity to announce the availability of a
new stable GnuPG release: version 2.4.0.
” Changes in this release
include full support for the key database daemon, some performance
improvements, a change to AES256 as the default cipher, and much more.

Combining research and practice to evaluate and improve computing education in non-formal settings

Post Syndicated from Bonnie Sheppard original https://www.raspberrypi.org/blog/research-practice-evaluate-improve-computing-education-non-formal-settings-seminar/

In the final seminar in our series on cross-disciplinary computing, Dr Tracy Gardner and Rebecca Franks, who work here at the Foundation, described the framework underpinning the Foundation’s non-formal learning pathways. They also shared insights from our recently published literature review about the impact that non-formal computing education has on learners.

Tracy and Rebecca both have extensive experience in teaching computing, and they are passionate about inspiring young learners and broadening access to computing education. In their work here, they create resources and content for learners in coding clubs and young people at home.

How non-formal learning creates opportunities for computing education

UNESCO defines non-formal learning as “institutionalised, intentional, and planned… an addition, alternative, and/or complement to formal education within the process of life-long learning of individuals”. In terms of computing education, this kind of learning happens in after-school programmes or children’s homes as they engage with materials that have been carefully designed by education providers.

At the Raspberry Pi Foundation, we support two global networks of free, volunteer-led coding clubs where regular non-formal learning takes place: Code Club, teacher- and volunteer-led coding clubs for 9- to 13-year-olds taking place in schools in more than160 countries; and CoderDojo, volunteer-led programming clubs for young people aged 7–17 taking place in community venues and offices in 100 countries. Through free learning resources and other support, we enable volunteers to run their club sessions, offering versatile opportunities and creative, inclusive spaces for young people to learn about computing outside of the school curriculum. Volunteers who run Code Clubs or CoderDojos report that participating in the club sessions positively impacts participants’ programming skills and confidence.

Rebecca and Tracy are part of the team here that writes the learning resources young people in Code Clubs and CoderDojos (and beyond) use to learn to code and create technology. 

Helping learners make things that matter to them

Rebecca started the seminar by describing how the team reviewed existing computing pedagogy research into non-formal learning, as well as large amounts of website visitor data and feedback from volunteers, to establish a new framework for designing and creating coding resources in the form of learning paths.

What the Raspberry Pi Foundation takes into account when creating non-formal learning resources: what young people are making, young people's interests, research, user data, our own experiences as educators, the Foundation's other educational offers, ideas of purpose-driven computing.
What the Raspberry Pi Foundation takes into account when creating non-formal learning resources. Click to enlarge.

As Rebecca explained, non-formal learning paths should be designed to bridge the so-called ‘Turing tar-pit’: the gap between what learners want to do, and what they have the knowledge and resources to achieve.

The Raspberry Pi Foundation's non-formal learning resources bridge the so-called Turing tar pit, in which learners get stuck when they feel everything is possible to create, but nothing is easy.

To prevent learners from getting frustrated and ultimately losing interest in computing, learning paths need to:

  • Be beginner-friendly
  • Include scaffolding
  • Support learner’s design skills
  • Relate to things that matter to learners

When Rebecca and Tracy’s team create new learning paths, they first focus on the things that learners want to make. Then they work backwards to bridge the gap between learners’ big ideas and the knowledge and skills needed to create them. To do this, they use the 3…2…1…Make! framework they’ve developed.

An illustration of the 3-2-1 structure of the new Raspberry Pi Foundation coding project paths.
An illustration of the 3…2…1…Make! structure of the new Raspberry Pi Foundation non-formal learning paths.

Learning paths designed according to the framework are made up of three different types of project in a 3-2-1 structure:

  • Three Explore projects to introduce creators to a set of skills and provide step-by-step instructions to help them develop initial confidence
  • Two Design projects to allow creators to practise the skills they learned in the previous Explore projects, and to express themselves creatively while they grow in independence
  • One Invent project where creators use their skills to meet a project brief for a particular audience

You can learn more about the framework in this blog post and this guide for adults who run sessions with young people based on the learning paths. And you can explore the learning paths yourself too.

Rebecca and Tracy’s team have created several new learning pathways based on the 3…2…1…Make! framework and received much positive feedback on them. They are now looking to develop more tools and libraries to support learners, to increase the accessibility of the paths, and also to conduct research into the impact of the framework. 

New literature review of non-formal computing education showcases its positive impact

In the second half of the seminar, Tracy shared what the research literature says about the impact of non-formal learning. She and researchers at the Foundation particularly wanted to find out what the research says about computing education for K–12 in non-formal settings. They systematically reviewed 421 papers, identifying 88 papers from the last seven years that related to empirical research on non-formal computing education for young learners. Based on these 88 papers, they summarised the state of the field in a literature review.

So far, most studies of non-formal computing education have looked at knowledge and skill development in computing, as well as affective factors such as interest and perception. The cognitive impact of non-formal education has been generally positive. The papers Tracy and the research reviewed suggested that regular learning opportunities, such as weekly Code Clubs, were beneficial for learners’ knowledge development, and that active teaching of problem solving skills can lead to learners’ independence.

In the literature review the Raspberry Pi Foundation team conducted, most research studies were university-organised on projects to broaden participation and interest development in immersive multi-day settings.

Non-formal computing education also seems to be beneficial in terms of affective factors (although it is unclear yet whether the benefits remain long-term, since most existing research studies conducted have been short-term ones). For example, out-of-school programmes can lead to more positive perception and increased awareness of computing for learners, and also boost learners’ confidence and self-efficacy if they have had little prior experience of computing. The social aspects of participating in coding clubs should not be underestimated, as learners can develop a sense of belonging and support as they work with their peers and mentors.

The affordances of non-formal computing activities that complement formal education: access and awareness, cultural relevance and equity, practice and personalisation, fun and engagement, community and identity, immediate impact.

The literature review showed that non-formal computing complements formal in-school education in many ways. Not only can Code Clubs and CoderDojos be accessible and equitable spaces for all young people, because the people who run them can tailor learning to the individuals. Coding clubs such as these succeed in making computing fun and engaging by enabling a community to form and allowing learners to make things that are meaningful to them.

What existing studies in non-formal computing aren’t telling us

Another thing the literature review made obvious is that there are big gaps in the existing understanding of non-formal computing education that need to be researched in more detail. For example, most of the studies the papers in the literature review described took place with female students in middle schools in the US.

That means the existing research tells us little about non-formal learning:

  • In other geographic locations
  • In other educational settings, such as primary schools or after-school programmes
  • For a wider spectrum of learners

We would also love to see studies that hone in on:

  • The long-term impact of non-formal learning
  • Which specific factors contribute to positive outcomes
  • Non-formal learning about aspects of computing beyond programming

3…2…1…research!

We’re excited to continue collaborating within the Foundation so that our researchers and our team creating non-formal learning content can investigate the impact of the 3…2…1…Make! framework.

At Coolest Projects, a group of people explore a coding project.
The aim of the 3…2…1…Make! framework is to enable young people to create things and solve problems that matter to them using technology.

This collaboration connects two of our long-term strategic goals: to engage millions of young people in learning about computing and how to create with digital technologies outside of school, and to deepen our understanding of how young people learn about computing and how to create with digital technologies, and to use that knowledge to increase the impact of our work and advance the field of computing education. Based on our research, we will iterate and improve the framework, in order to enable even more young people to realise their full potential through the power of computing and digital technologies. 

Join our seminar series on primary computing education

From January, you can join our new monthly seminar series on primary (K–5) teaching and learning. In this series, we’ll hear insights into how our youngest learners develop their computing knowledge, so whether you’re a volunteer in a coding club, a teacher, a researcher, or simply interested in the topic, we’d love to see you at one of these monthly online sessions.

The first seminar, on Tuesday 10 January at 5pm UK time, will feature researchers and educators Dr Katie Rich and Carla Strickland. They will share findings on how to teach children about variables, one of the most difficult aspects of computing for young learners. Sign up now, and we will send you notifications and joining links for each seminar session.

We look forward to seeing you soon, and to discussing with you how we can apply research results to better support all our learners.

The post Combining research and practice to evaluate and improve computing education in non-formal settings appeared first on Raspberry Pi.

Cengage LTI Session Management Leakage

Post Syndicated from Tod Beardsley original https://blog.rapid7.com/2022/12/20/cengage-lti-session-management-leakage/

Cengage LTI Session Management Leakage

Prior to December 10, 2022, Cengage, an education technology provider in use in many higher education environments primarily in the United States, had two issues in the way it handled session management over its Learning Tools Integration (LTI) pipeline.

The first issue involves leaving unexpectedly long-lived sessions and accompanying login links in the end user’s browser history as well as via cached GET requests, which could be used by unauthenticated attackers to impersonate the user. This appears to be an instance of CWE-525, "Use of Web Browser Cache Containing Sensitive Information." This issue is estimated to have a CVSSv3 score of 4.5 (Medium). A fix for this issue is expected in March of 2023.

The second issue involves a failure to check the LTI launch signature from connected applications, which could allow an authenticated attacker to impersonate another user. This appears to be an instance of CWE-347, "Improper Verification of Cryptographic Signature." This issue is estimated to have a CVSSv3 score of 6.5 (Medium). Note, this issue has been fixed by the vendor.

Product Description

Cengage is an education technology provider offering digital products including eTextbooks, homework tools and online learning platforms (such as WebAssign). Cengage’s online learning platforms integrate with Learning Management Systems (LMS). For more information about Cengage’s LMS integrations, please visit the vendor’s website.

Credit

This issue was discovered by Tony Porterfield, Principal Cloud Solutions Architect at Rapid7, while observing a family member use the application as an end-user. It is being disclosed in accordance with Rapid7’s vulnerability disclosure policy.

Exploitation

For the CWE-525 issue, it was observed that the "Cengage Single Sign-On" link was usable in the browser history, even though the user had already logged out of the application:

Cengage LTI Session Management Leakage

Clicking the circled link would log the user back in, and was active for at least an hour post-logout.

For the CWE-347 issue, it was observed that the signature check on an LTI launch request to https://gateway.cengage.com/rest/launchBasicLTI responds with a hidden form post containing the LTI parameters from https://gateway.cengage.com/launchBasicLTI/smartlink/basicLTILaunch.gwy as well as a field signatureVerified that is set to false if the signature is invalid. An end user could alter this response by setting signatureVerified to true, as shown below:

Cengage LTI Session Management Leakage

Once modified, the LTI session context would then be accepted by the server as authentic.

Impact

For the CWE-525 issue involving cached credentials, an attacker wishing to impersonate an authenticated user would either need to have access to the browser session of the targeted user, or access to network proxy logs which can cache these tokens (thus, implying a privileged position either locally or on the local network). Assuming this is the case, the attacker could go on to read and alter personal information involving the student. It appears possible to similarly hijack the sessions of instructors or administrators, but this hasn’t been tested or confirmed directly.

For the CWE-347 issue involving signature verification, an authenticated attacker may be able to assume sessions belonging to other users, possibly including other students, instructors, and administrators.

Vendor Statement

Cengage is continuously implementing and refining measures aimed to protect the privacy and security of the customer information entrusted to us. We value the contributions of security researchers like you, as reviews like this help us strengthen our security posture. We use multiple tools and processes, including DAST, SAST, penetration testing and VDP to identify and address security defects in our software.

[The CWE-347 issue] was fixed on December 9, 2022. The second issue is currently in remediation, and we plan to launch a fix as soon as possible.

We continually update and regularly identify ways we can improve our products both to better the learning experience for students and instructors and to ensure information remains secure. If you have found something you would like to report, please submit at https://bugcrowd.com/cengage-vdp.

Remediation

In December of 2022, Cengage released an update to its webassign.net service to address the CWE-347 (signature verification) issue, and is developing a fix for the CWE-525 issue, which we expect to see in March of 2023. Since this is a SaaS-based/cloud-hosted solution, end users, implementers, and integrators should not need to do anything to update or patch locally to address the signature verification issue — the latest version of the LTI implementation will already be available. Beyond this fix, end users have nothing to do to ensure they’re safe from impersonation, as they’re reliant on the provider to properly verify signatures.

While the SSO issue is being developed, end users would be wise to completely sign out of the local computer when complete with whatever academic tasks they were performing, as this would prevent opportunistic attackers from using stored session tokens locally. This is generally good advice anyway, even after the CWE-525 issue is resolved.

Avoiding caching session tokens inadvertently exposed through GET requests on a web proxy is more difficult for the average user to avoid, but as long as no untrusted users have access to proxy logs, the risk from exploiting proxy-cached session tokens is remote (an attacker who does have access to web proxies used by students, instructors, and administrators tend to already have more expansive offensive options).

Disclosure Timeline

  • September, 2022: Issue discovered by Tony Porterfield
  • Mon, Sep 12, 2022: Contacted Bugcrowd, Cengage’s VDP provider, to work out a disclosure agreement
  • Mon, Sep 19, 2022: Contact attempted to alternate contacts at Cengage
  • Wed, Sep 21, 2022: Agreement reached on disclosure terms
  • Thu, Sep 22, 2022: Vulnerability details communicated to Cengage via Bugcrowd with a public disclosure goal of November 15, 2022.
  • Fri, Sep 23, 2022: Triage begun at Bugcrowd (Issue 4464ed3d-3fb6-4287-ba46-786d21bebad0)
  • Sep 23 – Oct 25, 2022: Triage and reproduction work continued
  • Oct 26, 2022: Bugcrowd verified reproduction of the report
  • Nov 1, 2022: Extended disclosure time by approximately 30 days
  • Thu, Dec 15, 2022: Vendor provided update on fix status
  • Fri, Dec 20, 2022: This public disclosure

Linux Mint 21.1 (“Vera”) released

Post Syndicated from original https://lwn.net/Articles/918251/

Linux Mint has announced the release of version 21.1 of the distribution in three editions: Cinnamon (what’s new), MATE (what’s new), and Xfce (what’s new).
Mint 21.1 is based on Ubuntu 22.04 and uses kernel version 5.15.

Linux Mint 21.1 is a long term support release which will be supported until 2027. It comes with updated software and brings refinements and many new features to make your desktop even more comfortable to use.

Trojaned Windows Installer Targets Ukraine

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2022/12/trojaned-windows-installer-targets-ukraine.html

Mandiant is reporting on a trojaned Windows installer that targets Ukrainian users. The installer was left on various torrent sites, presumably ensnaring people downloading pirated copies of the operating system:

Mandiant uncovered a socially engineered supply chain operation focused on Ukrainian government entities that leveraged trojanized ISO files masquerading as legitimate Windows 10 Operating System installers. The trojanized ISOs were hosted on Ukrainian- and Russian-language torrent file sharing sites. Upon installation of the compromised software, the malware gathers information on the compromised system and exfiltrates it. At a subset of victims, additional tools are deployed to enable further intelligence gathering. In some instances, we discovered additional payloads that were likely deployed following initial reconnaissance including the STOWAWAY, BEACON, and SPAREPART backdoors.

One obvious solution would be for Microsoft to give the Ukrainians Windows licenses, so they don’t have to get their software from sketchy torrent sites.

ICYMI: 10 cybersecurity acronyms you should know in 2023

Post Syndicated from Drew Burton original https://blog.rapid7.com/2022/12/20/icymi-10-cybersecurity-acronyms-you-should-know-in-2023/

ICYMI: 10 cybersecurity acronyms you should know in 2023

Cybersecurity is acronym-heavy to say the least. If you’re reading this, you already know. From CVE to FTP, we in IT love our abbreviations, FR FR. Truthfully though, it can be a bit much, and even the nerdiest among us miss a few. So, In Case You Missed It, here are 10 cybersecurity acronyms you should know IRL, err in 2023.

HUMINT

Peppermint on a sticky day? How dare you. HUMINT is short for Human Intelligence. This abbreviation refers to information collected by threat researchers from sources across the clear, deep and dark web. Real people doing real things, you might say. These folks are out there hunting down potential threats and stopping them before they occur. Pretty cool stuff, TBH.

CSPM

Cloud Security Posture Management tools include use cases for compliance assessment, operational monitoring, DevOps integrations, incident response, risk identification, and risk visualization. Good posture: so hot RN.

IAM

Not the guy with the green eggs, this IAM stands for Identity and Access Management. CSO online says IAM is a “set of processes, policies, and tools for defining and managing the roles and access privileges of individual network entities (users and devices) to a variety of cloud and on-premises applications’. Green Eggs and Ham didn’t age well IMO, Sam was kind of a bully. JK JK.

ICYMI: 10 cybersecurity acronyms you should know in 2023

XDR

AKA Extended Detection and Response. Forrester calls XDR the “evolution of endpoint detection and response”. Gartner says it’s integrating “multiple security products into a cohesive security operations system”. Essentially, XDR is about taking a holistic approach to more efficient, effective detection and response. It’s definitely not an Xtreme Dude Ranch. That’s just absurd.

XSPM

According to Hacker News, “Extended Security Posture Management is a multilayered process combining the capabilities of Attack Surface Management (ASM), Breach and Attack Simulation (BAS), Continuous Automated Red Teaming (CART), and Purple Teaming to continuously evaluate and score the infrastructure’s overall cyber resiliency.” Yes, that definition includes three additional acronyms. Plus, one of them is CART, SMH.

RASP

Runtime application self-protection tools can block malicious activity while an application is in production. If RASP detects a security event such as an attempt to run a shell, open a file, or call a database, it will automatically attempt to terminate that action, NBD.

MDR

Managed Detection and Response providers deliver technology and human expertise to perform threat hunting, monitoring, and response. The main benefit of MDR is that it helps organizations limit the impact of threats without the need for additional staffing. In other words, they are free to TCB instead of worrying about security stuff.

MSSP

A Managed Security Service Provider provides outsourced monitoring and management of security devices and systems. MSSPs deliver managed firewall, intrusion detection, virtual private network, vulnerability scanning, and other services. Oh BTW, sometimes MSSPs partner with MDR vendors to deliver services to their customers.

DAST

Dynamic Application Security Testing is the process of analyzing a web application to find vulnerabilities through simulated attacks. DAST is all about finding vulnerabilities in web applications and correcting them before they can be exploited by threat actors. A dastardly deed conducted with no ill will … if you will.

WAF

A Web Application Firewall is a type of firewall that filters, monitors, and blocks HTTP traffic to and from a web service. It is designed to prevent attacks exploiting a web application’s known vulnerabilities, such as SQL injection, cross-site scripting, file inclusion, and improper system configuration. Proper WAF definition there, zero Cardi B jokes. Those are NSFW.

Run fault tolerant and cost-optimized Spark clusters using Amazon EMR on EKS and Amazon EC2 Spot Instances

Post Syndicated from Kinnar Kumar Sen original https://aws.amazon.com/blogs/big-data/run-fault-tolerant-and-cost-optimized-spark-clusters-using-amazon-emr-on-eks-and-amazon-ec2-spot-instances/

Amazon EMR on EKS is a deployment option in Amazon EMR that allows you to run Spark jobs on Amazon Elastic Kubernetes Service (Amazon EKS). Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances save you up to 90% over On-Demand Instances, and is a great way to cost optimize the Spark workloads running on Amazon EMR on EKS. Because Spot is an interruptible service, if we can move or reuse the intermediate shuffle files, it improves the overall stability and SLA of the job. The latest versions of Amazon EMR on EKS have integrated Spark features to enable this capability.

In this post, we discuss these features—Node Decommissioning and Persistent Volume Claim (PVC) reuse—and their impact on increasing the fault tolerance of Spark jobs on Amazon EMR on EKS when cost optimizing using EC2 Spot Instances.

Amazon EMR on EKS and Spot

EC2 Spot Instances are spare EC2 capacity provided at a steep discount of up to 90% over On-Demand prices. Spot Instances are a great choice for stateless and flexible workloads. The caveat with this discount and spare capacity is that Amazon EC2 can interrupt an instance with a proactive or reactive (2-minute) warning when it needs the capacity back. You can provision compute capacity in an EKS cluster using Spot Instances using a managed or self-managed node group and provide cost optimization for your workloads.

Amazon EMR on EKS uses Amazon EKS to run jobs with the EMR runtime for Apache Spark, which can be cost optimized by running the Spark executors on Spot. It provides up to 61% lower costs and up to 68% performance improvement for Spark workloads on Amazon EKS. The Spark application launches a driver and executors to run the computation. Spark is a semi-fault tolerant framework that is resilient to executor loss due to an interruption and therefore can run on EC2 Spot. On the other hand, when the driver is interrupted, the job fails. Hence, we recommend running drivers on on-demand instances. Some of the best practices for running Spark on Amazon EKS are applicable with Amazon EMR on EKS.

EC2 Spot instances also helps in cost optimization by improving the overall throughput of the job. This can be achieved by auto-scaling the cluster using Cluster Autoscaler (for managed nodegroups) or Karpenter.

Though Spark executors are resilient to Spot interruptions, the shuffle files and RDD data is lost when the executor gets killed. The lost shuffle files need to be recomputed, which increases the overall runtime of the job. Apache Spark has released two features (in versions 3.1 and 3.2) that addresses this issue. Amazon EMR on EKS released features such as node decommissioning (version 6.3) and PVC reuse (version 6.8) to simplify recovery and reuse shuffle files, which increases the overall resiliency of your application.

Node decommissioning

The node decommissioning feature works by preventing scheduling of new jobs on the nodes that are to be decommissioned. It also moves any shuffle files or cache present in those nodes to other executors (peers). If there are no other available executors, the shuffle files and cache are moved to a remote fallback storage.

Node Decommissioning

Fig 1 : Node Decommissioning

Let’s look at the decommission steps in more detail.

If one of the nodes that is running executors is interrupted, the executor starts the process of decommissioning and sends the message to the driver:

21/05/05 17:41:41 WARN KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor 7 decommissioned message
21/05/05 17:41:41 DEBUG TaskSetManager: Valid locality levels for TaskSet 2.0: NO_PREF, ANY
21/05/05 17:41:41 INFO KubernetesClusterSchedulerBackend: Decommission executors: 7
21/05/05 17:41:41 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_2.0, runningTasks: 10
21/05/05 17:41:41 INFO BlockManagerMasterEndpoint: Mark BlockManagers (BlockManagerId(7, 192.168.82.107, 39007, None)) as being decommissioning.
21/05/05 20:22:17 INFO CoarseGrainedExecutorBackend: Decommission executor 1.
21/05/05 20:22:17 INFO CoarseGrainedExecutorBackend: Will exit when finished decommissioning
21/05/05 20:22:17 INFO BlockManager: Starting block manager decommissioning process...
21/05/05 20:22:17 DEBUG FileSystem: Looking for FS supporting s3a

The executor looks for RDD or shuffle files and tries to replicate or migrate those files. It first tries to find a peer executor. If successful, it will move the files to the peer executor:

22/06/07 20:41:38 INFO ShuffleStatus: Updating map output for 46 to BlockManagerId(4, 192.168.13.235, 34737, None)
22/06/07 20:41:38 DEBUG BlockManagerMasterEndpoint: Received shuffle data block update for 0 46, ignore.
22/06/07 20:41:38 DEBUG BlockManagerMasterEndpoint: Received shuffle index block update for 0 46, updating.

However, if It is not able to find a peer executor, it will try to move the files to a fallback storage if available.

Fallback Storage

Fig 2: Fallback Storage

The executor is then decommissioned. When a new executor comes up, the shuffle files are reused:

22/06/07 20:42:50 INFO BasicExecutorFeatureStep: Adding decommission script to lifecycle
22/06/07 20:42:50 DEBUG ExecutorPodsAllocator: Requested executor with id 19 from Kubernetes.
22/06/07 20:42:50 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-bfd0a5813fd1b80f-exec-19, action ADDED
22/06/07 20:42:50 DEBUG BlockManagerMasterEndpoint: Received shuffle index block update for 0 52, updating.
22/06/07 20:42:50 INFO ShuffleStatus: Recover 52 BlockManagerId(fallback, remote, 7337, None)

The key advantage of this process is that it enables migrates blocks and shuffle data, thereby reducing recomputation, which adds to the overall resiliency of the system and reduces runtime. This process can be triggered by a Spot interruption signal (Sigterm) and node draining. Node draining  may happen due to high-priority task scheduling or independently.

When you use Amazon EMR on EKS with managed node groups/Karpenter, the Spot interruption handling is automated, wherein Amazon EKS gracefully drains and rebalances the Spot nodes to minimize application disruption when a Spot node is at elevated risk of interruption. If you’re using managed node groups/Karpenter, the decommission gets triggered when the nodes are getting drained and because it’s proactive, it gives you more time (at least 2 minutes) to move the files. In the case of self-managed node groups, we recommend installing the AWS Node Termination Handler to handle the interruption, and the decommission is triggered when the reactive (2-minute) notification is received. We recommend to use Karpenter with Spot Instances as it has faster node scheduling with early pod binding and binpacking to optimize the resource utilization.

The following code enables this configuration; more details are available on GitHub:

"spark.decommission.enabled": "true"
"spark.storage.decommission.rddBlocks.enabled": "true"
"spark.storage.decommission.shuffleBlocks.enabled" : "true"
"spark.storage.decommission.enabled": "true"
"spark.storage.decommission.fallbackStorage.path": "s3://<<bucket>>"

PVC reuse

Apache Spark enabled dynamic PVC in version 3.1, which is useful with dynamic allocation because we don’t have to pre-create the claims or volumes for the executors and delete them after completion. PVC enables true decoupling of data and processing when we’re running Spark jobs on Kubernetes, because we can use it as a local storage to spill in-process files too. The latest version of Amazon EMR 6.8 has integrated the PVC reuse feature of Spark, wherein if an executor is terminated due to EC2 Spot interruption or any other reason (JVM), then the PVC is not deleted but persisted and reattached to another executor. If there are shuffle files in that volume, then they are reused.

As with node decommission, this reduces the overall runtime because we don’t have to recompute the shuffle files. We also save the time required to request a new volume for an executor, and shuffle files can be reused without moving the files round.

The following diagram illustrates this workflow.

PVC Reuse

Fig 3: PVC Reuse

Let’s look at the steps in more detail.

If one or more of the nodes that are running executors is interrupted, the underlying pods get terminated and the driver gets the update. Note that the driver is the owner of the PVC of the executors, and they are not terminated. See the following code:

22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-3, action DELETED
22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-6, action MODIFIED
22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-6, action DELETED
22/06/15 23:25:07 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-3, action MODIFIED

The ExecutorPodsAllocator tries to allocate new executor pods to replace the ones terminated due to interruption. During the allocation, it figures out how many of the existing PVCs have files and can be reused:

22/06/15 23:25:23 INFO ExecutorPodsAllocator: Found 2 reusable PVCs from 10 PVCs

The ExecutorPodsAllocator requests for a pod and when it launches it, the PVC is reused. In the following example, the PVC from executor 6 is reused for new executor pod 11:

22/06/15 23:25:23 DEBUG ExecutorPodsAllocator: Requested executor with id 11 from Kubernetes.
22/06/15 23:25:24 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-11, action ADDED
22/06/15 23:25:24 INFO KubernetesClientUtils: Spark configuration files loaded from Some(/usr/lib/spark/conf) : log4j.properties,spark-env.sh,hive-site.xml,metrics.properties
22/06/15 23:25:24 INFO BasicExecutorFeatureStep: Decommissioning not enabled, skipping shutdown script
22/06/15 23:25:24 DEBUG ExecutorPodsWatchSnapshotSource: Received executor pod update for pod named amazon-reviews-word-count-9ee82b8169a75183-exec-11, action MODIFIED
22/06/15 23:25:24 INFO ExecutorPodsAllocator: Reuse PersistentVolumeClaim amazon-reviews-word-count-9ee82b8169a75183-exec-6-pvc-0

The shuffle files, if present in the PVC are reused.

The key advantage of this technique is that it allows us to reuse pre-computed shuffle files in their original location, thereby reducing the time of the overall job run.

This works for both static and dynamic PVCs. Amazon EKS offers three different storage offerings, which can be encrypted too: Amazon Elastic Block Store (Amazon EBS), Amazon Elastic File System (Amazon EFS), and Amazon FSx for Lustre. We recommend using dynamic PVCs with Amazon EBS because with static PVCs, you would need to create multiple PVCs.

The following code enables this configuration; more details are available on GitHub:

"spark.kubernetes.driver.ownPersistentVolumeClaim": "true"
"spark.kubernetes.driver.reusePersistentVolumeClaim": "true"

For this to work, we need to enable PVC with Amazon EKS and mention the details in the Spark runtime configuration. For instructions, refer to How do I use persistent storage in Amazon EKS? The following code contains the Spark configuration details for using PVC as local storage; other details are available on GitHub:

"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly": "false"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName": "OnDemand"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass": "spark-sc"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit": "10Gi"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path": "/var/data/spill"

Conclusion

With Amazon EMR on EKS (6.9) and the features discussed in this post, you can further reduce the overall runtime for Spark jobs when running with Spot Instances. This also improves the overall resiliency and flexibility of the job while cost optimizing the workload on EC2 Spot.

Try out the EMR on EKS workshop for improved performance when running Spark workloads on Kubernetes and cost optimize using EC2 Spot Instances.


About the Author

Kinnar Kumar Sen is a Sr. Solutions Architect at Amazon Web Services (AWS) focusing on Flexible Compute. As a part of the EC2 Flexible Compute team, he works with customers to guide them to the most elastic and efficient compute options that are suitable for their workload running on AWS. Kinnar has more than 15 years of industry experience working in research, consultancy, engineering, and architecture.

Introducing the Security Design of the AWS Nitro System whitepaper

Post Syndicated from J.D. Bean original https://aws.amazon.com/blogs/security/introducing-the-security-design-of-the-aws-nitro-system-whitepaper/

AWS recently released a whitepaper on the Security Design of the AWS Nitro System. The Nitro System is a combination of purpose-built server designs, data processors, system management components, and specialized firmware that serves as the underlying virtualization technology that powers all Amazon Elastic Compute Cloud (Amazon EC2) instances launched since early 2018. With the Nitro System, AWS undertook an effort to reimagine the architecture of virtualization to deliver security, isolation, performance, cost savings, and a pace of innovation that our customers require.

This whitepaper is a detailed design document on the inner workings of the AWS Nitro System, and how we use it to help secure your most critical workloads. This is the first time that AWS has provided such a detailed design document on the Nitro System and how it offers a no-operator access design and strong tenant isolation. The whitepaper describes the security design of the Nitro System in detail to help you evaluate Amazon EC2 for your sensitive workloads.

Three key components of the Nitro System are used to implement this design:

  • Purpose-built Nitro Cards – Hardware devices designed by AWS that provide overall system control and I/O virtualization that is independent of the main system board with its CPUs and memory.
  • Nitro Security Chip – Enables a secure boot process for the overall system based on a hardware root of trust, the ability to offer bare metal instances, and defense-in-depth that offers protection to the server from unauthorized modification of system firmware.
  • Nitro Hypervisor – A deliberately minimized and firmware-like hypervisor designed to provide strong resource isolation, and performance that is nearly indistinguishable from a bare metal server.

The whitepaper describes the fundamental architectural change introduced by the Nitro System compared to previous approaches to virtualization. It discusses the three key components of the Nitro System, and provides a demonstration of how these components work together by walking through what happens when a new Amazon Elastic Block Store (Amazon EBS) volume is added to a running EC2 instance. The whitepaper also discusses how the Nitro System is designed to eliminate the possibility of administrator access to an EC2 server, the overall passive communications design of the Nitro System, and the Nitro System change management process. Finally, the paper surveys important aspects of the EC2 system design that provide mitigations against potential side-channel issues that can arise in compute environments.

The whitepaper dives deep into each of these considerations, offering a detailed picture of the Nitro System security design. For more information about cloud security at AWS, contact us.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

J.D. Bean

J.D. is a Principal Security Architect for Amazon EC2 based out of New York City. His interests include security, privacy, and compliance. He is passionate about his work enabling AWS customers’ successful cloud journeys. J.D. holds a Bachelor of Arts from The George Washington University and a Juris Doctor from New York University School of Law.

The collective thoughts of the interwebz

Proudly powered by Ants