Does free software benefit from ML models being derived works of training data?

Post Syndicated from original https://mjg59.dreamwidth.org/57615.html

Github recently announced Copilot, a machine learning system that makes suggestions for you when you’re writing code. It’s apparently trained on all public code hosted on Github, which means there’s a lot of free software in its training set. Github assert that the output of Copilot belongs to the user, although they admit that it may occasionally produce output that is identical to content from the training set.

Unsurprisingly, this has led to a number of questions along the lines of “If Copilot embeds code that is identical to GPLed training data, is my code now GPLed?”. This is extremely understandable, but the underlying issue is actually more general than that. Even code under permissive licenses like BSD requires retention of copyright notices and disclaimers, and failing to include them is just as much a copyright violation as incorporating GPLed code into a work and not abiding by the terms of the GPL is.

But free software licenses only have power to the extent that copyright permits them to. If your code isn’t a derived work of GPLed material, you have no obligation to follow the terms of the GPL. Github clearly believe that Copilot’s output doesn’t count as a derived work as far as US copyright law goes, and as a result the licenses on the training data don’t apply to the output. Some people have interpreted this as an attack on free software – Copilot may insert code that’s either identical or extremely similar to GPLed code, and claim that there are no license obligations created as a result, effectively allowing the laundering of GPLed code into proprietary software.

I’m completely unqualified to hold a strong opinion on whether Github’s legal position is justifiable or not, and right now I’m also not interested in thinking about it too much. What I think is more interesting is what the impact of either position has on free software. Do we benefit more from a future where the output of Copilot (or similar projects) is considered a derived work of the training data, or one where it isn’t? Having been involved in a bunch of GPL enforcement activities, it’s very easy to think of this as something that weakens the GPL and, as a result, weakens free software. That was my initial reaction, but that’s shifted over the past few days.

Let’s look at the GNU manifesto, specifically this section:

The fact that the easiest way to copy a program is from one neighbor to another, the fact that a program has both source code and object code which are distinct, and the fact that a program is used rather than read and enjoyed, combine to create a situation in which a person who enforces a copyright is harming society as a whole both materially and spiritually; in which a person should not do so regardless of whether the law enables him to.

The GPL makes use of copyright law to ensure that GPLed work can’t be taken from the commons. Anyone who produces a derived work of GPLed code is obliged to provide that work under the same terms. If software weren’t copyrightable, the GPL would have no power. But this is the outcome Stallman wanted! The GPL doesn’t exist because copyright is good, it exists because software being copyrightable is what enables the concept of proprietary software in the first place.

The powers that the GPL uses to enforce sharing of code are used by the authors of proprietary software to reduce that sharing. They attempt to forbid us from examining their code to determine how it works – they argue that anyone who does so is tainted, unable to contribute similar code to free software projects in case they produce a derived work of the original. Broadly speaking, the further the definition of a derived work reaches, the greater the power of proprietary software authors. If Oracle’s argument that APIs are copyrightable had prevailed, it would have been disastrous for free software. If the Apple look and feel suit had established that Microsoft infringed Apple’s copyright, we might be living in a future where we had no free software desktop environments.

When we argue for an interpretation of copyright law that enhances the power of the GPL, we’re also enhancing the power of giant corporations with a lot of lawyers on hand. So let’s look at this another way. If Github’s interpretation of copyright law holds, we can train a model on proprietary code and extract concepts without having to worry about being tainted. The proprietary code itself won’t enter the commons, but the ideas it embodies will. No more worries about whether you’re literally copying the code that implements an algorithm you want to duplicate – simply start typing and let the model remove the risk for you.

There’s a reasonable counter argument about equality here. How much GPL-influenced code is going to end up in proprietary projects when compared to the reverse? It’s not an easy question to answer, but we should bear in mind that the majority of public repositories on Github aren’t under an open source license. Copilot is already claiming to give us access to the concepts embodied in those repositories. Do these provide more value than is given up? I honestly don’t know how to measure that. But what I do know is that free software was founded in a belief that software shouldn’t be constrained by copyright, and our default stance shouldn’t be to argue against the idea that copyright is weaker than we imagined.

comment count unavailable comments

App Modularisation at Scale

Post Syndicated from Grab Tech original https://engineering.grab.com/app-modularisation-at-scale

Grab a coffee ☕️, sit back and enjoy reading. 😃

Wanna know how we improved our app’s build time performance and developer experience at Grab? Continue reading…

Where it all began

Imagine you are working on an app that grows continuously as more and more features are added to it, it becomes challenging to manage the code at some point. Code conflicts increase due to coupling, development slows down, releases take longer to ship, collaboration becomes difficult, and so on.

Grab superapp is one such app that offers many services like booking taxis, ordering food, payments using an e-wallet, transferring money to friends/families, paying at merchants, and many more, across Southeast Asia.

Grab app followed a monolithic architecture initially where the entire code was held in a single module containing all the UI and business logic for almost all of its features. But as the app grew, new developers were hired, and more features were built, it became difficult to work on the codebase. We had to think of better ways to maintain the codebase, and that’s when the team decided to modularise the app to solve the issues faced.

What is Modularisation?

Breaking the monolithic app module into smaller, independent, and interchangeable modules to segregate functionality so that every module is responsible for executing a specific functionality and will contain everything necessary to execute that functionality.

Modularising the Grab app was not an easy task as it brought many challenges along with it because of its complicated structure due to the high amount of code coupling.

Approach and Design

We divided the task into the following sub-tasks to ensure that only one out of many functionalities in the app was impacted at a time.

  • Setting up the infrastructure by creating Base/Core modules for Networking, Analytics, Experimentation, Storage, Config, and so on.
  • Building Shared Library modules for Styling, Common-UI, Utils, etc.
  • Incrementally building Feature modules for user-facing features like Payments Home, Wallet Top Up, Peer-to-Merchant (P2M) Payments, GrabCard and many others.
  • Creating Kit modules for the feature to feature module communication. This step helped us in building the feature modules in parallel.
  • Finally, the App module is used as a hub to connect all the other modules together using dependency injection (Dagger).
Modularised app structure
Modularised app structure

In the above diagram, payments-home, wallet top-up, and grabcard are different features provided by the Grab app. top-up-kit and grabcard-kit are bridges that expose functionalities from topup and grabcard modules to the payments-home module, respectively.

In the process of modularising the Grab app, we ensured that a feature module did not directly depend on other feature modules so that they could be built in parallel using the available CPU cores of the machine, hence reducing the overall build time of the app.

With the Kit module approach, we separated our code into independent layers by depending only on abstractions instead of concrete implementation.

Modularisation Benefits

  • Faster build times and hence faster CI: Gradle build system compiles only the changed modules and uses the binaries of all the non-affected modules from its cache. So the compilation becomes faster. Moreover, independent modules are run in parallel on different threads.
  • Fine dependency graph: Dependencies of a module are well defined.
  • Reusability across other apps: Modules can be used across different apps by converting them into an AAR SDK.
  • Scale and maintainability: Teams can work independently on the modules owned by them without blocking each other.
  • Well-defined code ownership: Clear responsibility on who owns which code.

Limitations

  • Requires more effort and time to modularise an app.
  • Separate configuration files to be maintained for each module.
  • Gradle sync time starts to grow.
  • IDE becomes very slow and its memory usage goes up a lot.
  • Parallel execution of the module depends on the machine’s capabilities.

Where we are now

There are more than 1,000 modules in the Grab app and are still counting.

At Grab, we have many sub-teams which take care of different features available in the app. Grab Financial Group (GFG) is one such sub-team that handles everything related to payments in the app. For example: P2P & P2M money transfers, e-Wallet activation, KYC, and so on.

We started modularising payments further in July 2020 as it was already bombarded with too many features and it was difficult for the team to work on the single payments module. The result of payments modularisation is shown in the following chart.

Build time graph of payments module
Build time graph of payments module

As of today, we have about 200+ modules in GFG and more than 95% of the modules take less than 15s to build.

Conclusion

Modularisation has helped us a lot in reducing the overall build time of the app and also, in improving the developer experience by breaking dependencies and allowing us to define code ownership. Having said that, modularisation is not an easy or a small task, especially for large projects with legacy code. However, with careful planning and the right design, modularisation can help in forming a well-structured and maintainable project.

Hope you enjoyed reading. Don’t forget to 👏.

References:

Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

App Moduralisation at Scale

Post Syndicated from Grab Tech original https://engineering.grab.com/app-moduralisation-at-scale

Grab a coffee ☕️, sit back and enjoy reading. 😃

Wanna know how we improved our app’s build time performance and developer experience at Grab? Continue reading…

Where it all began

Imagine you are working on an app that grows continuously as more and more features are added to it, it becomes challenging to manage the code at some point. Code conflicts increase due to coupling, development slows down, releases take longer to ship, collaboration becomes difficult, and so on.

Grab superapp is one such app that offers many services like booking taxis, ordering food, payments using an e-wallet, transferring money to friends/families, paying at merchants, and many more, across Southeast Asia.

Grab app followed a monolithic architecture initially where the entire code was held in a single module containing all the UI and business logic for almost all of its features. But as the app grew, new developers were hired, and more features were built, it became difficult to work on the codebase. We had to think of better ways to maintain the codebase, and that’s when the team decided to modularise the app to solve the issues faced.

What is Modularisation?

Breaking the monolithic app module into smaller, independent, and interchangeable modules to segregate functionality so that every module is responsible for executing a specific functionality and will contain everything necessary to execute that functionality.

Modularising the Grab app was not an easy task as it brought many challenges along with it because of its complicated structure due to the high amount of code coupling.

Approach and Design

We divided the task into the following sub-tasks to ensure that only one out of many functionalities in the app was impacted at a time.

  • Setting up the infrastructure by creating Base/Core modules for Networking, Analytics, Experimentation, Storage, Config, and so on.
  • Building Shared Library modules for Styling, Common-UI, Utils, etc.
  • Incrementally building Feature modules for user-facing features like Payments Home, Wallet Top Up, Peer-to-Merchant (P2M) Payments, GrabCard and many others.
  • Creating Kit modules for the feature to feature module communication. This step helped us in building the feature modules in parallel.
  • Finally, the App module is used as a hub to connect all the other modules together using dependency injection (Dagger).
Modularised app structure
Modularised app structure

In the above diagram, payments-home, wallet top-up, and grabcard are different features provided by the Grab app. top-up-kit and grabcard-kit are bridges that expose functionalities from topup and grabcard modules to the payments-home module, respectively.

In the process of modularising the Grab app, we ensured that a feature module did not directly depend on other feature modules so that they could be built in parallel using the available CPU cores of the machine, hence reducing the overall build time of the app.

With the Kit module approach, we separated our code into independent layers by depending only on abstractions instead of concrete implementation.

Modularisation Benefits

  • Faster build times and hence faster CI: Gradle build system compiles only the changed modules and uses the binaries of all the non-affected modules from its cache. So the compilation becomes faster. Moreover, independent modules are run in parallel on different threads.
  • Fine dependency graph: Dependencies of a module are well defined.
  • Reusability across other apps: Modules can be used across different apps by converting them into an AAR SDK.
  • Scale and maintainability: Teams can work independently on the modules owned by them without blocking each other.
  • Well-defined code ownership: Clear responsibility on who owns which code.

Limitations

  • Requires more effort and time to modularise an app.
  • Separate configuration files to be maintained for each module.
  • Gradle sync time starts to grow.
  • IDE becomes very slow and its memory usage goes up a lot.
  • Parallel execution of the module depends on the machine’s capabilities.

Where we are now

There are more than 1,000 modules in the Grab app and are still counting.

At Grab, we have many sub-teams which take care of different features available in the app. Grab Financial Group (GFG) is one such sub-team that handles everything related to payments in the app. For example: P2P & P2M money transfers, e-Wallet activation, KYC, and so on.

We started modularising payments further in July 2020 as it was already bombarded with too many features and it was difficult for the team to work on the single payments module. The result of payments modularisation is shown in the following chart.

Build time graph of payments module
Build time graph of payments module

As of today, we have about 200+ modules in GFG and more than 95% of the modules take less than 15s to build.

Conclusion

Modularisation has helped us a lot in reducing the overall build time of the app and also, in improving the developer experience by breaking dependencies and allowing us to define code ownership. Having said that, modularisation is not an easy or a small task, especially for large projects with legacy code. However, with careful planning and the right design, modularisation can help in forming a well-structured and maintainable project.

Hope you enjoyed reading. Don’t forget to 👏.

References:

Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

SolarWinds Serv-U FTP and Managed File Transfer CVE-2021-35211: What You Need to Know

Post Syndicated from Erick Galinkin original https://blog.rapid7.com/2021/07/12/solarwinds-serv-u-ftp-and-managed-file-transfer-cve-2021-35211-what-you-need-to-know/

SolarWinds Serv-U FTP and Managed File Transfer CVE-2021-35211: What You Need to Know

On July 12, 2021, SolarWinds confirmed an actively exploited zero-day vulnerability, CVE-2021-35211, in the Serv-U FTP and Managed File Transfer component of SolarWinds15.2.3 HF1 (released May 5, 2021) and all prior versions. Successful exploitation of CVE-2021-35211 could enable an attacker to gain remote code execution on a vulnerable target system. The vulnerability only exists when SSH is enabled in the Serv-U environment.

A hotfix for the vulnerability is available, and we recommend all customers of SolarWinds Serv-U FTP and Managed File Transfer install this hotfix immediately (or, at minimum, disable SSH for a temporary mitigation). SolarWinds has emphasized that CVE-2021-35211 only affects Serv-U Managed File Transfer and Serv-U Secure FTP and does not affect any other SolarWinds or N-able (formerly SolarWinds MSP) products. For further details, see SolarWinds’s advisory.

Details

The SolarWinds advisory cites threat intelligence provided by Microsoft. According to Microsoft, a single threat actor unrelated to this year’s earlier SUNBURST intrusions has exploited the vulnerability against a limited, targeted population of SolarWinds customers. The vulnerability exists in all versions of Serv-U 15.2.3 HF1 and earlier. Though Microsoft provided a proof-of-concept exploit to SolarWinds, there are no public proofs-of-concept as of July 12, 2021.

The vulnerability appears to be in the exception handling functionality in a portion of the software related to processing connections on open sockets. Successful exploitation of the vulnerability will cause the Serv-U product to throw an exception, then will overwrite the exception handler with the attacker’s code, causing remote code execution.

Detection

Since the vulnerability is in the exception handler, looking for exceptions in the DebugSocketLog.txt file may help identify exploitation attempts. Note, however, that exceptions can be thrown for many reasons and the presence of an exception in the log does not guarantee that there has been an exploitation attempt.

IP addresses used by the threat actor include:

98.176.196.89 
68.235.178.32 
208.113.35.58

Rapid7 does not use SolarWinds Serv-U FTP products anywhere in our environment and is not affected by CVE-2021-35211.

For further information, see Solarwinds’s FAQ here.

[$] The conclusion of the 5.14 merge window

Post Syndicated from original https://lwn.net/Articles/861695/rss

The 5.14 merge window closed with the 5.14-rc1
release on July 11. By that time, some 12,981 non-merge changesets had
been pulled into the mainline repository; nearly 8,000 of those arrived
after the first LWN 5.14 merge-window summary
was written. This merge window has thus seen fewer commits than its
predecessor, which saw 14,231 changesets before the 5.13-rc1 release. That
said, there is still a lot of interesting work that has found its way into
the kernel this time around.

Should I Run my Containers on AWS Fargate, AWS Lambda, or Both?

Post Syndicated from Rob Solomon original https://aws.amazon.com/blogs/architecture/should-i-run-my-containers-on-aws-fargate-aws-lambda-or-both/

Containers have transformed how companies build and operate software. Bundling both application code and dependencies into a single container image improves agility and reduces deployment failures. But what compute platform should you choose to be most efficient, and what factors should you consider in this decision?

With the release of container image support for AWS Lambda functions (December 2020), customers now have an additional option for building serverless applications using their existing container-oriented tooling and DevOps best practices. In addition, a single container image can be configured to run on both of these compute platforms: AWS Lambda (using serverless functions) or AWS Fargate (using containers).

Three key factors can influence the decision of what platform you use to deploy your container: startup time, task runtime, and cost. That decision may vary each time a task is initiated, as shown in the three scenarios following.

Design considerations for deploying a container

Total task duration consists of startup time and runtime. The startup time of a containerized task is the time required to provision the container compute resource and deploy the container. Task runtime is the time it takes for the application code to complete.

Startup time: Some tasks must complete quickly. For example, when a user waits for a web response, or when a series of tasks is completed in sequential order. In those situations, the total duration time must be minimal. While the application code may be optimized to run faster, startup time depends on the chosen compute platform as well. AWS Fargate container startup time typically takes from 60 to 90 seconds. AWS Lambda initial cold start can take up to 5 seconds. Following that first startup, the same containerized function has negligible startup time.

Task runtime: The amount of time it takes for a task to complete is influenced by the compute resources allocated (vCPU and memory) and application code. AWS Fargate lets you select vCPU and memory size. With AWS Lambda, you define the amount of allocated memory. Lambda then provisions a proportional quantity of vCPU. In both AWS Fargate and AWS Lambda uses, increasing the amount of compute resources may result in faster completion time. However, this will depend on the application. While the additional compute resources incur greater cost, the total duration may be shorter, so the overall cost may also be lower.

AWS Lambda has a maximum limit of 15 minutes of runtime. Lambda shouldn’t be used for these tasks to avoid the likelihood of timeout errors.

Figure 1 illustrates the proportion of startup time to total duration. The initial steepness of each line shows a rapid decrease in startup overhead. This is followed by a flattening out, showing a diminishing rate of efficiency. Startup time delay becomes less impactful as the total job duration increases. Other factors (such as cost) become more significant.

Figure 1. Ratio of startup time as a function to overall job duration for each service

Figure 1. Ratio of startup time as a function to overall job duration for each service

Cost: When making the choice between Fargate and Lambda, it is important to understand the different pricing models. This way, you can make the appropriate selection for your needs.

Figure 2 shows a cost analysis of Lambda vs Fargate. This is for the entire range of configurations for a runtime task. For most of the range of configurable memory, AWS Lambda is more expensive per second than even the most expensive configuration of Fargate.

Figure 2. Total cost for both AWS Lambda and AWS Fargate based on task duration

Figure 2. Total cost for both AWS Lambda and AWS Fargate based on task duration

From a cost perspective, AWS Fargate is more cost-effective for tasks running for several seconds or longer. If cost is the only factor at play, then Fargate would be the better choice. But the savings gained by using Fargate may be offset by the business value gained from the shorter Lambda function startup time.

Dynamically choose your compute platform

In the following scenarios, we show how a single container image can serve multiple use cases. The decision to run a given containerized application on either AWS Lambda or AWS Fargate can be determined at runtime. This decision depends on whether cost, speed, or duration are the priority.

In Figure 3, an image-processing AWS Batch job runs on a nightly schedule, processing tens of thousands of images to extract location information. When run as a batch job, image processing may take 1–2 hours. The job pulls images stored in Amazon Simple Storage Service (S3) and writes the location metadata to Amazon DynamoDB. In this case, AWS Fargate provides a good combination of compute and cost efficiency. An added benefit is that it also supports tasks that exceed 15 minutes. If a single image is submitted for real-time processing, response time is critical. In that case, the same image-processing code can be run on AWS Lambda, using the same container image. Rather than waiting for the next batch process to run, the image is processed immediately.

Figure 3. One-off invocation of a typically long-running batch job

Figure 3. One-off invocation of a typically long-running batch job

In Figure 4, a SaaS application uses an AWS Lambda function to allow customers to submit complex text search queries for files stored in an Amazon Elastic File System (EFS) volume. The task should return results quickly, which is an ideal condition for AWS Lambda. However, a small percentage of jobs run much longer than the average, exceeding the maximum duration of 15 minutes.

A straightforward approach to avoid job failure is to initiate an Amazon CloudWatch alarm when the Lambda function times out. CloudWatch alarms can automatically retry the job using Fargate. An alternate approach is to capture historical data and use it to create a machine learning model in Amazon SageMaker. When a new job is initiated, the SageMaker model can predict the time it will take the job to complete. Lambda can use that prediction to route the job to either AWS Lambda or AWS Fargate.

Figure 4. Short duration tasks with occasional outliers running longer than 15 minutes

Figure 4. Short duration tasks with occasional outliers running longer than 15 minutes

In Figure 5, a customer runs a containerized legacy application that encompasses many different kinds of functions, all related to a recurring data processing workflow. Each function performs a task of varying complexity and duration. These can range from processing data files, updating a database, or submitting machine learning jobs.

Using a container image, one code base can be configured to contain all of the individual functions. Longer running functions, such as data preparation and big data analytics, are routed to Fargate. Shorter duration functions like simple queries can be configured to run using the container image in AWS Lambda. By using AWS Step Functions as an orchestrator, the process can be automated. In this way, a monolithic application can be broken up into a set of “Units of Work” that operate independently.

Figure 5. Heterogeneous function orchestration

Figure 5. Heterogeneous function orchestration

Conclusion

If your job lasts milliseconds and requires a fast response to provide a good customer experience, use AWS Lambda. If your function is not time-sensitive and runs on the scale of minutes, use AWS Fargate. For tasks that have a total duration of under 15 minutes, customers must decide based on impacts to both business and cost. Select the service that is the most effective serverless compute environment to meet your requirements. The choice can be made manually when a job is scheduled or by using retry logic to switch to the other compute platform if the first option fails. The decision can also be based on a machine learning model trained on historical data.

How to create auto-suppression rules in AWS Security Hub

Post Syndicated from BK Das original https://aws.amazon.com/blogs/security/how-to-create-auto-suppression-rules-in-aws-security-hub/

AWS Security Hub gives you a comprehensive view of your security alerts and security posture across your AWS accounts. With Security Hub, you have a single place that aggregates, organizes, and prioritizes your security alerts, or findings, from multiple AWS services. Security Hub lets you assign workflow statuses to these findings, which are NEW, NOTIFIED, SUPPRESSED, or RESOLVED. These statuses allow you to categorize which findings are open and need your attention.

In this blog post, we show how you can create automated suppression rules for specific types of findings in AWS Security Hub, such as ones that are an accepted risk by design, or have a compensating control. By automatically suppressing these findings that don’t require follow-up action from your security team, you can concentrate on investigating and remediating findings that are not yet resolved.

As an example of a finding that you may want to suppress, suppose that your development environment doesn’t need to have Amazon Virtual Private Cloud (VPC) Flow Logs enabled because it does not contain any sensitive data (that is, it is an accepted risk). However, your production environment must have VPC Flow Logs enabled. You can use this solution to automatically suppress the development environment findings regarding VPC Flow Logs not being enabled. Then, you can focus on responding to and remediating findings regarding the production environment VPC Flow Logs that are not enabled.

This solution uses an Amazon EventBridge rule to evaluate Security Hub findings based on predefined filters. An AWS Lambda function is the target of the rule, and is triggered to perform the suppression. The Lambda function calls the Security Hub BatchUpdateFindings API action to set the finding of interest to the SUPPRESSED status.

Prerequisites

This solution assumes that you have Security Hub and AWS Config enabled in your administrator and member AWS accounts. AWS Config is required to execute the rules that will generate the findings. You will also need to enable the AWS Foundational Security Best Practices standard, because the examples in this post rely on those findings. You should ensure that you have configured your administrator account to aggregate your Security Hub findings from across your AWS accounts.

Solution overview

In Security Hub, the status of an investigation of a finding is tracked using the workflow status attribute. The workflow status for new findings is initially set to NEW. You can change the workflow status of a finding either by selecting it in the AWS Security Hub console, or by automating the change of workflow status by using AWS CLI or Security Hub API. After the owner of the finding’s resource is notified to take action, you can set the workflow status to NOTIFIED. After a finding is remediated, you can set the workflow status to RESOLVED. If the finding is not a concern for your given environment and does not require any action, then you can set the workflow status to SUPPRESSED.

In this solution, we show you how to automatically set the workflow status to SUPPRESSED for expected findings, by using EventBridge event patterns that trigger on Security Hub findings that match your defined criteria. The event pattern can match on fields of the findings such as account number, AWS Region, and Amazon Resource Names (ARNs). The Lambda function triggers on findings that match all defined criteria, and then sets the workflow status to SUPPRESSED for all matched findings using the BatchUpdateFindings Security Hub API action.

Solution architecture

 

Figure 1: Solution architecture overview

Figure 1: Solution architecture overview

Figure 1 shows the administrator account aggregating the Security Hub findings from the member accounts.

  1. Security Hub generates findings in the member accounts, then forwards the findings to the administrator account to be evaluated.
  2. In the administrator account, Security Hub evaluates every finding (whether generated or forwarded) against EventBridge rules.
  3. If a finding satisfies any of the defined EventBridge rule conditions, EventBridge triggers a Lambda function in the same Region. The EventBridge event bus delivers the finding to the Lambda function.
  4. The Lambda function in the administrator account performs the finding suppression evaluation, and sets the Security Hub workflow status of the finding to SUPPRESSED.

This architecture uses one Lambda function per Region. You can group together multiple suppression rules into the same EventBridge pattern when they apply to the same group of AWS accounts. You can also configure multiple separate EventBridge event patterns when a suppression rule shouldn’t apply to an account.

Implementation

First, we show how to write the EventBridge event pattern. You use the CDK to define the event rule and pattern. The following example code will suppress Security Hub findings that originate in the development accounts for VPC flow logs that aren’t enabled. The solution will filter new findings only.

In the following example, replace <account-id-1> and <account-id-2> with your own information.

event_pattern_obj = events.EventPattern(
            source=["aws.securityhub"],
            detail_type=["Security Hub Findings - Imported"],
            detail= {
                "findings": {
                    "GeneratorId": [
                        "aws-foundational-security-best-practices/v/1.0.0/EC2.6"
                    ],
                    "AwsAccountId": [
                        "<account-id-1>",
                        "<account-id-2>"
                    ],
                    "Workflow": {
                        "Status": [
                            "NEW"
                        ]
                    }
                }
            }
        )

Second, you define the EventBridge rule that will match on the defined pattern.

        vpc_flow_log_dev_account_event_rule = events.Rule(
                    self,
                    'vpc-flow-logs-development-account-eventbridge-rule',
                    description='VPC flow logs in development account finding suppression',
                    rule_name='vpc-flow-logs-development-account-sechub-rule',
                    event_pattern=event_pattern_obj
                    )

Finally, the EventBridge rule triggers the suppression Lambda function.

 vpc_flow_log_dev_account_event_rule.add_target(lambda_targets.LambdaFunction(security_hub_suppression_lambda))

Solution deployment

You can deploy the solution through either the AWS Management Console or the AWS Cloud Development Kit (AWS CDK).

To deploy the solution by using the AWS Management Console

In your security account, launch the template by choosing the following Launch Stack button.
Select the Launch Stack button to launch the template

To deploy the solution by using the AWS CDK

You can find the latest code on GitHub, where you can also contribute to the sample code. The following commands show how to deploy the solution by using the AWS CDK. First, the CDK initializes your environment and uploads the Lambda assets to Amazon Simple Storage Service (Amazon S3). Then, you can deploy the solution to your account. For <account_id>, specify the account number, or comma separated list of account numbers, that you want the suppression rule to apply to.

cdk bootstrap

cdk deploy sechub-finding-suppression --parameters GeneratorIds=<generator_ids> --parameters AccountNumbers=<account_ids>

To test the solution

  1. Create a VPC that does not have flow logs enabled. We have included a test VPC that you can deploy with the following command:
    cdk deploy vpc-test-suppression
    

  2. Verify that the Security Hub finding EC2.6 has been suppressed in the parent account and the target account. You might need to wait a few minutes for the AWS Config recorder to detect the newly created resource and then to manually trigger the following AWS Config rule:
    securityhub-vpc-flow-logs-enabled-* 
    

  3. After verifying the suppression, delete the test VPC you created to test the suppression rule:
    cdk destroy vpc-test-suppression
    

Next steps

You can configure EventBridge rules and patterns to suppress all of your findings that are accepted risk, by design, or that have a compensating control. For example, if you are performing IAM authentication by using Amazon RDS Proxy, you could consider suppressing the control [RDS.10] IAM authentication should be configured for RDS instances. You can also consider creating event patterns that filter based on resource tags, such as filtering VPCs based on tags rather than account numbers for [EC2.6] VPC flow logging should be enabled in all VPCs.

Summary

In this blog post, we showed how you can automatically suppress specific findings by using the Security Hub BatchUpdateFindings API action. We showed you how to configure EventBridge patterns and rules in order to trigger a Lambda function that calls this API action to suppress your expected findings. After you follow the steps in this blog post for automatic Security Hub suppression, your console view in Security Hub will only show findings that are not suppressed.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

BK Das

BK works as a Senior Security Architect with AWS Professional Services. He love to solve security problems for his customers, and help them feel comfortable within AWS. Outside of work, BK loves to play computer games, and go on long drives.

Author

Josh Joy

Josh is a Senior Security Consultant with the AWS Global Security Practice, a part of our Worldwide Professional Services Organization. Josh helps customers improve their security posture as they migrate their most sensitive workloads to AWS. Josh enjoys diving deep and working backwards in order to help customers achieve positive outcomes.

Author

Moumita Saha

Moumita is a Security Consultant with AWS Professional Services working to help enterprise customers secure their workloads in the cloud. She assists customers in secure cloud migration, designing automated solutions to protect against cyber threats in the cloud. She is passionate about cyber security, data privacy, and new, emerging cloud-security technologies.

Create a secure data lake by masking, encrypting data, and enabling fine-grained access with AWS Lake Formation

Post Syndicated from Shekar Tippur original https://aws.amazon.com/blogs/big-data/create-a-secure-data-lake-by-masking-encrypting-data-and-enabling-fine-grained-access-with-aws-lake-formation/

You can build data lakes with millions of objects on Amazon Simple Storage Service (Amazon S3) and use AWS native analytics and machine learning (ML) services to process, analyze, and extract business insights. You can use a combination of our purpose-built databases and analytics services like Amazon EMR, Amazon Elasticsearch Service (Amazon ES), and Amazon Redshift as the right tool for your specific job and benefit from optimal performance, scale, and cost.

In this post, you learn how to create a secure data lake using AWS Lake Formation for processing sensitive data. The data (simulated patient metrics) is ingested through a serverless pipeline to identify, mask, and encrypt sensitive data before storing it securely in Amazon S3. After the data has been processed and stored, you use Lake Formation to define and enforce fine-grained access permissions to provide secure access for data analysts and data scientists.

Target personas

The proposed solution focuses on the following personas, with each one having different level of access:

  • Cloud engineer – As the cloud infrastructure engineer, you implement the architecture but may not have access to the data itself or to define access permissions
  • secure-lf-admin – As a data lake administrator, you configure the data lake setting and assign data stewards
  • secure-lf-business-analyst – As a business analyst, you shouldn’t be able to access sensitive information
  • secure-lf-data-scientist – As a data scientist, you shouldn’t be able to access sensitive information

Solution overview

We use the following AWS services for ingesting, processing, and analyzing the data:

  • Amazon Athena is an interactive query service that can query data in Amazon S3 using standard SQL queries using tables in an AWS Glue Data Catalog. The data can be accessed via JDBC for further processing such as displaying in business intelligence (BI) dashboards.
  • Amazon CloudWatch is a monitoring and observability service that provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, and more. The logs from AWS Glue jobs and AWS Lambda functions are saved in CloudWatch logs.
  • Amazon Comprehend is a natural language processing (NLP) service that uses ML to uncover information in unstructured data.
  • Amazon DynamoDB is a NoSQL database that delivers single-digit millisecond performance at any scale and is used to avoid processing duplicates files.
  • AWS Glue is a serverless data preparation service that makes it easy to extract, transform, and load (ETL) data. An AWS Glue job encapsulates a script that reads, processes, and writes data to a new schema. This solution uses Python3.6 AWS Glue jobs for ETL processing.
  • AWS IoT provides the cloud services that connect your internet of things (IoT) devices to other devices and AWS Cloud services.
  • Amazon Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics services.
  • AWS Lake Formation makes it easy to set up, secure, and manage your data lake. With Lake Formation, you can discover, cleanse, transform, and ingest data into your data lake from various sources; define fine-grained permissions at the database, table, or column level; and share controlled access across analytic, ML, and ETL services.
  • Amazon S3 is a scalable object storage service that hosts the raw data files and processed files in the data lake for millisecond access.

You can enhance the security of your sensitive data with the following methods:

  • Implement encryption at rest using AWS Key Management Service (AWS KMS) and customer managed encryption keys
  • Instrument AWS CloudTrail and audit logging
  • Restrict access to AWS resources based on the least privilege principle

Architecture overview

The solution emulates diagnostic devices sending Message Queuing Telemetry Transport (MQTT) messages onto an AWS IoT Core topic. We use Kinesis Data Firehose to preprocess and stage the raw data in Amazon S3. We then use AWS Glue for ETL to further process the data by calling Amazon Comprehend to identify any sensitive information. Finally, we use Lake Formation to define fine-grained permissions that restrict access to business analysts and data scientists who use Athena to query the data.

The following diagram illustrates the architecture for our solution.

Prerequisites

To follow the deployment walkthrough, you need an AWS account. Use us-east-1 or us-west-2 as your Region.

For this post, make sure you don’t have Lake Formation enabled in your AWS account.

Stage the data

Download the zipped archive file to use for this solution and unzip the files locally. patient.csv file is dummy data created to help demonstrate masking, encryption, and granting fine-grained access. The send-messages.sh script randomly generates simulated diagnostic data to represent body vitals. AWS Glue job uses glue-script.py script to perform ETL that detects sensitive information, masks/encrypt data, and populates curated table in AWS Glue catalog.

Create an S3 bucket called secure-datalake-scripts-<ACCOUNT_ID> via the Amazon S3 console. Upload the scripts and CSV files to this location.

Deploy your resources

For this post, we use AWS CloudFormation to create our data lake infrastructure.

  1. Choose Launch Stack:
  2. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names before deploying.

The stack takes approximately 5 minutes to complete.

The following screenshot shows the key-values the stack created. We use the TestUserPassword parameter for the Lake Formation personas to sign in to the AWS Management Console.

Load the simulation data

Sign in to the AWS CloudShell console and wait for the terminal to start.

Stage the send-messages.sh script by running the Amazon S3 copy command:

aws s3 cp s3://secure-datalake-scripts-<ACCOUNT_ID>/send-messages.sh

Run your script by using the following command:

sh send-messages.sh.

The script runs for a few minutes and emits 300 messages. This sends MQTT messages to the secure_iot_device_analytics topic, filtered using IoT rules, processed using Kinesis Data Firehose, and converted to Parquet format. After a minute, data starts showing up in the raw bucket.

Run the AWS Glue ETL pipeline

Run AWS Glue workflow (secureGlueWorkflow) from the AWS Glue console; you can also schedule to run this using CloudWatch. It takes approximately 10 minutes to complete.

The AWS Glue job that is triggered as part of the workflow (ProcessSecureData) joins the patient metadata and patient metrics data. See the following code:

# Join Patient metadata and patient metrics dataframe
combined_df=Join.apply(patient_metadata, patient_metrics, 'PatientId', 'pid', transformation_ctx = "combined_df")

The ensuing dataframe contains sensitive information like FirstName, LastName, DOB, Address1, Address2, and AboutYourself. AboutYourself is freeform text entered by the patient during registration. In the following code snippet, the detect_sensitive_info function calls the Amazon Comprehend API to identify personally identifiable information (PII):

# Apply groupBy to get unique  AboutYourself records
group=combined_df.toDF().groupBy("pid","DOB", "FirstName", "LastName", "Address1", "Address2", "AboutYourself").count()
# Apply detect_sensitive_info to get the redacted string after masking  PII data
df_with_about_yourself = Map.apply(frame = group_df, f = detect_sensitive_info)
# Apply encryption to the identified fields
df_with_about_yourself_encrypted = Map.apply(frame = group_df, f = encrypt_rows)

Amazon Comprehend returns an object that has information about the entity name and entity type. Based on your needs, you can filter the entity types that need to be masked.

These fields are masked, encrypted, and written to their respective S3 buckets where fine-grained access controls are applied via Lake Formation:

  • Masked datas3://secure-data-lake-masked-<ACCOUNT_ID>
    secure-dl-masked-data/
  • Encrypted datas3://secure-data-lake-masked-<ACCOUNT_ID>
    secure-dl-encrypted-data/
  • Curated datas3://secure-data-lake-<ACCOUNT_ID>
    secure-dl-curated-data/

Now that the tables have been defined, we review permissions using Lake Formation.

Enable Lake Formation fine-grained access

To enable fine-grained access, we first add a Lake Formation admin user.

  1. On the Lake Formation console, select Add other AWS users or roles.
  2. On the drop-down menu, choose secure-lf-admin.
  3. Choose Get started.
  4. In the navigation pane, choose Settings.
  5. On the Data Catalog Settings page, deselect Use only IAM access control for new databases and Use only IAM access control for new tables in new databases.
  6. Choose Save.

Grant access to different personas

Before we grant permissions to different user personas, let’s register the S3 locations in Lake Formation so these personas can access S3 data without granting access through AWS Identity and Access Management (IAM).

  1. On the Lake Formation console, choose Register and ingest in the navigation pane.
  2. Choose Data lake locations.
  3. Choose Register location.
  4. Find and select each of the following S3 buckets and choose Register location:
    1. s3://secure-raw-bucket-<ACCOUNT_ID>/temp-raw-table
    2. s3://secure-data-lake-masked-<ACCOUNT_ID>/secure-dl-encrypted-data
    3. s3://secure-data-lake-<ACCOUNT_ID>/secure-dl-curated-data
    4. s3://secure-data-lake-masked-<ACCOUNT_ID>/secure-dl-masked-data

We’re now ready to grant access to our different users.

Grant read-only access to all the tables to secure-lf-admin

First, we grant read-only access to all the tables for the user secure-lf-admin.

  1. Sign in to the console with secure-lf-admin (use the password value for TestUserPassword from the CloudFormation stack) and make sure you’re in the same Region.
  2. Navigate to AWS Lake Formation console
  3. Under Data Catalog, choose Databases.
  4. Select the database secure-db.
  5. On the Actions drop-down menu, choose Grant.
  6. Select IAM users and roles.
  7. Choose the role secure-lf-admin.
  8. Under Policy tags or catalog resources, select Named data catalog resources.
  9. For Database, choose the database secure-db.
  10. For Tables, choose All tables.
  11. Under Permissions, select Table permissions.
  12. For Table permissions, select Super.
  13. Choose Grant.
  14. Choosesecure_dl_curated_data table.
  15. On the Actions drop-down menu, chose View permissions.
  16. Check IAMAllowedPrincipals and select Revoke and click on Revoke button.

You can confirm your user permissions on the Data Permissions page.

Grant read-only access to secure-lf-business-analyst

Now we grant read-only access to certain encrypted columns to the user secure-lf-business-analyst.

  1. On the Lake Formation console, under Data Catalog, choose Databases.
  2. Select the database secure-db and choose View tables.
  3. Select the table secure_dl_encrypted_data.
  4. On the Actions drop-down menu, choose Grant.
  5. Select IAM users and roles.
  6. Choose the role secure-lf-business-analyst.
  7. Under Permissions, select Column-based permissions.
  8. Choose the following columns:
    1. count
    2. address1_encrypted
    3. firstname_encrypted
    4. address2_encrypted
    5. dob_encrypted
    6. lastname_encrypted
  9. For Grantable permissions, select Select.
  10. Choose Grant.
  11. Chose secure_dl_encrypted_data table.
  12. On the Actions drop-down menu, chose View permissions.
  13. Check IAMAllowedPrincipals and select Revoke and click on Revoke button.

You can confirm your user permissions on the Data Permissions page.

Grant read-only access to secure-lf-data-scientist

Lastly, we grant read-only access to masked data to the user secure-lf-data-scientist.

  1. On the Lake Formation console, under Data Catalog, choose Databases.
  2. Select the database secure-db and choose View tables.
  3. Select the table secure_dl_masked_data.
  4. On the Actions drop-down menu, choose Grant.
  5. Select IAM users and roles.
  6. Choose the role secure-lf-data-scientist.
  7. Under Permissions, select Table permissions.
  8. For Table permissions, select Select.
  9. Choose Grant.
  10. Under Data Catalog, chose Tables.
  11. Chose secure_dl_masked_data table.
  12. On the Actions drop-down menu, chose View permissions.
  13. Check IAMAllowedPrincipals and select Revoke and click on Revoke button.

You can confirm your user permissions on the Data Permissions page.

Query the data lake using Athena from different personas

To validate the permissions of different personas, we use Athena to query against the S3 data lake.

Make sure you set the query result location to the location created as part of the CloudFormation stack (secure-athena-query-<ACCOUNT_ID>). The following screenshot shows the location information in the Settings section on the Athena console.

You can see all the tables listed under secure-db.

  1. Sign in to the console with secure-lf-admin (use the password value for TestUserPassword from the CloudFormation stack) and make sure you’re in the same Region.
  2. Navigate to Athena Console.
  3. Run a SELECT query against the secure_dl_curated_data

The user secure-lf-admin should see all the columns with encryption or masking.

Now let’s validate the permissions of secure-lf-business-analyst user.

  1. Sign in to the console with secure-lf-business-analyst.
  2. Navigate to Athena console.
  3. Run a SELECT query against the secure_dl_encrypted_data table.

The secure-lf-business-analyst user can only view the selected encrypted columns.

Lastly, let’s validate the permissions of secure-lf-data-scientist.

  1. Sign in to the console with secure-lf-data-scientist.
  2. Run a SELECT query against the secure_dl_masked_data table.

The secure-lf-data-scientist user can only view the selected masked columns.

If you try to run a query on different tables, such as secure_dl_curated_data, you get an error message for insufficient permissions.

Clean up

To avoid unexpected future charges, delete the CloudFormation stack.

Conclusion

In this post, we presented a potential solution for processing and storing sensitive data workloads in an S3 data lake. We demonstrated how to build a data lake on AWS to ingest, transform, aggregate, and analyze data from IoT devices in near-real time. This solution also demonstrates how you can mask and encrypt sensitive data, and use fine-grained column-level security controls with Lake Formation, which benefits those with a higher level of security needs.

Lake Formation recently announced the preview for row-level access; and you can sign up for the preview now!


About the Authors

Shekar Tippur is an AWS Partner Solutions Architect. He specializes in machine learning and analytics workloads. He has been helping partners and customers adopt best practices and discover insights from data.

 

 

Ramakant Joshi is an AWS Solution Architect, specializing in the analytics and serverless domain. He has over 20 years of software development and architecture experience, and is passionate about helping customers in their cloud journey.

 

 

Navnit Shukla is AWS Specialist Solution Architect, Analytics, and is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

The UEFA EURO 2020 final as seen online by Cloudflare Radar

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/the-uefa-euro-2020-final-as-seen-online-by-cloudflare-radar/

The UEFA EURO 2020 final as seen online by Cloudflare Radar

Last night’s Italy-England match was a nail-biter. 1-1 at full time, 1-1 at the end of extra time, and then an amazing penalty shootout with incredible goalkeeping by Pickford and Donnarumma.

Cloudflare has been publishing statistics about all the teams involved in EURO 2020 and traffic to betting websites, sports newspapers, streaming services and sponsors. Here’s a quick look at some specific highlights from England’s and Italy’s EURO 2020.

Two interesting peaks show up in UK visits to sports newspapers: the day after England-Germany and today after England’s defeat. Looks like fans are hungry for analysis and news beyond the goals. You can see all the data on the dedicated England EURO 2020 page on Cloudflare Radar.

The UEFA EURO 2020 final as seen online by Cloudflare Radar

But it was a quiet morning for the websites of the England team’s sponsors.

The UEFA EURO 2020 final as seen online by Cloudflare Radar

Turning to the winners, we can see that Italian readers are even more interested in knowing more about their team’s success.

The UEFA EURO 2020 final as seen online by Cloudflare Radar

And this enthusiasm spills over into visits to the Italian team’s sponsors.

The UEFA EURO 2020 final as seen online by Cloudflare Radar

You can follow along on the dedicated Cloudflare Radar page for Italy in EURO 2020.

Visit Cloudflare Radar for information on global Internet trends, trending domains, attacks and usage statistics.

Adding support for cross-cluster associations to Rails 7

Post Syndicated from Eileen M. Uchitelle original https://github.blog/2021-07-12-adding-support-cross-cluster-associations-rails-7/

Ever since we made the leap at GitHub to upgrade off our fork of Rails and worked hard to stay up to date with the latest releases, we’ve consistently looked for ways to improve the Rails framework upstream. We do this in many ways – running GitHub off of Rails main, reporting and fixing bugs we find, and most importantly pushing functionality upstream that the entire Ruby community can benefit from.

Most recently, we extracted internal functionality to disable joining queries when an association crossed multiple databases. Prior to our work in this area, Rails had no support for handling associations that spanned across clusters; teams had to write SQL to achieve this.

Background

At GitHub, we have 30 databases configured in our Rails monolith—15 primaries and 15 replicas. We use “functional partitioning” to split up our data, which means that each of those 15 primaries has a different schema. In contrast, a “horizontal sharding” approach would have 15 shards with the same schema.

While there are some workarounds for joining across clusters in MySQL, they are usually not performant or else they require additional setup. Without these workarounds, attempting to join from a table in cluster A to a table in cluster B would result in an error. To work around this limitation, teams had to write SQL, selecting IDs from the first table to then use in the second query to find the appropriate records. This was extra work and could be error-prone. We had an opportunity to make this process smoother by implementing non-join queries in Rails.

Let’s look at some code to see how this works:

Let’s say we have three models: Dog, Human, and Treat.

# table dogs in database animals
class Dog < AnimalsRecord  
  has_many: treats, through: :humans
  has_many :humans
end

# table humans in database people
class Human < PeopleRecord
  has_many :treats
  has_many :dogs
end

# table treats in database default
class Treat < ApplicationRecord
  has_many :dogs, through: :humans
  has_many :humans
end

If our Rails application code loaded the dog.treats association, usually that would automatically perform a join query:

SELECT treats.* FROM treats INNER JOIN humans ON treats.human_id = humans.id WHERE humans.dog_id = 2

Looking at the inheritance chain, we can see that Dog, Treat, and Human all inherit from different base classes. Each of these base classes belongs to a different database connection, which means that records for all three models are stored in different databases.

Since the data is stored across multiple primaries, when the join on dog.treats is run we’ll see an application error:

ActiveRecord::StatementInvalid (Table 'people_db_cluster.humans' doesn't exist)

One of the best features Rails provides out of the box is generating SQL for you. But since GitHub’s data lives in different databases, we could no longer take advantage of this. We had an opportunity to improve Rails in a way that benefited our engineers and everyone else in the Rails community who uses multiple databases.

Implementation

Prior to our work in this area, engineers working on any associations that crossed database boundaries would be forced to manually query IDs rather than using Active Record’s association APIs. Writing SQL can be error prone and defeats the purpose of Active Record’s convenience methods like dog.treats.

A little over two years ago, we started experimenting with an internal gem to disable joins for cross-database associations. We chose to implement this outside of Rails first so that we could work out the majority of bugs before merging to Rails. We wanted to be sure that we could use it successfully in production and that it didn’t cause any significant friction in development or any performance concerns in production. This is how many of Rails’ popular features get developed. We often extract implementations from large production applications – if it’s something we need and something a lot of applications can benefit from, we make it stable first, then upstream it to Rails.

The overall implementation is relatively small. To accomplish disabling joins, we added an option to has_many :through associations called disable_joins. When set to true for an association, Rails will generate separate queries for each database rather than a join query.

This needed to be an option on the association rather than performed at runtime because Rails associations are lazily loaded – the SQL is generated when the association objects are created, which means that by the time Rails runs the SQL to load dog.treats the join will already be generated. After adding the option in Rails, we implemented a new scoping class that would handle the order, limit, scopes, and other options.

Now applications can add the following to their associations to make sure Rails generates two or more queries instead of joins:

class Dog < AnimalsRecord
  has_many: treats, through: :humans, disable_joins: true
  has_many :humans
end

And that’s all that’s needed to disable generating joins for associations that cross database servers!

Now, calls to dog.treats will generate the following SQL:

SELECT "humans"."id" FROM "humans" WHERE "humans"."dog_id" = ?  [["dog_id", 1]]
SELECT "treats".* FROM "treats" WHERE "treats"."human_id" IN (?, ?, ?)  [["human_id", 1], ["human_id", 2], ["human_id", 3]]

Caveats

There are a couple of important caveats to keep in mind when using this new feature. Applications that need to disable joins may see that those associations have slower database performance. This would be true whether you wrote the SQL manually or use Rails’ new disable_joins feature. Fundamentally, if you’re performing multiple queries across multiple databases, that can be slower than performing a single join query on one database. It’s really important to make sure that your queries are efficient and that proper indexes are in place before using this feature. And as always, when making major changes to your database queries, it’s important to benchmark and understand how those changes will affect your application.

Additionally, if your queries rely on an order and limit from the join database you may see a performance impact on requests. When two queries are joined, MySQL can perform the order based on the table that’s joined (i.e., order by humans.human_id which would order the returned treats by the human ID). However, when you’re splitting queries the order can’t be applied by the database. To solve this, Rails orders the records in-memory based on the order they would have been returned if there was a join. This preserves the expected behavior but since the order and limit are performed in-memory, you’ll usually want to avoid performing these actions on hundreds of thousands of records.

Conclusion

Four years ago, contributing to Rails at this level was just something we’d hoped to be able to do one day. We were so far behind in our upgrades that it was difficult to contribute changes to the framework. This might look like a small change but it’s a clear demonstration of the hard work we’ve done to improve the technical debt in our application and ensure that we’re giving back to the community whenever we can.

By adding support for handling associations across databases, we help empower other applications to scale as their traffic and data grow. Additionally, by pushing this code into Rails and out of our private internal gem we’ll find even more improvements and edge cases in applications that aren’t GitHub. As we continue to grow and improve Rails for us at GitHub, we’ll continue to improve it for the entire community. This pull request is just one example of how we intend to do that for years to come.

Coming soon: Expansion of AWS Lambda states to all functions

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/coming-soon-expansion-of-aws-lambda-states-to-all-functions/

In November of 2019, we announced AWS Lambda function state attributes, a capability to track the current “state” of a function throughout its lifecycle.

Since launch, states have been used in two primary use-cases. First, to move the blocking setup of VPC resources out of the path of function invocation. Second, to allow the Lambda service to optimize new or updated container images for container-image based functions, also before invocation. By moving this additional work out of the path of the invocation, customers see lower latency and better consistency in their function performance. Soon, we will be expanding states to apply to all Lambda functions.

This post outlines the upcoming change, any impact, and actions to take during the roll out of function states to all Lambda functions. Most customers experience no impact from this change.

As functions are created or updated, or potentially fall idle due to low usage, they can transition to a state associated with that lifecycle event. Previously any function that was zip-file based and not attached to a VPC would only show an Active state. Updates to the application code and modifications of the function configuration would always show the Successful value for the LastUpdateStatus attribute. Now all functions will follow the same function state lifecycles described in the initial announcement post and in the documentation for Monitoring the state of a function with the Lambda API.

All AWS CLIs and SDKs have supported monitoring Lambda function states transitions since the original announcement in 2019. Infrastructure as code tools such as AWS CloudFormation, AWS SAM, Serverless Framework, and Hashicorp Terraform also already support states. Customers using these tools do not need to take any action as part of this, except for one recommended service role policy change for AWS CloudFormation customers (see Updating CloudFormation’s service role below).

However, there are some customers using SDK-based automation workflows, or calling Lambda’s service APIs directly, that must update those workflows for this change. To allow time for testing this change, we are rolling it out in a phased model, much like the initial rollout for VPC attached functions. We encourage all customers to take this opportunity to move to the latest SDKs and tools available.

Change details

Nothing is changing about how functions are created, updated, or operate as part of this. However, this change may impact certain workflows that attempt to invoke or modify a function shortly after a create or an update action. Before making API calls to a function that was recently created or modified, confirm it is first in the Active state, and that the LastUpdateStatus is Successful.

For a full explanation of both the create and update lifecycles, see Tracking the state of AWS Lambda functions.

Create function state lifecycle

Create function state lifecycle

Update function state lifecycle

Update function state lifecycle

Change timeframe

We are rolling out this change over a multiple phase period, starting with the Begin Testing phase today, July 12, 2021. The phases allow you to update tooling for deploying and managing Lambda functions to account for this change. By the end of the update timeline, all accounts transition to using the create/update Lambda lifecycle.

July 12 2021– Begin Testing: You can now begin testing and updating any deployment or management tools you have to account for the upcoming lifecycle change. You can also use this time to update your function configuration to delay the change until the End of Delayed Update.

September 6 2021 – General Update (with optional delayed update configuration): All customers without the delayed update configuration begin seeing functions transition through the lifecycles for create and update. Customers that have used the delay update configuration as described below will not see any change.

October 01 2021 – End of Delayed Update: The delay mechanism expires and customers now see the Lambda states lifecycle applied during function create or update.

Opt-in and delayed update configurations

Starting today, we are providing a mechanism for an opt-in. This allows you to update and test your tools and developer workflow processes for this change. We are also providing a mechanism to delay this change until the End of Delayed Update date. After the End of Delayed Update date, all functions will begin using the Lambda states lifecycle.

This mechanism operates on a function-by-function basis, so you can test and experiment individually without impacting your whole account. Once the General Update phase begins, all functions in an account that do not have the delayed update mechanism in place see the new lifecycle for their functions.

Both mechanisms work by adding a special string in the “Description” parameter of Lambda functions. You can add this string anywhere in this parameter. You can opt to add it to the prefix or suffix, or set the entire contents of the field. This parameter is processed at create or update in accordance with the requested action.

To opt in:

aws:states:opt-in

To delay the update:

aws:states:opt-out

NOTE: Delay configuration mechanism has no impact after the end of the Delayed Update.

Here is how this looks in the console:

I add the opt-in configuration to my function’s Description. You can find this under Configuration -> General Configuration in the Lambda console. Choose Edit to change the value.

Edit basic settings

Edit basic settings

After choosing Save, you can see the value in the console:

Opt-in flag set

Opt-in flag set

Once the opt-in is set for a function, then updates on that function go through the preceding update flow.

Checking a function’s state

With this in place, you can now test your development workflow ahead of the General Update phase. Download the latest AWS CLI (version 2.2.18 or greater) or SDKs to see function state and related attribute information.

You can confirm the current state of a function using the AWS APIs or AWS CLI to perform the GetFunction or GetFunctionConfiguration API or command for a specified function:

$ aws lambda get-function --function-name MY_FUNCTION_NAME --query 'Configuration.[State, LastUpdateStatus]'
[
    "Active",
    "Successful"
]

This returns the State and LastUpdateStatus in order for a function.

Updating CloudFormation’s service role

CloudFormation allows customers to create an AWS Identity and Access Management (IAM) service role to make calls to resources in a stack on your behalf. Customers can use service roles to allow or deny the ability to create, update, or delete resources in a stack.

As part of the rollout of function states for all functions, we recommend that customers configure CloudFormation service roles with an Allow for the “lambda:GetFunction” API. This API allows CloudFormation to get the current state of a function, which is required to assist in the creation and deployment of functions.

Conclusion

With function states, you can have better clarity on how the resources required by your Lambda function are being created. This change does not impact the way that functions are invoked or how your code is run. While this is a minor change to when resources are created for your Lambda function, the result is even better consistency of working with the service.

For more serverless learning resources, visit Serverless Land.

Security updates for Monday

Post Syndicated from original https://lwn.net/Articles/862673/rss

Security updates have been issued by Fedora (djvulibre), Gentoo (connman, gnuchess, openexr, and xen), openSUSE (arpwatch, avahi, dbus-1, dhcp, djvulibre, freeradius-server, fribidi, gstreamer, gstreamer-plugins-bad, gstreamer-plugins-base, gstreamer-plugins-good, gstreamer-plugins-ugly, gupnp, hivex, icinga2, jdom2, jetty-minimal, kernel, kubevirt, libgcrypt, libnettle, libxml2, openexr, openscad, pam_radius, polkit, postgresql13, python-httplib2, python-py, python-rsa, qemu, redis, rubygem-actionpack-5_1, salt, snakeyaml, squid, tpm2.0-tools, and xstream), Red Hat (xstream), and SUSE (bluez, csync2, dbus-1, jdom2, postgresql13, redis, slurm_20_11, and xstream).

Understanding data streaming concepts for serverless applications

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/understanding-data-streaming-concepts-for-serverless-applications/

Amazon Kinesis is a suite of managed services that can help you collect, process, and analyze streaming data in near-real time. It consists of four separate services that are designed for common tasks with streaming data: This blog post focuses on Kinesis Data Streams.

One of the main benefits of processing streaming data is that an application can react as new data is generated, instead of waiting for batches. This real-time capability enables new functionality for applications. For example, payment processors can analyze payments in real time to detect fraudulent transactions. Ecommerce websites can use streams of clickstream activity to determine site engagement metrics in near-real time.

Kinesis can be used with Amazon EC2-based and container-based workloads. However, its integration with AWS Lambda can make it a useful data source for serverless applications. Using Lambda as a stream consumer can also help minimize the amount of operational overhead for managing streaming applications.

In this post, I explain important streaming concepts and how they affect the design of serverless applications. This post references the Alleycat racing application. Alleycat is a home fitness system that allows users to compete in an intense series of 5-minute virtual bicycle races. Up to 1,000 racers at a time take the saddle and push the limits of cadence and resistance to set personal records and rank on leaderboards. The Alleycat software connects the stationary exercise bike with a backend application that processes the data from thousands of remote devices.

The Alleycat frontend allows users to configure their races and view real-time leaderboard and historical rankings. The frontend could wait until the end of each race and collect the total output from each racer. Once the batch is ready, it could rank the results and provide a leaderboard once the race is completed. However, this is not very engaging for competitors. By using streaming data instead of a batch, the application show the racers a view of who is winning during the race. This makes the virtual environment more like a real-life cycling race.

Producers and consumers

In streaming data workloads, producers are the applications that produce data and consumers are those that process it. In a serverless streaming application, a consumer is usually a Lambda function, Amazon Kinesis Data Firehose, or Amazon Kinesis Data Analytics.

Kinesis producers and consumers

There are a number of ways to put data into a Kinesis stream in serverless applications, including direct service integrations, client libraries, and the AWS SDK.

Producer

Kinesis Data Streams

Kinesis Data Firehose

Amazon CloudWatch Logs Yes, using subscription filters Yes, using subscription filters
AWS IoT Core Yes, using IoT rule actions Yes, using IoT rule actions
AWS Database Migration Service Yes – set stream as target Not directly.
Amazon API Gateway Yes, via REST API direct service integration Yes, via REST API direct service integration
AWS Amplify Yes – via JavaScript library Not directly
AWS SDK Yes Yes

A single stream may have tens of thousands of producers, which could be web or mobile applications or IoT devices. The Alleycat application uses the AWS IoT SDK for JavaScript to publish messages to an IoT topic. An IoT rule action then uses a direct integration with Kinesis Data Streams to push the data to the stream. This configuration is ideal at the device level, especially since the device may already use AWS IoT Core to receive messages.

The Alleycat simulator uses the AWS SDK to send a large number of messages to the stream. The SDK provides two methods: PutRecord and PutRecords. The first allows you to send a single record, while the second supports up to 500 records per request (or up to 5 MB in total). The simulator uses the putRecords JavaScript API to batch messages to the stream.

A producer can put records directly on a stream, for example via the AWS SDK, or indirectly via other services such as Amazon API Gateway or AWS IoT Core. If direct, the producer must have appropriate permission to write data to the stream. If indirect, the producer must have permission to invoke the proxy service, and then the service must have permission to put data onto the stream.

While there may be many producers, there are comparatively fewer consumers. You can register up to 20 consumers per data stream, which share the outgoing throughout limit per shard. Consumers receive batches of records sequentially, which means processing latency increases as you add more consumers to a stream. For latency-sensitive applications, Kinesis offers enhanced fan-out which gives consumers 2 MB per second dedicated throughput and uses a push model to reduce latency.

Shards, streams, and partition keys

A shard is a sequence of data records in a stream with a fixed capacity. Part of Kinesis billing is based upon the number of shards. A single shard can process up to 1 MB per second or 1,000 records of incoming data. One shard can also send up to 2 MB per second of outgoing data to downstream consumers. These are hard limits on the throughputs of a shard and as your application approaches these limits, you must add more shards to avoid exceeding these limits.

A stream is a collection of these shards and is often a grouping at the workload or project level. Adding another shard to a stream effectively doubles the throughput, though it also doubles the cost. When there is only one shard in a stream, all records sent to that sent are routed to the same shard. With multiple shards, the routing of incoming messages to shards is determined by a partition key.

The data producer adds the partition key before sending the record to Kinesis. The service calculates an MD5 hash of the key, which maps to one of the shards in the stream. Each shard is assigned a range of non-overlapping hash values, so each partition key maps to only one shard.

MD5 hash function

The partition key exists as an alternative to specifying a shard ID directly, since it’s common in production applications to add and remove shards depending upon traffic. How you use the partition key determines the shard-mapping behavior. For example:

  • Same value: If you specify the same string as the partition key, every message is routed to a single shard, regardless of the number of shards in the stream. This is called overheating a shard.
  • Random value: Using a pseudo-random value, such as a UUID, evenly distributes messages between all the shards available.
  • Time-based: Using a timestamp as a partition key may result in a preference for a single shard if multiple messages arrive at the same time.
  • Applicationspecific: The Alleycat application uses the raceId as a partition key to ensure that all messages from a single race are processed by the same shard consumer.

A Lambda function is a consumer application for a data stream and processes one batch of records for each shard. Since Alleycat uses a tumbling window to calculate aggregates between batches, this use of the partition key ensures that all messages for each raceId are processed by the same function. The downside to this architecture is that it is limited to 1,000 incoming messages per second with the same raceId since it is bound to a single shard.

Deciding on a partition key strategy depends upon the specific needs of your workload. In most cases, a random value partition key is often the best approach.

Streaming payloads in serverless applications

When using the SDK to put messages to a stream, the Data attribute can be a buffer, typed array, blob, or string. Combined with the partition key value, the maximum record size is 1 MB. The Data value is base64 encoded when serialized by Kinesis and delivered as an encoded value to downstream consumers. When using a Lambda function consumer, Kinesis delivers batches in a Records array. Each record contains the encoded data attribute, partition key, and additional metadata in a JSON envelope:

JSON transformation from producer to consumer

Ordering and idempotency

Records in a Kinesis stream are delivered to consuming applications in the same order that they arrive at the Kinesis service. The service assigns a sequence number to the record when it is received and this is delivered as part of the payload to a Kinesis consumer:

Sequence number in payload

When using Lambda as a consuming application for Kinesis, by default each shard has a single instance of the function processing records. In this case, ordering is guaranteed as Kinesis invokes the function serially, one batch of records at a time.

Parallelization factor of 1

You can increase the number of concurrent function invocations by setting the ParallelizationFactor on the event source mapping. This allows you to set a concurrency of between 1 and 10, which provides a way to increase Lambda throughout if the IteratorAge metric is increasing. However, one side effect is that ordering per shard is no longer guaranteed, since the shard’s messages are split into multiple subgroups based upon an internal hash.

Parallelization factor is 2

Kinesis guarantees that every record is delivered “at least once”, but occasionally messages are delivered more than once. This is caused by producers that retry messages, network-related timeouts, and consumer retries, which can occur when worker processes restart. In both cases, these are normal activities and you should design your application to handle infrequent duplicate records.

To prevent the duplicate messages causing unintentional side effects, such as charging a payment twice, it’s important to design your application with idempotency in mind. By using transaction IDs appropriately, your code can determine if a given message has been processed previously, and ignore any duplicates. In the Alleycat application, the aggregation and processes of messages is idempotent. If two identical messages are received, processing completes with the same end result.

To learn more about implementing idempotency in serverless applications, read the “Serverless Application Lens: AWS Well-Architected Framework”.

Conclusion

In this post, I introduce some of the core streaming concepts for serverless applications. I explain some of the benefits of streaming architectures and how Kinesis works with producers and consumers. I compare different ways to ingest data, how streams are composed of shards, and how partition keys determine which shard is used. Finally, I explain the payload formats at the different stages of a streaming workload, how message ordering works with shards, and why idempotency is important to handle.

To learn more about building serverless web applications, visit Serverless Land.

Charge your Tesla automatically with Raspberry Pi

Post Syndicated from Ashley Whittaker original https://www.raspberrypi.org/blog/charge-your-tesla-automatically-with-raspberry-pi/

It’s the worst feeling in the world: waking up and realising you forgot to put your electric car on charge overnight. What do you do now? Dig a bike out of the shed? Wait four hours until there’s enough juice in the battery to get you where you need to be? Neither option works if you’re running late. If only there were a way to automate the process, so that when you park up, the charger find its way to the charging port on its own. That would make life so much easier.

This is quite the build

Of course, this is all conjecture, because I drive a car made in the same year I started university. Not even the windows go up and down automatically. But I can dream, and I still love this automatic Tesla charger built with Raspberry Pi.

Wait, don’t Tesla make those already?

Back in 2015 Tesla released a video of their own prototype which can automatically charge their cars. But things have gone quiet, and nothing seems to be coming to market any time soon – nothing directly from Tesla, anyway. And while we like the slightly odd snake-charmer vibes the Tesla prototype gives off, we really like Pat’s commitment to spending hours tinkering in order to automate a 20-second manual job. It’s how we do things around here.

This video makes me feel weird

Electric vehicle enthusiast Andrew Erickson has been keeping up with the prototype’s whereabouts, and discussed it on YouTube in 2020.

How did Pat build his home-made charger?

Tired of waiting on Tesla, Pat took matters into his own hands and developed a home-made solution with Raspberry Pi 4. Our tiny computer is the “brains of everything”, and is mounted to a carriage on Pat’s garage wall.

automatic tesla charger rig mounted on garage wall
The entire rig mounted to Pat’s garage wall

There’s a big servo at the end of the carriage, which rotates the charging arm out when it’s needed. And an ultrasonic distance sensor ensures none of the home-made apparatus hits the car.

automatic tesla charger sensors
Big white thing on the left is the charging arm. Pat pointing to the little green Raspberry Pi camera module up top. And the yellow box at the bottom is the distance sensor

How does the charger find the charging port?

A Raspberry Pi Camera Module takes photos and sends them back to a machine learning model (Pat used TensorFlow Lite) running on his Raspberry Pi 4. This is how the charging arm finds its way to the port. You can watch the model in action from this point in the build video.

automatic tesla charger in action
“Marco!” “Polo!” “Marco!” “Polo!”

Top stuff, Pat. Now I just need to acquire a Tesla from somewhere so I can build one for my own garage. Wait, I don’t have a garage either…

The post Charge your Tesla automatically with Raspberry Pi appeared first on Raspberry Pi.

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close