Darktable 3.8.0 released

Post Syndicated from original https://lwn.net/Articles/879765/rss

Version
3.8.0
of the Darktable photo-processing application has been released.
Significant changes include a new keyboard shortcut system, a new
diffuse-or-sharpen module, a new “scene-referred” blurs module “to
synthesize motion and lens blurs in a parametric and physically accurate
way
“, support for the Canon CR3 raw format, and more.

How fEMR Delivers Cryptographically Secure and Verifiable Medical Data with Amazon QLDB

Post Syndicated from Patrick Gryczka original https://aws.amazon.com/blogs/architecture/how-femr-delivers-cryptographically-secure-and-verifiable-emr-medical-data-with-amazon-qldb/

This post was co-written by Team fEMR’s President & Co-founder, Sarah Draugelis; CTO, Andy Mastie; Core Team Engineer & Fennel Labs Co-founder, Sean Batzel; Patrick Gryczka, AWS Solutions Architect; Mithil Prasad, AWS Senior Customer Solutions Manager.

Team fEMR is a non-profit organization that created a free, open-source Electronic Medical Records system for transient medical teams.  Their system has helped bring aid and drive data driven communication in disaster relief scenarios and low resource settings around the world since 2014.  In the past year, Team fEMR integrated Amazon Quantum Ledger Database (QLDB) as their HIPAA compliant database solution to address their need for data integrity and verifiability.

When delivering aid to at risk communities and within challenging social and political environments, data veracity is a critical issue.  Patients need to trust that their data is confidential and following an appropriate chain of ownership.  Aid organizations meanwhile need to trust the demographic data provided to them to appropriately understand the scope of disasters and verify the usage of funds. Amazon QLDB is backed by an immutable append-only journal.  This journal is verifiable, making it easier for Team fEMR to engage in external audits and deliver a trusted and verifiable EMR solution. The teams that use the new fEMR system are typically working or volunteering in post-disaster environments, refugee camps, or in other mobile clinics that offer pre-hospital care.

In this blog post, we explore how Team fEMR leveraged Amazon QLDB and other AWS Managed Services to enable their relief efforts.

Background

Before the use of an electronic record keeping system, these teams often had to use paper records, which would easily be lost or destroyed. The new fEMR system allows the clinician to look up the patient’s history of illness, in order to provide the patient with a seamless continuity of care between clinic visits. Additionally, the collection of health data on a more macroscopic level allows researchers and data scientists to monitor for disease. This is- an especially important aspect of mobile medicine in a pandemic world.

The fEMR system has been deployed worldwide since 2014. In the original design, the system functioned solely in an on-premises environment. Clinicians were able to attach their own devices to the system, and have access to the EMR functionality and medical records. While the need for a standalone solution continues to exist for environments without data coverage and connectivity, demand for fEMR has increased rapidly and outpaced deployment capabilities as well as hardware availability. To solve for real-time deployment and scalability needs, Team fEMR migrated fEMR to the cloud and developed a HIPAA-compliant and secure architecture using Amazon QLDB.

As part of their cloud adoption strategy, Team fEMR decided to procure more managed services, to automate operational tasks and to optimize their resources.

Architecture showing how How fEMR Delivers Cryptographically Secure and Verifiable EMR Medical Data with Amazon QLDB

Figure 1 – Architecture showing How fEMR Delivers Cryptographically Secure and Verifiable EMR Medical Data with Amazon QLDB

The team built the preceding architecture using a combination of the following AWS managed services:

1. Amazon Quantum Ledger Database (QLDB) – By using QLDB, Team fEMR were able to build an unfalsifiable, HIPAA compliant, and cryptographically verifiable record of medical histories, as well as for clinic pharmacy usage and stocking.

2. AWS Elastic Beanstalk – Team fEMR uses Elastic Beanstalk to deploy and run their Django front end and application logic.  It allows their agile development team to focus on development and feature delivery, by offloading the operational work of deployment, patching, and scaling.

3. Amazon Relational Database Service (RDS) – Team fEMR uses RDS for ad-hoc search, reporting, and analytics, whereas QLDB is not optimized for the specific requirement.

4. Amazon ElastiCache – Team fEMR uses Amazon ElastiCache to cache user session data to provide near real-time response times.

Data Security considerations

Data Security was a key consideration when building the fEMR EHR solution. End users of the solution are often working in environments with at-risk populations. That is, patients who may be fleeing persecution from their home country, or may be at risk due to discriminatory treatment. It is therefore imperative to secure their data. QLDB as a data storage mechanism provides a cryptographically secure history of all changes to data. This has the benefit of improved visibility and auditability and is invaluable in situations where medical data needs to be as tamper-evident as possible.

Using Auto Scaling to minimize operational effort

When Team fEMR engages with disaster relief, they deal with ambiguity around both when events may occur and at what scale their effects may be felt.  By leveraging managed services like QLDB, RDS, and Elastic Beanstalk, Team fEMR was able to minimize the time their technical team spent on systems operations.  Instead, they can focus optimizing and improving their technology architectures.

Use of Infrastructure as Code to enable fast global deployment

With Infrastructure as Code, Team fEMR was able to create repeatable deployments. They utilized AWS CloudFormation to deploy their Elasticache, RDS, QLDB, and Elastic Beanstalk environment.  Elastic Beanstalk was used to further automate the deployment of infrastructure for their Django stack. Repeatability of deployment enables the team to have the flexibility they need to deploy in certain regions due to geographic and data sovereignty requirements.

Optimizing the architecture

The team found that simultaneous writing into two databases could cause inconsistencies if a write succeeds in one database and fails in the other. In addition, it puts the burden of identifying errors and rolling back updates on the application. Therefore, an improvement planned for this architecture is to stream successful transactions from Amazon QLDB to Amazon RDS using Amazon Kinesis Data Streams. This service provides them a way to replicate data from Amazon QLDB to Amazon RDS, and any other databases or dashboards. QLDB remains as as their System of Record.

Figure 2 - Optimized Architecture using Amazon Kinesis Data Streams

Figure 2 – Optimized Architecture using Amazon Kinesis Data Streams

Conclusion

As a result of migrating their system to the cloud, Team fEMR was able to deliver their EMR system with less operational overhead and instead focus on bringing their solution to the communities that need it.  By using Amazon QLDB, Team fEMR was able to make their solution easier to audit and enabled more trust in their work with at-risk populations.

To learn more about Team fEMR, you can read about their efforts on their organization’s website, and explore their Open Source contributions on GitHub.

For hands on experience with Amazon QLDB you can reference our QLDB Workshops and explore our QLDB Developer Guide.

Systemd 250 released

Post Syndicated from original https://lwn.net/Articles/879739/rss

Systemd 250 has been released. To say that the list of new features is
long would be a severe understatement; the developers have clearly been
busy.

systemd-homed now makes use of UID mapped mounts for the home areas.
If the kernel and used file system support it, files are now
internally owned by the “nobody” user (i.e. the user typically used
for indicating “this ownership is not mapped”), and dynamically
mapped to the UID used locally on the system via the UID mapping
mount logic of recent kernels. This makes migrating home areas
between different systems cheaper because recursively chown()ing file
system trees is no longer necessary.

(See this article for a description of
ID-mapped mounts).

Deep dive into NitroTPM and UEFI Secure Boot support in Amazon EC2

Post Syndicated from Neelay Thaker original https://aws.amazon.com/blogs/compute/deep-dive-into-nitrotpm-and-uefi-secure-boot-support-in-amazon-ec2/

Contributed by Samartha Chandrashekar, Principal Product Manager Amazon EC2

At re:Invent 2021, we announced NitroTPM, a Trusted Platform Module (TPM) 2.0 and Unified Extensible Firmware Interface (UEFI) Secure Boot support in Amazon EC2. In this blog post, we’ll share additional details on how these capabilities can help further raise the security bar of EC2 deployments.

A TPM is a security device to gather and attest system state, store and generate cryptographic data, and prove platform identity. Although TPMs are traditionally discrete chips or firmware modules, their adaptation on AWS as NitroTPM preserves their security properties without affecting the agility and scalability of EC2. NitroTPM makes it possible to use TPM-dependent applications and Operating System (OS) capabilities in EC2 instances. It conforms to the TPM 2.0 specification, which makes it easy to migrate existing on-premises workloads that use TPM functionalities to EC2.

Unified Extensible Firmware Interface (UEFI) Secure Boot is a feature of UEFI that builds on EC2’s long-standing secure boot process and provides additional defense-in-depth that helps you secure software from threats that persist across reboots. It ensures that EC2 instances run authentic software by verifying the digital signature of all boot components, and halts the boot process if signature verification fails. When used with UEFI Secure Boot, NitroTPM can verify the integrity of software that boots and runs in the EC2 instance. It can measure instance properties and components as evidence that unaltered software in the correct order was used during boot. Features such as “Measured Boot” in Windows, Linux Unified Key Setup (LUKS) and dm-verity in popular Linux distributions can use NitroTPM to further secure OS launches from malware with administrative that attempt to persist across reboots.

NitroTPM derives its root-of-trust from the Nitro Security Chip and performs the same functions as a physical/discrete TPM. Similar to discrete TPMs, an immutable private and public Endorsement Key (EK) is set up inside the NitroTPM by AWS during instance creation. NitroTPM can serve as a “root-of-trust” to verify the provenance of software in the instance (e.g., NitroTPM’s EKCert as the basis for SSL certificates). Sensitive information protected by NitroTPM is made available only if the OS has booted correctly (i.e., boot measurements match expected values). If the system is tampered, keys are not released since the TPM state is different, thereby ensuring protection from malware attempting to hijack the boot process. NitroTPM can protect volume encryption keys used by full-disk encryption utilities (such as dm-crypt and BitLocker) or private keys for certificates.

NitroTPM can be used for attestation, a process to demonstrate that an EC2 instance meets pre-defined criteria, thereby allowing you to gain confidence in its integrity. It can be used to authenticate an instance requesting access to a resource (such as a service or a database) to be contingent on its health state (e.g., patching level, presence of mandated agents, etc.). For example, a private key can be “sealed” to a list of measurements of specific programs allowed to “unseal”. This makes it suited for use cases such as digital rights management to gate LDAP login, and database access on attestation. Access to AWS Key Management Service (KMS) keys to encrypt/decrypt data accessed by the instance can be made to require affirmative attestation of instance health. Anti-malware software (e.g., Windows Defender) can initiate remediation actions if attestation fails.

NitroTPM uses Platform Configuration Registers (PCR) to store system measurements. These do not change until the next boot of the instance. PCR measurements are computed during the boot process before malware can modify system state or tamper with the measuring process. These values are compared with pre-calculated known-good values, and secrets protected by NitroTPM are released only if the sequences match. PCRs are recalculated after each reboot, which ensures protection against malware aiming to hijack the boot process or persist across reboots. For example, if malware overwrites part of the kernel, measurements change, and disk decryption keys sealed to NitroTPM are not unsealed. Trust decisions can also be made based on additional criteria such as boot integrity, patching level, etc.

The workflow below shows how UEFI Secure Boot and NitroTPM work to ensure system integrity during OS startup.

workflow

To get started, you’ll need to register an Amazon Machine Image (AMI) of an Operating System that supports TPM 2.0 and UEFI Secure Boot using the register-image primitive via the CLI, API, or console. Alternatively, you can use pre-configured AMIs from AWS for both Windows and Linux to launch EC2 instances with TPM and Secure Boot. The screenshot below shows a Windows Server 2019 instance on EC2 launched with NitroTPM using its inbox TPM 2.0 drivers to recognize a TPM device.

NitroTPM and UEFI Secure Boot enables you to further raise the bar in running their workloads in a secure and trustworthy manner. We’re excited for you to try out NitroTPM when it becomes publicly available in 2022. Contact [email protected] for additional information.

Enrich datasets for descriptive analytics with AWS Glue DataBrew

Post Syndicated from Daniel Rozo original https://aws.amazon.com/blogs/big-data/enrich-datasets-for-descriptive-analytics-with-aws-glue-databrew/

Data analytics remains a constantly hot topic. More and more businesses are beginning to understand the potential their data has to allow them to serve customers more effectively and give them a competitive advantage. However, for many small to medium businesses, gaining insight from their data can be challenging because they often lack in-house data engineering skills and knowledge.

Data enrichment is another challenge. Businesses that focus on analytics using only their internal datasets miss the opportunity to gain better insights by using reliable and credible public datasets. Small to medium businesses are no exception to this shortcoming, where obstacles such as not having sufficient data diminish their ability to make well-informed decisions based on accurate analytical insights.

In this post, we demonstrate how AWS Glue DataBrew enables businesses of all sizes to get started with data analytics with no prior coding knowledge. DataBrew is a visual data preparation tool that makes it easy for data analysts and scientists to clean and normalize data in preparation for analytics or machine learning. It includes more than 350 pre-built transformations for common data preparation use cases, enabling you to get started with cleaning, preparing, and combining your datasets without writing code.

For this post, we assume the role of a fictitious small Dutch solar panel distribution and installation company named OurCompany. We demonstrate how this company can prepare, combine, and enrich an internal dataset with publicly available data from the Dutch public entity, the Centraal Bureau voor de Statistiek (CBS), or in English, Statistics Netherlands. Ultimately, OurCompany desires to know how well they’re performing compared to the official reported values by the CBS across two important key performance indicators (KPIs): the amount of solar panel installations, and total energy capacity in kilowatt (kW) per region.

Solution overview

The architecture uses DataBrew for data preparation and transformation, Amazon Simple Storage Service (Amazon S3) as the storage layer of the entire data pipeline, and the AWS Glue Data Catalog for storing the dataset’s business and technical metadata. Following the modern data architecture best practices, this solution adheres to foundational logical layers of the Lake House Architecture.

The solution includes the following steps:

  1. We set up the storage layer using Amazon S3 by creating the following folders: raw-data, transformed-data, and curated-data. We use these folders to track the different stages of our data pipeline consumption readiness.
  2. Three CSV raw data files containing unprocessed data of solar panels as well as the external datasets from the CBS are ingested into the raw-data S3 folder.
  3. This part of the architecture incorporates both processing and cataloging capabilities:
    1. We use AWS Glue crawlers to populate the initial schema definition tables for the raw dataset automatically. For the remaining two stages of the data pipeline (transformed-data and curated-data), we utilize the functionality in DataBrew to directly create schema definition tables into the Data Catalog. Each table provides an up-to-date schema definition of the datasets we store on Amazon S3.
    2. We work with DataBrew projects as the centerpiece of our data analysis and transformation efforts. In here, we set up no-code data preparation and transformation steps, and visualize them through a highly interactive, intuitive user interface. Finally, we define DataBrew jobs to apply these steps and store transformation outputs on Amazon S3.
  4. To gain the benefits of granular access control and easily visualize data from Amazon S3, we take advantage of the seamless integration between Amazon Athena and Amazon QuickSight. This provides a SQL interface to query all the information we need from the curated dataset stored on Amazon S3 without the need to create and maintain manifest files.
  5. Finally, we construct an interactive dashboard with QuickSight to depict the final curated dataset alongside our two critical KPIs.

Prerequisites

Before beginning this tutorial, make sure you have the required Identity and Access Management (IAM) permissions to create the resources required as part of the solution. Your AWS account should also have an active subscription to QuickSight to create the visualization on processed data. If you don’t have a QuickSight account, you can sign up for an account.

The following sections provide a step-by-step guide to create and deploy the entire data pipeline for OurCompany without the use of code.

Data preparation steps

We work with the following files:

  • CBS Dutch municipalities and provinces (Gemeentelijke indeling op 1 januari 2021) – Holds all the municipalities and provinces names and codes of the Netherlands. Download the file gemeenten alfabetisch 2021. Open the file and save it as cbs_regions_nl.csv. Remember to change the format to CSV (comma-delimited).
  • CBS Solar power dataset (Zonnestroom; vermogen bedrijven en woningen, regio, 2012-2018) – This file contains the installed capacity in kilowatts and total number of installations for businesses and private homes across the Netherlands from 2012–2018. To download the file, go to the dataset page, choose the Onbewerkte dataset, and download the CSV file. Rename the file to cbs_sp_cap_nl.csv.
  • OurCompany’s solar panel historical data – Contains the reported energy capacity from all solar panel installations of OurCompany across the Netherlands from 2012 until 2018. Download the file.

As a result, the following are the expected input files we use to work with the data analytics pipeline:

  • cbs_regions_nl.csv
  • cbs_sp_cap_nl.csv
  • sp_data.csv

Set up the storage Layer

We first need to create the storage layer for our solution to store all raw, transformed, and curated datasets. We use Amazon S3 as the storage layer of our entire data pipeline.

  1. Create an S3 bucket in the AWS Region where you want to build this solution. In our case, the bucket is named cbs-solar-panel-data. You can use the same name followed by a unique identifier.
  2. Create the following three prefixes (folders) in your S3 bucket by choosing Create folder:
    1. curated-data/
    2. raw-data/
    3. transformed-data/

  3. Upload the three raw files to the raw-data/ prefix.
  4. Create two prefixes within the transformed-data/ prefix named cbs_data/ and sp_data/.

Create a Data Catalog database

After we set up the storage layer of our data pipeline, we need to create the Data Catalog to store all the metadata of the datasets hosted in Amazon S3. To do so, follow these steps:

  1. Open the AWS Glue console in the same Region of your newly created S3 bucket.
  2. In the navigation pane, choose Databases.
  3. Choose Add database.
  4. Enter the name for the Data Catalog to store all the dataset’s metadata.
  5. Name the database sp_catalog_db.

Create AWS Glue data crawlers

Now that we created the catalog database, it’s time to crawl the raw data prefix to automatically retrieve the metadata associated to each input file.

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Add a crawler with the name crawler_raw and choose Next.
  3. For S3 path, select the raw-data folder of the cbs-solar-panel-data prefix.
  4. Create an IAM role and name it AWSGlueServiceRole-cbsdata.
  5. Leave the frequency as Run on demand.
  6. Choose the sp_catalog_db database created in the previous section, and enter the prefix raw_ to identify the tables that belong to the raw data folder.
  7. Review the parameters of the crawler and then choose Finish.
  8. After the crawler is created, select it and choose Run crawler.

After successful deployment of the crawler, your three tables are created in the sp_catalog_db database: raw_sp_data_csv, raw_cbs_regions_nl_csv, and raw_cbs_sp_cap_nl_csv.

Create DataBrew raw datasets

To utilize the power of DataBrew, we need to connect datasets that point to the Data Catalog S3 tables we just created. Follow these steps to connect the datasets:

  1. On the DataBrew console, choose Datasets in the navigation pane.
  2. Choose Connect new dataset.
  3. Name the dataset cbs-sp-cap-nl-dataset.
  4. For Connect to new dataset, choose Data Catalog S3 tables.
  5. Select the sp_catalog_db database and the raw_cbs_sp_cap_nl_csv table.
  6. Choose Create dataset.

We need to create to two more datasets following the same process. The following table summarizes the names and tables of the catalog required for the new datasets.

Dataset name Data catalog table
sp-dataset raw_sp_data_csv
cbs-regions-nl-dataset raw_cbs_regions_nl_csv

Import DataBrew recipes

A recipe is a set of data transformation steps. These transformations are applied to one or multiple datasets of your DataBrew project. For more information about recipes, see Creating and using AWS Glue DataBrew recipes.

We have prepared three DataBrew recipes, which contain the set of data transformation steps we need for this data pipeline. Some of these transformation steps include: renaming columns (from Dutch to English), removing null or missing values, aggregating rows based on specific attributes, and combining datasets in the transformation stage.

To import the recipes, follow these instructions:

  1. On the DataBrew console, choose Recipes in the navigation pane.
  2. Choose Upload recipe.
  3. Enter the name of the recipe: recipe-1-transform-cbs-data.
  4. Upload the following JSON recipe.
  5. Choose Create recipe.

Now we need to upload two more recipes that we use for transformation and aggregation projects in DataBrew.

  1. Follow the same procedure to import the following recipes:
Recipe name Recipe source file
recipe-2-transform-sp-data Download
recipe-3-curate-sp-cbs-data Download
  1. Make sure the recipes are listed in the Recipes section filtered by All recipes.

Set up DataBrew projects and jobs

After we successfully create the Data Catalog database, crawlers, DataBrew datasets, and import the DataBrew recipes, we need to create the first transformation project.

CBS external data transformation project

The first project takes care of transforming, cleaning, and preparing cbs-sp-cap-nl-dataset. To create the project, follow these steps:

  1. On the DataBrew console, choose Projects in the navigation pane.
  2. Create a new project with the name 1-transform-cbs-data.
  3. In the Recipe details section, choose Edit existing recipe and choose the recipe recipe-1-transform-cbs-data.
  4. Select the newly created cbs-sp-cap-nl-dataset under Select a dataset.
  5. In the Permissions section, choose Create a new IAM role.
  6. As suffix, enter sp-project.
  7. Choose Create project.

After you create the project, a preview dataset is displayed as a result of applying the selected recipe. When you choose 10 more recipe steps, the service shows the entire set of transformation steps.

After you create the project, you need to grant put and delete S3 object permissions to the created role AWSGlueDataBrewServiceRole-sp-project on IAM. Add an inline policy using the following JSON and replace the resource with your S3 bucket name:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::<your-S3-bucket-name>/*"
        }
    ]
}

This role also needs permissions to access the Data Catalog. To grant these permissions, add the managed policy AWSGlueServiceRole to the role.

CBS external data transformation job

After we define the project, we need to configure and run a job to apply the transformation across the entire raw dataset stored in the Raw-data folder of your S3 bucket. To do so, you need to do the following:

  1. On the DataBrew project page, choose Create job.
  2. For Job name, enter 1-transform-cbs-data-job.
  3. For Output to, choose Data Catalog S3 tables.
  4. For File type¸ choose Parquet.
  5. For Database name, choose sp_catalog_db.
  6. For Table name, choose Create new table.
  7. For Catalog table name, enter transformed_cbs_data.
  8. For S3 location, enter s3://<your-S3-bucket-name>/transformed-data/cbs_data/.
  9. In the job output settings section, choose Settings.
  10. Select Replace output files for each job run and then choose Save.
  11. In the permissions section, choose the automatically created role with the sp-project suffix; for example, AWSGlueDataBrewServiceRole-sp-project.
  12. Review the job details once more and then choose Create and run job.
  13. Back in the main project view, choose Job details.

After a few minutes, the job status changes from Running to Successful. Choose the output to go to the S3 location where all the generated Parquet files are stored.

Solar panels data transformation stage

We now create the second phase of the data pipeline. We create a project and a job using the same procedure described in the previous section.

  1. Create a DataBrew project with the following parameters:
    1. Project name2-transform-sp-data
    2. Imported reciperecipe-2-transform-sp-data
    3. Datasetsp_dataset
    4. Permissions roleAWSGlueDataBrewServiceRole-sp-project
  2. Create and run another DataBrew job with the following parameters:
    1. Job name2-transform-sp-data-job
    2. Output to – Data Catalog S3 tables
    3. File type – Parquet
    4. Database namesp_catalog_db
    5. Create new table with table nametransformed_sp_data
    6. S3 locations3://<your-S3-bucket-name>/transformed-data/sp_data/
    7. Settings – Replace output files for each job run.
    8. Permissions roleAWSGlueDataBrewServiceRole-sp-project
  3. After the job is complete, create the DataBrew datasets with the following parameters:
Dataset name Data catalog table
transformed-cbs-dataset awsgluedatabrew_transformed_cbs_data
transformed-sp-dataset awsgluedatabrew_transformed_sp_data

You should now see five items as part of your DataBrew dataset.

Data curation and aggregation stage

We now create the final DataBrew project and job.

  1. Create a DataBrew project with the following parameters:
    1. Project name3-curate-sp-cbs-data
    2. Imported reciperecipe-3-curate-sp-cbs-data
    3. Datasettransformed_sp_dataset
    4. Permissions roleAWSGlueDataBrewServiceRole-sp-project
  2. Create a DataBrew job with the following parameters:
    1. Job name3-curate-sp-cbs-data-job
    2. Output to – Data Catalog S3 tables
    3. File type – Parquet
    4. Database namesp_catalog_db
    5. Create new table with table namecurated_data
    6. S3 locations3://<your-S3-bucket-name>/curated-data/
    7. Settings – Replace output files for each job run
    8. Permissions roleAWSGlueDataBrewServiceRole-sp-project

The last project defines a single transformation step; the join between the transformed-cbs-dataset and the transformed-sp-dataset based on the municipality code and the year.

The DataBrew job should take a few minutes to complete.

Next, check your sp_catalog_db database. You should now have raw, transformed, and curated tables in your database. DataBrew automatically adds the prefix awsgluedatabrew_ to both the transformed and curated tables in the catalog.

Consume curated datasets for descriptive analytics

We’re now ready to build the consumption layer for descriptive analytics with QuickSight. In this section, we build a business intelligence dashboard that reflects OurCompany’s solar panel energy capacity and installations participation in contrast to the reported values by the CBS from 2012–2018.

To complete this section, you need to have the default primary workgroup already set up on Athena in the same Region where you implemented the data pipeline. If it’s your first time setting up workgroups on Athena, follow the instructions in Setting up Workgroups.

Also make sure that QuickSight has the right permissions to access Athena and your S3 bucket. Then complete the following steps:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose Create a new dataset.
  3. Select Athena as the data source.
  4. For Data source name, enter sp_data_source.
  5. Choose Create data source.
  6. Choose AWSDataCatalog as the catalog and sp_catalog_db as the database.
  7. Select the table curated_data.
  8. Choose Select.
  9. In the Finish dataset creation section, choose Directly query your data and choose Visualize.
  10. Choose the clustered bar combo chart from the Visual types list.
  11. Expand the field wells section and then drag and drop the following fields into each section as shown in the following screenshot.
  12. Rename the visualization as you like, and optionally filter the report by sp_year using the Filter option.

From this graph, we can already benchmark OurCompany against the regional values reported by the CBS across two dimensions: the total amount of installations and the total kW capacity generated by solar panels.

We went one step further and created two KPI visualizations to empower our descriptive analytics capabilities. The following is our final dashboard that we can use to enhance our decision-making process.

Clean up resources

To clean all the resources we created for the data pipeline, complete the following steps:

  1. Remove the QuickSight analyses you created.
  2. Delete the dataset curated_data.
  3. Delete all the DataBrew projects with their associated recipes.
  4. Delete all the DataBrew datasets.
  5. Delete all the AWS Glue crawlers you created.
  6. Delete the sp_catalog_db catalog database; this removes all the tables.
  7. Empty the contents of your S3 bucket and delete it.

Conclusion

In this post, we demonstrated how you can begin your data analytics journey. With DataBrew, you can prepare and combine the data you already have with publicly available datasets such as those from the Dutch CBS (Centraal Bureau voor de Statistiek) without needing to write a single line of code. Start using DataBrew today and enrich key datasets in AWS for enhanced descriptive analytics capabilities.


About the Authors

Daniel Rozo is a Solutions Architect with Amazon Web Services based out of Amsterdam, The Netherlands. He is devoted to working with customers and engineering simple data and analytics solutions on AWS. In his free time, he enjoys playing tennis and taking tours around the beautiful Dutch canals.

Maurits de Groot is an intern Solutions Architect at Amazon Web Services. He does research on startups with a focus on FinTech. Besides working, Maurits enjoys skiing and playing squash.


Terms of use: Gemeentelijke indeling op 1 januari 2021, Zonnestroom; vermogen bedrijven en woningen, regio (indeling 2018), 2012-2018, and copies of these datasets redistributed by AWS, are licensed under the Creative Commons 4.0 license (CC BY 4.0), sourced from Centraal Bureau voor de Statistiek (CBS). The datasets used in this solution are modified to rename columns from Dutch to English, remove null or missing values, aggregate rows based on specific attributes, and combine the datasets in the final transformation. Refer to the CC BY 4.0 use, adaptation, and attribution requirements for additional information.

The Case for Backup Over Sync

Post Syndicated from Lora Maslenitsyna original https://www.backblaze.com/blog/the-case-for-backup-over-sync/

We hear it all the time: “I don’t need to back up my data, it’s already synced.” But backing up your data and syncing your data are two different animals—only a backup service actually protects all of your data while also making it accessible to you even when you’re away from your computer.

Are you using a sync service like Dropbox or OneDrive without a backup solution? If so, we’ll make the case for why you should use backup over sync, including the Backblaze features you won’t find from a sync service.

Read on for a refresher on the difference between backup and sync, and find out why choosing Backblaze over a sync service could be more beneficial to you.

Review: What’s the Difference Between Backup and Sync?

With the myriad of cloud services available, many people don’t understand the difference between sync and backup. You can read more about the difference between the two services here, but here’s a brief refresher:

  • Sync: These services allow you to access your files across different devices. You can also use sync services to share files with other users, where they can make changes from their computer that will be visible to you from your device.
  • Backup: These services usually work automatically in the background of your computer, backing up a copy of your new or changed data to another location (e.g., the cloud). Most backup services catalog and save the most recent version of your data, and some now offer features like Extended Version History, which you can use to recover files from even farther back in time than the standard 30 days.
Backup Pro Tip: Managing Your Devices

How many devices do you use to store and access your data on a given day? Between phones, tablets, laptops, and external hard drives, it can be a lot. We’ve created a few guides to help you make sure the data on your phone, computer, and hard drive is backed up or secured for whenever you plan to upgrade.

The Disadvantages of Sync Services

While sync tools are great for collaboration and 24/7 access to your data, they are not a viable backup solution and relying on them to protect your data can lead to trouble. If you or someone you shared a file with deletes that file, you are at risk of losing it forever unless the sync service you’re using has a version history feature. Sync services do not create a copy of your files for backup, and require additional setup to make sure you have some data protection enabled.

Data in sync services is also vulnerable to corruption by bad actors or malware as it does not provide a backup of your uncorrupted files. If your computer is hit with a ransomware attack and automatically synchronizes your data afterwards, all of your synced files will be corrupted.

Lastly, many people choose not to pay for a sync service, instead opting to use the free tier. For the most part, the free tiers of sync services have a cap on the amount of data you’re able to sync, meaning there will still be a portion of your data on your computer left unsynced, neither accessible by the service or protected in any capacity. Paying for more data in a sync service can become costly over time, and still does not offer protection against data loss.

The Backblaze Features You Won’t Get With a Sync Service

Sync and backup shouldn’t be thought of as opposing services—they’re better together. However, if your budget only allows for one, backup is the way to go.

Now, hear us out—as a backup provider, we may seem biased, but in reality, the benefits of using a backup service speak for themselves. Here are the Backblaze features you won’t get with a sync service:

  • Automatic, comprehensive data protection: Backblaze protects all of the data on your computer, not just the files in your synced folders. Think about all the things you save on your computer, but not in OneDrive or Dropbox. I imagine that might include important confidential documents like taxes, financial information, or legal documents, or just random stuff that doesn’t get saved to your sync service. With Backblaze, in case of potential data loss, you can find a copy of each of your files saved in the cloud. Also considering how much of our data is scattered across devices and platforms, having a backup of all of your data is valuable to keep it safe in case you can’t access a profile or device for any reason. (Check out our Backup Pro Tip below to learn more about how to back up your digital life.)
  • Fast and easy data restores: In the case that you lose your computer or it crashes and you need to restore all or some of your files, backup services like Backblaze allow you to download the important files you need via your internet connection and opt to have all of your files sent to you via USB hard drive. Meanwhile, downloading your data from a sync service depends on your internet bandwidth and can take days if not weeks. Also, with the Backblaze mobile apps for iOS and Android, all of your backed up data is with you, no matter where you are.
  • Extended Version History: Most backup providers offer version history for all of the data you are backing up. With this feature, you can restore your entire backup history, or just one file, from a specific point in time. Backblaze offers Extended Version History, so you can choose if you’d like to keep all versions of your data protected longer than the standard 30 days for a small additional fee. You can choose to keep versions for up to one year or forever. Not only does this feature provide better security for your data and the ability to restore files in the event of a potential cybersecurity breach, but it also gives you the ability to see changes to your edited files over time, much like with a sync service. Want to invite someone to collaborate on a file you’ve edited? You can even share files with other people by enabling Backblaze B2 Cloud Storage. Learn more about how to share files here, and more about B2 Cloud Storage here. (Note: Some sync services are catching on and starting to offer Extended Version History for customers on business or professional tiers or as add-ons. But, keep in mind, unless you’re on the highest tiers, chances are your storage is capped and you’re paying for extended versions of only some of your files.)
  • Ransomware protection: Another benefit of backup versus sync is protection against cybersecurity threats. In the case of a ransomware attack on your device, you will be able to completely restore your system from a backup that was created before the malware affected your files.
  • The ability to access your data from anywhere: While sync services are promoted as a way to access your files away from your computer, backup providers also allow you to download individual files or entire data backups from another device. Not only does this come in handy when transferring your data or restoring your old settings on an entirely new device, but also in the event that you need to access a file not covered by a sync service. Your data backup will have a copy of every single one of your files that you can access from another computer.
  • Location services: Additionally, some backup providers (Like us!) offer additional features or functionality—for example, location services like Backblaze’s Locate My Computer tool allow you to find a lost or stolen device. If you’re unable to get the device back, or you just need to access a file or folder when you’re away from your device, you can download or view data from a web browser or from the Backblaze mobile app.
Backup Pro Tip: Backing Up Your Digital Life

These days, our data is scattered across many different platforms—including social media, sync services, and more. We’ve gathered a handful of guides to help you protect your content. Read these guides to learn how to download your data and create a backup of it.

Don’t Sync—Back Up Instead

If you’re going to choose one service over the other, a backup service gives you the best of both worlds—you can make sure all of your data stays safe, you can access it from anywhere, and you can restore previous versions of your data whenever you need it. Backblaze Computer Backup let’s you do all of that, for Macs or PCs—learn more about it and download a free 15-day trial.

Do you have a preference for syncing your data vs. backing it up? We’d love to hear what you think in the comments.

The post The Case for Backup Over Sync appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Cloudflare Radar’s 2021 Year In Review

Post Syndicated from João Tomé original https://blog.cloudflare.com/cloudflare-radar-2021-year-in-review/

Cloudflare Radar's 2021 Year In Review

Cloudflare Radar's 2021 Year In Review

In 2021, we continued to live with the effects of the COVID pandemic and Internet traffic was also impacted by it. Although learning and exercising may have started to get back to something close to normal (depending on the country), the effects of what started almost two years ago on the way people work and communicate seems to be here to stay, and the lockdowns or restrictions continue to have an impact on where and how people go online.

So, Cloudflare Radar’s 2021 Year In Review is out with interactive maps and charts you can use to explore what changed on the Internet throughout this past year. Year In Review is part of Cloudflare Radar. We launched Radar in September 2020 to give anyone access to Internet use and abuse trends.

This year we’ve added a mobile vs desktop traffic chart, but also the attack distribution that shows the evolution throughout the year — the beginning of July 2021, more than a month after the famous Colonial Pipeline cyberattack, was the time of the year when attacks worldwide peaked.

There are also interesting pandemic-related trends like the (lack) of Internet activity in Tokyo with the Summer Olympics in town and how Thanksgiving week in the US in late November affected mobile traffic in the United States.

You can also check our Popular Domains — 2021 Year in Review where TikTok, e-commerce and space companies had a big year.

Internet: growing steadily (with lockdown bumps)

In 2020 by late April we saw that the Internet had seen incredible, sudden growth in traffic because of lockdowns and that was sustained throughout the year as we showed in our 2020 Year In Review. 2021 told a slightly different story, depending on the country.

The big April-March and May Internet traffic peak from 2020 related to the pandemic wasn’t there, in the same way, this year — it was more distributed depending on the local restrictions. In 2021, Internet traffic, globally, continued to grow throughout the year, and it was at the end of the year that was higher (a normal trend, given there’s a growth in categories like online shopping and the colder season in the Northern Hemisphere, where most Internet traffic occurs, affects human behaviour).

The day of the year with the highest growth in traffic worldwide, from our standpoint, was December 2, 2021, with 20% more than the first week of the year — the Y-axis shows the percentage change in Internet traffic using a cohort of top domains from each country. But in May there was also a bump (highlighted in red as a possible pandemic-related occurrence), although not as high as we saw in the March-May period of last year.

Spikes in Internet traffic — Worldwide 2021

#1 November-December1 (+23%)
#2 September (+20%)
#3 October (+19%)
#4 August (+16%)
#5 May (+13%)
1Beginning of December

Cloudflare Radar's 2021 Year In Review

When we focus on specific countries using our Year In Review 2021 page you can see that new restrictions or lockdowns affected (again) Internet traffic and, in some countries, that is more evident than others.

In the following table, we show the months with the highest traffic growth (the percentage shown focus on the spikes). From our standpoint the last four months of the year usually have the highest growth in traffic after September, but Canada, the UK, Germany, France, Portugal, South Korea and Brazil seemed to show (in red) an impact of restrictions in their Internet traffic — with higher increases in the first five months of the year.

Months with the largest traffic growth — 2021

United States 

#1 November-Dec (+30%)
#2 October (+26%)
#3 September (+25%)
#4 August (+15%)
#5 May (+13%)

Canada

#1 November-Dec (+21%)
#2 October (+10%)
#3 April (+9%)
#4 May (+8%)
#5 March (+7%)

UK

#1 November-Dec (+23%)
#2 March (+13%)
#3 October (+12%)
#4 February (+7%)
#5 September (+5%)

Germany

#1 November-Dec (+25%)
#2 October (+15%)
#3 May (+7%)
#4 February (+6%)
#5 September (+5%)

France

#1 November-Dec (+24%)
#2 May (+14%)
#3 April (+13%)
#4 January (+8%)
#5 February (+7%)

Japan

#1 November-Dec (+32%)
#2 October (+28%)
#3 September (+28%)
#4 August (+24%)
#5 July (+18%)

Australia

#1 November-Dec (+42%)
#2 September (+38%)
#3 October (+37%)
#4 August (+32%)
#5 July (+27%)

Singapore

#1 November-Dec (+62%)
#2 October (+58%)
#3 September (+58%)
#4 August (+41%)
#5 July (+31%)

Portugal

#1 February (+38%)
#2 March (+23%)
#3 January (+22%)
#4 November-Dec (+18%)
#5 April (+17%)

South Korea

#1 April (+21%)
#2 May (+16%)
#3 February (+10%)
#4 August (+7%)
#5 September (+7%)

Brazil

#1 May (+25%)
#2 June (+23%)
#3 November-Dec (+22%)
#4 April (+21%)
#5 July (+21%)

India

#1 November-Dec (+24%)
#2 September (+22%)
#3 October (+21%)
#4 August (+19%)
#5 July (+10%)

When we look at those countries’ trends we can see that Canada had lockdowns at the beginning of February that went through March and May, depending on the area of the country. That is in line with what we’ve seen in 2020: when restrictions/lockdowns are up, people tend to use the Internet more to communicate, work, exercise and learn.

Most of Europe also started 2021 with lockdowns and restrictions that included schools — so online learning was back on. That’s clear in the UK. From January to March showed a high increase in traffic percentage that went down when restrictions were relaxed.

Cloudflare Radar's 2021 Year In Review
The lines here show Internet traffic growth from our standpoint throughout 2020 and 2021 in the UK

The same happens in Portugal, where new measures on January 21, 2021, put the three first months of the year in the top 3 of the year in terms of growth of traffic, and April was #5.

We can also check the example of France. Lockdowns were imposed again especially during April and May 2021, and we can see the growth in Internet traffic during those months, slightly more timid than the first lockdown of 2020, but nonetheless evident in the 2021 chart.

Cloudflare Radar's 2021 Year In Review

Germany had the same situation in May (in April work from home was again the rule and the relaxation of measures for vaccinated people only began in mid-May), but in February the lockdown that started at the end of 2020 (and included schools) was also having an impact on Internet traffic.

In South Korea there was also an impact of the beginning of the year lockdown seen in spikes through February, April and May 2021.

Internet traffic growth in the United States had a very different year in 2021 than it had the year before, when the first lockdown had a major effect on Internet growth, but still, May was a month of high growth — it was in mid-May that there were new guidelines from the CDC about masks.

Cloudflare Radar's 2021 Year In Review

Mobile traffic: The Thanksgiving effect

Another trend worldwide from 2021 is the mobile traffic percentage evolution. Worldwide, from our standpoint, the more mobile-friendly months of the year — where mobile devices were more prevalent to go online — were July and August (typical vacations months in most of the Northern Hemisphere), but January and November were also very strong.

Cloudflare Radar's 2021 Year In Review

On our Year in Review page, you can also see the new mobile vs desktop traffic chart. The evolution of the importance of mobile traffic is different depending on the country.

For example, the United States has more desktop traffic throughout the year, but in 2021, during the Thanksgiving (November 25) week, mobile traffic took the lead for the first and only time in the whole year. We can also see that in July mobile traffic was also high in terms of relevance.

Cloudflare Radar's 2021 Year In Review

The UK has a similar trend, with June, July and August being the only months of the year when mobile traffic is prevalent compared to desktop.

If we go to the other side of the planet, to Singapore, there the mobile percentage is usually higher than desktop, and we see a completely different trend than in the US. Mobile traffic was higher in May, and desktop only went above mobile in some days of February, some in March, and especially after the end of October.

Cloudflare Radar's 2021 Year In Review

Where people accessed the Internet

We also have, again, available the possibility of selecting a city from the map of our Year in Review to zoom into a city to see the change in Internet use throughout the year. Let’s zoom in on San Francisco.

The following agglomeration of maps highlights (all available in our Year in Review site) the change in Internet use comparing the start of 2020, mid-January to mid-March — you can see that there’s still some increase in traffic, in orange —, to the total lockdown situation of April and May, with more blue areas (decrease in traffic).

Cloudflare Radar's 2021 Year In Review
The red circles shows San Francisco and its surroundings (home of a lot of companies) in a map that compares working hours Internet use on a weekday between two months.

The same trend is seen already in May 2021 in a time when remote work continued to be strong — especially in tech companies (employees moved from the Bay Area). Only in June of this year, there was some increase in traffic (more orange areas), especially further away from San Francisco (in residential areas).

London: From lockdown to a Euro Championship final

London tells us a different story. Looking through the evolution since the start of 2020 we can see that in March (compared to January) we have an increase in traffic (in orange) outside London (where blue is dominant).

The Internet activity only starts to get heavier in June, in time for the kick-off of the 2020 UEFA European Championship. The tournament played in several cities in Europe had a lot of restrictions and a number of games were played in London at Wembley Stadium — where Italy won the final by beating England on penalties. But at the time of the final, July, and especially August, blue was already dominant again — so people seemed to leave the London area. Only in September and October did the traffic start to pick up again, but mostly outside the city centre.

Cloudflare Radar's 2021 Year In Review

The Summer Olympics impact? Tokyo with low activity

After the UEFA European Championship, came the other big event postponed back in 2020, the Tokyo Summer Olympics. Our map seems to show the troubled months before the event with the pandemic numbers and the restrictions rising before the dates of the major event — late July and the first days of August.

There were athletes, but not fans from around the world and even locals weren’t attending — i​t was largely an event held behind closed doors with no public spectators permitted due to the declaration of a state of emergency in the Greater Tokyo Area. We can see that in our charts, especially when looking at the increase in activity in March (compared to January) and the decrease in August (compared to June), even with a global event in town (Tokyo is in the red circle).

Cloudflare Radar's 2021 Year In Review

There’s also another interesting trend pandemic-related in Lisbon, Portugal. With the lockdowns put in place since mid-January, the comparison with March shows the centre of the city losing Internet traffic and the residential areas outside Lisbon gaining it (in orange in the animation). But in April the activity decreased even around Lisbon and only started to get heavier in May when restrictions were more a lot more relaxed.

Lockdowns bring more traffic to Berlin

A different trend can be seen in Berlin, Germany. Internet activity in the city and its surroundings was very high in March and in April (compared to the previous two months) at a time when lockdowns were in place — nonetheless, in 2020 the activity decreased in April with the first major lockdown.

But in May and June, with the relaxation in restrictions, Internet activity decreased (blue) giving the idea that people left the city or, at least, weren’t using the Internet so much. Only in August did Internet activity begin to pick up again, but decreased once more in the colder months of November and December.

Cloudflare Radar's 2021 Year In Review

Cyberattacks: Threats that came in July

In terms of worldwide attacks, July and November (the month of Black Friday, when it reached a 78% in increase) were definitely the months with the highest peak of the year. The biggest peak was at the beginning of July 2021, when it reached 82%. That was more than a month after the Colonial Pipeline ransomware cyberattack — May was also the month of an attack on part of Toshiba and, in the same week, the Irish health system and of the meat processing company JBS.

The week of December 6 (the same when the Log4j vulnerability was disclosed) also had an increase in attacks — 42% more, and there was also a clear increase (42%) in the beginning of October, around the time of the Facebook outage.

Cloudflare Radar's 2021 Year In Review

In our dedicated page you can check — for the first time this year — the attack distribution in a selection of countries.

The UK had a very noticeable peak in overall Internet attacks (a growth of 150%) in August and that continued through September. We already saw that the beginning of the year, because of lockdowns, also had an increase in Internet traffic, and we can also see an increase in attacks in January 2021, but also in late November — around the time of the Black Friday week.

Cloudflare Radar's 2021 Year In Review

The United States, on the other hand, saw a growth in threats that was more uniform throughout the year. The biggest spike was between August and September (a time when students, depending on the state, were going back to school), with 65% of growth. July also had a big spike in threats (58%), but also late May (48%) — that was the month of the Colonial Pipeline ransomware cyberattack. Late November also had a spike (29%).

Cloudflare Radar's 2021 Year In Review

Countries like France had their peak in attacks (420% more) in late September and Germany it was in June (425%), but also in October (380%) and in November (350%).

The same trend can be seen in Singapore, but with an even higher growth. It reached 1,000% more threats in late November and 900% in the same month, around the time of the famous Singles’ Day (11.11, on November 11), the main e-commerce event in the region.

Cloudflare Radar's 2021 Year In Review

Also in the region, Australia, for example, also saw a big increase (more than 100%) in attacks in the beginning of September. In Japan, it was more in late May (over 40% of growth in threats).

What people did online in 2021

Last year we saw how the e-commerce category jumped in several countries after the first major lockdown — late March.

In New York, Black Friday, November 26, 2021, was the day of the whole year that e-commerce traffic peaked — it represented 31.9% of traffic, followed by Cyber Monday, November 29, with 26.6% (San Francisco has the same trend). It’s also interesting to see that in 2020 the same category peaked Black Friday, November 27, 2020 (24.3%) but April 22, during the first lockdowns, was a close second at 23.1% (this year the category only had ~14% in April).

Also with no surprise, messaging traffic peaked (20.6%) in the city that never sleeps on the first day of the year, January 1, 2021, to celebrate the New Year.

Cloudflare Radar's 2021 Year In Review

London calling (pre-Valentine messages)

But countries, cities and the people who live there have different patterns and in London messaging traffic actually peaks at 21.5% of traffic on Friday, February 12, 2021 (two days before Valentine’s Day). While in London, let’s check if Black Friday was also big outside the US. And the answer is: yes! E-commerce traffic peaked at 20.7% of traffic precisely on Black Friday, November 26.

The pandemic also has an influence in the types of websites people use and in London, travel websites had the biggest percentage in traffic on August 8, with only 1.4% — in Munich it was 1.1% on August 11. On the other hand, in New York and San Francisco, travel websites always had less than 1% of traffic.

Going back to Europe, Paris, France, saw a different trend. Travel websites had 1.9% of traffic on June 7, 2021, precisely the week that the pandemic restrictions were lifted — France opened to international travelers on June 9, 2021. The “City of Light” (and love) had its biggest day of the year for messaging websites (24.4%) on Sunday, January 31 — a time when there were new restrictions announced to try to avoid a total lockdown.

The hacker attack: 2021 methods

Our Year in Review site also lets you dig into which attack methods gained the most traction in 2021. It is a given that hackers continued to run their tools to attack websites, overwhelm APIs, and try to exfiltrate data — recently the Log4j vulnerability exposed the Internet to new possible exploitation.

Just to give some examples, in Paris “faking search engine bots” represented 48.3% of the attacks selected for the chart on January 14, 2021, but “SQL Injection” got to 59% on April 29.

Cloudflare Radar's 2021 Year In Review
Cyberattacks distribution throughout the year in San Francisco

In London “User-Agent Anomaly” was also relevant in some parts of the year, but in San Francisco it was mostly “information disclosure” that was more prevalent, especially in late November, at a time when online shopping was booming — in December “file inclusion” vulnerability had a bigger percentage.

Now it’s your turn: explore more

To explore data for 2021 (but also 2020), you can check out Cloudflare Radar’s Year In Review page. To go deep into any specific country with up-to-date data about current trends, start at Cloudflare Radar’s homepage.

Security updates for Thursday

Post Syndicated from original https://lwn.net/Articles/879675/rss

Security updates have been issued by Debian (openjdk-11), Fedora (keepalived and tang), openSUSE (openssh, p11-kit, runc, and thunderbird), Oracle (postgresql:12, postgresql:13, and virt:ol and virt-devel:ol), Red Hat (rh-maven36-log4j12), and SUSE (ansible, chrony, logstash, elasticsearch, kafka, zookeeper, openstack-monasca-agent, openstack-monasca-persister-java, openstack-monasca-thresh, openssh, p11-kit, python-Babel, and thunderbird).

Optimize Cost by Automating the Start/Stop of Resources in Non-Production Environments

Post Syndicated from Ashutosh Pateriya original https://aws.amazon.com/blogs/architecture/optimize-cost-by-automating-the-start-stop-of-resources-in-non-production-environments/

Co-authored with Nirmal Tomar, Principal Consultant, Infosys Technologies Ltd.

Ease of creating on-demand resources on AWS can sometimes lead to over-provisioning or under-utilization of AWS resources like Amazon EC2 and Amazon RDS. This can lead to higher costs that can often be avoided with proper planning and monitoring.  Non-critical environments, like development and test are often not monitored on a regular basis and can result in under-utilization of AWS resources.

In this blog, we discuss a common AWS cost optimization strategy, which is an automated and deployable solution to schedule the start/stop of AWS resources. For this example, we are considering non-production environments because in most scenarios these do not need to be available at all time. By following this solution, cloud architects can automate the start/stop of services per their usage pattern and can save up to 70% of costs while running their non-production environment.

The solution outlined here is designed to automatically stop and start the following AWS services based on your needs; Amazon RDS, Amazon Aurora, Amazon EC2, Auto Scaling groups, AWS Beanstalk, and Amazon EKS. The solution is automated using AWS Step functions, AWS Lambda, AWS Cloud​Formation Templates, Amazon EventBridge and AWS Identity and Access Management (IAM).

In this solution, we also provide an option to exclude specific Amazon Resource Names (ARNs) of the aforementioned services. This helps cloud architects to exclude the start/stop function for various use cases like in a QA environment when they don’t want to stop Aurora or they want to start RDS in a Development environment. The solution can be used to start/stop the services mentioned previously on a scheduled interval but can also be used for other applicable services like Amazon ECS, Amazon SageMaker Notebook Instances, Amazon Redshift and many more.

Note – Don’t set up this solution in a production or other environment where you require continuous service availability.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account with permission to spin up required resources.
  • A running Amazon Aurora instance in the source AWS account.

Walkthrough

To set up this solution, proceed with the following two steps:

  1. Set up the step function workflow to stop services using a CloudFormation template. On a scheduled interval, this workflow will run and stop the chosen services.
  2. Set up the step function workflow to start services using a  CloudFormation template. On a scheduled interval, this workflow will run and start services as configured during the CloudFormation setup.

1. Stop Services using the Step Function workflow for a predefined duration

Figure 1 – Architecture showing the AWS Step Functions Workflow to stop services

Figure 1 – Architecture showing the AWS Step Functions Workflow to stop services

The AWS Lambda functions involved in this workflow:

  • StopAuroraCluster:  This Lambda function will stop all Aurora Cluster setup across Region including read replica.
  • StopRDSInstances:  This Lambda function will stop all RDS Instances except the Aurora setup across the Region.
  • ScaleDownEKSNodeGroups: This Lambda function will downsize all nodegroups to zero instance across the Region.
  • ScaleDownASG: This Lambda function will downsize all Auto Scaling groups including the Elastic Beanstalk Auto Scaling group to zero instance across Region. We can edit CloudFormation templates to include the custom value.
  • StopEC2Instances: This Lambda function will stop all EC2 instances set up across the Region.

Using the following AWS CloudFormation Template, we set up the required services and workflow:

a. Launch the template in the source account and source Region:

Launch Stack

Specify stack details screenshot

b.     Fill out the preceding form with the following details and select Next.

Stack name:  Stack Name which you want to create.

ExcludeAuroraClusterArnListInCommaSeprated: Comma separated Aurora clusters ARN which enterprises don’t want to stop, keep the default value if there is no exclusion list.

e.g.  arn:aws:rds:us-east-1:111111111111:cluster:aurorcluster1, arn:aws:rds:us-east-2:111111111111:cluster:auroracluster2

ExcludeRDSDBInstancesArnListInCommaSeprated: Comma separated DB instances ARN which enterprises don’t want to stop, keep the default value if there is no exclusion list.

e.g.   arn:aws:rds:us-east-1:111111111111:db:rds-instance-1, arn:aws:rds:us-east-2:111111111111:db:rds-instance-2

ExcludeEKSClusterNodeGroupsArnListInCommaSeprated: Comma separated EKS Clusters ARN which enterprises don’t want to stop, leave it with default value if there is no exclusion list.

e.g.   arn:aws:eks:us-east-2:111111111111:cluster/testcluster

ExcludeAutoScalingGroupIncludingBeanstalkListInCommaSeprated: Comma separated Beanstalk and other Auto Clusters groups ARN (except Amazon EKS) which enterprises don’t want to stop, keep the default value if there is no exclusion list.

e.g.  arn:aws:autoscaling:us-east-1:111111111111:autoScalingGroup:6d5af669-eb3b-4530-894b-e314a667f2e7:autoScalingGroupName/test-0-ASG

ExcludeEC2InstancesIdListInCommaSeprated: Comma separated EC2 instance ID’s which you don’t want to stop, keep the default value if there is no exclusion list.

e.g.  i-02185df0872f0f852, 0775f7e39513c50dd

ScheduleExpression: Schedule a cron expression when you want to run this workflow. Sample expressions are available in this guide, Schedule expressions using rate or cron.

c.     Select IAM role to launch this template.  As a best practice, select the AWS CloudFormation service role to manage AWS services and resources  available to each user.

Permissions screenshot

d.     Acknowledge that you want to create various resources including IAM roles/policies and select Create Stack.

d. Acknowledge to create various resources including IAM roles/policies and select Create Stack.

2. Start Services using the Step Function workflow in pre-configured time

Figure 2 – Architecture showing the AWS Step Functions Workflow to start services

Figure 2 – Architecture showing the AWS Step Functions Workflow to start services

The  Lambda functions involved in this workflow:

  • StartAuroraCluster:  This Lambda function will start all Aurora Cluster setup across Region including read-replica.
  • StartRDSInstances:  This Lambda function will start all RDS Instances except for the Aurora setup across the Region.
  • ScaleUpEKSNodeGroups: This Lambda function will upsize all nodegroups to minimum 2 and maximum 4 instances across Region. We can edit CloudFormation templates for custom value.
  • ScaleUpASG: This Lambda function will Scale up all Auto Scaling group including Elastic Beanstalk Auto Scaling group to minimum 2 and maximum 4 instances across the Region. We can edit CloudFormation templates for custom value.
  • StartEC2Instances: This Lambda function will start all EC2 instances setup across the Region.

Using the following AWS CloudFormation template, we set up the required services and workflow:

a. Launch the template in the source account and source Region:

Launch Stack

b. Fill out the preceding form with the following details and select Next.

Stack details screenshot

Stack name:  Stack Name which you want to create.

ExcludeAuroraClusterArnListInCommaSeprated: Comma separated Aurora clusters the ARN which you don’t want to start, keep the default value if there is no exclusion list.

For example:  arn:aws:rds:us-east-1:111111111111:cluster:aurorcluster1, arn:aws:rds:us-east-2:111111111111:cluster:auroracluster2

ExcludeRDSDBInstancesArnListInCommaSeprated: Comma separated databaseinstances ARN which you don’t want to start, keep default value if there is no exclusion list.

For example:   arn:aws:rds:us-east-1:111111111111:db:rds-instance-1, arn:aws:rds:us-east-2:111111111111:db:rds-instance-2

ExcludeEKSClusterNodeGroupsArnListInCommaSeprated: Comma separated EKS Clusters ARN which you don’t want to start, keep the default value if there is no exclusion list.

For example:   arn:aws:eks:us-east-2:111111111111:cluster/testcluster

ExcludeAutoScalingGroupIncludingBeanstalkListInCommaSeprated: Comma separated Beanstalk and other Auto Clusters groups ARN (except EKS) which you don’t want to start, keep the default value if there is no exclusion list.

For example:  arn:aws:autoscaling:us-east-1:111111111111:autoScalingGroup:6d5af669-eb3b-4530-894b-e314a667f2e7:autoScalingGroupName/test-0-ASG

ExcludeEC2InstancesIdListInCommaSeprated: Comma separated EC2 instance ID s you  don’t want to start, keep the default value if there is no exclusion list.

For example:  i-02185df0872f0f852, 0775f7e39513c50dd

ScheduleExpression: Schedule a cron expression when you want to run this workflow. Sample expressions are available in this guide, Schedule expressions using rate or cron.

c.     Select IAM role to launch this template.  As a best practice, select the AWS CloudFormation service role to manage AWS services and resources available to each user.

Permissions screenshot

d. Acknowledge that you want to create various resources including IAM roles and policies and select Create Stack.

Acknowledgement screenshot

Cleaning up

Delete any unused resources to avoid incurring future charges.

Conclusion

In this blog post, we outlined a solution to help you optimize cost by automating the stop/start of AWS services in non-production environments. Cost Optimization and Cloud Financial Management are ongoing initiatives. We hope you found this solution helpful and encourage you to explore additional ways to optimize cost on the AWS Architecture Center.

Nirmal Tomar

Nirmal Tomar

Nirmal is a principal consultant with Infosys, assisting vertical industries on application migration and modernization to the public cloud. He is the part of the Infosys Cloud CoE team and leads various initiatives with AWS. Nirmal has a keen interest in industry and cloud trends, and in working with hyperscalers to build differentiated offerings. Nirmal also specializes in cloud native development, managed containerization solutions, data analytics, data lakes, IoT, AI, and machine learning solutions.

Test for Log4Shell With InsightAppSec Using New Functionality

Post Syndicated from Bria Grangard original https://blog.rapid7.com/2021/12/22/test-for-log4shell-with-insightappsec-using-new-functionality/

Test for Log4Shell With InsightAppSec Using New Functionality

We can all agree at this point that the Log4Shell vulnerability (CVE-2021-44228) can rightfully be categorized as a celebrity vulnerability. Security teams have been working around the clock investigating whether they have instances of Log4j in their environment. You are likely very familiar with everything regarding  Log4Shell, but if you are looking for more information, you can check out our Everyperson’s Guide to Log4Shell (CVE-2021-44228). In this blog, we will share how Rapid7 customers can test for Log4Shell with InsightAppSec.

Testing for Log4Shell with InsightAppSec

With InsightAppSec, our dynamic application security testing (DAST) solution, customers can assess the risk of their applications. InsightAppSec allows you to configure various attacks of your applications to identify response behaviors that make your applications more vulnerable to attacks. These attacks are run during scans that you can customize based on your needs. In this case, we’ve introduced a new default attack template for Out of Band Injection specific to Log4Shell attacks.

What’s this mean? Customers can now run an Out of Band Attack Injection from our default template, which includes an attack type for Log4Shell. The new default Out of Band attack template in InsightAppSec will perform sophisticated web application attacks that do not rely on traditional HTTP request-response interactions. Our Log4Shell vulnerability detection will simulate an attacker on your website. InsightAppSec will validate the exploitability of the application and the associated risk.

How to run a Log4Shell attack in InsightAppSec

You can scan for this new Out of Band attack using either a new attack template we have created or by creating your own custom attack template and selecting this new attack module. We have added some highlights below, but you can find a detailed guide via our help docs.

Attack templates

Out of Band Injection attack template

Test for Log4Shell With InsightAppSec Using New Functionality

Out of band Log4Shell attack module

Test for Log4Shell With InsightAppSec Using New Functionality

Run a scan

Scan Config

Depending on the choice of either using the new Out of Band Injection attack template or creating your own custom attack module, you now need to choose this template on your scan config and run a scan against your selected app(s).

Test for Log4Shell With InsightAppSec Using New Functionality

Scan results

Now you run your scan, you can review your scan results to see if your app(s) have any findings that could be exposed as per the details in CVE-2021-44228.

Test for Log4Shell With InsightAppSec Using New Functionality

What’s next?

Though official mitigation steps are changing as new information arises, we recommend for Java 8+ applications, upgrade your Log4j libraries to the latest version (currently 2.17.0) to fix any new issues as they are discovered. Application using Java 7 should upgrade Log4j to at least version 2.212.2 If you’re looking to validate any fixes have been implemented, feel free to run a validation scan with InsightAppSec to verify the fixes have been made.

If you’re looking for additional information on how Rapid7 can help support you during this time, check out our Log4j resource hub.

Increasing McGraw-Hill’s Application Throughput with Amazon SQS

Post Syndicated from Vikas Panghal original https://aws.amazon.com/blogs/architecture/increasing-mcgraw-hills-application-throughput-with-amazon-sqs/

This post was co-authored by Vikas Panghal, Principal Product Mgr – Tech, AWS and Nick Afshartous, Principal Data Engineer at McGraw-Hill

McGraw-Hill’s Open Learning Solutions (OL) allow instructors to create online courses using content from various sources, including digital textbooks, instructor material, open educational resources (OER), national media, YouTube videos, and interactive simulations. The integrated assessment component provides instructors and school administrators with insights into student understanding and performance.

McGraw-Hill measures OL’s performance by observing throughput, which is the amount of work done by an application in a given period. McGraw-Hill worked with AWS to ensure OL continues to run smoothly and to allow it to scale with the organization’s growth. This blog post shows how we reviewed and refined their original architecture by incorporating Amazon Simple Queue Service (Amazon SQS). to achieve better throughput and stability.

Reviewing the original Open Learning Solutions architecture

Figure 1 shows the OL original architecture, which works as follows:

  1. The application makes a REST call to DMAPI. DMAPI is an API layer over the Datamart. The call results in a row being inserted in a job requests table in Postgres.
  2. A monitoring process called Watchdog periodically checks the database for pending requests.
  3. Watchdog spins up an Apache Spark on Databricks (Spark) cluster and passes up to 10 requests.
  4. The report is processed and output to Amazon Simple Storage Service (Amazon S3).
  5. Report status is set to completed.
  6. User can view report.
  7. The Databricks clusters shut down.
Original OL architecture

Figure 1. Original OL architecture

To help isolate longer running reports, we separated requests that have up to five schools (P1) from those having more than five (P2) by allocating a different pool of clusters. Each of the two groups can have up to 70 clusters running concurrently.

Challenges with original architecture

There are several challenges inherent in this original architecture, and we concluded that this architecture will fail under heavy load.

It takes 5 minutes to spin up a Spark cluster. After processing up to 10 requests, each cluster shuts down. Pending requests are processed by new clusters. This results in many clusters continuously being cycled.

We also identified a database resource contention problem. In testing, we couldn’t process 142 reports out of 2,030 simulated reports within the allotted 4 hours. Furthermore, the architecture cannot be scaled out beyond 70 clusters for the P1 and P2 pools. This is because adding more clusters will increase the number of database connections. Other production workloads on Postgres would also be affected.

Refining the architecture with Amazon SQS

To address the challenges with the existing architecture, we rearchitected the pipeline using Amazon SQS. Figure 2 shows the revised architecture. In addition to inserting a row to the requests table, the API call now inserts the job request Id into one of the SQS queues. The corresponding SQS consumers are embedded in the Spark clusters.

New OL architecture with Amazon SQS

Figure 2. New OL architecture with Amazon SQS

The revised flow is as follows:

  1. An API request results in a job request Id being inserted into one of the queues and a row being inserted into the requests table.
  2. Watchdog monitors SQS queues.
  3. Pending requests prompt Watchdog to spin up a Spark cluster.
  4. SQS consumer consumes the messages.
  5. Report data is processed.
  6. Report files output to Amazon S3
  7. Job status is updated in the requests table.
  8. Report can be viewed in the application.

After deploying the Amazon SQS architecture, we reran the previous load of 2,030 reports with a configuration ceiling of up to five Spark clusters. This time all reports were completed within the 4-hour time limit, including the 142 reports that timed out previously. Not only did we achieve better throughput and stability, but we did so by running far fewer clusters.

Reducing the number of clusters reduced the number of concurrent database connections that access Postgres. Unlike the original architecture, we also now have room to scale by adding more clusters and consumers. Another benefit of using Amazon SQS is a more loosely coupled architecture. The Watchdog process now only prompts Spark clusters to spin up, whereas previously it had to extract and pass job requests Ids to the Spark job.

Consumer code and multi-threading

The following code snippet shows how we consumed the messages via Amazon SQS and performed concurrent processing. Messages are consumed and submitted to a thread pool that utilizes Java’s ThreadPoolExecutor for concurrent processing. The full source is located on GitHub.

/**
  * Main Consumer run loop performs the following steps:
  *   1. Consume messages
  *   2. Convert message to Task objects
  *   3. Submit tasks to the ThreadPool
  *   4. Sleep based on the configured poll interval.
  */
 def run(): Unit = {
   while (!this.shutdownFlag) {
     val receiveMessageResult = sqsClient.receiveMessage(new  
                                           ReceiveMessageRequest(queueURL)
       .withMaxNumberOfMessages(threadPoolSize))
     val messages = receiveMessageResult.getMessages
     val tasks = getTasks(messages.asScala.toList)

     threadPool.submitTasks(tasks, sqsConfig.requestTimeoutMinutes)
     Thread.sleep(sqsConfig.pollIntervalSeconds * 1000)
   }

   threadPool.shutdown()
 }

Kafka versus Amazon SQS

We also considered routing the report requests via Kafka, because Kafka is part of our analytics platform. However, Kafka is not a queue, it is a publish-subscribe streaming system with different operational semantics. Unlike queues, Kafka messages are not removed by the consumer. Publish-subscribe semantics can be useful for data processing scenarios. In other words, it can be used in cases where it’s required to reprocess data or to transform data in different ways using multiple independent consumers.

In contrast, for performing tasks, the intent is to process a message exactly once. There can be multiple consumers, and with queue semantics, the consumers work together to pull messages off the queue. Because report processing is a type of task execution, we decided that SQS queue semantics better fit the use case.

Conclusion and future work

In this blog post, we described how we reviewed and revised a report processing pipeline by incorporating Amazon SQS as a messaging layer. Embedding SQS consumers in the Spark clusters resulted in fewer clusters and more efficient cluster utilization. This, in turn, reduced the number of concurrent database connections accessing Postgres.

There are still some improvements that can be made. The DMAPI call currently inserts the report request into a queue and the database. In case of an error, it’s possible for the two to become out of sync. In the next iteration, we can have the consumer insert the request into the database. Hence, the DMAPI call would only insert the SQS message.

Also, the Java ThreadPoolExecutor API being used in the source code exhibits the slow poke problem. Because the call to submit the tasks is synchronous, it will not return until all tasks have completed. Here, any idle threads will not be utilized until the slowest task has completed. There’s an opportunity for improved throughput by using a thread pool that allows idle threads to pick up new tasks.

Ready to get started? Explore the source code illustrating how to build a multi-threaded AWS SQS consumer.

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close