On the Evolution of Ransomware

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2020/12/on-the-evolution-of-ransomware.html

Good article on the evolution of ransomware:

Though some researchers say that the scale and severity of ransomware attacks crossed a bright line in 2020, others describe this year as simply the next step in a gradual and, unfortunately, predictable devolution. After years spent honing their techniques, attackers are growing bolder. They’ve begun to incorporate other types of extortion like blackmail into their arsenals, by exfiltrating an organization’s data and then threatening to release it if the victim doesn’t pay an additional fee. Most significantly, ransomware attackers have transitioned from a model in which they hit lots of individuals and accumulated many small ransom payments to one where they carefully plan attacks against a smaller group of large targets from which they can demand massive ransoms. The antivirus firm Emsisoft found that the average requested fee has increased from about $5,000 in 2018 to about $200,000 this year.

Ransomware is a decades-old idea. Today, it’s increasingly profitable and professional.

Running queries securely from the same VPC where an Amazon Redshift cluster is running

Post Syndicated from Seetha Sarma original https://aws.amazon.com/blogs/big-data/running-queries-securely-from-the-same-vpc-where-an-amazon-redshift-cluster-is-running/

Customers who don’t need to set up a VPN or a private connection to AWS often use public endpoints to access AWS. Although this is acceptable for testing out the services, most production workloads need a secure connection to their VPC on AWS. If you’re running your production data warehouse on Amazon Redshift, you can run your queries in Amazon Redshift query editor or use Amazon WorkSpaces from Amazon Virtual Private Cloud (Amazon VPC) to connect to Amazon Redshift securely and analyze and graph a data summary in your favorite business intelligence (BI) or data visualization desktop tool.

With Amazon Redshift, you can query petabytes of structured and semi-structured data across your data warehouse, operational database, and your data lake using standard SQL. Amazon WorkSpaces is a managed, secure Desktop-as-a-Service (DaaS) solution deployed within an Amazon VPC. In this post, we show how you can run SQL queries on Amazon Redshift securely without VPN using Amazon Redshift query editor and Amazon WorkSpaces. First, we discuss how to run queries that return large datasets from the Amazon Redshift query editor using the UNLOAD command. Next, we discuss how to set up Amazon WorkSpaces and use it to securely run queries on Amazon Redshift. We cover the detailed steps for setting up Amazon WorkSpaces and show different scenarios to test Amazon Redshift queries with Amazon WorkSpaces.

The following diagram illustrates these architectures.

Using the Amazon Redshift query editor with UNLOAD command

Amazon Redshift has a query editor on its console that is typically used to run short queries and explore the data in the Amazon Redshift database. You may have a production scenario with queries that return large result sets. For instance, you may want to unload CSV data for use by a downstream process. In this case, you can run your query in the query editor and use the UNLOAD command to send the output directly to Amazon Simple Storage Service (Amazon S3) and get notified when the data is uploaded.

  1. Create separate S3 buckets for each user. You can use the default configuration for the bucket.

Create separate S3 buckets for each user. You can use the default configuration for the bucket.

  1. On the Amazon Redshift console, choose Editor.
  2. In the Connect to database section, enter the database connection details.
  3. Choose Connect to database.

Choose Connect to database.

  1. Use the following format for the queries run from the query editor using the IAM role for Amazon Redshift, so the results get uploaded to Amazon S3:
    UNLOAD 
    ('select id, name
    from <your_ schema>.<your_table>)
    TO 's3://<username>/<yy-mm-dd-hh24-mi-ss>/'
    FORMAT as CSV
    iam_role 'arn:aws:iam:<myaccount>:role/MyAmazon_RedshiftUnloadRole';

Use the following format for the queries run from the query editor.

This query creates multiple files in the designated user’s S3 bucket under the date/time prefix. The user can preview or download the individual files on the Amazon S3 console.

This query creates multiple files in the designated user’s S3 bucket under the date/time prefix.

A large unload may take some time. You can configure Amazon Simple Notification Service (Amazon SNS) to send a notification when the results are uploaded to Amazon S3.

  1. On the Amazon SNS console, choose Topics.
  2. Choose Create topic.

Choose Create topic.

  1. Create an SNS topic with a meaningful description text, like Your query results are uploaded to S3.

In the next steps, you edit the access policy of the SNS topic to give permission for Amazon S3 to publish to it.

  1. Change the Principal from "AWS": "*" to "Service": "s3.amazonaws.com".
  2. Scroll down to “Action” and delete everything except “SNS:Publish”. Make sure to delete the extra comma.
  3. Scroll down to “Condition” and modify the text "StringEquals": { "AWS:SourceOwner": <Your account id>} to "ArnLike": { "aws:SourceArn": "arn:aws:s3:*:*:<user-bucket-name>" }.

In the next steps, you edit the access policy of the SNS topic to give permission for Amazon S3 to publish to it.

  1. In the navigation pane, choose Subscriptions.
  2. Choose Create subscription.

Choose Create subscription.

  1. Subscribe to the SNS notification with the user’s email address as the endpoint.

Subscribe to the SNS notification with the user’s email address as the endpoint.

  1. Make sure the user chooses confirms the subscription from their email.
  2. On the Amazon S3 console, choose the Properties tab of the user’s S3 bucket.
  3. Under Event Notifications, choose Create event notification.

Under Event Notifications, choose Create event notification.

  1. Select All object create events.

Select All object create events.

  1. For Destination, select SNS topic.
  2. For Specify SNS topic, select Choose from your SNS topics.
  3. For SNS topic, choose the topic you created.

For SNS topic, choose the topic you created.

  1. Save your settings.
  2. To test the notification, on the Amazon Redshift console, open the query editor.
  3. Edit the UNLOAD query and change the S3 bucket name to the current date and time.
  4. Run the query and check if the user gets the email notification.

Using Amazon WorkSpaces to run Amazon Redshift queries

In this section, we cover setting up Amazon WorkSpaces, including Amazon VPC prerequisites, creating an Amazon VPC endpoint for Amazon S3, launching Amazon WorkSpaces in the same VPC where an Amazon Redshift cluster is running, setting up an Amazon WorkSpaces client, installing PSQL or a SQL client, and connecting to the client.

When setup is complete, we show different scenarios to test with Amazon WorkSpaces, such as testing a SQL command from the Amazon WorkSpaces client, testing SCREEN program to run SQL in the background, and testing PSQL with Amazon S3 and getting a notification through Amazon SNS.

Prerequisites

By default, AWS Identity and Access Management (IAM) users and roles can’t perform tasks using the AWS Management Console and they don’t have permission to create or modify Amazon VPC resources. Make sure you have administrator privileges or an administrator creates IAM policies that grants sufficient permissions to edit the route table, edit the VPC security group, and enable a DNS hostname for the VPC.

When you have the correct permissions, complete the following prerequisite steps:

  1. On the Amazon Redshift console, in config, check the cluster subnet groups to make sure the Amazon Redshift cluster is created in an Amazon VPC with at least two subnets that are in separate Availability Zones.
  2. On the Amazon VPC console, edit the route table and make sure to associate these two subnets.
  3. Make sure the Amazon VPC security group has a self-referencing inbound rule for the security group for all traffic (not all tcp). The self-referencing rule restricts the source to the same security group in the VPC, and it’s not open to all networks. Consider limiting to just the protocol and port needed for Redshift to talk to Workspaces.
  4. Edit the DNS hostname of the Amazon VPC and enable it.

Creating an Amazon VPC endpoint for Amazon S3 for software downloads

In this step, you create your Amazon VPC endpoint for Amazon S3. This gives you Amazon S3 access to download PSQL from the Amazon repository. Alternatively, you could set up a NAT Gateway and download PSQL or other SQL clients from the internet.

  1. On the Amazon VPC console, choose Endpoints.
  2. Choose Create endpoint.
    Choose Create endpoint.
  3. Search for Service Name: S3
  4. Select the S3 service gateway
    Select the S3 service gateway
  5. Select the Amazon VPC where the Amazon Redshift cluster is running
  6. Select the route table
    Select the route table
  7. Enter the following custom policy for the endpoint to access the Amazon Linux AMI
    {
        "Version": "2008-10-17",
        "Statement": [
            {
                "Sid": "Amazon Linux AMI Repository Access",
                "Effect": "Allow",
                "Principal": "*",
                "Action": "s3:GetObject",
                "Resource": [
                    "arn:aws:s3:::*.amazonaws.com",
                    "arn:aws:s3:::*.amazonaws.com/*"
                ]
            }
        ]
    }

    Select the Amazon VPC where the Amazon Redshift cluster is running

  8. Create the endpoint

Launching Amazon WorkSpaces in the VPC where the Amazon Redshift cluster runs

You’re now ready to launch Amazon WorkSpaces.

  1. On the Amazon WorkSpaces console, choose Launch WorkSpaces.

On the Amazon WorkSpaces console, choose Launch WorkSpaces.

  1. For Directory types, select Simple AD.

Directory Service Solutions helps you store information and manage access to resources. For this post, we choose Simple AD.

Directory Service Solutions helps you store information and manage access to resources. For this post, we choose Simple AD.

  1. For Directory size, select Small.
  2. Enter your directory details.

Enter your directory details.

  1. For VPC, choose the VPC where the Amazon Redshift cluster is running.
  2. For Subnets, choose the two subnets you created.

For Subnets, choose the two subnets you created.

It may take a few minutes to provision the directory. You see the status show as Active when it’s ready.

It may take a few minutes to provision the directory. You see the status show as Active when it’s ready.

  1. When the directory is provisioned, choose the directory and subnets you created.
  2. Choose Next Step.

Choose Next Step.

  1. Create and identify your users.

Create and identify your users.

  1. Use the default settings for the compute bundle.
  2. For Running Mode, select AlwaysOn.

Alternatively, select AutoStop and adjust the time in order to run long-running queries.

Alternatively, select AutoStop and adjust the time in order to run long-running queries.

  1. Review and launch WorkSpaces.

It may take up to 20 minutes to become available.

Setting up the Amazon WorkSpaces client

In this section, you configure your Amazon WorkSpaces client.

  1. Use the link from your email to download the Amazon WorkSpaces client.
  2. Register it using the registration code from the email.
  3. Login with your username in the email and your newly created password
  4. In the Amazon WorkSpaces client, open the terminal.

In the Amazon WorkSpaces client, open the terminal.

  1. Run the following command to capture the IP address:
hostname -I | awk '{print $2}'

The following screenshot shows your output.

The following screenshot shows your output.

  1. On the Amazon Redshift console, choose Clusters.
  2. Choose your cluster.
  3. Save the endpoint information to use later.
  4. Choose the Properties tab.

Choose the Properties tab.

  1. In the Network and security section, note the VPC security group.
    In the Network and security section, note the VPC security group.
  2. On the Amazon VPC console, under Security, choose Security Groups.
  3. Select the security group that the Amazon Redshift cluster uses.

Select the security group that the Amazon Redshift cluster uses.

  1. Add an inbound rule with the type Redshift and the source value of the IP Address you captured/32.

Add an inbound rule with the type Redshift and the source value of the IP Address you captured/32.

  1. On the Amazon WorkSpaces client, use the Amazon Redshift hostname from the endpoint you saved earlier and verify the VPC setup with the following code:
     nslookup <Amazon Redshift hostname>

If you see an IP address within your subnet range, the private endpoint setup for Amazon Redshift was successful.

If you see an IP address within your subnet range, the private endpoint setup for Amazon Redshift was successful.

Testing a SQL command from PSQL or a SQL client in the Amazon WorkSpaces client

To test a SQL command, complete the following steps:

  1. From the terminal in the Amazon WorkSpaces client, run the following command to install PostgreSQL:
    sudo yum install postgresql-server

Alternatively, setup a NAT Gateway and download a SQL client such as SQL Workbench on the Amazon WorkSpaces client:

sudo wget https://www.sql-workbench.eu/Workbench-Build125-with-optional-libs.zip

Then unzip the content of the downloaded file and save it to a directory:

unzip Workbench-Build125-with-optional-libs.zip -d ~/Workbench
  1. Use the Amazon Redshift hostname, port, and database names of the Amazon Redshift cluster endpoint you copied earlier and try connecting to the database:
    psql -h <Amazon Redshift hostname> -p <port> -d <database> -U <username> -W

  1. Enter your password when prompted.

Enter your password when prompted.

  1. Run a SQL command and check the results.

Testing the SCREEN program to run SQL in the background

You can use the SCREEN program to run the SQL command in the background and resume to see the results.

  1. From the terminal in the Amazon WorkSpaces client, install the SCREEN program:
    sudo yum install screen

  1. Run the program:
    screen

  1. Connect to PSQL:
    psql -h <Amazon Redshift hostname> -p <port> -d <database> -U <username> -W

  1. Enter the password when prompted.
  2. Run the SQL command
  3. Enter the command ctrl A D to detach from the screen.

The SQL command is now running in the background. You can check by running ps -ef | grep psql.

The SQL command is now running in the background. You can check by running ps -ef | grep psql.

  1. To go back to that screen, run the following command:
    screen -r

  1. To quit SCREEN, enter the following command:
    ctrl A \

Testing PSQL with Amazon S3 and Amazon SNS

Similar to the UNLOAD command we used from the Amazon Redshift query editor in the beginning of this post, you can run PSQL from the Amazon WorkSpaces client, send the output to an S3 bucket, and get an Amazon SNS notification for an object create event.

  1. From the terminal in the Amazon WorkSpaces client, run aws configure to set up AWS credentials with write access to the S3 bucket.
  2. Run the following command to write a single output file to Amazon S3 and send an email notification:
    psql -h <Amazon Redshift hostname> -p <port> -d <database> -U <username> -W -c 'select column1, column2 from myschema.mytable' > s3://<username>/<yy-mm-dd-hh24-mi-ss>/

Conclusion

This post explained how to securely query Amazon Redshift databases running in an Amazon VPC using the UNLOAD command with the Amazon Redshift query editor and using Amazon WorkSpaces running in the same VPC. Try out this solution to securely access Amazon Redshift databases without a VPN and run long-running queries. If you have any questions or comments, please share your thoughts in the comments section.


About the Authors

Seetha Sarma is a Senior Database Specialist Solutions Architect with Amazon Web Services. Seetha provides guidance to customers on using AWS services for distributed data processing. In her spare time she likes to go on long walks and enjoy nature.

 

 

 

Moiz MianMoiz Mian is a Solutions Architect for AWS Strategic Accounts. He focuses on enabling customers to build innovative, scalable, and secure solutions for the cloud. In his free time, he enjoys building out Smart Home systems and driving at race tracks.

Прибързаната ми радост за дърветата на Стара Загора

Post Syndicated from original https://yurukov.net/blog/2020/sztrees/

Наскоро попаднах на новина, че Стара Загора са публикувал карта показваща дърветата в града с информация за точното местоположение, вида и състоянието им. Темата ме вълнува от доста време и дори обмислях как може да се crowdsource-нат тези данни. Точно това направиха и от ZaraLab не направиха такова. Повече за него, както и защо е ограничено до Стара Загора ще намерите при Жустин.

Това, което ще пуснаха от общината на Стара Загора обаче няма общо с данните събрани от доброволци. Това, което ме зарадва е, че са направили сносна карта, на която може да се намери информация за доста дървета. Първата ми реакция беше изключителн положителна, защото реших, че най-накрая ще мога да направя визуализация по подобие на тази, която New York има от години.

Това и ще видите тук. Показва видовете дървета в различен цвят. Цели по-скоро да даде обща представа, а не информация за отделни обекти. Подобна е идеята и на картите на сградите, където през юни включих също Стара Загора. Може да отворите интерактивна версия на тази карта, ако кликнете на снимката.

Оказва се обаче, че практически нищо ново не е направено, освен една карта със стари непълни данни. Пак е похвално, че е събрана тази информация и са се опитали да я представят по добър начин, но в голяма степен е безполезна.

Дори на пръв поглед сравнявайки със стателитни снимки повече от половината дървета в града липсват, също както и цял един парк. Отделно, информацията за почти 38% от дърветата е събрана през 2007-ма година, т.е. е на 13 години. За това време те е много вероятно да са променили състоянието си или да са отрязани. Други 48% са записани през 2014 и 2015-та. За 7% нямаме информация кога са въведени, но съдейки по номерата са също на поне 5 години. Едва за 106 дървета или 1.8% е въведена информация в последните 3 години.

Опитах се да се свържа с хората работещи по проекта преди празниците. Фирмата изпълнител не си вдигаше телефоните. От общината обещаха да ми отговорят през януари. Отношението на всички, с които се свързах лъхаше на отбиване на номера.

С това се изпари първоначалният ми ентусиазъм за тази карта. Още повече, че данните видимо не бяха в отворен формат, въпреки изискването към такива системи според закона. Затова се наложи да ги отворея. Може да ги свалите тук.

Направих си труда да направя и карта с възможности за филтриране по вид, състояние и прочие. Мисля, че само годината на въвеждане липсва. Пускам данните независимо, че ги смятам за не особено полезни предвид колко са стари и непълни. Все пак, по подобие на справките от НЦОЗА, където пуснах подобни предупреждения, може да са полезни за някого.

Независимо от критиката ми изложена тук, все пак следва да се отбележи, че Стара Загора е една идея по-напред от другите градове в това отношение – поне имат някаква публикувана информация в някакъв смислен вид. Това без съмнение е и заслуга на натиска от активни граждани и примера на частни инициативи. Последното е напомняне на всички нас, които искаме това да се случи и другате и този път да е полезно.

The post Прибързаната ми радост за дърветата на Стара Загора first appeared on Блогът на Юруков.

Signing executables with HSM-backed certificates using multiple Windows instances

Post Syndicated from Karim Hamdy Abdelmonsif Ibrahim original https://aws.amazon.com/blogs/security/signing-executables-with-hsm-backed-certificates-using-multiple-windows-instances/

Customers use code signing certificates to digitally sign software, documents, and other certificates. Signing is a cryptographic tool that lets users verify that the code hasn’t been altered and that the software, documents or other certificates can be trusted.

This blog post shows you how to configure your applications so you can use a key pair already on your hardware security module (HSM) to generate signatures using any Windows instance. Many customers use multiple Amazon Elastic Compute Cloud (Amazon EC2) instances to sign workloads using the same key pair. You must configure these instances to use a pre-existing key pair from the HSM. In this blog post, I show you how to create a key container on a new Windows instance from an existing key pair in AWS CloudHSM, and then update the certificate store to associate the newly imported certificate with the new container. I also show you how to use a common application to sign executables with this key pair.

Every certificate is associated with a key pair, which includes a private key and a public key. You can only trust a signature if you can be sure that the private key has remained confidential and can be used only by the owner of the certificate. You achieve this goal by generating the key pair on an HSM and securely storing the private key on the HSM. Enterprise certificate authority (CA) or public key infrastructure (PKI) applications are configured to use this private key in the HSM whenever they need to use the corresponding certificate to sign. This configuration is generally handled transparently between the application and the HSM on the Windows instance your application is running on. The process gets tricky when you want to use multiple Windows instances to sign using the same key pair. This is especially true if your current EC2 instance that acts as a Windows Server CA, which you used to issue the HSM-backed certificate, is deleted and you have a backup of the HSM-backed certificate.

Before we get into the details, you need to know about a library called the key storage provider (KSP). Windows systems use KSP libraries to connect applications to an HSM. For each HSM brand, such as CloudHSM, you need a corresponding KSP to run operations that involve cryptographic keys stored on that HSM. From your application, select the KSP that corresponds with the HSM you want to use to store (or use) your keys. All KSPs associate keys on their HSM with metadata in the Microsoft ecosystem using key containers. Key containers map the metadata in certificates with metadata on the HSM, which allows the application to properly address keys. The list of certificates available for Microsoft utilities to sign with is contained in a trust store. To use the same key pair across multiple Windows instances, you must copy the key containers to each instance—or create a new key container from an existing key pair in each instance—and import the corresponding certificate into the trust store for each instance.

Prerequisites

The solution in this post assumes that you’ve completed the steps in Signing executables with Microsoft SignTool.exe using AWS CloudHSM-backed certificates. You should already have your HSM-backed certificate on one Windows instance.

Before you implement the solution, you must:

  1. Install the AWS CloudHSM client on the new instance and make sure that you can interact with HSM in your CloudHSM cluster.
  2. Verify the CloudHSM KSP and CNG providers installation on your new instance.
  3. Set the login credentials for the HSM on your system. Set credentials through Windows Credentials Manager. I recommend that you reboot your instance after setting up the credentials.

Note: The login credentials identify a crypto user (CU) in the HSM that has access to the key pair in CloudHSM.

Architectural overview

 

Figure 1: Architectural overview

Figure 1: Architectural overview

This diagram shows a virtual private cloud (VPC) that contains an EC2 instance running Windows Server 2016 that resides on private subnet 1. This instance will run the CloudHSM client software and will use your HSM-backed certificate with a key pair already on your HSM to sign executable files. The instance can be accessed through a VPN connection. It will also have security groups that enable RDP access for your on-premises network. Private subnet 2 hosts the elastic network interface for the CloudHSM cluster, which has a single HSM.

Out of scope

The focus of this blog post is how to use an HSM-backed certificate with a key pair already on your HSM to sign executable files from any Windows instance using Microsoft SignTool.exe. This post isn’t intended to represent any best practices for implementing code signing or Amazon EC2. For more information, see the NIST cybersecurity whitepaper Security Considerations for Code Signing and Best practices for Amazon EC2, respectively.

Deploy the solution

To deploy the solution, you use certutil, import_key, and SignTool. Certutil is a Microsoft tool that helps you examine your system for available certificates and key containers. Import_key—a tool provided by CloudHSM—generates a local key container for a key pair that’s on your HSM. To complete the process, use SignTool—a Microsoft tool that enables Windows users to digitally sign files, and verifies signatures in files and timestamps files.

You will need the following:

Certificates or key material Purpose
<my root certificate>.cer Root certificate
<my signed certificate>.cer HSM-backed signing certificate
<signed certificate in base64>.cer HSM-backed signing certificate in base64 format
<public key handle> Public key handle of the signing certificate
<private key handle> Private key handle of the signing certificate

Import the HSM-backed certificate and its RootCA chain certificate into the new instance

Before you can use third-party tools such as SignTool to generate signatures using the HSM-backed certificate, you must move the signing certificate file to the Personal certificate store in the new Windows instance.

To do that, you copy the HSM-backed certificate that your application uses for signing operations and its root certificate chain from the original instance to the new Windows instance.

If you issued your signing certificate through a private CA (like in my example), you must deploy a copy of the root CA certificate and any intermediate certificates from the private CA to any systems you want to use to verify the integrity of your signed file.

To import the HSM-backed certificate and root certificate

  1. Sign in to the Windows Server that has the private CA that you used to issue your signing certificate. Then, run the following certutil command to export the root CA to a new file. Replace <my root certificate> with a name that you can remember easily.
    C:\Users\Administrator\Desktop>certutil -ca.cert <my root certificate>.cer
    
    CA cert[0]: 3 -- Valid
    CA cert[0]:
    
    -----BEGIN CERTIFICATE-----
    MIICiTCCAfICCQD6m7oRw0uXOjANBgkqhkiG9w0BAQUFADCBiDELMAkGA1UEBhMC
    VVMxCzAJBgNVBAgTAldBMRAwDgYDVQQHEwdTZWF0dGxlMQ8wDQYDVQQKEwZBbWF6
    b24xFDASBgNVBAsTC0lBTSBDb25zb2xlMRIwEAYDVQQDEwlUZXN0Q2lsYWMxHzAd
    BgkqhkiG9w0BCQEWEG5vb25lQGFtYXpvbi5jb20wHhcNMTEwNDI1MjA0NTIxWhcN
    MTIwNDI0MjA0NTIxWjCBiDELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAldBMRAwDgYD
    VQQHEwdTZWF0dGxlMQ8wDQYDVQQKEwZBbWF6b24xFDASBgNVBAsTC0lBTSBDb25z
    b2xlMRIwEAYDVQQDEwlUZXN0Q2lsYWMxHzAdBgkqhkiG9w0BCQEWEG5vb25lQGFt
    YXpvbi5jb20wgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGBAMaK0dn+a4GmWIWJ
    21uUSfwfEvySWtC2XADZ4nB+BLYgVIk60CpiwsZ3G93vUEIO3IyNoH/f0wYK8m9T
    rDHudUZg3qX4waLG5M43q7Wgc/MbQITxOUSQv7c7ugFFDzQGBzZswY6786m86gpE
    Ibb3OhjZnzcvQAaRHhdlQWIMm2nrAgMBAAEwDQYJKoZIhvcNAQEFBQADgYEAtCu4
    nUhVVxYUntneD9+h8Mg9q6q+auNKyExzyLwaxlAoo7TJHidbtS4J5iNmZgXL0Fkb
    FFBjvSfpJIlJ00zbhNYS5f6GuoEDmFJl0ZxBHjJnyp378OD8uTs7fLvjx79LjSTb
    NYiytVbZPQUQ5Yaxu2jXnimvw3rrszlaEXAMPLE=
    -----END CERTIFICATE-----
            
    CertUtil: -ca.cert command completed successfully.
    
    C:\Users\Administrator\Desktop>
    

  2. Copy the <my root certificate>.cer file to your new Windows instance and run the following certutil command. This moves the root certificate from the file into the Trusted Root Certification Authorities store in Windows. You can verify that it exists by running certlm.msc and viewing the Trusted Root Certification Authorities certificates.
    C:\Users\Administrator\Desktop>certutil -addstore "Root" <my root certificate>.cer
    
    Root "Trusted Root Certification Authorities"
    Signature matches Public Key
    Certificate "MYRootCA" added to store.
    CertUtil: -addstore command completed successfully.
    

  3. Copy the HSM-backed signing certificate from the original instance to the new one, and run the following certutil command. This moves the certificate from the file into the Personal certificate store in Windows.
    C:\Users\Administrator\Desktop>certutil -addstore "My" <my signed certificate>.cer
    
    My "Personal"
    Certificate "www.mydomain.com" added to store.
    CertUtil: -addstore command completed successfully.
    

  4. Verify that the certificate exists in your Personal certificate store by running the following certutil command. The following sample output from certutil shows the serial number. Take note of the certificate serial number to use later.
    C:\Users\Administrator\Desktop>certutil -store my
    
    my "Personal"
    ================ Certificate 0 ================
    Serial Number: <certificate serial number>
    Issuer: CN=MYRootCA
     NotBefore: 2/5/2020 1:38 PM
     NotAfter: 2/5/2021 1:48 PM
    Subject: CN=www.mydomain.com, OU=Certificate Management, O=Information Technology, L=Houston, S=Texas, C=US
    Non-root Certificate
    Cert Hash(sha1): 5aaef93e7e972b1187363d880cfa3f71507c2e24
    No key provider information
    Cannot find the certificate and private key for decryption.
    CertUtil: -store command completed successfully.
    

Retrieve the key handles of the RSA key pair on the HSM

In this step, you retrieve the key handles of the existing public and private key pair on your CloudHSM in order to use that key pair to create a key container on the new Windows instance.

One way to get the key handles of an existing key pair on the CloudHSM is to use the modulus value. Since the certificate and its public and private keys all must have the same modulus value and you have the signing certificate already, you view its modulus value using the OpenSSL tool. Then, you use the findKey command in key_mgmt_util to search for the public and private key handles on the HSM using the value of the certificate modulus.

To retrieve the key handles

  1. Download the OpenSSL for Windows installation package.

    Note: In my example, I downloaded Win64OpenSSL-1_1_1d.exe.

  2. Right-click on the downloaded file and choose Run as administrator.
  3. Follow the installation instructions, accepting all default settings. Then choose Install.
    1. If the error message “The Win64 Open SSL Installation Project setup has detected that the following critical component is missing…”—shown in Figure 2—appears, you need to install Microsoft Visual C++ Redistributables to complete this procedure.

      Figure 2: OpenSSL installation error message

      Figure 2: OpenSSL installation error message

    2. Choose Yes to download and install the required Microsoft Visual C++ package on your system.
    3. Run the OpenSSL installer again and follow the installation instructions, accepting all default settings. Then choose Install.
  4. Choose Finish when the installation is complete.

    With the installation complete, OpenSSL for Windows can be found as OpenSSL.exe in C:\Program Files\OpenSSL-Win64\bin. Always open the program as the administrator.

  5. On the new CloudHSM client instance, copy your certificate to C:\Program Files\OpenSSL-Win64\bin and run the command certutil -encode <my signed certificate>.cer <signed certificate in base64>.cer to export the certificate using base64 .cer format. This exports the certificate to a file with the name you enter in place of <signed certificate in base64>.
    C:\Program Files\OpenSSL-Win64\bin>certutil -encode <my signed certificate>.cer <signed certificate in base64>.cer
    
    Input Length = 1066
    Output Length = 1526
    CertUtil: -encode command completed successfully.
    

  6. Run the command openssl x509 -noout -modulus -in <signed certificate in base64>.cer to view the certificate modulus.
    C:\Program Files\OpenSSL-Win64\bin>openssl x509 -noout -modulus -in <signed certificate in base64>.cer
    
    Modulus=9D1D625C041F7FAF076780E486CA2DB2FB846982E88804030F9C84F6CF553925C287934C18B92606EE9A4438F80E47961D7B2CD28213EADE2078BE1A921E6D164CC07F99DA42CF6DD1767A6392FC4BC2B19592474782E1B8574F4A46A93626CD2A8D56405EA7DFCED8DA7042F6FC6D3716CC1649174E93C66F0A9EC7EEFEC9661D43FD2BC8E2E261C06A619E4AF3B5E13190215F72EE5BDE2090818031F8AAD0AA7E934894DC54DF5F1E7577645137637F400E10B9ECDC0870C78C99E8027A86807CD719AA05931D1A4326A5ED1C3687C8EA8E54DF62BFD1851A92473348C98973DEF850B8A88A443A56E93B997F3286A1DC274E6A8DD187D8C59BAB32A6919F
    

  7. Save the certificate modulus in a text file named modulus.txt.
  8. Run the key_mgmt_util command line tool, and log in as the CU, as described in Getting Started with key_mgmt_util. Replace <cu username> and <cu password> with the username and password of the CU.
    Command: loginHSM -u CU -s <CU username> -p <CU password>
    
         	Cfm3LoginHSM returned: 0x00 : HSM Return: SUCCESS
    
            Cluster Error Status
            Node id 13 and err state 0x00000000 : HSM Return: SUCCESS
            Node id 14 and err state 0x00000000 : HSM Return: SUCCESS
    

  9. Run the following findKey command to find the public key handle that has the same RSA modulus that you generated previously. Enter the path to the modulus.txt file that you created in step 7. Take note of the public key handle that’s returned so that you can use it in the following steps.
    Command: findKey -c 2 -m C:\\Users\\Administrator\\Desktop\\modulus.txt
    
            Total number of keys present: 1
    
            Number of matching keys from start index 0::0
    
            Handles of matching keys:
            <public key handle>
    
            Cluster Error Status
            Node id 13 and err state 0x00000000 : HSM Return: SUCCESS
            Node id 14 and err state 0x00000000 : HSM Return: SUCCESS
    
            Cfm3FindKey returned: 0x00 : HSM Return: SUCCESS
    

  10. Run the following findKey command to find the private key handle that has the same RSA modulus that you generated previously. Enter the path to the modulus.txt file that you created in step 7. Take note of the private key handle that’s returned so that you can use it in the following steps.
    Command: findKey -c 3 -m C:\\Users\\Administrator\\Desktop\\modulus.txt
    
            Total number of keys present: 1
    
            Number of matching keys from start index 0::0
    
            Handles of matching keys:
            <private key handle>
    
            Cluster Error Status
            Node id 13 and err state 0x00000000 : HSM Return: SUCCESS
            Node id 14 and err state 0x00000000 : HSM Return: SUCCESS
    
            Cfm3FindKey returned: 0x00 : HSM Return: SUCCESS
    

Create a new key container for the existing public and private key pair in the CloudHSM

To use the same key pair across new Windows instances, you must copy over the key containers to each instance, or create a new key container from an existing key pair in the key storage provider of each instance. In this step, you create a new key container to hold the public key of the certificate and its corresponding private key metadata. To create a new key container from an existing public and private key pair in the HSM, first make sure to start the CloudHSM client daemon. Then, use the import_key.exe utility, which is included in CloudHSM version 3.0 and later.

To create a new key container

  1. Run the following import_key.exe command, replacing <private key handle> and <public key handle> with the public and private key handles you created in the previous procedure. This creates the HSM key pair in a new key container in the key storage provider.
    C:\Program Files\Amazon\CloudHSM>import_key.exe -from HSM –privateKeyHandle <private key handle> -publicKeyHandle <public key handle>
    
    Represented 1 keypairs in Cavium Key Storage Provider.
    

    Note: If you get the error message n3fips_password is not set, make sure that you set the login credentials for the HSM on your system.

  2. You can verify the new key container by running the following certutil command to list the key containers in your key storage provider (KSP). Take note of the key container name to use in the following steps.
    C:\Program Files\Amazon\CloudHSM>certutil -key -csp "Cavium Key Storage provider"
    
    Cavium Key Storage provider:
      <key container name>
      RSA
    
    
    CertUtil: -key command completed successfully.
    

Update the certificate store

Now you have everything in place: the imported certificate in the Personal certificate store of the new Windows instance and the key container that represents the key pair in CloudHSM. In this step, you associate the certificate to the key container that you made a note of earlier.

To update the certificate store

  1. Create a file named repair.txt as shown following.

    Note: You must use the key container name of your certificate that you got in the previous step as the input for the repair.txt file.

    [Properties]
    11 = "" ; Add friendly name property
    2 = "{text}" ; Add Key Provider Information property
    _continue_="Container=<key container name>&"
    _continue_="Provider=Cavium Key Storage Provider&"
    _continue_="Flags=0&"
    _continue_="KeySpec=2"
    

  2. Make sure that the CloudHSM client daemon is still running. Then, use the certutil verb -repairstore to update the certificate serial number that you took note of earlier, as shown in the following command. The following sample shows the command and output. See the Microsoft documentation for information about the – repairstore verb.
    certutil -repairstore my <certificate serial number> repair.txt
    
    C:\Users\Administrator\Desktop>certutil -repairstore my <certificate serial number> repair.txt
    
    my "Personal"
    ================ Certificate 0 ================
    Serial Number: <certificate serial number>
    Issuer: CN=MYRootCA
     NotBefore: 2/5/2020 1:38 PM
     NotAfter: 2/5/2021 1:48 PM
    Subject: CN=www.mydomain.com, OU=Certificate Management, O=Information Technology, L=Houston, S=Texas, C=US
    Non-root Certificate
    Cert Hash(sha1): 5aaef93e7e972b1187363d880cfa3f71507c2e24
    CertUtil: -repairstore command completed successfully.
    

  3. Run the following certutil command to verify that your certificate has been associated with the new key container successfully.
    C:\Users\Administrator\Desktop>certutil -store my
    
    my "Personal"
    ================ Certificate 0 ================
    Serial Number: <certificate serial number>
    Issuer: CN=MYRootCA
     NotBefore: 2/5/2020 1:38 PM
     NotAfter: 2/5/2021 1:48 PM
    Subject: CN=www.mydomain.com, OU=Certificate Management, O=Information Technology, L=Houston, S=Texas, C=US
    Non-root Certificate
    Cert Hash(sha1): 5aaef93e7e972b1187363d880cfa3f71507c2e24
      Key Container = CNGRSAPriv-3145768-3407903-26dd1d
      Provider = Cavium Key Storage Provider
    Private key is NOT exportable
    Encryption test passed
    CertUtil: -store command completed successfully.
    

Now you can use this certificate and its corresponding private key with any third-party signing tool on Windows.

Use the certificate with Microsoft SignTool

Now that you have everything in place, you can use the certificate to sign a file using the Microsoft SignTool.

To use the certificate

  1. Get the thumbprint of your certificate. To do this, right-click PowerShell and choose Run as administrator. Enter the following command:
    PS C:\>Get-ChildItem -path cert:\LocalMachine\My
    

    If successful, you should see output similar to the following.

    PSParentPath: Microsoft.PowerShell.Security\Certificate::LocalMachine\My
    
    Thumbprint                                Subject
    ----------                                -------
    <thumbprint>   CN=www.mydomain.com, OU=Certificate Management, O=Information Technology, L=Ho...
    

  2. Copy the thumbprint. You need it to perform the actual signing operation on a file.
  3. Download and install one of the following versions of the Microsoft Windows SDK on your Windows EC2 instance:Microsoft Windows 10 SDK
    Microsoft Windows 8.1 SDK
    Microsoft Windows 7 SDK

    Install the latest applicable Windows SDK package for your operating system. For example, for Microsoft Windows 2012 R2 or later versions, you should install the Microsoft Windows 10 SDK.

  4. To open the SignTool application, navigate to the application directory within PowerShell. This is usually:
    C:\Program Files (x86)\Windows Kits\<SDK version>\bin\<version number>\<CPU architecture>\signtool.exe
    

  5. When you’ve located the directory, sign your file by running the following command. Remember to replace <thumbprint> and <test.exe> with your own values. <test.exe> can be any executable file in your directory.
    PS C:\>.\signtool.exe sign /v /fd sha256 /sha1 <thumbprint> /sm /as C:\Users\Administrator\Desktop\<test.exe>
    

    You should see a message like the following:

    Done Adding Additional Store
    Successfully signed: C:\Users\Administrator\Desktop\<test.exe>
    
    Number of files successfully Signed: 1
    Number of warnings: 0
    Number of errors: 0
    

  6. (Optional) To verify the signature on the file, you can use SignTool.exe with the verify option by using the following command.
    PS C:\>.\signtool.exe verify /v /pa C:\Users\Administrators\Desktop\<test.exe>
    

    If successful, you should see output similar to the following.

    Number of files successfully Verified: 1
    

Conclusion

In this post, I walked you through the process of using an HSM-backed certificate on a new Windows instance for signing operations. You used the import_key.exe utility to create a new key container from an existing private/public key pair in CloudHSM. Then, you updated the certificate store to associate your certificate with the key container. Finally, you saw how to use the HSM-backed certificate with the new key container to sign executable files. As you continue to use this solution, it’s important to keep Microsoft Windows SDK, CloudHSM client software, and any other installed software up-to-date.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS CloudHSM forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Karim Hamdy Abdelmonsif Ibrahim

Karim is a Cloud Support Engineer II at AWS and a subject matter expert for AWS Shield. He’s an avid pentester and security enthusiast who’s worked in IT for almost 11 years. He obtained OSCP, OSWP, CISSP, CEH, ECSA, CISM, and AWS Certified Security Specialist certifications. Outside of work, he enjoys jet skiing, hanging out with friends, and watching space documentaries.

Building a serverless data quality and analysis framework with Deequ and AWS Glue

Post Syndicated from Vara Bonthu original https://aws.amazon.com/blogs/big-data/building-a-serverless-data-quality-and-analysis-framework-with-deequ-and-aws-glue/

With ever-increasing amounts of data at their disposal, large organizations struggle to cope with not only the volume but also the quality of the data they manage. Indeed, alongside volume and velocity, veracity is an equally critical issue in data analysis, often seen as a precondition to analyzing data and guaranteeing its value. High-quality data is commonly described as fit for purpose and a fair representation of the real-world constructs it depicts. Ensuring data sources meet these requirements is an arduous task that is best addressed through an automated approach and adequate tooling.

Challenges when running data quality at scale can include choosing the right data quality tools, managing the rules and constraints to apply on top of the data, and taking on the large upfront cost of setting up infrastructure in production.

Deequ, an open-source data quality library developed internally at Amazon, addresses these requirements by defining unit tests for data that it can then scale to datasets with billions of records. It provides multiple features, like automatic constraint suggestions and verification, metrics computation, and data profiling. For more information about how Deequ is used at Amazon, see Test data quality data at scale with Deequ.

You need to follow several steps to implement Deequ in production, including building the infrastructure, writing custom AWS Glue jobs, profiling the data, and generating rules before applying them. In this post, we introduce an open-source Data Quality and Analysis Framework (DQAF) that simplifies this process and its orchestration. Built on top of Deequ, this framework makes it easy to create the data quality jobs that you need, manage the associated constraints through a web UI, and run them on the data as you ingest it into your data lake.

Architecture

As illustrated in the following architecture diagram, the DQAF exclusively uses serverless AWS technology. It takes a database and tables in the AWS Glue Data Catalog as inputs to AWS Glue jobs, and outputs various data quality metrics into Amazon Simple Storage Service (Amazon S3). Additionally, it saves time by automatically generating constraints on previously unseen data. The resulting suggestions are stored in Amazon DynamoDB tables and can be reviewed and amended at any point by data owners in the AWS Amplify managed UI. Amplify makes it easy to create, configure, and implement scalable web applications on AWS. The orchestration of these operations is carried out by an AWS Step Functions workflow. The code, artifacts, and an installation guide are available in the GitHub repository.

As illustrated in the following architecture diagram, the DQAF exclusively uses serverless AWS technology.

In this post, we walk through a deployment of the DQAF using some sample data. We assume you have a database in the AWS Glue Data Catalog hosting one or more tables in the same Region where you deploy the framework. We use a legislators database with two tables (persons_json and organizations_json) referencing data about United States legislators. For more information about this database, see Code Example: Joining and Relationalizing Data.

In this post, we walk through a deployment of the DQAF using some sample data.

Deploying the solution

Click on the button below to launch an AWS CloudFormation stack that deploys the solution in your AWS account in the last Region that was used:

The process takes 10–15 minutes to complete. You can verify that the framework was successfully deployed by checking that the CloudFormation stacks show the status CREATE_COMPLETE.

You can verify that the framework was successfully deployed by checking that the CloudFormation stacks show the status CREATE_COMPLETE.

Testing the data quality and analysis framework

The next step is to understand (profile) your test data and set up data quality constraints. Constraints can be defined as a set of rules to validate whether incoming data meets specific requirements along various dimensions (such as completeness, consistency, or contextual accuracy). Creating these rules can be a painful process if you have lots of tables with multiple columns, but DQAF makes it easy by sampling your data and suggesting the constraints automatically.

On the Step Functions console, locate the data-quality-sm state machine, which represents the entry point to data quality in the framework. When you provide a valid input, it starts a series of AWS Glue jobs running Deequ. This step function can be called on demand, on a schedule, or based on an event. You run the state machine by entering a value in JSON format.

You run the state machine by entering a value in JSON format.

First pass and automatic suggestion of constraints

After the step function is triggered, it calls the AWS Glue controller job, which is responsible for determining the data quality checks to perform. Because the submitted tables were never checked before, a first step is to generate data quality constraints on attributes of the data. In Deequ, this is done through an automatic suggestion of constraints, a process where data is profiled and a set of heuristic rules is applied to suggest constraints. It’s particularly useful when dealing with large multi-column datasets. In the framework, this operation is performed by the AWS Glue suggestions job, which logs the constraints into the DataQualitySuggestions DynamoDB table and outputs preliminary quality check results based on those suggestions into Amazon S3 in Parquet file format.

AWS Glue suggestions job

The Deequ suggestions job generates constraints based on three major dimensions:

  • Completeness – Measures the presence of null values, for example isComplete("gender") or isComplete("name")
  • Consistency – Consistency of data types and value ranges, for example .isUnique("id") or isContainedIn("gender", Array("female", "male"))
  • Statistics – Univariate dimensions in the data, for example .hasMax("Salary", “90000”) or .hasSize(_>=10)

The following table lists the available constraints that can be manually added in addition to the automatically suggested ones.

Constraints Argument Semantics
Dimension Completeness
isComplete column Check that there are no missing values in a column
hasCompleteness column, udf Custom validation of missing values in a column
Dimension Consistency
isUnique column Check that there are no duplicates in a column
hasUniqueness column, udf Custom validation of the unique value ratio in a column
hasDistinctness column, udf Custom validation of the unique row ratio in a column
isInRange column, value range Validation of the fraction of values that are in a valid range
hasConsistentType column Validation of the largest fraction of values that have the same type
isNonNegative column Validation whether all the values in a numeric column are non-negative
isLessThan column pair Validation whether all the values in the first column are always less than the second column
satisfies predicate Validation whether all the rows match predicate
satisfiesIf predicate pair Validation whether all the rows matching first predicate also match the second predicate
hasPredictability column, column(s), udf User-defined validation of the predictability of a column
Statistics (can be used to verify dimension consistency)
hasSize udf Custom validation of the number of records
hasTypeConsistency column, udf Custom validation of the maximum fraction of the values of the same datatype
hastCountDistinct column Custom validation of the number of distinct non null values in a column
hasApproxCountDistinct column, udf Custom validation of the approximate number of distinct non-null values
hasMin column, udf Custom validation of the column’s minimum value
hasMax column, udf Custom validation of the column’s maximum value
hasMean column, udf Custom validation of the column’s mean value
hasStandardDeviation column, udf Custom validation of the column’s standard deviation value
hasApproxQuantile column,quantile,udf Custom validation of a particular quantile of a column (approximate)
hasEntropy column, udf Custom validation of the column’s entropy
hasMutualInformation column pair,udf Custom validation of the column pair’s mutual information
hasHistogramValues column, udf Custom validation of the column’s histogram
hasCorrelation column pair,udf Custom validation of the column pair’s correlation

The following screenshot shows the DynamoDB table output with suggested constraints generated by the AWS Glue job.

The following screenshot shows the DynamoDB table output with suggested constraints generated by the AWS Glue job.

AWS Glue data profiler job

Deequ also supports single-column profiling of data, and its implementation scales to large datasets with billions of rows. As a result, we get a profile for each column in the data, which allows us to inspect the completeness of the column, the approximate number of distinct values, and the inferred datatype.

The controller triggers an AWS Glue data profiler job in parallel to the suggestions job. This profiler Deequ process runs three passes over the data and avoids any shuffles in order to easily scale to large datasets. Results are stored as Parquet files in the S3 data quality bucket.

When the controller job is complete, the second step in the data quality state machine is to crawl the Amazon S3 output data into a data_quality_db database in the AWS Glue Data Catalog, which is then immediately available to be queried in Amazon Athena. The following screenshot shows the list of tables created by this AWS Glue framework and a sample output from the data profiler results.

The following screenshot shows the list of tables created by this AWS Glue framework and a sample output from the data profiler results.

Reviewing and verifying data quality constraints

As good as Deequ is at suggesting data quality rules, the data stewards should first review the constraints before applying them in production. Because it may be cumbersome to edit large tables in DynamoDB directly, we have created a web app that enables you to add or amend the constraints. The changes are updated in the relevant DynamoDB tables in the background.

Accessing the web front end

To access the user interface, on the AWS Amplify console, choose the deequ-constraints app. Choosing the URL (listed as https://<env>.<appsync_app_id>.amplifyapp.com) opens the data quality constraints front end. After you complete the registration process with Amazon Cognito (create an account) and sign in, you see a UI similar to the following screenshot.

After you complete the registration process with Amazon Cognito (create an account) and sign in, you see a UI similar to the following screenshot.

It lists data quality constraint suggestions produced by the AWS Glue job in the previous step. Data owners can add or remove and enable or disable these constraints at any point via the UI. Suggestions are not enabled by default. This makes sure all constraints are human reviewed before they are processed. Choosing the check box enables a constraint.

Data analyzer (metric computations)

Alongside profiling, Deequ can also generate column-level statistics called data analyzer metrics (such as completeness, maximum, and correlation). They can help uncover data quality problems, for example by highlighting the share of null values in a primary key or the correlation between two columns.

The following table lists the metrics that you can apply to any column.

Metric Semantics
Dimension Completeness
Completeness Fraction of non-missing values in a column
Dimension Consistency
Size Number of records
Compliance Ratio of columns matching predicate
Uniqueness Unique value ratio in a column
Distinctness Unique row ratio in a column
ValueRange Value range verification for a column
DataType Data type inference for a column
Predictability Predictability of values in a column
Statistics (can be used to verify dimension consistency)
Minimum Minimal value in a column
Maximum Maximal value in a column
Mean Mean value in a column
StandardDeviation Standard deviation of the value distribution in a column
CountDistinct Number of distinct values in a column
ApproxCountDistinct Number of distinct values in a column
ApproxQuantile Approximate quantile of the value in a column
Correlation Correlation between two columns
Entropy Entropy of the value distribution in a column
Histogram Histogram of an optionally binned column
MutualInformation Mutual information between two columns

In the web UI, you can add these metrics on the Analyzers tab. In the following screenshot, we add an ApproxCountDistinct metric on an id column. Choosing Create analyzer inserts the record into the DataQualityAnalyzer table in DynamoDB and enables the constraint.

In the following screenshot, we add an ApproxCountDistinct metric on an id column.

AWS Glue verification job

We’re now ready to put our rules into production and can use Athena to look at the resultsYou can start running the step function with the same JSON as input:

{
  "glueDatabase": "legislators",
  "glueTables": "persons_json, organizations_json"
}

This time the AWS Glue verification job is triggered by the controller. This job performs two actions: it verifies the suggestion constraints and performs metric computations. You can immediately query the results in Athena under the constraints_verification_results table.

The following screenshot shows the verification output.

The following screenshot shows the verification output.

The following screenshot shows the metric computation results.

The following screenshot shows the metric computation results.

Summary

Dealing with large, real-world datasets requires a scalable and automated approach to data quality. Deequ is the tool of choice at Amazon when it comes to measuring the quality of large production datasets. It’s used to compute data quality metrics, suggest and verify constraints, and profile data.

This post introduced an open-source, serverless Data Quality and Analysis Framework that aims to simplify the process of deploying Deequ in production by setting up the necessary infrastructure and making it easy to manage data quality constraints. It enables data owners to generate automated data quality suggestions on previously unseen data that can then be reviewed and amended in a UI. These constraints serve as inputs to various AWS Glue jobs in order to produce data quality results queryable via Athena. Try this framework on your data and leave suggestions on how to improve it on our open-source GitHub repo.


About the Authors

Vara Bonthu is a Senior BigData/DevOps Architect for ProServe working with the Global Accounts team. He is passionate about big data and Kubernetes. He helps customers all over the world design, build, and migrate end-to-end data analytics and container-based solutions. In his spare time, he develops applications to help educate his 7-year-old autistic son.

 

Abdel Jaidi is a Data & Machine Learning Engineer for AWS Professional Services. He works with customers on Data & Analytics projects, helping them shorten their time to value and accelerate business outcomes. In his spare time, he enjoys participating in triathlons and walking dogs in parks in and around London.

2020 in the Rearview

Post Syndicated from original https://www.backblaze.com/blog/2020-in-the-rearview/

Looking Out for Our Team, Customers, and Community

Writing a “year in review” for 2020 feels more than a little challenging. After all, it’s the first year in memory that became its own descriptor: The phrase “because 2020” has become the lead in or blanket explanation for just about any news story we never could have predicted at the beginning of this year.

And yet, looking forward to 2021, I can’t help but feel hopeful when I think about what we did with these hard times. Families rediscovered ways to stay connected and celebrate, neighbors and communities strengthened their bonds and their empathy for one another, and all sorts of businesses and organizations reached well beyond any idea of normal operations to provide services and support despite wild headwinds. Healthcare professionals, grocery stores, poll workers, restaurants, teachers—the creativity and resilience shown in all they’ve accomplished in a matter of months is humbling. If we can do all of this and more in a year of unprecedented challenges, imagine what we can do when we’re no longer held back by a global pandemic?

Looking closer to home, at the Backblaze community—some 190 employees, as well as their families and pets, and our hundreds of thousands of customers and partners around the world—I’m similarly hopeful. In the grand scheme of the pandemic, we were lucky. Most of our work, our services, and our customers’ work, can be accomplished remotely. And yet, I can’t help but be inspired by the stories from this year.

There were Andrew Davis and Alex Acosta, two-thirds of the IT operations team at Gladstone Institutes—a leader in biomedical research that rapidly shifted many of its labs’ focus this year to studying the virus that causes COVID-19. After realizing their data was vulnerable, these two worked with our team to move petabytes of data off of tape and into the cloud, protecting all of it from ransomware and data loss.

Research in process at Gladstone Institutes. Photo Credit: Gladstone Institutes.

And then there were Cédric Pierre-Louis, Director of Programming for the African Fiction Channels at THEMA, and Gareth Howells, Director of Out Point Media, who worked with our friends at iconik to make collaboration and storytelling easier across the African Fiction Channels at THEMA—a Canal+ Group company that has more than 180 television channels in its portfolio. The creative collaboration that goes into TV might not rival the life-saving potential of Gladstone’s work, but I think everyone needed to escape through the power of media at some point this year.

Members of the Backblaze team, connecting remotely.

And if you had told me on March 7th—the day after we made the decision to shift Backblaze to mostly 100% work from home status until the COVID-19 situation resolved—that the majority of our team would work for 10 more months (and counting) from our kitchens and attics and garages…and that we’d still launch the Backblaze S3 Compatible APIs, clear an exabyte of data under management, enable Cloud to Cloud Migration, and announce so many other solutions and partnerships, I’m not sure which part would have been harder to believe. But during a year when cloud storage and computer backup became increasingly important for businesses and individuals, I’m truly proud of the way our team stepped up to support and serve our customers.

These are just a sampling of the hopeful stories from our year. There’s no question that there are still challenges in our future, but tallying what we’ve been able to achieve while our Wi-Fi cut in and out, our pets and children rampaged through the house, while we swapped hard drives while masked and six feet distant from our coworkers, there’s little question in my mind that we can meet them. Until then, thanks for your good work, your business, and sticking with us, together, while apart.

The post 2020 in the Rearview appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Security updates for Tuesday

Post Syndicated from original https://lwn.net/Articles/841436/rss

Security updates have been issued by Mageia (flac, graphicsmagick, jackit, kdeconnect-kde, libmaxminddb, libvirt, openjpeg2, pngcheck, python3, roundcubemail, and spice-vdagent), openSUSE (gimp), and SUSE (containerd, docker, docker-runc, golang-github-docker-libnetwork, cyrus-sasl, and gimp).

Let me subscribe – Zabbix masters IoT topics

Post Syndicated from Wolfgang Alper original https://blog.zabbix.com/let-me-subscribe-zabbix-masters-iot-topics/12710/

Zabbix 5.2 supports two important protocols used in the world of the Internet of Things — MQTT and Modbus. Now we can benefit from the newest Zabbix features and integrate Zabbix network monitoring in the world of IoT.

Contents

I. What is MQTT? (3:32:13)
II. MQTT and Zabbix integration (3:39:48)

1.MQTT setup (3:40:03)
2.Node-RED (3:42:12)
3.Splitting data (3:45:45)
4.Publishing data from Zabbix (3:52:23)

III. Questions & Answers (3:55:42)

What is MQTT?

MQTT — the Message Queuing Telemetry Transport was invented in 1999, and designed to be bandwidth-efficient and lightweight, thus battery efficient. Initially, it was developed to allow for monitoring oil pipelines.

It is a well-defined ISO standard — ISO/IEC 20922, and it is getting increasingly adopted due to its suitability for the Internet of Things (IoT), sensor networks, home automation, machine-to-machine (M2M), and mobile applications. MQTT usually uses TCP/IP as the transport protocol — over ports 1883, and can be encrypted using TLS transport mechanism with 8883 as the default port.

There is a variation of MQTT available — MQTT-SN (MQTT for Sensor Networks) used for non-TCP/IP networks, such as Zigbee (IEEE 80215.4 radio-based protocol) or other UDP / Bluetooth-based implementations.

There are 2 types of network entities available: ‘Message broker‘ and ‘Clients‘.

MQTT supports 3 Quality-of Service levels:

— 0: At most once – “Fire and forget” where you might or might not receive the message.
— 1: At least once – The message can be sent/delivered multiple times.
— 2: Exactly once – Safest and slowest service.

MQTT is based on a ‘publish’ / ‘subscribe-to-topic’ mechanism:

1. Publish/subscribe.

Publish/subscribe pattern

MQTT Message Broker consumes messages published by clients (on the left) using two-level ‘Topics‘ (such as, for instance, office temperature, office humidity, or indoor air quality). The clients on the rights side act as subscribers receiving any information published on a particular topic. Every time a message is published to the broker, the broker notifies all of the subscribers (Clients 3 and 4), and these clients get the sensor value.

2. Combined publishing/subscribing

Combined pub/sub

A client can be a subscriber and a publisher at the same time. So, in this example, Client 1 is publishing a brightness value and Client 3 has a subscription for that brightness value. Client 3 may decide that the brightness, for instance, of 1,500 might be too low, so it can publish a new message to the topic ‘office’ to let the light controller know that it should increase the brightness, while Client 2, for instance, the light controller with a subscription, may change the brightness level on receipt of the message.

3. Wildcards subs

+ = single-level, # = multi-level

Wildcards in MQTT are easy. So, you can have, for instance, ‘office + brightness’ topic,  where the ‘+’ sign can be substituted by any topic name. If the ‘+’ sign substitutes just one level in our topic, then it is a single-level wildcard. While the pound sign works for a multi-level wildcard.

MQTT features:

  • Clients can publish and subscribe to one or more topics.
  • One client can publish and subscribe at the same time.
  • Clients can subscribe using single/multi-level wildcards.
  • Clients can choose between three different QoS levels.

MQTT advanced features:

  • Messages can be retained by the broker for new subscribers. So, if a new client subscribes to a particular topic, then the publisher can mark its messages as ‘Retained‘ so that the new subscriber gets the last retained message.
  • Clients can provide a “last will and testament” that will be published by the broker when the client “dies”.

MQTT and Zabbix integration

MQTT setup

Integrating Zabbix into the multiple-client mix

Integrated structure:

1. Four sensors:

    • Server room.
    • Training room.
    • Sales room.
    • Support room.

2. Four different topics:

    • office
    • bielefeld (home town)
    • serverroom
    • trainingroom

3. Mosquitto MQTT Message Broker, which is one of the well-known message brokers.

So, the sensors are publishing the data to the Mosquitto Message Broker, where any MQTT-enabled device or system can pick those values up. In our case, it’s the home automation system, which subscribes to the Message Broker and has access to all of the values published by the sensors.

Thanks to MQTT support in Zabbix 5.2, Zabbix can now subscribe to the Mosquitto Message Broker and immediately get access to all of the sensors publishing their values to the broker.

As we can have multiple subscribers, multiple clients can subscribe to one topic on the Message Broker. So the home automation system can subscribe to the same values published to the Message Broker, as well as Zabbix.

Node-RED

Sooner or later, you will need Node-RED, which is a flow-based programming tool allowing you to subscribe to the broker and to publish messages to the broker acting as the client, as well as to work with the data.

Data Processing in Node-RED

This setup might be useful, if, for instance, some Zabbix trigger fires and passes the information over to the MQTT to publish the outcome of the trigger to the Message Broker, which will be then picked up by the home automation system.

Zabbix publishes data to the broker

You can have two different Zabbix instances subscribing to the same Message Broker acting just as two different clients.

Multiple Zabbix servers sharing the same data

Node-RED:

    • Construction kit for the Internet of Things and home automation.
    • Acts as MQTT client able to publish and subscribe.
    • Flow-based tool for visual programming based on Node.js.
    • Graphical web editor.
    • Supports input, processing, and output nodes.
    • Extensible with plugins and custom function nodes.

Different types of nodes can be connected in the workspace. For instance, the nodes subscribing to a topic and transforming the data, or the nodes writing the data to a log file.

Node-RED

We can get the data from the sensors as the raw JSON string containing 20-30 metrics in a payload, and as a parsed JSON object in the Node-RED Debug node with easy-to-read metrics, such as, for instance, temperature, humidity, WiFi quality, indoor air quality, etc.

Multiple metrics in one message

Splitting data

We have different options for data splitting available:

  • Split on MQTT level: use Node-RED to split metrics and then publish them in their own topics (it’s good to set up when other clients can handle only a single metric at a time).

Splitting data in Node-RED

 

  • Split on Zabbix level: set up an MQTT item as a master item and use Zabbix JSON preprocessing with corresponding dependent items. Its more efficient because Zabbix would need only one subscription.

We can get the data with the brand-new mqtt.get item in Zabbix 5.2:

— Requires Agent 2.
— Requires active checks. As every time a client publishes a message to the topic, we need the broker to push that data to us, we need active checks, so mqtt.get must listen to the subscription and get notified when the new data comes in.
— Broker URL default is localhost.
— User name and password are optional.
— Uses Eclipse Paho Go client library.

One Zabbix agent in active mode sending data to multiple hosts

For our setup with four sensors: in Sales Room, Server Room, Support Room, and Training Room, we need four hosts in Zabbix. Traditionally, you need four different agents to handle them as each agent running as active needs to configure its own hostname. However in our setup, we need just one agent installed and handling different hosts by subscribing to multiple topics.

This is possible because of the the new feature  running active agent checks from multiple hosts which is now available in Zabbix 5.2. All we need is:

—  to set up hosts in Zabbix (as usual),
—  to define our MQTT items (as usual),
—  to set up just one agent with all of the hostnames the agent should be responsible for (the new feature),
—  to set up the master item, which is our mqtt.get item,
—  to define several dependent items and preprocessing for each of the dependent items, and
—  to start preprocessing with JSONPath.

NOTE. Every time the master item gets an update, so do all of the dependent items in Zabbix.

Master item and dependent items

  • Combine both methods: let other clients subscribe to a single metric using their specific topic, but publish all sensor data for Zabbix in one topic.

NOTEData received and displayed on the dashboard is based on the MQTT item, the payload, and the MQTT messages received from the Message Broker.

Sensor data dashboard

Publishing data from Zabbix

Now you want to publish the outcome of a Zabbix trigger, so it can be consumed by other MQTT-enabled devices. Any MQTT subscriber, like Node-RED, should receive the alert. To do that, you need:

  • to define a new media type to send problems to the topic, that is, to pass the data over to the Message Broker:
  • to use the command-line tool for Mosquitto — mosquitto_pub allowing us to publish the message.
#!/bin/sh
mosquitto_pub -h yourbroker.io -m "$1" -t "zabbix/problems/$2"

  • to make sure that the data is sent to the broker in the right format. In this case, we use JSON as transport and define a JSON problem template and a JSON problem recovery template.

 

In Zabbix, you’ll see the problem, the actions, and the media type firing using the subscription, and in the Debug node of Note-RED, you’ll see that the data is received from Zabbix.

Zabbix problems  published via MQTT

This model with Node-RED can be used to create sophisticated setups. For instance, you can take the data from Zabbix, forward it by actions and media types, preprocess them in Node-RED, and transform the data in many different ways.

IoT devices and other subscribers can react to issues detected by Zabbix using Node-RED

NOTE. To try out the MQTT setup and new Zabbix features, you can use the Live broker available on IntelliTrend new GitHub account, getting data from Zabbix sensors every 10 minutes. You’ll also find templates,  access data, address of the broker, etc. —  everything you need to to get started.

Questions & Answers

Question. If the MQTT client gets overloaded due to high message frequency on subscribe topics, how will that affect Zabbix?

Answer. Here the broker might be overloaded or the Zabbix agent might not be able to follow up. If for the problem with the broker, the quality of service levels is defined in the MQTT protocol, more specifically — QoS level 2, which guarantees delivery. So if QoS2 is used as a QoS level, the messages won’t get lost but would be resent in case of failure.

Question. What else would you expect from the IoT side of Zabbix? What kind of protocols or things would get added? 

Answer. There’s always room for improvement. You can use third-party tools, custom scripts, or any tools to enhance Zabbix. I’m sure that using user script parameters was an excellent design decision. But the official support of MQTT is a quantum leap for Zabbix because it opens the door to most IoT infrastructures, as MQTT is the most important IoT protocol so far.

For instance, one of our customers is monitoring the infrastructure of electricity generators, production systems, etc. They use their own monitoring platform provided by vendors. The request was to integrate alerts or some metrics into Zabbix. The customer’s monitoring platform used MQTT protocol. So, all we had to do was to make their monitoring platform use external scripts and MQTT support.

Менте-маските купени от България стигнаха и до Германия с фалшиви сертификати

Post Syndicated from Атанас Чобанов original https://bivol.bg/ryzur-masks-spiegel.html

вторник 29 декември 2020


Некачествени китайски респиратори от производителя Ryzur са раздадени на учителите в провинция Баден-Вюртемберг, пише Шпигел. Според проучване на авторитетното немско издание тези маски са с фалшиви сертификати от лабораторията DEKRA и трябва веднага да бъдат изтеглени от употреба. Немските власти обаче не знаели за тази измама.

В България един милион бройки от същите маски бяха закупени от правителството с европейски средства през месец май. Тогава Биволъ разкри, че тези маски са с най-ниската оценка за филтрация от референтна американска лаборатория. Маските на Ryzur се представят най-зле сред 92 изследвани модела /виж тук/. Максималната ефективност на филтъра на тази маска е едва 33,9%! Това е скандално ниско спрямо заявените 95% за този клас респиратори. На практика тази маска не предпазва изобщо от вдишване на коронавируса.  Пълният доклад от проучването може да бъде видян тук.

Как България купи с европари китайски маски-менте

Същите маски-менте са стигнали до немските учители тази есен и на тях се вижда печат за тест от авторитетната лаборатория DEKRA. Директорът на DEKRA Йорг-Тим Килиш обаче е заявил, че тези маски не са тествани от неговата лаборатория и “не трябва да се раздават при никакви обстоятелства”.

“Раздадените вече маски трябва да бъдат изтеглени веднага” – казва Килиш.

Шпигел са потърсили коментар от компетентните регионални министерства на социалните грижи и културата, които са отрекли, че раздават некачествени маски. Чиновниците обаче не знаели, че печатите на DEKRA са фалшифицирани.

Българският опит с тези маски е още по-скандален, тъй като те бяха рекламирани като медицински изделия и това е записано и в договора за покупката им, с който Биволъ разполага. Доставените със самолети маски обаче не бяха медицински по стандарта GB19083-2010 /виж тук/, а обикновени респиратори по стандарта GB2626 /виж тук/. Те не ставаха за употреба от здравните работници.

На поръчаните от България маски също имаше фалшиви печати от американската агенция за безопасност на лекарствата FDA и за европейско качество CE.  Това не попречи пратката да бъде усвоена. Маските бяха раздадени на служители на МВР, Агенция “Митници” и в други институции на “първа линия”. въпреки установените факти за тяхната неефективност и фалшивата документация.

 

[$] 5.11 Merge window, part 2

Post Syndicated from original https://lwn.net/Articles/841062/rss

Linus Torvalds released
the 5.11-rc1 prepatch
and closed the 5.11 merge window on
December 27. By that time, 12,498 non-merge changesets had been
pulled into the mainline; nearly 2,500 of those wandered in after the first merge-window summary was written.
Activity slowed down in the second week, as expected, but there were still
a number of interesting features that found their way into the mainline.

Security updates for Monday

Post Syndicated from original https://lwn.net/Articles/841378/rss

Security updates have been issued by Debian (horizon, kitty, python-apt, and roundcube), Fedora (libmaxminddb, mediawiki, mingw-binutils, and thunderbird), Mageia (erlang-rebar3), openSUSE (blosc, ceph, firefox, flac, kdeconnect-kde, openexr, ovmf, PackageKit, python3, thunderbird, and xen), and SUSE (thunderbird).

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close