[$] OpenPGP in Thunderbird

Post Syndicated from original https://lwn.net/Articles/832183/rss

It is a pretty rare event to see a nearly 21-year-old bug be addressed—many
projects are nowhere near that old for one thing—but that is just what has
occurred for the Mozilla Thunderbird email
application. An enhancement
request
filed at the end of 1999 asked for a plugin to support email encryption, but it has mostly
languished since. The Enigmail plugin did come
along to fill the gap by providing OpenPGP support using GNU Privacy Guard (GnuPG or GPG), but was
never part of Thunderbird.
As part of Thunderbird 78,
though, OpenPGP is now fully supported within the mail user agent
(MUA).

tko-subs – Detect & Takeover Subdomains With Dead DNS Records

Post Syndicated from original https://www.darknet.org.uk/2020/09/tko-subs-detect-takeover-subdomains-with-dead-dns-records/?utm_source=rss&utm_medium=social&utm_campaign=darknetfeed

tko-subs – Detect & Takeover Subdomains With Dead DNS Records

tko-subs is a tool that helps you to detect & takeover subdomains with dead DNS records, this could be dangling CNAMEs point to hosting services or to nothing at all or NS records that are mistyped.

What does tko-subs – Detect & Takeover Subdomains With Dead DNS Records Do?

This tool allows you:

  • To check whether a subdomain can be taken over because it has:
    • a dangling CNAME pointing to a CMS provider (Heroku, Github, Shopify, Amazon S3, Amazon CloudFront, etc.) that can be taken over.

Read the rest of tko-subs – Detect & Takeover Subdomains With Dead DNS Records now! Only available at Darknet.

Automating bucketing of streaming data using Amazon Athena and AWS Lambda

Post Syndicated from Ahmed Saef Zamzam original https://aws.amazon.com/blogs/big-data/automating-bucketing-of-streaming-data-using-amazon-athena-and-aws-lambda/

In today’s world, data plays a vital role in helping businesses understand and improve their processes and services to reduce cost. You can use several tools to gain insights from your data, such as Amazon Kinesis Data Analytics or open-source frameworks like Structured Streaming and Apache Flink to analyze the data in real time. Alternatively, you can batch analyze the data by ingesting it into a centralized storage known as a data lake. Data lakes allow you to import any amount of data that can come in real time or batch. With Amazon Simple Storage Service (Amazon S3), you can cost-effectively build and scale a data lake of any size in a secure environment where data is protected by 99.999999999% (11 9s) of durability.

After the data lands in your data lake, you can start processing this data using any Big Data processing tool of your choice. Amazon Athena is a fully managed interactive query service that enables you to analyze data stored in an Amazon S3-based data lake using standard SQL. You can also integrate Athena with Amazon QuickSight for easy visualization of the data.

When working with Athena, you can employ a few best practices to reduce cost and improve performance. Converting to columnar formats, partitioning, and bucketing your data are some of the best practices outlined in Top 10 Performance Tuning Tips for Amazon Athena. Bucketing is a technique that groups data based on specific columns together within a single partition. These columns are known as bucket keys. By grouping related data together into a single bucket (a file within a partition), you significantly reduce the amount of data scanned by Athena, thus improving query performance and reducing cost. For example, imagine collecting and storing clickstream data. If you frequently filter or aggregate by user ID, then within a single partition it’s better to store all rows for the same user together. If user data isn’t stored together, then Athena has to scan multiple files to retrieve the user’s records. This leads to more files being scanned, and therefore, an increase in query runtime and cost.

Like partitioning, columns that are frequently used to filter the data are good candidates for bucketing. However, unlike partitioning, with bucketing it’s better to use columns with high cardinality as a bucketing key. For example, Year and Month columns are good candidates for partition keys, whereas userID and sensorID are good examples of bucket keys. By doing this, you make sure that all buckets have a similar number of rows. For more information, see Bucketing vs Partitioning.

For real-time data (such as data coming from sensors or clickstream data), streaming tools like Amazon Kinesis Data Firehose can convert the data to columnar formats and partition it while writing to Amazon S3. With Kafka, you can do the same thing with connectors. But what about bucketing? This post shows how to continuously bucket streaming data using AWS Lambda and Athena.

Overview of solution

The following diagram shows the high-level architecture of the solution.

The architecture includes the following steps:

  1. We use the Amazon Kinesis Data Generator (KDG) to simulate streaming data. Data is then written into Kinesis Data Firehose; a fully managed service that enables you to load streaming data to an Amazon S3-based data lake.
  2. Kinesis Data Firehose partitions the data by hour and writes new JSON files into the current partition in a /raw Each new partition looks like /raw/dt=<YYYY-MM-dd-HH>. Every hour, a new partition is created.
  3. Two Lambda functions are triggered on an hourly basis based on Amazon CloudWatch Events.
    • Function 1 (LoadPartition) runs every hour to load new /raw partitions to Athena SourceTable, which points to the /raw prefix.
    • Function 2 (Bucketing) runs the Athena CREATE TABLE AS SELECT (CTAS) query.
  4. The CTAS query copies the previous hour’s data from /raw to /curated and buckets the data while doing so. It loads the new data as a new partition to TargetTable, which points to the /curated prefix.

Overview of walkthrough

In this post, we cover the following high-level steps:

  1. Install and configure the KDG.
  2. Create a Kinesis Data Firehose delivery stream.
  3. Create the database and tables in Athena.
  4. Create the Lambda functions and schedule them.
  5. Test the solution.
  6. Create view that the combines data from both tables.
  7. Clean up.

Installing and configuring the KDG

First, we need to install and configure the KDG in our AWS account. To do this, we use the following AWS CloudFormation template.

For more information about installing the KDG, see the KDG Guide in GitHub.

To configure the KDG, complete the following steps:

  1. On the AWS CloudFormation console, locate the stack you just created.
  2. On the Outputs tab, record the value for KinesisDataGeneratorUrl.
  3. Log in to the KDG main page using the credentials created when you deployed the CloudFormation template.
  4. In the Record template section, enter the following template. Each record has three fields: sensorID, currentTemperature, and status.
    {
        "sensorId": {{random.number(4000)}},
        "currentTemperature": {{random.number(
            {
                "min":10,
                "max":50
            }
        )}},
        "status": "{{random.arrayElement(
            ["OK","FAIL","WARN"]
        )}}"
    }
    

  5. Choose Test template.

The result should look like the following screenshot.

We don’t start sending data now; we do this after creating all other resources.

Creating a Kinesis Data Firehose delivery stream

Next, we create the Kinesis Data Firehose delivery stream that is used to load the data to the S3 bucket.

  1. On the Amazon Kinesis console, choose Kinesis Data Firehose.
  2. Choose Create delivery stream.
  3. For Delivery stream name, enter a name, such as AutoBucketingKDF.
  4. For Source, select Direct PUT or other sources.
  5. Leave all other settings at their default and choose Next.
  6. On Process Records page, leave everything at its default and choose Next.
  7. Choose Amazon S3 as the destination and choose your S3 bucket from the drop-down menu (or create a new one). For this post, I already have a bucket created.
  8. For S3 Prefix, enter the following prefix:
    raw/dt=!{timestamp:yyyy}-!{timestamp:MM}-!{timestamp:dd}-!{timestamp:HH}/

We use custom prefixes to tell Kinesis Data Firehose to create a new partition every hour. Each partition looks like this: dt=YYYY-MM-dd-HH. This partition-naming convention conforms to the Hive partition-naming convention, <PartitionKey>=<PartitionKey>. In this case, <PartitionKey> is dt and <PartitionValue> is YYYY-MM-dd-HH. By doing this, we implement a flat partitioning model instead of hierarchical (year=YYYY/month=MM/day=dd/hour=HH) partitions. This model can be much simpler for end-users to work with, and you can use a single column (dt) to filter the data. For more information on flat vs. hierarchal partitions, see Data Lake Storage Foundation on GitHub.

  1. For S3 error prefix, enter the following code:
    myFirehoseFailures/!{firehose:error-output-type}/

  2. On the Settings page, leave everything at its default.
  3. Choose Create delivery stream.

Creating an Athena database and tables

In this solution, the Athena database has two tables: SourceTable and TargetTable. Both tables have identical schemas and will have the same data eventually. However, each table points to a different S3 location. Moreover, because data is stored in different formats, Athena uses a different SerDe for each table to parse the data. SourceTable uses JSON SerDe and TargetTable uses Parquet SerDe. One other difference is that SourceTable’s data isn’t bucketed, whereas TargetTable’s data is bucketed.

In this step, we create both tables and the database that groups them.

  1. On the Athena console, create a new database by running the following statement:
    CREATE DATABASE mydatabase

  2. Choose the database that was created and run the following query to create SourceTable. Replace <s3_bucket_name> with the bucket name you used when creating the Kinesis Data Firehose delivery stream.
    CREATE EXTERNAL TABLE mydatabase.SourceTable(
      sensorid string, 
      currenttemperature int, 
      status string)
    PARTITIONED BY ( 
      dt string)
    ROW FORMAT SERDE 
      'org.openx.data.jsonserde.JsonSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
    LOCATION
      's3://<s3_bucket_name>/raw/'
    

  3. Run the following CTAS statement to create TargetTable:
    CREATE TABLE TargetTable
    WITH (
          format = 'PARQUET', 
          external_location = 's3://<s3_bucket_name>/curated/', 
          partitioned_by = ARRAY['dt'], 
          bucketed_by = ARRAY['sensorID'], 
          bucket_count = 3) 
    AS SELECT *
    FROM SourceTable

SourceTable doesn’t have any data yet. However, the preceding query creates the table definition in the Data Catalog. We configured this data to be bucketed by sensorID (bucketing key) with a bucket count of 3. Ideally, the number of buckets should be so that the files are of optimal size.

Creating Lambda functions

The solution has two Lambda functions: LoadPartiton and Bucketing. We use an AWS Serverless Application Model (AWS SAM) template to create, deploy, and schedule both functions.

Follow the instructions in the GitHub repo to deploy the template. When deploying the template, it asks you for some parameters. You can use the default parameters, but you have to change S3BucketName and AthenaResultLocation. For more information, see Parameter Details in the GitHub repo.

LoadPartition function

The LoadPartiton function is scheduled to run the first minute of every hour. Every time Kinesis Data Firehose creates a new partition in the /raw folder, this function loads the new partition to the SourceTable. This is crucial because the second function (Bucketing) reads this partition the following hour to copy the data to /curated.

Bucketing function

The Bucketing function is scheduled to run the first minute of every hour. It copies the last hour’s data from SourceTable to TargetTable. It does so by creating a tempTable using a CTAS query. This tempTable points to the new date-hour folder under /curated; this folder is then added as a single partition to TargetTable.

To implement this, the function runs three queries sequentially. The queries use two parameters:

  • <s3_bucket_name> – Defined by an AWS SAM parameter and should be the same bucket used throughout this solution
  • <last_hour_partition> – Is calculated by the function depending on which hour it’s running

The function first creates TempTable as the result of a SELECT statement from SourceTable. It stores the results in a new folder under /curated. The results are bucketed and stored in Parquet format. See the following code:

CREATE TABLE TempTable
    WITH (
      format = 'PARQUET', 
      external_location = 's3://<s3_bucket_name>/curated/dt=<last_hour_partition>/', 
      bucketed_by = ARRAY['sensorID'], 
      bucket_count = 3) 
    AS SELECT *
    FROM SourceTable
    WHERE dt='<last_hour_partiton>';

We create a new subfolder in /curated, which is new partition for TargetTable. So, after the TempTable creation is complete, we load the new partition to TargetTable:

ALTER TABLE TargetTable
                ADD IF NOT EXISTS
                PARTITION ('<last_hour_partiton>');

Finally, we delete tempTable from the Data Catalog:

DROP TABLE TempTable

Testing the solution

Now that we have created all resources, it’s time to test the solution. We start by generating data from the KDG and waiting for an hour to start querying data in TargetTable (the bucketed table).

  1. Log in to the KDG. You should find the template you created earlier. For the configuration, choose the following:
    1. The Region used.
    2. For the delivery stream, choose the Kinesis Data Firehose you created earlier.
    3. For records/sec, enter 3000.
  2. Choose Send data.

The KDG starts sending simulated data to Kinesis Data Firehose. After 1 minute, a new partition should be created in Amazon S3.

The Lambda function that loads the partition to SourceTable runs on the first minute of the hour. If you started sending data after the first minute, this partition is missed because the next run loads the next hour’s partition, not this one. To mitigate this, run MSCK REPAIR TABLE SourceTable only for the first hour.

  1. To benchmark the performance between both tables, wait for an hour so that the data is available for querying in TargetTable.
  2. When the data is available, choose one sensorID and run the following query on SourceTable and TargetTable.
    SELECT sensorID, avg(currenttemperature) as AverageTempreture 
    FROM <TableName>
    WHERE dt='<YYYY-MM-dd-HH>' AND sensorID ='<sensorID_selected>'
    GROUP BY 1

The following screenshot shows the query results for SourceTable. It shows the runtime in seconds and amount of data scanned.

The following screenshot shows the query results for TargetTable.

If you look at these results, you don’t see a huge difference in runtime for this specific query and dataset; for other datasets, this difference should be more significant. However, from a data scanning perspective, after bucketing the data, we reduced the data scanned by approximately 98%. Therefore, for this specific use case, bucketing the data lead to a 98% reduction in Athena costs because you’re charged based on the amount of data scanned by each query.

Querying the current hour’s data

Data for the current hour isn’t available immediately in TargetTable. It’s available for querying after the first minute of the following hour. To query this data immediately, we have to create a view that UNIONS the previous hour’s data from TargetTable with the current hour’s data from SourceTable. If data is required for analysis after an hour of its arrival, then you don’t need to create this view.

To create this view, run the following query in Athena:

CREATE OR REPLACE VIEW combined AS

SELECT *, "$path" AS file
FROM SourceTable
WHERE dt >= date_format(date_trunc('hour', (current_timestamp)), '%Y-%m-%d-%H')

UNION ALL 

SELECT *, "$path" AS file
FROM TargetTable
WHERE dt < date_format(date_trunc('hour', (current_timestamp)), '%Y-%m-%d-%H')

Cleaning up

Delete the resources you created if you no longer need them.

  1. Delete the Kinesis Data Firehose delivery stream.
  2. In Athena, run the following statements
    1. DROP DATABASE mydatabase
    2. DROP TABLE SourceTable
    3. DROP TABLE TargetTable
  3. Delete the AWS SAM template to delete the Lambda functions.
  4. Delete the CloudFormation stack for the KDG. For more information, see Deleting a stack on the AWS CloudFormation console.

Conclusion

Bucketing is a powerful technique and can significantly improve performance and reduce Athena costs. In this post, we saw how to continuously bucket streaming data using Lambda and Athena. We used a simulated dataset generated by Kinesis Data Generator. The same solution can apply to any production data, with the following changes:

  • DDL statements
  • Functions used can work with data that is partitioned by hour with the partition key ‘dt’ and partition value <YYYY-MM-dd-HH>. If your data is partitioned in a different way, edit the Lambda functions accordingly.
  • Frequency of Lambda triggers.

About the Author

Ahmed Zamzam is a Solutions Architect with Amazon Web Services. He supports SMB customers in the UK in their digital transformation and their cloud journey to AWS, and specializes in Data Analytics. Outside of work, he loves traveling, hiking, and cycling.

 

 

 

 

Best practices using AWS SCT and AWS Snowball to migrate from Teradata to Amazon Redshift

Post Syndicated from Ajinkya Puranik original https://aws.amazon.com/blogs/big-data/best-practices-using-aws-sct-and-aws-snowball-to-migrate-from-teradata-to-amazon-redshift/

This is a guest post from ZS. In their own words, “ZS is a professional services firm that works closely with companies to help develop and deliver products and solutions that drive customer value and company results. ZS engagements involve a blend of technology, consulting, analytics, and operations, and are targeted toward improving the commercial experience for clients.”

This blog is about the approaches evaluated and eventually chosen for ZS’s cloud transformation journey specifically for adoption of Amazon Redshift from prior Teradata based data warehousing solution.

ZS, a professional services firm that works side by side with companies to help develop and deliver products that drive customer value and company results. We leverage our extensive industry expertise, leading-edge analytics, technology and strategies to create solutions that work in the real world. With more than 35 years of experience and over 7,500 ZS employees in 24 offices worldwide, we are passionately committed to helping companies and their customers thrive.

ZS used Teradata as the primary data warehouse solution for several years. Part due to high ownership and operating cost, we started looking for an optimal solution which could provide scaling flexibility, lower maintenance liability and access accelerated innovation in the industry. This was achievable through solutions hosted on a cloud platform like AWS which ZS has already been using for numerous business workloads over the years.

Considerations for migration

Following were the three key areas which were critical for our Teradata to Amazon Redshift migration planning.

Table structures

The process included migrating the database schema first and then migrating the actual data from the databases. The schema on Amazon Redshift needed to be ready before loading the data from Amazon Simple Storage Service (Amazon S3).

AWS Schema Conversion Tool (SCT) helped in migrating table structures to Amazon Redshift, which converted the data types used for columns in Teradata tables into the corresponding Amazon Redshift data types. The AWS SCT tool also helped convert the table definition from Teradata to Amazon Redshift to include the appropriate keys, such as the Distribution Key/Sort Key. How to use the AWS SCT has been explained in the later sections of this blog

Database objects and data types

Teradata databases can hold a variety of database objects apart from tables like views, stored procedures, macros, User Defined Functions (UDF) and so on. The data types of the columns used in Teradata tables needed to be converted into the appropriate data types on Amazon Redshift. For other objects like views, stored procedures, the definitions from Teradata were exported and fresh objects were created in Amazon Redshift with appropriate changes in the new definitions. The AWS SCT can help in identifying the objects that need rework while migrating to Amazon Redshift.

Transferring data to AWS

Third and one of the major considerations was migrating the actual data to AWS. ZS’s use cases and isolation requirements were such that neither was Direct connect used in general nor were all AWS VPCs connected to corporate / on-premises network via VPN Tunnels. Data once exported out of Teradata gets uncompressed and expands approximately 4x resulting in requirement for data storage on local staging servers. Each ZS client workload had its respective warehouse on source and destination which also varied in size and had respective isolate change management timelines. Given these considerations we designed two use case specific approaches for transferring the exported data from the Teradata database to Amazon S3:

  • AWS Snowball – For databases larger than 4TB, we chose to transfer the data using AWS Snowball. Once the data was exported out of Teradata it was pushed to AWS Snowball periodically in batches. Resulting in optimal use of the storage space on the staging servers.
  • AWS CLI upload – For databases smaller than 4TB, data sets were exported from Teradata to staging servers and uploaded to Amazon S3 over the internet using AWS Command Line Interface (AWS CLI). These data sets were uploaded during non-business hours to minimize the impact on the ZS on-premises data center network bandwidth

The following diagram illustrates this architecture

Challenges and constraints

Exporting the data

The amount of data that had to be exported from the Teradata systems was 100+ TB (compressed). When exported, this would potentially expand to 500+TB. We needed a solution that could export this scale efficiently. Staging such large data volumes before migrating to AWS was a challenge due to limited on-premises SAN storage capacity. The mitigation chosen was to export in batches such that the exported data could be moved away from the staging server in a rolling fashion thus keeping space available throughout the migration. For certain datasets, due to volumes we further re-compressed the exported data before migration to Amazon S3.

Transferring the data

ZS had 150+ databases within our Teradata systems used across numerous ZS client initiatives. For certain projects, the data even had to be transferred to the client’s AWS account requiring respective unique processes while technology foundation was reusable. As alluded to earlier, due to varying dataset sizes per client workload, respective nuanced approaches were designed.

Initial approach for the solution

A cross functional team comprising of expertise across data warehousing, storage, network, cloud native technologies, business was formed at ZS which was also supported by AWS experts brought in via ZS’s AWS Partnership.

Primary focus at beginning was placed on finalizing data migration approaches. One such method that we tried was to use the AWS SCT to copy the schema onto Amazon Redshift and transfer the data to Amazon S3 using SCT Migration mode extract and upload. We also looked at file interface of AWS Snowball Edge to eliminate the need of having local storage for migration and directly exporting the Teradata exports on AWS Snowball Edge.

Approach constraints

While choosing a final approach, we came across the following challenges:

  1. Data export speeds were a major factor, considering the huge amount of data to migrate. We adopted the Teradata Parallel Transporter (TPT) approach because it showed better runtimes.
  2. Teradata holds up to 4X compressed data, which gets uncompressed post export. Holding such large datasets on a staging server was not feasible due to storage constraints.
  3. AWS Snowball Edge was evaluated instead of AWS Snowball to test the advantages of attaching it as a direct NFS to staging servers. However, since maximum file size supported by snowball edge NFS interface is 150 GB, we decided to continue with AWS Snowball.

TPT scripts method

Teradata Parallel Transporter (TPT) scripts were leveraged to export the data since it provided faster export speeds from Teradata servers compared to alternatives. We prepared the Teradata Parallel Transporter (TPT) scripts and launched these through Linux servers. Before starting the export, we had to ensure that enough free space was available on the server(s) to accommodate the export dumps.

The advantages of using TPT scripts to export data from Teradata tables were as follows:

  • Parallel processing to export data, which provided faster runtimes
  • Exporting varied data types into text format, which could be loaded into Amazon Redshift

Then the data was exported on the same servers where the TPT scripts were run. From here the data was copied either to the Amazon S3 bucket through the AWS CLI that was installed on the same server or to the Snowball device.

Final architecture

The hybrid cloud architecture we zeroed in on is depicted in picture below comprising of ZS’s on-premises data center hosting Teradata appliance, AWS destination environments and intermediary staging as well as shipping and data transfer networks. AWS SCT was leveraged for Schema migration and TPT exports for the data migration. The TPT export scripts were executed on the staging servers and the data was exported onto shared storage which was attached to staging servers. After the exports were completed the data was copied to AWS S3 using either AWS CLI for S3 or was pushed to AWS Snowball depending on the data size. The Snowball device was configured within the same network as the staging servers to ensure optimal transfer latency. Once data was copied completely onto AWS Snowball, it was shipped to AWS where data was transferred into the corresponding Amazon S3 bucket. On the AWS side, we had the S3 bucket for the corresponding Amazon Redshift cluster that held the data before loading into it.

Exporting the data

The TPT script is very effective when exporting huge amounts of data from the proprietary Teradata systems. You can prepare and deploy export scripts on a server within the same network as the Teradata appliance, which enables high export speeds.

The TPT export script is a combination of 1) Declaration section 2) Loop with built-in commands. Export dump with logs are generated as outputs.

Declaration section

The declaration section is where we initialize all the parameters, like the system identifier known as the tdpid, login user name, and delimiter that are used in the output files. See the following code that sets up shell variables:

#!/bin/ksh  
 split_file_no=3  
 SourceTdpId=<cop alias entries from hosts file or IP>  
 SourceUserName=<user id having read access on the DB tables>  
 SourceUserPassword=  
 DDLPrivateLogName=ddlprivate.log  
 ExportPrivateLogName=exportprivate.log  
 TargetErrorList=3807  
 TargetFormat=delimited  
 TargetTextDelimiter=^ (can be decided based on the column values)  
 TargetOpenMode=write  
 SpoolMode=NoSpool  
 MaxDecimalDigits=31  

Loop with built-in commands:

The values for the required variables were passed from three input files:

  • <databasename>.<tablename>
  • Definition of the TPT export operator
  • Job variables file (this file gets removed at the end of export)

See the following shell script that uses shell and TPT utility commands:

  fn_read_table_schema()   
  {   
  while read database table   
  do   
    SourceWorkingDatabase=${database}   
    TargetFileName=${database}.${table}   
    echo "SourceTdpId = ""'"${SourceTdpId}"'" $'\n' ",""SourceLogonMech = ""'"${SourceLogonMech}"'" $'\n' ","   
       "SourceUserName = ""'"${SourceUserName}"'" $'\n' "," "SourceUserPassword = ""'"${SourceUserPassword}"'" $'\n' ","   
       "SourceWorkingDatabase = ""'"${SourceWorkingDatabase}"'" $'\n' "," "DDLPrivateLogName = ""'"${DDLPrivateLogName}"'" $'\n' ","   
       "ExportPrivateLogName = ""'"${ExportPrivateLogName}"'" $'\n' "," "TargetErrorList = ""[""'"${TargetErrorList}"'""]" $'\n' ","   
       "TargetFileName = ""'"${TargetFileName}".dat""'" $'\n' "," "TargetFormat = ""'"${TargetFormat}"'" $'\n' ","   
       "TargetTextDelimiter = ""'"${TargetTextDelimiter}"'" $'\n' "," "TargetOpenMode = ""'"${TargetOpenMode}"'" $'\n' ","   
       "SpoolMode = ""'"${SpoolMode}"'" $'\n' "," "SelectStmt = ""'""select * from ${database}""."${table}"'" $'\n' >> jobvar.txt   
    chmod 777 jobvar.txt   
    tbuild -j ${table} -f tpt2test_2.tpt -v jobvar.txt   
    rm -rf jobvar.txt   
    log_total_records "${TargetFileName}"   
  done<tablename   
  }  

Export dump and logs

The data exported from the Teradata system through the TPT scripts was placed on the staging server. To ensure the quality of the exported data, we verified that the record counts in the log file, created during the TPT export, matched with the table row counts.

Table row count in Teradata

TPT exported dataset row count

The TPT scripts generated one file for every Teradata table. The file format of these files was text with the .dat extension. See the following screenshot.

You can optimize data loading into Amazon Redshift tables by splitting the corresponding file (dataset) into subsets of equal sizes. The number of such subsets should ideally be equal to or a multiple of the number of slices for the Amazon Redshift node type configured in the cluster. We chose to split the TPT output files using the Linux split command on the TPT server:

‘split -C 20m --numeric-suffixes input_filename output_prefix’

For more information efficiently loading the Amazon Redshift tables, see Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift and Best Practices for Micro-Batch Loading on Amazon Redshift.

Transferring data to S3 buckets

ZS leveraged AWS account level isolation for many of our client solutions to align with respective compliance controls. AWS Snowball is associated with a single AWS account, and to achieve full client data isolation, separate devices were shipped for each large use case. As indicated above, we adopted two methods to transfer the data based on the export size for each client workload:

  • AWS CLI – Use when databases are smaller than 4TB.
  • Snowball – Use when databases are bigger than 4TB or when data needed to be loaded to a ZS owned Client Dedicated account.

Transferring data through the AWS CLI

Transferring the data via the AWS CLI includes the following steps:

  1. Install and configure the AWS CLI utility on ZS on-premises Linux (staging) server
  2. Export datasets out from Teradata on the staging server.
  3. Copy the exported datasets to Amazon S3 using the AWS CLI:

aws s3 cp filename.txt s3://aws-s3-bucket-name/foldername/

Transferring data through Snowball

To transfer the data with Snowball, complete the following steps:

  1. Create a Snowball job on the AWS Management Console and order the Snowball device.
  2. Configure the Snowball on ZS’s on-premises data center network and install the Snowball client on the staging server.
  3. Unlock the Snowball device by downloading the manifest file and an unlock code from the console, as shown in the following code:

snowball start -i XX.XX.XX.XX -m /home/abcd/XXXXXXXXX_manifest.bin -u XXXXXXXXXXX

  1. Use the Snowball CLI to list the S3 Bucket associated with Snowball.

snowball s3 ls

  1. Copy the files to Snowball:

snowball cp /location/of/the/exported/files s3://Bucket_name/Target/

Transferring the table structure to Amazon Redshift

There are a few differences in the table definition format between Amazon Redshift and Teradata. The AWS SCT tool helps convert the Teradata table structure into an appropriate Amazon Redshift table structure.

To transfer the Teradata table structure to Amazon Redshift, complete the following steps:

  1. Connect to the on-premises Teradata systems and the Amazon Redshift cluster endpoint.

  1. Select the specific table from Teradata and right-click the option Convert schema. This converts the table definition into the Amazon Redshift equivalent.

  1. In the Amazon Redshift section of the AWS SCT console, choose Apply to database when the table conversion is complete to create the table structure on Amazon Redshift.

Pushing the data to the tables

After you migrate the required data to the appropriate S3 bucket, convert the tables as per usability, and apply the tables to Amazon Redshift, you can push the data to these tables via the COPY command:

copy AXXXX_MAIN.table1  
 from 's3://aws-s3-bucket-name/AXXXX_MAIN.table1.dat'  
 iam_role 'arn:aws:iam::XXXXXXX:role/aws-iam-role '  
 delimiter '|'  
 region 'us-XXXX-1'; 

The naming convention we used for the exported datasets was <databasename>.<tablename>. The table structures (DDLs) were migrated through AWS SCT and the table names matched the dataset names. Therefore, when we created the COPY commands, we simply had to match the target table name in Amazon Redshift with that of the datasets on Amazon S3. For more information about this process, see Using the COPY command to load from Amazon S3.

Conclusion

In this blog, we intended to convey our journey and options evaluated before zeroing on one to transform on-premise Teradata data warehouse workloads onto Amazon Redshift at scale. Process built around multiple tools including AWS SCT, Teradata Parallel Transporter, and AWS Snowball facilitated our transformation

For more information about AWS SCT, see Introducing AWS Schema Conversion Tool Version 1.0.502. For more information about Snowball, see AWS Import/Export Snowball – Transfer 1 Petabyte Per Week Using Amazon-Owned Storage Appliances.

Disclaimer: The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

 


About the Authors

Ajinkya Puranik is a Cloud Database Lead within Cloud Centre of Excellence at ZS Associates. He has years of experience managing, administrating, optimizing and adopting evolving data warehousing solutions. He played an instrumental role in ZS’s Teradata to Redshift transformation journey. His personal interests involve cricket and traveling.

 

 

 

 

Sushant Jadhav is a Senior Cloud Administrator within Cloud Center of Excellence at ZS Associates. He is a results-oriented professional with technology experience predominantly in the storage and backup industry. He has worked on many migration projects where he helped customers migrate from on-premises to AWS. Sushant enjoys working on all the AWS services and tries to bridge the gap between technology and business. He is always keen on learning new technologies and is always evolving in his role. Apart from work, he enjoys playing football.

[$] Removing run-time disabling for SELinux in Fedora

Post Syndicated from original https://lwn.net/Articles/831748/rss

Disabling SELinux
is, perhaps sadly in some ways, a time-honored tradition
for users of Fedora, RHEL, and other distributions that feature the
security mechanism. Over the years, SELinux has gotten easier to tolerate
due to the hard work of its developers and the distributions, but there are
still third-party packages that recommend or require disabling SELinux in
order to function. Up until fairly recently, the kernel has supported
disabling SELinux at run time, but that mechanism has been deprecated—in
part due to another kernel security feature. Now Fedora is planning
to eliminate the ability to disable SELinux at run time in Fedora 34, which sparked
some discussion in its devel mailing list.

Field Notes: Monitoring the Java Virtual Machine Garbage Collection on AWS Lambda

Post Syndicated from Steffen Grunwald original https://aws.amazon.com/blogs/architecture/field-notes-monitoring-the-java-virtual-machine-garbage-collection-on-aws-lambda/

When you want to optimize your Java application on AWS Lambda for performance and cost the general steps are: Build, measure, then optimize! To accomplish this, you need a solid monitoring mechanism. Amazon CloudWatch and AWS X-Ray are well suited for this task since they already provide lots of data about your AWS Lambda function. This includes overall memory consumption, initialization time, and duration of your invocations. To examine the Java Virtual Machine (JVM) memory you require garbage collection logs from your functions. Instances of an AWS Lambda function have a short lifecycle compared to a long-running Java application server. It can be challenging to process the logs from tens or hundreds of these instances.

In this post, you learn how to emit and collect data to monitor the JVM garbage collector activity. Having this data, you can visualize out-of-memory situations of your applications in a Kibana dashboard like in the following screenshot. You gain actionable insights into your application’s memory consumption on AWS Lambda for troubleshooting and optimization.

The lifecycle of a JVM application on AWS Lambda

Let’s first revisit the lifecycle of the AWS Lambda Java runtime and its JVM:

  1. A Lambda function is invoked.
  2. AWS Lambda launches an execution context. This is a temporary runtime environment based on the configuration settings you provide, like permissions, memory size, and environment variables.
  3. AWS Lambda creates a new log stream in Amazon CloudWatch Logs for each instance of the execution context.
  4. The execution context initializes the JVM and your handler’s code.

You typically see the initialization of a fresh execution context when a Lambda function is invoked for the first time, after it has been updated, or it scales up in response to more incoming events.

AWS Lambda maintains the execution context for some time in anticipation of another Lambda function invocation. In effect, the service freezes the execution context after a Lambda function completes. It thaws the execution context when the Lambda function is invoked again if AWS Lambda chooses to reuse it.

During invocations, the JVM also maintains garbage collection as usual. Outside of invocations, the JVM and its maintenance processes like garbage collection are also frozen.

Garbage collection and indicators for your application’s health

The purpose of JVM garbage collection is to clean up objects in the JVM heap, which is the space for an application’s objects. It finds objects which are unreachable and deletes them. This frees heap space for other objects.

You can make the JVM log garbage collection activities to get insights into the health of your application. One example for this is the free heap after each garbage collection. If this metric keeps shrinking, it is an indicator for a memory leak – eventually turning into an OutOfMemoryError. If there is not enough of free heap, the JVM might be too busy with garbage collection instead of running your application code. Otherwise, a heap that is too big does indicate that there’s potential to decrease the memory configuration of your AWS Lambda function. This keeps garbage collection pauses low and provides a consistent response time.

The garbage collection logging can be configured via an environment variable as part of the AWS Lambda function configuration. The environment variable JAVA_TOOL_OPTIONS is considered by both the Java 8 and 11 JVMs. You use it to pass options that you would usually add to the command line when launching the JVM. The options to configure garbage collection logging and the output is specific to the Java version.

Java 11 uses the Unified Logging System (JEP 158 and JEP 271) which has been introduced in Java 9. Logging can be configured with the environment variable:

JAVA_TOOL_OPTIONS=-Xlog:gc+metaspace,gc+heap,gc:stdout:time,tags

The Serial Garbage Collector will output the logs:

[<TIMESTAMP>][gc] GC(4) Pause Full (Allocation Failure) 9M->9M(11M) 3.941ms (D)
[<TIMESTAMP>][gc,heap] GC(3) DefNew: 3063K->234K(3072K) (A)
[<TIMESTAMP>][gc,heap] GC(3) Tenured: 6313K->9127K(9152K) (B)
[<TIMESTAMP>][gc,metaspace] GC(3) Metaspace: 762K->762K(52428K) (C)
[<TIMESTAMP>][gc] GC(3) Pause Young (Allocation Failure) 9M->9M(21M) 23.559ms (D)

Prior to Java 9, including Java 8, you configure the garbage collection logging as follows:

JAVA_TOOL_OPTIONS=-XX:+PrintGCDetails -XX:+PrintGCDateStamps

The Serial garbage collector output in Java 8 is structured differently:

<TIMESTAMP>: [GC (Allocation Failure)
    <TIMESTAMP>: [DefNew: 131042K->131042K(131072K), 0.0000216 secs] (A)
    <TIMESTAMP>: [Tenured: 235683K->291057K(291076K), 0.2213687 secs] (B)
    366725K->365266K(422148K), (D)
    [Metaspace: 3943K->3943K(1056768K)], (C)
    0.2215370 secs]
    [Times: user=0.04 sys=0.02, real=0.22 secs]
<TIMESTAMP>: [Full GC (Allocation Failure)
    <TIMESTAMP>: [Tenured: 297661K->36658K(297664K), 0.0434012 secs] (B)
    431575K->36658K(431616K), (D)
    [Metaspace: 3943K->3943K(1056768K)], 0.0434680 secs] (C)
    [Times: user=0.02 sys=0.00, real=0.05 secs]

Independent of the Java version, the garbage collection activities are logged to standard out (stdout) or standard error (stderr). Logs appear in the AWS Lambda function’s log stream of Amazon CloudWatch Logs. The log contains the size of memory used for:

  • A: the young generation
  • B: the old generation
  • C: the metaspace
  • D: the entire heap

The notation is before-gc -> after-gc (committed heap). Read the JVM Garbage Collection Tuning Guide for more details.

Visualizing the logs in Amazon Elasticsearch Service

It is hard to fully understand the garbage collection log by just reading it in Amazon CloudWatch Logs. You must visualize it to gain more insight. This section describes the solution to achieve this.

Solution Overview

Java Solution Overview

Amazon CloudWatch Logs have a feature to stream CloudWatch Logs data to Amazon Elasticsearch Service via an AWS Lambda function. The AWS Lambda function for log transformation is subscribed to the log group of your application’s AWS Lambda function. The subscription filters for a pattern that matches the one of the garbage collection log entries. The log transformation function processes the log messages and puts it to a search cluster. To make the data easy to digest for the search cluster, you add code to transform and convert the messages to JSON. Having the data in a search cluster, you can visualize it with Kibana dashboards.

Get Started

To start, launch the solution architecture described above as a prepackaged application from the AWS Serverless Application Repository. It contains all resources ready to visualize the garbage collection logs for your Java 11 AWS Lambda functions in a Kibana dashboard. The search cluster consists of a single t2.small.elasticsearch instance with 10GB of EBS storage. It is protected with Amazon Cognito User Pools so you only need to add your user(s). The T2 instance types do not support encryption of data at rest.

Read the source code for the application in the aws-samples repository.

1. Spin up the application from the AWS Serverless Application Repository:

launch stack button

2. As soon as the application is deployed completely, the outputs of the AWS CloudFormation stack provide the links for the next steps. You will find two URLs in the AWS CloudFormation console called createUserUrl and kibanaUrl.

search stack

3. Use the createUserUrl link from the outputs, or navigate to the Amazon Cognito user pool in the console to create a new user in the pool.

a. Enter an email address as username and email. Enter a temporary password of your choice with at least 8 characters.

b. Leave the phone number empty and uncheck the checkbox to mark the phone number as verified.

c. If necessary, you can check the checkboxes to send an invitation to the new user or to make the user verify the email address.

d. Choose Create user.

create user dialog of Amazon Cognito User Pools

4. Access the Kibana dashboard with the kibanaUrl link from the AWS CloudFormation stack outputs, or navigate to the Kibana link displayed in the Amazon Elasticsearch Service console.

a. In Kibana, choose the Dashboard icon in the left menu bar

b. Open the Lambda GC Activity dashboard.

You can test that new events appear by using the Kibana Developer Console:

POST gc-logs-2020.09.03/_doc
{
  "@timestamp": "2020-09-03T15:12:34.567+0000",
  "@gc_type": "Pause Young",
  "@gc_cause": "Allocation Failure",
  "@heap_before_gc": "2",
  "@heap_after_gc": "1",
  "@heap_size_gc": "9",
  "@gc_duration": "5.432",
  "@owner": "123456789012",
  "@log_group": "/aws/lambda/myfunction",
  "@log_stream": "2020/09/03/[$LATEST]123456"
}

5. When you go to the Lambda GC Activity dashboard you can see the new event. You must select the right timeframe with the Show dates link.

Lambda GC activity

The dashboard consists of six tiles:

  • In the Filters you optionally select the log group and filter for a specific AWS Lambda function execution context by the name of its log stream.
  • In the GC Activity Count by Execution Context you see a heatmap of all filtered execution contexts by garbage collection activity count.
  • The GC Activity Metrics display a graph for the metrics for all filtered execution contexts.
  • The GC Activity Count shows the amount of garbage collection activities that are currently displayed.
  • The GC Duration show the sum of the duration of all displayed garbage collection activities.
  • The GC Activity Raw Data at the bottom displays the raw items as ingested into the search cluster for a further drill down.

Configure your AWS Lambda function for garbage collection logging

1. The application that you want to monitor needs to log garbage collection activities. Currently the solution supports logs from Java 11. Add the following environment variable to your AWS Lambda function to activate the logging.

JAVA_TOOL_OPTIONS=-Xlog:gc:stderr:time,tags

The environment variables must reflect this parameter like the following screenshot:

environment variables

2. Go to the streamLogs function in the AWS Lambda console that has been created by the stack, and subscribe it to the log group of the function you want to monitor.

streamlogs function

3. Select Add Trigger.

4. Select CloudWatch Logs as Trigger Configuration.

5. Input a Filter name of your choice.

6. Input "[gc" (including quotes) as the Filter pattern to match all garbage collection log entries.

7. Select the Log Group of the function you want to monitor. The following screenshot subscribes to the logs of the application’s function resize-lambda-ResizeFn-[...].

add trigger

8. Select Add.

9. Execute the AWS Lambda function you want to monitor.

10. Refresh the dashboard in Amazon Elasticsearch Service and see the datapoint added manually before appearing in the graph.

Troubleshooting examples

Let’s look at an example function and draw some useful insights from the Java garbage collection log. The following diagrams show the Sample Amazon S3 function code for Java from the AWS Lambda documentation running in a Java 11 function with 512 MB of memory.

  • An S3 event from a new uploaded image triggers this function.
  • The function loads the image from S3, resizes it, and puts the resized version to S3.
  • The file size of the example image is close to 2.8MB.
  • The application is called 100 times with a pause of 1 second.

Memory leak

For the demonstration of a memory leak, the function has been changed to keep all source images in memory as a class variable. Hence the memory of the function keeps growing when processing more images:

GC activity metrics

In the diagram, the heap size drops to zero at timestamp 12:34:00. The Amazon CloudWatch Logs of the function reveal an error before the next call to your code in the same AWS Lambda execution context with a fresh JVM:

Java heap space: java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space
 at java.desktop/java.awt.image.DataBufferByte.<init>(Unknown Source)
[...]

The JVM crashed and was restarted because of the error. You leverage primarily the Amazon CloudWatch Logs of your function to detect errors. The garbage collection log and its visualization provide additional information for root cause analysis:

Did the JVM run out of memory because a single image to resize was too large?

Or was the memory issue growing over time?

The latter could be an indication that you have a memory leak in your code.

The Heap size is too small

For the demonstration of a heap that was chosen too small, the memory leak from the preceding image has been resolved, but the function was configured to 128MB of memory. From the baseline of the heap to the maximum heap size, there are only approximately 5 MB used.

GC activity metrics

This will result in a high management overhead of your JVM. You should experiment with a higher memory configuration to find the optimal performance also taking cost into account. Check out AWS Lambda power tuning open source tool to do this in an automated fashion.

Finetuning the initial heap size

If you review the development of the heap size at the start of an execution context, this indicates that the heap size is continuously increased. Each heap size change is an expensive operation consuming time of your function. Over time, the heap size is changed as well. The garbage collector logs 502 activities, which take almost 17 seconds overall.

GC activity metrics

This on-demand scaling is useful on a local workstation where the physical memory is shared with other applications. On AWS Lambda, the configured memory is dedicated to your function, so you can use it to its full extent.

You can do so by setting the minimum and maximum heap size to a fixed value by appending the -Xms and -Xmx parameters to the environment variable we introduced before.

The heap is not the only part of the JVM that consumes memory, so you must experiment with this setting and closely monitor the performance.

Start with the heap size that you observe to be working from the garbage collection log. If you set the heap size too large, your function will not initialize at all or break unexpectedly. Remember that the ability to tweak JVM parameters might change with future service features.

Let’s set 400 MB of the 512 MB memory and examine the results:

JAVA_TOOL_OPTIONS=-Xlog:gc:stderr:time,tags -Xms400m -Xmx400m

GC activity metrics

The preceding dashboard shows that the overall garbage collection duration was reduced by about 95%. The garbage collector had 80% fewer activities.

The garbage collection log entries displayed in the dashboard reveal that exclusively minor garbage collection (Pause Young) activities were triggered instead of major garbage collections (Pause Full). This is expected as the images are immediately discarded after the download, resize, upload operation. The effect on the overall function durations of 100 invocations, is a 5% decrease on average in this specific case.

Lambda duration

Cost estimation and clean up

Cost for the processing and transformation of your function’s Amazon CloudWatch Logs incurs when your function is called. This cost depends on your application and how often garbage collection activities are triggered. Read an estimate of the monthly cost for the search cluster. If you do not need the garbage collection monitoring anymore, delete the subscription filter from the log group of your AWS Lambda function(s). Also, delete the stack of the solution above in the AWS CloudFormation console to clean up resources.

Conclusion

In this post, we examined further sources of data to gain insights about the health of your Java application. We also demonstrated a pipeline to ingest, transform, and visualize this information continuously in a Kibana dashboard. As a next step, launch the application from the AWS Serverless Application Repository and subscribe it to your applications’ logs. Feel free to submit enhancements to the application in the aws-samples repository or provide feedback in the comments.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Security updates for Wednesday

Post Syndicated from original https://lwn.net/Articles/832276/rss

Security updates have been issued by openSUSE (libetpan, libqt4, lilypond, otrs, and perl-DBI), Red Hat (kernel-rt), Slackware (seamonkey), SUSE (grafana, libmspack, openldap2, ovmf, pdns, rubygem-actionpack-5_1, and samba), and Ubuntu (debian-lan-config, ldm, libdbi-perl, and netty-3.9).

Documented Death from a Ransomware Attack

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2020/09/documented-death-from-a-ransomware-attack.html

A Dusseldorf woman died when a ransomware attack against a hospital forced her to be taken to a different hospital in another city.

I think this is the first documented case of a cyberattack causing a fatality. UK hospitals had to redirect patients during the 2017 WannaCry ransomware attack, but there were no documented fatalities from that event.

The police are treating this as a homicide.

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close