Tag Archives: dbus

Combine Transactional and Analytical Data Using Amazon Aurora and Amazon Redshift

Post Syndicated from Re Alvarez-Parmar original https://aws.amazon.com/blogs/big-data/combine-transactional-and-analytical-data-using-amazon-aurora-and-amazon-redshift/

A few months ago, we published a blog post about capturing data changes in an Amazon Aurora database and sending it to Amazon Athena and Amazon QuickSight for fast analysis and visualization. In this post, I want to demonstrate how easy it can be to take the data in Aurora and combine it with data in Amazon Redshift using Amazon Redshift Spectrum.

With Amazon Redshift, you can build petabyte-scale data warehouses that unify data from a variety of internal and external sources. Because Amazon Redshift is optimized for complex queries (often involving multiple joins) across large tables, it can handle large volumes of retail, inventory, and financial data without breaking a sweat.

In this post, we describe how to combine data in Aurora in Amazon Redshift. Here’s an overview of the solution:

  • Use AWS Lambda functions with Amazon Aurora to capture data changes in a table.
  • Save data in an Amazon S3
  • Query data using Amazon Redshift Spectrum.

We use the following services:

Serverless architecture for capturing and analyzing Aurora data changes

Consider a scenario in which an e-commerce web application uses Amazon Aurora for a transactional database layer. The company has a sales table that captures every single sale, along with a few corresponding data items. This information is stored as immutable data in a table. Business users want to monitor the sales data and then analyze and visualize it.

In this example, you take the changes in data in an Aurora database table and save it in Amazon S3. After the data is captured in Amazon S3, you combine it with data in your existing Amazon Redshift cluster for analysis.

By the end of this post, you will understand how to capture data events in an Aurora table and push them out to other AWS services using AWS Lambda.

The following diagram shows the flow of data as it occurs in this tutorial:

The starting point in this architecture is a database insert operation in Amazon Aurora. When the insert statement is executed, a custom trigger calls a Lambda function and forwards the inserted data. Lambda writes the data that it received from Amazon Aurora to a Kinesis data delivery stream. Kinesis Data Firehose writes the data to an Amazon S3 bucket. Once the data is in an Amazon S3 bucket, it is queried in place using Amazon Redshift Spectrum.

Creating an Aurora database

First, create a database by following these steps in the Amazon RDS console:

  1. Sign in to the AWS Management Console, and open the Amazon RDS console.
  2. Choose Launch a DB instance, and choose Next.
  3. For Engine, choose Amazon Aurora.
  4. Choose a DB instance class. This example uses a small, since this is not a production database.
  5. In Multi-AZ deployment, choose No.
  6. Configure DB instance identifier, Master username, and Master password.
  7. Launch the DB instance.

After you create the database, use MySQL Workbench to connect to the database using the CNAME from the console. For information about connecting to an Aurora database, see Connecting to an Amazon Aurora DB Cluster.

The following screenshot shows the MySQL Workbench configuration:

Next, create a table in the database by running the following SQL statement:

Create Table
CREATE TABLE Sales (
InvoiceID int NOT NULL AUTO_INCREMENT,
ItemID int NOT NULL,
Category varchar(255),
Price double(10,2), 
Quantity int not NULL,
OrderDate timestamp,
DestinationState varchar(2),
ShippingType varchar(255),
Referral varchar(255),
PRIMARY KEY (InvoiceID)
)

You can now populate the table with some sample data. To generate sample data in your table, copy and run the following script. Ensure that the highlighted (bold) variables are replaced with appropriate values.

#!/usr/bin/python
import MySQLdb
import random
import datetime

db = MySQLdb.connect(host="AURORA_CNAME",
                     user="DBUSER",
                     passwd="DBPASSWORD",
                     db="DB")

states = ("AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN",
"IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ",
"NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA",
"WA","WV","WI","WY")

shipping_types = ("Free", "3-Day", "2-Day")

product_categories = ("Garden", "Kitchen", "Office", "Household")
referrals = ("Other", "Friend/Colleague", "Repeat Customer", "Online Ad")

for i in range(0,10):
    item_id = random.randint(1,100)
    state = states[random.randint(0,len(states)-1)]
    shipping_type = shipping_types[random.randint(0,len(shipping_types)-1)]
    product_category = product_categories[random.randint(0,len(product_categories)-1)]
    quantity = random.randint(1,4)
    referral = referrals[random.randint(0,len(referrals)-1)]
    price = random.randint(1,100)
    order_date = datetime.date(2016,random.randint(1,12),random.randint(1,30)).isoformat()

    data_order = (item_id, product_category, price, quantity, order_date, state,
    shipping_type, referral)

    add_order = ("INSERT INTO Sales "
                   "(ItemID, Category, Price, Quantity, OrderDate, DestinationState, \
                   ShippingType, Referral) "
                   "VALUES (%s, %s, %s, %s, %s, %s, %s, %s)")

    cursor = db.cursor()
    cursor.execute(add_order, data_order)

    db.commit()

cursor.close()
db.close() 

The following screenshot shows how the table appears with the sample data:

Sending data from Amazon Aurora to Amazon S3

There are two methods available to send data from Amazon Aurora to Amazon S3:

  • Using a Lambda function
  • Using SELECT INTO OUTFILE S3

To demonstrate the ease of setting up integration between multiple AWS services, we use a Lambda function to send data to Amazon S3 using Amazon Kinesis Data Firehose.

Alternatively, you can use a SELECT INTO OUTFILE S3 statement to query data from an Amazon Aurora DB cluster and save it directly in text files that are stored in an Amazon S3 bucket. However, with this method, there is a delay between the time that the database transaction occurs and the time that the data is exported to Amazon S3 because the default file size threshold is 6 GB.

Creating a Kinesis data delivery stream

The next step is to create a Kinesis data delivery stream, since it’s a dependency of the Lambda function.

To create a delivery stream:

  1. Open the Kinesis Data Firehose console
  2. Choose Create delivery stream.
  3. For Delivery stream name, type AuroraChangesToS3.
  4. For Source, choose Direct PUT.
  5. For Record transformation, choose Disabled.
  6. For Destination, choose Amazon S3.
  7. In the S3 bucket drop-down list, choose an existing bucket, or create a new one.
  8. Enter a prefix if needed, and choose Next.
  9. For Data compression, choose GZIP.
  10. In IAM role, choose either an existing role that has access to write to Amazon S3, or choose to generate one automatically. Choose Next.
  11. Review all the details on the screen, and choose Create delivery stream when you’re finished.

 

Creating a Lambda function

Now you can create a Lambda function that is called every time there is a change that needs to be tracked in the database table. This Lambda function passes the data to the Kinesis data delivery stream that you created earlier.

To create the Lambda function:

  1. Open the AWS Lambda console.
  2. Ensure that you are in the AWS Region where your Amazon Aurora database is located.
  3. If you have no Lambda functions yet, choose Get started now. Otherwise, choose Create function.
  4. Choose Author from scratch.
  5. Give your function a name and select Python 3.6 for Runtime
  6. Choose and existing or create a new Role, the role would need to have access to call firehose:PutRecord
  7. Choose Next on the trigger selection screen.
  8. Paste the following code in the code window. Change the stream_name variable to the Kinesis data delivery stream that you created in the previous step.
  9. Choose File -> Save in the code editor and then choose Save.
import boto3
import json

firehose = boto3.client('firehose')
stream_name = ‘AuroraChangesToS3’


def Kinesis_publish_message(event, context):
    
    firehose_data = (("%s,%s,%s,%s,%s,%s,%s,%s\n") %(event['ItemID'], 
    event['Category'], event['Price'], event['Quantity'],
    event['OrderDate'], event['DestinationState'], event['ShippingType'], 
    event['Referral']))
    
    firehose_data = {'Data': str(firehose_data)}
    print(firehose_data)
    
    firehose.put_record(DeliveryStreamName=stream_name,
    Record=firehose_data)

Note the Amazon Resource Name (ARN) of this Lambda function.

Giving Aurora permissions to invoke a Lambda function

To give Amazon Aurora permissions to invoke a Lambda function, you must attach an IAM role with appropriate permissions to the cluster. For more information, see Invoking a Lambda Function from an Amazon Aurora DB Cluster.

Once you are finished, the Amazon Aurora database has access to invoke a Lambda function.

Creating a stored procedure and a trigger in Amazon Aurora

Now, go back to MySQL Workbench, and run the following command to create a new stored procedure. When this stored procedure is called, it invokes the Lambda function you created. Change the ARN in the following code to your Lambda function’s ARN.

DROP PROCEDURE IF EXISTS CDC_TO_FIREHOSE;
DELIMITER ;;
CREATE PROCEDURE CDC_TO_FIREHOSE (IN ItemID VARCHAR(255), 
									IN Category varchar(255), 
									IN Price double(10,2),
                                    IN Quantity int(11),
                                    IN OrderDate timestamp,
                                    IN DestinationState varchar(2),
                                    IN ShippingType varchar(255),
                                    IN Referral  varchar(255)) LANGUAGE SQL 
BEGIN
  CALL mysql.lambda_async('arn:aws:lambda:us-east-1:XXXXXXXXXXXXX:function:CDCFromAuroraToKinesis', 
     CONCAT('{ "ItemID" : "', ItemID, 
            '", "Category" : "', Category,
            '", "Price" : "', Price,
            '", "Quantity" : "', Quantity, 
            '", "OrderDate" : "', OrderDate, 
            '", "DestinationState" : "', DestinationState, 
            '", "ShippingType" : "', ShippingType, 
            '", "Referral" : "', Referral, '"}')
     );
END
;;
DELIMITER ;

Create a trigger TR_Sales_CDC on the Sales table. When a new record is inserted, this trigger calls the CDC_TO_FIREHOSE stored procedure.

DROP TRIGGER IF EXISTS TR_Sales_CDC;
 
DELIMITER ;;
CREATE TRIGGER TR_Sales_CDC
  AFTER INSERT ON Sales
  FOR EACH ROW
BEGIN
  SELECT  NEW.ItemID , NEW.Category, New.Price, New.Quantity, New.OrderDate
  , New.DestinationState, New.ShippingType, New.Referral
  INTO @ItemID , @Category, @Price, @Quantity, @OrderDate
  , @DestinationState, @ShippingType, @Referral;
  CALL  CDC_TO_FIREHOSE(@ItemID , @Category, @Price, @Quantity, @OrderDate
  , @DestinationState, @ShippingType, @Referral);
END
;;
DELIMITER ;

If a new row is inserted in the Sales table, the Lambda function that is mentioned in the stored procedure is invoked.

Verify that data is being sent from the Lambda function to Kinesis Data Firehose to Amazon S3 successfully. You might have to insert a few records, depending on the size of your data, before new records appear in Amazon S3. This is due to Kinesis Data Firehose buffering. To learn more about Kinesis Data Firehose buffering, see the “Amazon S3” section in Amazon Kinesis Data Firehose Data Delivery.

Every time a new record is inserted in the sales table, a stored procedure is called, and it updates data in Amazon S3.

Querying data in Amazon Redshift

In this section, you use the data you produced from Amazon Aurora and consume it as-is in Amazon Redshift. In order to allow you to process your data as-is, where it is, while taking advantage of the power and flexibility of Amazon Redshift, you use Amazon Redshift Spectrum. You can use Redshift Spectrum to run complex queries on data stored in Amazon S3, with no need for loading or other data prep.

Just create a data source and issue your queries to your Amazon Redshift cluster as usual. Behind the scenes, Redshift Spectrum scales to thousands of instances on a per-query basis, ensuring that you get fast, consistent performance even as your dataset grows to beyond an exabyte! Being able to query data that is stored in Amazon S3 means that you can scale your compute and your storage independently. You have the full power of the Amazon Redshift query model and all the reporting and business intelligence tools at your disposal. Your queries can reference any combination of data stored in Amazon Redshift tables and in Amazon S3.

Redshift Spectrum supports open, common data types, including CSV/TSV, Apache Parquet, SequenceFile, and RCFile. Files can be compressed using gzip or Snappy, with other data types and compression methods in the works.

First, create an Amazon Redshift cluster. Follow the steps in Launch a Sample Amazon Redshift Cluster.

Next, create an IAM role that has access to Amazon S3 and Athena. By default, Amazon Redshift Spectrum uses the Amazon Athena data catalog. Your cluster needs authorization to access your external data catalog in AWS Glue or Athena and your data files in Amazon S3.

In the demo setup, I attached AmazonS3FullAccess and AmazonAthenaFullAccess. In a production environment, the IAM roles should follow the standard security of granting least privilege. For more information, see IAM Policies for Amazon Redshift Spectrum.

Attach the newly created role to the Amazon Redshift cluster. For more information, see Associate the IAM Role with Your Cluster.

Next, connect to the Amazon Redshift cluster, and create an external schema and database:

create external schema if not exists spectrum_schema
from data catalog 
database 'spectrum_db' 
region 'us-east-1'
IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/RedshiftSpectrumRole'
create external database if not exists;

Don’t forget to replace the IAM role in the statement.

Then create an external table within the database:

 CREATE EXTERNAL TABLE IF NOT EXISTS spectrum_schema.ecommerce_sales(
  ItemID int,
  Category varchar,
  Price DOUBLE PRECISION,
  Quantity int,
  OrderDate TIMESTAMP,
  DestinationState varchar,
  ShippingType varchar,
  Referral varchar)
ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 's3://{BUCKET_NAME}/CDC/'

Query the table, and it should contain data. This is a fact table.

select top 10 * from spectrum_schema.ecommerce_sales

 

Next, create a dimension table. For this example, we create a date/time dimension table. Create the table:

CREATE TABLE date_dimension (
  d_datekey           integer       not null sortkey,
  d_dayofmonth        integer       not null,
  d_monthnum          integer       not null,
  d_dayofweek                varchar(10)   not null,
  d_prettydate        date       not null,
  d_quarter           integer       not null,
  d_half              integer       not null,
  d_year              integer       not null,
  d_season            varchar(10)   not null,
  d_fiscalyear        integer       not null)
diststyle all;

Populate the table with data:

copy date_dimension from 's3://reparmar-lab/2016dates' 
iam_role 'arn:aws:iam::XXXXXXXXXXXX:role/redshiftspectrum'
DELIMITER ','
dateformat 'auto';

The date dimension table should look like the following:

Querying data in local and external tables using Amazon Redshift

Now that you have the fact and dimension table populated with data, you can combine the two and run analysis. For example, if you want to query the total sales amount by weekday, you can run the following:

select sum(quantity*price) as total_sales, date_dimension.d_season
from spectrum_schema.ecommerce_sales 
join date_dimension on spectrum_schema.ecommerce_sales.orderdate = date_dimension.d_prettydate 
group by date_dimension.d_season

You get the following results:

Similarly, you can replace d_season with d_dayofweek to get sales figures by weekday:

With Amazon Redshift Spectrum, you pay only for the queries you run against the data that you actually scan. We encourage you to use file partitioning, columnar data formats, and data compression to significantly minimize the amount of data scanned in Amazon S3. This is important for data warehousing because it dramatically improves query performance and reduces cost.

Partitioning your data in Amazon S3 by date, time, or any other custom keys enables Amazon Redshift Spectrum to dynamically prune nonrelevant partitions to minimize the amount of data processed. If you store data in a columnar format, such as Parquet, Amazon Redshift Spectrum scans only the columns needed by your query, rather than processing entire rows. Similarly, if you compress your data using one of the supported compression algorithms in Amazon Redshift Spectrum, less data is scanned.

Analyzing and visualizing Amazon Redshift data in Amazon QuickSight

Modify the Amazon Redshift security group to allow an Amazon QuickSight connection. For more information, see Authorizing Connections from Amazon QuickSight to Amazon Redshift Clusters.

After modifying the Amazon Redshift security group, go to Amazon QuickSight. Create a new analysis, and choose Amazon Redshift as the data source.

Enter the database connection details, validate the connection, and create the data source.

Choose the schema to be analyzed. In this case, choose spectrum_schema, and then choose the ecommerce_sales table.

Next, we add a custom field for Total Sales = Price*Quantity. In the drop-down list for the ecommerce_sales table, choose Edit analysis data sets.

On the next screen, choose Edit.

In the data prep screen, choose New Field. Add a new calculated field Total Sales $, which is the product of the Price*Quantity fields. Then choose Create. Save and visualize it.

Next, to visualize total sales figures by month, create a graph with Total Sales on the x-axis and Order Data formatted as month on the y-axis.

After you’ve finished, you can use Amazon QuickSight to add different columns from your Amazon Redshift tables and perform different types of visualizations. You can build operational dashboards that continuously monitor your transactional and analytical data. You can publish these dashboards and share them with others.

Final notes

Amazon QuickSight can also read data in Amazon S3 directly. However, with the method demonstrated in this post, you have the option to manipulate, filter, and combine data from multiple sources or Amazon Redshift tables before visualizing it in Amazon QuickSight.

In this example, we dealt with data being inserted, but triggers can be activated in response to an INSERT, UPDATE, or DELETE trigger.

Keep the following in mind:

  • Be careful when invoking a Lambda function from triggers on tables that experience high write traffic. This would result in a large number of calls to your Lambda function. Although calls to the lambda_async procedure are asynchronous, triggers are synchronous.
  • A statement that results in a large number of trigger activations does not wait for the call to the AWS Lambda function to complete. But it does wait for the triggers to complete before returning control to the client.
  • Similarly, you must account for Amazon Kinesis Data Firehose limits. By default, Kinesis Data Firehose is limited to a maximum of 5,000 records/second. For more information, see Monitoring Amazon Kinesis Data Firehose.

In certain cases, it may be optimal to use AWS Database Migration Service (AWS DMS) to capture data changes in Aurora and use Amazon S3 as a target. For example, AWS DMS might be a good option if you don’t need to transform data from Amazon Aurora. The method used in this post gives you the flexibility to transform data from Aurora using Lambda before sending it to Amazon S3. Additionally, the architecture has the benefits of being serverless, whereas AWS DMS requires an Amazon EC2 instance for replication.

For design considerations while using Redshift Spectrum, see Using Amazon Redshift Spectrum to Query External Data.

If you have questions or suggestions, please comment below.


Additional Reading

If you found this post useful, be sure to check out Capturing Data Changes in Amazon Aurora Using AWS Lambda and 10 Best Practices for Amazon Redshift Spectrum


About the Authors

Re Alvarez-Parmar is a solutions architect for Amazon Web Services. He helps enterprises achieve success through technical guidance and thought leadership. In his spare time, he enjoys spending time with his two kids and exploring outdoors.

 

 

 

[$] Varlink: a protocol for IPC

Post Syndicated from jake original https://lwn.net/Articles/742675/rss

One of the motivations behind projects like kdbus and bus1, both of which have fallen short of
mainline inclusion, is to have an interprocess communication (IPC)
mechanism available early in the boot process. The D-Bus IPC
mechanism has a daemon that cannot be started until filesystems are mounted
and the like, but what if the early boot process wants to perform IPC?
A new project, varlink, was
recently announced; it aims
to provide IPC from early boot onward, though it does not really address
the longtime D-Bus performance complaints that also served as motivation
for kdbus and bus1.

Federate Database User Authentication Easily with IAM and Amazon Redshift

Post Syndicated from Thiyagarajan Arumugam original https://aws.amazon.com/blogs/big-data/federate-database-user-authentication-easily-with-iam-and-amazon-redshift/

Managing database users though federation allows you to manage authentication and authorization procedures centrally. Amazon Redshift now supports database authentication with IAM, enabling user authentication though enterprise federation. No need to manage separate database users and passwords to further ease the database administration. You can now manage users outside of AWS and authenticate them for access to an Amazon Redshift data warehouse. Do this by integrating IAM authentication and a third-party SAML-2.0 identity provider (IdP), such as AD FS, PingFederate, or Okta. In addition, database users can also be automatically created at their first login based on corporate permissions.

In this post, I demonstrate how you can extend the federation to enable single sign-on (SSO) to the Amazon Redshift data warehouse.

SAML and Amazon Redshift

AWS supports Security Assertion Markup Language (SAML) 2.0, which is an open standard for identity federation used by many IdPs. SAML enables federated SSO, which enables your users to sign in to the AWS Management Console. Users can also make programmatic calls to AWS API actions by using assertions from a SAML-compliant IdP. For example, if you use Microsoft Active Directory for corporate directories, you may be familiar with how Active Directory and AD FS work together to enable federation. For more information, see the Enabling Federation to AWS Using Windows Active Directory, AD FS, and SAML 2.0 AWS Security Blog post.

Amazon Redshift now provides the GetClusterCredentials API operation that allows you to generate temporary database user credentials for authentication. You can set up an IAM permissions policy that generates these credentials for connecting to Amazon Redshift. Extending the IAM authentication, you can configure the federation of AWS access though a SAML 2.0–compliant IdP. An IAM role can be configured to permit the federated users call the GetClusterCredentials action and generate temporary credentials to log in to Amazon Redshift databases. You can also set up policies to restrict access to Amazon Redshift clusters, databases, database user names, and user group.

Amazon Redshift federation workflow

In this post, I demonstrate how you can use a JDBC– or ODBC-based SQL client to log in to the Amazon Redshift cluster using this feature. The SQL clients used with Amazon Redshift JDBC or ODBC drivers automatically manage the process of calling the GetClusterCredentials action, retrieving the database user credentials, and establishing a connection to your Amazon Redshift database. You can also use your database application to programmatically call the GetClusterCredentials action, retrieve database user credentials, and connect to the database. I demonstrate these features using an example company to show how different database users accounts can be managed easily using federation.

The following diagram shows how the SSO process works:

  1. JDBC/ODBC
  2. Authenticate using Corp Username/Password
  3. IdP sends SAML assertion
  4. Call STS to assume role with SAML
  5. STS Returns Temp Credentials
  6. Use Temp Credentials to get Temp cluster credentials
  7. Connect to Amazon Redshift using temp credentials

Walkthrough

Example Corp. is using Active Directory (idp host:demo.examplecorp.com) to manage federated access for users in its organization. It has an AWS account: 123456789012 and currently manages an Amazon Redshift cluster with the cluster ID “examplecorp-dw”, database “analytics” in us-west-2 region for its Sales and Data Science teams. It wants the following access:

  • Sales users can access the examplecorp-dw cluster using the sales_grp database group
  • Sales users access examplecorp-dw through a JDBC-based SQL client
  • Sales users access examplecorp-dw through an ODBC connection, for their reporting tools
  • Data Science users access the examplecorp-dw cluster using the data_science_grp database group.
  • Partners access the examplecorp-dw cluster and query using the partner_grp database group.
  • Partners are not federated through Active Directory and are provided with separate IAM user credentials (with IAM user name examplecorpsalespartner).
  • Partners can connect to the examplecorp-dw cluster programmatically, using language such as Python.
  • All users are automatically created in Amazon Redshift when they log in for the first time.
  • (Optional) Internal users do not specify database user or group information in their connection string. It is automatically assigned.
  • Data warehouse users can use SSO for the Amazon Redshift data warehouse using the preceding permissions.

Step 1:  Set up IdPs and federation

The Enabling Federation to AWS Using Windows Active Directory post demonstrated how to prepare Active Directory and enable federation to AWS. Using those instructions, you can establish trust between your AWS account and the IdP and enable user access to AWS using SSO.  For more information, see Identity Providers and Federation.

For this walkthrough, assume that this company has already configured SSO to their AWS account: 123456789012 for their Active Directory domain demo.examplecorp.com. The Sales and Data Science teams are not required to specify database user and group information in the connection string. The connection string can be configured by adding SAML Attribute elements to your IdP. Configuring these optional attributes enables internal users to conveniently avoid providing the DbUser and DbGroup parameters when they log in to Amazon Redshift.

The user-name attribute can be set up as follows, with a user ID (for example, nancy) or an email address (for example. [email protected]):

<Attribute Name="https://redshift.amazon.com/SAML/Attributes/DbUser">  
  <AttributeValue>user-name</AttributeValue>
</Attribute>

The AutoCreate attribute can be defined as follows:

<Attribute Name="https://redshift.amazon.com/SAML/Attributes/AutoCreate">
    <AttributeValue>true</AttributeValue>
</Attribute>

The sales_grp database group can be included as follows:

<Attribute Name="https://redshift.amazon.com/SAML/Attributes/DbGroups">
    <AttributeValue>sales_grp</AttributeValue>
</Attribute>

For more information about attribute element configuration, see Configure SAML Assertions for Your IdP.

Step 2: Create IAM roles for access to the Amazon Redshift cluster

The next step is to create IAM policies with permissions to call GetClusterCredentials and provide authorization for Amazon Redshift resources. To grant a SQL client the ability to retrieve the cluster endpoint, region, and port automatically, include the redshift:DescribeClusters action with the Amazon Redshift cluster resource in the IAM role.  For example, users can connect to the Amazon Redshift cluster using a JDBC URL without the need to hardcode the Amazon Redshift endpoint:

Previous:  jdbc:redshift://endpoint:port/database

Current:  jdbc:redshift:iam://clustername:region/dbname

Use IAM to create the following policies. You can also use an existing user or role and assign these policies. For example, if you already created an IAM role for IdP access, you can attach the necessary policies to that role. Here is the policy created for sales users for this example:

Sales_DW_IAM_Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "redshift:DescribeClusters"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw",
                "arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
            ],
            "Condition": {
                "StringEquals": {
                    "aws:userid": "AIDIODR4TAW7CSEXAMPLE:${redshift:DbUser}@examplecorp.com"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:CreateClusterUser"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:JoinGroup"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:dbgroup:examplecorp-dw/sales_grp"
            ]
        }
    ]
}

The policy uses the following parameter values:

  • Region: us-west-2
  • AWS Account: 123456789012
  • Cluster name: examplecorp-dw
  • Database group: sales_grp
  • IAM role: AIDIODR4TAW7CSEXAMPLE
Policy StatementDescription
{
"Effect":"Allow",
"Action":[
"redshift:DescribeClusters"
],
"Resource":[
"arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw"
]
}

Allow users to retrieve the cluster endpoint, region, and port automatically for the Amazon Redshift cluster examplecorp-dw. This specification uses the resource format arn:aws:redshift:region:account-id:cluster:clustername. For example, the SQL client JDBC can be specified in the format jdbc:redshift:iam://clustername:region/dbname.

For more information, see Amazon Resource Names.

{
"Effect":"Allow",
"Action":[
"redshift:GetClusterCredentials"
],
"Resource":[
"arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw",
"arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
],
"Condition":{
"StringEquals":{
"aws:userid":"AIDIODR4TAW7CSEXAMPLE:${redshift:DbUser}@examplecorp.com"
}
}
}

Generates a temporary token to authenticate into the examplecorp-dw cluster. “arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}” restricts the corporate user name to the database user name for that user. This resource is specified using the format: arn:aws:redshift:region:account-id:dbuser:clustername/dbusername.

The Condition block enforces that the AWS user ID should match “AIDIODR4TAW7CSEXAMPLE:${redshift:DbUser}@examplecorp.com”, so that individual users can authenticate only as themselves. The AIDIODR4TAW7CSEXAMPLE role has the Sales_DW_IAM_Policy policy attached.

{
"Effect":"Allow",
"Action":[
"redshift:CreateClusterUser"
],
"Resource":[
"arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
]
}
Automatically creates database users in examplecorp-dw, when they log in for the first time. Subsequent logins reuse the existing database user.
{
"Effect":"Allow",
"Action":[
"redshift:JoinGroup"
],
"Resource":[
"arn:aws:redshift:us-west-2:123456789012:dbgroup:examplecorp-dw/sales_grp"
]
}
Allows sales users to join the sales_grp database group through the resource “arn:aws:redshift:us-west-2:123456789012:dbgroup:examplecorp-dw/sales_grp” that is specified in the format arn:aws:redshift:region:account-id:dbgroup:clustername/dbgroupname.

Similar policies can be created for Data Science users with access to join the data_science_grp group in examplecorp-dw. You can now attach the Sales_DW_IAM_Policy policy to the role that is mapped to IdP application for SSO.
 For more information about how to define the claim rules, see Configuring SAML Assertions for the Authentication Response.

Because partners are not authorized using Active Directory, they are provided with IAM credentials and added to the partner_grp database group. The Partner_DW_IAM_Policy is attached to the IAM users for partners. The following policy allows partners to log in using the IAM user name as the database user name.

Partner_DW_IAM_Policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "redshift:DescribeClusters"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:cluster:examplecorp-dw",
                "arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
            ],
            "Condition": {
                "StringEquals": {
                    "redshift:DbUser": "${aws:username}"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:CreateClusterUser"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:JoinGroup"
            ],
            "Resource": [
                "arn:aws:redshift:us-west-2:123456789012:dbgroup:examplecorp-dw/partner_grp"
            ]
        }
    ]
}

redshift:DbUser“: “${aws:username}” forces an IAM user to use the IAM user name as the database user name.

With the previous steps configured, you can now establish the connection to Amazon Redshift through JDBC– or ODBC-supported clients.

Step 3: Set up database user access

Before you start connecting to Amazon Redshift using the SQL client, set up the database groups for appropriate data access. Log in to your Amazon Redshift database as superuser to create a database group, using CREATE GROUP.

Log in to examplecorp-dw/analytics as superuser and create the following groups and users:

CREATE GROUP sales_grp;
CREATE GROUP datascience_grp;
CREATE GROUP partner_grp;

Use the GRANT command to define access permissions to database objects (tables/views) for the preceding groups.

Step 4: Connect to Amazon Redshift using the JDBC SQL client

Assume that sales user “nancy” is using the SQL Workbench client and JDBC driver to log in to the Amazon Redshift data warehouse. The following steps help set up the client and establish the connection:

  1. Download the latest Amazon Redshift JDBC driver from the Configure a JDBC Connection page
  2. Build the JDBC URL with the IAM option in the following format:
    jdbc:redshift:iam://examplecorp-dw:us-west-2/sales_db

Because the redshift:DescribeClusters action is assigned to the preceding IAM roles, it automatically resolves the cluster endpoints and the port. Otherwise, you can specify the endpoint and port information in the JDBC URL, as described in Configure a JDBC Connection.

Identify the following JDBC options for providing the IAM credentials (see the “Prepare your environment” section) and configure in the SQL Workbench Connection Profile:

plugin_name=com.amazon.redshift.plugin.AdfsCredentialsProvider 
idp_host=demo.examplecorp.com (The name of the corporate identity provider host)
idp_port=443  (The port of the corporate identity provider host)
user=examplecorp\nancy(corporate user name)
password=***(corporate user password)

The SQL workbench configuration looks similar to the following screenshot:

Now, “nancy” can connect to examplecorp-dw by authenticating using the corporate Active Directory. Because the SAML attributes elements are already configured for nancy, she logs in as database user nancy and is assigned the sales_grp. Similarly, other Sales and Data Science users can connect to the examplecorp-dw cluster. A custom Amazon Redshift ODBC driver can also be used to connect using a SQL client. For more information, see Configure an ODBC Connection.

Step 5: Connecting to Amazon Redshift using JDBC SQL Client and IAM Credentials

This optional step is necessary only when you want to enable users that are not authenticated with Active Directory. Partners are provided with IAM credentials that they can use to connect to the examplecorp-dw Amazon Redshift clusters. These IAM users are attached to Partner_DW_IAM_Policy that assigns them to be assigned to the public database group in Amazon Redshift. The following JDBC URLs enable them to connect to the Amazon Redshift cluster:

jdbc:redshift:iam//examplecorp-dw/analytics?AccessKeyID=XXX&SecretAccessKey=YYY&DbUser=examplecorpsalespartner&DbGroup= partner_grp&AutoCreate=true

The AutoCreate option automatically creates a new database user the first time the partner logs in. There are several other options available to conveniently specify the IAM user credentials. For more information, see Options for providing IAM credentials.

Step 6: Connecting to Amazon Redshift using an ODBC client for Microsoft Windows

Assume that another sales user “uma” is using an ODBC-based client to log in to the Amazon Redshift data warehouse using Example Corp Active Directory. The following steps help set up the ODBC client and establish the Amazon Redshift connection in a Microsoft Windows operating system connected to your corporate network:

  1. Download and install the latest Amazon Redshift ODBC driver.
  2. Create a system DSN entry.
    1. In the Start menu, locate the driver folder or folders:
      • Amazon Redshift ODBC Driver (32-bit)
      • Amazon Redshift ODBC Driver (64-bit)
      • If you installed both drivers, you have a folder for each driver.
    2. Choose ODBC Administrator, and then type your administrator credentials.
    3. To configure the driver for all users on the computer, choose System DSN. To configure the driver for your user account only, choose User DSN.
    4. Choose Add.
  3. Select the Amazon Redshift ODBC driver, and choose Finish. Configure the following attributes:
    Data Source Name =any friendly name to identify the ODBC connection 
    Database=analytics
    user=uma(corporate user name)
    Auth Type-Identity Provider: AD FS
    password=leave blank (Windows automatically authenticates)
    Cluster ID: examplecorp-dw
    idp_host=demo.examplecorp.com (The name of the corporate IdP host)

This configuration looks like the following:

  1. Choose OK to save the ODBC connection.
  2. Verify that uma is set up with the SAML attributes, as described in the “Set up IdPs and federation” section.

The user uma can now use this ODBC connection to establish the connection to the Amazon Redshift cluster using any ODBC-based tools or reporting tools such as Tableau. Internally, uma authenticates using the Sales_DW_IAM_Policy  IAM role and is assigned the sales_grp database group.

Step 7: Connecting to Amazon Redshift using Python and IAM credentials

To enable partners, connect to the examplecorp-dw cluster programmatically, using Python on a computer such as Amazon EC2 instance. Reuse the IAM users that are attached to the Partner_DW_IAM_Policy policy defined in Step 2.

The following steps show this set up on an EC2 instance:

  1. Launch a new EC2 instance with the Partner_DW_IAM_Policy role, as described in Using an IAM Role to Grant Permissions to Applications Running on Amazon EC2 Instances. Alternatively, you can attach an existing IAM role to an EC2 instance.
  2. This example uses Python PostgreSQL Driver (PyGreSQL) to connect to your Amazon Redshift clusters. To install PyGreSQL on Amazon Linux, use the following command as the ec2-user:
    sudo easy_install pip
    sudo yum install postgresql postgresql-devel gcc python-devel
    sudo pip install PyGreSQL

  1. The following code snippet demonstrates programmatic access to Amazon Redshift for partner users:
    #!/usr/bin/env python
    """
    Usage:
    python redshift-unload-copy.py <config file> <region>
    
    * Copyright 2014, Amazon.com, Inc. or its affiliates. All Rights Reserved.
    *
    * Licensed under the Amazon Software License (the "License").
    * You may not use this file except in compliance with the License.
    * A copy of the License is located at
    *
    * http://aws.amazon.com/asl/
    *
    * or in the "license" file accompanying this file. This file is distributed
    * on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
    * express or implied. See the License for the specific language governing
    * permissions and limitations under the License.
    """
    
    import sys
    import pg
    import boto3
    
    REGION = 'us-west-2'
    CLUSTER_IDENTIFIER = 'examplecorp-dw'
    DB_NAME = 'sales_db'
    DB_USER = 'examplecorpsalespartner'
    
    options = """keepalives=1 keepalives_idle=200 keepalives_interval=200
                 keepalives_count=6"""
    
    set_timeout_stmt = "set statement_timeout = 1200000"
    
    def conn_to_rs(host, port, db, usr, pwd, opt=options, timeout=set_timeout_stmt):
        rs_conn_string = """host=%s port=%s dbname=%s user=%s password=%s
                             %s""" % (host, port, db, usr, pwd, opt)
        print "Connecting to %s:%s:%s as %s" % (host, port, db, usr)
        rs_conn = pg.connect(dbname=rs_conn_string)
        rs_conn.query(timeout)
        return rs_conn
    
    def main():
        # describe the cluster and fetch the IAM temporary credentials
        global redshift_client
        redshift_client = boto3.client('redshift', region_name=REGION)
        response_cluster_details = redshift_client.describe_clusters(ClusterIdentifier=CLUSTER_IDENTIFIER)
        response_credentials = redshift_client.get_cluster_credentials(DbUser=DB_USER,DbName=DB_NAME,ClusterIdentifier=CLUSTER_IDENTIFIER,DurationSeconds=3600)
        rs_host = response_cluster_details['Clusters'][0]['Endpoint']['Address']
        rs_port = response_cluster_details['Clusters'][0]['Endpoint']['Port']
        rs_db = DB_NAME
        rs_iam_user = response_credentials['DbUser']
        rs_iam_pwd = response_credentials['DbPassword']
        # connect to the Amazon Redshift cluster
        conn = conn_to_rs(rs_host, rs_port, rs_db, rs_iam_user,rs_iam_pwd)
        # execute a query
        result = conn.query("SELECT sysdate as dt")
        # fetch results from the query
        for dt_val in result.getresult() :
            print dt_val
        # close the Amazon Redshift connection
        conn.close()
    
    if __name__ == "__main__":
        main()

You can save this Python program in a file (redshiftscript.py) and execute it at the command line as ec2-user:

python redshiftscript.py

Now partners can connect to the Amazon Redshift cluster using the Python script, and authentication is federated through the IAM user.

Summary

In this post, I demonstrated how to use federated access using Active Directory and IAM roles to enable single sign-on to an Amazon Redshift cluster. I also showed how partners outside an organization can be managed easily using IAM credentials.  Using the GetClusterCredentials API action, now supported by Amazon Redshift, lets you manage a large number of database users and have them use corporate credentials to log in. You don’t have to maintain separate database user accounts.

Although this post demonstrated the integration of IAM with AD FS and Active Directory, you can replicate this solution across with your choice of SAML 2.0 third-party identity providers (IdP), such as PingFederate or Okta. For the different supported federation options, see Configure SAML Assertions for Your IdP.

If you have questions or suggestions, please comment below.


Additional Reading

Learn how to establish federated access to your AWS resources by using Active Directory user attributes.


About the Author

Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

 

Security updates for Monday

Post Syndicated from ris original https://lwn.net/Articles/732179/rss

Security updates have been issued by Debian (augeas, connman, fontforge, freeradius, git, mariadb-10.1, openjdk-7, php5, qemu, qemu-kvm, and tenshi), Fedora (augeas, libsndfile, thunderbird, and xen), Gentoo (AutoTrace and jbig2dec), Mageia (dbus, flash-player-plugin, groovy, groovy18, heimdal, kernel-linus, kmail(kdepimlibs4), libice, libmodplug, miniupnpc, and postgresql9.3/4/6), openSUSE (freeradius-server, gnome-shell, ImageMagick, and openvswitch), and SUSE (java-1_8_0-ibm, libzypp, and postgresql94).

Security updates for Monday

Post Syndicated from ris original https://lwn.net/Articles/730910/rss

Security updates have been issued by Debian (botan1.10, cvs, firefox-esr, iortcw, libgd2, libgxps, supervisor, and zabbix), Fedora (curl, firefox, git, jackson-databind, libgxps, libsoup, openjpeg2, potrace, python-dbusmock, spatialite-tools, and sqlite), Mageia (cacti, ffmpeg, git, heimdal, jackson-databind, kernel-linus, kernel-tmb, krb5, php-phpmailer, ruby-rubyzip, and supervisor), openSUSE (firefox, librsvg, libsoup, ncurses, and tcmu-runner), Oracle (firefox), Red Hat (java-1.8.0-ibm), Slackware (git, libsoup, mercurial, and subversion), and SUSE (kernel).

Continuous Delivery of Nested AWS CloudFormation Stacks Using AWS CodePipeline

Post Syndicated from Prakash Palanisamy original https://aws.amazon.com/blogs/devops/continuous-delivery-of-nested-aws-cloudformation-stacks-using-aws-codepipeline/

In CodePipeline Update – Build Continuous Delivery Workflows for CloudFormation Stacks, Jeff Barr discusses infrastructure as code and how to use AWS CodePipeline for continuous delivery. In this blog post, I discuss the continuous delivery of nested CloudFormation stacks using AWS CodePipeline, with AWS CodeCommit as the source repository and AWS CodeBuild as a build and testing tool. I deploy the stacks using CloudFormation change sets following a manual approval process.

Here’s how to do it:

In AWS CodePipeline, create a pipeline with four stages:

  • Source (AWS CodeCommit)
  • Build and Test (AWS CodeBuild and AWS CloudFormation)
  • Staging (AWS CloudFormation and manual approval)
  • Production (AWS CloudFormation and manual approval)

Pipeline stages, the actions in each stage, and transitions between stages are shown in the following diagram.

CloudFormation templates, test scripts, and the build specification are stored in AWS CodeCommit repositories. These files are used in the Source stage of the pipeline in AWS CodePipeline.

The AWS::CloudFormation::Stack resource type is used to create child stacks from a master stack. The CloudFormation stack resource requires the templates of the child stacks to be stored in the S3 bucket. The location of the template file is provided as a URL in the properties section of the resource definition.

The following template creates three child stacks:

  • Security (IAM, security groups).
  • Database (an RDS instance).
  • Web stacks (EC2 instances in an Auto Scaling group, elastic load balancer).
Description: Master stack which creates all required nested stacks

Parameters:
  TemplatePath:
    Type: String
    Description: S3Bucket Path where the templates are stored
  VPCID:
    Type: "AWS::EC2::VPC::Id"
    Description: Enter a valid VPC Id
  PrivateSubnet1:
    Type: "AWS::EC2::Subnet::Id"
    Description: Enter a valid SubnetId of private subnet in AZ1
  PrivateSubnet2:
    Type: "AWS::EC2::Subnet::Id"
    Description: Enter a valid SubnetId of private subnet in AZ2
  PublicSubnet1:
    Type: "AWS::EC2::Subnet::Id"
    Description: Enter a valid SubnetId of public subnet in AZ1
  PublicSubnet2:
    Type: "AWS::EC2::Subnet::Id"
    Description: Enter a valid SubnetId of public subnet in AZ2
  S3BucketName:
    Type: String
    Description: Name of the S3 bucket to allow access to the Web Server IAM Role.
  KeyPair:
    Type: "AWS::EC2::KeyPair::KeyName"
    Description: Enter a valid KeyPair Name
  AMIId:
    Type: "AWS::EC2::Image::Id"
    Description: Enter a valid AMI ID to launch the instance
  WebInstanceType:
    Type: String
    Description: Enter one of the possible instance type for web server
    AllowedValues:
      - t2.large
      - m4.large
      - m4.xlarge
      - c4.large
  WebMinSize:
    Type: String
    Description: Minimum number of instances in auto scaling group
  WebMaxSize:
    Type: String
    Description: Maximum number of instances in auto scaling group
  DBSubnetGroup:
    Type: String
    Description: Enter a valid DB Subnet Group
  DBUsername:
    Type: String
    Description: Enter a valid Database master username
    MinLength: 1
    MaxLength: 16
    AllowedPattern: "[a-zA-Z][a-zA-Z0-9]*"
  DBPassword:
    Type: String
    Description: Enter a valid Database master password
    NoEcho: true
    MinLength: 1
    MaxLength: 41
    AllowedPattern: "[a-zA-Z0-9]*"
  DBInstanceType:
    Type: String
    Description: Enter one of the possible instance type for database
    AllowedValues:
      - db.t2.micro
      - db.t2.small
      - db.t2.medium
      - db.t2.large
  Environment:
    Type: String
    Description: Select the appropriate environment
    AllowedValues:
      - dev
      - test
      - uat
      - prod

Resources:
  SecurityStack:
    Type: "AWS::CloudFormation::Stack"
    Properties:
      TemplateURL:
        Fn::Sub: "https://s3.amazonaws.com/${TemplatePath}/security-stack.yml"
      Parameters:
        S3BucketName:
          Ref: S3BucketName
        VPCID:
          Ref: VPCID
        Environment:
          Ref: Environment
      Tags:
        - Key: Name
          Value: SecurityStack

  DatabaseStack:
    Type: "AWS::CloudFormation::Stack"
    Properties:
      TemplateURL:
        Fn::Sub: "https://s3.amazonaws.com/${TemplatePath}/database-stack.yml"
      Parameters:
        DBSubnetGroup:
          Ref: DBSubnetGroup
        DBUsername:
          Ref: DBUsername
        DBPassword:
          Ref: DBPassword
        DBServerSecurityGroup:
          Fn::GetAtt: SecurityStack.Outputs.DBServerSG
        DBInstanceType:
          Ref: DBInstanceType
        Environment:
          Ref: Environment
      Tags:
        - Key: Name
          Value:   DatabaseStack

  ServerStack:
    Type: "AWS::CloudFormation::Stack"
    Properties:
      TemplateURL:
        Fn::Sub: "https://s3.amazonaws.com/${TemplatePath}/server-stack.yml"
      Parameters:
        VPCID:
          Ref: VPCID
        PrivateSubnet1:
          Ref: PrivateSubnet1
        PrivateSubnet2:
          Ref: PrivateSubnet2
        PublicSubnet1:
          Ref: PublicSubnet1
        PublicSubnet2:
          Ref: PublicSubnet2
        KeyPair:
          Ref: KeyPair
        AMIId:
          Ref: AMIId
        WebSG:
          Fn::GetAtt: SecurityStack.Outputs.WebSG
        ELBSG:
          Fn::GetAtt: SecurityStack.Outputs.ELBSG
        DBClientSG:
          Fn::GetAtt: SecurityStack.Outputs.DBClientSG
        WebIAMProfile:
          Fn::GetAtt: SecurityStack.Outputs.WebIAMProfile
        WebInstanceType:
          Ref: WebInstanceType
        WebMinSize:
          Ref: WebMinSize
        WebMaxSize:
          Ref: WebMaxSize
        Environment:
          Ref: Environment
      Tags:
        - Key: Name
          Value: ServerStack

Outputs:
  WebELBURL:
    Description: "URL endpoint of web ELB"
    Value:
      Fn::GetAtt: ServerStack.Outputs.WebELBURL

During the Validate stage, AWS CodeBuild checks for changes to the AWS CodeCommit source repositories. It uses the ValidateTemplate API to validate the CloudFormation template and copies the child templates and configuration files to the appropriate location in the S3 bucket.

The following AWS CodeBuild build specification validates the CloudFormation templates listed under the TEMPLATE_FILES environment variable and copies them to the S3 bucket specified in the TEMPLATE_BUCKET environment variable in the AWS CodeBuild project. Optionally, you can use the TEMPLATE_PREFIX environment variable to specify a path inside the bucket. This updates the configuration files to use the location of the child template files. The location of the template files is provided as a parameter to the master stack.

version: 0.1

environment_variables:
  plaintext:
    CHILD_TEMPLATES: |
      security-stack.yml
      server-stack.yml
      database-stack.yml
    TEMPLATE_FILES: |
      master-stack.yml
      security-stack.yml
      server-stack.yml
      database-stack.yml
    CONFIG_FILES: |
      config-prod.json
      config-test.json
      config-uat.json

phases:
  install:
    commands:
      npm install jsonlint -g
  pre_build:
    commands:
      - echo "Validating CFN templates"
      - |
        for cfn_template in $TEMPLATE_FILES; do
          echo "Validating CloudFormation template file $cfn_template"
          aws cloudformation validate-template --template-body file://$cfn_template
        done
      - |
        for conf in $CONFIG_FILES; do
          echo "Validating CFN parameters config file $conf"
          jsonlint -q $conf
        done
  build:
    commands:
      - echo "Copying child stack templates to S3"
      - |
        for child_template in $CHILD_TEMPLATES; do
          if [ "X$TEMPLATE_PREFIX" = "X" ]; then
            aws s3 cp "$child_template" "s3://$TEMPLATE_BUCKET/$child_template"
          else
            aws s3 cp "$child_template" "s3://$TEMPLATE_BUCKET/$TEMPLATE_PREFIX/$child_template"
          fi
        done
      - echo "Updating template configurtion files to use the appropriate values"
      - |
        for conf in $CONFIG_FILES; do
          if [ "X$TEMPLATE_PREFIX" = "X" ]; then
            echo "Replacing \"TEMPLATE_PATH_PLACEHOLDER\" for \"$TEMPLATE_BUCKET\" in $conf"
            sed -i -e "s/TEMPLATE_PATH_PLACEHOLDER/$TEMPLATE_BUCKET/" $conf
          else
            echo "Replacing \"TEMPLATE_PATH_PLACEHOLDER\" for \"$TEMPLATE_BUCKET/$TEMPLATE_PREFIX\" in $conf"
            sed -i -e "s/TEMPLATE_PATH_PLACEHOLDER/$TEMPLATE_BUCKET\/$TEMPLATE_PREFIX/" $conf
          fi
        done

artifacts:
  files:
    - master-stack.yml
    - config-*.json

After the template files are copied to S3, CloudFormation creates a test stack and triggers AWS CodeBuild as a test action.

Then the AWS CodeBuild build specification executes validate-env.py, the Python script used to determine whether resources created using the nested CloudFormation stacks conform to the specifications provided in the CONFIG_FILE.

version: 0.1

environment_variables:
  plaintext:
    CONFIG_FILE: env-details.yml

phases:
  install:
    commands:
      - pip install --upgrade pip
      - pip install boto3 --upgrade
      - pip install pyyaml --upgrade
      - pip install yamllint --upgrade
  pre_build:
    commands:
      - echo "Validating config file $CONFIG_FILE"
      - yamllint $CONFIG_FILE
  build:
    commands:
      - echo "Validating resources..."
      - python validate-env.py
      - exit $?

Upon successful completion of the test action, CloudFormation deletes the test stack and proceeds to the UAT stage in the pipeline.

During this stage, CloudFormation creates a change set against the UAT stack and then executes the change set. This updates the UAT environment and makes it available for acceptance testing. The process continues to a manual approval action. After the QA team validates the UAT environment and provides an approval, the process moves to the Production stage in the pipeline.

During this stage, CloudFormation creates a change set for the nested production stack and the process continues to a manual approval step. Upon approval (usually by a designated executive), the change set is executed and the production deployment is completed.
 

Setting up a continuous delivery pipeline

 
I used a CloudFormation template to set up my continuous delivery pipeline. The codepipeline-cfn-codebuild.yml template, available from GitHub, sets up a full-featured pipeline.

When I use the template to create my pipeline, I specify the following:

  • AWS CodeCommit repositories.
  • SNS topics to send approval notifications.
  • S3 bucket name where the artifacts will be stored.

The CFNTemplateRepoName points to the AWS CodeCommit repository where CloudFormation templates, configuration files, and build specification files are stored.

My repo contains following files:

The continuous delivery pipeline is ready just seconds after clicking Create Stack. After it’s created, the pipeline executes each stage. Upon manual approvals for the UAT and Production stages, the pipeline successfully enables continuous delivery.


 

Implementing a change in nested stack

 
To make changes to a child stack in a nested stack (for example, to update a parameter value or add or change resources), update the master stack. The changes must be made in the appropriate template or configuration files and then checked in to the AWS CodeCommit repository. This triggers the following deployment process:

 

Conclusion

 
In this post, I showed how you can use AWS CodePipeline, AWS CloudFormation, AWS CodeBuild, and a manual approval process to create a continuous delivery pipeline for both infrastructure as code and application deployment.

For more information about AWS CodePipeline, see the AWS CodePipeline documentation. You can get started in just a few clicks. All CloudFormation templates, AWS CodeBuild build specification files, and the Python script that performs the validation are available in codepipeline-nested-cfn GitHub repository.


About the author

 
Prakash Palanisamy is a Solutions Architect for Amazon Web Services. When he is not working on Serverless, DevOps or Alexa, he will be solving problems in Project Euler. He also enjoys watching educational documentaries.

Konecny: Anaconda modularisation

Post Syndicated from jake original https://lwn.net/Articles/725623/rss

On his blog, Jiri Konecny writes about plans for modularizing Anaconda, which is the installer for Fedora and other Linux distributions. Anaconda is written in Python 3, but is all contained in one monolithic program.
The current Anaconda has one significant problem: all of the code is in one place–the monolith. It is more difficult to trace bugs and to a have a stable API. Implementing new features or modifying existing code in Anaconda is also more challenging. Modularisation should help with these things mainly because of isolation between the modules. It will be much easier to create tests for modules or to add new functionality.

Modularisation also opens up new possibilities to developers. They should be able to create a new user interface easily. Since developers can rely on the existing API documentation, it should not be necessary to browse the source code tree very often. Another benefit is that an addon is like another module, communicating with other modules, so it has the same capabilities. Developers can use the public API to write their addons in their favourite programming language which supports DBus.”

Security updates for Tuesday

Post Syndicated from ris original https://lwn.net/Articles/718262/rss

Security updates have been issued by Debian (eject, gst-plugins-bad1.0, gst-plugins-base1.0, gst-plugins-good1.0, gst-plugins-ugly1.0, gstreamer1.0, php5, and tiff), Fedora (kernel), Gentoo (curl, deluge, libtasn1, and xen-tools), Mageia (mbedtls, putty, and roundcubemail), openSUSE (dbus-1, gegl, mxml, open-vm-tools, partclone, qbittorrent, tcpreplay, and xtrabackup), and Ubuntu (eject, gst-plugins-base0.10, gst-plugins-base1.0, and gst-plugins-good0.10, gst-plugins-good1.0).

Security updates for Wednesday

Post Syndicated from ris original http://lwn.net/Articles/711316/rss

Debian has updated icedove (multiple vulnerabilities).

Debian-LTS has updated tomcat7 (information disclosure).

Gentoo has updated bind (denial
of service), botan (two vulnerabilities),
c-ares (code execution), dbus (denial of service), expat (multiple vulnerabilities, one from
2012), flex (code execution), nginx (privilege escalation), ntfs3g (privilege escalation from 2015), p7zip (two code execution flaws), pgbouncer (two vulnerabilities), phpBB (two vulnerabilities), phpmyadmin (multiple vulnerabilities), vim (code execution), and vzctl (insecure ploop-based containers from 2015).

openSUSE has updated jasper
(42.2, 42.1: multiple vulnerabilities).

Oracle has updated kernel (OL6: three vulnerabilities).

Red Hat has updated flash-plugin
(RHEL6: multiple vulnerabilities), kernel
(RHEL6.7: code execution), and kernel
(RHEL6: three vulnerabilities).

SUSE has updated freeradius-server (SLE12-SP1,2: insufficient
certificate verification) and LibVNCServer
(SLE11-SP4: two vulnerabilities).

Ubuntu has updated kernel (16.10; 16.04;
14.04; 12.04: multiple vulnerabilities), linux-lts-trusty (12.04: multiple
vulnerabilities), linux-lts-xenial (14.04:
three vulnerabilities), linux-raspi2 (16.10; 16.04:
two vulnerabilities), linux-snapdragon
(16.04: two vulnerabilities), linux-ti-omap4 (12.04: two vulnerabilities),
and webkit2gtk (16.04: multiple vulnerabilities).

My WATCH runs GNU/Linux And It Is Amazing (LearntEmail)

Post Syndicated from jake original http://lwn.net/Articles/710914/rss

The LearntEmail blog has a look at running AsteroidOS on the LG Watch Urbane smartwatch.
It looks like a watch, it smells like a watch, but it runs like a normal computer. Wayland, systemd, polkit, dbus and friends look very friendly to hacking. Even Qt is better than android, but that’s debatable.

My next project – run Gtk+ on the watch 🙂” (Thanks to Paul Wise.)

Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning

Post Syndicated from Radhika Ravirala original https://aws.amazon.com/blogs/big-data/powering-amazon-redshift-analytics-with-apache-spark-and-amazon-machine-learning/

Air travel can be stressful due to the many factors that are simply out of airline passengers’ control. As passengers, we want to minimize this stress as much as we can. We can do this by using past data to make predictions about how likely a flight will be delayed based on the time of day or the airline carrier.

In this post, we generate a predictive model for flight delays that can be used to help us pick the flight least likely to add to our travel stress. To accomplish this, we will use Apache Spark running on Amazon EMR for extracting, transforming, and loading (ETL) the data, Amazon Redshift for analysis, and Amazon Machine Learning for creating predictive models. This solution gives a good example of combining multiple AWS services to build a sophisticated analytical application in the AWS Cloud.

Architecture

At a high level, our solution includes the following steps:

Step 1 is to ingest datasets:

Step 2 is to enrich data by using ETL:

  • We will transform the maximum and minimum temperature columns from Celsius to Fahrenheit in the weather table in Hive by using a user-defined function in Spark.
  • We enrich the flight data in Amazon Redshift to compute and include extra features and columns (departure hour, days to the nearest holiday) that will help the Amazon Machine Learning algorithm’s learning process.
  • We then combine both the datasets in the Spark environment by using the spark-redshift package to load data from Amazon Redshift cluster to Spark running on an Amazon EMR cluster. We write the enriched data back to a Amazon Redshift table using the spark-redshift package.

Step 3 is to perform predictive analytics:

  • In this last step, we use Amazon Machine Learning to create and train a ML model using Amazon Redshift as our data source. The trained Amazon ML model is used to generate predictions for the test dataset, which are output to an S3 bucket.

o_powering_redshift_analytics_1

Building the solution

Working with datasets

In this example, we use historical weather data for Chicago’s O’Hare International Airport station from 2013 to 2014.

Let’s begin by launching an EMR cluster with Apache Hive and Apache Spark. We use the following Hive script to create the raw weather table, for which we load the weather data from an S3 bucket in the next step.

CREATE EXTERNAL TABLE IF NOT EXISTS weather_raw (
  station       string,
  station_name  string,
  elevation     string,
  latitude      string,
  longitude     string,
  wdate         string,
  prcp          decimal(5,1),
  snow          int,
  tmax          string,
  tmin          string,
  awnd          string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
LOCATION 's3://<your-bucket>/weather/'
TBLPROPERTIES("skip.header.line.count"="1");

CREATE TABLE weather AS SELECT station, station_name, elevation, latitude, longitude, 
                   cast(concat(
                                substr(wdate,1,4), '-', 
                                substr(wdate,5,2), '-', 
                                substr(wdate,7,2) 
                              ) AS date) AS dt, prcp, snow, tmax, tmin, awnd FROM weather_raw;

set hive.cli.print.header=true;

select * from weather limit 10;

Next, we create a single-node Amazon Redshift cluster and copy the flight data from the S3 bucket. Make sure to associate the Amazon Redshift cluster with an AWS Identity and Access Management (IAM) role, which will allow Amazon Redshift to assume a role when reading data from S3.

aws redshift create-cluster \
  --cluster-identifier demo \
  --db-name demo \
  --node-type dc1.large \
  --cluster-type single-node \
  --iam-roles "arn:aws:iam::YOUR-AWS-ACCOUNT:role/<redshift-iam-role>" \
  --master-username master \
  --master-user-password REDSHIFT-MASTER-PASSWORD \
  --publicly-accessible \
  --port 5439

In this example, we will load flight data for flights originating in Chicago’s O’Hare International Airport from 2013 to 2014. You can use most SQL client tools with Amazon Redshift JDBC or ODBC drivers to connect to an Amazon Redshift cluster. For instructions on installing SQL Workbench/J for this use, see the Amazon Redshift documentation.

First, we create a table with data for all flights originating in Chicago in December 2013.

create table ord_flights
(
ID  			bigint identity(0,1),
YEAR 			smallint,
QUARTER		smallint,
MONTH 			smallint,
DAY_OF_MONTH 		smallint,
DAY_OF_WEEK		smallint,
FL_DATE		date, 
UNIQUE_CARRIER 	varchar(10),
AIRLINE_ID 		int,
CARRIER		varchar(4),
TAIL_NUM 		varchar(8),
FL_NUM 		varchar(4),
ORIGIN_AIRPORT_ID 	smallint,
ORIGIN 		varchar(5),
ORIGIN_CITY_NAME  	varchar(35),
ORIGIN_STATE_ABR 	varchar(2),
ORIGIN_STATE_NM 	varchar(50),
ORIGIN_WAC 		varchar(2),
DEST_AIRPORT_ID	smallint,
DEST 			varchar(5),
DEST_CITY_NAME 	varchar(35),
DEST_STATE_ABR 	varchar(2),
DEST_STATE_NM 	varchar(50),
DEST_WAC 		varchar(2),
CRS_DEP_TIME 		smallint,
DEP_TIME 		varchar(6),
DEP_DELAY 		numeric(22,6),
DEP_DELAY_NEW 	numeric(22,6),
DEP_DEL15 		numeric(22,6),
DEP_DELAY_GROUP	smallint,
DEP_TIME_BLK 		varchar(15),
TAXI_OUT 		numeric(22,6),
TAXI_IN 		numeric(22,6),
CRS_ARR_TIME 		numeric(22,6),
ARR_TIME 		varchar(6),
ARR_DELAY 		numeric(22,6),
ARR_DELAY_NEW 	numeric(22,6),
ARR_DEL15 		numeric(22,6),
ARR_DELAY_GROUP 	smallint,
ARR_TIME_BLK 		varchar(15),
CANCELLED 		numeric(22,6),
DIVERTED 		numeric(22,6),
CRS_ELAPSED_TIME 	numeric(22,6),
ACTUAL_ELAPSED_TIME numeric(22,6),
AIR_TIME 		numeric(22,6), 
FLIGHTS		numeric(22,6),
DISTANCE 		numeric(22,6),
DISTANCE_GROUP 	numeric(22,6),
CARRIER_DELAY 	numeric(22,6),
WEATHER_DELAY 	numeric(22,6),
NAS_DELAY 		numeric(22,6),
SECURITY_DELAY 	numeric(22,6),
LATE_AIRCRAFT_DELAY numeric(22,6),
primary key (id)
);

Next, we ensure that the flight data residing in the S3 bucket is in the same region as the Amazon Redshift cluster. We then run the following copy command.

-- Copy all flight data for Dec 2013 and 2014 from S3 bucket
copy ord_flights 
FROM 's3://<path-to-your-faa-data-bucket>/T_ONTIME.csv' credentials 'aws_iam_role=arn:aws:iam::YOUR-AWS-ACCOUNT:role/<RedshiftIAMRole>' csv IGNOREHEADER 1;

Data enrichment

Businesses have myriad data sources to integrate, normalize, and make consumable by downstream analytic processes. To add to the complications, ingested data is rarely useful in its original form for analysis. Workflows often involve using either open source tools such as Spark or commercially available tools to enrich the ingested data. In this example, we will perform simple transformations on the weather data to convert the air temperature (tmax and tmin) from Celsius to Fahrenheit using a user-defined function (UDF) in the Spark environment.

Spark

// Read weather table from hive
val rawWeatherDF = sqlContext.table("weather")

// Retrieve the header
val header = rawWeatherDF.first()

// Remove the header from the dataframe
val noHeaderWeatherDF = rawWeatherDF.filter(row => row != header)

// UDF to convert the air temperature from celsius to fahrenheit
val toFahrenheit = udf {(c: Double) => c * 9 / 5 + 32}

// Apply the UDF to maximum and minimum air temperature
val weatherDF = noHeaderWeatherDF.withColumn("new_tmin", toFahrenheit(noHeaderWeatherDF("tmin")))
                                     .withColumn("new_tmax", toFahrenheit(noHeaderWeatherDF("tmax")))
                                     .drop("tmax")
                                     .drop("tmin")
                                     .withColumnRenamed("new_tmax","tmax")
                                     .withColumnRenamed("new_tmin","tmin")

Data manipulation also consists of excluding or including data fields and columns that are meaningful to the analysis being performed. Because our analysis tries to predict whether a flight will be delayed or not, we will include only columns that help make that determination.

To that end, we will snip a few columns from the flight table and add ‘departure hour’ and ‘days from the nearest holiday’ columns for data on items that might contribute to flight delay. You can compute the ‘days from the nearest holiday’ by using Amazon Redshift’s user-defined functions (UDFs), which will allow you to write a scalar function in Python 2.7. We also use the Pandas library’s USFederalHolidayCalendar to create a list of federal holidays for the years used in this example (see the code snippet following).

Amazon Redshift UDF

First, create a Python UDF to compute number of days before or after the nearest holiday.

create or replace function f_days_from_holiday (year int, month int, day int)
returns int
stable
as $$
  import datetime
  from datetime import date
  import dateutil
  from dateutil.relativedelta import relativedelta

  fdate = date(year, month, day)

  fmt = '%Y-%m-%d'
  s_date = fdate - dateutil.relativedelta.relativedelta(days=7)
  e_date = fdate + relativedelta(months=1)
  start_date = s_date.strftime(fmt)
  end_date = e_date.strftime(fmt)

  """
  Compute a list of holidays over a period (7 days before, 1 month after) for the flight date
  """
  from pandas.tseries.holiday import USFederalHolidayCalendar
  calendar = USFederalHolidayCalendar()
  holidays = calendar.holidays(start_date, end_date)
  days_from_closest_holiday = [(abs(fdate - hdate)).days for hdate in holidays.date.tolist()]
  return min(days_from_closest_holiday)
$$ language plpythonu;

We are now ready to combine the weather data in Hive table with the flights data in our Amazon Redshift table using the spark-redshift package, to demonstrate the data enrichment process.

Working with the spark-redshift package

Databrick’s spark-redshift package is a library that loads data into Spark SQL DataFrames from Amazon Redshift and also saves DataFrames back into Amazon Redshift tables. The library uses the Spark SQL Data Sources API to integrate with Amazon Redshift. This approach makes it easy to integrate large datasets from a Amazon Redshift database with datasets from other data sources and to interoperate seamlessly with disparate input sources like Hive tables, as in this use case, or columnar Parquet files on HDFS or S3, and many others.

The following is an illustration of spark-redshift’s load and save functionality:

o_powering_redshift_analytics_2

Prerequisites

To work with spark-redshift package, you will need to download the following .jar files onto your EMR cluster running spark. Alternatively, you can clone the git repository and build the .jar files from the sources. For this example, we ran EMR version 5.0 with Spark 2.0. Ensure that you download the right versions of the .jar files based on the version of Spark you use.

spark-redshift jar

wget http://repo1.maven.org/maven2/com/databricks/spark-redshift_2.10/2.0.0/spark-redshift_2.10-2.0.0.jar

spark-avro jar

wget http://repo1.maven.org/maven2/com/databricks/spark-avro_2.11/3.0.0/spark-avro_3.0.0.jar

minimal-json.jar

wget https://github.com/ralfstx/minimal-json/releases/download/0.9.4/minimal-json-0.9.4.jar

In addition to the .jar file described preceding, you will also need the Amazon Redshift JDBC driver to connect to the Amazon Redshift cluster. Fortunately, the driver is already included in  EMR version 4.7.0 and above.

Connecting to Amazon Redshift cluster from your Spark environment

Launch spark-shell with the prerequisites downloaded from the previous section.

spark-shell --jars spark-redshift_2.10-2.0.0.jar,/usr/share/aws/redshift/jdbc/RedshiftJDBC41.jar,minimal-json-0.9.4.jar,spark-avro_2.11-3.0.0.jar

Once in spark-shell, we run through the following steps to connect to our Amazon Redshift cluster.

/**
  * Example to demonstrate combining tables in Hive and Amazon Redshift for data enrichment.
  */

import org.apache.spark.sql._
import com.amazonaws.auth._
import com.amazonaws.auth.AWSCredentialsProvider
import com.amazonaws.auth.AWSSessionCredentials
import com.amazonaws.auth.InstanceProfileCredentialsProvider
import com.amazonaws.services.redshift.AmazonRedshiftClient
import _root_.com.amazon.redshift.jdbc41.Driver
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SQLContext



// Instance Profile for authentication to AWS resources
val provider = new InstanceProfileCredentialsProvider();
val credentials: AWSSessionCredentials = provider.getCredentials.asInstanceOf[AWSSessionCredentials];
val token = credentials.getSessionToken;
val awsAccessKey = credentials.getAWSAccessKeyId;
val awsSecretKey = credentials.getAWSSecretKey

val sqlContext = new SQLContext(sc)
import sqlContext.implicits._;

// Read weather table from hive
val rawWeatherDF = sqlContext.table("weather")

// Retrieve the header
val header = rawWeatherDF.first()

// Remove the header from the dataframe
val noHeaderWeatherDF = rawWeatherDF.filter(row => row != header)

// UDF to convert the air temperature from celsius to fahrenheit
val toFahrenheit = udf {(c: Double) => c * 9 / 5 + 32}

// Apply the UDF to maximum and minimum air temperature
val weatherDF = noHeaderWeatherDF.withColumn("new_tmin", toFahrenheit(noHeaderWeatherDF("tmin")))
                                 .withColumn("new_tmax", toFahrenheit(noHeaderWeatherDF("tmax")))
                                 .drop("tmax")
                                 .drop("tmin")
                                 .withColumnRenamed("new_tmax","tmax")
                                 .withColumnRenamed("new_tmin","tmin")

// Provide the jdbc url for Amazon Redshift
val jdbcURL = "jdbc:redshift://<redshift-cluster-name>:<port/<database-name>?user=<dbuser>&password=<dbpassword>"

// Create and declare an S3 bucket where the temporary files are written
val s3TempDir = "s3://<S3TempBucket>/"

Reading from Amazon Redshift

To read from Amazon Redshift, spark-redshift executes a Amazon Redshift UNLOAD command that copies a Amazon Redshift table or results from a query to a temporary S3 bucket that you provide. Then spark-redshift reads the temporary S3 input files and generates a DataFrame instance that you can manipulate in your application.

In this case, we will issue a SQL query against the ord_flights table in the Amazon Redshift cluster from Spark to read only the columns we are interested in. Notice that during this process, we are computing the days_to_holiday column using the Amazon Redshift UDF that we discussed in the earlier section.

// Query against the ord_flights table in Amazon Redshift
val flightsQuery = """
                    select ORD_DELAY_ID, DAY_OF_MONTH, DAY_OF_WEEK, FL_DATE, f_days_from_holiday(year, month, day_of_month) as DAYS_TO_HOLIDAY, UNIQUE_CARRIER, FL_NUM, substring(DEP_TIME, 1, 2) as DEP_HOUR, cast(DEP_DEL15 as smallint),
                    cast(AIR_TIME as integer), cast(FLIGHTS as smallint), cast(DISTANCE as smallint)
                    from flights where origin='ORD' and cancelled = 0
                   """

// Create a Dataframe to hold the results of the above query
val flightsDF = sqlContext.read.format("com.databricks.spark.redshift")
                                          .option("url", jdbcURL)
                                          .option("tempdir", s3TempDir)
                                          .option("query", flightsQuery)
                                          .option("temporary_aws_access_key_id", awsAccessKey)
                                          .option("temporary_aws_secret_access_key", awsSecretKey)
                                          .option("temporary_aws_session_token", token).load()

It’s join time!

// Join the two dataframes
val joinedDF = flightsDF.join(weatherDF, flightsDF("fl_date") === weatherDF("dt"))

Now that we have unified the data, we can write it back to another table in our Amazon Redshift cluster. After that is done, the table can serve as a datasource for advanced analytics, such as for developing predictive models using the Amazon Machine Learning service.

Writing to Amazon Redshift

To write to Amazon Redshift, the spark-redshift library first creates a table in Amazon Redshift using JDBC. Then it copies the partitioned DataFrame as AVRO partitions to a temporary S3 folder that you specify. Finally, it executes the Amazon Redshift COPY command to copy the S3 contents to the newly created Amazon Redshift table. You can also use the append option with spark-redshift to append data to an existing Amazon Redshift table. In this example, we will write the data to a table named ‘ord_flights’ in Amazon Redshift.

joinedDF.write
  .format("com.databricks.spark.redshift")
  .option("temporary_aws_access_key_id", awsAccessKey)
  .option("temporary_aws_secret_access_key", awsSecretKey)
  .option("temporary_aws_session_token", token)
  .option("url", jdbcURL)
  .option("dbtable", "ord_flights")
  .option("aws_iam_role", "arn:aws:iam::<AWS-account>:role/<redshift-role>")
  .option("tempdir", s3TempDir)
  .mode("error")
  .save()

Let us derive a couple of tables in Amazon Redshift from ord_flights to serve as the training and test input datasets for Amazon ML.

-- Derived table for training data
create table train_ord_flights 
as 
        (select * from ord_flights where year = 2013);

select count(*) from train_ord_flights;

-- Derived table for test data
create table test_ord_flights 
as 
        (select * from ord_flights where year = 2013);

select count(*) from test_ord_flights;

Connecting Amazon Redshift to Amazon ML

With processed data on hand, we can perform advanced analytics, such as forecasting future events or conducting what-if analysis to understand their effect on business outcomes.

We started with a goal of predicting flight delays for future flights. However, because future flight data is not available, we will use the unified flight and weather data for December 2013 to create and train a predictive model using Amazon ML. We’ll then use December 2014 data as our test dataset, on which we apply the predictive model to get the batch predictions. We can then compare the generated batch predictions against the actual delays (because we already know the 2014 delays) and determine the accuracy of the model that we have developed.

The predictive model creation here will use the AWS Management Console, but you can also use the AWS CLI or one of the AWS SDKs to accomplish the same thing.

Create an Amazon ML Datasource

The first step in developing an Amazon ML model is creation of a datasource object. We will create a datasource object in Amazon ML by issuing a query against the Amazon Redshift cluster to use December 2013 data as our training dataset. Amazon ML automatically infers the schema based on the values of the columns in the query. You can optionally specify a schema if you want.

In our case, we select departure delay (dep_del15) as the target attribute to predict. When the datasource object is created, Amazon ML shows the target distribution, missing values, and statistics for each of attributes for the datasource object you just created.

o_powering_redshift_analytics_3

Create an Amazon Machine Learning Model

After creating the datasource object, we are ready to create and train an ML model. Select the datasource you just created in previous section and follow the instructions to create the model.

When the model is successfully created, choose Latest evaluation result to learn how the model scored. Amazon ML provides an industry standard accuracy metric for binary classification models called “Area Under the (Receiver Operating Characteristic) Curve (AUC).”

o_powering_redshift_analytics_4

For insights into prediction accuracy, see performance visualization, where the effect of choosing the cut-off score (threshold) can be understood. The prediction score for each record is a numeric value between 0 and 1. The closer to 1, the more likely the prediction should be set to Yes; the further away, the more likely is No. A prediction falls into one of the four cases: a) True-positive (TP)—correctly classified as Yes; b) True-negative (TN)—correctly classified as No; c) False-positive (FP)—wrongly classified as Yes; d) False-negative (TN)—wrongly classified as No.

In this example, False positive indicates flights that are not delayed are reported as delayed, which may cause passengers to miss their flight. False negative means flights that are delayed are classified as being not delayed, which is just an annoyance. We want to minimize the false positives in this model by simply adjusting the threshold to a higher value.

o_powering_redshift_analytics_5

Generate batch predictions

To generate batch predictions, take the two steps following.

Create a Datasource object for the test dataset

Repeat the datasource object creation in Amazon ML as we did in the earlier section to create a test dataset on which model will be deployed. Make sure to use the December 2014 data for the Amazon Redshift query.

Create batch predictions

From the Amazon Machine Learning console, choose Batch Predictions and follow the instructions to create batch predictions by applying the model we created in the last section on the test dataset. Predictions can be located in the “Output S3 URL” path on the summary page. The predictions can be copied back to a Amazon Redshift table and correlated with other flight data to produce visualizations. Such visualizations provide insights such as days and times when the flights are most delayed, carriers that have had delays during holidays, and so on.

o_powering_redshift_analytics_6

Congratulations! You have successfully generated a flight delay forecast for December 2014!

Conclusion

In this post, we built an end-to-end solution for predicting flight delays by combining multiple AWS services—Apache Spark on Amazon EMR, Amazon Redshift, and Amazon ML. We started out with raw data, combined data in Spark with Amazon Redshift using the spark-redshift package, processed data using constructs such as UDFs, and developed a forecasting model for flight delays using Amazon ML. This ability to seamlessly integrate these services truly empowers users to build sophisticated analytical applications on the AWS cloud.  We encourage you to build your own and share your experience with us!


About the Authors

radhika_90
Radhika Ravirala is a solutions architect with Amazon Web Services.

 

 

wangechi_90

Wangechi Doble is a Solutions Architect with a focus on Big Data & Analytics at AWS. She helps customers architect and build out high performance, scalable and secure cloud-based solutions on AWS. In her spare time, she enjoys spending time with her family, reading and traveling to exotic places.

 

 


Related

Serving Real-Time Machine Learning Predictions on Amazon EMR

o_realtime_1_1_1

Security advisories for Wednesday

Post Syndicated from ris original http://lwn.net/Articles/706020/rss

Debian has updated libxslt (code execution).

Fedora has updated dbus (F23:
code execution), firefox (F23: two
vulnerabilities), and pacemaker (F23: privilege escalation).

openSUSE has updated mariadb
(13.2: multiple vulnerabilities) and nodejs
(Leap42.1, 13.2: code execution).

Red Hat has updated flash-plugin
(RHEL5,6: multiple vulnerabilities).

Scientific Linux has updated libgcrypt (SL6: flawed random number
generation) and pacemaker (SL6: privilege escalation).

Security advisories for Monday

Post Syndicated from ris original http://lwn.net/Articles/705814/rss

Debian has updated mysql-5.5 (multiple unspecified vulnerabilities).

Debian-LTS has updated libdatetime-timezone-perl (update tzdata), libxslt (code execution), memcached (multiple vulnerabilities, one from
2013), openjdk-7 (multiple
vulnerabilities), and tzdata (update tzdata).

Fedora has updated 389-ds-base
(F24: information leak), curl (F24:
multiple vulnerabilities), firefox (F24:
two vulnerabilities), and pacemaker (F24: privilege escalation).

Mageia has updated libtomcrypt (signature forgery), python-django (two vulnerabilities), and tomcat (multiple vulnerabilities).

openSUSE has updated chromium (SPH for SLE12; Leap42.1, 13.2: memory leak), dbus-1 (13.1: denial of service), jasper (13.1: multiple vulnerabilities), libraw (Leap42.1: memory leak), libxml2 (13.2: code execution), and firefox (13.1: two vulnerabilities).

Red Hat has updated java-1.6.0-ibm (RHEL5,6: multiple
vulnerabilities) and java-1.7.0-openjdk
(RHEL5,6,7: multiple vulnerabilities).

Security advisories for Wednesday

Post Syndicated from ris original http://lwn.net/Articles/705355/rss

Arch Linux has updated bind (denial of service).

Debian has updated bind9 (denial of service) and tar (file overwrite).

Debian-LTS has updated libwmf (denial of service), tiff (multiple vulnerabilities), and tiff3 (two vulnerabilities).

Fedora has updated ecryptfs-utils
(F23: two vulnerabilities), libass (F23:
multiple vulnerabilities), libXfixes (F23:
integer overflow), libXrandr (F23:
insufficient validation), libXrender (F23:
insufficient validation), libXtst (F23:
insufficient validation), libXv (F23:
insufficient validation), libXvMC (F23:
insufficient validation), systemd (F23:
denial of service), and tor (F23: denial of service).

Mageia has updated libtiff (two vulnerabilities).

Red Hat has updated java-1.7.0-ibm (RHEL5: multiple
vulnerabilities), java-1.7.1-ibm (RHEL6,7:
multiple vulnerabilities), and java-1.8.0-ibm (RHEL6,7: multiple vulnerabilities).

SUSE has updated bind (SLE12-SP1,2; SLES12: denial of service), curl (SLE12-SP1; SSO1.3: multiple vulnerabilities), nodejs4 (SLEM12: multiple vulnerabilities), php7 (SLEM12: many vulnerabilities), and php7 (SLEM12: three vulnerabilities in libgd).

Ubuntu has updated bind9 (denial
of service), dbus (denial of service from
2015), libgd2 (three vulnerabilities), mailman (two vulnerabilities), oxide-qt (16.10, 16.04, 14.04: multiple
vulnerabilities), and python-django (two
vulnerabilities).

The initial bus1 patch posting

Post Syndicated from corbet original http://lwn.net/Articles/704732/rss

The bus1 message-passing mechanism is the successor to the “kdbus” project;
it was covered here in August. The patches have now been posted for review.
While bus1 emerged out of the kdbus project, bus1 was started from
scratch and the concepts have little in common. In a nutshell, bus1
provides a capability-based IPC system, similar in nature to Android
Binder, Cap’n Proto, and seL4.

Security advisories for Monday

Post Syndicated from ris original http://lwn.net/Articles/704464/rss

Arch Linux has updated chromium (multiple vulnerabilities), kernel (privilege escalation), linux-lts (privilege escalation), python-django (cross-site request forgery), and python2-django (cross-site request forgery).

CentOS has updated bind (C6; C5: denial
of service) and bind97 (C5: denial of service).

Debian has updated kdepimlibs (HTML injection).

Debian-LTS has updated kdepimlibs (HTML injection).

Fedora has updated guile (F23: two vulnerabilities), kernel (F24; F23: privilege escalation), php (F24; F23: multiple vulnerabilities), and php-pecl-zip (F24; F23: multiple vulnerabilities).

Mageia has updated 389-ds-base (information disclosure), c-ares (code execution), guile (two vulnerabilities), openjpeg (denial of service), and php-ZendFramework (SQL injection).

openSUSE has updated Chromium
(Leap42.1, 13.2: multiple vulnerabilities), dbus-1 (Leap42.1: code execution), gd (13.2: denial of service), kdump (Leap42.1: denial of service), php5 (13.2: three vulnerabilities),
kernel (Leap42.1; 13.1: multiple vulnerabilities), tor (Leap42.1, 13.2: denial of service), and
X (Leap42.1: multiple vulnerabilities).

Oracle has updated bind (OL6; OL5:
denial of service), bind97 (OL5: multiple
vulnerabilities), and kernel 4.1.12 (OL7; OL6:
privilege escalation), kernel 3.8.13 (OL7; OL6:
privilege escalation), kernel 2.6.39 (OL6; OL5: privilege escalation).

Red Hat has updated kernel
(RHEL7: privilege escalation).

SUSE has updated Chromium
(SPH for SLE12: multiple vulnerabilities), qemu (SLE12-SP1: multiple vulnerabilities),
and kernel (SLE12-SP1; SLE12; SLE11-SP4; SLE11-SP3; SLE11-SP2: privilege escalation).

Friday’s security updates

Post Syndicated from jake original http://lwn.net/Articles/704247/rss

Debian-LTS has updated bind9 (denial of service).

Fedora has updated libgit2 (F23:
two vulnerabilities).

Mageia has updated kernel (three
vulnerabilities), libtiff (multiple
vulnerabilities, two from 2015), and openslp (code execution).

openSUSE has updated dbus-1
(13.2: code execution), ghostscript-library
(42.1: three vulnerabilities, one from 2013), roundcubemail (42.1: two vulnerabilities), and
squidGuard (42.1: cross-site scripting from
2015).

Red Hat has updated bind
(RHEL6&5: denial of service) and bind97
(RHEL5: denial of service).

Scientific Linux has updated bind
(SL6&5: denial of service) and bind97 (SL5: denial of service).

Ubuntu has updated bind9 (12.04: denial of service).

Friday’s security advisories

Post Syndicated from ris original http://lwn.net/Articles/703602/rss

Arch Linux has updated gdk-pixbuf2 (denial of service).

Debian has updated freeimage (two vulnerabilities).

Debian-LTS has updated libxfixes (integer overflow).

Fedora has updated dbus (F24: code execution) and xen (F24; F23: three vulnerabilities).

openSUSE has updated compat-openssl098 (Leap42.1: multiple
vulnerabilities), derby (13.2: information
leak), libreoffice (Leap42.1: code
execution), php5 (Leap42.1: multiple
vulnerabilities), go1.4
(SPH for SLE12: denial of service), systemd (Leap42.1: denial of service), and unzip (13.2: two vulnerabilities).

Oracle has updated kernel 4.1.12 (OL7; OL6: stack corruption).

Red Hat has updated mariadb-galera (RHOSP9; RHELOSP7 for RHEL7; RHELOSP6 for RHEL7; RHELOSP5 for RHEL7; RHELOSP5 for RHEL6: SQL injection/privilege escalation).

SUSE has updated xen (SLE12; SLES11-SP2: multiple vulnerabilities).

Ubuntu has updated linux-ti-omap4
(12.04: three vulnerabilities).