Tag Archives: variables

Combine Transactional and Analytical Data Using Amazon Aurora and Amazon Redshift

Post Syndicated from Re Alvarez-Parmar original https://aws.amazon.com/blogs/big-data/combine-transactional-and-analytical-data-using-amazon-aurora-and-amazon-redshift/

A few months ago, we published a blog post about capturing data changes in an Amazon Aurora database and sending it to Amazon Athena and Amazon QuickSight for fast analysis and visualization. In this post, I want to demonstrate how easy it can be to take the data in Aurora and combine it with data in Amazon Redshift using Amazon Redshift Spectrum.

With Amazon Redshift, you can build petabyte-scale data warehouses that unify data from a variety of internal and external sources. Because Amazon Redshift is optimized for complex queries (often involving multiple joins) across large tables, it can handle large volumes of retail, inventory, and financial data without breaking a sweat.

In this post, we describe how to combine data in Aurora in Amazon Redshift. Here’s an overview of the solution:

  • Use AWS Lambda functions with Amazon Aurora to capture data changes in a table.
  • Save data in an Amazon S3
  • Query data using Amazon Redshift Spectrum.

We use the following services:

Serverless architecture for capturing and analyzing Aurora data changes

Consider a scenario in which an e-commerce web application uses Amazon Aurora for a transactional database layer. The company has a sales table that captures every single sale, along with a few corresponding data items. This information is stored as immutable data in a table. Business users want to monitor the sales data and then analyze and visualize it.

In this example, you take the changes in data in an Aurora database table and save it in Amazon S3. After the data is captured in Amazon S3, you combine it with data in your existing Amazon Redshift cluster for analysis.

By the end of this post, you will understand how to capture data events in an Aurora table and push them out to other AWS services using AWS Lambda.

The following diagram shows the flow of data as it occurs in this tutorial:

The starting point in this architecture is a database insert operation in Amazon Aurora. When the insert statement is executed, a custom trigger calls a Lambda function and forwards the inserted data. Lambda writes the data that it received from Amazon Aurora to a Kinesis data delivery stream. Kinesis Data Firehose writes the data to an Amazon S3 bucket. Once the data is in an Amazon S3 bucket, it is queried in place using Amazon Redshift Spectrum.

Creating an Aurora database

First, create a database by following these steps in the Amazon RDS console:

  1. Sign in to the AWS Management Console, and open the Amazon RDS console.
  2. Choose Launch a DB instance, and choose Next.
  3. For Engine, choose Amazon Aurora.
  4. Choose a DB instance class. This example uses a small, since this is not a production database.
  5. In Multi-AZ deployment, choose No.
  6. Configure DB instance identifier, Master username, and Master password.
  7. Launch the DB instance.

After you create the database, use MySQL Workbench to connect to the database using the CNAME from the console. For information about connecting to an Aurora database, see Connecting to an Amazon Aurora DB Cluster.

The following screenshot shows the MySQL Workbench configuration:

Next, create a table in the database by running the following SQL statement:

Create Table
CREATE TABLE Sales (
InvoiceID int NOT NULL AUTO_INCREMENT,
ItemID int NOT NULL,
Category varchar(255),
Price double(10,2), 
Quantity int not NULL,
OrderDate timestamp,
DestinationState varchar(2),
ShippingType varchar(255),
Referral varchar(255),
PRIMARY KEY (InvoiceID)
)

You can now populate the table with some sample data. To generate sample data in your table, copy and run the following script. Ensure that the highlighted (bold) variables are replaced with appropriate values.

#!/usr/bin/python
import MySQLdb
import random
import datetime

db = MySQLdb.connect(host="AURORA_CNAME",
                     user="DBUSER",
                     passwd="DBPASSWORD",
                     db="DB")

states = ("AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN",
"IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ",
"NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA",
"WA","WV","WI","WY")

shipping_types = ("Free", "3-Day", "2-Day")

product_categories = ("Garden", "Kitchen", "Office", "Household")
referrals = ("Other", "Friend/Colleague", "Repeat Customer", "Online Ad")

for i in range(0,10):
    item_id = random.randint(1,100)
    state = states[random.randint(0,len(states)-1)]
    shipping_type = shipping_types[random.randint(0,len(shipping_types)-1)]
    product_category = product_categories[random.randint(0,len(product_categories)-1)]
    quantity = random.randint(1,4)
    referral = referrals[random.randint(0,len(referrals)-1)]
    price = random.randint(1,100)
    order_date = datetime.date(2016,random.randint(1,12),random.randint(1,30)).isoformat()

    data_order = (item_id, product_category, price, quantity, order_date, state,
    shipping_type, referral)

    add_order = ("INSERT INTO Sales "
                   "(ItemID, Category, Price, Quantity, OrderDate, DestinationState, \
                   ShippingType, Referral) "
                   "VALUES (%s, %s, %s, %s, %s, %s, %s, %s)")

    cursor = db.cursor()
    cursor.execute(add_order, data_order)

    db.commit()

cursor.close()
db.close() 

The following screenshot shows how the table appears with the sample data:

Sending data from Amazon Aurora to Amazon S3

There are two methods available to send data from Amazon Aurora to Amazon S3:

  • Using a Lambda function
  • Using SELECT INTO OUTFILE S3

To demonstrate the ease of setting up integration between multiple AWS services, we use a Lambda function to send data to Amazon S3 using Amazon Kinesis Data Firehose.

Alternatively, you can use a SELECT INTO OUTFILE S3 statement to query data from an Amazon Aurora DB cluster and save it directly in text files that are stored in an Amazon S3 bucket. However, with this method, there is a delay between the time that the database transaction occurs and the time that the data is exported to Amazon S3 because the default file size threshold is 6 GB.

Creating a Kinesis data delivery stream

The next step is to create a Kinesis data delivery stream, since it’s a dependency of the Lambda function.

To create a delivery stream:

  1. Open the Kinesis Data Firehose console
  2. Choose Create delivery stream.
  3. For Delivery stream name, type AuroraChangesToS3.
  4. For Source, choose Direct PUT.
  5. For Record transformation, choose Disabled.
  6. For Destination, choose Amazon S3.
  7. In the S3 bucket drop-down list, choose an existing bucket, or create a new one.
  8. Enter a prefix if needed, and choose Next.
  9. For Data compression, choose GZIP.
  10. In IAM role, choose either an existing role that has access to write to Amazon S3, or choose to generate one automatically. Choose Next.
  11. Review all the details on the screen, and choose Create delivery stream when you’re finished.

 

Creating a Lambda function

Now you can create a Lambda function that is called every time there is a change that needs to be tracked in the database table. This Lambda function passes the data to the Kinesis data delivery stream that you created earlier.

To create the Lambda function:

  1. Open the AWS Lambda console.
  2. Ensure that you are in the AWS Region where your Amazon Aurora database is located.
  3. If you have no Lambda functions yet, choose Get started now. Otherwise, choose Create function.
  4. Choose Author from scratch.
  5. Give your function a name and select Python 3.6 for Runtime
  6. Choose and existing or create a new Role, the role would need to have access to call firehose:PutRecord
  7. Choose Next on the trigger selection screen.
  8. Paste the following code in the code window. Change the stream_name variable to the Kinesis data delivery stream that you created in the previous step.
  9. Choose File -> Save in the code editor and then choose Save.
import boto3
import json

firehose = boto3.client('firehose')
stream_name = ‘AuroraChangesToS3’


def Kinesis_publish_message(event, context):
    
    firehose_data = (("%s,%s,%s,%s,%s,%s,%s,%s\n") %(event['ItemID'], 
    event['Category'], event['Price'], event['Quantity'],
    event['OrderDate'], event['DestinationState'], event['ShippingType'], 
    event['Referral']))
    
    firehose_data = {'Data': str(firehose_data)}
    print(firehose_data)
    
    firehose.put_record(DeliveryStreamName=stream_name,
    Record=firehose_data)

Note the Amazon Resource Name (ARN) of this Lambda function.

Giving Aurora permissions to invoke a Lambda function

To give Amazon Aurora permissions to invoke a Lambda function, you must attach an IAM role with appropriate permissions to the cluster. For more information, see Invoking a Lambda Function from an Amazon Aurora DB Cluster.

Once you are finished, the Amazon Aurora database has access to invoke a Lambda function.

Creating a stored procedure and a trigger in Amazon Aurora

Now, go back to MySQL Workbench, and run the following command to create a new stored procedure. When this stored procedure is called, it invokes the Lambda function you created. Change the ARN in the following code to your Lambda function’s ARN.

DROP PROCEDURE IF EXISTS CDC_TO_FIREHOSE;
DELIMITER ;;
CREATE PROCEDURE CDC_TO_FIREHOSE (IN ItemID VARCHAR(255), 
									IN Category varchar(255), 
									IN Price double(10,2),
                                    IN Quantity int(11),
                                    IN OrderDate timestamp,
                                    IN DestinationState varchar(2),
                                    IN ShippingType varchar(255),
                                    IN Referral  varchar(255)) LANGUAGE SQL 
BEGIN
  CALL mysql.lambda_async('arn:aws:lambda:us-east-1:XXXXXXXXXXXXX:function:CDCFromAuroraToKinesis', 
     CONCAT('{ "ItemID" : "', ItemID, 
            '", "Category" : "', Category,
            '", "Price" : "', Price,
            '", "Quantity" : "', Quantity, 
            '", "OrderDate" : "', OrderDate, 
            '", "DestinationState" : "', DestinationState, 
            '", "ShippingType" : "', ShippingType, 
            '", "Referral" : "', Referral, '"}')
     );
END
;;
DELIMITER ;

Create a trigger TR_Sales_CDC on the Sales table. When a new record is inserted, this trigger calls the CDC_TO_FIREHOSE stored procedure.

DROP TRIGGER IF EXISTS TR_Sales_CDC;
 
DELIMITER ;;
CREATE TRIGGER TR_Sales_CDC
  AFTER INSERT ON Sales
  FOR EACH ROW
BEGIN
  SELECT  NEW.ItemID , NEW.Category, New.Price, New.Quantity, New.OrderDate
  , New.DestinationState, New.ShippingType, New.Referral
  INTO @ItemID , @Category, @Price, @Quantity, @OrderDate
  , @DestinationState, @ShippingType, @Referral;
  CALL  CDC_TO_FIREHOSE(@ItemID , @Category, @Price, @Quantity, @OrderDate
  , @DestinationState, @ShippingType, @Referral);
END
;;
DELIMITER ;

If a new row is inserted in the Sales table, the Lambda function that is mentioned in the stored procedure is invoked.

Verify that data is being sent from the Lambda function to Kinesis Data Firehose to Amazon S3 successfully. You might have to insert a few records, depending on the size of your data, before new records appear in Amazon S3. This is due to Kinesis Data Firehose buffering. To learn more about Kinesis Data Firehose buffering, see the “Amazon S3” section in Amazon Kinesis Data Firehose Data Delivery.

Every time a new record is inserted in the sales table, a stored procedure is called, and it updates data in Amazon S3.

Querying data in Amazon Redshift

In this section, you use the data you produced from Amazon Aurora and consume it as-is in Amazon Redshift. In order to allow you to process your data as-is, where it is, while taking advantage of the power and flexibility of Amazon Redshift, you use Amazon Redshift Spectrum. You can use Redshift Spectrum to run complex queries on data stored in Amazon S3, with no need for loading or other data prep.

Just create a data source and issue your queries to your Amazon Redshift cluster as usual. Behind the scenes, Redshift Spectrum scales to thousands of instances on a per-query basis, ensuring that you get fast, consistent performance even as your dataset grows to beyond an exabyte! Being able to query data that is stored in Amazon S3 means that you can scale your compute and your storage independently. You have the full power of the Amazon Redshift query model and all the reporting and business intelligence tools at your disposal. Your queries can reference any combination of data stored in Amazon Redshift tables and in Amazon S3.

Redshift Spectrum supports open, common data types, including CSV/TSV, Apache Parquet, SequenceFile, and RCFile. Files can be compressed using gzip or Snappy, with other data types and compression methods in the works.

First, create an Amazon Redshift cluster. Follow the steps in Launch a Sample Amazon Redshift Cluster.

Next, create an IAM role that has access to Amazon S3 and Athena. By default, Amazon Redshift Spectrum uses the Amazon Athena data catalog. Your cluster needs authorization to access your external data catalog in AWS Glue or Athena and your data files in Amazon S3.

In the demo setup, I attached AmazonS3FullAccess and AmazonAthenaFullAccess. In a production environment, the IAM roles should follow the standard security of granting least privilege. For more information, see IAM Policies for Amazon Redshift Spectrum.

Attach the newly created role to the Amazon Redshift cluster. For more information, see Associate the IAM Role with Your Cluster.

Next, connect to the Amazon Redshift cluster, and create an external schema and database:

create external schema if not exists spectrum_schema
from data catalog 
database 'spectrum_db' 
region 'us-east-1'
IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/RedshiftSpectrumRole'
create external database if not exists;

Don’t forget to replace the IAM role in the statement.

Then create an external table within the database:

 CREATE EXTERNAL TABLE IF NOT EXISTS spectrum_schema.ecommerce_sales(
  ItemID int,
  Category varchar,
  Price DOUBLE PRECISION,
  Quantity int,
  OrderDate TIMESTAMP,
  DestinationState varchar,
  ShippingType varchar,
  Referral varchar)
ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 's3://{BUCKET_NAME}/CDC/'

Query the table, and it should contain data. This is a fact table.

select top 10 * from spectrum_schema.ecommerce_sales

 

Next, create a dimension table. For this example, we create a date/time dimension table. Create the table:

CREATE TABLE date_dimension (
  d_datekey           integer       not null sortkey,
  d_dayofmonth        integer       not null,
  d_monthnum          integer       not null,
  d_dayofweek                varchar(10)   not null,
  d_prettydate        date       not null,
  d_quarter           integer       not null,
  d_half              integer       not null,
  d_year              integer       not null,
  d_season            varchar(10)   not null,
  d_fiscalyear        integer       not null)
diststyle all;

Populate the table with data:

copy date_dimension from 's3://reparmar-lab/2016dates' 
iam_role 'arn:aws:iam::XXXXXXXXXXXX:role/redshiftspectrum'
DELIMITER ','
dateformat 'auto';

The date dimension table should look like the following:

Querying data in local and external tables using Amazon Redshift

Now that you have the fact and dimension table populated with data, you can combine the two and run analysis. For example, if you want to query the total sales amount by weekday, you can run the following:

select sum(quantity*price) as total_sales, date_dimension.d_season
from spectrum_schema.ecommerce_sales 
join date_dimension on spectrum_schema.ecommerce_sales.orderdate = date_dimension.d_prettydate 
group by date_dimension.d_season

You get the following results:

Similarly, you can replace d_season with d_dayofweek to get sales figures by weekday:

With Amazon Redshift Spectrum, you pay only for the queries you run against the data that you actually scan. We encourage you to use file partitioning, columnar data formats, and data compression to significantly minimize the amount of data scanned in Amazon S3. This is important for data warehousing because it dramatically improves query performance and reduces cost.

Partitioning your data in Amazon S3 by date, time, or any other custom keys enables Amazon Redshift Spectrum to dynamically prune nonrelevant partitions to minimize the amount of data processed. If you store data in a columnar format, such as Parquet, Amazon Redshift Spectrum scans only the columns needed by your query, rather than processing entire rows. Similarly, if you compress your data using one of the supported compression algorithms in Amazon Redshift Spectrum, less data is scanned.

Analyzing and visualizing Amazon Redshift data in Amazon QuickSight

Modify the Amazon Redshift security group to allow an Amazon QuickSight connection. For more information, see Authorizing Connections from Amazon QuickSight to Amazon Redshift Clusters.

After modifying the Amazon Redshift security group, go to Amazon QuickSight. Create a new analysis, and choose Amazon Redshift as the data source.

Enter the database connection details, validate the connection, and create the data source.

Choose the schema to be analyzed. In this case, choose spectrum_schema, and then choose the ecommerce_sales table.

Next, we add a custom field for Total Sales = Price*Quantity. In the drop-down list for the ecommerce_sales table, choose Edit analysis data sets.

On the next screen, choose Edit.

In the data prep screen, choose New Field. Add a new calculated field Total Sales $, which is the product of the Price*Quantity fields. Then choose Create. Save and visualize it.

Next, to visualize total sales figures by month, create a graph with Total Sales on the x-axis and Order Data formatted as month on the y-axis.

After you’ve finished, you can use Amazon QuickSight to add different columns from your Amazon Redshift tables and perform different types of visualizations. You can build operational dashboards that continuously monitor your transactional and analytical data. You can publish these dashboards and share them with others.

Final notes

Amazon QuickSight can also read data in Amazon S3 directly. However, with the method demonstrated in this post, you have the option to manipulate, filter, and combine data from multiple sources or Amazon Redshift tables before visualizing it in Amazon QuickSight.

In this example, we dealt with data being inserted, but triggers can be activated in response to an INSERT, UPDATE, or DELETE trigger.

Keep the following in mind:

  • Be careful when invoking a Lambda function from triggers on tables that experience high write traffic. This would result in a large number of calls to your Lambda function. Although calls to the lambda_async procedure are asynchronous, triggers are synchronous.
  • A statement that results in a large number of trigger activations does not wait for the call to the AWS Lambda function to complete. But it does wait for the triggers to complete before returning control to the client.
  • Similarly, you must account for Amazon Kinesis Data Firehose limits. By default, Kinesis Data Firehose is limited to a maximum of 5,000 records/second. For more information, see Monitoring Amazon Kinesis Data Firehose.

In certain cases, it may be optimal to use AWS Database Migration Service (AWS DMS) to capture data changes in Aurora and use Amazon S3 as a target. For example, AWS DMS might be a good option if you don’t need to transform data from Amazon Aurora. The method used in this post gives you the flexibility to transform data from Aurora using Lambda before sending it to Amazon S3. Additionally, the architecture has the benefits of being serverless, whereas AWS DMS requires an Amazon EC2 instance for replication.

For design considerations while using Redshift Spectrum, see Using Amazon Redshift Spectrum to Query External Data.

If you have questions or suggestions, please comment below.


Additional Reading

If you found this post useful, be sure to check out Capturing Data Changes in Amazon Aurora Using AWS Lambda and 10 Best Practices for Amazon Redshift Spectrum


About the Authors

Re Alvarez-Parmar is a solutions architect for Amazon Web Services. He helps enterprises achieve success through technical guidance and thought leadership. In his spare time, he enjoys spending time with his two kids and exploring outdoors.

 

 

 

Random with care

Post Syndicated from Eevee original https://eev.ee/blog/2018/01/02/random-with-care/

Hi! Here are a few loose thoughts about picking random numbers.

A word about crypto

DON’T ROLL YOUR OWN CRYPTO

This is all aimed at frivolous pursuits like video games. Hell, even video games where money is at stake should be deferring to someone who knows way more than I do. Otherwise you might find out that your deck shuffles in your poker game are woefully inadequate and some smartass is cheating you out of millions. (If your random number generator has fewer than 226 bits of state, it can’t even generate every possible shuffling of a deck of cards!)

Use the right distribution

Most languages have a random number primitive that spits out a number uniformly in the range [0, 1), and you can go pretty far with just that. But beware a few traps!

Random pitches

Say you want to pitch up a sound by a random amount, perhaps up to an octave. Your audio API probably has a way to do this that takes a pitch multiplier, where I say “probably” because that’s how the only audio API I’ve used works.

Easy peasy. If 1 is unchanged and 2 is pitched up by an octave, then all you need is rand() + 1. Right?

No! Pitch is exponential — within the same octave, the “gap” between C and C♯ is about half as big as the gap between B and the following C. If you pick a pitch multiplier uniformly, you’ll have a noticeable bias towards the higher pitches.

One octave corresponds to a doubling of pitch, so if you want to pick a random note, you want 2 ** rand().

Random directions

For two dimensions, you can just pick a random angle with rand() * TAU.

If you want a vector rather than an angle, or if you want a random direction in three dimensions, it’s a little trickier. You might be tempted to just pick a random point where each component is rand() * 2 - 1 (ranging from −1 to 1), but that’s not quite right. A direction is a point on the surface (or, equivalently, within the volume) of a sphere, and picking each component independently produces a point within the volume of a cube; the result will be a bias towards the corners of the cube, where there’s much more extra volume beyond the sphere.

No? Well, just trust me. I don’t know how to make a diagram for this.

Anyway, you could use the Pythagorean theorem a few times and make a huge mess of things, or it turns out there’s a really easy way that even works for two or four or any number of dimensions. You pick each coordinate from a Gaussian (normal) distribution, then normalize the resulting vector. In other words, using Python’s random module:

1
2
3
4
5
6
def random_direction():
    x = random.gauss(0, 1)
    y = random.gauss(0, 1)
    z = random.gauss(0, 1)
    r = math.sqrt(x*x + y*y + z*z)
    return x/r, y/r, z/r

Why does this work? I have no idea!

Note that it is possible to get zero (or close to it) for every component, in which case the result is nonsense. You can re-roll all the components if necessary; just check that the magnitude (or its square) is less than some epsilon, which is equivalent to throwing away a tiny sphere at the center and shouldn’t affect the distribution.

Beware Gauss

Since I brought it up: the Gaussian distribution is a pretty nice one for choosing things in some range, where the middle is the common case and should appear more frequently.

That said, I never use it, because it has one annoying drawback: the Gaussian distribution has no minimum or maximum value, so you can’t really scale it down to the range you want. In theory, you might get any value out of it, with no limit on scale.

In practice, it’s astronomically rare to actually get such a value out. I did a hundred million trials just to see what would happen, and the largest value produced was 5.8.

But, still, I’d rather not knowingly put extremely rare corner cases in my code if I can at all avoid it. I could clamp the ends, but that would cause unnatural bunching at the endpoints. I could reroll if I got a value outside some desired range, but I prefer to avoid rerolling when I can, too; after all, it’s still (astronomically) possible to have to reroll for an indefinite amount of time. (Okay, it’s really not, since you’ll eventually hit the period of your PRNG. Still, though.) I don’t bend over backwards here — I did just say to reroll when picking a random direction, after all — but when there’s a nicer alternative I’ll gladly use it.

And lo, there is a nicer alternative! Enter the beta distribution. It always spits out a number in [0, 1], so you can easily swap it in for the standard normal function, but it takes two “shape” parameters α and β that alter its behavior fairly dramatically.

With α = β = 1, the beta distribution is uniform, i.e. no different from rand(). As α increases, the distribution skews towards the right, and as β increases, the distribution skews towards the left. If α = β, the whole thing is symmetric with a hump in the middle. The higher either one gets, the more extreme the hump (meaning that value is far more common than any other). With a little fiddling, you can get a number of interesting curves.

Screenshots don’t really do it justice, so here’s a little Wolfram widget that lets you play with α and β live:

Note that if α = 1, then 1 is a possible value; if β = 1, then 0 is a possible value. You probably want them both greater than 1, which clamps the endpoints to zero.

Also, it’s possible to have either α or β or both be less than 1, but this creates very different behavior: the corresponding endpoints become poles.

Anyway, something like α = β = 3 is probably close enough to normal for most purposes but already clamped for you. And you could easily replicate something like, say, NetHack’s incredibly bizarre rnz function.

Random frequency

Say you want some event to have an 80% chance to happen every second. You (who am I kidding, I) might be tempted to do something like this:

1
2
if random() < 0.8 * dt:
    do_thing()

In an ideal world, dt is always the same and is equal to 1 / f, where f is the framerate. Replace that 80% with a variable, say P, and every tic you have a P / f chance to do the… whatever it is.

Each second, f tics pass, so you’ll make this check f times. The chance that any check succeeds is the inverse of the chance that every check fails, which is \(1 – \left(1 – \frac{P}{f}\right)^f\).

For P of 80% and a framerate of 60, that’s a total probability of 55.3%. Wait, what?

Consider what happens if the framerate is 2. On the first tic, you roll 0.4 twice — but probabilities are combined by multiplying, and splitting work up by dt only works for additive quantities. You lose some accuracy along the way. If you’re dealing with something that multiplies, you need an exponent somewhere.

But in this case, maybe you don’t want that at all. Each separate roll you make might independently succeed, so it’s possible (but very unlikely) that the event will happen 60 times within a single second! Or 200 times, if that’s someone’s framerate.

If you explicitly want something to have a chance to happen on a specific interval, you have to check on that interval. If you don’t have a gizmo handy to run code on an interval, it’s easy to do yourself with a time buffer:

1
2
3
4
5
6
timer += dt
# here, 1 is the "every 1 seconds"
while timer > 1:
    timer -= 1
    if random() < 0.8:
        do_thing()

Using while means rolls still happen even if you somehow skipped over an entire second.

(For the curious, and the nerds who already noticed: the expression \(1 – \left(1 – \frac{P}{f}\right)^f\) converges to a specific value! As the framerate increases, it becomes a better and better approximation for \(1 – e^{-P}\), which for the example above is 0.551. Hey, 60 fps is pretty accurate — it’s just accurately representing something nowhere near what I wanted. Er, you wanted.)

Rolling your own

Of course, you can fuss with the classic [0, 1] uniform value however you want. If I want a bias towards zero, I’ll often just square it, or multiply two of them together. If I want a bias towards one, I’ll take a square root. If I want something like a Gaussian/normal distribution, but with clearly-defined endpoints, I might add together n rolls and divide by n. (The normal distribution is just what you get if you roll infinite dice and divide by infinity!)

It’d be nice to be able to understand exactly what this will do to the distribution. Unfortunately, that requires some calculus, which this post is too small to contain, and which I didn’t even know much about myself until I went down a deep rabbit hole while writing, and which in many cases is straight up impossible to express directly.

Here’s the non-calculus bit. A source of randomness is often graphed as a PDF — a probability density function. You’ve almost certainly seen a bell curve graphed, and that’s a PDF. They’re pretty nice, since they do exactly what they look like: they show the relative chance that any given value will pop out. On a bog standard bell curve, there’s a peak at zero, and of course zero is the most common result from a normal distribution.

(Okay, actually, since the results are continuous, it’s vanishingly unlikely that you’ll get exactly zero — but you’re much more likely to get a value near zero than near any other number.)

For the uniform distribution, which is what a classic rand() gives you, the PDF is just a straight horizontal line — every result is equally likely.


If there were a calculus bit, it would go here! Instead, we can cheat. Sometimes. Mathematica knows how to work with probability distributions in the abstract, and there’s a free web version you can use. For the example of squaring a uniform variable, try this out:

1
PDF[TransformedDistribution[u^2, u \[Distributed] UniformDistribution[{0, 1}]], u]

(The \[Distributed] is a funny tilde that doesn’t exist in Unicode, but which Mathematica uses as a first-class operator. Also, press shiftEnter to evaluate the line.)

This will tell you that the distribution is… \(\frac{1}{2\sqrt{u}}\). Weird! You can plot it:

1
Plot[%, {u, 0, 1}]

(The % refers to the result of the last thing you did, so if you want to try several of these, you can just do Plot[PDF[…], u] directly.)

The resulting graph shows that numbers around zero are, in fact, vastly — infinitely — more likely than anything else.

What about multiplying two together? I can’t figure out how to get Mathematica to understand this, but a great amount of digging revealed that the answer is -ln x, and from there you can plot them both on Wolfram Alpha. They’re similar, though squaring has a much better chance of giving you high numbers than multiplying two separate rolls — which makes some sense, since if either of two rolls is a low number, the product will be even lower.

What if you know the graph you want, and you want to figure out how to play with a uniform roll to get it? Good news! That’s a whole thing called inverse transform sampling. All you have to do is take an integral. Good luck!


This is all extremely ridiculous. New tactic: Just Simulate The Damn Thing. You already have the code; run it a million times, make a histogram, and tada, there’s your PDF. That’s one of the great things about computers! Brute-force numerical answers are easy to come by, so there’s no excuse for producing something like rnz. (Though, be sure your histogram has sufficiently narrow buckets — I tried plotting one for rnz once and the weird stuff on the left side didn’t show up at all!)

By the way, I learned something from futzing with Mathematica here! Taking the square root (to bias towards 1) gives a PDF that’s a straight diagonal line, nothing like the hyperbola you get from squaring (to bias towards 0). How do you get a straight line the other way? Surprise: \(1 – \sqrt{1 – u}\).

Okay, okay, here’s the actual math

I don’t claim to have a very firm grasp on this, but I had a hell of a time finding it written out clearly, so I might as well write it down as best I can. This was a great excuse to finally set up MathJax, too.

Say \(u(x)\) is the PDF of the original distribution and \(u\) is a representative number you plucked from that distribution. For the uniform distribution, \(u(x) = 1\). Or, more accurately,

$$
u(x) = \begin{cases}
1 & \text{ if } 0 \le x \lt 1 \\
0 & \text{ otherwise }
\end{cases}
$$

Remember that \(x\) here is a possible outcome you want to know about, and the PDF tells you the relative probability that a roll will be near it. This PDF spits out 1 for every \(x\), meaning every number between 0 and 1 is equally likely to appear.

We want to do something to that PDF, which creates a new distribution, whose PDF we want to know. I’ll use my original example of \(f(u) = u^2\), which creates a new PDF \(v(x)\).

The trick is that we need to work in terms of the cumulative distribution function for \(u\). Where the PDF gives the relative chance that a roll will be (“near”) a specific value, the CDF gives the relative chance that a roll will be less than a specific value.

The conventions for this seem to be a bit fuzzy, and nobody bothers to explain which ones they’re using, which makes this all the more confusing to read about… but let’s write the CDF with a capital letter, so we have \(U(x)\). In this case, \(U(x) = x\), a straight 45° line (at least between 0 and 1). With the definition I gave, this should make sense. At some arbitrary point like 0.4, the value of the PDF is 1 (0.4 is just as likely as anything else), and the value of the CDF is 0.4 (you have a 40% chance of getting a number from 0 to 0.4).

Calculus ahoy: the PDF is the derivative of the CDF, which means it measures the slope of the CDF at any point. For \(U(x) = x\), the slope is always 1, and indeed \(u(x) = 1\). See, calculus is easy.

Okay, so, now we’re getting somewhere. What we want is the CDF of our new distribution, \(V(x)\). The CDF is defined as the probability that a roll \(v\) will be less than \(x\), so we can literally write:

$$V(x) = P(v \le x)$$

(This is why we have to work with CDFs, rather than PDFs — a PDF gives the chance that a roll will be “nearby,” whatever that means. A CDF is much more concrete.)

What is \(v\), exactly? We defined it ourselves; it’s the do something applied to a roll from the original distribution, or \(f(u)\).

$$V(x) = P\!\left(f(u) \le x\right)$$

Now the first tricky part: we have to solve that inequality for \(u\), which means we have to do something, backwards to \(x\).

$$V(x) = P\!\left(u \le f^{-1}(x)\right)$$

Almost there! We now have a probability that \(u\) is less than some value, and that’s the definition of a CDF!

$$V(x) = U\!\left(f^{-1}(x)\right)$$

Hooray! Now to turn these CDFs back into PDFs, all we need to do is differentiate both sides and use the chain rule. If you never took calculus, don’t worry too much about what that means!

$$v(x) = u\!\left(f^{-1}(x)\right)\left|\frac{d}{dx}f^{-1}(x)\right|$$

Wait! Where did that absolute value come from? It takes care of whether \(f(x)\) increases or decreases. It’s the least interesting part here by far, so, whatever.

There’s one more magical part here when using the uniform distribution — \(u(\dots)\) is always equal to 1, so that entire term disappears! (Note that this only works for a uniform distribution with a width of 1; PDFs are scaled so the entire area under them sums to 1, so if you had a rand() that could spit out a number between 0 and 2, the PDF would be \(u(x) = \frac{1}{2}\).)

$$v(x) = \left|\frac{d}{dx}f^{-1}(x)\right|$$

So for the specific case of modifying the output of rand(), all we have to do is invert, then differentiate. The inverse of \(f(u) = u^2\) is \(f^{-1}(x) = \sqrt{x}\) (no need for a ± since we’re only dealing with positive numbers), and differentiating that gives \(v(x) = \frac{1}{2\sqrt{x}}\). Done! This is also why square root comes out nicer; inverting it gives \(x^2\), and differentiating that gives \(2x\), a straight line.

Incidentally, that method for turning a uniform distribution into any distribution — inverse transform sampling — is pretty much the same thing in reverse: integrate, then invert. For example, when I saw that taking the square root gave \(v(x) = 2x\), I naturally wondered how to get a straight line going the other way, \(v(x) = 2 – 2x\). Integrating that gives \(2x – x^2\), and then you can use the quadratic formula (or just ask Wolfram Alpha) to solve \(2x – x^2 = u\) for \(x\) and get \(f(u) = 1 – \sqrt{1 – u}\).

Multiply two rolls is a bit more complicated; you have to write out the CDF as an integral and you end up doing a double integral and wow it’s a mess. The only thing I’ve retained is that you do a division somewhere, which then gets integrated, and that’s why it ends up as \(-\ln x\).

And that’s quite enough of that! (Okay but having math in my blog is pretty cool and I will definitely be doing more of this, sorry, not sorry.)

Random vs varied

Sometimes, random isn’t actually what you want. We tend to use the word “random” casually to mean something more like chaotic, i.e., with no discernible pattern. But that’s not really random. In fact, given how good humans can be at finding incidental patterns, they aren’t all that unlikely! Consider that when you roll two dice, they’ll come up either the same or only one apart almost half the time. Coincidence? Well, yes.

If you ask for randomness, you’re saying that any outcome — or series of outcomes — is acceptable, including five heads in a row or five tails in a row. Most of the time, that’s fine. Some of the time, it’s less fine, and what you really want is variety. Here are a couple examples and some fairly easy workarounds.

NPC quips

The nature of games is such that NPCs will eventually run out of things to say, at which point further conversation will give the player a short brush-off quip — a slight nod from the designer to the player that, hey, you hit the end of the script.

Some NPCs have multiple possible quips and will give one at random. The trouble with this is that it’s very possible for an NPC to repeat the same quip several times in a row before abruptly switching to another one. With only a few options to choose from, getting the same option twice or thrice (especially across an entire game, which may have numerous NPCs) isn’t all that unlikely. The notion of an NPC quip isn’t very realistic to start with, but having someone repeat themselves and then abruptly switch to something else is especially jarring.

The easy fix is to show the quips in order! Paradoxically, this is more consistently varied than choosing at random — the original “order” is likely to be meaningless anyway, and it already has the property that the same quip can never appear twice in a row.

If you like, you can shuffle the list of quips every time you reach the end, but take care here — it’s possible that the last quip in the old order will be the same as the first quip in the new order, so you may still get a repeat. (Of course, you can just check for this case and swap the first quip somewhere else if it bothers you.)

That last behavior is, in fact, the canonical way that Tetris chooses pieces — the game simply shuffles a list of all 7 pieces, gives those to you in shuffled order, then shuffles them again to make a new list once it’s exhausted. There’s no avoidance of duplicates, though, so you can still get two S blocks in a row, or even two S and two Z all clumped together, but no more than that. Some Tetris variants take other approaches, such as actively avoiding repeats even several pieces apart or deliberately giving you the worst piece possible.

Random drops

Random drops are often implemented as a flat chance each time. Maybe enemies have a 5% chance to drop health when they die. Legally speaking, over the long term, a player will see health drops for about 5% of enemy kills.

Over the short term, they may be desperate for health and not survive to see the long term. So you may want to put a thumb on the scale sometimes. Games in the Metroid series, for example, have a somewhat infamous bias towards whatever kind of drop they think you need — health if your health is low, missiles if your missiles are low.

I can’t give you an exact approach to use, since it depends on the game and the feeling you’re going for and the variables at your disposal. In extreme cases, you might want to guarantee a health drop from a tough enemy when the player is critically low on health. (Or if you’re feeling particularly evil, you could go the other way and deny the player health when they most need it…)

The problem becomes a little different, and worse, when the event that triggers the drop is relatively rare. The pathological case here would be something like a raid boss in World of Warcraft, which requires hours of effort from a coordinated group of people to defeat, and which has some tiny chance of dropping a good item that will go to only one of those people. This is why I stopped playing World of Warcraft at 60.

Dialing it back a little bit gives us Enter the Gungeon, a roguelike where each room is a set of encounters and each floor only has a dozen or so rooms. Initially, you have a 1% chance of getting a reward after completing a room — but every time you complete a room and don’t get a reward, the chance increases by 9%, up to a cap of 80%. Once you get a reward, the chance resets to 1%.

The natural question is: how frequently, exactly, can a player expect to get a reward? We could do math, or we could Just Simulate The Damn Thing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from collections import Counter
import random

histogram = Counter()

TRIALS = 1000000
chance = 1
rooms_cleared = 0
rewards_found = 0
while rewards_found < TRIALS:
    rooms_cleared += 1
    if random.random() * 100 < chance:
        # Reward!
        rewards_found += 1
        histogram[rooms_cleared] += 1
        rooms_cleared = 0
        chance = 1
    else:
        chance = min(80, chance + 9)

for gaps, count in sorted(histogram.items()):
    print(f"{gaps:3d} | {count / TRIALS * 100:6.2f}%", '#' * (count // (TRIALS // 100)))
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
  1 |   0.98%
  2 |   9.91% #########
  3 |  17.00% ################
  4 |  20.23% ####################
  5 |  19.21% ###################
  6 |  15.05% ###############
  7 |   9.69% #########
  8 |   5.07% #####
  9 |   2.09% ##
 10 |   0.63%
 11 |   0.12%
 12 |   0.03%
 13 |   0.00%
 14 |   0.00%
 15 |   0.00%

We’ve got kind of a hilly distribution, skewed to the left, which is up in this histogram. Most of the time, a player should see a reward every three to six rooms, which is maybe twice per floor. It’s vanishingly unlikely to go through a dozen rooms without ever seeing a reward, so a player should see at least one per floor.

Of course, this simulated a single continuous playthrough; when starting the game from scratch, your chance at a reward always starts fresh at 1%, the worst it can be. If you want to know about how many rewards a player will get on the first floor, hey, Just Simulate The Damn Thing.

1
2
3
4
5
6
7
  0 |   0.01%
  1 |  13.01% #############
  2 |  56.28% ########################################################
  3 |  27.49% ###########################
  4 |   3.10% ###
  5 |   0.11%
  6 |   0.00%

Cool. Though, that’s assuming exactly 12 rooms; it might be worth changing that to pick at random in a way that matches the level generator.

(Enter the Gungeon does some other things to skew probability, which is very nice in a roguelike where blind luck can make or break you. For example, if you kill a boss without having gotten a new gun anywhere else on the floor, the boss is guaranteed to drop a gun.)

Critical hits

I suppose this is the same problem as random drops, but backwards.

Say you have a battle sim where every attack has a 6% chance to land a devastating critical hit. Presumably the same rules apply to both the player and the AI opponents.

Consider, then, that the AI opponents have exactly the same 6% chance to ruin the player’s day. Consider also that this gives them an 0.4% chance to critical hit twice in a row. 0.4% doesn’t sound like much, but across an entire playthrough, it’s not unlikely that a player might see it happen and find it incredibly annoying.

Perhaps it would be worthwhile to explicitly forbid AI opponents from getting consecutive critical hits.

In conclusion

An emerging theme here has been to Just Simulate The Damn Thing. So consider Just Simulating The Damn Thing. Even a simple change to a random value can do surprising things to the resulting distribution, so unless you feel like differentiating the inverse function of your code, maybe test out any non-trivial behavior and make sure it’s what you wanted. Probability is hard to reason about.

Simplify Querying Nested JSON with the AWS Glue Relationalize Transform

Post Syndicated from Trevor Roberts original https://aws.amazon.com/blogs/big-data/simplify-querying-nested-json-with-the-aws-glue-relationalize-transform/

AWS Glue has a transform called Relationalize that simplifies the extract, transform, load (ETL) process by converting nested JSON into columns that you can easily import into relational databases. Relationalize transforms the nested JSON into key-value pairs at the outermost level of the JSON document. The transformed data maintains a list of the original keys from the nested JSON separated by periods.

Let’s look at how Relationalize can help you with a sample use case.

An example of Relationalize in action

Suppose that the developers of a video game want to use a data warehouse like Amazon Redshift to run reports on player behavior based on data that is stored in JSON. Sample 1 shows example user data from the game. The player named “user1” has characteristics such as race, class, and location in nested JSON data. Further down, the player’s arsenal information includes additional nested JSON data. If the developers want to ETL this data into their data warehouse, they might have to resort to nested loops or recursive functions in their code.

Sample 1: Nested JSON

{
	"player": {
		"username": "user1",
		"characteristics": {
			"race": "Human",
			"class": "Warlock",
			"subclass": "Dawnblade",
			"power": 300,
			"playercountry": "USA"
		},
		"arsenal": {
			"kinetic": {
				"name": "Sweet Business",
				"type": "Auto Rifle",
				"power": 300,
				"element": "Kinetic"
			},
			"energy": {
				"name": "MIDA Mini-Tool",
				"type": "Submachine Gun",
				"power": 300,
				"element": "Solar"
			},
			"power": {
				"name": "Play of the Game",
				"type": "Grenade Launcher",
				"power": 300,
				"element": "Arc"
			}
		},
		"armor": {
			"head": "Eye of Another World",
			"arms": "Philomath Gloves",
			"chest": "Philomath Robes",
			"leg": "Philomath Boots",
			"classitem": "Philomath Bond"
		},
		"location": {
			"map": "Titan",
			"waypoint": "The Rig"
		}
	}
}

Instead, the developers can use the Relationalize transform. Sample 2 shows what the transformed data looks like.

Sample 2: Flattened JSON

{
    "player.username": "user1",
    "player.characteristics.race": "Human",
    "player.characteristics.class": "Warlock",
    "player.characteristics.subclass": "Dawnblade",
    "player.characteristics.power": 300,
    "player.characteristics.playercountry": "USA",
    "player.arsenal.kinetic.name": "Sweet Business",
    "player.arsenal.kinetic.type": "Auto Rifle",
    "player.arsenal.kinetic.power": 300,
    "player.arsenal.kinetic.element": "Kinetic",
    "player.arsenal.energy.name": "MIDA Mini-Tool",
    "player.arsenal.energy.type": "Submachine Gun",
    "player.arsenal.energy.power": 300,
    "player.arsenal.energy.element": "Solar",
    "player.arsenal.power.name": "Play of the Game",
    "player.arsenal.power.type": "Grenade Launcher",
    "player.arsenal.power.power": 300,
    "player.arsenal.power.element": "Arc",
    "player.armor.head": "Eye of Another World",
    "player.armor.arms": "Philomath Gloves",
    "player.armor.chest": "Philomath Robes",
    "player.armor.leg": "Philomath Boots",
    "player.armor.classitem": "Philomath Bond",
    "player.location.map": "Titan",
    "player.location.waypoint": "The Rig"
}

You can then write the data to a database or to a data warehouse. You can also write it to delimited text files, such as in comma-separated value (CSV) format, or columnar file formats such as Optimized Row Columnar (ORC) format. You can use either of these format types for long-term storage in Amazon S3. Storing the transformed files in S3 provides the additional benefit of being able to query this data using Amazon Athena or Amazon Redshift Spectrum. You can further extend the usefulness of the data by performing joins between data stored in S3 and the data stored in an Amazon Redshift data warehouse.

Before we get started…

In my example, I took two preparatory steps that save some time in your ETL code development:

  1. I stored my data in an Amazon S3 bucket and used an AWS Glue crawler to make my data available in the AWS Glue data catalog. You can find instructions on how to do that in Cataloging Tables with a Crawler in the AWS Glue documentation. The AWS Glue database name I used was “blog,” and the table name was “players.” You can see these values in use in the sample code that follows.
  2. I deployed a Zeppelin notebook using the automated deployment available within AWS Glue. If you already used an AWS Glue development endpoint to deploy a Zeppelin notebook, you can skip the deployment instructions. Otherwise, let’s quickly review how to deploy Zeppelin.

Deploying a Zeppelin notebook with AWS Glue

The following steps are outlined in the AWS Glue documentation, and I include a few screenshots here for clarity.

First, create two IAM roles:

Next, in the AWS Glue Management Console, choose Dev endpoints, and then choose Add endpoint.

Specify a name for the endpoint and the AWS Glue IAM role that you created.

On the networking screen, choose Skip Networking because our code only communicates with S3.

Complete the development endpoint process by providing a Secure Shell (SSH) public key and confirming your settings.

When your new development endpoint’s Provisioning status changes from PROVISIONING to READY, choose your endpoint, and then for Actions choose Create notebook server.

Enter the notebook server details, including the role you previously created and a security group with inbound access allowed on TCP port 443.

Doing this automatically launches an AWS CloudFormation template. The output specifies the URL that you can use to access your Zeppelin notebook with the username and password you specified in the wizard.

How do we flatten nested JSON?

With my data loaded and my notebook server ready, I accessed Zeppelin, created a new note, and set my interpreter to spark. I used some Python code that AWS Glue previously generated for another job that outputs to ORC. Then I added the Relationalize transform. You can see the resulting Python code in Sample 3.­

Sample 3: Python code to transform the nested JSON and output it to ORC

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
#from awsglue.transforms import Relationalize

# Begin variables to customize with your information
glue_source_database = "blog"
glue_source_table = "players"
glue_temp_storage = "s3://blog-example-edz/temp"
glue_relationalize_output_s3_path = "s3://blog-example-edz/output-flat"
dfc_root_table_name = "root" #default value is "roottable"
# End variables to customize with your information

glueContext = GlueContext(spark.sparkContext)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = glue_source_table, transformation_ctx = "datasource0")
dfc = Relationalize.apply(frame = datasource0, staging_path = glue_temp_storage, name = dfc_root_table_name, transformation_ctx = "dfc")
blogdata = dfc.select(dfc_root_table_name)
blogdataoutput = glueContext.write_dynamic_frame.from_options(frame = blogdata, connection_type = "s3", connection_options = {"path": glue_relationalize_output_s3_path}, format = "orc", transformation_ctx = "blogdataoutput")

What exactly is going on in this script?

After the import statements, we instantiate a GlueContext object, which allows us to work with the data in AWS Glue. Next, we create a DynamicFrame (datasource0) from the “players” table in the AWS Glue “blog” database. We use this DynamicFrame to perform any necessary operations on the data structure before it’s written to our desired output format. The source files remain unchanged.

We then run the Relationalize transform (Relationalize.apply()) with our datasource0 as one of the parameters. Another important parameter is the name parameter, which is a key that identifies our data after the transformation completes.

The Relationalize.apply() method returns a DynamicFrameCollection, and this is stored in the dfc variable. Before we can write our data to S3, we need to select the DynamicFrame from the DynamicFrameCollection object. We do this with the dfc.select() method. The correct DynamicFrame is stored in the blogdata variable.

You might be curious why a DynamicFrameCollection was returned when we started with a single DynamicFrame. This return value comes from the way Relationalize treats arrays in the JSON document: A DynamicFrame is created for each array. Together with the root data structure, each generated DynamicFrame is added to a DynamicFrameCollection when Relationalize completes its work. Although we didn’t have any arrays in our data, it’s good to keep this in mind. Finally, we output (blogdataoutput) the root DynamicFrame to ORC files in S3.

Using the transformed data

One of the use cases we discussed earlier was using Amazon Athena or Amazon Redshift Spectrum to query the ORC files.

I used the following SQL DDL statements to create external tables in both services to enable queries of my data stored in Amazon S3.

Sample 4: Amazon Athena DDL

CREATE EXTERNAL TABLE IF NOT EXISTS blog.blog_data_athena_test (
  `characteristics_race` string,
  `characteristics_class` string,
  `characteristics_subclass` string,
  `characteristics_power` int,
  `characteristics_playercountry` string,
  `kinetic_name` string,
  `kinetic_type` string,
  `kinetic_power` int,
  `kinetic_element` string,
  `energy_name` string,
  `energy_type` string,
  `energy_power` int,
  `energy_element` string,
  `power_name` string,
  `power_type` string,
  `power_power` int,
  `power_element` string,
  `armor_head` string,
  `armor_arms` string,
  `armor_chest` string,
  `armor_leg` string,
  `armor_classitem` string,
  `map` string,
  `waypoint` string 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://blog-example-edz/output-flat/'
TBLPROPERTIES ('has_encrypted_data'='false');

 

Sample 5: Amazon Redshift Spectrum DDL

-- Create a Schema
-- A single schema can be used with multiple external tables.
-- This step is only required once for the external tables you create.
create external schema spectrum 
from data catalog 
database 'blog' 
iam_role 'arn:aws:iam::0123456789:role/redshift-role'
create external database if not exists;

-- Create an external table in the schema
create external table spectrum.blog(
  username VARCHAR,
  characteristics_race VARCHAR,
  characteristics_class VARCHAR,
  characteristics_subclass VARCHAR,
  characteristics_power INTEGER,
  characteristics_playercountry VARCHAR,
  kinetic_name VARCHAR,
  kinetic_type VARCHAR,
  kinetic_power INTEGER,
  kinetic_element VARCHAR,
  energy_name VARCHAR,
  energy_type VARCHAR,
  energy_power INTEGER,
  energy_element VARCHAR,
  power_name VARCHAR,
  power_type VARCHAR,
  power_power INTEGER,
  power_element VARCHAR,
  armor_head VARCHAR,
  armor_arms VARCHAR,
  armor_chest VARCHAR,
  armor_leg VARCHAR,
  armor_classItem VARCHAR,
  map VARCHAR,
  waypoint VARCHAR)
stored as orc
location 's3://blog-example-edz/output-flat';

I even ran a query, shown in Sample 6, that joined my Redshift Spectrum table (spectrum.playerdata) with data in an Amazon Redshift table (public.raids) to generate advanced reports. In the where clause, I join the two tables based on the username values that are common to both data sources.

Sample 6: Select statement with a join of Redshift Spectrum data with Amazon Redshift data

-- Get Total Raid Completions for the Hunter Class.
select spectrum.playerdata.characteristics_class as class, sum(public.raids."completions.val.raids.leviathan") as "Total Hunter Leviathan Raid Completions" from spectrum.playerdata, public.raids
where spectrum.playerdata.username = public.raids."completions.val.username"
and spectrum.playerdata.characteristics_class = 'Hunter'
group by spectrum.playerdata.characteristics_class;

Summary

This post demonstrated how simple it can be to flatten nested JSON data with AWS Glue, using the Relationalize transform to automate the conversion of nested JSON. AWS Glue also automates the deployment of Zeppelin notebooks that you can use to develop your Python automation script. Finally, AWS Glue can output the transformed data directly to a relational database, or to files in Amazon S3 for further analysis with tools such as Amazon Athena and Amazon Redshift Spectrum.

As great as Relationalize is, it’s not the only transform available with AWS Glue. You can see a complete list of available transforms in Built-In Transforms in the AWS Glue documentation. Try them out today!


Additional Reading

If you found this post useful, be sure to check out Using Amazon Redshift Spectrum, Amazon Athena and AWS Glue with Node.js in Production and Build a Data Lake Foundation with AWS Glue and Amazon S3.


About the Author

Trevor Roberts Jr is a Solutions Architect with AWS. He provides architectural guidance to help customers achieve success in the cloud. In his spare time, Trevor enjoys traveling to new places and spending time with family.

Resume AWS Step Functions from Any State

Post Syndicated from Andy Katz original https://aws.amazon.com/blogs/compute/resume-aws-step-functions-from-any-state/


Yash Pant, Solutions Architect, AWS


Aaron Friedman, Partner Solutions Architect, AWS

When we discuss how to build applications with customers, we often align to the Well Architected Framework pillars of security, reliability, performance efficiency, cost optimization, and operational excellence. Designing for failure is an essential component to developing well architected applications that are resilient to spurious errors that may occur.

There are many ways you can use AWS services to achieve high availability and resiliency of your applications. For example, you can couple Elastic Load Balancing with Auto Scaling and Amazon EC2 instances to build highly available applications. Or use Amazon API Gateway and AWS Lambda to rapidly scale out a microservices-based architecture. Many AWS services have built in solutions to help with the appropriate error handling, such as Dead Letter Queues (DLQ) for Amazon SQS or retries in AWS Batch.

AWS Step Functions is an AWS service that makes it easy for you to coordinate the components of distributed applications and microservices. Step Functions allows you to easily design for failure, by incorporating features such as error retries and custom error handling from AWS Lambda exceptions. These features allow you to programmatically handle many common error modes and build robust, reliable applications.

In some rare cases, however, your application may fail in an unexpected manner. In these situations, you might not want to duplicate in a repeat execution those portions of your state machine that have already run. This is especially true when orchestrating long-running jobs or executing a complex state machine as part of a microservice. Here, you need to know the last successful state in your state machine from which to resume, so that you don’t duplicate previous work. In this post, we present a solution to enable you to resume from any given state in your state machine in the case of an unexpected failure.

Resuming from a given state

To resume a failed state machine execution from the state at which it failed, you first run a script that dynamically creates a new state machine. When the new state machine is executed, it resumes the failed execution from the point of failure. The script contains the following two primary steps:

  1. Parse the execution history of the failed execution to find the name of the state at which it failed, as well as the JSON input to that state.
  2. Create a new state machine, which adds an additional state to failed state machine, called "GoToState". "GoToState" is a choice state at the beginning of the state machine that branches execution directly to the failed state, allowing you to skip states that had succeeded in the previous execution.

The full script along with a CloudFormation template that creates a demo of this is available in the aws-sfn-resume-from-any-state GitHub repo.

Diving into the script

In this section, we walk you through the script and highlight the core components of its functionality. The script contains a main function, which adds a command line parameter for the failedExecutionArn so that you can easily call the script from the command line:

python gotostate.py --failedExecutionArn '<Failed_Execution_Arn>'

Identifying the failed state in your execution

First, the script extracts the name of the failed state along with the input to that state. It does so by using the failed state machine execution history, which is identified by the Amazon Resource Name (ARN) of the execution. The failed state is marked in the execution history, along with the input to that state (which is also the output of the preceding successful state). The script is able to parse these values from the log.

The script loops through the execution history of the failed state machine, and traces it backwards until it finds the failed state. If the state machine failed in a parallel state, then it must restart from the beginning of the parallel state. The script is able to capture the name of the parallel state that failed, rather than any substate within the parallel state that may have caused the failure. The following code is the Python function that does this.


def parseFailureHistory(failedExecutionArn):

    '''
    Parses the execution history of a failed state machine to get the name of failed state and the input to the failed state:
    Input failedExecutionArn = A string containing the execution ARN of a failed state machine y
    Output = A list with two elements: [name of failed state, input to failed state]
    '''
    failedAtParallelState = False
    try:
        #Get the execution history
        response = client.get\_execution\_history(
            executionArn=failedExecutionArn,
            reverseOrder=True
        )
        failedEvents = response['events']
    except Exception as ex:
        raise ex
    #Confirm that the execution actually failed, raise exception if it didn't fail.
    try:
        failedEvents[0]['executionFailedEventDetails']
    except:
        raise('Execution did not fail')
        
    '''
    If you have a 'States.Runtime' error (for example, if a task state in your state machine attempts to execute a Lambda function in a different region than the state machine), get the ID of the failed state, and use it to determine the failed state name and input.
    '''
    
    if failedEvents[0]['executionFailedEventDetails']['error'] == 'States.Runtime':
        failedId = int(filter(str.isdigit, str(failedEvents[0]['executionFailedEventDetails']['cause'].split()[13])))
        failedState = failedEvents[-1 \* failedId]['stateEnteredEventDetails']['name']
        failedInput = failedEvents[-1 \* failedId]['stateEnteredEventDetails']['input']
        return (failedState, failedInput)
        
    '''
    You need to loop through the execution history, tracing back the executed steps.
    The first state you encounter is the failed state. If you failed on a parallel state, you need the name of the parallel state rather than the name of a state within a parallel state that it failed on. This is because you can only attach goToState to the parallel state, but not a substate within the parallel state.
    This loop starts with the ID of the latest event and uses the previous event IDs to trace back the execution to the beginning (id 0). However, it returns as soon it finds the name of the failed state.
    '''

    currentEventId = failedEvents[0]['id']
    while currentEventId != 0:
        #multiply event ID by -1 for indexing because you're looking at the reversed history
        currentEvent = failedEvents[-1 \* currentEventId]
        
        '''
        You can determine if the failed state was a parallel state because it and an event with 'type'='ParallelStateFailed' appears in the execution history before the name of the failed state
        '''

        if currentEvent['type'] == 'ParallelStateFailed':
            failedAtParallelState = True

        '''
        If the failed state is not a parallel state, then the name of failed state to return is the name of the state in the first 'TaskStateEntered' event type you run into when tracing back the execution history
        '''

        if currentEvent['type'] == 'TaskStateEntered' and failedAtParallelState == False:
            failedState = currentEvent['stateEnteredEventDetails']['name']
            failedInput = currentEvent['stateEnteredEventDetails']['input']
            return (failedState, failedInput)

        '''
        If the failed state was a parallel state, then you need to trace execution back to the first event with 'type'='ParallelStateEntered', and return the name of the state
        '''

        if currentEvent['type'] == 'ParallelStateEntered' and failedAtParallelState:
            failedState = failedState = currentEvent['stateEnteredEventDetails']['name']
            failedInput = currentEvent['stateEnteredEventDetails']['input']
            return (failedState, failedInput)
        #Update the ID for the next execution of the loop
        currentEventId = currentEvent['previousEventId']
        

Create the new state machine

The script uses the name of the failed state to create the new state machine, with "GoToState" branching execution directly to the failed state.

To do this, the script requires the Amazon States Language (ASL) definition of the failed state machine. It modifies the definition to append "GoToState", and create a new state machine from it.

The script gets the ARN of the failed state machine from the execution ARN of the failed state machine. This ARN allows it to get the ASL definition of the failed state machine by calling the DesribeStateMachine API action. It creates a new state machine with "GoToState".

When the script creates the new state machine, it also adds an additional input variable called "resuming". When you execute this new state machine, you specify this resuming variable as true in the input JSON. This tells "GoToState" to branch execution to the state that had previously failed. Here’s the function that does this:

def attachGoToState(failedStateName, stateMachineArn):

    '''
    Given a state machine ARN and the name of a state in that state machine, create a new state machine that starts at a new choice state called 'GoToState'. "GoToState" branches to the named state, and sends the input of the state machine to that state, when a variable called "resuming" is set to True.
    Input failedStateName = A string with the name of the failed state
          stateMachineArn = A string with the ARN of the state machine
    Output response from the create_state_machine call, which is the API call that creates a new state machine
    '''

    try:
        response = client.describe\_state\_machine(
            stateMachineArn=stateMachineArn
        )
    except:
        raise('Could not get ASL definition of state machine')
    roleArn = response['roleArn']
    stateMachine = json.loads(response['definition'])
    #Create a name for the new state machine
    newName = response['name'] + '-with-GoToState'
    #Get the StartAt state for the original state machine, because you point the 'GoToState' to this state
    originalStartAt = stateMachine['StartAt']

    '''
    Create the GoToState with the variable $.resuming.
    If new state machine is executed with $.resuming = True, then the state machine skips to the failed state.
    Otherwise, it executes the state machine from the original start state.
    '''

    goToState = {'Type':'Choice', 'Choices':[{'Variable':'$.resuming', 'BooleanEquals':False, 'Next':originalStartAt}], 'Default':failedStateName}
    #Add GoToState to the set of states in the new state machine
    stateMachine['States']['GoToState'] = goToState
    #Add StartAt
    stateMachine['StartAt'] = 'GoToState'
    #Create new state machine
    try:
        response = client.create_state_machine(
            name=newName,
            definition=json.dumps(stateMachine),
            roleArn=roleArn
        )
    except:
        raise('Failed to create new state machine with GoToState')
    return response

Testing the script

Now that you understand how the script works, you can test it out.

The following screenshot shows an example state machine that has failed, called "TestMachine". This state machine successfully completed "FirstState" and "ChoiceState", but when it branched to "FirstMatchState", it failed.

Use the script to create a new state machine that allows you to rerun this state machine, but skip the "FirstState" and the "ChoiceState" steps that already succeeded. You can do this by calling the script as follows:

python gotostate.py --failedExecutionArn 'arn:aws:states:us-west-2:<AWS_ACCOUNT_ID>:execution:TestMachine-with-GoToState:b2578403-f41d-a2c7-e70c-7500045288595

This creates a new state machine called "TestMachine-with-GoToState", and returns its ARN, along with the input that had been sent to "FirstMatchState". You can then inspect the input to determine what caused the error. In this case, you notice that the input to "FirstMachState" was the following:

{
"foo": 1,
"Message": true
}

However, this state machine expects the "Message" field of the JSON to be a string rather than a Boolean. Execute the new "TestMachine-with-GoToState" state machine, change the input to be a string, and add the "resuming" variable that "GoToState" requires:

{
"foo": 1,
"Message": "Hello!",
"resuming":true
}

When you execute the new state machine, it skips "FirstState" and "ChoiceState", and goes directly to "FirstMatchState", which was the state that failed:

Look at what happens when you have a state machine with multiple parallel steps. This example is included in the GitHub repository associated with this post. The repo contains a CloudFormation template that sets up this state machine and provides instructions to replicate this solution.

The following state machine, "ParallelStateMachine", takes an input through two subsequent parallel states before doing some final processing and exiting, along with the JSON with the ASL definition of the state machine.

{
  "Comment": "An example of the Amazon States Language using a parallel state to execute two branches at the same time.",
  "StartAt": "Parallel",
  "States": {
    "Parallel": {
      "Type": "Parallel",
      "ResultPath":"$.output",
      "Next": "Parallel 2",
      "Branches": [
        {
          "StartAt": "Parallel Step 1, Process 1",
          "States": {
            "Parallel Step 1, Process 1": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaA",
              "End": true
            }
          }
        },
        {
          "StartAt": "Parallel Step 1, Process 2",
          "States": {
            "Parallel Step 1, Process 2": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaA",
              "End": true
            }
          }
        }
      ]
    },
    "Parallel 2": {
      "Type": "Parallel",
      "Next": "Final Processing",
      "Branches": [
        {
          "StartAt": "Parallel Step 2, Process 1",
          "States": {
            "Parallel Step 2, Process 1": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXXX:function:LambdaB",
              "End": true
            }
          }
        },
        {
          "StartAt": "Parallel Step 2, Process 2",
          "States": {
            "Parallel Step 2, Process 2": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaB",
              "End": true
            }
          }
        }
      ]
    },
    "Final Processing": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaC",
      "End": true
    }
  }
}

First, use an input that initially fails:

{
  "Message": "Hello!"
}

This fails because the state machine expects you to have a variable in the input JSON called "foo" in the second parallel state to run "Parallel Step 2, Process 1" and "Parallel Step 2, Process 2". Instead, the original input gets processed by the first parallel state and produces the following output to pass to the second parallel state:

{
"output": [
    {
      "Message": "Hello!"
    },
    {
      "Message": "Hello!"
    }
  ],
}

Run the script on the failed state machine to create a new state machine that allows it to resume directly at the second parallel state instead of having to redo the first parallel state. This creates a new state machine called "ParallelStateMachine-with-GoToState". The following JSON was created by the script to define the new state machine in ASL. It contains the "GoToState" value that was attached by the script.

{
   "Comment":"An example of the Amazon States Language using a parallel state to execute two branches at the same time.",
   "States":{
      "Final Processing":{
         "Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaC",
         "End":true,
         "Type":"Task"
      },
      "GoToState":{
         "Default":"Parallel 2",
         "Type":"Choice",
         "Choices":[
            {
               "Variable":"$.resuming",
               "BooleanEquals":false,
               "Next":"Parallel"
            }
         ]
      },
      "Parallel":{
         "Branches":[
            {
               "States":{
                  "Parallel Step 1, Process 1":{
                     "Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaA",
                     "End":true,
                     "Type":"Task"
                  }
               },
               "StartAt":"Parallel Step 1, Process 1"
            },
            {
               "States":{
                  "Parallel Step 1, Process 2":{
                     "Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:LambdaA",
                     "End":true,
                     "Type":"Task"
                  }
               },
               "StartAt":"Parallel Step 1, Process 2"
            }
         ],
         "ResultPath":"$.output",
         "Type":"Parallel",
         "Next":"Parallel 2"
      },
      "Parallel 2":{
         "Branches":[
            {
               "States":{
                  "Parallel Step 2, Process 1":{
                     "Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaB",
                     "End":true,
                     "Type":"Task"
                  }
               },
               "StartAt":"Parallel Step 2, Process 1"
            },
            {
               "States":{
                  "Parallel Step 2, Process 2":{
                     "Resource":"arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:LambdaB",
                     "End":true,
                     "Type":"Task"
                  }
               },
               "StartAt":"Parallel Step 2, Process 2"
            }
         ],
         "Type":"Parallel",
         "Next":"Final Processing"
      }
   },
   "StartAt":"GoToState"
}

You can then execute this state machine with the correct input by adding the "foo" and "resuming" variables:

{
  "foo": 1,
  "output": [
    {
      "Message": "Hello!"
    },
    {
      "Message": "Hello!"
    }
  ],
  "resuming": true
}

This yields the following result. Notice that this time, the state machine executed successfully to completion, and skipped the steps that had previously failed.


Conclusion

When you’re building out complex workflows, it’s important to be prepared for failure. You can do this by taking advantage of features such as automatic error retries in Step Functions and custom error handling of Lambda exceptions.

Nevertheless, state machines still have the possibility of failing. With the methodology and script presented in this post, you can resume a failed state machine from its point of failure. This allows you to skip the execution of steps in the workflow that had already succeeded, and recover the process from the point of failure.

To see more examples, please visit the Step Functions Getting Started page.

If you have questions or suggestions, please comment below.

Building a Multi-region Serverless Application with Amazon API Gateway and AWS Lambda

Post Syndicated from Stefano Buliani original https://aws.amazon.com/blogs/compute/building-a-multi-region-serverless-application-with-amazon-api-gateway-and-aws-lambda/

This post written by: Magnus Bjorkman – Solutions Architect

Many customers are looking to run their services at global scale, deploying their backend to multiple regions. In this post, we describe how to deploy a Serverless API into multiple regions and how to leverage Amazon Route 53 to route the traffic between regions. We use latency-based routing and health checks to achieve an active-active setup that can fail over between regions in case of an issue. We leverage the new regional API endpoint feature in Amazon API Gateway to make this a seamless process for the API client making the requests. This post does not cover the replication of your data, which is another aspect to consider when deploying applications across regions.

Solution overview

Currently, the default API endpoint type in API Gateway is the edge-optimized API endpoint, which enables clients to access an API through an Amazon CloudFront distribution. This typically improves connection time for geographically diverse clients. By default, a custom domain name is globally unique and the edge-optimized API endpoint would invoke a Lambda function in a single region in the case of Lambda integration. You can’t use this type of endpoint with a Route 53 active-active setup and fail-over.

The new regional API endpoint in API Gateway moves the API endpoint into the region and the custom domain name is unique per region. This makes it possible to run a full copy of an API in each region and then use Route 53 to use an active-active setup and failover. The following diagram shows how you do this:

Active/active multi region architecture

  • Deploy your Rest API stack, consisting of API Gateway and Lambda, in two regions, such as us-east-1 and us-west-2.
  • Choose the regional API endpoint type for your API.
  • Create a custom domain name and choose the regional API endpoint type for that one as well. In both regions, you are configuring the custom domain name to be the same, for example, helloworldapi.replacewithyourcompanyname.com
  • Use the host name of the custom domain names from each region, for example, xxxxxx.execute-api.us-east-1.amazonaws.com and xxxxxx.execute-api.us-west-2.amazonaws.com, to configure record sets in Route 53 for your client-facing domain name, for example, helloworldapi.replacewithyourcompanyname.com

The above solution provides an active-active setup for your API across the two regions, but you are not doing failover yet. For that to work, set up a health check in Route 53:

Route 53 Health Check

A Route 53 health check must have an endpoint to call to check the health of a service. You could do a simple ping of your actual Rest API methods, but instead provide a specific method on your Rest API that does a deep ping. That is, it is a Lambda function that checks the status of all the dependencies.

In the case of the Hello World API, you don’t have any other dependencies. In a real-world scenario, you could check on dependencies as databases, other APIs, and external dependencies. Route 53 health checks themselves cannot use your custom domain name endpoint’s DNS address, so you are going to directly call the API endpoints via their region unique endpoint’s DNS address.

Walkthrough

The following sections describe how to set up this solution. You can find the complete solution at the blog-multi-region-serverless-service GitHub repo. Clone or download the repository locally to be able to do the setup as described.

Prerequisites

You need the following resources to set up the solution described in this post:

  • AWS CLI
  • An S3 bucket in each region in which to deploy the solution, which can be used by the AWS Serverless Application Model (SAM). You can use the following CloudFormation templates to create buckets in us-east-1 and us-west-2:
    • us-east-1:
    • us-west-2:
  • A hosted zone registered in Amazon Route 53. This is used for defining the domain name of your API endpoint, for example, helloworldapi.replacewithyourcompanyname.com. You can use a third-party domain name registrar and then configure the DNS in Amazon Route 53, or you can purchase a domain directly from Amazon Route 53.

Deploy API with health checks in two regions

Start by creating a small “Hello World” Lambda function that sends back a message in the region in which it has been deployed.


"""Return message."""
import logging

logging.basicConfig()
logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    """Lambda handler for getting the hello world message."""

    region = context.invoked_function_arn.split(':')[3]

    logger.info("message: " + "Hello from " + region)
    
    return {
		"message": "Hello from " + region
    }

Also create a Lambda function for doing a health check that returns a value based on another environment variable (either “ok” or “fail”) to allow for ease of testing:


"""Return health."""
import logging
import os

logging.basicConfig()
logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    """Lambda handler for getting the health."""

    logger.info("status: " + os.environ['STATUS'])
    
    return {
		"status": os.environ['STATUS']
    }

Deploy both of these using an AWS Serverless Application Model (SAM) template. SAM is a CloudFormation extension that is optimized for serverless, and provides a standard way to create a complete serverless application. You can find the full helloworld-sam.yaml template in the blog-multi-region-serverless-service GitHub repo.

A few things to highlight:

  • You are using inline Swagger to define your API so you can substitute the current region in the x-amazon-apigateway-integration section.
  • Most of the Swagger template covers CORS to allow you to test this from a browser.
  • You are also using substitution to populate the environment variable used by the “Hello World” method with the region into which it is being deployed.

The Swagger allows you to use the same SAM template in both regions.

You can only use SAM from the AWS CLI, so do the following from the command prompt. First, deploy the SAM template in us-east-1 with the following commands, replacing “<your bucket in us-east-1>” with a bucket in your account:


> cd helloworld-api
> aws cloudformation package --template-file helloworld-sam.yaml --output-template-file /tmp/cf-helloworld-sam.yaml --s3-bucket <your bucket in us-east-1> --region us-east-1
> aws cloudformation deploy --template-file /tmp/cf-helloworld-sam.yaml --stack-name multiregionhelloworld --capabilities CAPABILITY_IAM --region us-east-1

Second, do the same in us-west-2:


> aws cloudformation package --template-file helloworld-sam.yaml --output-template-file /tmp/cf-helloworld-sam.yaml --s3-bucket <your bucket in us-west-2> --region us-west-2
> aws cloudformation deploy --template-file /tmp/cf-helloworld-sam.yaml --stack-name multiregionhelloworld --capabilities CAPABILITY_IAM --region us-west-2

The API was created with the default endpoint type of Edge Optimized. Switch it to Regional. In the Amazon API Gateway console, select the API that you just created and choose the wheel-icon to edit it.

API Gateway edit API settings

In the edit screen, select the Regional endpoint type and save the API. Do the same in both regions.

Grab the URL for the API in the console by navigating to the method in the prod stage.

API Gateway endpoint link

You can now test this with curl:


> curl https://2wkt1cxxxx.execute-api.us-west-2.amazonaws.com/prod/helloworld
{"message": "Hello from us-west-2"}

Write down the domain name for the URL in each region (for example, 2wkt1cxxxx.execute-api.us-west-2.amazonaws.com), as you need that later when you deploy the Route 53 setup.

Create the custom domain name

Next, create an Amazon API Gateway custom domain name endpoint. As part of using this feature, you must have a hosted zone and domain available to use in Route 53 as well as an SSL certificate that you use with your specific domain name.

You can create the SSL certificate by using AWS Certificate Manager. In the ACM console, choose Get started (if you have no existing certificates) or Request a certificate. Fill out the form with the domain name to use for the custom domain name endpoint, which is the same across the two regions:

Amazon Certificate Manager request new certificate

Go through the remaining steps and validate the certificate for each region before moving on.

You are now ready to create the endpoints. In the Amazon API Gateway console, choose Custom Domain Names, Create Custom Domain Name.

API Gateway create custom domain name

A few things to highlight:

  • The domain name is the same as what you requested earlier through ACM.
  • The endpoint configuration should be regional.
  • Select the ACM Certificate that you created earlier.
  • You need to create a base path mapping that connects back to your earlier API Gateway endpoint. Set the base path to v1 so you can version your API, and then select the API and the prod stage.

Choose Save. You should see your newly created custom domain name:

API Gateway custom domain setup

Note the value for Target Domain Name as you need that for the next step. Do this for both regions.

Deploy Route 53 setup

Use the global Route 53 service to provide DNS lookup for the Rest API, distributing the traffic in an active-active setup based on latency. You can find the full CloudFormation template in the blog-multi-region-serverless-service GitHub repo.

The template sets up health checks, for example, for us-east-1:


HealthcheckRegion1:
  Type: "AWS::Route53::HealthCheck"
  Properties:
    HealthCheckConfig:
      Port: "443"
      Type: "HTTPS_STR_MATCH"
      SearchString: "ok"
      ResourcePath: "/prod/healthcheck"
      FullyQualifiedDomainName: !Ref Region1HealthEndpoint
      RequestInterval: "30"
      FailureThreshold: "2"

Use the health check when you set up the record set and the latency routing, for example, for us-east-1:


Region1EndpointRecord:
  Type: AWS::Route53::RecordSet
  Properties:
    Region: us-east-1
    HealthCheckId: !Ref HealthcheckRegion1
    SetIdentifier: "endpoint-region1"
    HostedZoneId: !Ref HostedZoneId
    Name: !Ref MultiregionEndpoint
    Type: CNAME
    TTL: 60
    ResourceRecords:
      - !Ref Region1Endpoint

You can create the stack by using the following link, copying in the domain names from the previous section, your existing hosted zone name, and the main domain name that is created (for example, hellowordapi.replacewithyourcompanyname.com):

The following screenshot shows what the parameters might look like:
Serverless multi region Route 53 health check

Specifically, the domain names that you collected earlier would map according to following:

  • The domain names from the API Gateway “prod”-stage go into Region1HealthEndpoint and Region2HealthEndpoint.
  • The domain names from the custom domain name’s target domain name goes into Region1Endpoint and Region2Endpoint.

Using the Rest API from server-side applications

You are now ready to use your setup. First, demonstrate the use of the API from server-side clients. You can demonstrate this by using curl from the command line:


> curl https://hellowordapi.replacewithyourcompanyname.com/v1/helloworld/
{"message": "Hello from us-east-1"}

Testing failover of Rest API in browser

Here’s how you can use this from the browser and test the failover. Find all of the files for this test in the browser-client folder of the blog-multi-region-serverless-service GitHub repo.

Use this html file:


<!DOCTYPE HTML>
<html>
<head>
    <meta charset="utf-8"/>
    <meta http-equiv="X-UA-Compatible" content="IE=edge"/>
    <meta name="viewport" content="width=device-width, initial-scale=1"/>
    <title>Multi-Region Client</title>
</head>
<body>
<div>
   <h1>Test Client</h1>

    <p id="client_result">

    </p>

    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
    <script src="settings.js"></script>
    <script src="client.js"></script>
</body>
</html>

The html file uses this JavaScript file to repeatedly call the API and print the history of messages:


var messageHistory = "";

(function call_service() {

   $.ajax({
      url: helloworldMultiregionendpoint+'v1/helloworld/',
      dataType: "json",
      cache: false,
      success: function(data) {
         messageHistory+="<p>"+data['message']+"</p>";
         $('#client_result').html(messageHistory);
      },
      complete: function() {
         // Schedule the next request when the current one's complete
         setTimeout(call_service, 10000);
      },
      error: function(xhr, status, error) {
         $('#client_result').html('ERROR: '+status);
      }
   });

})();

Also, make sure to update the settings in settings.js to match with the API Gateway endpoints for the DNS-proxy and the multi-regional endpoint for the Hello World API: var helloworldMultiregionendpoint = "https://hellowordapi.replacewithyourcompanyname.com/";

You can now open the HTML file in the browser (you can do this directly from the file system) and you should see something like the following screenshot:

Serverless multi region browser test

You can test failover by changing the environment variable in your health check Lambda function. In the Lambda console, select your health check function and scroll down to the Environment variables section. For the STATUS key, modify the value to fail.

Lambda update environment variable

You should see the region switch in the test client:

Serverless multi region broker test switchover

During an emulated failure like this, the browser might take some additional time to switch over due to connection keep-alive functionality. If you are using a browser like Chrome, you can kill all the connections to see a more immediate fail-over: chrome://net-internals/#sockets

Summary

You have implemented a simple way to do multi-regional serverless applications that fail over seamlessly between regions, either being accessed from the browser or from other applications/services. You achieved this by using the capabilities of Amazon Route 53 to do latency based routing and health checks for fail-over. You unlocked the use of these features in a serverless application by leveraging the new regional endpoint feature of Amazon API Gateway.

The setup was fully scripted using CloudFormation, the AWS Serverless Application Model (SAM), and the AWS CLI, and it can be integrated into deployment tools to push the code across the regions to make sure it is available in all the needed regions. For more information about cross-region deployments, see Building a Cross-Region/Cross-Account Code Deployment Solution on AWS on the AWS DevOps blog.

I Still Prefer Eclipse Over IntelliJ IDEA

Post Syndicated from Bozho original https://techblog.bozho.net/still-prefer-eclipse-intellij-idea/

Over the years I’ve observed an inevitable shift from Eclipse to IntelliJ IDEA. Last year they were almost equal in usage, and I have the feeling things are swaying even more towards IDEA.

IDEA is like the iPhone of IDEs – its users tell you that “you will feel how much better it is once you get used to it”, “are you STILL using Eclipse??”, “IDEA is so much better, I thought everyone has switched”, etc.

I’ve been using mostly Eclipse for the past 12 years, but in some cases I did use IDEA – when I was writing Scala, when I was writing Android, and most recently – when Eclipse failed to be ready for the Java 9 release, so after half a day of trying to get it working, I just switched to IDEA until Eclipse finally gets a working Java 9 version (with Maven and the rest of the stuff).

But I will get back to Eclipse again, soon. And I still prefer it. Not just because of all the key combinations I’ve internalized (you can reuse those in IDEA), but because there are still things I find worse in IDEA. Of course, IDEA has so much more cool features like code improvement suggestions and actually working plugins for everything. But at least some of the problems I see have to do with the more basic development workflow and experience. And you can’t compensate for those with sugarcoating. So here they are:

  • Projects are not automatically built (by default), so you can end up with compilation errors that you don’t see until you open a non-compiling file or run a build. And turning the autobild on makes my machine crawl. I know I need an upgrade, but that’s not the point – not having “build on change” was a huge surprise to me the first time I tried IDEA. I recently complained about that on twitter and it turns out “it’s a feature”. The rationale seems to be that if you use refactoring, that shouldn’t happen. Well, there are dozens of cases when it does happen. Refactoring by adding a method parameter, by changing the type of a parameter, by removing a parameter (where the IDE can’t infer which parameter is removed based on the types), by changing return types. Also, a change in maven/gradle dependencies may introduces compilation issues that you don’t get to see. This is not a reasonable default at all, and I think the performance issues are the only reason it’s still the default. I think this makes the experience much worse.
  • You can have only one project per screen. Maybe there are those small companies with greenfield projects where you only need one. But I’ve never been in a situation, where you don’t at least occasionally need a separate project. Be it an “experiments” one, a “tools” one, or whatever. And no, multi-module maven projects (which IDEA handles well) are not sufficient. So each time you need to step out of your main project, you launch another screen. Apart from the bad usability, it’s double the memory, double the fun.
  • Speaking of memory, It seems to be taking more memory than Eclipse. I don’t have representative benchmarks of that, and I know that my 8 GB RAM home machine is way to small for development nowadays, but still.
  • It feels less responsive and clunky. There is some minor delay that I can’t define well, but “I feel it”. I read somewhere that they were excessively repainting the screen elements, so that might be the explanation. Eclipse feels smoother (I know that’s not a proper argument, but I can’t be more precise)
  • Due to some extra cleverness, I have “unused methods” and “never assigned fields” all around the project. It uses spring, so these methods and fields are controller methods and autowired fields. Maybe some spring plugin would take care of that, but spring is not the only framework that uses reflection. Even getters and setters on POJOs get the unused warnings. What’s the problem with those warnings? That warnings are devalued. They don’t mean anything now. There isn’t a “yellow” indicator on the class either, so you don’t actually see the amount of warnings you have. Eclipse displays warnings better, and the false positives are much less.
  • The call hierarchy is slightly worse. But since that’s the most important IDE feature for me (alongside refactoring), it matters. It doesn’t give you the call hierarchy of default constructors that are not explicitly defined. Also, from what I’ve seen IDEA users don’t often use the call hierarchy feature. “Find usage” I think predates the call hierarchy, and is also much more visible through the UI, so some of the IDEA users don’t even know what a call hierarchy is. And repeatedly do “find usage”. That’s only partly the IDE’s fault.
  • No search in the output console. Come one, why I do I have an IDE, where I have to copy the output and paste it in a text editor in order to search. Now, to clarify, the console does have search. But when I run my (spring-boot) application, it outputs stuff in a panel at the bottom that is not the console and doesn’t have search.
  • CTRL+arrows by default jumps over whole words, and not camel cased words. This is configurable, but is yet another odd default. You almost always want to be able to traverse your variables word by word (in camel case), rather than skipping over the whole variable (method/class) name.
  • A few years ago when I used it for Scala, the project never actually compiled. But I guess that’s more Scala’s fault than of the IDE

Apart from the first two, the rest are not major issues, I agree. But they add up. Ultimately, it’s a matter of personal choice whether you can turn a blind eye to these issues. But I’m getting back to Eclipse again. At some point I will propose improvements in the IntelliJ IDEA backlog and will check it again in a few years, I guess.

The post I Still Prefer Eclipse Over IntelliJ IDEA appeared first on Bozho's tech blog.

Automating Security Group Updates with AWS Lambda

Post Syndicated from Ian Scofield original https://aws.amazon.com/blogs/compute/automating-security-group-updates-with-aws-lambda/

Customers often use public endpoints to perform cross-region replication or other application layer communication to remote regions. But a common problem is how do you protect these endpoints? It can be tempting to open up the security groups to the world due to the complexity of keeping security groups in sync across regions with a dynamically changing infrastructure.

Consider a situation where you are running large clusters of instances in different regions that all require internode connectivity. One approach would be to use a VPN tunnel between regions to provide a secure tunnel over which to send your traffic. A good example of this is the Transit VPC Solution, which is a published AWS solution to help customers quickly get up and running. However, this adds additional cost and complexity to your solution due to the newly required additional infrastructure.

Another approach, which I’ll explore in this post, is to restrict access to the nodes by whitelisting the public IP addresses of your hosts in the opposite region. Today, I’ll outline a solution that allows for cross-region security group updates, can handle remote region failures, and supports external actions such as manually terminating instances or adding instances to an existing Auto Scaling group.

Solution overview

The overview of this solution is diagrammed below. Although this post covers limiting access to your instances, you should still implement encryption to protect your data in transit.

If your entire infrastructure is running in a single region, you can reference a security group as the source, allowing your IP addresses to change without any updates required. However, if you’re going across the public internet between regions to perform things like application-level traffic or cross-region replication, this is no longer an option. Security groups are regional. When you go across regions it can be tempting to drop security to enable this communication.

Although using an Elastic IP address can provide you with a static IP address that you can define as a source for your security groups, this may not always be feasible, especially when automatic scaling is desired.

In this example scenario, you have a distributed database that requires full internode communication for replication. If you place a cluster in us-east-1 and us-west-2, you must provide a secure method of communication between the two. Because the database uses cloud best practices, you can add or remove nodes as the load varies.

To start the process of updating your security groups, you must know when an instance has come online to trigger your workflow. Auto Scaling groups have the concept of lifecycle hooks that enable you to perform custom actions as the group launches or terminates instances.

When Auto Scaling begins to launch or terminate an instance, it puts the instance into a wait state (Pending:Wait or Terminating:Wait). The instance remains in this state while you perform your various actions until either you tell Auto Scaling to Continue, Abandon, or the timeout period ends. A lifecycle hook can trigger a CloudWatch event, publish to an Amazon SNS topic, or send to an Amazon SQS queue. For this example, you use CloudWatch Events to trigger an AWS Lambda function that updates an Amazon DynamoDB table.

Component breakdown

Here’s a quick breakdown of the components involved in this solution:

• Lambda function
• CloudWatch event
• DynamoDB table

Lambda function

The Lambda function automatically updates your security groups, in the following way:

1. Determines whether a change was triggered by your Auto Scaling group lifecycle hook or manually invoked for a “true up” functionality, which I discuss later in this post.
2. Describes the instances in the Auto Scaling group and obtain public IP addresses for each instance.
3. Updates both local and remote DynamoDB tables.
4. Compares the list of public IP addresses for both local and remote clusters with what’s already in the local region security group. Update the security group.
5. Compares the list of public IP addresses for both local and remote clusters with what’s already in the remote region security group. Update the security group
6. Signals CONTINUE back to the lifecycle hook.

CloudWatch event

The CloudWatch event triggers when an instance passes through either the launching or terminating states. When the Lambda function gets invoked, it receives an event that looks like the following:

{
	"account": "123456789012",
	"region": "us-east-1",
	"detail": {
		"LifecycleHookName": "hook-launching",
		"AutoScalingGroupName": "",
		"LifecycleActionToken": "33965228-086a-4aeb-8c26-f82ed3bef495",
		"LifecycleTransition": "autoscaling:EC2_INSTANCE_LAUNCHING",
		"EC2InstanceId": "i-017425ec54f22f994"
	},
	"detail-type": "EC2 Instance-launch Lifecycle Action",
	"source": "aws.autoscaling",
	"version": "0",
	"time": "2017-05-03T02:20:59Z",
	"id": "cb930cf8-ce8b-4b6c-8011-af17966eb7e2",
	"resources": [
		"arn:aws:autoscaling:us-east-1:123456789012:autoScalingGroup:d3fe9d96-34d0-4c62-b9bb-293a41ba3765:autoScalingGroupName/"
	]
}

DynamoDB table

You use DynamoDB to store lists of remote IP addresses in a local table that is updated by the opposite region as a failsafe source of truth. Although you can describe your Auto Scaling group for the local region, you must maintain a list of IP addresses for the remote region.

To minimize the number of describe calls and prevent an issue in the remote region from blocking your local scaling actions, we keep a list of the remote IP addresses in a local DynamoDB table. Each Lambda function in each region is responsible for updating the public IP addresses of its Auto Scaling group for both the local and remote tables.

As with all the infrastructure in this solution, there is a DynamoDB table in both regions that mirror each other. For example, the following screenshot shows a sample DynamoDB table. The Lambda function in us-east-1 would update the DynamoDB entry for us-east-1 in both tables in both regions.

By updating a DynamoDB table in both regions, it allows the local region to gracefully handle issues with the remote region, which would otherwise prevent your ability to scale locally. If the remote region becomes inaccessible, you have a copy of the latest configuration from the table that you can use to continue to sync with your security groups. When the remote region comes back online, it pushes its updated public IP addresses to the DynamoDB table. The security group is updated to reflect the current status by the remote Lambda function.

 

Walkthrough

Note: All of the following steps are performed in both regions. The Launch Stack buttons will default to the us-east-1 region.

Here’s a quick overview of the steps involved in this process:

1. An instance is launched or terminated, which triggers an Auto Scaling group lifecycle hook, triggering the Lambda function via CloudWatch Events.
2. The Lambda function retrieves the list of public IP addresses for all instances in the local region Auto Scaling group.
3. The Lambda function updates the local and remote region DynamoDB tables with the public IP addresses just received for the local Auto Scaling group.
4. The Lambda function updates the local region security group with the public IP addresses, removing and adding to ensure that it mirrors what is present for the local and remote Auto Scaling groups.
5. The Lambda function updates the remote region security group with the public IP addresses, removing and adding to ensure that it mirrors what is present for the local and remote Auto Scaling groups.

Prerequisites

To deploy this solution, you need to have Auto Scaling groups, launch configurations, and a base security group in both regions. To expedite this process, this CloudFormation template can be launched in both regions.

Step 1: Launch the AWS SAM template in the first region

To make the deployment process easy, I’ve created an AWS Serverless Application Model (AWS SAM) template, which is a new specification that makes it easier to manage and deploy serverless applications on AWS. This template creates the following resources:

• A Lambda function, to perform the various security group actions
• A DynamoDB table, to track the state of the local and remote Auto Scaling groups
• Auto Scaling group lifecycle hooks for instance launching and terminating
• A CloudWatch event, to track the EC2 Instance-Launch Lifecycle-Action and EC2 Instance-terminate Lifecycle-Action events
• A pointer from the CloudWatch event to the Lambda function, and the necessary permissions

Download the template from here or click to launch.

Upon launching the template, you’ll be presented with a list of parameters which includes the remote/local names for your Auto Scaling Groups, AWS region, Security Group IDs, DynamoDB table names, as well as where the code for the Lambda function is located. Because this is the first region you’re launching the stack in, fill out all the parameters except for the RemoteTable parameter as it hasn’t been created yet (you fill this in later).

Step 2: Test the local region

After the stack has finished launching, you can test the local region. Open the EC2 console and find the Auto Scaling group that was created when launching the prerequisite stack. Change the desired number of instances from 0 to 1.

For both regions, check your security group to verify that the public IP address of the instance created is now in the security group.

Local region:

Remote region:

Now, change the desired number of instances for your group back to 0 and verify that the rules are properly removed.

Local region:

Remote region:

Step 3: Launch in the remote region

When you deploy a Lambda function using CloudFormation, the Lambda zip file needs to reside in the same region you are launching the template. Once you choose your remote region, create an Amazon S3 bucket and upload the Lambda zip file there. Next, go to the remote region and launch the same SAM template as before, but make sure you update the CodeBucket and CodeKey parameters. Also, because this is the second launch, you now have all the values and can fill out all the parameters, specifically the RemoteTable value.

 

Step 4: Update the local region Lambda environment variable

When you originally launched the template in the local region, you didn’t have the name of the DynamoDB table for the remote region, because you hadn’t created it yet. Now that you have launched the remote template, you can perform a CloudFormation stack update on the initial SAM template. This populates the remote DynamoDB table name into the initial Lambda function’s environment variables.

In the CloudFormation console in the initial region, select the stack. Under Actions, choose Update Stack, and select the SAM template used for both regions. Under Parameters, populate the remote DynamoDB table name, as shown below. Choose Next and let the stack update complete. This updates your Lambda function and completes the setup process.

 

Step 5: Final testing

You now have everything fully configured and in place to trigger security group changes based on instances being added or removed to your Auto Scaling groups in both regions. Test this by changing the desired capacity of your group in both regions.

True up functionality
If an instance is manually added or removed from the Auto Scaling group, the lifecycle hooks don’t get triggered. To account for this, the Lambda function supports a “true up” functionality in which the function can be manually invoked. If you paste in the following JSON text for your test event, it kicks off the entire workflow. For added peace of mind, you can also have this function fire via a CloudWatch event with a CRON expression for nearly continuous checking.

{
	"detail": {
		"AutoScalingGroupName": "<your ASG name>"
	},
	"trueup":true
}

Extra credit

Now that all the resources are created in both regions, go back and break down the policy to incorporate resource-level permissions for specific security groups, Auto Scaling groups, and the DynamoDB tables.

Although this post is centered around using public IP addresses for your instances, you could instead use a VPN between regions. In this case, you would still be able to use this solution to scope down the security groups to the cluster instances. However, the code would need to be modified to support private IP addresses.

 

Conclusion

At this point, you now have a mechanism in place that captures when a new instance is added to or removed from your cluster and updates the security groups in both regions. This ensures that you are locking down your infrastructure securely by allowing access only to other cluster members.

Keep in mind that this architecture (lifecycle hooks, CloudWatch event, Lambda function, and DynamoDB table) requires that the infrastructure to be deployed in both regions, to have synchronization going both ways.

Because this Lambda function is modifying security group rules, it’s important to have an audit log of what has been modified and who is modifying them. The out-of-the-box function provides logs in CloudWatch for what IP addresses are being added and removed for which ports. As these are all API calls being made, they are logged in CloudTrail and can be traced back to the IAM role that you created for your lifecycle hooks. This can provide historical data that can be used for troubleshooting or auditing purposes.

Security is paramount at AWS. We want to ensure that customers are protecting access to their resources. This solution helps you keep your security groups in both regions automatically in sync with your Auto Scaling group resources. Let us know if you have any questions or other solutions you’ve come up with!

timeShift(GrafanaBuzz, 1w) Issue 18

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2017/10/20/timeshiftgrafanabuzz-1w-issue-18/

Welcome to another issue of timeShift. This week we released Grafana 4.6.0-beta2, which includes some fixes for alerts, annotations, the Cloudwatch data source, and a few panel updates. We’re also gearing up for Oredev, one of the biggest tech conferences in Scandinavia, November 7-10. In addition to sponsoring, our very own Carl Bergquist will be presenting “Monitoring for everyone.” Hope to see you there – swing by our booth and say hi!


Latest Release

Grafana 4.6-beta-2 is now available! Grafana 4.6.0-beta2 adds fixes for:

  • ColorPicker display
  • Alerting test
  • Cloudwatch improvements
  • CSV export
  • Text panel enhancements
  • Annotation fix for MySQL

To see more details on what’s in the newest version, please see the release notes.

Download Grafana 4.6.0-beta-2 Now


From the Blogosphere

Screeps and Grafana: Graphing your AI: If you’re unfamiliar with Screeps, it’s a MMO RTS game for programmers, where the objective is to grow your colony through programming your units’ AI. You control your colony by writing JavaScript, which operates 247 in the single persistent real-time world filled by other players. This article walks you through graphing all your game stats with Grafana.

ntopng Grafana Integration: The Beauty of Data Visualization: Our friends at ntop created a tutorial so that you can graph ntop monitoring data in Grafana. He goes through the metrics exposed, configuring the ntopng Data Source plugin, and building your first dashboard. They’ve also created a nice video tutorial of the process.

Installing Graphite and Grafana to Display the Graphs of Centreon: This article, provides a step-by-step guide to getting your Centreon data into Graphite and visualizing the data in Grafana.

Bit v. Byte Episode 3 – Metrics for the Win: Bit v. Byte is a new weekly Podcast about the web industry, tools and techniques upcoming and in use today. This episode dives into metrics, and discusses Grafana, Prometheus and NGINX Amplify.

Code-Quickie: Visualize heating with Grafana: With the winter weather coming, Reinhard wanted to monitor the stats in his boiler room. This article covers not only the visualization of the data, but the different devices and sensors you can use to can use in your own home.

RuuviTag with C.H.I.P – BLE – Node-RED: Following the temperature-monitoring theme from the last article, Tobias writes about his journey of hooking up his new RuuviTag to Grafana to measure temperature, relative humidity, air pressure and more.


Early Bird will be Ending Soon

Early bird discounts will be ending soon, but you still have a few days to lock in the lower price. We will be closing early bird on October 31, so don’t wait until the last minute to take advantage of the discounted tickets!

Also, there’s still time to submit your talk. We’ll accept submissions through the end of October. We’re looking for technical and non-technical talks of all sizes. Submit a CFP now.

Get Your Early Bird Ticket Now


Grafana Plugins

This week we have updates to two panels and a brand new panel that can add some animation to your dashboards. Installing plugins in Grafana is easy; for on-prem Grafana, use the Grafana-cli tool, or with 1 click if you are using Hosted Grafana.

NEW PLUGIN

Geoloop Panel – The Geoloop panel is a simple visualizer for joining GeoJSON to Time Series data, and animating the geo features in a loop. An example of using the panel would be showing the rate of rainfall during a 5-hour storm.

Install Now

UPDATED PLUGIN

Breadcrumb Panel – This plugin keeps track of dashboards you have visited within one session and displays them as a breadcrumb. The latest update fixes some issues with back navigation and url query params.

Update

UPDATED PLUGIN

Influx Admin Panel – The Influx Admin panel duplicates features from the now deprecated Web Admin Interface for InfluxDB and has lots of features like letting you see the currently running queries, which can also be easily killed.

Changes in the latest release:

  • Converted to typescript project based on typescript-template-datasource
  • Select Databases. This only works with PR#8096
  • Added time format options
  • Show tags from response
  • Support template variables in the query

Update


Contribution of the week:

Each week we highlight some of the important contributions from our amazing open source community. Thank you for helping make Grafana better!

The Stockholm Go Meetup had a hackathon this week and sent a PR for letting whitelisted cookies pass through the Grafana proxy. Thanks to everyone who worked on this PR!


Tweet of the Week

We scour Twitter each week to find an interesting/beautiful dashboard and show it off! #monitoringLove

This is awesome – we can’t get enough of these public dashboards!

We Need Your Help!

Do you have a graph that you love because the data is beautiful or because the graph provides interesting information? Please get in touch. Tweet or send us an email with a screenshot, and we’ll tell you about this fun experiment.

Tell Me More


Grafana Labs is Hiring!

We are passionate about open source software and thrive on tackling complex challenges to build the future. We ship code from every corner of the globe and love working with the community. If this sounds exciting, you’re in luck – WE’RE HIRING!

Check out our Open Positions


How are we doing?

Please tell us how we’re doing. Submit a comment on this article below, or post something at our community forum. Help us make these weekly roundups better!

Follow us on Twitter, like us on Facebook, and join the Grafana Labs community.

Using AWS Step Functions State Machines to Handle Workflow-Driven AWS CodePipeline Actions

Post Syndicated from Marcilio Mendonca original https://aws.amazon.com/blogs/devops/using-aws-step-functions-state-machines-to-handle-workflow-driven-aws-codepipeline-actions/

AWS CodePipeline is a continuous integration and continuous delivery service for fast and reliable application and infrastructure updates. It offers powerful integration with other AWS services, such as AWS CodeBuildAWS CodeDeployAWS CodeCommit, AWS CloudFormation and with third-party tools such as Jenkins and GitHub. These services make it possible for AWS customers to successfully automate various tasks, including infrastructure provisioning, blue/green deployments, serverless deployments, AMI baking, database provisioning, and release management.

Developers have been able to use CodePipeline to build sophisticated automation pipelines that often require a single CodePipeline action to perform multiple tasks, fork into different execution paths, and deal with asynchronous behavior. For example, to deploy a Lambda function, a CodePipeline action might first inspect the changes pushed to the code repository. If only the Lambda code has changed, the action can simply update the Lambda code package, create a new version, and point the Lambda alias to the new version. If the changes also affect infrastructure resources managed by AWS CloudFormation, the pipeline action might have to create a stack or update an existing one through the use of a change set. In addition, if an update is required, the pipeline action might enforce a safety policy to infrastructure resources that prevents the deletion and replacement of resources. You can do this by creating a change set and having the pipeline action inspect its changes before updating the stack. Change sets that do not conform to the policy are deleted.

This use case is a good illustration of workflow-driven pipeline actions. These are actions that run multiple tasks, deal with async behavior and loops, need to maintain and propagate state, and fork into different execution paths. Implementing workflow-driven actions directly in CodePipeline can lead to complex pipelines that are hard for developers to understand and maintain. Ideally, a pipeline action should perform a single task and delegate the complexity of dealing with workflow-driven behavior associated with that task to a state machine engine. This would make it possible for developers to build simpler, more intuitive pipelines and allow them to use state machine execution logs to visualize and troubleshoot their pipeline actions.

In this blog post, we discuss how AWS Step Functions state machines can be used to handle workflow-driven actions. We show how a CodePipeline action can trigger a Step Functions state machine and how the pipeline and the state machine are kept decoupled through a Lambda function. The advantages of using state machines include:

  • Simplified logic (complex tasks are broken into multiple smaller tasks).
  • Ease of handling asynchronous behavior (through state machine wait states).
  • Built-in support for choices and processing different execution paths (through state machine choices).
  • Built-in visualization and logging of the state machine execution.

The source code for the sample pipeline, pipeline actions, and state machine used in this post is available at https://github.com/awslabs/aws-codepipeline-stepfunctions.

Overview

This figure shows the components in the CodePipeline-Step Functions integration that will be described in this post. The pipeline contains two stages: a Source stage represented by a CodeCommit Git repository and a Prod stage with a single Deploy action that represents the workflow-driven action.

This action invokes a Lambda function (1) called the State Machine Trigger Lambda, which, in turn, triggers a Step Function state machine to process the request (2). The Lambda function sends a continuation token back to the pipeline (3) to continue its execution later and terminates. Seconds later, the pipeline invokes the Lambda function again (4), passing the continuation token received. The Lambda function checks the execution state of the state machine (5,6) and communicates the status to the pipeline. The process is repeated until the state machine execution is complete. Then the Lambda function notifies the pipeline that the corresponding pipeline action is complete (7). If the state machine has failed, the Lambda function will then fail the pipeline action and stop its execution (7). While running, the state machine triggers various Lambda functions to perform different tasks. The state machine and the pipeline are fully decoupled. Their interaction is handled by the Lambda function.

The Deploy State Machine

The sample state machine used in this post is a simplified version of the use case, with emphasis on infrastructure deployment. The state machine will follow distinct execution paths and thus have different outcomes, depending on:

  • The current state of the AWS CloudFormation stack.
  • The nature of the code changes made to the AWS CloudFormation template and pushed into the pipeline.

If the stack does not exist, it will be created. If the stack exists, a change set will be created and its resources inspected by the state machine. The inspection consists of parsing the change set results and detecting whether any resources will be deleted or replaced. If no resources are being deleted or replaced, the change set is allowed to be executed and the state machine completes successfully. Otherwise, the change set is deleted and the state machine completes execution with a failure as the terminal state.

Let’s dive into each of these execution paths.

Path 1: Create a Stack and Succeed Deployment

The Deploy state machine is shown here. It is triggered by the Lambda function using the following input parameters stored in an S3 bucket.

Create New Stack Execution Path

{
    "environmentName": "prod",
    "stackName": "sample-lambda-app",
    "templatePath": "infra/Lambda-template.yaml",
    "revisionS3Bucket": "codepipeline-us-east-1-418586629775",
    "revisionS3Key": "StepFunctionsDrivenD/CodeCommit/sjcmExZ"
}

Note that some values used here are for the use case example only. Account-specific parameters like revisionS3Bucket and revisionS3Key will be different when you deploy this use case in your account.

These input parameters are used by various states in the state machine and passed to the corresponding Lambda functions to perform different tasks. For example, stackName is used to create a stack, check the status of stack creation, and create a change set. The environmentName represents the environment (for example, dev, test, prod) to which the code is being deployed. It is used to prefix the name of stacks and change sets.

With the exception of built-in states such as wait and choice, each state in the state machine invokes a specific Lambda function.  The results received from the Lambda invocations are appended to the state machine’s original input. When the state machine finishes its execution, several parameters will have been added to its original input.

The first stage in the state machine is “Check Stack Existence”. It checks whether a stack with the input name specified in the stackName input parameter already exists. The output of the state adds a Boolean value called doesStackExist to the original state machine input as follows:

{
  "doesStackExist": true,
  "environmentName": "prod",
  "stackName": "sample-lambda-app",
  "templatePath": "infra/lambda-template.yaml",
  "revisionS3Bucket": "codepipeline-us-east-1-418586629775",
  "revisionS3Key": "StepFunctionsDrivenD/CodeCommit/sjcmExZ",
}

The following stage, “Does Stack Exist?”, is represented by Step Functions built-in choice state. It checks the value of doesStackExist to determine whether a new stack needs to be created (doesStackExist=true) or a change set needs to be created and inspected (doesStackExist=false).

If the stack does not exist, the states illustrated in green in the preceding figure are executed. This execution path creates the stack, waits until the stack is created, checks the status of the stack’s creation, and marks the deployment successful after the stack has been created. Except for “Stack Created?” and “Wait Stack Creation,” each of these stages invokes a Lambda function. “Stack Created?” and “Wait Stack Creation” are implemented by using the built-in choice state (to decide which path to follow) and the wait state (to wait a few seconds before proceeding), respectively. Each stage adds the results of their Lambda function executions to the initial input of the state machine, allowing future stages to process them.

Path 2: Safely Update a Stack and Mark Deployment as Successful

Safely Update a Stack and Mark Deployment as Successful Execution Path

If the stack indicated by the stackName parameter already exists, a different path is executed. (See the green states in the figure.) This path will create a change set and use wait and choice states to wait until the change set is created. Afterwards, a stage in the execution path will inspect  the resources affected before the change set is executed.

The inspection procedure represented by the “Inspect Change Set Changes” stage consists of parsing the resources affected by the change set and checking whether any of the existing resources are being deleted or replaced. The following is an excerpt of the algorithm, where changeSetChanges.Changes is the object representing the change set changes:

...
var RESOURCES_BEING_DELETED_OR_REPLACED = "RESOURCES-BEING-DELETED-OR-REPLACED";
var CAN_SAFELY_UPDATE_EXISTING_STACK = "CAN-SAFELY-UPDATE-EXISTING-STACK";
for (var i = 0; i < changeSetChanges.Changes.length; i++) {
    var change = changeSetChanges.Changes[i];
    if (change.Type == "Resource") {
        if (change.ResourceChange.Action == "Delete") {
            return RESOURCES_BEING_DELETED_OR_REPLACED;
        }
        if (change.ResourceChange.Action == "Modify") {
            if (change.ResourceChange.Replacement == "True") {
                return RESOURCES_BEING_DELETED_OR_REPLACED;
            }
        }
    }
}
return CAN_SAFELY_UPDATE_EXISTING_STACK;

The algorithm returns different values to indicate whether the change set can be safely executed (CAN_SAFELY_UPDATE_EXISTING_STACK or RESOURCES_BEING_DELETED_OR_REPLACED). This value is used later by the state machine to decide whether to execute the change set and update the stack or interrupt the deployment.

The output of the “Inspect Change Set” stage is shown here.

{
  "environmentName": "prod",
  "stackName": "sample-lambda-app",
  "templatePath": "infra/lambda-template.yaml",
  "revisionS3Bucket": "codepipeline-us-east-1-418586629775",
  "revisionS3Key": "StepFunctionsDrivenD/CodeCommit/sjcmExZ",
  "doesStackExist": true,
  "changeSetName": "prod-sample-lambda-app-change-set-545",
  "changeSetCreationStatus": "complete",
  "changeSetAction": "CAN-SAFELY-UPDATE-EXISTING-STACK"
}

At this point, these parameters have been added to the state machine’s original input:

  • changeSetName, which is added by the “Create Change Set” state.
  • changeSetCreationStatus, which is added by the “Get Change Set Creation Status” state.
  • changeSetAction, which is added by the “Inspect Change Set Changes” state.

The “Safe to Update Infra?” step is a choice state (its JSON spec follows) that simply checks the value of the changeSetAction parameter. If the value is equal to “CAN-SAFELY-UPDATE-EXISTING-STACK“, meaning that no resources will be deleted or replaced, the step will execute the change set by proceeding to the “Execute Change Set” state. The deployment is successful (the state machine completes its execution successfully).

"Safe to Update Infra?": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.taskParams.changeSetAction",
          "StringEquals": "CAN-SAFELY-UPDATE-EXISTING-STACK",
          "Next": "Execute Change Set"
        }
      ],
      "Default": "Deployment Failed"
 }

Path 3: Reject Stack Update and Fail Deployment

Reject Stack Update and Fail Deployment Execution Path

If the changeSetAction parameter is different from “CAN-SAFELY-UPDATE-EXISTING-STACK“, the state machine will interrupt the deployment by deleting the change set and proceeding to the “Deployment Fail” step, which is a built-in Fail state. (Its JSON spec follows.) This state causes the state machine to stop in a failed state and serves to indicate to the Lambda function that the pipeline deployment should be interrupted in a fail state as well.

 "Deployment Failed": {
      "Type": "Fail",
      "Cause": "Deployment Failed",
      "Error": "Deployment Failed"
    }

In all three scenarios, there’s a state machine’s visual representation available in the AWS Step Functions console that makes it very easy for developers to identify what tasks have been executed or why a deployment has failed. Developers can also inspect the inputs and outputs of each state and look at the state machine Lambda function’s logs for details. Meanwhile, the corresponding CodePipeline action remains very simple and intuitive for developers who only need to know whether the deployment was successful or failed.

The State Machine Trigger Lambda Function

The Trigger Lambda function is invoked directly by the Deploy action in CodePipeline. The CodePipeline action must pass a JSON structure to the trigger function through the UserParameters attribute, as follows:

{
  "s3Bucket": "codepipeline-StepFunctions-sample",
  "stateMachineFile": "state_machine_input.json"
}

The s3Bucket parameter specifies the S3 bucket location for the state machine input parameters file. The stateMachineFile parameter specifies the file holding the input parameters. By being able to specify different input parameters to the state machine, we make the Trigger Lambda function and the state machine reusable across environments. For example, the same state machine could be called from a test and prod pipeline action by specifying a different S3 bucket or state machine input file for each environment.

The Trigger Lambda function performs two main tasks: triggering the state machine and checking the execution state of the state machine. Its core logic is shown here:

exports.index = function (event, context, callback) {
    try {
        console.log("Event: " + JSON.stringify(event));
        console.log("Context: " + JSON.stringify(context));
        console.log("Environment Variables: " + JSON.stringify(process.env));
        if (Util.isContinuingPipelineTask(event)) {
            monitorStateMachineExecution(event, context, callback);
        }
        else {
            triggerStateMachine(event, context, callback);
        }
    }
    catch (err) {
        failure(Util.jobId(event), callback, context.invokeid, err.message);
    }
}

Util.isContinuingPipelineTask(event) is a utility function that checks if the Trigger Lambda function is being called for the first time (that is, no continuation token is passed by CodePipeline) or as a continuation of a previous call. In its first execution, the Lambda function will trigger the state machine and send a continuation token to CodePipeline that contains the state machine execution ARN. The state machine ARN is exposed to the Lambda function through a Lambda environment variable called stateMachineArn. Here is the code that triggers the state machine:

function triggerStateMachine(event, context, callback) {
    var stateMachineArn = process.env.stateMachineArn;
    var s3Bucket = Util.actionUserParameter(event, "s3Bucket");
    var stateMachineFile = Util.actionUserParameter(event, "stateMachineFile");
    getStateMachineInputData(s3Bucket, stateMachineFile)
        .then(function (data) {
            var initialParameters = data.Body.toString();
            var stateMachineInputJSON = createStateMachineInitialInput(initialParameters, event);
            console.log("State machine input JSON: " + JSON.stringify(stateMachineInputJSON));
            return stateMachineInputJSON;
        })
        .then(function (stateMachineInputJSON) {
            return triggerStateMachineExecution(stateMachineArn, stateMachineInputJSON);
        })
        .then(function (triggerStateMachineOutput) {
            var continuationToken = { "stateMachineExecutionArn": triggerStateMachineOutput.executionArn };
            var message = "State machine has been triggered: " + JSON.stringify(triggerStateMachineOutput) + ", continuationToken: " + JSON.stringify(continuationToken);
            return continueExecution(Util.jobId(event), continuationToken, callback, message);
        })
        .catch(function (err) {
            console.log("Error triggering state machine: " + stateMachineArn + ", Error: " + err.message);
            failure(Util.jobId(event), callback, context.invokeid, err.message);
        })
}

The Trigger Lambda function fetches the state machine input parameters from an S3 file, triggers the execution of the state machine using the input parameters and the stateMachineArn environment variable, and signals to CodePipeline that the execution should continue later by passing a continuation token that contains the state machine execution ARN. In case any of these operations fail and an exception is thrown, the Trigger Lambda function will fail the pipeline immediately by signaling a pipeline failure through the putJobFailureResult CodePipeline API.

If the Lambda function is continuing a previous execution, it will extract the state machine execution ARN from the continuation token and check the status of the state machine, as shown here.

function monitorStateMachineExecution(event, context, callback) {
    var stateMachineArn = process.env.stateMachineArn;
    var continuationToken = JSON.parse(Util.continuationToken(event));
    var stateMachineExecutionArn = continuationToken.stateMachineExecutionArn;
    getStateMachineExecutionStatus(stateMachineExecutionArn)
        .then(function (response) {
            if (response.status === "RUNNING") {
                var message = "Execution: " + stateMachineExecutionArn + " of state machine: " + stateMachineArn + " is still " + response.status;
                return continueExecution(Util.jobId(event), continuationToken, callback, message);
            }
            if (response.status === "SUCCEEDED") {
                var message = "Execution: " + stateMachineExecutionArn + " of state machine: " + stateMachineArn + " has: " + response.status;
                return success(Util.jobId(event), callback, message);
            }
            // FAILED, TIMED_OUT, ABORTED
            var message = "Execution: " + stateMachineExecutionArn + " of state machine: " + stateMachineArn + " has: " + response.status;
            return failure(Util.jobId(event), callback, context.invokeid, message);
        })
        .catch(function (err) {
            var message = "Error monitoring execution: " + stateMachineExecutionArn + " of state machine: " + stateMachineArn + ", Error: " + err.message;
            failure(Util.jobId(event), callback, context.invokeid, message);
        });
}

If the state machine is in the RUNNING state, the Lambda function will send the continuation token back to the CodePipeline action. This will cause CodePipeline to call the Lambda function again a few seconds later. If the state machine has SUCCEEDED, then the Lambda function will notify the CodePipeline action that the action has succeeded. In any other case (FAILURE, TIMED-OUT, or ABORT), the Lambda function will fail the pipeline action.

This behavior is especially useful for developers who are building and debugging a new state machine because a bug in the state machine can potentially leave the pipeline action hanging for long periods of time until it times out. The Trigger Lambda function prevents this.

Also, by having the Trigger Lambda function as a means to decouple the pipeline and state machine, we make the state machine more reusable. It can be triggered from anywhere, not just from a CodePipeline action.

The Pipeline in CodePipeline

Our sample pipeline contains two simple stages: the Source stage represented by a CodeCommit Git repository and the Prod stage, which contains the Deploy action that invokes the Trigger Lambda function. When the state machine decides that the change set created must be rejected (because it replaces or deletes some the existing production resources), it fails the pipeline without performing any updates to the existing infrastructure. (See the failed Deploy action in red.) Otherwise, the pipeline action succeeds, indicating that the existing provisioned infrastructure was either created (first run) or updated without impacting any resources. (See the green Deploy stage in the pipeline on the left.)

The Pipeline in CodePipeline

The JSON spec for the pipeline’s Prod stage is shown here. We use the UserParameters attribute to pass the S3 bucket and state machine input file to the Lambda function. These parameters are action-specific, which means that we can reuse the state machine in another pipeline action.

{
  "name": "Prod",
  "actions": [
      {
          "inputArtifacts": [
              {
                  "name": "CodeCommitOutput"
              }
          ],
          "name": "Deploy",
          "actionTypeId": {
              "category": "Invoke",
              "owner": "AWS",
              "version": "1",
              "provider": "Lambda"
          },
          "outputArtifacts": [],
          "configuration": {
              "FunctionName": "StateMachineTriggerLambda",
              "UserParameters": "{\"s3Bucket\": \"codepipeline-StepFunctions-sample\", \"stateMachineFile\": \"state_machine_input.json\"}"
          },
          "runOrder": 1
      }
  ]
}

Conclusion

In this blog post, we discussed how state machines in AWS Step Functions can be used to handle workflow-driven actions. We showed how a Lambda function can be used to fully decouple the pipeline and the state machine and manage their interaction. The use of a state machine greatly simplified the associated CodePipeline action, allowing us to build a much simpler and cleaner pipeline while drilling down into the state machine’s execution for troubleshooting or debugging.

Here are two exercises you can complete by using the source code.

Exercise #1: Do not fail the state machine and pipeline action after inspecting a change set that deletes or replaces resources. Instead, create a stack with a different name (think of blue/green deployments). You can do this by creating a state machine transition between the “Safe to Update Infra?” and “Create Stack” stages and passing a new stack name as input to the “Create Stack” stage.

Exercise #2: Add wait logic to the state machine to wait until the change set completes its execution before allowing the state machine to proceed to the “Deployment Succeeded” stage. Use the stack creation case as an example. You’ll have to create a Lambda function (similar to the Lambda function that checks the creation status of a stack) to get the creation status of the change set.

Have fun and share your thoughts!

About the Author

Marcilio Mendonca is a Sr. Consultant in the Canadian Professional Services Team at Amazon Web Services. He has helped AWS customers design, build, and deploy best-in-class, cloud-native AWS applications using VMs, containers, and serverless architectures. Before he joined AWS, Marcilio was a Software Development Engineer at Amazon. Marcilio also holds a Ph.D. in Computer Science. In his spare time, he enjoys playing drums, riding his motorcycle in the Toronto GTA area, and spending quality time with his family.

What’s new in HiveMQ 3.3

Post Syndicated from The HiveMQ Team original https://www.hivemq.com/whats-new-in-hivemq-3-3

We are pleased to announce the release of HiveMQ 3.3. This version of HiveMQ is the most advanced and user friendly version of HiveMQ ever. A broker is the heart of every MQTT deployment and it’s key to monitor and understand how healthy your system and your connected clients are. Version 3.3 of HiveMQ focuses on observability, usability and advanced administration features and introduces a brand new Web UI. This version is a drop-in replacement for HiveMQ 3.2 and of course supports rolling upgrades for zero-downtime.

HiveMQ 3.3 brings many features that your users, administrators and plugin developers are going to love. These are the highlights:

Web UI

Web UI
The new HiveMQ version has a built-in Web UI for advanced analysis and administrative tasks. A powerful dashboard shows important data about the health of the broker cluster and an overview of the whole MQTT deployment.
With the new Web UI, administrators are able to drill down to specific client information and can perform administrative actions like disconnecting a client. Advanced analytics functionality allows indetifying clients with irregular behavior. It’s easy to identify message-dropping clients as HiveMQ shows detailed statistics of such misbehaving MQTT participants.
Of course all Web UI features work at scale with more than a million connected MQTT clients. Learn more about the Web UI in the documentation.

Time To Live

TTL
HiveMQ introduces Time to Live (TTL) on various levels of the MQTT lifecycle. Automatic cleanup of expired messages is as well supported as the wiping of abandoned persistent MQTT sessions. In particular, version 3.3 implements the following TTL features:

  • MQTT client session expiration
  • Retained Message expiration
  • MQTT PUBLISH message expiration

Configuring a TTL for MQTT client sessions and retained messages allows freeing system resources without manual administrative intervention as soon as the data is not needed anymore.
Beside global configuration, MQTT PUBLISHES can have individual TTLs based on application specific characteristics. It’s a breeze to change the TTL of particular messages with the HiveMQ plugin system. As soon as a message TTL expires, the broker won’t send out the message anymore, even if the message was previously queued or in-flight. This can save precious bandwidth for mobile connections as unnecessary traffic is avoided for expired messages.

Trace Recordings

Trace Recordings
Debugging specific MQTT clients or groups of MQTT clients can be challenging at scale. HiveMQ 3.3 introduces an innovative Trace Recording mechanism that allows creating detailed recordings of all client interactions with given filters.
It’s possible to filter based on client identifiers, MQTT message types and topics. And the best of all: You can use regular expressions to select multiple MQTT clients at once as well as topics with complex structures. Getting detailed information about the behavior of specific MQTT clients for debugging complex issues was never easier.

Native SSL

Native SSL
The new native SSL integration of HiveMQ brings a performance boost of more than 40% for SSL Handshakes (in terms of CPU usage) by utilizing an integration with BoringSSL. BoringSSL is Google’s fork of OpenSSL which is also used in Google Chrome and Android. Besides the compute and huge memory optimizations (saves up to 60% Java Heap), additional secure state-of-the-art cipher suites are supported by HiveMQ which are not directly available for Java (like ChaCha20-Poly1305).
Most HiveMQ deployments on Linux systems are expected to see decreased CPU load on TLS handshakes with the native SSL integration and huge memory improvements.

New Plugin System Features

New Plugin System Features
The popular and powerful plugin system has received additional services and callbacks which are useful for many existing and future plugins.
Plugin developers can now use a ConnectionAttributeStore and a SessionAttributeStore for storing arbitrary data for the lifetime of a single MQTT connection of a client or for the whole session of a client. The new ClientGroupService allows grouping different MQTT client identifiers by the same key, so it’s easy to address multiple MQTT clients (with the same group) at once.

A new callback was introduced which notifies a plugin when a HiveMQ instance is ready, which means the instance is part of the cluster and all listeners were started successfully. Developers can now react when a MQTT client session is ready and usable in the cluster with a dedicated callback.

Some use cases require modifying a MQTT PUBLISH packet before it’s sent out to a client. This is now possible with a new callback that was introduced for modifying a PUBLISH before sending it out to a individual client.
The offline queue size for persistent clients is now also configurable for individual clients as well as the queue discard strategy.

Additional Features

Additional Features
HiveMQ 3.3 has many additional features designed for power users and professional MQTT deployments. The new version also has the following highlights:

  • OCSP Stapling
  • Event Log for MQTT client connects, disconnects and unusual events (e.g. discarded message due to slow consumption on the client side
  • Throttling of concurrent TLS handshakes
  • Connect Packet overload protection
  • Configuration of Socket send and receive buffer sizes
  • Global System Information like the HiveMQ Home folder can now be set via Environment Variables without changing the run script
  • The internal HTTP server of HiveMQ is now exposed to the holistic monitoring subsystem
  • Many additional useful metrics were exposed to HiveMQ’s monitoring subsystem

 

In order to upgrade to HiveMQ 3.3 from HiveMQ 3.2 or older versions, take a look at our Upgrade Guide.
Don’t forget to learn more about all the new features with our HiveMQ User Guide.

Download HiveMQ 3.3 now

Predict Billboard Top 10 Hits Using RStudio, H2O and Amazon Athena

Post Syndicated from Gopal Wunnava original https://aws.amazon.com/blogs/big-data/predict-billboard-top-10-hits-using-rstudio-h2o-and-amazon-athena/

Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.

Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.

In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.

Walkthrough

Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.

This solution involves the following steps:

  • Install and configure RStudio with Athena
  • Log in to RStudio
  • Install R packages
  • Connect to Athena
  • Create a dataset
  • Create models

Install and configure RStudio with Athena

Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.

Launching this stack creates all required resources and prerequisites:

  • Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
  • Provisioning of the EC2 instance in an existing VPC and public subnet
  • Installation of Java 8
  • Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
  • Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
  • S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
  • RStudio username and password
  • Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
  • Amazon EC2 Systems Manager agent, which makes it easy to manage and patch

All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.

Log in to RStudio

The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.

  1. In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
  2. Log in to RStudio with the and password you provided during setup.

Install R packages

Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.

#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}  
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 42 minutes 
##     H2O cluster version:        3.10.4.6 
##     H2O cluster version age:    4 months and 4 days !!! 
##     H2O cluster name:           H2O_started_from_R_rstudio_hjx881 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.30 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
  install_aws_sdk()
}

load_sdk()
## NULL

Connect to Athena

Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                                   s3_staging_dir=Sys.getenv("ATHENABUCKET"),
                                   aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")

Verify the connection. The results returned depend on your specific Athena setup.

con
## <JDBCConnection>
dbListTables(con)
##  [1] "gdelt"               "wikistats"           "elb_logs_raw_native"
##  [4] "twitter"             "twitter2"            "usermovieratings"   
##  [7] "eventcodes"          "events"              "billboard"          
## [10] "billboardtop10"      "elb_logs"            "gdelthist"          
## [13] "gdeltmaster"         "twitter"             "twitter3"

Create a dataset

For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.

Field Description
year Year that song was released
songtitle Title of the song
artistname Name of the song artist
songid Unique identifier for the song
artistid Unique identifier for the song artist
timesignature Variable estimating the time signature of the song
timesignature_confidence Confidence in the estimate for the timesignature
loudness Continuous variable indicating the average amplitude of the audio in decibels
tempo Variable indicating the estimated beats per minute of the song
tempo_confidence Confidence in the estimate for tempo
key Variable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidence Confidence in the estimate for key
energy Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch Continuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_min Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_max Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10 Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)

Create an Athena table based on the dataset

In the Athena console, select the default database, sampled, or create a new database.

Run the following create table statement.

create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;

Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.

dbGetQuery(con, "show create table sampledb.billboard")
##                                      createtab_stmt
## 1       CREATE EXTERNAL TABLE `sampledb.billboard`(
## 2                                       `year` int,
## 3                               `songtitle` string,
## 4                              `artistname` string,
## 5                                  `songid` string,
## 6                                `artistid` string,
## 7                              `timesignature` int,
## 8                `timesignature_confidence` double,
## 9                                `loudness` double,
## 10                                  `tempo` double,
## 11                       `tempo_confidence` double,
## 12                                       `key` int,
## 13                         `key_confidence` double,
## 14                                 `energy` double,
## 15                                  `pitch` double,
## 16                           `timbre_0_min` double,
## 17                           `timbre_0_max` double,
## 18                           `timbre_1_min` double,
## 19                           `timbre_1_max` double,
## 20                           `timbre_2_min` double,
## 21                           `timbre_2_max` double,
## 22                           `timbre_3_min` double,
## 23                           `timbre_3_max` double,
## 24                           `timbre_4_min` double,
## 25                           `timbre_4_max` double,
## 26                           `timbre_5_min` double,
## 27                           `timbre_5_max` double,
## 28                           `timbre_6_min` double,
## 29                           `timbre_6_max` double,
## 30                           `timbre_7_min` double,
## 31                           `timbre_7_max` double,
## 32                           `timbre_8_min` double,
## 33                           `timbre_8_max` double,
## 34                           `timbre_9_min` double,
## 35                           `timbre_9_max` double,
## 36                          `timbre_10_min` double,
## 37                          `timbre_10_max` double,
## 38                          `timbre_11_min` double,
## 39                          `timbre_11_max` double,
## 40                                     `top10` int)
## 41                             ROW FORMAT DELIMITED 
## 42                         FIELDS TERMINATED BY ',' 
## 43                            STORED AS INPUTFORMAT 
## 44       'org.apache.hadoop.mapred.TextInputFormat' 
## 45                                     OUTPUTFORMAT 
## 46  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
## 47                                        LOCATION
## 48    's3://aws-bigdata-blog/artifacts/predict-billboard/data'
## 49                                  TBLPROPERTIES (
## 50            'transient_lastDdlTime'='1505484133')

Run a sample query

Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.

dbGetQuery(con, " SELECT songtitle,artistname,top10   FROM sampledb.billboard WHERE lower(artistname) =     'janet jackson' AND top10 = 1")
##                       songtitle    artistname top10
## 1                       Runaway Janet Jackson     1
## 2               Because Of Love Janet Jackson     1
## 3                         Again Janet Jackson     1
## 4                            If Janet Jackson     1
## 5  Love Will Never Do (Without You) Janet Jackson 1
## 6                     Black Cat Janet Jackson     1
## 7               Come Back To Me Janet Jackson     1
## 8                       Alright Janet Jackson     1
## 9                      Escapade Janet Jackson     1
## 10                Rhythm Nation Janet Jackson     1

Determine how many songs in this dataset are specifically from the year 2010.

dbGetQuery(con, " SELECT count(*)   FROM sampledb.billboard WHERE year = 2010")
##   _col0
## 1   373

The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.

Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.

#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note:  Running the preceding query results in the following error: 
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature     FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
##    0    1    3    4    5    7 
##   10  143  503 6787  112   19
nrow(dftimesignature)
## [1] 7574

From the results, observe that 6787 songs have a timesignature of 4.

Next, determine the song with the highest tempo.

dbGetQuery(con, " SELECT songtitle,artistname,tempo   FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
##                   songtitle      artistname   tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307

Create the training dataset

Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first.  This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.

#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.

r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
##   year           songtitle artistname timesignature
## 1 2009 The Awkward Goodbye    Athlete             3
## 2 2009        Rubik's Cube    Athlete             3
##   timesignature_confidence loudness   tempo tempo_confidence
## 1                    0.732   -6.320  89.614   0.652
## 2                    0.906   -9.541 117.742   0.542
nrow(BillboardTrain)
## [1] 7201

Create the test dataset

BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
##   year              songtitle        artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember  11
## 2 2010        Sticks & Bricks A Day to Remember  10
##   key_confidence    energy pitch timbre_0_min
## 1          0.453 0.9666556 0.024        0.002
## 2          0.469 0.9847095 0.025        0.000
nrow(BillboardTest)
## [1] 373

Convert the training and test datasets into H2O dataframes

train.h2o <- as.h2o(BillboardTrain)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o <- as.h2o(BillboardTest)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Inspect the column names in your H2O dataframes.

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"

Create models

You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.

Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables:  “year”, “songtitle”, “artistname”, “songid”, or “artistid”.

y.dep <- 39
x.indep <- c(6:38)
x.indep
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38

Create Model 1: All numeric variables

Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.

modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 1, using H2O’s built-in performance function.

h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09924684
## RMSE:  0.3150347
## LogLoss:  0.3220267
## Mean Per-Class Error:  0.2380168
## AUC:  0.8431394
## Gini:  0.6862787
## R^2:  0.254663
## Null Deviance:  326.0801
## Residual Deviance:  240.2319
## AIC:  308.2319
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      255  59 0.187898  =59/314
## 1       17  42 0.288136   =17/59
## Totals 272 101 0.203753  =76/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.192772 0.525000 100
## 2                       max f2  0.124912 0.650510 155
## 3                 max f0point5  0.416258 0.612903  23
## 4                 max accuracy  0.416258 0.879357  23
## 5                max precision  0.813396 1.000000   0
## 6                   max recall  0.037579 1.000000 282
## 7              max specificity  0.813396 1.000000   0
## 8             max absolute_mcc  0.416258 0.455251  23
## 9   max min_per_class_accuracy  0.161402 0.738854 125
## 10 max mean_per_class_accuracy  0.124912 0.765006 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or ` 
h2o.auc(h2o.performance(modelh1,test.h2o)) 
## [1] 0.8431394

The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)

Next, inspect the coefficients of the variables in the dataset.

dfmodelh1 <- as.data.frame(h2o.varimp(modelh1))
dfmodelh1
##                       names coefficients sign
## 1              timbre_0_max  1.290938663  NEG
## 2                  loudness  1.262941934  POS
## 3                     pitch  0.616995941  NEG
## 4              timbre_1_min  0.422323735  POS
## 5              timbre_6_min  0.349016024  NEG
## 6                    energy  0.348092062  NEG
## 7             timbre_11_min  0.307331997  NEG
## 8              timbre_3_max  0.302225619  NEG
## 9             timbre_11_max  0.243632060  POS
## 10             timbre_4_min  0.224233951  POS
## 11             timbre_4_max  0.204134342  POS
## 12             timbre_5_min  0.199149324  NEG
## 13             timbre_0_min  0.195147119  POS
## 14 timesignature_confidence  0.179973904  POS
## 15         tempo_confidence  0.144242598  POS
## 16            timbre_10_max  0.137644568  POS
## 17             timbre_7_min  0.126995955  NEG
## 18            timbre_10_min  0.123851179  POS
## 19             timbre_7_max  0.100031481  NEG
## 20             timbre_2_min  0.096127636  NEG
## 21           key_confidence  0.083115820  POS
## 22             timbre_6_max  0.073712419  POS
## 23            timesignature  0.067241917  POS
## 24             timbre_8_min  0.061301881  POS
## 25             timbre_8_max  0.060041698  POS
## 26                      key  0.056158445  POS
## 27             timbre_3_min  0.050825116  POS
## 28             timbre_9_max  0.033733561  POS
## 29             timbre_2_max  0.030939072  POS
## 30             timbre_9_min  0.020708113  POS
## 31             timbre_1_max  0.014228818  NEG
## 32                    tempo  0.008199861  POS
## 33             timbre_5_max  0.004837870  POS
## 34                                    NA <NA>

Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.

You can make the following observations from the results:

  • The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
  • The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
  • The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.

These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.

cor(train.h2o$loudness,train.h2o$energy)
## [1] 0.7399067

This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.

You build two variations of the original model:

  • Model 2, in which you keep “energy” and omit “loudness”
  • Model 3, in which you keep “loudness” and omit “energy”

You compare these two models and choose the model with a better fit for this use case.

Create Model 2: Keep energy and omit loudness

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:7,9:38)
x.indep
##  [1]  6  7  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 2.

h2o.performance(model=modelh2,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09922606
## RMSE:  0.3150017
## LogLoss:  0.3228213
## Mean Per-Class Error:  0.2490554
## AUC:  0.8431933
## Gini:  0.6863867
## R^2:  0.2548191
## Null Deviance:  326.0801
## Residual Deviance:  240.8247
## AIC:  306.8247
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      280 34 0.108280  =34/314
## 1       23 36 0.389831   =23/59
## Totals 303 70 0.152815  =57/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.254391 0.558140  69
## 2                       max f2  0.113031 0.647208 157
## 3                 max f0point5  0.413999 0.596026  22
## 4                 max accuracy  0.446250 0.876676  18
## 5                max precision  0.811739 1.000000   0
## 6                   max recall  0.037682 1.000000 283
## 7              max specificity  0.811739 1.000000   0
## 8             max absolute_mcc  0.254391 0.469060  69
## 9   max min_per_class_accuracy  0.141051 0.716561 131
## 10 max mean_per_class_accuracy  0.113031 0.761821 157
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh2 <- as.data.frame(h2o.varimp(modelh2))
dfmodelh2
##                       names coefficients sign
## 1                     pitch  0.700331511  NEG
## 2              timbre_1_min  0.510270513  POS
## 3              timbre_0_max  0.402059546  NEG
## 4              timbre_6_min  0.333316236  NEG
## 5             timbre_11_min  0.331647383  NEG
## 6              timbre_3_max  0.252425901  NEG
## 7             timbre_11_max  0.227500308  POS
## 8              timbre_4_max  0.210663865  POS
## 9              timbre_0_min  0.208516163  POS
## 10             timbre_5_min  0.202748055  NEG
## 11             timbre_4_min  0.197246582  POS
## 12            timbre_10_max  0.172729619  POS
## 13         tempo_confidence  0.167523934  POS
## 14 timesignature_confidence  0.167398830  POS
## 15             timbre_7_min  0.142450727  NEG
## 16             timbre_8_max  0.093377516  POS
## 17            timbre_10_min  0.090333426  POS
## 18            timesignature  0.085851625  POS
## 19             timbre_7_max  0.083948442  NEG
## 20           key_confidence  0.079657073  POS
## 21             timbre_6_max  0.076426046  POS
## 22             timbre_2_min  0.071957831  NEG
## 23             timbre_9_max  0.071393189  POS
## 24             timbre_8_min  0.070225578  POS
## 25                      key  0.061394702  POS
## 26             timbre_3_min  0.048384697  POS
## 27             timbre_1_max  0.044721121  NEG
## 28                   energy  0.039698433  POS
## 29             timbre_5_max  0.039469064  POS
## 30             timbre_2_max  0.018461133  POS
## 31                    tempo  0.013279926  POS
## 32             timbre_9_min  0.005282143  NEG
## 33                                    NA <NA>

h2o.auc(h2o.performance(modelh2,test.h2o)) 
## [1] 0.8431933

You can make the following observations:

  • The AUC metric is 0.8431933.
  • Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
  • As H2O orders variables by significance, the variable energy is not significant in this model.

You can conclude that Model 2 is not ideal for this use , as energy is not significant.

CreateModel 3: Keep loudness but omit energy

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:12,14:38)
x.indep
##  [1]  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=================================================================| 100%
perfh3<-h2o.performance(model=modelh3,newdata=test.h2o)
perfh3
## H2OBinomialMetrics: glm
## 
## MSE:  0.0978859
## RMSE:  0.3128672
## LogLoss:  0.3178367
## Mean Per-Class Error:  0.264925
## AUC:  0.8492389
## Gini:  0.6984778
## R^2:  0.2648836
## Null Deviance:  326.0801
## Residual Deviance:  237.1062
## AIC:  303.1062
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      286 28 0.089172  =28/314
## 1       26 33 0.440678   =26/59
## Totals 312 61 0.144772  =54/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.273799 0.550000  60
## 2                       max f2  0.125503 0.663265 155
## 3                 max f0point5  0.435479 0.628931  24
## 4                 max accuracy  0.435479 0.882038  24
## 5                max precision  0.821606 1.000000   0
## 6                   max recall  0.038328 1.000000 280
## 7              max specificity  0.821606 1.000000   0
## 8             max absolute_mcc  0.435479 0.471426  24
## 9   max min_per_class_accuracy  0.173693 0.745763 120
## 10 max mean_per_class_accuracy  0.125503 0.775073 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh3 <- as.data.frame(h2o.varimp(modelh3))
dfmodelh3
##                       names coefficients sign
## 1              timbre_0_max 1.216621e+00  NEG
## 2                  loudness 9.780973e-01  POS
## 3                     pitch 7.249788e-01  NEG
## 4              timbre_1_min 3.891197e-01  POS
## 5              timbre_6_min 3.689193e-01  NEG
## 6             timbre_11_min 3.086673e-01  NEG
## 7              timbre_3_max 3.025593e-01  NEG
## 8             timbre_11_max 2.459081e-01  POS
## 9              timbre_4_min 2.379749e-01  POS
## 10             timbre_4_max 2.157627e-01  POS
## 11             timbre_0_min 1.859531e-01  POS
## 12             timbre_5_min 1.846128e-01  NEG
## 13 timesignature_confidence 1.729658e-01  POS
## 14             timbre_7_min 1.431871e-01  NEG
## 15            timbre_10_max 1.366703e-01  POS
## 16            timbre_10_min 1.215954e-01  POS
## 17         tempo_confidence 1.183698e-01  POS
## 18             timbre_2_min 1.019149e-01  NEG
## 19           key_confidence 9.109701e-02  POS
## 20             timbre_7_max 8.987908e-02  NEG
## 21             timbre_6_max 6.935132e-02  POS
## 22             timbre_8_max 6.878241e-02  POS
## 23            timesignature 6.120105e-02  POS
## 24                      key 5.814805e-02  POS
## 25             timbre_8_min 5.759228e-02  POS
## 26             timbre_1_max 2.930285e-02  NEG
## 27             timbre_9_max 2.843755e-02  POS
## 28             timbre_3_min 2.380245e-02  POS
## 29             timbre_2_max 1.917035e-02  POS
## 30             timbre_5_max 1.715813e-02  POS
## 31                    tempo 1.364418e-02  NEG
## 32             timbre_9_min 8.463143e-05  NEG
## 33                                    NA <NA>
h2o.sensitivity(perfh3,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501855569251422. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.2033898
h2o.auc(perfh3)
## [1] 0.8492389

You can make the following observations:

  • The AUC metric is 0.8492389.
  • From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
  • Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
  • Loudness is significant in this model.

Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:

  • Desired model accuracy at a given threshold
  • Number of correct predictions for top10 hits
  • Tolerable number of false positives or false negatives

Next, make predictions using Model 3 on the test dataset.

predict.regh <- h2o.predict(modelh3, test.h2o)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
print(predict.regh)
##   predict        p0          p1
## 1       0 0.9654739 0.034526052
## 2       0 0.9654748 0.034525236
## 3       0 0.9635547 0.036445318
## 4       0 0.9343579 0.065642149
## 5       0 0.9978334 0.002166601
## 6       0 0.9779949 0.022005078
## 
## [373 rows x 3 columns]
predict.regh$predict
##   predict
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## 
## [373 rows x 1 column]
dpr<-as.data.frame(predict.regh)
#Rename the predicted column 
colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10'
table(dpr$predict_top10)
## 
##   0   1 
## 312  61

The first set of output results specifies the probabilities associated with each predicted observation.  For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit).  The second set of results list the actual predictions made.  From the third set of results, this model predicts that 61 songs will be top 10 hits.

Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.

table(BillboardTest$top10)
## 
##   0   1 
## 314  59

Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.

It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?

View the two models from an investment perspective:

  • A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
  • How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
  • It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
  • The baseline model is not useful, as it simply does not label any song as a hit.

Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.

GBM model

H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.

Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.

train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |====================================================             |  79%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
## 
## MSE:  0.09860778
## RMSE:  0.3140188
## LogLoss:  0.3206876
## Mean Per-Class Error:  0.2120263
## AUC:  0.8630573
## Gini:  0.7261146
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      266 48 0.152866  =48/314
## 1       16 43 0.271186   =16/59
## Totals 282 91 0.171582  =64/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.189757 0.573333  90
## 2                     max f2  0.130895 0.693717 145
## 3               max f0point5  0.327346 0.598802  26
## 4               max accuracy  0.442757 0.876676  14
## 5              max precision  0.802184 1.000000   0
## 6                 max recall  0.049990 1.000000 284
## 7            max specificity  0.802184 1.000000   0
## 8           max absolute_mcc  0.169135 0.496486 104
## 9 max min_per_class_accuracy  0.169135 0.796610 104
## 10 max mean_per_class_accuracy  0.169135 0.805948 104
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573

This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.

As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.

H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.

Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose.  For models that require more computation, consider using accelerated computing instances such as the P2 instance type.

system.time(
  dlearning.modelh <- h2o.deeplearning(y = y.dep,
                                      x = x.indep,
                                      training_frame = train.h2o,
                                      epoch = 250,
                                      hidden = c(250,250),
                                      activation = "Rectifier",
                                      seed = 1122,
                                      distribution="multinomial"
  )
)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##   1.216   0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
## 
## MSE:  0.1678359
## RMSE:  0.4096778
## LogLoss:  1.86509
## Mean Per-Class Error:  0.3433013
## AUC:  0.7568822
## Gini:  0.5137644
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      290 24 0.076433  =24/314
## 1       36 23 0.610169   =36/59
## Totals 326 47 0.160858  =60/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.826267 0.433962  46
## 2                     max f2  0.000000 0.588235 239
## 3               max f0point5  0.999929 0.511811  16
## 4               max accuracy  0.999999 0.865952  10
## 5              max precision  1.000000 1.000000   0
## 6                 max recall  0.000000 1.000000 326
## 7            max specificity  1.000000 1.000000   0
## 8           max absolute_mcc  0.999929 0.363219  16
## 9 max min_per_class_accuracy  0.000004 0.662420 145
## 10 max mean_per_class_accuracy  0.000000 0.685334 224
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822

The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.

H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.

Metric Description
Sensitivity Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
Specificity Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
Threshold Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
Precision The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant
AUC

Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.

0.90 – 1 = excellent (A)

0.8 – 0.9 = good (B)

0.7 – 0.8 = fair (C)

.6 – 0.7 = poor (D)

0.5 – 0.5 = fail (F)

Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.

Metric Model 3 GBM Model Deep Learning Model

Accuracy

(max)

0.882038

(t=0.435479)

0.876676

(t=0.442757)

0.865952

(t=0.999999)

Precision

(max)

1.0

(t=0.821606)

1.0

(t=0802184)

1.0

(t=1.0)

Recall

(max)

1.0 1.0

1.0

(t=0)

Specificity

(max)

1.0 1.0

1.0

(t=1)

Sensitivity

 

0.2033898 0.1355932

0.3898305

(t=0.5)

AUC 0.8492389 0.8630573 0.756882

Note: ‘t’ denotes threshold.

Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier.  If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits.  The AUC metric for the GBM model is also higher than that of Model 3.

Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.

Conclusion

In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.

If you have questions or suggestions, please comment below.


Additional Reading

Learn how to build and explore a simple geospita simple GEOINT application using SparkR.


About the Authors

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.

 

 

Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.

 

 

Introducing Email Templates and Bulk Sending

Post Syndicated from Brent Meyer original https://aws.amazon.com/blogs/ses/introducing-email-templates-and-bulk-sending/

The Amazon SES team is excited to announce our latest update, which includes two related features that help you send personalized emails to large groups of customers. This post discusses these features, and provides examples that you can follow to start using these features right away.

Email templates

You can use email templates to create the structure of an email that you plan to send to multiple recipients, or that you will use again in the future. Each template contains a subject line, a text part, and an HTML part. Both the subject and the email body can contain variables that are automatically replaced with values specific to each recipient. For example, you can include a {{name}} variable in the body of your email. When you send the email, you specify the value of {{name}} for each recipient. Amazon SES then automatically replaces the {{name}} variable with the recipient’s first name.

Creating a template

To create a template, you use the CreateTemplate API operation. To use this operation, pass a JSON object with four properties: a template name (TemplateName), a subject line (SubjectPart), a plain text version of the email body (TextPart), and an HTML version of the email body (HtmlPart). You can include variables in the subject line or message body by enclosing the variable names in two sets of curly braces. The following example shows the structure of this JSON object.

{
  "TemplateName": "MyTemplate",
  "SubjectPart": "Greetings, {{name}}!",
  "TextPart": "Dear {{name}},\r\nYour favorite animal is {{favoriteanimal}}.",
  "HtmlPart": "<h1>Hello {{name}}</h1><p>Your favorite animal is {{favoriteanimal}}.</p>"
}

Use this example to create your own template, and save the resulting file as mytemplate.json. You can then use the AWS Command Line Interface (AWS CLI) to create your template by running the following command: aws ses create-template --cli-input-json mytemplate.json

Sending an email created with a template

Now that you have created a template, you’re ready to send email that uses the template. You can use the SendTemplatedEmail API operation to send email to a single destination using a template. Like the CreateTemplate operation, this operation accepts a JSON object with four properties. For this operation, the properties are the sender’s email address (Source), the name of an existing template (Template), an object called Destination that contains the recipient addresses (and, optionally, any CC or BCC addresses) that will receive the email, and a property that refers to the values that will be replaced in the email (TemplateData). The following example shows the structure of the JSON object used by the SendTemplatedEmail operation.

{
  "Source": "[email protected]",
  "Template": "MyTemplate",
  "Destination": {
    "ToAddresses": [ "[email protected]" ]
  },
  "TemplateData": "{ \"name\":\"Alejandro\", \"favoriteanimal\": \"zebra\" }"
}

Customize this example to fit your needs, and then save the resulting file as myemail.json. One important note: in the TemplateData property, you must use a blackslash (\) character to escape the quotes within this object, as shown in the preceding example.

When you’re ready to send the email, run the following command: aws ses send-templated-email --cli-input-json myemail.json

Bulk email sending

In most cases, you should use email templates to send personalized emails to several customers at the same time. The SendBulkTemplatedEmail API operation helps you do that. This operation also accepts a JSON object. At a minimum, you must supply a sender email address (Source), a reference to an existing template (Template), a list of recipients in an array called Destinations (within which you specify the recipient’s email address, and the variable values for that recipient), and a list of fallback values for the variables in the template (DefaultTemplateData). The following example shows the structure of this JSON object.

{
  "Source":"[email protected]",
  "ConfigurationSetName":"ConfigSet",
  "Template":"MyTemplate",
  "Destinations":[
    {
      "Destination":{
        "ToAddresses":[
          "[email protected]"
        ]
      },
      "ReplacementTemplateData":"{ \"name\":\"Anaya\", \"favoriteanimal\":\"yak\" }"
    },
    {
      "Destination":{ 
        "ToAddresses":[
          "[email protected]"
        ]
      },
      "ReplacementTemplateData":"{ \"name\":\"Liu\", \"favoriteanimal\":\"water buffalo\" }"
    },
    {
      "Destination":{
        "ToAddresses":[
          "[email protected]"
        ]
      },
      "ReplacementTemplateData":"{ \"name\":\"Shirley\", \"favoriteanimal\":\"vulture\" }"
    },
    {
      "Destination":{
        "ToAddresses":[
          "[email protected]"
        ]
      },
      "ReplacementTemplateData":"{}"
    }
  ],
  "DefaultTemplateData":"{ \"name\":\"friend\", \"favoriteanimal\":\"unknown\" }"
}

This example sends unique emails to Anaya ([email protected]), Liu ([email protected]), Shirley ([email protected]), and a fourth recipient ([email protected]), whose name and favorite animal we didn’t specify. Anaya, Liu, and Shirley will see their names in place of the {{name}} tag in the template (which, in this example, is present in both the subject line and message body), as well as their favorite animals in place of the {{favoriteanimal}} tag in the message body. The DefaultTemplateData property determines what happens if you do not specify the ReplacementTemplateData property for a recipient. In this case, the fourth recipient will see the word “friend” in place of the {{name}} tag, and “unknown” in place of the {{favoriteanimal}} tag.

Use the example to create your own list of recipients, and save the resulting file as mybulkemail.json. When you’re ready to send the email, run the following command: aws ses send-bulk-templated-email --cli-input-json mybulkemail.json

Other considerations

There are a few limits and other considerations when using these features:

  • You can create up to 10,000 email templates per Amazon SES account.
  • Each template can be up to 10 MB in size.
  • You can include an unlimited number of replacement variables in each template.
  • You can send email to up to 50 destinations in each call to the SendBulkTemplatedEmail operation. A destination includes a list of recipients, as well as CC and BCC recipients. Note that the number of destinations you can contact in a single call to the API may be limited by your account’s maximum sending rate. For more information, see Managing Your Amazon SES Sending Limits in the Amazon SES Developer Guide.

We look forward to seeing the amazing things you create with these new features. If you have any questions, please leave a comment on this post, or let us know in the Amazon SES forum.

AWS Developer Tools Expands Integration to Include GitHub

Post Syndicated from Balaji Iyer original https://aws.amazon.com/blogs/devops/aws-developer-tools-expands-integration-to-include-github/

AWS Developer Tools is a set of services that include AWS CodeCommit, AWS CodePipeline, AWS CodeBuild, and AWS CodeDeploy. Together, these services help you securely store and maintain version control of your application’s source code and automatically build, test, and deploy your application to AWS or your on-premises environment. These services are designed to enable developers and IT professionals to rapidly and safely deliver software.

As part of our continued commitment to extend the AWS Developer Tools ecosystem to third-party tools and services, we’re pleased to announce AWS CodeStar and AWS CodeBuild now integrate with GitHub. This will make it easier for GitHub users to set up a continuous integration and continuous delivery toolchain as part of their release process using AWS Developer Tools.

In this post, I will walk through the following:

Prerequisites:

You’ll need an AWS account, a GitHub account, an Amazon EC2 key pair, and administrator-level permissions for AWS Identity and Access Management (IAM), AWS CodeStar, AWS CodeBuild, AWS CodePipeline, Amazon EC2, Amazon S3.

 

Integrating GitHub with AWS CodeStar

AWS CodeStar enables you to quickly develop, build, and deploy applications on AWS. Its unified user interface helps you easily manage your software development activities in one place. With AWS CodeStar, you can set up your entire continuous delivery toolchain in minutes, so you can start releasing code faster.

When AWS CodeStar launched in April of this year, it used AWS CodeCommit as the hosted source repository. You can now choose between AWS CodeCommit or GitHub as the source control service for your CodeStar projects. In addition, your CodeStar project dashboard lets you centrally track GitHub activities, including commits, issues, and pull requests. This makes it easy to manage project activity across the components of your CI/CD toolchain. Adding the GitHub dashboard view will simplify development of your AWS applications.

In this section, I will show you how to use GitHub as the source provider for your CodeStar projects. I’ll also show you how to work with recent commits, issues, and pull requests in the CodeStar dashboard.

Sign in to the AWS Management Console and from the Services menu, choose CodeStar. In the CodeStar console, choose Create a new project. You should see the Choose a project template page.

CodeStar Project

Choose an option by programming language, application category, or AWS service. I am going to choose the Ruby on Rails web application that will be running on Amazon EC2.

On the Project details page, you’ll now see the GitHub option. Type a name for your project, and then choose Connect to GitHub.

Project details

You’ll see a message requesting authorization to connect to your GitHub repository. When prompted, choose Authorize, and then type your GitHub account password.

Authorize

This connects your GitHub identity to AWS CodeStar through OAuth. You can always review your settings by navigating to your GitHub application settings.

Installed GitHub Apps

You’ll see AWS CodeStar is now connected to GitHub:

Create project

You can choose a public or private repository. GitHub offers free accounts for users and organizations working on public and open source projects and paid accounts that offer unlimited private repositories and optional user management and security features.

In this example, I am going to choose the public repository option. Edit the repository description, if you like, and then choose Next.

Review your CodeStar project details, and then choose Create Project. On Choose an Amazon EC2 Key Pair, choose Create Project.

Key Pair

On the Review project details page, you’ll see Edit Amazon EC2 configuration. Choose this link to configure instance type, VPC, and subnet options. AWS CodeStar requires a service role to create and manage AWS resources and IAM permissions. This role will be created for you when you select the AWS CodeStar would like permission to administer AWS resources on your behalf check box.

Choose Create Project. It might take a few minutes to create your project and resources.

Review project details

When you create a CodeStar project, you’re added to the project team as an owner. If this is the first time you’ve used AWS CodeStar, you’ll be asked to provide the following information, which will be shown to others:

  • Your display name.
  • Your email address.

This information is used in your AWS CodeStar user profile. User profiles are not project-specific, but they are limited to a single AWS region. If you are a team member in projects in more than one region, you’ll have to create a user profile in each region.

User settings

User settings

Choose Next. AWS CodeStar will create a GitHub repository with your configuration settings (for example, https://github.com/biyer/ruby-on-rails-service).

When you integrate your integrated development environment (IDE) with AWS CodeStar, you can continue to write and develop code in your preferred environment. The changes you make will be included in the AWS CodeStar project each time you commit and push your code.

IDE

After setting up your IDE, choose Next to go to the CodeStar dashboard. Take a few minutes to familiarize yourself with the dashboard. You can easily track progress across your entire software development process, from your backlog of work items to recent code deployments.

Dashboard

After the application deployment is complete, choose the endpoint that will display the application.

Pipeline

This is what you’ll see when you open the application endpoint:

The Commit history section of the dashboard lists the commits made to the Git repository. If you choose the commit ID or the Open in GitHub option, you can use a hotlink to your GitHub repository.

Commit history

Your AWS CodeStar project dashboard is where you and your team view the status of your project resources, including the latest commits to your project, the state of your continuous delivery pipeline, and the performance of your instances. This information is displayed on tiles that are dedicated to a particular resource. To see more information about any of these resources, choose the details link on the tile. The console for that AWS service will open on the details page for that resource.

Issues

You can also filter issues based on their status and the assigned user.

Filter

AWS CodeBuild Now Supports Building GitHub Pull Requests

CodeBuild is a fully managed build service that compiles source code, runs tests, and produces software packages that are ready to deploy. With CodeBuild, you don’t need to provision, manage, and scale your own build servers. CodeBuild scales continuously and processes multiple builds concurrently, so your builds are not left waiting in a queue. You can use prepackaged build environments to get started quickly or you can create custom build environments that use your own build tools.

We recently announced support for GitHub pull requests in AWS CodeBuild. This functionality makes it easier to collaborate across your team while editing and building your application code with CodeBuild. You can use the AWS CodeBuild or AWS CodePipeline consoles to run AWS CodeBuild. You can also automate the running of AWS CodeBuild by using the AWS Command Line Interface (AWS CLI), the AWS SDKs, or the AWS CodeBuild Plugin for Jenkins.

AWS CodeBuild

In this section, I will show you how to trigger a build in AWS CodeBuild with a pull request from GitHub through webhooks.

Open the AWS CodeBuild console at https://console.aws.amazon.com/codebuild/. Choose Create project. If you already have a CodeBuild project, you can choose Edit project, and then follow along. CodeBuild can connect to AWS CodeCommit, S3, BitBucket, and GitHub to pull source code for builds. For Source provider, choose GitHub, and then choose Connect to GitHub.

Configure

After you’ve successfully linked GitHub and your CodeBuild project, you can choose a repository in your GitHub account. CodeBuild also supports connections to any public repository. You can review your settings by navigating to your GitHub application settings.

GitHub Apps

On Source: What to Build, for Webhook, select the Rebuild every time a code change is pushed to this repository check box.

Note: You can select this option only if, under Repository, you chose Use a repository in my account.

Source

In Environment: How to build, for Environment image, select Use an image managed by AWS CodeBuild. For Operating system, choose Ubuntu. For Runtime, choose Base. For Version, choose the latest available version. For Build specification, you can provide a collection of build commands and related settings, in YAML format (buildspec.yml) or you can override the build spec by inserting build commands directly in the console. AWS CodeBuild uses these commands to run a build. In this example, the output is the string “hello.”

Environment

On Artifacts: Where to put the artifacts from this build project, for Type, choose No artifacts. (This is also the type to choose if you are just running tests or pushing a Docker image to Amazon ECR.) You also need an AWS CodeBuild service role so that AWS CodeBuild can interact with dependent AWS services on your behalf. Unless you already have a role, choose Create a role, and for Role name, type a name for your role.

Artifacts

In this example, leave the advanced settings at their defaults.

If you expand Show advanced settings, you’ll see options for customizing your build, including:

  • A build timeout.
  • A KMS key to encrypt all the artifacts that the builds for this project will use.
  • Options for building a Docker image.
  • Elevated permissions during your build action (for example, accessing Docker inside your build container to build a Dockerfile).
  • Resource options for the build compute type.
  • Environment variables (built-in or custom). For more information, see Create a Build Project in the AWS CodeBuild User Guide.

Advanced settings

You can use the AWS CodeBuild console to create a parameter in Amazon EC2 Systems Manager. Choose Create a parameter, and then follow the instructions in the dialog box. (In that dialog box, for KMS key, you can optionally specify the ARN of an AWS KMS key in your account. Amazon EC2 Systems Manager uses this key to encrypt the parameter’s value during storage and decrypt during retrieval.)

Create parameter

Choose Continue. On the Review page, either choose Save and build or choose Save to run the build later.

Choose Start build. When the build is complete, the Build logs section should display detailed information about the build.

Logs

To demonstrate a pull request, I will fork the repository as a different GitHub user, make commits to the forked repo, check in the changes to a newly created branch, and then open a pull request.

Pull request

As soon as the pull request is submitted, you’ll see CodeBuild start executing the build.

Build

GitHub sends an HTTP POST payload to the webhook’s configured URL (highlighted here), which CodeBuild uses to download the latest source code and execute the build phases.

Build project

If you expand the Show all checks option for the GitHub pull request, you’ll see that CodeBuild has completed the build, all checks have passed, and a deep link is provided in Details, which opens the build history in the CodeBuild console.

Pull request

Summary:

In this post, I showed you how to use GitHub as the source provider for your CodeStar projects and how to work with recent commits, issues, and pull requests in the CodeStar dashboard. I also showed you how you can use GitHub pull requests to automatically trigger a build in AWS CodeBuild — specifically, how this functionality makes it easier to collaborate across your team while editing and building your application code with CodeBuild.


About the author:

Balaji Iyer is an Enterprise Consultant for the Professional Services Team at Amazon Web Services. In this role, he has helped several customers successfully navigate their journey to AWS. His specialties include architecting and implementing highly scalable distributed systems, serverless architectures, large scale migrations, operational security, and leading strategic AWS initiatives. Before he joined Amazon, Balaji spent more than a decade building operating systems, big data analytics solutions, mobile services, and web applications. In his spare time, he enjoys experiencing the great outdoors and spending time with his family.

 

JavaScript got better while I wasn’t looking

Post Syndicated from Eevee original https://eev.ee/blog/2017/10/07/javascript-got-better-while-i-wasnt-looking/

IndustrialRobot has generously donated in order to inquire:

In the last few years there seems to have been a lot of activity with adding emojis to Unicode. Has there been an equal effort to add ‘real’ languages/glyph systems/etc?

And as always, if you don’t have anything to say on that topic, feel free to choose your own. :p

Yes.

I mean, each release of Unicode lists major new additions right at the top — Unicode 10, Unicode 9, Unicode 8, etc. They also keep fastidious notes, so you can also dig into how and why these new scripts came from, by reading e.g. the proposal for the addition of Zanabazar Square. I don’t think I have much to add here; I’m not a real linguist, I only play one on TV.

So with that out of the way, here’s something completely different!

A brief history of JavaScript

JavaScript was created in seven days, about eight thousand years ago. It was pretty rough, and it stayed rough for most of its life. But that was fine, because no one used it for anything besides having a trail of sparkles follow your mouse on their Xanga profile.

Then people discovered you could actually do a handful of useful things with JavaScript, and it saw a sharp uptick in usage. Alas, it stayed pretty rough. So we came up with polyfills and jQuerys and all kinds of miscellaneous things that tried to smooth over the rough parts, to varying degrees of success.

And… that’s it. That’s pretty much how things stayed for a while.


I have complicated feelings about JavaScript. I don’t hate it… but I certainly don’t enjoy it, either. It has some pretty neat ideas, like prototypical inheritance and “everything is a value”, but it buries them under a pile of annoying quirks and a woefully inadequate standard library. The DOM APIs don’t make things much better — they seem to be designed as though the target language were Java, rarely taking advantage of any interesting JavaScript features. And the places where the APIs overlap with the language are a hilarious mess: I have to check documentation every single time I use any API that returns a set of things, because there are at least three totally different conventions for handling that and I can’t keep them straight.

The funny thing is that I’ve been fairly happy to work with Lua, even though it shares most of the same obvious quirks as JavaScript. Both languages are weakly typed; both treat nonexistent variables and keys as simply false values, rather than errors; both have a single data structure that doubles as both a list and a map; both use 64-bit floating-point as their only numeric type (though Lua added integers very recently); both lack a standard object model; both have very tiny standard libraries. Hell, Lua doesn’t even have exceptions, not really — you have to fake them in much the same style as Perl.

And yet none of this bothers me nearly as much in Lua. The differences between the languages are very subtle, but combined they make a huge impact.

  • Lua has separate operators for addition and concatenation, so + is never ambiguous. It also has printf-style string formatting in the standard library.

  • Lua’s method calls are syntactic sugar: foo:bar() just means foo.bar(foo). Lua doesn’t even have a special this or self value; the invocant just becomes the first argument. In contrast, JavaScript invokes some hand-waved magic to set its contextual this variable, which has led to no end of confusion.

  • Lua has an iteration protocol, as well as built-in iterators for dealing with list-style or map-style data. JavaScript has a special dedicated Array type and clumsy built-in iteration syntax.

  • Lua has operator overloading and (surprisingly flexible) module importing.

  • Lua allows the keys of a map to be any value (though non-scalars are always compared by identity). JavaScript implicitly converts keys to strings — and since there’s no operator overloading, there’s no way to natively fix this.

These are fairly minor differences, in the grand scheme of language design. And almost every feature in Lua is implemented in a ridiculously simple way; in fact the entire language is described in complete detail in a single web page. So writing JavaScript is always frustrating for me: the language is so close to being much more ergonomic, and yet, it isn’t.

Or, so I thought. As it turns out, while I’ve been off doing other stuff for a few years, browser vendors have been implementing all this pie-in-the-sky stuff from “ES5” and “ES6”, whatever those are. People even upgrade their browsers now. Lo and behold, the last time I went to write JavaScript, I found out that a number of papercuts had actually been solved, and the solutions were sufficiently widely available that I could actually use them in web code.

The weird thing is that I do hear a lot about JavaScript, but the feature I’ve seen raved the most about by far is probably… built-in types for working with arrays of bytes? That’s cool and all, but not exactly the most pressing concern for me.

Anyway, if you also haven’t been keeping tabs on the world of JavaScript, here are some things we missed.

let

MDN docs — supported in Firefox 44, Chrome 41, IE 11, Safari 10

I’m pretty sure I first saw let over a decade ago. Firefox has supported it for ages, but you actually had to opt in by specifying JavaScript version 1.7. Remember JavaScript versions? You know, from back in the days when people actually suggested you write stuff like this:

1
<SCRIPT LANGUAGE="JavaScript1.2" TYPE="text/javascript">

Yikes.

Anyway, so, let declares a variable — but scoped to the immediately containing block, unlike var, which scopes to the innermost function. The trouble with var was that it was very easy to make misleading:

1
2
3
4
5
6
// foo exists here
while (true) {
    var foo = ...;
    ...
}
// foo exists here too

If you reused the same temporary variable name in a different block, or if you expected to be shadowing an outer foo, or if you were trying to do something with creating closures in a loop, this would cause you some trouble.

But no more, because let actually scopes the way it looks like it should, the way variable declarations do in C and friends. As an added bonus, if you refer to a variable declared with let outside of where it’s valid, you’ll get a ReferenceError instead of a silent undefined value. Hooray!

There’s one other interesting quirk to let that I can’t find explicitly documented. Consider:

1
2
3
4
5
6
7
let closures = [];
for (let i = 0; i < 4; i++) {
    closures.push(function() { console.log(i); });
}
for (let j = 0; j < closures.length; j++) {
    closures[j]();
}

If this code had used var i, then it would print 4 four times, because the function-scoped var i means each closure is sharing the same i, whose final value is 4. With let, the output is 0 1 2 3, as you might expect, because each run through the loop gets its own i.

But wait, hang on.

The semantics of a C-style for are that the first expression is only evaluated once, at the very beginning. So there’s only one let i. In fact, it makes no sense for each run through the loop to have a distinct i, because the whole idea of the loop is to modify i each time with i++.

I assume this is simply a special case, since it’s what everyone expects. We expect it so much that I can’t find anyone pointing out that the usual explanation for why it works makes no sense. It has the interesting side effect that for no longer de-sugars perfectly to a while, since this will print all 4s:

1
2
3
4
5
6
7
8
9
closures = [];
let i = 0;
while (i < 4) {
    closures.push(function() { console.log(i); });
    i++;
}
for (let j = 0; j < closures.length; j++) {
    closures[j]();
}

This isn’t a problem — I’m glad let works this way! — it just stands out to me as interesting. Lua doesn’t need a special case here, since it uses an iterator protocol that produces values rather than mutating a visible state variable, so there’s no problem with having the loop variable be truly distinct on each run through the loop.

Classes

MDN docs — supported in Firefox 45, Chrome 42, Safari 9, Edge 13

Prototypical inheritance is pretty cool. The way JavaScript presents it is a little bit opaque, unfortunately, which seems to confuse a lot of people. JavaScript gives you enough functionality to make it work, and even makes it sound like a first-class feature with a property outright called prototype… but to actually use it, you have to do a bunch of weird stuff that doesn’t much look like constructing an object or type.

The funny thing is, people with almost any background get along with Python just fine, and Python uses prototypical inheritance! Nobody ever seems to notice this, because Python tucks it neatly behind a class block that works enough like a Java-style class. (Python also handles inheritance without using the prototype, so it’s a little different… but I digress. Maybe in another post.)

The point is, there’s nothing fundamentally wrong with how JavaScript handles objects; the ergonomics are just terrible.

Lo! They finally added a class keyword. Or, rather, they finally made the class keyword do something; it’s been reserved this entire time.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
class Vector {
    constructor(x, y) {
        this.x = x;
        this.y = y;
    }

    get magnitude() {
        return Math.sqrt(this.x * this.x + this.y * this.y);
    }

    dot(other) {
        return this.x * other.x + this.y * other.y;
    }
}

This is all just sugar for existing features: creating a Vector function to act as the constructor, assigning a function to Vector.prototype.dot, and whatever it is you do to make a property. (Oh, there are properties. I’ll get to that in a bit.)

The class block can be used as an expression, with or without a name. It also supports prototypical inheritance with an extends clause and has a super pseudo-value for superclass calls.

It’s a little weird that the inside of the class block has its own special syntax, with function omitted and whatnot, but honestly you’d have a hard time making a class block without special syntax.

One severe omission here is that you can’t declare values inside the block, i.e. you can’t just drop a bar = 3; in there if you want all your objects to share a default attribute. The workaround is to just do this.bar = 3; inside the constructor, but I find that unsatisfying, since it defeats half the point of using prototypes.

Properties

MDN docs — supported in Firefox 4, Chrome 5, IE 9, Safari 5.1

JavaScript historically didn’t have a way to intercept attribute access, which is a travesty. And by “intercept attribute access”, I mean that you couldn’t design a value foo such that evaluating foo.bar runs some code you wrote.

Exciting news: now it does. Or, rather, you can intercept specific attributes, like in the class example above. The above magnitude definition is equivalent to:

1
2
3
4
5
6
7
Object.defineProperty(Vector.prototype, 'magnitude', {
    configurable: true,
    enumerable: true,
    get: function() {
        return Math.sqrt(this.x * this.x + this.y * this.y);
    },
});

Beautiful.

And what even are these configurable and enumerable things? It seems that every single key on every single object now has its own set of three Boolean twiddles:

  • configurable means the property itself can be reconfigured with another call to Object.defineProperty.
  • enumerable means the property appears in for..in or Object.keys().
  • writable means the property value can be changed, which only applies to properties with real values rather than accessor functions.

The incredibly wild thing is that for properties defined by Object.defineProperty, configurable and enumerable default to false, meaning that by default accessor properties are immutable and invisible. Super weird.

Nice to have, though. And luckily, it turns out the same syntax as in class also works in object literals.

1
2
3
4
5
6
Vector.prototype = {
    get magnitude() {
        return Math.sqrt(this.x * this.x + this.y * this.y);
    },
    ...
};

Alas, I’m not aware of a way to intercept arbitrary attribute access.

Another feature along the same lines is Object.seal(), which marks all of an object’s properties as non-configurable and prevents any new properties from being added to the object. The object is still mutable, but its “shape” can’t be changed. And of course you can just make the object completely immutable if you want, via setting all its properties non-writable, or just using Object.freeze().

I have mixed feelings about the ability to irrevocably change something about a dynamic runtime. It would certainly solve some gripes of former Haskell-minded colleagues, and I don’t have any compelling argument against it, but it feels like it violates some unwritten contract about dynamic languages — surely any structural change made by user code should also be able to be undone by user code?

Slurpy arguments

MDN docs — supported in Firefox 15, Chrome 47, Edge 12, Safari 10

Officially this feature is called “rest parameters”, but that’s a terrible name, no one cares about “arguments” vs “parameters”, and “slurpy” is a good word. Bless you, Perl.

1
2
3
function foo(a, b, ...args) {
    // ...
}

Now you can call foo with as many arguments as you want, and every argument after the second will be collected in args as a regular array.

You can also do the reverse with the spread operator:

1
2
3
4
5
let args = [];
args.push(1);
args.push(2);
args.push(3);
foo(...args);

It even works in array literals, even multiple times:

1
2
let args2 = [...args, ...args];
console.log(args2);  // [1, 2, 3, 1, 2, 3]

Apparently there’s also a proposal for allowing the same thing with objects inside object literals.

Default arguments

MDN docs — supported in Firefox 15, Chrome 49, Edge 14, Safari 10

Yes, arguments can have defaults now. It’s more like Sass than Python — default expressions are evaluated once per call, and later default expressions can refer to earlier arguments. I don’t know how I feel about that but whatever.

1
2
3
function foo(n = 1, m = n + 1, list = []) {
    ...
}

Also, unlike Python, you can have an argument with a default and follow it with an argument without a default, since the default default (!) is and always has been defined as undefined. Er, let me just write it out.

1
2
3
function bar(a = 5, b) {
    ...
}

Arrow functions

MDN docs — supported in Firefox 22, Chrome 45, Edge 12, Safari 10

Perhaps the most humble improvement is the arrow function. It’s a slightly shorter way to write an anonymous function.

1
2
3
(a, b, c) => { ... }
a => { ... }
() => { ... }

An arrow function does not set this or some other magical values, so you can safely use an arrow function as a quick closure inside a method without having to rebind this. Hooray!

Otherwise, arrow functions act pretty much like regular functions; you can even use all the features of regular function signatures.

Arrow functions are particularly nice in combination with all the combinator-style array functions that were added a while ago, like Array.forEach.

1
2
3
[7, 8, 9].forEach(value => {
    console.log(value);
});

Symbol

MDN docs — supported in Firefox 36, Chrome 38, Edge 12, Safari 9

This isn’t quite what I’d call an exciting feature, but it’s necessary for explaining the next one. It’s actually… extremely weird.

symbol is a new kind of primitive (like number and string), not an object (like, er, Number and String). A symbol is created with Symbol('foo'). No, not new Symbol('foo'); that throws a TypeError, for, uh, some reason.

The only point of a symbol is as a unique key. You see, symbols have one very special property: they can be used as object keys, and will not be stringified. Remember, only strings can be keys in JavaScript — even the indices of an array are, semantically speaking, still strings. Symbols are a new exception to this rule.

Also, like other objects, two symbols don’t compare equal to each other: Symbol('foo') != Symbol('foo').

The result is that symbols solve one of the problems that plauges most object systems, something I’ve talked about before: interfaces. Since an interface might be implemented by any arbitrary type, and any arbitrary type might want to implement any number of arbitrary interfaces, all the method names on an interface are effectively part of a single global namespace.

I think I need to take a moment to justify that. If you have IFoo and IBar, both with a method called method, and you want to implement both on the same type… you have a problem. Because most object systems consider “interface” to mean “I have a method called method, with no way to say which interface’s method you mean. This is a hard problem to avoid, because IFoo and IBar might not even come from the same library. Occasionally languages offer a clumsy way to “rename” one method or the other, but the most common approach seems to be for interface designers to avoid names that sound “too common”. You end up with redundant mouthfuls like IFoo.foo_method.

This incredibly sucks, and the only languages I’m aware of that avoid the problem are the ML family and Rust. In Rust, you define all the methods for a particular trait (interface) in a separate block, away from the type’s “own” methods. It’s pretty slick. You can still do obj.method(), and as long as there’s only one method among all the available traits, you’ll get that one. If not, there’s syntax for explicitly saying which trait you mean, which I can’t remember because I’ve never had to use it.

Symbols are JavaScript’s answer to this problem. If you want to define some interface, you can name its methods with symbols, which are guaranteed to be unique. You just have to make sure you keep the symbol around somewhere accessible so other people can actually use it. (Or… not?)

The interesting thing is that JavaScript now has several of its own symbols built in, allowing user objects to implement features that were previously reserved for built-in types. For example, you can use the Symbol.hasInstance symbol — which is simply where the language is storing an existing symbol and is not the same as Symbol('hasInstance')! — to override instanceof:

1
2
3
4
5
6
7
8
// oh my god don't do this though
class EvenNumber {
    static [Symbol.hasInstance](obj) {
        return obj % 2 == 0;
    }
}
console.log(2 instanceof EvenNumber);  // true
console.log(3 instanceof EvenNumber);  // false

Oh, and those brackets around Symbol.hasInstance are a sort of reverse-quoting — they indicate an expression to use where the language would normally expect a literal identifier. I think they work as object keys, too, and maybe some other places.

The equivalent in Python is to implement a method called __instancecheck__, a name which is not special in any way except that Python has reserved all method names of the form __foo__. That’s great for Python, but doesn’t really help user code. JavaScript has actually outclassed (ho ho) Python here.

Of course, obj[BobNamespace.some_method]() is not the prettiest way to call an interface method, so it’s not perfect. I imagine this would be best implemented in user code by exposing a polymorphic function, similar to how Python’s len(obj) pretty much just calls obj.__len__().

I only bring this up because it’s the plumbing behind one of the most incredible things in JavaScript that I didn’t even know about until I started writing this post. I’m so excited oh my gosh. Are you ready? It’s:

Iteration protocol

MDN docs — supported in Firefox 27, Chrome 39, Safari 10; still experimental in Edge

Yes! Amazing! JavaScript has first-class support for iteration! I can’t even believe this.

It works pretty much how you’d expect, or at least, how I’d expect. You give your object a method called Symbol.iterator, and that returns an iterator.

What’s an iterator? It’s an object with a next() method that returns the next value and whether the iterator is exhausted.

Wait, wait, wait a second. Hang on. The method is called next? Really? You didn’t go for Symbol.next? Python 2 did exactly the same thing, then realized its mistake and changed it to __next__ in Python 3. Why did you do this?

Well, anyway. My go-to test of an iterator protocol is how hard it is to write an equivalent to Python’s enumerate(), which takes a list and iterates over its values and their indices. In Python it looks like this:

1
2
3
4
5
for i, value in enumerate(['one', 'two', 'three']):
    print(i, value)
# 0 one
# 1 two
# 2 three

It’s super nice to have, and I’m always amazed when languages with “strong” “support” for iteration don’t have it. Like, C# doesn’t. So if you want to iterate over a list but also need indices, you need to fall back to a C-style for loop. And if you want to iterate over a lazy or arbitrary iterable but also need indices, you need to track it yourself with a counter. Ridiculous.

Here’s my attempt at building it in JavaScript.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
function enumerate(iterable) {
    // Return a new iter*able* object with a Symbol.iterator method that
    // returns an iterator.
    return {
        [Symbol.iterator]: function() {
            let iterator = iterable[Symbol.iterator]();
            let i = 0;

            return {
                next: function() {
                    let nextval = iterator.next();
                    if (! nextval.done) {
                        nextval.value = [i, nextval.value];
                        i++;
                    }
                    return nextval;
                },
            };
        },
    };
}
for (let [i, value] of enumerate(['one', 'two', 'three'])) {
    console.log(i, value);
}
// 0 one
// 1 two
// 2 three

Incidentally, for..of (which iterates over a sequence, unlike for..in which iterates over keys — obviously) is finally supported in Edge 12. Hallelujah.

Oh, and let [i, value] is destructuring assignment, which is also a thing now and works with objects as well. You can even use the splat operator with it! Like Python! (And you can use it in function signatures! Like Python! Wait, no, Python decided that was terrible and removed it in 3…)

1
let [x, y, ...others] = ['apple', 'orange', 'cherry', 'banana'];

It’s a Halloween miracle. 🎃

Generators

MDN docs — supported in Firefox 26, Chrome 39, Edge 13, Safari 10

That’s right, JavaScript has goddamn generators now. It’s basically just copying Python and adding a lot of superfluous punctuation everywhere. Not that I’m complaining.

Also, generators are themselves iterable, so I’m going to cut to the chase and rewrite my enumerate() with a generator.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
function enumerate(iterable) {
    return {
        [Symbol.iterator]: function*() {
            let i = 0;
            for (let value of iterable) {
                yield [i, value];
                i++;
            }
        },
    };
}
for (let [i, value] of enumerate(['one', 'two', 'three'])) {
    console.log(i, value);
}
// 0 one
// 1 two
// 2 three

Amazing. function* is a pretty strange choice of syntax, but whatever? I guess it also lets them make yield only act as a keyword inside a generator, for ultimate backwards compatibility.

JavaScript generators support everything Python generators do: yield* yields every item from a subsequence, like Python’s yield from; generators can return final values; you can pass values back into the generator if you iterate it by hand. No, really, I wasn’t kidding, it’s basically just copying Python. It’s great. You could now built asyncio in JavaScript!

In fact, they did that! JavaScript now has async and await. An async function returns a Promise, which is also a built-in type now. Amazing.

Sets and maps

MDN docs for MapMDN docs for Set — supported in Firefox 13, Chrome 38, IE 11, Safari 7.1

I did not save the best for last. This is much less exciting than generators. But still exciting.

The only data structure in JavaScript is the object, a map where the strings are keys. (Or now, also symbols, I guess.) That means you can’t readily use custom values as keys, nor simulate a set of arbitrary objects. And you have to worry about people mucking with Object.prototype, yikes.

But now, there’s Map and Set! Wow.

Unfortunately, because JavaScript, Map couldn’t use the indexing operators without losing the ability to have methods, so you have to use a boring old method-based API. But Map has convenient methods that plain objects don’t, like entries() to iterate over pairs of keys and values. In fact, you can use a map with for..of to get key/value pairs. So that’s nice.

Perhaps more interesting, there’s also now a WeakMap and WeakSet, where the keys are weak references. I don’t think JavaScript had any way to do weak references before this, so that’s pretty slick. There’s no obvious way to hold a weak value, but I guess you could substitute a WeakSet with only one item.

Template literals

MDN docs — supported in Firefox 34, Chrome 41, Edge 12, Safari 9

Template literals are JavaScript’s answer to string interpolation, which has historically been a huge pain in the ass because it doesn’t even have string formatting in the standard library.

They’re just strings delimited by backticks instead of quotes. They can span multiple lines and contain expressions.

1
2
console.log(`one plus
two is ${1 + 2}`);

Someone decided it would be a good idea to allow nesting more sets of backticks inside a ${} expression, so, good luck to syntax highlighters.

However, someone also had the most incredible idea ever, which was to add syntax allowing user code to do the interpolation — so you can do custom escaping, when absolutely necessary, which is virtually never, because “escaping” means you’re building a structured format by slopping strings together willy-nilly instead of using some API that works with the structure.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// OF COURSE, YOU SHOULDN'T BE DOING THIS ANYWAY; YOU SHOULD BUILD HTML WITH
// THE DOM API AND USE .textContent FOR LITERAL TEXT.  BUT AS AN EXAMPLE:
function html(literals, ...values) {
    let ret = [];
    literals.forEach((literal, i) => {
        if (i > 0) {
            // Is there seriously still not a built-in function for doing this?
            // Well, probably because you SHOULDN'T BE DOING IT
            ret.push(values[i - 1]
                .replace(/&/g, '&amp;')
                .replace(/</g, '&lt;')
                .replace(/>/g, '&gt;')
                .replace(/"/g, '&quot;')
                .replace(/'/g, '&apos;'));
        }
        ret.push(literal);
    });
    return ret.join('');
}
let username = 'Bob<script>';
let result = html`<b>Hello, ${username}!</b>`;
console.log(result);
// <b>Hello, Bob&lt;script&gt;!</b>

It’s a shame this feature is in JavaScript, the language where you are least likely to need it.

Trailing commas

Remember how you couldn’t do this for ages, because ass-old IE considered it a syntax error and would reject the entire script?

1
2
3
4
5
{
    a: 'one',
    b: 'two',
    c: 'three',  // <- THIS GUY RIGHT HERE
}

Well now it’s part of the goddamn spec and if there’s anything in this post you can rely on, it’s this. In fact you can use AS MANY GODDAMN TRAILING COMMAS AS YOU WANT. But only in arrays.

1
[1, 2, 3,,,,,,,,,,,,,,,,,,,,,,,,,]

Apparently that has the bizarre side effect of reserving extra space at the end of the array, without putting values there.

And more, probably

Like strict mode, which makes a few silent “errors” be actual errors, forces you to declare variables (no implicit globals!), and forbids the completely bozotic with block.

Or String.trim(), which trims whitespace off of strings.

Or… Math.sign()? That’s new? Seriously? Well, okay.

Or the Proxy type, which lets you customize indexing and assignment and calling. Oh. I guess that is possible, though this is a pretty weird way to do it; why not just use symbol-named methods?

You can write Unicode escapes for astral plane characters in strings (or identifiers!), as \u{XXXXXXXX}.

There’s a const now? I extremely don’t care, just name it in all caps and don’t reassign it, come on.

There’s also a mountain of other minor things, which you can peruse at your leisure via MDN or the ECMAScript compatibility tables (note the links at the top, too).

That’s all I’ve got. I still wouldn’t say I’m a big fan of JavaScript, but it’s definitely making an effort to clean up some goofy inconsistencies and solve common problems. I think I could even write some without yelling on Twitter about it now.

On the other hand, if you’re still stuck supporting IE 10 for some reason… well, er, my condolences.

Creating a Cost-Efficient Amazon ECS Cluster for Scheduled Tasks

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/creating-a-cost-efficient-amazon-ecs-cluster-for-scheduled-tasks/

Madhuri Peri
Sr. DevOps Consultant

When you use Amazon Relational Database Service (Amazon RDS), depending on the logging levels on the RDS instances and the volume of transactions, you could generate a lot of log data. To ensure that everything is running smoothly, many customers search for log error patterns using different log aggregation and visualization systems, such as Amazon Elasticsearch Service, Splunk, or other tool of their choice. A module needs to periodically retrieve the RDS logs using the SDK, and then send them to Amazon S3. From there, you can stream them to your log aggregation tool.

One option is writing an AWS Lambda function to retrieve the log files. However, because of the time that this function needs to execute, depending on the volume of log files retrieved and transferred, it is possible that Lambda could time out on many instances.  Another approach is launching an Amazon EC2 instance that runs this job periodically. However, this would require you to run an EC2 instance continuously, not an optimal use of time or money.

Using the new Amazon CloudWatch integration with Amazon EC2 Container Service, you can trigger this job to run in a container on an existing Amazon ECS cluster. Additionally, this would allow you to improve costs by running containers on a fleet of Spot Instances.

In this post, I will show you how to use the new scheduled tasks (cron) feature in Amazon ECS and launch tasks using CloudWatch events, while leveraging Spot Fleet to maximize availability and cost optimization for containerized workloads.

Architecture

The following diagram shows how the various components described schedule a task that retrieves log files from Amazon RDS database instances, and deposits the logs into an S3 bucket.

Amazon ECS cluster container instances are using Spot Fleet, which is a perfect match for the workload that needs to run when it can. This improves cluster costs.

The task definition defines which Docker image to retrieve from the Amazon EC2 Container Registry (Amazon ECR) repository and run on the Amazon ECS cluster.

The container image has Python code functions to make AWS API calls using boto3. It iterates over the RDS database instances, retrieves the logs, and deposits them in the S3 bucket. Many customers choose these logs to be delivered to their centralized log-store. CloudWatch Events defines the schedule for when the container task has to be launched.

Walkthrough

To provide the basic framework, we have built an AWS CloudFormation template that creates the following resources:

  • Amazon ECR repository for storing the Docker image to be used in the task definition
  • S3 bucket that holds the transferred logs
  • Task definition, with image name and S3 bucket as environment variables provided via input parameter
  • CloudWatch Events rule
  • Amazon ECS cluster
  • Amazon ECS container instances using Spot Fleet
  • IAM roles required for the container instance profiles

Before you begin

Ensure that Git, Docker, and the AWS CLI are installed on your computer.

In your AWS account, instantiate one Amazon Aurora instance using the console. For more information, see Creating an Amazon Aurora DB Cluster.

Implementation Steps

  1. Clone the code from GitHub that performs RDS API calls to retrieve the log files.
    git clone https://github.com/awslabs/aws-ecs-scheduled-tasks.git
  2. Build and tag the image.
    cd aws-ecs-scheduled-tasks/container-code/src && ls

    Dockerfile		rdslogsshipper.py	requirements.txt

    docker build -t rdslogsshipper .

    Sending build context to Docker daemon 9.728 kB
    Step 1 : FROM python:3
     ---> 41397f4f2887
    Step 2 : WORKDIR /usr/src/app
     ---> Using cache
     ---> 59299c020e7e
    Step 3 : COPY requirements.txt ./
     ---> 8c017e931c3b
    Removing intermediate container df09e1bed9f2
    Step 4 : COPY rdslogsshipper.py /usr/src/app
     ---> 099a49ca4325
    Removing intermediate container 1b1da24a6699
    Step 5 : RUN pip install --no-cache-dir -r requirements.txt
     ---> Running in 3ed98b30901d
    Collecting boto3 (from -r requirements.txt (line 1))
      Downloading boto3-1.4.6-py2.py3-none-any.whl (128kB)
    Collecting botocore (from -r requirements.txt (line 2))
      Downloading botocore-1.6.7-py2.py3-none-any.whl (3.6MB)
    Collecting s3transfer<0.2.0,>=0.1.10 (from boto3->-r requirements.txt (line 1))
      Downloading s3transfer-0.1.10-py2.py3-none-any.whl (54kB)
    Collecting jmespath<1.0.0,>=0.7.1 (from boto3->-r requirements.txt (line 1))
      Downloading jmespath-0.9.3-py2.py3-none-any.whl
    Collecting python-dateutil<3.0.0,>=2.1 (from botocore->-r requirements.txt (line 2))
      Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194kB)
    Collecting docutils>=0.10 (from botocore->-r requirements.txt (line 2))
      Downloading docutils-0.14-py3-none-any.whl (543kB)
    Collecting six>=1.5 (from python-dateutil<3.0.0,>=2.1->botocore->-r requirements.txt (line 2))
      Downloading six-1.10.0-py2.py3-none-any.whl
    Installing collected packages: six, python-dateutil, docutils, jmespath, botocore, s3transfer, boto3
    Successfully installed boto3-1.4.6 botocore-1.6.7 docutils-0.14 jmespath-0.9.3 python-dateutil-2.6.1 s3transfer-0.1.10 six-1.10.0
     ---> f892d3cb7383
    Removing intermediate container 3ed98b30901d
    Step 6 : COPY . .
     ---> ea7550c04fea
    Removing intermediate container b558b3ebd406
    Successfully built ea7550c04fea
  3. Run the CloudFormation stack and get the names for the Amazon ECR repo and S3 bucket. In the stack, choose Outputs.
  4. Open the ECS console and choose Repositories. The rdslogs repo has been created. Choose View Push Commands and follow the instructions to connect to the repository and push the image for the code that you built in Step 2. The screenshot shows the final result:
  5. Associate the CloudWatch scheduled task with the created Amazon ECS Task Definition, using a new CloudWatch event rule that is scheduled to run at intervals. The following rule is scheduled to run every 15 minutes:
    aws --profile default --region us-west-2 events put-rule --name demo-ecs-task-rule  --schedule-expression "rate(15 minutes)"

    {
        "RuleArn": "arn:aws:events:us-west-2:12345678901:rule/demo-ecs-task-rule"
    }
  6. CloudWatch requires IAM permissions to place a task on the Amazon ECS cluster when the CloudWatch event rule is executed, in addition to an IAM role that can be assumed by CloudWatch Events. This is done in three steps:
    1. Create the IAM role to be assumed by CloudWatch.
      aws --profile default --region us-west-2 iam create-role --role-name Test-Role --assume-role-policy-document file://event-role.json

      {
          "Role": {
              "AssumeRolePolicyDocument": {
                  "Version": "2012-10-17", 
                  "Statement": [
                      {
                          "Action": "sts:AssumeRole", 
                          "Effect": "Allow", 
                          "Principal": {
                              "Service": "events.amazonaws.com"
                          }
                      }
                  ]
              }, 
              "RoleId": "AROAIRYYLDCVZCUACT7FS", 
              "CreateDate": "2017-07-14T22:44:52.627Z", 
              "RoleName": "Test-Role", 
              "Path": "/", 
              "Arn": "arn:aws:iam::12345678901:role/Test-Role"
          }
      }

      The following is an example of the event-role.json file used earlier:

      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                    "Service": "events.amazonaws.com"
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      }
    2. Create the IAM policy defining the ECS cluster and task definition. You need to get these values from the CloudFormation outputs and resources.
      aws --profile default --region us-west-2 iam create-policy --policy-name test-policy --policy-document file://event-policy.json

      {
          "Policy": {
              "PolicyName": "test-policy", 
              "CreateDate": "2017-07-14T22:51:20.293Z", 
              "AttachmentCount": 0, 
              "IsAttachable": true, 
              "PolicyId": "ANPAI7XDIQOLTBUMDWGJW", 
              "DefaultVersionId": "v1", 
              "Path": "/", 
              "Arn": "arn:aws:iam::123455678901:policy/test-policy", 
              "UpdateDate": "2017-07-14T22:51:20.293Z"
          }
      }

      The following is an example of the event-policy.json file used earlier:

      {
          "Version": "2012-10-17",
          "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "ecs:RunTask"
                ],
                "Resource": [
                    "arn:aws:ecs:*::task-definition/"
                ],
                "Condition": {
                    "ArnLike": {
                        "ecs:cluster": "arn:aws:ecs:*::cluster/"
                    }
                }
            }
          ]
      }
    3. Attach the IAM policy to the role.
      aws --profile default --region us-west-2 iam attach-role-policy --role-name Test-Role --policy-arn arn:aws:iam::1234567890:policy/test-policy
  7. Associate the CloudWatch rule created earlier to place the task on the ECS cluster. The following command shows an example. Replace the AWS account ID and region with your settings.
    aws events put-targets --rule demo-ecs-task-rule --targets "Id"="1","Arn"="arn:aws:ecs:us-west-2:12345678901:cluster/test-cwe-blog-ecsCluster-15HJFWCH4SP67","EcsParameters"={"TaskDefinitionArn"="arn:aws:ecs:us-west-2:12345678901:task-definition/test-cwe-blog-taskdef:8"},"RoleArn"="arn:aws:iam::12345678901:role/Test-Role"

    {
        "FailedEntries": [], 
        "FailedEntryCount": 0
    }

That’s it. The logs now run based on the defined schedule.

To test this, open the Amazon ECS console, select the Amazon ECS cluster that you created, and then choose Tasks, Run New Task. Select the task definition created by the CloudFormation template, and the cluster should be selected automatically. As this runs, the S3 bucket should be populated with the RDS logs for the instance.

Conclusion

In this post, you’ve seen that the choices for workloads that need to run at a scheduled time include Lambda with CloudWatch events or EC2 with cron. However, sometimes the job could run outside of Lambda execution time limits or be not cost-effective for an EC2 instance.

In such cases, you can schedule the tasks on an ECS cluster using CloudWatch rules. In addition, you can use a Spot Fleet cluster with Amazon ECS for cost-conscious workloads that do not have hard requirements on execution time or instance availability in the Spot Fleet. For more information, see Powering your Amazon ECS Cluster with Amazon EC2 Spot Instances and Scheduled Events.

If you have questions or suggestions, please comment below.

Using Enhanced Request Authorizers in Amazon API Gateway

Post Syndicated from Stefano Buliani original https://aws.amazon.com/blogs/compute/using-enhanced-request-authorizers-in-amazon-api-gateway/

Recently, AWS introduced a new type of authorizer in Amazon API Gateway, enhanced request authorizers. Previously, custom authorizers received only the bearer token included in the request and the ARN of the API Gateway method being called. Enhanced request authorizers receive all of the headers, query string, and path parameters as well as the request context. This enables you to make more sophisticated authorization decisions based on parameters such as the client IP address, user agent, or a query string parameter alongside the client bearer token.

Enhanced request authorizer configuration

From the API Gateway console, you can declare a new enhanced request authorizer by selecting the Request option as the AWS Lambda event payload:

Create enhanced request authorizer

 

Just like normal custom authorizers, API Gateway can cache the policy returned by your Lambda function. With enhanced request authorizers, however, you can also specify the values that form the unique key of a policy in the cache. For example, if your authorization decision is based on both the bearer token and the IP address of the client, both values should be part of the unique key in the policy cache. The identity source parameter lets you specify these values as mapping expressions:

  • The bearer token appears in the Authorization header
  • The client IP address is stored in the sourceIp parameter of the request context.

Configure identity sources

 

Using enhanced request authorizers with Swagger

You can also define enhanced request authorizers in your Swagger (Open API) definitions. In the following example, you can see that all of the options configured in the API Gateway console are available as custom extensions in the API definition. For example, the identitySource field is a comma-separated list of mapping expressions.

securityDefinitions:
  IpAuthorizer:
    type: "apiKey"
    name: "IpAuthorizer"
    in: "header"
    x-amazon-apigateway-authtype: "custom"
    x-amazon-apigateway-authorizer:
      authorizerResultTtlInSeconds: 300
      identitySource: "method.request.header.Authorization, context.identity.sourceIp"
      authorizerUri: "arn:aws:apigateway:us-east-1:lambda:path/2015-03-31/functions/arn:aws:lambda:us-east-1:XXXXXXXXXX:function:py-ip-authorizer/invocations"
      type: "request"

After you have declared your authorizer in the security definitions section, you can use it in your API methods:

---
swagger: "2.0"
info:
  title: "request-authorizer-demo"
basePath: "/dev"
paths:
  /hello:
    get:
      security:
      - IpAuthorizer: []
...

Enhanced request authorizer Lambda functions

Enhanced request authorizer Lambda functions receive an event object that is similar to proxy integrations. It contains all of the information about a request, excluding the body.

{
    "methodArn": "arn:aws:execute-api:us-east-1:XXXXXXXXXX:xxxxxx/dev/GET/hello",
    "resource": "/hello",
    "requestContext": {
        "resourceId": "xxxx",
        "apiId": "xxxxxxxxx",
        "resourcePath": "/hello",
        "httpMethod": "GET",
        "requestId": "9e04ff18-98a6-11e7-9311-ef19ba18fc8a",
        "path": "/dev/hello",
        "accountId": "XXXXXXXXXXX",
        "identity": {
            "apiKey": "",
            "sourceIp": "58.240.196.186"
        },
        "stage": "dev"
    },
    "queryStringParameters": {},
    "httpMethod": "GET",
    "pathParameters": {},
    "headers": {
        "cache-control": "no-cache",
        "x-amzn-ssl-client-hello": "AQACJAMDAAAAAAAAAAAAAAAAAAAAAAAAAAAA…",
        "Accept-Encoding": "gzip, deflate",
        "X-Forwarded-For": "54.240.196.186, 54.182.214.90",
        "Accept": "*/*",
        "User-Agent": "PostmanRuntime/6.2.5",
        "Authorization": "hello"
    },
    "stageVariables": {},
    "path": "/hello",
    "type": "REQUEST"
}

The following enhanced request authorizer snippet is written in Python and compares the source IP address against a list of valid IP addresses. The comments in the code explain what happens in each step.

...
VALID_IPS = ["58.240.195.186", "201.246.162.38"]

def lambda_handler(event, context):

    # Read the client’s bearer token.
    jwtToken = event["headers"]["Authorization"]
    
    # Read the source IP address for the request form 
    # for the API Gateway context object.
    clientIp = event["requestContext"]["identity"]["sourceIp"]
    
    # Verify that the client IP address is allowed.
    # If it’s not valid, raise an exception to make sure
    # that API Gateway returns a 401 status code.
    if clientIp not in VALID_IPS:
        raise Exception('Unauthorized')
    
    # Only allow hello users in!
    if not validate_jwt(userId):
        raise Exception('Unauthorized')

    # Use the values from the event object to populate the 
    # required parameters in the policy object.
    policy = AuthPolicy(userId, event["requestContext"]["accountId"])
    policy.restApiId = event["requestContext"]["apiId"]
    policy.region = event["methodArn"].split(":")[3]
    policy.stage = event["requestContext"]["stage"]
    
    # Use the scopes from the bearer token to make a 
    # decision on which methods to allow in the API.
    policy.allowMethod(HttpVerb.GET, '/hello')

    # Finally, build the policy.
    authResponse = policy.build()

    return authResponse
...

Conclusion

API Gateway customers build complex APIs, and authorization decisions often go beyond the simple properties in a JWT token. For example, users may be allowed to call the “list cars” endpoint but only with a specific subset of filter parameters. With enhanced request authorizers, you have access to all request parameters. You can centralize all of your application’s access control decisions in a Lambda function, making it easier to manage your application security.

How Much Money Can Pirate Bay Make From a Cryptocoin Miner?

Post Syndicated from Ernesto original https://torrentfreak.com/how-much-money-can-pirate-bay-make-from-a-cryptocoin-miner-170924/

In recent years many pirate sites have struggled to make a decent income.

Not only are more people using ad-blockers now, the ad-quality is also dropping as copyright holders actively go after this revenue source, trying to dry up the funds of pirate sites.

Last weekend The Pirate Bay tested a cryptocurrency miner to see whether that could offer a viable alternative. This created quite a bit of backlash, but there were plenty of positive comments too.

The question still remains whether the mining efforts can bring in enough money to pay all the bills.

The miner is provided by Coinhive which, at the time of writing, pays out 0.00015 XMR per 1M hashes. So how much can The Pirate Bay make from this?

To get a rough idea we did some back-of-the-envelope calculations, starting with the site’s visitor numbers.

SimilarWeb estimates that The Pirate Bay has roughly 315 million visits per month. On average, users spend five minutes on the site per “visit”. While we have reason to believe that this underestimates the site’s popularity, we’ll use it as an illustration.

We spoke to Coinhive and they estimate that a user with a mid-range laptop would have a hashrate of 30 h/s.

In Pirate Bay’s case this would translate to 30 hashes * 300 seconds * 315M visits = 2,835,000M hashes per month. If the miner is throttled at 30% this would drop to 850,000M hashes.

If Coinhive pays out 0.00015 XMR per million hashes, TPB would get 127.5 XMR per month, which is roughly $12,000 at the moment. Since the miner doesn’t appear on all pages and because some may actively block it, this number will drop a bit further.

Keep in mind that this is just an illustration using several estimated variables which may vary greatly over time. Still, it gives a broad idea of the potential.

Since Pirate Bay tested the miner several other sites jumped on board as well. We’ll keep a close eye on the developments and hope we can share some real data in the future.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Using AWS CodePipeline, AWS CodeBuild, and AWS Lambda for Serverless Automated UI Testing

Post Syndicated from Prakash Palanisamy original https://aws.amazon.com/blogs/devops/using-aws-codepipeline-aws-codebuild-and-aws-lambda-for-serverless-automated-ui-testing/

Testing the user interface of a web application is an important part of the development lifecycle. In this post, I’ll explain how to automate UI testing using serverless technologies, including AWS CodePipeline, AWS CodeBuild, and AWS Lambda.

I built a website for UI testing that is hosted in S3. I used Selenium to perform cross-browser UI testing on Chrome, Firefox, and PhantomJS, a headless WebKit browser with Ghost Driver, an implementation of the WebDriver Wire Protocol. I used Python to create test cases for ChromeDriver, FirefoxDriver, or PhatomJSDriver based the browser against which the test is being executed.

Resources referred to in this post, including the AWS CloudFormation template, test and status websites hosted in S3, AWS CodeBuild build specification files, AWS Lambda function, and the Python script that performs the test are available in the serverless-automated-ui-testing GitHub repository.

S3 Hosted Test Website:

AWS CodeBuild supports custom containers so we can use the Selenium/standalone-Firefox and Selenium/standalone-Chrome containers, which include prebuild Firefox and Chrome browsers, respectively. Xvfb performs the graphical operation in virtual memory without any display hardware. It will be installed in the CodeBuild containers during the install phase.

Build Spec for Chrome and Firefox

The build specification for Chrome and Firefox testing includes multiple phases:

  • The environment variables section contains a set of default variables that are overridden while creating the build project or triggering the build.
  • As part of install phase, required packages like Xvfb and Selenium are installed using yum.
  • During the pre_build phase, the test bed is prepared for test execution.
  • During the build phase, the appropriate DISPLAY is set and the tests are executed.
version: 0.2

env:
  variables:
    BROWSER: "chrome"
    WebURL: "https://sampletestweb.s3-eu-west-1.amazonaws.com/website/index.html"
    ArtifactBucket: "codebuild-demo-artifact-repository"
    MODULES: "mod1"
    ModuleTable: "test-modules"
    StatusTable: "blog-test-status"

phases:
  install:
    commands:
      - apt-get update
      - apt-get -y upgrade
      - apt-get install xvfb python python-pip build-essential -y
      - pip install --upgrade pip
      - pip install selenium
      - pip install awscli
      - pip install requests
      - pip install boto3
      - cp xvfb.init /etc/init.d/xvfb
      - chmod +x /etc/init.d/xvfb
      - update-rc.d xvfb defaults
      - service xvfb start
      - export PATH="$PATH:`pwd`/webdrivers"
  pre_build:
    commands:
      - python prepare_test.py
  build:
    commands:
      - export DISPLAY=:5
      - cd tests
      - echo "Executing simple test..."
      - python testsuite.py

Because Ghost Driver runs headless, it can be executed on AWS Lambda. In keeping with a fire-and-forget model, I used CodeBuild to create the PhantomJS Lambda function and trigger the test invocations on Lambda in parallel. This is powerful because many tests can be executed in parallel on Lambda.

Build Spec for PhantomJS

The build specification for PhantomJS testing also includes multiple phases. It is a little different from the preceding example because we are using AWS Lambda for the test execution.

  • The environment variables section contains a set of default variables that are overridden while creating the build project or triggering the build.
  • As part of install phase, the required packages like Selenium and the AWS CLI are installed using yum.
  • During the pre_build phase, the test bed is prepared for test execution.
  • During the build phase, a zip file that will be used to create the PhantomJS Lambda function is created and tests are executed on the Lambda function.
version: 0.2

env:
  variables:
    BROWSER: "phantomjs"
    WebURL: "https://sampletestweb.s3-eu-west-1.amazonaws.com/website/index.html"
    ArtifactBucket: "codebuild-demo-artifact-repository"
    MODULES: "mod1"
    ModuleTable: "test-modules"
    StatusTable: "blog-test-status"
    LambdaRole: "arn:aws:iam::account-id:role/role-name"

phases:
  install:
    commands:
      - apt-get update
      - apt-get -y upgrade
      - apt-get install python python-pip build-essential -y
      - apt-get install zip unzip -y
      - pip install --upgrade pip
      - pip install selenium
      - pip install awscli
      - pip install requests
      - pip install boto3
  pre_build:
    commands:
      - python prepare_test.py
  build:
    commands:
      - cd lambda_function
      - echo "Packaging Lambda Function..."
      - zip -r /tmp/lambda_function.zip ./*
      - func_name=`echo $CODEBUILD_BUILD_ID | awk -F ':' '{print $1}'`-phantomjs
      - echo "Creating Lambda Function..."
      - chmod 777 phantomjs
      - |
         func_list=`aws lambda list-functions | grep FunctionName | awk -F':' '{print $2}' | tr -d ', "'`
         if echo "$func_list" | grep -qw $func_name
         then
             echo "Lambda function already exists."
         else
             aws lambda create-function --function-name $func_name --runtime "python2.7" --role $LambdaRole --handler "testsuite.lambda_handler" --zip-file fileb:///tmp/lambda_function.zip --timeout 150 --memory-size 1024 --environment Variables="{WebURL=$WebURL, StatusTable=$StatusTable}" --tags Name=$func_name
         fi
      - export PhantomJSFunction=$func_name
      - cd ../tests/
      - python testsuite.py

The list of test cases and the test modules that belong to each case are stored in an Amazon DynamoDB table. Based on the list of modules passed as an argument to the CodeBuild project, CodeBuild gets the test cases from that table and executes them. The test execution status and results are stored in another Amazon DynamoDB table. It will read the test status from the status table in DynamoDB and display it.

AWS CodeBuild and AWS Lambda perform the test execution as individual tasks. AWS CodePipeline plays an important role here by enabling continuous delivery and parallel execution of tests for optimized testing.

Here’s how to do it:

In AWS CodePipeline, create a pipeline with four stages:

  • Source (AWS CodeCommit)
  • UI testing (AWS Lambda and AWS CodeBuild)
  • Approval (manual approval)
  • Production (AWS Lambda)

Pipeline stages, the actions in each stage, and transitions between stages are shown in the following diagram.

This design implemented in AWS CodePipeline looks like this:

CodePipeline automatically detects a change in the source repository and triggers the execution of the pipeline.

In the UITest stage, there are two parallel actions:

  • DeployTestWebsite invokes a Lambda function to deploy the test website in S3 as an S3 website.
  • DeployStatusPage invokes another Lambda function to deploy in parallel the status website in S3 as an S3 website.

Next, there are three parallel actions that trigger the CodeBuild project:

  • TestOnChrome launches a container to perform the Selenium tests on Chrome.
  • TestOnFirefox launches another container to perform the Selenium tests on Firefox.
  • TestOnPhantomJS creates a Lambda function and invokes individual Lambda functions per test case to execute the test cases in parallel.

You can monitor the status of the test execution on the status website, as shown here:

When the UI testing is completed successfully, the pipeline continues to an Approval stage in which a notification is sent to the configured SNS topic. The designated team member reviews the test status and approves or rejects the deployment. Upon approval, the pipeline continues to the Production stage, where it invokes a Lambda function and deploys the website to a production S3 bucket.

I used a CloudFormation template to set up my continuous delivery pipeline. The automated-ui-testing.yaml template, available from GitHub, sets up a full-featured pipeline.

When I use the template to create my pipeline, I specify the following:

  • AWS CodeCommit repository.
  • SNS topic to send approval notification.
  • S3 bucket name where the artifacts will be stored.

The stack name should follow the rules for S3 bucket naming because it will be part of the S3 bucket name.

When the stack is created successfully, the URLs for the test website and status website appear in the Outputs section, as shown here:

Conclusion

In this post, I showed how you can use AWS CodePipeline, AWS CodeBuild, AWS Lambda, and a manual approval process to create a continuous delivery pipeline for serverless automated UI testing. Websites running on Amazon EC2 instances or AWS Elastic Beanstalk can also be tested using similar approach.


About the author

Prakash Palanisamy is a Solutions Architect for Amazon Web Services. When he is not working on Serverless, DevOps or Alexa, he will be solving problems in Project Euler. He also enjoys watching educational documentaries.