Use AWS CodeDeploy to Deploy to Amazon EC2 Instances Behind an Elastic Load Balancer

Post Syndicated from Thomas Schmitt original http://blogs.aws.amazon.com/application-management/post/Tx39X8HM93NXU47/Use-AWS-CodeDeploy-to-Deploy-to-Amazon-EC2-Instances-Behind-an-Elastic-Load-Bala

AWS CodeDeploy is a new service that makes it easy to deploy application updates to Amazon EC2 instances. CodeDeploy is targeted at customers who manage their EC2 instances directly, instead of those who use an application management service like AWS Elastic Beanstalk or AWS OpsWorks that have their own built-in deployment features. CodeDeploy allows developers and administrators to centrally control and track their application deployments across their different development, testing, and production environments.

 

Let’s assume you have an application architecture designed for high availability that includes an Elastic Load Balancer in front of multiple application servers belonging to an Auto Scaling Group. Elastic Load Balancing enables you to distribute incoming traffic over multiple servers and Auto Scaling allows you to scale your EC2 capacity up or down automatically according to your needs. In this blog post, we will show how you can use CodeDeploy to avoid downtime when updating the code running on your application servers in such an environment. We will use the CodeDeploy rolling updates feature so that there is a minimum capacity always available to serve traffic and use a simple script to take EC2 instances out of the load balancer as and when we deploy new code on it.

 

So let’s get started. We are going to:

Set up the environment described above

Create your artifact bundle, which includes the deployment scripts, and upload it to Amazon S3

Create an AWS CodeDeploy application and a deployment group

Start the zero-downtime deployment

Monitor your deployment

 

1. Set up the environment

Let’s get started by setting up some AWS resources.

 

To simplify the setup process, you can use a sample AWS CloudFormation template that sets up the following resources for you:

An Auto Scaling group and its launch configuration. The Auto Scaling group launches by default three Amazon EC2 instances. The AWS CloudFormation template installs Apache on each of these instances to run a sample website. It also installs the AWS CodeDeploy Agent, which performs the deployments on the instance. The template creates a service role that grants AWS CodeDeploy access to add deployment lifecycle event hooks to your Auto Scaling group so that it can kick off a deployment whenever Auto Scaling launches a new Amazon EC2 instance.

The Auto Scaling group spins up Amazon EC2 instances and monitors their health for you. The Auto Scaling Group spans all Availability Zones within the region for fault tolerance. 

An Elastic Load Balancing load balancer, which distributes the traffic across all of the Amazon EC2 instances in the Auto Scaling group.

 

Simply execute the following command using the AWS Command Line Interface (AWS CLI), or you can create an AWS CloudFormation stack with the AWS Management Console by using the value of the –template-url option shown here:

 

aws cloudformation create-stack
–stack-name "CodeDeploySampleELBIntegrationStack"
–template-url "http://s3.amazonaws.com/aws-codedeploy-us-east-1/templates/latest/CodeDeploy_SampleCF_ELB_Integration.json"
–capabilities "CAPABILITY_IAM"
–parameters "ParameterKey=KeyName,ParameterValue=<my-key-pair>"

 

Note: AWS CloudFormation will change your AWS account’s security configuration by adding two roles. These roles will enable AWS CodeDeploy to perform actions on your AWS account’s behalf. These actions include identifying Amazon EC2 instances by their tags or Auto Scaling group names and for deploying applications from Amazon S3 buckets to instances. For more information, see the AWS CodeDeploy service role and IAM instance profile documentation.

 

2. Create your artifact bundle, which includes the deployment scripts, and upload it to Amazon S3

You can use the following sample artifact bundle in Amazon S3, which includes everything you need: the Application Specification (AppSpec) file, deployment scripts, and a sample web page:

 

http://s3.amazonaws.com/aws-codedeploy-us-east-1/samples/latest/SampleApp_ELB_Integration.zip

 

This artifact bundle contains the deployment artifacts and a set of scripts that call the AutoScaling EnterStandby and ExitStandby APIs to do both the registration and deregistration of an Amazon EC2 instance from the load balancer.

 

The installation scripts and deployment artifacts are bundled together with a CodeDeploy AppSpec file. The AppSpec file must be placed in the root of your archive and describes where to copy the application and how to execute installation scripts. 

 

Here is the appspec.yml file from the sample artifact bundle:

 

version: 0.0
os: linux
files:
– source: /html
destination: /var/www/html
hooks:
BeforeInstall:
– location: scripts/deregister_from_elb.sh
timeout: 400
– location: scripts/stop_server.sh
timeout: 120
runas: root
ApplicationStart:
– location: scripts/start_server.sh
timeout: 120
runas: root
– location: scripts/register_with_elb.sh
timeout: 120

 

The defined commands in the AppSepc file will be executed in the following order (see AWS CodeDeploy AppSpec File Reference for more details):

BeforeInstall deployment lifecycle event
First, it deregisters the instance from the load balancer (deregister_from_elb.sh). I have increased the time out for the deregistration script above the 300 seconds that the load balancer waits until all connections are closed, which is the default value if connection draining is enabled.
After that it stops the Apache Web Server (stop_server.sh).

Install deployment lifecycle event
The next step of the host agent is to copy the HTML pages defined in the ‘files’ section from the ‘/html’ folder in the archive to ‘/var/www/html’ on the server.

ApplicationStart deployment lifecycle event
It starts the Apache Web Server (start_server.sh).
It then registers the instance with the load balancer (register_with_elb.sh).

In case you are wondering why I used the BeforeInstall instead of the ApplicationStop deployment lifecycle event, the ApplicationStop event always executes the scripts from the previous deployment bundle. If you do the deployment for the first time with AWS CodeDeploy, the instance would not get deregistered from the load balancer.

 

 

Here’s what the deregister script does, step by step:

The script gets the instance ID (and AWS region) from the Amazon EC2 metadata service.

It checks if the instance is part of an Auto Scaling group.

After that the script deregisters the instance from the load balancer by putting the instance into standby mode in the Auto Scaling group.

The script keeps polling the Auto Scaling API every second until the instance is in standby mode, which means it has been deregistered from the load balancer.

The deregistration might take a while if connection draining is enabled. The server has to finish processing the ongoing requests first before we can continue with the deployment.

 

For example, the following is the section of the deregister_from_elb.sh sample script that removes the Amazon EC2 instance from the load balancer:

 

# Get this instance’s ID
INSTANCE_ID=$(get_instance_id)
if [ $? != 0 -o -z "$INSTANCE_ID" ]; then
error_exit "Unable to get this instance’s ID; cannot continue."
fi

msg "Checking if instance $INSTANCE_ID is part of an AutoScaling group"
asg=$(autoscaling_group_name $INSTANCE_ID)
if [ $? == 0 -a -n "$asg" ]; then
msg "Found AutoScaling group for instance $INSTANCE_ID: $asg"

msg "Attempting to put instance into Standby"
autoscaling_enter_standby $INSTANCE_ID $asg
if [ $? != 0 ]; then
error_exit "Failed to move instance into standby"
else
msg "Instance is in standby"
exit 0
fi
fi

 

The ‘autoscaling_enter_standby’ function is defined in the common_functions.sh sample script as follows:

 

autoscaling_enter_standby() {
local instance_id=$1
local asg_name=$2

msg "Putting instance $instance_id into Standby"
$AWS_CLI autoscaling enter-standby
–instance-ids $instance_id
–auto-scaling-group-name $asg_name
–should-decrement-desired-capacity
if [ $? != 0 ]; then
msg "Failed to put instance $instance_id into standby for ASG $asg_name."
return 1
fi

msg "Waiting for move to standby to finish."
wait_for_state "autoscaling" $instance_id "Standby"
if [ $? != 0 ]; then
local wait_timeout=$(($WAITER_INTERVAL * $WAITER_ATTEMPTS))
msg "$instance_id did not make it to standby after $wait_timeout seconds"
return 1
fi

return 0
}

 

The register_with_elb.sh sample script works in a similar way. It calls the ‘autoscaling_exit_standby’ from the common_functions.sh sample script to put the instance back in service in the load balancer.

 

The register and deregister scripts are executed on each Amazon EC2 instance in your fleet. The instances must have access to the AutoScaling API to put themselves into standby mode and back in service. Your Amazon EC2 instance role needs the following permissions:

 

{
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:Describe*",
"autoscaling:EnterStandby",
"autoscaling:ExitStandby",
"cloudformation:Describe*",
"cloudformation:GetTemplate",
"s3:Get*"
],
"Resource": "*"
}
]
}

 

If you use the provided CloudFormation template, an IAM instance role with the necessary permissions is automatically created for you.

 

For more details on how to create a deployment archive, see Prepare a Revision for AWS CodeDeploy.

 

3. Create an AWS CodeDeploy application and a deployment group

The next step is to create the AWS CodeDeploy resources and configure the roll-out strategy. The following commands tell AWS CodeDeploy where to deploy your artifact bundle (all instances of the given Auto Scaling group) and how to deploy it (OneAtATime). The deployment configuration ‘OneAtATime’ is the safest way to deploy, because only one instance of the Auto Scaling group will be updated at the same time.

 

# Create a new AWS CodeDeploy application.
aws deploy create-application –application-name "SampleELBWebApp"

# Get the AWS CodeDeploy service role ARN and Auto Scaling group name
# from the AWS CloudFormation template.
output_parameters=$(aws cloudformation describe-stacks
–stack-name "CodeDeploySampleELBIntegrationStack"
–output text
–query ‘Stacks[0].Outputs[*].OutputValue’)
service_role_arn=$(echo $output_parameters | awk ‘{print $2}’)
autoscaling_group_name=$(echo $output_parameters | awk ‘{print $3}’)

# Create an AWS CodeDeploy deployment group that uses
# the Auto Scaling group created by the AWS CloudFormation template.
# Set up the deployment group so that it deploys to
# only one instance at a time.
aws deploy create-deployment-group
–application-name "SampleELBWebApp"
–deployment-group-name "SampleELBDeploymentGroup"
–auto-scaling-groups "$autoscaling_group_name"
–service-role-arn "$service_role_arn"
–deployment-config-name "CodeDeployDefault.OneAtATime"

 

4. Start the zero-downtime deployment

Now you are ready to start your rolling, zero-downtime deployment. 

 

aws deploy create-deployment
–application-name "SampleELBWebApp"
–s3-location "bucket=aws-codedeploy-us-east-1,key=samples/latest/SampleApp_ELB_Integration.zip,bundleType=zip"
–deployment-group-name "SampleELBDeploymentGroup"

 

5. Monitor your deployment

You can see how your instances are taken out of service and back into service with the following command:

 

watch -n1 aws autoscaling describe-scaling-activities
–auto-scaling-group-name "$autoscaling_group_name"
–query ‘Activities[*].Description’

 

Every 1.0s: aws autoscaling describe-scaling-activities […]
[
"Moving EC2 instance out of Standby: i-d308b93c",
"Moving EC2 instance to Standby: i-d308b93c",
"Moving EC2 instance out of Standby: i-a9695458",
"Moving EC2 instance to Standby: i-a9695458",
"Moving EC2 instance out of Standby: i-2478cade",
"Moving EC2 instance to Standby: i-2478cade",
"Launching a new EC2 instance: i-d308b93c",
"Launching a new EC2 instance: i-a9695458",
"Launching a new EC2 instance: i-2478cade"
]

 

The URL output parameter of the AWS CloudFormation stack contains the link to the website so that you are able to watch it change. The following command returns the URL of the load balancer:

 

# Get the URL output parameter of the AWS CloudFormation template.
aws cloudformation describe-stacks
–stack-name "CodeDeploySampleELBIntegrationStack"
–output text
–query ‘Stacks[0].Outputs[?OutputKey==`URL`].OutputValue’

 

There are a few other points to consider in order to achieve zero-downtime deployments:

Graceful shut-down of your application
You do not want to kill a process with running executions. Make sure that the running threads have enough time to finish work before shutting down your application.

Connection draining
The AWS CloudFormation template sets up an Elastic Load Balancing load balancer with connection draining enabled. The load balancer does not send any new requests to the instance when the instance is deregistering, and it waits until any in-flight requests have finished executing. (For more information, see Enable or Disable Connection Draining for Your Load Balancer.)

Sanity test
It is important to check that the instance is healthy and the application is running before the instance is added back to the load balancer after the deployment.

Backward-compatible changes (for example, database changes)
Both application versions must work side by side until the deployment finishes, because only a part of the fleet is updated at the same time.

Warming of the caches and service
This is so that no request suffers a degraded performance after the deployment.

 

This example should help you get started toward improving your deployment process. I hope that this post makes it easier to reach zero-downtime deployments with AWS CodeDeploy and allows shipping your changes continuously in order to provide a great customer experience.

 

Use AWS CodeDeploy to Deploy to Amazon EC2 Instances Behind an Elastic Load Balancer

Post Syndicated from Thomas Schmitt original http://blogs.aws.amazon.com/application-management/post/Tx39X8HM93NXU47/Use-AWS-CodeDeploy-to-Deploy-to-Amazon-EC2-Instances-Behind-an-Elastic-Load-Bala

AWS CodeDeploy is a new service that makes it easy to deploy application updates to Amazon EC2 instances. CodeDeploy is targeted at customers who manage their EC2 instances directly, instead of those who use an application management service like AWS Elastic Beanstalk or AWS OpsWorks that have their own built-in deployment features. CodeDeploy allows developers and administrators to centrally control and track their application deployments across their different development, testing, and production environments.

 

Let’s assume you have an application architecture designed for high availability that includes an Elastic Load Balancer in front of multiple application servers belonging to an Auto Scaling Group. Elastic Load Balancing enables you to distribute incoming traffic over multiple servers and Auto Scaling allows you to scale your EC2 capacity up or down automatically according to your needs. In this blog post, we will show how you can use CodeDeploy to avoid downtime when updating the code running on your application servers in such an environment. We will use the CodeDeploy rolling updates feature so that there is a minimum capacity always available to serve traffic and use a simple script to take EC2 instances out of the load balancer as and when we deploy new code on it.

 

So let’s get started. We are going to:

Set up the environment described above

Create your artifact bundle, which includes the deployment scripts, and upload it to Amazon S3

Create an AWS CodeDeploy application and a deployment group

Start the zero-downtime deployment

Monitor your deployment

 

1. Set up the environment

Let’s get started by setting up some AWS resources.

 

To simplify the setup process, you can use a sample AWS CloudFormation template that sets up the following resources for you:

An Auto Scaling group and its launch configuration. The Auto Scaling group launches by default three Amazon EC2 instances. The AWS CloudFormation template installs Apache on each of these instances to run a sample website. It also installs the AWS CodeDeploy Agent, which performs the deployments on the instance. The template creates a service role that grants AWS CodeDeploy access to add deployment lifecycle event hooks to your Auto Scaling group so that it can kick off a deployment whenever Auto Scaling launches a new Amazon EC2 instance.

The Auto Scaling group spins up Amazon EC2 instances and monitors their health for you. The Auto Scaling Group spans all Availability Zones within the region for fault tolerance. 

An Elastic Load Balancing load balancer, which distributes the traffic across all of the Amazon EC2 instances in the Auto Scaling group.

 

Simply execute the following command using the AWS Command Line Interface (AWS CLI), or you can create an AWS CloudFormation stack with the AWS Management Console by using the value of the –template-url option shown here:

 

aws cloudformation create-stack
–stack-name "CodeDeploySampleELBIntegrationStack"
–template-url "http://s3.amazonaws.com/aws-codedeploy-us-east-1/templates/latest/CodeDeploy_SampleCF_ELB_Integration.json"
–capabilities "CAPABILITY_IAM"
–parameters "ParameterKey=KeyName,ParameterValue=<my-key-pair>"

 

Note: AWS CloudFormation will change your AWS account’s security configuration by adding two roles. These roles will enable AWS CodeDeploy to perform actions on your AWS account’s behalf. These actions include identifying Amazon EC2 instances by their tags or Auto Scaling group names and for deploying applications from Amazon S3 buckets to instances. For more information, see the AWS CodeDeploy service role and IAM instance profile documentation.

 

2. Create your artifact bundle, which includes the deployment scripts, and upload it to Amazon S3

You can use the following sample artifact bundle in Amazon S3, which includes everything you need: the Application Specification (AppSpec) file, deployment scripts, and a sample web page:

 

http://s3.amazonaws.com/aws-codedeploy-us-east-1/samples/latest/SampleApp_ELB_Integration.zip

 

This artifact bundle contains the deployment artifacts and a set of scripts that call the AutoScaling EnterStandby and ExitStandby APIs to do both the registration and deregistration of an Amazon EC2 instance from the load balancer.

 

The installation scripts and deployment artifacts are bundled together with a CodeDeploy AppSpec file. The AppSpec file must be placed in the root of your archive and describes where to copy the application and how to execute installation scripts. 

 

Here is the appspec.yml file from the sample artifact bundle:

 

version: 0.0
os: linux
files:
– source: /html
destination: /var/www/html
hooks:
BeforeInstall:
– location: scripts/deregister_from_elb.sh
timeout: 400
– location: scripts/stop_server.sh
timeout: 120
runas: root
ApplicationStart:
– location: scripts/start_server.sh
timeout: 120
runas: root
– location: scripts/register_with_elb.sh
timeout: 120

 

The defined commands in the AppSepc file will be executed in the following order (see AWS CodeDeploy AppSpec File Reference for more details):

BeforeInstall deployment lifecycle event
First, it deregisters the instance from the load balancer (deregister_from_elb.sh). I have increased the time out for the deregistration script above the 300 seconds that the load balancer waits until all connections are closed, which is the default value if connection draining is enabled.
After that it stops the Apache Web Server (stop_server.sh).

Install deployment lifecycle event
The next step of the host agent is to copy the HTML pages defined in the ‘files’ section from the ‘/html’ folder in the archive to ‘/var/www/html’ on the server.

ApplicationStart deployment lifecycle event
It starts the Apache Web Server (start_server.sh).
It then registers the instance with the load balancer (register_with_elb.sh).

In case you are wondering why I used the BeforeInstall instead of the ApplicationStop deployment lifecycle event, the ApplicationStop event always executes the scripts from the previous deployment bundle. If you do the deployment for the first time with AWS CodeDeploy, the instance would not get deregistered from the load balancer.

 

 

Here’s what the deregister script does, step by step:

The script gets the instance ID (and AWS region) from the Amazon EC2 metadata service.

It checks if the instance is part of an Auto Scaling group.

After that the script deregisters the instance from the load balancer by putting the instance into standby mode in the Auto Scaling group.

The script keeps polling the Auto Scaling API every second until the instance is in standby mode, which means it has been deregistered from the load balancer.

The deregistration might take a while if connection draining is enabled. The server has to finish processing the ongoing requests first before we can continue with the deployment.

 

For example, the following is the section of the deregister_from_elb.sh sample script that removes the Amazon EC2 instance from the load balancer:

 

# Get this instance’s ID
INSTANCE_ID=$(get_instance_id)
if [ $? != 0 -o -z "$INSTANCE_ID" ]; then
error_exit "Unable to get this instance’s ID; cannot continue."
fi

msg "Checking if instance $INSTANCE_ID is part of an AutoScaling group"
asg=$(autoscaling_group_name $INSTANCE_ID)
if [ $? == 0 -a -n "$asg" ]; then
msg "Found AutoScaling group for instance $INSTANCE_ID: $asg"

msg "Attempting to put instance into Standby"
autoscaling_enter_standby $INSTANCE_ID $asg
if [ $? != 0 ]; then
error_exit "Failed to move instance into standby"
else
msg "Instance is in standby"
exit 0
fi
fi

 

The ‘autoscaling_enter_standby’ function is defined in the common_functions.sh sample script as follows:

 

autoscaling_enter_standby() {
local instance_id=$1
local asg_name=$2

msg "Putting instance $instance_id into Standby"
$AWS_CLI autoscaling enter-standby
–instance-ids $instance_id
–auto-scaling-group-name $asg_name
–should-decrement-desired-capacity
if [ $? != 0 ]; then
msg "Failed to put instance $instance_id into standby for ASG $asg_name."
return 1
fi

msg "Waiting for move to standby to finish."
wait_for_state "autoscaling" $instance_id "Standby"
if [ $? != 0 ]; then
local wait_timeout=$(($WAITER_INTERVAL * $WAITER_ATTEMPTS))
msg "$instance_id did not make it to standby after $wait_timeout seconds"
return 1
fi

return 0
}

 

The register_with_elb.sh sample script works in a similar way. It calls the ‘autoscaling_exit_standby’ from the common_functions.sh sample script to put the instance back in service in the load balancer.

 

The register and deregister scripts are executed on each Amazon EC2 instance in your fleet. The instances must have access to the AutoScaling API to put themselves into standby mode and back in service. Your Amazon EC2 instance role needs the following permissions:

 

{
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:Describe*",
"autoscaling:EnterStandby",
"autoscaling:ExitStandby",
"cloudformation:Describe*",
"cloudformation:GetTemplate",
"s3:Get*"
],
"Resource": "*"
}
]
}

 

If you use the provided CloudFormation template, an IAM instance role with the necessary permissions is automatically created for you.

 

For more details on how to create a deployment archive, see Prepare a Revision for AWS CodeDeploy.

 

3. Create an AWS CodeDeploy application and a deployment group

The next step is to create the AWS CodeDeploy resources and configure the roll-out strategy. The following commands tell AWS CodeDeploy where to deploy your artifact bundle (all instances of the given Auto Scaling group) and how to deploy it (OneAtATime). The deployment configuration ‘OneAtATime’ is the safest way to deploy, because only one instance of the Auto Scaling group will be updated at the same time.

 

# Create a new AWS CodeDeploy application.
aws deploy create-application –application-name "SampleELBWebApp"

# Get the AWS CodeDeploy service role ARN and Auto Scaling group name
# from the AWS CloudFormation template.
output_parameters=$(aws cloudformation describe-stacks
–stack-name "CodeDeploySampleELBIntegrationStack"
–output text
–query ‘Stacks[0].Outputs[*].OutputValue’)
service_role_arn=$(echo $output_parameters | awk ‘{print $2}’)
autoscaling_group_name=$(echo $output_parameters | awk ‘{print $3}’)

# Create an AWS CodeDeploy deployment group that uses
# the Auto Scaling group created by the AWS CloudFormation template.
# Set up the deployment group so that it deploys to
# only one instance at a time.
aws deploy create-deployment-group
–application-name "SampleELBWebApp"
–deployment-group-name "SampleELBDeploymentGroup"
–auto-scaling-groups "$autoscaling_group_name"
–service-role-arn "$service_role_arn"
–deployment-config-name "CodeDeployDefault.OneAtATime"

 

4. Start the zero-downtime deployment

Now you are ready to start your rolling, zero-downtime deployment. 

 

aws deploy create-deployment
–application-name "SampleELBWebApp"
–s3-location "bucket=aws-codedeploy-us-east-1,key=samples/latest/SampleApp_ELB_Integration.zip,bundleType=zip"
–deployment-group-name "SampleELBDeploymentGroup"

 

5. Monitor your deployment

You can see how your instances are taken out of service and back into service with the following command:

 

watch -n1 aws autoscaling describe-scaling-activities
–auto-scaling-group-name "$autoscaling_group_name"
–query ‘Activities[*].Description’

 

Every 1.0s: aws autoscaling describe-scaling-activities […]
[
"Moving EC2 instance out of Standby: i-d308b93c",
"Moving EC2 instance to Standby: i-d308b93c",
"Moving EC2 instance out of Standby: i-a9695458",
"Moving EC2 instance to Standby: i-a9695458",
"Moving EC2 instance out of Standby: i-2478cade",
"Moving EC2 instance to Standby: i-2478cade",
"Launching a new EC2 instance: i-d308b93c",
"Launching a new EC2 instance: i-a9695458",
"Launching a new EC2 instance: i-2478cade"
]

 

The URL output parameter of the AWS CloudFormation stack contains the link to the website so that you are able to watch it change. The following command returns the URL of the load balancer:

 

# Get the URL output parameter of the AWS CloudFormation template.
aws cloudformation describe-stacks
–stack-name "CodeDeploySampleELBIntegrationStack"
–output text
–query ‘Stacks[0].Outputs[?OutputKey==`URL`].OutputValue’

 

There are a few other points to consider in order to achieve zero-downtime deployments:

Graceful shut-down of your application
You do not want to kill a process with running executions. Make sure that the running threads have enough time to finish work before shutting down your application.

Connection draining
The AWS CloudFormation template sets up an Elastic Load Balancing load balancer with connection draining enabled. The load balancer does not send any new requests to the instance when the instance is deregistering, and it waits until any in-flight requests have finished executing. (For more information, see Enable or Disable Connection Draining for Your Load Balancer.)

Sanity test
It is important to check that the instance is healthy and the application is running before the instance is added back to the load balancer after the deployment.

Backward-compatible changes (for example, database changes)
Both application versions must work side by side until the deployment finishes, because only a part of the fleet is updated at the same time.

Warming of the caches and service
This is so that no request suffers a degraded performance after the deployment.

 

This example should help you get started toward improving your deployment process. I hope that this post makes it easier to reach zero-downtime deployments with AWS CodeDeploy and allows shipping your changes continuously in order to provide a great customer experience.

 

Exponential Backoff And Jitter

Post Syndicated from Marc Brooker original https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/

Introducing OCC

Optimistic concurrency control (OCC) is a time-honored way for multiple writers to safely modify a single object without losing writes. OCC has three nice properties: it will always make progress as long as the underlying store is available, it’s easy to understand, and it’s easy to implement. DynamoDB’s conditional writes make OCC a natural fit for DynamoDB users, and it’s natively supported by the DynamoDBMapper client.

While OCC is guaranteed to make progress, it can still perform quite poorly under high contention. The simplest of these contention cases is when a whole lot of clients start at the same time, and try to update the same database row. With one client guaranteed to succeed every round, the time to complete all the updates grows linearly with contention.

For the graphs in this post, I used a small simulator to model the behavior of OCC on a network with delay (and variance in delay), against a remote database. In this simulation, the network introduces delay with a mean of 10ms and variance of 4ms. The first simulation shows how completion time grows linearly with contention. This linear growth is because one client succeeds every round, so it takes N rounds for all N clients to succeed.

Unfortunately, that’s not the whole picture. With N clients contending, the total amount of work done by the system increases with N2.

Adding Backoff

The problem here is that N clients compete in the first round, N-1 in the second round, and so on. Having every client compete in every round is wasteful. Slowing clients down may help, and the classic way to slow clients down is capped exponential backoff. Capped exponential backoff means that clients multiply their backoff by a constant after each attempt, up to some maximum value. In our case, after each unsuccessful attempt, clients sleep for:

Running the simulation again shows that backoff helps a small amount, but doesn’t solve the problem. Client work has only been reduced slightly.

The best way to see the problem is to look at the times these exponentially backed-off calls happen.

It’s obvious that the exponential backoff is working, in that the calls are happening less and less frequently. The problem also stands out: there are still clusters of calls. Instead of reducing the number of clients competing in every round, we’ve just introduced times when no client is competing. Contention hasn’t been reduced much, although the natural variance in network delay has introduced some spreading.

Adding Jitter

The solution isn’t to remove backoff. It’s to add jitter. Initially, jitter may appear to be a counter-intuitive idea: trying to improve the performance of a system by adding randomness. The time series above makes a great case for jitter – we want to spread out the spikes to an approximately constant rate. Adding jitter is a small change to the sleep function:

That time series looks a whole lot better. The gaps are gone, and beyond the initial spike, there’s an approximately constant rate of calls. It’s also had a great effect on the total number of calls.

In the case with 100 contending clients, we’ve reduced our call count by more than half. We’ve also significantly improved the time to completion, when compared to un-jittered exponential backoff.

There are a few ways to implement these timed backoff loops. Let’s call the algorithm above “Full Jitter”, and consider two alternatives. The first alternative is “Equal Jitter”, where we always keep some of the backoff and jitter by a smaller amount:

The intuition behind this one is that it prevents very short sleeps, always keeping some of the slow down from the backoff. A second alternative is “Decorrelated Jitter”, which is similar to “Full Jitter”, but we also increase the maximum jitter based on the last random value.

Which approach do you think is best?

Looking at the amount of client work, the number of calls is approximately the same for “Full” and “Equal” jitter, and higher for “Decorrelated”. Both cut down work substantially relative to both the no-jitter approaches.

The no-jitter exponential backoff approach is the clear loser. It not only takes more work, but also takes more time than the jittered approaches. In fact, it takes so much more time we have to leave it off the graph to get a good comparison of the other methods.

Of the jittered approaches, “Equal Jitter” is the loser. It does slightly more work than “Full Jitter”, and takes much longer. The decision between “Decorrelated Jitter” and “Full Jitter” is less clear. The “Full Jitter” approach uses less work, but slightly more time. Both approaches, though, present a substantial decrease in client work and server load.

It’s worth noting that none of these approaches fundamentally change the N2 nature of the work to be done, but do substantially reduce work at reasonable levels of contention. The return on implementation complexity of using jittered backoff is huge, and it should be considered a standard approach for remote clients.

All of the graphs and numbers from this post were generated using a simple simulation of OCC behavior. You can get our simulator code on GitHub, in the aws-arch-backoff-simulator project.

– Marc Brooker

 

Exponential Backoff And Jitter

Post Syndicated from AWS Architecture Blog original https://www.awsarchitectureblog.com/2015/03/backoff.html

Introducing OCC

Optimistic concurrency control (OCC) is a time-honored way for multiple writers to safely modify a single object without losing writes. OCC has three nice properties: it will always make progress as long as the underlying store is available, it’s easy to understand, and it’s easy to implement. DynamoDB’s conditional writes make OCC a natural fit for DynamoDB users, and it’s natively supported by the DynamoDBMapper client.

While OCC is guaranteed to make progress, it can still perform quite poorly under high contention. The simplest of these contention cases is when a whole lot of clients start at the same time, and try to update the same database row. With one client guaranteed to succeed every round, the time to complete all the updates grows linearly with contention.

For the graphs in this post, I used a small simulator to model the behavior of OCC on a network with delay (and variance in delay), against a remote database. In this simulation, the network introduces delay with a mean of 10ms and variance of 4ms. The first simulation shows how completion time grows linearly with contention. This linear growth is because one client succeeds every round, so it takes N rounds for all N clients to succeed.

Unfortunately, that’s not the whole picture. With N clients contending, the total amount of work done by the system increases with N2.

Adding Backoff

The problem here is that N clients compete in the first round, N-1 in the second round, and so on. Having every client compete in every round is wasteful. Slowing clients down may help, and the classic way to slow clients down is capped exponential backoff. Capped exponential backoff means that clients multiply their backoff by a constant after each attempt, up to some maximum value. In our case, after each unsuccessful attempt, clients sleep for:
sleep = min(cap, base * 2 ** attempt)

Running the simulation again shows that backoff helps a small amount, but doesn’t solve the problem. Client work has only been reduced slightly.

The best way to see the problem is to look at the times these exponentially backed-off calls happen.

It’s obvious that the exponential backoff is working, in that the calls are happening less and less frequently. The problem also stands out: there are still clusters of calls. Instead of reducing the number of clients competing in every round, we’ve just introduced times when no client is competing. Contention hasn’t been reduced much, although the natural variance in network delay has introduced some spreading.

Adding Jitter

The solution isn’t to remove backoff. It’s to add jitter. Initially, jitter may appear to be a counter-intuitive idea: trying to improve the performance of a system by adding randomness. The time series above makes a great case for jitter – we want to spread out the spikes to an approximately constant rate. Adding jitter is a small change to the sleep function:
sleep = random_between(0, min(cap, base * 2 ** attempt))

That time series looks a whole lot better. The gaps are gone, and beyond the initial spike, there’s an approximately constant rate of calls. It’s also had a great effect on the total number of calls.

In the case with 100 contending clients, we’ve reduced our call count by more than half. We’ve also significantly improved the time to completion, when compared to un-jittered exponential backoff.

There are a few ways to implement these timed backoff loops. Let’s call the algorithm above “Full Jitter”, and consider two alternatives. The first alternative is “Equal Jitter”, where we always keep some of the backoff and jitter by a smaller amount:
temp = min(cap, base * 2 ** attempt)
sleep = temp / 2 + random_between(0, temp / 2)

The intuition behind this one is that it prevents very short sleeps, always keeping some of the slow down from the backoff. A second alternative is “Decorrelated Jitter”, which is similar to “Full Jitter”, but we also increase the maximum jitter based on the last random value.
sleep = min(cap, random_between(base, sleep * 3))

Which approach do you think is best?

Looking at the amount of client work, the number of calls is approximately the same for “Full” and “Equal” jitter, and higher for “Decorrelated”. Both cut down work substantially relative to both the no-jitter approaches.

The no-jitter exponential backoff approach is the clear loser. It not only takes more work, but also takes more time than the jittered approaches. In fact, it takes so much more time we have to leave it off the graph to get a good comparison of the other methods.

Of the jittered approaches, “Equal Jitter” is the loser. It does slightly more work than “Full Jitter”, and takes much longer. The decision between “Decorrelated Jitter” and “Full Jitter” is less clear. The “Full Jitter” approach uses less work, but slightly more time. Both approaches, though, present a substantial decrease in client work and server load.

It’s worth noting that none of these approaches fundamentally change the N2 nature of the work to be done, but do substantially reduce work at reasonable levels of contention. The return on implementation complexity of using jittered backoff is huge, and it should be considered a standard approach for remote clients.

All of the graphs and numbers from this post were generated using a simple simulation of OCC behavior. You can get our simulator code on GitHub, in the aws-arch-backoff-simulator project.

– Marc Brooker

node-netflowv9 node.js module for processing of netflowv9 has been updated to 0.2.5

Post Syndicated from Delian Delchev original http://deliantech.blogspot.com/2015/03/node-netflowv9-nodejs-module-for.html

My node-netflowv9 library has been updated to version 0.2.5There are few new things -Almost all of the IETF netflow types are decoded now. Which means practically that we support IPFIXUnknown NetFlow v9 type does not throw an error. It is decoded into property with name ‘unknown_type_XXX’ where XXX is the ID of the typeUnknown NetFlow v9 Option Template scope does not throw an error. It is decoded in ‘unknown_scope_XXX’ where XXX is the ID of the scopeThe user can overwrite how different types of NetFlow are decoded and the user can define its own decoding for new types. The same for scopes. And this can happen “on fly” – at any time.The library supports well multiple netflow collectors running at the same timeA lot of new options and models for using of the library has been introducedBellow is the updated README.md file, describing how to use the library:UsageThe usage of the netflowv9 collector library is very very simple. You just have to do something like this:var Collector = require(‘node-netflowv9’);Collector(function(flow) { console.log(flow);}).listen(3000);or you can use it as event provider:Collector({port: 3000}).on(‘data’,function(flow) { console.log(flow);});The flow will be presented in a format very similar to this:{ header: { version: 9, count: 25, uptime: 2452864139, seconds: 1401951592, sequence: 254138992, sourceId: 2081 }, rinfo: { address: ‘15.21.21.13’, family: ‘IPv4’, port: 29471, size: 1452 }, packet: Buffer <00 00 00 00 ….> flow: [ { in_pkts: 3, in_bytes: 144, ipv4_src_addr: ‘15.23.23.37’, ipv4_dst_addr: ‘16.16.19.165’, input_snmp: 27, output_snmp: 16, last_switched: 2452753808, first_switched: 2452744429, l4_src_port: 61538, l4_dst_port: 62348, out_as: 0, in_as: 0, bgp_ipv4_next_hop: ‘16.16.1.1’, src_mask: 32, dst_mask: 24, protocol: 17, tcp_flags: 0, src_tos: 0, direction: 1, fw_status: 64, flow_sampler_id: 2 } } ]There will be one callback for each packet, which may contain more than one flow.You can also access a NetFlow decode function directly. Do something like this:var netflowPktDecoder = require(‘node-netflowv9’).nfPktDecode;….console.log(netflowPktDecoder(buffer))Currently we support netflow version 1, 5, 7 and 9.OptionsYou can initialize the collector with either callback function only or a group of options within an object.The following options are available during initialization:port – defines the port where our collector will listen to.Collector({ port: 5000, cb: function (flow) { console.log(flow) } })If no port is provided, then the underlying socket will not be initialized (bind to a port) until you call listen method with a port as a parameter:Collector(function (flow) { console.log(flow) }).listen(port)cb – defines a callback function to be executed for every flow. If no call back function is provided, then the collector fires ‘data’ event for each received flowCollector({ cb: function (flow) { console.log(flow) } }).listen(5000)ipv4num – defines that we want to receive the IPv4 ip address as a number, instead of decoded in a readable dot formatCollector({ ipv4num: true, cb: function (flow) { console.log(flow) } }).listen(5000)socketType – defines to what socket type we will bind to. Default is udp4. You can change it to udp6 is you like.Collector({ socketType: ‘udp6’, cb: function (flow) { console.log(flow) } }).listen(5000)nfTypes – defines your own decoders to NetFlow v9+ typesnfScope – defines your own decoders to NetFlow v9+ Option Template scopesDefine your own decoders for NetFlow v9+ typesNetFlow v9 could be extended with vendor specific types and many vendors define their own. There could be no netflow collector in the world that decodes all the specific vendor types. By default this library decodes in readable format all the types it recognises. All the unknown types are decoded as ‘unknown_type_XXX’ where XXX is the type ID. The data is provided as a HEX string. But you can extend the library yourself. You can even replace how current types are decoded. You can even do that on fly (you can dynamically change how the type is decoded in different periods of time).To understand how to do that, you have to learn a bit about the internals of how this module works.When a new flowset template is received from the NetFlow Agent, this netflow module generates and compile (with new Function()) a decoding functionWhen a netflow is received for a known flowset template (we have a compiled function for it) – the function is simply executedThis approach is quite simple and provides enormous performance. The function code is as small as possible and as well on first execution Node.JS compiles it with JIT and the result is really fast.The function code is generated with templates that contains the javascript code to be add for each netflow type, identified by its ID.Each template consist of an object of the following form:{ name: ‘property-name’, compileRule: compileRuleObject }compileRuleObject contains rules how that netflow type to be decoded, depending on its length. The reason for that is, that some of the netflow types are variable length. And you may have to execute different code to decode them depending on the length. The compileRuleObject format is simple:{ length: ‘javascript code as a string that decode this value’, …}There is a special length property of 0. This code will be used, if there is no more specific decode defined for a length. For example:{ 4: ‘code used to decode this netflow type with length of 4’, 8: ‘code used to decode this netflow type with length of 8’, 0: ‘code used to decode ANY OTHER length’}decoding codeThe decoding code must be a string that contains javascript code. This code will be concatenated to the function body before compilation. If that code contain errors or simply does not work as expected it could crash the collector. So be careful.There are few variables you have to use:$pos – this string is replaced with a number containing the current position of the netflow type within the binary buffer.$len – this string is replaced with a number containing the length of the netflow type$name – this string is replaced with a string containing the name property of the netflow type (defined by you above)buf – is Node.JS Buffer object containing the Flow we want to decodeo – this is the object where the decoded flow is written to.Everything else is pure javascript. It is good if you know the restrictions of the javascript and Node.JS capabilities of the Function() method, but not necessary to allow you to write simple decoding by yourself.If you want to decode a string, of variable length, you could write a compileRuleObject of the form:{ 0: ‘o[“$name”] = buf.toString(“utf8”,$pos,$pos+$len)’}The example above will say that for this netfow type, whatever length it has, we will decode the value as utf8 string.ExampleLets assume you want to write you own code for decoding a NetFlow type, lets say 4444, which could be of variable length, and contains a integer number.You can write a code like this:Collector({ port: 5000, nfTypes: { 4444: { // 4444 is the NetFlow Type ID which decoding we want to replace name: ‘my_vendor_type4444’, // This will be the property name, that will contain the decoded value, it will be also the value of the $name compileRule: { 1: “o[‘$name’]=buf.readUInt8($pos);”, // This is how we decode type of length 1 to a number 2: “o[‘$name’]=buf.readUInt16BE($pos);”, // This is how we decode type of length 2 to a number 3: “o[‘$name’]=buf.readUInt8($pos)*65536+buf.readUInt16BE($pos+1);”, // This is how we decode type of length 3 to a number 4: “o[‘$name’]=buf.readUInt32BE($pos);”, // This is how we decode type of length 4 to a number 5: “o[‘$name’]=buf.readUInt8($pos)*4294967296+buf.readUInt32BE($pos+1);”, // This is how we decode type of length 5 to a number 6: “o[‘$name’]=buf.readUInt16BE($pos)*4294967296+buf.readUInt32BE($pos+2);”, // This is how we decode type of length 6 to a number 8: “o[‘$name’]=buf.readUInt32BE($pos)*4294967296+buf.readUInt32BE($pos+4);”, // This is how we decode type of length 8 to a number 0: “o[‘$name’]=’Unsupported Length of $len'” } } }, cb: function (flow) { console.log(flow) }});It looks to be a bit complex, but actually it is not. In most of the cases, you don’t have to define a compile rule for each different length. The following example defines a decoding for a netflow type 6789 that carry a string:var colObj = Collector(function (flow) { console.log(flow)});colObj.listen(5000);colObj.nfTypes[6789] = { name: ‘vendor_string’, compileRule: { 0: ‘o[“$name”] = buf.toString(“utf8”,$pos,$pos+$len)’ }}As you can see, we can also change the decoding on fly, by defining a property for that netflow type within the nfTypes property of the colObj (the Collector object). Next time when the NetFlow Agent send us a NetFlow Template definition containing this netflow type, the new rule will be used (the routers usually send temlpates from time to time, so even currently compiled templates are recompiled).You could also overwrite the default property names where the decoded data is written. For example:var colObj = Collector(function (flow) { console.log(flow)});colObj.listen(5000);colObj.nfTypes[14].name = ‘outputInterface’;colObj.nfTypes[10].name = ‘inputInterface’;Logging / Debugging the moduleYou can use the debug module to turn on the logging, in order to debug how the library behave. The following example show you how:require(‘debug’).enable(‘NetFlowV9’);var Collector = require(‘node-netflowv9’);Collector(function(flow) { console.log(flow);}).listen(5555);Multiple collectorsThe module allows you to define multiple collectors at the same time. For example:var Collector = require(‘node-netflowv9’);Collector(function(flow) { // Collector 1 listening on port 5555 console.log(flow);}).listen(5555);Collector(function(flow) { // Collector 2 listening on port 6666 console.log(flow);}).listen(6666);NetFlowV9 Options TemplateNetFlowV9 support Options template, where there could be an option Flow Set that contains data for a predefined fields within a certain scope. This module supports the Options Template and provides the output of it as it is any other flow. The only difference is that there is a property isOption set to true to remind to your code, that this data has come from an Option Template.Currently the following nfScope are supported – system, interface, line_card, netflow_cache. You can overwrite the decoding of them, or add another the same way (and using absolutley the same format) as you overwrite nfTypes.

node-netflowv9 node.js module for processing of netflowv9 has been updated to 0.2.5

Post Syndicated from Delian Delchev original http://deliantech.blogspot.com/2015/03/node-netflowv9-nodejs-module-for.html

My node-netflowv9 library has been updated to version 0.2.5There are few new things -Almost all of the IETF netflow types are decoded now. Which means practically that we support IPFIXUnknown NetFlow v9 type does not throw an error. It is decoded into property with name ‘unknown_type_XXX’ where XXX is the ID of the typeUnknown NetFlow v9 Option Template scope does not throw an error. It is decoded in ‘unknown_scope_XXX’ where XXX is the ID of the scopeThe user can overwrite how different types of NetFlow are decoded and the user can define its own decoding for new types. The same for scopes. And this can happen “on fly” – at any time.The library supports well multiple netflow collectors running at the same timeA lot of new options and models for using of the library has been introducedBellow is the updated README.md file, describing how to use the library:UsageThe usage of the netflowv9 collector library is very very simple. You just have to do something like this:var Collector = require(‘node-netflowv9’);Collector(function(flow) { console.log(flow);}).listen(3000);or you can use it as event provider:Collector({port: 3000}).on(‘data’,function(flow) { console.log(flow);});The flow will be presented in a format very similar to this:{ header: { version: 9, count: 25, uptime: 2452864139, seconds: 1401951592, sequence: 254138992, sourceId: 2081 }, rinfo: { address: ‘15.21.21.13’, family: ‘IPv4’, port: 29471, size: 1452 }, packet: Buffer <00 00 00 00 ….> flow: [ { in_pkts: 3, in_bytes: 144, ipv4_src_addr: ‘15.23.23.37’, ipv4_dst_addr: ‘16.16.19.165’, input_snmp: 27, output_snmp: 16, last_switched: 2452753808, first_switched: 2452744429, l4_src_port: 61538, l4_dst_port: 62348, out_as: 0, in_as: 0, bgp_ipv4_next_hop: ‘16.16.1.1’, src_mask: 32, dst_mask: 24, protocol: 17, tcp_flags: 0, src_tos: 0, direction: 1, fw_status: 64, flow_sampler_id: 2 } } ]There will be one callback for each packet, which may contain more than one flow.You can also access a NetFlow decode function directly. Do something like this:var netflowPktDecoder = require(‘node-netflowv9’).nfPktDecode;….console.log(netflowPktDecoder(buffer))Currently we support netflow version 1, 5, 7 and 9.OptionsYou can initialize the collector with either callback function only or a group of options within an object.The following options are available during initialization:port – defines the port where our collector will listen to.Collector({ port: 5000, cb: function (flow) { console.log(flow) } })If no port is provided, then the underlying socket will not be initialized (bind to a port) until you call listen method with a port as a parameter:Collector(function (flow) { console.log(flow) }).listen(port)cb – defines a callback function to be executed for every flow. If no call back function is provided, then the collector fires ‘data’ event for each received flowCollector({ cb: function (flow) { console.log(flow) } }).listen(5000)ipv4num – defines that we want to receive the IPv4 ip address as a number, instead of decoded in a readable dot formatCollector({ ipv4num: true, cb: function (flow) { console.log(flow) } }).listen(5000)socketType – defines to what socket type we will bind to. Default is udp4. You can change it to udp6 is you like.Collector({ socketType: ‘udp6’, cb: function (flow) { console.log(flow) } }).listen(5000)nfTypes – defines your own decoders to NetFlow v9+ typesnfScope – defines your own decoders to NetFlow v9+ Option Template scopesDefine your own decoders for NetFlow v9+ typesNetFlow v9 could be extended with vendor specific types and many vendors define their own. There could be no netflow collector in the world that decodes all the specific vendor types. By default this library decodes in readable format all the types it recognises. All the unknown types are decoded as ‘unknown_type_XXX’ where XXX is the type ID. The data is provided as a HEX string. But you can extend the library yourself. You can even replace how current types are decoded. You can even do that on fly (you can dynamically change how the type is decoded in different periods of time).To understand how to do that, you have to learn a bit about the internals of how this module works.When a new flowset template is received from the NetFlow Agent, this netflow module generates and compile (with new Function()) a decoding functionWhen a netflow is received for a known flowset template (we have a compiled function for it) – the function is simply executedThis approach is quite simple and provides enormous performance. The function code is as small as possible and as well on first execution Node.JS compiles it with JIT and the result is really fast.The function code is generated with templates that contains the javascript code to be add for each netflow type, identified by its ID.Each template consist of an object of the following form:{ name: ‘property-name’, compileRule: compileRuleObject }compileRuleObject contains rules how that netflow type to be decoded, depending on its length. The reason for that is, that some of the netflow types are variable length. And you may have to execute different code to decode them depending on the length. The compileRuleObject format is simple:{ length: ‘javascript code as a string that decode this value’, …}There is a special length property of 0. This code will be used, if there is no more specific decode defined for a length. For example:{ 4: ‘code used to decode this netflow type with length of 4’, 8: ‘code used to decode this netflow type with length of 8’, 0: ‘code used to decode ANY OTHER length’}decoding codeThe decoding code must be a string that contains javascript code. This code will be concatenated to the function body before compilation. If that code contain errors or simply does not work as expected it could crash the collector. So be careful.There are few variables you have to use:$pos – this string is replaced with a number containing the current position of the netflow type within the binary buffer.$len – this string is replaced with a number containing the length of the netflow type$name – this string is replaced with a string containing the name property of the netflow type (defined by you above)buf – is Node.JS Buffer object containing the Flow we want to decodeo – this is the object where the decoded flow is written to.Everything else is pure javascript. It is good if you know the restrictions of the javascript and Node.JS capabilities of the Function() method, but not necessary to allow you to write simple decoding by yourself.If you want to decode a string, of variable length, you could write a compileRuleObject of the form:{ 0: ‘o[“$name”] = buf.toString(“utf8”,$pos,$pos+$len)’}The example above will say that for this netfow type, whatever length it has, we will decode the value as utf8 string.ExampleLets assume you want to write you own code for decoding a NetFlow type, lets say 4444, which could be of variable length, and contains a integer number.You can write a code like this:Collector({ port: 5000, nfTypes: { 4444: { // 4444 is the NetFlow Type ID which decoding we want to replace name: ‘my_vendor_type4444’, // This will be the property name, that will contain the decoded value, it will be also the value of the $name compileRule: { 1: “o[‘$name’]=buf.readUInt8($pos);”, // This is how we decode type of length 1 to a number 2: “o[‘$name’]=buf.readUInt16BE($pos);”, // This is how we decode type of length 2 to a number 3: “o[‘$name’]=buf.readUInt8($pos)*65536+buf.readUInt16BE($pos+1);”, // This is how we decode type of length 3 to a number 4: “o[‘$name’]=buf.readUInt32BE($pos);”, // This is how we decode type of length 4 to a number 5: “o[‘$name’]=buf.readUInt8($pos)*4294967296+buf.readUInt32BE($pos+1);”, // This is how we decode type of length 5 to a number 6: “o[‘$name’]=buf.readUInt16BE($pos)*4294967296+buf.readUInt32BE($pos+2);”, // This is how we decode type of length 6 to a number 8: “o[‘$name’]=buf.readUInt32BE($pos)*4294967296+buf.readUInt32BE($pos+4);”, // This is how we decode type of length 8 to a number 0: “o[‘$name’]=’Unsupported Length of $len'” } } }, cb: function (flow) { console.log(flow) }});It looks to be a bit complex, but actually it is not. In most of the cases, you don’t have to define a compile rule for each different length. The following example defines a decoding for a netflow type 6789 that carry a string:var colObj = Collector(function (flow) { console.log(flow)});colObj.listen(5000);colObj.nfTypes[6789] = { name: ‘vendor_string’, compileRule: { 0: ‘o[“$name”] = buf.toString(“utf8”,$pos,$pos+$len)’ }}As you can see, we can also change the decoding on fly, by defining a property for that netflow type within the nfTypes property of the colObj (the Collector object). Next time when the NetFlow Agent send us a NetFlow Template definition containing this netflow type, the new rule will be used (the routers usually send temlpates from time to time, so even currently compiled templates are recompiled).You could also overwrite the default property names where the decoded data is written. For example:var colObj = Collector(function (flow) { console.log(flow)});colObj.listen(5000);colObj.nfTypes[14].name = ‘outputInterface’;colObj.nfTypes[10].name = ‘inputInterface’;Logging / Debugging the moduleYou can use the debug module to turn on the logging, in order to debug how the library behave. The following example show you how:require(‘debug’).enable(‘NetFlowV9’);var Collector = require(‘node-netflowv9’);Collector(function(flow) { console.log(flow);}).listen(5555);Multiple collectorsThe module allows you to define multiple collectors at the same time. For example:var Collector = require(‘node-netflowv9’);Collector(function(flow) { // Collector 1 listening on port 5555 console.log(flow);}).listen(5555);Collector(function(flow) { // Collector 2 listening on port 6666 console.log(flow);}).listen(6666);NetFlowV9 Options TemplateNetFlowV9 support Options template, where there could be an option Flow Set that contains data for a predefined fields within a certain scope. This module supports the Options Template and provides the output of it as it is any other flow. The only difference is that there is a property isOption set to true to remind to your code, that this data has come from an Option Template.Currently the following nfScope are supported – system, interface, line_card, netflow_cache. You can overwrite the decoding of them, or add another the same way (and using absolutley the same format) as you overwrite nfTypes.

Vote Karen Sandler for Red Hat’s Women In Open Source Award

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2015/02/26/award-karen.html

I know this decision is tough, as all the candidates in the list deserve
an award. However, I hope that you’ll chose to vote for my friend and
colleague, Karen Sandler, for
the 2015 Red
Hat Women in Open Source Community Award
. Admittedly, most of Karen’s
work has been for software freedom, not Open Source (i.e., her work has
been community and charity-oriented, not for-profit oriented). However,
giving her an “Open Source” award is a great way to spread the
message of software freedom to the for-profit corporate Open Source
world.

I realize that there are some amazingly good candidates, and I admit I’d
be posting a blog post to endorse someone else (No, I won’t say who 🙂 if
Karen wasn’t on the ballot for the Community Award. So, I wouldn’t say you
backed the wrong candidate you if you vote for someone else. And, I’m
imminently biased since Karen and I have worked together
on Conservancy since its
inception. But, if you can see your way through to it, I hope you’ll give
Karen your vote.

(BTW, I’m not endorsing a candidate in the Academic Award race. I am just
not familiar enough with the work of the candidates involved to make an
endorsement. I even abstained from voting in that race myself because I
didn’t want to make an uninformed vote.)

Тука едно регистърче, моля

Post Syndicated from Боян Юруков original http://feedproxy.google.com/~r/yurukov-blog/~3/CV0iOkTXnAU/


Миналата седмица покрай Хакатона на Общество реших да споделя отново данните си за производството на енергия. Съдържат производство и износ по минути и часове за последните няколко години. Свалих дори последните гаранции за ток от възобновими източници.
Изпратиха ми някои интересни графики направени на база тези данни и се надявам, че ще ги видите в най-скоро време. Покрай тях обаче забелязах, че някои от историческите данни не са били свалени. Причината е, че сайтът на ЕСО дава грешки. Това ме накара да се разровя още и открих отчет за първите 45 дни от 2014 и 2015 (вече е обновен до 22-ри). Реших да сравня данните в него с произведения ток по цифрите, които съм зареждал в реално време. Идеята ми беше, че трябва да са еднакви – средното производство умножено по броя часове трябва да даде произведеното, което са дали в отчета си или поне нещо близко.
Да, ама не.

Оказва се, че има разминаване с между 2 и 18%. Дори да не разбирам данните, би трябвало отклонението по различните показатели да е относително същото и то най-вече между двете години. Това, което открих, е напълно хаотично. Потреблението за първите дни на 2015-та е 5.9 млн. MWh, а от данните в реално време от сайта им излиза 5.59 млн. Износът за същия период през 2014-та според отчета им е 1079.89 хиляди MWh., а справка в регистъра на сайта им показва 1040 хил. Обобщил съм всички разминавания в тази таблица.
Получава се същото както с регистъра за ражданията на Министерството на здравеопазването. От една страна липсва разбиране какво точно показват данните, липсва документация какво е включено и как се измерва. От друга обаче съвсем не става ясно до колко са точни цифрите. При производството от фотоволтици, например, е широко известно, че данните в реално време са просто оценка на база формула, а не реални измервания. Подобни съмнения има и за ветрените централи. Въпреки това, би трябвало да са възможно най-близки до реалните стойности. Толкова големи разминавания напълно обезсмислят не само анализа на данните, но и самото им публикуване.
Потребление, внос и износ между 2006 и 2012-та – линк
Едно друго обяснение, разбира се, би било, че тези цифри показват злоупотреби в ЕСО и изкупуването на енергия. Това далеч не е толкова невероятно, като имаме предвид злоупотребите в сектора. Всъщност, идеята да има публичност на данните е именно, за да правим анализ на производството, потреблението, дефицитите и реалното състояние. Това, което получаваме обаче е една картинка, няколко видимо случайни цифри и гръмка декларация колко са прозрачни.
На няколко пъти съм споменавал, че е важно да използваме наличните данни и да показваме ползата от тях. В противен случай институциите ще имат повод да твърдят, че са направили усилие към прозрачност, но явно няма интерес. Качеството на данните обаче е от решаващо значение тук. Публикуват се някакви регистри и документи без яснота какво означават цифрите в тях, какво е качеството и кой отговаря за него. Всичко до тук е имитация на дейност.


systemd: Using ExecStop to depool nodes for fun and profit

Post Syndicated from Laurie Denness original https://laur.ie/blog/2015/02/systemd-using-execstop-to-depool-nodes-for-fun-and-profit/

Preface: There is a tonne of drama about systemd on the internets; it won’t take you long to find it, if you’re curious. Despite that, I’m largely a fan and focusing on all the cool stuff I can finally do as an ops person without basically re-writing crappy bash scripts for a living (cough sys-v init) 

Process Supervision

Without going into the basics about systemd too much (I quite enjoy this post as an intro), you tell systemd to run your executable using the “ExecStart” part of the config, and it will go and run that command and make sure it keeps running. Wonderful! In this case, we wanted to keep HHVM running all the time, so we told systemd to do it, in 3 lines. Waaaay easier than sys-v init.

ExecStop

By default when you tell systemd to stop a process, and you haven’t told it how to stop the process, it’s just going to gracefully kill the process and any other processes it spawned.

However, there is also the ExecStop configuration option that will be executed before systemd kills your processes, adding a new “deactivating” step to the process. It takes any executable name (or many) as an argument, so you can abuse this to do literally anything as cleanup before your processes get killed.

Systemd will also continue to do it’s regular killing of processes if by the end of running your ExecStop script the processes are not all dead.

Load balancer health checks

We have a load balancer that uses a bunch of health checks to ensure that the node that it’s asking to do work can actually still do work before it sends it there.

One of these is hitting an HTTP endpoint we set up, let’s call it “status.php” which just contains the text “Status:OK”. This way, if the server dies, or PHP breaks, or Apache breaks, that node will be automatically depooled and we don’t serve garbage to the user. Yay!

Example: automatic depooling using ExecStop

Armed with my new ExecStop super power, I realised we were able to let the load balancer know this node was no longer available before killing the process.

I wrote a simple bash script that:

  • Moves the status.php file to status.php.disabled
  • Starts pinging the built in HHVM “load” endpoint (which tells you how many requests are in flight in HHVM) to see if the load has hit 0
  • if the curl to the “load” endpoint fails, we try again after 1 second
  • If we hit 30 seconds and the load isn’t 0 or we still can’t reach the endpoint, we just carry on anyway; something is wrong.
  • Once the load is “0”, we can continue
  • use `pidof` to kill the HHVM process
  • Move status.php.disabled back to status.php

And now, i can reference this in our HHVM systemd unit file:

[Unit]
Description=HHVM HipHop Virtual Machine (FCGI)

[Service]
Restart=always
ExecStart=/usr/bin/hhvm -c <snip>
ExecStop=/usr/local/bin/hhvm_stop.sh

Now when I call service hhvm stop, it takes 6-10 seconds for the stop to complete, because the traffic is gracefully removed.

Logging

Another thing I personally love about systemd, is the increase visibility the operator gets about what’s going on. In sys-v, if you’re lucky, someone put a “status” action in their bash script and it might tell you if the pid exists.

In systemd, you get a tonne of information about what’s going on; the processes that have been launched (including child processes), the PIDs, logs associated with that process, and in the case of something like Apache, the process can report information back:

Apache systemd status output showing requests per second

In this case, our ExecStop script output gets shown when you look at the status output of systemd:

[[email protected] ~]# systemctl status hhvm -l
hhvm.service - HHVM HipHop Virtual Machine (FCGI)
   Loaded: loaded (/usr/lib/systemd/system/hhvm.service; enabled)
   Active: inactive (dead) since Tue 2015-02-17 22:00:52 UTC; 48s ago
  Process: 23889 ExecStop=/usr/local/bin/hhvm_stop.sh (code=exited, status=0/SUCCESS)
  Process: 37601 ExecStart=/usr/bin/hhvm <snip> (code=killed, signal=TERM)
  Main PID: 37601 (code=killed, signal=TERM)

Feb 17 22:00:45 hhvm01 hhvm_stop.sh[23889]: Moving status.php to status.php.disabled
Feb 17 22:00:47 hhvm01 hvm_stop.sh[23889]: Waiting another second (currently up to 8) because the load is still 16
Feb 17 22:00:48 hhvm01 hhvm_stop.sh[23889]: Waiting another second (currently up to 9) because the load is still 10
Feb 17 22:00:49 hhvm01 hhvm_stop.sh[23889]: Waiting another second (currently up to 10) because the load is still 10
Feb 17 22:00:50 hhvm01 hhvm_stop.sh[23889]: Load was 0 after 11 seconds, now we can kill HHVM.
Feb 17 22:00:50 hhvm01 hhvm_stop.sh[23889]: Killing HHVM
Feb 17 22:00:52 hhvm01 hhvm_stop.sh[23889]: Flipping status.php.disabled to status.php
Feb 17 22:00:52 hhvm01 systemd[1]: Stopped HHVM HipHop Virtual Machine (FCGI).

Now all the information about what happened during the ExecStop process is captured for debugging later! No more having no idea what happened during the shut down.

When the script is in the process of running, the systemd status output will show as “deactivating” so you know it’s still ongoing.

 

Summary

This is just one example of how you might use/abuse the ExecStop to do work before killing processes. Whilst this was technically possible before, IMO the ease of use and the added introspection means this is actually feasible for production systems.

I’ve gisted a copy of the script here, if you want to steal it and modify it for your own use.

Scheduling Automatic Deletion of Application Environments

Post Syndicated from dchetan original http://blogs.aws.amazon.com/application-management/post/Tx2MMLQMSTZYNMJ/Scheduling-Automatic-Deletion-of-Application-Environments

Have you ever set up a temporary application environment and wished you could schedule automatic deletion of the environment rather than remembering to clean it up after you are done? If the answer is yes, then this blog post is for you.

Here is an example of setting up an AWS CloudFormation stack with a configurable TTL (time-to-live). When the TTL is up, deletion of the stack is triggered automatically. You can use this idea regardless of whether you have a single Amazon EC2 instance in the stack or a complex application environment. You can even use this idea in combination with other deployment and management services such as AWS Elastic Beanstalk or AWS OpsWorks, as long as your environment is modeled inside an AWS CloudFormation stack.

In this example, first I setup a sample application on an EC2 instance and then configure a ‘TTL’:

Configuring TTL is simple. Just schedule execution of a one-line shell script, deletestack.sh, using the ‘at’ command. The shell script uses AWS Command Line Interface to call aws cloudformation delete-stack:

Notice that the EC2 instance requires permissions to delete all of the stack resources. The permissions are granted to the EC2 instance via an IAM role. Also, notice that for the stack deletion to succeed, the IAM role needs to be the last in the order of deletion. You can ensure that the role is the last in the order of deletion by making other resources dependent on the role. Finally, as a best practice, you should grant the least possible privilege to the role. You can do this by using a finer grained policy document for the IAM role:

 

You can try the full sample template here: 

 

— Chetan Dandekar, Senior Product Manager, Amazon Web Services.

Scheduling Automatic Deletion of Application Environments

Post Syndicated from dchetan original http://blogs.aws.amazon.com/application-management/post/Tx2MMLQMSTZYNMJ/Scheduling-Automatic-Deletion-of-Application-Environments

Have you ever set up a temporary application environment and wished you could schedule automatic deletion of the environment rather than remembering to clean it up after you are done? If the answer is yes, then this blog post is for you.

Here is an example of setting up an AWS CloudFormation stack with a configurable TTL (time-to-live). When the TTL is up, deletion of the stack is triggered automatically. You can use this idea regardless of whether you have a single Amazon EC2 instance in the stack or a complex application environment. You can even use this idea in combination with other deployment and management services such as AWS Elastic Beanstalk or AWS OpsWorks, as long as your environment is modeled inside an AWS CloudFormation stack.

In this example, first I setup a sample application on an EC2 instance and then configure a ‘TTL’:

Configuring TTL is simple. Just schedule execution of a one-line shell script, deletestack.sh, using the ‘at’ command. The shell script uses AWS Command Line Interface to call aws cloudformation delete-stack:

Notice that the EC2 instance requires permissions to delete all of the stack resources. The permissions are granted to the EC2 instance via an IAM role. Also, notice that for the stack deletion to succeed, the IAM role needs to be the last in the order of deletion. You can ensure that the role is the last in the order of deletion by making other resources dependent on the role. Finally, as a best practice, you should grant the least possible privilege to the role. You can do this by using a finer grained policy document for the IAM role:

 

You can try the full sample template here: 

 

— Chetan Dandekar, Senior Product Manager, Amazon Web Services.

Use a CreationPolicy to Wait for On-Instance Configurations

Post Syndicated from Elliot Yamaguchi original http://blogs.aws.amazon.com/application-management/post/Tx3CISOFP98TS56/Use-a-CreationPolicy-to-Wait-for-On-Instance-Configurations

When you provision an Amazon EC2 instance in an AWS CloudFormation stack, you might specify additional actions to configure the instance, such as install software packages or bootstrap applications. Normally, CloudFormation proceeds with stack creation after the instance has been successfully created. However, you can use a CreationPolicy so that CloudFormation proceeds with stack creation only after your configuration actions are done. That way you’ll know your applications are ready to go after stack creation succeeds.

A CreationPolicy instructs CloudFormation to wait on an instance until CloudFormation receives the specified number of signals. This policy takes effect only when CloudFormation creates the instance. Here’s what a creation policy looks like:

"AutoScalingGroup": {
"Type": "AWS::AutoScaling::AutoScalingGroup",
"Properties": {

},
"CreationPolicy": {
"ResourceSignal": {
"Count": "3",
"Timeout": "PT5M"
}
}
}

A CreationPolicy must be associated with a resource, such as an EC2 instance or an Auto Scaling group. This association is how CloudFormation knows what resource to wait on. In the example policy, the CreationPolicy is associated with an Auto Scaling group. CloudFormation waits on the Auto Scaling group until CloudFormation receives three signals within five minutes. Because the Auto Scaling group’s desired capacity is set to three, the signal count is set to three (one for each instance).

If three signals are not received after five minutes, CloudFormation immediately stops the stack creation labels the Auto Scaling group as failed to create, so make sure you specify a timeout period that gives your instances and applications enough time to be deployed.

Signaling a Resource

You can easily send signals from the instances that you’re provisioning. On those instances, you should be using the cfn-init helper script in the EC2 user data script to deploy applications. After the cfn-init script, just add a command to run the cfn-signal helper script, as in the following example:

"UserData": {
"Fn::Base64": {
"Fn::Join" [ "", [
"/opt/aws/bin/cfn-init ",

"/opt/aws/bin/cfn-signal -e $? ",
" –stack ", { "Ref": "AWS::StackName" },
" –resource AutoScalingGroup " ,
" –region ", { "Ref" : "AWS::Region" }, "n"
] ]
}
}

When you signal CloudFormation, you need let it know what stack and what resource you’re signaling. In the example, the cfn-signal command specifies the stack that is provisioning the instance, the logical ID of the resource (AutoScalingGroup), and the region in which the stack is being created.

With the CreationPolicy attribute and the cfn-signal helper script, you can ensure that your stacks are created successfully only when your applications are successfully deployed. For more information, you can view a complete sample template in the AWS CloudFormation User Guide.

Use a CreationPolicy to Wait for On-Instance Configurations

Post Syndicated from Elliot Yamaguchi original http://blogs.aws.amazon.com/application-management/post/Tx3CISOFP98TS56/Use-a-CreationPolicy-to-Wait-for-On-Instance-Configurations

When you provision an Amazon EC2 instance in an AWS CloudFormation stack, you might specify additional actions to configure the instance, such as install software packages or bootstrap applications. Normally, CloudFormation proceeds with stack creation after the instance has been successfully created. However, you can use a CreationPolicy so that CloudFormation proceeds with stack creation only after your configuration actions are done. That way you’ll know your applications are ready to go after stack creation succeeds.

A CreationPolicy instructs CloudFormation to wait on an instance until CloudFormation receives the specified number of signals. This policy takes effect only when CloudFormation creates the instance. Here’s what a creation policy looks like:

"AutoScalingGroup": {
"Type": "AWS::AutoScaling::AutoScalingGroup",
"Properties": {

},
"CreationPolicy": {
"ResourceSignal": {
"Count": "3",
"Timeout": "PT5M"
}
}
}

A CreationPolicy must be associated with a resource, such as an EC2 instance or an Auto Scaling group. This association is how CloudFormation knows what resource to wait on. In the example policy, the CreationPolicy is associated with an Auto Scaling group. CloudFormation waits on the Auto Scaling group until CloudFormation receives three signals within five minutes. Because the Auto Scaling group’s desired capacity is set to three, the signal count is set to three (one for each instance).

If three signals are not received after five minutes, CloudFormation immediately stops the stack creation labels the Auto Scaling group as failed to create, so make sure you specify a timeout period that gives your instances and applications enough time to be deployed.

Signaling a Resource

You can easily send signals from the instances that you’re provisioning. On those instances, you should be using the cfn-init helper script in the EC2 user data script to deploy applications. After the cfn-init script, just add a command to run the cfn-signal helper script, as in the following example:

"UserData": {
"Fn::Base64": {
"Fn::Join" [ "", [
"/opt/aws/bin/cfn-init ",

"/opt/aws/bin/cfn-signal -e $? ",
" –stack ", { "Ref": "AWS::StackName" },
" –resource AutoScalingGroup " ,
" –region ", { "Ref" : "AWS::Region" }, "n"
] ]
}
}

When you signal CloudFormation, you need let it know what stack and what resource you’re signaling. In the example, the cfn-signal command specifies the stack that is provisioning the instance, the logical ID of the resource (AutoScalingGroup), and the region in which the stack is being created.

With the CreationPolicy attribute and the cfn-signal helper script, you can ensure that your stacks are created successfully only when your applications are successfully deployed. For more information, you can view a complete sample template in the AWS CloudFormation User Guide.

Trade Associations Are Never Neutral

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2015/02/10/node-foundation.html

It’s amazing what we let for-profit companies and their trade associations get away with.
Today, Joyent
announced the Node.js Foundation
, in conjunction with various
for-profit corporate partners and Linux Foundation (which is a 501(c)(6)
trade association under the full control of for-profit companies).

Joyent and their corporate partners claim that the Node.js Foundation will
be neutral and provide open governance. Yet, they don’t
even say what corporate form the new organization will take, nor present
its by-laws. There’s no way that anyone can know if the organization will
be neutral and provide open governance without at least that information.

Meanwhile, I’ve spent years pointing out that what corporate form you
chose matters. In the USA, if you pick a 501(c)(6) trade association (like
Linux Foundation), the result is not a neutral non-profit
home. Rather, a trade association simply promotes the interest of the
for-profit businesses that control it. Such organizations don’t have
the community interests at heart, but rather the interests of the
for-profit corporate masters who control the Board of Directors. Sadly,
most people tend to think that if you put the word “Foundation”
in the name0, you magically get a neutral home
and open governance.

Fortunately for these trade associations, they hide behind the
far-too-general term non-profit, and act as if all non-profits are equal. Why
do trade association representatives and companies ignore the differences
between charities and trade associations? Because they don’t want you to
know the real story.

Ultimately, charities serve the public good. They can do nothing else,
lest they run afoul of IRS rules. Trade associations serve the business
interests of the companies that join them. They can do nothing else, lest
they run afoul of IRS rules. I would certainly argue the Linux
Foundation has done an excellent job serving the interests of the
businesses that control it. They can be commended for meeting their
mission, but that mission is not one to serve the individual users and
developers of Linux and other Free Software. What will the mission of the
Node.js Foundation be? We really don’t know, but given who’s starting it,
I’m sure it will be to promote the businesses around Node.js, not its
users and developers.

0Richard Fontana recently
pointed out to me that it is extremely rare for trade associations
to call themselves foundations outside of the Open Source and Free
Software community. He found very few examples of it in the wider
world. He speculated that this may be an attempt to capitalize on
the credibility of the Free Software Foundation, which is older
than all other non-profits in this community by at least two
decades. Of course, FSF is a 501(c)(3) charity, and since there
is no IRS rule about calling a 501(c)(6) trade association by the
name “Foundation”, this is a further opportunity to
spread confusion about who these organization serve: business
interests or the general public.

Bi-level TIFFs and the tale of the unexpectedly early patch

Post Syndicated from Michal Zalewski original http://lcamtuf.blogspot.com/2015/02/bi-level-tiffs-and-tale-of-unexpectedly.html

Today’s release of MS15-016 (CVE-2015-0061) fixes another of the series of browser memory disclosure bugs found with afl-fuzz – this time, related to the handling of bi-level (1-bpp) TIFFs in Internet Explorer (yup, MSIE displays TIFFs!). You can check out a simple proof-of-concept here, or simply enjoy this screenshot of eight subsequent renderings of the same TIFF file:

The vulnerability is conceptually similar to other previously-identified problems with GIF and JPEG handling in popular browsers (example 1, example 2), with the SOS handling bug in libjpeg, or the DHT bug in libjpeg-turbo (details here) – so I will try not to repeat the same points in this post.

Instead, I wanted to take note of what really sets this bug apart: Microsoft has addressed it in precisely 60 days, counting form my initial e-mail to the availability of a patch! This struck me as a big deal: although vulnerability research is not my full-time job, I do have a decent sample size – and I don’t think I have seen this happen for any of the few dozen MSIE bugs that I reported to MSRC over the past few years. The average patch time always seemed to be closer to 6+ months – coupled with what the somewhat odd practice of withholding attribution in security bulletins and engaging in seemingly punitive PR outreach if the reporter ever went public before that.

I am very excited and hopeful that rapid patching is the new norm – and huge thanks to MSRC folks if so 🙂

Bi-level TIFFs and the tale of the unexpectedly early patch

Post Syndicated from Michal Zalewski original http://lcamtuf.blogspot.com/2015/02/bi-level-tiffs-and-tale-of-unexpectedly.html

Today’s release of MS15-016 (CVE-2015-0061) fixes another of the series of browser memory disclosure bugs found with afl-fuzz – this time, related to the handling of bi-level (1-bpp) TIFFs in Internet Explorer (yup, MSIE displays TIFFs!). You can check out a simple proof-of-concept here, or simply enjoy this screenshot of eight subsequent renderings of the same TIFF file:

The vulnerability is conceptually similar to other previously-identified problems with GIF and JPEG handling in popular browsers (example 1, example 2), with the SOS handling bug in libjpeg, or the DHT bug in libjpeg-turbo (details here) – so I will try not to repeat the same points in this post.

Instead, I wanted to take note of what really sets this bug apart: Microsoft has addressed it in precisely 60 days, counting form my initial e-mail to the availability of a patch! This struck me as a big deal: although vulnerability research is not my full-time job, I do have a decent sample size – and I don’t think I have seen this happen for any of the few dozen MSIE bugs that I reported to MSRC over the past few years. The average patch time always seemed to be closer to 6+ months – coupled with what the somewhat odd practice of withholding attribution in security bulletins and engaging in seemingly punitive PR outreach if the reporter ever went public before that.

I am very excited and hopeful that rapid patching is the new norm – and huge thanks to MSRC folks if so 🙂

Grafana 2.0 Alpha & Preview

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2015/02/10/grafana-2.0-alpha-preview/

As some of you already know, at least those who read the previous blog post on the future of Grafana,
we have been working on a backend for the Grafana 2.0 release. A backend is really essential if Grafana is to be able
to expand its capabilities to support alerting, PNG rendering of panels, and of course user managment and authentication.
A backend will also make it easier to install and get started with Grafana as you will no longer need to learn how to
configure nginx or deal with complex CORS issues for Graphite or OpenTSDB. Something that
many new users struggle with when setting up Grafana for the first time.

Grafana 2.0 has been long in the making but has seen accelerated progress in the last two months
thanks to my collegues at Raintank, where we are working on building a nextgen open source
infrastructure monitoring platform.

Code & Docker image

The Grafana 2.0 backend and frontend changes are now public in the develop
branch in the main grafana repository on GitHub. The readme on that branch contains instructions for how to build and run the
new backend. You can also try it using docker with a single line like this:

docker run -i -p 3000:3000 grafana/grafana:develop

There are no offical binary packages yet as there is still some work that needs
to be done before a beta release can be made. Grafana 2.0 is not ready for
production use yet. If you want to help please checkout out the develop branch
or try the docker image and submit feedback and bug reports on github!

There is no updated documentation for Grafana 2.0, that is one of the items we will work on
in the run up to a beta release.

You can also try out a preview build on play.grafana.org/v2, login with
guest/guest or you can signup. If you signup you will be automatically added as a Viewer to the main account.
This preview is automatically updated after each successful commit & build of the develop branch.

The backend

The backend is written in Go and defaults to an embedded sqlite3 database so there is no dependency on
Elasticsearch for dashboards storage. If you are already using InfluxDB or Elasticsearch for dashboard
storage there is an import feature under the account page where you can import your existing dashboards.

Panel rendering

The backend supports rendering any panel to a PNG image, this will enable some slick integration
in the future with services like Slack and Hipchat. When we eventually tackle alerting this feature
should also come in handy as we could provide images of related graphs along with alert notifications.

Currently the PNG rendering feature is a just a simple link in the share panel dialog.

User & Account model

To support SaaS and larger Grafana installations we have added a User and Account model similar to
how Google Analytics has implemented this. Dashboards, data sources, etc are not tied to a specific user
but to an Account, users are linked to accounts with a specific Role (Admin, Editor, Viewer).

To simplify smaller Grafana setups you can configure Grafana to be in a single account mode where new user
signups are automatically assigned to a specific account. And you can also enable anonymous access if you
do not care about authentication or user managment at all.

Currently users can signup and login with a specified email/password or via GitHub or Google OAuth integration.
LDAP integration is in on the roadmap but not implemented yet.

UI Changes

It was a big challange to modify the existing Grafana UI to accomodate more views like Profile, Account,
Datasources, Admin etc, without sacrificing the clean simple top navigation. We tried many alternatives
of sidebars and collapsable top top navbars.

The above animated gif shows the sidebar we ended up with, as well as the new dashboard search dropdown.
We will continue to refine and polish this solution. For more on the creative process and details
on all the UI mockups and navigation alternatives we explored read this
blog post by Matt Toback. Matt is lead
on customer experiance at Raintank.

Dashboard list panel

There is currently one new panel in Grafana 2.0 that can show the current user’s starred dashboards as well
as a list of dashboards based on search query or dashboard tags.

There is a github issue for this new panel where feedback
would be greatly appreciated.

Standalone mode

The frontend has been slighly changed to support a standalone mode where it can work mostly like Grafana has
in the past, that is without any backend. This mode is currently broken in the develop branch but will
be made functioning again before release. We belive that many Grafana users value the pure frontend (no backend)
nature of Grafana and we would hate to abandon or dissapoint this user group so we will try to support
a standalone mode and build as long as the interest is high and the complexity of supporting both is acceptable.
But it is likely that long term this standalone mode will be removed, but this is all up to user feedback.

Beta

So stay tuned for a beta sometime early or mid March (if this goes as we hope).
Subscribe to newsletter to get notified.

Thanks to project sponsors

UX: The long road to the shortest path

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2015/02/10/ux-the-long-road-to-the-shortest-path/

Torkel and the rest of the team have been hard at work on the next generation of Grafana, which is the foundation for what we’re building here at raintank. We’re really excited about the many new features and capabilites that Grafana 2.0 brings to the table, but they all have big implications from a UI/UX perspective.

Our first task at raintank in a post-capacity world was clear: make sure that new functionality didn’t negatively impact the UI/UX. In my opinion, one of the key reasons behind Grafana’s success has been how lean and mean it is, while still managing to look stunningly beautiful. There is minimal chrome in the interface to get in the way of the data. Therefore, our new design revisions were based on that core belief – don’t get in the way of the data.

This blog post is a behind the scenes peak on the path we took to deliver on these goals.

The challenge

Grafana v1 is a largely flat hierarchy. You are able to bounce around between your dashboards easily, but the interface had yet to introduce other top-level areas of the system.

Grafana v1 Hierarchy

But as we start adding in the concept of users, account level configuration and other features, we needed to introduce a hierarchy within the interface.

Grafana v2 Hierarchy

Our process

In the first prototypes of Grafana v2, Torkel added the navigation into an expandable sidebar, but it wasnt clear that a menu was hiding behind the Grafana logo in the upper left. We agreed that we liked the concept, but wanted to explore a few different options before locking things in.

First, we would throw together some of the core concepts in Omnigraffle, mocking up the different states (open, closed, click, hover, etc). Next, since the nav is so dependent on the interaction, we would then bring those ideas into Invision to get a sense on how it would feel to use. Finally, once we thought an idea had legs, Torkel would prototype it within the interface for testing.

Note: The mockups below are high-fidelity wireframes. None of the pros/cons listed below were driven by aesthetics – that came later.

Our thought process for this first iteration focused on clearly showing a menu was available, feature the concept of the account/organization, and make it easy for people to jump around the new areas of the system.

Grafana v1 Hierarchy

Pros:

  • The newly introduced header area gives us space to grow.
  • The active account/organization is prominently featured, showing off new features & functionality in v2.

Cons:

  • The introduction of a new interface element and behavior (menu trigger) felt challenging.
  • We lost the grafana logo in the upper left, which is also the main space for a user to include their own logo.
  • It was great that the top area had a lot available space – but we couldn’t envision what would fill it. Effectively, we were creating dead space.

Conclusion:

The dead space wouldn’t stay dead for long. It would reanimate as zombie space, containing lost souls and buttons that didn’t belong anywhere else. We strive for a zombie-free interface.

Knowing that the top bar didnt improve the interface, we dropped it, and went back to the original goal. How could we make the menu clearer but also retaining branding?

Grafana v1 Hierarchy

Pros:

  • The menu functionality is clear, and could easily translate into any language.

Cons:

  • Though it’s tested positively, we didnt like the lockup of the Logo and the word Menu. 12
  • It didn’t seem polished.

Conclusion:

We can do better.

But wait.

There’s a bigger problem. Did you catch it?

Grafana v1 Hierarchy

The dashboard hierarchy on the sidebar is all wrong. Though the subnav would function as the settings for the active dashboard, the menu structure implies that it’s Global Dashboard Settings, Global Dashboard Templating, etc… Bad news.

So we started looking at common user behavior as a way to understand how necessary the sub were going to be in the primary navigation. As you imagine, critical functionality would be switching between individual dashboards – not necessarily switching between Dashboards, Accounts and User Profile.

So we pulled out the sub-items from the nav, and got excited. We got so excited in fact, we thought this would be a good idea:

“We should do something really different”. “Yeah!”.

Grafana v1 Hierarchy

Pros:

  • We don’t have to take up screen real estate by introducing a sidebar. The options become even simpler.

Cons:

  • The trigger area felt small, and awkward overlap with the content below it.
  • By reducing the prominence of the menu, it made it feel like options within the existing page – not giving appropriate weight to the main navigation.

Conclusion:

Glad we got that out of our system.

Thinking more about behavior patterns, we know that the user will spend most of their time in the dashboards. Would the user benefit from consolidating the navigation entirely into three areas: Dashboards, Account/Organization, and User?

Grafana v1 Hierarchy

Pros:

  • No more sidebar – one simple, graceful top nav.
  • The organization and user are persistently displayed. Super helpful when bouncing around in different instances/accounts.

Cons:

  • We’ve now eaten up valuable vertical space, violating one of our main rules: don’t get in the way of data.
  • We’re restricting the way the interface can grow.

Conclusion:

A very good option, but not the best answer.

So after deliberation and experimentation, we circled back around to the sidebar – but dug deeper into why it wasn’t working for us in the first place. The menu had to be clear, unobtrusive, and needed room to grow.

Grafana v1 Hierarchy

Pros:

  • Clean, simple design with visual cues on hover.
  • The sidebar gives us space to grow, but the decision to stick to “Top Level” nav items will prevent it from turning into an junk drawer.
  • The introduction of breadcrumbs reinforces the hierarchy, and turning it into a button lets a user switch between dashboards super quickly.

Cons:

  • We’ve placed the menu in the hit area that commonly will take you “Home”. We knew we were adjusting a common web convention, but given the usage profile of our users, as well as the introduction of the hamburger on hover, we’re hoping it can remain intuitive.

Conclusion:

Winner. We love it, let’s do it.

What did we learn?

It’s early days at raintank, and we’re just beginning to find our rhythm. This has been a hugely collaborative and iterative process, and we’re just getting started. We know that it will often take a few different attempts to hone in on what works (and what doesn’t), and I’m impressed with the raintank team’s ability to act, react, and adjust.

We are placing a very high degree of importance on UI/UX and customer experience at raintank, which was one of the main reasons I was so excited to join. We’d love to hear what you think of where we’re going with all of this. Please check out a live demo of the Grafana 2.0 alpha here, and don’t be shy with your feedback.

You can reach me on twitter, github or old fashioned raintank.io email as [email protected]


  1. We re-referenced this research Hamburger vs Menu: The Final AB Test and Don’t Be Afraid of the Hamburger: A/B Test while we were experimenting, and was the inspiration for trying out the Menu button. Ultimately, since our audience is predominantly high-frequency daily users, we thought we could take the risk and teach the behavior in a compressed space.
    [return]
  2. When this blog was first posted, I originally said: “Explicitly stating Menu felt like a cop out”. That was a dumb thing to say, and not reflective of our actual thoughts.
    [return]

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close