Welcome, Mandrill Customers!

Post Syndicated from Nic Webb original http://sesblog.amazon.com/post/Tx9AXP03VIZDLH/Welcome-Mandrill-Customers

After the recent announcement of changes with Mandrill, we have received many questions from Mandrill customers who are looking at options for email service providers.

Moving to a new platform can be daunting. If you want to understand what SES is and what it can do for you, check out our product detail page and Getting Started guide. We also have our re:Invent 2014 and 2015 presentations, with real-world examples on how to use the email sending and receiving capabilities of SES. Like many other AWS products, SES offers a free tier.

If you’re ready to dive in and start your migration, we have put together a list of resources to get you up and running with SES.

If you send emails through an SMTP interface, check out our SMTP endpoints.

If you send emails through a web API, take a look at the SES API. You can interact with the SES API directly through HTTPS, or use one of the many AWS SDKs that take care of the details for you. AWS SDKs are available for Android, iOS, Java, .NET, Node.js, PHP, Python, and Ruby.

If you are an API user, note that:

Mandrill’s send method corresponds to SES’s SendEmail.

Mandrill’s send-raw method corresponds to SES’s SendRawEmail.

There is no corresponding send_at parameter for SES – messages are queued for delivery when you make an email-sending call to SES.

There is no need to specify async – SES API calls are inherently asynchronous.

If you process incoming email, we offer the ability to receive those emails through SES. Through our inbound processing, you can save your emails to S3, receive SNS notifications, and perform custom logic using Lambda.

To help you maintain good deliverability, we can automatically DKIM-sign your emails using Easy DKIM. (You are also free to do that manually.)

SES offers a mailbox simulator you can use to test your application’s handling of scenarios such as deliveries, bounces, and complaints.

We want your transition to be simple. If you have any questions, we welcome you to our forums, or you can review our support options.

Welcome to SES!

Optimize Spark-Streaming to Efficiently Process Amazon Kinesis Streams

Post Syndicated from Rahul Bhartia original https://blogs.aws.amazon.com/bigdata/post/Tx2MQREB43K3BFK/Optimize-Spark-Streaming-to-Efficiently-Process-Amazon-Kinesis-Streams

Rahul Bhartia is a Solutions Architect with AWS

Martin Schade, a Solutions Architect with AWS, also contributed to this post.

Do you use real-time analytics on AWS to quickly extract value from large volumes of data streams? For example, have you built a recommendation engine on clickstream data to personalize content suggestions in real time using  Amazon Kinesis Streams and Apache Spark? These frameworks make it easy to implement real-time analytics, but understanding how they work together helps you optimize performance. In this post, I explain some ways to tune Spark Streaming for the best performance and the right semantics.

Amazon Kinesis integration with Apache Spark is via Spark Streaming. Spark Streaming is an extension of the core Spark framework that enables scalable, high-throughput, fault-tolerant stream processing of data streams such as Amazon Kinesis Streams. Spark Streaming provides a high-level abstraction called a Discretized Stream or DStream, which represents a continuous sequence of RDDs. Spark Streaming uses the Amazon Kinesis Client Library (KCL) to consume data from an Amazon Kinesis stream. The KCL takes care of many of the complex tasks associated with distributed computing, such as load balancing, failure recovery, and check-pointing. For more information, see the Spark Streaming + Kinesis Integration guide.

Spark Streaming receivers and KCL workers

Think of Spark Streaming as two main components:

Fetching data from the streaming sources into DStreams

Processing data in these DStreams as batches

Every input DStream is associated with a receiver, and in this case also with a KCL worker. The best way to understand this is to refer to the method createStream defined in the KinesisUtils Scala class.

Every call to KinesisUtils.createStream instantiates a Spark Streaming receiver and a KCL worker process on a Spark executor. The first time a KCL worker is created, it connects to the Amazon Kinesis stream and instantiates a record processor for every shard that it manages. For every subsequent call, a new KCL worker is created and the record processors are re-balanced among all available KCL workers. The KCL workers pull data from the shards, and routes them to the receiver, which in turns stores them into the associated DStream.

To understand how these extensions interface with the core Spark framework, we will need an environment that illustrates some of these concepts.

Executors, cores, and tasks

To create the environment, use an Amazon Kinesis stream named “KinesisStream” with 10 shards capable of ingesting 10 MBps or 10,000 TPS. For writing records into KinesisStream, use the AdvancedKPLClickEventsToKinesis class described in the Implementing Efficient and Reliable Producers with the Amazon Kinesis Producer Library blog post. Each event here is a line from the web access log, which averaged 140 KB in size. You can see from the screenshot below that this producer is capable of streaming over 10,000 records per second into KinesisStream by taking advantage of the PutRecords API operation.

To process the data using Spark Streaming, create an Amazon EMR cluster in the same AWS region using three m3.xlarge EC2 instances (one core and two workers). On the EMR cluster, the following default values are defined in the Spark configuration file (spark-defaults.conf):


spark.executor.instances         2
spark.executor.cores               4

This essentially means that each Spark application has two executors (an executor here is one container on YARN) with four cores each, and can run a maximum of eight tasks concurrently.

Note:  By default, YARN with EMR is configured to use DefaultCapacityScheduler, which doesn’t take vCores as a factor into scheduling containers; To use vCores as an additional factor, you can switch the scheduler to use DominantResourceCalculator.

In the context of your stream and cluster, this is how Spark Streaming might look like after the first invocation of KinesisUtils.createStream.

Calling KinesisUtils.createStream a second time rebalances the record processors between the two instances of KCL workers, each running on a separate executor as shown below.

Throughput

In the code snippet below from the Spark Streaming example, you can see that it’s creating as many DStreams (in turn, KCL workers) as there are shards.

// Create the Amazon Kinesis DStreams
val kinesisStreams = (0 until numStreams).map { i =>
KinesisUtils.createStream(ssc, appName, streamName, endpointUrl, regionName, InitialPositionInStream.LATEST, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_2)
}

If you run a simple record count application using the logic above, you’ll see that the streaming application starts but doesn’t produce anything till you terminate it.

To see why, look at the details of the test environment. The Spark default configuration with Amazon EMR could run eight concurrent tasks; you have created 10 receiver tasks, as the stream has 10 shards. Therefore, only eight receivers are active, with the others tasks being scheduled but never executed. A quick look at the application UI confirms this.

As you can observe in the screenshot, there are only eight active receivers out of 10 created, and processing time is undefined, as there are no slots left for other tasks to run. Re-run the line count application but this time use the number of executors (2) for the number of receivers instead of shards.

Now both your receivers are active with no scheduling delay and an average processing time of 111 milliseconds. In fact, running the same application with a single executor and a single receiver came out with the lowest delay of 93 ms, as displayed below. Spark Streaming was able to comfortably manage all KCL workers plus a count task on a single executor with four vCores.

For the processing time, the best way to find this is to put your application to the test, as the processing which you can do in the given batch interval is a function of your application logic, and how many stages/tasks Spark would create.

As demonstrated above, at minimum you need to ensure that every executor has enough vCores left to process the data received in the defined batch interval. When the application is running, you can use the Spark Streaming interface (or look for “Total delay” in the Spark driver log4j logs) to verify whether the system is able to keep up with the data rate, by using the total delay metrics. A system could be considered optimized if the total delay is comparable to the batch interval for the streaming context; otherwise, if the total delay is great than the batch interval, then your application will not be able to keep up with the incoming velocity.

Another parameter that you can consider is block interval, determined by the configuration parameter spark.streaming.blockInterval. The number of tasks per receiver per batch is approximately batch interval / block interval. For example, a block interval of 1 sec creates five tasks for a 5-second batch interval. If the number of tasks is too high, this leads to a scheduling delay. If it is too low, it will be inefficient because not all available cores will be used to process the data.

Note:  Using the KCL, the interval at which the receiver reads data from a stream is by default set at 1 second (or 1000 millisecond). This lets you have multiple consumer applications processing a stream concurrently without hitting the Amazon Kinesis stream limits of five GetRecords calls per shards per second. There are currently no parameters exposed by the Spark Streaming framework to change this.

Dynamic resource allocation

A Spark cluster can also be configured to make use of “dynamic resource allocation”, which allows for a variable number of executors to be allocated to an application over time. A good use case is workloads with variable throughput. Using dynamic resource allocation, the number of executors and therefore the number of resources allocated to process a stream can be modified and matched to the actual workload.

For more information, see the Submitting User Applications with spark-submit blog post.

Note:  Dynamic resource allocation currently only works with processing and not with receivers. While the KCL automatically balances workers among the receivers to keep up with changes for a given shard, you need to shut down your application and re-start it if you need to create more receivers.

Failure recovery

Now that you know more about the core integration of the KCL with Spark Streaming, I’ll move the focus from performance to reliability.

Checkpoint using the KCL

When you create or invoke the KinesisUtils.createStream method, in addition to the details about the stream—such as the name and AWS region—there are three other parameters:

Initial position – Location where your application starts reading in the stream: from the oldest record available (InitialPositionInStream.TRIM_HORIZON) or the latest record (InitialPostitionInStream.LATEST).

Application name – Name of the DynamoDB table where information about checkpoints is stored.

Checkpoint interval – Interval at which the KCL workers save their position in the stream. In other words, at every duration defined by checkpoint interval, the last sequence number read from all the shards are written to a DynamoDB table. You can view the DynamoDB table by navigating to the DynamoDB console and looking at the items in the table, you will see records for every shard and its associated information.

These parameters determine the failure recovery and checkpointing semantics as supported by the KCL.

Note: Sequence numbers are used as the checkpoints for every shard, and they are assigned after you write to the stream; for the same partition key, they generally increase over time.

How does a KCL application use this information for recovery from failure? Simply by looking up a DynamoDB table with the name provided by the parameter. If the DynamoDB table exists, then the KCL application starts reading from the sequence numbers stored in the DynamoDB table. For more information, see Kinesis Checkpointing.

Here’s an example of how this works. Stream records into an Amazon Kinesis stream with monotonically increasing numbers, so it becomes easy to demonstrate recovery. Your Spark streaming application is calling print on the DStream to print out the first 10 records.

Start your application and randomly terminate it using control-c, after it prints a batch:

Now, start the application again using spark-shell, and within a few seconds, you see it start printing the contents of your stream again.

You can see how the second run of the application re-printed the exact same records from the last set printed just before terminating the previous run. This recovery is implemented using the DynamoDB table, and helps recover from failure of KCL-based processors.

The KCL (and Spark Streaming, by extension) creates the DynamoDB table with a provisioned throughput of 10 reads per second and 10 writes per second. If you’re processing many shards or checkpointing often, you might need more throughput. At a minimum, you need throughput as defined by the ratio number of shards/KCL checkpoint interval in seconds. For more information, see Provisioned Throughput in Amazon DynamoDB and Working with Tables in the Amazon DynamoDB Developer Guide.

Checkpoints using Spark Streaming

What if failure happens after KCL checkpoints, for instance, if a driver fails? On recovery, you lose these records because the application starts processing from the record after the sequence numbers currently stored in DynamoDB.

To help recover from such failures, Spark Streaming also supports a checkpoint-based mechanism. This implementation does two things:

Metadata checkpoints – Stores the state of the Spark driver and metadata about unallocated blocks.

Processing checkpoints – Stores the state of DStream as it is processed through the chain, including beginning intermediate and ending states.

Note:  Prior to Spark 1.5, you could also use write ahead logs (WAL) which also saved the received data to fault-tolerant storage. As of Spark 1.5, this has been replaced by checkpoints, which can use the source to recover instead of storing data.

For more information, see Checkpointing.

Here’s another example. This time, keep the batch processing around 2 seconds with both the KCL checkpoint and block interval at 1 second. Also, enable the Spark-based checkpoint using the following outline:

/* Create a streaming context */
def functionToCreateContext(): StreamingContext = {

val ssc = new StreamingContext(sc, Seconds(2) )

ssc.checkpoint(checkpointDirectory)

val kinesisStreams = KinesisUtils.createStream(ssc, appName, streamName, endpointUrl,awsRegion,InitialPositionInStream.LATEST,Seconds(1),StorageLevel.MEMORY_ONLY)

/* Do the processing */

ssc
}

/* First, recover the context; otherwise, create a new context */
val ssc = StreamingContext.getOrCreate(checkpointDirectory,functionToCreateContext _ )

ssc.start()
ssc.awaitTermination()

After your application starts, fail the driver by killing the process associated with spark-shell on the master node.

Now re-start your application without checkpoint recovery enabled. Before you start your application, you need to copy the content of the DynamoDB table, so that you can use it to re-run the application with recovery enabled.

As you noticed above, your application didn’t print the record 166 and directly jumped to the record 167. Here, as you are checkpointing in KCL twice per batch interval, you have a 50% probability of missing one block’s worth of data for every failure.

Now re-run the application, but this time use the checkpoint for recovery. Before you start your application, you need to restore the content of the DynamoDB table to ensure your application uses the same state.

This time you didn’t miss any records and your record-set of 163-165 gets printed again, with the same time. There is also another batch, with records 166-167, that was recovered from the checkpoint, after which your normal process starts from record 167 using the KCL checkpoint.

You might have noticed that both the KCL and Spark checkpoints help to recover from failures, which could lead to records being replayed again when using Spark Streaming with Amazon Kinesis, implying at-least-once semantics. To learm more about the failure scenarios with Spark Streaming, see Fault-tolerance Semantics. To avoid potential side effects, ensure that the downstream processing has idempotent or transactional semantics.

Summary

Below are guidelines for processing Amazon Kinesis streams using Spark Streaming:

Ensure that the number of Amazon Kinesis receivers created are a multiple of executors so that they are load balanced evenly across all the executors.

Ensure that the total processing time is less than the batch interval.

Use the number of executors and number of cores per executor parameters to optimize parallelism and use the available resources efficiently.

Be aware that Spark Streaming uses the default of 1 sec with KCL to read data from Amazon Kinesis.

For reliable at-least-once semantics, enable the Spark-based checkpoints and ensure that your processing is idempotent (recommended) or you have transactional semantics around your processing.

Ensure that you’re using Spark version 1.6 or later with the EMRFS consistent view option, when using Amazon S3 as the storage for Spark checkpoints.

Ensure that there is only one instance of the application running with Spark Streaming, and that multiple applications are not using the same DynamoDB table (via the KCL).

I hope that this post helps you identify potential bottlenecks when using Amazon Kinesis with Spark Streaming, and helps you apply optimizations to leverage your computing resources effectively. Want to do more? Launch an EMR cluster and use Amazon Kinesis out of the box to get started.

Happy streaming!

If you have questions or suggestions, please leave a comment below.

—————————-

Related

Querying Amazon Kinesis Streams Directly with SQL and Spark Streaming

Looking to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

Optimize Spark-Streaming to Efficiently Process Amazon Kinesis Streams

Post Syndicated from Rahul Bhartia original https://blogs.aws.amazon.com/bigdata/post/Tx2MQREB43K3BFK/Optimize-Spark-Streaming-to-Efficiently-Process-Amazon-Kinesis-Streams

Rahul Bhartia is a Solutions Architect with AWS

Martin Schade, a Solutions Architect with AWS, also contributed to this post.

Do you use real-time analytics on AWS to quickly extract value from large volumes of data streams? For example, have you built a recommendation engine on clickstream data to personalize content suggestions in real time using  Amazon Kinesis Streams and Apache Spark? These frameworks make it easy to implement real-time analytics, but understanding how they work together helps you optimize performance. In this post, I explain some ways to tune Spark Streaming for the best performance and the right semantics.

Amazon Kinesis integration with Apache Spark is via Spark Streaming. Spark Streaming is an extension of the core Spark framework that enables scalable, high-throughput, fault-tolerant stream processing of data streams such as Amazon Kinesis Streams. Spark Streaming provides a high-level abstraction called a Discretized Stream or DStream, which represents a continuous sequence of RDDs. Spark Streaming uses the Amazon Kinesis Client Library (KCL) to consume data from an Amazon Kinesis stream. The KCL takes care of many of the complex tasks associated with distributed computing, such as load balancing, failure recovery, and check-pointing. For more information, see the Spark Streaming + Kinesis Integration guide.

Spark Streaming receivers and KCL workers

Think of Spark Streaming as two main components:

Fetching data from the streaming sources into DStreams

Processing data in these DStreams as batches

Every input DStream is associated with a receiver, and in this case also with a KCL worker. The best way to understand this is to refer to the method createStream defined in the KinesisUtils Scala class.

Every call to KinesisUtils.createStream instantiates a Spark Streaming receiver and a KCL worker process on a Spark executor. The first time a KCL worker is created, it connects to the Amazon Kinesis stream and instantiates a record processor for every shard that it manages. For every subsequent call, a new KCL worker is created and the record processors are re-balanced among all available KCL workers. The KCL workers pull data from the shards, and routes them to the receiver, which in turns stores them into the associated DStream.

To understand how these extensions interface with the core Spark framework, we will need an environment that illustrates some of these concepts.

Executors, cores, and tasks

To create the environment, use an Amazon Kinesis stream named “KinesisStream” with 10 shards capable of ingesting 10 MBps or 10,000 TPS. For writing records into KinesisStream, use the AdvancedKPLClickEventsToKinesis class described in the Implementing Efficient and Reliable Producers with the Amazon Kinesis Producer Library blog post. Each event here is a line from the web access log, which averaged 140 KB in size. You can see from the screenshot below that this producer is capable of streaming over 10,000 records per second into KinesisStream by taking advantage of the PutRecords API operation.

To process the data using Spark Streaming, create an Amazon EMR cluster in the same AWS region using three m3.xlarge EC2 instances (one core and two workers). On the EMR cluster, the following default values are defined in the Spark configuration file (spark-defaults.conf):


spark.executor.instances         2
spark.executor.cores               4

This essentially means that each Spark application has two executors (an executor here is one container on YARN) with four cores each, and can run a maximum of eight tasks concurrently.

Note:  By default, YARN with EMR is configured to use DefaultCapacityScheduler, which doesn’t take vCores as a factor into scheduling containers; To use vCores as an additional factor, you can switch the scheduler to use DominantResourceCalculator.

In the context of your stream and cluster, this is how Spark Streaming might look like after the first invocation of KinesisUtils.createStream.

Calling KinesisUtils.createStream a second time rebalances the record processors between the two instances of KCL workers, each running on a separate executor as shown below.

Throughput

In the code snippet below from the Spark Streaming example, you can see that it’s creating as many DStreams (in turn, KCL workers) as there are shards.

// Create the Amazon Kinesis DStreams
val kinesisStreams = (0 until numStreams).map { i =>
KinesisUtils.createStream(ssc, appName, streamName, endpointUrl, regionName, InitialPositionInStream.LATEST, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_2)
}

If you run a simple record count application using the logic above, you’ll see that the streaming application starts but doesn’t produce anything till you terminate it.

To see why, look at the details of the test environment. The Spark default configuration with Amazon EMR could run eight concurrent tasks; you have created 10 receiver tasks, as the stream has 10 shards. Therefore, only eight receivers are active, with the others tasks being scheduled but never executed. A quick look at the application UI confirms this.

As you can observe in the screenshot, there are only eight active receivers out of 10 created, and processing time is undefined, as there are no slots left for other tasks to run. Re-run the line count application but this time use the number of executors (2) for the number of receivers instead of shards.

Now both your receivers are active with no scheduling delay and an average processing time of 111 milliseconds. In fact, running the same application with a single executor and a single receiver came out with the lowest delay of 93 ms, as displayed below. Spark Streaming was able to comfortably manage all KCL workers plus a count task on a single executor with four vCores.

For the processing time, the best way to find this is to put your application to the test, as the processing which you can do in the given batch interval is a function of your application logic, and how many stages/tasks Spark would create.

As demonstrated above, at minimum you need to ensure that every executor has enough vCores left to process the data received in the defined batch interval. When the application is running, you can use the Spark Streaming interface (or look for “Total delay” in the Spark driver log4j logs) to verify whether the system is able to keep up with the data rate, by using the total delay metrics. A system could be considered optimized if the total delay is comparable to the batch interval for the streaming context; otherwise, if the total delay is great than the batch interval, then your application will not be able to keep up with the incoming velocity.

Another parameter that you can consider is block interval, determined by the configuration parameter spark.streaming.blockInterval. The number of tasks per receiver per batch is approximately batch interval / block interval. For example, a block interval of 1 sec creates five tasks for a 5-second batch interval. If the number of tasks is too high, this leads to a scheduling delay. If it is too low, it will be inefficient because not all available cores will be used to process the data.

Note:  Using the KCL, the interval at which the receiver reads data from a stream is by default set at 1 second (or 1000 millisecond). This lets you have multiple consumer applications processing a stream concurrently without hitting the Amazon Kinesis stream limits of five GetRecords calls per shards per second. There are currently no parameters exposed by the Spark Streaming framework to change this.

Dynamic resource allocation

A Spark cluster can also be configured to make use of “dynamic resource allocation”, which allows for a variable number of executors to be allocated to an application over time. A good use case is workloads with variable throughput. Using dynamic resource allocation, the number of executors and therefore the number of resources allocated to process a stream can be modified and matched to the actual workload.

For more information, see the Submitting User Applications with spark-submit blog post.

Note:  Dynamic resource allocation currently only works with processing and not with receivers. While the KCL automatically balances workers among the receivers to keep up with changes for a given shard, you need to shut down your application and re-start it if you need to create more receivers.

Failure recovery

Now that you know more about the core integration of the KCL with Spark Streaming, I’ll move the focus from performance to reliability.

Checkpoint using the KCL

When you create or invoke the KinesisUtils.createStream method, in addition to the details about the stream—such as the name and AWS region—there are three other parameters:

Initial position – Location where your application starts reading in the stream: from the oldest record available (InitialPositionInStream.TRIM_HORIZON) or the latest record (InitialPostitionInStream.LATEST).

Application name – Name of the DynamoDB table where information about checkpoints is stored.

Checkpoint interval – Interval at which the KCL workers save their position in the stream. In other words, at every duration defined by checkpoint interval, the last sequence number read from all the shards are written to a DynamoDB table. You can view the DynamoDB table by navigating to the DynamoDB console and looking at the items in the table, you will see records for every shard and its associated information.

These parameters determine the failure recovery and checkpointing semantics as supported by the KCL.

Note: Sequence numbers are used as the checkpoints for every shard, and they are assigned after you write to the stream; for the same partition key, they generally increase over time.

How does a KCL application use this information for recovery from failure? Simply by looking up a DynamoDB table with the name provided by the parameter. If the DynamoDB table exists, then the KCL application starts reading from the sequence numbers stored in the DynamoDB table. For more information, see Kinesis Checkpointing.

Here’s an example of how this works. Stream records into an Amazon Kinesis stream with monotonically increasing numbers, so it becomes easy to demonstrate recovery. Your Spark streaming application is calling print on the DStream to print out the first 10 records.

Start your application and randomly terminate it using control-c, after it prints a batch:

Now, start the application again using spark-shell, and within a few seconds, you see it start printing the contents of your stream again.

You can see how the second run of the application re-printed the exact same records from the last set printed just before terminating the previous run. This recovery is implemented using the DynamoDB table, and helps recover from failure of KCL-based processors.

The KCL (and Spark Streaming, by extension) creates the DynamoDB table with a provisioned throughput of 10 reads per second and 10 writes per second. If you’re processing many shards or checkpointing often, you might need more throughput. At a minimum, you need throughput as defined by the ratio number of shards/KCL checkpoint interval in seconds. For more information, see Provisioned Throughput in Amazon DynamoDB and Working with Tables in the Amazon DynamoDB Developer Guide.

Checkpoints using Spark Streaming

What if failure happens after KCL checkpoints, for instance, if a driver fails? On recovery, you lose these records because the application starts processing from the record after the sequence numbers currently stored in DynamoDB.

To help recover from such failures, Spark Streaming also supports a checkpoint-based mechanism. This implementation does two things:

Metadata checkpoints – Stores the state of the Spark driver and metadata about unallocated blocks.

Processing checkpoints – Stores the state of DStream as it is processed through the chain, including beginning intermediate and ending states.

Note:  Prior to Spark 1.5, you could also use write ahead logs (WAL) which also saved the received data to fault-tolerant storage. As of Spark 1.5, this has been replaced by checkpoints, which can use the source to recover instead of storing data.

For more information, see Checkpointing.

Here’s another example. This time, keep the batch processing around 2 seconds with both the KCL checkpoint and block interval at 1 second. Also, enable the Spark-based checkpoint using the following outline:

/* Create a streaming context */
def functionToCreateContext(): StreamingContext = {

val ssc = new StreamingContext(sc, Seconds(2) )

ssc.checkpoint(checkpointDirectory)

val kinesisStreams = KinesisUtils.createStream(ssc, appName, streamName, endpointUrl,awsRegion,InitialPositionInStream.LATEST,Seconds(1),StorageLevel.MEMORY_ONLY)

/* Do the processing */

ssc
}

/* First, recover the context; otherwise, create a new context */
val ssc = StreamingContext.getOrCreate(checkpointDirectory,functionToCreateContext _ )

ssc.start()
ssc.awaitTermination()

After your application starts, fail the driver by killing the process associated with spark-shell on the master node.

Now re-start your application without checkpoint recovery enabled. Before you start your application, you need to copy the content of the DynamoDB table, so that you can use it to re-run the application with recovery enabled.

As you noticed above, your application didn’t print the record 166 and directly jumped to the record 167. Here, as you are checkpointing in KCL twice per batch interval, you have a 50% probability of missing one block’s worth of data for every failure.

Now re-run the application, but this time use the checkpoint for recovery. Before you start your application, you need to restore the content of the DynamoDB table to ensure your application uses the same state.

This time you didn’t miss any records and your record-set of 163-165 gets printed again, with the same time. There is also another batch, with records 166-167, that was recovered from the checkpoint, after which your normal process starts from record 167 using the KCL checkpoint.

You might have noticed that both the KCL and Spark checkpoints help to recover from failures, which could lead to records being replayed again when using Spark Streaming with Amazon Kinesis, implying at-least-once semantics. To learm more about the failure scenarios with Spark Streaming, see Fault-tolerance Semantics. To avoid potential side effects, ensure that the downstream processing has idempotent or transactional semantics.

Summary

Below are guidelines for processing Amazon Kinesis streams using Spark Streaming:

Ensure that the number of Amazon Kinesis receivers created are a multiple of executors so that they are load balanced evenly across all the executors.

Ensure that the total processing time is less than the batch interval.

Use the number of executors and number of cores per executor parameters to optimize parallelism and use the available resources efficiently.

Be aware that Spark Streaming uses the default of 1 sec with KCL to read data from Amazon Kinesis.

For reliable at-least-once semantics, enable the Spark-based checkpoints and ensure that your processing is idempotent (recommended) or you have transactional semantics around your processing.

Ensure that you’re using Spark version 1.6 or later with the EMRFS consistent view option, when using Amazon S3 as the storage for Spark checkpoints.

Ensure that there is only one instance of the application running with Spark Streaming, and that multiple applications are not using the same DynamoDB table (via the KCL).

I hope that this post helps you identify potential bottlenecks when using Amazon Kinesis with Spark Streaming, and helps you apply optimizations to leverage your computing resources effectively. Want to do more? Launch an EMR cluster and use Amazon Kinesis out of the box to get started.

Happy streaming!

If you have questions or suggestions, please leave a comment below.

—————————-

Related

Querying Amazon Kinesis Streams Directly with SQL and Spark Streaming

Looking to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

Using Amazon API Gateway as a proxy for DynamoDB

Post Syndicated from Stefano Buliani original https://aws.amazon.com/blogs/compute/using-amazon-api-gateway-as-a-proxy-for-dynamodb/

Andrew Baird Andrew Baird, AWS Solutions Architect
Amazon API Gateway has a feature that enables customers to create their own API definitions directly in front of an AWS service API. This tutorial will walk you through an example of doing so with Amazon DynamoDB.
Why use API Gateway as a proxy for AWS APIs?
Many AWS services provide APIs that applications depend on directly for their functionality. Examples include:

Amazon DynamoDB – An API-accessible NoSQL database.
Amazon Kinesis – Real-time ingestion of streaming data via API.
Amazon CloudWatch – API-driven metrics collection and retrieval.

If AWS already exposes internet-accessible APIs, why would you want to use API Gateway as a proxy for them? Why not allow applications to just directly depend on the AWS service API itself?
Here are a few great reasons to do so:

You might want to enable your application to integrate with very specific functionality that an AWS service provides, without the need to manage access keys and secret keys that AWS APIs require.
There may be application-specific restrictions you’d like to place on the API calls being made to AWS services that you would not be able to enforce if clients integrated with the AWS APIs directly.
You may get additional value out of using a different HTTP method from the method that is used by the AWS service. For example, creating a GET request as a proxy in front of an AWS API that requires an HTTP POST so that the response will be cached.
You can accomplish the above things without having to introduce a server-side application component that you need to manage or that could introduce increased latency. Even a lightweight Lambda function that calls a single AWS service API is code that you do not need to create or maintain if you use API Gateway directly as an AWS service proxy.

Here, we will walk you through a hypothetical scenario that shows how to create an Amazon API Gateway AWS service proxy in front of Amazon DynamoDB.
The Scenario
You would like the ability to add a public Comments section to each page of your website. To achieve this, you’ll need to accept and store comments and you will need to retrieve all of the comments posted for a particular page so that the UI can display them.
We will show you how to implement this functionality by creating a single table in DynamoDB, and creating the two necessary APIs using the AWS service proxy feature of Amazon API Gateway.
Defining the APIs
The first step is to map out the APIs that you want to create. For both APIs, we’ve linked to the DynamoDB API documentation. Take note of how the API you define below differs in request/response details from the native DynamoDB APIs.
Post Comments
First, you need an API that accepts user comments and stores them in the DynamoDB table. Here’s the API definition you’ll use to implement this functionality:
Resource: /comments
HTTP Method: POST
HTTP Request Body:
{
"pageId": "example-page-id",
"userName": "ExampleUserName",
"message": "This is an example comment to be added."
}
After you create it, this API becomes a proxy in front of the DynamoDB API PutItem.
Get Comments
Second, you need an API to retrieve all of the comments for a particular page. Use the following API definition:
Resource: /comments/{pageId}
HTTP Method: GET
The curly braces around {pageId} in the URI path definition indicate that pageId will be treated as a path variable within the URI.
This API will be a proxy in front of the DynamoDB API Query. Here, you will notice the benefit: your API uses the GET method, while the DynamoDB GetItem API requires an HTTP POST and does not include any cache headers in the response.
Creating the DynamoDB Table
First, Navigate to the DynamoDB console and select Create Table. Next, name the table Comments, with commentId as the Primary Key. Leave the rest of the default settings for this example, and choose Create.

After this table is populated with comments, you will want to retrieve them based on the page that they’ve been posted to. To do this, create a secondary index on an attribute called pageId. This secondary index enables you to query the table later for all comments posted to a particular page. When viewing your table, choose the Indexes tab and choose Create index.

When querying this table, you only want to retrieve the pieces of information that matter to the client: in this case, these are the pageId, the userName, and the message itself. Any other data you decide to store with each comment does not need to be retrieved from the table for the publically accessible API. Type the following information into the form to capture this and choose Create index:

Creating the APIs
Now, using the AWS service proxy feature of Amazon API Gateway, we’ll demonstrate how to create each of the APIs you defined. Navigate to the API Gateway service console, and choose Create API. In API name, type CommentsApi and type a short description. Finally, choose Create API.

Now you’re ready to create the specific resources and methods for the new API.
Creating the Post Comments API
In the editor screen, choose Create Resource. To match the description of the Post Comments API above, provide the appropriate details and create the first API resource:

Now, with the resource created, set up what happens when the resource is called with the HTTP POST method. Choose Create Method and select POST from the drop down. Click the checkmark to save.
To map this API to the DynamoDB API needed, next to Integration type, choose Show Advanced and choose AWS Service Proxy.
Here, you’re presented with options that define which specific AWS service API will be executed when this API is called, and in which region. Fill out the information as shown, matching the DynamoDB table you created a moment ago. Before you proceed, create an AWS Identity and Access Management (IAM) role that has permission to call the DynamoDB API PutItem for the Comments table; this role must have a service trust relationship to API Gateway. For more information on IAM policies and roles, see the Overview of IAM Policies topic.
After inputting all of the information as shown, choose Save.

If you were to deploy this API right now, you would have a working service proxy API that only wraps the DynamoDB PutItem API. But, for the Post Comments API, you’d like the client to be able to use a more contextual JSON object structure. Also, you’d like to be sure that the DynamoDB API PutItem is called precisely the way you expect it to be called. This eliminates client-driven error responses and removes the possibility that the new API could be used to call another DynamoDB API or table that you do not intend to allow.
You accomplish this by creating a mapping template. This enables you to define the request structure that your API clients will use, and then transform those requests into the structure that the DynamoDB API PutItem requires.
From the Method Execution screen, choose Integration Request:

In the Integration Request screen expand the Mapping Templates section and choose Add mapping template. Under Content-Type, type application/json and then choose the check mark:

Next, choose the pencil icon next to Input passthrough and choose Mapping template from the dropdown. Now, you’ll be presented with a text box where you create the mapping template. For more information on creating mapping templates, see API Gateway Mapping Template Reference.
The mapping template will be as follows. We’ll walk through what’s important about it next:
{
"TableName": "Comments",
"Item": {
"commentId": {
"S": "$context.requestId"
},
"pageId": {
"S": "$input.path(‘$.pageId’)"
},
"userName": {
"S": "$input.path(‘$.userName’)"
},
"message": {
"S": "$input.path(‘$.message’)"
}
}
}
This mapping template creates the JSON structure required by the DynamoDB PutItem API. The entire mapping template is static. The three input variables are referenced from the request JSON using the $input variable and each comment is stamped with a unique identifier. This unique identifier is the commentId and is extracted directly from the API request’s $context variable. This $context variable is set by the API Gateway service itself. To review other parameters that are available to a mapping template, see API Gateway Mapping Template Reference. You may decide that including information like sourceIp or other headers could be valuable to you.
With this mapping template, no matter how your API is called, the only variance from the DynamoDB PutItem API call will be the values of pageId, userName, and message. Clients of your API will not be able to dictate which DynamoDB table is being targeted (because “Comments” is statically listed), and they will not have any control over the object structure that is specified for each item (each input variable is explicitly declared a string to the PutItem API).
Back in the Method Execution pane click TEST.
Create an example Request Body that matches the API definition documented above and then choose Test. For example, your request body could be:
{
"pageId": "breaking-news-story-01-18-2016",
"userName": "Just Saying Thank You",
"message": "I really enjoyed this story!!"
}
Navigate to the DynamoDB console and view the Comments table to show that the request really was successfully processed:

Great! Try including a few more sample items in the table to further test the Get Comments API.
If you deployed this API, you would be all set with a public API that has the ability to post public comments and store them in DynamoDB. For some use cases you may only want to collect data through a single API like this: for example, when collecting customer and visitor feedback, or for a public voting or polling system. But for this use case, we’ll demonstrate how to create another API to retrieve records from a DynamoDB table as well. Many of the details are similar to the process above.
Creating the Get Comments API
Return to the Resources view, choose the /comments resource you created earlier and choose Create Resource, like before.
This time, include a request path parameter to represent the pageId of the comments being retrieved. Input the following information and then choose Create Resource:

In Resources, choose your new /{pageId} resource and choose Create Method. The Get Comments API will be retrieving data from our DynamoDB table, so choose GET for the HTTP method.
In the method configuration screen choose Show advanced and then select AWS Service Proxy. Fill out the form to match the following. Make sure to use the appropriate AWS Region and IAM execution role; these should match what you previously created. Finally, choose Save.

Modify the Integration Request and create a new mapping template. This will transform the simple pageId path parameter on the GET request to the needed DynamoDB Query API, which requires an HTTP POST. Here is the mapping template:
{
"TableName": "Comments",
"IndexName": "pageId-index",
"KeyConditionExpression": "pageId = :v1",
"ExpressionAttributeValues": {
":v1": {
"S": "$input.params(‘pageId’)"
}
}
}
Now test your mapping template. Navigate to the Method Execution pane and choose the Test icon on the left. Provide one of the pageId values that you’ve inserted into your Comments table and choose Test.

You should see a response like the following; it is directly passing through the raw DynamoDB response:

Now you’re close! All you need to do before you deploy your API is to map the raw DynamoDB response to the similar JSON object structure that you defined on the Post Comment API.
This will work very similarly to the mapping template changes you already made. But you’ll configure this change on the Integration Response page of the console by editing the default mapping response’s mapping template.
Navigate to Integration Response and expand the 200 response code by choosing the arrow on the left. In the 200 response, expand the Mapping Templates section. In Content-Type choose application/json then choose the pencil icon next to Output Passthrough.

Now, create a mapping template that extracts the relevant pieces of the DynamoDB response and places them into a response structure that matches our use case:
#set($inputRoot = $input.path(‘$’))
{
"comments": [
#foreach($elem in $inputRoot.Items) {
"commentId": "$elem.commentId.S",
"userName": "$elem.userName.S",
"message": "$elem.message.S"
}#if($foreach.hasNext),#end
#end
]
}
Now choose the check mark to save the mapping template, and choose Save to save this default integration response. Return to the Method Execution page and test your API again. You should now see a formatted response.
Now you have two working APIs that are ready to deploy! See our documentation to learn about how to deploy API stages.
But, before you deploy your API, here are some additional things to consider:

Authentication: you may want to require that users authenticate before they can leave comments. Amazon API Gateway can enforce IAM authentication for the APIs you create. To learn more, see Amazon API Gateway Access Permissions.
DynamoDB capacity: you may want to provision an appropriate amount of capacity to your Comments table so that your costs and performance reflect your needs.
Commenting features: Depending on how robust you’d like commenting to be on your site, you might like to introduce changes to the APIs described here. Examples are attributes that track replies or timestamp attributes.

Conclusion
Now you’ve got a fully functioning public API to post and retrieve public comments for your website. This API communicates directly with the Amazon DynamoDB API without you having to manage a single application component yourself!

Using Amazon API Gateway as a proxy for DynamoDB

Post Syndicated from Stefano Buliani original https://aws.amazon.com/blogs/compute/using-amazon-api-gateway-as-a-proxy-for-dynamodb/

Andrew Baird Andrew Baird, AWS Solutions Architect
Amazon API Gateway has a feature that enables customers to create their own API definitions directly in front of an AWS service API. This tutorial will walk you through an example of doing so with Amazon DynamoDB.
Why use API Gateway as a proxy for AWS APIs?
Many AWS services provide APIs that applications depend on directly for their functionality. Examples include:

Amazon DynamoDB – An API-accessible NoSQL database.
Amazon Kinesis – Real-time ingestion of streaming data via API.
Amazon CloudWatch – API-driven metrics collection and retrieval.

If AWS already exposes internet-accessible APIs, why would you want to use API Gateway as a proxy for them? Why not allow applications to just directly depend on the AWS service API itself?
Here are a few great reasons to do so:

You might want to enable your application to integrate with very specific functionality that an AWS service provides, without the need to manage access keys and secret keys that AWS APIs require.
There may be application-specific restrictions you’d like to place on the API calls being made to AWS services that you would not be able to enforce if clients integrated with the AWS APIs directly.
You may get additional value out of using a different HTTP method from the method that is used by the AWS service. For example, creating a GET request as a proxy in front of an AWS API that requires an HTTP POST so that the response will be cached.
You can accomplish the above things without having to introduce a server-side application component that you need to manage or that could introduce increased latency. Even a lightweight Lambda function that calls a single AWS service API is code that you do not need to create or maintain if you use API Gateway directly as an AWS service proxy.

Here, we will walk you through a hypothetical scenario that shows how to create an Amazon API Gateway AWS service proxy in front of Amazon DynamoDB.
The Scenario
You would like the ability to add a public Comments section to each page of your website. To achieve this, you’ll need to accept and store comments and you will need to retrieve all of the comments posted for a particular page so that the UI can display them.
We will show you how to implement this functionality by creating a single table in DynamoDB, and creating the two necessary APIs using the AWS service proxy feature of Amazon API Gateway.
Defining the APIs
The first step is to map out the APIs that you want to create. For both APIs, we’ve linked to the DynamoDB API documentation. Take note of how the API you define below differs in request/response details from the native DynamoDB APIs.
Post Comments
First, you need an API that accepts user comments and stores them in the DynamoDB table. Here’s the API definition you’ll use to implement this functionality:
Resource: /comments
HTTP Method: POST
HTTP Request Body:
{
"pageId": "example-page-id",
"userName": "ExampleUserName",
"message": "This is an example comment to be added."
}
After you create it, this API becomes a proxy in front of the DynamoDB API PutItem.
Get Comments
Second, you need an API to retrieve all of the comments for a particular page. Use the following API definition:
Resource: /comments/{pageId}
HTTP Method: GET
The curly braces around {pageId} in the URI path definition indicate that pageId will be treated as a path variable within the URI.
This API will be a proxy in front of the DynamoDB API Query. Here, you will notice the benefit: your API uses the GET method, while the DynamoDB GetItem API requires an HTTP POST and does not include any cache headers in the response.
Creating the DynamoDB Table
First, Navigate to the DynamoDB console and select Create Table. Next, name the table Comments, with commentId as the Primary Key. Leave the rest of the default settings for this example, and choose Create.

After this table is populated with comments, you will want to retrieve them based on the page that they’ve been posted to. To do this, create a secondary index on an attribute called pageId. This secondary index enables you to query the table later for all comments posted to a particular page. When viewing your table, choose the Indexes tab and choose Create index.

When querying this table, you only want to retrieve the pieces of information that matter to the client: in this case, these are the pageId, the userName, and the message itself. Any other data you decide to store with each comment does not need to be retrieved from the table for the publically accessible API. Type the following information into the form to capture this and choose Create index:

Creating the APIs
Now, using the AWS service proxy feature of Amazon API Gateway, we’ll demonstrate how to create each of the APIs you defined. Navigate to the API Gateway service console, and choose Create API. In API name, type CommentsApi and type a short description. Finally, choose Create API.

Now you’re ready to create the specific resources and methods for the new API.
Creating the Post Comments API
In the editor screen, choose Create Resource. To match the description of the Post Comments API above, provide the appropriate details and create the first API resource:

Now, with the resource created, set up what happens when the resource is called with the HTTP POST method. Choose Create Method and select POST from the drop down. Click the checkmark to save.
To map this API to the DynamoDB API needed, next to Integration type, choose Show Advanced and choose AWS Service Proxy.
Here, you’re presented with options that define which specific AWS service API will be executed when this API is called, and in which region. Fill out the information as shown, matching the DynamoDB table you created a moment ago. Before you proceed, create an AWS Identity and Access Management (IAM) role that has permission to call the DynamoDB API PutItem for the Comments table; this role must have a service trust relationship to API Gateway. For more information on IAM policies and roles, see the Overview of IAM Policies topic.
After inputting all of the information as shown, choose Save.

If you were to deploy this API right now, you would have a working service proxy API that only wraps the DynamoDB PutItem API. But, for the Post Comments API, you’d like the client to be able to use a more contextual JSON object structure. Also, you’d like to be sure that the DynamoDB API PutItem is called precisely the way you expect it to be called. This eliminates client-driven error responses and removes the possibility that the new API could be used to call another DynamoDB API or table that you do not intend to allow.
You accomplish this by creating a mapping template. This enables you to define the request structure that your API clients will use, and then transform those requests into the structure that the DynamoDB API PutItem requires.
From the Method Execution screen, choose Integration Request:

In the Integration Request screen expand the Mapping Templates section and choose Add mapping template. Under Content-Type, type application/json and then choose the check mark:

Next, choose the pencil icon next to Input passthrough and choose Mapping template from the dropdown. Now, you’ll be presented with a text box where you create the mapping template. For more information on creating mapping templates, see API Gateway Mapping Template Reference.
The mapping template will be as follows. We’ll walk through what’s important about it next:
{
"TableName": "Comments",
"Item": {
"commentId": {
"S": "$context.requestId"
},
"pageId": {
"S": "$input.path(‘$.pageId’)"
},
"userName": {
"S": "$input.path(‘$.userName’)"
},
"message": {
"S": "$input.path(‘$.message’)"
}
}
}
This mapping template creates the JSON structure required by the DynamoDB PutItem API. The entire mapping template is static. The three input variables are referenced from the request JSON using the $input variable and each comment is stamped with a unique identifier. This unique identifier is the commentId and is extracted directly from the API request’s $context variable. This $context variable is set by the API Gateway service itself. To review other parameters that are available to a mapping template, see API Gateway Mapping Template Reference. You may decide that including information like sourceIp or other headers could be valuable to you.
With this mapping template, no matter how your API is called, the only variance from the DynamoDB PutItem API call will be the values of pageId, userName, and message. Clients of your API will not be able to dictate which DynamoDB table is being targeted (because “Comments” is statically listed), and they will not have any control over the object structure that is specified for each item (each input variable is explicitly declared a string to the PutItem API).
Back in the Method Execution pane click TEST.
Create an example Request Body that matches the API definition documented above and then choose Test. For example, your request body could be:
{
"pageId": "breaking-news-story-01-18-2016",
"userName": "Just Saying Thank You",
"message": "I really enjoyed this story!!"
}
Navigate to the DynamoDB console and view the Comments table to show that the request really was successfully processed:

Great! Try including a few more sample items in the table to further test the Get Comments API.
If you deployed this API, you would be all set with a public API that has the ability to post public comments and store them in DynamoDB. For some use cases you may only want to collect data through a single API like this: for example, when collecting customer and visitor feedback, or for a public voting or polling system. But for this use case, we’ll demonstrate how to create another API to retrieve records from a DynamoDB table as well. Many of the details are similar to the process above.
Creating the Get Comments API
Return to the Resources view, choose the /comments resource you created earlier and choose Create Resource, like before.
This time, include a request path parameter to represent the pageId of the comments being retrieved. Input the following information and then choose Create Resource:

In Resources, choose your new /{pageId} resource and choose Create Method. The Get Comments API will be retrieving data from our DynamoDB table, so choose GET for the HTTP method.
In the method configuration screen choose Show advanced and then select AWS Service Proxy. Fill out the form to match the following. Make sure to use the appropriate AWS Region and IAM execution role; these should match what you previously created. Finally, choose Save.

Modify the Integration Request and create a new mapping template. This will transform the simple pageId path parameter on the GET request to the needed DynamoDB Query API, which requires an HTTP POST. Here is the mapping template:
{
"TableName": "Comments",
"IndexName": "pageId-index",
"KeyConditionExpression": "pageId = :v1",
"ExpressionAttributeValues": {
":v1": {
"S": "$input.params(‘pageId’)"
}
}
}
Now test your mapping template. Navigate to the Method Execution pane and choose the Test icon on the left. Provide one of the pageId values that you’ve inserted into your Comments table and choose Test.

You should see a response like the following; it is directly passing through the raw DynamoDB response:

Now you’re close! All you need to do before you deploy your API is to map the raw DynamoDB response to the similar JSON object structure that you defined on the Post Comment API.
This will work very similarly to the mapping template changes you already made. But you’ll configure this change on the Integration Response page of the console by editing the default mapping response’s mapping template.
Navigate to Integration Response and expand the 200 response code by choosing the arrow on the left. In the 200 response, expand the Mapping Templates section. In Content-Type choose application/json then choose the pencil icon next to Output Passthrough.

Now, create a mapping template that extracts the relevant pieces of the DynamoDB response and places them into a response structure that matches our use case:
#set($inputRoot = $input.path(‘$’))
{
"comments": [
#foreach($elem in $inputRoot.Items) {
"commentId": "$elem.commentId.S",
"userName": "$elem.userName.S",
"message": "$elem.message.S"
}#if($foreach.hasNext),#end
#end
]
}
Now choose the check mark to save the mapping template, and choose Save to save this default integration response. Return to the Method Execution page and test your API again. You should now see a formatted response.
Now you have two working APIs that are ready to deploy! See our documentation to learn about how to deploy API stages.
But, before you deploy your API, here are some additional things to consider:

Authentication: you may want to require that users authenticate before they can leave comments. Amazon API Gateway can enforce IAM authentication for the APIs you create. To learn more, see Amazon API Gateway Access Permissions.
DynamoDB capacity: you may want to provision an appropriate amount of capacity to your Comments table so that your costs and performance reflect your needs.
Commenting features: Depending on how robust you’d like commenting to be on your site, you might like to introduce changes to the APIs described here. Examples are attributes that track replies or timestamp attributes.

Conclusion
Now you’ve got a fully functioning public API to post and retrieve public comments for your website. This API communicates directly with the Amazon DynamoDB API without you having to manage a single application component yourself!

DevOps Cafe Episode 66 – Damon Interviews John

Post Syndicated from DevOpsCafeAdmin original http://devopscafe.org/show/2016/2/25/devops-cafe-episode-66-damon-interviews-john.html

Can’t contain(er) Botchagalupe

 In this special episode, Damon turns the tables and interviews John. The wide ranging conversations hits on everything from containers, to Docker, to microservices, to immutable infrastructure, to unikernals, and the impact of it all. 


 

  

Direct download

Follow John Willis on Twitter: @botchagalupe
Follow Damon Edwards on Twitter: @damonedwards

Notes:

 

Please leave comments or questions below and we’ll read them on the show!

За умните хора, глупавия народ и реалността – 2

Post Syndicated from Григор original http://www.gatchev.info/blog/?p=1916

Предишният ми запис вероятно щеше да отмине и потъне в дълбините на е-океана, ако не беше препубликуван в webcafe.bg. Прочетоха го доста хора. Някои коментираха там, други – тук. Един-двама дори ми писаха поща… Писмото на единия е причината за този запис.

Не искам да го цитирам тук. Едно, че блогът ми не е помийна яма и читателите му търсят тук нещо различно от помия. И второ, че един ден авторът му сигурно ще се срамува от него, а Нетът помни… Но ще му отговоря открито и пред всички.

—-

Да, човече, прав си – няма да публикувам писмото ти в блога си. По-горе обясних защо. Ако искаш да имаш глас, направи си блог. Евтино е, а на много места – даже безплатно. И просвещавай народа колкото щеш.

Не е трудно, нали? Защо тогава се пенявиш, че нямало да ти дам свобода на словото? Имаш я. Ако имаш предвид, че моя блог го четат, а твоя няма да го четат – това е друго. И моя не го четат кой знае колко хора. Толкова ми е умението да разказвам увлекателно и да поднасям ценни неща, толкова съм привлякъл. Покажи повече, ще привлечеш повече. Ако обаче искаш да не буташ колата, а само да се возиш на чужд гръб, не си прав. Тази настройка няма никога да бъде решението на българските, или които и да е други проблеми. Тя е част от проблемите – може би най-основната.

Не пиша този запис нито от Щатите, нито от Англия, нито от Германия. Ако го пишех оттам, щеше да съдържа съвсем различни неща. Обяснения колко лошо е там, колко страшно е и как не може да се живее. Как бият негрите и експлоатират бедните, и смучат целия свят, за да демонстрират богатство. Как араби с чалми и ятагани се самовзривяват на всяка крачка, та улиците са само кратери от експлозиите. (А, и как опипват и изнасилват жените, докато се самовзривяват.) Как правителството там си е зарязало страната, за да може ден и нощ да крои планове за съсипване на България. Как, дойде ли българин, го арестуват, затварят доживотно и пребиват от бой всеки ден…

Недоволен тук, недоволен и там, нали? Да, ама не. Има разлика, която ти никога няма да разбереш, колкото и пъти и които и хора да ти я казват. Ще я напиша за останалите наоколо… Тук съм недоволен, защото ми се иска тези, които се борим да оправяме нещата, да бъдем повече. И търся сродни души, или начин да подкрепя отчаялите и отказалите се… Там щях да съм доволен. А щях да пиша тези неща, за да не се подлъже някой като теб за дойде. Току-виж му харесало и решил да остане. А нито страната, нито хората там са виновни и заслужават подобно нещо.

Да, точно това казах. Някои други страни са уредени, въпреки че хората в тях не са много умни, понеже там са малко тези като теб. А България е съсипана, въпреки умните си хора, понеже тези като теб са повече. Именно вие сте трупната отрова, която превръща умните хора в умонепобираемо глупав народ. И която превръща една от най-красивите страни на света в една от най-скапаните държави.

Сърдиш ли ми се? Обиждаш ли ми се? Чакай малко. В мейла си ти ме оплю с далеч по-цветисти епитети и ги подкрепи единствено с ругатни, а аз не ти се разсърдих. Вместо това те съжалявам. Да, пазя се от теб, както е редно човек да се пази от въшлясал и крастав луд, но не те мразя както ти мен. Съжалявам те и ми се иска да осъзнаеш докъде си се докарал. Някак така, че да не искаш после да умреш от срам – не знам възможно ли е, но го искам. Защото какъвто си сега съсипваш не само моя и на всички свестни хора наоколо живот, а и своя… И ще подкрепя това, което написах, с аргументи.

Кажи ми, човече – имаш ли кауза, която би подкрепил с труд и действия? Диванното суперменство не се брои, анонимната юначност в Интернет също. Имаш ли кауза, различна от „да изколим турците, евреите, чужденците, педалите, прекалено умните и всички останали“? Защото това не е кауза – това са комплексите на пъзлив дрисльо. Виждал съм такива като теб неведнъж. За мой срам, като по-млад и буен даже съм срещал такива на четири очи и съм надушвал съдържанието на тазовите им резервоари… Предполагам, че дори да си най-отзад на побесняла тълпа, пак няма да колиш който ти падне. От което сигурно тайно се срамуваш, а би трябвало да се гордееш. Да тормозиш закъсали малцинства е като да биеш сираче – носи единствено позор.

За друго те питам. Кажи ми, ако си имаш друга, истинска кауза, подкрепял ли си я някога с труд и усилия? Примерно ако си недоволен от правителство, независимо кое и какви са ти политическите възгледи, късал ли си от времето и парите си, за да протестираш срещу него всеки ден в течение на месеци? Или ако примерно си любител на природата, жертвал ли си почивните си дни, за да чистиш доброволно и безплатно някое красиво кътче от боклуците? Или, ако те тревожи колко хора нямат дори какво да ядат, дарявал ли си за тях било пари, било храна, било труд? Събирал ли си дрехи за домове за сираци или престарели? Прекопавал и поливал ли си кварталната градинка – освен нощем, в „алтернативен“ смисъл на думата? Потил ли си се да допълваш Уикипедия или друг подобен общополезен ресурс?

Не, не ми казвай, че нямаш време. Три килобайта писмо не стават за миг. И съм абсолютно сигурен, че пишеш още много и много подобни неща – до други хора, като коментари по форуми, къде ли не. (Подозирам, че един от коментаторите под текста в webcafe.bg си ти, ако и псевдонимно.) Сигурно отделяш за това часове всеки ден… Апропо, като не ти харесват позициите ми, кой те кара да ги четеш? На мен като не ми харесват нечии позиции, не ги чета. Защо си бесен на мен, когато си го причиняваш сам?!

Да, знам защо. Не си първият такъв, който виждам, няма да си и последният – познавам ви добре. И ще ти направя услугата да ти го кажа. Сигурно ще си ми много ядосан, понеже не звучи приятно. Но ако за момент се замислиш, може да откриеш и верни неща вътре. Ако имаш достойнството и самоуважението да тръгнеш да ги оправяш, може да забележиш и други. И нищо чудно един ден да си ми благодарен за толкова лошите на пръв поглед думи.

Защото подсъзнателно усещаш, че в теб няма нищо свястно. Че анонимният бабаитлък в Интернет е единственото, за което те бива – друга стойност нямаш. И затова мразиш тези, които не са като теб. Които могат да видят чуждите предимства и своите недостатъци, и да търсят как да оправят себе си и да отдадат другиму дължимото уважение. Опитваш се да се докажеш над такива хора по единствения начин, който владееш – като псуваш, ругаеш и заплашваш. С надеждата да уплашиш някого, че да се почувстваш за миг важен и въобще забелязан и оставил някаква следа. Защото виждаш начин да се утвърдиш единствено като бъдеш най-големият боклук наоколо.

В предишния си запис цитирах как българите, които бягат в чужбина, бягат от други българи, които остават тук. Ти си този, от който бягат, с това какво представляваш. Да им държиш сметка е върхът и завършекът на причината да го правят, и да нямат намерение да се върнат повече. Ще се върнат – и България ще разцъфти – когато ти се махнеш оттук.

Имам обаче и добра новина за теб. И тя е, че този подсъзнателен усет те лъже. Да, към момента може да си човек без особена стойност, но можеш да промениш това. Не е трудно – просто започни да правиш нещо, което е от полза за хората. Бъди политически активен според възгледите си. Помагай на когото смяташ, че има нужда от помощ. Допринасяй за доброто на хората, без да търсиш заплащане. Отдели част от времето и силите си, за да си полезен, по какъвто начин ти е по душа, на който ти е симпатичен или скъп. Ако имаш силите да е за всички, е най-добре.

Да, ще има хора, които ще те смятат за луд. Нищо чудно повечето ти познати да са такива. Бягай от тях като от чума – те са именно чума, по умовете. Светът е пълен с далеч по-свестни хора, които заслужават подкрепата ти и ще ти дадат подкрепа при нужда. Търси истински и достойни приятели сред тях.

Започни оттам. Останалото ще дойде с времето само.

I bought some awful light bulbs so you don’t have to

Post Syndicated from Matthew Garrett original http://mjg59.dreamwidth.org/40397.html

I maintain an application for bridging various non-Hue lighting systems to something that looks enough like a Hue that an Amazon Echo will still control them. One thing I hadn’t really worked on was colour support, so I picked up some cheap bulbs and a bridge. The kit is badged as an iSuper iRainbow001, and it’s terrible.Things seemed promising enough at first, although the bulbs were alarmingly heavy (there’s a significant chunk of heatsink built into them, which seems to get a lot warmer than I’d expect from something that claims a 7W power consumption). The app was a bit clunky, but eh – I wasn’t planning on using it for long. I pressed the button on the bridge, launched the app and could control the bulbs. The first thing I noticed was that they had a separate “white” and “colour” mode. White mode was pretty bright, but colour mode massively less so – presumably the white LEDs are entirely independent of the RGB ones, and much higher intensity. Still, potentially useful as mood lighting.Anyway. Next step was to start playing with the protocol, which meant finding the device on my network. I checked anything that had picked up a DHCP lease recently and nmapped them. The OS detection reported Linux, which wasn’t hugely surprising – there was no GPL notice or source code included with the box, but I’m way past the point of shock at that. It also reported that there was a telnet daemon running. I connected and got a login prompt. And then I typed admin as the username and admin as the password and got a root prompt. So, there’s that. The copy of Busybox included even came with tftp, so it was easy to get copies of tcpdump and strace on there to see what was up.So. Next step. Protocol sniffing. I wanted to know how discovery worked, so reset the device to factory and watched what happens. The app on my phone sent out a discovery packet on UDP port 18602 which looked like this:INLAN:CLIP:23.21.212.48:CLPT:22345:MAC:02:00:00:00:00:00The CLIP and CLPT fields refer to the cloud server that allows for management when you’re not on the local network. The mac field contains an utterly fake address. If you send out a discovery packet and your mac hasn’t been registered with the device, you get a denial back. If your mac has been (and by “mac” here I mean “utterly fake mac that’s the same for all devices”), you get back a response including the device serial number and IP address. And if you just leave out the mac field entirely, you get back a response no matter whether your address is registered or not. So, that’s a start. Once you’ve registered one copy of the app with the device, anything can communicate with it by just using the same fake mac in the discovery packets.Onwards. The command format turns out to be simple. They start ##, are followed by two ascii digits encoding a command, four ascii digits containing a bulb id, two ascii digits containing the number of following bytes and then the command data (in ascii). An example is:##05010002ffwhich decodes as command 5 (set white intensity) on bulb 1 with two bytes of data following, each of which is an f. This translates as “Set bulb 1 to full white intensity”. I worked out the rest pretty quickly – command 03 sets the RGB colour of the bulb, 0A asks the bridge to search for new bulbs, 0B tells you which bulbs are available and their capabilities and 0E gives you the MAC addresses of the bulbs(‽). 0C crashes the server process, and 06 spews a bunch of garbage back at you and potentially crashes the bulb in a hilarious way that involves it flickering at about 15Hz. It turns out that 06 is actually the “Rename bulb” command, and if you send it less data than it’s expecting something goes hilariously wrong in string parsing and everything is awful.Ok. Easy enough, but not hugely exciting. What about the remote protocol? This turns out to involve sending a login packet and then a wrapped command packet. The login has some length data, a header of “MQIsdp”, a long bunch of ascii-encoded hex, a username and a password.The username is w13733 and the password is gbigbi01. These are hardcoded in the app. The ascii-encoded hex can be replaced with 0s and the login succeeds anyway.Ok. What about the wrapping on the command? The login never indicated which device we wanted to talk to, so presumably there’s some sort of protection going on here oh wait. The command packet is a length, the serial number of the bridge and then a normal command packet. As long as you know the serial number of the device (which it tells you in response to a discovery packet, even if you’re unauthenticated), you can use the cloud service to send arbitrary commands to the device (including the one that crashes the service). One of which involves the device then doing some kind of string parsing that doesn’t appear to work that well. What could possibly go wrong?Ok, so that seemed to be the entirety of the network protocol. Anything else to do? Some poking around on the bridge revealed (a) that it had an active wireless device and (b) a running DHCP server. They wouldn’t, would they?Yes. Yes, they would.The devices all have a hardcoded SSID of “iRainbow”, although they don’t broadcast it. There’s no security – anybody can associate. It’ll then hand out an IP address. It’s running telnetd on that interface as well. You can bounce through there to the owner’s internal network.So, in summary: it’s a device that infringes my copyright, gives you root access in response to trivial credentials, has access control that depends entirely on nobody ever looking at the packets, is sufficiently poorly implemented that you can crash both it and the bulbs, has a cloud access protocol that has no security whatsoever and also acts as an easy mechanism for people to circumvent your network security. This may be the single worst device I’ve ever bought.I called the manufacturer and they insisted that the device was designed in 2012, was no longer manufactured or supported, that they had no source code to give me and there would be no security fixes. The vendor wants me to pay for shipping back to them and reserves the right to deduct the cost of the original “free” shipping from the refund. Everything is awful, which is why I just ordered four more random bulbs to play with over the weekend.comment count unavailable comments

Early Internet services considered harmful

Post Syndicated from Robert Graham original http://blog.erratasec.com/2016/02/early-internet-services-considered.html

This journalist, while writing a story on the #FBIvApple debate, got his email account hacked while on the airplane. Of course he did. His email account is with Earthlink, an early Internet services provider from the 1990s. Such early providers (AOL, Network Solutions, etc.) haven’t kept up with the times. If that’s still your email, there’s pretty much no way to secure it.Early Internet stuff wasn’t encrypted, because encryption was hard, and it was hard for bad guys to tap into wires to eavesdrop. Now, with open WiFi hotspots at Starbucks or on the airplane, it’s easy for hackers to eavesdrop on your network traffic. Simultaneously, encryption has become a lot easier. All new companies, those still fighting to acquire new customers, have thus upgraded their infrastructure to support encryption. Stagnant old companies, who are just milking their customers for profits, haven’t upgraded their infrastructure.You see this in the picture below. Earthlink supports older un-encrypted “POP3” (for fetching email from the server), but not the new encrypted POP3 over SSL. Conversely, GMail doesn’t support the older un-encrypted stuff (even if you wanted it to), but only the newer encrypted version.Thus, if you are a reporter using Earthlink, of course you’ll get hacked every time you fetch your email (from your phone, or using an app like Outlook on the laptop). I point this out because the story then includes some recommendations on how to protect yourself, and they are complete nonsense. The only recommendation here is to stop using Earthlink, and other ancient email providers. Open your settings for how you get email and check the “port” number. If it’s 110, stop using that email provider (unless STARTTLS is enabled). If it’s 995, you are likely okay.The more general lesson is that hacking doesn’t work like magic. The reporter’s email program was sending unencrypted passwords, and the solution is to stop doing that.Update: No, Earthlink doesn’t support STARTTLS either.

Early Internet services considered harmful

Post Syndicated from Robert Graham original http://blog.erratasec.com/2016/02/early-internet-services-considered.html

This journalist, while writing a story on the #FBIvApple debate, got his email account hacked while on the airplane. Of course he did. His email account is with Earthlink, an early Internet services provider from the 1990s. Such early providers (AOL, Network Solutions, etc.) haven’t kept up with the times. If that’s still your email, there’s pretty much no way to secure it.Early Internet stuff wasn’t encrypted, because encryption was hard, and it was hard for bad guys to tap into wires to eavesdrop. Now, with open WiFi hotspots at Starbucks or on the airplane, it’s easy for hackers to eavesdrop on your network traffic. Simultaneously, encryption has become a lot easier. All new companies, those still fighting to acquire new customers, have thus upgraded their infrastructure to support encryption. Stagnant old companies, who are just milking their customers for profits, haven’t upgraded their infrastructure.You see this in the picture below. Earthlink supports older un-encrypted “POP3” (for fetching email from the server), but not the new encrypted POP3 over SSL. Conversely, GMail doesn’t support the older un-encrypted stuff (even if you wanted it to), but only the newer encrypted version.Thus, if you are a reporter using Earthlink, of course you’ll get hacked every time you fetch your email (from your phone, or using an app like Outlook on the laptop). I point this out because the story then includes some recommendations on how to protect yourself, and they are complete nonsense. The only recommendation here is to stop using Earthlink, and other ancient email providers. Open your settings for how you get email and check the “port” number. If it’s 110, stop using that email provider (unless STARTTLS is enabled). If it’s 995, you are likely okay.The more general lesson is that hacking doesn’t work like magic. The reporter’s email program was sending unencrypted passwords, and the solution is to stop doing that.Update: No, Earthlink doesn’t support STARTTLS either.

CaffeOnSpark Open Sourced for Distributed Deep Learning on Big Data Clusters

Post Syndicated from yahoo original https://yahooeng.tumblr.com/post/139916828451

yahoohadoop:

By Andy Feng(@afeng76), Jun Shi and Mridul Jain (@mridul_jain), Yahoo Big ML Team
Introduction
Deep learning (DL) is a critical capability required by Yahoo product teams (ex. Flickr, Image Search) to gain intelligence from massive amounts of online data. Many existing DL frameworks require a separated cluster for deep learning, and multiple programs have to be created for a typical machine learning pipeline (see Figure 1). The separated clusters require large datasets to be transferred among them, and introduce unwanted system complexity and latency for end-to-end learning.
image
Figure 1: ML Pipeline with multiple programs on separated clusters

As discussed in our earlier Tumblr post, we believe that deep learning should be conducted in the same cluster along with existing data processing pipelines to support feature engineering and traditional (non-deep) machine learning. We created CaffeOnSpark to allow deep learning training and testing to be embedded into Spark applications (see Figure 2). 
image
Figure 2: ML Pipeline with single program on one cluster

CaffeOnSpark: API & Configuration and CLI

CaffeOnSpark is designed to be a Spark deep learning package. Spark MLlib supported a variety of non-deep learning algorithms for classification, regression, clustering, recommendation, and so on. Deep learning is a key capacity that Spark MLlib lacks currently, and CaffeOnSpark is designed to fill that gap. CaffeOnSpark API supports dataframes so that you can easily interface with a training dataset that was prepared using a Spark application, and extract the predictions from the model or features from intermediate layers for results and data analysis using MLLib or SQL.
imageFigure 3: CaffeOnSpark as a Spark Deep Learning package

1:   def main(args: Array[String]): Unit = {
2:   val ctx = new SparkContext(new SparkConf())
3:   val cos = new CaffeOnSpark(ctx)
4:   val conf = new Config(ctx, args).init()
 5:   val dl_train_source = DataSource.getSource(conf, true)
 6:   cos.train(dl_train_source)
 7:   val lr_raw_source = DataSource.getSource(conf, false)
 8:   val extracted_df = cos.features(lr_raw_source)
 9:   val lr_input_df = extracted_df.withColumn(“Label”, cos.floatarray2doubleUDF(extracted_df(conf.label)))
10:     .withColumn(“Feature”, cos.floatarray2doublevectorUDF(extracted_df(conf.features(0))))
11:  val lr = new LogisticRegression().setLabelCol(“Label”).setFeaturesCol(“Feature”)
12:  val lr_model = lr.fit(lr_input_df)
13:  lr_model.write.overwrite().save(conf.outputPath)
14: }

Figure 4: Scala application using CaffeOnSpark both MLlib

Scala program in Figure 4 illustrates how CaffeOnSpark and MLlib work together:
L1-L4 … You initialize a Spark context, and use it to create CaffeOnSpark and configuration object.
L5-L6 … You use CaffeOnSpark to conduct DNN training with a training dataset on HDFS.
L7-L8 …. The learned DL model is applied to extract features from a feature dataset on HDFS.
L9-L12 … MLlib uses the extracted features to perform non-deep learning (more specifically logistic regression for classification).
L13 … You could save the classification model onto HDFS.

As illustrated in Figure 4, CaffeOnSpark enables deep learning steps to be seamlessly embedded in Spark applications. It eliminates unwanted data movement in traditional solutions (as illustrated in Figure 1), and enables deep learning to be conducted on big-data clusters directly. Direct access to big-data and massive computation power are critical for DL to find meaningful insights in a timely manner.
CaffeOnSpark uses the configuration files for solvers and neural network as in standard Caffe uses. As illustrated in our example, the neural network will have a MemoryData layer with 2 extra parameters:

source_class specifying a data source class

source specifying dataset location.
The initial CaffeOnSpark release has several built-in data source classes (including com.yahoo.ml.caffe.LMDB for LMDB databases and com.yahoo.ml.caffe.SeqImageDataSource for Hadoop sequence files). Users could easily introduce customized data source classes to interact with the existing data formats.

CaffeOnSpark applications will be launched by standard Spark commands, such as spark-submit. Here are 2 examples of spark-submit commands. The first command uses CaffeOnSpark to train a DNN model saved onto HDFS. The second command is a custom Spark application that embedded CaffeOnSpark along with MLlib.
First command:
spark-submit    –files caffenet_train_solver.prototxt,caffenet_train_net.prototxt    –num-executors 2      –class com.yahoo.ml.caffe.CaffeOnSpark        caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar       -train -persistent       -conf caffenet_train_solver.prototxt       -model hdfs:///sample_images.model       -devices 2
Second command:

spark-submit    –files caffenet_train_solver.prototxt,caffenet_train_net.prototxt    –num-executors 2      –class com.yahoo.ml.caffe.examples.MyMLPipeline                                         caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
       -features fc8        -label label        -conf caffenet_train_solver.prototxt        -model hdfs:///sample_images.model          -output hdfs:///image_classifier_model        -devices 2

System Architecture
imageFigure 5: System Architecture

Figure 5 describes the system architecture of CaffeOnSpark. We launch Caffe engines on GPU devices or CPU devices within the Spark executor, via invoking a JNI layer with fine-grain memory management. Unlike traditional Spark applications, CaffeOnSpark executors communicate to each other via MPI allreduce style interface via TCP/Ethernet or RDMA/Infiniband. This Spark+MPI architecture enables CaffeOnSpark to achieve similar performance as dedicated deep learning clusters.
Many deep learning jobs are long running, and it is important to handle potential system failures. CaffeOnSpark enables training state being snapshotted periodically, and thus we could resume from previous state after a failure of a CaffeOnSpark job.
Open Source
In the last several quarters, Yahoo has applied CaffeOnSpark on several projects, and we have received much positive feedback from our internal users. Flickr teams, for example, made significant improvements on image recognition accuracy with CaffeOnSpark by training with millions of photos from the Yahoo Webscope Flickr Creative Commons 100M dataset on Hadoop clusters.
CaffeOnSpark is beneficial to deep learning community and the Spark community. In order to advance the fields of deep learning and artificial intelligence, Yahoo is happy to release CaffeOnSpark at github.com/yahoo/CaffeOnSpark under Apache 2.0 license.
CaffeOnSpark can be tested on an  AWS EC2 cloud or on your own Spark clusters. Please find the detailed instructions at Yahoo github repository, and share your feedback at [email protected]. Our goal is to make CaffeOnSpark widely available to deep learning scientists and researchers, and we welcome contributions from the community to make that happen. .

How to Use AWS WAF to Block IP Addresses That Generate Bad Requests

Post Syndicated from Ben Potter original https://blogs.aws.amazon.com/security/post/Tx223ZW25YRPRKV/How-to-Use-AWS-WAF-to-Block-IP-Addresses-That-Generate-Bad-Requests

Internet-facing web applications are frequently scanned and probed by various sources, sometimes for good and other times to identify weaknesses. It takes some sleuthing to determine the probable intent of such exploit attempts, especially if you do not have tools in place that identify them. One way you can identify and block unwanted traffic is to use AWS WAF, a web application firewall that helps protect web applications from exploit attempts that can compromise security or place unnecessary load on your application.

Typically, exploit attempts are automated. Their intent is to collect information about your web application, such as the software version and exposed URLs. Think of these attempts as “reconnaissance missions” that gather data about where your web application might be vulnerable. To find out what is vulnerable, these exploit attempts send out a series of requests to see if they get any responses. Along the way, these attempts usually generate several error codes (HTTP 4xx error codes) as they try to determine what is exposed. Even normal requests can generate these error codes, but if you see a high number of errors coming from a single IP address, this is a good indication that somebody (or something) might not have good intentions for your web application. If you are delivering your web application with Amazon CloudFront, you can see these error codes in CloudFront access logs. Based on these error codes, you can configure AWS Lambda to update AWS WAF and block requests that generate too many error codes.

In this blog post, I show you how to create a Lambda function that automatically parses CloudFront access logs as they are delivered to Amazon S3, counts the number of bad requests from unique sources (IP addresses), and updates AWS WAF to block further requests from those IP addresses. I also provide a CloudFormation template that creates the web access control list (ACL), rule sets, Lambda function, and logging S3 bucket so that you can try this yourself.

Solution overview

This solution expands on a recent blog post by my colleague Heitor Vital who provided a comprehensive how-to guide for implementing rate-based blacklisting. I use the same concept in this post, but I block IP addresses based on the number of HTTP 4xx error codes instead of total requests. This solution assumes you have a CloudFront distribution and already are familiar with CloudFormation.

The following architecture diagram shows the flow of this solution, which works in this way:

  1. CloudFront delivers access log files for your distribution up to several times an hour to the S3 bucket you have configured. This bucket must reside in the same AWS region as the Lambda function and where you create the CloudFormation stack for this example.
  2. As new log files are delivered to the S3 bucket, the custom Lambda function is triggered. The Lambda function parses the log files and looks for requests that resulted in error codes 400, 403, 404, and 405. The function counts the number of bad requests temporarily storing results in current_outstanding_requesters.json in the configured S3 bucket.
  3. Bad requests for each IP address above the threshold that you define are blocked by updating an auto block IP match condition in AWS WAF.
  4. The Lambda function also updates CloudWatch with custom metrics for counters on number of requests and number of IP addresses blocked (you could set CloudWatch alarms on this as well).

Lambda function overview

Here’s how the Lambda function is composed:

  1. The relevant Python modules (shown in the following screenshot) are imported for use.
  2. The next section (shown in the following screenshot) contains some configurable items such as error codes (line 29) for which to search, and the OUTPUT_FILE_NAME for storage of the function’s output in S3. The LINE_FORMAT parameters determine the format of the log; the settings are for CloudFront, but the function may be modified for other log formats.
  3. The following list outlines the main functions and their purpose:

    • def get_outstanding_requesters – Handles the parsing of the log file.
    • def merge_current_blocked_requesters – Handles the expiration of blocks.
    • def write_output – Writes the current IP block status to current_outstanding_requesters.json.
    • def waf_get_ip_set – Gets the current AWS WAF service IPSet.
    • def get_ip_set_already_blocked – Determines if the IPSet is already blocked.
    • def update_waf_ip_set – Performs the update to the AWS WAF IPSet.
    • def lambda_handler – Reads the values configured in CloudFormation such as S3 bucket, and updates CloudWatch metrics.

Deploying the Auto Block Solution—Using the AWS Management Console

Let’s get started! In the CloudFormation console:

  1. Ensure you select a region where Lambda is available. See the Region Table for current availability.
  2. Click Create Stack, and then specify the template URL: https://s3.amazonaws.com/awswaf.us-east-1/block-bad-behaving-ips/block-bad-behaving-ips_template.json. Click Next.
  3. Type a name for your stack as well as the following parameters (also shown in the following screenshot):

    1. S3 Location – The stack will create a new S3 bucket where CloudFront log files are to be stored.
    2. Request Threshold – The number of bad requests in a one-minute period from a single source IP address to trigger a block condition. The default value is 50.
    3. WAF Block Period – The duration (in seconds) for which the bad IP address will be blocked. The default value is 4 hours (14400 seconds).

  1. Optional: Complete the creation of the stack by entering tag details. Click Next.
  2. Acknowledge that you are aware of the changes and associated costs with this stack creation by selecting the check box in the Capabilities section. Click Create.

    1. For more information about associated costs, see AWS WAF pricing information, Lambda pricing information, and CloudFront pricing information.

Testing the Auto Block Solution

After the stack has been created, you can test the Lambda function by copying this test file into the S3 bucket that was created. You can also test the function by copying your own CloudFront logs from your current logging bucket. In the Lambda console, click the Monitoring tab and then click through to view the logs in CloudWatch Logs. You will also notice that the S3 bucket now contains a file, current_outstanding_requesters.json, which details the IP addresses that are currently blocked. This is how the Lambda function stores state between invocations.

If you are satisfied the auto block solution is working correctly, you may want to configure a CloudFront distribution to store log files in the new bucket by using the CloudFront console. As the logs are processed, you can verify the Lambda function is running correctly in the Lambda console. Additionally you can use the AWS WAF console to check on current IP blocks. In the AWS WAF console, you will see the web ACL named Malicious Requesters, and an Auto Block Rule linked to an IP match condition called Auto Block Set. You will also see a Manual Block Rule to which you can add IP addresses you want to block manually.

Finally, to enforce the blocking, configure AWS WAF in CloudFront by associating the web ACL with your CloudFront distribution. In the CloudFormation console select Distribution Settings on the distribution you wish to enable for AWS WAF, and modify the AWS WAF Web ACL setting, as shown in the following screenshot.

Remember that 4xx errors also can be returned in response to normal requests. Determine which threshold is right for you, and ensure you do not have any missing pages or images somewhere on your site that are generating 404 errors. If you are not sure which threshold is right for you, test the rules first by changing the rule action to Count instead of Block, and viewing web request samples. When you are confident in your rules, you can change the rule action to Block.

Summary

This blog post has shown you how to configure a solution that automatically blocks IP addresses based on their error count. If you have ideas or questions, submit them in the “Comments” section below or on the AWS WAF forum.

– Ben

How to Use AWS WAF to Block IP Addresses That Generate Bad Requests

Post Syndicated from Ben Potter original https://blogs.aws.amazon.com/security/post/Tx223ZW25YRPRKV/How-to-Use-AWS-WAF-to-Block-IP-Addresses-That-Generate-Bad-Requests

Internet-facing web applications are frequently scanned and probed by various sources, sometimes for good and other times to identify weaknesses. It takes some sleuthing to determine the probable intent of such exploit attempts, especially if you do not have tools in place that identify them. One way you can identify and block unwanted traffic is to use AWS WAF, a web application firewall that helps protect web applications from exploit attempts that can compromise security or place unnecessary load on your application.

Typically, exploit attempts are automated. Their intent is to collect information about your web application, such as the software version and exposed URLs. Think of these attempts as “reconnaissance missions” that gather data about where your web application might be vulnerable. To find out what is vulnerable, these exploit attempts send out a series of requests to see if they get any responses. Along the way, these attempts usually generate several error codes (HTTP 4xx error codes) as they try to determine what is exposed. Even normal requests can generate these error codes, but if you see a high number of errors coming from a single IP address, this is a good indication that somebody (or something) might not have good intentions for your web application. If you are delivering your web application with Amazon CloudFront, you can see these error codes in CloudFront access logs. Based on these error codes, you can configure AWS Lambda to update AWS WAF and block requests that generate too many error codes.

In this blog post, I show you how to create a Lambda function that automatically parses CloudFront access logs as they are delivered to Amazon S3, counts the number of bad requests from unique sources (IP addresses), and updates AWS WAF to block further requests from those IP addresses. I also provide a CloudFormation template that creates the web access control list (ACL), rule sets, Lambda function, and logging S3 bucket so that you can try this yourself.

Solution overview

This solution expands on a recent blog post by my colleague Heitor Vital who provided a comprehensive how-to guide for implementing rate-based blacklisting. I use the same concept in this post, but I block IP addresses based on the number of HTTP 4xx error codes instead of total requests. This solution assumes you have a CloudFront distribution and already are familiar with CloudFormation.

The following architecture diagram shows the flow of this solution, which works in this way:

CloudFront delivers access log files for your distribution up to several times an hour to the S3 bucket you have configured. This bucket must reside in the same AWS region as the Lambda function and where you create the CloudFormation stack for this example.

As new log files are delivered to the S3 bucket, the custom Lambda function is triggered. The Lambda function parses the log files and looks for requests that resulted in error codes 400, 403, 404, and 405. The function counts the number of bad requests temporarily storing results in current_outstanding_requesters.json in the configured S3 bucket.

Bad requests for each IP address above the threshold that you define are blocked by updating an auto block IP match condition in AWS WAF.

The Lambda function also updates CloudWatch with custom metrics for counters on number of requests and number of IP addresses blocked (you could set CloudWatch alarms on this as well).

Lambda function overview

Here’s how the Lambda function is composed:

The relevant Python modules (shown in the following screenshot) are imported for use.

The next section (shown in the following screenshot) contains some configurable items such as error codes (line 29) for which to search, and the OUTPUT_FILE_NAME for storage of the function’s output in S3. The LINE_FORMAT parameters determine the format of the log; the settings are for CloudFront, but the function may be modified for other log formats.

The following list outlines the main functions and their purpose:

def get_outstanding_requesters – Handles the parsing of the log file.

def merge_current_blocked_requesters – Handles the expiration of blocks.

def write_output – Writes the current IP block status to current_outstanding_requesters.json.

def waf_get_ip_set – Gets the current AWS WAF service IPSet.

def get_ip_set_already_blocked – Determines if the IPSet is already blocked.

def update_waf_ip_set – Performs the update to the AWS WAF IPSet.

def lambda_handler – Reads the values configured in CloudFormation such as S3 bucket, and updates CloudWatch metrics.

Deploying the Auto Block Solution—Using the AWS Management Console

Let’s get started! In the CloudFormation console:

Ensure you select a region where Lambda is available. See the Region Table for current availability.

Click Create Stack, and then specify the template URL: https://s3.amazonaws.com/awswaf.us-east-1/block-bad-behaving-ips/block-bad-behaving-ips_template.json. Click Next.

Type a name for your stack as well as the following parameters (also shown in the following screenshot):

S3 Location – The stack will create a new S3 bucket where CloudFront log files are to be stored.

Request Threshold – The number of bad requests in a one-minute period from a single source IP address to trigger a block condition. The default value is 50.

WAF Block Period – The duration (in seconds) for which the bad IP address will be blocked. The default value is 4 hours (14400 seconds).

Optional: Complete the creation of the stack by entering tag details. Click Next.

Acknowledge that you are aware of the changes and associated costs with this stack creation by selecting the check box in the Capabilities section. Click Create.

For more information about associated costs, see AWS WAF pricing information, Lambda pricing information, and CloudFront pricing information.

Testing the Auto Block Solution

After the stack has been created, you can test the Lambda function by copying this test file into the S3 bucket that was created. You can also test the function by copying your own CloudFront logs from your current logging bucket. In the Lambda console, click the Monitoring tab and then click through to view the logs in CloudWatch Logs. You will also notice that the S3 bucket now contains a file, current_outstanding_requesters.json, which details the IP addresses that are currently blocked. This is how the Lambda function stores state between invocations.

If you are satisfied the auto block solution is working correctly, you may want to configure a CloudFront distribution to store log files in the new bucket by using the CloudFront console. As the logs are processed, you can verify the Lambda function is running correctly in the Lambda console. Additionally you can use the AWS WAF console to check on current IP blocks. In the AWS WAF console, you will see the web ACL named Malicious Requesters, and an Auto Block Rule linked to an IP match condition called Auto Block Set. You will also see a Manual Block Rule to which you can add IP addresses you want to block manually.

Finally, to enforce the blocking, configure AWS WAF in CloudFront by associating the web ACL with your CloudFront distribution. In the CloudFormation console select Distribution Settings on the distribution you wish to enable for AWS WAF, and modify the AWS WAF Web ACL setting, as shown in the following screenshot.

Remember that 4xx errors also can be returned in response to normal requests. Determine which threshold is right for you, and ensure you do not have any missing pages or images somewhere on your site that are generating 404 errors. If you are not sure which threshold is right for you, test the rules first by changing the rule action to Count instead of Block, and viewing web request samples. When you are confident in your rules, you can change the rule action to Block.

Summary

This blog post has shown you how to configure a solution that automatically blocks IP addresses based on their error count. If you have ideas or questions, submit them in the “Comments” section below or on the AWS WAF forum.

– Ben

How to Use AWS Config to Help with Required HIPAA Audit Controls: Part 4 of the Automating HIPAA Compliance Series

Post Syndicated from Chris Crosbie original https://blogs.aws.amazon.com/security/post/Tx27GJDUUTHKRRJ/How-to-Use-AWS-Config-to-Help-with-Required-HIPAA-Audit-Controls-Part-4-of-the-A

In my previous posts in this series, I explained how to get started with the DevSecOps environment for HIPAA that is depicted in the following architecture diagram. In my second post in this series, I gave you guidance about how to set up AWS Service Catalog (#4 in the following diagram) to allow developers a way to launch healthcare web servers and release source code without the need for administrator intervention. In my third post in this series, I advised healthcare security administrators about defining AWS CloudFormation templates (#1 in the diagram) for infrastructure that must comply with the AWS Business Associate Agreement (BAA).

In today’s final post of this series, I am going to complete the explanation of the DevSecOps architecture depicted in the preceding diagram by highlighting ways you can use AWS Config (#9 in the diagram) to help meet audit controls required by HIPAA. Config is a fully managed service that provides you with an AWS resource inventory, configuration history, and configuration change notifications. This Config output, along with other audit trails, gives you the types of information you can use to meet your HIPAA auditing obligations. 

Auditing and monitoring are essential to HIPAA security. Auditing controls are a Technical Safeguard that must be addressed through the use of technical controls by anyone who wishes to store, process, or transmit electronic patient data. However, because there are no standard implementation specifications within the HIPAA law and regulations, AWS Config enables you to address audit controls  to use the cloud to protect the cloud.

Because Config currently targets only AWS infrastructure configuration changes, it is unlikely that Config alone will be able to meet all of the audit control requirements laid out in Technical Safeguard 164.312, the section of the HIPAA regulations that discusses the technical safeguards such as audit controls. However, Config is a cloud-native auditing service that you should evaluate as an alternative to traditional on-premises compliance tools and procedures.

Standard audit controls found in 164.312(b)(2) of the HIPAA regulations says: “Implement hardware, software, and/or procedural mechanisms that record and examine activity in information systems that contain or use electronic health information.” Config helps achieve this because it monitors the activity of both running and deleted AWS resources across time. In a DevSecOps environment in which developers have the power to turn on and turn off infrastructure in a self-service manner, using a cloud-native monitoring tool such as Config will help ensure that you can meet your auditing requirements. Understanding what a configuration looked like and who had access to it at a point in the past is something that you will need to do in a typical HIPAA audit, and Config provides this functionality.

For more about the topic of auditing HIPAA infrastructure in the cloud, the AWS re:Invent 2015 session, Architecting for HIPAA Compliance on AWS, gives additional pointers. To supplement the monitoring provided by Config, review and evaluate the easily deployable monitoring software found in the AWS Marketplace.

Get started with AWS Config

From the AWS Management Console, under Management Tools:

Click Config.

If this is your first time using Config, click Get started.

From the Set up AWS Config page, choose which types of resources that you want to track.

Config is designed to track the interaction among various AWS services. At the time of this post, you can choose to track your accounts in AWS Identity and Access Management (IAM), Amazon EC2–related services (such as Amazon Elastic Block Store, elastic network interfaces , and virtual private cloud [VPC]), and AWS CloudTrail.

All the information collected across these services is normalized into a standard format so that auditors or your compliance team may not need to understand the underlying details of how to audit each AWS service. They simply can review the Config console to ensure that your healthcare privacy standards are being met.

Because the infrastructure described in this post is designed for storing protected health information (PHI), I am going to select the check box next to All resources, as shown in the following screenshot. By choosing this option, I can ensure that not only will all the resources available for tracking be included, but also as new resource types get added to Config, they will automatically be added to my tracking as well.  

Also, be sure to select the Include global resources check box if you would like to use Config to record and govern your IAM resource configurations.

Specify where the configuration history file should be stored

Amazon S3 buckets have global naming, which makes it possible to aggregate the configuration history files across regions or send the files to a separate AWS account with limited privileges. The same consolidation can be configured for Amazon Simple Notification Service (SNS) topics, if you want to programmatically extend the information coming from Config or be immediately alerted of compliance risks.

For this example, I create a new bucket in my account and turn off the Amazon SNS topic notifications (as shown in the following screenshot), and click Continue.  

On the next page, create a new IAM role in your AWS account so that the Config service has the ability to read your infrastructure’s information. You can review the permissions that will be associated with this IAM role by clicking the arrow next to View Policy Document.

After you have verified the policy, click Allow. You should now be taken to the Resource inventory page. On the right side of the page, you should see that Recording is on and that inventory is being taken about your infrastructure. When the Taking inventory label (shown in the following image) is no longer visible, you can start reviewing your healthcare infrastructure.

Review your healthcare server

For the rest of this post, I use Config to review the healthcare web server that I created with AWS Service Catalog in How to Use AWS Service Catalog for Code Deployments: Part 2 of the Automating HIPAA Compliance Series.

From the Resource inventory page, you can search based on types of resources, such as IAM user, network access control list (ACL), VPC, and instance. A resource tag is a way to categorize AWS resources, and you can search by those tags in Config. Because I used CloudFormation to enforce tagging, I can quickly find the type of resources I am interested in by setting up search for these tags.

As an example of why this is useful, consider employee turnover. Most healthcare organizations need to have processes and procedures to deal with employee turnover in a regulated environment. Because our CloudFormation template forced developers to populate a tag with their email addresses, you can easily use Config to find all the resources the employee was using, if they decided to leave the organization (or even if they didn’t leave the company).

Search on the Resource inventory page for the employee’s email address along with the tag, InstanceOwnerEmail, and then click Look up, as shown in the following screenshot.

Click the link under Resource identifier to see the Config timeline that shows the most recent configuration recorded for the instance as well as previously recorded configurations. This timeline will show not only the configuration details of the instance itself, but also will provide the relationships to other AWS services and an easy-to-interpret Changes section. This section provides your auditing and compliance teams the ability to quickly review and interpret changes from a single interface without needing to understand the underlying AWS services in detail or jump between multiple AWS service pages.

Clicking View Details, as shown in the following image, will produce a JSON representation of the configuration, which you may consider including as evidence in the event of an audit.

The details contained in this JSON text will help you understand the structure of the configuration objects passed to AWS Lambda, which you interact with when writing your own Config rules. I discuss this in more detail later in this blog post.

Let’s walk through a quick example of one of the many ways of how an auditor or administrator may go about using Config. Let’s say that there was an emergency production issue. The issue required an administrator to add SSH access to production web servers temporarily so that he or she could log in and manually install a software patch. The patches then were installed and SSH access was revoked from all the security groups except for one instance’s security group, which was mistakenly forgotten. In Config, the compliance team is able to review the last change to any resource type by reviewing the Config Timeline (as show in the following screenshot) and clicking Change to verify exactly what was changed.

It is clear from the following screenshot that the opening of SSH on port 22 was the last change captured, so we need to close the port on this security group to block remote access to this server.

Extend healthcare-specific compliance with Config Rules

Though the SSH configuration I just walked through provided context about how Config works, in a healthcare environment we would ideally want to automate this process. This is what AWS Config Rules can do for us.

Config Rules is a powerful rule system that can target resources and then have those resources evaluated when they are created or changed or evaluated on a periodic basis (hourly, daily, and so forth).

Let’s look at how we could have used Config Rules to identify the same improperly opened SSH port discussed previously in this post.

At the time of this post, AWS Config Rules is available only in the US East (N. Virginia) Region, so to follow along, be sure you have the AWS Management Console set to that region. From the same Config service that we have been using, click Rules in the left pane and then click Add Rule.

You can choose from available managed rules. One of those rules is restricted-common-ports, which will fit our use case. I modify this rule to be limited to only those security groups I have tagged as PROD in the Trigger section, as shown in the following screenshot.

I then override the default ports of this rule and specify my own port under Rule parameters, which is 22.

Click Save and you will be taken back to the Rules page to have the rule run on your infrastructure. While the rule is running, you will see an Evaluating status, as shown in the following image.

When I return to my Resource inventory by clicking Resources in the left pane, I again search for all of my PROD environment resources. However, with AWS Config rules, I can quickly find which resources are noncompliant with the rule I just created. The following screenshot shows the Resource type and Resource identifier of the resource that is noncompliant with this rule.

In addition to this SSH production check, for a regulated healthcare environment you should consider implementing all of the managed AWS Config rules to ensure your AWS infrastructure is meeting basic compliance requirements set by your organization. A few examples are:

Use the encrypted-volumes rule to ensure that volumes tagged as PHI=”Yes” are encrypted.

Ensure that you are always logging API activity by using the cloudtrail-enabled rule.

Ensure you do not have orphaned Elastic IP addresses with eip-attached.

Verify that all development machines can only be accessed with SSH from the development VPC by changing the defaults in restricted-ssh.

Use required-tags to ensure that you have the information you need for healthcare audits.

Ensure that only PROD resources that are hardened for exposure to the public Internet are in a VPC that has an Internet gateway attached by taking advantage of managed rule, ec2-instances-in-vpc.

Create your own healthcare rules with Lambda

The managed rules just discussed will give you a jump-start to make sure your environment is meeting some of the minimum compliance requirements shared across many compliance frameworks. These rules can be configured quickly to make sure you are meeting some of the basic checks in an automated manner.

However, for deep visibility into your healthcare-compliant architecture, you might want to consider developing your own custom rules to help meet your HIPAA obligations. As a trivial, yet important, example of something you might want to check for to be sure you are staying compliant with the AWS Business Associates Agreement, you could create a custom AWS Config rule to check that all of your EC2 instances are set to dedicated tenancy. This can be done by creating a new rule as shown previously in this post, except this time click Add custom rule at the top of the Config Rules page.

You are then taken to the custom rule page where you name your rule and then click Create AWS Lambda function (as shown in the following screenshot) to be taken to Lambda.

On the landing page to which you are taken (see following screenshot), choose a predefined blueprint with the name config-rule-change-triggered, which provides a sample function that is triggered when AWS resource configurations change.

Within the code blueprint provided, customize the evaluateCompliance function by changing the line

if (‘AWS::EC2::Instance’ !== configurationItem.resourceType)

to

if ("dedicated" === configurationItem.configuration.placement.tenancy)

This will change the function to return COMPLIANT if the EC2 instance is dedicated tenancy instead of returning COMPLIANT if the resource type is simply an EC2 instance, as shown in the following screenshot.

After you have modified the Lambda function, create a role that has the permission to interact with Config. By default, Lambda will suggest that you create the role AWS Config role. You can follow all the default advice suggested in the AWS console to create a role that contains the appropriate permissions.

After you have created the new role, click Next. On the next page, review the Lambda function you are about to create, and then click Create function. Now that you have created the function, copy the function’s Amazon Resource Name (ARN) from the Lambda page and return to your Config Rules setup page. Paste the ARN of the Lambda function you just created into the AWS Lambda function ARN* box.

From the Trigger options, choose Configuration changes under Trigger type, because this is the Lambda blueprint that you used. Set the Scope of changes to whichever resources you would like this rule to evaluate. In this sample, I will apply the rule to All changes.

After a few minutes, this rule will evaluate your infrastructure, and you can use the rule to easily audit your infrastructure to display the EC2 instances that are Compliant (in this case, that are using dedicated tenancy), as shown in the following screenshot.

For more details about working with Config Rules, see the AWS Config Developer Guide to learn how to develop your own rules.

In addition to digging deeper into the documentation, you may also want to explore the AWS Config Partners who have developed Config rules that you can simply take and use for your own AWS infrastructure. For companies that have HIPAA expertise and are interested in partnering with AWS to develop HIPAA-specific Config rules, feel free to email me or leave a comment in the “Comments” section below to discuss more.

Conclusion

In this blog post, I have completed my explanation of a DevSecOps architecture for the healthcare sector by looking at AWS Config Rules. I hope you have learned how compliance and auditing can use Config Rules to track the rapid, self service changes developers make to cloud infrastructure as well as how you can extend Config with customized compliance rules that allow auditing and compliance groups to gain deep visibility into a developer-centric AWS environment.

– Chris

Замърсеният въздух на София

Post Syndicated from Боян Юруков original http://feedproxy.google.com/~r/yurukov-blog/~3/0VgfnzmblsM/

Измина точно месец, откакто започнах да тегля данните за замърсяването на въздуха в София в реално време. Междувременно пуснах няколко графики с извадки от тези данни. Пример са тези от 24-ти януари разглеждащи няколко показателя в рамките на четири дни:

Проблемът с тези графики и много други коментари по темата са, че не взимат под внимание точната дефиниция на ограничението от 50 µg/m3. За да се смята въздухът за замърсен, трябва средното количество фини прахови частици (PM10) за 24 часа да надвишава 50 микрограма на кубичен метър. Почти винаги, когато прочетете в медиите или във фейса (включително понякога в моя профил), че въздухът в София е 4 или 5 пъти над нормата, значи някой е отворил портала на общината и е видял замърсяването в последния час. Това обаче е грешно.
За да получим истинските надвишавания, трябва да усредним стойностите за всички периоди от 24 часа. Именно това направих днес и получих следните изводи за праховото замърсяване в периода 22 януари – 21 февруари. Взех средното за всички станции в столицата, но не включвам измерванията на Копитото по очевидни причини.

Нивото на PM10 e било над нормата в 48% от времето
15% от времето е било 2 пъти над нормата, 6% – 3 пъти, 4% – 4 пъти
Сумарно в почти ден от този месец замърсяването е било 5 пъти над нормата
Средното замърсяване за целия период е 62 µg/m3

На следната графика може да видите всички периоди с превишения на 24-овите усреднявания. Вертикалната скала показва колко пъти е превишен лимита.

Отново повтарям, че тези параметри не са за отделни часове, а за цели 24 часови периоди. Когато няколко периода се припокриват, ги събирам. Например, между 2-ри февруари 00:00 и 6-ти февруари 03:00 усреднените стойности са над лимита. Зачитам обаче само времето между 2-ри 12:00 и 5-ти 15:00, защото това са средите на съответните 24 часови периоди. В началото и края им средно-часовите стойности падат доста под 50 микрограма и да се показва целия период на усредненото 24-часово превишаване би било подвеждащо.
В същото време обаче, е интересно да се посочи, че ако смятаме по часове, превишаващите 50 µg/m3 са едва 43.5% от времето. Ако смятаме целите периоди, а не условността, която описах горе, получаваме не 48%, а 57%. Тоест с правилната методика ще излезе, че дори по-голяма част от времето въздухът е бил замърсен (уточнение в коментарите). Именно обяснението за грешна интерпретация на данните е редовното обяснение на ИАОС когато коментира публикациите в медиите. Явно смятайки по периоди стигаме до дори по-лоши изводи.
За съжаление, тези данни не са достъпни в отворен формат. На страницата на ИАОС в портала за отворени данни на правителството са публикувани таблици единствено с превишенията, но не и почасовите стойности на всички станции и параметри. Така бихме могли да следим в реално време и да потвърждаваме изчисленията им. Принципно не е проблем да ги публикуват, тъй като вече ги предоставят свободно на всички общини. Именно така се генерират различни графики на страниците на общините в София, Пловдив и Бургас. Доколкото разбрах обаче, не предоставят таблиците публично, защото се притесняват от грешни интерпретации и неразбиране на установените от ЕС лимити. По тази логика обаче не трябва да се пускат никакви данни за демографията или емиграцията, тъй като е пълно с грешни интерпретации. Все пак, в новия план на кабинета за отваряне на данни фигурират тези справки и надеждата е, че ще започнем да ги получаваме в скоро време.
Конкретно за София свалям данните чрез автоматичен анализ на графиките на сайта им. Това носи със себе си риск от грешки, но според сметките ми те са по-малко от 0.1%. Това не би довело до изкривяване на резултатите илюстрирани до тук. Може да свалите и анализирате сами данните ми от последния месец. Включват всички параметри на станциите в София.
Повече за мръсния въздух, здравните и икономическите му ефекти може да прочетете в Дневник, WHO и Washington Post. За индустриалното замърсяване ще намерите интерактивна графика и разяснения в статията ми от 2013-та.


Introducing On-Demand Pipeline Execution in AWS Data Pipeline

Post Syndicated from Marc Beitchman original https://blogs.aws.amazon.com/bigdata/post/Tx37EJ2IDFXITB2/Introducing-On-Demand-Pipeline-Execution-in-AWS-Data-Pipeline

Marc Beitchman is a Software Development Engineer in the AWS Database Services team

Now it is possible to trigger activation of pipelines in AWS Data Pipeline using the new on-demand schedule type. You can access this functionality through the existing AWS Data Pipeline activation API. On-demand schedules make it easy to integrate pipelines in AWS Data Pipeline with other AWS services and with on-premise orchestration engines.

For example, you can build AWS Lambda functions to activate an AWS Data Pipeline execution in response to AWS CloudWatch cron expression events or AWS S3 event notifications. You can also invoke the AWS Data Pipeline activation API directly from the AWS CLI and SDK.

To get started, create a new pipeline and use the default object to specify a property of ‘scheduleType":"ondemand”. Setting this parameter enables on-demand activation of the pipeline.

Note: Activating a running on-demand pipeline cancels the current run of the pipeline and starts a new run of the pipeline. Check the state of the current running pipeline if you do not want activation to cancel a running on-demand pipeline.

Below is a simple example of a default object configured for on-demand activation.

{
"id": "Default",
"scheduleType": "ondemand"
}

The screen shot below shows an on-demand pipeline with two Hadoop activities. The pipeline has been run three times.

Check out our samples in the AWS Data Pipeline samples Github repository. These samples show you how to create an AWS Lambda function that triggers an on-demand pipeline activation in response to CreateObject (new file) events in S3 and how to trigger an on-demand pipeline activation in response to AWS CloudWatch cron expression events.

If you have questions or suggestions, please leave a comment below.

—————————-

Related:

How Coursera Manages Large-Scale ETL using AWS Data Pipeline and Dataduct

 

Looking to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

 

Introducing On-Demand Pipeline Execution in AWS Data Pipeline

Post Syndicated from Marc Beitchman original https://blogs.aws.amazon.com/bigdata/post/Tx37EJ2IDFXITB2/Introducing-On-Demand-Pipeline-Execution-in-AWS-Data-Pipeline

Marc Beitchman is a Software Development Engineer in the AWS Database Services team

Now it is possible to trigger activation of pipelines in AWS Data Pipeline using the new on-demand schedule type. You can access this functionality through the existing AWS Data Pipeline activation API. On-demand schedules make it easy to integrate pipelines in AWS Data Pipeline with other AWS services and with on-premise orchestration engines.

For example, you can build AWS Lambda functions to activate an AWS Data Pipeline execution in response to AWS CloudWatch cron expression events or AWS S3 event notifications. You can also invoke the AWS Data Pipeline activation API directly from the AWS CLI and SDK.

To get started, create a new pipeline and use the default object to specify a property of ‘scheduleType":"ondemand”. Setting this parameter enables on-demand activation of the pipeline.

Note: Activating a running on-demand pipeline cancels the current run of the pipeline and starts a new run of the pipeline. Check the state of the current running pipeline if you do not want activation to cancel a running on-demand pipeline.

Below is a simple example of a default object configured for on-demand activation.

{
"id": "Default",
"scheduleType": "ondemand"
}

The screen shot below shows an on-demand pipeline with two Hadoop activities. The pipeline has been run three times.

Check out our samples in the AWS Data Pipeline samples Github repository. These samples show you how to create an AWS Lambda function that triggers an on-demand pipeline activation in response to CreateObject (new file) events in S3 and how to trigger an on-demand pipeline activation in response to AWS CloudWatch cron expression events.

If you have questions or suggestions, please leave a comment below.

—————————-

Related:

How Coursera Manages Large-Scale ETL using AWS Data Pipeline and Dataduct

 

Looking to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

 

Freedom, the US Government, and why Apple are still bad

Post Syndicated from Matthew Garrett original http://mjg59.dreamwidth.org/39999.html

The US Government is attempting to force Apple to build a signed image that can be flashed onto an iPhone used by one of the San Bernardino shooters. To their credit, Apple have pushed back against this – there’s an explanation of why doing so would be dangerous here. But what’s noteworthy is that Apple are arguing that they shouldn’t do this, not that they can’t do this – Apple (and many other phone manufacturers) have designed their phones such that they can replace the firmware with anything they want.In order to prevent unauthorised firmware being installed on a device, Apple (and most other vendors) verify that any firmware updates are signed with a trusted key. The FBI don’t have access to Apple’s firmware signing keys, and as a result they’re unable to simply replace the software themselves. That’s why they’re asking Apple to build a new firmware image, sign it with their private key and provide it to the FBI.But what do we mean by “unauthorised firmware”? In this case, it’s “not authorised by Apple” – Apple can sign whatever they want, and your iPhone will happily accept that update. As owner of the device, there’s no way for you to reconfigure it such that it will accept your updates. And, perhaps worse, there’s no way to reconfigure it such that it will reject Apple’s.I’ve previously written about how it’s possible to reconfigure a subset of Android devices so that they trust your images and nobody else’s. Any attempt to update the phone using the Google-provided image will fail – instead, they must be re-signed using the keys that were installed in the device. No matter what legal mechanisms were used against them, Google would be unable to produce a signed firmware image that could be installed on the device without your consent. The mechanism I proposed is complicated and annoying, but this could be integrated into the standard vendor update process such that you simply type a password to unlock a key for re-signing.Why’s this important? Sure, in this case the government is attempting to obtain the contents of a phone that belonged to an actual terrorist. But not all cases governments bring will be as legitimate, and not all manufacturers are Apple. Governments will request that manufacturers build new firmware that allows them to monitor the behaviour of activists. They’ll attempt to obtain signing keys and use them directly to build backdoors that let them obtain messages sent to journalists. They’ll be able to reflash phones to plant evidence to discredit opposition politicians.We can’t rely on Apple to fight every case – if it becomes politically or financially expedient for them to do so, they may well change their policy. And we can’t rely on the US government only seeking to obtain this kind of backdoor in clear-cut cases – there’s a risk that these techniques will be used against innocent people. The only way for Apple (and all other phone manufacturers) to protect users is to allow users to remove Apple’s validation keys and substitute their own. If Apple genuinely value user privacy over Apple’s control of a device, it shouldn’t be a difficult decision to make.comment count unavailable comments

Register for and Attend This March 2 Webinar—Using AWS WAF and Lambda for Automatic Protection

Post Syndicated from Craig Liebendorfer original https://blogs.aws.amazon.com/security/post/Tx273MQOP5UGJWO/Register-for-and-Attend-This-March-2-Webinar-Using-AWS-WAF-and-Lambda-for-Automa

As part of the AWS Webinar Series, AWS will present Using AWS WAF and Lambda for Automatic Protection on Wednesday, March 2. This webinar will start at 10:00 A.M. and end at 11:00 A.M. Pacific Time (UTC-8).

AWS WAF Software Development Manager Nathan Dye will share AWS Lambda scripts that you can use to automate security with AWS WAF and write dynamic rules that can prevent HTTP floods, protect against badly behaving IPs, and maintain IP reputation lists. You can also learn how Brazilian retailer, Magazine Luiza, leveraged AWS WAF and Lambda to protect its site and run an operationally smooth Black Friday.

You will:

Learn how to use AWS WAF and Lambda together to automate security responses.

Get the Lambda scripts and AWS CloudFormation templates that prevent HTTP floods, automatically block bad-behaving IPs and bad-behaving bots, and allow you to import and maintain publicly available IP reputation lists.

Gain an understanding of strategies for protecting your web applications using AWS WAF, Amazon CloudFront, and Lambda.

The webinar is free, but space is limited and registration is required. Register today.

– Craig

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close