Why ninety-day lifetimes for certificates?

Post Syndicated from Let's Encrypt - Free SSL/TLS Certificates original https://letsencrypt.org//2015/11/09/why-90-days.html

We’re sometimes asked why we only offer certificates with ninety-day lifetimes. People who ask this are usually concerned that ninety days is too short and wish we would offer certificates lasting a year or more, like some other CAs do.

Ninety days is nothing new on the Web. According to Firefox Telemetry, 29% of TLS transactions use ninety-day certificates. That’s more than any other lifetime. From our perspective, there are two primary advantages to such short certificate lifetimes:

They limit damage from key compromise and mis-issuance. Stolen keys and mis-issued certificates are valid for a shorter period of time.
They encourage automation, which is absolutely essential for ease-of-use. If we’re going to move the entire Web to HTTPS, we can’t continue to expect system administrators to manually handle renewals. Once issuance and renewal are automated, shorter lifetimes won’t be any less convenience than longer ones.

For these reasons, we do not offer certificates with lifetimes longer than ninety days. We realize that our service is young, and that automation is new to many subscribers, so we chose a lifetime that allows plenty of time for manual renewal if necessary. We recommend that subscribers renew every sixty days. Once automated renewal tools are widely deployed and working well, we may consider even shorter lifetimes.

systemd.conf 2015 Summary

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/systemdconf-2015-summary.html

systemd.conf 2015 is Over Now!

Last week our first systemd.conf conference
took place at betahaus, in Berlin, Germany. With almost 100 attendees,
a dense schedule of 23 high-quality talks stuffed into a single track
on just two days, a productive hackfest and numerous consumed
Club-Mates I believe it was quite a success!

If you couldn’t attend the conference, you may watch all talks on our
YouTube
Channel
. The
slides are available
online
,
too.

Many photos from the conference are available on the Google Events
Page
. Enjoy!

I’d specifically like to thank Daniel Mack, Chris Kühl and Nils Magnus
for running the conference, and making sure that it worked out as
smoothly as it did! Thank you very much, you did a fantastic job!

I’d also specifically like to thank the CCC Video Operation
Center
folks for the excellent video coverage of
the conference. Not only did they implement a live-stream for the
entire talks part of the conference, but also cut and uploaded videos
of all talks to our YouTube
Channel

within the same day (in fact, within a few hours after the talks
finished). That’s quite an impressive feat!

The folks from LinuxTag e.V. put a lot of time and energy in the
organization. It was great to see how well this all worked out!
Excellent work!

(BTW, LinuxTag e.V. and the CCC Video Operation Center folks are
willing to help with the organization of Free Software community
events in Germany (and Europe?). Hence, if you need an entity that can
do the financial work and other stuff for your Free Software project’s
conference, consider pinging LinuxTag, they might be willing to
help. Similar, if you are organizing such an event and are thinking
about providing video coverage, consider pinging the the CCC VOC
folks! Both of them get our best recommendations!)

I’d also like to thank our conference
sponsors
!
Specifically, we’d like to thank our Gold Sponsors Red Hat and
CoreOS for their support. We’d also like to thank our Silver
Sponsor Codethink, and our Bronze Sponsors Pengutronix,
Pantheon, Collabora, Endocode, the Linux Foundation,
Samsung and Travelping, as well as our Cooperation Partners
LinuxTag and kinvolk.io, and our Media Partner Golem.de.

Last but not least I’d really like to thank our speakers and attendees
for presenting and participating in the conference. Of course, the
conference we put together specifically for you, and we really hope
you had as much fun at it as we did!

Thank you all for attending, supporting, and organizing systemd.conf
2015
! We are looking forward to seeing you
and working with you again at systemd.conf 2016!

Thanks!

systemd.conf 2015 Summary

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/systemdconf-2015-summary.html

systemd.conf 2015 is Over Now!

Last week our first systemd.conf conference
took place at betahaus, in Berlin, Germany. With almost 100 attendees,
a dense schedule of 23 high-quality talks stuffed into a single track
on just two days, a productive hackfest and numerous consumed
Club-Mates I believe it was quite a success!

If you couldn’t attend the conference, you may watch all talks on our
YouTube
Channel
. The
slides are available
online
,
too.

Many photos from the conference are available on the Google Events
Page
. Enjoy!

I’d specifically like to thank Daniel Mack, Chris Kühl and Nils Magnus
for running the conference, and making sure that it worked out as
smoothly as it did! Thank you very much, you did a fantastic job!

I’d also specifically like to thank the CCC Video Operation
Center
folks for the excellent video coverage of
the conference. Not only did they implement a live-stream for the
entire talks part of the conference, but also cut and uploaded videos
of all talks to our YouTube
Channel

within the same day (in fact, within a few hours after the talks
finished). That’s quite an impressive feat!

The folks from LinuxTag e.V. put a lot of time and energy in the
organization. It was great to see how well this all worked out!
Excellent work!

(BTW, LinuxTag e.V. and the CCC Video Operation Center folks are
willing to help with the organization of Free Software community
events in Germany (and Europe?). Hence, if you need an entity that can
do the financial work and other stuff for your Free Software project’s
conference, consider pinging LinuxTag, they might be willing to
help. Similar, if you are organizing such an event and are thinking
about providing video coverage, consider pinging the the CCC VOC
folks! Both of them get our best recommendations!)

I’d also like to thank our conference
sponsors
!
Specifically, we’d like to thank our Gold Sponsors Red Hat and
CoreOS for their support. We’d also like to thank our Silver
Sponsor Codethink, and our Bronze Sponsors Pengutronix,
Pantheon, Collabora, Endocode, the Linux Foundation,
Samsung and Travelping, as well as our Cooperation Partners
LinuxTag and kinvolk.io, and our Media Partner Golem.de.

Last but not least I’d really like to thank our speakers and attendees
for presenting and participating in the conference. Of course, the
conference we put together specifically for you, and we really hope
you had as much fun at it as we did!

Thank you all for attending, supporting, and organizing systemd.conf
2015
! We are looking forward to seeing you
and working with you again at systemd.conf 2016!

Thanks!

systemd.conf 2015 Summary

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/systemdconf-2015-summary.html

systemd.conf 2015 is Over Now!
Last week our first systemd.conf conference
took place at betahaus, in Berlin, Germany. With almost 100 attendees,
a dense schedule of 23 high-quality talks stuffed into a single track
on just two days, a productive hackfest and numerous consumed
Club-Mates I believe it was quite a success!
If you couldn’t attend the conference, you may watch all talks on our
YouTube
Channel
. The
slides are available
online
,
too.
Many photos from the conference are available on the Google Events
Page
. Enjoy!
I’d specifically like to thank Daniel Mack, Chris Kühl and Nils Magnus
for running the conference, and making sure that it worked out as
smoothly as it did! Thank you very much, you did a fantastic job!
I’d also specifically like to thank the CCC Video Operation
Center
folks for the excellent video coverage of
the conference. Not only did they implement a live-stream for the
entire talks part of the conference, but also cut and uploaded videos
of all talks to our YouTube
Channel

within the same day (in fact, within a few hours after the talks
finished). That’s quite an impressive feat!
The folks from LinuxTag e.V. put a lot of time and energy in the
organization. It was great to see how well this all worked out!
Excellent work!
(BTW, LinuxTag e.V. and the CCC Video Operation Center folks are
willing to help with the organization of Free Software community
events in Germany (and Europe?). Hence, if you need an entity that can
do the financial work and other stuff for your Free Software project’s
conference, consider pinging LinuxTag, they might be willing to
help. Similar, if you are organizing such an event and are thinking
about providing video coverage, consider pinging the the CCC VOC
folks! Both of them get our best recommendations!)
I’d also like to thank our conference
sponsors
!
Specifically, we’d like to thank our Gold Sponsors Red Hat and
CoreOS for their support. We’d also like to thank our Silver
Sponsor Codethink, and our Bronze Sponsors Pengutronix,
Pantheon, Collabora, Endocode, the Linux Foundation,
Samsung and Travelping, as well as our Cooperation Partners
LinuxTag and kinvolk.io, and our Media Partner Golem.de.
Last but not least I’d really like to thank our speakers and attendees
for presenting and participating in the conference. Of course, the
conference we put together specifically for you, and we really hope
you had as much fun at it as we did!
Thank you all for attending, supporting, and organizing systemd.conf
2015
! We are looking forward to seeing you
and working with you again at systemd.conf 2016!
Thanks!

Едни по-други избори

Post Syndicated from Боян Юруков original http://feedproxy.google.com/~r/yurukov-blog/~3/1tYKXeLAQWM/

На 29 ноември във Франкфурт ще се проведат избори. Те не са за Бундестаг, кмет или което и да е много нива на администрацията помежду им. В по-големите градове в провинцията има т.н. съвети на чужденците. На всеки 5 години живущите постоянно в региона, които нямат немско гражданство, може да гласуват за него.
Във Франкфурт има 202000 такива гласоподаватели. 7545 от тях са българи. Макар на пръв поглед тези избори да изглеждат незначителни, може да се направят няколко интересни паралела между тях и онези в България.
Бюлетини и гласуване
Кандидатите са съвета във Франкфурт специално са около 500. Изискването всички кандидати и листи да са изписани прави бюлетинните интересни.
IMG_20151101_175318
Сами виждате, че няма нищо страшно в големите бюлетини на скорошния местен вот в България. Исках да я снимкам с дъщеря ми, защото почти я скрива, но тя реши, че е одеало и се уви с нея. Бюлетина като тази не е изключение в Германия. Гласува с преференции – имаш 37 и може да ги разпределиш давайки до 3 на кандидат. Маркирайки листа, подреждаш автоматично гласовете си за първите в нея.
Всеки с право на глас в изборите получава по пощата примерна бюлетина, заедно с описание на вота, снимка и указания до най-близките секции. Получих и прословутото заявление за гласуване по пощата. Това е възможно на всеки вот в Германия и беше посочено като работеща алтернатива на дистанционното гласуване в България. Настрана, че е практически невъзможно в България, трябва да посоча, че сигурността му е практически никаква. Получавал съм такива документи и за местния вот, за който имам право да гласувам. Всеки може да ти бръкне в пощата и не е рядкост в някои квартали. На всички избори има мини-скандали със забавени или изгубени документи дори при стройна организация като немската.
IMG_20151101_175415 IMG_20151101_175404
Инструкциите къде да отидеш да гласуваш обаче са полезни. Не знам колко пари за изхарчили за това обаче, щом трябва да отпечатат различни листовки за всеки адрес в града. Приемам го като вид пряка помощ в местния бизнес – печатници, фотостудиа, медии, пощи.
Имат ли значение тези избори?
Съветът на чужденците има най-вече консултативна роля. Дава становища по различни решения и проекти на общината, работи срещу дискриминацията и повдига въпроси пред местната власт. Може да се сравни донякъде с ролята омбудсмана минус законодателната инициатива. На разположение на този съвет е и бюджет да спонсориране на НПО-та и събития, които подпомагат интеграцията на чужденците в града. Бюджетът, всъщност, е в пъти по-малък от това, което средностатистически общински кмет в България харчи за служебната си кола, но все пак е нещо.
a7970ef06b
С тези правомощия, членовете на съвета могат наистина да помогнат на емигрантската общност. Затова не бихте се учудили да разберете, че кандидати не липсват. Най-силни очаквано са турците. Подобно на изборите в България, организацията им е стройна. Заради простотата на гласуването по пощата, масово гласували така. Дочух, че имамите организирали всичко по изпращане на писмата на общността. Затова няколкото съвета до сега са били пълни с етнически турци. Макар да съставляват по-малко от 15% от емигрантите в града, до преди няколко години повече от половината кандидати са били с турски паспорт.
Това обаче се променя. Постепенно руснаците, украинците, италианците, гърците, пакистанците също се организират и съставят свои листи. Тази година за пръв път има и българска листа. Активността обаче е много ниска. Анализите, които се публикуват след всеки вот, показват само 6-8% гласували от онези 200 хиляди. Това обаче означава също така, че българските кандидати имат реални шансове да влязат, ако нашата общност се активизира следващата неделя.
Как да гласувам?
Liste_22-1b
Ако сте българин регистриран във Франкфурт преди 29-ти август и нямате немски паспорт, би трябвало да сте получили вече брошурите. В него пиша къде са секциите. За да гласувате, трябва да занесете лична карта или паспорт и Wahlbenachrichtigungskarte (първата страница от писмото). Вижте първия коментар под статията, ако сте го изхвърлили. Тук има отговори на различни въпроси (тук са на български).
Българската листа се казва Bulgarische Gemeinschaft Frankfurt и е номер 22. Ако желаете да я подкрепите, може да маркирате направо кръга пред листата или да разпределите 37-те си преференции на кандидатите. Повече информация има на страницата им във Facebook. Ако живеете другаде в Hessen, може да погледнете какви кандидати има във вашия град.


Why improving kernel security is important

Post Syndicated from Matthew Garrett original http://mjg59.dreamwidth.org/38158.html

The Washington Post published an article today which describes the ongoing tension between the security community and Linux kernel developers. This has been roundly denounced as FUD, with Rob Graham going so far as to claim that nobody ever attacks the kernel.Unfortunately he’s entirely and demonstrably wrong, it’s not FUD and the state of security in the kernel is currently far short of where it should be.An example. Recent versions of Android use SELinux to confine applications. Even if you have full control over an application running on Android, the SELinux rules make it very difficult to do anything especially user-hostile. Hacking Team, the GPL-violating Italian company who sells surveillance software to human rights abusers, found that this impeded their ability to drop their spyware onto targets’ devices. So they took advantage of the fact that many Android devices shipped a kernel with a flawed copy_from_user() implementation that allowed them to copy arbitrary userspace data over arbitrary kernel code, thus allowing them to disable SELinux.If we could trust userspace applications, we wouldn’t need SELinux. But we assume that userspace code may be buggy, misconfigured or actively hostile, and we use technologies such as SELinux or AppArmor to restrict its behaviour. There’s simply too much userspace code for us to guarantee that it’s all correct, so we do our best to prevent it from doing harm anyway.This is significantly less true in the kernel. The model up until now has largely been “Fix security bugs as we find them”, an approach that fails on two levels:1) Once we find them and fix them, there’s still a window between the fixed version being available and it actually being deployed2) The forces of good may not be the first ones to find themThis reactive approach is fine for a world where it’s possible to push out software updates without having to perform extensive testing first, a world where the only people hunting for interesting kernel vulnerabilities are nice people. This isn’t that world, and this approach isn’t fine.Just as features like SELinux allow us to reduce the harm that can occur if a new userspace vulnerability is found, we can add features to the kernel that make it more difficult (or impossible) for attackers to turn a kernel bug into an exploitable vulnerability. The number of people using Linux systems is increasing every day, and many of these users depend on the security of these systems in critical ways. It’s vital that we do what we can to avoid their trust being misplaced.Many useful mitigation features already exist in the Grsecurity patchset, but a combination of technical disagreements around certain features, personality conflicts and an apparent lack of enthusiasm on the side of upstream kernel developers has resulted in almost none of it landing in the kernels that most people use. Kees Cook has proposed a new project to start making a more concerted effort to migrate components of Grsecurity to upstream. If you rely on the kernel being a secure component, either because you ship a product based on it or because you use it yourself, you should probably be doing what you can to support this.Microsoft received entirely justifiable criticism for the terrible state of security on their platform. They responded by introducing cutting-edge security features across the OS, including the kernel. Accusing anyone who says we need to do the same of spreading FUD is risking free software being sidelined in favour of proprietary software providing more real-world security. That doesn’t seem like a good outcome.comment count unavailable comments

Integrating Splunk with Amazon Kinesis Streams

Post Syndicated from Prahlad Rao original https://blogs.aws.amazon.com/bigdata/post/Tx36W2CM8Y4OM2E/Integrating-Splunk-with-Amazon-Kinesis-Streams

Prahlad Rao is a Solutions Architect wih AWS

It is important to not only be able to stream and ingest terabytes of data at scale, but to quickly get insights and visualize data using available tools and technologies. The Amazon Kinesis platform of managed services enables continuous capture and stores terabytes of data per hour from hundreds or thousands of sources for real-time data processing over large distributed streams. Splunk enables data insights, transformation, and visualization. Both Splunk and Amazon Kinesis can be used for direct ingestion from your data producers.

This powerful combination lets you quickly capture, analyze, transform, and visualize streams of data without needing to write complex code using Amazon Kinesis client libraries. In this blog post, I show you how to integrate Amazon Kinesis with Splunk by taking Twitter feeds as the input data source and using Splunk to visualize the data.

Why is this architecture important?

Amazon Kinesis allows you to build a common ingestion mechanism for multiple downstream consumers (Splunk being one of them) to process without having to go back to the source multiple times. For example, you can integrate with Splunk for analytics and visualization at the same time you enable streaming data to be emitted to other data sources such as Amazon S3, Amazon Redshift, or even an AWS Lambda function for additional processing and transformation. Another common practice is to use S3 as a landing point for data after ingestion into Amazon Kinesis, which ensures that data can be stored persistently long-term. This post assumes that you have a fair understanding of Amazon Kinesis and Splunk usage and configuration.

Amazon Kinesis Streams

Amazon Kinesis Streams is a fully managed service for real-time processing of data streams at massive scale. You can configure hundreds of thousands of data producers to continuously put data into an Amazon Kinesis stream. The data from the stream is consumed by different Amazon Kinesis applications. Streams allows as many consumers of the data stream as your solution requires without a performance penalty.

Amazon Kinesis Streams

Splunk

Splunk is a platform for real-time, operational intelligence. It is an easy, fast, and secure way to analyze and visualize massive streams of data that could be generated by either IT systems or technology infrastructure. In this post, the data is being generated by Twitter feeds.

Flow of data

Here’s the data flow for this post.

The Twitter feeds related to a particular topic are captured by the Tweepy API. For this post, you capture how users are tweeting on ‘beer’, ’wine’, and ‘whiskey’. The output is then fed into an Amazon Kinesis stream using a simple Python Boto3 script. The consumer, installed on an Amazon EC2 instance (in this case, Splunk), is reading off a stream, which then extracts useful information and builds a dashboard for analysis and visualization. The stream can also feed into multiple consumers, including Lambda.

Get access to the Twitter API

You need to get access to the Twitter streaming API so you can access the Twitter feeds from Python using Tweepy. For more information about how to set up and access Tweepy, see Streaming With Tweepy.

Sign in with your Twitter account at https://apps.twitter.com.

Create a new application (just a placeholder to generate access keys).

Generate the consumer key, consumer secret, access token, and access token secret.

Use OAuth and keys in the Python script.

Install Python boto3

Python was chosen as the programming language for this post, given that it’s fairly simple to set up Tweepy to access Twitter and also use boto, a Python library that provides SDK access to AWS services. AWS provides an easy-to-read guide for getting started with boto.

Create an Amazon Kinesis stream

After you have completed the steps to access the Twitter API and set up Python Boto3, create an Amazon Kinesis stream to ingest Twitter data feeds using the AWS Management Console, CLI, or the Boto3 API. For this example, your stream is called ‘Kinesis_Twitter’ and has one shard. 

The unit of data stored by Amazon Kinesis is a data record, and a stream represents an ordered sequence of data records distributed into shards (or groups). When you create a stream, you specify the number of shards for each stream. A producer (in this case, Twitter feeds) puts data records into shards and a consumer (in this case, Splunk) gets data records from shards. You can dynamically resize your stream or add and remove shards after a stream is created.

aws create-stream  –stream-name Kinesis_Twitter –shard-count 1

Verify the stream creation by using the CLI as follows:

aws kinesis describe-stream –stream-name Kinesis_Twitter
{
"StreamDescription": {
"StreamStatus": "ACTIVE",
"StreamName": "Kinesis_Twitter",
"StreamARN": "arn:aws:kinesis:us-west-2:904672585901:stream/Kinesis_Twitter",
"Shards": [
{
"ShardId": "shardId-000000000000",
"HashKeyRange": {
"EndingHashKey": "340282366920938463463374607431768211455",
"StartingHashKey": "0"
},
"SequenceNumberRange": {
"StartingSequenceNumber": "49554309958127460123837282410803325391382383410689343490"
}
}
]
}
}

Set up Python for Twitter authentication

After you have registered your client application with Twitter, you should have your consumer token, access token, and secret. Tweepy supports OAuth authentication, which is handled by the tweepy.AuthHandler class.

Import Tweepy. From tweepy.streaming, import StreamListener. From tweepy, import Stream. Import json. Import boto3

Set the following, replacing your values for the red text:

client = boto3.client (‘kinesis’)

consumer_key = << consumer key >>

consumer_secret = << consumer secret key >>

access_token = << access token >>

access_secret = << access secret >>

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

auth.set_access_token(access_token, access_secret)

Feed Twitter data into Amazon Kinesis

Simple code enable you to pull feeds from Twitter and feed that data into Amazon Kinesis (limiting the number of records to 100 for this post). Load the Twitter data that contains keywords/track for ‘beer’, ‘wine’, and ‘whiskey’ into the stream so you can analyze data about what users are tweeting, including location and other interesting information.

You can write to Amazon Kinesis a single record at a time using the PutRecord API operation, or multiple records at one time using PutRecords. When you have more streaming data from producers, we recommend that you combine multiple records into batches and write bigger groupings of objects into the stream using PutRecords. PutRecords writes multiple data records from a producer into a stream in a single call. Each shard can support up to 1000 records written per second, up to a maximum total of 1 MB data written per second. A stream can have as many shards as you need. Specify the name of the stream and an array of request records, with each record in the array requiring a partition key and data blog. In this post, you are feeding data in the stream called Kinesis_Twitter by batching multiple records up to 10 in a container and calling PutRecords in a loop until the self-imposed limit of 100 records in the stream. 

For a thorough implementation of streaming large volumes of data, we recommend that you consider a producer library. The Amazon Kinesis Producer Library (KPL) provides necessary management of retries, failed records, and batch records when implementing producer applications to efficiently feed data into Amazon Kinesis at scale.  For more information, see the Implementing Efficient and Reliable Producers with the Amazon Kinesis Producer Library blog post.

container = []
class StdOutListener(StreamListener):
def __init__(self):
super(StdOutListener, self).__init__()
self.counter = 0
self.limit = 100
def on_data(self, data):
data = json.loads(data)
global container
record = {‘Data’: json.dumps(data), ‘PartitionKey’: ‘partition1′}
container.append(record)
if len(container) >= 10:
client.put_records(Records=container, StreamName=’Kinesis_Twitter’)
container=[]
self.counter += 1
if self.counter < self.limit:
return True
else:
stream.disconnect()
def on_error(self, status):
print status
def main():
out = StdOutListener()
stream = Stream(auth, out)
track = [‘beer’, ‘wine’, ‘whiskey’]

try:
stream.filter(track = track)
except
stream.disconnect()

if __name__ = = ‘__main__’:
main()

Sample partial output of the Python script above:

[{‘PartitionKey’: ‘partition1’, ‘Data’: ‘{"contributors": null, "truncated": false, "text": "in case anyone’s wondering how walking to work is going, someone poured beer on my leg this morning. (what? http://t.co/oDwcG5Kz5k)", "in_reply_to_status_id": null, "id": 641965701952094208, "favorite_count": 0, "source": "Twitter for iPhone", "retweeted": false, "coordinates": null, "timestamp_ms": "1441891525148", "entities": {"user_mentions": [], "symbols": [], "trends": [], "hashtags": [], "urls": [{"url": "http://t.co/oDwcG5Kz5k",
……. More blob data >>

Verify and read data in Amazon Kinesis  

To read data continuously from a stream, the Amazon Kinesis API provides the getShardIterator and getRecords methods, representing a pull model that draws data directly from specified shards in the stream. You retrieve records from the stream on a per-shard basis; for each shard and each batch of records, you need to obtain a shard iterator, which specifies the position in the shard from which to start reading data records sequentially. Obtain the initial shard iterator using getShardIterator. Next, instantiate a GetRecordsRequest object and specify the iterator for the request using the setShardIterator method. Obtain shard iterators for additional batches of records using the getNextShardIterator method. To get the data records, call the getRecords method and continue to loop through the next shard iterator as follows. 

The following code specifies TRIM_HORIZON as the iterator type when you obtain the initial shard iterator, which means records should be returned beginning with the first record added to the shard. For more information about using shard iterators, see Using Shard Iterators.

import boto3
import time
client = boto3.client(‘kinesis’)
shard_id = ‘shardId-000000000000′
shard_iterator = client.get_shard_iterator(StreamName=’Kinesis_Twitter’, ShardId=shard_id,ShardIteratorType=’TRIM_HORIZON’)[‘ShardIterator’]
i=100
while i > 0:
out = client.get_records(ShardIterator=shard_iterator, Limit=5)
shard_iterator = out[‘NextShardIterator’]
i=i-1
print out;
time.sleep(0.5)

So far, you’ve set up Tweepy to access Twitter feeds, configured Amazon Kinesis to ingest Twitter feeds, and verified data in the stream.  Now, set up Splunk on an EC2 instance, connect Splunk to an Amazon Kinesis stream, and finally visualize Twitter data in a Splunk dashboard.

Install and set up Splunk

Splunk Enterprise is available as an Amazon Machine Image on the AWS Marketplace.

Splunk is available on AWS Marketplace

The latest version 6.2.1 is available on Linux as a 64-bit AMI. For more information about setup, see the Splunk documentation.  

From the AWS Marketplace, choose Splunk Enterprise HVM AMI. On the overview page, choose Continue.

On the Launch on EC2 page, enter the following:

Select the appropriate EC2 instance type.

Select the AWS region in which to set up Splunk.

Choose Launch with 1-click. For production deployments, we recommend following Splunk capacity planning guidelines and best practices.

Select appropriate VPC and security group rules for your environment, including a key pair

Select the security group ports to be opened: TCP (554), UDP 8089 (management), 8000 (Splunkweb), 9997 (fwder), 22 (ssh), 443 (SSL/https).

After the instance launches and Splunk is running, log in to the Splunk console.

For a production deployment, you need to set up Splunk indexers and other related configuration; for this post, you can use the default values.

Set up Amazon Kinesis Modular Input

To ingest stream data into Splunk for indexing, install the free Amazon Kinesis Modular Input app.  The Amazon Kinesis Modular Input app is a type of consuming application that enables stream data to be indexed into Splunk. This is very much like a connector application between Amazon Kinesis and Splunk.

On the app home page, choose Settings and Data inputs.

Choose Settings and Data Inputs

On the Data inputs page, you should now see Amazon Kinesis listed as a local input type.  Under Actions, choose Add new.

Specify the following Amazon Kinesis and AWS configuration parameters. 

Stanza Name:  Any name associated with Amazon Kinesis data (such as Kinesis_Tweet).

Kinesis App Name: Any name associated with Amazon Kinesis data (such as Kinesis_Tweet).

Kinesis Stream Name: An Amazon Kinesis stream as configured in your Amazon Kinesis environment; you should match the exact name and repeat this configuration for each stream (such as Kinesis_Tweet).

Kinesis Endpoint: Use us-west-2.

Initial Stream Position: Defaults to TRIM_HORIZON, which causes the ShardIterator to point to last untrimmed record in the shard (the oldest data in the shard). You can also point to and read the most recent record in shard with LATEST. For now, use TRIM_HORIZON.  For more information, see GetShardIterator in the Amazon Kinesis Streams API Reference.

AWS Access Key ID: Your AWS access key (IAM user account).

AWS Secret: Your AWS secret key (IAM user account).

Backoff Time (Millis): Defaults to 3000.

Number of Retries: Defaults to 10.

Checkpoint Interval (Millis): Defaults to 60000.

Message Processing: This field is used for additional custom handling and formatting of messages consumed from Amazon Kinesis before they are indexed by Splunk. If this field is empty (as for this post), the default handler is used. For more information about custom message handling and examples, see the Customized Message Handling section in the Splunk documentation. You can find example message handlers in the GitHub SplunkModularInputsJavaFramework repository.

Set Source Type: Manual (allows you to specify a type for Amazon Kinesis streams).

Source Type: Kinesis.

Choose Next, verify the configuration, and enable it by choosing Data Inputs and Kinesis Records. 

Enable the configuration

Verify Amazon Kinesis stream data in Splunk

You should be all wired up from streaming Twitter data into Amazon Kinesis using Python tweepy and connecting the Amazon Kinesis stream to your Splunk indexer.  Now, verify Twitter data in your Splunk indexer by using the Splunk search console:  Choose Data Summary, Sources, and kinesis://kinesis_twitter. You should now see Twitter records show up on the search console. 

Twitter records in the search console

As the records are in JSON format, you can use Splunk rex commands to extract fields from the records, which can then be used for analysis and dashboards:

Source=”kinesis: //kinesis_twitter” | rex "s+record="(?[^n]+})"ssequence" | spath input=jsontest

All the fields extracted from each Amazon Kinesis record are displayed to the left.  Choose the user.location field to display users tweeting by location, or user.time_zone to display users tweeting by time zone. 

Users tweeting by time zone

Build a simple dashboard and visualization of Twitter data on Splunk

Now, build a simple dashboard displaying the top locations and languages of tweeting users. The following search extracts user locations:

source="kinesis://Kinesis_Tweet" | rex "s+record="(?[^n]+})"ssequence" | spath input=jsontest | rex "s+record="(?[^n]+})"ssequence" | spath input=jsontest | rename user.lang as language | stats count by language | sort -count limit=10

After the data is extracted, choose Save As and save the search as Report so you can use the search for later use. Choose the Visualization tab next to Statistics and choose Pie as the chart type.

Pie chart visualization

Select Save As again, save as Dashboard Panel, enter a value for Dashboard Title, and choose Save.

Repeat the steps above for additional dashboard charts.

The following search extracts user language:

source="kinesis://Kinesis_Tweet" | rex "s+record="(?[^n]+})"ssequence" | spath input=jsontest | rename entities.hashtags{}.text as hashtag | stats count by hashtag | sort -count limit=10

The following search extracts user hashtags:

source="kinesis://Kinesis_Tweet" | rex "s+record="(?<jsontest>[^n]+})"ssequence"  | spath input=jsontest | rename entities.hashtags{}.text as hashtag | stats count by hashtag | sort -count limit=10

Access your dashboard by clicking on the Kinesis_TA app, and choose Default Views and Dashboards.  Select the dashboard (Twitter data dashboard) that you just created.

Here’s your simple dashboard displaying Twitter user location, language, and hashtag statistics.

Dashboard

When you finish, make sure to delete the streams and terminate the Splunk EC2 instances if you no longer need them.

Conclusion

The combination of Amazon Kinesis and Splunk enables powerful capabilities: you can ingest massive data at scale, consume data for analytics, create visualizations or custom data processing using Splunk, and potentially tie in Lambda functions for multiple consumer needs, all while ingesting data into Amazon Kinesis one time. This is a win-win combination for customers who are already using Splunk and AWS services, or customers looking to implement scalable data ingestion and data insight mechanisms for their big data needs.

In a future post, I’ll continue from here to extract data from an Amazon Kinesis stream and store that data in a DynamoDB table or an S3 bucket using a Lambda function.

Until then, happy tweeting, streaming, and Splunk-ing all at once!

If you have questions or suggestions, please leave a comment below.

————————————

Related:

Using Amazon EMR and Hunk for Rapid Response Log Analysis and Review

 

Integrating Splunk with Amazon Kinesis Streams

Post Syndicated from Prahlad Rao original https://blogs.aws.amazon.com/bigdata/post/Tx36W2CM8Y4OM2E/Integrating-Splunk-with-Amazon-Kinesis-Streams

Prahlad Rao is a Solutions Architect wih AWS

It is important to not only be able to stream and ingest terabytes of data at scale, but to quickly get insights and visualize data using available tools and technologies. The Amazon Kinesis platform of managed services enables continuous capture and stores terabytes of data per hour from hundreds or thousands of sources for real-time data processing over large distributed streams. Splunk enables data insights, transformation, and visualization. Both Splunk and Amazon Kinesis can be used for direct ingestion from your data producers.

This powerful combination lets you quickly capture, analyze, transform, and visualize streams of data without needing to write complex code using Amazon Kinesis client libraries. In this blog post, I show you how to integrate Amazon Kinesis with Splunk by taking Twitter feeds as the input data source and using Splunk to visualize the data.

Why is this architecture important?

Amazon Kinesis allows you to build a common ingestion mechanism for multiple downstream consumers (Splunk being one of them) to process without having to go back to the source multiple times. For example, you can integrate with Splunk for analytics and visualization at the same time you enable streaming data to be emitted to other data sources such as Amazon S3, Amazon Redshift, or even an AWS Lambda function for additional processing and transformation. Another common practice is to use S3 as a landing point for data after ingestion into Amazon Kinesis, which ensures that data can be stored persistently long-term. This post assumes that you have a fair understanding of Amazon Kinesis and Splunk usage and configuration.

Amazon Kinesis Streams

Amazon Kinesis Streams is a fully managed service for real-time processing of data streams at massive scale. You can configure hundreds of thousands of data producers to continuously put data into an Amazon Kinesis stream. The data from the stream is consumed by different Amazon Kinesis applications. Streams allows as many consumers of the data stream as your solution requires without a performance penalty.

Amazon Kinesis Streams

Splunk

Splunk is a platform for real-time, operational intelligence. It is an easy, fast, and secure way to analyze and visualize massive streams of data that could be generated by either IT systems or technology infrastructure. In this post, the data is being generated by Twitter feeds.

Flow of data

Here’s the data flow for this post.

The Twitter feeds related to a particular topic are captured by the Tweepy API. For this post, you capture how users are tweeting on ‘beer’, ’wine’, and ‘whiskey’. The output is then fed into an Amazon Kinesis stream using a simple Python Boto3 script. The consumer, installed on an Amazon EC2 instance (in this case, Splunk), is reading off a stream, which then extracts useful information and builds a dashboard for analysis and visualization. The stream can also feed into multiple consumers, including Lambda.

Get access to the Twitter API

You need to get access to the Twitter streaming API so you can access the Twitter feeds from Python using Tweepy. For more information about how to set up and access Tweepy, see Streaming With Tweepy.

Sign in with your Twitter account at https://apps.twitter.com.

Create a new application (just a placeholder to generate access keys).

Generate the consumer key, consumer secret, access token, and access token secret.

Use OAuth and keys in the Python script.

Install Python boto3

Python was chosen as the programming language for this post, given that it’s fairly simple to set up Tweepy to access Twitter and also use boto, a Python library that provides SDK access to AWS services. AWS provides an easy-to-read guide for getting started with boto.

Create an Amazon Kinesis stream

After you have completed the steps to access the Twitter API and set up Python Boto3, create an Amazon Kinesis stream to ingest Twitter data feeds using the AWS Management Console, CLI, or the Boto3 API. For this example, your stream is called ‘Kinesis_Twitter’ and has one shard. 

The unit of data stored by Amazon Kinesis is a data record, and a stream represents an ordered sequence of data records distributed into shards (or groups). When you create a stream, you specify the number of shards for each stream. A producer (in this case, Twitter feeds) puts data records into shards and a consumer (in this case, Splunk) gets data records from shards. You can dynamically resize your stream or add and remove shards after a stream is created.

aws create-stream  –stream-name Kinesis_Twitter –shard-count 1

Verify the stream creation by using the CLI as follows:

aws kinesis describe-stream –stream-name Kinesis_Twitter
{
"StreamDescription": {
"StreamStatus": "ACTIVE",
"StreamName": "Kinesis_Twitter",
"StreamARN": "arn:aws:kinesis:us-west-2:904672585901:stream/Kinesis_Twitter",
"Shards": [
{
"ShardId": "shardId-000000000000",
"HashKeyRange": {
"EndingHashKey": "340282366920938463463374607431768211455",
"StartingHashKey": "0"
},
"SequenceNumberRange": {
"StartingSequenceNumber": "49554309958127460123837282410803325391382383410689343490"
}
}
]
}
}

Set up Python for Twitter authentication

After you have registered your client application with Twitter, you should have your consumer token, access token, and secret. Tweepy supports OAuth authentication, which is handled by the tweepy.AuthHandler class.

Import Tweepy. From tweepy.streaming, import StreamListener. From tweepy, import Stream. Import json. Import boto3

Set the following, replacing your values for the red text:

client = boto3.client (‘kinesis’)

consumer_key = << consumer key >>

consumer_secret = << consumer secret key >>

access_token = << access token >>

access_secret = << access secret >>

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

auth.set_access_token(access_token, access_secret)

Feed Twitter data into Amazon Kinesis

Simple code enable you to pull feeds from Twitter and feed that data into Amazon Kinesis (limiting the number of records to 100 for this post). Load the Twitter data that contains keywords/track for ‘beer’, ‘wine’, and ‘whiskey’ into the stream so you can analyze data about what users are tweeting, including location and other interesting information.

You can write to Amazon Kinesis a single record at a time using the PutRecord API operation, or multiple records at one time using PutRecords. When you have more streaming data from producers, we recommend that you combine multiple records into batches and write bigger groupings of objects into the stream using PutRecords. PutRecords writes multiple data records from a producer into a stream in a single call. Each shard can support up to 1000 records written per second, up to a maximum total of 1 MB data written per second. A stream can have as many shards as you need. Specify the name of the stream and an array of request records, with each record in the array requiring a partition key and data blog. In this post, you are feeding data in the stream called Kinesis_Twitter by batching multiple records up to 10 in a container and calling PutRecords in a loop until the self-imposed limit of 100 records in the stream. 

For a thorough implementation of streaming large volumes of data, we recommend that you consider a producer library. The Amazon Kinesis Producer Library (KPL) provides necessary management of retries, failed records, and batch records when implementing producer applications to efficiently feed data into Amazon Kinesis at scale.  For more information, see the Implementing Efficient and Reliable Producers with the Amazon Kinesis Producer Library blog post.

container = []
class StdOutListener(StreamListener):
def __init__(self):
super(StdOutListener, self).__init__()
self.counter = 0
self.limit = 100
def on_data(self, data):
data = json.loads(data)
global container
record = {‘Data’: json.dumps(data), ‘PartitionKey’: ‘partition1′}
container.append(record)
if len(container) >= 10:
client.put_records(Records=container, StreamName=’Kinesis_Twitter’)
container=[]
self.counter += 1
if self.counter < self.limit:
return True
else:
stream.disconnect()
def on_error(self, status):
print status
def main():
out = StdOutListener()
stream = Stream(auth, out)
track = [‘beer’, ‘wine’, ‘whiskey’]

try:
stream.filter(track = track)
except
stream.disconnect()

if __name__ = = ‘__main__’:
main()

Sample partial output of the Python script above:

[{‘PartitionKey’: ‘partition1’, ‘Data’: ‘{"contributors": null, "truncated": false, "text": "in case anyone’s wondering how walking to work is going, someone poured beer on my leg this morning. (what? http://t.co/oDwcG5Kz5k)", "in_reply_to_status_id": null, "id": 641965701952094208, "favorite_count": 0, "source": "Twitter for iPhone", "retweeted": false, "coordinates": null, "timestamp_ms": "1441891525148", "entities": {"user_mentions": [], "symbols": [], "trends": [], "hashtags": [], "urls": [{"url": "http://t.co/oDwcG5Kz5k",
……. More blob data >>

Verify and read data in Amazon Kinesis  

To read data continuously from a stream, the Amazon Kinesis API provides the getShardIterator and getRecords methods, representing a pull model that draws data directly from specified shards in the stream. You retrieve records from the stream on a per-shard basis; for each shard and each batch of records, you need to obtain a shard iterator, which specifies the position in the shard from which to start reading data records sequentially. Obtain the initial shard iterator using getShardIterator. Next, instantiate a GetRecordsRequest object and specify the iterator for the request using the setShardIterator method. Obtain shard iterators for additional batches of records using the getNextShardIterator method. To get the data records, call the getRecords method and continue to loop through the next shard iterator as follows. 

The following code specifies TRIM_HORIZON as the iterator type when you obtain the initial shard iterator, which means records should be returned beginning with the first record added to the shard. For more information about using shard iterators, see Using Shard Iterators.

import boto3
import time
client = boto3.client(‘kinesis’)
shard_id = ‘shardId-000000000000′
shard_iterator = client.get_shard_iterator(StreamName=’Kinesis_Twitter’, ShardId=shard_id,ShardIteratorType=’TRIM_HORIZON’)[‘ShardIterator’]
i=100
while i > 0:
out = client.get_records(ShardIterator=shard_iterator, Limit=5)
shard_iterator = out[‘NextShardIterator’]
i=i-1
print out;
time.sleep(0.5)

So far, you’ve set up Tweepy to access Twitter feeds, configured Amazon Kinesis to ingest Twitter feeds, and verified data in the stream.  Now, set up Splunk on an EC2 instance, connect Splunk to an Amazon Kinesis stream, and finally visualize Twitter data in a Splunk dashboard.

Install and set up Splunk

Splunk Enterprise is available as an Amazon Machine Image on the AWS Marketplace.

Splunk is available on AWS Marketplace

The latest version 6.2.1 is available on Linux as a 64-bit AMI. For more information about setup, see the Splunk documentation.  

From the AWS Marketplace, choose Splunk Enterprise HVM AMI. On the overview page, choose Continue.

On the Launch on EC2 page, enter the following:

Select the appropriate EC2 instance type.

Select the AWS region in which to set up Splunk.

Choose Launch with 1-click. For production deployments, we recommend following Splunk capacity planning guidelines and best practices.

Select appropriate VPC and security group rules for your environment, including a key pair

Select the security group ports to be opened: TCP (554), UDP 8089 (management), 8000 (Splunkweb), 9997 (fwder), 22 (ssh), 443 (SSL/https).

After the instance launches and Splunk is running, log in to the Splunk console.

For a production deployment, you need to set up Splunk indexers and other related configuration; for this post, you can use the default values.

Set up Amazon Kinesis Modular Input

To ingest stream data into Splunk for indexing, install the free Amazon Kinesis Modular Input app.  The Amazon Kinesis Modular Input app is a type of consuming application that enables stream data to be indexed into Splunk. This is very much like a connector application between Amazon Kinesis and Splunk.

On the app home page, choose Settings and Data inputs.

Choose Settings and Data Inputs

On the Data inputs page, you should now see Amazon Kinesis listed as a local input type.  Under Actions, choose Add new.

Specify the following Amazon Kinesis and AWS configuration parameters. 

Stanza Name:  Any name associated with Amazon Kinesis data (such as Kinesis_Tweet).

Kinesis App Name: Any name associated with Amazon Kinesis data (such as Kinesis_Tweet).

Kinesis Stream Name: An Amazon Kinesis stream as configured in your Amazon Kinesis environment; you should match the exact name and repeat this configuration for each stream (such as Kinesis_Tweet).

Kinesis Endpoint: Use us-west-2.

Initial Stream Position: Defaults to TRIM_HORIZON, which causes the ShardIterator to point to last untrimmed record in the shard (the oldest data in the shard). You can also point to and read the most recent record in shard with LATEST. For now, use TRIM_HORIZON.  For more information, see GetShardIterator in the Amazon Kinesis Streams API Reference.

AWS Access Key ID: Your AWS access key (IAM user account).

AWS Secret: Your AWS secret key (IAM user account).

Backoff Time (Millis): Defaults to 3000.

Number of Retries: Defaults to 10.

Checkpoint Interval (Millis): Defaults to 60000.

Message Processing: This field is used for additional custom handling and formatting of messages consumed from Amazon Kinesis before they are indexed by Splunk. If this field is empty (as for this post), the default handler is used. For more information about custom message handling and examples, see the Customized Message Handling section in the Splunk documentation. You can find example message handlers in the GitHub SplunkModularInputsJavaFramework repository.

Set Source Type: Manual (allows you to specify a type for Amazon Kinesis streams).

Source Type: Kinesis.

Choose Next, verify the configuration, and enable it by choosing Data Inputs and Kinesis Records. 

Enable the configuration

Verify Amazon Kinesis stream data in Splunk

You should be all wired up from streaming Twitter data into Amazon Kinesis using Python tweepy and connecting the Amazon Kinesis stream to your Splunk indexer.  Now, verify Twitter data in your Splunk indexer by using the Splunk search console:  Choose Data Summary, Sources, and kinesis://kinesis_twitter. You should now see Twitter records show up on the search console. 

Twitter records in the search console

As the records are in JSON format, you can use Splunk rex commands to extract fields from the records, which can then be used for analysis and dashboards:

Source=”kinesis: //kinesis_twitter” | rex "s+record="(?[^n]+})"ssequence" | spath input=jsontest

All the fields extracted from each Amazon Kinesis record are displayed to the left.  Choose the user.location field to display users tweeting by location, or user.time_zone to display users tweeting by time zone. 

Users tweeting by time zone

Build a simple dashboard and visualization of Twitter data on Splunk

Now, build a simple dashboard displaying the top locations and languages of tweeting users. The following search extracts user locations:

source="kinesis://Kinesis_Tweet" | rex "s+record="(?[^n]+})"ssequence" | spath input=jsontest | rex "s+record="(?[^n]+})"ssequence" | spath input=jsontest | rename user.lang as language | stats count by language | sort -count limit=10

After the data is extracted, choose Save As and save the search as Report so you can use the search for later use. Choose the Visualization tab next to Statistics and choose Pie as the chart type.

Pie chart visualization

Select Save As again, save as Dashboard Panel, enter a value for Dashboard Title, and choose Save.

Repeat the steps above for additional dashboard charts.

The following search extracts user language:

source="kinesis://Kinesis_Tweet" | rex "s+record="(?[^n]+})"ssequence" | spath input=jsontest | rename entities.hashtags{}.text as hashtag | stats count by hashtag | sort -count limit=10

The following search extracts user hashtags:

source="kinesis://Kinesis_Tweet" | rex "s+record="(?<jsontest>[^n]+})"ssequence"  | spath input=jsontest | rename entities.hashtags{}.text as hashtag | stats count by hashtag | sort -count limit=10

Access your dashboard by clicking on the Kinesis_TA app, and choose Default Views and Dashboards.  Select the dashboard (Twitter data dashboard) that you just created.

Here’s your simple dashboard displaying Twitter user location, language, and hashtag statistics.

Dashboard

When you finish, make sure to delete the streams and terminate the Splunk EC2 instances if you no longer need them.

Conclusion

The combination of Amazon Kinesis and Splunk enables powerful capabilities: you can ingest massive data at scale, consume data for analytics, create visualizations or custom data processing using Splunk, and potentially tie in Lambda functions for multiple consumer needs, all while ingesting data into Amazon Kinesis one time. This is a win-win combination for customers who are already using Splunk and AWS services, or customers looking to implement scalable data ingestion and data insight mechanisms for their big data needs.

In a future post, I’ll continue from here to extract data from an Amazon Kinesis stream and store that data in a DynamoDB table or an S3 bucket using a Lambda function.

Until then, happy tweeting, streaming, and Splunk-ing all at once!

If you have questions or suggestions, please leave a comment below.

————————————

Related:

Using Amazon EMR and Hunk for Rapid Response Log Analysis and Review

 

Signal

Post Syndicated from Йовко Ламбрев original http://yovko.net/signal/

Реших да редуцирам каналите си за връзка, най-вече по отношение на всевъзможните messenger-и за директни съобщения.
Винаги предпочитам електронната поща за основна комуникация, понеже мога да подреждам (или пренебрегвам) по приоритет писмата, които заслужават внимание и евентуален отговор, и разполагам с end-to-end криптиране при нужда. Тук е актуалният ми PGP ключ. А ако не знаете някой от моите email адреси, винаги можете да ползвате този начин.
Преглеждам пощата си поне един-два пъти дневно, освен когато съм в почивка, без Интернет или работя по някой спешен проблем или проект. Но не получавам нотификации за нея на смартфона си – това е адски разсейващо и контрапродуктивно. „Любимо“ ми е някой да ми звънне по телефона с изречението: Току-що ти изпратих mail. Видя ли го?
За директни съобщения занапред ще използвам основно Signal на Open Whisper Systems като прилична база за отворена и сигурна платформа, която заслужава да бъде ползвана, популяризирана и подкрепена от потребителите. Временно, като резервна опция, оставям и WhatsApp заради няколко близки приятели, които предпочитат навика и не осъзнават необходимостта от сигурна комуникация, поради което ще отнеме време да бъдат убедени.
"I don't need privacy, I've nothing to hide" argues "I don't need free speech, I've nothing to say." Rights = Power https://t.co/AOMc79DIOS
— Edward Snowden (@Snowden) November 4, 2015

Okay, ако сте с iPhone или Mac можете да ми изпратите и iMessage като друга резервна опция, с едно наум, че сигурността и там е според зависи от Apple.
Принципно не ползвам Skype освен след предварителна уговорка за конкретен разговор. Нито Viber и Facebook Messenger (те дори не успяха да ми харесат). Спирам също и Hangouts, и Telegram, както и всякакви други, защото ми идват в повече.
Опитайте Signal – семпло и леко приложение – за криптирани писмени съобщения и разговори. Освен, че е свободно и open source е и безплатно. Има го за iPhone и Android, а скоро и за web. И даже Snowden го благослови 😉
I use Signal every day. #notesforFBI (Spoiler: they already know) https://t.co/KNy0xppsN0
— Edward Snowden (@Snowden) November 2, 2015

Under the Hood: AWS CodeDeploy and Auto Scaling Integration

Post Syndicated from Jonathan Turpie original http://blogs.aws.amazon.com/application-management/post/Tx1NRS217K1YOPJ/Under-the-Hood-AWS-CodeDeploy-and-Auto-Scaling-Integration

Under the Hood: AWS CodeDeploy and Auto Scaling Integration

AWS CodeDeploy is a service that automates application deployments to your fleet of servers. Auto Scaling is a service that lets you dynamically scale your fleet based on load. Although these services are standalone, you can use them together for hands-free deployments! Whenever new Amazon EC2 instances are launched as part of an Auto Scaling group, CodeDeploy can automatically deploy your latest application revision to the new instances.

This blog post will cover how this integration works and conclude with a discussion of best practices. We assume you are familiar with CodeDeploy concepts and have completed the CodeDeploy walkthrough.

Configuring CodeDeploy with Auto Scaling

Configuring CodeDeploy with Auto Scaling is easy. Just go to the AWS CodeDeploy console and specify the Auto Scaling group name in your Deployment Group configuration.

In addition, you need to:

Install the CodeDeploy agent on the Auto Scaling instance. You can either bake the agent as part of the base AMI or use user data to install the agent during launch.

Make sure the service role used by CodeDeploy to interact with Auto Scaling has the correct permissions. You can use the AWSCodeDeployRole managed policy. For more information, see Create a Service Role for CodeDeploy.

For a step-by-step tutorial, see Using AWS CodeDeploy to Deploy an Application to an Auto Scaling Group.

Auto Scaling Lifecycle Hook

The communication between Auto Scaling and CodeDeploy during a scale in event is based on Auto Scaling lifecycle hooks. If the hooks are not set up correctly, the deployment will fail. We recommend that you do not try to manually set up or modify these hooks because CodeDeploy can do this for you. Auto Scaling lifecycle hooks tell Auto Scaling to send a notification when an instance is about to change to certain Auto Scaling lifecycle states. CodeDeploy listens only for notifications about instances that have launched and are about to be put in the InService state. This state occurs after the EC2 instance has finished booting, but before it is put behind any Elastic Load Balancing load balancers you have configured. Auto Scaling waits for a successful response from CodeDeploy before it continues working on the instance.

Hooks are part of the configuration of your Auto Scaling group. You can use the describe-lifecycle-hooks CLI command to see a list of hooks installed on your Auto Scaling group. When you create or modify a deployment group to contain an Auto Scaling group, CodeDeploy does the following:

Uses the CodeDeploy service role passed in for use with the deployment group to gain permissions to the Auto Scaling group.

Installs a lifecycle hook in the Auto Scaling group for instance launches that sends notifications to a queue owned by CodeDeploy.

Adds a record of the installed hook to the deployment group.

When you remove an Auto Scaling group from a deployment group or delete a deployment group, CodeDeploy does the following:

Uses the service role for the deployment group to gain access to the Auto Scaling group.

Gets the recorded hook from the deployment group and removes it from the Auto Scaling hook.

If the deployment group is being modified (not deleted), deletes the record of the hook from the deployment group.

If there are problems creating hooks, CodeDeploy will try to roll back the changes. If there are problems removing hooks, CodeDeploy will return the unsuccessful hook removals in the API response and continue.

Under the Hood

Here’s the sequence of events that occur during an Auto Scaling scale-in event:

Auto Scaling asks EC2 for a new instance.

EC2 spins up a new instance with the configuration provided by Auto Scaling.

Auto Scaling sees the new instance, puts it into Pending:Wait status, and sends the notification to Code Deploy.

CodeDeploy receives the instance launch notification from Auto Scaling.

CodeDeploy validates the configuration of the instance and the deployment group.

If the notification looks correct, but the deployment group no longer contains the Auto Scaling group (or we can determine the deployment group was previously deleted) then CodeDeploy will not deploy anything and tell Auto Scaling to CONTINUE with the instance launch. Auto Scaling will respect any other constraints on instance launch; this step does not force Auto Scaling to continue if something else is wrong.

If CodeDeploy can’t process the message (for example, if the stored service role doesn’t grant appropriate permissions), then CodeDeploy will let the hook time out. The default timeout for CodeDeploy is 10 minutes.

CodeDeploy creates a new deployment for the instance to deploy the target revision of the deployment group. (The target revision is the last successfully deployed revision to the deployment group. It is maintained by CodeDeploy.) You will need to deploy to your deployment group at least once for CodeDeploy to identify the target revision. You can use the get-deployment-group CLI command or the CodeDeploy console get the target revision for a deployment group.

While the deployment is running, it sends heartbeats to Auto Scaling to let it know that the instance is still being worked on.

If something goes wrong with the deployment, CodeDeploy will immediately tell Auto Scaling to ABANDON the instance launch. Auto Scaling terminates the instance and starts the process over again with a new instance.

 

Best Practices

Now that we know how the CodeDeploy and Auto Scaling integration works, let’s go over some best practices when using the two services together:

Setting up or modifying Auto Scaling lifecycle hooks – We recommend that you do not try to set up or modify the Auto Scaling hooks manually because configuration errors could break the CodeDeploy integration.

Beware of failed deployments – When a deployment to a new instance fails, CodeDeploy will mark the instance for termination. Auto Scaling will terminate the instance, spin up a new instance, and notify CodeDeploy to start a deployment. This is great when you have transient errors. However, the downside is that if you have an issue with your target revision (for example, if there is an error in your deployment script), this cycle of launching and terminating instances can go into a loop. We recommend that you closely monitor deployments and set up Auto Scaling notifications to keep track of EC2 instances launched and terminated by Auto Scaling.

Troubleshooting Auto Scaling deployments – Troubleshooting deployments involving Auto Scaling groups can be challenging. If you have a failed deployment, we recommend that you disassociate the Auto Scaling group from the deployment group to prevent Auto Scaling from continuously launching and terminating EC2 instances. Next, add a tagged EC2 instance launched with the same base AMI to your deployment group, deploy the target revision to that EC2 instance, and use that to troubleshoot your scripts. When you are confident, associate the deployment group with the Auto Scaling group, deploy the golden revision to your Auto Scaling group, scale up a new EC2 instance (by adjusting Min, Max, and Desired values), and verify that the deployment is successful.

Ordering execution of launch scripts – The CodeDeploy agent looks for and executes deployments as soon as it starts. There is no ordering between the deployment execution and   launch scripts such as user data, cfn-init, etc. We recommend you install the host agent as part of (and maybe as the last step in) the launch scripts so that you can be sure the deployment won’t be executed until the instance has installed dependencies that are not part of your CodeDeploy deployment. If you prefer baking the agent into the base AMI, we recommend that you keep the agent service in a stopped state and use the launch scripts to start the agent service.

Associating multiple deployment groups with the same Auto Scaling group – In general, you should avoid associating multiple deployment groups with the same Auto Scaling group. When Auto Scaling scales up an instance with multiple hooks associated with multiple deployment groups, it sends notifications for all of the hooks at the same time. As a result, multiple CodeDeploy deployments are created. There are several drawbacks to this. These deployments are executed in parallel, so you won’t be able to depend on any ordering between them. If any of the deployments fail, Auto Scaling will immediately terminate the instance. The other deployments that were running will start to fail when the instance shuts down, but they may take an hour to time out.  The host agent processes only one deployment command at a time, so you have two more limitations to consider. First, it’s possible for one of the deployments to be starved for time and fail. This might happen, for example, if the steps in your deployment take more than five minutes to complete. Second, there is no preemption between deployments, so there is no way to enforce step ordering between one deployment and another. We therefore recommend that you minimize the number of deployment groups associated with an Auto Scaling group and consolidate the deployments into a single deployment.

We hope this deep dive into the Auto Scaling integration with CodeDeploy gives you the insight needed to use it effectively.  Are there other features or scenarios with CodeDeploy that you’d be interested in understanding the inner details better?  Let us know in the comments.

Under the Hood: AWS CodeDeploy and Auto Scaling Integration

Post Syndicated from Jonathan Turpie original http://blogs.aws.amazon.com/application-management/post/Tx1NRS217K1YOPJ/Under-the-Hood-AWS-CodeDeploy-and-Auto-Scaling-Integration

Under the Hood: AWS CodeDeploy and Auto Scaling Integration

AWS CodeDeploy is a service that automates application deployments to your fleet of servers. Auto Scaling is a service that lets you dynamically scale your fleet based on load. Although these services are standalone, you can use them together for hands-free deployments! Whenever new Amazon EC2 instances are launched as part of an Auto Scaling group, CodeDeploy can automatically deploy your latest application revision to the new instances.

This blog post will cover how this integration works and conclude with a discussion of best practices. We assume you are familiar with CodeDeploy concepts and have completed the CodeDeploy walkthrough.

Configuring CodeDeploy with Auto Scaling

Configuring CodeDeploy with Auto Scaling is easy. Just go to the AWS CodeDeploy console and specify the Auto Scaling group name in your Deployment Group configuration.

In addition, you need to:

Install the CodeDeploy agent on the Auto Scaling instance. You can either bake the agent as part of the base AMI or use user data to install the agent during launch.

Make sure the service role used by CodeDeploy to interact with Auto Scaling has the correct permissions. You can use the AWSCodeDeployRole managed policy. For more information, see Create a Service Role for CodeDeploy.

For a step-by-step tutorial, see Using AWS CodeDeploy to Deploy an Application to an Auto Scaling Group.

Auto Scaling Lifecycle Hook

The communication between Auto Scaling and CodeDeploy during a scale in event is based on Auto Scaling lifecycle hooks. If the hooks are not set up correctly, the deployment will fail. We recommend that you do not try to manually set up or modify these hooks because CodeDeploy can do this for you. Auto Scaling lifecycle hooks tell Auto Scaling to send a notification when an instance is about to change to certain Auto Scaling lifecycle states. CodeDeploy listens only for notifications about instances that have launched and are about to be put in the InService state. This state occurs after the EC2 instance has finished booting, but before it is put behind any Elastic Load Balancing load balancers you have configured. Auto Scaling waits for a successful response from CodeDeploy before it continues working on the instance.

Hooks are part of the configuration of your Auto Scaling group. You can use the describe-lifecycle-hooks CLI command to see a list of hooks installed on your Auto Scaling group. When you create or modify a deployment group to contain an Auto Scaling group, CodeDeploy does the following:

Uses the CodeDeploy service role passed in for use with the deployment group to gain permissions to the Auto Scaling group.

Installs a lifecycle hook in the Auto Scaling group for instance launches that sends notifications to a queue owned by CodeDeploy.

Adds a record of the installed hook to the deployment group.

When you remove an Auto Scaling group from a deployment group or delete a deployment group, CodeDeploy does the following:

Uses the service role for the deployment group to gain access to the Auto Scaling group.

Gets the recorded hook from the deployment group and removes it from the Auto Scaling hook.

If the deployment group is being modified (not deleted), deletes the record of the hook from the deployment group.

If there are problems creating hooks, CodeDeploy will try to roll back the changes. If there are problems removing hooks, CodeDeploy will return the unsuccessful hook removals in the API response and continue.

Under the Hood

Here’s the sequence of events that occur during an Auto Scaling scale-in event:

Auto Scaling asks EC2 for a new instance.

EC2 spins up a new instance with the configuration provided by Auto Scaling.

Auto Scaling sees the new instance, puts it into Pending:Wait status, and sends the notification to Code Deploy.

CodeDeploy receives the instance launch notification from Auto Scaling.

CodeDeploy validates the configuration of the instance and the deployment group.

If the notification looks correct, but the deployment group no longer contains the Auto Scaling group (or we can determine the deployment group was previously deleted) then CodeDeploy will not deploy anything and tell Auto Scaling to CONTINUE with the instance launch. Auto Scaling will respect any other constraints on instance launch; this step does not force Auto Scaling to continue if something else is wrong.

If CodeDeploy can’t process the message (for example, if the stored service role doesn’t grant appropriate permissions), then CodeDeploy will let the hook time out. The default timeout for CodeDeploy is 10 minutes.

CodeDeploy creates a new deployment for the instance to deploy the target revision of the deployment group. (The target revision is the last successfully deployed revision to the deployment group. It is maintained by CodeDeploy.) You will need to deploy to your deployment group at least once for CodeDeploy to identify the target revision. You can use the get-deployment-group CLI command or the CodeDeploy console get the target revision for a deployment group.

While the deployment is running, it sends heartbeats to Auto Scaling to let it know that the instance is still being worked on.

If something goes wrong with the deployment, CodeDeploy will immediately tell Auto Scaling to ABANDON the instance launch. Auto Scaling terminates the instance and starts the process over again with a new instance.

 

Best Practices

Now that we know how the CodeDeploy and Auto Scaling integration works, let’s go over some best practices when using the two services together:

Setting up or modifying Auto Scaling lifecycle hooks – We recommend that you do not try to set up or modify the Auto Scaling hooks manually because configuration errors could break the CodeDeploy integration.

Beware of failed deployments – When a deployment to a new instance fails, CodeDeploy will mark the instance for termination. Auto Scaling will terminate the instance, spin up a new instance, and notify CodeDeploy to start a deployment. This is great when you have transient errors. However, the downside is that if you have an issue with your target revision (for example, if there is an error in your deployment script), this cycle of launching and terminating instances can go into a loop. We recommend that you closely monitor deployments and set up Auto Scaling notifications to keep track of EC2 instances launched and terminated by Auto Scaling.

Troubleshooting Auto Scaling deployments – Troubleshooting deployments involving Auto Scaling groups can be challenging. If you have a failed deployment, we recommend that you disassociate the Auto Scaling group from the deployment group to prevent Auto Scaling from continuously launching and terminating EC2 instances. Next, add a tagged EC2 instance launched with the same base AMI to your deployment group, deploy the target revision to that EC2 instance, and use that to troubleshoot your scripts. When you are confident, associate the deployment group with the Auto Scaling group, deploy the golden revision to your Auto Scaling group, scale up a new EC2 instance (by adjusting Min, Max, and Desired values), and verify that the deployment is successful.

Ordering execution of launch scripts – The CodeDeploy agent looks for and executes deployments as soon as it starts. There is no ordering between the deployment execution and   launch scripts such as user data, cfn-init, etc. We recommend you install the host agent as part of (and maybe as the last step in) the launch scripts so that you can be sure the deployment won’t be executed until the instance has installed dependencies that are not part of your CodeDeploy deployment. If you prefer baking the agent into the base AMI, we recommend that you keep the agent service in a stopped state and use the launch scripts to start the agent service.

Associating multiple deployment groups with the same Auto Scaling group – In general, you should avoid associating multiple deployment groups with the same Auto Scaling group. When Auto Scaling scales up an instance with multiple hooks associated with multiple deployment groups, it sends notifications for all of the hooks at the same time. As a result, multiple CodeDeploy deployments are created. There are several drawbacks to this. These deployments are executed in parallel, so you won’t be able to depend on any ordering between them. If any of the deployments fail, Auto Scaling will immediately terminate the instance. The other deployments that were running will start to fail when the instance shuts down, but they may take an hour to time out.  The host agent processes only one deployment command at a time, so you have two more limitations to consider. First, it’s possible for one of the deployments to be starved for time and fail. This might happen, for example, if the steps in your deployment take more than five minutes to complete. Second, there is no preemption between deployments, so there is no way to enforce step ordering between one deployment and another. We therefore recommend that you minimize the number of deployment groups associated with an Auto Scaling group and consolidate the deployments into a single deployment.

We hope this deep dive into the Auto Scaling integration with CodeDeploy gives you the insight needed to use it effectively.  Are there other features or scenarios with CodeDeploy that you’d be interested in understanding the inner details better?  Let us know in the comments.

Буря в каца с лайна

Post Syndicated from Йовко Ламбрев original http://yovko.net/%D0%B1%D1%83%D1%80%D1%8F-%D0%B2-%D0%BA%D0%B0%D1%86%D0%B0-%D1%81-%D0%BB%D0%B0%D0%B9%D0%BD%D0%B0/

Че не се прави така, както го направи финансовият министър Горанов е ясно. Освен ако това не е стъпка от някакъв мъглив политически танц, което може да се оцени от странно до тревожно.
Че МВР трябва да се реформира спешно и решително също е ясно. А и моментът е подходящ – очевидно властта има подкрепата на обществото, току-що препотвърдена повече от категорично на местните избори. Кога, ако не сега?
Че Борисов ще отстъпи (дано само частично) е не по-малко ясно. Както е ясно, че със сигурност в МВР има свестни хора и не всички са корумпирани. Всички познаваме такива, но… Тези свестни и некорумпирани хора често са широко-затворили очи пред безчинствата на „колегите“ си.
Такива свестни хора подкрепиха, поне на думи и съвсем в началото онези летни протести, които имаха потенциал да променят България, но началството бързо им скръцна със зъби и те се появиха след това маскирани, озъбени и без отличителни знаци срещу гражданското недоволство. И биеха и нарушаваха закона… не само в нощта на белия автобус. Без напън за обмисляне дали не сме от една страна на барикадата…
А сега очакват разбиране и съчувствие?…
Тези, които са призвани да пазят реда и закона, вчера се държаха като футболни хулигани и го прегазиха най-безцеремонно. Демонстрираха сила и простащина. И отново заплашваха граждани – неясно в каква роля. Днес продължават! (Свестните в МВР нещо недовиждат, или пак колегиално мълчат?) И това не е дреболия, защото тук размислите спират. Тук всичко свършва и се срутва…
P.S. Реакциите на ДПС, БСП, омбудсмана Мая Манолова и синдикатите говорят еднозначно. И не за хубаво
P.P.S. Фактът, че ДПС критикува Румяна Бъчварова, а мълчи като свестен полицай от системата на МВР за Горанов е двойно по-любопитен. И още толкова опасен!
P.P.P.S. И да… Не се извинявам за заглавието! Още по-малко за иронията…

Историческият парадокс с полицията и протестите

Post Syndicated from Longanlon original http://kaka-cuuka.com/3588


Протестите на полицаите вчера по повод идеите за намаление на заплатите им ми дават повод да погледна историческото значение на силите за сигурност за държавността и да опиша интересния парадокс на държавата като монополист на насилието в съвременното общество. Накратко, кой пази пазачите?

(Чети още…) (550 думи)


От времето, когато 2: Османталгия

Post Syndicated from Боян Юруков original http://feedproxy.google.com/~r/yurukov-blog/~3/W0TQ8Vz5bi8/

Под надвисналата заплаха Москов да въведе данък върху прекомерно мазните и сладки изделия, някои бизнеси се налага да станат по-гъвкави, за да оцелеят. Именно така се е родила идеята на (неназована за сега) верига баничарници да се върне към корените на един от най-продаваните им продукти – баклавата.
Всички ние помним колко вкусна беше баклавата от детството ни в сравнение с пластмасата, която се продава сега. А знаете ли колко по-добра е била онази от средата на 19-ти век? Несравнима като качество и технология. За да предаде това послание, баничарницата е взаимствала подхода със соцталгията от една друга рекламна кампания. Така ще внесат у нас нещо популярно напоследък в Турция – османталгията (от османска + носталгия).
Щом може да продава бира, защо не и баклава?
Предварителните допитвания показват, че тези реклами резонират добре с националистическите и етническо-партийни таргет групи. Добре разпознаваема е сред целия възрастов спектър с повърхностни знания по история. Както при другата рекламна кампания, тази също има регионална насоченост и потенциал да тръгне вайръл яхнала вълна от разгневени коментари и нападки.
Интересно ми е обаче какво е вашето мнение:
От времето, когато властта
не само хващаше престъпниците,
но и ги увисваше на въжето
b2
От времето, когато брояхме
овцете, а не калориите
b1
От времето, когато
данъкът беше само 10%
b3
От времето, когато
всичката храна беше био
b4
От времето, когато имаше
мегдан вместо Facebook и Twitter
b5
От времето, когато жените
предяха и перяха вместо
да висят в БГМама и мола
b6
От времето, когато
семействата бяха големи
b7
От времето, когато
момите приставаха
b8
От времето, когато властта
не строеше магистрали, а джамии и бани
b9
От времето, когато младите
заминаваха да учат в чужбина
с трен, а не със самолет
b10
От времето, когато
училищата бяха качествени и килийни
b11
От времето, когато вързвахме
учителя и попа, за да ни е мирно селото
b12
От времето, когато
бяхме част от нещо наистина голямо
b13
От времето, когато търгувахме
от Тунис до Дубай, от Виена до Судан
b14
От времето, когато
Благоевград беше само гарнизон
b15
От времето, когато
Истанбул беше столица
b16
От времето, когато
София беше Сердика
b17
От времето, когато
Стара Загора беше Ески Зара
b18
От времето, когато
Добрич беше Хаджиоглу Пазарджик
b19
От времето, когато
Смолян беше Ахъ Челеби
b20
От времето, когато
Созопол беше Сизеболу
b21
Тази статия е сатира. Не би трябвало да се налага да го казвам изрично, но свят – широк, хора – всякакви. Всяко използване на колажите извън този контекст ще бъде осмивано и порицавано пред колектива.


YateUCN – the solution for MVNO networks

Post Syndicated from Yate Team original http://blog.yate.ro/2015/11/02/yateucn-the-solution-for-mvno-networks/

With mobile consumers’ expectations on the rise, new business models proliferate. Mobile Virtual Network Operator solutions must differentiate to stay competitive and maximize their offerings. MVNOs wishing to offer subscribers high quality voice and/or data services can use YateUCN as a GMSC (voice), a GGSN (GPRS), or a PGW (LTE data). YateUCN supports billing integration […]

Хакнат ли е някой сайт? Лесно ръководство за журналисти

Post Syndicated from Боян Юруков original http://feedproxy.google.com/~r/yurukov-blog/~3/kEIfJwXnGFE/

В последните 10 дни сайтовете на няколко държавни инстуции бяха недостъпни за сериозен отрязък от време. Това донякъде беше проблем в изборния ден заради страницата на ЦИК. Надали някой щеше да забележи, че сайтовете на МВР, ГРАО и Президента са паднали, ако журналистите не бяха забили камбаната. По мои наблюдения от проекта @GovAlertEu, хората рядко им обръщат каквото и да е внимание. Сайтът на ДАНС по принцип е хлабав и често пада от нищо.
Журналистите бързо вдигнаха шум как основните сайтове в държавата са били хакнати. Може много да се говори каква глупост е всичко това и Божо го е описал добре. При него ще прочетете и какви най-вероятно са причините някой да плати за такава коодинирана атака.
Разбираемо е обаче дори добросъвестни журналисти да се объркат в терминологията. Няма да им е за пръв път. Лесно е да се подхлъзнеш по изрази, които докарават повече читатели, колкото и да не искаш да си признаеш, че играеш в същата кочинка като жълтите медии.
Затова реших да помогна. Представям ви краткото ръководство за журналисти на тема „Хакнат ли е някой сайт?“. Може да го свалите и в удобен за печатане формат.
chart
Важни бележки под линия:
*: Този знак може да е под формата на пълна подмяна на сайта, подмяна на новини и документи, публикуване на политически, религиозни, расистки или други послания, както и просто подпис на отговорните лица.
**: Макар DDoS атаките да са най-интересни от медийна гледна точка, има много причини един сайт да падне – проблем или обновяване на сървъра, временен срив в межата на държавната институция, проблем с твоя интернет или просто немарлив държавен служител.


Using AWS Lambda for Event-driven Data Processing Pipelines

Post Syndicated from Vadim Astakhov original https://blogs.aws.amazon.com/bigdata/post/Tx462DZWHF1WPN/Using-AWS-Lambda-for-Event-driven-Data-Processing-Pipelines

Vadim Astakhov is a Solutions Architect with AWS

Some big data customers want to analyze new data in response to a specific event, and they might already have well-defined pipelines to perform batch processing, orchestrated by AWS Data Pipeline. One example of event-triggered pipelines is when data analysts must analyze data as soon as it arrives, so that they can immediately respond to partners. Scheduling is not an optimal solution in this situation. The main question is how to schedule data processing at an arbitrary time using Data Pipeline, which relies on schedulers.

Here’s a solution. First, create a simple pipeline and test it with data from Amazon S3, then add an Amazon SNS topic to notify the customer when the pipeline is finished so data analysts can review the result. Lastly, create an AWS Lambda function to activate Data Pipeline when new data is successfully committed into an S3 bucket—without managing any scheduling activity. This post will show you how.

Solution that activates Data Pipeline when new data is committed to S3

When Data Pipeline activity can be scheduled, customers can define preconditions that see whether data exists on S3 and then allocate resources. However, the use of Lambda is a good mechanism when Data Pipeline needs to be activated at a random time.

Cloning pipelines for future use

In this scenario, the customer’s pipeline has been activated through some scheduled activity but the customer wants to be able to invoke the same pipeline in response to an ad-hoc event such as a new data commit to an S3 bucket. The customer has already developed a “template” pipeline that has reached the Finished state.

One way to re-initiate the pipeline is to keep the JSON file with the pipeline definition on S3 and use it to create a new pipeline. Some customers have multiple versions of the same pipeline stored on S3 but are willing to clone and reuse only the version of the pipeline that has been recently executed. The light way to accommodate such request can be done by getting the pipeline definition from the finished pipeline and creating a clone. This approach relies on recently-executed pipelines and does not require the customer to keep a registry of pipeline versions from S3 and track which version has been executed recently.     

Even if customers want to maintain such a registry of pipelines on S3, they might also be willing to get a pipeline definition on-the-fly from an existing pipeline using the Lambda API. They could have complicated, event-driven workflows where they need to clone finished pipelines, re-run them, and then delete the cloned pipelines. That’s why it is important to first to detect pipelines in the Finished state.

In this post, I demonstrate how you can accomplish such on-the-fly pipeline cloning. There is no direct clone API in Data Pipeline, so you implement this by making several API calls. I also provide code for deleting old clones that have finished.

Three-step workflow

Create a simple pipeline for testing.

Create an SNS notification to notify analysts that the pipeline has finished.

Crate a Lambda function to activate the pipeline when new data get committed to an S3 bucket.

Step 1: Create a simple pipeline

Create a simple pipeline

Open the AWS Data Pipeline console.

If you haven’t created a pipeline in this region, the console displays an introductory screen. Choose Get started now. If you’ve already created a pipeline in this region, the console displays a page that lists your pipelines for the region. Choose Create new pipeline.

Enter a name and description.

Select an Elastic MapReduce (EMR) template and choose Run once on pipeline activation.

In the Step field, enter the following:

/home/hadoop/contrib/streaming/hadoop-streaming.jar,-input,s3n://elasticmapreduce/samples/wordcount/input,-output,s3://example-bucket/wordcount/output/#{@scheduledStartTime},-mapper,s3n://elasticmapreduce/samples/wordcount/wordSplitter.py,-reducer,aggregate

You can adjust the number of Amazon EMR cluster nodes and select distributions. For more information about creating pipelines, see Getting Started with AWS Data Pipeline.

Step 2: Create an SNS topic

To create an SNS topic:

In a new browser tab, open the Amazon SNS console.

Choose Create topic.

In the Topic name field, type a topic name.

Choose Create topic.

Select the new topic and then choose the topic ARN. The Topic Details page appears.

Topic Details page

Copy the topic ARN for the next task.

Create the subscription for that topic and provide your email address. AWS sends email to confirm your subscription.             

To configure the topic notification action in the pipeline:

In the the AWS Data Pipeline console, open your pipeline in the Architect window.

In the right pane, choose Others.

Under DefaultAction1, do the following:

Enter the name for your notification (for example, MyEMRJobNotice).

In the  Type field, choose SnsAlarm.

In the Subject field, enter the subject line for your notification.

In the Topic Arn field, enter the ARN of your topic.

In the Message field, enter the message content.

Leave Role set to the default value.

 Save and activate your pipeline to ensure that it can be executed successfully.

Step 3: Create a Lambda function

On the Lambda console, choose Create a Lambda function. You can select a blueprint or just skip the first step and proceed with Step 2: Configure function, where you provide a function name (such as LambdaDP) and a description, and choose Node.js as the value for the Runtime field.

Your test pipeline is finished. Rerunning a finished pipeline is not currently supported. To re-run a finished pipeline, clone the pipeline from the template and Lambda triggers a new pipeline. You’ll need Lambda to create a new clone every time you clean up old clones. Below are helpful functions to do that. On the Lambda console, use the Code entry type and Edit code inline fields, and start with the following:

console.log(‘Loading function’);
var AWS = require(‘aws-sdk’);
exports.handler = function(event, context) {
var Data Pipeline = new AWS.Data Pipeline();
var pipeline2delete =’None’;
var pipeline =’df-02….T’;
……….
}

Define your pipeline ID and create a variable for your cloned pipeline IDs, such as pipeline2delete. Then, add a function to check for existing clones left from previous runs, as follows:

//Iterate over the list of pipelines and check if the pipeline clone already exists
Data Pipeline.listPipelines(paramsall, function(err, data) {
if (err) {console.log(err, err.stack); // an error occurred}
else {console.log(data); // successful response
for (var i in data.pipelineIdList){
if (data.pipelineIdList[i].name ==’myLambdaSample’) {
pipeline2delete = data.pipelineIdList[i].id;
console.log(‘Pipeline clone id to delete: ‘ + pipeline2delete);
};

If the finished clone from a previous run has been identified, you must invoke the delete function within this loop. The sample code to do that is as follows:

var paramsd = {pipelineId: pipeline2delete /* required */};
Data Pipeline.deletePipeline(paramsd, function(err, data) {
if (err) {console.log(err, err.stack); // an error occurred}
else console.log(‘Old clone deleted ‘ + pipeline2delete + ‘ Create new clone now’);
});

Finally, you need to make three API calls to create a new clone from your original Data Pipeline template. The APIs you can use are as follows:

getPipelineDefinition (for the finished pipeline)

createPipeline

putPipelineDefinition (from #1)

Below are examples of those three calls.

1. Use this pipeline’s definition to create the next clone:

var params = {pipelineId: pipeline};
Data Pipeline.getPipelineDefinition(params, function(err, definition) {
if (err) console.log(err, err.stack); // an error occurred
else {
var params = {
name: ‘myLambdaSample’, /* required */
uniqueId: ‘myLambdaSample’ /* required */
};

2. Use the pipeline definition from the definition object:

Data Pipeline.createPipeline(params, function(err, pipelineIdObject) {
if (err) console.log(err, err.stack); // an error occurred
else { //new pipeline created with id=pipelineIdObject.pipelineId
console.log(pipelineIdObject); // successful response
//Create and activate pipeline
var params = {
pipelineId: pipelineIdObject.pipelineId,
pipelineObjects: definition.pipelineObjects//(you can add parameter objects and values)

3. Use the definition from the getPipelineDefinition API result:

Data Pipeline.putPipelineDefinition(params, function(err, data) {
if (err) console.log(err, err.stack);
else {
Data Pipeline.activatePipeline(pipelineIdObject, function(err, data) { //Activate the pipeline finally
if (err) console.log(err, err.stack);
else console.log(data);
});
}
});
}});
}});

Now you have all function calls for the Lambda function. You can also wrap those calls as an independent function as follows:

Enter a value for the Handler field as the name of your function (LambdaDP.index).

Select Role, which can let you access resources like S3 and Data Pipeline.

Keep the default Memory and Timeout values.

Choose Next, review your function, and choose Create function.

In the Event source field, choose S3.

Provide the bucket name used by the pipeline.

In Event type, choose Put, which activates your pipeline when the new file is committed to the bucket.

Save the pipeline and upload a data file to your S3 bucket.

Check the Data Pipeline console to make sure that the new pipeline has been created and activated (you should get a SNS notification when pipeline is finished).

Conclusion

Congratulations! You have successfully cloned and launched your pipeline from a Lambda function to perform data processing after successfully committing new data to the S3 bucket. You can continue evolving your workflow to include other AWS services, such as Amazon Redshift, Amazon RDS for MySQL, and Amazon DynamoDB.

If you have questions or suggestions, please leave a comment below.

Appendix

Below is a template of the Lambda function that uses all function calls discussed above. This template is only a starting point and isn’t meant for a production environment.

console.log(‘Loading function’);

var AWS = require(‘aws-sdk’);
//var s3 = new aws.S3({ apiVersion: ‘2012-10-29′ });

exports.handler = function(event, context) {
var datapipeline = new AWS.DataPipeline();
var pipeline2delete =’None’;
var pipeline =’df-02364022NP3BYIO2UPBT’;

var paramsall = {
marker:”
};

//Check if pipelien clone already exist
datapipeline.listPipelines(paramsall, function(err, data) {
if (err) {
console.log(err, err.stack); // an error occurred
context.fail(‘Error’, "Error getting list of pipelines: " + err);
}
else {
console.log(data); // successful response
for (var i in data.pipelineIdList){
if (data.pipelineIdList[i].name ==’myLambdaSample’) {
pipeline2delete = data.pipelineIdList[i].id;
console.log(‘Pipeline id to delete: ‘ + pipeline2delete);

var paramsd = {
pipelineId: pipeline2delete /* required */
};
datapipeline.deletePipeline(paramsd, function(err, data) {
if (err) {
console.log(err, err.stack); // an error occurred
context.fail(‘Error’, "Error deleting pipelines: " + err);
}
else console.log(‘Old clone deleted ‘ + pipeline2delete + ‘ Create new clone now’); // successful response
});
}
else console.log(‘No clones to delete’);
}
}
});

var params = {
pipelineId: pipeline
}; //Using this pipeline’s definition to create the next
datapipeline.getPipelineDefinition(params, function(err, definition) {
if (err) {
console.log(err, err.stack); // an error occurred
context.fail(‘Error’, "Error getting pipeline definition: " + err);
}
else {

var params = {
name: ‘myLambdaSample’, /* required */
uniqueId: ‘myLambdaSample’ /* required */
}; //definition object contains pipeline definition
datapipeline.createPipeline(params, function(err, pipelineIdObject) {
if (err) {
console.log(err, err.stack); // an error occurred
context.fail(‘Error’, "Error creating pipeline: " + err);
}
else { //new pipeline created with id=pipelineIdObject.pipelineId
console.log(pipelineIdObject); // successful response
//Create and activate pipeline
var params = {
pipelineId: pipelineIdObject.pipelineId,
pipelineObjects: definition.pipelineObjects//(you can add parameter objects and values too)
} //Use definition from the getPipelineDefinition API result
datapipeline.putPipelineDefinition(params, function(err, data) {
if (err) {
console.log(err, err.stack);
context.fail(‘Error’, "Error putting pipeline definition: " + err);
}
else {
datapipeline.activatePipeline(pipelineIdObject, function(err, data) { //Activate the pipeline finally
if (err) {
console.log(err, err.stack);
context.fail(‘Error’, "Error activating pipeline: " + err);
}
else console.log(data);
context.succeed();
});
}
});
}});
}});
};

——————-

Related

Automating Analytic Workflows on AWS

Using AWS Lambda for Event-driven Data Processing Pipelines

Post Syndicated from Vadim Astakhov original https://blogs.aws.amazon.com/bigdata/post/Tx462DZWHF1WPN/Using-AWS-Lambda-for-Event-driven-Data-Processing-Pipelines

Vadim Astakhov is a Solutions Architect with AWS

Some big data customers want to analyze new data in response to a specific event, and they might already have well-defined pipelines to perform batch processing, orchestrated by AWS Data Pipeline. One example of event-triggered pipelines is when data analysts must analyze data as soon as it arrives, so that they can immediately respond to partners. Scheduling is not an optimal solution in this situation. The main question is how to schedule data processing at an arbitrary time using Data Pipeline, which relies on schedulers.

Here’s a solution. First, create a simple pipeline and test it with data from Amazon S3, then add an Amazon SNS topic to notify the customer when the pipeline is finished so data analysts can review the result. Lastly, create an AWS Lambda function to activate Data Pipeline when new data is successfully committed into an S3 bucket—without managing any scheduling activity. This post will show you how.

Solution that activates Data Pipeline when new data is committed to S3

When Data Pipeline activity can be scheduled, customers can define preconditions that see whether data exists on S3 and then allocate resources. However, the use of Lambda is a good mechanism when Data Pipeline needs to be activated at a random time.

Cloning pipelines for future use

In this scenario, the customer’s pipeline has been activated through some scheduled activity but the customer wants to be able to invoke the same pipeline in response to an ad-hoc event such as a new data commit to an S3 bucket. The customer has already developed a “template” pipeline that has reached the Finished state.

One way to re-initiate the pipeline is to keep the JSON file with the pipeline definition on S3 and use it to create a new pipeline. Some customers have multiple versions of the same pipeline stored on S3 but are willing to clone and reuse only the version of the pipeline that has been recently executed. The light way to accommodate such request can be done by getting the pipeline definition from the finished pipeline and creating a clone. This approach relies on recently-executed pipelines and does not require the customer to keep a registry of pipeline versions from S3 and track which version has been executed recently.     

Even if customers want to maintain such a registry of pipelines on S3, they might also be willing to get a pipeline definition on-the-fly from an existing pipeline using the Lambda API. They could have complicated, event-driven workflows where they need to clone finished pipelines, re-run them, and then delete the cloned pipelines. That’s why it is important to first to detect pipelines in the Finished state.

In this post, I demonstrate how you can accomplish such on-the-fly pipeline cloning. There is no direct clone API in Data Pipeline, so you implement this by making several API calls. I also provide code for deleting old clones that have finished.

Three-step workflow

Create a simple pipeline for testing.

Create an SNS notification to notify analysts that the pipeline has finished.

Crate a Lambda function to activate the pipeline when new data get committed to an S3 bucket.

Step 1: Create a simple pipeline

Create a simple pipeline

Open the AWS Data Pipeline console.

If you haven’t created a pipeline in this region, the console displays an introductory screen. Choose Get started now. If you’ve already created a pipeline in this region, the console displays a page that lists your pipelines for the region. Choose Create new pipeline.

Enter a name and description.

Select an Elastic MapReduce (EMR) template and choose Run once on pipeline activation.

In the Step field, enter the following:

/home/hadoop/contrib/streaming/hadoop-streaming.jar,-input,s3n://elasticmapreduce/samples/wordcount/input,-output,s3://example-bucket/wordcount/output/#{@scheduledStartTime},-mapper,s3n://elasticmapreduce/samples/wordcount/wordSplitter.py,-reducer,aggregate

You can adjust the number of Amazon EMR cluster nodes and select distributions. For more information about creating pipelines, see Getting Started with AWS Data Pipeline.

Step 2: Create an SNS topic

To create an SNS topic:

In a new browser tab, open the Amazon SNS console.

Choose Create topic.

In the Topic name field, type a topic name.

Choose Create topic.

Select the new topic and then choose the topic ARN. The Topic Details page appears.

Topic Details page

Copy the topic ARN for the next task.

Create the subscription for that topic and provide your email address. AWS sends email to confirm your subscription.             

To configure the topic notification action in the pipeline:

In the the AWS Data Pipeline console, open your pipeline in the Architect window.

In the right pane, choose Others.

Under DefaultAction1, do the following:

Enter the name for your notification (for example, MyEMRJobNotice).

In the  Type field, choose SnsAlarm.

In the Subject field, enter the subject line for your notification.

In the Topic Arn field, enter the ARN of your topic.

In the Message field, enter the message content.

Leave Role set to the default value.

 Save and activate your pipeline to ensure that it can be executed successfully.

Step 3: Create a Lambda function

On the Lambda console, choose Create a Lambda function. You can select a blueprint or just skip the first step and proceed with Step 2: Configure function, where you provide a function name (such as LambdaDP) and a description, and choose Node.js as the value for the Runtime field.

Your test pipeline is finished. Rerunning a finished pipeline is not currently supported. To re-run a finished pipeline, clone the pipeline from the template and Lambda triggers a new pipeline. You’ll need Lambda to create a new clone every time you clean up old clones. Below are helpful functions to do that. On the Lambda console, use the Code entry type and Edit code inline fields, and start with the following:

console.log(‘Loading function’);
var AWS = require(‘aws-sdk’);
exports.handler = function(event, context) {
var Data Pipeline = new AWS.Data Pipeline();
var pipeline2delete =’None’;
var pipeline =’df-02….T’;
……….
}

Define your pipeline ID and create a variable for your cloned pipeline IDs, such as pipeline2delete. Then, add a function to check for existing clones left from previous runs, as follows:

//Iterate over the list of pipelines and check if the pipeline clone already exists
Data Pipeline.listPipelines(paramsall, function(err, data) {
if (err) {console.log(err, err.stack); // an error occurred}
else {console.log(data); // successful response
for (var i in data.pipelineIdList){
if (data.pipelineIdList[i].name ==’myLambdaSample’) {
pipeline2delete = data.pipelineIdList[i].id;
console.log(‘Pipeline clone id to delete: ‘ + pipeline2delete);
};

If the finished clone from a previous run has been identified, you must invoke the delete function within this loop. The sample code to do that is as follows:

var paramsd = {pipelineId: pipeline2delete /* required */};
Data Pipeline.deletePipeline(paramsd, function(err, data) {
if (err) {console.log(err, err.stack); // an error occurred}
else console.log(‘Old clone deleted ‘ + pipeline2delete + ‘ Create new clone now’);
});

Finally, you need to make three API calls to create a new clone from your original Data Pipeline template. The APIs you can use are as follows:

getPipelineDefinition (for the finished pipeline)

createPipeline

putPipelineDefinition (from #1)

Below are examples of those three calls.

1. Use this pipeline’s definition to create the next clone:

var params = {pipelineId: pipeline};
Data Pipeline.getPipelineDefinition(params, function(err, definition) {
if (err) console.log(err, err.stack); // an error occurred
else {
var params = {
name: ‘myLambdaSample’, /* required */
uniqueId: ‘myLambdaSample’ /* required */
};

2. Use the pipeline definition from the definition object:

Data Pipeline.createPipeline(params, function(err, pipelineIdObject) {
if (err) console.log(err, err.stack); // an error occurred
else { //new pipeline created with id=pipelineIdObject.pipelineId
console.log(pipelineIdObject); // successful response
//Create and activate pipeline
var params = {
pipelineId: pipelineIdObject.pipelineId,
pipelineObjects: definition.pipelineObjects//(you can add parameter objects and values)

3. Use the definition from the getPipelineDefinition API result:

Data Pipeline.putPipelineDefinition(params, function(err, data) {
if (err) console.log(err, err.stack);
else {
Data Pipeline.activatePipeline(pipelineIdObject, function(err, data) { //Activate the pipeline finally
if (err) console.log(err, err.stack);
else console.log(data);
});
}
});
}});
}});

Now you have all function calls for the Lambda function. You can also wrap those calls as an independent function as follows:

Enter a value for the Handler field as the name of your function (LambdaDP.index).

Select Role, which can let you access resources like S3 and Data Pipeline.

Keep the default Memory and Timeout values.

Choose Next, review your function, and choose Create function.

In the Event source field, choose S3.

Provide the bucket name used by the pipeline.

In Event type, choose Put, which activates your pipeline when the new file is committed to the bucket.

Save the pipeline and upload a data file to your S3 bucket.

Check the Data Pipeline console to make sure that the new pipeline has been created and activated (you should get a SNS notification when pipeline is finished).

Conclusion

Congratulations! You have successfully cloned and launched your pipeline from a Lambda function to perform data processing after successfully committing new data to the S3 bucket. You can continue evolving your workflow to include other AWS services, such as Amazon Redshift, Amazon RDS for MySQL, and Amazon DynamoDB.

If you have questions or suggestions, please leave a comment below.

Appendix

Below is a template of the Lambda function that uses all function calls discussed above. This template is only a starting point and isn’t meant for a production environment.

console.log(‘Loading function’);

var AWS = require(‘aws-sdk’);
//var s3 = new aws.S3({ apiVersion: ‘2012-10-29′ });

exports.handler = function(event, context) {
var datapipeline = new AWS.DataPipeline();
var pipeline2delete =’None’;
var pipeline =’df-02364022NP3BYIO2UPBT’;

var paramsall = {
marker:”
};

//Check if pipelien clone already exist
datapipeline.listPipelines(paramsall, function(err, data) {
if (err) {
console.log(err, err.stack); // an error occurred
context.fail(‘Error’, "Error getting list of pipelines: " + err);
}
else {
console.log(data); // successful response
for (var i in data.pipelineIdList){
if (data.pipelineIdList[i].name ==’myLambdaSample’) {
pipeline2delete = data.pipelineIdList[i].id;
console.log(‘Pipeline id to delete: ‘ + pipeline2delete);

var paramsd = {
pipelineId: pipeline2delete /* required */
};
datapipeline.deletePipeline(paramsd, function(err, data) {
if (err) {
console.log(err, err.stack); // an error occurred
context.fail(‘Error’, "Error deleting pipelines: " + err);
}
else console.log(‘Old clone deleted ‘ + pipeline2delete + ‘ Create new clone now’); // successful response
});
}
else console.log(‘No clones to delete’);
}
}
});

var params = {
pipelineId: pipeline
}; //Using this pipeline’s definition to create the next
datapipeline.getPipelineDefinition(params, function(err, definition) {
if (err) {
console.log(err, err.stack); // an error occurred
context.fail(‘Error’, "Error getting pipeline definition: " + err);
}
else {

var params = {
name: ‘myLambdaSample’, /* required */
uniqueId: ‘myLambdaSample’ /* required */
}; //definition object contains pipeline definition
datapipeline.createPipeline(params, function(err, pipelineIdObject) {
if (err) {
console.log(err, err.stack); // an error occurred
context.fail(‘Error’, "Error creating pipeline: " + err);
}
else { //new pipeline created with id=pipelineIdObject.pipelineId
console.log(pipelineIdObject); // successful response
//Create and activate pipeline
var params = {
pipelineId: pipelineIdObject.pipelineId,
pipelineObjects: definition.pipelineObjects//(you can add parameter objects and values too)
} //Use definition from the getPipelineDefinition API result
datapipeline.putPipelineDefinition(params, function(err, data) {
if (err) {
console.log(err, err.stack);
context.fail(‘Error’, "Error putting pipeline definition: " + err);
}
else {
datapipeline.activatePipeline(pipelineIdObject, function(err, data) { //Activate the pipeline finally
if (err) {
console.log(err, err.stack);
context.fail(‘Error’, "Error activating pipeline: " + err);
}
else console.log(data);
context.succeed();
});
}
});
}});
}});
};

——————-

Related

Automating Analytic Workflows on AWS

Under the Hood: Delivering the First Free Global Live Stream of an NFL Game on Yahoo

Post Syndicated from yahoo original https://yahooeng.tumblr.com/post/132155634066

P.P.S. Narayan, VP of Engineering On Sunday, October 25, Yahoo delivered the first-ever, global live stream of a regular season NFL game to football fans around the world, for free, across devices. Our goal was to distribute the game over the Internet and provide a broadcast-quality experience. Leveraging our focus on consumer products, we worked to identify features and experiences that would be unique for users enjoying a live stream for the first time. In other words, we wanted to make you feel like you were watching on TV, but make the experience even better.For us, success was twofold: provide the best quality viewing experience and deliver that quality at global scale. In this blog, we will talk about some key technology innovations that helped us achieve this for over 15M unique viewers in 185 countries across the world. On the technical side, the HD video signal was shipped from London to our encoders in Dallas and Sunnyvale, where it was converted into Internet video. The streams were transcoded (compression that enables efficient network transmission) into 9 bitrates ranging from 6Mbps to 300kbps. We also provided a framerate of 60 frames per second (fps), in addition to 30fps, thus allowing for smooth video playback suited for a sport like NFL football. Having a max bitrate of 6Mbps with 60fps gave a “wow” factor to the viewing experience, and was a first for NFL and sports audiences.One special Yahoo addition to the programming was an overlaid audio commentary from our Yahoo Studio in Sunnyvale. It was as if you were watching the game alongside our Yahoo Sports experts on your couch. This unique Yahoo take gave NFL viewers a whole new way to experience the game.Figure 1: High-level Architecture for NFL Live StreamingQuality Viewing ExperienceOur goal was to deliver a premium streaming quality that would bring users a best-in-class viewing experience, similar to TV–one that was extremely smooth and uninterrupted. This meant partnering with multiple CDNs to get the video bits as close to the viewer as possible, optimizing bandwidth usage, and making the video player resilient to problems on the Internet or the user’s network.Multiple CDNsIn addition to Yahoo’s own Content Delivery Network (CDN) and network infrastructure, which are capable of delivering video around the world, we partnered with six CDNs and Internet Service Providers (ISPs). The NFL game streams were available across all seven CDNs; however, we wanted to route the viewer to the most suitable CDN server based on multiple factors – device, resolution, user-agent, geography, app or site, cable/DSL network, and so on. We built sophisticated capabilities in our platform to be able to define routing and quality policy decisions. The policy engine served more than 80M requests during the game. Policy rules to routes were adjusted based on CDN performance and geographies. For example, we were able to reduce the international traffic to one underperforming CDN during the game and the changes were propagated in under six seconds across viewers. Such capabilities delivered a near flawless  viewing experience. During the game, we served on average about 5Tbps across the CDNs, and at peak we were serving 7Tbps of video to viewers globally. Bitrates and AdaptationViewers of video streaming on the Internet are all too familiar with the visual aspects of poor quality: the infamous “spinner,” technically termed re-buffering; the blockiness of the video that represents low bitrates; jerkiness of the video, which could be due to frame drops; and plain old errors that show up on the screen.Since we had nine bitrates available, our Yahoo player could use sophisticated techniques to measure the bandwidth (or network capacity) on a user’s home or service provider network, and pick the best bitrate to minimize buffering or errors. Such adaptive bitrate (ABR) algorithms make the viewing experience smooth. Since we supported 60fps streams, the algorithm also monitored frame drops to decide if the device was capable of supporting the high frame rate. It then reacted appropriately by switching to the 30fps stream if necessary.Figure 2: Player ABR reacting to CDN that had capacity issues or errorsTesting and simulationManually testing adaptive video playback is very difficult, subjective and time consuming. So we built a network and device simulation framework called “Adaptive Playground” that brings automation, integration and a more scientific approach to testing and measuring video playback performance.Figure 3: Adaptive Playground Tool UIAnother tool we developed is a “Stream Monitor” that was used to constantly monitor all the streams across CDNs, check the validity or correctness of the streams, and ultimately identify ingestion or delivery problems. During the game, the tool detected issues, helped to identify the exact problem and take action.Yahoo broadcasts live events, news segments and concerts regularly. So these types of tools are continuously used on these events to measure, test and analyze our infrastructure and partner systems. Player RecoveryThe video playback must be smooth even if the connection to the streaming server is lost or if there are Internet connectivity issues. So, we introduced seamless recovery in the Yahoo video player. Under problematic conditions, the recovery mechanism is automatically activated, and the player reconnects to our backend API servers to fetch from another CDN. In essence, this replaces a user reloading the page or clicking the player when problems occur–an otherwise manual action that is incredibly frustrating.During the game on Sunday, thanks to the seamless recovery of our player, many viewers automatically switched CDNs when their current CDN or ISP had issues. This resulted in a smooth watching experience. In one severe case, we had up to 100K viewers automatically switching CDNs in less than 30 seconds, as seen in the graph below.Figure 4: Player Recovery per 5 second interval Broad Audience ReachWe wanted to make sure that our global audience could watch this stream anywhere in the world, on any device so we delivered it on laptops and desktops, on phones and tablets; and finally, we wanted to reach the ardent fans on the big screen TVs, game consoles, and other connected devices. Our destination page, which provided a full screen experience of the game on web and mobile web, was built on node.js and React, and extensively optimized for page load and startup latency. In addition, we decided to launch the NFL experience on our key mobile apps: Yahoo, Tumblr, Yahoo Sports and Yahoo Sports Fantasy. Pure HTML5 on SafariWe brought users a pure HTML5 video delivery on the Safari web browser. There is currently an industry-wide move away from Flash, and Yahoo is no exception. As the first step toward achieving this goal, we deployed a “pure” HTML5 player on Safari for the NFL live stream. Making this leap had a positive impact to millions of viewers during the game.Connected Devices & TV ExperienceOur objective was to create a connected TV video experience better than cable/satellite TV. In just a few months, we were able to develop and deploy on nine different connected device platforms and on 60+ different TV models.We wanted a large percentage of our big screen viewers to experience the 60fps streams. However, we soon realized that even on modern devices this was not easily feasible due to memory, CPU and firmware limitations. So we conducted hundreds of hours of testing to come up with the right stream configuration for each device. We developed automation tools to quickly validate stream configurations from various CDNs, as well as created a computer vision (camera automation) test tool to monitor and verify video quality and stability across devices.ChromecastBecause NFL games are traditionally viewed on television, we wanted to provide viewers an easy way to watch the NFL/Yahoo Live Stream on their big screens. In addition to connected TV apps, we built Chromecast support into apps for iOS and Android, allowing viewers to cast the game on big screen TVs from their mobile devices.To ensure a high-quality, uninterrupted cast, we also built a custom Chromecast receiver app with the same improved resiliency through robust recovery algorithms. Judging by the engagement on our Chromecast streams, we consistently matched or surpassed the viewing times on other experiences.Global ScaleYahoo operates multiple data centers across the US and the world for service reliability and capacity. We also have dozens of smaller point-of-presence (POPs) located close to all major population centers to provide a low latency connection to Yahoo’s infrastructure. Our data centers and POPs are connected together via a high redundancy private backbone network. For the NFL game, we upgraded our network and POPs to handle the extra load. We also worked with the CDN vendors to setup new peering points to efficiently route traffic to their networks.As part of running “Internet” scale applications, we always build our software to take advantage of Yahoo’s multiple data centers. Every system has a backup, and in most cases, each backup has another backup. Our architecture and contingency plans account for multiple simultaneous failures.During an NFL game, which typically lasts just under four hours, there is a very small margin of error for detecting and fixing streaming issues. Real-time metrics as well as detailed data from our backend systems provide a high fidelity understanding of the stream quality that viewers are experiencing.Yahoo is a world leader in data, analytics and real-time data processing. So, we extensively used our data infrastructure, including Hadoop, to provide industry leading operational metrics during the game.Player InstrumentationThe Yahoo video player has extensive instrumentation to track everything happening during video playback. And, this data is regularly beaconed back to our data centers. The data includes service quality metrics like first video frame latency, bitrate, bandwidth observed, buffering and frame drops.  The beacons are processed in real-time, and we have dashboards showing KPIs like the number of concurrent viewers, total video starts, re-buffering ratio by numerous dimensions like CDN, device, OS and geo. These real-time dashboards enabled our operations team to make decisions about routing policies and switching CDN(s) in real-time based on quality metrics.In terms of scale, our beacon servers peaked at more than 225K events/sec, handling about two billion events in total, which equaled about 4TB of data during the game.Backend APIsPrior to the NFL streaming event, we had designed the backend APIs to deliver at scale, with low latency and high availability. During the game, we served 216 million API calls, with a median latency of 11ms, and a 95th percentile latency of 16ms. The APIs showed six 9s of availability during this time period.Our systems are instrumented exhaustively to obtain real-time feedback on performance. These metrics were available for monitoring through dashboards, and were also used for alerting when performance breached acceptable thresholds. The Take-AwayPioneering the delivery of a smooth 60fps live video experience to millions of users around the world was a significant undertaking. Huge thanks to the team for executing against our vision – it was a coordinated effort across Yahoo.While much of our technology and infrastructure was already set up to handle the scale and load–one of the reasons the NFL chose us–in preparation for the main event, we designed a new destination page and enhanced our mobile applications. We also enhanced the control and recovery mechanisms, as well as expanded our infrastructure to handle the huge traffic of the game. We worked hard to ensure that the experience was available on every Internet connected device. We tuned our video players to deliver the optimal video stream, taking into account device, connectivity, location and ISP. Behind everything was our massive analytical system that would measure and aggregate all aspects of quality and engagement. We conducted comprehensive tests with our partners so that game day would be successful. In the end, the game exceeded our high expectations, setting a bar for quality and scale for live Internet broadcasts to come. We’re thrilled and proud of the experience we delivered, and further, the reception and accolades from our community of users has been gratifying.  Looking to the future, we expect live sporting events to be routinely streamed over the Internet to massive global audiences. People will expect these broadcasts to be flawless, with better than HD quality. October 25th 2015 was a significant step towards this vision. Yahoo, as a leading technology company and a top destination for sports, is proud of our role in setting a new standard for sports programming. We look forward to making other global scale broadcasts like the NFL game happen in the future.Want to help? Email me at [email protected] and we can talk about opportunities on our team.

The CA’s Role in Fighting Phishing and Malware

Post Syndicated from Let's Encrypt - Free SSL/TLS Certificates original https://letsencrypt.org//2015/10/29/phishing-and-malware.html

Since we announced Let’s Encrypt we’ve often been asked how we’ll ensure that we don’t issue certificates for phishing and malware sites. The concern most commonly expressed is that having valid HTTPS certificates helps these sites look more legitimate, making people more likely to trust them.

Deciding what to do here has been tough. On the one hand, we don’t like these sites any more than anyone else does, and our mission is to help build a safer and more secure Web. On the other hand, we’re not sure that certificate issuance (at least for Domain Validation) is the right level on which to be policing phishing and malware sites in 2015. This post explains our thinking in order to encourage a conversation about the CA ecosystem’s role in fighting these malicious sites.

CAs Make Poor Content Watchdogs

Let’s Encrypt is going to be issuing Domain Validation (DV) certificates. On a technical level, a DV certificate asserts that a public key belongs to a domain – it says nothing else about a site’s content or who runs it. DV certificates do not include any information about a website’s reputation, real-world identity, or safety. However, many people believe the mere presence of DV certificate ought to connote at least some of these things.

Treating a DV certificate as a kind of “seal of approval” for a site’s content is problematic for several reasons.

First, CAs are not well positioned to operate anti­-phishing and anti-malware operations – or to police content more generally. They simply do not have sufficient ongoing visibility into sites’ content. The best CAs can do is check with organizations that have much greater content awareness, such as Microsoft and Google. Google and Microsoft consume vast quantities of data about the Web from massive crawling and reporting infrastructures. This data allows them to use complex machine learning algorithms (developed and operated by dozens of staff) to identify malicious sites and content.

Even if a CA checks for phishing and malware status with a good API, the CA’s ability to accurately express information regarding phishing and malware is extremely limited. Site content can change much faster than certificate issuance and revocation cycles, phishing and malware status can be page-specific, and certificates and their related browser UIs contain little, if any, information about phishing or malware status. When a CA doesn’t issue a certificate for a site with phishing or malware content, users simply don’t see a lock icon. Users are much better informed and protected when browsers include anti-phishing and anti-malware features, which typically do not suffer from any of these limitations.

Another issue with treating DV certificates as a “seal of approval” for site content is that there is no standard for CA anti­-phishing and anti-malware measures beyond a simple blacklist of high-­value domains, so enforcement is inconsistent across the thousands of CAs trusted by major browsers. Even if one CA takes extraordinary measures to weed out bad sites, attackers can simply shop around to different CAs. The bad guys will almost always be able to get a certificate and hold onto it long enough to exploit people. It doesn’t matter how sophisticated the best CA anti­-phishing and anti-malware programs are, it only matters how good the worst are. It’s a “find the weakest link” scenario, and weak links aren’t hard to find.

Browser makers have realized all of this. That’s why they are pushing phishing and malware protection features, and evolving their UIs to more accurately reflect the assertions that certificates actually make.

TLS No Longer Optional

When they were first developed in the 1990s, HTTPS and SSL/TLS were considered “special” protections that were only necessary or useful for particular kinds of websites, like online banks and shopping sites accepting credit cards. We’ve since come to realize that HTTPS is important for almost all websites. It’s important for any website that allows people to log in with a password, any website that tracks its users in any way, any website that doesn’t want its content altered, and for any site that offers content people might not want others to know they are consuming. We’ve also learned that any site not secured by HTTPS can be used to attack other sites.

TLS is no longer the exception, nor should it be. That’s why we built Let’s Encrypt. We want TLS to be the default method for communication on the Web. It should just be a fundamental part of the fabric, like TCP or HTTP. When this happens, having a certificate will become an existential issue, rather than a value add, and content policing mistakes will be particularly costly. On a technical level, mistakes will lead to significant down time due to a slow issuance and revocation cycle, and features like HSTS. On a philosophical and moral level, mistakes (innocent or otherwise) will mean censorship, since CAs would be gatekeepers for online speech and presence. This is probably not a good role for CAs.

Our Plan

At least for the time being, Let’s Encrypt is going to check with the Google Safe Browsing API before issuing certificates, and refuse to issue to sites that are flagged as phishing or malware sites. Google’s API is the best source of phishing and malware status information that we have access to, and attempting to do more than query this API before issuance would almost certainly be wasteful and ineffective.

We’re going to implement this phishing and malware status check because many people are not comfortable with CAs entirely abandoning anti-phishing and anti-malware efforts just yet, even for DV certificates. We’d like to continue the conversation for a bit longer before we abandon what many people perceive to be an important CA behavior, even though we disagree.

Conclusion

The fight against phishing and malware content is an important one, but it does not make sense for CAs to be on the front lines, at least when it comes to DV certificates. That said, we’re going to implement checks against the Google Safe Browsing API while we continue the conversation.

We look forward to hearing what you think. Please let us know.

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close