Tag Archives: fine

The Benefits of Side Projects

Post Syndicated from Bozho original https://techblog.bozho.net/the-benefits-of-side-projects/

Side projects are the things you do at home, after work, for your own “entertainment”, or to satisfy your desire to learn new stuff, in case your workplace doesn’t give you that opportunity (or at least not enough of it). Side projects are also a way to build stuff that you think is valuable but not necessarily “commercialisable”. Many side projects are open-sourced sooner or later and some of them contribute to the pool of tools at other people’s disposal.

I’ve outlined one recommendation about side projects before – do them with technologies that are new to you, so that you learn important things that will keep you better positioned in the software world.

But there are more benefits than that – serendipitous benefits, for example. And I’d like to tell some personal stories about that. I’ll focus on a few examples from my list of side projects to show how, through a sort-of butterfly effect, they helped shape my career.

The computoser project, no matter how cool algorithmic music composition, didn’t manage to have much of a long term impact. But it did teach me something apart from niche musical theory – how to read a bulk of scientific papers (mostly computer science) and understand them without being formally trained in the particular field. We’ll see how that was useful later.

Then there was the “State alerts” project – a website that scraped content from public institutions in my country (legislation, legislation proposals, decisions by regulators, new tenders, etc.), made them searchable, and “subscribable” – so that you get notified when a keyword of interest is mentioned in newly proposed legislation, for example. (I obviously subscribed for “information technologies” and “electronic”).

And that project turned out to have a significant impact on the following years. First, I chose a new technology to write it with – Scala. Which turned out to be of great use when I started working at TomTom, and on the 3rd day I was transferred to a Scala project, which was way cooler and much more complex than the original one I was hired for. It was a bit ironic, as my colleagues had just read that “I don’t like Scala” a few weeks earlier, but nevertheless, that was one of the most interesting projects I’ve worked on, and it went on for two years. Had I not known Scala, I’d probably be gone from TomTom much earlier (as the other project was restructured a few times), and I would not have learned many of the scalability, architecture and AWS lessons that I did learn there.

But the very same project had an even more important follow-up. Because if its “civic hacking” flavour, I was invited to join an informal group of developers (later officiated as an NGO) who create tools that are useful for society (something like MySociety.org). That group gathered regularly, discussed both tools and policies, and at some point we put up a list of policy priorities that we wanted to lobby policy makers. One of them was open source for the government, the other one was open data. As a result of our interaction with an interim government, we donated the official open data portal of my country, functioning to this day.

As a result of that, a few months later we got a proposal from the deputy prime minister’s office to “elect” one of the group for an advisor to the cabinet. And we decided that could be me. So I went for it and became advisor to the deputy prime minister. The job has nothing to do with anything one could imagine, and it was challenging and fascinating. We managed to pass legislation, including one that requires open source for custom projects, eID and open data. And all of that would not have been possible without my little side project.

As for my latest side project, LogSentinel – it became my current startup company. And not without help from the previous two mentioned above – the computer science paper reading was of great use when I was navigating the crypto papers landscape, and from the government job I not only gained invaluable legal knowledge, but I also “got” a co-founder.

Some other side projects died without much fanfare, and that’s fine. But the ones above shaped my “story” in a way that would not have been possible otherwise.

And I agree that such serendipitous chain of events could have happened without side projects – I could’ve gotten these opportunities by meeting someone at a bar (unlikely, but who knows). But we, as software engineers, are capable of tilting chance towards us by utilizing our skills. Side projects are our “extracurricular activities”, and they often lead to unpredictable, but rather positive chains of events. They would rarely be the only factor, but they are certainly great at unlocking potential.

The post The Benefits of Side Projects appeared first on Bozho's tech blog.

Facebook User Pleads Guilty to Uploading Pirated Copy of Deadpool

Post Syndicated from Ernesto original https://torrentfreak.com/facebook-user-pleads-guilty-to-uploading-pirated-copy-of-deadpool-180522/

Every day, hundreds of millions of people use Facebook to share photos, videos and other information.

While most of the content posted on the site is relatively harmless, some people use it to share things they are not supposed to. A pirated copy of Deadpool, for example.

This is what the now 22-year-old Trevon Franklin from Fresno, California, did early 2016. Just a week after the first installment of the box-office hit Deadpool premiered in theaters, he shared a pirated copy of the movie on the social network.

To be clear, Franklin wasn’t the person who originally made the copy available. He simply downloaded it from the file-sharing site Putlocker.is and then proceeded to upload it to his Facebook account, using the screen name “Tre-Von M. King.”

This post went viral with more than six million viewers ‘tuning in.’ While many people dream of this kind of attention, in this case, it meant that copyright holder Twentieth Century Fox and the feds were alerted as well.

The FBI launched a full-fledged investigation which eventually led to an indictment and the arrest of Franklin last summer.

After months of relative silence, Franklin has now signed a plea agreement with the Government where he admits to sharing the pirated film on Facebook. In return, the authorities will recommend a sentence reduction.

“Defendant admits that defendant is, in fact, guilty of the offense to which defendant is agreeing to plead guilty,” the plea agreement reads.

The legal paperwork, signed by both sides, states that Franklin downloaded the pirated copy from Putlocker, knowing full well that he didn’t have permission to do so. He then willfully shared it on Facebook where it was accessed by millions of people.

“Between February 20 and 22, 2016, while Deadpool was still in theaters and had not yet been made available for purchase by the public for home viewing, the copy of Deadpool defendant posted to his Facebook page had been viewed over 6,386,456 times,” the paperwork reads.

From the plea agreement

While a federal case over Facebook uploads is unlikely, the risk of legal trouble was pointed out to Franklin by others.

According to Facebook comments from 2016, several people warned “Tre-Von M. King” that it wasn’t wise to post copyright-infringing material on the social media platform. However, Franklin said he wasn’t worried.

It’s unclear why the US Government decided to pursue this case. Copyright infringement isn’t exactly rare on Facebook. However, it may be that the media attention and the high number of views may have prompted the authorities to set an example.

Under the terms of the plea agreement, Franklin will be sentenced for a Class A misdemeanor. This can lead to a maximum prison sentence of one year, followed by probation or a supervised release, as well as a fine of $100,000. Meanwhile, he has waived his right to a trial by jury.

A copy of the plea agreement is available here (pdf).

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN reviews, discounts, offers and coupons.

Maliciously Changing Someone’s Address

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2018/05/maliciously_cha.html

Someone changed the address of UPS corporate headquarters to his own apartment in Chicago. The company discovered it three months later.

The problem, of course, is that there isn’t any authentication of change-of-address submissions:

According to the Postal Service, nearly 37 million change-of-address requests ­ known as PS Form 3575 ­ were submitted in 2017. The form, which can be filled out in person or online, includes a warning below the signature line that “anyone submitting false or inaccurate information” could be subject to fines and imprisonment.

To cut down on possible fraud, post offices send a validation letter to both an old and new address when a change is filed. The letter includes a toll-free number to call to report anything suspicious.

Each year, only a tiny fraction of the requests are ever referred to postal inspectors for investigation. A spokeswoman for the U.S. Postal Inspection Service could not provide a specific number to the Tribune, but officials have previously said that the number of change-of-address investigations in a given year totals 1,000 or fewer typically.

While fraud involving change-of-address forms has long been linked to identity thieves, the targets are usually unsuspecting individuals, not massive corporations.

AWS IoT 1-Click – Use Simple Devices to Trigger Lambda Functions

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-iot-1-click-use-simple-devices-to-trigger-lambda-functions/

We announced a preview of AWS IoT 1-Click at AWS re:Invent 2017 and have been refining it ever since, focusing on simplicity and a clean out-of-box experience. Designed to make IoT available and accessible to a broad audience, AWS IoT 1-Click is now generally available, along with new IoT buttons from AWS and AT&T.

I sat down with the dev team a month or two ago to learn about the service so that I could start thinking about my blog post. During the meeting they gave me a pair of IoT buttons and I started to think about some creative ways to put them to use. Here are a few that I came up with:

Help Request – Earlier this month I spent a very pleasant weekend at the HackTillDawn hackathon in Los Angeles. As the participants were hacking away, they occasionally had questions about AWS, machine learning, Amazon SageMaker, and AWS DeepLens. While we had plenty of AWS Solution Architects on hand (decked out in fashionable & distinctive AWS shirts for easy identification), I imagined an IoT button for each team. Pressing the button would alert the SA crew via SMS and direct them to the proper table.

Camera ControlTim Bray and I were in the AWS video studio, prepping for the first episode of Tim’s series on AWS Messaging. Minutes before we opened the Twitch stream I realized that we did not have a clean, unobtrusive way to ask the camera operator to switch to a closeup view. Again, I imagined that a couple of IoT buttons would allow us to make the request.

Remote Dog Treat Dispenser – My dog barks every time a stranger opens the gate in front of our house. While it is great to have confirmation that my Ring doorbell is working, I would like to be able to press a button and dispense a treat so that Luna stops barking!

Homes, offices, factories, schools, vehicles, and health care facilities can all benefit from IoT buttons and other simple IoT devices, all managed using AWS IoT 1-Click.

All About AWS IoT 1-Click
As I said earlier, we have been focusing on simplicity and a clean out-of-box experience. Here’s what that means:

Architects can dream up applications for inexpensive, low-powered devices.

Developers don’t need to write any device-level code. They can make use of pre-built actions, which send email or SMS messages, or write their own custom actions using AWS Lambda functions.

Installers don’t have to install certificates or configure cloud endpoints on newly acquired devices, and don’t have to worry about firmware updates.

Administrators can monitor the overall status and health of each device, and can arrange to receive alerts when a device nears the end of its useful life and needs to be replaced, using a single interface that spans device types and manufacturers.

I’ll show you how easy this is in just a moment. But first, let’s talk about the current set of devices that are supported by AWS IoT 1-Click.

Who’s Got the Button?
We’re launching with support for two types of buttons (both pictured above). Both types of buttons are pre-configured with X.509 certificates, communicate to the cloud over secure connections, and are ready to use.

The AWS IoT Enterprise Button communicates via Wi-Fi. It has a 2000-click lifetime, encrypts outbound data using TLS, and can be configured using BLE and our mobile app. It retails for $19.99 (shipping and handling not included) and can be used in the United States, Europe, and Japan.

The AT&T LTE-M Button communicates via the LTE-M cellular network. It has a 1500-click lifetime, and also encrypts outbound data using TLS. The device and the bundled data plan is available an an introductory price of $29.99 (shipping and handling not included), and can be used in the United States.

We are very interested in working with device manufacturers in order to make even more shapes, sizes, and types of devices (badge readers, asset trackers, motion detectors, and industrial sensors, to name a few) available to our customers. Our team will be happy to tell you about our provisioning tools and our facility for pushing OTA (over the air) updates to large fleets of devices; you can contact them at [email protected].

AWS IoT 1-Click Concepts
I’m eager to show you how to use AWS IoT 1-Click and the buttons, but need to introduce a few concepts first.

Device – A button or other item that can send messages. Each device is uniquely identified by a serial number.

Placement Template – Describes a like-minded collection of devices to be deployed. Specifies the action to be performed and lists the names of custom attributes for each device.

Placement – A device that has been deployed. Referring to placements instead of devices gives you the freedom to replace and upgrade devices with minimal disruption. Each placement can include values for custom attributes such as a location (“Building 8, 3rd Floor, Room 1337”) or a purpose (“Coffee Request Button”).

Action – The AWS Lambda function to invoke when the button is pressed. You can write a function from scratch, or you can make use of a pair of predefined functions that send an email or an SMS message. The actions have access to the attributes; you can, for example, send an SMS message with the text “Urgent need for coffee in Building 8, 3rd Floor, Room 1337.”

Getting Started with AWS IoT 1-Click
Let’s set up an IoT button using the AWS IoT 1-Click Console:

If I didn’t have any buttons I could click Buy devices to get some. But, I do have some, so I click Claim devices to move ahead. I enter the device ID or claim code for my AT&T button and click Claim (I can enter multiple claim codes or device IDs if I want):

The AWS buttons can be claimed using the console or the mobile app; the first step is to use the mobile app to configure the button to use my Wi-Fi:

Then I scan the barcode on the box and click the button to complete the process of claiming the device. Both of my buttons are now visible in the console:

I am now ready to put them to use. I click on Projects, and then Create a project:

I name and describe my project, and click Next to proceed:

Now I define a device template, along with names and default values for the placement attributes. Here’s how I set up a device template (projects can contain several, but I just need one):

The action has two mandatory parameters (phone number and SMS message) built in; I add three more (Building, Room, and Floor) and click Create project:

I’m almost ready to ask for some coffee! The next step is to associate my buttons with this project by creating a placement for each one. I click Create placements to proceed. I name each placement, select the device to associate with it, and then enter values for the attributes that I established for the project. I can also add additional attributes that are peculiar to this placement:

I can inspect my project and see that everything looks good:

I click on the buttons and the SMS messages appear:

I can monitor device activity in the AWS IoT 1-Click Console:

And also in the Lambda Console:

The Lambda function itself is also accessible, and can be used as-is or customized:

As you can see, this is the code that lets me use {{*}}include all of the placement attributes in the message and {{Building}} (for example) to include a specific placement attribute.

Now Available
I’ve barely scratched the surface of this cool new service and I encourage you to give it a try (or a click) yourself. Buy a button or two, build something cool, and let me know all about it!

Pricing is based on the number of enabled devices in your account, measured monthly and pro-rated for partial months. Devices can be enabled or disabled at any time. See the AWS IoT 1-Click Pricing page for more info.

To learn more, visit the AWS IoT 1-Click home page or read the AWS IoT 1-Click documentation.



From Framework to Function: Deploying AWS Lambda Functions for Java 8 using Apache Maven Archetype

Post Syndicated from Ryosuke Iwanaga original https://aws.amazon.com/blogs/compute/from-framework-to-function-deploying-aws-lambda-functions-for-java-8-using-apache-maven-archetype/

As a serverless computing platform that supports Java 8 runtime, AWS Lambda makes it easy to run any type of Java function simply by uploading a JAR file. To help define not only a Lambda serverless application but also Amazon API Gateway, Amazon DynamoDB, and other related services, the AWS Serverless Application Model (SAM) allows developers to use a simple AWS CloudFormation template.

AWS provides the AWS Toolkit for Eclipse that supports both Lambda and SAM. AWS also gives customers an easy way to create Lambda functions and SAM applications in Java using the AWS Command Line Interface (AWS CLI). After you build a JAR file, all you have to do is type the following commands:

aws cloudformation package 
aws cloudformation deploy

To consolidate these steps, customers can use Archetype by Apache Maven. Archetype uses a predefined package template that makes getting started to develop a function exceptionally simple.

In this post, I introduce a Maven archetype that allows you to create a skeleton of AWS SAM for a Java function. Using this archetype, you can generate a sample Java code example and an accompanying SAM template to deploy it on AWS Lambda by a single Maven action.


Make sure that the following software is installed on your workstation:

  • Java
  • Maven
  • (Optional) AWS SAM CLI

Install Archetype

After you’ve set up those packages, install Archetype with the following commands:

git clone https://github.com/awslabs/aws-serverless-java-archetype
cd aws-serverless-java-archetype
mvn install

These are one-time operations, so you don’t run them for every new package. If you’d like, you can add Archetype to your company’s Maven repository so that other developers can use it later.

With those packages installed, you’re ready to develop your new Lambda Function.

Start a project

Now that you have the archetype, customize it and run the code:

cd /path/to/project_home
mvn archetype:generate \
  -DarchetypeGroupId=com.amazonaws.serverless.archetypes \
  -DarchetypeArtifactId=aws-serverless-java-archetype \
  -DarchetypeVersion=1.0.0 \
  -DarchetypeRepository=local \ # Forcing to use local maven repository
  -DinteractiveMode=false \ # For batch mode
  # You can also specify properties below interactively if you omit the line for batch mode
  -DgroupId=YOUR_GROUP_ID \
  -DartifactId=YOUR_ARTIFACT_ID \
  -Dversion=YOUR_VERSION \

You should have a directory called YOUR_ARTIFACT_ID that contains the files and folders shown below:

├── event.json
├── pom.xml
├── src
│   └── main
│       ├── java
│       │   └── Package
│       │       └── Example.java
│       └── resources
│           └── log4j2.xml
└── template.yaml

The sample code is a working example. If you install SAM CLI, you can invoke it just by the command below:

mvn -P invoke verify
[INFO] Scanning for projects...
[INFO] ---------------------------< com.riywo:foo >----------------------------
[INFO] Building foo 1.0
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] --- maven-jar-plugin:3.0.2:jar (default-jar) @ foo ---
[INFO] Building jar: /private/tmp/foo/target/foo-1.0.jar
[INFO] --- maven-shade-plugin:3.1.0:shade (shade) @ foo ---
[INFO] Including com.amazonaws:aws-lambda-java-core:jar:1.2.0 in the shaded jar.
[INFO] Replacing /private/tmp/foo/target/lambda.jar with /private/tmp/foo/target/foo-1.0-shaded.jar
[INFO] --- exec-maven-plugin:1.6.0:exec (sam-local-invoke) @ foo ---
2018/04/06 16:34:35 Successfully parsed template.yaml
2018/04/06 16:34:35 Connected to Docker 1.37
2018/04/06 16:34:35 Fetching lambci/lambda:java8 image for java8 runtime...
java8: Pulling from lambci/lambda
Digest: sha256:14df0a5914d000e15753d739612a506ddb8fa89eaa28dcceff5497d9df2cf7aa
Status: Image is up to date for lambci/lambda:java8
2018/04/06 16:34:37 Invoking Package.Example::handleRequest (java8)
2018/04/06 16:34:37 Decompressing /tmp/foo/target/lambda.jar
2018/04/06 16:34:37 Mounting /private/var/folders/x5/ldp7c38545v9x5dg_zmkr5kxmpdprx/T/aws-sam-local-1523000077594231063 as /var/task:ro inside runtime container
START RequestId: a6ae19fe-b1b0-41e2-80bc-68a40d094d74 Version: $LATEST
Log output: Greeting is 'Hello Tim Wagner.'
END RequestId: a6ae19fe-b1b0-41e2-80bc-68a40d094d74
REPORT RequestId: a6ae19fe-b1b0-41e2-80bc-68a40d094d74	Duration: 96.60 ms	Billed Duration: 100 ms	Memory Size: 128 MB	Max Memory Used: 7 MB

{"greetings":"Hello Tim Wagner."}

[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 10.452 s
[INFO] Finished at: 2018-04-06T16:34:40+09:00
[INFO] ------------------------------------------------------------------------

This maven goal invokes sam local invoke -e event.json, so you can see the sample output to greet Tim Wagner.

To deploy this application to AWS, you need an Amazon S3 bucket to upload your package. You can use the following command to create a bucket if you want:

aws s3 mb s3://YOUR_BUCKET --region YOUR_REGION

Now, you can deploy your application by just one command!

mvn deploy \
    -DawsRegion=YOUR_REGION \
    -Ds3Bucket=YOUR_BUCKET \
[INFO] Scanning for projects...
[INFO] ---------------------------< com.riywo:foo >----------------------------
[INFO] Building foo 1.0
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] --- exec-maven-plugin:1.6.0:exec (sam-package) @ foo ---
Uploading to aws-serverless-java/com.riywo:foo:1.0/924732f1f8e4705c87e26ef77b080b47  11657 / 11657.0  (100.00%)
Successfully packaged artifacts and wrote output template to file target/sam.yaml.
Execute the following command to deploy the packaged template
aws cloudformation deploy --template-file /private/tmp/foo/target/sam.yaml --stack-name <YOUR STACK NAME>
[INFO] --- maven-deploy-plugin:2.8.2:deploy (default-deploy) @ foo ---
[INFO] Skipping artifact deployment
[INFO] --- exec-maven-plugin:1.6.0:exec (sam-deploy) @ foo ---

Waiting for changeset to be created..
Waiting for stack create/update to complete
Successfully created/updated stack - archetype
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 37.176 s
[INFO] Finished at: 2018-04-06T16:41:02+09:00
[INFO] ------------------------------------------------------------------------

Maven automatically creates a shaded JAR file, uploads it to your S3 bucket, replaces template.yaml, and creates and updates the CloudFormation stack.

To customize the process, modify the pom.xml file. For example, to avoid typing values for awsRegion, s3Bucket or stackName, write them inside pom.xml and check in your VCS. Afterward, you and the rest of your team can deploy the function by typing just the following command:

mvn deploy


Lambda Java 8 runtime has some types of handlers: POJO, Simple type and Stream. The default option of this archetype is POJO style, which requires to create request and response classes, but they are baked by the archetype by default. If you want to use other type of handlers, you can use handlerType property like below:

## POJO type (default)
mvn archetype:generate \

## Simple type - String
mvn archetype:generate \

### Stream type
mvn archetype:generate \

See documentation for more details about handlers.

Also, Lambda Java 8 runtime supports two types of Logging class: Log4j 2 and LambdaLogger. This archetype creates LambdaLogger implementation by default, but you can use Log4j 2 if you want:

## LambdaLogger (default)
mvn archetype:generate \

## Log4j 2
mvn archetype:generate \

If you use LambdaLogger, you can delete ./src/main/resources/log4j2.xml. See documentation for more details.


So, what’s next? Develop your Lambda function locally and type the following command: mvn deploy !

With this Archetype code example, available on GitHub repo, you should be able to deploy Lambda functions for Java 8 in a snap. If you have any questions or comments, please submit them below or leave them on GitHub.

Some notes on eFail

Post Syndicated from Robert Graham original https://blog.erratasec.com/2018/05/some-notes-on-efail.html

I’ve been busy trying to replicate the “eFail” PGP/SMIME bug. I thought I’d write up some notes.

PGP and S/MIME encrypt emails, so that eavesdroppers can’t read them. The bugs potentially allow eavesdroppers to take the encrypted emails they’ve captured and resend them to you, reformatted in a way that allows them to decrypt the messages.

Disable remote/external content in email

The most important defense is to disable “external” or “remote” content from being automatically loaded. This is when HTML-formatted emails attempt to load images from remote websites. This happens legitimately when they want to display images, but not fill up the email with them. But most of the time this is illegitimate, they hide images on the webpage in order to track you with unique IDs and cookies. For example, this is the code at the end of an email from politician Bernie Sanders to his supporters. Notice the long random number assigned to track me, and the width/height of this image is set to one pixel, so you don’t even see it:

Such trackers are so pernicious they are disabled by default in most email clients. This is an example of the settings in Thunderbird:

The problem is that as you read email messages, you often get frustrated by the fact the error messages and missing content, so you keep adding exceptions:

The correct defense against this eFail bug is to make sure such remote content is disabled and that you have no exceptions, or at least, no HTTP exceptions. HTTPS exceptions (those using SSL) are okay as long as they aren’t to a website the attacker controls. Unencrypted exceptions, though, the hacker can eavesdrop on, so it doesn’t matter if they control the website the requests go to. If the attacker can eavesdrop on your emails, they can probably eavesdrop on your HTTP sessions as well.

Some have recommended disabling PGP and S/MIME completely. That’s probably overkill. As long as the attacker can’t use the “remote content” in emails, you are fine. Likewise, some have recommend disabling HTML completely. That’s not even an option in any email client I’ve used — you can disable sending HTML emails, but not receiving them. It’s sufficient to just disable grabbing remote content, not the rest of HTML email rendering.

I couldn’t replicate the direct exfiltration

There rare two related bugs. One allows direct exfiltration, which appends the decrypted PGP email onto the end of an IMG tag (like one of those tracking tags), allowing the entire message to be decrypted.

An example of this is the following email. This is a standard HTML email message consisting of multiple parts. The trick is that the IMG tag in the first part starts the URL (blog.robertgraham.com/…) but doesn’t end it. It has the starting quotes in front of the URL but no ending quotes. The ending will in the next chunk.

The next chunk isn’t HTML, though, it’s PGP. The PGP extension (in my case, Enignmail) will detect this and automatically decrypt it. In this case, it’s some previous email message I’ve received the attacker captured by eavesdropping, who then pastes the contents into this email message in order to get it decrypted.

What should happen at this point is that Thunderbird will generate a request (if “remote content” is enabled) to the blog.robertgraham.com server with the decrypted contents of the PGP email appended to it. But that’s not what happens. Instead, I get this:

I am indeed getting weird stuff in the URL (the bit after the GET /), but it’s not the PGP decrypted message. Instead what’s going on is that when Thunderbird puts together a “multipart/mixed” message, it adds it’s own HTML tags consisting of lines between each part. In the email client it looks like this:

The HTML code it adds looks like:

That’s what you see in the above URL, all this code up to the first quotes. Those quotes terminate the quotes in the URL from the first multipart section, causing the rest of the content to be ignored (as far as being sent as part of the URL).

So at least for the latest version of Thunderbird, you are accidentally safe, even if you have “remote content” enabled. Though, this is only according to my tests, there may be a work around to this that hackers could exploit.


In the old days, email was sent plaintext over the wire so that it could be passively eavesdropped on. Nowadays, most providers send it via “STARTTLS”, which sorta encrypts it. Attackers can still intercept such email, but they have to do so actively, using man-in-the-middle. Such active techniques can be detected if you are careful and look for them.
Some organizations don’t care. Apparently, some nation states are just blocking all STARTTLS and forcing email to be sent unencrypted. Others do care. The NSA will passively sniff all the email they can in nations like Iraq, but they won’t actively intercept STARTTLS messages, for fear of getting caught.
The consequence is that it’s much less likely that somebody has been eavesdropping on you, passively grabbing all your PGP/SMIME emails. If you fear they have been, you should look (e.g. send emails from GMail and see if they are intercepted by sniffing the wire).

You’ll know if you are getting hacked

If somebody attacks you using eFail, you’ll know. You’ll get an email message formatted this way, with multipart/mixed components, some with corrupt HTML, some encrypted via PGP. This means that for the most part, your risk is that you’ll be attacked only once — the hacker will only be able to get one message through and decrypt it before you notice that something is amiss. Though to be fair, they can probably include all the emails they want decrypted as attachments to the single email they sent you, so the risk isn’t necessarily that you’ll only get one decrypted.
As mentioned above, a lot of attackers (e.g. the NSA) won’t attack you if its so easy to get caught. Other attackers, though, like anonymous hackers, don’t care.
Somebody ought to write a plugin to Thunderbird to detect this.


It only works if attackers have already captured your emails (though, that’s why you use PGP/SMIME in the first place, to guard against that).
It only works if you’ve enabled your email client to automatically grab external/remote content.
It seems to not be easily reproducible in all cases.
Instead of disabling PGP/SMIME, you should make sure your email client hast remote/external content disabled — that’s a huge privacy violation even without this bug.

Notes: The default email client on the Mac enables remote content by default, which is bad:

The Pirating Elephant in Uncle Sam’s Room

Post Syndicated from Ernesto original https://torrentfreak.com/the-pirating-elephant-in-uncle-sams-room-180413/

It’s not a secret that, in sheer numbers, America is the country that harbors most online pirates.

Perhaps no surprise, since it has a large and well-connected population, but it’s important to note considering what we’re about to write today.

Over the past decade, online piracy has presented itself as a massive problem for the US and its entertainment industries. It’s a global issue that’s hard to contain, but Hollywood and the major record labels are doing what they can.

Two of the key strategies they’ve employed in recent years are website blocking and domain seizures. US companies have traveled to courts all over the world to have ISP blockades put in place, with quite a bit of success.

At the same time, US rightsholders also push foreign domain registrars and registries to suspend or seize domains. The US Government is even jumping in, applying pressure against pirate domains as well.

Previously, the U.S. Embassy in Costa Rica threatened to have the country’s domain name registry shut down, unless it suspended ThePirateBay.cr. This hasn’t happened, yet, but it was a clear signal.

What’s odd, though, is that ThePirateBay.cr is a relatively meaningless proxy site. The real Pirate Bay operates from an .org domain name, which happens to be managed by the US-based Public Interest Registry (PIR).

So, if the US authorities threaten to shut down Costa Rica’s domain registry over a proxy, why is the US-based PIR registry still in action? After all, it’s the registry that ‘manages’ the domain name of the largest pirate site on the entire web, and has done for nearly 15 years.

Also of note is that the entertainment industries previously launched an overseas lawsuit to seize The Pirate Bay’s .se and .is domains, but never attempted to do the same with the US-based .org domain.

There are more of these strange observations. Let’s move back to website blocking, for example.

In a detailed overview, the Motion Picture Association recently reported that ISP blocking measures, which are in place in more than two dozen countries, help to reduce piracy significantly. This is further backed up by industry-supported reports and independent academic research.

In an ideal world, the US entertainment industries would like ISPs in every country to block pirate sites. While this is all fine and understandable from the perspective of these companies, there’s also an elephant in the room.

Over the past decade, US companies have worked hard to spread their blocking message around the world, while they yet have to attempt the same on their home turf. And this happens to be the country with the most pirates of all, which could make a massive impact.

Sure, it was a major success when a court in Iceland ordered local ISPs to block The Pirate Bay. But with roughly 130,000 Internet subscriptions in the entire country, that’s peanuts compared to the US.

So why is the US, the largest “pirate nation,” ignored?

From what we’ve heard, the entertainment industries are not pushing for ISP blockades in US courts because they fear a SOPA-like backlash. This likely applies to domain suspensions as well, which aren’t all that hard to accomplish in the US.

Instead, the major entertainment companies are focusing their efforts elsewhere.

While these entertainment companies are well within their rights to lobby for these measures, there’s an elephant in the room that is hard to ignore. Personally, I can’t help but cringe every time Hollywood pushes the blocking agenda to a new country or demands domain seizures abroad.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN reviews, discounts, offers and coupons.

Analyze Apache Parquet optimized data using Amazon Kinesis Data Firehose, Amazon Athena, and Amazon Redshift

Post Syndicated from Roy Hasson original https://aws.amazon.com/blogs/big-data/analyzing-apache-parquet-optimized-data-using-amazon-kinesis-data-firehose-amazon-athena-and-amazon-redshift/

Amazon Kinesis Data Firehose is the easiest way to capture and stream data into a data lake built on Amazon S3. This data can be anything—from AWS service logs like AWS CloudTrail log files, Amazon VPC Flow Logs, Application Load Balancer logs, and others. It can also be IoT events, game events, and much more. To efficiently query this data, a time-consuming ETL (extract, transform, and load) process is required to massage and convert the data to an optimal file format, which increases the time to insight. This situation is less than ideal, especially for real-time data that loses its value over time.

To solve this common challenge, Kinesis Data Firehose can now save data to Amazon S3 in Apache Parquet or Apache ORC format. These are optimized columnar formats that are highly recommended for best performance and cost-savings when querying data in S3. This feature directly benefits you if you use Amazon Athena, Amazon Redshift, AWS Glue, Amazon EMR, or any other big data tools that are available from the AWS Partner Network and through the open-source community.

Amazon Connect is a simple-to-use, cloud-based contact center service that makes it easy for any business to provide a great customer experience at a lower cost than common alternatives. Its open platform design enables easy integration with other systems. One of those systems is Amazon Kinesis—in particular, Kinesis Data Streams and Kinesis Data Firehose.

What’s really exciting is that you can now save events from Amazon Connect to S3 in Apache Parquet format. You can then perform analytics using Amazon Athena and Amazon Redshift Spectrum in real time, taking advantage of this key performance and cost optimization. Of course, Amazon Connect is only one example. This new capability opens the door for a great deal of opportunity, especially as organizations continue to build their data lakes.

Amazon Connect includes an array of analytics views in the Administrator dashboard. But you might want to run other types of analysis. In this post, I describe how to set up a data stream from Amazon Connect through Kinesis Data Streams and Kinesis Data Firehose and out to S3, and then perform analytics using Athena and Amazon Redshift Spectrum. I focus primarily on the Kinesis Data Firehose support for Parquet and its integration with the AWS Glue Data Catalog, Amazon Athena, and Amazon Redshift.

Solution overview

Here is how the solution is laid out:



The following sections walk you through each of these steps to set up the pipeline.

1. Define the schema

When Kinesis Data Firehose processes incoming events and converts the data to Parquet, it needs to know which schema to apply. The reason is that many times, incoming events contain all or some of the expected fields based on which values the producers are advertising. A typical process is to normalize the schema during a batch ETL job so that you end up with a consistent schema that can easily be understood and queried. Doing this introduces latency due to the nature of the batch process. To overcome this issue, Kinesis Data Firehose requires the schema to be defined in advance.

To see the available columns and structures, see Amazon Connect Agent Event Streams. For the purpose of simplicity, I opted to make all the columns of type String rather than create the nested structures. But you can definitely do that if you want.

The simplest way to define the schema is to create a table in the Amazon Athena console. Open the Athena console, and paste the following create table statement, substituting your own S3 bucket and prefix for where your event data will be stored. A Data Catalog database is a logical container that holds the different tables that you can create. The default database name shown here should already exist. If it doesn’t, you can create it or use another database that you’ve already created.

CREATE EXTERNAL TABLE default.kfhconnectblog (
  awsaccountid string,
  agentarn string,
  currentagentsnapshot string,
  eventid string,
  eventtimestamp string,
  eventtype string,
  instancearn string,
  previousagentsnapshot string,
  version string
STORED AS parquet
LOCATION 's3://your_bucket/kfhconnectblog/'
TBLPROPERTIES ("parquet.compression"="SNAPPY")

That’s all you have to do to prepare the schema for Kinesis Data Firehose.

2. Define the data streams

Next, you need to define the Kinesis data streams that will be used to stream the Amazon Connect events.  Open the Kinesis Data Streams console and create two streams.  You can configure them with only one shard each because you don’t have a lot of data right now.

3. Define the Kinesis Data Firehose delivery stream for Parquet

Let’s configure the Data Firehose delivery stream using the data stream as the source and Amazon S3 as the output. Start by opening the Kinesis Data Firehose console and creating a new data delivery stream. Give it a name, and associate it with the Kinesis data stream that you created in Step 2.

As shown in the following screenshot, enable Record format conversion (1) and choose Apache Parquet (2). As you can see, Apache ORC is also supported. Scroll down and provide the AWS Glue Data Catalog database name (3) and table names (4) that you created in Step 1. Choose Next.

To make things easier, the output S3 bucket and prefix fields are automatically populated using the values that you defined in the LOCATION parameter of the create table statement from Step 1. Pretty cool. Additionally, you have the option to save the raw events into another location as defined in the Source record S3 backup section. Don’t forget to add a trailing forward slash “ / “ so that Data Firehose creates the date partitions inside that prefix.

On the next page, in the S3 buffer conditions section, there is a note about configuring a large buffer size. The Parquet file format is highly efficient in how it stores and compresses data. Increasing the buffer size allows you to pack more rows into each output file, which is preferred and gives you the most benefit from Parquet.

Compression using Snappy is automatically enabled for both Parquet and ORC. You can modify the compression algorithm by using the Kinesis Data Firehose API and update the OutputFormatConfiguration.

Be sure to also enable Amazon CloudWatch Logs so that you can debug any issues that you might run into.

Lastly, finalize the creation of the Firehose delivery stream, and continue on to the next section.

4. Set up the Amazon Connect contact center

After setting up the Kinesis pipeline, you now need to set up a simple contact center in Amazon Connect. The Getting Started page provides clear instructions on how to set up your environment, acquire a phone number, and create an agent to accept calls.

After setting up the contact center, in the Amazon Connect console, choose your Instance Alias, and then choose Data Streaming. Under Agent Event, choose the Kinesis data stream that you created in Step 2, and then choose Save.

At this point, your pipeline is complete.  Agent events from Amazon Connect are generated as agents go about their day. Events are sent via Kinesis Data Streams to Kinesis Data Firehose, which converts the event data from JSON to Parquet and stores it in S3. Athena and Amazon Redshift Spectrum can simply query the data without any additional work.

So let’s generate some data. Go back into the Administrator console for your Amazon Connect contact center, and create an agent to handle incoming calls. In this example, I creatively named mine Agent One. After it is created, Agent One can get to work and log into their console and set their availability to Available so that they are ready to receive calls.

To make the data a bit more interesting, I also created a second agent, Agent Two. I then made some incoming and outgoing calls and caused some failures to occur, so I now have enough data available to analyze.

5. Analyze the data with Athena

Let’s open the Athena console and run some queries. One thing you’ll notice is that when we created the schema for the dataset, we defined some of the fields as Strings even though in the documentation they were complex structures.  The reason for doing that was simply to show some of the flexibility of Athena to be able to parse JSON data. However, you can define nested structures in your table schema so that Kinesis Data Firehose applies the appropriate schema to the Parquet file.

Let’s run the first query to see which agents have logged into the system.

The query might look complex, but it’s fairly straightforward:

WITH dataset AS (
    from_iso8601_timestamp(eventtimestamp) AS event_ts,
      '$.agentstatus.name') AS current_status,
        '$.agentstatus.starttimestamp')) AS current_starttimestamp,
      '$.configuration.firstname') AS current_firstname,
      '$.configuration.lastname') AS current_lastname,
      '$.configuration.username') AS current_username,
      '$.configuration.routingprofile.defaultoutboundqueue.name') AS               current_outboundqueue,
      '$.configuration.routingprofile.inboundqueues[0].name') as current_inboundqueue,
      '$.agentstatus.name') as prev_status,
       '$.agentstatus.starttimestamp')) as prev_starttimestamp,
      '$.configuration.firstname') as prev_firstname,
      '$.configuration.lastname') as prev_lastname,
      '$.configuration.username') as prev_username,
      '$.configuration.routingprofile.defaultoutboundqueue.name') as current_outboundqueue,
      '$.configuration.routingprofile.inboundqueues[0].name') as prev_inboundqueue
  from kfhconnectblog
  where eventtype <> 'HEART_BEAT'
  current_status as status,
  current_username as username,
FROM dataset
WHERE eventtype = 'LOGIN' AND current_username <> ''
ORDER BY event_ts DESC

The query output looks something like this:

Here is another query that shows the sessions each of the agents engaged with. It tells us where they were incoming or outgoing, if they were completed, and where there were missed or failed calls.

WITH src AS (
     json_extract_scalar(currentagentsnapshot, '$.configuration.username') as username,
     cast(json_extract(currentagentsnapshot, '$.contacts') AS ARRAY(JSON)) as c,
     cast(json_extract(previousagentsnapshot, '$.contacts') AS ARRAY(JSON)) as p
  from kfhconnectblog
src2 AS (
  FROM src CROSS JOIN UNNEST (c, p) AS contacts(c_item, p_item)
dataset AS (
  json_extract_scalar(c_item, '$.contactid') as c_contactid,
  json_extract_scalar(c_item, '$.channel') as c_channel,
  json_extract_scalar(c_item, '$.initiationmethod') as c_direction,
  json_extract_scalar(c_item, '$.queue.name') as c_queue,
  json_extract_scalar(c_item, '$.state') as c_state,
  from_iso8601_timestamp(json_extract_scalar(c_item, '$.statestarttimestamp')) as c_ts,
  json_extract_scalar(p_item, '$.contactid') as p_contactid,
  json_extract_scalar(p_item, '$.channel') as p_channel,
  json_extract_scalar(p_item, '$.initiationmethod') as p_direction,
  json_extract_scalar(p_item, '$.queue.name') as p_queue,
  json_extract_scalar(p_item, '$.state') as p_state,
  from_iso8601_timestamp(json_extract_scalar(p_item, '$.statestarttimestamp')) as p_ts
FROM src2
  c_channel as channel,
  c_direction as direction,
  p_state as prev_state,
  c_state as current_state,
  c_ts as current_ts,
  c_contactid as id
FROM dataset
WHERE c_contactid = p_contactid
ORDER BY id DESC, current_ts ASC

The query output looks similar to the following:

6. Analyze the data with Amazon Redshift Spectrum

With Amazon Redshift Spectrum, you can query data directly in S3 using your existing Amazon Redshift data warehouse cluster. Because the data is already in Parquet format, Redshift Spectrum gets the same great benefits that Athena does.

Here is a simple query to show querying the same data from Amazon Redshift. Note that to do this, you need to first create an external schema in Amazon Redshift that points to the AWS Glue Data Catalog.

  json_extract_path_text(currentagentsnapshot,'agentstatus','name') AS current_status,
  json_extract_path_text(currentagentsnapshot, 'configuration','firstname') AS current_firstname,
  json_extract_path_text(currentagentsnapshot, 'configuration','lastname') AS current_lastname,
    'configuration','routingprofile','defaultoutboundqueue','name') AS current_outboundqueue,
FROM default_schema.kfhconnectblog

The following shows the query output:


In this post, I showed you how to use Kinesis Data Firehose to ingest and convert data to columnar file format, enabling real-time analysis using Athena and Amazon Redshift. This great feature enables a level of optimization in both cost and performance that you need when storing and analyzing large amounts of data. This feature is equally important if you are investing in building data lakes on AWS.


Additional Reading

If you found this post useful, be sure to check out Analyzing VPC Flow Logs with Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight and Work with partitioned data in AWS Glue.

About the Author

Roy Hasson is a Global Business Development Manager for AWS Analytics. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan cheering his team on and hanging out with his family.




[$] Using user-space tracepoints with BPF

Post Syndicated from corbet original https://lwn.net/Articles/753601/rss

Much has been written on LWN about dynamically instrumenting kernel
code. These features are also available to user-space code with a
special kind of probe known as a User Statically-Defined Tracing
(USDT) probe. These probes provide a low-overhead way of
instrumenting user-space code and provide a convenient way to debug applications
running in production. In this final article of the BPF and BCC series
we’ll look at where USDT probes come from and how you can use them to
understand the behavior of your own applications.

Bad Software Is Our Fault

Post Syndicated from Bozho original https://techblog.bozho.net/bad-software-is-our-fault/

Bad software is everywhere. One can even claim that every software is bad. Cool companies, tech giants, established companies, all produce bad software. And no, yours is not an exception.

Who’s to blame for bad software? It’s all complicated and many factors are intertwined – there’s business requirements, there’s organizational context, there’s lack of sufficient skilled developers, there’s the inherent complexity of software development, there’s leaky abstractions, reliance on 3rd party software, consequences of wrong business and purchase decisions, time limitations, flawed business analysis, etc. So yes, despite the catchy title, I’m aware it’s actually complicated.

But in every “it’s complicated” scenario, there’s always one or two factors that are decisive. All of them contribute somehow, but the major drivers are usually a handful of things. And in the case of base software, I think it’s the fault of technical people. Developers, architects, ops.

We don’t seem to care about best practices. And I’ll do some nasty generalizations here, but bear with me. We can spend hours arguing about tabs vs spaces, curly bracket on new line, git merge vs rebase, which IDE is better, which framework is better and other largely irrelevant stuff. But we tend to ignore the important aspects that span beyond the code itself. The context in which the code lives, the non-functional requirements – robustness, security, resilience, etc.

We don’t seem to get security. Even trivial stuff such as user authentication is almost always implemented wrong. These days Twitter and GitHub realized they have been logging plain-text passwords, for example, but that’s just the tip of the iceberg. Too often we ignore the security implications.

“But the business didn’t request the security features”, one may say. The business never requested 2-factor authentication, encryption at rest, PKI, secure (or any) audit trail, log masking, crypto shredding, etc., etc. Because the business doesn’t know these things – we do and we have to put them on the backlog and fight for them to be implemented. Each organization has its specifics and tech people can influence the backlog in different ways, but almost everywhere we can put things there and prioritize them.

The other aspect is testing. We should all be well aware by now that automated testing is mandatory. We have all the tools in the world for unit, functional, integration, performance and whatnot testing, and yet many software projects lack the necessary test coverage to be able to change stuff without accidentally breaking things. “But testing takes time, we don’t have it”. We are perfectly aware that testing saves time, as we’ve all had those “not again!” recurring bugs. And yet we think of all sorts of excuses – “let the QAs test it”, we have to ship that now, we’ll test it later”, “this is too trivial to be tested”, etc.

And you may say it’s not our job. We don’t define what has do be done, we just do it. We don’t define the budget, the scope, the features. We just write whatever has been decided. And that’s plain wrong. It’s not our job to make money out of our code, and it’s not our job to define what customers need, but apart from that everything is our job. The way the software is structured, the security aspects and security features, the stability of the code base, the way the software behaves in different environments. The non-functional requirements are our job, and putting them on the backlog is our job.

You’ve probably heard that every software becomes “legacy” after 6 months. And that’s because of us, our sloppiness, our inability to mitigate external factors and constraints. Too often we create a mess through “just doing our job”.

And of course that’s a generalization. I happen to know a lot of great professionals who don’t make these mistakes, who strive for excellence and implement things the right way. But our industry as a whole doesn’t. Our industry as a whole produces bad software. And it’s our fault, as developers – as the only people who know why a certain piece of software is bad.

In a talk of his, Bob Martin warns us of the risks of our sloppiness. We have been building websites so far, but we are more and more building stuff that interacts with the real world, directly and indirectly. Ultimately, lives may depend on our software (like the recent unfortunate death caused by a self-driving car). And I’ll agree with Uncle Bob that it’s high time we self-regulate as an industry, before some technically incompetent politician decides to do that.

How, I don’t know. We’ll have to think more about it. But I’m pretty sure it’s our fault that software is bad, and no amount of blaming the management, the budget, the timing, the tools or the process can eliminate our responsibility.

Why do I insist on bashing my fellow software engineers? Because if we start looking at software development with more responsibility; with the fact that if it fails, it’s our fault, then we’re more likely to get out of our current bug-ridden, security-flawed, fragile software hole and really become the experts of the future.

The post Bad Software Is Our Fault appeared first on Bozho's tech blog.

UK Internet Filters Block Disney Sites, Internet Safety Tips, and More

Post Syndicated from Ernesto original https://torrentfreak.com/uk-internet-filters-block-disney-sites-internet-safety-tips-and-more-180505/

Over the past several years we have regularly written about court-ordered blockades of pirate sites in the UK.

Today, we take a closer look at another type of blocking, the Internet safety filters UK ISPs offer. These filters, which are sometimes enabled by default, help subscribers to block harmful content, especially for their children.

With help from the Open Rights Group’s Blocked initiative, which documents the scope of various ISP filters, we took a look at how these perform.

The first results are as expected. Many porn sites are blocked and so are sites that are clearly oriented at a mature audience. That’s more or less what these filters are meant for, so no issues there.

We also noticed that many proxies and VPNs are not accessible. While this may seem broad, as they’re not offensive, these tools could allow clever sorts to bypass parental controls, so there’s an argument to be made for their inclusion.

Oddly enough, the Tor browser, which can do the same, is freely accessible. But let’s not digress.

What really stood out to us is that some sites which are targeted at kids, or at least useful to them, are blocked too.

One prime example is the official UK Disney website, located at disney.co.uk, which is blocked by BT’s Strict filters. That seems a bit cruel. The same is true for disneymoviesanywhere.com, which is not very useful, but certainly doesn’t seem harmful to us either.

Apparently, BT doesn’t want children to visit these Disney sites.

No Disney

The parental control filters are supposed to make the web a safer place for kids. While this is a laudable aim, the execution is not always perfect. For example, several ISPs including BT, Plusnet and Virgin Media, are blocking the internetsafetyday.org website.

Admittedly, the site is targeted at parents, but since these will often be behind the same filters, they’re missing out on some good tips and tricks on how to educate their children.

No safety

Talking about education. It’s always good when kids start to experiment with coding at a young age. This is also one of the core messages of the non-profit organization Kidsandcode.org.

“Everyone should have the opportunity to learn how to code,” the site reads.

This makes sense, you’d think, but for kids who are trapped behind the BT Strict or BT Light filters, this is not an option.

What can kids do nowadays then? Play a few simple games? Ideally educational games such as the ones playkidsgames.com offers. As you may have guessed by now, that’s not an option either, at least not behind BT’s Strict filter.

Maybe kids should stick to more boring stuff. Perhaps finish that school project on Vikings, that should be doable, right? Well, it is, as long as you don’t look up vikingsword.com, a historical and educational side dedicated to Viking swords.

Both Three and Sky have blocked the site, with Sky explaining that it’s placed in the “Weapons, Violence, Gore and Hate” category.

Ancient weapons

Perhaps we’re too insensitive, but I think that most children can handle grainy drawings of swords. After all, the average cartoon is more violent nowadays.

And yes, it’s true that decades-old Viking swords are weapons, but Three and Sky are not very consistent as the website of today’s largest weapons manufacturer Lockheed Martin appears to pass through all parental filters just fine.

Luckily, the Open Rights Group allows us to check for these odd results, and report sites which are inaccurately blocked. Interested in checking if your favorite website is blocked? You can do so here. Feel free to report any unusual findings in the comments.

Open Rights Group is currently looking for donations and other means of support to keep its Blocked project up and running. More information is available here.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN reviews, discounts, offers and coupons.

Analyze data in Amazon DynamoDB using Amazon SageMaker for real-time prediction

Post Syndicated from YongSeong Lee original https://aws.amazon.com/blogs/big-data/analyze-data-in-amazon-dynamodb-using-amazon-sagemaker-for-real-time-prediction/

Many companies across the globe use Amazon DynamoDB to store and query historical user-interaction data. DynamoDB is a fast NoSQL database used by applications that need consistent, single-digit millisecond latency.

Often, customers want to turn their valuable data in DynamoDB into insights by analyzing a copy of their table stored in Amazon S3. Doing this separates their analytical queries from their low-latency critical paths. This data can be the primary source for understanding customers’ past behavior, predicting future behavior, and generating downstream business value. Customers often turn to DynamoDB because of its great scalability and high availability. After a successful launch, many customers want to use the data in DynamoDB to predict future behaviors or provide personalized recommendations.

DynamoDB is a good fit for low-latency reads and writes, but it’s not practical to scan all data in a DynamoDB database to train a model. In this post, I demonstrate how you can use DynamoDB table data copied to Amazon S3 by AWS Data Pipeline to predict customer behavior. I also demonstrate how you can use this data to provide personalized recommendations for customers using Amazon SageMaker. You can also run ad hoc queries using Amazon Athena against the data. DynamoDB recently released on-demand backups to create full table backups with no performance impact. However, it’s not suitable for our purposes in this post, so I chose AWS Data Pipeline instead to create managed backups are accessible from other services.

To do this, I describe how to read the DynamoDB backup file format in Data Pipeline. I also describe how to convert the objects in S3 to a CSV format that Amazon SageMaker can read. In addition, I show how to schedule regular exports and transformations using Data Pipeline. The sample data used in this post is from Bank Marketing Data Set of UCI.

The solution that I describe provides the following benefits:

  • Separates analytical queries from production traffic on your DynamoDB table, preserving your DynamoDB read capacity units (RCUs) for important production requests
  • Automatically updates your model to get real-time predictions
  • Optimizes for performance (so it doesn’t compete with DynamoDB RCUs after the export) and for cost (using data you already have)
  • Makes it easier for developers of all skill levels to use Amazon SageMaker

All code and data set in this post are available in this .zip file.

Solution architecture

The following diagram shows the overall architecture of the solution.

The steps that data follows through the architecture are as follows:

  1. Data Pipeline regularly copies the full contents of a DynamoDB table as JSON into an S3
  2. Exported JSON files are converted to comma-separated value (CSV) format to use as a data source for Amazon SageMaker.
  3. Amazon SageMaker renews the model artifact and update the endpoint.
  4. The converted CSV is available for ad hoc queries with Amazon Athena.
  5. Data Pipeline controls this flow and repeats the cycle based on the schedule defined by customer requirements.

Building the auto-updating model

This section discusses details about how to read the DynamoDB exported data in Data Pipeline and build automated workflows for real-time prediction with a regularly updated model.

Download sample scripts and data

Before you begin, take the following steps:

  1. Download sample scripts in this .zip file.
  2. Unzip the src.zip file.
  3. Find the automation_script.sh file and edit it for your environment. For example, you need to replace 's3://<your bucket>/<datasource path>/' with your own S3 path to the data source for Amazon ML. In the script, the text enclosed by angle brackets—< and >—should be replaced with your own path.
  4. Upload the json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar file to your S3 path so that the ADD jar command in Apache Hive can refer to it.

For this solution, the banking.csv  should be imported into a DynamoDB table.

Export a DynamoDB table

To export the DynamoDB table to S3, open the Data Pipeline console and choose the Export DynamoDB table to S3 template. In this template, Data Pipeline creates an Amazon EMR cluster and performs an export in the EMRActivity activity. Set proper intervals for backups according to your business requirements.

One core node(m3.xlarge) provides the default capacity for the EMR cluster and should be suitable for the solution in this post. Leave the option to resize the cluster before running enabled in the TableBackupActivity activity to let Data Pipeline scale the cluster to match the table size. The process of converting to CSV format and renewing models happens in this EMR cluster.

For a more in-depth look at how to export data from DynamoDB, see Export Data from DynamoDB in the Data Pipeline documentation.

Add the script to an existing pipeline

After you export your DynamoDB table, you add an additional EMR step to EMRActivity by following these steps:

  1. Open the Data Pipeline console and choose the ID for the pipeline that you want to add the script to.
  2. For Actions, choose Edit.
  3. In the editing console, choose the Activities category and add an EMR step using the custom script downloaded in the previous section, as shown below.

Paste the following command into the new step after the data ­­upload step:

s3://#{myDDBRegion}.elasticmapreduce/libs/script-runner/script-runner.jar,s3://<your bucket name>/automation_script.sh,#{output.directoryPath},#{myDDBRegion}

The element #{output.directoryPath} references the S3 path where the data pipeline exports DynamoDB data as JSON. The path should be passed to the script as an argument.

The bash script has two goals, converting data formats and renewing the Amazon SageMaker model. Subsequent sections discuss the contents of the automation script.

Automation script: Convert JSON data to CSV with Hive

We use Apache Hive to transform the data into a new format. The Hive QL script to create an external table and transform the data is included in the custom script that you added to the Data Pipeline definition.

When you run the Hive scripts, do so with the -e option. Also, define the Hive table with the 'org.openx.data.jsonserde.JsonSerDe' row format to parse and read JSON format. The SQL creates a Hive EXTERNAL table, and it reads the DynamoDB backup data on the S3 path passed to it by Data Pipeline.

Note: You should create the table with the “EXTERNAL” keyword to avoid the backup data being accidentally deleted from S3 if you drop the table.

The full automation script for converting follows. Add your own bucket name and data source path in the highlighted areas.

hive -e "
ADD jar s3://<your bucket name>/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar ; 
DROP TABLE IF EXISTS blog_backup_data ;
CREATE EXTERNAL TABLE blog_backup_data (
 customer_id map<string,string>,
 age map<string,string>, job map<string,string>, 
 marital map<string,string>,education map<string,string>, 
 default map<string,string>, housing map<string,string>,
 loan map<string,string>, contact map<string,string>, 
 month map<string,string>, day_of_week map<string,string>, 
 duration map<string,string>, campaign map<string,string>,
 pdays map<string,string>, previous map<string,string>, 
 poutcome map<string,string>, emp_var_rate map<string,string>, 
 cons_price_idx map<string,string>, cons_conf_idx map<string,string>,
 euribor3m map<string,string>, nr_employed map<string,string>, 
 y map<string,string> ) 
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 

INSERT OVERWRITE DIRECTORY 's3://<your bucket name>/<datasource path>/' 
SELECT concat( customer_id['s'],',', 
 age['n'],',', job['s'],',', 
 marital['s'],',', education['s'],',', default['s'],',', 
 housing['s'],',', loan['s'],',', contact['s'],',', 
 month['s'],',', day_of_week['s'],',', duration['n'],',', 
 poutcome['s'],',', emp_var_rate['n'],',', cons_price_idx['n'],',',
 cons_conf_idx['n'],',', euribor3m['n'],',', nr_employed['n'],',', y['n'] ) 
FROM blog_backup_data
WHERE customer_id['s'] > 0 ; 

After creating an external table, you need to read data. You then use the INSERT OVERWRITE DIRECTORY ~ SELECT command to write CSV data to the S3 path that you designated as the data source for Amazon SageMaker.

Depending on your requirements, you can eliminate or process the columns in the SELECT clause in this step to optimize data analysis. For example, you might remove some columns that have unpredictable correlations with the target value because keeping the wrong columns might expose your model to “overfitting” during the training. In this post, customer_id  columns is removed. Overfitting can make your prediction weak. More information about overfitting can be found in the topic Model Fit: Underfitting vs. Overfitting in the Amazon ML documentation.

Automation script: Renew the Amazon SageMaker model

After the CSV data is replaced and ready to use, create a new model artifact for Amazon SageMaker with the updated dataset on S3.  For renewing model artifact, you must create a new training job.  Training jobs can be run using the AWS SDK ( for example, Amazon SageMaker boto3 ) or the Amazon SageMaker Python SDK that can be installed with “pip install sagemaker” command as well as the AWS CLI for Amazon SageMaker described in this post.

In addition, consider how to smoothly renew your existing model without service impact, because your model is called by applications in real time. To do this, you need to create a new endpoint configuration first and update a current endpoint with the endpoint configuration that is just created.

## Define variable 
DTTIME=`date +%Y-%m-%d-%H-%M-%S`
ROLE="<your AmazonSageMaker-ExecutionRole>" 

# Select containers image based on region.  
case "$REGION" in
"us-west-2" )
"us-east-1" )
"us-east-2" )
"eu-west-1" )
    echo "Invalid Region Name"
    exit 1 ;  

# Start training job and creating model artifact 
S3OUTPUT="s3://<your bucket name>/model/" 
aws sagemaker create-training-job --training-job-name ${TRAINING_JOB_NAME} --region ${REGION}  --algorithm-specification TrainingImage=${IMAGE},TrainingInputMode=File --role-arn ${ROLE}  --input-data-config '[{ "ChannelName": "train", "DataSource": { "S3DataSource": { "S3DataType": "S3Prefix", "S3Uri": "s3://<your bucket name>/<datasource path>/", "S3DataDistributionType": "FullyReplicated" } }, "ContentType": "text/csv", "CompressionType": "None" , "RecordWrapperType": "None"  }]'  --output-data-config S3OutputPath=${S3OUTPUT} --resource-config  InstanceType=${INSTANCETYPE},InstanceCount=${INSTANCECOUNT},VolumeSizeInGB=${VOLUMESIZE} --stopping-condition MaxRuntimeInSeconds=120 --hyper-parameters feature_dim=20,predictor_type=binary_classifier  

# Wait until job completed 
aws sagemaker wait training-job-completed-or-stopped --training-job-name ${TRAINING_JOB_NAME}  --region ${REGION}

# Get newly created model artifact and create model
MODELARTIFACT=`aws sagemaker describe-training-job --training-job-name ${TRAINING_JOB_NAME} --region ${REGION}  --query 'ModelArtifacts.S3ModelArtifacts' --output text `
aws sagemaker create-model --region ${REGION} --model-name ${MODELNAME}  --primary-container Image=${IMAGE},ModelDataUrl=${MODELARTIFACT}  --execution-role-arn ${ROLE}

# create a new endpoint configuration 
aws sagemaker  create-endpoint-config --region ${REGION} --endpoint-config-name ${CONFIGNAME}  --production-variants  VariantName=Users,ModelName=${MODELNAME},InitialInstanceCount=1,InstanceType=ml.m4.xlarge

# create or update the endpoint
STATUS=`aws sagemaker describe-endpoint --endpoint-name  ServiceEndpoint --query 'EndpointStatus' --output text --region ${REGION} `
if [[ $STATUS -ne "InService" ]] ;
    aws sagemaker  create-endpoint --endpoint-name  ServiceEndpoint  --endpoint-config-name ${CONFIGNAME} --region ${REGION}    
    aws sagemaker  update-endpoint --endpoint-name  ServiceEndpoint  --endpoint-config-name ${CONFIGNAME} --region ${REGION}

Grant permission

Before you execute the script, you must grant proper permission to Data Pipeline. Data Pipeline uses the DataPipelineDefaultResourceRole role by default. I added the following policy to DataPipelineDefaultResourceRole to allow Data Pipeline to create, delete, and update the Amazon SageMaker model and data source in the script.

 "Version": "2012-10-17",
 "Statement": [
 "Effect": "Allow",
 "Action": [
 "Resource": "*"

Use real-time prediction

After you deploy a model into production using Amazon SageMaker hosting services, your client applications use this API to get inferences from the model hosted at the specified endpoint. This approach is useful for interactive web, mobile, or desktop applications.

Following, I provide a simple Python code example that queries against Amazon SageMaker endpoint URL with its name (“ServiceEndpoint”) and then uses them for real-time prediction.

=== Python sample for real-time prediction ===

#!/usr/bin/env python
import boto3
import json 

client = boto3.client('sagemaker-runtime', region_name ='<your region>' )
new_customer_info = '34,10,2,4,1,2,1,1,6,3,190,1,3,4,3,-1.7,94.055,-39.8,0.715,4991.6'
response = client.invoke_endpoint(
result = json.loads(response['Body'].read().decode())
--- output(response) ---
{u'predictions': [{u'score': 0.7528127431869507, u'predicted_label': 1.0}]}

Solution summary

The solution takes the following steps:

  1. Data Pipeline exports DynamoDB table data into S3. The original JSON data should be kept to recover the table in the rare event that this is needed. Data Pipeline then converts JSON to CSV so that Amazon SageMaker can read the data.Note: You should select only meaningful attributes when you convert CSV. For example, if you judge that the “campaign” attribute is not correlated, you can eliminate this attribute from the CSV.
  2. Train the Amazon SageMaker model with the new data source.
  3. When a new customer comes to your site, you can judge how likely it is for this customer to subscribe to your new product based on “predictedScores” provided by Amazon SageMaker.
  4. If the new user subscribes your new product, your application must update the attribute “y” to the value 1 (for yes). This updated data is provided for the next model renewal as a new data source. It serves to improve the accuracy of your prediction. With each new entry, your application can become smarter and deliver better predictions.

Running ad hoc queries using Amazon Athena

Amazon Athena is a serverless query service that makes it easy to analyze large amounts of data stored in Amazon S3 using standard SQL. Athena is useful for examining data and collecting statistics or informative summaries about data. You can also use the powerful analytic functions of Presto, as described in the topic Aggregate Functions of Presto in the Presto documentation.

With the Data Pipeline scheduled activity, recent CSV data is always located in S3 so that you can run ad hoc queries against the data using Amazon Athena. I show this with example SQL statements following. For an in-depth description of this process, see the post Interactive SQL Queries for Data in Amazon S3 on the AWS News Blog. 

Creating an Amazon Athena table and running it

Simply, you can create an EXTERNAL table for the CSV data on S3 in Amazon Athena Management Console.

=== Table Creation ===
 age int, 
 job string, 
 marital string , 
 education string, 
 default string, 
 housing string, 
 loan string, 
 contact string, 
 month string, 
 day_of_week string, 
 duration int, 
 campaign int, 
 pdays int , 
 previous int , 
 poutcome string, 
 emp_var_rate double, 
 cons_price_idx double,
 cons_conf_idx double, 
 euribor3m double, 
 nr_employed double, 
 y int 
LOCATION 's3://<your bucket name>/<datasource path>/';

The following query calculates the correlation coefficient between the target attribute and other attributes using Amazon Athena.

=== Sample Query ===

SELECT corr(age,y) AS correlation_age_and_target, 
 corr(duration,y) AS correlation_duration_and_target, 
 corr(campaign,y) AS correlation_campaign_and_target,
 corr(contact,y) AS correlation_contact_and_target
FROM ( SELECT age , duration , campaign , y , 
 CASE WHEN contact = 'telephone' THEN 1 ELSE 0 END AS contact 
 FROM datasource 
 ) datasource ;


In this post, I introduce an example of how to analyze data in DynamoDB by using table data in Amazon S3 to optimize DynamoDB table read capacity. You can then use the analyzed data as a new data source to train an Amazon SageMaker model for accurate real-time prediction. In addition, you can run ad hoc queries against the data on S3 using Amazon Athena. I also present how to automate these procedures by using Data Pipeline.

You can adapt this example to your specific use case at hand, and hopefully this post helps you accelerate your development. You can find more examples and use cases for Amazon SageMaker in the video AWS 2017: Introducing Amazon SageMaker on the AWS website.


Additional Reading

If you found this post useful, be sure to check out Serving Real-Time Machine Learning Predictions on Amazon EMR and Analyzing Data in S3 using Amazon Athena.


About the Author

Yong Seong Lee is a Cloud Support Engineer for AWS Big Data Services. He is interested in every technology related to data/databases and helping customers who have difficulties in using AWS services. His motto is “Enjoy life, be curious and have maximum experience.”



Yahoo! Fined 35 Million USD For Late Disclosure Of Hack

Post Syndicated from Darknet original https://www.darknet.org.uk/2018/05/yahoo-fined-35-million-usd-for-late-disclosure-of-hack/?utm_source=rss&utm_medium=social&utm_campaign=darknetfeed

Yahoo! Fined 35 Million USD For Late Disclosure Of Hack

Ah Yahoo! in trouble again, this time the news is Yahoo! fined for 35 million USD by the SEC for the 2 years delayed disclosure of the massive hack, we actually reported on the incident in 2016 when it became public – Massive Yahoo Hack – 500 Million Accounts Compromised.

Yahoo! has been having a rocky time for quite a few years now and just recently has sold Flickr to SmugMug for an undisclosed amount, I hope that at least helps pay off some of the fine.

Read the rest of Yahoo! Fined 35 Million USD For Late Disclosure Of Hack now! Only available at Darknet.

The Helium Factor and Hard Drive Failure Rates

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/helium-filled-hard-drive-failure-rates/

Seagate Enterprise Capacity 3.5 Helium HDD

In November 2013, the first commercially available helium-filled hard drive was introduced by HGST, a Western Digital subsidiary. The 6 TB drive was not only unique in being helium-filled, it was for the moment, the highest capacity hard drive available. Fast forward a little over 4 years later and 12 TB helium-filled drives are readily available, 14 TB drives can be found, and 16 TB helium-filled drives are arriving soon.

Backblaze has been purchasing and deploying helium-filled hard drives over the past year and we thought it was time to start looking at their failure rates compared to traditional air-filled drives. This post will provide an overview, then we’ll continue the comparison on a regular basis over the coming months.

The Promise and Challenge of Helium Filled Drives

We all know that helium is lighter than air — that’s why helium-filled balloons float. Inside of an air-filled hard drive there are rapidly spinning disk platters that rotate at a given speed, 7200 rpm for example. The air inside adds an appreciable amount of drag on the platters that in turn requires an appreciable amount of additional energy to spin the platters. Replacing the air inside of a hard drive with helium reduces the amount of drag, thereby reducing the amount of energy needed to spin the platters, typically by 20%.

We also know that after a few days, a helium-filled balloon sinks to the ground. This was one of the key challenges in using helium inside of a hard drive: helium escapes from most containers, even if they are well sealed. It took years for hard drive manufacturers to create containers that could contain helium while still functioning as a hard drive. This container innovation allows helium-filled drives to function at spec over the course of their lifetime.

Checking for Leaks

Three years ago, we identified SMART 22 as the attribute assigned to recording the status of helium inside of a hard drive. We have both HGST and Seagate helium-filled hard drives, but only the HGST drives currently report the SMART 22 attribute. It appears the normalized and raw values for SMART 22 currently report the same value, which starts at 100 and goes down.

To date only one HGST drive has reported a value of less than 100, with multiple readings between 94 and 99. That drive continues to perform fine, with no other errors or any correlating changes in temperature, so we are not sure whether the change in value is trying to tell us something or if it is just a wonky sensor.

Helium versus Air-Filled Hard Drives

There are several different ways to compare these two types of drives. Below we decided to use just our 8, 10, and 12 TB drives in the comparison. We did this since we have helium-filled drives in those sizes. We left out of the comparison all of the drives that are 6 TB and smaller as none of the drive models we use are helium-filled. We are open to trying different comparisons. This just seemed to be the best place to start.

Lifetime Hard Drive Failure Rates: Helium vs. Air-Filled Hard Drives table

The most obvious observation is that there seems to be little difference in the Annualized Failure Rate (AFR) based on whether they contain helium or air. One conclusion, given this evidence, is that helium doesn’t affect the AFR of hard drives versus air-filled drives. My prediction is that the helium drives will eventually prove to have a lower AFR. Why? Drive Days.

Let’s go back in time to Q1 2017 when the air-filled drives listed in the table above had a similar number of Drive Days to the current number of Drive Days for the helium drives. We find that the failure rate for the air-filled drives at the time (Q1 2017) was 1.61%. In other words, when the drives were in use a similar number of hours, the helium drives had a failure rate of 1.06% while the failure rate of the air-filled drives was 1.61%.

Helium or Air?

My hypothesis is that after normalizing the data so that the helium and air-filled drives have the same (or similar) usage (Drive Days), the helium-filled drives we use will continue to have a lower Annualized Failure Rate versus the air-filled drives we use. I expect this trend to continue for the next year at least. What side do you come down on? Will the Annualized Failure Rate for helium-filled drives be better than air-filled drives or vice-versa? Or do you think the two technologies will be eventually produce the same AFR over time? Pick a side and we’ll document the results over the next year and see where the data takes us.

The post The Helium Factor and Hard Drive Failure Rates appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

LC4: Another Pen-and-Paper Cipher

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2018/05/lc4_another_pen.html

Interesting symmetric cipher: LC4:

Abstract: ElsieFour (LC4) is a low-tech cipher that can be computed by hand; but unlike many historical ciphers, LC4 is designed to be hard to break. LC4 is intended for encrypted communication between humans only, and therefore it encrypts and decrypts plaintexts and ciphertexts consisting only of the English letters A through Z plus a few other characters. LC4 uses a nonce in addition to the secret key, and requires that different messages use unique nonces. LC4 performs authenticated encryption, and optional header data can be included in the authentication. This paper defines the LC4 encryption and decryption algorithms, analyzes LC4’s security, and describes a simple appliance for computing LC4 by hand.

Almost two decades ago I designed Solitaire, a pen-and-paper cipher that uses a deck of playing cards to store the cipher’s state. This algorithm uses specialized tiles. This gives the cipher designer more options, but it can be incriminating in a way that regular playing cards are not.

Still, I like seeing more designs like this.

Hacker News thread.

EC2 Fleet – Manage Thousands of On-Demand and Spot Instances with One Request

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/ec2-fleet-manage-thousands-of-on-demand-and-spot-instances-with-one-request/

EC2 Spot Fleets are really cool. You can launch a fleet of Spot Instances that spans EC2 instance types and Availability Zones without having to write custom code to discover capacity or monitor prices. You can set the target capacity (the size of the fleet) in units that are meaningful to your application and have Spot Fleet create and then maintain the fleet on your behalf. Our customers are creating Spot Fleets of all sizes. For example, one financial service customer runs Monte Carlo simulations across 10 different EC2 instance types. They routinely make requests for hundreds of thousands of vCPUs and count on Spot Fleet to give them access to massive amounts of capacity at the best possible price.

EC2 Fleet
Today we are extending and generalizing the set-it-and-forget-it model that we pioneered in Spot Fleet with EC2 Fleet, a new building block that gives you the ability to create fleets that are composed of a combination of EC2 On-Demand, Reserved, and Spot Instances with a single API call. You tell us what you need, capacity and instance-wise, and we’ll handle all the heavy lifting. We will launch, manage, monitor and scale instances as needed, without the need for scaffolding code.

You can specify the capacity of your fleet in terms of instances, vCPUs, or application-oriented units, and also indicate how much of the capacity should be fulfilled by Spot Instances. The application-oriented units allow you to specify the relative power of each EC2 instance type in a way that directly maps to the needs of your application. All three capacity specification options (instances, vCPUs, and application-oriented units) are known as weights.

I think you’ll find a number ways this feature makes managing a fleet of instances easier, and believe that you will also appreciate the team’s near-term feature roadmap of interest (more on that in a bit).

Using EC2 Fleet
There are a number of ways that you can use this feature, whether you’re running a stateless web service, a big data cluster or a continuous integration pipeline. Today I’m going to describe how you can use EC2 Fleet for genomic processing, but this is similar to workloads like risk analysis, log processing or image rendering. Modern DNA sequencers can produce multiple terabytes of raw data each day, to process that data into meaningful information in a timely fashion you need lots of processing power. I’ll be showing you how to deploy a “grid” of worker nodes that can quickly crunch through secondary analysis tasks in parallel.

Projects in genomics can use the elasticity EC2 provides to experiment and try out new pipelines on hundreds or even thousands of servers. With EC2 you can access as many cores as you need and only pay for what you use. Prior to today, you would need to use the RunInstances API or an Auto Scaling group for the On-Demand & Reserved Instance portion of your grid. To get the best price performance you’d also create and manage a Spot Fleet or multiple Spot Auto Scaling groups with different instance types if you wanted to add Spot Instances to turbo-boost your secondary analysis. Finally, to automate scaling decisions across multiple APIs and Auto Scaling groups you would need to write Lambda functions that periodically assess your grid’s progress & backlog, as well as current Spot prices – modifying your Auto Scaling Groups and Spot Fleets accordingly.

You can now replace all of this with a single EC2 Fleet, analyzing genomes at scale for as little as $1 per analysis. In my grid, each step in in the pipeline requires 1 vCPU and 4 GiB of memory, a perfect match for M4 and M5 instances with 4 GiB of memory per vCPU. I will create a fleet using M4 and M5 instances with weights that correspond to the number of vCPUs on each instance:

  • m4.16xlarge – 64 vCPUs, weight = 64
  • m5.24xlarge – 96 vCPUs, weight = 96

This is expressed in a template that looks like this:

"Overrides": [
  "InstanceType": "m4.16xlarge",
  "WeightedCapacity": 64,
  "InstanceType": "m5.24xlarge",
  "WeightedCapacity": 96,

By default, EC2 Fleet will select the most cost effective combination of instance types and Availability Zones (both specified in the template) using the current prices for the Spot Instances and public prices for the On-Demand Instances (if you specify instances for which you have matching RIs, your discounts will apply). The default mode takes weights into account to get the instances that have the lowest price per unit. So for my grid, fleet will find the instance that offers the lowest price per vCPU.

Now I can request capacity in terms of vCPUs, knowing EC2 Fleet will select the lowest cost option using only the instance types I’ve defined as acceptable. Also, I can specify how many vCPUs I want to launch using On-Demand or Reserved Instance capacity and how many vCPUs should be launched using Spot Instance capacity:

"TargetCapacitySpecification": {
	"TotalTargetCapacity": 2880,
	"OnDemandTargetCapacity": 960,
	"SpotTargetCapacity": 1920,
	"DefaultTargetCapacityType": "Spot"

The above means that I want a total of 2880 vCPUs, with 960 vCPUs fulfilled using On-Demand and 1920 using Spot. The On-Demand price per vCPU is lower for m5.24xlarge than the On-Demand price per vCPU for m4.16xlarge, so EC2 Fleet will launch 10 m5.24xlarge instances to fulfill 960 vCPUs. Based on current Spot pricing (again, on a per-vCPU basis), EC2 Fleet will choose to launch 30 m4.16xlarge instances or 20 m5.24xlarges, delivering 1920 vCPUs either way.

Putting it all together, I have a single file (fl1.json) that describes my fleet:

    "LaunchTemplateConfigs": [
            "LaunchTemplateSpecification": {
                "LaunchTemplateId": "lt-0e8c754449b27161c",
                "Version": "1"
        "Overrides": [
          "InstanceType": "m4.16xlarge",
          "WeightedCapacity": 64,
          "InstanceType": "m5.24xlarge",
          "WeightedCapacity": 96,
    "TargetCapacitySpecification": {
        "TotalTargetCapacity": 2880,
        "OnDemandTargetCapacity": 960,
        "SpotTargetCapacity": 1920,
        "DefaultTargetCapacityType": "Spot"

I can launch my fleet with a single command:

$ aws ec2 create-fleet --cli-input-json file://home/ec2-user/fl1.json

My entire fleet is created within seconds and was built using 10 m5.24xlarge On-Demand Instances and 30 m4.16xlarge Spot Instances, since the current Spot price was 1.5¢ per vCPU for m4.16xlarge and 1.6¢ per vCPU for m5.24xlarge.

Now lets imagine my grid has crunched through its backlog and no longer needs the additional Spot Instances. I can then modify the size of my fleet by changing the target capacity in my fleet specification, like this:

    "TotalTargetCapacity": 960,

Since 960 was equal to the amount of On-Demand vCPUs I had requested, when I describe my fleet I will see all of my capacity being delivered using On-Demand capacity:

"TargetCapacitySpecification": {
	"TotalTargetCapacity": 960,
	"OnDemandTargetCapacity": 960,
	"SpotTargetCapacity": 0,
	"DefaultTargetCapacityType": "Spot"

When I no longer need my fleet I can delete it and terminate the instances in it like this:

$ aws ec2 delete-fleets --fleet-id fleet-838cf4e5-fded-4f68-acb5-8c47ee1b248a \
    "UnsuccessfulFleetDletetions": [],
    "SuccessfulFleetDeletions": [
            "CurrentFleetState": "deleted_terminating",
            "PreviousFleetState": "active",
            "FleetId": "fleet-838cf4e5-fded-4f68-acb5-8c47ee1b248a"

Earlier I described how RI discounts apply when EC2 Fleet launches instances for which you have matching RIs, so you might be wondering how else RI customers benefit from EC2 Fleet. Let’s say that I own regional RIs for M4 instances. In my EC2 Fleet I would remove m5.24xlarge and specify m4.10xlarge and m4.16xlarge. Then when EC2 Fleet creates the grid, it will quickly find M4 capacity across the sizes and AZs I’ve specified, and my RI discounts apply automatically to this usage.

In the Works
We plan to connect EC2 Fleet and EC2 Auto Scaling groups. This will let you create a single fleet that mixed instance types and Spot, Reserved and On-Demand, while also taking advantage of EC2 Auto Scaling features such as health checks and lifecycle hooks. This integration will also bring EC2 Fleet functionality to services such as Amazon ECS, Amazon EKS, and AWS Batch that build on and make use of EC2 Auto Scaling for fleet management.

Available Now
You can create and make use of EC2 Fleets today in all public AWS Regions!