Tag Archives: dna

Amazon Redshift Dense Compute (DC2) Nodes Deliver Twice the Performance as DC1 at the Same Price

Post Syndicated from Quaseer Mujawar original https://aws.amazon.com/blogs/big-data/amazon-redshift-dense-compute-dc2-nodes-deliver-twice-the-performance-as-dc1-at-the-same-price/

Amazon Redshift makes analyzing exabyte-scale data fast, simple, and cost-effective. It delivers advanced data warehousing capabilities, including parallel execution, compressed columnar storage, and end-to-end encryption as a fully managed service, for less than $1,000/TB/year. With Amazon Redshift Spectrum, you can run SQL queries directly against exabytes of unstructured data in Amazon S3 for $5/TB scanned.

Today, we are making our Dense Compute (DC) family faster and more cost-effective with new second-generation Dense Compute (DC2) nodes at the same price as our previous generation DC1. DC2 is designed for demanding data warehousing workloads that require low latency and high throughput. DC2 features powerful Intel E5-2686 v4 (Broadwell) CPUs, fast DDR4 memory, and NVMe-based solid state disks.

We’ve tuned Amazon Redshift to take advantage of the better CPU, network, and disk on DC2 nodes, providing up to twice the performance of DC1 at the same price. Our DC2.8xlarge instances now provide twice the memory per slice of data and an optimized storage layout with 30 percent better storage utilization.

Customer successes

Several flagship customers, ranging from fast growing startups to large Fortune 100 companies, previewed the new DC2 node type. In their tests, DC2 provided up to twice the performance as DC1. Our preview customers saw faster ETL (extract, transform, and load) jobs, higher query throughput, better concurrency, faster reports, and shorter data-to-insights—all at the same cost as DC1. DC2.8xlarge customers also noted that their databases used up to 30 percent less disk space due to our optimized storage format, reducing their costs.

4Cite Marketing, one of America’s fastest growing private companies, uses Amazon Redshift to analyze customer data and determine personalized product recommendations for retailers. “Amazon Redshift’s new DC2 node is giving us a 100 percent performance increase, allowing us to provide faster insights for our retailers, more cost-effectively, to drive incremental revenue,” said Jim Finnerty, 4Cite’s senior vice president of product.

BrandVerity, a Seattle-based brand protection and compliance‎ company, provides solutions to monitor, detect, and mitigate online brand, trademark, and compliance abuse. “We saw a 70 percent performance boost with the DC2 nodes for running Redshift Spectrum queries. As a result, we can analyze far more data for our customers and deliver results much faster,” said Hyung-Joon Kim, principal software engineer at BrandVerity.

“Amazon Redshift is at the core of our operations and our marketing automation tools,” said Jarno Kartela, head of analytics and chief data scientist at DNA Plc, one of the leading Finnish telecommunications groups and Finland’s largest cable operator and pay TV provider. “We saw a 52 percent performance gain in moving to Amazon Redshift’s DC2 nodes. We can now run queries in half the time, allowing us to provide more analytics power and reduce time-to-insight for our analytics and marketing automation users.”

You can read about their experiences on our Customer Success page.

Get started

You can try the new node type using our getting started guide. Just choose dc2.large or dc2.8xlarge in the Amazon Redshift console:

If you have a DC1.large Amazon Redshift cluster, you can restore to a new DC2.large cluster using an existing snapshot. To migrate from DS2.xlarge, DS2.8xlarge, or DC1.8xlarge Amazon Redshift clusters, you can use the resize operation to move data to your new DC2 cluster. For more information, see Clusters and Nodes in Amazon Redshift.

To get the latest Amazon Redshift feature announcements, check out our What’s New page, and subscribe to the RSS feed.

Top 10 Most Pirated Movies of The Week on BitTorrent – 09/11/17

Post Syndicated from Ernesto original https://torrentfreak.com/top-10-pirated-movies-week-bittorrent-091117/

This week we have three newcomers in our chart.

Pirates of the Caribbean: Dead Men Tell No Tales is the most downloaded movie.

The data for our weekly download chart is estimated by TorrentFreak, and is for informational and educational reference only. All the movies in the list are Web-DL/Webrip/HDRip/BDrip/DVDrip unless stated otherwise.

RSS feed for the weekly movie download chart.

This week’s most downloaded movies are:
Movie Rank Rank last week Movie name IMDb Rating / Trailer
Most downloaded movies via torrents
1 (…) Pirates of the Caribbean: Dead Men Tell No Tales 6.9 / trailer
2 (1) Hitman’s Bodyguard 7.2 / trailer
3 (2) Wonder Woman 8.2 / trailer
4 (3) The Mummy 2017 5.8 / trailer
5 (…) The Big Sick 6.9 / trailer
6 (4) Despicable Me 3 6.4 / trailer
7 (5) Baywatch 5.7 / trailer
8 (…) Kidnap 6.9 / trailer
9 (6) Guardians of the Galaxy Vol. 2 8.0 / trailer
10 (8) Spider-Man: Homecoming (HDTS) 8.0 / trailer

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Hacking a Gene Sequencer by Encoding Malware in a DNA Strand

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/08/hacking_a_gene_.html

One of the common ways to hack a computer is to mess with its input data. That is, if you can feed the computer data that it interprets — or misinterprets — in a particular way, you can trick the computer into doing things that it wasn’t intended to do. This is basically what a buffer overflow attack is: the data input overflows a buffer and ends up being executed by the computer process.

Well, some researchers did this with a computer that processes DNA, and they encoded their malware in the DNA strands themselves:

To make the malware, the team translated a simple computer command into a short stretch of 176 DNA letters, denoted as A, G, C, and T. After ordering copies of the DNA from a vendor for $89, they fed the strands to a sequencing machine, which read off the gene letters, storing them as binary digits, 0s and 1s.

Erlich says the attack took advantage of a spill-over effect, when data that exceeds a storage buffer can be interpreted as a computer command. In this case, the command contacted a server controlled by Kohno’s team, from which they took control of a computer in their lab they were using to analyze the DNA file.

News articles. Research paper.

Announcing the New Customer Compliance Center

Post Syndicated from Chad Woolf original https://aws.amazon.com/blogs/security/announcing-the-new-customer-compliance-center/

AWS has the longest running, most effective, and most customer-obsessed compliance program in the cloud market. We have always centered our program around customers, obtaining the certifications needed to provide our customers with the proper level of validated transparency in order to enable them to certify their own AWS workloads [download .pdf of AWS certifications]. We also offer a rich suite of embedded compliance tooling, enabling customers and partners to more effectively manage security controls and in turn provide evidence of effective control operation to their auditors. Along with our customers and partners, we have the largest, most diverse, and most comprehensive compliance footprint in the industry.

Enabling customers is a core part of the AWS DNA. Today, in the spirit of that pedigree, I’m happy to announce we’ve launched a new AWS Customer Compliance Center. This center is focused on the security and compliance of our customers on AWS. You can learn from other customer experiences and discover how your peers have solved the compliance, governance, and audit challenges present in today’s regulatory environment. You can also access our industry-first cloud Auditor Learning Path via the customer center. These online university learning resources are logical learning paths, specifically designed for security, compliance and audit professionals, allowing you to build on the IT skills you have to move your environment to the next generation of audit and security assurance. As we engage with our security and compliance customer colleagues on this topic, we will continue to update and improve upon the existing resource and publish new enablers in the coming months.

We are excited to continue to work with our customers on moving from the old-guard manual audit world to the new cloud-enabled, automated, “secure and compliant by default” model we’ve been leading over the past few years.

– Chad Woolf, AWS Security & Compliance

A new twist on data backup: CloudNAS

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/cloudnas-backup/

Morro CacheDrive

There are many ways for SMBs, professionals, and advanced users to back up their data. The process can be as simple as copying files to a flash drive or an external drive, or as sophisticated as using a Synology or QNAP NAS device as your primary storage device and syncing the files to a cloud storage service such as Backblaze B2.

A recent entry into the backup arena is Morro Data and their CloudNAS solution, where files are stored in the cloud, cached locally as needed, and synced globally among the other CloudNAS systems in a given organization. There are three components to the solution:

  • A Morro CacheDrive — This resides on your internal network like a NAS device and stores from 1- to 8 TB of data depending on the model
  • The CloudNAS service — This software runs on the Morro CacheDrive to keep track of and manage the data
  • Backblaze B2 Cloud Storage — Where the data is stored in the cloud

The Morro CacheDrive is installed on your local network and looks like a network share. On Windows, the share can be mounted as a letter device, M:, for example. On the Mac, the device is mounted as a Shared device (Databank in the example below).

CloudNAS software dashboard

In either case, the device works like a folder/directory, typically on your desktop. You then either drag-and-drop or save a file to the folder/directory. This places the file on the CacheDrive. Once there, the file is automatically backed up to the cloud. In the case of CloudNAS solution, that cloud is Backblaze B2.

All that sounds pretty straight-forward, but what makes the CloudNAS solution unique is the solution allows you to have unlimited storage space. For example, you can access 5 TB of data from a 1 TB CacheDrive. Confused? Let me explain. All 5 TB of the data is stored in B2, having been uploaded to B2 each time you stored data on the CacheDrive. The 1 TB CacheDrive keeps (caches) the most recent or most often used files on the CacheDrive. When you need a file not currently stored on the CacheDrive, the CloudNAS software automatically downloads the file from the B2 cloud to the CacheDrive and makes it available to use as desired.

Things to know about the CloudNAS solution

  • Sharing Systems: Multiple users can mount the same CacheDrive with each being able to update and share the files.
  • Synced Systems: If you have two or more CloudNAS systems on your network, they will keep the B2 directory of files synced between all of the systems. Everyone on the network sees the same file list.
  • Unlimited Data: Regardless of the size of the CacheDrive device you purchase, you will not run out of space as Backblaze B2 will contain all of your data. That said, you should choose the size of your CacheDrive that fits your operational environment.
  • Network Speed: Files are initially stored on the CacheDrive, then copied to B2. Local network connections are typically much faster than internet network speeds. This means your files are uploaded to the CacheDrive fast then transferred to B2 as time allows at the speed of your internet connection, all without slowing you down. This should be interesting to those of you who have slower internet connections.
  • Access: The files stored using the Cloud NAS solution can be accessed through the shared folder/directory on your desktop as well as through a web-based Team Portal.

Getting Started

To start, you purchase a Morro CacheDrive. The price starts at $499.00 for a unit with 1 TB of cache storage. Next you choose a CloudNAS subscription. This starts at $10/month for the Standard plan, and lets you manage up to 10 TB of data. Finally, you connect Backblaze B2 to the Morro system to finish the set-up process. You pay Backblaze each month for the data you store in and download from B2 while using the Morro solution.

The CloudNAS solution is certainly a different approach to storing your data. You get the ability to store a nearly unlimited amount of data without having to upgrade your hardware as you go, and all of your data is readily available with just a few clicks. For users who need to store terabytes of data that needs be available anytime, the CloudNAS solution is worth a look.

The post A new twist on data backup: CloudNAS appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Tijuana Rick’s 1969 Wurlitzer Jukebox revitalisation

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/1969-wurlitzer-jukebox/

After Tijuana Rick’s father-in-law came by a working 1969 Wurlitzer 3100 jukebox earlier this year, he and Tijuana Rick quickly realised they lacked the original 45s to play on it. When they introduced a Raspberry Pi 3 into the mix, this was no longer an issue.

1969 Wurlitzer 3100

Restored and retrofitted Jukebox with Arduino and Raspberry Pi

Tijuana Rick

Yes, I shall be referring to Rick as Tijuana Rick throughout this blog post. Be honest, wouldn’t you if you were writing about someone whose moniker is Tijuana Rick?

Wurlitzer

The Wurlitzer jukebox has to be one of the classic icons of Americana. It evokes images of leather-booth-lined diners filled with rock ‘n’ roll music and teddy-haired bad boys eyeing Cherry Cola-sipping Nancys and Sandys across the checkered tile floor.

Raspberry Pi Wurlitzer

image courtesy of Ariadna Bach

With its brightly lit exterior and visible record-changing mechanism, the Wurlitzer is more than just your average pub jukebox. I should know: I have an average pub jukebox in my house, and although there’s some wonderfully nostalgic joy in pressing its buttons to play my favourite track, it’s not a Wurlitzer.

Raspberry Pi Wurlitzer

Americana – exactly what it says on the tin jukebox

The Wurlitzer company was founded in 1853 by a German immigrant called – you guessed it – Rudolf Wurlitzer, and at first it imported stringed instruments for the U.S. military. When the company moved from Ohio to New York, it expanded its production range to electric pianos, organs, and jukeboxes.

And thus ends today’s history lesson.

Tijuana Rick and the Wurlitzer

Since he had prior experience in repurposing physical switches for digital ends, Tijuana Rick felt confident that he could modify the newly acquired jukebox to play MP3s while still using the standard, iconic track selection process.

Raspberry Pi Wurlitzer

In order to do this, however, he had to venture into brand-new territory: mould making. Since many of the Wurlitzer’s original buttons were in disrepair, Tijuana Rick decided to try his hand at making moulds to create a set of replacements. Using an original button, he made silicone moulds, and then produced perfect button clones in exactly the right shade of red.

Raspberry Pi Wurlitzer

Then he turned to the computing side of the project. While he set up an Arduino Mega to control the buttons, Tijuana Rick decided to use a Raspberry Pi to handle the audio playback. After an extensive online search for code inspiration, he finally found this script by Thomas Sprinkmeier and used it as the foundation for the project’s software.

More images and video of the build can be found on Tijuana Rick’s website.

Fixer-uppers

We see a lot of tech upgrades and restorations using Raspberry Pis, from old cameras such as this Mansfield Holiday Zoom, and toys like this beloved Teddy Ruxpin, to… well… dinosaurs. If a piece of retro tech has any room at all for a Pi or a Pi Zero, someone in the maker community is bound to give it a 21st century overhaul.

What have been your favourite Pi retrofit projects so far? Have you seen a build that’s inspired you to restore or recreate something from your past? Got any planned projects or successful hacks? Make sure to share them in the comments below!

The post Tijuana Rick’s 1969 Wurlitzer Jukebox revitalisation appeared first on Raspberry Pi.

Hard Drive Cost Per Gigabyte

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/hard-drive-cost-per-gigabyte/

Hard Drive Cost

For hard drive prices, the race to zero is over: nobody won. For the past 35+ years or so, hard drives prices have dropped, from around $500,000 per gigabyte in 1981 to less than $0.03 per gigabyte today. This includes the period of the Thailand drive crisis in 2012 that spiked hard drive prices. Matthew Komorowski has done an admirable job of documenting the hard drive price curve through March 2014 and we’d like to fill in the blanks with our own drive purchase data to complete the picture. As you’ll see, the hard drive pricing curve has flattened out.

75,000 New Hard Drives

We first looked at the cost per gigabyte of a hard drive in 2013 when we examined the effects of the Thailand Drive crisis on our business. When we wrote that post, the cost per gigabyte for a 4 TB hard drive was about $0.04 per gigabyte. Since then 5-, 6-, 8- and recently 10 TB hard drives have been introduced and during that period we have purchased nearly 75,000 drives. Below is a chart by drive size of the drives we purchased since that last report in 2013.

Hard Drive Cost Per GB by drive size

Observations

  1. We purchase drives in bulk, thousands at a time. The price you might get at Costco or BestBuy, or on Amazon will most likely be higher.
  2. The effect of the Thailand Drive crisis is clearly seen from October 2011 through mid-2013.

The 4 TB Drive Enigma

Up through the 4 TB drive models, the cost per gigabyte of a larger sized drive always became less than the smaller sized drives. In other words, the cost per gigabyte of a 2 TB drive was less than that of a 1 TB drive resulting in higher density at a lower cost per gigabyte. This changed with the introduction of 6- and 8 TB drives, especially as it relates to the 4 TB drives. As you can see in the chart above, the cost per gigabyte of the 6 TB drives did not fall below that of the 4 TB drives. You can also observe that the 8 TB drives are just approaching the cost per gigabyte of the 4 TB drives. The 4 TB drives are the price king as seen in the chart below of the current cost of Seagate consumer drives by size.

Seagate Hard Drive Prices By Size

Drive Size Model Price Cost/GB
1 TB ST1000DM010 $49.99 $0.050
2 TB ST2000DM006 $66.99 $0.033
3 TB ST3000DM008 $83.72 $0.028
4 TB ST4000DM005 $99.99 $0.025
6 TB ST6000DM004 $240.00 $0.040
8 TB ST8000DM005 $307.34 $0.038

The data on this chart was sourced from the current price of these drives on Amazon. The drive models selected were “consumer” drives, like those we typically use in our data centers.

The manufacturing and marketing efficiencies that drive the pricing of hard drives seems to have changed over time. For example, the 6 TB drives have been in the market at least 3 years, but are not even close to the cost per gigabyte of the 4 TB drives. Meanwhile, back in 2011, the 3 TB drives models fell below the cost per gigabyte of the 2 TB drives they “replaced” within a few months. Have we as consumers decided that 4 TB drives are “big enough” for our needs and we are not demanding (by purchasing) larger sized drives in the quantities needed to push down the unit cost?

Approaching Zero: There’s a Limit

The important aspect is the trend of the cost over time. While it has continued to move downward, the rate of change has slowed dramatically as observed in the chart below which represents our average quarterly cost per gigabyte over time.

Hard Drive Cost per GB over time

The change in the rate of the cost per gigabyte of a hard drive is declining. For example, from January 2009 to January 2011, our average cost for a hard drive decreased 45% from $0.11 to $0.06 – $0.05 per gigabyte. From January 2015 to January 2017, the average cost decreased 26% from $0.038 to $0.028 – just $0.01 per gigabyte. This means that the declining price of storage will become less relevant in driving the cost of providing storage.

Back in 2011, IDC predicted that the overall data will grow by 50 times by 2020, and in 2014, EMC estimated that by 2020, we will be creating 44 trillion gigabytes of data annually. That’s quite a challenge for the storage industry especially as the cost per gigabyte curve for hard drives is flattening out. Improvements in existing storage technologies (Helium, HAMR) along with future technologies (Quantum Storage, DNA), are on the way – we can’t wait. Of course we’d like these new storage devices to be 50% less expensive per gigabyte then today’s hard drives. That would be a good start.

The post Hard Drive Cost Per Gigabyte appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

New – API & CloudFormation Support for Amazon CloudWatch Dashboards

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-api-cloudformation-support-for-amazon-cloudwatch-dashboards/

We launched CloudWatch Dashboards a couple of years ago. In the post that I wrote for the launch, I showed you how to interactively create a dashboard that displayed chosen CloudWatch metrics in graphical form. After the launch, we added additional features including a full screen mode, a dark theme, control over the range of the Y axis, simplified renaming, persistent storage, and new visualization options.

New API & CLI
While console support is wonderful for interactive use, many customers have asked us to support programmatic creation and manipulation of dashboards and the widgets within. They would like to dynamically build and maintain dashboards, adding and removing widgets as the corresponding AWS resources are created and destroyed. Other customers are interested in setting up and maintaining a consistent set of dashboards across two or more AWS accounts.

I am happy to announce that API, CLI, and AWS CloudFormation support for CloudWatch Dashboards is available now and that you can start using it today!

There are four new API functions (and equivalent CLI commands):

ListDashboards / aws cloudwatch list-dashboards – Fetch a list of all dashboards within an account, or a subset that share a common prefix.

GetDashboard / aws cloudwatch get-dashboard – Fetch details for a single dashboard.

PutDashboard / aws cloudwatch put-dashboard – Create a new dashboard or update an existing one.

DeleteDashboards / aws cloudwatch delete-dashboards – Delete one or more dashboards.

Dashboard Concepts
I want to show you how to use these functions and commands. Before I dive in, I should review a couple of important dashboard concepts and attributes.

Global – Dashboards are part of an AWS account, and are not associated with a specific AWS Region. Each account can have up to 500 dashboards.

Named – Each dashboard has a name that is unique within the AWS account. Names can be up to 255 characters long.

Grid Model – Each dashboard is composed of a grid of cells. The grid is 24 cells across and as tall as necessary. Each widget on the dashboard is positioned at a particular set of grid coordinates, and has a size that spans an integral number of grid cells.

Widgets (Visualizations) – Each widget can display text or a set of CloudWatch metrics. Text is specified using Markdown; metrics can be displayed as single values, line charts, or stacked area charts. Each dashboard can have up to 100 widgets. Widgets that display metrics can also be associated with a CloudWatch Alarm.

Dashboards have a JSON representation that you can now see and edit from within the console. Simply click on the Action menu and choose View/edit source:

Here’s the source for my dashboard:

You can use this JSON as a starting point for your own applications. As you can see, there’s an entry in the widgets array for each widget on the dashboard; each entry describes one widget, starting with its type, position, and size.

Creating a Dashboard Using the API
Let’s say I want to create a dashboard that has a widget for each of my EC2 instances in a particular region. I’ll use Python and the AWS SDK for Python, and start as follows (excuse the amateur nature of my code):

import boto3
import json

cw  = boto3.client("cloudwatch")
ec2 = boto3.client("ec2")

x, y          = [0, 0]
width, height = [3, 3]
max_width     = 12
widgets       = []

Then I simply iterate over the instances, creating a widget dictionary for each one, and appending it to the widgets array:

instances = ec2.describe_instances()
for r in instances['Reservations']:
    for i in r['Instances']:

        widget = {'type'      : 'metric',
                  'x'         : x,
                  'y'         : y,
                  'height'    : height,
                  'width'     : width,
                  'properties': {'view'    : 'timeSeries',
                                 'stacked' : False,
                                 'metrics' : [['AWS/EC2', 'NetworkIn', 'InstanceId', i['InstanceId']],
                                              ['.',       'NetworkOut', '.',         '.']
                                             ],
                                 'period'  : 300,
                                 'stat'    : 'Average',
                                 'region'  : 'us-east-1',
                                 'title'   : i['InstanceId']
                                }
                 }

        widgets.append(widget)

I update the position (x and y) within the loop, and form a grid (if I don’t specify positions, the widgets will be laid out left to right, top to bottom):

        x += width
        if (x + width > max_width):
            x = 0
            y += height

After I have processed all of the instances, I create a JSON version of the widget array:

body   = {'widgets' : widgets}
body_j = json.dumps(body)

And I create or update my dashboard:

cw.put_dashboard(DashboardName = "EC2_Networking",
                 DashboardBody = body_j)

I run the code, and get the following dashboard:

The CloudWatch team recommends that dashboards created programmatically include a text widget indicating that the dashboard was generated automatically, along with a link to the source code or CloudFormation template that did the work. This will discourage users from making manual, out-of-band changers to the dashboards.

As I mentioned earlier, each metric widget can also be associated with a CloudWatch Alarm. You can create the alarms programmatically or by using a CloudFormation template such as the Sample CPU Utilization Alarm. If you decide to do this, the alarm threshold will be displayed in the widget. To learn more about this, read Tara Walker’s recent post, Amazon CloudWatch Launches Alarms on Dashboards.

Going one step further, I could use CloudWatch Events and a Lamba Function to track the creation and deletion of certain resources and update a dashboard in concert with the changes. To learn how to do this, read Keeping CloudWatch Dashboards up to Date Using AWS Lambda.

Accessing a Dashboard Using the CLI
I can also access and manipulate my dashboards from the command line. For example, I can generate a simple list:

$ aws cloudwatch list-dashboards --output table
----------------------------------------------
|               ListDashboards               |
+--------------------------------------------+
||             DashboardEntries             ||
|+-----------------+----------------+-------+|
||  DashboardName  | LastModified   | Size  ||
|+-----------------+----------------+-------+|
||  Disk-Metrics   |  1496405221.0  |  316  ||
||  EC2_Networking |  1498090434.0  |  2830 ||
||  Main-Metrics   |  1498085173.0  |  234  ||
|+-----------------+----------------+-------+|

And I can get rid of the Disk-Metrics dashboard:

$ aws cloudwatch delete-dashboards --dashboard-names Disk-Metrics

I can also retrieve the JSON that defines a dashboard:

Creating a Dashboard Using CloudFormation
Dashboards can also be specified in CloudFormation templates. Here’s a simple template in YAML (the DashboardBody is still specified in JSON):

Resources:
  MyDashboard:
    Type: "AWS::CloudWatch::Dashboard"
    Properties:
      DashboardName: SampleDashboard
      DashboardBody: '{"widgets":[{"type":"text","x":0,"y":0,"width":6,"height":6,"properties":{"markdown":"Hi there from CloudFormation"}}]}'

I place the template in a file and then create a stack using the console or the CLI:

$ aws cloudformation create-stack --stack-name MyDashboard --template-body file://dash.yaml
{
    "StackId": "arn:aws:cloudformation:us-east-1:xxxxxxxxxxxx:stack/MyDashboard/a2a3fb20-5708-11e7-8ffd-500c21311262"
}

Here’s the dashboard:

Available Now
This feature is available now and you can start using it today. You can create 3 dashboards with up to 50 metrics per dashboard at no charge; additional dashboards are priced at $3 per month, as listed on the CloudWatch Pricing page. You can make up to 1 million calls to the new API functions each month at no charge; beyond that you pay $.01 for every 1,000 calls.

Jeff;

ISP Doesn’t Have to Expose Alleged BitTorrent Pirates, Finnish Court Rules

Post Syndicated from Ernesto original https://torrentfreak.com/isp-doesnt-have-to-expose-alleged-bittorrent-pirates-finnish-court-rules-170615/

finlandStarting three years ago, copyright holders began sending out thousands of settlement letters to alleged pirates in Finland, a practice often described as copyright trolling.

This week, however, the local Market Court has put the brakes on these efforts, with a rather significant ruling.

In the case in question, filmmakers requested the personal information of hundreds of alleged BitTorrent users from Internet provider DNA. However, after a careful review by a panel of seven judges, the Court decided not to grant the request.

The rightsholders provided a detailed log from a BitTorrent monitoring tool as evidence. While the Court didn’t doubt that the pirated material had been shared, it questioned how significant the infringements were.

The provided list of IP-addresses and timestamps don’t show how much data was shared, or for how long.

The evidence included an overview of the total number of users sharing the same file in a single BitTorrent swarm. However, the fact that thousands of people were sharing the same file says nothing about the significance of individual infringements.

“[T]he applicant has not claimed or provided any explanation that would indicate that the distribution of its work, by an IP address in the application, would have repeatedly occurred or for a longer period of time,” the Market Court writes.

The verdict, first reported by Iltalethi, refers to a recent case in the European Court of Justice, and stressed that the significance of an infringement must be weighed against the defendants’ privacy rights. In this case, the court decided that the evidence doesn’t warrant the exposure of the alleged pirates.

“Since the applicant has not provided sufficient proof of compliance with the conditions set out in Article 60a of the Copyright Act to adoption of an application, the application must be dismissed,” the Market Court writes.

The outcome is a clear victory for the accused BitTorrent users. Time will tell whether rightsholders will adapt their evidence to the ruling, or whether they will test their luck elsewhere. The current ruling can still be appealed.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Building High-Throughput Genomic Batch Workflows on AWS: Batch Layer (Part 3 of 4)

Post Syndicated from Andy Katz original https://aws.amazon.com/blogs/compute/building-high-throughput-genomic-batch-workflows-on-aws-batch-layer-part-3-of-4/

Aaron Friedman is a Healthcare and Life Sciences Partner Solutions Architect at AWS

Angel Pizarro is a Scientific Computing Technical Business Development Manager at AWS

This post is the third in a series on how to build a genomics workflow on AWS. In Part 1, we introduced a general architecture, shown below, and highlighted the three common layers in a batch workflow:

  • Job
  • Batch
  • Workflow

In Part 2, you built a Docker container for each job that needed to run as part of your workflow, and stored them in Amazon ECR.

In Part 3, you tackle the batch layer and build a scalable, elastic, and easily maintainable batch engine using AWS Batch.

AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. It dynamically provisions the optimal quantity and type of compute resources (for example, CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs that you submit. With AWS Batch, you do not need to install and manage your own batch computing software or server clusters, which allows you to focus on analyzing results, such as those of your genomic analysis.

Integrating applications into AWS Batch

If you are new to AWS Batch, we recommend reading Setting Up AWS Batch to ensure that you have the proper permissions and AWS environment.

After you have a working environment, you define several types of resources:

  • IAM roles that provide service permissions
  • A compute environment that launches and terminates compute resources for jobs
  • A custom Amazon Machine Image (AMI)
  • A job queue to submit the units of work and to schedule the appropriate resources within the compute environment to execute those jobs
  • Job definitions that define how to execute an application

After the resources are created, you’ll test the environment and create an AWS Lambda function to send generic jobs to the queue.

This genomics workflow covers the basic steps. For more information, see Getting Started with AWS Batch.

Creating the necessary IAM roles

AWS Batch simplifies batch processing by managing a number of underlying AWS services so that you can focus on your applications. As a result, you create IAM roles that give the service permissions to act on your behalf. In this section, deploy the AWS CloudFormation template included in the GitHub repository and extract the ARNs for later use.

To deploy the stack, go to the top level in the repo with the following command:

aws cloudformation create-stack --template-body file://batch/setup/iam.template.yaml --stack-name iam --capabilities CAPABILITY_NAMED_IAM

You can capture the output from this stack in the Outputs tab in the CloudFormation console:

Creating the compute environment

In AWS Batch, you will set up a managed compute environments. Managed compute environments automatically launch and terminate compute resources on your behalf based on the aggregate resources needed by your jobs, such as vCPU and memory, and simple boundaries that you define.

When defining your compute environment, specify the following:

  • Desired instance types in your environment
  • Min and max vCPUs in the environment
  • The Amazon Machine Image (AMI) to use
  • Percentage value for bids on the Spot Market and VPC subnets that can be used.

AWS Batch then provisions an elastic and heterogeneous pool of Amazon EC2 instances based on the aggregate resource requirements of jobs sitting in the RUNNABLE state. If a mix of CPU and memory-intensive jobs are ready to run, AWS Batch provisions the appropriate ratio and size of CPU and memory-optimized instances within your environment. For this post, you will use the simplest configuration, in which instance types are set to "optimal" allowing AWS Batch to choose from the latest C, M, and R EC2 instance families.

While you could create this compute environment in the console, we provide the following CLI commands. Replace the subnet IDs and key name with your own private subnets and key, and the image-id with the image you will build in the next section.

ACCOUNTID=<your account id>
SERVICEROLE=<from output in CloudFormation template>
IAMFLEETROLE=<from output in CloudFormation template>
JOBROLEARN=<from output in CloudFormation template>
SUBNETS=<comma delimited list of subnets>
SECGROUPS=<your security groups>
SPOTPER=50 # percentage of on demand
IMAGEID=<ami-id corresponding to the one you created>
INSTANCEROLE=<from output in CloudFormation template>
REGISTRY=${ACCOUNTID}.dkr.ecr.us-east-1.amazonaws.com
KEYNAME=<your key name>
MAXCPU=1024 # max vCPUs in compute environment
ENV=myenv

# Creates the compute environment
aws batch create-compute-environment --compute-environment-name genomicsEnv-$ENV --type MANAGED --state ENABLED --service-role ${SERVICEROLE} --compute-resources type=SPOT,minvCpus=0,maxvCpus=$MAXCPU,desiredvCpus=0,instanceTypes=optimal,imageId=$IMAGEID,subnets=$SUBNETS,securityGroupIds=$SECGROUPS,ec2KeyPair=$KEYNAME,instanceRole=$INSTANCEROLE,bidPercentage=$SPOTPER,spotIamFleetRole=$IAMFLEETROLE

Creating the custom AMI for AWS Batch

While you can use default Amazon ECS-optimized AMIs with AWS Batch, you can also provide your own image in managed compute environments. We will use this feature to provision additional scratch EBS storage on each of the instances that AWS Batch launches and also to encrypt both the Docker and scratch EBS volumes.

AWS Batch has the same requirements for your AMI as Amazon ECS. To build the custom image, modify the default Amazon ECS-Optimized Amazon Linux AMI in the following ways:

  • Attach a 1 TB scratch volume to /dev/sdb
  • Encrypt the Docker and new scratch volumes
  • Mount the scratch volume to /docker_scratch by modifying /etcfstab

The first two tasks can be addressed when you create the custom AMI in the console. Spin up a small t2.micro instance, and proceed through the standard EC2 instance launch.

After your instance has launched, record the IP address and then SSH into the instance. Copy and paste the following code:

sudo yum -y update
sudo parted /dev/xvdb mklabel gpt
sudo parted /dev/xvdb mkpart primary 0% 100%
sudo mkfs -t ext4 /dev/xvdb1
sudo mkdir /docker_scratch
sudo echo -e '/dev/xvdb1\t/docker_scratch\text4\tdefaults\t0\t0' | sudo tee -a /etc/fstab
sudo mount -a

This auto-mounts your scratch volume to /docker_scratch, which is your scratch directory for batch processing. Next, create your new AMI and record the image ID.

Creating the job queues

AWS Batch job queues are used to coordinate the submission of batch jobs. Your jobs are submitted to job queues, which can be mapped to one or more compute environments. Job queues have priority relative to each other. You can also specify the order in which they consume resources from your compute environments.

In this solution, use two job queues. The first is for high priority jobs, such as alignment or variant calling. Set this with a high priority (1000) and map back to the previously created compute environment. Next, set a second job queue for low priority jobs, such as quality statistics generation. To create these compute environments, enter the following CLI commands:

aws batch create-job-queue --job-queue-name highPriority-${ENV} --compute-environment-order order=0,computeEnvironment=genomicsEnv-${ENV}  --priority 1000 --state ENABLED
aws batch create-job-queue --job-queue-name lowPriority-${ENV} --compute-environment-order order=0,computeEnvironment=genomicsEnv-${ENV}  --priority 1 --state ENABLED

Creating the job definitions

To run the Isaac aligner container image locally, supply the Amazon S3 locations for the FASTQ input sequences, the reference genome to align to, and the output BAM file. For more information, see tools/isaac/README.md.

The Docker container itself also requires some information on a suitable mountable volume so that it can read and write files temporary files without running out of space.

Note: In the following example, the FASTQ files as well as the reference files to run are in a publicly available bucket.

FASTQ1=s3://aws-batch-genomics-resources/fastq/SRR1919605_1.fastq.gz
FASTQ2=s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz
REF=s3://aws-batch-genomics-resources/reference/isaac/
BAM=s3://mybucket/genomic-workflow/test_results/bam/

mkdir ~/scratch

docker run --rm -ti -v $(HOME)/scratch:/scratch $REPO_URI --bam_s3_folder_path $BAM \
--fastq1_s3_path $FASTQ1 \
--fastq2_s3_path $FASTQ2 \
--reference_s3_path $REF \
--working_dir /scratch 

Locally running containers can typically expand their CPU and memory resource headroom. In AWS Batch, the CPU and memory requirements are hard limits and are allocated to the container image at runtime.

Isaac is a fairly resource-intensive algorithm, as it creates an uncompressed index of the reference genome in memory to match the query DNA sequences. The large memory space is shared across multiple CPU threads, and Isaac can scale almost linearly with the number of CPU threads given to it as a parameter.

To fit these characteristics, choose an optimal instance size to maximize the number of CPU threads based on a given large memory footprint, and deploy a Docker container that uses all of the instance resources. In this case, we chose a host instance with 80+ GB of memory and 32+ vCPUs. The following code is example JSON that you can pass to the AWS CLI to create a job definition for Isaac.

aws batch register-job-definition --job-definition-name isaac-${ENV} --type container --retry-strategy attempts=3 --container-properties '
{"image": "'${REGISTRY}'/isaac",
"jobRoleArn":"'${JOBROLEARN}'",
"memory":80000,
"vcpus":32,
"mountPoints": [{"containerPath": "/scratch", "readOnly": false, "sourceVolume": "docker_scratch"}],
"volumes": [{"name": "docker_scratch", "host": {"sourcePath": "/docker_scratch"}}]
}'

You can copy and paste the following code for the other three job definitions:

aws batch register-job-definition --job-definition-name strelka-${ENV} --type container --retry-strategy attempts=3 --container-properties '
{"image": "'${REGISTRY}'/strelka",
"jobRoleArn":"'${JOBROLEARN}'",
"memory":32000,
"vcpus":32,
"mountPoints": [{"containerPath": "/scratch", "readOnly": false, "sourceVolume": "docker_scratch"}],
"volumes": [{"name": "docker_scratch", "host": {"sourcePath": "/docker_scratch"}}]
}'

aws batch register-job-definition --job-definition-name snpeff-${ENV} --type container --retry-strategy attempts=3 --container-properties '
{"image": "'${REGISTRY}'/snpeff",
"jobRoleArn":"'${JOBROLEARN}'",
"memory":10000,
"vcpus":4,
"mountPoints": [{"containerPath": "/scratch", "readOnly": false, "sourceVolume": "docker_scratch"}],
"volumes": [{"name": "docker_scratch", "host": {"sourcePath": "/docker_scratch"}}]
}'

aws batch register-job-definition --job-definition-name samtoolsStats-${ENV} --type container --retry-strategy attempts=3 --container-properties '
{"image": "'${REGISTRY}'/samtools_stats",
"jobRoleArn":"'${JOBROLEARN}'",
"memory":10000,
"vcpus":4,
"mountPoints": [{"containerPath": "/scratch", "readOnly": false, "sourceVolume": "docker_scratch"}],
"volumes": [{"name": "docker_scratch", "host": {"sourcePath": "/docker_scratch"}}]
}'

The value for "image" comes from the previous post on creating a Docker image and publishing to ECR. The value for jobRoleArn you can find from the output of the CloudFormation template that you deployed earlier. In addition to providing the number of CPU cores and memory required by Isaac, you also give it a storage volume for scratch and staging. The volume comes from the previously defined custom AMI.

Testing the environment

After you have created the Isaac job definition, you can submit the job using the AWS Batch submitJob API action. While the base mappings for Docker run are taken care of in the job definition that you just built, the specific job parameters should be specified in the container overrides section of the API call. Here’s what this would look like in the CLI, using the same parameters as in the bash commands shown earlier:

aws batch submit-job --job-name testisaac --job-queue highPriority-${ENV} --job-definition isaac-${ENV}:1 --container-overrides '{
"command": [
			"--bam_s3_folder_path", "s3://mybucket/genomic-workflow/test_batch/bam/",
            "--fastq1_s3_path", "s3://aws-batch-genomics-resources/fastq/ SRR1919605_1.fastq.gz",
            "--fastq2_s3_path", "s3://aws-batch-genomics-resources/fastq/SRR1919605_2.fastq.gz",
            "--reference_s3_path", "s3://aws-batch-genomics-resources/reference/isaac/",
            "--working_dir", "/scratch",
			"—cmd_args", " --exome ",]
}'

When you execute a submitJob call, jobId is returned. You can then track the progress of your job using the describeJobs API action:

aws batch describe-jobs –jobs <jobId returned from submitJob>

You can also track the progress of all of your jobs in the AWS Batch console dashboard.

To see exactly where a RUNNING job is at, use the link in the AWS Batch console to direct you to the appropriate location in CloudWatch logs.

Completing the batch environment setup

To finish, create a Lambda function to submit a generic AWS Batch job.

In the Lambda console, create a Python 2.7 Lambda function named batchSubmitJob. Copy and paste the following code. This is similar to the batch-submit-job-python27 Lambda blueprint. Use the LambdaBatchExecutionRole that you created earlier. For more information about creating functions, see Step 2.1: Create a Hello World Lambda Function.

from __future__ import print_function

import json
import boto3

batch_client = boto3.client('batch')

def lambda_handler(event, context):
    # Log the received event
    print("Received event: " + json.dumps(event, indent=2))
    # Get parameters for the SubmitJob call
    # http://docs.aws.amazon.com/batch/latest/APIReference/API_SubmitJob.html
    job_name = event['jobName']
    job_queue = event['jobQueue']
    job_definition = event['jobDefinition']
    
    # containerOverrides, dependsOn, and parameters are optional
    container_overrides = event['containerOverrides'] if event.get('containerOverrides') else {}
    parameters = event['parameters'] if event.get('parameters') else {}
    depends_on = event['dependsOn'] if event.get('dependsOn') else []
    
    try:
        response = batch_client.submit_job(
            dependsOn=depends_on,
            containerOverrides=container_overrides,
            jobDefinition=job_definition,
            jobName=job_name,
            jobQueue=job_queue,
            parameters=parameters
        )
        
        # Log response from AWS Batch
        print("Response: " + json.dumps(response, indent=2))
        
        # Return the jobId
        event['jobId'] = response['jobId']
        return event
    
    except Exception as e:
        print(e)
        message = 'Error getting Batch Job status'
        print(message)
        raise Exception(message)

Conclusion

In part 3 of this series, you successfully set up your data processing, or batch, environment in AWS Batch. We also provided a Python script in the corresponding GitHub repo that takes care of all of the above CLI arguments for you, as well as building out the job definitions for all of the jobs in the workflow: Isaac, Strelka, SAMtools, and snpEff. You can check the script’s README for additional documentation.

In Part 4, you’ll cover the workflow layer using AWS Step Functions and AWS Lambda.

Please leave any questions and comments below.

File deduplication written in bash

Post Syndicated from Delian Delchev original http://deliantech.blogspot.com/2017/05/file-deduplication-written-in-bash.html

Once there was this guy asking me would I be able to write a file deduplication script in shell.

It is not very hard and it is a curious problem, so I am publishing my code here:

#!/bin/bash
[ ! -d $1 ] && echo “$1 is not a directory! exit” &&  exit 1
cd $1
oldsize=”yyyyy”;oldname=”xxxxx”
find . -type f -ls | awk ‘{ print $7″:”$11 }’ | sort -k 1,1 -n -r | while read line; do
  size=${line%%:*}
  name=${line##:*}
  if [ “$oldsize” == “$size” -a -f “$name” -a -f “$oldname” ] && diff -s “$oldname” “$name”; then
      rm -f “$name”
      ln “$oldname” “$name”
      continue
  fi
  oldsize=”$size”
  oldname=”$name”
done

I am wondering, would it be possible to be made even simpler…

Building a Competitive Moat: Turning Challenges Into Advantages

Post Syndicated from Gleb Budman original https://www.backblaze.com/blog/turning-challenges-into-advantages/

castle on top of a storage pod

In my previous post on how Backblaze got started, I mentioned that “just because we knew the right solution, didn’t mean that it was possible.” I’ll dig into that here. The right solution was to offer unlimited backup for $5 per month. The price of storage at the time, however, would have likely forced us to price our unlimited backup service at 2x – 5x that.

We were faced with a difficult challenge – compromise a fundamental feature of our product by removing the unlimited storage element, increase our price point in order to cover our costs but likely limit our potential customer base, seek funding in order to run at a loss while we built market share with a hope/prayer we could make a profit in the future, or find another way (huge unknown that might not have a solution). Below I’ll dig into the options that were available, the paths we tried, and how this challenge completely transformed our company and ended up being our greatest technological advantage.

Available Options:

Use a Storage Service

Originally we intended to build the backup application, but leave the back-end storage to others; likely Amazon S3. This had many advantages:

  1. We would not have to worry about the storage at all
  2. It would scale up or down as we needed it
  3. We would pay only for what we used

Especially as a small, bootstrapped company with limited resources – these were incredible benefits.

There was just one problem. At S3’s then current pricing ($0.15/GB/month), a customer storing just 33 GB would cost us 100% of the $5 per month we would collect. Additionally, we would need to pay S3 transaction and download charges, along with our engineering/support/marketing and other expenses.. The conclusion, even if the average customer stored just 33 GB, it would cost us at least $10/month for a customer that we were charging just $5/month.

In 2007, when we were getting started, there were a few other storage services available. But all were more expensive. Despite the fantastic benefits of using such a service, it simply didn’t work for us.

Buy Storage Systems

Buying storage systems didn’t have all the benefits of using a storage service – we would have to forecast need, buy in big blocks up front, manage data centers, etc. – but it seemed the second-best option. Companies such as EMC, NetApp, Dell, and others sold hundreds of petabytes of storage systems where they provide the servers, software, and support.

Alas, there were two problems: One temporary, the other permanent (and fatal). The temporary problem was that these systems were hundreds of thousands of dollars just to get started. This was challenging for us from a cash-flow perspective, but it was just a question of coming up with the cash. The permanent problem was that these systems cost ~$1,000/TB of storage. Hard drives were selling for ~$100/TB, so there was a 10x markup for the storage system. That markup eliminated pursuing this path. What if the the average customer had 100 GB to store? It would take us 20 months to pay off the purchase. We weren’t sure how much data the average customer would have, but the scenarios we were running made it seem like a $5/month price point was unsustainable.

Our Choices Where:

Don’t Offer the Right Solution

If it’s impossible to offer unlimited backup for $5/month, there are certainly choices. We could have raised the price to $10/month, not make the backup unlimited, or close-up shop altogether. All doable, none ideal.

Raise Funding

Plenty of companies raise funding before they can be self-sustaining, and it can work out great for everyone. We had raised funding for a previous company and believed we could have done it for Backblaze. And raising funding would have taken care of the cash-flow issue if we chose to buy storage systems.

However, it would have left us with a business with negative unit economics – we would lose money on every customer, and the faster we grew, the more money we would lose. VCs do fund these types of companies often (many of the delivery companies today fall in this realm) with the idea that, at scale, you improve your cost structure and possibly also charge more. But it’s a dangerous game since not only is the business not self-sustaining, it inevitably must be significantly altered in order to survive.

Find a Way to Store Data for Less

If there were some way to store data for less, significantly less, it could all work. We had a tiny glimmer of hope that it would be possible: Since hard drives only cost ~$100/TB, if we could somehow use those drives without adding much overhead, that would be quite affordable.

“we wanted to build a sustainable business from day one and build a culture that believes dollars come from customers.”

Our first decision was to not compromise our product by restricting the amount of storage. Although this would have been a much easier solution, it violated our core mission: Create a simple and inexpensive solution to backup all of your important data.

We had previously also decided not to raise funding to get started because we wanted to build a sustainable business from day one and build a culture that believes dollars come from customers. With those decisions made, we moved onto finding the best solution to fulfill our mission and create a viable company.

Experimentation

All we wanted was to attach hard drives to the Internet. If we could do that inexpensively, our backup application could store the data there and we could offer our unlimited backup service.

A hard drive needs to be connected to a server to be available on the Internet. It certainly wouldn’t be very cost effective to have one server for every hard drive, as the server costs would dominate the equation. Alternatively, trying to attach a lot of drives to a server resulted in needing expensive “enterprise” servers. The goal then became cost-efficiently attaching as many hard drives as possible to one server. According to its spec, USB is supposed to allow for 127 devices to be daisy-chained to a single port. We tried; it didn’t work. We considered Firewire, which could connect 63 devices, but the connectors are aimed at graphic designers and ended up too expensive on a unit-basis. Our next thought was to use small consumer-grade DAS (Direct-attached storage) devices and connect those to a server. We managed to attach 8 DAS devices with 4 drives each for a total of 32 hard drives connected to one server.

DAS units attached to a server
This worked well, but it was operationally challenging as none of these devices were meant to fit in a data center rack. Further complicating matters was that moving one of these setups required cabling 10 power cords, and separately moving 9 boxes. Fine at small scale, but very hard to scale up.

We realized that we didn’t need all the boxes, we just needed backplanes to connect the drives from the DAS boxes to the motherboard from the server. We found a different DAS box that supports port multipliers and took that backplane. How did we decide on that DAS box? Tim, co-founder & Chief Cloud Officer, remembers going to Fry’s and picking the box that looked “about right”.

That all laid the path for our eventual 45 drive design. The next thought was: If we could put all that in one box, it might be the solution we were looking for. The first iteration of this was a plywood box.

the first wooden storage pod

That eventually evolved into a steel server and what we refer to as a Storage Pod.

steel storage pod chassis

Building a Storage Platform

The Storage Pod became our key building block, but was just a tiny component of the ‘storage platform’. We had to write software that would run on each Storage Pod, software that would create redundancy between the Storage Pods, and central software and systems that would coordinate other aspects of the system to accept/load balance/validate/clean-up data. We had to find and train contract manufacturers to build the Storage Pods, find and negotiate data center space and bandwidth, setup processes to buy drives and track their reliability, hire people to maintain the systems, and setup the business processes to do all of this and more at scale.

All of this ended up taking tremendous technical effort, management engagement, and work from all corners of Backblaze. But it has also paid enormous dividends.

The Transformation

We started Backblaze thinking of ourselves as a backup company. In reality, we became a storage company with ‘backup’ as the first service we offered on our storage platform. Our backup service relies on the storage platform as, without the storage platform, we couldn’t offer unlimited backup. To enable the backup service, storage became the foundation of our company and is still what we live and breathe every day.

It didn’t just change how we built the service, it changed the fundamental DNA of the company.

Dividends

Creating our own storage platform was certainly hard. But it enabled us to offer our unlimited backup for a low price and do that while running a sustainable business.

“It didn’t just change how we built the service, it changed the fundamental DNA of the company.”

We felt that we had a service and price point that customers wanted, and we “unlocked” the way to let us build it. Having our storage platform also provides us with a deep connection to our customers and the storage community – we share how we build Storage Pods and how reliable hard drives in our environment have been. That content, in turns, helps brings awareness to Backblaze; the awareness helps establish the company as a tech leader; that reputation helps us recruit to our growing team and earns customers who are evaluating our solutions vs Storage Company X.

And after years of being a storage company with a backup service, and being asked all the time to just offer our storage directly, we launched our Backblaze B2 Cloud Storage service. We offer this raw storage at a price of $0.005/GB/month – that’s less than 1/4th of the price of S3.

If we had built our backup service on one of the existing storage services or storage systems, it would have been easier – but none of this would have been possible. This challenge, which we have spent a decade working to overcome, has also transformed our company and became our greatest technological advantage.

The post Building a Competitive Moat: Turning Challenges Into Advantages appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Spotify’s Beta Used ‘Pirate’ MP3 Files, Some From Pirate Bay

Post Syndicated from Andy original https://torrentfreak.com/spotifys-beta-used-pirate-mp3-files-some-from-pirate-bay-170509/

While some pirates will probably never be tempted away from the digital high seas, over the past decade millions have ditched or tapered down their habit with the help of Spotify.

It’s no coincidence that from the very beginning more than a decade ago, the streaming service had more than a few things in common with the piracy scene.

Spotify CEO Daniel Ek originally worked with uTorrent creator Ludvig ‘Ludde’ Strigeus before the pair sold to BitTorrent Inc. and began work on Spotify. Later, the company told TF that pirates were their target.

“Spotify is a new way of enjoying music. We believe Spotify provides a viable alternative to music piracy,” the company said.

“We think the way forward is to create a service better than piracy, thereby converting users into a legal, sustainable alternative which also enriches the total music experience.”

The technology deployed by Spotify was also familiar. Like the majority of ‘pirate’ platforms at the time, Spotify operated a peer-to-peer (P2P) system which grew to become one of the largest on the Internet. It was shut down in 2011.

But in the clearest nod to pirates, Spotify was available for free, supported by ads if the user desired. This was the platform’s greatest asset as it sought to win over a generation that had grown accustomed to gorging on free MP3s. Interestingly, however, an early Pirate Bay figure has now revealed that Spotify also had a use for the free content floating around the Internet.

As one of the early members of Sweden’s infamous Piratbyrån (piracy bureau), Rasmus Fleischer was also one of key figures at The Pirate Bay. Over the years he’s been a writer, researcher, debater and musician, and in 2012 he finished his PhD thesis on “music’s political economy.”

As part of a five-person team, Fleischer is now writing a book about Spotify. Titled ‘Spotify Teardown – Inside the Black Box of Streaming Music’, the book aims to shine light on the history of the famous music service and also spills the beans on a few secrets.

In an interview with Sweden’s DI.se, Fleischer reveals that when Spotify was in early beta, the company used unlicensed music to kick-start the platform.

“Spotify’s beta version was originally a pirate service. It was distributing MP3 files that the employees happened to have on their hard drives,” he reveals.

Rumors that early versions of Spotify used ‘pirate’ MP3s have been floating around the Internet for years. People who had access to the service in the beginning later reported downloading tracks that contained ‘Scene’ labeling, tags, and formats, which are the tell-tale signs that content hadn’t been obtained officially.

Solid proof has been more difficult to come by but Fleischer says he knows for certain that Spotify was using music obtained not only from pirate sites, but the most famous pirate site of all.

According to the writer, a few years ago he was involved with a band that decided to distribute their music on The Pirate Bay instead of the usual outlets. Soon after, the album appeared on Spotify’s beta service.

“I thought that was funny. So I emailed Spotify and asked how they obtained it. They said that ‘now, during the test period, we will use music that we find’,” Fleischer recalls.

For a company that has attracting pirates built into its DNA, it’s perhaps fitting that it tempted them with the same bait found on pirate sites. Certainly, the company’s history of a pragmatic attitude towards piracy means that few will be shouting ‘hypocrites’ at the streaming platform now.

Indeed, according to Fleischer the successes and growth of Spotify are directly linked to the temporary downfall of The Pirate Bay following the raid on the site in 2006, and the lawsuits that followed.

“The entire Spotify beta period and its early launch history is in perfect sync with the Pirate Bay process,” Fleischer explains.

“They would not have had as much attention if they had not been able to surf that wave. The company’s early history coincides with the Pirate Party becoming a hot topic, and the trial of the Pirate Bay in the Stockholm District Court.”

In 2013, Fleischer told TF that The Pirate Bay had “helped catalyze so-called ‘new business models’,” and it now appears that Spotify is reaping the benefits and looks set to keep doing so into the future.

An in-depth interview with Rasmus Fleischer will be published here soon, including an interesting revelation detailing how TorrentFreak readers positively affected the launch of Spotify in the United States.

Spotify Teardown – Inside the Black Box of Streaming Music will be published early 2018.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Kim Dotcom Asks Police to Urgently Interview FBI Director Jim Comey

Post Syndicated from Andy original https://torrentfreak.com/kim-dotcom-asks-police-to-urgently-interview-fbi-director-jim-comey-170425/

When authorities in the United States and New Zealand shut down Megaupload in 2012, large amounts of data were seized in both locations. The data in the US is currently gathering dust but over in New Zealand yet another storm is brewing.

In the weeks following the raid, hard drives seized from Dotcom in New Zealand were cloned and sent to the FBI in the United States. A judge later found that this should not have been allowed, ruling that the copies in the FBI’s possession must be destroyed.

Like almost every process in the Megaupload saga the ruling went to appeal and in 2014 Dotcom won again, with the Court of Appeal upholding the lower court’s decision, stating that the removal of the clones to the United States was “plainly not authorized.”

At the time Dotcom said that fighting back is “encoded in his DNA” and today he’s taking that fight to the FBI. On Sunday, FBI director James Comey touched down in Queenstown, New Zealand, for an intelligence conference. With Comey in the country, Dotcom seized the moment to file a complaint with local police.

In the complaint shared with TorrentFreak, lawyer Simon Cogan draws police attention to the Court of Appeal ruling determining that clones of Dotcom drives were unlawfully shipped to the FBI in the United States. Since Comey is in the country, police should take the opportunity to urgently interview him over this potential criminal matter.

“As director of the FBI, Mr Comey will be able to assist Police with their investigation of the matters raised in Mr Dotcom’s complaint,” the complaint reads, noting several key areas of interest as detailed below.

Speaking with TF, Dotcom says that since the New Zealand High Court and Court of Appeal have both ruled that the FBI had no authority to remove his data from New Zealand, the FBI acted unlawfully.

“In simple terms the FBI has committed theft,” Dotcom says.

“The NZ courts don’t have jurisdiction in the US and could therefore not assist me in getting my data back. But FBI Director Comey has just arrived in New Zealand for a conference meaning he is in the jurisdiction of NZ courts. We have asked the NZ police to question Mr Comey about the theft and to investigate.”

In addition to seeking assistance from the police, Dotcom says that he’s also initiated a new lawsuit to have his data returned.

“We have also launched a separate civil court action to force Mr Comey to return my data to New Zealand and to erase any and all copies the FBI / US Govt holds. We expect an urgent hearing of the matter in the High Court tomorrow,” Dotcom concludes.

It’s likely that this will be another Dotcom saga that will run and run, but despite the seriousness of the matter in hand, Dotcom was happy to take to Twitter this morning, delivering a video message in his own inimitable style.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Operating OpenStack at Scale

Post Syndicated from mikesefanov original https://yahooeng.tumblr.com/post/159795571841

By James Penick, Cloud Architect & Gurpreet Kaur, Product Manager

A version of this byline was originally written for and appears in CIO Review.

A successful private cloud presents a consistent and reliable facade over the complexities of hyperscale infrastructure. It must simultaneously handle constant organic traffic growth, unanticipated spikes, a multitude of hardware vendors, and discordant customer demands. The depth of this complexity only increases with the age of the business, leaving a private cloud operator saddled with legacy hardware, old network infrastructure, customers dependent on legacy operating systems, and the list goes on. These are the foundations of the horror stories told by grizzled operators around the campfire.

Providing a plethora of services globally for over a billion active users requires a hyperscale infrastructure. Yahoo’s on-premises infrastructure is comprised of datacenters housing hundreds of thousands of physical and virtual compute resources globally, connected via a multi-terabit network backbone. As one of the very first hyperscale internet companies in the world, Yahoo’s infrastructure had grown organically – things were built, and rebuilt, as the company learned and grew. The resulting web of modern and legacy infrastructure became progressively more difficult to manage. Initial attempts to manage this via IaaS (Infrastructure-as-a-Service) taught some hard lessons. However, those lessons served us well when OpenStack was selected to manage Yahoo’s datacenters, some of which are shared below.

Centralized team offering Infrastructure-as-a-Service

Chief amongst the lessons learned prior to OpenStack was that IaaS must be presented as a core service to the whole organization by a dedicated team. An a-la-carte-IaaS, where each user is expected to manage their own control plane and inventory, just isn’t sustainable at scale. Multiple teams tackling the same challenges involved in the curation of software, deployment, upkeep, and security within an organization is not just a duplication of effort; it removes the opportunity for improved synergy with all levels of the business. The first OpenStack cluster, with a centralized dedicated developer and service engineering team, went live in June 2012.  This model has served us well and has been a crucial piece of making OpenStack succeed at Yahoo. One of the biggest advantages to a centralized, core team is the ability to collaborate with the foundational teams upon which any business is built: Supply chain, Datacenter Site-Operations, Finance, and finally our customers, the engineering teams. Building a close relationship with these vital parts of the business provides the ability to streamline the process of scaling inventory and presenting on-demand infrastructure to the company.

Developers love instant access to compute resources

Our developer productivity clusters, named “OpenHouse,” were a huge hit. Ideation and experimentation are core to developers’ DNA at Yahoo. It empowers our engineers to innovate, prototype, develop, and quickly iterate on ideas. No longer is a developer reliant on a static and costly development machine under their desk. OpenHouse enables developer agility and cost savings by obviating the desktop.

Dynamic infrastructure empowers agile products

From a humble beginning of a single, small OpenStack cluster, Yahoo’s OpenStack footprint is growing beyond 100,000 VM instances globally, with our single largest virtual machine cluster running over a thousand compute nodes, without using Nova Cells.

Until this point, Yahoo’s production footprint was nearly 100% focused on baremetal – a part of the business that one cannot simply ignore. In 2013, Yahoo OpenStack Baremetal began to manage all new compute deployments. Interestingly, after moving to a common API to provision baremetal and virtual machines, there was a marked increase in demand for virtual machines.

Developers across all major business units ranging from Yahoo Mail, Video, News, Finance, Sports and many more, were thrilled with getting instant access to compute resources to hit the ground running on their projects. Today, the OpenStack team is continuing to fully migrate the business to OpenStack-managed. Our baremetal footprint is well beyond that of our VMs, with over 100,000 baremetal instances provisioned by OpenStack Nova via Ironic.

How did Yahoo hit this scale?  

Scaling OpenStack begins with understanding how its various components work and how they communicate with one another. This topic can be very deep and for the sake of brevity, we’ll hit the high points.

1. Start at the bottom and think about the underlying hardware

Do not overlook the unique resource constraints for the services which power your cloud, nor the fashion in which those services are to be used. Leverage that understanding to drive hardware selection. For example, when one examines the role of the database server in an OpenStack cluster, and considers the multitudinous calls to the database: compute node heartbeats, instance state changes, normal user operations, and so on; they would conclude this core component is extremely busy in even a modest-sized Nova cluster, and in need of adequate computational resources to perform. Yet many deployers skimp on the hardware. The performance of the whole cluster is bottlenecked by the DB I/O. By thinking ahead you can save yourself a lot of heartburn later on.

2. Think about how things communicate

Our cluster databases are configured to be multi-master single-writer with automated failover. Control plane services have been modified to split DB reads directly to the read slaves and only write to the write-master. This distributes load across the database servers.

3. Scale wide

OpenStack has many small horizontally-scalable components which can peacefully cohabitate on the same machines: the Nova, Keystone, and Glance APIs, for example. Stripe these across several small or modest hardware. Some services, such as the Nova scheduler, run the risk of race conditions when running multi-active. If the risk of race conditions is unacceptable, use ZooKeeper to manage leader election.

4. Remove dependencies

In a Yahoo datacenter, DHCP is only used to provision baremetal servers. By statically declaring IPs in our instances via cloud-init, our infrastructure is less prone to outage from a failure in the DHCP infrastructure.

5. Don’t be afraid to replace things

Neutron used Dnsmasq to provide DHCP services, however it was not designed to address the complexity or scale of a dynamic environment. For example, Dnsmasq must be restarted for any config change, such as when a new host is being provisioned.  In the Yahoo OpenStack clusters this has been replaced by ISC-DHCPD, which scales far better than Dnsmasq and allows dynamic configuration updates via an API.

6. Or split them apart

Some of the core imaging services provided by Ironic, such as DHCP, TFTP, and HTTPS communicate with a host during the provisioning process. These services are normally  part of the Ironic Conductor (IC) service. In our environment we split these services into a new and physically-distinct service called the Ironic Transport Service (ITS). This brings value by:

  • Adding security: Splitting the ITS from the IC allows us to block all network traffic from production compute nodes to the IC, and other parts of our control plane. If a malicious entity attacks a node serving production traffic, they cannot escalate from it  to our control plane.
  • Scale: The ITS hosts allow us to horizontally scale the core provisioning services with which nodes communicate.
  • Flexibility: ITS allows Yahoo to manage remote sites, such as peering points, without building a new cluster in that site. Resources in those sites can now be managed by the nearest Yahoo owned & operated (O&O) datacenter, without needing to build a whole cluster in each site.

Be prepared for faulty hardware!

Running IaaS reliably at hyperscale is more than just scaling the control plane. One must take a holistic look at the system and consider everything. In fact, when examining provisioning failures, our engineers determined the majority root cause was faulty hardware. For example, there are a number of machines from varying vendors whose IPMI firmware fails from time to time, leaving the host inaccessible to remote power management. Some fail within minutes or weeks of installation. These failures occur on many different models, across many generations, and across many hardware vendors. Exposing these failures to users would create a very negative experience, and the cloud must be built to tolerate this complexity.

Focus on the end state

Yahoo’s experience shows that one can run OpenStack at hyperscale, leveraging it to wrap infrastructure and remove perceived complexity. Correctly leveraged, OpenStack presents an easy, consistent, and error-free interface. Delivering this interface is core to our design philosophy as Yahoo continues to double down on our OpenStack investment. The Yahoo OpenStack team looks forward to continue collaborating with the OpenStack community to share feedback and code.

[$] Kubernetes & security

Post Syndicated from jake original https://lwn.net/Articles/720215/rss

Every conference venue has problems with the mix of room sizes, but
I don’t recall ever going to a talk that so badly needed to be in a
bigger room as Jessie Frazelle and Alex Mohr’s talk
at CloudNativeCon/KubeCon Europe 2017 on securing Kubernetes.
The cause of the enthusiasm
was the opportunity to get “best practice” information on securing
Kubernetes, and how Kubernetes might be evolving to assist with this,
directly from the source.

[$] Network security in the microservice environment

Post Syndicated from corbet original https://lwn.net/Articles/719638/rss

We have seen that a microservice
architecture is intimately tied to the use of a TCP/IP network as the
interconnecting fabric, so when Bernard Van De Walle from Aporeto gave a talk at CloudNativeCon
and KubeCon Europe 2017
on why we shouldn’t bother securing that
network, it seemed a pretty provocative idea.

Backblaze Labs: The Future of DNA Storage

Post Syndicated from Yev original https://www.backblaze.com/blog/backblaze-labs-future-dna-storage/

Backblaze Labs – the data storage innovation branch of Backblaze – is proud to present to you its newest innovation, Backblaze DNA Storage. Most recently Backblaze Labs designed Storage Pod 6.0 and built out our Vault infrastructure that powers Backblaze Personal Backup and B2 Cloud Storage. DNA Storage has long been in our sights and we’re happy to see our work validated as Science magazine has caught up to us, reporting its latest capabilities in their article DNA could store all of the world’s data in one room.

Backblaze has always been known for very dense storage, and when we started to see reports that we could store up to 1 Zettabyte of storage in a single gram of DNA, Backblaze Labs kicked itself into high gear. We were new to bio-storage, so we wanted to take it slow and steady.

The Ah-Ha Moment

In keeping with our bootstrapping ethos our team has not only been studying how to best employ DNA storage, but has also taken an active role by volunteering* in our alpha program as test subjects. Currently our server farm resides inside giant datacenters, but with DNA storage our ‘Storage Bods™’ could be… mobile!

*Note rumors of employees being “voluntold” to take part are just that, rumor.

Unfortunately this feature also proved to be the downfall of our earliest experiment – our first volunteer, Lego, ran away for a day. Fortunately we found her… but we realized new approach was needed.

Our first human Storage Bod was Bob, our network engineer. His enthusiasm to take part, with the right motivation, was exactly what we needed. Aside from the sporadic bouts of nausea we were able to keep our first test unit of storage, roughly 100 TB of 70’s, 80’s & 90’s TV sitcoms, in place with some stability. Strangely we noticed some odd side effects with Bob kicking our refrigerator to get a drink and developing an odd laugh. Able to look past this we moved on to our next phase.

Now that our initial test “hosts” were used to the process, we started experimenting with data redundancy. It’s great to have a mobile backup, but what about making sure that the data exists in two places at once? We wanted this new project to be in keeping with our recommended 3-2-1 Backup strategy, didn’t we?

Unfortunately, our first attempts were again, not as successful as we’d have hoped. While we were able to start replicating the data, keeping it in two different locations proved to be a harder problem than we initially thought.

The breakthrough came when we stopped thinking about simply having the data in two locations, and instead focusing on having identical copies of the data in two distinct units. Sound familiar? It should! We had stumbled onto the DNA-storage equivalent of RAID 1!

The best part about this setup was that when one set of data got corrupted, the other would know and could come in for repairs. It was a win/win. Or as we call it, a twin/twin!

Next Steps

We’d like to implement a Backblaze Vault type of set-up with our new Storage Bod systems, 20 Bods grouped together running our Reed-Solomon encoding algorithm to create 99.999999% DNA data durability. That way, even if three of the Storage Bods were to go down or leave town, all of the data can be recovered. We don’t know how all of this will work out yet, but we’re sure we’ll need to get a lot more office snacks for our Storage Bods and we’re excited about the possibility of having billions of terabytes of data walking around. What could go wrong?

The post Backblaze Labs: The Future of DNA Storage appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Amazon Elasticsearch Service support for Elasticsearch 5.1

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/amazon-elasticsearch-service-support-for-es-5-1/

The Amazon Elasticsearch Service is a fully managed service that provides easier deployment, operation, and scale for the Elasticsearch open-source search and analytics engine. We are excited to announce that Amazon Elasticsearch Service now supports Elasticsearch 5.1 and Kibana 5.1.

Elasticsearch 5 comes with a ton of new features and enhancements that customers can now take advantage of in Amazon Elasticsearch service. Elements of the Elasticsearch 5 release are as follow:

  • Indexing performance: Improved Indexing throughput with updates to lock implementation & async translog fsyncing
  • Ingestion Pipelines: Incoming data can be sent to a pipeline that applies a series of ingestion processors, allowing transformation to the exact data you want to have in your search index. There are twenty processors included, from simple appending to complex regex applications
  • Painless scripting: Amazon Elasticsearch Service supports Painless, a new secure and performant scripting language for Elasticsearch 5. You can use scripting to change the precedence of search results, delete index fields by query, modify search results to return specific fields, and more.
  • New data structures: Lucene 6 data structures, new data types; half_float, text, keyword, and more complete support for dots-in-fieldnames
  • Search and Aggregations: Refactored search API, BM25 relevance calculations, Instant Aggregations, improvements to histogram aggregations & terms aggregations, and rewritten percolator & completion suggester
  • User experience: Strict settings and body & query string parameter validation, index management improvement, default deprecation logging, new shard allocation API, and new indices efficiency pattern for rollover & shrink APIs
  • Java REST client: simple HTTP/REST Java client that works with Java 7 and handles retry on node failure, as well as, round-robin, sniffing, and logging of requests
  • Other improvements: Lazy unicast hosts DNS lookup, automatic parallel tasking of reindex, update-by-query, delete-by-query, and search cancellation by task management API

The compelling new enhancements of Elasticsearch 5 are meant to make the service faster and easier to use while providing better security. Amazon Elasticsearch Service is a managed service designed to aid customers in building, developing and deploying solutions with Elasticsearch by providing the following capabilities:

  • Multiple configurations of instance types
  • Amazon EBS volumes for data storage
  • Cluster stability improvement with dedicated master nodes
  • Zone awareness – Cluster node allocation across two Availability Zones in the region
  • Access Control & Security with AWS Identity and Access Management (IAM)
  • Various geographical locations/regions for resources
  • Amazon Elasticsearch domain snapshots for replication, backup and restore
  • Integration with Amazon CloudWatch for monitoring Amazon Elasticsearch domain metrics
  • Integration with AWS CloudTrail for configuration auditing
  • Integration with other AWS Services like Kinesis Firehouse and DynamoDB for loading of real-time streaming data into Amazon Elasticsearch Service

Amazon Elasticsearch Service allows dynamic changes with zero downtime. You can add instances, remove instances, change instance sizes, change storage configuration, and make other changes dynamically.

The best way to highlight some of the aforementioned capabilities is with an example.

During a presentation at the IT/Dev conference, I demonstrated how to build a serverless employee onboarding system using Express.js, AWS Lambda, Amazon DynamoDB, and Amazon S3. In the demo, the information collected was personnel data stored in DynamoDB about an employee going through a fictional onboarding process. Imagine if the collected employee data could be searched, queried, and analyzed as needed by the company’s HR department. We can easily augment the onboarding system to add these capabilities by enabling the employee table to use DynamoDB Streams to trigger Lambda and store the desired employee attributes in Amazon Elasticsearch Service.

The result is the following solution architecture:

We will focus solely on how to dynamically store and index employee data to Amazon Elasticseach Service each time an employee record is entered and subsequently stored in the database.
To add this enhancement to the existing aforementioned onboarding solution, we will implement the solution as noted by the detailed cloud architecture diagram below:

Let’s look at how to implement the employee load process to the Amazon Elasticsearch Service, which is the first process flow shown in the diagram above.

Amazon Elasticsearch Service: Domain Creation

Let’s now visit the AWS Console to check out Amazon Elasticsearch Service with Elasticsearch 5 in action. As you probably guessed, from the AWS Console home, we select Elasticsearch Service under the Analytics group.

The first step in creating an Elasticsearch solution is to create a domain.  You will notice that now when creating an Amazon Elasticsearch Service domain, you now have the option to choose the Elasticsearch 5.1 version.  Since we are discussing the launch of the support of Elasticsearch 5, we will, of course, choose the 5.1 Elasticsearch engine version when creating our domain in the Amazon Elasticsearch Service.


After clicking Next, we will now setup our Elasticsearch domain by configuring our instance and storage settings. The instance type and the number of instances for your cluster should be determined based upon your application’s availability, network volume, and data needs. A recommended best practice is to choose two or more instances in order to avoid possible data inconsistencies or split brain failure conditions with Elasticsearch. Therefore, I will choose two instances/data nodes for my cluster and set up EBS as my storage device.

To understand how many instances you will need for your specific application, please review the blog post, Get Started with Amazon Elasticsearch Service: How Many Data Instances Do I Need, on the AWS Database blog.

All that is left for me is to set up the access policy and deploy the service. Once I create my service, the domain will be initialized and deployed.

Now that I have my Elasticsearch service running, I now need a mechanism to populate it with data. I will implement a dynamic data load process of the employee data to Amazon Elasticsearch Service using DynamoDB Streams.

Amazon DynamoDB: Table and Streams

Before I head to the DynamoDB console, I will quickly cover the basics.

Amazon DynamoDB is a scalable, distributed NoSQL database service. DynamoDB Streams provide an ordered, time-based sequence of every CRUD operation to the items in a DynamoDB table. Each stream record has information about the primary attribute modification for an individual item in the table. Streams execute asynchronously and can write stream records in practically real time. Additionally, a stream can be enabled when a table is created or can be enabled and modified on an existing table. You can learn more about DynamoDB Streams in the DynamoDB developer guide.

Now we will head to the DynamoDB console and view the OnboardingEmployeeData table.

This table has a primary partition key, UserID, that is a string data type and a primary sort key, Username, which is also of a string data type. We will use the UserID as the document ID in Elasticsearch. You will also notice that on this table, streams are enabled and the stream view type is New image. A stream that is set to a New image view type will have stream records that display the entire item record after it has been updated. You also have the option to have the stream present records that provide data items before modification, provide only the items’ key attributes, or provide old and new item information.  If you opt to use the AWS CLI to create your DynamoDB table, the key information to capture is the Latest Stream ARN shown underneath the Stream Details section. A DynamoDB stream has a unique ARN identifier that is outside of the ARN of the DynamoDB table. The stream ARN will be needed to create the IAM policy for access permissions between the stream and the Lambda function.

IAM Policy

The first thing that is essential for any service implementation is getting the correct permissions in place. Therefore, I will first go to the IAM console to create a role and a policy for my Lambda function that will provide permissions for DynamoDB and Elasticsearch.

First, I will create a policy based upon an existing managed policy for Lambda execution with DynamoDB Streams.

This will take us to the Review Policy screen, which will have the selected managed policy details. I’ll name this policy, Onboarding-LambdaDynamoDB-toElasticsearch, and then customize the policy for my solution. The first thing you should notice is that the current policy allows access to all streams, however, the best practice would be to have this policy only access the specific DynamoDB Stream by adding the Latest Stream ARN. Hence, I will alter the policy and add the ARN for the DynamoDB table, OnboardingEmployeeData, and validate the policy. The altered policy is as shown below.

The only thing left is to add the Amazon Elasticsearch Service permissions in the policy. The core policy for Amazon Elasticsearch Service access permissions is as shown below:

 

I will use this policy and add the specific Elasticsearch domain ARN as the Resource for the policy. This ensures that I have a policy that enforces the Least Privilege security best practice for policies. With the Amazon Elasticsearch Service domain added as shown, I can validate and save the policy.

The best way to create a custom policy is to use the IAM Policy Simulator or view the examples of the AWS service permissions from the service documentation. You can also find some examples of policies for a subset of AWS Services here. Remember you should only add the ES permissions that are needed using the Least Privilege security best practice, the policy shown above is used only as an example.

We will create the role for our Lambda function to use to grant access and attach the aforementioned policy to the role.

AWS Lambda: DynamoDB triggered Lambda function

AWS Lambda is the core of Amazon Web Services serverless computing offering. With Lambda, you can write and run code using supported languages for almost any type of application or backend service. Lambda will trigger your code in response to events from AWS services or from HTTP requests. Lambda will dynamically scale based upon workload and you only pay for your code execution.

We will have DynamoDB streams trigger a Lambda function that will create an index and send data to Elasticsearch. Another option for this is to use the Logstash plugin for DynamoDB. However, since several of the Logstash processors are now included in Elasticsearch 5.1 core and with the improved performance optimizations, I will opt to use Lambda to process my DynamoDB stream and load data to Amazon Elasticsearch Service.
Now let us head over to the AWS Lambda console and create the lambda function for loading employee data to Amazon Elasticsearch Service.

Once in the console, I will create a new Lambda function by selecting the Blank Function blueprint that will take me to the Configure Trigger page. Once on the trigger page, I will select DynamoDB as the AWS service which will trigger Lambda, and I provide the following trigger related options:

  • Table: OnboardingEmployeeData
  • Batch size: 100 (default)
  • Starting position: Trim Horizon

I hit Next button, and I am on the Configure Function screen. The name of my function will be ESEmployeeLoad and I will write this function in Node.4.3.

The Lambda function code is as follows:

var AWS = require('aws-sdk');
var path = require('path');

//Object for all the ElasticSearch Domain Info
var esDomain = {
    region: process.env.RegionForES,
    endpoint: process.env.EndpointForES,
    index: process.env.IndexForES,
    doctype: 'onboardingrecords'
};
//AWS Endpoint from created ES Domain Endpoint
var endpoint = new AWS.Endpoint(esDomain.endpoint);
//The AWS credentials are picked up from the environment.
var creds = new AWS.EnvironmentCredentials('AWS');

console.log('Loading function');
exports.handler = (event, context, callback) => {
    //console.log('Received event:', JSON.stringify(event, null, 2));
    console.log(JSON.stringify(esDomain));
    
    event.Records.forEach((record) => {
        console.log(record.eventID);
        console.log(record.eventName);
        console.log('DynamoDB Record: %j', record.dynamodb);
       
        var dbRecord = JSON.stringify(record.dynamodb);
        postToES(dbRecord, context, callback);
    });
};

function postToES(doc, context, lambdaCallback) {
    var req = new AWS.HttpRequest(endpoint);

    req.method = 'POST';
    req.path = path.join('/', esDomain.index, esDomain.doctype);
    req.region = esDomain.region;
    req.headers['presigned-expires'] = false;
    req.headers['Host'] = endpoint.host;
    req.body = doc;

    var signer = new AWS.Signers.V4(req , 'es');  // es: service code
    signer.addAuthorization(creds, new Date());

    var send = new AWS.NodeHttpClient();
    send.handleRequest(req, null, function(httpResp) {
        var respBody = '';
        httpResp.on('data', function (chunk) {
            respBody += chunk;
        });
        httpResp.on('end', function (chunk) {
            console.log('Response: ' + respBody);
            lambdaCallback(null,'Lambda added document ' + doc);
        });
    }, function(err) {
        console.log('Error: ' + err);
        lambdaCallback('Lambda failed with error ' + err);
    });
}

The Lambda function Environment variables are:

I will select an Existing role option and choose the ESOnboardingSystem IAM role I created earlier.

Upon completing my IAM role permissions for the Lambda function, I can review the Lambda function details and complete the creation of ESEmployeeLoad function.

I have completed the process of building my Lambda function to talk to Elasticsearch, and now I test my function my simulating data changes to my database.

Now my function, ESEmployeeLoad, will execute upon changes to the data in my database from my onboarding system. Additionally, I can review the processing of the Lambda function to Elasticsearch by reviewing the CloudWatch logs.

Now I can alter my Lambda function to take advantage of the new features or go directly to Elasticsearch and utilize the new Ingest Mode. An example of this would be to implement a pipeline for my Employee record documents.

I can replicate this function for handling the badge updates to the employee record, and/or leverage other preprocessors against the employee data. For instance, if I wanted to do a search of data based upon a data parameter in the Elasticsearch document, I could use the Search API and get records from the dataset.

The possibilities are endless, and you can get as creative as your data needs dictate while maintaining great performance.

Amazon Elasticsearch Service: Kibana 5.1

All Amazon Elasticsearch Service domains using Elasticsearch 5.1 are bundled with Kibana 5.1, the latest version of the open-source visualization tool.

The companion visualization and analytics platform, Kibana, has also been enhanced in the Kibana 5.1 release. Kibana is used to view, search or and interact with Elasticsearch data with a myriad of different charts, tables, and maps.  In addition, Kibana performs advanced data analysis of large volumes of the data. Key enhancements of the Kibana release are as follows:

  • Visualization tool new design: Updated color scheme and maximization of screen real-estate
  • Timelion: visualization tool with a time-based query DSL
  • Console: formerly known as Sense is now part of the core, using the same configuration for free-form requests to Elasticsearch
  • Scripted field language: ability use new Painless scripting language in the Elasticsearch cluster
  • Tag Cloud Visualization: 5.1 adds a word base graphical view of data sized by importance
  • More Charts: return of previously removed charts and addition of advanced view for X-Pack
  • Profiler UI:1 provides an enhancement to profile API with tree view
  • Rendering performance improvement: Discover performance fixes, decrease of CPU load

Summary

As you can see this release is expansive with many enhancements to assist customers in building Elasticsearch solutions. Amazon Elasticsearch Service now supports 15 new Elasticsearch APIs and 6 new plugins. Amazon Elasticsearch Service supports the following operations for Elasticsearch 5.1:

You can read more about the supported operations for Elasticsearch in the Amazon Elasticsearch Developer Guide, and you can get started by visiting the Amazon Elasticsearch Service website and/or sign into the AWS Management Console.

Tara