Last week Backblaze made the exciting announcement that through partnerships with Packet and ServerCentral, cloud computing is available to Backblaze B2 Cloud Storage customers.
Those of you familiar with cloud computing will understand the significance of this news. We are now offering the least expensive cloud storage + cloud computing available anywhere. You no longer have to submit to the lock-in tactics and exorbitant prices charged by the other big players in the cloud services biz.
We understand that some of our cloud backup and storage customers might be unfamiliar with cloud computing. Backblaze made its name in cloud backup and object storage, and that’s what our customers know us for. In response to customers requests, we’ve directly connected our B2 cloud object storage with cloud compute providers. This adds the ability to use and run programs on data once it’s in the B2 cloud, opening up a world of new uses for B2. Just some of the possibilities include media transcoding and rendering, web hosting, application development and testing, business analytics, disaster recovery, on-demand computing capacity (cloud bursting), AI, and mobile and IoT applications.
The world has been moving to a multi-cloud / hybrid cloud world, and customers are looking for more choices than those offered by the existing cloud players. Our B2 compute partnerships build on our mission to offer cloud storage that’s astonishingly easy and low-cost. They enable our customers to move into a more flexible and affordable cloud services ecosystem that provides a greater variety of choices and costs far less. We believe we are helping to fulfill the promise of the internet by allowing customers to choose the best-of-breed services from the best vendors.
If You’re Not Familiar with Cloud Computing, Here’s a Quick Overview
Cloud computing is another component of cloud services, like object storage, that replicates in the cloud a basic function of a computer system. Think of services that operate in a cloud as an infinitely scalable version of what happens on your desktop computer. In your desktop computer you have computing/processing (CPU), fast storage (like an SSD), data storage (like your disk drive), and memory (RAM). Their counterparts in the cloud are computing (CPU), block storage (fast storage), object storage (data storage), and processing memory (RAM).
CPU, RAM, fast internal storage, and a hard drive are the basic building blocks of a computer They also are the basic building blocks of cloud computing
Some customers require only some of these services, such as cloud storage. B2 as a standalone service has proven to be an outstanding solution for those customers interested in backing up or archiving data. There are many customers that would like additional capabilities, such as performing operations on that data once it’s in the cloud. They need object storage combined with computing.
With the just announced compute partnerships, Backblaze is able to offer computing services to anyone using B2. A direct connection between Backblaze’s and our partners’ data centers means that our customers can process data stored in B2 with high speed, low latency, and zero data transfer costs.
Cloud service providers package up CPU, storage, and memory into services that you can rent on an hourly basis You can scale up and down and add or remove services as you need them
How Does Computing + B2 Work?
Those wanting to use B2 with computing will need to sign up for accounts with Backblaze and either Packet or ServerCentral. Packet customers need only select “SJC1” as their region and then get started. The process is also simple for ServerCentral customers — they just need to register with a ServerCentral account rep.
The direct connection between B2 and our compute partners means customers will experience very low latency (less than 10ms) between services. Even better, all data transfers between B2 and the compute partner are free. When combined with Backblaze B2, customers can obtain cloud computing services for as little as 50% of the cost of Amazon’s Elastic Compute Cloud (EC2).
Opening Up the Cloud “Walled Garden”
Traditionally, cloud vendors charge fees for customers to move data outside the “walled garden” of that particular vendor. These fees reach upwards of $0.12 per gigabyte (GB) for data egress. This large fee for customers accessing their own data restricts users from using a multi-cloud approach and taking advantage of less expensive or better performing options. With free transfers between B2 and Packet or ServerCentral, customers now have a predictable, scalable solution for computing and data storage while avoiding vendor lock in. Dropbox made waves when they saved $75 million by migrating off of AWS. Adding computing to B2 helps anyone interested in moving some or all of their computing off of AWS and thereby cutting their AWS bill by 50% or more.
What are the Advantages of Cloud Storage + Computing?
Using computing and storage in the cloud provide a number of advantages over using in-house resources.
You don’t have to purchase the actual hardware, software licenses, and provide space and IT resources for the systems.
Cloud computing is available with just a few minutes notice and you only pay for whatever period of time you need. You avoid having additional hardware on your balance sheet.
Resources are in the cloud and can provide online services to customers, mobile users, and partners located anywhere in the world.
You can isolate the work on these systems from your normal production environment, making them ideal for testing and trying out new applications and development projects.
Computing resources scale when you need them to, providing temporary or ongoing extra resources for expected or unexpected demand.
They can provide redundant and failover services when and if your primary systems are unavailable for whatever reason.
Where Can I Learn More?
We encourage B2 customers to explore the options available at our partner sites, Packet and ServerCentral. They are happy to help customers understand what services are available and how to get started.
We are excited to see what you build! And please tell us in the comments what you are doing or have planned with B2 + computing.
Regularly I receive mail from people wanting to advertise on, write for, or sponsor posts on my blog. My rule is that I say no to everyone. There is no amount of money or free stuff that will get me to write about your security product or service.
With regard to squid, however, I have no such compunctions. Send me any sort of squid anything, and I am happy to write about it. Earlier this week, for example, I received two — not one — copies of the new book Squid Empire: The Rise and Fall of Cephalopods. I haven’t read it yet, but it looks good. It’s the story of prehistoric squid.
In the cybersecurity community, much time is spent trying to speak the language of business, in order to communicate to business leaders our problems. One way we do this is trying to adapt the concept of “return on investment” or “ROI” to explain why they need to spend more money. Stop doing this. It’s nonsense. ROI is a concept pushed by vendors in order to justify why you should pay money for their snake oil security products. Don’t play the vendor’s game.
The correct concept is simply “risk analysis”. Here’s how it works. List out all the risks. For each risk, calculate:
How often it occurs.
How much damage it does.
How to mitigate it.
How effective the mitigation is (reduces chance and/or cost).
How much the mitigation costs.
If you have risk of something that’ll happen once-per-day on average, costing $1000 each time, then a mitigation costing $500/day that reduces likelihood to once-per-week is a clear win for investment.
Now, ROI should in theory fit directly into this model. If you are paying $500/day to reduce that risk, I could use ROI to show you hypothetical products that will …
…reduce the remaining risk to once-per-month for an additional $10/day.
…replace that $500/day mitigation with a $400/day mitigation.
But this is never done. Companies don’t have a sophisticated enough risk matrix in order to plug in some ROI numbers to reduce cost/risk. Instead, ROI is a calculation is done standalone by a vendor pimping product, or a security engineer building empires within the company.
If you haven’t done risk analysis to begin with (and almost none of you have), then ROI calculations are pointless.
But there are further problems. This is risk analysis as done in industries like oil and gas, which have inanimate risk. Almost all their risks are due to accidental failures, like in the Deep Water Horizon incident. In our industry, cybersecurity, risks are animate — by hackers. Our risk models are based on trying to guess what hackers might do.
An example of this problem is when our drug company jacks up the price of an HIV drug, Anonymous hackers will break in and dump all our financial data, and our CFO will go to jail. A lot of our risks come now from the technical side, but the whims and fads of the hacker community.
Another example is when some Google researcher finds a vuln in WordPress, and our website gets hacked by that three months from now. We have to forecast not only what hackers can do now, but what they might be able to do in the future.
Finally, there is this problem with cybersecurity that we really can’t distinguish between pesky and existential threats. Take ransomware. A lot of large organizations have just gotten accustomed to just wiping a few worker’s machines every day and restoring from backups. It’s a small, pesky problem of little consequence. Then one day a ransomware gets domain admin privileges and takes down the entire business for several weeks, as happened after #nPetya. Inevitably our risk models always come down on the high side of estimates, with us claiming that all threats are existential, when in fact, most companies continue to survive major breaches.
These difficulties with risk analysis leads us to punting on the problem altogether, but that’s not the right answer. No matter how faulty our risk analysis is, we still have to go through the exercise.
One model of how to do this calculation is architecture. We know we need a certain number of toilets per building, even without doing ROI on the value of such toilets. The same is true for a lot of security engineering. We know we need firewalls, encryption, and OWASP hardening, even without specifically doing a calculation. Passwords and session cookies need to go across SSL. That’s the starting point from which we start to analysis risks and mitigations — what we need beyond SSL, for example.
So stop using “ROI”, or worse, the abomination “ROSI”. Start doing risk analysis.
To everyone’s surprise, the sun has actually managed to show its face this summer in Britain! So we’re not feeling too guilty for having asked the newest crop of Pioneers to Make it outdoors. In fact, the 11- to 16-year-olds that took part in our second digital making challenge not only made things that celebrate the outdoors – some of them actually carted their entire coding setup into the garden. Epic!
We asked you to make it outdoors with tech, challenging all our Pioneers to code and build awesome projects that celebrate the outside world. And we were not disappointed! Congratulations to everyone who took part. Every entry was great and we loved them all.
We set the challenge to Make it outdoors, and our theme winners HH Squared really delivered! You best captured the spirit of what our challenge was asking with your fabulous, fun-looking project which used the outdoors to make it a success. HH Squared, we loved Pi Spy so much that we may have to make our own for Pi Towers! Congratulations on winning this award.
Watching all the entry videos, our judges had the tricky task of picking the top of the pops from among the projects. In additon to ‘theme winner’, we had a number of other categories to help make their job a little bit easier:
We appreciate what you’re trying to do: We know that when tackling a digital making project, time and tech sometimes aren’t in your favour. But we still want to see what you’ve got up to, and this award category recognises that even though you haven’t fully realised your ambition yet, you’ve made a great start. *And*, when you do finish, we think it’s going to be awesome. Congratulations to the UTC Bullfrogs for winning this award – we can’t wait to see the final project!
Inspiring journey: This category recognises that getting from where you’ve started to where you want to go isn’t always smooth sailing. Maybe teams had tech problems, maybe they had logistical problems, but the winners of this award did a great job of sharing the trials and tribulations they encountered along the way! Coding Doughnuts, your project was a little outside the box IN a box. We loved it.
Technically brilliant: This award is in recognition of some serious digital making chops. Robot Apocalypse Committee, you owned this award. Get in!
Best explanation: Digital making is an endeavour that involves making a thing, and then sharing that thing. The winners of this category did a great job of showing us exactly what they made, and how they made it. They also get bonus points for making a highly watchable, entertaining video. Uniteam, we got it. We totally got it! What a great explanation of such a wonderful project – and it made us laugh too. Well done!
The Judges’ Special Recognition Awards
Because we found it so hard to just pick five winners, the following teams will receive our Judges’ Special Recognition Award:
PiChasers with their project Auqa (yes, the spelling is intentional!)
Sunscreen Superstars, making sure we’re all protected in the glorious British sunshine
Off The Shelf and their ingenious Underwater Canal Scanner
Glassbox, who made us all want Nerf guns thanks to their project Tin Can Alley
Turtle Tamers, ensuring the well-being of LEGO turtles around the world with their project Umbrella Empire
Winners from both our Make us laugh and Make it outdoors challenges will be joining us at Google HQ for a Pioneers summer camp full of making funtimes! They’ll also receive some amazing prizes to help them continue in their digital making adventures.
Massive thanks go to our judges for helping to pick the winners!
And for your next Pioneers challenge…
Ha, as if we’re going to tell you just yet – we’re still recovering from this challenge! We’ll be back in September to announce the theme of the next cycle – so make sure to sign up for our newsletter to be reminded closer to the time.
Today seems like a good time to recap some of the features that we have added to Amazon EC2 Container Service over the last year or so, and to share some customer success stories and code with you! The service makes it easy for you to run any number of Docker containers across a managed cluster of EC2 instances, with full console, API, CloudFormation, CLI, and PowerShell support. You can store your Linux and Windows Docker images in the EC2 Container Registry for easy access.
Launch Recap Let’s start by taking a look at some of the newest ECS features and some helpful how-to blog posts that will show you how to use them:
IAM Roles for Tasks – You can secure your infrastructure by assigning IAM roles to ECS tasks. This allows you to grant permissions on a fine-grained, per-task basis, customizing the permissions to the needs of each task. Read IAM Roles for Tasks to learn more.
Service Auto Scaling – You can define scaling policies that scale your services (tasks) up and down in response to changes in demand. You set the desired minimum and maximum number of tasks, create one or more scaling policies, and Service Auto Scaling will take care of the rest. The documentation for Service Auto Scaling will help you to make use of this feature.
Blox – Scheduling, in a container-based environment, is the process of assigning tasks to instances. ECS gives you three options: automated (via the built-in Service Scheduler), manual (via the RunTask function), and custom (via a scheduler that you provide). Blox is an open source scheduler that supports a one-task-per-host model, with room to accommodate other models in the future. It monitors the state of the cluster and is well-suited to running monitoring agents, log collectors, and other daemon-style tasks.
Container Instance Draining – From time to time you may need to remove an instance from a running cluster in order to scale the cluster down or to perform a system update. Earlier this year we added a set of lifecycle hooks that allow you to better manage the state of the instances. Read the blog post How to Automate Container Instance Draining in Amazon ECS to see how to use the lifecycle hooks and a Lambda function to automate the process of draining existing work from an instance while preventing new work from being scheduled for it.
Task Placement Policies – This launch provided you with fine-grained control over the placement of tasks on container instances within clusters. It allows you to construct policies that include cluster constraints, custom constraints (location, instance type, AMI, and attribute), placement strategies (spread or bin pack) and to use them without writing any code. Read Introducing Amazon ECS Task Placement Policies to see how to do this!
EC2 Container Service in Action Many of our customers from large enterprises to hot startups and across all industries, such as financial services, hospitality, and consumer electronics, are using Amazon ECS to run their microservices applications in production. Companies such as Capital One, Expedia, Okta, Riot Games, and Viacom rely on Amazon ECS.
Mapbox is a platform for designing and publishing custom maps. The company uses ECS to power their entire batch processing architecture to collect and process over 100 million miles of sensor data per day that they use for powering their maps. They also optimize their batch processing architecture on ECS using Spot Instances. The Mapbox platform powers over 5,000 apps and reaches more than 200 million users each month. Its backend runs on ECS allowing it to serve more than 1.3 billion requests per day. To learn more about their recent migration to ECS, read their recent blog post, We Switched to Amazon ECS, and You Won’t Believe What Happened Next.
Travel company Expedia designed their backends with a microservices architecture. With the popularization of Docker, they decided they would like to adopt Docker for its faster deployments and environment portability. They chose to use ECS to orchestrate all their containers because it had great integration with the AWS platform, everything from ALB to IAM roles to VPC integration. This made ECS very easy to use with their existing AWS infrastructure. ECS really reduced the heavy lifting of deploying and running containerized applications. Expedia runs 75% of all apps on AWS in ECS allowing it to process 4 billion requests per hour. Read Kuldeep Chowhan‘s blog post, How Expedia Runs Hundreds of Applications in Production Using Amazon ECS to learn more.
Realtor.com provides home buyers and sellers with a comprehensive database of properties that are currently for sale. Their move to AWS and ECS has helped them to support business growth that now numbers 50 million unique monthly users who drive up to 250,000 requests per second at peak times. ECS has helped them to deploy their code more quickly while increasing utilization of their cloud infrastructure. Read the Realtor.com Case Study to learn more about how they use ECS, Kinesis, and other AWS services.
What do you do when you have almost half a petabyte (PB) of data? That’s the situation in which Michael Oskierko finds himself. He’s a self-proclaimed digital pack rat who’s amassed more than 390 terabytes (TB) total, and it’s continuing to grow.
Based in Texas, Michael Oskierko is a financial analyst by day. But he’s set up one of the biggest personal data warehouses we’ve seen. The Oskierko family has a huge collection of photos, videos, documents and more – much more than most of us. Heck, more data than many companies have.
How Did It Get Like This?
“There was a moment when we were pregnant with our second child,” Michael explained. “I guess it was a nesting instinct. I was looking at pictures of our first child and played them back on a 4K monitor. It was grainy and choppy.”
Disappointed with the quality of those early images, he vowed to store future memories in a pristine state. “I got a DSLR that took great pictures and saved everything in RAW format. That’s about 30 MB per image right there.”
Michael says he now has close to 1 million photos (from many different devices, not just the DSLR) and about 200,000 videos stored in their original formats. Michael says that video footage from his drone alone occupies about 300 GB.
The Oskierkos are also avid music listeners: iTunes counts 707 days’ worth of music in their library at present. Michael keeps Green Day’s entire library on heavy rotation, with a lot of other alternative rock a few clicks away. His wife’s musical tastes are quite broad, ranging from ghetto rap to gospel. They’re also avid audiobook listeners, and it all adds up: Dozens more TB of shared storage space dedicated to audio files.
What’s more, he’s kept very careful digital records of stuff that otherwise might have gotten tossed to the curbside years ago. “I have every single note, test, project, and assignment from 7th grade through graduate school scanned and archived,” he tells us. He’s even scanned his textbooks from high school and college!
I started cutting these up and scanning the pages before the nifty ‘Scan to PDF’ was a real widespread option and duplexing scanners were expensive,” he said.
One of the biggest uses of space isn’t something that Michael needs constant access to, but he’s happy to have when the need arises. As a hobbyist programmer who works in multiple languages and on different platforms, Michael maintains a library of uncompressed disk images (ISOs) which he uses as needed.
When you have this much storage, it’s silly to get greedy with it. Michael operates his sprawling setup as a personal cloud for his family members, as well.
“I have a few hosted websites, and everyone in my family has a preconfigured FTP client to connect to my servers,” he said.
Bargain Hunting For Big Storage
How do you get 390 TB without spending a mint? Michael says it’s all about finding the right deals. The whole thing got started when a former boss asked if Michael would be interested in buying the assets of his shuttered computer repair business. Michael ended up with an inventory of parts which he’s successfully scavenged into the beginning of his 390 TB digital empire.
He’s augmented and improved that over time, evolving his digital library over six distinct storage systems that he’s used to maintaining all of his family’s personal data. He keeps an eye out wherever he can for good deals.
“There are a few IT support and service places I pass by on my daily commute to work,” he said. He stops in periodically to check if they’re blowing out inventory. Ebay and other online auction sites are great places for him to find deals.
“I just bought 100 1 TB drives from a guy on eBay for $4 each,” he said.
Michael has outgrown and retired a bunch of devices over the years as his storage empire has grown, but he keeps an orderly collection of parts and supplies for when he has to make some repairs.
How To Manage Large Directories: Keep It Simple
“I thoroughly enjoy data archiving and organizing,” Michael said. Perhaps a massive understatement. While he’s looked at Digital Asset Management (DAM) software and other tools to manage his ever-growing library, Michael prefers a more straightforward approach to figuring out what’s where. His focus is on a simplified directory structure.
“I would have to say I spend about 2 hours a week just going through files and sorting things out but it’s fun for me,” Michael said. “There are essentially five top-level directories.”
Documents, installs, disk images, music, and a general storage directory comprise the highest hierarchy. “I don’t put files in folders with other folders,” he explained. “The problem I run into is figuring out where to go for old archives that are spread across multiple machines.”
How To Back Up That Much Data
Even though he has a high-speed fiber optic connection to the Internet, Michael doesn’t want to use it all for backup. So much of his local backup and duplication is done using cloning and Windows’ built-in Xcopy tool, which he manages using home-grown batch files.
Michael also relies on Backblaze Personal Backup for mission-critical data on his family’s personal systems. “I recommend it to everyone I talk to,” he said.
In addition to loads of available local storage for backups, three of his Michael’s personal computers back up to Backblaze. He makes them accessible to family members who want the peace of mind of cloud-based backup. He’s also set up Backblaze for his father in law’s business and his mother’s personal computer.
“I let Backblaze do all the heavy lifting,” he said. “If you ever have a failure, Backblaze will have a copy we can restore.”
Thanks from all of us at Backblaze for spreading the love, Michael!
The 390 TB is spread across six systems, which has led to some logistical difficulties for Michael, like remembering to power up the right one to get what he needs (he doesn’t typically run everything all the time to help conserve electricity).
“Sometimes I have to sit there and think, ‘Where did I store my drone footage,’” Michael said.
To simplify things, Michael is trying to consolidate his setup. And to that end, he recently acquired a decommissioned Storage Pod from Backblaze. He said he plans to populate the 45-bay Pod with as large hard drives as he can afford, which will hopefully make it simpler, easier and more efficient to store all that data.
Well, as soon as he can find a great deal on 8 TB and 10 TB drives, anyway. Keep checking eBay, Michael, and stay in touch! We can’t wait to see what your Storage Pod will look like in action!
I finally got a chance to see Rogue One: A Star Wars Story recently (I know, I’m late to the game, but I was off my feet for a few weeks around the holidays). It got me thinking about data backups and data security. Whether you’re in a small business, an enterprise, the Imperial Forces, or just backing up a home computer, the same rules apply.
Spoiler Alert: If you haven’t seen Rogue One and don’t want any plot details leaked, skip this blog post if you want to stay in the dark.
Test Your Security
The Imperial databank on Scarif is an impressive facility with only one way on and off the planet. That Shield Gate is one heck of a firewall. But Jyn Urso and Cassian Andor prove that Imperial security is fallible. No matter how good you think your data defense is, it can probably be better.
Conduct regular reviews of your backup and data security strategies to make sure you’re not leaving any glaring holes open that someone can take advantage of. Regularly change passwords. Use encryption. Here’s more on how we use encryption.
Scarif is the only place in the Galaxy the Empire is keeping a copy of the plans. If you only have one backup, it’s better than nothing – but not by much. Especially when Governor Tarkin decides to test his new toy on your planet. Better to backup in at least two places.
We recommend a two-step approach. In addition to the live data on your computer, you should keep a local backup copy on site in case you need to do a quick restore. Another copy in the cloud (not Cloud City) will make sure that no matter what happens, you have a copy you can recover from (that’s what we’re here for).
If you don’t already have a backup strategy in place, make sure to check out our Computer Backup Guide for lots of information about how to get started.
Check Your Backups
One other thing we learn from the Death Star plans – the Empire didn’t manage version control very well. Take a close look at the Death Star schematic that Jyn and Cassian absconded with. Notice anything…off?
Yeah, that’s right. The focus lens for the superlaser is equatorial. Now, everyone knows that the Death Star’s superlaser is actually on the northern hemisphere. Which goes to show you that Jyn and Cassian made off with a previous backup, not the current data.
It’s important to test your backups periodically to make sure that the files that are important to you are safe and sound. Don’t just set a backup system and forget it – verify periodically that the data you actually need is being backed up. Also verify that all the data you need is accounted for.
Restoring your data shouldn’t be as hard as massing a rebel assault on Scarif. There’s another practical reason to test your backup and restore process periodically — so you’ll be familiar with the workflow when it matters. Catastrophes that require data recovery are fraught with enough peril. Don’t make it worse by learning how to use software on the fly, otherwise you might end up like an X-Wing hitting the Shield Gate.
You’re One With The Force And The Force Is With You
Data security and backup doesn’t need to be a battle. Develop a strategy that works for you, make sure your data is safe and sound, and check it once in awhile to make sure it’s up to date and complete. That way, just like the Force, your data will be with you, always.
Are you trying to move away from a batch-based ETL pipeline? You might do this, for example, to get real-time insights into your streaming data, such as clickstream, financial transactions, sensor data, customer interactions, and so on. If so, it’s possible that as soon as you get down to requirements, you realize your streaming data doesn’t have all of the fields you need for real-time processing, and you are really missing that JOIN keyword!
You might also have requirements to enrich the streaming data based on a static reference, a dynamic dataset, or even with another streaming data source. How can you achieve this without sacrificing the velocity of the streaming data?
In this blog post, I provide three use cases and approaches for joining and enriching streaming data:
Joining streaming data with a relatively static dataset on Amazon S3 using Amazon Kinesis Analytics
In this use case, Amazon Kinesis Analytics can be used to define a reference data input on S3, and use S3 for enriching a streaming data source.
For example, bike share systems around the world can publish data files about available bikes and docks, at each station, in real time. On bike-share system data feeds that follow the General Bikeshare Feed Specification (GBFS), there is a reference dataset that contains a static list of all stations, their capacities, and locations.
Let’s say you would like to enrich the bike-availability data (which changes throughout the day) with the bike station’s latitude and longitude (which is static) for downstream applications. The architecture would look like this:
To illustrate how you can use Amazon Kinesis Analytics to do this, follow these steps to set up an AWS CloudFormation stack, which will do the following:
Continuously produce sample bike-availability data onto an Amazon Kinesis stream (with, by default, a single shard).
Generate a sample S3 reference data file.
Create an Amazon Kinesis Analytics application that performs the join with the reference data file.
For the stack name, type demo-data-enrichment-s3-reference-stack. Under the KinesisAnalyticsReferenceDataS3Bucket parameter, type the name of an S3 bucket in the US East (N. Virginia) region where the reference data will be uploaded. This CloudFormation template will create a sample data file under KinesisAnalyticsReferenceDataS3Key. (Make sure the value of the KinesisAnalyticsReferenceDataS3Key parameter does not conflict with existing S3 objects in your bucket).You’ll also see parameters referencing an S3 bucket (LambdaKinesisAnalyticsApplicationCustomResourceCodeS3Bucket) and an associated S3 key. The S3 bucket and the S3 key represent the location of the AWS Lambda package (written in Python) that contains the AWS CloudFormation custom resources required to create an Amazon Kinesis application. Do not modify the values of these parameters.
Follow the remaining steps in the Create Stack wizard (click “Next” button, and then the “Create” button). Wait until the status displayed for the stack is CREATE_COMPLETE.
You now have an Amazon Kinesis stream in your AWS account that has sample bike-availability streaming data which, if you followed the template, is named demo-data-enrichment-bike-availability-input-stream and an Amazon Kinesis Analytics application that performs the joining.
You might want to examine the reference data file generated under your bucket (its default name is demo_data_enrichment_s3_reference_data.json). If you are using a JSON formatter/viewer, it will look like the following. Note how the records are put together as top-level objects (no commas separating the records).
In the CloudFormation template, the reference data has been added to the Amazon Kinesis application through the AddApplicationReferenceDataSource API method.
Next, open the Amazon Kinesis Analytics console and examine the application. If you did not modify the default parameters, the application name should be demo-data-enrichment-s3-reference-application.
The Application status should be RUNNING. If its status is still STARTING, wait until it changes to RUNNING. Choose Application details, and then go to SQL results. You should see a page similar to the following:
In the Amazon Kinesis Analytics SQL code, an in-application stream (DESTINATION_SQL_STREAM) is created with five columns. Data is inserted into the stream using a pump (STREAM_PUMP) by joining the source stream (SOURCE_SQL_STREAM_001) and the reference S3 data file (REFERENCE_DATA) using the station_id field. For more information about streams and pumps, see In-Application Streams and Pumps in the Amazon Kinesis Analytics Developer Guide.
The joined (enriched) data fields – station_name, station_lat and station_lon – should appear on the Real-time analytics tab, DESTINATION_SQL_STREAM.
The LEFT OUTER JOIN works just like the ANSI-standard SQL. When you use LEFT OUTER JOIN, the streaming records are preserved even if there is no matching station_id in the reference data. You can delete LEFT OUTER (leaving only JOIN, which means INNER join), choose Save and then run SQL. Only the records with a matching station_id will appear in the output.
You can write the results to an Amazon Kinesis stream or to S3, Amazon Redshift, or Amazon Elasticsearch Service with an Amazon Kinesis Firehose delivery stream by adding a destination on the Destination tab. For more information, see the Writing SQL on Streaming Data with Amazon Kinesis Analytics blog post.
Note: If you have not modified the default parameters in the CloudFormation template, S3 reference data source should have been added to your application. If you are interested in adding S3 reference data source on your own streams, the following code snippet shows what it looks like in Python:
This data enrichment approach works for a relatively static dataset because it requires you to upload the entire reference dataset to S3 whenever there is a change. This approach might not work for cases where the dataset changes too frequently to catch up with the streaming data.
If you change the reference data stored in the S3 bucket, you need to use the UpdateApplication operation (using the API or AWS CLI) to refresh the data in the Amazon Kinesis Analytics in-application table. You can use the following Python script to refresh the data. Although it will not be covered in this blog post, you can automate data refresh by setting up Amazon S3 event notification with AWS Lambda.
# Name of the Kinesis Analytics Application
# AWS region
# retrieve the current application version id
response_describe_application = kinesisanalytics_client.describe_application(
# update the application
response = kinesisanalytics_client.update_application(
if __name__ == "__main__":
To clean up resources created during this demo
To delete the Amazon Kinesis Analytics application, open the Amazon Kinesis Analytics console, choose the application (demo-data-enrichment-s3-reference-application), choose Actions, and then choose Delete application.
To delete the stack, open the AWS CloudFormation console, choose the stack (demo-data-enrichment-s3-reference-stack), choose Actions, and then choose Delete Stack.
The S3 reference data file should have been removed from your bucket, but your other data files should remain intact.
Enriching streaming data with a dynamic dataset using AWS Lambda and Amazon DynamoDB
In many cases, the data you want to enrich is not static. Imagine a case in which you are capturing user activities from a web or mobile application and have to enrich the activity data with frequently changing fields from a user database table.
A better approach is to store reference data in a way that can support random reads and writes efficiently. Amazon DynamoDB is a fully managed non-relational database that can be used in this case. AWS Lambda can be set up to automatically read batches of records from your Amazon Kinesis streams, which can perform data enrichment like looking up data from DynamoDB, and then produce the enriched data onto another stream.
If you have a stream of user activities and want to look up a user’s birth year from a DynamoDB table, the architecture will look like this:
To set up a demo with this architecture, follow these steps to set up an AWS CloudFormation stack, which continuously produces sample user scores onto an Amazon Kinesis stream. An AWS Lambda function enriches the records with data from a DynamoDB table and produces the results onto another Amazon Kinesis stream.
Open the CloudFormation console and choose Create Stack. Make sure the US East (N. Virginia) region is selected.
Choose Specify an Amazon S3 template URL, enter or paste the following URL, and then choose Next.
For the stack name, type demo-data-enrichment-ddb-reference-stack. Review the parameters. If none of them conflict with your Amazon Kinesis stream names or DynamoDB table name, you can accept the default values and choose Next. The following steps are written with the assumption that you are using the default values.
Complete the remaining steps in the Create Stack wizard. Wait until CREATE_COMPLETE is displayed for the stack status.This CloudFormation template creates three Lambda functions: one for setting up sample data in a DynamoDB table, one for producing sample records continuously, and one for data enrichment. The description for this last function is Data Enrichment Demo – Lambda function to consume and enrich records from Kinesis.
Open the Lambda console, select Data Enrichment Demo – Lambda function to consume and enrich records from Kinesis, and then choose the Triggers
If No records processed is displayed for Last result, wait for a few minutes, and then reload this page. If things are set up correctly, OK will be displayed for Last result. This might take a few minutes.
At this point, the Lambda function is performing the data enrichment and producing records onto an output stream. Lambda is commonly used for preprocessing the analytics app to handle more complicated data formats.
Although this data enrichment process doesn’t involve Amazon Kinesis Analytics, you can still use the Kinesis Analytics service to examine, or even process, the enriched records, by creating a new Kinesis Analytics application and connect it to demo-data-enrichment-user-activity-output-stream.
The records on the demo-data-enrichment-user-activity-output-stream will look like the following and will show the enriched birthyear field on the streaming data.
You can use this birthyear value in your Amazon Kinesis Analytics application. Although it’s not covered in this blog post, you can aggregate by age groups and count the user activities.
You can review the code for the Lambda function on the Code tab. The code used here is for demonstration purpose only. You might want to modify and thoroughly test it before applying it to your use case.
Note the following:
The Lambda function receives records in batches (instead of one record per Lambda invocation). You can control this behavior through Batch size on the Triggers tab (in the preceding example, Batch size is set to 100). The batch size you specify is the maximum number of records that you want your Lambda function to receive per invocation.
The retrieval from the reference data source (in this case, DynamoDB) can be done in batch instead of record by record. This example uses the batch_get_item() API method.
Depending on how strict the record sequencing requirement is, the writing of results to the output stream can also be done in batch. This example uses the put_records() API method.
DynamoDB is not the only option. What’s important is to have a data source that supports random data read (and write) at a high efficiency. Because it’s a distributed database with built-in partitioning, DynamoDB is a good choice, but you can also use Amazon RDS as long as the data retrieval is fast enough (perhaps by having a proper index and by spreading the read workload over read replicas). You can also use an in-memory database like Amazon ElastiCache for Redis or Memcached.
To clean up the resources created for this demo
If you created a Kinesis Analytics application: Open the Amazon Kinesis Analytics console, select your application, choose Actions, and then choose Delete application.
Open the CloudFormation console, choose demo-data-enrichment-ddb-reference-stack, choose Actions, and then choose Delete Stack.
Joining multiple sets of streaming data using Amazon Kinesis Analytics
To illustrate this use case, let’s say you have two streams of temperature sensor data coming from a set of machines: one measuring the engine temperature and the other measuring the power supply temperature. For the best prediction of machine failure, you’ve been told (perhaps by your data scientist) that the two temperature readings must be validated against a prediction model ─ not individually, but as a set.
Because this is streaming data, the time window for the data joining should be clearly defined. In this example, the joining must occur on temperature readings within the same window (same minute) so that the temperature of the engine is matched against the temperature of the power supply in the same timeframe.
This data joining can be achieved with Amazon Kinesis Analytics, but has to follow a certain pattern.
First of all, an Amazon Kinesis Analytics application supports no more than one streaming data source. The different sets of streaming data have to be produced onto one Amazon Kinesis stream.
An Amazon Kinesis Analytics application can use this Amazon Kinesis stream as input and can process the data based on a certain field (in this example, sensor_location) into multiple in-application streams.
The joining can now occur on the two in-application streams. In this example, the join fields are the machine_id and the one-minute tumbling window.
The selection of the time field for the windowing criteria is a topic of its own. For information, see Writing SQL on Streaming Data with Amazon Kinesis Analytics. It is important to get a well-aligned window for real-world applications. In this example, the processing time (ROWTIME) will be used for the tumbling window calculation for simplicity.
Again, you can follow these steps to set up an AWS CloudFormation stack, which continuously produces two sets of sample sensor data onto a single Amazon Kinesis stream and creates an Amazon Kinesis Analytics application that performs the joining.
Open the CloudFormation console and choose Create Stack. Make sure the US East (N. Virginia) region is selected.
Choose Specify an Amazon S3 template URL, enter or paste the following URL, and then choose Next.
For the stack name, type demo-data-joining-multiple-streams-stack. Review the parameters. If none of them conflict with your Amazon Kinesis stream, you can accept the default values and choose Next. The following steps are written with the assumption that you are using the default values.
You will also see parameters referencing an S3 bucket and S3 key. This is an AWS Lambda package (written in Python) that contains the AWS CloudFormation custom resources to create an Amazon Kinesis application.
Complete the remaining steps in the Create Stack wizard. Wait until the status displayed for the stack is CREATE_COMPLETE.
Open the Amazon Kinesis Analytics console and examine the application. If you have not modified the default parameters in the CloudFormation template, the application name should be demo-data-joining-multiple-streams-application.
The Application status should be RUNNING. If its status is still STARTING, wait until it changes to RUNNING.
Choose Application details, and then go to SQL results.
You should see a page similar to the following:
The SQL statements have been populated for you. The script first creates two in-application streams (for engine and power supply temperature readings, respectively). DESTINATION_SQL_STREAM holds the joined results.
On the Real-time analytics tab, you’ll see the average temperature readings of the engine and power supply have been joined together using the machine_id and per-minute tumbling window. The results can be written to an Amazon Kinesis stream or to S3, Amazon Redshift, or Amazon Elastisearch Service with a Firehose delivery stream.
To clean up resources created for this demo
To delete the Amazon Kinesis Analytics application, open the Amazon Kinesis Analytics console, choose demo-data-joining-multiple-streams-application, choose Actions, and then choose Delete application.
Open the CloudFormation console, choose demo-data-joining-multiple-streams-stack, choose Actions, and then choose Delete Stack.
In this blog post, I shared three approaches for joining and enriching streaming data on Amazon Kinesis Streams by using Amazon Kinesis Analytics, AWS Lambda, and Amazon DynamoDB. I hope this information will be helpful in situations when your real-time analytics applications require additional data fields from reference data sources or the real-time insights must be derived from data across multiple streaming data sources. These joining and enrichment techniques will help you can get the best business value from your data.
If you have a question or suggestion, please leave a comment below.
About the author
Assaf Mentzer is a Senior Consultant in Big Data & Analytics for AWS Professional Services. He works with enterprise customers to provide leadership on big data projects, helping them reach their full data potential in the cloud. In his spare time, he enjoys watching sports, playing ping pong, and hanging out with family and friends.
In our willingness to believe any evil of Trump, some have claimed his original name was “Drumpf”. This isn’t true, this isn’t how the German language works. Trump has the power to short-circuit critical thinking in both his supporters and his enemies. The “Drumpf” meme is just one example.
There was no official pronunciation or spelling of German words/names until after Trump’s grandfather was born. As this The Guardian article describes, in the city (“Kallstadt”) where Trump’s grandfather was born, you’ll see many different spellings of the family name in the church’s records. like “Drumb, Tromb, Tromp, Trum, Trumpff, Dromb” and Trump. A person might spell their name different ways on different documents, and the names of children might be spelled different than their parent’s. It makes German genealogy tough sometimes.
During that time, different areas of German had different dialects that were as far apart as Dutch and German are today. Indeed, these dialects persist. Germans who grow up outside of cities often learn their own local dialect and standard German as two different languages. Everyone understands standard German, but many villagers cannot speak it. They often live their entire lives within a hundred kilometers of where they grew up because if they go too far away, people can no longer understand them.
The various German dialects, sub-dialects, and accents often had consistent language shifts, where the same sound is pronounced differently across many words. For example, words that in English have a ‘p’ will in German have ‘pf” instead, like the word penny becoming Pfennig, or pepper becoming Pfeffer.
Kallstadt is located in the Pfalz region of Germany, or as they pronounce it in the local dialect, Palz. You see what I’m getting at, what is ‘pf’ in German is ‘p’ (like English) in the local dialect. Thus, you’d say “Trump” if you were speak Pfalz dialect, or “Trumpf” if you were speaking standard German.
It’s like the word for stocking, which in standard German is Strumpf. In documents written around that time in the Pfalz region, you’d find spellings like Strump, Strumpf, Strumpff, Strimp, and Stromp. Both the vowels and the last consonant would change (according to a Pfalz dictionary I found online).
Friederich Trump was born in 1869, in a time when Germany was split into numerous smaller countries. The German Empire that unified Germany was created in 1871. The counsel to standardize the language and spellings was 1876. Friederich emigrated to America in 1885. In other words, his birth predates the era in which they would’ve standardized the spelling of names.
From the records we have, “Trump” was on his baptism record, and “Trump” is how he spelled his name in America, but “Trumpf”, with an ‘f’ was on his immigration form. That’s perfectly reasonable. The immigration officer was probably a German speaker, who asked his name, and spelled it according to his version of German, with an ‘f’.
This idea of an official spelling/pronunciation of a name is a modern invention, with the invention of the modern “state” and “government officials”. It didn’t exist back when Friederich was born. His only birth record is actually his baptismal record at the local church.
Thus, Trump’s name is spelled “Trump”. It was never officially spelled any other way in the past. It was never “changed”. Sure, you’ll see church documents and stuff with different spellings, but just how all words and names were handled back then. Insisting that he’s “Drumpf” is ignorant — it’s not now the German language works.
Update: Somebody named Gwenda Blair wrote book on Trump’s family, which claims the name comes from Hanns Drumpf, who settled in Kallstadt in 1608. But they can’t connect the dots. That’s because right after, the 30 years war happened. It’s a famous event in Germany because it burnt most of the church records. Most all German family trees can be traced back to the 30 years war — but no further.
It’s probable they were related. It’s possible that Hanns was even an ancestor of Trump who at one time spelled his name “Drumpf”. But that’s still not the official spelling, because that’s not how German worked at that time.
Update: According to Snopes, it’s true that “Donald Trump’s ancestors changed their surname from Drumpf to Trump“. Snopes is wrong, because they are morons. The correct answer is “there’s no record of a name change”. The fact that difference sources make conflicting claims should’ve been proof enough for Snopes that there’s no evidence to support the claim.
BTW, my great-grandmother is “Pennsylvania Dutch”, most of who came from that same region. I may be distantly related to Trump.
Also BTW, isn’t weird that we are talking about his grandfather born 150 years ago? His grandfather was 36 when his son was born, and his father was 41 when Trump was born. Three generations back, and we are already in a pre-historical era — that is to say, the era where we had writing, but not standardized spelling.
A certain monk heard that master Suku knew the secret of designing code for maximum reusability. But whenever the monk begged the master to share her wisdom, Suku only walked away. Exasperated, the monk asked one of Suku’s three apprentices for help.
“To learn the master’s great secret, you must approach her correctly,” explained the apprentice. “Come; I shall assist you.”
The apprentice gave the monk special ceremonial robes, which were several sizes too large and had to be wound twice around his arms and legs. To keep the robes from unraveling the apprentice tied a long sash tightly around the monk’s body from wrists to ankles. When the monk protested that walking was now impossible, the apprentice only nodded, saying that the monk was meant to approach Suku on his belly, with his head low and his feet high.
Angrily the monk writhed slowly down the corridor on his stomach, cursing Suku and wondering whether any information could possibly be worth such ridiculous effort.
At this thought, the monk was suddenly enlightened.
Some masters answer a question with a single gesture; Suku answered without even being asked. The master has a most efficient API indeed, for she returns a usable value even when her function is not called.
The general wanted a mount that could cross the Empire. The groom delivered only a plush saddle. Many fine horses stumble on stony roads— Sometimes even a general should think with his posterior.
I guess it’s interesting to look into where it came from.
The Internet tells me that Valentine’s Day, much like every other interesting holiday, is rooted in an ancient Roman festival called Lupercalia and held on February 13 through 15.
I say “festival”, but, uh. It involved sacrificing a goat and a dog, cutting the goat’s hide into strips, dipping the strips into the goat’s own blood, and then running around town wearing a goatskin and slapping (whipping?) women with the strips. Probably while drunk. But the women were generally on board, because this was supposed to make them fertile. Oh, and the people doing this were priests, of course.
Then they’d have a hookup lottery, where all the single ladies would put their names in a jar, and a guy could draw a name, and the pair would go shack up for the night or maybe the year.
All of this was in service of the god Lupercus, the god of — you guessed it — shepherds.
You zany Romans.
I found half a dozen different descriptions of this holiday that explain it half a dozen very different ways, so I’m going to avoid elaborating too much for fear someone will think I know what I’m talking about. The goal seems to have been a bit fuzzy, anything from purification to fertility to honoring the founders of Rome to just an excuse to drink and eat a lot. About as consistent as our modern holidays, then.
It seems this festival overtook an earlier one called Februalia, which was a purification celebration — essentially, spring cleaning. That’s what February is named for. So there is in fact a very distant tie between the very month of February and Valentine’s Day. Neat. (There was also a god of purification named Februus, though in an interesting twist, he came later and thus was named after the festival/month, not the other way around.)
The chain of events thusfar is something like: it rains in spring, which led to a cleaning festival, which was somehow co-opted by a purification festival, which also involved fertility rites.
I’m pretty sure we don’t still do the goat-blood-slapping thing, so something else must have happened.
During the 200s, the Roman Emperor was Claudius II, a man who was tragically unaware that the decline of the Roman Empire had begun in 190.
The first thing I read was that he had the goofy idea that unmarried men made better soldiers than married men, so he decreed it illegal for young people to marry. Some Christian priests were officiating weddings in secret, and one of them was named Valentine, and that’s why he was executed. This is all according to some Christian website that’s really trying to drive home the martyrdom angle.
But then I checked the liberal hedonistic pit Wikipedia, and it tells me that no such decree was ever issued, which rather puts a hole in the story.
Regardless, there was definitely a Christian priest named Valentime who was executed on February 14, 269. Apocryphally, his last words were a note that he signed “your Valentine”. And then he had his head cut off. Aww.
Astoundingly, Claudius had another Valentine put to death exactly four years later — on February 14, 273. A third Valentine may also have been executed in Africa on February 14, but nobody knows any specifics.
It was a couple hundred years before the reigning pope decided that this Lupercalia thing simply had to go, and invoked that most hallowed of Christian traditions: stamping out a pagan holiday by making up a new one on the same day. And so February 14 became St. Valentine’s Day. It probably should’ve been St. Valentines’ Day.
Or, uh, so the story goes, but this is pretty murky too. I’m not sure if the reuse of the date was intentional, or a huge coincidence — frankly, given how many festivals the Romans had, they’d have had a hard time inventing a new holiday that didn’t overlap an existing Roman thing. I also have absolutely no idea what St. Valentines’ Day was actually for, insofar as days devoted to saints are actually for anything in particular. The best I’ve found is “feasting”.
A few casual retellings of this story implicitly (or even explicitly) link St. Valentines’ Day directly to romance right from the beginning, based on the marriage story. But it seems that there’s no record of any romantic connotations until a poem written by Chaucer in 1382, which is extra weird because he seemed to be nodding at an existing tradition. Another nod appears in a charter written in 1400, founding a court which by all other appearances seemed to never have existed.
Reading about this has been a truly surreal experience. I initially avoided looking at Wikipedia for fear I would just end up paraphrasing it, and instead found a dozen conflicting stories that all turned out to be wrong anyway. The actual history (well, as told by Wikipedia) appears to be full of huge gaps that we just don’t know anything about. All I can really say for sure is that Valentine’s Day exists, at least one guy named Valentine was once put to death on the same day, and the Romans had a festival around the same time.
The rest of the history isn’t nearly so bizarre. In the early 1800s, companies started printing pre-made Valentine cards, and cheap postage made it feasible to mail them around. Momentum built from there, and of course, anyone with something vaguely romantic to sell latched on.
I can’t remember ever doing anything particularly romantic for Valentine’s Day. No one in this house is really big on holidays; we don’t even do anything special for Christmas. So I don’t have particularly strong feelings here.
In fact, my strongest association with the day is still from grade school. I don’t know if this is a cultural or regional thing or what, but I had multiple teachers who expected every student to go buy one of those big boxes of 30 cheesy Valentine’s Day cards branded with popular cartoons or whatever and give one to every other student.
I never understood this. I still don’t. I can see how it might have come about: kids exchange cards, some kids don’t get any, they feel bad, teachers see an obvious way to compensate. But then what’s the point of doing it at all? What does this even have to do with romance any more? Did it ever, considering I wasn’t even 10 when this was happening, or was it always a goofy popularity contest?
Here’s a better tradition for young kids, I think: the teacher buys a mountain of chocolate and gives it out. It’s based on my own tradition of buying a mountain of discount chocolate on February 15. Everyone wins!
Come to think of it, those incredibly cheesy cards might be one of my favorite parts of Valentine’s Day. Not the mass-produced Spongebob ones, but knowingly-corny ones created by artists, like the set Mel made a few years ago (some of which only make sense in the context of a story they were telling at the time). I heart-ily endorse creating more of this kind of nonsense.
Another hallowed tradition is for single people to lament that they’re single. I guess, anyway. Like the Super Bowl thing from a few weeks ago, this is something I’ve seen complained about much more than I’ve seen actually happen. I don’t really know anything about human mating rituals, but lamenting one’s lack of a partner seems exactly opposite to the spirit of Valentine’s Day, which at its heart seems to be about flirting. Whether there’s an actual connection or not, maybe that Roman holiday had it right: everyone should put their names in an urn, draw partners at random, and hook up for a night. While dressed up as goats. That sounds like a good plan.
Hmm. I just don’t really have a lot to say here. Valentine’s Day is the pink and red holiday that fills the seasonal aisle after Christmas but before the Easter candy has been stocked. It comes, and it goes, and I think once we used it as an excuse to go to a restaurant. Its history is more confusing than outrageous, and its modern incarnation is a fairly quiet blip on the calendar. So I think that’s all I’ve got. Sorry 🙂
The collective thoughts of the interwebz
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.