AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easier to prepare and load your data for analytics. You can create and run an ETL job with a few clicks on the AWS Management Console. Just point AWS Glue to your data store. AWS Glue discovers your data and stores the associated metadata (for example, a table definition and schema) in the AWS Glue Data Catalog.
AWS Glue has native connectors to data sources using JDBC drivers, either on AWS or elsewhere, as long as there is IP connectivity. In this post, we demonstrate how to connect to data sources that are not natively supported in AWS Glue today. We walk through connecting to and running ETL jobs against two such data sources, IBM DB2 and SAP Sybase. However, you can use the same process with any other JDBC-accessible database.
AWS Glue data sources
AWS Glue natively supports the following data stores by using the JDBC protocol:
One of the fastest growing architectures deployed on AWS is the data lake. The ETL processes that are used to ingest, clean, transform, and structure data are critically important for this architecture. Having the flexibility to interoperate with a broader range of database engines allows for a quicker adoption of the data lake architecture.
For data sources that AWS Glue doesn’t natively support, such as IBM DB2, Pivotal Greenplum, SAP Sybase, or any other relational database management system (RDBMS), you can import custom database connectors from Amazon S3 into AWS Glue jobs. In this case, the connection to the data source must be made from the AWS Glue script to extract the data, rather than using AWS Glue connections. To learn more, see Providing Your Own Custom Scripts in the AWS Glue Developer Guide.
Setting up an ETL job for an IBM DB2 data source
The first example demonstrates how to connect the AWS Glue ETL job to an IBM DB2 instance, transform the data from the source, and store it in Apache Parquet format in Amazon S3. To successfully create the ETL job using an external JDBC driver, you must define the following:
The S3 location of the job script
The S3 location of the temporary directory
The S3 location of the JDBC driver
The S3 location of the Parquet data (output)
The IAM role for the job
By default, AWS Glue suggests bucket names for the scripts and the temporary directory using the following format:
Keep in mind that having the AWS Glue job and S3 buckets in the same AWS Region helps save on cross-Region data transfer fees. For this post, we will work in the US East (Ohio) Region (us-east-2).
Creating the IAM role
The next step is to set up the IAM role that the ETL job will use:
Sign in to the AWS Management Console, and search for IAM:
On the IAM console, choose Roles in the left navigation pane.
Choose Create role. The role type of trusted entity must be an AWS service, specifically AWS Glue.
Choose Next: Permissions.
Search for the AWSGlueServiceRole policy, and select it.
Search again, now for the SecretsManagerReadWrite This policy allows the AWS Glue job to access database credentials that are stored in AWS Secrets Manager.
CAUTION: This policy is open and is being used for testing purposes only. You should create a custom policy to narrow the access just to the secrets that you want to use in the ETL job.
Select this policy, and choose Next: Review.
Give your role a name, for example, GluePermissions, and confirm that both policies were selected.
Choose Create role.
Now that you have created the IAM role, it’s time to upload the JDBC driver to the defined location in Amazon S3. For this example, we will use the DB2 driver, which is available on the IBM Support site.
Storing database credentials
It is a best practice to store database credentials in a safe store. In this case, we use AWS Secrets Manager to securely store credentials. Follow these steps to create those credentials:
Open the console, and search for Secrets Manager.
In the AWS Secrets Manager console, choose Store a new secret.
Under Select a secret type, choose Other type of secrets.
In the Secret key/value, set one row for each of the following parameters:
db_url (for example, jdbc:db2://10.10.12.12:50000/SAMPLE)
output_bucket: (for example, aws-glue-data-output-1234567890-us-east-2/User)
For Secret name, use DB2_Database_Connection_Info.
Keep the Disable automatic rotation check box selected.
Adding a job in AWS Glue
The next step is to author the AWS Glue job, following these steps:
In the AWS Management Console, search for AWS Glue.
In the navigation pane on the left, choose Jobs under the ETL
Choose Add job.
Fill in the basic Job properties:
Give the job a name (for example, db2-job).
Choose the IAM role that you created previously (GluePermissions).
For This job runs, choose A new script to be authored by you.
For ETL language, choose Python.
In the Script libraries and job parameters section, choose the location of your JDBC driver for Dependent jars path.
On the Connections page, choose Next
On the summary page, choose Save job and edit script. This creates the job and opens the script editor.
In the editor, replace the existing code with the following script. Important: Line 47 of the script corresponds to the mapping of the fields in the source table to the destination, dropping of the null fields to save space in the Parquet destination, and finally writing to Amazon S3 in Parquet format.
Choose the black X on the right side of the screen to close the editor.
Running the ETL job
Now that you have created the job, the next step is to execute it as follows:
On the Jobs page, select your new job. On the Action menu, choose Run job, and confirm that you want to run the job. Wait a few moments as it finishes the execution.
After the job shows as Succeeded, choose Logs to read the output of the job.
In the output of the job, you will find the result of executing the df.printSchema() and the message with the df.count().
Also, if you go to your output bucket in S3, you will find the Parquet result of the ETL job.
Using AWS Glue, you have created an ETL job that connects to an existing database using an external JDBC driver. It enables you to execute any transformation that you need.
Setting up an ETL job for an SAP Sybase data source
In this section, we describe how to create an AWS Glue ETL job against an SAP Sybase data source. The process mentioned in the previous section works for a Sybase data source with a few changes required in the job:
While creating the job, choose the correct jar for the JDBC dependency.
In the script, change the reference to the secret to be used from AWS Secrets Manager:
After you successfully execute the new ETL job, the output contains the same type of information that was generated with the DB2 data source.
Note that each of these JDBC drivers has its own nuances and different licensing terms that you should be aware of before using them.
Maximizing JDBC read parallelism
Something to keep in mind while working with big data sources is the memory consumption. In some cases, “Out of Memory” errors are generated when all the data is read into a single executor. One approach to optimize this is to rely on the parallelism on read that you can implement with Apache Spark and AWS Glue. To learn more, see the Apache Spark SQL module.
You can use the following options:
partitionColumn: The name of an integer column that is used for partitioning.
lowerBound: The minimum value of partitionColumn that is used to decide partition stride.
upperBound: The maximum value of partitionColumn that is used to decide partition stride.
numPartitions: The number of partitions. This, along with lowerBound (inclusive) and upperBound (exclusive), form partition strides for generated WHERE clause expressions used to split the partitionColumn When unset, this defaults to SparkContext.defaultParallelism.
Those options specify the parallelism of the table read. lowerBound and upperBound decide the partition stride, but they don’t filter the rows in the table. Therefore, Spark partitions and returns all rows in the table. For example:
It’s important to be careful with the number of partitions because too many partitions could also result in Spark crashing your external database systems.
Using the process described in this post, you can connect to and run AWS Glue ETL jobs against any data source that can be reached using a JDBC driver. This includes new generations of common analytical databases like Greenplum and others.
You can improve the query efficiency of these datasets by using partitioning and pushdown predicates. For more information, see Managing Partitions for ETL Output in AWS Glue. This technique opens the door to moving data and feeding data lakes in hybrid environments.
Many of my colleagues are fortunate to be able to spend a good part of their day sitting down with and listening to our customers, doing their best to understand ways that we can better meet their business and technology needs. This information is treated with extreme care and is used to drive the roadmap for new services and new features.
AWS customers in the financial services industry (often abbreviated as FSI) are looking ahead to the Fundamental Review of Trading Book (FRTB) regulations that will come in to effect between 2019 and 2021. Among other things, these regulations mandate a new approach to the “value at risk” calculations that each financial institution must perform in the four hour time window after trading ends in New York and begins in Tokyo. Today, our customers report this mission-critical calculation consumes on the order of 200,000 vCPUs, growing to between 400K and 800K vCPUs in order to meet the FRTB regulations. While there’s still some debate about the magnitude and frequency with which they’ll need to run this expanded calculation, the overall direction is clear.
Building a Big Grid In order to make sure that we are ready to help our FSI customers meet these new regulations, we worked with TIBCO to set up and run a proof of concept grid in the AWS Cloud. The periodic nature of the calculation, along with the amount of processing power and storage needed to run it to completion within four hours, make it a great fit for an environment where a vast amount of cost-effective compute power is available on an on-demand basis.
Our customers are already using the TIBCO GridServer on-premises and want to use it in the cloud. This product is designed to run grids at enterprise scale. It runs apps in a virtualized fashion, and accepts requests for resources, dynamically provisioning them on an as-needed basis. The cloud version supports Amazon Linux as well as the PostgreSQL-compatible edition of Amazon Aurora.
Working together with TIBCO, we set out to create a grid that was substantially larger than the current high-end prediction of 800K vCPUs, adding a 50% safety factor and then rounding up to reach 1.3 million vCPUs (5x the size of the largest on-premises grid). With that target in mind, the account limits were raised as follows:
Spot Instance Limit – 120,000
EBS Volume Limit – 120,000
EBS Capacity Limit – 2 PB
If you plan to create a grid of this size, you should also bring your friendly local AWS Solutions Architect into the loop as early as possible. They will review your plans, provide you with architecture guidance, and help you to schedule your run.
Running the Grid We hit the Go button and launched the grid, watching as it bid for and obtained Spot Instances, each of which booted, initialized, and joined the grid within two minutes. The test workload used the Strata open source analytics & market risk library from OpenGamma and was set up with their assistance.
The grid grew to 61,299 Spot Instances (1.3 million vCPUs drawn from 34 instance types spanning 3 generations of EC2 hardware) as planned, with just 1,937 instances reclaimed and automatically replaced during the run, and cost $30,000 per hour to run, at an average hourly cost of $0.078 per vCPU. If the same instances had been used in On-Demand form, the hourly cost to run the grid would have been approximately $93,000.
Despite the scale of the grid, prices for the EC2 instances did not move during the bidding process. This is due to the overall size of the AWS Cloud and the smooth price change model that we launched late last year.
To give you a sense of the compute power, we computed that this grid would have taken the #1 position on the TOP 500 supercomputer list in November 2007 by a considerable margin, and the #2 position in June 2008. Today, it would occupy position #360 on the list.
I hope that you enjoyed this AWS success story, and that it gives you an idea of the scale that you can achieve in the cloud!
I understand the urge, given how bad the larger political crises have gotten, to want to give to charities other than those related to software freedom. There are important causes out there that have become more urgent this year. Here’s three issues which have become shockingly more acute this year:
making sure the USA keeps it commitment to immigrants to allow them make a new life here just like my own ancestors did,
assuring that the great national nature reserves are maintained and left pristine for generations to come,
assuring that we have zero tolerance abusive behavior — particularly by those in power against people who come to them for help and job opportunities.
These are just three of the many issues this year that I’ve seen get worse, not better. I am glad that I know and support people who work on these issues, and I urge everyone to work on these issues, too.
Nevertheless, as I plan my primary donations this year, I’m again, as I always do, giving to the FSF and my own employer, Software Freedom Conservancy. The reason is simple: software freedom is still an essential cause and it is frankly one that most people don’t understand (yet). I wrote almost two years ago about the phenomenon I dubbed Kuhn’s Paradox. Simply put: it keeps getting more and more difficult to avoid proprietary software in a normal day’s tasks, even while the number of lines of code licensed freely gets larger every day.
As long as that paradox remains true, I see software freedom as urgent. I know that we’re losing ground on so many other causes, too. But those of you who read my blog are some of the few people in the world that understand that software freedom is under threat and needs the urgent work that the very few software-freedom-related organizations, like the FSF and Software Freedom Conservancy are doing. I hope you’ll donate now to both of them. For my part, I gave $120 myself to FSF as part of the monthly Associate Membership program, and in a few minutes, I’m going to give $400 to Conservancy. I’ll be frank: if you work in technology in an industrialized country, I’m quite sure you can afford that level of money, and I suspect those amounts are less than most of you spent on technology equipment and/or network connectivity charges this year. Make a difference for us and give to the cause of software freedom at least as much a you’re giving to large technology companies.
Finally, a good reason to give to smaller charities like FSF and Conservancy is that your donation makes a bigger difference. I do think bigger organizations, such as (to pick an example of an organization I used to give to) my local NPR station does important work. However, I was listening this week to my local NPR station, and they said their goal for that day was to raise $50,000. For Conservancy, that’s closer to a goal we have for entire fundraising season, which for this year was $75,000. The thing is: NPR is an important part of USA society, but it’s one that nearly everyone understands. So few people understand the threats looming from proprietary software, and they may not understand at all until it’s too late — when all their devices are locked down, DRM is fully ubiquitous, and no one is allowed to tinker with the software on their devices and learn the wonderful art of computer programming. We are at real risk of reaching that distopia before 90% of the world’s population understands the threat!
Thus, giving to organizations in the area of software freedom is just going to have a bigger and more immediate impact than more general causes that more easily connect with people. You’re giving to prevent a future that not everyone understands yet, and making an impact on our work to help explain the dangers to the larger population.
Good article on the history and practice of e-mail tracking:
The tech is pretty simple. Tracking clients embed a line of code in the body of an email — usually in a 1×1 pixel image, so tiny it’s invisible, but also in elements like hyperlinks and custom fonts. When a recipient opens the email, the tracking client recognizes that pixel has been downloaded, as well as where and on what device. Newsletter services, marketers, and advertisers have used the technique for years, to collect data about their open rates; major tech companies like Facebook and Twitter followed suit in their ongoing quest to profile and predict our behavior online.
But lately, a surprising — and growing — number of tracked emails are being sent not from corporations, but acquaintances. “We have been in touch with users that were tracked by their spouses, business partners, competitors,” says Florian Seroussi, the founder of OMC. “It’s the wild, wild west out there.”
According to OMC’s data, a full 19 percent of all “conversational” email is now tracked. That’s one in five of the emails you get from your friends. And you probably never noticed.
I admit it’s enticing. I would very much like the statistics that adding trackers to Crypto-Gram would give me. But I still don’t do it.
Glenn Gore here, Chief Architect for AWS. I’m in Las Vegas this week — with 43K others — for re:Invent 2017. We’ve got a lot of exciting announcements this week. I’m going to check in to the Architecture blog with my take on what’s interesting about some of the announcements from an cloud architectural perspective. My first post can be found here.
The Media and Entertainment industry has been a rapid adopter of AWS due to the scale, reliability, and low costs of our services. This has enabled customers to create new, online, digital experiences for their viewers ranging from broadcast to streaming to Over-the-Top (OTT) services that can be a combination of live, scheduled, or ad-hoc viewing, while supporting devices ranging from high-def TVs to mobile devices. Creating an end-to-end video service requires many different components often sourced from different vendors with different licensing models, which creates a complex architecture and a complex environment to support operationally.
In my role, I participate in many AWS and industry events and often work with the production and event teams that put these shows together. With all the logistical tasks they have to deal with, the biggest question is often: “Will the live stream work?” Compounding this fear is the reality that, as users, we are also quick to jump on social media and make noise when a live stream drops while we are following along remotely. Worse is when I see event organizers actively selecting not to live stream content because of the risk of failure and and exposure — leading them to decide to take the safe option and not stream at all.
With AWS Media Services addressing many of the issues around putting together a high-quality media service, live streaming, and providing access to a library of content through a variety of mechanisms, I can’t wait to see more event teams use live streaming without the concern and worry I’ve seen in the past. I am excited for what this also means for non-media companies, as video becomes an increasingly common way of sharing information and adding a more personalized touch to internally- and externally-facing content.
AWS Media Services will allow you to focus more on the content and not worry about the platform. Awesome!
Amazon Neptune As a civilization, we have been developing new ways to record and store information and model the relationships between sets of information for more than a thousand years. Government census data, tax records, births, deaths, and marriages were all recorded on medium ranging from knotted cords in the Inca civilization, clay tablets in ancient Babylon, to written texts in Western Europe during the late Middle Ages.
One of the first challenges of computing was figuring out how to store and work with vast amounts of information in a programmatic way, especially as the volume of information was increasing at a faster rate than ever before. We have seen different generations of how to organize this information in some form of database, ranging from flat files to the Information Management System (IMS) used in the 1960s for the Apollo space program, to the rise of the relational database management system (RDBMS) in the 1970s. These innovations drove a lot of subsequent innovations in information management and application development as we were able to move from thousands of records to millions and billions.
Today, as architects and developers, we have a vast variety of database technologies to select from, which have different characteristics that are optimized for different use cases:
Relational databases are well understood after decades of use in the majority of companies who required a database to store information. Amazon Relational Database (Amazon RDS) supports many popular relational database engines such as MySQL, Microsoft SQL Server, PostgreSQL, MariaDB, and Oracle. We have even brought the traditional RDBMS into the cloud world through Amazon Aurora, which provides MySQL and PostgreSQL support with the performance and reliability of commercial-grade databases at 1/10th the cost.
Non-relational databases (NoSQL) provided a simpler method of storing and retrieving information that was often faster and more scalable than traditional RDBMS technology. The concept of non-relational databases has existed since the 1960s but really took off in the early 2000s with the rise of web-based applications that required performance and scalability that relational databases struggled with at the time. AWS published this Dynamo whitepaper in 2007, with DynamoDB launching as a service in 2012. DynamoDB has quickly become one of the critical design elements for many of our customers who are building highly-scalable applications on AWS. We continue to innovate with DynamoDB, and this week launched global tables and on-demand backup at re:Invent 2017. DynamoDB excels in a variety of use cases, such as tracking of session information for popular websites, shopping cart information on e-commerce sites, and keeping track of gamers’ high scores in mobile gaming applications, for example.
Graph databases focus on the relationship between data items in the store. With a graph database, we work with nodes, edges, and properties to represent data, relationships, and information. Graph databases are designed to make it easy and fast to traverse and retrieve complex hierarchical data models. Graph databases share some concepts from the NoSQL family of databases such as key-value pairs (properties) and the use of a non-SQL query language such as Gremlin. Graph databases are commonly used for social networking, recommendation engines, fraud detection, and knowledge graphs. We released Amazon Neptune to help simplify the provisioning and management of graph databases as we believe that graph databases are going to enable the next generation of smart applications.
A common use case I am hearing every week as I talk to customers is how to incorporate chatbots within their organizations. Amazon Lex and Amazon Polly have made it easy for customers to experiment and build chatbots for a wide range of scenarios, but one of the missing pieces of the puzzle was how to model decision trees and and knowledge graphs so the chatbot could guide the conversation in an intelligent manner.
Graph databases are ideal for this particular use case, and having Amazon Neptune simplifies the deployment of a graph database while providing high performance, scalability, availability, and durability as a managed service. Security of your graph database is critical. To help ensure this, you can store your encrypted data by running AWS in Amazon Neptune within your Amazon Virtual Private Cloud (Amazon VPC) and using encryption at rest integrated with AWS Key Management Service (AWS KMS). Neptune also supports Amazon VPC and AWS Identity and Access Management (AWS IAM) to help further protect and restrict access.
Our customers now have the choice of many different database technologies to ensure that they can optimize each application and service for their specific needs. Just as DynamoDB has unlocked and enabled many new workloads that weren’t possible in relational databases, I can’t wait to see what new innovations and capabilities are enabled from graph databases as they become easier to use through Amazon Neptune.
Look for more on DynamoDB and Amazon S3 from me on Monday.
Robot-builder extraordinaire Clément Didier is ushering in the era of our cybernetic overlords. Future generations will remember him as the creator of robots constructed from cardboard and conductive paint which are so easy to replicate that a robot could do it. Welcome to the singularity.
This cool robot was made with the #PiCap, conductive paint and @Raspberry_Pi by @clementdidier. Full tutorial: https://t.co/AcQVTS4vr2 https://t.co/D04U5UGR0P
To assemble the robot, Clément made use of a Pi Cap board, a motor driver, and most importantly, a tube of Bare Conductive Electric Paint. He painted the control interface onto the cardboard surface of the robot, allowing a human, replicant, or superior robot to direct its movements simply by touching the paint.
The Raspberry Pi 3, the motor control board, and the painted input buttons interface via the GPIO breakout pins on the Pi Cap. Crocodile clips connect the Pi Cap to the cardboard-and-paint control surface, while jumper wires connect it to the motor control board.
Sing with me: ‘The Raspberry Pi’s connected to the Pi Cap, and the Pi Cap’s connected to the inputs, and…’
Two battery packs provide power to the Raspberry Pi, and to the four independently driven motors. Software, written in Python, allows the robot to respond to inputs from the conductive paint. The motors drive wheels attached to a plastic chassis, moving and turning the robot at the touch of a square of black paint.
Clément used masking tape and a paintbrush to create the control buttons. For a human, this is obviously a fiddly process which relies on the blocking properties of the masking tape and a steady hand. For a robot, however, the process would be a simple, freehand one, resulting in neatly painted circuits on every single one of countless robotic minions. Cybernetic domination is at (metallic) hand.
One fiddly job for a human, one easy task for robotkind
The instructions and code for Clément’s build can be found here.
There are two opposing models of how the Internet has changed protest movements. The first is that the Internet has made protesters mightier than ever. This comes from the successful revolutions in Tunisia (2010-11), Egypt (2011), and Ukraine (2013). The second is that it has made them more ineffectual. Derided as “slacktivism” or “clicktivism,” the ease of action without commitment can result in movements like Occupy petering out in the US without any obvious effects. Of course, the reality is more nuanced, and Zeynep Tufekci teases that out in her new book Twitter and Tear Gas.
Tufekci is a rare interdisciplinary figure. As a sociologist, programmer, and ethnographer, she studies how technology shapes society and drives social change. She has a dual appointment in both the School of Information Science and the Department of Sociology at University of North Carolina at Chapel Hill, and is a Faculty Associate at the Berkman Klein Center for Internet and Society at Harvard University. Her regular New York Times column on the social impacts of technology is a must-read.
Modern Internet-fueled protest movements are the subjects of Twitter and Tear Gas. As an observer, writer, and participant, Tufekci examines how modern protest movements have been changed by the Internet — and what that means for protests going forward. Her book combines her own ethnographic research and her usual deft analysis, with the research of others and some big data analysis from social media outlets. The result is a book that is both insightful and entertaining, and whose lessons are much broader than the book’s central topic.
“The Power and Fragility of Networked Protest” is the book’s subtitle. The power of the Internet as a tool for protest is obvious: it gives people newfound abilities to quickly organize and scale. But, according to Tufekci, it’s a mistake to judge modern protests using the same criteria we used to judge pre-Internet protests. The 1963 March on Washington might have culminated in hundreds of thousands of people listening to Martin Luther King Jr. deliver his “I Have a Dream” speech, but it was the culmination of a multi-year protest effort and the result of six months of careful planning made possible by that sustained effort. The 2011 protests in Cairo came together in mere days because they could be loosely coordinated on Facebook and Twitter.
That’s the power. Tufekci describes the fragility by analogy. Nepalese Sherpas assist Mt. Everest climbers by carrying supplies, laying out ropes and ladders, and so on. This means that people with limited training and experience can make the ascent, which is no less dangerous — to sometimes disastrous results. Says Tufekci: “The Internet similarly allows networked movements to grow dramatically and rapidly, but without prior building of formal or informal organizational and other collective capacities that could prepare them for the inevitable challenges they will face and give them the ability to respond to what comes next.” That makes them less able to respond to government counters, change their tactics — a phenomenon Tufekci calls “tactical freeze” — make movement-wide decisions, and survive over the long haul.
Tufekci isn’t arguing that modern protests are necessarily less effective, but that they’re different. Effective movements need to understand these differences, and leverage these new advantages while minimizing the disadvantages.
To that end, she develops a taxonomy for talking about social movements. Protests are an example of a “signal” that corresponds to one of several underlying “capacities.” There’s narrative capacity: the ability to change the conversation, as Black Lives Matter did with police violence and Occupy did with wealth inequality. There’s disruptive capacity: the ability to stop business as usual. An early Internet example is the 1999 WTO protests in Seattle. And finally, there’s electoral or institutional capacity: the ability to vote, lobby, fund raise, and so on. Because of various “affordances” of modern Internet technologies, particularly social media, the same signal — a protest of a given size — reflects different underlying capacities.
This taxonomy also informs government reactions to protest movements. Smart responses target attention as a resource. The Chinese government responded to 2015 protesters in Hong Kong by not engaging with them at all, denying them camera-phone videos that would go viral and attract the world’s attention. Instead, they pulled their police back and waited for the movement to die from lack of attention.
If this all sounds dry and academic, it’s not. Twitter and Tear Gasis infused with a richness of detail stemming from her personal participation in the 2013 Gezi Park protests in Turkey, as well as personal on-the-ground interviews with protesters throughout the Middle East — particularly Egypt and her native Turkey — Zapatistas in Mexico, WTO protesters in Seattle, Occupy participants worldwide, and others. Tufekci writes with a warmth and respect for the humans that are part of these powerful social movements, gently intertwining her own story with the stories of others, big data, and theory. She is adept at writing for a general audience, anddespite being published by the intimidating Yale University Press — her book is more mass-market than academic. What rigor is there is presented in a way that carries readers along rather than distracting.
The synthesist in me wishes Tufekci would take some additional steps, taking the trends she describes outside of the narrow world of political protest and applying them more broadly to social change. Her taxonomy is an important contribution to the more-general discussion of how the Internet affects society. Furthermore, her insights on the networked public sphere has applications for understanding technology-driven social change in general. These are hard conversations for society to have. We largely prefer to allow technology to blindly steer society or — in some ways worse — leave it to unfettered for-profit corporations. When you’re reading Twitter and Tear Gas, keep current and near-term future technological issues such as ubiquitous surveillance, algorithmic discrimination, and automation and employment in mind. You’ll come away with new insights.
Tufekci twice quotes historian Melvin Kranzberg from 1985: “Technology is neither good nor bad; nor is it neutral.” This foreshadows her central message. For better or worse, the technologies that power the networked public sphere have changed the nature of political protest as well as government reactions to and suppressions of such protest.
I have long characterized our technological future as a battle between the quick and the strong. The quick — dissidents, hackers, criminals, marginalized groups — are the first to make use of a new technology to magnify their power. The strong are slower, but have more raw power to magnify. So while protesters are the first to use Facebook to organize, the governments eventually figure out how to use Facebook to track protesters. It’s still an open question who will gain the upper hand in the long term, but Tufekci’s book helps us understand the dynamics at work.
I hesitate to blog this, because it’s an example of everything that’s wrong with pop psychology. Malcolm Harris writes about millennials, and has a theory of why millennials leak secrets. My guess is that you could write a similar essay about every named generation, every age group, and so on.
KLRU-TV, Austin PBS created Austin City Limits 42 years ago and has produced it ever since. Austin City Limits is the longest-running music series in television history. Over the years, KLRU accumulated over 550 episodes and thousands of hours of unaired footage stored on videotape. When KLRU decided to preserve their collection they turned to Backblaze for help with uploading and storing this unparalleled musical anthology in the Backblaze B2 cloud.
Upload: Backblaze B2 Fireball
KLRU started their preservation efforts by digitizing their collection of videotapes. After some internal processing, they were ready to upload the files to Backblaze, but there was a problem – one facing many organizations with a stash of historical digital data – their network connection was “slow”. It was fine for daily work, but uploading terabytes of data was not going to work.
“We would not have been able to get this project off the ground without the B2 Fireball.” – James Cole, KLRU
Backblaze B2 Fireball to the rescue. The B2 Fireball is a secure, shippable, data ingest system capable of transporting up to 40 terabytes of data from your location to Backblaze where the data is ingested into your B2 account. Designed for those organizations that have large amounts of data locally that they want to store in the cloud, the Backblaze B2 Fireball was just what KLRU needed to get the project started.
Preserve: Live Archive with B2
The KLRU staff is working hard to digitize and restore their entire musical archive and they are committed to preserving their data by having both a local copy and a cloud copy of their files. By choosing Backblaze B2 Cloud Storage versus a near-line or off-line storage solution KLRU now has an affordable live archive of their data they can access without delay anytime they need.
You can download and read the entire Austin City Limits case study for more details on how KLRU used B2 as part of their strategy to preserve their entire catalog of Austin City Limits content for future generations.
We’re excited to co-host the 10th Annual Hadoop Summit, the leading conference for the Apache Hadoop community, taking place on June 13 – 15 at the San Jose Convention Center. In the last few years, the Hadoop Summit has expanded to cover all things data beyond just Apache Hadoop – such as data science, cloud and operations, IoT and applications – and has been aptly renamed the DataWorks Summit. The three-day program is bursting at the seams! Here are just a few of the reasons why you cannot miss this must-attend event:
Familiarize yourself with the cutting edge in Apache project developments from the committers
Learn from your peers and industry experts about innovative and real-world use cases, development and administration tips and tricks, success stories and best practices to leverage all your data – on-premise and in the cloud – to drive predictive analytics, distributed deep-learning and artificial intelligence initiatives
Attend one of our more than 170 technical deep dive breakout sessions from nearly 200 speakers across eight tracks
Check out our keynotes, meetups, trainings, technical crash courses, birds-of-a-feather sessions, Women in Big Data and more
Similar to previous years, we look forward to continuing Yahoo’s decade-long tradition of thought leadership at this year’s summit. Join us for an in-depth look at Yahoo’s Hadoop culture and for the latest in technologies such as Apache Tez, HBase, Hive, Data Highway Rainbow, Mail Data Warehouse and Distributed Deep Learning at the breakout sessions below. Or, stop by Yahoo kiosk #700 at the community showcase.
Also, as a co-host of the event, Yahoo is pleased to offer a 20% discount for the summit with the code MSPO20. Register here forHadoop Summit, San Jose, California!
Andy Feng – VP Architecture, Big Data and Machine Learning
Lee Yang – Sr. Principal Engineer
In this talk, we will introduce a new framework, TensorFlowOnSpark, for scalable TensorFlow learning, that was open sourced in Q1 2017. This new framework enables easy experimentation for algorithm designs, and supports scalable training & inferencing on Spark clusters. It supports all TensorFlow functionalities including synchronous & asynchronous learning, model & data parallelism, and TensorBoard. It provides architectural flexibility for data ingestion to TensorFlow and network protocols for server-to-server communication. With a few lines of code changes, an existing TensorFlow algorithm can be transformed into a scalable application.
2:10 – 2:50 P.M. Handling Kernel Upgrades at Scale – The Dirty Cow Story
Samy Gawande – Sr. Operations Engineer
Savitha Ravikrishnan – Site Reliability Engineer
Apache Hadoop at Yahoo is a massive platform with 36 different clusters spread across YARN, Apache HBase, and Apache Storm deployments, totaling 60,000 servers made up of 100s of different hardware configurations accumulated over generations, presenting unique operational challenges and a variety of unforeseen corner cases. In this talk, we will share methods, tips and tricks to deal with large scale kernel upgrade on heterogeneous platforms within tight timeframes with 100% uptime and no service or data loss through the Dirty COW use case (privilege escalation vulnerability found in the Linux Kernel in late 2016).
5:00 – 5:40 P.M. Data Highway Rainbow – Petabyte Scale Event Collection, Transport, and Delivery at Yahoo
Nilam Sharma – Sr. Software Engineer
Huibing Yin – Sr. Software Engineer
This talk presents the architecture and features of Data Highway Rainbow, Yahoo’s hosted multi-tenant infrastructure which offers event collection, transport and aggregated delivery as a service. Data Highway supports collection from multiple data centers & aggregated delivery in primary Yahoo data centers which provide a big data computing cluster. From a delivery perspective, Data Highway supports endpoints/sinks such as HDFS, Storm and Kafka; with Storm & Kafka endpoints tailored towards latency sensitive consumers.
DAY 2. WEDNESDAY June 14, 2017
9:05 – 9:15 A.M. Yahoo General Session – Shaping Data Platform for Lasting Value
Sumeet Singh – Sr. Director, Products
With a long history of open innovation with Hadoop, Yahoo continues to invest in and expand the platform capabilities by pushing the boundaries of what the platform can accomplish for the entire organization. In the last 11 years (yes, it is that old!), the Hadoop platform has shown no signs of giving up or giving in. In this talk, we explore what makes the shared multi-tenant Hadoop platform so special at Yahoo.
12:20 – 1:00 P.M. CaffeOnSpark Update – Recent Enhancements and Use Cases
Mridul Jain – Sr. Principal Engineer
Jun Shi – Principal Engineer
By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. We released CaffeOnSpark as an open source project in early 2016, and shared its architecture design and basic usage at Hadoop Summit 2016. In this talk, we will update audiences about the recent development of CaffeOnSpark. We will highlight new features and capabilities: unified data layer which multi-label datasets, distributed LSTM training, interleave testing with training, monitoring/profiling framework, and docker deployment.
12:20 – 1:00 P.M. Tez Shuffle Handler – Shuffling at Scale with Apache Hadoop
Jon Eagles – Principal Engineer
Kuhu Shukla – Software Engineer
In this talk we introduce a new Shuffle Handler for Tez, a YARN Auxiliary Service, that addresses the shortcomings and performance bottlenecks of the legacy MapReduce Shuffle Handler, the default shuffle service in Apache Tez. The Apache Tez Shuffle Handler adds composite fetch which has support for multi-partition fetch to mitigate performance slow down and provides deletion APIs to reduce disk usage for long running Tez sessions. As an emerging technology we will outline future roadmap for the Apache Tez Shuffle Handler and provide performance evaluation results from real world jobs at scale.
2:10 – 2:50 P.M. Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
Thiruvel Thirumoolan – Principal Engineer
Francis Liu – Sr. Principal Engineer
At Yahoo! HBase has been running as a hosted multi-tenant service since 2013. In a single HBase cluster we have around 30 tenants running various types of workloads (ie batch, near real-time, ad-hoc, etc). We will walk through multi-tenancy features explaining our motivation, how they work as well as our experiences running these multi-tenant clusters. These features will be available in Apache HBase 2.0.
2:10 – 2:50 P.M. Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Nick Huang – Director, Data Engineering, Yahoo Mail
Saurabh Dixit – Sr. Principal Engineer, Yahoo Mail
Since 2014, the Yahoo Mail Data Engineering team took on the task of revamping the Mail data warehouse and analytics infrastructure in order to drive the continued growth and evolution of Yahoo Mail. Along the way we have built a 50 PB Hadoop warehouse, and surrounding analytics and machine learning programs that have transformed the way data plays in Yahoo Mail. In this session we will share our experience from this 3 year journey, from the system architecture, analytics systems built, to the learnings from development and drive for adoption.
DAY3. THURSDAY June 15, 2017
2:10 – 2:50 P.M. OracleStore – A Highly Performant RawStore Implementation for Hive Metastore
Chris Drome – Sr. Principal Engineer
Jin Sun – Principal Engineer
Today, Yahoo uses Hive in many different spaces, from ETL pipelines to adhoc user queries. Increasingly, we are investigating the practicality of applying Hive to real-time queries, such as those generated by interactive BI reporting systems. In order for Hive to succeed in this space, it must be performant in all aspects of query execution, from query compilation to job execution. One such component is the interaction with the underlying database at the core of the Metastore. As an alternative to ObjectStore, we created OracleStore as a proof-of-concept. Freed of the restrictions imposed by DataNucleus, we were able to design a more performant database schema that better met our needs. Then, we implemented OracleStore with specific goals built-in from the start, such as ensuring the deduplication of data. In this talk we will discuss the details behind OracleStore and the gains that were realized with this alternative implementation. These include a reduction of 97%+ in the storage footprint of multiple tables, as well as query performance that is 13x faster than ObjectStore with DirectSQL and 46x faster than ObjectStore without DirectSQL.
3:00 P.M. – 3:40 P.M. Bullet – A Real Time Data Query Engine
Akshai Sarma – Sr. Software Engineer
Michael Natkovich – Director, Engineering
Bullet is an open sourced, lightweight, pluggable querying system for streaming data without a persistence layer implemented on top of Storm. It allows you to filter, project, and aggregate on data in transit. It includes a UI and WS. Instead of running queries on a finite set of data that arrived and was persisted or running a static query defined at the startup of the stream, our queries can be executed against an arbitrary set of data arriving after the query is submitted. In other words, it is a look-forward system. Bullet is a multi-tenant system that scales independently of the data consumed and the number of simultaneous queries. Bullet is pluggable into any streaming data source. It can be configured to read from systems such as Storm, Kafka, Spark, Flume, etc. Bullet leverages Sketches to perform its aggregate operations such as distinct, count distinct, sum, count, min, max, and average.
3:00 P.M. – 3:40 P.M. Yahoo – Moving Beyond Running 100% of Apache Pig Jobs on Apache Tez
Rohini Palaniswamy – Sr. Principal Engineer
Last year at Yahoo, we spent great effort in scaling, stabilizing and making Pig on Tez production ready and by the end of the year retired running Pig jobs on Mapreduce. This talk will detail the performance and resource utilization improvements Yahoo achieved after migrating all Pig jobs to run on Tez. After successful migration and the improved performance we shifted our focus to addressing some of the bottlenecks we identified and new optimization ideas that we came up with to make it go even faster. We will go over the new features and work done in Tez to make that happen like custom YARN ShuffleHandler, reworking DAG scheduling order, serialization changes, etc. We will also cover exciting new features that were added to Pig for performance such as bloom join and byte code generation.
4:10 P.M. – 4:50 P.M. Leveraging Docker for Hadoop Build Automation and Big Data Stack Provisioning
Evans Ye, Software Engineer
Apache Bigtop as an open source Hadoop distribution, focuses on developing packaging, testing and deployment solutions that help infrastructure engineers to build up their own customized big data platform as easy as possible. However, packages deployed in production require a solid CI testing framework to ensure its quality. Numbers of Hadoop component must be ensured to work perfectly together as well. In this presentation, we’ll talk about how Bigtop deliver its containerized CI framework which can be directly replicated by Bigtop users. The core revolution here are the newly developed Docker Provisioner that leveraged Docker for Hadoop deployment and Docker Sandbox for developer to quickly start a big data stack. The content of this talk includes the containerized CI framework, technical detail of Docker Provisioner and Docker Sandbox, a hierarchy of docker images we designed, and several components we developed such as Bigtop Toolchain to achieve build automation.
In November 2015 we announced that the Raspberry Pi Foundation was joining forces with Code Club to give more young people the opportunity to learn how to make things with computers. In the 18 months since we made that announcement, we have more than doubled the number of Code Clubs. Over 10,000 clubs are now active, in communities all over the world.
Children at a Code Club in Australia
The UK is where the movement started, and there are now an amazing 5750 Code Clubs engaging over 85,000 young people in the UK each week. The rest of the world is catching up rapidly. With the help of our regional partners, there are over 4000 clubs outside the UK, and fast-growing Code Club communities in Australia, Bangladesh, Brazil, Canada, Croatia, France, Hong Kong, New Zealand, and Ukraine. This year we have already launched new partnerships in Spain and South Korea, with more to come.
It’s fantastic to see the movement growing so quickly, and it’s all due to the amazing community of volunteers, teachers, parents, and young people who make everything possible. Thank you all!
Today, we are announcing the next stage of Code Club’s evolution. Drum roll, please…
Starting in September, we are extending Code Club to 9- to 13-year-olds.
Students at a Code Club in Brazil
Those in the know will remember that Code Club has, until now, been focused on 9- to 11-year-olds. So why the change?
Put simply: demand. There is a huge demand from young people for more opportunities to learn about computing generally, and for Code Club specifically. The first generations of Code Club graduates have moved on to more senior schools, and they’re telling us that they just don’t have the opportunities they need to learn more about digital making. We’ve decided to take up the challenge.
For the UK, this means that schools will be supported to set up Code Clubs for Years 7 and 8. Non-school venues, like libraries, will be able to offer their clubs to a wider age group.
Growing Code Club International
Code Club is a global movement, and we will be working with our regional partners to make sure that it is available to 9- to 13-year-olds in every community in the world. That includes accelerating the work to translate club materials into even more languages.
A Code Club volunteer and students in Brazil
As part of the change, we will be expanding our curriculum and free educational resources to cater for older children and more experienced coders. Like all our educational resources, the new materials will be created by qualified and experienced educators. They will be designed to help young people build a wide range of skills and competencies, including teamwork, problem-solving, and creativity.
Our first step towards supporting a wider age range is a pilot programme, launching today, with 50 secondary schools in the UK. Over the next few months, we will be working closely with them to find out the best ways to make the programme work for older kids.
Supporting Code Club
For now, you can help us spread the word. If you know a school, youth club, library, or similar venue that could host a club for young people aged 9 to 13, then encourage them to get involved.
Lastly, I want to say a massive “thank you!” to all the organisations and individuals that support Code Club financially. We care passionately about Code Club being free for every child to attend. That’s only possible because of the generous donations and grants that we receive from so many companies, foundations, and people who share our mission to put the power of digital making into the hands of people all over the world.
By James Penick, Cloud Architect & Gurpreet Kaur, Product Manager
A version of this byline was originally written for and appears in CIO Review.
A successful private cloud presents a consistent and reliable facade over the complexities of hyperscale infrastructure. It must simultaneously handle constant organic traffic growth, unanticipated spikes, a multitude of hardware vendors, and discordant customer demands. The depth of this complexity only increases with the age of the business, leaving a private cloud operator saddled with legacy hardware, old network infrastructure, customers dependent on legacy operating systems, and the list goes on. These are the foundations of the horror stories told by grizzled operators around the campfire.
Providing a plethora of services globally for over a billion active users requires a hyperscale infrastructure. Yahoo’s on-premises infrastructure is comprised of datacenters housing hundreds of thousands of physical and virtual compute resources globally, connected via a multi-terabit network backbone. As one of the very first hyperscale internet companies in the world, Yahoo’s infrastructure had grown organically – things were built, and rebuilt, as the company learned and grew. The resulting web of modern and legacy infrastructure became progressively more difficult to manage. Initial attempts to manage this via IaaS (Infrastructure-as-a-Service) taught some hard lessons. However, those lessons served us well when OpenStack was selected to manage Yahoo’s datacenters, some of which are shared below.
Centralized team offering Infrastructure-as-a-Service
Chief amongst the lessons learned prior to OpenStack was that IaaS must be presented as a core service to the whole organization by a dedicated team. An a-la-carte-IaaS, where each user is expected to manage their own control plane and inventory, just isn’t sustainable at scale. Multiple teams tackling the same challenges involved in the curation of software, deployment, upkeep, and security within an organization is not just a duplication of effort; it removes the opportunity for improved synergy with all levels of the business. The first OpenStack cluster, with a centralized dedicated developer and service engineering team, went live in June 2012. This model has served us well and has been a crucial piece of making OpenStack succeed at Yahoo. One of the biggest advantages to a centralized, core team is the ability to collaborate with the foundational teams upon which any business is built: Supply chain, Datacenter Site-Operations, Finance, and finally our customers, the engineering teams. Building a close relationship with these vital parts of the business provides the ability to streamline the process of scaling inventory and presenting on-demand infrastructure to the company.
Developers love instant access to compute resources
Our developer productivity clusters, named “OpenHouse,” were a huge hit. Ideation and experimentation are core to developers’ DNA at Yahoo. It empowers our engineers to innovate, prototype, develop, and quickly iterate on ideas. No longer is a developer reliant on a static and costly development machine under their desk. OpenHouse enables developer agility and cost savings by obviating the desktop.
Dynamic infrastructure empowers agile products
From a humble beginning of a single, small OpenStack cluster, Yahoo’s OpenStack footprint is growing beyond 100,000 VM instances globally, with our single largest virtual machine cluster running over a thousand compute nodes, without using Nova Cells.
Until this point, Yahoo’s production footprint was nearly 100% focused on baremetal – a part of the business that one cannot simply ignore. In 2013, Yahoo OpenStack Baremetal began to manage all new compute deployments. Interestingly, after moving to a common API to provision baremetal and virtual machines, there was a marked increase in demand for virtual machines.
Developers across all major business units ranging from Yahoo Mail, Video, News, Finance, Sports and many more, were thrilled with getting instant access to compute resources to hit the ground running on their projects. Today, the OpenStack team is continuing to fully migrate the business to OpenStack-managed. Our baremetal footprint is well beyond that of our VMs, with over 100,000 baremetal instances provisioned by OpenStack Nova via Ironic.
How did Yahoo hit this scale?
Scaling OpenStack begins with understanding how its various components work and how they communicate with one another. This topic can be very deep and for the sake of brevity, we’ll hit the high points.
1. Start at the bottom and think about the underlying hardware
Do not overlook the unique resource constraints for the services which power your cloud, nor the fashion in which those services are to be used. Leverage that understanding to drive hardware selection. For example, when one examines the role of the database server in an OpenStack cluster, and considers the multitudinous calls to the database: compute node heartbeats, instance state changes, normal user operations, and so on; they would conclude this core component is extremely busy in even a modest-sized Nova cluster, and in need of adequate computational resources to perform. Yet many deployers skimp on the hardware. The performance of the whole cluster is bottlenecked by the DB I/O. By thinking ahead you can save yourself a lot of heartburn later on.
2. Think about how things communicate
Our cluster databases are configured to be multi-master single-writer with automated failover. Control plane services have been modified to split DB reads directly to the read slaves and only write to the write-master. This distributes load across the database servers.
3. Scale wide
OpenStack has many small horizontally-scalable components which can peacefully cohabitate on the same machines: the Nova, Keystone, and Glance APIs, for example. Stripe these across several small or modest hardware. Some services, such as the Nova scheduler, run the risk of race conditions when running multi-active. If the risk of race conditions is unacceptable, use ZooKeeper to manage leader election.
4. Remove dependencies
In a Yahoo datacenter, DHCP is only used to provision baremetal servers. By statically declaring IPs in our instances via cloud-init, our infrastructure is less prone to outage from a failure in the DHCP infrastructure.
5. Don’t be afraid to replace things
Neutron used Dnsmasq to provide DHCP services, however it was not designed to address the complexity or scale of a dynamic environment. For example, Dnsmasq must be restarted for any config change, such as when a new host is being provisioned. In the Yahoo OpenStack clusters this has been replaced by ISC-DHCPD, which scales far better than Dnsmasq and allows dynamic configuration updates via an API.
6. Or split them apart
Some of the core imaging services provided by Ironic, such as DHCP, TFTP, and HTTPS communicate with a host during the provisioning process. These services are normally part of the Ironic Conductor (IC) service. In our environment we split these services into a new and physically-distinct service called the Ironic Transport Service (ITS). This brings value by:
Adding security: Splitting the ITS from the IC allows us to block all network traffic from production compute nodes to the IC, and other parts of our control plane. If a malicious entity attacks a node serving production traffic, they cannot escalate from it to our control plane.
Scale: The ITS hosts allow us to horizontally scale the core provisioning services with which nodes communicate.
Flexibility: ITS allows Yahoo to manage remote sites, such as peering points, without building a new cluster in that site. Resources in those sites can now be managed by the nearest Yahoo owned & operated (O&O) datacenter, without needing to build a whole cluster in each site.
Be prepared for faulty hardware!
Running IaaS reliably at hyperscale is more than just scaling the control plane. One must take a holistic look at the system and consider everything. In fact, when examining provisioning failures, our engineers determined the majority root cause was faulty hardware. For example, there are a number of machines from varying vendors whose IPMI firmware fails from time to time, leaving the host inaccessible to remote power management. Some fail within minutes or weeks of installation. These failures occur on many different models, across many generations, and across many hardware vendors. Exposing these failures to users would create a very negative experience, and the cloud must be built to tolerate this complexity.
Focus on the end state
Yahoo’s experience shows that one can run OpenStack at hyperscale, leveraging it to wrap infrastructure and remove perceived complexity. Correctly leveraged, OpenStack presents an easy, consistent, and error-free interface. Delivering this interface is core to our design philosophy as Yahoo continues to double down on our OpenStack investment. The Yahoo OpenStack team looks forward to continue collaborating with the OpenStack community to share feedback and code.
Two days ago, the White House released a report on privacy: “Privacy in our Digital Lives: Protecting Individuals and Promoting Innovation.” The report summarizes things the administration has done, and lists future challenges:
Areas for Further Attention
Technology will pose new consumer privacy and security challenges.
Emerging technology may simultaneously create new challenges and opportunities for law enforcement and national security.
The digital economy is making privacy a global value.
Consumers’ voices are being heard — and must continue to be heard — in the regulatory process.
The Federal Government benefits from hiring more privacy professionals.
Transparency is vital for earning and retaining public trust.
Privacy is a bipartisan issue.
I especially like the framing of privacy as a right. From President Obama’s introduction:
Privacy is more than just, as Justice Brandeis famously proclaimed, the “right to be let alone.” It is the right to have our most personal information be kept safe by others we trust. It is the right to communicate freely and to do so without fear. It is the right to associate freely with others, regardless of the medium. In an age where so many of our thoughts, words, and movements are digitally recorded, privacy cannot simply be an abstract concept in our lives; privacy must be an embedded value.
For the past 240 years, the core of our democracy — the values that have helped propel the United States of America — have remained largely the same. We are still a people founded on the beliefs of equality and economic prosperity for all. The fierce independence that encouraged us to break from an oppressive king is the same independence found in young women and men across the country who strive to make their own path in this world and create a life unique unto to themselves. So long as that independence is encouraged, so long as it is fostered by the ability to transcend past data points and by the ability to speak and create free from intrusion, the United States will continue to lead the world. Privacy is necessary to our economy, free expression, and the digital free flow of data because it is fundamental to ourselves.
Privacy, as a right that has been enjoyed by past generations, must be protected in our digital ecosystem so that future generations are given the same freedoms to engage, explore, and create the future we all seek.
I know; rhetoric is easy, policy is hard. But we can’t change policy without a changed rhetoric.
EDITED TO ADD: The document was originally on the whitehouse.gov website, but was deleted in the Trump transition.
Way back in April of 2014 we launched the original Compute Module (CM1) which was based around the BCM2835 processor of the original Raspberry Pi. CM1 was a great success and we’ve seen a lot of uptake from various markets, particularly in IoT and home and factory automation. Not to be outdone by its bigger Raspberry Pi brother, the Compute Module is also destined for space!
Compute Module 3
Since releasing the original Compute Module we’ve launched 2 further generations of much faster Raspberry Pi boards, so today we bring you the shiny new Compute Module 3 (CM3) which is based on the Raspberry Pi 3 hardware, providing twice the RAM and roughly 10x the CPU performance of the original module. We’ve been talking about the Compute Module 3 since the launch of the Raspberry Pi 3, and we’re already excited to see NEC displays, an early adopter, launching their CM3-enabled display solution.
Compute Module 3
The idea of the Compute Module was to provide an easy and cost effective route to producing customised products based on the Pi hardware and software platform. The thought was to provide the ‘team in a garage’ with easy access to the same technology as the big guys. The module takes care of the complexity of routing out the processor pins, the high speed RAM interface and core power supply and allows a simple carrier board to provide just what is needed in terms of external interfaces and form factor. The module uses a standard DDR2 SODIMM form factor, sockets for which are made by several manufacturers and are easily available and inexpensive.
In fact today we are launching two versions of Compute Module 3. The first is the ‘standard’ CM3 which has a BCM2837 processor at up to 1.2GHz with 1GByte RAM (the same as Pi3) and 4Gbytes of on-module eMMC flash. The second version is what we are calling ‘Compute Module 3 Lite’ (CM3L) which still has the same BCM2837 and 1Gbyte of RAM but brings the SD card interface to the module pins so a user can wire this up to an eMMC or SD card of their choice.
Back side of CM3 (left) and CM3L (right).
We are also releasing an updated version of our get-you-started breakout board, the Compute Module IO Board V3 (CMIO3). This board provides the necessary power to the module and gives you the ability to program the module’s Flash memory (for the non-Lite versions) or use an SD card (Lite versions), access the processor interfaces in a slightly more friendly fashion (pin headers and flexi connectors, much like the Pi) and provides the necessary HDMI and USB connectors so that you have an entire system that can boot Raspbian (or the OS of your choice). This board provides both a starting template for those who want to design with the Compute Module, and a quick way to start experimenting with the hardware and building and testing a system before going to the expense of fabricating a custom board. The CMIO3 can accept an original Compute Module, CM3 or CM3L.
With the launch of CM3 and CM3 Lite we are not obsoleting the original Compute Module, as we still see this as a valid product in its own right being a lower cost and lower power option where the performance of a CM3 would be overkill.
CM3 and CM3L are priced at $30 and $25 respectively (excluding tax and shipping) and this price applies to any size order. The original Compute Module is also reduced to $25. Our partners RS and Premier Farnell are also providing full development kits which include all you need to get started designing with the Compute Module 3.
The CM3 is largely backwards compatible with CM1 designs which have followed our design guidelines. The caveats are that the module is 1mm taller than the original module and the processor core supply (VBAT) can draw significantly more current and consequently the processor itself will run much hotter under heavy CPU load – i.e. designers need to consider thermals based on expected use cases.
CM3 (left) is 1mm taller than CM1 (right)
We’re very glad to finally be launching the Compute Module 3, and we’re excited to see what people do with it. Head on over to our partners element14 and RS Components to buy yours today!
This year’s Consumer Electronics Show (CES) just wrapped up in Las Vegas. The usual parade of cool tech toys created a lot of headlines this year, but there were some genuine trends to keep an eye on too. If you’re like us, you’re probably one of the first people around to adopt promising new technologies when they emerge. As early adopters we can sometimes lose the forest through the trees when it comes to understanding what this means for everyone else, so we’re going to look at it through that prism.
2017 promises to be a big year for voice-activated “smart home” devices. The final landscape for this is still to be determined – all the expected players have their foot in it right now. Amazon, Apple, Google, Microsoft, even some smaller players.
Amazon deserves props after a holiday season that saw its Echo and Echo Dot devices in high demand. The company’s published an API that is Alexa is picking up plenty of support from third party manufacturers. Alexa’s testing for far beyond Echo, it seems.
Electronics giant LG is building Alexa into a line of robots designed for domestic duties and a refrigerator that also sports interior fridge cams, for example. Ford is integrating Alexa support into its Sync 3 automotive interface. Televisions, lighting devices, and home security products are among the many devices to feature Alexa integration.
Alexa is the new hotness, but the real trend here is in voice-assisted connectivity around the home. Even if Alexa runs out of steam, this tech is here to stay. The Internet of Things and voice activated interfaces are converging quickly, though that day isn’t today. It’s tantalizingly close. It’s still a niche, though, where it will stay for as long as consumers have to piece different things together to get it to work. That means there’s still room for disruption.
There’s especially ripe opportunity in underserved verticals. Take the home health market, for example: Natural language interfaces have huge implications for elderly and disabled care and assistance. Finding and developing solutions for those sorts of vertical markets is an awesome opportunity for the right players.
Of course, with great power comes great responsibility. A family of a six-year-old recently got stuck with a $160 bill after she told Alexa to order her cookies and a dollhouse. The family ended up donating the accidental order to charity. For what it’s worth, that problem can be avoided by activating a confirmation code feature in the Alexa software.
The Electric Vehicle (EV) Market Heats Up
One of the trickiest things to unpack from CES is hype from substance. Nowhere was that more apparent last week than the unveiling of Faraday Future’s FF91, a new Electric Vehicle (EV) positioned to go toe-to-toe with Tesla’s EV fleet.
The FF91 EV can purportedly go 378 miles on a single charge and also possesses autonomous driving capabilities (although its vaunted self-parking abilities didn’t demo as well as planned). When or if it’ll make it into production is still a head-scratcher, however. Faraday Future says it’ll be out next year, assuming that the company is beyond the production and manufacturing woes that have plagued it up until now.
While new vehicles and vehicle concepts are still largely the domain of auto shows, some auto manufacturers used CES to float new concepts ahead of the Detroit Auto Show, which happens this week. Toyota, for example, showed off its Concept-i, a car with artificial intelligence and natural language processing (like Siri or Alexa) designed to learn from you and adapt.
As we mentioned, Alexa is integrated into Ford’s Sync 3 platform, too. Already you can buy new cars with CarPlay and Android Auto, which makes it a lot easier to just talk with your mobile device to stay connected, get directions and entertain yourself on the morning commute simply by talking to your car instead of touching buttons. That’s a smart user interface change, but it’s still a potentially dangerous distraction for the driver. For this technology to succeed, it’s imperative that natural language interface designers make the experience as frictionless as possible.
Chrysler is making a play for future millennial families. We’re not making this up – they used “millennial” to describe the target market for this several times. The Portal concept is an electric minivan of sorts that’s chock-full of buzzwords: Facial recognition, Wi-Fi, media sharing, ten charging ports, semi-autonomous driving abilities and more).
2017 marks a pivot for car makers in this respect. For years the conventional wisdom that millennials were a lost cause for auto makers – Uber and Zipcar was all they needed. It turns out that was totally wrong. Economic pressures and diverse lifestyles may have delayed millennials’ trek toward auto ownership, but they’re turning out now in big numbers to buy wheels. Millennial families will need transportation just like generations before them back to the station wagon, which is why Chrysler says this “fifth-generation” family car will go into production sometime after 2018.
Volkswagen showed off its new I.D. concept car, a Golf-looking EV that also has all the requisite buzzwords. Speaking of buzzwords, what really excited us was the I.D. Buzz. This new EV resurrects the styling of the Hippy-era Microbus, with mood lighting, autonomous driving capabilities and a retractable steering wheel.
Rumors have persisted for years that VW was on the cusp of introducing a refreshed Microbus, but those rumors have never come to pass. And unfortunately, VW has no concrete plans to actually produce this – it seems to be a marketing effort to draw on nostalgic Boomer appeal, more than anything..
Both Buzz and Chrysler’s Portal do give us some insight about where auto makers are going when it comes to future generations of minivans: Electric, autonomous, customizable and more social than ever. If we are headed towards a future where vehicles drive themselves, family transportation will look very different than it is today.
Laptops At Both Extremes
CES saw the rollout of several new PC laptop models and concepts that will be hitting store shelves over the next several months.
Gamers looking for more real estate – a lot more real estate – were interested in Razer’s latest concept, Project Valerie. The laptop sports not one but three 4K displays which fold out on hinges. That’s 12K pixels of horizontal image space, mated to an Nvidia GeForce GTX 1080 graphics processor. A unibody aluminum chassis keeps it relatively thin (1.5 inches) when closed, but the entire rig weighs more than 12 pounds. Razer doesn’t have any immediate production plans, which may explain why their prototype was stolen before the end of the show.
Unlike Razer, Acer has production plans – immediate plans – for its gargantuan 21-inch Predator 21X laptop, priced at $8,999 and headed to store shelves next month. It was announced last year, but Acer finally offered launch details last week. A 17-inch model is also coming soon.
Big gaming laptops make for pretty pictures and certainly have their place in the PC ecosystem, but they’re niche devices. After a ramp up on 2-in-1s and low-powered laptops, Intel’s Kaby Lake processors are finally ready for the premium and mid-range laptop market. Kaby Lake efficiency improvements are helping PC makers build thinner and lighter laptops with better battery life, 4K video processing, faster solid state storage and more.
HP, Asus, MSI, Dell (and its gaming arm Alienware) were among the many companies with sleek new Kaby Lake-equipped models.
Gaming in the cloud with Nvidia
Nvidia, makers of premium graphics processors, offers GeForce Now cloud gaming to users of its Shield, an Android-based gaming handheld. That service is expanding to Windows and Mac in March.
Gaming as a Service, if you will, isn’t a new idea. OnLive pioneered the concept more than a decade ago. Gaikai followed, then was acquired by Sony in 2012. Nvidia’s had limited success with GeForce Now, but it’s been a single-platform offering up until now.
Nvidia has robust data centers to handle the processing and traffic, so best of luck to them as they scale up to meet demand. Gaming is very sensitive to network disruption – no gamer appreciates lag – so it’ll be interesting to see how GeForce Now scales to accommodate the new devices.
Mesh networking delivers more consistent, stronger network reception and performance than a conventional Wi-Fi router. Some of us have set up routers and extenders to fix dead spots – mesh networking works differently through smart traffic and better radio management between multiple network bases.
Eero, Ubiquiti, and even Google (with Google Wifi) are already offering mesh networking products, and this market segment looks to expand big in 2017. Netgear, Linksys, Asus, TP-Link and others are among those with new mesh networking setups. Mesh networking gear is still hampered by a higher price than plain old routers. That means the value isn’t there for some of us who have networking gear that gets the job done, even with shortcomings like dead zones or slow zones. But prices are coming down fast as more companies get into the market. If you have an 802.11ac router you’re happy with, stick with it for now, and move to a mesh networking setup for your next Wi-Fi upgrade.
Getting Your Feet Into VR
Our award for wackiest CES product has to go to Cerevo Taclim. Tactile feedback shoes and wireless hand controllers that help you “feel” the surface you’re walking on. Crunching snow underfoot, splashing through water. At an expected $1,000-$1,500 a pop, these probably won’t be next year’s Hatchimals, but it’s fun to imagine what game devs can do with the technology. Strap these to your feet then break out your best Hadouken in Street Fighter VR!
CES isn’t the real world. Only a fraction of what’s shown off ever sees the light of day, but it’s always interesting to see the trend-focused consumer electronics market shift and change from year to year. At the end of the year we hope to look back and see how much of this stuff ended up resonating with the actual consumer the show is named for.
2016 is safely in our rear-view mirrors. It’s time to take a look back at the year that was and see what technology had the biggest impact on consumers and businesses alike. We also have an eye to 2017 to see what the future holds.
AI and machine learning in the cloud
Truly sentient computers and robots are still the stuff of science fiction (and the premise of one of 2016’s most promising new SF TV series, HBO’s Westworld). Neural networks are nothing new, but 2016 saw huge strides in artificial intelligence and machine learning, especially in the cloud.
Google, Amazon, Apple, IBM, Microsoft and others are developing cloud computing infrastructures designed especially for AI work. It’s this technology that’s underpinning advances in image recognition technology, pattern recognition in cybersecurity, speech recognition, natural language interpretation and other advances.
Microsoft’s newly-formed AI and Research Group is finding ways to get artificial intelligence into Microsoft products like its Bing search engine and Cortana natural language assistant. Some of these efforts, while well-meaning, still need refinement: Early in 2016 Microsoft launched Tay, an AI chatbot designed to mimic the natural language characteristics of a teenage girl and learn from interacting with Twitter users. Microsoft had to shut Tay down after Twitter users exploited vulnerabilities that caused Tay to begin spewing really inappropriate responses. But it paves the way for future efforts that blur the line between man and machine.
Finance, energy, climatology – anywhere you find big data sets you’re going to find uses for machine learning. On the consumer end it can help your grocery app guess what you might want or need based on your spending habits. Financial firms use machine learning to help predict customer credit scores by analyzing profile information. One of the most intriguing uses of machine learning is in security: Pattern recognition helps systems predict malicious intent and figure out where exploits will come from.
Meanwhile we’re still waiting for Rosie the Robot from the Jetsons. And flying cars. So if Elon Musk has any spare time in 2017, maybe he can get on that.
Augmented Reality (AR) games have been around for a good long time – ever since smartphone makers put cameras on them, game makers have been toying with the mix of real life and games.
AR games took a giant step forward with a game released in 2016 that you couldn’t get away from, at least for a little while. We’re talking about Pokémon GO, of course. Niantic, makers of another AR game called Ingress, used the framework they built for that game to power Pokémon GO. Kids, parents, young, old, it seemed like everyone with an iPhone that could run the game caught wild Pokémon, hatched eggs by walking, and battled each other in Pokémon gyms.
For a few weeks, anyway.
Technical glitches, problems with scale and limited gameplay value ultimately hurt Pokémon GO’s longevity. Today the game only garners a fraction of the public interest it did at peak. It continues to be successful, albeit not at the stratospheric pace it first set.
Niantic, the game’s developer, was able to tie together several factors to bring such an explosive and – if you’ll pardon the overused euphemism – disruptive – game to bear. One was its previous work with a game called Ingress, another AR-enhanced game that uses geomap data. In fact, Pokémon GO uses the same geomap data as Ingress, so Niantic had already done a huge amount of legwork needed to get Pokémon GO up and running. Niantic cleverly used Google Maps data to form the basis of both games, relying on already-identified public landmarks and other locations tagged by Ingress players (Ingress has been around since 2011).
Then, of course, there’s the Pokémon connection – an intensely meaningful gaming property that’s been popular with generations of video games and cartoon watchers since the 1990s. The dearth of Pokémon-branded games on smartphones meant an instant explosion of popularity upon Pokémon GO’s release.
2016 also saw the introduction of several new virtual reality (VR) headsets designed for home and mobile use. Samsung Gear VR and Google Daydream View made a splash. As these products continue to make consumer inroads, we’ll see more games push the envelope of what you can achieve with VR and AR.
Hybrid Cloud services combine public cloud storage (like B2 Cloud Storage) or public compute (like Amazon Web Services) with a private cloud platform. Specialized content and file management software glues it all together, making the experience seamless for the user.
Businesses get the instant access and speed they need to get work done, with the ability to fall back on on-demand cloud-based resources when scale is needed. B2’s hybrid cloud integrations include OpenIO, which helps businesses maintain data storage on-premise until it’s designated for archive and stored in the B2 cloud.
The cost of entry and usage of Hybrid Cloud services have continued to fall. For example, small and medium-sized organizations in the post production industry are finding Hybrid Cloud storage is now a viable strategy in managing the large amounts of information they use on a daily basis. This strategy is enabled by the low cost of B2 Cloud Storage that provides ready access to cloud-stored data.
There are practical deployment and scale issues that have kept Hybrid Cloud services from being used widespread in the largest enterprise environments. Small to medium businesses and vertical markets like Media & Entertainment have found promising, economical opportunities to use it, which bodes well for the future.
Inexpensive 3D printers
3D printing, once a rarified technology, has become increasingly commoditized over the past several years. That’s been in part thanks to the “Maker Movement:” Thousands of folks all around the world who love to tinker and build. XYZprinting is out in front of makers and others with its line of inexpensive desktop da Vinci printers.
The da Vinci Mini is a tabletop model aimed at home users which starts at under $300. You can download and tweak thousands of 3D models to build toys, games, art projects and educational items. They’re built using spools of biodegradable, non-toxic plastics derived from corn starch which dispense sort of like the bobbin on a sewing machine. The da Vinci Mini works with Macs and PCs and can connect via USB or Wi-Fi.
Quadcopter drones have been fun tech toys for a while now, but the new trend we saw in 2016 was “do it yourself” models. The result was Flybrix, which combines lightweight drone motors with LEGO building toys. Flybrix was so successful that they blew out of inventory for the 2016 holiday season and are backlogged with orders into the new year.
Each Flybrix kit comes with the motors, LEGO building blocks, cables and gear you need to build your own quad, hex or octocopter drone (as well as a cheerful-looking LEGO pilot to command the new vessel). A downloadable app for iOS or Android lets you control your creation. A deluxe kit includes a handheld controller so you don’t have to tie up your phone.
If you already own a 3D printer like the da Vinci Mini, you’ll find plenty of model files available for download and modification so you can print your own parts, though you’ll probably need help from one of the many maker sites to know what else you’ll need to aerial flight and control.
5D Glass Storage
Research at the University of Southampton may yield the next big leap in optical storage technology meant for long-term archival. The boffins at the Optoelectronics Research Centre have developed a new data storage technique that embeds information in glass “nanostructures” on a storage disc the size of a U.S. quarter.
A Blu-Ray Disc can hold 50 GB, but one of the new 5D glass storage discs – only the size of a U.S. quarter – can hold 360 TB – 7200 times more. It’s like a super-stable supercharged version of a CD. Not only is the data inscribed on much smaller structures within the glass, but reflected at multiple angles, hence “5D.”
An upside to this is an absence of bit rot: The glass medium is extremely stable, with a shelf life predicted in billions of years. The downside is that this is still a write-once medium, so it’s intended for long term storage.
This tech is still years away from practical use, but it took a big step forward in 2016 when the University announced the development of a practical information encoding scheme to use with it.
Smart Home Tech
Are you ready to talk to your house to tell it to do things? If you’re not already, you probably will be soon. Google’s Google Home is a $129 voice-activated speaker powered by the Google Assistant. You can use it for everything from streaming music and video to a nearby TV to reading your calendar or to do list. You can also tell it to operate other supported devices like the Nest smart thermostat and Philips Hue lights.
Amazon has its own similar wireless speaker product called the Echo, powered by Amazon’s Alexa information assistant. Amazon has differentiated its Echo offerings by making the Dot – a hockey puck-sized device that connects to a speaker you already own. So Amazon customers can begin to outfit their connected homes for less than $50.
Apple’s HomeKit software kit isn’t a speaker like Amazon Echo or Google Home. It’s software. You use the Home app on your iOS 10-equipped iPhone or iPad to connect and configure supported devices. Use Siri, Apple’s own intelligent assistant, on any supported Apple device. HomeKit turns on lights, turns up the thermostat, operates switches and more.
Smart home tech has been coming in fits and starts for a while – the Nest smart thermostat is already in its third generation, for example. But 2016 was the year we finally saw the “Internet of things” coalescing into a smart home that we can control through voice and gestures in a … well, smart way.
Welcome To The Future
It’s 2017, welcome to our brave new world. While it’s anyone’s guess what the future holds, there are at least a few tech trends that are pretty safe to bet on. They include:
Internet of Things: More smart-connected devices are coming online in the home and at work every day, and this trend will accelerate in 2017 with more and more devices requiring some form of Internet connectivity to work. Expect to see a lot more appliances, devices, and accessories that make use of the API’s promoted by Google, Amazon, and Apple to help let you control everything in your life just using your voice and a smart speaker setup.
Blockchain security: Blockchain is the digital ledger security technology that makes Bitcoin work. Its distribution methodology and validation system help you make certain that no one’s tampered with the records, which make it well-suited for applications besides cryptocurrency, like make sure your smart thermostat (see above) hasn’t been hacked). Expect 2017 to be the year we see more mainstream acceptance, use, and development of blockchain technology from financial institutions, the creation of new private blockchain networks, and improved usability aimed at making blockchain easier for regular consumers to use. Blockchain-based voting is here too. It also wouldn’t surprise us, given all this movement, to see government regulators take a much deeper interest in blockchain, either.
5G: Verizon is field-testing 5G on its wireless network, which it says deliver speeds 30-50 times faster than 4G LTE. We’ll be hearing a lot more about 5G from Verizon and other wireless players in 2017. In fairness, we’re still a few years away from widescale 5G deployment, but field-testing has already started.
Enough of our bloviation. Let’s open the floor to you. What do you think were the biggest technology trends in 2016? What’s coming in 2017 that has you the most excited? Let us know in the comments!
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.