Tag Archives: Processing

Dutch Film Distributor Wins Right To Chase Pirates, Store Data For 5 Years

Post Syndicated from Andy original https://torrentfreak.com/dutch-film-distributor-wins-right-to-chase-pirates-store-data-for-5-years-171208/

For many years, Dutch Internet users were allowed to download copyrighted content without reprisals, provided it was for their own personal use.

In 2014, however, the European Court of Justice ruled that the country’s “piracy levy” to compensate rightsholders was unlawful. Almost immediately, the government announced a downloading ban.

In March 2016, anti-piracy outfit BREIN followed up by obtaining permission from the Dutch Data Protection Authority to track and store the personal data of alleged BitTorrent pirates. This year, movie distributor Dutch FilmWorks (DFW) made a similar application.

The company said that it would be pursuing alleged pirates to deter future infringement but many suspected that securing cash settlements was its main aim. That was confirmed in August.

“[The letter to alleged pirates] will propose a fee. If someone does not agree [to pay], the organization can start a lawsuit,” said DFW CEO Willem Pruijsserts

“In Germany, this costs between €800 and €1,000, although we find this a bit excessive. But of course it has to be a deterrent, so it will be more than a tenner or two,” he added.

But despite the grand plans, nothing would be possible without first obtaining the necessary permission from the Data Protection Authority. This Wednesday, however, that arrived.

“DFW has given sufficient guarantees for the proper and careful processing of personal data. This means that DFW has been given a green light from the Data Protection Authority to collect personal data, such as IP addresses, from people downloading from illegal sources,” the Authority announced.

Noting that it received feedback from four entities during the six-week consultation process following the publication of its draft decision during the summer, the Data Protection Authority said that further investigations were duly carried out. All input was considered before handing down the final decision.

The Authority said it was satisfied that personal data would be handled correctly and that the information collected and stored would be encrypted and hashed to ensure integrity. Furthermore, data will not be retained for longer than is necessary.

“DFW has stated…that data from users with Dutch IP addresses who were involved in the exchange of a title owned by DFW, but in respect of which there is no intention to follow up on that within three months after receipt, will be destroyed,” the decision reads.

For any cases that are active and haven’t been discarded in the initial three-month period, DFW will be allowed to hold alleged pirates’ data for a maximum of five years, a period that matches the time a company has to file a claim under the Dutch Civil Code.

“When DFW does follow up on a file, DFW carries out further research into the identity of the users of the IP addresses. For this, it is necessary to contact the Internet service providers of the subscribers who used the IP addresses found in the BitTorrent network,” the Authority notes.

According to the decision, once DFW has a person’s details it can take any of several actions, starting with a simple warning or moving up to an amicable cash settlement. Failing that, it might choose to file a full-on court case in which the distributor seeks an injunction against the alleged pirate plus compensation and costs.

Only time will tell what strategy DFW will deploy against alleged pirates but since these schemes aren’t cheap to run, it’s likely that simple warning letters will be seriously outnumbered by demands for cash settlement.

While it seems unlikely that the Data Protection Authority will change its mind at this late stage, it’s decision remains open to appeal. Interested parties have just under six weeks to make their voices heard. Failing that, copyright trolling will hit the Netherlands in the weeks and months to come.

The full decision can be found here (Dutch, pdf) via Tweakers

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons

[$] Mozilla releases tools and data for speech recognition

Post Syndicated from jake original https://lwn.net/Articles/740768/rss

Voice computing has long been a staple of science fiction, but it has
only relatively recently made its way into fairly common mainstream use.
Gadgets like mobile
phones and “smart” home assistant devices (e.g. Amazon Echo, Google Home)
have brought voice-based user interfaces to the masses. The voice
processing for those gadgets relies on various proprietary services “in the
cloud”, which generally leaves the free-software world out in the cold.
There have
been FOSS speech-recognition efforts over
the years, but Mozilla’s recent
announcement
of the release of its voice-recognition code and voice
data set should help further the goal of FOSS voice interfaces.

Implementing Dynamic ETL Pipelines Using AWS Step Functions

Post Syndicated from Tara Van Unen original https://aws.amazon.com/blogs/compute/implementing-dynamic-etl-pipelines-using-aws-step-functions/

This post contributed by:
Wangechi Dole, AWS Solutions Architect
Milan Krasnansky, ING, Digital Solutions Developer, SGK
Rian Mookencherry, Director – Product Innovation, SGK

Data processing and transformation is a common use case you see in our customer case studies and success stories. Often, customers deal with complex data from a variety of sources that needs to be transformed and customized through a series of steps to make it useful to different systems and stakeholders. This can be difficult due to the ever-increasing volume, velocity, and variety of data. Today, data management challenges cannot be solved with traditional databases.

Workflow automation helps you build solutions that are repeatable, scalable, and reliable. You can use AWS Step Functions for this. A great example is how SGK used Step Functions to automate the ETL processes for their client. With Step Functions, SGK has been able to automate changes within the data management system, substantially reducing the time required for data processing.

In this post, SGK shares the details of how they used Step Functions to build a robust data processing system based on highly configurable business transformation rules for ETL processes.

SGK: Building dynamic ETL pipelines

SGK is a subsidiary of Matthews International Corporation, a diversified organization focusing on brand solutions and industrial technologies. SGK’s Global Content Creation Studio network creates compelling content and solutions that connect brands and products to consumers through multiple assets including photography, video, and copywriting.

We were recently contracted to build a sophisticated and scalable data management system for one of our clients. We chose to build the solution on AWS to leverage advanced, managed services that help to improve the speed and agility of development.

The data management system served two main functions:

  1. Ingesting a large amount of complex data to facilitate both reporting and product funding decisions for the client’s global marketing and supply chain organizations.
  2. Processing the data through normalization and applying complex algorithms and data transformations. The system goal was to provide information in the relevant context—such as strategic marketing, supply chain, product planning, etc. —to the end consumer through automated data feeds or updates to existing ETL systems.

We were faced with several challenges:

  • Output data that needed to be refreshed at least twice a day to provide fresh datasets to both local and global markets. That constant data refresh posed several challenges, especially around data management and replication across multiple databases.
  • The complexity of reporting business rules that needed to be updated on a constant basis.
  • Data that could not be processed as contiguous blocks of typical time-series data. The measurement of the data was done across seasons (that is, combination of dates), which often resulted with up to three overlapping seasons at any given time.
  • Input data that came from 10+ different data sources. Each data source ranged from 1–20K rows with as many as 85 columns per input source.

These challenges meant that our small Dev team heavily invested time in frequent configuration changes to the system and data integrity verification to make sure that everything was operating properly. Maintaining this system proved to be a daunting task and that’s when we turned to Step Functions—along with other AWS services—to automate our ETL processes.

Solution overview

Our solution included the following AWS services:

  • AWS Step Functions: Before Step Functions was available, we were using multiple Lambda functions for this use case and running into memory limit issues. With Step Functions, we can execute steps in parallel simultaneously, in a cost-efficient manner, without running into memory limitations.
  • AWS Lambda: The Step Functions state machine uses Lambda functions to implement the Task states. Our Lambda functions are implemented in Java 8.
  • Amazon DynamoDB provides us with an easy and flexible way to manage business rules. We specify our rules as Keys. These are key-value pairs stored in a DynamoDB table.
  • Amazon RDS: Our ETL pipelines consume source data from our RDS MySQL database.
  • Amazon Redshift: We use Amazon Redshift for reporting purposes because it integrates with our BI tools. Currently we are using Tableau for reporting which integrates well with Amazon Redshift.
  • Amazon S3: We store our raw input files and intermediate results in S3 buckets.
  • Amazon CloudWatch Events: Our users expect results at a specific time. We use CloudWatch Events to trigger Step Functions on an automated schedule.

Solution architecture

This solution uses a declarative approach to defining business transformation rules that are applied by the underlying Step Functions state machine as data moves from RDS to Amazon Redshift. An S3 bucket is used to store intermediate results. A CloudWatch Event rule triggers the Step Functions state machine on a schedule. The following diagram illustrates our architecture:

Here are more details for the above diagram:

  1. A rule in CloudWatch Events triggers the state machine execution on an automated schedule.
  2. The state machine invokes the first Lambda function.
  3. The Lambda function deletes all existing records in Amazon Redshift. Depending on the dataset, the Lambda function can create a new table in Amazon Redshift to hold the data.
  4. The same Lambda function then retrieves Keys from a DynamoDB table. Keys represent specific marketing campaigns or seasons and map to specific records in RDS.
  5. The state machine executes the second Lambda function using the Keys from DynamoDB.
  6. The second Lambda function retrieves the referenced dataset from RDS. The records retrieved represent the entire dataset needed for a specific marketing campaign.
  7. The second Lambda function executes in parallel for each Key retrieved from DynamoDB and stores the output in CSV format temporarily in S3.
  8. Finally, the Lambda function uploads the data into Amazon Redshift.

To understand the above data processing workflow, take a closer look at the Step Functions state machine for this example.

We walk you through the state machine in more detail in the following sections.

Walkthrough

To get started, you need to:

  • Create a schedule in CloudWatch Events
  • Specify conditions for RDS data extracts
  • Create Amazon Redshift input files
  • Load data into Amazon Redshift

Step 1: Create a schedule in CloudWatch Events
Create rules in CloudWatch Events to trigger the Step Functions state machine on an automated schedule. The following is an example cron expression to automate your schedule:

In this example, the cron expression invokes the Step Functions state machine at 3:00am and 2:00pm (UTC) every day.

Step 2: Specify conditions for RDS data extracts
We use DynamoDB to store Keys that determine which rows of data to extract from our RDS MySQL database. An example Key is MCS2017, which stands for, Marketing Campaign Spring 2017. Each campaign has a specific start and end date and the corresponding dataset is stored in RDS MySQL. A record in RDS contains about 600 columns, and each Key can represent up to 20K records.

A given day can have multiple campaigns with different start and end dates running simultaneously. In the following example DynamoDB item, three campaigns are specified for the given date.

The state machine example shown above uses Keys 31, 32, and 33 in the first ChoiceState and Keys 21 and 22 in the second ChoiceState. These keys represent marketing campaigns for a given day. For example, on Monday, there are only two campaigns requested. The ChoiceState with Keys 21 and 22 is executed. If three campaigns are requested on Tuesday, for example, then ChoiceState with Keys 31, 32, and 33 is executed. MCS2017 can be represented by Key 21 and Key 33 on Monday and Tuesday, respectively. This approach gives us the flexibility to add or remove campaigns dynamically.

Step 3: Create Amazon Redshift input files
When the state machine begins execution, the first Lambda function is invoked as the resource for FirstState, represented in the Step Functions state machine as follows:

"Comment": ” AWS Amazon States Language.", 
  "StartAt": "FirstState",
 
"States": { 
  "FirstState": {
   
"Type": "Task",
   
"Resource": "arn:aws:lambda:xx-xxxx-x:XXXXXXXXXXXX:function:Start",
    "Next": "ChoiceState" 
  } 

As described in the solution architecture, the purpose of this Lambda function is to delete existing data in Amazon Redshift and retrieve keys from DynamoDB. In our use case, we found that deleting existing records was more efficient and less time-consuming than finding the delta and updating existing records. On average, an Amazon Redshift table can contain about 36 million cells, which translates to roughly 65K records. The following is the code snippet for the first Lambda function in Java 8:

public class LambdaFunctionHandler implements RequestHandler<Map<String,Object>,Map<String,String>> {
    Map<String,String> keys= new HashMap<>();
    public Map<String, String> handleRequest(Map<String, Object> input, Context context){
       Properties config = getConfig(); 
       // 1. Cleaning Redshift Database
       new RedshiftDataService(config).cleaningTable(); 
       // 2. Reading data from Dynamodb
       List<String> keyList = new DynamoDBDataService(config).getCurrentKeys();
       for(int i = 0; i < keyList.size(); i++) {
           keys.put(”key" + (i+1), keyList.get(i)); 
       }
       keys.put(”key" + T,String.valueOf(keyList.size()));
       // 3. Returning the key values and the key count from the “for” loop
       return (keys);
}

The following JSON represents ChoiceState.

"ChoiceState": {
   "Type" : "Choice",
   "Choices": [ 
   {

      "Variable": "$.keyT",
     "StringEquals": "3",
     "Next": "CurrentThreeKeys" 
   }, 
   {

     "Variable": "$.keyT",
    "StringEquals": "2",
    "Next": "CurrentTwooKeys" 
   } 
 ], 
 "Default": "DefaultState"
}

The variable $.keyT represents the number of keys retrieved from DynamoDB. This variable determines which of the parallel branches should be executed. At the time of publication, Step Functions does not support dynamic parallel state. Therefore, choices under ChoiceState are manually created and assigned hardcoded StringEquals values. These values represent the number of parallel executions for the second Lambda function.

For example, if $.keyT equals 3, the second Lambda function is executed three times in parallel with keys, $key1, $key2 and $key3 retrieved from DynamoDB. Similarly, if $.keyT equals two, the second Lambda function is executed twice in parallel.  The following JSON represents this parallel execution:

"CurrentThreeKeys": { 
  "Type": "Parallel",
  "Next": "NextState",
  "Branches": [ 
  {

     "StartAt": “key31",
    "States": { 
       “key31": {

          "Type": "Task",
        "InputPath": "$.key1",
        "Resource": "arn:aws:lambda:xx-xxxx-x:XXXXXXXXXXXX:function:Execution",
        "End": true 
       } 
    } 
  }, 
  {

     "StartAt": “key32",
    "States": { 
     “key32": {

        "Type": "Task",
       "InputPath": "$.key2",
         "Resource": "arn:aws:lambda:xx-xxxx-x:XXXXXXXXXXXX:function:Execution",
       "End": true 
      } 
     } 
   }, 
   {

      "StartAt": “key33",
       "States": { 
          “key33": {

                "Type": "Task",
             "InputPath": "$.key3",
             "Resource": "arn:aws:lambda:xx-xxxx-x:XXXXXXXXXXXX:function:Execution",
           "End": true 
       } 
     } 
    } 
  ] 
} 

Step 4: Load data into Amazon Redshift
The second Lambda function in the state machine extracts records from RDS associated with keys retrieved for DynamoDB. It processes the data then loads into an Amazon Redshift table. The following is code snippet for the second Lambda function in Java 8.

public class LambdaFunctionHandler implements RequestHandler<String, String> {
 public static String key = null;

public String handleRequest(String input, Context context) { 
   key=input; 
   //1. Getting basic configurations for the next classes + s3 client Properties
   config = getConfig();

   AmazonS3 s3 = AmazonS3ClientBuilder.defaultClient(); 
   // 2. Export query results from RDS into S3 bucket 
   new RdsDataService(config).exportDataToS3(s3,key); 
   // 3. Import query results from S3 bucket into Redshift 
    new RedshiftDataService(config).importDataFromS3(s3,key); 
   System.out.println(input); 
   return "SUCCESS"; 
 } 
}

After the data is loaded into Amazon Redshift, end users can visualize it using their preferred business intelligence tools.

Lessons learned

  • At the time of publication, the 1.5–GB memory hard limit for Lambda functions was inadequate for processing our complex workload. Step Functions gave us the flexibility to chunk our large datasets and process them in parallel, saving on costs and time.
  • In our previous implementation, we assigned each key a dedicated Lambda function along with CloudWatch rules for schedule automation. This approach proved to be inefficient and quickly became an operational burden. Previously, we processed each key sequentially, with each key adding about five minutes to the overall processing time. For example, processing three keys meant that the total processing time was three times longer. With Step Functions, the entire state machine executes in about five minutes.
  • Using DynamoDB with Step Functions gave us the flexibility to manage keys efficiently. In our previous implementations, keys were hardcoded in Lambda functions, which became difficult to manage due to frequent updates. DynamoDB is a great way to store dynamic data that changes frequently, and it works perfectly with our serverless architectures.

Conclusion

With Step Functions, we were able to fully automate the frequent configuration updates to our dataset resulting in significant cost savings, reduced risk to data errors due to system downtime, and more time for us to focus on new product development rather than support related issues. We hope that you have found the information useful and that it can serve as a jump-start to building your own ETL processes on AWS with managed AWS services.

For more information about how Step Functions makes it easy to coordinate the components of distributed applications and microservices in any workflow, see the use case examples and then build your first state machine in under five minutes in the Step Functions console.

If you have questions or suggestions, please comment below.

The Pi Towers Secret Santa Babbage

Post Syndicated from Mark Calleja original https://www.raspberrypi.org/blog/secret-santa-babbage/

Tired of pulling names out of a hat for office Secret Santa? Upgrade your festive tradition with a Raspberry Pi, thermal printer, and everybody’s favourite microcomputer mascot, Babbage Bear.

Raspberry Pi Babbage Bear Secret Santa

The name’s Santa. Secret Santa.

It’s that time of year again, when the cosiness gets turned up to 11 and everyone starts thinking about jolly fat men, reindeer, toys, and benevolent home invasion. At Raspberry Pi, we’re running a Secret Santa pool: everyone buys a gift for someone else in the office. Obviously, the person you buy for has to be picked in secret and at random, or the whole thing wouldn’t work. With that in mind, I created Secret Santa Babbage to do the somewhat mundane task of choosing gift recipients. This could’ve just been done with some names in a hat, but we’re Raspberry Pi! If we don’t make a Python-based Babbage robot wearing a jaunty hat and programmed to spread Christmas cheer, who will?

Secret Santa Babbage

Ho ho ho!

Mecha-Babbage Xmas shenanigans

The script the robot runs is pretty basic: a list of names entered as comma-separated strings is shuffled at the press of a GPIO button, then a name is popped off the end and stored as a variable. The name is matched to a photo of the person stored on the Raspberry Pi, and a thermal printer pinched from Alex’s super awesome PastyCam (blog post forthcoming, maybe) prints out the picture and name of the person you will need to shower with gifts at the Christmas party. (Well, OK — with one gift. No more than five quid’s worth. Nothing untoward.) There’s also a redo function, just in case you pick yourself: press another button and the last picked name — still stored as a variable — is appended to the list again, which is shuffled once more, and a new name is popped off the end.

Secret Santa Babbage prototyping

Prototyping!

As the build was a bit of a rush job undertaken at the request of our ‘Director of Vibe’ Emily, there are a few things I’d like to improve about this functionality that I didn’t get around to — more on that later. To add some extra holiday spirit to the project at the last minute, I used Pygame to play a WAV file of Santa’s jolly laugh while Babbage chooses a name for you. The file is included in the GitHub repo along with everything else, because ‘tis the season, etc., etc.

Secret Santa Babbage prototyping

Editor’s note: Considering these desk adornments, Mark’s Secret Santa gift-giver has a lot to go on.

Writing the code for Xmas Mecha-Babbage was fairly straightforward, though it uses some tricky bits for managing the thermal printer. You’ll need to install the drivers to make it go, as well as the CUPS package for managing the print hosting. You can find instructions for these things here, thanks to the wonderful Adafruit crew. Also, for reasons I couldn’t fathom, this will all only work on a Pi 2 and not a Pi 3, as there are some compatibility issues with the thermal printer otherwise. (I also tested the script on a Pi Zero W…no dice.)

Building a Christmassy throne

The hardest (well, fiddliest) parts of making the whole build were constructing the throne and wiring the bear. Using MakerCase, Inkscape, a bit of ingenuity, and a laser cutter, I was able to rig up a Christmassy plywood throne which has a hole through the seat so I could run the wires down from Babbage and to the Pi inside. I finished the throne by rubbing a couple of fingers of beeswax into it; as well as making the wood shine just a little bit and protecting it against getting wet, this had the added bonus of making it smell awesome.

Secret Santa Babbage inside

Next year’s iteration will be mulled wine–scented.

I next soldered two LEDs to some lengths of wire, and then ran the wires through holes at the top of the throne and down the back along a small channel I had carved with a narrow chisel to connect them to the Pi’s GPIO pins. The green LED will remain on as long as Babbage is running his program, and the red one will light up while he is processing your request. Once the red LED goes off again, the next person can have a go. I also laser-cut a final piece of wood to overlay the back of Babbage’s Xmas throne and cover the wiring a bit.

Creating a Xmas cyborg bear

Taking two 6 mm tactile buttons, I clipped the spiky metal legs off one side of each (the buttons were going into a stuffed christmas toy, after all) and soldered a length of wire to each of the remaining legs. Next, I made a small incision into Babbage with my trusty Swiss army knife (in a place that actually made me cringe a little) and fed the buttons up into his paws. At some point in this process I was standing in the office wrestling with the bear and muttering to myself, which elicited some very strange looks from my colleagues.

Secret Santa Babbage throne

Poor Babbage…

One thing to note here is to make sure the wires remain attached at the solder points while you push them up into Babbage’s paws. The first time I tried it, I snapped one of my connections and had to start again. It helped to remove some stuffing like a tunnel and then replace it afterward. Moreover, you can use your fingertip to support the joints as you poke the wire in. Finally, a couple of squirts of hot glue to keep Babbage’s furry cheeks firmly on the seat, and done!

Secret Santa Babbage

Next year: Game of Thrones–inspired candy cane throne

The Secret Santa Babbage masterpiece

The whole build process was the perfect holiday mix of cheerful and macabre, and while getting the thermal printer to work was a little time-consuming, the finished product definitely raised some smiles around the office and added a bit of interesting digital flavour to a staid office tradition. And it also helped people who are new to the office or from other branches of the Foundation to know for whom they will be buying a gift.

Secret Santa Babbage

Ready to dispense Christmas cheer!

There are a few ways in which I’ll polish this project before next year, such as having the script write the names to external text files to create a record that will persist in case of a reboot, and maybe having Secret Santa Babbage play you a random Christmas carol when you squeeze his paw instead of just laughing merrily every time. (I also thought about adding electric shocks for those people who are on the naughty list, but HR said no. Bah, humbug!)

Make your own

The code and laser cut plans for the whole build are available here. If you plan to make your own, let us know which stuffed toy you will be turning into a Secret Santa cyborg! And if you’ve been working on any other Christmas-themed Raspberry Pi projects, we’d like to see those too, so tag us on social media to share the festive maker cheer.

The post The Pi Towers Secret Santa Babbage appeared first on Raspberry Pi.

AWS Contributes to Milestone 1.0 Release and Adds Model Serving Capability for Apache MXNet

Post Syndicated from Ana Visneski original https://aws.amazon.com/blogs/aws/aws-contributes-to-milestone-1-0-release-and-adds-model-serving-capability-for-apache-mxnet/

Post by Dr. Matt Wood

Today AWS announced contributions to the milestone 1.0 release of the Apache MXNet deep learning engine including the introduction of a new model-serving capability for MXNet. The new capabilities in MXNet provide the following benefits to users:

1) MXNet is easier to use: The model server for MXNet is a new capability introduced by AWS, and it packages, runs, and serves deep learning models in seconds with just a few lines of code, making them accessible over the internet via an API endpoint and thus easy to integrate into applications. The 1.0 release also includes an advanced indexing capability that enables users to perform matrix operations in a more intuitive manner.

  • Model Serving enables set up of an API endpoint for prediction: It saves developers time and effort by condensing the task of setting up an API endpoint for running and integrating prediction functionality into an application to just a few lines of code. It bridges the barrier between Python-based deep learning frameworks and production systems through a Docker container-based deployment model.
  • Advanced indexing for array operations in MXNet: It is now more intuitive for developers to leverage the powerful array operations in MXNet. They can use the advanced indexing capability by leveraging existing knowledge of NumPy/SciPy arrays. For example, it supports MXNet NDArray and Numpy ndarray as index, e.g. (a[mx.nd.array([1,2], dtype = ‘int32’]).

2) MXNet is faster: The 1.0 release includes implementation of cutting-edge features that optimize the performance of training and inference. Gradient compression enables users to train models up to five times faster by reducing communication bandwidth between compute nodes without loss in convergence rate or accuracy. For speech recognition acoustic modeling like the Alexa voice, this feature can reduce network bandwidth by up to three orders of magnitude during training. With the support of NVIDIA Collective Communication Library (NCCL), users can train a model 20% faster on multi-GPU systems.

  • Optimize network bandwidth with gradient compression: In distributed training, each machine must communicate frequently with others to update the weight-vectors and thereby collectively build a single model, leading to high network traffic. Gradient compression algorithm enables users to train models up to five times faster by compressing the model changes communicated by each instance.
  • Optimize the training performance by taking advantage of NCCL: NCCL implements multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs. NCCL provides communication routines that are optimized to achieve high bandwidth over interconnection between multi-GPUs. MXNet supports NCCL to train models about 20% faster on multi-GPU systems.

3) MXNet provides easy interoperability: MXNet now includes a tool for converting neural network code written with the Caffe framework to MXNet code, making it easier for users to take advantage of MXNet’s scalability and performance.

  • Migrate Caffe models to MXNet: It is now possible to easily migrate Caffe code to MXNet, using the new source code translation tool for converting Caffe code to MXNet code.

MXNet has helped developers and researchers make progress with everything from language translation to autonomous vehicles and behavioral biometric security. We are excited to see the broad base of users that are building production artificial intelligence applications powered by neural network models developed and trained with MXNet. For example, the autonomous driving company TuSimple recently piloted a self-driving truck on a 200-mile journey from Yuma, Arizona to San Diego, California using MXNet. This release also includes a full-featured and performance optimized version of the Gluon programming interface. The ease-of-use associated with it combined with the extensive set of tutorials has led significant adoption among developers new to deep learning. The flexibility of the interface has driven interest within the research community, especially in the natural language processing domain.

Getting started with MXNet
Getting started with MXNet is simple. To learn more about the Gluon interface and deep learning, you can reference this comprehensive set of tutorials, which covers everything from an introduction to deep learning to how to implement cutting-edge neural network models. If you’re a contributor to a machine learning framework, check out the interface specs on GitHub.

To get started with the Model Server for Apache MXNet, install the library with the following command:

$ pip install mxnet-model-server

The Model Server library has a Model Zoo with 10 pre-trained deep learning models, including the SqueezeNet 1.1 object classification model. You can start serving the SqueezeNet model with just the following command:

$ mxnet-model-server \
  --models squeezenet=https://s3.amazonaws.com/model-server/models/squeezenet_v1.1/squeezenet_v1.1.model \
  --service dms/model_service/mxnet_vision_service.py

Learn more about the Model Server and view the source code, reference examples, and tutorials here: https://github.com/awslabs/mxnet-model-server/

-Dr. Matt Wood

Glenn’s Take on re:Invent 2017 – Part 3

Post Syndicated from Glenn Gore original https://aws.amazon.com/blogs/architecture/glenns-take-on-reinvent-2017-part-3/

Glenn Gore here, Chief Architect for AWS. I was in Las Vegas last week — with 43K others — for re:Invent 2017. I checked in to the Architecture blog here and here with my take on what was interesting about some of the bigger announcements from a cloud-architecture perspective.

In the excitement of so many new services being launched, we sometimes overlook feature updates that, while perhaps not as exciting as Amazon DeepLens, have significant impact on how you architect and develop solutions on AWS.

Amazon DynamoDB is used by more than 100,000 customers around the world, handling over a trillion requests every day. From the start, DynamoDB has offered high availability by natively spanning multiple Availability Zones within an AWS Region. As more customers started building and deploying truly-global applications, there was a need to replicate a DynamoDB table to multiple AWS Regions, allowing for read/write operations to occur in any region where the table was replicated. This update is important for providing a globally-consistent view of information — as users may transition from one region to another — or for providing additional levels of availability, allowing for failover between AWS Regions without loss of information.

There are some interesting concurrency-design aspects you need to be aware of and ensure you can handle correctly. For example, we support the “last writer wins” reconciliation where eventual consistency is being used and an application updates the same item in different AWS Regions at the same time. If you require strongly-consistent read/writes then you must perform all of your read/writes in the same AWS Region. The details behind this can be found in the DynamoDB documentation. Providing a globally-distributed, replicated DynamoDB table simplifies many different use cases and allows for the logic of replication, which may have been pushed up into the application layers to be simplified back down into the data layer.

The other big update for DynamoDB is that you can now back up your DynamoDB table on demand with no impact to performance. One of the features I really like is that when you trigger a backup, it is available instantly, regardless of the size of the table. Behind the scenes, we use snapshots and change logs to ensure a consistent backup. While backup is instant, restoring the table could take some time depending on its size and ranges — from minutes to hours for very large tables.

This feature is super important for those of you who work in regulated industries that often have strict requirements around data retention and backups of data, which sometimes limited the use of DynamoDB or required complex workarounds to implement some sort of backup feature in the past. This often incurred significant, additional costs due to increased read transactions on their DynamoDB tables.

Amazon Simple Storage Service (Amazon S3) was our first-released AWS service over 11 years ago, and it proved the simplicity and scalability of true API-driven architectures in the cloud. Today, Amazon S3 stores trillions of objects, with transactional requests per second reaching into the millions! Dealing with data as objects opened up an incredibly diverse array of use cases ranging from libraries of static images, game binary downloads, and application log data, to massive data lakes used for big data analytics and business intelligence. With Amazon S3, when you accessed your data in an object, you effectively had to write/read the object as a whole or use the range feature to retrieve a part of the object — if possible — in your individual use case.

Now, with Amazon S3 Select, an SQL-like query language is used that can work with delimited text and JSON files, as well as work with GZIP compressed files. We don’t support encryption during the preview of Amazon S3 Select.

Amazon S3 Select provides two major benefits:

  • Faster access
  • Lower running costs

Serverless Lambda functions, where every millisecond matters when you are being charged, will benefit greatly from Amazon S3 Select as data retrieval and processing of your Lambda function will experience significant speedups and cost reductions. For example, we have seen 2x speed improvement and 80% cost reduction with the Serverless MapReduce code.

Other AWS services such as Amazon Athena, Amazon Redshift, and Amazon EMR will support Amazon S3 Select as well as partner offerings including Cloudera and Hortonworks. If you are using Amazon Glacier for longer-term data archival, you will be able to use Amazon Glacier Select to retrieve a subset of your content from within Amazon Glacier.

As the volume of data that can be stored within Amazon S3 and Amazon Glacier continues to scale on a daily basis, we will continue to innovate and develop improved and optimized services that will allow you to work with these magnificently-large data sets while reducing your costs (retrieval and processing). I believe this will also allow you to simplify the transformation and storage of incoming data into Amazon S3 in basic, semi-structured formats as a single copy vs. some of the duplication and reformatting of data sometimes required to do upfront optimizations for downstream processing. Amazon S3 Select largely removes the need for this upfront optimization and instead allows you to store data once and process it based on your individual Amazon S3 Select query per application or transaction need.

Thanks for reading!

Glenn contemplating why CSV format is still relevant in 2017 (Italy).

GPIO expander: access a Pi’s GPIO pins on your PC/Mac

Post Syndicated from Gordon Hollingworth original https://www.raspberrypi.org/blog/gpio-expander/

Use the GPIO pins of a Raspberry Pi Zero while running Debian Stretch on a PC or Mac with our new GPIO expander software! With this tool, you can easily access a Pi Zero’s GPIO pins from your x86 laptop without using SSH, and you can also take advantage of your x86 computer’s processing power in your physical computing projects.

A Raspberry Pi zero connected to a laptop - GPIO expander

What is this magic?

Running our x86 Stretch distribution on a PC or Mac, whether installed on the hard drive or as a live image, is a great way of taking advantage of a well controlled and simple Linux distribution without the need for a Raspberry Pi.

The downside of not using a Pi, however, is that there aren’t any GPIO pins with which your Scratch or Python programs could communicate. This is a shame, because it means you are limited in your physical computing projects.

I was thinking about this while playing around with the Pi Zero’s USB booting capabilities, having seen people employ the Linux gadget USB mode to use the Pi Zero as an Ethernet device. It struck me that, using the udev subsystem, we could create a simple GUI application that automatically pops up when you plug a Pi Zero into your computer’s USB port. Then the Pi Zero could be programmed to turn into an Ethernet-connected computer running pigpio to provide you with remote GPIO pins.

So we went ahead and built this GPIO expander application, and your PC or Mac can now have GPIO pins which are accessible through Scratch or the GPIO Zero Python library. Note that you can only use this tool to access the Pi Zero.

You can also install the application on the Raspberry Pi. Theoretically, you could connect a number of Pi Zeros to a single Pi and (without a USB hub) use a maximum of 140 pins! But I’ve not tested this — one for you, I think…

Making the GPIO expander work

If you’re using a PC or Mac and you haven’t set up x86 Debian Stretch yet, you’ll need to do that first. An easy way to do it is to download a copy of the Stretch release from this page and image it onto a USB stick. Boot from the USB stick (on most computers, you just need to press F10 during booting and select the stick when asked), and then run Stretch directly from the USB key. You can also install it to the hard drive, but be aware that installing it will overwrite anything that was on your hard drive before.

Whether on a Mac, PC, or Pi, boot through to the Stretch desktop, open a terminal window, and install the GPIO expander application:

sudo apt install usbbootgui

Next, plug in your Raspberry Pi Zero (don’t insert an SD card), and after a few seconds the GUI will appear.

A screenshot of the GPIO expander GUI

The Raspberry Pi USB programming GUI

Select GPIO expansion board and click OK. The Pi Zero will now be programmed as a locally connected Ethernet port (if you run ifconfig, you’ll see the new interface usb0 coming up).

What’s really cool about this is that your plugged-in Pi Zero is now running pigpio, which allows you to control its GPIOs through the network interface.

With Scratch 2

To utilise the pins with Scratch 2, just click on the start bar and select Programming > Scratch 2.

In Scratch, click on More Blocks, select Add an Extension, and then click Pi GPIO.

Two new blocks will be added: the first is used to set the output pin, the second is used to get the pin value (it is true if the pin is read high).

This a simple application using a Pibrella I had hanging around:

A screenshot of a Scratch 2 program - GPIO expander

With Python

This is a Python example using the GPIO Zero library to flash an LED:

[email protected]:~ $ export GPIOZERO_PIN_FACTORY=pigpio
[email protected]:~ $ export PIGPIO_ADDR=fe80::1%usb0
[email protected]:~ $ python3
>>> from gpiozero import LED
>>> led = LED(17)
>>> led.blink()
A Raspberry Pi zero connected to a laptop - GPIO expander

The pinout command line tool is your friend

Note that in the code above the IP address of the Pi Zero is an IPv6 address and is shortened to fe80::1%usb0, where usb0 is the network interface created by the first Pi Zero.

With pigs directly

Another option you have is to use the pigpio library and the pigs application and redirect the output to the Pi Zero network port running IPv6. To do this, you’ll first need to set some environment variable for the redirection:

[email protected]:~ $ export PIGPIO_ADDR=fe80::1%usb0
[email protected]:~ $ pigs bc2 0x8000
[email protected]:~ $ pigs bs2 0x8000

With the commands above, you should be able to flash the LED on the Pi Zero.

The secret sauce

I know there’ll be some people out there who would be interested in how we put this together. And I’m sure many people are interested in the ‘buildroot’ we created to run on the Pi Zero — after all, there are lots of things you can create if you’ve got a Pi Zero on the end of a piece of IPv6 string! For a closer look, find the build scripts for the GPIO expander here and the source code for the USB boot GUI here.

And be sure to share your projects built with the GPIO expander by tagging us on social media or posting links in the comments!

The post GPIO expander: access a Pi’s GPIO pins on your PC/Mac appeared first on Raspberry Pi.

timeShift(GrafanaBuzz, 1w) Issue 24

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2017/12/01/timeshiftgrafanabuzz-1w-issue-24/

Welcome to TimeShift

It’s hard to believe it’s already December. Here at Grafana Labs we’ve been spending a lot of time working on new features and enhancements for Grafana v5, and finalizing our selections for GrafanaCon EU. This week we have some interesting articles to share and a number of plugin updates. Enjoy!


Latest Release

Grafana 4.6.2 is now available and includes some bug fixes:

  • Prometheus: Fixes bug with new Prometheus alerts in Grafana. Make sure to download this version if you’re using Prometheus for alerting. More details in the issue. #9777
  • Color picker: Bug after using textbox input field to change/paste color string #9769
  • Cloudwatch: build using golang 1.9.2 #9667, thanks @mtanda
  • Heatmap: Fixed tooltip for “time series buckets” mode #9332
  • InfluxDB: Fixed query editor issue when using > or < operators in WHERE clause #9871

Download Grafana 4.6.2 Now


From the Blogosphere

Monitoring Camel with Prometheus in Red Hat OpenShift: This in-depth walk-through will show you how to build an Apache Camel application from scratch, deploy it in a Kubernetes environment, gather metrics using Prometheus and display them in Grafana.

How to run Grafana with DeviceHive: We see more and more examples of people using Grafana in IoT. This article discusses how to gather data from the IoT platform, DeviceHive, and build useful dashboards.

How to Install Grafana on Linux Servers: Pretty self-explanatory, but this tutorial walks you installing Grafana on Ubuntu 16.04 and CentOS 7. After installation, it covers configuration and plugin installation. This is the first article in an upcoming series about Grafana.

Monitoring your AKS cluster with Grafana: It’s important to know how your application is performing regardless of where it lives; the same applies to Kubernetes. This article focuses on aggregating data from Kubernetes with Heapster and feeding it to a backend for Grafana to visualize.

CoinStatistics: With the price of Bitcoin skyrocketing, more and more people are interested in cryptocurrencies. This is a cool dashboard that has a lot of stats about popular cryptocurrencies, and has a calculator to let you know when you can buy that lambo.

Using OpenNTI As A Collector For Streaming Telemetry From Juniper Devices: Part 1: This series will serve as a quick start guide for getting up and running with streaming real-time telemetry data from Juniper devices. This first article covers some high-level concepts and installation, while part 2 covers configuration options.

How to Get Metrics for Advance Alerting to Prevent Trouble: What good is performance monitoring if you’re never told when something has gone wrong? This article suggests ways to be more proactive to prevent issues and avoid the scramble to troubleshoot issues.

Thoughtworks: Technology Radar: We got a shout-out in the latest Technology Radar in the Tools section, as the dashboard visualization tool of choice for Prometheus!


GrafanaCon Tickets are Going Fast

Tickets are going fast for GrafanaCon EU, but we still have a seat reserved for you. Join us March 1-2, 2018 in Amsterdam for 2 days of talks centered around Grafana and the surrounding monitoring ecosystem including Graphite, Prometheus, InfluxData, Elasticsearch, Kubernetes, and more.

Get Your Ticket Now


Grafana Plugins

We have a number of plugin updates to highlight this week. Authors improve plugins regularly to fix bugs and improve performance, so it’s important to keep your plugins up to date. We’ve made updating easy; for on-prem Grafana, use the Grafana-cli tool, or update with 1 click if you’re using Hosted Grafana.

UPDATED PLUGIN

Clickhouse Data Source – The Clickhouse Data Source received a substantial update this week. It now has support for Ace Editor, which has a reformatting function for the query editor that automatically formats your sql. If you’re using Clickhouse then you should also have a look at CHProxy – see the plugin readme for more details.


Update

UPDATED PLUGIN

Influx Admin Panel – This panel received a number of small fixes. A new version will be coming soon with some new features.

Some of the changes (see the release notes) for more details):

  • Fix issue always showing query results
  • When there is only one row, swap rows/cols (ie: SHOW DIAGNOSTICS)
  • Improve auto-refresh behavior
  • Show ‘message’ response. (ie: please use POST)
  • Fix query time sorting
  • Show ‘status’ field (killed, etc)

Update

UPDATED PLUGIN

Gnocchi Data Source – The latest version of the Gnocchi Data Source adds support for dynamic aggregations.


Update

UPDATED PLUGINS

BT Plugins – All of the BT panel plugins received updates this week.


Upcoming Events:

In between code pushes we like to speak at, sponsor and attend all kinds of conferences and meetups. We have some awesome talks and events coming soon. Hope to see you at one of these!

KubeCon | Austin, TX – Dec. 6-8, 2017: We’re sponsoring KubeCon 2017! This is the must-attend conference for cloud native computing professionals. KubeCon + CloudNativeCon brings together leading contributors in:

  • Cloud native applications and computing
  • Containers
  • Microservices
  • Central orchestration processing
  • And more

Buy Tickets

FOSDEM | Brussels, Belgium – Feb 3-4, 2018: FOSDEM is a free developer conference where thousands of developers of free and open source software gather to share ideas and technology. Carl Bergquist is managing the Cloud and Monitoring Devroom, and we’ve heard there were some great talks submitted. There is no need to register; all are welcome.


Tweet of the Week

We scour Twitter each week to find an interesting/beautiful dashboard and show it off! #monitoringLove

YIKES! Glad it’s not – there’s good attention and bad attention.


Grafana Labs is Hiring!

We are passionate about open source software and thrive on tackling complex challenges to build the future. We ship code from every corner of the globe and love working with the community. If this sounds exciting, you’re in luck – WE’RE HIRING!

Check out our Open Positions


How are we doing?

Let us know if you’re finding these weekly roundups valuable. Submit a comment on this article below, or post something at our community forum. Find an article I haven’t included? Send it my way. Help us make timeShift better!

Follow us on Twitter, like us on Facebook, and join the Grafana Labs community.

NSA "Red Disk" Data Leak

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/11/nsa_red_disk_da.html

ZDNet is reporting about another data leak, this one from US Army’s Intelligence and Security Command (INSCOM), which is also within to the NSA.

The disk image, when unpacked and loaded, is a snapshot of a hard drive dating back to May 2013 from a Linux-based server that forms part of a cloud-based intelligence sharing system, known as Red Disk. The project, developed by INSCOM’s Futures Directorate, was slated to complement the Army’s so-called distributed common ground system (DCGS), a legacy platform for processing and sharing intelligence, surveillance, and reconnaissance information.

[…]

Red Disk was envisioned as a highly customizable cloud system that could meet the demands of large, complex military operations. The hope was that Red Disk could provide a consistent picture from the Pentagon to deployed soldiers in the Afghan battlefield, including satellite images and video feeds from drones trained on terrorists and enemy fighters, according to a Foreign Policy report.

[…]

Red Disk was a modular, customizable, and scalable system for sharing intelligence across the battlefield, like electronic intercepts, drone footage and satellite imagery, and classified reports, for troops to access with laptops and tablets on the battlefield. Marking files found in several directories imply the disk is “top secret,” and restricted from being shared to foreign intelligence partners.

A couple of points. One, this isn’t particularly sensitive. It’s an intelligence distribution system under development. It’s not raw intelligence. Two, this doesn’t seem to be classified data. Even the article hedges, using the unofficial term of “highly sensitive.” Three, it doesn’t seem that Chris Vickery, the researcher that discovered the data, has published it.

Chris Vickery, director of cyber risk research at security firm UpGuard, found the data and informed the government of the breach in October. The storage server was subsequently secured, though its owner remains unknown.

This doesn’t feel like a big deal to me.

Slashdot thread.

Presenting AWS IoT Analytics: Delivering IoT Analytics at Scale and Faster than Ever Before

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/launch-presenting-aws-iot-analytics/

One of the technology areas I thoroughly enjoy is the Internet of Things (IoT). Even as a child I used to infuriate my parents by taking apart the toys they would purchase for me to see how they worked and if I could somehow put them back together. It seems somehow I was destined to end up the tough and ever-changing world of technology. Therefore, it’s no wonder that I am really enjoying learning and tinkering with IoT devices and technologies. It combines my love of development and software engineering with my curiosity around circuits, controllers, and other facets of the electrical engineering discipline; even though an electrical engineer I can not claim to be.

Despite all of the information that is collected by the deployment of IoT devices and solutions, I honestly never really thought about the need to analyze, search, and process this data until I came up against a scenario where it became of the utmost importance to be able to search and query through loads of sensory data for an anomaly occurrence. Of course, I understood the importance of analytics for businesses to make accurate decisions and predictions to drive the organization’s direction. But it didn’t occur to me initially, how important it was to make analytics an integral part of my IoT solutions. Well, I learned my lesson just in time because this re:Invent a service is launching to make it easier for anyone to process and analyze IoT messages and device data.

 

Hello, AWS IoT Analytics!  AWS IoT Analytics is a fully managed service of AWS IoT that provides advanced data analysis of data collected from your IoT devices.  With the AWS IoT Analytics service, you can process messages, gather and store large amounts of device data, as well as, query your data. Also, the new AWS IoT Analytics service feature integrates with Amazon Quicksight for visualization of your data and brings the power of machine learning through integration with Jupyter Notebooks.

Benefits of AWS IoT Analytics

  • Helps with predictive analysis of data by providing access to pre-built analytical functions
  • Provides ability to visualize analytical output from service
  • Provides tools to clean up data
  • Can help identify patterns in the gathered data

Be In the Know: IoT Analytics Concepts

  • Channel: archives the raw, unprocessed messages and collects data from MQTT topics.
  • Pipeline: consumes messages from channels and allows message processing.
    • Activities: perform transformations on your messages including filtering attributes and invoking lambda functions advanced processing.
  • Data Store: Used as a queryable repository for processed messages. Provide ability to have multiple datastores for messages coming from different devices or locations or filtered by message attributes.
  • Data Set: Data retrieval view from a data store, can be generated by a recurring schedule. 

Getting Started with AWS IoT Analytics

First, I’ll create a channel to receive incoming messages.  This channel can be used to ingest data sent to the channel via MQTT or messages directed from the Rules Engine. To create a channel, I’ll select the Channels menu option and then click the Create a channel button.

I’ll name my channel, TaraIoTAnalyticsID and give the Channel a MQTT topic filter of Temperature. To complete the creation of my channel, I will click the Create Channel button.

Now that I have my Channel created, I need to create a Data Store to receive and store the messages received on the Channel from my IoT device. Remember you can set up multiple Data Stores for more complex solution needs, but I’ll just create one Data Store for my example. I’ll select Data Stores from menu panel and click Create a data store.

 

I’ll name my Data Store, TaraDataStoreID, and once I click the Create the data store button and I would have successfully set up a Data Store to house messages coming from my Channel.

Now that I have my Channel and my Data Store, I will need to connect the two using a Pipeline. I’ll create a simple pipeline that just connects my Channel and Data Store, but you can create a more robust pipeline to process and filter messages by adding Pipeline activities like a Lambda activity.

To create a pipeline, I’ll select the Pipelines menu option and then click the Create a pipeline button.

I will not add an Attribute for this pipeline. So I will click Next button.

As we discussed there are additional pipeline activities that I can add to my pipeline for the processing and transformation of messages but I will keep my first pipeline simple and hit the Next button.

The final step in creating my pipeline is for me to select my previously created Data Store and click Create Pipeline.

All that is left for me to take advantage of the AWS IoT Analytics service is to create an IoT rule that sends data to an AWS IoT Analytics channel.  Wow, that was a super easy process to set up analytics for IoT devices.

If I wanted to create a Data Set as a result of queries run against my data for visualization with Amazon Quicksight or integrate with Jupyter Notebooks to perform more advanced analytical functions, I can choose the Analyze menu option to bring up the screens to create data sets and access the Juypter Notebook instances.

Summary

As you can see, it was a very simple process to set up the advanced data analysis for AWS IoT. With AWS IoT Analytics, you have the ability to collect, visualize, process, query and store large amounts of data generated from your AWS IoT connected device. Additionally, you can access the AWS IoT Analytics service in a myriad of different ways; the AWS Command Line Interface (AWS CLI), the AWS IoT API, language-specific AWS SDKs, and AWS IoT Device SDKs.

AWS IoT Analytics is available today for you to dig into the analysis of your IoT data. To learn more about AWS IoT and AWS IoT Analytics go to the AWS IoT Analytics product page and/or the AWS IoT documentation.

Tara

GDPR – A Practical Guide For Developers

Post Syndicated from Bozho original https://techblog.bozho.net/gdpr-practical-guide-developers/

You’ve probably heard about GDPR. The new European data protection regulation that applies practically to everyone. Especially if you are working in a big company, it’s most likely that there’s already a process for gettign your systems in compliance with the regulation.

The regulation is basically a law that must be followed in all European countries (but also applies to non-EU companies that have users in the EU). In this particular case, it applies to companies that are not registered in Europe, but are having European customers. So that’s most companies. I will not go into yet another “12 facts about GDPR” or “7 myths about GDPR” posts/whitepapers, as they are often aimed at managers or legal people. Instead, I’ll focus on what GDPR means for developers.

Why am I qualified to do that? A few reasons – I was advisor to the deputy prime minister of a EU country, and because of that I’ve been both exposed and myself wrote some legislation. I’m familiar with the “legalese” and how the regulatory framework operates in general. I’m also a privacy advocate and I’ve been writing about GDPR-related stuff in the past, i.e. “before it was cool” (protecting sensitive data, the right to be forgotten). And finally, I’m currently working on a project that (among other things) aims to help with covering some GDPR aspects.

I’ll try to be a bit more comprehensive this time and cover as many aspects of the regulation that concern developers as I can. And while developers will mostly be concerned about how the systems they are working on have to change, it’s not unlikely that a less informed manager storms in in late spring, realizing GDPR is going to be in force tomorrow, asking “what should we do to get our system/website compliant”.

The rights of the user/client (referred to as “data subject” in the regulation) that I think are relevant for developers are: the right to erasure (the right to be forgotten/deleted from the system), right to restriction of processing (you still keep the data, but mark it as “restricted” and don’t touch it without further consent by the user), the right to data portability (the ability to export one’s data), the right to rectification (the ability to get personal data fixed), the right to be informed (getting human-readable information, rather than long terms and conditions), the right of access (the user should be able to see all the data you have about them), the right to data portability (the user should be able to get a machine-readable dump of their data).

Additionally, the relevant basic principles are: data minimization (one should not collect more data than necessary), integrity and confidentiality (all security measures to protect data that you can think of + measures to guarantee that the data has not been inappropriately modified).

Even further, the regulation requires certain processes to be in place within an organization (of more than 250 employees or if a significant amount of data is processed), and those include keeping a record of all types of processing activities carried out, including transfers to processors (3rd parties), which includes cloud service providers. None of the other requirements of the regulation have an exception depending on the organization size, so “I’m small, GDPR does not concern me” is a myth.

It is important to know what “personal data” is. Basically, it’s every piece of data that can be used to uniquely identify a person or data that is about an already identified person. It’s data that the user has explicitly provided, but also data that you have collected about them from either 3rd parties or based on their activities on the site (what they’ve been looking at, what they’ve purchased, etc.)

Having said that, I’ll list a number of features that will have to be implemented and some hints on how to do that, followed by some do’s and don’t’s.

  • “Forget me” – you should have a method that takes a userId and deletes all personal data about that user (in case they have been collected on the basis of consent, and not due to contract enforcement or legal obligation). It is actually useful for integration tests to have that feature (to cleanup after the test), but it may be hard to implement depending on the data model. In a regular data model, deleting a record may be easy, but some foreign keys may be violated. That means you have two options – either make sure you allow nullable foreign keys (for example an order usually has a reference to the user that made it, but when the user requests his data be deleted, you can set the userId to null), or make sure you delete all related data (e.g. via cascades). This may not be desirable, e.g. if the order is used to track available quantities or for accounting purposes. It’s a bit trickier for event-sourcing data models, or in extreme cases, ones that include some sort of blcokchain/hash chain/tamper-evident data structure. With event sourcing you should be able to remove a past event and re-generate intermediate snapshots. For blockchain-like structures – be careful what you put in there and avoid putting personal data of users. There is an option to use a chameleon hash function, but that’s suboptimal. Overall, you must constantly think of how you can delete the personal data. And “our data model doesn’t allow it” isn’t an excuse.
  • Notify 3rd parties for erasure – deleting things from your system may be one thing, but you are also obligated to inform all third parties that you have pushed that data to. So if you have sent personal data to, say, Salesforce, Hubspot, twitter, or any cloud service provider, you should call an API of theirs that allows for the deletion of personal data. If you are such a provider, obviously, your “forget me” endpoint should be exposed. Calling the 3rd party APIs to remove data is not the full story, though. You also have to make sure the information does not appear in search results. Now, that’s tricky, as Google doesn’t have an API for removal, only a manual process. Fortunately, it’s only about public profile pages that are crawlable by Google (and other search engines, okay…), but you still have to take measures. Ideally, you should make the personal data page return a 404 HTTP status, so that it can be removed.
  • Restrict processing – in your admin panel where there’s a list of users, there should be a button “restrict processing”. The user settings page should also have that button. When clicked (after reading the appropriate information), it should mark the profile as restricted. That means it should no longer be visible to the backoffice staff, or publicly. You can implement that with a simple “restricted” flag in the users table and a few if-clasues here and there.
  • Export data – there should be another button – “export data”. When clicked, the user should receive all the data that you hold about them. What exactly is that data – depends on the particular usecase. Usually it’s at least the data that you delete with the “forget me” functionality, but may include additional data (e.g. the orders the user has made may not be delete, but should be included in the dump). The structure of the dump is not strictly defined, but my recommendation would be to reuse schema.org definitions as much as possible, for either JSON or XML. If the data is simple enough, a CSV/XLS export would also be fine. Sometimes data export can take a long time, so the button can trigger a background process, which would then notify the user via email when his data is ready (twitter, for example, does that already – you can request all your tweets and you get them after a while).
  • Allow users to edit their profile – this seems an obvious rule, but it isn’t always followed. Users must be able to fix all data about them, including data that you have collected from other sources (e.g. using a “login with facebook” you may have fetched their name and address). Rule of thumb – all the fields in your “users” table should be editable via the UI. Technically, rectification can be done via a manual support process, but that’s normally more expensive for a business than just having the form to do it. There is one other scenario, however, when you’ve obtained the data from other sources (i.e. the user hasn’t provided their details to you directly). In that case there should still be a page where they can identify somehow (via email and/or sms confirmation) and get access to the data about them.
  • Consent checkboxes – this is in my opinion the biggest change that the regulation brings. “I accept the terms and conditions” would no longer be sufficient to claim that the user has given their consent for processing their data. So, for each particular processing activity there should be a separate checkbox on the registration (or user profile) screen. You should keep these consent checkboxes in separate columns in the database, and let the users withdraw their consent (by unchecking these checkboxes from their profile page – see the previous point). Ideally, these checkboxes should come directly from the register of processing activities (if you keep one). Note that the checkboxes should not be preselected, as this does not count as “consent”.
  • Re-request consent – if the consent users have given was not clear (e.g. if they simply agreed to terms & conditions), you’d have to re-obtain that consent. So prepare a functionality for mass-emailing your users to ask them to go to their profile page and check all the checkboxes for the personal data processing activities that you have.
  • “See all my data” – this is very similar to the “Export” button, except data should be displayed in the regular UI of the application rather than an XML/JSON format. For example, Google Maps shows you your location history – all the places that you’ve been to. It is a good implementation of the right to access. (Though Google is very far from perfect when privacy is concerned)
  • Age checks – you should ask for the user’s age, and if the user is a child (below 16), you should ask for parent permission. There’s no clear way how to do that, but my suggestion is to introduce a flow, where the child should specify the email of a parent, who can then confirm. Obviosuly, children will just cheat with their birthdate, or provide a fake parent email, but you will most likely have done your job according to the regulation (this is one of the “wishful thinking” aspects of the regulation).

Now some “do’s”, which are mostly about the technical measures needed to protect personal data. They may be more “ops” than “dev”, but often the application also has to be extended to support them. I’ve listed most of what I could think of in a previous post.

  • Encrypt the data in transit. That means that communication between your application layer and your database (or your message queue, or whatever component you have) should be over TLS. The certificates could be self-signed (and possibly pinned), or you could have an internal CA. Different databases have different configurations, just google “X encrypted connections. Some databases need gossiping among the nodes – that should also be configured to use encryption
  • Encrypt the data at rest – this again depends on the database (some offer table-level encryption), but can also be done on machine-level. E.g. using LUKS. The private key can be stored in your infrastructure, or in some cloud service like AWS KMS.
  • Encrypt your backups – kind of obvious
  • Implement pseudonymisation – the most obvious use-case is when you want to use production data for the test/staging servers. You should change the personal data to some “pseudonym”, so that the people cannot be identified. When you push data for machine learning purposes (to third parties or not), you can also do that. Technically, that could mean that your User object can have a “pseudonymize” method which applies hash+salt/bcrypt/PBKDF2 for some of the data that can be used to identify a person
  • Protect data integrity – this is a very broad thing, and could simply mean “have authentication mechanisms for modifying data”. But you can do something more, even as simple as a checksum, or a more complicated solution (like the one I’m working on). It depends on the stakes, on the way data is accessed, on the particular system, etc. The checksum can be in the form of a hash of all the data in a given database record, which should be updated each time the record is updated through the application. It isn’t a strong guarantee, but it is at least something.
  • Have your GDPR register of processing activities in something other than Excel – Article 30 says that you should keep a record of all the types of activities that you use personal data for. That sounds like bureaucracy, but it may be useful – you will be able to link certain aspects of your application with that register (e.g. the consent checkboxes, or your audit trail records). It wouldn’t take much time to implement a simple register, but the business requirements for that should come from whoever is responsible for the GDPR compliance. But you can advise them that having it in Excel won’t make it easy for you as a developer (imagine having to fetch the excel file internally, so that you can parse it and implement a feature). Such a register could be a microservice/small application deployed separately in your infrastructure.
  • Log access to personal data – every read operation on a personal data record should be logged, so that you know who accessed what and for what purpose
  • Register all API consumers – you shouldn’t allow anonymous API access to personal data. I’d say you should request the organization name and contact person for each API user upon registration, and add those to the data processing register. Note: some have treated article 30 as a requirement to keep an audit log. I don’t think it is saying that – instead it requires 250+ companies to keep a register of the types of processing activities (i.e. what you use the data for). There are other articles in the regulation that imply that keeping an audit log is a best practice (for protecting the integrity of the data as well as to make sure it hasn’t been processed without a valid reason)

Finally, some “don’t’s”.

  • Don’t use data for purposes that the user hasn’t agreed with – that’s supposed to be the spirit of the regulation. If you want to expose a new API to a new type of clients, or you want to use the data for some machine learning, or you decide to add ads to your site based on users’ behaviour, or sell your database to a 3rd party – think twice. I would imagine your register of processing activities could have a button to send notification emails to users to ask them for permission when a new processing activity is added (or if you use a 3rd party register, it should probably give you an API). So upon adding a new processing activity (and adding that to your register), mass email all users from whom you’d like consent.
  • Don’t log personal data – getting rid of the personal data from log files (especially if they are shipped to a 3rd party service) can be tedious or even impossible. So log just identifiers if needed. And make sure old logs files are cleaned up, just in case
  • Don’t put fields on the registration/profile form that you don’t need – it’s always tempting to just throw as many fields as the usability person/designer agrees on, but unless you absolutely need the data for delivering your service, you shouldn’t collect it. Names you should probably always collect, but unless you are delivering something, a home address or phone is unnecessary.
  • Don’t assume 3rd parties are compliant – you are responsible if there’s a data breach in one of the 3rd parties (e.g. “processors”) to which you send personal data. So before you send data via an API to another service, make sure they have at least a basic level of data protection. If they don’t, raise a flag with management.
  • Don’t assume having ISO XXX makes you compliant – information security standards and even personal data standards are a good start and they will probably 70% of what the regulation requires, but they are not sufficient – most of the things listed above are not covered in any of those standards

Overall, the purpose of the regulation is to make you take conscious decisions when processing personal data. It imposes best practices in a legal way. If you follow the above advice and design your data model, storage, data flow , API calls with data protection in mind, then you shouldn’t worry about the huge fines that the regulation prescribes – they are for extreme cases, like Equifax for example. Regulators (data protection authorities) will most likely have some checklists into which you’d have to somehow fit, but if you follow best practices, that shouldn’t be an issue.

I think all of the above features can be implemented in a few weeks by a small team. Be suspicious when a big vendor offers you a generic plug-and-play “GDPR compliance” solution. GDPR is not just about the technical aspects listed above – it does have organizational/process implications. But also be suspicious if a consultant claims GDPR is complicated. It’s not – it relies on a few basic principles that are in fact best practices anyway. Just don’t ignore them.

The post GDPR – A Practical Guide For Developers appeared first on Bozho's tech blog.

Amazon EC2 Bare Metal Instances with Direct Access to Hardware

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-amazon-ec2-bare-metal-instances-with-direct-access-to-hardware/

When customers come to us with new and unique requirements for AWS, we listen closely, ask lots of questions, and do our best to understand and address their needs. When we do this, we make the resulting service or feature generally available; we do not build one-offs or “snowflakes” for individual customers. That model is messy and hard to scale and is not the way we work.

Instead, every AWS customer has access to whatever it is that we build, and everyone benefits. VMware Cloud on AWS is a good example of this strategy in action. They told us that they wanted to run their virtualization stack directly on the hardware, within the AWS Cloud, giving their customers access to the elasticity, security, and reliability (not to mention the broad array of services) that AWS offers.

We knew that other customers also had interesting use cases for bare metal hardware and didn’t want to take the performance hit of nested virtualization. They wanted access to the physical resources for applications that take advantage of low-level hardware features such as performance counters and Intel® VT that are not always available or fully supported in virtualized environments, and also for applications intended to run directly on the hardware or licensed and supported for use in non-virtualized environments.

Our multi-year effort to move networking, storage, and other EC2 features out of our virtualization platform and into dedicated hardware was already well underway and provided the perfect foundation for a possible solution. This work, as I described in Now Available – Compute-Intensive C5 Instances for Amazon EC2, includes a set of dedicated hardware accelerators.

Now that we have provided VMware with the bare metal access that they requested, we are doing the same for all AWS customers. I’m really looking forward to seeing what you can do with them!

New Bare Metal Instances
Today we are launching a public preview the i3.metal instance, the first in a series of EC2 instances that offer the best of both worlds, allowing the operating system to run directly on the underlying hardware while still providing access to all of the benefits of the cloud. The instance gives you direct access to the processor and other hardware, and has the following specifications:

  • Processing – Two Intel Xeon E5-2686 v4 processors running at 2.3 GHz, with a total of 36 hyperthreaded cores (72 logical processors).
  • Memory – 512 GiB.
  • Storage – 15.2 terabytes of local, SSD-based NVMe storage.
  • Network – 25 Gbps of ENA-based enhanced networking.

Bare Metal instances are full-fledged members of the EC2 family and can take advantage of Elastic Load Balancing, Auto Scaling, Amazon CloudWatch, Auto Recovery, and so forth. They can also access the full suite of AWS database, IoT, mobile, analytics, artificial intelligence, and security services.

Previewing Now
We are launching a public preview of the Bare Metal instances today; please sign up now if you want to try them out.

You can now bring your specialized applications or your own stack of virtualized components to AWS and run them on Bare Metal instances. If you are using or thinking about using containers, these instances make a great host for CoreOS.

An AMI that works on one of the new C5 instances should also work on an I3 Bare Metal Instance. It must have the ENA and NVMe drivers, and must be tagged for ENA.

Jeff;

 

Amazon MQ – Managed Message Broker Service for ActiveMQ

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-mq-managed-message-broker-service-for-activemq/

Messaging holds the parts of a distributed application together, while also adding resiliency and enabling the implementation of highly scalable architectures. For example, earlier this year, Amazon Simple Queue Service (SQS) and Amazon Simple Notification Service (SNS) supported the processing of customer orders on Prime Day, collectively processing 40 billion messages at a rate of 10 million per second, with no customer-visible issues.

SQS and SNS have been used extensively for applications that were born in the cloud. However, many of our larger customers are already making use of open-sourced or commercially-licensed message brokers. Their applications are mission-critical, and so is the messaging that powers them. Our customers describe the setup and on-going maintenance of their messaging infrastructure as “painful” and report that they spend at least 10 staff-hours per week on this chore.

New Amazon MQ
Today we are launching Amazon MQ – a managed message broker service for Apache ActiveMQ that lets you get started in minutes with just three clicks! As you may know, ActiveMQ is a popular open-source message broker that is fast & feature-rich. It offers queues and topics, durable and non-durable subscriptions, push-based and poll-based messaging, and filtering.

As a managed service, Amazon MQ takes care of the administration and maintenance of ActiveMQ. This includes responsibility for broker provisioning, patching, failure detection & recovery for high availability, and message durability. With Amazon MQ, you get direct access to the ActiveMQ console and industry standard APIs and protocols for messaging, including JMS, NMS, AMQP, STOMP, MQTT, and WebSocket. This allows you to move from any message broker that uses these standards to Amazon MQ–along with the supported applications–without rewriting code.

You can create a single-instance Amazon MQ broker for development and testing, or an active/standby pair that spans AZs, with quick, automatic failover. Either way, you get data replication across AZs and a pay-as-you-go model for the broker instance and message storage.

Amazon MQ is a full-fledged part of the AWS family, including the use of AWS Identity and Access Management (IAM) for authentication and authorization to use the service API. You can use Amazon CloudWatch metrics to keep a watchful eye metrics such as queue depth and initiate Auto Scaling of your consumer fleet as needed.

Launching an Amazon MQ Broker
To get started, I open up the Amazon MQ Console, select the desired AWS Region, enter a name for my broker, and click on Next step:

Then I choose the instance type, indicate that I want to create a standby , and click on Create broker (I can select a VPC and fine-tune other settings in the Advanced settings section):

My broker will be created and ready to use in 5-10 minutes:

The URLs and endpoints that I use to access my broker are all available at a click:

I can access the ActiveMQ Web Console at the link provided:

The broker publishes instance, topic, and queue metrics to CloudWatch. Here are the instance metrics:

Available Now
Amazon MQ is available now and you can start using it today in the US East (Northern Virginia), US East (Ohio), US West (Oregon), EU (Ireland), EU (Frankfurt), and Asia Pacific (Sydney) Regions.

The AWS Free Tier lets you use a single-AZ micro instance for up to 750 hours and to store up to 1 gigabyte each month, for one year. After that, billing is based on instance-hours and message storage, plus charges Internet data transfer if the broker is accessed from outside of AWS.

Jeff;

Introducing AWS AppSync – Build data-driven apps with real-time and off-line capabilities

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/introducing-amazon-appsync/

In this day and age, it is almost impossible to do without our mobile devices and the applications that help make our lives easier. As our dependency on our mobile phone grows, the mobile application market has exploded with millions of apps vying for our attention. For mobile developers, this means that we must ensure that we build applications that provide the quality, real-time experiences that app users desire.  Therefore, it has become essential that mobile applications are developed to include features such as multi-user data synchronization, offline network support, and data discovery, just to name a few.  According to several articles, I read recently about mobile development trends on publications like InfoQ, DZone, and the mobile development blog AlleviateTech, one of the key elements in of delivering the aforementioned capabilities is with cloud-driven mobile applications.  It seems that this is especially true, as it related to mobile data synchronization and data storage.

That being the case, it is a perfect time for me to announce a new service for building innovative mobile applications that are driven by data-intensive services in the cloud; AWS AppSync. AWS AppSync is a fully managed serverless GraphQL service for real-time data queries, synchronization, communications and offline programming features. For those not familiar, let me briefly share some information about the open GraphQL specification. GraphQL is a responsive data query language and server-side runtime for querying data sources that allow for real-time data retrieval and dynamic query execution. You can use GraphQL to build a responsive API for use in when building client applications. GraphQL works at the application layer and provides a type system for defining schemas. These schemas serve as specifications to define how operations should be performed on the data and how the data should be structured when retrieved. Additionally, GraphQL has a declarative coding model which is supported by many client libraries and frameworks including React, React Native, iOS, and Android.

Now the power of the GraphQL open standard query language is being brought to you in a rich managed service with AWS AppSync.  With AppSync developers can simplify the retrieval and manipulation of data across multiple data sources with ease, allowing them to quickly prototype, build and create robust, collaborative, multi-user applications. AppSync keeps data updated when devices are connected, but enables developers to build solutions that work offline by caching data locally and synchronizing local data when connections become available.

Let’s discuss some key concepts of AWS AppSync and how the service works.

AppSync Concepts

  • AWS AppSync Client: service client that defines operations, wraps authorization details of requests, and manage offline logic.
  • Data Source: the data storage system or a trigger housing data
  • Identity: a set of credentials with permissions and identification context provided with requests to GraphQL proxy
  • GraphQL Proxy: the GraphQL engine component for processing and mapping requests, handling conflict resolution, and managing Fine Grained Access Control
  • Operation: one of three GraphQL operations supported in AppSync
    • Query: a read-only fetch call to the data
    • Mutation: a write of the data followed by a fetch,
    • Subscription: long-lived connections that receive data in response to events.
  • Action: a notification to connected subscribers from a GraphQL subscription.
  • Resolver: function using request and response mapping templates that converts and executes payload against data source

How It Works

A schema is created to define types and capabilities of the desired GraphQL API and tied to a Resolver function.  The schema can be created to mirror existing data sources or AWS AppSync can create tables automatically based the schema definition. Developers can also use GraphQL features for data discovery without having knowledge of the backend data sources. After a schema definition is established, an AWS AppSync client can be configured with an operation request, like a Query operation. The client submits the operation request to GraphQL Proxy along with an identity context and credentials. The GraphQL Proxy passes this request to the Resolver which maps and executes the request payload against pre-configured AWS data services like an Amazon DynamoDB table, an AWS Lambda function, or a search capability using Amazon Elasticsearch. The Resolver executes calls to one or all of these services within a single network call minimizing CPU cycles and bandwidth needs and returns the response to the client. Additionally, the client application can change data requirements in code on demand and the AppSync GraphQL API will dynamically map requests for data accordingly, allowing prototyping and faster development.

In order to take a quick peek at the service, I’ll go to the AWS AppSync console. I’ll click the Create API button to get started.

 

When the Create new API screen opens, I’ll give my new API a name, TarasTestApp, and since I am just exploring the new service I will select the Sample schema option.  You may notice from the informational dialog box on the screen that in using the sample schema, AWS AppSync will automatically create the DynamoDB tables and the IAM roles for me.It will also deploy the TarasTestApp API on my behalf.  After review of the sample schema provided by the console, I’ll click the Create button to create my test API.

After the TaraTestApp API has been created and the associated AWS resources provisioned on my behalf, I can make updates to the schema, data source, or connect my data source(s) to a resolver. I also can integrate my GraphQL API into an iOS, Android, Web, or React Native application by cloning the sample repo from GitHub and downloading the accompanying GraphQL schema.  These application samples are great to help get you started and they are pre-configured to function in offline scenarios.

If I select the Schema menu option on the console, I can update and view the TarasTestApp GraphQL API schema.


Additionally, if I select the Data Sources menu option in the console, I can see the existing data sources.  Within this screen, I can update, delete, or add data sources if I so desire.

Next, I will select the Query menu option which takes me to the console tool for writing and testing queries. Since I chose the sample schema and the AWS AppSync service did most of the heavy lifting for me, I’ll try a query against my new GraphQL API.

I’ll use a mutation to add data for the event type in my schema. Since this is a mutation and it first writes data and then does a read of the data, I want the query to return values for name and where.

If I go to the DynamoDB table created for the event type in the schema, I will see that the values from my query have been successfully written into the table. Now that was a pretty simple task to write and retrieve data based on a GraphQL API schema from a data source, don’t you think.


 Summary

AWS AppSync is currently in AWS AppSync is in Public Preview and you can sign up today. It supports development for iOS, Android, and JavaScript applications. You can take advantage of this managed GraphQL service by going to the AWS AppSync console or learn more by reviewing more details about the service by reading a tutorial in the AWS documentation for the service or checking out our AWS AppSync Developer Guide.

Tara

 

Using Amazon Redshift Spectrum, Amazon Athena, and AWS Glue with Node.js in Production

Post Syndicated from Rafi Ton original https://aws.amazon.com/blogs/big-data/using-amazon-redshift-spectrum-amazon-athena-and-aws-glue-with-node-js-in-production/

This is a guest post by Rafi Ton, founder and CEO of NUVIAD. NUVIAD is, in their own words, “a mobile marketing platform providing professional marketers, agencies and local businesses state of the art tools to promote their products and services through hyper targeting, big data analytics and advanced machine learning tools.”

At NUVIAD, we’ve been using Amazon Redshift as our main data warehouse solution for more than 3 years.

We store massive amounts of ad transaction data that our users and partners analyze to determine ad campaign strategies. When running real-time bidding (RTB) campaigns in large scale, data freshness is critical so that our users can respond rapidly to changes in campaign performance. We chose Amazon Redshift because of its simplicity, scalability, performance, and ability to load new data in near real time.

Over the past three years, our customer base grew significantly and so did our data. We saw our Amazon Redshift cluster grow from three nodes to 65 nodes. To balance cost and analytics performance, we looked for a way to store large amounts of less-frequently analyzed data at a lower cost. Yet, we still wanted to have the data immediately available for user queries and to meet their expectations for fast performance. We turned to Amazon Redshift Spectrum.

In this post, I explain the reasons why we extended Amazon Redshift with Redshift Spectrum as our modern data warehouse. I cover how our data growth and the need to balance cost and performance led us to adopt Redshift Spectrum. I also share key performance metrics in our environment, and discuss the additional AWS services that provide a scalable and fast environment, with data available for immediate querying by our growing user base.

Amazon Redshift as our foundation

The ability to provide fresh, up-to-the-minute data to our customers and partners was always a main goal with our platform. We saw other solutions provide data that was a few hours old, but this was not good enough for us. We insisted on providing the freshest data possible. For us, that meant loading Amazon Redshift in frequent micro batches and allowing our customers to query Amazon Redshift directly to get results in near real time.

The benefits were immediately evident. Our customers could see how their campaigns performed faster than with other solutions, and react sooner to the ever-changing media supply pricing and availability. They were very happy.

However, this approach required Amazon Redshift to store a lot of data for long periods, and our data grew substantially. In our peak, we maintained a cluster running 65 DC1.large nodes. The impact on our Amazon Redshift cluster was evident, and we saw our CPU utilization grow to 90%.

Why we extended Amazon Redshift to Redshift Spectrum

Redshift Spectrum gives us the ability to run SQL queries using the powerful Amazon Redshift query engine against data stored in Amazon S3, without needing to load the data. With Redshift Spectrum, we store data where we want, at the cost that we want. We have the data available for analytics when our users need it with the performance they expect.

Seamless scalability, high performance, and unlimited concurrency

Scaling Redshift Spectrum is a simple process. First, it allows us to leverage Amazon S3 as the storage engine and get practically unlimited data capacity.

Second, if we need more compute power, we can leverage Redshift Spectrum’s distributed compute engine over thousands of nodes to provide superior performance – perfect for complex queries running against massive amounts of data.

Third, all Redshift Spectrum clusters access the same data catalog so that we don’t have to worry about data migration at all, making scaling effortless and seamless.

Lastly, since Redshift Spectrum distributes queries across potentially thousands of nodes, they are not affected by other queries, providing much more stable performance and unlimited concurrency.

Keeping it SQL

Redshift Spectrum uses the same query engine as Amazon Redshift. This means that we did not need to change our BI tools or query syntax, whether we used complex queries across a single table or joins across multiple tables.

An interesting capability introduced recently is the ability to create a view that spans both Amazon Redshift and Redshift Spectrum external tables. With this feature, you can query frequently accessed data in your Amazon Redshift cluster and less-frequently accessed data in Amazon S3, using a single view.

Leveraging Parquet for higher performance

Parquet is a columnar data format that provides superior performance and allows Redshift Spectrum (or Amazon Athena) to scan significantly less data. With less I/O, queries run faster and we pay less per query. You can read all about Parquet at https://parquet.apache.org/ or https://en.wikipedia.org/wiki/Apache_Parquet.

Lower cost

From a cost perspective, we pay standard rates for our data in Amazon S3, and only small amounts per query to analyze data with Redshift Spectrum. Using the Parquet format, we can significantly reduce the amount of data scanned. Our costs are now lower, and our users get fast results even for large complex queries.

What we learned about Amazon Redshift vs. Redshift Spectrum performance

When we first started looking at Redshift Spectrum, we wanted to put it to the test. We wanted to know how it would compare to Amazon Redshift, so we looked at two key questions:

  1. What is the performance difference between Amazon Redshift and Redshift Spectrum on simple and complex queries?
  2. Does the data format impact performance?

During the migration phase, we had our dataset stored in Amazon Redshift and S3 as CSV/GZIP and as Parquet file formats. We tested three configurations:

  • Amazon Redshift cluster with 28 DC1.large nodes
  • Redshift Spectrum using CSV/GZIP
  • Redshift Spectrum using Parquet

We performed benchmarks for simple and complex queries on one month’s worth of data. We tested how much time it took to perform the query, and how consistent the results were when running the same query multiple times. The data we used for the tests was already partitioned by date and hour. Properly partitioning the data improves performance significantly and reduces query times.

Simple query

First, we tested a simple query aggregating billing data across a month:

SELECT 
  user_id, 
  count(*) AS impressions, 
  SUM(billing)::decimal /1000000 AS billing 
FROM <table_name> 
WHERE 
  date >= '2017-08-01' AND 
  date <= '2017-08-31'  
GROUP BY 
  user_id;

We ran the same query seven times and measured the response times (red marking the longest time and green the shortest time):

Execution Time (seconds)
  Amazon Redshift Redshift Spectrum
CSV
Redshift Spectrum Parquet
Run #1 39.65 45.11 11.92
Run #2 15.26 43.13 12.05
Run #3 15.27 46.47 13.38
Run #4 21.22 51.02 12.74
Run #5 17.27 43.35 11.76
Run #6 16.67 44.23 13.67
Run #7 25.37 40.39 12.75
Average 21.53  44.82 12.61

For simple queries, Amazon Redshift performed better than Redshift Spectrum, as we thought, because the data is local to Amazon Redshift.

What was surprising was that using Parquet data format in Redshift Spectrum significantly beat ‘traditional’ Amazon Redshift performance. For our queries, using Parquet data format with Redshift Spectrum delivered an average 40% performance gain over traditional Amazon Redshift. Furthermore, Redshift Spectrum showed high consistency in execution time with a smaller difference between the slowest run and the fastest run.

Comparing the amount of data scanned when using CSV/GZIP and Parquet, the difference was also significant:

Data Scanned (GB)
CSV (Gzip) 135.49
Parquet 2.83

Because we pay only for the data scanned by Redshift Spectrum, the cost saving of using Parquet is evident and substantial.

Complex query

Next, we compared the same three configurations with a complex query.

Execution Time (seconds)
  Amazon Redshift Redshift Spectrum CSV Redshift Spectrum Parquet
Run #1 329.80 84.20 42.40
Run #2 167.60 65.30 35.10
Run #3 165.20 62.20 23.90
Run #4 273.90 74.90 55.90
Run #5 167.70 69.00 58.40
Average 220.84 71.12 43.14

This time, Redshift Spectrum using Parquet cut the average query time by 80% compared to traditional Amazon Redshift!

Bottom line: For complex queries, Redshift Spectrum provided a 67% performance gain over Amazon Redshift. Using the Parquet data format, Redshift Spectrum delivered an 80% performance improvement over Amazon Redshift. For us, this was substantial.

Optimizing the data structure for different workloads

Because the cost of S3 is relatively inexpensive and we pay only for the data scanned by each query, we believe that it makes sense to keep our data in different formats for different workloads and different analytics engines. It is important to note that we can have any number of tables pointing to the same data on S3. It all depends on how we partition the data and update the table partitions.

Data permutations

For example, we have a process that runs every minute and generates statistics for the last minute of data collected. With Amazon Redshift, this would be done by running the query on the table with something as follows:

SELECT 
  user, 
  COUNT(*) 
FROM 
  events_table 
WHERE 
  ts BETWEEN ‘2017-08-01 14:00:00’ AND ‘2017-08-01 14:00:59’ 
GROUP BY 
  user;

(Assuming ‘ts’ is your column storing the time stamp for each event.)

With Redshift Spectrum, we pay for the data scanned in each query. If the data is partitioned by the minute instead of the hour, a query looking at one minute would be 1/60th the cost. If we use a temporary table that points only to the data of the last minute, we save that unnecessary cost.

Creating Parquet data efficiently

On the average, we have 800 instances that process our traffic. Each instance sends events that are eventually loaded into Amazon Redshift. When we started three years ago, we would offload data from each server to S3 and then perform a periodic copy command from S3 to Amazon Redshift.

Recently, Amazon Kinesis Firehose added the capability to offload data directly to Amazon Redshift. While this is now a viable option, we kept the same collection process that worked flawlessly and efficiently for three years.

This changed, however, when we incorporated Redshift Spectrum. With Redshift Spectrum, we needed to find a way to:

  • Collect the event data from the instances.
  • Save the data in Parquet format.
  • Partition the data effectively.

To accomplish this, we save the data as CSV and then transform it to Parquet. The most effective method to generate the Parquet files is to:

  1. Send the data in one-minute intervals from the instances to Kinesis Firehose with an S3 temporary bucket as the destination.
  2. Aggregate hourly data and convert it to Parquet using AWS Lambda and AWS Glue.
  3. Add the Parquet data to S3 by updating the table partitions.

With this new process, we had to give more attention to validating the data before we sent it to Kinesis Firehose, because a single corrupted record in a partition fails queries on that partition.

Data validation

To store our click data in a table, we considered the following SQL create table command:

create external TABLE spectrum.blog_clicks (
    user_id varchar(50),
    campaign_id varchar(50),
    os varchar(50),
    ua varchar(255),
    ts bigint,
    billing float
)
partitioned by (date date, hour smallint)  
stored as parquet
location 's3://nuviad-temp/blog/clicks/';

The above statement defines a new external table (all Redshift Spectrum tables are external tables) with a few attributes. We stored ‘ts’ as a Unix time stamp and not as Timestamp, and billing data is stored as float and not decimal (more on that later). We also said that the data is partitioned by date and hour, and then stored as Parquet on S3.

First, we need to get the table definitions. This can be achieved by running the following query:

SELECT 
  * 
FROM 
  svv_external_columns 
WHERE 
  tablename = 'blog_clicks';

This query lists all the columns in the table with their respective definitions:

schemaname tablename columnname external_type columnnum part_key
spectrum blog_clicks user_id varchar(50) 1 0
spectrum blog_clicks campaign_id varchar(50) 2 0
spectrum blog_clicks os varchar(50) 3 0
spectrum blog_clicks ua varchar(255) 4 0
spectrum blog_clicks ts bigint 5 0
spectrum blog_clicks billing double 6 0
spectrum blog_clicks date date 7 1
spectrum blog_clicks hour smallint 8 2

Now we can use this data to create a validation schema for our data:

const rtb_request_schema = {
    "name": "clicks",
    "items": {
        "user_id": {
            "type": "string",
            "max_length": 100
        },
        "campaign_id": {
            "type": "string",
            "max_length": 50
        },
        "os": {
            "type": "string",
            "max_length": 50            
        },
        "ua": {
            "type": "string",
            "max_length": 255            
        },
        "ts": {
            "type": "integer",
            "min_value": 0,
            "max_value": 9999999999999
        },
        "billing": {
            "type": "float",
            "min_value": 0,
            "max_value": 9999999999999
        }
    }
};

Next, we create a function that uses this schema to validate data:

function valueIsValid(value, item_schema) {
    if (schema.type == 'string') {
        return (typeof value == 'string' && value.length <= schema.max_length);
    }
    else if (schema.type == 'integer') {
        return (typeof value == 'number' && value >= schema.min_value && value <= schema.max_value);
    }
    else if (schema.type == 'float' || schema.type == 'double') {
        return (typeof value == 'number' && value >= schema.min_value && value <= schema.max_value);
    }
    else if (schema.type == 'boolean') {
        return typeof value == 'boolean';
    }
    else if (schema.type == 'timestamp') {
        return (new Date(value)).getTime() > 0;
    }
    else {
        return true;
    }
}

Near real-time data loading with Kinesis Firehose

On Kinesis Firehose, we created a new delivery stream to handle the events as follows:

Delivery stream name: events
Source: Direct PUT
S3 bucket: nuviad-events
S3 prefix: rtb/
IAM role: firehose_delivery_role_1
Data transformation: Disabled
Source record backup: Disabled
S3 buffer size (MB): 100
S3 buffer interval (sec): 60
S3 Compression: GZIP
S3 Encryption: No Encryption
Status: ACTIVE
Error logging: Enabled

This delivery stream aggregates event data every minute, or up to 100 MB, and writes the data to an S3 bucket as a CSV/GZIP compressed file. Next, after we have the data validated, we can safely send it to our Kinesis Firehose API:

if (validated) {
    let itemString = item.join('|')+'\n'; //Sending csv delimited by pipe and adding new line

    let params = {
        DeliveryStreamName: 'events',
        Record: {
            Data: itemString
        }
    };

    firehose.putRecord(params, function(err, data) {
        if (err) {
            console.error(err, err.stack);        
        }
        else {
            // Continue to your next step 
        }
    });
}

Now, we have a single CSV file representing one minute of event data stored in S3. The files are named automatically by Kinesis Firehose by adding a UTC time prefix in the format YYYY/MM/DD/HH before writing objects to S3. Because we use the date and hour as partitions, we need to change the file naming and location to fit our Redshift Spectrum schema.

Automating data distribution using AWS Lambda

We created a simple Lambda function triggered by an S3 put event that copies the file to a different location (or locations), while renaming it to fit our data structure and processing flow. As mentioned before, the files generated by Kinesis Firehose are structured in a pre-defined hierarchy, such as:

S3://your-bucket/your-prefix/2017/08/01/20/events-4-2017-08-01-20-06-06-536f5c40-6893-4ee4-907d-81e4d3b09455.gz

All we need to do is parse the object name and restructure it as we see fit. In our case, we did the following (the event is an object received in the Lambda function with all the data about the object written to S3):

/*
	object key structure in the event object:
your-prefix/2017/08/01/20/event-4-2017-08-01-20-06-06-536f5c40-6893-4ee4-907d-81e4d3b09455.gz
	*/

let key_parts = event.Records[0].s3.object.key.split('/'); 

let event_type = key_parts[0];
let date = key_parts[1] + '-' + key_parts[2] + '-' + key_parts[3];
let hour = key_parts[4];
if (hour.indexOf('0') == 0) {
 		hour = parseInt(hour, 10) + '';
}
    
let parts1 = key_parts[5].split('-');
let minute = parts1[7];
if (minute.indexOf('0') == 0) {
        minute = parseInt(minute, 10) + '';
}

Now, we can redistribute the file to the two destinations we need—one for the minute processing task and the other for hourly aggregation:

    copyObjectToHourlyFolder(event, date, hour, minute)
        .then(copyObjectToMinuteFolder.bind(null, event, date, hour, minute))
        .then(addPartitionToSpectrum.bind(null, event, date, hour, minute))
        .then(deleteOldMinuteObjects.bind(null, event))
        .then(deleteStreamObject.bind(null, event))        
        .then(result => {
            callback(null, { message: 'done' });            
        })
        .catch(err => {
            console.error(err);
            callback(null, { message: err });            
        }); 

Kinesis Firehose stores the data in a temporary folder. We copy the object to another folder that holds the data for the last processed minute. This folder is connected to a small Redshift Spectrum table where the data is being processed without needing to scan a much larger dataset. We also copy the data to a folder that holds the data for the entire hour, to be later aggregated and converted to Parquet.

Because we partition the data by date and hour, we created a new partition on the Redshift Spectrum table if the processed minute is the first minute in the hour (that is, minute 0). We ran the following:

ALTER TABLE 
  spectrum.events 
ADD partition
  (date='2017-08-01', hour=0) 
  LOCATION 's3://nuviad-temp/events/2017-08-01/0/';

After the data is processed and added to the table, we delete the processed data from the temporary Kinesis Firehose storage and from the minute storage folder.

Migrating CSV to Parquet using AWS Glue and Amazon EMR

The simplest way we found to run an hourly job converting our CSV data to Parquet is using Lambda and AWS Glue (and thanks to the awesome AWS Big Data team for their help with this).

Creating AWS Glue jobs

What this simple AWS Glue script does:

  • Gets parameters for the job, date, and hour to be processed
  • Creates a Spark EMR context allowing us to run Spark code
  • Reads CSV data into a DataFrame
  • Writes the data as Parquet to the destination S3 bucket
  • Adds or modifies the Redshift Spectrum / Amazon Athena table partition for the table
import sys
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME','day_partition_key', 'hour_partition_key', 'day_partition_value', 'hour_partition_value' ])

#day_partition_key = "partition_0"
#hour_partition_key = "partition_1"
#day_partition_value = "2017-08-01"
#hour_partition_value = "0"

day_partition_key = args['day_partition_key']
hour_partition_key = args['hour_partition_key']
day_partition_value = args['day_partition_value']
hour_partition_value = args['hour_partition_value']

print("Running for " + day_partition_value + "/" + hour_partition_value)

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

df = spark.read.option("delimiter","|").csv("s3://nuviad-temp/events/"+day_partition_value+"/"+hour_partition_value)
df.registerTempTable("data")

df1 = spark.sql("select _c0 as user_id, _c1 as campaign_id, _c2 as os, _c3 as ua, cast(_c4 as bigint) as ts, cast(_c5 as double) as billing from data")

df1.repartition(1).write.mode("overwrite").parquet("s3://nuviad-temp/parquet/"+day_partition_value+"/hour="+hour_partition_value)

client = boto3.client('athena', region_name='us-east-1')

response = client.start_query_execution(
    QueryString='alter table parquet_events add if not exists partition(' + day_partition_key + '=\'' + day_partition_value + '\',' + hour_partition_key + '=' + hour_partition_value + ')  location \'s3://nuviad-temp/parquet/' + day_partition_value + '/hour=' + hour_partition_value + '\'' ,
    QueryExecutionContext={
        'Database': 'spectrumdb'
    },
    ResultConfiguration={
        'OutputLocation': 's3://nuviad-temp/convertresults'
    }
)

response = client.start_query_execution(
    QueryString='alter table parquet_events partition(' + day_partition_key + '=\'' + day_partition_value + '\',' + hour_partition_key + '=' + hour_partition_value + ') set location \'s3://nuviad-temp/parquet/' + day_partition_value + '/hour=' + hour_partition_value + '\'' ,
    QueryExecutionContext={
        'Database': 'spectrumdb'
    },
    ResultConfiguration={
        'OutputLocation': 's3://nuviad-temp/convertresults'
    }
)

job.commit()

Note: Because Redshift Spectrum and Athena both use the AWS Glue Data Catalog, we could use the Athena client to add the partition to the table.

Here are a few words about float, decimal, and double. Using decimal proved to be more challenging than we expected, as it seems that Redshift Spectrum and Spark use them differently. Whenever we used decimal in Redshift Spectrum and in Spark, we kept getting errors, such as:

S3 Query Exception (Fetch). Task failed due to an internal error. File 'https://s3-external-1.amazonaws.com/nuviad-temp/events/2017-08-01/hour=2/part-00017-48ae5b6b-906e-4875-8cde-bc36c0c6d0ca.c000.snappy.parquet has an incompatible Parquet schema for column 's3://nuviad-events/events.lat'. Column type: DECIMAL(18, 8), Parquet schema:\noptional float lat [i:4 d:1 r:0]\n (https://s3-external-1.amazonaws.com/nuviad-temp/events/2017-08-01/hour=2/part-00017-48ae5b6b-906e-4875-8cde-bc36c0c6d0ca.c000.snappy.parq

We had to experiment with a few floating-point formats until we found that the only combination that worked was to define the column as double in the Spark code and float in Spectrum. This is the reason you see billing defined as float in Spectrum and double in the Spark code.

Creating a Lambda function to trigger conversion

Next, we created a simple Lambda function to trigger the AWS Glue script hourly using a simple Python code:

import boto3
import json
from datetime import datetime, timedelta
 
client = boto3.client('glue')
 
def lambda_handler(event, context):
    last_hour_date_time = datetime.now() - timedelta(hours = 1)
    day_partition_value = last_hour_date_time.strftime("%Y-%m-%d") 
    hour_partition_value = last_hour_date_time.strftime("%-H") 
    response = client.start_job_run(
    JobName='convertEventsParquetHourly',
    Arguments={
         '--day_partition_key': 'date',
         '--hour_partition_key': 'hour',
         '--day_partition_value': day_partition_value,
         '--hour_partition_value': hour_partition_value
         }
    )

Using Amazon CloudWatch Events, we trigger this function hourly. This function triggers an AWS Glue job named ‘convertEventsParquetHourly’ and runs it for the previous hour, passing job names and values of the partitions to process to AWS Glue.

Redshift Spectrum and Node.js

Our development stack is based on Node.js, which is well-suited for high-speed, light servers that need to process a huge number of transactions. However, a few limitations of the Node.js environment required us to create workarounds and use other tools to complete the process.

Node.js and Parquet

The lack of Parquet modules for Node.js required us to implement an AWS Glue/Amazon EMR process to effectively migrate data from CSV to Parquet. We would rather save directly to Parquet, but we couldn’t find an effective way to do it.

One interesting project in the works is the development of a Parquet NPM by Marc Vertes called node-parquet (https://www.npmjs.com/package/node-parquet). It is not in a production state yet, but we think it would be well worth following the progress of this package.

Timestamp data type

According to the Parquet documentation, Timestamp data are stored in Parquet as 64-bit integers. However, JavaScript does not support 64-bit integers, because the native number type is a 64-bit double, giving only 53 bits of integer range.

The result is that you cannot store Timestamp correctly in Parquet using Node.js. The solution is to store Timestamp as string and cast the type to Timestamp in the query. Using this method, we did not witness any performance degradation whatsoever.

Lessons learned

You can benefit from our trial-and-error experience.

Lesson #1: Data validation is critical

As mentioned earlier, a single corrupt entry in a partition can fail queries running against this partition, especially when using Parquet, which is harder to edit than a simple CSV file. Make sure that you validate your data before scanning it with Redshift Spectrum.

Lesson #2: Structure and partition data effectively

One of the biggest benefits of using Redshift Spectrum (or Athena for that matter) is that you don’t need to keep nodes up and running all the time. You pay only for the queries you perform and only for the data scanned per query.

Keeping different permutations of your data for different queries makes a lot of sense in this case. For example, you can partition your data by date and hour to run time-based queries, and also have another set partitioned by user_id and date to run user-based queries. This results in faster and more efficient performance of your data warehouse.

Storing data in the right format

Use Parquet whenever you can. The benefits of Parquet are substantial. Faster performance, less data to scan, and much more efficient columnar format. However, it is not supported out-of-the-box by Kinesis Firehose, so you need to implement your own ETL. AWS Glue is a great option.

Creating small tables for frequent tasks

When we started using Redshift Spectrum, we saw our Amazon Redshift costs jump by hundreds of dollars per day. Then we realized that we were unnecessarily scanning a full day’s worth of data every minute. Take advantage of the ability to define multiple tables on the same S3 bucket or folder, and create temporary and small tables for frequent queries.

Lesson #3: Combine Athena and Redshift Spectrum for optimal performance

Moving to Redshift Spectrum also allowed us to take advantage of Athena as both use the AWS Glue Data Catalog. Run fast and simple queries using Athena while taking advantage of the advanced Amazon Redshift query engine for complex queries using Redshift Spectrum.

Redshift Spectrum excels when running complex queries. It can push many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer, so that queries use much less of your cluster’s processing capacity.

Lesson #4: Sort your Parquet data within the partition

We achieved another performance improvement by sorting data within the partition using sortWithinPartitions(sort_field). For example:

df.repartition(1).sortWithinPartitions("campaign_id")…

Conclusion

We were extremely pleased with using Amazon Redshift as our core data warehouse for over three years. But as our client base and volume of data grew substantially, we extended Amazon Redshift to take advantage of scalability, performance, and cost with Redshift Spectrum.

Redshift Spectrum lets us scale to virtually unlimited storage, scale compute transparently, and deliver super-fast results for our users. With Redshift Spectrum, we store data where we want at the cost we want, and have the data available for analytics when our users need it with the performance they expect.


About the Author

With 7 years of experience in the AdTech industry and 15 years in leading technology companies, Rafi Ton is the founder and CEO of NUVIAD. He enjoys exploring new technologies and putting them to use in cutting edge products and services, in the real world generating real money. Being an experienced entrepreneur, Rafi believes in practical-programming and fast adaptation of new technologies to achieve a significant market advantage.

 

 

AWS Media Services – Process, Store, and Monetize Cloud-Based Video

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-media-services-process-store-and-monetize-cloud-based-video/

Do you remember what web video was like in the early days? Standalone players, video no larger than a postage stamp, slow & cantankerous connections, overloaded servers, and the ever-present buffering messages were the norm less than two decades ago.

Today, thanks to technological progress and a broad array of standards, things are a lot better. Video consumers are now in control. They use devices of all shapes, sizes, and vintages to enjoy live and recorded content that is broadcast, streamed, or sent over-the-top (OTT, as they say), and expect immediate access to content that captures and then holds their attention. Meeting these expectations presents a challenge for content creators and distributors. Instead of generating video in a one-size-fits-all format, they (or their media servers) must be prepared to produce video that spans a broad range of sizes, formats, and bit rates, taking care to be ready to deal with planned or unplanned surges in demand. In the face of all of this complexity, they must backstop their content with a monetization model that supports the content and the infrastructure to deliver it.

New AWS Media Services
Today we are launching an array of broadcast-quality media services, each designed to address one or more aspects of the challenge that I outlined above. You can use them together to build a complete end-to-end video solution or you can use one or more in building-block style. In true AWS fashion, you can spend more time innovating and less time setting up and running infrastructure, leaving you ready to focus on creating, delivering, and monetizing your content. The services are all elastic, allowing you to ramp up processing power, connections, and storage and giving you the ability to handle million-user (and beyond) spikes with ease.

Here are the services (all accessible from a set of interactive consoles as well as through a comprehensive set of APIs):

AWS Elemental MediaConvert – File-based transcoding for OTT, broadcast, or archiving, with support for a long list of formats and codecs. Features include multi-channel audio, graphic overlays, closed captioning, and several DRM options.

AWS Elemental MediaLive – Live encoding to deliver video streams in real time to both televisions and multiscreen devices. Allows you to deploy highly reliable live channels in minutes, with full control over encoding parameters. It supports ad insertion, multi-channel audio, graphic overlays, and closed captioning.

AWS Elemental MediaPackage – Video origination and just-in-time packaging. Starting from a single input, produces output for multiple devices representing a long list of current and legacy formats. Supports multiple monetization models, time-shifted live streaming, ad insertion, DRM, and blackout management.

AWS Elemental MediaStore – Media-optimized storage that enables high performance and low latency applications such as live streaming, while taking advantage of the scale and durability of Amazon Simple Storage Service (S3).

AWS Elemental MediaTailor – Monetization service that supports ad serving and server-side ad insertion, a broad range of devices, transcoding, and accurate reporting of server-side and client-side ad insertion.

Instead of listing out all of the features in the sections below, I’ve simply included as many screen shots as possible with the expectation that this will give you a better sense of the rich set of features, parameters, and settings that you get with this set of services.

AWS Elemental MediaConvert
MediaConvert allows you to transcode content that is stored in files. You can process individual files or entire media libraries, or anything in-between. You simply create a conversion job that specifies the content and the desired outputs, and submit it to MediaConvert. There’s no software to install or patch and the service scales to meet your needs without affecting turnaround time or performance.

The MediaConvert Console lets you manage Output presets, Job templates, Queues, and Jobs:

You can use a built-in system preset or you can make one of your own. You have full control over the settings when you make your own:

Jobs templates are named, and produce one or more output groups. You can add a new group to a template with a click:

When everything is ready to go, you create a job and make some final selections, then click on Create:

Each account starts with a default queue for jobs, where incoming work is processed in parallel using all processing resources available to the account. Adding queues does not add processing resources, but does cause them to be apportioned across queues. You can temporarily pause one queue in order to devote more resources to the others. You can submit jobs to paused queues and you can also cancel any that have yet to start.

Pricing for this service is based on the amount of video that you process and the features that you use.

AWS Elemental MediaLive
This service is for live encoding, and can be run 24×7. MediaLive channels are deployed on redundant resources distributed in two physically separated Availability Zones in order to provide the reliability expected by our customers in the broadcast industry. You can specify your inputs and define your channels in the MediaLive Console:

After you create an Input, you create a Channel and attach it to the Input:

You have full control over the settings for each channel:

 

AWS Elemental MediaPackage
This service lets you deliver video to many devices from a single source. It focuses on protection and just-in-time packaging, giving you the ability to provide your users with the desired content on the device of their choice. You simply create a channel to get started:

Then you add one or more endpoints. Once again, plenty of options and full control, including a startover window and a time delay:

You find the input URL, user name, and password for your channel and route your live video stream to it for packaging:

AWS Elemental MediaStore
MediaStore offers the performance, consistency, and latency required for live and on-demand media delivery. Objects are written and read into a new “temporal” tier of object storage for a limited amount of time, then move silently into S3 for long-lived durability. You simply create a storage container to group your media content:

The container is available within a minute or so:

Like S3 buckets, MediaStore containers have access policies and no limits on the number of objects or storage capacity.

MediaStore helps you to take full advantage of S3 by managing the object key names so as to maximize storage and retrieval throughput, in accord with the Request Rate and Performance Considerations.

AWS Elemental MediaTailor
This service takes care of server-side ad insertion while providing a broadcast-quality viewer experience by transcoding ad assets on the fly. Your customer’s video player asks MediaTailor for a playlist. MediaTailor, in turn, calls your Ad Decision Server and returns a playlist that references the origin server for your original video and the ads recommended by the Ad Decision Server. The video player makes all of its requests to a single endpoint in order to ensure that client-side ad-blocking is ineffective. You simply create a MediaTailor Configuration:

Context information is passed to the Ad Decision Server in the URL:

Despite the length of this post I have barely scratched the surface of the AWS Media Services. Once AWS re:Invent is in the rear view mirror I hope to do a deep dive and show you how to use each of these services.

Available Now
The entire set of AWS Media Services is available now and you can start using them today! Pricing varies by service, but is built around a pay-as-you-go model.

Jeff;

A Thanksgiving Carol: How Those Smart Engineers at Twitter Screwed Me

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/11/a-thanksgiving-carol-how-those-smart.html

Thanksgiving Holiday is a time for family and cheer. Well, a time for family. It’s the holiday where we ask our doctor relatives to look at that weird skin growth, and for our geek relatives to fix our computers. This tale is of such computer support, and how the “smart” engineers at Twitter have ruined this for life.

My mom is smart, but not a good computer user. I get my enthusiasm for science and math from my mother, and she has no problem understanding the science of computers. She keeps up when I explain Bitcoin. But she has difficulty using computers. She has this emotional, irrational belief that computers are out to get her.

This makes helping her difficult. Every problem is described in terms of what the computer did to her, not what she did to her computer. It’s the computer that needs to be fixed, instead of the user. When I showed her the “haveibeenpwned.com” website (part of my tips for securing computers), it showed her Tumblr password had been hacked. She swore she never created a Tumblr account — that somebody or something must have done it for her. Except, I was there five years ago and watched her create it.

Another example is how GMail is deleting her emails for no reason, corrupting them, and changing the spelling of her words. She emails the way an impatient teenager texts — all of us in the family know the misspellings are not GMail’s fault. But I can’t help her with this because she keeps her GMail inbox clean, deleting all her messages, leaving no evidence behind. She has only a vague description of the problem that I can’t make sense of.

This last March, I tried something to resolve this. I configured her GMail to send a copy of all incoming messages to a new, duplicate account on my own email server. With evidence in hand, I would then be able solve what’s going on with her GMail. I’d be able to show her which steps she took, which buttons she clicked on, and what caused the weirdness she’s seeing.

Today, while the family was in a state of turkey-induced torpor, my mom brought up a problem with Twitter. She doesn’t use Twitter, she doesn’t have an account, but they keep sending tweets to her phone, about topics like Denzel Washington. And she said something about “peaches” I didn’t understand.

This is how the problem descriptions always start, chaotic, with mutually exclusive possibilities. If you don’t use Twitter, you don’t have the Twitter app installed, so how are you getting Tweets? Over much gnashing of teeth, it comes out that she’s getting emails from Twitter, not tweets, about Denzel Washington — to someone named “Peaches Graham”. Naturally, she can only describe these emails, because she’s already deleted them.

“Ah ha!”, I think. I’ve got the evidence! I’ll just log onto my duplicate email server, and grab the copies to prove to her it was something she did.

I find she is indeed receiving such emails, called “Moments”, about topics trending on Twitter. They are signed with “DKIM”, proving they are legitimate rather than from a hacker or spammer. The only way that can happen is if my mother signed up for Twitter, despite her protestations that she didn’t.

I look further back and find that there were also confirmation messages involved. Back in August, she got a typical Twitter account signup message. I am now seeing a little bit more of the story unfold with this “Peaches Graham” name on the account. It wasn’t my mother who initially signed up for Twitter, but Peaches, who misspelled the email address. It’s one of the reasons why the confirmation process exists, to make sure you spelled your email address correctly.

It’s now obvious my mom accidentally clicked on the [Confirm] button. I don’t have any proof she did, but it’s the only reasonable explanation. Otherwise, she wouldn’t have gotten the “Moments” messages. My mom disputed this, emphatically insisting she never clicked on the emails.

It’s at this point that I made a great mistake, saying:

“This sort of thing just doesn’t happen. Twitter has very smart engineers. What’s the chance they made the mistake here, or…”.

I recognized condescension of words as they came out of my mouth, but dug myself deeper with:

“…or that the user made the error?”

This was wrong to say even if I were right. I have no excuse. I mean, maybe I could argue that it’s really her fault, for not raising me right, but no, this is only on me.

Regardless of what caused the Twitter emails, the problem needs to be fixed. The solution is to take control of the Twitter account by using the password reset feature. I went to the Twitter login page, clicked on “Lost Password”, got the password reset message, and reset the password. I then reconfigured the account to never send anything to my mom again.

But when I logged in I got an error saying the account had not yet been confirmed. I paused. The family dog eyed me in wise silence. My mom hadn’t clicked on the [Confirm] button — the proof was right there. Moreover, it hadn’t been confirmed for a long time, since the account was created in 2011.

I interrogated my mother some more. It appears that this has been going on for years. She’s just been deleting the emails without opening them, both the “Confirmations” and the “Moments”. She made it clear she does it this way because her son (that would be me) instructs her to never open emails she knows are bad. That’s how she could be so certain she never clicked on the [Confirm] button — she never even opens the emails to see the contents.

My mom is a prolific email user. In the last eight months, I’ve received over 10,000 emails in the duplicate mailbox on my server. That’s a lot. She’s technically retired, but she volunteers for several charities, goes to community college classes, and is joining an anti-Trump protest group. She has a daily routine for triaging and processing all the emails that flow through her inbox.

So here’s the thing, and there’s no getting around it: my mom was right, on all particulars. She had done nothing, the computer had done it to her. It’s Twitter who is at fault, having continued to resend that confirmation email every couple months for six years. When Twitter added their controversial “Moments” feature a couple years back, somehow they turned on Notifications for accounts that technically didn’t fully exist yet.

Being right this time means she might be right the next time the computer does something to her without her touching anything. My attempts at making computers seem rational has failed. That they are driven by untrustworthy spirits is now a reasonable alternative.

Those “smart” engineers at Twitter screwed me. Continuing to send confirmation emails for six years is stupid. Sending Notifications to unconfirmed accounts is stupid. Yes, I know at the bottom of the message it gives a “Not my account” selection that she could have clicked on, but it’s small and easily missed. In any case, my mom never saw that option, because she’s been deleting the messages without opening them — for six years.

Twitter can fix their problem, but it’s not going to help mine. Forever more, I’ll be unable to convince my mom that the majority of her problems are because of user error, and not because the computer people are out to get her.