“Security is hard” is a tautology, especially in the fast-moving world of container orchestration. We have previously covered various aspects of Linux container security through, for example, the Clear Containers implementation or the broader question of Kubernetes and security, but those are mostly concerned with container isolation; they do not address the question of trusting a container’s contents. What is a container running? Who built it and when? Even assuming we have good programmers and solid isolation layers, propagating that good code around a Kubernetes cluster and making strong assertions on the integrity of that supply chain is far from trivial. The 2018 KubeCon + CloudNativeCon Europe event featured some projects that could eventually solve that problem.
Today, we’re excited to announce local build support in AWS CodeBuild.
AWS CodeBuild is a fully managed build service. There are no servers to provision and scale, or software to install, configure, and operate. You just specify the location of your source code, choose your build settings, and CodeBuild runs build scripts for compiling, testing, and packaging your code.
In this blog post, I’ll show you how to set up CodeBuild locally to build and test a sample Java application.
By building an application on a local machine you can:
Test the integrity and contents of a buildspec file locally.
Test and build an application locally before committing.
Identify and fix errors quickly from your local development environment.
Prerequisites
In this post, I am using AWS Cloud9 IDE as my development environment.
If you would like to use AWS Cloud9 as your IDE, follow the express setup steps in the AWS Cloud9 User Guide.
The AWS Cloud9 IDE comes with Docker and Git already installed. If you are going to use your laptop or desktop machine as your development environment, install Docker and Git before you start.
Note: We need to provide three environment variables namely IMAGE_NAME, SOURCE and ARTIFACTS.
IMAGE_NAME: The name of your build environment image.
SOURCE: The absolute path to your source code directory.
ARTIFACTS: The absolute path to your artifact output folder.
When you run the sample project, you get a runtime error that says the YAML file does not exist. This is because a buildspec.yml file is not included in the sample web project. AWS CodeBuild requires a buildspec.yml to run a build. For more information about buildspec.yml, see Build Spec Example in the AWS CodeBuild User Guide.
Let’s add a buildspec.yml file with the following content to the sample-web-app folder and then rebuild the project.
version: 0.2
phases:
build:
commands:
- echo Build started on `date`
- mvn install
artifacts:
files:
- target/javawebdemo.war
This time your build should be successful. Upon successful execution, look in the /artifacts folder for the final built artifacts.zip file to validate.
Conclusion:
In this blog post, I showed you how to quickly set up the CodeBuild local agent to build projects right from your local desktop machine or laptop. As you see, local builds can improve developer productivity by helping you identify and fix errors quickly.
I hope you found this post useful. Feel free to leave your feedback or suggestions in the comments.
At the 2018 Linux Storage, Filesystem, and Memory-Management Summit, Mimi Zohar gave a presentation in the filesystem track on the Linux integrity subsystem. There is a lot of talk that the integrity subsystem (usually referred to as “IMA”, which is the integrity measurement architecture, though there is more to the subsystem) is complex and not documented well, she said. So she wanted to give an overview of the subsystem and then to discuss some filesystem-related concerns.
At the 2018 Linux Storage, Filesystem, and Memory Management Summit, Ted Ts’o introduced an integrity feature akin to dm-verity that targets Android, at least to start with. It is meant to protect the integrity of files on the system so that any tampering would be detectable. The initial use case would be for a certain special type of Android file, but other systems may find uses for it as well.
The very short version is that a UK bank, TSB, which had been merged into and then many years later was spun out of Lloyds Bank, was bought by the Spanish bank Banco Sabadell in 2015. Lloyds had continued to run the TSB systems and was to transfer them over to Sabadell over the weekend. It’s turned out to be an epic failure, and it’s not clear if and when this can be straightened out.
The more serious issue is the fact that customers still can’t access online accounts and even more disconcerting, are sometimes being allowed into other people’s accounts, says there are massive problems with data integrity. That’s a nightmare to sort out.
Even worse, the fact that this situation has persisted strongly suggests that Lloyds went ahead with the migration without allowing for a rollback.
Today, I’m very pleased to announce that AWS services comply with the General Data Protection Regulation (GDPR). This means that, in addition to benefiting from all of the measures that AWS already takes to maintain services security, customers can deploy AWS services as a key part of their GDPR compliance plans.
This announcement confirms we have completed the entirety of our GDPR service readiness audit, validating that all generally available services and features adhere to the high privacy bar and data protection standards required of data processors by the GDPR. We completed this work two months ahead of the May 25, 2018 enforcement deadline in order to give customers and APN partners an environment in which they can confidently build their own GDPR-compliant products, services, and solutions.
AWS’s GDPR service readiness is only part of the story; we are continuing to work alongside our customers and the AWS Partner Network (APN) to help on their journey toward GDPR compliance. Along with this announcement, I’d like to highlight the following examples of ways AWS can help you accelerate your own GDPR compliance efforts.
Security of Personal Data During our GDPR service readiness audit, our security and compliance experts confirmed that AWS has in place effective technical and organizational measures for data processors to secure personal data in accordance with the GDPR. Security remains our highest priority, and we continue to innovate and invest in a high bar for security and compliance across all global operations. Our industry-leading functionality provides the foundation for our long list of internationally-recognized certifications and accreditations, demonstrating compliance with rigorous international standards, such as ISO 27001 for technical measures, ISO 27017 for cloud security, ISO 27018 for cloud privacy, SOC 1, SOC 2 and SOC 3, PCI DSS Level 1, and EU-specific certifications such as BSI’s Common Cloud Computing Controls Catalogue (C5). AWS continues to pursue the certifications that assist our customers.
Compliance-enabling Services Many requirements under the GDPR focus on ensuring effective control and protection of personal data. AWS services give you the capability to implement your own security measures in the ways you need in order to enable your compliance with the GDPR, including specific measures such as:
Encryption of personal data
Ability to ensure the ongoing confidentiality, integrity, availability, and resilience of processing systems and services
Ability to restore the availability and access to personal data in a timely manner in the event of a physical or technical incident
Processes for regularly testing, assessing, and evaluating the effectiveness of technical and organizational measures for ensuring the security of processing
This is an advanced set of security and compliance services that are designed specifically to handle the requirements of the GDPR. There are numerous AWS services that have particular significance for customers focusing on GDPR compliance, including:
Amazon GuardDuty – a security service featuring intelligent threat detection and continuous monitoring
Amazon Macie – a machine learning tool to assist discovery and securing of personal data stored in Amazon S3
Amazon Inspector – an automated security assessment service to help keep applications in conformity with best security practices
AWS Config Rules – a monitoring service that dynamically checks cloud resources for compliance with security rules
Additionally, we have published a whitepaper, “Navigating GDPR Compliance on AWS,” dedicated to this topic. This paper details how to tie GDPR concepts to specific AWS services, including those relating to monitoring, data access, and key management. Furthermore, our GDPR Center will give you access to the up-to-date resources you need to tackle requirements that directly support your GDPR efforts.
Compliant DPA We offer a GDPR-compliant Data Processing Addendum (DPA), enabling you to comply with GDPR contractual obligations.
Conformity with a Code of Conduct GDPR introduces adherence to a “code of conduct” as a mechanism for demonstrating sufficient guarantees of requirements that the GDPR places on data processors. In this context, we previously announced compliance with the CISPE Code of Conduct. The CISPE Code of Conduct provides customers with additional assurances regarding their ability to fully control their data in a safe, secure, and compliant environment when they use services from providers like AWS. More detail about the CISPE Code of Conduct can be found at: https://aws.amazon.com/compliance/cispe/
Training and Summits We can provide you with training on navigating GDPR compliance using AWS services via our Professional Services team. This team has a GDPR workshop offering, which is a two-day facilitated session customized to your specific needs and challenges. We are also providing GDPR presentations during our AWS Summits in European countries, as well as San Francisco and Tokyo.
Additional Resources Finally, we have teams of compliance, data protection, and security experts, as well as the APN, helping customers across Europe prepare for running regulated workloads in the cloud as the GDPR becomes enforceable. For additional information on this, please contact your AWS Account Manager.
As we move towards May 25 and beyond, we’ll be posting a series of blogs to dive deeper into GDPR-related concepts along with how AWS can help. Please visit our GDPR Center for more information. We’re excited about being your partner in fully addressing this important regulation.
-Chad Woolf
Vice President, AWS Security Assurance
Interested in additional AWS Security news? Follow the AWS Security Blog on Twitter.
AWS has achieved Spain’s Esquema Nacional de Seguridad (ENS) High certification across 29 services. To successfully achieve the ENS High Standard, BDO España conducted an independent audit and attested that AWS meets confidentiality, integrity, and availability standards. This provides the assurance needed by Spanish Public Sector organizations wanting to build secure applications and services on AWS.
The National Security Framework, regulated under Royal Decree 3/2010, was developed through close collaboration between ENAC (Entidad Nacional de Acreditación), the Ministry of Finance and Public Administration and the CCN (National Cryptologic Centre), and other administrative bodies.
The following AWS Services are ENS High accredited across our Dublin and Frankfurt Regions:
AWS Key Management Service (KMS) now uses FIPS 140-2 validated hardware security modules (HSM) and supports FIPS 140-2 validated endpoints, which provide independent assurances about the confidentiality and integrity of your keys. Having additional third-party assurances about the keys you manage in AWS KMS can make it easier to use the service for regulated workloads.
AWS KMS HSMs are designed so that no one, not even AWS employees, can retrieve your plaintext keys. The service uses the FIPS 140-2 validated HSMs to protect your keys when you request the service to create keys on your behalf or when you import them. Your plaintext keys are never written to disk and are only used in volatile memory of the HSMs while performing your requested cryptographic operation. Furthermore, AWS KMS keys are never transmitted outside the AWS Regions they were created. And HSM firmware updates are controlled by multi-party access that is audited and reviewed by an independent group within AWS.
AWS KMS HSMs are validated at level 2 overall and at level 3 in the following areas:
Cryptographic Module Specification
Roles, Services, and Authentication
Physical Security
Design Assurance
You can also make AWS KMS requests to API endpoints that terminate TLS sessions using a FIPS 140-2 validated cryptographic software module. To do so, connect to the unique FIPS 140-2 validated HTTPS endpoints in the AWS KMS requests made from your applications. AWS KMS FIPS 140-2 validated HTTPS endpoints are powered by the OpenSSL FIPS Object Module. FIPS 140-2 validated API endpoints are available in all commercial regions where AWS KMS is available.
Here’s a long post. We think you’ll find it interesting. If you don’t have time to read it all, we recommend you watch this video, which will fill you in with everything you need, and then head straight to the product page to fill yer boots. (We recommend the video anyway, even if you do have time for a long read. ‘Cos it’s fab.)
Raspberry Pi 3 Model B+ is now on sale now for $35, featuring: – A 1.4GHz 64-bit quad-core ARM Cortex-A53 CPU – Dual-band 802.11ac wireless LAN and Bluetooth 4.2 – Faster Ethernet (Gigabit Ethernet over USB 2.0) – Power-over-Ethernet support (with separate PoE HAT) – Improved PXE network and USB mass-storage booting – Improved thermal management Alongside a 200MHz increase in peak CPU clock frequency, we have roughly three times the wired and wireless network throughput, and the ability to sustain high performance for much longer periods.
If you’ve been a Raspberry Pi watcher for a while now, you’ll have a bit of a feel for how we update our products. Just over two years ago, we releasedRaspberry Pi 3 Model B. This was our first 64-bit product, and our first product to feature integrated wireless connectivity. Since then, we’ve sold over nine million Raspberry Pi 3 units (we’ve sold 19 million Raspberry Pis in total), which have been put to work in schools, homes, offices and factories all over the globe.
Those Raspberry Pi watchers will know that we have a history of releasing improved versions of our products a couple of years into their lives. The first example was Raspberry Pi 1 Model B+, which added two additional USB ports, introduced our current form factor, and rolled up a variety of other feedback from the community. Raspberry Pi 2 didn’t get this treatment, of course, as it was superseded after only one year; but it feels like it’s high time that Raspberry Pi 3 received the “plus” treatment.
So, without further ado, Raspberry Pi 3 Model B+ is now on sale for $35 (the same price as the existing Raspberry Pi 3 Model B), featuring:
A 1.4GHz 64-bit quad-core ARM Cortex-A53 CPU
Dual-band 802.11ac wireless LAN and Bluetooth 4.2
Faster Ethernet (Gigabit Ethernet over USB 2.0)
Power-over-Ethernet support (with separate PoE HAT)
Improved PXE network and USB mass-storage booting
Improved thermal management
Alongside a 200MHz increase in peak CPU clock frequency, we have roughly three times the wired and wireless network throughput, and the ability to sustain high performance for much longer periods.
Behold the shiny
Raspberry Pi 3B+ is available to buy today from our network of Approved Resellers.
New features, new chips
Roger Thornton did the design work on this revision of the Raspberry Pi. Here, he and I have a chat about what’s new.
Raspberry Pi 3 Model B+ is now on sale now for $35, featuring: – A 1.4GHz 64-bit quad-core ARM Cortex-A53 CPU – Dual-band 802.11ac wireless LAN and Bluetooth 4.2 – Faster Ethernet (Gigabit Ethernet over USB 2.0) – Power-over-Ethernet support (with separate PoE HAT) – Improved PXE network and USB mass-storage booting – Improved thermal management Alongside a 200MHz increase in peak CPU clock frequency, we have roughly three times the wired and wireless network throughput, and the ability to sustain high performance for much longer periods.
The new product is built around BCM2837B0, an updated version of the 64-bit Broadcom application processor used in Raspberry Pi 3B, which incorporates power integrity optimisations, and a heat spreader (that’s the shiny metal bit you can see in the photos). Together these allow us to reach higher clock frequencies (or to run at lower voltages to reduce power consumption), and to more accurately monitor and control the temperature of the chip.
Dual-band wireless LAN and Bluetooth are provided by the Cypress CYW43455 “combo” chip, connected to a Proant PCB antenna similar to the one used on Raspberry Pi Zero W. Compared to its predecessor, Raspberry Pi 3B+ delivers somewhat better performance in the 2.4GHz band, and far better performance in the 5GHz band, as demonstrated by these iperf results from LibreELEC developer Milhouse.
Tx bandwidth (Mb/s)
Rx bandwidth (Mb/s)
Raspberry Pi 3B
35.7
35.6
Raspberry Pi 3B+ (2.4GHz)
46.7
46.3
Raspberry Pi 3B+ (5GHz)
102
102
The wireless circuitry is encapsulated under a metal shield, rather fetchingly embossed with our logo. This has allowed us to certify the entire board as a radio module under FCC rules, which in turn will significantly reduce the cost of conformance testing Raspberry Pi-based products.
We’ll be teaching metalwork next.
Previous Raspberry Pi devices have used the LAN951x family of chips, which combine a USB hub and 10/100 Ethernet controller. For Raspberry Pi 3B+, Microchip have supported us with an upgraded version, LAN7515, which supports Gigabit Ethernet. While the USB 2.0 connection to the application processor limits the available bandwidth, we still see roughly a threefold increase in throughput compared to Raspberry Pi 3B. Again, here are some typical iperf results.
Tx bandwidth (Mb/s)
Rx bandwidth (Mb/s)
Raspberry Pi 3B
94.1
95.5
Raspberry Pi 3B+
315
315
We use a magjack that supports Power over Ethernet (PoE), and bring the relevant signals to a new 4-pin header. We will shortly launch a PoE HAT which can generate the 5V necessary to power the Raspberry Pi from the 48V PoE supply.
There… are… four… pins!
Coming soon to a Raspberry Pi 3B+ near you
Raspberry Pi 3B was our first product to support PXE Ethernet boot. Testing it in the wild shook out a number of compatibility issues with particular switches and traffic environments. Gordon has rolled up fixes for all known issues into the BCM2837B0 boot ROM, and PXE boot is now enabled by default.
Clocking, voltages and thermals
The improved power integrity of the BCM2837B0 package, and the improved regulation accuracy of our new MaxLinear MxL7704 power management IC, have allowed us to tune our clocking and voltage rules for both better peak performance and longer-duration sustained performance.
Below 70°C, we use the improvements to increase the core frequency to 1.4GHz. Above 70°C, we drop to 1.2GHz, and use the improvements to decrease the core voltage, increasing the period of time before we reach our 80°C thermal throttle; the reduction in power consumption is such that many use cases will never reach the throttle. Like a modern smartphone, we treat the thermal mass of the device as a resource, to be spent carefully with the goal of optimising user experience.
This graph, courtesy of Gareth Halfacree, demonstrates that Raspberry Pi 3B+ runs faster and at a lower temperature for the duration of an eight‑minute quad‑core Sysbench CPU test.
Note that Raspberry Pi 3B+ does consume substantially more power than its predecessor. We strongly encourage you to use a high-quality 2.5A power supply, such as the official Raspberry Pi Universal Power Supply.
FAQs
We’ll keep updating this list over the next couple of days, but here are a few to get you started.
Are you discontinuing earlier Raspberry Pi models?
No. We have a lot of industrial customers who will want to stick with the existing products for the time being. We’ll keep building these models for as long as there’s demand. Raspberry Pi 1B+, Raspberry Pi 2B, and Raspberry Pi 3B will continue to sell for $25, $35, and $35 respectively.
What about Model A+?
Raspberry Pi 1A+ continues to be the $20 entry-level “big” Raspberry Pi for the time being. We are considering the possibility of producing a Raspberry Pi 3A+ in due course.
What about the Compute Module?
CM1, CM3 and CM3L will continue to be available. We may offer versions of CM3 and CM3L with BCM2837B0 in due course, depending on customer demand.
Are you still using VideoCore?
Yes. VideoCore IV 3D is the only publicly-documented 3D graphics core for ARM‑based SoCs, and we want to make Raspberry Pi more open over time, not less.
Credits
A project like this requires a vast amount of focused work from a large team over an extended period. Particular credit is due to Roger Thornton, who designed the board and ran the exhaustive (and exhausting) RF compliance campaign, and to the team at the Sony UK Technology Centre in Pencoed, South Wales. A partial list of others who made major direct contributions to the BCM2837B0 chip program, CYW43455 integration, LAN7515 and MxL7704 developments, and Raspberry Pi 3B+ itself follows:
James Adams, David Armour, Jonathan Bell, Maria Blazquez, Jamie Brogan-Shaw, Mike Buffham, Rob Campling, Cindy Cao, Victor Carmon, KK Chan, Nick Chase, Nigel Cheetham, Scott Clark, Nigel Clift, Dominic Cobley, Peter Coyle, John Cronk, Di Dai, Kurt Dennis, David Doyle, Andrew Edwards, Phil Elwell, John Ferdinand, Doug Freegard, Ian Furlong, Shawn Guo, Philip Harrison, Jason Hicks, Stefan Ho, Andrew Hoare, Gordon Hollingworth, Tuomas Hollman, EikPei Hu, James Hughes, Andy Hulbert, Anand Jain, David John, Prasanna Kerekoppa, Shaik Labeeb, Trevor Latham, Steve Le, David Lee, David Lewsey, Sherman Li, Xizhe Li, Simon Long, Fu Luo Larson, Juan Martinez, Sandhya Menon, Ben Mercer, James Mills, Max Passell, Mark Perry, Eric Phiri, Ashwin Rao, Justin Rees, James Reilly, Matt Rowley, Akshaye Sama, Ian Saturley, Serge Schneider, Manuel Sedlmair, Shawn Shadburn, Veeresh Shivashimper, Graham Smith, Ben Stephens, Mike Stimson, Yuree Tchong, Stuart Thomson, John Wadsworth, Ian Watch, Sarah Williams, Jason Zhu.
If you’re not on this list and think you should be, please let me know, and accept my apologies.
This post summarizes the responses we received to our November 28 post asking our readers how they handle the challenge of digital asset management (DAM). You can read the previous posts in this series below:
Use the Join button above to receive notification of future posts on this topic.
This past November, we published a blog post entitled What’s the Best Solution for Managing Digital Photos and Videos? We asked our readers to tell us how they’re currently backing up their digital media assets and what their ideal system might be. We posed these questions:
How are you currently backing up your digital photos, video files, and/or file libraries/catalogs? Do you have a backup system that uses attached drives, a local network, the cloud, or offline storage media? Does it work well for you?
Imagine your ideal digital asset backup setup. What would it look like? Don’t be constrained by current products, technologies, brands, or solutions. Invent a technology or product if you wish. Describe an ideal system that would work the way you want it to.
We were thrilled to receive a large number of responses from readers. What was clear from the responses is that there is no consensus on solutions for either amateur or professional, and that users had many ideas for how digital media management could be improved to meet their needs.
We asked our readers to contribute to this dialog for a number of reasons. As a cloud backup and cloud storage service provider, we want to understand how our users are working with digital media so we know how to improve our services. Also, we want to participate in the digital media community, and hope that sharing the challenges our readers are facing and the solutions they are using will make a contribution to that community.
The State of Managing Digital Media
While a few readers told us they had settled on a system that worked for them, most said that they were still looking for a better solution. Many expressed frustration with dealing with the growing amount of data for digital photos and videos that is only getting larger with the increasing resolution of still and video cameras. Amateurs are making do with a number of consumer services, while professionals employ a wide range of commercial, open source, or jury rigged solutions for managing data and maintaining its integrity.
I’ve summarized the responses we received in three sections on, 1) what readers are doing today, 2) common wishes they have for improvements, and 3) concerns that were expressed by a number of respondents.
The Digital Media Workflow
Protecting Media From Camera to Cloud
We heard from a wide range of smartphone users, DSLR and other format photographers, and digital video creators. Speed of operation, the ability to share files with collaborators and clients, and product feature sets were frequently cited as reasons for selecting their particular solution. Also of great importance was protecting the integrity of media through the entire capture, transfer, editing, and backup workflow.
Avid Media Composer
Many readers said they backed up their camera memory cards as soon as possible to a computer or external drive and erased cards only when they had more than one backup of the media. Some said that they used dual memory cards that are written to simultaneously by the camera for peace-of-mind.
While some cameras now come equipped with Wi-Fi, no one other than smartphone users said they were using Wi-Fi as part of their workflow. Also, we didn’t receive feedback from any photographers who regularly shoot tethered.
Some readers said they still use CDs and DVDs for storing media. One user admitted to previously using VHS tape.
NAS (Network Attached Storage) is in wide use. Synology, Drobo, FreeNAS, and other RAID and non-RAID storage devices were frequently mentioned.
A number were backing up their NAS to the cloud for archiving. Others said they had duplicate external drives that were stored onsite or offsite, including in a physical safe, other business locations, a bank lock box, and even “mom’s house.”
Many said they had regular backup practices, including nightly backups, weekly and other regularly scheduled backups, often in non-work hours.
One reader said that a monthly data scrub was performed on the NAS to ensure data integrity.
Hardware used for backups included Synology, QNAP, Drobo, and FreeNAS systems.
Services used by our readers for backing up included Backblaze Backup, Backblaze B2 Cloud Storage, CrashPlan, SmugMug, Amazon Glacier, Google Photos, Amazon Prime Photos, Adobe Creative Cloud, Apple Photos, Lima, DropBox, and Tarsnap. Some readers made a distinction between how they used sync (such as DropBox), backup (such as Backblaze Backup), and storage (such as Backblaze B2), but others did not. (See Sync vs. Backup vs. Storage on our blog for an explanation of the differences.)
Software used for backups and maintaining file integrity included Arq, Carbon Copy Cloner, ChronoSync, SoftRAID, FreeNAS, corz checksum, rclone, rsync, Apple Time Machine, Capture One, Btrfs, BorgBackup, SuperDuper, restic, Acronis True Image, custom Python scripts, and smartphone apps PhotoTransfer and PhotoSync.
Cloud torrent services mentioned were Offcloud, Bitport, and Seedr.
A common practice mentioned is to use SSD (Solid State Drives) in the working computer or attached drives (or both) to improve speed and reliability. Protection from magnetic fields was another reason given to use SSDs.
Many users copy their media to multiple attached or network drives for redundancy.
Users of Lightroom reported keeping their Lightroom catalog on a local drive and their photo files on an attached drive. They frequently had different backup schemes for the catalog and the media. Many readers are careful to have multiple backups of their Lightroom catalog. Some expressed the desire to back up both their original raw files and their edited (working) raw files, but limitations in bandwidth and backup media caused some to give priority to good backups of their raw files, since the edited files could be recreated if necessary.
A number of smartphone users reported using Apple or Google Photos to store their photos and share them.
Digital Editing and Enhancement
Adobe still rules for many users for photo editing. Some expressed interest in alternatives from Phase One, Skylum (formerly Macphun), ON1, and DxO.
Adobe Lightroom
While Adobe Lightroom (and Adobe Photoshop for some) are the foundation of many users’ photo media workflow, others are still looking for something that might better suit their needs. A number of comments were made regarding Adobe’s switch to a subscription model.
Software used for image and video editing and enhancement included Adobe Lightroom, Adobe Photoshop, Luminar, Affinity Photo, Phase One, DxO, ON1, GoPro Quik, Apple Aperture (discontinued), Avid Media Composer, Adobe Premiere, and Apple Final Cut Studio (discontinued) or Final Cut Pro.
Luminar 2018 DAM preview
Managing, Archiving, Adding Metadata, Searching for Media Files
While some of our respondents are casual or serious amateur digital media users, others make a living from digital photography and videography. A number of our readers report having hundreds of thousands of files and many terabytes of data — even approaching one petabyte of data for one professional who responded. Whether amateur or professional, all shared the desire to preserve their digital media assets for the future. Consequently, they want to be able to attach metadata quickly and easily, and search for and retrieve files from wherever they are stored when necessary.
It’s not surprising that metadata was of great interest to our readers. Tagging, categorizing, and maintaining searchable records is important to anyone dealing with digital media.
While Lightroom was frequently used to manage catalogs, metadata, and files, others used spreadsheets to record archive location and grep for searching records.
Some liked the idea of Adobe’s Creative Cloud but weren’t excited about its cost and lack of choice in cloud providers.
Others reported using Photo Mechanic, DxO, digiKam, Google Photos, Daminion, Photo Supreme, Phraseanet, Phase One Media Pro, Google Picasa (discontinued), Adobe Bridge, Synology Photo Station, FotoStation, PhotoShelter, Flickr, and SmugMug.
Photo Mechanic 5
Common Wishes For Managing Digital Media in the Future
Our readers came through with numerous suggestions for how digital media management could be improved. There were a number of common themes centered around bigger and better storage, faster broadband or other ways to get data into the cloud, managing metadata, and ensuring integrity of their data.
Many wished for faster internet speeds that would make transferring and backing up files more efficient. This desire was expressed multiple times. Many said that the sheer volume of digital data they worked with made cloud services and storage impractical.
A number of readers would like the option to be able to ship files on a physical device to a cloud provider so that the initial large transfer would not take as long. Some wished to be able to send monthly physical transfers with incremental transfers send over the internet. (Note that Backblaze supports adding data via a hardware drive to B2 Cloud Storage with our Fireball service.)
Reasonable service cost, not surprisingly, was a desire expressed by just about everyone.
Many wished for not just backup, but long-term archiving of data. One suggestion was to be able to specify the length-of-term for archiving and pay by that metric for specific sets of files.
An easy-to-use Windows, Macintosh, or Linux client was a feature that many appreciated. Some were comfortable with using third-party apps for cloud storage and others wanted a vendor-supplied client.
A number of users like the combination of NAS and cloud. Many backed up their NAS devices to the cloud. Some suggested that the NAS should be the local gateway to unlimited virtual storage in the cloud. (They should read our recent blog post on Morro Data’s CloudNAS solution.)
Some just wanted the storage problem solved. They would like the computer system to manage storage intelligently so they don’t have to. One reader said that storage should be managed and optimized by the system, as RAM is, and not by the user.
Common Concerns Expressed by our Readers
Over and over again our readers expressed similar concerns about the state of digital asset management.
Dealing with large volumes of data was a common challenge. As digital media files increase in size, readers struggle to manage the amount of data they have to deal with. As one reader wrote, “Why don’t I have an online backup of my entire library? Because it’s too much damn data!”
Many said they would back up more often, or back up even more files if they had the bandwidth or storage media to do so.
The cloud is attractive to many, but some said that they didn’t have the bandwidth to get their data into the cloud in an efficient manner, the cloud is too expensive, or they have other concerns about trusting the cloud with their data.
Most of our respondents are using Apple computer systems, some Windows, and a few Linux. A lot of the Mac users are using Time Machine. Some liked the concept of Time Machine but said they had experienced corrupted data when using it.
Visibility into the backup process was mentioned many times. Users want to know what’s happening to their data. A number said they wanted automatic integrity checks of their data and reports sent to them if anything changes.
A number of readers said they didn’t want to be locked into one vendor’s proprietary solution. They prefer open standards to prevent loss if a vendor leaves the market, changes the product, or makes a turn in strategy that they don’t wish to follow.
A number of users talked about how their practices differed depending on whether they were working in the field or working in a studio or at home. Access to the internet and data transfer speed was an issue for many.
It’s clear that people working in high resolution photography and videography are pushing the envelope for moving data between storage devices and the cloud.
Some readers expressed concern about the integrity of their stored data. They were concerned that over time, files would degrade. Some asked for tools to verify data integrity manually, or that data integrity should be monitored and reported by the storage vendor on a regular basis. The OpenZFS and Btrfs file systems were mentioned by some.
A few readers mentioned that they preferred redundant data centers for cloud storage.
Metadata is an important element for many, and making sure that metadata is easily and permanently associated with their files is essential.
The ability to share working files with collaborators or finished media with clients, friends, and family also is a common requirement.
Thank You for Your Comments and Suggestions
As a cloud backup and storage provider, your contributions were of great interest to us. A number of readers made suggestions for how we can improve or augment our services to increase the options for digital media management. We listened and are considering your comments. They will be included in our discussions and planning for possible future services and offerings from Backblaze. We thank everyone for your contributions.
Digital media management
Let’s Keep the Conversation Going!
Were you surprised by any of the responses? Do you have something further to contribute? This is by no means the end of our exploration of how to better serve media professionals, so let’s keep the lines of communication open.
Blockchain, AI, big data, NoSQL, microservices, single page applications, cloud, SOA. What do these have in common? They have been or are hyped. At some point they were “the big thing” du jour. Everyone was investigating the possibility of using them, everyone was talking about them, there were meetups, conferences, articles on Hacker news and reddit. There are more examples, of course (which is the javascript framework this month?) but I’ll focus my examples on those above.
Another thing they have in common is that they are useful. All of them have some pretty good applications that are definitely worth the time and investment.
Yet another thing they have in common is that they are far from universally applicable. I’ve argued that monoliths are often still the better approach and that microservices introduce too much complexity for the average project. Big Data is something very few organizations actually have; AI/machine learning can help a wide variety of problems, but it is just a tool in a toolbox, not the solution to all problems. Single page applications are great for, yeah, applications, but most websites are still websites, not feature-rich frontends – you don’t need an SPA for every type of website. NoSQL has solved niche issues, and issues of scale that few companies have had, but nothing beats a good old relational database for the typical project out there. “The cloud” is not always where you want your software to be; and SOA just means everything (ESBs, direct integrations, even microservices, according to some). And the blockchain – it seems to be having limited success beyond cryptocurrencies.
And finally, another trait many of them share is that the hype has settled down. Only yesterday I read an article about the “death of the microservices madness”. I don’t see nearly as many new NoSQL databases as a few years ago, some of the projects that have been popular have faded. SOA and “the cloud” are already “boring”, and we’ve realized we don’t actually have big data if it fits in an Excel spreadsheet. SPAs and AI are still high in popularity, but we are getting a good understanding as a community why and when they are useful.
But it seems that nuanced reality has never stopped us from hyping a particular technology or approach. And maybe that’s okay in order to get a promising, though niche, technology, the spotlight and let it shine in the particular usecases where it fits.
But countless projects have and will suffer from our collective inability to filter through these hypes. I’d bet millions of developer hours have been wasted in trying to use the above technologies where they just didn’t fit. It’s like that scene from Idiocracy where a guy tries to fit a rectangular figure into a circular hole.
And the new one is not “the blockchain”. I won’t repeat my rant, but in summary – it doesn’t solve many of the problems companies are trying to solve with it right now just because it’s cool. Or at least it doesn’t solve them better than existing solutions. Many pilots will be carried out, many hours will be wasted in figuring out why that thing doesn’t work. A few of those projects will be a good fit and will actually bring value.
Do you need to reach multi-party consensus for the data you store? Can all stakeholder support the infrastructure to run their node(s)? Do they have the staff to administer the node(s)? Do you need to execute distributed application code on the data? Won’t it be easier to just deploy RESTful APIs and integrate the parties through that? Do you need to store all the data, or just parts of it, to guarantee data integrity?
“If you have is a hammer, everything looks like a nail” as the famous saying goes. In the software industry we repeatedly find new and cool hammers and then try to hit as many nails as we can. But only few of them are actual nails. The rest remain ugly, hard to support, “who was the idiot that wrote this” and “I wasn’t here when the decisions were made” types of projects.
I don’t have the illusion that we will calm down and skip the next hypes. Especially if adding the hyped word to your company raises your stock price. But if there’s one thing I’d like people to ask themselves when choosing a technology stack, it is “do we really need that to solve our problems?”.
If the answer is really “yes”, then great, go ahead and deploy the multi-organization permissioned blockchain, or fork Ethereum, or whatever. If not, you can still do a project a home that you can safely abandon. And if you need some pilot project to figure out whether the new piece of technology would be beneficial – go ahead and try it. But have a baseline – the fact that it somehow worked doesn’t mean it’s better than old, tested models of doing the same thing.
The following list includes the ten most downloaded AWS security and compliance documents in 2017. Using this list, you can learn about what other AWS customers found most interesting about security and compliance last year.
AWS Security Best Practices – This guide is intended for customers who are designing the security infrastructure and configuration for applications running on AWS. The guide provides security best practices that will help you define your Information Security Management System (ISMS) and build a set of security policies and processes for your organization so that you can protect your data and assets in the AWS Cloud.
AWS: Overview of Security Processes – This whitepaper describes the physical and operational security processes for the AWS managed network and infrastructure, and helps answer questions such as, “How does AWS help me protect my data?”
Service Organization Controls (SOC) 3 Report – This publicly available report describes internal AWS security controls, availability, processing integrity, confidentiality, and privacy.
Introduction to AWS Security –This document provides an introduction to AWS’s approach to security, including the controls in the AWS environment, and some of the products and features that AWS makes available to customers to meet your security objectives.
AWS: Risk and Compliance – This whitepaper provides information to help customers integrate AWS into their existing control framework, including a basic approach for evaluating AWS controls and a description of AWS certifications, programs, reports, and third-party attestations.
Use AWS WAF to Mitigate OWASP’s Top 10 Web Application Vulnerabilities – AWS WAF is a web application firewall that helps you protect your websites and web applications against various attack vectors at the HTTP protocol level. This whitepaper outlines how you can use AWS WAF to mitigate the application vulnerabilities that are defined in the Open Web Application Security Project (OWASP) Top 10 list of most common categories of application security flaws.
Introduction to Auditing the Use of AWS – This whitepaper provides information, tools, and approaches for auditors to use when auditing the security of the AWS managed network and infrastructure.
AWS Security and Compliance: Quick Reference Guide – By using AWS, you inherit the many security controls that we operate, thus reducing the number of security controls that you need to maintain. Your own compliance and certification programs are strengthened while at the same time lowering your cost to maintain and run your specific security assurance requirements. Learn more in this quick reference guide.
AWS has updated its certifications against ISO 9001, ISO 27001, ISO 27017, and ISO 27018 standards, bringing the total to 67 services now under ISO compliance. We added the following 29 services this cycle:
AWS maintains certifications through extensive audits of its controls to ensure that information security risks that affect the confidentiality, integrity, and availability of company and customer information are appropriately managed.
You can download copies of the AWS ISO certificates that contain AWS’s in-scope services and Regions, and use these certificates to jump-start your own certification efforts:
Linux Foundation Director of IT infrastructure security, Konstantin Ryabitsev, has put together a lengthy guide to using Git and PGP to protect the integrity of source code. In a Google+ post, he called it “beta quality” and asked for help with corrections and fixes. “PGP incorporates a trust delegation mechanism known as the ‘Web of Trust.’ At its core, this is an attempt to replace the need for centralized Certification Authorities of the HTTPS/TLS world. Instead of various software makers dictating who should be your trusted certifying entity, PGP leaves this responsibility to each user.
Unfortunately, very few people understand how the Web of Trust works, and even fewer bother to keep it going. It remains an important aspect of the OpenPGP specification, but recent versions of GnuPG (2.2 and above) have implemented an alternative mechanism called ‘Trust on First Use’ (TOFU).
You can think of TOFU as ‘the SSH-like approach to trust.’ With SSH, the first time you connect to a remote system, its key fingerprint is recorded and remembered. If the key changes in the future, the SSH client will alert you and refuse to connect, forcing you to make a decision on whether you choose to trust the changed key or not.
Similarly, the first time you import someone’s PGP key, it is assumed to be trusted. If at any point in the future GnuPG comes across another key with the same identity, both the previously imported key and the new key will be marked as invalid and you will need to manually figure out which one to keep.
In this guide, we will be using the TOFU trust model.”
Now, Amazon Cloud Directory makes it easier for you to apply schema changes across your directories with in-place schema upgrades. Your directory now remains available while Cloud Directory applies backward-compatible schema changes such as the addition of new fields. Without migrating data between directories or applying code changes to your applications, you can upgrade your schemas. You also can view the history of your schema changes in Cloud Directory by using version identifiers, which help you track and audit schema versions across directories. If you have multiple instances of a directory with the same schema, you can view the version history of schema changes to manage your directory fleet and ensure that all directories are running with the same schema version.
In this blog post, I demonstrate how to perform an in-place schema upgrade and use schema versions in Cloud Directory. I add additional attributes to an existing facet and add a new facet to a schema. I then publish the new schema and apply it to running directories, upgrading the schema in place. I also show how to view the version history of a directory schema, which helps me to ensure my directory fleet is running the same version of the schema and has the correct history of schema changes applied to it.
Note: I share Java code examples in this post. I assume that you are familiar with the AWS SDK and can use Java-based code to build a Cloud Directory code example. You can apply the concepts I cover in this post to other programming languages such as Python and Ruby.
Cloud Directory fundamentals
I will start by covering a few Cloud Directory fundamentals. If you are already familiar with the concepts behind Cloud Directory facets, schemas, and schema lifecycles, you can skip to the next section.
Facets: Groups of attributes. You use facets to define object types. For example, you can define a device schema by adding facets such as computers, phones, and tablets. A computer facet can track attributes such as serial number, make, and model. You can then use the facets to create computer objects, phone objects, and tablet objects in the directory to which the schema applies.
Schemas: Collections of facets. Schemas define which types of objects can be created in a directory (such as users, devices, and organizations) and enforce validation of data for each object class. All data within a directory must conform to the applied schema. As a result, the schema definition is essentially a blueprint to construct a directory with an applied schema.
Schema lifecycle: The four distinct states of a schema: Development, Published, Applied, and Deleted. Schemas in the Published and Applied states have version identifiers and cannot be changed. Schemas in the Applied state are used by directories for validation as applications insert or update data. You can change schemas in the Development state as many times as you need them to. In-place schema upgrades allow you to apply schema changes to an existing Applied schema in a production directory without the need to export and import the data populated in the directory.
How to add attributes to a computer inventory application schema and perform an in-place schema upgrade
To demonstrate how to set up schema versioning and perform an in-place schema upgrade, I will use an example of a computer inventory application that uses Cloud Directory to store relationship data. Let’s say that at my company, AnyCompany, we use this computer inventory application to track all computers we give to our employees for work use. I previously created a ComputerSchema and assigned its version identifier as 1. This schema contains one facet called ComputerInfo that includes attributes for SerialNumber, Make, and Model, as shown in the following schema details.
AnyCompany has offices in Seattle, Portland, and San Francisco. I have deployed the computer inventory application for each of these three locations. As shown in the lower left part of the following diagram, ComputerSchema is in the Published state with a version of 1. The Published schema is applied to SeattleDirectory, PortlandDirectory, and SanFranciscoDirectory for AnyCompany’s three locations. Implementing separate directories for different geographic locations when you don’t have any queries that cross location boundaries is a good data partitioning strategy and gives your application better response times with lower latency.
The following code example creates the schema in the Development state by using a JSON file, publishes the schema, and then creates directories for the Seattle, Portland, and San Francisco locations. For this example, I assume the schema has been defined in the JSON file. The createSchema API creates a schema Amazon Resource Name (ARN) with the name defined in the variable, SCHEMA_NAME. I can use the putSchemaFromJson API to add specific schema definitions from the JSON file.
// The utility method to get valid Cloud Directory schema JSON
String validJson = getJsonFile("ComputerSchema_version_1.json")
String SCHEMA_NAME = "ComputerSchema";
String developmentSchemaArn = client.createSchema(new CreateSchemaRequest()
.withName(SCHEMA_NAME))
.getSchemaArn();
// Put the schema document in the Development schema
PutSchemaFromJsonResult result = client.putSchemaFromJson(new PutSchemaFromJsonRequest()
.withSchemaArn(developmentSchemaArn)
.withDocument(validJson));
The following code example takes the schema that is currently in the Development state and publishes the schema, changing its state to Published.
String SCHEMA_VERSION = "1";
String publishedSchemaArn = client.publishSchema(
new PublishSchemaRequest()
.withDevelopmentSchemaArn(developmentSchemaArn)
.withVersion(SCHEMA_VERSION))
.getPublishedSchemaArn();
// Our Published schema ARN is as follows
// arn:aws:clouddirectory:us-west-2:XXXXXXXXXXXX:schema/published/ComputerSchema/1
The following code example creates a directory named SeattleDirectory and applies the published schema. The createDirectory API call creates a directory by using the published schema provided in the API parameters. Note that Cloud Directory stores a version of the schema in the directory in the Applied state. I will use similar code to create directories for PortlandDirectory and SanFranciscoDirectory.
String DIRECTORY_NAME = "SeattleDirectory";
CreateDirectoryResult directory = client.createDirectory(
new CreateDirectoryRequest()
.withName(DIRECTORY_NAME)
.withSchemaArn(publishedSchemaArn));
String directoryArn = directory.getDirectoryArn();
String appliedSchemaArn = directory.getAppliedSchemaArn();
// This code section can be reused to create directories for Portland and San Francisco locations with the appropriate directory names
// Our directory ARN is as follows
// arn:aws:clouddirectory:us-west-2:XXXXXXXXXXXX:directory/XX_DIRECTORY_GUID_XX
// Our applied schema ARN is as follows
// arn:aws:clouddirectory:us-west-2:XXXXXXXXXXXX:directory/XX_DIRECTORY_GUID_XX/schema/ComputerSchema/1
Revising a schema
Now let’s say my company, AnyCompany, wants to add more information for computers and to track which employees have been assigned a computer for work use. I modify the schema to add two attributes to the ComputerInfo facet: Description and OSVersion (operating system version). I make Description optional because it is not important for me to track this attribute for the computer objects I create. I make OSVersion mandatory because it is critical for me to track it for all computer objects so that I can make changes such as applying security patches or making upgrades. Because I make OSVersion mandatory, I must provide a default value that Cloud Directory will apply to objects that were created before the schema revision, in order to handle backward compatibility. Note that you can replace the value in any object with a different value.
I also add a new facet to track computer assignment information, shown in the following updated schema as the ComputerAssignment facet. This facet tracks these additional attributes: Name (the name of the person to whom the computer is assigned), EMail (the email address of the assignee), Department, and department CostCenter. Note that Cloud Directory refers to the previously available version identifier as the Major Version. Because I can now add a minor version to a schema, I also denote the changed schema as Minor Version A.
The following diagram shows the changes that were made when I added another facet to the schema and attributes to the existing facet. The highlighted area of the diagram (bottom left) shows that the schema changes were published.
The following code example revises the existing Development schema by adding the new attributes to the ComputerInfo facet and by adding the ComputerAssignment facet. I use a new JSON file for the schema revision, and for the purposes of this example, I am assuming the JSON file has the full schema including planned revisions.
// The utility method to get a valid CloudDirectory schema JSON
String schemaJson = getJsonFile("ComputerSchema_version_1_A.json")
// Put the schema document in the Development schema
PutSchemaFromJsonResult result = client.putSchemaFromJson(
new PutSchemaFromJsonRequest()
.withSchemaArn(developmentSchemaArn)
.withDocument(schemaJson));
Upgrading the Published schema
The following code example performs an in-place schema upgrade of the Published schema with schema revisions (it adds new attributes to the existing facet and another facet to the schema). The upgradePublishedSchema API upgrades the Published schema with backward-compatible changes from the Development schema.
// From an earlier code example, I know the publishedSchemaArn has this value: "arn:aws:clouddirectory:us-west-2:XXXXXXXXXXXX:schema/published/ComputerSchema/1"
// Upgrade publishedSchemaArn to minorVersion A. The Development schema must be backward compatible with
// the existing publishedSchemaArn.
String minorVersion = "A"
UpgradePublishedSchemaResult upgradePublishedSchemaResult = client.upgradePublishedSchema(new UpgradePublishedSchemaRequest()
.withDevelopmentSchemaArn(developmentSchemaArn)
.withPublishedSchemaArn(publishedSchemaArn)
.withMinorVersion(minorVersion));
String upgradedPublishedSchemaArn = upgradePublishedSchemaResult.getUpgradedSchemaArn();
// The Published schema ARN after the upgrade shows a minor version as follows
// arn:aws:clouddirectory:us-west-2:XXXXXXXXXXXX:schema/published/ComputerSchema/1/A
Upgrading the Applied schema
The following diagram shows the in-place schema upgrade for the SeattleDirectory directory. I am performing the schema upgrade so that I can reflect the new schemas in all three directories. As a reminder, I added new attributes to the ComputerInfo facet and also added the ComputerAssignment facet. After the schema and directory upgrade, I can create objects for the ComputerInfo and ComputerAssignment facets in the SeattleDirectory. Any objects that were created with the old facet definition for ComputerInfo will now use the default values for any additional attributes defined in the new schema.
I use the following code example to perform an in-place upgrade of the SeattleDirectory to a Major Version of 1 and a Minor Version of A. Note that you should change a Major Version identifier in a schema to make backward-incompatible changes such as changing the data type of an existing attribute or dropping a mandatory attribute from your schema. Backward-incompatible changes require directory data migration from a previous version to the new version. You should change a Minor Version identifier in a schema to make backward-compatible upgrades such as adding additional attributes or adding facets, which in turn may contain one or more attributes. The upgradeAppliedSchema API lets me upgrade an existing directory with a different version of a schema.
// This upgrades ComputerSchema version 1 of the Applied schema in SeattleDirectory to Major Version 1 and Minor Version A
// The schema must be backward compatible or the API will fail with IncompatibleSchemaException
UpgradeAppliedSchemaResult upgradeAppliedSchemaResult = client.upgradeAppliedSchema(new UpgradeAppliedSchemaRequest()
.withDirectoryArn(directoryArn)
.withPublishedSchemaArn(upgradedPublishedSchemaArn));
String upgradedAppliedSchemaArn = upgradeAppliedSchemaResult.getUpgradedSchemaArn();
// The Applied schema ARN after the in-place schema upgrade will appear as follows
// arn:aws:clouddirectory:us-west-2:XXXXXXXXXXXX:directory/XX_DIRECTORY_GUID_XX/schema/ComputerSchema/1
// This code section can be reused to upgrade directories for the Portland and San Francisco locations with the appropriate directory ARN
Note: Cloud Directory has excluded returning the Minor Version identifier in the Applied schema ARN for backward compatibility and to enable the application to work across older and newer versions of the directory.
The following diagram shows the changes that are made when I perform an in-place schema upgrade in the two remaining directories, PortlandDirectory and SanFranciscoDirectory. I make these calls sequentially, upgrading PortlandDirectory first and then upgrading SanFranciscoDirectory. I use the same code example that I used earlier to upgrade SeattleDirectory. Now, all my directories are running the most current version of the schema. Also, I made these schema changes without having to migrate data and while maintaining my application’s high availability.
Schema revision history
I can now view the schema revision history for any of AnyCompany’s directories by using the listAppliedSchemaArns API. Cloud Directory maintains the five most recent versions of applied schema changes. Similarly, to inspect the current Minor Version that was applied to my schema, I use the getAppliedSchemaVersion API. The listAppliedSchemaArns API returns the schema ARNs based on my schema filter as defined in withSchemaArn.
I use the following code example to query an Applied schema for its version history.
// This returns the five most recent Minor Versions associated with a Major Version
ListAppliedSchemaArnsResult listAppliedSchemaArnsResult = client.listAppliedSchemaArns(new ListAppliedSchemaArnsRequest()
.withDirectoryArn(directoryArn)
.withSchemaArn(upgradedAppliedSchemaArn));
// Note: The listAppliedSchemaArns API without the SchemaArn filter returns all the Major Versions in a directory
The listAppliedSchemaArns API returns the two ARNs as shown in the following output.
The following code example queries an Applied schema for current Minor Version by using the getAppliedSchemaVersion API.
// This returns the current Applied schema's Minor Version ARN
GetAppliedSchemaVersion getAppliedSchemaVersionResult = client.getAppliedSchemaVersion(new GetAppliedSchemaVersionRequest()
.withSchemaArn(upgradedAppliedSchemaArn));
The getAppliedSchemaVersion API returns the current Applied schema ARN with a Minor Version, as shown in the following output.
If you have a lot of directories, schema revision API calls can help you audit your directory fleet and ensure that all directories are running the same version of a schema. Such auditing can help you ensure high integrity of directories across your fleet.
Summary
You can use in-place schema upgrades to make changes to your directory schema as you evolve your data set to match the needs of your application. An in-place schema upgrade allows you to maintain high availability for your directory and applications while the upgrade takes place. For more information about in-place schema upgrades, see the in-place schema upgrade documentation.
If you have comments about this blog post, submit them in the “Comments” section below. If you have questions about implementing the solution in this post, start a new thread in the Directory Service forum or contact AWS Support.
This post contributed by: Wangechi Dole, AWS Solutions Architect Milan Krasnansky, ING, Digital Solutions Developer, SGK Rian Mookencherry, Director – Product Innovation, SGK
Data processing and transformation is a common use case you see in our customer case studies and success stories. Often, customers deal with complex data from a variety of sources that needs to be transformed and customized through a series of steps to make it useful to different systems and stakeholders. This can be difficult due to the ever-increasing volume, velocity, and variety of data. Today, data management challenges cannot be solved with traditional databases.
Workflow automation helps you build solutions that are repeatable, scalable, and reliable. You can use AWS Step Functions for this. A great example is how SGK used Step Functions to automate the ETL processes for their client. With Step Functions, SGK has been able to automate changes within the data management system, substantially reducing the time required for data processing.
In this post, SGK shares the details of how they used Step Functions to build a robust data processing system based on highly configurable business transformation rules for ETL processes.
SGK: Building dynamic ETL pipelines
SGK is a subsidiary of Matthews International Corporation, a diversified organization focusing on brand solutions and industrial technologies. SGK’s Global Content Creation Studio network creates compelling content and solutions that connect brands and products to consumers through multiple assets including photography, video, and copywriting.
We were recently contracted to build a sophisticated and scalable data management system for one of our clients. We chose to build the solution on AWS to leverage advanced, managed services that help to improve the speed and agility of development.
The data management system served two main functions:
Ingesting a large amount of complex data to facilitate both reporting and product funding decisions for the client’s global marketing and supply chain organizations.
Processing the data through normalization and applying complex algorithms and data transformations. The system goal was to provide information in the relevant context—such as strategic marketing, supply chain, product planning, etc. —to the end consumer through automated data feeds or updates to existing ETL systems.
We were faced with several challenges:
Output data that needed to be refreshed at least twice a day to provide fresh datasets to both local and global markets. That constant data refresh posed several challenges, especially around data management and replication across multiple databases.
The complexity of reporting business rules that needed to be updated on a constant basis.
Data that could not be processed as contiguous blocks of typical time-series data. The measurement of the data was done across seasons (that is, combination of dates), which often resulted with up to three overlapping seasons at any given time.
Input data that came from 10+ different data sources. Each data source ranged from 1–20K rows with as many as 85 columns per input source.
These challenges meant that our small Dev team heavily invested time in frequent configuration changes to the system and data integrity verification to make sure that everything was operating properly. Maintaining this system proved to be a daunting task and that’s when we turned to Step Functions—along with other AWS services—to automate our ETL processes.
Solution overview
Our solution included the following AWS services:
AWS Step Functions: Before Step Functions was available, we were using multiple Lambda functions for this use case and running into memory limit issues. With Step Functions, we can execute steps in parallel simultaneously, in a cost-efficient manner, without running into memory limitations.
AWS Lambda: The Step Functions state machine uses Lambda functions to implement the Task states. Our Lambda functions are implemented in Java 8.
Amazon DynamoDB provides us with an easy and flexible way to manage business rules. We specify our rules as Keys. These are key-value pairs stored in a DynamoDB table.
Amazon RDS: Our ETL pipelines consume source data from our RDS MySQL database.
Amazon Redshift: We use Amazon Redshift for reporting purposes because it integrates with our BI tools. Currently we are using Tableau for reporting which integrates well with Amazon Redshift.
Amazon S3: We store our raw input files and intermediate results in S3 buckets.
Amazon CloudWatch Events: Our users expect results at a specific time. We use CloudWatch Events to trigger Step Functions on an automated schedule.
Solution architecture
This solution uses a declarative approach to defining business transformation rules that are applied by the underlying Step Functions state machine as data moves from RDS to Amazon Redshift. An S3 bucket is used to store intermediate results. A CloudWatch Event rule triggers the Step Functions state machine on a schedule. The following diagram illustrates our architecture:
Here are more details for the above diagram:
A rule in CloudWatch Events triggers the state machine execution on an automated schedule.
The state machine invokes the first Lambda function.
The Lambda function deletes all existing records in Amazon Redshift. Depending on the dataset, the Lambda function can create a new table in Amazon Redshift to hold the data.
The same Lambda function then retrieves Keys from a DynamoDB table. Keys represent specific marketing campaigns or seasons and map to specific records in RDS.
The state machine executes the second Lambda function using the Keys from DynamoDB.
The second Lambda function retrieves the referenced dataset from RDS. The records retrieved represent the entire dataset needed for a specific marketing campaign.
The second Lambda function executes in parallel for each Key retrieved from DynamoDB and stores the output in CSV format temporarily in S3.
Finally, the Lambda function uploads the data into Amazon Redshift.
To understand the above data processing workflow, take a closer look at the Step Functions state machine for this example.
We walk you through the state machine in more detail in the following sections.
Walkthrough
To get started, you need to:
Create a schedule in CloudWatch Events
Specify conditions for RDS data extracts
Create Amazon Redshift input files
Load data into Amazon Redshift
Step 1: Create a schedule in CloudWatch Events Create rules in CloudWatch Events to trigger the Step Functions state machine on an automated schedule. The following is an example cron expression to automate your schedule:
In this example, the cron expression invokes the Step Functions state machine at 3:00am and 2:00pm (UTC) every day.
Step 2: Specify conditions for RDS data extracts We use DynamoDB to store Keys that determine which rows of data to extract from our RDS MySQL database. An example Key is MCS2017, which stands for, Marketing Campaign Spring 2017. Each campaign has a specific start and end date and the corresponding dataset is stored in RDS MySQL. A record in RDS contains about 600 columns, and each Key can represent up to 20K records.
A given day can have multiple campaigns with different start and end dates running simultaneously. In the following example DynamoDB item, three campaigns are specified for the given date.
The state machine example shown above uses Keys 31, 32, and 33 in the first ChoiceState and Keys 21 and 22 in the second ChoiceState. These keys represent marketing campaigns for a given day. For example, on Monday, there are only two campaigns requested. The ChoiceState with Keys 21 and 22 is executed. If three campaigns are requested on Tuesday, for example, then ChoiceState with Keys 31, 32, and 33 is executed. MCS2017 can be represented by Key 21 and Key 33 on Monday and Tuesday, respectively. This approach gives us the flexibility to add or remove campaigns dynamically.
Step 3: Create Amazon Redshift input files When the state machine begins execution, the first Lambda function is invoked as the resource for FirstState, represented in the Step Functions state machine as follows:
As described in the solution architecture, the purpose of this Lambda function is to delete existing data in Amazon Redshift and retrieve keys from DynamoDB. In our use case, we found that deleting existing records was more efficient and less time-consuming than finding the delta and updating existing records. On average, an Amazon Redshift table can contain about 36 million cells, which translates to roughly 65K records. The following is the code snippet for the first Lambda function in Java 8:
public class LambdaFunctionHandler implements RequestHandler<Map<String,Object>,Map<String,String>> {
Map<String,String> keys= new HashMap<>();
public Map<String, String> handleRequest(Map<String, Object> input, Context context){
Properties config = getConfig();
// 1. Cleaning Redshift Database
new RedshiftDataService(config).cleaningTable();
// 2. Reading data from Dynamodb
List<String> keyList = new DynamoDBDataService(config).getCurrentKeys();
for(int i = 0; i < keyList.size(); i++) {
keys.put(”key" + (i+1), keyList.get(i));
}
keys.put(”key" + T,String.valueOf(keyList.size()));
// 3. Returning the key values and the key count from the “for” loop
return (keys);
}
The variable $.keyT represents the number of keys retrieved from DynamoDB. This variable determines which of the parallel branches should be executed. At the time of publication, Step Functions does not support dynamic parallel state. Therefore, choices under ChoiceState are manually created and assigned hardcoded StringEquals values. These values represent the number of parallel executions for the second Lambda function.
For example, if $.keyT equals 3, the second Lambda function is executed three times in parallel with keys, $key1, $key2 and $key3 retrieved from DynamoDB. Similarly, if $.keyT equals two, the second Lambda function is executed twice in parallel. The following JSON represents this parallel execution:
Step 4: Load data into Amazon Redshift The second Lambda function in the state machine extracts records from RDS associated with keys retrieved for DynamoDB. It processes the data then loads into an Amazon Redshift table. The following is code snippet for the second Lambda function in Java 8.
public class LambdaFunctionHandler implements RequestHandler<String, String> {
public static String key = null;
public String handleRequest(String input, Context context) {
key=input;
//1. Getting basic configurations for the next classes + s3 client Properties
config = getConfig();
AmazonS3 s3 = AmazonS3ClientBuilder.defaultClient();
// 2. Export query results from RDS into S3 bucket
new RdsDataService(config).exportDataToS3(s3,key);
// 3. Import query results from S3 bucket into Redshift
new RedshiftDataService(config).importDataFromS3(s3,key);
System.out.println(input);
return "SUCCESS";
}
}
After the data is loaded into Amazon Redshift, end users can visualize it using their preferred business intelligence tools.
Lessons learned
At the time of publication, the 1.5–GB memory hard limit for Lambda functions was inadequate for processing our complex workload. Step Functions gave us the flexibility to chunk our large datasets and process them in parallel, saving on costs and time.
In our previous implementation, we assigned each key a dedicated Lambda function along with CloudWatch rules for schedule automation. This approach proved to be inefficient and quickly became an operational burden. Previously, we processed each key sequentially, with each key adding about five minutes to the overall processing time. For example, processing three keys meant that the total processing time was three times longer. With Step Functions, the entire state machine executes in about five minutes.
Using DynamoDB with Step Functions gave us the flexibility to manage keys efficiently. In our previous implementations, keys were hardcoded in Lambda functions, which became difficult to manage due to frequent updates. DynamoDB is a great way to store dynamic data that changes frequently, and it works perfectly with our serverless architectures.
Conclusion
With Step Functions, we were able to fully automate the frequent configuration updates to our dataset resulting in significant cost savings, reduced risk to data errors due to system downtime, and more time for us to focus on new product development rather than support related issues. We hope that you have found the information useful and that it can serve as a jump-start to building your own ETL processes on AWS with managed AWS services.
For more information about how Step Functions makes it easy to coordinate the components of distributed applications and microservices in any workflow, see the use case examples and then build your first state machine in under five minutes in the Step Functions console.
If you have questions or suggestions, please comment below.
NAS + CLOUD GIVEAWAY FROM MORRO DATA AND BACKBLAZE
Backblaze and Morro Data have teamed up to offer a hardware and software package giveaway that combines the best of NAS and the cloud for managing your photos and videos. You’ll find information about how to enter this promotion at the end of this post.
Whether you’re a serious amateur photographer, an Instagram fanatic, or a professional videographer, you’ve encountered the challenge of accessing, organizing, and storing your growing collection of digital photos and videos. The problems are similar for both amateur and professional — they vary chiefly in scale and cost — and the choices for addressing this challenge increase in number and complexity every day.
In this post we’ll be talking about the basics of managing digital photos and videos and trying to define the goals for a good digital asset management system (DAM). There’s a lot to cover, and we can’t get to all of it in one post. We will write more on this topic in future posts.
To start off, what is digital asset management (DAM)? In his book, The DAM Book: Digital Asset Management for Photographers, author Peter Krogh describes DAM as a term that refers to your entire digital photography ecosystem and how you work with it. It comprises the choices you make about every component of your digital photography practice.
Anyone considering how to manage their digital assets will need to consider the following questions:
How do I like to work, and need to work if I have clients, partners, or others with whom I need to cooperate?
What are the software and hardware options I need to consider to set up an efficient system that suits my needs?
How do DAS (direct-attached storage), NAS (network-attached storage), the cloud, and other storage solutions fit into a working system?
Is there a difference between how and where I back up and archive my files?
How do I find media files in my collection?
How do I handle a digital archive that just keeps growing and growing?
How do I make sure that the methods and system I choose won’t lock me into a closed-end, proprietary system?
Tell us what you’re using for digital media management
Earlier this week we published a post entitled What’s the Best Solution for Managing Digital Photos and Videos? in which we asked our readers to tell us how they manage their media files and what they would like to have in an ideal system. We’ll write a post after the first of the year based on the replies we receive. We encourage you to visit this week’s post and contribute your comments to the conversation.
Getting Started with Digital Asset Management
Whether you have hundreds, thousands, or millions of digital media files, you’re going to need a plan on how to manage them. Let’s start with the goals for what a good digital media management plan should look like.
Goals of a Good Digital Media Management System
1) Don’t lose your files
At the very least, your system should preserve files you wish to keep for future use. A good system will be reliable, support maintaining multiple copies of your data, and will integrate well with your data backup strategy. You should analyze each step of how you handle your cameras, memory cards, disks, and other storage media to understand the points at which your data is most vulnerable and how to minimize the possibility of data loss.
2) Find media when you need it
Your system should enable you to find files when you need them.
3) Work economically
You want a system that meets your budget and doesn’t waste your time.
4) Edit or Enhance the images or video
You’ll want the ability to make changes, change formats, and repurpose your media for different uses.
5) Share media in ways you choose
A good system will help you share your files with clients, friends, and family, giving you choices of different media, formats, and control over access and privacy.
6) Doesn’t lock your media into a proprietary system
Your system shouldn’t lock you into file formats, proprietary protocols, or make it difficult or impossible to get your media out of a particular vendor’s environment. You want a system that uses common and open formats and protocols to maintain the compatibility of your media with as yet unknown hardware and software you might want to use in the future.
Media Storage Options
Photographers and videographers differ in aspects of their workflow, and amateurs and professionals have different needs and options, but there are some common elements that are typically found in a digital media workflow:
Data is collected in a digital camera
Data is copied from the camera to a computer, a transport device, or a storage device
Data is brought into a computer system where original files are typically backed up and copies made for editing and enhancement (depending on type of system)
Data files are organized into folders, and metadata added or edited to aid in record keeping and finding files in the future
Files are edited and enhanced, with backups made during the process
File formats might be changed manually or automatically depending on system
Versions are created for client review, sharing, posting, publishing, or other uses
File versions are archived either manually or automatically
Files await possible future retrieval and use
These days, most of our digital media devices have multiple options for getting the digital media out of the camera. Those options can include Wi-Fi, direct cable connection, or one of a number of types and makes of memory cards. If your digital media device of choice is a smartphone, then you’re used to syncing your recent photos with your computer or a cloud service. If you sync with Apple Photos/iCloud or Google Photos, then one of those services may fulfill just about all your needs for managing your digital media.
If you’re a serious amateur or professional, your solution is more complex. You likely transfer your media from the camera to a computer or storage device (perhaps waiting to erase the memory cards until you’re sure you’ve safely got multiple copies of your files). The computer might already contain your image or video editing tools, or you might use it as a device to get your media back to your home or studio.
If you’ve got a fast internet connection, you might transfer your files to the cloud for safekeeping, to send them to a co-worker so she can start working on them, or to give your client a preview of what you’ve got. The cloud is also useful if you need the media to be accessible from different locations or on various devices.
If you’ve been working for a while, you might have data stored in some older formats such as CD, DVD, DVD-RAM, Zip, Jaz, or other format. Besides the inevitable degradation that occurs with older media, just finding a device to read the data can be a challenge, and it doesn’t get any easier as time passes. If you have data in older formats that you wish to save, you should transfer and preserve that data as soon as possible.
Let’s address the different types of storage devices and approaches.
Direct-attached Storage (DAS)
DAS includes any type of drive that is internal to your computer and connected via the host bus adapter (HBA), and using a common bus protocol such as ATA, SATA, or SCSI; or externally connected to the computer through, for example, USB or Thunderbolt.
Solid-state drives (SSD) are popular these days for their speed and reliability. In a system with different types of drives, it’s best to put your OS, applications, and video files on the fastest drive (typically the SSD), and use the slower drives when speed is not as critical.
A DAS device is directly accessible only from the host to which the DAS is attached, and only when the host is turned on, as the DAS incorporates no networking hardware or environment. Data on DAS can be shared on a network through capabilities provided by the operating system used on the host.
DAS can include a single drive attached via a single cable, multiple drives attached in a series, or multiple drives combined into a virtual unit by hardware and software, an example of which is RAID (Redundant Array of Inexpensive [or Independent] Disks). Storage virtualization such as RAID combines multiple physical disk drive components into one or more logical units for the purposes of data redundancy, performance improvement, or both.
Network-attached Storage (NAS)
A popular option these days is the use of network-attached storage (NAS) for storing working data, backing up data, and sharing data with co-workers. Compared to general purpose servers, NAS can offer several advantages, including faster data access, easier administration, and simple configuration through a web interface.
Users have the choice of a wide number of NAS vendors and storage approaches from vendors such as Morro Data, QNAP, Synology, Drobo, and many more.
NAS uses file-based protocols such as NFS (popular on UNIX systems), SMB/CIFS (Server Message Block/Common Internet File System used with MS Windows systems), AFP (used with Apple Macintosh computers), or NCP (used with OES and Novell NetWare). Multiple protocols are often supported by a single NAS device. NAS devices frequently include RAID or similar capability, providing virtualized storage and often performance improvements.
NAS devices are popular for digital media files due to their large capacities, data protection capabilities, speed, expansion options through adding more and bigger drives, and the ability to share files on a local office or home network or more widely on the internet. NAS devices often include the capability to back up the data on the NAS to another NAS or to the cloud, making them a great hub for a digital media management system.
The Cloud
The cloud is becoming increasingly attractive as a component of a digital asset management system due to a number of inherent advantages:
Cloud data centers employ redundant technologies to protect the integrity of the stored data
Data stored in the cloud can be shared, if desired
Cloud storage is limitless, as opposed to DAS and most NAS implementations
Cloud storage can be accessed through a wide range of interfaces, and APIs (Application Programming Interfaces), making cloud storage extremely flexible
Cloud storage supports an extensive ecosystem of add-on hardware, software, and applications to enhance your DAM. Backblaze’s B2 Cloud Storage, for example, has a long list of integrations with media-oriented partners such as Axle video, Cantemo, Cubix, and others
Anyone working with digital media will tell you that the biggest challenge with the cloud is the large amount of data that must be transferred to the cloud, especially if someone already has a large library of media that exists on drives that they want to put into the cloud. Internet access speeds are getting faster, but not fast enough for users like Drew Geraci (known for his incredible time lapse photography and other work, including the opening to Netflix’s House of Cards), who told me he can create one terabyte of data in just five minutes when using nine 8K cameras simultaneously.
While we wait for everyone to get 10GB broadband transfer speeds, there are other options, such as Backblaze’s Fireball, which enables B2 Cloud Storage users to copy up to 40TB of data to a drive and send it directly to Backblaze.
There are technologies available that can accelerate internet TCP/IP speeds and enable faster data transfers to and from cloud storage such as Backblaze B2. We’ll be writing about these technologies in a future post.
CloudNAS
A recent entry into the storage space is Morro Data and their CloudNAS solution. Files are stored in the cloud, cached locally on a CloudNAS device as needed, and synced globally among the other CloudNAS systems in a given organization. To the user, all of their files are listed in one catalog, but they could be stored locally or in the cloud. Another advantage is that uploads to the cloud are done behind the scenes as time and network permit. A file stays local until such time as it it safely stored in the B2 Cloud then it is removed from the CloudNAS device, depending on how often it is accessed. There are more details on the CloudNAS solution in our A New Twist on Data Backup: CloudNAS blog post. (See below for how to enter our Backblaze/Morro Data giveaway.)
Cataloging and Searching Your Media
A key component of any DAM system is the ability to find files when you need them. You’ll want the ability to catalog all of your digital media, assign keywords and metadata that make sense for the way you work, and have that catalog available and searchable even when the digital files themselves are located on various drives, in the cloud, or even disconnected from your current system.
Adobe’s Lightroom is a popular application for cataloging and managing image workflow. Lightroom can handle an enormous number of files, and has a flexible catalog that can be stored locally and used to search for files that have been archived to different storage devices. Users debate whether one master catalog or multiple catalogs are the best way to work in Lightroom. In any case, it’s critical that you back up your DAM catalogs as diligently as you back up your digital media.
The latest version of Lightroom, Lightroom CC (distinguished from Lightroom CC Classic), is coupled with Adobe’s Creative Cloud service. In addition to the subscription plan for Lightroom and other Adobe Suite applications, you’ll need to choose and pay a subscription fee for how much storage you wish to use in Adobe’s Creative Cloud. You don’t get a choice of other cloud vendors.
Another popular option for image editing is Phase One Capture One, and Phase One Media Pro SE for cataloging and management. Macphun’s Luminar is available for both Macintosh and Windows. Macphun has announced that will launch a digital asset manager component for Luminar in 2018 that will compete with Adobe’s offering for a complete digital image workflow.
Any media management system needs to include or work seamlessly with the editing and enhancement tools you use for photos or videos. We’re already talked about some cataloging solutions that include image editing, as well. Some of the mainstream photo apps, such as Google Photos and Apple Photos include rudimentary to mid-level editing tools. It’s up to the more capable applications to deliver the power needed for real photo or video editing, e.g. Adobe Photoshop, Adobe Lightroom, Macphun’s Luminar, and Phase One Capture One for photography, and Adobe Premiere, AppleFinal Cut Pro, or Avid Media Composer (among others) for video editing.
Ensuring Future Compatibility for Your Media
Images come out of your camera in a variety of formats. Camera makers have their proprietary raw file formats (CR2 from Canon, NEF from Nikon, for example), and Adobe has a proprietary, but open, standard for digital images called DNG (Digital Negative) that is used in Lightroom and products from other vendors, as well.
Whichever you choose, be aware that you are betting that whichever format you use will be supported years down the road when you go back to your files and want to open a file with whatever will be your future photo/video editing setup. So always think of the future and consider the solution that is most likely to still be supported in future applications.
There are myriad aspects to a digital asset management system, and as we said at the outset, many choices to make. We hope you’ll take us up on our request to tell us what you’re using to manage your photos and videos and what an ideal system for you would look like. We want to make Backblaze Backup and B2 Cloud Storage more useful to our customers, and your input will help us do that.
In the meantime, why not enter the Backblaze + Morro Data Promotion described below. You could win!
ENTER TO WIN A DREAM DIGITAL MEDIA COMBO
Morro Data and Backblaze Team Up to Deliver the Dream Digital Media Backup Solution
+
Visit Dream Photo Backup to learn about this combination of NAS, software, and the cloud that provides a complete solution for managing, archiving, and accessing your digital media files. You’ll have the opportunity to win Morro Data’s CacheDrive G40 (with 1TB of HDD cache), an annual subscription to CloudNAS Basic Global File Services, and $100 of Backblaze B2 Cloud Storage. The total value of this package is greater than $700. Enter at Dream Photo Backup.
You’ve probably heard about GDPR. The new European data protection regulation that applies practically to everyone. Especially if you are working in a big company, it’s most likely that there’s already a process for gettign your systems in compliance with the regulation.
The regulation is basically a law that must be followed in all European countries (but also applies to non-EU companies that have users in the EU). In this particular case, it applies to companies that are not registered in Europe, but are having European customers. So that’s most companies. I will not go into yet another “12 facts about GDPR” or “7 myths about GDPR” posts/whitepapers, as they are often aimed at managers or legal people. Instead, I’ll focus on what GDPR means for developers.
Why am I qualified to do that? A few reasons – I was advisor to the deputy prime minister of a EU country, and because of that I’ve been both exposed and myself wrote some legislation. I’m familiar with the “legalese” and how the regulatory framework operates in general. I’m also a privacy advocate and I’ve been writing about GDPR-related stuff in the past, i.e. “before it was cool” (protecting sensitive data, the right to be forgotten). And finally, I’m currently working on a project that (among other things) aims to help with covering some GDPR aspects.
I’ll try to be a bit more comprehensive this time and cover as many aspects of the regulation that concern developers as I can. And while developers will mostly be concerned about how the systems they are working on have to change, it’s not unlikely that a less informed manager storms in in late spring, realizing GDPR is going to be in force tomorrow, asking “what should we do to get our system/website compliant”.
The rights of the user/client (referred to as “data subject” in the regulation) that I think are relevant for developers are: the right to erasure (the right to be forgotten/deleted from the system), right to restriction of processing (you still keep the data, but mark it as “restricted” and don’t touch it without further consent by the user), the right to data portability (the ability to export one’s data), the right to rectification (the ability to get personal data fixed), the right to be informed (getting human-readable information, rather than long terms and conditions), the right of access (the user should be able to see all the data you have about them), the right to data portability (the user should be able to get a machine-readable dump of their data).
Additionally, the relevant basic principles are: data minimization (one should not collect more data than necessary), integrity and confidentiality (all security measures to protect data that you can think of + measures to guarantee that the data has not been inappropriately modified).
Even further, the regulation requires certain processes to be in place within an organization (of more than 250 employees or if a significant amount of data is processed), and those include keeping a record of all types of processing activities carried out, including transfers to processors (3rd parties), which includes cloud service providers. None of the other requirements of the regulation have an exception depending on the organization size, so “I’m small, GDPR does not concern me” is a myth.
It is important to know what “personal data” is. Basically, it’s every piece of data that can be used to uniquely identify a person or data that is about an already identified person. It’s data that the user has explicitly provided, but also data that you have collected about them from either 3rd parties or based on their activities on the site (what they’ve been looking at, what they’ve purchased, etc.)
Having said that, I’ll list a number of features that will have to be implemented and some hints on how to do that, followed by some do’s and don’t’s.
“Forget me” – you should have a method that takes a userId and deletes all personal data about that user (in case they have been collected on the basis of consent, and not due to contract enforcement or legal obligation). It is actually useful for integration tests to have that feature (to cleanup after the test), but it may be hard to implement depending on the data model. In a regular data model, deleting a record may be easy, but some foreign keys may be violated. That means you have two options – either make sure you allow nullable foreign keys (for example an order usually has a reference to the user that made it, but when the user requests his data be deleted, you can set the userId to null), or make sure you delete all related data (e.g. via cascades). This may not be desirable, e.g. if the order is used to track available quantities or for accounting purposes. It’s a bit trickier for event-sourcing data models, or in extreme cases, ones that include some sort of blcokchain/hash chain/tamper-evident data structure. With event sourcing you should be able to remove a past event and re-generate intermediate snapshots. For blockchain-like structures – be careful what you put in there and avoid putting personal data of users. There is an option to use a chameleon hash function, but that’s suboptimal. Overall, you must constantly think of how you can delete the personal data. And “our data model doesn’t allow it” isn’t an excuse.
Notify 3rd parties for erasure – deleting things from your system may be one thing, but you are also obligated to inform all third parties that you have pushed that data to. So if you have sent personal data to, say, Salesforce, Hubspot, twitter, or any cloud service provider, you should call an API of theirs that allows for the deletion of personal data. If you are such a provider, obviously, your “forget me” endpoint should be exposed. Calling the 3rd party APIs to remove data is not the full story, though. You also have to make sure the information does not appear in search results. Now, that’s tricky, as Google doesn’t have an API for removal, only a manual process. Fortunately, it’s only about public profile pages that are crawlable by Google (and other search engines, okay…), but you still have to take measures. Ideally, you should make the personal data page return a 404 HTTP status, so that it can be removed.
Restrict processing – in your admin panel where there’s a list of users, there should be a button “restrict processing”. The user settings page should also have that button. When clicked (after reading the appropriate information), it should mark the profile as restricted. That means it should no longer be visible to the backoffice staff, or publicly. You can implement that with a simple “restricted” flag in the users table and a few if-clasues here and there.
Export data – there should be another button – “export data”. When clicked, the user should receive all the data that you hold about them. What exactly is that data – depends on the particular usecase. Usually it’s at least the data that you delete with the “forget me” functionality, but may include additional data (e.g. the orders the user has made may not be delete, but should be included in the dump). The structure of the dump is not strictly defined, but my recommendation would be to reuse schema.org definitions as much as possible, for either JSON or XML. If the data is simple enough, a CSV/XLS export would also be fine. Sometimes data export can take a long time, so the button can trigger a background process, which would then notify the user via email when his data is ready (twitter, for example, does that already – you can request all your tweets and you get them after a while).
Allow users to edit their profile – this seems an obvious rule, but it isn’t always followed. Users must be able to fix all data about them, including data that you have collected from other sources (e.g. using a “login with facebook” you may have fetched their name and address). Rule of thumb – all the fields in your “users” table should be editable via the UI. Technically, rectification can be done via a manual support process, but that’s normally more expensive for a business than just having the form to do it. There is one other scenario, however, when you’ve obtained the data from other sources (i.e. the user hasn’t provided their details to you directly). In that case there should still be a page where they can identify somehow (via email and/or sms confirmation) and get access to the data about them.
Consent checkboxes – this is in my opinion the biggest change that the regulation brings. “I accept the terms and conditions” would no longer be sufficient to claim that the user has given their consent for processing their data. So, for each particular processing activity there should be a separate checkbox on the registration (or user profile) screen. You should keep these consent checkboxes in separate columns in the database, and let the users withdraw their consent (by unchecking these checkboxes from their profile page – see the previous point). Ideally, these checkboxes should come directly from the register of processing activities (if you keep one). Note that the checkboxes should not be preselected, as this does not count as “consent”.
Re-request consent – if the consent users have given was not clear (e.g. if they simply agreed to terms & conditions), you’d have to re-obtain that consent. So prepare a functionality for mass-emailing your users to ask them to go to their profile page and check all the checkboxes for the personal data processing activities that you have.
“See all my data” – this is very similar to the “Export” button, except data should be displayed in the regular UI of the application rather than an XML/JSON format. For example, Google Maps shows you your location history – all the places that you’ve been to. It is a good implementation of the right to access. (Though Google is very far from perfect when privacy is concerned)
Age checks – you should ask for the user’s age, and if the user is a child (below 16), you should ask for parent permission. There’s no clear way how to do that, but my suggestion is to introduce a flow, where the child should specify the email of a parent, who can then confirm. Obviosuly, children will just cheat with their birthdate, or provide a fake parent email, but you will most likely have done your job according to the regulation (this is one of the “wishful thinking” aspects of the regulation).
Now some “do’s”, which are mostly about the technical measures needed to protect personal data. They may be more “ops” than “dev”, but often the application also has to be extended to support them. I’ve listed most of what I could think of in a previous post.
Encrypt the data in transit. That means that communication between your application layer and your database (or your message queue, or whatever component you have) should be over TLS. The certificates could be self-signed (and possibly pinned), or you could have an internal CA. Different databases have different configurations, just google “X encrypted connections. Some databases need gossiping among the nodes – that should also be configured to use encryption
Encrypt the data at rest – this again depends on the database (some offer table-level encryption), but can also be done on machine-level. E.g. using LUKS. The private key can be stored in your infrastructure, or in some cloud service like AWS KMS.
Encrypt your backups – kind of obvious
Implement pseudonymisation – the most obvious use-case is when you want to use production data for the test/staging servers. You should change the personal data to some “pseudonym”, so that the people cannot be identified. When you push data for machine learning purposes (to third parties or not), you can also do that. Technically, that could mean that your User object can have a “pseudonymize” method which applies hash+salt/bcrypt/PBKDF2 for some of the data that can be used to identify a person
Protect data integrity – this is a very broad thing, and could simply mean “have authentication mechanisms for modifying data”. But you can do something more, even as simple as a checksum, or a more complicated solution (like the one I’m working on). It depends on the stakes, on the way data is accessed, on the particular system, etc. The checksum can be in the form of a hash of all the data in a given database record, which should be updated each time the record is updated through the application. It isn’t a strong guarantee, but it is at least something.
Have your GDPR register of processing activities in something other than Excel – Article 30 says that you should keep a record of all the types of activities that you use personal data for. That sounds like bureaucracy, but it may be useful – you will be able to link certain aspects of your application with that register (e.g. the consent checkboxes, or your audit trail records). It wouldn’t take much time to implement a simple register, but the business requirements for that should come from whoever is responsible for the GDPR compliance. But you can advise them that having it in Excel won’t make it easy for you as a developer (imagine having to fetch the excel file internally, so that you can parse it and implement a feature). Such a register could be a microservice/small application deployed separately in your infrastructure.
Log access to personal data – every read operation on a personal data record should be logged, so that you know who accessed what and for what purpose
Register all API consumers – you shouldn’t allow anonymous API access to personal data. I’d say you should request the organization name and contact person for each API user upon registration, and add those to the data processing register. Note: some have treated article 30 as a requirement to keep an audit log. I don’t think it is saying that – instead it requires 250+ companies to keep a register of the types of processing activities (i.e. what you use the data for). There are other articles in the regulation that imply that keeping an audit log is a best practice (for protecting the integrity of the data as well as to make sure it hasn’t been processed without a valid reason)
Finally, some “don’t’s”.
Don’t use data for purposes that the user hasn’t agreed with – that’s supposed to be the spirit of the regulation. If you want to expose a new API to a new type of clients, or you want to use the data for some machine learning, or you decide to add ads to your site based on users’ behaviour, or sell your database to a 3rd party – think twice. I would imagine your register of processing activities could have a button to send notification emails to users to ask them for permission when a new processing activity is added (or if you use a 3rd party register, it should probably give you an API). So upon adding a new processing activity (and adding that to your register), mass email all users from whom you’d like consent.
Don’t log personal data – getting rid of the personal data from log files (especially if they are shipped to a 3rd party service) can be tedious or even impossible. So log just identifiers if needed. And make sure old logs files are cleaned up, just in case
Don’t put fields on the registration/profile form that you don’t need – it’s always tempting to just throw as many fields as the usability person/designer agrees on, but unless you absolutely need the data for delivering your service, you shouldn’t collect it. Names you should probably always collect, but unless you are delivering something, a home address or phone is unnecessary.
Don’t assume 3rd parties are compliant – you are responsible if there’s a data breach in one of the 3rd parties (e.g. “processors”) to which you send personal data. So before you send data via an API to another service, make sure they have at least a basic level of data protection. If they don’t, raise a flag with management.
Don’t assume having ISO XXX makes you compliant – information security standards and even personal data standards are a good start and they will probably 70% of what the regulation requires, but they are not sufficient – most of the things listed above are not covered in any of those standards
Overall, the purpose of the regulation is to make you take conscious decisions when processing personal data. It imposes best practices in a legal way. If you follow the above advice and design your data model, storage, data flow , API calls with data protection in mind, then you shouldn’t worry about the huge fines that the regulation prescribes – they are for extreme cases, like Equifax for example. Regulators (data protection authorities) will most likely have some checklists into which you’d have to somehow fit, but if you follow best practices, that shouldn’t be an issue.
I think all of the above features can be implemented in a few weeks by a small team. Be suspicious when a big vendor offers you a generic plug-and-play “GDPR compliance” solution. GDPR is not just about the technical aspects listed above – it does have organizational/process implications. But also be suspicious if a consultant claims GDPR is complicated. It’s not – it relies on a few basic principles that are in fact best practices anyway. Just don’t ignore them.
The AWS Knowledge Center helps answer the questions most frequently asked by AWS Support customers. The following 10 Knowledge Center security articles and videos have been the most viewed this month. It’s likely you’ve wondered about a few of these topics yourself, so here’s a chance to learn the answers!
The White House has released a new version of the Vulnerabilities Equities Process (VEP). This is the inter-agency process by which the US government decides whether to inform the software vendor of a vulnerability it finds, or keep it secret and use it to eavesdrop on or attack other systems. You can read the new policy or the fact sheet, but the best place to start is Cybersecurity Coordinator Rob Joyce’s blog post.
In considering a way forward, there are some key tenets on which we can build a better process.
Improved transparency is critical. The American people should have confidence in the integrity of the process that underpins decision making about discovered vulnerabilities. Since I took my post as Cybersecurity Coordinator, improving the VEP and ensuring its transparency have been key priorities, and we have spent the last few months reviewing our existing policy in order to improve the process and make key details about the VEP available to the public. Through these efforts, we have validated much of the existing process and ensured a rigorous standard that considers many potential equities.
The interests of all stakeholders must be fairly represented. At a high level we consider four major groups of equities: defensive equities; intelligence / law enforcement / operational equities; commercial equities; and international partnership equities. Additionally, ordinary people want to know the systems they use are resilient, safe, and sound. These core considerations, which have been incorporated into the VEP Charter, help to standardize the process by which decision makers weigh the benefit to national security and the national interest when deciding whether to disclose or restrict knowledge of a vulnerability.
Accountability of the process and those who operate it is important to establish confidence in those served by it. Our public release of the unclassified portions Charter will shed light on aspects of the VEP that were previously shielded from public review, including who participates in the VEP’s governing body, known as the Equities Review Board. We make it clear that departments and agencies with protective missions participate in VEP discussions, as well as other departments and agencies that have broader equities, like the Department of State and the Department of Commerce. We also clarify what categories of vulnerabilities are submitted to the process and ensure that any decision not to disclose a vulnerability will be reevaluated regularly. There are still important reasons to keep many of the specific vulnerabilities evaluated in the process classified, but we will release an annual report that provides metrics about the process to further inform the public about the VEP and its outcomes.
Our system of government depends on informed and vigorous dialogue to discover and make available the best ideas that our diverse society can generate. This publication of the VEP Charter will likely spark discussion and debate. This discourse is important. I also predict that articles will make breathless claims of “massive stockpiles” of exploits while describing the issue. That simply isn’t true. The annual reports and transparency of this effort will reinforce that fact.
Mozilla is pleased with the new charter. I am less so; it looks to me like the same old policy with some new transparency measures — which I’m not sure I trust. The devil is in the details, and we don’t know the details — and it has giant loopholes that pretty much anything can fall through:
The United States Government’s decision to disclose or restrict vulnerability information could be subject to restrictions by partner agreements and sensitive operations. Vulnerabilities that fall within these categories will be cataloged by the originating Department/Agency internally and reported directly to the Chair of the ERB. The details of these categories are outlined in Annex C, which is classified. Quantities of excepted vulnerabilities from each department and agency will be provided in ERB meetings to all members.
There’s a lot we don’t know about the VEP. The Washington Post says that the NSA used EternalBlue “for more than five years,” which implies that it was discovered after the 2010 process was put in place. It’s not clear if all vulnerabilities are given such consideration, or if bugs are periodically reviewed to determine if they should be disclosed. That said, any VEP that allows something as dangerous as EternalBlue — or the Ciscovulnerabilities that the Shadow Brokers leaked last August — to remain unpatched for years isn’t serving national security very well. As a former NSA employee said, the quality of intelligence that could be gathered was “unreal.” But so was the potential damage. The NSA must avoid hoarding vulnerabilities.
I stand by that, and am not sure the new policy changes anything.
EDITED TO ADD (11/22): Adam Shostack points out that the process does not cover design flaws or trade-offs, and that those need to be covered:
…we need the VEP to expand to cover those issues. I’m not going to claim that will be easy, that the current approach will translate, or that they should have waited to handle those before publishing. One obvious place it gets harder is the sources and methods tradeoff. But we need the internet to be a resilient and trustworthy infrastructure.
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.