A member of the Raspberry Pi community in Ontario, Canada spotted this story from the University of Toronto on CBC News. Engineers have created a device that enables healthcare workers to monitor COVID-19 patients continuously without the need to enter their hospital rooms.
Continuous, remote monitoring
Up-to-date information can be checked from any nursing station computer or smartphone. This advance could prove invaluable in conserving Personal Protective Equipment (PPE) supplies, which staff have to don for each hospital room visit. It also allows for the constant monitoring of patients at a time when hospital workers are extremely stretched.
Mount Sinai Hospital approached the University of Toronto’s engineering department to ask for their help in finding a way to monitor vital signs both continuously and remotely. A team of three PhD students, led by Professor Willy Wong, came up with the solution in just three days.
Communicating finger-clip monitor measurements
The simple concept involves connecting a Raspberry Pi 4 to standard finger-clip monitors, already in use across the hospital to monitor the respiratory status of COVID-19 patients. The finger clips detect what light is absorbed by the blood in a patient’s finger. Blood absorbs different colours of light to different degrees depending on how well oxygenated it is, so these measurements tell medical staff whether patients might be having difficulty with breathing.
The Raspberry Pi communicates this information over a wireless network to a server that Wong’s team deployed, allowing the nurses’ station computers or doctors’ smartphones to access data on how their patients are doing. This relieves staff of the need enter patients’ rooms to check the data output on bedside monitors.
Feedback has been unanimously positive since several prototypes were deployed in a trial at Mount Sinai. And a local retirement home has been in touch to ask about using the invention to help care for their residents. Professor Wong says solutions like this one are a “no-brainer” when trying to monitor large groups of people as healthcare workers battle COVID-19. “This was a quintessentially electrical and computer engineering problem,” he explains.
Professor Wong’s team included PhD candidates Bill Shi, Yan Li, and Brian Wang.
The University of Toronto is also home to engineers who are currently developing an automated, more sensitive and rapid test for COVID-19. You can read more about their project, which is based on quantum dots – nano-scale particles that bind to different components of the virus’s genetic material and glow brightly in different colours when struck by light. This gives multiple data points per patient sample and provides increased confidence in test results.
This blog post was co-authored by Ujjwal Ratan, a senior AI/ML solutions architect on the global life sciences team.
Healthcare data is generated at an ever-increasing rate and is predicted to reach 35 zettabytes by 2020. Being able to cost-effectively and securely manage this data whether for patient care, research or legal reasons is increasingly important for healthcare providers.
Healthcare providers must have the ability to ingest, store and protect large volumes of data including clinical, genomic, device, financial, supply chain, and claims. AWS is well-suited to this data deluge with a wide variety of ingestion, storage and security services (e.g. AWS Direct Connect, Amazon Kinesis Streams, Amazon S3, Amazon Macie) for customers to handle their healthcare data. In a recent Healthcare IT News article, healthcare thought-leader, John Halamka, noted, “I predict that five years from now none of us will have datacenters. We’re going to go out to the cloud to find EHRs, clinical decision support, analytics.”
I realize simply storing this data is challenging enough. Magnifying the problem is the fact that healthcare data is increasingly attractive to cyber attackers, making security a top priority. According to Mariya Yao in her Forbes column, it is estimated that individual medical records can be worth hundreds or even thousands of dollars on the black market.
In this first of a 2-part post, I will address the value that AWS can bring to customers for ingesting, storing and protecting provider’s healthcare data. I will describe key components of any cloud-based healthcare workload and the services AWS provides to meet these requirements. In part 2 of this post we will dive deep into the AWS services used for advanced analytics, artificial intelligence and machine learning.
The data tsunami is upon us
So where is this data coming from? In addition to the ubiquitous electronic health record (EHR), the sources of this data include:
devices such as MRIs, x-rays and ultrasounds
sensors and wearables for patients
medical equipment telemetry
Additional sources of data come from non-clinical, operational systems such as:
claims and billing
Data from these sources can be structured (e.g., claims data) as well as unstructured (e.g., clinician notes). Some data comes across in streams such as that taken from patient monitors, while some comes in batch form. Still other data comes in near-real time such as HL7 messages. All of this data has retention policies dictating how long it must be stored. Much of this data is stored in perpetuity as many systems in use today have no purge mechanism. AWS has services to manage all these data types as well as their retention, security and access policies.
Imaging is a significant contributor to this data tsunami. Increasing demand for early-stage diagnoses along with aging populations drive increasing demand for images from CT, PET, MRI, ultrasound, digital pathology, X-ray and fluoroscopy. For example, a thin-slice CT image can be hundreds of megabytes. Increasing demand and strict retention policies make storage costly.
Due to the plummeting cost of gene sequencing, molecular diagnostics (including liquid biopsy) is a large contributor to this data deluge. Many predict that as the value of molecular testing becomes more identifiable, the reimbursement models will change and it will increasingly become the standard of care. According to the Washington Post article “Sequencing the Genome Creates so Much Data We Don’t Know What to do with It,”
“Some researchers predict that up to one billion people will have their genome sequenced by 2025 generating up to 40 exabytes of data per year.”
Although genomics is primarily used for oncology diagnostics today, it’s also used for other purposes, pharmacogenomics — used to understand how an individual will metabolize a medication.
It is increasingly challenging for the typical hospital, clinic or physician practice to securely store, process and manage this data without cloud adoption.
Amazon has a variety of ingestion techniques depending on the nature of the data including size, frequency and structure. AWS Snowball and AWS Snowmachine are appropriate for extremely-large, secure data transfers whether one time or episodic. AWS Glue is a fully-managed ETL service for securely moving data from on-premise to AWS and Amazon Kinesis can be used for ingesting streaming data.
Amazon S3, Amazon S3 IA, and Amazon Glacier are economical, data-storage services with a pay-as-you-go pricing model that expand (or shrink) with the customer’s requirements.
The above architecture has four distinct components – ingestion, storage, security, and analytics. In this post I will dive deeper into the first three components, namely ingestion, storage and security. In part 2, I will look at how to use AWS’ analytics services to draw value on, and optimize, your healthcare data.
A typical provider data center will consist of many systems with varied datasets. AWS provides multiple tools and services to effectively and securely connect to these data sources and ingest data in various formats. The customers can choose from a range of services and use them in accordance with the use case.
For use cases involving one-time (or periodic), very large data migrations into AWS, customers can take advantage of AWS Snowball devices. These devices come in two sizes, 50 TB and 80 TB and can be combined together to create a petabyte scale data transfer solution.
The devices are easy to connect and load and they are shipped to AWS avoiding the network bottlenecks associated with such large-scale data migrations. The devices are extremely secure supporting 256-bit encryption and come in a tamper-resistant enclosure. AWS Snowball imports data in Amazon S3 which can then interface with other AWS compute services to process that data in a scalable manner.
For use cases involving a need to store a portion of datasets on premises for active use and offload the rest on AWS, the Amazon storage gateway service can be used. The service allows you to seamlessly integrate on premises applications via standard storage protocols like iSCSI or NFS mounted on a gateway appliance. It supports a file interface, a volume interface and a tape interface which can be utilized for a range of use cases like disaster recovery, backup and archiving, cloud bursting, storage tiering and migration.
By using the AWS proposed reference architecture for disaster recovery, healthcare providers can ensure their data assets are securely stored on the cloud and are easily accessible in the event of a disaster. The “AWS Disaster Recovery” whitepaper includes details on options available to customers based on their desired recovery time objective (RTO) and recovery point objective (RPO).
AWS is an ideal destination for offloading large volumes of less-frequently-accessed data. These datasets are rarely used in active compute operations but are exceedingly important to retain for reasons like compliance. By storing these datasets on AWS, customers can take advantage of the highly-durable platform to securely store their data and also retrieve them easily when they need to. For more details on how AWS enables customers to run back and archival use cases on AWS, please refer to the following set of whitepapers.
A healthcare provider may have a variety of databases spread throughout the hospital system supporting critical applications such as EHR, PACS, finance and many more. These datasets often need to be aggregated to derive information and calculate metrics to optimize business processes. AWS Glue is a fully-managed Extract, Transform and Load (ETL) service that can read data from a JDBC-enabled, on-premise database and transfer the datasets into AWS services like Amazon S3, Amazon Redshift and Amazon RDS. This allows customers to create transformation workflows that integrate smaller datasets from multiple sources and aggregates them on AWS.
Healthcare providers deal with a variety of streaming datasets which often have to be analyzed in near real time. These datasets come from a variety of sources such as sensors, messaging buses and social media, and often do not adhere to an industry standard. The Amazon Kinesis suite of services, that includes Amazon Kinesis Streams, Amazon Kinesis Firehose, and Amazon Kinesis Analytics, are the ideal set of services to accomplish the task of deriving value from streaming data.
Example: Using AWS Glue to de-identify and ingest healthcare data into S3 Let’s consider a scenario in which a provider maintains patient records in a database they want to ingest into S3. The provider also wants to de-identify the data by stripping personally- identifiable attributes and store the non-identifiable information in an S3 bucket. This bucket is different from the one that contains identifiable information. Doing this allows the healthcare provider to separate sensitive information with more restrictions set up via S3 bucket policies.
To ingest records into S3, we create a Glue job that reads from the source database using a Glue connection. The connection is also used by a Glue crawler to populate the Glue data catalog with the schema of the source database. We will use the Glue development endpoint and a zeppelin notebook server on EC2 to develop and execute the job.
Step 1: Import the necessary libraries and also set a glue context which is a wrapper on the spark context:
Step 2: Create a dataframe from the source data. I call the dataframe “readmissionsdata”. Here is what the schema would look like:
Step 3: Now select the columns that contains indentifiable information and store it in a new dataframe. Call the new dataframe “phi”.
Step 4: Non-PHI columns are stored in a separate dataframe. Call this dataframe “nonphi”.
Step 5: Write the two dataframes into two separate S3 buckets
Once successfully executed, the PHI and non-PHI attributes are stored in two separate files in two separate buckets that can be individually maintained.
In 2016, 327 healthcare providers reported a protected health information (PHI) breach, affecting 16.4m patient records. There have been 342 data breaches reported in 2017 — involving 3.2 million patient records.
To date, AWS has released 51 HIPAA-eligible services to help customers address security challenges and is in the process of making many more services HIPAA-eligible. These HIPAA-eligible services (along with all other AWS services) help customers build solutions that comply with HIPAA security and auditing requirements. A catalogue of HIPAA-enabled services can be found at AWS HIPAA-eligible services. It is important to note that AWS manages physical and logical access controls for the AWS boundary. However, the overall security of your workloads is a shared responsibility, where you are responsible for controlling user access to content on your AWS accounts.
AWS storage services allow you to store data efficiently while maintaining high durability and scalability. By using Amazon S3 as the central storage layer, you can take advantage of the Amazon S3 storage management features to get operational metrics on your data sets and transition them between various storage classes to save costs. By tagging objects on Amazon S3, you can build a governance layer on Amazon S3 to grant role based access to objects using Amazon IAM and Amazon S3 bucket policies.
To learn more about the Amazon S3 storage management features, see the following link.
In the example above, we are storing the PHI information in a bucket named “phi.” Now, we want to protect this information to make sure its encrypted, does not have unauthorized access, and all access requests to the data are logged.
Encryption: S3 provides settings to enable default encryption on a bucket. This ensures any object in the bucket is encrypted by default.
Logging: S3 provides object level logging that can be used to capture all API calls to the object. The API calls are logged in cloudtrail for easy access and consolidation. Moreover, it also supports events to proactively alert customers of read and write operations.
Access control: Customers can use S3 bucket policies and IAM policies to restrict access to the phi bucket. It can also put a restriction to enforce multi-factor authentication on the bucket. For example, the following policy enforces multi-factor authentication on the phi bucket:
In Part 1 of this blog, we detailed the ingestion, storage, security and management of healthcare data on AWS. Stay tuned for part two where we are going to dive deep into optimizing the data for analytics and machine learning.
As ransomware attacks have grown in number in recent months, the tactics and attack vectors also have evolved. While the primary method of attack used to be to target individual computer users within organizations with phishing emails and infected attachments, we’re increasingly seeing attacks that target weaknesses in businesses’ IT infrastructure.
How Ransomware Attacks Typically Work
In our previous posts on ransomware, we described the common vehicles used by hackers to infect organizations with ransomware viruses. Most often, downloaders distribute trojan horses through malicious downloads and spam emails. The emails contain a variety of file attachments, which if opened, will download and run one of the many ransomware variants. Once a user’s computer is infected with a malicious downloader, it will retrieve additional malware, which frequently includes crypto-ransomware. After the files have been encrypted, a ransom payment is demanded of the victim in order to decrypt the files.
What’s Changed With the Latest Ransomware Attacks?
In 2016, a customized ransomware strain called SamSam began attacking the servers in primarily health care institutions. SamSam, unlike more conventional ransomware, is not delivered through downloads or phishing emails. Instead, the attackers behind SamSam use tools to identify unpatched servers running Red Hat’s JBoss enterprise products. Once the attackers have successfully gained entry into one of these servers by exploiting vulnerabilities in JBoss, they use other freely available tools and scripts to collect credentials and gather information on networked computers. Then they deploy their ransomware to encrypt files on these systems before demanding a ransom. Gaining entry to an organization through its IT center rather than its endpoints makes this approach scalable and especially unsettling.
SamSam’s methodology is to scour the Internet searching for accessible and vulnerable JBoss application servers, especially ones used by hospitals. It’s not unlike a burglar rattling doorknobs in a neighborhood to find unlocked homes. When SamSam finds an unlocked home (unpatched server), the software infiltrates the system. It is then free to spread across the company’s network by stealing passwords. As it transverses the network and systems, it encrypts files, preventing access until the victims pay the hackers a ransom, typically between $10,000 and $15,000. The low ransom amount has encouraged some victimized organizations to pay the ransom rather than incur the downtime required to wipe and reinitialize their IT systems.
The success of SamSam is due to its effectiveness rather than its sophistication. SamSam can enter and transverse a network without human intervention. Some organizations are learning too late that securing internet-facing services in their data center from attack is just as important as securing endpoints.
The typical steps in a SamSam ransomware attack are:
1 Attackers gain access to vulnerable server
Attackers exploit vulnerable software or weak/stolen credentials.
2 Attack spreads via remote access tools
Attackers harvest credentials, create SOCKS proxies to tunnel traffic, and abuse RDP to install SamSam on more computers in the network.
3 Ransomware payload deployed
Attackers run batch scripts to execute ransomware on compromised machines.
4 Ransomware demand delivered requiring payment to decrypt files
Demand amounts vary from victim to victim. Relatively low ransom amounts appear to be designed to encourage quick payment decisions.
What all the organizations successfully exploited by SamSam have in common is that they were running unpatched servers that made them vulnerable to SamSam. Some organizations had their endpoints and servers backed up, while others did not. Some of those without backups they could use to recover their systems chose to pay the ransom money.
Timeline of SamSam History and Exploits
Since its appearance in 2016, SamSam has been in the news with many successful incursions into healthcare, business, and government institutions.
March 2016 SamSam appears
SamSam campaign targets vulnerable JBoss servers Attackers hone in on healthcare organizations specifically, as they’re more likely to have unpatched JBoss machines.
April 2016 SamSam finds new targets
SamSam begins targeting schools and government. After initial success targeting healthcare, attackers branch out to other sectors.
April 2017 New tactics include RDP
Attackers shift to targeting organizations with exposed RDP connections, and maintain focus on healthcare. An attack on Erie County Medical Center costs the hospital $10 million over three months of recovery.
January 2018 Municipalities attacked
• Attack on Municipality of Farmington, NM. • Attack on Hancock Health. • Attack on Adams Memorial Hospital • Attack on Allscripts (Electronic Health Records), which includes 180,000 physicians, 2,500 hospitals, and 7.2 million patients’ health records.
February 2018 Attack volume increases
• Attack on Davidson County, NC. • Attack on Colorado Department of Transportation.
March 2018 SamSam shuts down Atlanta
• Second attack on Colorado Department of Transportation. • City of Atlanta suffers a devastating attack by SamSam. The attack has far-reaching impacts — crippling the court system, keeping residents from paying their water bills, limiting vital communications like sewer infrastructure requests, and pushing the Atlanta Police Department to file paper reports. • SamSam campaign nets $325,000 in 4 weeks. Infections spike as attackers launch new campaigns. Healthcare and government organizations are once again the primary targets.
How to Defend Against SamSam and Other Ransomware Attacks
The best way to respond to a ransomware attack is to avoid having one in the first place. If you are attacked, making sure your valuable data is backed up and unreachable by ransomware infection will ensure that your downtime and data loss will be minimal or none if you ever suffer an attack.
Use anti-virus and anti-malware software or other security policies to block known payloads from launching.
Make frequent, comprehensive backups of all important files and isolate them from local and open networks. Cybersecurity professionals view data backup and recovery (74% in a recent survey) by far as the most effective solution to respond to a successful ransomware attack.
Keep offline backups of data stored in locations inaccessible from any potentially infected computer, such as disconnected external storage drives or the cloud, which prevents them from being accessed by the ransomware.
Install the latest security updates issued by software vendors of your OS and applications. Remember to patch early and patch often to close known vulnerabilities in operating systems, server software, browsers, and web plugins.
Consider deploying security software to protect endpoints, email servers, and network systems from infection.
Exercise cyber hygiene, such as using caution when opening email attachments and links.
Segment your networks to keep critical computers isolated and to prevent the spread of malware in case of attack. Turn off unneeded network shares.
Turn off admin rights for users who don’t require them. Give users the lowest system permissions they need to do their work.
Restrict write permissions on file servers as much as possible.
Educate yourself, your employees, and your family in best practices to keep malware out of your systems. Update everyone on the latest email phishing scams and human engineering aimed at turning victims into abettors.
Please Tell Us About Your Experiences with Ransomware
Have you endured a ransomware attack or have a strategy to avoid becoming a victim? Please tell us of your experiences in the comments.
Check out my latest Hacker in Residence project for SparkFun Electronics: the Helmet Guardian! It’s a Pi Zero powered impact force monitor that turns on an LED if your head/body experiences a potentially dangerous impact. Install in your sports helmets, bicycle, or car to keep track of impact and inform you when it’s time to visit the doctor.
We’ve all knocked our heads at least once in our lives, maybe due to tripping over a loose paving slab, or to falling off a bike, or to walking into the corner of the overhead cupboard door for the third time this week — will I ever learn?! More often than not, even when we’re seeing stars, we brush off the accident and continue with our day, oblivious to the long-term damage we may be doing.
Force of impact
After some thorough research, Jennifer Fox, founder of FoxBot Industries, concluded that forces of 4 to 6 G sustained for more than a few seconds are dangerous to the human body. With this in mind, she decided to use a Raspberry Pi Zero W and an accelerometer to create helmet with an impact force monitor that notifies its wearer if this level of G-force has been met.
Obviously, if you do have a serious fall, you should always seek medical advice. This project is an example of how affordable technology can be used to create medical and citizen science builds, and not a replacement for professional medical services.
Setting up the impact monitor
Jennifer’s monitor requires only a few pieces of tech: a Zero W, an accelerometer and breakout board, a rechargeable USB battery, and an LED, plus the standard wires and resistors for these components.
After installing Raspbian, Jennifer enabled SSH and I2C on the Zero W to make it run headlessly, and then accessed it from a laptop. This allows her to control the Pi without physically connecting to it, and it makes for a wireless finished project.
Jen wired the Pi to the accelerometer breakout board and LED as shown in the schematic below.
The LED acts as a signal of significant impacts, turning on when the G-force threshold is reached, and not turning off again until the program is reset.
With over 500 Petabytes of data under management we need more people keeping the drives spinning in our data center. We’re constantly hiring Systems Administrators and Data Center Technicians, and here’s our latest one! Lets learn a bit more about Jacob, shall we?
What is your Backblaze Title? Data center Technician
Where are you originally from? Ojai, CA
What attracted you to Backblaze? It’s a technical job that believes in training it’s employees and treating them well.
What do you expect to learn while being at Backblaze? As much as I can.
Where else have you worked? I was a Team Lead at Target, I did some volunteer work with the Ventura County Medical Center, and I also worked at a motocross track.
Where did you go to school? Ventura Community College, then 1 semester at Sac State
What’s your dream job? Don’t really have one. Whatever can support my family and that I enjoy.
Favorite place you’ve traveled? Yosemite National Park for the touristy stuff, Bend Oregon for a good getaway place.
Favorite hobby? Gaming and music. It’s a tie.
Of what achievement are you most proud? Marring my wife Masha.
Star Trek or Star Wars? Wars. 100%. I’m a major Star Wars geek.
Coke or Pepsi? Monster.
Favorite food? French fries.
Why do you like certain things? Because my brain tells me I like them.
Thank you for helping care for all of our customer’s data. Welcome to the data center team Jacob!
Applying technology to healthcare data has the potential to produce many exciting and important outcomes. The analysis produced from healthcare data can empower clinicians to improve the health of individuals and populations by enabling them to make better decisions that enhance the care they provide.
The Observational Health Data Sciences and Informatics (OHDSI, pronounced “Odyssey”) program and community is working toward this goal by producing data standards and open-source solutions to store and analyze observational health data. Using the OHDSI tools, you can visualize the health of your entire population. You can build cohorts of patients, analyze incidence rates for various conditions, and estimate the effect of treatments on patients with certain conditions. You can also model health outcome predictions using machine learning algorithms.
One of the challenges often faced when working with big data tools is the expense of the infrastructure required to run them. Another challenge is the learning curve to implement and begin using these tools. Amazon Web Services has enabled us to address many of the classic IT challenges by making enterprise class infrastructure and technology available in an affordable, elastic, and automated way. This blog post demonstrates how to combine some of the OHDSI projects (Atlas, Achilles, WebAPI, and the OMOP Common Data Model) with AWS technologies. By doing so, you can quickly and inexpensively implement a health data science and informatics environment.
Shown following is just one example of the population health analysis that is possible with the OHDSI tools. This visualization shows the prevalence of various drugs within the given population of people. This information helps researchers and clinicians discover trends and make better informed decisions about patient health.
OHDSI application architecture on AWS
Before deploying an application on AWS that transmits, processes, or stores protected health information (PHI) or personally identifiable information (PII), address your organization’s compliance concerns. Make sure that you have worked with your internal compliance and legal team to ensure compliance with the laws and regulations that govern your organization. To understand how you can use AWS services as a part of your overall compliance program, see the AWS HIPAA Compliance whitepaper. With that said, we paid careful attention to the HIPAA control set during the design of this solution.
This blog post presents a complete OHDSI application environment, including a data warehouse with sample data. It has the following features:
Following, you can see a block diagram of how the OHDSI tools map to the services provided by AWS.
Atlas is the web application that researchers interact with to perform analysis. Atlas interacts with the underlying databases through a web services application named WebAPI. In this example, both Atlas and WebAPI are deployed and managed by AWS Elastic Beanstalk. Elastic Beanstalk is an easy-to-use service for deploying and scaling web applications. Simply upload the Atlas and WebAPI code and Elastic Beanstalk automatically handles the deployment. It covers everything from capacity provisioning, load balancing, autoscaling, and high availability, to application health monitoring. Using a feature of Elastic Beanstalk called ebextensions, the Atlas and WebAPI servers are customized to use an encrypted storage volume for the middleware application logs.
Atlas stores the state of the various patient cohorts that are analyzed in a dedicated database separate from your observational health data. This database is provided by Amazon Aurora with PostgreSQL compatibility.
Amazon Aurora is a relational database built for the cloud that combines the performance and availability of high-end commercial databases with the simplicity and cost-effectiveness of open-source databases. It provides cost-efficient and resizable capacity while automating time-consuming administration tasks such as hardware provisioning, database setup, patching, and backups. It is configured for high availability and uses encryption at rest for the database and backups, and encryption in flight for the JDBC connections.
All of your observational health data is stored inside the OHDSI Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). This model also stores useful vocabulary tables that help to translate values from various data sources (like EHR systems and claims data).
The OMOP CDM schema is deployed onto Amazon Redshift. Amazon Redshift is a fast, fully managed data warehouse that allows you to run complex analytic queries against petabytes of structured data. It uses using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution. You can also resize an Amazon Redshift cluster as your requirements for it change.
The solution in this blog post automatically loads de-identified sample data of 1,000 people from the CMS 2008–2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF). The data has helpful formatting from LTS Computing LLC. Vocabulary data from the OHDSI Athena project is also loaded into the OMOP CDM, and a results set is computed by OHDSI Achilles.
Following is a detailed technical diagram showing the configuration of the architecture to be deployed.
Deploying OHDSI on AWS
Everything just described is automatically deployed by using an AWS CloudFormation template. Using this template, you can quickly get started with the OHDSI project. The CloudFormation templates for this deployment as well as all of the supporting scripts and source code can be found in the AWS Labs GitHub repo.
From your AWS account, open the CloudFormation Management Console and choose Create Stack. From there, copy and paste the following URL in the Specify an Amazon S3 template URL box, and choose Next.
On the next screen, you provide a Stack Name (this can be anything you like) and a few other parameters for your OHDSI environment.
You use the DatabasePassword parameter to set the password for the master user account of the Amazon Redshift and Aurora databases.
You use the EBEndpoint name to generate a unique URL for Atlas to access the OHDSI environment. It is http://EBEndpoint.AWS-Region.elasticbeanstalk.com, where EBEndpoint.AWS-Region indicates the Elastic Beanstalk endpoint and AWS Region. You can configure this URL through Elastic Beanstalk if you want to change it in the future.
You use the KPair option to choose one of your existing Amazon EC2 key pairs to use with the instances that Elastic Beanstalk deploys. By doing this, you can gain administrative access to these instances in the future if you need to. If you don’t already have an Amazon EC2 key pair, you can generate one for free. You do this by going to the Key Pairs section of the EC2 console and choosing Generate Key Pair.
Finally, you use the UserIPRange parameter to specify a CIDR IP address range from which to access your OHDSI environment. By default, your OHDSI environment is accessible over the public internet. Use UserIPRange to limit access over the Internet to a single IP address or a range of IP addresses that represent users you want to have access. Through additional configuration, you can also make your OHDSI environment completely private and accessible only through a VPN or AWS Direct Connect private circuit.
When you’ve provided all Parameters, choose Next.
On the next screen, you can provide some other optional information like tags at your discretion, or just choose Next.
On the next screen, you can review what will be deployed. At the bottom of the screen, there is a check box for you to acknowledge that AWS CloudFormation might create IAM resources with custom names. This is correct; the template being deployed creates four custom roles that give permission for the AWS services involved to communicate with each other. Details of these permissions are inside the CloudFormation template referenced in the URL given in the first step. Check the box acknowledging this and choose Next.
You can watch as CloudFormation builds out your OHDSI architecture. A CloudFormation deployment is called a stack. The parent stack creates two child stacks, one containing the VPC and IAM roles and another created by Elastic Beanstalk with the Atlas and WebAPI servers. When all three stacks have reached the green CREATE_COMPLETE status, as shown in the screenshot following, then the OHDSI architecture has been deployed.
There is still some work going on behind the scenes, though. To watch the progress, browse to the Amazon Redshift section of your AWS Management Console and choose the Amazon Redshift cluster that was created for your OHDSI architecture. After you do so, you can observe the Loads and Queries tabs.
First, on the Loads tab, you can see the CMS De-SynPUF sample data and Athena vocabulary data being loaded into the OMOP Common Data Model. After you see the VOCABULARY table reach the COMPLETED status (as shown following), all of the sample and vocabulary data has been loaded.
After the data loads, the Achilles computation starts. On the Queries tab, you can watch Achilles running queries against your database to build out the Results schema. Achilles runs a large number of queries, and the entire process can take quite some time (about 20 minutes for the sample data we’ve loaded). Eventually, no new queries show up in the Queries tab, which shows that the Achilles computation is completed. The entire process from the time you executed the CloudFormation template until the Achilles computation is completed usually takes about an hour and 15 minutes.
At this point, you can browse to the Elastic Beanstalk section of the AWS Management Console. There, you can choose the OHDSI Application and Environment (green box) that was deployed by the CloudFormation template. At the top of the dashboard, as shown following, you see a link to a URL. This URL matches the name you provided in the EBEndpoint parameter of the CloudFormation template. Choose this URL, and you can start using Atlas to explore the CMS DE-SynPUF sample data!
Cost of deploying this environment
It used to be common to see healthcare data analytics environments deployed in an on-premises data center with expensive data warehouse appliances and virtualized environments. The cloud era has democratized the availability of the infrastructure required to do this type of data analysis, so that now it is within reach of even small organizations. This environment can expand to analyze petabyte-scale health data, and you only pay for what you need. See an estimated breakdown of the monthly cost components for this environment as deployed on the AWS Solution Calculator.
It’s also worth noting that this environment does not have to be run all of the time. If you are only performing analyses periodically, you can terminate the environment when you are finished and restore it from the database backups when you want to continue working. This would reduce the cost of operation even further.
Now that you have a fully functional OHDSI environment with sample data, you can use this to explore and learn the toolset and its capabilities. After learning with the sample data, you can begin gaining insights by analyzing your own organization’s health data. You can do this using an extract, transform, load (ETL) process from one or more of your health data sources.
James Wiggins is a senior healthcare solutions architect at AWS. He is passionate about using technology to help organizations positively impact world health. He also loves spending time with his wife and three children.
Karen Sandler has been giving conference talks about free software and open medical devices for the better part of a decade at this point. LWN briefly covered a 2010 LinuxCon talk and a 2012 linux.conf.au (LCA) talk; her talk at LCA 2012 was her first full-length keynote, she said. In this year’s edition, she reviewed her history (including her love for LCA based in part on that 2012 visit) and gave an update on the status of the source code for the device she has implanted on her heart.
On January 3, the world learned about a series of major security vulnerabilities in modern microprocessors. Called Spectre and Meltdown, these vulnerabilities were discovered by several different researchers last summer, disclosed to the microprocessors’ manufacturers, and patched — at least to the extent possible.
This news isn’t really any different from the usual endless stream of security vulnerabilities and patches, but it’s also a harbinger of the sorts of security problems we’re going to be seeing in the coming years. These are vulnerabilities in computer hardware, not software. They affect virtually all high-end microprocessors produced in the last 20 years. Patching them requires large-scale coordination across the industry, and in some cases drastically affects the performance of the computers. And sometimes patching isn’t possible; the vulnerability will remain until the computer is discarded.
Spectre and Meltdown aren’t anomalies. They represent a new area to look for vulnerabilities and a new avenue of attack. They’re the future of security — and it doesn’t look good for the defenders.
Modern computers do lots of things at the same time. Your computer and your phone simultaneously run several applications — or apps. Your browser has several windows open. A cloud computer runs applications for many different computers. All of those applications need to be isolated from each other. For security, one application isn’t supposed to be able to peek at what another one is doing, except in very controlled circumstances. Otherwise, a malicious advertisement on a website you’re visiting could eavesdrop on your banking details, or the cloud service purchased by some foreign intelligence organization could eavesdrop on every other cloud customer, and so on. The companies that write browsers, operating systems, and cloud infrastructure spend a lot of time making sure this isolation works.
Both Spectre and Meltdown break that isolation, deep down at the microprocessor level, by exploiting performance optimizations that have been implemented for the past decade or so. Basically, microprocessors have become so fast that they spend a lot of time waiting for data to move in and out of memory. To increase performance, these processors guess what data they’re going to receive and execute instructions based on that. If the guess turns out to be correct, it’s a performance win. If it’s wrong, the microprocessors throw away what they’ve done without losing any time. This feature is called speculative execution.
Spectre and Meltdown attack speculative execution in different ways. Meltdown is more of a conventional vulnerability; the designers of the speculative-execution process made a mistake, so they just needed to fix it. Spectre is worse; it’s a flaw in the very concept of speculative execution. There’s no way to patch that vulnerability; the chips need to be redesigned in such a way as to eliminate it.
Since the announcement, manufacturers have been rolling out patches to these vulnerabilities to the extent possible. Operating systems have been patched so that attackers can’t make use of the vulnerabilities. Web browsers have been patched. Chips have been patched. From the user’s perspective, these are routine fixes. But several aspects of these vulnerabilities illustrate the sorts of security problems we’re only going to be seeing more of.
First, attacks against hardware, as opposed to software, will become more common. Last fall, vulnerabilities were discovered in Intel’s Management Engine, a remote-administration feature on its microprocessors. Like Spectre and Meltdown, they affected how the chips operate. Looking for vulnerabilities on computer chips is new. Now that researchers know this is a fruitful area to explore, security researchers, foreign intelligence agencies, and criminals will be on the hunt.
Second, because microprocessors are fundamental parts of computers, patching requires coordination between many companies. Even when manufacturers like Intel and AMD can write a patch for a vulnerability, computer makers and application vendors still have to customize and push the patch out to the users. This makes it much harder to keep vulnerabilities secret while patches are being written. Spectre and Meltdown were announced prematurely because details were leaking and rumors were swirling. Situations like this give malicious actors more opportunity to attack systems before they’re guarded.
Third, these vulnerabilities will affect computers’ functionality. In some cases, the patches for Spectre and Meltdown result in significant reductions in speed. The press initially reported 30%, but that only seems true for certain servers running in the cloud. For your personal computer or phone, the performance hit from the patch is minimal. But as more vulnerabilities are discovered in hardware, patches will affect performance in noticeable ways.
And then there are the unpatchable vulnerabilities. For decades, the computer industry has kept things secure by finding vulnerabilities in fielded products and quickly patching them. Now there are cases where that doesn’t work. Sometimes it’s because computers are in cheap products that don’t have a patch mechanism, like many of the DVRs and webcams that are vulnerable to the Mirai (and other) botnets — groups of Internet-connected devices sabotaged for coordinated digital attacks. Sometimes it’s because a computer chip’s functionality is so core to a computer’s design that patching it effectively means turning the computer off. This, too, is becoming more common.
Increasingly, everything is a computer: not just your laptop and phone, but your car, your appliances, your medical devices, and global infrastructure. These computers are and always will be vulnerable, but Spectre and Meltdown represent a new class of vulnerability. Unpatchable vulnerabilities in the deepest recesses of the world’s computer hardware is the new normal. It’s going to leave us all much more vulnerable in the future.
The security of pretty much every computer on the planet has just gotten a lot worse, and the only real solution — which of course is not a solution — is to throw them all away and buy new ones.
On Wednesday, researchersjustannouncedaseries of major security vulnerabilities in the microprocessors at the heart of the world’s computers for the past 15-20 years. They’ve been named Spectre and Meltdown, and they have to do with manipulating different ways processors optimize performance by rearranging the order of instructions or performing different instructions in parallel. An attacker who controls one process on a system can use the vulnerabilities to steal secrets elsewhere on the computer. (The research papers are here and here.)
This means that a malicious app on your phone could steal data from your other apps. Or a malicious program on your computer — maybe one running in a browserwindow from that sketchy site you’re visiting, or as a result of a phishing attack — can steal data elsewhere on your machine. Cloud services, which often share machines amongst several customers, are especially vulnerable. This affects corporate applications running on cloud infrastructure, and end-user cloud applications like Google Drive. Someone can run a process in the cloud and steal data from every other users on the same hardware.
Information about these flaws has been secretly circulating amongst the major IT companies for months as they researched the ramifications and coordinated updates. The details were supposed to be released next week, but the story broke early and everyone is scrambling. By now all the major cloud vendors have patched their systems against the vulnerabilities that can be patched against.
“Throw it away and buy a new one” is ridiculous security advice, but it’s what US-CERT recommends. It is also unworkable. The problem is that there isn’t anything to buy that isn’t vulnerable. Pretty much every major processor made in the past 20 years is vulnerable to some flavor of these vulnerabilities. Patching against Meltdown can degrade performance by almost a third. And there’s no patch for Spectre; the microprocessors have to be redesigned to prevent the attack, and that will take years. (Here’s a running list of who’s patched what.)
This is bad, but expect it more and more. Several trends are converging in a way that makes our current system of patching security vulnerabilities harder to implement.
The first is that these vulnerabilities affect embedded computers in consumer devices. Unlike our computer and phones, these systems are designed and produced at a lower profit margin with less engineering expertise. There aren’t security teams on call to write patches, and there often aren’t mechanisms to push patches onto the devices. We’re already seeing this with home routers, digital video recorders, and webcams. The vulnerability that allowed them to be taken over by the Mirai botnet last August simply can’t be fixed.
The second is that some of the patches require updating the computer’s firmware. This is much harder to walk consumers through, and is more likely to permanently brick the device if something goes wrong. It also requires more coordination. In November, Intel released a firmware update to fix a vulnerability in its Management Engine (ME): another flaw in its microprocessors. But it couldn’t get that update directly to users; it had to work with the individual hardware companies, and some of them just weren’t capable of getting the update to their customers.
We’re already seeing this. Some patches require users to disable the computer’s password, which means organizations can’t automate the patch. Some antivirus software blocks the patch, or — worse — crashes the computer. This results in a three-step process: patch your antivirus software, patch your operating system, and then patch the computer’s firmware.
The final reason is the nature of these vulnerabilities themselves. These aren’t normal software vulnerabilities, where a patch fixes the problem and everyone can move on. These vulnerabilities are in the fundamentals of how the microprocessor operates.
It shouldn’t be surprising that microprocessor designers have been building insecure hardware for 20 years. What’s surprising is that it took 20 years to discover it. In their rush to make computers faster, they weren’t thinking about security. They didn’t have the expertise to find these vulnerabilities. And those who did were too busy finding normal software vulnerabilities to examine microprocessors. Security researchers are starting to look more closely at these systems, so expect to hear about more vulnerabilities along these lines.
Spectre and Meltdown are pretty catastrophic vulnerabilities, but they only affect the confidentiality of data. Now that they — and the research into the Intel ME vulnerability — have shown researchers where to look, more is coming — and what they’ll find will be worse than either Spectre or Meltdown. There will be vulnerabilities that will allow attackers to manipulate or delete data across processes, potentially fatal in the computers controlling our cars or implanted medical devices. These will be similarly impossible to fix, and the only strategy will be to throw our devices away and buy new ones.
This isn’t to say you should immediately turn your computers and phones off and not use them for a few years. For the average user, this is just another attack method amongst many. All the major vendors are working on patches and workarounds for the attacks they can mitigate. All the normal security advice still applies: watch for phishing attacks, don’t click on strange e-mail attachments, don’t visit sketchy websites that might run malware on your browser, patch your systems regularly, and generally be careful on the Internet.
You probably won’t notice that performance hit once Meltdown is patched, except maybe in backup programs and networking applications. Embedded systems that do only one task, like your programmable thermostat or the computer in your refrigerator, are unaffected. Small microprocessors that don’t do all of the vulnerable fancy performance tricks are unaffected. Browsers will figure out how to mitigate this in software. Overall, the security of the average Internet-of-Things device is so bad that this attack is in the noise compared to the previously known risks.
It’s a much bigger problem for cloud vendors; the performance hit will be expensive, but I expect that they’ll figure out some clever way of detecting and blocking the attacks. All in all, as bad as Spectre and Meltdown are, I think we got lucky.
But more are coming, and they’ll be worse. 2018 will be the year of microprocessor vulnerabilities, and it’s going to be a wild ride.
Откриването на Charlie Stross (който ми е от любимите автори) беше доста интересно, с наблюдението, че корпорациите могат да се разглеждат като начална форма на изкуствените интелекти и всякакви интересни следствия от това, струва си да се отдели малко време и да се гледа (не знам дали ще го качи в блога си).
Лекцията за геймифицираната система за социален кредит в Китай не ми каза нещо ново и не беше особено добре представена, но е добре човек да почете за ситуацията.
Харалд Велте разказа за internet-а и BBS-ите от едно време (само че в Германия), като цяло все неща, с които едно време сме си играли. Иво ме пита дали не можем да направим някаква такава лекция или да намерим история на случвалите се неща в България. Мислех си, че вече има такова нещо, ама не мога да го намеря, някой да се сеща за хубава история на ония времена?
Лекцията за Иран имаше малко полезна информация в нея, но основно не си заслужаваше. Лекцията за Саудитска Арабия също нямаше много съдържание.
Лекцията за “Low Cost Non-Invasive Biomedical Imaging” за момента ми е любима, и трябва да си вземем едно такова нещо за в лаба. Звучи като технология, с която си струва да си играем и която може много да подобри работата на всякакви лекари.
“Defeating (Not)Petya’s Cryptography” имаше полезни моменти.
Като успея да изгледам още някакви неща, ще пиша и за тях. Който иска, може директно да ходи в initLab да гледа, тъкмо ще има с кой да коментира 🙂
Update: “The Ultimate Apollo Guidance Computer Talk” се оказа страхотно, особено архитектурата на нещото, която има вид на скалъпена с тел и тиксо.
Abstract: Among storage components, hard disk drives (HDDs) have become the most commonly-used type of non-volatile storage due to their recent technological advances, including, enhanced energy efficacy and significantly-improved areal density. Such advances in HDDs have made them an inevitable part of numerous computing systems, including, personal computers, closed-circuit television (CCTV) systems, medical bedside monitors, and automated teller machines (ATMs). Despite the widespread use of HDDs and their critical role in real-world systems, there exist only a few research studies on the security of HDDs. In particular, prior research studies have discussed how HDDs can potentially leak critical private information through acoustic or electromagnetic emanations. Borrowing theoretical principles from acoustics and mechanics, we propose a novel denial-of-service (DoS) attack against HDDs that exploits a physical phenomenon, known as acoustic resonance. We perform a comprehensive examination of physical characteristics of several HDDs and create acoustic signals that cause significant vibrations in HDDs internal components. We demonstrate that such vibrations can negatively influence the performance of HDDs embedded in real-world systems. We show the feasibility of the proposed attack in two real-world case studies, namely, personal computers and CCTVs.
The cell phones we carry with us constantly are the most perfect surveillance device ever invented, and our laws haven’t caught up to that reality. That might change soon.
This week, the Supreme Court will hear a case with profound implications on your security and privacy in the coming years. The Fourth Amendment’s prohibition of unlawful search and seizure is a vital right that protects us all from police overreach, and the way the courts interpret it is increasingly nonsensical in our computerized and networked world. The Supreme Court can either update current law to reflect the world, or it can further solidify an unnecessary and dangerous police power.
The case centers on cell phone location data and whether the police need a warrant to get it, or if they can use a simple subpoena, which is easier to obtain. Current Fourth Amendment doctrine holds that you lose all privacy protections over any data you willingly share with a third party. Your cellular provider, under this interpretation, is a third party with whom you’ve willingly shared your movements, 24 hours a day, going back months — even though you don’t really have any choice about whether to share with them. So police can request records of where you’ve been from cell carriers without any judicial oversight. The case before the court, Carpenter v. United States, could change that.
Traditionally, information that was most precious to us was physically close to us. It was on our bodies, in our homes and offices, in our cars. Because of that, the courts gave that information extra protections. Information that we stored far away from us, or gave to other people, afforded fewer protections. Police searches have been governed by the “third-party doctrine,” which explicitly says that information we share with others is not considered private.
The Internet has turned that thinking upside-down. Our cell phones know who we talk to and, if we’re talking via text or e-mail, what we say. They track our location constantly, so they know where we live and work. Because they’re the first and last thing we check every day, they know when we go to sleep and when we wake up. Because everyone has one, they know whom we sleep with. And because of how those phones work, all that information is naturally shared with third parties.
More generally, all our data is literally stored on computers belonging to other people. It’s our e-mail, text messages, photos, Google docs, and more all in the cloud. We store it there not because it’s unimportant, but precisely because it is important. And as the Internet of Things computerizes the rest our lives, even more data will be collected by other people: data from our health trackers and medical devices, data from our home sensors and appliances, data from Internet-connected “listeners” like Alexa, Siri, and your voice-activated television.
All this data will be collected and saved by third parties, sometimes for years. The result is a detailed dossier of your activities more complete than any private investigator – or police officer – could possibly collect by following you around.
The issue here is not whether the police should be allowed to use that data to help solve crimes. Of course they should. The issue is whether that information should be protected by the warrant process that requires the police to have probable cause to investigate you and get approval by a court.
Warrants are a security mechanism. They prevent the police from abusing their authority to investigate someone they have no reason to suspect of a crime. They prevent the police from going on “fishing expeditions.” They protect our rights and liberties, even as we willingly give up our privacy to the legitimate needs of law enforcement.
The third-party doctrine never made a lot of sense. Just because I share an intimate secret with my spouse, friend, or doctor doesn’t mean that I no longer consider it private. It makes even less sense in today’s hyper-connected world. It’s long past time the Supreme Court recognized that a months’-long history of my movements is private, and my e-mails and other personal data deserve the same protections, whether they’re on my laptop or on Google’s servers.
We can’t believe that there are just few days left before re:Invent 2017. If you are attending this year, you’ll want to check out our Big Data sessions! The Big Data and Machine Learning categories are bigger than ever. As in previous years, you can find these sessions in various tracks, including Analytics & Big Data, Deep Learning Summit, Artificial Intelligence & Machine Learning, Architecture, and Databases.
We have great sessions from organizations and companies like Vanguard, Cox Automotive, Pinterest, Netflix, FINRA, Amtrak, AmazonFresh, Sysco Foods, Twilio, American Heart Association, Expedia, Esri, Nextdoor, and many more. All sessions are recorded and made available on YouTube. In addition, all slide decks from the sessions will be available on SlideShare.net after the conference.
This post highlights the sessions that will be presented as part of the Analytics & Big Data track, as well as relevant sessions from other tracks like Architecture, Artificial Intelligence & Machine Learning, and IoT. If you’re interested in Machine Learning sessions, don’t forget to check out our Guide to Machine Learning at re:Invent 2017.
Raju Gulabani, VP, Database, Analytics and AI at AWS will discuss the evolution of database and analytics services in AWS, the new database and analytics services and features we launched this year, and our vision for continued innovation in this space. We are witnessing an unprecedented growth in the amount of data collected, in many different forms. Storage, management, and analysis of this data require database services that scale and perform in ways not possible before. AWS offers a collection of database and other data services—including Amazon Aurora, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Amazon ElastiCache, Amazon Kinesis, and Amazon EMR—to process, store, manage, and analyze data. In this session, we provide an overview of AWS database and analytics services and discuss how customers are using these services today.
Deep dive customer use cases
ABD401 – How Netflix Monitors Applications in Near Real-Time with Amazon Kinesis Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this session, we first discuss why Netflix chose Kinesis Streams to address these challenges at scale. We then dive deep into how Netflix uses Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this session, you will learn how to build a real-time application monitoring system using network traffic logs and get real-time, actionable insights.
In this session, learn how Nextdoor replaced their home-grown data pipeline based on a topology of Flume nodes with a completely serverless architecture based on Kinesis and Lambda. By making these changes, they improved both the reliability of their data and the delivery times of billions of records of data to their Amazon S3–based data lake and Amazon Redshift cluster. Nextdoor is a private social networking service for neighborhoods.
ABD205 – Taking a Page Out of Ivy Tech’s Book: Using Data for Student Success Data speaks. Discover how Ivy Tech, the nation’s largest singly accredited community college, uses AWS to gather, analyze, and take action on student behavioral data for the betterment of over 3,100 students. This session outlines the process from inception to implementation across the state of Indiana and highlights how Ivy Tech’s model can be applied to your own complex business problems.
ABD207 – Leveraging AWS to Fight Financial Crime and Protect National Security Banks aren’t known to share data and collaborate with one another. But that is exactly what the Mid-Sized Bank Coalition of America (MBCA) is doing to fight digital financial crime—and protect national security. Using the AWS Cloud, the MBCA developed a shared data analytics utility that processes terabytes of non-competitive customer account, transaction, and government risk data. The intelligence produced from the data helps banks increase the efficiency of their operations, cut labor and operating costs, and reduce false positive volumes. The collective intelligence also allows greater enforcement of Anti-Money Laundering (AML) regulations by helping members detect internal risks—and identify the challenges to detecting these risks in the first place. This session demonstrates how the AWS Cloud supports the MBCA to deliver advanced data analytics, provide consistent operating models across financial institutions, reduce costs, and strengthen national security.
ABD208 – Cox Automotive Empowered to Scale with Splunk Cloud & AWS and Explores New Innovation with Amazon Kinesis Firehose In this session, learn how Cox Automotive is using Splunk Cloud for real time visibility into its AWS and hybrid environments to achieve near instantaneous MTTI, reduce auction incidents by 90%, and proactively predict outages. We also introduce a highly anticipated capability that allows you to ingest, transform, and analyze data in real time using Splunk and Amazon Kinesis Firehose to gain valuable insights from your cloud resources. It’s now quicker and easier than ever to gain access to analytics-driven infrastructure monitoring using Splunk Enterprise & Splunk Cloud.
ABD209 – Accelerating the Speed of Innovation with a Data Sciences Data & Analytics Hub at Takeda Historically, silos of data, analytics, and processes across functions, stages of development, and geography created a barrier to R&D efficiency. Gathering the right data necessary for decision-making was challenging due to issues of accessibility, trust, and timeliness. In this session, learn how Takeda is undergoing a transformation in R&D to increase the speed-to-market of high-impact therapies to improve patient lives. The Data and Analytics Hub was built, with Deloitte, to address these issues and support the efficient generation of data insights for functions such as clinical operations, clinical development, medical affairs, portfolio management, and R&D finance. In the AWS hosted data lake, this data is processed, integrated, and made available to business end users through data visualization interfaces, and to data scientists through direct connectivity. Learn how Takeda has achieved significant time reductions—from weeks to minutes—to gather and provision data that has the potential to reduce cycle times in drug development. The hub also enables more efficient operations and alignment to achieve product goals through cross functional team accountability and collaboration due to the ability to access the same cross domain data.
ABD210 – Modernizing Amtrak: Serverless Solution for Real-Time Data Capabilities As the nation’s only high-speed intercity passenger rail provider, Amtrak needs to know critical information to run their business such as: Who’s onboard any train at any time? How are booking and revenue trending? Amtrak was faced with unpredictable and often slow response times from existing databases, ranging from seconds to hours; existing booking and revenue dashboards were spreadsheet-based and manual; multiple copies of data were stored in different repositories, lacking integration and consistency; and operations and maintenance (O&M) costs were relatively high. Join us as we demonstrate how Deloitte and Amtrak successfully went live with a cloud-native operational database and analytical datamart for near-real-time reporting in under six months. We highlight the specific challenges and the modernization of architecture on an AWS native Platform as a Service (PaaS) solution. The solution includes cloud-native components such as AWS Lambda for microservices, Amazon Kinesis and AWS Data Pipeline for moving data, Amazon S3 for storage, Amazon DynamoDB for a managed NoSQL database service, and Amazon Redshift for near-real time reports and dashboards. Deloitte’s solution enabled “at scale” processing of 1 million transactions/day and up to 2K transactions/minute. It provided flexibility and scalability, largely eliminate the need for system management, and dramatically reduce operating costs. Moreover, it laid the groundwork for decommissioning legacy systems, anticipated to save at least $1M over 3 years.
ABD211 – Sysco Foods: A Journey from Too Much Data to Curated Insights In this session, we detail Sysco’s journey from a company focused on hindsight-based reporting to one focused on insights and foresight. For this shift, Sysco moved from multiple data warehouses to an AWS ecosystem, including Amazon Redshift, Amazon EMR, AWS Data Pipeline, and more. As the team at Sysco worked with Tableau, they gained agile insight across their business. Learn how Sysco decided to use AWS, how they scaled, and how they became more strategic with the AWS ecosystem and Tableau.
ABD217 – From Batch to Streaming: How Amazon Flex Uses Real-time Analytics to Deliver Packages on Time Reducing the time to get actionable insights from data is important to all businesses, and customers who employ batch data analytics tools are exploring the benefits of streaming analytics. Learn best practices to extend your architecture from data warehouses and databases to real-time solutions. Learn how to use Amazon Kinesis to get real-time data insights and integrate them with Amazon Aurora, Amazon RDS, Amazon Redshift, and Amazon S3. The Amazon Flex team describes how they used streaming analytics in their Amazon Flex mobile app, used by Amazon delivery drivers to deliver millions of packages each month on time. They discuss the architecture that enabled the move from a batch processing system to a real-time system, overcoming the challenges of migrating existing batch data to streaming data, and how to benefit from real-time analytics.
ABD218 – How EuroLeague Basketball Uses IoT Analytics to Engage Fans IoT and big data have made their way out of industrial applications, general automation, and consumer goods, and are now a valuable tool for improving consumer engagement across a number of industries, including media, entertainment, and sports. The low cost and ease of implementation of AWS analytics services and AWS IoT have allowed AGT, a leader in IoT, to develop their IoTA analytics platform. Using IoTA, AGT brought a tailored solution to EuroLeague Basketball for real-time content production and fan engagement during the 2017-18 season. In this session, we take a deep dive into how this solution is architected for secure, scalable, and highly performant data collection from athletes, coaches, and fans. We also talk about how the data is transformed into insights and integrated into a content generation pipeline. Lastly, we demonstrate how this solution can be easily adapted for other industries and applications.
ABD222 – How to Confidently Unleash Data to Meet the Needs of Your Entire Organization Where are you on the spectrum of IT leaders? Are you confident that you’re providing the technology and solutions that consistently meet or exceed the needs of your internal customers? Do your peers at the executive table see you as an innovative technology leader? Innovative IT leaders understand the value of getting data and analytics directly into the hands of decision makers, and into their own. In this session, Daren Thayne, Domo’s Chief Technology Officer, shares how innovative IT leaders are helping drive a culture change at their organizations. See how transformative it can be to have real-time access to all of the data that’ is relevant to YOUR job (including a complete view of your entire AWS environment), as well as understand how it can help you lead the way in applying that same pattern throughout your entire company
ABD303 – Developing an Insights Platform – Sysco’s Journey from Disparate Systems to Data Lake and Beyond Sysco has nearly 200 operating companies across its multiple lines of business throughout the United States, Canada, Central/South America, and Europe. As the global leader in food services, Sysco identified the need to streamline the collection, transformation, and presentation of data produced by the distributed units and systems, into a central data ecosystem. Sysco’s Business Intelligence and Analytics team addressed these requirements by creating a data lake with scalable analytics and query engines leveraging AWS services. In this session, Sysco will outline their journey from a hindsight reporting focused company to an insights driven organization. They will cover solution architecture, challenges, and lessons learned from deploying a self-service insights platform. They will also walk through the design patterns they used and how they designed the solution to provide predictive analytics using Amazon Redshift Spectrum, Amazon S3, Amazon EMR, AWS Glue, Amazon Elasticsearch Service and other AWS services.
ABD309 – How Twilio Scaled Its Data-Driven Culture As a leading cloud communications platform, Twilio has always been strongly data-driven. But as headcount and data volumes grew—and grew quickly—they faced many new challenges. One-off, static reports work when you’re a small startup, but how do you support a growth stage company to a successful IPO and beyond? Today, Twilio’s data team relies on AWS and Looker to provide data access to 700 colleagues. Departments have the data they need to make decisions, and cloud-based scale means they get answers fast. Data delivers real-business value at Twilio, providing a 360-degree view of their customer, product, and business. In this session, you hear firsthand stories directly from the Twilio data team and learn real-world tips for fostering a truly data-driven culture at scale.
ABD310 – How FINRA Secures Its Big Data and Data Science Platform on AWS FINRA uses big data and data science technologies to detect fraud, market manipulation, and insider trading across US capital markets. As a financial regulator, FINRA analyzes highly sensitive data, so information security is critical. Learn how FINRA secures its Amazon S3 Data Lake and its data science platform on Amazon EMR and Amazon Redshift, while empowering data scientists with tools they need to be effective. In addition, FINRA shares AWS security best practices, covering topics such as AMI updates, micro segmentation, encryption, key management, logging, identity and access management, and compliance.
ABD331 – Log Analytics at Expedia Using Amazon Elasticsearch Service Expedia uses Amazon Elasticsearch Service (Amazon ES) for a variety of mission-critical use cases, ranging from log aggregation to application monitoring and pricing optimization. In this session, the Expedia team reviews how they use Amazon ES and Kibana to analyze and visualize Docker startup logs, AWS CloudTrail data, and application metrics. They share best practices for architecting a scalable, secure log analytics solution using Amazon ES, so you can add new data sources almost effortlessly and get insights quickly
ABD316 – American Heart Association: Finding Cures to Heart Disease Through the Power of Technology Combining disparate datasets and making them accessible to data scientists and researchers is a prevalent challenge for many organizations, not just in healthcare research. American Heart Association (AHA) has built a data science platform using Amazon EMR, Amazon Elasticsearch Service, and other AWS services, that corrals multiple datasets and enables advanced research on phenotype and genotype datasets, aimed at curing heart diseases. In this session, we present how AHA built this platform and the key challenges they addressed with the solution. We also provide a demo of the platform, and leave you with suggestions and next steps so you can build similar solutions for your use cases
ABD319 – Tooling Up for Efficiency: DIY Solutions @ Netflix At Netflix, we have traditionally approached cloud efficiency from a human standpoint, whether it be in-person meetings with the largest service teams or manually flipping reservations. Over time, we realized that these manual processes are not scalable as the business continues to grow. Therefore, in the past year, we have focused on building out tools that allow us to make more insightful, data-driven decisions around capacity and efficiency. In this session, we discuss the DIY applications, dashboards, and processes we built to help with capacity and efficiency. We start at the ten thousand foot view to understand the unique business and cloud problems that drove us to create these products, and discuss implementation details, including the challenges encountered along the way. Tools discussed include Picsou, the successor to our AWS billing file cost analyzer; Libra, an easy-to-use reservation conversion application; and cost and efficiency dashboards that relay useful financial context to 50+ engineering teams and managers.
ABD312 – Deep Dive: Migrating Big Data Workloads to AWS Customers are migrating their analytics, data processing (ETL), and data science workloads running on Apache Hadoop, Spark, and data warehouse appliances from on-premise deployments to AWS in order to save costs, increase availability, and improve performance. AWS offers a broad set of analytics services, including solutions for batch processing, stream processing, machine learning, data workflow orchestration, and data warehousing. This session will focus on identifying the components and workflows in your current environment; and providing the best practices to migrate these workloads to the right AWS data analytics product. We will cover services such as Amazon EMR, Amazon Athena, Amazon Redshift, Amazon Kinesis, and more. We will also feature Vanguard, an American investment management company based in Malvern, Pennsylvania with over $4.4 trillion in assets under management. Ritesh Shah, Sr. Program Manager for Cloud Analytics Program at Vanguard, will describe how they orchestrated their migration to AWS analytics services, including Hadoop and Spark workloads to Amazon EMR. Ritesh will highlight the technical challenges they faced and overcame along the way, as well as share common recommendations and tuning tips to accelerate the time to production.
ABD402 – How Esri Optimizes Massive Image Archives for Analytics in the Cloud Petabyte scale archives of satellites, planes, and drones imagery continue to grow exponentially. They mostly exist as semi-structured data, but they are only valuable when accessed and processed by a wide range of products for both visualization and analysis. This session provides an overview of how ArcGIS indexes and structures data so that any part of it can be quickly accessed, processed, and analyzed by reading only the minimum amount of data needed for the task. In this session, we share best practices for structuring and compressing massive datasets in Amazon S3, so it can be analyzed efficiently. We also review a number of different image formats, including GeoTIFF (used for the Public Datasets on AWS program, Landsat on AWS), cloud optimized GeoTIFF, MRF, and CRF as well as different compression approaches to show the effect on processing performance. Finally, we provide examples of how this technology has been used to help image processing and analysis for the response to Hurricane Harvey.
ABD329 – A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Massive Scale Amazon’s consumer business continues to grow, and so does the volume of data and the number and complexity of the analytics done in support of the business. In this session, we talk about how Amazon.com uses AWS technologies to build a scalable environment for data and analytics. We look at how Amazon is evolving the world of data warehousing with a combination of a data lake and parallel, scalable compute engines such as Amazon EMR and Amazon Redshift.
ABD327 – Migrating Your Traditional Data Warehouse to a Modern Data Lake In this session, we discuss the latest features of Amazon Redshift and Redshift Spectrum, and take a deep dive into its architecture and inner workings. We share many of the recent availability, performance, and management enhancements and how they improve your end user experience. You also hear from 21st Century Fox, who presents a case study of their fast migration from an on-premises data warehouse to Amazon Redshift. Learn how they are expanding their data warehouse to a data lake that encompasses multiple data sources and data formats. This architecture helps them tie together siloed business units and get actionable 360-degree insights across their consumer base. MCL202 – Ally Bank & Cognizant: Transforming Customer Experience Using Amazon Alexa Given the increasing popularity of natural language interfaces such as Voice as User technology or conversational artificial intelligence (AI), Ally® Bank was looking to interact with customers by enabling direct transactions through conversation or voice. They also needed to develop a capability that allows third parties to connect to the bank securely for information sharing and exchange, using oAuth, an authentication protocol seen as the future of secure banking technology. Cognizant’s Architecture team partnered with Ally Bank’s Enterprise Architecture group and identified the right product for oAuth integration with Amazon Alexa and third-party technologies. In this session, we discuss how building products with conversational AI helps Ally Bank offer an innovative customer experience; increase retention through improved data-driven personalization; increase the efficiency and convenience of customer service; and gain deep insights into customer needs through data analysis and predictive analytics to offer new products and services.
MCL317 – Orchestrating Machine Learning Training for Netflix Recommendations At Netflix, we use machine learning (ML) algorithms extensively to recommend relevant titles to our 100+ million members based on their tastes. Everything on the member home page is an evidence-driven, A/B-tested experience that we roll out backed by ML models. These models are trained using Meson, our workflow orchestration system. Meson distinguishes itself from other workflow engines by handling more sophisticated execution graphs, such as loops and parameterized fan-outs. Meson can schedule Spark jobs, Docker containers, bash scripts, gists of Scala code, and more. Meson also provides a rich visual interface for monitoring active workflows and inspecting execution logs. It has a powerful Scala DSL for authoring workflows as well as the REST API. In this session, we focus on how Meson trains recommendation ML models in production, and how we have re-architected it to scale up for a growing need of broad ETL applications within Netflix. As a driver for this change, we have had to evolve the persistence layer for Meson. We talk about how we migrated from Cassandra to Amazon RDS backed by Amazon Aurora
MCL350 – Humans vs. the Machines: How Pinterest Uses Amazon Mechanical Turk’s Worker Community to Improve Machine Learning Ever since the term “crowdsourcing” was coined in 2006, it’s been a buzzword for technology companies and social institutions. In the technology sector, crowdsourcing is instrumental for verifying machine learning algorithms, which, in turn, improves the user’s experience. In this session, we explore how Pinterest adapted to an increased reliability on human evaluation to improve their product, with a focus on how they’ve integrated with Mechanical Turk’s platform. This presentation is aimed at engineers, analysts, program managers, and product managers who are interested in how companies rely on Mechanical Turk’s human evaluation platform to better understand content and improve machine learning algorithms. The discussion focuses on the analysis and product decisions related to building a high quality crowdsourcing system that takes advantage of Mechanical Turk’s powerful worker community.
ABD201 – Big Data Architectural Patterns and Best Practices on AWS In this session, we simplify big data processing as a data bus comprising various stages: collect, store, process, analyze, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architectures, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost
ABD202 – Best Practices for Building Serverless Big Data Applications Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. In this session, we show you how to incorporate serverless concepts into your big data architectures. We explore the concepts behind and benefits of serverless architectures for big data, looking at design patterns to ingest, store, process, and visualize your data. Along the way, we explain when and how you can use serverless technologies to streamline data processing, minimize infrastructure management, and improve agility and robustness and share a reference architecture using a combination of cloud and open source technologies to solve your big data problems. Topics include: use cases and best practices for serverless big data applications; leveraging AWS technologies such as Amazon DynamoDB, Amazon S3, Amazon Kinesis, AWS Lambda, Amazon Athena, and Amazon EMR; and serverless ETL, event processing, ad hoc analysis, and real-time analytics.
ABD206 – Building Visualizations and Dashboards with Amazon QuickSight Just as a picture is worth a thousand words, a visual is worth a thousand data points. A key aspect of our ability to gain insights from our data is to look for patterns, and these patterns are often not evident when we simply look at data in tables. The right visualization will help you gain a deeper understanding in a much quicker timeframe. In this session, we will show you how to quickly and easily visualize your data using Amazon QuickSight. We will show you how you can connect to data sources, generate custom metrics and calculations, create comprehensive business dashboards with various chart types, and setup filters and drill downs to slice and dice the data.
ABD203 – Real-Time Streaming Applications on AWS: Use Cases and Patterns To win in the marketplace and provide differentiated customer experiences, businesses need to be able to use live data in real time to facilitate fast decision making. In this session, you learn common streaming data processing use cases and architectures. First, we give an overview of streaming data and AWS streaming data capabilities. Next, we look at a few customer examples and their real-time streaming applications. Finally, we walk through common architectures and design patterns of top streaming data use cases.
ABD213 – How to Build a Data Lake with AWS Glue Data Catalog As data volumes grow and customers store more data on AWS, they often have valuable data that is not easily discoverable and available for analytics. The AWS Glue Data Catalog provides a central view of your data lake, making data readily available for analytics. We introduce key features of the AWS Glue Data Catalog and its use cases. Learn how crawlers can automatically discover your data, extract relevant metadata, and add it as table definitions to the AWS Glue Data Catalog. We will also explore the integration between AWS Glue Data Catalog and Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
ABD214 – Real-time User Insights for Mobile and Web Applications with Amazon Pinpoint With customers demanding relevant and real-time experiences across a range of devices, digital businesses are looking to gather user data at scale, understand this data, and respond to customer needs instantly. This requires tools that can record large volumes of user data in a structured fashion, and then instantly make this data available to generate insights. In this session, we demonstrate how you can use Amazon Pinpoint to capture user data in a structured yet flexible manner. Further, we demonstrate how this data can be set up for instant consumption using services like Amazon Kinesis Firehose and Amazon Redshift. We walk through example data based on real world scenarios, to illustrate how Amazon Pinpoint lets you easily organize millions of events, record them in real-time, and store them for further analysis.
ABD223 – IT Innovators: New Technology for Leveraging Data to Enable Agility, Innovation, and Business Optimization Companies of all sizes are looking for technology to efficiently leverage data and their existing IT investments to stay competitive and understand where to find new growth. Regardless of where companies are in their data-driven journey, they face greater demands for information by customers, prospects, partners, vendors and employees. All stakeholders inside and outside the organization want information on-demand or in “real time”, available anywhere on any device. They want to use it to optimize business outcomes without having to rely on complex software tools or human gatekeepers to relevant information. Learn how IT innovators at companies such as MasterCard, Jefferson Health, and TELUS are using Domo’s Business Cloud to help their organizations more effectively leverage data at scale.
ABD301 – Analyzing Streaming Data in Real Time with Amazon Kinesis Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. In this session, we present an end-to-end streaming data solution using Kinesis Streams for data ingestion, Kinesis Analytics for real-time processing, and Kinesis Firehose for persistence. We review in detail how to write SQL queries using streaming data and discuss best practices to optimize and monitor your Kinesis Analytics applications. Lastly, we discuss how to estimate the cost of the entire system
ABD302 – Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service and Kibana In this session, we use Apache web logs as example and show you how to build an end-to-end analytics solution. First, we cover how to configure an Amazon ES cluster and ingest data using Amazon Kinesis Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data. Then we demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we review approaches for generating custom, ad-hoc reports.
ABD304 – Best Practices for Data Warehousing with Amazon Redshift & Redshift Spectrum Most companies are over-run with data, yet they lack critical insights to make timely and accurate business decisions. They are missing the opportunity to combine large amounts of new, unstructured big data that resides outside their data warehouse with trusted, structured data inside their data warehouse. In this session, we take an in-depth look at how modern data warehousing blends and analyzes all your data, inside and outside your data warehouse without moving the data, to give you deeper insights to run your business. We will cover best practices on how to design optimal schemas, load data efficiently, and optimize your queries to deliver high throughput and performance.
ABD305 – Design Patterns and Best Practices for Data Analytics with Amazon EMR Amazon EMR is one of the largest Hadoop operators in the world, enabling customers to run ETL, machine learning, real-time processing, data science, and low-latency SQL at petabyte scale. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about lowering cost with Auto Scaling and Spot Instances, and security best practices for encryption and fine-grained access control. Finally, we dive into some of our recent launches to keep you current on our latest features.
ABD307 – Deep Analytics for Global AWS Marketing Organization To meet the needs of the global marketing organization, the AWS marketing analytics team built a scalable platform that allows the data science team to deliver custom econometric and machine learning models for end user self-service. To meet data security standards, we use end-to-end data encryption and different AWS services such as Amazon Redshift, Amazon RDS, Amazon S3, Amazon EMR with Apache Spark and Auto Scaling. In this session, you see real examples of how we have scaled and automated critical analysis, such as calculating the impact of marketing programs like re:Invent and prioritizing leads for our sales teams.
ABD311 – Deploying Business Analytics at Enterprise Scale with Amazon QuickSight One of the biggest tradeoffs customers usually make when deploying BI solutions at scale is agility versus governance. Large-scale BI implementations with the right governance structure can take months to design and deploy. In this session, learn how you can avoid making this tradeoff using Amazon QuickSight. Learn how to easily deploy Amazon QuickSight to thousands of users using Active Directory and Federated SSO, while securely accessing your data sources in Amazon VPCs or on-premises. We also cover how to control access to your datasets, implement row-level security, create scheduled email reports, and audit access to your data.
ABD315 – Building Serverless ETL Pipelines with AWS Glue Organizations need to gain insight and knowledge from a growing number of Internet of Things (IoT), APIs, clickstreams, unstructured and log data sources. However, organizations are also often limited by legacy data warehouses and ETL processes that were designed for transactional data. In this session, we introduce key ETL features of AWS Glue, cover common use cases ranging from scheduled nightly data warehouse loads to near real-time, event-driven ETL flows for your data lake. We discuss how to build scalable, efficient, and serverless ETL pipelines using AWS Glue. Additionally, Merck will share how they built an end-to-end ETL pipeline for their application release management system, and launched it in production in less than a week using AWS Glue.
ABD318 – Architecting a data lake with Amazon S3, Amazon Kinesis, and Amazon Athena Learn how to architect a data lake where different teams within your organization can publish and consume data in a self-service manner. As organizations aim to become more data-driven, data engineering teams have to build architectures that can cater to the needs of diverse users – from developers, to business analysts, to data scientists. Each of these user groups employs different tools, have different data needs and access data in different ways. In this talk, we will dive deep into assembling a data lake using Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue. The session will feature Mohit Rao, Architect and Integration lead at Atlassian, the maker of products such as JIRA, Confluence, and Stride. First, we will look at a couple of common architectures for building a data lake. Then we will show how Atlassian built a self-service data lake, where any team within the company can publish a dataset to be consumed by a broad set of users.
Companies have valuable data that they may not be analyzing due to the complexity, scalability, and performance issues of loading the data into their data warehouse. However, with the right tools, you can extend your analytics to query data in your data lake—with no loading required. Amazon Redshift Spectrum extends the analytic power of Amazon Redshift beyond data stored in your data warehouse to run SQL queries directly against vast amounts of unstructured data in your Amazon S3 data lake. This gives you the freedom to store your data where you want, in the format you want, and have it available for analytics when you need it. Join a discussion with AWS solution architects to ask question.
ABD330 – Combining Batch and Stream Processing to Get the Best of Both Worlds Today, many architects and developers are looking to build solutions that integrate batch and real-time data processing, and deliver the best of both approaches. Lambda architecture (not to be confused with the AWS Lambda service) is a design pattern that leverages both batch and real-time processing within a single solution to meet the latency, accuracy, and throughput requirements of big data use cases. Come join us for a discussion on how to implement Lambda architecture (batch, speed, and serving layers) and best practices for data processing, loading, and performance tuning
ABD335 – Real-Time Anomaly Detection Using Amazon Kinesis Amazon Kinesis Analytics offers a built-in machine learning algorithm that you can use to easily detect anomalies in your VPC network traffic and improve security monitoring. Join us for an interactive discussion on how to stream your VPC flow Logs to Amazon Kinesis Streams and identify anomalies using Kinesis Analytics.
ABD339 – Deep Dive and Best Practices for Amazon Athena Amazon Athena is an interactive query service that enables you to process data directly from Amazon S3 without the need for infrastructure. Since its launch at re:invent 2016, several organizations have adopted Athena as the central tool to process all their data. In this talk, we dive deep into the most common use cases, including working with other AWS services. We review the best practices for creating tables and partitions and performance optimizations. We also dive into how Athena handles security, authorization, and authentication. Lastly, we hear from a customer who has reduced costs and improved time to market by deploying Athena across their organization.
We look forward to meeting you at re:Invent 2017!
About the Author
Roy Ben-Alta is a solution architect and principal business development manager at Amazon Web Services in New York. He focuses on Data Analytics and ML Technologies, working with AWS customers to build innovative data-driven products.
ClearCare – helping home care agencies operate efficiently and grow their business.
DNAnexus – providing a cloud-based global network for sharing and managing genomic data.
ClearCare (San Francisco, CA)
ClearCare envisions a future where home care is the only choice for aging in place. Home care agencies play a critical role in the economy and their communities by significantly lowering the overall cost of care, reducing the number of hospital admissions, and bending the cost curve of aging. Patients receiving home care typically have multiple chronic conditions and functional limitations, driving over $190 billion in healthcare spending in the U.S. each year. To offset these costs, health insurance payers are developing in-home care management programs for patients. ClearCare’s goal is to help home care agencies leverage technology to improve costs, outcomes, and quality of life for the aging population. The company’s powerful software platform is specifically designed for use by non-medical, in-home care agencies to manage their businesses.
Founder and CEO Geoff Nudd created ClearCare because of his own grandmother’s need for care. Keeping family members and caregivers up to date on a loved one’s well being can be difficult, so Geoff created what is now ClearCare’s Family Room, which enables caregivers and agency staff to check schedules and receive real-time updates about what’s happening in the home. Since then, agencies have provided feedback on others areas of their businesses that could be streamlined. ClearCare has now built over 20 modules to help home care agencies optimize operations with services including a telephony service, billing and payroll, and more. ClearCare now serves over 4,000 home care agencies, representing 500,000 caregivers and 400,000 seniors.
Using AWS, ClearCare is able to spin up reliable infrastructure for proofs of concept and iterate on those systems to quickly get value to market. The company runs many AWS services including Amazon Elasticsearch Service, Amazon RDS, and Amazon CloudFront. Amazon EMR and Amazon Athena have enabled ClearCare to build a Hadoop-based ETL and data warehousing system that processes terabytes of data each day. By utilizing these managed services, ClearCare has been able to go from concept to customer delivery in less than three months.
DNAnexus is accelerating the application of genomic data in precision medicine by providing a cloud-based platform for sharing and managing genomic and biomedical data and analysis tools. The company was founded in 2009 by Stanford graduate student Andreas Sundquist and two Stanford professors Arend Sidow and Serafim Batzoglou, to address the need for scaling secondary analysis of next-generation sequencing (NGS) data in the cloud. The founders quickly learned that users needed a flexible solution to build complex analysis workflows and tools that enable them to share and manage large volumes of data. DNAnexus is optimized to address the challenges of security, scalability, and collaboration for organizations that are pursuing genomic-based approaches to health, both in clinics and research labs. DNAnexus has a global customer base – spanning North America, Europe, Asia-Pacific, South America, and Africa – that runs a million jobs each month and is doubling their storage year-over-year. The company currently stores more than 10 petabytes of biomedical and genomic data. That is equivalent to approximately 100,000 genomes, or in simpler terms, over 50 billion Facebook photos!
DNAnexus is working with its customers to help expand their translational informatics research, which includes expanding into clinical trial genomic services. This will help companies developing different medicines to better stratify clinical trial populations and develop companion tests that enable the right patient to get the right medicine. In collaboration with Janssen Human Microbiome Institute, DNAnexus is also launching Mosaic – a community platform for microbiome research.
AWS provides DNAnexus and its customers the flexibility to grow and scale research programs. Building the technology infrastructure required to manage these projects in-house is expensive and time-consuming. DNAnexus removes that barrier for labs of any size by using AWS scalable cloud resources. The company deploys its customers’ genomic pipelines on Amazon EC2, using Amazon S3 for high-performance, high-durability storage, and Amazon Glacier for low-cost data archiving. DNAnexus is also an AWS Life Sciences Competency Partner.
Deriving insights from large datasets is central to nearly every industry, and life sciences is no exception. To combat the rising cost of bringing drugs to market, pharmaceutical companies are looking for ways to optimize their drug development processes. They are turning to big data analytics to better quantify the effect that their drug compounds have on different populations and to look for new clinical indications for existing drugs.
A real world evidence (RWE) platform is loosely defined as an integrated set of services and products that life sciences companies use to securely acquire, store, and analyze large, often disparate datasets to gain insight into the functions of a specific drug or intervention. The petabyte-scale data generated from wearables, medical devices, genomics, clinical imaging, and claims (to name a few) allows pharmaceutical and other life sciences companies to build big data platforms to analyze these datasets. And many of them are doing it on AWS.
At the center of nearly every RWE platform is a data lake that houses different data types. It also stores the related metadata to identify where each piece of data came from, who owned it, etc. Analytics engines integrate the relevant streaming (e.g., wearables), structured (e.g., claims data), and unstructured (e.g., notes in electronic health records) data.
In this post, I highlight common architectural patterns that customers are using to maximize the value of real world evidence on AWS. The architecture presented here can be reproduced in multiple regions, so you can respect local data sovereignty requirements, when applicable, while conducting global studies. This post doesn’t cover all considerations for real world evidence (such as security and authentication), but instead focuses on the areas that are related to your data flow.
A data lake is your source of truth
The ability to integrate disparate data types is critical to maximizing the utility of RWE. Life sciences companies have to be able to store, search, and retrieve data of different types and sizes, including (but not limited to) the following:
Streaming data from wearables and medical devices
Structured data from genomics and claims data
Unstructured data from notes in electronic health records
A common solution to integrate these data types is a data lake. Data lakes allow organizations to store all their data, regardless of data type, in a centralized repository. Because data can be stored as-is, there’s no need to convert it to a predefined schema. And you no longer need to know what questions you want to ask of your data beforehand. You can use data lakes for ad hoc analyses, so you can quickly explore and discover new insights without needing to structure the data first, as you would with a traditional data warehouse.
Although there are many uses for storing real world evidence data in a data lake, here are a few examples of the processes it can facilitate for pharma and biotech companies:
Quickly access all data for a given subject, from clinical images to genomics to claims data.
Associate incoming RWE data with existing data in your RWE data lake, such as by subject or study.
Select specific windows in longitudinal studies (microbiome, metabolomics, etc.).
The following diagram shows an architecture of the features within a data lake and several examples of how data enters a data lake:
This architecture, which is entirely serverless and backed by Amazon S3, lets you scale your data lake to easily accommodate any data size for your RWE platform. Additionally, the components presented in this diagram can be secured via IAM controls and service policies. This enables you to secure and protect the sensitive information that often resides within an RWE platform. Amazon S3 is the canonical source for objects in your RWE data lake. You track and search metadata that is associated with these objects in a data catalog built on Amazon Elasticsearch Service and Amazon DynamoDB.
Let’s dive deeper into what this architecture does and how data makes it into the data lake:
Streaming (e.g., wearables), structured, and unstructured data is acquired from myriad devices and sources. Depending on the data size, you might use AWS IoT (streaming), AWS Storage Gateway (mid-size/continual batch), and AWS Snowball (large legacy datasets, such as imaging). AWS IoT writes to Amazon Kinesis Firehose, which transforms the telemetry data in-flight to land both transformed and raw data in Amazon S3.
When data lands in Amazon S3 buckets, an AWS Lambda function is invoked (either by trigger or manually).
This Lambda function writes to a data catalog that is fronted by Amazon API Gateway. The data catalog contains metadata about all the object data in Amazon S3, as well as data that resides in databases, such as Amazon Redshift.
An AWS Lambda function on the other side of API Gateway writes the appropriate metadata about the objects, such as the study that the data was generated from, into Amazon Elasticsearch Service and/or Amazon DynamoDB, which I refer to as the data catalog.
The data catalog mentioned in steps 3 and 4 is central to your data management. In most analyses, the first step is to build a data manifest of where the data-of-interest lies, which is discussed in later sections.
Normalizing data for real world evidence
As I previously mentioned, data can enter your real world evidence data lake in many different formats. In many cases, you might need to normalize this data into a specific format to ease downstream analysis. For example, you might have to integrate many different electronic health records that are stored in different formats. You also might have to transform genomics data, often represented as a variant call format (VCF) file, into a format that’s easier for big data technologies like Apache Parquet to query. While you can certainly analyze this data in the format in which it entered, it might be better for querying after it’s transformed into a different format, or into a different data store.
In the architecture shown in the following diagram, AWS Step Functions orchestrates the data normalization (extract, transform, load—or ETL) process. Step Functions is a serverless workflow service that ensures that your long-running ETL jobs execute in order and complete successfully. At a high level, you first query the data catalog to get a manifest of the data to be normalized. Normalization occurs on Amazon EC2 instances, and the results are then stored back in Amazon S3 or in databases such as Amazon Redshift for future analysis. Locations of these results data are also logged in the data catalog for future querying.
Here is an example Step Functions state machine for this process:
The following is the accompanying JSON that produces it:
Either a user or an application submits a POST request through Amazon API Gateway to process or transform a set of data. This request contains the query parameters that correspond to generating the data manifest. It is passed into an AWS Step Functions state machine, which orchestrates the ETL process.
A Lambda function queries the data catalog and builds a manifest that contains the location of the data of interest.
The manifest is passed to downstream Lambda functions in the state machine that orchestrate the batch workflow to process the data that’s specified in the manifest.
These Lambda functions submit jobs to AWS Batch to execute batch jobs on Amazon EC2.
Amazon EC2 processes the data (e.g., data normalization) and stores results back in Amazon S3 and/or Amazon Redshift.
Results data is then logged in the data catalog, using the process shown in the data acquisition diagram.
Analyzing and visualizing data in real world evidence
Ultimately, the value in real world evidence platforms is the value you derive from the wide variety of data that feeds into it. Big data analytics, such as Spark on EMR, machine learning with the Deep Learning AMIs, and healthcare data warehouses, are all fundamental to maximizing the value of RWE.
Given the wide array of questions you can answer, you want your architecture to be as flexible as possible. Your organization also might use business intelligence (BI) tools. Again, as with the data normalization tier, you can use Step Functions to orchestrate your workflow. You first build the data manifest and then submit each portion of your desired analysis to the relevant data location and compute options, such as Amazon EMR or Amazon Redshift. Your organization’s BI tool of choice, such as Amazon QuickSight or one offered by our AWS Big Data Competency partners, can extract and visualize the results.
Here is the workflow in more detail:
A user initiates a query on a dataset. In the context of RWE, this is largely analyzing a cohort in the context of a specific indication (drug response, etc.).
AWS Step Functions invokes a Lambda function to query the data catalog and build a manifest of data.
The manifest is passed to subsequent Lambda functions, which orchestrate the data analysis through different AWS services.
These analyses can include Amazon EMR for population-scale genomics, Amazon EC2 for HPC and machine learning, and Amazon Redshift for your healthcare data warehouse.
Results from each analysis are staged back in Amazon S3 and logged in the data catalog.
BI tools like Amazon QuickSight or Tableau, or data analysis workbooks like Jupyter can query Amazon S3 to visualize results, such as through Amazon Athena.
A practical example
Imagine that you are at a pharmaceutical company that is developing a drug to address a neurological condition. You have longitudinal data in the form of brain scans and have noticed that the drug compound under development seems to benefit a specific subpopulation of individuals. As a result, you determine to sequence the genomes of all the responders and non-responders to look for specific biomarkers that could indicate response levels. Success would result in a companion diagnostic that you could use alongside your drug to increase its overall efficacy by delivering it only to the responding population.
Let’s briefly look at how this study might play out in the steps of data acquisition, normalization, and analysis described earlier.
Data from longitudinal brain imaging studies can be moved to AWS using AWS Snowball. Genomics data coming off your genome sequencers would land in Amazon S3 via AWS Storage Gateway and/or AWS Direct Connect. These datasets are stored in your data catalog where you track items such as date generated, anonymized subject ID, etc.
Genomes are transformed from raw reads (e.g., FASTQ) to a human readable format (VCF) that identifies variations in a genome. These VCF files are subsequently extracted, transformed, and loaded into a format that is amenable to big data analytics, such as Parquet.
You build a data manifest of your brain scans in Amazon S3. You use the deep learning AMI and P2 instance family to build machine learning models to identify images that represent different stages in your disorder-of-interest. You then run association analyses that combine your genomics data with your brain imaging models and drug response data to identify what genomic variants associate with improved treatment outcomes. You can manage these analyses via Jupyter notebooks, or you can connect with your BI tool of choice to visualize your results.
It will always be day one for RWE
By implementing real world evidence platforms on AWS, you can quickly integrate and interrogate disparate healthcare data to advance human health. By designing your RWE platform to be flexible, you can readily incorporate new big data technologies as they become relevant to your needs. This enables you to innovate and discover at a quicker rate.
The healthcare ecosystem has chosen a variety of tools and techniques for working with big data, but one tool that comes up again and again in many of the architectures we design and review is Spark on Amazon EMR. Will spark power the data behind precision medicine?
About the Authors
Dr. Aaron Friedman is a Healthcare and Life Sciences Partner Solutions Architect at Amazon Web Services. He works with ISVs and SIs to architect healthcare solutions on AWS, and bring the best possible experience to their customers. His passion is working at the intersection of science, big data, and software. In his spare time, he’s exploring the outdoors, learning a new thing to cook, or spending time with his wife and his dog, Macaroon.
Welcome back to another month of Hot Startups! Every day, startups are creating innovative and exciting businesses, applications, and products around the world. Each month we feature a handful of startups doing cool things using AWS.
July is all about learning! These companies are focused on providing access to tools and resources to expand knowledge and skills in different ways.
This month’s startups:
CodeHS – provides fun and accessible computer science curriculum for middle and high schools.
Insight – offers intensive fellowships to grow technical talent in Data Science.
iTranslate – enables people to read, write, and speak in over 90 languages, anywhere in the world.
CodeHS (San Francisco, CA)
In 2012, Stanford students Zach Galant and Jeremy Keeshin were computer science majors and TAs for introductory classes when they noticed a trend among their peers. Many wished that they had been exposed to computer science earlier in life. In their senior year, Zach and Jeremy launched CodeHS to give middle and high schools the opportunity to provide a fun, accessible computer science education to students everywhere. CodeHS is a web-based curriculum pathway complete with teacher resources, lesson plans, and professional development opportunities. The curriculum is supplemented with time-saving teacher tools to help with lesson planning, grading and reviewing student code, and managing their classroom.
CodeHS aspires to empower all students to meaningfully impact the future, and believe that coding is becoming a new foundational skill, along with reading and writing, that allows students to further explore any interest or area of study. At the time CodeHS was founded in 2012, only 10% of high schools in America offered a computer science course. Zach and Jeremy set out to change that by providing a solution that made it easy for schools and districts to get started. With CodeHS, thousands of teachers have been trained and are teaching hundreds of thousands of students all over the world. To use CodeHS, all that’s needed is the internet and a web browser. Students can write and run their code online, and teachers can immediately see what the students are working on and how they are doing.
Amazon EC2, Amazon RDS, Amazon ElastiCache, Amazon CloudFront, and Amazon S3 make it possible for CodeHS to scale their site to meet the needs of schools all over the world. CodeHS also relies on AWS to compile and run student code in the browser, which is extremely important when teaching server-side languages like Java that powers the AP course. Since usage rises and falls based on school schedules, Amazon CloudWatch and ELBs are used to easily scale up when students are running code so they have a seamless experience.
Insight was founded in 2012 to create a new educational model, optimize hiring for data teams, and facilitate successful career transitions among data professionals. Over the last 5 years, Insight has kept ahead of market trends and launched a series of professional training fellowships including Data Science, Health Data Science, Data Engineering, and Artificial Intelligence. Finding individuals with the right skill set, background, and culture fit is a challenge for big companies and startups alike, and Insight is focused on developing top talent through intensive 7-week fellowships. To date, Insight has over 1,000 alumni at over 350 companies including Amazon, Google, Netflix, Twitter, and The New York Times.
The Data Engineering team at Insight is well-versed in the current ecosystem of open source tools and technologies and provides mentorship on the best practices in this space. The technical teams are continually working with external groups in a variety of data advisory and mentorship capacities, but the majority of Insight partners participate in professional sessions. Companies visit the Insight office to speak with fellows in an informal setting and provide details on the type of work they are doing and how their teams are growing. These sessions have proved invaluable as fellows experience a significantly better interview process and companies yield engaged and enthusiastic new team members.
An important aspect of Insight’s fellowships is the opportunity for hands-on work, focusing on everything from building big-data pipelines to contributing novel features to industry-standard open source efforts. Insight provides free AWS resources for all fellows to use, in addition to mentorships from the Data Engineering team. Fellows regularly utilize Amazon S3, Amazon EC2, Amazon Kinesis, Amazon EMR, AWS Lambda, Amazon Redshift, Amazon RDS, among other services. The experience with AWS gives fellows a solid skill set as they transition into the industry. Fellowships are currently being offered in Boston, New York, Seattle, and the Bay Area.
Check out the Insight blog for more information on trends in data infrastructure, artificial intelligence, and cutting-edge data products.
When the App Store was introduced in 2008, the founders of iTranslate saw an opportunity to be part of something big. The group of four fully believed that the iPhone and apps were going to change the world, and together they brainstormed ideas for their own app. The combination of translation and mobile devices seemed a natural fit, and by 2009 iTranslate was born. iTranslate’s mission is to enable travelers, students, business professionals, employers, and medical staff to read, write, and speak in all languages, anywhere in the world. The app allows users to translate text, voice, websites and more into nearly 100 languages on various platforms. Today, iTranslate is the leading player for conversational translation and dictionary apps, with more than 60 million downloads and 6 million monthly active users.
iTranslate is breaking language barriers through disruptive technology and innovation, enabling people to translate in real time. The app has a variety of features designed to optimize productivity including offline translation, website and voice translation, and language auto detection. iTranslate also recently launched the world’s first ear translation device in collaboration with Bragi, a company focused on smart earphones. The Dash Pro allows people to communicate freely, while having a personal translator right in their ear.
iTranslate started using Amazon Polly soon after it was announced. CEO Alexander Marktl said, “As the leading translation and dictionary app, it is our mission at iTranslate to provide our users with the best possible tools to read, write, and speak in all languages across the globe. Amazon Polly provides us with the ability to efficiently produce and use high quality, natural sounding synthesized speech.” The stable and simple-to-use API, low latency, and free caching allow iTranslate to scale as they continue adding features to their app. Customers also enjoy the option to change speech rate and change between male and female voices. To assure quality, speed, and reliability of their products, iTranslate also uses Amazon EC2, Amazon S3, and Amazon Route 53.
You’re watching the new episode of Game of Thrones, and suddenly you hear your children, up and about after their bedtime! Now you’ll probably miss a crucial moment of the show because you have to put them to bed again. Or you’re out to dinner with friends and longing for the sight of your sleeping small humans. What do you do? Text the babysitter to check on them? Well, luckily for you these issues could soon be things of the past, thanks to Bert Vuylsteke and his Pi-powered Sleepbuddy. This IoT-controlled social robot could fulfil all your remote babysitting needs!
This is the actual concept of my robot and in what context it can be used.
A social robot?
A social robot fulfils a role normally played by a person, and interacts with humans via human language, gestures, and facial expressions. This is what Bert says about the role of the Sleepbuddy:
[For children, it] is a friend or safeguard from nightmares, but it is so much more for the babysitters or parents. The babysitters or parents connect their smartphone/tablet/PC to the Sleepbuddy. This will give them access to control all his emotions, gestures, microphone, speaker and camera. In the eye is a hidden camera to see the kids sleeping. The speaker and microphone allow communication with the kids through WiFi.
The roots of the Sleepbuddy
As a student at Ghent University, Bert had to build a social robot using OPSORO, the university’s open-source robotics platform. The developers of this platform create social robots for research purposes. They are also making all software, as well as hardware design plans, available on GitHub. In addition, you will soon be able to purchase their robot kits via a Kickstarter. OPSORO robots are designed around the Raspberry Pi, and controlled via a web interface. The interface allows you to customise your robot’s behaviour, using visual or text-based programming languages.
The Sleepbuddy’s components
Building the Sleepbuddy
Bert has provided a detailed Instructable describing the process of putting the Sleepbuddy together, complete with video walk-throughs. However, the making techniques he has used include thermoforming, laser cutting, and 3D printing. If you want to recreate this build, you may need to contact your local makerspace to find out whether they have the necessary equipment.
Assembling the Sleepbuddy
Finally, Bert added an especially cute touch to this project by covering the Sleepbuddy in blackboard paint. Therefore, kids can draw on the robot to really make it their own!
So many robots!
At Pi Towers we are partial to all kinds of robots, be they ones that test medical devices, play chess or Connect 4, or fight other robots. If they twerk, or are cute, tiny, or shoddy, we maybe even like them a tiny bit more.
Do you share our love of robots? Would you like to make your own? Then check out our resource for building a simple robot buggy. Maybe it will kick-start your career as the general of a robot army. A robot army that does good, of course! Let us know your benevolent robot overlord plans in the comments.
If you are anything like me, Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning are completely fascinating and exciting topics. As AI, ML, and Deep Learning become more widely used, for me it means that the science fiction written by Dr. Issac Asimov, the robotics and medical advancements in Star Wars, and the technologies that enabled Captain Kirk and his Star Trek crew “to boldly go where no man has gone before” can become achievable realities.
Most people interested in the aforementioned topics are familiar with the AI and ML solutions enabled by Deep Learning, such as Convolutional Neural Networks for Image and Video Classification, Speech Recognition, Natural Language interfaces, and Recommendation Engines. However, it is not always an easy task setting up the infrastructure, environment, and tools to enable data scientists, machine learning practitioners, research scientists, and deep learning hobbyists/advocates to dive into these technologies. Most developers desire to go quickly from getting started with deep learning to training models and developing solutions using deep learning technologies.
For these reasons, I would like to share some resources that will help to quickly build deep learning solutions whether you are an experienced data scientist or a curious developer wanting to get started.
Deep Learning Resources
The Apache MXNet is Amazon’s deep learning framework of choice. With the power of Apache MXNet framework and NVIDIA GPU computing, you can launch your scalable deep learning projects and solutions easily on the AWS Cloud. As you get started on your MxNet deep learning quest, there are a variety of self-service tutorials and datasets available to you:
Launch an AWS Deep Learning AMI: This guide walks you through the steps to launch the AWS Deep Learning AMI with Ubuntu
MXNet – Create a computer vision application: This hands-on tutorial uses a pre-built notebook to walk you through using neural networks to build a computer vision application to identify handwritten digits
AWS Machine Learning Datasets: AWS hosts datasets for Machine Learning on the AWS Marketplace that you can access for free. These large datasets are available for anyone to analyze the data without requiring the data to be downloaded or stored.
Predict and Extract – Learn to use pre-trained models for predictions: This hands-on tutorial will walk you through how to use pre-trained model for predicting and feature extraction using the full Imagenet dataset.
AWS Deep Learning AMIs
AWS offers Amazon Machine Images (AMIs) for use on Amazon EC2 for quick deployment of an infrastructure needed to start your deep learning journey. The AWS Deep Learning AMIsare pre-configured with popular deep learning frameworks built using Amazon EC2 instances on Amazon Linux, and Ubuntu that can be launched for AI targeted solutions and models. The deep learning frameworks supported and pre-configured on the deep learning AMI are:
Microsoft Cognitive Toolkit (CNTK)
Additionally, the AWS Deep Learning AMIs install preconfigured libraries for Jupyter notebooks with Python 2.7/3.4, AWS SDK for Python, and other data science related python packages and dependencies. The AMIs also come with NVIDIA CUDA and NVIDIA CUDA Deep Neural Network (cuDNN) libraries preinstalled with all the supported deep learning frameworks and the Intel Math Kernel Library is installed for Apache MXNet framework. You can launch any of the Deep Learning AMIs by visiting the AWS Marketplace using the Try the Deep Learning AMIs link.
It is a great time to dive into Deep Learning. You can accelerate your work in deep learning by using the AWS Deep Learning AMIs running on the AWS cloud to get your deep learning environment running quickly or get started learning more about Deep Learning on AWS with MXNet using the AWS self-service resources. Of course, you can learn even more information about Deep Learning, Machine Learning, and Artificial Intelligence on AWS by reviewing the AWS Deep Learning page, the Amazon AI product page, and the AWS AI Blog.
Criminals go where the money is, and cybercriminals are no exception.
And right now, the money is in ransomware.
It’s a simple scam. Encrypt the victim’s hard drive, then extract a fee to decrypt it. The scammers can’t charge too much, because they want the victim to pay rather than give up on the data. But they can charge individuals a few hundred dollars, and they can charge institutions like hospitals a few thousand. Do it at scale, and it’s a profitable business.
And scale is how ransomware works. Computers are infected automatically, with viruses that spread over the internet. Payment is no more difficult than buying something online – and payable in untraceable bitcoin -- with some ransomware makers offering tech support to those unsure of how to buy or transfer bitcoin. Customer service is important; people need to know they’ll get their files back once they pay.
And they want you to pay. If they’re lucky, they’ve encrypted your irreplaceable family photos, or the documents of a project you’ve been working on for weeks. Or maybe your company’s accounts receivable files or your hospital’s patient records. The more you need what they’ve stolen, the better.
The particular ransomware making headlines is called WannaCry, and it’s infected some pretty serious organizations.
What can you do about it? Your first line of defense is to diligently install every security update as soon as it becomes available, and to migrate to systems that vendors still support. Microsoft issued a security patch that protects against WannaCry months before the ransomware started infecting systems; it only works against computers that haven’t been patched. And many of the systems it infects are older computers, no longer normally supported by Microsoft – though it did belatedly release a patch for those older systems. I know it’s hard, but until companies are forced to maintain old systems, you’re much safer upgrading.
This is easier advice for individuals than for organizations. You and I can pretty easily migrate to a new operating system, but organizations sometimes have custom software that breaks when they change OS versions or install updates. Many of the organizations hit by WannaCry had outdated systems for exactly these reasons. But as expensive and time-consuming as updating might be, the risks of not doing so are increasing.
Your second line of defense is good antivirus software. Sometimes ransomware tricks you into encrypting your own hard drive by clicking on a file attachment that you thought was benign. Antivirus software can often catch your mistake and prevent the malicious software from running. This isn’t perfect, of course, but it’s an important part of any defense.
Your third line of defense is to diligently back up your files. There are systems that do this automatically for your hard drive. You can invest in one of those. Or you can store your important data in the cloud. If your irreplaceable family photos are in a backup drive in your house, then the ransomware has that much less hold on you. If your e-mail and documents are in the cloud, then you can just reinstall the operating system and bypass the ransomware entirely. I know storing data in the cloud has its own privacy risks, but they may be less than the risks of losing everything to ransomware.
That takes care of your computers and smartphones, but what about everything else? We’re deep into the age of the “Internet of things.”
There are now computers in your household appliances. There are computers in your cars and in the airplanes you travel on. Computers run our traffic lights and our power grids. These are all vulnerable to ransomware. The Mirai botnet exploited a vulnerability in internet-enabled devices like DVRs and webcams to launch a denial-of-service attack against a critical internet name server; next time it could just as easily disable the devices and demand payment to turn them back on.
Re-enabling a webcam will be cheap; re-enabling your car will cost more. And you don’t want to know how vulnerable implanted medical devices are to these sorts of attacks.
Commercial solutions are coming, probably a convenient repackaging of the three lines of defense described above. But it’ll be yet another security surcharge you’ll be expected to pay because the computers and internet-of-things devices you buy are so insecure. Because there are currently no liabilities for lousy software and no regulations mandating secure software, the market rewards software that’s fast and cheap at the expense of good. Until that changes, ransomware will continue to be profitable line of criminal business.
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.