We’re excited to announce that Amazon Web Services (AWS) has achieved certification under the Korea-Personal Information & Information Security Management System (K-ISMS-P) standard (effective from December 16, 2020 to December 15, 2023). The assessment by the Korea Internet & Security Agency (KISA) covered the operation of infrastructure (including compute, storage, networking, databases, and security) in the AWS Asia Pacific (Seoul) Region. AWS was the first global cloud service provider (CSP) to obtain K-ISMS certification (the previous version of K-ISMS-P) back in 2017. Now AWS is the first global CSP to achieve compliance with the K-ISMS portion of the new K-ISMS-P standard.
Sponsored by KISA and affiliated with the Korean Ministry of Science and ICT (MSIT), K-ISMS-P serves as a standard for evaluating whether enterprises and organizations operate and manage their information security management systems consistently and securely, such that they thoroughly protect their information assets. The new K-ISMS-P standard combined the K-ISMS and K-PIMS (Personal Information Management System) standards with updated control items. Accordingly, the new K-ISMS certification and K-ISMS-P certification (personal information–focused) are introduced under the updated standard.
This certification helps enterprises and organizations across South Korea, regardless of industry, meet KISA compliance requirements more efficiently. Achieving this certification demonstrates the proactive approach AWS has taken to meet compliance set by the South Korean government and to deliver secure AWS services to customers. In addition, we’ve launched Quick Start and Operational Best Practices (conformance pack) pages to provide customers with a compliance framework that they can utilize for their K-ISMS-P compliance needs. Enterprises and organizations can use these toolkits and AWS certification to reduce the effort and cost of getting their own K-ISMS-P certification. You can download the AWS K-ISMS certification under the K-ISMS-P standard from AWS Artifact. To learn more about the AWS K-ISMS certification, see the AWS K-ISMS page. If you have any questions, don’t hesitate to contact your AWS Account Manager.
If you have feedback about this post, submit comments in the Comments section below.
Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.
Today, I’m pleased to announce the launch of Cloud Audit Academy AWS-specific (CAA AWS-specific). This is a new, accelerated training program for auditing AWS Cloud implementations, and is designed for auditors, regulators, or anyone working within a control framework.
Over the past few years, auditing security in the cloud has become one of the fastest growing questions among Amazon Web Services (AWS) customers, across multiple industries and all around the world. Here are the two pain points that I hear about most often:
Engineering teams want to move regulatory frameworks compliant workloads to AWS to take advantage of its innovation capabilities, but security and risk teams are uncertain how AWS can help them meet their compliance requirements through audits.
Compliance teams want to effectively audit the cloud environments and take advantage of the available security control options that are built into the cloud, but the legacy audit processes and control frameworks are built for an on-premises environment. The differences require some reconciliation and improvement work to be done on compliance programs, audit processes, and auditor training.
To help address these issues for not only AWS customers but for any auditor or compliance team facing cloud migration, we announced Cloud Audit Academy Cloud Agnostic (CAA Cloud Agnostic) at re:Inforce 2019. This foundational, first-of-its-kind, course provides baseline knowledge on auditing in the cloud and in understanding the differences in control operation, design, and auditing. It is cloud agnostic and can benefit security and compliance professionals in any industry—including independent third-party auditors. Since its launch in June 2019, 1,400 students have followed this cloud audit learning path, with 91 percent of participants saying that they would recommend the workshop to others.
So today we’re releasing the next phase of that education program, Cloud Audit Academy AWS-specific. Offered virtually or in-person, CAA AWS-specific is an instructor-led workshop on addressing risks and auditing security in the AWS Cloud, with a focus on the security and audit tools provided by AWS. All instructors have professional audit industry experience, current audit credentials, and maintain AWS Solutions Architect credentials.
Here are four things to know about CAA AWS-specific and what it has to offer audit and compliance teams:
Content was created with PricewaterhouseCoopers (PwC) PricewaterhouseCoopers worked with us to develop the curriculum content, bringing their expertise in independent risk and control auditing.
“With so many of our customers already in the cloud—or ready to be—we’ve seen a huge increase in the need to meet regulatory and compliance requirements. We’re excited to have combined our risk and controls experience with the power of AWS to create a curriculum in which customers can not only [leverage AWS to help them] meet their compliance needs, but unlock the total value of their cloud investment.” – Paige Hayes, Global Account Leader at PwC
Attendees earn continuing professional education credits Based on feedback from CAA Cloud Agnostic, we now offer continuing professional education (CPE) credits to attendees. Completion of CAA AWS-specific will allow attendees to earn 28 CPE credits towards any of the International Information System Security Certification Consortium, or (ISC)², certifications, and 18 CPE credits towards any Global Information Assurance Certification (GIAC).
Training helps boost confidence when auditing the AWS cloud Our customers have proven repeatedly that running sensitive workloads in AWS can be more secure than in on-premises environments. However, a lack of knowledge and updated processes for implementing, monitoring, and proving compliance in the cloud has caused some difficulty. Through CAA AWS-specific, you will get critical training to become more comfortable and confident knowing how to audit the AWS environment with precision.
“Our FSI customer conversations are often focused on security and compliance controls. Leveraging the Cloud Audit Academy enables our team to educate the internal and external auditors of our customers. CAA provides them the necessary tools and knowledge to evaluate and gain comfort with their AWS control environment firsthand. The varying depth and levels focus on everything from basic cloud auditing to diving deeper into the domains which align with our governance and control domains. We reference key AWS services that customers can utilize to create an effective control environment that [helps to meet their] regulatory and audit expectations.” – Jeff (Axe) Axelrad, Compliance Manager, AWS Financial Services
Training enables the governance, risk, and compliance professional In four days of CAA AWS-specific, you’ll become more comfortable with topics like control domains, network management, vulnerability management, logging and monitoring, incident response, and general knowledge about compliance controls in the cloud.
“In addition to [using AWS to help support and maintain their compliance], our customers need to be able to clearly communicate with their external auditors and regulators HOW compliance is achieved. CAA doesn’t teach auditors how to audit, but rather accelerates the learning necessary to understand specifically how the control landscape changes.” – Jesse Skibbe, Sr. Practice Manager, AWS Professional Services
CAA Cloud Agnostic provides some foundational concepts and is a prerequisite to CAA AWS-specific. It is available for free online at our AWS Training and Certification learning library, or you can contact your account manager to have a one-day instructor-led training session in person.
If it sounds like Cloud Audit Academy training would benefit you and your team, contact our AWS Security Assurance Services team or contact your AWS account manager. For more information, check out the newly updated Security Audit Learning Path.
If you have feedback about this post, submit comments in the Comments section below.
Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.
This update enables you to validate a subset of PCI DSS’s requirements and helps with ongoing PCI DSS security activities by conducting continuous and automated checks. The new Security Hub standard also makes it easier to proactively monitor AWS resources, which is critical for any company involved with the storage, processing, or transmission of cardholder data. There’s also a Security score feature for the Security Hub standard, which can help support preparations for PCI DSS assessment.
Use this post to learn how to:
Enable the AWS Security Hub PCI DSS v3.2.1 standard and navigating results
Interpret your security score
Remediate failed security checks
Understand requirements related to findings
Enable Security Hub’s PCI DSS v3.2.1 standard and navigate results
Note: This section assumes that you have Security Hub enabled in one or more accounts. To learn how to enable Security Hub, follow these instructions. If you don’t have Security Hub enabled, the first time you enable Security Hub you will be given the option to enable PCI DSS v3.2.1.
To enable the PCI DSS v3.2.1 security standard in Security Hub:
Open Security Hub and enable PCI DSS v3.2.1 Security standards. (Once enabled, Security Hub will begin evaluating related resources in the current AWS account and region against the AWS controls within the standard. The scope of the assessment is the current AWS account).
When the evaluation completes, select View results.
Now you are on the PCI DSS v3.2.1 page (Figure 1). You can see all 32 currently-implemented security controls in this standard, their severities, and their status for this account and region. Use search and filters to narrow down the controls by status, severity, title, or related requirement.
Figure 1: PCI DSS v3.2.1 standard results page
Select the name of the control to review detailed information about it. This action will take you to the control’s detail page (Figure 2), which gives you related findings.
Figure 2: Detailed control information
If a specific control is not relevant for you, you can disable the control by selecting Disable and providing a Reason for disabling. (See Disabling Individual Compliance Controls for instructions).
How to interpret and improve your “Security score”
After enabling the PCI DSS v3.2.1 standard in Security Hub, you will notice a Security score appear for the standard itself, and for your account overall. These scores range between 0% and 100%.
Figure 3: Security score for PCI DSS standard (left) and overall (right)
The PCI DSS standard’s Security score represents the proportion of passed PCI DSS controls over enabled PCI DSS controls. The score is displayed as a percentage. Similarly, the overall Security score represents the proportion of passed controls over enabled controls, including controls from every enabled Security Hub standard, displayed as a percentage.
Your aim should be to pass all enabled security checks to reach a score of 100%. Reaching a 100% security score for the AWS Security Hub PCI DSS standard will help you prepare for a PCI DSS assessment. The PCI DSS Compliance Standard in Security Hub is designed to help you with your ongoing PCI DSS security activities.
An important note, the controls cannot verify whether your systems are compliant with the PCI DSS standard. They can neither replace internal efforts nor guarantee that you will pass a PCI DSS assessment.
Remediating failed security checks
To remediate a failed control, you need to remediate every failed finding for that control.
To prioritize remediation, we recommend filtering by Failed controls and then remediating issues starting with critical– and ending with low severity controls.
Identify a control you want to remediate and visit the control detail page.
Follow the Remediation instructions link, and then follow the step-by-step remediation instructions, applying them for every failed finding.
Figure 4: The control detail page, with a link to the remediation instructions
How to interpret “Related requirements”
Every control displays Related requirements in the control card and in the control’s detail page. For PCI DSS, the Related requirements show which PCI DSS requirements are related to the Security Hub PCI DSS control. A single AWS control might relate to multiple PCI DSS requirements.
Figure 5: Related requirements in the control detail page
The user guide lists the related PCI DSS requirements and explains how the specific Security Hub PCI DSS control is related to the requirement.
For example, the AWS Config rule cmk-backing-key-rotation-enabled checks that key rotation is enabled for each customer master key (CMK), but it doesn’t check for CMKs that are using key material imported with the AWS Key Management Service (AWS KMS) BYOK mechanism. The related PCI DSS requirement that is mapped to this rule is PCI DSS 3.6.4 – “Cryptographic keys should be changed once they have reached the end of their cryptoperiod.” Although PCI DSS doesn’t specify the time frame for cryptoperiods, this rule is mapped because, if key rotation is enabled, rotation occurs annually by default with a customer-managed CMK.
Conclusion
The new AWS Security Hub PCI DSS v3.2.1 standard is fundamental for any company involved with storing, processing, or transmitting cardholder data. In this post, you learned how to enable the standard to begin proactively monitoring your AWS resources against the Security Hub PCI DSS controls. You also learned how to navigate the PCI DSS results within Security Hub. By frequently reviewing failed security checks, prioritizing their remediation, and aiming to achieve a 100% security score for PCI DSS within Security Hub, you’ll be better prepared for a PCI DSS assessment.
If you have feedback about this post, submit comments in the Comments section below. If you have questions, please start a new thread on the Security Hub forums.
Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.
King County voters will be able to use their name and birthdate to log in to a Web portal through the Internet browser on their phones, says Bryan Finney, the CEO of Democracy Live, the Seattle-based voting company providing the technology.
Once voters have completed their ballots, they must verify their submissions and then submit a signature on the touch screen of their device.
Finney says election officials in Washington are adept at signature verification because the state votes entirely by mail. That will be the way people are caught if they log in to the system under false pretenses and try to vote as someone else.
The King County elections office plans to print out the ballots submitted electronically by voters whose signatures match and count the papers alongside the votes submitted through traditional routes.
While advocates say this creates an auditable paper trail, many security experts say that because the ballots cross the Internet before they are printed, any subsequent audits on them would be moot. If a cyberattack occurred, an audit could essentially require double-checking ballots that may already have been altered, says Buell.
Of course it’s not an auditable paper trail. There’s a reason why security experts use the phrase “voter-verifiable paper ballots.” A centralized printout of a received Internet message is not voter verifiable.
There are some good lessons in this article on financial fraud:
That’s how we got it so wrong. We were looking for incidental breaches of technical regulations, not systematic crime. And the thing is, that’s normal. The nature of fraud is that it works outside your field of vision, subverting the normal checks and balances so that the world changes while the picture stays the same. People in financial markets have been missing the wood for the trees for as long as there have been markets.
[..]
Trust — particularly between complete strangers, with no interactions beside relatively anonymous market transactions — is the basis of the modern industrial economy. And the story of the development of the modern economy is in large part the story of the invention and improvement of technologies and institutions for managing that trust.
And as industrial society develops, it becomes easier to be a victim. In The Wealth of Nations, Adam Smith described how prosperity derived from the division of labour — the 18 distinct operations that went into the manufacture of a pin, for example. While this was going on, the modern world also saw a growing division of trust. The more a society benefits from the division of labour in checking up on things, the further you can go into a con game before you realise that you’re in one.
[…]
Libor teaches us a valuable lesson about commercial fraud — that unlike other crimes, it has a problem of denial as well as one of detection. There are very few other criminal acts where the victim not only consents to the criminal act, but voluntarily transfers the money or valuable goods to the criminal. And the hierarchies, status distinctions and networks that make up a modern economy also create powerful psychological barriers against seeing fraud when it is happening. White-collar crime is partly defined by the kind of person who commits it: a person of high status in the community, the kind of person who is always given the benefit of the doubt.
[…]
Fraudsters don’t play on moral weaknesses, greed or fear; they play on weaknesses in the system of checks and balances — the audit processes that are meant to supplement an overall environment of trust. One point that comes up again and again when looking at famous and large-scale frauds is that, in many cases, everything could have been brought to a halt at a very early stage if anyone had taken care to confirm all the facts. But nobody does confirm all the facts. There are just too bloody many of them. Even after the financial rubble has settled and the arrests been made, this is a huge problem.
This blog post was co-authored by Ujjwal Ratan, a senior AI/ML solutions architect on the global life sciences team.
Healthcare data is generated at an ever-increasing rate and is predicted to reach 35 zettabytes by 2020. Being able to cost-effectively and securely manage this data whether for patient care, research or legal reasons is increasingly important for healthcare providers.
Healthcare providers must have the ability to ingest, store and protect large volumes of data including clinical, genomic, device, financial, supply chain, and claims. AWS is well-suited to this data deluge with a wide variety of ingestion, storage and security services (e.g. AWS Direct Connect, Amazon Kinesis Streams, Amazon S3, Amazon Macie) for customers to handle their healthcare data. In a recent Healthcare IT News article, healthcare thought-leader, John Halamka, noted, “I predict that five years from now none of us will have datacenters. We’re going to go out to the cloud to find EHRs, clinical decision support, analytics.”
I realize simply storing this data is challenging enough. Magnifying the problem is the fact that healthcare data is increasingly attractive to cyber attackers, making security a top priority. According to Mariya Yao in her Forbes column, it is estimated that individual medical records can be worth hundreds or even thousands of dollars on the black market.
In this first of a 2-part post, I will address the value that AWS can bring to customers for ingesting, storing and protecting provider’s healthcare data. I will describe key components of any cloud-based healthcare workload and the services AWS provides to meet these requirements. In part 2 of this post we will dive deep into the AWS services used for advanced analytics, artificial intelligence and machine learning.
The data tsunami is upon us
So where is this data coming from? In addition to the ubiquitous electronic health record (EHR), the sources of this data include:
genomic sequencers
devices such as MRIs, x-rays and ultrasounds
sensors and wearables for patients
medical equipment telemetry
mobile applications
Additional sources of data come from non-clinical, operational systems such as:
human resources
finance
supply chain
claims and billing
Data from these sources can be structured (e.g., claims data) as well as unstructured (e.g., clinician notes). Some data comes across in streams such as that taken from patient monitors, while some comes in batch form. Still other data comes in near-real time such as HL7 messages. All of this data has retention policies dictating how long it must be stored. Much of this data is stored in perpetuity as many systems in use today have no purge mechanism. AWS has services to manage all these data types as well as their retention, security and access policies.
Imaging is a significant contributor to this data tsunami. Increasing demand for early-stage diagnoses along with aging populations drive increasing demand for images from CT, PET, MRI, ultrasound, digital pathology, X-ray and fluoroscopy. For example, a thin-slice CT image can be hundreds of megabytes. Increasing demand and strict retention policies make storage costly.
Due to the plummeting cost of gene sequencing, molecular diagnostics (including liquid biopsy) is a large contributor to this data deluge. Many predict that as the value of molecular testing becomes more identifiable, the reimbursement models will change and it will increasingly become the standard of care. According to the Washington Post article “Sequencing the Genome Creates so Much Data We Don’t Know What to do with It,”
“Some researchers predict that up to one billion people will have their genome sequenced by 2025 generating up to 40 exabytes of data per year.”
Although genomics is primarily used for oncology diagnostics today, it’s also used for other purposes, pharmacogenomics — used to understand how an individual will metabolize a medication.
Reference Architecture
It is increasingly challenging for the typical hospital, clinic or physician practice to securely store, process and manage this data without cloud adoption.
Amazon has a variety of ingestion techniques depending on the nature of the data including size, frequency and structure. AWS Snowball and AWS Snowmachine are appropriate for extremely-large, secure data transfers whether one time or episodic. AWS Glue is a fully-managed ETL service for securely moving data from on-premise to AWS and Amazon Kinesis can be used for ingesting streaming data.
Amazon S3, Amazon S3 IA, and Amazon Glacier are economical, data-storage services with a pay-as-you-go pricing model that expand (or shrink) with the customer’s requirements.
The above architecture has four distinct components – ingestion, storage, security, and analytics. In this post I will dive deeper into the first three components, namely ingestion, storage and security. In part 2, I will look at how to use AWS’ analytics services to draw value on, and optimize, your healthcare data.
Ingestion
A typical provider data center will consist of many systems with varied datasets. AWS provides multiple tools and services to effectively and securely connect to these data sources and ingest data in various formats. The customers can choose from a range of services and use them in accordance with the use case.
For use cases involving one-time (or periodic), very large data migrations into AWS, customers can take advantage of AWS Snowball devices. These devices come in two sizes, 50 TB and 80 TB and can be combined together to create a petabyte scale data transfer solution.
The devices are easy to connect and load and they are shipped to AWS avoiding the network bottlenecks associated with such large-scale data migrations. The devices are extremely secure supporting 256-bit encryption and come in a tamper-resistant enclosure. AWS Snowball imports data in Amazon S3 which can then interface with other AWS compute services to process that data in a scalable manner.
For use cases involving a need to store a portion of datasets on premises for active use and offload the rest on AWS, the Amazon storage gateway service can be used. The service allows you to seamlessly integrate on premises applications via standard storage protocols like iSCSI or NFS mounted on a gateway appliance. It supports a file interface, a volume interface and a tape interface which can be utilized for a range of use cases like disaster recovery, backup and archiving, cloud bursting, storage tiering and migration.
The AWS Storage Gateway appliance can use the AWS Direct Connect service to establish a dedicated network connection from the on premises data center to AWS.
Specific Industry Use Cases
By using the AWS proposed reference architecture for disaster recovery, healthcare providers can ensure their data assets are securely stored on the cloud and are easily accessible in the event of a disaster. The “AWS Disaster Recovery” whitepaper includes details on options available to customers based on their desired recovery time objective (RTO) and recovery point objective (RPO).
AWS is an ideal destination for offloading large volumes of less-frequently-accessed data. These datasets are rarely used in active compute operations but are exceedingly important to retain for reasons like compliance. By storing these datasets on AWS, customers can take advantage of the highly-durable platform to securely store their data and also retrieve them easily when they need to. For more details on how AWS enables customers to run back and archival use cases on AWS, please refer to the following set of whitepapers.
A healthcare provider may have a variety of databases spread throughout the hospital system supporting critical applications such as EHR, PACS, finance and many more. These datasets often need to be aggregated to derive information and calculate metrics to optimize business processes. AWS Glue is a fully-managed Extract, Transform and Load (ETL) service that can read data from a JDBC-enabled, on-premise database and transfer the datasets into AWS services like Amazon S3, Amazon Redshift and Amazon RDS. This allows customers to create transformation workflows that integrate smaller datasets from multiple sources and aggregates them on AWS.
Healthcare providers deal with a variety of streaming datasets which often have to be analyzed in near real time. These datasets come from a variety of sources such as sensors, messaging buses and social media, and often do not adhere to an industry standard. The Amazon Kinesis suite of services, that includes Amazon Kinesis Streams, Amazon Kinesis Firehose, and Amazon Kinesis Analytics, are the ideal set of services to accomplish the task of deriving value from streaming data.
Example: Using AWS Glue to de-identify and ingest healthcare data into S3 Let’s consider a scenario in which a provider maintains patient records in a database they want to ingest into S3. The provider also wants to de-identify the data by stripping personally- identifiable attributes and store the non-identifiable information in an S3 bucket. This bucket is different from the one that contains identifiable information. Doing this allows the healthcare provider to separate sensitive information with more restrictions set up via S3 bucket policies.
To ingest records into S3, we create a Glue job that reads from the source database using a Glue connection. The connection is also used by a Glue crawler to populate the Glue data catalog with the schema of the source database. We will use the Glue development endpoint and a zeppelin notebook server on EC2 to develop and execute the job.
Step 1: Import the necessary libraries and also set a glue context which is a wrapper on the spark context:
Step 2: Create a dataframe from the source data. I call the dataframe “readmissionsdata”. Here is what the schema would look like:
Step 3: Now select the columns that contains indentifiable information and store it in a new dataframe. Call the new dataframe “phi”.
Step 4: Non-PHI columns are stored in a separate dataframe. Call this dataframe “nonphi”.
Step 5: Write the two dataframes into two separate S3 buckets
Once successfully executed, the PHI and non-PHI attributes are stored in two separate files in two separate buckets that can be individually maintained.
Storage
In 2016, 327 healthcare providers reported a protected health information (PHI) breach, affecting 16.4m patient records[1]. There have been 342 data breaches reported in 2017 — involving 3.2 million patient records.[2]
To date, AWS has released 51 HIPAA-eligible services to help customers address security challenges and is in the process of making many more services HIPAA-eligible. These HIPAA-eligible services (along with all other AWS services) help customers build solutions that comply with HIPAA security and auditing requirements. A catalogue of HIPAA-enabled services can be found at AWS HIPAA-eligible services. It is important to note that AWS manages physical and logical access controls for the AWS boundary. However, the overall security of your workloads is a shared responsibility, where you are responsible for controlling user access to content on your AWS accounts.
AWS storage services allow you to store data efficiently while maintaining high durability and scalability. By using Amazon S3 as the central storage layer, you can take advantage of the Amazon S3 storage management features to get operational metrics on your data sets and transition them between various storage classes to save costs. By tagging objects on Amazon S3, you can build a governance layer on Amazon S3 to grant role based access to objects using Amazon IAM and Amazon S3 bucket policies.
To learn more about the Amazon S3 storage management features, see the following link.
Security
In the example above, we are storing the PHI information in a bucket named “phi.” Now, we want to protect this information to make sure its encrypted, does not have unauthorized access, and all access requests to the data are logged.
Encryption: S3 provides settings to enable default encryption on a bucket. This ensures any object in the bucket is encrypted by default.
Logging: S3 provides object level logging that can be used to capture all API calls to the object. The API calls are logged in cloudtrail for easy access and consolidation. Moreover, it also supports events to proactively alert customers of read and write operations.
Access control: Customers can use S3 bucket policies and IAM policies to restrict access to the phi bucket. It can also put a restriction to enforce multi-factor authentication on the bucket. For example, the following policy enforces multi-factor authentication on the phi bucket:
In Part 1 of this blog, we detailed the ingestion, storage, security and management of healthcare data on AWS. Stay tuned for part two where we are going to dive deep into optimizing the data for analytics and machine learning.
Elections serve two purposes. The first, and obvious, purpose is to accurately choose the winner. But the second is equally important: to convince the loser. To the extent that an election system is not transparently and auditably accurate, it fails in that second purpose. Our election systems are failing, and we need to fix them.
Today, we conduct our elections on computers. Our registration lists are in computer databases. We vote on computerized voting machines. And our tabulation and reporting is done on computers. We do this for a lot of good reasons, but a side effect is that elections now have all the insecurities inherent in computers. The only way to reliably protect elections from both malice and accident is to use something that is not hackable or unreliable at scale; the best way to do that is to back up as much of the system as possible with paper.
Recently, there have been two graphic demonstrations of how bad our computerized voting system is. In 2007, the states of California and Ohio conducted audits of their electronic voting machines. Expert review teams found exploitable vulnerabilities in almost every component they examined. The researchers were able to undetectably alter vote tallies, erase audit logs, and load malware on to the systems. Some of their attacks could be implemented by a single individual with no greater access than a normal poll worker; others could be done remotely.
Last year, the Defcon hackers’ conference sponsored a Voting Village. Organizers collected 25 pieces of voting equipment, including voting machines and electronic poll books. By the end of the weekend, conference attendees had found ways to compromise every piece of test equipment: to load malicious software, compromise vote tallies and audit logs, or cause equipment to fail.
It’s important to understand that these were not well-funded nation-state attackers. These were not even academics who had been studying the problem for weeks. These were bored hackers, with no experience with voting machines, playing around between parties one weekend.
It shouldn’t be any surprise that voting equipment, including voting machines, voter registration databases, and vote tabulation systems, are that hackable. They’re computers — often ancient computers running operating systems no longer supported by the manufacturers — and they don’t have any magical security technology that the rest of the industry isn’t privy to. If anything, they’re less secure than the computers we generally use, because their manufacturers hide any flaws behind the proprietary nature of their equipment.
We’re not just worried about altering the vote. Sometimes causing widespread failures, or even just sowing mistrust in the system, is enough. And an election whose results are not trusted or believed is a failed election.
Voting systems have another requirement that makes security even harder to achieve: the requirement for a secret ballot. Because we have to securely separate the election-roll system that determines who can vote from the system that collects and tabulates the votes, we can’t use the security systems available to banking and other high-value applications.
We can securely bank online, but can’t securely vote online. If we could do away with anonymity — if everyone could check that their vote was counted correctly — then it would be easy to secure the vote. But that would lead to other problems. Before the US had the secret ballot, voter coercion and vote-buying were widespread.
We can’t, so we need to accept that our voting systems are insecure. We need an election system that is resilient to the threats. And for many parts of the system, that means paper.
Let’s start with the voter rolls. We know they’ve already been targeted. In 2016, someone changed the party affiliation of hundreds of voters before the Republican primary. That’s just one possibility. A well-executed attack that deletes, for example, one in five voters at random — or changes their addresses — would cause chaos on election day.
Yes, we need to shore up the security of these systems. We need better computer, network, and database security for the various state voter organizations. We also need to better secure the voterregistration websites, with better design and better internet security. We need better security for the companies that build and sell all this equipment.
Multiple, unchangeable backups are essential. A record of every addition, deletion, and change needs to be stored on a separate system, on write-only media like a DVD. Copies of that DVD, or — even better — a paper printout of the voter rolls, should be available at every polling place on election day. We need to be ready for anything.
Next, the voting machines themselves. Security researchers agree that the gold standard is a voter-verified paper ballot. The easiest (and cheapest) way to achieve this is through optical-scan voting. Voters mark paper ballots by hand; they are fed into a machine and counted automatically. That paper ballot is saved, and serves as a final true record in a recount in case of problems. Touch-screen machines that print a paper ballot to drop in a ballot box can also work for voters with disabilities, as long as the ballot can be easily read and verified by the voter.
Finally, the tabulation and reporting systems. Here again we need more security in the process, but we must always use those paper ballots as checks on the computers. A manual, post-election, risk-limiting audit varies the number of ballots examined according to the margin of victory. Conducting this audit after every election, before the results are certified, gives us confidence that the election outcome is correct, even if the voting machines and tabulation computers have been tampered with. Additionally, we need better coordination and communications when incidents occur.
It’s vital to agree on these procedures and policies before an election. Before the fact, when anyone can win and no one knows whose votes might be changed, it’s easy to agree on strong security. But after the vote, someone is the presumptive winner — and then everything changes. Half of the country wants the result to stand, and half wants it reversed. At that point, it’s too late to agree on anything.
The politicians running in the election shouldn’t have to argue their challenges in court. Getting elections right is in the interest of all citizens. Many countries have independent election commissions that are charged with conducting elections and ensuring their security. We don’t do that in the US.
Instead, we have representatives from each of our two parties in the room, keeping an eye on each other. That provided acceptable security against 20th-century threats, but is totally inadequate to secure our elections in the 21st century. And the belief that the diversity of voting systems in the US provides a measure of security is a dangerous myth, because few districts can be decisive and there are so few voting-machine vendors.
We candobetter. In 2017, the Department of Homeland Security declared elections to be critical infrastructure, allowing the department to focus on securing them. On 23 March, Congress allocated $380m to states to upgrade election security.
These are good starts, but don’t go nearly far enough. The constitution delegates elections to the states but allows Congress to “make or alter such Regulations”. In 1845, Congress set a nationwide election day. Today, we need it to set uniform and strict election standards.
Amazon EMR empowers many customers to build big data processing applications quickly and cost-effectively, using popular distributed frameworks such as Apache Spark, Apache HBase, Presto, and Apache Flink. For organizations that are crafting their analytical applications on Amazon EMR, there is a growing need to keep their data assets organized in an automated fashion. Because datasets tend to grow exponentially, using cataloging tools is essential to automating data discovery and organizing data assets.
AWS Glue Data Catalog provides this essential capability, allowing you to automatically discover and catalog metadata about your data stores in a central repository. Since Amazon EMR 5.8.0, customers have been using the AWS Glue Data Catalog as a metadata store for Apache Hive and Spark SQL applications that are running on Amazon EMR. Starting with Amazon EMR 5.10.0, you can catalog datasets using AWS Glue and run queries using Presto on Amazon EMR from the Hue (Hadoop User Experience) and Apache Zeppelin UIs.
You might wonder what scenarios warrant using Presto running on Amazon EMR and when to choose Amazon Athena (which uses Presto as the query engine under the hood). It is important to note that both are excellent tools for querying massive amounts of data and addressing different needs and use cases.
Amazon Athena provides the easiest way to run interactive queries for data in Amazon S3 without needing to set up or manage any servers. Presto running on Amazon EMR gives you much more flexibility in how you configure and run your queries, providing the ability to federate to other data sources if needed. For example, you might have a use case that requires LDAP authentication for clients such as the Presto CLI or JDBC/ODBC drivers. Or you might have a workflow where you need to join data between different systems like MySQL/Amazon Redshift/Apache Cassandra and Hive. In these examples, Presto running on Amazon EMR is the right tool to use because it can be configured to enable LDAP authentication in addition to the desired database connectors at cluster launch.
Now, let’s look at how metadata management for Presto works with AWS Glue.
Using an AWS Glue crawler to discover datasets
The AWS Glue Data Catalog is a reference to the location, schema, and runtime metrics of your datasets. To create this reference metadata, AWS Glue needs to crawl your datasets. In this exercise, we use an AWS Glue crawler to populate tables in the Data Catalog for the NYC taxi rides dataset.
The following are the steps for adding a crawler:
Sign in to the AWS Management Console, and open the AWS Glue console. In the navigation pane, choose Crawlers. Then choose Add crawler.
On the Add a data store page, specify the location of the NYC taxi rides dataset.
In the next step, choose an existing IAM role if one is available, or create a new role. Then choose Next.
On the scheduling page, for Frequency, choose Run on demand.
On the Configure the crawler’s output page, choose Add database. Specify blog-db as the database name. (You can specify a name of your choice, but be sure to choose the correct database name when running queries.)
Follow the remaining steps using the default values to create a crawler.
When the crawler displays the Ready state, navigate to the Databases (Choose blog-db from the list of databases, or search for it by specifying it as a filter, as shown in the following screenshot.) Then choose Tables. You should see the three tables created by the crawler, as follows.
(Optional) The discovered data is classified as CSV files. You can optionally convert this data into Parquet format for better response times on your queries.
Launching an Amazon EMR cluster
With the dataset discovered and organized, we can now walk through different options for launching Presto on an Amazon EMR cluster to use the AWS Glue Data Catalog.
After you’ve set up the Amazon EMR cluster with Presto, the AWS Glue Data Catalog is available through a default “hive” catalog. To change between the Hive and Glue metastores, you have to manually update hive.properties and restart the Presto server. Connect to the master node on your EMR cluster using SSH, and run the Presto CLI to start running queries interactively.
$ presto-cli – catalog hive
Begin with a simple query to sample a few rows:
presto> SELECT * FROM “blog-db”.taxi limit 10;
The query shows a few sample rows as follows:
Query the average fare for trips at each hour of the day and for each day of the month on the Parquet version of the taxi dataset.
presto> SELECT EXTRACT (HOUR FROM pickup_datetime) AS hour, avg(fare_amount) AS average_fare FROM “blog-db”.taxi_parquet GROUP BY 1 ORDER BY 1;
The following image shows the results:
More interestingly, you can compute the number of trips that gave tips in the 10 percent, 15 percent, or higher percentage range:
presto> – Tip Percent Category
SELECT TipPrctCtgry
, COUNT (DISTINCT TripID) TripCt
FROM
(SELECT TripID
, (CASE
WHEN fare_prct < 0.7 THEN 'FL70'
WHEN fare_prct < 0.8 THEN 'FL80'
WHEN fare_prct < 0.9 THEN 'FL90'
ELSE 'FL100'
END) FarePrctCtgry
, (CASE
WHEN tip_prct < 0.1 THEN 'TL10'
WHEN tip_prct < 0.15 THEN 'TL15'
WHEN tip_prct < 0.2 THEN 'TL20'
ELSE 'TG20'
END) TipPrctCtgry
FROM
(SELECT TripID
, (fare_amount / total_amount) as fare_prct
, (extra / total_amount) as extra_prct
, (mta_tax / total_amount) as tip_prct
, (tolls_amount / total_amount) as mta_taxprct
, (tip_amount / total_amount) as tolls_prct
, (improvement_surcharge / total_amount) as imprv_suchrgprct
, total_amount
FROM
(SELECT *
, (cast(pickup_longitude AS VARCHAR(100)) || '_' || cast(pickup_latitude AS VARCHAR(100))) as TripID
from "blog-db”.taxi_parquet
WHERE total_amount > 0
) as t
) as t
) ct
GROUP BY TipPrctCtgry;
The results are as follows:
While the preceding query is running, navigate to the web interface for Presto on Amazon EMR at <http://master-public-dns-name:8889/. Here you can look into the query metrics, such as active worker nodes, number of rows read per second, reserved memory, and parallelism.
Running queries in the Presto Editor on Hue
If you installed Hue with your Amazon EMR launch, you can also run queries on Hue’s Presto Editor. On the Amazon EMR Cluster console, choose Enable Web Connection, and follow the instructions to access the web interfaces for Hue and Zeppelin.
After the web connection is enabled, choose the Hue link to open the web interface. At the login screen, if you are the administrator logging in for the first time, type a user name and password to create your Hue superuser account. Then choose Create account. Otherwise, type your user name and password and choose Create account, or type the credentials provided by your administrator.
Choose the Presto Editor from the menu. You can run Presto queries against your tables in the AWS Glue Data Catalog.
Conclusion
Having a shared data catalog for applications on Amazon EMR alleviates a myriad of data-related challenges that organizations face today—including discovery, governance, auditability, and collaboration. In this post, we explored how the AWS Glue Data Catalog addresses discoverability and manageability for table metadata for Presto on Amazon EMR. Go ahead, give this a try, and share your experience with us!
Radhika Ravirala is a Solutions Architect at Amazon Web Services where she helps customers craft distributed big data applications on the AWS platform. Prior to her cloud journey, she worked as a software engineer and designer for technology companies in Silicon Valley. She holds a M.S in computer science from San Jose State University.
The Amazon RDS team launched nearly 80 features in 2017. Some of them were covered in this blog, others on the AWS Database Blog, and the rest in What’s New or Forum posts. To wrap up my week, I thought it would be worthwhile to give you an organized recap. So here we go!
Take the AWS Certified Security – Specialty beta exam for the chance to be among the first to hold this new AWS Certification. This beta exam allows experienced cloud security professionals to demonstrate and validate their expertise. Register today – this beta exam will only be available from January 15 to March 2!
About the exam
This beta exam validates that the successful candidate can effectively demonstrate knowledge of how to secure the AWS platform. The exam covers incident response, logging and monitoring, infrastructure security, identity and access management, and data protection.
The exam validates:
Familiarity with regional- and country-specific security and compliance regulations and meta issues that these regulations embody.
An understanding of specialized data classifications and AWS data protection mechanisms.
An understanding of data encryption methods and AWS mechanisms to implement them.
An understanding of secure Internet protocols and AWS mechanisms to implement them.
A working knowledge of AWS security services and features of services to provide a secure production environment.
Competency gained from two or more years of production deployment experience using AWS security services and features.
Ability to make tradeoff decisions with regard to cost, security, and deployment complexity given a set of application requirements.
The beta is open to anyone who currently holds an Associate or Cloud Practitioner certification. We recommend candidates have five years of IT security experience designing and implementing security solutions, and at least two years of hands-on experience securing AWS workloads.
How to prepare
We have training and other resources to help you prepare for the beta exam:
AWS Security Fundamentals Digital| 3 Hours This course introduces you to fundamental cloud computing and AWS security concepts, including AWS access control and management, governance, logging, and encryption methods. It also covers security-related compliance protocols and risk management strategies, as well as procedures related to auditing your AWS security infrastructure.
Security Operations on AWS Classroom | 3 Days This course demonstrates how to efficiently use AWS security services to stay secure and compliant in the AWS Cloud. The course focuses on the AWS-recommended security best practices that you can implement to enhance the security of your data and systems in the cloud. The course highlights the security features of AWS key services including compute, storage, networking, and database services.
If you are an experienced cloud security professional, you can demonstrate and validate your expertise with the new AWS Certified Security – Specialty beta exam. This exam allows you to demonstrate your knowledge of incident response, logging and monitoring, infrastructure security, identity and access management, and data protection. Register today – this beta exam will be available only from January 15 to March 2, 2018.
By taking this exam, you can validate your:
Familiarity with region-specific and country-specific security and compliance regulations and meta issues that these regulations include.
Understanding of data encryption methods and secure internet protocols, and the AWS mechanisms to implement them.
Working knowledge of AWS security services to provide a secure production environment.
Ability to make trade-off decisions with regard to cost, security, and deployment complexity when given a set of application requirements.
See the full list of security knowledge you can validate by taking this beta exam.
Who is eligible?
The beta exam is open to anyone who currently holds an AWS Associate or Cloud Practitioner certification. We recommend candidates have five years of IT security experience designing and implementing security solutions, and at least two years of hands-on experience securing AWS workloads.
AWS Security Fundamentals (digital, 3 hours) This digital course introduces you to fundamental cloud computing and AWS security concepts, including AWS access control and management, governance, logging, and encryption methods. It also covers security-related compliance protocols and risk management strategies, as well as procedures related to auditing your AWS security infrastructure.
Security Operations on AWS (classroom, 3 days) This instructor-led course demonstrates how to efficiently use AWS security services to help stay secure and compliant in the AWS Cloud. The course focuses on the AWS-recommended security best practices that you can implement to enhance the security of your AWS resources. The course highlights the security features of AWS compute, storage, networking, and database services.
If you have questions about this new beta exam, contact us.
Many enterprises use Microsoft Active Directory to manage users, groups, and computers in a network. And a question is asked frequently: How can Active Directory users access big data workloads running on Amazon EMR with the same single sign-on (SSO) experience they have when accessing resources in the Active Directory network?
This post walks you through the process of using AWS CloudFormation to set up a cross-realm trust and extend authentication from an Active Directory network into an Amazon EMR cluster with Kerberos enabled. By establishing a cross-realm trust, Active Directory users can use their Active Directory credentials to access an Amazon EMR cluster and run jobs as themselves.
Walkthrough overview
In this example, you build a solution that allows Active Directory users to seamlessly access Amazon EMR clusters and run big data jobs. Here’s what you need before setting up this solution:
A possible limit increase for your account (Note: Usually a limit increase will not be necessary. See the AWS Service Limits documentation if you encounter a limit error while building the solution.)
To make it easier for you to get started, I created AWS CloudFormation templates that automatically configure and deploy the solution for you. The following steps and resources are involved in setting up the solution:
Note: If you want to manually create and configure the components for this solution without using AWS CloudFormation, refer to the Amazon EMR cross-realm documentation. IMPORTANT: The AWS CloudFormation templates used in this post are designed to work only in the us-east-1 (N. Virginia) Region. They are not intended for production use without modification.
Single-step solution deployment
If you don’t want to set up each component individually, you can use the single-step AWS CloudFormation template. The single-step template is a master template that uses nested stacks (additional templates) to launch and configure all the resources for the solution in one go.
To deploy the single-step template into your account, choose Launch Stack:
This takes you to the Create stack wizard in the AWS CloudFormation console. The template is launched in the US East (N. Virginia) Region by default. Do not change to a different Region because the template is designed to work only in us-east-1 (N. Virginia).
On the Select Template page, keep the default URL for the AWS CloudFormation template, and then choose Next.
On the Specify Details page, review the parameters for the template. Provide values for the parameters that require input (for more information, see the parameters table that follows).
The following parameters are available in this template.
Parameter
Default
Description
Domain Controller name
DC1
NetBIOS (hostname) name of the Active Directory server. This name can be up to 15 characters long.
Active Directory domain
example.com
Fully qualified domain name (FQDN) of the forest root domain (for example, example.com).
Domain NetBIOS name
EXAMPLE
NetBIOS name of the domain for users of earlier versions of Windows. This name can be up to 15 characters long.
Domain admin user
CrossRealmAdmin
User name for the account that is added as domain administrator. This account is separate from the default administrator account.
Domain admin password
Requires input
Password for the domain admin user. Must be at least eight characters including letters, numbers, and symbols.
Key pair name
Requires input
Name of an existing key pair, which enables you to connect securely to your instance after it launches.
Instance type
m4.xlarge
Instance type for the domain controller and the Amazon EMR cluster.
Allowed IP address
10.0.0.0/16
The client IP address can that can reach your cluster. Specify an IP address range in CIDR notation (for example, 203.0.113.5/32). By default, only the VPC CIDR (10.0.0.0/16) can reach the cluster. Be sure to add your client IP range so that you can connect to the cluster using SSH.
EMR Kerberos realm
EC2.INTERNAL
Cluster’s Kerberos realm name. By default, the realm name is derived from the cluster’s VPC domain name in uppercase letters (for example, EC2.INTERNAL is the default VPC domain name in the us-east-1 Region).
Trusted AD domain
EXAMPLE.COM
The Active Directory (AD) domain that you want to trust. This is the same as the “Active Directory domain.” However, it must use all uppercase letters (for example, EXAMPLE.COM).
Cross-realm trust password
Requires input
Password that you want to use for your cross-realm trust.
Instance count
2
The number of instances (core nodes) for the cluster.
EMR applications
Hadoop, Spark, Ganglia, Hive
Comma separated list of applications to install on the cluster.
After you specify the template details, choose Next. On the Options page, choose Next again. On the Review page, select the I acknowledge that AWS CloudFormation might create IAM resources with custom names check box, and then choose Create.
It takes approximately 45 minutes for the deployment to complete. When the stack launch is complete, it will return outputs with information about the resources that were created. Note the outputs and skip to the Managing and testing the solution section. You can view the stack outputs on the AWS Management Console or by using the following AWS CLI command:
This section describes how to use AWS CloudFormation templates to perform each step separately in the solution.
Create and configure an Amazon VPC
In order for you to establish a cross-realm trust between an Amazon EMR Kerberos realm and an Active Directory domain, your Amazon VPC must meet the following requirements:
The subnet used for the Amazon EMR cluster must have a CIDR block of fewer than nine digits (for example, 10.0.1.0/24).
Both DNS resolution and DNS hostnames must be enabled (set to “yes”).
The Active Directory domain controller must be the DNS server for instances in the Amazon VPC (this is configured in the next step).
To use the AWS CloudFormation template to create and configure an Amazon VPC with the prerequisites listed previously, choose Launch Stack:
Note: If you want to create the VPC manually (without using AWS CloudFormation), see Set Up the VPC and Subnet in the Amazon EMR documentation.
Launching this stack creates the following AWS resources:
Amazon VPC with CIDR block 10.0.0.0/16 (Name: CrossRealmVPC)
Internet Gateway (Name: CrossRealmGateway)
Public subnet with CIDR block 10.0.1.0/24 (Name: CrossRealmSubnet)
Security group allowing inbound access from the VPC’s subnets (Name tag: CrossRealmSecurityGroup)
When the stack launch is complete, it should return outputs similar to the following.
Key
Value example
Description
SubnetID
subnet-xxxxxxxx
The subnet for the Active Directory domain controller and the EMR cluster.
SecurityGroup
sg-xxxxxxxx
The security group for the Active Directory domain controller.
VPCID
vpc-xxxxxxxx
The Active Directory domain controller and EMR cluster will be launched on this VPC.
Note the outputs because they are used in the next step. You can view the stack outputs on the AWS Management Console or by using the following AWS CLI command:
Launch and configure an Active Directory domain controller
In this step, you use an AWS CloudFormation template to automatically launch and configure a new Active Directory domain controller and cross-realm trust.
Note: There are various ways to install and configure an Active Directory domain controller. For details on manually launching and installing a domain controller without AWS CloudFormation, see Step 2: Launch and Install the AD Domain Controller in the Amazon EMR documentation.
In addition to launching and configuring an Active Directory domain controller and cross-realm trust, this AWS CloudFormation template also sets the domain controller as the DNS server (name server) for your Amazon VPC. In other words, the template creates a new DHCP option-set for the VPC where it’s being deployed to, and it sets the private IP of the domain controller as the name server for that new DHCP option set.
IMPORTANT: You should not use this template on a production VPC with existing resources like Amazon EC2 instances. When you launch this stack, make sure that you use the new environment and resources (Amazon VPC, subnet, and security group) that were created in the Create and configure an Amazon VPC step.
To launch this stack, choose Launch Stack:
The following table contains information about the parameters available in this template. Review the parameters for the template and provide values for those that require input.
Parameter
Default
Description
VPC ID
Requires input
Launch the domain controller on this VPC (for example, use the VPC created in the Create and configure an Amazon VPC step).
Subnet ID
Requires input
Subnet used for the domain controller (for example, use the subnet created in the Create and configure an Amazon VPC step).
Security group ID
Requires input
Security group (SG) for the domain controller (for example, use the SG created in the Create and configure an Amazon VPC step).
Domain Controller name
DC1
NetBIOS name of the Active Directory server (up to 15 characters).
Active Directory domain
example.com
Fully qualified domain name (FQDN) of the forest root domain (for example, example.com).
Domain NetBIOS name
EXAMPLE
NetBIOS name of the domain for users of earlier versions of Windows. This name can be up to 15 characters long.
Domain admin user
CrossRealmAdmin
User name for the account that is added as domain administrator. This account is separate from the default administrator account.
Domain admin password
Requires input
Password for the domain admin user. Must be at least eight characters including letters, numbers, and symbols.
Key pair name
Requires input
Name of an existing EC2 key pair to enable access to the domain controller instance.
Instance type
m4.xlarge
Instance type for the domain controller.
EMR Kerberos realm
EC2.INTERNAL
Cluster’s Kerberos realm name. By default, the realm name is derived from the cluster’s VPC domain name in uppercase letters (for example, EC2.INTERNAL is the default VPC domain name in the us-east-1 Region).
Cross-realm trust password
Requires input
Password that you want to use for your cross-realm trust.
It takes 25–30 minutes for this stack to be created. When it’s complete, note the stack’s outputs, and then move to the next step: Launch an EMR cluster with Kerberos enabled.
Create a security configuration and launch an Amazon EMR cluster with Kerberos enabled
To launch a kerberized Amazon EMR cluster, you first need to create a security configuration containing the cross-realm trust configuration. You then specify cluster-specific Kerberos attributes when launching the cluster.
In this step, you use AWS CloudFormation to launch and configure a kerberized Amazon EMR cluster with a cross-realm trust. If you want to manually launch and configure a cluster with Kerberos enabled, see Step 6: Launch a Kerberized EMR Cluster in the Amazon EMR documentation.
Note: At the time of this writing, AWS CloudFormation does not yet support launching Amazon EMR clusters with Kerberos authentication enabled. To overcome this limitation, I created a template that uses an AWS Lambda-backed custom resource to launch and configure the Amazon EMR cluster with Kerberos enabled. If you use this template, there’s nothing else that you need to do. Just keep in mind that the template creates and invokes an AWS Lambda function (custom resource) to launch the cluster.
To create a cross-realm trust security configuration and launch a kerberized Amazon EMR cluster using AWS CloudFormation, choose Launch Stack:
The following table lists and describes the template parameters for deploying a kerberized Amazon EMR cluster and configuring a cross-realm trust.
Parameter
Default
Description
Active Directory domain
example.com
The Active Directory domain that you want to establish the cross-realm trust with.
Domain admin user (joiner user)
CrossRealmAdmin
The user name of an Active Directory domain user with privileges to join domains/computers to the Active Directory domain (joiner user).
Domain admin password
Requires input
Password of the joiner user.
Cross-realm trust password
Requires input
Password of your cross-realm trust.
EC2 key pair name
Requires input
Name of an existing key pair, which enables you to connect securely to your cluster after it launches.
Subnet ID
Requires input
Subnet that you want to use for your Amazon EMR cluster (for example, choose the subnet created in the Create and configure an Amazon VPC step).
Security group ID
Requires input
Security group that you want to use for your Amazon EMR cluster (for example, choose the security group created in the Create and configure an Amazon VPC step).
Instance type
m4.xlarge
The instance type that you want to use for the cluster nodes.
Instance count
2
The number of instances (core nodes) for the cluster.
Allowed IP address
10.0.0.0/16
The client IP address can that can reach your cluster. Specify an IP address range in CIDR notation (for example, 203.0.113.5/32). By default, only the VPC CIDR (10.0.0.0/16) can reach the cluster. Be sure to add your client IP range so that you can connect to the cluster using SSH.
EMR applications
Hadoop, Spark, Ganglia, Hive
Comma separated list of the applications that you want installed on the cluster.
EMR Kerberos realm
EC2.INTERNAL
Cluster’s Kerberos realm name. By default, the realm name is derived from the cluster’s VPC domain name in uppercase letters (for example, EC2.INTERNAL is the default VPC domain name in the us-east-1 Region).
Trusted AD domain
EXAMPLE.COM
The Active Directory domain that you want to trust. This name is the same as the “AD domain name.” However, it must use all uppercase letters (for example, EXAMPLE.COM).
It takes 10–15 minutes for this stack to be created. When it’s complete, note the stack’s outputs, and then move to the next section: Managing and testing the solution.
Managing and testing the solution
Now that you’ve configured and built the solution, it’s time to test it by connecting to a cluster using Active Directory credentials.
SSH to a cluster using Active Directory credentials (single sign-on)
After you launch a kerberized Amazon EMR cluster, if you used the AWS CloudFormation templates and added your client IP range to the Allowed IP address parameter, you should be able to connect to the cluster using an SSH client and your Active Directory user credentials. If you have trouble connecting to the cluster using SSH, check the cluster’s security group to make sure that it allows inbound SSH connection (TCP port 22) from your client’s IP address (source).
The following steps assume that you’re using a client such as OpenSSH. If you’re using a different SSH application (for example, PuTTY), consult the application-specific documentation.
Note: Because the cluster was launched with a cross-realm trust configuration, you don’t need to use a private key (.pem file) when you connect to it as a domain user using SSH.
To connect to your Amazon EMR cluster as an Active Directory user using SSH, run the following command. Replace ad_user with the domain admin user that you created while setting up the domain controller and replace master_node_URL with the cluster’s URL (see the stack’s outputs to find this information):
$ ssh -l <ad_user> <master_node_URL>
If your SSH client is configured to use a key as the preferred authentication method, the login might fail. If that’s the case, you can add the following options to your SSH command to force the SSH connection to use password authentication:
After a domain user connects to the cluster using SSH, if this is the first that the user is connecting, a local home directory is created for that user. In addition to creating a local home directory, if you used the create-hdfs-home-ba.sh bootstrap action when launching the cluster (done by default if you used the AWS CloudFormation template to launch a kerberized cluster), an HDFS user home directory is also automatically created.
Note: If you manually launched the cluster and did not use the create-hdfs-home-ba.sh bootstrap action, then you’ll need to manually create HDFS user home directories for your users.
When you connect to the cluster using SSH for the first time (as a domain user), you should see the following messages if the HDFS home directory for your domain user was successfully created:
Running jobs on a kerberized Amazon EMR cluster
To run a job on a kerberized cluster, the user submitting the job must first be authenticated. If you followed the previous section to connect to your cluster as an Active Directory user using SSH, the user should be authenticated automatically.
If running the klist command returns a “No credentials cache found” message, it means that the user is not authenticated (the user doesn’t have a Kerberos ticket). You can re-authenticate a user at any time by running the following command (be sure to use all uppercase letters for the Active Directory domain):
$ kinit <username>@<AD_DOMAIN>
When the user is authenticated, they can submit jobs just like they would on a non-kerberized cluster.
Auditing jobs
Another advantage that Kerberos can provide is that you can easily tell which user ran a particular job. For example, connect (using SSH) to a kerberized cluster with an Active Directory user, and submit the SparkPi sample application:
$ spark-example SparkPi
After running the SparkPi application, go to the Amazon EMR console and choose your cluster. Then choose the Application history tab. There you can see information about the application, including the user that submitted the job:
Common issues
Although it would be hard to cover every possible Kerberos issue, this section covers some of the more common issues that might occur and ways to fix them.
Issue 1: You can successfully connect and get authenticated on a cluster. However, whenever you try running job, it fails with an error similar to the following:
Solution: Make sure that an HDFS home directory for the user was created and that it has the right permissions.
Issue 2: You can successfully connect to the cluster, but you can’t run any Hadoop or HDFS commands.
Solution: Use the klist command to confirm whether the user is authenticated and has a valid Kerberos ticket. Use the kinit command to re-authenticate a user.
Issue 3: You can’t connect (using SSH) to the cluster using Active Directory user credentials, but you can manually authenticate the user with kinit.
Solution: Make sure that the Active Directory domain controller is the DNS server (name server) for the cluster nodes.
Cleaning up
After completing and testing this solution, remember to clean up the resources. If you used the AWS CloudFormation templates to create the resources, then use the AWS CloudFormation console or AWS CLI/SDK to delete the stacks. Deleting a stack also deletes the resources created by that stack.
If one of your stacks does not delete, make sure that there are no dependencies on the resources created by that stack. For example, if you deployed an Amazon VPC using AWS CloudFormation and then deployed a domain controller into that VPC using a different AWS CloudFormation stack, you must first delete the domain controller stack before the VPC stack can be deleted.
Summary
The ability to authenticate users and services with Kerberos not only allows you to secure your big data applications, but it also enables you to easily integrate Amazon EMR clusters with an Active Directory environment. This post showed how you can use Kerberos on Amazon EMR to create a single sign-on solution where Active Directory domain users can seamlessly access Amazon EMR clusters and run big data applications. We also showed how you can use AWS CloudFormation to automate the deployment of this solution.
The following list includes the ten most downloaded AWS security and compliance documents in 2017. Using this list, you can learn about what other AWS customers found most interesting about security and compliance last year.
AWS Security Best Practices – This guide is intended for customers who are designing the security infrastructure and configuration for applications running on AWS. The guide provides security best practices that will help you define your Information Security Management System (ISMS) and build a set of security policies and processes for your organization so that you can protect your data and assets in the AWS Cloud.
AWS: Overview of Security Processes – This whitepaper describes the physical and operational security processes for the AWS managed network and infrastructure, and helps answer questions such as, “How does AWS help me protect my data?”
Service Organization Controls (SOC) 3 Report – This publicly available report describes internal AWS security controls, availability, processing integrity, confidentiality, and privacy.
Introduction to AWS Security –This document provides an introduction to AWS’s approach to security, including the controls in the AWS environment, and some of the products and features that AWS makes available to customers to meet your security objectives.
AWS: Risk and Compliance – This whitepaper provides information to help customers integrate AWS into their existing control framework, including a basic approach for evaluating AWS controls and a description of AWS certifications, programs, reports, and third-party attestations.
Use AWS WAF to Mitigate OWASP’s Top 10 Web Application Vulnerabilities – AWS WAF is a web application firewall that helps you protect your websites and web applications against various attack vectors at the HTTP protocol level. This whitepaper outlines how you can use AWS WAF to mitigate the application vulnerabilities that are defined in the Open Web Application Security Project (OWASP) Top 10 list of most common categories of application security flaws.
Introduction to Auditing the Use of AWS – This whitepaper provides information, tools, and approaches for auditors to use when auditing the security of the AWS managed network and infrastructure.
AWS Security and Compliance: Quick Reference Guide – By using AWS, you inherit the many security controls that we operate, thus reducing the number of security controls that you need to maintain. Your own compliance and certification programs are strengthened while at the same time lowering your cost to maintain and run your specific security assurance requirements. Learn more in this quick reference guide.
We find that AWS customers often require that every query submitted to Presto running on Amazon EMR is logged. They want to track what query was submitted, when it was submitted and who submitted it.
In this blog post, we will demonstrate how to implement and install a Presto event listener for purposes of custom logging, debugging and performance analysis for queries executed on an EMR cluster. An event listener is a plugin to Presto that is invoked when an event such as query creation, query completion, or split completion occurs.
Presto also provides a system connector to access metrics and details about a running cluster. In particular, the system connector gets information about currently running and recently run queries by using the system.runtime.queries table. From the Presto command line interface (CLI), you get this data with the entries Select * from system.runtime.queries; and Select * from system.runtime.tasks;. This connector is available out of the box and exposes information and metrics through SQL.
In addition to providing custom logging and debugging, the Presto event listener allows you to enhance the information provided by the system connector by providing a mechanism to capture detailed query context, query statistics and failure information during query creation or completion, or a split completion.
We will begin by providing a detailed walkthrough of the implementation of the Presto event listener in Java followed by its deployment on the EMR cluster.
Implementation
We use the Eclipse IDE to create a Maven Project, as shown below:
Once you have created the Maven Project, modify the pom.xml file to add the dependency for Presto, as shown following:
After you add the Presto dependency to our pom.xml file, create a Java package under the src/main/java folder. In our project, we have named the package com.amazonaws.QueryEventListener. You can choose the naming convention that best fits your organization. Within this package, create three Java files for the EventListener, the EventListenerFactory, and the EventListenerPlugin.
As the Presto website says: “EventListenerFactory is responsible for creating an EventListener instance. It also defines an EventListener name, which is used by the administrator in a Presto configuration. Implementations of EventListener implement methods for the event types they are interested in handling. The implementation of EventListener and EventListenerFactory must be wrapped as a plugin and installed on the Presto cluster.”
In our project, we have named these Java files QueryEventListener, QueryEventListenerFactory, and QueryEventListenerPlugin:
Now we write our code for the Java files.
QueryEventListener – QueryEventListener implements the Presto EventListener interface. It has a constructor that creates five rotating log files of 524 MB each. After creating QueryEventListener, we implement the query creation, query completion, and split completion methods and log the events relevant to us. You can choose to include more events based on your needs.
QueryEventListenerFactory – The QueryEventListenerFactory class implements the Presto EventListenerFactory interface. We implement the method getName, which provides a registered EventListenerFactory name to Presto. We also implement the create method, which creates an EventListener instance.
You can find the code for the QueryEventListenerFactory class in this Amazon repository.
QueryEventListenerPlugin – The QueryEventListenerPlugin class implements the Presto EventListenerPlugin interface. This class has a getEventListenerFactories method that returns an immutable list containing the EventListenerFactory. Basically, in this class we are wrapping QueryEventListener and QueryEventListenerFactory.
Finally, in our project we add the META-INF folder and a services subfolder within the META-INF folder. In the services subfolder, you create a file called com.facebook.presto.spi.Plugin:
As the Presto documentation describes: “Each plugin identifies an entry point: an implementation of the plugin interface. This class name is provided to Presto via the standard Java ServiceLoader interface: the classpath contains a resource file named com.facebook.presto.spi.Plugin in the META-INF/services directory”.
We add the name of our plugin class com.amazonaws.QueryEventListener.QueryEventListenerPlugin to the com.facebook.presto.spi.Plugin file, as shown below:
Next we will show you how to deploy the Presto plugin we created to Amazon EMR for custom logging.
Presto logging overview on Amazon EMR
Presto by default will produce three log files that capture the configurations properties and the overall operational events of the components that make up Presto, plus log end user access to the Presto UI.
On Amazon EMR, these log files are written into /var/log/presto. The log files in this directory are pushed into Amazon S3. This S3 location is the location of the new log file.
Steps to deploy Presto on Amazon EMR with custom logging
To deploy Presto on EMR with custom logging a bootstrap action will be used. The bootstrap is available in this Amazon Repository.
#!/bin/bash
IS_MASTER=true
if [ -f /mnt/var/lib/info/instance.json ]
then
if grep isMaster /mnt/var/lib/info/instance.json | grep true;
then
IS_MASTER=true
else
IS_MASTER=false
fi
fi
sudo mkdir -p /usr/lib/presto/plugin/queryeventlistener
sudo /usr/bin/aws s3 cp s3://replace-with-your-bucket/QueryEventListener.jar /tmp
sudo cp /tmp/QueryEventListener.jar /usr/lib/presto/plugin/queryeventlistener/
if [ "$IS_MASTER" = true ]; then
sudo mkdir -p /usr/lib/presto/etc
sudo bash -c 'cat <<EOT >> /usr/lib/presto/etc/event-listener.properties
event-listener.name=event-listener
EOT'
fi
First, upload the JAR file created on the last section and update the s3 location in the bootstrap, s3://replace-with-your-bucket/QueryEventListener.jar, with the bucket name where the jar was placed.
After updating the bootstrap with the S3 location for your JAR, upload that bootstrap to your own bucket.
The bootstrap action will copy the jar file with the custom EventListener implementation into all machines of the cluster. Moreover, the bootstrap action will create a file named event-listener.properties on the Amazon EMR Master node. This file will configure the coordinator to enable the custom logging plugin via property event-listener.name. The event-listener.name property is set to event-listener in the event-listener.properties file. As per Presto documentation, this property is used by Presto to find a registered EventListenerFactory based on the name returned by EventListenerFactory.getName().
Now that the bootstrap is ready, the following AWS CLI command can be used to create a new EMR cluster with the bootstrap:
After we have implemented the Presto custom logging plugin and deployed it, we can run a test and see the output of the log files.
We first create a Hive table with the DDL below from the Hive Command Line (simply type Hive on the SSH terminal to the EMR Master to access the Hive Command Line):
CREATE EXTERNAL TABLE wikistats (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LINES TERMINATED BY '\n'
LOCATION 's3://support.elasticmapreduce/training/datasets/wikistats/';
Then, we access the presto command line by typing the following command on the terminal:
presto-cli – server localhost:8889 – catalog hive – schema default
Finally, we run a simple query:
Select * from wikistats limit 10;
We then go to the /var/log/presto directory and look at the contents of the log file queries-YYYY-MM-DDTHH\:MM\:SS.0.log. As depicted in the screenshot below, our QueryEventListener plugin captures the fields shown for the Query Created and Query Completed events. Moreover, if there are splits, the plugin will also capture split events.
Note: If you want to include the query text executed by the user for auditing and debugging purposes, add the field appropriately in the QueryEventListener class methods, as shown below:
Because this is custom logging, you can capture as many fields as are available for the particular events. To find out the fields available for each of the events, see the Java Classes provided by Presto at this GitHub location.
Summary
In this post, you learned how to add custom logging to Presto on EMR to enhance your organization’s auditing capabilities and provide insights into performance.
If you have questions or suggestions, please leave a comment below.
Zafar Kapadia is a Cloud Application Architect for AWS. He works on Application Development and Optimization projects. He is also an avid cricketer and plays in various local leagues.
Francisco Oliveira is a Big Data Engineer with AWS Professional Services. He focuses on building big data solutions with open source technology and AWS. In his free time, he likes to try new sports, travel and explore national parks.
Cloud security with scalability and innovation: at AWS, this is our top priority. To help you securely architect cloud solutions, AWS Training and Certification recently added new free digital training about security, including a new course about Amazon GuardDuty, a new managed threat-detection service.
These introductory courses, built by AWS experts, are suitable for users and decision makers alike. Each course lasts 10–15 minutes, and you can learn at your own pace.
Introduction to Amazon GuardDuty (10 minutes) Learn how to use Amazon GuardDuty, a new AWS managed threat-detection service that continuously monitors your account to detect known and unknown threats.
Introduction to Amazon Macie (10 minutes) Learn how Amazon Macie uses machine learning to automatically discover, classify, and protect sensitive data in the AWS Cloud.
Introduction to AWS Artifact (10 minutes) Get a brief overview of this service, including the definition of an audit artifact and a demonstration of AWS Artifact.
Introduction to AWS WAF (15 minutes) Review common AWS WAF use cases and learn which conditions AWS WAF (a web application firewall) can detect. A brief demonstration shows how to configure AWS WAF filters and rules.
Go deeper with AWS security courses
To supplement your foundational training, take these security-focused courses:
AWS Security Fundamentals (3 hours) is a self-paced digital course. This course introduces you to fundamental cloud computing and AWS security concepts, including AWS access control and management, governance, logging, and encryption methods. The course also covers security-related compliance protocols and risk management strategies, as well as procedures related to auditing your AWS security infrastructure.
Security Operations on AWS (3 days) is an instructor-led course that offers in-depth classroom instruction. This course demonstrates how to use AWS security services to help remain secure and compliant in the AWS Cloud. The course focuses on AWS security best practices that you can implement to enhance the security of your data and systems in the cloud.
AWS Training and Certification continually evaluates and expands the training courses available to you, so be sure to visit the website regularly to explore the latest offerings.
Today, AWS introduced AWS Single Sign-On (AWS SSO), a service that makes it easy for you to centrally manage SSO access to multiple AWS accounts and business applications. AWS SSO provides a user portal so that your users can find and access all of their assigned accounts and applications from one place, using their existing corporate credentials. AWS SSO is integrated with AWS Organizations to enable you to manage access to AWS accounts in your organization. In addition, AWS SSO supports Security Assertion Markup Language (SAML) 2.0, which means you can extend SSO access to your SAML-enabled applications by using the AWS SSO application configuration wizard. AWS SSO also includes built-in SSO integrations with many business applications, such as Salesforce, Box, and Office 365.
In this blog post, I help you get started with AWS SSO by answering three main questions:
What benefits does AWS SSO provide?
What are the key features of AWS SSO?
How do I get started?
1. What benefits does AWS SSO provide?
You can connect your corporate Microsoft Active Directory to AWS SSO so that your users can sign in to the user portal with their user names and passwords to access the AWS accounts and applications to which you have granted them access. The following screenshot shows an example of the AWS SSO user portal.
You can use AWS SSO to centrally assign, manage, and audit your users’ access to multiple AWS accounts and SAML-enabled business applications. You can add new users to the appropriate Active Directory group, which automatically gives them access to the AWS accounts and applications assigned for members of that group. AWS SSO also provides better visibility into which users accessed which accounts and applications from the user portal by recording all user portal sign-in activities in AWS CloudTrail. AWS SSO records details such as the IP address, user name, date, and time of the sign-in request. Any changes made by administrators in the AWS SSO console also are recorded in CloudTrail, and you can use security information and event management (SIEM) solutions such as Splunk to analyze the associated CloudTrail logs.
2. What are the key features of AWS SSO?
AWS SSO includes the following key features.
AWS SSO user portal: In the user portal, your users can easily find and access all applications and AWS accounts to which you have granted them access. Users can access the user portal with their corporate Active Directory credentials and access these applications without needing to enter their user name and password again.
Integration with AWS Organizations: AWS SSO is integrated with Organizations to enable you to manage access to all AWS accounts in your organization. When you enable AWS SSO in your organization’s master account, AWS SSO lists all the accounts managed in your organization for which you can enable SSO access to AWS consoles.
Integration with on-premises Active Directory: AWS SSO integrates with your on-premises Active Directory by using AWS Directory Service. Users can access AWS accounts and business applications by using their Active Directory credentials. You can manage which users or groups in your corporate directory can access which AWS accounts.
Centralized permissions management: With AWS SSO, you can centrally manage the permissions granted to users when they access AWS accounts via the AWS Management Console. You define users’ permissions as permission sets, which are collections of permissions that are based on a combination of AWS managed policies or AWS managed policies for job functions. AWS managed policies are designed to provide permissions for many common use cases, and AWS managed policies for job functions are designed to closely align with common job functions in the IT industry.
With AWS SSO, you can configure all the necessary user permissions to your AWS resources in your AWS accounts by applying permission sets. For example, you can grant database administrators broad permissions to Amazon Relational Database Service in your development accounts, but limit their permissions in your production accounts. As you change these permission sets, AWS SSO helps you keep them updated in all relevant AWS accounts, allowing you to manage permissions centrally.
Application configuration wizard: You can configure SSO access to any SAML-enabled business application by using the AWS SSO application configuration wizard.
Built-in SSO integrations: AWS SSO provides built-in SSO integrations and step-by-step configuration instructions for many commonly used business applications such as Office 365, Salesforce, and Box.
Centralized auditing: AWS SSO logs all sign-in and administrative activities in CloudTrail. You can send these logs to SIEM solutions such as Splunk to analyze them.
Highly available multi-tenant SSO infrastructure: AWS SSO is built on a highly available, AWS managed SSO infrastructure. The AWS SSO multi-tenant architecture enables you to start using the service quickly without needing to procure hardware or install software.
3. How do I get started?
To get started, connect your corporate Active Directory to AWS SSO by using AWS Directory Service. You have two choices to connect your corporate directory: use AD Connector, or configure an Active Directory trust with your on-premises Active Directory. After connecting your corporate directory, you can set up accounts and applications for SSO access. You also can use AWS Managed Microsoft AD in the cloud to manage your users and groups in the cloud, if you don’t have an on-premises Active Directory or don’t want to connect to on-premises Active Directory.
The preceding diagram shows how AWS SSO helps connect your users to the AWS accounts and business applications to which they need access. The numbers in the diagram correspond to the following use cases.
Use case 1: Manage SSO access to AWS accounts
With AWS SSO, you can grant your users access to AWS accounts in your organization. You can do this by adding your users to groups in your corporate Active Directory. In AWS SSO, specify which Active Directory groups can access which AWS accounts, and then pick a permission set to specify the level of SSO access you are granting these Active Directory groups. AWS SSO then sets up AWS account access for the users in the groups. Going forward, you can add new users to your Active Directory groups, and AWS SSO automatically provides the users access to the configured accounts. You also can grant Active Directory users direct access to AWS accounts (without needing to add users to Active Directory groups).
To configure AWS account access for your users:
Navigate to the AWS SSO console, and choose AWS accounts from the navigation pane. Choose which accounts you want users to access from the list of accounts. For this example, I am choosing three accounts from my MarketingBU organizational unit. I then choose Assign users.
Choose Users, start typing to search for users, and then choose Search connected directory. This search will return a list of users from your connected directory. You can also search for groups.
To select permission sets, you first have to create one. Choose Create new permission set.
You can use an existing job function policy to create a permission set. This type of policy allows you to apply predefined AWS managed policies to a permission set that are based on common job functions in the IT industry. Alternatively, you can create a custom permission set based on custom policies.
For this example, I choose the SecurityAudit job function policy and then choose Create. As a result, this permission set will be available for me to pick on the next screen.
Choose a permission set to indicate what level of access you want to grant your users. For this example, I assign the SecurityAudit permission set I created in the previous step to the users I chose. I then choose Finish.
Your users can sign in to the user portal and access the accounts to which you gave them access. AWS SSO automatically sets up the necessary trust between accounts to enable SSO. AWS SSO also sets up the necessary permissions in each account. This helps you scale your administrative tasks across multiple AWS accounts.
The users can choose an account and a permission set to sign in to that account without needing to provide a password again. For example, if you grant a user two permission sets—one that is more restrictive and one that is less restrictive—the user can choose which permission set to use for a specific session. In the following screenshot, John has signed in to the AWS SSO user portal. He can see all the accounts to which he has access. For example, he can sign in to the Production Account with SecurityAudit permissions.
Use case 2: Manage SSO access to business applications
AWS SSO has built-in support for SSO access to commonly used business applications such as Salesforce, Office 365, and Box. You can find these applications in the AWS SSO console and easily configure SSO access by using the application configuration wizard. After you configure an application for SSO access, you can grant users access by searching for users and groups in your corporate directory. For a complete list of supported applications, navigate to the AWS SSO console.
To configure SSO access to business applications:
Navigate to the AWS SSO console and choose Applications from the navigation pane.
Choose Add a new application and choose one or more of the applications in the list. For this example, I have chosen Dropbox.
Depending on which application you choose, you will be asked to complete step-by-step instructions to configure the application for SSO access. The instructions guide you to use the details provided in the AWS SSO metadata section to configure your application, and then to provide your application details in the Application metadata section. Choose Save changes when you are done.
Optionally, you can provide additional SAML attribute mappings by choosing the Attribute mappings tab. You need to do this only if you want to pass user attributes from your corporate directory to the application.
To give your users access to this application, choose the Assigned users tab. Choose Assign users to search your connected directory, and choose a user or group that can access this application.
Use case 3: Manage SSO access to custom SAML-enabled applications
You also can enable SSO access to your custom-built or partner-built SAML applications by using the AWS SSO application configuration wizard.
To configure SSO access to SAML-enabled applications:
Navigate to the AWS SSO console and choose Applications from the navigation pane.
Choose Add a new application, choose Custom SAML 2.0 application, and choose Add.
On the Custom SAML 2.0 application page, copy or download the AWS SSO metadata from the AWS SSO metadata section to configure your custom SAML-enabled application to recognize AWS SSO as an identity provider.
On the same page, complete the application configuration details in the Application metadata section, and choose Save changes.
You can provide additional SAML attribute mappings to be passed to your application in the SAML assertion by choosing the Attribute mappings tab. See the documentation for list of all available attributes.
To give your users access to this application, choose the Assigned users tab. Choose Assign users to search your connected directory, and choose a user or group that can access this application.
Summary
In this blog post, I introduced AWS SSO and explained its key features, benefits, and use cases. With AWS SSO, you can centrally manage and audit SSO access to all your AWS accounts, cloud applications, and custom applications. To start using AWS SSO, navigate to the AWS SSO console.
If you have feedback or questions about AWS SSO, start a new thread on the AWS SSO forum.
Now, Amazon Cloud Directory makes it easier for you to apply schema changes across your directories with in-place schema upgrades. Your directory now remains available while Cloud Directory applies backward-compatible schema changes such as the addition of new fields. Without migrating data between directories or applying code changes to your applications, you can upgrade your schemas. You also can view the history of your schema changes in Cloud Directory by using version identifiers, which help you track and audit schema versions across directories. If you have multiple instances of a directory with the same schema, you can view the version history of schema changes to manage your directory fleet and ensure that all directories are running with the same schema version.
In this blog post, I demonstrate how to perform an in-place schema upgrade and use schema versions in Cloud Directory. I add additional attributes to an existing facet and add a new facet to a schema. I then publish the new schema and apply it to running directories, upgrading the schema in place. I also show how to view the version history of a directory schema, which helps me to ensure my directory fleet is running the same version of the schema and has the correct history of schema changes applied to it.
Note: I share Java code examples in this post. I assume that you are familiar with the AWS SDK and can use Java-based code to build a Cloud Directory code example. You can apply the concepts I cover in this post to other programming languages such as Python and Ruby.
Cloud Directory fundamentals
I will start by covering a few Cloud Directory fundamentals. If you are already familiar with the concepts behind Cloud Directory facets, schemas, and schema lifecycles, you can skip to the next section.
Facets: Groups of attributes. You use facets to define object types. For example, you can define a device schema by adding facets such as computers, phones, and tablets. A computer facet can track attributes such as serial number, make, and model. You can then use the facets to create computer objects, phone objects, and tablet objects in the directory to which the schema applies.
Schemas: Collections of facets. Schemas define which types of objects can be created in a directory (such as users, devices, and organizations) and enforce validation of data for each object class. All data within a directory must conform to the applied schema. As a result, the schema definition is essentially a blueprint to construct a directory with an applied schema.
Schema lifecycle: The four distinct states of a schema: Development, Published, Applied, and Deleted. Schemas in the Published and Applied states have version identifiers and cannot be changed. Schemas in the Applied state are used by directories for validation as applications insert or update data. You can change schemas in the Development state as many times as you need them to. In-place schema upgrades allow you to apply schema changes to an existing Applied schema in a production directory without the need to export and import the data populated in the directory.
How to add attributes to a computer inventory application schema and perform an in-place schema upgrade
To demonstrate how to set up schema versioning and perform an in-place schema upgrade, I will use an example of a computer inventory application that uses Cloud Directory to store relationship data. Let’s say that at my company, AnyCompany, we use this computer inventory application to track all computers we give to our employees for work use. I previously created a ComputerSchema and assigned its version identifier as 1. This schema contains one facet called ComputerInfo that includes attributes for SerialNumber, Make, and Model, as shown in the following schema details.
AnyCompany has offices in Seattle, Portland, and San Francisco. I have deployed the computer inventory application for each of these three locations. As shown in the lower left part of the following diagram, ComputerSchema is in the Published state with a version of 1. The Published schema is applied to SeattleDirectory, PortlandDirectory, and SanFranciscoDirectory for AnyCompany’s three locations. Implementing separate directories for different geographic locations when you don’t have any queries that cross location boundaries is a good data partitioning strategy and gives your application better response times with lower latency.
The following code example creates the schema in the Development state by using a JSON file, publishes the schema, and then creates directories for the Seattle, Portland, and San Francisco locations. For this example, I assume the schema has been defined in the JSON file. The createSchema API creates a schema Amazon Resource Name (ARN) with the name defined in the variable, SCHEMA_NAME. I can use the putSchemaFromJson API to add specific schema definitions from the JSON file.
// The utility method to get valid Cloud Directory schema JSON
String validJson = getJsonFile("ComputerSchema_version_1.json")
String SCHEMA_NAME = "ComputerSchema";
String developmentSchemaArn = client.createSchema(new CreateSchemaRequest()
.withName(SCHEMA_NAME))
.getSchemaArn();
// Put the schema document in the Development schema
PutSchemaFromJsonResult result = client.putSchemaFromJson(new PutSchemaFromJsonRequest()
.withSchemaArn(developmentSchemaArn)
.withDocument(validJson));
The following code example takes the schema that is currently in the Development state and publishes the schema, changing its state to Published.
String SCHEMA_VERSION = "1";
String publishedSchemaArn = client.publishSchema(
new PublishSchemaRequest()
.withDevelopmentSchemaArn(developmentSchemaArn)
.withVersion(SCHEMA_VERSION))
.getPublishedSchemaArn();
// Our Published schema ARN is as follows
// arn:aws:clouddirectory:us-west-2:XXXXXXXXXXXX:schema/published/ComputerSchema/1
The following code example creates a directory named SeattleDirectory and applies the published schema. The createDirectory API call creates a directory by using the published schema provided in the API parameters. Note that Cloud Directory stores a version of the schema in the directory in the Applied state. I will use similar code to create directories for PortlandDirectory and SanFranciscoDirectory.
String DIRECTORY_NAME = "SeattleDirectory";
CreateDirectoryResult directory = client.createDirectory(
new CreateDirectoryRequest()
.withName(DIRECTORY_NAME)
.withSchemaArn(publishedSchemaArn));
String directoryArn = directory.getDirectoryArn();
String appliedSchemaArn = directory.getAppliedSchemaArn();
// This code section can be reused to create directories for Portland and San Francisco locations with the appropriate directory names
// Our directory ARN is as follows
// arn:aws:clouddirectory:us-west-2:XXXXXXXXXXXX:directory/XX_DIRECTORY_GUID_XX
// Our applied schema ARN is as follows
// arn:aws:clouddirectory:us-west-2:XXXXXXXXXXXX:directory/XX_DIRECTORY_GUID_XX/schema/ComputerSchema/1
Revising a schema
Now let’s say my company, AnyCompany, wants to add more information for computers and to track which employees have been assigned a computer for work use. I modify the schema to add two attributes to the ComputerInfo facet: Description and OSVersion (operating system version). I make Description optional because it is not important for me to track this attribute for the computer objects I create. I make OSVersion mandatory because it is critical for me to track it for all computer objects so that I can make changes such as applying security patches or making upgrades. Because I make OSVersion mandatory, I must provide a default value that Cloud Directory will apply to objects that were created before the schema revision, in order to handle backward compatibility. Note that you can replace the value in any object with a different value.
I also add a new facet to track computer assignment information, shown in the following updated schema as the ComputerAssignment facet. This facet tracks these additional attributes: Name (the name of the person to whom the computer is assigned), EMail (the email address of the assignee), Department, and department CostCenter. Note that Cloud Directory refers to the previously available version identifier as the Major Version. Because I can now add a minor version to a schema, I also denote the changed schema as Minor Version A.
The following diagram shows the changes that were made when I added another facet to the schema and attributes to the existing facet. The highlighted area of the diagram (bottom left) shows that the schema changes were published.
The following code example revises the existing Development schema by adding the new attributes to the ComputerInfo facet and by adding the ComputerAssignment facet. I use a new JSON file for the schema revision, and for the purposes of this example, I am assuming the JSON file has the full schema including planned revisions.
// The utility method to get a valid CloudDirectory schema JSON
String schemaJson = getJsonFile("ComputerSchema_version_1_A.json")
// Put the schema document in the Development schema
PutSchemaFromJsonResult result = client.putSchemaFromJson(
new PutSchemaFromJsonRequest()
.withSchemaArn(developmentSchemaArn)
.withDocument(schemaJson));
Upgrading the Published schema
The following code example performs an in-place schema upgrade of the Published schema with schema revisions (it adds new attributes to the existing facet and another facet to the schema). The upgradePublishedSchema API upgrades the Published schema with backward-compatible changes from the Development schema.
// From an earlier code example, I know the publishedSchemaArn has this value: "arn:aws:clouddirectory:us-west-2:XXXXXXXXXXXX:schema/published/ComputerSchema/1"
// Upgrade publishedSchemaArn to minorVersion A. The Development schema must be backward compatible with
// the existing publishedSchemaArn.
String minorVersion = "A"
UpgradePublishedSchemaResult upgradePublishedSchemaResult = client.upgradePublishedSchema(new UpgradePublishedSchemaRequest()
.withDevelopmentSchemaArn(developmentSchemaArn)
.withPublishedSchemaArn(publishedSchemaArn)
.withMinorVersion(minorVersion));
String upgradedPublishedSchemaArn = upgradePublishedSchemaResult.getUpgradedSchemaArn();
// The Published schema ARN after the upgrade shows a minor version as follows
// arn:aws:clouddirectory:us-west-2:XXXXXXXXXXXX:schema/published/ComputerSchema/1/A
Upgrading the Applied schema
The following diagram shows the in-place schema upgrade for the SeattleDirectory directory. I am performing the schema upgrade so that I can reflect the new schemas in all three directories. As a reminder, I added new attributes to the ComputerInfo facet and also added the ComputerAssignment facet. After the schema and directory upgrade, I can create objects for the ComputerInfo and ComputerAssignment facets in the SeattleDirectory. Any objects that were created with the old facet definition for ComputerInfo will now use the default values for any additional attributes defined in the new schema.
I use the following code example to perform an in-place upgrade of the SeattleDirectory to a Major Version of 1 and a Minor Version of A. Note that you should change a Major Version identifier in a schema to make backward-incompatible changes such as changing the data type of an existing attribute or dropping a mandatory attribute from your schema. Backward-incompatible changes require directory data migration from a previous version to the new version. You should change a Minor Version identifier in a schema to make backward-compatible upgrades such as adding additional attributes or adding facets, which in turn may contain one or more attributes. The upgradeAppliedSchema API lets me upgrade an existing directory with a different version of a schema.
// This upgrades ComputerSchema version 1 of the Applied schema in SeattleDirectory to Major Version 1 and Minor Version A
// The schema must be backward compatible or the API will fail with IncompatibleSchemaException
UpgradeAppliedSchemaResult upgradeAppliedSchemaResult = client.upgradeAppliedSchema(new UpgradeAppliedSchemaRequest()
.withDirectoryArn(directoryArn)
.withPublishedSchemaArn(upgradedPublishedSchemaArn));
String upgradedAppliedSchemaArn = upgradeAppliedSchemaResult.getUpgradedSchemaArn();
// The Applied schema ARN after the in-place schema upgrade will appear as follows
// arn:aws:clouddirectory:us-west-2:XXXXXXXXXXXX:directory/XX_DIRECTORY_GUID_XX/schema/ComputerSchema/1
// This code section can be reused to upgrade directories for the Portland and San Francisco locations with the appropriate directory ARN
Note: Cloud Directory has excluded returning the Minor Version identifier in the Applied schema ARN for backward compatibility and to enable the application to work across older and newer versions of the directory.
The following diagram shows the changes that are made when I perform an in-place schema upgrade in the two remaining directories, PortlandDirectory and SanFranciscoDirectory. I make these calls sequentially, upgrading PortlandDirectory first and then upgrading SanFranciscoDirectory. I use the same code example that I used earlier to upgrade SeattleDirectory. Now, all my directories are running the most current version of the schema. Also, I made these schema changes without having to migrate data and while maintaining my application’s high availability.
Schema revision history
I can now view the schema revision history for any of AnyCompany’s directories by using the listAppliedSchemaArns API. Cloud Directory maintains the five most recent versions of applied schema changes. Similarly, to inspect the current Minor Version that was applied to my schema, I use the getAppliedSchemaVersion API. The listAppliedSchemaArns API returns the schema ARNs based on my schema filter as defined in withSchemaArn.
I use the following code example to query an Applied schema for its version history.
// This returns the five most recent Minor Versions associated with a Major Version
ListAppliedSchemaArnsResult listAppliedSchemaArnsResult = client.listAppliedSchemaArns(new ListAppliedSchemaArnsRequest()
.withDirectoryArn(directoryArn)
.withSchemaArn(upgradedAppliedSchemaArn));
// Note: The listAppliedSchemaArns API without the SchemaArn filter returns all the Major Versions in a directory
The listAppliedSchemaArns API returns the two ARNs as shown in the following output.
The following code example queries an Applied schema for current Minor Version by using the getAppliedSchemaVersion API.
// This returns the current Applied schema's Minor Version ARN
GetAppliedSchemaVersion getAppliedSchemaVersionResult = client.getAppliedSchemaVersion(new GetAppliedSchemaVersionRequest()
.withSchemaArn(upgradedAppliedSchemaArn));
The getAppliedSchemaVersion API returns the current Applied schema ARN with a Minor Version, as shown in the following output.
If you have a lot of directories, schema revision API calls can help you audit your directory fleet and ensure that all directories are running the same version of a schema. Such auditing can help you ensure high integrity of directories across your fleet.
Summary
You can use in-place schema upgrades to make changes to your directory schema as you evolve your data set to match the needs of your application. An in-place schema upgrade allows you to maintain high availability for your directory and applications while the upgrade takes place. For more information about in-place schema upgrades, see the in-place schema upgrade documentation.
If you have comments about this blog post, submit them in the “Comments” section below. If you have questions about implementing the solution in this post, start a new thread in the Directory Service forum or contact AWS Support.
One of the main tenets of the Family Educational Rights and Privacy Act (FERPA) is the protection of student education records, including personally identifiable information (PII) and directory information. We recently updated our FERPA Compliance on AWS whitepaper to include AWS service-specific guidance for 24 AWS services. The whitepaper describes how these services can be used to help secure protected data. In conjunction with more detailed service-specific documentation, this updated information helps make it easier for you to plan, deploy, and operate secure environments to meet your compliance requirements in the AWS Cloud.
The updated whitepaper is especially useful for educational institutions and their vendors who need to understand:
How AWS services can be used to help deploy educational and PII workloads securely in the AWS Cloud.
Key security disciplines in a security program to help you run a FERPA-compliant program (such as auditing, data destruction, and backup and disaster recovery).
In a related effort to help you secure PII, we also added to the whitepaper a mapping of NIST SP 800-122, which provides guidance for protecting PII, as well as a link to our NIST SP 800-53 Quick Start, a CloudFormation template that automatically configures AWS resources and deploys a multi-tier, Linux-based web application. To learn how this Quick Start works, see the Automate NIST Compliance in AWS GovCloud (US) with AWS Quick Start Tools video. The template helps you streamline and automate secure baselines in AWS—from initial design to operational security readiness—by incorporating the expertise of AWS security and compliance subject matter experts.
For more information about AWS Compliance and FERPA or to request support for your organization, contact your AWS account manager.
– Chris Gile, Senior Manager, AWS Security Assurance
Scale takes on a whole new meaning when it comes to IoT. Last year I was lucky enough to tour a gigantic factory that had, on average, one environment sensor per square meter. The sensors measured temperature, humidity, and air purity several times per second, and served as an early warning system for contaminants. I’ve heard customers express interest in deploying IoT-enabled consumer devices in the millions or tens of millions.
With powerful, long-lived devices deployed in a geographically distributed fashion, managing security challenges is crucial. However, the limited amount of local compute power and memory can sometimes limit the ability to use encryption and other forms of data protection.
To address these challenges and to allow our customers to confidently deploy IoT devices at scale, we are working on IoT Device Defender. While the details might change before release, AWS IoT Device Defender is designed to offer these benefits:
Continuous Auditing – AWS IoT Device Defender monitors the policies related to your devices to ensure that the desired security settings are in place. It looks for drifts away from best practices and supports custom audit rules so that you can check for conditions that are specific to your deployment. For example, you could check to see if a compromised device has subscribed to sensor data from another device. You can run audits on a schedule or on an as-needed basis.
Real-Time Detection and Alerting – AWS IoT Device Defender looks for and quickly alerts you to unusual behavior that could be coming from a compromised device. It does this by monitoring the behavior of similar devices over time, looking for unauthorized access attempts, changes in connection patterns, and changes in traffic patterns (either inbound or outbound).
Fast Investigation and Mitigation – In the event that you get an alert that something unusual is happening, AWS IoT Device Defender gives you the tools, including contextual information, to help you to investigate and mitigate the problem. Device information, device statistics, diagnostic logs, and previous alerts are all at your fingertips. You have the option to reboot the device, revoke its permissions, reset it to factory defaults, or push a security fix.
Stay Tuned I’ll have more info (and a hands-on post) as soon as possible, so stay tuned!
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.