Tag Archives: .sh

Analyze OpenFDA Data in R with Amazon S3 and Amazon Athena

Post Syndicated from Ryan Hood original https://aws.amazon.com/blogs/big-data/analyze-openfda-data-in-r-with-amazon-s3-and-amazon-athena/

One of the great benefits of Amazon S3 is the ability to host, share, or consume public data sets. This provides transparency into data to which an external data scientist or developer might not normally have access. By exposing the data to the public, you can glean many insights that would have been difficult with a data silo.

The openFDA project creates easy access to the high value, high priority, and public access data of the Food and Drug Administration (FDA). The data has been formatted and documented in consumer-friendly standards. Critical data related to drugs, devices, and food has been harmonized and can easily be called by application developers and researchers via API calls. OpenFDA has published two whitepapers that drill into the technical underpinnings of the API infrastructure as well as how to properly analyze the data in R. In addition, FDA makes openFDA data available on S3 in raw format.

In this post, I show how to use S3, Amazon EMR, and Amazon Athena to analyze the drug adverse events dataset. A drug adverse event is an undesirable experience associated with the use of a drug, including serious drug side effects, product use errors, product quality programs, and therapeutic failures.

Data considerations

Keep in mind that this data does have limitations. In addition, in the United States, these adverse events are submitted to the FDA voluntarily from consumers so there may not be reports for all events that occurred. There is no certainty that the reported event was actually due to the product. The FDA does not require that a causal relationship between a product and event be proven, and reports do not always contain the detail necessary to evaluate an event. Because of this, there is no way to identify the true number of events. The important takeaway to all this is that the information contained in this data has not been verified to produce cause and effect relationships. Despite this disclaimer, many interesting insights and value can be derived from the data to accelerate drug safety research.

Data analysis using SQL

For application developers who want to perform targeted searching and lookups, the API endpoints provided by the openFDA project are “ready to go” for software integration using a standard API powered by Elasticsearch, NodeJS, and Docker. However, for data analysis purposes, it is often easier to work with the data using SQL and statistical packages that expect a SQL table structure. For large-scale analysis, APIs often have query limits, such as 5000 records per query. This can cause extra work for data scientists who want to analyze the full dataset instead of small subsets of data.

To address the concern of requiring all the data in a single dataset, the openFDA project released the full 100 GB of harmonized data files that back the openFDA project onto S3. Athena is an interactive query service that makes it easy to analyze data in S3 using standard SQL. It’s a quick and easy way to answer your questions about adverse events and aspirin that does not require you to spin up databases or servers.

While you could point tools directly at the openFDA S3 files, you can find greatly improved performance and use of the data by following some of the preparation steps later in this post.

Architecture

This post explains how to use the following architecture to take the raw data provided by openFDA, leverage several AWS services, and derive meaning from the underlying data.

Steps:

  1. Load the openFDA /drug/event dataset into Spark and convert it to gzip to allow for streaming.
  2. Transform the data in Spark and save the results as a Parquet file in S3.
  3. Query the S3 Parquet file with Athena.
  4. Perform visualization and analysis of the data in R and Python on Amazon EC2.

Optimizing public data sets: A primer on data preparation

Those who want to jump right into preparing the files for Athena may want to skip ahead to the next section.

Transforming, or pre-processing, files is a common task for using many public data sets. Before you jump into the specific steps for transforming the openFDA data files into a format optimized for Athena, I thought it would be worthwhile to provide a quick exploration on the problem.

Making a dataset in S3 efficiently accessible with minimal transformation for the end user has two key elements:

  1. Partitioning the data into objects that contain a complete part of the data (such as data created within a specific month).
  2. Using file formats that make it easy for applications to locate subsets of data (for example, gzip, Parquet, ORC, etc.).

With these two key elements in mind, you can now apply transformations to the openFDA adverse event data to prepare it for Athena. You might find the data techniques employed in this post to be applicable to many of the questions you might want to ask of the public data sets stored in Amazon S3.

Before you get started, I encourage those who are interested in doing deeper healthcare analysis on AWS to make sure that you first read the AWS HIPAA Compliance whitepaper. This covers the information necessary for processing and storing patient health information (PHI).

Also, the adverse event analysis shown for aspirin is strictly for demonstration purposes and should not be used for any real decision or taken as anything other than a demonstration of AWS capabilities. However, there have been robust case studies published that have explored a causal relationship between aspirin and adverse reactions using OpenFDA data. If you are seeking research on aspirin or its risks, visit organizations such as the Centers for Disease Control and Prevention (CDC) or the Institute of Medicine (IOM).

Preparing data for Athena

For this walkthrough, you will start with the FDA adverse events dataset, which is stored as JSON files within zip archives on S3. You then convert it to Parquet for analysis. Why do you need to convert it? The original data download is stored in objects that are partitioned by quarter.

Here is a small sample of what you find in the adverse events (/drugs/event) section of the openFDA website.

If you were looking for events that happened in a specific quarter, this is not a bad solution. For most other scenarios, such as looking across the full history of aspirin events, it requires you to access a lot of data that you won’t need. The zip file format is not ideal for using data in place because zip readers must have random access to the file, which means the data can’t be streamed. Additionally, the zip files contain large JSON objects.

To read the data in these JSON files, a streaming JSON decoder must be used or a computer with a significant amount of RAM must decode the JSON. Opening up these files for public consumption is a great start. However, you still prepare the data with a few lines of Spark code so that the JSON can be streamed.

Step 1:  Convert the file types

Using Apache Spark on EMR, you can extract all of the zip files and pull out the events from the JSON files. To do this, use the Scala code below to deflate the zip file and create a text file. In addition, compress the JSON files with gzip to improve Spark’s performance and reduce your overall storage footprint. The Scala code can be run in either the Spark Shell or in an Apache Zeppelin notebook on your EMR cluster.

If you are unfamiliar with either Apache Zeppelin or the Spark Shell, the following posts serve as great references:

 

import scala.io.Source
import java.util.zip.ZipInputStream
import org.apache.spark.input.PortableDataStream
import org.apache.hadoop.io.compress.GzipCodec

// Input Directory
val inputFile = "s3://download.open.fda.gov/drug/event/2015q4/*.json.zip";

// Output Directory
val outputDir = "s3://{YOUR OUTPUT BUCKET HERE}/output/2015q4/";

// Extract zip files from 
val zipFiles = sc.binaryFiles(inputFile);

// Process zip file to extract the json as text file and save it
// in the output directory 
val rdd = zipFiles.flatMap((file: (String, PortableDataStream)) => {
    val zipStream = new ZipInputStream(file.2.open)
    val entry = zipStream.getNextEntry
    val iter = Source.fromInputStream(zipStream).getLines
    iter
}).map(.replaceAll("\s+","")).saveAsTextFile(outputDir, classOf[GzipCodec])

Step 2:  Transform JSON into Parquet

With just a few more lines of Scala code, you can use Spark’s abstractions to convert the JSON into a Spark DataFrame and then export the data back to S3 in Parquet format.

Spark requires the JSON to be in JSON Lines format to be parsed correctly into a DataFrame.

// Output Parquet directory
val outputDir = "s3://{YOUR OUTPUT BUCKET NAME}/output/drugevents"
// Input json file
val inputJson = "s3://{YOUR OUTPUT BUCKET NAME}/output/2015q4/*”
// Load dataframe from json file multiline 
val df = spark.read.json(sc.wholeTextFiles(inputJson).values)
// Extract results from dataframe
val results = df.select("results")
// Save it to Parquet
results.write.parquet(outputDir)

Step 3:  Create an Athena table

With the data cleanly prepared and stored in S3 using the Parquet format, you can now place an Athena table on top of it to get a better understanding of the underlying data.

Because the openFDA data structure incorporates several layers of nesting, it can be a complex process to try to manually derive the underlying schema in a Hive-compatible format. To shorten this process, you can load the top row of the DataFrame from the previous step into a Hive table within Zeppelin and then extract the “create  table” statement from SparkSQL.

results.createOrReplaceTempView("data")

val top1 = spark.sql("select * from data tablesample(1 rows)")

top1.write.format("parquet").mode("overwrite").saveAsTable("drugevents")

val show_cmd = spark.sql("show create table drugevents”).show(1, false)

This returns a “create table” statement that you can almost paste directly into the Athena console. Make some small modifications (adding the word “external” and replacing “using with “stored as”), and then execute the code in the Athena query editor. The table is created.

For the openFDA data, the DDL returns all string fields, as the date format used in your dataset does not conform to the yyy-mm-dd hh:mm:ss[.f…] format required by Hive. For your analysis, the string format works appropriately but it would be possible to extend this code to use a Presto function to convert the strings into time stamps.

CREATE EXTERNAL TABLE  drugevents (
   companynumb  string, 
   safetyreportid  string, 
   safetyreportversion  string, 
   receiptdate  string, 
   patientagegroup  string, 
   patientdeathdate  string, 
   patientsex  string, 
   patientweight  string, 
   serious  string, 
   seriousnesscongenitalanomali  string, 
   seriousnessdeath  string, 
   seriousnessdisabling  string, 
   seriousnesshospitalization  string, 
   seriousnesslifethreatening  string, 
   seriousnessother  string, 
   actiondrug  string, 
   activesubstancename  string, 
   drugadditional  string, 
   drugadministrationroute  string, 
   drugcharacterization  string, 
   drugindication  string, 
   drugauthorizationnumb  string, 
   medicinalproduct  string, 
   drugdosageform  string, 
   drugdosagetext  string, 
   reactionoutcome  string, 
   reactionmeddrapt  string, 
   reactionmeddraversionpt  string)
STORED AS parquet
LOCATION
  's3://{YOUR TARGET BUCKET}/output/drugevents'

With the Athena table in place, you can start to explore the data by running ad hoc queries within Athena or doing more advanced statistical analysis in R.

Using SQL and R to analyze adverse events

Using the openFDA data with Athena makes it very easy to translate your questions into SQL code and perform quick analysis on the data. After you have prepared the data for Athena, you can begin to explore the relationship between aspirin and adverse drug events, as an example. One of the most common metrics to measure adverse drug events is the Proportional Reporting Ratio (PRR). It is defined as:

PRR = (m/n)/( (M-m)/(N-n) )
Where
m = #reports with drug and event
n = #reports with drug
M = #reports with event in database
N = #reports in database

Gastrointestinal haemorrhage has the highest PRR of any reaction to aspirin when viewed in aggregate. One question you may want to ask is how the PRR has trended on a yearly basis for gastrointestinal haemorrhage since 2005.

Using the following query in Athena, you can see the PRR trend of “GASTROINTESTINAL HAEMORRHAGE” reactions with “ASPIRIN” since 2005:

with drug_and_event as 
(select rpad(receiptdate, 4, 'NA') as receipt_year
    , reactionmeddrapt
    , count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as reports_with_drug_and_event 
from fda.drugevents
where rpad(receiptdate,4,'NA') 
     between '2005' and '2015' 
     and medicinalproduct = 'ASPIRIN'
     and reactionmeddrapt= 'GASTROINTESTINAL HAEMORRHAGE'
group by reactionmeddrapt, rpad(receiptdate, 4, 'NA') 
), reports_with_drug as 
(
select rpad(receiptdate, 4, 'NA') as receipt_year
    , count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as reports_with_drug 
 from fda.drugevents 
 where rpad(receiptdate,4,'NA') 
     between '2005' and '2015' 
     and medicinalproduct = 'ASPIRIN'
group by rpad(receiptdate, 4, 'NA') 
), reports_with_event as 
(
   select rpad(receiptdate, 4, 'NA') as receipt_year
    , count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as reports_with_event 
   from fda.drugevents
   where rpad(receiptdate,4,'NA') 
     between '2005' and '2015' 
     and reactionmeddrapt= 'GASTROINTESTINAL HAEMORRHAGE'
   group by rpad(receiptdate, 4, 'NA')
), total_reports as 
(
   select rpad(receiptdate, 4, 'NA') as receipt_year
    , count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as total_reports 
   from fda.drugevents
   where rpad(receiptdate,4,'NA') 
     between '2005' and '2015' 
   group by rpad(receiptdate, 4, 'NA')
)
select  drug_and_event.receipt_year, 
(1.0 * drug_and_event.reports_with_drug_and_event/reports_with_drug.reports_with_drug)/ (1.0 * (reports_with_event.reports_with_event- drug_and_event.reports_with_drug_and_event)/(total_reports.total_reports-reports_with_drug.reports_with_drug)) as prr
, drug_and_event.reports_with_drug_and_event
, reports_with_drug.reports_with_drug
, reports_with_event.reports_with_event
, total_reports.total_reports
from drug_and_event
    inner join reports_with_drug on  drug_and_event.receipt_year = reports_with_drug.receipt_year   
    inner join reports_with_event on  drug_and_event.receipt_year = reports_with_event.receipt_year
    inner join total_reports on  drug_and_event.receipt_year = total_reports.receipt_year
order by  drug_and_event.receipt_year


One nice feature of Athena is that you can quickly connect to it via R or any other tool that can use a JDBC driver to visualize the data and understand it more clearly.

With this quick R script that can be run in R Studio either locally or on an EC2 instance, you can create a visualization of the PRR and Reporting Odds Ratio (RoR) for “GASTROINTESTINAL HAEMORRHAGE” reactions from “ASPIRIN” since 2005 to better understand these trends.

# connect to ATHENA
conn <- dbConnect(drv, '<Your JDBC URL>',s3_staging_dir="<Your S3 Location>",user=Sys.getenv(c("USER_NAME"),password=Sys.getenv(c("USER_PASSWORD"))

# Declare Adverse Event
adverseEvent <- "'GASTROINTESTINAL HAEMORRHAGE'"

# Build SQL Blocks
sqlFirst <- "SELECT rpad(receiptdate, 4, 'NA') as receipt_year, count(DISTINCT safetyreportid) as event_count FROM fda.drugsflat WHERE rpad(receiptdate,4,'NA') between '2005' and '2015'"
sqlEnd <- "GROUP BY rpad(receiptdate, 4, 'NA') ORDER BY receipt_year"

# Extract Aspirin with adverse event counts
sql <- paste(sqlFirst,"AND medicinalproduct ='ASPIRIN' AND reactionmeddrapt=",adverseEvent, sqlEnd,sep=" ")
aspirinAdverseCount = dbGetQuery(conn,sql)

# Extract Aspirin counts
sql <- paste(sqlFirst,"AND medicinalproduct ='ASPIRIN'", sqlEnd,sep=" ")
aspirinCount = dbGetQuery(conn,sql)

# Extract adverse event counts
sql <- paste(sqlFirst,"AND reactionmeddrapt=",adverseEvent, sqlEnd,sep=" ")
adverseCount = dbGetQuery(conn,sql)

# All Drug Adverse event Counts
sql <- paste(sqlFirst, sqlEnd,sep=" ")
allDrugCount = dbGetQuery(conn,sql)

# Select correct rows
selAll =  allDrugCount$receipt_year == aspirinAdverseCount$receipt_year
selAspirin = aspirinCount$receipt_year == aspirinAdverseCount$receipt_year
selAdverse = adverseCount$receipt_year == aspirinAdverseCount$receipt_year

# Calculate Numbers
m <- c(aspirinAdverseCount$event_count)
n <- c(aspirinCount[selAspirin,2])
M <- c(adverseCount[selAdverse,2])
N <- c(allDrugCount[selAll,2])

# Calculate proptional reporting ratio
PRR = (m/n)/((M-m)/(N-n))

# Calculate reporting Odds Ratio
d = n-m
D = N-M
ROR = (m/d)/(M/D)

# Plot the PRR and ROR
g_range <- range(0, PRR,ROR)
g_range[2] <- g_range[2] + 3
yearLen = length(aspirinAdverseCount$receipt_year)
axis(1,1:yearLen,lab=ax)
plot(PRR, type="o", col="blue", ylim=g_range,axes=FALSE, ann=FALSE)
axis(1,1:yearLen,lab=ax)
axis(2, las=1, at=1*0:g_range[2])
box()
lines(ROR, type="o", pch=22, lty=2, col="red")

As you can see, the PRR and RoR have both remained fairly steady over this time range. With the R Script above, all you need to do is change the adverseEvent variable from GASTROINTESTINAL HAEMORRHAGE to another type of reaction to analyze and compare those trends.

Summary

In this walkthrough:

  • You used a Scala script on EMR to convert the openFDA zip files to gzip.
  • You then transformed the JSON blobs into flattened Parquet files using Spark on EMR.
  • You created an Athena DDL so that you could query these Parquet files residing in S3.
  • Finally, you pointed the R package at the Athena table to analyze the data without pulling it into a database or creating your own servers.

If you have questions or suggestions, please comment below.


Next Steps

Take your skills to the next level. Learn how to optimize Amazon S3 for an architecture commonly used to enable genomic data analysis. Also, be sure to read more about running R on Amazon Athena.

 

 

 

 

 


About the Authors

Ryan Hood is a Data Engineer for AWS. He works on big data projects leveraging the newest AWS offerings. In his spare time, he enjoys watching the Cubs win the World Series and attempting to Sous-vide anything he can find in his refrigerator.

 

 

Vikram Anand is a Data Engineer for AWS. He works on big data projects leveraging the newest AWS offerings. In his spare time, he enjoys playing soccer and watching the NFL & European Soccer leagues.

 

 

Dave Rocamora is a Solutions Architect at Amazon Web Services on the Open Data team. Dave is based in Seattle and when he is not opening data, he enjoys biking and drinking coffee outside.

 

 

 

 

Basic API Rate-Limiting

Post Syndicated from Bozho original https://techblog.bozho.net/basic-api-rate-limiting/

It is likely that you are developing some form of (web/RESTful) API, and in case it is publicly-facing (or even when it’s internal), you normally want to rate-limit it somehow. That is, to limit the number of requests performed over a period of time, in order to save resources and protect from abuse.

This can probably be achieved on web-server/load balancer level with some clever configurations, but usually you want the rate limiter to be client-specific (i.e. each client of your API sohuld have a separate rate limit), and the way the client is identified varies. It’s probably still possible to do it on the load balancer, but I think it makes sense to have it on the application level.

I’ll use spring-mvc for the example, but any web framework has a good way to plug an interceptor.

So here’s an example of a spring-mvc interceptor:

@Component
public class RateLimitingInterceptor extends HandlerInterceptorAdapter {

    private static final Logger logger = LoggerFactory.getLogger(RateLimitingInterceptor.class);
    
    @Value("${rate.limit.enabled}")
    private boolean enabled;
    
    @Value("${rate.limit.hourly.limit}")
    private int hourlyLimit;

    private Map<String, Optional<SimpleRateLimiter>> limiters = new ConcurrentHashMap<>();
    
    @Override
    public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler)
            throws Exception {
        if (!enabled) {
            return true;
        }
        String clientId = request.getHeader("Client-Id");
        // let non-API requests pass
        if (clientId == null) {
            return true;
        }
        SimpleRateLimiter rateLimiter = getRateLimiter(clientId);
        boolean allowRequest = limiter.tryAcquire();
    
        if (!allowRequest) {
            response.setStatus(HttpStatus.TOO_MANY_REQUESTS.value());
        }
        response.addHeader("X-RateLimit-Limit", String.valueOf(hourlyLimit));
        return allowRequest;
    }
    
    private SimpleRateLimiter getRateLimiter(String clientId) {
        return limiters.computeIfAbsent(clientId, clientId -> {
            return Optional.of(createRateLimiter(clientId));
        });
    }

	
    @PreDestroy
    public void destroy() {
        // loop and finalize all limiters
    }
}

This initializes rate-limiters per client on demand. Alternatively, on startup you could just loop through all registered API clients and create a rate limiter for each. In case the rate limiter doesn’t allow more requests (tryAcquire() returns false), then raturn “Too many requests” and abort the execution of the request (return “false” from the interceptor).

This sounds simple. But there are a few catches. You may wonder where the SimpleRateLimiter above is defined. We’ll get there, but first let’s see what options do we have for rate limiter implementations.

The most recommended one seems to be the guava RateLimiter. It has a straightforward factory method that gives you a rate limiter for a specified rate (permits per second). However, it doesn’t accomodate web APIs very well, as you can’t initilize the RateLimiter with pre-existing number of permits. That means a period of time should elapse before the limiter would allow requests. There’s another issue – if you have less than one permits per second (e.g. if your desired rate limit is “200 requests per hour”), you can pass a fraction (hourlyLimit / secondsInHour), but it still won’t work the way you expect it to, as internally there’s a “maxPermits” field that would cap the number of permits to much less than you want it to. Also, the rate limiter doesn’t allow bursts – you have exactly X permits per second, but you cannot spread them over a long period of time, e.g. have 5 requests in one second, and then no requests for the next few seconds. In fact, all of the above can be solved, but sadly, through hidden fields that you don’t have access to. Multiple feature requests exist for years now, but Guava just doesn’t update the rate limiter, making it much less applicable to API rate-limiting.

Using reflection, you can tweak the parameters and make the limiter work. However, it’s ugly, and it’s not guaranteed it will work as expected. I have shown here how to initialize a guava rate limiter with X permits per hour, with burstability and full initial permits. When I thought that would do, I saw that tryAcquire() has a synchronized(..) block. Will that mean all requests will wait for each other when simply checking whether allowed to make a request? That would be horrible.

So in fact the guava RateLimiter is not meant for (web) API rate-limiting. Maybe keeping it feature-poor is Guava’s way for discouraging people from misusing it?

That’s why I decided to implement something simple myself, based on a Java Semaphore. Here’s the naive implementation:

public class SimpleRateLimiter {
    private Semaphore semaphore;
    private int maxPermits;
    private TimeUnit timePeriod;
    private ScheduledExecutorService scheduler;

    public static SimpleRateLimiter create(int permits, TimeUnit timePeriod) {
        SimpleRateLimiter limiter = new SimpleRateLimiter(permits, timePeriod);
        limiter.schedulePermitReplenishment();
        return limiter;
    }

    private SimpleRateLimiter(int permits, TimeUnit timePeriod) {
        this.semaphore = new Semaphore(permits);
        this.maxPermits = permits;
        this.timePeriod = timePeriod;
    }

    public boolean tryAcquire() {
        return semaphore.tryAcquire();
    }

    public void stop() {
        scheduler.shutdownNow();
    }

    public void schedulePermitReplenishment() {
        scheduler = Executors.newScheduledThreadPool(1);
        scheduler.schedule(() -> {
            semaphore.release(maxPermits - semaphore.availablePermits());
        }, 1, timePeriod);

    }
}

It takes a number of permits (allowed number of requests) and a time period. The time period is “1 X”, where X can be second/minute/hour/daily – depending on how you want your limit to be configured – per second, per minute, hourly, daily. Every 1 X a scheduler replenishes the acquired permits (in the example above there’s one scheduler per client, which may be inefficient with large number of clients – you can pass a shared scheduler pool instead). There is no control for bursts (a client can spend all permits with a rapid succession of requests), there is no warm-up functionality, there is no gradual replenishment. Depending on what you want, this may not be ideal, but that’s just a basic rate limiter that is thread-safe and doesn’t have any blocking. I wrote a unit test to confirm that the limiter behaves properly, and also ran performance tests against a local application to make sure the limit is obeyed. So far it seems to be working.

Are there alternatives? Well, yes – there are libraries like RateLimitJ that uses Redis to implement rate-limiting. That would mean, however, that you need to setup and run Redis. Which seems like an overhead for “simply” having rate-limiting. (Note: it seems to also have an in-memory version)

On the other hand, how would rate-limiting work properly in a cluster of application nodes? Application nodes probably need some database or gossip protocol to share data about the per-client permits (requests) remaining? Not necessarily. A very simple approach to this issue would be to assume that the load balancer distributes the load equally among your nodes. That way you would just have to set the limit on each node to be equal to the total limit divided by the number of nodes. It won’t be exact, but you rarely need it to be – allowing 5-10 more requests won’t kill your application, allowing 5-10 less won’t be dramatic for the users.

That, however, would mean that you have to know the number of application nodes. If you employ auto-scaling (e.g. in AWS), the number of nodes may change depending on the load. If that is the case, instead of configuring a hard-coded number of permits, the replenishing scheduled job can calculate the “maxPermits” on the fly, by calling an AWS (or other cloud-provider) API to obtain the number of nodes in the current auto-scaling group. That would still be simpler than supporting a redis deployment just for that.

Overall, I’m surprised there isn’t a “canonical” way to implement rate-limiting (in Java). Maybe the need for rate-limiting is not as common as it may seem. Or it’s implemented manually – by temporarily banning API clients that use “too much resources”.

Update: someone pointed out the bucket4j project, which seems nice and worth taking a look at.

The post Basic API Rate-Limiting appeared first on Bozho's tech blog.

mkosi — A Tool for Generating OS Images

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/mkosi-a-tool-for-generating-os-images.html

Introducing mkosi

After blogging about
casync
I realized I never blogged about the
mkosi tool that combines nicely
with it. mkosi has been around for a while already, and its time to
make it a bit better known. mkosi stands for Make Operating System
Image
, and is a tool for precisely that: generating an OS tree or
image that can be booted.

Yes, there are many tools like mkosi, and a number of them are quite
well known and popular. But mkosi has a number of features that I
think make it interesting for a variety of use-cases that other tools
don’t cover that well.

What is mkosi?

What are those use-cases, and what does mkosi precisely set apart?
mkosi is definitely a tool with a focus on developer’s needs for
building OS images, for testing and debugging, but also for generating
production images with cryptographic protection. A typical use-case
would be to add a mkosi.default file to an existing project (for
example, one written in C or Python), and thus making it easy to
generate an OS image for it. mkosi will put together the image with
development headers and tools, compile your code in it, run your test
suite, then throw away the image again, and build a new one, this time
without development headers and tools, and install your build
artifacts in it. This final image is then “production-ready”, and only
contains your built program and the minimal set of packages you
configured otherwise. Such an image could then be deployed with
casync (or any other tool of course) to be delivered to your set of
servers, or IoT devices or whatever you are building.

mkosi is supposed to be legacy-free: the focus is clearly on
today’s technology, not yesteryear’s. Specifically this means that
we’ll generate GPT partition tables, not MBR/DOS ones. When you tell
mkosi to generate a bootable image for you, it will make it bootable
on EFI, not on legacy BIOS. The GPT images generated follow
specifications such as the Discoverable Partitions
Specification
,
so that /etc/fstab can remain unpopulated and tools such as
systemd-nspawn can automatically dissect the image and boot from
them.

So, let’s have a look on the specific images it can generate:

  1. Raw GPT disk image, with ext4 as root
  2. Raw GPT disk image, with btrfs as root
  3. Raw GPT disk image, with a read-only squashfs as root
  4. A plain directory on disk containing the OS tree directly (this is useful for creating generic container images)
  5. A btrfs subvolume on disk, similar to the plain directory
  6. A tarball of a plain directory

When any of the GPT choices above are selected, a couple of additional
options are available:

  1. A swap partition may be added in
  2. The system may be made bootable on EFI systems
  3. Separate partitions for /home and /srv may be added in
  4. The root, /home and /srv partitions may be optionally encrypted with LUKS
  5. The root partition may be protected using dm-verity, thus making offline attacks on the generated system hard
  6. If the image is made bootable, the dm-verity root hash is automatically added to the kernel command line, and the kernel together with its initial RAM disk and the kernel command line is optionally cryptographically signed for UEFI SecureBoot

Note that mkosi is distribution-agnostic. It currently can build
images based on the following Linux distributions:

  1. Fedora
  2. Debian
  3. Ubuntu
  4. ArchLinux
  5. openSUSE

Note though that not all distributions are supported at the same
feature level currently. Also, as mkosi is based on dnf
--installroot
, debootstrap, pacstrap and zypper, and those
packages are not packaged universally on all distributions, you might
not be able to build images for all those distributions on arbitrary
host distributions. For example, Fedora doesn’t package zypper,
hence you cannot build an openSUSE image easily on Fedora, but you can
still build Fedora (obviously…), Debian, Ubuntu and ArchLinux images
on it just fine.

The GPT images are put together in a way that they aren’t just
compatible with UEFI systems, but also with VM and container managers
(that is, at least the smart ones, i.e. VM managers that know UEFI,
and container managers that grok GPT disk images) to a large
degree. In fact, the idea is that you can use mkosi to build a
single GPT image that may be used to:

  1. Boot on bare-metal boxes
  2. Boot in a VM
  3. Boot in a systemd-nspawn container
  4. Directly run a systemd service off, using systemd’s RootImage= unit file setting

Note that in all four cases the dm-verity data is automatically used
if available to ensure the image is not tempered with (yes, you read
that right, systemd-nspawn and systemd’s RootImage= setting
automatically do dm-verity these days if the image has it.)

Mode of Operation

The simplest usage of mkosi is by simply invoking it without
parameters (as root):

# mkosi

Without any configuration this will create a GPT disk image for you,
will call it image.raw and drop it in the current directory. The
distribution used will be the same one as your host runs.

Of course in most cases you want more control about how the image is
put together, i.e. select package sets, select the distribution, size
partitions and so on. Most of that you can actually specify on the
command line, but it is recommended to instead create a couple of
mkosi.$SOMETHING files and directories in some directory. Then,
simply change to that directory and run mkosi without any further
arguments. The tool will then look in the current working directory
for these files and directories and make use of them (similar to how
make looks for a Makefile…). Every single file/directory is
optional, but if they exist they are honored. Here’s a list of the
files/directories mkosi currently looks for:

  1. mkosi.default — This is the main configuration file, here you
    can configure what kind of image you want, which distribution, which
    packages and so on.

  2. mkosi.extra/ — If this directory exists, then mkosi will copy
    everything inside it into the images built. You can place arbitrary
    directory hierarchies in here, and they’ll be copied over whatever is
    already in the image, after it was put together by the distribution’s
    package manager. This is the best way to drop additional static files
    into the image, or override distribution-supplied ones.

  3. mkosi.build — This executable file is supposed to be a build
    script. When it exists, mkosi will build two images, one after the
    other in the mode already mentioned above: the first version is the
    build image, and may include various build-time dependencies such as
    a compiler or development headers. The build script is also copied
    into it, and then run inside it. The script should then build
    whatever shall be built and place the result in $DESTDIR (don’t
    worry, popular build tools such as Automake or Meson all honor
    $DESTDIR anyway, so there’s not much to do here explicitly). It may
    also run a test suite, or anything else you like. After the script
    finished, the build image is removed again, and a second image (the
    final image) is built. This time, no development packages are
    included, and the build script is not copied into the image again —
    however, the build artifacts from the first run (i.e. those placed in
    $DESTDIR) are copied into the image.

  4. mkosi.postinst — If this executable script exists, it is invoked
    inside the image (inside a systemd-nspawn invocation) and can
    adjust the image as it likes at a very late point in the image
    preparation. If mkosi.build exists, i.e. the dual-phased
    development build process used, then this script will be invoked
    twice: once inside the build image and once inside the final
    image. The first parameter passed to the script clarifies which phase
    it is run in.

  5. mkosi.nspawn — If this file exists, it should contain a
    container configuration file for systemd-nspawn (see
    systemd.nspawn(5)
    for details), which shall be shipped along with the final image and
    shall be included in the check-sum calculations (see below).

  6. mkosi.cache/ — If this directory exists, it is used as package
    cache directory for the builds. This directory is effectively bind
    mounted into the image at build time, in order to speed up building
    images. The package installers of the various distributions will
    place their package files here, so that subsequent runs can reuse
    them.

  7. mkosi.passphrase — If this file exists, it should contain a
    pass-phrase to use for the LUKS encryption (if that’s enabled for the
    image built). This file should not be readable to other users.

  8. mkosi.secure-boot.crt and mkosi.secure-boot.key should be an
    X.509 key pair to use for signing the kernel and initrd for UEFI
    SecureBoot, if that’s enabled.

How to use it

So, let’s come back to our most trivial example, without any of the
mkosi.$SOMETHING files around:

# mkosi

As mentioned, this will create a build file image.raw in the current
directory. How do we use it? Of course, we could dd it onto some USB
stick and boot it on a bare-metal device. However, it’s much simpler
to first run it in a container for testing:

# systemd-nspawn -bi image.raw

And there you go: the image should boot up, and just work for you.

Now, let’s make things more interesting. Let’s still not use any of
the mkosi.$SOMETHING files around:

# mkosi -t raw_btrfs --bootable -o foobar.raw
# systemd-nspawn -bi foobar.raw

This is similar as the above, but we made three changes: it’s no
longer GPT + ext4, but GPT + btrfs. Moreover, the system is made
bootable on UEFI systems, and finally, the output is now called
foobar.raw.

Because this system is bootable on UEFI systems, we can run it in KVM:

qemu-kvm -m 512 -smp 2 -bios /usr/share/edk2/ovmf/OVMF_CODE.fd -drive format=raw,file=foobar.raw

This will look very similar to the systemd-nspawn invocation, except
that this uses full VM virtualization rather than container
virtualization. (Note that the way to run a UEFI qemu/kvm instance
appears to change all the time and is different on the various
distributions. It’s quite annoying, and I can’t really tell you what
the right qemu command line is to make this work on your system.)

Of course, it’s not all raw GPT disk images with mkosi. Let’s try
a plain directory image:

# mkosi -d fedora -t directory -o quux
# systemd-nspawn -bD quux

Of course, if you generate the image as plain directory you can’t boot
it on bare-metal just like that, nor run it in a VM.

A more complex command line is the following:

# mkosi -d fedora -t raw_squashfs --checksum --xz --package=openssh-clients --package=emacs

In this mode we explicitly pick Fedora as the distribution to use, ask
mkosi to generate a compressed GPT image with a root squashfs,
compress the result with xz, and generate a SHA256SUMS file with
the hashes of the generated artifacts. The package will contain the
SSH client as well as everybody’s favorite editor.

Now, let’s make use of the various mkosi.$SOMETHING files. Let’s
say we are working on some Automake-based project and want to make it
easy to generate a disk image off the development tree with the
version you are hacking on. Create a configuration file:

# cat > mkosi.default <<EOF
[Distribution]
Distribution=fedora
Release=24

[Output]
Format=raw_btrfs
Bootable=yes

[Packages]
# The packages to appear in both the build and the final image
Packages=openssh-clients httpd
# The packages to appear in the build image, but absent from the final image
BuildPackages=make gcc libcurl-devel
EOF

And let’s add a build script:

# cat > mkosi.build <<EOF
#!/bin/sh
cd $SRCDIR
./autogen.sh
./configure --prefix=/usr
make -j `nproc`
make install
EOF
# chmod +x mkosi.build

And with all that in place we can now build our project into a disk image, simply by typing:

# mkosi

Let’s try it out:

# systemd-nspawn -bi image.raw

Of course, if you do this you’ll notice that building an image like
this can be quite slow. And slow build times are actively hurtful to
your productivity as a developer. Hence let’s make things a bit
faster. First, let’s make use of a package cache shared between runs:

# mkdir mkosi.chache

Building images now should already be substantially faster (and
generate less network traffic) as the packages will now be downloaded
only once and reused. However, you’ll notice that unpacking all those
packages and the rest of the work is still quite slow. But mkosi can
help you with that. Simply use mkosi‘s incremental build feature. In
this mode mkosi will make a copy of the build and final images
immediately before dropping in your build sources or artifacts, so
that building an image becomes a lot quicker: instead of always
starting totally from scratch a build will now reuse everything it can
reuse from a previous run, and immediately begin with building your
sources rather than the build image to build your sources in. To
enable the incremental build feature use -i:

# mkosi -i

Note that if you use this option, the package list is not updated
anymore from your distribution’s servers, as the cached copy is made
after all packages are installed, and hence until you actually delete
the cached copy the distribution’s network servers aren’t contacted
again and no RPMs or DEBs are downloaded. This means the distribution
you use becomes “frozen in time” this way. (Which might be a bad
thing, but also a good thing, as it makes things kinda reproducible.)

Of course, if you run mkosi a couple of times you’ll notice that it
won’t overwrite the generated image when it already exists. You can
either delete the file yourself first (rm image.raw) or let mkosi
do it for you right before building a new image, with mkosi -f. You
can also tell mkosi to not only remove any such pre-existing images,
but also remove any cached copies of the incremental feature, by using
-f twice.

I wrote mkosi originally in order to test systemd, and quickly
generate a disk image of various distributions with the most current
systemd version from git, without all that affecting my host system. I
regularly use mkosi for that today, in incremental mode. The two
commands I use most in that context are:

# mkosi -if && systemd-nspawn -bi image.raw

And sometimes:

# mkosi -iff && systemd-nspawn -bi image.raw

The latter I use only if I want to regenerate everything based on the
very newest set of RPMs provided by Fedora, instead of a cached
snapshot of it.

BTW, the mkosi files for systemd are included in the systemd git
tree:
mkosi.default
and
mkosi.build. This
way, any developer who wants to quickly test something with current
systemd git, or wants to prepare a patch based on it and test it can
check out the systemd repository and simply run mkosi in it and a
few minutes later he has a bootable image he can test in
systemd-nspawn or KVM. casync has similar files:
mkosi.default,
mkosi.build.

Random Interesting Features

  1. As mentioned already, mkosi will generate dm-verity enabled
    disk images if you ask for it. For that use the --verity switch on
    the command line or Verity= setting in mkosi.default. Of course,
    dm-verity implies that the root volume is read-only. In this mode
    the top-level dm-verity hash will be placed along-side the output
    disk image in a file named the same way, but with the .roothash
    suffix. If the image is to be created bootable, the root hash is also
    included on the kernel command line in the roothash= parameter,
    which current systemd versions can use to both find and activate the
    root partition in a dm-verity protected way. BTW: it’s a good idea
    to combine this dm-verity mode with the raw_squashfs image mode,
    to generate a genuinely protected, compressed image suitable for
    running in your IoT device.

  2. As indicated above, mkosi can automatically create a check-sum
    file SHA256SUMS for you (--checksum) covering all the files it
    outputs (which could be the image file itself, a matching .nspawn
    file using the mkosi.nspawn file mentioned above, as well as the
    .roothash file for the dm-verity root hash.) It can then
    optionally sign this with gpg (--sign). Note that systemd‘s
    machinectl pull-tar and machinectl pull-raw command can download
    these files and the SHA256SUMS file automatically and verify things
    on download. With other words: what mkosi outputs is perfectly
    ready for downloads using these two systemd commands.

  3. As mentioned, mkosi is big on supporting UEFI SecureBoot. To
    make use of that, place your X.509 key pair in two files
    mkosi.secureboot.crt and mkosi.secureboot.key, and set
    SecureBoot= or --secure-boot. If so, mkosi will sign the
    kernel/initrd/kernel command line combination during the build. Of
    course, if you use this mode, you should also use
    Verity=/--verity=, otherwise the setup makes only partial
    sense. Note that mkosi will not help you with actually enrolling
    the keys you use in your UEFI BIOS.

  4. mkosi has minimal support for GIT checkouts: when it recognizes
    it is run in a git checkout and you use the mkosi.build script
    stuff, the source tree will be copied into the build image, but will
    all files excluded by .gitignore removed.

  5. There’s support for encryption in place. Use --encrypt= or
    Encrypt=. Note that the UEFI ESP is never encrypted though, and the
    root partition only if explicitly requested. The /home and /srv
    partitions are unconditionally encrypted if that’s enabled.

  6. Images may be built with all documentation removed.

  7. The password for the root user and additional kernel command line
    arguments may be configured for the image to generate.

Minimum Requirements

Current mkosi requires Python 3.5, and has a number of dependencies,
listed in the
README. Most
notably you need a somewhat recent systemd version to make use of its
full feature set: systemd 233. Older versions are already packaged for
various distributions, but much of what I describe above is only
available in the most recent release mkosi 3.

The UEFI SecureBoot support requires sbsign which currently isn’t
available in Fedora, but there’s a
COPR
.

Future

It is my intention to continue turning mkosi into a tool suitable
for:

  1. Testing and debugging projects
  2. Building images for secure devices
  3. Building portable service images
  4. Building images for secure VMs and containers

One of the biggest goals I have for the future is to teach mkosi and
systemd/sd-boot native support for A/B IoT style partition
setups. The idea is that the combination of systemd, casync and
mkosi provides generic building blocks for building secure,
auto-updating devices in a generic way from, even though all pieces
may be used individually, too.

FAQ

  1. Why are you reinventing the wheel again? This is exactly like
    $SOMEOTHERPROJECT!
    — Well, to my knowledge there’s no tool that
    integrates this nicely with your project’s development tree, and can
    do dm-verity and UEFI SecureBoot and all that stuff for you. So
    nope, I don’t think this exactly like $SOMEOTHERPROJECT, thank you
    very much.

  2. What about creating MBR/DOS partition images? — That’s really
    out of focus to me. This is an exercise in figuring out how generic
    OSes and devices in the future should be built and an attempt to
    commoditize OS image building. And no, the future doesn’t speak MBR,
    sorry. That said, I’d be quite interested in adding support for
    booting on Raspberry Pi, possibly using a hybrid approach, i.e. using
    a GPT disk label, but arranging things in a way that the Raspberry Pi
    boot protocol (which is built around DOS partition tables), can still
    work.

  3. Is this portable? — Well, depends what you mean by
    portable. No, this tool runs on Linux only, and as it uses
    systemd-nspawn during the build process it doesn’t run on
    non-systemd systems either. But then again, you should be able to
    create images for any architecture you like with it, but of course if
    you want the image bootable on bare-metal systems only systems doing
    UEFI are supported (but systemd-nspawn should still work fine on
    them).

  4. Where can I get this stuff? — Try
    GitHub. And some distributions
    carry packaged versions, but I think none of them the current v3
    yet.

  5. Is this a systemd project? — Yes, it’s hosted under the
    systemd GitHub umbrella. And yes,
    during run-time systemd-nspawn in a current version is required. But
    no, the code-bases are separate otherwise, already because systemd
    is a C project, and mkosi Python.

  6. Requiring systemd 233 is a pretty steep requirement, no?
    Yes, but the feature we need kind of matters (systemd-nspawn‘s
    --overlay= switch), and again, this isn’t supposed to be a tool for
    legacy systems.

  7. Can I run the resulting images in LXC or Docker? — Humm, I am
    not an LXC nor Docker guy. If you select directory or subvolume
    as image type, LXC should be able to boot the generated images just
    fine, but I didn’t try. Last time I looked, Docker doesn’t permit
    running proper init systems as PID 1 inside the container, as they
    define their own run-time without intention to emulate a proper
    system. Hence, no I don’t think it will work, at least not with an
    unpatched Docker version. That said, again, don’t ask me questions
    about Docker, it’s not precisely my area of expertise, and quite
    frankly I am not a fan. To my knowledge neither LXC nor Docker are
    able to run containers directly off GPT disk images, hence the
    various raw_xyz image types are definitely not compatible with
    either. That means if you want to generate a single raw disk image
    that can be booted unmodified both in a container and on bare-metal,
    then systemd-nspawn is the container manager to go for
    (specifically, its -i/--image= switch).

Should you care? Is this a tool for you?

Well, that’s up to you really.

If you hack on some complex project and need a quick way to compile
and run your project on a specific current Linux distribution, then
mkosi is an excellent way to do that. Simply drop the mkosi.default
and mkosi.build files in your git tree and everything will be
easy. (And of course, as indicated above: if the project you are
hacking on happens to be called systemd or casync be aware that
those files are already part of the git tree — you can just use them.)

If you hack on some embedded or IoT device, then mkosi is a great
choice too, as it will make it reasonably easy to generate secure
images that are protected against offline modification, by using
dm-verity and UEFI SecureBoot.

If you are an administrator and need a nice way to build images for a
VM or systemd-nspawn container, or a portable service then mkosi
is an excellent choice too.

If you care about legacy computers, old distributions, non-systemd
init systems, old VM managers, Docker, … then no, mkosi is not for
you, but there are plenty of well-established alternatives around that
cover that nicely.

And never forget: mkosi is an Open Source project. We are happy to
accept your patches and other contributions.

Oh, and one unrelated last thing: don’t forget to submit your talk
proposal

and/or buy a ticket for
All Systems Go! 2017 in Berlin — the
conference where things like systemd, casync and mkosi are
discussed, along with a variety of other Linux userspace projects used
for building systems.

mkosi — A Tool for Generating OS Images

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/mkosi-a-tool-for-generating-os-images.html

Introducing mkosi

After blogging about
casync
I realized I never blogged about the
mkosi tool that combines nicely
with it. mkosi has been around for a while already, and its time to
make it a bit better known. mkosi stands for Make Operating System
Image
, and is a tool for precisely that: generating an OS tree or
image that can be booted.

Yes, there are many tools like mkosi, and a number of them are quite
well known and popular. But mkosi has a number of features that I
think make it interesting for a variety of use-cases that other tools
don’t cover that well.

What is mkosi?

What are those use-cases, and what does mkosi precisely set apart?
mkosi is definitely a tool with a focus on developer’s needs for
building OS images, for testing and debugging, but also for generating
production images with cryptographic protection. A typical use-case
would be to add a mkosi.default file to an existing project (for
example, one written in C or Python), and thus making it easy to
generate an OS image for it. mkosi will put together the image with
development headers and tools, compile your code in it, run your test
suite, then throw away the image again, and build a new one, this time
without development headers and tools, and install your build
artifacts in it. This final image is then “production-ready”, and only
contains your built program and the minimal set of packages you
configured otherwise. Such an image could then be deployed with
casync (or any other tool of course) to be delivered to your set of
servers, or IoT devices or whatever you are building.

mkosi is supposed to be legacy-free: the focus is clearly on
today’s technology, not yesteryear’s. Specifically this means that
we’ll generate GPT partition tables, not MBR/DOS ones. When you tell
mkosi to generate a bootable image for you, it will make it bootable
on EFI, not on legacy BIOS. The GPT images generated follow
specifications such as the Discoverable Partitions
Specification
,
so that /etc/fstab can remain unpopulated and tools such as
systemd-nspawn can automatically dissect the image and boot from
them.

So, let’s have a look on the specific images it can generate:

  1. Raw GPT disk image, with ext4 as root
  2. Raw GPT disk image, with btrfs as root
  3. Raw GPT disk image, with a read-only squashfs as root
  4. A plain directory on disk containing the OS tree directly (this is useful for creating generic container images)
  5. A btrfs subvolume on disk, similar to the plain directory
  6. A tarball of a plain directory

When any of the GPT choices above are selected, a couple of additional
options are available:

  1. A swap partition may be added in
  2. The system may be made bootable on EFI systems
  3. Separate partitions for /home and /srv may be added in
  4. The root, /home and /srv partitions may be optionally encrypted with LUKS
  5. The root partition may be protected using dm-verity, thus making offline attacks on the generated system hard
  6. If the image is made bootable, the dm-verity root hash is automatically added to the kernel command line, and the kernel together with its initial RAM disk and the kernel command line is optionally cryptographically signed for UEFI SecureBoot

Note that mkosi is distribution-agnostic. It currently can build
images based on the following Linux distributions:

  1. Fedora
  2. Debian
  3. Ubuntu
  4. ArchLinux
  5. openSUSE

Note though that not all distributions are supported at the same
feature level currently. Also, as mkosi is based on dnf
--installroot
, debootstrap, pacstrap and zypper, and those
packages are not packaged universally on all distributions, you might
not be able to build images for all those distributions on arbitrary
host distributions.

The GPT images are put together in a way that they aren’t just
compatible with UEFI systems, but also with VM and container managers
(that is, at least the smart ones, i.e. VM managers that know UEFI,
and container managers that grok GPT disk images) to a large
degree. In fact, the idea is that you can use mkosi to build a
single GPT image that may be used to:

  1. Boot on bare-metal boxes
  2. Boot in a VM
  3. Boot in a systemd-nspawn container
  4. Directly run a systemd service off, using systemd’s RootImage= unit file setting

Note that in all four cases the dm-verity data is automatically used
if available to ensure the image is not tampered with (yes, you read
that right, systemd-nspawn and systemd’s RootImage= setting
automatically do dm-verity these days if the image has it.)

Mode of Operation

The simplest usage of mkosi is by simply invoking it without
parameters (as root):

# mkosi

Without any configuration this will create a GPT disk image for you,
will call it image.raw and drop it in the current directory. The
distribution used will be the same one as your host runs.

Of course in most cases you want more control about how the image is
put together, i.e. select package sets, select the distribution, size
partitions and so on. Most of that you can actually specify on the
command line, but it is recommended to instead create a couple of
mkosi.$SOMETHING files and directories in some directory. Then,
simply change to that directory and run mkosi without any further
arguments. The tool will then look in the current working directory
for these files and directories and make use of them (similar to how
make looks for a Makefile…). Every single file/directory is
optional, but if they exist they are honored. Here’s a list of the
files/directories mkosi currently looks for:

  1. mkosi.default — This is the main configuration file, here you
    can configure what kind of image you want, which distribution, which
    packages and so on.

  2. mkosi.extra/ — If this directory exists, then mkosi will copy
    everything inside it into the images built. You can place arbitrary
    directory hierarchies in here, and they’ll be copied over whatever is
    already in the image, after it was put together by the distribution’s
    package manager. This is the best way to drop additional static files
    into the image, or override distribution-supplied ones.

  3. mkosi.build — This executable file is supposed to be a build
    script. When it exists, mkosi will build two images, one after the
    other in the mode already mentioned above: the first version is the
    build image, and may include various build-time dependencies such as
    a compiler or development headers. The build script is also copied
    into it, and then run inside it. The script should then build
    whatever shall be built and place the result in $DESTDIR (don’t
    worry, popular build tools such as Automake or Meson all honor
    $DESTDIR anyway, so there’s not much to do here explicitly). It may
    also run a test suite, or anything else you like. After the script
    finished, the build image is removed again, and a second image (the
    final image) is built. This time, no development packages are
    included, and the build script is not copied into the image again —
    however, the build artifacts from the first run (i.e. those placed in
    $DESTDIR) are copied into the image.

  4. mkosi.postinst — If this executable script exists, it is invoked
    inside the image (inside a systemd-nspawn invocation) and can
    adjust the image as it likes at a very late point in the image
    preparation. If mkosi.build exists, i.e. the dual-phased
    development build process used, then this script will be invoked
    twice: once inside the build image and once inside the final
    image. The first parameter passed to the script clarifies which phase
    it is run in.

  5. mkosi.nspawn — If this file exists, it should contain a
    container configuration file for systemd-nspawn (see
    systemd.nspawn(5)
    for details), which shall be shipped along with the final image and
    shall be included in the check-sum calculations (see below).

  6. mkosi.cache/ — If this directory exists, it is used as package
    cache directory for the builds. This directory is effectively bind
    mounted into the image at build time, in order to speed up building
    images. The package installers of the various distributions will
    place their package files here, so that subsequent runs can reuse
    them.

  7. mkosi.passphrase — If this file exists, it should contain a
    pass-phrase to use for the LUKS encryption (if that’s enabled for the
    image built). This file should not be readable to other users.

  8. mkosi.secure-boot.crt and mkosi.secure-boot.key should be an
    X.509 key pair to use for signing the kernel and initrd for UEFI
    SecureBoot, if that’s enabled.

How to use it

So, let’s come back to our most trivial example, without any of the
mkosi.$SOMETHING files around:

# mkosi

As mentioned, this will create a build file image.raw in the current
directory. How do we use it? Of course, we could dd it onto some USB
stick and boot it on a bare-metal device. However, it’s much simpler
to first run it in a container for testing:

# systemd-nspawn -bi image.raw

And there you go: the image should boot up, and just work for you.

Now, let’s make things more interesting. Let’s still not use any of
the mkosi.$SOMETHING files around:

# mkosi -t raw_btrfs --bootable -o foobar.raw
# systemd-nspawn -bi foobar.raw

This is similar as the above, but we made three changes: it’s no
longer GPT + ext4, but GPT + btrfs. Moreover, the system is made
bootable on UEFI systems, and finally, the output is now called
foobar.raw.

Because this system is bootable on UEFI systems, we can run it in KVM:

qemu-kvm -m 512 -smp 2 -bios /usr/share/edk2/ovmf/OVMF_CODE.fd -drive format=raw,file=foobar.raw

This will look very similar to the systemd-nspawn invocation, except
that this uses full VM virtualization rather than container
virtualization. (Note that the way to run a UEFI qemu/kvm instance
appears to change all the time and is different on the various
distributions. It’s quite annoying, and I can’t really tell you what
the right qemu command line is to make this work on your system.)

Of course, it’s not all raw GPT disk images with mkosi. Let’s try
a plain directory image:

# mkosi -d fedora -t directory -o quux
# systemd-nspawn -bD quux

Of course, if you generate the image as plain directory you can’t boot
it on bare-metal just like that, nor run it in a VM.

A more complex command line is the following:

# mkosi -d fedora -t raw_squashfs --checksum --xz --package=openssh-clients --package=emacs

In this mode we explicitly pick Fedora as the distribution to use, ask
mkosi to generate a compressed GPT image with a root squashfs,
compress the result with xz, and generate a SHA256SUMS file with
the hashes of the generated artifacts. The package will contain the
SSH client as well as everybody’s favorite editor.

Now, let’s make use of the various mkosi.$SOMETHING files. Let’s
say we are working on some Automake-based project and want to make it
easy to generate a disk image off the development tree with the
version you are hacking on. Create a configuration file:

# cat > mkosi.default <<EOF
[Distribution]
Distribution=fedora
Release=24

[Output]
Format=raw_btrfs
Bootable=yes

[Packages]
# The packages to appear in both the build and the final image
Packages=openssh-clients httpd
# The packages to appear in the build image, but absent from the final image
BuildPackages=make gcc libcurl-devel
EOF

And let’s add a build script:

# cat > mkosi.build <<EOF
#!/bin/sh
./autogen.sh
./configure --prefix=/usr
make -j `nproc`
make install
EOF
# chmod +x mkosi.build

And with all that in place we can now build our project into a disk image, simply by typing:

# mkosi

Let’s try it out:

# systemd-nspawn -bi image.raw

Of course, if you do this you’ll notice that building an image like
this can be quite slow. And slow build times are actively hurtful to
your productivity as a developer. Hence let’s make things a bit
faster. First, let’s make use of a package cache shared between runs:

# mkdir mkosi.cache

Building images now should already be substantially faster (and
generate less network traffic) as the packages will now be downloaded
only once and reused. However, you’ll notice that unpacking all those
packages and the rest of the work is still quite slow. But mkosi can
help you with that. Simply use mkosi‘s incremental build feature. In
this mode mkosi will make a copy of the build and final images
immediately before dropping in your build sources or artifacts, so
that building an image becomes a lot quicker: instead of always
starting totally from scratch a build will now reuse everything it can
reuse from a previous run, and immediately begin with building your
sources rather than the build image to build your sources in. To
enable the incremental build feature use -i:

# mkosi -i

Note that if you use this option, the package list is not updated
anymore from your distribution’s servers, as the cached copy is made
after all packages are installed, and hence until you actually delete
the cached copy the distribution’s network servers aren’t contacted
again and no RPMs or DEBs are downloaded. This means the distribution
you use becomes “frozen in time” this way. (Which might be a bad
thing, but also a good thing, as it makes things kinda reproducible.)

Of course, if you run mkosi a couple of times you’ll notice that it
won’t overwrite the generated image when it already exists. You can
either delete the file yourself first (rm image.raw) or let mkosi
do it for you right before building a new image, with mkosi -f. You
can also tell mkosi to not only remove any such pre-existing images,
but also remove any cached copies of the incremental feature, by using
-f twice.

I wrote mkosi originally in order to test systemd, and quickly
generate a disk image of various distributions with the most current
systemd version from git, without all that affecting my host system. I
regularly use mkosi for that today, in incremental mode. The two
commands I use most in that context are:

# mkosi -if && systemd-nspawn -bi image.raw

And sometimes:

# mkosi -iff && systemd-nspawn -bi image.raw

The latter I use only if I want to regenerate everything based on the
very newest set of RPMs provided by Fedora, instead of a cached
snapshot of it.

BTW, the mkosi files for systemd are included in the systemd git
tree:
mkosi.default
and
mkosi.build. This
way, any developer who wants to quickly test something with current
systemd git, or wants to prepare a patch based on it and test it can
check out the systemd repository and simply run mkosi in it and a
few minutes later he has a bootable image he can test in
systemd-nspawn or KVM. casync has similar files:
mkosi.default,
mkosi.build.

Random Interesting Features

  1. As mentioned already, mkosi will generate dm-verity enabled
    disk images if you ask for it. For that use the --verity switch on
    the command line or Verity= setting in mkosi.default. Of course,
    dm-verity implies that the root volume is read-only. In this mode
    the top-level dm-verity hash will be placed along-side the output
    disk image in a file named the same way, but with the .roothash
    suffix. If the image is to be created bootable, the root hash is also
    included on the kernel command line in the roothash= parameter,
    which current systemd versions can use to both find and activate the
    root partition in a dm-verity protected way. BTW: it’s a good idea
    to combine this dm-verity mode with the raw_squashfs image mode,
    to generate a genuinely protected, compressed image suitable for
    running in your IoT device.

  2. As indicated above, mkosi can automatically create a check-sum
    file SHA256SUMS for you (--checksum) covering all the files it
    outputs (which could be the image file itself, a matching .nspawn
    file using the mkosi.nspawn file mentioned above, as well as the
    .roothash file for the dm-verity root hash.) It can then
    optionally sign this with gpg (--sign). Note that systemd‘s
    machinectl pull-tar and machinectl pull-raw command can download
    these files and the SHA256SUMS file automatically and verify things
    on download. With other words: what mkosi outputs is perfectly
    ready for downloads using these two systemd commands.

  3. As mentioned, mkosi is big on supporting UEFI SecureBoot. To
    make use of that, place your X.509 key pair in two files
    mkosi.secureboot.crt and mkosi.secureboot.key, and set
    SecureBoot= or --secure-boot. If so, mkosi will sign the
    kernel/initrd/kernel command line combination during the build. Of
    course, if you use this mode, you should also use
    Verity=/--verity=, otherwise the setup makes only partial
    sense. Note that mkosi will not help you with actually enrolling
    the keys you use in your UEFI BIOS.

  4. mkosi has minimal support for GIT checkouts: when it recognizes
    it is run in a git checkout and you use the mkosi.build script
    stuff, the source tree will be copied into the build image, but will
    all files excluded by .gitignore removed.

  5. There’s support for encryption in place. Use --encrypt= or
    Encrypt=. Note that the UEFI ESP is never encrypted though, and the
    root partition only if explicitly requested. The /home and /srv
    partitions are unconditionally encrypted if that’s enabled.

  6. Images may be built with all documentation removed.

  7. The password for the root user and additional kernel command line
    arguments may be configured for the image to generate.

Minimum Requirements

Current mkosi requires Python 3.5, and has a number of dependencies,
listed in the
README. Most
notably you need a somewhat recent systemd version to make use of its
full feature set: systemd 233. Older versions are already packaged for
various distributions, but much of what I describe above is only
available in the most recent release mkosi 3.

The UEFI SecureBoot support requires sbsign which currently isn’t
available in Fedora, but there’s a
COPR
.

Future

It is my intention to continue turning mkosi into a tool suitable
for:

  1. Testing and debugging projects
  2. Building images for secure devices
  3. Building portable service images
  4. Building images for secure VMs and containers

One of the biggest goals I have for the future is to teach mkosi and
systemd/sd-boot native support for A/B IoT style partition
setups. The idea is that the combination of systemd, casync and
mkosi provides generic building blocks for building secure,
auto-updating devices in a generic way from, even though all pieces
may be used individually, too.

FAQ

  1. Why are you reinventing the wheel again? This is exactly like
    $SOMEOTHERPROJECT!
    — Well, to my knowledge there’s no tool that
    integrates this nicely with your project’s development tree, and can
    do dm-verity and UEFI SecureBoot and all that stuff for you. So
    nope, I don’t think this exactly like $SOMEOTHERPROJECT, thank you
    very much.

  2. What about creating MBR/DOS partition images? — That’s really
    out of focus to me. This is an exercise in figuring out how generic
    OSes and devices in the future should be built and an attempt to
    commoditize OS image building. And no, the future doesn’t speak MBR,
    sorry. That said, I’d be quite interested in adding support for
    booting on Raspberry Pi, possibly using a hybrid approach, i.e. using
    a GPT disk label, but arranging things in a way that the Raspberry Pi
    boot protocol (which is built around DOS partition tables), can still
    work.

  3. Is this portable? — Well, depends what you mean by
    portable. No, this tool runs on Linux only, and as it uses
    systemd-nspawn during the build process it doesn’t run on
    non-systemd systems either. But then again, you should be able to
    create images for any architecture you like with it, but of course if
    you want the image bootable on bare-metal systems only systems doing
    UEFI are supported (but systemd-nspawn should still work fine on
    them).

  4. Where can I get this stuff? — Try
    GitHub. And some distributions
    carry packaged versions, but I think none of them the current v3
    yet.

  5. Is this a systemd project? — Yes, it’s hosted under the
    systemd GitHub umbrella. And yes,
    during run-time systemd-nspawn in a current version is required. But
    no, the code-bases are separate otherwise, already because systemd
    is a C project, and mkosi Python.

  6. Requiring systemd 233 is a pretty steep requirement, no?
    Yes, but the feature we need kind of matters (systemd-nspawn‘s
    --overlay= switch), and again, this isn’t supposed to be a tool for
    legacy systems.

  7. Can I run the resulting images in LXC or Docker? — Humm, I am
    not an LXC nor Docker guy. If you select directory or subvolume
    as image type, LXC should be able to boot the generated images just
    fine, but I didn’t try. Last time I looked, Docker doesn’t permit
    running proper init systems as PID 1 inside the container, as they
    define their own run-time without intention to emulate a proper
    system. Hence, no I don’t think it will work, at least not with an
    unpatched Docker version. That said, again, don’t ask me questions
    about Docker, it’s not precisely my area of expertise, and quite
    frankly I am not a fan. To my knowledge neither LXC nor Docker are
    able to run containers directly off GPT disk images, hence the
    various raw_xyz image types are definitely not compatible with
    either. That means if you want to generate a single raw disk image
    that can be booted unmodified both in a container and on bare-metal,
    then systemd-nspawn is the container manager to go for
    (specifically, its -i/--image= switch).

Should you care? Is this a tool for you?

Well, that’s up to you really.

If you hack on some complex project and need a quick way to compile
and run your project on a specific current Linux distribution, then
mkosi is an excellent way to do that. Simply drop the mkosi.default
and mkosi.build files in your git tree and everything will be
easy. (And of course, as indicated above: if the project you are
hacking on happens to be called systemd or casync be aware that
those files are already part of the git tree — you can just use them.)

If you hack on some embedded or IoT device, then mkosi is a great
choice too, as it will make it reasonably easy to generate secure
images that are protected against offline modification, by using
dm-verity and UEFI SecureBoot.

If you are an administrator and need a nice way to build images for a
VM or systemd-nspawn container, or a portable service then mkosi
is an excellent choice too.

If you care about legacy computers, old distributions, non-systemd
init systems, old VM managers, Docker, … then no, mkosi is not for
you, but there are plenty of well-established alternatives around that
cover that nicely.

And never forget: mkosi is an Open Source project. We are happy to
accept your patches and other contributions.

Oh, and one unrelated last thing: don’t forget to submit your talk
proposal

and/or buy a ticket for
All Systems Go! 2017 in Berlin — the
conference where things like systemd, casync and mkosi are
discussed, along with a variety of other Linux userspace projects used
for building systems.

Kotlin and Groovy JVM Languages with AWS Lambda

Post Syndicated from Juan Villa original https://aws.amazon.com/blogs/compute/kotlin-and-groovy-jvm-languages-with-aws-lambda/


Juan Villa – Partner Solutions Architect

 

When most people hear “Java” they think of Java the programming language. Java is a lot more than a programming language, it also implies a larger ecosystem including the Java Virtual Machine (JVM). Java, the programming language, is just one of the many languages that can be compiled to run on the JVM. Some of the most popular JVM languages, other than Java, are Clojure, Groovy, Scala, Kotlin, JRuby, and Jython (see this link for a list of more JVM languages).

Did you know that you can compile and subsequently run all these languages on AWS Lambda?

AWS Lambda supports the Java 8 runtime, but this does not mean you are limited to the Java language. The Java 8 runtime is capable of running JVM languages such as Kotlin and Groovy once they have been compiled and packaged as a “fat” JAR (a JAR file containing all necessary dependencies and classes bundled in).

In this blog post we’ll work through building AWS Lambda functions in both Kotlin and Groovy programming languages. To compile and package our projects we will use Gradle build tool.

To follow along, please clone the Git repository available at GitHub here. Also, I recommend using an Integrated Development Environment (IDE) such as JetBrain’s IntelliJ IDEA, this is the IDE I used while working on these projects.

Kotlin

Kotlin is a statically-typed JVM language designed and developed by JetBrains (one of our Amazon Partner Network Technology partners) and the open source community. Compared to Java the programming language, Kotlin has additional powerful language features such as: Data Classes, Default Arguments, Extensions, Elvis Operator, and Destructuring Declarations. This is a just a short list of Kotlin’s powerful language features. For a more thorough list of features, and how to use them, refer to the full documentation of the Kotlin language.

Let’s jump right into the code and see what an AWS Lambda function looks like in Kotlin.

package com.aws.blog.jvmlangs.kotlin

import java.io.*
import com.fasterxml.jackson.module.kotlin.*

data class HandlerInput(val who: String)
data class HandlerOutput(val message: String)

class Main {
    val mapper = jacksonObjectMapper()

    fun handler(input: InputStream, output: OutputStream): Unit {
        val inputObj = mapper.readValue<HandlerInput>(input)
        mapper.writeValue(output, HandlerOutput("Hello ${inputObj.who}"))
    }
}

The above example is a very simple Hello World application that accepts as an input a JSON object containing a key called “who” and returns a JSON object containing a key called “message” with a value of “Hello {who}”.

AWS Lambda does not support serializing JSON objects into Kotlin data classes, but don’t worry! AWS Lambda supports passing an input object as a Stream, and also supports an output Stream for returning a result (see this link for more information). Combined with the Input/Output Stream form of the handler function, we are using the Jackson library with a Kotlin extension function to support serialization and deserialization of Kotlin data class types.

To get started with this example, let’s first compile and package the Kotlin project.

git clone https://github.com/awslabs/lambda-kotlin-groovy-example
cd lambda-kotlin-groovy-example/kotlin
./gradlew shadowJar

Once packaged, a JAR file containing all necessary dependencies will be available at “build/libs/ jvmlangs-kotlin-1.0-SNAPSHOT-all.jar”. Now let’s deploy this package to AWS Lambda.

To deploy the lambda function, we will be using the AWS Command Line Interface (CLI). You can find information on how to set up the AWS CLI here. This tool allows you to set up and manage AWS services via the command line.

aws lambda create-function --region us-east-1 --function-name kotlin-hello \
--zip-file fileb://build/libs/jvmlangs-kotlin-1.0-SNAPSHOT-all.jar \
--role arn:aws:iam::<account_id>:role/lambda_basic_execution \
--handler com.aws.blog.jvmlangs.kotlin.Main::handler --runtime java8 \
--timeout 15 --memory-size 128

Once deployed, we can test the function by invoking the lambda function from the CLI.

aws lambda invoke --function-name kotlin-hello --payload '{"who": "AWS Fan"}' output.txt
cat output.txt

If successful, you’ll see an output of “{"message":"Hello AWS Fan"}”.

Groovy

Groovy is an optionally typed JVM language with both dynamic and static typing capabilities. Groovy is currently being supported by the Apache Software Foundation. Like Kotlin, Groovy also packs a lot of powerful features such as: Closures, Dynamic Typing, Collection Literals, String Interpolation, and Elvis Operator. This is just a short list, see the full documentation for a list of features and how to use them.

Once again, let’s jump right into the code.

package com.aws.blog.jvmlangs.groovy

class HandlerInput {
    String who
}
class HandlerOutput {
    String message
}

class Main {
    def handler(HandlerInput input) {
        return new HandlerOutput(message: "Hello ${input.who}")
    }
}

Just like the Kotlin example, we have defined a function that takes a simple JSON object containing a “who” key value and build a response containing a “message” key. Note that in this case we are not using the Input/Output Stream form of the handler function, but rather we are letting AWS Lambda serialize the input JSON object into the type HandlerInput. To accomplish this, AWS Lambda uses the Jackson library and handles the serialization for us.

Let’s go ahead and compile and package this Groovy example.

git clone https://github.com/awslabs/lambda-kotlin-groovy-example
cd lambda-kotlin-groovy-example/groovy
./gradlew shadowJar

Once packaged, a JAR file containing all necessary dependencies will be available at “build/libs/ jvmlangs-groovy-1.0-SNAPSHOT-all.jar”. Now let’s deploy this package to AWS Lambda.

aws lambda create-function --region us-east-1 --function-name groovy-hello \
--zip-file fileb://build/libs/jvmlangs-groovy-1.0-SNAPSHOT-all.jar \
--role arn:aws:iam::<account_id>:role/lambda_basic_execution \
--handler com.aws.blog.jvmlangs.groovy.Main::handler --runtime java8 \
--timeout 15 --memory-size 128

Once deployed, we can test the function by invoking the lambda function from the CLI.

aws lambda invoke --function-name groovy-hello --payload '{"who": "AWS Fan"}' output.txt
cat output.txt

If successful, you’ll see an output of “{"message":"Hello AWS Fan"}”.

Gradle Build Tool

Finally, let’s touch up on how we built the JAR package from the Kotlin and Groovy sources above. To build the JARs we used the Gradle build tool. Gradle builds a project by reading instructions from a file called “build.gradle”. This is a file written in Gradle’s Groovy Domain Specific Langauge (DSL). You can find more information on the gradle build file by looking at their documentation. Let’s take a look at the Gradle build files we used for this post.

For the Kotlin example, this is the build file we used.

buildscript {
    repositories {
        mavenCentral()
        jcenter()
    }
    dependencies {
        classpath "org.jetbrains.kotlin:kotlin-gradle-plugin:$kotlin_version"
        classpath "com.github.jengelman.gradle.plugins:shadow:1.2.3"
    }
}

group 'com.aws.blog.jvmlangs.kotlin'
version '1.0-SNAPSHOT'

apply plugin: 'kotlin'
apply plugin: 'com.github.johnrengelman.shadow'

repositories {
    mavenCentral()
}

dependencies {
    compile "org.jetbrains.kotlin:kotlin-stdlib:$kotlin_version"
    compile "com.fasterxml.jackson.module:jackson-module-kotlin:2.8.2"
}

For the Groovy example this is the build file we used.

buildscript {
    repositories {
        jcenter()
    }
    dependencies {
        classpath 'com.github.jengelman.gradle.plugins:shadow:1.2.3'
    }
}

group 'com.aws.blog.jvmlangs.groovy'
version '1.0-SNAPSHOT'

apply plugin: 'groovy'
apply plugin: 'com.github.johnrengelman.shadow'

repositories {
    mavenCentral()
}

dependencies {
    compile 'org.codehaus.groovy:groovy-all:2.3.11'
    testCompile group: 'junit', name: 'junit', version: '4.11'
}

As you can see, the build files for both Kotlin and Groovy files are very similar. For the Kotlin project we define a dependency on the Jackson Kotlin module. Also, for each respective language we include the language supporting libraries (kotlin-stdlib and groovy-all respectively).

In addition, you will notice that we are using a plugin called “shadow”. We use this plugin to package all the project dependencies into one JAR by using the Gradle task “shadowJar”. You can find more information on Shadow in their documentation.

Final Words

Don’t stop here though! Take a look at other JVM languages and get them running on AWS Lambda with the Java 8 runtime. Maybe start with Clojure? or Scala?

Also take a look AWS Lambda Java libraries provided by AWS. They provide interfaces and models to make handling events from event sources easier to handle.

How to Control TLS Ciphers in Your AWS Elastic Beanstalk Application by Using AWS CloudFormation

Post Syndicated from Paco Hope original https://aws.amazon.com/blogs/security/how-to-control-tls-ciphers-in-your-aws-elastic-beanstalk-application-by-using-aws-cloudformation/

Securing data in transit is critical to the integrity of transactions on the Internet. Whether you log in to an account with your user name and password or give your credit card details to a retailer, you want your data protected as it travels across the Internet from place to place. One of the protocols in widespread use to protect data in transit is Transport Layer Security (TLS). Every time you access a URL that begins with “https” instead of just “http”, you are using a TLS-secured connection to a website.

To demonstrate that your application has a strong TLS configuration, you can use services like the one provided by SSL Labs. There are also open source, command-line-oriented TLS testing programs such as testssl.sh (which I do not cover in this post) and sslscan (which I cover later in this post). The goal of testing your TLS configuration is to provide evidence that weak cryptographic ciphers are disabled in your TLS configuration and only strong ciphers are enabled. In this blog post, I show you how to control the TLS security options for your secure load balancer in AWS CloudFormation, pass the TLS certificate and host name for your secure AWS Elastic Beanstalk application to the CloudFormation script as parameters, and then confirm that only strong TLS ciphers are enabled on the launched application by testing it with SSLLabs.

Background

In some situations, it’s not enough to simply turn on TLS with its default settings and call it done. Over the years, a number of vulnerabilities have been discovered in the TLS protocol itself with codenames such as CRIME, POODLE, and Logjam. Though some vulnerabilities were in specific implementations, such as OpenSSL, others were vulnerabilities in the Secure Sockets Layer (SSL) or TLS protocol itself.

The only way to avoid some TLS vulnerabilities is to ensure your web server uses only the latest version of TLS. Some organizations want to limit their TLS configuration to the highest possible security levels to satisfy company policies, regulatory requirements, or other information security requirements. In practice, such limitations usually mean using TLS version 1.2 (at the time of this writing, TLS 1.3 is in the works) and using only strong cryptographic ciphers. Note that forcing a high-security TLS connection in this manner limits which types of devices can connect to your web server. I address this point at the end of this post.

The default TLS configuration in most web servers is compatible with the broadest set of clients (such as web browsers, mobile devices, and point-of-sale systems). As a result, older ciphers and protocol versions are usually enabled. This is true for the Elastic Load Balancing load balancer that is created in your Elastic Beanstalk application as well as for web server software such as Apache and nginx.  For example, TLS versions 1.0 and 1.1 are enabled in addition to 1.2. The RC4 cipher is permitted, even though that cipher is too weak for the most demanding security requirements. If your application needs to prioritize the security of connections over compatibility with legacy devices, you must adjust the TLS encryption settings on your application. The solution in this post helps you make those adjustments.

Prerequisites for the solution

Before you implement this solution, you must have a few prerequisites in place:

  1. You must have a hosted zone in Amazon Route 53 where the name of the secure application will be created. I use example.com as my domain name in this post and assume that I host example.com publicly in Route 53. To learn more about creating and hosting a zone publicly in Route 53, see Working with Public Hosted Zones.
  2. You must choose a name to be associated with the secure app. In this case, I use secure.example.com as the DNS name to be associated with the secure app. This means that I’m trying to create an Elastic Beanstalk application whose URL will be https://secure.example.com/.
  3. You must have a TLS certificate hosted in AWS Certificate Manager (ACM). This certificate must be issued with the name you decided in Step 2. If you are new to ACM, see Getting Started. If you are already familiar with ACM, request a certificate and get its Amazon Resource Name (ARN).Look up the ARN for the certificate that you created by opening the ACM console. The ARN looks something like: arn:aws:acm:eu-west-1:111122223333:certificate/12345678-abcd-1234-abcd-1234abcd1234.

Implementing the solution

You can use two approaches to control the TLS ciphers used by your load balancer: one is to use a predefined protocol policy from AWS, and the other is to write your own protocol policy that lists exactly which ciphers should be enabled. There are many ciphers and options that can be set, so the appropriate AWS predefined policy is often the simplest policy to use. If you have to comply with an information security policy that requires enabling or disabling specific ciphers, you will probably find it easiest to write a custom policy listing only the ciphers that are acceptable to your requirements.

AWS released two predefined TLS policies on March 10, 2017: ELBSecurityPolicy-TLS-1-1-2017-01 and ELBSecurityPolicy-TLS-1-2-2017-01. These policies restrict TLS negotiations to TLS 1.1 and 1.2, respectively. You can find a good comparison of the ciphers that these policies enable and disable in the HTTPS listener documentation for Elastic Load Balancing. If your requirements are simply “support TLS 1.1 and later” or “support TLS 1.2 and later,” those AWS predefined cipher policies are the best place to start. If you need to control your cipher choice with a custom policy, I show you in this post which lines of the CloudFormation template to change.

Download the predefined policy CloudFormation template

Many AWS customers rely on CloudFormation to launch their AWS resources, including their Elastic Beanstalk applications. To change the ciphers and protocol versions supported on your load balancer, you must put those options in a CloudFormation template. You can store your site’s TLS certificate in ACM and create the corresponding DNS alias record in the correct zone in Route 53.

To start, download the CloudFormation template that I have provided for this blog post, or deploy the template directly in your environment. This template creates a CloudFormation stack in your default VPC that contains two resources: an Elastic Beanstalk application that deploys a standard sample PHP application, and a Route 53 record in a hosted zone. This CloudFormation template selects the AWS predefined policy called ELBSecurityPolicy-TLS-1-2-2017-01 and deploys it.

Launching the sample application from the CloudFormation console

In the CloudFormation console, choose Create Stack. You can either upload the template through your browser, or load the template into an Amazon S3 bucket and type the S3 URL in the Specify an Amazon S3 template URL box.

After you click Next, you will see that there are three parameters defined: CertificateARN, ELBHostName, and HostedDomainName. Set the CertificateARN parameter to the ARN of the certificate you want to use for your application. Set the ELBHostName parameter to the hostname part of the URL. For example, if your URL were https://secure.example.com/, the HostedDomainName parameter would be example.com and the ELBHostName parameter would be secure.

For the sample application, choose Next and then choose Create, and the CloudFormation stack will be created. For your own applications, you might need to set other options such as a database, VPC options, or Amazon SNS notifications. For more details, see AWS Elastic Beanstalk Environment Configuration. To deploy an application other than our sample PHP application, create your own application source bundle.

Launching the sample application from the command line

In addition to launching the sample application from the console, you can specify the parameters from the command line. Because the template uses parameters, you can launch multiple copies of the application, specifying different parameters for each copy. To launch the application from a Linux command line with the AWS CLI, insert the correct values for your application, as shown in the following command.

aws cloudformation create-stack --stack-name "SecureSampleApplication" \
--template-url https://<URL of your CloudFormation template in S3> \
--parameters ParameterKey=CertificateARN,ParameterValue=<Your ARN> \
ParameterKey=ELBHostName,ParameterValue=<Your Host Name> \
ParameterKey=HostedDomainName,ParameterValue=<Your Domain Name>

When that command exits, it prints the StackID of the stack it created. Save that StackID for later so that you can fetch the stack’s outputs from the command line.

Using a custom cipher specification

If you want to specify your own cipher choices, you can use the same CloudFormation template and change two lines. Let’s assume your information security policies require you to disable any ciphers that use Cipher Block Chaining (CBC) mode encryption. These ciphers are enabled in the ELBSecurityPolicy-TLS-1-2-2017-01 managed policy, so to satisfy that security requirement, you have to modify the CloudFormation template to use your own protocol policy.

In the template, locate the three lines that define the TLSHighPolicy.

- Namespace:  aws:elb:policies:TLSHighPolicy
OptionName: SSLReferencePolicy
Value:      ELBSecurityPolicy-TLS-1-2-2017-01

Change the OptionName and Value for the TLSHighPolicy. Instead of referring to the AWS predefined policy by name, explicitly list all the ciphers you want to use. Change those three lines so they look like the following.

- Namespace: aws:elb:policies:TLSHighPolicy
OptionName: SSLProtocols
Value:  Protocol-TLSv1.2,Server-Defined-Cipher-Order,ECDHE-ECDSA-AES256-GCM-SHA384,ECDHE-ECDSA-AES128-GCM-SHA256,ECDHE-RSA-AES256-GCM-SHA384,ECDHE-RSA-AES128-GCM-SHA256

This protocol policy stipulates that the load balancer should:

  • Negotiate connections using only TLS 1.2.
  • Ignore any attempts by the client (for example, the web browser or mobile device) to negotiate a weaker cipher.
  • Accept four specific, strong combinations of cipher and key exchange—and nothing else.

The protocol policy enables only TLS 1.2, strong ciphers that do not use CBC mode encryption, and strong key exchange.

Connect to the secure application

When your CloudFormation stack is in the CREATE_COMPLETED state, you will find three outputs:

  1. The public DNS name of the load balancer
  2. The secure URL that was created
  3. TestOnSSLLabs output that contains a direct link for testing your configuration

You can either enter the secure URL in a web browser (for example, https://secure.example.com/), or click the link in the Outputs to open your sample application and see the demo page. Note that you must use HTTPS—this template has disabled HTTP on port 80 and only listens with HTTPS on port 443.

If you launched your application through the command line, you can view the CloudFormation outputs using the command line as well. You need to know the StackId of the stack you launched and insert it in the following stack-name parameter.

aws cloudformation describe-stacks --stack-name "<ARN of Your Stack>" \
--query 'Stacks[0].Outputs'

Test your application over the Internet with SSLLabs

The easiest way to confirm that the load balancer is using the secure ciphers that we chose is to enter the URL of the load balancer in the form on SSL Labs’ SSL Server Test page. If you do not want the name of your load balancer to be shared publicly on SSLLabs.com, select the Do not show the results on the boards check box. After a minute or two of testing, SSLLabs gives you a detailed report of every cipher it tried and how your load balancer responded. This test simulates many devices that might connect to your website, including mobile phones, desktop web browsers, and software libraries such as Java and OpenSSL. The report tells you whether these clients would be able to connect to your application successfully.

Assuming all went well, you should receive an A grade for the sample application. The biggest contributors to the A grade are:

  • Supporting only TLS 1.2, and not TLS 1.1, TLS 1.0, or SSL 3.0
  • Supporting only strong ciphers such as AES, and not weaker ciphers such as RC4
  • Having an X.509 public key certificate issued correctly by ACM

How to test your application privately with sslscan

You might not be able to reach your Elastic Beanstalk application from the Internet because it might be in a private subnet that is only accessible internally. If you want to test the security of your load balancer’s configuration privately, you can use one of the open source command-line tools such as sslscan. You can install and run the sslscan command on any Amazon EC2 Linux instance or even from your own laptop. Be sure that the Elastic Beanstalk application you want to test will accept an HTTPS connection from your Amazon Linux EC2 instance or from your laptop.

The easiest way to get sslscan on an Amazon Linux EC2 instance is to:

  1. Enable the Extra Packages for Enterprise Linux (EPEL) repository.
  2. Run sudo yum install sslscan.
  3. After the command runs successfully, run sslscan secure.example.com to scan your application for supported ciphers.

The results are similar to Qualys’ results at SSLLabs.com, but the sslscan tool does not summarize and evaluate the results to assign a grade. It just reports whether your application accepted a connection using the cipher that it tried. You must decide for yourself whether that set of accepted connections represents the right level of security for your application. If you have been asked to build a secure load balancer that meets specific security requirements, the output from sslscan helps to show how the security of your application is configured.

The following sample output shows a small subset of the total output of the sslscan tool.

AcceptedTLS12256 bitsAES256-GCM-SHA384
AcceptedTLS12256 bitsAES256-SHA256
AcceptedTLS12256 bitsAES256-SHA
RejectedTLS12256 bitsCAMELLIA256-SHA
FailedTLS12256 bitsPSK-AES256-CBC-SHA
RejectedTLS12128 bitsECDHE-RSA-AES128-GCM-SHA256
RejectedTLS12128 bitsECDHE-ECDSA-AES128-GCM-SHA256
RejectedTLS12128 bitsECDHE-RSA-AES128-SHA256

An Accepted connection is one that was successful: the load balancer and the client were both able to use the indicated cipher. Failed and Rejected connections are connections whose load balancer would not accept the level of security that the client was requesting. As a result, the load balancer closed the connection instead of communicating insecurely. The difference between Failed and Rejected is based one whether the TLS connection was closed cleanly.

Comparing the two policies

The main difference between our custom policy and the AWS predefined policy is whether or not CBC ciphers are accepted. The test results with both policies are identical except for the results shown in the following table. The only change in the policy, and therefore the only change in the results, is that the cipher suites using CBC ciphers have been disabled.

Cipher Suite NameEncryption AlgorithmKey Size (bits)ELBSecurityPolicy-TLS-1-2-2017-01Custom Policy
ECDHE-RSA-AES256-GCM-SHA384AESGCM256EnabledEnabled
ECDHE-RSA-AES256-SHA384AES256EnabledDisabled
AES256-GCM-SHA384AESGCM256EnabledDisabled
AES256-SHA256AES256EnabledDisabled
ECDHE-RSA-AES128-GCM-SHA256AESGCM128EnabledEnabled
ECDHE-RSA-AES128-SHA256AES128EnabledDisabled
AES128-GCM-SHA256AESGCM128EnabledDisabled
AES128-SHA256AES128EnabledDisabled

Strong ciphers and compatibility

The custom policy described in the previous section prevents legacy devices and older versions of software and web browsers from connecting. The output at SSLLabs provides a list of devices and applications (such as Internet Explorer 10 on Windows 7) that cannot connect to an application that uses the TLS policy. By design, the load balancer will refuse to connect to a device that is unable to negotiate a connection at the required levels of security. Users who use legacy software and devices will see different errors, depending on which device or software they use (for example, Internet Explorer on Windows, Chrome on Android, or a legacy mobile application). The error messages will be some variation of “connection failed” because the Elastic Load Balancer closes the connection without responding to the user’s request. This behavior can be problematic for websites that must be accessible to older desktop operating systems or older mobile devices.

If you need to support legacy devices, adjust the TLSHighPolicy in the CloudFormation template. For example, if you need to support web browsers on Windows 7 systems (and you cannot enable TLS 1.2 support on those systems), you can change the policy to enable TLS 1.1. To do this, change the value of SSLReferencePolicy to ELBSecurityPolicy-TLS-1-1-2017-01.

Enabling legacy protocol versions such as TLS version 1.1 will allow older devices to connect, but then the application may not be compliant with the information security policies or business requirements that require strong ciphers.

Conclusion

Using Elastic Beanstalk, Route 53, and ACM can help you launch secure applications that are designed to not only protect data but also meet regulatory compliance requirements and your information security policies. The TLS policy, either custom or predefined, allows you to control exactly which cryptographic ciphers are enabled on your Elastic Load Balancer. The TLS test results provide you with clear evidence you can use to demonstrate compliance with security policies or requirements. The parameters in this post’s CloudFormation template also make it adaptable and reusable for multiple applications. You can use the same template to launch different applications on different secure URLs by simply changing the parameters that you pass to the template.

If you have comments about this post, submit them in the “Comments” section below. If you have questions about or issues implementing this solution, start a new thread on the CloudFormation forum.

– Paco

How to Update AWS CloudHSM Devices and Client Instances to the Software and Firmware Versions Supported by AWS

Post Syndicated from Tracy Pierce original https://aws.amazon.com/blogs/security/how-to-update-aws-cloudhsm-devices-and-client-instances-to-the-software-and-firmware-versions-supported-by-aws/

As I explained in my previous Security Blog post, a hardware security module (HSM) is a hardware device designed with the security of your data and cryptographic key material in mind. It is tamper-resistant hardware that prevents unauthorized users from attempting to pry open the device, plug in any extra devices to access data or keys such as subtokens, or damage the outside housing. The HSM device AWS CloudHSM offers is the Luna SA 7000 (also called Safenet Network HSM 7000), which is created by Gemalto. Depending on the firmware version you install, many of the security properties of these HSMs will have been validated under Federal Information Processing Standard (FIPS) 140-2, a standard issued by the National Institute of Standards and Technology (NIST) for cryptography modules. These standards are in place to protect the integrity and confidentiality of the data stored on cryptographic modules.

To help ensure its continued use, functionality, and support from AWS, we suggest that you update your AWS CloudHSM device software and firmware as well as the client instance software to current versions offered by AWS. As of the publication of this blog post, the current non-FIPS-validated versions are 5.4.9/client, 5.3.13/software, and 6.20.2/firmware, and the current FIPS-validated versions are 5.4.9/client, 5.3.13/software, and 6.10.9/firmware. (The firmware version determines FIPS validation.) It is important to know your current versions before updating so that you can follow the correct update path.

In this post, I demonstrate how to update your current CloudHSM devices and client instances so that you are using the most current versions of software and firmware. If you contact AWS Support for CloudHSM hardware and application issues, you will be required to update to these supported versions before proceeding. Also, any newly provisioned CloudHSM devices will use these supported software and firmware versions only, and AWS does not offer “downgrade options.

Note: Before you perform any updates, check with your local CloudHSM administrator and application developer to verify that these updates will not conflict with your current applications or architecture.

Overview of the update process

To update your client and CloudHSM devices, you must use both update paths offered by AWS. The first path involves updating the software on your client instance, also known as a control instance. Following the second path updates the software first and then the firmware on your CloudHSM devices. The CloudHSM software must be updated before the firmware because of the firmware’s dependencies on the software in order to work appropriately.

As I demonstrate in this post, the correct update order is:

  1. Updating your client instance
  2. Updating your CloudHSM software
  3. Updating your CloudHSM firmware

To update your client instance, you must have the private SSH key you created when you first set up your client environment. This key is used to connect via SSH protocol on port 22 of your client instance. If you have more than one client instance, you must repeat this connection and update process on each of them. The following diagram shows the flow of an SSH connection from your local network to your client instances in the AWS Cloud.

Diagram that shows the flow of an SSH connection from your local network to your client instances in the AWS Cloud

After you update your client instance to the most recent software (5.3.13), you then must update the CloudHSM device software and firmware. First, you must initiate an SSH connection from any one client instance to each CloudHSM device, as illustrated in the following diagram. A successful SSH connection will have you land at the Luna shell, denoted by lunash:>. Second, you must be able to initiate a Secure Copy (SCP) of files to each device from the client instance. Because the software and firmware updates require an elevated level of privilege, you must have the Security Officer (SO) password that you created when you initialized your CloudHSM devices.

Diagram illustrating the initiation of an SSH connection from any one client instance to each CloudHSM device

After you have completed all updates, you can receive enhanced troubleshooting assistance from AWS, if you need it. When new versions of software and firmware are released, AWS performs extensive testing to ensure your smooth transition from version to version.

Detailed guidance for updating your client instance, CloudHSM software, and CloudHSM firmware

1.  Updating your client instance

Let’s start by updating your client instances. My client instance and CloudHSM devices are in the eu-west-1 region, but these steps work the same in any AWS region. Because Gemalto offers client instances in both Linux and Windows, I will cover steps to update both. I will start with Linux. Please note that all commands should be run as the “root” user.

Updating the Linux client

  1. SSH from your local network into the client instance. You can do this from Linux or Windows. Typically, you would do this from the directory where you have stored your private SSH key by using a command like the following command in a terminal or PuTTY This initiates the SSH connection by pointing to the path of your SSH key and denoting the user name and IP address of your client instance.
    ssh –i /Users/Bob/Keys/CloudHSM_SSH_Key.pem [email protected]

  1. After the SSH connection is established, you must stop all applications and services on the instance that are using the CloudHSM device. This is required because you are removing old software and installing new software in its place. After you have stopped all applications and services, you can move on to remove the existing version of the client software.
    /usr/safenet/lunaclient/bin/uninstall.sh

This command will remove the old client software, but will not remove your configuration file or certificates. These will be saved in the Chrystoki.conf file of your /etc directory and your usr/safenet/lunaclient/cert directory. Do not delete these files because you will lose the configuration of your CloudHSM devices and client connections.

  1. Download the new client software package: cloudhsm-safenet-client. Double-click it to extract the archive.
    SafeNet-Luna-client-5-4-9/linux/64/install.sh

    Make sure you choose the Luna SA option when presented with it. Because the directory where your certificates are installed is the same, you do not need to copy these certificates to another directory. You do, however, need to ensure that the Chrystoki.conf file, located at /etc/Chrystoki.conf, has the same path and name for the certificates as when you created them. (The path or names should not have changed, but you should still verify they are same as before the update.)

  1. Check to ensure that the PATH environment variable points to the directory, /usr/safenet/lunaclient/bin, to ensure no issues when you restart applications and services. The update process for your Linux client Instance is now complete.

Updating the Windows client

Use the following steps to update your Windows client instances:

  1. Use Remote Desktop Protocol (RDP) from your local network into the client instance. You can accomplish this with the RDP application of your choice.
  2. After you establish the RDP connection, stop all applications and services on the instance that are using the CloudHSM device. This is required because you will remove old software and install new software in its place or overwrite If your client software version is older than 5.4.1, you need to completely remove it and all patches by using Programs and Features in the Windows Control Panel. If your client software version is 5.4.1 or newer, proceed without removing the software. Your configuration file will remain intact in the crystoki.ini file of your C:\Program Files\SafeNet\Lunaclient\ directory. All certificates are preserved in the C:\Program Files\SafeNet\Lunaclient\cert\ directory. Again, do not delete these files, or you will lose all configuration and client connection data.
  3. After you have completed these steps, download the new client software: cloudhsm-safenet-client. Extract the archive from the downloaded file, and launch the SafeNet-Luna-client-5-4-9\win\64\Lunaclient.msi Choose the Luna SA option when it is presented to you. Because the directory where your certificates are installed is the same, you do not need to copy these certificates to another directory. You do, however, need to ensure that the crystoki.ini file, which is located at C:\Program Files\SafeNet\Lunaclient\crystoki.ini, has the same path and name for the certificates as when you created them. (The path and names should not have changed, but you should still verify they are same as before the update.)
  4. Make one last check to ensure the PATH environment variable points to the directory C:\Program Files\SafeNet\Lunaclient\ to help ensure no issues when you restart applications and services. The update process for your Windows client instance is now complete.

2.  Updating your CloudHSM software

Now that your clients are up to date with the most current software version, it’s time to move on to your CloudHSM devices. A few important notes:

  • Back up your data to a Luna SA Backup device. AWS does not sell or support the Luna SA Backup devices, but you can purchase them from Gemalto. We do, however, offer the steps to back up your data to a Luna SA Backup device. Do not update your CloudHSM devices without backing up your data first.
  • If the names of your clients used for Network Trust Link Service (NTLS) connections has a capital “T” as the eighth character, the client will not work after this update. This is because of a Gemalto naming convention. Before upgrading, ensure you modify your client names accordingly. The NTLS connection uses a two-way digital certificate authentication and SSL data encryption to protect sensitive data transmitted between your CloudHSM device and the client Instances.
  • The syslog configuration for the CloudHSM devices will be lost. After the update is complete, notify AWS Support and we will update the configuration for you.

Now on to updating the software versions. There are actually three different update paths to follow, and I will cover each. Depending on the current software versions on your CloudHSM devices, you might need to follow all three or just one.

Updating the software from version 5.1.x to 5.1.5

If you are running any version of the software older than 5.1.5, you must first update to version 5.1.5 before proceeding. To update to version 5.1.5:

  1. Stop all applications and services that access the CloudHSM device.
  2. Download the Luna SA software update package.
  3. Extract all files from the archive.
  4. Run the following command from your client instance to copy the lunasa_update-5.1.5-2.spkg file to the CloudHSM device.
    $ scp –I <private_key_file> lunasa_update-5.1.5-2.spkg [email protected]<hsm_ip_address>:

    <private_key_file> is the private portion of your SSH key pair and <hsm_ip_address> is the IP address of your CloudHSM elastic network interface (ENI). The ENI is the network endpoint that permits access to your CloudHSM device. The IP address was supplied to you when the CloudHSM device was provisioned.

  1. Use the following command to connect to your CloudHSM device and log in with your Security Officer (SO) password.
    $ ssh –I <private_key_file> [email protected]<hsm_ip_address>
    
    lunash:> hsm login

  1. Run the following commands to verify and then install the updated Luna SA software package.
    lunash:> package verify lunasa_update-5.1.5-2.spkg –authcode <auth_code>
    
    lunash:> package update lunasa_update-5.1.5-2.spkg –authcode <auth_code>

    The value you will use for <auth_code> is contained in the lunasa_update-5.1.5-2.auth file found in the 630-010165-018_reva.tar archive you downloaded in Step 2.

  1. Reboot the CloudHSM device by running the following command.
    lunash:> sysconf appliance reboot

When all the steps in this section are completed, you will have updated your CloudHSM software to version 5.1.5. You can now move on to update to version 5.3.10.

Updating the software to version 5.3.10

You can update to version 5.3.10 only if you are currently running version 5.1.5. To update to version 5.3.10:

  1. Stop all applications and services that access the CloudHSM device.
  2. Download the v 5.3.10 Luna SA software update package.
  3. Extract all files from the archive.
  4. Run the following command to copy the lunasa_update-5.3.10-7.spkg file to the CloudHSM device.
    $ scp –i <private_key_file> lunasa_update-5.3.10-7.spkg [email protected]<hsm_ip_address>:

    <private_key_file> is the private portion of your SSH key pair and <hsm_ip_address> is the IP address of your CloudHSM ENI.

  1. Run the following command to connect to your CloudHSM device and log in with your SO password.
    $ ssh –i <private_key_file> [email protected]<hsm_ip_address>
    
    lunash:> hsm login

  1. Run the following commands to verify and then install the updated Luna SA software package.
    lunash:> package verify lunasa_update-5.3.10-7.spkg –authcode <auth_code>
    
    lunash:> package update lunasa_update-5.3.10-7.spkg –authcode <auth_code>

The value you will use for <auth_code> is contained in the lunasa_update-5.3.10-7.auth file found in the SafeNet-Luna-SA-5-3-10.zip archive you downloaded in Step 2.

  1. Reboot the CloudHSM device by running the following command.
    lunash:> sysconf appliance reboot

When all the steps in this section are completed, you will have updated your CloudHSM software to version 5.3.10. You can now move on to update to version 5.3.13.

Note: Do not configure your applog settings at this point; you must first update the software to version 5.3.13 in the following step.

Updating the software to version 5.3.13

You can update to version 5.3.13 only if you are currently running version 5.3.10. If you are not already running version 5.3.10, follow the two update paths mentioned previously in this section.

To update to version 5.3.13:

  1. Stop all applications and services that access the CloudHSM device.
  2. Download the Luna SA software update package.
  3. Extract all files from the archive.
  4. Run the following command to copy the lunasa_update-5.3.13-1.spkg file to the CloudHSM device.
    $ scp –i <private_key_file> lunasa_update-5.3.13-1.spkg [email protected]<hsm_ip_address>

<private_key_file> is the private portion of your SSH key pair and <hsm_ip_address> is the IP address of your CloudHSM ENI.

  1. Run the following command to connect to your CloudHSM device and log in with your SO password.
    $ ssh –i <private_key_file> [email protected]<hsm_ip_address>
    
    lunash:> hsm login

  1. Run the following commands to verify and then install the updated Luna SA software package.
    lunash:> package verify lunasa_update-5.3.13-1.spkg –authcode <auth_code>
    
    lunash:> package update lunasa_update-5.3.13-1.spkg –authcode <auth_code>

The value you will use for <auth_code> is contained in the lunasa_update-5.3.13-1.auth file found in the SafeNet-Luna-SA-5-3-13.zip archive that you downloaded in Step 2.

  1. When updating to this software version, the option to update the firmware also is offered. If you do not require a version of the firmware validated under FIPS 140-2, accept the firmware update to version 6.20.2. If you do require a version of the firmware validated under FIPS 140-2, do not accept the firmware update and instead update by using the steps in the next section, “Updating your CloudHSM FIPS 140-2 validated firmware.”
  2. After updating the CloudHSM device, reboot it by running the following command.
    lunash:> sysconf appliance reboot

  1. Disable NTLS IP checking on the CloudHSM device so that it can operate within its VPC. To do this, run the following command.
    lunash:> ntls ipcheck disable

When all the steps in this section are completed, you will have updated your CloudHSM software to version 5.3.13. If you don’t need the FIPS 140-2 validated firmware, you will have also updated the firmware to version 6.20.2. If you do need the FIPS 140-2 validated firmware, proceed to the next section.

3.  Updating your CloudHSM FIPS 140-2 validated firmware

To update the FIPS 140-2 validated version of the firmware to 6.10.9, use the following steps:

  1. Download version 6.10.9 of the firmware package.
  2. Extract all files from the archive.
  3. Run the following command to copy the 630-010430-010_SPKG_LunaFW_6.10.9.spkg file to the CloudHSM device.
    $ scp –i <private_key_file> 630-010430-010_SPKG_LunaFW_6.10.9.spkg [email protected]<hsm_ip_address>:

<private_key_file> is the private portion of your SSH key pair, and <hsm_ip_address> is the IP address of your CloudHSM ENI.

  1. Run the following command to connect to your CloudHSM device and log in with your SO password.
    $ ssh –i <private_key_file> manager#<hsm_ip_address>
    
    lunash:> hsm login

  1. Run the following commands to verify and then install the updated Luna SA firmware package.
    lunash:> package verify 630-010430-010_SPKG_LunaFW_6.10.9.spkg –authcode <auth_code>
    
    lunash:> package update 630-010430-010_SPKG_LunaFW_6.10.9.spkg –authcode <auth_code>

The value you will use for <auth_code> is contained in the 630-010430-010_SPKG_LunaFW_6.10.9.auth file found in the 630-010430-010_SPKG_LunaFW_6.10.9.zip archive that you downloaded in Step 1.

  1. Run the following command to update the firmware of the CloudHSM devices.
    lunash:> hsm update firmware

  1. After you have updated the firmware, reboot the CloudHSM devices to complete the installation.
    lunash:> sysconf appliance reboot

Summary

In this blog post, I walked you through how to update your existing CloudHSM devices and clients so that they are using supported client, software, and firmware versions. Per AWS Support and CloudHSM Terms and Conditions, your devices and clients must use the most current supported software and firmware for continued troubleshooting assistance. Software and firmware versions regularly change based on customer use cases and requirements. Because AWS tests and validates all updates from Gemalto, you must install all updates for firmware and software by using our package links described in this post and elsewhere in our documentation.

If you have comments about this blog post, submit them in the “Comments” section below. If you have questions about implementing this solution, please start a new thread on the CloudHSM forum.

– Tracy

Using AWS Lambda and Amazon DynamoDB in an Automated Approach to Managing AWS CloudFormation Template Parameters and Mappings

Post Syndicated from Anuj Sharma original https://aws.amazon.com/blogs/devops/custom-lookup-using-aws-lambda-and-amazon-dynamodb/

AWS CloudFormation gives you an easy way to codify the creation and management of related AWS resources. The optional Mappings and Parameters sections of CloudFormation templates help you organize and parameterize your templates so you can quickly customize your stack. As organizations adopt Infrastructure as Code best practices, the number of mappings and parameters can quickly grow. In this blog post, we’ll discuss how AWS Lambda and Amazon DynamoDB can be used to simplify updates, reuse, quick lookups, and reporting for these mappings and parameters.

Solution overview

There are three parts to the solution described in this post. You’ll find sample code for this solution in this AWS Labs GitHub repository.

  1. DynamoDB table: Used as a central location to store and update all key-value pairs used in the ‘Mappings’ and ‘Parameters’ sections of the CloudFormation template. This could be a centralized table for the whole organization, with a partition key consisting of the team name and environment (for example, development, test, production) and a sort key for the application name. For more information about the types of keys supported by DynamoDB, see Core Components in the Amazon DynamoDB Developer Guide.

Here is the sample data in the table. “teamname-environment” is the partition key and “appname” is the sort key.

{
        "teamname-environment": "team1-dev",
        "appname": "app1",
        "mappings": {
            "elbsubnet1": "subnet-123456",
            "elbsubnet2": "subnet-234567",
            "appsubnet1": "subnet-345678",
            "appsubnet2": "subnet-456789",
            "vpc": "vpc-123456",
            "appname":"app1",
            "costcenter":"123456",
            "teamname":"team1",
            "environment":"dev",
            "certificate":"arn-123456qwertyasdfgh",
            "compliancetype":"pci",
            "amiid":"ami-123456",
            "region":"us-west-1",
            "publichostedzoneid":"Z234asdf1234asdf",
            "privatehostedzoneid":"Z345SDFGCVHD123",
            "hostedzonename":"demo.internal"
        }
}

  1. Lambda function: Accepts the inputs of the primary keys, looks up the DynamoDB table, and returns all the key-value data.
  2. Custom lookup resource: Makes a call to the Lambda function, with the inputs of primary keys (accepted by the Lambda function) and retrieves the key-value data. All CloudFormation templates can duplicate this generic custom resource.

 

The diagram shows the interaction of services and resources in this solution.

  1. Users create a DynamoDB table and, using the DynamoDB console, AWS SDK, or AWS CLI, insert the mappings, parameters, and key-value data.
  2. CloudFormation templates have a custom resource, which calls a Lambda function. The combination of team name and application environment (“teamname-environment”) is the partition key input. Application Name (“appname”) is the sort key input to the Lambda function.
  3. The Lambda function queries the DynamoDB table based on the inputs.
  4. DynamoDB responds to the Lambda function with the results.
  5. The Lambda function responds to the custom resource in the CloudFormation stack, with the key-value data.
  6. The key-value data retrieved by the custom resource is then used by other resources in the stack using GetAtt Intrinsic function.

 

Using the sample code

To use the sample code for this solution, follow these steps:

  1. Clone the repository.
git clone https://github.com/awslabs/custom-lookup-lambda.git
cd custom-lookup-lambda
  1. The values in the sample-mappings.json file will be inserted into DynamoDB. Each record in sample-mappings.json corresponds to an item in DynamoDB. The mappings object contains the mapping.
  2. This solution uses Python as a programming model for AWS Lambda. If you haven’t set up your local development environment for Python, follow these steps, and then install the awscli python package.
pip install awscli
  1. To prepare your access keys or assume-role to make calls to AWS, configure the AWS Command Line Interface as described here. The IAM user or the assumed role used to make API calls must have, at minimum, this access. You can attach this policy to the IAM user or IAM group if you are using access keys, or to the IAM role if you are assuming a role. For more information, see Attaching Managed Policies in the IAM User Guide.
  2. Run insertrecord.sh to create the DynamoDB table named custom-lookup and insert the items in sample-mappings.json.
./insertrecord.sh

This script does the following:

  • Installs boto3 and requests Python packages through pip.
  • Executes the Python script to create the DynamoDB table (custom-lookup) and puts the data in sample-mappings.json.
  1. Run deployer.sh to package and create the Lambda function.
./deployer.sh

This script does the following:

  1. Using sample-stack.yaml, create a stack that makes a call to the Lambda function created in step 6. The function queries the DynamoDB table and produces as output the values corresponding to team1-dev and app1.
aws cloudformation deploy --template-file sample-stack.yaml --stack-name sample-stack

  1. Examine the output in the AWS CloudFormation console. You should see the values retrieved from the DynamoDB table. The Fn::GetAtt function allows you to use the values retrieved by the custom resource Lambda function.

For example, if the custom resource Lambda function is called using resource name as CUSTOMLOOKUP in the sample-stack, the value of key=amiid will be used in the stack using !GetAtt CUSTOMLOOKUP.amiid. Likewise, the value of key=vpc will be used in the stack using !GetAtt CUSTOMLOOKUP.vpc and so on.

Conclusion

In this blog post, we showed how to use an AWS CloudFormation custom resource backed by an AWS Lambda function to query Amazon DynamoDB to retrieve key-value data, thereby replacing the Mappings and Parameter sections of the CloudFormation template. This solution provides a more automated approach to managing template parameters and mappings. You can use the DynamoDB table to simplify updates, reuse, quick lookups, and reporting for these mappings and parameters.

Replicating and Automating Sync-Ups for a Repository with AWS CodeCommit

Post Syndicated from Cherry Zhou original https://aws.amazon.com/blogs/devops/replicating-and-automating-sync-ups-for-a-repository-with-aws-codecommit/

by Chenwei (Cherry) Zhou, Software Development Engineer


 

Many of our customers have expressed interest in the following scenarios:

  • Backing up or replicating an AWS CodeCommit repository to another AWS region.
  • Automatically backing up repositories currently hosted on other services (for example, GitHub or BitBucket) to AWS CodeCommit.

In this blog post, we’ll show you how to automate the replication of a source repository to a repository in AWS CodeCommit. Your source repository could be another AWS CodeCommit repository, a local repository, or a repository hosted on other Git services.

To replicate your repository, you’ll first need to set up a repository in AWS CodeCommit to use as your backup/replica repository. After replicating the contents in your source repository to the backup repository, we’ll demonstrate how you can set up a scheduled job to periodically sync up your source repository with the backup/replica.

Where do I host this?

You can host your local repository and schedule your task on your own machine or on an Amazon EC2 instance. For an example of how to set up an EC2 instance for access to an AWS CodeCommit repository, including a sample AWS CloudFormation template for launching the instance, see Launch an Amazon EC2 Instance to Access the AWS CodeCommit Repository in the AWS for DevOps Guide.

 

Part 1: Set Up a Replica Repository

In this section, we’ll create an AWS CodeCommit repository and replicate your source repository to it.

  1. If you haven’t already done so, set up for AWS CodeCommit. Then follow the steps to create a CodeCommit repository in the region of your choice. Choose a name that will help you remember that this repository is a replica or backup repository. For example, you could create a repository in the US East (Ohio) region and name it MyReplicaRepo. This is the name and region we’ll use in this post.
  2. Use the git clone --mirror command to clone the source repository, including the directory where you want to create the local repo, to your local computer. You are not cloning the repository you just created in AWS CodeCommit. You are cloning the repository you want to replicate or back up to that AWS CodeCommit repository. For example, to clone a sample application created for AWS demonstration purposes and hosted on GitHub (https://github.com/awslabs/aws-demo-php-simple-app.git) to a local repo in a directory named my-repo-replica:
git clone --mirror https://github.com/awslabs/aws-demo-php-simple-app.git my-repo-replica

IMPORTANT

  • DO NOT use your working directory as the local clone repository. Your work-in-progress commits would also be pushed for backup.
  • DO NOT make local changes to this local repository. It should be used for sync-up operations only.
  • DO NOT manually push any changes to this replica repository. It will cause conflicts later when your scheduled job pushes changes in the source repository. Treat it as a read-only repository, and push all of your development changes to your source repository.
  1. Change directories to the directory where you made the clone:
cd my-repo-replica
  1. Use the git remote add RemoteName RemoteRepositoryURL command to add the AWS CodeCommit repository you created as a remote repository for the local repo. Use an appropriate nickname, such as sync. (Because this is a mirror, the default nickname, origin, will already be in use.) For example, to add your AWS CodeCommit repository MyReplicaRepo as a remote for my-repo-replica with the nickname sync:
git remote add sync ssh://git-codecommit.us-east-2.amazonaws.com/v1/repos/MyReplicaRepo

When you push large repositories, consider using SSH instead of HTTPS. When you push a large change, a large number of changes, or a large repository, long-running HTTPS connections are often terminated prematurely due to networking issues or firewall settings. For more information about setting up AWS CodeCommit for SSH, see For SSH Connections on Linux, macOS, or Unix or For SSH Connections on Windows.

Tip

Use the git remote show command to review the list of remotes set for your local repo.

  1. Run the git push sync --mirror command to push to your replica repository.
  • If you named your remote for the replica repository something else, replace sync with your remote name.
  • The --mirror option specifies that all refs under refs/ (which includes, but is not limited to, refs/heads/, refs/remotes/, and refs/tags/) will be mirrored to the remote repository. If you only want to push branches and commits, but don’t care if you push other references such as tags, you can use the --all option instead.

 

Your replica repository is now ready for sync-up operations. To do a manual sync, run git pull to pull from your original repository, and then run git push sync --mirror to push to the replica repository. Again, do not push any local changes to your replica repository at any time.

 

Part 2: Create a Periodic Sync Job

You can use a number of tools to set up an automated sync job. In this section, we’ll briefly cover four common tools: a cron job (Linux), a task in Windows Task Scheduler (Windows), a launchd instance (macOS), and, for those users who already have a Jenkins server set up, a Freestyle project with build triggers. Feel free to use whatever tools are best for you.

Note

Some hosted repositories offer options for syncing repositories, such as Git hooks, notifications, and other triggers. To learn more about those options, consult the documentation for your source repository system.

 

All of the following approaches rely on commands that pull the latest changes from the source repository to your local clone repo, and then mirror those changes to your AWS CodeCommit repository. They can be summed up as follows:

cd /path/to/your/local/repo git pull
git push sync --mirror

Where and how you save and schedule these commands depends on your operating system and tool(s). We’ve included just a few options/examples from a variety of approaches.

 

In Linux:

  1. At the terminal, run the crontab -e command to edit your crontab file in your default editor.
  2. Add a line for a new cron job that will change directories to your local clone repo, pull from your source repository, and mirror any changes to your AWS CodeCommit repository on the schedule you specify. For example, to run a daily job at 2:45 A.M. for a local repo named my-repo-replica in the ~/tmp directory where you nicknamed your remote (the AWS CodeCommit repository) sync, your new line might look like this:
45 2 * * * cd ~/tmp/my-repo-replica && git pull && git push sync --mirror
  1. Save the crontab file and exit your editor.

 

In Windows:

  1. Create a batch file that contains the command to change directories to your local clone repo, pull from your source repository, and mirror any changes up to your AWS CodeCommit repository. For example, if you created your local repo my-repo-replica in a c:\temp directory, and you nicknamed your remote (the AWS CodeCommit repository) sync, your file might look like this:
cd /d c:\temp\my-repo-replica
git pull
git push sync --mirror
  1. Save the batch file with a name like my-repo-backup.bat.
  2. Open Task Scheduler. (Not sure how? The simplest way is to open a command line and run msc.)
  3. In Actions, choose Create Basic Task, and then follow the steps in the wizard.

 

In macOS:

  1. Create a shell script that contains the command to change directories to your local clone repo, pull from your source repository, and mirror any changes up to your AWS CodeCommit repository. For example, if you created your local repo my-repo-replica in a ~/Documents directory, and you nicknamed your remote (the AWS CodeCommit repository) sync, your file might look like this:
cd ~/Documents/my-repo-replica
git pull
git push sync --mirror
  1. Save the shell script with a name like my-repo-backup.sh.
  2. Create a launchd property list file that runs the shell script on the schedule you specify. For example, if you stored my-repo-backup.sh in ~/Documents, to run the script daily at 2:45 A.M., your plist file might look like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.example.codecommit.backup</string>
    <key>ProgramArguments</key>
    <array>
        <string>~/Documents/my-repo-backup.sh</string>
    </array>
    <key>StartCalendarInterval</key>
    <dict>
        <key>Minute</key>
        <integer>45</integer>
        <key>Hour</key>
        <integer>2</integer>
    </dict>
</dict>
</plist>
  1. Save your plist file in ~/Library/LaunchAgents, /Library/LaunchAgents, or /Library/LaunchDaemons folder, depending on the definition you want for the job.
  2. Run the launtchctl command to load your job. For example, if you want to load a plist file named codecommit.sync.plist in ~/Library/LaunchAgents, your command might look like this:
launchctl load ~/Library/LaunchAgents/codecommit.sync.plist

 

For Jenkins:

  1. Open Jenkins.
  2. Create a new job as a Freestyle project.

codecommit_replicate_new_project

  1. In the Build Triggers section, select Build periodically, and set up a schedule for the task. Jenkins uses cron expressions to run periodic tasks. For more information, see the Jenkins documentation for the syntax of cron.

If you are replicating a GitHub or BitBucket repository, you can also set the task to build when the Git hook is triggered.

The following example builds once a day between midnight and 1 A.M.

codecommit_replicate_build_triggers

  1. In the Build section, add a build step and choose Execute Windows batch command or Execute Shell. Then write a script and implement the Git operations:
cd /path/to/your/local/repo git pull
git push sync --mirror

Note: Jenkins may require the full path for Git.

The following example is a Windows batch command file, with the full path for Git on the host.

codecommit_replicate_build

  1. Save the configuration for the task.

 

Your AWS CodeCommit replica repository will now be automatically updated with any changes to your source repository as scheduled.

We hope you’ve enjoyed this blog post. If you have questions or suggestions for future blog post, please leave it in the comments below or visit our user forum!

 

Creating a Simple “Fetch & Run” AWS Batch Job

Post Syndicated from Bryan Liston original https://aws.amazon.com/blogs/compute/creating-a-simple-fetch-and-run-aws-batch-job/

Dougal Ballantyne
Dougal Ballantyne, Principal Product Manager – AWS Batch

Docker enables you to create highly customized images that are used to execute your jobs. These images allow you to easily share complex applications between teams and even organizations. However, sometimes you might just need to run a script!

This post details the steps to create and run a simple “fetch & run” job in AWS Batch. AWS Batch executes jobs as Docker containers using Amazon ECS. You build a simple Docker image containing a helper application that can download your script or even a zip file from Amazon S3. AWS Batch then launches an instance of your container image to retrieve your script and run your job.

AWS Batch overview

AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. AWS Batch dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted.

With AWS Batch, there is no need to install and manage batch computing software or server clusters that you use to run your jobs, allowing you to focus on analyzing results and solving problems. AWS Batch plans, schedules, and executes your batch computing workloads across the full range of AWS compute services and features, such as Amazon EC2 Spot Instances.

“Fetch & run” walkthrough

The following steps get everything working:

  • Build a Docker image with the fetch & run script
  • Create an Amazon ECR repository for the image
  • Push the built image to ECR
  • Create a simple job script and upload it to S3
  • Create an IAM role to be used by jobs to access S3
  • Create a job definition that uses the built image
  • Submit and run a job that execute the job script from S3

Prerequisites

Before you get started, there a few things to prepare. If this is the first time you have used AWS Batch, you should follow the Getting Started Guide and ensure you have a valid job queue and compute environment.

After you are up and running with AWS Batch, the next thing is to have an environment to build and register the Docker image to be used. For this post, register this image in an ECR repository. This is a private repository by default and can easily be used by AWS Batch jobs

You also need a working Docker environment to complete the walkthrough. For the examples, I used Docker for Mac. Alternatively, you could easily launch an EC2 instance running Amazon Linux and install Docker.

You need the AWS CLI installed. For more information, see Installing the AWS Command Line Interface.

Building the fetch & run Docker image

The fetch & run Docker image is based on Amazon Linux. It includes a simple script that reads some environment variables and then uses the AWS CLI to download the job script (or zip file) to be executed.

To get started, download the source code from the aws-batch-helpers GitHub repository. The following link pulls the latest version: https://github.com/awslabs/aws-batch-helpers/archive/master.zip. Unzip the downloaded file and navigate to the “fetch-and-run” folder. Inside this folder are two files:

  • Dockerfile
  • fetchandrun.sh

Dockerfile is used by Docker to build an image. Look at the contents; you should see something like the following:

FROM amazonlinux:latest

RUN yum -y install unzip aws-cli
ADD fetch_and_run.sh /usr/local/bin/fetch_and_run.sh
WORKDIR /tmp
USER nobody

ENTRYPOINT ["/usr/local/bin/fetch_and_run.sh"]
  • The FROM line instructs Docker to pull the base image from the amazonlinux repository, using the latest tag.
  • The RUN line executes a shell command as part of the image build process.
  • The ADD line, copies the fetchandrun.sh script into the /usr/local/bin directory inside the image.
  • The WORKDIR line, sets the default directory to /tmp when the image is used to start a container.
  • The USER line sets the default user that the container executes as.
  • Finally, the ENTRYPOINT line instructs Docker to call the /usr/local/bin/fetchandrun.sh script when it starts the container. When running as an AWS Batch job, it is passed the contents of the command parameter.

Now, build the Docker image! Assuming that the docker command is in your PATH and you don’t need sudo to access it, you can build the image with the following command (note the dot at the end of the command):

docker build -t awsbatch/fetch_and_run .   

This command should produce an output similar to the following:

Sending build context to Docker daemon 373.8 kB

Step 1/6 : FROM amazonlinux:latest
latest: Pulling from library/amazonlinux
c9141092a50d: Pull complete
Digest: sha256:2010c88ac1e7c118d61793eec71dcfe0e276d72b38dd86bd3e49da1f8c48bf54
Status: Downloaded newer image for amazonlinux:latest
 ---> 8ae6f52035b5
Step 2/6 : RUN yum -y install unzip aws-cli
 ---> Running in e49cba995ea6
Loaded plugins: ovl, priorities
Resolving Dependencies
--> Running transaction check
---> Package aws-cli.noarch 0:1.11.29-1.45.amzn1 will be installed

  << removed for brevity >>

Complete!
 ---> b30dfc9b1b0e
Removing intermediate container e49cba995ea6
Step 3/6 : ADD fetch_and_run.sh /usr/local/bin/fetch_and_run.sh
 ---> 256343139922
Removing intermediate container 326092094ede
Step 4/6 : WORKDIR /tmp
 ---> 5a8660e40d85
Removing intermediate container b48a7b9c7b74
Step 5/6 : USER nobody
 ---> Running in 72c2be3af547
 ---> fb17633a64fe
Removing intermediate container 72c2be3af547
Step 6/6 : ENTRYPOINT /usr/local/bin/fetch_and_run.sh
 ---> Running in aa454b301d37
 ---> fe753d94c372

Removing intermediate container aa454b301d37
Successfully built 9aa226c28efc

In addition, you should see a new local repository called fetchandrun, when you run the following command:

docker images
REPOSITORY               TAG              IMAGE ID            CREATED             SIZE
awsbatch/fetch_and_run   latest           9aa226c28efc        19 seconds ago      374 MB
amazonlinux              latest           8ae6f52035b5        5 weeks ago         292 MB

To add more packages to the image, you could update the RUN line or add a second one, right after it.

Creating an ECR repository

The next step is to create an ECR repository to store the Docker image, so that it can be retrieved by AWS Batch when running jobs.

  1. In the ECR console, choose Get Started or Create repository.
  2. Enter a name for the repository, for example: awsbatch/fetchandrun.
  3. Choose Next step and follow the instructions.

    fetchAndRunBatch_1.png

You can keep the console open, as the tips can be helpful.

Push the built image to ECR

Now that you have a Docker image and an ECR repository, it is time to push the image to the repository. Use the following AWS CLI commands, if you have used the previous example names. Replace the AWS account number in red with your own account.

aws ecr get-login --region us-east-1

docker tag awsbatch/fetch_and_run:latest 012345678901.dkr.ecr.us-east-1.amazonaws.com/awsbatch/fetch_and_run:latest

docker push 012345678901.dkr.ecr.us-east-1.amazonaws.com/awsbatch/fetch_and_run:latest

Create a simple job script and upload to S3

Next, create and upload a simple job script that is executed using the fetchandrun image that you just built and registered in ECR. Start by creating a file called myjob.sh with the example content below:

!/bin/bash

date
echo "Args: [email protected]"
env
echo "This is my simple test job!."
echo "jobId: $AWS_BATCH_JOB_ID"
echo "jobQueue: $AWS_BATCH_JQ_NAME"
echo "computeEnvironment: $AWS_BATCH_CE_NAME"
sleep $1
date
echo "bye bye!!"

Upload the script to an S3 bucket.

aws s3 cp myjob.sh s3://<bucket>/myjob.sh

Create an IAM role

When the fetchandrun image runs as an AWS Batch job, it fetches the job script from Amazon S3. You need an IAM role that the AWS Batch job can use to access S3.

  1. In the IAM console, choose Roles, Create New Role.
  2. Enter a name for your new role, for example: batchJobRole, and choose Next Step.
  3. For Role Type, under AWS Service Roles, choose Select next to “Amazon EC2 Container Service Task Role” and then choose Next Step.

    fetchAndRunBatch_2.png

  4. On the Attach Policy page, type “AmazonS3ReadOnlyAccess” into the Filter field and then select the check box for that policy.

    fetchAndRunBatch_3.png

  5. Choose Next Step, Create Role. You see the details of the new role.

    fetchAndRunBatch_4.png

Create a job definition

Now that you’ve have created all the resources needed, pull everything together and build a job definition that you can use to run one or many AWS Batch jobs.

  1. In the AWS Batch console, choose Job Definitions, Create.
  2. For the Job Definition, enter a name, for example, fetchandrun.
  3. For IAM Role, choose the role that you created earlier, batchJobRole.
  4. For ECR Repository URI, enter the URI where the fetchandrun image was pushed, for example: 012345678901.dkr.ecr.us-east-1.amazonaws.com/awsbatch/fetchandrun.
  5. Leave the Command field blank.
  6. For vCPUs, enter 1. For Memory, enter 500.

    fetchAndRunBatch_5.png

  7. For User, enter “nobody”.

  8. Choose Create job definition.

Submit and run a job

Now, submit and run a job that uses the fetchandrun image to download the job script and execute it.

  1. In the AWS Batch console, choose Jobs, Submit Job.
  2. Enter a name for the job, for example: script_test.
  3. Choose the latest fetchandrun job definition.
  4. For Job Queue, choose a queue, for example: first-run-job-queue.
  5. For Command, enter myjob.sh,60.
  6. Choose Validate Command.

    fetchAndRunBatch_6.png

  7. Enter the following environment variables and then choose Submit job.

    • Key=BATCHFILETYPE, Value=script
    • Key=BATCHFILES3_URL, Value=s3:///myjob.sh. Don’t forget to use the correct URL for your file.

    fetchAndRunBatch_7.png

  8. After the job is completed, check the final status in the console.

    fetchAndRunBatch_8.png

  9. In the job details page, you can also choose View logs for this job in CloudWatch console to see your job log.

    fetchAndRunBatch_9.png

How the fetch and run image works

The fetchandrun image works as a combination of the Docker ENTRYPOINT and COMMAND feature, and a shell script that reads environment variables set as part of the AWS Batch job. When building the Docker image, it starts with a base image from Amazon Linux and installs a few packages from the yum repository. This becomes the execution environment for the job.

If the script you planned to run needed more packages, you would add them using the RUN parameter in the Dockerfile. You could even change it to a different base image such as Ubuntu, by updating the FROM parameter.

Next, the fetchandrun.sh script is added to the image and set as the container ENTRYPOINT. The script simply reads some environment variables and then downloads and runs the script/zip file from S3. It is looking for the following environment variables BATCHFILETYPE and BATCHFILES3URL. If you run fetchand_run.sh, with no environment variables, you get the following usage message:

  • BATCHFILETYPE not set, unable to determine type (zip/script) of

Usage:

export BATCH_FILE_TYPE="script"

export BATCH_FILE_S3_URL="s3://my-bucket/my-script"

fetch_and_run.sh script-from-s3 [ <script arguments> ]

– or –

export BATCH_FILE_TYPE="zip"

export BATCH_FILE_S3_URL="s3://my-bucket/my-zip"

fetch_and_run.sh script-from-zip [ <script arguments> ]

This shows that it supports two values for BATCHFILETYPE, either “script” or “zip”. When you set “script”, it causes fetchandrun.sh to download a single file and then execute it, in addition to passing in any further arguments to the script. If you set it to “zip”, this causes fetchandrun.sh to download a zip file, then unpack it and execute the script name passed and any further arguments. You can use the “zip” option to pass more complex jobs with all the applications dependencies in one file.

Finally, the ENTRYPOINT parameter tells Docker to execute the /usr/local/bin/fetchandrun.sh script when creating a container. In addition, it passes the contents of the COMMAND parameter as arguments to the script. This is what enables you to pass the script and arguments to be executed by the fetchandrun image with the Command field in the SubmitJob API action call.

Summary

In this post, I detailed the steps to create and run a simple “fetch & run” job in AWS Batch. You can now easily use the same job definition to run as many jobs as you need by uploading a job script to Amazon S3 and calling SubmitJob with the appropriate environment variables.

If you have questions or suggestions, please comment below.

Open Sourcing TensorFlowOnSpark: Distributed Deep Learning on Big-Data Clusters

Post Syndicated from mikesefanov original https://yahooeng.tumblr.com/post/157196488076

yahoohadoop:

By Lee Yang, Jun Shi, Bobbie Chern, and Andy Feng (@afeng76), Yahoo Big ML team

Introduction

Today, we are pleased to offer TensorFlowOnSpark to the community, our latest open source framework for distributed deep learning on big-data clusters.

Deep learning (DL) has evolved significantly in recent years. At Yahoo, we’ve found that in order to gain insight from massive amounts of data, we need to deploy distributed deep learning. Existing DL frameworks often require us to set up separate clusters for deep learning, forcing us to create multiple programs for a machine learning pipeline (see Figure 1 below). Having separate clusters requires us to transfer large datasets between them, introducing unwanted system complexity and end-to-end learning latency.

image

Last year we addressed scaleout issues by developing and publishing CaffeOnSpark, our open source framework that allows distributed deep learning and big-data processing on identical Spark and Hadoop clusters. We use CaffeOnSpark at Yahoo to improve our NSFW image detection, to automatically identify eSports game highlights from live-streamed videos, and more. With the community’s valuable feedback and contributions, CaffeOnSpark has been upgraded with LSTM support, a new data layer, training and test interleaving, a Python API, and deployment on docker containers. This has been great for our Caffe users, but what about those who use the deep learning framework TensorFlow? We’re taking a page from our own playbook and doing for TensorFlow for what we did for Caffe.  

After TensorFlow’s initial publication, Google released an enhanced TensorFlow with distributed deep learning capabilities in April 2016. In October 2016, TensorFlow introduced HDFS support. Outside of the Google cloud, however, users still needed a dedicated cluster for TensorFlow applications. TensorFlow programs could not be deployed on existing big-data clusters, thus increasing the cost and latency for those who wanted to take advantage of this technology at scale.

To address this limitation, several community projects wired TensorFlow onto Spark clusters. SparkNet added the ability to launch TensorFlow networks in Spark executors. DataBricks proposed TensorFrame to manipulate Apache Spark’s DataFrames with TensorFlow programs. While these approaches are a step in the right direction, after examining their code, we learned we would be unable to get the TensorFlow processes to communicate with each other directly, we would not be able to implement asynchronous distributed learning, and we would have to expend significant effort to migrate existing TensorFlow programs.

TensorFlowOnSpark

image

Our new framework, TensorFlowOnSpark (TFoS), enables distributed
TensorFlow execution on Spark and Hadoop clusters. As illustrated in
Figure 2 above, TensorFlowOnSpark is designed to work along with
SparkSQL, MLlib, and other Spark libraries in a single pipeline or
program (e.g. Python notebook).

TensorFlowOnSpark supports all types of TensorFlow programs, enabling both asynchronous and synchronous training and inferencing. It supports model parallelism and data parallelism, as well as TensorFlow tools such as TensorBoard on Spark clusters.

Any TensorFlow program can be easily modified to work with TensorFlowOnSpark. Typically, changing fewer than 10 lines of Python code are needed. Many developers at Yahoo who use TensorFlow have easily migrated TensorFlow programs for execution with TensorFlowOnSpark.

TensorFlowOnSpark supports direct tensor communication among TensorFlow processes (workers and parameter servers). Process-to-process direct communication enables TensorFlowOnSpark programs to scale easily by adding machines. As illustrated in Figure 3, TensorFlowOnSpark doesn’t involve Spark drivers in tensor communication, and thus achieves similar scalability as stand-alone TensorFlow clusters.

image

TensorFlowOnSpark provides two different modes to ingest data for training and inference:

  1. TensorFlow QueueRunners: TensorFlowOnSpark leverages TensorFlow’s file readers and QueueRunners to read data directly from HDFS files. Spark is not involved in accessing data.
  2. Spark Feeding: Spark RDD data is fed to each Spark executor, which subsequently feeds the data into the TensorFlow graph via feed_dict.

Figure 4 illustrates how the synchronous distributed training of Inception image classification network scales in TFoS using QueueRunners with a simple setting: 1 GPU, 1 reader, and batch size 32 for each worker. Four TFoS jobs were launched to train 100,000 steps. When these jobs completed after 2+ days, the top-5 accuracy of these jobs were 0.730, 0.814, 0.854, and 0.879. Reaching top-5 accuracy of 0.730 takes 46 hours for a 1-worker job, 22.5 hours for a 2-worker job, 13 hours for a 4-worker job, and 7.5 hours for an 8-worker job. TFoS thus achieves near linear scalability for Inception model training. This is very encouraging, although TFoS scalability will vary for different models and hyperparameters.

image

RDMA for Distributed TensorFlow

In Yahoo’s Hadoop clusters, GPU nodes are connected by both Ethernet and Infiniband. Infiniband provides faster connectivity and supports direct access to other servers’ memories over RDMA. Current TensorFlow releases, however, only support distributed learning using gRPC over Ethernet. To speed up distributed learning, we have enhanced the TensorFlow C++ layer to enable RDMA over Infiniband.

In conjunction with our TFoS release, we are introducing a new protocol for TensorFlow servers in addition to the default “grpc” protocol. Any distributed TensorFlow program can leverage our enhancement via specifying protocol=“grpc_rdma” in tf.train.ServerDef() or tf.train.Server().

With this new protocol, a RDMA rendezvous manager is created to ensure tensors are written directly into the memory of remote servers. We minimize the tensor buffer creation: Tensor buffers are allocated once at the beginning, and then reused across all training steps of a TensorFlow job. From our early experimentation with large models like the VGG-19 network, our RDMA implementation has demonstrated a significant speedup on training time compared with the existing gRPC implementation.

Since RDMA support is a highly requested capability (see TensorFlow issue #2916), we decided to make our current implementation available as an alpha release to the TensorFlow community. In the coming weeks, we will polish our RDMA implementation further, and share detailed benchmark results.

Simple CLI and API

TFoS programs are launched by the standard Apache Spark command, spark-submit. As illustrated below, users can specify the number of Spark executors, the number of GPUs per executor, and the number of parameter servers in the CLI. A user can also state whether they want to use TensorBoard (–tensorboard) and/or RDMA (–rdma).

      spark-submit –master ${MASTER} \
      ${TFoS_HOME}/examples/slim/train_image_classifier.py \
      –model_name inception_v3 \
      –train_dir hdfs://default/slim_train \
      –dataset_dir hdfs://default/data/imagenet \
      –dataset_name imagenet \
      –dataset_split_name train \
      –cluster_size ${NUM_EXEC} \
      –num_gpus ${NUM_GPU} \
      –num_ps_tasks ${NUM_PS} \
      –sync_replicas \
      –replicas_to_aggregate ${NUM_WORKERS} \
      –tensorboard \
      –rdma  

TFoS provides a high-level Python API (illustrated in our sample Python notebook):

  • TFCluster.reserve() … construct a TensorFlow cluster from Spark executors
  • TFCluster.start() … launch Tensorflow program on the executors
  • TFCluster.train() or TFCluster.inference() … feed RDD data to TensorFlow processes
  • TFCluster.shutdown() … shutdown Tensorflow execution on executors

Open Source

Yahoo is happy to release TensorFlowOnSpark at github.com/yahoo/TensorFlowOnSpark and a RDMA enhancement of TensorFlow at github.com/yahoo/tensorflow/tree/yahoo. Multiple example programs (including mnist, cifar10, inception, and VGG) are provided to illustrate the simple conversion process of TensorFlow programs to TensorFlowOnSpark, and leverage RDMA. An Amazon Machine Image is also available for applying TensorFlowOnSpark on AWS EC2.

Going forward, we will advance TensorFlowOnSpark as we continue to do with CaffeOnSpark. We welcome the community’s continued feedback and contributions to CaffeOnSpark, and are interested in thoughts on ways TensorFlowOnSpark can be enhanced.

Secure Amazon EMR with Encryption

Post Syndicated from Sai Sriparasa original https://aws.amazon.com/blogs/big-data/secure-amazon-emr-with-encryption/

In the last few years, there has been a rapid rise in enterprises adopting the Apache Hadoop ecosystem for critical workloads that process sensitive or highly confidential data. Due to the highly critical nature of the workloads, the enterprises implement certain organization/industry wide policies and certain regulatory or compliance policies. Such policy requirements are designed to protect sensitive data from unauthorized access.

A common requirement within such policies is about encrypting data at-rest and in-flight. Amazon EMR uses “security configurations” to make it easy to specify the encryption keys and certificates, ranging from AWS Key Management Service to supplying your own custom encryption materials provider.

You create a security configuration that specifies encryption settings and then use the configuration when you create a cluster. This makes it easy to build the security configuration one time and use it for any number of clusters.

o_Amazon_EMR_Encryption_1

In this post, I go through the process of setting up the encryption of data at multiple levels using security configurations with EMR. Before I dive deep into encryption, here are the different phases where data needs to be encrypted.

Data at rest

  • Data residing on Amazon S3—S3 client-side encryption with EMR
  • Data residing on disk—the Amazon EC2 instance store volumes (except boot volumes) and the attached Amazon EBS volumes of cluster instances are encrypted using Linux Unified Key System (LUKS)

Data in transit

  • Data in transit from EMR to S3, or vice versa—S3 client side encryption with EMR
  • Data in transit between nodes in a cluster—in-transit encryption via Secure Sockets Layer (SSL) for MapReduce and Simple Authentication and Security Layer (SASL) for Spark shuffle encryption
  • Data being spilled to disk or cached during a shuffle phase—Spark shuffle encryption or LUKS encryption

Encryption walkthrough

For this post, you create a security configuration that implements encryption in transit and at rest. To achieve this, you create the following resources:

  • KMS keys for LUKS encryption and S3 client-side encryption for data exiting EMR to S3
  • SSL certificates to be used for MapReduce shuffle encryption
  • The environment into which the EMR cluster is launched. For this post, you launch EMR in private subnets and set up an S3 VPC endpoint to get the data from S3.
  • An EMR security configuration

All of the scripts and code snippets used for this walkthrough are available on the aws-blog-emrencryption GitHub repo.

Generate KMS keys

For this walkthrough, you use AWS KMS, a managed service that makes it easy for you to create and control the encryption keys used to encrypt your data and disks.

You generate two KMS master keys, one for S3 client-side encryption to encrypt data going out of EMR and the other for LUKS encryption to encrypt the local disks. The Hadoop MapReduce framework uses HDFS. Spark uses the local file system on each slave instance for intermediate data throughout a workload, where data could be spilled to disk when it overflows memory.

To generate the keys, use the kms.json AWS CloudFormation script.  As part of this script, provide an alias name, or display name, for the keys. An alias must be in the “alias/aliasname” format, and can only contain alphanumeric characters, an underscore, or a dash.

o_Amazon_EMR_Encryption_2

After you finish generating the keys, the ARNs are available as part of the outputs.

o_Amazon_EMR_Encryption_3

Generate SSL certificates

The SSL certificates allow the encryption of the MapReduce shuffle using HTTPS while the data is in transit between nodes.

o_Amazon_EMR_Encryption_4

For this walkthrough, use OpenSSL to generate a self-signed X.509 certificate with a 2048-bit RSA private key that allows access to the issuer’s EMR cluster instances. This prompts you to provide subject information to generate the certificates.

Use the cert-create.sh script to generate SSL certificates that are compressed into a zip file. Upload the zipped certificates to S3 and keep a note of the S3 prefix. You use this S3 prefix when you build your security configuration.

Important

This example is a proof-of-concept demonstration only. Using self-signed certificates is not recommended and presents a potential security risk. For production systems, use a trusted certification authority (CA) to issue certificates.

To implement certificates from custom providers, use the TLSArtifacts provider interface.

Build the environment

For this walkthrough, launch an EMR cluster into a private subnet. If you already have a VPC and would like to launch this cluster into a public subnet, skip this section and jump to the Create a Security Configuration section.

To launch the cluster into a private subnet, the environment must include the following resources:

  • VPC
  • Private subnet
  • Public subnet
  • Bastion
  • Managed NAT gateway
  • S3 VPC endpoint

As the EMR cluster is launched into a private subnet, you need a bastion or a jump server to SSH onto the cluster. After the cluster is running, you need access to the Internet to request the data keys from KMS. Private subnets do not have access to the Internet directly, so route this traffic via the managed NAT gateway. Use an S3 VPC endpoint to provide a highly reliable and a secure connection to S3.

o_Amazon_EMR_Encryption_5

In the CloudFormation console, create a new stack for this environment and use the environment.json CloudFormation template to deploy it.

As part of the parameters, pick an instance family for the bastion and an EC2 key pair to be used to SSH onto the bastion. Provide an appropriate stack name and add the appropriate tags. For example, the following screenshot is the review step for a stack that I created.

o_Amazon_EMR_Encryption_6

After creating the environment stack, look at the Output tab and make a note of the VPC ID, bastion, and private subnet IDs, as you will use them when you launch the EMR cluster resources.

o_Amazon_EMR_Encryption_7

Create a security configuration

The final step before launching the secure EMR cluster is to create a security configuration. For this walkthrough, create a security configuration with S3 client-side encryption using EMR, and LUKS encryption for local volumes using the KMS keys created earlier. You also use the SSL certificates generated and uploaded to S3 earlier for encrypting the MapReduce shuffle.

o_Amazon_EMR_Encryption_8

Launch an EMR cluster

Now, you can launch an EMR cluster in the private subnet. First, verify that the service role being used for EMR has access to the AmazonElasticMapReduceRole managed service policy. The default service role is EMR_DefaultRole. For more information, see Configuring User Permissions Using IAM Roles.

From the Build an environment section, you have the VPC ID and the subnet ID for the private subnet into which the EMR cluster should be launched. Select those values for the Network and EC2 Subnet fields. In the next step, provide a name and tags for the cluster.

o_Amazon_EMR_Encryption_9

The last step is to select the private key, assign the security configuration that was created in the Create a security configuration section, and choose Create Cluster.

o_Amazon_EMR_Encryption_10

Now that you have the environment and the cluster up and running, you can get onto the master node to run scripts. You need the IP address, which you can retrieve from the EMR console page. Choose Hardware, Master Instance group and note the private IP address of the master node.

o_Amazon_EMR_Encryption_11

As the master node is in a private subnet, SSH onto the bastion instance first and then jump from the bastion instance to the master node. For information about how to SSH onto the bastion and then to the Hadoop master, open the ssh-commands.txt file. For more information about how to get onto the bastion, see the Securely Connect to Linux Instances Running in a Private Amazon VPC post.

After you are on the master node, bring your own Hive or Spark scripts. For testing purposes, the GitHub /code directory includes the test.py PySpark and test.q Hive scripts.

Summary

As part of this post, I’ve identified the different phases where data needs to be encrypted and walked through how data in each phase can be encrypted. Then, I described a step-by-step process to achieve all the encryption prerequisites, such as building the KMS keys, building SSL certificates, and launching the EMR cluster with a strong security configuration. As part of this walkthrough, you also secured the data by launching your cluster in a private subnet within a VPC, and used a bastion instance for access to the EMR cluster.

If you have questions or suggestions, please comment below.


About the Author

sai_90Sai Sriparasa is a Big Data Consultant for AWS Professional Services. He works with our customers to provide strategic & tactical big data solutions with an emphasis on automation, operations & security on AWS. In his spare time, he follows sports and current affairs.

 

 

 


Related

Implementing Authorization and Auditing using Apache Ranger on Amazon EMR

EMRRanger_1

 

 

 

 

 

 

 

Managing Secrets for Amazon ECS Applications Using Parameter Store and IAM Roles for Tasks

Post Syndicated from Chris Barclay original https://aws.amazon.com/blogs/compute/managing-secrets-for-amazon-ecs-applications-using-parameter-store-and-iam-roles-for-tasks/

Thanks to my colleague Stas Vonholsky  for a great blog on managing secrets with Amazon ECS applications.

—–

As containerized applications and microservice-oriented architectures become more popular, managing secrets, such as a password to access an application database, becomes more challenging and critical.

Some examples of the challenges include:

  • Support for various access patterns across container environments such as dev, test, and prod
  • Isolated access to secrets on a container/application level rather than at the host level
  • Multiple decoupled services with their own needs for access, both as services and as clients of other services

This post focuses on newly released features that support further improvements to secret management for containerized applications running on Amazon ECS. My colleague, Matthew McClean, also published an excellent post on the AWS Security Blog, How to Manage Secrets for Amazon EC2 Container Service–Based Applications by Using Amazon S3 and Docker, which discusses some of the limitations of passing and storing secrets with container parameter variables.

Most secret management tools provide the following functionality:

  • Highly secured storage system
  • Central management capabilities
  • Secure authorization and authentication mechanisms
  • Integration with key management and encryption providers
  • Secure introduction mechanisms for access
  • Auditing
  • Secret rotation and revocation

Amazon EC2 Systems Manager Parameter Store

Parameter Store is a feature of Amazon EC2 Systems Manager. It provides a centralized, encrypted store for sensitive information and has many advantages when combined with other capabilities of Systems Manager, such as Run Command and State Manager. The service is fully managed, highly available, and highly secured.

Because Parameter Store is accessible using the Systems Manager API, AWS CLI, and AWS SDKs, you can also use it as a generic secret management store. Secrets can be easily rotated and revoked. Parameter Store is integrated with AWS KMS so that specific parameters can be encrypted at rest with the default or custom KMS key. Importing KMS keys enables you to use your own keys to encrypt sensitive data.

Access to Parameter Store is enabled by IAM policies and supports resource level permissions for access. An IAM policy that grants permissions to specific parameters or a namespace can be used to limit access to these parameters. CloudTrail logs, if enabled for the service, record any attempt to access a parameter.

While Amazon S3 has many of the above features and can also be used to implement a central secret store, Parameter Store has the following added advantages:

  • Easy creation of namespaces to support different stages of the application lifecycle.
  • KMS integration that abstracts parameter encryption from the application while requiring the instance or container to have access to the KMS key and for the decryption to take place locally in memory.
  • Stored history about parameter changes.
  • A service that can be controlled separately from S3, which is likely used for many other applications.
  • A configuration data store, reducing overhead from implementing multiple systems.
  • No usage costs.

Note: At the time of publication, Systems Manager doesn’t support VPC private endpoint functionality. To enforce stricter access to a Parameter Store endpoint from a private VPC, use a NAT gateway with a set Elastic IP address together with IAM policy conditions that restrict parameter access to a limited set of IP addresses.

IAM roles for tasks

With IAM roles for Amazon ECS tasks, you can specify an IAM role to be used by the containers in a task. Applications interacting with AWS services must sign their API requests with AWS credentials. This feature provides a strategy for managing credentials for your applications to use, similar to the way that Amazon EC2 instance profiles provide credentials to EC2 instances.

Instead of creating and distributing your AWS credentials to the containers or using the EC2 instance role, you can associate an IAM role with an ECS task definition or the RunTask API operation. For more information, see IAM Roles for Tasks.

You can use IAM roles for tasks to securely introduce and authenticate the application or container with the centralized Parameter Store. Access to the secret manager should include features such as:

  • Limited TTL for credentials used
  • Granular authorization policies
  • An ID to track the requests in the logs of the central secret manager
  • Integration support with the scheduler that could map between the container or task deployed and the relevant access privileges

IAM roles for tasks support this use case well, as the role credentials can be accessed only from within the container for which the role is defined. The role exposes temporary credentials and these are rotated automatically. Granular IAM policies are supported with optional conditions about source instances, source IP addresses, time of day, and other options.

The source IAM role can be identified in the CloudTrail logs based on a unique Amazon Resource Name and the access permissions can be revoked immediately at any time with the IAM API or console. As Parameter Store supports resource level permissions, a policy can be created to restrict access to specific keys and namespaces.

Dynamic environment association

In many cases, the container image does not change when moving between environments, which supports immutable deployments and ensures that the results are reproducible. What does change is the configuration: in this context, specifically the secrets. For example, a database and its password might be different in the staging and production environments. There’s still the question of how do you point the application to retrieve the correct secret? Should it retrieve prod.app1.secret, test.app1.secret or something else?

One option can be to pass the environment type as an environment variable to the container. The application then concatenates the environment type (prod, test, etc.) with the relative key path and retrieves the relevant secret. In most cases, this leads to a number of separate ECS task definitions.

When you describe the task definition in a CloudFormation template, you could base the entry in the IAM role that provides access to Parameter Store, KMS key, and environment property on a single CloudFormation parameter, such as “environment type.” This approach could support a single task definition type that is based on a generic CloudFormation template.

Walkthrough: Securely access Parameter Store resources with IAM roles for tasks

This walkthrough is configured for the North Virginia region (us-east-1). I recommend using the same region.

Step 1: Create the keys and parameters

First, create the following KMS keys with the default security policy to be used to encrypt various parameters:

  • prod-app1 –used to encrypt any secrets for app1.
  • license-key –used to encrypt license-related secrets.
aws kms create-key --description prod-app1 --region us-east-1
aws kms create-key --description license-code --region us-east-1

Note the KeyId property in the output of both commands. You use it throughout the walkthrough to identify the KMS keys.

The following commands create three parameters in Parameter Store:

  • prod.app1.db-pass (encrypted with the prod-app1 KMS key)
  • general.license-code (encrypted with the license-key KMS key)
  • prod.app2.user-name (stored as a standard string without encryption)
aws ssm put-parameter --name prod.app1.db-pass --value "AAAAAAAAAAA" --type SecureString --key-id "<key-id-for-prod-app1-key>" --region us-east-1
aws ssm put-parameter --name general.license-code --value "CCCCCCCCCCC" --type SecureString --key-id "<key-id-for-license-code-key>" --region us-east-1
aws ssm put-parameter --name prod.app2.user-name --value "BBBBBBBBBBB" --type String --region us-east-1

Step 2: Create the IAM role and policies

Now, create a role and an IAM policy to be associated later with the ECS task that you create later on.
The trust policy for the IAM role needs to allow the ecs-tasks entity to assume the role.

{
   "Version": "2012-10-17",
   "Statement": [
     {
       "Sid": "",
       "Effect": "Allow",
       "Principal": {
         "Service": "ecs-tasks.amazonaws.com"
       },
       "Action": "sts:AssumeRole"
     }
   ]
 }

Save the above policy as a file in the local directory with the name ecs-tasks-trust-policy.json.

aws iam create-role --role-name prod-app1 --assume-role-policy-document file://ecs-tasks-trust-policy.json

The following policy is attached to the role and later associated with the app1 container. Access is granted to the prod.app1.* namespace parameters, the encryption key required to decrypt the prod.app1.db-pass parameter and the license code parameter. The namespace resource permission structure is useful for building various hierarchies (based on environments, applications, etc.).

Make sure to replace <key-id-for-prod-app1-key> with the key ID for the relevant KMS key and <account-id> with your account ID in the following policy.

{
     "Version": "2012-10-17",
     "Statement": [
         {
             "Effect": "Allow",
             "Action": [
                 "ssm:DescribeParameters"
             ],
             "Resource": "*"
         },
         {
             "Sid": "Stmt1482841904000",
             "Effect": "Allow",
             "Action": [
                 "ssm:GetParameters"
             ],
             "Resource": [
                 "arn:aws:ssm:us-east-1:<account-id>:parameter/prod.app1.*",
                 "arn:aws:ssm:us-east-1:<account-id>:parameter/general.license-code"
             ]
         },
         {
             "Sid": "Stmt1482841948000",
             "Effect": "Allow",
             "Action": [
                 "kms:Decrypt"
             ],
             "Resource": [
                 "arn:aws:kms:us-east-1:<account-id>:key/<key-id-for-prod-app1-key>"
             ]
         }
     ]
 }

Save the above policy as a file in the local directory with the name app1-secret-access.json:

aws iam create-policy --policy-name prod-app1 --policy-document file://app1-secret-access.json

Replace <account-id> with your account ID in the following command:

aws iam attach-role-policy --role-name prod-app1 --policy-arn "arn:aws:iam::<account-id>:policy/prod-app1"

Step 3: Add the testing script to an S3 bucket

Create a file with the script below, name it access-test.sh and add it to an S3 bucket in your account. Make sure the object is publicly accessible and note down the object link, for example https://s3-eu-west-1.amazonaws.com/my-new-blog-bucket/access-test.sh

#!/bin/bash
#This is simple bash script that is used to test access to the EC2 Parameter store.
# Install the AWS CLI
apt-get -y install python2.7 curl
curl -O https://bootstrap.pypa.io/get-pip.py
python2.7 get-pip.py
pip install awscli
# Getting region
EC2_AVAIL_ZONE=`curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone`
EC2_REGION="`echo \"$EC2_AVAIL_ZONE\" | sed -e 's:\([0-9][0-9]*\)[a-z]*\$:\\1:'`"
# Trying to retrieve parameters from the EC2 Parameter Store
APP1_WITH_ENCRYPTION=`aws ssm get-parameters --names prod.app1.db-pass --with-decryption --region $EC2_REGION --output text 2>&1`
APP1_WITHOUT_ENCRYPTION=`aws ssm get-parameters --names prod.app1.db-pass --no-with-decryption --region $EC2_REGION --output text 2>&1`
LICENSE_WITH_ENCRYPTION=`aws ssm get-parameters --names general.license-code --with-decryption --region $EC2_REGION --output text 2>&1`
LICENSE_WITHOUT_ENCRYPTION=`aws ssm get-parameters --names general.license-code --no-with-decryption --region $EC2_REGION --output text 2>&1`
APP2_WITHOUT_ENCRYPTION=`aws ssm get-parameters --names prod.app2.user-name --no-with-decryption --region $EC2_REGION --output text 2>&1`
# The nginx server is started after the script is invoked, preparing folder for HTML.
if [ ! -d /usr/share/nginx/html/ ]; then
mkdir -p /usr/share/nginx/html/;
fi
chmod 755 /usr/share/nginx/html/

# Creating an HTML file to be accessed at http://<public-instance-DNS-name>/ecs.html
cat > /usr/share/nginx/html/ecs.html <<EOF
<!DOCTYPE html>
<html>
<head>
<title>App1</title>
<style>
body {padding: 20px;margin: 0 auto;font-family: Tahoma, Verdana, Arial, sans-serif;}
code {white-space: pre-wrap;}
result {background: hsl(220, 80%, 90%);}
</style>
</head>
<body>
<h1>Hi there!</h1>
<p style="padding-bottom: 0.8cm;">Following are the results of different access attempts as expirienced by "App1".</p>

<p><b>Access to prod.app1.db-pass:</b><br/>
<pre><code>aws ssm get-parameters --names prod.app1.db-pass --with-decryption</code><br/>
<code><result>$APP1_WITH_ENCRYPTION</result></code><br/>
<code>aws ssm get-parameters --names prod.app1.db-pass --no-with-decryption</code><br/>
<code><result>$APP1_WITHOUT_ENCRYPTION</result></code></pre><br/>
</p>

<p><b>Access to general.license-code:</b><br/>
<pre><code>aws ssm get-parameters --names general.license-code --with-decryption</code><br/>
<code><result>$LICENSE_WITH_ENCRYPTION</result></code><br/>
<code>aws ssm get-parameters --names general.license-code --no-with-decryption</code><br/>
<code><result>$LICENSE_WITHOUT_ENCRYPTION</result></code></pre><br/>
</p>

<p><b>Access to prod.app2.user-name:</b><br/>
<pre><code>aws ssm get-parameters --names prod.app2.user-name --no-with-decryption</code><br/>
<code><result>$APP2_WITHOUT_ENCRYPTION</result></code><br/>
</p>

<p><em>Thanks for visiting</em></p>
</body>
</html>
EOF

Step 4: Create a test cluster

I recommend creating a new ECS test cluster with the latest ECS AMI and ECS agent on the instance. Use the following field values:

  • Cluster name: access-test
  • EC2 instance type: t2.micro
  • Number of instances: 1
  • Key pair: No EC2 key pair is required, unless you’d like to SSH to the instance and explore the running container.
  • VPC: Choose the default VPC. If unsure, you can find the VPC ID with the IP range 172.31.0.0/16 in the Amazon VPC console.
  • Subnets: Pick a subnet in the default VPC.
  • Security group: Create a new security group with CIDR block 0.0.0.0/0 and port 80 for inbound access.

Leave other fields with the default settings.

Create a simple task definition that relies on the public NGINX container and the role that you created for app1. Specify the properties such as the available container resources and port mappings. Note the command option is used to download and invoke a test script that installs the AWS CLI on the container, runs a number of get-parameter commands, and creates an HTML file with the results.

Replace <account-id> with your account ID, <your-S3-URI> with a link to the S3 object created in step 3 in the following commands:

aws ecs register-task-definition --family access-test --task-role-arn "arn:aws:iam::<account-id>:role/prod-app1" --container-definitions name="access-test",image="nginx",portMappings="[{containerPort=80,hostPort=80,protocol=tcp}]",readonlyRootFilesystem=false,cpu=512,memory=490,essential=true,entryPoint="sh,-c",command="\"/bin/sh -c \\\"apt-get update ; apt-get -y install curl ; curl -O <your-S3-URI> ; chmod +x access-test.sh ; ./access-test.sh ; nginx -g 'daemon off;'\\\"\"" --region us-east-1

aws ecs run-task --cluster access-test --task-definition access-test --count 1 --region us-east-1

Verifying access

After the task is in a running state, check the public DNS name of the instance and navigate to the following page:

http://<ec2-instance-public-DNS-name>/ecs.html

You should see the results of running different access tests from the container after a short duration.

If the test results don’t appear immediately, wait a few seconds and refresh the page.
Make sure that inbound traffic for port 80 is allowed on the security group attached to the instance.

The results you see in the static results HTML page should be the same as running the following commands from the container.

prod.app1.key1

aws ssm get-parameters --names prod.app1.db-pass --with-decryption --region us-east-1
aws ssm get-parameters --names prod.app1.db-pass --no-with-decryption --region us-east-1

Both commands should work, as the policy provides access to both the parameter and the required KMS key.

general.license-code

aws ssm get-parameters --names general.license-code --no-with-decryption --region us-east-1
aws ssm get-parameters --names general.license-code --with-decryption --region us-east-1

Only the first command with the “no-with-decryption” parameter should work. The policy allows access to the parameter in Parameter Store but there’s no access to the KMS key. The second command should fail with an access denied error.

prod.app2.user-name

aws ssm get-parameters --names prod.app2.user-name –no-with-decryption --region us-east-1

The command should fail with an access denied error, as there are no permissions associated with the namespace for prod.app2.

Finishing up

Remember to delete all resources (such as the KMS keys and EC2 instance), so that you don’t incur charges.

Conclusion

Central secret management is an important aspect of securing containerized environments. By using Parameter Store and task IAM roles, customers can create a central secret management store and a well-integrated access layer that allows applications to access only the keys they need, to restrict access on a container basis, and to further encrypt secrets with custom keys with KMS.

Whether the secret management layer is implemented with Parameter Store, Amazon S3, Amazon DynamoDB, or a solution such as Vault or KeyWhiz, it’s a vital part to the process of managing and accessing secrets.

Serving Real-Time Machine Learning Predictions on Amazon EMR

Post Syndicated from Derek Graeber original https://aws.amazon.com/blogs/big-data/serving-real-time-machine-learning-predictions-on-amazon-emr/

The typical progression for creating and using a trained model for recommendations falls into two general areas: training the model and hosting the model. Model training has become a well-known standard practice. We want to highlight one of many ways to host those recommendations (for example, see the Analyzing Genomics Data at Scale using R, AWS Lambda, and Amazon API Gateway post).

In this post, we look at one possible way to host a trained ALS model on Amazon EMR using Apache Spark to serve movie predictions in real time. It is a continuation of two recent posts that are prerequisite:

In future posts we will cover other alternatives for serving real-time machine-learning predictions, namely AWS Lambda and Amazon EC2 Container Service, by running the prediction functions locally and loading the saved models from S3 to the local execution environments.

Walkthrough: Trained ALS model

For this walkthrough, you use the MovieLens dataset as set forth in the Building a Recommendation Engine post; the data model should have already been generated and persisted to Amazon S3. It uses the Alternating Least Squares (ALS) algorithm to train the data for generating the proper model.

Using JobServer, you take that model and persist it in memory in JobServer on Amazon EMR. After it’s persisted, you can expose RESTful endpoints to AWS Lambda, which in turn can be invoked from a static UI page hosted on S3, securing access with Amazon Cognito.

Here are the steps that you follow:

  1. Create the infrastructure, including EMR with JobServer and Lambda.
  2. Load the trained model into Spark on EMR via JobServer.
  3. Stage a static HTML page on S3.
  4. Access the AWS Lambda endpoints via the static HTML page authenticated with Amazon Cognito.

The following diagram shows the infrastructure architecture.

o_realtime_1_1_1

Create the hosting environment

Because you are already familiar with JobServer on EMR from the prerequisite Installing and Running JobServer post, this post won’t review most of that work. You use EMR 4.7.1, as does the earlier post. At the time, of this writing, JobServer does not have a version for Spark 2.0, so you use JobServer v.0.6.2.

Be sure to get the updated aws-blog-jobserver-emr . The project structure has not changed, but the code and configuration has been added. Specifically, under the <project> directory, reference the following:

  • <project>/BA – the bootstrap actions to install JobServer on EMR
  • <project>/cfn – the location of the necessary AWS CloudFormation templates
  • <project>/html – the static HTML page to be hosted in S3
  • <project>/jobserver_configs – the configurations for JobServer
  • <project>/ml_commands – sample invocation commands using CURL (just reference)
  • <project>/policy – the role policies for artifacts
  • <project>/python_lambda – the source code for the AWS Lambda Python code
  • <project>/src – the Scala source code for JobServer

You expose three endpoints in JobServer (and Lambda):

  1. LoadModelAndData – This takes both the model and data stored on S3 and loads them into JobServer. You MUST invoke this endpoint first.
  2. MoviesRec – The recommendation endpoint for all movies for a user (in this case, userId=100).
  3. MoviesRecByGenre – The recommendation for moves by a specific genre for a specific user (in this case, userId=100)

Use AWS CloudFormation to create the EMR cluster that JobServer runs on and the Lambda functions that access JobServer.

To get started, prep the deployment and review the assumptions.

Prep the deployment and stage configurations for JobServer

You use S3 as the landing point for all necessary configurations for JobServer and Lambda. You MUST be familiar with the JobServer on EMR post and understand the configurations discussed in detail. We touch on them only briefly in this post as we walk you through the following steps. For each step, be sure to note the full S3 object path.

  1. Compile the JobServer code with Maven and put the output .jar file in the S3 bucket (jobserverEmr-1.0.jar).
  2. Put the file <project>/jobserver_configs/emr_contexts_ml.conf in the S3 bucket.
  3. Put the file <project>/jobserver_configs/emr_v1.6.2.conf in the S3 bucket.
  4. Open the BA at <project>/BA/install_jobserver_ML.sh and update the following properties:
    • EMR_SH_FILE property – the path to the object on S3 to emr_v1.6.2.conf
    • EMR_CONF_FILE property – the path to the object on S3 to emr_contexts_ml.conf
  5. Put the install_jobserver_ML.sh BA in the S3 bucket.
  6. Open the BA at <project>/BA/startServer.sh and update the properties:
    • EMR_JOBSERVER_FILE – object on S3 to jobserverEmr-1.0.jar
  7. Put the startServer.sh BA in the S3 bucket.

At this point, your JobServer is configured. Right now, you should have noted the following:

  • The full S3 object path to the install_jobserver_ML.sh bootstrap action
  • The full S3 object path to startServer.sh bootstrap action

Now, you can create the Python code for Lambda.

Prep the deployment and stage configurations for Lambda

Create a zip file with the Python code and a dependency, using Python 2.7 for the runtime. Under <project>/python_lambda, the models.py file is the source code. It uses the sjs-client API. Install it for Python to use and put it in the deployment package zip file. For more information, see Creating a Deployment Package (Python).

pip install python-sjsclient –t <project> 

Zip the models.py script with the dependency sjs-client libraries (make sure that models.py is at the root and not in a directory). In the working environment, name the output .zip file python-lambda.zip or similar. Place this zip file in the S3 bucket and be sure to make a note of the full S3 object path.

Now that you have all the dependencies, you can move onto creating the environment for CloudFormation.

Create the JobServer environment

To create the JobServer environment, use CloudFormation and the items you staged on S3. The CloudFormation template is located at <project>/cfn/ml-jobserver.template.

It assumes the following:

  1. The user creating the environment has access to IAM, Route 53, Lambda, and EMR.
  2. It is being run in the default VPC of the account, in a public subnet.
  3. The account already has the EMR_DefaultRole and EMR_EC2_DefaultRole roles defined.

Get the VPCID value of the default VPC and the SubnetID value for the subnet in which the artifacts are created. Provide the following seven parameters as inputs to the CloudFormation template:

  • VPCID – the default VPC
    • vpc-XXXXXXXX
  • SubnetID – the subnet in the VPC to use
    • subnet-XXXXXXXX
  • EMRKeyPair – the name of an EC2 key to be installed on the EMR master node
    • YOUR-EC2-KEYPAIR
  • BaJobServerLoc – where to put the install JobServer BA
    • s3://<S3folder>/jobserver/BA/install_jobserver_ML_BA.sh
  • StepStartJobServerLoc – where to put the startServer.sh BA
    • s3:// <S3folder>/jobserver/BA/startServer.sh
  • LambdaS3BucketName – the name of the bucket where the Python code is located
    • <S3folder>
  • LambdaS3BucketKey – the key (without the bucket) where your Python code is located
    • ml/lambda/python_lambda.zip

Browse to the console and execute this template being sure to supply the required parameters. This template takes approximately 10-12 minutes to execute.

When the template has completed, the EMR cluster with JobServer is installed with your ML-Jobserver code and the Lambda endpoints are installed. You can find the names of the Lambda endpoints in the output section. You need these later on to host a static webpage in S3 and invoke the endpoints. The three endpoints created correspond to the three access points you deployed in JobServer:

  • LoadDataLambdaARN
  • RecommendationLambdaARN
  • GenreLambdaARN

o_realtime_2

Using the following template for each of the Lambda functions created, you can test each of the Lambda endpoints (be sure to test LoadDataLambda first). In this use case, the model you created with Zeppelin was persisted to S3 as denoted and the MovieLens data is on S3:

{
  "userId": 100,
  "genre": "Comedy",
  "s3DataLoc": "dgraeberaws-blogs/ml/data/movielens/small/",
  "s3ModelLoc": "dgraeberaws-blogs/ml/models/movielens/recommendations/"
}

With the server deployed and the model loaded and tested via Lambda test events, you can move on to creating a static HTML page to access these endpoints.

Create the static UI

You host a static HTML page on S3 and authenticate (unauthorized) with Amazon Cognito. In this use case, you use a federated identity pool, created using the console. You can either create a new IAM user manually or let the Amazon Cognito console create one for you. For the role that is the unauthenticated role, you must have a policy attached to that role to allow access to Lambda as well as the proper Amazon Cognito trust policy (see <project>/policy for samples).

After the identity pool is created, make note of the pool ID. Now, edit the mltest.html code located in <project>/html. Make four updates to give this page access to the Lambda code:

  • The Amazon Cognito identity pool ID
  • The ARN of the LoadDataLambdaARN function (from the CloudFormation output)
  • The ARN of the GenreLambdaARN function (from the CloudFormation output)
  • The ARN of the RecommendationLambdaARN function (from the CloudFormation output)

Edit the mltest.html page with these values, as indicated below:

o_realtime_3
Create an S3 bucket and configure it as a static website. After it’s created, place the mltest.html page in the bucket (with proper permissions for public read access) and browse to that object.

o_realtime_4

Conclusion

We have shown you one possible way to host a real-time recommendation engine on Amazon EMR with JobServer. You can quickly create infrastructure and load the model data. Also, using RESTful endpoints and AWS Lambda, you can expose the model to end users to provide recommendations in real time. This is only one example of a way to provide real-time predictions using standard technologies. Look for future posts related to Machine Learning on the AWS Big Data Blog.

If you have questions or suggestions, please comment below.


About the Authors


derek_graeber_90_1Derek Graeber is a Senior Consultant in Big Data & Analytics for AWS Professional Services.
He works with enterprise customers to provide leadership on big data projects, helping them realize their business goals when running on AWS. In his spare time, he enjoys spending time with his wife and family, occasionally getting out to reaffirm that he will never be a good golfer.

 

guy_ernest_90Guy Ernest is a principal business development manager for machine learning in AWS. He has the exciting opportunity to help shape and deliver on a strategy to build mind share and broad use of Amazon’s cloud computing platform for AI, machine learning and deep learning use cases. In his spare time, he enjoys spending time with his wife and family, gathering embarrassing stories, to share in talks about Amazon and the future of AI.


Related

Readmission Prediction Through Patient Risk Stratification Using Amazon Machine Learning

Readmission_1

 

 

 

 

 

 


Running Jupyter Notebook and JupyterHub on Amazon EMR

Post Syndicated from Tom Zeng original https://aws.amazon.com/blogs/big-data/running-jupyter-notebook-and-jupyterhub-on-amazon-emr/

Tom Zeng is a Solutions Architect for Amazon EMR

Jupyter Notebook (formerly IPython) is one of the most popular user interfaces for running Python, R, Julia, Scala, and other languages to process and visualize data, perform statistical analysis, and train and run machine learning models. Jupyter notebooks are self-contained documents that can include live code, charts, narrative text, and more. The notebooks can be easily converted to HTML, PDF, and other formats for sharing.

Amazon EMR is a popular hosted big data processing service that allows users to easily run Hadoop, Spark, Presto, and other Hadoop ecosystem applications, such as Hive and Pig.

Python, Scala, and R provide support for Spark and Hadoop, and running them in Jupyter on Amazon EMR makes it easy to take advantage of:

  • the big-data processing capabilities of Hadoop applications.
  • the large selection of Python and R packages for analytics and visualization.

JupyterHub is a multiple-user environment for Jupyter. You can use the following bootstrap action (BA) to install Jupyter and JupyterHub on Amazon EMR:

s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh

These are the supported Jupyter kernels:

  • Python
  • R
  • Scala
  • Apache Toree (which provides the Spark, PySpark, SparkR, and SparkSQL kernels)
  • Julia
  • Ruby
  • JavaScript
  • CoffeeScript
  • Torch

The BA will install Jupyter, JupyterHub, and sample notebooks on the master node.

Commonly used Python and R data science and machine learning packages can be optionally installed on all nodes.

The following arguments can be passed to the BA:

--rInstall the IRKernel for R.
--toreeInstall the Apache Toree kernel that supports Scala, PySpark, SQL, SparkR for Apache Spark.
--juliaInstall the IJulia kernel for Julia.
--torchInstall the iTorch kernel for Torch (machine learning and visualization).
--rubyInstall the iRuby kernel for Ruby.
--ds-packagesInstall the Python data science-related packages (scikit-learn pandas statsmodels).
--ml-packagesInstall the Python machine learning-related packages (theano keras tensorflow).
--python-packagesInstall specific Python packages (for example, ggplot and nilearn).
--portSet the port for Jupyter notebook. The default is 8888.
--passwordSet the password for the Jupyter notebook.
--localhost-onlyRestrict Jupyter to listen on localhost only. The default is to listen on all IP addresses.
--jupyterhubInstall JupyterHub.
--jupyterhub-portSet the port for JuputerHub. The default is 8000.
--notebook-dirSpecify the notebook folder. This could be a local directory or an S3 bucket.
--cached-installUse some cached dependency artifacts on S3 to speed up installation.
--sslEnable SSL. For production, make sure to use your own certificate and key files.
--copy-samplesCopy sample notebooks to the notebook folder.
--spark-optsUser-supplied Spark options to override the default values.
--s3fsUse instead of s3nb (the default) for storing notebooks on Amazon S3. This argument can cause slowness if the S3 bucket has lots of files. The Upload file and Create folder menu options do not work with s3nb.

By default (with no --password and --port arguments), Jupyter will run on port 8888 with no password protection; JupyterHub will run on port 8000.  The --port and --jupyterhub-port arguments can be used to override the default ports to avoid conflicts with other applications.

The --r option installs the IRKernel for R. It also installs SparkR and sparklyr for R, so make sure Spark is one of the selected EMR applications to be installed. You’ll need the Spark application if you use the --toree argument.

If you used --jupyterhub, use Linux users to sign in to JupyterHub. (Be sure to create passwords for the Linux users first.)  hadoop, the default admin user for JupyterHub, can be used to set up other users. The –password option sets the password for Jupyter and for the hadoop user for JupyterHub.

Jupyter on EMR allows users to save their work on Amazon S3 rather than on local storage on the EMR cluster (master node).

To store notebooks on S3, use:

--notebook-dir <s3://your-bucket/folder/>

To store notebooks in a directory different from the user’s home directory, use:

--notebook-dir <local directory>

The following example CLI command is used to launch a five-node (c3.4xlarge) EMR 5.2.0 cluster with the bootstrap action. The BA will install all the available kernels. It will also install the ggplot and nilearn Python packages and set:

  • the Jupyter port to 8880
  • the password to jupyter
  • the JupyterHub port to 8001
aws emr create-cluster --release-label emr-5.2.0 \
  --name 'emr-5.2.0 sparklyr + jupyter cli example' \
  --applications Name=Hadoop Name=Hive Name=Spark Name=Pig Name=Tez Name=Ganglia Name=Presto \
  --ec2-attributes KeyName=<your-ec2-key>,InstanceProfile=EMR_EC2_DefaultRole \
  --service-role EMR_DefaultRole \
  --instance-groups \
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=c3.4xlarge \
    InstanceGroupType=CORE,InstanceCount=4,InstanceType=c3.4xlarge \
  --region us-east-1 \
  --log-uri s3://<your-s3-bucket>/emr-logs/ \
  --bootstrap-actions \
    Name='Install Jupyter notebook',Path="s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh",Args=[--r,--julia,--toree,--torch,--ruby,--ds-packages,--ml-packages,--python-packages,'ggplot nilearn',--port,8880,--password,jupyter,--jupyterhub,--jupyterhub-port,8001,--cached-install,--notebook-dir,s3://<your-s3-bucket>/notebooks/,--copy-samples]

Replace <your-ec2-key> with your AWS access key and <your-s3-bucket> with the S3 bucket where you store notebooks. You can also change the instance types to suit your needs and budget.

If you are using the EMR console to launch a cluster, you can specify the bootstrap action as follows:

o_jupyter_1

When the cluster is available, set up the SSH tunnel and web proxy. The Jupyter notebook should be available at localhost:8880 (as specified in the example CLI command).

o_jupyter_2

After you have signed in, you will see the home page, which displays the notebook files:

o_jupyter_3

If JupyterHub is installed, the Sign in page should be available at port 8001 (as specified in the CLI example):

o_jupyter_4

After you are signed in, you’ll see the JupyterHub and Jupyter home pages are the same. The JupyterHub URL, however, is /user/<username>/tree instead of /tree.

o_jupyter_5

The JupyterHub Admin page is used for managing users:

o_jupyter_6

You can install Jupyter extensions from the Nbextensions tab:

o_jupyter_7

If you specified the --copy-samples option in the BA, you should see sample notebooks on the home page. To try the samples, first open and run the CopySampleDataToHDFS.ipynb notebook to copy some sample data files to HDFS. In the CLI example, --python-packages,'ggplot nilearn' is used to install the ggplot and nilearn packages. You can verify those packages were installed by running the Py-ggplot and PyNilearn notebooks.

The CreateUser.ipynb notebook contains examples for setting up JupyterHub users.

The PySpark.ipynb and ScalaSpark.ipynb notebooks contain the Python and Scala versions of some machine learning examples from the Spark distribution (Logistic Regression, Neural Networks, Random Forest, and Support Vector Machines):

o_jupyter_8

PyHivePrestoHDFS.ipynb shows how to access Hive, Presto, and HDFS in Python. (Be sure to run the CreateHivePrestoS3Tables.ipynb first to create tables.) The %%time and %%timeit cell magics can be used to benchmark Hive and Presto queries (and other executable code):

o_jupyter_9

Here are some other sample notebooks for you to try.

SparkSQLParquetJSON.ipynb:

o_SparkSQLParquetJSON

plot_separating_hyperplane.ipynb:

o_plot_separating_hyperplane

R-SVMLinearNonLinear.ipynb:

o_R-SVMLinearNonLinear

plot_iris.ipynb:

o_plot_iris

Julia-IrisPlot.ipynb:

o_Julia-IrisPlot

PyIrisPlot.ipynb:

o_PyIrisPlot

R-RandomForestVisualization.ipynb:

o_R-RandomForestVisualization

GrangerCausality.ipynb:

o_GrangerCausality

SQLite.ipynb:

o_SQLite

GraphvizDot.ipynb:

o_GraphvizDot

Conclusion

Data scientists who run Jupyter and JupyterHub on Amazon EMR can use Python, R, Julia, and Scala to process, analyze, and visualize big data stored in Amazon S3. Jupyter notebooks can be saved to S3 automatically, so users can shut down and launch new EMR clusters, as needed. EMR makes it easy to spin up clusters with different sizes and CPU/memory configurations to suit different workloads and budgets. This can greatly reduce the cost of data-science investigations.

If you have questions about using Jupyter and JupyterHub on EMR or would like share your use cases, please leave a comment below.


Related

Running sparklyr – RStudio’s R Interface to Spark on Amazon EMR

sparklyr_2

Interactive Analysis of Genomic Datasets Using Amazon Athena

Post Syndicated from Aaron Friedman original https://aws.amazon.com/blogs/big-data/interactive-analysis-of-genomic-datasets-using-amazon-athena/

Aaron Friedman is a Healthcare and Life Sciences Solutions Architect with Amazon Web Services

The genomics industry is in the midst of a data explosion. Due to the rapid drop in the cost to sequence genomes, genomics is now central to many medical advances. When your genome is sequenced and analyzed, raw sequencing files are processed in a multi-step workflow to identify where your genome differs from a standard reference. Your variations are stored in a Variant Call Format (VCF) file, which is then combined with other individuals to enable population-scale analyses. Many of these datasets are publicly available, and an increasing number are hosted on AWS as part of our Open Data project.

To mine genomic data for new discoveries, researchers in both industry and academia build complex models to analyze populations at scale. When building models, they first explore the datasets-of-interest to understand what questions the data might answer. In this step, interactivity is key, as it allows them to move easily from one question to the next.

Recently, we launched Amazon Athena as an interactive query service to analyze data on Amazon S3. With Amazon Athena there are no clusters to manage and tune, no infrastructure to setup or manage, and customers pay only for the queries they run. Athena is able to query many file types straight from S3. This flexibility gives you the ability to interact easily with your datasets, whether they are in a raw text format (CSV/JSON) or specialized formats (e.g. Parquet). By being able to flexibly query different types of data sources, researchers can more rapidly progress through the data exploration phase for discovery. Additionally, researchers don’t have to know nuances of managing and running a big data system. This makes Athena an excellent complement to data warehousing on Amazon Redshift and big data analytics on Amazon EMR 

In this post, I discuss how to prepare genomic data for analysis with Amazon Athena as well as demonstrating how Athena is well-adapted to address common genomics query paradigms.  I use the Thousand Genomes dataset hosted on Amazon S3, a seminal genomics study, to demonstrate these approaches. All code that is used as part of this post is available in our GitHub repository.

Although this post is focused on genomic analysis, similar approaches can be applied to any discipline where large-scale, interactive analysis is required.

 

Select, aggregate, annotate query pattern in genomics

Genomics researchers may ask different questions of their dataset, such as:

  • What variations in a genome may increase the risk of developing disease?
  • What positions in the genome have abnormal levels of variation, suggesting issues in quality of sequencing or errors in the genomic reference?
  • What variations in a genome influence how an individual may respond to a specific drug treatment?
  • Does a group of individuals contain a higher frequency of a genomic variant known to alter response to a drug relative to the general population?

All these questions, and more, can be generalized under a common query pattern I like to call “Select, Aggregate, Annotate”. Some of our genomics customers, such as Human Longevity, Inc., routinely use this query pattern in their work.

In each of the above queries, you execute the following steps:

SELECT: Specify the cohort of individuals meeting certain criteria (disease, drug response, age, BMI, entire population, etc.).

AGGREGATE: Generate summary statistics of genomic variants across the cohort that you selected.

ANNOTATE: Assign meaning to each of the variants by joining on known information about each variant.

Dataset preparation

Properly organizing your dataset is one of the most critical decisions for enabling fast, interactive analysies. Based on the query pattern I just described, the table representing your population needs to have the following information:

  • A unique sample ID corresponding to each sample in your population
  • Information about each variant, specifically its location in the genome as well as the specific deviation from the reference
  • Information about how many times in a sample a variant occurs (0, 1, or 2 times) as well as if there are multiple variants in the same site. This is known as a genotype.

The extract, transform, load (ETL) process to generate the appropriate data representation has two main steps. First, you use ADAM, a genomics analysis platform built on top of Spark, to convert the variant information residing a VCF file to Parquet for easier downstream analytics, in a process similar to the one described in the Will Spark Power the Data behind Precision Medicine? post. Then, you use custom Python code to massage the data and select only the appropriate fields that you need for analysis with Athena.

First, spin up an EMR cluster (version 5.0.3) for the ETL process. I used a c4.xlarge for my master node and m4.4xlarges with 1 TB of scratch for my core nodes.

After you SSH into your master node, clone the git repository. You can also put this in as a bootstrap action when spinning up your cluster.

sudo yum –y install git
git clone https://github.com/awslabs/aws-big-data-blog.git 

You then need to install ADAM and configure it to run with Spark 2.0. In your terminal, enter the following:

# Install Maven
wget http://apache.claz.org/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
tar xvzf apache-maven-3.3.9-bin.tar.gz
export PATH=/home/hadoop/apache-maven-3.3.9/bin:$PATH

# Install ADAM
git clone https://github.com/bigdatagenomics/adam.git
cd adam
sed -i 's/2.6.0/2.7.2/' pom.xml
./scripts/move_to_spark_2.sh
export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=256m"
mvn clean package -DskipTests

export PATH=/home/hadoop/adam/bin:$PATH

Now that ADAM is installed, enter the working directory of the repo that you first cloned, which contains the code relevant to this post. Your next step is to convert a VCF containing all the information about your population.

For this post, you are going to focus on a single region in the genome, defined as chromosome 22, for all 2,504 samples in the Thousand Genomes dataset. As chromosome 22 was the first chromosome to be sequenced as part of the Human Genome Project, it is your first here as well. You can scale the size of your cluster depending on whether you are looking at just chromosome 22, or the entire genome. In the following lines, replace mytestbucket with your bucket name of choice.

PARQUETS3PATH=s3://<mytestbucket>/thousand_genomes/grch37/chr22.parquet/
cd ~/aws-big-data-blog/aws-blog-athena-genomics/etl/thousand_genomes/
chmod +x ./convert_vcf.sh
./convert_vcf.sh $PARQUETS3PATH

After this conversion completes, you can reduce your Parquet file to only the fields you need.

TRIMMEDS3PATH=s3://<mytestbucket>/thousand_genomes/grch37_trimmed/chr22.parquet/
spark-submit --master yarn --executor-memory 40G --deploy-mode cluster create_trimmed_parquet.py --input_s3_path $PARQUETS3PATH --output_s3_path $TRIMMEDS3PATH 

Annotation data preparation

For this post, use the ClinVar dataset, which is an archive of information about variation in genomes and health. While Parquet is a preferred format for Athena relative to raw text, using the ClinVar TXT file demonstrates the ability of Athena to read multiple file types.

In a Unix terminal, execute the following commands (replace mytestbucket with your bucket name of choice):

ANNOPATH=s3://<mytestbucket>/clinvar/

wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz

# Need to strip out the first line (header)
zcat variant_summary.txt.gz | sed '1d' > temp ; mv -f temp variant_summary.trim.txt ; gzip variant_summary.trim.txt

aws s3 cp variant_summary.trim.txt.gz $ANNOPATH --sse

Creating Tables using Amazon Athena

In the recent Amazon Athena – Interactive SQL Queries for Data in Amazon S3 post, Jeff Barr covered how to get started with Athena and navigate to the console. Athena uses the Hive DDL when you create, alter, or drop a table. This means you can use the standard Hive syntax for creating tables. These commands can also be found in the aws-blog-athena-genomics GitHub repo under sql/setup.

First, create your database:

CREATE DATABASE kg;

Next, define your table with the appropriate mappings from the previous ETL processes. Be sure to change to your appropriate bucket name.

For your population data:

If you want to skip the ETL process described above for your population data, you can use the following path for your analysis: s3://aws-bigdata-blog/artifacts/athena_genomics.parquet/ – be sure to substitute it in for the LOCATION below.

CREATE EXTERNAL TABLE demo.samplevariants
(
  alternateallele STRING
  ,chromosome STRING
  ,endposition BIGINT
  ,genotype0 STRING
  ,genotype1 STRING
  ,referenceallele STRING
  ,sampleid STRING
  ,startposition BIGINT
)
STORED AS PARQUET
LOCATION ' s3://<mytestbucket>/thousand_genomes/grch37_trimmed/chr22.parquet/';

For your annotation data:

CREATE EXTERNAL TABLE demo.clinvar
(
    alleleId STRING
    ,variantType STRING
    ,hgvsName STRING
    ,geneID STRING
    ,geneSymbol STRING
    ,hgncId STRING
    ,clinicalSignificance STRING
    ,clinSigSimple STRING
    ,lastEvaluated STRING
    ,rsId STRING
    ,dbVarId STRING
    ,rcvAccession STRING
    ,phenotypeIds STRING
    ,phenotypeList STRING
    ,origin STRING
    ,originSimple STRING
    ,assembly STRING
    ,chromosomeAccession STRING
    ,chromosome STRING
    ,startPosition INT
    ,endPosition INT
    ,referenceAllele STRING
    ,alternateAllele STRING
    ,cytogenetic STRING
    ,reviewStatus STRING
    ,numberSubmitters STRING
    ,guidelines STRING
    ,testedInGtr STRING
    ,otherIds STRING
    ,submitterCategories STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://<mytestbucket>/clinvar/';

You can confirm that both tables have been created by choosing the eye icon next to a table name in the console. This automatically runs a query akin to SELECT * FROM table LIMIT 10. If it succeeds, then your data has been loaded successfully.

o_athena_genomes_1

Applying the select, aggregate, annotate paradigm

In this section, I walk you through how to use the query pattern described earlier to answer two common questions in genomics. You examine the frequency of different genotypes, which is a distinct combination of all fields in the samplevariants table with the exception of sampleid. These queries can also be found in the aws-blog-athena-genomics GitHub repo under sql/queries.

Population drug response

What small molecules/drugs are most likely to affect a subpopulation of individuals (ancestry, age, etc.) based on their genomic information?

In this query, assume that you have some phenotype data about your population. In this case, also assume that all samples sharing the pattern “NA12” are part of a specific demographic.

In this query, use sampleid as your predicate pushdown. The general steps are:

  1. Filter by the samples in your subpopulation
  2. Aggregate variant frequencies for the subpopulation-of-interest
  3. Join on ClinVar dataset
  4. Filter by variants that have been implicated in drug-response
  5. Order by highest frequency variants

To answer this question, you can craft the following query:

SELECT
    count(*)/cast(numsamples AS DOUBLE) AS genotypefrequency
    ,cv.rsid
    ,cv.phenotypelist
    ,sv.chromosome
    ,sv.startposition
    ,sv.endposition
    ,sv.referenceallele
    ,sv.alternateallele
    ,sv.genotype0
    ,sv.genotype1
FROM demo.samplevariants sv
CROSS JOIN
    (SELECT count(1) AS numsamples
    FROM
        (SELECT DISTINCT sampleid
        FROM demo.samplevariants
        WHERE sampleid LIKE 'NA12%'))
JOIN demo.clinvar cv
ON sv.chromosome = cv.chromosome
    AND sv.startposition = cv.startposition - 1
    AND sv.endposition = cv.endposition
    AND sv.referenceallele = cv.referenceallele
    AND sv.alternateallele = cv.alternateallele
WHERE assembly='GRCh37'
    AND cv.clinicalsignificance LIKE '%response%'
    AND sampleid LIKE 'NA12%'
GROUP BY  sv.chromosome
          ,sv.startposition
          ,sv.endposition
          ,sv.referenceallele
          ,sv.alternateallele
          ,sv.genotype0
          ,sv.genotype1
          ,cv.clinicalsignificance
          ,cv.phenotypelist
          ,cv.rsid
          ,numsamples
ORDER BY  genotypefrequency DESC LIMIT 50

Enter the query into the console and you see something like the following:

o_athena_genomes_2

When you inspect the results, you can quickly see (often in a matter of seconds!) that this population of individuals has a high frequency of variants associated with the metabolism of debrisoquine.

o_athena_genomes_3

Quality control

Are you systematically finding locations in your reference that are called as variation? In other words, what positions in the genome have abnormal levels of variation, suggesting issues in quality of sequencing or errors in the reference? Are any of these variants implicated in clinically? If so, could this be a false positive clinical finding?

In this query, the entire population is needed so the SELECT predicate is not used. The general steps are:

  1. Aggregate variant frequencies for the entire Thousand Genomes population
  2. Join on the ClinVar dataset
  3. Filter by variants that have been implicated in disease
  4. Order by the highest frequency variants

This translates into the following query:

SELECT
    count(*)/cast(numsamples AS DOUBLE) AS genotypefrequency
    ,cv.clinicalsignificance
    ,cv.phenotypelist
    ,sv.chromosome
    ,sv.startposition
    ,sv.endposition
    ,sv.referenceallele
    ,sv.alternateallele
    ,sv.genotype0
    ,sv.genotype1
FROM demo.samplevariants sv
CROSS JOIN
    (SELECT count(1) AS numsamples
    FROM
        (SELECT DISTINCT sampleid
        FROM demo.samplevariants))
    JOIN demo.clinvar cv
    ON sv.chromosome = cv.chromosome
        AND sv.startposition = cv.startposition - 1
        AND sv.endposition = cv.endposition
        AND sv.referenceallele = cv.referenceallele
        AND sv.alternateallele = cv.alternateallele
WHERE assembly='GRCh37'
        AND cv.clinsigsimple='1'
GROUP BY  sv.chromosome ,
          sv.startposition
          ,sv.endposition
          ,sv.referenceallele
          ,sv.alternateallele
          ,sv.genotype0
          ,sv.genotype1
          ,cv.clinicalsignificance
          ,cv.phenotypelist
          ,numsamples
ORDER BY  genotypefrequency DESC LIMIT 50

As you can see in the following screenshot, the highest frequency results have conflicting information, being listed both as potentially causing disease and being benign. In genomics, variants with a higher frequency are less likely to cause disease. Your quick analysis allows you to discount the pathogenic clinical significance annotations of these high genotype frequency variants.

o_athena_genomes_4

 

Summary

I hope you have seen how you can easily integrate datasets of different types and values, and combine them to quickly derive new meaning from your data. Do you have a dataset you want to explore or want to learn more about Amazon Athena? Check out our Getting Started Guide and happy querying!

If you have questions or suggestions, please comment below.


About the author

friedman_90Dr. Aaron Friedman a Healthcare and Life Sciences Partner Solutions Architect at Amazon Web Services. He works with our ISVs and SIs to architect healthcare solutions on AWS, and bring the best possible experience to their customers. His passion is working at the intersection of science, big data, and software. In his spare time, he’s out hiking or learning a new thing to cook.

 

 


Related

Analyzing Data in S3 using Amazon Athena

o_athena_10