Tag Archives: Docker

Send ECS Container Logs to CloudWatch Logs for Centralized Monitoring

Post Syndicated from Chris Barclay original http://blogs.aws.amazon.com/application-management/post/TxFRDMTMILAA8X/Send-ECS-Container-Logs-to-CloudWatch-Logs-for-Centralized-Monitoring

My colleagues Brandon Chavis, Pierre Steckmeyer and Chad Schmutzer sent a nice guest post that demonstrates how to send your container logs to a central source for easy troubleshooting and alarming.

 

—–

Amazon EC2 Container Service (Amazon ECS) is a highly scalable, high performance container management service that supports Docker containers and allows you to easily run applications on a managed cluster of Amazon EC2 instances.

In this multipart blog post, we have chosen to take a universal struggle amongst IT professionals—log collection—and approach it from different angles to highlight possible architectural patterns that facilitate communication and data sharing between containers.

When building applications on ECS, it is a good practice to follow a micro services approach, which encourages the design of a single application component in a single container. This design improves flexibility and elasticity, while leading to a loosely coupled architecture for resilience and ease of maintenance. However, this architectural style makes it important to consider how your containers will communicate and share data with each other.

Why is it useful?

Application logs are useful for many reasons. They are the primary source of troubleshooting information. In the field of security, they are essential to forensics. Web server logs are often leveraged for analysis (at scale) in order to gain insight into usage, audience, and trends.

Centrally collecting container logs is a common problem that can be solved in a number of ways. The Docker community has offered solutions such as having working containers map a shared volume; having a log-collecting container; and getting logs from a container that logs to stdout/stderr and retrieving them with docker logs.

In this post, we present a solution using Amazon CloudWatch Logs. CloudWatch is a monitoring service for AWS cloud resources and the applications you run on AWS. CloudWatch Logs can be used to collect and monitor your logs for specific phrases, values, or patterns. For example, you could set an alarm on the number of errors that occur in your system logs or view graphs of web request latencies from your application logs. The additional advantages here are that you can look at a single pane of glass for all of your monitoring needs because such metrics as CPU, disk I/O, and network for your container instances are already available on CloudWatch.

Here is how we are going to do it

Our approach involves setting up a container whose sole purpose is logging. It runs rsyslog and the CloudWatch Logs agent, and we use Docker Links to communicate to other containers. With this strategy, it becomes easy to link existing application containers such as Apache and have discrete logs per task. This logging container is defined in each ECS task definition, which is a collection of containers running together on the same container instance. With our container log collection strategy, you do not have to modify your Docker image. Any log mechanism tweak is specified in the task definition.

 

Note: This blog provisions a new ECS cluster in order to test the following instructions. Also, please note that we are using the US East (N. Virginia) region throughout this exercise. If you would like to use a different AWS region, please make sure to update your configuration accordingly.

Linking to a CloudWatch logging container

We will create a container that can be deployed as a syslog host. It will accept standard syslog connections on 514/TCP to rsyslog through container links, and will also forward those logs to CloudWatch Logs via the CloudWatch Logs agent. The idea is that this container can be deployed as the logging component in your architecture (not limited to ECS; it could be used for any centralized logging).

As a proof of concept, we show you how to deploy a container running httpd, clone some static web content (for this example, we clone the ECS documentation), and have the httpd access and error logs sent to the rsyslog service running on the syslog container via container linking. We also send the Docker and ecs-agent logs from the EC2 instance the task is running on. The logs in turn are sent to CloudWatch Logs via the CloudWatch Logs agent.

Note: Be sure to replace your information througout the document as necessary (for example: replace "my_docker_hub_repo" with the name of your own Docker Hub repository).

We also assume that all following requirements are in place in your AWS account:

A VPC exists for the account

There is an IAM user with permissions to launch EC2 instances and create IAM policies/roles

SSH keys have been generated

Git and Docker are installed on the image building host

The user owns a Docker Hub account and a repository ("my_docker_hub_repo" in this document)

Let’s get started.

Create the Docker image

The first step is to create the Docker image to use as a logging container. For this, all you need is a machine that has Git and Docker installed. You could use your own local machine or an EC2 instance.

Install Git and Docker. The following steps pertain to the Amazon Linux AMI but you should follow the Git and Docker installation instructions respective to your machine.

$ sudo yum update -y && sudo yum -y install git docker

Make sure that the Docker service is running:

$ sudo service docker start

Clone the GitHub repository containing the files you need:

$ git clone https://github.com/awslabs/ecs-cloudwatch-logs.git
$ cd ecs-cloudwatch-logs
You should now have a directory containing two .conf files and a Dockerfile. Feel free to read the content of these files and identify the mechanisms used.
 

Log in to Docker Hub:

$ sudo docker login

Build the container image (replace the my_docker_hub_repo with your repository name):

$ sudo docker build -t my_docker_hub_repo/cloudwatchlogs .

Push the image to your repo:

$ sudo docker push my_docker_hub_repo/cloudwatchlogs

Use the build-and-push time to dive deeper into what will live in this container. You can follow along by reading the Dockerfile. Here are a few things worth noting:

The first RUN updates the distribution and installs rsyslog, pip, and curl.

The second RUN downloads the AWS CloudWatch Logs agent.

The third RUN enables remote conncetions for rsyslog.

The fourth RUN removes the local6 and local7 facilities to prevent duplicate entries. If you don’t do this, you would see every single apache log entry in /var/log/syslog.

The last RUN specifies which output files will receive the log entries on local6 and local7 (e.g., "if the facility is local6 and it is tagged with httpd, put those into this httpd-access.log file").

We use Supervisor to run more than one process in this container: rsyslog and the CloudWatch Logs agent.

We expose port 514 for rsyslog to collect log entries via the Docker link.

Create an ECS cluster

Now, create an ECS cluster. One way to do so could be to use the Amazon ECS console first run wizard. For now, though, all you need is an ECS cluster.

7. Navigate to the ECS console and choose Create cluster. Give it a unique name that you have not used before (such as "ECSCloudWatchLogs"), and choose Create.

Create an IAM role

The next five steps set up a CloudWatch-enabled IAM role with EC2 permissions and spin up a new container instance with this role. All of this can be done manually via the console or you can run a CloudFormation template. To use the CloudFormation template, navigate to CloudFormation console, create a new stack by using this template and go straight to step 14 (just specify the ECS cluster name used above, choose your prefered instance type and select the appropriate EC2 SSH key, and leave the rest unchanged). Otherwise, continue on to step 8.

8. Create an IAM policy for CloudWatch Logs and ECS: point your browser to the IAM console, choose Policies and then Create Policy. Choose Select next to Create Your Own Policy. Give your policy a name (e.g., ECSCloudWatchLogs) and paste the text below as the Policy Document value.

{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"logs:Create*",
"logs:PutLogEvents"
],
"Effect": "Allow",
"Resource": "arn:aws:logs:*:*:*"
},
{
"Action": [
"ecs:CreateCluster",
"ecs:DeregisterContainerInstance",
"ecs:DiscoverPollEndpoint",
"ecs:RegisterContainerInstance",
"ecs:Submit*",
"ecs:Poll"
],
"Effect": "Allow",
"Resource": "*"
}
]
}

9. Create a new IAM EC2 service role and attach the above policy to it. In IAM, choose Roles, Create New Role. Pick a name for the role (e.g., ECSCloudWatchLogs). Choose Role Type, Amazon EC2. Find and pick the policy you just created, click Next Step, and then Create Role.

Launch an EC2 instance and ECS cluster

10. Launch an instance with the Amazon ECS AMI and the above role in the US East (N. Virginia) region. On the EC2 console page, choose Launch Instance. Choose Community AMIs. In the search box, type "amazon-ecs-optimized" and choose Select for the latest version (2015.03.b). Select the appropriate instance type and choose Next.

11. Choose the appropriate Network value for your ECS cluster. Make sure that Auto-assign Public IP is enabled. Choose the IAM role that you just created (e.g., ECSCloudWatchLogs). Expand Advanced Details and in the User data field, add the following while substituting your_cluster_name for the appropriate name:

#!/bin/bash
echo ECS_CLUSTER=your_cluster_name >> /etc/ecs/ecs.config
EOF

12. Choose Next: Add Storage, then Next: Tag Instance. You can give your container instance a name on this page. Choose Next: Configure Security Group. On this page, you should make sure that both SSH and HTTP are open to at least your own IP address.

13. Choose Review and Launch, then Launch and Associate with the appropriate SSH key. Note the instance ID.

14. Ensure that your newly spun-up EC2 instance is part of your container instances (note that it may take up to a minute for the container instance to register with ECS). In the ECS console, select the appropriate cluster. Select the ECS Instances tab. You should see a container instance with the instance ID that you just noted after a minute.

15. On the left pane of the ECS console, choose Task Definitions, then Create new Task Definition. On the JSON tab, paste the code below, overwriting the default text. Make sure to replace "my_docker_hub_repo" with your own Docker Hub repo name and choose Create.

{
"volumes": [
{
"name": "ecs_instance_logs",
"host": {
"sourcePath": "/var/log"
}
}
],
"containerDefinitions": [
{
"environment": [],
"name": "cloudwatchlogs",
"image": "my_docker_hub_repo/cloudwatchlogs",
"cpu": 50,
"portMappings": [],
"memory": 64,
"essential": true,
"mountPoints": [
{
"sourceVolume": "ecs_instance_logs",
"containerPath": "/mnt/ecs_instance_logs",
"readOnly": true
}
]
},
{
"environment": [],
"name": "httpd",
"links": [
"cloudwatchlogs"
],
"image": "httpd",
"cpu": 50,
"portMappings": [
{
"containerPort": 80,
"hostPort": 80
}
],
"memory": 128,
"entryPoint": ["/bin/bash", "-c"],
"command": [
"apt-get update && apt-get -y install wget && echo ‘CustomLog "| /usr/bin/logger -t httpd -p local6.info -n cloudwatchlogs -P 514" "%v %h %l %u %t %r %>s %b %{Referer}i %{User-agent}i"’ >> /usr/local/apache2/conf/httpd.conf && echo ‘ErrorLogFormat "%v [%t] [%l] [pid %P] %F: %E: [client %a] %M"’ >> /usr/local/apache2/conf/httpd.conf && echo ‘ErrorLog "| /usr/bin/logger -t httpd -p local7.info -n cloudwatchlogs -P 514"’ >> /usr/local/apache2/conf/httpd.conf && echo ServerName `hostname` >> /usr/local/apache2/conf/httpd.conf && rm -rf /usr/local/apache2/htdocs/* && cd /usr/local/apache2/htdocs && wget -mkEpnp -nH –cut-dirs=4 http://docs.aws.amazon.com/AmazonECS/latest/developerguide/Welcome.html && /usr/local/bin/httpd-foreground"
],
"essential": true
}
],
"family": "cloudwatchlogs"
}

What are some highlights of this task definition?

The sourcePath value allows the CloudWatch Logs agent running in the log collection container to access the host-based Docker and ECS agent log files. You can change the retention period in CloudWatch Logs.

The cloudwatchlogs container is marked essential, which means that if log collection goes down, so should the application it is collecting from. Similarly, the web server is marked essential as well. You can easily change this behavior.

The command section is a bit lengthy. Let us break it down:

We first install wget so that we can later clone the ECS documentation for display on our web server.

We then write four lines to httpd.conf. These are the echo commands. They describe how httpd will generate log files and their format. Notice how we tag (-t httpd) these files with httpd and assign them a specific facility (-p localX.info). We also specify that logger is to send these entries to host -n cloudwatchlogs on port -p 514. This will be handled by linking. Hence, port 514 is left untouched on the machine and we could have as many of these logging containers running as we want.

%h %l %u %t %r %>s %b %{Referer}i %{User-agent}i should look fairly familiar to anyone who has looked into tweaking Apache logs. The initial %v is the server name and it will be replaced by the container ID. This is how we are able to discern what container the logs come from in CloudWatch Logs.

We remove the default httpd landing page with rm -rf.

We instead use wget to download a clone of the ECS documentation.

And, finally, we start httpd. Note that we redirect httpd log files in our task definition at the command level for the httpd image. Applying the same concept to another image would simply require you to know where your application maintains its log files.

Note that we redirect httpd log files in our task definition at the command level for the httpd image. Applying the same concept to another image would simply require you to know where your application maintains its log files.

Create a service

16. On the services tab in the ECS console, choose Create. Choose the task definition created in step 15, name the service and set the number of tasks to 1. Select Create service.

17. The task will start running shortly. You can press the refresh icon on your service’s Tasks tab. After the status says "Running", choose the task and expand the httpd container. The container instance IP will be a hyperlink under the Network bindings section’s External link. When you select the link you should see a clone of the Amazon ECS documentation. You are viewing this thanks to the httpd container running on your ECS cluster.

18. Open the CloudWatch Logs console to view new ecs entries.

Conclusion

If you have followed all of these steps, you should now have a two container task running in your ECS cluster. One container serves web pages while the other one collects the log activity from the web container and sends it to CloudWatch Logs. Such a setup can be replicated with any other application. All you need is to specify a different container image and describe the expected log files in the command section.

Send ECS Container Logs to CloudWatch Logs for Centralized Monitoring

Post Syndicated from Chris Barclay original http://blogs.aws.amazon.com/application-management/post/TxFRDMTMILAA8X/Send-ECS-Container-Logs-to-CloudWatch-Logs-for-Centralized-Monitoring

My colleagues Brandon Chavis, Pierre Steckmeyer and Chad Schmutzer sent a nice guest post that demonstrates how to send your container logs to a central source for easy troubleshooting and alarming.

 

—–

Amazon EC2 Container Service (Amazon ECS) is a highly scalable, high performance container management service that supports Docker containers and allows you to easily run applications on a managed cluster of Amazon EC2 instances.

In this multipart blog post, we have chosen to take a universal struggle amongst IT professionals—log collection—and approach it from different angles to highlight possible architectural patterns that facilitate communication and data sharing between containers.

When building applications on ECS, it is a good practice to follow a micro services approach, which encourages the design of a single application component in a single container. This design improves flexibility and elasticity, while leading to a loosely coupled architecture for resilience and ease of maintenance. However, this architectural style makes it important to consider how your containers will communicate and share data with each other.

Why is it useful?

Application logs are useful for many reasons. They are the primary source of troubleshooting information. In the field of security, they are essential to forensics. Web server logs are often leveraged for analysis (at scale) in order to gain insight into usage, audience, and trends.

Centrally collecting container logs is a common problem that can be solved in a number of ways. The Docker community has offered solutions such as having working containers map a shared volume; having a log-collecting container; and getting logs from a container that logs to stdout/stderr and retrieving them with docker logs.

In this post, we present a solution using Amazon CloudWatch Logs. CloudWatch is a monitoring service for AWS cloud resources and the applications you run on AWS. CloudWatch Logs can be used to collect and monitor your logs for specific phrases, values, or patterns. For example, you could set an alarm on the number of errors that occur in your system logs or view graphs of web request latencies from your application logs. The additional advantages here are that you can look at a single pane of glass for all of your monitoring needs because such metrics as CPU, disk I/O, and network for your container instances are already available on CloudWatch.

Here is how we are going to do it

Our approach involves setting up a container whose sole purpose is logging. It runs rsyslog and the CloudWatch Logs agent, and we use Docker Links to communicate to other containers. With this strategy, it becomes easy to link existing application containers such as Apache and have discrete logs per task. This logging container is defined in each ECS task definition, which is a collection of containers running together on the same container instance. With our container log collection strategy, you do not have to modify your Docker image. Any log mechanism tweak is specified in the task definition.

 

Note: This blog provisions a new ECS cluster in order to test the following instructions. Also, please note that we are using the US East (N. Virginia) region throughout this exercise. If you would like to use a different AWS region, please make sure to update your configuration accordingly.

Linking to a CloudWatch logging container

We will create a container that can be deployed as a syslog host. It will accept standard syslog connections on 514/TCP to rsyslog through container links, and will also forward those logs to CloudWatch Logs via the CloudWatch Logs agent. The idea is that this container can be deployed as the logging component in your architecture (not limited to ECS; it could be used for any centralized logging).

As a proof of concept, we show you how to deploy a container running httpd, clone some static web content (for this example, we clone the ECS documentation), and have the httpd access and error logs sent to the rsyslog service running on the syslog container via container linking. We also send the Docker and ecs-agent logs from the EC2 instance the task is running on. The logs in turn are sent to CloudWatch Logs via the CloudWatch Logs agent.

Note: Be sure to replace your information througout the document as necessary (for example: replace "my_docker_hub_repo" with the name of your own Docker Hub repository).

We also assume that all following requirements are in place in your AWS account:

A VPC exists for the account

There is an IAM user with permissions to launch EC2 instances and create IAM policies/roles

SSH keys have been generated

Git and Docker are installed on the image building host

The user owns a Docker Hub account and a repository ("my_docker_hub_repo" in this document)

Let’s get started.

Create the Docker image

The first step is to create the Docker image to use as a logging container. For this, all you need is a machine that has Git and Docker installed. You could use your own local machine or an EC2 instance.

Install Git and Docker. The following steps pertain to the Amazon Linux AMI but you should follow the Git and Docker installation instructions respective to your machine.

$ sudo yum update -y && sudo yum -y install git docker

Make sure that the Docker service is running:

$ sudo service docker start

Clone the GitHub repository containing the files you need:

$ git clone https://github.com/awslabs/ecs-cloudwatch-logs.git
$ cd ecs-cloudwatch-logs
You should now have a directory containing two .conf files and a Dockerfile. Feel free to read the content of these files and identify the mechanisms used.
 

Log in to Docker Hub:

$ sudo docker login

Build the container image (replace the my_docker_hub_repo with your repository name):

$ sudo docker build -t my_docker_hub_repo/cloudwatchlogs .

Push the image to your repo:

$ sudo docker push my_docker_hub_repo/cloudwatchlogs

Use the build-and-push time to dive deeper into what will live in this container. You can follow along by reading the Dockerfile. Here are a few things worth noting:

The first RUN updates the distribution and installs rsyslog, pip, and curl.

The second RUN downloads the AWS CloudWatch Logs agent.

The third RUN enables remote conncetions for rsyslog.

The fourth RUN removes the local6 and local7 facilities to prevent duplicate entries. If you don’t do this, you would see every single apache log entry in /var/log/syslog.

The last RUN specifies which output files will receive the log entries on local6 and local7 (e.g., "if the facility is local6 and it is tagged with httpd, put those into this httpd-access.log file").

We use Supervisor to run more than one process in this container: rsyslog and the CloudWatch Logs agent.

We expose port 514 for rsyslog to collect log entries via the Docker link.

Create an ECS cluster

Now, create an ECS cluster. One way to do so could be to use the Amazon ECS console first run wizard. For now, though, all you need is an ECS cluster.

7. Navigate to the ECS console and choose Create cluster. Give it a unique name that you have not used before (such as "ECSCloudWatchLogs"), and choose Create.

Create an IAM role

The next five steps set up a CloudWatch-enabled IAM role with EC2 permissions and spin up a new container instance with this role. All of this can be done manually via the console or you can run a CloudFormation template. To use the CloudFormation template, navigate to CloudFormation console, create a new stack by using this template and go straight to step 14 (just specify the ECS cluster name used above, choose your prefered instance type and select the appropriate EC2 SSH key, and leave the rest unchanged). Otherwise, continue on to step 8.

8. Create an IAM policy for CloudWatch Logs and ECS: point your browser to the IAM console, choose Policies and then Create Policy. Choose Select next to Create Your Own Policy. Give your policy a name (e.g., ECSCloudWatchLogs) and paste the text below as the Policy Document value.

{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"logs:Create*",
"logs:PutLogEvents"
],
"Effect": "Allow",
"Resource": "arn:aws:logs:*:*:*"
},
{
"Action": [
"ecs:CreateCluster",
"ecs:DeregisterContainerInstance",
"ecs:DiscoverPollEndpoint",
"ecs:RegisterContainerInstance",
"ecs:Submit*",
"ecs:Poll"
],
"Effect": "Allow",
"Resource": "*"
}
]
}

9. Create a new IAM EC2 service role and attach the above policy to it. In IAM, choose Roles, Create New Role. Pick a name for the role (e.g., ECSCloudWatchLogs). Choose Role Type, Amazon EC2. Find and pick the policy you just created, click Next Step, and then Create Role.

Launch an EC2 instance and ECS cluster

10. Launch an instance with the Amazon ECS AMI and the above role in the US East (N. Virginia) region. On the EC2 console page, choose Launch Instance. Choose Community AMIs. In the search box, type "amazon-ecs-optimized" and choose Select for the latest version (2015.03.b). Select the appropriate instance type and choose Next.

11. Choose the appropriate Network value for your ECS cluster. Make sure that Auto-assign Public IP is enabled. Choose the IAM role that you just created (e.g., ECSCloudWatchLogs). Expand Advanced Details and in the User data field, add the following while substituting your_cluster_name for the appropriate name:

#!/bin/bash
echo ECS_CLUSTER=your_cluster_name >> /etc/ecs/ecs.config
EOF

12. Choose Next: Add Storage, then Next: Tag Instance. You can give your container instance a name on this page. Choose Next: Configure Security Group. On this page, you should make sure that both SSH and HTTP are open to at least your own IP address.

13. Choose Review and Launch, then Launch and Associate with the appropriate SSH key. Note the instance ID.

14. Ensure that your newly spun-up EC2 instance is part of your container instances (note that it may take up to a minute for the container instance to register with ECS). In the ECS console, select the appropriate cluster. Select the ECS Instances tab. You should see a container instance with the instance ID that you just noted after a minute.

15. On the left pane of the ECS console, choose Task Definitions, then Create new Task Definition. On the JSON tab, paste the code below, overwriting the default text. Make sure to replace "my_docker_hub_repo" with your own Docker Hub repo name and choose Create.

{
"volumes": [
{
"name": "ecs_instance_logs",
"host": {
"sourcePath": "/var/log"
}
}
],
"containerDefinitions": [
{
"environment": [],
"name": "cloudwatchlogs",
"image": "my_docker_hub_repo/cloudwatchlogs",
"cpu": 50,
"portMappings": [],
"memory": 64,
"essential": true,
"mountPoints": [
{
"sourceVolume": "ecs_instance_logs",
"containerPath": "/mnt/ecs_instance_logs",
"readOnly": true
}
]
},
{
"environment": [],
"name": "httpd",
"links": [
"cloudwatchlogs"
],
"image": "httpd",
"cpu": 50,
"portMappings": [
{
"containerPort": 80,
"hostPort": 80
}
],
"memory": 128,
"entryPoint": ["/bin/bash", "-c"],
"command": [
"apt-get update && apt-get -y install wget && echo ‘CustomLog "| /usr/bin/logger -t httpd -p local6.info -n cloudwatchlogs -P 514" "%v %h %l %u %t %r %>s %b %{Referer}i %{User-agent}i"’ >> /usr/local/apache2/conf/httpd.conf && echo ‘ErrorLogFormat "%v [%t] [%l] [pid %P] %F: %E: [client %a] %M"’ >> /usr/local/apache2/conf/httpd.conf && echo ‘ErrorLog "| /usr/bin/logger -t httpd -p local7.info -n cloudwatchlogs -P 514"’ >> /usr/local/apache2/conf/httpd.conf && echo ServerName `hostname` >> /usr/local/apache2/conf/httpd.conf && rm -rf /usr/local/apache2/htdocs/* && cd /usr/local/apache2/htdocs && wget -mkEpnp -nH –cut-dirs=4 http://docs.aws.amazon.com/AmazonECS/latest/developerguide/Welcome.html && /usr/local/bin/httpd-foreground"
],
"essential": true
}
],
"family": "cloudwatchlogs"
}

What are some highlights of this task definition?

The sourcePath value allows the CloudWatch Logs agent running in the log collection container to access the host-based Docker and ECS agent log files. You can change the retention period in CloudWatch Logs.

The cloudwatchlogs container is marked essential, which means that if log collection goes down, so should the application it is collecting from. Similarly, the web server is marked essential as well. You can easily change this behavior.

The command section is a bit lengthy. Let us break it down:

We first install wget so that we can later clone the ECS documentation for display on our web server.

We then write four lines to httpd.conf. These are the echo commands. They describe how httpd will generate log files and their format. Notice how we tag (-t httpd) these files with httpd and assign them a specific facility (-p localX.info). We also specify that logger is to send these entries to host -n cloudwatchlogs on port -p 514. This will be handled by linking. Hence, port 514 is left untouched on the machine and we could have as many of these logging containers running as we want.

%h %l %u %t %r %>s %b %{Referer}i %{User-agent}i should look fairly familiar to anyone who has looked into tweaking Apache logs. The initial %v is the server name and it will be replaced by the container ID. This is how we are able to discern what container the logs come from in CloudWatch Logs.

We remove the default httpd landing page with rm -rf.

We instead use wget to download a clone of the ECS documentation.

And, finally, we start httpd. Note that we redirect httpd log files in our task definition at the command level for the httpd image. Applying the same concept to another image would simply require you to know where your application maintains its log files.

Note that we redirect httpd log files in our task definition at the command level for the httpd image. Applying the same concept to another image would simply require you to know where your application maintains its log files.

Create a service

16. On the services tab in the ECS console, choose Create. Choose the task definition created in step 15, name the service and set the number of tasks to 1. Select Create service.

17. The task will start running shortly. You can press the refresh icon on your service’s Tasks tab. After the status says "Running", choose the task and expand the httpd container. The container instance IP will be a hyperlink under the Network bindings section’s External link. When you select the link you should see a clone of the Amazon ECS documentation. You are viewing this thanks to the httpd container running on your ECS cluster.

18. Open the CloudWatch Logs console to view new ecs entries.

Conclusion

If you have followed all of these steps, you should now have a two container task running in your ECS cluster. One container serves web pages while the other one collects the log activity from the web container and sends it to CloudWatch Logs. Such a setup can be replicated with any other application. All you need is to specify a different container image and describe the expected log files in the command section.

Our docker & screen development environment

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2015/05/05/our-docker--screen-development-environment/

The raintank software stack is comprised of a number of different components that all work together to deliver our platform. Some of our components include:
front-end visualization and dashboarding (based on Grafana) network performance data “collectors” event and metric processing and persisting core RESTful API To support these services we have a number of shared services including:
Elasticsearch InfluxDB MySQL RabbitMQ Redis As the stack has grown it has become increasingly more difficult to manage our development environments.

Grafana 2.0 Released

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2015/04/20/grafana-2.0-released/

Grafana 2.0 stable is now released and available for download.
We provide deb, rpm and generic binary tar packages for download as well as APT and YUM repositories. If your platform is not included, check the Building from source guide.
Download page Installing on Debian / Ubuntu Installing on RPM-based Linux (CentOS, Fedora, OpenSuse, RedHat) Installing via Docker Migrating from 1.x to 2.x What’s New in Grafana v2.

Set up a build pipeline with Jenkins and Amazon ECS

Post Syndicated from Chris Barclay original http://blogs.aws.amazon.com/application-management/post/Tx32RHFZHXY6ME1/Set-up-a-build-pipeline-with-Jenkins-and-Amazon-ECS

My colleague Daniele Stroppa sent a nice guest post that demonstrates how to use Jenkins to build Docker images for Amazon EC2 Container Service.

 

—–

 

In this walkthrough, we’ll show you how to set up and configure a build pipeline using Jenkins and the Amazon EC2 Container Service (ECS).

 

We’ll be using a sample Python application, available on GitHub. The repository contains a simple Dockerfile that uses a python base image and runs our application:

FROM python:2-onbuild
CMD [ "python", "./application.py" ]

This Dockerfile is used by the build pipeline to create a new Docker image upon pushing code to the repository. The built image will then be used to start a new service on an ECS cluster.

 

For the purpose of this walkthrough, fork the py-flask-signup-docker repository to your account.

 

Setup the build environment

For our build environment we’ll launch an Amazon EC2 instance using the Amazon Linux AMI and install and configure the required packages. Make sure that the security group you select for your instance allows traffic on ports TCP/22 and TCP/80.

 

Install and configure Jenkins, Docker and Nginx

Connect to your instance using your private key and switch to the root user. First, let’s update the repositories and install Docker, Nginx and Git.

# yum update -y
# yum install -y docker nginx git

To install Jenkins on Amazon Linux, we need to add the Jenkins repository and install Jenkins from there.

# wget -O /etc/yum.repos.d/jenkins.repo http://pkg.jenkins-ci.org/redhat/jenkins.repo
# rpm –import http://pkg.jenkins-ci.org/redhat/jenkins-ci.org.key
# yum install jenkins

As Jenkins typically uses port TCP/8080, we’ll configure Nginx as a proxy. Edit the Nginx config file (/etc/nginx/nginx.conf) and change the server configuration to look like this:

server {
listen 80;
server_name _;

location / {
proxy_pass http://127.0.0.1:8080;
}
}

We’ll be using Jenkins to build our Docker images, so we need to add the jenkins user to the docker group. A reboot may be required for the changes to take effect.

# usermod -a -G docker jenkins

Start the Docker, Jenkins and Nginx services and make sure they will be running after a reboot:

# service docker start
# service jenkins start
# service nginx start
# chkconfig docker on
# chkconfig jenkins on
# chkconfig nginx on

You can launch the Jenkins instance complete with all the required plugins with this CloudFormation template.

 

Point your browser to the public DNS name of your EC2 instance (e.g. ec2-54-163-4-211.compute-1.amazonaws.com) and you should be able to see the Jenkins home page:

 

 

The Jenkins installation is currently accessible through the Internet without any form of authentication. Before proceeding to the next step, let’s secure Jenkins. Select Manage Jenkins on the Jenkins home page, click Configure Global Security and then enable Jenkins security by selecting the Enable Security checkbox.

 

For the purpose of this walkthrough, select Jenkins’s Own User Database under Security realm and make sure to select the Allow users to sign up checkbox. Under Authorization, select Matrix-based security. Add a user (e.g. admin) and provide necessary privileges to this user.

 

 

After that’s complete, save your changes. Now you will be asked to provide a username and password for the user to login. Click on Create an account, provide your username – i.e. admin – and fill in the user details. Now you will be able to log in securely to Jenkins.

 

Install and configure Jenkins plugins

The last step in setting up our build environment is to install and configure the Jenkins plugins required to build a Docker image and publish it to a Docker registry (DockerHub in our case). We’ll also need a plugin to interact with the code repository of our choice, GitHub in our case.

 

From the Jenkins dashboard select Manage Jenkins and click Manage Plugins. On the Available tab, search for and select the following plugins:

Docker Build and Publish plugin

dockerhub plugin

Github plugin

Then click the Install button. After the plugin installation is completed, select Manage Jenkins from the Jenkins dashboard and click Configure System. Look for the Docker Image Builder section and fill in your Docker registry (DockerHub) credentials:

 

 

Install and configure the Amazon ECS CLI

Now we are ready to setup and configure the ECS Command Line Interface (CLI). The sample application creates and uses an Amazon DynamoDB table to store signup information, so make sure that the IAM Role that you create for the EC2 instances allows the dynamodb:* action.

 

Follow the Setting Up with Amazon ECS guide to get ready to use ECS. If you haven’t done so yet, make sure to start at least one container instance in your account and create the Amazon ECS service role in the AWS IAM console.

 

Make sure that Jenkins is able to use the ECS CLI. Switch to the jenkins user and configure the AWS CLI, providing your credentials:

# sudo -su jenkins
> aws configure

Login to Docker Hub
The Jenkins user needs to login to Docker Hub before doing the first build:

# docker login

Create a task definition template

Create a task definition template for our application (note, you will replace the image name with your own repository):

{
"family": "flask-signup",
"containerDefinitions": [
{
"image": "your-repository/flask-signup:v_%BUILD_NUMBER%",
"name": "flask-signup",
"cpu": 10,
"memory": 256,
"essential": true,
"portMappings": [
{
"containerPort": 5000,
"hostPort": 80
}
]
}
]
}

Save your task definition template as flask-signup.json. Since the image specified in the task definition template will be built in the Jenkins job, at this point we will create a dummy task definition. Substitute the %BUILD_NUMBER% parameter in your task definition template with a non-existent value (0) and register it with ECS:

# sed -e "s;%BUILD_NUMBER%;0;g" flask-signup.json > flask-signup-v_0.json
# aws ecs register-task-definition –cli-input-json file://flask-signup-v_0.json
{
"taskDefinition": {
"volumes": [],
"taskDefinitionArn": "arn:aws:ecs:us-east-1:123456789012:task-definition/flask-signup:1",
"containerDefinitions": [
{
"name": "flask-signup",
"image": "your-repository/flask-signup:v_0",
"cpu": 10,
"portMappings": [
{
"containerPort": 5000,
"hostPort": 80
}
],
"memory": 256,
"essential": true
}
],
"family": "flask-signup",
"revision": 1
}
}

Make note of the family value (flask-signup), as it will be needed when configuring the Execute shell step in the Jenkins job.

 

Create the ECS IAM Role, an ELB and your service definition

Create a new IAM role (e.g. ecs-service-role), select the Amazon EC2 Container Service Role type and attach the AmazonEC2ContainerServiceRole policy. This will allows ECS to create and manage AWS resources, such as an ELB, on your behalf. Create an Amazon Elastic Load Balancing (ELB) load balancer to be used in your service definition and note the ELB name (e.g. elb-flask-signup-1985465812). Create the flask-signup-service service, specifying the task definition (e.g. flask-signup) and the ELB name (e.g. elb-flask-signup-1985465812):

# aws ecs create-service –cluster default –service-name flask-signup-service –task-definition flask-signup –load-balancers loadBalancerName=elb-flask-signup-1985465812,containerName=flask-signup,containerPort=5000 –role ecs-service-role –desired-count 0
{
"service": {
"status": "ACTIVE",
"taskDefinition": "arn:aws:ecs:us-east-1:123456789012:task-definition/flask-signup:1",
"desiredCount": 0,
"serviceName": "flask-signup-service",
"clusterArn": "arn:aws:ecs:us-east-1:123456789012:cluster/default",
"serviceArn": "arn:aws:ecs:us-east-1:123456789012:service/flask-signup-service",
"runningCount": 0
}
}

Since we have not yet build a Docker image for our task, make sure to set the –desired-count flag to 0.

 

Configure the Jenkins build

On the Jenkins dashboard, click on New Item, select the Freestyle project job, add a name for the job, and click OK. Configure the Jenkins job:

Under GitHub Project, add the path of your GitHub repository – e.g. https://github.com/awslabs/py-flask-signup-docker. In addition to the application source code, the repository contains the Dockerfile used to build the image, as explained at the beginning of this walkthrough. 

Under Source Code Management provide the Repository URL for Git, e.g. https://github.com/awslabs/py-flask-signup-docker.

In the Build Triggers section, select Build when a change is pushed to GitHub.

In the Build section, add a Docker build and publish step to the job and configure it to publish to your Docker registry repository (e.g. DockerHub) and add a tag to identify the image (e.g. v_$BUILD_NUMBER). 

 

The Repository Name specifies the name of the Docker repository where the image will be published; this is composed of a user name (dstroppa) and an image name (flask-signup). In our case, the Dockerfile sits in the root path of our repository, so we won’t specify any path in the Directory Dockerfile is in field. Note, the repository name needs to be the same as what is used in the task definition template in flask-signup.json.

Add a Execute Shell step and add the ECS CLI commands to start a new task on your ECS cluster. 

The script for the Execute shell step will look like this:

#!/bin/bash
SERVICE_NAME="flask-signup-service"
IMAGE_VERSION="v_"${BUILD_NUMBER}
TASK_FAMILY="flask-signup"

# Create a new task definition for this build
sed -e "s;%BUILD_NUMBER%;${BUILD_NUMBER};g" flask-signup.json > flask-signup-v_${BUILD_NUMBER}.json
aws ecs register-task-definition –family flask-signup –cli-input-json file://flask-signup-v_${BUILD_NUMBER}.json

# Update the service with the new task definition and desired count
TASK_REVISION=`aws ecs describe-task-definition –task-definition flask-signup | egrep "revision" | tr "/" " " | awk ‘{print $2}’ | sed ‘s/"$//’`
DESIRED_COUNT=`aws ecs describe-services –services ${SERVICE_NAME} | egrep "desiredCount" | tr "/" " " | awk ‘{print $2}’ | sed ‘s/,$//’`
if [ ${DESIRED_COUNT} = "0" ]; then
DESIRED_COUNT="1"
fi

aws ecs update-service –cluster default –service ${SERVICE_NAME} –task-definition ${TASK_FAMILY}:${TASK_REVISION} –desired-count ${DESIRED_COUNT}

To trigger the build process on Jenkins upon pushing to the GitHub repository we need to configure a service hook on GitHub. Go to the GitHub repository settings page, select Webhooks and Services and add a service hook for Jenkins (GitHub plugin). Add the Jenkins hook url: http://<username>:<password>@<EC2-DNS-Name>/github-webhook/.

 

 

Now we have configured a Jenkins job in such a way that whenever a change is committed to GitHub repository it will trigger the build process on Jenkins.

 

Happy building

From your local repository, push the application code to GitHub:

# git add *
# git commit -m "Kicking off Jenkins build"
# git push origin master

This will trigger the Jenkins job. After the job is completed, point your browser to the public DNS name for your EC2 container instance and verify that the application is correctly running:

 

Conclusion

In this walkthrough we demonstrated how to use Jenkins to automate the deployment of an ECS service. See the documentation for further information on Amazon ECS.

 

 

Set up a build pipeline with Jenkins and Amazon ECS

Post Syndicated from Chris Barclay original http://blogs.aws.amazon.com/application-management/post/Tx32RHFZHXY6ME1/Set-up-a-build-pipeline-with-Jenkins-and-Amazon-ECS

My colleague Daniele Stroppa sent a nice guest post that demonstrates how to use Jenkins to build Docker images for Amazon EC2 Container Service.

 

—–

 

In this walkthrough, we’ll show you how to set up and configure a build pipeline using Jenkins and the Amazon EC2 Container Service (ECS).

 

We’ll be using a sample Python application, available on GitHub. The repository contains a simple Dockerfile that uses a python base image and runs our application:

FROM python:2-onbuild
CMD [ "python", "./application.py" ]

This Dockerfile is used by the build pipeline to create a new Docker image upon pushing code to the repository. The built image will then be used to start a new service on an ECS cluster.

 

For the purpose of this walkthrough, fork the py-flask-signup-docker repository to your account.

 

Setup the build environment

For our build environment we’ll launch an Amazon EC2 instance using the Amazon Linux AMI and install and configure the required packages. Make sure that the security group you select for your instance allows traffic on ports TCP/22 and TCP/80.

 

Install and configure Jenkins, Docker and Nginx

Connect to your instance using your private key and switch to the root user. First, let’s update the repositories and install Docker, Nginx and Git.

# yum update -y
# yum install -y docker nginx git

To install Jenkins on Amazon Linux, we need to add the Jenkins repository and install Jenkins from there.

# wget -O /etc/yum.repos.d/jenkins.repo http://pkg.jenkins-ci.org/redhat/jenkins.repo
# rpm –import http://pkg.jenkins-ci.org/redhat/jenkins-ci.org.key
# yum install jenkins

As Jenkins typically uses port TCP/8080, we’ll configure Nginx as a proxy. Edit the Nginx config file (/etc/nginx/nginx.conf) and change the server configuration to look like this:

server {
listen 80;
server_name _;

location / {
proxy_pass http://127.0.0.1:8080;
}
}

We’ll be using Jenkins to build our Docker images, so we need to add the jenkins user to the docker group. A reboot may be required for the changes to take effect.

# usermod -a -G docker jenkins

Start the Docker, Jenkins and Nginx services and make sure they will be running after a reboot:

# service docker start
# service jenkins start
# service nginx start
# chkconfig docker on
# chkconfig jenkins on
# chkconfig nginx on

You can launch the Jenkins instance complete with all the required plugins with this CloudFormation template.

 

Point your browser to the public DNS name of your EC2 instance (e.g. ec2-54-163-4-211.compute-1.amazonaws.com) and you should be able to see the Jenkins home page:

 

 

The Jenkins installation is currently accessible through the Internet without any form of authentication. Before proceeding to the next step, let’s secure Jenkins. Select Manage Jenkins on the Jenkins home page, click Configure Global Security and then enable Jenkins security by selecting the Enable Security checkbox.

 

For the purpose of this walkthrough, select Jenkins’s Own User Database under Security realm and make sure to select the Allow users to sign up checkbox. Under Authorization, select Matrix-based security. Add a user (e.g. admin) and provide necessary privileges to this user.

 

 

After that’s complete, save your changes. Now you will be asked to provide a username and password for the user to login. Click on Create an account, provide your username – i.e. admin – and fill in the user details. Now you will be able to log in securely to Jenkins.

 

Install and configure Jenkins plugins

The last step in setting up our build environment is to install and configure the Jenkins plugins required to build a Docker image and publish it to a Docker registry (DockerHub in our case). We’ll also need a plugin to interact with the code repository of our choice, GitHub in our case.

 

From the Jenkins dashboard select Manage Jenkins and click Manage Plugins. On the Available tab, search for and select the following plugins:

Docker Build and Publish plugin

dockerhub plugin

Github plugin

Then click the Install button. After the plugin installation is completed, select Manage Jenkins from the Jenkins dashboard and click Configure System. Look for the Docker Image Builder section and fill in your Docker registry (DockerHub) credentials:

 

 

Install and configure the Amazon ECS CLI

Now we are ready to setup and configure the ECS Command Line Interface (CLI). The sample application creates and uses an Amazon DynamoDB table to store signup information, so make sure that the IAM Role that you create for the EC2 instances allows the dynamodb:* action.

 

Follow the Setting Up with Amazon ECS guide to get ready to use ECS. If you haven’t done so yet, make sure to start at least one container instance in your account and create the Amazon ECS service role in the AWS IAM console.

 

Make sure that Jenkins is able to use the ECS CLI. Switch to the jenkins user and configure the AWS CLI, providing your credentials:

# sudo -su jenkins
> aws configure

Login to Docker Hub
The Jenkins user needs to login to Docker Hub before doing the first build:

# docker login

Create a task definition template

Create a task definition template for our application (note, you will replace the image name with your own repository):

{
"family": "flask-signup",
"containerDefinitions": [
{
"image": "your-repository/flask-signup:v_%BUILD_NUMBER%",
"name": "flask-signup",
"cpu": 10,
"memory": 256,
"essential": true,
"portMappings": [
{
"containerPort": 5000,
"hostPort": 80
}
]
}
]
}

Save your task definition template as flask-signup.json. Since the image specified in the task definition template will be built in the Jenkins job, at this point we will create a dummy task definition. Substitute the %BUILD_NUMBER% parameter in your task definition template with a non-existent value (0) and register it with ECS:

# sed -e "s;%BUILD_NUMBER%;0;g" flask-signup.json > flask-signup-v_0.json
# aws ecs register-task-definition –cli-input-json file://flask-signup-v_0.json
{
"taskDefinition": {
"volumes": [],
"taskDefinitionArn": "arn:aws:ecs:us-east-1:123456789012:task-definition/flask-signup:1",
"containerDefinitions": [
{
"name": "flask-signup",
"image": "your-repository/flask-signup:v_0",
"cpu": 10,
"portMappings": [
{
"containerPort": 5000,
"hostPort": 80
}
],
"memory": 256,
"essential": true
}
],
"family": "flask-signup",
"revision": 1
}
}

Make note of the family value (flask-signup), as it will be needed when configuring the Execute shell step in the Jenkins job.

 

Create the ECS IAM Role, an ELB and your service definition

Create a new IAM role (e.g. ecs-service-role), select the Amazon EC2 Container Service Role type and attach the AmazonEC2ContainerServiceRole policy. This will allows ECS to create and manage AWS resources, such as an ELB, on your behalf. Create an Amazon Elastic Load Balancing (ELB) load balancer to be used in your service definition and note the ELB name (e.g. elb-flask-signup-1985465812). Create the flask-signup-service service, specifying the task definition (e.g. flask-signup) and the ELB name (e.g. elb-flask-signup-1985465812):

# aws ecs create-service –cluster default –service-name flask-signup-service –task-definition flask-signup –load-balancers loadBalancerName=elb-flask-signup-1985465812,containerName=flask-signup,containerPort=5000 –role ecs-service-role –desired-count 0
{
"service": {
"status": "ACTIVE",
"taskDefinition": "arn:aws:ecs:us-east-1:123456789012:task-definition/flask-signup:1",
"desiredCount": 0,
"serviceName": "flask-signup-service",
"clusterArn": "arn:aws:ecs:us-east-1:123456789012:cluster/default",
"serviceArn": "arn:aws:ecs:us-east-1:123456789012:service/flask-signup-service",
"runningCount": 0
}
}

Since we have not yet build a Docker image for our task, make sure to set the –desired-count flag to 0.

 

Configure the Jenkins build

On the Jenkins dashboard, click on New Item, select the Freestyle project job, add a name for the job, and click OK. Configure the Jenkins job:

Under GitHub Project, add the path of your GitHub repository – e.g. https://github.com/awslabs/py-flask-signup-docker. In addition to the application source code, the repository contains the Dockerfile used to build the image, as explained at the beginning of this walkthrough. 

Under Source Code Management provide the Repository URL for Git, e.g. https://github.com/awslabs/py-flask-signup-docker.

In the Build Triggers section, select Build when a change is pushed to GitHub.

In the Build section, add a Docker build and publish step to the job and configure it to publish to your Docker registry repository (e.g. DockerHub) and add a tag to identify the image (e.g. v_$BUILD_NUMBER). 

 

The Repository Name specifies the name of the Docker repository where the image will be published; this is composed of a user name (dstroppa) and an image name (flask-signup). In our case, the Dockerfile sits in the root path of our repository, so we won’t specify any path in the Directory Dockerfile is in field. Note, the repository name needs to be the same as what is used in the task definition template in flask-signup.json.

Add a Execute Shell step and add the ECS CLI commands to start a new task on your ECS cluster. 

The script for the Execute shell step will look like this:

#!/bin/bash
SERVICE_NAME="flask-signup-service"
IMAGE_VERSION="v_"${BUILD_NUMBER}
TASK_FAMILY="flask-signup"

# Create a new task definition for this build
sed -e "s;%BUILD_NUMBER%;${BUILD_NUMBER};g" flask-signup.json > flask-signup-v_${BUILD_NUMBER}.json
aws ecs register-task-definition –family flask-signup –cli-input-json file://flask-signup-v_${BUILD_NUMBER}.json

# Update the service with the new task definition and desired count
TASK_REVISION=`aws ecs describe-task-definition –task-definition flask-signup | egrep "revision" | tr "/" " " | awk ‘{print $2}’ | sed ‘s/"$//’`
DESIRED_COUNT=`aws ecs describe-services –services ${SERVICE_NAME} | egrep "desiredCount" | tr "/" " " | awk ‘{print $2}’ | sed ‘s/,$//’`
if [ ${DESIRED_COUNT} = "0" ]; then
DESIRED_COUNT="1"
fi

aws ecs update-service –cluster default –service ${SERVICE_NAME} –task-definition ${TASK_FAMILY}:${TASK_REVISION} –desired-count ${DESIRED_COUNT}

To trigger the build process on Jenkins upon pushing to the GitHub repository we need to configure a service hook on GitHub. Go to the GitHub repository settings page, select Webhooks and Services and add a service hook for Jenkins (GitHub plugin). Add the Jenkins hook url: http://<username>:<password>@<EC2-DNS-Name>/github-webhook/.

 

 

Now we have configured a Jenkins job in such a way that whenever a change is committed to GitHub repository it will trigger the build process on Jenkins.

 

Happy building

From your local repository, push the application code to GitHub:

# git add *
# git commit -m "Kicking off Jenkins build"
# git push origin master

This will trigger the Jenkins job. After the job is completed, point your browser to the public DNS name for your EC2 container instance and verify that the application is correctly running:

 

Conclusion

In this walkthrough we demonstrated how to use Jenkins to automate the deployment of an ECS service. See the documentation for further information on Amazon ECS.

 

 

systemd For Administrators, Part XXI

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/systemd-for-administrators-part-xxi.html

Container Integration
Since a while containers have been one of the hot topics on
Linux. Container managers such as libvirt-lxc, LXC or Docker are
widely known and used these days. In this blog story I want to shed
some light on systemd‘s integration points with container managers, to
allow seamless management of services across container boundaries.
We’ll focus on OS containers here, i.e. the case where an init system
runs inside the container, and the container hence in most ways
appears like an independent system of its own. Much of what I describe
here is available on pretty much any container manager that implements
the logic described
here
,
including libvirt-lxc. However, to make things easy we’ll focus on
systemd-nspawn,
the mini-container manager that is shipped with systemd
itself. systemd-nspawn uses the same kernel interfaces as the other
container managers, however is less flexible as it is designed to be a
container manager that is as simple to use as possible and “just
works”, rather than trying to be a generic tool you can configure in
every low-level detail. We use systemd-nspawn extensively when
developing systemd.
Anyway, so let’s get started with our run-through. Let’s start by
creating a Fedora container tree in a subdirectory:
# yum -y –releasever=20 –nogpg –installroot=/srv/mycontainer –disablerepo='*' –enablerepo=fedora install systemd passwd yum fedora-release vim-minimal

This downloads a minimal Fedora system and installs it in in
/srv/mycontainer. This command line is Fedora-specific, but most
distributions provide similar functionality in one way or another. The
examples section in the systemd-nspawn(1) man
page

contains a list of the various command lines for other distribution.
We now have the new container installed, let’s set an initial root password:
# systemd-nspawn -D /srv/mycontainer
Spawning container mycontainer on /srv/mycontainer
Press ^] three times within 1s to kill container.
-bash-4.2# passwd
Changing password for user root.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
-bash-4.2# ^D
Container mycontainer exited successfully.
#

We use systemd-nspawn here to get a shell in the container, and then
use passwd to set the root password. After that the initial setup is done,
hence let’s boot it up and log in as root with our new password:
$ systemd-nspawn -D /srv/mycontainer -b
Spawning container mycontainer on /srv/mycontainer.
Press ^] three times within 1s to kill container.
systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)
Detected virtualization 'systemd-nspawn'.

Welcome to Fedora 20 (Heisenbug)!

[ OK ] Reached target Remote File Systems.
[ OK ] Created slice Root Slice.
[ OK ] Created slice User and Session Slice.
[ OK ] Created slice System Slice.
[ OK ] Created slice system-getty.slice.
[ OK ] Reached target Slices.
[ OK ] Listening on Delayed Shutdown Socket.
[ OK ] Listening on /dev/initctl Compatibility Named Pipe.
[ OK ] Listening on Journal Socket.
Starting Journal Service…
[ OK ] Started Journal Service.
[ OK ] Reached target Paths.
Mounting Debug File System…
Mounting Configuration File System…
Mounting FUSE Control File System…
Starting Create static device nodes in /dev…
Mounting POSIX Message Queue File System…
Mounting Huge Pages File System…
[ OK ] Reached target Encrypted Volumes.
[ OK ] Reached target Swap.
Mounting Temporary Directory…
Starting Load/Save Random Seed…
[ OK ] Mounted Configuration File System.
[ OK ] Mounted FUSE Control File System.
[ OK ] Mounted Temporary Directory.
[ OK ] Mounted POSIX Message Queue File System.
[ OK ] Mounted Debug File System.
[ OK ] Mounted Huge Pages File System.
[ OK ] Started Load/Save Random Seed.
[ OK ] Started Create static device nodes in /dev.
[ OK ] Reached target Local File Systems (Pre).
[ OK ] Reached target Local File Systems.
Starting Trigger Flushing of Journal to Persistent Storage…
Starting Recreate Volatile Files and Directories…
[ OK ] Started Recreate Volatile Files and Directories.
Starting Update UTMP about System Reboot/Shutdown…
[ OK ] Started Trigger Flushing of Journal to Persistent Storage.
[ OK ] Started Update UTMP about System Reboot/Shutdown.
[ OK ] Reached target System Initialization.
[ OK ] Reached target Timers.
[ OK ] Listening on D-Bus System Message Bus Socket.
[ OK ] Reached target Sockets.
[ OK ] Reached target Basic System.
Starting Login Service…
Starting Permit User Sessions…
Starting D-Bus System Message Bus…
[ OK ] Started D-Bus System Message Bus.
Starting Cleanup of Temporary Directories…
[ OK ] Started Cleanup of Temporary Directories.
[ OK ] Started Permit User Sessions.
Starting Console Getty…
[ OK ] Started Console Getty.
[ OK ] Reached target Login Prompts.
[ OK ] Started Login Service.
[ OK ] Reached target Multi-User System.
[ OK ] Reached target Graphical Interface.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (console)

mycontainer login: root
Password:
-bash-4.2#

Now we have everything ready to play around with the container
integration of systemd. Let’s have a look at the first tool,
machinectl. When run without parameters it shows a list of all
locally running containers:
$ machinectl
MACHINE CONTAINER SERVICE
mycontainer container nspawn

1 machines listed.

The “status” subcommand shows details about the container:
$ machinectl status mycontainer
mycontainer:
Since: Mi 2014-11-12 16:47:19 CET; 51s ago
Leader: 5374 (systemd)
Service: nspawn; class container
Root: /srv/mycontainer
Address: 192.168.178.38
10.36.6.162
fd00::523f:56ff:fe00:4994
fe80::523f:56ff:fe00:4994
OS: Fedora 20 (Heisenbug)
Unit: machine-mycontainer.scope
├─5374 /usr/lib/systemd/systemd
└─system.slice
├─dbus.service
│ └─5414 /bin/dbus-daemon –system –address=systemd: –nofork –nopidfile –systemd-act…
├─systemd-journald.service
│ └─5383 /usr/lib/systemd/systemd-journald
├─systemd-logind.service
│ └─5411 /usr/lib/systemd/systemd-logind
└─console-getty.service
└─5416 /sbin/agetty –noclear -s console 115200 38400 9600

With this we see some interesting information about the container,
including its control group tree (with processes), IP addresses and
root directory.
The “login” subcommand gets us a new login shell in the container:
# machinectl login mycontainer
Connected to container mycontainer. Press ^] three times within 1s to exit session.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (pts/0)

mycontainer login:

The “reboot” subcommand reboots the container:
# machinectl reboot mycontainer

The “poweroff” subcommand powers the container off:
# machinectl poweroff mycontainer

So much about the machinectl tool. The tool knows a couple of more
commands, please check the man
page

for details. Note again that even though we use systemd-nspawn as
container manager here the concepts apply to any container manager
that implements the logic described
here
,
including libvirt-lxc for example.
machinectl is not the only tool that is useful in conjunction with
containers. Many of systemd’s own tools have been updated to
explicitly support containers too! Let’s try this (after starting the
container up again first, repeating the systemd-nspawn command from
above.):
# hostnamectl -M mycontainer set-hostname "wuff"

This uses
hostnamectl(1)
on the local container and sets its hostname.
Similar, many other tools have been updated for connecting to local
containers. Here’s
systemctl(1)‘s -M switch
in action:
# systemctl -M mycontainer
UNIT LOAD ACTIVE SUB DESCRIPTION
-.mount loaded active mounted /
dev-hugepages.mount loaded active mounted Huge Pages File System
dev-mqueue.mount loaded active mounted POSIX Message Queue File System
proc-sys-kernel-random-boot_id.mount loaded active mounted /proc/sys/kernel/random/boot_id
[…]
time-sync.target loaded active active System Time Synchronized
timers.target loaded active active Timers
systemd-tmpfiles-clean.timer loaded active waiting Daily Cleanup of Temporary Directories

LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.

49 loaded units listed. Pass –all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

As expected, this shows the list of active units on the specified
container, not the host. (Output is shortened here, the blog story is
already getting too long).
Let’s use this to restart a service within our container:
# systemctl -M mycontainer restart systemd-resolved.service

systemctl has more container support though than just the -M
switch. With the -r switch it shows the units running on the host,
plus all units of all local, running containers:
# systemctl -r
UNIT LOAD ACTIVE SUB DESCRIPTION
boot.automount loaded active waiting EFI System Partition Automount
proc-sys-fs-binfmt_misc.automount loaded active waiting Arbitrary Executable File Formats File Syst
sys-devices-pci0000:00-0000:00:02.0-drm-card0-card0x2dLVDSx2d1-intel_backlight.device loaded active plugged /sys/devices/pci0000:00/0000:00:02.0/drm/ca
[…]
timers.target loaded active active Timers
mandb.timer loaded active waiting Daily man-db cache update
systemd-tmpfiles-clean.timer loaded active waiting Daily Cleanup of Temporary Directories
mycontainer:-.mount loaded active mounted /
mycontainer:dev-hugepages.mount loaded active mounted Huge Pages File System
mycontainer:dev-mqueue.mount loaded active mounted POSIX Message Queue File System
[…]
mycontainer:time-sync.target loaded active active System Time Synchronized
mycontainer:timers.target loaded active active Timers
mycontainer:systemd-tmpfiles-clean.timer loaded active waiting Daily Cleanup of Temporary Directories

LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.

191 loaded units listed. Pass –all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

We can see here first the units of the host, then followed by the
units of the one container we have currently running. The units of the
containers are prefixed with the container name, and a colon
(“:”). (The output is shortened again for brevity’s sake.)
The list-machines subcommand of systemctl shows a list of all
running containers, inquiring the system managers within the containers
about system state and health. More specifically it shows if
containers are properly booted up, or if there are any failed
services:
# systemctl list-machines
NAME STATE FAILED JOBS
delta (host) running 0 0
mycontainer running 0 0
miau degraded 1 0
waldi running 0 0

4 machines listed.

To make things more interesting we have started two more containers in
parallel. One of them has a failed service, which results in the
machine state to be degraded.
Let’s have a look at
journalctl(1)‘s
container support. It too supports -M to show the logs of a specific
container:
# journalctl -M mycontainer -n 8
Nov 12 16:51:13 wuff systemd[1]: Starting Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Reached target Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Starting Update UTMP about System Runlevel Changes…
Nov 12 16:51:13 wuff systemd[1]: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
Nov 12 16:51:13 wuff systemd[1]: Started Update UTMP about System Runlevel Changes.
Nov 12 16:51:13 wuff systemd[1]: Startup finished in 399ms.
Nov 12 16:51:13 wuff sshd[35]: Server listening on 0.0.0.0 port 24.
Nov 12 16:51:13 wuff sshd[35]: Server listening on :: port 24.

However, it also supports -m to show the combined log stream of the
host and all local containers:
# journalctl -m -e

(Let’s skip the output here completely, I figure you can extrapolate
how this looks.)
But it’s not only systemd’s own tools that understand container
support these days, procps sports support for it, too:
# ps -eo pid,machine,args
PID MACHINE COMMAND
1 – /usr/lib/systemd/systemd –switched-root –system –deserialize 20
[…]
2915 – emacs contents/projects/containers.md
3403 – [kworker/u16:7]
3415 – [kworker/u16:9]
4501 – /usr/libexec/nm-vpnc-service
4519 – /usr/sbin/vpnc –non-inter –no-detach –pid-file /var/run/NetworkManager/nm-vpnc-bfda8671-f025-4812-a66b-362eb12e7f13.pid –
4749 – /usr/libexec/dconf-service
4980 – /usr/lib/systemd/systemd-resolved
5006 – /usr/lib64/firefox/firefox
5168 – [kworker/u16:0]
5192 – [kworker/u16:4]
5193 – [kworker/u16:5]
5497 – [kworker/u16:1]
5591 – [kworker/u16:8]
5711 – sudo -s
5715 – /bin/bash
5749 – /home/lennart/projects/systemd/systemd-nspawn -D /srv/mycontainer -b
5750 mycontainer /usr/lib/systemd/systemd
5799 mycontainer /usr/lib/systemd/systemd-journald
5862 mycontainer /usr/lib/systemd/systemd-logind
5863 mycontainer /bin/dbus-daemon –system –address=systemd: –nofork –nopidfile –systemd-activation
5868 mycontainer /sbin/agetty –noclear –keep-baud console 115200 38400 9600 vt102
5871 mycontainer /usr/sbin/sshd -D
6527 mycontainer /usr/lib/systemd/systemd-resolved
[…]

This shows a process list (shortened). The second column shows the
container a process belongs to. All processes shown with “-” belong to
the host itself.
But it doesn’t stop there. The new “sd-bus” D-Bus client library we
have been preparing in the systemd/kdbus context knows containers
too. While you use sd_bus_open_system() to connect to your local
host’s system bus sd_bus_open_system_container() may be used to
connect to the system bus of any local container, so that you can
execute bus methods on it.
sd-login.h
and machined’s bus
interface

provide a number of APIs to add container support to other programs
too. They support enumeration of containers as well as retrieving the
machine name from a PID and similar.
systemd-networkd also has support for containers. When run inside a
container it will by default run a DHCP client and IPv4LL on any veth
network interface named host0 (this interface is special under the
logic described here). When run on the host networkd will by default
provide a DHCP server and IPv4LL on veth network interface named ve-
followed by a container name.
Let’s have a look at one last facet of systemd’s container
integration: the hook-up with the name service switch. Recent systemd
versions contain a new NSS module nss-mymachines that make the names
of all local containers resolvable via gethostbyname() and
getaddrinfo(). This only applies to containers that run within their
own network namespace. With the systemd-nspawn command shown above the
the container shares the network configuration with the host however;
hence let’s restart the container, this time with a virtual veth
network link between host and container:
# machinectl poweroff mycontainer
# systemd-nspawn -D /srv/mycontainer –network-veth -b

Now, (assuming that networkd is used in the container and outside) we
can already ping the container using its name, due to the simple magic
of nss-mymachines:
# ping mycontainer
PING mycontainer (10.0.0.2) 56(84) bytes of data.
64 bytes from mycontainer (10.0.0.2): icmp_seq=1 ttl=64 time=0.124 ms
64 bytes from mycontainer (10.0.0.2): icmp_seq=2 ttl=64 time=0.078 ms

Of course, name resolution not only works with ping, it works with
all other tools that use libc gethostbyname() or getaddrinfo()
too, among them venerable ssh.
And this is pretty much all I want to cover for now. We briefly
touched a variety of integration points, and there’s a lot more still
if you look closely. We are working on even more container integration
all the time, so expect more new features in this area with every
systemd release.
Note that the whole machine concept is actually not limited to
containers, but covers VMs too to a certain degree. However, the
integration is not as close, as access to a VM’s internals is not as
easy as for containers, as it usually requires a network transport
instead of allowing direct syscall access.
Anyway, I hope this is useful. For further details, please have a look
at the linked man pages and other documentation.

systemd For Administrators, Part XXI

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/systemd-for-administrators-part-xxi.html

Container Integration

Since a while containers have been one of the hot topics on
Linux. Container managers such as libvirt-lxc, LXC or Docker are
widely known and used these days. In this blog story I want to shed
some light on systemd‘s integration points with container managers, to
allow seamless management of services across container boundaries.

We’ll focus on OS containers here, i.e. the case where an init system
runs inside the container, and the container hence in most ways
appears like an independent system of its own. Much of what I describe
here is available on pretty much any container manager that implements
the logic described
here
,
including libvirt-lxc. However, to make things easy we’ll focus on
systemd-nspawn,
the mini-container manager that is shipped with systemd
itself. systemd-nspawn uses the same kernel interfaces as the other
container managers, however is less flexible as it is designed to be a
container manager that is as simple to use as possible and “just
works”, rather than trying to be a generic tool you can configure in
every low-level detail. We use systemd-nspawn extensively when
developing systemd.

Anyway, so let’s get started with our run-through. Let’s start by
creating a Fedora container tree in a subdirectory:

# yum -y --releasever=20 --nogpg --installroot=/srv/mycontainer --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal

This downloads a minimal Fedora system and installs it in in
/srv/mycontainer. This command line is Fedora-specific, but most
distributions provide similar functionality in one way or another. The
examples section in the systemd-nspawn(1) man
page

contains a list of the various command lines for other distribution.

We now have the new container installed, let’s set an initial root password:

# systemd-nspawn -D /srv/mycontainer
Spawning container mycontainer on /srv/mycontainer
Press ^] three times within 1s to kill container.
-bash-4.2# passwd
Changing password for user root.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
-bash-4.2# ^D
Container mycontainer exited successfully.
#

We use systemd-nspawn here to get a shell in the container, and then
use passwd to set the root password. After that the initial setup is done,
hence let’s boot it up and log in as root with our new password:

$ systemd-nspawn -D /srv/mycontainer -b
Spawning container mycontainer on /srv/mycontainer.
Press ^] three times within 1s to kill container.
systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)
Detected virtualization 'systemd-nspawn'.

Welcome to Fedora 20 (Heisenbug)!

[  OK  ] Reached target Remote File Systems.
[  OK  ] Created slice Root Slice.
[  OK  ] Created slice User and Session Slice.
[  OK  ] Created slice System Slice.
[  OK  ] Created slice system-getty.slice.
[  OK  ] Reached target Slices.
[  OK  ] Listening on Delayed Shutdown Socket.
[  OK  ] Listening on /dev/initctl Compatibility Named Pipe.
[  OK  ] Listening on Journal Socket.
         Starting Journal Service...
[  OK  ] Started Journal Service.
[  OK  ] Reached target Paths.
         Mounting Debug File System...
         Mounting Configuration File System...
         Mounting FUSE Control File System...
         Starting Create static device nodes in /dev...
         Mounting POSIX Message Queue File System...
         Mounting Huge Pages File System...
[  OK  ] Reached target Encrypted Volumes.
[  OK  ] Reached target Swap.
         Mounting Temporary Directory...
         Starting Load/Save Random Seed...
[  OK  ] Mounted Configuration File System.
[  OK  ] Mounted FUSE Control File System.
[  OK  ] Mounted Temporary Directory.
[  OK  ] Mounted POSIX Message Queue File System.
[  OK  ] Mounted Debug File System.
[  OK  ] Mounted Huge Pages File System.
[  OK  ] Started Load/Save Random Seed.
[  OK  ] Started Create static device nodes in /dev.
[  OK  ] Reached target Local File Systems (Pre).
[  OK  ] Reached target Local File Systems.
         Starting Trigger Flushing of Journal to Persistent Storage...
         Starting Recreate Volatile Files and Directories...
[  OK  ] Started Recreate Volatile Files and Directories.
         Starting Update UTMP about System Reboot/Shutdown...
[  OK  ] Started Trigger Flushing of Journal to Persistent Storage.
[  OK  ] Started Update UTMP about System Reboot/Shutdown.
[  OK  ] Reached target System Initialization.
[  OK  ] Reached target Timers.
[  OK  ] Listening on D-Bus System Message Bus Socket.
[  OK  ] Reached target Sockets.
[  OK  ] Reached target Basic System.
         Starting Login Service...
         Starting Permit User Sessions...
         Starting D-Bus System Message Bus...
[  OK  ] Started D-Bus System Message Bus.
         Starting Cleanup of Temporary Directories...
[  OK  ] Started Cleanup of Temporary Directories.
[  OK  ] Started Permit User Sessions.
         Starting Console Getty...
[  OK  ] Started Console Getty.
[  OK  ] Reached target Login Prompts.
[  OK  ] Started Login Service.
[  OK  ] Reached target Multi-User System.
[  OK  ] Reached target Graphical Interface.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (console)

mycontainer login: root
Password:
-bash-4.2#

Now we have everything ready to play around with the container
integration of systemd. Let’s have a look at the first tool,
machinectl. When run without parameters it shows a list of all
locally running containers:

$ machinectl
MACHINE                          CONTAINER SERVICE
mycontainer                      container nspawn

1 machines listed.

The “status” subcommand shows details about the container:

$ machinectl status mycontainer
mycontainer:
       Since: Mi 2014-11-12 16:47:19 CET; 51s ago
      Leader: 5374 (systemd)
     Service: nspawn; class container
        Root: /srv/mycontainer
     Address: 192.168.178.38
              10.36.6.162
              fd00::523f:56ff:fe00:4994
              fe80::523f:56ff:fe00:4994
          OS: Fedora 20 (Heisenbug)
        Unit: machine-mycontainer.scope
              ├─5374 /usr/lib/systemd/systemd
              └─system.slice
                ├─dbus.service
                │ └─5414 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-act...
                ├─systemd-journald.service
                │ └─5383 /usr/lib/systemd/systemd-journald
                ├─systemd-logind.service
                │ └─5411 /usr/lib/systemd/systemd-logind
                └─console-getty.service
                  └─5416 /sbin/agetty --noclear -s console 115200 38400 9600

With this we see some interesting information about the container,
including its control group tree (with processes), IP addresses and
root directory.

The “login” subcommand gets us a new login shell in the container:

# machinectl login mycontainer
Connected to container mycontainer. Press ^] three times within 1s to exit session.

Fedora release 20 (Heisenbug)
Kernel 3.18.0-0.rc4.git0.1.fc22.x86_64 on an x86_64 (pts/0)

mycontainer login:

The “reboot” subcommand reboots the container:

# machinectl reboot mycontainer

The “poweroff” subcommand powers the container off:

# machinectl poweroff mycontainer

So much about the machinectl tool. The tool knows a couple of more
commands, please check the man
page

for details. Note again that even though we use systemd-nspawn as
container manager here the concepts apply to any container manager
that implements the logic described
here
,
including libvirt-lxc for example.

machinectl is not the only tool that is useful in conjunction with
containers. Many of systemd’s own tools have been updated to
explicitly support containers too! Let’s try this (after starting the
container up again first, repeating the systemd-nspawn command from
above.):

# hostnamectl -M mycontainer set-hostname "wuff"

This uses
hostnamectl(1)
on the local container and sets its hostname.

Similar, many other tools have been updated for connecting to local
containers. Here’s
systemctl(1)‘s -M switch
in action:

# systemctl -M mycontainer
UNIT                                 LOAD   ACTIVE SUB       DESCRIPTION
-.mount                              loaded active mounted   /
dev-hugepages.mount                  loaded active mounted   Huge Pages File System
dev-mqueue.mount                     loaded active mounted   POSIX Message Queue File System
proc-sys-kernel-random-boot_id.mount loaded active mounted   /proc/sys/kernel/random/boot_id
[...]
time-sync.target                     loaded active active    System Time Synchronized
timers.target                        loaded active active    Timers
systemd-tmpfiles-clean.timer         loaded active waiting   Daily Cleanup of Temporary Directories

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

49 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

As expected, this shows the list of active units on the specified
container, not the host. (Output is shortened here, the blog story is
already getting too long).

Let’s use this to restart a service within our container:

# systemctl -M mycontainer restart systemd-resolved.service

systemctl has more container support though than just the -M
switch. With the -r switch it shows the units running on the host,
plus all units of all local, running containers:

# systemctl -r
UNIT                                        LOAD   ACTIVE SUB       DESCRIPTION
boot.automount                              loaded active waiting   EFI System Partition Automount
proc-sys-fs-binfmt_misc.automount           loaded active waiting   Arbitrary Executable File Formats File Syst
sys-devices-pci0000:00-0000:00:02.0-drm-card0-card0x2dLVDSx2d1-intel_backlight.device loaded active plugged   /sys/devices/pci0000:00/0000:00:02.0/drm/ca
[...]
timers.target                                                                                       loaded active active    Timers
mandb.timer                                                                                         loaded active waiting   Daily man-db cache update
systemd-tmpfiles-clean.timer                                                                        loaded active waiting   Daily Cleanup of Temporary Directories
mycontainer:-.mount                                                                                 loaded active mounted   /
mycontainer:dev-hugepages.mount                                                                     loaded active mounted   Huge Pages File System
mycontainer:dev-mqueue.mount                                                                        loaded active mounted   POSIX Message Queue File System
[...]
mycontainer:time-sync.target                                                                        loaded active active    System Time Synchronized
mycontainer:timers.target                                                                           loaded active active    Timers
mycontainer:systemd-tmpfiles-clean.timer                                                            loaded active waiting   Daily Cleanup of Temporary Directories

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

191 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

We can see here first the units of the host, then followed by the
units of the one container we have currently running. The units of the
containers are prefixed with the container name, and a colon
(“:”). (The output is shortened again for brevity’s sake.)

The list-machines subcommand of systemctl shows a list of all
running containers, inquiring the system managers within the containers
about system state and health. More specifically it shows if
containers are properly booted up, or if there are any failed
services:

# systemctl list-machines
NAME         STATE   FAILED JOBS
delta (host) running      0    0
mycontainer  running      0    0
miau         degraded     1    0
waldi        running      0    0

4 machines listed.

To make things more interesting we have started two more containers in
parallel. One of them has a failed service, which results in the
machine state to be degraded.

Let’s have a look at
journalctl(1)‘s
container support. It too supports -M to show the logs of a specific
container:

# journalctl -M mycontainer -n 8
Nov 12 16:51:13 wuff systemd[1]: Starting Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Reached target Graphical Interface.
Nov 12 16:51:13 wuff systemd[1]: Starting Update UTMP about System Runlevel Changes...
Nov 12 16:51:13 wuff systemd[1]: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
Nov 12 16:51:13 wuff systemd[1]: Started Update UTMP about System Runlevel Changes.
Nov 12 16:51:13 wuff systemd[1]: Startup finished in 399ms.
Nov 12 16:51:13 wuff sshd[35]: Server listening on 0.0.0.0 port 24.
Nov 12 16:51:13 wuff sshd[35]: Server listening on :: port 24.

However, it also supports -m to show the combined log stream of the
host and all local containers:

# journalctl -m -e

(Let’s skip the output here completely, I figure you can extrapolate
how this looks.)

But it’s not only systemd’s own tools that understand container
support these days, procps sports support for it, too:

# ps -eo pid,machine,args
 PID MACHINE                         COMMAND
   1 -                               /usr/lib/systemd/systemd --switched-root --system --deserialize 20
[...]
2915 -                               emacs contents/projects/containers.md
3403 -                               [kworker/u16:7]
3415 -                               [kworker/u16:9]
4501 -                               /usr/libexec/nm-vpnc-service
4519 -                               /usr/sbin/vpnc --non-inter --no-detach --pid-file /var/run/NetworkManager/nm-vpnc-bfda8671-f025-4812-a66b-362eb12e7f13.pid -
4749 -                               /usr/libexec/dconf-service
4980 -                               /usr/lib/systemd/systemd-resolved
5006 -                               /usr/lib64/firefox/firefox
5168 -                               [kworker/u16:0]
5192 -                               [kworker/u16:4]
5193 -                               [kworker/u16:5]
5497 -                               [kworker/u16:1]
5591 -                               [kworker/u16:8]
5711 -                               sudo -s
5715 -                               /bin/bash
5749 -                               /home/lennart/projects/systemd/systemd-nspawn -D /srv/mycontainer -b
5750 mycontainer                     /usr/lib/systemd/systemd
5799 mycontainer                     /usr/lib/systemd/systemd-journald
5862 mycontainer                     /usr/lib/systemd/systemd-logind
5863 mycontainer                     /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
5868 mycontainer                     /sbin/agetty --noclear --keep-baud console 115200 38400 9600 vt102
5871 mycontainer                     /usr/sbin/sshd -D
6527 mycontainer                     /usr/lib/systemd/systemd-resolved
[...]

This shows a process list (shortened). The second column shows the
container a process belongs to. All processes shown with “-” belong to
the host itself.

But it doesn’t stop there. The new “sd-bus” D-Bus client library we
have been preparing in the systemd/kdbus context knows containers
too. While you use sd_bus_open_system() to connect to your local
host’s system bus sd_bus_open_system_container() may be used to
connect to the system bus of any local container, so that you can
execute bus methods on it.

sd-login.h
and machined’s bus
interface

provide a number of APIs to add container support to other programs
too. They support enumeration of containers as well as retrieving the
machine name from a PID and similar.

systemd-networkd also has support for containers. When run inside a
container it will by default run a DHCP client and IPv4LL on any veth
network interface named host0 (this interface is special under the
logic described here). When run on the host networkd will by default
provide a DHCP server and IPv4LL on veth network interface named ve-
followed by a container name.

Let’s have a look at one last facet of systemd’s container
integration: the hook-up with the name service switch. Recent systemd
versions contain a new NSS module nss-mymachines that make the names
of all local containers resolvable via gethostbyname() and
getaddrinfo(). This only applies to containers that run within their
own network namespace. With the systemd-nspawn command shown above the
the container shares the network configuration with the host however;
hence let’s restart the container, this time with a virtual veth
network link between host and container:

# machinectl poweroff mycontainer
# systemd-nspawn -D /srv/mycontainer --network-veth -b

Now, (assuming that networkd is used in the container and outside) we
can already ping the container using its name, due to the simple magic
of nss-mymachines:

# ping mycontainer
PING mycontainer (10.0.0.2) 56(84) bytes of data.
64 bytes from mycontainer (10.0.0.2): icmp_seq=1 ttl=64 time=0.124 ms
64 bytes from mycontainer (10.0.0.2): icmp_seq=2 ttl=64 time=0.078 ms

Of course, name resolution not only works with ping, it works with
all other tools that use libc gethostbyname() or getaddrinfo()
too, among them venerable ssh.

And this is pretty much all I want to cover for now. We briefly
touched a variety of integration points, and there’s a lot more still
if you look closely. We are working on even more container integration
all the time, so expect more new features in this area with every
systemd release.

Note that the whole machine concept is actually not limited to
containers, but covers VMs too to a certain degree. However, the
integration is not as close, as access to a VM’s internals is not as
easy as for containers, as it usually requires a network transport
instead of allowing direct syscall access.

Anyway, I hope this is useful. For further details, please have a look
at the linked man pages and other documentation.

Running Docker on AWS OpsWorks

Post Syndicated from Chris Barclay original http://blogs.aws.amazon.com/application-management/post/Tx2FPK7NJS5AQC5/Running-Docker-on-AWS-OpsWorks

AWS OpsWorks lets you deploy and manage application of all shapes and sizes. OpsWorks layers let you create blueprints for EC2 instances to install and configure any software that you want. This blog will show you how to create a custom layer for Docker. For an overview of Docker, see https://www.docker.com/tryit. Docker lets you precisely define the runtime environment for your application and deploy your application code and its runtime as a Docker container. You can use Docker containers to support new languages like Go or to incorporate your dev and test workflows seamlessly with AWS OpsWorks.

 

The Docker layer uses Chef recipes to install Docker and deploy containers to the EC2 instances running in that layer. Simply provide a Dockerfile and OpsWorks will automatically run the recipes to build and run the container. A stack can have multiple Docker layers and you can deploy multiple Docker containers to each layer. You can extend or change the Chef example recipes to use Docker in the way that works best for you. If you aren’t familiar with Chef recipes, see Cookbooks 101 for an introduction.

 

Step 1: Create Recipes

First, create a repository to store your Chef recipes. OpsWorks supports Git and Subversion, or you can store an archive bundle on Amazon S3. The structure of a cookbook repository is described in the OpsWorks documentation.

 

The docker::install recipe installs the necessary Docker software on your instances:

 

case node[:platform]
when "ubuntu","debian"
package "docker.io" do
action :install
end
when ‘centos’,’redhat’,’fedora’,’amazon’
package "docker" do
action :install
end
end

service "docker" do
action :start
end

The docker::docker-deploy recipe deploys your docker containers (specified by a Dockerfile):

include_recipe ‘deploy’

node[:deploy].each do |application, deploy|

if node[:opsworks][:instance][:layers].first != deploy[:environment_variables][:layer]
Chef::Log.debug("Skipping deploy::docker application #{application} as it is not deployed to this layer")
next
end

opsworks_deploy_dir do
user deploy[:user]
group deploy[:group]
path deploy[:deploy_to]
end

opsworks_deploy do
deploy_data deploy
app application
end

bash "docker-cleanup" do
user "root"
code <<-EOH
if docker ps | grep #{deploy[:application]};
then
docker stop #{deploy[:application]}
sleep 3
docker rm #{deploy[:application]}
sleep 3
fi
if docker images | grep #{deploy[:application]};
then
docker rmi #{deploy[:application]}
fi
EOH
end

bash "docker-build" do
user "root"
cwd "#{deploy[:deploy_to]}/current"
code <<-EOH
docker build -t=#{deploy[:application]} . > #{deploy[:application]}-docker.out
EOH
end

dockerenvs = " "
deploy[:environment_variables].each do |key, value|
dockerenvs=dockerenvs+" -e "+key+"="+value
end

bash "docker-run" do
user "root"
cwd "#{deploy[:deploy_to]}/current"
code <<-EOH
docker run #{dockerenvs} -p #{node[:opsworks][:instance][:private_ip]}:#{deploy[:environment_variables][:service_port]}:#{deploy[:environment_variables][:container_port]} –name #{deploy[:application]} -d #{deploy[:application]}
EOH
end

end

Then create a repository to store your Dockerfile. Here’s a sample Dockerfile to get you going:

FROM ubuntu:12.04

RUN apt-get update
RUN apt-get install -y nginx zip curl

RUN echo "daemon off;" >> /etc/nginx/nginx.conf
RUN curl -o /usr/share/nginx/www/master.zip -L https://codeload.github.com/gabrielecirulli/2048/zip/master
RUN cd /usr/share/nginx/www/ && unzip master.zip && mv 2048-master/* . && rm -rf 2048-master master.zip

EXPOSE 80

CMD ["/usr/sbin/nginx", "-c", "/etc/nginx/nginx.conf"]

Step 2: Create an OpsWorks Stack

Now you’re ready to use these recipes with OpsWorks. Open the OpsWorks console

Select Add a Stack to create an OpsWorks stack.

Give it a name and select Advanced.

Set Use custom Chef Cookbooks to Yes.

Set Repository type to Git.

Set the Repository URL to the repository where you stored the recipes created in the previous step.

Click the Add Stack button at the bottom of the page to create the stack.

Step 3: Add a Layer

Select Add Layer. 

Choose Custom Layer, set the name to “Docker”, shortname to “docker”, and click Add Layer. 

Click the layer’s edit Recipes action and scroll to the Custom Chef recipes section. You will notice there are several headings—Setup, Configure, Deploy, Undeploy, and Shutdown—which correspond to OpsWorks lifecycle events. OpsWorks triggers these events at these key points in instance’s lifecycle, which runs the associated recipes. 

Enter docker::install in the Setup box and click + to add it to the list

Enter docker::docker-deploy in the Deploy box and click + to add it to the list

Click the Save button at the bottom to save the updated configuration. 

Step 4: Add an Instance

The Layer page should now show the Docker layer. However, the layer just controls how to configure instances. You now need to add some instances to the layer. Click Instances in the navigation pane and under the Docker layer, click + Instance. For this walkthrough, just accept the default settings and click Add Instance to add the instance to the layer. Click start in the row’s Actions column to start the instance. OpsWorks will then launch a new EC2 instance and run the Setup recipes to configure Docker. The instance’s status will change to online when it’s ready.

 

Step 5: Add an App & Deploy

Once you’ve started your instances:

In the navigation pane, click Apps and on the Apps page, click Add an app.

On the App page, give it a Name.

Set app’s type to other.

Specify the app’s repository type. 

Specify the app’s repository URL. This is where your Dockerfile lives and is usually a separate repository from the cookbook repository specified in step 2.

Set the following environment variables:

container_port – Set this variable to the port specified by the EXPOSE parameter in your Dockerfile.

service_port – Set this variable to the port your container will expose on the instance to the outside world. Note: Be sure that your security groups allow inbound traffic for the port specified in service_port.

layer – Set this variable to the shortname of the layer that you want this container deployed to (from Step 3). This lets you have multiple docker layers with different apps deployed on each, such as a front-end web app and a back-end worker. 

For our example, set container_port=80, service_port=80, and layer=docker. You can also define additional environment variables that are automatically passed onto your Docker container, for example a database endpoint that your app connects with. 

Keep the default values for the remaining settings and click Add App. 

To install the code on the server, you must deploy the app. It will take some time for your instances to boot up completely. Once they show up as “online” in the Instances view, navigate to Apps and click deploy in the Actions column. If you create multiple docker layers, note that although the deployment defaults to all instances in the stack, the containers will only be deployed to the layer specified in the layer environment variable.

Once the deployment is complete, you can see your app by clicking the public IP address of the server. You can update your Dockerfile and redeploy at any time.

Step 6: Making attributes dynamic

The recipes written for this blog pass environment variables into the Docker container when it is started. If you need to update the configuration while the app is running, such as a database password, solutions like etcd can make attributes dynamic. An etcd server can run on each instance and be populated by the instance’s OpsWorks attributes. You can update OpsWorks attributes, and thereby the values stored in etcd, at any time, and those values are immediately available to apps running in Docker containers. A future blog post will cover how to create recipes to install etcd and pass OpsWorks attributes, including app environment variables, to the Docker containers.

 

Summary

These instructions have demonstrated how use AWS OpsWorks and Docker to deploy applications, represented by Dockerfiles. You can also use Docker layers with other AWS OpsWorks features, including automatic instance scaling and integration with Elastic Load Balancing and Amazon RDS. 

 

Running Docker on AWS OpsWorks

Post Syndicated from Chris Barclay original http://blogs.aws.amazon.com/application-management/post/Tx2FPK7NJS5AQC5/Running-Docker-on-AWS-OpsWorks

AWS OpsWorks lets you deploy and manage application of all shapes and sizes. OpsWorks layers let you create blueprints for EC2 instances to install and configure any software that you want. This blog will show you how to create a custom layer for Docker. For an overview of Docker, see https://www.docker.com/tryit. Docker lets you precisely define the runtime environment for your application and deploy your application code and its runtime as a Docker container. You can use Docker containers to support new languages like Go or to incorporate your dev and test workflows seamlessly with AWS OpsWorks.

 

The Docker layer uses Chef recipes to install Docker and deploy containers to the EC2 instances running in that layer. Simply provide a Dockerfile and OpsWorks will automatically run the recipes to build and run the container. A stack can have multiple Docker layers and you can deploy multiple Docker containers to each layer. You can extend or change the Chef example recipes to use Docker in the way that works best for you. If you aren’t familiar with Chef recipes, see Cookbooks 101 for an introduction.

 

Step 1: Create Recipes

First, create a repository to store your Chef recipes. OpsWorks supports Git and Subversion, or you can store an archive bundle on Amazon S3. The structure of a cookbook repository is described in the OpsWorks documentation.

 

The docker::install recipe installs the necessary Docker software on your instances:

 

case node[:platform]
when "ubuntu","debian"
package "docker.io" do
action :install
end
when ‘centos’,’redhat’,’fedora’,’amazon’
package "docker" do
action :install
end
end

service "docker" do
action :start
end

The docker::docker-deploy recipe deploys your docker containers (specified by a Dockerfile):

include_recipe ‘deploy’

node[:deploy].each do |application, deploy|

if node[:opsworks][:instance][:layers].first != deploy[:environment_variables][:layer]
Chef::Log.debug("Skipping deploy::docker application #{application} as it is not deployed to this layer")
next
end

opsworks_deploy_dir do
user deploy[:user]
group deploy[:group]
path deploy[:deploy_to]
end

opsworks_deploy do
deploy_data deploy
app application
end

bash "docker-cleanup" do
user "root"
code <<-EOH
if docker ps | grep #{deploy[:application]};
then
docker stop #{deploy[:application]}
sleep 3
docker rm #{deploy[:application]}
sleep 3
fi
if docker images | grep #{deploy[:application]};
then
docker rmi #{deploy[:application]}
fi
EOH
end

bash "docker-build" do
user "root"
cwd "#{deploy[:deploy_to]}/current"
code <<-EOH
docker build -t=#{deploy[:application]} . > #{deploy[:application]}-docker.out
EOH
end

dockerenvs = " "
deploy[:environment_variables].each do |key, value|
dockerenvs=dockerenvs+" -e "+key+"="+value
end

bash "docker-run" do
user "root"
cwd "#{deploy[:deploy_to]}/current"
code <<-EOH
docker run #{dockerenvs} -p #{node[:opsworks][:instance][:private_ip]}:#{deploy[:environment_variables][:service_port]}:#{deploy[:environment_variables][:container_port]} –name #{deploy[:application]} -d #{deploy[:application]}
EOH
end

end

Then create a repository to store your Dockerfile. Here’s a sample Dockerfile to get you going:

FROM ubuntu:12.04

RUN apt-get update
RUN apt-get install -y nginx zip curl

RUN echo "daemon off;" >> /etc/nginx/nginx.conf
RUN curl -o /usr/share/nginx/www/master.zip -L https://codeload.github.com/gabrielecirulli/2048/zip/master
RUN cd /usr/share/nginx/www/ && unzip master.zip && mv 2048-master/* . && rm -rf 2048-master master.zip

EXPOSE 80

CMD ["/usr/sbin/nginx", "-c", "/etc/nginx/nginx.conf"]

Step 2: Create an OpsWorks Stack

Now you’re ready to use these recipes with OpsWorks. Open the OpsWorks console

Select Add a Stack to create an OpsWorks stack.

Give it a name and select Advanced.

Set Use custom Chef Cookbooks to Yes.

Set Repository type to Git.

Set the Repository URL to the repository where you stored the recipes created in the previous step.

Click the Add Stack button at the bottom of the page to create the stack.

Step 3: Add a Layer

Select Add Layer. 

Choose Custom Layer, set the name to “Docker”, shortname to “docker”, and click Add Layer. 

Click the layer’s edit Recipes action and scroll to the Custom Chef recipes section. You will notice there are several headings—Setup, Configure, Deploy, Undeploy, and Shutdown—which correspond to OpsWorks lifecycle events. OpsWorks triggers these events at these key points in instance’s lifecycle, which runs the associated recipes. 

Enter docker::install in the Setup box and click + to add it to the list

Enter docker::docker-deploy in the Deploy box and click + to add it to the list

Click the Save button at the bottom to save the updated configuration. 

Step 4: Add an Instance

The Layer page should now show the Docker layer. However, the layer just controls how to configure instances. You now need to add some instances to the layer. Click Instances in the navigation pane and under the Docker layer, click + Instance. For this walkthrough, just accept the default settings and click Add Instance to add the instance to the layer. Click start in the row’s Actions column to start the instance. OpsWorks will then launch a new EC2 instance and run the Setup recipes to configure Docker. The instance’s status will change to online when it’s ready.

 

Step 5: Add an App & Deploy

Once you’ve started your instances:

In the navigation pane, click Apps and on the Apps page, click Add an app.

On the App page, give it a Name.

Set app’s type to other.

Specify the app’s repository type. 

Specify the app’s repository URL. This is where your Dockerfile lives and is usually a separate repository from the cookbook repository specified in step 2.

Set the following environment variables:

container_port – Set this variable to the port specified by the EXPOSE parameter in your Dockerfile.

service_port – Set this variable to the port your container will expose on the instance to the outside world. Note: Be sure that your security groups allow inbound traffic for the port specified in service_port.

layer – Set this variable to the shortname of the layer that you want this container deployed to (from Step 3). This lets you have multiple docker layers with different apps deployed on each, such as a front-end web app and a back-end worker. 

For our example, set container_port=80, service_port=80, and layer=docker. You can also define additional environment variables that are automatically passed onto your Docker container, for example a database endpoint that your app connects with. 

Keep the default values for the remaining settings and click Add App. 

To install the code on the server, you must deploy the app. It will take some time for your instances to boot up completely. Once they show up as “online” in the Instances view, navigate to Apps and click deploy in the Actions column. If you create multiple docker layers, note that although the deployment defaults to all instances in the stack, the containers will only be deployed to the layer specified in the layer environment variable.

Once the deployment is complete, you can see your app by clicking the public IP address of the server. You can update your Dockerfile and redeploy at any time.

Step 6: Making attributes dynamic

The recipes written for this blog pass environment variables into the Docker container when it is started. If you need to update the configuration while the app is running, such as a database password, solutions like etcd can make attributes dynamic. An etcd server can run on each instance and be populated by the instance’s OpsWorks attributes. You can update OpsWorks attributes, and thereby the values stored in etcd, at any time, and those values are immediately available to apps running in Docker containers. A future blog post will cover how to create recipes to install etcd and pass OpsWorks attributes, including app environment variables, to the Docker containers.

 

Summary

These instructions have demonstrated how use AWS OpsWorks and Docker to deploy applications, represented by Dockerfiles. You can also use Docker layers with other AWS OpsWorks features, including automatic instance scaling and integration with Elastic Load Balancing and Amazon RDS. 

 

Revisiting How We Put Together Linux Systems

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/revisiting-how-we-put-together-linux-systems.html

In a previous blog story I discussed
Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems,
I now want to take the opportunity to explain a bit where we want to
take this with
systemd in the
longer run, and what we want to build out of it. This is going to be a
longer story, so better grab a cold bottle of
Club Mate before you start
reading.
Traditional Linux distributions are built around packaging systems
like RPM or dpkg, and an organization model where upstream developers
and downstream packagers are relatively clearly separated: an upstream
developer writes code, and puts it somewhere online, in a tarball. A
packager than grabs it and turns it into RPMs/DEBs. The user then
grabs these RPMs/DEBs and installs them locally on the system. For a
variety of uses this is a fantastic scheme: users have a large
selection of readily packaged software available, in mostly uniform
packaging, from a single source they can trust. In this scheme the
distribution vets all software it packages, and as long as the user
trusts the distribution all should be good. The distribution takes the
responsibility of ensuring the software is not malicious, of timely
fixing security problems and helping the user if something is wrong.
Upstream Projects
However, this scheme also has a number of problems, and doesn’t fit
many use-cases of our software particularly well. Let’s have a look at
the problems of this scheme for many upstreams:

Upstream software vendors are fully dependent on downstream
distributions to package their stuff. It’s the downstream
distribution that decides on schedules, packaging details, and how
to handle support. Often upstream vendors want much faster release
cycles then the downstream distributions follow.

Realistic testing is extremely unreliable and next to
impossible. Since the end-user can run a variety of different
package versions together, and expects the software he runs to just
work on any combination, the test matrix explodes. If upstream tests
its version on distribution X release Y, then there’s no guarantee
that that’s the precise combination of packages that the end user
will eventually run. In fact, it is very unlikely that the end user
will, since most distributions probably updated a number of
libraries the package relies on by the time the package ends up being
made available to the user. The fact that each package can be
individually updated by the user, and each user can combine library
versions, plug-ins and executables relatively freely, results in a high
risk of something going wrong.

Since there are so many different distributions in so many different
versions around, if upstream tries to build and test software for
them it needs to do so for a large number of distributions, which is
a massive effort.

The distributions are actually quite different in many ways. In
fact, they are different in a lot of the most basic
functionality. For example, the path where to put x86-64 libraries
is different on Fedora and Debian derived systems..

Developing software for a number of distributions and versions is
hard: if you want to do it, you need to actually install them, each
one of them, manually, and then build your software for each.

Since most downstream distributions have strict licensing and
trademark requirements (and rightly so), any kind of closed source
software (or otherwise non-free) does not fit into this scheme at
all.

This all together makes it really hard for many upstreams to work
nicely with the current way how Linux works. Often they try to improve
the situation for them, for example by bundling libraries, to make
their test and build matrices smaller.
System Vendors
The toolbox approach of classic Linux distributions is fantastic for
people who want to put together their individual system, nicely
adjusted to exactly what they need. However, this is not really how
many of today’s Linux systems are built, installed or updated. If you
build any kind of embedded device, a server system, or even user
systems, you frequently do your work based on complete system images,
that are linearly versioned. You build these images somewhere, and
then you replicate them atomically to a larger number of systems. On
these systems, you don’t install or remove packages, you get a defined
set of files, and besides installing or updating the system there are
no ways how to change the set of tools you get.
The current Linux distributions are not particularly good at providing
for this major use-case of Linux. Their strict focus on individual
packages as well as package managers as end-user install and update
tool is incompatible with what many system vendors want.
Users
The classic Linux distribution scheme is frequently not what end users
want, either. Many users are used to app markets like Android, Windows
or iOS/Mac have. Markets are a platform that doesn’t package, build or
maintain software like distributions do, but simply allows users to
quickly find and download the software they need, with the app vendor
responsible for keeping the app updated, secured, and all that on the
vendor’s release cycle. Users tend to be impatient. They want their
software quickly, and the fine distinction between trusting a single
distribution or a myriad of app developers individually is usually not
important for them. The companies behind the marketplaces usually try
to improve this trust problem by providing sand-boxing technologies: as
a replacement for the distribution that audits, vets, builds and
packages the software and thus allows users to trust it to a certain
level, these vendors try to find technical solutions to ensure that
the software they offer for download can’t be malicious.
Existing Approaches To Fix These Problems
Now, all the issues pointed out above are not new, and there are
sometimes quite successful attempts to do something about it. Ubuntu
Apps, Docker, Software Collections, ChromeOS, CoreOS all fix part of
this problem set, usually with a strict focus on one facet of Linux
systems. For example, Ubuntu Apps focus strictly on end user (desktop)
applications, and don’t care about how we built/update/install the OS
itself, or containers. Docker OTOH focuses on containers only, and
doesn’t care about end-user apps. Software Collections tries to focus
on the development environments. ChromeOS focuses on the OS itself,
but only for end-user devices. CoreOS also focuses on the OS, but
only for server systems.
The approaches they find are usually good at specific things, and use
a variety of different technologies, on different layers. However,
none of these projects tried to fix this problems in a generic way,
for all uses, right in the core components of the OS itself.
Linux has come to tremendous successes because its kernel is so
generic: you can build supercomputers and tiny embedded devices out of
it. It’s time we come up with a basic, reusable scheme how to solve
the problem set described above, that is equally generic.
What We Want
The systemd cabal (Kay Sievers, Harald Hoyer, Daniel Mack, Tom
Gundersen, David Herrmann, and yours truly) recently met in Berlin
about all these things, and tried to come up with a scheme that is
somewhat simple, but tries to solve the issues generically, for all
use-cases, as part of the systemd project. All that in a way that is
somewhat compatible with the current scheme of distributions, to allow
a slow, gradual adoption. Also, and that’s something one cannot stress
enough: the toolbox scheme of classic Linux distributions is
actually a good one, and for many cases the right one. However, we
need to make sure we make distributions relevant again for all
use-cases, not just those of highly individualized systems.
Anyway, so let’s summarize what we are trying to do:

We want an efficient way that allows vendors to package their
software (regardless if just an app, or the whole OS) directly for
the end user, and know the precise combination of libraries and
packages it will operate with.

We want to allow end users and administrators to install these
packages on their systems, regardless which distribution they have
installed on it.

We want a unified solution that ultimately can cover updates for
full systems, OS containers, end user apps, programming ABIs, and
more. These updates shall be double-buffered, (at least). This is an
absolute necessity if we want to prepare the ground for operating
systems that manage themselves, that can update safely without
administrator involvement.

We want our images to be trustable (i.e. signed). In fact we want a
fully trustable OS, with images that can be verified by a full
trust chain from the firmware (EFI SecureBoot!), through the boot loader, through the
kernel, and initrd. Cryptographically secure verification of the
code we execute is relevant on the desktop (like ChromeOS does), but
also for apps, for embedded devices and even on servers (in a post-Snowden
world, in particular).

What We Propose
So much about the set of problems, and what we are trying to do. So,
now, let’s discuss the technical bits we came up with:
The scheme we propose is built around the variety of concepts of btrfs
and Linux file system name-spacing. btrfs at this point already has a
large number of features that fit neatly in our concept, and the
maintainers are busy working on a couple of others we want to
eventually make use of.
As first part of our proposal we make heavy use of btrfs sub-volumes and
introduce a clear naming scheme for them. We name snapshots like this:

usr:<vendorid>:<architecture>:<version> — This refers to a full
vendor operating system tree. It’s basically a /usr tree (and no
other directories), in a specific version, with everything you need to boot
it up inside it. The <vendorid> field is replaced by some vendor
identifier, maybe a scheme like
org.fedoraproject.FedoraWorkstation. The <architecture> field
specifies a CPU architecture the OS is designed for, for example
x86-64. The <version> field specifies a specific OS version, for
example 23.4. An example sub-volume name could hence look like this:
usr:org.fedoraproject.FedoraWorkstation:x86_64:23.4

root:<name>:<vendorid>:<architecture> — This refers to an
instance of an operating system. Its basically a root directory,
containing primarily /etc and /var (but possibly more). Sub-volumes
of this type do not contain a populated /usr tree though. The
<name> field refers to some instance name (maybe the host name of
the instance). The other fields are defined as above. An example
sub-volume name is
root:revolution:org.fedoraproject.FedoraWorkstation:x86_64.

runtime:<vendorid>:<architecture>:<version> — This refers to a
vendor runtime. A runtime here is supposed to be a set of
libraries and other resources that are needed to run apps (for the
concept of apps see below), all in a /usr tree. In this regard this
is very similar to the usr sub-volumes explained above, however,
while a usr sub-volume is a full OS and contains everything
necessary to boot, a runtime is really only a set of
libraries. You cannot boot it, but you can run apps with it. An
example sub-volume name is: runtime:org.gnome.GNOME3_20:x86_64:3.20.1

framework:<vendorid>:<architecture>:<version> — This is very
similar to a vendor runtime, as described above, it contains just a
/usr tree, but goes one step further: it additionally contains all
development headers, compilers and build tools, that allow
developing against a specific runtime. For each runtime there should
be a framework. When you develop against a specific framework in a
specific architecture, then the resulting app will be compatible
with the runtime of the same vendor ID and architecture. Example:
framework:org.gnome.GNOME3_20:x86_64:3.20.1

app:<vendorid>:<runtime>:<architecture>:<version> — This
encapsulates an application bundle. It contains a tree that at
runtime is mounted to /opt/<vendorid>, and contains all the
application’s resources. The <vendorid> could be a string like
org.libreoffice.LibreOffice, the <runtime> refers to one the
vendor id of one specific runtime the application is built for, for
example org.gnome.GNOME3_20:3.20.1. The <architecture> and
<version> refer to the architecture the application is built for,
and of course its version. Example:
app:org.libreoffice.LibreOffice:GNOME3_20:x86_64:133

home:<user>:<uid>:<gid> — This sub-volume shall refer to the home
directory of the specific user. The <user> field contains the user
name, the <uid> and <gid> fields the numeric Unix UIDs and GIDs
of the user. The idea here is that in the long run the list of
sub-volumes is sufficient as a user database (but see
below). Example: home:lennart:1000:1000.

btrfs partitions that adhere to this naming scheme should be clearly
identifiable. It is our intention to introduce a new GPT partition type
ID for this.
How To Use It
After we introduced this naming scheme let’s see what we can build of
this:

When booting up a system we mount the root directory from one of the
root sub-volumes, and then mount /usr from a matching usr
sub-volume. Matching here means it carries the same <vendor-id>
and <architecture>. Of course, by default we should pick the
matching usr sub-volume with the newest version by default.

When we boot up an OS container, we do exactly the same as the when
we boot up a regular system: we simply combine a usr sub-volume
with a root sub-volume.

When we enumerate the system’s users we simply go through the
list of home snapshots.

When a user authenticates and logs in we mount his home
directory from his snapshot.

When an app is run, we set up a new file system name-space, mount the
app sub-volume to /opt/<vendorid>/, and the appropriate runtime
sub-volume the app picked to /usr, as well as the user’s
/home/$USER to its place.

When a developer wants to develop against a specific runtime he
installs the right framework, and then temporarily transitions into
a name space where /usris mounted from the framework sub-volume, and
/home/$USER from his own home directory. In this name space he then
runs his build commands. He can build in multiple name spaces at the
same time, if he intends to builds software for multiple runtimes or
architectures at the same time.

Instantiating a new system or OS container (which is exactly the same
in this scheme) just consists of creating a new appropriately named
root sub-volume. Completely naturally you can share one vendor OS
copy in one specific version with a multitude of container instances.
Everything is double-buffered (or actually, n-fold-buffered), because
usr, runtime, framework, app sub-volumes can exist in multiple
versions. Of course, by default the execution logic should always pick
the newest release of each sub-volume, but it is up to the user keep
multiple versions around, and possibly execute older versions, if he
desires to do so. In fact, like on ChromeOS this could even be handled
automatically: if a system fails to boot with a newer snapshot, the
boot loader can automatically revert back to an older version of the
OS.
An Example
Note that in result this allows installing not only multiple end-user
applications into the same btrfs volume, but also multiple operating
systems, multiple system instances, multiple runtimes, multiple
frameworks. Or to spell this out in an example:
Let’s say Fedora, Mageia and ArchLinux all implement this scheme,
and provide ready-made end-user images. Also, the GNOME, KDE, SDL
projects all define a runtime+framework to develop against. Finally,
both LibreOffice and Firefox provide their stuff according to this
scheme. You can now trivially install of these into the same btrfs
volume:

usr:org.fedoraproject.WorkStation:x86_64:24.7
usr:org.fedoraproject.WorkStation:x86_64:24.8
usr:org.fedoraproject.WorkStation:x86_64:24.9
usr:org.fedoraproject.WorkStation:x86_64:25beta
usr:org.mageia.Client:i386:39.3
usr:org.mageia.Client:i386:39.4
usr:org.mageia.Client:i386:39.6
usr:org.archlinux.Desktop:x86_64:302.7.8
usr:org.archlinux.Desktop:x86_64:302.7.9
usr:org.archlinux.Desktop:x86_64:302.7.10
root:revolution:org.fedoraproject.WorkStation:x86_64
root:testmachine:org.fedoraproject.WorkStation:x86_64
root:foo:org.mageia.Client:i386
root:bar:org.archlinux.Desktop:x86_64
runtime:org.gnome.GNOME3_20:x86_64:3.20.1
runtime:org.gnome.GNOME3_20:x86_64:3.20.4
runtime:org.gnome.GNOME3_20:x86_64:3.20.5
runtime:org.gnome.GNOME3_22:x86_64:3.22.0
runtime:org.kde.KDE5_6:x86_64:5.6.0
framework:org.gnome.GNOME3_22:x86_64:3.22.0
framework:org.kde.KDE5_6:x86_64:5.6.0
app:org.libreoffice.LibreOffice:GNOME3_20:x86_64:133
app:org.libreoffice.LibreOffice:GNOME3_22:x86_64:166
app:org.mozilla.Firefox:GNOME3_20:x86_64:39
app:org.mozilla.Firefox:GNOME3_20:x86_64:40
home:lennart:1000:1000
home:hrundivbakshi:1001:1001

In the example above, we have three vendor operating systems
installed. All of them in three versions, and one even in a beta
version. We have four system instances around. Two of them of Fedora,
maybe one of them we usually boot from, the other we run for very
specific purposes in an OS container. We also have the runtimes for
two GNOME releases in multiple versions, plus one for KDE. Then, we
have the development trees for one version of KDE and GNOME around, as
well as two apps, that make use of two releases of the GNOME
runtime. Finally, we have the home directories of two users.
Now, with the name-spacing concepts we introduced above, we can
actually relatively freely mix and match apps and OSes, or develop
against specific frameworks in specific versions on any operating
system. It doesn’t matter if you booted your ArchLinux instance, or
your Fedora one, you can execute both LibreOffice and Firefox just
fine, because at execution time they get matched up with the right
runtime, and all of them are available from all the operating systems
you installed. You get the precise runtime that the upstream vendor of
Firefox/LibreOffice did their testing with. It doesn’t matter anymore
which distribution you run, and which distribution the vendor prefers.
Also, given that the user database is actually encoded in the
sub-volume list, it doesn’t matter which system you boot, the
distribution should be able to find your local users automatically,
without any configuration in /etc/passwd.
Building Blocks
With this naming scheme plus the way how we can combine them on
execution we already came quite far, but how do we actually get these
sub-volumes onto the final machines, and how do we update them? Well,
btrfs has a feature they call “send-and-receive”. It basically allows
you to “diff” two file system versions, and generate a binary
delta. You can generate these deltas on a developer’s machine and then
push them into the user’s system, and he’ll get the exact same
sub-volume too. This is how we envision installation and updating of
operating systems, applications, runtimes, frameworks. At installation
time, we simply deserialize an initial send-and-receive delta into
our btrfs volume, and later, when a new version is released we just
add in the few bits that are new, by dropping in another
send-and-receive delta under a new sub-volume name. And we do it
exactly the same for the OS itself, for a runtime, a framework or an
app. There’s no technical distinction anymore. The underlying
operation for installing apps, runtime, frameworks, vendor OSes, as well
as the operation for updating them is done the exact same way for all.
Of course, keeping multiple full /usr trees around sounds like an
awful lot of waste, after all they will contain a lot of very similar
data, since a lot of resources are shared between distributions,
frameworks and runtimes. However, thankfully btrfs actually is able to
de-duplicate this for us. If we add in a new app snapshot, this simply
adds in the new files that changed. Moreover different runtimes and
operating systems might actually end up sharing the same tree.
Even though the example above focuses primarily on the end-user,
desktop side of things, the concept is also extremely powerful in
server scenarios. For example, it is easy to build your own usr
trees and deliver them to your hosts using this scheme. The usr
sub-volumes are supposed to be something that administrators can put
together. After deserializing them into a couple of hosts, you can
trivially instantiate them as OS containers there, simply by adding a
new root sub-volume for each instance, referencing the usr tree you
just put together. Instantiating OS containers hence becomes as easy
as creating a new btrfs sub-volume. And you can still update the images
nicely, get fully double-buffered updates and everything.
And of course, this scheme also applies great to embedded
use-cases. Regardless if you build a TV, an IVI system or a phone: you
can put together you OS versions as usr trees, and then use
btrfs-send-and-receive facilities to deliver them to the systems, and
update them there.
Many people when they hear the word “btrfs” instantly reply with “is
it ready yet?”. Thankfully, most of the functionality we really need
here is strictly read-only. With the exception of the home
sub-volumes (see below) all snapshots are strictly read-only, and are
delivered as immutable vendor trees onto the devices. They never are
changed. Even if btrfs might still be immature, for this kind of
read-only logic it should be more than good enough.
Note that this scheme also enables doing fat systems: for example,
an installer image could include a Fedora version compiled for x86-64,
one for i386, one for ARM, all in the same btrfs volume. Due to btrfs’
de-duplication they will share as much as possible, and when the image
is booted up the right sub-volume is automatically picked. Something
similar of course applies to the apps too!
This also allows us to implement something that we like to call
Operating-System-As-A-Virus. Installing a new system is little more
than:

Creating a new GPT partition table
Adding an EFI System Partition (FAT) to it
Adding a new btrfs volume to it
Deserializing a single usr sub-volume into the btrfs volume
Installing a boot loader into the EFI System Partition
Rebooting

Now, since the only real vendor data you need is the usr sub-volume,
you can trivially duplicate this onto any block device you want. Let’s
say you are a happy Fedora user, and you want to provide a friend with
his own installation of this awesome system, all on a USB stick. All
you have to do for this is do the steps above, using your installed
usr tree as source to copy. And there you go! And you don’t have to
be afraid that any of your personal data is copied too, as the usr
sub-volume is the exact version your vendor provided you with. Or with
other words: there’s no distinction anymore between installer images
and installed systems. It’s all the same. Installation becomes
replication, not more. Live-CDs and installed systems can be fully
identical.
Note that in this design apps are actually developed against a single,
very specific runtime, that contains all libraries it can link against
(including a specific glibc version!). Any library that is not
included in the runtime the developer picked must be included in the
app itself. This is similar how apps on Android declare one very
specific Android version they are developed against. This greatly
simplifies application installation, as there’s no dependency hell:
each app pulls in one runtime, and the app is actually free to pick
which one, as you can have multiple installed, though only one is used
by each app.
Also note that operating systems built this way will never see
“half-updated” systems, as it is common when a system is updated using
RPM/dpkg. When updating the system the code will either run the old or
the new version, but it will never see part of the old files and part
of the new files. This is the same for apps, runtimes, and frameworks,
too.
Where We Are Now
We are currently working on a lot of the groundwork necessary for
this. This scheme relies on the ability to monopolize the
vendor OS resources in /usr, which is the key of what I described in
Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems
a few weeks back. Then, of course, for the full desktop app concept we
need a strong sandbox, that does more than just hiding files from the
file system view. After all with an app concept like the above the
primary interfacing between the executed desktop apps and the rest of the
system is via IPC (which is why we work on kdbus and teach it all
kinds of sand-boxing features), and the kernel itself. Harald Hoyer has
started working on generating the btrfs send-and-receive images based
on Fedora.
Getting to the full scheme will take a while. Currently we have many
of the building blocks ready, but some major items are missing. For
example, we push quite a few problems into btrfs, that other solutions
try to solve in user space. One of them is actually
signing/verification of images. The btrfs maintainers are working on
adding this to the code base, but currently nothing exists. This
functionality is essential though to come to a fully verified system
where a trust chain exists all the way from the firmware to the
apps. Also, to make the home sub-volume scheme fully workable we
actually need encrypted sub-volumes, so that the sub-volume’s
pass-phrase can be used for authenticating users in PAM. This doesn’t
exist either.
Working towards this scheme is a gradual process. Many of the steps we
require for this are useful outside of the grand scheme though, which
means we can slowly work towards the goal, and our users can already
take benefit of what we are working on as we go.
Also, and most importantly, this is not really a departure from
traditional operating systems:
Each app, each OS and each app sees a traditional Unix hierarchy with
/usr, /home, /opt, /var, /etc. It executes in an environment that is
pretty much identical to how it would be run on traditional systems.
There’s no need to fully move to a system that uses only btrfs and
follows strictly this sub-volume scheme. For example, we intend to
provide implicit support for systems that are installed on ext4 or
xfs, or that are put together with traditional packaging tools such as
RPM or dpkg: if the the user tries to install a
runtime/app/framework/os image on a system that doesn’t use btrfs so
far, it can just create a loop-back btrfs image in /var, and push the
data into that. Even us developers will run our stuff like this for a
while, after all this new scheme is not particularly useful for highly
individualized systems, and we developers usually tend to run
systems like that.
Also note that this in no way a departure from packaging systems like
RPM or DEB. Even if the new scheme we propose is used for installing
and updating a specific system, it is RPM/DEB that is used to put
together the vendor OS tree initially. Hence, even in this scheme
RPM/DEB are highly relevant, though not strictly as an end-user tool
anymore, but as a build tool.
So Let’s Summarize Again What We Propose

We want a unified scheme, how we can install and update OS images,
user apps, runtimes and frameworks.

We want a unified scheme how you can relatively freely mix OS
images, apps, runtimes and frameworks on the same system.

We want a fully trusted system, where cryptographic verification of
all executed code can be done, all the way to the firmware, as
standard feature of the system.

We want to allow app vendors to write their programs against very
specific frameworks, under the knowledge that they will end up being
executed with the exact same set of libraries chosen.

We want to allow parallel installation of multiple OSes and versions
of them, multiple runtimes in multiple versions, as well as multiple
frameworks in multiple versions. And of course, multiple apps in
multiple versions.

We want everything double buffered (or actually n-fold buffered), to
ensure we can reliably update/rollback versions, in particular to
safely do automatic updates.

We want a system where updating a runtime, OS, framework, or OS
container is as simple as adding in a new snapshot and restarting
the runtime/OS/framework/OS container.

We want a system where we can easily instantiate a number of OS
instances from a single vendor tree, with zero difference for doing
this on order to be able to boot it on bare metal/VM or as a
container.

We want to enable Linux to have an open scheme that people can use
to build app markets and similar schemes, not restricted to a
specific vendor.

Final Words
I’ll be talking about this at LinuxCon Europe in October. I originally
intended to discuss this at the Linux Plumbers Conference (which I
assumed was the right forum for this kind of major plumbing level
improvement), and at linux.conf.au, but there was no interest in my
session submissions there…
Of course this is all work in progress. These are our current ideas we
are working towards. As we progress we will likely change a number of
things. For example, the precise naming of the sub-volumes might look
very different in the end.
Of course, we are developers of the systemd project. Implementing this
scheme is not just a job for the systemd developers. This is a
reinvention how distributions work, and hence needs great support from
the distributions. We really hope we can trigger some interest by
publishing this proposal now, to get the distributions on board. This
after all is explicitly not supposed to be a solution for one specific
project and one specific vendor product, we care about making this
open, and solving it for the generic case, without cutting corners.
If you have any questions about this, you know how you can reach us
(IRC, mail, G+, …).
The future is going to be awesome!

Revisiting How We Put Together Linux Systems

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/revisiting-how-we-put-together-linux-systems.html

In a previous blog story I discussed
Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems,
I now want to take the opportunity to explain a bit where we want to
take this with
systemd in the
longer run, and what we want to build out of it. This is going to be a
longer story, so better grab a cold bottle of
Club Mate before you start
reading.

Traditional Linux distributions are built around packaging systems
like RPM or dpkg, and an organization model where upstream developers
and downstream packagers are relatively clearly separated: an upstream
developer writes code, and puts it somewhere online, in a tarball. A
packager than grabs it and turns it into RPMs/DEBs. The user then
grabs these RPMs/DEBs and installs them locally on the system. For a
variety of uses this is a fantastic scheme: users have a large
selection of readily packaged software available, in mostly uniform
packaging, from a single source they can trust. In this scheme the
distribution vets all software it packages, and as long as the user
trusts the distribution all should be good. The distribution takes the
responsibility of ensuring the software is not malicious, of timely
fixing security problems and helping the user if something is wrong.

Upstream Projects

However, this scheme also has a number of problems, and doesn’t fit
many use-cases of our software particularly well. Let’s have a look at
the problems of this scheme for many upstreams:

  • Upstream software vendors are fully dependent on downstream
    distributions to package their stuff. It’s the downstream
    distribution that decides on schedules, packaging details, and how
    to handle support. Often upstream vendors want much faster release
    cycles then the downstream distributions follow.

  • Realistic testing is extremely unreliable and next to
    impossible. Since the end-user can run a variety of different
    package versions together, and expects the software he runs to just
    work on any combination, the test matrix explodes. If upstream tests
    its version on distribution X release Y, then there’s no guarantee
    that that’s the precise combination of packages that the end user
    will eventually run. In fact, it is very unlikely that the end user
    will, since most distributions probably updated a number of
    libraries the package relies on by the time the package ends up being
    made available to the user. The fact that each package can be
    individually updated by the user, and each user can combine library
    versions, plug-ins and executables relatively freely, results in a high
    risk of something going wrong.

  • Since there are so many different distributions in so many different
    versions around, if upstream tries to build and test software for
    them it needs to do so for a large number of distributions, which is
    a massive effort.

  • The distributions are actually quite different in many ways. In
    fact, they are different in a lot of the most basic
    functionality. For example, the path where to put x86-64 libraries
    is different on Fedora and Debian derived systems..

  • Developing software for a number of distributions and versions is
    hard: if you want to do it, you need to actually install them, each
    one of them, manually, and then build your software for each.

  • Since most downstream distributions have strict licensing and
    trademark requirements (and rightly so), any kind of closed source
    software (or otherwise non-free) does not fit into this scheme at
    all.

This all together makes it really hard for many upstreams to work
nicely with the current way how Linux works. Often they try to improve
the situation for them, for example by bundling libraries, to make
their test and build matrices smaller.

System Vendors

The toolbox approach of classic Linux distributions is fantastic for
people who want to put together their individual system, nicely
adjusted to exactly what they need. However, this is not really how
many of today’s Linux systems are built, installed or updated. If you
build any kind of embedded device, a server system, or even user
systems, you frequently do your work based on complete system images,
that are linearly versioned. You build these images somewhere, and
then you replicate them atomically to a larger number of systems. On
these systems, you don’t install or remove packages, you get a defined
set of files, and besides installing or updating the system there are
no ways how to change the set of tools you get.

The current Linux distributions are not particularly good at providing
for this major use-case of Linux. Their strict focus on individual
packages as well as package managers as end-user install and update
tool is incompatible with what many system vendors want.

Users

The classic Linux distribution scheme is frequently not what end users
want, either. Many users are used to app markets like Android, Windows
or iOS/Mac have. Markets are a platform that doesn’t package, build or
maintain software like distributions do, but simply allows users to
quickly find and download the software they need, with the app vendor
responsible for keeping the app updated, secured, and all that on the
vendor’s release cycle. Users tend to be impatient. They want their
software quickly, and the fine distinction between trusting a single
distribution or a myriad of app developers individually is usually not
important for them. The companies behind the marketplaces usually try
to improve this trust problem by providing sand-boxing technologies: as
a replacement for the distribution that audits, vets, builds and
packages the software and thus allows users to trust it to a certain
level, these vendors try to find technical solutions to ensure that
the software they offer for download can’t be malicious.

Existing Approaches To Fix These Problems

Now, all the issues pointed out above are not new, and there are
sometimes quite successful attempts to do something about it. Ubuntu
Apps, Docker, Software Collections, ChromeOS, CoreOS all fix part of
this problem set, usually with a strict focus on one facet of Linux
systems. For example, Ubuntu Apps focus strictly on end user (desktop)
applications, and don’t care about how we built/update/install the OS
itself, or containers. Docker OTOH focuses on containers only, and
doesn’t care about end-user apps. Software Collections tries to focus
on the development environments. ChromeOS focuses on the OS itself,
but only for end-user devices. CoreOS also focuses on the OS, but
only for server systems.

The approaches they find are usually good at specific things, and use
a variety of different technologies, on different layers. However,
none of these projects tried to fix this problems in a generic way,
for all uses, right in the core components of the OS itself.

Linux has come to tremendous successes because its kernel is so
generic: you can build supercomputers and tiny embedded devices out of
it. It’s time we come up with a basic, reusable scheme how to solve
the problem set described above, that is equally generic.

What We Want

The systemd cabal (Kay Sievers, Harald Hoyer, Daniel Mack, Tom
Gundersen, David Herrmann, and yours truly) recently met in Berlin
about all these things, and tried to come up with a scheme that is
somewhat simple, but tries to solve the issues generically, for all
use-cases, as part of the systemd project. All that in a way that is
somewhat compatible with the current scheme of distributions, to allow
a slow, gradual adoption. Also, and that’s something one cannot stress
enough: the toolbox scheme of classic Linux distributions is
actually a good one, and for many cases the right one. However, we
need to make sure we make distributions relevant again for all
use-cases, not just those of highly individualized systems.

Anyway, so let’s summarize what we are trying to do:

  • We want an efficient way that allows vendors to package their
    software (regardless if just an app, or the whole OS) directly for
    the end user, and know the precise combination of libraries and
    packages it will operate with.

  • We want to allow end users and administrators to install these
    packages on their systems, regardless which distribution they have
    installed on it.

  • We want a unified solution that ultimately can cover updates for
    full systems, OS containers, end user apps, programming ABIs, and
    more. These updates shall be double-buffered, (at least). This is an
    absolute necessity if we want to prepare the ground for operating
    systems that manage themselves, that can update safely without
    administrator involvement.

  • We want our images to be trustable (i.e. signed). In fact we want a
    fully trustable OS, with images that can be verified by a full
    trust chain from the firmware (EFI SecureBoot!), through the boot loader, through the
    kernel, and initrd. Cryptographically secure verification of the
    code we execute is relevant on the desktop (like ChromeOS does), but
    also for apps, for embedded devices and even on servers (in a post-Snowden
    world, in particular).

What We Propose

So much about the set of problems, and what we are trying to do. So,
now, let’s discuss the technical bits we came up with:

The scheme we propose is built around the variety of concepts of btrfs
and Linux file system name-spacing. btrfs at this point already has a
large number of features that fit neatly in our concept, and the
maintainers are busy working on a couple of others we want to
eventually make use of.

As first part of our proposal we make heavy use of btrfs sub-volumes and
introduce a clear naming scheme for them. We name snapshots like this:

  • usr:<vendorid>:<architecture>:<version> — This refers to a full
    vendor operating system tree. It’s basically a /usr tree (and no
    other directories), in a specific version, with everything you need to boot
    it up inside it. The <vendorid> field is replaced by some vendor
    identifier, maybe a scheme like
    org.fedoraproject.FedoraWorkstation. The <architecture> field
    specifies a CPU architecture the OS is designed for, for example
    x86-64. The <version> field specifies a specific OS version, for
    example 23.4. An example sub-volume name could hence look like this:
    usr:org.fedoraproject.FedoraWorkstation:x86_64:23.4

  • root:<name>:<vendorid>:<architecture> — This refers to an
    instance of an operating system. Its basically a root directory,
    containing primarily /etc and /var (but possibly more). Sub-volumes
    of this type do not contain a populated /usr tree though. The
    <name> field refers to some instance name (maybe the host name of
    the instance). The other fields are defined as above. An example
    sub-volume name is
    root:revolution:org.fedoraproject.FedoraWorkstation:x86_64.

  • runtime:<vendorid>:<architecture>:<version> — This refers to a
    vendor runtime. A runtime here is supposed to be a set of
    libraries and other resources that are needed to run apps (for the
    concept of apps see below), all in a /usr tree. In this regard this
    is very similar to the usr sub-volumes explained above, however,
    while a usr sub-volume is a full OS and contains everything
    necessary to boot, a runtime is really only a set of
    libraries. You cannot boot it, but you can run apps with it. An
    example sub-volume name is: runtime:org.gnome.GNOME3_20:x86_64:3.20.1

  • framework:<vendorid>:<architecture>:<version> — This is very
    similar to a vendor runtime, as described above, it contains just a
    /usr tree, but goes one step further: it additionally contains all
    development headers, compilers and build tools, that allow
    developing against a specific runtime. For each runtime there should
    be a framework. When you develop against a specific framework in a
    specific architecture, then the resulting app will be compatible
    with the runtime of the same vendor ID and architecture. Example:
    framework:org.gnome.GNOME3_20:x86_64:3.20.1

  • app:<vendorid>:<runtime>:<architecture>:<version> — This
    encapsulates an application bundle. It contains a tree that at
    runtime is mounted to /opt/<vendorid>, and contains all the
    application’s resources. The <vendorid> could be a string like
    org.libreoffice.LibreOffice, the <runtime> refers to one the
    vendor id of one specific runtime the application is built for, for
    example org.gnome.GNOME3_20:3.20.1. The <architecture> and
    <version> refer to the architecture the application is built for,
    and of course its version. Example:
    app:org.libreoffice.LibreOffice:GNOME3_20:x86_64:133

  • home:<user>:<uid>:<gid> — This sub-volume shall refer to the home
    directory of the specific user. The <user> field contains the user
    name, the <uid> and <gid> fields the numeric Unix UIDs and GIDs
    of the user. The idea here is that in the long run the list of
    sub-volumes is sufficient as a user database (but see
    below). Example: home:lennart:1000:1000.

btrfs partitions that adhere to this naming scheme should be clearly
identifiable. It is our intention to introduce a new GPT partition type
ID for this.

How To Use It

After we introduced this naming scheme let’s see what we can build of
this:

  • When booting up a system we mount the root directory from one of the
    root sub-volumes, and then mount /usr from a matching usr
    sub-volume. Matching here means it carries the same <vendor-id>
    and <architecture>. Of course, by default we should pick the
    matching usr sub-volume with the newest version by default.

  • When we boot up an OS container, we do exactly the same as the when
    we boot up a regular system: we simply combine a usr sub-volume
    with a root sub-volume.

  • When we enumerate the system’s users we simply go through the
    list of home snapshots.

  • When a user authenticates and logs in we mount his home
    directory from his snapshot.

  • When an app is run, we set up a new file system name-space, mount the
    app sub-volume to /opt/<vendorid>/, and the appropriate runtime
    sub-volume the app picked to /usr, as well as the user’s
    /home/$USER to its place.

  • When a developer wants to develop against a specific runtime he
    installs the right framework, and then temporarily transitions into
    a name space where /usris mounted from the framework sub-volume, and
    /home/$USER from his own home directory. In this name space he then
    runs his build commands. He can build in multiple name spaces at the
    same time, if he intends to builds software for multiple runtimes or
    architectures at the same time.

Instantiating a new system or OS container (which is exactly the same
in this scheme) just consists of creating a new appropriately named
root sub-volume. Completely naturally you can share one vendor OS
copy in one specific version with a multitude of container instances.

Everything is double-buffered (or actually, n-fold-buffered), because
usr, runtime, framework, app sub-volumes can exist in multiple
versions. Of course, by default the execution logic should always pick
the newest release of each sub-volume, but it is up to the user keep
multiple versions around, and possibly execute older versions, if he
desires to do so. In fact, like on ChromeOS this could even be handled
automatically: if a system fails to boot with a newer snapshot, the
boot loader can automatically revert back to an older version of the
OS.

An Example

Note that in result this allows installing not only multiple end-user
applications into the same btrfs volume, but also multiple operating
systems, multiple system instances, multiple runtimes, multiple
frameworks. Or to spell this out in an example:

Let’s say Fedora, Mageia and ArchLinux all implement this scheme,
and provide ready-made end-user images. Also, the GNOME, KDE, SDL
projects all define a runtime+framework to develop against. Finally,
both LibreOffice and Firefox provide their stuff according to this
scheme. You can now trivially install of these into the same btrfs
volume:

  • usr:org.fedoraproject.WorkStation:x86_64:24.7
  • usr:org.fedoraproject.WorkStation:x86_64:24.8
  • usr:org.fedoraproject.WorkStation:x86_64:24.9
  • usr:org.fedoraproject.WorkStation:x86_64:25beta
  • usr:org.mageia.Client:i386:39.3
  • usr:org.mageia.Client:i386:39.4
  • usr:org.mageia.Client:i386:39.6
  • usr:org.archlinux.Desktop:x86_64:302.7.8
  • usr:org.archlinux.Desktop:x86_64:302.7.9
  • usr:org.archlinux.Desktop:x86_64:302.7.10
  • root:revolution:org.fedoraproject.WorkStation:x86_64
  • root:testmachine:org.fedoraproject.WorkStation:x86_64
  • root:foo:org.mageia.Client:i386
  • root:bar:org.archlinux.Desktop:x86_64
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.1
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.4
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.5
  • runtime:org.gnome.GNOME3_22:x86_64:3.22.0
  • runtime:org.kde.KDE5_6:x86_64:5.6.0
  • framework:org.gnome.GNOME3_22:x86_64:3.22.0
  • framework:org.kde.KDE5_6:x86_64:5.6.0
  • app:org.libreoffice.LibreOffice:GNOME3_20:x86_64:133
  • app:org.libreoffice.LibreOffice:GNOME3_22:x86_64:166
  • app:org.mozilla.Firefox:GNOME3_20:x86_64:39
  • app:org.mozilla.Firefox:GNOME3_20:x86_64:40
  • home:lennart:1000:1000
  • home:hrundivbakshi:1001:1001

In the example above, we have three vendor operating systems
installed. All of them in three versions, and one even in a beta
version. We have four system instances around. Two of them of Fedora,
maybe one of them we usually boot from, the other we run for very
specific purposes in an OS container. We also have the runtimes for
two GNOME releases in multiple versions, plus one for KDE. Then, we
have the development trees for one version of KDE and GNOME around, as
well as two apps, that make use of two releases of the GNOME
runtime. Finally, we have the home directories of two users.

Now, with the name-spacing concepts we introduced above, we can
actually relatively freely mix and match apps and OSes, or develop
against specific frameworks in specific versions on any operating
system. It doesn’t matter if you booted your ArchLinux instance, or
your Fedora one, you can execute both LibreOffice and Firefox just
fine, because at execution time they get matched up with the right
runtime, and all of them are available from all the operating systems
you installed. You get the precise runtime that the upstream vendor of
Firefox/LibreOffice did their testing with. It doesn’t matter anymore
which distribution you run, and which distribution the vendor prefers.

Also, given that the user database is actually encoded in the
sub-volume list, it doesn’t matter which system you boot, the
distribution should be able to find your local users automatically,
without any configuration in /etc/passwd.

Building Blocks

With this naming scheme plus the way how we can combine them on
execution we already came quite far, but how do we actually get these
sub-volumes onto the final machines, and how do we update them? Well,
btrfs has a feature they call “send-and-receive”. It basically allows
you to “diff” two file system versions, and generate a binary
delta. You can generate these deltas on a developer’s machine and then
push them into the user’s system, and he’ll get the exact same
sub-volume too. This is how we envision installation and updating of
operating systems, applications, runtimes, frameworks. At installation
time, we simply deserialize an initial send-and-receive delta into
our btrfs volume, and later, when a new version is released we just
add in the few bits that are new, by dropping in another
send-and-receive delta under a new sub-volume name. And we do it
exactly the same for the OS itself, for a runtime, a framework or an
app. There’s no technical distinction anymore. The underlying
operation for installing apps, runtime, frameworks, vendor OSes, as well
as the operation for updating them is done the exact same way for all.

Of course, keeping multiple full /usr trees around sounds like an
awful lot of waste, after all they will contain a lot of very similar
data, since a lot of resources are shared between distributions,
frameworks and runtimes. However, thankfully btrfs actually is able to
de-duplicate this for us. If we add in a new app snapshot, this simply
adds in the new files that changed. Moreover different runtimes and
operating systems might actually end up sharing the same tree.

Even though the example above focuses primarily on the end-user,
desktop side of things, the concept is also extremely powerful in
server scenarios. For example, it is easy to build your own usr
trees and deliver them to your hosts using this scheme. The usr
sub-volumes are supposed to be something that administrators can put
together. After deserializing them into a couple of hosts, you can
trivially instantiate them as OS containers there, simply by adding a
new root sub-volume for each instance, referencing the usr tree you
just put together. Instantiating OS containers hence becomes as easy
as creating a new btrfs sub-volume. And you can still update the images
nicely, get fully double-buffered updates and everything.

And of course, this scheme also applies great to embedded
use-cases. Regardless if you build a TV, an IVI system or a phone: you
can put together you OS versions as usr trees, and then use
btrfs-send-and-receive facilities to deliver them to the systems, and
update them there.

Many people when they hear the word “btrfs” instantly reply with “is
it ready yet?”. Thankfully, most of the functionality we really need
here is strictly read-only. With the exception of the home
sub-volumes (see below) all snapshots are strictly read-only, and are
delivered as immutable vendor trees onto the devices. They never are
changed. Even if btrfs might still be immature, for this kind of
read-only logic it should be more than good enough.

Note that this scheme also enables doing fat systems: for example,
an installer image could include a Fedora version compiled for x86-64,
one for i386, one for ARM, all in the same btrfs volume. Due to btrfs’
de-duplication they will share as much as possible, and when the image
is booted up the right sub-volume is automatically picked. Something
similar of course applies to the apps too!

This also allows us to implement something that we like to call
Operating-System-As-A-Virus. Installing a new system is little more
than:

  • Creating a new GPT partition table
  • Adding an EFI System Partition (FAT) to it
  • Adding a new btrfs volume to it
  • Deserializing a single usr sub-volume into the btrfs volume
  • Installing a boot loader into the EFI System Partition
  • Rebooting

Now, since the only real vendor data you need is the usr sub-volume,
you can trivially duplicate this onto any block device you want. Let’s
say you are a happy Fedora user, and you want to provide a friend with
his own installation of this awesome system, all on a USB stick. All
you have to do for this is do the steps above, using your installed
usr tree as source to copy. And there you go! And you don’t have to
be afraid that any of your personal data is copied too, as the usr
sub-volume is the exact version your vendor provided you with. Or with
other words: there’s no distinction anymore between installer images
and installed systems. It’s all the same. Installation becomes
replication, not more. Live-CDs and installed systems can be fully
identical.

Note that in this design apps are actually developed against a single,
very specific runtime, that contains all libraries it can link against
(including a specific glibc version!). Any library that is not
included in the runtime the developer picked must be included in the
app itself. This is similar how apps on Android declare one very
specific Android version they are developed against. This greatly
simplifies application installation, as there’s no dependency hell:
each app pulls in one runtime, and the app is actually free to pick
which one, as you can have multiple installed, though only one is used
by each app.

Also note that operating systems built this way will never see
“half-updated” systems, as it is common when a system is updated using
RPM/dpkg. When updating the system the code will either run the old or
the new version, but it will never see part of the old files and part
of the new files. This is the same for apps, runtimes, and frameworks,
too.

Where We Are Now

We are currently working on a lot of the groundwork necessary for
this. This scheme relies on the ability to monopolize the
vendor OS resources in /usr, which is the key of what I described in
Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems
a few weeks back. Then, of course, for the full desktop app concept we
need a strong sandbox, that does more than just hiding files from the
file system view. After all with an app concept like the above the
primary interfacing between the executed desktop apps and the rest of the
system is via IPC (which is why we work on kdbus and teach it all
kinds of sand-boxing features), and the kernel itself. Harald Hoyer has
started working on generating the btrfs send-and-receive images based
on Fedora.

Getting to the full scheme will take a while. Currently we have many
of the building blocks ready, but some major items are missing. For
example, we push quite a few problems into btrfs, that other solutions
try to solve in user space. One of them is actually
signing/verification of images. The btrfs maintainers are working on
adding this to the code base, but currently nothing exists. This
functionality is essential though to come to a fully verified system
where a trust chain exists all the way from the firmware to the
apps. Also, to make the home sub-volume scheme fully workable we
actually need encrypted sub-volumes, so that the sub-volume’s
pass-phrase can be used for authenticating users in PAM. This doesn’t
exist either.

Working towards this scheme is a gradual process. Many of the steps we
require for this are useful outside of the grand scheme though, which
means we can slowly work towards the goal, and our users can already
take benefit of what we are working on as we go.

Also, and most importantly, this is not really a departure from
traditional operating systems:

Each app, each OS and each app sees a traditional Unix hierarchy with
/usr, /home, /opt, /var, /etc. It executes in an environment that is
pretty much identical to how it would be run on traditional systems.

There’s no need to fully move to a system that uses only btrfs and
follows strictly this sub-volume scheme. For example, we intend to
provide implicit support for systems that are installed on ext4 or
xfs, or that are put together with traditional packaging tools such as
RPM or dpkg: if the the user tries to install a
runtime/app/framework/os image on a system that doesn’t use btrfs so
far, it can just create a loop-back btrfs image in /var, and push the
data into that. Even us developers will run our stuff like this for a
while, after all this new scheme is not particularly useful for highly
individualized systems, and we developers usually tend to run
systems like that.

Also note that this in no way a departure from packaging systems like
RPM or DEB. Even if the new scheme we propose is used for installing
and updating a specific system, it is RPM/DEB that is used to put
together the vendor OS tree initially. Hence, even in this scheme
RPM/DEB are highly relevant, though not strictly as an end-user tool
anymore, but as a build tool.

So Let’s Summarize Again What We Propose

  • We want a unified scheme, how we can install and update OS images,
    user apps, runtimes and frameworks.

  • We want a unified scheme how you can relatively freely mix OS
    images, apps, runtimes and frameworks on the same system.

  • We want a fully trusted system, where cryptographic verification of
    all executed code can be done, all the way to the firmware, as
    standard feature of the system.

  • We want to allow app vendors to write their programs against very
    specific frameworks, under the knowledge that they will end up being
    executed with the exact same set of libraries chosen.

  • We want to allow parallel installation of multiple OSes and versions
    of them, multiple runtimes in multiple versions, as well as multiple
    frameworks in multiple versions. And of course, multiple apps in
    multiple versions.

  • We want everything double buffered (or actually n-fold buffered), to
    ensure we can reliably update/rollback versions, in particular to
    safely do automatic updates.

  • We want a system where updating a runtime, OS, framework, or OS
    container is as simple as adding in a new snapshot and restarting
    the runtime/OS/framework/OS container.

  • We want a system where we can easily instantiate a number of OS
    instances from a single vendor tree, with zero difference for doing
    this on order to be able to boot it on bare metal/VM or as a
    container.

  • We want to enable Linux to have an open scheme that people can use
    to build app markets and similar schemes, not restricted to a
    specific vendor.

Final Words

I’ll be talking about this at LinuxCon Europe in October. I originally
intended to discuss this at the Linux Plumbers Conference (which I
assumed was the right forum for this kind of major plumbing level
improvement), and at linux.conf.au, but there was no interest in my
session submissions there…

Of course this is all work in progress. These are our current ideas we
are working towards. As we progress we will likely change a number of
things. For example, the precise naming of the sub-volumes might look
very different in the end.

Of course, we are developers of the systemd project. Implementing this
scheme is not just a job for the systemd developers. This is a
reinvention how distributions work, and hence needs great support from
the distributions. We really hope we can trigger some interest by
publishing this proposal now, to get the distributions on board. This
after all is explicitly not supposed to be a solution for one specific
project and one specific vendor product, we care about making this
open, and solving it for the generic case, without cutting corners.

If you have any questions about this, you know how you can reach us
(IRC, mail, G+, …).

The future is going to be awesome!

Revisiting How We Put Together Linux Systems

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/revisiting-how-we-put-together-linux-systems.html

In a previous blog story I discussed
Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems,
I now want to take the opportunity to explain a bit where we want to
take this with
systemd in the
longer run, and what we want to build out of it. This is going to be a
longer story, so better grab a cold bottle of
Club Mate before you start
reading.

Traditional Linux distributions are built around packaging systems
like RPM or dpkg, and an organization model where upstream developers
and downstream packagers are relatively clearly separated: an upstream
developer writes code, and puts it somewhere online, in a tarball. A
packager than grabs it and turns it into RPMs/DEBs. The user then
grabs these RPMs/DEBs and installs them locally on the system. For a
variety of uses this is a fantastic scheme: users have a large
selection of readily packaged software available, in mostly uniform
packaging, from a single source they can trust. In this scheme the
distribution vets all software it packages, and as long as the user
trusts the distribution all should be good. The distribution takes the
responsibility of ensuring the software is not malicious, of timely
fixing security problems and helping the user if something is wrong.

Upstream Projects

However, this scheme also has a number of problems, and doesn’t fit
many use-cases of our software particularly well. Let’s have a look at
the problems of this scheme for many upstreams:

  • Upstream software vendors are fully dependent on downstream
    distributions to package their stuff. It’s the downstream
    distribution that decides on schedules, packaging details, and how
    to handle support. Often upstream vendors want much faster release
    cycles then the downstream distributions follow.

  • Realistic testing is extremely unreliable and next to
    impossible. Since the end-user can run a variety of different
    package versions together, and expects the software he runs to just
    work on any combination, the test matrix explodes. If upstream tests
    its version on distribution X release Y, then there’s no guarantee
    that that’s the precise combination of packages that the end user
    will eventually run. In fact, it is very unlikely that the end user
    will, since most distributions probably updated a number of
    libraries the package relies on by the time the package ends up being
    made available to the user. The fact that each package can be
    individually updated by the user, and each user can combine library
    versions, plug-ins and executables relatively freely, results in a high
    risk of something going wrong.

  • Since there are so many different distributions in so many different
    versions around, if upstream tries to build and test software for
    them it needs to do so for a large number of distributions, which is
    a massive effort.

  • The distributions are actually quite different in many ways. In
    fact, they are different in a lot of the most basic
    functionality. For example, the path where to put x86-64 libraries
    is different on Fedora and Debian derived systems..

  • Developing software for a number of distributions and versions is
    hard: if you want to do it, you need to actually install them, each
    one of them, manually, and then build your software for each.

  • Since most downstream distributions have strict licensing and
    trademark requirements (and rightly so), any kind of closed source
    software (or otherwise non-free) does not fit into this scheme at
    all.

This all together makes it really hard for many upstreams to work
nicely with the current way how Linux works. Often they try to improve
the situation for them, for example by bundling libraries, to make
their test and build matrices smaller.

System Vendors

The toolbox approach of classic Linux distributions is fantastic for
people who want to put together their individual system, nicely
adjusted to exactly what they need. However, this is not really how
many of today’s Linux systems are built, installed or updated. If you
build any kind of embedded device, a server system, or even user
systems, you frequently do your work based on complete system images,
that are linearly versioned. You build these images somewhere, and
then you replicate them atomically to a larger number of systems. On
these systems, you don’t install or remove packages, you get a defined
set of files, and besides installing or updating the system there are
no ways how to change the set of tools you get.

The current Linux distributions are not particularly good at providing
for this major use-case of Linux. Their strict focus on individual
packages as well as package managers as end-user install and update
tool is incompatible with what many system vendors want.

Users

The classic Linux distribution scheme is frequently not what end users
want, either. Many users are used to app markets like Android, Windows
or iOS/Mac have. Markets are a platform that doesn’t package, build or
maintain software like distributions do, but simply allows users to
quickly find and download the software they need, with the app vendor
responsible for keeping the app updated, secured, and all that on the
vendor’s release cycle. Users tend to be impatient. They want their
software quickly, and the fine distinction between trusting a single
distribution or a myriad of app developers individually is usually not
important for them. The companies behind the marketplaces usually try
to improve this trust problem by providing sand-boxing technologies: as
a replacement for the distribution that audits, vets, builds and
packages the software and thus allows users to trust it to a certain
level, these vendors try to find technical solutions to ensure that
the software they offer for download can’t be malicious.

Existing Approaches To Fix These Problems

Now, all the issues pointed out above are not new, and there are
sometimes quite successful attempts to do something about it. Ubuntu
Apps, Docker, Software Collections, ChromeOS, CoreOS all fix part of
this problem set, usually with a strict focus on one facet of Linux
systems. For example, Ubuntu Apps focus strictly on end user (desktop)
applications, and don’t care about how we built/update/install the OS
itself, or containers. Docker OTOH focuses on containers only, and
doesn’t care about end-user apps. Software Collections tries to focus
on the development environments. ChromeOS focuses on the OS itself,
but only for end-user devices. CoreOS also focuses on the OS, but
only for server systems.

The approaches they find are usually good at specific things, and use
a variety of different technologies, on different layers. However,
none of these projects tried to fix this problems in a generic way,
for all uses, right in the core components of the OS itself.

Linux has come to tremendous successes because its kernel is so
generic: you can build supercomputers and tiny embedded devices out of
it. It’s time we come up with a basic, reusable scheme how to solve
the problem set described above, that is equally generic.

What We Want

The systemd cabal (Kay Sievers, Harald Hoyer, Daniel Mack, Tom
Gundersen, David Herrmann, and yours truly) recently met in Berlin
about all these things, and tried to come up with a scheme that is
somewhat simple, but tries to solve the issues generically, for all
use-cases, as part of the systemd project. All that in a way that is
somewhat compatible with the current scheme of distributions, to allow
a slow, gradual adoption. Also, and that’s something one cannot stress
enough: the toolbox scheme of classic Linux distributions is
actually a good one, and for many cases the right one. However, we
need to make sure we make distributions relevant again for all
use-cases, not just those of highly individualized systems.

Anyway, so let’s summarize what we are trying to do:

  • We want an efficient way that allows vendors to package their
    software (regardless if just an app, or the whole OS) directly for
    the end user, and know the precise combination of libraries and
    packages it will operate with.

  • We want to allow end users and administrators to install these
    packages on their systems, regardless which distribution they have
    installed on it.

  • We want a unified solution that ultimately can cover updates for
    full systems, OS containers, end user apps, programming ABIs, and
    more. These updates shall be double-buffered, (at least). This is an
    absolute necessity if we want to prepare the ground for operating
    systems that manage themselves, that can update safely without
    administrator involvement.

  • We want our images to be trustable (i.e. signed). In fact we want a
    fully trustable OS, with images that can be verified by a full
    trust chain from the firmware (EFI SecureBoot!), through the boot loader, through the
    kernel, and initrd. Cryptographically secure verification of the
    code we execute is relevant on the desktop (like ChromeOS does), but
    also for apps, for embedded devices and even on servers (in a post-Snowden
    world, in particular).

What We Propose

So much about the set of problems, and what we are trying to do. So,
now, let’s discuss the technical bits we came up with:

The scheme we propose is built around the variety of concepts of btrfs
and Linux file system name-spacing. btrfs at this point already has a
large number of features that fit neatly in our concept, and the
maintainers are busy working on a couple of others we want to
eventually make use of.

As first part of our proposal we make heavy use of btrfs sub-volumes and
introduce a clear naming scheme for them. We name snapshots like this:

  • usr:<vendorid>:<architecture>:<version> — This refers to a full
    vendor operating system tree. It’s basically a /usr tree (and no
    other directories), in a specific version, with everything you need to boot
    it up inside it. The <vendorid> field is replaced by some vendor
    identifier, maybe a scheme like
    org.fedoraproject.FedoraWorkstation. The <architecture> field
    specifies a CPU architecture the OS is designed for, for example
    x86-64. The <version> field specifies a specific OS version, for
    example 23.4. An example sub-volume name could hence look like this:
    usr:org.fedoraproject.FedoraWorkstation:x86_64:23.4

  • root:<name>:<vendorid>:<architecture> — This refers to an
    instance of an operating system. Its basically a root directory,
    containing primarily /etc and /var (but possibly more). Sub-volumes
    of this type do not contain a populated /usr tree though. The
    <name> field refers to some instance name (maybe the host name of
    the instance). The other fields are defined as above. An example
    sub-volume name is
    root:revolution:org.fedoraproject.FedoraWorkstation:x86_64.

  • runtime:<vendorid>:<architecture>:<version> — This refers to a
    vendor runtime. A runtime here is supposed to be a set of
    libraries and other resources that are needed to run apps (for the
    concept of apps see below), all in a /usr tree. In this regard this
    is very similar to the usr sub-volumes explained above, however,
    while a usr sub-volume is a full OS and contains everything
    necessary to boot, a runtime is really only a set of
    libraries. You cannot boot it, but you can run apps with it. An
    example sub-volume name is: runtime:org.gnome.GNOME3_20:x86_64:3.20.1

  • framework:<vendorid>:<architecture>:<version> — This is very
    similar to a vendor runtime, as described above, it contains just a
    /usr tree, but goes one step further: it additionally contains all
    development headers, compilers and build tools, that allow
    developing against a specific runtime. For each runtime there should
    be a framework. When you develop against a specific framework in a
    specific architecture, then the resulting app will be compatible
    with the runtime of the same vendor ID and architecture. Example:
    framework:org.gnome.GNOME3_20:x86_64:3.20.1

  • app:<vendorid>:<runtime>:<architecture>:<version> — This
    encapsulates an application bundle. It contains a tree that at
    runtime is mounted to /opt/<vendorid>, and contains all the
    application’s resources. The <vendorid> could be a string like
    org.libreoffice.LibreOffice, the <runtime> refers to one the
    vendor id of one specific runtime the application is built for, for
    example org.gnome.GNOME3_20:3.20.1. The <architecture> and
    <version> refer to the architecture the application is built for,
    and of course its version. Example:
    app:org.libreoffice.LibreOffice:GNOME3_20:x86_64:133

  • home:<user>:<uid>:<gid> — This sub-volume shall refer to the home
    directory of the specific user. The <user> field contains the user
    name, the <uid> and <gid> fields the numeric Unix UIDs and GIDs
    of the user. The idea here is that in the long run the list of
    sub-volumes is sufficient as a user database (but see
    below). Example: home:lennart:1000:1000.

btrfs partitions that adhere to this naming scheme should be clearly
identifiable. It is our intention to introduce a new GPT partition type
ID for this.

How To Use It

After we introduced this naming scheme let’s see what we can build of
this:

  • When booting up a system we mount the root directory from one of the
    root sub-volumes, and then mount /usr from a matching usr
    sub-volume. Matching here means it carries the same <vendor-id>
    and <architecture>. Of course, by default we should pick the
    matching usr sub-volume with the newest version by default.

  • When we boot up an OS container, we do exactly the same as the when
    we boot up a regular system: we simply combine a usr sub-volume
    with a root sub-volume.

  • When we enumerate the system’s users we simply go through the
    list of home snapshots.

  • When a user authenticates and logs in we mount his home
    directory from his snapshot.

  • When an app is run, we set up a new file system name-space, mount the
    app sub-volume to /opt/<vendorid>/, and the appropriate runtime
    sub-volume the app picked to /usr, as well as the user’s
    /home/$USER to its place.

  • When a developer wants to develop against a specific runtime he
    installs the right framework, and then temporarily transitions into
    a name space where /usris mounted from the framework sub-volume, and
    /home/$USER from his own home directory. In this name space he then
    runs his build commands. He can build in multiple name spaces at the
    same time, if he intends to builds software for multiple runtimes or
    architectures at the same time.

Instantiating a new system or OS container (which is exactly the same
in this scheme) just consists of creating a new appropriately named
root sub-volume. Completely naturally you can share one vendor OS
copy in one specific version with a multitude of container instances.

Everything is double-buffered (or actually, n-fold-buffered), because
usr, runtime, framework, app sub-volumes can exist in multiple
versions. Of course, by default the execution logic should always pick
the newest release of each sub-volume, but it is up to the user keep
multiple versions around, and possibly execute older versions, if he
desires to do so. In fact, like on ChromeOS this could even be handled
automatically: if a system fails to boot with a newer snapshot, the
boot loader can automatically revert back to an older version of the
OS.

An Example

Note that in result this allows installing not only multiple end-user
applications into the same btrfs volume, but also multiple operating
systems, multiple system instances, multiple runtimes, multiple
frameworks. Or to spell this out in an example:

Let’s say Fedora, Mageia and ArchLinux all implement this scheme,
and provide ready-made end-user images. Also, the GNOME, KDE, SDL
projects all define a runtime+framework to develop against. Finally,
both LibreOffice and Firefox provide their stuff according to this
scheme. You can now trivially install of these into the same btrfs
volume:

  • usr:org.fedoraproject.WorkStation:x86_64:24.7
  • usr:org.fedoraproject.WorkStation:x86_64:24.8
  • usr:org.fedoraproject.WorkStation:x86_64:24.9
  • usr:org.fedoraproject.WorkStation:x86_64:25beta
  • usr:org.mageia.Client:i386:39.3
  • usr:org.mageia.Client:i386:39.4
  • usr:org.mageia.Client:i386:39.6
  • usr:org.archlinux.Desktop:x86_64:302.7.8
  • usr:org.archlinux.Desktop:x86_64:302.7.9
  • usr:org.archlinux.Desktop:x86_64:302.7.10
  • root:revolution:org.fedoraproject.WorkStation:x86_64
  • root:testmachine:org.fedoraproject.WorkStation:x86_64
  • root:foo:org.mageia.Client:i386
  • root:bar:org.archlinux.Desktop:x86_64
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.1
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.4
  • runtime:org.gnome.GNOME3_20:x86_64:3.20.5
  • runtime:org.gnome.GNOME3_22:x86_64:3.22.0
  • runtime:org.kde.KDE5_6:x86_64:5.6.0
  • framework:org.gnome.GNOME3_22:x86_64:3.22.0
  • framework:org.kde.KDE5_6:x86_64:5.6.0
  • app:org.libreoffice.LibreOffice:GNOME3_20:x86_64:133
  • app:org.libreoffice.LibreOffice:GNOME3_22:x86_64:166
  • app:org.mozilla.Firefox:GNOME3_20:x86_64:39
  • app:org.mozilla.Firefox:GNOME3_20:x86_64:40
  • home:lennart:1000:1000
  • home:hrundivbakshi:1001:1001

In the example above, we have three vendor operating systems
installed. All of them in three versions, and one even in a beta
version. We have four system instances around. Two of them of Fedora,
maybe one of them we usually boot from, the other we run for very
specific purposes in an OS container. We also have the runtimes for
two GNOME releases in multiple versions, plus one for KDE. Then, we
have the development trees for one version of KDE and GNOME around, as
well as two apps, that make use of two releases of the GNOME
runtime. Finally, we have the home directories of two users.

Now, with the name-spacing concepts we introduced above, we can
actually relatively freely mix and match apps and OSes, or develop
against specific frameworks in specific versions on any operating
system. It doesn’t matter if you booted your ArchLinux instance, or
your Fedora one, you can execute both LibreOffice and Firefox just
fine, because at execution time they get matched up with the right
runtime, and all of them are available from all the operating systems
you installed. You get the precise runtime that the upstream vendor of
Firefox/LibreOffice did their testing with. It doesn’t matter anymore
which distribution you run, and which distribution the vendor prefers.

Also, given that the user database is actually encoded in the
sub-volume list, it doesn’t matter which system you boot, the
distribution should be able to find your local users automatically,
without any configuration in /etc/passwd.

Building Blocks

With this naming scheme plus the way how we can combine them on
execution we already came quite far, but how do we actually get these
sub-volumes onto the final machines, and how do we update them? Well,
btrfs has a feature they call “send-and-receive”. It basically allows
you to “diff” two file system versions, and generate a binary
delta. You can generate these deltas on a developer’s machine and then
push them into the user’s system, and he’ll get the exact same
sub-volume too. This is how we envision installation and updating of
operating systems, applications, runtimes, frameworks. At installation
time, we simply deserialize an initial send-and-receive delta into
our btrfs volume, and later, when a new version is released we just
add in the few bits that are new, by dropping in another
send-and-receive delta under a new sub-volume name. And we do it
exactly the same for the OS itself, for a runtime, a framework or an
app. There’s no technical distinction anymore. The underlying
operation for installing apps, runtime, frameworks, vendor OSes, as well
as the operation for updating them is done the exact same way for all.

Of course, keeping multiple full /usr trees around sounds like an
awful lot of waste, after all they will contain a lot of very similar
data, since a lot of resources are shared between distributions,
frameworks and runtimes. However, thankfully btrfs actually is able to
de-duplicate this for us. If we add in a new app snapshot, this simply
adds in the new files that changed. Moreover different runtimes and
operating systems might actually end up sharing the same tree.

Even though the example above focuses primarily on the end-user,
desktop side of things, the concept is also extremely powerful in
server scenarios. For example, it is easy to build your own usr
trees and deliver them to your hosts using this scheme. The usr
sub-volumes are supposed to be something that administrators can put
together. After deserializing them into a couple of hosts, you can
trivially instantiate them as OS containers there, simply by adding a
new root sub-volume for each instance, referencing the usr tree you
just put together. Instantiating OS containers hence becomes as easy
as creating a new btrfs sub-volume. And you can still update the images
nicely, get fully double-buffered updates and everything.

And of course, this scheme also applies great to embedded
use-cases. Regardless if you build a TV, an IVI system or a phone: you
can put together you OS versions as usr trees, and then use
btrfs-send-and-receive facilities to deliver them to the systems, and
update them there.

Many people when they hear the word “btrfs” instantly reply with “is
it ready yet?”. Thankfully, most of the functionality we really need
here is strictly read-only. With the exception of the home
sub-volumes (see below) all snapshots are strictly read-only, and are
delivered as immutable vendor trees onto the devices. They never are
changed. Even if btrfs might still be immature, for this kind of
read-only logic it should be more than good enough.

Note that this scheme also enables doing fat systems: for example,
an installer image could include a Fedora version compiled for x86-64,
one for i386, one for ARM, all in the same btrfs volume. Due to btrfs’
de-duplication they will share as much as possible, and when the image
is booted up the right sub-volume is automatically picked. Something
similar of course applies to the apps too!

This also allows us to implement something that we like to call
Operating-System-As-A-Virus. Installing a new system is little more
than:

  • Creating a new GPT partition table
  • Adding an EFI System Partition (FAT) to it
  • Adding a new btrfs volume to it
  • Deserializing a single usr sub-volume into the btrfs volume
  • Installing a boot loader into the EFI System Partition
  • Rebooting

Now, since the only real vendor data you need is the usr sub-volume,
you can trivially duplicate this onto any block device you want. Let’s
say you are a happy Fedora user, and you want to provide a friend with
his own installation of this awesome system, all on a USB stick. All
you have to do for this is do the steps above, using your installed
usr tree as source to copy. And there you go! And you don’t have to
be afraid that any of your personal data is copied too, as the usr
sub-volume is the exact version your vendor provided you with. Or with
other words: there’s no distinction anymore between installer images
and installed systems. It’s all the same. Installation becomes
replication, not more. Live-CDs and installed systems can be fully
identical.

Note that in this design apps are actually developed against a single,
very specific runtime, that contains all libraries it can link against
(including a specific glibc version!). Any library that is not
included in the runtime the developer picked must be included in the
app itself. This is similar how apps on Android declare one very
specific Android version they are developed against. This greatly
simplifies application installation, as there’s no dependency hell:
each app pulls in one runtime, and the app is actually free to pick
which one, as you can have multiple installed, though only one is used
by each app.

Also note that operating systems built this way will never see
“half-updated” systems, as it is common when a system is updated using
RPM/dpkg. When updating the system the code will either run the old or
the new version, but it will never see part of the old files and part
of the new files. This is the same for apps, runtimes, and frameworks,
too.

Where We Are Now

We are currently working on a lot of the groundwork necessary for
this. This scheme relies on the ability to monopolize the
vendor OS resources in /usr, which is the key of what I described in
Factory Reset, Stateless Systems, Reproducible Systems & Verifiable Systems
a few weeks back. Then, of course, for the full desktop app concept we
need a strong sandbox, that does more than just hiding files from the
file system view. After all with an app concept like the above the
primary interfacing between the executed desktop apps and the rest of the
system is via IPC (which is why we work on kdbus and teach it all
kinds of sand-boxing features), and the kernel itself. Harald Hoyer has
started working on generating the btrfs send-and-receive images based
on Fedora.

Getting to the full scheme will take a while. Currently we have many
of the building blocks ready, but some major items are missing. For
example, we push quite a few problems into btrfs, that other solutions
try to solve in user space. One of them is actually
signing/verification of images. The btrfs maintainers are working on
adding this to the code base, but currently nothing exists. This
functionality is essential though to come to a fully verified system
where a trust chain exists all the way from the firmware to the
apps. Also, to make the home sub-volume scheme fully workable we
actually need encrypted sub-volumes, so that the sub-volume’s
pass-phrase can be used for authenticating users in PAM. This doesn’t
exist either.

Working towards this scheme is a gradual process. Many of the steps we
require for this are useful outside of the grand scheme though, which
means we can slowly work towards the goal, and our users can already
take benefit of what we are working on as we go.

Also, and most importantly, this is not really a departure from
traditional operating systems:

Each app, each OS and each app sees a traditional Unix hierarchy with
/usr, /home, /opt, /var, /etc. It executes in an environment that is
pretty much identical to how it would be run on traditional systems.

There’s no need to fully move to a system that uses only btrfs and
follows strictly this sub-volume scheme. For example, we intend to
provide implicit support for systems that are installed on ext4 or
xfs, or that are put together with traditional packaging tools such as
RPM or dpkg: if the the user tries to install a
runtime/app/framework/os image on a system that doesn’t use btrfs so
far, it can just create a loop-back btrfs image in /var, and push the
data into that. Even us developers will run our stuff like this for a
while, after all this new scheme is not particularly useful for highly
individualized systems, and we developers usually tend to run
systems like that.

Also note that this in no way a departure from packaging systems like
RPM or DEB. Even if the new scheme we propose is used for installing
and updating a specific system, it is RPM/DEB that is used to put
together the vendor OS tree initially. Hence, even in this scheme
RPM/DEB are highly relevant, though not strictly as an end-user tool
anymore, but as a build tool.

So Let’s Summarize Again What We Propose

  • We want a unified scheme, how we can install and update OS images,
    user apps, runtimes and frameworks.

  • We want a unified scheme how you can relatively freely mix OS
    images, apps, runtimes and frameworks on the same system.

  • We want a fully trusted system, where cryptographic verification of
    all executed code can be done, all the way to the firmware, as
    standard feature of the system.

  • We want to allow app vendors to write their programs against very
    specific frameworks, under the knowledge that they will end up being
    executed with the exact same set of libraries chosen.

  • We want to allow parallel installation of multiple OSes and versions
    of them, multiple runtimes in multiple versions, as well as multiple
    frameworks in multiple versions. And of course, multiple apps in
    multiple versions.

  • We want everything double buffered (or actually n-fold buffered), to
    ensure we can reliably update/rollback versions, in particular to
    safely do automatic updates.

  • We want a system where updating a runtime, OS, framework, or OS
    container is as simple as adding in a new snapshot and restarting
    the runtime/OS/framework/OS container.

  • We want a system where we can easily instantiate a number of OS
    instances from a single vendor tree, with zero difference for doing
    this on order to be able to boot it on bare metal/VM or as a
    container.

  • We want to enable Linux to have an open scheme that people can use
    to build app markets and similar schemes, not restricted to a
    specific vendor.

Final Words

I’ll be talking about this at LinuxCon Europe in October. I originally
intended to discuss this at the Linux Plumbers Conference (which I
assumed was the right forum for this kind of major plumbing level
improvement), and at linux.conf.au, but there was no interest in my
session submissions there…

Of course this is all work in progress. These are our current ideas we
are working towards. As we progress we will likely change a number of
things. For example, the precise naming of the sub-volumes might look
very different in the end.

Of course, we are developers of the systemd project. Implementing this
scheme is not just a job for the systemd developers. This is a
reinvention how distributions work, and hence needs great support from
the distributions. We really hope we can trigger some interest by
publishing this proposal now, to get the distributions on board. This
after all is explicitly not supposed to be a solution for one specific
project and one specific vendor product, we care about making this
open, and solving it for the generic case, without cutting corners.

If you have any questions about this, you know how you can reach us
(IRC, mail, G+, …).

The future is going to be awesome!