Tag Archives: Dell

Building a Graph Database on AWS Using Amazon DynamoDB and Titan

Post Syndicated from Nick Corbett original https://blogs.aws.amazon.com/bigdata/post/Tx12NN92B1F5K0C/Building-a-Graph-Database-on-AWS-Using-Amazon-DynamoDB-and-Titan

Nick Corbett is a Big Data Consultant for AWS Professional Services

You might not know it, but a graph has changed your life. A bold claim perhaps, but companies such as Facebook, LinkedIn, and Twitter have revolutionized the way society interacts through their ability to manage a huge network of relationships. However, graphs aren’t just used in social media; they can represent many different systems, including financial transactions for fraud detection, customer purchases for recommendation engines, computer network topologies, or the logistics operations of Amazon.com.

In this post, I would like to introduce you to a technology that makes it easy to manipulate graphs in AWS at massive scale. To do this, let’s imagine that you have decided to build a mobile app to help you and your friends with the simple task of finding a good restaurant. You quickly decide to build a ‘server-less’ infrastructure, using Amazon Cognito to identity management and data synchronization, Amazon API Gateway for your REST API, and AWS Lambda to implement microservices that fulfil your business logic. Your final decision is where to store your data. Because your vision is to build a network of friends and restaurants, the natural choice is a graph database rather than an RDBMS.  Titan running on Amazon DynamoDB is a great fit for the job.

DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance together with seamless scalability. Recently, AWS announced a plug-in for Titan that allows it to use DynamoDB as a storage backend. This means you can now build a graph database using Titan and not worry about the performance, scalability, or operational management of storing your data.

Your vision for the network that will power your app is shown below and shows the three major parts of a graph: vertices (or nodes), edges, and properties.

A vertex (or node) represents an entity, such as a person or restaurant. In your graph, you have three types of vertex: customers, restaurants, and the type of cuisine served (called genre in the code examples).

An edge defines a relationship between two vertices. For example, a customer might visit a restaurant or a restaurant may serve food of a particular cuisine. An edge always has direction – it will be outgoing from one vertex and incoming to the other.

A property is a key-value pair that enriches a vertex or an edge. For example, a customer has a name or the customer might rate their experience when they visit a restaurant.

After a short time, your app is ready to be released, albeit as a minimum viable product. The initial functionality of your app is very simple: your customer supplies a cuisine, such as ‘Pizza’ or ‘Sushi’, and the app returns a list of restaurants they might like to visit.

To show how this works in Titan, you can follow these instructions in the AWS Big Data Blog’s GitHub’ repository to load some sample data into your own Titan database, using DynamoDB as the backend store. The data used in this example was based on a data set provided by the Machine Learning Repository at UCL1. By default, the example uses Amazon DynamoDB Local, a small client-side database and server that mimics the DynamoDB service. This component is intended to support local development and small scale testing, and lets you save on provisioned throughput, data storage, and transfer fees.

Interaction with Titan is through a graph traversal language called Gremlin, in much the same way as you would use SQL to interact with an RDBMS. However, whereas SQL is declarative, Gremlin is implemented as a functional pipeline; the results of each operation in the query are piped to the next stage. This provides a degree of control on not just what results your query generates but also how it is executed. Gremlin is part of the Open Source Apache TinkerPop stack, which has become the de facto standard framework for graph databases and is supported by products such as Titan, Neo4j, and OrientDB.

Titan is written in Java and you can see that this API is used to load the sample data by running Gremlin commands. The Java API would also be used by your microservices running in Lambda, calling through to DynamoDB to store the data. In fact, the data stored in DynamoDB is compressed and not humanly readable (for more information about the storage format, see Titan Graph Modeling in DynamoDB).

For the purposes of this post, however, it’s easier to user the Gremlin REPL, written in Groovy. The instructions on GitHub show you how to start your Gremlin session.

A simple Gremlin query that finds restaurants based on a type of cuisine is shown below:

gremlin> g.V.has(‘genreId’, ‘Pizzeria’).in.restaurant_name

==>La Fontana Pizza Restaurante and Cafe
==>Dominos Pizza
==>Little Cesarz
==>pizza clasica
==>Restaurante Tiberius

This introduces the concept of how graph queries work; you select one or more vertices then use the language to walk (or traverse) across the graph. You can also see the functional pipeline in action as the results of each element are passed to the next step in the query. The query can be read as shown below.

Network that will power your app

The query gives us five restaurants to recommend to our customer. This query would be just as easy to run if your data was based in an RDBMS, so at this point not much is gained by using a graph database. However, as more customers start using your app and the first feature requests come in, you start to feel the benefit of your decision.

Initial feedback from your customers is good. However, they tell you that although it’s great to get a recommendation based on a cuisine, it would be better if they could receive recommendations based on places their friends have visited. You quickly add a ‘friend’ feature to the the app and change the Gremlin query that you use to provide recommendations:

This query assumes that a particular user (‘U1064’) has asked us to find a ‘Cafeteria’ restaurant that their friends have visited. The Gremlin syntax can be read as shown below.

This query uses a pattern called ‘backtrack’. You make a selection of vertices and ‘remember’ them. You then traverse the graph, selecting more nodes. Finally, you ‘backtrack’ to your remembered selection and reduce it to those vertices that have a path through to your current position.

Again, this query could be executed in an RDBMS but it would be complex. Because you would keep all customers in a single table, finding friends would involve looping back to join a table to itself. While it’s perfectly possible to do this in SQL, the syntax can become long—especially if you want to loop multiple times; for example, how many of my friends’ friends’ have visited the same restaurant as me?  A more important problem would be the performance. Each SQL join would introduce extra latency to the query and you may find that, as your database grows, you can’t meet the strict latency requirements of a modern app. In my test system, Titan returned the answer to this query in 38ms, but the RDBMS where I staged the data took over 0.3 seconds to resolve, an order of magnitude difference!

Your new recommendations work well, but some customers are still not happy. Just because their friends visited a restaurant doesn’t mean that they enjoyed it; they only want recommendations to restaurants their friends actually liked. You update your app again and ask customers to rate their experience, using ‘0’ for poor, ‘1’ for good, and ‘2’ for excellent. You then modify the query to:

g.V.has(‘userId’,’U1101′).out(‘friend’).outE(‘visit’).has(‘visit_food’, T.gte, 1).as(‘x’).inV.as(‘y’).out(‘restaurant_genre’).has(‘genreId’, ‘Seafood’).back(‘x’).transform{e, m -> [food: m.x.visit_food, name:m.y.restaurant_name]}.groupCount{it.name}.cap

==>{Restaurante y Pescaderia Tampico=1, Restaurante Marisco Sam=1, Mariscos El Pescador=2}

This query is based on a user (‘U1101’) asking for a seafood restaurant. The stages of the query are shown below.

This query shows how you can filter for a property on an edge. When you traverse the ‘visit’ edge, you filter for those visits where the food rating was greater or equal than 1. The query also shows how you can transform results from a pipeline to a new object. You build a simple object, with two properties (food rating and name) for each ‘hit’ you have against your query criteria. Finally, the query also demonstrates the ‘groupCount’ function. This aggregation provides a count of each unique name.

The net result of this query is that the ‘best’ seafood restaurant to recommend is ‘Mariscos El Pescador’, as your customer’s friends have made two visits in which they rated the food as ‘good’ or better.

The reputation of your app grows and more and more customers sign up. It’s great to take advantage of DynamoDB scalability; there’s no need to re-architect your solution as you gain more users, as your storage backend can scale to deal with millions or even hundreds of millions of customers.

Soon, it becomes apparent that most of your customers are using your app when they are out and about. You need to enhance your app so that it can make recommendations that are close to the customer. Fortunately, Titan comes with built-in geo queries. The query below imagines that customer ‘U1064’ is asking for a ‘Cafeteria’ and that you’ve captured their location of their mobile as (22.165, -101.0):

g.V.has(‘userId’, ‘U1064’).out(‘friend’).outE(‘visit’).has(‘visit_rating’, T.gte, 2).has(‘visit_food’, T.gte, 2).inV.as(‘x’).out(‘restaurant_genre’).has(‘genreId’, ‘Cafeteria’).back(‘x’).has(‘restaurant_place’, WITHIN, Geoshape.circle(22.165, -101.00, 5)).as(‘b’).transform{e, m -> m.b.restaurant_name + " distance " + m.b.restaurant_place.getPoint().distance(Geoshape.point(22.165, -101.00).getPoint())}

==>Luna Cafe distance 2.774053451453471
==>Cafeteria y Restaurant El Pacifico distance 3.064723519030348

This query is the same as before except that there’s an extra filter:

has(‘restaurant_place’, WITHIN, Geoshape.circle(22.165, -101.00, 5)).

Each restaurant vertex has a property called ‘restaurant_place’, which is a geo-point (a longitude and latitude). The filter restricts selection to any restaurants whose ‘restaurant_place’ is within 5km of the customer’s current location. The part of the query that transforms the output from the pipeline is modified to include the distance to the customer. You can use this to order your recommendations so the nearest is shown first.

Your app hits the big time as more and more customers use it to find a good dining experience. You are approached by one of the restaurants, which wants to run a promotion to acquire new customers. Their request is simple – they will pay you to send an in-app advert to your customers who are friends of people who have visited their restaurant, but who haven’t visited the restaurant themselves. Relieved that your app can finally make some money, you set about writing the query. This type of query follows a ‘except’ pattern:

gremlin> x = []
gremlin> g.V.has(‘RestaurantId’,’135052′).in(‘visit’).aggregate(x).out(‘friend’).except(x).userId.order

The query assumes that RestaurantId 135052 has made the approach. The first line defines a variable ‘x’ as an array. The steps of the query are shown below.

The ‘except’ pattern used in this query makes it very easy to select elements that have not been selected in a previous step. This makes queries such as the above or “who are a customer’s friend’s friends that are not already their friends” easy resolve. Once again, you could write this query in SQL, but the syntax would be far more complex than the simple Gremlin query used above and the multiple joins needed to resolve the query would affect performance.

Summary

In this post, I’ve shown you how to build a simple graph database using Titan with DynamoDB for storage. Compared to a more traditional RDBMS approach, a graph database can offer many advantages when you need to model a complex network. Your queries will be easier to understand and you may well get better performance from using a storage engine geared towards graph traversal. Using DynamoDB for your storage gives the added benefit of a fully managed, scalable repository for storing your data. You can concentrate on producing an app that excites your customers rather than managing infrastructure.

If you have any questions or suggestions, please leave a comment below.

References

Blanca Vargas-Govea, Juan Gabriel González-Serna, Rafael Ponce-Medellín. Effects of relevant contextual features in the performance of a restaurant recommender system. In RecSys’11: Workshop on Context Aware Recommender Systems (CARS-2011), Chicago, IL, USA, October 23, 2011

——————————————–

Related:

Scaling Writes on Amazon DynamoDB Tables with Global Secondary Indexes

Building a Graph Database on AWS Using Amazon DynamoDB and Titan

Post Syndicated from Nick Corbett original https://blogs.aws.amazon.com/bigdata/post/Tx12NN92B1F5K0C/Building-a-Graph-Database-on-AWS-Using-Amazon-DynamoDB-and-Titan

Nick Corbett is a Big Data Consultant for AWS Professional Services

You might not know it, but a graph has changed your life. A bold claim perhaps, but companies such as Facebook, LinkedIn, and Twitter have revolutionized the way society interacts through their ability to manage a huge network of relationships. However, graphs aren’t just used in social media; they can represent many different systems, including financial transactions for fraud detection, customer purchases for recommendation engines, computer network topologies, or the logistics operations of Amazon.com.

In this post, I would like to introduce you to a technology that makes it easy to manipulate graphs in AWS at massive scale. To do this, let’s imagine that you have decided to build a mobile app to help you and your friends with the simple task of finding a good restaurant. You quickly decide to build a ‘server-less’ infrastructure, using Amazon Cognito to identity management and data synchronization, Amazon API Gateway for your REST API, and AWS Lambda to implement microservices that fulfil your business logic. Your final decision is where to store your data. Because your vision is to build a network of friends and restaurants, the natural choice is a graph database rather than an RDBMS.  Titan running on Amazon DynamoDB is a great fit for the job.

DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance together with seamless scalability. Recently, AWS announced a plug-in for Titan that allows it to use DynamoDB as a storage backend. This means you can now build a graph database using Titan and not worry about the performance, scalability, or operational management of storing your data.

Your vision for the network that will power your app is shown below and shows the three major parts of a graph: vertices (or nodes), edges, and properties.

A vertex (or node) represents an entity, such as a person or restaurant. In your graph, you have three types of vertex: customers, restaurants, and the type of cuisine served (called genre in the code examples).

An edge defines a relationship between two vertices. For example, a customer might visit a restaurant or a restaurant may serve food of a particular cuisine. An edge always has direction – it will be outgoing from one vertex and incoming to the other.

A property is a key-value pair that enriches a vertex or an edge. For example, a customer has a name or the customer might rate their experience when they visit a restaurant.

After a short time, your app is ready to be released, albeit as a minimum viable product. The initial functionality of your app is very simple: your customer supplies a cuisine, such as ‘Pizza’ or ‘Sushi’, and the app returns a list of restaurants they might like to visit.

To show how this works in Titan, you can follow these instructions in the AWS Big Data Blog’s GitHub’ repository to load some sample data into your own Titan database, using DynamoDB as the backend store. The data used in this example was based on a data set provided by the Machine Learning Repository at UCL1. By default, the example uses Amazon DynamoDB Local, a small client-side database and server that mimics the DynamoDB service. This component is intended to support local development and small scale testing, and lets you save on provisioned throughput, data storage, and transfer fees.

Interaction with Titan is through a graph traversal language called Gremlin, in much the same way as you would use SQL to interact with an RDBMS. However, whereas SQL is declarative, Gremlin is implemented as a functional pipeline; the results of each operation in the query are piped to the next stage. This provides a degree of control on not just what results your query generates but also how it is executed. Gremlin is part of the Open Source Apache TinkerPop stack, which has become the de facto standard framework for graph databases and is supported by products such as Titan, Neo4j, and OrientDB.

Titan is written in Java and you can see that this API is used to load the sample data by running Gremlin commands. The Java API would also be used by your microservices running in Lambda, calling through to DynamoDB to store the data. In fact, the data stored in DynamoDB is compressed and not humanly readable (for more information about the storage format, see Titan Graph Modeling in DynamoDB).

For the purposes of this post, however, it’s easier to user the Gremlin REPL, written in Groovy. The instructions on GitHub show you how to start your Gremlin session.

A simple Gremlin query that finds restaurants based on a type of cuisine is shown below:

gremlin> g.V.has(‘genreId’, ‘Pizzeria’).in.restaurant_name

==>La Fontana Pizza Restaurante and Cafe
==>Dominos Pizza
==>Little Cesarz
==>pizza clasica
==>Restaurante Tiberius

This introduces the concept of how graph queries work; you select one or more vertices then use the language to walk (or traverse) across the graph. You can also see the functional pipeline in action as the results of each element are passed to the next step in the query. The query can be read as shown below.

Network that will power your app

The query gives us five restaurants to recommend to our customer. This query would be just as easy to run if your data was based in an RDBMS, so at this point not much is gained by using a graph database. However, as more customers start using your app and the first feature requests come in, you start to feel the benefit of your decision.

Initial feedback from your customers is good. However, they tell you that although it’s great to get a recommendation based on a cuisine, it would be better if they could receive recommendations based on places their friends have visited. You quickly add a ‘friend’ feature to the the app and change the Gremlin query that you use to provide recommendations:

This query assumes that a particular user (‘U1064’) has asked us to find a ‘Cafeteria’ restaurant that their friends have visited. The Gremlin syntax can be read as shown below.

This query uses a pattern called ‘backtrack’. You make a selection of vertices and ‘remember’ them. You then traverse the graph, selecting more nodes. Finally, you ‘backtrack’ to your remembered selection and reduce it to those vertices that have a path through to your current position.

Again, this query could be executed in an RDBMS but it would be complex. Because you would keep all customers in a single table, finding friends would involve looping back to join a table to itself. While it’s perfectly possible to do this in SQL, the syntax can become long—especially if you want to loop multiple times; for example, how many of my friends’ friends’ have visited the same restaurant as me?  A more important problem would be the performance. Each SQL join would introduce extra latency to the query and you may find that, as your database grows, you can’t meet the strict latency requirements of a modern app. In my test system, Titan returned the answer to this query in 38ms, but the RDBMS where I staged the data took over 0.3 seconds to resolve, an order of magnitude difference!

Your new recommendations work well, but some customers are still not happy. Just because their friends visited a restaurant doesn’t mean that they enjoyed it; they only want recommendations to restaurants their friends actually liked. You update your app again and ask customers to rate their experience, using ‘0’ for poor, ‘1’ for good, and ‘2’ for excellent. You then modify the query to:

g.V.has(‘userId’,’U1101′).out(‘friend’).outE(‘visit’).has(‘visit_food’, T.gte, 1).as(‘x’).inV.as(‘y’).out(‘restaurant_genre’).has(‘genreId’, ‘Seafood’).back(‘x’).transform{e, m -> [food: m.x.visit_food, name:m.y.restaurant_name]}.groupCount{it.name}.cap

==>{Restaurante y Pescaderia Tampico=1, Restaurante Marisco Sam=1, Mariscos El Pescador=2}

This query is based on a user (‘U1101’) asking for a seafood restaurant. The stages of the query are shown below.

This query shows how you can filter for a property on an edge. When you traverse the ‘visit’ edge, you filter for those visits where the food rating was greater or equal than 1. The query also shows how you can transform results from a pipeline to a new object. You build a simple object, with two properties (food rating and name) for each ‘hit’ you have against your query criteria. Finally, the query also demonstrates the ‘groupCount’ function. This aggregation provides a count of each unique name.

The net result of this query is that the ‘best’ seafood restaurant to recommend is ‘Mariscos El Pescador’, as your customer’s friends have made two visits in which they rated the food as ‘good’ or better.

The reputation of your app grows and more and more customers sign up. It’s great to take advantage of DynamoDB scalability; there’s no need to re-architect your solution as you gain more users, as your storage backend can scale to deal with millions or even hundreds of millions of customers.

Soon, it becomes apparent that most of your customers are using your app when they are out and about. You need to enhance your app so that it can make recommendations that are close to the customer. Fortunately, Titan comes with built-in geo queries. The query below imagines that customer ‘U1064’ is asking for a ‘Cafeteria’ and that you’ve captured their location of their mobile as (22.165, -101.0):

g.V.has(‘userId’, ‘U1064’).out(‘friend’).outE(‘visit’).has(‘visit_rating’, T.gte, 2).has(‘visit_food’, T.gte, 2).inV.as(‘x’).out(‘restaurant_genre’).has(‘genreId’, ‘Cafeteria’).back(‘x’).has(‘restaurant_place’, WITHIN, Geoshape.circle(22.165, -101.00, 5)).as(‘b’).transform{e, m -> m.b.restaurant_name + " distance " + m.b.restaurant_place.getPoint().distance(Geoshape.point(22.165, -101.00).getPoint())}

==>Luna Cafe distance 2.774053451453471
==>Cafeteria y Restaurant El Pacifico distance 3.064723519030348

This query is the same as before except that there’s an extra filter:

has(‘restaurant_place’, WITHIN, Geoshape.circle(22.165, -101.00, 5)).

Each restaurant vertex has a property called ‘restaurant_place’, which is a geo-point (a longitude and latitude). The filter restricts selection to any restaurants whose ‘restaurant_place’ is within 5km of the customer’s current location. The part of the query that transforms the output from the pipeline is modified to include the distance to the customer. You can use this to order your recommendations so the nearest is shown first.

Your app hits the big time as more and more customers use it to find a good dining experience. You are approached by one of the restaurants, which wants to run a promotion to acquire new customers. Their request is simple – they will pay you to send an in-app advert to your customers who are friends of people who have visited their restaurant, but who haven’t visited the restaurant themselves. Relieved that your app can finally make some money, you set about writing the query. This type of query follows a ‘except’ pattern:

gremlin> x = []
gremlin> g.V.has(‘RestaurantId’,’135052′).in(‘visit’).aggregate(x).out(‘friend’).except(x).userId.order

The query assumes that RestaurantId 135052 has made the approach. The first line defines a variable ‘x’ as an array. The steps of the query are shown below.

The ‘except’ pattern used in this query makes it very easy to select elements that have not been selected in a previous step. This makes queries such as the above or “who are a customer’s friend’s friends that are not already their friends” easy resolve. Once again, you could write this query in SQL, but the syntax would be far more complex than the simple Gremlin query used above and the multiple joins needed to resolve the query would affect performance.

Summary

In this post, I’ve shown you how to build a simple graph database using Titan with DynamoDB for storage. Compared to a more traditional RDBMS approach, a graph database can offer many advantages when you need to model a complex network. Your queries will be easier to understand and you may well get better performance from using a storage engine geared towards graph traversal. Using DynamoDB for your storage gives the added benefit of a fully managed, scalable repository for storing your data. You can concentrate on producing an app that excites your customers rather than managing infrastructure.

If you have any questions or suggestions, please leave a comment below.

References

Blanca Vargas-Govea, Juan Gabriel González-Serna, Rafael Ponce-Medellín. Effects of relevant contextual features in the performance of a restaurant recommender system. In RecSys’11: Workshop on Context Aware Recommender Systems (CARS-2011), Chicago, IL, USA, October 23, 2011

——————————————–

Related:

Scaling Writes on Amazon DynamoDB Tables with Global Secondary Indexes

Thoughts on Canonical, Ltd.’s Updated Ubuntu IP Policy

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2015/07/15/ubuntu-ip-policy.html

Most of you by now have probably
seen Conservancy’s
and FSF’s
statements regarding
the today’s
update to Canonical, Ltd.’s Ubuntu IP Policy
. I have a few personal
comments, speaking only for myself, that I want to add that don’t appear
in the FSF’s nor Conservancy’s analysis. (I wrote nearly all of
Conservancy’s analysis and did some editing on FSF’s analysis, but the
statements here I add are my personal opinions and don’t necessarily
reflect the views of the FSF nor Conservancy, notwithstanding that I have
affiliations with both orgs.)

First of all, I think it’s important to note the timeline: it took two
years of work by two charities to get this change done. The scary thing is
that compared to their peers who have also violated the GPL, Canonical,
Ltd. acted rather quickly.
As Conservancy
pointed out regarding the VMware lawsuit
, it’s not uncommon for these
negotiations to take even four years before we all give up and have to file
a lawsuit. So, Canonical, Ltd. resolved the matter at least twice
as fast as VMware, and they deserve some credit for that — even if
other GPL violators have set the bar quite low.

Second, I have to express my sympathy for the positions on this matter
taken by Matthew
Garrett
and Jonathan
Riddell
. Their positions show clearly that, while the GPL violation is
now fully resolved, the community is very concerned about what the happens
regarding non-copylefted software in Ubuntu, and thus Ubuntu as a
whole.

Realize, though, that these trump clauses are widely used throughout the
software industry. For example, electronics manufacturers who ship an
Android/Linux system with standard, disgustingly worded, forbid-everything
EULA usually include a trump clause not unlike Ubuntu’s. In such systems,
usually, the only copylefted program is the kernel named Linux. The rest
of the distribution includes tons of (now proprietarized) non-copylefted
code from Android (as well as a bunch of born-proprietary applications
too). The trump clause assures the software freedom rights for that one
copylefted work present, but all the non-copylefted ones are subject to the
strict EULA (which often includes “no reverse engineer
clauses”, etc.). That means if the electronics company did change
the Android Java code in some way, you can’t even legally reverse engineer
it — even though it was Apache-licensed by upstream.

Trump clauses are thus less than ideal because they achieve compliance
only by allowing a copyleft to prevail when the overarching license
contradicts specific requirements, permissions, or rights under copyleft.
That’s acceptable because copyleft licenses have many important clauses
that assure and uphold software freedom. By contrast, most non-copyleft
licenses have very few requirements, and thus they lack adequate terms to
triumph over any anti-software-freedom terms of the overarching license.
For example, if I take a 100% ISC-licensed program and build a
binary from it, nothing in the ISC license prohibits me from imposing this
license on you: “you may not redistribute this binary
commercially”. Thus, even if I also say to you: “but also, if
the ISC license grants rights, my aforementioned license does not modify or
reduce those rights”, nothing has changed for you. You still have a
binary that you can’t distribute commercially, and there was no text in the
ISC license to force the trump clause to save you.

Therefore, this whole situation is a simple and clear argument for why
copyleft matters. Copyleft can and does (when someone like me actually
enforces it) prevent such situations. But copyleft is not infinitely
expansive. Nearly every full operating system distribution available
includes an aggregated mix of copylefted, non-copyleft, and often
fully-proprietary userspace applications. Nearly every company that
distributes them wraps the whole thing with some agreement that restricts
some rights that copyleft defends, and then adds a trump clause that gives
an exception just for FLOSS license compliance. Sadly, I have yet to see a
company trailblaze adoption of a “software freedom
preservation” clause that guarantees copyleft-like compliance for
non-copylefted programs and packages. Thus, the problem with Ubuntu is
just a particularly bad example of what has become a standard industry
practice by nearly every “open source” company.

How badly these practices impact software freedom depends on the
strictness and detailed terms of the overarching license
(and not the contents of the trump clause itself; they are
generally isomorphic0). The task of analyzing and
rating “relative badness” of each overarching licensing
document is monumental; there are probably thousands of different ones in
use today. Matthew Garrett points out why Canonical, Ltd.’s is
particularly bad, but that doesn’t mean there aren’t worse (and better)
situations of a similar ilk. Perhaps our next best move is to use copyleft
licenses more often, so that the trump clauses actually do more.

In other words, as long as there is non-copylefted software aggregated in a
given distribution of an otherwise Free Software system, companies will
seek to put non-Free terms on top of the non-copylefted parts, To my
knowledge, every distribution-shipping company (except for
extremely rare, Free-Software-focused companies like ThinkPenguin) place
some kind of restrictions in their business terms for their enterprise
distribution products. Everyone seems to be asking me today to build the
“worst to almost-benign” ranking of these terms, but I’ve
resisted the urge to try. I think the safe bet is to assume that if you’re
looking at one of these trump clauses, there is some sort of
software-freedom-unfriendly restriction floating around in the broader
agreement, and you should thus just avoid that product entirely. Or, if
you really want to use it, fork it from source and relicense the
non-copylefted stuff under copyleft licenses (which is permitted by nearly
all non-copyleft licenses), to prevent future downstream actors from adding
more restrictive terms. I’d even suggest this as a potential solution to
the current Ubuntu problem (or, better yet, just go back upstream to Debian
and do the same :).

Finally, IMO the biggest problem with these “overarching licenses
with a trump clause” is their use by companies who herald “open
source” friendliness. I suspect the community ire comes from a sense
of betrayal. Yet, I feel only my usual anger at proprietary software here;
I don’t feel betrayed. Rather, this is just another situation that proves
that
saying you are an “open source company” isn’t enough;
only the company’s actions and “fine print” terms matter. Now
that open source has really succeeded at coopting software freedom,
enormous effort is now required to ascertain if any company respects your
software freedom. We must ignore the ballyhoo of “community
managers” and look closely at the real story.

0Despite Canonical,
Ltd.’s use of a trump clause, I don’t think these various trump
clauses are canonically isomorphic. There is no natural mapping
between these various trump clauses, but they all do have the same
effect: they assure that when the overarching terms conflict with
the a FLOSS license, the FLOSS license triumphs over the
overarching terms, no matter what they are. However, the
potential relevance of the phrase “canonical
isomorphism” here is yet another example why it’s confusing
and insidious that Canonical, Ltd. insisted so strongly
on using
canonical in a non-canonical way
.

SSDs: A gift and a curse

Post Syndicated from Laurie Denness original https://laur.ie/blog/2015/06/ssds-a-gift-and-a-curse/

Artur Bergman, founder of a CDN exclusively powered by super fast SSDs, has made many compelling cases over the years to use them. He was definitely ahead of the curve here, but he’s right. Nowadays, they’re denser, 100x faster and as competitively priced as hard disks in most server configurations.

At Etsy, we’ve been trying to get on this bandwagon for the last 5 years too. It’s got a lot better value for money in the last year, so we’ve gone from “dipping our toes in the water” to “ORDER EVERYTHING WITH SSDs!” pretty rapidly.

This isn’t a post about how great SSDs are though: Seriously, they’re amazing. The new Dell R630 allows for 24x 960GB 1.8″ SSDs in a 1U chassis. That’s 19TB usable ludicrously fast, sub millisecond latency storage after RAID6, that will blow away anything you can get on spinning rust, use less power, and is actually reasonably priced per GB.

Picture of Dell R630, 24x 960GB SSDs in 1U chassis
Plus, they look amazing.

So if this post isn’t “GO BUY ALL THE SSDs NOW”, what is it? Well, it’s a cautionary tale that it’s not all unicorns and IOPs.

The problem(s) with SSDs

When SSDs first started to come out, people were concerned that these drives “only” handled a certain number of operations or data during their lifetime, and they’d be changing SSDs far more frequently than conventional spinning rust. Actually, that’s totally not the case and we haven’t experienced that at all. We have thousands of SSDs, and we’ve lost maybe one or two to old age, and it probably wasn’t wear related.

Spoiler alert: SSD firmware is buggy

When was the last time your hard disk failed because the firmware did something whacky? Well, Seagate had a pretty famous case back in 2009 where the drives may not ever power on again if you power them off. Whoops.

But the majority of times, the issue is the physical hardware… The infamous “spinning rust” that is in the drive.

So, SSDs solve this forever right? No moving parts.. Measured mean time to failure of hundreds of years before the memory wears out? Perfect!

Here’s the run down of the firmware issues we’ve had over 5 or so years:

Intel

Okay, bad start, we’ve actually had no issues with Intel. This seems to be common across other companies we’ve spoken to. We started putting single 160GB in our web servers about 4 years ago, because it gave us low power, fast, reliable storage and the space requirements for web servers and utility boxes was low anyway. No more waiting for the metal to seize up! We have SSDs that have long outlived the servers.

OCZ

Outside of the 160GB Intel drives, our search (Solr) stack was the first to benefit from denser, fast storage. Search indexes were getting big; too big for memory. In addition, getting them off disk and serving search results to users was limited by the random disk latency.

Rather than many expensive, relatively fast but low capacity spinning rust drives in a RAID array, we opted for OCZ Talos 960GB disks. These weren’t too bad; we had a spate of initial failures in what seemed like a bad batch, but we were able to learn from this and make the app more resilient to failures.

However, they had poor SMART info (none) so predicting failures was hard.

Unfortunately, the company later went bankrupt, and Toshiba rescued them from the dead. They were unavailable for long enough that we simply ditched them and moved on.

HP SSDs

We briefly tried running third party SSDs on our older (HP) Graphite boxes… This was a quick, fairly cheap win as it got us a tonne of performance for relatively little money (back then we needed much less Graphite storage). This worked fine until the drives started to fail.

Unfortunately, HP have proprietary RAID controllers, and they don’t support SMART. Or rather, they refuse to talk to non-HP drives using off the shelf technology, they have their own methods.

Slot an unsupported disk or SSD into the controller, and you have no idea how that drive is performing or failing. We quickly learnt this after running for a while on these boxes, and performance randomly tanked. The SSDs underlying the RAID array seemed to be dying and slowing down, and we had no way of knowing which one (or ones), or how to fix it. Presumably the drives were not being issued TRIM commands either.

When we had to purchase a new box for our primary database this left us with no choice: We have to pay HP for SSDs. 960GB SSDs direct from HP, properly supported, cost us around $7000 each. Yes, each. We had to buy 4 of them to get the storage we needed.

On the upside, they do have fancy detailed stats (like wear levelling) exposed via the controller and ILO, and none have failed yet almost 3 years on (in fact, they’re all showing 99% health). You get what you pay for, luckily.

Samsung

Samsung saved the day and picked up from OCZ with a ludicrously cheap 960GB offering, the 840 EVO. A consumer drive, so very limited warranty, but for the price (~$400-500) you got great IOPS and they were reliable. They had better SMART info, and seemed to play nicely with our hardware.

We have a lot of these drives:

[~/chef-repo (master)] $ knife search node block_device_sda_model:'Samsung' -a block_device.sda.model

117 items found

That’s 117 hosts with those drives, most of them have 6 each, and doesn’t include hosts that have them behind RAID controllers (for example, our Graphite boxes). In particular, they’ve been awesome for our ELK logging cluster

Then BB6Q happened…

I hinted that we used these for Graphite. They worked great! Who wouldn’t want thousands and thousands of IOPs for relatively little money? Buying SSDs from OEMs is still expensive, and they give you those darn fancy “enterprise” level drives. Pfft. Redundancy at the app level, right?

We had started buying Dell, who use a rebranded LSI RAID controller so they happily talked to the drives including providing full SMART info. We had 16 of those Samsung drives behind the Dell controller giving us 7.3TB of super fast storage.

Given the already proven pattern, we ordered the same spec box for a Ganglia hardware refresh. And they didn’t work. The RAID controller hung on startup trying to initialise the drives, so long that the Boot ROM was never loaded so it was impossible to boot from an array created using them.

What had changed?! A quick

"MegaCli -AdpAllInfo -a0 | diff"

on the two boxes, revealed: The firmware on the drive had changed. (shout out to those of you who know the MegaCli parameters by heart now…)

Weeks of debugging and back and forth with both Dell (who were very nice given these drives were unsupported) and Samsung revealed there were definitely firmware issues with this particular BB6Q release.

It was soon released publicly, that not only did this new firmware somehow break compatibility with Dell RAID controllers (by accident), but they also had a crippling performance bug… They got slower and slower over time, because they had messed up their block allocation algorithm.

In the end, behind LSI controllers, it was the controller sending particular ATA commands to the drives that would make them hang and not respond.. And so the RAID controller would have to wait for it to time out.

Samsung put out a firmware updater and “fixer” tool for this, but it needed to move your data around so only ran on Windows with NTFS.

With hundreds of these things that are in production and working, but have a crippling performance issue, we had to figure out how they would get flashed. An awesome contractor for Samsung agreed that if we drove over batches of drives (luckily, they are incredibly close to our datacenter) they would flash them and return them the next day.

This story has a relatively happy ending then; our drives are getting fixed, and we’re still buying their drives; now the 960GB 850 PRO model, as they remain a great value for money high performance drive.

Talking with other companies, we’re not alone with Samsung issues like this, even the 840 PRO has some issues that require hard power cycles to fix. But the price is hard to beat, especially now the 850 range is looking more solid.

LiteOn

LiteOn were famously known for making CD writers back when CD writers were new and exciting.

But they’re also a chosen OEM partner of Dell’s for their official “value” SSDs. Value is a relative term here, but they’re infinitely cheaper than HP’s offerings, enterprise level, fully supported and for all that, “only” twice the price of Samsung (~$940)

We decided to buy new SSD based database boxes, because SSDs were too hard to resist for these use cases; crazy performance and at 1TB capacity, not too much more expensive per GB than spinning rust. We had to buy many many 15,000rpm drives to even get near the performance, and they were expensive at 300GB capacity. We could spend a bit more money and save power, rack space, and get more disk space and IOPs.

For similar reasons to HP, we thought best to pay the premium for a fully supported solution, especially as Samsung had just caused all these issue with their firmware issues.

With that in mind, we ordered some R630’s hot off the production line with 960GB LiteOn’s, tested performance, and it was great: 30,000 random write IOPs across 4 SSDs in RAID6, (5.5 TB useable space).

We put them live, and they promptly blew up spectacularly. (Yes, we had a postmortem about this). The RAID controller claimed that two drives had died simultaneously, with another being reset by the adapter. Did we really get two disks to die at once?

This took months of working closely with Dell to figure out. Replacement of drives, backplane, and then the whole box, but the problem persisted. Just a few short hours of intense IO, especially on a box with only 4 SSDs would cause them to flip out. And in the mean time, we’d ordered 50+ of these boxes with varying amounts of SSDs installed, having tested so well initially.

Eventually it transpires that, like most good problems, it was a combination of many factors that caused these issues. The SSDs were having extended garbage collection periods, exacerbated by a smaller amount of SSDs with higher IO, in RAID6. This caused the controller to kick the drive out of the array… and unfortunately due to the write levelling across the drives, at least two of them were garbage collecting at the same time, destroying the array integrity.

The fix was no small deal; Dell and LiteOn together identified and fixed weaknesses in their RAID controller, the backplane and the SSD firmware. It was great to see the companies working together rather than just pointing fingers here, and the fixes for all sizes except 960GB was out within a month.

The story here continues for us though; the 960GB drive remains unsolved, as it caused more issues, and we had almost exclusively purchased those. For systems that weren’t fully loaded, Dell kindly provided us with 800GB replacements and extra drives to make up the space. For the rest, because the stress across the 22 drives means garbage collection isn’t as intense, so they remain operating until a firmware fix.

Summary

I’m hesitant to recommend any one particular brand, because I’m sure as with the hard disk phenomenon (Law where each person has their preferred brand that they’ve never had issues with but everyone else has), people’s experiences will have varied.

We should probably collect some real data on this as an industry and share it around; I’ve always been of the mindset that we’re weirdly secretive sometimes of what hardware/software we use but we should share, so if anyone wants to contribute let me know.

But: you can probably continue to buy Intel and Samsung, depending on your use case/budget, and as usual, own your own availability and add resiliency to your apps and hardware, because things always fail in ways you can’t imagine.

Always Follow the Money

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2014/10/10/anita-borg.html

Selena Larson wrote an article
describing the Male Allies Plenary Panel at the Anita Borg
Institute’s Grace Hopper Celebration on Wednesday night
. There is a
video available of the
panel
(that’s the youtube link, the links on Anita Borg Institute’s
website don’t work with Free Software).

Selena’s article pretty much covers it. The only point that I thought
useful to add was that one can “follow the money” here.
Interestingly
enough, Facebook,
Google, GoDaddy, and Intuit were all listed as top-tier sponsors of the event
.
I find it a strange correlation that not one man on this panel is from a
company that didn’t sponsor the event. Are there no male allies
to the cause of women in tech worth hearing from who work for companies that, say,
don’t have enough money to sponsor the event? Perhaps that’s true, but
it’s somewhat surprising.

Honest US Congresspeople often say that the main problem with corruption
of campaign funds is that those who donate simply have more access and time
to make their case to the congressional representatives. They aren’t
buying votes; they’re buying access for conversations. (This was covered
well
in This
American Life, Episode 461
).

I often see a similar problem in the “Open Source” world. The
loudest microphones can be bought by the highest bidder (in various ways),
so we hear more from the wealthiest companies. The amazing thing about
this story, frankly, is that buying the microphone didn’t work
this time. I’m very glad the audience refused to let it happen! I’d love
to see a similar reaction at the corporate-controlled “Open Source and
Linux” conferences!

Update later in the day: The conference I’m commenting on
above is the same conference where Satya Nadella, CEO of Microsoft, said
that women shouldn’t ask for raises, and Microsoft is also a
top-tier sponsor of the conference. I’m left wondering if anyone who spoke
at this conference didn’t pay for the privilege of making these gaffes.

Why I’ll be letting Nagios live on a bit longer, thank you very much

Post Syndicated from Laurie Denness original https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-much/

My my, hasn’t @supersheep stirred up a bit of controversy over Nagios over the last week?

In case you missed it, he brought up an excellent topic that’s close my heart: Nagios. In his words, we should stop using it, so we can let it die. (I read about this in DevOpsWeekly, which you should absolutely sign up to if you haven’t already, it’s fantastic)

Mr Sheep (Andy) brought up some excellent points, and when I read them I must admit getting fairly triggered and angry that someone would speak about one of my favourite things in such a horrible way! Then maybe I started thinking I had a problem. Was I blindly in love with this thing? Naive to the alternatives, a fan boy? Do I need help? Luckily I could reach out to my wonderful coworkers, and @benjammingh was quick to confirm that yes, I do need help, but then again don’t we all. That’s a separate issue.

Anyway, the folks at reddit had plenty to say about this too. Some of the answers are sane, some are… Not so. Other people were seemingly very angry too. I don’t blame them.. It’s a bold move to stand up and say a perfectly good piece of software “sucks” and “you shouldn’t use it”. Which was the intention, of course, to make us talk about it.

Now the dust has settled slightly, I’m going to tell you why I still love Nagios, and why it will be continued to be used at Etsy, addressing the points Andy brought up individually.

“Doesn’t scale at all”

Yeah, that Gearman thing freaks me out too. I don’t think I’d want to use it, even though we use Gearman extremely heavily at Etsy for the site (we even invited the creator in for our Code as Craft speaker series).

But what scale are people taking here? Is it really that hard?

We “only” have 10,000 checks in our primary datacenter, all active, usually on 2-3 minute check intervals with a bunch on 30 seconds. I’m honestly not sure if that’s impressive or embarrassing, but the machine is 80% idle, so it’s not like there isn’t overhead for more. And this isn’t a super-duper spec box by any means. In fact, it’s actually one of the oldest servers we have.

use_large_installation_tweaks

We had to enable use_large_installation_tweaks  to get the latency down, but that made absolutely no difference to our Nagios operation. Our check latency is currently 2.324 seconds.

I’m not sure how familiar people are with this flag… Our latency crept up to minutes without it, and it’s not massively well documented online that you can probably enable it with almost no effect to anything except… Making Nagios not suck quite so much.

It’s literally a “go faster” flag.

Disable CPU scaling

Our Nagios boxes are HP or Dell servers, that by default have a “dynamic” CPU scaling setting enabled. Great for power saving, but for some reason the intelligence built into this system is absolutely horrible with Nagios. Because Nagios has extremely high context switches, but relatively low CPU, it causes a lot of problems with the intelligent management. If you’re still having latency issues, set the server to “static high performance mode” or equivalent.

We’ve tested this in a bunch of other places, and the only other place it helped was syslog-ng. Normally it’s pretty smart, but there *are* a few cases that trip it up.

Horizontal Scaling

The reason we’ve ended up with 10,000 checks on that single box is because that datacenter is now full, and we’ve moved onto another one, so we’ve started scaling Nagios horizontally rather than vertically. It makes a lot more sense to have a Nagios instance in each network/datacenter location so you can get a “clean view” of what’s going on inside that datacenter rather than letting a network link show half the hosts as dead. If you lose cross-DC connectivity, how will you ever know what really happened in that DC when it comes back?

This does present some small annoyances, for example we needed to come up with a solution for aggregating status together into one place. We use Nagdash for that. It uses the nagios-api, which I’ll come onto more later. We also use nagios-api to allow us to downtime hosts quickly and easily via irccat in IRC, regardless of the datacenter.

We’ve done the same with Ganglia and FITB too, for the same reasons. Much easier to scale things by adding more boxes, once you get over the hurdles of managing multiple instances. As long as you’re using configuration management.

“Second most horrible configuration”

After sendmail. Fair enough… m4 anyone? Some people like it though, it’s called preference.

Anyway, those are some strong feelings. Ever used XML based configuration? ini files? Yaml? Hadoop? In *my opinion* they’re worse. Maybe you’re a fan.

Regardless, if you spend your day picking through Nagios config files, then you probably either love it anyway, you’re doing a huge rewrite of your old config, or you’re probably doing it wrong. You can easily automate this.

We came up with a pretty simple solution for the split NRPE/Nagios configs thing at Etsy: Stop worrying about the NRPE configs and put every check on every host. The entire directory is 3MB, and does it matter if you have a check on a system you never use? No. Now you only have one config to worry about.

Andy acknowledges Chef/Puppet automation later where he calls using them to manage your Nagios configuration a “band aid”. Is managing your Apache config a “band aid”? How about your resolv.conf? Depending on your philosophy, you could basically call configuration management in general a giant bandaid. Is that a bad thing? No! That’s what makes it awesome. Our jobs is tying together components to construct a functioning system, at many many levels. At the highest level, at Etsy we’re here to make a shopping website. There are a bunch more systems tied together to make that possible lower down.

This is actually the Unix philosophy. Many small parts, applications that do a small specific thing, which you tie together using “|”. A pipe. You pipe data in to one application, and you manipulate it how you want on the way out. Which brings me onto:

“No programmatic interfaces”

At this point I am threatened with “If I catch you parsing status.dat I will beat your ass”. Bring it on!

We’re using the wonderful nagios-api project extremely heavily at Etsy because it provides a fantastic REST API for anything you’ve ever wanted in Nagios. And it does so by parsing status.dat. So sue me. Call me crazy, but isn’t parsing one machine readable output into another machine readable output basically computers? Where exactly is the issue in that?

Not only that, but it works really really well. We’ve contributed bits back to extend the functionality, and now our entire day to day workflow depends on it.

Would it be cool if it was built in? Maybe. Does it matter that it’s not? No. Again, pipes people. We’re using Chef as “echo” into Nagios, and then piping Nagios output into nagios-api for the output.

“Horrendous interface”

Well, it’s more “old” than anything else. At least everything is in the same place as you left it because it’s been the same since 1912. I wouldn’t argue if it was modernised slightly.

“Stupid wire format for clients”

I don’t think I’ve ever looked. Why are you looking? When was the last time NRPE broke? Maybe you have a good reason. I don’t.

“Throws away perfdata”

Again with the pipes! As Nagios logs this, we throw it into Splunk and Logstash. I admit we don’t bother doing much with it from there, as I like my graphs powered by something that was designed to graph, but a couple of times I’ve parsed the perfdata output in one of those two to get data I need.

All singing all dancing!

In the end though, I think the theme we’re coming onto here is that Andy really wants a big monolithic thing to handle everything for him, whereas actually I’m a massive fan of using the right tool for the job. You can buy a clock radio that is also a iPod dock, mp3 player, torch, battery charger, cheese grater, but it does all those things terribly.

For example, I don’t often need the perfdata because we have Ganglia for system level metrics, Graphite for our app level metrics, and we alert on data from both of those using Nagios.

In the end, Nagios is an extremely stable, extremely customisable piece of software, which does the job of scheduling and running shell scripts and then taking that and running other shell scripts to tell someone about it incredibly well. No it doesn’t do everything. Is that a bad thing?

Murphy said this excellently:

“Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.”

(As a side note, yes all of our Nagios instances monitor each other, no they’ve never crashed)

I will be honest; I haven’t used Sensu, because I’m in a happy place right now, but just the architectural diagram of how it works scares the shit out of me. When you need 7 arrow colours to describe where data is going in a monitoring system, I’m starting to fear it slightly. But hey, if it works, good on you guys. It just looks a lot like this. Nothing wrong with that, if you can make it stable and reliable.

Your mileage may vary

The nice thing about this world is people have choices. You may read everything I just wrote and still think Nagios is rubbish. No problem!

Certainly for us, things are working out pretty great, so Nagios will be with us for some time (drama involving monitoring plugins aside…). When we’ve hit a limit, that’ll be the next thing out the window or re-worked. But for now, long live Nagios. And it’s far from being on life support.

And, the best thing is, that doesn’t even stop Andy making something awesome. Hell, if it’s really good, maybe we’ll use it and contribute to it. But declaring Nagios as dead isn’t going to help that effort, actually. It will just alienate people. But I’m sure there are many of you who are sick of it, so please, don’t let us stop you.

Follow me on Twitter: @lozzd

Rethinking PID 1

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd.html

If you are well connected or good at reading between the lines
you might already know what this blog post is about. But even then
you may find this story interesting. So grab a cup of coffee,
sit down, and read what’s coming.

This blog story is long, so even though I can only recommend
reading the long story, here’s the one sentence summary: we are
experimenting with a new init system and it is fun.

Here’s the code. And here’s the story:

Process Identifier 1

On every Unix system there is one process with the special
process identifier 1. It is started by the kernel before all other
processes and is the parent process for all those other processes
that have nobody else to be child of. Due to that it can do a lot
of stuff that other processes cannot do. And it is also
responsible for some things that other processes are not
responsible for, such as bringing up and maintaining userspace
during boot.

Historically on Linux the software acting as PID 1 was the
venerable sysvinit package, though it had been showing its age for
quite a while. Many replacements have been suggested, only one of
them really took off: Upstart, which has by now found
its way into all major distributions.

As mentioned, the central responsibility of an init system is
to bring up userspace. And a good init system does that
fast. Unfortunately, the traditional SysV init system was not
particularly fast.

For a fast and efficient boot-up two things are crucial:

To start less.

And to start more in parallel.

What does that mean? Starting less means starting fewer
services or deferring the starting of services until they are
actually needed. There are some services where we know that they
will be required sooner or later (syslog, D-Bus system bus, etc.),
but for many others this isn’t the case. For example, bluetoothd
does not need to be running unless a bluetooth dongle is actually
plugged in or an application wants to talk to its D-Bus
interfaces. Same for a printing system: unless the machine
physically is connected to a printer, or an application wants to
print something, there is no need to run a printing daemon such as
CUPS. Avahi: if the machine is not connected to a
network, there is no need to run Avahi, unless some application wants
to use its APIs. And even SSH: as long as nobody wants to contact
your machine there is no need to run it, as long as it is then
started on the first connection. (And admit it, on most machines
where sshd might be listening somebody connects to it only every
other month or so.)

Starting more in parallel means that if we have
to run something, we should not serialize its start-up (as sysvinit
does), but run it all at the same time, so that the available
CPU and disk IO bandwidth is maxed out, and hence
the overall start-up time minimized.

Hardware and Software Change Dynamically

Modern systems (especially general purpose OS) are highly
dynamic in their configuration and use: they are mobile, different
applications are started and stopped, different hardware added and
removed again. An init system that is responsible for maintaining
services needs to listen to hardware and software
changes. It needs to dynamically start (and sometimes stop)
services as they are needed to run a program or enable some
hardware.

Most current systems that try to parallelize boot-up still
synchronize the start-up of the various daemons involved: since
Avahi needs D-Bus, D-Bus is started first, and only when D-Bus
signals that it is ready, Avahi is started too. Similar for other
services: livirtd and X11 need HAL (well, I am considering the
Fedora 13 services here, ignore that HAL is obsolete), hence HAL
is started first, before livirtd and X11 are started. And
libvirtd also needs Avahi, so it waits for Avahi too. And all of
them require syslog, so they all wait until Syslog is fully
started up and initialized. And so on.

Parallelizing Socket Services

This kind of start-up synchronization results in the
serialization of a significant part of the boot process. Wouldn’t
it be great if we could get rid of the synchronization and
serialization cost? Well, we can, actually. For that, we need to
understand what exactly the daemons require from each other, and
why their start-up is delayed. For traditional Unix daemons,
there’s one answer to it: they wait until the socket the other
daemon offers its services on is ready for connections. Usually
that is an AF_UNIX socket in the file-system, but it could be
AF_INET[6], too. For example, clients of D-Bus wait that
/var/run/dbus/system_bus_socket can be connected to,
clients of syslog wait for /dev/log, clients of CUPS wait
for /var/run/cups/cups.sock and NFS mounts wait for
/var/run/rpcbind.sock and the portmapper IP port, and so
on. And think about it, this is actually the only thing they wait
for!

Now, if that’s all they are waiting for, if we manage to make
those sockets available for connection earlier and only actually
wait for that instead of the full daemon start-up, then we can
speed up the entire boot and start more processes in parallel. So,
how can we do that? Actually quite easily in Unix-like systems: we
can create the listening sockets before we actually start
the daemon, and then just pass the socket during exec()
to it. That way, we can create all sockets for all
daemons in one step in the init system, and then in a second step
run all daemons at once. If a service needs another, and it is not
fully started up, that’s completely OK: what will happen is that
the connection is queued in the providing service and the client
will potentially block on that single request. But only that one
client will block and only on that one request. Also, dependencies
between services will no longer necessarily have to be configured
to allow proper parallelized start-up: if we start all sockets at
once and a service needs another it can be sure that it can
connect to its socket.

Because this is at the core of what is following, let me say
this again, with different words and by example: if you start
syslog and and various syslog clients at the same time, what will
happen in the scheme pointed out above is that the messages of the
clients will be added to the /dev/log socket buffer. As
long as that buffer doesn’t run full, the clients will not have to
wait in any way and can immediately proceed with their start-up. As
soon as syslog itself finished start-up, it will dequeue all
messages and process them. Another example: we start D-Bus and
several clients at the same time. If a synchronous bus
request is sent and hence a reply expected, what will happen is
that the client will have to block, however only that one client
and only until D-Bus managed to catch up and process it.

Basically, the kernel socket buffers help us to maximize
parallelization, and the ordering and synchronization is done by
the kernel, without any further management from userspace! And if
all the sockets are available before the daemons actually start-up,
dependency management also becomes redundant (or at least
secondary): if a daemon needs another daemon, it will just connect
to it. If the other daemon is already started, this will
immediately succeed. If it isn’t started but in the process of
being started, the first daemon will not even have to wait for it,
unless it issues a synchronous request. And even if the other
daemon is not running at all, it can be auto-spawned. From the
first daemon’s perspective there is no difference, hence dependency
management becomes mostly unnecessary or at least secondary, and
all of this in optimal parallelization and optionally with
on-demand loading. On top of this, this is also more robust, because
the sockets stay available regardless whether the actual daemons
might temporarily become unavailable (maybe due to crashing). In
fact, you can easily write a daemon with this that can run, and
exit (or crash), and run again and exit again (and so on), and all
of that without the clients noticing or loosing any request.

It’s a good time for a pause, go and refill your coffee mug,
and be assured, there is more interesting stuff following.

But first, let’s clear a few things up: is this kind of logic
new? No, it certainly is not. The most prominent system that works
like this is Apple’s launchd system: on MacOS the listening of the
sockets is pulled out of all daemons and done by launchd. The
services themselves hence can all start up in parallel and
dependencies need not to be configured for them. And that is
actually a really ingenious design, and the primary reason why
MacOS manages to provide the fantastic boot-up times it
provides. I can highly recommend this
video
where the launchd folks explain what they are
doing. Unfortunately this idea never really took on outside of the Apple
camp.

The idea is actually even older than launchd. Prior to launchd
the venerable inetd worked much like this: sockets were
centrally created in a daemon that would start the actual service
daemons passing the socket file descriptors during
exec(). However the focus of inetd certainly
wasn’t local services, but Internet services (although later
reimplementations supported AF_UNIX sockets, too). It also wasn’t a
tool to parallelize boot-up or even useful for getting implicit
dependencies right.

For TCP sockets inetd was primarily used in a way that
for every incoming connection a new daemon instance was
spawned. That meant that for each connection a new
process was spawned and initialized, which is not a
recipe for high-performance servers. However, right from the
beginning inetd also supported another mode, where a
single daemon was spawned on the first connection, and that single
instance would then go on and also accept the follow-up connections
(that’s what the wait/nowait option in
inetd.conf was for, a particularly badly documented
option, unfortunately.) Per-connection daemon starts probably gave
inetd its bad reputation for being slow. But that’s not entirely
fair.

Parallelizing Bus Services

Modern daemons on Linux tend to provide services via D-Bus
instead of plain AF_UNIX sockets. Now, the question is, for those
services, can we apply the same parallelizing boot logic as for
traditional socket services? Yes, we can, D-Bus already has all
the right hooks for it: using bus activation a service can be
started the first time it is accessed. Bus activation also gives
us the minimal per-request synchronisation we need for starting up
the providers and the consumers of D-Bus services at the same
time: if we want to start Avahi at the same time as CUPS (side
note: CUPS uses Avahi to browse for mDNS/DNS-SD printers), then we
can simply run them at the same time, and if CUPS is quicker than
Avahi via the bus activation logic we can get D-Bus to queue the
request until Avahi manages to establish its service name.

So, in summary: the socket-based service activation and the
bus-based service activation together enable us to start
all daemons in parallel, without any further
synchronization. Activation also allows us to do lazy-loading of
services: if a service is rarely used, we can just load it the
first time somebody accesses the socket or bus name, instead of
starting it during boot.

And if that’s not great, then I don’t know what is
great!

Parallelizing File System Jobs

If you look at
the serialization graphs of the boot process
of current
distributions, there are more synchronisation points than just
daemon start-ups: most prominently there are file-system related
jobs: mounting, fscking, quota. Right now, on boot-up a lot of
time is spent idling to wait until all devices that are listed in
/etc/fstab show up in the device tree and are then
fsck’ed, mounted, quota checked (if enabled). Only after that is
fully finished we go on and boot the actual services.

Can we improve this? It turns out we can. Harald Hoyer came up
with the idea of using the venerable autofs system for this:

Just like a connect() call shows that a service is
interested in another service, an open() (or a similar
call) shows that a service is interested in a specific file or
file-system. So, in order to improve how much we can parallelize
we can make those apps wait only if a file-system they are looking
for is not yet mounted and readily available: we set up an autofs
mount point, and then when our file-system finished fsck and quota
due to normal boot-up we replace it by the real mount. While the
file-system is not ready yet, the access will be queued by the
kernel and the accessing process will block, but only that one
daemon and only that one access. And this way we can begin
starting our daemons even before all file systems have been fully
made available — without them missing any files, and maximizing
parallelization.

Parallelizing file system jobs and service jobs does
not make sense for /, after all that’s where the service
binaries are usually stored. However, for file-systems such as
/home, that usually are bigger, even encrypted, possibly
remote and seldom accessed by the usual boot-up daemons, this
can improve boot time considerably. It is probably not necessary
to mention this, but virtual file systems, such as
procfs or sysfs should never be mounted via autofs.

I wouldn’t be surprised if some readers might find integrating
autofs in an init system a bit fragile and even weird, and maybe
more on the “crackish” side of things. However, having played
around with this extensively I can tell you that this actually
feels quite right. Using autofs here simply means that we can
create a mount point without having to provide the backing file
system right-away. In effect it hence only delays accesses. If an
application tries to access an autofs file-system and we take very
long to replace it with the real file-system, it will hang in an
interruptible sleep, meaning that you can safely cancel it, for
example via C-c. Also note that at any point, if the mount point
should not be mountable in the end (maybe because fsck failed), we
can just tell autofs to return a clean error code (like
ENOENT). So, I guess what I want to say is that even though
integrating autofs into an init system might appear adventurous at
first, our experimental code has shown that this idea works
surprisingly well in practice — if it is done for the right
reasons and the right way.

Also note that these should be direct autofs
mounts, meaning that from an application perspective there’s
little effective difference between a classic mount point and one
based on autofs.

Keeping the First User PID Small

Another thing we can learn from the MacOS boot-up logic is
that shell scripts are evil. Shell is fast and shell is slow. It
is fast to hack, but slow in execution. The classic sysvinit boot
logic is modelled around shell scripts. Whether it is
/bin/bash or any other shell (that was written to make
shell scripts faster), in the end the approach is doomed to be
slow. On my system the scripts in /etc/init.d call
grep at least 77 times. awk is called 92
times, cut 23 and sed 74. Every time those
commands (and others) are called, a process is spawned, the
libraries searched, some start-up stuff like i18n and so on set up
and more. And then after seldom doing more than a trivial string
operation the process is terminated again. Of course, that has to
be incredibly slow. No other language but shell would do something like
that. On top of that, shell scripts are also very fragile, and
change their behaviour drastically based on environment variables
and suchlike, stuff that is hard to oversee and control.

So, let’s get rid of shell scripts in the boot process! Before
we can do that we need to figure out what they are currently
actually used for: well, the big picture is that most of the time,
what they do is actually quite boring. Most of the scripting is
spent on trivial setup and tear-down of services, and should be
rewritten in C, either in separate executables, or moved into the
daemons themselves, or simply be done in the init system.

It is not likely that we can get rid of shell scripts during
system boot-up entirely anytime soon. Rewriting them in C takes
time, in a few case does not really make sense, and sometimes
shell scripts are just too handy to do without. But we can
certainly make them less prominent.

A good metric for measuring shell script infestation of the
boot process is the PID number of the first process you can start
after the system is fully booted up. Boot up, log in, open a
terminal, and type echo $$. Try that on your Linux
system, and then compare the result with MacOS! (Hint, it’s
something like this: Linux PID 1823; MacOS PID 154, measured on
test systems we own.)

Keeping Track of Processes

A central part of a system that starts up and maintains
services should be process babysitting: it should watch
services. Restart them if they shut down. If they crash it should
collect information about them, and keep it around for the
administrator, and cross-link that information with what is
available from crash dump systems such as abrt, and in logging
systems like syslog or the audit system.

It should also be capable of shutting down a service
completely. That might sound easy, but is harder than you
think. Traditionally on Unix a process that does double-forking
can escape the supervision of its parent, and the old parent will
not learn about the relation of the new process to the one it
actually started. An example: currently, a misbehaving CGI script
that has double-forked is not terminated when you shut down
Apache. Furthermore, you will not even be able to figure out its
relation to Apache, unless you know it by name and purpose.

So, how can we keep track of processes, so that they cannot
escape the babysitter, and that we can control them as one unit
even if they fork a gazillion times?

Different people came up with different solutions for this. I
am not going into much detail here, but let’s at least say that
approaches based on ptrace or the netlink connector (a kernel
interface which allows you to get a netlink message each time any
process on the system fork()s or exit()s) that some people have
investigated and implemented, have been criticised as ugly and not
very scalable.

So what can we do about this? Well, since quite a while the
kernel knows Control
Groups
(aka “cgroups”). Basically they allow the creation of a
hierarchy of groups of processes. The hierarchy is directly
exposed in a virtual file-system, and hence easily accessible. The
group names are basically directory names in that file-system. If
a process belonging to a specific cgroup fork()s, its child will
become a member of the same group. Unless it is privileged and has
access to the cgroup file system it cannot escape its
group. Originally, cgroups have been introduced into the kernel
for the purpose of containers: certain kernel subsystems can
enforce limits on resources of certain groups, such as limiting
CPU or memory usage. Traditional resource limits (as implemented
by setrlimit()) are (mostly) per-process. cgroups on the
other hand let you enforce limits on entire groups of
processes. cgroups are also useful to enforce limits outside of
the immediate container use case. You can use it for example to
limit the total amount of memory or CPU Apache and all its
children may use. Then, a misbehaving CGI script can no longer
escape your setrlimit() resource control by simply
forking away.

In addition to container and resource limit enforcement cgroups
are very useful to keep track of daemons: cgroup membership is
securely inherited by child processes, they cannot escape. There’s
a notification system available so that a supervisor process can
be notified when a cgroup runs empty. You can find the cgroups of
a process by reading /proc/$PID/cgroup. cgroups hence
make a very good choice to keep track of processes for babysitting
purposes.

Controlling the Process Execution Environment

A good babysitter should not only oversee and control when a
daemon starts, ends or crashes, but also set up a good, minimal,
and secure working environment for it.

That means setting obvious process parameters such as the
setrlimit() resource limits, user/group IDs or the
environment block, but does not end there. The Linux kernel gives
users and administrators a lot of control over processes (some of
it is rarely used, currently). For each process you can set CPU
and IO scheduler controls, the capability bounding set, CPU
affinity or of course cgroup environments with additional limits,
and more.

As an example, ioprio_set() with
IOPRIO_CLASS_IDLE is a great away to minimize the effect
of locate’s updatedb on system interactivity.

On top of that certain high-level controls can be very useful,
such as setting up read-only file system overlays based on
read-only bind mounts. That way one can run certain daemons so
that all (or some) file systems appear read-only to them, so that
EROFS is returned on every write request. As such this can be used
to lock down what daemons can do similar in fashion to a poor
man’s SELinux policy system (but this certainly doesn’t replace
SELinux, don’t get any bad ideas, please).

Finally logging is an important part of executing services:
ideally every bit of output a service generates should be logged
away. An init system should hence provide logging to daemons it
spawns right from the beginning, and connect stdout and stderr to
syslog or in some cases even /dev/kmsg which in many
cases makes a very useful replacement for syslog (embedded folks,
listen up!), especially in times where the kernel log buffer is
configured ridiculously large out-of-the-box.

On Upstart

To begin with, let me emphasize that I actually like the code
of Upstart, it is very well commented and easy to
follow. It’s certainly something other projects should learn
from (including my own).

That being said, I can’t say I agree with the general approach
of Upstart. But first, a bit more about the project:

Upstart does not share code with sysvinit, and its
functionality is a super-set of it, and provides compatibility to
some degree with the well known SysV init scripts. It’s main
feature is its event-based approach: starting and stopping of
processes is bound to “events” happening in the system, where an
“event” can be a lot of different things, such as: a network
interfaces becomes available or some other software has been
started.

Upstart does service serialization via these events: if the
syslog-started event is triggered this is used as an
indication to start D-Bus since it can now make use of Syslog. And
then, when dbus-started is triggered,
NetworkManager is started, since it may now use
D-Bus, and so on.

One could say that this way the actual logical dependency tree
that exists and is understood by the admin or developer is
translated and encoded into event and action rules: every logical
“a needs b” rule that the administrator/developer is aware of
becomes a “start a when b is started” plus “stop a when b is
stopped”. In some way this certainly is a simplification:
especially for the code in Upstart itself. However I would argue
that this simplification is actually detrimental. First of all,
the logical dependency system does not go away, the person who is
writing Upstart files must now translate the dependencies manually
into these event/action rules (actually, two rules for each
dependency). So, instead of letting the computer figure out what
to do based on the dependencies, the user has to manually
translate the dependencies into simple event/action rules. Also,
because the dependency information has never been encoded it is
not available at runtime, effectively meaning that an
administrator who tries to figure our why something
happened, i.e. why a is started when b is started, has no chance
of finding that out.

Furthermore, the event logic turns around all dependencies,
from the feet onto their head. Instead of minimizing the
amount of work (which is something that a good init system should
focus on, as pointed out in the beginning of this blog story), it
actually maximizes the amount of work to do during
operations. Or in other words, instead of having a clear goal and
only doing the things it really needs to do to reach the goal, it
does one step, and then after finishing it, it does all
steps that possibly could follow it.

Or to put it simpler: the fact that the user just started D-Bus
is in no way an indication that NetworkManager should be started
too (but this is what Upstart would do). It’s right the other way
round: when the user asks for NetworkManager, that is definitely
an indication that D-Bus should be started too (which is certainly
what most users would expect, right?).

A good init system should start only what is needed, and that
on-demand. Either lazily or parallelized and in advance. However
it should not start more than necessary, particularly not
everything installed that could use that service.

Finally, I fail to see the actual usefulness of the event
logic. It appears to me that most events that are exposed in
Upstart actually are not punctual in nature, but have duration: a
service starts, is running, and stops. A device is plugged in, is
available, and is plugged out again. A mount point is in the
process of being mounted, is fully mounted, or is being
unmounted. A power plug is plugged in, the system runs on AC, and
the power plug is pulled. Only a minority of the events an init
system or process supervisor should handle are actually punctual,
most of them are tuples of start, condition, and stop. This
information is again not available in Upstart, because it focuses
in singular events, and ignores durable dependencies.

Now, I am aware that some of the issues I pointed out above are
in some way mitigated by certain more recent changes in Upstart,
particularly condition based syntaxes such as start on
(local-filesystems and net-device-up IFACE=lo) in Upstart
rule files. However, to me this appears mostly as an attempt to
fix a system whose core design is flawed.

Besides that Upstart does OK for babysitting daemons, even though
some choices might be questionable (see above), and there are certainly a lot
of missed opportunities (see above, too).

There are other init systems besides sysvinit, Upstart and
launchd. Most of them offer little substantial more than Upstart or
sysvinit. The most interesting other contender is Solaris SMF,
which supports proper dependencies between services. However, in
many ways it is overly complex and, let’s say, a bit academic
with its excessive use of XML and new terminology for known
things. It is also closely bound to Solaris specific features such
as the contract system.

Putting it All Together: systemd

Well, this is another good time for a little pause, because
after I have hopefully explained above what I think a good PID 1
should be doing and what the current most used system does, we’ll
now come to where the beef is. So, go and refill you coffee mug
again. It’s going to be worth it.

You probably guessed it: what I suggested above as requirements
and features for an ideal init system is actually available now,
in a (still experimental) init system called systemd, and
which I hereby want to announce. Again, here’s the
code.
And here’s a quick rundown of its features, and the
rationale behind them:

systemd starts up and supervises the entire system (hence the
name…). It implements all of the features pointed out above and
a few more. It is based around the notion of units. Units
have a name and a type. Since their configuration is usually
loaded directly from the file system, these unit names are
actually file names. Example: a unit avahi.service is
read from a configuration file by the same name, and of course
could be a unit encapsulating the Avahi daemon. There are several
kinds of units:

service: these are the most obvious kind of unit:
daemons that can be started, stopped, restarted, reloaded. For
compatibility with SysV we not only support our own
configuration files for services, but also are able to read
classic SysV init scripts, in particular we parse the LSB
header, if it exists. /etc/init.d is hence not much
more than just another source of configuration.

socket: this unit encapsulates a socket in the
file-system or on the Internet. We currently support AF_INET,
AF_INET6, AF_UNIX sockets of the types stream, datagram, and
sequential packet. We also support classic FIFOs as
transport. Each socket unit has a matching
service unit, that is started if the first connection
comes in on the socket or FIFO. Example: nscd.socket
starts nscd.service on an incoming connection.

device: this unit encapsulates a device in the
Linux device tree. If a device is marked for this via udev
rules, it will be exposed as a device unit in
systemd. Properties set with udev can be used as
configuration source to set dependencies for device units.

mount: this unit encapsulates a mount point in the
file system hierarchy. systemd monitors all mount points how
they come and go, and can also be used to mount or
unmount mount-points. /etc/fstab is used here as an
additional configuration source for these mount points, similar to
how SysV init scripts can be used as additional configuration
source for service units.

automount: this unit type encapsulates an automount
point in the file system hierarchy. Each automount
unit has a matching mount unit, which is started
(i.e. mounted) as soon as the automount directory is
accessed.

target: this unit type is used for logical
grouping of units: instead of actually doing anything by itself
it simply references other units, which thereby can be controlled
together. Examples for this are: multi-user.target,
which is a target that basically plays the role of run-level 5 on
classic SysV system, or bluetooth.target which is
requested as soon as a bluetooth dongle becomes available and
which simply pulls in bluetooth related services that otherwise
would not need to be started: bluetoothd and
obexd and suchlike.

snapshot: similar to target units
snapshots do not actually do anything themselves and their only
purpose is to reference other units. Snapshots can be used to
save/rollback the state of all services and units of the init
system. Primarily it has two intended use cases: to allow the
user to temporarily enter a specific state such as “Emergency
Shell”, terminating current services, and provide an easy way to
return to the state before, pulling up all services again that
got temporarily pulled down. And to ease support for system
suspending: still many services cannot correctly deal with
system suspend, and it is often a better idea to shut them down
before suspend, and restore them afterwards.

All these units can have dependencies between each other (both
positive and negative, i.e. ‘Requires’ and ‘Conflicts’): a device
can have a dependency on a service, meaning that as soon as a
device becomes available a certain service is started. Mounts get
an implicit dependency on the device they are mounted from. Mounts
also gets implicit dependencies to mounts that are their prefixes
(i.e. a mount /home/lennart implicitly gets a dependency
added to the mount for /home) and so on.

A short list of other features:

For each process that is spawned, you may control: the
environment, resource limits, working and root directory, umask,
OOM killer adjustment, nice level, IO class and priority, CPU policy
and priority, CPU affinity, timer slack, user id, group id,
supplementary group ids, readable/writable/inaccessible
directories, shared/private/slave mount flags,
capabilities/bounding set, secure bits, CPU scheduler reset of
fork, private /tmp name-space, cgroup control for
various subsystems. Also, you can easily connect
stdin/stdout/stderr of services to syslog, /dev/kmsg,
arbitrary TTYs. If connected to a TTY for input systemd will make
sure a process gets exclusive access, optionally waiting or enforcing
it.

Every executed process gets its own cgroup (currently by
default in the debug subsystem, since that subsystem is not
otherwise used and does not much more than the most basic
process grouping), and it is very easy to configure systemd to
place services in cgroups that have been configured externally,
for example via the libcgroups utilities.

The native configuration files use a syntax that closely
follows the well-known .desktop files. It is a simple syntax for
which parsers exist already in many software frameworks. Also, this
allows us to rely on existing tools for i18n for service
descriptions, and similar. Administrators and developers don’t
need to learn a new syntax.

As mentioned, we provide compatibility with SysV init
scripts. We take advantages of LSB and Red Hat chkconfig headers
if they are available. If they aren’t we try to make the best of
the otherwise available information, such as the start
priorities in /etc/rc.d. These init scripts are simply
considered a different source of configuration, hence an easy
upgrade path to proper systemd services is available. Optionally
we can read classic PID files for services to identify the main
pid of a daemon. Note that we make use of the dependency
information from the LSB init script headers, and translate
those into native systemd dependencies. Side note: Upstart is
unable to harvest and make use of that information. Boot-up on a
plain Upstart system with mostly LSB SysV init scripts will
hence not be parallelized, a similar system running systemd
however will. In fact, for Upstart all SysV scripts together
make one job that is executed, they are not treated
individually, again in contrast to systemd where SysV init
scripts are just another source of configuration and are all
treated and controlled individually, much like any other native
systemd service.

Similarly, we read the existing /etc/fstab
configuration file, and consider it just another source of
configuration. Using the comment= fstab option you can
even mark /etc/fstab entries to become systemd
controlled automount points.

If the same unit is configured in multiple configuration
sources (e.g. /etc/systemd/system/avahi.service exists,
and /etc/init.d/avahi too), then the native
configuration will always take precedence, the legacy format is
ignored, allowing an easy upgrade path and packages to carry
both a SysV init script and a systemd service file for a
while.

We support a simple templating/instance mechanism. Example:
instead of having six configuration files for six gettys, we
only have one [email protected] file which gets instantiated to
[email protected] and suchlike. The interface part can
even be inherited by dependency expressions, i.e. it is easy to
encode that a service [email protected] pulls in
[email protected], while leaving the
eth0 string wild-carded.

For socket activation we support full compatibility with the
traditional inetd modes, as well as a very simple mode that
tries to mimic launchd socket activation and is recommended for
new services. The inetd mode only allows passing one socket to
the started daemon, while the native mode supports passing
arbitrary numbers of file descriptors. We also support one
instance per connection, as well as one instance for all
connections modes. In the former mode we name the cgroup the
daemon will be started in after the connection parameters, and
utilize the templating logic mentioned above for this. Example:
sshd.socket might spawn services
[email protected] with a
cgroup of [email protected]/192.168.0.1-4711-192.168.0.2-22
(i.e. the IP address and port numbers are used in the instance
names. For AF_UNIX sockets we use PID and user id of the
connecting client). This provides a nice way for the
administrator to identify the various instances of a daemon and
control their runtime individually. The native socket passing
mode is very easily implementable in applications: if
$LISTEN_FDS is set it contains the number of sockets
passed and the daemon will find them sorted as listed in the
.service file, starting from file descriptor 3 (a
nicely written daemon could also use fstat() and
getsockname() to identify the sockets in case it
receives more than one). In addition we set $LISTEN_PID
to the PID of the daemon that shall receive the fds, because
environment variables are normally inherited by sub-processes and
hence could confuse processes further down the chain. Even
though this socket passing logic is very simple to implement in
daemons, we will provide a BSD-licensed reference implementation
that shows how to do this. We have ported a couple of existing
daemons to this new scheme.

We provide compatibility with /dev/initctl to a
certain extent. This compatibility is in fact implemented with a
FIFO-activated service, which simply translates these legacy
requests to D-Bus requests. Effectively this means the old
shutdown, poweroff and similar commands from
Upstart and sysvinit continue to work with
systemd.

We also provide compatibility with utmp and
wtmp. Possibly even to an extent that is far more
than healthy, given how crufty utmp and wtmp
are.

systemd supports several kinds of
dependencies between units. After/Before can be used to fix
the ordering how units are activated. It is completely
orthogonal to Requires and Wants, which
express a positive requirement dependency, either mandatory, or
optional. Then, there is Conflicts which
expresses a negative requirement dependency. Finally, there are
three further, less used dependency types.

systemd has a minimal transaction system. Meaning: if a unit
is requested to start up or shut down we will add it and all its
dependencies to a temporary transaction. Then, we will
verify if the transaction is consistent (i.e. whether the
ordering via After/Before of all units is
cycle-free). If it is not, systemd will try to fix it up, and
removes non-essential jobs from the transaction that might
remove the loop. Also, systemd tries to suppress non-essential
jobs in the transaction that would stop a running
service. Non-essential jobs are those which the original request
did not directly include but which where pulled in by
Wants type of dependencies. Finally we check whether
the jobs of the transaction contradict jobs that have already
been queued, and optionally the transaction is aborted then. If
all worked out and the transaction is consistent and minimized
in its impact it is merged with all already outstanding jobs and
added to the run queue. Effectively this means that before
executing a requested operation, we will verify that it makes
sense, fixing it if possible, and only failing if it really cannot
work.

We record start/exit time as well as the PID and exit status
of every process we spawn and supervise. This data can be used
to cross-link daemons with their data in abrtd, auditd and
syslog. Think of an UI that will highlight crashed daemons for
you, and allows you to easily navigate to the respective UIs for
syslog, abrt, and auditd that will show the data generated from
and for this daemon on a specific run.

We support reexecution of the init process itself at any
time. The daemon state is serialized before the reexecution and
deserialized afterwards. That way we provide a simple way to
facilitate init system upgrades as well as handover from an
initrd daemon to the final daemon. Open sockets and autofs
mounts are properly serialized away, so that they stay
connectible all the time, in a way that clients will not even
notice that the init system reexecuted itself. Also, the fact
that a big part of the service state is encoded anyway in the
cgroup virtual file system would even allow us to resume
execution without access to the serialization data. The
reexecution code paths are actually mostly the same as the init
system configuration reloading code paths, which
guarantees that reexecution (which is probably more seldom
triggered) gets similar testing as reloading (which is probably
more common).

Starting the work of removing shell scripts from the boot
process we have recoded part of the basic system setup in C and
moved it directly into systemd. Among that is mounting of the API
file systems (i.e. virtual file systems such as /proc,
/sys and /dev.) and setting of the
host-name.

Server state is introspectable and controllable via
D-Bus. This is not complete yet but quite extensive.

While we want to emphasize socket-based and bus-name-based
activation, and we hence support dependencies between sockets and
services, we also support traditional inter-service
dependencies. We support multiple ways how such a service can
signal its readiness: by forking and having the start process
exit (i.e. traditional daemonize() behaviour), as well
as by watching the bus until a configured service name appears.

There’s an interactive mode which asks for confirmation each
time a process is spawned by systemd. You may enable it by
passing systemd.confirm_spawn=1 on the kernel command
line.

With the systemd.default= kernel command line
parameter you can specify which unit systemd should start on
boot-up. Normally you’d specify something like
multi-user.target here, but another choice could even
be a single service instead of a target, for example
out-of-the-box we ship a service emergency.service that
is similar in its usefulness as init=/bin/bash, however
has the advantage of actually running the init system, hence
offering the option to boot up the full system from the
emergency shell.

There’s a minimal UI that allows you to
start/stop/introspect services. It’s far from complete but
useful as a debugging tool. It’s written in Vala (yay!) and goes
by the name of systemadm.

It should be noted that systemd uses many Linux-specific
features, and does not limit itself to POSIX. That unlocks a lot
of functionality a system that is designed for portability to
other operating systems cannot provide.

Status

All the features listed above are already implemented. Right
now systemd can already be used as a drop-in replacement for
Upstart and sysvinit (at least as long as there aren’t too many
native upstart services yet. Thankfully most distributions don’t
carry too many native Upstart services yet.)

However, testing has been minimal, our version number is
currently at an impressive 0. Expect breakage if you run this in
its current state. That said, overall it should be quite stable
and some of us already boot their normal development systems with
systemd (in contrast to VMs only). YMMV, especially if you try
this on distributions we developers don’t use.

Where is This Going?

The feature set described above is certainly already
comprehensive. However, we have a few more things on our plate. I
don’t really like speaking too much about big plans but here’s a
short overview in which direction we will be pushing this:

We want to add at least two more unit types: swap
shall be used to control swap devices the same way we
already control mounts, i.e. with automatic dependencies on the
device tree devices they are activated from, and
suchlike. timer shall provide functionality similar to
cron, i.e. starts services based on time events, the
focus being both monotonic clock and wall-clock/calendar
events. (i.e. “start this 5h after it last ran” as well as “start
this every monday 5 am”)

More importantly however, it is also our plan to experiment with
systemd not only for optimizing boot times, but also to make it
the ideal session manager, to replace (or possibly just augment)
gnome-session, kdeinit and similar daemons. The problem set of a
session manager and an init system are very similar: quick start-up
is essential and babysitting processes the focus. Using the same
code for both uses hence suggests itself. Apple recognized that
and does just that with launchd. And so should we: socket and bus
based activation and parallelization is something session services
and system services can benefit from equally.

I should probably note that all three of these features are
already partially available in the current code base, but not
complete yet. For example, already, you can run systemd just fine
as a normal user, and it will detect that is run that way and
support for this mode has been available since the very beginning,
and is in the very core. (It is also exceptionally useful for
debugging! This works fine even without having the system
otherwise converted to systemd for booting.)

However, there are some things we probably should fix in the
kernel and elsewhere before finishing work on this: we
need swap status change notifications from the kernel similar to
how we can already subscribe to mount changes; we want a
notification when CLOCK_REALTIME jumps relative to
CLOCK_MONOTONIC; we want to allow normal processes to get
some init-like powers
; we need a well-defined
place where we can put user sockets
. None of these issues are
really essential for systemd, but they’d certainly improve
things.

You Want to See This in Action?

Currently, there are no tarball releases, but it should be
straightforward to check out the code from our
repository
. In addition, to have something to start with, here’s
a tarball with unit configuration files
that allows an
otherwise unmodified Fedora 13 system to work with systemd. We
have no RPMs to offer you for now.

An easier way is to download this Fedora 13 qemu image, which
has been prepared for systemd. In the grub menu you can select
whether you want to boot the system with Upstart or systemd. Note
that this system is minimally modified only. Service information
is read exclusively from the existing SysV init scripts. Hence it
will not take advantage of the full socket and bus-based
parallelization pointed out above, however it will interpret the
parallelization hints from the LSB headers, and hence boots faster
than the Upstart system, which in Fedora does not employ any
parallelization at the moment. The image is configured to output
debug information on the serial console, as well as writing it to
the kernel log buffer (which you may access with dmesg).
You might want to run qemu configured with a virtual
serial terminal. All passwords are set to systemd.

Even simpler than downloading and booting the qemu image is
looking at pretty screen-shots. Since an init system usually is
well hidden beneath the user interface, some shots of
systemadm and ps must do:

systemadm

That’s systemadm showing all loaded units, with more detailed
information on one of the getty instances.

ps

That’s an excerpt of the output of ps xaf -eo
pid,user,args,cgroup showing how neatly the processes are
sorted into the cgroup of their service. (The fourth column is the
cgroup, the debug: prefix is shown because we use the
debug cgroup controller for systemd, as mentioned earlier. This is
only temporary.)

Note that both of these screenshots show an only minimally
modified Fedora 13 Live CD installation, where services are
exclusively loaded from the existing SysV init scripts. Hence,
this does not use socket or bus activation for any existing
service.

Sorry, no bootcharts or hard data on start-up times for the
moment. We’ll publish that as soon as we have fully parallelized
all services from the default Fedora install. Then, we’ll welcome
you to benchmark the systemd approach, and provide our own
benchmark data as well.

Well, presumably everybody will keep bugging me about this, so
here are two numbers I’ll tell you. However, they are completely
unscientific as they are measured for a VM (single CPU) and by
using the stop timer in my watch. Fedora 13 booting up with
Upstart takes 27s, with systemd we reach 24s (from grub to gdm,
same system, same settings, shorter value of two bootups, one
immediately following the other). Note however that this shows
nothing more than the speedup effect reached by using the LSB
dependency information parsed from the init script headers for
parallelization. Socket or bus based activation was not utilized
for this, and hence these numbers are unsuitable to assess the
ideas pointed out above. Also, systemd was set to debug verbosity
levels on a serial console. So again, this benchmark data has
barely any value.

Writing Daemons

An ideal daemon for use with systemd does a few things
differently then things were traditionally done. Later on, we will
publish a longer guide explaining and suggesting how to write a daemon for use
with this systemd. Basically, things get simpler for daemon
developers:

We ask daemon writers not to fork or even double fork
in their processes, but run their event loop from the initial process
systemd starts for you. Also, don’t call setsid().

Don’t drop user privileges in the daemon itself, leave this
to systemd and configure it in systemd service configuration
files. (There are exceptions here. For example, for some daemons
there are good reasons to drop privileges inside the daemon
code, after an initialization phase that requires elevated
privileges.)

Don’t write PID files

Grab a name on the bus

You may rely on systemd for logging, you are welcome to log
whatever you need to log to stderr.

Let systemd create and watch sockets for you, so that socket
activation works. Hence, interpret $LISTEN_FDS and
$LISTEN_PID as described above.

Use SIGTERM for requesting shut downs from your daemon.

The list above is very similar to what Apple
recommends for daemons compatible with launchd
. It should be
easy to extend daemons that already support launchd
activation to support systemd activation as well.

Note that systemd supports daemons not written in this style
perfectly as well, already for compatibility reasons (launchd has
only limited support for that). As mentioned, this even extends to
existing inetd capable daemons which can be used unmodified for
socket activation by systemd.

So, yes, should systemd prove itself in our experiments and get
adopted by the distributions it would make sense to port at least
those services that are started by default to use socket or
bus-based activation. We have
written proof-of-concept patches
, and the porting turned out
to be very easy. Also, we can leverage the work that has already
been done for launchd, to a certain extent. Moreover, adding
support for socket-based activation does not make the service
incompatible with non-systemd systems.

FAQs

Who’s behind this?

Well, the current code-base is mostly my work, Lennart
Poettering (Red Hat). However the design in all its details is
result of close cooperation between Kay Sievers (Novell) and
me. Other people involved are Harald Hoyer (Red Hat), Dhaval
Giani (Formerly IBM), and a few others from various
companies such as Intel, SUSE and Nokia.

Is this a Red Hat project?

No, this is my personal side project. Also, let me emphasize
this: the opinions reflected here are my own. They are not
the views of my employer, or Ronald McDonald, or anyone
else.

Will this come to Fedora?

If our experiments prove that this approach works out, and
discussions in the Fedora community show support for this, then
yes, we’ll certainly try to get this into Fedora.

Will this come to OpenSUSE?

Kay’s pursuing that, so something similar as for Fedora applies here, too.

Will this come to Debian/Gentoo/Mandriva/MeeGo/Ubuntu/[insert your favourite distro here]?

That’s up to them. We’d certainly welcome their interest, and help with the integration.

Why didn’t you just add this to Upstart, why did you invent something new?

Well, the point of the part about Upstart above was to show
that the core design of Upstart is flawed, in our
opinion. Starting completely from scratch suggests itself if the
existing solution appears flawed in its core. However, note that
we took a lot of inspiration from Upstart’s code-base
otherwise.

If you love Apple launchd so much, why not adopt that?

launchd is a great invention, but I am not convinced that it
would fit well into Linux, nor that it is suitable for a system
like Linux with its immense scalability and flexibility to
numerous purposes and uses.

Is this an NIH project?

Well, I hope that I managed to explain in the text above why
we came up with something new, instead of building on Upstart or
launchd. We came up with systemd due to technical
reasons, not political reasons.

Don’t forget that it is Upstart that includes
a library called NIH
(which is kind of a reimplementation of glib) — not systemd!

Will this run on [insert non-Linux OS here]?

Unlikely. As pointed out, systemd uses many Linux specific
APIs (such as epoll, signalfd, libudev, cgroups, and numerous
more), a port to other operating systems appears to us as not
making a lot of sense. Also, we, the people involved are
unlikely to be interested in merging possible ports to other
platforms and work with the constraints this introduces. That said,
git supports branches and rebasing quite well, in case
people really want to do a port.

Actually portability is even more limited than just to other OSes: we require a very
recent Linux kernel, glibc, libcgroup and libudev. No support for
less-than-current Linux systems, sorry.

If folks want to implement something similar for other
operating systems, the preferred mode of cooperation is probably
that we help you identify which interfaces can be shared with
your system, to make life easier for daemon writers to support
both systemd and your systemd counterpart. Probably, the focus should be
to share interfaces, not code.

I hear [fill one in here: the Gentoo boot system, initng,
Solaris SMF, runit, uxlaunch, …] is an awesome init system and
also does parallel boot-up, so why not adopt that?

Well, before we started this we actually had a very close
look at the various systems, and none of them did what we had in
mind for systemd (with the exception of launchd, of course). If
you cannot see that, then please read again what I wrote
above.

<!– First you break my
audio
, and now you want to corrupt my boot?

Yes. And don’t forget that I am also responsible for crucifying your network. I am
coming after you! Muhahahaha!–>

Contributions

We are very interested in patches and help. It should be common
sense that every Free Software project can only benefit from the
widest possible external contributions. That is particularly true
for a core part of the OS, such as an init system. We value your
contributions and hence do not require copyright assignment (Very
much unlike Canonical/Upstart
!). And also, we use git,
everybody’s favourite VCS, yay!

We are particularly interested in help getting systemd to work
on other distributions, besides Fedora and OpenSUSE. (Hey, anybody
from Debian, Gentoo, Mandriva, MeeGo looking for something to do?)
But even beyond that we are keen to attract contributors on every
level: we welcome C hackers, packagers, as well as folks who are interested
to write documentation, or contribute a logo.

Community

At this time we only have source code
repository
and an IRC channel (#systemd on
Freenode). There’s no mailing list, web site or bug tracking
system. We’ll probably set something up on freedesktop.org
soon. If you have any questions or want to contact us otherwise we
invite you to join us on IRC!

Update: our GIT repository has moved.

Rethinking PID 1

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd.html

If you are well connected or good at reading between the lines
you might already know what this blog post is about. But even then
you may find this story interesting. So grab a cup of coffee,
sit down, and read what’s coming.

This blog story is long, so even though I can only recommend
reading the long story, here’s the one sentence summary: we are
experimenting with a new init system and it is fun.

Here’s the code. And here’s the story:

Process Identifier 1

On every Unix system there is one process with the special
process identifier 1. It is started by the kernel before all other
processes and is the parent process for all those other processes
that have nobody else to be child of. Due to that it can do a lot
of stuff that other processes cannot do. And it is also
responsible for some things that other processes are not
responsible for, such as bringing up and maintaining userspace
during boot.

Historically on Linux the software acting as PID 1 was the
venerable sysvinit package, though it had been showing its age for
quite a while. Many replacements have been suggested, only one of
them really took off: Upstart, which has by now found
its way into all major distributions.

As mentioned, the central responsibility of an init system is
to bring up userspace. And a good init system does that
fast. Unfortunately, the traditional SysV init system was not
particularly fast.

For a fast and efficient boot-up two things are crucial:

  • To start less.
  • And to start more in parallel.

What does that mean? Starting less means starting fewer
services or deferring the starting of services until they are
actually needed. There are some services where we know that they
will be required sooner or later (syslog, D-Bus system bus, etc.),
but for many others this isn’t the case. For example, bluetoothd
does not need to be running unless a bluetooth dongle is actually
plugged in or an application wants to talk to its D-Bus
interfaces. Same for a printing system: unless the machine
physically is connected to a printer, or an application wants to
print something, there is no need to run a printing daemon such as
CUPS. Avahi: if the machine is not connected to a
network, there is no need to run Avahi, unless some application wants
to use its APIs. And even SSH: as long as nobody wants to contact
your machine there is no need to run it, as long as it is then
started on the first connection. (And admit it, on most machines
where sshd might be listening somebody connects to it only every
other month or so.)

Starting more in parallel means that if we have
to run something, we should not serialize its start-up (as sysvinit
does), but run it all at the same time, so that the available
CPU and disk IO bandwidth is maxed out, and hence
the overall start-up time minimized.

Hardware and Software Change Dynamically

Modern systems (especially general purpose OS) are highly
dynamic in their configuration and use: they are mobile, different
applications are started and stopped, different hardware added and
removed again. An init system that is responsible for maintaining
services needs to listen to hardware and software
changes. It needs to dynamically start (and sometimes stop)
services as they are needed to run a program or enable some
hardware.

Most current systems that try to parallelize boot-up still
synchronize the start-up of the various daemons involved: since
Avahi needs D-Bus, D-Bus is started first, and only when D-Bus
signals that it is ready, Avahi is started too. Similar for other
services: livirtd and X11 need HAL (well, I am considering the
Fedora 13 services here, ignore that HAL is obsolete), hence HAL
is started first, before livirtd and X11 are started. And
libvirtd also needs Avahi, so it waits for Avahi too. And all of
them require syslog, so they all wait until Syslog is fully
started up and initialized. And so on.

Parallelizing Socket Services

This kind of start-up synchronization results in the
serialization of a significant part of the boot process. Wouldn’t
it be great if we could get rid of the synchronization and
serialization cost? Well, we can, actually. For that, we need to
understand what exactly the daemons require from each other, and
why their start-up is delayed. For traditional Unix daemons,
there’s one answer to it: they wait until the socket the other
daemon offers its services on is ready for connections. Usually
that is an AF_UNIX socket in the file-system, but it could be
AF_INET[6], too. For example, clients of D-Bus wait that
/var/run/dbus/system_bus_socket can be connected to,
clients of syslog wait for /dev/log, clients of CUPS wait
for /var/run/cups/cups.sock and NFS mounts wait for
/var/run/rpcbind.sock and the portmapper IP port, and so
on. And think about it, this is actually the only thing they wait
for!

Now, if that’s all they are waiting for, if we manage to make
those sockets available for connection earlier and only actually
wait for that instead of the full daemon start-up, then we can
speed up the entire boot and start more processes in parallel. So,
how can we do that? Actually quite easily in Unix-like systems: we
can create the listening sockets before we actually start
the daemon, and then just pass the socket during exec()
to it. That way, we can create all sockets for all
daemons in one step in the init system, and then in a second step
run all daemons at once. If a service needs another, and it is not
fully started up, that’s completely OK: what will happen is that
the connection is queued in the providing service and the client
will potentially block on that single request. But only that one
client will block and only on that one request. Also, dependencies
between services will no longer necessarily have to be configured
to allow proper parallelized start-up: if we start all sockets at
once and a service needs another it can be sure that it can
connect to its socket.

Because this is at the core of what is following, let me say
this again, with different words and by example: if you start
syslog and and various syslog clients at the same time, what will
happen in the scheme pointed out above is that the messages of the
clients will be added to the /dev/log socket buffer. As
long as that buffer doesn’t run full, the clients will not have to
wait in any way and can immediately proceed with their start-up. As
soon as syslog itself finished start-up, it will dequeue all
messages and process them. Another example: we start D-Bus and
several clients at the same time. If a synchronous bus
request is sent and hence a reply expected, what will happen is
that the client will have to block, however only that one client
and only until D-Bus managed to catch up and process it.

Basically, the kernel socket buffers help us to maximize
parallelization, and the ordering and synchronization is done by
the kernel, without any further management from userspace! And if
all the sockets are available before the daemons actually start-up,
dependency management also becomes redundant (or at least
secondary): if a daemon needs another daemon, it will just connect
to it. If the other daemon is already started, this will
immediately succeed. If it isn’t started but in the process of
being started, the first daemon will not even have to wait for it,
unless it issues a synchronous request. And even if the other
daemon is not running at all, it can be auto-spawned. From the
first daemon’s perspective there is no difference, hence dependency
management becomes mostly unnecessary or at least secondary, and
all of this in optimal parallelization and optionally with
on-demand loading. On top of this, this is also more robust, because
the sockets stay available regardless whether the actual daemons
might temporarily become unavailable (maybe due to crashing). In
fact, you can easily write a daemon with this that can run, and
exit (or crash), and run again and exit again (and so on), and all
of that without the clients noticing or loosing any request.

It’s a good time for a pause, go and refill your coffee mug,
and be assured, there is more interesting stuff following.

But first, let’s clear a few things up: is this kind of logic
new? No, it certainly is not. The most prominent system that works
like this is Apple’s launchd system: on MacOS the listening of the
sockets is pulled out of all daemons and done by launchd. The
services themselves hence can all start up in parallel and
dependencies need not to be configured for them. And that is
actually a really ingenious design, and the primary reason why
MacOS manages to provide the fantastic boot-up times it
provides. I can highly recommend this
video
where the launchd folks explain what they are
doing. Unfortunately this idea never really took on outside of the Apple
camp.

The idea is actually even older than launchd. Prior to launchd
the venerable inetd worked much like this: sockets were
centrally created in a daemon that would start the actual service
daemons passing the socket file descriptors during
exec(). However the focus of inetd certainly
wasn’t local services, but Internet services (although later
reimplementations supported AF_UNIX sockets, too). It also wasn’t a
tool to parallelize boot-up or even useful for getting implicit
dependencies right.

For TCP sockets inetd was primarily used in a way that
for every incoming connection a new daemon instance was
spawned. That meant that for each connection a new
process was spawned and initialized, which is not a
recipe for high-performance servers. However, right from the
beginning inetd also supported another mode, where a
single daemon was spawned on the first connection, and that single
instance would then go on and also accept the follow-up connections
(that’s what the wait/nowait option in
inetd.conf was for, a particularly badly documented
option, unfortunately.) Per-connection daemon starts probably gave
inetd its bad reputation for being slow. But that’s not entirely
fair.

Parallelizing Bus Services

Modern daemons on Linux tend to provide services via D-Bus
instead of plain AF_UNIX sockets. Now, the question is, for those
services, can we apply the same parallelizing boot logic as for
traditional socket services? Yes, we can, D-Bus already has all
the right hooks for it: using bus activation a service can be
started the first time it is accessed. Bus activation also gives
us the minimal per-request synchronisation we need for starting up
the providers and the consumers of D-Bus services at the same
time: if we want to start Avahi at the same time as CUPS (side
note: CUPS uses Avahi to browse for mDNS/DNS-SD printers), then we
can simply run them at the same time, and if CUPS is quicker than
Avahi via the bus activation logic we can get D-Bus to queue the
request until Avahi manages to establish its service name.

So, in summary: the socket-based service activation and the
bus-based service activation together enable us to start
all daemons in parallel, without any further
synchronization. Activation also allows us to do lazy-loading of
services: if a service is rarely used, we can just load it the
first time somebody accesses the socket or bus name, instead of
starting it during boot.

And if that’s not great, then I don’t know what is
great!

Parallelizing File System Jobs

If you look at
the serialization graphs of the boot process
of current
distributions, there are more synchronisation points than just
daemon start-ups: most prominently there are file-system related
jobs: mounting, fscking, quota. Right now, on boot-up a lot of
time is spent idling to wait until all devices that are listed in
/etc/fstab show up in the device tree and are then
fsck’ed, mounted, quota checked (if enabled). Only after that is
fully finished we go on and boot the actual services.

Can we improve this? It turns out we can. Harald Hoyer came up
with the idea of using the venerable autofs system for this:

Just like a connect() call shows that a service is
interested in another service, an open() (or a similar
call) shows that a service is interested in a specific file or
file-system. So, in order to improve how much we can parallelize
we can make those apps wait only if a file-system they are looking
for is not yet mounted and readily available: we set up an autofs
mount point, and then when our file-system finished fsck and quota
due to normal boot-up we replace it by the real mount. While the
file-system is not ready yet, the access will be queued by the
kernel and the accessing process will block, but only that one
daemon and only that one access. And this way we can begin
starting our daemons even before all file systems have been fully
made available — without them missing any files, and maximizing
parallelization.

Parallelizing file system jobs and service jobs does
not make sense for /, after all that’s where the service
binaries are usually stored. However, for file-systems such as
/home, that usually are bigger, even encrypted, possibly
remote and seldom accessed by the usual boot-up daemons, this
can improve boot time considerably. It is probably not necessary
to mention this, but virtual file systems, such as
procfs or sysfs should never be mounted via autofs.

I wouldn’t be surprised if some readers might find integrating
autofs in an init system a bit fragile and even weird, and maybe
more on the “crackish” side of things. However, having played
around with this extensively I can tell you that this actually
feels quite right. Using autofs here simply means that we can
create a mount point without having to provide the backing file
system right-away. In effect it hence only delays accesses. If an
application tries to access an autofs file-system and we take very
long to replace it with the real file-system, it will hang in an
interruptible sleep, meaning that you can safely cancel it, for
example via C-c. Also note that at any point, if the mount point
should not be mountable in the end (maybe because fsck failed), we
can just tell autofs to return a clean error code (like
ENOENT). So, I guess what I want to say is that even though
integrating autofs into an init system might appear adventurous at
first, our experimental code has shown that this idea works
surprisingly well in practice — if it is done for the right
reasons and the right way.

Also note that these should be direct autofs
mounts, meaning that from an application perspective there’s
little effective difference between a classic mount point and one
based on autofs.

Keeping the First User PID Small

Another thing we can learn from the MacOS boot-up logic is
that shell scripts are evil. Shell is fast and shell is slow. It
is fast to hack, but slow in execution. The classic sysvinit boot
logic is modelled around shell scripts. Whether it is
/bin/bash or any other shell (that was written to make
shell scripts faster), in the end the approach is doomed to be
slow. On my system the scripts in /etc/init.d call
grep at least 77 times. awk is called 92
times, cut 23 and sed 74. Every time those
commands (and others) are called, a process is spawned, the
libraries searched, some start-up stuff like i18n and so on set up
and more. And then after seldom doing more than a trivial string
operation the process is terminated again. Of course, that has to
be incredibly slow. No other language but shell would do something like
that. On top of that, shell scripts are also very fragile, and
change their behaviour drastically based on environment variables
and suchlike, stuff that is hard to oversee and control.

So, let’s get rid of shell scripts in the boot process! Before
we can do that we need to figure out what they are currently
actually used for: well, the big picture is that most of the time,
what they do is actually quite boring. Most of the scripting is
spent on trivial setup and tear-down of services, and should be
rewritten in C, either in separate executables, or moved into the
daemons themselves, or simply be done in the init system.

It is not likely that we can get rid of shell scripts during
system boot-up entirely anytime soon. Rewriting them in C takes
time, in a few case does not really make sense, and sometimes
shell scripts are just too handy to do without. But we can
certainly make them less prominent.

A good metric for measuring shell script infestation of the
boot process is the PID number of the first process you can start
after the system is fully booted up. Boot up, log in, open a
terminal, and type echo $$. Try that on your Linux
system, and then compare the result with MacOS! (Hint, it’s
something like this: Linux PID 1823; MacOS PID 154, measured on
test systems we own.)

Keeping Track of Processes

A central part of a system that starts up and maintains
services should be process babysitting: it should watch
services. Restart them if they shut down. If they crash it should
collect information about them, and keep it around for the
administrator, and cross-link that information with what is
available from crash dump systems such as abrt, and in logging
systems like syslog or the audit system.

It should also be capable of shutting down a service
completely. That might sound easy, but is harder than you
think. Traditionally on Unix a process that does double-forking
can escape the supervision of its parent, and the old parent will
not learn about the relation of the new process to the one it
actually started. An example: currently, a misbehaving CGI script
that has double-forked is not terminated when you shut down
Apache. Furthermore, you will not even be able to figure out its
relation to Apache, unless you know it by name and purpose.

So, how can we keep track of processes, so that they cannot
escape the babysitter, and that we can control them as one unit
even if they fork a gazillion times?

Different people came up with different solutions for this. I
am not going into much detail here, but let’s at least say that
approaches based on ptrace or the netlink connector (a kernel
interface which allows you to get a netlink message each time any
process on the system fork()s or exit()s) that some people have
investigated and implemented, have been criticised as ugly and not
very scalable.

So what can we do about this? Well, since quite a while the
kernel knows Control
Groups
(aka “cgroups”). Basically they allow the creation of a
hierarchy of groups of processes. The hierarchy is directly
exposed in a virtual file-system, and hence easily accessible. The
group names are basically directory names in that file-system. If
a process belonging to a specific cgroup fork()s, its child will
become a member of the same group. Unless it is privileged and has
access to the cgroup file system it cannot escape its
group. Originally, cgroups have been introduced into the kernel
for the purpose of containers: certain kernel subsystems can
enforce limits on resources of certain groups, such as limiting
CPU or memory usage. Traditional resource limits (as implemented
by setrlimit()) are (mostly) per-process. cgroups on the
other hand let you enforce limits on entire groups of
processes. cgroups are also useful to enforce limits outside of
the immediate container use case. You can use it for example to
limit the total amount of memory or CPU Apache and all its
children may use. Then, a misbehaving CGI script can no longer
escape your setrlimit() resource control by simply
forking away.

In addition to container and resource limit enforcement cgroups
are very useful to keep track of daemons: cgroup membership is
securely inherited by child processes, they cannot escape. There’s
a notification system available so that a supervisor process can
be notified when a cgroup runs empty. You can find the cgroups of
a process by reading /proc/$PID/cgroup. cgroups hence
make a very good choice to keep track of processes for babysitting
purposes.

Controlling the Process Execution Environment

A good babysitter should not only oversee and control when a
daemon starts, ends or crashes, but also set up a good, minimal,
and secure working environment for it.

That means setting obvious process parameters such as the
setrlimit() resource limits, user/group IDs or the
environment block, but does not end there. The Linux kernel gives
users and administrators a lot of control over processes (some of
it is rarely used, currently). For each process you can set CPU
and IO scheduler controls, the capability bounding set, CPU
affinity or of course cgroup environments with additional limits,
and more.

As an example, ioprio_set() with
IOPRIO_CLASS_IDLE is a great away to minimize the effect
of locate‘s updatedb on system interactivity.

On top of that certain high-level controls can be very useful,
such as setting up read-only file system overlays based on
read-only bind mounts. That way one can run certain daemons so
that all (or some) file systems appear read-only to them, so that
EROFS is returned on every write request. As such this can be used
to lock down what daemons can do similar in fashion to a poor
man’s SELinux policy system (but this certainly doesn’t replace
SELinux, don’t get any bad ideas, please).

Finally logging is an important part of executing services:
ideally every bit of output a service generates should be logged
away. An init system should hence provide logging to daemons it
spawns right from the beginning, and connect stdout and stderr to
syslog or in some cases even /dev/kmsg which in many
cases makes a very useful replacement for syslog (embedded folks,
listen up!), especially in times where the kernel log buffer is
configured ridiculously large out-of-the-box.

On Upstart

To begin with, let me emphasize that I actually like the code
of Upstart, it is very well commented and easy to
follow. It’s certainly something other projects should learn
from (including my own).

That being said, I can’t say I agree with the general approach
of Upstart. But first, a bit more about the project:

Upstart does not share code with sysvinit, and its
functionality is a super-set of it, and provides compatibility to
some degree with the well known SysV init scripts. It’s main
feature is its event-based approach: starting and stopping of
processes is bound to “events” happening in the system, where an
“event” can be a lot of different things, such as: a network
interfaces becomes available or some other software has been
started.

Upstart does service serialization via these events: if the
syslog-started event is triggered this is used as an
indication to start D-Bus since it can now make use of Syslog. And
then, when dbus-started is triggered,
NetworkManager is started, since it may now use
D-Bus, and so on.

One could say that this way the actual logical dependency tree
that exists and is understood by the admin or developer is
translated and encoded into event and action rules: every logical
“a needs b” rule that the administrator/developer is aware of
becomes a “start a when b is started” plus “stop a when b is
stopped”. In some way this certainly is a simplification:
especially for the code in Upstart itself. However I would argue
that this simplification is actually detrimental. First of all,
the logical dependency system does not go away, the person who is
writing Upstart files must now translate the dependencies manually
into these event/action rules (actually, two rules for each
dependency). So, instead of letting the computer figure out what
to do based on the dependencies, the user has to manually
translate the dependencies into simple event/action rules. Also,
because the dependency information has never been encoded it is
not available at runtime, effectively meaning that an
administrator who tries to figure our why something
happened, i.e. why a is started when b is started, has no chance
of finding that out.

Furthermore, the event logic turns around all dependencies,
from the feet onto their head. Instead of minimizing the
amount of work (which is something that a good init system should
focus on, as pointed out in the beginning of this blog story), it
actually maximizes the amount of work to do during
operations. Or in other words, instead of having a clear goal and
only doing the things it really needs to do to reach the goal, it
does one step, and then after finishing it, it does all
steps that possibly could follow it.

Or to put it simpler: the fact that the user just started D-Bus
is in no way an indication that NetworkManager should be started
too (but this is what Upstart would do). It’s right the other way
round: when the user asks for NetworkManager, that is definitely
an indication that D-Bus should be started too (which is certainly
what most users would expect, right?).

A good init system should start only what is needed, and that
on-demand. Either lazily or parallelized and in advance. However
it should not start more than necessary, particularly not
everything installed that could use that service.

Finally, I fail to see the actual usefulness of the event
logic. It appears to me that most events that are exposed in
Upstart actually are not punctual in nature, but have duration: a
service starts, is running, and stops. A device is plugged in, is
available, and is plugged out again. A mount point is in the
process of being mounted, is fully mounted, or is being
unmounted. A power plug is plugged in, the system runs on AC, and
the power plug is pulled. Only a minority of the events an init
system or process supervisor should handle are actually punctual,
most of them are tuples of start, condition, and stop. This
information is again not available in Upstart, because it focuses
in singular events, and ignores durable dependencies.

Now, I am aware that some of the issues I pointed out above are
in some way mitigated by certain more recent changes in Upstart,
particularly condition based syntaxes such as start on
(local-filesystems and net-device-up IFACE=lo)
in Upstart
rule files. However, to me this appears mostly as an attempt to
fix a system whose core design is flawed.

Besides that Upstart does OK for babysitting daemons, even though
some choices might be questionable (see above), and there are certainly a lot
of missed opportunities (see above, too).

There are other init systems besides sysvinit, Upstart and
launchd. Most of them offer little substantial more than Upstart or
sysvinit. The most interesting other contender is Solaris SMF,
which supports proper dependencies between services. However, in
many ways it is overly complex and, let’s say, a bit academic
with its excessive use of XML and new terminology for known
things. It is also closely bound to Solaris specific features such
as the contract system.

Putting it All Together: systemd

Well, this is another good time for a little pause, because
after I have hopefully explained above what I think a good PID 1
should be doing and what the current most used system does, we’ll
now come to where the beef is. So, go and refill you coffee mug
again. It’s going to be worth it.

You probably guessed it: what I suggested above as requirements
and features for an ideal init system is actually available now,
in a (still experimental) init system called systemd, and
which I hereby want to announce. Again, here’s the
code.
And here’s a quick rundown of its features, and the
rationale behind them:

systemd starts up and supervises the entire system (hence the
name…). It implements all of the features pointed out above and
a few more. It is based around the notion of units. Units
have a name and a type. Since their configuration is usually
loaded directly from the file system, these unit names are
actually file names. Example: a unit avahi.service is
read from a configuration file by the same name, and of course
could be a unit encapsulating the Avahi daemon. There are several
kinds of units:

  1. service: these are the most obvious kind of unit:
    daemons that can be started, stopped, restarted, reloaded. For
    compatibility with SysV we not only support our own
    configuration files for services, but also are able to read
    classic SysV init scripts, in particular we parse the LSB
    header, if it exists. /etc/init.d is hence not much
    more than just another source of configuration.
  2. socket: this unit encapsulates a socket in the
    file-system or on the Internet. We currently support AF_INET,
    AF_INET6, AF_UNIX sockets of the types stream, datagram, and
    sequential packet. We also support classic FIFOs as
    transport. Each socket unit has a matching
    service unit, that is started if the first connection
    comes in on the socket or FIFO. Example: nscd.socket
    starts nscd.service on an incoming connection.
  3. device: this unit encapsulates a device in the
    Linux device tree. If a device is marked for this via udev
    rules, it will be exposed as a device unit in
    systemd. Properties set with udev can be used as
    configuration source to set dependencies for device units.
  4. mount: this unit encapsulates a mount point in the
    file system hierarchy. systemd monitors all mount points how
    they come and go, and can also be used to mount or
    unmount mount-points. /etc/fstab is used here as an
    additional configuration source for these mount points, similar to
    how SysV init scripts can be used as additional configuration
    source for service units.
  5. automount: this unit type encapsulates an automount
    point in the file system hierarchy. Each automount
    unit has a matching mount unit, which is started
    (i.e. mounted) as soon as the automount directory is
    accessed.
  6. target: this unit type is used for logical
    grouping of units: instead of actually doing anything by itself
    it simply references other units, which thereby can be controlled
    together. Examples for this are: multi-user.target,
    which is a target that basically plays the role of run-level 5 on
    classic SysV system, or bluetooth.target which is
    requested as soon as a bluetooth dongle becomes available and
    which simply pulls in bluetooth related services that otherwise
    would not need to be started: bluetoothd and
    obexd and suchlike.
  7. snapshot: similar to target units
    snapshots do not actually do anything themselves and their only
    purpose is to reference other units. Snapshots can be used to
    save/rollback the state of all services and units of the init
    system. Primarily it has two intended use cases: to allow the
    user to temporarily enter a specific state such as “Emergency
    Shell”, terminating current services, and provide an easy way to
    return to the state before, pulling up all services again that
    got temporarily pulled down. And to ease support for system
    suspending: still many services cannot correctly deal with
    system suspend, and it is often a better idea to shut them down
    before suspend, and restore them afterwards.

All these units can have dependencies between each other (both
positive and negative, i.e. ‘Requires’ and ‘Conflicts’): a device
can have a dependency on a service, meaning that as soon as a
device becomes available a certain service is started. Mounts get
an implicit dependency on the device they are mounted from. Mounts
also gets implicit dependencies to mounts that are their prefixes
(i.e. a mount /home/lennart implicitly gets a dependency
added to the mount for /home) and so on.

A short list of other features:

  1. For each process that is spawned, you may control: the
    environment, resource limits, working and root directory, umask,
    OOM killer adjustment, nice level, IO class and priority, CPU policy
    and priority, CPU affinity, timer slack, user id, group id,
    supplementary group ids, readable/writable/inaccessible
    directories, shared/private/slave mount flags,
    capabilities/bounding set, secure bits, CPU scheduler reset of
    fork, private /tmp name-space, cgroup control for
    various subsystems. Also, you can easily connect
    stdin/stdout/stderr of services to syslog, /dev/kmsg,
    arbitrary TTYs. If connected to a TTY for input systemd will make
    sure a process gets exclusive access, optionally waiting or enforcing
    it.
  2. Every executed process gets its own cgroup (currently by
    default in the debug subsystem, since that subsystem is not
    otherwise used and does not much more than the most basic
    process grouping), and it is very easy to configure systemd to
    place services in cgroups that have been configured externally,
    for example via the libcgroups utilities.
  3. The native configuration files use a syntax that closely
    follows the well-known .desktop files. It is a simple syntax for
    which parsers exist already in many software frameworks. Also, this
    allows us to rely on existing tools for i18n for service
    descriptions, and similar. Administrators and developers don’t
    need to learn a new syntax.
  4. As mentioned, we provide compatibility with SysV init
    scripts. We take advantages of LSB and Red Hat chkconfig headers
    if they are available. If they aren’t we try to make the best of
    the otherwise available information, such as the start
    priorities in /etc/rc.d. These init scripts are simply
    considered a different source of configuration, hence an easy
    upgrade path to proper systemd services is available. Optionally
    we can read classic PID files for services to identify the main
    pid of a daemon. Note that we make use of the dependency
    information from the LSB init script headers, and translate
    those into native systemd dependencies. Side note: Upstart is
    unable to harvest and make use of that information. Boot-up on a
    plain Upstart system with mostly LSB SysV init scripts will
    hence not be parallelized, a similar system running systemd
    however will. In fact, for Upstart all SysV scripts together
    make one job that is executed, they are not treated
    individually, again in contrast to systemd where SysV init
    scripts are just another source of configuration and are all
    treated and controlled individually, much like any other native
    systemd service.
  5. Similarly, we read the existing /etc/fstab
    configuration file, and consider it just another source of
    configuration. Using the comment= fstab option you can
    even mark /etc/fstab entries to become systemd
    controlled automount points.
  6. If the same unit is configured in multiple configuration
    sources (e.g. /etc/systemd/system/avahi.service exists,
    and /etc/init.d/avahi too), then the native
    configuration will always take precedence, the legacy format is
    ignored, allowing an easy upgrade path and packages to carry
    both a SysV init script and a systemd service file for a
    while.
  7. We support a simple templating/instance mechanism. Example:
    instead of having six configuration files for six gettys, we
    only have one [email protected] file which gets instantiated to
    [email protected] and suchlike. The interface part can
    even be inherited by dependency expressions, i.e. it is easy to
    encode that a service [email protected] pulls in
    [email protected], while leaving the
    eth0 string wild-carded.
  8. For socket activation we support full compatibility with the
    traditional inetd modes, as well as a very simple mode that
    tries to mimic launchd socket activation and is recommended for
    new services. The inetd mode only allows passing one socket to
    the started daemon, while the native mode supports passing
    arbitrary numbers of file descriptors. We also support one
    instance per connection, as well as one instance for all
    connections modes. In the former mode we name the cgroup the
    daemon will be started in after the connection parameters, and
    utilize the templating logic mentioned above for this. Example:
    sshd.socket might spawn services
    [email protected] with a
    cgroup of [email protected]/192.168.0.1-4711-192.168.0.2-22
    (i.e. the IP address and port numbers are used in the instance
    names. For AF_UNIX sockets we use PID and user id of the
    connecting client). This provides a nice way for the
    administrator to identify the various instances of a daemon and
    control their runtime individually. The native socket passing
    mode is very easily implementable in applications: if
    $LISTEN_FDS is set it contains the number of sockets
    passed and the daemon will find them sorted as listed in the
    .service file, starting from file descriptor 3 (a
    nicely written daemon could also use fstat() and
    getsockname() to identify the sockets in case it
    receives more than one). In addition we set $LISTEN_PID
    to the PID of the daemon that shall receive the fds, because
    environment variables are normally inherited by sub-processes and
    hence could confuse processes further down the chain. Even
    though this socket passing logic is very simple to implement in
    daemons, we will provide a BSD-licensed reference implementation
    that shows how to do this. We have ported a couple of existing
    daemons to this new scheme.
  9. We provide compatibility with /dev/initctl to a
    certain extent. This compatibility is in fact implemented with a
    FIFO-activated service, which simply translates these legacy
    requests to D-Bus requests. Effectively this means the old
    shutdown, poweroff and similar commands from
    Upstart and sysvinit continue to work with
    systemd.
  10. We also provide compatibility with utmp and
    wtmp. Possibly even to an extent that is far more
    than healthy, given how crufty utmp and wtmp
    are.
  11. systemd supports several kinds of
    dependencies between units. After/Before can be used to fix
    the ordering how units are activated. It is completely
    orthogonal to Requires and Wants, which
    express a positive requirement dependency, either mandatory, or
    optional. Then, there is Conflicts which
    expresses a negative requirement dependency. Finally, there are
    three further, less used dependency types.
  12. systemd has a minimal transaction system. Meaning: if a unit
    is requested to start up or shut down we will add it and all its
    dependencies to a temporary transaction. Then, we will
    verify if the transaction is consistent (i.e. whether the
    ordering via After/Before of all units is
    cycle-free). If it is not, systemd will try to fix it up, and
    removes non-essential jobs from the transaction that might
    remove the loop. Also, systemd tries to suppress non-essential
    jobs in the transaction that would stop a running
    service. Non-essential jobs are those which the original request
    did not directly include but which where pulled in by
    Wants type of dependencies. Finally we check whether
    the jobs of the transaction contradict jobs that have already
    been queued, and optionally the transaction is aborted then. If
    all worked out and the transaction is consistent and minimized
    in its impact it is merged with all already outstanding jobs and
    added to the run queue. Effectively this means that before
    executing a requested operation, we will verify that it makes
    sense, fixing it if possible, and only failing if it really cannot
    work.
  13. We record start/exit time as well as the PID and exit status
    of every process we spawn and supervise. This data can be used
    to cross-link daemons with their data in abrtd, auditd and
    syslog. Think of an UI that will highlight crashed daemons for
    you, and allows you to easily navigate to the respective UIs for
    syslog, abrt, and auditd that will show the data generated from
    and for this daemon on a specific run.
  14. We support reexecution of the init process itself at any
    time. The daemon state is serialized before the reexecution and
    deserialized afterwards. That way we provide a simple way to
    facilitate init system upgrades as well as handover from an
    initrd daemon to the final daemon. Open sockets and autofs
    mounts are properly serialized away, so that they stay
    connectible all the time, in a way that clients will not even
    notice that the init system reexecuted itself. Also, the fact
    that a big part of the service state is encoded anyway in the
    cgroup virtual file system would even allow us to resume
    execution without access to the serialization data. The
    reexecution code paths are actually mostly the same as the init
    system configuration reloading code paths, which
    guarantees that reexecution (which is probably more seldom
    triggered) gets similar testing as reloading (which is probably
    more common).
  15. Starting the work of removing shell scripts from the boot
    process we have recoded part of the basic system setup in C and
    moved it directly into systemd. Among that is mounting of the API
    file systems (i.e. virtual file systems such as /proc,
    /sys and /dev.) and setting of the
    host-name.
  16. Server state is introspectable and controllable via
    D-Bus. This is not complete yet but quite extensive.
  17. While we want to emphasize socket-based and bus-name-based
    activation, and we hence support dependencies between sockets and
    services, we also support traditional inter-service
    dependencies. We support multiple ways how such a service can
    signal its readiness: by forking and having the start process
    exit (i.e. traditional daemonize() behaviour), as well
    as by watching the bus until a configured service name appears.
  18. There’s an interactive mode which asks for confirmation each
    time a process is spawned by systemd. You may enable it by
    passing systemd.confirm_spawn=1 on the kernel command
    line.
  19. With the systemd.default= kernel command line
    parameter you can specify which unit systemd should start on
    boot-up. Normally you’d specify something like
    multi-user.target here, but another choice could even
    be a single service instead of a target, for example
    out-of-the-box we ship a service emergency.service that
    is similar in its usefulness as init=/bin/bash, however
    has the advantage of actually running the init system, hence
    offering the option to boot up the full system from the
    emergency shell.
  20. There’s a minimal UI that allows you to
    start/stop/introspect services. It’s far from complete but
    useful as a debugging tool. It’s written in Vala (yay!) and goes
    by the name of systemadm.

It should be noted that systemd uses many Linux-specific
features, and does not limit itself to POSIX. That unlocks a lot
of functionality a system that is designed for portability to
other operating systems cannot provide.

Status

All the features listed above are already implemented. Right
now systemd can already be used as a drop-in replacement for
Upstart and sysvinit (at least as long as there aren’t too many
native upstart services yet. Thankfully most distributions don’t
carry too many native Upstart services yet.)

However, testing has been minimal, our version number is
currently at an impressive 0. Expect breakage if you run this in
its current state. That said, overall it should be quite stable
and some of us already boot their normal development systems with
systemd (in contrast to VMs only). YMMV, especially if you try
this on distributions we developers don’t use.

Where is This Going?

The feature set described above is certainly already
comprehensive. However, we have a few more things on our plate. I
don’t really like speaking too much about big plans but here’s a
short overview in which direction we will be pushing this:

We want to add at least two more unit types: swap
shall be used to control swap devices the same way we
already control mounts, i.e. with automatic dependencies on the
device tree devices they are activated from, and
suchlike. timer shall provide functionality similar to
cron, i.e. starts services based on time events, the
focus being both monotonic clock and wall-clock/calendar
events. (i.e. “start this 5h after it last ran” as well as “start
this every monday 5 am”)

More importantly however, it is also our plan to experiment with
systemd not only for optimizing boot times, but also to make it
the ideal session manager, to replace (or possibly just augment)
gnome-session, kdeinit and similar daemons. The problem set of a
session manager and an init system are very similar: quick start-up
is essential and babysitting processes the focus. Using the same
code for both uses hence suggests itself. Apple recognized that
and does just that with launchd. And so should we: socket and bus
based activation and parallelization is something session services
and system services can benefit from equally.

I should probably note that all three of these features are
already partially available in the current code base, but not
complete yet. For example, already, you can run systemd just fine
as a normal user, and it will detect that is run that way and
support for this mode has been available since the very beginning,
and is in the very core. (It is also exceptionally useful for
debugging! This works fine even without having the system
otherwise converted to systemd for booting.)

However, there are some things we probably should fix in the
kernel and elsewhere before finishing work on this: we
need swap status change notifications from the kernel similar to
how we can already subscribe to mount changes; we want a
notification when CLOCK_REALTIME jumps relative to
CLOCK_MONOTONIC; we want to allow normal processes to get
some init-like powers
; we need a well-defined
place where we can put user sockets
. None of these issues are
really essential for systemd, but they’d certainly improve
things.

You Want to See This in Action?

Currently, there are no tarball releases, but it should be
straightforward to check out the code from our
repository
. In addition, to have something to start with, here’s
a tarball with unit configuration files
that allows an
otherwise unmodified Fedora 13 system to work with systemd. We
have no RPMs to offer you for now.

An easier way is to download this Fedora 13 qemu image, which
has been prepared for systemd. In the grub menu you can select
whether you want to boot the system with Upstart or systemd. Note
that this system is minimally modified only. Service information
is read exclusively from the existing SysV init scripts. Hence it
will not take advantage of the full socket and bus-based
parallelization pointed out above, however it will interpret the
parallelization hints from the LSB headers, and hence boots faster
than the Upstart system, which in Fedora does not employ any
parallelization at the moment. The image is configured to output
debug information on the serial console, as well as writing it to
the kernel log buffer (which you may access with dmesg).
You might want to run qemu configured with a virtual
serial terminal. All passwords are set to systemd.

Even simpler than downloading and booting the qemu image is
looking at pretty screen-shots. Since an init system usually is
well hidden beneath the user interface, some shots of
systemadm and ps must do:

systemadm

That’s systemadm showing all loaded units, with more detailed
information on one of the getty instances.

ps

That’s an excerpt of the output of ps xaf -eo
pid,user,args,cgroup
showing how neatly the processes are
sorted into the cgroup of their service. (The fourth column is the
cgroup, the debug: prefix is shown because we use the
debug cgroup controller for systemd, as mentioned earlier. This is
only temporary.)

Note that both of these screenshots show an only minimally
modified Fedora 13 Live CD installation, where services are
exclusively loaded from the existing SysV init scripts. Hence,
this does not use socket or bus activation for any existing
service.

Sorry, no bootcharts or hard data on start-up times for the
moment. We’ll publish that as soon as we have fully parallelized
all services from the default Fedora install. Then, we’ll welcome
you to benchmark the systemd approach, and provide our own
benchmark data as well.

Well, presumably everybody will keep bugging me about this, so
here are two numbers I’ll tell you. However, they are completely
unscientific as they are measured for a VM (single CPU) and by
using the stop timer in my watch. Fedora 13 booting up with
Upstart takes 27s, with systemd we reach 24s (from grub to gdm,
same system, same settings, shorter value of two bootups, one
immediately following the other). Note however that this shows
nothing more than the speedup effect reached by using the LSB
dependency information parsed from the init script headers for
parallelization. Socket or bus based activation was not utilized
for this, and hence these numbers are unsuitable to assess the
ideas pointed out above. Also, systemd was set to debug verbosity
levels on a serial console. So again, this benchmark data has
barely any value.

Writing Daemons

An ideal daemon for use with systemd does a few things
differently then things were traditionally done. Later on, we will
publish a longer guide explaining and suggesting how to write a daemon for use
with this systemd. Basically, things get simpler for daemon
developers:

  • We ask daemon writers not to fork or even double fork
    in their processes, but run their event loop from the initial process
    systemd starts for you. Also, don’t call setsid().
  • Don’t drop user privileges in the daemon itself, leave this
    to systemd and configure it in systemd service configuration
    files. (There are exceptions here. For example, for some daemons
    there are good reasons to drop privileges inside the daemon
    code, after an initialization phase that requires elevated
    privileges.)
  • Don’t write PID files
  • Grab a name on the bus
  • You may rely on systemd for logging, you are welcome to log
    whatever you need to log to stderr.
  • Let systemd create and watch sockets for you, so that socket
    activation works. Hence, interpret $LISTEN_FDS and
    $LISTEN_PID as described above.
  • Use SIGTERM for requesting shut downs from your daemon.

The list above is very similar to what Apple
recommends for daemons compatible with launchd
. It should be
easy to extend daemons that already support launchd
activation to support systemd activation as well.

Note that systemd supports daemons not written in this style
perfectly as well, already for compatibility reasons (launchd has
only limited support for that). As mentioned, this even extends to
existing inetd capable daemons which can be used unmodified for
socket activation by systemd.

So, yes, should systemd prove itself in our experiments and get
adopted by the distributions it would make sense to port at least
those services that are started by default to use socket or
bus-based activation. We have
written proof-of-concept patches
, and the porting turned out
to be very easy. Also, we can leverage the work that has already
been done for launchd, to a certain extent. Moreover, adding
support for socket-based activation does not make the service
incompatible with non-systemd systems.

FAQs

Who’s behind this?
Well, the current code-base is mostly my work, Lennart
Poettering (Red Hat). However the design in all its details is
result of close cooperation between Kay Sievers (Novell) and
me. Other people involved are Harald Hoyer (Red Hat), Dhaval
Giani (Formerly IBM), and a few others from various
companies such as Intel, SUSE and Nokia.
Is this a Red Hat project?
No, this is my personal side project. Also, let me emphasize
this: the opinions reflected here are my own. They are not
the views of my employer, or Ronald McDonald, or anyone
else.
Will this come to Fedora?
If our experiments prove that this approach works out, and
discussions in the Fedora community show support for this, then
yes, we’ll certainly try to get this into Fedora.
Will this come to OpenSUSE?
Kay’s pursuing that, so something similar as for Fedora applies here, too.
Will this come to Debian/Gentoo/Mandriva/MeeGo/Ubuntu/[insert your favourite distro here]?
That’s up to them. We’d certainly welcome their interest, and help with the integration.
Why didn’t you just add this to Upstart, why did you invent something new?
Well, the point of the part about Upstart above was to show
that the core design of Upstart is flawed, in our
opinion. Starting completely from scratch suggests itself if the
existing solution appears flawed in its core. However, note that
we took a lot of inspiration from Upstart’s code-base
otherwise.
If you love Apple launchd so much, why not adopt that?
launchd is a great invention, but I am not convinced that it
would fit well into Linux, nor that it is suitable for a system
like Linux with its immense scalability and flexibility to
numerous purposes and uses.
Is this an NIH project?
Well, I hope that I managed to explain in the text above why
we came up with something new, instead of building on Upstart or
launchd. We came up with systemd due to technical
reasons, not political reasons.
Don’t forget that it is Upstart that includes
a library called NIH
(which is kind of a reimplementation of glib) — not systemd!
Will this run on [insert non-Linux OS here]?
Unlikely. As pointed out, systemd uses many Linux specific
APIs (such as epoll, signalfd, libudev, cgroups, and numerous
more), a port to other operating systems appears to us as not
making a lot of sense. Also, we, the people involved are
unlikely to be interested in merging possible ports to other
platforms and work with the constraints this introduces. That said,
git supports branches and rebasing quite well, in case
people really want to do a port.
Actually portability is even more limited than just to other OSes: we require a very
recent Linux kernel, glibc, libcgroup and libudev. No support for
less-than-current Linux systems, sorry.
If folks want to implement something similar for other
operating systems, the preferred mode of cooperation is probably
that we help you identify which interfaces can be shared with
your system, to make life easier for daemon writers to support
both systemd and your systemd counterpart. Probably, the focus should be
to share interfaces, not code.
I hear [fill one in here: the Gentoo boot system, initng,
Solaris SMF, runit, uxlaunch, …] is an awesome init system and
also does parallel boot-up, so why not adopt that?
Well, before we started this we actually had a very close
look at the various systems, and none of them did what we had in
mind for systemd (with the exception of launchd, of course). If
you cannot see that, then please read again what I wrote
above.

Contributions

We are very interested in patches and help. It should be common
sense that every Free Software project can only benefit from the
widest possible external contributions. That is particularly true
for a core part of the OS, such as an init system. We value your
contributions and hence do not require copyright assignment (Very
much unlike Canonical/Upstart
!). And also, we use git,
everybody’s favourite VCS, yay!

We are particularly interested in help getting systemd to work
on other distributions, besides Fedora and OpenSUSE. (Hey, anybody
from Debian, Gentoo, Mandriva, MeeGo looking for something to do?)
But even beyond that we are keen to attract contributors on every
level: we welcome C hackers, packagers, as well as folks who are interested
to write documentation, or contribute a logo.

Community

At this time we only have source code
repository
and an IRC channel (#systemd on
Freenode). There’s no mailing list, web site or bug tracking
system. We’ll probably set something up on freedesktop.org
soon. If you have any questions or want to contact us otherwise we
invite you to join us on IRC!

Update: our GIT repository has moved.

Rethinking PID 1

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/systemd.html

If you are well connected or good at reading between the lines
you might already know what this blog post is about. But even then
you may find this story interesting. So grab a cup of coffee,
sit down, and read what’s coming.

This blog story is long, so even though I can only recommend
reading the long story, here’s the one sentence summary: we are
experimenting with a new init system and it is fun.

Here’s the code. And here’s the story:

Process Identifier 1

On every Unix system there is one process with the special
process identifier 1. It is started by the kernel before all other
processes and is the parent process for all those other processes
that have nobody else to be child of. Due to that it can do a lot
of stuff that other processes cannot do. And it is also
responsible for some things that other processes are not
responsible for, such as bringing up and maintaining userspace
during boot.

Historically on Linux the software acting as PID 1 was the
venerable sysvinit package, though it had been showing its age for
quite a while. Many replacements have been suggested, only one of
them really took off: Upstart, which has by now found
its way into all major distributions.

As mentioned, the central responsibility of an init system is
to bring up userspace. And a good init system does that
fast. Unfortunately, the traditional SysV init system was not
particularly fast.

For a fast and efficient boot-up two things are crucial:

  • To start less.
  • And to start more in parallel.

What does that mean? Starting less means starting fewer
services or deferring the starting of services until they are
actually needed. There are some services where we know that they
will be required sooner or later (syslog, D-Bus system bus, etc.),
but for many others this isn’t the case. For example, bluetoothd
does not need to be running unless a bluetooth dongle is actually
plugged in or an application wants to talk to its D-Bus
interfaces. Same for a printing system: unless the machine
physically is connected to a printer, or an application wants to
print something, there is no need to run a printing daemon such as
CUPS. Avahi: if the machine is not connected to a
network, there is no need to run Avahi, unless some application wants
to use its APIs. And even SSH: as long as nobody wants to contact
your machine there is no need to run it, as long as it is then
started on the first connection. (And admit it, on most machines
where sshd might be listening somebody connects to it only every
other month or so.)

Starting more in parallel means that if we have
to run something, we should not serialize its start-up (as sysvinit
does), but run it all at the same time, so that the available
CPU and disk IO bandwidth is maxed out, and hence
the overall start-up time minimized.

Hardware and Software Change Dynamically

Modern systems (especially general purpose OS) are highly
dynamic in their configuration and use: they are mobile, different
applications are started and stopped, different hardware added and
removed again. An init system that is responsible for maintaining
services needs to listen to hardware and software
changes. It needs to dynamically start (and sometimes stop)
services as they are needed to run a program or enable some
hardware.

Most current systems that try to parallelize boot-up still
synchronize the start-up of the various daemons involved: since
Avahi needs D-Bus, D-Bus is started first, and only when D-Bus
signals that it is ready, Avahi is started too. Similar for other
services: livirtd and X11 need HAL (well, I am considering the
Fedora 13 services here, ignore that HAL is obsolete), hence HAL
is started first, before livirtd and X11 are started. And
libvirtd also needs Avahi, so it waits for Avahi too. And all of
them require syslog, so they all wait until Syslog is fully
started up and initialized. And so on.

Parallelizing Socket Services

This kind of start-up synchronization results in the
serialization of a significant part of the boot process. Wouldn’t
it be great if we could get rid of the synchronization and
serialization cost? Well, we can, actually. For that, we need to
understand what exactly the daemons require from each other, and
why their start-up is delayed. For traditional Unix daemons,
there’s one answer to it: they wait until the socket the other
daemon offers its services on is ready for connections. Usually
that is an AF_UNIX socket in the file-system, but it could be
AF_INET[6], too. For example, clients of D-Bus wait that
/var/run/dbus/system_bus_socket can be connected to,
clients of syslog wait for /dev/log, clients of CUPS wait
for /var/run/cups/cups.sock and NFS mounts wait for
/var/run/rpcbind.sock and the portmapper IP port, and so
on. And think about it, this is actually the only thing they wait
for!

Now, if that’s all they are waiting for, if we manage to make
those sockets available for connection earlier and only actually
wait for that instead of the full daemon start-up, then we can
speed up the entire boot and start more processes in parallel. So,
how can we do that? Actually quite easily in Unix-like systems: we
can create the listening sockets before we actually start
the daemon, and then just pass the socket during exec()
to it. That way, we can create all sockets for all
daemons in one step in the init system, and then in a second step
run all daemons at once. If a service needs another, and it is not
fully started up, that’s completely OK: what will happen is that
the connection is queued in the providing service and the client
will potentially block on that single request. But only that one
client will block and only on that one request. Also, dependencies
between services will no longer necessarily have to be configured
to allow proper parallelized start-up: if we start all sockets at
once and a service needs another it can be sure that it can
connect to its socket.

Because this is at the core of what is following, let me say
this again, with different words and by example: if you start
syslog and and various syslog clients at the same time, what will
happen in the scheme pointed out above is that the messages of the
clients will be added to the /dev/log socket buffer. As
long as that buffer doesn’t run full, the clients will not have to
wait in any way and can immediately proceed with their start-up. As
soon as syslog itself finished start-up, it will dequeue all
messages and process them. Another example: we start D-Bus and
several clients at the same time. If a synchronous bus
request is sent and hence a reply expected, what will happen is
that the client will have to block, however only that one client
and only until D-Bus managed to catch up and process it.

Basically, the kernel socket buffers help us to maximize
parallelization, and the ordering and synchronization is done by
the kernel, without any further management from userspace! And if
all the sockets are available before the daemons actually start-up,
dependency management also becomes redundant (or at least
secondary): if a daemon needs another daemon, it will just connect
to it. If the other daemon is already started, this will
immediately succeed. If it isn’t started but in the process of
being started, the first daemon will not even have to wait for it,
unless it issues a synchronous request. And even if the other
daemon is not running at all, it can be auto-spawned. From the
first daemon’s perspective there is no difference, hence dependency
management becomes mostly unnecessary or at least secondary, and
all of this in optimal parallelization and optionally with
on-demand loading. On top of this, this is also more robust, because
the sockets stay available regardless whether the actual daemons
might temporarily become unavailable (maybe due to crashing). In
fact, you can easily write a daemon with this that can run, and
exit (or crash), and run again and exit again (and so on), and all
of that without the clients noticing or loosing any request.

It’s a good time for a pause, go and refill your coffee mug,
and be assured, there is more interesting stuff following.

But first, let’s clear a few things up: is this kind of logic
new? No, it certainly is not. The most prominent system that works
like this is Apple’s launchd system: on MacOS the listening of the
sockets is pulled out of all daemons and done by launchd. The
services themselves hence can all start up in parallel and
dependencies need not to be configured for them. And that is
actually a really ingenious design, and the primary reason why
MacOS manages to provide the fantastic boot-up times it
provides. I can highly recommend this
video
where the launchd folks explain what they are
doing. Unfortunately this idea never really took on outside of the Apple
camp.

The idea is actually even older than launchd. Prior to launchd
the venerable inetd worked much like this: sockets were
centrally created in a daemon that would start the actual service
daemons passing the socket file descriptors during
exec(). However the focus of inetd certainly
wasn’t local services, but Internet services (although later
reimplementations supported AF_UNIX sockets, too). It also wasn’t a
tool to parallelize boot-up or even useful for getting implicit
dependencies right.

For TCP sockets inetd was primarily used in a way that
for every incoming connection a new daemon instance was
spawned. That meant that for each connection a new
process was spawned and initialized, which is not a
recipe for high-performance servers. However, right from the
beginning inetd also supported another mode, where a
single daemon was spawned on the first connection, and that single
instance would then go on and also accept the follow-up connections
(that’s what the wait/nowait option in
inetd.conf was for, a particularly badly documented
option, unfortunately.) Per-connection daemon starts probably gave
inetd its bad reputation for being slow. But that’s not entirely
fair.

Parallelizing Bus Services

Modern daemons on Linux tend to provide services via D-Bus
instead of plain AF_UNIX sockets. Now, the question is, for those
services, can we apply the same parallelizing boot logic as for
traditional socket services? Yes, we can, D-Bus already has all
the right hooks for it: using bus activation a service can be
started the first time it is accessed. Bus activation also gives
us the minimal per-request synchronisation we need for starting up
the providers and the consumers of D-Bus services at the same
time: if we want to start Avahi at the same time as CUPS (side
note: CUPS uses Avahi to browse for mDNS/DNS-SD printers), then we
can simply run them at the same time, and if CUPS is quicker than
Avahi via the bus activation logic we can get D-Bus to queue the
request until Avahi manages to establish its service name.

So, in summary: the socket-based service activation and the
bus-based service activation together enable us to start
all daemons in parallel, without any further
synchronization. Activation also allows us to do lazy-loading of
services: if a service is rarely used, we can just load it the
first time somebody accesses the socket or bus name, instead of
starting it during boot.

And if that’s not great, then I don’t know what is
great!

Parallelizing File System Jobs

If you look at
the serialization graphs of the boot process
of current
distributions, there are more synchronisation points than just
daemon start-ups: most prominently there are file-system related
jobs: mounting, fscking, quota. Right now, on boot-up a lot of
time is spent idling to wait until all devices that are listed in
/etc/fstab show up in the device tree and are then
fsck’ed, mounted, quota checked (if enabled). Only after that is
fully finished we go on and boot the actual services.

Can we improve this? It turns out we can. Harald Hoyer came up
with the idea of using the venerable autofs system for this:

Just like a connect() call shows that a service is
interested in another service, an open() (or a similar
call) shows that a service is interested in a specific file or
file-system. So, in order to improve how much we can parallelize
we can make those apps wait only if a file-system they are looking
for is not yet mounted and readily available: we set up an autofs
mount point, and then when our file-system finished fsck and quota
due to normal boot-up we replace it by the real mount. While the
file-system is not ready yet, the access will be queued by the
kernel and the accessing process will block, but only that one
daemon and only that one access. And this way we can begin
starting our daemons even before all file systems have been fully
made available — without them missing any files, and maximizing
parallelization.

Parallelizing file system jobs and service jobs does
not make sense for /, after all that’s where the service
binaries are usually stored. However, for file-systems such as
/home, that usually are bigger, even encrypted, possibly
remote and seldom accessed by the usual boot-up daemons, this
can improve boot time considerably. It is probably not necessary
to mention this, but virtual file systems, such as
procfs or sysfs should never be mounted via autofs.

I wouldn’t be surprised if some readers might find integrating
autofs in an init system a bit fragile and even weird, and maybe
more on the “crackish” side of things. However, having played
around with this extensively I can tell you that this actually
feels quite right. Using autofs here simply means that we can
create a mount point without having to provide the backing file
system right-away. In effect it hence only delays accesses. If an
application tries to access an autofs file-system and we take very
long to replace it with the real file-system, it will hang in an
interruptible sleep, meaning that you can safely cancel it, for
example via C-c. Also note that at any point, if the mount point
should not be mountable in the end (maybe because fsck failed), we
can just tell autofs to return a clean error code (like
ENOENT). So, I guess what I want to say is that even though
integrating autofs into an init system might appear adventurous at
first, our experimental code has shown that this idea works
surprisingly well in practice — if it is done for the right
reasons and the right way.

Also note that these should be direct autofs
mounts, meaning that from an application perspective there’s
little effective difference between a classic mount point and one
based on autofs.

Keeping the First User PID Small

Another thing we can learn from the MacOS boot-up logic is
that shell scripts are evil. Shell is fast and shell is slow. It
is fast to hack, but slow in execution. The classic sysvinit boot
logic is modelled around shell scripts. Whether it is
/bin/bash or any other shell (that was written to make
shell scripts faster), in the end the approach is doomed to be
slow. On my system the scripts in /etc/init.d call
grep at least 77 times. awk is called 92
times, cut 23 and sed 74. Every time those
commands (and others) are called, a process is spawned, the
libraries searched, some start-up stuff like i18n and so on set up
and more. And then after seldom doing more than a trivial string
operation the process is terminated again. Of course, that has to
be incredibly slow. No other language but shell would do something like
that. On top of that, shell scripts are also very fragile, and
change their behaviour drastically based on environment variables
and suchlike, stuff that is hard to oversee and control.

So, let’s get rid of shell scripts in the boot process! Before
we can do that we need to figure out what they are currently
actually used for: well, the big picture is that most of the time,
what they do is actually quite boring. Most of the scripting is
spent on trivial setup and tear-down of services, and should be
rewritten in C, either in separate executables, or moved into the
daemons themselves, or simply be done in the init system.

It is not likely that we can get rid of shell scripts during
system boot-up entirely anytime soon. Rewriting them in C takes
time, in a few case does not really make sense, and sometimes
shell scripts are just too handy to do without. But we can
certainly make them less prominent.

A good metric for measuring shell script infestation of the
boot process is the PID number of the first process you can start
after the system is fully booted up. Boot up, log in, open a
terminal, and type echo $$. Try that on your Linux
system, and then compare the result with MacOS! (Hint, it’s
something like this: Linux PID 1823; MacOS PID 154, measured on
test systems we own.)

Keeping Track of Processes

A central part of a system that starts up and maintains
services should be process babysitting: it should watch
services. Restart them if they shut down. If they crash it should
collect information about them, and keep it around for the
administrator, and cross-link that information with what is
available from crash dump systems such as abrt, and in logging
systems like syslog or the audit system.

It should also be capable of shutting down a service
completely. That might sound easy, but is harder than you
think. Traditionally on Unix a process that does double-forking
can escape the supervision of its parent, and the old parent will
not learn about the relation of the new process to the one it
actually started. An example: currently, a misbehaving CGI script
that has double-forked is not terminated when you shut down
Apache. Furthermore, you will not even be able to figure out its
relation to Apache, unless you know it by name and purpose.

So, how can we keep track of processes, so that they cannot
escape the babysitter, and that we can control them as one unit
even if they fork a gazillion times?

Different people came up with different solutions for this. I
am not going into much detail here, but let’s at least say that
approaches based on ptrace or the netlink connector (a kernel
interface which allows you to get a netlink message each time any
process on the system fork()s or exit()s) that some people have
investigated and implemented, have been criticised as ugly and not
very scalable.

So what can we do about this? Well, since quite a while the
kernel knows Control
Groups
(aka “cgroups”). Basically they allow the creation of a
hierarchy of groups of processes. The hierarchy is directly
exposed in a virtual file-system, and hence easily accessible. The
group names are basically directory names in that file-system. If
a process belonging to a specific cgroup fork()s, its child will
become a member of the same group. Unless it is privileged and has
access to the cgroup file system it cannot escape its
group. Originally, cgroups have been introduced into the kernel
for the purpose of containers: certain kernel subsystems can
enforce limits on resources of certain groups, such as limiting
CPU or memory usage. Traditional resource limits (as implemented
by setrlimit()) are (mostly) per-process. cgroups on the
other hand let you enforce limits on entire groups of
processes. cgroups are also useful to enforce limits outside of
the immediate container use case. You can use it for example to
limit the total amount of memory or CPU Apache and all its
children may use. Then, a misbehaving CGI script can no longer
escape your setrlimit() resource control by simply
forking away.

In addition to container and resource limit enforcement cgroups
are very useful to keep track of daemons: cgroup membership is
securely inherited by child processes, they cannot escape. There’s
a notification system available so that a supervisor process can
be notified when a cgroup runs empty. You can find the cgroups of
a process by reading /proc/$PID/cgroup. cgroups hence
make a very good choice to keep track of processes for babysitting
purposes.

Controlling the Process Execution Environment

A good babysitter should not only oversee and control when a
daemon starts, ends or crashes, but also set up a good, minimal,
and secure working environment for it.

That means setting obvious process parameters such as the
setrlimit() resource limits, user/group IDs or the
environment block, but does not end there. The Linux kernel gives
users and administrators a lot of control over processes (some of
it is rarely used, currently). For each process you can set CPU
and IO scheduler controls, the capability bounding set, CPU
affinity or of course cgroup environments with additional limits,
and more.

As an example, ioprio_set() with
IOPRIO_CLASS_IDLE is a great away to minimize the effect
of locate‘s updatedb on system interactivity.

On top of that certain high-level controls can be very useful,
such as setting up read-only file system overlays based on
read-only bind mounts. That way one can run certain daemons so
that all (or some) file systems appear read-only to them, so that
EROFS is returned on every write request. As such this can be used
to lock down what daemons can do similar in fashion to a poor
man’s SELinux policy system (but this certainly doesn’t replace
SELinux, don’t get any bad ideas, please).

Finally logging is an important part of executing services:
ideally every bit of output a service generates should be logged
away. An init system should hence provide logging to daemons it
spawns right from the beginning, and connect stdout and stderr to
syslog or in some cases even /dev/kmsg which in many
cases makes a very useful replacement for syslog (embedded folks,
listen up!), especially in times where the kernel log buffer is
configured ridiculously large out-of-the-box.

On Upstart

To begin with, let me emphasize that I actually like the code
of Upstart, it is very well commented and easy to
follow. It’s certainly something other projects should learn
from (including my own).

That being said, I can’t say I agree with the general approach
of Upstart. But first, a bit more about the project:

Upstart does not share code with sysvinit, and its
functionality is a super-set of it, and provides compatibility to
some degree with the well known SysV init scripts. It’s main
feature is its event-based approach: starting and stopping of
processes is bound to “events” happening in the system, where an
“event” can be a lot of different things, such as: a network
interfaces becomes available or some other software has been
started.

Upstart does service serialization via these events: if the
syslog-started event is triggered this is used as an
indication to start D-Bus since it can now make use of Syslog. And
then, when dbus-started is triggered,
NetworkManager is started, since it may now use
D-Bus, and so on.

One could say that this way the actual logical dependency tree
that exists and is understood by the admin or developer is
translated and encoded into event and action rules: every logical
“a needs b” rule that the administrator/developer is aware of
becomes a “start a when b is started” plus “stop a when b is
stopped”. In some way this certainly is a simplification:
especially for the code in Upstart itself. However I would argue
that this simplification is actually detrimental. First of all,
the logical dependency system does not go away, the person who is
writing Upstart files must now translate the dependencies manually
into these event/action rules (actually, two rules for each
dependency). So, instead of letting the computer figure out what
to do based on the dependencies, the user has to manually
translate the dependencies into simple event/action rules. Also,
because the dependency information has never been encoded it is
not available at runtime, effectively meaning that an
administrator who tries to figure our why something
happened, i.e. why a is started when b is started, has no chance
of finding that out.

Furthermore, the event logic turns around all dependencies,
from the feet onto their head. Instead of minimizing the
amount of work (which is something that a good init system should
focus on, as pointed out in the beginning of this blog story), it
actually maximizes the amount of work to do during
operations. Or in other words, instead of having a clear goal and
only doing the things it really needs to do to reach the goal, it
does one step, and then after finishing it, it does all
steps that possibly could follow it.

Or to put it simpler: the fact that the user just started D-Bus
is in no way an indication that NetworkManager should be started
too (but this is what Upstart would do). It’s right the other way
round: when the user asks for NetworkManager, that is definitely
an indication that D-Bus should be started too (which is certainly
what most users would expect, right?).

A good init system should start only what is needed, and that
on-demand. Either lazily or parallelized and in advance. However
it should not start more than necessary, particularly not
everything installed that could use that service.

Finally, I fail to see the actual usefulness of the event
logic. It appears to me that most events that are exposed in
Upstart actually are not punctual in nature, but have duration: a
service starts, is running, and stops. A device is plugged in, is
available, and is plugged out again. A mount point is in the
process of being mounted, is fully mounted, or is being
unmounted. A power plug is plugged in, the system runs on AC, and
the power plug is pulled. Only a minority of the events an init
system or process supervisor should handle are actually punctual,
most of them are tuples of start, condition, and stop. This
information is again not available in Upstart, because it focuses
in singular events, and ignores durable dependencies.

Now, I am aware that some of the issues I pointed out above are
in some way mitigated by certain more recent changes in Upstart,
particularly condition based syntaxes such as start on
(local-filesystems and net-device-up IFACE=lo)
in Upstart
rule files. However, to me this appears mostly as an attempt to
fix a system whose core design is flawed.

Besides that Upstart does OK for babysitting daemons, even though
some choices might be questionable (see above), and there are certainly a lot
of missed opportunities (see above, too).

There are other init systems besides sysvinit, Upstart and
launchd. Most of them offer little substantial more than Upstart or
sysvinit. The most interesting other contender is Solaris SMF,
which supports proper dependencies between services. However, in
many ways it is overly complex and, let’s say, a bit academic
with its excessive use of XML and new terminology for known
things. It is also closely bound to Solaris specific features such
as the contract system.

Putting it All Together: systemd

Well, this is another good time for a little pause, because
after I have hopefully explained above what I think a good PID 1
should be doing and what the current most used system does, we’ll
now come to where the beef is. So, go and refill you coffee mug
again. It’s going to be worth it.

You probably guessed it: what I suggested above as requirements
and features for an ideal init system is actually available now,
in a (still experimental) init system called systemd, and
which I hereby want to announce. Again, here’s the
code.
And here’s a quick rundown of its features, and the
rationale behind them:

systemd starts up and supervises the entire system (hence the
name…). It implements all of the features pointed out above and
a few more. It is based around the notion of units. Units
have a name and a type. Since their configuration is usually
loaded directly from the file system, these unit names are
actually file names. Example: a unit avahi.service is
read from a configuration file by the same name, and of course
could be a unit encapsulating the Avahi daemon. There are several
kinds of units:

  1. service: these are the most obvious kind of unit:
    daemons that can be started, stopped, restarted, reloaded. For
    compatibility with SysV we not only support our own
    configuration files for services, but also are able to read
    classic SysV init scripts, in particular we parse the LSB
    header, if it exists. /etc/init.d is hence not much
    more than just another source of configuration.
  2. socket: this unit encapsulates a socket in the
    file-system or on the Internet. We currently support AF_INET,
    AF_INET6, AF_UNIX sockets of the types stream, datagram, and
    sequential packet. We also support classic FIFOs as
    transport. Each socket unit has a matching
    service unit, that is started if the first connection
    comes in on the socket or FIFO. Example: nscd.socket
    starts nscd.service on an incoming connection.
  3. device: this unit encapsulates a device in the
    Linux device tree. If a device is marked for this via udev
    rules, it will be exposed as a device unit in
    systemd. Properties set with udev can be used as
    configuration source to set dependencies for device units.
  4. mount: this unit encapsulates a mount point in the
    file system hierarchy. systemd monitors all mount points how
    they come and go, and can also be used to mount or
    unmount mount-points. /etc/fstab is used here as an
    additional configuration source for these mount points, similar to
    how SysV init scripts can be used as additional configuration
    source for service units.
  5. automount: this unit type encapsulates an automount
    point in the file system hierarchy. Each automount
    unit has a matching mount unit, which is started
    (i.e. mounted) as soon as the automount directory is
    accessed.
  6. target: this unit type is used for logical
    grouping of units: instead of actually doing anything by itself
    it simply references other units, which thereby can be controlled
    together. Examples for this are: multi-user.target,
    which is a target that basically plays the role of run-level 5 on
    classic SysV system, or bluetooth.target which is
    requested as soon as a bluetooth dongle becomes available and
    which simply pulls in bluetooth related services that otherwise
    would not need to be started: bluetoothd and
    obexd and suchlike.
  7. snapshot: similar to target units
    snapshots do not actually do anything themselves and their only
    purpose is to reference other units. Snapshots can be used to
    save/rollback the state of all services and units of the init
    system. Primarily it has two intended use cases: to allow the
    user to temporarily enter a specific state such as “Emergency
    Shell”, terminating current services, and provide an easy way to
    return to the state before, pulling up all services again that
    got temporarily pulled down. And to ease support for system
    suspending: still many services cannot correctly deal with
    system suspend, and it is often a better idea to shut them down
    before suspend, and restore them afterwards.

All these units can have dependencies between each other (both
positive and negative, i.e. ‘Requires’ and ‘Conflicts’): a device
can have a dependency on a service, meaning that as soon as a
device becomes available a certain service is started. Mounts get
an implicit dependency on the device they are mounted from. Mounts
also gets implicit dependencies to mounts that are their prefixes
(i.e. a mount /home/lennart implicitly gets a dependency
added to the mount for /home) and so on.

A short list of other features:

  1. For each process that is spawned, you may control: the
    environment, resource limits, working and root directory, umask,
    OOM killer adjustment, nice level, IO class and priority, CPU policy
    and priority, CPU affinity, timer slack, user id, group id,
    supplementary group ids, readable/writable/inaccessible
    directories, shared/private/slave mount flags,
    capabilities/bounding set, secure bits, CPU scheduler reset of
    fork, private /tmp name-space, cgroup control for
    various subsystems. Also, you can easily connect
    stdin/stdout/stderr of services to syslog, /dev/kmsg,
    arbitrary TTYs. If connected to a TTY for input systemd will make
    sure a process gets exclusive access, optionally waiting or enforcing
    it.
  2. Every executed process gets its own cgroup (currently by
    default in the debug subsystem, since that subsystem is not
    otherwise used and does not much more than the most basic
    process grouping), and it is very easy to configure systemd to
    place services in cgroups that have been configured externally,
    for example via the libcgroups utilities.
  3. The native configuration files use a syntax that closely
    follows the well-known .desktop files. It is a simple syntax for
    which parsers exist already in many software frameworks. Also, this
    allows us to rely on existing tools for i18n for service
    descriptions, and similar. Administrators and developers don’t
    need to learn a new syntax.
  4. As mentioned, we provide compatibility with SysV init
    scripts. We take advantages of LSB and Red Hat chkconfig headers
    if they are available. If they aren’t we try to make the best of
    the otherwise available information, such as the start
    priorities in /etc/rc.d. These init scripts are simply
    considered a different source of configuration, hence an easy
    upgrade path to proper systemd services is available. Optionally
    we can read classic PID files for services to identify the main
    pid of a daemon. Note that we make use of the dependency
    information from the LSB init script headers, and translate
    those into native systemd dependencies. Side note: Upstart is
    unable to harvest and make use of that information. Boot-up on a
    plain Upstart system with mostly LSB SysV init scripts will
    hence not be parallelized, a similar system running systemd
    however will. In fact, for Upstart all SysV scripts together
    make one job that is executed, they are not treated
    individually, again in contrast to systemd where SysV init
    scripts are just another source of configuration and are all
    treated and controlled individually, much like any other native
    systemd service.
  5. Similarly, we read the existing /etc/fstab
    configuration file, and consider it just another source of
    configuration. Using the comment= fstab option you can
    even mark /etc/fstab entries to become systemd
    controlled automount points.
  6. If the same unit is configured in multiple configuration
    sources (e.g. /etc/systemd/system/avahi.service exists,
    and /etc/init.d/avahi too), then the native
    configuration will always take precedence, the legacy format is
    ignored, allowing an easy upgrade path and packages to carry
    both a SysV init script and a systemd service file for a
    while.
  7. We support a simple templating/instance mechanism. Example:
    instead of having six configuration files for six gettys, we
    only have one [email protected] file which gets instantiated to
    [email protected] and suchlike. The interface part can
    even be inherited by dependency expressions, i.e. it is easy to
    encode that a service [email protected] pulls in
    [email protected], while leaving the
    eth0 string wild-carded.
  8. For socket activation we support full compatibility with the
    traditional inetd modes, as well as a very simple mode that
    tries to mimic launchd socket activation and is recommended for
    new services. The inetd mode only allows passing one socket to
    the started daemon, while the native mode supports passing
    arbitrary numbers of file descriptors. We also support one
    instance per connection, as well as one instance for all
    connections modes. In the former mode we name the cgroup the
    daemon will be started in after the connection parameters, and
    utilize the templating logic mentioned above for this. Example:
    sshd.socket might spawn services
    [email protected] with a
    cgroup of [email protected]/192.168.0.1-4711-192.168.0.2-22
    (i.e. the IP address and port numbers are used in the instance
    names. For AF_UNIX sockets we use PID and user id of the
    connecting client). This provides a nice way for the
    administrator to identify the various instances of a daemon and
    control their runtime individually. The native socket passing
    mode is very easily implementable in applications: if
    $LISTEN_FDS is set it contains the number of sockets
    passed and the daemon will find them sorted as listed in the
    .service file, starting from file descriptor 3 (a
    nicely written daemon could also use fstat() and
    getsockname() to identify the sockets in case it
    receives more than one). In addition we set $LISTEN_PID
    to the PID of the daemon that shall receive the fds, because
    environment variables are normally inherited by sub-processes and
    hence could confuse processes further down the chain. Even
    though this socket passing logic is very simple to implement in
    daemons, we will provide a BSD-licensed reference implementation
    that shows how to do this. We have ported a couple of existing
    daemons to this new scheme.
  9. We provide compatibility with /dev/initctl to a
    certain extent. This compatibility is in fact implemented with a
    FIFO-activated service, which simply translates these legacy
    requests to D-Bus requests. Effectively this means the old
    shutdown, poweroff and similar commands from
    Upstart and sysvinit continue to work with
    systemd.
  10. We also provide compatibility with utmp and
    wtmp. Possibly even to an extent that is far more
    than healthy, given how crufty utmp and wtmp
    are.
  11. systemd supports several kinds of
    dependencies between units. After/Before can be used to fix
    the ordering how units are activated. It is completely
    orthogonal to Requires and Wants, which
    express a positive requirement dependency, either mandatory, or
    optional. Then, there is Conflicts which
    expresses a negative requirement dependency. Finally, there are
    three further, less used dependency types.
  12. systemd has a minimal transaction system. Meaning: if a unit
    is requested to start up or shut down we will add it and all its
    dependencies to a temporary transaction. Then, we will
    verify if the transaction is consistent (i.e. whether the
    ordering via After/Before of all units is
    cycle-free). If it is not, systemd will try to fix it up, and
    removes non-essential jobs from the transaction that might
    remove the loop. Also, systemd tries to suppress non-essential
    jobs in the transaction that would stop a running
    service. Non-essential jobs are those which the original request
    did not directly include but which where pulled in by
    Wants type of dependencies. Finally we check whether
    the jobs of the transaction contradict jobs that have already
    been queued, and optionally the transaction is aborted then. If
    all worked out and the transaction is consistent and minimized
    in its impact it is merged with all already outstanding jobs and
    added to the run queue. Effectively this means that before
    executing a requested operation, we will verify that it makes
    sense, fixing it if possible, and only failing if it really cannot
    work.
  13. We record start/exit time as well as the PID and exit status
    of every process we spawn and supervise. This data can be used
    to cross-link daemons with their data in abrtd, auditd and
    syslog. Think of an UI that will highlight crashed daemons for
    you, and allows you to easily navigate to the respective UIs for
    syslog, abrt, and auditd that will show the data generated from
    and for this daemon on a specific run.
  14. We support reexecution of the init process itself at any
    time. The daemon state is serialized before the reexecution and
    deserialized afterwards. That way we provide a simple way to
    facilitate init system upgrades as well as handover from an
    initrd daemon to the final daemon. Open sockets and autofs
    mounts are properly serialized away, so that they stay
    connectible all the time, in a way that clients will not even
    notice that the init system reexecuted itself. Also, the fact
    that a big part of the service state is encoded anyway in the
    cgroup virtual file system would even allow us to resume
    execution without access to the serialization data. The
    reexecution code paths are actually mostly the same as the init
    system configuration reloading code paths, which
    guarantees that reexecution (which is probably more seldom
    triggered) gets similar testing as reloading (which is probably
    more common).
  15. Starting the work of removing shell scripts from the boot
    process we have recoded part of the basic system setup in C and
    moved it directly into systemd. Among that is mounting of the API
    file systems (i.e. virtual file systems such as /proc,
    /sys and /dev.) and setting of the
    host-name.
  16. Server state is introspectable and controllable via
    D-Bus. This is not complete yet but quite extensive.
  17. While we want to emphasize socket-based and bus-name-based
    activation, and we hence support dependencies between sockets and
    services, we also support traditional inter-service
    dependencies. We support multiple ways how such a service can
    signal its readiness: by forking and having the start process
    exit (i.e. traditional daemonize() behaviour), as well
    as by watching the bus until a configured service name appears.
  18. There’s an interactive mode which asks for confirmation each
    time a process is spawned by systemd. You may enable it by
    passing systemd.confirm_spawn=1 on the kernel command
    line.
  19. With the systemd.default= kernel command line
    parameter you can specify which unit systemd should start on
    boot-up. Normally you’d specify something like
    multi-user.target here, but another choice could even
    be a single service instead of a target, for example
    out-of-the-box we ship a service emergency.service that
    is similar in its usefulness as init=/bin/bash, however
    has the advantage of actually running the init system, hence
    offering the option to boot up the full system from the
    emergency shell.
  20. There’s a minimal UI that allows you to
    start/stop/introspect services. It’s far from complete but
    useful as a debugging tool. It’s written in Vala (yay!) and goes
    by the name of systemadm.

It should be noted that systemd uses many Linux-specific
features, and does not limit itself to POSIX. That unlocks a lot
of functionality a system that is designed for portability to
other operating systems cannot provide.

Status

All the features listed above are already implemented. Right
now systemd can already be used as a drop-in replacement for
Upstart and sysvinit (at least as long as there aren’t too many
native upstart services yet. Thankfully most distributions don’t
carry too many native Upstart services yet.)

However, testing has been minimal, our version number is
currently at an impressive 0. Expect breakage if you run this in
its current state. That said, overall it should be quite stable
and some of us already boot their normal development systems with
systemd (in contrast to VMs only). YMMV, especially if you try
this on distributions we developers don’t use.

Where is This Going?

The feature set described above is certainly already
comprehensive. However, we have a few more things on our plate. I
don’t really like speaking too much about big plans but here’s a
short overview in which direction we will be pushing this:

We want to add at least two more unit types: swap
shall be used to control swap devices the same way we
already control mounts, i.e. with automatic dependencies on the
device tree devices they are activated from, and
suchlike. timer shall provide functionality similar to
cron, i.e. starts services based on time events, the
focus being both monotonic clock and wall-clock/calendar
events. (i.e. “start this 5h after it last ran” as well as “start
this every monday 5 am”)

More importantly however, it is also our plan to experiment with
systemd not only for optimizing boot times, but also to make it
the ideal session manager, to replace (or possibly just augment)
gnome-session, kdeinit and similar daemons. The problem set of a
session manager and an init system are very similar: quick start-up
is essential and babysitting processes the focus. Using the same
code for both uses hence suggests itself. Apple recognized that
and does just that with launchd. And so should we: socket and bus
based activation and parallelization is something session services
and system services can benefit from equally.

I should probably note that all three of these features are
already partially available in the current code base, but not
complete yet. For example, already, you can run systemd just fine
as a normal user, and it will detect that is run that way and
support for this mode has been available since the very beginning,
and is in the very core. (It is also exceptionally useful for
debugging! This works fine even without having the system
otherwise converted to systemd for booting.)

However, there are some things we probably should fix in the
kernel and elsewhere before finishing work on this: we
need swap status change notifications from the kernel similar to
how we can already subscribe to mount changes; we want a
notification when CLOCK_REALTIME jumps relative to
CLOCK_MONOTONIC; we want to allow normal processes to get
some init-like powers
; we need a well-defined
place where we can put user sockets
. None of these issues are
really essential for systemd, but they’d certainly improve
things.

You Want to See This in Action?

Currently, there are no tarball releases, but it should be
straightforward to check out the code from our
repository
. In addition, to have something to start with, here’s
a tarball with unit configuration files
that allows an
otherwise unmodified Fedora 13 system to work with systemd. We
have no RPMs to offer you for now.

An easier way is to download this Fedora 13 qemu image, which
has been prepared for systemd. In the grub menu you can select
whether you want to boot the system with Upstart or systemd. Note
that this system is minimally modified only. Service information
is read exclusively from the existing SysV init scripts. Hence it
will not take advantage of the full socket and bus-based
parallelization pointed out above, however it will interpret the
parallelization hints from the LSB headers, and hence boots faster
than the Upstart system, which in Fedora does not employ any
parallelization at the moment. The image is configured to output
debug information on the serial console, as well as writing it to
the kernel log buffer (which you may access with dmesg).
You might want to run qemu configured with a virtual
serial terminal. All passwords are set to systemd.

Even simpler than downloading and booting the qemu image is
looking at pretty screen-shots. Since an init system usually is
well hidden beneath the user interface, some shots of
systemadm and ps must do:

systemadm

That’s systemadm showing all loaded units, with more detailed
information on one of the getty instances.

ps

That’s an excerpt of the output of ps xaf -eo
pid,user,args,cgroup
showing how neatly the processes are
sorted into the cgroup of their service. (The fourth column is the
cgroup, the debug: prefix is shown because we use the
debug cgroup controller for systemd, as mentioned earlier. This is
only temporary.)

Note that both of these screenshots show an only minimally
modified Fedora 13 Live CD installation, where services are
exclusively loaded from the existing SysV init scripts. Hence,
this does not use socket or bus activation for any existing
service.

Sorry, no bootcharts or hard data on start-up times for the
moment. We’ll publish that as soon as we have fully parallelized
all services from the default Fedora install. Then, we’ll welcome
you to benchmark the systemd approach, and provide our own
benchmark data as well.

Well, presumably everybody will keep bugging me about this, so
here are two numbers I’ll tell you. However, they are completely
unscientific as they are measured for a VM (single CPU) and by
using the stop timer in my watch. Fedora 13 booting up with
Upstart takes 27s, with systemd we reach 24s (from grub to gdm,
same system, same settings, shorter value of two bootups, one
immediately following the other). Note however that this shows
nothing more than the speedup effect reached by using the LSB
dependency information parsed from the init script headers for
parallelization. Socket or bus based activation was not utilized
for this, and hence these numbers are unsuitable to assess the
ideas pointed out above. Also, systemd was set to debug verbosity
levels on a serial console. So again, this benchmark data has
barely any value.

Writing Daemons

An ideal daemon for use with systemd does a few things
differently then things were traditionally done. Later on, we will
publish a longer guide explaining and suggesting how to write a daemon for use
with this systemd. Basically, things get simpler for daemon
developers:

  • We ask daemon writers not to fork or even double fork
    in their processes, but run their event loop from the initial process
    systemd starts for you. Also, don’t call setsid().
  • Don’t drop user privileges in the daemon itself, leave this
    to systemd and configure it in systemd service configuration
    files. (There are exceptions here. For example, for some daemons
    there are good reasons to drop privileges inside the daemon
    code, after an initialization phase that requires elevated
    privileges.)
  • Don’t write PID files
  • Grab a name on the bus
  • You may rely on systemd for logging, you are welcome to log
    whatever you need to log to stderr.
  • Let systemd create and watch sockets for you, so that socket
    activation works. Hence, interpret $LISTEN_FDS and
    $LISTEN_PID as described above.
  • Use SIGTERM for requesting shut downs from your daemon.

The list above is very similar to what Apple
recommends for daemons compatible with launchd
. It should be
easy to extend daemons that already support launchd
activation to support systemd activation as well.

Note that systemd supports daemons not written in this style
perfectly as well, already for compatibility reasons (launchd has
only limited support for that). As mentioned, this even extends to
existing inetd capable daemons which can be used unmodified for
socket activation by systemd.

So, yes, should systemd prove itself in our experiments and get
adopted by the distributions it would make sense to port at least
those services that are started by default to use socket or
bus-based activation. We have
written proof-of-concept patches
, and the porting turned out
to be very easy. Also, we can leverage the work that has already
been done for launchd, to a certain extent. Moreover, adding
support for socket-based activation does not make the service
incompatible with non-systemd systems.

FAQs

Who’s behind this?
Well, the current code-base is mostly my work, Lennart
Poettering (Red Hat). However the design in all its details is
result of close cooperation between Kay Sievers (Novell) and
me. Other people involved are Harald Hoyer (Red Hat), Dhaval
Giani (Formerly IBM), and a few others from various
companies such as Intel, SUSE and Nokia.
Is this a Red Hat project?
No, this is my personal side project. Also, let me emphasize
this: the opinions reflected here are my own. They are not
the views of my employer, or Ronald McDonald, or anyone
else.
Will this come to Fedora?
If our experiments prove that this approach works out, and
discussions in the Fedora community show support for this, then
yes, we’ll certainly try to get this into Fedora.
Will this come to OpenSUSE?
Kay’s pursuing that, so something similar as for Fedora applies here, too.
Will this come to Debian/Gentoo/Mandriva/MeeGo/Ubuntu/[insert your favourite distro here]?
That’s up to them. We’d certainly welcome their interest, and help with the integration.
Why didn’t you just add this to Upstart, why did you invent something new?
Well, the point of the part about Upstart above was to show
that the core design of Upstart is flawed, in our
opinion. Starting completely from scratch suggests itself if the
existing solution appears flawed in its core. However, note that
we took a lot of inspiration from Upstart’s code-base
otherwise.
If you love Apple launchd so much, why not adopt that?
launchd is a great invention, but I am not convinced that it
would fit well into Linux, nor that it is suitable for a system
like Linux with its immense scalability and flexibility to
numerous purposes and uses.
Is this an NIH project?
Well, I hope that I managed to explain in the text above why
we came up with something new, instead of building on Upstart or
launchd. We came up with systemd due to technical
reasons, not political reasons.
Don’t forget that it is Upstart that includes
a library called NIH
(which is kind of a reimplementation of glib) — not systemd!
Will this run on [insert non-Linux OS here]?
Unlikely. As pointed out, systemd uses many Linux specific
APIs (such as epoll, signalfd, libudev, cgroups, and numerous
more), a port to other operating systems appears to us as not
making a lot of sense. Also, we, the people involved are
unlikely to be interested in merging possible ports to other
platforms and work with the constraints this introduces. That said,
git supports branches and rebasing quite well, in case
people really want to do a port.
Actually portability is even more limited than just to other OSes: we require a very
recent Linux kernel, glibc, libcgroup and libudev. No support for
less-than-current Linux systems, sorry.
If folks want to implement something similar for other
operating systems, the preferred mode of cooperation is probably
that we help you identify which interfaces can be shared with
your system, to make life easier for daemon writers to support
both systemd and your systemd counterpart. Probably, the focus should be
to share interfaces, not code.
I hear [fill one in here: the Gentoo boot system, initng,
Solaris SMF, runit, uxlaunch, …] is an awesome init system and
also does parallel boot-up, so why not adopt that?
Well, before we started this we actually had a very close
look at the various systems, and none of them did what we had in
mind for systemd (with the exception of launchd, of course). If
you cannot see that, then please read again what I wrote
above.

Contributions

We are very interested in patches and help. It should be common
sense that every Free Software project can only benefit from the
widest possible external contributions. That is particularly true
for a core part of the OS, such as an init system. We value your
contributions and hence do not require copyright assignment (Very
much unlike Canonical/Upstart
!). And also, we use git,
everybody’s favourite VCS, yay!

We are particularly interested in help getting systemd to work
on other distributions, besides Fedora and OpenSUSE. (Hey, anybody
from Debian, Gentoo, Mandriva, MeeGo looking for something to do?)
But even beyond that we are keen to attract contributors on every
level: we welcome C hackers, packagers, as well as folks who are interested
to write documentation, or contribute a logo.

Community

At this time we only have source code
repository
and an IRC channel (#systemd on
Freenode). There’s no mailing list, web site or bug tracking
system. We’ll probably set something up on freedesktop.org
soon. If you have any questions or want to contact us otherwise we
invite you to join us on IRC!

Update: our GIT repository has moved.