Tag Archives: binary format

[$] An alternative device-tree source language

Post Syndicated from corbet original https://lwn.net/Articles/730217/rss

Device trees have become, in a relatively short time, the preferred way to
inform the kernel of the available hardware on systems where that hardware
is not discoverable — most ARM systems, among others. In short, a
device tree is a textual description of a system’s hardware that is
compiled to a simple binary format and passed to the kernel by the
bootloader. The source format for device trees has been established for a
long time — longer than Linux has been using it. Perhaps it’s time for a
change, but a proposal for a new
device-tree source format has generated a fair amount of controversy in the
small corner of the community that concerns itself with such things.

Portable Computing Language (pocl) v0.14 released

Post Syndicated from ris original https://lwn.net/Articles/719726/rss

Pocl aims to become a performance portable open source (MIT-licensed)
implementation of the OpenCL standard. Version
0.14
adds support for LLVM/Clang 4.0 and 3.9 and a new binary format
that enables running OpenCL programs on hosts without online compiler
support. There is also initial support for out-of-order command queue task
scheduling and plenty of bug fixes.

Storing Pokémon without SQL

Post Syndicated from Eevee original https://eev.ee/blog/2016/08/05/storing-pok%C3%A9mon-without-sql/

I run veekun, a little niche Pokédex website that mostly focuses on (a) very accurate data for every version, derived directly from the games and (b) a bunch of nerdy nerd tools.

It’s been languishing for a few years. (Sorry.) Part of it is that the team has never been very big, and all of us have either drifted away or gotten tied up in other things.

And part of it is that the schema absolutely sucks to work with. I’ve been planning to fix it for a year or two now, and with Sun/Moon on the horizon, it’s time I actually got around to doing that.

Alas! I’m still unsure on some of the details. I’m hoping if I talk them out, a clear best answer will present itself. It’s like advanced rubber duck debugging, with the added bonus that maybe a bunch of strangers will validate my thinking.

(Spoilers: I think I figured some stuff out by the end, so you don’t actually need to read any of this.)

The data

Pokémon has a lot of stuff going on under the hood.

  • The Pokémon themselves have one or two types; a set of abilities; moves they might learn at a given level or from a certain “tutor” NPC or via a specific item; evolution via one of at least twelve different mechanisms and which may branch; items they may be holding in the wild; six stats, plus effort for those six stats; flavor text; and a variety of other little data.

  • A number of Pokémon also have multiple forms, which can mean any number of differences that still “count” as the same Pokémon. Some forms are purely cosmetic (Unown); some affect the Pokémon’s type (Arceus); some affect stats (Pumpkaboo); some affect learned moves (Meowstic); some swap out a signature move (Rotom); some disable evolution (Pichu). Some forms can be switched at will; some switch automatically; some cannot be switched between at all. There aren’t really any hard and fast rules here. They’re effectively different Pokémon with the same name, except most of the properties are the same.

  • Moves are fairly straightforward, except that their effects vary wildly and it would be mighty convenient to be able to categorize them in a way that’s useful to a computer. After 17 years of trying, I’ve still not managed this.

  • Places connect to each other in various directions. They also may have some number of wild Pokémon, which appear at given levels with given probability. Oh, but certain conditions can change some — but not all! — of the possible encounters in an area, making for a UI nightmare. It gets particularly bad in Heart Gold and Soul Silver, where encounters (and their rates) are affected by time of day (morning, midday, night) and the music you’re playing (Sinnoh, Hoenn, none) and whether there’s an active swarm. Try to make sense of Rattata on Route 3.

  • Event Pokémon — those received from giveaways — may be given in several different ways, to several different regions, and may “lock” any of the Pokémon’s attributes either to a specific value or a choice of values.

  • And of course, all of this exists in at least eight different languages, plus a few languages with their own fanon vernacular, plus romanization for katakana and Hangul.

Even that would be all well and good, but the biggest problem of all is that any and all of this can change between games. Pairs of games — say, Red and Blue — tend to be mostly identical except for the encounters, since they come out at the same time. Spiky-Eared Pichu exists only in HGSS, and never appears again. The move Hypnosis has 60% accuracy in every game, except in Diamond and Pearl, where it has 70% accuracy. Sand Attack is ground-type, except in the first generation of games, where it was normal. Several Pokémon change how they evolve in later games, because they relied on a mechanic that was dropped. The type strength/weakness chart has been updated a couple times. And so on.

Oh, and there are several spin-off series, which often reuse the names of moves but completely change how they work. The entire Mystery Dungeon series, for example. Or even Pokémon Go.

This is awful.

The current approach

Since time immemorial, veekun has used a relational database. (Except for that one time I tried a single massive XML file, but let’s not talk about that.) It’s already straining the limits of this format, and it doesn’t even include half the stuff I just mentioned, like event Pokémon or where the move tutors are or Spiky-Eared Pichu’s disabled evolution.

Just the basic information about the Pokémon themselves is spread across three tables: pokemon_species, pokemon, and pokemon_forms. “Species” is supposed to be the pure essence of the name, so it contains stuff like “is this a baby” or “what does this evolve from/into” (which, in the case of Pichu, is already wrong!). pokemon_forms contains every form imaginable, including all 28 Unown, and tries to loosely categorize them — but it also treats Pokémon without forms as having a single “default” form. And then pokemon contains a somewhat arbitrary subset of forms and tacks other data onto them. Other tables arbitrarily join to whichever of these is most appropriate.

Tables may also be segmented by “version” (Red), “version group” (Red and Blue), or “generation” (Red, Blue, and Yellow), depending on when the data tends to vary. Oh, but there are also a number of conquest_* tables for Pokémon Conquest, which doesn’t have a row in versions since it’s not a mainline version. And I think there’s a goofy hack for Stadium in there somewhere.

For data that virtually never varies, except that one time it did, we… don’t really do anything. Base EXP was completely overhauled in X and Y, for example, and we only have a single base_experience column in the pokemon table, so it just contains the new X and Y values. What if you want to know about experience for an older game? Well, oops. Similarly, the type chart is the one from X and Y, which is no longer correct for previous games.

Aligning entities across games can be a little tricky, too. Earlier games had the Itemfinder, gen 5 had the Dowsing MCHN, and now we have the Dowsing Machine. These are all clearly the same item, but only the name Dowsing Machine appears anywhere in veekun, because there’s no support for changing names across games. The last few games also technically “renamed” every move and Pokémon from all-caps to title case, but this isn’t reflected anywhere. In fact, the all-caps names have never appeared on veekun.

All canonical textual data, including the names of fundamental entities like Pokémon and moves, are in separate tables so they can be separated by language as well. Numerous combinations of languages/games are missing, and I don’t think we actually have a list of which games were even released in which languages.

The result is a massive spread of tables, many of them very narrow but very tall, with joins that are not obvious if you’re not a DBA. I forget how half of it works if I haven’t looked at it in at least a month. I make this stuff available for anyone to use, too, so I would greatly prefer if it were (a) understandable by mortals and (b) not comically incomplete in poorly-documented ways.

I think a lot of this is a fear of massively duplicating the pile of data we’ve already got. Fixing the Dowsing Machine thing, for example, would require duplicating the name of every single item for every single game, just to fix this one item that was renamed twice. Fixing the base EXP problem would require yet another new table just for base experience, solely because it changed once.

It’s long past time to fix this.

SQL is bad, actually

(Let me cut you off right now: NoSQL is worse.)

I like the idea of a relational database. You have a schema describing your data, and you can link it together in myriad different ways, and it’s all built around set operations, and wow that’s pretty cool.

The actual implementation leaves a little to be desired. You can really only describe anything as flat tuples. You want to have things that can contain several other things, perhaps in order? Great! Make another flat tuple describing that, and make sure you remember to ask for the order explicitly, every single time you query.

Oh boy, querying. Querying is so, so tedious. You can’t even use all your carefully-constructed foreign key constraints as a shortcut; you have to write out foo.bar_id = bar.id in full every single time.

There are GUIs and whatnot, but the focus is all wrong. It’s on tables. Of course it’s on tables, but a single table is frequently not a useful thing to see on its own. For any given kind of entity (as defined however you think about your application), a table probably only contains a slice of what the entity is about, but it contains that slice for every single instance. Meanwhile, you can’t actually see a single entity on its own.

I’ll repeat that: you cannot.

Consider, for example, a Pokémon. A Pokémon has up to two types, which are rather fundamental properties. How do you view or fetch the Pokémon and its types?

Fuck you, that’s how. If you join pokemon to pokemon_types, you get this goofy result where everything about the Pokémon is potentially duplicated, but each row contains a distinct type.

Want to see abilities as well? There can be up to three of those! Join to both pokemon_abilities and pokemon_types, and now you get up to six rows, which looks increasingly not at all like what you actually wanted. Want moves as well? Good luck.

I don’t understand how this is still the case. SQL is 42 years old! How has it not evolved to have even the slightest nod towards the existence of nested data? This isn’t some niche use case; it’s responsible for at least a third of veekun’s tables!

This die-hard focus on data-as-spreadsheets is probably why we’ve tried so hard to avoid “duplication”, even when it’s the correct thing to do. The fundamental unit of a relational database is the table, and seeing a table full of the same information copied over and over just feels wrong.

But it’s really the focus on tables that’s wrong. The important point isn’t that Bulbasaur is named “BULBASAUR” in ten different games; it’s that each of those games has a name for Bulbasaur, and it happens to be the same much of the time.

NoSQL exists, yes, but I don’t trust anyone who looked at SQL and decided that the real problem was that it has too much schema.

I know the structure of my data, and I’m happy to have it be enforced. The problem isn’t that writing a schema is hard. The problem is that any schema that doesn’t look like a bank ledger maps fairly poorly to SQL primitives. It works, and it’s correct (if you can figure out how to express what you want), but the ergonomics are atrocious.

We’ve papered over some of this with SQLAlchemy’s excellent ORM, but you have to be very good at SQLAlchemy to make the mapping natural, which is the whole goal of using an ORM. I’m pretty good, and it’s still fairly clumsy.

A new idea

So. How about YAML?

See, despite our hesitation to duplicate everything, the dataset really isn’t that big. All of the data combined are a paltry 17MB, which could fit in RAM without much trouble; then we could search and wrangle it with regular Python operations. I could still have a schema, remember, because I wrote a thing for that. And other people could probably make more sense of some YAML files than CSV dumps (!) of a tangled relational database.

The idea is to re-dump every game into its own set of YAML files, describing just the raw data in a form generic enough that it can handle every (main series) game. I did a proof of concept of this for Pokémon earlier this year, and it looks like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
%TAG !dex! tag:veekun.com,2005:pokedex/
--- !!omap
- bulbasaur: !dex!pokemon
    name: BULBASAUR
    types:
    - grass
    - poison
    base-stats:
      attack: 49
      defense: 49
      hp: 45
      special: 65
      speed: 45
    growth-rate: medium-slow
    base-experience: 64
    pokedex-numbers:
      kanto: 1
    evolutions:
    - into: ivysaur
      minimum-level: 16
      trigger: level-up
    species: SEED
    flavor-text: "A strange seed was\nplanted on its\nback at birth.\fThe plant sprouts\nand
      grows with\nthis POKéMON."
    height: 28
    weight: 150
    moves:
      level-up:
      - 1: tackle
      # ...
    game-index: 153

This is all just regular ol’ YAML syntax. This is for English Red; there’d also be one for French Red, Spanish Red, etc. Ultimately, there’d be a lot of files, with a separate set for every game in every language.

The UI will have to figure out when some datum was the same in every game, but it frequently does that even now, so that’s not a significant new burden. If anything, it’s an improvement, since now it’ll be happening only in one place; right now there are a lot of ad-hoc “deduplication” steps done behind the scenes when we add new data.

I like this idea, but I still feel very uneasy about it for unclear reasons. It is a wee bit out there. I could just take this same approach of “fuck it, store everything” and still use a relational database. But look at this little chunk of data; it already tells you plenty of interesting facts about Bulbasaur and only Bulbasaur, yet it would need at least half a dozen tables to express in a relational database. And you couldn’t inspect just Bulbasaur, and you’d have to do multiple queries to actually get everything, and there’d be no useful way to work with the data independently of the app, and so on. Worst of all, the structure is often not remotely obvious from looking at the tables, whereas you can literally see it in YAML syntax.

There are other advantages, as well:

  • A schema can still be enforced Python-side, using the camel loader, which by the way will produce objects rather than plain dicts. (That’s what the !dex!pokemon tag is for.)
  • If you don’t care about veekun at all and just want data, you have it in a straightforward format, for any version you like.
  • YAML libraries are fairly common, and even someone with very limited programming experience can make sense of the above structure. Currently we store CSV database dumps and offer a tool to load into an RDBMS, which has led to a number of bug reports about obscure compatibility issues with various databases, as well as numerous emails from people who are confused about how to load the data or even about what a database is.
  • It’s much more obvious what’s missing. If there’s no directory for Pokémon Yellow, surprise! That means we don’t have Pokémon Yellow. If the directory exists but there’s no places.yaml, guess what we’re missing! Figuring out what’s there and what’s not in a relational system is much more difficult; I only recently realized that we don’t have flavor text for any game before Black/White.
  • I’ll never again have to rearchitect the schema because a new game changed something I didn’t expect could ever change. Similarly, the UI can drop a lot of special cases for “this changes between games”, “this changes between generations”, etc. and treat it all consistently.
  • Pokémon forms can just be two Pokémon with the same species name. Fuck it, store everything. YAML even has “merge” syntax built right in that can elide the common parts. (This isn’t shown above, and I don’t know exactly what the syntax looks like yet.)

Good idea? Sure, maybe? Okay let’s look at some details, where the devil’s in.

Problems

There are several, and they are blocking my progress on this, and I only have three months to go.

Speed

There will be a lot of YAML, and loading a lot of YAML is not particularly quick, even with pyyaml’s C loader. YAML is a complicated format and this is a lot of text to chew through. I won’t know for sure how slow this is until I actually have more than a handful of games in this format, though.

I have a similar concern about memory use, since I’ll suddenly be storing a whole lot of identical data. I do have an idea for reducing memory use for strings, which is basically manual interning:

1
string_datum = big_ol_string_dict.setdefault(string_datum, string_datum)

If I load two YAML files that contain the same string, I can reuse the first one instead of keeping two copies around for no reason. (Strings are immutable in Python, so this is fine.)

Alas, I’ve seen this done before, and it does have a teeny bit of overhead, which might make the speed issue even worse.

So I think what I’m going to do is load everything into objects, resolve duplicate strings, and then… store it all in a pickle! Then the next time the app goes to load the data, if the pickle is newer than any of the files, just load the pickle instead. Pickle is a well-specified binary format (much faster to parse) and should be able to remember that strings have already been de-duplicated.

I know, I know: I said don’t use pickle. This is the one case where pickle is actually useful: as a disposable cache. It doesn’t leave the machine, so there are no security concerns; it’s not shared between multiple copies of the app at the same time; and if it fails to load for any reason at all, the app can silently trash it and load the data directly.

I just hope that pickle will be quick enough, or this whole idea falls apart. Trouble is, I can’t know for sure until I’m halfway done.

Languages versus games

Earlier I implied that every single game would get its own set of data: English Red has a set of files, French Red has the same set of files, etc.

For the very early games, this directly reflects their structure: each region got its own cartridge with the game in a single language. Different languages might have different character sets, different UI, different encounters (Phanpy and Teddiursa were swapped in Gold and Silver’s Western releases), different mechanics (leech moves fail against a Substitute in gen 1, but only in Japanese), and different graphics (several Gold and Silver trainer classes were slightly censored outside of Japan). You could very well argue that they’re distinct games.

The increased storage space of the Nintendo DS changed things. The games were still released regionally, but every game contains every language’s flavor text and “genus (the stuff you see in the Pokédex). This was an actual feature of the game: if you received a Pokémon caught in another language — made much easier by the introduction of online trading — then you’d get the flavor text for that language in your Pokédex.

The DS versions also use a filesystem rather than baking everything into the binary, so very little code needed to change between languages; everything of interest was in text files.

From X and Y, there are no localizations. Every game contains the full names and descriptions of everything, plus the entire game script, in every language. In fact, you can choose which language to play the game in — in an almost unprecedented move for a Nintendo game, an American player with the American copy of the game can play the entire thing in Japanese.

(If this weren’t the case, you’d need an entire separate 3DS to do that, since the 3DS is region-locked. Thanks, Nintendo.)

The question, then, is how to sensibly store all this.


With the example YAML above, human-language details like names and flavor text are baked right into the Pokémon. This makes sense in the context of a single game, where those are properties of a Pokémon. If you take that to be the schema, then the obvious thing to do is to have a separate file for every game in every language: /red/en/pokemon.yaml, /red/fr/pokemon.yaml, and so on.

This isn’t ideal, since most of the other data is going to be the same. But those games are also the smallest, and anyway this captures the rare oddball difference like Phanpy and Teddiursa (though hell if I know how to express that in the UI).

With X and Y, everything goes out the window. There are effectively no separate games any more, so /x/en versus /x/fr makes no sense. It’s very clear now that flavor text — and even names — aren’t direct properties of the Pokémon, but of some combination of the Pokémon and the player.


One option is to put some flexibility in the directory structure.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
/red
  /en
    pokemon.yaml
    pokemon-text.yaml
  /ja
    pokemon.yaml
    pokemon-text.yaml
...
/x
  pokemon.yaml
  /en
    pokemon-text.yaml
  /ja
    pokemon-text.yaml

A pokemon-text.yaml file would be a very simple mapping.

1
2
3
4
5
6
7
bulbasaur:
    name: BULBASAUR
    species: SEED
    flavor-text: "A strange seed was\nplanted on its\nback at birth.\fThe plant sprouts\nand
      grows with\nthis POKéMON."
ivysaur:
    ...

(Note that the lower-case keys like bulbasaur are identifiers, not names — they’re human-readable and obviously based on the English names, but they’re supposed to be treated as opaque dev-only keys. In fact I might try to obfuscate them further, to discourage anyone from title-casing them and calling them names.)

Something about this doesn’t sit well. I think part of it is that the structure in pokemon-text.yaml doesn’t represent a meaningful thing, which is somewhat at odds with the idea of loading each file directly into a set of objects. With this approach, I have to patchwork update existing objects as I go.

It’s kind of a philosophical quibble, granted.


An extreme solution would be to pretend that X and Y are several different games: have /x/en and /x/fr, even though they contain mostly the same information taken from the same source.

I don’t think that’s a great idea, especially since the merged approach will surely be how all future games work as well.


At the other extreme, I could treat the older games as though they were separate versions themselves. Add a grouping called “cartridge” or something that’s a subset of “version”. Many of the oddball differences are between the Japanese version and everyone else, too.

There’s even a little justification for this in the way the first few games were released. Japan first got Red and Green, which had goofy art and were very buggy; they were later polished and released as the single version Japanese Blue, which became the basis for worldwide releases of Red and Blue. Japanese Red is a fairly different game from American Red; Japanese Blue is closer to American Blue but still not really the same. veekun already has a couple of nods towards this, such as having separate Red/Green and Red/Blue sprite art.

That would lead to a list of games like jp-red, jp-green, jp-blue, ww-red, ww-blue, yellow (I think they were similar across the board), jp-gold, jp-silver, ww-gold, ww-silver, crystal (again, I don’t think there were any differences), and so on. The schema would look like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
bulbasaur:
    name:
        en: BULBASAUR
        fr: BULBIZARRE
        es: BULBASAUR
        ...
    flavor-text: 
        en: "A strange seed was\nplanted on its\nback at birth.\fThe plant sprouts\nand
          grows with\nthis POKéMON."
        ...

The Japanese games, of course, would only have Japanese entries. A huge advantage of this approach is that it also works perfectly with the newer games, where this is effectively the structure of the original data anyway.

This does raise the question of exactly how I generate such a file without constantly reloading and redumping it. I guess I could dump every language game at the same time. That would also let me verify that there are no differences besides text.

The downside is mostly that the UI would have to consolidate this, and the results might be a little funky. Merging jp-gold with ww-gold and just calling it “Gold” when the information is the same, okay, sure, that’s easy and makes sense. jp-red versus ww-red is a bit weirder of a case. On the other hand, veekun currently pretends Red and Green didn’t even exist, which is certainly wrong.

I’d have to look more into the precise differences to be sure this would actually work, but the more I think about it, the more reasonable this sounds. Probably the biggest benefit is that non-text data would only differ across games, not potentially across games and languages.

Wow, this might be a really good idea. And it had never occurred to me before writing this section. This rubber duck thing really works, thanks!

Forms

As mentioned above, rather than try to group forms into various different tiers based on how much they differ, I might as well just go whole hog and have every form act as a completely distinct Pokémon.

Doing this with YAML’s merge syntax would even make the differences crystal clear:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
plant-wormadam:
    &plant-wormadam
    types: [bug, grass]
    abilities:
        1: anticipation
        2: anticipation
        hidden: overcoat
    moves:
        ...
    # etc
trash-wormadam:
    <<: *plant-wormadam  # means "merge in everything from this other node"
    types: [bug, ground]
    moves:
        ...
# Even better:
unown-a:
    &unown-a
    types: [psychic]
    name: ...
    # whatever else
unown-c:
    <<: *unown-a
unown-d:
    <<: *unown-a
unown-e:
    <<: *unown-a

One catch is that I don’t know how to convince PyYAML to output merge nodes, though it’s perfectly happy to read them.

But wait, hang on. This is a list of Pokémon, not forms. Wormadam is a Pokémon. Plant Wormadam is a form.

Right?

This distinction has haunted us rather thoroughly since we first tried to support it with Diamond and Pearl. The website is still a little weird about this: it acts as though “Plant Wormadam” is the name of a distinct Pokémon (because it affects type) and has distinct pages for Sandy and Trash Wormadam, but “Burmy” is a single page, even though Wormadam evolves from Burmy and they have the same forms. (In Burmy’s case, though, form only affects the sprite and nothing else.) You can also get distinct forms in search results, which may or may not be what you want — but it also may or may not make sense to “ignore” forms when searching. In many cases we’ve arbitrarily chosen a form as the “default” even when there isn’t a clear one, just so you get something useful when you type in “wormadam”.

Either way, there needs to be something connecting them. Merge keys are only a shorthand for writing YAML; they’re completely invisible to app code and don’t exist in the deserialized data.

YAML does have a nice shorthand syntax for a list of mappings:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
bulbasaur:
-   name: ...
    types: ...
unown:
-   &unown-a
    name: ...
    types: ...
    form: a
-   <<: *unown-a
    form: b
-   <<: *unown-a
    form: c
...

Hm, now we lose the unown-a that functions as the actual identifier for the form.

Alternatively, there could be an entire separate type for sets of forms, since we do have tags here.

1
2
3
4
5
6
7
bulbasaur: !dex!pokemon
    name: ...
unown: !dex!pokemon-form-set
    unown-a: !dex!pokemon
        name: ...
    unown-b: !dex!pokemon
        ...

An unadorned Pokemon could act as a set of 1, then? I guess?

Come to think of it, this knits with another question: where does data specific to a set of forms go? Data like “can you switch between forms” and “is this purely cosmetic”. We can’t readily get that from the games, since it’s code rather than data.

It’s also extremely unlikely to ever change, since it’s a fundamental part of each multi-form Pokémon’s in-universe lore. So it makes sense to store that stuff in some separate manually-curated place, right? In which case, we could do the same for storing which sets of forms “count” as the same Pokémon. That is, the data files could contain plant-wormadam and sandy-wormadam as though they were completely distinct, and then we’d have our own bits on top (which we need anyway) to say that, hey, those are both forms of the same thing, wormadam.

That mirrors how the actual games handle this, too — the three Wormadam forms have completely separate stat/etc. structs.

Ah, but the games don’t store the Burmy or Unown forms separately, because they’re cosmetic. How does our code handle that? I guess there’s only one unown, and then we also know that there are 28 possible sprites?

But Arceus’s forms have different types, and they’re not stored separately either. (I think you could argue that Arceus is cosmetic-only, the cosmetic form is changed by Arceus’s type, and Arceus’s type is really just changed by Arceus’s ability. I’m pretty sure the ability doesn’t work if you hack it onto any other Pokémon, but I can’t remember whether Arceus still changes type if hacked to have a different ability.)

Relying too much on outside information also makes the data a teeny bit harder for anyone else to use; suddenly they have three Wormadams, none of which are quite called “Wormadam”, but all of which share the same Pokédex number. (Oh, right, we could just link them by Pokédex number.) That feels goofy, but if what you’re after is something with a definitive set of types, there is nothing called simply “Wormadam”.

Oh, and there’s a minigame that only exists in Heart Gold and Soul Silver, but that has different stats even for cosmetic forms. Christ.

I don’t think there’s any perfect answer here. I have a list of all the forms if you’d like to see more of this madness.

The Python API

So you want to load all this data and do stuff with it. Cool. There’ll be a class like this:

1
2
3
4
5
class Pokemon(Locus):
    types = List(Type, min=1, max=2)
    growth_rate = Scalar(GrowthRate)
    game_index = Scalar(int)
    ...

You know, a little declarative schema that matches the YAML structure. I love declarative classes.

The big question here is what a Pokemon is. (Besides whether it’s a form or not.) Is it a wrapper around all the possible data from every possible game, or just the data from one particular game? Probably the former, since the latter would mean having some twenty different Pokemon all called bulbasaur and that’s weird.

(Arguably, the former would be wrong because much of this stuff only applies to the main games and not Mystery Dungeon or Ranger or whatever else. That’s a very different problem that I’ll worry about later.)

I guess then a Pokemon would wrap all its attributes in a kind of amalgamation object:

1
2
3
4
5
6
7
8
print(pokemon)                          # <Pokemon: bulbasaur>
print(pokemon.growth_rate)              # <MultiValue: bulbasaur.growth_rate>
current = Game.latest
print(current)                          # <Game: alpha-sapphire>
print(pokemon.growth_rate[current])     # <GrowthRate: medium-slow>
pokemonv = pokemon.for_version(current)
print(pokemonv)                         # <Pokemon: bulbsaur/alpha-sapphire>
print(pokemonv.growth_rate)             # <GrowthRate: medium-slow>

There’s one more level you might want: a wrapper that slices by language solely for your own convenience, so you can say print(some_pokemon.name) and get a single string rather than a thing that contains all of them.

Should you be able to slice by language but not by version, so pokemon.name is a thing containing all English names across all the games? I guess that sounds reasonable to want, right? It would also let you treat text like any other property, which could be handy.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
print(pokemon)                          # <Pokemon: bulbasaur>
print(pokemon.growth_rate)              # <MultiValue: bulbasaur.growth_rate>
# I'm making up method names on the fly here, so.
# Also there will probably be a few ways to group together changed properties,
# depending entirely on what the UI needs.
print(pokemon.growth_rate.meld())       # [((...every game...), <GrowthRate: medium-slow>)]
print(pokemon.growth_rate.unify())      # <GrowthRate: medium-slow>
pokemonl = pokemon.for_language(Language['en'])
print(pokemonl.name)                    # <MultiValue: bulbasaur.name>
print(pokemonl.name.meld())             # [((<Game: ww-red>, ...), 'BULBASAUR'), ((<Game: x>, ...), 'Bulbasaur')]
print(pokemonl.name.unify())            # None, maybe ValueError?

(Having written all of this, I suddenly realize that I’m targeting Python 3, where I can use é in class names. Which I am probably going to do a lot.)

I think… this all… seems reasonable and doable. It’ll require some clever wrapper types, but that’s okay.

Hmm

I know these are relatively minor problems in the grand scheme of things. People handle hundreds of millions of rows in poorly-designed MySQL tables all the time and manage just fine. I’m mostly waffling because this is a lot of (hobby!) work and I’ve already been through several of these rearchitecturings and I’m tired of discovering the dozens of drawbacks only after all the work is done.

Writing this out has provided some clarity, though, and I think I have a better idea of what I want to do. So, thanks.

I’d like to have a proof of concept of this, covering some arbitrary but representative subset of games, by the end of the month. Keep your eyes peeled.

Automatically inferring file syntax with afl-analyze

Post Syndicated from Michal Zalewski original http://lcamtuf.blogspot.com/2016/02/say-hello-to-afl-analyze.html

The nice thing about the control flow instrumentation used by American Fuzzy Lop is that it allows you to do much more than just, well, fuzzing stuff. For example, the suite has long shipped with a standalone tool called afl-tmin, capable of automatically shrinking test cases while still making sure that they exercise the same functionality in the targeted binary (or that they trigger the same crash). Another similar tool, afl-cmin, employed a similar trick to eliminate redundant files in any large testing corpora.

The latest release of AFL features another nifty new addition along these lines: afl-analyze. The tool takes an input file, sequentially flips bytes in this data stream, and then observes the behavior of the targeted binary after every flip. From this information, it can infer several things:

Classify some content as no-op blocks that do not elicit any changes to control flow (say, comments, pixel data, etc).

Checksums, magic values, and other short, atomically compared tokens where any bit flip causes the same change to program execution.

Longer blobs exhibiting this property – almost certainly corresponding to checksummed or encrypted data.

“Pure” data sections, where analyzer-injected changes consistently elicit differing changes to control flow.

This gives us some remarkable and quick insights into the syntax of the file and the behavior of the underlying parser. It may sound too good to be true, but actually seems to work in practice. For a quick demo, let’s see what afl-analyze has to say about running cut -d ‘ ‘ -f1 on a text file:

We see that cut really only cares about spaces and newlines. Interestingly, it also appears that the tool always tokenizes the entire line, even if it’s just asked to return the first token. Neat, right?

Of course, the value of afl-analyze is greater for incomprehensible binary formats than for simple text utilities; perhaps even more so when dealing with black-box parsers (which can be analyzed thanks to the runtime QEMU instrumentation supported in AFL). To try out the tool’s ability to deal with binaries, let’s check out libpng:

This looks pretty damn good: we have two four-byte signatures, followed by chunk length, four-byte chunk name, chunk length, some image metadata, and then a comment section. Neat, right? All in a matter of seconds: no configuration needed and no knobs to turn.

Of course, the tool shipped just moments ago and is still very much experimental; expect some kinks. Field testing and feedback welcome!

Automatically inferring file syntax with afl-analyze

Post Syndicated from Michal Zalewski original http://lcamtuf.blogspot.com/2016/02/say-hello-to-afl-analyze.html

The nice thing about the control flow instrumentation used by American Fuzzy Lop is that it allows you to do much more than just, well, fuzzing stuff. For example, the suite has long shipped with a standalone tool called afl-tmin, capable of automatically shrinking test cases while still making sure that they exercise the same functionality in the targeted binary (or that they trigger the same crash). Another similar tool, afl-cmin, employed a similar trick to eliminate redundant files in any large testing corpora.

The latest release of AFL features another nifty new addition along these lines: afl-analyze. The tool takes an input file, sequentially flips bytes in this data stream, and then observes the behavior of the targeted binary after every flip. From this information, it can infer several things:

Classify some content as no-op blocks that do not elicit any changes to control flow (say, comments, pixel data, etc).

Checksums, magic values, and other short, atomically compared tokens where any bit flip causes the same change to program execution.

Longer blobs exhibiting this property – almost certainly corresponding to checksummed or encrypted data.

“Pure” data sections, where analyzer-injected changes consistently elicit differing changes to control flow.

This gives us some remarkable and quick insights into the syntax of the file and the behavior of the underlying parser. It may sound too good to be true, but actually seems to work in practice. For a quick demo, let’s see what afl-analyze has to say about running cut -d ‘ ‘ -f1 on a text file:

We see that cut really only cares about spaces and newlines. Interestingly, it also appears that the tool always tokenizes the entire line, even if it’s just asked to return the first token. Neat, right?

Of course, the value of afl-analyze is greater for incomprehensible binary formats than for simple text utilities; perhaps even more so when dealing with black-box parsers (which can be analyzed thanks to the runtime QEMU instrumentation supported in AFL). To try out the tool’s ability to deal with binaries, let’s check out libpng:

This looks pretty damn good: we have two four-byte signatures, followed by chunk length, four-byte chunk name, chunk length, some image metadata, and then a comment section. Neat, right? All in a matter of seconds: no configuration needed and no knobs to turn.

Of course, the tool shipped just moments ago and is still very much experimental; expect some kinks. Field testing and feedback welcome!

Performance Tuning Your Titan Graph Database on AWS

Post Syndicated from Nick Corbett original https://blogs.aws.amazon.com/bigdata/post/Tx3JO46NGMX3WAW/Performance-Tuning-Your-Titan-Graph-Database-on-AWS

Nick Corbett is a Big Data Consultant for AWS Professional Services

Graph databases can outperform an RDBMS and give much simpler query syntax for many use cases. In my last post, Building a Graph Database on AWS Using Amazon DynamoDB and Titan, I showed how a network of relationships can be stored and queried using a graph database. In this post, I show you how to tune the performance of your Titan database running on Amazon DynamoDB in AWS.

Earlier today, AWS released a new version of the Amazon DynamoDB Storage Backend for Titan, which is compatible with Titan v1.0.0. Titan v1.0.0 moves on from the previous version, v0.5.4, with improvement to the query execution engine and compatibility with the latest version of the Apache TinkerPop stack (v3). (See the What’s New post for details about this release.)

The following diagram shows you the stack architecture:

As you saw in my last post, your application runs Gremlin commands on the graph database to traverse the vertices and edges, in much the same way as you would run SQL against an RDBMS. Your application interacts with the TinkerPop API; either calling Titan directly or calling a TinkerPop Gremlin Server instance. You can think of the TinkerPop API in a similar way to ODBC; it gives a standard, vendor-neutral interface to a graph database and is implemented by major graph engines including Titan, Neo4J, or OrientDB.

Here’s an example of a network that is best modelled using a graph database. On the 4th July 2012, CERN announced the preliminary result of the discovery of the Higgs Boson from teams working with the Large Hadron Collider. While most of us struggle to understand what a Higgs particle is and the true implications of its discovery, the scientific community was hugely interested in this event. The sample data1 used in this post records Twitter activity between the 1st and 7th of July 2012 that includes one of the following keyword or hashtags: lhc, cern, boson, higgs.

This data can be used to understand how information flows from one user (or vertex) in the graph to another. To do this, you can run the following ‘path finder’ query that calculates all the paths between two vertices:

s=System.currentTimeMillis(); g.V.has(‘UserId’, 46996).outE.has(‘tweettype’, ‘RT’).inV.loop(3){it.object.UserId != 33833 && it.loops < 6}.path.filter{it.last().UserId == 33833}.transform{[id: it.UserId, time:System.currentTimeMillis() – s]}.iterate(); t = System.currentTimeMillis() – s;

This was adapted from the shortest path recipe in the Gremlin documentation and finds all paths between two users that are connected by re-tweets. The query starts at the vertex with UserId 46996 and will traverse a maximum of 6 edges in all directions in an attempt to find UserId 33833. You can use the number of traversals to change how easy it is for the system to calculate this query. The greater the number of traversals, the more vertices need to be considered and the more demanding the query.

When creating a Titan graph using DynamoDB, your first decision is whether to use the single-item or multi-item data model. Titan uses the BigTable paradigm to persist data and the DynamoDB Storage Backend supports these two implementations of this. Both store vertex, edge, and property in a DynamoDB table called edgestore (for more details about the tables the storage backend uses, see Titan Graph Modeling in DynamoDB).

However, the single-item model puts each vertex, together with all its properties, edges, and properties of its edges into in a single DynamoDB item in this table. The multi-item model stores each piece of information in a separate DynamoDB item and uses the hash key to link all information for a particular vertex, as shown in the following example tables. Note that the actual data stored is in binary format and not human readable.

The advantage of the single-item data model is that DynamoDB consumes fewer IOPS to read your data so it’s quicker to run queries.

To demonstrate this, the Twitter data was loaded into two graph databases, one for each data model. The path finder query was run using 4 traversals. The query against data stored in the single-item data model returned in 25% of the time, taking 1114 ms compared to 4484 ms for the multi-item format.

However, even the query against the data stored in the multi-item model was faster than running the equivalent query in an RDBMS. The same data was loaded into MySQL running in Amazon Relational Database Service (RDS) with an equivalent instance size (4 CPU / 16 GiB memory).  A SQL statement took 10204 ms to find all the paths between the two vertices.

In general, queries against a graph stored in the single-item data model outperform those stored in the multi-item data model. However, if you need to access a single property or edge, then the multi-item model is faster as the cost of reading the smaller row is less. Also, if you update a vertex or edge in a multi-item data model the cost of changing the smaller row is cheaper and quicker.

The main restriction of the single-item data model is the size of the row, as all properties, edges (both in and out), and edge properties must fit into a single DynamoDB item, currently capped at 400 KB. After it’s created, you cannot change the storage model of your graph, so you need to be sure that you won’t exceed this limit.

At first glance, the Twitter data set looks like a good candidate for the single-item data model, as the only property associated with each vertex is a UserId. However, if live data was captured and added to this model, new edges would be constantly discovered as users retweet and mention each other. Each new edge adds extra storage requirements to its parent vertex; if a user is too active, then you can’t store all their interactions in a single item. Because you don’t want to find yourself in a situation where you can’t add a new edge to a vertex, its best to use the multi-item storage model unless you are absolutely sure the single model works for you.

Using the multiple-item storage model allows your graph to have highly connected vertices or lots of properties, but it does come with a performance cost. This “storage versus performance” choice is not one that most of us would want to make and it is worth looking at the architecture and configuration of the system to see if you can tune performance.

In order to tune your graph database, you need visibility into how the system is performing. Titan uses the Open Source Metrics package to make many measurements visible at run-time. Metrics integrates with many common tools, such as Log4J, Ganglia or Graphite making it easy to get insight into your system. The DynamoDB Storage Backend extends this by adding metrics of its own, which can be enabled in the graph configuration.  By combining these with measures taken from Amazon CloudWatch, you will be able to understand what is happening in your graph database.

The obvious place to start is to increase the size of the instance that is running the Gremlin query. The graph below shows the time taken to run the 6 loop version of the path finder query on various instances in the m4 family. As you can see, increasing the instance size does improve performance but not by much, as shown on the following graph:

As you can see, moving from a m4.large (2 vCPU, 8 GiB memory) instance to an m4.2xlarge (8 vCPU, 32 GiB) only gives a 9% gain in performance when running this particular query, which shows it isn’t bound by memory or CPU. The maximum IOPS against the DynamoDB table is also shown on the graph above (orange line), taken from the DynamoDB Storage Backend metric. The edgestore table was configured to allow 750 read IOPS, but you only see between 200 and 250 when the query is run.

You can configure the DynamoDB storage backend to limit the number of IOPS it uses against each table. Specifically, by changing the storage.dynamodb.stores.edgestore.read-rate configuration parameter, you can cap the maximum IOPS that the storage backend consumes against the main table. The following graph shows the same query run as above, on an m4.xlarge instance. This time, however, the storage backend was configured to limit the read rate against the edgestore table.

The time taken to run the ten-loop query was recorded:

  

As the maximum read-rate of the storage backend is increased, the time taken to execute the query (blue line) shows exponential decay. This tells us that, for this particular query, the system can only achieve about 200 IOPS against the DynamoDB table. This number will vary from query to query. For example, the following query times how long it takes to build a deduped list of all vertices that can be reached within 8 edge traversals of UserId 46996:

s=System.currentTimeMillis(); a = g.V.has(‘UserId’, 46996).out.loop(1){it.loops < 8}{true}.dedup.count();s=System.currentTimeMillis()-s;

Running this query on an m4.xlarge instance generated over 550 IOPS against the edgestore table, as shown in the following Amazon CloudWatch screenshot:

The tests described so far have all been run on a single EC2 instance, using the Gremlin Console (part of the TinkerPop stack) as a Titan client. This architecture works well if the client is JVM-compatible, as it can directly call the Titan libraries. The Gremlin Console is written in Groovy and directly calls the TinkerPop API of the Titan libraries.

An alternative approach is to use Gremlin Server. Gremlin Server is also part of the TinkerPop stack and provides REST and WebSockets interfaces to allow multiple clients to run queries against the database. If you choose to put a Gremlin Server instance into your architecture, you still execute the same Gremlin query but will be able to run these from non-JVM clients. You can also choose to have more memory or CPU as this instance is responsible for resolving your queries.

To get more throughput in a multi-client solution, you can scale out your Gremlin Server instances.  A standard configuration of putting the instances in an Auto Scaling group, spread over multiple Availability Zones works best. You can use an Elastic Load Balancing load balancer to manage the traffic to the instances.

If you want to use the WebSockets interface, configure your load balancer to use TCP (rather than HTTP); if you want to retain information about client IP addresses, enable Proxy Protocol Support. You need to make sure that the DynamoDB Storage Backend configuration on each Gremlin Server instance limits the number of IOPS so that the total generated by all your instances does not exceed the provisioned IOPS your DynamoDB table.

If you choose to scale out your Gremlin Server instances, you need to make sure they are in session-less mode. This means that your entire Gremlin query needs to be encapsulated in a single request to a server, which has the advantage that each server is stateless and can easily scale.

Summary

In this post, I have shown you how the storage model and instance size can affect the performance of a Titan graph database using DynamoDB for storage. The single-item data model outperforms the multiple-item data model in most cases, but comes with the cost of limiting the amount of information you can store.

I have also shown you how to get maximum performance out of a single instance running Gremlin queries and how to increase system throughput by scaling out. By using a combination of CloudWatch and the metrics that the DynamoDB Storage Backend creates, you can get insight into how your graph database is performing and make decisions on how to design your system for best performance.

References

M. De Domenico, A. Lima, P. Mougel and M. Musolesi. The Anatomy of a Scientific Rumor. (Nature Open Access) Scientific Reports 3, 2980 (2013).

Picture of Higgs Boson taken from CERN Document Server. Shared under creative commons license.

Performance Tuning Your Titan Graph Database on AWS

Post Syndicated from Nick Corbett original https://blogs.aws.amazon.com/bigdata/post/Tx3JO46NGMX3WAW/Performance-Tuning-Your-Titan-Graph-Database-on-AWS

Nick Corbett is a Big Data Consultant for AWS Professional Services

Graph databases can outperform an RDBMS and give much simpler query syntax for many use cases. In my last post, Building a Graph Database on AWS Using Amazon DynamoDB and Titan, I showed how a network of relationships can be stored and queried using a graph database. In this post, I show you how to tune the performance of your Titan database running on Amazon DynamoDB in AWS.

Earlier today, AWS released a new version of the Amazon DynamoDB Storage Backend for Titan, which is compatible with Titan v1.0.0. Titan v1.0.0 moves on from the previous version, v0.5.4, with improvement to the query execution engine and compatibility with the latest version of the Apache TinkerPop stack (v3). (See the What’s New post for details about this release.)

The following diagram shows you the stack architecture:

As you saw in my last post, your application runs Gremlin commands on the graph database to traverse the vertices and edges, in much the same way as you would run SQL against an RDBMS. Your application interacts with the TinkerPop API; either calling Titan directly or calling a TinkerPop Gremlin Server instance. You can think of the TinkerPop API in a similar way to ODBC; it gives a standard, vendor-neutral interface to a graph database and is implemented by major graph engines including Titan, Neo4J, or OrientDB.

Here’s an example of a network that is best modelled using a graph database. On the 4th July 2012, CERN announced the preliminary result of the discovery of the Higgs Boson from teams working with the Large Hadron Collider. While most of us struggle to understand what a Higgs particle is and the true implications of its discovery, the scientific community was hugely interested in this event. The sample data1 used in this post records Twitter activity between the 1st and 7th of July 2012 that includes one of the following keyword or hashtags: lhc, cern, boson, higgs.

This data can be used to understand how information flows from one user (or vertex) in the graph to another. To do this, you can run the following ‘path finder’ query that calculates all the paths between two vertices:

s=System.currentTimeMillis(); g.V.has(‘UserId’, 46996).outE.has(‘tweettype’, ‘RT’).inV.loop(3){it.object.UserId != 33833 && it.loops < 6}.path.filter{it.last().UserId == 33833}.transform{[id: it.UserId, time:System.currentTimeMillis() – s]}.iterate(); t = System.currentTimeMillis() – s;

This was adapted from the shortest path recipe in the Gremlin documentation and finds all paths between two users that are connected by re-tweets. The query starts at the vertex with UserId 46996 and will traverse a maximum of 6 edges in all directions in an attempt to find UserId 33833. You can use the number of traversals to change how easy it is for the system to calculate this query. The greater the number of traversals, the more vertices need to be considered and the more demanding the query.

When creating a Titan graph using DynamoDB, your first decision is whether to use the single-item or multi-item data model. Titan uses the BigTable paradigm to persist data and the DynamoDB Storage Backend supports these two implementations of this. Both store vertex, edge, and property in a DynamoDB table called edgestore (for more details about the tables the storage backend uses, see Titan Graph Modeling in DynamoDB).

However, the single-item model puts each vertex, together with all its properties, edges, and properties of its edges into in a single DynamoDB item in this table. The multi-item model stores each piece of information in a separate DynamoDB item and uses the hash key to link all information for a particular vertex, as shown in the following example tables. Note that the actual data stored is in binary format and not human readable.

The advantage of the single-item data model is that DynamoDB consumes fewer IOPS to read your data so it’s quicker to run queries.

To demonstrate this, the Twitter data was loaded into two graph databases, one for each data model. The path finder query was run using 4 traversals. The query against data stored in the single-item data model returned in 25% of the time, taking 1114 ms compared to 4484 ms for the multi-item format.

However, even the query against the data stored in the multi-item model was faster than running the equivalent query in an RDBMS. The same data was loaded into MySQL running in Amazon Relational Database Service (RDS) with an equivalent instance size (4 CPU / 16 GiB memory).  A SQL statement took 10204 ms to find all the paths between the two vertices.

In general, queries against a graph stored in the single-item data model outperform those stored in the multi-item data model. However, if you need to access a single property or edge, then the multi-item model is faster as the cost of reading the smaller row is less. Also, if you update a vertex or edge in a multi-item data model the cost of changing the smaller row is cheaper and quicker.

The main restriction of the single-item data model is the size of the row, as all properties, edges (both in and out), and edge properties must fit into a single DynamoDB item, currently capped at 400 KB. After it’s created, you cannot change the storage model of your graph, so you need to be sure that you won’t exceed this limit.

At first glance, the Twitter data set looks like a good candidate for the single-item data model, as the only property associated with each vertex is a UserId. However, if live data was captured and added to this model, new edges would be constantly discovered as users retweet and mention each other. Each new edge adds extra storage requirements to its parent vertex; if a user is too active, then you can’t store all their interactions in a single item. Because you don’t want to find yourself in a situation where you can’t add a new edge to a vertex, its best to use the multi-item storage model unless you are absolutely sure the single model works for you.

Using the multiple-item storage model allows your graph to have highly connected vertices or lots of properties, but it does come with a performance cost. This “storage versus performance” choice is not one that most of us would want to make and it is worth looking at the architecture and configuration of the system to see if you can tune performance.

In order to tune your graph database, you need visibility into how the system is performing. Titan uses the Open Source Metrics package to make many measurements visible at run-time. Metrics integrates with many common tools, such as Log4J, Ganglia or Graphite making it easy to get insight into your system. The DynamoDB Storage Backend extends this by adding metrics of its own, which can be enabled in the graph configuration.  By combining these with measures taken from Amazon CloudWatch, you will be able to understand what is happening in your graph database.

The obvious place to start is to increase the size of the instance that is running the Gremlin query. The graph below shows the time taken to run the 6 loop version of the path finder query on various instances in the m4 family. As you can see, increasing the instance size does improve performance but not by much, as shown on the following graph:

As you can see, moving from a m4.large (2 vCPU, 8 GiB memory) instance to an m4.2xlarge (8 vCPU, 32 GiB) only gives a 9% gain in performance when running this particular query, which shows it isn’t bound by memory or CPU. The maximum IOPS against the DynamoDB table is also shown on the graph above (orange line), taken from the DynamoDB Storage Backend metric. The edgestore table was configured to allow 750 read IOPS, but you only see between 200 and 250 when the query is run.

You can configure the DynamoDB storage backend to limit the number of IOPS it uses against each table. Specifically, by changing the storage.dynamodb.stores.edgestore.read-rate configuration parameter, you can cap the maximum IOPS that the storage backend consumes against the main table. The following graph shows the same query run as above, on an m4.xlarge instance. This time, however, the storage backend was configured to limit the read rate against the edgestore table.

The time taken to run the ten-loop query was recorded:

  

As the maximum read-rate of the storage backend is increased, the time taken to execute the query (blue line) shows exponential decay. This tells us that, for this particular query, the system can only achieve about 200 IOPS against the DynamoDB table. This number will vary from query to query. For example, the following query times how long it takes to build a deduped list of all vertices that can be reached within 8 edge traversals of UserId 46996:

s=System.currentTimeMillis(); a = g.V.has(‘UserId’, 46996).out.loop(1){it.loops < 8}{true}.dedup.count();s=System.currentTimeMillis()-s;

Running this query on an m4.xlarge instance generated over 550 IOPS against the edgestore table, as shown in the following Amazon CloudWatch screenshot:

The tests described so far have all been run on a single EC2 instance, using the Gremlin Console (part of the TinkerPop stack) as a Titan client. This architecture works well if the client is JVM-compatible, as it can directly call the Titan libraries. The Gremlin Console is written in Groovy and directly calls the TinkerPop API of the Titan libraries.

An alternative approach is to use Gremlin Server. Gremlin Server is also part of the TinkerPop stack and provides REST and WebSockets interfaces to allow multiple clients to run queries against the database. If you choose to put a Gremlin Server instance into your architecture, you still execute the same Gremlin query but will be able to run these from non-JVM clients. You can also choose to have more memory or CPU as this instance is responsible for resolving your queries.

To get more throughput in a multi-client solution, you can scale out your Gremlin Server instances.  A standard configuration of putting the instances in an Auto Scaling group, spread over multiple Availability Zones works best. You can use an Elastic Load Balancing load balancer to manage the traffic to the instances.

If you want to use the WebSockets interface, configure your load balancer to use TCP (rather than HTTP); if you want to retain information about client IP addresses, enable Proxy Protocol Support. You need to make sure that the DynamoDB Storage Backend configuration on each Gremlin Server instance limits the number of IOPS so that the total generated by all your instances does not exceed the provisioned IOPS your DynamoDB table.

If you choose to scale out your Gremlin Server instances, you need to make sure they are in session-less mode. This means that your entire Gremlin query needs to be encapsulated in a single request to a server, which has the advantage that each server is stateless and can easily scale.

Summary

In this post, I have shown you how the storage model and instance size can affect the performance of a Titan graph database using DynamoDB for storage. The single-item data model outperforms the multiple-item data model in most cases, but comes with the cost of limiting the amount of information you can store.

I have also shown you how to get maximum performance out of a single instance running Gremlin queries and how to increase system throughput by scaling out. By using a combination of CloudWatch and the metrics that the DynamoDB Storage Backend creates, you can get insight into how your graph database is performing and make decisions on how to design your system for best performance.

References

M. De Domenico, A. Lima, P. Mougel and M. Musolesi. The Anatomy of a Scientific Rumor. (Nature Open Access) Scientific Reports 3, 2980 (2013).

Picture of Higgs Boson taken from CERN Document Server. Shared under creative commons license.

Implementing Efficient and Reliable Producers with the Amazon Kinesis Producer Library

Post Syndicated from Kevin Deng original https://blogs.aws.amazon.com/bigdata/post/Tx3ET30EGDKUUI2/Implementing-Efficient-and-Reliable-Producers-with-the-Amazon-Kinesis-Producer-L

Kevin Deng is an SDE with the Amazon Kinesis team and is the lead author of the Amazon Kinesis Producer Library

How do you vertically scale an Amazon Kinesis producer application by 100x? While it’s easy to get started with streaming data into Amazon Kinesis, streaming large volumes of data efficiently and reliably presents some challenges. When a host needs to send many records per second (RPS) to Amazon Kinesis, simply calling the basic PutRecord API action in a loop is inadequate.

To reduce overhead and increase throughput, the application must batch records and implement parallel HTTP requests. It also must deal with the transient failures inherent to any network application and perform retries as needed. And in a large-scale application, monitoring is needed to allow operators to diagnose and troubleshoot any issues that arise.

In this post, you’ll develop a sample producer application and evolve it through several stages of complexity to learn about these challenges. I discuss each challenge at a conceptual level and present sample solutions for some of them. The goal is to help you understand what problems the KPL solves and, at a high level, how it solves them.

All code in this post is available in the AWS Big Data Blog Github repository.

Scenario

Consider a typical use case of Amazon Kinesis: realtime clickstream analysis. In this setup, a webserver receives requests from the browser about the links a user has clicked on a website. A typical request might contain a payload that looks like this:

{
"sessionId": "seXACNTS3FoQuqTVxAM",
"fmt": "1",
"num": "1",
"cv": "7",
"frm": "2",
"url": "http%3A//www.mywebsite.com/activityi%3Bsrc%3D1782317%3Btype%3D2015s004%3Bcat%3Dgs6pr0%3Bord%3D3497210042551.1597%3F",
"ref": "http%3A//www.mywebsite.com/us/explore/myproduct-features-and-specs/%3Fcid%3Dppc-",
"random": "3833153354"
}

This payload is 342 bytes long, and the examples in this post assume that most of the requests coming in are roughly this size (~350 bytes).

After receiving the request, the webserver sends it to Amazon Kinesis to be consumed by an analysis application. This application provides the business intelligence in near-realtime that can be used for various purposes. Don’t worry about the consumer in this post; focus only on the producer–in this case, the webserver.

Overall Strategy

To avoid interfering with other workloads such as actually serving pages or receiving other requests, the webserver doesn’t try to send the data to Amazon Kinesis within its request handler. Instead, it places the payload of each request into a queue to be processed by a separate component. Call this component ClickEventsToKinesis; its implementation is what I discuss in the rest of this post. Start with the following abstract class that all subsequent implementations inherit from:

public abstract class AbstractClickEventsToKinesis implements Runnable {
protected final static String STREAM_NAME = "myStream";
protected final static String REGION = "us-east-1";

protected final BlockingQueue<ClickEvent> inputQueue;
protected volatile boolean shutdown = false;
protected final AtomicLong recordsPut = new AtomicLong(0);

protected AbstractClickEventsToKinesis(
BlockingQueue<ClickEvent> inputQueue) {
this.inputQueue = inputQueue;
}

@Override
public void run() {
while (!shutdown) {
try {
runOnce();
} catch (Exception e) {
e.printStackTrace();
}
}
}

public long recordsPut() {
return recordsPut.get();
}

public void stop() {
shutdown = true;
}

protected abstract void runOnce() throws Exception;
}

In this code, ClickEvent is a trivial wrapper around the payload, with the sessionId field extracted. Use the sessionId as your partition key when putting into Amazon Kinesis.

public class ClickEvent {
private String sessionId;
private String payload;

public ClickEvent(String sessionId, String payload) {
this.sessionId = sessionId;
this.payload = payload;
}

public String getSessionId() {
return sessionId;
}

public String getPayload() {
return payload;
}
}

For simplicity, assume that the server runs on an Amazon EC2 instance that has an appropriate IAM role such that you can simply use DefaultAWSCredentialsProviderChain for credentials.

Basic Producer

Imagine that you have just launched your website. Traffic is fairly low at the moment, and each one of your servers is getting about 50 ClickEvents objects per second. Start with the simplest possible implementation: take the ClickEvents objects from the queue one at a time and send each to Amazon Kinesis synchronously. At this point, make no attempt at handlings errors or performing retries should any of the puts fail.

public class BasicClickEventsToKinesis extends AbstractClickEventsToKinesis {
private final AmazonKinesis kinesis;

public BasicClickEventsToKinesis(BlockingQueue<ClickEvent> inputQueue) {
super(inputQueue);
kinesis = new AmazonKinesisClient().withRegion(
Regions.fromName(REGION));
}

@Override
protected void runOnce() throws Exception {
ClickEvent event = inputQueue.take();
String partitionKey = event.getSessionId();
ByteBuffer data = ByteBuffer.wrap(
event.getPayload().getBytes("UTF-8"));
kinesis.putRecord(STREAM_NAME, data, partitionKey);
recordsPut.getAndIncrement();
}
}

That didn’t take too long, not bad at all. This implementation attains a throughput of about 90 records per second (RPS) in us-east-1 on a c3.2xlarge Ubuntu 15 instance, sufficient for your present needs.

Using the KPL – Quick Preview

Take a look at a similar implementation that uses the KPL instead:

public class KPLClickEventsToKinesis extends AbstractClickEventsToKinesis {
private final KinesisProducer kinesis;

public KPLClickEventsToKinesis(BlockingQueue<ClickEvent> inputQueue) {
super(inputQueue);
kinesis = new KinesisProducer(new KinesisProducerConfiguration()
.setRegion(REGION)
.setRecordMaxBufferedTime(5000));
}

@Override
protected void runOnce() throws Exception {
ClickEvent event = inputQueue.take();
String partitionKey = event.getSessionId();
ByteBuffer data = ByteBuffer.wrap(
event.getPayload().getBytes("UTF-8"));
while (kinesis.getOutstandingRecordsCount() > 5e4) {
Thread.sleep(1);
}
kinesis.addUserRecord(STREAM_NAME, partitionKey, data);
recordsPut.getAndIncrement();
}
}

While the two implementations are very similar on the surface, a significant amount of additional logic is performed by the KPL-based implementation underneath the covers. The latter achieves a throughput of about 76,000 RPS when run on the same host, which is about 800 times greater. As you go through the rest of the post, it should become clear why that is the case. I’ll show you a more advanced example of using the KPL later on and go through the various API methods, but for now I’ll discuss the problems with the basic implementation you created earlier.

Throughput

Imagine now that your website has gone viral, and as a result each one of your servers is receiving about 5000 ClickEvents objects per second instead of 50. Because BasicClickEventsToKinesis can only do 90 RPS, you’ll need to make some changes to get the throughput you need.

Multithreading

One approach would be to create multiple instances of BasicClickEventsToKinesis, pass each one a reference to the same queue, and then submit all of them to an ExecutorService object. If all goes well, this should multiply your throughput by some large fraction of the number of instances you create. To get to 5000 RPS at 90 RPS each, you would require 56 instances in the ideal case, but in reality you’ll need more because the scaling is not going to be perfectly linear.

Code implementing the above approach is available in the repository as the class MultithreadedClickEventsToKinesis.

The problem with this approach is that it incurs a large amount of CPU overhead from context switching between threads. In addition to this, sending records one by one also incurs large overheads from signature version 4 and HTTP, further consuming CPU cycles as well as network bandwidth. I’ll discuss the bandwidth aspect next.

Bandwidth Overhead

If you take a look at what’s actually going over the wire, you’ll see that each request looks more or less like the following:

Even though your original data and partition key are only about 370 bytes in size, the HTTP request ends up consuming about 1200 bytes. Some of this is due to expansion from base64-encoding, but the bulk of it comes from the HTTP headers, which constitute just over half of the entire request.

Batching with PutRecords

To improve efficiency, the service team introduced the PutRecords API method in December 2014. This allows many records to be sent with a single HTTP request, thus amortizing the cost of the headers and fixed-cost components of signature version 4. Another benefit of using PutRecords is that you don’t have to incur a round-trip of latency between every single record, even if you don’t use multiple connections in parallel.

The following code example implements a new class called BatchedClickEventsToKinesis to take advantage of these benefits of PutRecords.

public class BatchedClickEventsToKinesis extends AbstractClickEventsToKinesis {
protected AmazonKinesis kinesis;
protected List<PutRecordsRequestEntry> entries;

private int dataSize;

public BatchedClickEventsToKinesis(BlockingQueue<ClickEvent> inputQueue) {
super(inputQueue);
kinesis = new AmazonKinesisClient().withRegion(
Regions.fromName(REGION));
entries = new ArrayList<>();
dataSize = 0;
}

@Override
protected void runOnce() throws Exception {
ClickEvent event = inputQueue.take();
String partitionKey = event.getSessionId();
ByteBuffer data = ByteBuffer.wrap(
event.getPayload().getBytes("UTF-8"));
recordsPut.getAndIncrement();

addEntry(new PutRecordsRequestEntry()
.withPartitionKey(partitionKey)
.withData(data));
}

@Override
public long recordsPut() {
return super.recordsPut() – entries.size();
}

@Override
public void stop() {
super.stop();
flush();
}

protected void flush() {
kinesis.putRecords(new PutRecordsRequest()
.withStreamName(STREAM_NAME)
.withRecords(entries));
entries.clear();
}

protected void addEntry(PutRecordsRequestEntry entry) {
int newDataSize = dataSize + entry.getData().remaining() +
entry.getPartitionKey().length();
if (newDataSize <= 5 * 1024 * 1024 && entries.size() < 500) {
dataSize = newDataSize;
entries.add(entry);
} else {
flush();
dataSize = 0;
addEntry(entry);
}
}
}

This implementation gives you 5500 RPS, a dramatic improvement in the throughput and efficiency of the producer.

PutRecords is not Atomic

There is however a subtle issue with PutRecords in that it is not an atomic operation; unlike the other API methods that either fail or succeed, PutRecords can partially fail. When a batch of multiple records is sent with PutRecords, it’s possible for every record to individually fail or succeed, independent of the other ones in the same batch. Unless an error prevents the entire HTTP request containing the batch from being delivered, PutRecords always returns a 200 result, even if some records within the batch have failed. This means that instead of simply relying on exceptions to indicate errors, you must write code that examines the PutRecordsResult objects to detect individual record failures and take appropriate action.

Retries

For various reasons, calls to the Amazon Kinesis API can occasionally fail. To avoid losing data because of this, you’ll want to retry failed puts.

Add retries to the BatchedClickEventsToKinesis from earlier by re-implementing the flush() method in a subclass:

public class RetryingBatchedClickEventsToKinesis
extends BatchedClickEventsToKinesis {
private final int MAX_BACKOFF = 30000;
private final int MAX_ATTEMPTS = 5;
private final Set<String> RETRYABLE_ERR_CODES = ImmutableSet.of(
"ProvisionedThroughputExceededException",
"InternalFailure",
"ServiceUnavailable");

private int backoff;
private int attempt;
private MapSet<PutRecordsRequestEntry, Integer> recordAttempts;

public RetryingBatchedClickEventsToKinesis(
BlockingQueue<ClickEvent> inputQueue) {
super(inputQueue);
reset();
recordAttempts = new HashMap<>();
}

@Override
protected void flush() {
PutRecordsRequest req = new PutRecordsRequest()
.withStreamName(STREAM_NAME)
.withRecords(entries);

PutRecordsResult res = null;
try {
res = kinesis.putRecords(req);
reset();
} catch (AmazonClientException e) {
if ((e instanceof AmazonServiceException
&& ((AmazonServiceException) e).getStatusCode() / 100 == 4)
|| attempt == MAX_ATTEMPTS) {
reset();
throw e;
}

try {
Thread.sleep(backoff);
} catch (InterruptedException ie) {
ie.printStackTrace();
}
Math.min(MAX_BACKOFF, backoff *= 2);

attempt++;
flush();
}

final List<PutRecordsResultEntry> results = res.getRecords();
List<PutRecordsRequestEntry> retries = IntStream
.range(0, results.size())
.mapToObj(i -> {
PutRecordsRequestEntry e = entries.get(i);
String errorCode = results.get(i).getErrorCode();
int n = recordAttempts.getOrDefault(e, 1) + 1;
// Determine whether the record should be retried
if (errorCode != null &&
RETRYABLE_ERR_CODES.contains(errorCode) &&
n < MAX_ATTEMPTS) {
recordAttempts.put(e, n);
return Optional.of(e);
} else {
recordAttempts.remove(e);
return Optional.<PutRecordsRequestEntry>empty();
}
})
.filter(Optional::isPresent)
.map(Optional::get)
.collect(Collectors.toList());

entries.clear();
retries.forEach(e -> addEntry(e));
}

private void reset() {
attempt = 1;
backoff = 100;
}
}

There’s a lot going on here, so I’ll walk you through all the issues.

Different Types of Errors

The first type of error this code attempts to handle is a complete request failure, indicated by HTTP status code 4XX or 5XX. This occurs when the entire PutRecordsRequest request fails to get processed, for example because the service had an internal error, or if the credentials are invalid. 4XX errors are pointless to retry because they are not recoverable. For 5XX errors, on the other hand, perform an exponential backoff followed by a retry. Only do so if the put failed and you haven’t reached the maximum number of attempts for this record.

Partial Failures

As I discussed earlier, PutRecords calls can partially fail. The second half of the flush() method deals with this. For each entry in PutRecordsResult, match it up with the original PutRecordsRequestEntry request that you tried to put. Then, check whether you need to retry the record based on the error code and number of attempts already made.

Backoff Strategies

Despite all of this, you’re not quite done yet. Notice that you have backoffs if the entire request fails, but not for individual records. What if certain shards had problems but not others? If you performed backoff whenever any record failed, you would unnecessarily delay records and reduce throughput to those shards that are not affected.

On the other hand, if you retried failed records without any backoff (as in the example), then you might end up spamming a shard even though it’s already having problems. Regardless of how you mitigate this problem, you would still want a way to know if it, or some other problem, is occurring in production. You need to add monitoring to the application, so that operators can quickly respond to and fix any issues that occur.

Monitoring

The two main methods of monitoring are metrics and realtime log analysis (RTLA). In AWS, these are offered by Amazon CloudWatch and Amazon CloudWatch Logs respectively. CloudWatch metrics allows users to submit metric data, which are then aggregated over several time windows, and made available for retrieval.

Here’s how you modify BasicClickEventsToKinesis to emit a metric:

public class MetricsEmittingBasicClickEventsToKinesis
extends AbstractClickEventsToKinesis {
private final AmazonKinesis kinesis;
private final AmazonCloudWatch cw;

protected MetricsEmittingBasicClickEventsToKinesis(
BlockingQueue<ClickEvent> inputQueue) {
super(inputQueue);
kinesis = new AmazonKinesisClient().withRegion(
Regions.fromName(REGION));
cw = new AmazonCloudWatchClient().withRegion(Regions.fromName(REGION));
}

@Override
protected void runOnce() throws Exception {
ClickEvent event = inputQueue.take();
String partitionKey = event.getSessionId();
ByteBuffer data = ByteBuffer.wrap(
event.getPayload().getBytes("UTF-8"));
recordsPut.getAndIncrement();

PutRecordResult res = kinesis.putRecord(
STREAM_NAME, data, partitionKey);

MetricDatum d = new MetricDatum()
.withDimensions(
new Dimension().withName("StreamName").withValue(STREAM_NAME),
new Dimension().withName("ShardId").withValue(res.getShardId()),
new Dimension().withName("Host").withValue(
InetAddress.getLocalHost().toString()))
.withValue(1.0)
.withMetricName("RecordsPut");
cw.putMetricData(new PutMetricDataRequest()
.withMetricData(d)
.withNamespace("MySampleProducer"));
}
}

As you can see, the code to emit the metric ended up being longer than the code that does the actual work (i.e., putting the record). In practice, to get a good picture of what the system is doing, you’ll need far more than just one metric. This means you’ll need to create utilities to simplify the call site.

Another problem is that every PutMetricData call makes a HTTP request, and the overhead from this is quite substantial because the payload size is small. Perhaps unsurprisingly, the solution to this problem is basically the same for CloudWatch metrics as it is for Amazon Kinesis records.

To address these issues, you’ll end up basically writing a different application, all because you wanted to keep track of how the original application was behaving.

Aggregation

I’m going to shift the focus now from the throughput of sending records to the throughput of the Amazon Kinesis shards actually receiving them.

Provisioned Capacity

Amazon Kinesis requires the provisioning of capacity through the concept of shards. Each shard can ingest 1 MiB or 1000 records per second, and is throttled if either limit is reached. In a cost-optimal architecture, we all want to have as few shards as possible, but still be able to ingest all data promptly at peak traffic.

In the example here, your records are a mere 350 bytes each. Putting 1000 of these a second results in a throughput of 0.33 MiB/s, only a third of what the shard would’ve allowed you to do in terms of bandwidth. If you can increase your data-to-record ratio by 3x or more, you can get away with having only 1/3 as many shards.

Increasing Capacity Utilization with Aggregation

So how can you increase your shard utilization? The obvious answer is to concatenate multiple records together. If you combined three of your ClickEvents objects, you’d have a 1050 byte record. This is called aggregation.

Aggregation is not always as simple as simply concatenating strings together. There are various additional design considerations to keep in mind:

You want to ensure that the consumer application can still access all the records with the same partition key without having to perform distributed sorting or resorting to an application level map. This means you have to be careful about how you group records together.

You want a buffering scheme that imposes a limit on how long records can wait in buffers in order to prevent excessive delays. This means you need to add timestamps and timers to your components.

You want a binary format that’s unambiguous between the aggregated and non-aggregated case, such that every record can be deserialized correctly at the consumer. This requires code not just in the producer, but the consumer as well.

Because of these intricacies, implementing record aggregators can be challenging and time-consuming.

Amazon Kinesis Producer Library

Because the challenges of batching, retries, and monitoring are shared by the majority of Amazon Kinesis application developers, we decided to build the Amazon Kinesis Producer Library (KPL). The KPL implements solutions to these problems and frees developers from having to worry about how best to stream their data into Amazon Kinesis, allowing them to focus on the actual business logic of their applications instead.

Architecture

The diagram below shows an overview of the internal architecture of the KPL:

KPL Architecture

Native Process

The KPL is split into a native process and a wrapper. The native process does all the actual work of processing and sending records, while the wrapper manages the native process and communicates with it.

Each KPL native process can process records from multiple streams at once, as long as they’re in the same region and use the same credentials for API calls. Each stream is handled by a separate pipeline, which contains components performing the functions I’ve discussed throughout this post.

IPC

Communication between the wrapper and the native process takes place over named pipes (either Unix or Windows), and the messages are implemented with Google protocol buffers. It’s possible to build your own KPL wrapper using the protobuf message definitions available in the source code. The native process is agnostic to how the wrapper is implemented as long as the messages sent to it are valid.

Multithreading

The native process is multithreaded, and can process multiple records simultaneously. This allows it to scale to higher throughputs on systems with more CPU cores. The threads running the pipelines do not participate in I/O to avoid unnecessarily blocking internal processing. Dedicated threads service the IPC manager and HTTP client. The latter uses asynchronous I/O to avoid needing a thread pool.

In practice, all the details are abstracted away by the wrapper, so that for the most part, the KPL should behave just like a regular library in the wrapper’s language.

Using the KPL

The KPL provides the following features out of the box:

Batching of puts using PutRecords (the Collector in the architecture diagram)

Tracking of record age and enforcement of maximum buffering times (all components)

Per-shard record aggregation (the Aggregator)

Retries in case of errors, with ability to distinguish between retryable and non-retryable errors (the Retrier)

Per-shard rate limiting to prevent excessive and pointless spamming (the Limiter)

Useful metrics and a highly efficient CloudWatch client (not shown in diagram)

Configuration

Many configuration options are available to customize the behavior of many of the components. To get started, however, the only things you need are AWS credentials and the AWS region. The KPL attempts to use the DefaultAWSCredentialsProviderChain if no credentials providers are given in the config. In addition, the KPL automatically attempts to fetch the region from Amazon EC2 metadata, so when you are running in Amazon EC2, no configuration may be needed at all.

Currently, configuration is global and applies to all streams. This includes the credentials used for putting records and CloudWatch metrics. If you wish to have different configurations for different streams, it’s feasible to create multiple instances of the KPL, as long as the number is reasonably small (e.g., 10 to 15). Because the core is native code, the fixed overhead per instance is small.

API

Now take a look at the class AdvancedKPLClickEventsToKinesis, to take a tour of the features of the KPL Java API:

public class AdvancedKPLClickEventsToKinesis
extends AbstractClickEventsToKinesis {
private static final Random RANDOM = new Random();
private static final Log log = LogFactory.getLog(
AdvancedKPLClickEventsToKinesis.class);

private final KinesisProducer kinesis;

protected AdvancedKPLClickEventsToKinesis(
BlockingQueue<ClickEvent> inputQueue) {
super(inputQueue);
kinesis = new KinesisProducer(new KinesisProducerConfiguration()
.setRegion(REGION));
}

@Override
protected void runOnce() throws Exception {
ClickEvent event = inputQueue.take();
String partitionKey = event.getSessionId();
String payload = event.getPayload();
ByteBuffer data = ByteBuffer.wrap(payload.getBytes("UTF-8"));
while (kinesis.getOutstandingRecordsCount() > 1e4) {
Thread.sleep(1);
}
recordsPut.getAndIncrement();

ListenableFuture<UserRecordResult> f =
kinesis.addUserRecord(STREAM_NAME, partitionKey, data);
Futures.addCallback(f, new FutureCallback<UserRecordResult>() {
@Override
public void onSuccess(UserRecordResult result) {
long totalTime = result.getAttempts().stream()
.mapToLong(a -> a.getDelay() + a.getDuration())
.sum();
// Only log with a small probability, otherwise it’ll be very
// spammy
if (RANDOM.nextDouble() < 1e-5) {
log.info(String.format(
"Succesfully put record, partitionKey=%s, "
+ "payload=%s, sequenceNumber=%s, "
+ "shardId=%s, took %d attempts, "
+ "totalling %s ms",
partitionKey, payload, result.getSequenceNumber(),
result.getShardId(), result.getAttempts().size(),
totalTime));
}
}

@Override
public void onFailure(Throwable t) {
if (t instanceof UserRecordFailedException) {
UserRecordFailedException e =
(UserRecordFailedException) t;
UserRecordResult result = e.getResult();

String errorList =
StringUtils.join(result.getAttempts().stream()
.map(a -> String.format(
"Delay after prev attempt: %d ms, "
+ "Duration: %d ms, Code: %s, "
+ "Message: %s",
a.getDelay(), a.getDuration(),
a.getErrorCode(),
a.getErrorMessage()))
.collect(Collectors.toList()), "n");

log.error(String.format(
"Record failed to put, partitionKey=%s, "
+ "payload=%s, attempts:n%s",
partitionKey, payload, errorList));
}
};
});
}

@Override
public long recordsPut() {
try {
return kinesis.getMetrics("UserRecordsPut").stream()
.filter(m -> m.getDimensions().size() == 2)
.findFirst()
.map(Metric::getSum)
.orElse(0.0)
.longValue();
} catch (Exception e) {
throw new RuntimeException(e);
}
}

@Override
public void stop() {
super.stop();
kinesis.flushSync();
kinesis.destroy();
}
}

Backpressure

First, examine the runOnce() method. You’ll notice that you’re calling getOutstandingRecordsCount() in a loop. This is a measure of the backpressure in the system, which should be checked before putting more records, to avoid exhausting system resources.

Async and Futures

Moving on, the KPL has an asynchronous interface that does not block. When addUserRecord is called, the record is placed into a queue serviced by another thread and the method returns immediately with a ListenableFuture object. To this, a FutureCallback object can be added that is invoked when the record completes. Adding the callback is optional, which means the API can be used in a fire-and-forget fashion.

The KPL always returns a UserRecordResult object in the future, whether the record succeeded or not, but because of the way ListenableFuture is designed, retrieving the UserRecordResult is slightly different between the success and failure cases. In the latter, it is wrapped in a UserRecordFailedException object.

Information about Put Attempts

Every UserRecordResult object contains a list of Attempt objects describing each attempt at sending a particular record to Amazon Kinesis. Timing information and error messages from failures are available. This is useful for application logging and debugging, as demonstrated in the onFailure() method. The same information is also uploaded to CloudWatch as metrics named BufferingTime and RequestTime, and there are individual error count metrics (one per unique error code) as well. Attempt objects provide this information at a per-record granularity, while the CloudWatch metrics are aggregated over a shard or stream.

Metrics

Now take a look at the recordsPut() method. You can retrieve metrics from the current KinesisProducer instance directly with the getMetrics() methods. These are the same metrics as those that are uploaded to CloudWatch, the difference being that getMetrics() retrieves the data for the instance on which it is called, whereas the data in CloudWatch is aggregated over all instances sharing the same metrics namespace. The overload getMetrics(String, int) allows you to get data aggregated over a specified number of seconds. This is useful for tracking sliding-window statistics. If called without the int argument, getMetrics() returns the cumulative statistics from the time the KinesisProducer instance started to the present moment. For a list of the available metrics, see the Monitoring Amazon Kinesis with Amazon CloudWatch topic in the Amazon Kinesis Developer’s Guide.

Termination

Finally, look at the stop() method. Because the KPL uses a child process, it is necessary to explicitly terminate the child if you wish to do so without exiting the JVM. This is what the destroy() method does. Note that destroy() does not attempt to wait for buffered or in-flight records. To ensure all records are sent before killing the child, call flushSync(). This method blocks until all records are complete. A non-blocking flush() method is also available.

Conclusion

In this post, I’ve discussed several challenges that are commonly encountered when implementing producer applications for Amazon Kinesis: the need to aggregate records and use batch puts while controlling buffering delays, the need to retry records in case of failures, and the need to monitor the application using metrics.

At the same time, I showed you how to develop a sample application and evolve it through several stages of increasing complexity in an attempt to address those needs. Despite having a non-trivial amount of code, the sample falls far short of the requirements for a efficient and reliable producer library that can be reused across many different applications.

The KPL was developed in recognition of the fact that these same challenges are faced by many customers, and it presents a general solution that customers can leverage to save time and effort when developing Amazon Kinesis-based solutions.

You can get started using the KPL with just a few lines of code, but the API is also flexible enough to cater to more advanced use cases, including retrieving detailed information about record puts and realtime metrics from the currently running instance.

Hopefully, this post has given you a deeper understanding of Amazon Kinesis producer applications and provided some insight into how to develop or evolve your own solution by leveraging the KPL.

If you have questions or suggestions, please leave a comment below.

————————

Related:

Snakes in the Stream! Feeding and Eating Amazon Kinesis Streams with Python

Python Kinesis

 

 

Implementing Efficient and Reliable Producers with the Amazon Kinesis Producer Library

Post Syndicated from Kevin Deng original https://blogs.aws.amazon.com/bigdata/post/Tx3ET30EGDKUUI2/Implementing-Efficient-and-Reliable-Producers-with-the-Amazon-Kinesis-Producer-L

Kevin Deng is an SDE with the Amazon Kinesis team and is the lead author of the Amazon Kinesis Producer Library

How do you vertically scale an Amazon Kinesis producer application by 100x? While it’s easy to get started with streaming data into Amazon Kinesis, streaming large volumes of data efficiently and reliably presents some challenges. When a host needs to send many records per second (RPS) to Amazon Kinesis, simply calling the basic PutRecord API action in a loop is inadequate.

To reduce overhead and increase throughput, the application must batch records and implement parallel HTTP requests. It also must deal with the transient failures inherent to any network application and perform retries as needed. And in a large-scale application, monitoring is needed to allow operators to diagnose and troubleshoot any issues that arise.

In this post, you’ll develop a sample producer application and evolve it through several stages of complexity to learn about these challenges. I discuss each challenge at a conceptual level and present sample solutions for some of them. The goal is to help you understand what problems the KPL solves and, at a high level, how it solves them.

All code in this post is available in the AWS Big Data Blog Github repository.

Scenario

Consider a typical use case of Amazon Kinesis: realtime clickstream analysis. In this setup, a webserver receives requests from the browser about the links a user has clicked on a website. A typical request might contain a payload that looks like this:

{
"sessionId": "seXACNTS3FoQuqTVxAM",
"fmt": "1",
"num": "1",
"cv": "7",
"frm": "2",
"url": "http%3A//www.mywebsite.com/activityi%3Bsrc%3D1782317%3Btype%3D2015s004%3Bcat%3Dgs6pr0%3Bord%3D3497210042551.1597%3F",
"ref": "http%3A//www.mywebsite.com/us/explore/myproduct-features-and-specs/%3Fcid%3Dppc-",
"random": "3833153354"
}

This payload is 342 bytes long, and the examples in this post assume that most of the requests coming in are roughly this size (~350 bytes).

After receiving the request, the webserver sends it to Amazon Kinesis to be consumed by an analysis application. This application provides the business intelligence in near-realtime that can be used for various purposes. Don’t worry about the consumer in this post; focus only on the producer–in this case, the webserver.

Overall Strategy

To avoid interfering with other workloads such as actually serving pages or receiving other requests, the webserver doesn’t try to send the data to Amazon Kinesis within its request handler. Instead, it places the payload of each request into a queue to be processed by a separate component. Call this component ClickEventsToKinesis; its implementation is what I discuss in the rest of this post. Start with the following abstract class that all subsequent implementations inherit from:

public abstract class AbstractClickEventsToKinesis implements Runnable {
protected final static String STREAM_NAME = "myStream";
protected final static String REGION = "us-east-1";

protected final BlockingQueue<ClickEvent> inputQueue;
protected volatile boolean shutdown = false;
protected final AtomicLong recordsPut = new AtomicLong(0);

protected AbstractClickEventsToKinesis(
BlockingQueue<ClickEvent> inputQueue) {
this.inputQueue = inputQueue;
}

@Override
public void run() {
while (!shutdown) {
try {
runOnce();
} catch (Exception e) {
e.printStackTrace();
}
}
}

public long recordsPut() {
return recordsPut.get();
}

public void stop() {
shutdown = true;
}

protected abstract void runOnce() throws Exception;
}

In this code, ClickEvent is a trivial wrapper around the payload, with the sessionId field extracted. Use the sessionId as your partition key when putting into Amazon Kinesis.

public class ClickEvent {
private String sessionId;
private String payload;

public ClickEvent(String sessionId, String payload) {
this.sessionId = sessionId;
this.payload = payload;
}

public String getSessionId() {
return sessionId;
}

public String getPayload() {
return payload;
}
}

For simplicity, assume that the server runs on an Amazon EC2 instance that has an appropriate IAM role such that you can simply use DefaultAWSCredentialsProviderChain for credentials.

Basic Producer

Imagine that you have just launched your website. Traffic is fairly low at the moment, and each one of your servers is getting about 50 ClickEvents objects per second. Start with the simplest possible implementation: take the ClickEvents objects from the queue one at a time and send each to Amazon Kinesis synchronously. At this point, make no attempt at handlings errors or performing retries should any of the puts fail.

public class BasicClickEventsToKinesis extends AbstractClickEventsToKinesis {
private final AmazonKinesis kinesis;

public BasicClickEventsToKinesis(BlockingQueue<ClickEvent> inputQueue) {
super(inputQueue);
kinesis = new AmazonKinesisClient().withRegion(
Regions.fromName(REGION));
}

@Override
protected void runOnce() throws Exception {
ClickEvent event = inputQueue.take();
String partitionKey = event.getSessionId();
ByteBuffer data = ByteBuffer.wrap(
event.getPayload().getBytes("UTF-8"));
kinesis.putRecord(STREAM_NAME, data, partitionKey);
recordsPut.getAndIncrement();
}
}

That didn’t take too long, not bad at all. This implementation attains a throughput of about 90 records per second (RPS) in us-east-1 on a c3.2xlarge Ubuntu 15 instance, sufficient for your present needs.

Using the KPL – Quick Preview

Take a look at a similar implementation that uses the KPL instead:

public class KPLClickEventsToKinesis extends AbstractClickEventsToKinesis {
private final KinesisProducer kinesis;

public KPLClickEventsToKinesis(BlockingQueue<ClickEvent> inputQueue) {
super(inputQueue);
kinesis = new KinesisProducer(new KinesisProducerConfiguration()
.setRegion(REGION)
.setRecordMaxBufferedTime(5000));
}

@Override
protected void runOnce() throws Exception {
ClickEvent event = inputQueue.take();
String partitionKey = event.getSessionId();
ByteBuffer data = ByteBuffer.wrap(
event.getPayload().getBytes("UTF-8"));
while (kinesis.getOutstandingRecordsCount() > 5e4) {
Thread.sleep(1);
}
kinesis.addUserRecord(STREAM_NAME, partitionKey, data);
recordsPut.getAndIncrement();
}
}

While the two implementations are very similar on the surface, a significant amount of additional logic is performed by the KPL-based implementation underneath the covers. The latter achieves a throughput of about 76,000 RPS when run on the same host, which is about 800 times greater. As you go through the rest of the post, it should become clear why that is the case. I’ll show you a more advanced example of using the KPL later on and go through the various API methods, but for now I’ll discuss the problems with the basic implementation you created earlier.

Throughput

Imagine now that your website has gone viral, and as a result each one of your servers is receiving about 5000 ClickEvents objects per second instead of 50. Because BasicClickEventsToKinesis can only do 90 RPS, you’ll need to make some changes to get the throughput you need.

Multithreading

One approach would be to create multiple instances of BasicClickEventsToKinesis, pass each one a reference to the same queue, and then submit all of them to an ExecutorService object. If all goes well, this should multiply your throughput by some large fraction of the number of instances you create. To get to 5000 RPS at 90 RPS each, you would require 56 instances in the ideal case, but in reality you’ll need more because the scaling is not going to be perfectly linear.

Code implementing the above approach is available in the repository as the class MultithreadedClickEventsToKinesis.

The problem with this approach is that it incurs a large amount of CPU overhead from context switching between threads. In addition to this, sending records one by one also incurs large overheads from signature version 4 and HTTP, further consuming CPU cycles as well as network bandwidth. I’ll discuss the bandwidth aspect next.

Bandwidth Overhead

If you take a look at what’s actually going over the wire, you’ll see that each request looks more or less like the following:

Even though your original data and partition key are only about 370 bytes in size, the HTTP request ends up consuming about 1200 bytes. Some of this is due to expansion from base64-encoding, but the bulk of it comes from the HTTP headers, which constitute just over half of the entire request.

Batching with PutRecords

To improve efficiency, the service team introduced the PutRecords API method in December 2014. This allows many records to be sent with a single HTTP request, thus amortizing the cost of the headers and fixed-cost components of signature version 4. Another benefit of using PutRecords is that you don’t have to incur a round-trip of latency between every single record, even if you don’t use multiple connections in parallel.

The following code example implements a new class called BatchedClickEventsToKinesis to take advantage of these benefits of PutRecords.

public class BatchedClickEventsToKinesis extends AbstractClickEventsToKinesis {
protected AmazonKinesis kinesis;
protected List<PutRecordsRequestEntry> entries;

private int dataSize;

public BatchedClickEventsToKinesis(BlockingQueue<ClickEvent> inputQueue) {
super(inputQueue);
kinesis = new AmazonKinesisClient().withRegion(
Regions.fromName(REGION));
entries = new ArrayList<>();
dataSize = 0;
}

@Override
protected void runOnce() throws Exception {
ClickEvent event = inputQueue.take();
String partitionKey = event.getSessionId();
ByteBuffer data = ByteBuffer.wrap(
event.getPayload().getBytes("UTF-8"));
recordsPut.getAndIncrement();

addEntry(new PutRecordsRequestEntry()
.withPartitionKey(partitionKey)
.withData(data));
}

@Override
public long recordsPut() {
return super.recordsPut() – entries.size();
}

@Override
public void stop() {
super.stop();
flush();
}

protected void flush() {
kinesis.putRecords(new PutRecordsRequest()
.withStreamName(STREAM_NAME)
.withRecords(entries));
entries.clear();
}

protected void addEntry(PutRecordsRequestEntry entry) {
int newDataSize = dataSize + entry.getData().remaining() +
entry.getPartitionKey().length();
if (newDataSize <= 5 * 1024 * 1024 && entries.size() < 500) {
dataSize = newDataSize;
entries.add(entry);
} else {
flush();
dataSize = 0;
addEntry(entry);
}
}
}

This implementation gives you 5500 RPS, a dramatic improvement in the throughput and efficiency of the producer.

PutRecords is not Atomic

There is however a subtle issue with PutRecords in that it is not an atomic operation; unlike the other API methods that either fail or succeed, PutRecords can partially fail. When a batch of multiple records is sent with PutRecords, it’s possible for every record to individually fail or succeed, independent of the other ones in the same batch. Unless an error prevents the entire HTTP request containing the batch from being delivered, PutRecords always returns a 200 result, even if some records within the batch have failed. This means that instead of simply relying on exceptions to indicate errors, you must write code that examines the PutRecordsResult objects to detect individual record failures and take appropriate action.

Retries

For various reasons, calls to the Amazon Kinesis API can occasionally fail. To avoid losing data because of this, you’ll want to retry failed puts.

Add retries to the BatchedClickEventsToKinesis from earlier by re-implementing the flush() method in a subclass:

public class RetryingBatchedClickEventsToKinesis
extends BatchedClickEventsToKinesis {
private final int MAX_BACKOFF = 30000;
private final int MAX_ATTEMPTS = 5;
private final Set<String> RETRYABLE_ERR_CODES = ImmutableSet.of(
"ProvisionedThroughputExceededException",
"InternalFailure",
"ServiceUnavailable");

private int backoff;
private int attempt;
private MapSet<PutRecordsRequestEntry, Integer> recordAttempts;

public RetryingBatchedClickEventsToKinesis(
BlockingQueue<ClickEvent> inputQueue) {
super(inputQueue);
reset();
recordAttempts = new HashMap<>();
}

@Override
protected void flush() {
PutRecordsRequest req = new PutRecordsRequest()
.withStreamName(STREAM_NAME)
.withRecords(entries);

PutRecordsResult res = null;
try {
res = kinesis.putRecords(req);
reset();
} catch (AmazonClientException e) {
if ((e instanceof AmazonServiceException
&& ((AmazonServiceException) e).getStatusCode() / 100 == 4)
|| attempt == MAX_ATTEMPTS) {
reset();
throw e;
}

try {
Thread.sleep(backoff);
} catch (InterruptedException ie) {
ie.printStackTrace();
}
Math.min(MAX_BACKOFF, backoff *= 2);

attempt++;
flush();
}

final List<PutRecordsResultEntry> results = res.getRecords();
List<PutRecordsRequestEntry> retries = IntStream
.range(0, results.size())
.mapToObj(i -> {
PutRecordsRequestEntry e = entries.get(i);
String errorCode = results.get(i).getErrorCode();
int n = recordAttempts.getOrDefault(e, 1) + 1;
// Determine whether the record should be retried
if (errorCode != null &&
RETRYABLE_ERR_CODES.contains(errorCode) &&
n < MAX_ATTEMPTS) {
recordAttempts.put(e, n);
return Optional.of(e);
} else {
recordAttempts.remove(e);
return Optional.<PutRecordsRequestEntry>empty();
}
})
.filter(Optional::isPresent)
.map(Optional::get)
.collect(Collectors.toList());

entries.clear();
retries.forEach(e -> addEntry(e));
}

private void reset() {
attempt = 1;
backoff = 100;
}
}

There’s a lot going on here, so I’ll walk you through all the issues.

Different Types of Errors

The first type of error this code attempts to handle is a complete request failure, indicated by HTTP status code 4XX or 5XX. This occurs when the entire PutRecordsRequest request fails to get processed, for example because the service had an internal error, or if the credentials are invalid. 4XX errors are pointless to retry because they are not recoverable. For 5XX errors, on the other hand, perform an exponential backoff followed by a retry. Only do so if the put failed and you haven’t reached the maximum number of attempts for this record.

Partial Failures

As I discussed earlier, PutRecords calls can partially fail. The second half of the flush() method deals with this. For each entry in PutRecordsResult, match it up with the original PutRecordsRequestEntry request that you tried to put. Then, check whether you need to retry the record based on the error code and number of attempts already made.

Backoff Strategies

Despite all of this, you’re not quite done yet. Notice that you have backoffs if the entire request fails, but not for individual records. What if certain shards had problems but not others? If you performed backoff whenever any record failed, you would unnecessarily delay records and reduce throughput to those shards that are not affected.

On the other hand, if you retried failed records without any backoff (as in the example), then you might end up spamming a shard even though it’s already having problems. Regardless of how you mitigate this problem, you would still want a way to know if it, or some other problem, is occurring in production. You need to add monitoring to the application, so that operators can quickly respond to and fix any issues that occur.

Monitoring

The two main methods of monitoring are metrics and realtime log analysis (RTLA). In AWS, these are offered by Amazon CloudWatch and Amazon CloudWatch Logs respectively. CloudWatch metrics allows users to submit metric data, which are then aggregated over several time windows, and made available for retrieval.

Here’s how you modify BasicClickEventsToKinesis to emit a metric:

public class MetricsEmittingBasicClickEventsToKinesis
extends AbstractClickEventsToKinesis {
private final AmazonKinesis kinesis;
private final AmazonCloudWatch cw;

protected MetricsEmittingBasicClickEventsToKinesis(
BlockingQueue<ClickEvent> inputQueue) {
super(inputQueue);
kinesis = new AmazonKinesisClient().withRegion(
Regions.fromName(REGION));
cw = new AmazonCloudWatchClient().withRegion(Regions.fromName(REGION));
}

@Override
protected void runOnce() throws Exception {
ClickEvent event = inputQueue.take();
String partitionKey = event.getSessionId();
ByteBuffer data = ByteBuffer.wrap(
event.getPayload().getBytes("UTF-8"));
recordsPut.getAndIncrement();

PutRecordResult res = kinesis.putRecord(
STREAM_NAME, data, partitionKey);

MetricDatum d = new MetricDatum()
.withDimensions(
new Dimension().withName("StreamName").withValue(STREAM_NAME),
new Dimension().withName("ShardId").withValue(res.getShardId()),
new Dimension().withName("Host").withValue(
InetAddress.getLocalHost().toString()))
.withValue(1.0)
.withMetricName("RecordsPut");
cw.putMetricData(new PutMetricDataRequest()
.withMetricData(d)
.withNamespace("MySampleProducer"));
}
}

As you can see, the code to emit the metric ended up being longer than the code that does the actual work (i.e., putting the record). In practice, to get a good picture of what the system is doing, you’ll need far more than just one metric. This means you’ll need to create utilities to simplify the call site.

Another problem is that every PutMetricData call makes a HTTP request, and the overhead from this is quite substantial because the payload size is small. Perhaps unsurprisingly, the solution to this problem is basically the same for CloudWatch metrics as it is for Amazon Kinesis records.

To address these issues, you’ll end up basically writing a different application, all because you wanted to keep track of how the original application was behaving.

Aggregation

I’m going to shift the focus now from the throughput of sending records to the throughput of the Amazon Kinesis shards actually receiving them.

Provisioned Capacity

Amazon Kinesis requires the provisioning of capacity through the concept of shards. Each shard can ingest 1 MiB or 1000 records per second, and is throttled if either limit is reached. In a cost-optimal architecture, we all want to have as few shards as possible, but still be able to ingest all data promptly at peak traffic.

In the example here, your records are a mere 350 bytes each. Putting 1000 of these a second results in a throughput of 0.33 MiB/s, only a third of what the shard would’ve allowed you to do in terms of bandwidth. If you can increase your data-to-record ratio by 3x or more, you can get away with having only 1/3 as many shards.

Increasing Capacity Utilization with Aggregation

So how can you increase your shard utilization? The obvious answer is to concatenate multiple records together. If you combined three of your ClickEvents objects, you’d have a 1050 byte record. This is called aggregation.

Aggregation is not always as simple as simply concatenating strings together. There are various additional design considerations to keep in mind:

You want to ensure that the consumer application can still access all the records with the same partition key without having to perform distributed sorting or resorting to an application level map. This means you have to be careful about how you group records together.

You want a buffering scheme that imposes a limit on how long records can wait in buffers in order to prevent excessive delays. This means you need to add timestamps and timers to your components.

You want a binary format that’s unambiguous between the aggregated and non-aggregated case, such that every record can be deserialized correctly at the consumer. This requires code not just in the producer, but the consumer as well.

Because of these intricacies, implementing record aggregators can be challenging and time-consuming.

Amazon Kinesis Producer Library

Because the challenges of batching, retries, and monitoring are shared by the majority of Amazon Kinesis application developers, we decided to build the Amazon Kinesis Producer Library (KPL). The KPL implements solutions to these problems and frees developers from having to worry about how best to stream their data into Amazon Kinesis, allowing them to focus on the actual business logic of their applications instead.

Architecture

The diagram below shows an overview of the internal architecture of the KPL:

KPL Architecture

Native Process

The KPL is split into a native process and a wrapper. The native process does all the actual work of processing and sending records, while the wrapper manages the native process and communicates with it.

Each KPL native process can process records from multiple streams at once, as long as they’re in the same region and use the same credentials for API calls. Each stream is handled by a separate pipeline, which contains components performing the functions I’ve discussed throughout this post.

IPC

Communication between the wrapper and the native process takes place over named pipes (either Unix or Windows), and the messages are implemented with Google protocol buffers. It’s possible to build your own KPL wrapper using the protobuf message definitions available in the source code. The native process is agnostic to how the wrapper is implemented as long as the messages sent to it are valid.

Multithreading

The native process is multithreaded, and can process multiple records simultaneously. This allows it to scale to higher throughputs on systems with more CPU cores. The threads running the pipelines do not participate in I/O to avoid unnecessarily blocking internal processing. Dedicated threads service the IPC manager and HTTP client. The latter uses asynchronous I/O to avoid needing a thread pool.

In practice, all the details are abstracted away by the wrapper, so that for the most part, the KPL should behave just like a regular library in the wrapper’s language.

Using the KPL

The KPL provides the following features out of the box:

Batching of puts using PutRecords (the Collector in the architecture diagram)

Tracking of record age and enforcement of maximum buffering times (all components)

Per-shard record aggregation (the Aggregator)

Retries in case of errors, with ability to distinguish between retryable and non-retryable errors (the Retrier)

Per-shard rate limiting to prevent excessive and pointless spamming (the Limiter)

Useful metrics and a highly efficient CloudWatch client (not shown in diagram)

Configuration

Many configuration options are available to customize the behavior of many of the components. To get started, however, the only things you need are AWS credentials and the AWS region. The KPL attempts to use the DefaultAWSCredentialsProviderChain if no credentials providers are given in the config. In addition, the KPL automatically attempts to fetch the region from Amazon EC2 metadata, so when you are running in Amazon EC2, no configuration may be needed at all.

Currently, configuration is global and applies to all streams. This includes the credentials used for putting records and CloudWatch metrics. If you wish to have different configurations for different streams, it’s feasible to create multiple instances of the KPL, as long as the number is reasonably small (e.g., 10 to 15). Because the core is native code, the fixed overhead per instance is small.

API

Now take a look at the class AdvancedKPLClickEventsToKinesis, to take a tour of the features of the KPL Java API:

public class AdvancedKPLClickEventsToKinesis
extends AbstractClickEventsToKinesis {
private static final Random RANDOM = new Random();
private static final Log log = LogFactory.getLog(
AdvancedKPLClickEventsToKinesis.class);

private final KinesisProducer kinesis;

protected AdvancedKPLClickEventsToKinesis(
BlockingQueue<ClickEvent> inputQueue) {
super(inputQueue);
kinesis = new KinesisProducer(new KinesisProducerConfiguration()
.setRegion(REGION));
}

@Override
protected void runOnce() throws Exception {
ClickEvent event = inputQueue.take();
String partitionKey = event.getSessionId();
String payload = event.getPayload();
ByteBuffer data = ByteBuffer.wrap(payload.getBytes("UTF-8"));
while (kinesis.getOutstandingRecordsCount() > 1e4) {
Thread.sleep(1);
}
recordsPut.getAndIncrement();

ListenableFuture<UserRecordResult> f =
kinesis.addUserRecord(STREAM_NAME, partitionKey, data);
Futures.addCallback(f, new FutureCallback<UserRecordResult>() {
@Override
public void onSuccess(UserRecordResult result) {
long totalTime = result.getAttempts().stream()
.mapToLong(a -> a.getDelay() + a.getDuration())
.sum();
// Only log with a small probability, otherwise it’ll be very
// spammy
if (RANDOM.nextDouble() < 1e-5) {
log.info(String.format(
"Succesfully put record, partitionKey=%s, "
+ "payload=%s, sequenceNumber=%s, "
+ "shardId=%s, took %d attempts, "
+ "totalling %s ms",
partitionKey, payload, result.getSequenceNumber(),
result.getShardId(), result.getAttempts().size(),
totalTime));
}
}

@Override
public void onFailure(Throwable t) {
if (t instanceof UserRecordFailedException) {
UserRecordFailedException e =
(UserRecordFailedException) t;
UserRecordResult result = e.getResult();

String errorList =
StringUtils.join(result.getAttempts().stream()
.map(a -> String.format(
"Delay after prev attempt: %d ms, "
+ "Duration: %d ms, Code: %s, "
+ "Message: %s",
a.getDelay(), a.getDuration(),
a.getErrorCode(),
a.getErrorMessage()))
.collect(Collectors.toList()), "n");

log.error(String.format(
"Record failed to put, partitionKey=%s, "
+ "payload=%s, attempts:n%s",
partitionKey, payload, errorList));
}
};
});
}

@Override
public long recordsPut() {
try {
return kinesis.getMetrics("UserRecordsPut").stream()
.filter(m -> m.getDimensions().size() == 2)
.findFirst()
.map(Metric::getSum)
.orElse(0.0)
.longValue();
} catch (Exception e) {
throw new RuntimeException(e);
}
}

@Override
public void stop() {
super.stop();
kinesis.flushSync();
kinesis.destroy();
}
}

Backpressure

First, examine the runOnce() method. You’ll notice that you’re calling getOutstandingRecordsCount() in a loop. This is a measure of the backpressure in the system, which should be checked before putting more records, to avoid exhausting system resources.

Async and Futures

Moving on, the KPL has an asynchronous interface that does not block. When addUserRecord is called, the record is placed into a queue serviced by another thread and the method returns immediately with a ListenableFuture object. To this, a FutureCallback object can be added that is invoked when the record completes. Adding the callback is optional, which means the API can be used in a fire-and-forget fashion.

The KPL always returns a UserRecordResult object in the future, whether the record succeeded or not, but because of the way ListenableFuture is designed, retrieving the UserRecordResult is slightly different between the success and failure cases. In the latter, it is wrapped in a UserRecordFailedException object.

Information about Put Attempts

Every UserRecordResult object contains a list of Attempt objects describing each attempt at sending a particular record to Amazon Kinesis. Timing information and error messages from failures are available. This is useful for application logging and debugging, as demonstrated in the onFailure() method. The same information is also uploaded to CloudWatch as metrics named BufferingTime and RequestTime, and there are individual error count metrics (one per unique error code) as well. Attempt objects provide this information at a per-record granularity, while the CloudWatch metrics are aggregated over a shard or stream.

Metrics

Now take a look at the recordsPut() method. You can retrieve metrics from the current KinesisProducer instance directly with the getMetrics() methods. These are the same metrics as those that are uploaded to CloudWatch, the difference being that getMetrics() retrieves the data for the instance on which it is called, whereas the data in CloudWatch is aggregated over all instances sharing the same metrics namespace. The overload getMetrics(String, int) allows you to get data aggregated over a specified number of seconds. This is useful for tracking sliding-window statistics. If called without the int argument, getMetrics() returns the cumulative statistics from the time the KinesisProducer instance started to the present moment. For a list of the available metrics, see the Monitoring Amazon Kinesis with Amazon CloudWatch topic in the Amazon Kinesis Developer’s Guide.

Termination

Finally, look at the stop() method. Because the KPL uses a child process, it is necessary to explicitly terminate the child if you wish to do so without exiting the JVM. This is what the destroy() method does. Note that destroy() does not attempt to wait for buffered or in-flight records. To ensure all records are sent before killing the child, call flushSync(). This method blocks until all records are complete. A non-blocking flush() method is also available.

Conclusion

In this post, I’ve discussed several challenges that are commonly encountered when implementing producer applications for Amazon Kinesis: the need to aggregate records and use batch puts while controlling buffering delays, the need to retry records in case of failures, and the need to monitor the application using metrics.

At the same time, I showed you how to develop a sample application and evolve it through several stages of increasing complexity in an attempt to address those needs. Despite having a non-trivial amount of code, the sample falls far short of the requirements for a efficient and reliable producer library that can be reused across many different applications.

The KPL was developed in recognition of the fact that these same challenges are faced by many customers, and it presents a general solution that customers can leverage to save time and effort when developing Amazon Kinesis-based solutions.

You can get started using the KPL with just a few lines of code, but the API is also flexible enough to cater to more advanced use cases, including retrieving detailed information about record puts and realtime metrics from the currently running instance.

Hopefully, this post has given you a deeper understanding of Amazon Kinesis producer applications and provided some insight into how to develop or evolve your own solution by leveraging the KPL.

If you have questions or suggestions, please leave a comment below.

————————

Related:

Snakes in the Stream! Feeding and Eating Amazon Kinesis Streams with Python

Python Kinesis

 

 

systemd for Administrators, Part VIII

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/the-new-configuration-files.html

Another episode of my
ongoing
series
on
systemd
for
Administrators:

The New Configuration Files

One of the formidable new features of systemd is
that it comes with a complete set of modular early-boot services that are
written in simple, fast, parallelizable and robust C, replacing the
shell “novels” the various distributions featured before. Our little
Project Zero Shell[1] has been a full success. We currently
cover pretty much everything most desktop and embedded
distributions should need, plus a big part of the server needs:

Checking and mounting of all file systems
Updating and enabling quota on all file systems
Setting the host name
Configuring the loopback network device
Loading the SELinux policy and relabelling /run and /dev as necessary on boot
Registering additional binary formats in the kernel, such as Java, Mono and WINE binaries
Setting the system locale
Setting up the console font and keyboard map
Creating, removing and cleaning up of temporary and volatile files and directories
Applying mount options from /etc/fstab to pre-mounted API VFS
Applying sysctl kernel settings
Collecting and replaying readahead information
Updating utmp boot and shutdown records
Loading and saving the random seed
Statically loading specific kernel modules
Setting up encrypted hard disks and partitions
Spawning automatic gettys on serial kernel consoles
Maintenance of Plymouth
Machine ID maintenance
Setting of the UTC distance for the system clock

On a standard Fedora 15 install, only a few legacy and storage
services still require shell scripts during early boot. If you don’t
need those, you can easily disable them end enjoy your shell-free boot
(like I do every day). The shell-less boot systemd offers you is a
unique feature on Linux.

Many of these small components are configured via configuration
files in /etc. Some of these are fairly standardized among
distributions and hence supporting them in the C implementations was
easy and obvious. Examples include: /etc/fstab,
/etc/crypttab or /etc/sysctl.conf. However, for
others no standardized file or directory existed which forced us to add
#ifdef orgies to our sources to deal with the different
places the distributions we want to support store these things. All
these configuration files have in common that they are dead-simple and
there is simply no good reason for distributions to distuingish
themselves with them: they all do the very same thing, just
a bit differently.

To improve the situation and benefit from the unifying force that
systemd is we thus decided to read the per-distribution configuration
files only as fallbacks — and to introduce new configuration
files as primary source of configuration wherever applicable. Of
course, where possible these standardized configuration files should
not be new inventions but rather just standardizations of the best
distribution-specific configuration files previously used. Here’s a
little overview over these new common configuration files systemd
supports on all distributions:

/etc/hostname:
the host name for the system. One of the most basic and trivial
system settings. Nonetheless previously all distributions used
different files for this. Fedora used /etc/sysconfig/network,
OpenSUSE /etc/HOSTNAME. We chose to standardize on the
Debian configuration file /etc/hostname.

/etc/vconsole.conf:
configuration of the default keyboard mapping and console font.

/etc/locale.conf:
configuration of the system-wide locale.

/etc/modules-load.d/*.conf:
a drop-in directory for kernel modules to statically load at
boot (for the very few that still need this).

/etc/sysctl.d/*.conf:
a drop-in directory for kernel sysctl parameters, extending what you
can already do with /etc/sysctl.conf.

/etc/tmpfiles.d/*.conf:
a drop-in directory for configuration of runtime files that need to be
removed/created/cleaned up at boot and during uptime.

/etc/binfmt.d/*.conf:
a drop-in directory for registration of additional binary formats for
systems like Java, Mono and WINE.

/etc/os-release:
a standardization of the various distribution ID files like
/etc/fedora-release and similar. Really every distribution
introduced their own file here; writing a simple tool that just prints
out the name of the local distribution usually means including a
database of release files to check. The LSB tried to standardize
something like this with the lsb_release
tool, but quite frankly the idea of employing a shell script in this
is not the best choice the LSB folks ever made. To rectify this we
just decided to generalize this, so that everybody can use the same
file here.

/etc/machine-id:
a machine ID file, superseding D-Bus’ machine ID file. This file is
guaranteed to be existing and valid on a systemd system, covering also
stateless boots. By moving this out of the D-Bus logic it is hopefully
interesting for a lot of additional uses as a unique and stable
machine identifier.

/etc/machine-info:
a new information file encoding meta data about a host, like a pretty
host name and an icon name, replacing stuff like
/etc/favicon.png and suchlike. This is maintained by systemd-hostnamed.

It is our definite intention to convince you to use these new
configuration files in your configuration tools: if your
configuration frontend writes these files instead of the old ones, it
automatically becomes more portable between Linux distributions, and
you are helping standardizing Linux. This makes things simpler to
understand and more obvious for users and administrators. Of course,
right now, only systemd-based distributions read these files, but that
already covers all important distributions in one way or another, except for one. And it’s a bit of a
chicken-and-egg problem: a standard becomes a standard by being
used. In order to gently push everybody to standardize on these files
we also want to make clear that sooner or later we plan to drop the
fallback support for the old configuration files from
systemd. That means adoption of this new scheme can happen slowly and piece
by piece. But the final goal of only having one set of configuration
files must be clear.

Many of these configuration files are relevant not only for
configuration tools but also (and sometimes even primarily) in
upstream projects. For example, we invite projects like Mono, Java, or
WINE to install a drop-in file in /etc/binfmt.d/ from their
upstream build systems. Per-distribution downstream support for binary
formats would then no longer be necessary and your platform would work
the same on all distributions. Something similar applies to all
software which need creation/cleaning of certain runtime files and
directories at boot, for example beneath the /run hierarchy
(i.e. /var/run as it used to be known). These
projects should just drop in configuration files in
/etc/tmpfiles.d, also from the upstream build systems. This
also helps speeding up the boot process, as separate per-project SysV
shell scripts which implement trivial things like registering a binary
format or removing/creating temporary/volatile files at boot are no
longer necessary. Or another example, where upstream support would be
fantastic: projects like X11 could probably benefit from reading the
default keyboard mapping for its displays from
/etc/vconsole.conf.

Of course, I have no doubt that not everybody is happy with our
choice of names (and formats) for these configuration files. In the
end we had to pick something, and from all the choices these appeared
to be the most convincing. The file formats are as simple as they can
be, and usually easily written and read even from shell scripts. That
said, /etc/bikeshed.conf could of course also have been a
fantastic configuration file name!

So, help us standardizing Linux! Use the new configuration files!
Adopt them upstream, adopt them downstream, adopt them all across the
distributions!

Oh, and in case you are wondering: yes, all of these files were
discussed in one way or another with various folks from the various
distributions. And there has even been some push towards supporting
some of these files even outside of systemd systems.

Footnotes

[1] Our slogan: “The only shell that should get started
during boot is gnome-shell!” — Yes, the slogan needs a bit of
work, but you get the idea.

systemd for Administrators, Part VIII

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/the-new-configuration-files.html

Another episode of my
ongoing
series
on
systemd
for
Administrators:

The New Configuration Files

One of the formidable new features of systemd is
that it comes with a complete set of modular early-boot services that are
written in simple, fast, parallelizable and robust C, replacing the
shell “novels” the various distributions featured before. Our little
Project Zero Shell[1] has been a full success. We currently
cover pretty much everything most desktop and embedded
distributions should need, plus a big part of the server needs:

  • Checking and mounting of all file systems
  • Updating and enabling quota on all file systems
  • Setting the host name
  • Configuring the loopback network device
  • Loading the SELinux policy and relabelling /run and /dev as necessary on boot
  • Registering additional binary formats in the kernel, such as Java, Mono and WINE binaries
  • Setting the system locale
  • Setting up the console font and keyboard map
  • Creating, removing and cleaning up of temporary and volatile files and directories
  • Applying mount options from /etc/fstab to pre-mounted API VFS
  • Applying sysctl kernel settings
  • Collecting and replaying readahead information
  • Updating utmp boot and shutdown records
  • Loading and saving the random seed
  • Statically loading specific kernel modules
  • Setting up encrypted hard disks and partitions
  • Spawning automatic gettys on serial kernel consoles
  • Maintenance of Plymouth
  • Machine ID maintenance
  • Setting of the UTC distance for the system clock

On a standard Fedora 15 install, only a few legacy and storage
services still require shell scripts during early boot. If you don’t
need those, you can easily disable them end enjoy your shell-free boot
(like I do every day). The shell-less boot systemd offers you is a
unique feature on Linux.

Many of these small components are configured via configuration
files in /etc. Some of these are fairly standardized among
distributions and hence supporting them in the C implementations was
easy and obvious. Examples include: /etc/fstab,
/etc/crypttab or /etc/sysctl.conf. However, for
others no standardized file or directory existed which forced us to add
#ifdef orgies to our sources to deal with the different
places the distributions we want to support store these things. All
these configuration files have in common that they are dead-simple and
there is simply no good reason for distributions to distuingish
themselves with them: they all do the very same thing, just
a bit differently.

To improve the situation and benefit from the unifying force that
systemd is we thus decided to read the per-distribution configuration
files only as fallbacks — and to introduce new configuration
files as primary source of configuration wherever applicable. Of
course, where possible these standardized configuration files should
not be new inventions but rather just standardizations of the best
distribution-specific configuration files previously used. Here’s a
little overview over these new common configuration files systemd
supports on all distributions:

  • /etc/hostname:
    the host name for the system. One of the most basic and trivial
    system settings. Nonetheless previously all distributions used
    different files for this. Fedora used /etc/sysconfig/network,
    OpenSUSE /etc/HOSTNAME. We chose to standardize on the
    Debian configuration file /etc/hostname.
  • /etc/vconsole.conf:
    configuration of the default keyboard mapping and console font.
  • /etc/locale.conf:
    configuration of the system-wide locale.
  • /etc/modules-load.d/*.conf:
    a drop-in directory for kernel modules to statically load at
    boot (for the very few that still need this).
  • /etc/sysctl.d/*.conf:
    a drop-in directory for kernel sysctl parameters, extending what you
    can already do with /etc/sysctl.conf.
  • /etc/tmpfiles.d/*.conf:
    a drop-in directory for configuration of runtime files that need to be
    removed/created/cleaned up at boot and during uptime.
  • /etc/binfmt.d/*.conf:
    a drop-in directory for registration of additional binary formats for
    systems like Java, Mono and WINE.
  • /etc/os-release:
    a standardization of the various distribution ID files like
    /etc/fedora-release and similar. Really every distribution
    introduced their own file here; writing a simple tool that just prints
    out the name of the local distribution usually means including a
    database of release files to check. The LSB tried to standardize
    something like this with the lsb_release
    tool, but quite frankly the idea of employing a shell script in this
    is not the best choice the LSB folks ever made. To rectify this we
    just decided to generalize this, so that everybody can use the same
    file here.
  • /etc/machine-id:
    a machine ID file, superseding D-Bus’ machine ID file. This file is
    guaranteed to be existing and valid on a systemd system, covering also
    stateless boots. By moving this out of the D-Bus logic it is hopefully
    interesting for a lot of additional uses as a unique and stable
    machine identifier.
  • /etc/machine-info:
    a new information file encoding meta data about a host, like a pretty
    host name and an icon name, replacing stuff like
    /etc/favicon.png and suchlike. This is maintained by systemd-hostnamed.

It is our definite intention to convince you to use these new
configuration files in your configuration tools: if your
configuration frontend writes these files instead of the old ones, it
automatically becomes more portable between Linux distributions, and
you are helping standardizing Linux. This makes things simpler to
understand and more obvious for users and administrators. Of course,
right now, only systemd-based distributions read these files, but that
already covers all important distributions in one way or another, except for one. And it’s a bit of a
chicken-and-egg problem: a standard becomes a standard by being
used. In order to gently push everybody to standardize on these files
we also want to make clear that sooner or later we plan to drop the
fallback support for the old configuration files from
systemd. That means adoption of this new scheme can happen slowly and piece
by piece. But the final goal of only having one set of configuration
files must be clear.

Many of these configuration files are relevant not only for
configuration tools but also (and sometimes even primarily) in
upstream projects. For example, we invite projects like Mono, Java, or
WINE to install a drop-in file in /etc/binfmt.d/ from their
upstream build systems. Per-distribution downstream support for binary
formats would then no longer be necessary and your platform would work
the same on all distributions. Something similar applies to all
software which need creation/cleaning of certain runtime files and
directories at boot, for example beneath the /run hierarchy
(i.e. /var/run as it used to be known). These
projects should just drop in configuration files in
/etc/tmpfiles.d, also from the upstream build systems. This
also helps speeding up the boot process, as separate per-project SysV
shell scripts which implement trivial things like registering a binary
format or removing/creating temporary/volatile files at boot are no
longer necessary. Or another example, where upstream support would be
fantastic: projects like X11 could probably benefit from reading the
default keyboard mapping for its displays from
/etc/vconsole.conf.

Of course, I have no doubt that not everybody is happy with our
choice of names (and formats) for these configuration files. In the
end we had to pick something, and from all the choices these appeared
to be the most convincing. The file formats are as simple as they can
be, and usually easily written and read even from shell scripts. That
said, /etc/bikeshed.conf could of course also have been a
fantastic configuration file name!

So, help us standardizing Linux! Use the new configuration files!
Adopt them upstream, adopt them downstream, adopt them all across the
distributions!

Oh, and in case you are wondering: yes, all of these files were
discussed in one way or another with various folks from the various
distributions. And there has even been some push towards supporting
some of these files even outside of systemd systems.

Footnotes

[1] Our slogan: “The only shell that should get started
during boot is gnome-shell!
” — Yes, the slogan needs a bit of
work, but you get the idea.

systemd for Administrators, Part VIII

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/the-new-configuration-files.html

Another episode of my
ongoing
series
on
systemd
for
Administrators:

The New Configuration Files

One of the formidable new features of systemd is
that it comes with a complete set of modular early-boot services that are
written in simple, fast, parallelizable and robust C, replacing the
shell “novels” the various distributions featured before. Our little
Project Zero Shell[1] has been a full success. We currently
cover pretty much everything most desktop and embedded
distributions should need, plus a big part of the server needs:

  • Checking and mounting of all file systems
  • Updating and enabling quota on all file systems
  • Setting the host name
  • Configuring the loopback network device
  • Loading the SELinux policy and relabelling /run and /dev as necessary on boot
  • Registering additional binary formats in the kernel, such as Java, Mono and WINE binaries
  • Setting the system locale
  • Setting up the console font and keyboard map
  • Creating, removing and cleaning up of temporary and volatile files and directories
  • Applying mount options from /etc/fstab to pre-mounted API VFS
  • Applying sysctl kernel settings
  • Collecting and replaying readahead information
  • Updating utmp boot and shutdown records
  • Loading and saving the random seed
  • Statically loading specific kernel modules
  • Setting up encrypted hard disks and partitions
  • Spawning automatic gettys on serial kernel consoles
  • Maintenance of Plymouth
  • Machine ID maintenance
  • Setting of the UTC distance for the system clock

On a standard Fedora 15 install, only a few legacy and storage
services still require shell scripts during early boot. If you don’t
need those, you can easily disable them end enjoy your shell-free boot
(like I do every day). The shell-less boot systemd offers you is a
unique feature on Linux.

Many of these small components are configured via configuration
files in /etc. Some of these are fairly standardized among
distributions and hence supporting them in the C implementations was
easy and obvious. Examples include: /etc/fstab,
/etc/crypttab or /etc/sysctl.conf. However, for
others no standardized file or directory existed which forced us to add
#ifdef orgies to our sources to deal with the different
places the distributions we want to support store these things. All
these configuration files have in common that they are dead-simple and
there is simply no good reason for distributions to distuingish
themselves with them: they all do the very same thing, just
a bit differently.

To improve the situation and benefit from the unifying force that
systemd is we thus decided to read the per-distribution configuration
files only as fallbacks — and to introduce new configuration
files as primary source of configuration wherever applicable. Of
course, where possible these standardized configuration files should
not be new inventions but rather just standardizations of the best
distribution-specific configuration files previously used. Here’s a
little overview over these new common configuration files systemd
supports on all distributions:

  • /etc/hostname:
    the host name for the system. One of the most basic and trivial
    system settings. Nonetheless previously all distributions used
    different files for this. Fedora used /etc/sysconfig/network,
    OpenSUSE /etc/HOSTNAME. We chose to standardize on the
    Debian configuration file /etc/hostname.
  • /etc/vconsole.conf:
    configuration of the default keyboard mapping and console font.
  • /etc/locale.conf:
    configuration of the system-wide locale.
  • /etc/modules-load.d/*.conf:
    a drop-in directory for kernel modules to statically load at
    boot (for the very few that still need this).
  • /etc/sysctl.d/*.conf:
    a drop-in directory for kernel sysctl parameters, extending what you
    can already do with /etc/sysctl.conf.
  • /etc/tmpfiles.d/*.conf:
    a drop-in directory for configuration of runtime files that need to be
    removed/created/cleaned up at boot and during uptime.
  • /etc/binfmt.d/*.conf:
    a drop-in directory for registration of additional binary formats for
    systems like Java, Mono and WINE.
  • /etc/os-release:
    a standardization of the various distribution ID files like
    /etc/fedora-release and similar. Really every distribution
    introduced their own file here; writing a simple tool that just prints
    out the name of the local distribution usually means including a
    database of release files to check. The LSB tried to standardize
    something like this with the lsb_release
    tool, but quite frankly the idea of employing a shell script in this
    is not the best choice the LSB folks ever made. To rectify this we
    just decided to generalize this, so that everybody can use the same
    file here.
  • /etc/machine-id:
    a machine ID file, superseding D-Bus’ machine ID file. This file is
    guaranteed to be existing and valid on a systemd system, covering also
    stateless boots. By moving this out of the D-Bus logic it is hopefully
    interesting for a lot of additional uses as a unique and stable
    machine identifier.
  • /etc/machine-info:
    a new information file encoding meta data about a host, like a pretty
    host name and an icon name, replacing stuff like
    /etc/favicon.png and suchlike. This is maintained by systemd-hostnamed.

It is our definite intention to convince you to use these new
configuration files in your configuration tools: if your
configuration frontend writes these files instead of the old ones, it
automatically becomes more portable between Linux distributions, and
you are helping standardizing Linux. This makes things simpler to
understand and more obvious for users and administrators. Of course,
right now, only systemd-based distributions read these files, but that
already covers all important distributions in one way or another, except for one. And it’s a bit of a
chicken-and-egg problem: a standard becomes a standard by being
used. In order to gently push everybody to standardize on these files
we also want to make clear that sooner or later we plan to drop the
fallback support for the old configuration files from
systemd. That means adoption of this new scheme can happen slowly and piece
by piece. But the final goal of only having one set of configuration
files must be clear.

Many of these configuration files are relevant not only for
configuration tools but also (and sometimes even primarily) in
upstream projects. For example, we invite projects like Mono, Java, or
WINE to install a drop-in file in /etc/binfmt.d/ from their
upstream build systems. Per-distribution downstream support for binary
formats would then no longer be necessary and your platform would work
the same on all distributions. Something similar applies to all
software which need creation/cleaning of certain runtime files and
directories at boot, for example beneath the /run hierarchy
(i.e. /var/run as it used to be known). These
projects should just drop in configuration files in
/etc/tmpfiles.d, also from the upstream build systems. This
also helps speeding up the boot process, as separate per-project SysV
shell scripts which implement trivial things like registering a binary
format or removing/creating temporary/volatile files at boot are no
longer necessary. Or another example, where upstream support would be
fantastic: projects like X11 could probably benefit from reading the
default keyboard mapping for its displays from
/etc/vconsole.conf.

Of course, I have no doubt that not everybody is happy with our
choice of names (and formats) for these configuration files. In the
end we had to pick something, and from all the choices these appeared
to be the most convincing. The file formats are as simple as they can
be, and usually easily written and read even from shell scripts. That
said, /etc/bikeshed.conf could of course also have been a
fantastic configuration file name!

So, help us standardizing Linux! Use the new configuration files!
Adopt them upstream, adopt them downstream, adopt them all across the
distributions!

Oh, and in case you are wondering: yes, all of these files were
discussed in one way or another with various folks from the various
distributions. And there has even been some push towards supporting
some of these files even outside of systemd systems.

Footnotes

[1] Our slogan: “The only shell that should get started
during boot is gnome-shell!
” — Yes, the slogan needs a bit of
work, but you get the idea.