Tag Archives: nmap

Security updates for Monday

Post Syndicated from ris original https://lwn.net/Articles/751346/rss

Security updates have been issued by Arch Linux (openssl and zziplib), Debian (ldap-account-manager, ming, python-crypto, sam2p, sdl-image1.2, and squirrelmail), Fedora (bchunk, koji, libidn, librelp, nodejs, and php), Gentoo (curl, dhcp, libvirt, mailx, poppler, qemu, and spice-vdagent), Mageia (389-ds-base, aubio, cfitsio, libvncserver, nmap, and ntp), openSUSE (GraphicsMagick, ImageMagick, spice-gtk, and wireshark), Oracle (kubernetes), Slackware (patch), and SUSE (apache2 and openssl).

Security updates for Thursday

Post Syndicated from jake original https://lwn.net/Articles/750432/rss

Security updates have been issued by Debian (drupal7, graphicsmagick, libdatetime-timezone-perl, thunderbird, and tzdata), Fedora (gd, libtiff, mozjs52, and nmap), Gentoo (thunderbird), Red Hat (openstack-tripleo-common, openstack-tripleo-heat-templates and sensu), SUSE (kernel, libvirt, and memcached), and Ubuntu (icu, librelp, openssl, and thunderbird).

Some notes on memcached DDoS

Post Syndicated from Robert Graham original http://blog.erratasec.com/2018/03/some-notes-on-memcached-ddos.html

I thought I’d write up some notes on the memcached DDoS. Specifically, I describe how many I found scanning the Internet with masscan, and how to use masscan as a killswitch to neuter the worst of the attacks.

Test your servers

I added code to my port scanner for this, then scanned the Internet:
masscan -pU:11211 –banners | grep memcached
This example scans the entire Internet (/0). Replaced with your address range (or ranges).
This produces output that looks like this:
Banner on port 11211/udp on [memcached] uptime=230130 time=1520485357 version=1.4.13
Banner on port 11211/udp on [memcached] uptime=3935192 time=1520485363 version=1.4.17
Banner on port 11211/udp on [memcached] uptime=230130 time=1520485357 version=1.4.13
Banner on port 11211/udp on [memcached] uptime=399858 time=1520485362 version=1.4.20
Banner on port 11211/udp on [memcached] uptime=29429482 time=1520485363 version=1.4.20
Banner on port 11211/udp on [memcached] uptime=2879363 time=1520485366 version=1.2.6
Banner on port 11211/udp on [memcached] uptime=42083736 time=1520485365 version=1.4.13
The “banners” check filters out those with valid memcached responses, so you don’t get other stuff that isn’t memcached. To filter this output further, use  the ‘cut’ to grab just column 6:
… | cut -d ‘ ‘ -f 6 | cut -d: -f1
You often get multiple responses to just one query, so you’ll want to sort/uniq the list:
… | sort | uniq

My results from an Internet wide scan

I got 15181 results (or roughly 15,000).
People are using Shodan to find a list of memcached servers. They might be getting a lot results back that response to TCP instead of UDP. Only UDP can be used for the attack.

Other researchers scanned the Internet a few days ago and found ~31k. I don’t know if this means people have been removing these from the Internet.

Masscan as exploit script

BTW, you can not only use masscan to find amplifiers, you can also use it to carry out the DDoS. Simply import the list of amplifier IP addresses, then spoof the source address as that of the target. All the responses will go back to the source address.
masscan -iL amplifiers.txt -pU:11211 –spoof-ip –rate 100000
I point this out to show how there’s no magic in exploiting this. Numerous exploit scripts have been released, because it’s so easy.

Why memcached servers are vulnerable

Like many servers, memcached listens to local IP address for local administration. By listening only on the local IP address, remote people cannot talk to the server.
However, this process is often buggy, and you end up listening on either (all interfaces) or on one of the external interfaces. There’s a common Linux network stack issue where this keeps happening, like trying to get VMs connected to the network. I forget the exact details, but the point is that lots of servers that intend to listen only on end up listening on external interfaces instead. It’s not a good security barrier.
Thus, there are lots of memcached servers listening on their control port (11211) on external interfaces.

How the protocol works

The protocol is documented here. It’s pretty straightforward.
The easiest amplification attacks is to send the “stats” command. This is 15 byte UDP packet that causes the server to send back either a large response full of useful statistics about the server.  You often see around 10 kilobytes of response across several packets.
A harder, but more effect attack uses a two step process. You first use the “add” or “set” commands to put chunks of data into the server, then send a “get” command to retrieve it. You can easily put 100-megabytes of data into the server this way, and causes a retrieval with a single “get” command.
That’s why this has been the largest amplification ever, because a single 100-byte packet can in theory cause a 100-megabytes response.
Doing the math, the 1.3 terabit/second DDoS divided across the 15,000 servers I found vulnerable on the Internet leads to an average of 100-megabits/second per server. This is fairly minor, and is indeed something even small servers (like Raspberry Pis) can generate.

Neutering the attack (“kill switch”)

If they are using the more powerful attack against you, you can neuter it: you can send a “flush_all” command back at the servers who are flooding you, causing them to drop all those large chunks of data from the cache.
I’m going to describe how I would do this.
First, get a list of attackers, meaning, the amplifiers that are flooding you. The way to do this is grab a packet sniffer and capture all packets with a source port of 11211. Here is an example using tcpdump.
tcpdump -i -w attackers.pcap src port 11221
Let that run for a while, then hit [ctrl-c] to stop, then extract the list of IP addresses in the capture file. The way I do this is with tshark (comes with Wireshark):
tshark -r attackers.pcap -Tfields -eip.src | sort | uniq > amplifiers.txt
Now, craft a flush_all payload. There are many ways of doing this. For example, if you are using nmap or masscan, you can add the bytes to the nmap-payloads.txt file. Also, masscan can read this directly from a packet capture file. To do this, first craft a packet, such as with the following command line foo:
echo -en “\x00\x00\x00\x00\x00\x01\x00\x00flush_all\r\n” | nc -q1 -u 11211
Capture this packet using tcpdump or something, and save into a file “flush_all.pcap”. If you want to skip this step, I’ve already done this for you, go grab the file from GitHub:
Now that we have our list of attackers (amplifiers.txt) and a payload to blast at them (flush_all.pcap), use masscan to send it:
masscan -iL amplifiers.txt -pU:112211 –pcap-payload flush_all.pcap

Reportedly, “shutdown” may also work to completely shutdown the amplifiers. I’ll leave that as an exercise for the reader, since of course you’ll be adversely affecting the servers.

Some notes

Here are some good reading on this attack:

Dynamic Users with systemd

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/dynamic-users-with-systemd.html

TL;DR: you may now configure systemd to dynamically allocate a UNIX
user ID for service processes when it starts them and release it when
it stops them. It’s pretty secure, mixes well with transient services,
socket activated services and service templating.

Today we released systemd
. Among
other improvements this greatly extends the dynamic user logic of
systemd. Dynamic users are a powerful but little known concept,
supported in its basic form since systemd 232. With this blog story I
hope to make it a bit better known.

The UNIX user concept is the most basic and well-understood security
concept in POSIX operating systems. It is UNIX/POSIX’ primary security
concept, the one everybody can agree on, and most security concepts
that came after it (such as process capabilities, SELinux and other
MACs, user name-spaces, …) in some form or another build on it, extend
it or at least interface with it. If you build a Linux kernel with all
security features turned off, the user concept is pretty much the one
you’ll still retain.

Originally, the user concept was introduced to make multi-user systems
a reality, i.e. systems enabling multiple human users to share the
same system at the same time, cleanly separating their resources and
protecting them from each other. The majority of today’s UNIX systems
don’t really use the user concept like that anymore though. Most of
today’s systems probably have only one actual human user (or even
less!), but their user databases (/etc/passwd) list a good number
more entries than that. Today, the majority of UNIX users in most
environments are system users, i.e. users that are not the technical
representation of a human sitting in front of a PC anymore, but the
security identity a system service — an executable program — runs
as. Event though traditional, simultaneous multi-user systems slowly
became less relevant, their ground-breaking basic concept became the
cornerstone of UNIX security. The OS is nowadays partitioned into
isolated services — and each service runs as its own system user, and
thus within its own, minimal security context.

The people behind the Android OS realized the relevance of the UNIX
user concept as the primary security concept on UNIX, and took its use
even further: on Android not only system services take benefit of the
UNIX user concept, but each UI app gets its own, individual user
identity too — thus neatly separating app resources from each other,
and protecting app processes from each other, too.

Back in the more traditional Linux world things are a bit less
advanced in this area. Even though users are the quintessential UNIX
security concept, allocation and management of system users is still a
pretty limited, raw and static affair. In most cases, RPM or DEB
package installation scripts allocate a fixed number of (usually one)
system users when you install the package of a service that wants to
take benefit of the user concept, and from that point on the system
user remains allocated on the system and is never deallocated again,
even if the package is later removed again. Most Linux distributions
limit the number of system users to 1000 (which isn’t particularly a
lot). Allocating a system user is hence expensive: the number of
available users is limited, and there’s no defined way to dispose of
them after use. If you make use of system users too liberally, you are
very likely to run out of them sooner rather than later.

You may wonder why system users are generally not deallocated when the
package that registered them is uninstalled from a system (at least on
most distributions). The reason for that is one relevant property of
the user concept (you might even want to call this a design flaw):
user IDs are sticky to files (and other objects such as IPC
objects). If a service running as a specific system user creates a
file at some location, and is then terminated and its package and user
removed, then the created file still belongs to the numeric ID (“UID”)
the system user originally got assigned. When the next system user is
allocated and — due to ID recycling — happens to get assigned the same
numeric ID, then it will also gain access to the file, and that’s
generally considered a problem, given that the file belonged to a
potentially very different service once upon a time, and likely should
not be readable or changeable by anything coming after
it. Distributions hence tend to avoid UID recycling which means system
users remain registered forever on a system after they have been
allocated once.

The above is a description of the status quo ante. Let’s now focus on
what systemd’s dynamic user concept brings to the table, to improve
the situation.

Introducing Dynamic Users

With systemd dynamic users we hope to make make it easier and cheaper
to allocate system users on-the-fly, thus substantially increasing the
possible uses of this core UNIX security concept.

If you write a systemd service unit file, you may enable the dynamic
user logic for it by setting the
option in its [Service] section to yes. If you do a system user is
dynamically allocated the instant the service binary is invoked, and
released again when the service terminates. The user is automatically
allocated from the UID range 61184–65519, by looking for a so far
unused UID.

Now you may wonder, how does this concept deal with the sticky user
issue discussed above? In order to counter the problem, two strategies
easily come to mind:

  1. Prohibit the service from creating any files/directories or IPC objects

  2. Automatically removing the files/directories or IPC objects the
    service created when it shuts down.

In systemd we implemented both strategies, but for different parts of
the execution environment. Specifically:

  1. Setting DynamicUser=yes implies
    ProtectHome=read-only. These
    sand-boxing options turn off write access to pretty much the whole OS
    directory tree, with a few relevant exceptions, such as the API file
    systems /proc, /sys and so on, as well as /tmp and
    /var/tmp. (BTW: setting these two options on your regular services
    that do not use DynamicUser= is a good idea too, as it drastically
    reduces the exposure of the system to exploited services.)

  2. Setting DynamicUser=yes implies
    PrivateTmp=yes. This
    option sets up /tmp and /var/tmp for the service in a way that it
    gets its own, disconnected version of these directories, that are not
    shared by other services, and whose life-cycle is bound to the
    service’s own life-cycle. Thus if the service goes down, the user is
    removed and all its temporary files and directories with it. (BTW: as
    above, consider setting this option for your regular services that do
    not use DynamicUser= too, it’s a great way to lock things down

  3. Setting DynamicUser=yes implies
    RemoveIPC=yes. This
    option ensures that when the service goes down all SysV and POSIX IPC
    objects (shared memory, message queues, semaphores) owned by the
    service’s user are removed. Thus, the life-cycle of the IPC objects is
    bound to the life-cycle of the dynamic user and service, too. (BTW:
    yes, here too, consider using this in your regular services, too!)

With these four settings in effect, services with dynamic users are
nicely sand-boxed. They cannot create files or directories, except in
/tmp and /var/tmp, where they will be removed automatically when
the service shuts down, as will any IPC objects created. Sticky
ownership of files/directories and IPC objects is hence dealt with

option may be used to open up a bit the sandbox to external
programs. If you set it to a directory name of your choice, it will be
created below /run when the service is started, and removed in its
entirety when it is terminated. The ownership of the directory is
assigned to the service’s dynamic user. This way, a dynamic user
service can expose API interfaces (AF_UNIX sockets, …) to other
services at a well-defined place and again bind the life-cycle of it to
the service’s own run-time. Example: set RuntimeDirectory=foobar in
your service, and watch how a directory /run/foobar appears at the
moment you start the service, and disappears the moment you stop
it again. (BTW: Much like the other settings discussed above,
RuntimeDirectory= may be used outside of the DynamicUser= context
too, and is a nice way to run any service with a properly owned,
life-cycle-managed run-time directory.)

Persistent Data

Of course, a service running in such an environment (although already
very useful for many cases!), has a major limitation: it cannot leave
persistent data around it can reuse on a later run. As pretty much the
whole OS directory tree is read-only to it, there’s simply no place it
could put the data that survives from one service invocation to the

With systemd 235 this limitation is removed: there are now three new
LogsDirectory= and CacheDirectory=. In many ways they operate like
RuntimeDirectory=, but create sub-directories below /var/lib,
/var/log and /var/cache, respectively. There’s one major
difference beyond that however: directories created that way are
persistent, they will survive the run-time cycle of a service, and
thus may be used to store data that is supposed to stay around between
invocations of the service.

Of course, the obvious question to ask now is: how do these three
settings deal with the sticky file ownership problem?

For that we lifted a concept from container managers. Container
managers have a very similar problem: each container and the host
typically end up using a very similar set of numeric UIDs, and unless
user name-spacing is deployed this means that host users might be able
to access the data of specific containers that also have a user by the
same numeric UID assigned, even though it actually refers to a very
different identity in a different context. (Actually, it’s even worse
than just getting access, due to the existence of setuid file bits,
access might translate to privilege elevation.) The way container
managers protect the container images from the host (and from each
other to some level) is by placing the container trees below a
boundary directory, with very restrictive access modes and ownership
(0700 and root:root or so). A host user hence cannot take advantage
of the files/directories of a container user of the same UID inside of
a local container tree, simply because the boundary directory makes it
impossible to even reference files in it. After all on UNIX, in order
to get access to a specific path you need access to every single
component of it.

How is that applied to dynamic user services? Let’s say
StateDirectory=foobar is set for a service that has DynamicUser=
turned off. The instant the service is started, /var/lib/foobar is
created as state directory, owned by the service’s user and remains in
existence when the service is stopped. If the same service now is run
with DynamicUser= turned on, the implementation is slightly
altered. Instead of a directory /var/lib/foobar a symbolic link by
the same path is created (owned by root), pointing to
/var/lib/private/foobar (the latter being owned by the service’s
dynamic user). The /var/lib/private directory is created as boundary
directory: it’s owned by root:root, and has a restrictive access
mode of 0700. Both the symlink and the service’s state directory will
survive the service’s life-cycle, but the state directory will remain,
and continues to be owned by the now disposed dynamic UID — however it
is protected from other host users (and other services which might get
the same dynamic UID assigned due to UID recycling) by the boundary

The obvious question to ask now is: but if the boundary directory
prohibits access to the directory from unprivileged processes, how can
the service itself which runs under its own dynamic UID access it
anyway? This is achieved by invoking the service process in a slightly
modified mount name-space: it will see most of the file hierarchy the
same way as everything else on the system (modulo /tmp and
/var/tmp as mentioned above), except for /var/lib/private, which
is over-mounted with a read-only tmpfs file system instance, with a
slightly more liberal access mode permitting the service read
access. Inside of this tmpfs file system instance another mount is
placed: a bind mount to the host’s real /var/lib/private/foobar
directory, onto the same name. Putting this together these means that
superficially everything looks the same and is available at the same
place on the host and from inside the service, but two important
changes have been made: the /var/lib/private boundary directory lost
its restrictive character inside the service, and has been emptied of
the state directories of any other service, thus making the protection
complete. Note that the symlink /var/lib/foobar hides the fact that
the boundary directory is used (making it little more than an
implementation detail), as the directory is available this way under
the same name as it would be if DynamicUser= was not used. Long
story short: for the daemon and from the view from the host the
indirection through /var/lib/private is mostly transparent.

This logic of course raises another question: what happens to the
state directory if a dynamic user service is started with a state
directory configured, gets UID X assigned on this first invocation,
then terminates and is restarted and now gets UID Y assigned on the
second invocation, with X ≠ Y? On the second invocation the directory
— and all the files and directories below it — will still be owned by
the original UID X so how could the second instance running as Y
access it? Our way out is simple: systemd will recursively change the
ownership of the directory and everything contained within it to UID Y
before invoking the service’s executable.

Of course, such recursive ownership changing (chown()ing) of whole
directory trees can become expensive (though according to my
experiences, IRL and for most services it’s much cheaper than you
might think), hence in order to optimize behavior in this regard, the
allocation of dynamic UIDs has been tweaked in two ways to avoid the
necessity to do this expensive operation in most cases: firstly, when
a dynamic UID is allocated for a service an allocation loop is
employed that starts out with a UID hashed from the service’s
name. This means a service by the same name is likely to always use
the same numeric UID. That means that a stable service name translates
into a stable dynamic UID, and that means recursive file ownership
adjustments can be skipped (of course, after validation). Secondly, if
the configured state directory already exists, and is owned by a
suitable currently unused dynamic UID, it’s preferably used above
everything else, thus maximizing the chance we can avoid the
chown()ing. (That all said, ultimately we have to face it, the
currently available UID space of 4K+ is very small still, and
conflicts are pretty likely sooner or later, thus a chown()ing has to
be expected every now and then when this feature is used extensively).

Note that CacheDirectory= and LogsDirectory= work very similar to
StateDirectory=. The only difference is that they manage directories
below the /var/cache and /var/logs directories, and their boundary
directory hence is /var/cache/private and /var/log/private,


So, after all this introduction, let’s have a look how this all can be
put together. Here’s a trivial example:

# cat > /etc/systemd/system/dynamic-user-test.service <<EOF
ExecStart=/usr/bin/sleep 4711
# systemctl daemon-reload
# systemctl start dynamic-user-test
# systemctl status dynamic-user-test
● dynamic-user-test.service
   Loaded: loaded (/etc/systemd/system/dynamic-user-test.service; static; vendor preset: disabled)
   Active: active (running) since Fri 2017-10-06 13:12:25 CEST; 3s ago
 Main PID: 2967 (sleep)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/dynamic-user-test.service
           └─2967 /usr/bin/sleep 4711

Okt 06 13:12:25 sigma systemd[1]: Started dynamic-user-test.service.
# ps -e -o pid,comm,user | grep 2967
 2967 sleep           dynamic-user-test
# id dynamic-user-test
uid=64642(dynamic-user-test) gid=64642(dynamic-user-test) groups=64642(dynamic-user-test)
# systemctl stop dynamic-user-test
# id dynamic-user-test
id: ‘dynamic-user-test’: no such user

In this example, we create a unit file with DynamicUser= turned on,
start it, check if it’s running correctly, have a look at the service
process’ user (which is named like the service; systemd does this
automatically if the service name is suitable as user name, and you
didn’t configure any user name to use explicitly), stop the service
and verify that the user ceased to exist too.

That’s already pretty cool. Let’s step it up a notch, by doing the
same in an interactive transient service (for those who don’t know
systemd well: a transient service is a service that is defined and
started dynamically at run-time, for example via the systemd-run
command from the shell. Think: run a service without having to write a
unit file first):

# systemd-run --pty --property=DynamicUser=yes --property=StateDirectory=wuff /bin/sh
Running as unit: run-u15750.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4$ id
uid=63122(run-u15750) gid=63122(run-u15750) groups=63122(run-u15750) context=system_u:system_r:initrc_t:s0
sh-4.4$ ls -al /var/lib/private/
total 0
drwxr-xr-x. 3 root       root        60  6. Okt 13:21 .
drwxr-xr-x. 1 root       root       852  6. Okt 13:21 ..
drwxr-xr-x. 1 run-u15750 run-u15750   8  6. Okt 13:22 wuff
sh-4.4$ ls -ld /var/lib/wuff
lrwxrwxrwx. 1 root root 12  6. Okt 13:21 /var/lib/wuff -> private/wuff
sh-4.4$ ls -ld /var/lib/wuff/
drwxr-xr-x. 1 run-u15750 run-u15750 0  6. Okt 13:21 /var/lib/wuff/
sh-4.4$ echo hello > /var/lib/wuff/test
sh-4.4$ exit
# id run-u15750
id: ‘run-u15750’: no such user
# ls -al /var/lib/private
total 0
drwx------. 1 root  root   66  6. Okt 13:21 .
drwxr-xr-x. 1 root  root  852  6. Okt 13:21 ..
drwxr-xr-x. 1 63122 63122   8  6. Okt 13:22 wuff
# ls -ld /var/lib/wuff
lrwxrwxrwx. 1 root root 12  6. Okt 13:21 /var/lib/wuff -> private/wuff
# ls -ld /var/lib/wuff/
drwxr-xr-x. 1 63122 63122 8  6. Okt 13:22 /var/lib/wuff/
# cat /var/lib/wuff/test

The above invokes an interactive shell as transient service
run-u15750.service (systemd-run picked that name automatically,
since we didn’t specify anything explicitly) with a dynamic user whose
name is derived automatically from the service name. Because
StateDirectory=wuff is used, a persistent state directory for the
service is made available as /var/lib/wuff. In the interactive shell
running inside the service, the ls commands show the
/var/lib/private boundary directory and its contents, as well as the
symlink that is placed for the service. Finally, before exiting the
shell, a file is created in the state directory. Back in the original
command shell we check if the user is still allocated: it is not, of
course, since the service ceased to exist when we exited the shell and
with it the dynamic user associated with it. From the host we check
the state directory of the service, with similar commands as we did
from inside of it. We see that things are set up pretty much the same
way in both cases, except for two things: first of all the user/group
of the files is now shown as raw numeric UIDs instead of the
user/group names derived from the unit name. That’s because the user
ceased to exist at this point, and “ls” shows the raw UID for files
owned by users that don’t exist. Secondly, the access mode of the
boundary directory is different: when we look at it from outside of
the service it is not readable by anyone but root, when we looked from
inside we saw it it being world readable.

Now, let’s see how things look if we start another transient service,
reusing the state directory from the first invocation:

# systemd-run --pty --property=DynamicUser=yes --property=StateDirectory=wuff /bin/sh
Running as unit: run-u16087.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4$ cat /var/lib/wuff/test
sh-4.4$ ls -al /var/lib/wuff/
total 4
drwxr-xr-x. 1 run-u16087 run-u16087  8  6. Okt 13:22 .
drwxr-xr-x. 3 root       root       60  6. Okt 15:42 ..
-rw-r--r--. 1 run-u16087 run-u16087  6  6. Okt 13:22 test
sh-4.4$ id
uid=63122(run-u16087) gid=63122(run-u16087) groups=63122(run-u16087) context=system_u:system_r:initrc_t:s0
sh-4.4$ exit

Here, systemd-run picked a different auto-generated unit name, but
the used dynamic UID is still the same, as it was read from the
pre-existing state directory, and was otherwise unused. As we can see
the test file we generated earlier is accessible and still contains
the data we left in there. Do note that the user name is different
this time (as it is derived from the unit name, which is different),
but the UID it is assigned to is the same one as on the first
invocation. We can thus see that the mentioned optimization of the UID
allocation logic (i.e. that we start the allocation loop from the UID
owner of any existing state directory) took effect, so that no
recursive chown()ing was required.

And that’s the end of our example, which hopefully illustrated a bit
how this concept and implementation works.


Now that we had a look at how to enable this logic for a unit and how
it is implemented, let’s discuss where this actually could be useful
in real life.

  • One major benefit of dynamic user IDs is that running a
    privilege-separated service leaves no artifacts in the system. A
    system user is allocated and made use of, but it is discarded
    automatically in a safe and secure way after use, in a fashion that is
    safe for later recycling. Thus, quickly invoking a short-lived service
    for processing some job can be protected properly through a user ID
    without having to pre-allocate it and without this draining the
    available UID pool any longer than necessary.

  • In many cases, starting a service no longer requires
    package-specific preparation. Or in other words, quite often
    useradd/mkdir/chown/chmod invocations in “post-inst” package
    scripts, as well as
    drop-ins become unnecessary, as the DynamicUser= and
    StateDirectory=/CacheDirectory=/LogsDirectory= logic can do the
    necessary work automatically, on-demand and with a well-defined

  • By combining dynamic user IDs with the transient unit concept, new
    creative ways of sand-boxing are made available. For example, let’s say
    you don’t trust the correct implementation of the sort command. You
    can now lock it into a simple, robust, dynamic UID sandbox with a
    simple systemd-run and still integrate it into a shell pipeline like
    any other command. Here’s an example, showcasing a shell pipeline
    whose middle element runs as a dynamically on-the-fly allocated UID,
    that is released when the pipelines ends.

    # cat some-file.txt | systemd-run ---pipe --property=DynamicUser=1 sort -u | grep -i foobar > some-other-file.txt
  • By combining dynamic user IDs with the systemd templating logic it
    is now possible to do much more fine-grained and fully automatic UID
    management. For example, let’s say you have a template unit file
    /etc/systemd/system/[email protected]:


    Now, let’s say you want to start one instance of this service for
    each of your customers. All you need to do now for that is:

    # systemctl enable [email protected] --now

    And you are done. (Invoke this as many times as you like, each time
    replacing customerxyz by some customer identifier, you get the

  • By combining dynamic user IDs with socket activation you may easily
    implement a system where each incoming connection is served by a
    process instance running as a different, fresh, newly allocated UID
    within its own sandbox. Here’s an example waldo.socket:


    With a matching [email protected]:


    With the two unit files above, systemd will listen on TCP/IP port
    2048, and for each incoming connection invoke a fresh instance of
    [email protected], each time utilizing a different, new,
    dynamically allocated UID, neatly isolated from any other

  • Dynamic user IDs combine very well with state-less systems,
    i.e. systems that come up with an unpopulated /etc and /var. A
    service using dynamic user IDs and the StateDirectory=,
    CacheDirectory=, LogsDirectory= and RuntimeDirectory= concepts
    will implicitly allocate the users and directories it needs for
    running, right at the moment where it needs it.

Dynamic users are a very generic concept, hence a multitude of other
uses are thinkable; the list above is just supposed to trigger your

What does this mean for you as a packager?

I am pretty sure that a large number of services shipped with today’s
distributions could benefit from using DynamicUser= and
StateDirectory= (and related settings). It often allows removal of
post-inst packaging scripts altogether, as well as any sysusers.d
and tmpfiles.d drop-ins by unifying the needed declarations in the
unit file itself. Hence, as a packager please consider switching your
unit files over. That said, there are a number of conditions where
DynamicUser= and StateDirectory= (and friends) cannot or should
not be used. To name a few:

  1. Service that need to write to files outside of /run/<package>,
    /var/lib/<package>, /var/cache/<package>, /var/log/<package>,
    /var/tmp, /tmp, /dev/shm are generally incompatible with this
    scheme. This rules out daemons that upgrade the system as one example,
    as that involves writing to /usr.

  2. Services that maintain a herd of processes with different user
    IDs. Some SMTP services are like this. If your service has such a
    super-server design, UID management needs to be done by the
    super-server itself, which rules out systemd doing its dynamic UID
    magic for it.

  3. Services which run as root (obviously…) or are otherwise

  4. Services that need to live in the same mount name-space as the host
    system (for example, because they want to establish mount points
    visible system-wide). As mentioned DynamicUser= implies
    ProtectSystem=, PrivateTmp= and related options, which all require
    the service to run in its own mount name-space.

  5. Your focus is older distributions, i.e. distributions that do not
    have systemd 232 (for DynamicUser=) or systemd 235 (for
    StateDirectory= and friends) yet.

  6. If your distribution’s packaging guides don’t allow it. Consult
    your packaging guides, and possibly start a discussion on your
    distribution’s mailing list about this.


A couple of additional, random notes about the implementation and use
of these features:

  1. Do note that allocating or deallocating a dynamic user leaves
    /etc/passwd untouched. A dynamic user is added into the user
    database through the glibc NSS module
    and this information never hits the disk.

  2. On traditional UNIX systems it was the job of the daemon process
    itself to drop privileges, while the DynamicUser= concept is
    designed around the service manager (i.e. systemd) being responsible
    for that. That said, since v235 there’s a way to marry DynamicUser=
    and such services which want to drop privileges on their own. For
    that, turn on DynamicUser= and set
    to the user name the service wants to setuid() to. This has the
    effect that systemd will allocate the dynamic user under the specified
    name when the service is started. Then, prefix the command line you
    specify in
    with a single ! character. If you do, the user is allocated for the
    service, but the daemon binary is is invoked as root instead of the
    allocated user, under the assumption that the daemon changes its UID
    on its own the right way. Not that after registration the user will
    show up instantly in the user database, and is hence resolvable like
    any other by the daemon process. Example:

  3. You may wonder why systemd uses the UID range 61184–65519 for its
    dynamic user allocations (side note: in hexadecimal this reads as
    0xEF00–0xFFEF). That’s because distributions (specifically Fedora)
    tend to allocate regular users from below the 60000 range, and we
    don’t want to step into that. We also want to stay away from 65535 and
    a bit around it, as some of these UIDs have special meanings (65535 is
    often used as special value for “invalid” or “no” UID, as it is
    identical to the 16bit value -1; 65534 is generally mapped to the
    “nobody” user, and is where some kernel subsystems map unmappable
    UIDs). Finally, we want to stay within the 16bit range. In a user
    name-spacing world each container tends to have much less than the full
    32bit UID range available that Linux kernels theoretically
    provide. Everybody apparently can agree that a container should at
    least cover the 16bit range though — already to include a nobody
    user. (And quite frankly, I am pretty sure assigning 64K UIDs per
    container is nicely systematic, as the the higher 16bit of the 32bit
    UID values this way become a container ID, while the lower 16bit
    become the logical UID within each container, if you still follow what
    I am babbling here…). And before you ask: no this range cannot be
    changed right now, it’s compiled in. We might change that eventually

  4. You might wonder what happens if you already used UIDs from the
    61184–65519 range on your system for other purposes. systemd should
    handle that mostly fine, as long as that usage is properly registered
    in the user database: when allocating a dynamic user we pick a UID,
    see if it is currently used somehow, and if yes pick a different one,
    until we find a free one. Whether a UID is used right now or not is
    checked through NSS calls. Moreover the IPC object lists are checked to
    see if there are any objects owned by the UID we are about to
    pick. This means systemd will avoid using UIDs you have assigned
    otherwise. Note however that this of course makes the pool of
    available UIDs smaller, and in the worst cases this means that
    allocating a dynamic user might fail because there simply are no
    unused UIDs in the range.

  5. If not specified otherwise the name for a dynamically allocated
    user is derived from the service name. Not everything that’s valid in
    a service name is valid in a user-name however, and in some cases a
    randomized name is used instead to deal with this. Often it makes
    sense to pick the user names to register explicitly. For that use
    User= and choose whatever you like.

  6. If you pick a user name with User= and combine it with
    DynamicUser= and the user already exists statically it will be used
    for the service and the dynamic user logic is automatically
    disabled. This permits automatic up- and downgrades between static and
    dynamic UIDs. For example, it provides a nice way to move a system
    from static to dynamic UIDs in a compatible way: as long as you select
    the same User= value before and after switching DynamicUser= on,
    the service will continue to use the statically allocated user if it
    exists, and only operates in the dynamic mode if it does not. This is
    useful for other cases as well, for example to adapt a service that
    normally would use a dynamic user to concepts that require statically
    assigned UIDs, for example to marry classic UID-based file system
    quota with such services.

  7. systemd always allocates a pair of dynamic UID and GID at the same
    time, with the same numeric ID.

  8. If the Linux kernel had a “shiftfs” or similar functionality,
    i.e. a way to mount an existing directory to a second place, but map
    the exposed UIDs/GIDs in some way configurable at mount time, this
    would be excellent for the implementation of StateDirectory= in
    conjunction with DynamicUser=. It would make the recursive
    chown()ing step unnecessary, as the host version of the state
    directory could simply be mounted into a the service’s mount
    name-space, with a shift applied that maps the directory’s owner to the
    services’ UID/GID. But I don’t have high hopes in this regard, as all
    work being done in this area appears to be bound to user name-spacing
    — which is a concept not used here (and I guess one could say user
    name-spacing is probably more a source of problems than a solution to
    one, but you are welcome to disagree on that).

And that’s all for now. Enjoy your dynamic users!

Announcement: IPS code

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/08/announcement-ips-code.html

So after 20 years, IBM is killing off my BlackICE code created in April 1998. So it’s time that I rewrite it.

BlackICE was the first “inline” intrusion-detection system, aka. an “intrusion prevention system” or IPS. ISS purchased my company in 2001 and replaced their RealSecure engine with it, and later renamed it Proventia. Then IBM purchased ISS in 2006. Now, they are formally canceling the project and moving customers onto Cisco’s products, which are based on Snort.

So now is a good time to write a replacement. The reason is that BlackICE worked fundamentally differently than Snort, using protocol analysis rather than pattern-matching. In this way, it worked more like Bro than Snort. The biggest benefit of protocol-analysis is speed, making it many times faster than Snort. The second benefit is better detection ability, as I describe in this post on Heartbleed.

So my plan is to create a new project. I’ll be checking in the starter bits into GitHub starting a couple weeks from now. I need to figure out a new name for the project, so I don’t have to rip off a name from William Gibson like I did last time :).

Some notes:

  • Yes, it’ll be GNU open source. I’m a capitalist, so I’ll earn money like snort/nmap dual-licensing it, charging companies who don’t want to open-source their addons. All capitalists GNU license their code.
  • C, not Rust. Sorry, I’m going for extreme scalability. We’ll re-visit this decision later when looking at building protocol parsers.
  • It’ll be 95% compatible with Snort signatures. Their language definition leaves so much ambiguous it’ll be hard to be 100% compatible.
  • It’ll support Snort output as well, though really, Snort’s events suck.
  • Protocol parsers in Lua, so you can use it as a replacement for Bro, writing parsers to extract data you are interested in.
  • Protocol state machine parsers in C, like you see in my Masscan project for X.509.
  • First version IDS only. These days, “inline” means also being able to MitM the SSL stack, so I’m gong to have to think harder on that.
  • Mutli-core worker threads off PF_RING/DPDK/netmap receive queues. Should handle 10gbps, tracking 10 million concurrent connections, with quad-core CPU.
So if you want to contribute to the project, here’s what I need:
  • Requirements from people who work daily with IDS/IPS today. I need you to write up what your products do well that you really like. I need to you write up what they suck at that needs to be fixed. These need to be in some detail.
  • Testing environment to play with. This means having a small server plugged into a real-world link running at a minimum of several gigabits-per-second available for the next year. I’ll sign NDAs related to the data I might see on the network.
  • Coders. I’ll be doing the basic architecture, but protocol parsers, output plugins, etc. will need work. Code will be in C and Lua for the near term. Unfortunately, since I’m going to dual-license, I’ll need waivers before accepting pull requests.
Anyway, follow me on Twitter @erratarob if you want to contribute.

Deploying an NGINX Reverse Proxy Sidecar Container on Amazon ECS

Post Syndicated from Nathan Peck original https://aws.amazon.com/blogs/compute/nginx-reverse-proxy-sidecar-container-on-amazon-ecs/

Reverse proxies are a powerful software architecture primitive for fetching resources from a server on behalf of a client. They serve a number of purposes, from protecting servers from unwanted traffic to offloading some of the heavy lifting of HTTP traffic processing.

This post explains the benefits of a reverse proxy, and explains how to use NGINX and Amazon EC2 Container Service (Amazon ECS) to easily implement and deploy a reverse proxy for your containerized application.


NGINX is a high performance HTTP server that has achieved significant adoption because of its asynchronous event driven architecture. It can serve thousands of concurrent requests with a low memory footprint. This efficiency also makes it ideal as a reverse proxy.

Amazon ECS is a highly scalable, high performance container management service that supports Docker containers. It allows you to run applications easily on a managed cluster of Amazon EC2 instances. Amazon ECS helps you get your application components running on instances according to a specified configuration. It also helps scale out these components across an entire fleet of instances.

Sidecar containers are a common software pattern that has been embraced by engineering organizations. It’s a way to keep server side architecture easier to understand by building with smaller, modular containers that each serve a simple purpose. Just like an application can be powered by multiple microservices, each microservice can also be powered by multiple containers that work together. A sidecar container is simply a way to move part of the core responsibility of a service out into a containerized module that is deployed alongside a core application container.

The following diagram shows how an NGINX reverse proxy sidecar container operates alongside an application server container:

In this architecture, Amazon ECS has deployed two copies of an application stack that is made up of an NGINX reverse proxy side container and an application container. Web traffic from the public goes to an Application Load Balancer, which then distributes the traffic to one of the NGINX reverse proxy sidecars. The NGINX reverse proxy then forwards the request to the application server and returns its response to the client via the load balancer.

Reverse proxy for security

Security is one reason for using a reverse proxy in front of an application container. Any web server that serves resources to the public can expect to receive lots of unwanted traffic every day. Some of this traffic is relatively benign scans by researchers and tools, such as Shodan or nmap:

[18/May/2017:15:10:10 +0000] "GET /YesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScann HTTP/1.1" 404 1389 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36
[18/May/2017:18:19:51 +0000] "GET /clientaccesspolicy.xml HTTP/1.1" 404 322 - Cloud mapping experiment. Contact [email protected]

But other traffic is much more malicious. For example, here is what a web server sees while being scanned by the hacking tool ZmEu, which scans web servers trying to find PHPMyAdmin installations to exploit:

[18/May/2017:16:27:39 +0000] "GET /mysqladmin/scripts/setup.php HTTP/1.1" 404 391 - ZmEu
[18/May/2017:16:27:39 +0000] "GET /web/phpMyAdmin/scripts/setup.php HTTP/1.1" 404 394 - ZmEu
[18/May/2017:16:27:39 +0000] "GET /xampp/phpmyadmin/scripts/setup.php HTTP/1.1" 404 396 - ZmEu
[18/May/2017:16:27:40 +0000] "GET /apache-default/phpmyadmin/scripts/setup.php HTTP/1.1" 404 405 - ZmEu
[18/May/2017:16:27:40 +0000] "GET /phpMyAdmin- HTTP/1.1" 404 397 - ZmEu
[18/May/2017:16:27:40 +0000] "GET /mysql/scripts/setup.php HTTP/1.1" 404 386 - ZmEu
[18/May/2017:16:27:41 +0000] "GET /admin/scripts/setup.php HTTP/1.1" 404 386 - ZmEu
[18/May/2017:16:27:41 +0000] "GET /forum/phpmyadmin/scripts/setup.php HTTP/1.1" 404 396 - ZmEu
[18/May/2017:16:27:41 +0000] "GET /typo3/phpmyadmin/scripts/setup.php HTTP/1.1" 404 396 - ZmEu
[18/May/2017:16:27:42 +0000] "GET /phpMyAdmin- HTTP/1.1" 404 399 - ZmEu
[18/May/2017:16:27:44 +0000] "GET /administrator/components/com_joommyadmin/phpmyadmin/scripts/setup.php HTTP/1.1" 404 418 - ZmEu
[18/May/2017:18:34:45 +0000] "GET /phpmyadmin/scripts/setup.php HTTP/1.1" 404 390 - ZmEu
[18/May/2017:16:27:45 +0000] "GET /w00tw00t.at.blackhats.romanian.anti-sec:) HTTP/1.1" 404 401 - ZmEu

In addition, servers can also end up receiving unwanted web traffic that is intended for another server. In a cloud environment, an application may end up reusing an IP address that was formerly connected to another service. It’s common for misconfigured or misbehaving DNS servers to send traffic intended for a different host to an IP address now connected to your server.

It’s the responsibility of anyone running a web server to handle and reject potentially malicious traffic or unwanted traffic. Ideally, the web server can reject this traffic as early as possible, before it actually reaches the core application code. A reverse proxy is one way to provide this layer of protection for an application server. It can be configured to reject these requests before they reach the application server.

Reverse proxy for performance

Another advantage of using a reverse proxy such as NGINX is that it can be configured to offload some heavy lifting from your application container. For example, every HTTP server should support gzip. Whenever a client requests gzip encoding, the server compresses the response before sending it back to the client. This compression saves network bandwidth, which also improves speed for clients who now don’t have to wait as long for a response to fully download.

NGINX can be configured to accept a plaintext response from your application container and gzip encode it before sending it down to the client. This allows your application container to focus 100% of its CPU allotment on running business logic, while NGINX handles the encoding with its efficient gzip implementation.

An application may have security concerns that require SSL termination at the instance level instead of at the load balancer. NGINX can also be configured to terminate SSL before proxying the request to a local application container. Again, this also removes some CPU load from the application container, allowing it to focus on running business logic. It also gives you a cleaner way to patch any SSL vulnerabilities or update SSL certificates by updating the NGINX container without needing to change the application container.

NGINX configuration

Configuring NGINX for both traffic filtering and gzip encoding is shown below:

http {
  # NGINX will handle gzip compression of responses from the app server
  gzip on;
  gzip_proxied any;
  gzip_types text/plain application/json;
  gzip_min_length 1000;
  server {
    listen 80;
    # NGINX will reject anything not matching /api
    location /api {
      # Reject requests with unsupported HTTP method
      if ($request_method !~ ^(GET|POST|HEAD|OPTIONS|PUT|DELETE)$) {
        return 405;
      # Only requests matching the whitelist expectations will
      # get sent to the application server
      proxy_pass http://app:3000;
      proxy_http_version 1.1;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection 'upgrade';
      proxy_set_header Host $host;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_cache_bypass $http_upgrade;

The above configuration only accepts traffic that matches the expression /api and has a recognized HTTP method. If the traffic matches, it is forwarded to a local application container accessible at the local hostname app. If the client requested gzip encoding, the plaintext response from that application container is gzip-encoded.

Amazon ECS configuration

Configuring ECS to run this NGINX container as a sidecar is also simple. ECS uses a core primitive called the task definition. Each task definition can include one or more containers, which can be linked to each other:

  "containerDefinitions": [
       "name": "nginx",
       "image": "<NGINX reverse proxy image URL here>",
       "memory": "256",
       "cpu": "256",
       "essential": true,
       "portMappings": [
           "containerPort": "80",
           "protocol": "tcp"
       "links": [
       "name": "app",
       "image": "<app image URL here>",
       "memory": "256",
       "cpu": "256",
       "essential": true
   "networkMode": "bridge",
   "family": "application-stack"

This task definition causes ECS to start both an NGINX container and an application container on the same instance. Then, the NGINX container is linked to the application container. This allows the NGINX container to send traffic to the application container using the hostname app.

The NGINX container has a port mapping that exposes port 80 on a publically accessible port but the application container does not. This means that the application container is not directly addressable. The only way to send it traffic is to send traffic to the NGINX container, which filters that traffic down. It only forwards to the application container if the traffic passes the whitelisted rules.


Running a sidecar container such as NGINX can bring significant benefits by making it easier to provide protection for application containers. Sidecar containers also improve performance by freeing your application container from various CPU intensive tasks. Amazon ECS makes it easy to run sidecar containers, and automate their deployment across your cluster.

To see the full code for this NGINX sidecar reference, or to try it out yourself, you can check out the open source NGINX reverse proxy reference architecture on GitHub.

– Nathan

Querying OpenStreetMap with Amazon Athena

Post Syndicated from Seth Fitzsimmons original https://aws.amazon.com/blogs/big-data/querying-openstreetmap-with-amazon-athena/

This is a guest post by Seth Fitzsimmons, member of the 2017 OpenStreetMap US board of directors. Seth works with clients including the Humanitarian OpenStreetMap Team, Mapzen, the American Red Cross, and World Bank to craft innovative geospatial solutions.

OpenStreetMap (OSM) is a free, editable map of the world, created and maintained by volunteers and available for use under an open license. Companies and non-profits like Mapbox, Foursquare, Mapzen, the World Bank, the American Red Cross and others use OSM to provide maps, directions, and geographic context to users around the world.

In the 12 years of OSM’s existence, editors have created and modified several billion features (physical things on the ground like roads or buildings). The main PostgreSQL database that powers the OSM editing interface is now over 2TB and includes historical data going back to 2007. As new users join the open mapping community, more and more valuable data is being added to OpenStreetMap, requiring increasingly powerful tools, interfaces, and approaches to explore its vastness.

This post explains how anyone can use Amazon Athena to quickly query publicly available OSM data stored in Amazon S3 (updated weekly) as an AWS Public Dataset. Imagine that you work for an NGO interested in improving knowledge of and access to health centers in Africa. You might want to know what’s already been mapped, to facilitate the production of maps of surrounding villages, and to determine where infrastructure investments are likely to be most effective.

Note: If you run all the queries in this post, you will be charged approximately $1 based on the number of bytes scanned. All queries used in this post can be found in this GitHub gist.

What is OpenStreetMap?

As an open content project, regular OSM data archives are made available to the public via planet.openstreetmap.org in a few different formats (XML, PBF). This includes both snapshots of the current state of data in OSM as well as historical archives.

Working with “the planet” (as the data archives are referred to) can be unwieldy. Because it contains data spanning the entire world, the size of a single archive is on the order of 50 GB. The format is bespoke and extremely specific to OSM. The data is incredibly rich, interesting, and useful, but the size, format, and tooling can often make it very difficult to even start the process of asking complex questions.

Heavy users of OSM data typically download the raw data and import it into their own systems, tailored for their individual use cases, such as map rendering, driving directions, or general analysis. Now that OSM data is available in the Apache ORC format on Amazon S3, it’s possible to query the data using Athena without even downloading it.

How does Athena help?

You can use Athena along with data made publicly available via OSM on AWS. You don’t have to learn how to install, configure, and populate your own server instances and go through multiple steps to download and transform the data into a queryable form. Thanks to AWS and partners, a regularly updated copy of the planet file (available within hours of OSM’s weekly publishing schedule) is hosted on S3 and made available in a format that lends itself to efficient querying using Athena.

Asking questions with Athena involves registering the OSM planet file as a table and making SQL queries. That’s it. Nothing to download, nothing to configure, nothing to ingest. Athena distributes your queries and returns answers within seconds, even while querying over 9 years and billions of OSM elements.

You’re in control. S3 provides high availability for the data and Athena charges you per TB of data scanned. Plus, we’ve gone through the trouble of keeping scanning charges as small as possible by transcoding OSM’s bespoke format as ORC. All the hard work of transforming the data into something highly queryable and making it publicly available is done; you just need to bring some questions.

Registering Tables

The OSM Public Datasets consist of three tables:

  • planet
    Contains the current versions of all elements present in OSM.
  • planet_history
    Contains a historical record of all versions of all elements (even those that have been deleted).
  • changesets
    Contains information about changesets in which elements were modified (and which have a foreign key relationship to both the planet and planet_history tables).

To register the OSM Public Datasets within your AWS account so you can query them, open the Athena console (make sure you are using the us-east-1 region) to paste and execute the following table definitions:


  id BIGINT,
  type STRING,
  lat DECIMAL(9,7),
  lon DECIMAL(10,7),
  members ARRAY<STRUCT<type: STRING, ref: BIGINT, role: STRING>>,
  changeset BIGINT,
  timestamp TIMESTAMP,
  uid BIGINT,
  user STRING,
  version BIGINT
LOCATION 's3://osm-pds/planet/';


CREATE EXTERNAL TABLE planet_history (
    id BIGINT,
    type STRING,
    lat DECIMAL(9,7),
    lon DECIMAL(10,7),
    nds ARRAY<STRUCT<ref: BIGINT>>,
    members ARRAY<STRUCT<type: STRING, ref: BIGINT, role: STRING>>,
    changeset BIGINT,
    timestamp TIMESTAMP,
    uid BIGINT,
    user STRING,
    version BIGINT,
    visible BOOLEAN
LOCATION 's3://osm-pds/planet-history/';


    id BIGINT,
    created_at TIMESTAMP,
    open BOOLEAN,
    closed_at TIMESTAMP,
    comments_count BIGINT,
    min_lat DECIMAL(9,7),
    max_lat DECIMAL(9,7),
    min_lon DECIMAL(10,7),
    max_lon DECIMAL(10,7),
    num_changes BIGINT,
    uid BIGINT,
    user STRING
LOCATION 's3://osm-pds/changesets/';


Under the Hood: Extract, Transform, Load

So, what happens behind the scenes to make this easier for you? In a nutshell, the data is transcoded from the OSM PBF format into Apache ORC.

There’s an AWS Lambda function (running every 15 minutes, triggered by CloudWatch Events) that watches planet.openstreetmap.org for the presence of weekly updates (using rsync). If that function detects that a new version has become available, it submits a set of AWS Batch jobs to mirror, transcode, and place it as the “latest” version. Code for this is available at osm-pds-pipelines GitHub repo.

To facilitate the transcoding into a format appropriate for Athena, we have produced an open source tool, OSM2ORC. The tool also includes an Osmosis plugin that can be used with complex filtering pipelines and outputs an ORC file that can be uploaded to S3 for use with Athena, or used locally with other tools from the Hadoop ecosystem.

What types of questions can OpenStreetMap answer?

There are many uses for OpenStreetMap data; here are three major ones and how they may be addressed using Athena.

Case Study 1: Finding Local Health Centers in West Africa

When the American Red Cross mapped more than 7,000 communities in West Africa in areas affected by the Ebola epidemic as part of the Missing Maps effort, they found themselves in a position where collecting a wide variety of data was both important and incredibly beneficial for others. Accurate maps play a critical role in understanding human communities, especially for populations at risk. The lack of detailed maps for West Africa posed a problem during the 2014 Ebola crisis, so collecting and producing data around the world has the potential to improve disaster responses in the future.

As part of the data collection, volunteers collected locations and information about local health centers, something that will facilitate treatment in future crises (and, more importantly, on a day-to-day basis). Combined with information about access to markets and clean drinking water and historical experiences with natural disasters, this data was used to create a vulnerability index to select communities for detailed mapping.

For this example, you find all health centers in West Africa (many of which were mapped as part of Missing Maps efforts). This is something that healthsites.io does for the public (worldwide and editable, based on OSM data), but you’re working with the raw data.

Here’s a query to fetch information about all health centers, tagged as nodes (points), in Guinea, Sierra Leone, and Liberia:

SELECT * from planet
WHERE type = 'node'
  AND tags['amenity'] IN ('hospital', 'clinic', 'doctors')
  AND lon BETWEEN -15.0863 AND -7.3651
  AND lat BETWEEN 4.3531 AND 12.6762;

Buildings, as “ways” (polygons, in this case) assembled from constituent nodes (points), can also be tagged as medical facilities. In order to find those, you need to reassemble geometries. Here you’re taking the average of all nodes that make up a building (which will be the approximate center point, which is close enough for this purpose). Here is a query that finds both buildings and points that are tagged as medical facilities:

-- select out nodes and relevant columns
WITH nodes AS (
  FROM planet
  WHERE type = 'node'
-- select out ways and relevant columns
ways AS (
  FROM planet
  WHERE type = 'way'
    AND tags['amenity'] IN ('hospital', 'clinic', 'doctors')
-- filter nodes to only contain those present within a bounding box
nodes_in_bbox AS (
  FROM nodes
  WHERE lon BETWEEN -15.0863 AND -7.3651
    AND lat BETWEEN 4.3531 AND 12.6762
-- find ways intersecting the bounding box
  AVG(nodes.lat) lat,
  AVG(nodes.lon) lon
FROM ways
JOIN nodes_in_bbox nodes ON nodes.id = nd.ref
GROUP BY (ways.type, ways.id, ways.tags)
FROM nodes_in_bbox
WHERE tags['amenity'] IN ('hospital', 'clinic', 'doctors');

You could go a step further and query for additional tags included with these (for example, opening_hours) and use that as a metric for measuring “completeness” of the dataset and to focus on additional data to collect (and locations to fill out).

Case Study 2: Generating statistics about mapathons

OSM has a history of holding mapping parties. Mapping parties are events where interested people get together and wander around outside, gathering and improving information about sites (and sights) that they pass. Another form of mapping party is the mapathon, which brings together armchair and desk mappers to focus on improving data in another part of the world.

Mapathons are a popular way of enlisting volunteers for Missing Maps, a collaboration between many NGOs, education institutions, and civil society groups that aims to map the most vulnerable places in the developing world to support international and local NGOs and individuals. One common way that volunteers participate is to trace buildings and roads from aerial imagery, providing baseline data that can later be verified by Missing Maps staff and volunteers working in the areas being mapped.

(Image and data from the American Red Cross)

Data collected during these events lends itself to a couple different types of questions. People like competition, so Missing Maps has developed a set of leaderboards that allow people to see where they stand relative to other mappers and how different groups compare. To facilitate this, hashtags (such as #missingmaps) are included in OSM changeset comments. To do similar ad hoc analysis, you need to query the list of changesets, filter by the presence of certain hashtags in the comments, and group things by username.

Now, find changes made during Missing Maps mapathons at George Mason University (using the #gmu hashtag):

FROM changesets
WHERE regexp_like(tags['comment'], '(?i)#gmu');

This includes all tags associated with a changeset, which typically include a mapper-provided comment about the changes made (often with additional hashtags corresponding to OSM Tasking Manager projects) as well as information about the editor used, imagery referenced, etc.

If you’re interested in the number of individual users who have mapped as part of the Missing Maps project, you can write a query such as:

FROM changesets
WHERE regexp_like(tags['comment'], '(?i)#missingmaps');

25,610 people (as of this writing)!

Back at GMU, you’d like to know who the most prolific mappers are:

SELECT user, count(*) AS edits
FROM changesets
WHERE regexp_like(tags['comment'], '(?i)#gmu')
ORDER BY count(*) DESC;

Nice job, BrokenString!

It’s also interesting to see what types of features were added or changed. You can do that by using a JOIN between the changesets and planet tables:

SELECT planet.*, changesets.tags
FROM planet
JOIN changesets ON planet.changeset = changesets.id
WHERE regexp_like(changesets.tags['comment'], '(?i)#gmu');

Using this as a starting point, you could break down the types of features, highlight popular parts of the world, or do something entirely different.

Case Study 3: Building Condition

With building outlines having been produced by mappers around (and across) the world, local Missing Maps volunteers (often from local Red Cross / Red Crescent societies) go around with Android phones running OpenDataKit  and OpenMapKit to verify that the buildings in question actually exist and to add additional information about them, such as the number of stories, use (residential, commercial, etc.), material, and condition.

This data can be used in many ways: it can provide local geographic context (by being included in map source data) as well as facilitate investment by development agencies such as the World Bank.

Here are a collection of buildings mapped in Dhaka, Bangladesh:

(Map and data © OpenStreetMap contributors)

For NGO staff to determine resource allocation, it can be helpful to enumerate and show buildings in varying conditions. Building conditions within an area can be a means of understanding where to focus future investments.

Querying for buildings is a bit more complicated than working with points or changesets. Of the three core OSM element types—node, way, and relation, only nodes (points) have geographic information associated with them. Ways (lines or polygons) are composed of nodes and inherit vertices from them. This means that ways must be reconstituted in order to effectively query by bounding box.

This results in a fairly complex query. You’ll notice that this is similar to the query used to find buildings tagged as medical facilities above. Here you’re counting buildings in Dhaka according to building condition:

-- select out nodes and relevant columns
WITH nodes AS (
  FROM planet
  WHERE type = 'node'
-- select out ways and relevant columns
ways AS (
  FROM planet
  WHERE type = 'way'
-- filter nodes to only contain those present within a bounding box
nodes_in_bbox AS (
  FROM nodes
  WHERE lon BETWEEN 90.3907 AND 90.4235
    AND lat BETWEEN 23.6948 AND 23.7248
-- fetch and expand referenced ways
referenced_ways AS (
  FROM ways
  JOIN nodes_in_bbox nodes ON nodes.id = nd.ref
-- fetch *all* referenced nodes (even those outside the queried bounding box)
exploded_ways AS (
    nodes.id node_id,
    ARRAY[nodes.lat, nodes.lon] coordinates
  FROM referenced_ways ways
  JOIN nodes ON nodes.id = nd.ref
  ORDER BY ways.id, idx
-- query ways matching the bounding box
FROM exploded_ways
GROUP BY tags['building:condition']
ORDER BY count(*) DESC;

Most buildings are unsurveyed (125,000 is a lot!), but of those that have been, most are average (as you’d expect). If you were to further group these buildings geographically, you’d have a starting point to determine which areas of Dhaka might benefit the most.


OSM data, while incredibly rich and valuable, can be difficult to work with, due to both its size and its data model. In addition to the time spent downloading large files to work with locally, time is spent installing and configuring tools and converting the data into more queryable formats. We think Amazon Athena combined with the ORC version of the planet file, updated on a weekly basis, is an extremely powerful and cost-effective combination. This allows anyone to start querying billions of records with simple SQL in no time, giving you the chance to focus on the analysis, not the infrastructure.

To download the data and experiment with it using other tools, the latest OSM ORC-formatted file is available via OSM on AWS at s3://osm-pds/planet/planet-latest.orcs3://osm-pds/planet-history/history-latest.orc, and s3://osm-pds/changesets/changesets-latest.orc.

We look forward to hearing what you find out!

The command-line, for cybersec

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/01/the-command-line-for-cybersec.html

On Twitter I made the mistake of asking people about command-line basics for cybersec professionals. A got a lot of useful responses, which I summarize in this long (5k words) post. It’s mostly driven by the tools I use, with a bit of input from the tweets I got in response to my query.


By command-line this document really means bash.

There are many types of command-line shells. Windows has two, ‘cmd.exe’ and ‘PowerShell’. Unix started with the Bourne shell ‘sh’, and there have been many variations of this over the years, ‘csh’, ‘ksh’, ‘zsh’, ‘tcsh’, etc. When GNU rewrote Unix user-mode software independently, they called their shell “Bourne Again Shell” or “bash” (queue “JSON Bourne” shell jokes here).

Bash is the default shell for Linux and macOS. It’s also available on Windows, as part of their special “Windows Subsystem for Linux”. The windows version of ‘bash’ has become my most used shell.

For Linux IoT devices, BusyBox is the most popular shell. It’s easy to clear, as it includes feature-reduced versions of popular commands.


‘Man’ is the command you should not run if you want help for a command.

Man pages are designed to drive away newbies. They are only useful if you already mostly an expert with the command you desire help on. Man pages list all possible features of a program, but do not highlight examples of the most common features, or the most common way to use the commands.

Take ‘sed’ as an example. It’s used most commonly to do a search-and-replace in files, like so:

$ sed ‘s/rob/dave/’ foo.txt

This usage is so common that many non-geeks know of it. Yet, if you type ‘man sed’ to figure out how to do a search and replace, you’ll get nearly incomprehensible gibberish, and no example of this most common usage.

I point this out because most guides on using the shell recommend ‘man’ pages to get help. This is wrong, it’ll just endlessly frustrate you. Instead, google the commands you need help on, or better yet, search StackExchange for answers.

You might try asking questions, like on Twitter or forum sites, but this requires a strategy. If you ask a basic question, self-important dickholes will respond by telling you to “rtfm” or “read the fucking manual”. A better strategy is to exploit their dickhole nature, such as saying “too bad command xxx cannot do yyy”. Helpful people will gladly explain why you are wrong, carefully explaining how xxx does yyy.

If you must use ‘man’, use the ‘apropos’ command to find the right man page. Sometimes multiple things in the system have the same or similar names, leading you to the wrong page.

apt-get install yum

Using the command-line means accessing that huge open-source ecosystem. Most of the things in this guide do no already exist on the system. You have to either compile them from source, or install via a package-manager. Linux distros ship with a small footprint, but have a massive database of precompiled software “packages” in the cloud somewhere. Use the “package manager” to install the software from the cloud.

On Debian-derived systems (like Ubuntu, Kali, Raspbian), type “apt-get install masscan” to install “masscan” (as an example). Use “apt-cache search scan” to find a bunch of scanners you might want to install.

On RedHat systems, use “yum” instead. On BSD, use the “ports” system, which you can also get working for macOS.

If no pre-compiled package exists for a program, then you’ll have to download the source code and compile it. There’s about an 80% chance this will work easy, following the instructions. There is a 20% chance you’ll experience “dependency hell”, for example, needing to install two mutually incompatible versions of Python.

Bash is a scripting language

Don’t forget that shells are really scripting languages. The bit that executes a single command is just a degenerate use of the scripting language. For example, you can do a traditional for loop like:

$ for i in $(seq 1 9); do echo $i; done

In this way, ‘bash’ is no different than any other scripting language, like Perl, Python, NodeJS, PHP CLI, etc. That’s why a lot of stuff on the system actually exists as short ‘bash’ programs, aka. shell scripts.

Few want to write bash scripts, but you are expected to be able to read them, either to tweek existing scripts on the system, or to read StackExchange help.

File system commands

The macOS “Finder” or Windows “File Explorer” are just graphical shells that help you find files, open, and save them. The first commands you learn are for the same functionality on the command-line: pwd, cd, ls, touch, rm, rmdir, mkdir, chmod, chown, find, ln, mount.

The command “rm –rf /” removes everything starting from the root directory. This will also follow mounted server directories, deleting files on the server. I point this out to give an appreciation of the raw power you have over the system from the command-line, and how easy you can disrupt things.

Of particular interest is the “mount” command. Desktop versions of Linux typically mount USB flash drives automatically, but on servers, you need to do it manually, e.g.:

$ mkdir ~/foobar
$ mount /dev/sdb ~/foobar

You’ll also use the ‘mount’ command to connect to file servers, using the “cifs” package if they are Windows file servers:

# apt-get install cifs-utils
# mkdir /mnt/vids
# mount -t cifs -o username=robert,password=foobar123  // /mnt/vids

Linux system commands

The next commands you’ll learn are about syadmin the Linux system: ps, top, who, history, last, df, du, kill, killall, lsof, lsmod, uname, id, shutdown, and so on.

The first thing hackers do when hacking into a system is run “uname” (to figure out what version of the OS is running) and “id” (to figure out which account they’ve acquired, like “root” or some other user).

The Linux system command I use most is “dmesg” (or ‘tail –f /var/log/dmesg’) which shows you the raw system messages. For example, when I plug in USB drives to a server, I look in ‘dmesg’ to find out which device was added so that I can mount it. I don’t know if this is the best way, it’s just the way I do it (servers don’t automount USB drives like desktops do).

Networking commands

The permanent state of the network (what gets configured on the next bootup) is configured in text files somewhere. But there are a wealth of commands you’ll use to view the current state of networking, make temporary changes, and diagnose problems.

The ‘ifconfig’ command has long been used to view the current TCP/IP configuration and make temporary changes. Learning how TCP/IP works means playing a lot with ‘ifconfig’. Use “ifconfig –a” for even more verbose information.

Use the “route” command to see if you are sending packets to the right router.

Use ‘arp’ command to make sure you can reach the local router.

Use ‘traceroute’ to make sure packets are following the correct route to their destination. You should learn the nifty trick it’s based on (TTLs). You should also play with the TCP, UDP, and ICMP options.

Use ‘ping’ to see if you can reach the target across the Internet. Usefully measures the latency in milliseconds, and congestion (via packet loss). For example, ping NetFlix throughout the day, and notice how the ping latency increases substantially during “prime time” viewing hours.

Use ‘dig’ to make sure DNS resolution is working right. (Some use ‘nslookup’ instead). Dig is useful because it’s the raw universal DNS tool – every time they add some new standard feature to DNS, they add that feature into ‘dig’ as well.

The ‘netstat –tualn’ command views the current TCP/IP connections and which ports are listening. I forget what the various options “tualn” mean, only it’s the output I always want to see, rather than the raw “netstat” command by itself.

You’ll want to use ‘ethtool –k’ to turn off checksum and segmentation offloading. These are features that break packet-captures sometimes.

There is this new fangled ‘ip’ system for Linux networking, replacing many of the above commands, but as an old timer, I haven’t looked into that.

Some other tools for diagnosing local network issues are ‘tcpdump’, ‘nmap’, and ‘netcat’. These are described in more detail below.


In general, you’ll remotely log into a system in order to use the command-line. We use ‘ssh’ for that. It uses a protocol similar to SSL in order to encrypt the connection. There are two ways to use ‘ssh’ to login, with a password or with a client-side certificate.

When using SSH with a password, you type “ssh [email protected]”. The remote system will then prompt you for a password for that account.

When using client-side certificates, use “ssh-keygen” to generate a key, then either copy the public-key of the client to the server manually, or use “ssh-copy-id” to copy it using the password method above.

How this works is basic application of public-key cryptography. When logging in with a password, you get a copy of the server’s public-key the first time you login, and if it ever changes, you get a nasty warning that somebody may be attempting a man in the middle attack.

$ ssh [email protected]

When using client-side certificates, the server trusts your public-key. This is similar to how client-side certificates work in SSL VPNs.

You can use SSH for things other than loging into a remote shell. You can script ‘ssh’ to run commands remotely on a system in a local shell script. You can use ‘scp’ (SSH copy) to transfer files to and from a remote system. You can do tricks with SSH to create tunnels, which is popular way to bypass the restrictive rules of your local firewall nazi.


This is your general cryptography toolkit, doing everything from simple encryption, to public-key certificate signing, to establishing SSL connections.

It is extraordinarily user hostile, with terrible inconsistency among options. You can only figure out how to do things by looking up examples on the net, such as on StackExchange. There are competing SSL libraries with their own command-line tools, like GnuTLS and Mozilla NSS that you might find easier to use.

The fundamental use of the ‘openssl’ tool is to create public-keys, “certificate requests”, and creating self-signed certificates. All the web-site certificates I’ve ever obtained has been using the openssl command-line tool to create CSRs.

You should practice using the ‘openssl’ tool to encrypt files, sign files, and to check signatures.

You can use openssl just like PGP for encrypted emails/messages, but following the “S/MIME” standard rather than PGP standard. You might consider learning the ‘pgp’ command-line tools, or the open-source ‘gpg’ or ‘gpg2’ tools as well.

You should learn how to use the “openssl s_client” feature to establish SSL connections, as well as the “openssl s_server” feature to create an SSL proxy for a server that doesn’t otherwise support SSL.

Learning all the ways of using the ‘openssl’ tool to do useful things will go a long way in teaching somebody about crypto and cybersecurity. I can imagine an entire class consisting of nothing but learning ‘openssl’.

netcat (nc, socat, cyptocat, ncat)

A lot of Internet protocols are based on text. That means you can create a raw TCP connection to the service and interact with them using your keyboard. The classic tool for doing this is known as “netcat”, abbreviated “nc”. For example, connect to Google’s web server at port and type the HTTP HEAD command followed by a blank line (hit [return] twice):

$ nc www.google.com 80

HTTP/1.0 200 OK
Date: Tue, 17 Jan 2017 01:53:28 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
P3P: CP=”This is not a P3P policy! See https://www.google.com/support/accounts/answer/151657?hl=en for more info.”
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Set-Cookie: NID=95=o7GT1uJCWTPhaPAefs4CcqF7h7Yd7HEqPdAJncZfWfDSnNfliWuSj3XfS5GJXGt67-QJ9nc8xFsydZKufBHLj-K242C3_Vak9Uz1TmtZwT-1zVVBhP8limZI55uXHuPrejAxyTxSCgR6MQ; expires=Wed, 19-Jul-2017 01:53:28 GMT; path=/; domain=.google.com; HttpOnly
Accept-Ranges: none
Vary: Accept-Encoding

Another classic example is to connect to port 25 on a mail server to send email, spoofing the “MAIL FROM” address.

There are several versions of ‘netcat’ that work over SSL as well. My favorite is ‘ncat’, which comes with ‘nmap’, as it’s actively maintained. In theory, “openssl s_client” should also work this way.


At some point, you’ll need to port scan. The standard program for this is ‘nmap’, and it’s the best. The classic way of using it is something like:

# nmap –A scanme.nmap.org

The ‘-A’ option means to enable all the interesting features like OS detection, version detection, and basic scripts on the most common ports that a server might have open. It takes awhile to run. The “scanme.nmap.org” is a good site to practice on.

Nmap is more than just a port scanner. It has a rich scripting system for probing more deeply into a system than just a port, and to gather more information useful for attacks. The scripting system essentially contains some attacks, such as password guessing.

Scanning the Internet, finding services identified by ‘nmap’ scripts, and interacting with them with tools like ‘ncat’ will teach you a lot about how the Internet works.

BTW, if ‘nmap’ is too slow, using ‘masscan’ instead. It’s a lot faster, though has much more limited functionality.

Packet sniffing with tcpdump and tshark

All Internet traffic consists of packets going between IP addresses. You can capture those packets and view them using “packet sniffers”. The most important packet-sniffer is “Wireshark”, a GUI. For the command-line, there is ‘tcpdump’ and ‘tshark’.

You can run tcpdump on the command-line to watch packets go in/out of the local computer. This performs a quick “decode” of packets as they are captured. It’ll reverse-lookup IP addresses into DNS names, which means its buffers can overflow, dropping new packets while it’s waiting for DNS name responses for previous packets (which can be disabled with -n):

# tcpdump –p –i eth0

A common task is to create a round-robin set of files, saving the last 100 files of 1-gig each. Older files are overwritten. Thus, when an attack happens, you can stop capture, and go backward in times and view the contents of the network traffic using something like Wireshark:

# tcpdump –p -i eth0 -s65535 –C 1000 –W 100 –w cap

Instead of capturing everything, you’ll often set “BPF” filters to narrow down to traffic from a specific target, or a specific port.

The above examples use the –p option to capture traffic destined to the local computer. Sometimes you may want to look at all traffic going to other machines on the local network. You’ll need to figure out how to tap into wires, or setup “monitor” ports on switches for this to work.

A more advanced command-line program is ‘tshark’. It can apply much more complex filters. It can also be used to extract the values of specific fields and dump them to a text files.


These are some rather trivial commands, but you should know them.

The ‘base64’ command encodes binary data in text. The text can then be passed around, such as in email messages. Base64 encoding is often automatic in the output from programs like openssl and PGP.

In many cases, you’ll need to view a hex dump of some binary data. There are many programs to do this, such as hexdump, xxd, od, and more.


Grep searches for a pattern within a file. More important, it searches for a regular expression (regex) in a file. The fu of Unix is that a lot of stuff is stored in text files, and use grep for regex patterns in order to extra stuff stored in those files.

The power of this tool really depends on your mastery of regexes. You should master enough that you can understand StackExhange posts that explain almost what you want to do, and then tweek them to make them work.

Grep, by default, shows only the matching lines. In many cases, you only want the part that matches. To do that, use the –o option. (This is not available on all versions of grep).

You’ll probably want the better, “extended” regular expressions, so use the –E option.

You’ll often want “case-insensitive” options (matching both upper and lower case), so use the –i option.

For example, to extract all MAC address from a text file, you might do something like the following. This extracts all strings that are twelve hex digits.

$ grep –Eio ‘[0-9A-F]{12}’ foo.txt

Text processing

Grep is just the first of the various “text processing filters”. Other useful ones include ‘sed’, ‘cut’, ‘sort’, and ‘uniq’.

You’ll be an expert as piping output of one to the input of the next. You’ll use “sort | uniq” as god (Dennis Ritchie) intended and not the heresy of “sort –u”.

You might want to master ‘awk’. It’s a new programming language, but once you master it, it’ll be easier than other mechanisms.

You’ll end up using ‘wc’ (word-count) a lot. All it does is count the number of lines, words, characters in a file, but you’ll find yourself wanting to do this a lot.

csvkit and jq

You get data in CSV format and JSON format a lot. The tools ‘csvkit’ and ‘jq’ respectively help you deal with those tools, to convert these files into other formats, sticking the data in databases, and so forth.

It’ll be easier using these tools that understand these text formats to extract data than trying to write ‘awk’ command or ‘grep’ regexes.


Most files are binary with a few readable ASCII strings. You use the program ‘strings’ to extract those strings.

This one simple trick sounds stupid, but it’s more powerful than you’d think. For example, I knew that a program probably contained a hard-coded password. I then blindly grabbed all the strings in the program’s binary file and sent them to a password cracker to see if they could decrypt something. And indeed, one of the 100,000 strings in the file worked, thus finding the hard-coded password.

tail -f

So ‘tail’ is just a standard Linux tool for looking at the end of files. If you want to keep checking the end of a live file that’s constantly growing, then use “tail –f”. It’ll sit there waiting for something new to be added to the end of the file, then print it out. I do this a lot, so I thought it’d be worth mentioning.

tar –xvfz, gzip, xz, 7z

In prehistorical times (like the 1980s), Unix was backed up to tape drives. The tar command could be used to combine a bunch of files into a single “archive” to be sent to the tape drive, hence “tape archive” or “tar”.

These days, a lot of stuff you download will be in tar format (ending in .tar). You’ll need to learn how to extract it:

$ tar –xvf something.tar

Nobody knows what the “xvf” options mean anymore, but these letters most be specified in that order. I’m joking here, but only a little: somebody did a survey once and found that virtually nobody know how to use ‘tar’ other than the canned formulas such as this.

Along with combining files into an archive you also need to compress them. In prehistoric Unix, the “compress” command would be used, which would replace a file with a compressed version ending in ‘.z’. This would found to be encumbered with patents, so everyone switched to ‘gzip’ instead, which replaces a file with a new one ending with ‘.gz’.

$ ls foo.txt*
$ gzip foo.txt
$ ls foo.txt*

Combined with tar, you get files with either the “.tar.gz” extension, or simply “.tgz”. You can untar and uncompress at the same time:

$ tar –xvfz something .tar.gz

Gzip is always good enough, but nerds gonna nerd and want to compress with slightly better compression programs. They’ll have extensions like “.bz2”, “.7z”, “.xz”, and so on. There are a ton of them. Some of them are supported directly by the ‘tar’ program:

$ tar –xvfj something.tar.bz2

Then there is the “zip/unzip” program, which supports Windows .zip file format. To create compressed archives these days, I don’t bother with tar, but just use the ZIP format. For example, this will recursively descend a directory, adding all files to a ZIP file that can easily be extracted under Windows:

$ zip –r test.zip ./test/


I should include this under the system tools at the top, but it’s interesting for a number of purposes. The usage is simply to copy one file to another, the in-file to the out-file.

$ dd if=foo.txt of=foo2.txt

But that’s not interesting. What interesting is using it to write to “devices”. The disk drives in your system also exist as raw devices under the /dev directory.

For example, if you want to create a boot USB drive for your Raspberry Pi:

# dd if=rpi-ubuntu.img of=/dev/sdb

Or, you might want to hard erase an entire hard drive by overwriting random data:

# dd if=/dev/urandom of=/dev/sdc

Or, you might want to image a drive on the system, for later forensics, without stumbling on things like open files.

# dd if=/dev/sda of=/media/Lexar/infected.img

The ‘dd’ program has some additional options, like block size and so forth, that you’ll want to pay attention to.

screen and tmux

You log in remotely and start some long running tool. Unfortunately, if you log out, all the processes you started will be killed. If you want it to keep running, then you need a tool to do this.

I use ‘screen’. Before I start a long running port scan, I run the “screen” command. Then, I type [ctrl-a][ctrl-d] to disconnect from that screen, leaving it running in the background.

Then later, I type “screen –r” to reconnect to it. If there are more than one screen sessions, using ‘-r’ by itself will list them all. Use “-r pid” to reattach to the proper one. If you can’t, then use “-D pid” or “-D –RR pid” to forced the other session to detached from whoever is using it.

Tmux is an alternative to screen that many use. It’s cool for also having lots of terminal screens open at once.

curl and wget

Sometimes you want to download files from websites without opening a browser. The ‘curl’ and ‘wget’ programs do that easily. Wget is the traditional way of doing this, but curl is a bit more flexible. I use curl for everything these days, except mirroring a website, in which case I just do “wget –m website”.

The thing that makes ‘curl’ so powerful is that it’s really designed as a tool for poking and prodding all the various features of HTTP. That it’s also useful for downloading files is a happy coincidence. When playing with a target website, curl will allow you do lots of complex things, which you can then script via bash. For example, hackers often write their cross-site scripting/forgeries in bash scripts using curl.


As mentioned above, bash is its own programming language. But it’s weird, and annoying. So sometimes you want a real programming language. Here are some useful ones.

Yes, PHP is a language that runs in a web server for creating web pages. But if you know the language well, it’s also a fine command-line language for doing stuff.

Yes, JavaScript is a language that runs in the web browser. But if you know it well, it’s also a great language for doing stuff, especially with the “nodejs” version.

Then there are other good command line languages, like the Python, Ruby, Lua, and the venerable Perl.

What makes all these great is the large library support. Somebody has already written a library that nearly does what you want that can be made to work with a little bit of extra code of your own.

My general impression is that Python and NodeJS have the largest libraries likely to have what you want, but you should pick whichever language you like best, whichever makes you most productive. For me, that’s NodeJS, because of the great Visual Code IDE/debugger.

iptables, iptables-save

I shouldn’t include this in the list. Iptables isn’t a command-line tool as such. The tool is the built-in firewalling/NAT features within the Linux kernel. Iptables is just the command to configure it.

Firewalling is an important part of cybersecurity. Everyone should have some experience playing with a Linux system doing basic firewalling tasks: basic rules, NATting, and transparent proxying for mitm attacks.

Use ‘iptables-save’ in order to persistently save your changes.


Similar to ‘iptables’, ‘mysql’ isn’t a tool in its own right, but a way of accessing a database maintained by another process on the system.

Filters acting on text files only goes so far. Sometimes you need to dump it into a database, and make queries on that database.

There is also the offensive skill needed to learn how targets store things in a database, and how attackers get the data.

Hackers often publish raw SQL data they’ve stolen in their hacks (like the Ashley-Madisan dump). Being able to stick those dumps into your own database is quite useful. Hint: disable transaction logging while importing mass data.

If you don’t like SQL, you might consider NoSQL tools like Elasticsearch, MongoDB, and Redis that can similarly be useful for arranging and searching data. You’ll probably have to learn some JSON tools for formatting the data.

Reverse engineering tools

A cybersecurity specialty is “reverse engineering”. Some want to reverse engineer the target software being hacked, to understand vulnerabilities. This is needed for commercial software and device firmware where the source code is hidden. Others use these tools to analyze viruses/malware.

The ‘file’ command uses heuristics to discover the type of a file.

There’s a whole skillset for analyzing PDF and Microsoft Office documents. I play with pdf-parser. There’s a long list at this website:

There’s a whole skillset for analyzing executables. Binwalk is especially useful for analyzing firmware images.

Qemu is useful is a useful virtual-machine. It can emulate full systems, such as an IoT device based on the MIPS processor. Like some other tools mentioned here, it’s more a full subsystem than a simple command-line tool.

On a live system, you can use ‘strace’ to view what system calls a process is making. Use ‘lsof’ to view which files and network connections a process is making.

Password crackers

A common cybersecurity specialty is “password cracking”. There’s two kinds: online and offline password crackers.

Typical online password crackers are ‘hydra’ and ‘medusa’. They can take files containing common passwords and attempt to log on to various protocols remotely, like HTTP, SMB, FTP, Telnet, and so on. I used ‘hydra’ recently in order to find the default/backdoor passwords to many IoT devices I’ve bought recently in my test lab.

Online password crackers must open TCP connections to the target, and try to logon. This limits their speed. They also may be stymied by systems that lock accounts, or introduce delays, after too many bad password attempts.

Typical offline password crackers are ‘hashcat’ and ‘jtr’ (John the Ripper). They work off of stolen encrypted passwords. They can attempt billions of passwords-per-second, because there’s no network interaction, nothing slowing them down.

Understanding offline password crackers means getting an appreciation for the exponential difficulty of the problem. A sufficiently long and complex encrypted password is uncrackable. Instead of brute-force attempts at all possible combinations, we must use tricks, like mutating the top million most common passwords.

I use hashcat because of the great GPU support, but John is also a great program.

WiFi hacking

A common specialty in cybersecurity is WiFi hacking. The difficulty in WiFi hacking is getting the right WiFi hardware that supports the features (monitor mode, packet injection), then the right drivers installed in your operating system. That’s why I use Kali rather than some generic Linux distribution, because it’s got the right drivers installed.

The ‘aircrack-ng’ suite is the best for doing basic hacking, such as packet injection. When the parents are letting the iPad babysit their kid with a loud movie at the otherwise quite coffeeshop, use ‘aircrack-ng’ to deauth the kid.

The ‘reaver’ tool is useful for hacking into sites that leave WPS wide open and misconfigured.

Remote exploitation

A common specialty in cybersecurity is pentesting.

Nmap, curl, and netcat (described above) above are useful tools for this.

Some useful DNS tools are ‘dig’ (described above), dnsrecon/dnsenum/fierce that try to enumerate and guess as many names as possible within a domain. These tools all have unique features, but also have a lot of overlap.

Nikto is a basic tool for probing for common vulnerabilities, out-of-date software, and so on. It’s not really a vulnerability scanner like Nessus used by defenders, but more of a tool for attack.

SQLmap is a popular tool for probing for SQL injection weaknesses.

Then there is ‘msfconsole’. It has some attack features. This is humor – it has all the attack features. Metasploit is the most popular tool for running remote attacks against targets, exploiting vulnerabilities.

Text editor

Finally, there is the decision of text editor. I use ‘vi’ variants. Others like ‘nano’ and variants. There’s no wrong answer as to which editor to use, unless that answer is ‘emacs’.


Obviously, not every cybersecurity professional will be familiar with every tool in this list. If you don’t do reverse-engineering, then you won’t use reverse-engineering tools.

On the other hand, regardless of your specialty, you need to know basic crypto concepts, so you should know something like the ‘openssl’ tool. You need to know basic networking, so things like ‘nmap’ and ‘tcpdump’. You need to be comfortable processing large dumps of data, manipulating it with any tool available. You shouldn’t be frightened by a little sysadmin work.

The above list is therefore a useful starting point for cybersecurity professionals. Of course, those new to the industry won’t have much familiarity with them. But it’s fair to say that I’ve used everything listed above at least once in the last year, and the year before that, and the year before that. I spend a lot of time on StackExchange and Google searching the exact options I need, so I’m not an expert, but I am familiar with the basic use of all these things.

Joining and Enriching Streaming Data on Amazon Kinesis

Post Syndicated from Assaf Mentzer original https://aws.amazon.com/blogs/big-data/joining-and-enriching-streaming-data-on-amazon-kinesis/

Are you trying to move away from a batch-based ETL pipeline? You might do this, for example, to get real-time insights into your streaming data, such as clickstream, financial transactions, sensor data, customer interactions, and so on.  If so, it’s possible that as soon as you get down to requirements, you realize your streaming data doesn’t have all of the fields you need for real-time processing, and you are really missing that JOIN keyword!

You might also have requirements to enrich the streaming data based on a static reference, a dynamic dataset, or even with another streaming data source.  How can you achieve this without sacrificing the velocity of the streaming data?

In this blog post, I provide three use cases and approaches for joining and enriching streaming data:

Joining streaming data with a relatively static dataset on Amazon S3 using Amazon Kinesis Analytics

In this use case, Amazon Kinesis Analytics can be used to define a reference data input on S3, and use S3 for enriching a streaming data source.

For example, bike share systems around the world can publish data files about available bikes and docks, at each station, in real time.  On bike-share system data feeds that follow the General Bikeshare Feed Specification (GBFS), there is a reference dataset that contains a static list of all stations, their capacities, and locations.

Let’s say you would like to enrich the bike-availability data (which changes throughout the day) with the bike station’s latitude and longitude (which is static) for downstream applications.  The architecture would look like this:


To illustrate how you can use Amazon Kinesis Analytics to do this, follow these steps to set up an AWS CloudFormation stack, which will do the following:

  • Continuously produce sample bike-availability data onto an Amazon Kinesis stream (with, by default, a single shard).
  • Generate a sample S3 reference data file.
  • Create an Amazon Kinesis Analytics application that performs the join with the reference data file.

To create the CloudFormation stack

  1. Open the AWS CloudFormation console and choose Create Stack. Make sure the US East (N. Virginia) region is selected.
  2. Choose Specify an Amazon S3 template URL, enter or paste the following URL, and then choose Next.
  1. For the stack name, type demo-data-enrichment-s3-reference-stack. Under the KinesisAnalyticsReferenceDataS3Bucket parameter, type the name of an S3 bucket in the US East (N. Virginia) region where the reference data will be uploaded.  This CloudFormation template will create a sample data file under KinesisAnalyticsReferenceDataS3Key. (Make sure the value of the KinesisAnalyticsReferenceDataS3Key parameter does not conflict with existing S3 objects in your bucket).You’ll also see parameters referencing an S3 bucket (LambdaKinesisAnalyticsApplicationCustomResourceCodeS3Bucket) and an associated S3 key. The S3 bucket and the S3 key represent the location of the AWS Lambda package (written in Python) that contains the AWS CloudFormation custom resources required to create an Amazon Kinesis application. Do not modify the values of these parameters.
  1. Follow the remaining steps in the Create Stack wizard (click “Next” button, and then the “Create” button). Wait until the status displayed for the stack is CREATE_COMPLETE.

You now have an Amazon Kinesis stream in your AWS account that has sample bike-availability streaming data which, if you followed the template, is named demo-data-enrichment-bike-availability-input-stream and an Amazon Kinesis Analytics application that performs the joining.

You might want to examine the reference data file generated under your bucket (its default name is demo_data_enrichment_s3_reference_data.json).  If you are using a JSON formatter/viewer, it will look like the following.  Note how the records are put together as top-level objects (no commas separating the records).

  "station_id": "s001",
  "name": "Liberty Ave",
  "lat": 40.6892,
  "lon": -74.0445
  "station_id": "s002",
  "name": "Empire St",
  "lat": 40.7484,
  "lon": -73.9857

In the CloudFormation template, the reference data has been added to the Amazon Kinesis application through the AddApplicationReferenceDataSource API method.

Next, open the Amazon Kinesis Analytics console and examine the application.  If you did not modify the default parameters, the application name should be demo-data-enrichment-s3-reference-application.

The Application status should be RUNNING.  If its status is still STARTING, wait until it changes to RUNNING. Choose Application details, and then go to SQL results. You should see a page similar to the following:


In the Amazon Kinesis Analytics SQL code, an in-application stream (DESTINATION_SQL_STREAM) is created with five columns. Data is inserted into the stream using a pump (STREAM_PUMP) by joining the source stream (SOURCE_SQL_STREAM_001) and the reference S3 data file (REFERENCE_DATA) using the station_id field. For more information about streams and pumps, see In-Application Streams and Pumps in the Amazon Kinesis Analytics Developer Guide.

The joined (enriched) data fields – station_name, station_lat and station_lon – should appear on the Real-time analytics tab, DESTINATION_SQL_STREAM.


The LEFT OUTER JOIN works just like the ANSI-standard SQL. When you use LEFT OUTER JOIN, the streaming records are preserved even if there is no matching station_id in the reference data.  You can delete LEFT OUTER (leaving only JOIN, which means INNER join), choose Save and then run SQL. Only the records with a matching station_id will appear in the output.


You can write the results to an Amazon Kinesis stream or to S3, Amazon Redshift, or Amazon Elasticsearch Service with an Amazon Kinesis Firehose delivery stream by adding a destination on the Destination tab.  For more information, see the Writing SQL on Streaming Data with Amazon Kinesis Analytics blog post.

Note: If you have not modified the default parameters in the CloudFormation template, S3 reference data source should have been added to your application.  If you are interested in adding S3 reference data source on your own streams, the following code snippet shows what it looks like in Python:

import boto3
kinesisanalytics_client = boto3.client('kinesisanalytics')

reference_data_s3_bucket = 'YOUR REFERENCE DATA S3 BUCKET'
reference_data_s3_key = 'YOUR REFERENCE DATA S3 KEY' #example: folder/sub-folder/reference.json
reference_data_role_arn = 'YOUR IAM ROLE ARN' #with s3:GetObject permission on the reference data s3 key. Example: arn:aws:iam::123456789012:role/service-role/role-name

response = kinesisanalytics_client.describe_application(
application_version_id = response['ApplicationDetail']['ApplicationVersionId']

        'TableName': 'REFERENCE_DATA',
        'S3ReferenceDataSource': {
            'BucketARN': 'arn:aws:s3:::%s' % reference_data_s3_bucket,
            'FileKey': reference_data_s3_key,
            'ReferenceRoleARN': reference_data_role_arn
        'ReferenceSchema': {
            'RecordFormat': {
                'RecordFormatType': 'JSON',
                'MappingParameters': {
                    'JSONMappingParameters': {
                        'RecordRowPath': '$'
            'RecordEncoding': 'UTF-8',
            'RecordColumns': [
                    'Name': 'station_id',
                    'Mapping': '$.station_id',
                    'SqlType': 'VARCHAR(4)'
                    'Name': 'name',
                    'Mapping': '$.name',
                    'SqlType': 'VARCHAR(64)'
                    'Name': 'lat',
                    'Mapping': '$.lat',
                    'SqlType': 'REAL'
                    'Name': 'lon',
                    'Mapping': '$.lon',
                    'SqlType': 'REAL'


This data enrichment approach works for a relatively static dataset because it requires you to upload the entire reference dataset to S3 whenever there is a change.  This approach might not work for cases where the dataset changes too frequently to catch up with the streaming data.

If you change the reference data stored in the S3 bucket, you need to use the UpdateApplication operation (using the API or AWS CLI) to refresh the data in the Amazon Kinesis Analytics in-application table.  You can use the following Python script to refresh the data. Although it will not be covered in this blog post, you can automate data refresh by setting up Amazon S3 event notification with AWS Lambda.

#!/usr/bin/env python
import json,boto3

# Name of the Kinesis Analytics Application

# AWS region

def main_handler():


    # retrieve the current application version id
    response_describe_application = kinesisanalytics_client.describe_application(

    # update the application
    response = kinesisanalytics_client.update_application(


if __name__ == "__main__":


To clean up resources created during this demo

  1. To delete the Amazon Kinesis Analytics application, open the Amazon Kinesis Analytics console, choose the application (demo-data-enrichment-s3-reference-application), choose Actions, and then choose Delete application.
  2. To delete the stack, open the AWS CloudFormation console, choose the stack (demo-data-enrichment-s3-reference-stack), choose Actions, and then choose Delete Stack.

The S3 reference data file should have been removed from your bucket, but your other data files should remain intact.

Enriching streaming data with a dynamic dataset using AWS Lambda and Amazon DynamoDB

In many cases, the data you want to enrich is not static. Imagine a case in which you are capturing user activities from a web or mobile application and have to enrich the activity data with frequently changing fields from a user database table.

A better approach is to store reference data in a way that can support random reads and writes efficiently.  Amazon DynamoDB is a fully managed non-relational database that can be used in this case.  AWS Lambda can be set up to automatically read batches of records from your Amazon Kinesis streams, which can perform data enrichment like looking up data from DynamoDB, and then produce the enriched data onto another stream.

If you have a stream of user activities and want to look up a user’s birth year from a DynamoDB table, the architecture will look like this:


To set up a demo with this architecture, follow these steps to set up an AWS CloudFormation stack, which continuously produces sample user scores onto an Amazon Kinesis stream. An AWS Lambda function enriches the records with data from a DynamoDB table and produces the results onto another Amazon Kinesis stream.

  1. Open the CloudFormation console and choose Create Stack. Make sure the US East (N. Virginia) region is selected.
  2. Choose Specify an Amazon S3 template URL, enter or paste the following URL, and then choose Next.
  3. For the stack name, type demo-data-enrichment-ddb-reference-stack. Review the parameters. If none of them conflict with your Amazon Kinesis stream names or DynamoDB table name, you can accept the default values and choose Next.  The following steps are written with the assumption that you are using the default values.
  4. Complete the remaining steps in the Create Stack wizard. Wait until CREATE_COMPLETE is displayed for the stack status.This CloudFormation template creates three Lambda functions: one for setting up sample data in a DynamoDB table, one for producing sample records continuously, and one for data enrichment. The description for this last function is Data Enrichment Demo – Lambda function to consume and enrich records from Kinesis.
  1. Open the Lambda console, select Data Enrichment Demo – Lambda function to consume and enrich records from Kinesis, and then choose the Triggers
  2. If No records processed is displayed for Last result, wait for a few minutes, and then reload this page. If things are set up correctly, OK will be displayed for Last result.  This might take a few minutes.


At this point, the Lambda function is performing the data enrichment and producing records onto an output stream. Lambda is commonly used for preprocessing the analytics app to handle more complicated data formats.

Although this data enrichment process doesn’t involve Amazon Kinesis Analytics, you can still use the Kinesis Analytics service to examine, or even process, the enriched records, by creating a new Kinesis Analytics application and connect it to demo-data-enrichment-user-activity-output-stream.

The records on the demo-data-enrichment-user-activity-output-stream will look like the following and will show the enriched birthyear field on the streaming data.


You can use this birthyear value in your Amazon Kinesis Analytics application. Although it’s not covered in this blog post, you can aggregate by age groups and count the user activities.

You can review the code for the Lambda function on the Code tab. The code used here is for demonstration purpose only. You might want to modify and thoroughly test it before applying it to your use case.

Note the following:

  • The Lambda function receives records in batches (instead of one record per Lambda invocation). You can control this behavior through Batch size on the Triggers tab (in the preceding example, Batch size is set to 100). The batch size you specify is the maximum number of records that you want your Lambda function to receive per invocation.
  • The retrieval from the reference data source (in this case, DynamoDB) can be done in batch instead of record by record. This example uses the batch_get_item() API method.
  • Depending on how strict the record sequencing requirement is, the writing of results to the output stream can also be done in batch. This example uses the put_records() API method.

DynamoDB is not the only option.  What’s important is to have a data source that supports random data read (and write) at a high efficiency.  Because it’s a distributed database with built-in partitioning, DynamoDB is a good choice, but you can also use Amazon RDS as long as the data retrieval is fast enough (perhaps by having a proper index and by spreading the read workload over read replicas). You can also use an in-memory database like Amazon ElastiCache for Redis or Memcached.

To clean up the resources created for this demo

  1. If you created a Kinesis Analytics application: Open the Amazon Kinesis Analytics console, select your application, choose Actions, and then choose Delete application.
  2. Open the CloudFormation console, choose demo-data-enrichment-ddb-reference-stack, choose Actions, and then choose Delete Stack.

Joining multiple sets of streaming data using Amazon Kinesis Analytics

To illustrate this use case, let’s say you have two streams of temperature sensor data coming from a set of machines: one measuring the engine temperature and the other measuring the power supply temperature.  For the best prediction of machine failure, you’ve been told (perhaps by your data scientist) that the two temperature readings must be validated against a prediction model ─ not individually, but as a set.

Because this is streaming data, the time window for the data joining should be clearly defined.  In this example, the joining must occur on temperature readings within the same window (same minute) so that the temperature of the engine is matched against the temperature of the power supply in the same timeframe.

This data joining can be achieved with Amazon Kinesis Analytics, but has to follow a certain pattern.

First of all, an Amazon Kinesis Analytics application supports no more than one streaming data source. The different sets of streaming data have to be produced onto one Amazon Kinesis stream.

An Amazon Kinesis Analytics application can use this Amazon Kinesis stream as input and can process the data based on a certain field (in this example, sensor_location) into multiple in-application streams.

The joining can now occur on the two in-application streams.  In this example, the join fields are the machine_id and the one-minute tumbling window.


The selection of the time field for the windowing criteria is a topic of its own.  For information, see Writing SQL on Streaming Data with Amazon Kinesis Analytics. It is important to get a well-aligned window for real-world applications.  In this example, the processing time (ROWTIME) will be used for the tumbling window calculation for simplicity.

Again, you can follow these steps to set up an AWS CloudFormation stack, which continuously produces two sets of sample sensor data onto a single Amazon Kinesis stream and creates an Amazon Kinesis Analytics application that performs the joining.

  1. Open the CloudFormation console and choose Create Stack. Make sure the US East (N. Virginia) region is selected.
  2. Choose Specify an Amazon S3 template URL, enter or paste the following URL, and then choose Next.
  3. For the stack name, type demo-data-joining-multiple-streams-stack. Review the parameters. If none of them conflict with your Amazon Kinesis stream, you can accept the default values and choose Next. The following steps are written with the assumption that you are using the default values.
  4. You will also see parameters referencing an S3 bucket and S3 key. This is an AWS Lambda package (written in Python) that contains the AWS CloudFormation custom resources to create an Amazon Kinesis application.
  5. Complete the remaining steps in the Create Stack wizard. Wait until the status displayed for the stack is CREATE_COMPLETE.

Open the Amazon Kinesis Analytics console and examine the application.  If you have not modified the default parameters in the CloudFormation template, the application name should be demo-data-joining-multiple-streams-application.

The Application status should be RUNNING.  If its status is still STARTING, wait until it changes to RUNNING.

Choose Application details, and then go to SQL results.

You should see a page similar to the following:


The SQL statements have been populated for you. The script first creates two in-application streams (for engine and power supply temperature readings, respectively).  DESTINATION_SQL_STREAM holds the joined results.

On the Real-time analytics tab, you’ll see the average temperature readings of the engine and power supply have been joined together using the machine_id and per-minute tumbling window.  The results can be written to an Amazon Kinesis stream or to S3, Amazon Redshift, or Amazon Elastisearch Service with a Firehose delivery stream.


To clean up resources created for this demo

  1. To delete the Amazon Kinesis Analytics application, open the Amazon Kinesis Analytics console, choose demo-data-joining-multiple-streams-application, choose Actions, and then choose Delete application.
  2. Open the CloudFormation console, choose demo-data-joining-multiple-streams-stack, choose Actions, and then choose Delete Stack.


In this blog post, I shared three approaches for joining and enriching streaming data on Amazon Kinesis Streams by using Amazon Kinesis Analytics, AWS Lambda, and Amazon DynamoDB.  I hope this information will be helpful in situations when your real-time analytics applications require additional data fields from reference data sources or the real-time insights must be derived from data across multiple streaming data sources.  These joining and enrichment techniques will help you can get the best business value from your data.

If you have a question or suggestion, please leave a comment below.

About the author


assaf_mentzer_90Assaf Mentzer is a Senior Consultant in Big Data & Analytics for AWS Professional Services. He works with enterprise customers to provide leadership on big data projects, helping them reach their full data potential in the cloud. In his spare time, he enjoys watching sports, playing ping pong, and hanging out with family and friends.





Writing SQL on Streaming Data with Amazon Kinesis Analytics


BitUnmap: Attacking Android Ashmem (Project Zero blog)

Post Syndicated from jake original http://lwn.net/Articles/708017/rss

Google’s Project Zero blog has a detailed look at exploiting a vulnerability in Android’s ashmem shared-memory facility. “The mismatch between the mmap-ed and munmap-ed length provides us with a great exploitation primitive! Specifically, we could supply a short length for the mmap operation and a longer length for the munmap operation – thus resulting in deletion of an arbitrarily large range of virtual memory following our bitmap object. Moreover, there’s no need for the deleted range to contain one continuous memory mapping, since the range supplied in munmap simply ignores unmapped pages.

Once we delete a range of memory, we can then attempt to “re-capture” that memory region with controlled data, by causing another allocation in the remote process. By doing so, we can forcibly “free” a data structure and replace its contents with our own chosen data — effectively forcing a use-after-free condition.”

Python FAQ: Why should I use Python 3?

Post Syndicated from Eevee original https://eev.ee/blog/2016/07/31/python-faq-why-should-i-use-python-3/

Part of my Python FAQ, which is doomed to never be finished.

The short answer is: because it’s the actively-developed version of the language, and you should use it for the same reason you’d use 2.7 instead of 2.6.

If you’re here, I’m guessing that’s not enough. You need something to sweeten the deal. Well, friend, I have got a whole mess of sugar cubes just for you.

And once you’re convinced, you may enjoy the companion article, how to port to Python 3! It also has some more details on the diffences between Python 2 and 3, whereas this article doesn’t focus too much on the features removed in Python 3.

Some background

If you aren’t neck-deep in Python, you might be wondering what the fuss is all about, or why people keep telling you that Python 3 will set your computer on fire. (It won’t.)

Python 2 is a good language, but it comes with some considerable baggage. It has two integer types; it may or may not be built in a way that completely mangles 16/17 of the Unicode space; it has a confusing mix of lazy and eager functional tools; it has a standard library that takes “batteries included” to lengths beyond your wildest imagination; it boasts strong typing, then casually insists that None < 3 < "2"; overall, it’s just full of little dark corners containing weird throwbacks to the days of Python 1.

(If you’re really interested, Nick Coghlan has written an exhaustive treatment of the slightly different question of why Python 3 was created. This post is about why Python 3 is great, so let’s focus on that.)

Fixing these things could break existing code, whereas virtually all code written for 2.0 will still work on 2.7. So Python decided to fix them all at once, producing a not-quite-compatible new version of the language, Python 3.

Nothing like this has really happened with a mainstream programming language before, and it’s been a bit of a bumpy ride since then. Python 3 was (seemingly) designed with the assumption that everyone would just port to Python 3, drop Python 2, and that would be that. Instead, it’s turned out that most libraries want to continue to run on both Python 2 and Python 3, which was considerably difficult to make work at first. Python 2.5 was still in common use at the time, too, and it had none of the helpful backports that showed up in Python 2.6 and 2.7; likewise, Python 3.0 didn’t support u'' strings. Writing code that works on both 2.5 and 3.0 was thus a ridiculous headache.

The porting effort also had a dependency problem: if your library or app depends on library A, which depends on library B, which depends on C, which depends on D… then none of those projects can even think about porting until D’s porting effort is finished. Early days were very slow going.

Now, though, things are looking brighter. Most popular libraries work with Python 3, and those that don’t are working on it. Python 3’s Unicode handling, one of its most contentious changes, has had many of its wrinkles ironed out. Python 2.7 consists largely of backported Python 3 features, making it much simpler to target 2 and 3 with the same code — and both 2.5 and 2.6 are no longer supported.

Don’t get me wrong, Python 2 will still be around for a while. A lot of large applications have been written for Python 2 — think websites like Yelp, YouTube, Reddit, Dropbox — and porting them will take some considerable effort. I happen to know that at least one of those websites was still running 2.6 last year, years after 2.6 had been discontinued, if that tells you anything about the speed of upgrades for big lumbering software.

But if you’re just getting started in Python, or looking to start a new project, there aren’t many reasons not to use Python 3. There are still some, yes — but unless you have one specifically in mind, they probably won’t affect you.

I keep having Python beginners tell me that all they know about Python 3 is that some tutorial tried to ward them away from it for vague reasons. (Which is ridiculous, since especially for beginners, Python 2 and 3 are fundamentally not that different.) Even the #python IRC channel has a few people who react, ah, somewhat passive-aggressively towards mentions of Python 3. Most of the technical hurdles have long since been cleared; it seems like one of the biggest roadblocks now standing in the way of Python 3 adoption is the community’s desire to sabotage itself.

I think that’s a huge shame. Not many people seem to want to stand up for Python 3, either.

Well, here I am, standing up for Python 3. I write all my new code in Python 3 now — because Python 3 is great and you should use it. Here’s why.

Hang on, let’s be real for just a moment

None of this is going to 💥blow your mind💥. It’s just a programming language. I mean, the biggest change to Python 2 in the last decade was probably the addition of the with statement, which is nice, but hardly an earth-shattering innovation. The biggest changes in Python 3 are in the same vein: they should smooth out some points of confusion, help avoid common mistakes, and maybe give you a new toy to play with.

Also, if you’re writing a library that needs to stay compatible with Python 2, you won’t actually be able to use any of this stuff. Sorry. In that case, the best reason to port is so application authors can use this stuff, rather than citing your library as the reason they’re trapped on Python 2 forever. (But hey, if you’re starting a brand new library that will blow everyone’s socks off, do feel free to make it Python 3 exclusive.)

Application authors, on the other hand, can go wild.

Unicode by default

Let’s get the obvious thing out of the way.

In Python 2, there are two string types: str is a sequence of bytes (which I would argue makes it not a string), and unicode is a sequence of Unicode codepoints. A literal string in source code is a str, a bytestring. Reading from a file gives you bytestrings. Source code is assumed ASCII by default. It’s an 8-bit world.

If you happen to be an English speaker, it’s very easy to write Python 2 code that seems to work perfectly, but chokes horribly if fed anything outside of ASCII. The right thing involves carefully specifying encodings everywhere and using u'' for virtually all your literal strings, but that’s very tedious and easily forgotten.

Python 3 reshuffles this to put full Unicode support front and center.

Most obviously, the str type is a real text type, similar to Python 2’s unicode. Literal strings are still str, but now that makes them Unicode strings. All of the “structural” strings — names of types, functions, modules, etc. — are likewise Unicode strings. Accordingly, identifiers are allowed to contain any Unicode “letter” characters. repr() no longer escapes printable Unicode characters, though there’s a new ascii() (and corresponding !a format cast and %a placeholder) that does. Unicode completely pervades the language, for better or worse.

And just for the record: this is way better. It is so much better. It is incredibly better. Do you know how much non-ASCII garbage I type? Every single em dash in this damn post was typed by hand, and Python 2 would merrily choke on them.

Source files are now assumed to be UTF-8 by default, so adding an em dash in a comment will no longer break your production website. (I have seen this happen.) You’re still free to specify another encoding explicitly if you want, using a magic comment.

There is no attempted conversion between bytes and text, as in Python 2; b'a' + 'b' is a TypeError. Some modules require you to know what you’re dealing with: zlib.compress only accepts bytes, because zlib is defined in terms of bytes; json.loads only accepts str, because JSON is defined in terms of Unicode codepoints. Calling str() on some bytes will defer to repr, producing something like "b'hello'". (But see -b and -bb below.) Overall it’s pretty obvious when you’ve mixed bytes with text.

Oh, and two huge problem children are fixed: both the csv module and urllib.parse (formerly urlparse) can handle text. If you’ve never tried to make those work, trust me, this is miraculous.

I/O does its best to make everything Unicode. On Unix, this is a little hokey, since the filesystem is explicitly bytes with no defined encoding; Python will trust the various locale environment variables, which on most systems will make everything UTF-8. The default encoding of text-mode file I/O is derived the same way and thus usually UTF-8. (If it’s not what you expect, run locale and see what you get.) Files opened in binary mode, with a 'b', will still read and write bytes.

Python used to come in “narrow” and “wide” builds, where “narrow” builds actually stored Unicode as UTF-16, and this distinction could leak through to user code in subtle ways. On a narrow build, unichr(0x1F4A3) raises ValueError, and the length of u'💣' is 2. Surprise! Maybe your code will work on someone else’s machine, or maybe it won’t. Python 3.3 eliminated narrow builds.

I think those are the major points. For the most part, you should be able to write code as though encodings don’t exist, and the right thing will happen more often. And the wrong thing will immediately explode in your face. It’s good for you.

If you work with binary data a lot, you might be frowning at me at this point; it was a bit of a second-class citizen in Python 3.0. I think things have improved, though: a number of APIs support both bytes and text, the bytes-to-bytes codec issue has largely been resolved, we have bytes.hex() and bytes.fromhex(), bytes and bytearray both support % now, and so on. They’re listening!

Refs: Python 3.0 release notes; myriad mentions all over the documentation

Backported features

Python 3.0 was released shortly after Python 2.6, and a number of features were then backported to Python 2.7. You can use these if you’re only targeting Python 2.7, but if you were stuck with 2.6 for a long time, you might not have noticed them.

  • Set literals:

    {1, 2, 3}
  • Dict and set comprehensions:

    {word.lower() for word in words}
    {value: key for (key, value) in dict_to_invert.items()}
  • Multi-with:

    with open("foo") as f1, open("bar") as f2:
  • print is now a function, with a couple bells and whistles added: you can change the delimiter with the sep argument, you can change the terminator to whatever you want (including nothing) with the end argument, and you can force a flush with the flush argument. In Python 2.6 and 2.7, you still have to opt into this with from __future__ import print_function.

  • The string representation of a float now uses the shortest decimal number that has the same underlying value — for example, repr(1.1) was '1.1000000000000001' in Python 2.6, but is just '1.1' in Python 2.7 and 3.1+, because both are represented the same way in a 64-bit float.

  • collections.OrderedDict is a dict-like type that remembers the order of its keys.

    Note that you cannot do OrderedDict(a=1, b=2), because the constructor still receives its keyword arguments in a regular dict, losing the order. You have to pass in a sequence of 2-tuples or assign keys one at a time.

  • collections.Counter is a dict-like type for counting a set of things. It has some pretty handy operations that allow it to be used like a multiset.

  • The entire argparse module is a backport from 3.2.

  • str.format learned a , formatting specifier for numbers, which always uses commas and groups of three digits. This is wrong for many countries, and the correct solution involves using the locale module, but it’s useful for quick output of large numbers.

  • re.sub, re.subn, and re.split accept a flags argument. Minor, but, thank fucking God.

Ref: Python 2.7 release notes

Iteration improvements

Everything is lazy

Python 2 has a lot of pairs of functions that do the same thing, except one is eager and one is lazy: range and xrange, map and itertools.imap, dict.keys and dict.iterkeys, and so on.

Python 3.0 eliminated all of the lazy variants and instead made the default versions lazy. Iterating over them works exactly the same way, but no longer creates an intermediate list — for example, range(1000000000) won’t eat all your RAM. If you need to index them or store them for later, you can just wrap them in list(...).

Even better, the dict methods are now “views“. You can keep them around, and they’ll reflect any changes to the underlying dict. They also act like sets, so you can do a.keys() & b.keys() to get the set of keys that exist in both dicts.

Refs: dictionary view docs; Python 3.0 release notes


Unpacking got a huge boost. You could always do stuff like this in Python 2:

a, b, c = range(3)  # a = 0, b = 1, c = 2

Python 3.0 introduces:

a, b, *c = range(5)  # a = 0, b = 1, c = [2, 3, 4]
a, *b, c = range(5)  # a = 0, b = [1, 2, 3], c = 4

Python 3.5 additionally allows use of the * and ** unpacking operators in literals, or multiple times in function calls:

print(*range(3), *range(3))  # 0 1 2 0 1 2

x = [*range(3), *range(3)]  # x = [0, 1, 2, 0, 1, 2]
y = {*range(3), *range(3)}  # y = {0, 1, 2}  (it's a set, remember!)
z = {**dict1, **dict2}  # finally, syntax for dict merging!

Refs: Python 3.0 release notes; PEP 3132; Python 3.5 release notes; PEP 448

yield from

yield from is an extension of yield. Where yield produces a single value, yield from yields an entire sequence.

def flatten(*sequences):
    for seq in sequences:
        yield from seq

list(flatten([1, 2], [3, 4]))  # [1, 2, 3, 4]

Of course, for a simple example like that, you could just do some normal yielding in a for loop. The magic of yield from is that it can also take another generator or other lazy iterable, and it’ll effectively pause the current generator until the given one has been exhausted. It also takes care of passing values back into the generator using .send() or .throw().

def foo():
    a = yield 1
    b = yield from bar(a)
    print("foo got back", b)
    yield 4

def bar(a):
    print("in bar", a)
    x = yield 2
    y = yield 3
    print("leaving bar")
    return x + y

gen = foo()
val = None
while True:
        newval = gen.send(val)
    except StopIteration:
    print("yielded", newval)
    val = newval * 10

# yielded 1
# in bar 10
# yielded 2
# yielded 3
# leaving bar
# foo got back 50
# yielded 4

Oh yes, and you can now return a value from a generator. The return value becomes the result of a yield from, or if the caller isn’t using yield from, it’s available as the argument to the StopIteration exception.

A small convenience, perhaps. The real power here isn’t in the use of generators as lazy iterators, but in the use of generators as coroutines.

A coroutine is a function that can “suspend” itself, like yield does, allowing other code to run until the function is resumed. It’s kind of like an alternative to threading, but only one function is actively running at any given time, and that function has to delierately relinquish control (or end) before anything else can run.

Generators could do this already, more or less, but only one stack frame deep. That is, you can yield from a generator to suspend it, but if the generator calls another function, that other function has no way to suspend the generator. This is still useful, but significantly less powerful than the coroutine functionality in e.g. Lua, which lets any function yield anywhere in the call stack.

With yield from, you can create a whole chain of generators that yield from one another, and as soon as the one on the bottom does a regular yield, the entire chain will be suspended.

This laid the groundwork for making the asyncio module possible. I’ll get to that later.

Refs: docs; Python 3.3 release notes; PEP 380

Syntactic sugar

Keyword-only arguments

Python 3.0 introduces “keyword-only” arguments, which must be given by name. As a corollary, you can now accept a list of args and have more arguments afterwards. The full syntax now looks something like this:

def foo(a, b=None, *args, c=None, d, **kwargs):

Here, a and d are required, b and c are optional. c and d must be given by name.

foo(1)                      # TypeError: missing d
foo(1, 2)                   # TypeError: missing d
foo(d=4)                    # TypeError: missing a
foo(1, d=4)                 # a = 1, d = 4
foo(1, 2, d=4)              # a = 1, b = 2, d = 4
foo(1, 2, 3, d=4)           # a = 1, b = 2, args = (3,), d = 4
foo(1, 2, c=3, d=4)         # a = 1, b = 2, c = 3, d = 4
foo(1, b=2, c=3, d=4, e=5)  # a = 1, b = 2, c = 3, d = f, kwargs = {'e': 5}

This is extremely useful for functions with a lot of arguments, functions with boolean arguments, functions that accept *args (or may do so in the future) but also want some options, etc. I use it a lot!

If you want keyword-only arguments, but you don’t want to accept *args, you just leave off the variable name:

def foo(*, arg=None):

Refs: Python 3.0 release notes; PEP 3102

Format strings

Python 3.6 (not yet out) will finally bring us string interpolation, more or less, using the str.format() syntax:

a = 0x133
b = 0x352
print(f"The answer is {a + b:04x}.")

It’s pretty much the same as str.format(), except that instead of a position or name, you can give an entire expression. The formatting suffixes with : still work, the special built-in conversions like !r still work, and __format__ is still invoked.

Refs: docs; Python 3.6 release notes; PEP 498

async and friends

Right, so, about coroutines.

Python 3.4 introduced the asyncio module, which offers building blocks for asynchronous I/O (and bringing together the myriad third-party modules that do it already).

The design is based around coroutines, which are really generators using yield from. The idea, as I mentioned above, is that you can create a stack of generators that all suspend at once:

def foo():
    # do some stuff
    yield from bar()
    # do more stuff

def bar():
    # do some stuff
    response = yield from get_url("https://eev.ee/")
    # do more stuff

When this code calls get_url() (not actually a real function, but see aiohttp), get_url will send a request off into the æther, and then yield. The entire stack of generators — get_url, bar, and foo — will all suspend, and control will return to whatever first called foo, which with asyncio will be an “event loop”.

The event loop’s entire job is to notice that get_url yielded some kind of “I’m doing a network request” thing, remember it, and resume other coroutines in the meantime. (Or just twiddle its thumbs, if there’s nothing else to do.) When a response comes back, the event loop will resume get_url and send it the response. get_url will do some stuff and return it up to bar, who continues on, none the wiser that anything unusual happened.

The magic of this is that you can call get_url several times, and instead of having to wait for each request to completely finish before the next one can even start, you can do other work while you’re waiting. No threads necessary; this is all one thread, with functions cooperatively yielding control when they’re waiting on some external thing to happen.

Now, notice that you do have to use yield from each time you call another coroutine. This is nice in some ways, since it lets you see exactly when and where your function might be suspended out from under you, which can be important in some situations. There are also arguments about why this is bad, and I don’t care about them.

However, yield from is a really weird phrase to be sprinkling all over network-related code. It’s meant for use with iterables, right? Lists and tuples and things. get_url is only one thing. What are we yielding from it? Also, what’s this @coroutine decorator that doesn’t actually do anything?

Python 3.5 smoothed over this nonsense by introducing explicit syntax for these constructs, using new async and await keywords:

async def foo():
    # do some stuff
    await bar()
    # do more stuff

async def bar():
    # do some stuff
    response = await get_url("https://eev.ee/")
    # do more stuff

async def clearly identifies a coroutine, even one that returns immediately. (Before, you’d have a generator with no yield, which isn’t actually a generator, which causes some problems.) await explains what’s actually happening: you’re just waiting for another function to be done.

async for and async with are also available, replacing some particularly clumsy syntax you’d need to use before. And, handily, you can only use any of these things within an async def.

The new syntax comes with corresponding new special methods like __await__, whereas the previous approach required doing weird things with __iter__, which is what yield from ultimately calls.

I could fill a whole post or three with stuff about asyncio, and can’t possibly give it justice in just a few paragraphs. The short version is: there’s built-in syntax for doing network stuff in parallel without threads, and that’s cool.

Refs for asyncio: docs (asyncio); Python 3.4 release notes; PEP 3156

Refs for async and await: docs (await); docs (async); docs (special methods); Python 3.5 release notes; PEP 492

Function annotations

Function arguments and return values can have annotations:

def foo(a: "hey", b: "what's up") -> "whoa":

The annotations are accessible via the function’s __annotations__ attribute. They have no special meaning to Python, so you’re free to experiment with them.


You were free to experiment with them, but the addition of the typing module (mentioned below) has hijacked them for type hints. There’s no clear way to attach a type hint and some other value to the same argument, so you’ll have a tough time making function annotations part of your API.

There’s still no hard requirement that annotations be used exclusively for type hints (and it’s not like Python does anything with type hints, either), but the original PEP suggests it would like that to be the case someday. I guess we’ll see.

If you want to see annotations preserved for other uses as well, it would be a really good idea to do some creative and interesting things with them as soon as possible. Just saying.

Refs: docs; Python 3.0 release notes; PEP 3107

Matrix multiplication

Python 3.5 learned a new infix operator for matrix multiplication, spelled @. It doesn’t do anything for any built-in types, but it’s supported in NumPy. You can implement it yourself with the __matmul__ special method and its r and i variants.

Shh. Don’t tell anyone, but I suspect there are fairly interesting things you could do with an operator called @ — some of which have nothing to do with matrix multiplication at all!

Refs: Python 3.5 release notes; PEP 465


... is now valid syntax everywhere. It evaluates to the Ellipsis singleton, which does nothing. (This exists in Python 2, too, but it’s only allowed when slicing.)

It’s not of much practical use, but you can use it to indicate an unfinished stub, in a way that’s clearly not intended to be final but will still parse and run:

class ReallyComplexFiddlyThing:
    # fuck it, do this later

Refs: docs; Python 3.0 release notes

Enhanced exceptions

A slightly annoying property of Python 2’s exception handling is that if you want to do your own error logging, or otherwise need to get at the traceback, you have to use the slightly funky sys.exc_info() API and carry the traceback around separately. As of Python 3.0, exceptions automatically have a __traceback__ attribute, as well as a .with_traceback() method that sets the traceback and returns the exception itself (so you can use it inline).

This makes some APIs a little silly — __exit__ still accepts the exception type and value and traceback, even though all three are readily available from just the exception object itself.

A much more annoying property of Python 2’s exception handling was that custom exception handling would lose track of where the problem actually occurred. Consider the following call stack.


Now say an exception happens in E, and it’s caught by code like this in C.

except Exception as e:
    raise CustomError("Failed to call D")

Because this creates and raises a new exception, the traceback will start from this point and not even mention E. The best workaround for this involves manually creating a traceback between C and E, formatting it as a string, and then including that in the error message. Preposterous.

Python 3.0 introduced exception chaining, which allows you to do this:

raise CustomError("Failed to call D") from e

Now, if this exception reaches the top level, Python will format it as:

Traceback (most recent call last):
File C, blah blah
File D, blah blah
File E, blah blah

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File A, blah blah
File B, blah blah
File C, blah blah
CustomError: Failed to call D

The best part is that you don’t need to explicitly say from e at all — if you do a plain raise while there’s already an active exception, Python will automatically chain them together. Even internal Python exceptions will have this behavior, so a broken exception handler won’t lose the original exception. (In the implicit case, the intermediate text becomes “During handling of the above exception, another exception occurred:”.)

The chained exception is stored on the new exception as either __cause__ (if from an explicit raise ... from) or __context__ (if automatic).

If you direly need to hide the original exception, Python 3.3 introduced raise ... from None.

Speaking of exceptions, the error messages for missing arguments have been improved. Python 2 does this:

TypeError: foo() takes exactly 1 argument (0 given)

Python 3 does this:

TypeError: foo() missing 1 required positional argument: 'a'


Cooler classes

super() with no arguments

You can call super() with no arguments. It Just Works. Hallelujah.

Also, you can call super() with no arguments. That’s so great that I could probably just fill the rest of this article with it and be satisfied.

Did I mention you can call super() with no arguments?

Refs: docs; Python 3.0 release notes; PEP 3135

New metaclass syntax and kwargs for classes

Compared to that, everything else in this section is going to sound really weird and obscure.

For example, __metaclass__ is gone. It’s now a keyword-only argument to the class statement.

class Foo(metaclass=FooMeta):

That doesn’t sound like much, right? Just some needless syntax change that makes porting harder, right?? Right??? Haha nope watch this because it’s amazing but it barely gets any mention at all.

class Foo(metaclass=FooMeta, a=1, b=2, c=3):

You can include arbitrary keyword arguments in the class statement, and they will be passed along to the metaclass call as keyword arguments. (You have to catch them in both __new__ and __init__, since they always get the same arguments.) (Also, the class statement now has the general syntax of a function call, so you can put *args and **kwargs in it.)

This is pretty slick. Consider SQLAlchemy, which uses a metaclass to let you declare a table with a class.

class SomeTable(TableBase):
    __tablename__ = 'some_table'
    id = Column()

Note that SQLAlchemy has you put the name of the table in the clumsy __tablename__ attribute, which it invented. Why not just name? Well, because then you couldn’t declare a column called name! Any “declarative” metaclass will have the same problem of separating the actual class contents from configuration. Keyword arguments offer an easy way out.

# only hypothetical, alas
class SomeTable(TableBase, name='some_table'):
    id = Column()

Refs: docs; Python 3.0 release notes; PEP 3115


Another new metaclass feature is the introduction of the __prepare__ method.

You may have noticed that the body of a class is just a regular block, which can contain whatever code you want. Before decorators were a thing, you’d actually declare class methods in two stages:

class Foo:
    def do_the_thing(cls):
    do_the_thing = classmethod(do_the_thing)

That’s not magical class-only syntax; that’s just regular code assigning to a variable. You can put ifs and fors and whiles and dels inside a class body, too; you just don’t see it very often because there aren’t very many useful reasons to do it.

A class body is a kind of weird pseudo-scope. It can create locals, and it can read values from outer scopes, but methods don’t see the class body as an outer scope. Once the class body reaches its end, any remaining locals are passed to the type constructor and become the new class’s attributes. (This is why, for example, you can’t refer to a class directly within its own body — the class doesn’t and can’t exist until after the body has executed.)

All of this is to say: __prepare__ is a new hook that returns the dict the class body’s locals go into.

Maybe that doesn’t sound particularly interesting, but consider: the value you return doesn’t have to be an actual dict. It can be anything that understands __setitem__. You could, say, use an OrderedDict, and keep track of the order your attributes were declared. That’s useful for declarative metaclasses, where the order of attributes may be important (consider a C struct).

But you can go further. You might allow more than one attribute of the same name. You might do something special with the attributes as soon as they’re assigned, rather than at the end of the body. You might predeclare some attributes. __prepare__ is passed the class’s kwargs, so you might alter the behavior based on those.

For a nice practical example, consider the new enum module, which I briefly mention later on. One drawback of this module is that you have to specify a value for every variant, since variants are defined as class attributes, which must have a value. There’s an example of automatic numbering, but it still requires assigning a dummy value like (). Clever use of __prepare__ would allow lifting this restriction:

# XXX: Should prefer MutableMapping here, but the ultimate call to type()
# raises a TypeError if you pass a namespace object that doesn't inherit
# from dict!  Boo.
class EnumLocals(dict):
    def __init__(self):
        self.nextval = 1

    def __getitem__(self, key):
        if key not in self and not key.startswith('_') and not key.endswith('_'):
            self[key] = self.nextval
            self.nextval += 1
        return super().__getitem__(key)

class EnumMeta(type):
    def __prepare__(meta, name, bases):
        return EnumLocals()

class Colors(metaclass=EnumMeta):

print(Colors.red, Colors.green, Colors.blue)
# 1 2 3

Deciding whether this is a good idea is left as an exercise.

This is an exceptionally obscure feature that gets very little attention — it’s not even mentioned explicitly in the 3.0 release notes — but there’s nothing else like it in the language. Between __prepare__ and keyword arguments, the class statement has transformed into a much more powerful and general tool for creating all kinds of objects. I almost wish it weren’t still called class.

Refs: docs; Python 3.0 release notes; PEP 3115

Attribute definition order

If that’s still too much work, don’t worry: a proposal was just accepted for Python 3.6 that makes this even easier. Now every class will have a __definition_order__ attribute, a tuple listing the names of all the attributes assigned within the class body, in order. (To make this possible, the default return value of __prepare__ will become an OrderedDict, but the __dict__ attribute will remain a regular dict.)

Now you don’t have to do anything at all: you can always check to see what order any class’s attributes were defined in.

Additionally, descriptors can now implement a __set_name__ method. When a class is created, any descriptor implementing the method will have it called with the containing class and the name of the descriptor.

I’m very excited about this, but let me try to back up. A descriptor is a special Python object that can be used to customize how a particular class attribute works. The built-in property decorator is a descriptor.

class MyClass:
    foo = SomeDescriptor()

c = MyClass()
c.foo = 5  # calls SomeDescriptor.__set__!
print(c.foo)  # calls SomeDescriptor.__get__!

This is super cool and can be used for all sorts of DSL-like shenanigans.

Now, most descriptors ultimately want to store a value somewhere, and the obvious place to do that is in the object’s __dict__. Above, SomeDescriptor might want to store its value in c.__dict__['foo'], which is fine since Python will still consult the descriptor first. If that weren’t fine, it could also use the key '_foo', or whatever. It probably wants to use its own name somehow, because otherwise… what would happen if you had two SomeDescriptors in the same class?

Therein lies the problem, and one of my long-running and extremely minor frustrations with Python. Descriptors have no way to know their own name! There are only really two solutions to this:

  1. Require the user to pass the name in as an argument, too: foo = SomeDescriptor('foo'). Blech!

  2. Also have a metaclass (or decorator, or whatever), which can iterate over all the class’s attributes, look for SomeDescriptor objects, and tell them what their names are. Needing a metaclass means you can’t make general-purpose descriptors meant for use in arbitrary classes; a decorator would work, but boy is that clumsy.

Both of these suck and really detract from what could otherwise be very neat-looking syntax trickery.

But now! Now, when MyClass is created, Python will have a look through its attributes. If it sees that the foo object has a __set_name__ method, it’ll call that method automatically, passing it both the owning class and the name 'foo'! Huzzah!

This is so great I am so happy you have no idea.

Lastly, there’s now an __init_subclass__ class method, which is called when the class is subclassed. A great many metaclasses exist just to do a little bit of work for each new subclass; now, you don’t need a metaclass at all in many simple cases. You want a plugin registry? No problem:

class Plugin:
    _known_plugins = {}

    def __init_subclass__(cls, *, name, **kwargs):
        cls._known_plugins[name] = cls

    def get_plugin(cls, name):
        return cls._known_plugins[name]

    # ...probably some interface stuff...

class FooPlugin(Plugin, name="foo"):

No metaclass needed at all.

Again, none of this stuff is available yet, but it’s all slated for Python 3.6, due out in mid-December. I am super pumped.

Refs: docs (customizing class creation); docs (descriptors); Python 3.6 release notes; PEP 520 (attribute definition order); PEP 487 (__init_subclass__ and __set_name__)

Math stuff

int and long have been merged, and there is no longer any useful distinction between small and very large integers. I’ve actually run into code that breaks if you give it 1 instead of 1L, so, good riddance. (Python 3.0 release notes; PEP 237)

The / operator always does “true” division, i.e., gives you a float. If you want floor division, use //. Accordingly, the __div__ magic method is gone; it’s split into two parts, __truediv__ and __floordiv__. (Python 3.0 release notes; PEP 238)

decimal.Decimal, fractions.Fraction, and floats now interoperate a little more nicely: numbers of different types hash to the same value; all three types can be compared with one another; and most notably, the Decimal and Fraction constructors can accept floats directly. (docs (decimal); docs (fractions); Python 3.2 release notes)

math.gcd returns the greatest common divisor of two integers. This existed before, but was in the fractions module, where nobody knew about it. (docs; Python 3.5 release notes)

math.inf is the floating-point infinity value. Previously, this was only available by writing float('inf'). There’s also a math.nan, but let’s not? (docs; Python 3.5 release notes)

math.isclose (and the corresponding complex version, cmath.isclose) determines whether two values are “close enough”. Intended to do the right thing when comparing floats. (docs; Python 3.5 release notes; PEP 485)

More modules

The standard library has seen quite a few improvements. In fact, Python 3.2 was developed with an explicit syntax freeze, so it consists almost entirely of standard library enhancements. There are far more changes across six and a half versions than I can possibly list here; these are the ones that stood out to me.

The module shuffle

Python 2, rather inexplicably, had a number of top-level modules that were named after the single class they contained, CamelCase and all. StringIO and SimpleHTTPServer are two obvious examples. In Python 3, the StringIO class lives in io (along with BytesIO), and SimpleHTTPServer has been renamed to http.server. If you’re anything like me, you’ll find this deeply satisfying.

Wait, wait, there’s a practical upside here. Python 2 had several pairs of modules that did the same thing with the same API, but one was pure Python and one was much faster C: pickle/cPickle, profile/cProfile, and StringIO/cStringIO. I’ve seen code (cough, older versions of Babel, cough) that spent a considerable amount of its startup time reading pickles with the pure Python version, because it did the obvious thing and used the pickle module. Now, these pairs have been merged: importing pickle gives you the faster C implementation, importing profile gives you the faster C implementation, and BytesIO/StringIO are the fast C implementations in the io module.

Refs: docs (sort of); Python 3.0 release notes; PEP 3108 (exhaustive list of removed and renamed modules)

Additions to existing modules

A number of file format modules, like bz2 and gzip, went through some cleanup and modernization in 3.2 through 3.4: some learned a more straightforward open function, some gained better support for the bytes/text split, and several learned to use their file types as context managers (i.e., with with).

collections.ChainMap is a mapping type that consults some number of underlying mappings in order, allowing for a “dict with defaults” without having to merge them together. (docs; Python 3.3 release notes)

configparser dropped its ridiculous distinction between ConfigParser and SafeConfigParser; there is now only ConfigParser, which is safe. The parsed data now preserves order by default and can be read or written using normal mapping syntax. Also there’s a fancier alternative interpolation parser. (docs; Python 3.2 release notes)

contextlib.ContextDecorator is some sort of devilry that allows writing a context manager which can also be used as a decorator. It’s used to implement the @contextmanager decorator, so those can be used as decorators as well. (docs; Python 3.2 release notes)

contextlib.ExitStack offers cleaner and more fine-grained handling of multiple context managers, as well as resources that don’t have their own context manager support. (docs; Python 3.3 release notes)

contextlib.suppress is a context manager that quietly swallows a given type of exception. (docs; Python 3.4 release notes)

contextlib.redirect_stdout is a context manager that replaces sys.stdout for the duration of a block. (docs; Python 3.4 release notes)

datetime.timedelta already existed, of course, but now it supports being multiplied and divided by numbers or divided by other timedeltas. The upshot of this is that timedelta finally, finally has a .total_seconds() method which does exactly what it says on the tin. (docs; Python 3.2 release notes)

datetime.timezone is a new concrete type that can represent fixed offsets from UTC. There has long been a datetime.tzinfo, but it was a useless interface, and you were left to write your own actual class yourself. datetime.timezone.utc is a pre-existing instance that represents UTC, an offset of zero. (docs; Python 3.2 release notes)

functools.lru_cache is a decorator that caches the results of a function, keyed on the arguments. It also offers cache usage statistics and a method for emptying the cache. (docs; Python 3.2 release notes)

functools.partialmethod is like functools.partial, but the resulting object can be used as a descriptor (read: method). (docs; Python 3.4 release notes)

functools.singledispatch allows function overloading, based on the type of the first argument. (docs; Python 3.4 release notes; PEP 443)

functools.total_ordering is a class decorator that allows you to define only __eq__ and __lt__ (or any other) and defines the other comparison methods in terms of them. Note that since Python 3.0, __ne__ is automatically the inverse of __eq__ and doesn’t need defining. Note also that total_ordering doesn’t correctly support NotImplemented until Python 3.4. For an even easier way to do this, consider my classtools.keyed_ordering decorator. (docs; Python 3.2 release notes)

inspect.getattr_static fetches an attribute like getattr but avoids triggering dynamic lookup like @property. (docs; Python 3.2 release notes)

inspect.signature fetches the signature of a function as the new and more featureful Signature object. It also knows to follow the __wrapped__ attribute set by functools.wraps since Python 3.2, so it can see through well-behaved wrapper functions to the “original” signature. (docs; Python 3.3 release notes; PEP 362)

The logging module can, finally, use str.format-style string formatting by passing style='{' to Formatter. (docs; Python 3.2 release notes)

The logging module spits warnings and higher to stderr if logging hasn’t been otherwise configured. This means that if your app doesn’t use logging, but it uses a library that does, you’ll get actual output rather than the completely useless “No handlers could be found for logger ‘foo’”. (docs; Python 3.2 release notes)

os.scandir lists the contents of a directory while avoiding stat calls as much as possible, making it significantly faster. (docs; Python 3.5 release notes; PEP 471)

re.fullmatch checks for a match against the entire input string, not just a substring. (docs; Python 3.4 release notes)

reprlib.recursive_repr is a decorator for __repr__ implementations that can detect recursive calls to the same object and replace them with ..., just like the built-in structures. Believe it or not, reprlib is an existing module, though in Python 2 it was called repr. (docs; Python 3.2 release notes)

shutil.disk_usage returns disk space statistics for a given path with no fuss. (docs; Python 3.3 release notes)

shutil.get_terminal_size tries very hard to detect the size of the terminal window. (docs; Python 3.3 release notes)

subprocess.run is a new streamlined function that consolidates several other helpers in the subprocess module. It returns an object that describes the final state of the process, and it accepts arguments for a timeout, requiring that the process return success, and passing data as stdin. This is now the recommended way to run a single subprocess. (docs; Python 3.5 release notes)

tempfile.TemporaryDirectory is a context manager that creates a temporary directory, then destroys it and its contents at the end of the block. (docs; Python 3.2 release notes)

textwrap.indent can add an arbitrary prefix to every line in a string. (docs; Python 3.3 release notes)

time.monotonic returns the value of a monotonic clock — i.e., it will never go backwards. You should use this for measuring time durations within your program; using time.time() will produce garbage results if the system clock changes due to DST, a leap second, NTP, manual intervention, etc. (docs; Python 3.3 release notes; PEP 418)

time.perf_counter returns the value of the highest-resolution clock available, but is only suitable for measuring a short duration. (docs; Python 3.3 release notes; PEP 418)

time.process_time returns the total system and user CPU time for the process, excluding sleep. Note that the starting time is undefined, so only durations are meaningful. (docs; Python 3.3 release notes; PEP 418)

traceback.walk_stack and traceback.walk_tb are small helper functions that walk back along a stack or traceback, so you can use simple iteration rather than the slightly clumsier linked-list approach. (docs; Python 3.5 release notes)

types.MappingProxyType offers a read-only proxy to a dict. Since it holds a reference to the dict in C, you can return MappingProxyType(some_dict) to effectively create a read-only dict, as the original dict will be inaccessible from Python code. This is the same type used for the __dict__ of an immutable object. Note that this has existed in various forms for a while, but wasn’t publicly exposed or documented; see my module dictproxyhack for something that does its best to work on every Python version. (docs; Python 3.3 release notes)

types.SimpleNamespace is a blank type for sticking arbitrary unstructed attributes to. Previously, you would have to make a dummy subclass of object to do this. (docs; Python 3.3 release notes)

weakref.finalize allows you to add a finalizer function to an arbitrary (weakrefable) object from the “outside”, without needing to add a __del__. The finalize object will keep itself alive, so there’s no need to hold onto it. (docs; Python 3.4 release notes)

New modules with backports

These are less exciting, since they have backports on PyPI that work in Python 2 just as well. But they came from Python 3 development, so I credit Python 3 for them, just like I credit NASA for inventing the microwave.

asyncio is covered above, but it’s been backported as trollius for 2.6+, with the caveat that Pythons before 3.3 don’t have yield from and you have to use yield From(...) as a workaround. That caveat means that third-party asyncio libraries will almost certainly not work with trollius! For this and other reasons, the maintainer is no longer supporting it. Alas. Guess you’ll have to upgrade to Python 3, then.

enum finally provides an enumeration type, something which has long been desired in Python and solved in myriad ad-hoc ways. The variants become instances of a class, can be compared by identity, can be converted between names and values (but only explicitly), can have custom methods, and can implement special methods as usual. There’s even an IntEnum base class whose values end up as subclasses of int (!), making them perfectly compatible with code expecting integer constants. Enums have a surprising amount of power, far more than any approach I’ve seen before; I heartily recommend that you skim the examples in the documentation. Backported as enum34 for 2.4+. (docs; Python 3.4 release notes; PEP 435)

ipaddress offers types for representing IPv4 and IPv6 addresses and subnets. They can convert between several representations, perform a few set-like operations on subnets, identify special addresses, and so on. Backported as ipaddress for 2.6+. (There’s also a py2-ipaddress, but its handling of bytestrings differs from Python 3’s built-in module, which is likely to cause confusing compatibility problems.) (docs; Python 3.3 release notes; PEP 3144)

pathlib provides the Path type, representing a filesystem path that you can manipulate with methods rather than the mountain of functions in os.path. It also overloads / so you can do path / 'file.txt', which is kind of cool. PEP 519 intends to further improve interoperability of Paths with classic functions for the not-yet-released Python 3.6. Backported as pathlib2 for 2.6+; there’s also a pathlib, but it’s no longer maintained, and I don’t know what happened there. (docs; Python 3.4 release notes; PEP 428)

selectors (created as part of the work on asyncio) attempts to wrap select in a high-level interface that doesn’t make you want to claw your eyes out. A noble pursuit. Backported as selectors34 for 2.6+. (docs; Python 3.4 release notes)

statistics contains a number of high-precision statistical functions. Backported as backports.statistics for 2.6+. (docs; Python 3.4 release notes; PEP 450)

unittest.mock provides multiple ways for creating dummy objects, temporarily (with a context manager or decorator) replacing an object or some of its attributes, and verifying that some sequence of operations was performed on a dummy object. I’m not a huge fan of mocking so much that your tests end up mostly testing that your source code hasn’t changed, but if you have to deal with external resources or global state, some light use of unittest.mock can be very handy — even if you aren’t using the rest of unittest. Backported as mock for 2.6+. (docs; Python 3.3, but no release notes)

New modules without backports

Perhaps more exciting because they’re Python 3 exclusive! Perhaps less exciting because they’re necessarily related to plumbing.


faulthandler is a debugging aid that can dump a Python traceback during a segfault or other fatal signal. It can also be made to hook on an arbitrary signal, and can intervene even when Python code is deadlocked. You can use the default behavior with no effort by passing -X faulthandler on the command line, by setting the PYTHONFAULTHANDLER environment variable, or by using the module API manually.

I think -X itself is new as of Python 3.2, though it’s not mentioned in the release notes. It’s reserved for implementation-specific options; there are a few others defined for CPython, and the options can be retrieved from Python code via sys._xoptions.

Refs: docs; Python 3.3 release notes


importlib is the culmination of a whole lot of work, performed in multiple phases across numerous Python releases, to extend, formalize, and cleanly reimplement the entire import process.

I can’t possibly describe everything the import system can do and what Python versions support what parts of it. Suffice to say, it can do a lot of things: Python has built-in support for importing from zip files, and I’ve seen third-party import hooks that allow transparently importing modules written in another programming language.

If you want to mess around with writing your own custom importer, importlib has a ton of tools for helping you do that. It’s possible in Python 2, too, using the imp module, but that’s a lot rougher around the edges.

If not, the main thing of interest is the import_module function, which imports a module by name without all the really weird semantics of __import__. Seriously, don’t use __import__. It’s so weird. It probably doesn’t do what you think. importlib.import_module even exists in Python 2.7.

Refs: docs; Python 3.3 release notes; PEP 302?


tracemalloc is another debugging aid which tracks Python’s memory allocations. It can also compare two snapshots, showing how much memory has been allocated or released between two points in time, and who was responsible. If you have rampant memory use issues, this is probably more helpful than having Python check its own RSS.

Technically, tracemalloc can be used with Python 2.7… but that involves patching and recompiling Python, so I hesitate to call it a backport. Still, if you really need it, give it a whirl.

Refs: docs; Python 3.4 release notes; PEP 454


typing offers a standard way to declare type hints — the expected types of arguments and return values. Type hints are given using the function annotation syntax.

Python itself doesn’t do anything with the annotations, though they’re accessible and inspectable at runtime. An external tool like mypy can perform static type checking ahead of time, using these standard types. mypy is an existing project that predates typing (and works with Python 2), but the previous syntax relied on magic comments; typing formalizes the constructs and puts them in the standard library.

I haven’t actually used either the type hints or mypy myself, so I can’t comment on how helpful or intrusive they are. Give them a shot if they sound useful to you.

Refs: docs; Python 3.5 release notes; PEP 484

venv and ensurepip

I mean, yes, of course, virtualenv and pip are readily available in Python 2. The whole point of these is that they are bundled with Python, so you always have them at your fingertips and never have to worry about installing them yourself.

Installing Python should now give you pipX and pipX.Y commands automatically, corresponding to the latest stable release of pip when that Python version was first released. You’ll also get pyvenv, which is effectively just virtualenv.

There’s also a module interface: python -m ensurepip will install pip (hopefully not necessary), python -m pip runs pip with a specific Python version (a feature of pip and not new to the bundling), and python -m venv runs the bundled copy of virtualenv with a specific Python version.

There was a time where these were completely broken on Debian, because Debian strongly opposes vendoring (the rationale being that it’s easiest to push out updates if there’s only one copy of a library in the Debian package repository), so they just deleted ensurepip and venv? Which completely defeated the point of having them in the first place? I think this has been fixed by now, but it might still bite you if you’re on the Ubuntu 14.04 LTS.

Refs: ensurepip docs; pyvenv docs; Python 3.4 release notes; PEP 453


zipapp makes it easy to create executable zip applications, which have been a thing since 2.6 but have languished in obscurity. Well, no longer.

This wasn’t particularly difficult before: you just zip up some code, make sure there’s a __main__.py in the root, and pass it to Python. Optionally, you can set it executable and add a shebang line, since the ZIP format ignores any leading junk in the file. That’s basically all zipapp does. (It does not magically infer your dependencies and bundle them as well; you’re on your own there.)

I can’t find a backport, which is a little odd, since I don’t think this module does anything too special.

Refs: docs; Python 3.5 release notes; PEP 441

Miscellaneous nice enhancements

There were a lot of improvements to language semantics that don’t fit anywhere else above, but make me a little happier.

The interactive interpreter does tab-completion by default. I say “by default” because I’ve been told that it was supported before, but you had to do some kind of goat blood sacrifice to get it to work. Also, command history persists between runs. (docs; Python 3.4 release notes)

The -b command-line option produces a warning when calling str() on a bytes or bytearray, or when comparing text to bytes. -bb produces an error. (docs)

The -I command-like option runs Python in “isolated mode”: it ignores all PYTHON* environment variables and leaves the current directory and user site-packages directories off of sys.path. The idea is to use this when running a system script (or in the shebang line of a system script) to insulate it from any weird user-specific stuff. (docs; Python 3.4 release notes)

Functions and classes learned a __qualname__ attribute, which is a dotted name describing (lexically) where they were defined. For example, a method’s __name__ might be foo, but its __qualname__ would be something like SomeClass.foo. Similarly, a class or function defined within another function will list that containing function in its __qualname__. (docs; Python 3.3 release notes; PEP 3155)

Generators signal their end by raising StopIteration internally, but it was also possible to raise StopIteration directly within a generator — most notably, when calling next() on an exhausted iterator. This would cause the generator to end prematurely and silently. Now, raising StopIteration inside a generator will produce a warning, which will become a RuntimeError in Python 3.7. You can opt into the fatal behavior early with from __future__ import generator_stop. (Python 3.5 release notes; PEP 479)

Implicit namespace packages allow a package to span multiple directories. The most common example is a plugin system, foo.plugins.*, where plugins may come from multiple libraries, but all want to share the foo.plugins namespace. Previously, they would collide, and some sys.path tricks were necessary to make it work; now, support is built in. (This feature also allows you to have a regular package without an __init__.py, but I’d strongly recommend still having one.) (Python 3.3 release notes; PEP 420)

Object finalization behaves in less quirky ways when destroying an isolated reference cycle. Also, modules no longer have their contents changed to None during shutdown, which fixes a long-running type of error when a __del__ method tries to call, say, os.path.join() — if you were unlucky, os.path would have already have had its contents replaced with Nones, and you’d get an extremely confusing TypeError from trying to call a standard library function. (Python 3.4 release notes; PEP 442)

str.format_map is like str.format, but it accepts a mapping object directly (instead of having to flatten it with **kwargs). This allows some fancy things that weren’t previously possible, like passing a fake map that creates values on the fly based on the keys looked up in it. (docs; Python 3.2 release notes)

When a blocking system call is interrupted by a signal, it returns EINTR, indicating that the calling code should try the same system call again. In Python, this becomes OSError or InterruptedError. I have never in my life seen any C or Python code that actually deals with this correctly. Now, Python will do it for you: all the built-in and standard library functions that make use of system calls will automatically retry themselves when interrupted. (Python 3.5 release notes; PEP 475)

File descriptors created by Python code are now flagged “non-inheritable”, meaning they’re closed automatically when spawning a child process. (docs; Python 3.4 release notes; PEP 446)

A number of standard library functions now accept file descriptors in addition to paths. (docs; Python 3.3 release notes)

Several different OS and I/O exceptions were merged into a single and more fine-grained hierarchy, rooted at OSError. Code can now catch a specific subclass in most cases, rather than examine .errno. (docs; Python 3.3 release notes; PEP 3151)

ResourceWarning is a new kind of warning for issues with resource cleanup. One is produced if a file object is destroyed, but was never closed, which can cause issues on Windows or with garbage-collected Python implementations like PyPy; one is also produced if uncollectable objects still remain when Python shuts down, indicating some severe finalization problems. The warning is ignored by default, but can be enabled with -W default on the command line. (Python 3.2 release notes)

hasattr() only catches (and returns False for) AttributeErrors. Previously, any exception would be considered a sign that the attribute doesn’t exist, even though an unusual exception like an OSError usually means the attribute is computed dynamically, and that code is broken somehow. Now, exceptions other than AttributeError are allowed to propagate to the caller. (docs; Python 3.2 release notes)

Hash randomization is on by default, meaning that dict and set iteration order is different per Python runs. This protects against some DoS attacks, but more importantly, it spitefully forces you not to rely on incidental ordering. (docs; Python 3.3 release notes)

List comprehensions no longer leak their loop variables into the enclosing scope. (Python 3.0 release notes)

nonlocal allows writing to a variable in an enclosing (but non-global) scope. (docs; Python 3.0 release notes; PEP 3104)

Comparing objects of incompatible types now produces a TypeError, rather than using Python 2’s very silly fallback. (Python 3.0 release notes)

!= defaults to returning the opposite of ==. (Python 3.0 release notes)

Accessing a method as a class attribute now gives you a regular function, not an “unbound method” object. (Python 3.0 release notes)

The input builtin no longer performs an eval (!), removing a huge point of confusion for beginners. This is the behavior of raw_input in Python 2. (docs; Python 3.0 release notes; PEP 3111)

Fast and furious

These aren’t necessarily compelling, and they may not even make any appreciable difference for your code, but I think they’re interesting technically.

Objects’ __dict__s can now share their key storage internally. Instances of the same type generally have the same attribute names, so this provides a modest improvement in speed and memory usage for programs that create a lot of user-defined objects. (Python 3.3 release notes; PEP 412)

OrderedDict is now implemented in C, making it “4 to 100” (!) times faster. Note that the backport in the 2.7 standard library is pure Python. So, there’s a carrot. (Python 3.5 release notes)

The GIL was made more predictable. My understanding is that the old behavior was to yield after some number of Python bytecode operations, which could take wildly varying amounts of time; the new behavior yields after a given duration, by default 5ms. (Python 3.2 release notes)

The io library was rewritten in C, making it more fast. Again, the Python 2.7 implementation is pure Python. (Python 3.1 release notes)

Tuples and dicts containing only immutable objects — i.e., objects that cannot possibly contain circular references — are ignored by the garbage collector. This was backported to Python 2.7, too, but I thought it was super interesting. (Python 3.1 release notes)

That’s all I’ve got

Huff, puff.

I hope something here appeals to you as a reason to at least experiment with Python 3. It’s fun over here. Give it a try.

I’ve bought some more awful IoT stuff

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/43486.html

I bought some awful WiFi lightbulbs a few months ago. The short version: they introduced terrible vulnerabilities on your network, they violated the GPL and they were also just bad at being lightbulbs. Since then I’ve bought some other Internet of Things devices, and since people seem to have a bizarre level of fascination with figuring out just what kind of fractal of poor design choices these things frequently embody, I thought I’d oblige.

Today we’re going to be talking about the KanKun SP3, a plug that’s been around for a while. The idea here is pretty simple – there’s lots of devices that you’d like to be able to turn on and off in a programmatic way, and rather than rewiring them the simplest thing to do is just to insert a control device in between the wall and the device andn ow you can turn your foot bath on and off from your phone. Most vendors go further and also allow you to program timers and even provide some sort of remote tunneling protocol so you can turn off your lights from the comfort of somebody else’s home.

The KanKun has all of these features and a bunch more, although when I say “features” I kind of mean the opposite. I plugged mine in and followed the install instructions. As is pretty typical, this took the form of the plug bringing up its own Wifi access point, the app on the phone connecting to it and sending configuration data, and the plug then using that data to join your network. Except it didn’t work. I connected to the plug’s network, gave it my SSID and password and waited. Nothing happened. No useful diagnostic data. Eventually I plugged my phone into my laptop and ran adb logcat, and the Android debug logs told me that the app was trying to modify a network that it hadn’t created. Apparently this isn’t permitted as of Android 6, but the app was handling this denial by just trying again. I deleted the network from the system settings, restarted the app, and this time the app created the network record and could modify it. It still didn’t work, but that’s because it let me give it a 5GHz network and it only has a 2.4GHz radio, so one reset later and I finally had it online.

The first thing I normally do to one of these things is run nmap with the -O argument, which gives you an indication of what OS it’s running. I didn’t really need to in this case, because if I just telnetted to port 22 I got a dropbear ssh banner. Googling turned up the root password (“p9z34c”) and I was logged into a lightly hacked (and fairly obsolete) OpenWRT environment.

It turns out that here’s a whole community of people playing with these plugs, and it’s common for people to install CGI scripts on them so they can turn them on and off via an API. At first this sounds somewhat confusing, because if the phone app can control the plug then there clearly is some kind of API, right? Well ha yeah ok that’s a great question and oh good lord do things start getting bad quickly at this point.

I’d grabbed the apk for the app and a copy of jadx, an incredibly useful piece of code that’s surprisingly good at turning compiled Android apps into something resembling Java source. I dug through that for a while before figuring out that before packets were being sent, they were being handed off to some sort of encryption code. I couldn’t find that in the app, but there was a native ARM library shipped with it. Running strings on that showed functions with names matching the calls in the Java code, so that made sense. There were also references to AES, which explained why when I ran tcpdump I only saw bizarre garbage packets.

But what was surprising was that most of these packets were substantially similar. There were a load that were identical other than a 16-byte chunk in the middle. That plus the fact that every payload length was a multiple of 16 bytes strongly indicated that AES was being used in ECB mode. In ECB mode each plaintext is split up into 16-byte chunks and encrypted with the same key. The same plaintext will always result in the same encrypted output. This implied that the packets were substantially similar and that the encryption key was static.

Some more digging showed that someone had figured out the encryption key last year, and that someone else had written some tools to control the plug without needing to modify it. The protocol is basically ascii and consists mostly of the MAC address of the target device, a password and a command. This is then encrypted and sent to the device’s IP address. The device then sends a challenge packet containing a random number. The app has to decrypt this, obtain the random number, create a response, encrypt that and send it before the command takes effect. This avoids the most obvious weakness around using ECB – since the same plaintext always encrypts to the same ciphertext, you could just watch encrypted packets go past and replay them to get the same effect, even if you didn’t have the encryption key. Using a random number in a challenge forces you to prove that you actually have the key.

At least, it would do if the numbers were actually random. It turns out that the plug is just calling rand(). Further, it turns out that it never calls srand(). This means that the plug will always generate the same sequence of challenges after a reboot, which means you can still carry out replay attacks if you can reboot the plug. Strong work.

But there was still the question of how the remote control works, since the code on github only worked locally. tcpdumping the traffic from the server and trying to decrypt it in the same way as local packets worked fine, and showed that the only difference was that the packet started “wan” rather than “lan”. The server decrypts the packet, looks at the MAC address, re-encrypts it and sends it over the tunnel to the plug that registered with that address.

That’s not really a great deal of authentication. The protocol permits a password, but the app doesn’t insist on it – some quick playing suggests that about 90% of these devices still use the default password. And the devices are all based on the same wifi module, so the MAC addresses are all in the same range. The process of sending status check packets to the server with every MAC address wouldn’t take that long and would tell you how many of these devices are out there. If they’re using the default password, that’s enough to have full control over them.

There’s some other failings. The github repo mentioned earlier includes a script that allows arbitrary command execution – the wifi configuration information is passed to the system() command, so leaving a semicolon in the middle of it will result in your own commands being executed. Thankfully this doesn’t seem to be true of the daemon that’s listening for the remote control packets, which seems to restrict its use of system() to data entirely under its control. But even if you change the default root password, anyone on your local network can get root on the plug. So that’s a thing. It also downloads firmware updates over http and doesn’t appear to check signatures on them, so there’s the potential for MITM attacks on the plug itself. The remote control server is on AWS unless your timezone is GMT+8, in which case it’s in China. Sorry, Western Australia.

It’s running Linux and includes Busybox and dnsmasq, so plenty of GPLed code. I emailed the manufacturer asking for a copy and got told that they wouldn’t give it to me, which is unsurprising but still disappointing.

The use of AES is still somewhat confusing, given the relatively small amount of security it provides. One thing I’ve wondered is whether it’s not actually intended to provide security at all. The remote servers need to accept connections from anywhere and funnel decent amounts of traffic around from phones to switches. If that weren’t restricted in any way, competitors would be able to use existing servers rather than setting up their own. Using AES at least provides a minor obstacle that might encourage them to set up their own server.

Overall: the hardware seems fine, the software is shoddy and the security is terrible. If you have one of these, set a strong password. There’s no rate-limiting on the server, so a weak password will be broken pretty quickly. It’s also infringing my copyright, so I’d recommend against it on that point alone.

comment count unavailable comments

No GSoC for Avahi

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/gsoc-avahi.html

As it seems, the Avahi project has not been
accepted as Google Summer of Code
organization, much like the GStreamer project.

Grr, I cannot say I really understand why three wiki engines [1] got accepted,
or a UI frontend for nmap – but not important infrastructure projects like
GStreamer or Avahi. Mhmm, maybe I am just envious, and considering these two projects important is just hybris…

Anyway, we had already prepared a list of exciting [2] GSoC project
ideas for Avahi. If anyone is interested to work on one of these there might be a small chance to get this done under the GNOME umbrella. Feel free to contact either me or Trent if you are interested!


[1] If there is something we already have enough of in Free Software – then it is Wiki engines. just check the output of apt-cache search wiki | wc -l on a recent Debian system.

[2] In our definition of exciting, of course – which doesn’t seem to be the same as Google’s. Grrrh!