Tag Archives: cookies

[email protected] – Intelligent Processing of HTTP Requests at the Edge

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/lambdaedge-intelligent-processing-of-http-requests-at-the-edge/

Late last year I announced a preview of [email protected] and talked about how you could use it to intelligently process HTTP requests at locations that are close (latency-wise) to your customers. Developers who applied and gained access to the preview have been making good use of it, and have provided us with plenty of very helpful feedback. During the preview we added the ability to generate HTTP responses and support for CloudWatch Logs, and also updated our roadmap based on the feedback.

Now Generally Available
Today I am happy to announce that [email protected] is now generally available! You can use it to:

  • Inspect cookies and rewrite URLs to perform A/B testing.
  • Send specific objects to your users based on the User-Agent header.
  • Implement access control by looking for specific headers before passing requests to the origin.
  • Add, drop, or modify headers to direct users to different cached objects.
  • Generate new HTTP responses.
  • Cleanly support legacy URLs.
  • Modify or condense headers or URLs to improve cache utilization.
  • Make HTTP requests to other Internet resources and use the results to customize responses.

[email protected] allows you to create web-based user experiences that are rich and personal. As is rapidly becoming the norm in today’s world, you don’t need to provision or manage any servers. You simply upload your code (Lambda functions written in Node.js) and pick one of the CloudFront behaviors that you have created for the distribution, along with the desired CloudFront event:

In this case, my function (the imaginatively named EdgeFunc1) would run in response to origin requests for image/* within the indicated distribution. As you can see, you can run code in response to four different CloudFront events:

Viewer Request – This event is triggered when an event arrives from a viewer (an HTTP client, generally a web browser or a mobile app), and has access to the incoming HTTP request. As you know, each CloudFront edge location maintains a large cache of objects so that it can efficiently respond to repeated requests. This particular event is triggered regardless of whether the requested object is already cached.

Origin Request – This event is triggered when the edge location is about to make a request back to the origin, due to the fact that the requested object is not cached at the edge location. It has access to the request that will be made to the origin (often an S3 bucket or code running on an EC2 instance).

Origin Response – This event is triggered after the origin returns a response to a request. It has access to the response from the origin.

Viewer Response – This is event is triggered before the edge location returns a response to the viewer. It has access to the response.

Functions are globally replicated and requests are automatically routed to the optimal location for execution. You can write your code once and with no overt action on your part, have it be available at low latency to users all over the world.

Your code has full access to requests and responses, including headers, cookies, the HTTP method (GET, HEAD, and so forth), and the URI. Subject to a few restrictions, it can modify existing headers and insert new ones.

[email protected] in Action
Let’s create a simple function that runs in response to the Viewer Request event. I open up the Lambda Console and create a new function. I choose the Node.js 6.10 runtime and search for cloudfront blueprints:

I choose cloudfront-response-generation and configure a trigger to invoke the function:

The Lambda Console provides me with some information about the operating environment for my function:

I enter a name and a description for my function, as usual:

The blueprint includes a fully operational function. It generates a “200” HTTP response and a very simple body:

I used this as the starting point for my own code, which pulls some interesting values from the request and displays them in a table:

'use strict';
exports.handler = (event, context, callback) => {

    /* Set table row style */
    const rs = '"border-bottom:1px solid black;vertical-align:top;"';
    /* Get request */
    const request = event.Records[0].cf.request;
   
    /* Get values from request */ 
    const httpVersion = request.httpVersion;
    const clientIp    = request.clientIp;
    const method      = request.method;
    const uri         = request.uri;
    const headers     = request.headers;
    const host        = headers['host'][0].value;
    const agent       = headers['user-agent'][0].value;
    
    var sreq = JSON.stringify(event.Records[0].cf.request, null, ' ');
    sreq = sreq.replace(/\n/g, '<br/>');

    /* Generate body for response */
    const body = 
     '<html>\n'
     + '<head><title>Hello From [email protected]</title></head>\n'
     + '<body>\n'
     + '<table style="border:1px solid black;background-color:#e0e0e0;border-collapse:collapse;" cellpadding=4 cellspacing=4>\n'
     + '<tr style=' + rs + '><td>Host</td><td>'        + host     + '</td></tr>\n'
     + '<tr style=' + rs + '><td>Agent</td><td>'       + agent    + '</td></tr>\n'
     + '<tr style=' + rs + '><td>Client IP</td><td>'   + clientIp + '</td></tr>\n'
     + '<tr style=' + rs + '><td>Method</td><td>'      + method   + '</td></tr>\n'
     + '<tr style=' + rs + '><td>URI</td><td>'         + uri      + '</td></tr>\n'
     + '<tr style=' + rs + '><td>Raw Request</td><td>' + sreq     + '</td></tr>\n'
     + '</table>\n'
     + '</body>\n'
     + '</html>'

    /* Generate HTTP response */
    const response = {
        status: '200',
        statusDescription: 'HTTP OK',
        httpVersion: httpVersion,
        body: body,
        headers: {
            'vary':          [{key: 'Vary',          value: '*'}],
            'last-modified': [{key: 'Last-Modified', value:'2017-01-13'}]
        },
    };

    callback(null, response);
};

I configure my handler, and request the creation of a new IAM Role with Basic Edge Lambda permissions:

On the next page I confirm my settings (as I would do for a regular Lambda function), and click on Create function:

This creates the function, attaches the trigger to the distribution, and also initiates global replication of the function. The status of my distribution changes to In Progress for the duration of the replication (typically 5 to 8 minutes):

The status changes back to Deployed as soon as the replication completes:

Then I access the root of my distribution (https://dogy9dy9kvj6w.cloudfront.net/), the function runs, and this is what I see:

Feel free to click on the image (it is linked to the root of my distribution) to run my code!

As usual, this is a very simple example and I am sure that you can do a lot better. Here are a few ideas to get you started:

Site Management – You can take an entire dynamic website offline and replace critical pages with [email protected] functions for maintenance or during a disaster recovery operation.

High Volume Content – You can create scoreboards, weather reports, or public safety pages and make them available at the edge, both quickly and cost-effectively.

Create something cool and share it in the comments or in a blog post, and I’ll take a look.

Things to Know
Here are a couple of things to keep in mind as you start to think about how to put [email protected] to use in your application:

Timeouts – Functions that handle Origin Request and Origin Response events must complete within 3 seconds. Functions that handle Viewer Request and Viewer Response events must complete within 1 second.

Versioning – After you update your code in the Lambda Console, you must publish a new version and set up a fresh set of triggers for it, and then wait for the replication to complete. You must always refer to your code using a version number; $LATEST and aliases do not apply.

Headers – As you can see from my code, the HTTP request headers are accessible as an array. The headers fall in to four categories:

  • Accessible – Can be read, written, deleted, or modified.
  • Restricted – Must be passed on to the origin.
  • Read-only – Can be read, but not modified in any way.
  • Blacklisted – Not seen by code, and cannot be added.

Runtime Environment – The runtime environment provides each function with 128 MB of memory, but no builtin libraries or access to /tmp.

Web Service Access – Functions that handle Origin Request and Origin Response events must complete within 3 seconds can access the AWS APIs and fetch content via HTTP. These requests are always made synchronously with request to the original request or response.

Function Replication – As I mentioned earlier, your functions will be globally replicated. The replicas are visible in the “other” regions from the Lambda Console:

CloudFront – Everything that you already know about CloudFront and CloudFront behaviors is relevant to [email protected]. You can use multiple behaviors (each with up to four [email protected] functions) from each behavior, customize header & cookie forwarding, and so forth. You can also make the association between events and functions (via ARNs that include function versions) while you are editing a behavior:

Available Now
[email protected] is available now and you can start using it today. Pricing is based on the number of times that your functions are invoked and the amount of time that they run (see the [email protected] Pricing page for more info).

Jeff;

 

casync — A tool for distributing file system images

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html

Introducing casync

In the past months I have been working on a new project:
casync. casync takes
inspiration from the popular rsync file
synchronization tool as well as the probably even more popular
git revision control system. It combines the
idea of the rsync algorithm with the idea of git-style
content-addressable file systems, and creates a new system for
efficiently storing and delivering file system images, optimized for
high-frequency update cycles over the Internet. Its current focus is
on delivering IoT, container, VM, application, portable service or OS
images, but I hope to extend it later in a generic fashion to become
useful for backups and home directory synchronization as well (but
more about that later).

The basic technological building blocks casync is built from are
neither new nor particularly innovative (at least not anymore),
however the way casync combines them is different from existing tools,
and that’s what makes it useful for a variety of use-cases that other
tools can’t cover that well.

Why?

I created casync after studying how today’s popular tools store and
deliver file system images. To briefly name a few: Docker has a
layered tarball approach,
OSTree serves the
individual files directly via HTTP and maintains packed deltas to
speed up updates, while other systems operate on the block layer and
place raw squashfs images (or other archival file systems, such as
IS09660) for download on HTTP shares (in the better cases combined
with zsync data).

Neither of these approaches appeared fully convincing to me when used
in high-frequency update cycle systems. In such systems, it is
important to optimize towards a couple of goals:

  1. Most importantly, make updates cheap traffic-wise (for this most tools use image deltas of some form)
  2. Put boundaries on disk space usage on servers (keeping deltas between all version combinations clients might want to run updates between, would suggest keeping an exponentially growing amount of deltas on servers)
  3. Put boundaries on disk space usage on clients
  4. Be friendly to Content Delivery Networks (CDNs), i.e. serve neither too many small nor too many overly large files, and only require the most basic form of HTTP. Provide the repository administrator with high-level knobs to tune the average file size delivered.
  5. Simplicity to use for users, repository administrators and developers

I don’t think any of the tools mentioned above are really good on more
than a small subset of these points.

Specifically: Docker’s layered tarball approach dumps the “delta”
question onto the feet of the image creators: the best way to make
your image downloads minimal is basing your work on an existing image
clients might already have, and inherit its resources, maintaining full
history. Here, revision control (a tool for the developer) is
intermingled with update management (a concept for optimizing
production delivery). As container histories grow individual deltas
are likely to stay small, but on the other hand a brand-new deployment
usually requires downloading the full history onto the deployment
system, even though there’s no use for it there, and likely requires
substantially more disk space and download sizes.

OSTree’s serving of individual files is unfriendly to CDNs (as many
small files in file trees cause an explosion of HTTP GET
requests). To counter that OSTree supports placing pre-calculated
delta images between selected revisions on the delivery servers, which
means a certain amount of revision management, that leaks into the
clients.

Delivering direct squashfs (or other file system) images is almost
beautifully simple, but of course means every update requires a full
download of the newest image, which is both bad for disk usage and
generated traffic. Enhancing it with zsync makes this a much better
option, as it can reduce generated traffic substantially at very
little cost of history/meta-data (no explicit deltas between a large
number of versions need to be prepared server side). On the other hand
server requirements in disk space and functionality (HTTP Range
requests) are minus points for the use-case I am interested in.

(Note: all the mentioned systems have great properties, and it’s not
my intention to badmouth them. They only point I am trying to make is
that for the use case I care about — file system image delivery with
high high frequency update-cycles — each system comes with certain
drawbacks.)

Security & Reproducibility

Besides the issues pointed out above I wasn’t happy with the security
and reproducibility properties of these systems. In today’s world
where security breaches involving hacking and breaking into connected
systems happen every day, an image delivery system that cannot make
strong guarantees regarding data integrity is out of
date. Specifically, the tarball format is famously nondeterministic:
the very same file tree can result in any number of different
valid serializations depending on the tool used, its version and the
underlying OS and file system. Some tar implementations attempt to
correct that by guaranteeing that each file tree maps to exactly
one valid serialization, but such a property is always only specific
to the tool used. I strongly believe that any good update system must
guarantee on every single link of the chain that there’s only one
valid representation of the data to deliver, that can easily be
verified.

What casync Is

So much about the background why I created casync. Now, let’s have a
look what casync actually is like, and what it does. Here’s the brief
technical overview:

Encoding: Let’s take a large linear data stream, split it into
variable-sized chunks (the size of each being a function of the
chunk’s contents), and store these chunks in individual, compressed
files in some directory, each file named after a strong hash value of
its contents, so that the hash value may be used to as key for
retrieving the full chunk data. Let’s call this directory a “chunk
store”. At the same time, generate a “chunk index” file that lists
these chunk hash values plus their respective chunk sizes in a simple
linear array. The chunking algorithm is supposed to create variable,
but similarly sized chunks from the data stream, and do so in a way
that the same data results in the same chunks even if placed at
varying offsets. For more information see this blog
story
.

Decoding: Let’s take the chunk index file, and reassemble the large
linear data stream by concatenating the uncompressed chunks retrieved
from the chunk store, keyed by the listed chunk hash values.

As an extra twist, we introduce a well-defined, reproducible,
random-access serialization format for file trees (think: a more
modern tar), to permit efficient, stable storage of complete file
trees in the system, simply by serializing them and then passing them
into the encoding step explained above.

Finally, let’s put all this on the network: for each image you want to
deliver, generate a chunk index file and place it on an HTTP
server. Do the same with the chunk store, and share it between the
various index files you intend to deliver.

Why bother with all of this? Streams with similar contents will result
in mostly the same chunk files in the chunk store. This means it is
very efficient to store many related versions of a data stream in the
same chunk store, thus minimizing disk usage. Moreover, when
transferring linear data streams chunks already known on the receiving
side can be made use of, thus minimizing network traffic.

Why is this different from rsync or OSTree, or similar tools? Well,
one major difference between casync and those tools is that we
remove file boundaries before chunking things up. This means that
small files are lumped together with their siblings and large files
are chopped into pieces, which permits us to recognize similarities in
files and directories beyond file boundaries, and makes sure our chunk
sizes are pretty evenly distributed, without the file boundaries
affecting them.

The “chunking” algorithm is based on a the buzhash rolling hash
function. SHA256 is used as strong hash function to generate digests
of the chunks. xz is used to compress the individual chunks.

Here’s a diagram, hopefully explaining a bit how the encoding process
works, wasn’t it for my crappy drawing skills:

Diagram

The diagram shows the encoding process from top to bottom. It starts
with a block device or a file tree, which is then serialized and
chunked up into variable sized blocks. The compressed chunks are then
placed in the chunk store, while a chunk index file is written listing
the chunk hashes in order. (The original SVG of this graphic may be
found here.)

Details

Note that casync operates on two different layers, depending on the
use-case of the user:

  1. You may use it on the block layer. In this case the raw block data
    on disk is taken as-is, read directly from the block device, split
    into chunks as described above, compressed, stored and delivered.

  2. You may use it on the file system layer. In this case, the
    file tree serialization format mentioned above comes into play:
    the file tree is serialized depth-first (much like tar would do
    it) and then split into chunks, compressed, stored and delivered.

The fact that it may be used on both the block and file system layer
opens it up for a variety of different use-cases. In the VM and IoT
ecosystems shipping images as block-level serializations is more
common, while in the container and application world file-system-level
serializations are more typically used.

Chunk index files referring to block-layer serializations carry the
.caibx suffix, while chunk index files referring to file system
serializations carry the .caidx suffix. Note that you may also use
casync as direct tar replacement, i.e. without the chunking, just
generating the plain linear file tree serialization. Such files
carry the .catar suffix. Internally .caibx are identical to
.caidx files, the only difference is semantical: .caidx files
describe a .catar file, while .caibx files may describe any other
blob. Finally, chunk stores are directories carrying the .castr
suffix.

Features

Here are a couple of other features casync has:

  1. When downloading a new image you may use casync‘s --seed=
    feature: each block device, file, or directory specified is processed
    using the same chunking logic described above, and is used as
    preferred source when putting together the downloaded image locally,
    avoiding network transfer of it. This of course is useful whenever
    updating an image: simply specify one or more old versions as seed and
    only download the chunks that truly changed since then. Note that
    using seeds requires no history relationship between seed and the new
    image to download. This has major benefits: you can even use it to
    speed up downloads of relatively foreign and unrelated data. For
    example, when downloading a container image built using Ubuntu you can
    use your Fedora host OS tree in /usr as seed, and casync will
    automatically use whatever it can from that tree, for example timezone
    and locale data that tends to be identical between
    distributions. Example: casync extract
    http://example.com/myimage.caibx --seed=/dev/sda1 /dev/sda2
    . This
    will place the block-layer image described by the indicated URL in the
    /dev/sda2 partition, using the existing /dev/sda1 data as seeding
    source. An invocation like this could be typically used by IoT systems
    with an A/B partition setup. Example 2: casync extract
    http://example.com/mycontainer-v3.caidx --seed=/srv/container-v1
    --seed=/srv/container-v2 /src/container-v3
    , is very similar but
    operates on the file system layer, and uses two old container versions
    to seed the new version.

  2. When operating on the file system level, the user has fine-grained
    control on the meta-data included in the serialization. This is
    relevant since different use-cases tend to require a different set of
    saved/restored meta-data. For example, when shipping OS images, file
    access bits/ACLs and ownership matter, while file modification times
    hurt. When doing personal backups OTOH file ownership matters little
    but file modification times are important. Moreover different backing
    file systems support different feature sets, and storing more
    information than necessary might make it impossible to validate a tree
    against an image if the meta-data cannot be replayed in full. Due to
    this, casync provides a set of --with= and --without= parameters
    that allow fine-grained control of the data stored in the file tree
    serialization, including the granularity of modification times and
    more. The precise set of selected meta-data features is also always
    part of the serialization, so that seeding can work correctly and
    automatically.

  3. casync tries to be as accurate as possible when storing file
    system meta-data. This means that besides the usual baseline of file
    meta-data (file ownership and access bits), and more advanced features
    (extended attributes, ACLs, file capabilities) a number of more exotic
    data is stored as well, including Linux
    chattr(1) file attributes, as
    well as FAT file
    attributes

    (you may wonder why the latter? — EFI is FAT, and /efi is part of
    the comprehensive serialization of any host). In the future I intend
    to extend this further, for example storing btrfs sub-volume
    information where available. Note that as described above every single
    type of meta-data may be turned off and on individually, hence if you
    don’t need FAT file bits (and I figure it’s pretty likely you don’t),
    then they won’t be stored.

  4. The user creating .caidx or .caibx files may control the desired
    average chunk length (before compression) freely, using the
    --chunk-size= parameter. Smaller chunks increase the number of
    generated files in the chunk store and increase HTTP GET load on the
    server, but also ensure that sharing between similar images is
    improved, as identical patterns in the images stored are more likely
    to be recognized. By default casync will use a 64K average chunk
    size. Tweaking this can be particularly useful when adapting the
    system to specific CDNs, or when delivering compressed disk images
    such as squashfs (see below).

  5. Emphasis is placed on making all invocations reproducible,
    well-defined and strictly deterministic. As mentioned above this is a
    requirement to reach the intended security guarantees, but is also
    useful for many other use-cases. For example, the casync digest
    command may be used to calculate a hash value identifying a specific
    directory in all desired detail (use --with= and --without to pick
    the desired detail). Moreover the casync mtree command may be used
    to generate a BSD mtree(5) compatible manifest of a directory tree,
    .caidx or .catar file.

  6. The file system serialization format is nicely composable. By this
    I mean that the serialization of a file tree is the concatenation of
    the serializations of all files and file sub-trees located at the
    top of the tree, with zero meta-data references from any of these
    serializations into the others. This property is essential to ensure
    maximum reuse of chunks when similar trees are serialized.

  7. When extracting file trees or disk image files, casync
    will automatically create
    reflinks
    from any specified seeds if the underlying file system supports it
    (such as btrfs, ocfs, and future xfs). After all, instead of
    copying the desired data from the seed, we can just tell the file
    system to link up the relevant blocks. This works both when extracting
    .caidx and .caibx files — the latter of course only when the
    extracted disk image is placed in a regular raw image file on disk,
    rather than directly on a plain block device, as plain block devices
    do not know the concept of reflinks.

  8. Optionally, when extracting file trees, casync can
    create traditional UNIX hard-links for identical files in specified
    seeds (--hardlink=yes). This works on all UNIX file systems, and can
    save substantial amounts of disk space. However, this only works for
    very specific use-cases where disk images are considered read-only
    after extraction, as any changes made to one tree will propagate to
    all other trees sharing the same hard-linked files, as that’s the
    nature of hard-links. In this mode, casync exposes OSTree-like
    behavior, which is built heavily around read-only hard-link trees.

  9. casync tries to be smart when choosing what to include in file
    system images. Implicitly, file systems such as procfs and sysfs are
    excluded from serialization, as they expose API objects, not real
    files. Moreover, the “nodump” (+d)
    chattr(1) flag is honored by
    default, permitting users to mark files to exclude from serialization.

  10. When creating and extracting file trees casync may apply an
    automatic or explicit UID/GID shift. This is particularly useful when
    transferring container image for use with Linux user name-spacing.

  11. In addition to local operation, casync currently supports HTTP,
    HTTPS, FTP and ssh natively for downloading chunk index files and
    chunks (the ssh mode requires installing casync on the remote host,
    though, but an sftp mode not requiring that should be easy to
    add). When creating index files or chunks, only ssh is supported as
    remote back-end.

  12. When operating on block-layer images, you may expose locally or
    remotely stored images as local block devices. Example: casync mkdev
    http://example.com/myimage.caibx
    exposes the disk image described by
    the indicated URL as local block device in /dev, which you then may
    use the usual block device tools on, such as mount or fdisk (only
    read-only though). Chunks are downloaded on access with high priority,
    and at low priority when idle in the background. Note that in this
    mode, casync also plays a role similar to “dm-verity”, as all blocks
    are validated against the strong digests in the chunk index file
    before passing them on to the kernel’s block layer. This feature is
    implemented though Linux’ NBD kernel facility.

  13. Similar, when operating on file-system-layer images, you may mount
    locally or remotely stored images as regular file systems. Example:
    casync mount http://example.com/mytree.caidx /srv/mytree mounts the
    file tree image described by the indicated URL as a local directory
    /srv/mytree. This feature is implemented though Linux’ FUSE kernel
    facility. Note that special care is taken that the images exposed this
    way can be packed up again with casync make and are guaranteed to
    return the bit-by-bit exact same serialization again that it was
    mounted from. No data is lost or changed while passing things through
    FUSE (OK, strictly speaking this is a lie, we do lose ACLs, but that’s
    hopefully just a temporary gap to be fixed soon).

  14. In IoT A/B fixed size partition setups the file systems placed in
    the two partitions are usually much shorter than the partition size,
    in order to keep some room for later, larger updates. casync is able
    to analyze the super-block of a number of common file systems in order
    to determine the actual size of a file system stored on a block
    device, so that writing a file system to such a partition and reading
    it back again will result in reproducible data. Moreover this speeds
    up the seeding process, as there’s little point in seeding the
    white-space after the file system within the partition.

Example Command Lines

Here’s how to use casync, explained with a few examples:

$ casync make foobar.caidx /some/directory

This will create a chunk index file foobar.caidx in the local
directory, and populate the chunk store directory default.castr
located next to it with the chunks of the serialization (you can
change the name for the store directory with --store= if you
like). This command operates on the file-system level. A similar
command operating on the block level:

$ casync make foobar.caibx /dev/sda1

This command creates a chunk index file foobar.caibx in the local
directory describing the current contents of the /dev/sda1 block
device, and populates default.castr in the same way as above. Note
that you may as well read a raw disk image from a file instead of a
block device:

$ casync make foobar.caibx myimage.raw

To reconstruct the original file tree from the .caidx file and
the chunk store of the first command, use:

$ casync extract foobar.caidx /some/other/directory

And similar for the block-layer version:

$ casync extract foobar.caibx /dev/sdb1

or, to extract the block-layer version into a raw disk image:

$ casync extract foobar.caibx myotherimage.raw

The above are the most basic commands, operating on local data
only. Now let’s make this more interesting, and reference remote
resources:

$ casync extract http://example.com/images/foobar.caidx /some/other/directory

This extracts the specified .caidx onto a local directory. This of
course assumes that foobar.caidx was uploaded to the HTTP server in
the first place, along with the chunk store. You can use any command
you like to accomplish that, for example scp or
rsync. Alternatively, you can let casync do this directly when
generating the chunk index:

$ casync make ssh.example.com:images/foobar.caidx /some/directory

This will use ssh to connect to the ssh.example.com server, and then
places the .caidx file and the chunks on it. Note that this mode of
operation is “smart”: this scheme will only upload chunks currently
missing on the server side, and not re-transmit what already is
available.

Note that you can always configure the precise path or URL of the
chunk store via the --store= option. If you do not do that, then the
store path is automatically derived from the path or URL: the last
component of the path or URL is replaced by default.castr.

Of course, when extracting .caidx or .caibx files from remote sources,
using a local seed is advisable:

$ casync extract http://example.com/images/foobar.caidx --seed=/some/exising/directory /some/other/directory

Or on the block layer:

$ casync extract http://example.com/images/foobar.caibx --seed=/dev/sda1 /dev/sdb2

When creating chunk indexes on the file system layer casync will by
default store meta-data as accurately as possible. Let’s create a chunk
index with reduced meta-data:

$ casync make foobar.caidx --with=sec-time --with=symlinks --with=read-only /some/dir

This command will create a chunk index for a file tree serialization
that has three features above the absolute baseline supported: 1s
granularity time-stamps, symbolic links and a single read-only bit. In
this mode, all the other meta-data bits are not stored, including
nanosecond time-stamps, full UNIX permission bits, file ownership or
even ACLs or extended attributes.

Now let’s make a .caidx file available locally as a mounted file
system, without extracting it:

$ casync mount http://example.comf/images/foobar.caidx /mnt/foobar

And similar, let’s make a .caibx file available locally as a block device:

$ casync mkdev http://example.comf/images/foobar.caibx

This will create a block device in /dev and print the used device
node path to STDOUT.

As mentioned, casync is big about reproducibility. Let’s make use of
that to calculate the a digest identifying a very specific version of
a file tree:

$ casync digest .

This digest will include all meta-data bits casync and the underlying
file system know about. Usually, to make this useful you want to
configure exactly what meta-data to include:

$ casync digest --with=unix .

This makes use of the --with=unix shortcut for selecting meta-data
fields. Specifying --with-unix= selects all meta-data that
traditional UNIX file systems support. It is a shortcut for writing out:
--with=16bit-uids --with=permissions --with=sec-time --with=symlinks
--with=device-nodes --with=fifos --with=sockets
.

Note that when calculating digests or creating chunk indexes you may
also use the negative --without= option to remove specific features
but start from the most precise:

$ casync digest --without=flag-immutable

This generates a digest with the most accurate meta-data, but leaves
one feature out: chattr(1)‘s
immutable (+i) file flag.

To list the contents of a .caidx file use a command like the following:

$ casync list http://example.com/images/foobar.caidx

or

$ casync mtree http://example.com/images/foobar.caidx

The former command will generate a brief list of files and
directories, not too different from tar t or ls -al in its
output. The latter command will generate a BSD
mtree(5) compatible
manifest. Note that casync actually stores substantially more file
meta-data than mtree files can express, though.

What casync isn’t

  1. casync is not an attempt to minimize serialization and downloaded
    deltas to the extreme. Instead, the tool is supposed to find a good
    middle ground, that is good on traffic and disk space, but not at the
    price of convenience or requiring explicit revision control. If you
    care about updates that are absolutely minimal, there are binary delta
    systems around that might be an option for you, such as Google’s
    Courgette
    .

  2. casync is not a replacement for rsync, or git or zsync or
    anything like that. They have very different use-cases and
    semantics. For example, rsync permits you to directly synchronize two
    file trees remotely. casync just cannot do that, and it is unlikely
    it every will.

Where next?

casync is supposed to be a generic synchronization tool. Its primary
focus for now is delivery of OS images, but I’d like to make it useful
for a couple other use-cases, too. Specifically:

  1. To make the tool useful for backups, encryption is missing. I have
    pretty concrete plans how to add that. When implemented, the tool
    might become an alternative to restic,
    BorgBackup or
    tarsnap.

  2. Right now, if you want to deploy casync in real-life, you still
    need to validate the downloaded .caidx or .caibx file yourself, for
    example with some gpg signature. It is my intention to integrate with
    gpg in a minimal way so that signing and verifying chunk index files
    is done automatically.

  3. In the longer run, I’d like to build an automatic synchronizer for
    $HOME between systems from this. Each $HOME instance would be
    stored automatically in regular intervals in the cloud using casync,
    and conflicts would be resolved locally.

  4. casync is written in a shared library style, but it is not yet
    built as one. Specifically this means that almost all of casync‘s
    functionality is supposed to be available as C API soon, and
    applications can process casync files on every level. It is my
    intention to make this library useful enough so that it will be easy
    to write a module for GNOME’s gvfs subsystem in order to make remote
    or local .caidx files directly available to applications (as an
    alternative to casync mount). In fact the idea is to make this all
    flexible enough that even the remoting back-ends can be replaced
    easily, for example to replace casync‘s default HTTP/HTTPS back-ends
    built on CURL with GNOME’s own HTTP implementation, in order to share
    cookies, certificates, … There’s also an alternative method to
    integrate with casync in place already: simply invoke casync as a
    sub-process. casync will inform you about a certain set of state
    changes using a mechanism compatible with
    sd_notify(3). In
    future it will also propagate progress data this way and more.

  5. I intend to a add a new seeding back-end that sources chunks from
    the local network. After downloading the new .caidx file off the
    Internet casync would then search for the listed chunks on the local
    network first before retrieving them from the Internet. This should
    speed things up on all installations that have multiple similar
    systems deployed in the same network.

Further plans are listed tersely in the
TODO file.

FAQ:

  1. Is this a systemd project?casync is hosted under the
    github systemd umbrella, and the
    projects share the same coding style. However, the code-bases are
    distinct and without interdependencies, and casync works fine both
    on systemd systems and systems without it.

  2. Is casync portable? — At the moment: no. I only run Linux and
    that’s what I code for. That said, I am open to accepting portability
    patches (unlike for systemd, which doesn’t really make sense on
    non-Linux systems), as long as they don’t interfere too much with the
    way casync works. Specifically this means that I am not too
    enthusiastic about merging portability patches for OSes lacking the
    openat(2) family
    of APIs.

  3. Does casync require reflink-capable file systems to work, such
    as btrfs?
    — No it doesn’t. The reflink magic in casync is
    employed when the file system permits it, and it’s good to have it,
    but it’s not a requirement, and casync will implicitly fall back to
    copying when it isn’t available. Note that casync supports a number
    of file system features on a variety of file systems that aren’t
    available everywhere, for example FAT’s system/hidden file flags or
    xfs‘s projinherit file flag.

  4. Is casync stable? — I just tagged the first, initial
    release. While I have been working on it since quite some time and it
    is quite featureful, this is the first time I advertise it publicly,
    and it hence received very little testing outside of its own test
    suite. I am also not fully ready to commit to the stability of the
    current serialization or chunk index format. I don’t see any breakages
    coming for it though. casync is pretty light on documentation right
    now, and does not even have a man page. I also intend to correct that
    soon.

  5. Are the .caidx/.caibx and .catar file formats open and
    documented?
    casync is Open Source, so if you want to know the
    precise format, have a look at the sources for now. It’s definitely my
    intention to add comprehensive docs for both formats however. Don’t
    forget this is just the initial version right now.

  6. casync is just like $SOMEOTHERTOOL! Why are you reinventing
    the wheel (again)?
    — Well, because casync isn’t “just like” some
    other tool. I am pretty sure I did my homework, and that there is no
    tool just like casync right now. The tools coming closest are probably
    rsync, zsync, tarsnap, restic, but they are quite different beasts
    each.

  7. Why did you invent your own serialization format for file trees?
    Why don’t you just use tar?
    — That’s a good question, and other
    systems — most prominently tarsnap — do that. However, as mentioned
    above tar doesn’t enforce reproducibility. It also doesn’t really do
    random access: if you want to access some specific file you need to
    read every single byte stored before it in the tar archive to find
    it, which is of course very expensive. The serialization casync
    implements places a focus on reproducibility, random access, and
    meta-data control. Much like traditional tar it can still be
    generated and extracted in a stream fashion though.

  8. Does casync save/restore SELinux/SMACK file labels? — At the
    moment not. That’s not because I wouldn’t want it to, but simply
    because I am not a guru of either of these systems, and didn’t want to
    implement something I do not fully grok nor can test. If you look at
    the sources you’ll find that there’s already some definitions in place
    that keep room for them though. I’d be delighted to accept a patch
    implementing this fully.

  9. What about delivering squashfs images? How well does chunking
    work on compressed serializations?
    – That’s a very good point!
    Usually, if you apply the a chunking algorithm to a compressed data
    stream (let’s say a tar.gz file), then changing a single bit at the
    front will propagate into the entire remainder of the file, so that
    minimal changes will explode into major changes. Thankfully this
    doesn’t apply that strictly to squashfs images, as it provides
    random access to files and directories and thus breaks up the
    compression streams in regular intervals to make seeking easy. This
    fact is beneficial for systems employing chunking, such as casync as
    this means single bit changes might affect their vicinity but will not
    explode in an unbounded fashion. In order achieve best results when
    delivering squashfs images through casync the block sizes of
    squashfs and the chunks sizes of casync should be matched up
    (using casync‘s --chunk-size= option). How precisely to choose
    both values is left a research subject for the user, for now.

  10. What does the name casync mean? – It’s a synchronizing
    tool, hence the -sync suffix, following rsync‘s naming. It makes
    use of the content-addressable concept of git hence the ca-
    prefix.

  11. Where can I get this stuff? Is it already packaged? – Check
    out the sources on GitHub. I
    just tagged the first
    version
    . Martin
    Pitt has packaged casync for
    Ubuntu
    . There
    is also an ArchLinux
    package
    . Zbigniew
    Jędrzejewski-Szmek has prepared a Fedora
    RPM
    that hopefully
    will soon be included in the distribution.

Should you care? Is this a tool for you?

Well, that’s up to you really. If you are involved with projects that
need to deliver IoT, VM, container, application or OS images, then
maybe this is a great tool for you — but other options exist, some of
which are linked above.

Note that casync is an Open Source project: if it doesn’t do exactly
what you need, prepare a patch that adds what you need, and we’ll
consider it.

If you are interested in the project and would like to talk about this
in person, I’ll be presenting casync soon at Kinvolk’s Linux
Technologies
Meetup

in Berlin, Germany. You are invited. I also intend to talk about it at
All Systems Go!, also in Berlin.

How to Help Protect Dynamic Web Applications Against DDoS Attacks by Using Amazon CloudFront and Amazon Route 53

Post Syndicated from Holly Willey original https://aws.amazon.com/blogs/security/how-to-protect-dynamic-web-applications-against-ddos-attacks-by-using-amazon-cloudfront-and-amazon-route-53/

Using a content delivery network (CDN) such as Amazon CloudFront to cache and serve static text and images or downloadable objects such as media files and documents is a common strategy to improve webpage load times, reduce network bandwidth costs, lessen the load on web servers, and mitigate distributed denial of service (DDoS) attacks. AWS WAF is a web application firewall that can be deployed on CloudFront to help protect your application against DDoS attacks by giving you control over which traffic to allow or block by defining security rules. When users access your application, the Domain Name System (DNS) translates human-readable domain names (for example, www.example.com) to machine-readable IP addresses (for example, 192.0.2.44). A DNS service, such as Amazon Route 53, can effectively connect users’ requests to a CloudFront distribution that proxies requests for dynamic content to the infrastructure hosting your application’s endpoints.

In this blog post, I show you how to deploy CloudFront with AWS WAF and Route 53 to help protect dynamic web applications (with dynamic content such as a response to user input) against DDoS attacks. The steps shown in this post are key to implementing the overall approach described in AWS Best Practices for DDoS Resiliency and enable the built-in, managed DDoS protection service, AWS Shield.

Background

AWS hosts CloudFront and Route 53 services on a distributed network of proxy servers in data centers throughout the world called edge locations. Using the global Amazon network of edge locations for application delivery and DNS service plays an important part in building a comprehensive defense against DDoS attacks for your dynamic web applications. These web applications can benefit from the increased security and availability provided by CloudFront and Route 53 as well as improving end users’ experience by reducing latency.

The following screenshot of an Amazon.com webpage shows how static and dynamic content can compose a dynamic web application that is delivered via HTTPS protocol for the encryption of user page requests as well as the pages that are returned by a web server.

Screenshot of an Amazon.com webpage with static and dynamic content

The following map shows the global Amazon network of edge locations available to serve static content and proxy requests for dynamic content back to the origin as of the writing of this blog post. For the latest list of edge locations, see AWS Global Infrastructure.

Map showing Amazon edge locations

How AWS Shield, CloudFront, and Route 53 work to help protect against DDoS attacks

To help keep your dynamic web applications available when they are under DDoS attack, the steps in this post enable AWS Shield Standard by configuring your applications behind CloudFront and Route 53. AWS Shield Standard protects your resources from common, frequently occurring network and transport layer DDoS attacks. Attack traffic can be geographically isolated and absorbed using the capacity in edge locations close to the source. Additionally, you can configure geographical restrictions to help block attacks originating from specific countries.

The request-routing technology in CloudFront connects each client to the nearest edge location, as determined by continuously updated latency measurements. HTTP and HTTPS requests sent to CloudFront can be monitored, and access to your application resources can be controlled at edge locations using AWS WAF. Based on conditions that you specify in AWS WAF, such as the IP addresses that requests originate from or the values of query strings, traffic can be allowed, blocked, or allowed and counted for further investigation or remediation. The following diagram shows how static and dynamic web application content can originate from endpoint resources within AWS or your corporate data center. For more details, see How CloudFront Delivers Content and How CloudFront Works with Regional Edge Caches.

Route 53 DNS requests and subsequent application traffic routed through CloudFront are inspected inline. Always-on monitoring, anomaly detection, and mitigation against common infrastructure DDoS attacks such as SYN/ACK floods, UDP floods, and reflection attacks are built into both Route 53 and CloudFront. For a review of common DDoS attack vectors, see How to Help Prepare for DDoS Attacks by Reducing Your Attack Surface. When the SYN flood attack threshold is exceeded, SYN cookies are activated to avoid dropping connections from legitimate clients. Deterministic packet filtering drops malformed TCP packets and invalid DNS requests, only allowing traffic to pass that is valid for the service. Heuristics-based anomaly detection evaluates attributes such as type, source, and composition of traffic. Traffic is scored across many dimensions, and only the most suspicious traffic is dropped. This method allows you to avoid false positives while protecting application availability.

Route 53 is also designed to withstand DNS query floods, which are real DNS requests that can continue for hours and attempt to exhaust DNS server resources. Route 53 uses shuffle sharding and anycast striping to spread DNS traffic across edge locations and help protect the availability of the service.

The next four sections provide guidance about how to deploy CloudFront, Route 53, AWS WAF, and, optionally, AWS Shield Advanced.

Deploy CloudFront

To take advantage of application delivery with DDoS mitigations at the edge, start by creating a CloudFront distribution and configuring origins:

  1. Sign in to the AWS Management Console and open the CloudFront console
  2. Choose Create Distribution.
  3. On the first page of the Create Distribution Wizard, in the Web section, choose Get Started.
  4. Specify origin settings for the distribution. The following screenshot of the CloudFront console shows an example CloudFront distribution configured with an Elastic Load Balancing load balancer origin, as shown in the previous diagram. I have configured this example to set the Origin SSL Protocols to use TLSv1.2 and the Origin Protocol Policy to HTTP Only. For more information about creating an HTTPS listener for your ELB load balancer and requesting a certificate from AWS Certificate Manager (ACM), see Getting Started with Elastic Load BalancingSupported Regions, and Requiring HTTPS for Communication Between CloudFront and Your Custom Origin.
  1. Specify cache behavior settings for the distribution, as shown in the following screenshot. You can configure each URL path pattern with a set of associated cache behaviors. For dynamic web applications, set the Minimum TTL to 0 so that CloudFront will make a GET request with an If-Modified-Since header back to the origin. When CloudFront proxies traffic to the origin from edge locations and back, multiple concurrent requests for the same object are collapsed into a single request. The request is sent over a persistent connection from the edge location to the region over networks monitored by AWS. The use of a large initial TCP window size in CloudFront maximizes the available bandwidth, and TCP Fast Open (TFO) reduces latency.
  2. To ensure that all traffic to CloudFront is encrypted and to enable SSL termination from clients at global edge locations, specify Redirect HTTP to HTTPS for Viewer Protocol Policy. Moving SSL termination to CloudFront offloads computationally expensive SSL negotiation, helps mitigate SSL abuse, and reduces latency with the use of OCSP stapling and session tickets. For more information about options for serving HTTPS requests, see Choosing How CloudFront Serves HTTPS Requests. For dynamic web applications, set Allowed HTTP Methods to include all methods, set Forward Headers to All, and for Query String Forwarding and Caching, choose Forward all, cache based on all.
  1. Specify distribution settings for the distribution, as shown in the following screenshot. Enter your domain names in the Alternate Domain Names box and choose Custom SSL Certificate.
  2. Choose Create Distribution. Note the x.cloudfront.net Domain Name of the distribution. In the next section, you will configure Route 53 to route traffic to this CloudFront distribution domain name.

Configure Route 53

When you created a web distribution in the previous section, CloudFront assigned a domain name to the distribution, such as d111111abcdef8.cloudfront.net. You can use this domain name in the URLs for your content, such as: http://d111111abcdef8.cloudfront.net/logo.jpg.

Alternatively, you might prefer to use your own domain name in URLs, such as: http://example.com/logo.jpg. You can accomplish this by creating a Route 53 alias resource record set that routes dynamic web application traffic to your CloudFront distribution by using your domain name. Alias resource record sets are virtual records specific to Route 53 that are used to map alias resource record sets for your domain to your CloudFront distribution. Alias resource record sets are similar to CNAME records except there is no charge for DNS queries to Route 53 alias resource record sets mapped to AWS services. Alias resource record sets are also not visible to resolvers, and they can be created for the root domain (zone apex) as well as subdomains.

A hosted zone, similar to a DNS zone file, is a collection of records that belongs to a single parent domain name. Each hosted zone has four nonoverlapping name servers in a delegation set. If a DNS query is dropped, the client automatically retries the next name server. If you have not already registered a domain name and have not configured a hosted zone for your domain, complete these two prerequisite steps before proceeding:

After you have registered your domain name and configured your public hosted zone, follow these steps to create an alias resource record set:

  1. Sign in to the AWS Management Console and open the Route 53 console.
  2. In the navigation pane, choose Hosted Zones.
  3. Choose the name of the hosted zone for the domain that you want to use to route traffic to your CloudFront distribution.
  4. Choose Create Record Set.
  5. Specify the following values:
    • Name – Type the domain name that you want to use to route traffic to your CloudFront distribution. The default value is the name of the hosted zone. For example, if the name of the hosted zone is example.com and you want to use acme.example.com to route traffic to your distribution, type acme.
    • Type – Choose A – IPv4 address. If IPv6 is enabled for the distribution and you are creating a second resource record set, choose AAAA – IPv6 address.
    • Alias – Choose Yes.
    • Alias Target – In the CloudFront distributions section, choose the name that CloudFront assigned to the distribution when you created it.
    • Routing Policy – Accept the default value of Simple.
    • Evaluate Target Health – Accept the default value of No.
  6. Choose Create.
  7. If IPv6 is enabled for the distribution, repeat Steps 4 through 6. Specify the same settings except for the Type field, as explained in Step 5.

The following screenshot of the Route 53 console shows a Route 53 alias resource record set that is configured to map a domain name to a CloudFront distribution.

If your dynamic web application requires geo redundancy, you can use latency-based routing in Route 53 to run origin servers in different AWS regions. Route 53 is integrated with CloudFront to collect latency measurements from each edge location. With Route 53 latency-based routing, each CloudFront edge location goes to the region with the lowest latency for the origin fetch.

Enable AWS WAF

AWS WAF is a web application firewall that helps detect and mitigate web application layer DDoS attacks by inspecting traffic inline. Application layer DDoS attacks use well-formed but malicious requests to evade mitigation and consume application resources. You can define custom security rules (also called web ACLs) that contain a set of conditions, rules, and actions to block attacking traffic. After you define web ACLs, you can apply them to CloudFront distributions, and web ACLs are evaluated in the priority order you specified when you configured them. Real-time metrics and sampled web requests are provided for each web ACL.

You can configure AWS WAF whitelisting or blacklisting in conjunction with CloudFront geo restriction to prevent users in specific geographic locations from accessing your application. The AWS WAF API supports security automation such as blacklisting IP addresses that exceed request limits, which can be useful for mitigating HTTP flood attacks. Use the AWS WAF Security Automations Implementation Guide to implement rate-based blacklisting.

The following diagram shows how the (a) flow of CloudFront access logs files to an Amazon S3 bucket (b) provides the source data for the Lambda log parser function (c) to identify HTTP flood traffic and update AWS WAF web ACLs. As CloudFront receives requests on behalf of your dynamic web application, it sends access logs to an S3 bucket, triggering the Lambda log parser. The Lambda function parses CloudFront access logs to identify suspicious behavior, such as an unusual number of requests or errors, and it automatically updates your AWS WAF rules to block subsequent requests from the IP addresses in question for a predefined amount of time that you specify.

Diagram of the process

In addition to automated rate-based blacklisting to help protect against HTTP flood attacks, prebuilt AWS CloudFormation templates are available to simplify the configuration of AWS WAF for a proactive application-layer security defense. The following diagram provides an overview of CloudFormation template input into the creation of the CommonAttackProtection stack that includes AWS WAF web ACLs used to block, allow, or count requests that meet the criteria defined in each rule.

Diagram of CloudFormation template input into the creation of the CommonAttackProtection stack

To implement these application layer protections, follow the steps in Tutorial: Quickly Setting Up AWS WAF Protection Against Common Attacks. After you have created your AWS WAF web ACLs, you can assign them to your CloudFront distribution by updating the settings.

  1. Sign in to the AWS Management Console and open the CloudFront console.
  2. Choose the link under the ID column for your CloudFront distribution.
  3. Choose Edit under the General
  4. Choose your AWS WAF Web ACL from the drop-down
  5. Choose Yes, Edit.

Activate AWS Shield Advanced (optional)

Deploying CloudFront, Route 53, and AWS WAF as described in this post enables the built-in DDoS protections for your dynamic web applications that are included with AWS Shield Standard. (There is no upfront cost or charge for AWS Shield Standard beyond the normal pricing for CloudFront, Route 53, and AWS WAF.) AWS Shield Standard is designed to meet the needs of many dynamic web applications.

For dynamic web applications that have a high risk or history of frequent, complex, or high volume DDoS attacks, AWS Shield Advanced provides additional DDoS mitigation capacity, attack visibility, cost protection, and access to the AWS DDoS Response Team (DRT). For more information about AWS Shield Advanced pricing, see AWS Shield Advanced pricing. To activate advanced protection services, follow these steps:

  1. Sign in to the AWS Management Console and open the AWS WAF console.
  2. If this is your first time signing in to the AWS WAF console, choose Get started with AWS Shield Advanced. Otherwise, choose Protected resources.
  3. Choose Activate AWS Shield Advanced.
  4. Choose the resource type and resource to protect.
  5. For Name, enter a friendly name that will help you identify the AWS resources that are protected. For example, My CloudFront AWS Shield Advanced distributions.
  6. (Optional) For Web DDoS attack, select Enable. You will be prompted to associate an existing web ACL with these resources, or create a new ACL if you don’t have any yet.
  7. Choose Add DDoS protection.

Summary

In this blog post, I outline the steps to deploy CloudFront and configure Route 53 in front of your dynamic web application to leverage the global Amazon network of edge locations for DDoS resiliency. The post also provides guidance about enabling AWS WAF for application layer traffic monitoring and automated rules creation to block malicious traffic. I also cover the optional steps to activate AWS Shield Advanced, which helps build a more comprehensive defense against DDoS attacks for your dynamic web applications.

If you have comments about this post, submit them in the “Comments” section below. If you have questions about or issues implementing this solution, please open a new thread on the AWS WAF forum.

– Holly

Firefox 52.0

Post Syndicated from ris original https://lwn.net/Articles/716426/rss

Firefox 52.0 has been released. This version features support for
WebAssembly, adds user warnings for non-secure HTTP pages with logins,
implements the Strict Secure Cookies specification which forbids insecure
HTTP sites from setting cookies with the “secure” attribute, and enhances
Sync to allow users to send and open tabs from one device to another. See
the release
notes
for more information.

Cloudflare Reverse Proxies are Dumping Uninitialized Memory

Post Syndicated from ris original https://lwn.net/Articles/715535/rss

Thanks to Josh Triplett for sending us this chromium
bug report
about a dump of unitialized memory caused by Cloudflare’s
reverse proxies. “A while later, we figured out how to reproduce the
problem. It looked like that if an html page hosted behind cloudflare had a
specific combination of unbalanced tags, the proxy would intersperse pages
of uninitialized memory into the output (kinda like heartbleed, but
cloudflare specific and worse for reasons I’ll explain later). My working
theory was that this was related to their “ScrapeShield” feature which
parses and obfuscates html – but because reverse proxies are shared between
customers, it would affect *all* Cloudflare customers. We fetched a few live samples, and we observed encryption keys, cookies, passwords, chunks of POST data and even HTTPS requests for other major cloudflare-hosted sites from other users. Once we understood what we were seeing and the implications, we immediately stopped and contacted cloudflare security.

You don’t need printer security

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/02/you-dont-need-printer-security.html

So there’s this tweet:

What it’s probably refering to is this:

This is an obviously bad idea.

Well, not so “obvious”, so some people have ask me to clarify the situation. After all, without “security”, couldn’t a printer just be added to a botnet of IoT devices?

The answer is this:

Fixing insecurity is almost always better than adding a layer of security.

Adding security is notoriously problematic, for three reasons

  1. Hackers are active attackers. When presented with a barrier in front of an insecurity, they’ll often find ways around that barrier. It’s a common problem with “web application firewalls”, for example.
  2. The security software itself can become a source of vulnerabilities hackers can attack, which has happened frequently in anti-virus and intrusion prevention systems.
  3. Security features are usually snake-oil, sounding great on paper, with with no details, and no independent evaluation, provided to the public.

It’s the last one that’s most important. HP markets features, but there’s no guarantee they work. In particular, similar features in other products have proven not to work in the past.

HP describes its three special features in a brief whitepaper [*]. They aren’t bad, but at the same time, they aren’t particularly good. Windows already offers all these features. Indeed, as far as I know, they are just using Windows as their firmware operating system, and are just slapping an “HP” marketing name onto existing Windows functionality.

HP Sure Start: This refers to the standard feature in almost all devices these days of having a secure boot process. Windows supports this in UEFI boot. Apple’s iPhones work this way, which is why the FBI needed Apple’s help to break into a captured terrorist’s phone. It’s a feature built into most IoT hardware, though most don’t enable it in software.

Whitelisting: Their description sounds like “signed firmware updates”, but if that was they case, they’d call it that. Traditionally, “whitelisting” referred to a different feature, containing a list of hashes for programs that can run on the device. Either way, it’s a pretty common functionality.

Run-time intrusion detection: They have numerous, conflicting descriptions on their website. It may mean scanning memory for signatures of known viruses. It may mean stack cookies. It may mean double-checking kernel modules. Windows does all these things, and it has a tiny benefit on stopping security threats.

As for traditional threats for attacks against printers, none of these really are important. What you need to secure a printer is the ability to disable services you aren’t using (close ports), enable passwords and other access control, and delete files of old print jobs so hackers can’t grab them from the printer. HP has features to address these security problems, but then, so do its competitors.

Lastly, printers should be behind firewalls, not only protected from the Internet, but also segmented from the corporate network, so that only those designed ports, or flows between the printer and print servers, are enabled.

Conclusion

The features HP describes are snake oil. If they worked well, they’d still only address a small part of the spectrum of attacks against printers. And, since there’s no technical details or independent evaluation of the features, they are almost certainly lies.

If HP really cared about security, they’d make their software more secure. They use fuzzing tools like AFL to secure it. They’d enable ASLR and stack cookies. They’d compile C code with run-time buffer overflow checks. Thety’d have a bug bounty program. It’s not something they can easily market, but at least it’d be real.

If you cared about printer security, then do the steps I outline above, especially firewalling printers from the traditional network. Seriously, putting $100 firewall between a VLAN for your printers and the rest of the network is cheap and easy way to do a vast amount of security. If you can’t secure printers this way, buying snake oil features like HP describes won’t help you.

CES 2017: Trends For the Tech Savvy To Watch

Post Syndicated from Peter Cohen original https://www.backblaze.com/blog/ces-2017-trends-tech-savvy-watch/

This year’s Consumer Electronics Show (CES) just wrapped up in Las Vegas. The usual parade of cool tech toys created a lot of headlines this year, but there were some genuine trends to keep an eye on too. If you’re like us, you’re probably one of the first people around to adopt promising new technologies when they emerge. As early adopters we can sometimes lose the forest through the trees when it comes to understanding what this means for everyone else, so we’re going to look at it through that prism.

Alexa everywhere

2017 promises to be a big year for voice-activated “smart home” devices. The final landscape for this is still to be determined – all the expected players have their foot in it right now. Amazon, Apple, Google, Microsoft, even some smaller players.

Amazon deserves props after a holiday season that saw its Echo and Echo Dot devices in high demand. The company’s published an API that is Alexa is picking up plenty of support from third party manufacturers. Alexa’s testing for far beyond Echo, it seems.

Electronics giant LG is building Alexa into a line of robots designed for domestic duties and a refrigerator that also sports interior fridge cams, for example. Ford is integrating Alexa support into its Sync 3 automotive interface. Televisions, lighting devices, and home security products are among the many devices to feature Alexa integration.

Alexa is the new hotness, but the real trend here is in voice-assisted connectivity around the home. Even if Alexa runs out of steam, this tech is here to stay. The Internet of Things and voice activated interfaces are converging quickly, though that day isn’t today. It’s tantalizingly close. It’s still a niche, though, where it will stay for as long as consumers have to piece different things together to get it to work. That means there’s still room for disruption.

There’s especially ripe opportunity in underserved verticals. Take the home health market, for example: Natural language interfaces have huge implications for elderly and disabled care and assistance. Finding and developing solutions for those sorts of vertical markets is an awesome opportunity for the right players.

Of course, with great power comes great responsibility. A family of a six-year-old recently got stuck with a $160 bill after she told Alexa to order her cookies and a dollhouse. The family ended up donating the accidental order to charity. For what it’s worth, that problem can be avoided by activating a confirmation code feature in the Alexa software.

The Electric Vehicle (EV) Market Heats Up

One of the trickiest things to unpack from CES is hype from substance. Nowhere was that more apparent last week than the unveiling of Faraday Future’s FF91, a new Electric Vehicle (EV) positioned to go toe-to-toe with Tesla’s EV fleet.

The FF91 EV can purportedly go 378 miles on a single charge and also possesses autonomous driving capabilities (although its vaunted self-parking abilities didn’t demo as well as planned). When or if it’ll make it into production is still a head-scratcher, however. Faraday Future says it’ll be out next year, assuming that the company is beyond the production and manufacturing woes that have plagued it up until now.

While new vehicles and vehicle concepts are still largely the domain of auto shows, some auto manufacturers used CES to float new concepts ahead of the Detroit Auto Show, which happens this week. Toyota, for example, showed off its Concept-i, a car with artificial intelligence and natural language processing (like Siri or Alexa) designed to learn from you and adapt.

As we mentioned, Alexa is integrated into Ford’s Sync 3 platform, too. Already you can buy new cars with CarPlay and Android Auto, which makes it a lot easier to just talk with your mobile device to stay connected, get directions and entertain yourself on the morning commute simply by talking to your car instead of touching buttons. That’s a smart user interface change, but it’s still a potentially dangerous distraction for the driver. For this technology to succeed, it’s imperative that natural language interface designers make the experience as frictionless as possible.

Chrysler is making a play for future millennial families. We’re not making this up – they used “millennial” to describe the target market for this several times. The Portal concept is an electric minivan of sorts that’s chock-full of buzzwords: Facial recognition, Wi-Fi, media sharing, ten charging ports, semi-autonomous driving abilities and more).

2017 marks a pivot for car makers in this respect. For years the conventional wisdom that millennials were a lost cause for auto makers – Uber and Zipcar was all they needed. It turns out that was totally wrong. Economic pressures and diverse lifestyles may have delayed millennials’ trek toward auto ownership, but they’re turning out now in big numbers to buy wheels. Millennial families will need transportation just like generations before them back to the station wagon, which is why Chrysler says this “fifth-generation” family car will go into production sometime after 2018.

Volkswagen showed off its new I.D. concept car, a Golf-looking EV that also has all the requisite buzzwords. Speaking of buzzwords, what really excited us was the I.D. Buzz. This new EV resurrects the styling of the Hippy-era Microbus, with mood lighting, autonomous driving capabilities and a retractable steering wheel.

Rumors have persisted for years that VW was on the cusp of introducing a refreshed Microbus, but those rumors have never come to pass. And unfortunately, VW has no concrete plans to actually produce this – it seems to be a marketing effort to draw on nostalgic Boomer appeal, more than anything..

Both Buzz and Chrysler’s Portal do give us some insight about where auto makers are going when it comes to future generations of minivans: Electric, autonomous, customizable and more social than ever. If we are headed towards a future where vehicles drive themselves, family transportation will look very different than it is today.

Laptops At Both Extremes

CES saw the rollout of several new PC laptop models and concepts that will be hitting store shelves over the next several months.

Gamers looking for more real estate – a lot more real estate – were interested in Razer’s latest concept, Project Valerie. The laptop sports not one but three 4K displays which fold out on hinges. That’s 12K pixels of horizontal image space, mated to an Nvidia GeForce GTX 1080 graphics processor. A unibody aluminum chassis keeps it relatively thin (1.5 inches) when closed, but the entire rig weighs more than 12 pounds. Razer doesn’t have any immediate production plans, which may explain why their prototype was stolen before the end of the show.

Unlike Razer, Acer has production plans – immediate plans – for its gargantuan 21-inch Predator 21X laptop, priced at $8,999 and headed to store shelves next month. It was announced last year, but Acer finally offered launch details last week. A 17-inch model is also coming soon.

Big gaming laptops make for pretty pictures and certainly have their place in the PC ecosystem, but they’re niche devices. After a ramp up on 2-in-1s and low-powered laptops, Intel’s Kaby Lake processors are finally ready for the premium and mid-range laptop market. Kaby Lake efficiency improvements are helping PC makers build thinner and lighter laptops with better battery life, 4K video processing, faster solid state storage and more.

HP, Asus, MSI, Dell (and its gaming arm Alienware) were among the many companies with sleek new Kaby Lake-equipped models.

Gaming in the cloud with Nvidia

Nvidia, makers of premium graphics processors, offers GeForce Now cloud gaming to users of its Shield, an Android-based gaming handheld. That service is expanding to Windows and Mac in March.

Gaming as a Service, if you will, isn’t a new idea. OnLive pioneered the concept more than a decade ago. Gaikai followed, then was acquired by Sony in 2012. Nvidia’s had limited success with GeForce Now, but it’s been a single-platform offering up until now.

Nvidia has robust data centers to handle the processing and traffic, so best of luck to them as they scale up to meet demand. Gaming is very sensitive to network disruption – no gamer appreciates lag – so it’ll be interesting to see how GeForce Now scales to accommodate the new devices.

Mesh Networking

Mesh networking delivers more consistent, stronger network reception and performance than a conventional Wi-Fi router. Some of us have set up routers and extenders to fix dead spots – mesh networking works differently through smart traffic and better radio management between multiple network bases.

Eero, Ubiquiti, and even Google (with Google Wifi) are already offering mesh networking products, and this market segment looks to expand big in 2017. Netgear, Linksys, Asus, TP-Link and others are among those with new mesh networking setups. Mesh networking gear is still hampered by a higher price than plain old routers. That means the value isn’t there for some of us who have networking gear that gets the job done, even with shortcomings like dead zones or slow zones. But prices are coming down fast as more companies get into the market. If you have an 802.11ac router you’re happy with, stick with it for now, and move to a mesh networking setup for your next Wi-Fi upgrade.

Getting Your Feet Into VR

Our award for wackiest CES product has to go to Cerevo Taclim. Tactile feedback shoes and wireless hand controllers that help you “feel” the surface you’re walking on. Crunching snow underfoot, splashing through water. At an expected $1,000-$1,500 a pop, these probably won’t be next year’s Hatchimals, but it’s fun to imagine what game devs can do with the technology. Strap these to your feet then break out your best Hadouken in Street Fighter VR!

CES isn’t the real world. Only a fraction of what’s shown off ever sees the light of day, but it’s always interesting to see the trend-focused consumer electronics market shift and change from year to year. At the end of the year we hope to look back and see how much of this stuff ended up resonating with the actual consumer the show is named for.

The post CES 2017: Trends For the Tech Savvy To Watch appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

[email protected] – Preview

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/coming-soon-lambda-at-the-edge/

Just last week, a comment that I made on Hacker News resulted in an interesting email from an AWS customer!

He told me that he runs a single page app that is hosted on S3 (read about this in Host Your Static Website on Amazon S3) and served up at low latency through Amazon CloudFront. The page includes some dynamic elements that are customized for each user via an API hosted on AWS Elastic Beanstalk.

Here’s how he explained his problem to me:

In order to properly get indexed by search engines and in order for previews of our content to show up correctly within Facebook and Twitter, we need to serve a prerendered version of each of our pages. In order to do this, every time a normal user hits our site need for them to be served our normal front end from Cloudfront. But if the user agent matches Google / Facebook / Twitter etc., we need to instead redirect them the prerendered version of the site.

Without spilling any beans I let him know that we were very aware of this use case and that we had some interesting solutions in the works. Other customers have also let us know that they want to customize their end user experience by making quick decisions out at the edge.

It turns out that there are many compelling use cases for “intelligent” processing of HTTP requests at a location that is close (latency-wise) to the customer. These include inspection and alteration of HTTP headers, access control (requiring certain cookies to be present), device detection, A/B testing, expedited or special handling for crawlers or ‘bots, and rewriting user-friendly URLs to accommodate legacy systems. Many of these use cases require more processing and decision-making than can be expressed by simple pattern matching and rules.

[email protected]
In order to provide support for these use cases (and others that you will dream up), we are launching a preview of [email protected] This new Lambda-based processing model allows you to write JavaScript code that runs within the ever-growing network of AWS edge locations.

You can now write lightweight request processing logic that springs to life quickly and handles requests and responses that flow through a CloudFront distribution. You can run code in response to four distinct events:

Viewer Request – Your code will run on every request, whether the content is cached or not. Here’s some simple header processing code:

exports.viewer_request_handler = function(event, context) {
  var headers = event.Records[0].cf.request.headers;
  for (var header in headers) {
    headers["X-".concat(header)] = headers[header];
  }
  context.succeed(event.Records[0].cf.request);
}

Origin Request – Your code will run when the requested content is not cached at the edge, before the request is passed along to the origin. You can add more headers, modify existing ones, or modify the URL.

Viewer Response – Your code will run on every response, cached or not. You could use this to clean up some headers that need not be passed back to the viewer.

Origin Response – Your code will run after a cache miss causes an origin fetch and returns a response to the edge.

Your code has access to many aspects of the requests and responses including the URL, method, HTTP version, client IP address, and headers. Initially, you will be able to add, delete, and modify the headers. Soon, you will have complete read/write access to all of the values including the body.

Because your JavaScript code will be part of the request/response path, it must be lean, mean, and self-contained. It cannot make calls to other web services and it cannot access other AWS resources. It must run within 128 MB of memory, and complete within 50 ms.

To get started, you will simply create a new Lambda function, set your distribution as the trigger, and choose the new Edge runtime:

Then you write your code as usual; Lambda will take care of the behind-the-scenes work of getting it to the edge locations.

Interested?
I believe that this cool new processing model will lead to the creation of some very cool new applications and development tools. I can’t wait to see what you come up with!

We are launching a limited preview of [email protected] today and are taking applications now. If you have a relevant use case and are ready to try this out, please apply here.

Jeff;

 

Hacking Password-Protected Computers via the USB Port

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2016/11/hacking_passwor.html

PoisonTap is an impressive hacking tool that can compromise computers via the USB port, even when they are password-protected. What’s interesting is the chain of vulnerabilities the tool exploits. No individual vulnerability is a problem, but together they create a big problem.

Kamkar’s trick works by chaining together a long, complex series of seemingly innocuous software security oversights that only together add up to a full-blown threat. When PoisonTap — a tiny $5 Raspberry Pi microcomputer loaded with Kamkar’s code and attached to a USB adapter — is plugged into a computer’s USB drive, it starts impersonating a new ethernet connection. Even if the computer is already connected to Wifi, PoisonTap is programmed to tell the victim’s computer that any IP address accessed through that connection is actually on the computer’s local network rather than the internet, fooling the machine into prioritizing its network connection to PoisonTap over that of the Wifi network.

With that interception point established, the malicious USB device waits for any request from the user’s browser for new web content; if you leave your browser open when you walk away from your machine, chances are there’s at least one tab in your browser that’s still periodically loading new bits of HTTP data like ads or news updates. When PoisonTap sees that request, it spoofs a response and feeds your browser its own payload: a page that contains a collection of iframes — a technique for invisibly loading content from one website inside another­that consist of carefully crafted versions of virtually every popular website address on the internet. (Kamkar pulled his list from web-popularity ranking service Alexa‘s top one million sites.)

As it loads that long list of site addresses, PoisonTap tricks your browser into sharing any cookies it’s stored from visiting them, and writes all of that cookie data to a text file on the USB stick. Sites use cookies to check if a visitor has recently logged into the page, allowing visitors to avoid doing so repeatedly. So that list of cookies allows any hacker who walks away with the PoisonTap and its stored text file to access the user’s accounts on those sites.

There’s more. Here’s another article with more details. Also note that HTTPS is a protection.

Yesterday, I testified about this at a joint hearing of the Subcommittee on Communications and Technology, and the Subcommittee on Commerce, Manufacturing, and Trade — both part of the Committee on Energy and Commerce of the US House of Representatives. Here’s the video; my testimony starts around 1:10:10.

The topic was the Dyn attacks and the Internet of Things. I talked about different market failures that will affect security on the Internet of Things. One of them was this problem of emergent vulnerabilities. I worry that as we continue to connect things to the Internet, we’re going to be seeing a lot of these sorts of attacks: chains of tiny vulnerabilities that combine into a massive security risk. It’ll be hard to defend against these types of attacks. If no one product or process is to blame, no one has responsibility to fix the problem. So I gave a mostly Republican audience a pro-regulation message. They were surprisingly polite and receptive.

New – HTTP/2 Support for CloudFront

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-http2-support-for-cloudfront/

When I interview a candidate for a technical position, I often ask them to explain what happens when they see a interesting link and decide to click on it. I encourage them to go in to as much detail as they would like. The answers let me know how well they understand and can explain a complex concept. Some candidates will sum up the entire process in a sentence or two. Others have filled an entire whiteboard with detailed diagrams. At a very simple level, here are the steps that I like to see (If you know much about HTTP, you know that I have failed to mention the SSL handshake, cookies, response codes, caches, content distribution networks, and all sorts of other details):

  1. The domain name is converted to an IP address by way of a DNS lookup.
  2. A TCP connection is made to the remote server.
  3. A GET request is issued.
  4. The remote server locates (or generates) the desired content and returns it to fulfill the request.
  5. The TCP connection is closed.
  6. The client processes and displays the result..

A complex web page might contain references to scripts, style sheets, images, and so forth. If this is the case, the entire sequence must be repeated for each reference. On mobile devices, each connection request wakes up the radio, adding hundreds or thousands of milliseconds of overhead (read about Application Network Latency Overhead to learn more).

Amazon CloudFront is a global content distribution network, or CDN. Several of CloudFront’s features help to make the process above more efficient. For example, it caches frequently used content in dozens of edge locations scattered across the planet. This allows CloudFront to respond to requests more quickly and over a shorter distance, thereby improving application performance. With no minimum usage commitments, CloudFront can help you to deliver web sites, videos, and other types of content to your end users in an economical and expeditious way.

New HTTP/2 Support
The retrieval process that I described above contains a lot of room for improvement. Repeated DNS lookups are avoidable, as are TCP connection requests. HTTP/2, a new version of the HTTP protocol, streamlines the process by reusing the TCP connection if possible. This core feature, combined with many other changes to the existing HTTP model, has the potential to reduce latency and to improve the performance of all types of web applications.

Today we are launching HTTP/2 support for CloudFront. You can enable it on a per-distribution basis today and your HTTP/2-aware clients and applications will start to make use of it right away. While HTTP/2 does not mandate the use of encryption, it turns out that all of the common web browsers require the use of HTTPS connections in conjunction with HTTP/2. Therefore, you may need to make some changes to your site or application in order to take full advantage of HTTP/2. Due to the (fairly clever) way in which HTTP/2 works, older clients remain compatible with web endpoints that do support it.

The connection from CloudFront back to your origin server is still made using HTTP/1. You don’t need to make any server-side changes in order to make your static or dynamic content accessible via HTTP/2.

Several AWS customers have already been testing CloudFront’s HTTP/2 support and have seen clear performance improvements. Marfeel is an ad tech platform that helps publishers to create, optimize, and monetize mobile web sites. They told us that CloudFront’s HTTP/2 support has reduced their first-render time by 17%. This allows the sites that they create to consistently load within 0.8 seconds. making them more accessible to mobile readers.

Enabling HTTP/2
To enable HTTP/2 for an existing CloudFront distribution, simply open up the CloudFront Console, locate the distribution, and click on Edit. Then change the Supported HTTP Versions to include HTTP/2:

The change will be effective within minutes and your users should start to see the benefits shortly thereafter. As I noted earlier, HTTP/2 must be used in conjunction with HTTPS. You can use your browser’s developer tools to verify that HTTP/2 is being used. Here’s what I see when I use the Network tool in Firefox:

You can also add HTTP/2 support to Curl and test from the command line:

$ curl --http2 -I https://d25c7x5dprwhn6.cloudfront.net/images/amazon_fulfilment_center_phoenix.jpg
HTTP/2.0 200
content-type:image/jpeg
content-length:650136
date:Sun, 04 Sep 2016 23:32:39 GMT
last-modified:Sat, 03 Sep 2016 15:21:01 GMT
etag:"b95a82b8df7373895a44a01a3c1e6f8d"
x-amz-version-id:fgWz_QaWo_.4GF7_VOl0gkBwnOmOALz6
accept-ranges:bytes
server:AmazonS3
age:644
x-cache:Hit from cloudfront
via:1.1 91e54ea7c5cc54f4a3500c72b19a2a23.cloudfront.net (CloudFront)
x-amz-cf-id:Dr_A3emW7OdxWfs3O0lDZfiBFL7loKMFCP9XC0_FYmCkeRuyXcu5BQ==

Learning More
Here are some resources that you can use to learn more about HTTP/2 and how it can benefit your application:

Available Now
This feature is available now and you can start using it today at no extra charge.


Jeff;

 

Raspberry Pi as retail product display

Post Syndicated from Matt Richardson original https://www.raspberrypi.org/blog/raspberry-pi-as-retail-product-display/

Digitec Galaxus is an electronics retailer in Switzerland. Among other things, they sell Raspberry Pis and related accessories, including our official 7” Touch Display. Many of their customers likely noticed that they haven’t had the Touch Display in stock recently, but there’s an interesting reason.

381A9940_kl21

The retailer wanted to replace their tablet-based digital product labels with something more robust, so they turned to Raspberry Pi 2 with the 7” Touch Display. Each store has 105 screens, which means that the staff of Digitec Galaxus assembled 840 custom Pi-based digital product labels. The screens enable their customers to view up-to-date product information, price, and product ratings from their community as they look at the product up-close.

To pull this off, the engineering team used Raspbian Jesse Lite and installed Chromium. They wrote a startup script which launches Chromium in kiosk mode and handles adjusting the display’s backlight. The browser loads a local HTML page and uses JavaScript to download the most up-to-date content using an AJAX call. When a keyboard is connected, the staff can set the parameters for the display, which are stored as cookies in the browser. For good measure, the team also introduced many levels of fault tolerance into their design. Just as one example, the boot script starts Chromium in a loop to ensure that it will be relaunched automatically if it crashes. It can also handle sudden loss of power and network connectivity issues.

IMG_810121

Whether it’s a young person’s learning computer, the brains of a DIY home automation project, or a node in a factory sensor network, we beam with pride when see our little computer being used in so many different ways. This project in particular is a great example of how those that sell Raspberry Pi products can harness Pi’s power for their own operations.

The post Raspberry Pi as retail product display appeared first on Raspberry Pi.

Python FAQ: How do I port to Python 3?

Post Syndicated from Eevee original https://eev.ee/blog/2016/07/31/python-faq-how-do-i-port-to-python-3/

Part of my Python FAQ, which is doomed to never be finished.

Maybe you have a Python 2 codebase. Maybe you’d like to make it work with Python 3. Maybe you really wish someone would write a comically long article on how to make that happen.

I have good news! You’re already reading one.

(And if you’re not sure why you’d want to use Python 3 in the first place, perhaps you’d be interested in the companion article which delves into exactly that question?)

Don’t be intimidated

This article is quite long, but don’t take that as a sign that this is necessarily a Herculean task. I’m trying to cover every issue I can ever recall running across, which means a lot of small gotchas.

I’ve ported several codebases from Python 2 to Python 2+3, and most of them have gone pretty smoothly. If you have modern Python 2 code that handles Unicode responsibly, you’re already halfway there.

However… if you still haven’t ported by now, almost eight years after Python 3.0 was first released, chances are you have either a lumbering giant of an app or ancient and weird 2.2-era code. Or, perish the thought, a lumbering giant consisting largely of weird 2.2-era code. In that case, you’ll want to clean up the more obvious issues one at a time, then go back and start worrying about actually running parts of your code on Python 3.

On the other hand, if your Python 2 code is pretty small and you’ve just never gotten around to porting, good news! It’s not that bad, and much of the work can be done automatically. Python 3 is ultimately the same language as Python 2, just with some sharp bits filed off.

Making some tough decisions

We say “porting from 2 to 3”, but what we usually mean is “porting code from 2 to both 2 and 3”. That ends up being more difficult (and ugly), since rather than writing either 2 or 3, you have to write the common subset of 2 and 3. As nifty as some of the features in 3 are, you can’t actually use any of them if you have to remain compatible with Python 2.

The first thing you need to do, then, is decide exactly which versions of Python you’re targeting. For 2, your options are:

  • Python 2.5+ is possible, but very difficult, and this post doesn’t really discuss it. Even something as simple as exception handling becomes painful, because the only syntax that works in Python 3 was first introduced in Python 2.6. I wouldn’t recommend doing this.

  • Python 2.6+ used to be fairly common, and is well-tread ground. However, Python 2.6 reached end-of-life in 2013, and some common libraries have been dropping support for it. If you want to preserve Python 2.6 compatibility for the sake of making a library more widely-available, well, I’d urge you to reconsider. If you want to preserve Python 2.6 compatibility because you’re running a proprietary app on it, you should stop reading this right now and go upgrade to 2.7 already.

  • Python 2.7 is the last release of the Python 2 series, but is guaranteed to be supported until at least 2020. The major focus of the release was backporting a lot of minor Python 3 features, making it the best possible target for code that’s meant to run on both 2 and 3.

  • There is, of course, also the choice of dropping Python 2 support, in which case this process will be much easier. Python 2 is still very widely-used, though, so library authors probably won’t want to do this. App authors do have the option, but unless your app is trivial, it’s much easier to maintain Python 2 support during the port — that way you can port iteratively, and the app will still function on Python 2 in the interim, rather than being a 2/3 hybrid that can’t run on either.

Most of this post assumes you’re targeting Python 2.7, though there are mentions of 2.6 as well.

You also have to decide which version of Python 3 to target.

  • Python 3.0 and 3.1 are forgettable. Python 3 was still stabilizing for its first couple minor versions, and from what I hear, compatibility with both 2.7 and 3.0 is a huge pain. Both versions are also past end-of-life.

  • Python 3.2 and 3.3 are a common minimum version to target. Python 3.3 reinstated support for u'...' literals (redundant in Python 3, where normal strings are already Unicode), which makes supporting both 2 and 3 much easier. I bundle it with Python 3.2 because the latest version that stable PyPy supports is 3.2, but it also supports u'...' literals. You’ll support the biggest surface area by targeting that, a sort of 3.2½. (There’s an alpha PyPy supporting 3.3, but as of this writing it’s not released as stable yet.)

  • Python 3.4 and 3.5 add shiny new features, but you can only really use them if you’re dropping support for Python 2. Again, I’d suggest targeting Python 2.7 + Python 3.2½ first, then dropping the Python 2 support and adding whatever later Python 3 trinkets you want.

Another consideration is what attitude you want your final code to take. Do you want Python 2 code with enough band-aids that it also works on Python 3, or Python 3 code that’s carefully written so it still works on Python 2? The differences are subtle! Consider code like x = map(a, b). map returns a list in Python 2, but a lazy iterable in Python 3. Which way do you want to port this code?

1
2
3
4
5
6
7
8
9
# Python 2 style: force eager evaluation, even on Python 3
x = list(map(a, b))

# Python 3 style: use lazy evaluation, even on Python 2
try:
    from future_builtins import map
except ImportError:
    pass
x = map(a, b)

The answer may depend on which Python you primarily use for development, your target audience, or even case-by-case based on how x is used.

Personally, I’d err on the side of preserving Python 3 semantics and porting them to Python 2 when possible. I’m pretty used to Python 3, though, and you or your team might be thrown for a loop by changing Python 2’s behavior.

At the very least, prefer if PY2 to if not PY3. The former stresses that Python 2 is the special case, which is increasingly true going forward. Eventually there’ll be a Python 4, and perhaps even a Python 5, and those future versions will want the “Python 3” behavior.

Some helpful tools

The good news is that you don’t have to do all of this manually.

2to3 is a standard library module (since 2.6) that automatically modifies Python 2 source code to change some common Python 2 constructs to the Python 3 equivalent. (It also doubles as a framework for making arbitrary changes to Python code.)

Unfortunately, it ports 2 to 3, not 2 to 2+3. For libraries, it’s possible to rig 2to3 to run automatically on your code just before it’s installed on Python 3, so you can keep writing Python 2 code — but 2to3 isn’t perfect, and this makes it impossible to develop with your library on Python 3, so Python 3 ends up as a second-class citizen. I wouldn’t recommend it.

The more common approach is to use something like six, a library that wraps many of the runtime differences between 2 and 3, so you can run the same codebase on both 2 and 3.

Of course, that still leaves you making the changes yourself. A more recent innovation is the python-future project, which combines both of the above. It has a future library of renames and backports of Python 3 functionality that goes further than six and is designed to let you write Python 3-esque code that still runs on Python 2. It also includes a futurize script, based on the 2to3 plumbing, that rewrites your code to target 2+3 (using python-future’s library) rather than just 3.

The nice thing about python-future is that it explicitly takes the stance of writing code against Python 3 semantics and backporting them to Python 2. It’s very dedicated to this: it has a future.builtins module that includes not only easy cases like map, but also entire pure-Python reimplementations of types like bytes. (Naturally, this adds some significant overhead as well.) I do like the overall attitude, but I’m not totally sold on all the changes, and you might want to leaf through them to see which ones you like.

futurize isn’t perfect, but it’s probably the best starting point. The 2to3 design splits the various edits into a variety of “fixers” that each make a single style of change, and futurize works the same way, inheriting many of the fixers from 2to3. The nice thing about futurize is that it groups the fixers into “stages”, where stage 1 (futurize --stage1) only makes fairly straightforward changes, like fixing the except syntax. More importantly, it doesn’t add any dependencies on the future library, so it’s useful for making the easy changes even if you’d prefer to use six. You’re also free to choose individual fixes to apply, if you discover that some particular change breaks your code.

Another advantage of this approach is that you can tackle the porting piecemeal, which is great for very large projects. Run one fixer at a time, starting with the very simple ones like updating to except ... as ... syntax, and convince yourself that everything is fine before you do the next one. You can make some serious strides towards 3 compatibility just by eliminating behavior that already has cromulent alternatives in Python 2.

If you expect your Python 3 port to take a very long time — say, if you have a large project with numerous developers and a frantic release schedule — then you might want to prevent older syntax from creeping in with a tool like autopep8, which can automatically fix some deprecated features with a much lighter touch. If you’d like to automatically enforce that, say, from __future__ import absolute_import is at the top of every Python file, that’s a bit beyond the scope of this article, but I’ve had pre-commit + reorder_python_imports thrust upon me in the past to fairly good effect.

Anyway! For each of the issues below, I’ll mention whether futurize can fix it, the name of the responsible fixer, and whether six has anything relevant. If the name of the fixer begins with lib2to3, that means it’s part of the standard library, and you can use it with 2to3 without installing python-future.

Here we go!

Things you shouldn’t even be doing

These are ancient, ancient practices, and even Python 2 programmers may be surprised by them. Some of them are arguably outright bugs in the language; others are just old and forgotten. They generally have equivalents that work even in older versions of Python 2.

Old-style classes

1
2
class Foo:
    ...

In Python 3, this code creates a class that inherits from object. In Python 2, it creates a completely different kind of thing entirely: an “old-style” class, which worked a little differently from built-in types. The differences are generally subtle:

  • Old-style classes don’t support __getattribute__, __slots__

  • Old-style classes don’t correctly support data descriptors, i.e. the assignment behavior of @property.

  • Old-style classes had a __coerce__ method, which would attempt to turn a value into a built-in numeric type before performing a math operation.

  • Old-style classes didn’t use the C3 MRO, so in the case of diamond inheritance, a class could be skipped entirely by super().

  • Old-style instances check the instance for a special method name; new-style instances check the type. Additionally, if a special method isn’t found on an old-style instance, the lookup falls back to __getattr__; this is not the case for new-style classes (which makes proxying more complicated).

That last one is the only thing old-style classes can do that new-style classes cannot, and if you’re relying on it, you have a bit of refactoring to do. (The really curious thing is that there doesn’t seem to be a particularly good reason for the limitation on new-style classes, and it doesn’t even make things faster. Maybe that’ll be fixed in Python 4?)

If you have no idea what any of that means or why you should care, chances are you’re either not using old-style classes at all, or you’re only using them because you forgot to write (object) somewhere. In that case, futurize --stage2 will happily change class Foo: to class Foo(object): for you, using the libpasteurize.fixes.fix_newstyle fixer. (Strictly speaking, this is a Python 2 compatibility issue, since the old syntax still works fine in Python 3 — it just means something else now.)

cmp

Python 2 originally used the C approach for sorting. Given two things A and B, a comparison would produce a negative number if A < B, zero if A == B, and a positive number if A > B. This was the only way to customize sorting; there’s a cmp() built-in function, a __cmp__ special method, and cmp arguments to list.sort() and sorted().

This is a little cumbersome, as you may have noticed if you’ve ever tried to do custom sorting in Perl or JavaScript. Even a case-insensitive sort involves repeating yourself. Most custom sorts will have the same basic structure of cmp(op(a), op(b)), when the only thing you really care about is op.

1
names.sort(cmp=lambda a, b: cmp(a.lower(), b.lower()))

But more importantly, the C approach is flat-out wrong for some types. Consider sets, which use comparison to indicate subsets versus supersets:

1
2
3
4
{1, 2} < {1, 2, 3}  # True
{1, 2, 3} > {1, 2}  # True
{1, 2} < {1, 2}  # False
{1, 2} <= {1, 2}  # True

So what to do with {1, 2} < {3, 4}, where none of the three possible answers is correct?

Early versions of Python 2 added “rich comparisons”, which introduced methods for all six possible comparisons: __eq__, __ne__, __lt__, __le__, __gt__, and __ge__. You’re free to return False for all six, or even True for all six, or return NotImplemented to allow deferring to the other operand. The cmp argument became key instead, which allows mapping the original values to a different item to use for comparison:

1
names.sort(key=lambda a: a.lower())

(This is faster, too, since there are fewer calls to the lambda, fewer calls to .lower(), and no calls to cmp.)


So, fixing all this. Luckily, Python 2 supports all of the new stuff, so you don’t need compatibility hacks.

To replace simple implementations of __cmp__, you need only write the appropriate rich comparison methods. You could even do this the obvious way:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
class Foo(object):
    def __cmp__(self, other):
        return cmp(self.prop, other.prop)

    def __eq__(self, other):
        return self.__cmp__(other) == 0

    def __ne__(self, other):
        return self.__cmp__(other) != 0

    def __lt__(self, other):
        return self.__cmp__(other) < 0

    ...

You would also have to change the use of cmp to a manual if tree, since cmp is gone in Python 3. I don’t recommend this.

A lazier alternative would be to use functools.total_ordering (backported from 3.0 into 2.7), which generates four of the comparison methods, given a class that implements __eq__ and one other:

1
2
3
4
5
6
7
@functools.total_ordering
class Foo(object):
    def __eq__(self, other):
        return self.prop == other.prop

    def __lt__(self, other):
        return self.prop < other.prop

There are a couple problems with this code. For one, it’s still pretty repetitive, accessing .prop four times (and imagine if you wanted to compare several properties). For another, it’ll either cause an error or do entirely the wrong thing if you happen to compare with an object of a different type. You should return NotImplemented in this case, but total_ordering doesn’t handle that correctly until Python 3.4. If those bother you, you might enjoy my own classtools.keyed_ordering, which uses a __key__ method (much like the key argument) to generate all six methods:

1
2
3
4
@classtools.keyed_ordering
class Foo(object):
    def __key__(self):
        return self.prop

Replacing uses of key arguments should be straightforward: a cmp argument of cmp(op(a), op(b)) becomes a key argument of op. If you’re doing something more elaborate, there’s a functools.cmp_to_key function (also backported from 3.0 to 2.7), which converts a cmp function to one usable as a key. (The implementation is much like the first Foo example above: it involves a class that calls the wrapped function from its comparison methods, and returns True or False depending on the return value.)

Finally, if you’re using cmp directly, don’t do that. If you really, really need it for something other than Python’s own sorting, just use an if.

The only help futurize offers is in futurize --stage2, via libfuturize.fixes.fix_cmp, which adds an import of past.builtins.cmp if it detects you’re using the cmp function anywhere.

Comparing incompatible types

Python 2’s use of C-style ordering also means that any two objects, of any types, must be either equal or occur in some defined order. Python’s answer to this problem is to sort on the names of the types. So None < 3 < "1", because "NoneType" < "int" < "str".

Python 3 removes this fallback rule; if two values don’t know how to compare against each other (i.e. both return NotImplemented), you just get a TypeError.

This might affect you in subtle ways, such as if you’re sorting a list of objects that may contain Nones and expecting it to silently work. The fix depends entirely on the type of data you have, and no automated tool can handle that for you. Most likely, you didn’t mean to be sorting a heterogenous list in the first place.

Of course, you could always sort on type(x).__name__, but I don’t know why you would do that.

The sets module

Python 2.3 introduced its set types as Set and ImmutableSet in the sets module. Since Python 2.4, they’ve been built-in types, set and frozenset. The sets module is gone in Python 3, so just use the built-in names.

Creating exceptions

Python 2 allows you to do this:

1
raise RuntimeError, "an error happened at runtime!!"

There’s not really any good reason to do this, since you can just as well do:

1
raise RuntimeError("an error happened at runtime!!")

futurize --stage1 will rewrite the two-arg form to a regular object creation via the libfuturize.fixes.fix_raise fixer. It’ll also fix this alternative way of specifying an exception type, which is so bizarre and obscure that I did not know about it until I read the fixer’s source code:

1
raise (((A, B), C), ...)  # equivalent to `raise A` (?!)

Additionally, exceptions act like sequences in Python 2, but not in Python 3. You can just operate on the .args sequence directly, in either version. Alas, there’s no automated way to fix this.

Backticks

Did you know that `x` is equivalent to repr(x) in Python 2? Yeah, most people don’t. It’s super weird. futurize --stage1 will fix this with the lib2to3.fixes.fix_repr fixer.

has_key

Very old code may still be using somedict.has_key("foo"). "foo" in somedict has worked since Python 2.2. What are you doing. futurize --stage1 will fix this with the lib2to3.fixes.fix_has_key fixer.

<>

<> is equivalent to != in Python 2! This is an ancient, ancient holdover, and there’s no reason to still be using it. futurize --stage1 will fix this with the lib2to3.fixes.fix_ne fixer.

(You could also use from __future__ import barry_as_FLUFL, which restores <> in Python 3. It’s an easter egg. I’m joking. Please don’t actually do this.)

Things with easy Python 2 equivalents

These aren’t necessarily ancient, but they have an alternative you can just as well express in Python 2, so there’s no need to juggle 2 and 3.

Other ancient builtins

apply() is gone. Use the built-in syntax, f(*args, **kwargs).

callable() was briefly gone, but then came back in Python 3.2.

coerce() is gone; it was only used for old-style classes.

execfile() is gone. Read the file and pass its contents to exec() instead.

file() is gone; Python 3 has multiple file types, and a hierarchy of interfaces defined in the io module. Occasionally, code uses this as a synonym for open(), but you should really be using open() anyway.

intern() has been moved into the sys module, though I have no earthly idea why you’d be using it.

raw_input() has been renamed to input(), and the old ludicrous input() is gone. If you really need input(), please stop.

reduce() has been moved into the functools module, but it’s there in Python 2.6 as well.

reload() has been moved into the imp module. It’s unreliable garbage and you shouldn’t be using it anyway.

futurize --stage1 can fix several of these:

  • apply, via lib2to3.fixes.fix_apply
  • intern, via lib2to3.fixes.fix_intern
  • reduce, via lib2to3.fixes.fix_reduce

futurize --stage2 can also fix execfile via the libfuturize.fixes.fix_execfile fixer, which imports past.builtins.execfile. The 2to3 fixer uses an open() call, but the true correct fix is to use a with block.

futurize --stage2 has a couple of fixers for raw_input, but you can just as well import future.builtins.input or six.moves.input.

Nothing can fix coerce, which has no equivalent. Curiously, I don’t see a fixer for file, which is trivially fixed by replacing it with open. Nothing for reload, either.

Catching exceptions

Historically, the way to say “if there’s a ValueError, store it in e and run some code” was:

1
2
3
4
try:
    ...
except ValueError, e:
    ...

Unfortunately, that’s very easy to confuse with the syntax for catching two different types of exception:

1
2
except (ValueError, TypeError):
    ...

If you forget the parentheses, you’ll only catch ValueError, and the exception will be assigned to a variable called, er, TypeError. Whoops!

Python 3.0 introduced clearer syntax, which was also backported to Python 2.6:

1
2
except ValueError as e:
    ...

Python 3.0 finally removed the old syntax, so you must use the as form. futurize --stage1 will fix this with the lib2to3.fixes.fix_except fixer.

As an additional wrinkle, the extra variable e is deleted at the end of the block in Python 3, but not in Python 2. If you really need to refer to it after the block, just assign it to a different name.

(The reason for this is that captured exceptions contain a traceback in Python 3, and tracebacks contain the locals for the current frame, and those locals will contain the captured exception. The resulting cycle would keep all local variables alive until the cycle detector dealt with it, at least in CPython. Scrapping the exception as soon as it’s been dealt with was a simple way to keep this from accidentally happening all over the place. It usually doesn’t make sense to refer to a captured exception after the except block, anyway, since the variable may or may not even exist, and that’s generally weird and bad in Python.)

Octals

It’s not uncommon for a new programmer to try to zero-pad a set of numbers:

1
2
3
4
a = 07
b = 08
c = 09
d = 10

Of course, this will have the rather bizarre result that 08 is a SyntaxError, even though 07 works fine — because numbers starting with a 0 are parsed as octal.

This is a holdover from C, and it’s fairly surprising, since there’s virtually no reason to ever use octal. The only time I can ever remember using it is for passing file modes to chmod.

Python 3.0 requires octal literals to be prefixed with 0o, in line with 0x for hex and 0b for binary; literal integers starting with only a 0 are a syntax error. Python 2.6 supports both forms.

futurize --stage1 will fix this with the lib2to3.fixes.fix_numliterals fixer.

pickle

If you’re using the pickle module (which you shouldn’t be), and you intend to pass pickles back and forth between Python 2 and Python 3, there’s a small issue to be aware of. pickle has several different “protocol” versions, and the default version used in Python 3 is protocol 3, which Python 2 cannot read.

The fix is simple: just find where you’re calling pickle.dump() or pickle.dumps(), and pass a protocol argument of 2. Protocol 2 is the highest version supported by Python 2, and you probably want to be using it anyway, since it’s much more compact and faster to read/write than Python 2’s default, protocol 0.

You may be already using HIGHEST_PROTOCOL, but you’ll have the same problem: the highest protocol supported in any version of Python 3 is unreadable by Python 2.


A somewhat bigger problem is that if you pickle an instance of a user-defined class on Python 2, the pickle will record all its attributes as bytestrings, because that’s what they are in Python 2. Python 3 will then dutifully load the pickle and populate your object’s __dict__ with keys like b'foo'. obj.foo will then not actually exist, because obj.foo looks for the string 'foo', and 'foo' != b'foo' in Python 3.

Don’t use pickle, kids.

It’s possible to fix this, but also a huge pain in the ass. If you don’t know how, you definitely shouldn’t be using pickle.

Things that have a __future__ import

Occasionally, the syntax changed in an incompatible way, but the new syntax was still backported and hidden behind a __future__ import — Python’s mechanism for opting into syntax changes. You have to put such an import at the top of the file, optionally after a docstring, like this:

1
2
"""My super important module."""
from __future__ import with_statement

Ugh! Parentheses! Why, Guido, why?

The reason is that the print statement has incredibly goofy syntax, unlike anything else in the language:

1
print >>a, b, c,

You might not even recognize the >> bit, but it lets you print to a file other than sys.stdout. It’s baked specifically into the print syntax. Python 3 replaces this with a straightforward built-in function with a couple extra bells and whistles. The above would be written:

1
print(b, c, end='', file=a)

It’s slightly more verbose, but it’s also easier to tell what’s going on, and that teeny little comma at the end is now a more obvious keyword argument.

from __future__ import print_function will forget about the print statement for the rest of the file, and make the builtin print function available instead. futurize --stage1 will fix all uses of print and add the __future__ import, with the libfuturize.fixes.fix_print_with_import fixer. (There’s also a 2to3 fixer, but it doesn’t add the __future__ import, since it’s unnecessary in Python 3.)

A word of warning: do not just use print with parentheses without adding the __future__ import. This may appear to work in stock Python 2:

1
print("See, what's the problem?  This works fine!")

However, that’s parsed as the print statement followed by an expression in parentheses. It becomes more obvious if you try to print two values:

1
2
print("The answer is:", 3)
# ("The answer is:", 3)

Now you have a comma inside parentheses, which is a tuple, so the old print statement prints its repr.

Division always produces a float

Quick, what’s the answer here?

1
5 / 2

If you’re a normal human being, you’ll say 2.5 or 2½. Unfortunately, if you’re like Python and have been afflicted by C, you might say the answer is 2, because this is “integer division” — a bizarre and alien concept probably invented because CPUs didn’t have FPUs when C was first invented.

Python 3.0 decided that maybe contorting fundamental arithmetic to match the inadequacies of 1970s hardware is not the best idea, and so it changed division to always produce a float.

Since Python 2.6, from __future__ import division will alter the division operator to always do true division. If you want to do floor division, there’s a separate // operator, which has existed for ages; you can use it in Python 2 with or without the __future__ import.

Note that true division always produces a float, even if the result is integral: 6 / 3 is 2.0. On the other hand, floor division uses the same typing rules as C-style division: 5 // 2 is 2, but 5 // 2.0 is 2.0.

futurize --stage2 will “fix” this with the libfuturize.fixes.fix_division fixer, but unfortunately that just adds the __future__ import. With the --conservative option, it uses the libfuturize.fixes.fix_division_safe fixer instead, which imports past.utils.old_div, a forward-port of Python 2’s division operator.

The trouble here is that the new / always produces a float, and the new // always floors, but the old / sometimes did one and sometimes did the other. futurize can’t just replace all uses of / with //, because 5/2.0 is 2.5 but 5//2.0 is 2.0, and it can’t generally know what types the operands are.

You might be best off fixing this one manually — perhaps using fix_division_safe to find all the places you do division, then changing them to use the right operator.

Of course, the __div__ magic method is gone in Python 3, replaced explicitly by __floordiv__ (//) and __truediv__ (/). Both of those methods already exist in Python 2, and __truediv__ is even called when you use / in the presence of the future import, so being compatible is a simple matter of implementing all three and deferring to one of the others from __div__.

Relative imports

In Python 2, if you’re in the module foo.bar and say import quux, Python will look for a foo.quux before it looks for a top-level quux. The former behavior is called a relative import, though it might be more clearly called a sibling import. It’s troublesome for several reasons.

  • If you have a sibling called quux, and there’s also a top-level or standard library module called quux, you can’t import the latter. (There used to be a py.std module for providing indirect access to the standard library, for this very reason!)

  • If you import the top-level quux module, and then later add a foo.quux module, you’ll suddenly be importing a different module.

  • When reading the source code, it’s not clear which imports are siblings and which are top-level. In fact, the modules you get depend on the module you’re in, so moving or renaming a file may change its imports in non-obvious ways.

Python 3 eliminates this behavior: import quux always means the top-level module. It also adds syntax for “explicit relative” or “absolute relative” (yikes) imports: from . import quux or from .quux import somefunc explicitly means to look for a sibling named quux. (You can also use ..quux to look in the parent package, three dots to look in the grandparent, etc.)

The explicit syntax is supported since Python 2.5. The old sibling behavior can be disabled since Python 2.5 with from __future__ import absolute_import.

futurize --stage1 has a libfuturize.fixes.fix_absolute_import fixer, which attempts to detect sibling imports and convert them to explicit relative imports. If it finds any sibling imports, it’ll also add the __future__ line, though honestly you should make an effort to to put that line in all of your Python 2 code.

It’s possible for the futurize fixer to guess wrong about a sibling import, but in general it works pretty well.

(There is one case I’ve run across where simply replacing import sibling with from . import sibling didn’t work. Unfortunately, it was Yelp code that I no longer have access to, and I can’t remember the precise details. It involved having several sibling imports inside a __init__.py, where the siblings also imported from each other in complex ways. The sibling imports worked, but the explicit relative imports failed, for some really obscure timing reason. It’s even possible this was a 2.6 bug that’s been fixed in 2.7. If you see it, please let me know!)

Things that require some effort

These problems are a little more obscure, but many of them are also more difficult to fix automatically. If you have a massive codebase, these are where the problems start to appear.

The grand module shuffle

A whole bunch of modules were deleted, merged, or removed. A full list is in PEP 3108, but you’ll never have heard of most of them. Here are the ones that might affect you.

  • __builtin__ has been renamed to builtins. Note that this is a module, not the __builtins__ attribute of modules, which is exactly why it was renamed. Incidentally, you should be using the builtins module rather than __builtins__ anyway. Or, wait, no, just don’t use either, please don’t mess with the built-in scope.

  • ConfigParser has been renamed to configparser.

  • Queue has been renamed to queue.

  • SocketServer has been renamed to socketserver.

  • cStringIO and StringIO are gone; instead, use StringIO or BytesIO from the io module. Note that these also exist in Python 2, but are pure-Python rather than the C versions in current Python 3.

  • cPickle is gone. Importing pickle in Python 3 now gives you the C implementation automatically.

  • cProfile is gone. Importing profile in Python 3 gives you the C implementation automatically.

  • copy_reg has been renamed to copyreg.

  • anydbm, dbhash, dbm, dumbdm, gdbm, and whichdb have all been merged into a dbm package.

  • dummy_thread has become _dummy_thread. It’s an implementation of the _thread module that doesn’t actually do any threading. You should be using dummy_threading instead, I guess?

  • httplib has become http.client. BaseHTTPServer, CGIHTTPServer, and SimpleHTTPServer have been merged into a single http.server module. Cookie has become http.cookies. cookielib has become http.cookiejar.

  • repr has been renamed to reprlib. (The module, not the built-in function.)

  • thread has been renamed to _thread, and you should really be using the threading module instead.

  • A whole mess of top-level Tk modules have been combined into a tkinter package.

  • The contents of urllib, urllib2, and urlparse have been consolidated and then split into urllib.error, urllib.parse, and urllib.request.

  • xmlrpclib has become xmlrpc.client. DocXMLRPCServer and SimpleXMLRPCServer have been merged into xmlrpc.server.

futurize --stage2 will fix this with the somewhat invasive libfuturize.fixes.fix_future_standard_library fixer, which uses a mechanism from future that adds aliases to Python 2 to make all the Python 3 standard library names work. It’s an interesting idea, but it didn’t actually work for all cases when I tried it (though now I can’t recall what was broken), so YMMV.

Alternative, you could manually replace any affected imports with imports from six.moves, which provides aliases that work on either version.

Or as a last resort, you can just sprinkle try ... except ImportError around.

Built-in iterators are now lazy

filter, map, range, and zip are all lazy in Python 3. You can still iterate over their return values (once), but if you have code that expects to be able to index them or traverse them more than once, it’ll break in Python 3. (Well, not range, that’s fine.) The lazy equivalents — xrange and the functions in itertools — are of course gone in Python 3.

In either case, the easiest thing to do is force eager evaluation by wrapping the call in list() or tuple(), which you’ll occasionally need to do in Python 3 regardless.

For the sake of consistency, you may want to import the lazy versions from the standard library future_builtins module. It only exists in Python 2, so be sure to wrap the import in a try.

futurize --stage2 tries to address this with several of lib2to3s fixers, but the results aren’t particularly pleasing: calls to all four are unconditionally wrapped in list(), even in an obviously safe case like a for block. I’d just look through your uses of them manually.

A more subtle point: if you pass a string or tuple to Python 2’s filter, the return value will be the same type. Blindly wrapping the call in list() will of course change the behavior. Filtering a string is not a particularly common thing to do, but I’ve seen someone complain about it before, so take note.

Also, Python 3’s map stops at the shortest input sequence, whereas Python 2 extends shorter sequences with Nones. You can fix this with itertools.zip_longest (which in Python 2 is izip_longest!), but honestly, I’ve never even seen anyone pass multiple sequences to map.

Relatedly, dict.iteritems (plus its friends, iterkeys and itervalues) is gone in Python 3, as the plain items (plus keys and values) is already lazy. The dict.view* methods are also gone, as they were only backports of Python 3’s normal behavior.

Both six and future.utils contain functions called iteritems, etc., which provide a lazy iterator in both Python 2 and 3. They also offer view* functions, which are closer to the Python 3 behavior, though I can’t say I’ve ever seen anyone actually use dict.viewitems in real code.

Of course, if you explicitly want a list of dictionary keys (or items or values), list(d) and list(d.items()) do the same thing in both versions.

buffer is gone

The buffer type has been replaced by memoryview (also in Python 2.7), which is similar but not identical. If you’ve even heard of either of these types, you probably know more about the subtleties involved than I do. There’s a lib2to3.fixes.fix_buffer fixer that blindly replaces buffer with memoryview, but futurize doesn’t use it in either stage.

Several special methods were renamed

Where Python 2 has __str__ and __unicode__, Python 3 has __bytes__ and __str__. The trick is that __str__ should return the native str type for each version: a bytestring for Python 2, but a Unicode string for Python 3. Also, you almost certainly don’t want a __bytes__ method in Python 3, where bytes is no longer used for text.

Both six and python-future have a python_2_unicode_compatible class decorator that tries to do the right thing. You write only a single __str__ method that returns a Unicode string. In Python 3, that’s all you need, so the decorator does nothing; in Python 2, the decorator will rename your method to __unicode__ and add a __str__ that returns the same value encoded as UTF-8. If you need different behavior, you’ll have to roll it yourself with if PY2.


Python 2’s next method is more appropriately __next__ in Python 3. The easy way to address this is to call your method __next__, then alias it with next = __next__. Be sure you never call it directly as a method, only with the built-in next() function.

Alternatively, future.builtins contains an alternative next which always calls __next__, but on Python 2, it falls back to trying next if __next__ doesn’t exist.

futurize --stage1 changes all use of obj.next() to next(obj) via the libfuturize.fixes.fix_next_call fixer. futurize --stage2 renames next methods to __next__ via the lib2to3.fixes.fix_next fixer (which also fixes calls). Note that there’s a remote chance of false positives, if for some reason you happened to use next as a regular method name.


Python 2’s __nonzero__ is Python 3’s __bool__. Again, you can just alias it manually. Or futurize --stage2 will rename it with the lib2to3.fixes.fix_nonzero fixer.

Renaming it will of course break it in Python 2, but futurize --stage2 also has a libfuturize.fixes.fix_object fixer that imports python-future’s own builtins.object. The replacement object class has a few methods for making Python 3’s __str__, __next__, and __bool__ work on Python 2.

This is one of the mildly invasive things python-future does, and it may or may not sit well. Up to you.


__long__ is completely gone, as there is no long type in Python 3.

__getslice__, __setslice__, and __delslice__ are gone. Instead, slice objects are passed to __getitem__ and friends. On the off chance you use these, you’ll have to do something clever in the item methods to defer to your slice logic on Python 3.

__oct__ and __hex__ are gone; oct() and hex() now consult __index__. I seriously doubt this will impact anyone.

__div__ is gone, as mentioned previously.

Unbound methods are gone; function attributes renamed

Say you have this useless class.

1
2
3
class Foo(object):
    def bar(self):
        pass

In Python 2, Foo.bar is an “unbound method”, a type that’s generally unseen and unexposed other than as types.MethodType. In Python 3, Foo.bar is just a regular function.

Offhand, I can only think of one time this would matter: if you want to get at attributes on the function, perhaps for the sake of a method decorator. In Python 2, you have to go through the unbound method’s .im_func attribute to get the original function, but in Python 3, you already have the original function and can get the attributes directly.

If you’re doing this anywhere, an easy way to make it work in both versions is:

1
2
method = Foo.bar
method = getattr(method, 'im_func', method)

As for bound methods (the objects you get from accessing methods but not calling them, like [].append), the im_self and im_func attributes have been renamed to __self__ and __func__. Happily, these names also work in Python 2.6, so no compatibility hacks are necessary.

im_class is completely gone in Python 3. Methods have no interest in which class they’re attached to. They can’t, since the same function could easily be attached to more than one class. If you’re relying on im_class somehow, for some reason… well, don’t do that, maybe.

Relatedly, the func_* function attributes have been renamed to dunder names in Python 3, since assigning function attributes is a fairly common practice and Python doesn’t like to clog namespaces with its own builtin names. func_closure, func_code, func_defaults, func_dict, func_doc, func_globals, and func_name are now __closure__, __code__, etc. (Note that func_doc and func_name were already aliases for __doc__ and __name__, and func_defaults is much more easily inspected with the inspect module.) The new names are not available in Python 2, so you’ll need to do a getattr dance, or use the get_function_* functions from six.

Metaclass syntax has changed

In Python 2, a metaclass is declared by assigning to a special name in the class body:

1
2
3
class Foo(object):
    __metaclass__ = FooMeta
    ...

Admittedly, this doesn’t make a lot of sense. The metaclass affects how a class is created, and the class body is evaluated as part of that creation, so this is sort of a goofy hack.

Python 3 changed this, opening the door to a few new neat tricks in the process, which you can find out about in the companion article.

1
2
class Foo(object, metaclass=FooMeta):
    ...

The catch is finding a way to express this idea in both Python 2 and Python 3 — the old syntax is ignored in Python 3, and the new syntax is a syntax error in Python 2.

It’s a bit of a pain, but the class statement is really just a lot of sugar for calling the type() constructor; after all, Python classes are just instances of type. All you have to do is manually create an instance of your metaclass, rather than of type.

Fortunately, other people have already made this work for you. futurize --stage2 will fix this using the libfuturize.fixes.fix_metaclass fixer, which imports future.utils.with_metaclass and produces the following:

1
2
3
4
from future.utils import with_metaclass

class Foo(with_metaclass(object)):
    ...

This creates an intermediate dummy class with the right metaclass, which you then inherit from. Classes use the same metaclass as their parents, so this works fine in any Python.

If you don’t want to depend on python-future, the same function exists in the six module.

Re-raising exceptions has different syntax

raise with no arguments does the same thing in Python 2 and Python 3: it re-raises the exception currently being handled, preserving the original traceback.

The problem comes in with the three-argument form of raise, which is for preserving the traceback while raising a different exception. It might look like this:

1
2
3
4
try:
    some_fragile_function()
except Exception as e:
    raise MyLibraryError, MyLibraryError("Failed to do a thing: " + str(e)), sys.exc_info()[2]

sys.exc_info()[2] is, of course, the only way to get the current traceback in Python 2. You may have noticed that the three arguments to raise are the same three things that sys.exc_info() returns: the type, the value, and the traceback.

Python 3 introduces exception chaining. If something raises an exception from within an except block, Python will remember the original exception, attach it to the new one, and show both exceptions when printing a traceback — including both exceptions’ types, messages, and where they happened. So to wrap and rethrow an exception, you don’t need to do anything special at all.

1
2
3
4
try:
    some_fragile_function()
except Exception:
    raise MyLibraryError("Failed to do a thing")

For more complicated handling, you can also explicitly say raise new_exception from old_exception. Exceptions contain their associated tracebacks as a __traceback__ attribute in Python 3, so there’s no need to muck around getting the traceback manually. If you really want to give an explicit traceback, you can use the .with_traceback() method, which just assigns to __traceback__ and then returns self.

1
raise MyLibraryError("Failed to do a thing").with_traceback(some_traceback)

It’s hard to say what it even means to write code that works “equivalently” in both versions, because Python 3 handles this problem largely automatically, and Python 2 code tends to have a variety of ad-hoc solutions. Note that you cannot simply do this:

1
2
3
4
if PY3:
    raise MyLibraryError("Beep boop") from exc
else:
    raise MyLibraryError, MyLibraryError("Beep boop"), sys.exc_info()[2]

The first raise is a syntax error in Python 2, and the second is a syntax error in Python 3. if won’t protect you from parse errors. (On the other hand, you can hide .with_traceback() behind an if, since that’s just a regular method call and will parse with no issues.)

six has a reraise function that will smooth out the differences for you (probably by using exec). The drawback is that it’s of course Python 2-oriented syntax, and on Python 3 the final traceback will include more context than expected.

Alternatively, there’s a six.raise_from, which is designed around the raise X from Y syntax of Python 3. The drawback is that Python 2 has no obvious equivalent, so you just get raise X, losing the old exception and its traceback.

There’s no clear right approach here; it depends on how you’re handling re-raising. Code that just blindly raises new exceptions doesn’t need any changes, and will get exception chaining for free on Python 3. Code that does more elaborate things, like implementing its own form of chaining or storing exc_info tuples to be re-raised later, may need a little more care.

Bytestrings are sequences of integers

In Python 2, bytes is a synonym for str, the default string type. Iterating or indexing a bytes/str produces 1-character strs.

1
2
3
4
list(b'hello')  # ['h', 'e', 'l', 'l', 'o']
b'hello'[0:4]  # 'hell'
b'hello'[0]  # 'h'
b'hello'[0][0][0][0][0]  # 'h' -- it's turtles all the way down

In Python 3, bytes is a specialized type for handling binary data, not text. As such, iterating or indexing a bytes produces integers.

1
2
3
4
list(b'hello')  # [104, 101, 108, 108, 111]
b'hello'[0:4]  # b'hell'
b'hello'[0]  # 104
b'hello'[0][0][0][0]  # TypeError, since you can't index 104

If you have explicitly binary data that want to be bytes in Python 3, this may pose a bit of a problem. Aside from just checking the version explicitly and making heavy use of chr/ord, there are two approaches.

One is to use bytearray instead. This is like bytes, but mutable. More importantly, since it was introduced as a new type in Python 2.6 — after Python 3.0 came out — it has the same iterating and indexing behavior as Python 3’s bytes, even in Python 2.

1
bytearray(b'hello')[0]  # 104, on either Python 2 or 3

The other is to slice rather than index, since slicing always produces a new iterable of the same type. If you want to extract a single character from a bytes, just take a one-element slice.

1
2
b'hello'[0]  # 104
b'hello'[0:1]  # b'h'

Things that are just a royal pain in the ass

Unicode

Saving the best for last, almost!

Honestly, if your Python 2 code is already careful with Unicode — working with unicode internally, and encoding/decoding only at the “boundaries” of your code — then you shouldn’t have too many problems. If your code is not so careful, you should really try to make it a little more careful before you worry about Python 3, since Python 3’s whole jam is to force you to be careful.

See, in Python 2, you can combine bytestrings (str) and text strings (unicode) more or less freely. Python will automatically try to convert between the two using the “default encoding”, which is generally ascii. Python 3 makes text strings the default string type, demotes bytestrings, and forbids ever converting between them.

Most obviously, Python 2’s str and unicode have been renamed to bytes and str in Python 3. If you happen to be using the names anywhere, you’ll probably need to change them! six offers text_type and binary_type, though you can just use bytes to mean the same thing in either version. python-future also has backports for both Python 3’s bytes and str types, which seems like an extreme approach to me. Changing str to mean a text type even in Python 2 might be a good idea, though.

b'' and u'' work the same way in either Python 2 or 3, but unadorned strings like '' are always the str type, which has different behavior. There is a from __future__ import unicode_literals, which will cause unadorned strings to be unicode in Python 2, and this might work for you. However, this prevents you from writing literal “native” strings — strings of the same type Python uses for names, keyword arguments, etc. Usually this won’t matter, since Python 2 will silently convert between bytes and text, but it’s caused me the occasional problem.

The right thing to do is just explicitly mark every single string with either a b or u sigil as necessary. That just, you know, sucks. But you should be doing it even if you’re not porting to Python 3.

basestring is completely gone in Python 3. str and bytes have no common base type, and their semantics are different enough that it rarely makes sense to treat them the same way. If you’re using basestring in Python 2, it’s probably to allow code to work on either form of “text”, and you’ll only want to use str in Python 3 (where bytes are completely unsuitable for text). six.string_types provides exactly this. futurize --stage2 also runs the lib2to3.fixes.fix_basestring fixer, but this replaces basestring with str, which will almost certainly break your code in Python. If you intend to use stage 2, definitely audit your uses of basestring first.

As mentioned above, bytestrings are sequences of integers, which may affect code trying to work with explicitly binary data.

Python 2 has both .decode() and .encode() on both bytes and text; if you try to encode bytes or decode text, Python will try to implicitly convert to the right type first. In Python 3, only text has an .encode() and only bytes have a .decode().

Relatedly, Python 2 allows you to do some cute tricks with “encodings” that aren’t really encodings; for example, "hi".encode('hex') produces '6869'. In Python 3, encoding must produce bytes, and decoding must produce text, so these sorts of text-to-text or bytes-to-bytes translations aren’t allowed. You can still do them explicitly with the codecs module, e.g. codecs.encode(b'hi', 'hex'), which also works in Python 2, despite being undocumented. (Note that Python 3 specifically requires bytes for the hex codec, alas. If it’s any consolation, there’s a bytes.hex() method to do this directly, which you can’t use anyway if you’re targeting Python 2.)

Python 3’s open decodes as UTF-8 by default (a vast oversimplification, but usually), so if you’re manually decoding after reading, you’ll get an error in Python 3. You could explicitly open the file in binary mode (preserving the Python 2 behavior), or you could use codecs.open to decode transparently on read (preserving the Python 3 behavior). The same goes for writing.

sys.stdin, sys.stdout, and sys.stderr are all text streams in Python 3, so they have the same caveats as above, with the additional wrinkle that you didn’t actually open them yourself. Their .buffer attribute gives a handle opened in binary mode (Python 2 behavior), or you can adapt them to transcode transparently (Python 3 behavior):

1
2
3
4
if six.PY2:
    sys.stdin = codecs.getreader('utf-8')(sys.stdin)
    sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
    sys.stderr = codecs.getwriter('utf-8')(sys.stderr)

A text-mode file’s .tell() in Python 3 still returns a number that can be passed back to .seek(), but the number is not necessarily meaningful, and in particular can’t be used to estimate progress through a file. (Python uses a few very high bits as flags to indicate the state of the decoder; if you mask them off, what’s left is probably the byte position in the file as you’d expect, but this is pretty definitively a hack.)

Python 3 likes to treat filenames as text, but most of the functions in os and os.path will accept either text or bytes as their arguments (and return a value of the same type), so you should be okay there.

os.environ has text keys and values in Python 3. If you direly need bytes, you can use os.environb (and os.getenvb()).

I think that covers most of the obvious basics. This is a whole sprawling topic that I can’t hope to cover off the top of my head. I’ve seen it be both fairly painful and completely pain-free, depending entirely on the state of the Python 2 codebase.

Oh, one final note: there’s a module for Python 2 called unicode-nazi (sorry, I didn’t name it) that will produce a warning anytime a bytestring is implicitly converted to a text string, or vice versa. It might help you root out places you’re accidentally slopping types back and forth, which will certainly break in Python 3. I’ve only tried it on a comically large project where it found thousands of violations, including plenty in surprising places in the standard library, so it may or may not be of any practical help.

Things that are not actually gone

String formatting with %

There’s a widespread belief that str % ... is deprecated, since there’s a newer and shinier str.format() method.

Well, it’s not. It’s not gone; it’s not deprecated; it still works just fine. I don’t like to use it, myself, since it’s easy to make accidentally ambiguous — "%s" % foo can crash if foo is a tuple! — but it’s not going anywhere. In fact, as of Python 3.5, bytes and bytearray support % but not .format.

optparse

argparse is certainly better, but the optparse module still exists in Python 3. It has been deprecated since Python 3.2, though.

Things that are preposterously obscure but that I have seen cause problems nonetheless

Tuple unpacking

A little-used feature of Python 2 is tuple unpacking in function arguments:

1
2
3
4
5
def foo(a, (b, c)):
    print a, b, c

x = (2, 3)
foo(1, x)

This syntax is gone in Python 3. I’ve rarely seen anyone use it, except in two cases. One was a parsing library that relied pretty critically on using it in every parsing function you wrote; whoops.

The other is when sorting a dict’s items:

1
sorted(d.items(), key=lambda (k, v): k + v)

In Python 3, you have to write that as lambda kv: kv[0] + kv[1]. Boo.

long is gone

Python 3 merged its long type with int, so now there’s only one integral type, called int.

Python 2 promotes int to long pretty much transparently, and longs aren’t very common in the first place, so it’s fairly unlikely that this will make a difference. On the off chance you’re type-checking for integers with isinstance(x, (int, long)) (and really, why are you doing that), you can just use six.integer_types instead.

Note that futurize --stage2 applies the lib2to3.fixes.fix_long fixer, which blindly renames long to int, leaving you with inappropriate code like isinstance(x, (int, int)).

However…

I have seen some very obscure cases where a hand-rolled binary protocol would encode ints and longs differently. My advice would be to not do that.

Oh, and a little-known feature of Python 2’s syntax is that you can have long literals by suffixing them with an L:

1
2
123  # int
123L  # long

You can write 1267650600228229401496703205376 directly in Python 2 code, and it’ll automatically create a long, so the only reason to do this is if you explicitly need a long with a small value like 1. If that’s the case, something has gone catastrophically wrong.

repr changes

These should really only affect you if you’re using reprs as expected test output (or, god forbid, as cache keys or something). Some notable changes:

  • Unicode strings have a u prefix in Python 2. In Python 3, of course, Unicode strings are just strings, so there’s no prefix.

  • Conversely, bytestrings have a b prefix in Python 3, but not in Python 2 (though the b prefix is allowed in source code).

  • Python 2 escapes all non-ASCII characters, even in the repr of a Unicode string. Python 3 only escapes control characters and codepoints considered non-printing.

  • Large integers and explicit longs have an L suffix in Python 2, but not in Python 3, where there is no separate long type.

  • A set becomes set([1, 2, 3]) in Python 2, but {1, 2, 3} in Python 3. The set literal syntax is allowed in source code in Python 2.7, but the repr wasn’t changed until 3.0.

  • floats stringify to the shortest possible representation that has the same underlying value — e.g., str(1.1) is '1.1' rather than '1.1000000000000001'. This change was backported to Python 2.7 as well, but I have seen it break tests.

Hash randomization

Python has traditionally had a predictable hashing mechanism: repr(dict(a=1, b=2, c=3)) will always produce the same string. (On the same platform with the same Python version, at least.) Unfortunately this opens the door to an obscure DoS exploit that was known to Perl long ago: if you know a web application is written in Python, you can construct a query string that will become a dict whose keys all go in the same hash bucket. If your query string is long enough and you send enough requests, you can tie up all the Python processes in dealing with hash collisions.

The fix is hash randomization, which seeds the hashing algorithm in such a way that items are bucketed differently every time Python runs. It’s available in Python 2.7 via an environment variable or the -R argument, but it wasn’t turned on by default until Python 3.3.

The fear was that it might break things. Naturally, it has broken things. Mostly, reprs in tests. But it also changes the iteration order of dicts between Python runs. I have seen code using dicts whose keys happened to always be sorted in alphabetical or insertion order before, but with hash randomization, the keys were of course in a different order every time the code ran. The author assumed that Python had somehow broken dict sorting (which it has never had).

nonlocal

Python 3 introduces the nonlocal keyword, which is like global except it looks through all outer scopes in the expected order. It fixes this mild annoyance:

1
2
3
4
5
6
7
def make_function():
    counter = 0
    def function():
        nonlocal counter
        counter += 1  # without 'nonlocal', this declares a new local!
        print("I've been called", counter, "times!")
    return function

The problem is that any use of assignment within a function automatically creates a new local, and locals are known statically for the entire body of the function. (They actually affect how functions are compiled, in CPython.) So without nonlocal, the above code would see counter += 1, but counter is a new local that has never been assigned a value, so Python cannot possibly add 1 to it, and you get an UnboundLocalError.

nonlocal tells Python that when it sees an assignment of a name that exists in some outer scope, it should reuse that outer variable rather than shadowing it. Great, right? Purely a new feature. No problem.

Unfortunately, I’ve worked on a codebase that needed this feature in Python 2, and decided to fake it with a class… named nonlocal.

1
2
3
4
5
6
7
def make_function():
    class nonlocal:
        counter = 0
    def function():
        nonlocal.counter += 1  # this alters an outer value in-place, so it's fine
        print("I've been called", counter, "times!")
    return function

The class here is used purely as a dummy container. Assigning to an attribute doesn’t create any locals, because it’s equivalent to a method call, so the operand must already exist. This is a slightly quirky approach, but it works fine.

Except that, of course, nonlocal is a keyword in Python 3, so this becomes complete gibberish. It’s such gibberish that (if I remember correctly) 2to3 actually cannot parse it, even though it’s perfectly valid Python 2 code.

I don’t have a magical fix for this one. Just, uh, don’t name things nonlocal.

List comprehensions no longer leak

Python 2 has the slightly inconsistent behavior that loop variables in a generator expression ((...)) are scoped to the generator expression, but loop variables in a list comprehension ([...]) belong to the enclosing scope.

The only reason is in implementation details: a list comprehension acts like a for loop, which has the same behavior, whereas a generator expression actually creates a generator internally.

Python 3 brings these cases into line: loop variables in list comprehensions (or dict or set comprehensions) are also scoped to the comprehension itself.

I cannot imagine any possible reason why this would affect you negatively, and yet, I can swear I’ve seen it happen. I wish I could remember where, because I’m sure it’s an exciting story.

cStringIO.h is gone

cStringIO.h is a private and undocumented C interface to Python 2’s cStringIO.StringIO type. It was removed in Python 3, or at least is somewhere I can’t find it.

This was one of the reasons Thrift’s Python 3 port took almost 3 years: Thrift has a “fast” C module that makes use of this private interface, and it’s not obvious how to replace it. I think they ended up just having the module not exist on Python 3, so Python 3 will just be mysteriously slower.

Some troublesome libraries

MySQLdb is some ancient, clunky, noncompliant, underdocumented trash, much like the database it connects to. It’s nigh abandoned, though it still promises Python 3 support in the MySQLdb 2.0 vaporware. I would suggest not using MySQL, but barring that, try mysqlclient, a fork of MySQLdb that continues development and adds Python 3 support. (The same people also maintain an earlier project, pymysql, which strives to be a pure-Python drop-in replacement for MySQLdb — it’s not quite perfect, but its existence is interesting and it’s sure easier to read than MySQLdb.)

At a glance, Thrift still hasn’t had a release since it merged Python 3 support, eight months ago. It’s some enterprise nightmare, anyway, and bizarrely does code generation for a bunch of dynamic languages. Might I suggest just using the pure-Python thriftpy, which parses Thrift definitions on the fly?

Twisted is, ah, large and complex. Parts of it now support Python 3; parts of it do not. If you need the parts that don’t, well, maybe you could give them a hand?

M2Crypto is working on it, though I’m pretty sure most Python crypto nerds would advise you to use cryptography instead.

And so on

You may find any number of other obscure compatibility problems, just as you might when upgrading from 2.6 to 2.7. The Python community has a lot of clever people willing to help you out, though, and they’ve probably even seen your super duper niche problem before.

Don’t let that, or this list of gotchas in general, dissaude you! Better to start now than later; even fixing an integer division gets you one step closer to having your code run on Python 3 as well.

Рецепта за бисквитки

Не става дума за захарни изделия, а за информационни технологии. Терминът е заимстван с директен превод от английския език, където се използва думата cookies (въпреки, че буквалният превод е курабийки, а не бисквитки).

И все пак, каква е целта на бисквитките? Бисквитките представляват порции структурирана информация, която съдържа различни параметри. Те се създават от сървърите, които предоставят достъп до уеб страници. Предава се чрез служебните части на HTTP протокола (т. нар. HTTP headers) – това е трансферен протокол, който се използва от браузърите за обмен на информация със сървърите. Бисквитките са добавени в спецификацията на HTTP протокола във версия 1.0 в началото на 90те години. По това време Интернет не беше толкова развит, колкото е в момента и затова HTTP протоколът има някои специфични особености. Протоколът е базиран на заявки (от клиента) и отговори (от сървъра), като всяка двойка заявка и отговор се правят в отделна връзка (socket) към между клиента и сървъра. Тази схема на работа е изключително удобна, тъй като не изисква постоянна и стабилна връзка с Интернет, тъй като всъщност връзката се използва само за кратък момент. За съжаление, заради тази особеност често HTTP протоколът е наричан state less протокол (протокол без запазване на състоянието). А именно – сървърът няма как да знае, че редица от последователни заявки са изпратени от един и същи клиент. За разлика от IRC, SMTP, FTP и други протоколи създадени през същите години, които отварят една връзка и предават и приемат данни двупосочно. При такива протоколи комуникацията започва с hand shake или аутентикация между участниците в комуникацията, след което и за двете страни е ясно, че докато връзката остава отворена, комуникацията тече с конкретен участник.

За да бъде преодолян този недостатък на протокола, след версия 0.9 (първата версия навлязла в реална експлоатация), във версия 1.0 създават механизма на бисквитките. Кръщават технологията cookies заради приказката на братя Грим за Хенцел и Гретел, които маркират пътя, по който са минали пре тъмната гора, като поръсват трохи. В повечето български преводи в приказката се използват трохи от хляб (bread crumbs) термин, който намира друго място в IT сферата и уеб, но в действителност приказката е германска народна приказка с много различни версии през годините. Сравнението е очевидно – cookies позволяват да се проследи пътя на потребителя в тъмната гора на state less протокола HTTP.

Как работят бисквитките? Бисквитките се създават от сървърите и се изпращат на потребителите. При всяка следваща заявка от потребителя към същия сървър, той трябва да изпраща обратно копие на получените от сървъра бисквитки. По този начин, сървърът получава механизъм по който да проследи потребителя по пътя му, т.е. да знае, когато един и същи потребител е направил поредица от множество, иначе несвързани, заявки.

Представете си, че изпращате първа заявка към сървъра и той ви връща отговор, че иска да се аутентикирате – да посочите потребителско име и парола. Вие ги изпращате и сървърът верифицира, че това наистина сте вие. Но заради особеностите на HTTP протокола, след като изпратите името и паролата, връзката между вас и сървъра се прекъсва. По-късно изпращате нова заявка. Сървърът няма как да знае, че това отново сте вие, тъй като за новата заявка е направена нова връзка.

Сега си представете същата схема, но с добавена технологията на бисквитите. Изпращате заявката към сървъра, той ви връща отговор, че иска име и парола. Изпращате име и парола и сървърът ви връща бисквитка в която записва че собственикът на тази бисквитка вече е дал име и парола. Връзката прекъсва. По-късно изпращате нова заявка и тъй като тя е към същия сървър, клиентът трябва да върне обратно копие на получените от сървъра бисквитки. Тогава новата заявка съдържа и бисквитката, в която пише, че по-рано сте предоставили име и парола, които са били валидни.

И така, бисквитките се създават от сървърите, като данните се изпращат от уеб сървъра заедно с отговора на заявка за достъп до ресурс (например отделна уеб страница, текст, картинка или друг файл). Бисквитката се връща като част от служебната информация на протокола. Когато клиент (клиентският софтуер – браузър) получи отговор от сървър, който съдържа бисквитка, той следва да я обработи. Ако клиентът вече разполага с подобна бисквитка – следва да си обнови информацията за нея, ако не разполага с нея, следва да я създаде. Ако срокът на бисквитката е изтекъл – следва да я унищожи и т.н.

Често се споменава, че бисквитките са малки текстови файлове, които се съхраняват на компютъра на потребителя. Това не винаги е вярно – бисквитките представляваха отделни текстови файлове в ранните версии на някои браузъри (например в Internet Explorer и Netscape Navigator), повечето съвременни браузъри съхраняват бисквитките по различен начин. Например съвременните версии на браузърите Mozilla Firefox и Google Chrome съхранява всички бисквитки в един файл, който представлява sqlite база данни. Подходът с база данни е доста подходящ, тъй като бисквитките представляват структурирана информация, която е удобно да се съхранява в база, а достъпът до информацията е доста по ефективен. Въпреки това, браузърът Microsoft Edge продължава да съхранява бисквитките във вид на текстови файлове в AppData директорията.

Какви параметри съдържат бисквитките, които сървърите изпращат? Всяка бисквитка може да съдържа: име, стойност, домейн, адрес (път), срок на варидност и някои параметри за сигурност (да важат само по https и да бъдат достъпни само от трансфертия протокол, т.е. на не бъдат достъпни за javascript и други приложни слоеве при клиента). Името и стойността са задължителни за всяка бисквитка – те задават името на променливата, в която ще се съхранява информацията и съответната стойност – съхранената информация.

Домейнът е основната част от адреса на интернет сайта, който изпраща бисквитката – частта, която идентифицира машината (сървъра) в мрежата. Според техническите спецификации домейнът на бисквитката задължително трябва да съвпада с домейна на сървъра, който ги изпраща, но има някои изключения. Първото изключение е, чекогато се използват няколко нива домейни, те могат да създават бисквитки за различни части от нивата. Например отговорът на заявка към домейна example.com може да създаде бисквитка само за домейна example.com, както и бисквитка валидна за същия домейн и всички негови поддомейни. Ако бисквитката е валидна за всички поддомейни, изписването става с точка пред името на домейна или .example.com. Второто изключение е валидно, когато бисквитката не се създава от сървъра, а от приложен слой при клиента (например от javascript). Тогава е възможно js файлът да е зареден от един домейн, но в html страница от друг домейн – сървърът, който изпраща бисквитка може да я изпрати от името на домейна където е js файлът, но самият js, докато е зареден в страница от друг домейн може да създаде (и прочете) бисквитка от домейна на html страницата.

Адресът (или пътят) е останалата част от URL адреса. По подризбиране бисквитките са валидни за път / (т.е. за корена на уеб сайта). Това означава, че бисквитката е валидна за всички възможни адреси на сървъра. Въпреки това, има възможност изрично да се укаже конкретен адрес, за който да бъдат валидни бисквитките.По идея, адресите са репрезентация на път в йерархична файлова структура върху сървъра (въпреки, че не е задължително). Затова и адресите на бисквитките представят мястото на тяхното създаване в йерархична файлова система. Например ако пътят на бисквитката е /dir/ – това означава, че тя е валидна в директорията с име dir, включително и всички нейни поддиректории.

Да дадем малко по-реалистичен пример, ако имаме бисквитки, които използваме за съпраняване на информация за аутентикацията на потребителите в администраторски панел на уеб сайт, който е разположен в директория /admin/ – можем да посочим, че дадената бисквитка е валидна само за адрес /admin/, по този начин бисквитките създадени от сървъра за нуждите на администраторския панел няма да се използват при заявки за други ресурси от същия сървър.

Срокът на валидност определя колко време потребителят трябва да пази бисквитката при себе си и да я връща с всяка следваща заявка към същия сървър и адрес (път). Когато сървърът иска да изтрие бисквитка при потребителя, той трябва да я изпрати със срок на валидност в миналото, по този начин предизвиква изтичане и автоматично изтриване на бисквитката при потребителя.

Бисквитките имат и параметри, които имат грижата да осигурят сигурността на предаваните данни. Това включва два булеви параметъра – единият определя, дали бисквитката да бъде достъпна (както за четене, така и за писане) само от http протокола или да бъде достъпна и за приложния слой при клиента (например за javascript). Вторият параметър определя, дали бисквитката да се предава по всички протоколи или само по https (защитен http).

Както се досещате, в рамките на една и съща комуникация може да има повече от една бисквитка. Както сървърът може да създаде повече от една бисквитка едновременно, така и клиентът може да върне повече от една бисквитка обратно към сървъра. Именно затова освен домейни и адреси, бисквитките имат и имена.

В допълнение бисквитките имат и редица ограничения. Повечето браузъри не позволяват да има повече от 20 едновременно валидни бисквитки за един и същи домейн. Във Mozilla Firefox това ограничение е 50 броя, а в Opera 30 броя. Също така е ограничен и размерът на всяка отделна бисквитка – не повече от 4KB (4096 Bytes). В спецификациите за бисквитки RFC2109 от 1997 г. е посочено че клиентът може да съхранява до 300 бисквитки по 20 за един и същи домейн и всяка с размер до 4KB. В по-късната спецификация Rfc6265 от 2011 г. лимитите са увеличение до 3000 броя общо и 50 броя за един домейн. Все пак, не трябва да се забравя, че всяка бисквитка се изпраща от клиента при всяка следваща заявка към сървъра, ако чукнем тавана на лимитите и имаме 50 бисквитки по 4KB, това означава, че с всяка заявка ще пътуват близо 200KB само под формата на бисквитки, което може да се окаже сериозен товар за трафика, дори и при техническите възможности на съвременния достъп до Интернет.

Разбира се, по-рано приведеният пример, при който запазваме успешната аутентикация на потребителя в бисквитка има множество особености свързани с гарантиране на сигурността. На първо място – не е добра идея да запазим потребителското име и паролата на потребителя в бисквитка, тъй като тя се запазва на неговия компютър. Това означава, че по всяко време, ако злонамерено лице получи достъп до компютъра на потребителя, може да прочете потребителското име и паролата от записаните там бисквитки. От друга страна, ако в бисквитката съхраним само името на потребителя, без неговата парола – няма как да се предпазим от фалшиви бисквитки – всяко злонамерено лице може да създаде фалшива бисквитка, в която да посочи произволно (чуждо) потребителско име и да се представи пред сървъра от чуждо име.

Затова най-често използваният механизъм е, че при всяка аутентикация с име и парола пред сървъра, след като той ги верифицира, създава някакъв временно валиден идентификатор, който изпраща като бисквитка. В различните технологии този идентификатор може да се намери с различни имена, one time password (OTP), token, session или по друг начин. При тази схема сървърът съхранява за ограничено време (живот на сесията) информация за потребителя. Всяка такава информация (често наричана сесия) получава идентификационен номер, който се изпраща като бисквитка на потребителя. Тъй като той ще връща този идентификатор с всяка следваща заявка, сървърът ще може да възстановява съхранената информация за потребителя и тя да бъде достъпна при всяка следваща заявка. В същото време, информацията е съхранена на сървъра, а не при клиента, което не позволява на злонамерен потребител да я модифицира или фалшифицира. Освен това, идентификаторът е валиден за ограничен период от време (например за 30 минути). Дори и бисквитката с идентификатора да остане на компютъра на потребителя, заисаният в нея идентификатор няма да върши работа след половин час. Не на последно място, при натискане на бутона за изход съхранените на сървъра данни за потребителя се изтриват дори и да не е изтекъл срокът от 30 минути. Именно затова е важно винаги да се използват бутоните за изход при излизане от онлайн системи.

Какво друго се съхранява в бисквитки? На практика всичко! Много често бисквитките се използват за съхраняване на информация за индивидуални настройки на потребителя. Когато потребителят промени някоя настройка сървърът му изпраща бисквитка със съответната настройка и дълъг срок на валидност. При всяко следващо посещение на същия потребител на същия сайт, той ще изпраща запазената в бисквитка настройка, заедно със заявката към сървъра. Сървърът ще знае за желаната настройка и ще я приложи при връщането на отговор на изпратената заявка. Пример за такава настройка е броят на записите които се показват на страница. Ако веднъж промените този брой, избраната стойност може да се запази в бисквитка и при всяка следваща заявка сървърът винаги ще знае за настройката и ще връща правилен отговор.

В бисквитки се съхранява и друга информация, например поръчани стоки в пазарската кошница на онлайн магазин, статистическа информация кога сте посетили даден сайт за последно, колко пъти сте го посетили, кои страници сте посещавали, колко време сте се задържали на всяка от страниците и т.н.

Забранява ли Европейският съюз бисквитките? Бисквитеното чудовище от улица Сезам би било доста разстроено, ако разбере, че ЕС иска да ограничи използването на бисквитки. В действителност истината е доста по-различна, но информацията е масово грешно интепретирана. Да излезем от технологичната сфера и да навлезем малко в юридическата. На първо място, кои са нормативните документи в тази връзка? Масово се цитира европейската Директива 2009/136/ЕО от 25 ноември 2009 г. Истината е, че тази директива не засяга директно бисквитките. Директивата внася изменения в друга Директива 2002/22/ЕО от 7 март 2002 г. относно универсалната услуга (час от която е и достъпът до Интернет) и правата на потребителите. Изменението от директивата от 2009 г. гласи следното (чл. 5, стр. 30):

5. Член 5, параграф 3 се заменя със следния текст:

„3. Държавите-членки гарантират, че съхраняването на информация или получаването на достъп до информация, вече съхранявана в крайното оборудване на абоната или ползвателя, е позволено само при условие, че съответният абонат или ползвател е дал своето съгласие след получаване на предоставена ясна и изчерпателна информация в съответствие с Директива 95/46/ЕО, inter alia, относно целите на обработката. Това не пречи на всякакво техническо съхранение или достъп с единствена цел осъществяване на предаването на съобщение по електронна съобщителна мрежа или доколкото е строго необходимо, за да може доставчикът да предостави услуга на информационното общество, изрично поискана от абоната или ползвателя.“

Също така, в увода на същата директива се споменава още:

(66) Трети страни може да желаят да съхраняват информация върху оборудване на даден ползвател или да получат достъп до вече съхраняваната информация за различни цели, които варират от легитимни (някои видове „бисквитки“ (cookies) до такива, включващи непозволено навлизане в личната сфера (като шпионски софтуер или вируси). Следователно е от първостепенно значение ползвателите да получат ясна и всеобхватна информация при извършване на дейност, която би могла да доведе до подобно съхраняване или получаване на достъп. Методите на предоставяне на информация и на предоставяне на правото на отказ следва да се направят колкото може по-удобни за ползване. Изключенията от задължението за предоставяне на информация и на право на отказ следва да се ограничават до такива ситуации, в които техническото съхранение или достъп е стриктно необходимо за легитимната цел на даване възможност за ползване на специфична услуга, изрично поискана от абоната или ползвателя. Когато е технически възможно и ефективно, съгласно приложимите разпоредби на Директива 95/46/ЕО, съгласието на ползвателя с обработката може да бъде изразено чрез използване на съответните настройки на браузер или друго приложение. Прилагането на тези изисквания следва да се направи по-ефективно чрез разширените правомощия, дадени на съответните национални органи.

От двете цитирания следва да обърнем внамание и на още нещо – цитира се и Директива 95/46/ЕО от 24 октомври 1995 г, която пък третира защитата на физическите лица при обработването на лични данни. Разбира се, трудно е да се каже, че директивата от средата на 90те години засяга директно функционирането на Интернет, който по това време е доста слабо разпространен, а технологията на бисквитките – появила се само преди няколко години, все още изключително нова и рядко използвана.

Преди да продължим с анализа нататък, трябва да отбележим, че и трите цитирани директиви, по отношение на защитата на потребителите от бисквитки, засега не са транспонирани в националното законодателство (поне не в Закона за електронното управление, Закона за електронните съобщения и Закона за защита на личните данни). Слава богу, механизмът на директивите в Европейския съюз е така създаден, че ги прави задължителни за спазване от всички държави-членки, независимо дали има национално законодателство за съответната сфера или не.

И все пак – защо ЕС иска да ограничи използването на бисквитките? Сред изброените множество приложения на технологията – за съхраняване на аутентикация, за съхраняване на настройки, за следене на поведението на потребителя, за следене на статистика за потребителя, за маркетингови анализи, за съхраняване на данни за пазарски колички и др. (някои от изброените се припокриват или допълват), бисквитките още може да се използват и за сериозно навлизане в личното пространство на потребителите, като се извършват различни форми на следене и анализ на потреблението и поведението на потребителите в Интернет с цел предоставяне на таргетирана реклама или с други, дори и незаконни, цели.

Да разгледаме един по-реалистичен пример на базата на вече разгледаната по-рано технология. Имаме прост информационен сайт с автомобилна тематика, който няма никакви специфични функции, не предоставя услуги или нещо друго. Сайтът обаче използва Google Analytics (безплатен инструмент за събиране на статистика за посетителите, предоставят от Google), също така, собственикът на сайта, с цел да монетизира поне в някаква минимална степен събраната на сайта си информация е пуснал и Google Adwords (услуга за публикуване на рекламни банери предоставяна от Google). Също така имаме и потребител, който търси в Google информация за ремонт на спукана автомобилна гума. Потребителят открива цитирания по-горе сайт в Google, където кликва линк и отива на сайта. Същият потребител има и email в Gmail (безплатна email услуга предоставяна от Google). Както забелязвате, до момента имаме един неизменно преследващ ни по целия път общ знаменател – Google. В случая това далеч не е единствения голям играч на този пазар, просто примерът с него е най-достъпен за широката публика. Всъщност Google едновременно има достъп до писмата, които потребителят е изпращал и получавал, до това какво е търсил, до това в кой сайт е влязъл, какво е чел там (точно кои страници), колко време е прекарал на този сайт, също така и информация за всеки друг сайт, който същият потребител е посещавал в миналото, независимо дали ги е посещавал от същия компютър или от друг, достатъчно е всички сайтове да използват Google Analytics за статистиката си. Ако потребителят има служебен и домашен компютър и е влизал в Gmail пощата си и от двата компютъра, тогава Google Analytics е в състояние да направи връзка, че сайтовете, които сапосещавани на двата компютъра, които иначе нямат никаква друга връзка помежду си, са посещавани от един и същи потребител. Тогава, потребителят не трябва да се учудва, ако отиде на трето място, несвързано по никакъв начин с предишните две (домашния и служебния компютър), влезе си в пощата и по-късно посети произволен сайт, на който види реклама за продажба на нови автомобилни гуми.

Проследяването на всичко описано до момента е възможно именно чрез механизмите на бисквитките. Неслучайно те се наричат механизъм за запазване на състоянието и са създадени за „следене на потребителите на протокола”. Разбира се – следене в позитивния и чисто технологичен смисъл на термина, но все пак следене, което в ръцете на недобронамерени лица може да придобие съвсем различни мащаби и да се извършва със съвсем различни цели.

Всичко това може (и се) комбинира и с други данни, които се събират за потребителите – IP гео локация, информация за интернет връзката, използваното устройство, размер на дисплея, операционна система, използван софтуер и много други данни, които се предоставят автоматизирано, докато браузваме в мрежата.

Отново, сама по себе си, технологията на функциониране на бисквитките има предвидени защитни механизми. Например бисквитките от един уеб сайт не могат да бъдат прочетени от друг сайт. Но тук цялата схема се чупи поради факта, че бисквитките са достъпни за приложния слой на клиента (javascript), като в същото време милиони сайтове по света се възползват от иначе безплатната услуга за събиране на статистика за посещенията от един и същи доставчик. Именно този доставчик е свързващото звено в цялата схема. Всеки от сайтовете и всеки от потребителите поотделно не разполагат с особено полезна информация, но свързващото звено разполага с цялата информация и при добро желание, а повярвайте ми, за да бъдат безплатни всички тези услуги, желанието е преминало в нужда, тази натрупана информация може да бъде анализирана, обработена и използвана за всякакви нужди.

И тъй като често тези действия могат да навлязат доста надълбоко в личния живот на хората, Европейският съюз е предприел необходимите мерки, да задължи доставчиците на услуги в Интернет (собствениците на уеб сайтове), да информират потребителите за това какви данни се съхраняват за тях на крайните устройства (т.е. на самите компютри на клиентите) и за какви цели се използват тези данни.

Тук е много важно да уточним няколко аспекта. На първо място – използването на бисквитки, както и проследяването като процес не са забранени, просто се изисква потребителите да бъдат информирани какво се прави и с каква цел, както и потребителят изрично да е дал съгласието си за това. Другата важна подробност е, че бисквитките не са единственят механизъм за съхраняване на информация на крайните устройства и европейското законодателство не се ограничава до използването именно на бисквитки. Local Storage е съвременна алтернатива на бисквитките и въпреки, че функционира по различен начин и предоставя съвсем различни възможности, но също съхранява информация на крайните устройства и реално може да бъде използвана за следене на потребителите и може да засегне правата им по отношение на обработка на личните им данни. В този смисъл европейските директиви засягат всяка форма на съхраняване на информация на устройствата на потребителите, а не само бисквитките. Също така – директивите разглеждат съхраняването на данни за проследяване на потребителите отделно от бисквитките необходими за технологичното функциониране на системите в Интернет. Също така се прави и разлика между бисквитки от трети страни и бисквитки от собствениците на сайтовете, като в примера с бисквитките оставяни от Google Analytics, Google е трета страна.

Когато съхраняването на данните (били те бисквитки или други) се извършва от трети страни (различни от доставчика на услугата и крайния потребител), това не отменя ангажиментите на доставчика на услугата да информира потребителите, както и да им иска съгласието. Казано накратко, ако имам сайт, който използва Google Analytics, ангажиментът да информирам потребителите на сайта, както и да поискам тяхното съгласие, си остава мой (т.е. на собственика на сайта – лицето, което предоставя услугата), а не на третата страна (т.е. не е ангажимент на Google.

Също интересен факт е, че директивата разглежда като допустимо, съгласието на потребителя да бъде получено и чрез настройка на браузъра или друго приложение. Тук можем обратно да се върнем към технологиите. Преди време имаше P3P (пи три пи като Platform for Privacy Preferences, не ръ зъ ръ)– технология, която започна обещаващо, но в последствие беше имплементирана единствено от Internet Explorer и в крайна сметка разработката на спецификацията беше прекратена от W3C. Една от сочените причини е, че планираната технология беше относително сложна за имплементиране. Към днешна дата повечето браузъри поддържат Do Not Track (DNT), което представлява един HTTP header с име DNT, който ако присъства в заявката на клиента със стойност 1, посочва, че потребителят не е съгласен да бъде проследяван. Разбира се, проследяването, което се визира от DNT и запазването и достъпването на информация на крайните устройства на потребителите, което се визира в европейските директиви на се едно и също. Може да записваш и четеш информация на клиента, без да го проследяваш, което би нарушило европейските директиви, както и можеш да проследиш клиента, без да му записваш и четеш данни локално (например чрез Browser Fingerprint и с дънни съхранявани изцяло на сървъра).

Накрая, нека обобщим:

  1. Европейският съюз не забранява бисквитките;
  2. Европейският съюз предвижда мерки за защита на личните данни, като налага правила за искане на позволение от потребителите, когато се записват и достъпват данни на крайните им устройства, когато тези данни;
  3. Изискването за информирано съгласие не се ограничава единствено до бисквитките, а покрива всички съществуващи и евентуално бъдещи технологии позволяващи записване и достъпване на информация на крайните устройства на потребителите;
  4. Информиране на потребителите следва да има, както и трябва да им се поиска, независимо дали данните се записват или достъпват директно от доставчика на услугата или чрез услугите на трета страна. Или по-просто казано, ако използваме Google Analytics, ние трябва да предупредим потребителя и да му поискаме съгласието, а не Google. В този случай доставчикът на услугата (самият сайт) се явява като оператор на лични данни (не по смисъла на българския ЗЗЛД, но по смисъла на европейските директиви, а Google се явява трета страна оторизирана от администратора да управлява тези данни от негово име и за негова сметка – ерго отговорността е на администратора;
  5. Информиране на потребителя и искане на съгласие не е необходимо, когато записваните и четените данни се използват за технологични нужди и това е свързано с предоставянето на услугата, която потребителят изрично е поискал да използва. Бих казал, че когато случаят е такъв, лично аз бих избрал да информирам потребителя какво и защо се записва и чете, без обаче да му искам съгласието;
  6. Според Европейските директиви съгласието може да се изрази от потребителите и чрез специфични технологии създадени за тази цел, но използването на технологиите не отменя необходимостта от информиране на потребителя относно това какви данни се съхраняват и с каква цел се обработват;
  7. Европейската нормативна уредба в това отношение изглежда не е траспонирана в националното законодателство на България, което не означава, че може да не се спазва;

Still $5/Month

Post Syndicated from Yev original https://www.backblaze.com/blog/unlimited-cloud-backup-still-5-month/

Backblaze Five Dollars

On June 2nd, 2008, when Backblaze was still just five guys hammering away on a big red box in a one-bedroom apartment turned office, Backblaze went into “Public Beta” with version 1.0.0.71. With great fanfare we announced unlimited online backup of your Mac (the PC was later) for just $5 a month. We received coverage in TechCrunch, ArsTechnica and more, it was great. But here’s a deep dark secret, we started charging customers $5/month even before the June 4th launch. In fact, below is the very first deposit receipt we ever received.

First Backblaze Payment

Sharp eyed readers will notice that the person making the payment is Tim Nufire (tnufire), our head of operations. Before you go and call that cheating, Tim was not actually reimbursed for his $5 — it was our first bit of revenue. In fact Tim started a “tradition” — having all employees pay for having Backblaze on their work computer. A tradition we finally did away with about 2 years ago — though we still pay for Backblaze on our personal PCs and Macs.

While other companies reduce features, or raise prices, Backblaze has kept our price the same. We did this by continuously reducing our cost structure with Storage Pod improvements. Then we designed and built a highly scalable storage system using Backblaze Vaults and our own Reed-Solomon erasure coding library which we open-sourced. We’ve also managed to weather a few bumps along the way, like when Backblaze was almost acquired and the Thailand Drive crisis. In any case, 8 years ago we started with $5/month for our online backup service, the same price it is today and will be tomorrow.

We figured it might be fun to see what else you could get for just $5. Let’s take a look at the Fiverr — a website practically dedicated to the notion of something costing just $5…

What can you get for $5?

  1. 1 vector illustration
  2. 1 black and white anime-style portrait
  3. A 40-word Spokesperson video
  4. A 500-word article which is interesting and grammatically correct
  5. 75 Minute voice-over
  6. A Psychic Reading
  7. Have a “sexy” Happy Birthday song sung for you
  8. Get one error fixed on your WordPress site

Even though some things are getting more pricey, like a $5 foot long now being $6, there are also plenty of everyday things that you can get for $5, like a box of girl scout cookies for example. Or, for less than $.017 per day, you can get all your data automatically backed up and available in the cloud! It’s your $5.00 — choose wisely.

Either way, Backblaze is still…just $5/month.

The post Still $5/Month appeared first on Backblaze Blog | The Life of a Cloud Backup Company.

State of Online Tracking

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2016/05/state_of_online.html

Really interesting research: “Online tracking: A 1-million-site measurement and analysis,” by Steven Englehardt and Arvind Narayanan:

Abstract: We present the largest and most detailed measurement of online tracking conducted to date, based on a crawl of the top 1 million websites. We make 15 types of measurements on each site, including stateful (cookie-based) and stateless (fingerprinting-based) tracking, the effect of browser privacy tools, and the exchange of tracking data between different sites (“cookie syncing”). Our findings include multiple sophisticated fingerprinting techniques never before measured in the wild.

This measurement is made possible by our web privacy measurement tool, OpenWPM, which uses an automated version of a full-fledged consumer browser. It supports parallelism for speed and scale, automatic recovery from failures of the underlying browser, and comprehensive browser instrumentation. OpenWPM is open-source1 and has already been used as the basis of seven published studies on web privacy and security.

Summary in this blog post.

Polypaudio 0.8 Released

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/projects/polypaudio-0.8.html

The reports of Polypaudio’s death are greatly exaggerated.

We are proud to announce the release of Polypaudio
0.8, our networked sound daemon for Linux, other Unix-like operating
systems, and Microsoft Windows. Since the last official release, 0.7,
more than a year has passed. In the meantime Polypaudio experienced
major improvements. Major contributions have been made by both Pierre
Ossman and me. Pierre is being payed by Cendio AB to work on
Polypaudio. Cendio distributes Polypaudio along with their ThinLinc Terminal
Server
.

Some of the major changes:

  • New playback buffer model that allows applications to freely seek in
    the server side playback buffer (both with relative and absolute indexes) and to synchronize
    multiple streams together, in a way that the playback times are guaranteed to
    stay synchronized even in the case of a buffer underrun. (Lennart)
  • Ported to Microsoft Windows and Sun Solaris (Pierre)
  • Many inner loops (like sample type conversions) have been ported
    to liboil, which
    enables us to take advantage of modern SIMD instruction sets, like MMX or SSE/SSE2. (Lennart)
  • Support for channel maps which allow applications to assign
    specific speaker positions to logical channels. This enables support
    for “surround sound”. In addition we now support seperate volumes for
    all channels. (Lennart)
  • Support for hardware volume control for drivers that support
    it. (Lennart, Pierre)
  • Local users may now be authenticated just by the membership in a
    UNIX group, without the need to exchange authentication cookies. (Lennart)
  • A new driver module module-detect which detects
    automatically what local output devices are available and loads the
    needed drivers. Supports ALSA, OSS, Solaris and Win32 devices. (Lennart, Pierre)
  • Two new modules implementing RTP/SDP/SAP based multicast audio
    streaming. Useful for streaming music to multiple PCs with speakers
    simultaneously. Or for implementing a simple “always-on” conferencing
    solution for the LAN. Or for sharing a single MIC/LINE-IN jack on the
    LAN. (Lennart)
  • Two new modules for connecting Polypaudio to a JACK audio server
    (Lennart)
  • A new Zeroconf (mDNS/DNS-SD) publisher module. (Lennart)
  • A new module to control the volume of an output sink with a LIRC supported infrared remote
    control, and another one for doing so with a multimeda keyboard. (Lennart)
  • Support for resolving remote host names asynchronously using libasyncns. (Lennart)
  • A simple proof-of-concept HTTP module, which dumps the current daemon status to HTML. (Lennart)
  • Add proper validity checking of passed parameter to every single
    API functions. (Lennart)
  • Last but not least, the documentation has been beefed up a lot and
    is no longer just a simple doxygen-based API documentation (Pierre, Lennart)

Sounds good, doesn’t it? But that’s not all!

We’re really excited about this new Polypaudio release. However,
there are more very exciting, good news in the Polypaudio world. Pierre
implemented a Polypaudio plugin for alsa-libs. This means you
may now use any ALSA-aware application to access a Polypaudio sound
server! The patch has already merged upstream, and will probably
appear in the next official release of alsa-plugins.

Due to the massive internal changes we had to make a lot of modifications to
the public API. Hence applications which currently make use of the Polypaudio
0.7 API need to be updated. The patches or packages I maintain will be updated
in the next weeks one-by-one. (That is: xmms-polyp, the MPlayer patch, the
libao patch, the GStreamer patch and the PortAudio patch)

A side note: I wonder what this new release means for Polypaudio in
Debian. I’ve never been informed by the Debian maintainers of
Polypaudio that it has been uploaded to Debian, and never of the
removal either. In fact I never exchanged a single line with those who
were the Debian maintainers of Polypaudio. Is this the intended way
how the Debian project wants its developers to communicate with
upstream? I doubt that!

How does Polypaudio compare to ESOUND?

Polypaudio does everything what ESOUND does, and much more. It is a
fully compatible drop-in replacement. With a small script you can make
it command line compatible (including autospawning). ESOUND clients
may connect to our daemon just like they did to the original ESOUND
daemon, since we implemented a compatibility module for the ESOUND
protocol.

Support for other well known networked audio protocols (such as
NAS) should be easy to add – if there is a need.

For a full list of the features that Polypaudio has over ESOUND,
see Polypaudio’s
homepage
.

How does Polypaudio compare to ALSA‘s dmix?

Some people might ask whether there still is a need for a sound
server in times where ALSA’s dmix plugin is available. The
answer is: yes!

Firstly, Polypaudio is networked, which dmix is
not. However, there are many reasons why Polypaudio is useful on
non-networked systems as well. Polypaudio is portable, it is available
not just for Linux but for FreeBSD, Solaris and even Microsoft
Windows. Polypaudio is extensible, there is broad range of additional
modules
available which allow the user to use Polypaudio in many
exciting ways ALSA doesn’t offer. In Polypaudio streams, devices and
other server internals can be monitored and introspected freely. The
volume of the multiple streams may be manipulated independently of
each other, which allows new exciting applications like a work-alike
of the new per-application mixer tool featured in upcoming Windows
Vista. In multi-user systems, Polypaudio offers a secure and safe way
to allow multiple users to access the sound device
simultaneously. Polypaudio may be accessed through the ESOUND and the
ALSA APIs. In addition, ALSA dmix is still not supported properly by
many ALSA clients, and is difficult to setup.

A side node: dmix forks off its own simple sound daemon
anyway, hence there is no big difference to using Polypaudio with the
ALSA plugin in auto-spawning mode. (Though admittedly, those ALSA
clients that don’t work properly with dmix, won’t do so with our ALSA
plugin as well since they actually use the ALSA API incorrectly.)

How does Polypaudio compare to JACK?

Everytime people discuss sound servers on Unix/Linux and which way
is the right to go for desktops, JACK gets mentioned and suggested by some as a
replacement for ESOUND for the desktop. However, this is not
practical. JACK is not intended to be a desktop sound server, instead
it is designed for professional audio in mind. Its semantics are
different from other sound servers: e.g. it uses exclusively floating
point samples, doesn’t deal directly with interleaved channels and
maintains a server global time-line which may be stopped and seeked
around. All that translates badly to desktop usages. JACK is really
nice software, but just not designed for the normal desktop user,
who’s not working on professional audio production.

Since we think that JACK is really a nice piece of work, we added
two new modules to Polypaudio which can be used to hook it up to a
JACK server.

Get Polypaudio 0.8, while it is hot!

BTW: We’re looking for a logo for Polypaudio. Feel free to send us your suggestions!

Update: The Debian rant is unjust to Jeff Waugh. In fact, he had informed me that he prepared Debian packages of Polypaudio. I just never realized that he had actually uploaded them to Debian. What still stands, however, is that I’ve not been informed or asked about the removal.