Tag Archives: erts

A lesson in social engineering: president debates

Post Syndicated from Robert Graham original http://blog.erratasec.com/2016/08/a-lesson-in-social-engineering.html

In theory, we hackers are supposed to be experts in social engineering. In practice, we get suckered into it like everyone else. I point this out because of the upcoming presidential debates between Hillary and Trump (and hopefully Johnson). There is no debate, there is only social engineering.

Some think Trump will pull out of the debates, because he’s been complaining a lot lately that they are rigged. No. That’s just because Trump is a populist demagogue. A politician can only champion the cause of the “people” if there is something “powerful” to fight against. He has to set things up ahead of time (debates, elections, etc.) so that any failure on his part can be attributed to the powerful corrupting the system. His constant whining about the debates doesn’t mean he’ll pull out any more than whining about the election means he’ll pull out of that.
Moreover, he’s down in the polls (What polls? What’s the question??). He therefore needs the debates to pull himself back up. And it’ll likely work — because social-engineering.
Here’s how the social engineering works, and how Trump will win the debates.
The moderators, the ones running the debate, will do their best to ask Trump the toughest questions they think of. At this point, I think their first question will be about the Kahn family, and Trump’s crappy treatment of their hero son. This is one of Trump’s biggest weaknesses, but especially so among military-obsessed Republicans.
And Trump’s response to this will be awesome. I don’t know what it will be, but I do know that he’s employing some of the world’s top speech writers and debate specialists to work on the answer. He’ll be practicing this question diligently working on a scripted answer, from many ways it can be asked, from now until the election. And then, when that question comes up, it’ll look like he’s just responding off-the-cuff, without any special thought, and it’ll impress the heck out of all the viewers that don’t already hate him.
The same will apply too all Trump’s weak points. You think the debates are an opportunity for the press to lock him down, to make him reveal his weak points once and for all in front of a national audience, but the reverse is true. What the audience will instead see is somebody given tough, nearly impossible questions, and who nonetheless has a competent answer to everything. This will impress everyone with how “presidential” Trump has become.
Also, waivering voters will see that the Trump gets much tougher questions than Hillary. This will feed into Trump’s claim the media is biased against him. Of course, the reality is that Trump is a walking disaster area with so many more weaknesses to hit, but there’s some truth to the fact that media has a strong left-wing bias. Regardless of Trump’s performance, the media will be on trial during the debate, and they’ll lose.
The danger to Trump is that he goes off script, that his advisors haven’t beaten it into his head hard enough that he’s social engineering and not talking. That’s been his greatest flaw so far. But, and this is a big “but”, it’s also been his biggest strength. By owning his gaffes, he’s seen as a more authentic man of the people and not a slick politician. I point this out because we are all still working according to the rules of past elections, and Trump appears to have rewritten the rules for this election.
Anyway, this post is about social-engineering, not politics. You should watch the debate, not for content, but for how well each candidates does social engineering. Watch how they field every question, then “bridge” to a prepared statement they’ve been practicing for months. Watch how the moderators try to take them “off message”, and how the candidates put things back “on message”. Watch how Clinton, while being friendly and natural, never ever gets “off message”, and how you don’t even notice that she’s “bridging” to her message. Watch how Trump, though, will get flustered and off message. Watch how Hillary controls her hand gestures (almost) none, while Trump frequently fails to.
At least, this is what I’ll be watching for. And watching for live tweeting, as I paraphrase what candidate really were saying, as egregiously as I can :).

EQGRP tools are post-exploitation

Post Syndicated from Robert Graham original http://blog.erratasec.com/2016/08/eqgrp-tools-are-post-exploitation.html

A recent leak exposed hackings tools from the “Equation Group”, a group likely related to the NSA TAO (the NSA/DoD hacking group). I thought I’d write up some comments.

Despite the existence of 0days, these tools seem to be overwhelmingly post-exploitation. They aren’t the sorts of tools you use to break into a network — but the sorts of tools you use afterwards.

The focus of the tools appear to be about hacking into network equipment, installing implants, achievement permanence, and using the equipment to sniff network traffic.

Different pentesters have different ways of doing things once they’ve gotten inside a network, and this is reflected in their toolkits. Some focus on Windows and getting domain admin control, and have tools like mimikatz. Other’s focus on webapps, and how to install hostile PHP scripts. In this case, these tools reflect a methodology that goes after network equipment.

It’s a good strategy. Finding equipment is easy, and undetectable, just do a traceroute. As long as network equipment isn’t causing problems, sysadmins ignore it, so your implants are unlikely to be detected. Internal network equipment is rarely patched, so old exploits are still likely to work. Some tools appear to target bugs in equipment that are likely older than Equation Group itself.

In particular, because network equipment is at the network center instead of the edges, you can reach out and sniff packets through the equipment. Half the time it’s a feature of the network equipment, so no special implant is needed. Conversely, when on the edge of the network, switches often prevent you from sniffing packets, and even if you exploit the switch (e.g. ARP flood), all you get are nearby machines. Getting critical machines from across the network requires remotely hacking network devices.

So you see a group of pentest-type people (TAO hackers) with a consistent methodology, and toolmakers who develop and refine tools for them. Tool development is a rare thing amount pentesters — they use tools, they don’t develop them. Having programmers on staff dramatically changes the nature of pentesting.

Consider the program xml2pcap. I don’t know what it does, but it looks like similar tools I’ve written in my own pentests. Various network devices will allow you to sniff packets, but produce output in custom formats. Therefore, you need to write a quick-and-dirty tool that converts from that weird format back into the standard pcap format for use with tools like Wireshark. More than once I’ve had to convert HTML/XML output to pcap. Setting port filters for 21 (FTP) and Telnet (23) produces low-bandwidth traffic with high return (admin passwords) within networks — all you need is a script that can convert the packets into standard format to exploit this.

Also consider the tftpd tool in the dump. Many network devices support that protocol for updating firmware and configuration. That’s pretty much all it’s used for. This points to a defensive security strategy for your organization: log all TFTP traffic.

Same applies to SNMP. By the way, SNMP vulnerabilities in network equipment is still low hanging fruit. SNMP stores thousands of configuration parameters and statistics in a big tree, meaning that it has an enormous attack surface. Anything value that’s a settable, variable-length value (OCTECT STRING, OBJECT IDENTIFIER) is something you can play with for buffer-overflows and format string bugs. The Cisco 0day in the toolkit was one example.

Some have pointed out that the code in the tools is crappy, and they make obvious crypto errors (such as using the same initialization vectors). This is nonsense. It’s largely pentesters, not software developers, creating these tools. And they have limited threat models — encryption is to avoid easy detection that they are exfiltrating data, not to prevent somebody from looking at the data.

From that perspective, then, this is fine code, with some effort spent at quality for tools that don’t particularly need it. I’m a professional coder, and my little scripts often suck worse than the code I see here.

Lastly, I don’t think it’s a hack of the NSA themselves. Those people are over-the-top paranoid about opsec. But 95% of the US cyber-industrial-complex is made of up companies, who are much more lax about security than the NSA itself. It’s probably one of those companies that got popped — such as an employee who went to DEFCON and accidentally left his notebook computer open on the hotel WiFi.

Conclusion

Despite the 0days, these appear to be post-exploitation tools. They look like the sort of tools pentesters might develop over years, where each time they pop a target, they do a little development based on the devices they find inside that new network in order to compromise more machines/data.

Raspberry Shake – your personal seismograph

Post Syndicated from Liz Upton original https://www.raspberrypi.org/blog/raspberry-shake-personal-seismograph/

There are some applications for the Raspberry Pi that were a very long way from our minds back in 2009, when we were trying to come up with a computer to get kids programming again. I think it’s fair to say that we did not think we were building a personal seismograph.

Raspberry Shake has blown past its Kickstarter target of $7,000 to raise ten times that amount, and it’s still got a couple of days to go.

Raspberry Shake is sensitive enough to detect earthquakes of magnitude 2 and higher at a distance of 50 miles, and earthquakes of magnitude 4 or greater from 300 miles away. Angel Rodriguez, the maker, says:

It will also record earthquakes of larger magnitudes farther away but it will miss some of the subtleties. Raspberry Shake can detect and record short period (0.5 – 15 Hz) earthquakes; the farther away an earthquake, the less of that range of frequencies can be recorded.

Raspberry Shake seismograph

At the heart of this kit is a geophone: a device that converts movement into voltage. (Think of it as being a bit like a microphone for geology.) Inside the little geophone a coil moves relative to a magnet, creating current. Angel has a nice demonstration of how a geophone works:

What’s inside a Geophone

In order to get data coming from the ground we need a sensor able to detect these data. A geophone is a ground motion transducer that convert ground movement into voltage. Raspberry Shake use a geophone and in this video we are going to show you what’s inside of it.

The little add-on board amplifies and digitises the signal from the geophone, and feeds it to your Raspberry Pi.

The Raspberry Pi time-stamps the data and stores it in a seismic industry standard format and sends it in answer to client requests. Those requests are displayed on your smartphone or computer monitor. The complete system is called a seismograph.

Angel and the other instrument builders behind the Raspberry Shake make seismographs and other equipment for a living. This device is the little brother of a seismograph his team makes for universities and other earthquake observers. It runs the same open-source software that the United States Geological Survey (USGS) uses.

Angel says:

Don’t be fooled by the size and the price. Raspberry Shake is better than many of short-period seismometers in current use by the local networks of the USGS and many developing countries. Several software vendors have, for the first time, provided personal no-cost licenses for this project.

Raspberry Shake will make observatory quality data that can be shared in the worldwide standard SEED format. All modern automated seismology programs used by observatories can use the data from the Raspberry Shake. It’s the Volkswagen of seismometers – yes there are Lamborgini seismographs but both the Lamborghini and the Volkswagen will get you from point A to point B.

To prove it, here’s some data from a Raspberry Shake ($99 if you back the Kickstarter now) against data from a $50,000 professional seismograph. In this image the Raspberry Shake’s data is displayed at the top. Both devices are showing data from the same regional earthquake.

Raspberry Shake (upper) and Nanometric Trillium Compact (lower)

Data from Raspberry Shake (top) and Nanometric Trillium Compact (bottom)

Bringing the affordability of a piece of kit like this down to consumer levels is a real achievement: previously this sort of equipment has only been available to universities, governments and other bodies with the ability to make very big investments. As you’ve probably gathered, we love it: head over to back Raspberry Shake on Kickstarter quickly, before the opportunity’s gone!

The post Raspberry Shake – your personal seismograph appeared first on Raspberry Pi.

Writing SQL on Streaming Data with Amazon Kinesis Analytics – Part 1

Post Syndicated from Ryan Nienhuis original https://blogs.aws.amazon.com/bigdata/post/Tx2D4GLDJXPKHOY/Writing-SQL-on-Streaming-Data-with-Amazon-Kinesis-Analytics-Part-1

Ryan Nienhuis is a Senior Product Manager for Amazon Kinesis

This is the first of two AWS Big Data blog posts on Writing SQL on Streaming Data with Amazon Kinesis Analytics. In this post, I provide an overview of streaming data and key concepts like the basics of streaming SQL, and complete a walkthrough using a simple example. In the next post, I will cover more advanced stream processing concepts using Amazon Kinesis Analytics.

Most organizations use batch data processing to perform their analytics in daily or hourly intervals to inform their business decisions and improve their customer experiences. However, you can derive significantly more value from your data if you are able to process and react in real time. Indeed, the value of insights in your data can decline rapidly over time – the faster you react, the better. For example:

  • Analyzing your company’s key performance indicators over the last 24 hours is a better reflection of your current business than analyzing last month’s metrics.
  • Reacting to an operational event as it is happening is far more valuable than discovering a week later that the event occurred. 
  • Identifying that a customer is unable to complete a purchase on your ecommerce site so you can assist them in completing the order is much better than finding out next week that they were unable to complete the transaction.

Real-time insights are extremely valuable, but difficult to extract from streaming data. Processing data in real time can be difficult because it needs to be done quickly and continuously to keep up with the speed at which the data is produced. In addition, the analysis may require data to be processed in the same order in which it was generated for accurate results, which can be hard due to the distributed nature of the data.

Because of these complexities, people start by implementing simple applications that perform streaming ETL, such as collecting, validating, and normalizing log data across different applications. Some then progress to basic processing like rolling min-max computations, while a select few implement sophisticated processing such as anomaly detection or correlating events by user sessions.  With each step, more and more value is extracted from the data but the difficulty level also increases.

With the launch of Amazon Kinesis Analytics, you can now easily write SQL ­­­on streaming data, providing a powerful way to build a stream processing application in minutes. The service allows you to connect to streaming data sources, process the data with sub-second latencies, and continuously emit results to downstream destinations for use in real-time alerts, dashboards, or further analysis.

This post introduces you to Amazon Kinesis Analytics, the fundamentals of writing ANSI-Standard SQL over streaming data, and works through a simple example application that continuously generates metrics over time windows.

What is streaming data?

Today, data is generated continuously from a large variety of sources, including clickstream data from mobile and web applications, ecommerce transactions, application logs from servers, telemetry from connected devices, and many other sources.

Typically, hundreds to millions of these sources create data that is usually small (order of kilobytes) and occurs in a sequence. For example, your ecommerce site has thousands of individuals concurrently interacting with the site, each generating a sequence of events based upon their activity (click product, add to cart, purchase product, etc.). When these sequences are captured continuously from these sources as events occur, the data is categorized as streaming data.

Amazon Kinesis Streams

Capturing event data with low latency and durably storing it in a highly available, scalable data store, such as Amazon Kinesis Streams, is the foundation for streaming data. Streams enables you to capture and store data for ordered, replayable, real-time processing using a streaming application. You configure your data sources to emit data into the stream, then build applications that read and process data from that stream in real-time. To build your applications, you can use the Amazon Kinesis Client Library (KCL), AWS Lambda, Apache Storm, and a number of other solutions, including Amazon Kinesis Analytics.

Amazon Kinesis Firehose

One of the more common use cases for streaming data is to capture it and then load it to a cloud storage service, a database, or other analytics service. Amazon Kinesis Firehose is a fully managed service that offers an easy to use solution to collect and deliver streaming data to Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service.

With Firehose, you create delivery streams using the AWS Management Console to specify your destinations of choice and choose from configuration options that allow you to batch, compress, and encrypt your data before it is loaded into the destination. From there, you set up your data sources to start sending data to the Firehose delivery stream, which loads it continuously to your destinations with no ongoing administration.

Amazon Kinesis Analytics

Amazon Kinesis Analytics provides an easy and powerful way to process and analyze streaming data with standard SQL. Using Analytics, you build applications that continuously read data from streaming sources, process it in real-time using SQL code, and emit the results downstream to your configured destinations.

An Analytics application can ingest data from Streams and Firehose. The service detects a schema associated with the data in your source for common formats, which you can further refine using an interactive schema editor. Your application’s SQL code can be anything from a simple count or average, to more advanced analytics like correlating events over time windows. You author your SQL using an interactive editor, and then test it with live streaming data.

Finally, you configure your application to emit SQL results to up to four destinations, including S3, Amazon Redshift, and Amazon Elasticsearch Service (through a Firehose delivery stream); or to an Amazon Kinesis stream. After setup, the service scales your application to handle your query complexity and streaming data throughput – you don’t have to provision or manage any servers.

Walkthrough (part 1): Run your first SQL query using Amazon Kinesis Analytics

The easiest way to understand Amazon Kinesis Analytics is to try it out. You need an AWS account to get started. I interrupt the walkthrough to discuss streaming through time windows in more detail, then you create a second SQL query with more metrics and an additional step for your application.

A streaming application consist of three components:

  • Streaming data sources
  • Analytics written in SQL
  • Destinations for the results

The application continuously reads data from a streaming source, generates analytics using your SQL code, and emits those results to up to four destinations. This walkthrough will cover the first two steps and point you in the right direction for completing an end-to-end application by adding a destination for your SQL results.

Create an Amazon Kinesis Analytics application

  1. Open the Amazon Kinesis Analytics console and choose Create a new application.

  1. Provide a name and (optional) description for your application and choose Continue.

You are taken to the application hub page.

Create a streaming data source

For input, Analytics supports Amazon Kinesis Streams and Amazon Kinesis Firehose as streaming data input, and reference data input through S3. The primary difference between these two sources is that data is read continuously from the streaming data sources and at one time for reference data sources. Reference data sources are used for joining against the incoming stream to enrich the data.

In Amazon Kinesis Analytics, choose Connect to a source.

If you have existing Amazon Kinesis streams or Firehose delivery streams, they are shown here.

For the purposes of this post, you will be using a demo stream, which creates and populates a stream with sample data on your behalf. The demo stream is created under your account with a single shard, which supports up to 1 MB/sec of write throughput and 2 MB/sec of read throughput. Analytics will write simulated stock ticker data to the demo stream directly from your browser. Your application will read data from the stream in real time.

Next, choose Configure a new stream and Create demo stream.

Later, you will refer to the demo stream in your SQL code as “SOURCE_SQL_STREAM_001”. Analytics calls the DiscoverInputSchema API action, which infers a schema by sampling records from your selected input data stream. You can see the applied schema on your data in the formatted sample shown in the browser, as well as the original sample taken from the raw stream. You can then edit the schema to fine tune it to your needs.

Feel free to explore; when you are ready, choose Save and continue. You are taken back to the streaming application hub.

Create a SQL query for analyzing data

On the streaming application hub, choose Go to SQL Editor and Run Application.

This SQL editor is the development environment for Amazon Kinesis Analytics. On the top portion of the screen, there is a text editor with syntax highlighting and intelligent auto-complete, as well as a number of SQL templates to help you get started. On the bottom portion of the screen, there is an area for you to explore your source data, your intermediate SQL results, and the data you are writing to your destinations. You can view the entire flow of your application here, end-to-end.

Next, choose Add SQL from Template.

Amazon Kinesis Analytics provides a number of SQL templates that work with the demo stream. Feel free to explore; when you’re ready, choose the COUNT, AVG, etc. (aggregate functions) + Tumbling time window template and choose Add SQL to Editor.

The SELECT statement in this SQL template performs a count over a 10-second tumbling window. A window is used to group rows together relative to the current row that the Amazon Kinesis Analytics application is processing.

Choose Save and run SQL. Congratulations, you just wrote your first SQL query on streaming data!

Streaming SQL with Amazon Kinesis Analytics

In a relational database, you work with tables of data, using INSERT statements to add records and SELECT statements to query the data in a table. In Amazon Kinesis Analytics, you work with in-application streams, which are similar to tables in that you can CREATE, INSERT, and SELECT from them. However, unlike a table, data is continuously inserted into an in-application stream, even while you are executing a SQL statement against it. The data in an in-application stream is therefore unbounded.

In your application code, you interact primarily with in-application streams. For instance, a source in-application stream represents your configured Amazon Kinesis stream or Firehose delivery stream in the application, which by default is named “SOURCE_SQL_STREAM_001”. A destination in-application stream represents your configured destinations, which by default is named “DESTINATION_SQL_STREAM”. When interacting with in-application streams, the following is true:

  • The SELECT statement is used in the context of an INSERT statement. That is, when you select rows from one in-application stream, you insert results into another in-application stream.
  • The INSERT statement is always used in the context of a pump. That is, you use pumps to write to an in-application stream. A pump is the mechanism used to make an INSERT statement continuous.

There are two separate SQL statements in the template you selected in the first walkthrough. The first statement creates a target in-application stream for the SQL results; the second statement creates a PUMP for inserting into that stream and includes the SELECT statement.

Generating real-time analytics using windows

In the console, look at the SQL results from the walkthrough, which are sampled and continuously streamed to the console.

In the example application you just built, you used a 10-second tumbling time window to perform an aggregation of records. Notice the special column called ROWTIME, which represents the time a row was inserted into the first in-application stream. The ROWTIME value is incrementing every 10 seconds with each new set of SQL results. (Some 10 second windows may not be shown in the console because we sample results on the high speed stream.) You use this special column in your tumbling time window to help define the start and end of each result set.

Windows are important because they define the bounds for which you want your query to operate. The starting bound is usually the current row that Amazon Kinesis Analytics is processing, and the window defines the ending bound. Windows are required with any query that works across rows, because the in-application stream is unbounded and windows provide a mechanism to bind the result set and make the query deterministic. Analytics supports three types of windows: specifically tumbling, sliding, and custom windows. These concepts will be covered in depth in our next blog post.

Tumbling windows, like the one you selected in your template, are useful for periodic reports. You can use a tumbling window to compute an average number of visitors to your website in the last 5 minutes, or the maximum over the past hour. A single result is emitted for each key in the group as specified by the clause at the end of the defined window.

In streaming data, there are different types of time and how they are used is important to the analytics.  Our example uses ROWTIME, or the processing time, which is great for some use cases. However, in many scenarios, you want a time that more accurately reflects when the event occurred, such as the event or ingest time. Amazon Kinesis Analytics supports all three different time semantics for processing data; processing, event, and ingest time. These concepts will be covered in depth in our next blog post.

Part 2: Run your second SQL query using Amazon Kinesis Analytics

The next part of the walkthrough adds some additional metrics to your first SQL query and adds a second step to your application.

Add metrics to the SQL statement

In the SQL editor, add some additional SQL code.

First, add some metrics including the average price, average change, maximum price, and minimum price over the same window. Note that you need to add these in your SELECT statement as well as the in-application stream you are inserting into, DESTINATION_SQL_STREAM.

Second, add the sector to the query so you have additional information about the stock ticker. Note that the sector must be added to both the SELECT and GROUP BY clauses.

When you are finished, your SQL code should look like the following:

CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
    ticker_symbol VARCHAR(4),
    sector VARCHAR(16), 
    ticker_symbol_count INTEGER,
    avg_price REAL,
    avg_change REAL,
    max_price REAL,
    min_price REAL);

CREATE OR REPLACE  PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"
SELECT STREAM   ticker_symbol,
                sector,
                COUNT(*) AS ticker_symbol_count,
                AVG(price) as avg_price,
                AVG(change) as avg_change,
                MAX(price) as max_price,
                MIN(price) as min_price
FROM "SOURCE_SQL_STREAM_001"
GROUP BY ticker_symbol, sector, FLOOR(("SOURCE_SQL_STREAM_001".ROWTIME - TIMESTAMP '1970-01-01 00:00:00') SECOND / 10 TO SECOND);

Choose Save and run SQL.

Add a second step to your SQL code

Next, add a second step to your SQL code. You can use in-application streams to store intermediate SQL results, which can then be used as input for additional SQL statements. This allows you to build applications with multiple steps serially before sending it to the destination of your choice. Additionally, you can also use in-application streams to perform multiple steps in parallel and send to multiple destinations.

First, change the DESTINATION_SQL_STREAM name in your two SQL statements to be INTERMEDIATE_SQL_STREAM.

Next, add a second SQL step that selects from INTERMEDIATE_SQL_STREAM and INSERTS into a DESTINATION_SQL_STREAM. The SELECT statement should filter only for companies in the TECHNOLOGY sector using a simple WHERE clause. You must also create the DESTINATION_SQL_STREAM to insert SQL results into. Your final application code should look like the following:

CREATE OR REPLACE STREAM "INTERMEDIATE_SQL_STREAM" (
    ticker_symbol VARCHAR(4),
    sector VARCHAR(16), 
    ticker_symbol_count INTEGER,
    avg_price REAL,
    avg_change REAL,
    max_price REAL,
    min_price REAL);

CREATE OR REPLACE  PUMP "STREAM_PUMP" AS INSERT INTO "INTERMEDIATE_SQL_STREAM"
SELECT STREAM   ticker_symbol,
                sector,
                COUNT(*) AS ticker_symbol_count,
                AVG(price) as avg_price,
                AVG(change) as avg_change,
                MAX(price) as max_price,
                MIN(price) as min_price
FROM "SOURCE_SQL_STREAM_001"
GROUP BY ticker_symbol, sector, FLOOR(("SOURCE_SQL_STREAM_001".ROWTIME - TIMESTAMP '1970-01-01 00:00:00') SECOND / 10 TO SECOND);

CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
    ticker_symbol VARCHAR(4),
    sector VARCHAR(16), 
    ticker_symbol_count INTEGER,
    avg_price REAL,
    avg_change REAL,
    max_price REAL,
    min_price REAL);

CREATE OR REPLACE PUMP "STREAM_PUMP_02" AS INSERT INTO "DESTINATION_SQL_STREAM"
SELECT STREAM   ticker_symbol, sector, ticker_symbol_count, avg_price, avg_change, max_price, min_price
FROM "INTERMEDIATE_SQL_STREAM"
WHERE sector = 'TECHNOLOGY';

Choose Save and run SQL.

You can see both of the in-application streams on the left side of the Real-time analytics tab, and select either to see each step in your application for end-to-end visibility.

From here, you can add a destination for your SQL results, such as an Amazon S3 bucket. After set up, your application continuously reads data from the streaming source, processes it using your SQL code, and emits the results to your configured destination.

Clean up

The final step is to clean up. Take the following steps to avoid incurring charges.

  1. Delete the Streams demo stream.
  2. Stop the Analytics application.

Summary

Previously, real-time stream data processing was only accessible to those with the technical skills to build and manage a complex application. With Amazon Kinesis Analytics, anyone familiar with the ANSI SQL standard can build and deploy a stream data processing application in minutes.

This application you just built provides a managed and elastic data processing pipeline using Analytics that calculates useful results over streaming data. Results are calculated as they arrive, and you can configure a destination to deliver them to a persistent store like Amazon S3.

It’s simple to get this solution working for your use case. All that is required is to replace the Amazon Kinesis demo stream with your own, and then set up data producers. From there, configure the analytics and you have an end-to-end solution for capturing, processing, and durably storing streaming data.

If you have questions or suggestions, please comment below.

 

Amazon Kinesis Analytics – Process Streaming Data in Real Time with SQL

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-kinesis-analytics-process-streaming-data-in-real-time-with-sql/

As you may know, Amazon Kinesis greatly simplifies the process of working with real-time streaming data in the AWS Cloud. Instead of setting up and running your own processing and short-term storage infrastructure, you simply create a Kinesis Stream or Kinesis Firehose, arrange to pump data in to it, and then build an application to process or analyze it.

While it is relatively easy to build streaming data solutions using Kinesis Streams and Kinesis Firehose, we want to make it even easier. We want you, whether you are a procedural developer, a data scientist, or a SQL developer, to be able to process voluminous clickstreams from web applications, telemetry and sensor reports from connected devices, server logs, and more using a standard query language, all in real time!

Amazon Kinesis Analytics
Today I am happy to be able to announce the availability of Amazon Kinesis Analytics. You can now run continuous SQL queries against your streaming data, filtering, transforming, and summarizing the data as it arrives. You can focus on processing the data and extracting business value from it instead of wasting your time on infrastructure. You can build a powerful, end-to-end stream processing pipeline in 5 minutes without having to write anything more complex than a SQL query.

When I think of running a series of SQL queries against a database table, I generally think of the data as staying more or less static while the queries come and go pretty quickly. Rows are added, changed, and deleted all the time, but this does not generally matter when considering a single query that runs at a particular point in time. Running a Kinesis Analytics query against streaming data turns this model sideways.  The queries are long-running and the data changes many times per second as new records, observations, or log entries arrive. Once you wrap your head around this, you will see that the query processing model is very easy to understand: You build persistent queries that process records as they arrive.

In order to control the set of records that will be processed by a given query, you make use of a processing “window.” Kinesis Analytics supports three different types of windows:

Tumbling windows are used for periodic reports. You could use a tumbling window to summarize data over time. Perhaps you get thousands or millions of requests per second, and would like to know how many arrive each minute. When the current tumbling window closes, the next one begins after it. A new result is generated each time the window fills up.

Sliding windows are used for monitoring and other types of trend detection. For example, you could use a sliding window to compute a real-time moving average for an error rate. Records enter the window, contribute to the result as long as they are within it, and the window advances. A new result is generated each time a new record enters the window. You can adjust the size of the window to control the sensitive of the results.

Custom windows are used when the appropriate grouping is not strictly based on time. If you are processing clickstream data or server logs, you can use a custom window to perform an action known as sessionization. In other words, you can bound each query by the first and last actions performed by each user, as identified by a session identifier within the incoming data. You can write a query that computes the number of pages visited by each user or the time that they spend on your site.

While all of this might sound somewhat complicated, it is actually pretty easy to implement. Kinesis Analytics will analyze a sample of the incoming records and then propose a suitable schema. You can use it as-is, or you can fine-tune it to better reflect your actual data model. Once the schema has been defined, you can use the built-in SQL editor (complete with syntax checking and easy testing against live data). You can configure Kinesis Analytics to route the results of the query to up to four destinations including Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, or an Amazon Kinesis Stream.

When you build your first Amazon Kinesis Analytics application you need to write a pair of cooperating SQL statements (more complex applications can use more, but all it takes is two to get up and running):

A statement to create an in-application stream to store intermediate SQL results (a stream is a like a SQL table that is continuously updated, which you can select from and insert into).

Your SQL query, which selects from one in-application stream and inserts into another in-application stream.

Your SQL statements can also JOIN the records to reference data that originates in S3. This can be handy when you want to enhance or modify the records to include additional, perhaps more descriptive, information.

Amazon Kinesis Analytics in Action
Let’s spend a few minutes looking at Amazon Kinesis Analytics in action!

I log in to the Amazon Kinesis Analytics Console and clicking on Create new application. Then I enter a name and a description for my app:

Now I can manage my data source, my queries, and the destination(s):

I can select one of my existing input streams:

Or I can configure a new one (I’ll do that):

I click on Create demo stream to create a stream that will be populated with sample stock ticker data. This takes 30 to 40 seconds!

Kinesis Analytics peeks at the stream and proposes a schema. I can accept it as-is or fine tune it:

Then I hop over to the SQL editor. It offers to start my app. That seems like a good idea, so I agree  and click on Yes, start application:

Here’s the actual SQL editor:

I can write my query from scratch or I can use a template:

I picked Continuous filter; here’s the SQL:

I inspected it, nodded in agreement, and then clicked on Save and run SQL.  Within seconds, results began to flow in and were visible in the Console:

I used the SQL editor to modify the query to remove the sector and price columns and ran the query again. When I did this I learned that I needed to remove the columns from the CREATE STREAM statement (this is obvious in retrospect but it was the end of a long day).

Here’s the revised result set:

In most cases the next step would be to route the results to a new or existing stream. I can do that from the Console:

With just a couple of clicks and a little bit of typing, I have created an Amazon Kinesis Analytics app that is capable of process a production-scale stock ticker stream. This “demo” needs no changes whatsoever before being used in production. I think that’s kind of cool.

Learn More & Try it Yourself
As usual, I have barely scratched the surface of this exciting new service!  To learn more, you should read the new post, Writing SQL on Streaming Data with Amazon Kinesis Analytics.

You should be able to replicate my steps above in 5 minutes or less and I strongly recommend that you do so. Create your application, customize the SQL query, and learn how to process streaming data at scale.

Available Now
Amazon Kinesis Analytics is available now and you can start running queries against your streaming data today!


Jeff;

Skycademy 2016

Post Syndicated from Dan Fisher original https://www.raspberrypi.org/blog/skycademy-2016/

Over the next three days, we have 30 educators arriving at Pi Towers to learn how to build, launch, and track a High Altitude Balloon (HAB). For the uninitiated, Skycademy 2016 is our second CPD event which provides experience of launching balloons to educators, showing them how this can be used for an inspiring, project-based learning experience.

Skycademy_Header_v2

This is my first year preparing for Skycademy, and it has been a steep but worthwhile learning curve. Launching a HAB combines aspects of maths, physics, computing, design and technology, and geography, and the sheer scope of the project means that it’s rare for school-age children to get these types of experiences. It’s great news, then, that Raspberry Pi have the in-house skills, ambition, and commitment to run such things, and train others to run them too.

Skycademy runs over three days: on the first day, delegates form teams and take part in several workshops aimed at planning and building their flight. Day Two sees them launch, track, and recover their payload. Day Three has them regroup to reflect and plan for the year ahead. The support doesn’t end there: our Skycademy graduates go on to take part in a year-long project that will see them launch flights at their own schools and organisations, helped by their own students.

Tracking tomorrow’s launch

If you’re interested in watching the launch tomorrow, you can follow our progress by searching for #skycademy on Twitter. You can also use the links below to track the progress of different teams. Today, you will begin to see their payloads appearing on the map, and tomorrow you’ll be able to follow the chase.

 

TrackingImages
All teamsrpf.io/flightsrpf.io/flights/images
Altorpf.io/altorpf.io/alto/images
Cirrusrpf.io/cirrusrpf.io/cirrus/images
Cumulusrpf.io/cumulusrpf.io/cumulus/images
Nimbusrpf.io/nimbusrpf.io/nimbus/images
Stratusrpf.io/stratusrpf.io/stratus/images

 

Our current launch plan is to set the balloons free slightly to the west of Cambridge around 10am, but we’ll be posting updates to Twitter.

If you aren’t lucky enough to be taking part in Skycademy today, don’t worry: we’ll be making lots of resources available in the near future for anyone to access and run their own flights. Alternatively, you can also visit Dave Akerman’s website for lots of HAB information and guides to get you started.

Welcome to Dan Fisher’s ‘Fun with HABs’

I recently found out what lay in store for our latest crop of educators when I took part in a test launch two weeks ago…

We made our way to the launch site at Elsworth, Cambridgeshire, feeling nervous and excited. We arrived at 09.30, as experts Dave Akerman and Steve Randall were already starting to assemble their kit. The hope was that we might actually be able to break the world record for the highest amateur unmanned balloon flight. Dave and Steve are continually leapfrogging each other for this title.

IMG_0478

The payload Dave is making in the picture weighs about 250g and consists of a Raspberry Pi A+ connected to Pi-In-The-Sky (PITS) and LoRa boards. The lighter the payload, the higher the potential altitude. The boards broadcast packets of data back to earth, which can be decoded by our tracking equipment.

IMG_0554

Surprisingly, the payload’s chassis assembly is hardly high-tech: a polystyrene capsule gaffer-taped to some nylon cord and balsa wood, to which the balloon and parachute are attached. For this launch, Dave and Steve used hydrogen rather than helium, as it enables you to achieve higher altitudes. Having no previous experience working with pure hydrogen, I had visions of some kind of disaster happening.

Hindenberg

We weighed the payload to calculate how much hydrogen we would need to fill the balloon and ensure the correct ascent rate. Too much hydrogen means the balloon ascends too quickly and might burst early. Too little hydrogen results in a slow balloon which might not burst at all, and could float away and be lost.

IMG_8205

After Dave filled the balloon with hydrogen, we attached the real payload (lots more gaffer tape) and we were ready for a good ol’ launch ‘n’ track. However, as is often the case, it didn’t exactly go to plan…

Home, home on the range

Picture the scene: two Raspberry Pi staffers are driving off-road through a military firing range. Behind the wheel is Dave Akerman, grinning broadly.

“It’s so much more interesting when they don’t just land in a ditch,” he says, speeding the SUV over another pothole.

We’ve tracked our high altitude balloon for two hours to an area of land in Thetford Forest, Norfolk which is used for live ammo practice: not somewhere you’d want to go without permission. Access is looking unlikely until we get a call from the nearby army base’s ops team: we’re in. We make our way past the firing range and into the woods.

IMG_0564

After tracking as far as we can by car, we continue on foot until we spot the payload about ten metres up in a fir tree with very few branches. There’s no way of climbing up. Fortunately, Dave has come armed with the longest telescopic pole I’ve ever seen. It even has a hook on the business end for snagging the parachute’s cords. I act as a spotter as Dave manoeuvres the pole into position and tugs the payload free.

IMG_0568

Giddy with the unexpected success of our recovery, we head back to the SUV and make for the exit, only to find we’ve been locked in. Scenarios where we’ve unwittingly become contestants in the next Hunger Games cross my mind. Armed only with long plastic poles, I worry we might be early casualties.

IMG_0569

After feverish calls to the base again, they agree to come out and free us: a man in a MoD jacket dramatically smashes the lock with a hammer. We race back to Cambridge HQ, payload in hand and with a story to tell.

The Great Escape

Uploaded by David Akerman on 2016-07-26.

That’s it for now; look out for our post-Skycademy follow-up post soon!

The post Skycademy 2016 appeared first on Raspberry Pi.

How to Remove Single Points of Failure by Using a High-Availability Partition Group in Your AWS CloudHSM Environment

Post Syndicated from Tracy Pierce original https://blogs.aws.amazon.com/security/post/Tx7VU4QS5RCK7Q/How-to-Remove-Single-Points-of-Failure-by-Using-a-High-Availability-Partition-Gr

A hardware security module (HSM) is a hardware device designed with the security of your data and cryptographic key material in mind. It is tamper-resistant hardware that prevents unauthorized users from attempting to pry open the device, plug any extra devices in to access data or keys such as subtokens, or damage the outside housing. If any such interference occurs, the device wipes all information stored so that unauthorized parties do not gain access to your data or cryptographic key material. A high-availability (HA) setup could be beneficial because, with multiple HSMs kept in different data centers and all data synced between them, the loss of one HSM does not mean the loss of your data.

In this post, I will walk you through steps to remove single points of failure in your AWS CloudHSM environment by setting up an HA partition group. Single points of failure occur when a single CloudHSM device fails in a non-HA configuration, which can result in the permanent loss of keys and data. The HA partition group, however, allows for one or more CloudHSM devices to fail, while still keeping your environment operational.

Prerequisites

You will need a few things to build your HA partition group with CloudHSM:

  • 2 CloudHSM devices. AWS offers a free two-week trial. AWS will provision the trial for you and send you the CloudHSM information such as the Elastic Network Interface (ENI) and the private IP address assigned to the CloudHSM device so that you may begin testing. If you have used CloudHSM before, another trial cannot be provisioned, but you can set up production CloudHSM devices on your own. See Provisioning Your HSMs.
  • A client instance from which to access your CloudHSM devices. You can create this manually, or via an AWS CloudFormation template. You can connect to this instance in your public subnet, and then it can communicate with the CloudHSM devices in your private subnets.
  • An HA partition group, which ensures the syncing and load balancing of all CloudHSM devices you have created.

The CloudHSM setup process takes about 30 minutes from beginning to end. By the end of this how-to post, you should be able to set up multiple CloudHSM devices and an HA partition group in AWS with ease. Keep in mind that each production CloudHSM device you provision comes with an up-front fee of $5,000. You are not charged for any CloudHSM devices provisioned for a trial, unless you decide to move them to production when the trial ends.

If you decide to move your provisioned devices to your production environment, you will be billed $5,000 per device. If you decide to stop the trial so as not to be charged, you have up to 24 hours after the trial ends to let AWS Support know of your decision.

Solution overview

How HA works

HA is a feature of the Luna SA 7000 HSM hardware device AWS uses for its CloudHSM service. (Luna SA 7000 HSM is also known as the “SafeNet Network HSM” in more recent SafeNet documentation. Because AWS documentation refers to this hardware as “Luna SA 7000 HSM,” I will use this same product name in this post.) This feature allows more than one CloudHSM device to be placed as members in a load-balanced group setup. By having more than one device on which all cryptographic material is stored, you remove any single points of failure in your environment.

You access your CloudHSM devices in this HA partition group through one logical endpoint, which distributes traffic to the CloudHSM devices that are members of this group in a load-balanced fashion. Even though traffic is balanced between the HA partition group members, any new data or changes in data that occur on any CloudHSM device will be mirrored for continuity to the other members of the HA partition group. A single HA partition group is logically represented by a slot, which is physically composed of multiple partitions distributed across all HA nodes. Traffic is sent through the HA partition group, and then distributed to the partitions that are linked. All partitions are then synced so that data is persistent on each one identically.

The following diagram illustrates the HA partition group functionality.

  1. Application servers send traffic to your HA partition group endpoint.
  2. The HA partition group takes all requests and distributes them evenly between the CloudHSM devices that are members of the HA partition group.
  3. Each CloudHSM device mirrors itself to each other member of the HA partition group to ensure data integrity and availability.

Automatic recovery

If you ever lose data, you want a hands-off, quick recovery. Before autoRecovery was introduced, you could take advantage of the redundancy and performance HA partition groups offer, but you were still required to manually intervene when a group member was lost.

HA partition group members may fail for a number of reasons, including:

  • Loss of power to a CloudHSM device.
  • Loss of network connectivity to a CloudHSM device. If network connectivity is lost, it will be seen as a failed device and recovery attempts will be made.

Recovery of partition group members will only work if the following are true:

  • HA autoRecovery is enabled.
  • There are at least two nodes (CloudHSM devices) in the HA partition group.
  • Connectivity is established at startup.
  • The recover retry limit is not reached (if reached or exceeded, the only option is manual recovery).

HA autoRecovery is not enabled by default and must be explicitly enabled by running the following command, which is found in Enabling Automatic Recovery.

>vtl haAdmin –autoRecovery –retry <count>

When enabling autoRecovery, set the –retry and –interval parameters. The –retry parameter can be a value between 0 and 500 (or -1 for infinite retries), and equals the number of times the CloudHSM device will attempt automatic recovery. The –interval parameter is in seconds and can be any value between 60 and 1200. This is the amount of time between automatic recovery tries that the CloudHSM will attempt.

Setting up two Production CloudHSM devices in AWS

Now that I have discussed how HA partition groups work and why they are useful, I will show how to set up your CloudHSM environment and the HA partition group itself. To create an HA partition group environment, you need a minimum of two CloudHSM devices. You can have as many as 16 CloudHSM devices associated with an HA partition group at any given time. These must be associated with the same account and region, but can be spread across multiple Availability Zones, which is the ideal setup for an HA partition group. Automatic recovery is great for larger HA partition groups because it allows the devices to quickly attempt recovery and resync data in the event of a failure, without requiring manual intervention.

Set up the CloudHSM environment

To set up the CloudHSM environment, you must have a few things already in place:

  • An Amazon VPC.
  • At least one public subnet and two private subnets.
  • An Amazon EC2 client instance (m1.small running Amazon Linux x86 64-bit) in the public subnet, with the SafeNet client software already installed. This instance uses the key pair that you specified during creation of the CloudFormation stack. You can find a ready-to-use Amazon Machine Image (AMI) in our Community AMIs. Simply log into the EC2 console, choose Launch Instance, click Community AMIs, and search for CloudHSM. Because we regularly release new AMIs with software updates, searching for CloudHSM will show all available AMIs for a region. Select the AMI with the most recent client version.
  • Two security groups, one for the client instance and one for the CloudHSM devices. The security group for the client instance, which resides in the public subnet, will allow SSH on port 22 from your local network. The security group for the CloudHSM devices, which resides in the private subnet, will allow SSH on port 22 and NTLS on port 1792 from your public subnet. These will both be ingress rules (egress rules allow all traffic).
  • An Elastic IP address for the client instance.
  • An IAM role that delegates AWS resource access to CloudHSM. You can create this role in the IAM console:

    1. Click Roles and then click Create New Role.
    2. Type a name for the role and then click Next Step.
    3. Under AWS Service Roles, click Select next to AWS CloudHSM.
    4. In the Attach Policy step, select AWSCloudHSMRole as the policy. Click Next Step.
    5. Click Create Role.

We have a CloudFormation template available that will set up the CloudHSM environment for you:

  1. Go to the CloudFormation console.
  2. Choose Create Stack. Specify https://cloudhsm.s3.amazonaws.com/cloudhsm-quickstart.json as the Amazon S3 template URL.
  3. On the next two pages, specify parameters such as the Stack name, SSH Key Pair, Tags, and SNS Topic for alerts. You will find SNS Topic under the Advanced arrow. Then, click Create.

When the new stack is in the CREATION_COMPLETE state, you will have the IAM role to be used for provisioning your CloudHSM devices, the private and public subnets, your client instance with Elastic IP (EIP), and the security groups for both the CloudHSM devices and the client instance. The CloudHSM security group will already have its necessary rules in place to permit SSH and NTLS access from your public subnet; however, you still must add the rules to the client instance’s security group to permit SSH access from your allowed IPs. To do this:

  1. In the VPC console, make sure you select the same region as the region in which your HSM VPC resides.
  2. Select the security group in your HSM VPC that will be used for the client instance.
  3. Add an inbound rule that allows TCP traffic on port 22 (SSH) from your local network IP addresses.
  4. On the Inbound tab, from the Create a new rule list, select SSH, and enter the IP address range of the local network from which you will connect to your client instance.
  5. Click Add Rule, and then click Apply Rule Changes.

After adding the IP rules for SSH (port 22) to your client instance’s security group, test the connection by attempting to make a SSH connection locally to your client instance EIP. Make sure to write down all the subnet and role information, because you will need this later.

Create an SSH key pair

The SSH key pair that you will now create will be used by CloudHSM devices to authenticate the manager account when connecting from your client instance. The manager account is simply the user that is permitted to SSH to your CloudHSM devices. Before provisioning the CloudHSM devices, you create the SSH key pair so that you can provide the public key to the CloudHSM during setup. The private key remains on your client instance to complete the authentication process. You can generate the key pair on any computer, as long as you ensure the client instance has the private key copied to it. You can also create the key pair on Linux or Windows. I go over both processes in this section of this post.

In Linux, you will use the ssh-keygen command. By typing just this command into the terminal window, you will receive output similar to the following.

$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/user/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/user/.ssh/id_rsa.
Your public key has been saved in /home/user/.ssh/id_rsa.pub.
The key fingerprint is: df:c4:49:e9:fe:8e:7b:eb:28:d5:1f:72:82:fb:f2:69
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|             .   |
|            o    |
|           + .   |
|        S   *.   |
|         . =.o.o |
|          ..+ +..|
|          .o Eo .|
|           .OO=. |
+-----------------+
$

In Windows, use PuTTYgen to create your key pair:

  1. Start PuTTYgen. For Type of key to generate, select SSH-2 RSA.
  2. In the Number of bits in a generated key field, specify 2048.
  3. Click Generate.
  4. Move your mouse pointer around in the blank area of the Key section below the progress bar (to generate some randomness) until the progress bar is full.
  5. A private/public key pair has now been generated.
  6. In the Key comment field, type a name for the key pair that you will remember.
  7. Click Save public key and name your file.
  8. Click Save private key and name your file. It is imperative that you do not lose this key, so make sure to store it somewhere safe.
  9. Right-click the text field labeled Public key for pasting into OpenSSH authorized_keys file and choose Select All.
  10. Right-click again in the same text field and choose Copy.

The following screenshot shows what the PuTTYgen output will look like after you have created the key pair.

You must convert the keys created by PuTTYgen to OpenSSH format for use with other clients by using the following command.

ssh-keygen –i –f puttygen_key > openssh_key

The public key will be used to provision the CloudHSM device and the private key will be stored on the client instance to authenticate the SSH sessions. The public SSH key will look something like the following. If it does not, it is not in the correct format and must be converted using the preceding procedure.

ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA6bUsFjDSFcPC/BZbIAv8cAR5syJMB GiEqzFOIEHbm0fPkkQ0U6KppzuXvVlc2u7w0mg
PMhnkEfV6j0YBITu0Rs8rNHZFJs CYXpdoPxMMgmCf/FaOiKrb7+1xk21q2VwZyj13GPUsCxQhRW7dNidaaYTf14sbd9A qMUH4UOUjs
27MhO37q8/WjV3wVWpFqexm3f4HPyMLAAEeExT7UziHyoMLJBHDKMN7 1Ok2kV24wwn+t9P/Va/6OR6LyCmyCrFyiNbbCDtQ9JvCj5
RVBla5q4uEkFRl0t6m9 XZg+qT67sDDoystq3XEfNUmDYDL4kq1xPM66KFk3OS5qeIN2kcSnQ==

Whether you are saving the private key on your local computer or moving it to the client instance, you must ensure that the file permissions are correct. You can do this by running the following commands (throughout this post, be sure to replace placeholder content with your own values). The first command sets the necessary permissions; the second command adds the private key to the authentication agent.

$ chmod 600 ~/.ssh/<private_key_file>
$ ssh-add ~/.ssh/<private_key_file>

Set up the AWS CLI tools

Now that you have your SSH key pair ready, you can set up the AWS CLI tools so that you may provision and manage your CloudHSM devices. If you used the CloudFormation template or the CloudHSM AMI to set up your client instance, you already have the CLI installed. You can check this by running at the command prompt: CloudHSM version. The resulting output should be “Version”: “3.0.5”. If you chose to use your own AMI and install the Luna SA software, you can install the CloudHSM CLI Tools with the following steps. The current version in use is 3.0.5.

$ wget https://s3.amazon.com/cloudhsm-software/CloudHsmCLI.egg
$ sudo easy_install-2.7 –s /usr/local/bin CloudHsmCLI.egg
$ cloudhsm version
{
      “Version”: “<version>”
}

You must also set up file and directory ownership for your user on the client instance and the Chrystoki.conf file. The Chrystoki.conf file is the configuration file for the CloudHSM device. By default, CloudHSM devices come ready from the factory for immediate housing of cryptographic keys and performing cryptographic processes on data, but must be configured to connect to your client instances:

  1. On the client instance, set the owner and write permission on the Chrystoki.conf file.
$ sudo chown <owner> /etc/Chrystoki.conf
$ sudo chmod +w /etc/Chrystoki.conf

The <owner> can be either the user or a group the user belongs to (for example, ec2-user).

  1. On the client instance set the owner of the Luna client directory:
$ sudo chown <owner> -R <luna_client_dir>

The <owner> should be the same as the <owner> of the Chrystoki.conf file. The <luna_client_dir> differs based on the version of the LunaSA client software installed. If these are new setups, use version 5.3 or newer; however, if you have older clients with version 5.1 installed, use version 5.1:

  • Client software version 5.3: /usr/safenet/lunaclient/
  • Client software version 5.1: /usr/lunasa/

You also must configure the AWS CLI tools with AWS credentials to use for the API calls. These can be set by config files, passing the credentials in the commands, or by instance profile. The most secure option, which eliminates the need to hard-code credentials in a config file, is to use an instance profile on your client instance. All CLI commands in this post are performed on a client instance launched with a IAM role that has CloudHSM permissions. If you want to set your credentials in a config file instead, you must remember that each CLI command should include –profile <profilename>, with <profilename> being the name you assigned in the config file for these credentials.See Configuring the AWS CloudHSM CLI Tools for help with setting up the AWS CLI tools.

You will then set up a persistent SSH tunnel for all CloudHSM devices to communicate with the client instance. This is done by editing the ~/.ssh/config file. Replace <CloudHSM_ip_address> with the private IP of your CloudHSM device, and replace <private_key_file> with the file location of your SSH private key created earlier (for example, /home/user/.ssh/id_rsa).

Host <CloudHSM_ip_address>
User manager
IdentityFile <private_key_file>

Also necessary for the client instance to authenticate with the CloudHSM partitions or partition group are client certificates. Depending on the LunaSA client software you are using, the location of these files can differ. Again, if these are new setups, use version 5.3 or newer; however, if you have older clients with version 5.1 installed, use version 5.1:

  • Linux clients

    • Client software version 5.3: /usr/safenet/lunaclient/cert
    • Client software version 5.1: /usr/lunasa/cert
  • Windows clients

    • Client software version 5.3: %ProgramFiles%\SafeNet\LunaClient\cert
    • Client software version 5.1: %ProgramFiles%\LunaSA\cert

To create the client certificates, you can use the OpenSSL Toolkit or the LunaSA client-side vtl commands. The OpenSSL Toolkit is a program that allows you to manage TLS (Transport Layer Security) and SSL (Secure Sockets Layer) protocols. It is commonly used to create SSL certificates for secure communication between internal network devices. The LunaSA client-side vtl commands are installed on your client instance along with the Luna software. If you used either CloudFormation or the CloudHSM AMI, the vtl commands are already installed for you. If you chose to launch a different AMI, you can download the Luna software. After you download the software, run the command linux/64/install.sh as root on a Linux instance and install the Luna SA option. If you install the softtware on a Windows instance, run the command windows\64\LunaClient.msi to install the Luna SA option. I show certificate creation in both OpenSSL Toolkit and LunaSA in the following section.

OpenSSL Toolkit

      $ openssl genrsa –out <luna_client_cert_dir>/<client_name>Key.pem 2048
      $ openssl req –new –x509 –days 3650 –key <luna_client_cert_dir>/<client_cert>Key.pem –out <client_name>.pem

The <luna_client_cert_dir> is the LunaSA Client certificate directory on the client and the <client_name> can be whatever you choose.

LunaSA

      $ sudo vtl createCert –n <client_name>

The output of the preceding LunaSA command will be similar to the following.

Private Key created and written to:

<luna_client_cert_dir>/<client_name>Key.pem

Certificate created and written to:

<luna_client_cert_dir>/<client_name>.pem

You will need these key file locations later on, so make sure you write them down or save them to a file. One last thing to do at this point is create the client Amazon Resource Name (ARN), which you do by running the following command.

$ CloudHSM create-client –certificate-file <luna_client_cert_dir>/<client_name>.pem

{
      “ClientARN”: “<client_arn>”,
      “RequestId”: “<request_id>”
}

Also write down in a safe location the client ARN because you will need it when registering your client instances to the HA partition group.

Provision your CloudHSM devices

Now for the fun and expensive part. Always remember that for each CloudHSM device you provision to your production environment, there is an upfront fee of $5,000. Because you need more than one CloudHSM device to set up an HA partition group, provisioning two CloudHSM devices to production will cost an upfront fee of $10,000.

If this is your first time trying out CloudHSM, you can have a two-week trial provisioned for you at no cost. The only cost will occur if you decide to keep your CloudHSM devices and move them into production. If you are unsure of the usage in your company, I highly suggest doing a trial first. You can open a support case requesting a trial at any time. You must have a paid support plan to request a CloudHSM trial.

To provision the two CloudHSM devices, SSH into your client instance and run the following CLI command.

$ CloudHSM create-CloudHSM \
--subnet-id <subnet_id> \
--ssh-public-key-file <public_key_file> \
--iam-role-arn <iam_role_arn> \
--syslog-ip <syslog_ip_address>

The response should resemble the following.

{
      “CloudHSMArn”: “<CloudHSM_arn>”,
      “RequestId”: “<request_id>”
}

Make note of each CloudHSM ARN because you will need them to initialize the CloudHSM devices and later add them to the HA partition group.

Initialize the CloudHSM devices

Configuring your CloudHSM devices, or initializing them as the process is formally called, is what allows you to set up the configuration files, certificate files, and passwords on the CloudHSM itself. Because you already have your CloudHSM ARNs from the previous section, you can run the describe-hsm command to get the EniId and the EniIp of the CloudHSM devices. Your results should be similar to the following.

$ CloudHSM describe-CloudHSM --CloudHSM-arn <CloudHSM_arn>
{    
     "EniId": "<eni_id>",   
     "EniIp": "<eni_ip>",    
     "CloudHSMArn": "<CloudHSM_arn>",    
     "IamRoleArn": "<iam_role_arn>",    
     "Partitions": [],    
     "RequestId": "<request_id>",    
     "SerialNumber": "<serial_number>",    
     "SoftwareVersion": "5.1.3-1",    
     "SshPublicKey": "<public_key_text>",    
     "Status": "<status>",    
     "SubnetId": "<subnet_id>",    
     "SubscriptionStartDate": "2014-02-05T22:59:38.294Z",    
     "SubscriptionType": "PRODUCTION",    
     "VendorName": "SafeNet Inc."
}

Now that you know the EniId of each CloudHSM, you need to apply the CloudHSM security group to them. This ensures that connection can occur from any instance with the client security group assigned. When a trial is provisioned for you, or you provision CloudHSM devices yourself, the default security group of the VPC is automatically assigned to the ENI.You must change this to the security group that permits ingress ports 22 and 1792 from your client instance.

To apply a CloudHSM security group to an EniId:

  1. Go to the EC2 console, and choose Network Interfaces in the left pane.
  2. Select the EniId of the CloudHSM.
  3. From the Actions drop-down list, choose Change Security Groups. Choose the security group for your CloudHSM device, and then click Save.

To complete the initialization process, you must ensure a persistent SSH connection is in place from your client to the CloudHSM. Remember the ~/.ssh/config file you edited earlier? Now that you have the IP address of the CloudHSM devices and the location of the private SSH key file, go back and fill in that config file’s parameters by using your favorite text editor.

Now, initialize using the initialize-hsm command with the information you gathered from the provisioning steps. The values in red in the following example are meant as placeholders for your own naming and password conventions, and should be replaced with your information during the initialization of the CloudHSM devices.

$CloudHSM initialize-CloudHSM \
--CloudHSM-arn <CloudHSM_arn> \
--label <label> \
--cloning-domain <cloning_domain> \
--so-password <so_password>

The <label> is a unique name for the CloudHSM device that should be easy to remember. You can also use this name as a descriptive label that tells what the CloudHSM device is for. The <cloning_domain> is a secret used to control cloning of key material from one CloudHSM to another. This can be any unique name that fits your company’s naming conventions. Examples could be exampleproduction or exampledevelopment. If you are going to set up an HA partition group environment, the <cloning_domain> must be the same across all CloudHSMs. The <so_password> is the security officer password for the CloudHSM device, and for ease of remembrance, it should be the same across all devices as well. It is important you use passwords and cloning domain names that you will remember, because they are unrecoverable and the loss of them means loss of all data on a CloudHSM device. For your use, we do supply a Password Worksheet if you want to write down your passwords and store the printed page in a secure place.

Configure the client instance

Configuring the client instance is important because it is the secure link between you, your applications, and the CloudHSM devices. The client instance opens a secure channel to the CloudHSM devices and sends all requests over this channel so that the CloudHSM device can perform the cryptographic operations and key storage. Because you already have launched the client instance and mostly configured it, the only step left is to create the Network Trust Link (NTL) between the client instance and the CloudHSM. For this, we will use the LunaSA vtl commands again.

  1. Copy the server certificate from the CloudHSM to the client.
$ scp –i ~/.ssh/<private_key_file> [email protected]<CloudHSM_ip_address>:server.pem
  1. Register the CloudHSM certificate with the client.
$ sudo vtl addServer –n <CloudHSM_ip_address> -c server.pem
New server <CloudHSM_ip_address> successfully added to server list.
  1. Copy the client certificate to the CloudHSM.
$ scp –i ~/.ssh/<private_key_file> <client_cert_directory>/<client_name>.pem [email protected]<CloudHSM_ip_address>:
  1. Connect to the CloudHSM.
$ ssh –i ~/.ssh/<private_key_file> [email protected]<CloudHSM_ip_address>
lunash:>
  1. Register the client.
lunash:> client register –client <client_id> -hostname <client_name>

The <client_id> and <client_name> should be the same for ease of use, and this should be the same as the name you used when you created your client certificate.

  1. On the CloudHSM, log in with the SO password.
lunash:> hsm login
  1. Create a partition on each CloudHSM (use the same name for ease of remembrance).
lunash:> partition create –partition <name> -password <partition_password> -domain <cloning_domain>

The <partition_password> does not have to be the same as the SO password, and for security purposes, it should be different.

  1. Assign the client to the partition.
lunash:> client assignPartition –client <client_id> -partition <partition_name>
  1. Verify that the partition assigning went correctly.
lunash:> client show –client <client_id>
  1. Log in to the client and verify it has been properly configured.
$ vtl verify
The following Luna SA Slots/Partitions were found:
Slot    Serial #         Label
====    =========        ============
1      <serial_num1>     <partition_name>
2      <serial_num2>     <partition_name>

You should see an entry for each partition created on each CloudHSM device. This step lets you know that the CloudHSM devices and client instance were properly configured.

The partitions created and assigned via the previous steps are for testing purposes only and will not be used in the HA parition group setup. The HA partition group workflow will automatically create a partition on each CloudHSM device for its purposes. At this point, you have created the client and at least two CloudHSM devices. You also have set up and tested for validity the connection between the client instance and the CloudHSM devices. The next step in to ensure fault tolerance by setting up the HA partition group.

Set up the HA partition group for fault tolerance

Now that you have provisioned multiple CloudHSM devices in your account, you will add them to an HA partition group. As I explained earlier in this post, an HA partition group is a virtual partition that represents a group of partitions distributed over many physical CloudHSM devices for HA. Automatic recovery is also a key factor in ensuring HA and data integrity across your HA partition group members. If you followed the previous procedures in this post, setting up the HA partition group should be relatively straightforward.

Create the HA partition group

First, you will create the actual HA partition group itself. Using the CloudHSM CLI on your client instance, run the following command to create the HA partition group and name it per your company’s naming conventions. In the following command, replace <label> with the name you chose.

$ CloudHSM create-hapg –group-label <label>

Register the CloudHSM devices with the HA partition group

Now, add the already initialized CloudHSM devices to the HA partition group. You will need to run the following command for each CloudHSM device you want to add to the HA partition group.

$ CloudHSM add-CloudHSM-to-hapg \
--CloudHSM-arn <CloudHSM_arn> \
--hapg-arn <hapg_arn> \
--cloning-domain <cloning_domain> \
--partition-password <partition_password> \
--so-password <so_password>

You should see output similar to the following after each successful addition to the HA partition group.

{
      “Status”: “Addition of CloudHSM <CloudHSM_arn> to HA partition group <hapg_arn> successful”
}

Register the client with the HA partition group

The last step is to register the client with the HA partition group. You will need the client ARN from earlier in the post, and you will use the CloudHSM CLI command register-client-to-hapg to complete this process.

$ CloudHSM register-client-to-hapg \
--client-arn <client_arn> \
--hapg-arn <hapg_arn>
{
      “Status”: “Registration of the client <client_arn> to the HA partition group <hapg_arn> successful”
}

After you register the client with the HA partition group, you have the client configuration file and the server certificates. You have already registered the client with the HA partition group, but now you have to actually assign it as well, which you do by using the get-client-configuration AWS CLI command.

$ CloudHSM get-client-configuration \
--client-arn <client_arn> \
--hapg-arn <hapg_arn> \
--cert-directory <server_cert_location> \
--config-directory /etc/

The configuration file has been copied to /etc/
The server certificate has been copied to <server_cert_location>

The <server_cert_location> will differ depending on the LunaSA client software you are using:

  • Client software version 5.3: /usr/safenet/lunaclient/cert/server
  • Client software version 5.1: /usr/lunasa/cert/server

Lastly, to verify the client configuration, run the following LunaSA vtl command.

$ vtl haAdmin show

In the output, you will see a heading, HA Group and Member Information. Ensure that the number of group members equals the number of CloudHSM devices you added to the HA partition group. If the number does not match what you have provisioned, you might have missed a step in the provisioning process. Going back through the provisioning process usually repairs this. However, if you still encounter issues, opening a support case is the quickest way to get assistance.

Another way to verify the HA partition group setup is to check the /etc/Chrystoki.conf file for output similar to the following.

VirtualToken = {
   VirtualToken00Label = hapg1;
   VirtualToken00SN = 1529127380;
   VirtualToken00Members = 475256026,511541022;
}
HASynchronize = {
   hapg1 = 1;
}
HAConfiguration = {
   reconnAtt = -1;
   AutoReconnectInterval = 60;
   HAOnly = 1;

Summary

You have now completed the process of provisioning CloudHSM devices, the client instance for connection, and your HA partition group for fault tolerance. You can begin using an application of your choice to access the CloudHSM devices for key management and encryption. By accessing CloudHSM devices via the HA partition group, you ensure that all traffic is load balanced between all backing CloudHSM devices. The HA partition group will ensure that each CloudHSM has identical information so that it can respond to any request issued.

Now that you have an HA partition group set up with automatic recovery, if a CloudHSM device fails, the device will attempt to recover itself, and all traffic will be rerouted to the remaining CloudHSM devices in the HA partition group so as not to interrupt traffic. After recovery (manual or automatic), all data will be replicated across the CloudHSM devices in the HA partition group to ensure consistency.

If you have questions about any part of this blog post, please post them on the IAM forum.

– Tracy

Hacking the Vote

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2016/08/hacking_the_vot.html

Russia has attacked the U.S. in cyberspace in an attempt to influence our national election, many experts have concluded. We need to take this national security threat seriously and both respond and defend, despite the partisan nature of this particular attack.

There is virtually no debate about that, either from the technical experts who analyzed the attack last month or the FBI which is analyzing it now. The hackers have already released DNC emails and voicemails, and promise more data dumps.

While their motivation remains unclear, they could continue to attack our election from now to November — and beyond.

Like everything else in society, elections have gone digital. And just as we’ve seen cyberattacks affecting all aspects of society, we’re going to see them affecting elections as well.

What happened to the DNC is an example of organizational doxing — the publishing of private information — an increasingly popular tactic against both government and private organizations. There are other ways to influence elections: denial-of-service attacks against candidate and party networks and websites, attacks against campaign workers and donors, attacks against voter rolls or election agencies, hacks of the candidate websites and social media accounts, and — the one that scares me the most — manipulation of our highly insecure but increasingly popular electronic voting machines.

On the one hand, this attack is a standard intelligence gathering operation, something the NSA does against political targets all over the world and other countries regularly do to us. The only thing different between this attack and the more common Chinese and Russian attacks against our government networks is that the Russians apparently decided to publish selected pieces of what they stole in an attempt to influence our election, and to use Wikileaks as a way to both hide their origin and give them a veneer of respectability.

All of the attacks listed above can be perpetrated by other countries and by individuals as well. They’ve been done in elections in other countries. They’ve been done in other contexts. The Internet broadly distributes power, and what was once the sole purview of nation states is now in the hands of the masses. We’re living in a world where disgruntled people with the right hacking skills can influence our elections, wherever they are in the world.

The Snowden documents have shown the world how aggressive our own intelligence agency is in cyberspace. But despite all of the policy analysis that has gone into our own national cybersecurity, we seem perpetually taken by surprise when we are attacked. While foreign interference in national elections isn’t new, and something the U.S. has repeatedly done, electronic interference is a different animal.

The Obama Administration is considering how to respond, but politics will get in the way. Were this an attack against a popular Internet company, or a piece of our physical infrastructure, we would all be together in response. But because these attacks affect one political party, the other party benefits. Even worse, the benefited candidate is actively inviting more foreign attacks against his opponent, though he now says he was just being sarcastic. Any response from the Administration or the FBI will be viewed through this partisan lens, especially because the President is a Democrat.

We need to rise above that. These threats are real and they affect us all, regardless of political affiliation. That this particular attack targeted the DNC is no indication of who the next attack might target. We need to make it clear to the world that we will not accept interference in our political process, whether by foreign countries or lone hackers.

However we respond to this act of aggression, we also need to increase the security of our election systems against all threats — and quickly.

We tend to underestimate threats that haven’t happened — we discount them as “theoretical” — and overestimate threats that have happened at least once. The terrorist attacks of 9/11 are a showcase example of that: Administration officials ignored all the warning signs, and then drastically overreacted after the fact. These Russian attacks against our voting system have happened. And they will happen again, unless we take action.

If a foreign country attacked U.S. critical infrastructure, we would respond as a nation against the threat. But if that attack falls along political lines, the response is more complicated. It shouldn’t be. This is a national security threat against our democracy, and needs to be treated as such.

This essay previously appeared on CNN.com.

I wish I enjoyed Pokémon Go

Post Syndicated from Eevee original https://eev.ee/blog/2016/07/31/i-wish-i-enjoyed-pok%C3%A9mon-go/

I’ve been trying really hard not to be a sourpuss about this, because everyone seems to enjoy it a lot and I don’t want to be the jerk pissing in their cornflakes.

And yet!

Despite all the potential of the game, despite all the fervor all across the world, it doesn’t tickle my fancy.

It seems like the sort of thing I ought to enjoy. Pokémon is kind of my jam, if you hadn’t noticed. When I don’t enjoy a Pokémon thing, something is wrong with at least one of us.

The app is broken

I’m not talking about the recent update that everyone’s mad about and that I haven’t even tried. They removed pawprints, which didn’t work anyway? That sucks, yeah, but I think it’s more significant that the thing is barely usable.

I’ve gone out hunting Pokémon several times with my partner and their husband. We wandered around for about an hour each time, and like clockwork, the game would just stop working for me every fifteen minutes. It would still run, and the screen would still update, but it would completely ignore all taps or swipes. The only fix seems to be killing it and restarting it, which takes like a week, and meanwhile the rest of my party has already caught the Zubat or whatever and is moving on.

For the brief moments when it works, it seems to be constantly confused about exactly where I am and which way I’m facing. Pokéstops (Poké Stops?) have massive icons when they’re nearby, and more than once I’ve had to mess around with the camera angle to be able to tap a nearby Pokémon, because a cluster of several already-visited Pokéstops are in the way. There’s also a strip along the bottom of the screen, surrounding the menu buttons, where tapping just does nothing at all.

I’ve had the AR Pokémon catching screen — the entire conceit of the game — lag so badly on multiple occasions that a Pokéball just stayed frozen in midair, and I couldn’t tell if I’d hit the Pokémon or not. There was also the time the Pokéball hit the Pokémon, landed on the ground, and… slowly rolled into the distance. For at least five minutes. I’m not exaggerating this time.

The game is much more responsive with AR disabled, so the Pokémon appear on a bland and generic background, which… seems to defeat the purpose of the game.

(Catching Pokémon doesn’t seem to have any real skill to it, either? Maybe I’m missing something, but I don’t understand how I’m supposed to gauge distance to an isolated 3D model and somehow connect this to how fast I flick my finger. I don’t really like “squishy” physics games like Angry Birds, and this is notably worse. It might as well be random.)

I had a better time just enjoying my party’s company and looking at actual wildlife, which in this case consists of cicadas and a few semi-wild rabbits that inexplicably live in a nearby park. I feel that something has gone wrong with your augmented reality game when it is worse than reality.

It’s not about Pokémon

Let’s see if my reasoning is sound, here.

In the mainline Pokémon games, you play as a human, but many of your important interactions are with Pokémon. You carry a number of Pokémon with you. When you encounter a Pokémon, you immediately send out your own. All the NPCs talk about how much they love Pokémon. There are overworld Pokémon hanging out. It’s pretty clear what the focus is. It’s right there on the title screen, even: both the word itself and an actual Pokémon.

Contrast this with Pokémon Go.

Most of the time, the only thing of interest on the screen is your avatar, a human. Once you encounter a Pokémon, you don’t send out your own; it’s just you, and it. In fact, once you catch a Pokémon, you hardly ever interact with it again. You can go look at its stats, assuming you can find it in your party of, what, 250?

The best things I’ve seen done with the app are AR screenshots of Pokémon in funny or interesting real-world places. It didn’t even occur to me that you can only do this with wild Pokémon until I played it. You can’t use the AR feature — again, the main conceit of the game — with your own Pokémon. How obvious is this? How can it not be possible? (If it is possible, it’s so well-hidden that several rounds of poking through the app haven’t revealed how to do it, which is still a knock for hiding the most obvious thing to want to do.)

So you are a human, and you wander around hoping you see Pokémon, and then you catch them, and then they are effectively just a sprite in a list until you feed them to your other Pokémon. And feed them you must, because the only way to level up a Pokémon is to feed them the corpses — sorry, “candies” — of their brethren. The Pokémon themselves aren’t involved in this process; they are passive consumers you fatten up.

If you’re familiar with Nuzlocke runs, you might be aware of just how attached players — or even passive audiences — can get to their Pokémon in mainline games. Yet in Pokémon Go, the critters themselves are just something to collect, just something to have, just something to sacrifice. No other form of interaction is offered.

In Pokémon X and Y, you can pet your Pokémon and feed them cakes, then go solve puzzles with them. They will love you in return. In Pokémon Go, you can swipe to make the model rotate.

There is some kind of battle system in here somewhere, but as far as I can tell, you only ever battle against gym leaders, who are jerks who’ve been playing the damn thing since it came out and have Pokémon whose CP have more digits than you even knew were possible. Also the battling is real-time with some kind of weird gestural interface, so it’s kind of a crapshoot whether you even do the thing you want, a far cry from the ostensibly strategic theme of the mainline games.

If I didn’t know any better, I’d think some no-name third-party company just took an existing product and poorly plastered Pokémon onto it.

There are very few Pokémon per given area

The game is limited to generation 1, the Red/Blue/Yellow series. And that’s fine.

I’ve seen about six of them.

Rumor has it that they are arranged very cleverly, with fire Pokémon appearing in deserts and water Pokémon appearing in waterfronts. That sounds really cool, except that I don’t live at the intersection of fifteen different ecosystems. How do you get ice Pokémon? Visit my freezer?

I freely admit, I’m probably not the target audience here; I don’t have a commute at all, and on an average day I have no reason to leave the house at all. I can understand that I might not see a huge variety, sure. But I’ve seen several friends lamenting that they don’t see much variety on their own commutes, or around the points of interest near where they live.

If you spend most of your time downtown in a major city, the game is probably great; if you live out in the sticks, it sounds a bit barren. It might be a little better if you could actually tell how to find Pokémon that are more than a few feet away — there used to be a distance indicator for nearby Pokémon, which I’m told even worked at one point, but it’s never worked since I first tried the game and it’s gone now.

Ah, of course, there’s always Pokévision, a live map of what Pokémon are where… which Niantic just politely asked to cease and desist.

It’s full of obvious “free-to-play” nudges

I put “free-to-play” in quotes because it’s a big ol’ marketing lie and I don’t know why the gaming community even tolerates the phrase. The game is obviously designed to be significantly worse if you don’t give them money, and there are little reminders of this everywhere.

The most obvious example: eggs rain from the sky, and are the only way to get Pokémon that don’t appear naturally nearby. You have to walk a certain number of kilometers to hatch an egg, much like the mainline games, which is cute.

Ah, but you also have to put an egg in an incubator for the steps to count. And you only start with one. And they’re given to you very rarely, and any beyond the one you start with only have limited uses at a time. And you can carry 9 eggs at a time.

Never fear! You can an extra (limited use) incubator for the low low price of $1.48. Or maybe $1.03. It’s hard to tell, since (following the usual pattern of flagrant dishonesty) you first have to turn real money into game-specific trinkets at one of several carefully obscured exchange rates.

The thing is, you could just sell a Pokémon game. Nintendo has done so quite a few times, in fact. But who would pay for Pokémon Go, in the state it’s in?

In conclusion

This game is bad and I wish it weren’t bad. If you enjoy it, that’s awesome, and I’m not trying to rain on your parade, really. I just wish I enjoyed it too.

Python FAQ: How do I port to Python 3?

Post Syndicated from Eevee original https://eev.ee/blog/2016/07/31/python-faq-how-do-i-port-to-python-3/

Part of my Python FAQ, which is doomed to never be finished.

Maybe you have a Python 2 codebase. Maybe you’d like to make it work with Python 3. Maybe you really wish someone would write a comically long article on how to make that happen.

I have good news! You’re already reading one.

(And if you’re not sure why you’d want to use Python 3 in the first place, perhaps you’d be interested in the companion article which delves into exactly that question?)

Don’t be intimidated

This article is quite long, but don’t take that as a sign that this is necessarily a Herculean task. I’m trying to cover every issue I can ever recall running across, which means a lot of small gotchas.

I’ve ported several codebases from Python 2 to Python 2+3, and most of them have gone pretty smoothly. If you have modern Python 2 code that handles Unicode responsibly, you’re already halfway there.

However… if you still haven’t ported by now, almost eight years after Python 3.0 was first released, chances are you have either a lumbering giant of an app or ancient and weird 2.2-era code. Or, perish the thought, a lumbering giant consisting largely of weird 2.2-era code. In that case, you’ll want to clean up the more obvious issues one at a time, then go back and start worrying about actually running parts of your code on Python 3.

On the other hand, if your Python 2 code is pretty small and you’ve just never gotten around to porting, good news! It’s not that bad, and much of the work can be done automatically. Python 3 is ultimately the same language as Python 2, just with some sharp bits filed off.

Making some tough decisions

We say “porting from 2 to 3”, but what we usually mean is “porting code from 2 to both 2 and 3”. That ends up being more difficult (and ugly), since rather than writing either 2 or 3, you have to write the common subset of 2 and 3. As nifty as some of the features in 3 are, you can’t actually use any of them if you have to remain compatible with Python 2.

The first thing you need to do, then, is decide exactly which versions of Python you’re targeting. For 2, your options are:

  • Python 2.5+ is possible, but very difficult, and this post doesn’t really discuss it. Even something as simple as exception handling becomes painful, because the only syntax that works in Python 3 was first introduced in Python 2.6. I wouldn’t recommend doing this.

  • Python 2.6+ used to be fairly common, and is well-tread ground. However, Python 2.6 reached end-of-life in 2013, and some common libraries have been dropping support for it. If you want to preserve Python 2.6 compatibility for the sake of making a library more widely-available, well, I’d urge you to reconsider. If you want to preserve Python 2.6 compatibility because you’re running a proprietary app on it, you should stop reading this right now and go upgrade to 2.7 already.

  • Python 2.7 is the last release of the Python 2 series, but is guaranteed to be supported until at least 2020. The major focus of the release was backporting a lot of minor Python 3 features, making it the best possible target for code that’s meant to run on both 2 and 3.

  • There is, of course, also the choice of dropping Python 2 support, in which case this process will be much easier. Python 2 is still very widely-used, though, so library authors probably won’t want to do this. App authors do have the option, but unless your app is trivial, it’s much easier to maintain Python 2 support during the port — that way you can port iteratively, and the app will still function on Python 2 in the interim, rather than being a 2/3 hybrid that can’t run on either.

Most of this post assumes you’re targeting Python 2.7, though there are mentions of 2.6 as well.

You also have to decide which version of Python 3 to target.

  • Python 3.0 and 3.1 are forgettable. Python 3 was still stabilizing for its first couple minor versions, and from what I hear, compatibility with both 2.7 and 3.0 is a huge pain. Both versions are also past end-of-life.

  • Python 3.2 and 3.3 are a common minimum version to target. Python 3.3 reinstated support for u'...' literals (redundant in Python 3, where normal strings are already Unicode), which makes supporting both 2 and 3 much easier. I bundle it with Python 3.2 because the latest version that stable PyPy supports is 3.2, but it also supports u'...' literals. You’ll support the biggest surface area by targeting that, a sort of 3.2½. (There’s an alpha PyPy supporting 3.3, but as of this writing it’s not released as stable yet.)

  • Python 3.4 and 3.5 add shiny new features, but you can only really use them if you’re dropping support for Python 2. Again, I’d suggest targeting Python 2.7 + Python 3.2½ first, then dropping the Python 2 support and adding whatever later Python 3 trinkets you want.

Another consideration is what attitude you want your final code to take. Do you want Python 2 code with enough band-aids that it also works on Python 3, or Python 3 code that’s carefully written so it still works on Python 2? The differences are subtle! Consider code like x = map(a, b). map returns a list in Python 2, but a lazy iterable in Python 3. Which way do you want to port this code?

1
2
3
4
5
6
7
8
9
# Python 2 style: force eager evaluation, even on Python 3
x = list(map(a, b))

# Python 3 style: use lazy evaluation, even on Python 2
try:
    from future_builtins import map
except ImportError:
    pass
x = map(a, b)

The answer may depend on which Python you primarily use for development, your target audience, or even case-by-case based on how x is used.

Personally, I’d err on the side of preserving Python 3 semantics and porting them to Python 2 when possible. I’m pretty used to Python 3, though, and you or your team might be thrown for a loop by changing Python 2’s behavior.

At the very least, prefer if PY2 to if not PY3. The former stresses that Python 2 is the special case, which is increasingly true going forward. Eventually there’ll be a Python 4, and perhaps even a Python 5, and those future versions will want the “Python 3” behavior.

Some helpful tools

The good news is that you don’t have to do all of this manually.

2to3 is a standard library module (since 2.6) that automatically modifies Python 2 source code to change some common Python 2 constructs to the Python 3 equivalent. (It also doubles as a framework for making arbitrary changes to Python code.)

Unfortunately, it ports 2 to 3, not 2 to 2+3. For libraries, it’s possible to rig 2to3 to run automatically on your code just before it’s installed on Python 3, so you can keep writing Python 2 code — but 2to3 isn’t perfect, and this makes it impossible to develop with your library on Python 3, so Python 3 ends up as a second-class citizen. I wouldn’t recommend it.

The more common approach is to use something like six, a library that wraps many of the runtime differences between 2 and 3, so you can run the same codebase on both 2 and 3.

Of course, that still leaves you making the changes yourself. A more recent innovation is the python-future project, which combines both of the above. It has a future library of renames and backports of Python 3 functionality that goes further than six and is designed to let you write Python 3-esque code that still runs on Python 2. It also includes a futurize script, based on the 2to3 plumbing, that rewrites your code to target 2+3 (using python-future’s library) rather than just 3.

The nice thing about python-future is that it explicitly takes the stance of writing code against Python 3 semantics and backporting them to Python 2. It’s very dedicated to this: it has a future.builtins module that includes not only easy cases like map, but also entire pure-Python reimplementations of types like bytes. (Naturally, this adds some significant overhead as well.) I do like the overall attitude, but I’m not totally sold on all the changes, and you might want to leaf through them to see which ones you like.

futurize isn’t perfect, but it’s probably the best starting point. The 2to3 design splits the various edits into a variety of “fixers” that each make a single style of change, and futurize works the same way, inheriting many of the fixers from 2to3. The nice thing about futurize is that it groups the fixers into “stages”, where stage 1 (futurize --stage1) only makes fairly straightforward changes, like fixing the except syntax. More importantly, it doesn’t add any dependencies on the future library, so it’s useful for making the easy changes even if you’d prefer to use six. You’re also free to choose individual fixes to apply, if you discover that some particular change breaks your code.

Another advantage of this approach is that you can tackle the porting piecemeal, which is great for very large projects. Run one fixer at a time, starting with the very simple ones like updating to except ... as ... syntax, and convince yourself that everything is fine before you do the next one. You can make some serious strides towards 3 compatibility just by eliminating behavior that already has cromulent alternatives in Python 2.

If you expect your Python 3 port to take a very long time — say, if you have a large project with numerous developers and a frantic release schedule — then you might want to prevent older syntax from creeping in with a tool like autopep8, which can automatically fix some deprecated features with a much lighter touch. If you’d like to automatically enforce that, say, from __future__ import absolute_import is at the top of every Python file, that’s a bit beyond the scope of this article, but I’ve had pre-commit + reorder_python_imports thrust upon me in the past to fairly good effect.

Anyway! For each of the issues below, I’ll mention whether futurize can fix it, the name of the responsible fixer, and whether six has anything relevant. If the name of the fixer begins with lib2to3, that means it’s part of the standard library, and you can use it with 2to3 without installing python-future.

Here we go!

Things you shouldn’t even be doing

These are ancient, ancient practices, and even Python 2 programmers may be surprised by them. Some of them are arguably outright bugs in the language; others are just old and forgotten. They generally have equivalents that work even in older versions of Python 2.

Old-style classes

1
2
class Foo:
    ...

In Python 3, this code creates a class that inherits from object. In Python 2, it creates a completely different kind of thing entirely: an “old-style” class, which worked a little differently from built-in types. The differences are generally subtle:

  • Old-style classes don’t support __getattribute__, __slots__

  • Old-style classes don’t correctly support data descriptors, i.e. the assignment behavior of @property.

  • Old-style classes had a __coerce__ method, which would attempt to turn a value into a built-in numeric type before performing a math operation.

  • Old-style classes didn’t use the C3 MRO, so in the case of diamond inheritance, a class could be skipped entirely by super().

  • Old-style instances check the instance for a special method name; new-style instances check the type. Additionally, if a special method isn’t found on an old-style instance, the lookup falls back to __getattr__; this is not the case for new-style classes (which makes proxying more complicated).

That last one is the only thing old-style classes can do that new-style classes cannot, and if you’re relying on it, you have a bit of refactoring to do. (The really curious thing is that there doesn’t seem to be a particularly good reason for the limitation on new-style classes, and it doesn’t even make things faster. Maybe that’ll be fixed in Python 4?)

If you have no idea what any of that means or why you should care, chances are you’re either not using old-style classes at all, or you’re only using them because you forgot to write (object) somewhere. In that case, futurize --stage2 will happily change class Foo: to class Foo(object): for you, using the libpasteurize.fixes.fix_newstyle fixer. (Strictly speaking, this is a Python 2 compatibility issue, since the old syntax still works fine in Python 3 — it just means something else now.)

cmp

Python 2 originally used the C approach for sorting. Given two things A and B, a comparison would produce a negative number if A < B, zero if A == B, and a positive number if A > B. This was the only way to customize sorting; there’s a cmp() built-in function, a __cmp__ special method, and cmp arguments to list.sort() and sorted().

This is a little cumbersome, as you may have noticed if you’ve ever tried to do custom sorting in Perl or JavaScript. Even a case-insensitive sort involves repeating yourself. Most custom sorts will have the same basic structure of cmp(op(a), op(b)), when the only thing you really care about is op.

1
names.sort(cmp=lambda a, b: cmp(a.lower(), b.lower()))

But more importantly, the C approach is flat-out wrong for some types. Consider sets, which use comparison to indicate subsets versus supersets:

1
2
3
4
{1, 2} < {1, 2, 3}  # True
{1, 2, 3} > {1, 2}  # True
{1, 2} < {1, 2}  # False
{1, 2} <= {1, 2}  # True

So what to do with {1, 2} < {3, 4}, where none of the three possible answers is correct?

Early versions of Python 2 added “rich comparisons”, which introduced methods for all six possible comparisons: __eq__, __ne__, __lt__, __le__, __gt__, and __ge__. You’re free to return False for all six, or even True for all six, or return NotImplemented to allow deferring to the other operand. The cmp argument became key instead, which allows mapping the original values to a different item to use for comparison:

1
names.sort(key=lambda a: a.lower())

(This is faster, too, since there are fewer calls to the lambda, fewer calls to .lower(), and no calls to cmp.)


So, fixing all this. Luckily, Python 2 supports all of the new stuff, so you don’t need compatibility hacks.

To replace simple implementations of __cmp__, you need only write the appropriate rich comparison methods. You could even do this the obvious way:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
class Foo(object):
    def __cmp__(self, other):
        return cmp(self.prop, other.prop)

    def __eq__(self, other):
        return self.__cmp__(other) == 0

    def __ne__(self, other):
        return self.__cmp__(other) != 0

    def __lt__(self, other):
        return self.__cmp__(other) < 0

    ...

You would also have to change the use of cmp to a manual if tree, since cmp is gone in Python 3. I don’t recommend this.

A lazier alternative would be to use functools.total_ordering (backported from 3.0 into 2.7), which generates four of the comparison methods, given a class that implements __eq__ and one other:

1
2
3
4
5
6
7
@functools.total_ordering
class Foo(object):
    def __eq__(self, other):
        return self.prop == other.prop

    def __lt__(self, other):
        return self.prop < other.prop

There are a couple problems with this code. For one, it’s still pretty repetitive, accessing .prop four times (and imagine if you wanted to compare several properties). For another, it’ll either cause an error or do entirely the wrong thing if you happen to compare with an object of a different type. You should return NotImplemented in this case, but total_ordering doesn’t handle that correctly until Python 3.4. If those bother you, you might enjoy my own classtools.keyed_ordering, which uses a __key__ method (much like the key argument) to generate all six methods:

1
2
3
4
@classtools.keyed_ordering
class Foo(object):
    def __key__(self):
        return self.prop

Replacing uses of key arguments should be straightforward: a cmp argument of cmp(op(a), op(b)) becomes a key argument of op. If you’re doing something more elaborate, there’s a functools.cmp_to_key function (also backported from 3.0 to 2.7), which converts a cmp function to one usable as a key. (The implementation is much like the first Foo example above: it involves a class that calls the wrapped function from its comparison methods, and returns True or False depending on the return value.)

Finally, if you’re using cmp directly, don’t do that. If you really, really need it for something other than Python’s own sorting, just use an if.

The only help futurize offers is in futurize --stage2, via libfuturize.fixes.fix_cmp, which adds an import of past.builtins.cmp if it detects you’re using the cmp function anywhere.

Comparing incompatible types

Python 2’s use of C-style ordering also means that any two objects, of any types, must be either equal or occur in some defined order. Python’s answer to this problem is to sort on the names of the types. So None < 3 < "1", because "NoneType" < "int" < "str".

Python 3 removes this fallback rule; if two values don’t know how to compare against each other (i.e. both return NotImplemented), you just get a TypeError.

This might affect you in subtle ways, such as if you’re sorting a list of objects that may contain Nones and expecting it to silently work. The fix depends entirely on the type of data you have, and no automated tool can handle that for you. Most likely, you didn’t mean to be sorting a heterogenous list in the first place.

Of course, you could always sort on type(x).__name__, but I don’t know why you would do that.

The sets module

Python 2.3 introduced its set types as Set and ImmutableSet in the sets module. Since Python 2.4, they’ve been built-in types, set and frozenset. The sets module is gone in Python 3, so just use the built-in names.

Creating exceptions

Python 2 allows you to do this:

1
raise RuntimeError, "an error happened at runtime!!"

There’s not really any good reason to do this, since you can just as well do:

1
raise RuntimeError("an error happened at runtime!!")

futurize --stage1 will rewrite the two-arg form to a regular object creation via the libfuturize.fixes.fix_raise fixer. It’ll also fix this alternative way of specifying an exception type, which is so bizarre and obscure that I did not know about it until I read the fixer’s source code:

1
raise (((A, B), C), ...)  # equivalent to `raise A` (?!)

Additionally, exceptions act like sequences in Python 2, but not in Python 3. You can just operate on the .args sequence directly, in either version. Alas, there’s no automated way to fix this.

Backticks

Did you know that `x` is equivalent to repr(x) in Python 2? Yeah, most people don’t. It’s super weird. futurize --stage1 will fix this with the lib2to3.fixes.fix_repr fixer.

has_key

Very old code may still be using somedict.has_key("foo"). "foo" in somedict has worked since Python 2.2. What are you doing. futurize --stage1 will fix this with the lib2to3.fixes.fix_has_key fixer.

<>

<> is equivalent to != in Python 2! This is an ancient, ancient holdover, and there’s no reason to still be using it. futurize --stage1 will fix this with the lib2to3.fixes.fix_ne fixer.

(You could also use from __future__ import barry_as_FLUFL, which restores <> in Python 3. It’s an easter egg. I’m joking. Please don’t actually do this.)

Things with easy Python 2 equivalents

These aren’t necessarily ancient, but they have an alternative you can just as well express in Python 2, so there’s no need to juggle 2 and 3.

Other ancient builtins

apply() is gone. Use the built-in syntax, f(*args, **kwargs).

callable() was briefly gone, but then came back in Python 3.2.

coerce() is gone; it was only used for old-style classes.

execfile() is gone. Read the file and pass its contents to exec() instead.

file() is gone; Python 3 has multiple file types, and a hierarchy of interfaces defined in the io module. Occasionally, code uses this as a synonym for open(), but you should really be using open() anyway.

intern() has been moved into the sys module, though I have no earthly idea why you’d be using it.

raw_input() has been renamed to input(), and the old ludicrous input() is gone. If you really need input(), please stop.

reduce() has been moved into the functools module, but it’s there in Python 2.6 as well.

reload() has been moved into the imp module. It’s unreliable garbage and you shouldn’t be using it anyway.

futurize --stage1 can fix several of these:

  • apply, via lib2to3.fixes.fix_apply
  • intern, via lib2to3.fixes.fix_intern
  • reduce, via lib2to3.fixes.fix_reduce

futurize --stage2 can also fix execfile via the libfuturize.fixes.fix_execfile fixer, which imports past.builtins.execfile. The 2to3 fixer uses an open() call, but the true correct fix is to use a with block.

futurize --stage2 has a couple of fixers for raw_input, but you can just as well import future.builtins.input or six.moves.input.

Nothing can fix coerce, which has no equivalent. Curiously, I don’t see a fixer for file, which is trivially fixed by replacing it with open. Nothing for reload, either.

Catching exceptions

Historically, the way to say “if there’s a ValueError, store it in e and run some code” was:

1
2
3
4
try:
    ...
except ValueError, e:
    ...

Unfortunately, that’s very easy to confuse with the syntax for catching two different types of exception:

1
2
except (ValueError, TypeError):
    ...

If you forget the parentheses, you’ll only catch ValueError, and the exception will be assigned to a variable called, er, TypeError. Whoops!

Python 3.0 introduced clearer syntax, which was also backported to Python 2.6:

1
2
except ValueError as e:
    ...

Python 3.0 finally removed the old syntax, so you must use the as form. futurize --stage1 will fix this with the lib2to3.fixes.fix_except fixer.

As an additional wrinkle, the extra variable e is deleted at the end of the block in Python 3, but not in Python 2. If you really need to refer to it after the block, just assign it to a different name.

(The reason for this is that captured exceptions contain a traceback in Python 3, and tracebacks contain the locals for the current frame, and those locals will contain the captured exception. The resulting cycle would keep all local variables alive until the cycle detector dealt with it, at least in CPython. Scrapping the exception as soon as it’s been dealt with was a simple way to keep this from accidentally happening all over the place. It usually doesn’t make sense to refer to a captured exception after the except block, anyway, since the variable may or may not even exist, and that’s generally weird and bad in Python.)

Octals

It’s not uncommon for a new programmer to try to zero-pad a set of numbers:

1
2
3
4
a = 07
b = 08
c = 09
d = 10

Of course, this will have the rather bizarre result that 08 is a SyntaxError, even though 07 works fine — because numbers starting with a 0 are parsed as octal.

This is a holdover from C, and it’s fairly surprising, since there’s virtually no reason to ever use octal. The only time I can ever remember using it is for passing file modes to chmod.

Python 3.0 requires octal literals to be prefixed with 0o, in line with 0x for hex and 0b for binary; literal integers starting with only a 0 are a syntax error. Python 2.6 supports both forms.

futurize --stage1 will fix this with the lib2to3.fixes.fix_numliterals fixer.

pickle

If you’re using the pickle module (which you shouldn’t be), and you intend to pass pickles back and forth between Python 2 and Python 3, there’s a small issue to be aware of. pickle has several different “protocol” versions, and the default version used in Python 3 is protocol 3, which Python 2 cannot read.

The fix is simple: just find where you’re calling pickle.dump() or pickle.dumps(), and pass a protocol argument of 2. Protocol 2 is the highest version supported by Python 2, and you probably want to be using it anyway, since it’s much more compact and faster to read/write than Python 2’s default, protocol 0.

You may be already using HIGHEST_PROTOCOL, but you’ll have the same problem: the highest protocol supported in any version of Python 3 is unreadable by Python 2.


A somewhat bigger problem is that if you pickle an instance of a user-defined class on Python 2, the pickle will record all its attributes as bytestrings, because that’s what they are in Python 2. Python 3 will then dutifully load the pickle and populate your object’s __dict__ with keys like b'foo'. obj.foo will then not actually exist, because obj.foo looks for the string 'foo', and 'foo' != b'foo' in Python 3.

Don’t use pickle, kids.

It’s possible to fix this, but also a huge pain in the ass. If you don’t know how, you definitely shouldn’t be using pickle.

Things that have a __future__ import

Occasionally, the syntax changed in an incompatible way, but the new syntax was still backported and hidden behind a __future__ import — Python’s mechanism for opting into syntax changes. You have to put such an import at the top of the file, optionally after a docstring, like this:

1
2
"""My super important module."""
from __future__ import with_statement

Ugh! Parentheses! Why, Guido, why?

The reason is that the print statement has incredibly goofy syntax, unlike anything else in the language:

1
print >>a, b, c,

You might not even recognize the >> bit, but it lets you print to a file other than sys.stdout. It’s baked specifically into the print syntax. Python 3 replaces this with a straightforward built-in function with a couple extra bells and whistles. The above would be written:

1
print(b, c, end='', file=a)

It’s slightly more verbose, but it’s also easier to tell what’s going on, and that teeny little comma at the end is now a more obvious keyword argument.

from __future__ import print_function will forget about the print statement for the rest of the file, and make the builtin print function available instead. futurize --stage1 will fix all uses of print and add the __future__ import, with the libfuturize.fixes.fix_print_with_import fixer. (There’s also a 2to3 fixer, but it doesn’t add the __future__ import, since it’s unnecessary in Python 3.)

A word of warning: do not just use print with parentheses without adding the __future__ import. This may appear to work in stock Python 2:

1
print("See, what's the problem?  This works fine!")

However, that’s parsed as the print statement followed by an expression in parentheses. It becomes more obvious if you try to print two values:

1
2
print("The answer is:", 3)
# ("The answer is:", 3)

Now you have a comma inside parentheses, which is a tuple, so the old print statement prints its repr.

Division always produces a float

Quick, what’s the answer here?

1
5 / 2

If you’re a normal human being, you’ll say 2.5 or 2½. Unfortunately, if you’re like Python and have been afflicted by C, you might say the answer is 2, because this is “integer division” — a bizarre and alien concept probably invented because CPUs didn’t have FPUs when C was first invented.

Python 3.0 decided that maybe contorting fundamental arithmetic to match the inadequacies of 1970s hardware is not the best idea, and so it changed division to always produce a float.

Since Python 2.6, from __future__ import division will alter the division operator to always do true division. If you want to do floor division, there’s a separate // operator, which has existed for ages; you can use it in Python 2 with or without the __future__ import.

Note that true division always produces a float, even if the result is integral: 6 / 3 is 2.0. On the other hand, floor division uses the same typing rules as C-style division: 5 // 2 is 2, but 5 // 2.0 is 2.0.

futurize --stage2 will “fix” this with the libfuturize.fixes.fix_division fixer, but unfortunately that just adds the __future__ import. With the --conservative option, it uses the libfuturize.fixes.fix_division_safe fixer instead, which imports past.utils.old_div, a forward-port of Python 2’s division operator.

The trouble here is that the new / always produces a float, and the new // always floors, but the old / sometimes did one and sometimes did the other. futurize can’t just replace all uses of / with //, because 5/2.0 is 2.5 but 5//2.0 is 2.0, and it can’t generally know what types the operands are.

You might be best off fixing this one manually — perhaps using fix_division_safe to find all the places you do division, then changing them to use the right operator.

Of course, the __div__ magic method is gone in Python 3, replaced explicitly by __floordiv__ (//) and __truediv__ (/). Both of those methods already exist in Python 2, and __truediv__ is even called when you use / in the presence of the future import, so being compatible is a simple matter of implementing all three and deferring to one of the others from __div__.

Relative imports

In Python 2, if you’re in the module foo.bar and say import quux, Python will look for a foo.quux before it looks for a top-level quux. The former behavior is called a relative import, though it might be more clearly called a sibling import. It’s troublesome for several reasons.

  • If you have a sibling called quux, and there’s also a top-level or standard library module called quux, you can’t import the latter. (There used to be a py.std module for providing indirect access to the standard library, for this very reason!)

  • If you import the top-level quux module, and then later add a foo.quux module, you’ll suddenly be importing a different module.

  • When reading the source code, it’s not clear which imports are siblings and which are top-level. In fact, the modules you get depend on the module you’re in, so moving or renaming a file may change its imports in non-obvious ways.

Python 3 eliminates this behavior: import quux always means the top-level module. It also adds syntax for “explicit relative” or “absolute relative” (yikes) imports: from . import quux or from .quux import somefunc explicitly means to look for a sibling named quux. (You can also use ..quux to look in the parent package, three dots to look in the grandparent, etc.)

The explicit syntax is supported since Python 2.5. The old sibling behavior can be disabled since Python 2.5 with from __future__ import absolute_import.

futurize --stage1 has a libfuturize.fixes.fix_absolute_import fixer, which attempts to detect sibling imports and convert them to explicit relative imports. If it finds any sibling imports, it’ll also add the __future__ line, though honestly you should make an effort to to put that line in all of your Python 2 code.

It’s possible for the futurize fixer to guess wrong about a sibling import, but in general it works pretty well.

(There is one case I’ve run across where simply replacing import sibling with from . import sibling didn’t work. Unfortunately, it was Yelp code that I no longer have access to, and I can’t remember the precise details. It involved having several sibling imports inside a __init__.py, where the siblings also imported from each other in complex ways. The sibling imports worked, but the explicit relative imports failed, for some really obscure timing reason. It’s even possible this was a 2.6 bug that’s been fixed in 2.7. If you see it, please let me know!)

Things that require some effort

These problems are a little more obscure, but many of them are also more difficult to fix automatically. If you have a massive codebase, these are where the problems start to appear.

The grand module shuffle

A whole bunch of modules were deleted, merged, or removed. A full list is in PEP 3108, but you’ll never have heard of most of them. Here are the ones that might affect you.

  • __builtin__ has been renamed to builtins. Note that this is a module, not the __builtins__ attribute of modules, which is exactly why it was renamed. Incidentally, you should be using the builtins module rather than __builtins__ anyway. Or, wait, no, just don’t use either, please don’t mess with the built-in scope.

  • ConfigParser has been renamed to configparser.

  • Queue has been renamed to queue.

  • SocketServer has been renamed to socketserver.

  • cStringIO and StringIO are gone; instead, use StringIO or BytesIO from the io module. Note that these also exist in Python 2, but are pure-Python rather than the C versions in current Python 3.

  • cPickle is gone. Importing pickle in Python 3 now gives you the C implementation automatically.

  • cProfile is gone. Importing profile in Python 3 gives you the C implementation automatically.

  • copy_reg has been renamed to copyreg.

  • anydbm, dbhash, dbm, dumbdm, gdbm, and whichdb have all been merged into a dbm package.

  • dummy_thread has become _dummy_thread. It’s an implementation of the _thread module that doesn’t actually do any threading. You should be using dummy_threading instead, I guess?

  • httplib has become http.client. BaseHTTPServer, CGIHTTPServer, and SimpleHTTPServer have been merged into a single http.server module. Cookie has become http.cookies. cookielib has become http.cookiejar.

  • repr has been renamed to reprlib. (The module, not the built-in function.)

  • thread has been renamed to _thread, and you should really be using the threading module instead.

  • A whole mess of top-level Tk modules have been combined into a tkinter package.

  • The contents of urllib, urllib2, and urlparse have been consolidated and then split into urllib.error, urllib.parse, and urllib.request.

  • xmlrpclib has become xmlrpc.client. DocXMLRPCServer and SimpleXMLRPCServer have been merged into xmlrpc.server.

futurize --stage2 will fix this with the somewhat invasive libfuturize.fixes.fix_future_standard_library fixer, which uses a mechanism from future that adds aliases to Python 2 to make all the Python 3 standard library names work. It’s an interesting idea, but it didn’t actually work for all cases when I tried it (though now I can’t recall what was broken), so YMMV.

Alternative, you could manually replace any affected imports with imports from six.moves, which provides aliases that work on either version.

Or as a last resort, you can just sprinkle try ... except ImportError around.

Built-in iterators are now lazy

filter, map, range, and zip are all lazy in Python 3. You can still iterate over their return values (once), but if you have code that expects to be able to index them or traverse them more than once, it’ll break in Python 3. (Well, not range, that’s fine.) The lazy equivalents — xrange and the functions in itertools — are of course gone in Python 3.

In either case, the easiest thing to do is force eager evaluation by wrapping the call in list() or tuple(), which you’ll occasionally need to do in Python 3 regardless.

For the sake of consistency, you may want to import the lazy versions from the standard library future_builtins module. It only exists in Python 2, so be sure to wrap the import in a try.

futurize --stage2 tries to address this with several of lib2to3s fixers, but the results aren’t particularly pleasing: calls to all four are unconditionally wrapped in list(), even in an obviously safe case like a for block. I’d just look through your uses of them manually.

A more subtle point: if you pass a string or tuple to Python 2’s filter, the return value will be the same type. Blindly wrapping the call in list() will of course change the behavior. Filtering a string is not a particularly common thing to do, but I’ve seen someone complain about it before, so take note.

Also, Python 3’s map stops at the shortest input sequence, whereas Python 2 extends shorter sequences with Nones. You can fix this with itertools.zip_longest (which in Python 2 is izip_longest!), but honestly, I’ve never even seen anyone pass multiple sequences to map.

Relatedly, dict.iteritems (plus its friends, iterkeys and itervalues) is gone in Python 3, as the plain items (plus keys and values) is already lazy. The dict.view* methods are also gone, as they were only backports of Python 3’s normal behavior.

Both six and future.utils contain functions called iteritems, etc., which provide a lazy iterator in both Python 2 and 3. They also offer view* functions, which are closer to the Python 3 behavior, though I can’t say I’ve ever seen anyone actually use dict.viewitems in real code.

Of course, if you explicitly want a list of dictionary keys (or items or values), list(d) and list(d.items()) do the same thing in both versions.

buffer is gone

The buffer type has been replaced by memoryview (also in Python 2.7), which is similar but not identical. If you’ve even heard of either of these types, you probably know more about the subtleties involved than I do. There’s a lib2to3.fixes.fix_buffer fixer that blindly replaces buffer with memoryview, but futurize doesn’t use it in either stage.

Several special methods were renamed

Where Python 2 has __str__ and __unicode__, Python 3 has __bytes__ and __str__. The trick is that __str__ should return the native str type for each version: a bytestring for Python 2, but a Unicode string for Python 3. Also, you almost certainly don’t want a __bytes__ method in Python 3, where bytes is no longer used for text.

Both six and python-future have a python_2_unicode_compatible class decorator that tries to do the right thing. You write only a single __str__ method that returns a Unicode string. In Python 3, that’s all you need, so the decorator does nothing; in Python 2, the decorator will rename your method to __unicode__ and add a __str__ that returns the same value encoded as UTF-8. If you need different behavior, you’ll have to roll it yourself with if PY2.


Python 2’s next method is more appropriately __next__ in Python 3. The easy way to address this is to call your method __next__, then alias it with next = __next__. Be sure you never call it directly as a method, only with the built-in next() function.

Alternatively, future.builtins contains an alternative next which always calls __next__, but on Python 2, it falls back to trying next if __next__ doesn’t exist.

futurize --stage1 changes all use of obj.next() to next(obj) via the libfuturize.fixes.fix_next_call fixer. futurize --stage2 renames next methods to __next__ via the lib2to3.fixes.fix_next fixer (which also fixes calls). Note that there’s a remote chance of false positives, if for some reason you happened to use next as a regular method name.


Python 2’s __nonzero__ is Python 3’s __bool__. Again, you can just alias it manually. Or futurize --stage2 will rename it with the lib2to3.fixes.fix_nonzero fixer.

Renaming it will of course break it in Python 2, but futurize --stage2 also has a libfuturize.fixes.fix_object fixer that imports python-future’s own builtins.object. The replacement object class has a few methods for making Python 3’s __str__, __next__, and __bool__ work on Python 2.

This is one of the mildly invasive things python-future does, and it may or may not sit well. Up to you.


__long__ is completely gone, as there is no long type in Python 3.

__getslice__, __setslice__, and __delslice__ are gone. Instead, slice objects are passed to __getitem__ and friends. On the off chance you use these, you’ll have to do something clever in the item methods to defer to your slice logic on Python 3.

__oct__ and __hex__ are gone; oct() and hex() now consult __index__. I seriously doubt this will impact anyone.

__div__ is gone, as mentioned previously.

Unbound methods are gone; function attributes renamed

Say you have this useless class.

1
2
3
class Foo(object):
    def bar(self):
        pass

In Python 2, Foo.bar is an “unbound method”, a type that’s generally unseen and unexposed other than as types.MethodType. In Python 3, Foo.bar is just a regular function.

Offhand, I can only think of one time this would matter: if you want to get at attributes on the function, perhaps for the sake of a method decorator. In Python 2, you have to go through the unbound method’s .im_func attribute to get the original function, but in Python 3, you already have the original function and can get the attributes directly.

If you’re doing this anywhere, an easy way to make it work in both versions is:

1
2
method = Foo.bar
method = getattr(method, 'im_func', method)

As for bound methods (the objects you get from accessing methods but not calling them, like [].append), the im_self and im_func attributes have been renamed to __self__ and __func__. Happily, these names also work in Python 2.6, so no compatibility hacks are necessary.

im_class is completely gone in Python 3. Methods have no interest in which class they’re attached to. They can’t, since the same function could easily be attached to more than one class. If you’re relying on im_class somehow, for some reason… well, don’t do that, maybe.

Relatedly, the func_* function attributes have been renamed to dunder names in Python 3, since assigning function attributes is a fairly common practice and Python doesn’t like to clog namespaces with its own builtin names. func_closure, func_code, func_defaults, func_dict, func_doc, func_globals, and func_name are now __closure__, __code__, etc. (Note that func_doc and func_name were already aliases for __doc__ and __name__, and func_defaults is much more easily inspected with the inspect module.) The new names are not available in Python 2, so you’ll need to do a getattr dance, or use the get_function_* functions from six.

Metaclass syntax has changed

In Python 2, a metaclass is declared by assigning to a special name in the class body:

1
2
3
class Foo(object):
    __metaclass__ = FooMeta
    ...

Admittedly, this doesn’t make a lot of sense. The metaclass affects how a class is created, and the class body is evaluated as part of that creation, so this is sort of a goofy hack.

Python 3 changed this, opening the door to a few new neat tricks in the process, which you can find out about in the companion article.

1
2
class Foo(object, metaclass=FooMeta):
    ...

The catch is finding a way to express this idea in both Python 2 and Python 3 — the old syntax is ignored in Python 3, and the new syntax is a syntax error in Python 2.

It’s a bit of a pain, but the class statement is really just a lot of sugar for calling the type() constructor; after all, Python classes are just instances of type. All you have to do is manually create an instance of your metaclass, rather than of type.

Fortunately, other people have already made this work for you. futurize --stage2 will fix this using the libfuturize.fixes.fix_metaclass fixer, which imports future.utils.with_metaclass and produces the following:

1
2
3
4
from future.utils import with_metaclass

class Foo(with_metaclass(object)):
    ...

This creates an intermediate dummy class with the right metaclass, which you then inherit from. Classes use the same metaclass as their parents, so this works fine in any Python.

If you don’t want to depend on python-future, the same function exists in the six module.

Re-raising exceptions has different syntax

raise with no arguments does the same thing in Python 2 and Python 3: it re-raises the exception currently being handled, preserving the original traceback.

The problem comes in with the three-argument form of raise, which is for preserving the traceback while raising a different exception. It might look like this:

1
2
3
4
try:
    some_fragile_function()
except Exception as e:
    raise MyLibraryError, MyLibraryError("Failed to do a thing: " + str(e)), sys.exc_info()[2]

sys.exc_info()[2] is, of course, the only way to get the current traceback in Python 2. You may have noticed that the three arguments to raise are the same three things that sys.exc_info() returns: the type, the value, and the traceback.

Python 3 introduces exception chaining. If something raises an exception from within an except block, Python will remember the original exception, attach it to the new one, and show both exceptions when printing a traceback — including both exceptions’ types, messages, and where they happened. So to wrap and rethrow an exception, you don’t need to do anything special at all.

1
2
3
4
try:
    some_fragile_function()
except Exception:
    raise MyLibraryError("Failed to do a thing")

For more complicated handling, you can also explicitly say raise new_exception from old_exception. Exceptions contain their associated tracebacks as a __traceback__ attribute in Python 3, so there’s no need to muck around getting the traceback manually. If you really want to give an explicit traceback, you can use the .with_traceback() method, which just assigns to __traceback__ and then returns self.

1
raise MyLibraryError("Failed to do a thing").with_traceback(some_traceback)

It’s hard to say what it even means to write code that works “equivalently” in both versions, because Python 3 handles this problem largely automatically, and Python 2 code tends to have a variety of ad-hoc solutions. Note that you cannot simply do this:

1
2
3
4
if PY3:
    raise MyLibraryError("Beep boop") from exc
else:
    raise MyLibraryError, MyLibraryError("Beep boop"), sys.exc_info()[2]

The first raise is a syntax error in Python 2, and the second is a syntax error in Python 3. if won’t protect you from parse errors. (On the other hand, you can hide .with_traceback() behind an if, since that’s just a regular method call and will parse with no issues.)

six has a reraise function that will smooth out the differences for you (probably by using exec). The drawback is that it’s of course Python 2-oriented syntax, and on Python 3 the final traceback will include more context than expected.

Alternatively, there’s a six.raise_from, which is designed around the raise X from Y syntax of Python 3. The drawback is that Python 2 has no obvious equivalent, so you just get raise X, losing the old exception and its traceback.

There’s no clear right approach here; it depends on how you’re handling re-raising. Code that just blindly raises new exceptions doesn’t need any changes, and will get exception chaining for free on Python 3. Code that does more elaborate things, like implementing its own form of chaining or storing exc_info tuples to be re-raised later, may need a little more care.

Bytestrings are sequences of integers

In Python 2, bytes is a synonym for str, the default string type. Iterating or indexing a bytes/str produces 1-character strs.

1
2
3
4
list(b'hello')  # ['h', 'e', 'l', 'l', 'o']
b'hello'[0:4]  # 'hell'
b'hello'[0]  # 'h'
b'hello'[0][0][0][0][0]  # 'h' -- it's turtles all the way down

In Python 3, bytes is a specialized type for handling binary data, not text. As such, iterating or indexing a bytes produces integers.

1
2
3
4
list(b'hello')  # [104, 101, 108, 108, 111]
b'hello'[0:4]  # b'hell'
b'hello'[0]  # 104
b'hello'[0][0][0][0]  # TypeError, since you can't index 104

If you have explicitly binary data that want to be bytes in Python 3, this may pose a bit of a problem. Aside from just checking the version explicitly and making heavy use of chr/ord, there are two approaches.

One is to use bytearray instead. This is like bytes, but mutable. More importantly, since it was introduced as a new type in Python 2.6 — after Python 3.0 came out — it has the same iterating and indexing behavior as Python 3’s bytes, even in Python 2.

1
bytearray(b'hello')[0]  # 104, on either Python 2 or 3

The other is to slice rather than index, since slicing always produces a new iterable of the same type. If you want to extract a single character from a bytes, just take a one-element slice.

1
2
b'hello'[0]  # 104
b'hello'[0:1]  # b'h'

Things that are just a royal pain in the ass

Unicode

Saving the best for last, almost!

Honestly, if your Python 2 code is already careful with Unicode — working with unicode internally, and encoding/decoding only at the “boundaries” of your code — then you shouldn’t have too many problems. If your code is not so careful, you should really try to make it a little more careful before you worry about Python 3, since Python 3’s whole jam is to force you to be careful.

See, in Python 2, you can combine bytestrings (str) and text strings (unicode) more or less freely. Python will automatically try to convert between the two using the “default encoding”, which is generally ascii. Python 3 makes text strings the default string type, demotes bytestrings, and forbids ever converting between them.

Most obviously, Python 2’s str and unicode have been renamed to bytes and str in Python 3. If you happen to be using the names anywhere, you’ll probably need to change them! six offers text_type and binary_type, though you can just use bytes to mean the same thing in either version. python-future also has backports for both Python 3’s bytes and str types, which seems like an extreme approach to me. Changing str to mean a text type even in Python 2 might be a good idea, though.

b'' and u'' work the same way in either Python 2 or 3, but unadorned strings like '' are always the str type, which has different behavior. There is a from __future__ import unicode_literals, which will cause unadorned strings to be unicode in Python 2, and this might work for you. However, this prevents you from writing literal “native” strings — strings of the same type Python uses for names, keyword arguments, etc. Usually this won’t matter, since Python 2 will silently convert between bytes and text, but it’s caused me the occasional problem.

The right thing to do is just explicitly mark every single string with either a b or u sigil as necessary. That just, you know, sucks. But you should be doing it even if you’re not porting to Python 3.

basestring is completely gone in Python 3. str and bytes have no common base type, and their semantics are different enough that it rarely makes sense to treat them the same way. If you’re using basestring in Python 2, it’s probably to allow code to work on either form of “text”, and you’ll only want to use str in Python 3 (where bytes are completely unsuitable for text). six.string_types provides exactly this. futurize --stage2 also runs the lib2to3.fixes.fix_basestring fixer, but this replaces basestring with str, which will almost certainly break your code in Python. If you intend to use stage 2, definitely audit your uses of basestring first.

As mentioned above, bytestrings are sequences of integers, which may affect code trying to work with explicitly binary data.

Python 2 has both .decode() and .encode() on both bytes and text; if you try to encode bytes or decode text, Python will try to implicitly convert to the right type first. In Python 3, only text has an .encode() and only bytes have a .decode().

Relatedly, Python 2 allows you to do some cute tricks with “encodings” that aren’t really encodings; for example, "hi".encode('hex') produces '6869'. In Python 3, encoding must produce bytes, and decoding must produce text, so these sorts of text-to-text or bytes-to-bytes translations aren’t allowed. You can still do them explicitly with the codecs module, e.g. codecs.encode(b'hi', 'hex'), which also works in Python 2, despite being undocumented. (Note that Python 3 specifically requires bytes for the hex codec, alas. If it’s any consolation, there’s a bytes.hex() method to do this directly, which you can’t use anyway if you’re targeting Python 2.)

Python 3’s open decodes as UTF-8 by default (a vast oversimplification, but usually), so if you’re manually decoding after reading, you’ll get an error in Python 3. You could explicitly open the file in binary mode (preserving the Python 2 behavior), or you could use codecs.open to decode transparently on read (preserving the Python 3 behavior). The same goes for writing.

sys.stdin, sys.stdout, and sys.stderr are all text streams in Python 3, so they have the same caveats as above, with the additional wrinkle that you didn’t actually open them yourself. Their .buffer attribute gives a handle opened in binary mode (Python 2 behavior), or you can adapt them to transcode transparently (Python 3 behavior):

1
2
3
4
if six.PY2:
    sys.stdin = codecs.getreader('utf-8')(sys.stdin)
    sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
    sys.stderr = codecs.getwriter('utf-8')(sys.stderr)

A text-mode file’s .tell() in Python 3 still returns a number that can be passed back to .seek(), but the number is not necessarily meaningful, and in particular can’t be used to estimate progress through a file. (Python uses a few very high bits as flags to indicate the state of the decoder; if you mask them off, what’s left is probably the byte position in the file as you’d expect, but this is pretty definitively a hack.)

Python 3 likes to treat filenames as text, but most of the functions in os and os.path will accept either text or bytes as their arguments (and return a value of the same type), so you should be okay there.

os.environ has text keys and values in Python 3. If you direly need bytes, you can use os.environb (and os.getenvb()).

I think that covers most of the obvious basics. This is a whole sprawling topic that I can’t hope to cover off the top of my head. I’ve seen it be both fairly painful and completely pain-free, depending entirely on the state of the Python 2 codebase.

Oh, one final note: there’s a module for Python 2 called unicode-nazi (sorry, I didn’t name it) that will produce a warning anytime a bytestring is implicitly converted to a text string, or vice versa. It might help you root out places you’re accidentally slopping types back and forth, which will certainly break in Python 3. I’ve only tried it on a comically large project where it found thousands of violations, including plenty in surprising places in the standard library, so it may or may not be of any practical help.

Things that are not actually gone

String formatting with %

There’s a widespread belief that str % ... is deprecated, since there’s a newer and shinier str.format() method.

Well, it’s not. It’s not gone; it’s not deprecated; it still works just fine. I don’t like to use it, myself, since it’s easy to make accidentally ambiguous — "%s" % foo can crash if foo is a tuple! — but it’s not going anywhere. In fact, as of Python 3.5, bytes and bytearray support % but not .format.

optparse

argparse is certainly better, but the optparse module still exists in Python 3. It has been deprecated since Python 3.2, though.

Things that are preposterously obscure but that I have seen cause problems nonetheless

Tuple unpacking

A little-used feature of Python 2 is tuple unpacking in function arguments:

1
2
3
4
5
def foo(a, (b, c)):
    print a, b, c

x = (2, 3)
foo(1, x)

This syntax is gone in Python 3. I’ve rarely seen anyone use it, except in two cases. One was a parsing library that relied pretty critically on using it in every parsing function you wrote; whoops.

The other is when sorting a dict’s items:

1
sorted(d.items(), key=lambda (k, v): k + v)

In Python 3, you have to write that as lambda kv: kv[0] + kv[1]. Boo.

long is gone

Python 3 merged its long type with int, so now there’s only one integral type, called int.

Python 2 promotes int to long pretty much transparently, and longs aren’t very common in the first place, so it’s fairly unlikely that this will make a difference. On the off chance you’re type-checking for integers with isinstance(x, (int, long)) (and really, why are you doing that), you can just use six.integer_types instead.

Note that futurize --stage2 applies the lib2to3.fixes.fix_long fixer, which blindly renames long to int, leaving you with inappropriate code like isinstance(x, (int, int)).

However…

I have seen some very obscure cases where a hand-rolled binary protocol would encode ints and longs differently. My advice would be to not do that.

Oh, and a little-known feature of Python 2’s syntax is that you can have long literals by suffixing them with an L:

1
2
123  # int
123L  # long

You can write 1267650600228229401496703205376 directly in Python 2 code, and it’ll automatically create a long, so the only reason to do this is if you explicitly need a long with a small value like 1. If that’s the case, something has gone catastrophically wrong.

repr changes

These should really only affect you if you’re using reprs as expected test output (or, god forbid, as cache keys or something). Some notable changes:

  • Unicode strings have a u prefix in Python 2. In Python 3, of course, Unicode strings are just strings, so there’s no prefix.

  • Conversely, bytestrings have a b prefix in Python 3, but not in Python 2 (though the b prefix is allowed in source code).

  • Python 2 escapes all non-ASCII characters, even in the repr of a Unicode string. Python 3 only escapes control characters and codepoints considered non-printing.

  • Large integers and explicit longs have an L suffix in Python 2, but not in Python 3, where there is no separate long type.

  • A set becomes set([1, 2, 3]) in Python 2, but {1, 2, 3} in Python 3. The set literal syntax is allowed in source code in Python 2.7, but the repr wasn’t changed until 3.0.

  • floats stringify to the shortest possible representation that has the same underlying value — e.g., str(1.1) is '1.1' rather than '1.1000000000000001'. This change was backported to Python 2.7 as well, but I have seen it break tests.

Hash randomization

Python has traditionally had a predictable hashing mechanism: repr(dict(a=1, b=2, c=3)) will always produce the same string. (On the same platform with the same Python version, at least.) Unfortunately this opens the door to an obscure DoS exploit that was known to Perl long ago: if you know a web application is written in Python, you can construct a query string that will become a dict whose keys all go in the same hash bucket. If your query string is long enough and you send enough requests, you can tie up all the Python processes in dealing with hash collisions.

The fix is hash randomization, which seeds the hashing algorithm in such a way that items are bucketed differently every time Python runs. It’s available in Python 2.7 via an environment variable or the -R argument, but it wasn’t turned on by default until Python 3.3.

The fear was that it might break things. Naturally, it has broken things. Mostly, reprs in tests. But it also changes the iteration order of dicts between Python runs. I have seen code using dicts whose keys happened to always be sorted in alphabetical or insertion order before, but with hash randomization, the keys were of course in a different order every time the code ran. The author assumed that Python had somehow broken dict sorting (which it has never had).

nonlocal

Python 3 introduces the nonlocal keyword, which is like global except it looks through all outer scopes in the expected order. It fixes this mild annoyance:

1
2
3
4
5
6
7
def make_function():
    counter = 0
    def function():
        nonlocal counter
        counter += 1  # without 'nonlocal', this declares a new local!
        print("I've been called", counter, "times!")
    return function

The problem is that any use of assignment within a function automatically creates a new local, and locals are known statically for the entire body of the function. (They actually affect how functions are compiled, in CPython.) So without nonlocal, the above code would see counter += 1, but counter is a new local that has never been assigned a value, so Python cannot possibly add 1 to it, and you get an UnboundLocalError.

nonlocal tells Python that when it sees an assignment of a name that exists in some outer scope, it should reuse that outer variable rather than shadowing it. Great, right? Purely a new feature. No problem.

Unfortunately, I’ve worked on a codebase that needed this feature in Python 2, and decided to fake it with a class… named nonlocal.

1
2
3
4
5
6
7
def make_function():
    class nonlocal:
        counter = 0
    def function():
        nonlocal.counter += 1  # this alters an outer value in-place, so it's fine
        print("I've been called", counter, "times!")
    return function

The class here is used purely as a dummy container. Assigning to an attribute doesn’t create any locals, because it’s equivalent to a method call, so the operand must already exist. This is a slightly quirky approach, but it works fine.

Except that, of course, nonlocal is a keyword in Python 3, so this becomes complete gibberish. It’s such gibberish that (if I remember correctly) 2to3 actually cannot parse it, even though it’s perfectly valid Python 2 code.

I don’t have a magical fix for this one. Just, uh, don’t name things nonlocal.

List comprehensions no longer leak

Python 2 has the slightly inconsistent behavior that loop variables in a generator expression ((...)) are scoped to the generator expression, but loop variables in a list comprehension ([...]) belong to the enclosing scope.

The only reason is in implementation details: a list comprehension acts like a for loop, which has the same behavior, whereas a generator expression actually creates a generator internally.

Python 3 brings these cases into line: loop variables in list comprehensions (or dict or set comprehensions) are also scoped to the comprehension itself.

I cannot imagine any possible reason why this would affect you negatively, and yet, I can swear I’ve seen it happen. I wish I could remember where, because I’m sure it’s an exciting story.

cStringIO.h is gone

cStringIO.h is a private and undocumented C interface to Python 2’s cStringIO.StringIO type. It was removed in Python 3, or at least is somewhere I can’t find it.

This was one of the reasons Thrift’s Python 3 port took almost 3 years: Thrift has a “fast” C module that makes use of this private interface, and it’s not obvious how to replace it. I think they ended up just having the module not exist on Python 3, so Python 3 will just be mysteriously slower.

Some troublesome libraries

MySQLdb is some ancient, clunky, noncompliant, underdocumented trash, much like the database it connects to. It’s nigh abandoned, though it still promises Python 3 support in the MySQLdb 2.0 vaporware. I would suggest not using MySQL, but barring that, try mysqlclient, a fork of MySQLdb that continues development and adds Python 3 support. (The same people also maintain an earlier project, pymysql, which strives to be a pure-Python drop-in replacement for MySQLdb — it’s not quite perfect, but its existence is interesting and it’s sure easier to read than MySQLdb.)

At a glance, Thrift still hasn’t had a release since it merged Python 3 support, eight months ago. It’s some enterprise nightmare, anyway, and bizarrely does code generation for a bunch of dynamic languages. Might I suggest just using the pure-Python thriftpy, which parses Thrift definitions on the fly?

Twisted is, ah, large and complex. Parts of it now support Python 3; parts of it do not. If you need the parts that don’t, well, maybe you could give them a hand?

M2Crypto is working on it, though I’m pretty sure most Python crypto nerds would advise you to use cryptography instead.

And so on

You may find any number of other obscure compatibility problems, just as you might when upgrading from 2.6 to 2.7. The Python community has a lot of clever people willing to help you out, though, and they’ve probably even seen your super duper niche problem before.

Don’t let that, or this list of gotchas in general, dissaude you! Better to start now than later; even fixing an integer division gets you one step closer to having your code run on Python 3 as well.

The Security of Our Election Systems

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2016/07/the_security_of_11.html

Russia was behind the hacks into the Democratic National Committee’s computer network that led to the release of thousands of internal emails just before the party’s convention began, U.S. intelligence agencies have reportedly concluded.

The FBI is investigating. WikiLeaks promises there is more data to come. The political nature of this cyberattack means that Democrats and Republicans are trying to spin this as much as possible. Even so, we have to accept that someone is attacking our nation’s computer systems in an apparent attempt to influence a presidential election. This kind of cyberattack targets the very core of our democratic process. And it points to the possibility of an even worse problem in November ­ that our election systems and our voting machines could be vulnerable to a similar attack.

If the intelligence community has indeed ascertained that Russia is to blame, our government needs to decide what to do in response. This is difficult because the attacks are politically partisan, but it is essential. If foreign governments learn that they can influence our elections with impunity, this opens the door for future manipulations, both document thefts and dumps like this one that we see and more subtle manipulations that we don’t see.

Retaliation is politically fraught and could have serious consequences, but this is an attack against our democracy. We need to confront Russian President Vladimir Putin in some way ­ politically, economically or in cyberspace ­ and make it clear that we will not tolerate this kind of interference by any government. Regardless of your political leanings this time, there’s no guarantee the next country that tries to manipulate our elections will share your preferred candidates.

Even more important, we need to secure our election systems before autumn. If Putin’s government has already used a cyberattack to attempt to help Trump win, there’s no reason to believe he won’t do it again ­ especially now that Trump is inviting the “help.”

Over the years, more and more states have moved to electronic voting machines and have flirted with Internet voting. These systems are insecure and vulnerable to attack.

But while computer security experts like me have sounded the alarm for many years, states have largely ignored the threat, and the machine manufacturers have thrown up enough obfuscating babble that election officials are largely mollified.

We no longer have time for that. We must ignore the machine manufacturers’ spurious claims of security, create tiger teams to test the machines’ and systems’ resistance to attack, drastically increase their cyber-defenses and take them offline if we can’t guarantee their security online.

Longer term, we need to return to election systems that are secure from manipulation. This means voting machines with voter-verified paper audit trails, and no Internet voting. I know it’s slower and less convenient to stick to the old-fashioned way, but the security risks are simply too great.

There are other ways to attack our election system on the Internet besides hacking voting machines or changing vote tallies: deleting voter records, hijacking candidate or party websites, targeting and intimidating campaign workers or donors. There have already been multiple instances of political doxing ­ publishing personal information and documents about a person or organization ­ and we could easily see more of it in this election cycle. We need to take these risks much more seriously than before.

Government interference with foreign elections isn’t new, and in fact, that’s something the United States itself has repeatedly done in recent history. Using cyberattacks to influence elections is newer but has been done before, too ­ most notably in Latin America. Hacking of voting machines isn’t new, either. But what is new is a foreign government interfering with a U.S. national election on a large scale. Our democracy cannot tolerate it, and we as citizens cannot accept it.

Last April, the Obama administration issued an executive order outlining how we as a nation respond to cyberattacks against our critical infrastructure. While our election technology was not explicitly mentioned, our political process is certainly critical. And while they’re a hodgepodge of separate state-run systems, together their security affects every one of us. After everyone has voted, it is essential that both sides believe the election was fair and the results accurate. Otherwise, the election has no legitimacy.

Election security is now a national security issue; federal officials need to take the lead, and they need to do it quickly.

This essay originally appeared in the Washington Post.

Looking for: Systems Administrator

Post Syndicated from Yev original https://www.backblaze.com/blog/looking-systems-administrator/

hiring-lady-desk (1)

Want to a join a rapidly expanding team and help us grow Backblaze to new heights? We’re looking for a Sys Admin who is looking for a challenging and fast-paced working environment. The position can either be in San Mateo, California or in our Rancho Cordova datacenter! Interested? Check out the job description and application details below:

Here’s what you’ll be working on:

    – Rebuild failed RAID arrays, diagnose and repair file system problems (ext4) and debug other operations problems with minimal supervision.
    – Administrative proficiency in software patches, releases and system upgrades.
    – Troubleshoot and resolve operational problems.
    – Help deploy, configure and maintain production systems.
    – Assist with networks and services (static/dynamic web servers, etc) as needed.
    – Assist in efforts to automate provisioning and other tasks that need to be run across hundreds of servers.
    – Help maintain monitoring systems to measure system availability and detect issues.
    – Help qualify hardware and components.
    – Participate in the 24×7 on-call pager rotation and respond to alerts as needed. This may include occasional trips to Backblaze datacenter(s).
    – Write, design, maintain and support operational Documentation and scripts.
    – Help train operations staff as needed.

This is a must:

    – Strong knowledge of Linux system administration, Debian experience preferred.
    – 4+ years of experience.
    – Bash scripting skills required.
    – Ability to lift/move 50-75 lbs and work down near the floor as needed.
    – Position based in the San Mateo Corporate Office or the Rancho Cordova Datacenter, California.

It would be nice if you had:

    – Experience configuring and supporting (Debian) Linux software RAID (mdadm).
    – Experience configuring and supporting file systems on Linux (Debian).
    – Experience troubleshooting server hardware/component issues.
    – Experience supporting Apache, Tomcat, and Java services.
    – Experience with automation in a production environment (Puppet/Chef/Ansible).
    – Experience supporting network equipment (layer 2 switches).

Required for all Backblaze Employees:

    – Good attitude and willingness to do whatever it takes to get the job done.
    – Strong desire to work for a small fast paced company.
    – Desire to learn and adapt to rapidly changing technologies and work environment.
    – Occasional visits to Backblaze datacenters necessary.
    – Rigorous adherence to best practices.
    – Relentless attention to detail.
    – Excellent interpersonal skills and good oral/written communication.
    – Excellent troubleshooting and problem solving skills.
    – OK with pets in office.

Backblaze is an Equal Opportunity Employer and we offer competitive salary and benefits, including our no policy vacation policy.

If this sounds like you — follow these steps:

  • Send an email to [email protected] with the position in the subject line.
  • Include your resume.
  • Tell us a bit about your Sys Admin experience and why you’re excited to work with Backblaze.

The post Looking for: Systems Administrator appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

AWS re:Invent 2016 Registration is Now Open

Post Syndicated from Andy Werth original https://blogs.aws.amazon.com/bigdata/post/Tx72VGZS9D7E2M/AWS-re-Invent-2016-Registration-is-Now-Open

Register now for the fifth annual AWS re:Invent, the largest gathering of the global cloud computing community. Join us in Las Vegas for opportunities to connect, collaborate, and learn about AWS solutions.

There will be many opportunities for developers and data scientists working in big data to sharpen their skills and learn what’s coming next with AWS’s big data services.

This year we are offering all-new technical deep-dives on topics such as IoT, serverless computing, security, and containers. We are also delivering more than 400 sessions, more hands-on labs, bootcamps, and opportunities for one-on-one engagements with AWS experts.

Date: November 28 – December 2, 2016
Location: The Venetian, The Mirage, The Encore in Las Vegas, NV
Full Conference Pass: $1,599

Learn More and Register 

What you’ll gain from attending AWS re:Invent 2016:

  • Exciting new products, services, and other reveals in the Keynote and State of the Union sessions.
  • In-depth technical information across hundreds of breakout sessions.
  • Guidance on AWS services and features, straight from AWS engineers, architects, and partners.
  • Master new skills with technical bootcamps, hands-on workshops, and onsite AWS certification exams.
  • New technical tracks focusing on hot topics, such as containers, serverless computing, and IoT.
  • Industry-specific deep dives, focusing on using AWS in financial services, healthcare, life sciences, government, and digital media.
  • Celebrate with the AWS community at our after-hours activities, including the re:Play party, Tatonka atomic wings competition, a pub crawl, and new this year, a 5k run.

Register now and take advantage of great rates at The Venetian | Palazzo, The Mirage, and the Wynn | Encore. In addition, check out the discounts on domestic and international flights from Delta Air Lines – our preferred global airline. Learn more »

Need help justifying your AWS re:Invent trip to your manager? Use our justification email template.

Stay connected: Join the conversation on Twitter using #reInvent and follow @AWSreInvent for news and updates.

We look forward to seeing you in November!

Best regards,

AWS re:Invent Team

P.S. Interested in sponsorship opportunities at AWS re:Invent 2016? Reach out to us here.

If you need support or have questions, please review the FAQ page or contact us at: [email protected].

Some stuff about color

Post Syndicated from Eevee original https://eev.ee/blog/2016/07/16/some-stuff-about-color/

I’ve been trying to paint more lately, which means I have to actually think about color. Like an artist, I mean. I’m okay at thinking about color as a huge nerd, but I’m still figuring out how to adapt that.

While I work on that, here is some stuff about color from the huge nerd perspective, which may or may not be useful or correct.

Hue

Hues are what we usually think of as “colors”, independent of how light or dim or pale they are: general categories like purple and orange and green.

Strictly speaking, a hue is a specific wavelength of light. I think it’s really weird to think about light as coming in a bunch of wavelengths, so I try not to think about the precise physical mechanism too much. Instead, here’s a rainbow.

rainbow spectrum

These are all the hues the human eye can see. (Well, the ones this image and its colorspace and your screen can express, anyway.) They form a nice spectrum, which wraps around so the two red ends touch.

(And here is the first weird implication of the physical interpretation: purple is not a real color, in the sense that there is no single wavelength of light that we see as purple. The actual spectrum runs from red to blue; when we see red and blue simultaneously, we interpret it as purple.)

The spectrum is divided by three sharp lines: yellow, cyan, and magenta. The areas between those lines are largely dominated by red, green, and blue. These are the two sets of primary colors, those hues from which any others can be mixed.

Red, green, and blue (RGB) make up the additive primary colors, so named because they add light on top of black. LCD screens work exactly this way: each pixel is made up of three small red, green, and blue rectangles. It’s also how the human eye works, which is fascinating but again a bit too physical.

Cyan, magenta, and yellow are the subtractive primary colors, which subtract light from white. This is how ink, paint, and other materials work. When you look at an object, you’re seeing the colors it reflects, which are the colors it doesn’t absorb. A red ink reflects red light, which means it absorbs green and blue light. Cyan ink only absorbs red, and yellow ink only absorbs blue; if you mix them, you’ll get ink that absorbs both red and blue green, and thus will appear green. A pure black is often included to make CMYK; mixing all three colors would technically get you black, but it might be a bit muddy and would definitely use three times as much ink.

The great kindergarten lie

Okay, you probably knew all that. What confused me for the longest time was how no one ever mentioned the glaring contradiction with what every kid is taught in grade school art class: that the primary colors are red, blue, and yellow. Where did those come from, and where did they go?

I don’t have a canonical answer for that, but it does make some sense. Here’s a comparison: the first spectrum is a full rainbow, just like the one above. The second is the spectrum you get if you use red, blue, and yellow as primary colors.

a full spectrum of hues, labeled with color names that are roughly evenly distributed
a spectrum of hues made from red, blue, and yellow

The color names come from xkcd’s color survey, which asked a massive number of visitors to give freeform names to a variety of colors. One of the results was a map of names for all the fully-saturated colors, providing a rough consensus for how English speakers refer to them.

The first wheel is what you get if you start with red, green, and blue — but since we’re talking about art class here, it’s really what you get if you start with cyan, magenta, and yellow. The color names are spaced fairly evenly, save for blue and green, which almost entirely consume the bottom half.

The second wheel is what you get if you start with red, blue, and yellow. Red has replaced magenta, and blue has replaced cyan, so neither color appears on the wheel — red and blue are composites in the subtractive model, and you can’t make primary colors like cyan or magenta out of composite colors.

Look what this has done to the distribution of names. Pink and purple have shrunk considerably. Green is half its original size and somewhat duller. Red, orange, and yellow now consume a full half of the wheel.

There’s a really obvious advantage here, if you’re a painter: people are orange.

Yes, yes, we subdivide orange into a lot of more specific colors like “peach” and “brown”, but peach is just pale orange, and brown is just dark orange. Everyone, of every race, is approximately orange. Sunburn makes you redder; fear and sickness make you yellower.

People really like to paint other people, so it makes perfect sense to choose primary colors that easily mix to make people colors.

Meanwhile, cyan and magenta? When will you ever use those? Nothing in nature remotely resembles either of those colors. The true color wheel is incredibly, unnaturally bright. The reduced color wheel is much more subdued, with only one color that stands out as bright: yellow, the color of sunlight.

You may have noticed that I even cheated a little bit. The blue in the second wheel isn’t the same as the blue from the first wheel; it’s halfway between cyan and blue, a tertiary color I like to call azure. True pure blue is just as unnatural as true cyan; azure is closer to the color of the sky, which is reflected as the color of water.

People are orange. Sunlight is yellow. Dirt and rocks and wood are orange. Skies and oceans are blue. Blush and blood and sunburn are red. Sunsets are largely red and orange. Shadows are blue, the opposite of yellow. Plants are green, but in sun or shade they easily skew more blue or yellow.

All of these colors are much easier to mix if you start with red, blue, and yellow. It may not match how color actually works, but it’s a useful approximation for humans. (Anyway, where will you find dyes that are cyan or magenta? Blue is hard enough.)

I’ve actually done some painting since I first thought about this, and would you believe they sell paints in colors other than bright red, blue, and yellow? You can just pick whatever starting colors you want and the whole notion of “primary” goes a bit out the window. So maybe this is all a bit moot.

More on color names

The way we name colors fascinates me.

A “basic color term” is a single, unambiguous, very common name for a group of colors. English has eleven: red, orange, yellow, green, blue, purple, black, white, gray, pink, and brown.

Of these, orange is the only tertiary hue; brown is the only name for a specifically low-saturation color; pink and grey are the only names for specifically light shades. I can understand grey — it’s handy to have a midpoint between black and white — but the other exceptions are quite interesting.

Looking at the first color wheel again, “blue” and “green” together consume almost half of the spectrum. That seems reasonable, since they’re both primary colors, but “red” is relatively small; large chunks of it have been eaten up by its neighbors.

Orange is a tertiary color in either RGB or CMYK: it’s a mix of red and yellow, a primary and secondary color. Yet we ended up with a distinct name for it. I could understand if this were to give white folks’ skin tones their own category, similar to the reasons for the RBY art class model, but we don’t generally refer to white skin as “orange”. So where did this color come from?

Sometimes I imagine a parallel universe where we have common names for other tertiary colors. How much richer would the blue/green side of the color wheel be if “chartreuse” or “azure” were basic color terms? Can you even imagine treating those as distinct colors, not just variants or green or blue? That’s exactly how we treat orange, even though it’s just a variant of red.

I can’t speak to whether our vocabulary truly influences how we perceive or think (and that often-cited BBC report seems to have no real source). But for what it’s worth, I’ve been trying to think of “azure” as distinct for a few years now, and I’ve had a much easier time dealing with blues in art and design. Giving the cyan end of blue a distinct and common name has given me an anchor, something to arrange thoughts around.

Come to think of it, yellow is an interesting case as well. A decent chunk of the spectrum was ultimately called “yellow” in the xkcd map; here’s that chunk zoomed in a bit.

full range of xkcd yellows

How much of this range would you really call yellow, rather than green (or chartreuse!) or orange? Yellow is a remarkably specific color: mixing it even slightly with one of its neighbors loses some of its yellowness, and darkening it moves it swiftly towards brown.

I wonder why this is. When we see a yellowish-orange, are we inclined to think of it as orange because it looks like orange under yellow sunlight? Is it because yellow is between red and green, and the red and green receptors in the human eye pick up on colors that are very close together?


Most human languages develop their color terms in a similar order, with a split between blue and green often coming relatively late in a language’s development. Of particular interest to me is that orange and pink are listed as a common step towards the end — I’m really curious as to whether that happens universally and independently, or it’s just influence from Western color terms.

I’d love to see a list of the basic color terms in various languages, but such a thing is proving elusive. There’s a neat map of how many colors exist in various languages, but it doesn’t mention what the colors are. It’s easy enough to find a list of colors in various languages, like this one, but I have no idea whether they’re basic in each language. Note also that this chart only has columns for English’s eleven basic colors, even though Russian and several other languages have a twelfth basic term for azure. The page even mentions this, but doesn’t include a column for it, which seems ludicrous in an “omniglot” table.

The only language I know many color words in is Japanese, so I went delving into some of its color history. It turns out to be a fascinating example, because you can see how the color names developed right in the spelling of the words.

See, Japanese has a couple different types of words that function like adjectives. Many of the most common ones end in -i, like kawaii, and can be used like verbs — we would translate kawaii as “cute”, but it can function just as well as “to be cute”. I’m under the impression that -i adjectives trace back to Old Japanese, and new ones aren’t created any more.

That’s really interesting, because to my knowledge, only five Japanese color names are in this form: kuroi (black), shiroi (white), akai (red), aoi (blue), and kiiroi (yellow). So these are, necessarily, the first colors the language could describe. If you compare to the chart showing progression of color terms, this is the bottom cell in column IV: white, red, yellow, green/blue, and black.

A great many color names are compounds with iro, “color” — for example, chairo (brown) is cha (tea) + iro. Of the five basic terms above, kiiroi is almost of that form, but unusually still has the -i suffix. (You might think that shiroi contains iro, but shi is a single character distinct from i. kiiroi is actually written with the kanji for iro.) It’s possible, then, that yellow was the latest of these five words — and that would give Old Japanese words for white, red/yellow, green/blue, and black, matching the most common progression.

Skipping ahead some centuries, I was surprised to learn that midori, the word for green, was only promoted to a basic color fairly recently. It’s existed for a long time and originally referred to “greenery”, but it was considered to be a shade of blue (ao) until the Allied occupation after World War II, when teaching guidelines started to mention a blue/green distinction. (I would love to read more details about this, if you have any; the West’s coming in and adding a new color is a fascinating phenomenon, and I wonder what other substantial changes were made to education.)

Japanese still has a number of compound words that use ao (blue!) to mean what we would consider green: aoshingou is a green traffic light, aoao means “lush” in a natural sense, aonisai is a greenhorn (presumably from the color of unripe fruit), aojiru is a drink made from leafy vegetables, and so on.

This brings us to at least six basic colors, the fairly universal ones: black, white, red, yellow, blue, and green. What others does Japanese have?

From here, it’s a little harder to tell. I’m not exactly fluent and definitely not a native speaker, and resources aimed at native English speakers are more likely to list colors familiar to English speakers. (I mean, until this week, I never knew just how common it was for aoi to mean green, even though midori as a basic color is only about as old as my parents.)

I do know two curious standouts: pinku (pink) and orenji (orange), both English loanwords. I can’t be sure that they’re truly basic color terms, but they sure do come up a lot. The thing is, Japanese already has names for these colors: momoiro (the color of peach — flowers, not the fruit!) and daidaiiro (the color of, um, an orange). Why adopt loanwords for concepts that already exist?

I strongly suspect, but cannot remotely qualify, that pink and orange weren’t basic colors until Western culture introduced the idea that they could be — and so the language adopted the idea and the words simultaneously. (A similar thing happened with grey, natively haiiro and borrowed as guree, but in my limited experience even the loanword doesn’t seem to be very common.)

Based on the shape of the words and my own unqualified guesses of what counts as “basic”, the progression of basic colors in Japanese seems to be:

  1. black, white, red (+ yellow), blue (+ green) — Old Japanese
  2. yellow — later Old Japanese
  3. brown — sometime in the past millenium
  4. green — after WWII
  5. pink, orange — last few decades?

And in an effort to put a teeny bit more actual research into this, I searched the Leeds Japanese word frequency list (drawn from websites, so modern Japanese) for some color words. Here’s the rank of each. Word frequency is generally such that the actual frequency of a word is inversely proportional to its rank — so a word in rank 100 is twice as common as a word in rank 200. The five -i colors are split into both noun and adjective forms, so I’ve included an adjusted rank that you would see if they were counted as a single word, using ab / (a + b).

  • white: 1010 ≈ 1959 (as a noun) + 2083 (as an adjective)
  • red: 1198 ≈ 2101 (n) + 2790 (adj)
  • black: 1253 ≈ 2017 (n) + 3313 (adj)
  • blue: 1619 ≈ 2846 (n) + 3757 (adj)
  • green: 2710
  • yellow: 3316 ≈ 6088 (n) + 7284 (adj)
  • orange: 4732 (orenji), n/a (daidaiiro)
  • pink: 4887 (pinku), n/a (momoiro)
  • purple: 6502 (murasaki)
  • grey: 8472 (guree), 10848 (haiiro)
  • brown: 10622 (chairo)
  • gold: 12818 (kin’iro)
  • silver: n/a (gin’iro)
  • navy: n/a (kon)

n/a” doesn’t mean the word is never used, only that it wasn’t in the top 15,000.

I’m not sure where the cutoff is for “basic” color terms, but it’s interesting to see where the gaps lie. I’m especially surprised that yellow is so far down, and that purple (which I hadn’t even mentioned here) is as high as it is. Also, green is above yellow, despite having been a basic color for less than a century! Go, green.

For comparison, in American English:

  • black: 254
  • white: 302
  • red: 598
  • blue: 845
  • green: 893
  • yellow: 1675
  • brown: 1782
  • golden: 1835
  • græy: 1949
  • pink: 2512
  • orange: 3171
  • purple: 3931
  • silver: n/a
  • navy: n/a

Don’t read too much into the actual ranks; the languages and corpuses are both very different.

Color models

There are numerous ways to arrange and identify colors, much as there are numerous ways to identify points in 3D space. There are also benefits and drawbacks to each model, but I’m often most interested in how much sense the model makes to me as a squishy human.

RGB is the most familiar to anyone who does things with computers — it splits a color into its red, green, and blue channels, and measures the amount of each from “none” to “maximum”. (HTML sets this range as 0 to 255, but you could just as well call it 0 to 1, or -4 to 7600.)

RGB has a couple of interesting problems. Most notably, it’s kind of difficult to read and write by hand. You can sort of get used to how it works, though I’m still not particularly great at it. I keep in mind these rules:

  1. The largest channel is roughly how bright the color is.

    This follows pretty easily from the definition of RGB: it’s colored light added on top of black. The maximum amount of every color makes white, so less than the maximum must be darker, and of course none of any color stays black.

  2. The smallest channel is how pale (desaturated) the color is.

    Mixing equal amounts of red, green, and blue will produce grey. So if the smallest channel is green, you can imagine “splitting” the color between a grey (green, green, green), and the leftovers (red – green, 0, blue – green). Mixing grey with a color will of course make it paler — less saturated, closer to grey — so the bigger the smallest channel, the greyer the color.

  3. Whatever’s left over tells you the hue.

It might be time for an illustration. Consider the color (50%, 62.5%, 75%). The brightness is “capped” at 75%, the largest channel; the desaturation is 50%, the smallest channel. Here’s what that looks like.

illustration of the color (50%, 62.5%, 75%) split into three chunks of 50%, 25%, and 25%

Cutting out the grey and the darkness leaves a chunk in the middle of actual differences between the colors. Note that I’ve normalized it to (0%, 50%, 100%), which is the percentage of that small middle range. Removing the smallest and largest channels will always leave you with a middle chunk where at least one channel is 0% and at least one channel is 100%. (Or it’s grey, and there is no middle chunk.)

The odd one out is green at 50%, so the hue of this color is halfway between cyan (green + blue) and blue. That hue is… azure! So this color is a slightly darkened and fairly dull azure. (The actual amount of “greyness” is the smallest relative to the largest, so in this case it’s about ⅔ grey, or about ⅓ saturated.) Here’s that color.

a slightly darkened, fairly dull azure

This is a bit of a pain to do in your head all the time, so why not do it directly?

HSV is what you get when you directly represent colors as hue, saturation, and value. It’s often depicted as a cylinder, with hue represented as an angle around the color wheel: 0° for red, 120° for green, and 240° for blue. Saturation ranges from grey to a fully-saturated color, and value ranges from black to, er, the color. The azure above is (210°, ⅓, ¾) in HSV — 210° is halfway between 180° (cyan) and 240° (blue), ⅓ is the saturation measurement mentioned before, and ¾ is the largest channel.

It’s that hand-waved value bit that gives me trouble. I don’t really know how to intuitively explain what value is, which makes it hard to modify value to make the changes I want. I feel like I should have a better grasp of this after a year and a half of drawing, but alas.

I prefer HSL, which uses hue, saturation, and lightness. Lightness ranges from black to white, with the unperturbed color in the middle. Here’s lightness versus value for the azure color. (Its lightness is ⅝, the average of the smallest and largest channels.)

comparison of lightness and value for the azure color

The lightness just makes more sense to me. I can understand shifting a color towards white or black, and the color in the middle of that bar feels related to the azure I started with. Value looks almost arbitrary; I don’t know where the color at the far end comes from, and it just doesn’t seem to have anything to do with the original azure.

I’d hoped Wikipedia could clarify this for me. It tells me value is the same thing as brightness, but the mathematical definition on that page matches the definition of intensity from the little-used HSI model. I looked up lightness instead, and the first sentence says it’s also known as value. So lightness is value is brightness is intensity, but also they’re all completely different.

Wikipedia also says that HSV is sometimes known as HSB (where the “B” is for “brightness”), but I swear I’ve only ever seen HSB used as a synonym for HSL. I don’t know anything any more.

Oh, and in case you weren’t confused enough, the definition of “saturation” is different in HSV and HSL. Good luck!

Wikipedia does have some very nice illustrations of HSV and HSL, though, including depictions of them as a cone and double cone.

(Incidentally, you can use HSL directly in CSS now — there are hsl() and hsla() CSS3 functions which evaluate as colors. Combining these with Sass’s scale-color() function makes it fairly easy to come up with decent colors by hand, without having to go back and forth with an image editor. And I can even sort of read them later!)

An annoying problem with all of these models is that the idea of “lightness” is never quite consistent. Even in HSL, a yellow will appear much brighter than a blue with the same saturation and lightness. You may even have noticed in the RGB split diagram that I used dark red and green text, but light blue — the pure blue is so dark that a darker blue on top is hard to read! Yet all three colors have the same lightness in HSL, and the same value in HSV.

Clearly neither of these definitions of lightness or brightness or whatever is really working. There’s a thing called luminance, which is a weighted sum of the red, green, and blue channels that puts green as a whopping ten times brighter than blue. It tends to reflect how bright colors actually appear.

Unfortunately, luminance and related values are only used in fairly obscure color models, like YUV and Lab. I don’t mean “obscure” in the sense that nobody uses them, but rather that they’re very specialized and not often seen outside their particular niches: YUV is very common in video encoding, and Lab is useful for serious photo editing.

Lab is pretty interesting, since it’s intended to resemble how human vision works. It’s designed around the opponent process theory, which states that humans see color in three pairs of opposites: black/white, red/green, and yellow/blue. The idea is that we perceive color as somewhere along these axes, so a redder color necessarily appears less green — put another way, while it’s possible to see “yellowish green”, there’s no such thing as a “yellowish blue”.

(I wonder if that explains our affection for orange: we effectively perceive yellow as a fourth distinct primary color.)

Lab runs with this idea, making its three channels be lightness (but not the HSL lightness!), a (green to red), and b (blue to yellow). The neutral points for a and b are at zero, with green/blue extending in the negative direction and red/yellow extending in the positive direction.

Lab can express a whole bunch of colors beyond RGB, meaning they can’t be shown on a monitor, or even represented in most image formats. And you now have four primary colors in opposing pairs. That all makes it pretty weird, and I’ve actually never used it myself, but I vaguely aspire to do so someday.

I think those are all of the major ones. There’s also XYZ, which I think is some kind of master color model. Of course there’s CMYK, which is used for printing, but it’s effectively just the inverse of RGB.

With that out of the way, now we can get to the hard part!

Colorspaces

I called RGB a color model: a way to break colors into component parts.

Unfortunately, RGB alone can’t actually describe a color. You can tell me you have a color (0%, 50%, 100%), but what does that mean? 100% of what? What is “the most blue”? More importantly, how do you build a monitor that can display “the most blue” the same way as other monitors? Without some kind of absolute reference point, this is meaningless.

A color space is a color model plus enough information to map the model to absolute real-world colors. There are a lot of these. I’m looking at Krita’s list of built-in colorspaces and there are at least a hundred, most of them RGB.

I admit I’m bad at colorspaces and have basically done my best to not ever have to think about them, because they’re a big tangled mess and hard to reason about.

For example! The effective default RGB colorspace that almost everything will assume you’re using by default is sRGB, specifically designed to be this kind of global default. Okay, great.

Now, sRGB has gamma built in. Gamma correction means slapping an exponent on color values to skew them towards or away from black. The color is assumed to be in the range 0–1, so any positive power will produce output from 0–1 as well. An exponent greater than 1 will skew towards black (because you’re multiplying a number less than 1 by itself), whereas an exponent less than 1 will skew away from black.

What this means is that halfway between black and white in sRGB isn’t (50%, 50%, 50%), but around (73%, 73%, 73%). Here’s a great example, borrowed from this post (with numbers out of 255):

alternating black and white lines alongside gray squares of 128 and 187

Which one looks more like the alternating bands of black and white lines? Surely the one you pick is the color that’s actually halfway between black and white.

And yet, in most software that displays or edits images, interpolating white and black will give you a 50% gray — much darker than the original looked. A quick test is to scale that image down by half and see whether the result looks closer to the top square or the bottom square. (Firefox, Chrome, and GIMP get it wrong; Krita gets it right.)

The right thing to do here is convert an image to a linear colorspace before modifying it, then convert it back for display. In a linear colorspace, halfway between white and black is still 50%, but it looks like the 73% grey. This is great fun: it involves a piecewise function and an exponent of 2.4.

It’s really difficult to reason about this, for much the same reason that it’s hard to grasp text encoding problems in languages with only one string type. Ultimately you still have an RGB triplet at every stage, and it’s very easy to lose track of what kind of RGB that is. Then there’s the fact that most images don’t specify a colorspace in the first place so you can’t be entirely sure whether it’s sRGB, linear sRGB, or something entirely; monitors can have their own color profiles; you may or may not be using a program that respects an embedded color profile; and so on. How can you ever tell what you’re actually looking at and whether it’s correct? I can barely keep track of what I mean by “50% grey”.

And then… what about transparency? Should a 50% transparent white atop solid black look like 50% grey, or 73% grey? Krita seems to leave it to the colorspace: sRGB gives the former, but linear sRGB gives the latter. Does this mean I should paint in a linear colorspace? I don’t know! (Maybe I’ll give it a try and see what happens.)

Something I genuinely can’t answer is what effect this has on HSV and HSL, which are defined in terms of RGB. Is there such a thing as linear HSL? Does anyone ever talk about this? Would it make lightness more sensible?

There is a good reason for this, at least: the human eye is better at distinguishing dark colors than light ones. I was surprised to learn that, but of course, it’s been hidden from me by sRGB, which is deliberately skewed to dedicate more space to darker colors. In a linear colorspace, a gradient from white to black would have a lot of indistinguishable light colors, but appear to have severe banding among the darks.

several different black to white gradients

All three of these are regular black-to-white gradients drawn in 8-bit color (i.e., channels range from 0 to 255). The top one is the naïve result if you draw such a gradient in sRGB: the midpoint is the too-dark 50% grey. The middle one is that same gradient, but drawn in a linear colorspace. Obviously, a lot of dark colors are “missing”, in the sense that we could see them but there’s no way to express them in linear color. The bottom gradient makes this more clear: it’s a gradient of all the greys expressible in linear sRGB.

This is the first time I’ve ever delved so deeply into exactly how sRGB works, and I admit it’s kind of blowing my mind a bit. Straightforward linear color is so much lighter, and this huge bias gives us a lot more to work with. Also, 73% being the midpoint certainly explains a few things about my problems with understanding brightness of colors.

There are other RGB colorspaces, of course, and I suppose they all make for an equivalent CMYK colorspace. YUV and Lab are families of colorspaces, though I think most people talking about Lab specifically mean CIELAB (or “L*a*b*”), and there aren’t really any competitors. HSL and HSV are defined in terms of RGB, and image data is rarely stored directly as either, so there aren’t really HSL or HSV colorspaces.

I think that exhausts all the things I know.

Real world color is also a lie

Just in case you thought these problems were somehow unique to computers. Surprise! Modelling color is hard because color is hard.

I’m sure you’ve seen the checker shadow illusion, possibly one of the most effective optical illusions, where the presence of a shadow makes a gray square look radically different than a nearby square of the same color.

Our eyes are very good at stripping away ambient light effects to tell what color something “really” is. Have you ever been outside in bright summer weather for a while, then come inside and everything is starkly blue? Lingering compensation for the yellow sunlight shifting everything to be slightly yellow; the opposite of yellow is blue.

Or, here, I like this. I’m sure there are more drastic examples floating around, but this is the best I could come up with. Here are some Pikachu I found via GIS.

photo of Pikachu plushes on a shelf

My question for you is: what color is Pikachu?

Would you believe… orange?

photo of Pikachu plushes on a shelf, overlaid with color swatches; the Pikachu in the background are orange

In each box, the bottom color is what I color-dropped, and the top color is the same hue with 100% saturation and 50% lightness. It’s the same spot, on the same plush, right next to each other — but the one in the background is orange, not yellow. At best, it’s brown.

What we see as “yellow in shadow” and interpret to be “yellow, but darker” turns out to be another color entirely. (The grey whistles are, likewise, slightly blue.)

Did you know that mirrors are green? You can see it in a mirror tunnel: the image gets slightly greener as it goes through the mirror over and over.

Distant mountains and other objects, of course, look bluer.

This all makes painting rather complicated, since it’s not actually about painting things the color that they “are”, but painting them in such a way that a human viewer will interpret them appropriately.

I, er, don’t know enough to really get very deep here. I really should, seeing as I keep trying to paint things, but I don’t have a great handle on it yet. I’ll have to defer to Mel’s color tutorial. (warning: big)

Blending modes

You know, those things in Photoshop.

I’ve always found these remarkably unintuitive. Most of them have names that don’t remotely describe what they do, the math doesn’t necessarily translate to useful understanding, and they’re incredibly poorly-documented. So I went hunting for some precise definitions, even if I had to read GIMP’s or Krita’s source code.

In the following, A is a starting image, and B is something being drawn on top with the given blending mode. (In the case of layers, B is the layer with the mode, and A is everything underneath.) Generally, the same operation is done on each of the RGB channels independently. Everything is scaled to 0–1, and results are generally clamped to that range.

I believe all of these treat layer alpha the same way: linear interpolation between A and the combination of A and B. If B has alpha t, and the blending mode is a function f, then the result is t × f(A, B) + (1 - t) × A.

If A and B themselves have alpha, the result is a little more complicated, and probably not that interesting. It tends to work how you’d expect. (If you’re really curious, look at the definition of BLEND() in GIMP’s developer docs.)

  • Normal: B. No blending is done; new pixels replace old pixels.

  • Multiply: A × B. As the name suggests, the channels are multiplied together. This is very common in digital painting for slapping on a basic shadow or tinting a whole image.

    I think the name has always thrown me off just a bit because “Multiply” sounds like it should make things bigger and thus brighter — but because we’re dealing with values from 0 to 1, Multiply can only ever make colors darker.

    Multiplying with black produces black. Multiplying with white leaves the other color unchanged. Multiplying with a gray is equivalent to blending with black. Multiplying a color with itself squares the color, which is similar to applying gamma correction.

    Multiply is commutative — if you swap A and B, you get the same result.

  • Screen: 1 - (1 - A)(1 - B). This is sort of an inverse of Multiply; it multiplies darkness rather than lightness. It’s defined as inverting both colors, multiplying, and inverting the result. Accordingly, Screen can only make colors lighter, and is also commutative. All the properties of Multiply apply to Screen, just inverted.

  • Hard Light: Equivalent to Multiply if B is dark (i.e., less than 0.5), or Screen if B is light. There’s an additional factor of 2 included to compensate for how the range of B is split in half: Hard Light with B = 0.4 is equivalent to Multiply with B = 0.8, since 0.4 is 0.8 of the way to 0.5. Right.

    This seems like a possibly useful way to apply basic highlights and shadows with a single layer? I may give it a try.

    The math is commutative, but since B is checked and A is not, Hard Light is itself not commutative.

  • Soft Light: Like Hard Light, but softer. No, really. There are several different versions of this, and they’re all a bit of a mess, not very helpful for understanding what’s going on.

    If you graphed the effect various values of B had on a color, you’d have a straight line from 0 up to 1 (at B = 0.5), and then it would abruptly change to a straight line back down to 0. Soft Light just seeks to get rid of that crease. Here’s Hard Light compared with GIMP’s Soft Light, where A is a black to white gradient from bottom to top, and B is a black to white gradient from left to right.

    graphs of combinations of all grays with Hard Light versus Soft Light

    You can clearly see the crease in the middle of Hard Light, where B = 0.5 and it transitions from Multiply to Screen.

  • Overlay: Equivalent to either Hard Light or Soft Light, depending on who you ask. In GIMP, it’s Soft Light; in Krita, it’s Hard Light except the check is done on A rather than B. Given the ambiguity, I think I’d rather just stick with Hard Light or Soft Light explicitly.

  • Difference: abs(A - B). Does what it says on the tin. I don’t know why you would use this? Difference with black causes no change; Difference with white inverts the colors. Commutative.

  • Addition and Subtract: A + B and A - B. I didn’t think much of these until I discovered that Krita has a built-in brush that uses Addition mode. It’s essentially just a soft spraypaint brush, but because it uses Addition, painting over the same area with a dark color will gradually turn the center white, while the fainter edges remain dark. The result is a fiery glow effect, which is pretty cool. I used it manually as a layer mode for a similar effect, to make a field of sparkles. I don’t know if there are more general applications.

    Addition is commutative, of course, but Subtract is not.

  • Divide: A ÷ B. Apparently this is the same as changing the white point to 1 - B. Accordingly, the result will blow out towards white very quickly as B gets darker.

  • Dodge and Burn: A ÷ (1 - B) and 1 - (1 - A) ÷ B. Inverses in the same way as Multiply and Screen. Similar to Divide, but with B inverted — so Dodge changes the white point to B, with similar caveats as Divide. I’ve never seen either of these effects not look horrendously gaudy, but I think photographers manage to use them, somehow.

  • Darken Only and Lighten Only: min(A, B) and max(A, B). Commutative.

  • Linear Light: (2 × A + B) - 1. I think this is the same as Sai’s “Lumi and Shade” mode, which is very popular, at least in this house. It works very well for simple lighting effects, and shares the Soft/Hard Light property that darker colors darken and lighter colors lighten, but I don’t have a great grasp of it yet and don’t know quite how to explain what it does. So I made another graph:

    graph of Linear Light, with a diagonal band of shading going from upper left to bottom right

    Super weird! Half the graph is solid black or white; you have to stay in that sweet zone in the middle to get reasonable results.

    This is actually a combination of two other modes, Linear Dodge and Linear Burn, combined in much the same way as Hard Light. I’ve never encountered them used on their own, though.

  • Hue, Saturation, Value: Work like you might expect: converts A to HSV and replaces either its hue, saturation, or value with Bs.

  • Color: Uses HSL, unlike the above three. Combines Bs hue and saturation with As lightness.

  • Grain Extract and Grain Merge: A - B + 0.5 and A + B - 0.5. These are clearly related to film grain, somehow, but their exact use eludes me.

    I did find this example post where someone combines a photo with a blurred copy using Grain Extract and Grain Merge. Grain Extract picked out areas of sharp contrast, and Grain Merge emphasized them, which seems relevant enough to film grain. I might give these a try sometime.

Those are all the modes in GIMP (except Dissolve, which isn’t a real blend mode; also, GIMP doesn’t have Linear Light). Photoshop has a handful more. Krita has a preposterous number of other modes, no, really, it is absolutely ridiculous, you cannot even imagine.

I may be out of things

There’s plenty more to say about color, both technically and design-wise — contrast and harmony, color blindness, relativity, dithering, etc. I don’t know if I can say any of it with any particular confidence, though, so perhaps it’s best I stop here.

I hope some of this was instructive, or at least interesting!

Moving the Utilization Needle with Hadoop Overcommit

Post Syndicated from mikesefanov original https://yahooeng.tumblr.com/post/147408435396

yahoohadoop:

By Nathan Roberts and Jason Lowe

Hadoop was developed at Yahoo more than 10 years ago and its usage continues to demonstrate significant growth year-in and year-out. Due to Hadoop being an Apache open source project, this growth is not limited to just Yahoo. Hundreds, if not thousands, of companies are turning to Hadoop to power their big data analytics. Below is a graph of the Gigabyte-hours consumed per day on Yahoo Hadoop clusters (using 1 GB of RAM for 1 hour = 1GB-hour). As you can see in the graph, the demand for compute resources on our clusters show no signs of slowing down.

image

The resource scheduler within Hadoop (a.k.a. YARN – Yet Another Resource Negotiator), makes it easy to support new big-data applications within our Hadoop clusters. This flexibility helps fuel this sustained growth. We’ve fully embraced YARN by supporting several new application domains like: Hive_on_Tez, Pig_on_Tez, Spark, and various machine learning applications. As we find new use cases for the existing frameworks and continue adding even more frameworks, the demand for big-data compute resources will continue to grow for the foreseeable future.

20,000,000 GB-hours in this chart equate to millions of dollars per year worth of compute. Due to the size of this investment, we are constantly exploring ways to make the most efficient use of our compute hardware as possible. Over the past couple of years, there have been several achievements in this area:

  • We have developed tools that allow application owners to measure and optimize the resource utilization of their applications. As an example, one such tool compares allocated container sizes vs. the size of the MapReduce tasks running within the containers.
  • Hive and Pig have migrated to the Tez DAG execution framework which is significantly more efficient than MapReduce for these workloads.
  • Several performance improvements have been made to increase the scale at which we can run Hadoop. Central daemons such as the Namenode and Resourcemanager can throttle large clusters if not operating efficiently.

And most recently,

  • We’ve introduced a unique Dynamic Overcommit feature which improves the coordination between the YARN scheduler and the actual hardware utilization within the cluster. This additional coordination allows YARN to take advantage of the fact that containers don’t always make use of the resource they’ve reserved. At the 9th Annual Hadoop Summit last month, which Yahoo co-hosted, this was the focus of one of Yahoo’s keynote addresses. We also devoted time to the subject in a breakout session. Videos of the talks can be seen here and here, respectively.

The concept of Dynamic Overcommit is simple: Employ techniques to take advantage of reserved, but unused resources in the cluster.

The graph below illustrates the opportunities that Dynamic Overcommit offers. The shaded portions illustrate times when the cluster is fully reserved while CPU and Memory utilizations are well below 100%. This is the first-order opportunity that Dynamic Overcommit immediately addresses, i.e. improve system utilization when YARN is fully utilized. Perhaps more importantly though, once the first opportunity is addressed, it will mean we can run the cluster with less overall headroom (because we’re making full-use of the available physical resource). Less headroom means an overall increase in utilization, and this is where the big wins are. Let’s say we can increase average CPU utilization from 40% to 60% – that’s a 50% increase in the amount of actual work the cluster is getting done!

image

How Does Dynamic Overcommit Work?

Every container running in a YARN cluster declares the amount of Memory and CPU it will need to perform its job. YARN monitors all containers to make sure they do not exceed their declaration. In the case of memory, this is a non-negotiable contract – if a container exceeds its memory allocation, it is immediately terminated.  Obviously, applications will avoid this situation; therefore, it’s essentially guaranteed that all containers will have some amount of “padding” built into their configuration. Furthermore, most containers don’t use all their resource all of the time, leading to even more unused resource. When you add up all this unused resource, it turns out to be a significant opportunity. Dynamic Overcommit takes advantage of these unused resources by safely and effectively over-booking the nodes in the cluster.

The challenge with overbooking resources is this simple question: “What happens when everyone actually wants all the resources they requested?” With the Dynamic Overcommit feature, we take a very simple, yet highly-effective approach – we overcommit nodes based on their current utilization ratio and then in the event a node runs out of a resource, we preempt the most-recently launched containers until the situation improves. One of the advantages of Hadoop is that it is designed from the ground up to deal with failures. If an individual component fails, the application frameworks know how to re-run the work somewhere else, with only minimal impact to the application. Other than having to re-run pre-empted containers, there should be no other impact to applications.

The following diagram illustrates the basic design of Dynamic Overcommit. Notice how the node resource was adjusted from 128GB to 160GB; this is overcommit in action.

image

Results

We have enabled Dynamic Overcommit in several of our Hadoop clusters, but one large research cluster in particular is where we perform most of our measurements and tuning. This section describes results from this research cluster. The shaded regions in the following graph illustrate times when this cluster is highly-utilized from a YARN perspective. Prior to overcommit being enabled, CPU and Memory utilization would have been in the 40-50% range. With overcommit, these metrics are in the 50-80% range.

image

But, what about work lost due to too much overcommit? The following graph illustrates GB-hours gained due to overcommit vs GB-hours lost due to preemption. For the time period shown we gained 3.3 million GB-hour at the price of 502 GB-hours lost due to preemption. This demonstrates that with the current configuration, it’s very rare for us to have to preempted containers.

image

What’s next for Dynamic Overcommit?

  • Sync with the Apache community on how best to get this capability into Apache. Our implementation has been very successful and is serving us well within Yahoo. Now we look forward to working with the Hadoop community to help finalize similar capabilities in Apache Hadoop. For reference, related Apache jira are listed below:

How SmartNews Built a Lambda Architecture on AWS to Analyze Customer Behavior and Recommend Content

Post Syndicated from SmartNews original https://blogs.aws.amazon.com/bigdata/post/Tx2V1BSKGITCMTU/How-SmartNews-Built-a-Lambda-Architecture-on-AWS-to-Analyze-Customer-Behavior-an

This is a guest post by Takumi Sakamoto, a software engineer at SmartNews . SmartNews in their own words: "SmartNews is a machine learning-based news discovery app that delivers the very best stories on the Web for more than 18 million users worldwide."

Data processing is one of the key technologies for SmartNews. Every team’s workload involves data processing for various purposes. The news team at SmartNews uses data as input to their machine learning algorithm for delivering the very best stories on the Web. The product team relies on data to run various A/B tests, to learn about how our customers consume news articles, and to make product decisions.

To meet the goals of both teams, we built a sustainable data platform based on the lambda architecture, which is a data-processing framework that handles a massive amount of data and integrates batch and real-time processing within a single framework.

Thanks to AWS services and OSS technologies, our data platform is highly scalable and reliable, and is flexible enough to satisfy various requirements with minimum cost and effort.

Our current system generates tens of GBs of data from multiple data sources, and runs daily aggregation queries or machine learning algorithms on datasets with hundreds of GBs. Some outputs by machine learning algorithms are joined on data streams for gathering user feedback in near real-time (e.g. the last 5 minutes). It lets us adapt our product for users with minimum latency. In this post, I’ll show you how we built a SmartNews data platform on AWS.

The image below depicts the platform. Please scroll to see the full architecture.

Design principles

Before I dive into how we built our data platform, it’s important to know the design principles behind the architecture.

When we started to discuss the data platform, most data was stored in a document database. Although it was a good at product launch, it became painful with growth. For data platform maintainers, it was very expensive to store and serve data at scale. At that time, our system generated more than 10 GB of user activity records every day and processing time increased linearly. For data platform users, it was hard to try something new for data processing because of the database’s insufficient scalability and limited integration with the big data ecosystem. Obviously, it wasn’t not sustainable for both.

To make our data platform sustainable, we decided to completely separate the compute and storage layers. We adopted Amazon S3  for file storage and Amazon Kinesis Streams for stream storage. Both services replicate data into multiple Availability Zones and keep it available without high operation costs. We don’t have to pay much attention to the storage layer and we can focus on the computation layer that transforms raw data to a valuable output.

In addition, Amazon S3 and Amazon Kinesis Streams let us run multiple compute layers without complex negotiations. After data is stored, everyone can consume it in their own way. For example, if a team wants to try a new version of Spark, they can launch a new cluster and start to evaluate it immediately. That means every engineer in SmartNews can craft any solutions using whatever tools they feel are best suited to the task.

Input data

The first step is dispatching raw data to both the batch layer and the speed layer for processing. There are two types of data sources at SmartNews:

  • Groups of user activity logs generated from our mobile app
  • Various tables on Amazon RDS

User activity logs include more than 60 types of activities to understand user behavior such as which news articles are read. After we receive logs from the mobile app, all logs are passed to Fluentd, an OSS log collector, and forwarded to Amazon S3 and Amazon Kinesis Streams. If you are not familiar with Fluentd, see Store Apache Logs into Amazon S3 and Collect Log Files into Kinesis Stream in Real-Time to understand how Fluentd works.

Our recommended practice is adding the flush_at_shutdown parameter. If set to true, Fluentd waits for the buffer to flush at shutdown. Because our instances are scaled automatically, it’s important to store log files on Amazon S3 before terminating instances.

In addition, monitoring Fluentd status is important so that you know when bad things happen. We use Datadog and some Fluentd plugins. Because the Fluent-plugin-flowcounter counts incoming messages and bytes per second, we post these metrics to Dogstatsd via Fluent-plugin-dogstatsd. An example configuration is available in a GitHub Gist post.

After metrics are sent to Datadog, we can visualize aggregated metrics across any level that we choose. The following graph aggregates the number of records per data source.

Also, Datadog notifies us when things go wrong. The alerts in the figure below let us know that there have been no incoming records on an instance for the last 1 hour. We also monitor Fluentd’s buffer status by using Datadog’s Fluentd integration.

Various tables on Amazon RDS are dumped by Embulk, an OSS bulk data loader, and exported to Amazon S3. Its pluggable architecture lets us mask some fields that we don’t want to export to the data platform.

Batch layer

This layer is responsible for various ETL tasks such as transforming text files into columnar files (RCFile or ORCFile) for following consumers, generating machine learning features, and pre-computing the batch views.

We run multiple Amazon EMR clusters for each task. Amazon EMR lets us run multiple heterogeneous Hive and Spark clusters with a few clicks. Because all data is stored on Amazon S3, we can use Spot Instances for most tasks and adjust cluster capacity dynamically. It significantly reduces the cost of running our data processing system.

In addition to data processing itself, task management is very important for this layer. Although a cron scheduler is a good first solution, it becomes hard to maintain after increasing the number of ETL tasks.

When using a cron scheduler, a developer needs to write additional code to handle dependencies such as waiting until the previous task is done, or failure handling such as retrying failed tasks or specifying timeouts for long-running tasks. We use Airflow, an open-sourced task scheduler, to manage our ETL tasks. We can define ETL tasks and dependencies with Python scripts.

Because every task is described as code, we can introduce pull request–based review flows for modifying ETL tasks.

Serving layer

The serving layer indexes and exposes the views so that they can be queried.

We use Presto for this layer. Presto is an open source, distributed SQL query engine for running interactive queries against various data sources such as Hive tables on S3, MySQL on Amazon RDS, Amazon Redshift, and Amazon Kinesis Streams. Presto converts a SQL query into a series of task stages and processes each stage in parallel. Because all processing occurs in memory to reduce disk I/O, end-to-end latency is very low: ~30 seconds to scan billions of records.

With Presto, we can analyze the data from various perspectives. The following simplified query shows the result of A/B testing by user clusters.

```sql
-- Suppose that this table exists
DESC hive.default.user_activities;
user_id bigint
action  varchar
abtest  array>
url     varchar

-- Summarize page view per A/B Test identifier
--   for comparing two algorithms v1 & v2
SELECT
  dt,
  t['behaviorId'],
  count(*) as pv
FROM hive.default.user_activities CROSS JOIN UNNEST(abtest) AS t (t)
WHERE dt like '2016-01-%' AND action = 'viewArticle'
  AND t['definitionId'] = 163
GROUP BY dt, t['behaviorId'] ORDER BY dt
;

-- Output:
-- 2015-12-01 | algorithm_v1 | 40000
-- 2015-12-01 | algorithm_v2 | 62000
```

Speed layer

Like the batch layer, the speed layer computes views from the data it receives. The difference is latency. Sometimes, the low latency adds variable outputs for the product.

For example, we need to detect current trending news by interest-based clusters to deliver the best stories for each user. For this purpose, we run Spark Streaming.

User feedback in Amazon Kinesis Streams is joined on the interest-based user cluster data calculated in offline machine learning, and then the output metrics for each news article. These metrics are used to rank news articles in a later phase. What Spark Streaming does in the above figure looks something like the following:

```scala
def main(args: Array[String]): Unit = {
  // ..... (prepare SparkContext)

  // Load user clusters that are generated by offline machine learning
  if (needToUpdate) {
    userClusterRDD: RDD[(Long, Int)] = sqlContext.sql(
      "SELECT user_id, cluster_id FROM user_cluster"
    ).map( row => {
      (row.getLong(0), row.getInt(1))
    })
  }

  // Fetch and parse JSON records in Amazon Kinesis Streams
  val userPageviewStream: DStream[(Long, String)] = ssc.union(kinesisStreams)
    .map( byteArray => {
      val json = new String(bytesArray)
      val userActivity = parse(json)
      (userActivity.user_id, userActivity.url)
    })

  // Join stream records with pre-calculated user clusters
  val clusterPageviewStream: DStream[(Int, String)] = userPageviewStream
    .transform( userPageviewStreamRDD => {
      userPageviewStreamRDD.join(userClusterRDD).map( data => {
        val (userId, (url, clusterId) ) = data
        (clusterId, url)
      })
    })

  // ..... (aggregates pageview by clusters and store to DynamoDB)
}
```

Because every EMR cluster uses the shared Hive metastore, Spark Streaming applications can load all tables created on the batch layer by using SQLContext. After the tables are loaded as an RDD (Resilient Distributed Dataset), we can join it to a Kinesis stream.

Spark Streaming is a great tool for empowering your machine learning–based application, but it can be overkill for simpler use cases such as monitoring. For these cases, we use AWS Lambda and PipelineDB (not covered here in detail).

Output data

Chartio is a commercial business intelligence (BI) service. Chartio enables every member (including non-engineers!) in the company to create, edit, and refine beautiful dashboards with minimal effort. This has saved us hours each week so we can spend our time improving our product, not reporting on it. Because Chartio supports various data sources such as Amazon RDS (MySQL, PostgreSQL), Presto, PipelineDB, Amazon Redshift, and Amazon Elasticsearch, you can start using it easily.

Summary

In this post, I’ve shown you how SmartNews uses AWS services and OSS technologies to create a data platform that is highly scalable and reliable, and is flexible enough to satisfy various requirements with minimum cost and effort. If you’re interested in our data platform, check out these two slides in our SlideShare: Building a Sustainable Data Platform on AWS  and Stream Processing in SmartNews.

If you have questions or suggestions, please leave a comment below.

Takumi Sakamoto is not an Amazon employee and does not represent Amazon.

———————————

Related

Building a Near Real-Time Discovery Platform with AWS

Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

 

 

Nine-year-old inventor’s award-winning asthma monitor

Post Syndicated from Liz Upton original https://www.raspberrypi.org/blog/nine-year-old-inventors-asthma-monitor/

We keep a very close eye on the annual Tech4Good competition, and especially the children who are nominated for their BT Young Pioneer award; there are some fiercely smart kids there doing some hugely impressive work. This year’s was a very close field (I would not like to have been judging – there were some extraordinary projects presented).

Tech4Good award winners 2016

Tech4Good award winners 2016

Arnav Sharma, nine years old, was the Winner of Winners as well as the winner of the Young Pioneer section with this asthma monitor, which runs on Raspberry Pi. Arnav started by learning about the causes and effects of asthma, and thought about ways to help patients. He discovered that asthma is hard to diagnose, but can be fatal if left undetected. This leads to many children being over-diagnosed and over-medicated; inhalers are often given as treatment to reduce the symptoms of asthma, but come with side-effects like reduced growth and immunity. Arnav discovered that the best way to manage asthma is to prevent attacks by understanding what triggers asthma attacks and following a treatment plan.

Asthma Pi

AsthmaPi

Arnav’s AsthmaPi uses a Raspberry Pi, a Sense HAT, an MQ-135 Gas Sensor, a Sharp Optical Dust Sensor and an Arduino Uno.The sensors on the SenseHAT are used to measure temperature and humidity, while the MQ gas sensor detects nitrogen compounds, carbon dioxide, cigarette smoke, smog, ammonia and alcohol, all known asthma triggers. The dust sensor measures the size of dust particles and their density. The AsthmaPi is programmed in Python and C++, and triggers email and SMS text message alerts to remind the owner take medication and to go for review visits.

Here’s Arnav’s very impressive project video, which will walk you through what he’s put together, and how it all works.

AsthmaPi Asthma Management Kit Arnav, Asthma, Allergy, Raspberry Pi, Dust Sensor, Gas Sensor

This is the video demo for the AsthmaPi: An affordable asthma management kit made by Arnav Sharma, aged 9, finalist of Tech For Good competition. Please tweet him at #T4GArnavSharma or visit his page here http://www.tech4goodawards.com/finalist/arnav-sharma/ or vote for him at http://www.tech4goodawards.com/peoples-award/ Thank you.

Well done Arnav!

The post Nine-year-old inventor’s award-winning asthma monitor appeared first on Raspberry Pi.

ERTS – Exploit Reliability Testing System

Post Syndicated from Darknet original http://feedproxy.google.com/~r/darknethackers/~3/heOaYUkdEdU/

ERTS or Exploit Reliability Testing System is a Python based tool to calculate the reliability of an exploit based on the number of times the exploit is able to control EIP register with the desired address/value. It’s created to help you code reliable exploits and take the manual parts out of running and re-running exploits […]

The post…

Read the full post at darknet.org.uk

Hey, Mac and iOS users: Make sure to back up before you upgrade!

Post Syndicated from Peter Cohen original https://www.backblaze.com/blog/os-upgrade-backup-plan/

blog-backup-macos-sierra

Editor’s note: This article was originally published over the summer as a guideline for those dabbling with Apple’s public betas. It’s been updated to reflect the upcoming release of the new Mac and iOS versions.

New versions of Apple’s operating systems are coming to your iPhone and Mac later this month! iOS 10 will be released on September 13th, with macOS 10.12 “Sierra” coming a week later. If you’re planning to upgrade your Mac or iOS device with Apple’s newest software, you should make it a point to back up before you install anything new.

The new releases were announced in June at Apple’s annual Worldwide Developer Conference (WWDC) in San Francisco, which gathers thousands of Apple developers from around the world each year. It’s a familiar annual processional: Apple introduces new versions of both the Mac and iOS operating systems. They’re tested by developers and the public throughout the summer. At its recent iPhone 7 event, Apple announced the release dates for iOS 10 and macOS 10.12.

Here’s a rundown of some of the cool stuff coming with these new releases.

macOS

With this release Apple is rebranding the OS X operating system as macOS to keep it consistent with iOS, tvOS and watchOS. macOS’s new tentpole feature is Siri support, so you can talk to your computer the same way you talk to your phone. A lot of other new features have been added, too, including Apple Pay support and the ability to unlock your Mac using your Apple Watch. Some exciting under the hood changes to the operating system provide more optimized storage and seamless cloud transfer, which we’ll have more to say about later.

macOS Sierra

Up until a few years ago only registered developers could gain access to the new operating system software ahead of everyone else. Apple has loosened up the reins by expanding its Apple Beta Software Program to regular civilians, not just Apple experts and pros.

That means a lot more people than ever are using pre-release versions of iOS and macOS. Apple makes you wade through pages of legalese jargon and it’s easy to get glassy-eyed at all the stuff they throw at you. So if this is your first time, please keep some things in mind before you get rolling with the new software.

Back up early and often

Changing your Mac or iPhone’s operating system isn’t like installing a new version of an app, even though Apple has tried to make it a relatively simple process. Operating system software is essential software for these devices, and how it works has a cascading effect on all the other apps and services you depend on.

Sometimes features and services you find absolutely necessary are left out by omission, sometimes by accident, sometimes by circumstance. And that can (and does) change from pre-release build to pre-release build. The bottom line is you want to be prepared if something goes drastically wrong, and you want to be inconvenienced as little as possible when something does.

One way you can do that is to make sure you have a restore point you can recover from before upgrading your system with new, unproven software. That way, if things go awry – and in pre-release days they often do – you can get your system back to working order and be none the worse for wear.

If you’re not currently backing up, it’s easy to get started using our 3-2-1 Backup Strategy. The idea behind the 3-2-1 Backup Strategy is that there should be three copies of your data: The main one you use, a local backup copy, and a remote copy, stored at a secure offsite data center like Backblaze. It’s served us and thousands of our customers very well over the years, so we recommend it unabashedly. Also check out our Mac Backup Guide.

Don’t use your only hardware

It’s a really bad idea to install early release software on any computer or device you absolutely need. If you only have one Mac or one iPhone, I’d seriously reconsider installing any kind of pre-release software on it, unless you know you can live without it for however many hours you’ll need to restore it to working condition.

It’s a good idea to use beta operating system software only on a spare machine you can afford to lose for a while if you need to reset or reinstall. Especially in the early days, you can never count on things working quite as they should.

At the very least, in the case of your Mac, I’d strongly consider having a spare external hard drive to use. You can set it up as a bootable drive with the new operating system on it – simply attach the drive, turn on your Mac and hold down the option key on the keyboard to select the external drive.

Some users repartition their Mac startup disks with a second partition that they use for pre-release software. I stay away from this method as recovering from it sometimes requires resorting to command line work in OS X macOS to restore things to where they should be.

If you plan to use a pre-release version of iOS, tvOS or watchOS on any of your devices, it’d be wise to limit your use to spare devices only. Older iPads and iPhones will work (within the limit of what iOS 10 supports – Apple has posted system requirements, or you can pick up an iPod touch for $200 and have a very nice little testbed for iOS 10.

Your patience will pay off

This week saw the release of the first public beta release of the new operating systems. These are intended to give early adopters first crack at the new software, and for Apple to shakedown the changes. While the new features are cool, just remember that this is still very much a work in progress. Expect problems if you decide to install the software.

If you have only a single device or computer with which to use this stuff, the general release, or even some of the later public betas, will be a better time to upgrade than right now.

Even then though, the same rules apply – please make sure to back up all of your systems before installing operating system software, even release software. Better safe than sorry, especially where the safety and security of your data is concerned.

The post Hey, Mac and iOS users: Make sure to back up before you upgrade! appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.