# ISP Questions Impartiality of Judges in Copyright Troll Cases

Post Syndicated from Andy original https://torrentfreak.com/isp-questions-impartiality-of-judges-in-copyright-troll-cases-180602/

Following in the footsteps of similar operations around the world, two years ago the copyright trolling movement landed on Swedish shores.

The pattern was a familiar one, with trolls harvesting IP addresses from BitTorrent swarms and tracing them back to Internet service providers. Then, after presenting evidence to a judge, the trolls obtained orders that compelled ISPs to hand over their customers’ details. From there, the trolls demanded cash payments to make supposed lawsuits disappear.

It’s a controversial business model that rarely receives outside praise. Many ISPs have tried to slow down the flood but most eventually grow tired of battling to protect their customers. The same cannot be said of Swedish ISP Bahnhof.

The ISP, which is also a strong defender of privacy, has become known for fighting back against copyright trolls. Indeed, to thwart them at the very first step, the company deletes IP address logs after just 24 hours, which prevents its customers from being targeted.

Bahnhof says that the copyright business appeared “dirty and corrupt” right from the get go, so it now operates Utpressningskollen.se, a web portal where the ISP publishes data on Swedish legal cases in which copyright owners demand customer data from ISPs through the Patent and Market Courts.

Over the past two years, Bahnhof says it has documented 76 cases of which six are still ongoing, 11 have been waived and a majority 59 have been decided in favor of mainly movie companies. Bahnhof says that when it discovered that 59 out of the 76 cases benefited one party, it felt a need to investigate.

In a detailed report compiled by Bahnhof Communicator Carolina Lindahl and sent to TF, the ISP reveals that it examined the individual decision-makers in the cases before the Courts and found five judges with “questionable impartiality.”

“One of the judges, we can call them Judge 1, has closed 12 of the cases, of which two have been waived and the other 10 have benefitted the copyright owner, mostly movie companies,” Lindahl notes.

“Judge 1 apparently has written several articles in the magazine NIR – Nordiskt Immateriellt Rättsskydd (Nordic Intellectual Property Protection) – which is mainly supported by Svenska Föreningen för Upphovsrätt, the Swedish Association for Copyright (SFU).

“SFU is a member-financed group centered around copyright that publishes articles, hands out scholarships, arranges symposiums, etc. On their website they have a public calendar where Judge 1 appears regularly.”

Bahnhof says that the financiers of the SFU are Sveriges Television AB (Sweden’s national public TV broadcaster), Filmproducenternas Rättsförening (a legally-oriented association for filmproducers), BMG Chrysalis Scandinavia (a media giant) and Fackförbundet för Film och Mediabranschen (a union for the movie and media industry).

“This means that Judge 1 is involved in a copyright association sponsored by the film and media industry, while also judging in copyright cases with the film industry as one of the parties,” the ISP says.

Bahnhof’s also has criticism for Judge 2, who participated as an event speaker for the Swedish Association for Copyright, and Judge 3 who has written for the SFU-supported magazine NIR. According to Lindahl, Judge 4 worked for a bureau that is partly owned by a board member of SFU, who also defended media companies in a “high-profile” Swedish piracy case.

That leaves Judge 5, who handled 10 of the copyright troll cases documented by Bahnhof, waiving one and deciding the remaining nine in favor of a movie company plaintiff.

“Judge 5 has been questioned before and even been accused of bias while judging a high-profile piracy case almost ten years ago. The accusations of bias were motivated by the judge’s membership of SFU and the Swedish Association for Intellectual Property Rights (SFIR), an association with several important individuals of the Swedish copyright community as members, who all defend, represent, or sympathize with the media industry,” Lindahl says.

Bahnhof hasn’t named any of the judges nor has it provided additional details on the “high-profile” case. However, anyone who remembers the infamous trial of ‘The Pirate Bay Four’ a decade ago might recall complaints from the defense (1,2,3) that several judges involved in the case were members of pro-copyright groups.

While there were plenty of calls to consider them biased, in May 2010 the Supreme Court ruled otherwise, a fact Bahnhof recognizes.

“Judge 5 was never sentenced for bias by the court, but regardless of the court’s decision this is still a judge who shares values and has personal connections with [the media industry], and as if that weren’t enough, the judge has induced an additional financial aspect by participating in events paid for by said party,” Lindahl writes.

“The judge has parties and interest holders in their personal network, a private engagement in the subject and a financial connection to one party – textbook characteristics of bias which would make anyone suspicious.”

The decision-makers of the Patent and Market Court and their relations.

The ISP notes that all five judges have connections to the media industry in the cases they judge, which isn’t a great starting point for returning “objective and impartial” results. In its summary, however, the ISP is scathing of the overall system, one in which court cases “almost looked rigged” and appear to be decided in favor of the movie company even before reaching court.

In general, however, Bahnhof says that the processes show a lack of individual attention, such as the court blindly accepting questionable IP address evidence supplied by infamous anti-piracy outfit MaverickEye.

“The court never bothers to control the media company’s only evidence (lists generated by MaverickMonitor, which has proven to be an unreliable software), the court documents contain several typos of varying severity, and the same standard texts are reused in several different cases,” the ISP says.

“The court documents show a lack of care and control, something that can easily be taken advantage of by individuals with shady motives. The findings and discoveries of this investigation are strengthened by the pure numbers mentioned in the beginning which clearly show how one party almost always wins.

“If this is caused by bias, cheating, partiality, bribes, political agenda, conspiracy or pure coincidence we can’t say for sure, but the fact that this process has mainly generated money for the film industry, while citizens have been robbed of their personal integrity and legal certainty, indicates what forces lie behind this machinery,” Bahnhof’s Lindahl concludes.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN reviews, discounts, offers and coupons.

# Ефектите на медийната консолидация: по сценарий

Post Syndicated from nellyo original https://nellyo.wordpress.com/2018/04/04/sinclair-2/

Sinclair Broadcast Group е най-голяма тв група по брой медии и също така по покритие в САЩ. Известни са с  новинарско съдържание и предавания, които популяризират консервативни политически позиции  и са в подкрепа на Републиканската партия.

Когато Тръмп каже следното –

– ето какво следва в медиите на Синклер  – каскада от еднотипни изпълнения по сценарий, както се вижда във видеото:

“Някои  медии  използват своите платформи, за да наложат своето лично пристрастие. Това е изключително опасно за нашата демокрация.”

Видеото е публикувано и в NYT:

Показва ефектите  на медийната консолидация за правото на информация.  А също показва и как това управление означава критиката като fake.

+ op-ed от вчера

# Israeli Security Attacks AMD by Publishing Zero-Day Exploits

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2018/03/israeli_securit.html

Last week, the Israeli security company CTS Labs published a series of exploits against AMD chips. The publication came with the flashy website, detailed whitepaper, cool vulnerability names — RYZENFALL, MASTERKEY, FALLOUT, and CHIMERA — and logos we’ve come to expect from these sorts of things. What’s new is that the company only gave AMD a day’s notice, which breaks with every norm about responsible disclosure. CTS Labs didn’t release details of the exploits, only high-level descriptions of the vulnerabilities, but it is probably still enough for others to reproduce their results. This is incredibly irresponsible of the company.

Moreover, the vulnerabilities are kind of meh. Nicholas Weaver explains:

In order to use any of the four vulnerabilities, an attacker must already have almost complete control over the machine. For most purposes, if the attacker already has this access, we would generally say they’ve already won. But these days, modern computers at least attempt to protect against a rogue operating system by having separate secure subprocessors. CTS Labs discovered the vulnerabilities when they looked at AMD’s implementation of the secure subprocessor to see if an attacker, having already taken control of the host operating system, could bypass these last lines of defense.

In a “Clarification,” CTS Labs kind of agrees:

The vulnerabilities described in amdflaws.com could give an attacker that has already gained initial foothold into one or more computers in the enterprise a significant advantage against IT and security teams.

The only thing the attacker would need after the initial local compromise is local admin privileges and an affected machine. To clarify misunderstandings — there is no need for physical access, no digital signatures, no additional vulnerability to reflash an unsigned BIOS. Buy a computer from the store, run the exploits as admin — and they will work (on the affected models as described on the site).

The weirdest thing about this story is that CTS Labs describes one of the vulnerabilities, Chimera, as a backdoor. Although it doesn’t t come out and say that this was deliberately planted by someone, it does make the point that the chips were designed in Taiwan. This is an incredible accusation, and honestly needs more evidence before we can evaluate it.

The upshot of all of this is that CTS Labs played this for maximum publicity: over-hyping its results and minimizing AMD’s ability to respond. And it may have an ulterior motive:

But CTS’s website touting AMD’s flaws also contained a disclaimer that threw some shadows on the company’s motives: “Although we have a good faith belief in our analysis and believe it to be objective and unbiased, you are advised that we may have, either directly or indirectly, an economic interest in the performance of the securities of the companies whose products are the subject of our reports,” reads one line. WIRED asked in a follow-up email to CTS whether the company holds any financial positions designed to profit from the release of its AMD research specifically. CTS didn’t respond.

We all need to demand better behavior from security researchers. I know that any publicity is good publicity, but I am pleased to see the stories critical of CTS Labs outnumbering the stories praising it.

EDITED TO ADD (3/21): AMD responds:

AMD’s response today agrees that all four bug families are real and are found in the various components identified by CTS. The company says that it is developing firmware updates for the three PSP flaws. These fixes, to be made available in “coming weeks,” will be installed through system firmware updates. The firmware updates will also mitigate, in some unspecified way, the Chimera issue, with AMD saying that it’s working with ASMedia, the third-party hardware company that developed Promontory for AMD, to develop suitable protections. In its report, CTS wrote that, while one CTS attack vector was a firmware bug (and hence in principle correctable), the other was a hardware flaw. If true, there may be no effective way of solving it.

Response here.

# Setting up bug bounties for success

Post Syndicated from Michal Zalewski original https://lcamtuf.blogspot.com/2018/03/setting-up-bug-bounties-for-success.html

Bug bounties end up in the news with some regularity, usually for the wrong reasons. I’ve been itching to write
about that for a while – but instead of dwelling on the mistakes of the bygone days, I figured it may be better to
talk about some of the ways to get vulnerability rewards right.

#### What do you get out of bug bounties?

There’s plenty of differing views, but I like to think of such programs
simply as a bid on researchers’ time. In the most basic sense, you get three benefits:

• Improved ability to detect bugs in production before they become major incidents.
• A comparatively unbiased feedback loop to help you prioritize and measure other security work.
• A robust talent pipeline for when you need to hire.

#### What bug bounties don’t offer?

You don’t get anything resembling a comprehensive security program or a systematic assessment of your platforms.
Researchers end up looking for bugs that offer favorable effort-to-payoff ratios for their skills and given the
very imperfect information they have about your enterprise. In other words, you may end up with a hundred
people looking for XSS and just one person looking for RCE.

Your reward structure can steer them toward the targets and bugs you care about, but it’s difficult to fully
eliminate this inherent skew. There’s only so far you can jack up your top-tier rewards, and only so far you can
go lowering the bottom-tier ones.

#### Don’t you have to outcompete the black market to get all the “good” bugs?

There is a free market price discovery component to it all: if you’re not getting the engagement you
were hoping for, you should probably consider paying more.

That said, there are going to be researchers who’d rather hurt you than work for you, no matter how much you pay;
you don’t have to win them over, and you don’t have to outspend every authoritarian government or
every crime syndicate. A bug bounty is effective simply if it attracts enough eyeballs to make bugs statistically
harder to find, and reduces the useful lifespan of any zero-days in black market trade. Plus, most
researchers don’t want their work to be used to crack down on dissidents in Egypt or Vietnam.

Another factor is that you’re paying for different things: a black market buyer probably wants a reliable exploit
capable of delivering payloads, and then demands silence for months or years to come; a vendor-run
bug bounty program is usually perfectly happy with a reproducible crash and doesn’t mind a researcher blogging
about their work.

In fact, while money is important, you will probably find out that it’s not enough to retain your top talent;
many folks want bug bounties to be more than a business transaction, and find a lot of value in having a close
relationship with your security team, comparing notes, and growing together. Fostering that partnership can
be more important than adding another $10,000 to your top reward. #### How do I prevent it all from going horribly wrong? Bug bounties are an unfamiliar beast to most lawyers and PR folks, so it’s a natural to be wary and try to plan for every eventuality with pages and pages of impenetrable rules and fine-print legalese. This is generally unnecessary: there is a strong self-selection bias, and almost every participant in a vulnerability reward program will be coming to you in good faith. The more friendly, forthcoming, and approachable you seem, and the more you treat them like peers, the more likely it is for your relationship to stay positive. On the flip side, there is no faster way to make enemies than to make a security researcher feel that they are now talking to a lawyer or to the PR dept. Most people have strong opinions on disclosure policies; instead of imposing your own views, strive to patch reported bugs reasonably quickly, and almost every reporter will play along. Demand researchers to cancel conference appearances, take down blog posts, or sign NDAs, and you will sooner or later end up in the news. #### But what if that’s not enough? As with any business endeavor, mistakes will happen; total risk avoidance is seldom the answer. Learn to sincerely apologize for mishaps; it’s not a sign of weakness to say “sorry, we messed up”. And you will almost certainly not end up in the courtroom for doing so. It’s good to foster a healthy and productive relationship with the community, so that they come to your defense when something goes wrong. Encouraging people to disclose bugs and talk about their experiences is one way of accomplishing that. #### What about extortion? You should structure your program to naturally discourage bad behavior and make it stand out like a sore thumb. Require bona fide reports with complete technical details before any reward decision is made by a panel of named peers; and make it clear that you never demand non-disclosure as a condition of getting a reward. To avoid researchers accidentally putting themselves in awkward situations, have clear rules around data exfiltration and lateral movement: assure them that you will always pay based on the worst-case impact of their findings; in exchange, ask them to stop as soon as they get a shell and never access any data that isn’t their own. #### So… are there any downsides? Yep. Other than souring up your relationship with the community if you implement your program wrong, the other consideration is that bug bounties tend to generate a lot of noise from well-meaning but less-skilled researchers. When this happens, do not get frustrated and do not penalize such participants; instead, help them grow. Consider publishing educational articles, giving advice on how to investigate and structure reports, or offering free workshops every now and then. The other downside is cost; although bug bounties tend to offer far more bang for your buck than your average penetration test, they are more random. The annual expenses tend to be fairly predictable, but there is always some possibility of having to pay multiple top-tier rewards in rapid succession. This is the kind of uncertainty that many mid-level budget planners react badly to. Finally, you need to be able to fix the bugs you receive. It would be nuts to prefer to not know about the vulnerabilities in the first place – but once you invite the research, the clock starts ticking and you need to ship fixes reasonably fast. #### So… should I try it? There are folks who enthusiastically advocate for bug bounties in every conceivable situation, and people who dislike them with fierce passion; both sentiments are usually strongly correlated with the line of business they are in. In reality, bug bounties are not a cure-all, and there are some ways to make them ineffectual or even dangerous. But they are not as risky or expensive as most people suspect, and when done right, they can actually be fun for your team, too. You won’t know for sure until you try. # [$] Open-source trusted computing for IoT

Post Syndicated from jake original https://lwn.net/Articles/747564/rss

At this year’s FOSDEM in Brussels,
Jan Tobias Mühlberg gave a talk on the
latest work on Sancus, a
project that was originally presented
at the USENIX Security Symposium in 2013. The project is a fully
open-source hardware platform to support “trusted
computing
” and other security functionality. It is designed to be used for
internet of things (IoT)
devices, automotive applications, critical infrastructure, and other
embedded devices where trusted code is expected to be run.

# Meet India’s women Open Source warriors (Factor Daily)

Post Syndicated from corbet original https://lwn.net/Articles/746546/rss

The Factor Daily site has a
look at work to increase the diversity
of open-source contributors in
India. “Over past two months, we interviewed at least two dozen
people from within and outside the open source community to identify a set
of women open source contributors from India. While the list is not
conclusive by any measure, it’s a good starting point in identifying the
women who are quietly shaping the future of open source from this part of
the world and how they dealt with gender biases.

# Random with care

Post Syndicated from Eevee original https://eev.ee/blog/2018/01/02/random-with-care/

Hi! Here are a few loose thoughts about picking random numbers.

## A word about crypto

DON’T ROLL YOUR OWN CRYPTO

This is all aimed at frivolous pursuits like video games. Hell, even video games where money is at stake should be deferring to someone who knows way more than I do. Otherwise you might find out that your deck shuffles in your poker game are woefully inadequate and some smartass is cheating you out of millions. (If your random number generator has fewer than 226 bits of state, it can’t even generate every possible shuffling of a deck of cards!)

## Use the right distribution

Most languages have a random number primitive that spits out a number uniformly in the range [0, 1), and you can go pretty far with just that. But beware a few traps!

### Random pitches

Say you want to pitch up a sound by a random amount, perhaps up to an octave. Your audio API probably has a way to do this that takes a pitch multiplier, where I say “probably” because that’s how the only audio API I’ve used works.

Easy peasy. If 1 is unchanged and 2 is pitched up by an octave, then all you need is rand() + 1. Right?

No! Pitch is exponential — within the same octave, the “gap” between C and C♯ is about half as big as the gap between B and the following C. If you pick a pitch multiplier uniformly, you’ll have a noticeable bias towards the higher pitches.

One octave corresponds to a doubling of pitch, so if you want to pick a random note, you want 2 ** rand().

### Random directions

For two dimensions, you can just pick a random angle with rand() * TAU.

If you want a vector rather than an angle, or if you want a random direction in three dimensions, it’s a little trickier. You might be tempted to just pick a random point where each component is rand() * 2 - 1 (ranging from −1 to 1), but that’s not quite right. A direction is a point on the surface (or, equivalently, within the volume) of a sphere, and picking each component independently produces a point within the volume of a cube; the result will be a bias towards the corners of the cube, where there’s much more extra volume beyond the sphere.

No? Well, just trust me. I don’t know how to make a diagram for this.

Anyway, you could use the Pythagorean theorem a few times and make a huge mess of things, or it turns out there’s a really easy way that even works for two or four or any number of dimensions. You pick each coordinate from a Gaussian (normal) distribution, then normalize the resulting vector. In other words, using Python’s random module:

 1 2 3 4 5 6 def random_direction(): x = random.gauss(0, 1) y = random.gauss(0, 1) z = random.gauss(0, 1) r = math.sqrt(x*x + y*y + z*z) return x/r, y/r, z/r 

Why does this work? I have no idea!

Note that it is possible to get zero (or close to it) for every component, in which case the result is nonsense. You can re-roll all the components if necessary; just check that the magnitude (or its square) is less than some epsilon, which is equivalent to throwing away a tiny sphere at the center and shouldn’t affect the distribution.

### Beware Gauss

Since I brought it up: the Gaussian distribution is a pretty nice one for choosing things in some range, where the middle is the common case and should appear more frequently.

That said, I never use it, because it has one annoying drawback: the Gaussian distribution has no minimum or maximum value, so you can’t really scale it down to the range you want. In theory, you might get any value out of it, with no limit on scale.

In practice, it’s astronomically rare to actually get such a value out. I did a hundred million trials just to see what would happen, and the largest value produced was 5.8.

But, still, I’d rather not knowingly put extremely rare corner cases in my code if I can at all avoid it. I could clamp the ends, but that would cause unnatural bunching at the endpoints. I could reroll if I got a value outside some desired range, but I prefer to avoid rerolling when I can, too; after all, it’s still (astronomically) possible to have to reroll for an indefinite amount of time. (Okay, it’s really not, since you’ll eventually hit the period of your PRNG. Still, though.) I don’t bend over backwards here — I did just say to reroll when picking a random direction, after all — but when there’s a nicer alternative I’ll gladly use it.

And lo, there is a nicer alternative! Enter the beta distribution. It always spits out a number in [0, 1], so you can easily swap it in for the standard normal function, but it takes two “shape” parameters α and β that alter its behavior fairly dramatically.

With α = β = 1, the beta distribution is uniform, i.e. no different from rand(). As α increases, the distribution skews towards the right, and as β increases, the distribution skews towards the left. If α = β, the whole thing is symmetric with a hump in the middle. The higher either one gets, the more extreme the hump (meaning that value is far more common than any other). With a little fiddling, you can get a number of interesting curves.

Screenshots don’t really do it justice, so here’s a little Wolfram widget that lets you play with α and β live:

Note that if α = 1, then 1 is a possible value; if β = 1, then 0 is a possible value. You probably want them both greater than 1, which clamps the endpoints to zero.

Also, it’s possible to have either α or β or both be less than 1, but this creates very different behavior: the corresponding endpoints become poles.

Anyway, something like α = β = 3 is probably close enough to normal for most purposes but already clamped for you. And you could easily replicate something like, say, NetHack’s incredibly bizarre rnz function.

### Random frequency

Say you want some event to have an 80% chance to happen every second. You (who am I kidding, I) might be tempted to do something like this:

 1 2 if random() < 0.8 * dt: do_thing() 

In an ideal world, dt is always the same and is equal to 1 / f, where f is the framerate. Replace that 80% with a variable, say P, and every tic you have a P / f chance to do the… whatever it is.

Each second, f tics pass, so you’ll make this check f times. The chance that any check succeeds is the inverse of the chance that every check fails, which is $$1 – \left(1 – \frac{P}{f}\right)^f$$.

For P of 80% and a framerate of 60, that’s a total probability of 55.3%. Wait, what?

Consider what happens if the framerate is 2. On the first tic, you roll 0.4 twice — but probabilities are combined by multiplying, and splitting work up by dt only works for additive quantities. You lose some accuracy along the way. If you’re dealing with something that multiplies, you need an exponent somewhere.

But in this case, maybe you don’t want that at all. Each separate roll you make might independently succeed, so it’s possible (but very unlikely) that the event will happen 60 times within a single second! Or 200 times, if that’s someone’s framerate.

If you explicitly want something to have a chance to happen on a specific interval, you have to check on that interval. If you don’t have a gizmo handy to run code on an interval, it’s easy to do yourself with a time buffer:

 1 2 3 4 5 6 timer += dt # here, 1 is the "every 1 seconds" while timer > 1: timer -= 1 if random() < 0.8: do_thing() 

Using while means rolls still happen even if you somehow skipped over an entire second.

(For the curious, and the nerds who already noticed: the expression $$1 – \left(1 – \frac{P}{f}\right)^f$$ converges to a specific value! As the framerate increases, it becomes a better and better approximation for $$1 – e^{-P}$$, which for the example above is 0.551. Hey, 60 fps is pretty accurate — it’s just accurately representing something nowhere near what I wanted. Er, you wanted.)

### Rolling your own

Of course, you can fuss with the classic [0, 1] uniform value however you want. If I want a bias towards zero, I’ll often just square it, or multiply two of them together. If I want a bias towards one, I’ll take a square root. If I want something like a Gaussian/normal distribution, but with clearly-defined endpoints, I might add together n rolls and divide by n. (The normal distribution is just what you get if you roll infinite dice and divide by infinity!)

It’d be nice to be able to understand exactly what this will do to the distribution. Unfortunately, that requires some calculus, which this post is too small to contain, and which I didn’t even know much about myself until I went down a deep rabbit hole while writing, and which in many cases is straight up impossible to express directly.

Here’s the non-calculus bit. A source of randomness is often graphed as a PDF — a probability density function. You’ve almost certainly seen a bell curve graphed, and that’s a PDF. They’re pretty nice, since they do exactly what they look like: they show the relative chance that any given value will pop out. On a bog standard bell curve, there’s a peak at zero, and of course zero is the most common result from a normal distribution.

(Okay, actually, since the results are continuous, it’s vanishingly unlikely that you’ll get exactly zero — but you’re much more likely to get a value near zero than near any other number.)

For the uniform distribution, which is what a classic rand() gives you, the PDF is just a straight horizontal line — every result is equally likely.

If there were a calculus bit, it would go here! Instead, we can cheat. Sometimes. Mathematica knows how to work with probability distributions in the abstract, and there’s a free web version you can use. For the example of squaring a uniform variable, try this out:

 1 PDF[TransformedDistribution[u^2, u \[Distributed] UniformDistribution[{0, 1}]], u] 

(The \[Distributed] is a funny tilde that doesn’t exist in Unicode, but which Mathematica uses as a first-class operator. Also, press shiftEnter to evaluate the line.)

This will tell you that the distribution is… $$\frac{1}{2\sqrt{u}}$$. Weird! You can plot it:

 1 Plot[%, {u, 0, 1}] 

(The % refers to the result of the last thing you did, so if you want to try several of these, you can just do Plot[PDF[…], u] directly.)

The resulting graph shows that numbers around zero are, in fact, vastly — infinitely — more likely than anything else.

What about multiplying two together? I can’t figure out how to get Mathematica to understand this, but a great amount of digging revealed that the answer is -ln x, and from there you can plot them both on Wolfram Alpha. They’re similar, though squaring has a much better chance of giving you high numbers than multiplying two separate rolls — which makes some sense, since if either of two rolls is a low number, the product will be even lower.

What if you know the graph you want, and you want to figure out how to play with a uniform roll to get it? Good news! That’s a whole thing called inverse transform sampling. All you have to do is take an integral. Good luck!

This is all extremely ridiculous. New tactic: Just Simulate The Damn Thing. You already have the code; run it a million times, make a histogram, and tada, there’s your PDF. That’s one of the great things about computers! Brute-force numerical answers are easy to come by, so there’s no excuse for producing something like rnz. (Though, be sure your histogram has sufficiently narrow buckets — I tried plotting one for rnz once and the weird stuff on the left side didn’t show up at all!)

By the way, I learned something from futzing with Mathematica here! Taking the square root (to bias towards 1) gives a PDF that’s a straight diagonal line, nothing like the hyperbola you get from squaring (to bias towards 0). How do you get a straight line the other way? Surprise: $$1 – \sqrt{1 – u}$$.

### Okay, okay, here’s the actual math

I don’t claim to have a very firm grasp on this, but I had a hell of a time finding it written out clearly, so I might as well write it down as best I can. This was a great excuse to finally set up MathJax, too.

Say $$u(x)$$ is the PDF of the original distribution and $$u$$ is a representative number you plucked from that distribution. For the uniform distribution, $$u(x) = 1$$. Or, more accurately,

$$u(x) = \begin{cases} 1 & \text{ if } 0 \le x \lt 1 \\ 0 & \text{ otherwise } \end{cases}$$

Remember that $$x$$ here is a possible outcome you want to know about, and the PDF tells you the relative probability that a roll will be near it. This PDF spits out 1 for every $$x$$, meaning every number between 0 and 1 is equally likely to appear.

We want to do something to that PDF, which creates a new distribution, whose PDF we want to know. I’ll use my original example of $$f(u) = u^2$$, which creates a new PDF $$v(x)$$.

The trick is that we need to work in terms of the cumulative distribution function for $$u$$. Where the PDF gives the relative chance that a roll will be (“near”) a specific value, the CDF gives the relative chance that a roll will be less than a specific value.

The conventions for this seem to be a bit fuzzy, and nobody bothers to explain which ones they’re using, which makes this all the more confusing to read about… but let’s write the CDF with a capital letter, so we have $$U(x)$$. In this case, $$U(x) = x$$, a straight 45° line (at least between 0 and 1). With the definition I gave, this should make sense. At some arbitrary point like 0.4, the value of the PDF is 1 (0.4 is just as likely as anything else), and the value of the CDF is 0.4 (you have a 40% chance of getting a number from 0 to 0.4).

Calculus ahoy: the PDF is the derivative of the CDF, which means it measures the slope of the CDF at any point. For $$U(x) = x$$, the slope is always 1, and indeed $$u(x) = 1$$. See, calculus is easy.

Okay, so, now we’re getting somewhere. What we want is the CDF of our new distribution, $$V(x)$$. The CDF is defined as the probability that a roll $$v$$ will be less than $$x$$, so we can literally write:

$$V(x) = P(v \le x)$$

(This is why we have to work with CDFs, rather than PDFs — a PDF gives the chance that a roll will be “nearby,” whatever that means. A CDF is much more concrete.)

What is $$v$$, exactly? We defined it ourselves; it’s the do something applied to a roll from the original distribution, or $$f(u)$$.

$$V(x) = P\!\left(f(u) \le x\right)$$

Now the first tricky part: we have to solve that inequality for $$u$$, which means we have to do something, backwards to $$x$$.

$$V(x) = P\!\left(u \le f^{-1}(x)\right)$$

Almost there! We now have a probability that $$u$$ is less than some value, and that’s the definition of a CDF!

$$V(x) = U\!\left(f^{-1}(x)\right)$$

Hooray! Now to turn these CDFs back into PDFs, all we need to do is differentiate both sides and use the chain rule. If you never took calculus, don’t worry too much about what that means!

$$v(x) = u\!\left(f^{-1}(x)\right)\left|\frac{d}{dx}f^{-1}(x)\right|$$

Wait! Where did that absolute value come from? It takes care of whether $$f(x)$$ increases or decreases. It’s the least interesting part here by far, so, whatever.

There’s one more magical part here when using the uniform distribution — $$u(\dots)$$ is always equal to 1, so that entire term disappears! (Note that this only works for a uniform distribution with a width of 1; PDFs are scaled so the entire area under them sums to 1, so if you had a rand() that could spit out a number between 0 and 2, the PDF would be $$u(x) = \frac{1}{2}$$.)

$$v(x) = \left|\frac{d}{dx}f^{-1}(x)\right|$$

So for the specific case of modifying the output of rand(), all we have to do is invert, then differentiate. The inverse of $$f(u) = u^2$$ is $$f^{-1}(x) = \sqrt{x}$$ (no need for a ± since we’re only dealing with positive numbers), and differentiating that gives $$v(x) = \frac{1}{2\sqrt{x}}$$. Done! This is also why square root comes out nicer; inverting it gives $$x^2$$, and differentiating that gives $$2x$$, a straight line.

Incidentally, that method for turning a uniform distribution into any distribution — inverse transform sampling — is pretty much the same thing in reverse: integrate, then invert. For example, when I saw that taking the square root gave $$v(x) = 2x$$, I naturally wondered how to get a straight line going the other way, $$v(x) = 2 – 2x$$. Integrating that gives $$2x – x^2$$, and then you can use the quadratic formula (or just ask Wolfram Alpha) to solve $$2x – x^2 = u$$ for $$x$$ and get $$f(u) = 1 – \sqrt{1 – u}$$.

Multiply two rolls is a bit more complicated; you have to write out the CDF as an integral and you end up doing a double integral and wow it’s a mess. The only thing I’ve retained is that you do a division somewhere, which then gets integrated, and that’s why it ends up as $$-\ln x$$.

And that’s quite enough of that! (Okay but having math in my blog is pretty cool and I will definitely be doing more of this, sorry, not sorry.)

## Random vs varied

Sometimes, random isn’t actually what you want. We tend to use the word “random” casually to mean something more like chaotic, i.e., with no discernible pattern. But that’s not really random. In fact, given how good humans can be at finding incidental patterns, they aren’t all that unlikely! Consider that when you roll two dice, they’ll come up either the same or only one apart almost half the time. Coincidence? Well, yes.

If you ask for randomness, you’re saying that any outcome — or series of outcomes — is acceptable, including five heads in a row or five tails in a row. Most of the time, that’s fine. Some of the time, it’s less fine, and what you really want is variety. Here are a couple examples and some fairly easy workarounds.

### NPC quips

The nature of games is such that NPCs will eventually run out of things to say, at which point further conversation will give the player a short brush-off quip — a slight nod from the designer to the player that, hey, you hit the end of the script.

Some NPCs have multiple possible quips and will give one at random. The trouble with this is that it’s very possible for an NPC to repeat the same quip several times in a row before abruptly switching to another one. With only a few options to choose from, getting the same option twice or thrice (especially across an entire game, which may have numerous NPCs) isn’t all that unlikely. The notion of an NPC quip isn’t very realistic to start with, but having someone repeat themselves and then abruptly switch to something else is especially jarring.

The easy fix is to show the quips in order! Paradoxically, this is more consistently varied than choosing at random — the original “order” is likely to be meaningless anyway, and it already has the property that the same quip can never appear twice in a row.

If you like, you can shuffle the list of quips every time you reach the end, but take care here — it’s possible that the last quip in the old order will be the same as the first quip in the new order, so you may still get a repeat. (Of course, you can just check for this case and swap the first quip somewhere else if it bothers you.)

That last behavior is, in fact, the canonical way that Tetris chooses pieces — the game simply shuffles a list of all 7 pieces, gives those to you in shuffled order, then shuffles them again to make a new list once it’s exhausted. There’s no avoidance of duplicates, though, so you can still get two S blocks in a row, or even two S and two Z all clumped together, but no more than that. Some Tetris variants take other approaches, such as actively avoiding repeats even several pieces apart or deliberately giving you the worst piece possible.

### Random drops

Random drops are often implemented as a flat chance each time. Maybe enemies have a 5% chance to drop health when they die. Legally speaking, over the long term, a player will see health drops for about 5% of enemy kills.

Over the short term, they may be desperate for health and not survive to see the long term. So you may want to put a thumb on the scale sometimes. Games in the Metroid series, for example, have a somewhat infamous bias towards whatever kind of drop they think you need — health if your health is low, missiles if your missiles are low.

I can’t give you an exact approach to use, since it depends on the game and the feeling you’re going for and the variables at your disposal. In extreme cases, you might want to guarantee a health drop from a tough enemy when the player is critically low on health. (Or if you’re feeling particularly evil, you could go the other way and deny the player health when they most need it…)

The problem becomes a little different, and worse, when the event that triggers the drop is relatively rare. The pathological case here would be something like a raid boss in World of Warcraft, which requires hours of effort from a coordinated group of people to defeat, and which has some tiny chance of dropping a good item that will go to only one of those people. This is why I stopped playing World of Warcraft at 60.

Dialing it back a little bit gives us Enter the Gungeon, a roguelike where each room is a set of encounters and each floor only has a dozen or so rooms. Initially, you have a 1% chance of getting a reward after completing a room — but every time you complete a room and don’t get a reward, the chance increases by 9%, up to a cap of 80%. Once you get a reward, the chance resets to 1%.

The natural question is: how frequently, exactly, can a player expect to get a reward? We could do math, or we could Just Simulate The Damn Thing.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 from collections import Counter import random histogram = Counter() TRIALS = 1000000 chance = 1 rooms_cleared = 0 rewards_found = 0 while rewards_found < TRIALS: rooms_cleared += 1 if random.random() * 100 < chance: # Reward! rewards_found += 1 histogram[rooms_cleared] += 1 rooms_cleared = 0 chance = 1 else: chance = min(80, chance + 9) for gaps, count in sorted(histogram.items()): print(f"{gaps:3d} | {count / TRIALS * 100:6.2f}%", '#' * (count // (TRIALS // 100))) 
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  1 | 0.98% 2 | 9.91% ######### 3 | 17.00% ################ 4 | 20.23% #################### 5 | 19.21% ################### 6 | 15.05% ############### 7 | 9.69% ######### 8 | 5.07% ##### 9 | 2.09% ## 10 | 0.63% 11 | 0.12% 12 | 0.03% 13 | 0.00% 14 | 0.00% 15 | 0.00% 

We’ve got kind of a hilly distribution, skewed to the left, which is up in this histogram. Most of the time, a player should see a reward every three to six rooms, which is maybe twice per floor. It’s vanishingly unlikely to go through a dozen rooms without ever seeing a reward, so a player should see at least one per floor.

Of course, this simulated a single continuous playthrough; when starting the game from scratch, your chance at a reward always starts fresh at 1%, the worst it can be. If you want to know about how many rewards a player will get on the first floor, hey, Just Simulate The Damn Thing.

 1 2 3 4 5 6 7  0 | 0.01% 1 | 13.01% ############# 2 | 56.28% ######################################################## 3 | 27.49% ########################### 4 | 3.10% ### 5 | 0.11% 6 | 0.00% 

Cool. Though, that’s assuming exactly 12 rooms; it might be worth changing that to pick at random in a way that matches the level generator.

(Enter the Gungeon does some other things to skew probability, which is very nice in a roguelike where blind luck can make or break you. For example, if you kill a boss without having gotten a new gun anywhere else on the floor, the boss is guaranteed to drop a gun.)

### Critical hits

I suppose this is the same problem as random drops, but backwards.

Say you have a battle sim where every attack has a 6% chance to land a devastating critical hit. Presumably the same rules apply to both the player and the AI opponents.

Consider, then, that the AI opponents have exactly the same 6% chance to ruin the player’s day. Consider also that this gives them an 0.4% chance to critical hit twice in a row. 0.4% doesn’t sound like much, but across an entire playthrough, it’s not unlikely that a player might see it happen and find it incredibly annoying.

Perhaps it would be worthwhile to explicitly forbid AI opponents from getting consecutive critical hits.

## In conclusion

An emerging theme here has been to Just Simulate The Damn Thing. So consider Just Simulating The Damn Thing. Even a simple change to a random value can do surprising things to the resulting distribution, so unless you feel like differentiating the inverse function of your code, maybe test out any non-trivial behavior and make sure it’s what you wanted. Probability is hard to reason about.

# When should behaviour outside a community have consequences inside it?

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/50099.html

Free software communities don’t exist in a vacuum. They’re made up of people who are also members of other communities, people who have other interests and engage in other activities. Sometimes these people engage in behaviour outside the community that may be perceived as negatively impacting communities that they’re a part of, but most communities have no guidelines for determining whether behaviour outside the community should have any consequences within the community. This post isn’t an attempt to provide those guidelines, but aims to provide some things that community leaders should think about when the issue is raised.

## Some things to consider

### Did the behaviour violate the law?

This seems like an obvious bar, but it turns out to be a pretty bad one. For a start, many things that are common accepted behaviour in various communities may be illegal (eg, reverse engineering work may contravene a strict reading of US copyright law), and taking this to an extreme would result in expelling anyone who’s ever broken a speed limit. On the flipside, refusing to act unless someone broke the law is also a bad threshold – much behaviour that communities consider unacceptable may be entirely legal.

There’s also the problem of determining whether a law was actually broken. The criminal justice system is (correctly) biased to an extent in favour of the defendant – removing someone’s rights in society should require meeting a high burden of proof. However, this is not the threshold that most communities hold themselves to in determining whether to continue permitting an individual to associate with them. An incident that does not result in a finding of criminal guilt (either through an explicit finding or a failure to prosecute the case in the first place) should not be ignored by communities for that reason.

### Did the behaviour violate your community norms?

There’s plenty of behaviour that may be acceptable within other segments of society but unacceptable within your community (eg, lobbying for the use of proprietary software is considered entirely reasonable in most places, but rather less so at an FSF event). If someone can be trusted to segregate their behaviour appropriately then this may not be a problem, but that’s probably not sufficient in all cases. For instance, if someone acts entirely reasonably within your community but engages in lengthy anti-semitic screeds on 4chan, it’s legitimate to question whether permitting them to continue being part of your community serves your community’s best interests.

### Did the behaviour violate the norms of the community in which it occurred?

Of course, the converse is also true – there’s behaviour that may be acceptable within your community but unacceptable in another community. It’s easy to write off someone acting in a way that contravenes the standards of another community but wouldn’t violate your expected behavioural standards – after all, if it wouldn’t breach your standards, what grounds do you have for taking action?

But you need to consider that if someone consciously contravenes the behavioural standards of a community they’ve chosen to participate in, they may be willing to do the same in your community. If pushing boundaries is a frequent trait then it may not be too long until you discover that they’re also pushing your boundaries.

## Why do you care?

A community’s code of conduct can be looked at in two ways – as a list of behaviours that will be punished if they occur, or as a list of behaviours that are unlikely to occur within that community. The former is probably the primary consideration when a community adopts a CoC, but the latter is how many people considering joining a community will think about it.

If your community includes individuals that are known to have engaged in behaviour that would violate your community standards, potential members or contributors may not trust that your CoC will function as adequate protection. A community that contains people known to have engaged in sexual harassment in other settings is unlikely to be seen as hugely welcoming, even if they haven’t (as far as you know!) done so within your community. The way your members behave outside your community is going to be seen as saying something about your community, and that needs to be taken into account.

A second (and perhaps less obvious) aspect is that membership of some higher profile communities may be seen as lending general legitimacy to someone, and they may play off that to legitimise behaviour or views that would be seen as abhorrent by the community as a whole. If someone’s anti-semitic views (for example) are seen as having more relevance because of their membership of your community, it’s reasonable to think about whether keeping them in your community serves the best interests of your community.

## Conclusion

I’ve said things like “considered” or “taken into account” a bunch here, and that’s for a good reason – I don’t know what the thresholds should be for any of these things, and there doesn’t seem to be even a rough consensus in the wider community. We’ve seen cases in which communities have acted based on behaviour outside their community (eg, Debian removing Jacob Appelbaum after it was revealed that he’d sexually assaulted multiple people), but there’s been no real effort to build a meaningful decision making framework around that.

As a result, communities struggle to make consistent decisions. It’s unreasonable to expect individual communities to solve these problems on their own, but that doesn’t mean we can ignore them. It’s time to start coming up with a real set of best practices.

comments

# The Official Projects Book volume 3 — out now

Post Syndicated from Rob Zwetsloot original https://www.raspberrypi.org/blog/projects-book-3/

Hey folks, Rob from The MagPi here with some very exciting news! The third volume of the Official Raspberry Pi Projects Book is out right this second, and we’ve packed its 200 pages with the very best Raspberry Pi projects and guides!

## A peek inside the projects book

We start you off with a neat beginners guide to programming in Python,  walking you from the very basics all the way through to building the classic videogame Pong from scratch!

Check out what’s inside!

Then we showcase some of the most inspiring projects from around the community, such as a camera for taking photos of the moon, a smart art installation, amazing arcade machines, and much more.

Emulate the Apollo mission computers on the Raspberry Pi

Next, we ease you into a series of tutorials that will help you get the most out of your Raspberry Pi. Among other things, you’ll be learning how to sync your Pi to Dropbox, use it to create a waterproof camera, and even emulate an Amiga.

We’ve also assembled a load of reviews to let you know what you should be buying if you want to extend your Pi experience.

Learn more about Pimoroni’s Enviro pHAT

I am extremely proud of what the entire MagPi team has put together here, and I know you’ll enjoy reading it as much as we enjoyed creating it.

## How to get yours

In the UK, you can get your copy of the new Official Raspberry Pi Projects Book at WH Smith and all good newsagents today. In the US, print copies will be available in stores such as Barnes & Noble very soon.

Or order a copy from the Raspberry Pi Press store — before the end of Sunday 26 November, you can use the code BLACKFRIDAY to get 10% off your purchase!

There’s also the digital version, which you can get via The MagPi Android and iOS apps. And, as always, there’s a free PDF, which is available here.

We think this new projects book is the perfect stocking filler, although we may be just a tad biased. Anyway, I hope you’ll love it!

The post The Official Projects Book volume 3 — out now appeared first on Raspberry Pi.

# Some notes about the Kaspersky affair

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/10/some-notes-about-kaspersky-affair.html

I thought I’d write up some notes about Kaspersky, the Russian anti-virus vendor that many believe has ties to Russian intelligence.

There’s two angles to this story. One is whether the accusations are true. The second is the poor way the press has handled the story, with mainstream outlets like the New York Times more intent on pushing government propaganda than informing us what’s going on.

### The press

Before we address Kaspersky, we need to talk about how the press covers this.
The mainstream media’s stories have been pure government propaganda, like this one from the New York Times. It garbles the facts of what happened, and relies primarily on anonymous government sources that cannot be held accountable. It’s so messed up that we can’t easily challenge it because we aren’t even sure exactly what it’s claiming.
The Society of Professional Journalists have a name for this abuse of anonymous sources, the “Washington Game“. Journalists can identify this as bad journalism, but the big newspapers like The New York Times continues to do it anyway, because how dare anybody criticize them?
For all that I hate the anti-American bias of The Intercept, at least they’ve had stories that de-garble what’s going on, that explain things so that we can challenge them.

### Our Government

Our government can’t tell us everything, of course. But at the same time, they need to tell us something, to at least being clear what their accusations are. These vague insinuations through the media hurt their credibility, not help it. The obvious craptitude is making us in the cybersecurity community come to Kaspersky’s defense, which is not the government’s aim at all.
There are lots of issues involved here, but let’s consider the major one insinuated by the NYTimes story, that Kaspersky was getting “data” files along with copies of suspected malware. This is troublesome if true.
But, as Kaspersky claims today, it’s because they had detected malware within a zip file, and uploaded the entire zip — including the data files within the zip.
This is reasonable. This is indeed how anti-virus generally works. It completely defeats the NYTimes insinuations.
This isn’t to say Kaspersky is telling the truth, of course, but that’s not the point. The point is that we are getting vague propaganda from the government further garbled by the press, making Kaspersky’s clear defense the credible party in the affair.
It’s certainly possible for Kaspersky to write signatures to look for strings like “TS//SI/OC/REL TO USA” that appear in secret US documents, then upload them to Russia. If that’s what our government believes is happening, they need to come out and be explicit about it. They can easily setup honeypots, in the way described in today’s story, to confirm it. However, it seems the government’s description of honeypots is that Kaspersky only upload files that were clearly viruses, not data.

### Kaspersky

I believe Kaspersky is guilty, that the company and Eugene himself, works directly with Russian intelligence.
That’s because on a personal basis, people in government have given me specific, credible stories — the sort of thing they should be making public. And these stories are wholly unrelated to stories that have been made public so far.
You shouldn’t believe me, of course, because I won’t go into details you can challenge. I’m not trying to convince you, I’m just disclosing my point of view.
But there are some public reasons to doubt Kaspersky. For example, when trying to sell to our government, they’ve claimed they can help us against terrorists. The translation of this is that they could help our intelligence services. Well, if they are willing to help our intelligence services against customers who are terrorists, then why wouldn’t they likewise help Russian intelligence services against their adversaries?
Then there is how Russia works. It’s a violent country. Most of the people mentioned in that “Steele Dossier” have died. In the hacker community, hackers are often coerced to help the government. Many have simply gone missing.
Being rich doesn’t make Kaspersky immune from this — it makes him more of a target. Russian intelligence knows he’s getting all sorts of good intelligence, such as malware written by foreign intelligence services. It’s unbelievable they wouldn’t put the screws on him to get this sort of thing.
Russia is our adversary. It’d be foolish of our government to buy anti-virus from Russian companies. Likewise, the Russian government won’t buy such products from American companies.

### Conclusion

I have enormous disrespect for mainstream outlets like The New York Times and the way they’ve handled the story. It makes me want to come to Kaspersky’s defense.

I have enormous respect for Kaspersky technology. They do good work.

But I hear stories. I don’t think our government should be trusting Kaspersky at all. For that matter, our government shouldn’t trust any cybersecurity products from Russia, China, Iran, etc.

# timeShift(GrafanaBuzz, 1w) Issue 18

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2017/10/20/timeshiftgrafanabuzz-1w-issue-18/

Welcome to another issue of timeShift. This week we released Grafana 4.6.0-beta2, which includes some fixes for alerts, annotations, the Cloudwatch data source, and a few panel updates. We’re also gearing up for Oredev, one of the biggest tech conferences in Scandinavia, November 7-10. In addition to sponsoring, our very own Carl Bergquist will be presenting “Monitoring for everyone.” Hope to see you there – swing by our booth and say hi!

#### Latest Release

Grafana 4.6-beta-2 is now available! Grafana 4.6.0-beta2 adds fixes for:

• ColorPicker display
• Alerting test
• Cloudwatch improvements
• CSV export
• Text panel enhancements
• Annotation fix for MySQL

To see more details on what’s in the newest version, please see the release notes.

#### From the Blogosphere

Screeps and Grafana: Graphing your AI: If you’re unfamiliar with Screeps, it’s a MMO RTS game for programmers, where the objective is to grow your colony through programming your units’ AI. You control your colony by writing JavaScript, which operates 247 in the single persistent real-time world filled by other players. This article walks you through graphing all your game stats with Grafana.

ntopng Grafana Integration: The Beauty of Data Visualization: Our friends at ntop created a tutorial so that you can graph ntop monitoring data in Grafana. He goes through the metrics exposed, configuring the ntopng Data Source plugin, and building your first dashboard. They’ve also created a nice video tutorial of the process.

Installing Graphite and Grafana to Display the Graphs of Centreon: This article, provides a step-by-step guide to getting your Centreon data into Graphite and visualizing the data in Grafana.

Bit v. Byte Episode 3 – Metrics for the Win: Bit v. Byte is a new weekly Podcast about the web industry, tools and techniques upcoming and in use today. This episode dives into metrics, and discusses Grafana, Prometheus and NGINX Amplify.

Code-Quickie: Visualize heating with Grafana: With the winter weather coming, Reinhard wanted to monitor the stats in his boiler room. This article covers not only the visualization of the data, but the different devices and sensors you can use to can use in your own home.

RuuviTag with C.H.I.P – BLE – Node-RED: Following the temperature-monitoring theme from the last article, Tobias writes about his journey of hooking up his new RuuviTag to Grafana to measure temperature, relative humidity, air pressure and more.

#### Early Bird will be Ending Soon

Early bird discounts will be ending soon, but you still have a few days to lock in the lower price. We will be closing early bird on October 31, so don’t wait until the last minute to take advantage of the discounted tickets!

Also, there’s still time to submit your talk. We’ll accept submissions through the end of October. We’re looking for technical and non-technical talks of all sizes. Submit a CFP now.

#### Grafana Plugins

This week we have updates to two panels and a brand new panel that can add some animation to your dashboards. Installing plugins in Grafana is easy; for on-prem Grafana, use the Grafana-cli tool, or with 1 click if you are using Hosted Grafana.

NEW PLUGIN

Geoloop Panel – The Geoloop panel is a simple visualizer for joining GeoJSON to Time Series data, and animating the geo features in a loop. An example of using the panel would be showing the rate of rainfall during a 5-hour storm.

UPDATED PLUGIN

Breadcrumb Panel – This plugin keeps track of dashboards you have visited within one session and displays them as a breadcrumb. The latest update fixes some issues with back navigation and url query params.

UPDATED PLUGIN

Influx Admin Panel – The Influx Admin panel duplicates features from the now deprecated Web Admin Interface for InfluxDB and has lots of features like letting you see the currently running queries, which can also be easily killed.

Changes in the latest release:

• Converted to typescript project based on typescript-template-datasource
• Select Databases. This only works with PR#8096
• Added time format options
• Show tags from response
• Support template variables in the query

#### Contribution of the week:

Each week we highlight some of the important contributions from our amazing open source community. Thank you for helping make Grafana better!

The Stockholm Go Meetup had a hackathon this week and sent a PR for letting whitelisted cookies pass through the Grafana proxy. Thanks to everyone who worked on this PR!

#### Tweet of the Week

We scour Twitter each week to find an interesting/beautiful dashboard and show it off! #monitoringLove

This is awesome – we can’t get enough of these public dashboards!

#### We Need Your Help!

Do you have a graph that you love because the data is beautiful or because the graph provides interesting information? Please get in touch. Tweet or send us an email with a screenshot, and we’ll tell you about this fun experiment.

#### Grafana Labs is Hiring!

We are passionate about open source software and thrive on tackling complex challenges to build the future. We ship code from every corner of the globe and love working with the community. If this sounds exciting, you’re in luck – WE’RE HIRING!

Check out our

#### How are we doing?

Please tell us how we’re doing. Submit a comment on this article below, or post something at our community forum. Help us make these weekly roundups better!

Follow us on Twitter, like us on Facebook, and join the Grafana Labs community.

# Predict Billboard Top 10 Hits Using RStudio, H2O and Amazon Athena

Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.

Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.

In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.

## Walkthrough

Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.

This solution involves the following steps:

• Install and configure RStudio with Athena
• Log in to RStudio
• Install R packages
• Connect to Athena
• Create a dataset
• Create models

### Install and configure RStudio with Athena

Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.

Launching this stack creates all required resources and prerequisites:

• Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
• Provisioning of the EC2 instance in an existing VPC and public subnet
• Installation of Java 8
• Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
• Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
• S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
• RStudio username and password
• Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
• Amazon EC2 Systems Manager agent, which makes it easy to manage and patch

All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.

### Log in to RStudio

The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.

1. In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
2. Log in to RStudio with the and password you provided during setup.

### Install R packages

Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.

#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
##  Connection successful!
##
## R is connected to the H2O cluster:
##     H2O cluster uptime:         2 hours 42 minutes
##     H2O cluster version:        3.10.4.6
##     H2O cluster version age:    4 months and 4 days !!!
##     H2O cluster name:           H2O_started_from_R_rstudio_hjx881
##     H2O cluster total nodes:    1
##     H2O cluster total memory:   3.30 GB
##     H2O cluster total cores:    4
##     H2O cluster allowed cores:  4
##     H2O cluster healthy:        TRUE
##     H2O Connection ip:          localhost
##     H2O Connection port:        54321
##     H2O Connection proxy:       NA
##     H2O Internal Security:      FALSE
##     R Version:                  R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo():
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
install_aws_sdk()
}

load_sdk()
## NULL

### Connect to Athena

Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
s3_staging_dir=Sys.getenv("ATHENABUCKET"),
aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")


Verify the connection. The results returned depend on your specific Athena setup.

con
## <JDBCConnection>
dbListTables(con)
##  [1] "gdelt"               "wikistats"           "elb_logs_raw_native"
##  [4] "twitter"             "twitter2"            "usermovieratings"
##  [7] "eventcodes"          "events"              "billboard"
## [10] "billboardtop10"      "elb_logs"            "gdelthist"
## [13] "gdeltmaster"         "twitter"             "twitter3"

### Create a dataset

For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.

 Field Description year Year that song was released songtitle Title of the song artistname Name of the song artist songid Unique identifier for the song artistid Unique identifier for the song artist timesignature Variable estimating the time signature of the song timesignature_confidence Confidence in the estimate for the timesignature loudness Continuous variable indicating the average amplitude of the audio in decibels tempo Variable indicating the estimated beats per minute of the song tempo_confidence Confidence in the estimate for tempo key Variable with twelve levels indicating the estimated key of the song (C, C#, B) key_confidence Confidence in the estimate for key energy Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness pitch Continuous variable that indicates the pitch of the song timbre_0_min thru timbre_11_min Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector timbre_0_max thru timbre_11_max Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector top10 Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)

### Create an Athena table based on the dataset

In the Athena console, select the default database, sampled, or create a new database.

Run the following create table statement.

create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;

Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.

dbGetQuery(con, "show create table sampledb.billboard")
##                                      createtab_stmt
## 1       CREATE EXTERNAL TABLE sampledb.billboard(
## 2                                       year int,
## 3                               songtitle string,
## 4                              artistname string,
## 5                                  songid string,
## 6                                artistid string,
## 7                              timesignature int,
## 8                timesignature_confidence double,
## 9                                loudness double,
## 10                                  tempo double,
## 11                       tempo_confidence double,
## 12                                       key int,
## 13                         key_confidence double,
## 14                                 energy double,
## 15                                  pitch double,
## 16                           timbre_0_min double,
## 17                           timbre_0_max double,
## 18                           timbre_1_min double,
## 19                           timbre_1_max double,
## 20                           timbre_2_min double,
## 21                           timbre_2_max double,
## 22                           timbre_3_min double,
## 23                           timbre_3_max double,
## 24                           timbre_4_min double,
## 25                           timbre_4_max double,
## 26                           timbre_5_min double,
## 27                           timbre_5_max double,
## 28                           timbre_6_min double,
## 29                           timbre_6_max double,
## 30                           timbre_7_min double,
## 31                           timbre_7_max double,
## 32                           timbre_8_min double,
## 33                           timbre_8_max double,
## 34                           timbre_9_min double,
## 35                           timbre_9_max double,
## 36                          timbre_10_min double,
## 37                          timbre_10_max double,
## 38                          timbre_11_min double,
## 39                          timbre_11_max double,
## 40                                     top10 int)
## 41                             ROW FORMAT DELIMITED
## 42                         FIELDS TERMINATED BY ','
## 43                            STORED AS INPUTFORMAT
## 44       'org.apache.hadoop.mapred.TextInputFormat'
## 45                                     OUTPUTFORMAT
## 46  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
## 47                                        LOCATION
## 48    's3://aws-bigdata-blog/artifacts/predict-billboard/data'
## 49                                  TBLPROPERTIES (
## 50            'transient_lastDdlTime'='1505484133')

### Run a sample query

Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.

dbGetQuery(con, " SELECT songtitle,artistname,top10   FROM sampledb.billboard WHERE lower(artistname) =     'janet jackson' AND top10 = 1")
##                       songtitle    artistname top10
## 1                       Runaway Janet Jackson     1
## 2               Because Of Love Janet Jackson     1
## 3                         Again Janet Jackson     1
## 4                            If Janet Jackson     1
## 5  Love Will Never Do (Without You) Janet Jackson 1
## 6                     Black Cat Janet Jackson     1
## 7               Come Back To Me Janet Jackson     1
## 8                       Alright Janet Jackson     1
## 9                      Escapade Janet Jackson     1
## 10                Rhythm Nation Janet Jackson     1

Determine how many songs in this dataset are specifically from the year 2010.

dbGetQuery(con, " SELECT count(*)   FROM sampledb.billboard WHERE year = 2010")
##   _col0
## 1   373


The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.

Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.

#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note:  Running the preceding query results in the following error:
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature     FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
##    0    1    3    4    5    7
##   10  143  503 6787  112   19
nrow(dftimesignature)
## [1] 7574

From the results, observe that 6787 songs have a timesignature of 4.

Next, determine the song with the highest tempo.

dbGetQuery(con, " SELECT songtitle,artistname,tempo   FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
##                   songtitle      artistname   tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307


### Create the training dataset

Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first.  This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.

#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.

r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
##   year           songtitle artistname timesignature
## 1 2009 The Awkward Goodbye    Athlete             3
## 2 2009        Rubik's Cube    Athlete             3
##   timesignature_confidence loudness   tempo tempo_confidence
## 1                    0.732   -6.320  89.614   0.652
## 2                    0.906   -9.541 117.742   0.542
nrow(BillboardTrain)
## [1] 7201

### Create the test dataset

BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
##   year              songtitle        artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember  11
## 2 2010        Sticks & Bricks A Day to Remember  10
##   key_confidence    energy pitch timbre_0_min
## 1          0.453 0.9666556 0.024        0.002
## 2          0.469 0.9847095 0.025        0.000
nrow(BillboardTest)
## [1] 373

### Convert the training and test datasets into H2O dataframes

train.h2o <- as.h2o(BillboardTrain)
##
|
|                                                                 |   0%
|
|=================================================================| 100%
test.h2o <- as.h2o(BillboardTest)
##
|
|                                                                 |   0%
|
|=================================================================| 100%

Inspect the column names in your H2O dataframes.

colnames(train.h2o)
##  [1] "year"                     "songtitle"
##  [3] "artistname"               "songid"
##  [5] "artistid"                 "timesignature"
##  [7] "timesignature_confidence" "loudness"
##  [9] "tempo"                    "tempo_confidence"
## [11] "key"                      "key_confidence"
## [13] "energy"                   "pitch"
## [15] "timbre_0_min"             "timbre_0_max"
## [17] "timbre_1_min"             "timbre_1_max"
## [19] "timbre_2_min"             "timbre_2_max"
## [21] "timbre_3_min"             "timbre_3_max"
## [23] "timbre_4_min"             "timbre_4_max"
## [25] "timbre_5_min"             "timbre_5_max"
## [27] "timbre_6_min"             "timbre_6_max"
## [29] "timbre_7_min"             "timbre_7_max"
## [31] "timbre_8_min"             "timbre_8_max"
## [33] "timbre_9_min"             "timbre_9_max"
## [35] "timbre_10_min"            "timbre_10_max"
## [37] "timbre_11_min"            "timbre_11_max"
## [39] "top10"


### Create models

You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.

Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables:  “year”, “songtitle”, “artistname”, “songid”, or “artistid”.

y.dep <- 39
x.indep <- c(6:38)
x.indep
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38


#### Create Model 1: All numeric variables

Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.

modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
##
|
|                                                                 |   0%
|
|=====                                                            |   8%
|
|=================================================================| 100%

Measure the performance of Model 1, using H2O’s built-in performance function.

h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
##
## MSE:  0.09924684
## RMSE:  0.3150347
## LogLoss:  0.3220267
## Mean Per-Class Error:  0.2380168
## AUC:  0.8431394
## Gini:  0.6862787
## R^2:  0.254663
## Null Deviance:  326.0801
## Residual Deviance:  240.2319
## AIC:  308.2319
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      255  59 0.187898  =59/314
## 1       17  42 0.288136   =17/59
## Totals 272 101 0.203753  =76/373
##
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.192772 0.525000 100
## 2                       max f2  0.124912 0.650510 155
## 3                 max f0point5  0.416258 0.612903  23
## 4                 max accuracy  0.416258 0.879357  23
## 5                max precision  0.813396 1.000000   0
## 6                   max recall  0.037579 1.000000 282
## 7              max specificity  0.813396 1.000000   0
## 8             max absolute_mcc  0.416258 0.455251  23
## 9   max min_per_class_accuracy  0.161402 0.738854 125
## 10 max mean_per_class_accuracy  0.124912 0.765006 155
##
## Gains/Lift Table: Extract with h2o.gainsLift(<model>, <data>) or 
h2o.auc(h2o.performance(modelh1,test.h2o))
## [1] 0.8431394


The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)

Next, inspect the coefficients of the variables in the dataset.

dfmodelh1 <- as.data.frame(h2o.varimp(modelh1))
dfmodelh1
##                       names coefficients sign
## 1              timbre_0_max  1.290938663  NEG
## 2                  loudness  1.262941934  POS
## 3                     pitch  0.616995941  NEG
## 4              timbre_1_min  0.422323735  POS
## 5              timbre_6_min  0.349016024  NEG
## 6                    energy  0.348092062  NEG
## 7             timbre_11_min  0.307331997  NEG
## 8              timbre_3_max  0.302225619  NEG
## 9             timbre_11_max  0.243632060  POS
## 10             timbre_4_min  0.224233951  POS
## 11             timbre_4_max  0.204134342  POS
## 12             timbre_5_min  0.199149324  NEG
## 13             timbre_0_min  0.195147119  POS
## 14 timesignature_confidence  0.179973904  POS
## 15         tempo_confidence  0.144242598  POS
## 16            timbre_10_max  0.137644568  POS
## 17             timbre_7_min  0.126995955  NEG
## 18            timbre_10_min  0.123851179  POS
## 19             timbre_7_max  0.100031481  NEG
## 20             timbre_2_min  0.096127636  NEG
## 21           key_confidence  0.083115820  POS
## 22             timbre_6_max  0.073712419  POS
## 23            timesignature  0.067241917  POS
## 24             timbre_8_min  0.061301881  POS
## 25             timbre_8_max  0.060041698  POS
## 26                      key  0.056158445  POS
## 27             timbre_3_min  0.050825116  POS
## 28             timbre_9_max  0.033733561  POS
## 29             timbre_2_max  0.030939072  POS
## 30             timbre_9_min  0.020708113  POS
## 31             timbre_1_max  0.014228818  NEG
## 32                    tempo  0.008199861  POS
## 33             timbre_5_max  0.004837870  POS
## 34                                    NA <NA>

Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.

You can make the following observations from the results:

• The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
• The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
• The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.

These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.

cor(train.h2o$loudness,train.h2o$energy)
## [1] 0.7399067

This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.

You build two variations of the original model:

• Model 2, in which you keep “energy” and omit “loudness”
• Model 3, in which you keep “loudness” and omit “energy”

You compare these two models and choose the model with a better fit for this use case.

#### Create Model 2: Keep energy and omit loudness

colnames(train.h2o)
##  [1] "year"                     "songtitle"
##  [3] "artistname"               "songid"
##  [5] "artistid"                 "timesignature"
##  [7] "timesignature_confidence" "loudness"
##  [9] "tempo"                    "tempo_confidence"
## [11] "key"                      "key_confidence"
## [13] "energy"                   "pitch"
## [15] "timbre_0_min"             "timbre_0_max"
## [17] "timbre_1_min"             "timbre_1_max"
## [19] "timbre_2_min"             "timbre_2_max"
## [21] "timbre_3_min"             "timbre_3_max"
## [23] "timbre_4_min"             "timbre_4_max"
## [25] "timbre_5_min"             "timbre_5_max"
## [27] "timbre_6_min"             "timbre_6_max"
## [29] "timbre_7_min"             "timbre_7_max"
## [31] "timbre_8_min"             "timbre_8_max"
## [33] "timbre_9_min"             "timbre_9_max"
## [35] "timbre_10_min"            "timbre_10_max"
## [37] "timbre_11_min"            "timbre_11_max"
## [39] "top10"
y.dep <- 39
x.indep <- c(6:7,9:38)
x.indep
##  [1]  6  7  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
##
|
|                                                                 |   0%
|
|=======                                                          |  10%
|
|=================================================================| 100%


Measure the performance of Model 2.

h2o.performance(model=modelh2,newdata=test.h2o)
## H2OBinomialMetrics: glm
##
## MSE:  0.09922606
## RMSE:  0.3150017
## LogLoss:  0.3228213
## Mean Per-Class Error:  0.2490554
## AUC:  0.8431933
## Gini:  0.6863867
## R^2:  0.2548191
## Null Deviance:  326.0801
## Residual Deviance:  240.8247
## AIC:  306.8247
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      280 34 0.108280  =34/314
## 1       23 36 0.389831   =23/59
## Totals 303 70 0.152815  =57/373
##
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.254391 0.558140  69
## 2                       max f2  0.113031 0.647208 157
## 3                 max f0point5  0.413999 0.596026  22
## 4                 max accuracy  0.446250 0.876676  18
## 5                max precision  0.811739 1.000000   0
## 6                   max recall  0.037682 1.000000 283
## 7              max specificity  0.811739 1.000000   0
## 8             max absolute_mcc  0.254391 0.469060  69
## 9   max min_per_class_accuracy  0.141051 0.716561 131
## 10 max mean_per_class_accuracy  0.113031 0.761821 157
##
## Gains/Lift Table: Extract with h2o.gainsLift(<model>, <data>) or h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)
dfmodelh2 <- as.data.frame(h2o.varimp(modelh2))
dfmodelh2
##                       names coefficients sign
## 1                     pitch  0.700331511  NEG
## 2              timbre_1_min  0.510270513  POS
## 3              timbre_0_max  0.402059546  NEG
## 4              timbre_6_min  0.333316236  NEG
## 5             timbre_11_min  0.331647383  NEG
## 6              timbre_3_max  0.252425901  NEG
## 7             timbre_11_max  0.227500308  POS
## 8              timbre_4_max  0.210663865  POS
## 9              timbre_0_min  0.208516163  POS
## 10             timbre_5_min  0.202748055  NEG
## 11             timbre_4_min  0.197246582  POS
## 12            timbre_10_max  0.172729619  POS
## 13         tempo_confidence  0.167523934  POS
## 14 timesignature_confidence  0.167398830  POS
## 15             timbre_7_min  0.142450727  NEG
## 16             timbre_8_max  0.093377516  POS
## 17            timbre_10_min  0.090333426  POS
## 18            timesignature  0.085851625  POS
## 19             timbre_7_max  0.083948442  NEG
## 20           key_confidence  0.079657073  POS
## 21             timbre_6_max  0.076426046  POS
## 22             timbre_2_min  0.071957831  NEG
## 23             timbre_9_max  0.071393189  POS
## 24             timbre_8_min  0.070225578  POS
## 25                      key  0.061394702  POS
## 26             timbre_3_min  0.048384697  POS
## 27             timbre_1_max  0.044721121  NEG
## 28                   energy  0.039698433  POS
## 29             timbre_5_max  0.039469064  POS
## 30             timbre_2_max  0.018461133  POS
## 31                    tempo  0.013279926  POS
## 32             timbre_9_min  0.005282143  NEG
## 33                                    NA <NA>

h2o.auc(h2o.performance(modelh2,test.h2o))
## [1] 0.8431933

You can make the following observations:

• The AUC metric is 0.8431933.
• Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
• As H2O orders variables by significance, the variable energy is not significant in this model.

You can conclude that Model 2 is not ideal for this use , as energy is not significant.

#### CreateModel 3: Keep loudness but omit energy

colnames(train.h2o)
##  [1] "year"                     "songtitle"
##  [3] "artistname"               "songid"
##  [5] "artistid"                 "timesignature"
##  [7] "timesignature_confidence" "loudness"
##  [9] "tempo"                    "tempo_confidence"
## [11] "key"                      "key_confidence"
## [13] "energy"                   "pitch"
## [15] "timbre_0_min"             "timbre_0_max"
## [17] "timbre_1_min"             "timbre_1_max"
## [19] "timbre_2_min"             "timbre_2_max"
## [21] "timbre_3_min"             "timbre_3_max"
## [23] "timbre_4_min"             "timbre_4_max"
## [25] "timbre_5_min"             "timbre_5_max"
## [27] "timbre_6_min"             "timbre_6_max"
## [29] "timbre_7_min"             "timbre_7_max"
## [31] "timbre_8_min"             "timbre_8_max"
## [33] "timbre_9_min"             "timbre_9_max"
## [35] "timbre_10_min"            "timbre_10_max"
## [37] "timbre_11_min"            "timbre_11_max"
## [39] "top10"
y.dep <- 39
x.indep <- c(6:12,14:38)
x.indep
##  [1]  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
##
|
|                                                                 |   0%
|
|========                                                         |  12%
|
|=================================================================| 100%
perfh3<-h2o.performance(model=modelh3,newdata=test.h2o)
perfh3
## H2OBinomialMetrics: glm
##
## MSE:  0.0978859
## RMSE:  0.3128672
## LogLoss:  0.3178367
## Mean Per-Class Error:  0.264925
## AUC:  0.8492389
## Gini:  0.6984778
## R^2:  0.2648836
## Null Deviance:  326.0801
## Residual Deviance:  237.1062
## AIC:  303.1062
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      286 28 0.089172  =28/314
## 1       26 33 0.440678   =26/59
## Totals 312 61 0.144772  =54/373
##
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.273799 0.550000  60
## 2                       max f2  0.125503 0.663265 155
## 3                 max f0point5  0.435479 0.628931  24
## 4                 max accuracy  0.435479 0.882038  24
## 5                max precision  0.821606 1.000000   0
## 6                   max recall  0.038328 1.000000 280
## 7              max specificity  0.821606 1.000000   0
## 8             max absolute_mcc  0.435479 0.471426  24
## 9   max min_per_class_accuracy  0.173693 0.745763 120
## 10 max mean_per_class_accuracy  0.125503 0.775073 155
##
## Gains/Lift Table: Extract with h2o.gainsLift(<model>, <data>) or h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)
dfmodelh3 <- as.data.frame(h2o.varimp(modelh3))
dfmodelh3
##                       names coefficients sign
## 1              timbre_0_max 1.216621e+00  NEG
## 2                  loudness 9.780973e-01  POS
## 3                     pitch 7.249788e-01  NEG
## 4              timbre_1_min 3.891197e-01  POS
## 5              timbre_6_min 3.689193e-01  NEG
## 6             timbre_11_min 3.086673e-01  NEG
## 7              timbre_3_max 3.025593e-01  NEG
## 8             timbre_11_max 2.459081e-01  POS
## 9              timbre_4_min 2.379749e-01  POS
## 10             timbre_4_max 2.157627e-01  POS
## 11             timbre_0_min 1.859531e-01  POS
## 12             timbre_5_min 1.846128e-01  NEG
## 13 timesignature_confidence 1.729658e-01  POS
## 14             timbre_7_min 1.431871e-01  NEG
## 15            timbre_10_max 1.366703e-01  POS
## 16            timbre_10_min 1.215954e-01  POS
## 17         tempo_confidence 1.183698e-01  POS
## 18             timbre_2_min 1.019149e-01  NEG
## 19           key_confidence 9.109701e-02  POS
## 20             timbre_7_max 8.987908e-02  NEG
## 21             timbre_6_max 6.935132e-02  POS
## 22             timbre_8_max 6.878241e-02  POS
## 23            timesignature 6.120105e-02  POS
## 24                      key 5.814805e-02  POS
## 25             timbre_8_min 5.759228e-02  POS
## 26             timbre_1_max 2.930285e-02  NEG
## 27             timbre_9_max 2.843755e-02  POS
## 28             timbre_3_min 2.380245e-02  POS
## 29             timbre_2_max 1.917035e-02  POS
## 30             timbre_5_max 1.715813e-02  POS
## 31                    tempo 1.364418e-02  NEG
## 32             timbre_9_min 8.463143e-05  NEG
## 33                                    NA <NA>
h2o.sensitivity(perfh3,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501855569251422. Run h2o.predict and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.2033898
h2o.auc(perfh3)
## [1] 0.8492389


You can make the following observations:

• The AUC metric is 0.8492389.
• From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
• Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
• Loudness is significant in this model.

Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:

• Desired model accuracy at a given threshold
• Number of correct predictions for top10 hits
• Tolerable number of false positives or false negatives

Next, make predictions using Model 3 on the test dataset.

predict.regh <- h2o.predict(modelh3, test.h2o)
##
|
|                                                                 |   0%
|
|=================================================================| 100%
print(predict.regh)
##   predict        p0          p1
## 1       0 0.9654739 0.034526052
## 2       0 0.9654748 0.034525236
## 3       0 0.9635547 0.036445318
## 4       0 0.9343579 0.065642149
## 5       0 0.9978334 0.002166601
## 6       0 0.9779949 0.022005078
##
## [373 rows x 3 columns]
predict.regh$predict ## predict ## 1 0 ## 2 0 ## 3 0 ## 4 0 ## 5 0 ## 6 0 ## ## [373 rows x 1 column] dpr<-as.data.frame(predict.regh) #Rename the predicted column colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10' table(dpr$predict_top10)
##
##   0   1
## 312  61

The first set of output results specifies the probabilities associated with each predicted observation.  For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit).  The second set of results list the actual predictions made.  From the third set of results, this model predicts that 61 songs will be top 10 hits.

Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.

table(BillboardTest$top10) ## ## 0 1 ## 314 59  Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231. It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels? View the two models from an investment perspective: • A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular. • How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number • It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10. • The baseline model is not useful, as it simply does not label any song as a hit. Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label. #### GBM model H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function. Before you do this, you need to convert the target variable to a factor for multinomial classification techniques. train.h2o$top10=as.factor(train.h2o$top10) gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial") ## | | | 0% | |=== | 5% | |===== | 7% | |====== | 9% | |======= | 10% | |====================== | 33% | |===================================== | 56% | |==================================================== | 79% | |================================================================ | 98% | |=================================================================| 100% perf.gbmh<-h2o.performance(gbm.modelh,test.h2o) perf.gbmh ## H2OBinomialMetrics: gbm ## ## MSE: 0.09860778 ## RMSE: 0.3140188 ## LogLoss: 0.3206876 ## Mean Per-Class Error: 0.2120263 ## AUC: 0.8630573 ## Gini: 0.7261146 ## ## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold: ## 0 1 Error Rate ## 0 266 48 0.152866 =48/314 ## 1 16 43 0.271186 =16/59 ## Totals 282 91 0.171582 =64/373 ## ## Maximum Metrics: Maximum metrics at their respective thresholds ## metric threshold value idx ## 1 max f1 0.189757 0.573333 90 ## 2 max f2 0.130895 0.693717 145 ## 3 max f0point5 0.327346 0.598802 26 ## 4 max accuracy 0.442757 0.876676 14 ## 5 max precision 0.802184 1.000000 0 ## 6 max recall 0.049990 1.000000 284 ## 7 max specificity 0.802184 1.000000 0 ## 8 max absolute_mcc 0.169135 0.496486 104 ## 9 max min_per_class_accuracy 0.169135 0.796610 104 ## 10 max mean_per_class_accuracy 0.169135 0.805948 104 ## ## Gains/Lift Table: Extract with h2o.gainsLift(<model>, <data>) or  h2o.sensitivity(perf.gbmh,0.5) ## Warning in h2o.find_row_by_threshold(object, t): Could not find exact ## threshold: 0.5 for this set of metrics; using closest threshold found: ## 0.501205344484314. Run h2o.predict and apply your desired threshold on a ## probability column. ## [[1]] ## [1] 0.1355932 h2o.auc(perf.gbmh) ## [1] 0.8630573 This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3. As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold. H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally. Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose. For models that require more computation, consider using accelerated computing instances such as the P2 instance type. system.time( dlearning.modelh <- h2o.deeplearning(y = y.dep, x = x.indep, training_frame = train.h2o, epoch = 250, hidden = c(250,250), activation = "Rectifier", seed = 1122, distribution="multinomial" ) ) ## | | | 0% | |=== | 4% | |===== | 8% | |======== | 12% | |========== | 16% | |============= | 20% | |================ | 24% | |================== | 28% | |===================== | 32% | |======================= | 36% | |========================== | 40% | |============================= | 44% | |=============================== | 48% | |================================== | 52% | |==================================== | 56% | |======================================= | 60% | |========================================== | 64% | |============================================ | 68% | |=============================================== | 72% | |================================================= | 76% | |==================================================== | 80% | |======================================================= | 84% | |========================================================= | 88% | |============================================================ | 92% | |============================================================== | 96% | |=================================================================| 100% ## user system elapsed ## 1.216 0.020 166.508 perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o) perf.dl ## H2OBinomialMetrics: deeplearning ## ## MSE: 0.1678359 ## RMSE: 0.4096778 ## LogLoss: 1.86509 ## Mean Per-Class Error: 0.3433013 ## AUC: 0.7568822 ## Gini: 0.5137644 ## ## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold: ## 0 1 Error Rate ## 0 290 24 0.076433 =24/314 ## 1 36 23 0.610169 =36/59 ## Totals 326 47 0.160858 =60/373 ## ## Maximum Metrics: Maximum metrics at their respective thresholds ## metric threshold value idx ## 1 max f1 0.826267 0.433962 46 ## 2 max f2 0.000000 0.588235 239 ## 3 max f0point5 0.999929 0.511811 16 ## 4 max accuracy 0.999999 0.865952 10 ## 5 max precision 1.000000 1.000000 0 ## 6 max recall 0.000000 1.000000 326 ## 7 max specificity 1.000000 1.000000 0 ## 8 max absolute_mcc 0.999929 0.363219 16 ## 9 max min_per_class_accuracy 0.000004 0.662420 145 ## 10 max mean_per_class_accuracy 0.000000 0.685334 224 ## ## Gains/Lift Table: Extract with h2o.gainsLift(<model>, <data>) or h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>) h2o.sensitivity(perf.dl,0.5) ## Warning in h2o.find_row_by_threshold(object, t): Could not find exact ## threshold: 0.5 for this set of metrics; using closest threshold found: ## 0.496293348880151. Run h2o.predict and apply your desired threshold on a ## probability column. ## [[1]] ## [1] 0.3898305 h2o.auc(perf.dl) ## [1] 0.7568822  The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers. H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.  Metric Description Sensitivity Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall. Specificity Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate. Threshold Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives. Precision The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant AUC Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.0.90 – 1 = excellent (A)0.8 – 0.9 = good (B)0.7 – 0.8 = fair (C).6 – 0.7 = poor (D)0.5 – 0.5 = fail (F) Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.  Metric Model 3 GBM Model Deep Learning Model Accuracy(max) 0.882038(t=0.435479) 0.876676(t=0.442757) 0.865952(t=0.999999) Precision(max) 1.0(t=0.821606) 1.0(t=0802184) 1.0(t=1.0) Recall(max) 1.0 1.0 1.0(t=0) Specificity(max) 1.0 1.0 1.0(t=1) Sensitivity 0.2033898 0.1355932 0.3898305(t=0.5) AUC 0.8492389 0.8630573 0.756882 Note: ‘t’ denotes threshold. Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier. If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits. The AUC metric for the GBM model is also higher than that of Model 3. Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster. ## Conclusion In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue. If you have questions or suggestions, please comment below. ### Additional Reading ### About the Authors Gopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies. Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post. # We Are Not Having a Productive Debate About Women in Tech Post Syndicated from Bozho original https://techblog.bozho.net/not-productive-debate-women-tech/ Yes, it’s about the “anti-diversity memo”. But I won’t go into particular details of the memo, the firing, who’s right and wrong, who’s liberal and who’s conservative. Actually, I don’t need to repeat this post, which states almost exactly what I think about the particular issue. Just in case, and before someone decided to label me as “sexist white male” that knows nothing, I guess should clearly state that I acknowledge that biases against women are real and that I strongly support equal opportunity, and I think there must be more women in technology. I also have to state that I think the author of “the memo” was well-meaning, had some well argued, research-backed points and should not be ostracized. But I want to “rant” about the quality of the debate. On one side we have conservatives who are throwing themselves in defense of the fired googler, insisting that liberals are banning conservative points of view, that it is normal to have so few woman in tech and that everything is actually okay, or even that women are inferior. On the other side we have triggered liberals that are ready to shout “discrimination” and “harassment” at anything that resembles an attempt to claim anything different than total and absolute equality, in many cases using a classical “strawman” argument (e.g. “he’s saying women should not work in tech, he’s obviously wrong”). Everyone seems to be too eager to take side and issue a verdict on who’s right and who’s wrong, to blame the other side for all related and unrelated woes and while doing that, exhibit a huge amount of biases. If the debate is about that, we’d better shut it down as soon as possible, as it’s not going to lead anywhere. No matter how much conservatives want “a debate”, and no matter how much liberals want to advance equality. Oh, and by the way – this “conservatives” vs “liberals” is a false dichotomy. Most people hold a somewhat sensible stance in between. But let’s get to the actual issue: Women are underrepresented in STEM (Science, technology, engineering, mathematics). That is a fact everyone agrees on and is blatantly obvious when you walk in any software company office. Why is that the case? The whole debate revolved around biological and social differences, some of which are probably even true – that women value job flexibility more than being promoted or getting higher salary, that they are more neurotic (on average), that they are less confident, that they are more empathic and so on. These difference have been studied and documented, and as much as I have my reservations about psychology studies (so much so, that even meta-analysis are shown by meta-meta-analysis to be flawed) and social science in general, there seems to be a consensus there (by the way, it’s a shame that Gizmodo removed all the scientific references when they first published “the memo”). But that is not the issue. As it has been pointed out, there’s equal applicability of male and female “inherent” traits when working with technology. Why are we talking about “techonology”, and why not “mining and construction”, as many will point out. Let’s cut that argument once and for all – mining and construction are blue collar jobs that have a high chance of being automated in the near future and are in decline. The problem that we’re trying to solve is – how to make the dominant profession of the future – information technology – one of equal opportunity. Yes, it’s a a bold claim, but software is going to be everywhere and the industry will grow. This is why it’s so important to discuss it, not because we are developers and we are somewhat affected by that. So, there has been extended research on the matter, and the reasons are – surprise – complex and intertwined and there is no simple issue that, once resolved, will unlock the path of women to tech jobs. What would diversity give us and why should we care? Let’s assume for a moment we don’t care about equal opportunity and we are right-leaning, conservative people. Well, imagine you have a growing business and you need to hire developers. What would you prefer – having fewer or more people of whom to choose from? Having fewer or more diverse skills (technical and social) on the job market? The answer is obvious. The more people, regardless of their gender, race, whatever, are on the job market, the better for businesses. So I guess we’ve agreed on the two points so far – that women are underrepresented, and that it’s better for everyone if there are more people with technical skills on the job market, which includes more women. The “final” questions is – how? And this questions seems to not be anywhere in the discussion. Instead, we are going in circles with irrelevant arguments trying to either show that we’ve read more scientific papers than others, that we are more liberal than others or that we are more pro free speech. Back to “how” – in Bulgaria we have a social meme: “I don’t know what is the right way, but the way you are doing it is NOT the right way”. And much of the underlying sentiment of “the memo” is similar – that google should stop doing some of the stuff it is doing about diversity, or do them differently (but doesn’t tell us how exactly). Hiring biases, internal programs, whatever, seem to bother him. But this is just talking about the surface of the problem. These programs are correcting something that remains hidden in “the memo”. Google, on their diversity page, say that 20% of their tech employees are women. At the same time, in another diversity section, they claim “18% of CS graduates are women”. So, I guess, job done – they’ve reached the maximum possible diversity. They’ve hired as many women in tech as CS graduates there are. Anything more than that, even if it doesn’t mean they’ll hire worse developers, will leave the rest of the industry with less women. So, sure, 50/50 in Google would sound cool, but the industry average will still be bad. And that’s the actual, underlying reason that we should have already arrived at, and we should’ve started discussing the “how”. Girls do not see STEM as a thing for them. Our biases are projected on younger girls which culminate at a “this is not for girls” mantra. No matter how diverse hiring policies we have, if we don’t address the issue at a way earlier stage, we aren’t getting anywhere. In schools and even kindergartens we need to have an inclusive environment where “this is not for girls” is frowned upon. We should not discourage girls from liking math, or making math sound uncool and “hard for girls” (in my biased world I actually know more women mathematicians than men). This comic seems like on a different topic (gender-specific toys), but it’s actually not about toys – it’s about what is considered (stereo)typical of a girl to do. And most of these biases are unconscious, and come from all around us (school, TV, outdoor ads, people on the street, relatives, etc.), and it takes effort to confront them. To do that, we need policy decisions. We need lobbying education departments / ministries to encourage girls more in the STEM direction (and don’t worry, they’ll be good at it). By the way, guess what – Google’s diversity program is not just about hiring more women, it actually includes education policies with stuff like “influencing perception about computer science”, “getting more girls to code” and scholarships. Let’s discuss the education policies, the path to getting 40-50% of CS graduates to be female, and before that – more girls in schools with technical focus, and ultimately – how to get society to not perceive technology and science as “not for girls”. Let each girl decide on her own. All the other debates are short-sighted and not to the point at all. Will biological differences matter then? They probably will – but not significantly to justify a high gender imbalance. I am no expert in education policies and I don’t know what will work and what won’t. There is research on the matter that we should look at, and maybe argue about it. Everything else is wasted keystrokes. The post We Are Not Having a Productive Debate About Women in Tech appeared first on Bozho's tech blog. # Yet more reasons to disagree with experts on nPetya Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/07/yet-more-reasons-to-disagree-with.html In WW II, they looked at planes returning from bombing missions that were shot full of holes. Their natural conclusion was to add more armor to the sections that were damaged, to protect them in the future. But wait, said the statisticians. The original damage is likely spread evenly across the plane. Damage on returning planes indicates where they could damage and still return. The undamaged areas are where they were hit and couldn’t return. Thus, it’s the undamaged areas you need to protect. This is called survivorship bias. Many experts are making the same mistake with regards to the nPetya ransomware. I hate to point this out, because they are all experts I admire and respect, especially @MalwareJake, but it’s still an error. An example is this tweet: The context of this tweet is the discussion of why nPetya was well written with regards to spreading, but full of bugs with regards to collecting on the ransom. The conclusion therefore that it wasn’t intended to be ransomware, but was intended to simply be a “wiper”, to cause destruction. But this is just survivorship bias. If nPetya had been written the other way, with excellent ransomware features and poor spreading, we would not now be talking about it. Even that initial seeding with the trojaned MeDoc update wouldn’t have spread it far enough. In other words, all malware samples we get are good at spreading, either on their own, or because the creator did a good job seeding them. It’s because we never see the ones that didn’t spread. With regards to nPetya, a lot of experts are making this claim. Since it spread so well, but had hopelessly crippled ransomware features, that must have been the intent all along. Yet, as we see from survivorship bias, none of us would’ve seen nPetya had it not been for the spreading feature. # NonPetya: no evidence it was a "smokescreen" Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/06/nonpetya-no-evidence-it-was-smokescreen.html Many well-regarded experts claim that the not-Petya ransomware wasn’t “ransomware” at all, but a “wiper” whose goal was to destroy files, without any intent at letting victims recover their files. I want to point out that there is no real evidence of this. Certainly, things look suspicious. For one thing, it certainly targeted the Ukraine. For another thing, it made several mistakes that prevent them from ever decrypting drives. Their email account was shutdown, and it corrupts the boot sector. But these things aren’t evidence, they are problems. They are things needing explanation, not things that support our preferred conspiracy theory. The simplest, Occam’s Razor explanation explanation is that they were simple mistakes. Such mistakes are common among ransomware. We think of virus writers as professional software developers who thoroughly test their code. Decades of evidence show the opposite, that such software is of poor quality with shockingly bad bugs. It’s true that effectively, nPetya is a wiper. Matthieu Suiche‏ does a great job describing one flaw that prevents it working. @hasherezade does a great job explaining another flaw. But best explanation isn’t that this is intentional. Even if these bugs didn’t exist, it’d still be a wiper if the perpetrators simply ignored the decryption requests. They need not intentionally make the decryption fail. Thus, the simpler explanation is that it’s simply a bug. Ransomware authors test the bits they care about, and test less well the bits they don’t. It’s quite plausible to believe that just before shipping the code, they’d add a few extra features, and forget to regression test the entire suite. I mean, I do that all the time with my code. Some have pointed to the sophistication of the code as proof that such simple errors are unlikely. This isn’t true. While it’s more sophisticated than WannaCry, it’s about average for the current state-of-the-art for ransomware in general. What people think of, such the Petya base, or using PsExec to spread throughout a Windows domain, is already at least a year old. Indeed, the use of PsExec itself is a bit clumsy, when the code for doing the same thing is already public. It’s just a few calls to basic Windows networking APIs. A sophisticated virus would do this itself, rather than clumsily use PsExec. Infamy doesn’t mean skill. People keep making the mistake that the more widespread something is in the news, the more skill, the more of a “conspiracy” there must be behind it. This is not true. Virus/worm writers often do newsworthy things by accident. Indeed, the history of worms, starting with the Morris Worm, has been things running out of control more than the author’s expectations. What makes nPetya newsworthy isn’t the EternalBlue exploit or the wiper feature. Instead, the creators got lucky with MeDoc. The software is used by every major organization in the Ukraine, and at the same time, their website was horribly insecure — laughably insecure. Furthermore, it’s autoupdate feature didn’t check cryptographic signatures. No hacker can plan for this level of widespread incompetence — it’s just extreme luck. Thus, the effect of bumbling around is something that hit the Ukraine pretty hard, but it’s not necessarily the intent of the creators. It’s like how the Slammer worm hit South Korea pretty hard, or how the Witty worm hit the DoD pretty hard. These things look “targeted”, especially to the victims, but it was by pure chance (provably so, in the case of Witty). Certainly, MeDoc was targeted. But then, targeting a single organization is the norm for ransomware. They have to do it that way, giving each target a different Bitcoin address for payment. That it then spread to the entire Ukraine, and further, is the sort of thing that typically surprises worm writers. Finally, there’s little reason to believe that there needs to be a “smokescreen”. Russian hackers are targeting the Ukraine all the time. Whether Russian hackers are to blame for “ransomware” vs. “wiper” makes little difference. Conclusion We know that Russian hackers are constantly targeting the Ukraine. Therefore, the theory that this was nPetya’s goal all along, to destroy Ukraines computers, is a good one. Yet, there’s no actual “evidence” of this. nPetya’s issues are just as easily explained by normal software bugs. The smokescreen isn’t needed. The boot record bug isn’t needed. The single email address that was shutdown isn’t significant, since half of all ransomware uses the same technique. The experts who disagree with me are really smart/experienced people who you should generally trust. It’s just that I can’t see their evidence. Update: I wrote another blogpost about “survivorship bias“, refuting the claim by many experts talking about the sophistication of the spreading feature. Update: comment asks “why is there no Internet spreading code?”. The answer is “I don’t know”, but unanswerable questions aren’t evidence of a conspiracy. “What aren’t there any stars in the background?” isn’t proof the moon landings are fake, such because you can’t answer the question. One guess is that you never want ransomware to spread that far, until you’ve figured out how to get payment from so many people. # From Idea to Launch: Getting Your First Customers Post Syndicated from Gleb Budman original https://www.backblaze.com/blog/how-to-get-your-first-customers/ After deciding to build an unlimited backup service and developing our own storage platform, the next step was to get customers and feedback. Not all customers are created equal. Let’s talk about the types, and when and how to attract them. ## How to Get Your First Customers First Step – Don’t Launch Publicly Launch when you’re ready for the judgments of people who don’t know you at all. Until then, don’t launch. Sign up users and customers either that you know, those you can trust to cut you some slack (while providing you feedback), or at minimum those for whom you can set expectations. For months the Backblaze website was a single page with no ability to get the product and minimal info on what it would be. This is not to counter the Lean Startup ‘iterate quickly with customer feedback’ advice. Rather, this is an acknowledgement that there are different types of feedback required based on your development stage. Sign Up Your Friends We knew all of our first customers; they were friends, family, and previous co-workers. Many knew what we were up to and were excited to help us. No magic marketing or tech savviness was required to reach them – we just asked that they try the service. We asked them to provide us feedback on their experience and collected it through email and conversations. While the feedback wasn’t unbiased, it was nonetheless wide-ranging, real, and often insightful. These people were willing to spend time carefully thinking about their feedback and delving deeper into the conversations. Broaden to Beta Unless you’re famous or your service costs$1 million per customer, you’ll probably need to expand quickly beyond your friends to build a business – and to get broader feedback. Our next step was to broaden the customer base to beta users.

Opening up the service in beta provides three benefits:

1. Air cover for the early warts. There are going to be issues, bugs, unnecessarily complicated user flows, and poorly worded text. Beta tells people, “We don’t consider the product ‘done’ and you should expect some of these issues. Please be patient with us.”
2. A request for feedback. Some people always provide feedback, but beta communicates that you want it.
3. An awareness opportunity. Opening up in beta provides an early (but not only) opportunity to have an announcement and build awareness.

Pitching Beta to Press
Not all press cares about, or is even willing to cover, beta products. Much of the mainstream press wants to write about services that are fully live, have scale, and are important in the marketplace. However, there are a number of sites that like to cover the leading edge – and that means covering betas. Techcrunch, Ars Technica, and SimpleHelp covered our initial private beta launch. I’ll go into the details of how to work with the press to cover your announcements in a post next month.

Private vs. Public Beta
Both private and public beta provide all three of the benefits above. The difference between the two is that private betas are much more controlled, whereas public ones bring in more users. But this isn’t an either/or – I recommend doing both.

Private Beta
For our original beta in 2008, we decided that we were comfortable with about 1,000 users subscribing to our service. That would provide us with a healthy amount of feedback and get some early adoption, while not overwhelming us or our server capacity, and equally important not causing cash flow issues from having to buy more equipment. So we decided to limit the sign-up to only the first 1,000 people who signed up; then we would shut off sign-ups for a while.

But how do you even get 1,000 people to sign up for your service? In our case, get some major publications to write about our beta. (Note: In a future post I’ll explain exactly how to find and reach out to writers. Sign up to receive all of the entrepreneurial posts in this series.)

Public Beta
For our original service (computer backup), we did not have a public beta; but when we launched Backblaze B2, we had a private and then a public beta. The private beta allowed us to work out early kinks, while the public beta brought us a more varied set of use cases. In public beta, there is no cap on the number of users that may try the service.

While this is a first-class problem to have, if your service is flooded and stops working, it’s still a problem. Think through what you will do if that happens. In our early days, when our system could get overwhelmed by volume, we had a static web page hosted with a different registrar that wouldn’t let customers sign up but would tell them when our service would be open again. When we reached a critical volume level we would redirect to it in order to at least provide status for when we could accept more customers.

Collect Feedback
Since one of the goals of betas is to get feedback, we made sure that we had our email addresses clearly presented on the site so users could send us thoughts. We were most interested in broad qualitative feedback on users’ experience, so all emails went to an internal mailing list that would be read by everyone at Backblaze.

For our B2 public and private betas, we also added an optional short survey to the sign-up process. In order to be considered for the private beta you had to fill the survey out, though we found that 80% of users continued to fill out the survey even when it was not required. This survey had both closed-end questions (“how much data do you have”) and open-ended ones (“what do you want to use cloud storage for?”).

BTW, despite us getting a lot of feedback now via our support team, Twitter, and marketing surveys, we are always open to more – you can email me directly at gleb.budman {at} backblaze.com.

Don’t Throw Away Users
Initially our backup service was available only on Windows, but we had an email sign-up list for people who wanted it for their Mac. This provided us with a sense of market demand and a ready list of folks who could be beta users and early adopters when we had a Mac version. Have a service targeted at doctors but lawyers are expressing interest? Capture that.

## Product Launch

When
The first question is “when” to launch. Presuming your service is in ‘public beta’, what is the advantage of moving out of beta and into a “version 1.0”, “gold”, or “public availability”? That depends on your service and customer base. Some services fly through public beta. Gmail, on the other hand, was (in)famous for being in beta for 5 years, despite having over 100 million users.

The term beta says to users, “give us some leeway, but feel free to use the service”. That’s fine for many consumer apps and will have near zero impact on them. However, services aimed at businesses and government will often not be adopted with a beta label as the enterprise customers want to know the company feels the service is ‘ready’. While Backblaze started out as a purely consumer service, because it was a data backup service, it was important for customers to trust that the service was ready.

No product is bug-free. But from a product readiness perspective, the nomenclature should also be a reflection of the quality of the product. You can launch a product with one feature that works well out of beta. But a product with fifty features on which half the users will bump into problems should likely stay in beta. The customer feedback, surveys, and your own internal testing should guide you in determining this quality during the beta. Be careful about “we’ve only seen that one time” or “I haven’t been able to reproduce that on my machine”; those issues are likely to scale with customers when you launch.

How
Launching out of beta can be as simple as removing the beta label from the website/product. However, this can be a great time to reach out to press, write a blog post, and send an email announcement to your customers.

Consider thanking your beta testers somehow; can they get some feature turned out for free, an extension of their trial, or premium support? If nothing else, remember to thank them for their feedback. Users that signed up during your beta are likely the ones who will propel your service. They had the need and interest to both be early adopters and deal with bugs. They are likely the key to getting 1,000 true fans.

The Beginning
The title of this post was “Getting your first customers”, because getting to launch may feel like the peak of your journey when you’re pre-launch, but it really is just the beginning. It’s a step along the journey of building your business. If your launch is wildly successful, enjoy it, work to build on the momentum, but don’t lose track of building your business. If your launch is a dud, go out for a coffee with your team, say “well that sucks”, and then get back to building your business. You can learn a tremendous amount from your early customers, and they can become your biggest fans, but the success of your business will depend on what you continue to do the months and years after your launch.

The post From Idea to Launch: Getting Your First Customers appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

# 2017-05-09 bias-и и дебъгване

Post Syndicated from Vasil Kolev original https://vasil.ludost.net/blog/?p=3352

Нещо странично.

Тия дни в офиса около някакви занимания обсъждахме следната задача:

“Имаме банда пирати (N на брой, капитан и N-1 останали членове), които искат да си разделят съкровище от 100 пари. Пиратите имат строга линейна йерархия (знае се кой след кой е). Разделянето става по следния начин – текущият капитан предлага разпределение, гласува се и ако събере половината или повече от гласовете се приема, ако не – убиват го и следващия по веригата предлага разпределение. Въпросът е какво трябва да предложи капитанът, така че всички да се съгласят, ако приемем, че всички в екипажа са перфектни логици. Също така пиратите са кръвожадни и ако при гласуване против има шанс да спечели и същите пари, пак ще предпочете да убие капитана. Също така всички са алчни и целта е капитанът да запази най-много за себе си.”
(задачата не идва от икономиката, въпреки че и там всички са перфектни логици и за това толкова много им се дънят теориите)

Решението на задачата е интересно (за него – по-долу), но е доста по-интересно колко трудно се оказа да я реша. Първоначалната ми идея беше просто на горната половина от пиратите да се разделят намалящи суми, понеже това е стандартния начин, по който се случват нещата. Това се оказа неефективно. После ми напомниха (което сам трябваше да се сетя), че такива задачи се решават отзад-напред и по индукция, и като за начало започнахме с въпроса, какво става ако са само двама?

Първият ми отговор беше – ами другия член на екипажа ще иска винаги да убие капитана, щото така ще вземе всичко. Обаче се оказа, че и капитана има глас, т.е. ако останат само двама, капитанът взима всичко и разпределението е 100 за него и нищо за другия.

Какво следва, ако са трима? Казах – добре, тогава даваш на единия 1, на другия 2, и останалото за капитана, понеже ако останат само двама, последния няма да вземе нищо, капитанът гласува за себе си и втория и да е за и против, няма значение. Само че няма нужда да даваме нищо на средния, щото не ни пука за мнението му, така всъщност правилното разпределение идва 1, 0, 99. Тук пак си пролича bias-а, пак очаквах да има някаква пропорция.

Long story short, следващата итерация е 0, 1, 0, 99, понеже така ако не се съгласят, на следващия ход предпоследния ако не се съгласи няма да вземе нищо, и на другите двама мнението няма значение. Pattern-а мисля, че си личи 🙂

Лошото е колко много влияеше bias-а, който съм натрупал от четене за разпределения в реалния живот – какво са пиратите, как няма перфектни логици (и реално никой няма да смята по тоя начин, а ще се стремят към нещо, което им се вижда честно), как това тотално изключва политическата възможност N/2+1 от долната част да гласуват винаги против, докато не дойде всичкото до тях и после да си го разделят по равно и всякакви подобни варианти от реалния живот. Ако примерът беше с каквото и да е друго (например не включваше хора), вероятно щеше да е доста по-лесно да гледам абстрактно.

Което е още един довод в подкрепа на идеята ми, че много по-лесно се дебъгва нещо чуждо (често и което никога не си виждал), отколкото нещо, с което почти постоянно се занимаваш. Над 90% от проблемите (това не се базира на никаква статистика, а на усещане) са достатъчно прости, че да могат да се решат със стандартни методи и да не изискват много задълбочено познаване на системата (половината ми живот е минал в дебъгване на неща, които не разбирам, доста по-често успешно, отколкото не) и вероятно като/ако правя debug workshop-а (за който много хора ми натякват), ще е с проблеми, с които и аз не съм запознат, да е наистина забавно …

# Get wordy with our free resources

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/get-wordy-with-our-free-resources/

Here at the Raspberry Pi Foundation, we take great pride in the wonderful free resources we produce for you to use in classes, at home and in coding clubs. We publish them under a Creative Commons licence, and they’re an excellent way to develop your digital-making skills.

With yesterday being World Poetry Day (I’m a day late to the party. Shhh), I thought I’d share some wordy-themed [wordy-themed? Are you sure? – Ed] resources for you all to have a play with.

## Shakespearean Insult Generator

Have you ever found yourself lost for words just when the moment calls for your best comeback? With the Shakespearean Insult Generator, your mumbled retorts to life’s awkward situations will have the lyrical flow of our nation’s most beloved bard.

Thou sodden-witted lord! Thou hast no more brain than I have in mine elbows!

Not only will the generator provide you with hours of potty-mouthed fun, it’ll also teach you how to read and write data in CSV format using Python, how to manipulate lists, and how to choose a random item from a list.

## Talk like a Pirate

Ye’ll never be forced t’walk the plank once ye learn how to talk like a scurvy ol’ pirate… yaaaarrrgh!

The Talk like a Pirate speech generator teaches you how to use jQuery to cause live updates on a web page, how to write regular expressions to match patterns and words, and how to create a web page to input text and output results.

Once you’ve mastered those skills, you can use them to create other speech generators. How about a speech generator that turns certain words into their slang counterparts? Or one that changes words into txt speak – laugh into LOL, and see you into CU?

## Secret Agent Chat

So you’ve already mastered insults via list manipulation and random choice, and you’ve converted words into hilarious variations through matching word patterns and input/output. What’s next?

The Secret Agent Chat resource shows you how random numbers can be used to encrypt messages, how iteration can be used to encrypt individual characters, and, to make sure nobody cracks your codes, the importance of keeping your keys secret. And with these new skills under your belt, you can write and encrypt messages between you and your friends, ensuring that nobody will be able to read your secrets.

## Unlocking your transferable skill set

One of the great things about building projects like these is the way it expands your transferable skill set. When you complete a project using one of our resources, you gain abilities that can be transferred to other projects and situations. You might never need to use a ‘Talk like a Pirate’ speech generator, but you might need to create a way to detect and alter certain word patterns in a document. And while you might be able to coin your own colourful insults, making the Shakespearean Insult Generator gives you the ability to select words from lists at random, allowing you to write a program that picks names to create sports or quiz teams without bias.

All of our resources are available for free on our website, and we continually update them to offer you more opportunities to work on your skills, whatever your age and experience.

Have you built anything from our resources? Let us know in the comments.

The post Get wordy with our free resources appeared first on Raspberry Pi.

# Some notes on the RAND 0day report

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/03/some-notes-on-rand-0day-report.html

The RAND Corporation has a research report on the 0day market [ * ]. It’s pretty good. They talked to all the right people. It should be considered the seminal work on the issue. They’ve got the pricing about right ($1 million for full chain iPhone exploit, but closer to$100k for others). They’ve got the stats about right (5% chance somebody else will discover an exploit).

Yet, they’ve got some problems, namely phrasing the debate as activists want, rather than a neutral view of the debate.

The report frequently uses the word “stockpile”. This is a biased term used by activists. According to the dictionary, it means:

a large accumulated stock of goods or materials, especially one held in reserve for use at a time of shortage or other emergency.

Activists paint the picture that the government (NSA, CIA, DoD, FBI) buys 0day to hold in reserve in case they later need them. If that’s the case, then it seems reasonable that it’s better to disclose/patch the vuln then let it grow moldy in a cyberwarehouse somewhere.

But that’s not how things work. The government buys vulns it has immediate use for (primarily). Almost all vulns it buys are used within 6 months. Most vulns in its “stockpile” have been used in the previous year. These cyberweapons are not in a warehouse, but in active use on the front lines.

This is top secret, of course, so people assume it’s not happening. They hear about no cyber operations (except Stuxnet), so they assume such operations aren’t occurring. Thus, they build up the stockpiling assumption rather than the active use assumption.

If the RAND wanted to create an even more useful survey, they should figure out how many thousands of times per day our government (NSA, CIA, DoD, FBI) exploits 0days. They should characterize who they target (e.g. terrorists, child pornographers), success rate, and how many people they’ve killed based on 0days. It’s this data, not patching, that is at the root of the policy debate.

That 0days are actively used determines pricing. If the government doesn’t have immediate need for a vuln, it won’t pay much for it, if anything at all. Conversely, if the government has urgent need for a vuln, it’ll pay a lot.

Let’s say you have a remote vuln for Samsung TVs. You go to the NSA and offer it to them. They tell you they aren’t interested, because they see no near term need for it. Then a year later, spies reveal ISIS has stolen a truckload of Samsung TVs, put them in all the meeting rooms, and hooked them to Internet for video conferencing. The NSA then comes back to you and offers $500k for the vuln. Likewise, the number of sellers affects the price. If you know they desperately need the Samsung TV 0day, but they are only offering$100k, then it likely means that there’s another seller also offering such a vuln.

That’s why iPhone vulns are worth \$1 million for a full chain exploit, from browser to persistence. They use it a lot, it’s a major part of ongoing cyber operations. Each time Apple upgrades iOS, the change breaks part of the existing chain, and the government is keen on getting a new exploit to fix it. They’ll pay a lot to the first vuln seller who can give them a new exploit.

Thus, there are three prices the government is willing to pay for an 0day (the value it provides to the government):

• the price for an 0day they will actively use right now (high)
• the price for an 0day they’ll stockpile for possible use in the future (low)
• the price for an 0day they’ll disclose to the vendor to patch (very low)

That these are different prices is important to the policy debate. When activists claim the government should disclose the 0day they acquire, they are ignoring the price the 0day was acquired for. Since the government actively uses the 0day, they are acquired for a high-price, with their “use” value far higher than their “patch” value. It’s an absurd argument to make that they government should then immediately discard that money, to pay “use value” prices for “patch” results.

If the policy becomes that the NSA/CIA should disclose/patch the 0day they buy, it doesn’t mean business as usual acquiring vulns. It instead means they’ll stop buying 0day.

In other words, “patching 0day” is not an outcome on either side of the debate. Either the government buys 0day to use, or it stops buying 0day. In neither case does patching happen.

The real argument is whether the government (NSA, CIA, DoD, FBI) should be acquiring, weaponizing, and using 0day in the first place. It demands that we unilaterally disarm our military, intelligence, and law enforcement, preventing them from using 0days against our adversaries while our adversaries continue to use 0days against us.

That’s the gaping hole in both the RAND paper and most news reporting of this controversy. They characterize the debate the way activists want, as if the only question is the value of patching. They avoid talking about unilateral cyberdisarmament, even though that’s the consequence of the policy they are advocating. They avoid comparing the value of 0days to our country for active use (high) compared to the value to to our country for patching (very low).

Conclusion

It’s nice that the RAND paper studied the value of patching and confirmed it’s low, that only around 5% of our cyber-arsenal is likely to be found by others. But it’d be nice if they also looked at the point of view of those actively using 0days on a daily basis, rather than phrasing the debate the way activists want.