# Judge Issues Devastating Order Against BitTorrent Copyright Troll

Post Syndicated from Ernesto original https://torrentfreak.com/judge-issues-devastating-order-bittorrent-copyright-troll-180110/

In recent years, file-sharers around the world have been pressured to pay significant settlement fees, or face legal repercussions.

These so-called “copyright trolling” efforts have been a common occurrence in the United States since the turn of the last decade.

Increasingly, however, courts are growing weary of these cases. Many districts have turned into no-go zones for copyright trolls and the people behind Prenda law were arrested and are being prosecuted in a criminal case.

In the Western District of Washington, the tide also appears to have turned. After Venice PI, a copyright holder of the film “Once Upon a Time in Venice”, sued a man who later passed away, concerns were raised over the validity of the evidence.

Venice PI responded to the concerns with a declaration explaining its data gathering technique and assuring the Court that false positives are out of the question.

That testimony didn’t help much though, as a recently filed minute order shows this week. The order applies to a dozen cases and prohibits the company from reaching out to any defendants until further notice, as there are several alarming issues that have to be resolved first.

One of the problems is that Venice PI declared that it’s owned by a company named Lost Dog Productions, which in turn is owned by Voltage Productions. Interestingly, these companies don’t appear in the usual records.

“A search of the California Secretary of State’s online database, however, reveals no registered entity with the name ‘Lost Dog’ or ‘Lost Dog Productions’,” the Court notes.

“Moreover, although ‘Voltage Pictures, LLC’ is registered with the California Secretary of State, and has the same address as Venice PI, LLC, the parent company named in plaintiff’s corporate disclosure form, ‘Voltage Productions, LLC,’ cannot be found in the California Secretary of State’s online database and does not appear to exist.”

In other words, the company that filed the lawsuit, as well as its parent company, are extremely questionable.

While the above is a reason for concern, it’s just the tip of the iceberg. The Court not only points out administrative errors, but it also has serious doubts about the evidence collection process. This was carried out by the German company MaverickEye, which used the tracking technology of another German company, GuardaLey.

GuardaLey CEO Benjamin Perino, who claims that he coded the tracking software, wrote a declaration explaining that the infringement detection system at issue “cannot yield a false positive.” However, the Court doubts this statement and Perino’s qualifications in general.

“Perino has been proffered as an expert, but his qualifications consist of a technical high school education and work experience unrelated to the peer-to-peer file-sharing technology known as BitTorrent,” the Court writes.

“Perino does not have the qualifications necessary to be considered an expert in the field in question, and his opinion that the surveillance program is incapable of error is both contrary to common sense and inconsistent with plaintiff’s counsel’s conduct in other matters in this district. Plaintiff has not submitted an adequate offer of proof”

It seems like the Court would prefer to see an assessment from a qualified independent expert instead of the person who wrote the software. For now, this means that the IP-address evidence, in these cases, is not good enough. That’s quite a blow for the copyright holder.

If that wasn’t enough the Court also highlights another issue that’s possibly even more problematic. When Venice PI requested the subpoenas to identify alleged pirates, they relied on declarations from Daniel Arheidt, a consultant for MaverickEye.

These declarations fail to mention, however, that MaverickEye has the proper paperwork to collect IP addresses.

“Nowhere in Arheidt’s declarations does he indicate that either he or MaverickEye is licensed in Washington to conduct private investigation work,” the order reads.

This is important, as doing private investigator work without a license is a gross misdemeanor in Washington. The copyright holder was aware of this requirement because it was brought up in related cases in the past.

“Plaintiff’s counsel has apparently been aware since October 2016, when he received a letter concerning LHF Productions, Inc. v. Collins, C16-1017 RSM, that Arheidt might be committing a crime by engaging in unlicensed surveillance of Washington citizens, but he did not disclose this fact to the Court.”

The order is very bad news for Venice PI. The company had hoped to score a few dozen easy settlements but the tables have now been turned. The Court instead asks the company to explain the deficiencies and provide additional details. In the meantime, the copyright holder is urged not to spend or transfer any of the settlement money that has been collected thus far.

The latter indicates that Venice PI might have to hand defendants their money back, which would be pretty unique.

The order suggests that the Judge is very suspicious of these trolling activities. In a footnote there’s a link to a Fight Copyright Trolls article which revealed that the same counsel dismissed several cases, allegedly to avoid having IP-address evidence scrutinized.

Even more bizarrely, in another footnote the Court also doubts if MaverickEye’s aforementioned consultant, Daniel Arheidt, actually exists.

“The Court has recently become aware that Arheidt is the latest in a series of German declarants (Darren M. Griffin, Daniel Macek, Daniel Susac, Tobias Fieser, Michael Patzer) who might be aliases or even fictitious.

“Plaintiff will not be permitted to rely on Arheidt’s declarations or underlying data without explaining to the Court’s satisfaction Arheidt’s relationship to the above-listed declarants and producing proof beyond a reasonable doubt of Arheidt’s existence,” the court adds.

These are serious allegations, to say the least.

If a copyright holder uses non-existent companies and questionable testimony from unqualified experts after obtaining evidence illegally to get a subpoena backed by a fictitious person….something’s not quite right.

A copy of the minute order, which affects a series of cases, is available here (pdf).

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons

# Random with care

Post Syndicated from Eevee original https://eev.ee/blog/2018/01/02/random-with-care/

Hi! Here are a few loose thoughts about picking random numbers.

## A word about crypto

DON’T ROLL YOUR OWN CRYPTO

This is all aimed at frivolous pursuits like video games. Hell, even video games where money is at stake should be deferring to someone who knows way more than I do. Otherwise you might find out that your deck shuffles in your poker game are woefully inadequate and some smartass is cheating you out of millions. (If your random number generator has fewer than 226 bits of state, it can’t even generate every possible shuffling of a deck of cards!)

## Use the right distribution

Most languages have a random number primitive that spits out a number uniformly in the range [0, 1), and you can go pretty far with just that. But beware a few traps!

### Random pitches

Say you want to pitch up a sound by a random amount, perhaps up to an octave. Your audio API probably has a way to do this that takes a pitch multiplier, where I say “probably” because that’s how the only audio API I’ve used works.

Easy peasy. If 1 is unchanged and 2 is pitched up by an octave, then all you need is rand() + 1. Right?

No! Pitch is exponential — within the same octave, the “gap” between C and C♯ is about half as big as the gap between B and the following C. If you pick a pitch multiplier uniformly, you’ll have a noticeable bias towards the higher pitches.

One octave corresponds to a doubling of pitch, so if you want to pick a random note, you want 2 ** rand().

### Random directions

For two dimensions, you can just pick a random angle with rand() * TAU.

If you want a vector rather than an angle, or if you want a random direction in three dimensions, it’s a little trickier. You might be tempted to just pick a random point where each component is rand() * 2 - 1 (ranging from −1 to 1), but that’s not quite right. A direction is a point on the surface (or, equivalently, within the volume) of a sphere, and picking each component independently produces a point within the volume of a cube; the result will be a bias towards the corners of the cube, where there’s much more extra volume beyond the sphere.

No? Well, just trust me. I don’t know how to make a diagram for this.

Anyway, you could use the Pythagorean theorem a few times and make a huge mess of things, or it turns out there’s a really easy way that even works for two or four or any number of dimensions. You pick each coordinate from a Gaussian (normal) distribution, then normalize the resulting vector. In other words, using Python’s random module:

 1 2 3 4 5 6 def random_direction(): x = random.gauss(0, 1) y = random.gauss(0, 1) z = random.gauss(0, 1) r = math.sqrt(x*x + y*y + z*z) return x/r, y/r, z/r 

Why does this work? I have no idea!

Note that it is possible to get zero (or close to it) for every component, in which case the result is nonsense. You can re-roll all the components if necessary; just check that the magnitude (or its square) is less than some epsilon, which is equivalent to throwing away a tiny sphere at the center and shouldn’t affect the distribution.

### Beware Gauss

Since I brought it up: the Gaussian distribution is a pretty nice one for choosing things in some range, where the middle is the common case and should appear more frequently.

That said, I never use it, because it has one annoying drawback: the Gaussian distribution has no minimum or maximum value, so you can’t really scale it down to the range you want. In theory, you might get any value out of it, with no limit on scale.

In practice, it’s astronomically rare to actually get such a value out. I did a hundred million trials just to see what would happen, and the largest value produced was 5.8.

But, still, I’d rather not knowingly put extremely rare corner cases in my code if I can at all avoid it. I could clamp the ends, but that would cause unnatural bunching at the endpoints. I could reroll if I got a value outside some desired range, but I prefer to avoid rerolling when I can, too; after all, it’s still (astronomically) possible to have to reroll for an indefinite amount of time. (Okay, it’s really not, since you’ll eventually hit the period of your PRNG. Still, though.) I don’t bend over backwards here — I did just say to reroll when picking a random direction, after all — but when there’s a nicer alternative I’ll gladly use it.

And lo, there is a nicer alternative! Enter the beta distribution. It always spits out a number in [0, 1], so you can easily swap it in for the standard normal function, but it takes two “shape” parameters α and β that alter its behavior fairly dramatically.

With α = β = 1, the beta distribution is uniform, i.e. no different from rand(). As α increases, the distribution skews towards the right, and as β increases, the distribution skews towards the left. If α = β, the whole thing is symmetric with a hump in the middle. The higher either one gets, the more extreme the hump (meaning that value is far more common than any other). With a little fiddling, you can get a number of interesting curves.

Screenshots don’t really do it justice, so here’s a little Wolfram widget that lets you play with α and β live:

Note that if α = 1, then 1 is a possible value; if β = 1, then 0 is a possible value. You probably want them both greater than 1, which clamps the endpoints to zero.

Also, it’s possible to have either α or β or both be less than 1, but this creates very different behavior: the corresponding endpoints become poles.

Anyway, something like α = β = 3 is probably close enough to normal for most purposes but already clamped for you. And you could easily replicate something like, say, NetHack’s incredibly bizarre rnz function.

### Random frequency

Say you want some event to have an 80% chance to happen every second. You (who am I kidding, I) might be tempted to do something like this:

 1 2 if random() < 0.8 * dt: do_thing() 

In an ideal world, dt is always the same and is equal to 1 / f, where f is the framerate. Replace that 80% with a variable, say P, and every tic you have a P / f chance to do the… whatever it is.

Each second, f tics pass, so you’ll make this check f times. The chance that any check succeeds is the inverse of the chance that every check fails, which is $$1 – \left(1 – \frac{P}{f}\right)^f$$.

For P of 80% and a framerate of 60, that’s a total probability of 55.3%. Wait, what?

Consider what happens if the framerate is 2. On the first tic, you roll 0.4 twice — but probabilities are combined by multiplying, and splitting work up by dt only works for additive quantities. You lose some accuracy along the way. If you’re dealing with something that multiplies, you need an exponent somewhere.

But in this case, maybe you don’t want that at all. Each separate roll you make might independently succeed, so it’s possible (but very unlikely) that the event will happen 60 times within a single second! Or 200 times, if that’s someone’s framerate.

If you explicitly want something to have a chance to happen on a specific interval, you have to check on that interval. If you don’t have a gizmo handy to run code on an interval, it’s easy to do yourself with a time buffer:

 1 2 3 4 5 6 timer += dt # here, 1 is the "every 1 seconds" while timer > 1: timer -= 1 if random() < 0.8: do_thing() 

Using while means rolls still happen even if you somehow skipped over an entire second.

(For the curious, and the nerds who already noticed: the expression $$1 – \left(1 – \frac{P}{f}\right)^f$$ converges to a specific value! As the framerate increases, it becomes a better and better approximation for $$1 – e^{-P}$$, which for the example above is 0.551. Hey, 60 fps is pretty accurate — it’s just accurately representing something nowhere near what I wanted. Er, you wanted.)

### Rolling your own

Of course, you can fuss with the classic [0, 1] uniform value however you want. If I want a bias towards zero, I’ll often just square it, or multiply two of them together. If I want a bias towards one, I’ll take a square root. If I want something like a Gaussian/normal distribution, but with clearly-defined endpoints, I might add together n rolls and divide by n. (The normal distribution is just what you get if you roll infinite dice and divide by infinity!)

It’d be nice to be able to understand exactly what this will do to the distribution. Unfortunately, that requires some calculus, which this post is too small to contain, and which I didn’t even know much about myself until I went down a deep rabbit hole while writing, and which in many cases is straight up impossible to express directly.

Here’s the non-calculus bit. A source of randomness is often graphed as a PDF — a probability density function. You’ve almost certainly seen a bell curve graphed, and that’s a PDF. They’re pretty nice, since they do exactly what they look like: they show the relative chance that any given value will pop out. On a bog standard bell curve, there’s a peak at zero, and of course zero is the most common result from a normal distribution.

(Okay, actually, since the results are continuous, it’s vanishingly unlikely that you’ll get exactly zero — but you’re much more likely to get a value near zero than near any other number.)

For the uniform distribution, which is what a classic rand() gives you, the PDF is just a straight horizontal line — every result is equally likely.

If there were a calculus bit, it would go here! Instead, we can cheat. Sometimes. Mathematica knows how to work with probability distributions in the abstract, and there’s a free web version you can use. For the example of squaring a uniform variable, try this out:

 1 PDF[TransformedDistribution[u^2, u \[Distributed] UniformDistribution[{0, 1}]], u] 

(The \[Distributed] is a funny tilde that doesn’t exist in Unicode, but which Mathematica uses as a first-class operator. Also, press shiftEnter to evaluate the line.)

This will tell you that the distribution is… $$\frac{1}{2\sqrt{u}}$$. Weird! You can plot it:

 1 Plot[%, {u, 0, 1}] 

(The % refers to the result of the last thing you did, so if you want to try several of these, you can just do Plot[PDF[…], u] directly.)

The resulting graph shows that numbers around zero are, in fact, vastly — infinitely — more likely than anything else.

What about multiplying two together? I can’t figure out how to get Mathematica to understand this, but a great amount of digging revealed that the answer is -ln x, and from there you can plot them both on Wolfram Alpha. They’re similar, though squaring has a much better chance of giving you high numbers than multiplying two separate rolls — which makes some sense, since if either of two rolls is a low number, the product will be even lower.

What if you know the graph you want, and you want to figure out how to play with a uniform roll to get it? Good news! That’s a whole thing called inverse transform sampling. All you have to do is take an integral. Good luck!

This is all extremely ridiculous. New tactic: Just Simulate The Damn Thing. You already have the code; run it a million times, make a histogram, and tada, there’s your PDF. That’s one of the great things about computers! Brute-force numerical answers are easy to come by, so there’s no excuse for producing something like rnz. (Though, be sure your histogram has sufficiently narrow buckets — I tried plotting one for rnz once and the weird stuff on the left side didn’t show up at all!)

By the way, I learned something from futzing with Mathematica here! Taking the square root (to bias towards 1) gives a PDF that’s a straight diagonal line, nothing like the hyperbola you get from squaring (to bias towards 0). How do you get a straight line the other way? Surprise: $$1 – \sqrt{1 – u}$$.

### Okay, okay, here’s the actual math

I don’t claim to have a very firm grasp on this, but I had a hell of a time finding it written out clearly, so I might as well write it down as best I can. This was a great excuse to finally set up MathJax, too.

Say $$u(x)$$ is the PDF of the original distribution and $$u$$ is a representative number you plucked from that distribution. For the uniform distribution, $$u(x) = 1$$. Or, more accurately,

$$u(x) = \begin{cases} 1 & \text{ if } 0 \le x \lt 1 \\ 0 & \text{ otherwise } \end{cases}$$

Remember that $$x$$ here is a possible outcome you want to know about, and the PDF tells you the relative probability that a roll will be near it. This PDF spits out 1 for every $$x$$, meaning every number between 0 and 1 is equally likely to appear.

We want to do something to that PDF, which creates a new distribution, whose PDF we want to know. I’ll use my original example of $$f(u) = u^2$$, which creates a new PDF $$v(x)$$.

The trick is that we need to work in terms of the cumulative distribution function for $$u$$. Where the PDF gives the relative chance that a roll will be (“near”) a specific value, the CDF gives the relative chance that a roll will be less than a specific value.

The conventions for this seem to be a bit fuzzy, and nobody bothers to explain which ones they’re using, which makes this all the more confusing to read about… but let’s write the CDF with a capital letter, so we have $$U(x)$$. In this case, $$U(x) = x$$, a straight 45° line (at least between 0 and 1). With the definition I gave, this should make sense. At some arbitrary point like 0.4, the value of the PDF is 1 (0.4 is just as likely as anything else), and the value of the CDF is 0.4 (you have a 40% chance of getting a number from 0 to 0.4).

Calculus ahoy: the PDF is the derivative of the CDF, which means it measures the slope of the CDF at any point. For $$U(x) = x$$, the slope is always 1, and indeed $$u(x) = 1$$. See, calculus is easy.

Okay, so, now we’re getting somewhere. What we want is the CDF of our new distribution, $$V(x)$$. The CDF is defined as the probability that a roll $$v$$ will be less than $$x$$, so we can literally write:

$$V(x) = P(v \le x)$$

(This is why we have to work with CDFs, rather than PDFs — a PDF gives the chance that a roll will be “nearby,” whatever that means. A CDF is much more concrete.)

What is $$v$$, exactly? We defined it ourselves; it’s the do something applied to a roll from the original distribution, or $$f(u)$$.

$$V(x) = P\!\left(f(u) \le x\right)$$

Now the first tricky part: we have to solve that inequality for $$u$$, which means we have to do something, backwards to $$x$$.

$$V(x) = P\!\left(u \le f^{-1}(x)\right)$$

Almost there! We now have a probability that $$u$$ is less than some value, and that’s the definition of a CDF!

$$V(x) = U\!\left(f^{-1}(x)\right)$$

Hooray! Now to turn these CDFs back into PDFs, all we need to do is differentiate both sides and use the chain rule. If you never took calculus, don’t worry too much about what that means!

$$v(x) = u\!\left(f^{-1}(x)\right)\left|\frac{d}{dx}f^{-1}(x)\right|$$

Wait! Where did that absolute value come from? It takes care of whether $$f(x)$$ increases or decreases. It’s the least interesting part here by far, so, whatever.

There’s one more magical part here when using the uniform distribution — $$u(\dots)$$ is always equal to 1, so that entire term disappears! (Note that this only works for a uniform distribution with a width of 1; PDFs are scaled so the entire area under them sums to 1, so if you had a rand() that could spit out a number between 0 and 2, the PDF would be $$u(x) = \frac{1}{2}$$.)

$$v(x) = \left|\frac{d}{dx}f^{-1}(x)\right|$$

So for the specific case of modifying the output of rand(), all we have to do is invert, then differentiate. The inverse of $$f(u) = u^2$$ is $$f^{-1}(x) = \sqrt{x}$$ (no need for a ± since we’re only dealing with positive numbers), and differentiating that gives $$v(x) = \frac{1}{2\sqrt{x}}$$. Done! This is also why square root comes out nicer; inverting it gives $$x^2$$, and differentiating that gives $$2x$$, a straight line.

Incidentally, that method for turning a uniform distribution into any distribution — inverse transform sampling — is pretty much the same thing in reverse: integrate, then invert. For example, when I saw that taking the square root gave $$v(x) = 2x$$, I naturally wondered how to get a straight line going the other way, $$v(x) = 2 – 2x$$. Integrating that gives $$2x – x^2$$, and then you can use the quadratic formula (or just ask Wolfram Alpha) to solve $$2x – x^2 = u$$ for $$x$$ and get $$f(u) = 1 – \sqrt{1 – u}$$.

Multiply two rolls is a bit more complicated; you have to write out the CDF as an integral and you end up doing a double integral and wow it’s a mess. The only thing I’ve retained is that you do a division somewhere, which then gets integrated, and that’s why it ends up as $$-\ln x$$.

And that’s quite enough of that! (Okay but having math in my blog is pretty cool and I will definitely be doing more of this, sorry, not sorry.)

## Random vs varied

Sometimes, random isn’t actually what you want. We tend to use the word “random” casually to mean something more like chaotic, i.e., with no discernible pattern. But that’s not really random. In fact, given how good humans can be at finding incidental patterns, they aren’t all that unlikely! Consider that when you roll two dice, they’ll come up either the same or only one apart almost half the time. Coincidence? Well, yes.

If you ask for randomness, you’re saying that any outcome — or series of outcomes — is acceptable, including five heads in a row or five tails in a row. Most of the time, that’s fine. Some of the time, it’s less fine, and what you really want is variety. Here are a couple examples and some fairly easy workarounds.

### NPC quips

The nature of games is such that NPCs will eventually run out of things to say, at which point further conversation will give the player a short brush-off quip — a slight nod from the designer to the player that, hey, you hit the end of the script.

Some NPCs have multiple possible quips and will give one at random. The trouble with this is that it’s very possible for an NPC to repeat the same quip several times in a row before abruptly switching to another one. With only a few options to choose from, getting the same option twice or thrice (especially across an entire game, which may have numerous NPCs) isn’t all that unlikely. The notion of an NPC quip isn’t very realistic to start with, but having someone repeat themselves and then abruptly switch to something else is especially jarring.

The easy fix is to show the quips in order! Paradoxically, this is more consistently varied than choosing at random — the original “order” is likely to be meaningless anyway, and it already has the property that the same quip can never appear twice in a row.

If you like, you can shuffle the list of quips every time you reach the end, but take care here — it’s possible that the last quip in the old order will be the same as the first quip in the new order, so you may still get a repeat. (Of course, you can just check for this case and swap the first quip somewhere else if it bothers you.)

That last behavior is, in fact, the canonical way that Tetris chooses pieces — the game simply shuffles a list of all 7 pieces, gives those to you in shuffled order, then shuffles them again to make a new list once it’s exhausted. There’s no avoidance of duplicates, though, so you can still get two S blocks in a row, or even two S and two Z all clumped together, but no more than that. Some Tetris variants take other approaches, such as actively avoiding repeats even several pieces apart or deliberately giving you the worst piece possible.

### Random drops

Random drops are often implemented as a flat chance each time. Maybe enemies have a 5% chance to drop health when they die. Legally speaking, over the long term, a player will see health drops for about 5% of enemy kills.

Over the short term, they may be desperate for health and not survive to see the long term. So you may want to put a thumb on the scale sometimes. Games in the Metroid series, for example, have a somewhat infamous bias towards whatever kind of drop they think you need — health if your health is low, missiles if your missiles are low.

I can’t give you an exact approach to use, since it depends on the game and the feeling you’re going for and the variables at your disposal. In extreme cases, you might want to guarantee a health drop from a tough enemy when the player is critically low on health. (Or if you’re feeling particularly evil, you could go the other way and deny the player health when they most need it…)

The problem becomes a little different, and worse, when the event that triggers the drop is relatively rare. The pathological case here would be something like a raid boss in World of Warcraft, which requires hours of effort from a coordinated group of people to defeat, and which has some tiny chance of dropping a good item that will go to only one of those people. This is why I stopped playing World of Warcraft at 60.

Dialing it back a little bit gives us Enter the Gungeon, a roguelike where each room is a set of encounters and each floor only has a dozen or so rooms. Initially, you have a 1% chance of getting a reward after completing a room — but every time you complete a room and don’t get a reward, the chance increases by 9%, up to a cap of 80%. Once you get a reward, the chance resets to 1%.

The natural question is: how frequently, exactly, can a player expect to get a reward? We could do math, or we could Just Simulate The Damn Thing.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 from collections import Counter import random histogram = Counter() TRIALS = 1000000 chance = 1 rooms_cleared = 0 rewards_found = 0 while rewards_found < TRIALS: rooms_cleared += 1 if random.random() * 100 < chance: # Reward! rewards_found += 1 histogram[rooms_cleared] += 1 rooms_cleared = 0 chance = 1 else: chance = min(80, chance + 9) for gaps, count in sorted(histogram.items()): print(f"{gaps:3d} | {count / TRIALS * 100:6.2f}%", '#' * (count // (TRIALS // 100))) 
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  1 | 0.98% 2 | 9.91% ######### 3 | 17.00% ################ 4 | 20.23% #################### 5 | 19.21% ################### 6 | 15.05% ############### 7 | 9.69% ######### 8 | 5.07% ##### 9 | 2.09% ## 10 | 0.63% 11 | 0.12% 12 | 0.03% 13 | 0.00% 14 | 0.00% 15 | 0.00% 

We’ve got kind of a hilly distribution, skewed to the left, which is up in this histogram. Most of the time, a player should see a reward every three to six rooms, which is maybe twice per floor. It’s vanishingly unlikely to go through a dozen rooms without ever seeing a reward, so a player should see at least one per floor.

Of course, this simulated a single continuous playthrough; when starting the game from scratch, your chance at a reward always starts fresh at 1%, the worst it can be. If you want to know about how many rewards a player will get on the first floor, hey, Just Simulate The Damn Thing.

 1 2 3 4 5 6 7  0 | 0.01% 1 | 13.01% ############# 2 | 56.28% ######################################################## 3 | 27.49% ########################### 4 | 3.10% ### 5 | 0.11% 6 | 0.00% 

Cool. Though, that’s assuming exactly 12 rooms; it might be worth changing that to pick at random in a way that matches the level generator.

(Enter the Gungeon does some other things to skew probability, which is very nice in a roguelike where blind luck can make or break you. For example, if you kill a boss without having gotten a new gun anywhere else on the floor, the boss is guaranteed to drop a gun.)

### Critical hits

I suppose this is the same problem as random drops, but backwards.

Say you have a battle sim where every attack has a 6% chance to land a devastating critical hit. Presumably the same rules apply to both the player and the AI opponents.

Consider, then, that the AI opponents have exactly the same 6% chance to ruin the player’s day. Consider also that this gives them an 0.4% chance to critical hit twice in a row. 0.4% doesn’t sound like much, but across an entire playthrough, it’s not unlikely that a player might see it happen and find it incredibly annoying.

Perhaps it would be worthwhile to explicitly forbid AI opponents from getting consecutive critical hits.

## In conclusion

An emerging theme here has been to Just Simulate The Damn Thing. So consider Just Simulating The Damn Thing. Even a simple change to a random value can do surprising things to the resulting distribution, so unless you feel like differentiating the inverse function of your code, maybe test out any non-trivial behavior and make sure it’s what you wanted. Probability is hard to reason about.

# When should behaviour outside a community have consequences inside it?

Post Syndicated from Matthew Garrett original https://mjg59.dreamwidth.org/50099.html

Free software communities don’t exist in a vacuum. They’re made up of people who are also members of other communities, people who have other interests and engage in other activities. Sometimes these people engage in behaviour outside the community that may be perceived as negatively impacting communities that they’re a part of, but most communities have no guidelines for determining whether behaviour outside the community should have any consequences within the community. This post isn’t an attempt to provide those guidelines, but aims to provide some things that community leaders should think about when the issue is raised.

## Some things to consider

### Did the behaviour violate the law?

This seems like an obvious bar, but it turns out to be a pretty bad one. For a start, many things that are common accepted behaviour in various communities may be illegal (eg, reverse engineering work may contravene a strict reading of US copyright law), and taking this to an extreme would result in expelling anyone who’s ever broken a speed limit. On the flipside, refusing to act unless someone broke the law is also a bad threshold – much behaviour that communities consider unacceptable may be entirely legal.

There’s also the problem of determining whether a law was actually broken. The criminal justice system is (correctly) biased to an extent in favour of the defendant – removing someone’s rights in society should require meeting a high burden of proof. However, this is not the threshold that most communities hold themselves to in determining whether to continue permitting an individual to associate with them. An incident that does not result in a finding of criminal guilt (either through an explicit finding or a failure to prosecute the case in the first place) should not be ignored by communities for that reason.

### Did the behaviour violate your community norms?

There’s plenty of behaviour that may be acceptable within other segments of society but unacceptable within your community (eg, lobbying for the use of proprietary software is considered entirely reasonable in most places, but rather less so at an FSF event). If someone can be trusted to segregate their behaviour appropriately then this may not be a problem, but that’s probably not sufficient in all cases. For instance, if someone acts entirely reasonably within your community but engages in lengthy anti-semitic screeds on 4chan, it’s legitimate to question whether permitting them to continue being part of your community serves your community’s best interests.

### Did the behaviour violate the norms of the community in which it occurred?

Of course, the converse is also true – there’s behaviour that may be acceptable within your community but unacceptable in another community. It’s easy to write off someone acting in a way that contravenes the standards of another community but wouldn’t violate your expected behavioural standards – after all, if it wouldn’t breach your standards, what grounds do you have for taking action?

But you need to consider that if someone consciously contravenes the behavioural standards of a community they’ve chosen to participate in, they may be willing to do the same in your community. If pushing boundaries is a frequent trait then it may not be too long until you discover that they’re also pushing your boundaries.

## Why do you care?

A community’s code of conduct can be looked at in two ways – as a list of behaviours that will be punished if they occur, or as a list of behaviours that are unlikely to occur within that community. The former is probably the primary consideration when a community adopts a CoC, but the latter is how many people considering joining a community will think about it.

If your community includes individuals that are known to have engaged in behaviour that would violate your community standards, potential members or contributors may not trust that your CoC will function as adequate protection. A community that contains people known to have engaged in sexual harassment in other settings is unlikely to be seen as hugely welcoming, even if they haven’t (as far as you know!) done so within your community. The way your members behave outside your community is going to be seen as saying something about your community, and that needs to be taken into account.

A second (and perhaps less obvious) aspect is that membership of some higher profile communities may be seen as lending general legitimacy to someone, and they may play off that to legitimise behaviour or views that would be seen as abhorrent by the community as a whole. If someone’s anti-semitic views (for example) are seen as having more relevance because of their membership of your community, it’s reasonable to think about whether keeping them in your community serves the best interests of your community.

## Conclusion

I’ve said things like “considered” or “taken into account” a bunch here, and that’s for a good reason – I don’t know what the thresholds should be for any of these things, and there doesn’t seem to be even a rough consensus in the wider community. We’ve seen cases in which communities have acted based on behaviour outside their community (eg, Debian removing Jacob Appelbaum after it was revealed that he’d sexually assaulted multiple people), but there’s been no real effort to build a meaningful decision making framework around that.

As a result, communities struggle to make consistent decisions. It’s unreasonable to expect individual communities to solve these problems on their own, but that doesn’t mean we can ignore them. It’s time to start coming up with a real set of best practices.

comments

# FCC Repeals U.S. Net Neutrality Rules

Post Syndicated from Ernesto original https://torrentfreak.com/fcc-repeals-u-s-net-neutrality-rules-171214/

In recent months, millions of people have protested the FCC’s plan to repeal U.S. net neutrality rules, which were put in place by the Obama administration.

However, an outpouring public outrage, critique from major tech companies, and even warnings from pioneers of the Internet, had no effect.

Today the FCC voted to repeal the old rules, effectively ending net neutrality.

Under the net neutrality rules that have been in effect during recent years, ISPs were specifically prohibited from blocking, throttling, and paid prioritization of “lawful” traffic. In addition, Internet providers could be regulated as carriers under Title II.

Now that these rules have been repealed, Internet providers have more freedom to experiment with paid prioritization. Under the new guidelines, they can charge customers extra for access to some online services, or throttle certain types of traffic.

Most critics of the repeal fear that, now that the old net neutrality rules are in the trash, ‘fast lanes’ for some services, and throttling for others, will become commonplace in the U.S.

This could also mean that BitTorrent traffic becomes a target once again. After all, it was Comcast’s ‘secretive’ BitTorrent throttling that started the broader net neutrality debate, now ten years ago.

Comcast’s throttling history is a sensitive issue, also for the company itself.

Before the Obama-era net neutrality rules, the ISP vowed that it would no longer discriminate against specific traffic classes. Ahead of the FCC vote yesterday, it doubled down on this promise.

“Despite repeated distortions and biased information, as well as misguided, inaccurate attacks from detractors, our Internet service is not going to change,” writes David Cohen, Comcast’s Chief Diversity Officer.

“We have repeatedly stated, and reiterate today, that we do not and will not block, throttle, or discriminate against lawful content.”

It’s worth highlighting the term “lawful” in the last sentence. It is by no means a promise that pirate sites won’t be blocked.

As we’ve highlighted in the past, blocking pirate sites was already an option under the now-repealed rules. The massive copyright loophole made sure of that. Targeting all torrent traffic is even an option, in theory.

That said, today’s FCC vote certainly makes it easier for ISPs to block or throttle BitTorrent traffic across the entire network. For the time being, however, there are no signs that any ISPs plan to do so.

If they do, we will know soon enough. The FCC requires all ISPs to be transparent under the new plan. They have to disclose network management practices, blocking efforts, commercial prioritization, and the like.

And with the current focus on net neutrality, ISPs are likely to tread carefully, or else they might just face an exodus of customers.

Finally, it’s worth highlighting that today’s vote is not the end of the road yet. Net neutrality supporters are planning to convince Congress to overturn the repeal. In addition, there are is also talk of taking the matter to court.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons

# Digital Rights Groups Warn Against Copyright “Parking Tickets” Bill

Post Syndicated from Ernesto original https://torrentfreak.com/digital-rights-groups-warn-against-copyright-parking-tickets-bill-171203/

Nearly five years ago, US lawmakers agreed to carry out a comprehensive review of United States copyright law.

In the following years, the House Judiciary Committee held dozens of hearings on various topics, from DMCA reform and fair use exemptions to the possibility of a small claims court for copyright offenses.

While many of the topics never got far beyond the discussion stage, there’s now a new bill on the table that introduces a small claims process for copyright offenses.

The CASE Act, short for Copyright Alternative in Small-Claims Enforcement, proposes to establish a small claims court to resolve copyright disputes outside the federal courts. This means that legal costs will be significantly reduced.

The idea behind the bill is to lower the barrier for smaller copyright holders with limited resources, who usually refrain from going to court. Starting a federal case with proper representation is quite costly, while the outcome is rather uncertain.

While this may sound noble, digital rights groups, including the Electronic Frontier Foundation (EFF) and Public Knowledge, warn that the bill could do more harm than good.

One of the problems they signal is that the proposed “Copyright Claims Board” would be connected to the US Copyright Office. Given this connection, the groups fear that the three judges might be somewhat biased towards copyright holders.

“Unfortunately, the Copyright Office has a history of putting copyright holders’ interests ahead of other important legal rights and policy concerns. We fear that any small claims process the Copyright Office conducts will tend to follow that pattern,” EFF’s Mitch Stoltz warns.

The copyright claims board will have three judges who can hear cases from all over the country. They can award damages awards of up to $15,000 per infringement, or$30,000 per case.

Participation is voluntary and potential defendants can opt-out. However, if they fail to do so, any order against them can still be binding and enforceable through a federal court.

An opt-in system would be much better, according to EFF, as that would prevent abuse by copyright holders who are looking for cheap default judgments.

“[A]n opt-in approach would help ensure that both participants affirmatively choose to litigate their dispute in this new court, and help prevent copyright holders from abusing the system to obtain inexpensive default judgments that will be hard to appeal.”

While smart defendants would opt-out in certain situations, those who are less familiar with the law might become the target of what are essentially copyright parking tickets.

“Knowledgeable defendants will opt out of such proceedings, while legally unsophisticated targets, including ordinary Internet users, could find themselves committed to an unfair, accelerated process handing out largely unappealable $5,000 copyright parking tickets,” EFF adds. In its current form, the small claims court may prove to be an ideal tool for copyright trolls, including those who made a business out of filing federal cases against alleged BitTorrent pirates. This copyright troll issue angle highlighted by both EFF and Public Knowlege, who urge lawmakers to revise the bill. “[I]t’s not hard to see how trolls and default judgments could come to dominate the system,” Public Knowledge says. “Instead of creating a reliable, fair mechanism for independent artists to pursue scaled infringement claims online, it would establish an opaque, unaccountable legislation mill that will likely get bogged down by copyright trolls and questionable claimants looking for a payout,” they conclude. Various copyright holder groups are more positive about the bill. The Copyright Alliance, for example, says that it will empower creators with smaller budgets to protect their rights. “The next generation of creators deserves copyright protection that is as pioneering and forward-thinking as they are. They deserve practical solutions to the real-life problems they face as creators. This bill is the first step.” Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN discounts, offers and coupons # The Official Projects Book volume 3 — out now Post Syndicated from Rob Zwetsloot original https://www.raspberrypi.org/blog/projects-book-3/ Hey folks, Rob from The MagPi here with some very exciting news! The third volume of the Official Raspberry Pi Projects Book is out right this second, and we’ve packed its 200 pages with the very best Raspberry Pi projects and guides! ## A peek inside the projects book We start you off with a neat beginners guide to programming in Python, walking you from the very basics all the way through to building the classic videogame Pong from scratch! Check out what’s inside! Then we showcase some of the most inspiring projects from around the community, such as a camera for taking photos of the moon, a smart art installation, amazing arcade machines, and much more. Emulate the Apollo mission computers on the Raspberry Pi Next, we ease you into a series of tutorials that will help you get the most out of your Raspberry Pi. Among other things, you’ll be learning how to sync your Pi to Dropbox, use it to create a waterproof camera, and even emulate an Amiga. We’ve also assembled a load of reviews to let you know what you should be buying if you want to extend your Pi experience. Learn more about Pimoroni’s Enviro pHAT I am extremely proud of what the entire MagPi team has put together here, and I know you’ll enjoy reading it as much as we enjoyed creating it. ## How to get yours In the UK, you can get your copy of the new Official Raspberry Pi Projects Book at WH Smith and all good newsagents today. In the US, print copies will be available in stores such as Barnes & Noble very soon. Or order a copy from the Raspberry Pi Press store — before the end of Sunday 26 November, you can use the code BLACKFRIDAY to get 10% off your purchase! There’s also the digital version, which you can get via The MagPi Android and iOS apps. And, as always, there’s a free PDF, which is available here. We think this new projects book is the perfect stocking filler, although we may be just a tad biased. Anyway, I hope you’ll love it! The post The Official Projects Book volume 3 — out now appeared first on Raspberry Pi. # Some notes about the Kaspersky affair Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/10/some-notes-about-kaspersky-affair.html I thought I’d write up some notes about Kaspersky, the Russian anti-virus vendor that many believe has ties to Russian intelligence. There’s two angles to this story. One is whether the accusations are true. The second is the poor way the press has handled the story, with mainstream outlets like the New York Times more intent on pushing government propaganda than informing us what’s going on. ### The press Before we address Kaspersky, we need to talk about how the press covers this. The mainstream media’s stories have been pure government propaganda, like this one from the New York Times. It garbles the facts of what happened, and relies primarily on anonymous government sources that cannot be held accountable. It’s so messed up that we can’t easily challenge it because we aren’t even sure exactly what it’s claiming. The Society of Professional Journalists have a name for this abuse of anonymous sources, the “Washington Game“. Journalists can identify this as bad journalism, but the big newspapers like The New York Times continues to do it anyway, because how dare anybody criticize them? For all that I hate the anti-American bias of The Intercept, at least they’ve had stories that de-garble what’s going on, that explain things so that we can challenge them. ### Our Government Our government can’t tell us everything, of course. But at the same time, they need to tell us something, to at least being clear what their accusations are. These vague insinuations through the media hurt their credibility, not help it. The obvious craptitude is making us in the cybersecurity community come to Kaspersky’s defense, which is not the government’s aim at all. There are lots of issues involved here, but let’s consider the major one insinuated by the NYTimes story, that Kaspersky was getting “data” files along with copies of suspected malware. This is troublesome if true. But, as Kaspersky claims today, it’s because they had detected malware within a zip file, and uploaded the entire zip — including the data files within the zip. This is reasonable. This is indeed how anti-virus generally works. It completely defeats the NYTimes insinuations. This isn’t to say Kaspersky is telling the truth, of course, but that’s not the point. The point is that we are getting vague propaganda from the government further garbled by the press, making Kaspersky’s clear defense the credible party in the affair. It’s certainly possible for Kaspersky to write signatures to look for strings like “TS//SI/OC/REL TO USA” that appear in secret US documents, then upload them to Russia. If that’s what our government believes is happening, they need to come out and be explicit about it. They can easily setup honeypots, in the way described in today’s story, to confirm it. However, it seems the government’s description of honeypots is that Kaspersky only upload files that were clearly viruses, not data. ### Kaspersky I believe Kaspersky is guilty, that the company and Eugene himself, works directly with Russian intelligence. That’s because on a personal basis, people in government have given me specific, credible stories — the sort of thing they should be making public. And these stories are wholly unrelated to stories that have been made public so far. You shouldn’t believe me, of course, because I won’t go into details you can challenge. I’m not trying to convince you, I’m just disclosing my point of view. But there are some public reasons to doubt Kaspersky. For example, when trying to sell to our government, they’ve claimed they can help us against terrorists. The translation of this is that they could help our intelligence services. Well, if they are willing to help our intelligence services against customers who are terrorists, then why wouldn’t they likewise help Russian intelligence services against their adversaries? Then there is how Russia works. It’s a violent country. Most of the people mentioned in that “Steele Dossier” have died. In the hacker community, hackers are often coerced to help the government. Many have simply gone missing. Being rich doesn’t make Kaspersky immune from this — it makes him more of a target. Russian intelligence knows he’s getting all sorts of good intelligence, such as malware written by foreign intelligence services. It’s unbelievable they wouldn’t put the screws on him to get this sort of thing. Russia is our adversary. It’d be foolish of our government to buy anti-virus from Russian companies. Likewise, the Russian government won’t buy such products from American companies. ### Conclusion I have enormous disrespect for mainstream outlets like The New York Times and the way they’ve handled the story. It makes me want to come to Kaspersky’s defense. I have enormous respect for Kaspersky technology. They do good work. But I hear stories. I don’t think our government should be trusting Kaspersky at all. For that matter, our government shouldn’t trust any cybersecurity products from Russia, China, Iran, etc. # timeShift(GrafanaBuzz, 1w) Issue 18 Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2017/10/20/timeshiftgrafanabuzz-1w-issue-18/ Welcome to another issue of timeShift. This week we released Grafana 4.6.0-beta2, which includes some fixes for alerts, annotations, the Cloudwatch data source, and a few panel updates. We’re also gearing up for Oredev, one of the biggest tech conferences in Scandinavia, November 7-10. In addition to sponsoring, our very own Carl Bergquist will be presenting “Monitoring for everyone.” Hope to see you there – swing by our booth and say hi! #### Latest Release Grafana 4.6-beta-2 is now available! Grafana 4.6.0-beta2 adds fixes for: • ColorPicker display • Alerting test • Cloudwatch improvements • CSV export • Text panel enhancements • Annotation fix for MySQL To see more details on what’s in the newest version, please see the release notes. #### From the Blogosphere Screeps and Grafana: Graphing your AI: If you’re unfamiliar with Screeps, it’s a MMO RTS game for programmers, where the objective is to grow your colony through programming your units’ AI. You control your colony by writing JavaScript, which operates 247 in the single persistent real-time world filled by other players. This article walks you through graphing all your game stats with Grafana. ntopng Grafana Integration: The Beauty of Data Visualization: Our friends at ntop created a tutorial so that you can graph ntop monitoring data in Grafana. He goes through the metrics exposed, configuring the ntopng Data Source plugin, and building your first dashboard. They’ve also created a nice video tutorial of the process. Installing Graphite and Grafana to Display the Graphs of Centreon: This article, provides a step-by-step guide to getting your Centreon data into Graphite and visualizing the data in Grafana. Bit v. Byte Episode 3 – Metrics for the Win: Bit v. Byte is a new weekly Podcast about the web industry, tools and techniques upcoming and in use today. This episode dives into metrics, and discusses Grafana, Prometheus and NGINX Amplify. Code-Quickie: Visualize heating with Grafana: With the winter weather coming, Reinhard wanted to monitor the stats in his boiler room. This article covers not only the visualization of the data, but the different devices and sensors you can use to can use in your own home. RuuviTag with C.H.I.P – BLE – Node-RED: Following the temperature-monitoring theme from the last article, Tobias writes about his journey of hooking up his new RuuviTag to Grafana to measure temperature, relative humidity, air pressure and more. #### Early Bird will be Ending Soon Early bird discounts will be ending soon, but you still have a few days to lock in the lower price. We will be closing early bird on October 31, so don’t wait until the last minute to take advantage of the discounted tickets! Also, there’s still time to submit your talk. We’ll accept submissions through the end of October. We’re looking for technical and non-technical talks of all sizes. Submit a CFP now. #### Grafana Plugins This week we have updates to two panels and a brand new panel that can add some animation to your dashboards. Installing plugins in Grafana is easy; for on-prem Grafana, use the Grafana-cli tool, or with 1 click if you are using Hosted Grafana. NEW PLUGIN Geoloop Panel – The Geoloop panel is a simple visualizer for joining GeoJSON to Time Series data, and animating the geo features in a loop. An example of using the panel would be showing the rate of rainfall during a 5-hour storm. UPDATED PLUGIN Breadcrumb Panel – This plugin keeps track of dashboards you have visited within one session and displays them as a breadcrumb. The latest update fixes some issues with back navigation and url query params. UPDATED PLUGIN Influx Admin Panel – The Influx Admin panel duplicates features from the now deprecated Web Admin Interface for InfluxDB and has lots of features like letting you see the currently running queries, which can also be easily killed. Changes in the latest release: • Converted to typescript project based on typescript-template-datasource • Select Databases. This only works with PR#8096 • Added time format options • Show tags from response • Support template variables in the query #### Contribution of the week: Each week we highlight some of the important contributions from our amazing open source community. Thank you for helping make Grafana better! The Stockholm Go Meetup had a hackathon this week and sent a PR for letting whitelisted cookies pass through the Grafana proxy. Thanks to everyone who worked on this PR! #### Tweet of the Week We scour Twitter each week to find an interesting/beautiful dashboard and show it off! #monitoringLove This is awesome – we can’t get enough of these public dashboards! #### We Need Your Help! Do you have a graph that you love because the data is beautiful or because the graph provides interesting information? Please get in touch. Tweet or send us an email with a screenshot, and we’ll tell you about this fun experiment. #### Grafana Labs is Hiring! We are passionate about open source software and thrive on tackling complex challenges to build the future. We ship code from every corner of the globe and love working with the community. If this sounds exciting, you’re in luck – WE’RE HIRING! Check out our #### How are we doing? Please tell us how we’re doing. Submit a comment on this article below, or post something at our community forum. Help us make these weekly roundups better! Follow us on Twitter, like us on Facebook, and join the Grafana Labs community. # Predict Billboard Top 10 Hits Using RStudio, H2O and Amazon Athena Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets. Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote. In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post. ## Walkthrough Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit. This solution involves the following steps: • Install and configure RStudio with Athena • Log in to RStudio • Install R packages • Connect to Athena • Create a dataset • Create models ### Install and configure RStudio with Athena Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena. Launching this stack creates all required resources and prerequisites: • Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended) • Provisioning of the EC2 instance in an existing VPC and public subnet • Installation of Java 8 • Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3 • Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports) • S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET) • RStudio username and password • Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting) • Amazon EC2 Systems Manager agent, which makes it easy to manage and patch All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table. ### Log in to RStudio The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall. 1. In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched. 2. Log in to RStudio with the and password you provided during setup. ### Install R packages Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code. #install pacman – a handy package manager for managing installs if("pacman" %in% rownames(installed.packages()) == FALSE) {install.packages("pacman")} library(pacman) p_load(h2o,rJava,RJDBC,awsjavasdk) h2o.init(nthreads = -1) ## Connection successful! ## ## R is connected to the H2O cluster: ## H2O cluster uptime: 2 hours 42 minutes ## H2O cluster version: 3.10.4.6 ## H2O cluster version age: 4 months and 4 days !!! ## H2O cluster name: H2O_started_from_R_rstudio_hjx881 ## H2O cluster total nodes: 1 ## H2O cluster total memory: 3.30 GB ## H2O cluster total cores: 4 ## H2O cluster allowed cores: 4 ## H2O cluster healthy: TRUE ## H2O Connection ip: localhost ## H2O Connection port: 54321 ## H2O Connection proxy: NA ## H2O Internal Security: FALSE ## R Version: R version 3.3.3 (2017-03-06) ## Warning in h2o.clusterInfo(): ## Your H2O cluster version is too old (4 months and 4 days)! ## Please download and install the latest version from http://h2o.ai/download/ #install aws sdk if not present (pre-requisite for using Athena with an IAM role) if (!aws_sdk_present()) { install_aws_sdk() } load_sdk() ## NULL ### Connect to Athena Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory. URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar' fil <- basename(URL) #download the file into current working directory if (!file.exists(fil)) download.file(URL, fil) #verify that the file has been downloaded successfully list.files() ## [1] "AthenaJDBC41-1.0.1.jar" drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'") con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/', s3_staging_dir=Sys.getenv("ATHENABUCKET"), aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")  Verify the connection. The results returned depend on your specific Athena setup. con ## <JDBCConnection> dbListTables(con) ## [1] "gdelt" "wikistats" "elb_logs_raw_native" ## [4] "twitter" "twitter2" "usermovieratings" ## [7] "eventcodes" "events" "billboard" ## [10] "billboardtop10" "elb_logs" "gdelthist" ## [13] "gdeltmaster" "twitter" "twitter3" ### Create a dataset For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.  Field Description year Year that song was released songtitle Title of the song artistname Name of the song artist songid Unique identifier for the song artistid Unique identifier for the song artist timesignature Variable estimating the time signature of the song timesignature_confidence Confidence in the estimate for the timesignature loudness Continuous variable indicating the average amplitude of the audio in decibels tempo Variable indicating the estimated beats per minute of the song tempo_confidence Confidence in the estimate for tempo key Variable with twelve levels indicating the estimated key of the song (C, C#, B) key_confidence Confidence in the estimate for key energy Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness pitch Continuous variable that indicates the pitch of the song timbre_0_min thru timbre_11_min Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector timbre_0_max thru timbre_11_max Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector top10 Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not) ### Create an Athena table based on the dataset In the Athena console, select the default database, sampled, or create a new database. Run the following create table statement. create external table if not exists billboard ( year int, songtitle string, artistname string, songID string, artistID string, timesignature int, timesignature_confidence double, loudness double, tempo double, tempo_confidence double, key int, key_confidence double, energy double, pitch double, timbre_0_min double, timbre_0_max double, timbre_1_min double, timbre_1_max double, timbre_2_min double, timbre_2_max double, timbre_3_min double, timbre_3_max double, timbre_4_min double, timbre_4_max double, timbre_5_min double, timbre_5_max double, timbre_6_min double, timbre_6_max double, timbre_7_min double, timbre_7_max double, timbre_8_min double, timbre_8_max double, timbre_9_min double, timbre_9_max double, timbre_10_min double, timbre_10_max double, timbre_11_min double, timbre_11_max double, Top10 int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data' ; Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice. dbGetQuery(con, "show create table sampledb.billboard") ## createtab_stmt ## 1 CREATE EXTERNAL TABLE sampledb.billboard( ## 2 year int, ## 3 songtitle string, ## 4 artistname string, ## 5 songid string, ## 6 artistid string, ## 7 timesignature int, ## 8 timesignature_confidence double, ## 9 loudness double, ## 10 tempo double, ## 11 tempo_confidence double, ## 12 key int, ## 13 key_confidence double, ## 14 energy double, ## 15 pitch double, ## 16 timbre_0_min double, ## 17 timbre_0_max double, ## 18 timbre_1_min double, ## 19 timbre_1_max double, ## 20 timbre_2_min double, ## 21 timbre_2_max double, ## 22 timbre_3_min double, ## 23 timbre_3_max double, ## 24 timbre_4_min double, ## 25 timbre_4_max double, ## 26 timbre_5_min double, ## 27 timbre_5_max double, ## 28 timbre_6_min double, ## 29 timbre_6_max double, ## 30 timbre_7_min double, ## 31 timbre_7_max double, ## 32 timbre_8_min double, ## 33 timbre_8_max double, ## 34 timbre_9_min double, ## 35 timbre_9_max double, ## 36 timbre_10_min double, ## 37 timbre_10_max double, ## 38 timbre_11_min double, ## 39 timbre_11_max double, ## 40 top10 int) ## 41 ROW FORMAT DELIMITED ## 42 FIELDS TERMINATED BY ',' ## 43 STORED AS INPUTFORMAT ## 44 'org.apache.hadoop.mapred.TextInputFormat' ## 45 OUTPUTFORMAT ## 46 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' ## 47 LOCATION ## 48 's3://aws-bigdata-blog/artifacts/predict-billboard/data' ## 49 TBLPROPERTIES ( ## 50 'transient_lastDdlTime'='1505484133') ### Run a sample query Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts. dbGetQuery(con, " SELECT songtitle,artistname,top10 FROM sampledb.billboard WHERE lower(artistname) = 'janet jackson' AND top10 = 1") ## songtitle artistname top10 ## 1 Runaway Janet Jackson 1 ## 2 Because Of Love Janet Jackson 1 ## 3 Again Janet Jackson 1 ## 4 If Janet Jackson 1 ## 5 Love Will Never Do (Without You) Janet Jackson 1 ## 6 Black Cat Janet Jackson 1 ## 7 Come Back To Me Janet Jackson 1 ## 8 Alright Janet Jackson 1 ## 9 Escapade Janet Jackson 1 ## 10 Rhythm Nation Janet Jackson 1 Determine how many songs in this dataset are specifically from the year 2010. dbGetQuery(con, " SELECT count(*) FROM sampledb.billboard WHERE year = 2010") ## _col0 ## 1 373  The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved. Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again. #t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard") #Note: Running the preceding query results in the following error: #Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values. # Use the dbSendQuery function, reduce the fetch size, and run again r <- dbSendQuery(con, " SELECT timesignature FROM sampledb.billboard") dftimesignature<- fetch(r, n=-1, block=100) dbClearResult(r) ## [1] TRUE table(dftimesignature) ## dftimesignature ## 0 1 3 4 5 7 ## 10 143 503 6787 112 19 nrow(dftimesignature) ## [1] 7574 From the results, observe that 6787 songs have a timesignature of 4. Next, determine the song with the highest tempo. dbGetQuery(con, " SELECT songtitle,artistname,tempo FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ") ## songtitle artistname tempo ## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307  ### Create the training dataset Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first. This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size. #BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009") #Running the preceding query results in the following error:- #Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error) #Follow the same approach as before to address this issue. r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009") BillboardTrain <- fetch(r, n=-1, block=100) dbClearResult(r) ## [1] TRUE BillboardTrain[1:2,c(1:3,6:10)] ## year songtitle artistname timesignature ## 1 2009 The Awkward Goodbye Athlete 3 ## 2 2009 Rubik's Cube Athlete 3 ## timesignature_confidence loudness tempo tempo_confidence ## 1 0.732 -6.320 89.614 0.652 ## 2 0.906 -9.541 117.742 0.542 nrow(BillboardTrain) ## [1] 7201 ### Create the test dataset BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010") BillboardTest[1:2,c(1:3,11:15)] ## year songtitle artistname key ## 1 2010 This Is the House That Doubt Built A Day to Remember 11 ## 2 2010 Sticks & Bricks A Day to Remember 10 ## key_confidence energy pitch timbre_0_min ## 1 0.453 0.9666556 0.024 0.002 ## 2 0.469 0.9847095 0.025 0.000 nrow(BillboardTest) ## [1] 373 ### Convert the training and test datasets into H2O dataframes train.h2o <- as.h2o(BillboardTrain) ## | | | 0% | |=================================================================| 100% test.h2o <- as.h2o(BillboardTest) ## | | | 0% | |=================================================================| 100% Inspect the column names in your H2O dataframes. colnames(train.h2o) ## [1] "year" "songtitle" ## [3] "artistname" "songid" ## [5] "artistid" "timesignature" ## [7] "timesignature_confidence" "loudness" ## [9] "tempo" "tempo_confidence" ## [11] "key" "key_confidence" ## [13] "energy" "pitch" ## [15] "timbre_0_min" "timbre_0_max" ## [17] "timbre_1_min" "timbre_1_max" ## [19] "timbre_2_min" "timbre_2_max" ## [21] "timbre_3_min" "timbre_3_max" ## [23] "timbre_4_min" "timbre_4_max" ## [25] "timbre_5_min" "timbre_5_max" ## [27] "timbre_6_min" "timbre_6_max" ## [29] "timbre_7_min" "timbre_7_max" ## [31] "timbre_8_min" "timbre_8_max" ## [33] "timbre_9_min" "timbre_9_max" ## [35] "timbre_10_min" "timbre_10_max" ## [37] "timbre_11_min" "timbre_11_max" ## [39] "top10"  ### Create models You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent. Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables: “year”, “songtitle”, “artistname”, “songid”, or “artistid”. y.dep <- 39 x.indep <- c(6:38) x.indep ## [1] 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 ## [24] 29 30 31 32 33 34 35 36 37 38  #### Create Model 1: All numeric variables Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function. modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial") ## | | | 0% | |===== | 8% | |=================================================================| 100% Measure the performance of Model 1, using H2O’s built-in performance function. h2o.performance(model=modelh1,newdata=test.h2o) ## H2OBinomialMetrics: glm ## ## MSE: 0.09924684 ## RMSE: 0.3150347 ## LogLoss: 0.3220267 ## Mean Per-Class Error: 0.2380168 ## AUC: 0.8431394 ## Gini: 0.6862787 ## R^2: 0.254663 ## Null Deviance: 326.0801 ## Residual Deviance: 240.2319 ## AIC: 308.2319 ## ## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold: ## 0 1 Error Rate ## 0 255 59 0.187898 =59/314 ## 1 17 42 0.288136 =17/59 ## Totals 272 101 0.203753 =76/373 ## ## Maximum Metrics: Maximum metrics at their respective thresholds ## metric threshold value idx ## 1 max f1 0.192772 0.525000 100 ## 2 max f2 0.124912 0.650510 155 ## 3 max f0point5 0.416258 0.612903 23 ## 4 max accuracy 0.416258 0.879357 23 ## 5 max precision 0.813396 1.000000 0 ## 6 max recall 0.037579 1.000000 282 ## 7 max specificity 0.813396 1.000000 0 ## 8 max absolute_mcc 0.416258 0.455251 23 ## 9 max min_per_class_accuracy 0.161402 0.738854 125 ## 10 max mean_per_class_accuracy 0.124912 0.765006 155 ## ## Gains/Lift Table: Extract with h2o.gainsLift(<model>, <data>) or  h2o.auc(h2o.performance(modelh1,test.h2o)) ## [1] 0.8431394  The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.) Next, inspect the coefficients of the variables in the dataset. dfmodelh1 <- as.data.frame(h2o.varimp(modelh1)) dfmodelh1 ## names coefficients sign ## 1 timbre_0_max 1.290938663 NEG ## 2 loudness 1.262941934 POS ## 3 pitch 0.616995941 NEG ## 4 timbre_1_min 0.422323735 POS ## 5 timbre_6_min 0.349016024 NEG ## 6 energy 0.348092062 NEG ## 7 timbre_11_min 0.307331997 NEG ## 8 timbre_3_max 0.302225619 NEG ## 9 timbre_11_max 0.243632060 POS ## 10 timbre_4_min 0.224233951 POS ## 11 timbre_4_max 0.204134342 POS ## 12 timbre_5_min 0.199149324 NEG ## 13 timbre_0_min 0.195147119 POS ## 14 timesignature_confidence 0.179973904 POS ## 15 tempo_confidence 0.144242598 POS ## 16 timbre_10_max 0.137644568 POS ## 17 timbre_7_min 0.126995955 NEG ## 18 timbre_10_min 0.123851179 POS ## 19 timbre_7_max 0.100031481 NEG ## 20 timbre_2_min 0.096127636 NEG ## 21 key_confidence 0.083115820 POS ## 22 timbre_6_max 0.073712419 POS ## 23 timesignature 0.067241917 POS ## 24 timbre_8_min 0.061301881 POS ## 25 timbre_8_max 0.060041698 POS ## 26 key 0.056158445 POS ## 27 timbre_3_min 0.050825116 POS ## 28 timbre_9_max 0.033733561 POS ## 29 timbre_2_max 0.030939072 POS ## 30 timbre_9_min 0.020708113 POS ## 31 timbre_1_max 0.014228818 NEG ## 32 tempo 0.008199861 POS ## 33 timbre_5_max 0.004837870 POS ## 34 NA <NA> Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results. You can make the following observations from the results: • The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit. • The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation. • The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation. These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set. cor(train.h2o$loudness,train.h2o$energy) ## [1] 0.7399067 This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models. You build two variations of the original model: • Model 2, in which you keep “energy” and omit “loudness” • Model 3, in which you keep “loudness” and omit “energy” You compare these two models and choose the model with a better fit for this use case. #### Create Model 2: Keep energy and omit loudness colnames(train.h2o) ## [1] "year" "songtitle" ## [3] "artistname" "songid" ## [5] "artistid" "timesignature" ## [7] "timesignature_confidence" "loudness" ## [9] "tempo" "tempo_confidence" ## [11] "key" "key_confidence" ## [13] "energy" "pitch" ## [15] "timbre_0_min" "timbre_0_max" ## [17] "timbre_1_min" "timbre_1_max" ## [19] "timbre_2_min" "timbre_2_max" ## [21] "timbre_3_min" "timbre_3_max" ## [23] "timbre_4_min" "timbre_4_max" ## [25] "timbre_5_min" "timbre_5_max" ## [27] "timbre_6_min" "timbre_6_max" ## [29] "timbre_7_min" "timbre_7_max" ## [31] "timbre_8_min" "timbre_8_max" ## [33] "timbre_9_min" "timbre_9_max" ## [35] "timbre_10_min" "timbre_10_max" ## [37] "timbre_11_min" "timbre_11_max" ## [39] "top10" y.dep <- 39 x.indep <- c(6:7,9:38) x.indep ## [1] 6 7 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ## [24] 30 31 32 33 34 35 36 37 38 modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial") ## | | | 0% | |======= | 10% | |=================================================================| 100%  Measure the performance of Model 2. h2o.performance(model=modelh2,newdata=test.h2o) ## H2OBinomialMetrics: glm ## ## MSE: 0.09922606 ## RMSE: 0.3150017 ## LogLoss: 0.3228213 ## Mean Per-Class Error: 0.2490554 ## AUC: 0.8431933 ## Gini: 0.6863867 ## R^2: 0.2548191 ## Null Deviance: 326.0801 ## Residual Deviance: 240.8247 ## AIC: 306.8247 ## ## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold: ## 0 1 Error Rate ## 0 280 34 0.108280 =34/314 ## 1 23 36 0.389831 =23/59 ## Totals 303 70 0.152815 =57/373 ## ## Maximum Metrics: Maximum metrics at their respective thresholds ## metric threshold value idx ## 1 max f1 0.254391 0.558140 69 ## 2 max f2 0.113031 0.647208 157 ## 3 max f0point5 0.413999 0.596026 22 ## 4 max accuracy 0.446250 0.876676 18 ## 5 max precision 0.811739 1.000000 0 ## 6 max recall 0.037682 1.000000 283 ## 7 max specificity 0.811739 1.000000 0 ## 8 max absolute_mcc 0.254391 0.469060 69 ## 9 max min_per_class_accuracy 0.141051 0.716561 131 ## 10 max mean_per_class_accuracy 0.113031 0.761821 157 ## ## Gains/Lift Table: Extract with h2o.gainsLift(<model>, <data>) or h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>) dfmodelh2 <- as.data.frame(h2o.varimp(modelh2)) dfmodelh2 ## names coefficients sign ## 1 pitch 0.700331511 NEG ## 2 timbre_1_min 0.510270513 POS ## 3 timbre_0_max 0.402059546 NEG ## 4 timbre_6_min 0.333316236 NEG ## 5 timbre_11_min 0.331647383 NEG ## 6 timbre_3_max 0.252425901 NEG ## 7 timbre_11_max 0.227500308 POS ## 8 timbre_4_max 0.210663865 POS ## 9 timbre_0_min 0.208516163 POS ## 10 timbre_5_min 0.202748055 NEG ## 11 timbre_4_min 0.197246582 POS ## 12 timbre_10_max 0.172729619 POS ## 13 tempo_confidence 0.167523934 POS ## 14 timesignature_confidence 0.167398830 POS ## 15 timbre_7_min 0.142450727 NEG ## 16 timbre_8_max 0.093377516 POS ## 17 timbre_10_min 0.090333426 POS ## 18 timesignature 0.085851625 POS ## 19 timbre_7_max 0.083948442 NEG ## 20 key_confidence 0.079657073 POS ## 21 timbre_6_max 0.076426046 POS ## 22 timbre_2_min 0.071957831 NEG ## 23 timbre_9_max 0.071393189 POS ## 24 timbre_8_min 0.070225578 POS ## 25 key 0.061394702 POS ## 26 timbre_3_min 0.048384697 POS ## 27 timbre_1_max 0.044721121 NEG ## 28 energy 0.039698433 POS ## 29 timbre_5_max 0.039469064 POS ## 30 timbre_2_max 0.018461133 POS ## 31 tempo 0.013279926 POS ## 32 timbre_9_min 0.005282143 NEG ## 33 NA <NA> h2o.auc(h2o.performance(modelh2,test.h2o)) ## [1] 0.8431933 You can make the following observations: • The AUC metric is 0.8431933. • Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation. • As H2O orders variables by significance, the variable energy is not significant in this model. You can conclude that Model 2 is not ideal for this use , as energy is not significant. #### CreateModel 3: Keep loudness but omit energy colnames(train.h2o) ## [1] "year" "songtitle" ## [3] "artistname" "songid" ## [5] "artistid" "timesignature" ## [7] "timesignature_confidence" "loudness" ## [9] "tempo" "tempo_confidence" ## [11] "key" "key_confidence" ## [13] "energy" "pitch" ## [15] "timbre_0_min" "timbre_0_max" ## [17] "timbre_1_min" "timbre_1_max" ## [19] "timbre_2_min" "timbre_2_max" ## [21] "timbre_3_min" "timbre_3_max" ## [23] "timbre_4_min" "timbre_4_max" ## [25] "timbre_5_min" "timbre_5_max" ## [27] "timbre_6_min" "timbre_6_max" ## [29] "timbre_7_min" "timbre_7_max" ## [31] "timbre_8_min" "timbre_8_max" ## [33] "timbre_9_min" "timbre_9_max" ## [35] "timbre_10_min" "timbre_10_max" ## [37] "timbre_11_min" "timbre_11_max" ## [39] "top10" y.dep <- 39 x.indep <- c(6:12,14:38) x.indep ## [1] 6 7 8 9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ## [24] 30 31 32 33 34 35 36 37 38 modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial") ## | | | 0% | |======== | 12% | |=================================================================| 100% perfh3<-h2o.performance(model=modelh3,newdata=test.h2o) perfh3 ## H2OBinomialMetrics: glm ## ## MSE: 0.0978859 ## RMSE: 0.3128672 ## LogLoss: 0.3178367 ## Mean Per-Class Error: 0.264925 ## AUC: 0.8492389 ## Gini: 0.6984778 ## R^2: 0.2648836 ## Null Deviance: 326.0801 ## Residual Deviance: 237.1062 ## AIC: 303.1062 ## ## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold: ## 0 1 Error Rate ## 0 286 28 0.089172 =28/314 ## 1 26 33 0.440678 =26/59 ## Totals 312 61 0.144772 =54/373 ## ## Maximum Metrics: Maximum metrics at their respective thresholds ## metric threshold value idx ## 1 max f1 0.273799 0.550000 60 ## 2 max f2 0.125503 0.663265 155 ## 3 max f0point5 0.435479 0.628931 24 ## 4 max accuracy 0.435479 0.882038 24 ## 5 max precision 0.821606 1.000000 0 ## 6 max recall 0.038328 1.000000 280 ## 7 max specificity 0.821606 1.000000 0 ## 8 max absolute_mcc 0.435479 0.471426 24 ## 9 max min_per_class_accuracy 0.173693 0.745763 120 ## 10 max mean_per_class_accuracy 0.125503 0.775073 155 ## ## Gains/Lift Table: Extract with h2o.gainsLift(<model>, <data>) or h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>) dfmodelh3 <- as.data.frame(h2o.varimp(modelh3)) dfmodelh3 ## names coefficients sign ## 1 timbre_0_max 1.216621e+00 NEG ## 2 loudness 9.780973e-01 POS ## 3 pitch 7.249788e-01 NEG ## 4 timbre_1_min 3.891197e-01 POS ## 5 timbre_6_min 3.689193e-01 NEG ## 6 timbre_11_min 3.086673e-01 NEG ## 7 timbre_3_max 3.025593e-01 NEG ## 8 timbre_11_max 2.459081e-01 POS ## 9 timbre_4_min 2.379749e-01 POS ## 10 timbre_4_max 2.157627e-01 POS ## 11 timbre_0_min 1.859531e-01 POS ## 12 timbre_5_min 1.846128e-01 NEG ## 13 timesignature_confidence 1.729658e-01 POS ## 14 timbre_7_min 1.431871e-01 NEG ## 15 timbre_10_max 1.366703e-01 POS ## 16 timbre_10_min 1.215954e-01 POS ## 17 tempo_confidence 1.183698e-01 POS ## 18 timbre_2_min 1.019149e-01 NEG ## 19 key_confidence 9.109701e-02 POS ## 20 timbre_7_max 8.987908e-02 NEG ## 21 timbre_6_max 6.935132e-02 POS ## 22 timbre_8_max 6.878241e-02 POS ## 23 timesignature 6.120105e-02 POS ## 24 key 5.814805e-02 POS ## 25 timbre_8_min 5.759228e-02 POS ## 26 timbre_1_max 2.930285e-02 NEG ## 27 timbre_9_max 2.843755e-02 POS ## 28 timbre_3_min 2.380245e-02 POS ## 29 timbre_2_max 1.917035e-02 POS ## 30 timbre_5_max 1.715813e-02 POS ## 31 tempo 1.364418e-02 NEG ## 32 timbre_9_min 8.463143e-05 NEG ## 33 NA <NA> h2o.sensitivity(perfh3,0.5) ## Warning in h2o.find_row_by_threshold(object, t): Could not find exact ## threshold: 0.5 for this set of metrics; using closest threshold found: ## 0.501855569251422. Run h2o.predict and apply your desired threshold on a ## probability column. ## [[1]] ## [1] 0.2033898 h2o.auc(perfh3) ## [1] 0.8492389  You can make the following observations: • The AUC metric is 0.8492389. • From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits). • Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2. • Loudness is significant in this model. Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors: • Desired model accuracy at a given threshold • Number of correct predictions for top10 hits • Tolerable number of false positives or false negatives Next, make predictions using Model 3 on the test dataset. predict.regh <- h2o.predict(modelh3, test.h2o) ## | | | 0% | |=================================================================| 100% print(predict.regh) ## predict p0 p1 ## 1 0 0.9654739 0.034526052 ## 2 0 0.9654748 0.034525236 ## 3 0 0.9635547 0.036445318 ## 4 0 0.9343579 0.065642149 ## 5 0 0.9978334 0.002166601 ## 6 0 0.9779949 0.022005078 ## ## [373 rows x 3 columns] predict.regh$predict
##   predict
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
##
## [373 rows x 1 column]
dpr<-as.data.frame(predict.regh)
#Rename the predicted column
colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10'
table(dpr$predict_top10) ## ## 0 1 ## 312 61 The first set of output results specifies the probabilities associated with each predicted observation. For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit). The second set of results list the actual predictions made. From the third set of results, this model predicts that 61 songs will be top 10 hits. Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits. table(BillboardTest$top10)
##
##   0   1
## 314  59


Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.

It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?

View the two models from an investment perspective:

• A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
• How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
• It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
• The baseline model is not useful, as it simply does not label any song as a hit.

Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.

#### GBM model

H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.

Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.

train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
##
|
|                                                                 |   0%
|
|===                                                              |   5%
|
|=====                                                            |   7%
|
|======                                                           |   9%
|
|=======                                                          |  10%
|
|======================                                           |  33%
|
|=====================================                            |  56%
|
|====================================================             |  79%
|
|================================================================ |  98%
|
|=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
##
## MSE:  0.09860778
## RMSE:  0.3140188
## LogLoss:  0.3206876
## Mean Per-Class Error:  0.2120263
## AUC:  0.8630573
## Gini:  0.7261146
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      266 48 0.152866  =48/314
## 1       16 43 0.271186   =16/59
## Totals 282 91 0.171582  =64/373
##
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.189757 0.573333  90
## 2                     max f2  0.130895 0.693717 145
## 3               max f0point5  0.327346 0.598802  26
## 4               max accuracy  0.442757 0.876676  14
## 5              max precision  0.802184 1.000000   0
## 6                 max recall  0.049990 1.000000 284
## 7            max specificity  0.802184 1.000000   0
## 8           max absolute_mcc  0.169135 0.496486 104
## 9 max min_per_class_accuracy  0.169135 0.796610 104
## 10 max mean_per_class_accuracy  0.169135 0.805948 104
##
## Gains/Lift Table: Extract with h2o.gainsLift(<model>, <data>) or
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run h2o.predict and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573

This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.

As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.

H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.

Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose.  For models that require more computation, consider using accelerated computing instances such as the P2 instance type.

system.time(
dlearning.modelh <- h2o.deeplearning(y = y.dep,
x = x.indep,
training_frame = train.h2o,
epoch = 250,
hidden = c(250,250),
activation = "Rectifier",
seed = 1122,
distribution="multinomial"
)
)
##
|
|                                                                 |   0%
|
|===                                                              |   4%
|
|=====                                                            |   8%
|
|========                                                         |  12%
|
|==========                                                       |  16%
|
|=============                                                    |  20%
|
|================                                                 |  24%
|
|==================                                               |  28%
|
|=====================                                            |  32%
|
|=======================                                          |  36%
|
|==========================                                       |  40%
|
|=============================                                    |  44%
|
|===============================                                  |  48%
|
|==================================                               |  52%
|
|====================================                             |  56%
|
|=======================================                          |  60%
|
|==========================================                       |  64%
|
|============================================                     |  68%
|
|===============================================                  |  72%
|
|=================================================                |  76%
|
|====================================================             |  80%
|
|=======================================================          |  84%
|
|=========================================================        |  88%
|
|============================================================     |  92%
|
|==============================================================   |  96%
|
|=================================================================| 100%
##    user  system elapsed
##   1.216   0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
##
## MSE:  0.1678359
## RMSE:  0.4096778
## LogLoss:  1.86509
## Mean Per-Class Error:  0.3433013
## AUC:  0.7568822
## Gini:  0.5137644
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      290 24 0.076433  =24/314
## 1       36 23 0.610169   =36/59
## Totals 326 47 0.160858  =60/373
##
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.826267 0.433962  46
## 2                     max f2  0.000000 0.588235 239
## 3               max f0point5  0.999929 0.511811  16
## 4               max accuracy  0.999999 0.865952  10
## 5              max precision  1.000000 1.000000   0
## 6                 max recall  0.000000 1.000000 326
## 7            max specificity  1.000000 1.000000   0
## 8           max absolute_mcc  0.999929 0.363219  16
## 9 max min_per_class_accuracy  0.000004 0.662420 145
## 10 max mean_per_class_accuracy  0.000000 0.685334 224
##
## Gains/Lift Table: Extract with h2o.gainsLift(<model>, <data>) or h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run h2o.predict and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822


The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.

H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.

 Metric Description Sensitivity Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall. Specificity Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate. Threshold Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives. Precision The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant AUC Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class. 0.90 – 1 = excellent (A) 0.8 – 0.9 = good (B) 0.7 – 0.8 = fair (C) .6 – 0.7 = poor (D) 0.5 – 0.5 = fail (F)

Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.

 Metric Model 3 GBM Model Deep Learning Model Accuracy (max) 0.882038 (t=0.435479) 0.876676 (t=0.442757) 0.865952 (t=0.999999) Precision (max) 1.0 (t=0.821606) 1.0 (t=0802184) 1.0 (t=1.0) Recall (max) 1.0 1.0 1.0 (t=0) Specificity (max) 1.0 1.0 1.0 (t=1) Sensitivity 0.2033898 0.1355932 0.3898305 (t=0.5) AUC 0.8492389 0.8630573 0.756882

Note: ‘t’ denotes threshold.

Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier.  If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits.  The AUC metric for the GBM model is also higher than that of Model 3.

Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.

## Conclusion

In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.

If you have questions or suggestions, please comment below.

### About the Authors

Gopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.

Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.

# We Are Not Having a Productive Debate About Women in Tech

Post Syndicated from Bozho original https://techblog.bozho.net/not-productive-debate-women-tech/

Yes, it’s about the “anti-diversity memo”. But I won’t go into particular details of the memo, the firing, who’s right and wrong, who’s liberal and who’s conservative. Actually, I don’t need to repeat this post, which states almost exactly what I think about the particular issue. Just in case, and before someone decided to label me as “sexist white male” that knows nothing, I guess should clearly state that I acknowledge that biases against women are real and that I strongly support equal opportunity, and I think there must be more women in technology. I also have to state that I think the author of “the memo” was well-meaning, had some well argued, research-backed points and should not be ostracized.

But I want to “rant” about the quality of the debate. On one side we have conservatives who are throwing themselves in defense of the fired googler, insisting that liberals are banning conservative points of view, that it is normal to have so few woman in tech and that everything is actually okay, or even that women are inferior. On the other side we have triggered liberals that are ready to shout “discrimination” and “harassment” at anything that resembles an attempt to claim anything different than total and absolute equality, in many cases using a classical “strawman” argument (e.g. “he’s saying women should not work in tech, he’s obviously wrong”).

Everyone seems to be too eager to take side and issue a verdict on who’s right and who’s wrong, to blame the other side for all related and unrelated woes and while doing that, exhibit a huge amount of biases. If the debate is about that, we’d better shut it down as soon as possible, as it’s not going to lead anywhere. No matter how much conservatives want “a debate”, and no matter how much liberals want to advance equality. Oh, and by the way – this “conservatives” vs “liberals” is a false dichotomy. Most people hold a somewhat sensible stance in between. But let’s get to the actual issue:

Women are underrepresented in STEM (Science, technology, engineering, mathematics). That is a fact everyone agrees on and is blatantly obvious when you walk in any software company office.

Why is that the case? The whole debate revolved around biological and social differences, some of which are probably even true – that women value job flexibility more than being promoted or getting higher salary, that they are more neurotic (on average), that they are less confident, that they are more empathic and so on. These difference have been studied and documented, and as much as I have my reservations about psychology studies (so much so, that even meta-analysis are shown by meta-meta-analysis to be flawed) and social science in general, there seems to be a consensus there (by the way, it’s a shame that Gizmodo removed all the scientific references when they first published “the memo”). But that is not the issue. As it has been pointed out, there’s equal applicability of male and female “inherent” traits when working with technology.

Why are we talking about “techonology”, and why not “mining and construction”, as many will point out. Let’s cut that argument once and for all – mining and construction are blue collar jobs that have a high chance of being automated in the near future and are in decline. The problem that we’re trying to solve is – how to make the dominant profession of the future – information technology – one of equal opportunity. Yes, it’s a a bold claim, but software is going to be everywhere and the industry will grow. This is why it’s so important to discuss it, not because we are developers and we are somewhat affected by that.

So, there has been extended research on the matter, and the reasons are – surprise – complex and intertwined and there is no simple issue that, once resolved, will unlock the path of women to tech jobs.

What would diversity give us and why should we care? Let’s assume for a moment we don’t care about equal opportunity and we are right-leaning, conservative people. Well, imagine you have a growing business and you need to hire developers. What would you prefer – having fewer or more people of whom to choose from? Having fewer or more diverse skills (technical and social) on the job market? The answer is obvious. The more people, regardless of their gender, race, whatever, are on the job market, the better for businesses.

So I guess we’ve agreed on the two points so far – that women are underrepresented, and that it’s better for everyone if there are more people with technical skills on the job market, which includes more women.

The “final” questions is – how?

And this questions seems to not be anywhere in the discussion. Instead, we are going in circles with irrelevant arguments trying to either show that we’ve read more scientific papers than others, that we are more liberal than others or that we are more pro free speech.

Back to “how” – in Bulgaria we have a social meme: “I don’t know what is the right way, but the way you are doing it is NOT the right way”. And much of the underlying sentiment of “the memo” is similar – that google should stop doing some of the stuff it is doing about diversity, or do them differently (but doesn’t tell us how exactly). Hiring biases, internal programs, whatever, seem to bother him. But this is just talking about the surface of the problem. These programs are correcting something that remains hidden in “the memo”.

Google, on their diversity page, say that 20% of their tech employees are women. At the same time, in another diversity section, they claim “18% of CS graduates are women”. So, I guess, job done – they’ve reached the maximum possible diversity. They’ve hired as many women in tech as CS graduates there are. Anything more than that, even if it doesn’t mean they’ll hire worse developers, will leave the rest of the industry with less women. So, sure, 50/50 in Google would sound cool, but the industry average will still be bad.

And that’s the actual, underlying reason that we should have already arrived at, and we should’ve started discussing the “how”. Girls do not see STEM as a thing for them. Our biases are projected on younger girls which culminate at a “this is not for girls” mantra. No matter how diverse hiring policies we have, if we don’t address the issue at a way earlier stage, we aren’t getting anywhere.

In schools and even kindergartens we need to have an inclusive environment where “this is not for girls” is frowned upon. We should not discourage girls from liking math, or making math sound uncool and “hard for girls” (in my biased world I actually know more women mathematicians than men). This comic seems like on a different topic (gender-specific toys), but it’s actually not about toys – it’s about what is considered (stereo)typical of a girl to do. And most of these biases are unconscious, and come from all around us (school, TV, outdoor ads, people on the street, relatives, etc.), and it takes effort to confront them.

To do that, we need policy decisions. We need lobbying education departments / ministries to encourage girls more in the STEM direction (and don’t worry, they’ll be good at it). By the way, guess what – Google’s diversity program is not just about hiring more women, it actually includes education policies with stuff like “influencing perception about computer science”, “getting more girls to code” and scholarships.

Let’s discuss the education policies, the path to getting 40-50% of CS graduates to be female, and before that – more girls in schools with technical focus, and ultimately – how to get society to not perceive technology and science as “not for girls”. Let each girl decide on her own. All the other debates are short-sighted and not to the point at all. Will biological differences matter then? They probably will – but not significantly to justify a high gender imbalance.

I am no expert in education policies and I don’t know what will work and what won’t. There is research on the matter that we should look at, and maybe argue about it. Everything else is wasted keystrokes.

The post We Are Not Having a Productive Debate About Women in Tech appeared first on Bozho's tech blog.

# Piracy Narrative Isn’t About Ethics Anymore, It’s About “Danger”

Post Syndicated from Andy original https://torrentfreak.com/piracy-narrative-isnt-about-ethics-anymore-its-about-danger-170812/

Over the years there have been almost endless attempts to stop people from accessing copyright-infringing content online. Campaigns have come and gone and almost two decades later the battle is still ongoing.

Early on, when panic enveloped the music industry, the campaigns centered around people getting sued. Grabbing music online for free could be costly, the industry warned, while parading the heads of a few victims on pikes for the world to see.

Periodically, however, the aim has been to appeal to the public’s better nature. The idea is that people essentially want to do the ‘right thing’, so once they understand that largely hard-working Americans are losing their livelihoods, people will stop downloading from The Pirate Bay. For some, this probably had the desired effect but millions of people are still getting their fixes for free, so the job isn’t finished yet.

In more recent years, notably since the MPAA and RIAA had their eyes blacked in the wake of SOPA, the tone has shifted. In addition to educating the public, torrent and streaming sites are increasingly being painted as enemies of the public they claim to serve.

Several studies, largely carried out on behalf of the Digital Citizens Alliance (DCA), have claimed that pirate sites are hotbeds of malware, baiting consumers in with tasty pirate booty only to offload trojans, viruses, and God-knows-what. These reports have been ostensibly published as independent public interest documents but this week an advisor to the DCA suggested a deeper interest for the industry.

Hemanshu Nigam is a former federal prosecutor, ex-Chief Security Officer for News Corp and Fox Interactive Media, and former VP Worldwide Internet Enforcement at the MPAA. In an interview with Deadline this week, he spoke about alleged links between pirate sites and malware distributors. He also indicated that warning people about the dangers of pirate sites has become Hollywood’s latest anti-piracy strategy.

“The industry narrative has changed. When I was at the MPAA, we would tell people that stealing content is wrong and young people would say, yeah, whatever, you guys make a lot of money, too bad,” he told the publication.

“It has gone from an ethical discussion to a dangerous one. Now, your parents’ bank account can be raided, your teenage daughter can be spied on in her bedroom and extorted with the footage, or your computer can be locked up along with everything in it and held for ransom.”

Nigam’s stance isn’t really a surprise since he’s currently working for the Digital Citizens Alliance as an advisor. In turn, the Alliance is at least partly financed by the MPAA. There’s no suggestion whatsoever that Nigam is involved in any propaganda effort, but recent signs suggest that the DCA’s work in malware awareness is more about directing people away from pirate sites than protecting them from the alleged dangers within.

That being said and despite the bias, it’s still worth giving experts like Nigam an opportunity to speak. Largely thanks to industry efforts with brands, pirate sites are increasingly being forced to display lower-tier ads, which can be problematic. On top, some sites’ policies mean they don’t deserve any visitors at all.

In the Deadline piece, however, Nigam alleges that hackers have previously reached out to pirate websites offering $200 to$5000 per day “depending on the size of the pirate website” to have the site infect users with malware. If true, that’s a serious situation and people who would ordinarily use ‘pirate’ sites would definitely appreciate the details.

For example, to which sites did hackers make this offer and, crucially, which sites turned down the offer and which ones accepted?

It’s important to remember that pirates are just another type of consumer and they would boycott sites in a heartbeat if they discovered they’d been paid to infect them with malware. But, as usual, the claims are extremely light in detail. Instead, there’s simply a blanket warning to stay away from all unauthorized sites, which isn’t particularly helpful.

In some cases, of course, operational security will prevent some details coming to light but without these, people who don’t get infected on a ‘pirate’ site (the vast majority) simply won’t believe the allegations. As the author of the Deadline piece pointed out, it’s a bit like Reefer Madness all over again.

The point here is that without hard independent evidence to back up these claims, with reports listing sites alongside the malware they’ve supposed to have spread and when, few people will respond to perceived scaremongering. Free content trumps a few distant worries almost every time, whether that involves malware or the threat of a lawsuit.

It’ll be up to the DCA and their MPAA paymasters to consider whether the approach is working but thus far, not even having government heavyweights on board has helped.

Earlier this year the DCA launched a video campaign, enrolling 15 attorney generals to publish their own anti-piracy PSAs on YouTube. Thus far, interest has been minimal, to say the least.

At the time of writing the 15 PSAs have 3,986 views in total, with 2,441 of those contributed by a single video contributed by Wisconsin Attorney General Brad Schimel. Despite the relative success, even that got slammed with 2 upvotes and 127 downvotes.

A few of the other videos have a couple of hundred views each but more than half have less than 70. Perhaps most worryingly for the DCA, apart from the Schimel PSA, none have any upvotes at all, only down. It’s unclear who the viewers were but it seems reasonable to conclude they weren’t entertained.

The bottom line is nobody likes malware or having their banking details stolen but yet again, people who claim to have the public interest at heart aren’t actually making a difference on the ground. It could be argued that groups advocating online safety should be publishing guides on how to stay protected on the Internet period, not merely advising people to stay away from certain sites.

But of course, that wouldn’t achieve the goals of the MPAA Digital Citizens Alliance.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

# Yet more reasons to disagree with experts on nPetya

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/07/yet-more-reasons-to-disagree-with.html

In WW II, they looked at planes returning from bombing missions that were shot full of holes. Their natural conclusion was to add more armor to the sections that were damaged, to protect them in the future. But wait, said the statisticians. The original damage is likely spread evenly across the plane. Damage on returning planes indicates where they could damage and still return. The undamaged areas are where they were hit and couldn’t return. Thus, it’s the undamaged areas you need to protect.

This is called survivorship bias.
Many experts are making the same mistake with regards to the nPetya ransomware.
I hate to point this out, because they are all experts I admire and respect, especially @MalwareJake, but it’s still an error. An example is this tweet:
The context of this tweet is the discussion of why nPetya was well written with regards to spreading, but full of bugs with regards to collecting on the ransom. The conclusion therefore that it wasn’t intended to be ransomware, but was intended to simply be a “wiper”, to cause destruction.
But this is just survivorship bias. If nPetya had been written the other way, with excellent ransomware features and poor spreading, we would not now be talking about it. Even that initial seeding with the trojaned MeDoc update wouldn’t have spread it far enough.
In other words, all malware samples we get are good at spreading, either on their own, or because the creator did a good job seeding them. It’s because we never see the ones that didn’t spread.
With regards to nPetya, a lot of experts are making this claim. Since it spread so well, but had hopelessly crippled ransomware features, that must have been the intent all along. Yet, as we see from survivorship bias, none of us would’ve seen nPetya had it not been for the spreading feature.

# NonPetya: no evidence it was a "smokescreen"

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/06/nonpetya-no-evidence-it-was-smokescreen.html

Many well-regarded experts claim that the not-Petya ransomware wasn’t “ransomware” at all, but a “wiper” whose goal was to destroy files, without any intent at letting victims recover their files. I want to point out that there is no real evidence of this.

Certainly, things look suspicious. For one thing, it certainly targeted the Ukraine. For another thing, it made several mistakes that prevent them from ever decrypting drives. Their email account was shutdown, and it corrupts the boot sector.

But these things aren’t evidence, they are problems. They are things needing explanation, not things that support our preferred conspiracy theory.

The simplest, Occam’s Razor explanation explanation is that they were simple mistakes. Such mistakes are common among ransomware. We think of virus writers as professional software developers who thoroughly test their code. Decades of evidence show the opposite, that such software is of poor quality with shockingly bad bugs.

It’s true that effectively, nPetya is a wiper. Matthieu Suiche‏ does a great job describing one flaw that prevents it working. @hasherezade does a great job explaining another flaw.  But best explanation isn’t that this is intentional. Even if these bugs didn’t exist, it’d still be a wiper if the perpetrators simply ignored the decryption requests. They need not intentionally make the decryption fail.

Thus, the simpler explanation is that it’s simply a bug. Ransomware authors test the bits they care about, and test less well the bits they don’t. It’s quite plausible to believe that just before shipping the code, they’d add a few extra features, and forget to regression test the entire suite. I mean, I do that all the time with my code.

Some have pointed to the sophistication of the code as proof that such simple errors are unlikely. This isn’t true. While it’s more sophisticated than WannaCry, it’s about average for the current state-of-the-art for ransomware in general. What people think of, such the Petya base, or using PsExec to spread throughout a Windows domain, is already at least a year old.

Indeed, the use of PsExec itself is a bit clumsy, when the code for doing the same thing is already public. It’s just a few calls to basic Windows networking APIs. A sophisticated virus would do this itself, rather than clumsily use PsExec.

Infamy doesn’t mean skill. People keep making the mistake that the more widespread something is in the news, the more skill, the more of a “conspiracy” there must be behind it. This is not true. Virus/worm writers often do newsworthy things by accident. Indeed, the history of worms, starting with the Morris Worm, has been things running out of control more than the author’s expectations.

What makes nPetya newsworthy isn’t the EternalBlue exploit or the wiper feature. Instead, the creators got lucky with MeDoc. The software is used by every major organization in the Ukraine, and at the same time, their website was horribly insecure — laughably insecure. Furthermore, it’s autoupdate feature didn’t check cryptographic signatures. No hacker can plan for this level of widespread incompetence — it’s just extreme luck.

Thus, the effect of bumbling around is something that hit the Ukraine pretty hard, but it’s not necessarily the intent of the creators. It’s like how the Slammer worm hit South Korea pretty hard, or how the Witty worm hit the DoD pretty hard. These things look “targeted”, especially to the victims, but it was by pure chance (provably so, in the case of Witty).

Certainly, MeDoc was targeted. But then, targeting a single organization is the norm for ransomware. They have to do it that way, giving each target a different Bitcoin address for payment. That it then spread to the entire Ukraine, and further, is the sort of thing that typically surprises worm writers.

Finally, there’s little reason to believe that there needs to be a “smokescreen”. Russian hackers are targeting the Ukraine all the time. Whether Russian hackers are to blame for “ransomware” vs. “wiper” makes little difference.

Conclusion

We know that Russian hackers are constantly targeting the Ukraine. Therefore, the theory that this was nPetya’s goal all along, to destroy Ukraines computers, is a good one.

Yet, there’s no actual “evidence” of this. nPetya’s issues are just as easily explained by normal software bugs. The smokescreen isn’t needed. The boot record bug isn’t needed. The single email address that was shutdown isn’t significant, since half of all ransomware uses the same technique.

The experts who disagree with me are really smart/experienced people who you should generally trust. It’s just that I can’t see their evidence.

Update: I wrote another blogpost about “survivorship bias“, refuting the claim by many experts talking about the sophistication of the spreading feature.

Update: comment asks “why is there no Internet spreading code?”. The answer is “I don’t know”, but unanswerable questions aren’t evidence of a conspiracy. “What aren’t there any stars in the background?” isn’t proof the moon landings are fake, such because you can’t answer the question. One guess is that you never want ransomware to spread that far, until you’ve figured out how to get payment from so many people.

# From Idea to Launch: Getting Your First Customers

Post Syndicated from Gleb Budman original https://www.backblaze.com/blog/how-to-get-your-first-customers/

After deciding to build an unlimited backup service and developing our own storage platform, the next step was to get customers and feedback. Not all customers are created equal. Let’s talk about the types, and when and how to attract them.

## How to Get Your First Customers

First Step – Don’t Launch Publicly
Launch when you’re ready for the judgments of people who don’t know you at all. Until then, don’t launch. Sign up users and customers either that you know, those you can trust to cut you some slack (while providing you feedback), or at minimum those for whom you can set expectations. For months the Backblaze website was a single page with no ability to get the product and minimal info on what it would be. This is not to counter the Lean Startup ‘iterate quickly with customer feedback’ advice. Rather, this is an acknowledgement that there are different types of feedback required based on your development stage.

Sign Up Your Friends
We knew all of our first customers; they were friends, family, and previous co-workers. Many knew what we were up to and were excited to help us. No magic marketing or tech savviness was required to reach them – we just asked that they try the service. We asked them to provide us feedback on their experience and collected it through email and conversations. While the feedback wasn’t unbiased, it was nonetheless wide-ranging, real, and often insightful. These people were willing to spend time carefully thinking about their feedback and delving deeper into the conversations.

Broaden to Beta
Unless you’re famous or your service costs $1 million per customer, you’ll probably need to expand quickly beyond your friends to build a business – and to get broader feedback. Our next step was to broaden the customer base to beta users. Opening up the service in beta provides three benefits: 1. Air cover for the early warts. There are going to be issues, bugs, unnecessarily complicated user flows, and poorly worded text. Beta tells people, “We don’t consider the product ‘done’ and you should expect some of these issues. Please be patient with us.” 2. A request for feedback. Some people always provide feedback, but beta communicates that you want it. 3. An awareness opportunity. Opening up in beta provides an early (but not only) opportunity to have an announcement and build awareness. Pitching Beta to Press Not all press cares about, or is even willing to cover, beta products. Much of the mainstream press wants to write about services that are fully live, have scale, and are important in the marketplace. However, there are a number of sites that like to cover the leading edge – and that means covering betas. Techcrunch, Ars Technica, and SimpleHelp covered our initial private beta launch. I’ll go into the details of how to work with the press to cover your announcements in a post next month. Private vs. Public Beta Both private and public beta provide all three of the benefits above. The difference between the two is that private betas are much more controlled, whereas public ones bring in more users. But this isn’t an either/or – I recommend doing both. Private Beta For our original beta in 2008, we decided that we were comfortable with about 1,000 users subscribing to our service. That would provide us with a healthy amount of feedback and get some early adoption, while not overwhelming us or our server capacity, and equally important not causing cash flow issues from having to buy more equipment. So we decided to limit the sign-up to only the first 1,000 people who signed up; then we would shut off sign-ups for a while. But how do you even get 1,000 people to sign up for your service? In our case, get some major publications to write about our beta. (Note: In a future post I’ll explain exactly how to find and reach out to writers. Sign up to receive all of the entrepreneurial posts in this series.) Public Beta For our original service (computer backup), we did not have a public beta; but when we launched Backblaze B2, we had a private and then a public beta. The private beta allowed us to work out early kinks, while the public beta brought us a more varied set of use cases. In public beta, there is no cap on the number of users that may try the service. While this is a first-class problem to have, if your service is flooded and stops working, it’s still a problem. Think through what you will do if that happens. In our early days, when our system could get overwhelmed by volume, we had a static web page hosted with a different registrar that wouldn’t let customers sign up but would tell them when our service would be open again. When we reached a critical volume level we would redirect to it in order to at least provide status for when we could accept more customers. Collect Feedback Since one of the goals of betas is to get feedback, we made sure that we had our email addresses clearly presented on the site so users could send us thoughts. We were most interested in broad qualitative feedback on users’ experience, so all emails went to an internal mailing list that would be read by everyone at Backblaze. For our B2 public and private betas, we also added an optional short survey to the sign-up process. In order to be considered for the private beta you had to fill the survey out, though we found that 80% of users continued to fill out the survey even when it was not required. This survey had both closed-end questions (“how much data do you have”) and open-ended ones (“what do you want to use cloud storage for?”). BTW, despite us getting a lot of feedback now via our support team, Twitter, and marketing surveys, we are always open to more – you can email me directly at gleb.budman {at} backblaze.com. Don’t Throw Away Users Initially our backup service was available only on Windows, but we had an email sign-up list for people who wanted it for their Mac. This provided us with a sense of market demand and a ready list of folks who could be beta users and early adopters when we had a Mac version. Have a service targeted at doctors but lawyers are expressing interest? Capture that. ## Product Launch When The first question is “when” to launch. Presuming your service is in ‘public beta’, what is the advantage of moving out of beta and into a “version 1.0”, “gold”, or “public availability”? That depends on your service and customer base. Some services fly through public beta. Gmail, on the other hand, was (in)famous for being in beta for 5 years, despite having over 100 million users. The term beta says to users, “give us some leeway, but feel free to use the service”. That’s fine for many consumer apps and will have near zero impact on them. However, services aimed at businesses and government will often not be adopted with a beta label as the enterprise customers want to know the company feels the service is ‘ready’. While Backblaze started out as a purely consumer service, because it was a data backup service, it was important for customers to trust that the service was ready. No product is bug-free. But from a product readiness perspective, the nomenclature should also be a reflection of the quality of the product. You can launch a product with one feature that works well out of beta. But a product with fifty features on which half the users will bump into problems should likely stay in beta. The customer feedback, surveys, and your own internal testing should guide you in determining this quality during the beta. Be careful about “we’ve only seen that one time” or “I haven’t been able to reproduce that on my machine”; those issues are likely to scale with customers when you launch. How Launching out of beta can be as simple as removing the beta label from the website/product. However, this can be a great time to reach out to press, write a blog post, and send an email announcement to your customers. Consider thanking your beta testers somehow; can they get some feature turned out for free, an extension of their trial, or premium support? If nothing else, remember to thank them for their feedback. Users that signed up during your beta are likely the ones who will propel your service. They had the need and interest to both be early adopters and deal with bugs. They are likely the key to getting 1,000 true fans. The Beginning The title of this post was “Getting your first customers”, because getting to launch may feel like the peak of your journey when you’re pre-launch, but it really is just the beginning. It’s a step along the journey of building your business. If your launch is wildly successful, enjoy it, work to build on the momentum, but don’t lose track of building your business. If your launch is a dud, go out for a coffee with your team, say “well that sucks”, and then get back to building your business. You can learn a tremendous amount from your early customers, and they can become your biggest fans, but the success of your business will depend on what you continue to do the months and years after your launch. The post From Idea to Launch: Getting Your First Customers appeared first on Backblaze Blog | Cloud Storage & Cloud Backup. # 2017-05-09 bias-и и дебъгване Post Syndicated from Vasil Kolev original https://vasil.ludost.net/blog/?p=3352 Нещо странично. Тия дни в офиса около някакви занимания обсъждахме следната задача: “Имаме банда пирати (N на брой, капитан и N-1 останали членове), които искат да си разделят съкровище от 100 пари. Пиратите имат строга линейна йерархия (знае се кой след кой е). Разделянето става по следния начин – текущият капитан предлага разпределение, гласува се и ако събере половината или повече от гласовете се приема, ако не – убиват го и следващия по веригата предлага разпределение. Въпросът е какво трябва да предложи капитанът, така че всички да се съгласят, ако приемем, че всички в екипажа са перфектни логици. Също така пиратите са кръвожадни и ако при гласуване против има шанс да спечели и същите пари, пак ще предпочете да убие капитана. Също така всички са алчни и целта е капитанът да запази най-много за себе си.” (задачата не идва от икономиката, въпреки че и там всички са перфектни логици и за това толкова много им се дънят теориите) Решението на задачата е интересно (за него – по-долу), но е доста по-интересно колко трудно се оказа да я реша. Първоначалната ми идея беше просто на горната половина от пиратите да се разделят намалящи суми, понеже това е стандартния начин, по който се случват нещата. Това се оказа неефективно. После ми напомниха (което сам трябваше да се сетя), че такива задачи се решават отзад-напред и по индукция, и като за начало започнахме с въпроса, какво става ако са само двама? Първият ми отговор беше – ами другия член на екипажа ще иска винаги да убие капитана, щото така ще вземе всичко. Обаче се оказа, че и капитана има глас, т.е. ако останат само двама, капитанът взима всичко и разпределението е 100 за него и нищо за другия. Какво следва, ако са трима? Казах – добре, тогава даваш на единия 1, на другия 2, и останалото за капитана, понеже ако останат само двама, последния няма да вземе нищо, капитанът гласува за себе си и втория и да е за и против, няма значение. Само че няма нужда да даваме нищо на средния, щото не ни пука за мнението му, така всъщност правилното разпределение идва 1, 0, 99. Тук пак си пролича bias-а, пак очаквах да има някаква пропорция. Long story short, следващата итерация е 0, 1, 0, 99, понеже така ако не се съгласят, на следващия ход предпоследния ако не се съгласи няма да вземе нищо, и на другите двама мнението няма значение. Pattern-а мисля, че си личи 🙂 Лошото е колко много влияеше bias-а, който съм натрупал от четене за разпределения в реалния живот – какво са пиратите, как няма перфектни логици (и реално никой няма да смята по тоя начин, а ще се стремят към нещо, което им се вижда честно), как това тотално изключва политическата възможност N/2+1 от долната част да гласуват винаги против, докато не дойде всичкото до тях и после да си го разделят по равно и всякакви подобни варианти от реалния живот. Ако примерът беше с каквото и да е друго (например не включваше хора), вероятно щеше да е доста по-лесно да гледам абстрактно. Което е още един довод в подкрепа на идеята ми, че много по-лесно се дебъгва нещо чуждо (често и което никога не си виждал), отколкото нещо, с което почти постоянно се занимаваш. Над 90% от проблемите (това не се базира на никаква статистика, а на усещане) са достатъчно прости, че да могат да се решат със стандартни методи и да не изискват много задълбочено познаване на системата (половината ми живот е минал в дебъгване на неща, които не разбирам, доста по-често успешно, отколкото не) и вероятно като/ако правя debug workshop-а (за който много хора ми натякват), ще е с проблеми, с които и аз не съм запознат, да е наистина забавно … # Get wordy with our free resources Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/get-wordy-with-our-free-resources/ Here at the Raspberry Pi Foundation, we take great pride in the wonderful free resources we produce for you to use in classes, at home and in coding clubs. We publish them under a Creative Commons licence, and they’re an excellent way to develop your digital-making skills. With yesterday being World Poetry Day (I’m a day late to the party. Shhh), I thought I’d share some wordy-themed [wordy-themed? Are you sure? – Ed] resources for you all to have a play with. ## Shakespearean Insult Generator Have you ever found yourself lost for words just when the moment calls for your best comeback? With the Shakespearean Insult Generator, your mumbled retorts to life’s awkward situations will have the lyrical flow of our nation’s most beloved bard. Thou sodden-witted lord! Thou hast no more brain than I have in mine elbows! Not only will the generator provide you with hours of potty-mouthed fun, it’ll also teach you how to read and write data in CSV format using Python, how to manipulate lists, and how to choose a random item from a list. ## Talk like a Pirate Ye’ll never be forced t’walk the plank once ye learn how to talk like a scurvy ol’ pirate… yaaaarrrgh! The Talk like a Pirate speech generator teaches you how to use jQuery to cause live updates on a web page, how to write regular expressions to match patterns and words, and how to create a web page to input text and output results. Once you’ve mastered those skills, you can use them to create other speech generators. How about a speech generator that turns certain words into their slang counterparts? Or one that changes words into txt speak – laugh into LOL, and see you into CU? ## Secret Agent Chat So you’ve already mastered insults via list manipulation and random choice, and you’ve converted words into hilarious variations through matching word patterns and input/output. What’s next? The Secret Agent Chat resource shows you how random numbers can be used to encrypt messages, how iteration can be used to encrypt individual characters, and, to make sure nobody cracks your codes, the importance of keeping your keys secret. And with these new skills under your belt, you can write and encrypt messages between you and your friends, ensuring that nobody will be able to read your secrets. ## Unlocking your transferable skill set One of the great things about building projects like these is the way it expands your transferable skill set. When you complete a project using one of our resources, you gain abilities that can be transferred to other projects and situations. You might never need to use a ‘Talk like a Pirate’ speech generator, but you might need to create a way to detect and alter certain word patterns in a document. And while you might be able to coin your own colourful insults, making the Shakespearean Insult Generator gives you the ability to select words from lists at random, allowing you to write a program that picks names to create sports or quiz teams without bias. All of our resources are available for free on our website, and we continually update them to offer you more opportunities to work on your skills, whatever your age and experience. Have you built anything from our resources? Let us know in the comments. The post Get wordy with our free resources appeared first on Raspberry Pi. # Canada Rejects Flawed and One-Sided “Piracy” Claims From US Govt. Post Syndicated from Ernesto original https://torrentfreak.com/canada-rejects-flawed-and-one-sided-piracy-claims-from-us-govt-170310/ Every year the Office of the US Trade Representative (USTR) releases an updated version of its Special 301 Report, calling out other nations for failing to live up to U.S. IP enforcement standards. In recent years Canada has been placed on this “watch list” many times, for a variety of reasons. The country fails to properly deter piracy, is one of the prime complaints circulated by the U.S. Government. Even after Canada revamped its copyright law, including a mandatory piracy notice scheme and extending the copyright term to 70 years after publication, the allegations didn’t go away in 2016. Now, a year later new hearings are underway to discuss the 2017 version of the report. Fearing repercussions, several countries have joined stakeholders to defend their positions. However, Canada was notably absent. While the Canadian Government hasn’t made a lot of fuss in the media, a confidential memo, obtained by University of Ottawa professor Michael Geist, shows that they have little faith in the USTR report. “Canada does not recognize the validity of the Special 301 and considers the process and the Report to be flawed,” the Government memo reads. “The Report fails to employ a clear methodology and the findings tend to rely on industry allegations rather than empirical evidence and objective analysis.” The document in question was prepared for Minister Mélanie Joly last year after the 2016 report was published. It points out, in no uncertain terms, that Canada doesn’t recognize the validity of the 301 process and includes several talking points for the media. Excerpt from the note This year, rightsholders have once again labeled Canada a “piracy haven” so it wouldn’t be a big surprise if it’s listed again. Based on the Canadian Government’s lack of response, it is likely that the Northern neighbor still has little faith in the report. TorrentFreak spoke with law professor Micheal Geist, who has been very critical of the USTR’s 301-process in the past. He believes that Canada is doing the right thing and characterizes the yearly 301 report as biased. “I think the Canadian government is exactly right in its assessment of the Special 301 report process. It is little more than a lobbying document and the content largely reflects biased submissions from lobby groups,” Geist tells TorrentFreak. In a recent article the professor explains that, contrary to claims from entertainment industry groups, Canada now has some of the toughest anti-piracy laws in the world. But, these rightsholder groups want more. Some of the requests, such as those put forward by the industry group IIPA, even go beyond what the United States itself is doing, or far beyond internationally agreed standards. “[T]he submissions frequently engage in a double standard with the IIPA lobbying against fair use in other countries even though the U.S. has had fair use for decades,” Geist says. “It also often calls on countries to implement rules that go far beyond their international obligations such as the demands that countries adopt a DMCA-style approach for the WIPO Internet treaties even though those treaties are far more flexible in their requirements.” This critique of the USTR’s annual report is not new as its alleged biased nature has been discussed by various experts in the past. However, as a country, Canada’s rejection will have an impact, and Professor Geist hopes that other nations will follow suit. Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services. # Some notes on the RAND 0day report Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/03/some-notes-on-rand-0day-report.html The RAND Corporation has a research report on the 0day market [ * ]. It’s pretty good. They talked to all the right people. It should be considered the seminal work on the issue. They’ve got the pricing about right ($1 million for full chain iPhone exploit, but closer to $100k for others). They’ve got the stats about right (5% chance somebody else will discover an exploit). Yet, they’ve got some problems, namely phrasing the debate as activists want, rather than a neutral view of the debate. The report frequently uses the word “stockpile”. This is a biased term used by activists. According to the dictionary, it means: a large accumulated stock of goods or materials, especially one held in reserve for use at a time of shortage or other emergency. Activists paint the picture that the government (NSA, CIA, DoD, FBI) buys 0day to hold in reserve in case they later need them. If that’s the case, then it seems reasonable that it’s better to disclose/patch the vuln then let it grow moldy in a cyberwarehouse somewhere. But that’s not how things work. The government buys vulns it has immediate use for (primarily). Almost all vulns it buys are used within 6 months. Most vulns in its “stockpile” have been used in the previous year. These cyberweapons are not in a warehouse, but in active use on the front lines. This is top secret, of course, so people assume it’s not happening. They hear about no cyber operations (except Stuxnet), so they assume such operations aren’t occurring. Thus, they build up the stockpiling assumption rather than the active use assumption. If the RAND wanted to create an even more useful survey, they should figure out how many thousands of times per day our government (NSA, CIA, DoD, FBI) exploits 0days. They should characterize who they target (e.g. terrorists, child pornographers), success rate, and how many people they’ve killed based on 0days. It’s this data, not patching, that is at the root of the policy debate. That 0days are actively used determines pricing. If the government doesn’t have immediate need for a vuln, it won’t pay much for it, if anything at all. Conversely, if the government has urgent need for a vuln, it’ll pay a lot. Let’s say you have a remote vuln for Samsung TVs. You go to the NSA and offer it to them. They tell you they aren’t interested, because they see no near term need for it. Then a year later, spies reveal ISIS has stolen a truckload of Samsung TVs, put them in all the meeting rooms, and hooked them to Internet for video conferencing. The NSA then comes back to you and offers$500k for the vuln.

Likewise, the number of sellers affects the price. If you know they desperately need the Samsung TV 0day, but they are only offering $100k, then it likely means that there’s another seller also offering such a vuln. That’s why iPhone vulns are worth$1 million for a full chain exploit, from browser to persistence. They use it a lot, it’s a major part of ongoing cyber operations. Each time Apple upgrades iOS, the change breaks part of the existing chain, and the government is keen on getting a new exploit to fix it. They’ll pay a lot to the first vuln seller who can give them a new exploit.

Thus, there are three prices the government is willing to pay for an 0day (the value it provides to the government):

• the price for an 0day they will actively use right now (high)
• the price for an 0day they’ll stockpile for possible use in the future (low)
• the price for an 0day they’ll disclose to the vendor to patch (very low)

That these are different prices is important to the policy debate. When activists claim the government should disclose the 0day they acquire, they are ignoring the price the 0day was acquired for. Since the government actively uses the 0day, they are acquired for a high-price, with their “use” value far higher than their “patch” value. It’s an absurd argument to make that they government should then immediately discard that money, to pay “use value” prices for “patch” results.

If the policy becomes that the NSA/CIA should disclose/patch the 0day they buy, it doesn’t mean business as usual acquiring vulns. It instead means they’ll stop buying 0day.

In other words, “patching 0day” is not an outcome on either side of the debate. Either the government buys 0day to use, or it stops buying 0day. In neither case does patching happen.

The real argument is whether the government (NSA, CIA, DoD, FBI) should be acquiring, weaponizing, and using 0day in the first place. It demands that we unilaterally disarm our military, intelligence, and law enforcement, preventing them from using 0days against our adversaries while our adversaries continue to use 0days against us.

That’s the gaping hole in both the RAND paper and most news reporting of this controversy. They characterize the debate the way activists want, as if the only question is the value of patching. They avoid talking about unilateral cyberdisarmament, even though that’s the consequence of the policy they are advocating. They avoid comparing the value of 0days to our country for active use (high) compared to the value to to our country for patching (very low).

Conclusion

It’s nice that the RAND paper studied the value of patching and confirmed it’s low, that only around 5% of our cyber-arsenal is likely to be found by others. But it’d be nice if they also looked at the point of view of those actively using 0days on a daily basis, rather than phrasing the debate the way activists want.

# A note about "false flag" operations

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/03/a-note-about-false-flag-operations.html

There’s nothing in the CIA #Vault7 leaks that calls into question strong attribution, like Russia being responsible for the DNC hacks. On the other hand, it does call into question weak attribution, like North Korea being responsible for the Sony hacks.

There are really two types of attribution. Strong attribution is a preponderance of evidence that would convince an unbiased, skeptical expert. Weak attribution is flimsy evidence that confirms what people are predisposed to believe.

The DNC hacks have strong evidence pointing to Russia. Not only does all the malware check out, but also other, harder to “false flag” bits, like active command-and-control servers. A serious operator could still false-flag this in theory, if only by bribing people in Russia, but nothing in the CIA dump hints at this.

The Sony hacks have weak evidence pointing to North Korea. One of the items was the use of the RawDisk driver, used both in malware attributed to North Korea and the Sony attacks. This was described as “flimsy” at the time [*]. The CIA dump [*] demonstrates that indeed it’s flimsy — as apparently CIA malware also uses the RawDisk code.

In the coming days, biased partisans are going to seize on the CIA leaks as proof of “false flag” operations, calling into question Russian hacks. No, this isn’t valid. We experts in the industry criticized “malware techniques” as flimsy attribution, long before the Sony attack, and long before the DNC hacks. All the CIA leaks do is prove we were right. On the other hand, the DNC hack attribution is based on more than just this, so nothing in the CIA leaks calls into question that attribution.