Tag Archives: EMET

timeShift(GrafanaBuzz, 1w) Issue 16

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2017/10/06/timeshiftgrafanabuzz-1w-issue-16/

Welcome to another issue of TimeShift. In addition to the roundup of articles and plugin updates, we had a big announcement this week – Early Bird tickets to GrafanaCon EU are now available! We’re also accepting CFPs through the end of October, so if you have a topic in mind, don’t wait until the last minute, please send it our way. Speakers who are selected will receive a comped ticket to the conference.

Early Bird Tickets Now Available

We’ve released a limited number of Early Bird tickets before General Admission tickets are available. Take advantage of this discount before they’re sold out!

Get Your Early Bird Ticket Now

Interested in speaking at GrafanaCon? We’re looking for technical and non-tecnical talks of all sizes. Submit a CFP Now.

From the Blogosphere

Get insights into your Azure Cosmos DB: partition heatmaps, OMS, and More: Microsoft recently announced the ability to access a subset of Azure Cosmos DB metrics via Azure Monitor API. Grafana Labs built an Azure Monitor Plugin for Grafana 4.5 to visualize the data.

How to monitor Docker for Mac/Windows: Brian was tired of guessing about the performance of his development machines and test environment. Here, he shows how to monitor Docker with Prometheus to get a better understanding of a dev environment in his quest to monitor all the things.

Prometheus and Grafana to Monitor 10,000 servers: This article covers enokido’s process of choosing a monitoring platform. He identifies three possible solutions, outlines the pros and cons of each, and discusses why he chose Prometheus.

GitLab Monitoring: It’s fascinating to see Grafana dashboards with production data from companies around the world. For instance, we’ve previously highlighted the huge number of dashboards Wikimedia publicly shares. This week, we found that GitLab also has public dashboards to explore.

Monitoring a Docker Swarm Cluster with cAdvisor, InfluxDB and Grafana | The Laboratory: It’s important to know the state of your applications in a scalable environment such as Docker Swarm. This video covers an overview of Docker, VM’s vs. containers, orchestration and how to monitor Docker Swarm.

Introducing Telemetry: Actionable Time Series Data from Counters: Learn how to use counters from mulitple disparate sources, devices, operating systems, and applications to generate actionable time series data.

ofp_sniffer Branch 1.2 (docker/influxdb/grafana) Upcoming Features: This video demo shows off some of the upcoming features for OFP_Sniffer, an OpenFlow sniffer to help network troubleshooting in production networks.

Grafana Plugins

Plugin authors add new features and bugfixes all the time, so it’s important to always keep your plugins up to date. To update plugins from on-prem Grafana, use the Grafana-cli tool, if you are using Hosted Grafana, you can update with 1 click! If you have questions or need help, hit up our community site, where the Grafana team and members of the community are happy to help.


PNP for Nagios Data Source – The latest release for the PNP data source has some fixes and adds a mathematical factor option.



Google Calendar Data Source – This week, there was a small bug fix for the Google Calendar annotations data source.



BT Plugins – Our friends at BT have been busy. All of the BT plugins in our catalog received and update this week. The plugins are the Status Dot Panel, the Peak Report Panel, the Trend Box Panel and the Alarm Box Panel.

Changes include:

  • Custom dashboard links now work in Internet Explorer.
  • The Peak Report panel no longer supports click-to-sort.
  • The Status Dot panel tooltips now look like Grafana tooltips.

This week’s MVC (Most Valuable Contributor)

Each week we highlight some of the important contributions from our amazing open source community. This week, we’d like to recognize a contributor who did a lot of work to improve Prometheus support.

Thanks to Alin Sinpaleanfor his Prometheus PR – that aligns the step and interval parameters. Alin got a lot of feedback from the Prometheus community and spent a lot of time and energy explaining, debating and iterating before the PR was ready.
Thank you!

Grafana Labs is Hiring!

We are passionate about open source software and thrive on tackling complex challenges to build the future. We ship code from every corner of the globe and love working with the community. If this sounds exciting, you’re in luck – WE’RE HIRING!

Check out our Open Positions

Tweet of the Week

We scour Twitter each week to find an interesting/beautiful dashboard and show it off! #monitoringLove

Wow – Excited to be a part of exploring data to find out how Mexico City is evolving.

We Need Your Help!

Do you have a graph that you love because the data is beautiful or because the graph provides interesting information? Please get in touch. Tweet or send us an email with a screenshot, and we’ll tell you about this fun experiment.

Tell Me More

What do you think?

That’s a wrap! How are we doing? Submit a comment on this article below, or post something at our community forum. Help us make these weekly roundups better!

Follow us on Twitter, like us on Facebook, and join the Grafana Labs community.

Nazis, are bad

Post Syndicated from Eevee original https://eev.ee/blog/2017/08/13/nazis-are-bad/

Anonymous asks:

Could you talk about something related to the management/moderation and growth of online communities? IOW your thoughts on online community management, if any.

I think you’ve tweeted about this stuff in the past so I suspect you have thoughts on this, but if not, again, feel free to just blog about … anything 🙂

Oh, I think I have some stuff to say about community management, in light of recent events. None of it hasn’t already been said elsewhere, but I have to get this out.

Hopefully the content warning is implicit in the title.

I am frustrated.

I’ve gone on before about a particularly bothersome phenomenon that hurts a lot of small online communities: often, people are willing to tolerate the misery of others in a community, but then get up in arms when someone pushes back. Someone makes a lot of off-hand, off-color comments about women? Uses a lot of dog-whistle terms? Eh, they’re not bothering anyone, or at least not bothering me. Someone else gets tired of it and tells them to knock it off? Whoa there! Now we have the appearance of conflict, which is unacceptable, and people will turn on the person who’s pissed off — even though they’ve been at the butt end of an invisible conflict for who knows how long. The appearance of peace is paramount, even if it means a large chunk of the population is quietly miserable.

Okay, so now, imagine that on a vastly larger scale, and also those annoying people who know how to skirt the rules are Nazis.

The label “Nazi” gets thrown around a lot lately, probably far too easily. But when I see a group of people doing the Hitler salute, waving large Nazi flags, wearing Nazi armbands styled after the SS, well… if the shoe fits, right? I suppose they might have flown across the country to join a torch-bearing mob ironically, but if so, the joke is going way over my head. (Was the murder ironic, too?) Maybe they’re not Nazis in the sense that the original party doesn’t exist any more, but for ease of writing, let’s refer to “someone who espouses Nazi ideology and deliberately bears a number of Nazi symbols” as, well, “a Nazi”.

This isn’t a new thing, either; I’ve stumbled upon any number of Twitter accounts that are decorated in Nazi regalia. I suppose the trouble arises when perfectly innocent members of the alt-right get unfairly labelled as Nazis.

But hang on; this march was called “Unite the Right” and was intended to bring together various far right sub-groups. So what does their choice of aesthetic say about those sub-groups? I haven’t heard, say, alt-right coiner Richard Spencer denounce the use of Nazi symbology — extra notable since he was fucking there and apparently didn’t care to discourage it.

And so begins the rule-skirting. “Nazi” is definitely overused, but even using it to describe white supremacists who make not-so-subtle nods to Hitler is likely to earn you some sarcastic derailment. A Nazi? Oh, so is everyone you don’t like and who wants to establish a white ethno state a Nazi?

Calling someone a Nazi — or even a white supremacist — is an attack, you see. Merely expressing the desire that people of color not exist is perfectly peaceful, but identifying the sentiment for what it is causes visible discord, which is unacceptable.

These clowns even know this sort of thing and strategize around it. Or, try, at least. Maybe it wasn’t that successful this weekend — though flicking through Charlottesville headlines now, they seem to be relatively tame in how they refer to the ralliers.

I’m reminded of a group of furries — the alt-furries — who have been espousing white supremacy and wearing red armbands with a white circle containing a black… pawprint. Ah, yes, that’s completely different.

So, what to do about this?

Ignore them” is a popular option, often espoused to bullied children by parents who have never been bullied, shortly before they resume complaining about passive-aggressive office politics. The trouble with ignoring them is that, just like in smaller communitiest, they have a tendency to fester. They take over large chunks of influential Internet surface area like 4chan and Reddit; they help get an inept buffoon elected; and then they start to have torch-bearing rallies and run people over with cars.

4chan illustrates a kind of corollary here. Anyone who’s steeped in Internet Culture™ is surely familiar with 4chan; I was never a regular visitor, but it had enough influence that I was still aware of it and some of its culture. It was always thick with irony, which grew into a sort of ironic detachment — perhaps one of the major sources of the recurring online trope that having feelings is bad — which proceeded into ironic racism.

And now the ironic racism is indistinguishable from actual racism, as tends to be the case. Do they “actually” “mean it”, or are they just trying to get a rise out of people? What the hell is unironic racism if not trying to get a rise out of people? What difference is there to onlookers, especially as they move to become increasingly involved with politics?

It’s just a joke” and “it was just a thoughtless comment” are exceptionally common defenses made by people desperate to preserve the illusion of harmony, but the strain of overt white supremacy currently running rampant through the US was built on those excuses.

The other favored option is to debate them, to defeat their ideas with better ideas.

Well, hang on. What are their ideas, again? I hear they were chanting stuff like “go back to Africa” and “fuck you, faggots”. Given that this was an overtly political rally (and again, the Nazi fucking regalia), I don’t think it’s a far cry to describe their ideas as “let’s get rid of black people and queer folks”.

This is an underlying proposition: that white supremacy is inherently violent. After all, if the alt-right seized total political power, what would they do with it? If I asked the same question of Democrats or Republicans, I’d imagine answers like “universal health care” or “screw over poor people”. But people whose primary goal is to have a country full of only white folks? What are they going to do, politely ask everyone else to leave? They’re invoking the memory of people who committed genocide and also tried to take over the fucking world. They are outright saying, these are the people we look up to, this is who we think had a great idea.

How, precisely, does one defeat these ideas with rational debate?

Because the underlying core philosophy beneath all this is: “it would be good for me if everything were about me”. And that’s true! (Well, it probably wouldn’t work out how they imagine in practice, but it’s true enough.) Consider that slavery is probably fantastic if you’re the one with the slaves; the issue is that it’s reprehensible, not that the very notion contains some kind of 101-level logical fallacy. That’s probably why we had a fucking war over it instead of hashing it out over brunch.

…except we did hash it out over brunch once, and the result was that slavery was still allowed but slaves only counted as 60% of a person for the sake of counting how much political power states got. So that’s how rational debate worked out. I’m sure the slaves were thrilled with that progress.

That really only leaves pushing back, which raises the question of how to push back.

And, I don’t know. Pushing back is much harder in spaces you don’t control, spaces you’re already struggling to justify your own presence in. For most people, that’s most spaces. It’s made all the harder by that tendency to preserve illusory peace; even the tamest request that someone knock off some odious behavior can be met by pushback, even by third parties.

At the same time, I’m aware that white supremacists prey on disillusioned young white dudes who feel like they don’t fit in, who were promised the world and inherited kind of a mess. Does criticism drive them further away? The alt-right also opposes “political correctness”, i.e. “not being a fucking asshole”.

God knows we all suck at this kind of behavior correction, even within our own in-groups. Fandoms have become almost ridiculously vicious as platforms like Twitter and Tumblr amplify individual anger to deafening levels. It probably doesn’t help that we’re all just exhausted, that every new fuck-up feels like it bears the same weight as the last hundred combined.

This is the part where I admit I don’t know anything about people and don’t have any easy answers. Surprise!

The other alternative is, well, punching Nazis.

That meme kind of haunts me. It raises really fucking complicated questions about when violence is acceptable, in a culture that’s completely incapable of answering them.

America’s relationship to violence is so bizarre and two-faced as to be almost incomprehensible. We worship it. We have the biggest military in the world by an almost comical margin. It’s fairly mainstream to own deadly weapons for the express stated purpose of armed revolution against the government, should that become necessary, where “necessary” is left ominously undefined. Our movies are about explosions and beating up bad guys; our video games are about explosions and shooting bad guys. We fantasize about solving foreign policy problems by nuking someone — hell, our talking heads are currently in polite discussion about whether we should nuke North Korea and annihilate up to twenty-five million people, as punishment for daring to have the bomb that only we’re allowed to have.

But… violence is bad.

That’s about as far as the other side of the coin gets. It’s bad. We condemn it in the strongest possible terms. Also, guess who we bombed today?

I observe that the one time Nazis were a serious threat, America was happy to let them try to take over the world until their allies finally showed up on our back porch.

Maybe I don’t understand what “violence” means. In a quest to find out why people are talking about “leftist violence” lately, I found a National Review article from May that twice suggests blocking traffic is a form of violence. Anarchists have smashed some windows and set a couple fires at protests this year — and, hey, please knock that crap off? — which is called violence against, I guess, Starbucks. Black Lives Matter could be throwing a birthday party and Twitter would still be abuzz with people calling them thugs.

Meanwhile, there’s a trend of murderers with increasingly overt links to the alt-right, and everyone is still handling them with kid gloves. First it was murders by people repeating their talking points; now it’s the culmination of a torches-and-pitchforks mob. (Ah, sorry, not pitchforks; assault rifles.) And we still get this incredibly bizarre both-sides-ism, a White House that refers to the people who didn’t murder anyone as “just as violent if not more so“.

Should you punch Nazis? I don’t know. All I know is that I’m extremely dissatisfied with discourse that’s extremely alarmed by hypothetical punches — far more mundane than what you’d see after a sporting event — but treats a push for ethnic cleansing as a mere difference of opinion.

The equivalent to a punch in an online space is probably banning, which is almost laughable in comparison. It doesn’t cause physical harm, but it is a use of concrete force. Doesn’t pose quite the same moral quandary, though.

Somewhere in the middle is the currently popular pastime of doxxing (doxxxxxxing) people spotted at the rally in an attempt to get them fired or whatever. Frankly, that skeeves me out, though apparently not enough that I’m directly chastizing anyone for it.

We aren’t really equipped, as a society, to deal with memetic threats. We aren’t even equipped to determine what they are. We had a fucking world war over this, and now people are outright saying “hey I’m like those people we went and killed a lot in that world war” and we give them interviews and compliment their fashion sense.

A looming question is always, what if they then do it to you? What if people try to get you fired, to punch you for your beliefs?

I think about that a lot, and then I remember that it’s perfectly legal to fire someone for being gay in half the country. (Courts are currently wrangling whether Title VII forbids this, but with the current administration, I’m not optimistic.) I know people who’ve been fired for coming out as trans. I doubt I’d have to look very far to find someone who’s been punched for either reason.

And these aren’t even beliefs; they’re just properties of a person. You can stop being a white supremacist, one of those people yelling “fuck you, faggots”.

So I have to recuse myself from this asinine question, because I can’t fairly judge the risk of retaliation when it already happens to people I care about.

Meanwhile, if a white supremacist does get punched, I absolutely still want my tax dollars to pay for their universal healthcare.

The same wrinkle comes up with free speech, which is paramount.

The ACLU reminds us that the First Amendment “protects vile, hateful, and ignorant speech”. I think they’ve forgotten that that’s a side effect, not the goal. No one sat down and suggested that protecting vile speech was some kind of noble cause, yet that’s how we seem to be treating it.

The point was to avoid a situation where the government is arbitrarily deciding what qualifies as vile, hateful, and ignorant, and was using that power to eliminate ideas distasteful to politicians. You know, like, hypothetically, if they interrogated and jailed a bunch of people for supporting the wrong economic system. Or convicted someone under the Espionage Act for opposing the draft. (Hey, that’s where the “shouting fire in a crowded theater” line comes from.)

But these are ideas that are already in the government. Bannon, a man who was chair of a news organization he himself called “the platform for the alt-right”, has the President’s ear! How much more mainstream can you get?

So again I’m having a little trouble balancing “we need to defend the free speech of white supremacists or risk losing it for everyone” against “we fairly recently were ferreting out communists and the lingering public perception is that communists are scary, not that the government is”.

This isn’t to say that freedom of speech is bad, only that the way we talk about it has become fanatical to the point of absurdity. We love it so much that we turn around and try to apply it to corporations, to platforms, to communities, to interpersonal relationships.

Look at 4chan. It’s completely public and anonymous; you only get banned for putting the functioning of the site itself in jeopardy. Nothing is stopping a larger group of people from joining its politics board and tilting sentiment the other way — except that the current population is so odious that no one wants to be around them. Everyone else has evaporated away, as tends to happen.

Free speech is great for a government, to prevent quashing politics that threaten the status quo (except it’s a joke and they’ll do it anyway). People can’t very readily just bail when the government doesn’t like them, anyway. It’s also nice to keep in mind to some degree for ubiquitous platforms. But the smaller you go, the easier it is for people to evaporate away, and the faster pure free speech will turn the place to crap. You’ll be left only with people who care about nothing.

At the very least, it seems clear that the goal of white supremacists is some form of destabilization, of disruption to the fabric of a community for purely selfish purposes. And those are the kinds of people you want to get rid of as quickly as possible.

Usually this is hard, because they act just nicely enough to create some plausible deniability. But damn, if someone is outright telling you they love Hitler, maybe skip the principled hand-wringing and eject them.

Seven Tips for Using S3DistCp on Amazon EMR to Move Data Efficiently Between HDFS and Amazon S3

Post Syndicated from Illya Yalovyy original https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/

Have you ever needed to move a large amount of data between Amazon S3 and Hadoop Distributed File System (HDFS) but found that the data set was too large for a simple copy operation? EMR can help you with this. In addition to processing and analyzing petabytes of data, EMR can move large amounts of data.

In the Hadoop ecosystem, DistCp is often used to move data. DistCp provides a distributed copy capability built on top of a MapReduce framework. S3DistCp is an extension to DistCp that is optimized to work with S3 and that adds several useful features. In addition to moving data between HDFS and S3, S3DistCp is also a Swiss Army knife of file manipulations. In this post we’ll cover the following tips for using S3DistCp, starting with basic use cases and then moving to more advanced scenarios:

1. Copy or move files without transformation
2. Copy and change file compression on the fly
3. Copy files incrementally
4. Copy multiple folders in one job
5. Aggregate files based on a pattern
6. Upload files larger than 1 TB in size
7. Submit a S3DistCp step to an EMR cluster

1. Copy or move files without transformation

We’ve observed that customers often use S3DistCp to copy data from one storage location to another, whether S3 or HDFS. Syntax for this operation is simple and straightforward:

$ s3-dist-cp --src /data/incoming/hourly_table --dest s3://my-tables/incoming/hourly_table

The source location may contain extra files that we don’t necessarily want to copy. Here, we can use filters based on regular expressions to do things such as copying files with the .log extension only.

Each subfolder has the following files:

$ hadoop fs -ls /data/incoming/hourly_table/2017-02-01/03
Found 8 items
-rw-r--r--   1 hadoop hadoop     197850 2017-02-19 03:41 /data/incoming/hourly_table/2017-02-01/03/2017-02-01.03.25845.log
-rw-r--r--   1 hadoop hadoop     484006 2017-02-19 03:41 /data/incoming/hourly_table/2017-02-01/03/2017-02-01.03.32953.log
-rw-r--r--   1 hadoop hadoop     868522 2017-02-19 03:41 /data/incoming/hourly_table/2017-02-01/03/2017-02-01.03.62649.log
-rw-r--r--   1 hadoop hadoop     408072 2017-02-19 03:41 /data/incoming/hourly_table/2017-02-01/03/2017-02-01.03.64637.log
-rw-r--r--   1 hadoop hadoop    1031949 2017-02-19 03:41 /data/incoming/hourly_table/2017-02-01/03/2017-02-01.03.70767.log
-rw-r--r--   1 hadoop hadoop     368240 2017-02-19 03:41 /data/incoming/hourly_table/2017-02-01/03/2017-02-01.03.89910.log
-rw-r--r--   1 hadoop hadoop     437348 2017-02-19 03:41 /data/incoming/hourly_table/2017-02-01/03/2017-02-01.03.96053.log
-rw-r--r--   1 hadoop hadoop        800 2017-02-19 03:41 /data/incoming/hourly_table/2017-02-01/03/processing.meta

To copy only the required files, let’s use the --srcPattern option:

$ s3-dist-cp --src /data/incoming/hourly_table --dest s3://my-tables/incoming/hourly_table_filtered --srcPattern .*\.log

After the upload has finished successfully, let’s check the folder contents in the destination location to confirm only the files ending in .log were copied:

$ hadoop fs -ls s3://my-tables/incoming/hourly_table_filtered/2017-02-01/03
-rw-rw-rw-   1     197850 2017-02-19 22:56 s3://my-tables/incoming/hourly_table_filtered/2017-02-01/03/2017-02-01.03.25845.log
-rw-rw-rw-   1     484006 2017-02-19 22:56 s3://my-tables/incoming/hourly_table_filtered/2017-02-01/03/2017-02-01.03.32953.log
-rw-rw-rw-   1     868522 2017-02-19 22:56 s3://my-tables/incoming/hourly_table_filtered/2017-02-01/03/2017-02-01.03.62649.log
-rw-rw-rw-   1     408072 2017-02-19 22:56 s3://my-tables/incoming/hourly_table_filtered/2017-02-01/03/2017-02-01.03.64637.log
-rw-rw-rw-   1    1031949 2017-02-19 22:56 s3://my-tables/incoming/hourly_table_filtered/2017-02-01/03/2017-02-01.03.70767.log
-rw-rw-rw-   1     368240 2017-02-19 22:56 s3://my-tables/incoming/hourly_table_filtered/2017-02-01/03/2017-02-01.03.89910.log
-rw-rw-rw-   1     437348 2017-02-19 22:56 s3://my-tables/incoming/hourly_table_filtered/2017-02-01/03/2017-02-01.03.96053.log

Sometimes, data needs to be moved instead of copied. In this case, we can use the --deleteOnSuccess option. This option is similar to aws s3 mv, which you might have used previously with the AWS CLI. The files are first copied and then deleted from the source:

$ s3-dist-cp --src s3://my-tables/incoming/hourly_table --dest s3://my-tables/incoming/hourly_table_archive --deleteOnSuccess

After the preceding operation, the source location has only empty folders, and the target location contains all files.

$ hadoop fs -ls -R s3://my-tables/incoming/hourly_table/2017-02-01/
drwxrwxrwx   -          0 1970-01-01 00:00 s3://my-tables/incoming/hourly_table/2017-02-01/00
drwxrwxrwx   -          0 1970-01-01 00:00 s3://my-tables/incoming/hourly_table/2017-02-01/01
drwxrwxrwx   -          0 1970-01-01 00:00 s3://my-tables/incoming/hourly_table/2017-02-01/21
drwxrwxrwx   -          0 1970-01-01 00:00 s3://my-tables/incoming/hourly_table/2017-02-01/22

$ hadoop fs -ls s3://my-tables/incoming/hourly_table_archive/2017-02-01/01
-rw-rw-rw-   1     676756 2017-02-19 23:27 s3://my-tables/incoming/hourly_table_archive/2017-02-01/01/2017-02-01.01.27047.log
-rw-rw-rw-   1     780197 2017-02-19 23:27 s3://my-tables/incoming/hourly_table_archive/2017-02-01/01/2017-02-01.01.59789.log
-rw-rw-rw-   1    1041789 2017-02-19 23:27 s3://my-tables/incoming/hourly_table_archive/2017-02-01/01/2017-02-01.01.82293.log
-rw-rw-rw-   1        400 2017-02-19 23:27 s3://my-tables/incoming/hourly_table_archive/2017-02-01/01/processing.meta

The important things to remember here are that S3DistCp deletes only files with the --deleteOnSuccess flag and that it doesn’t delete parent folders, even when they are empty.

2. Copy and change file compression on the fly

Raw files often land in S3 or HDFS in an uncompressed text format. This format is suboptimal both for the cost of storage and for running analytics on that data. S3DistCp can help you efficiently store data and compress files on the fly with the --outputCodec option:

$ s3-dist-cp --src s3://my-tables/incoming/hourly_table_filtered --dest s3://my-tables/incoming/hourly_table_gz --outputCodec=gz

The current version of S3DistCp supports the codecs gzip, gz, lzo, lzop, and snappy, and the keywords none and keep (the default). These keywords have the following meaning:

  • none” – Save files uncompressed. If the files are compressed, then S3DistCp decompresses them.
  • keep” – Don’t change the compression of the files but copy them as-is.

Let’s check the files in the target folder, which have now been compressed with the gz codec:

$ hadoop fs -ls s3://my-tables/incoming/hourly_table_gz/2017-02-01/01/
Found 3 items
-rw-rw-rw-   1     78756 2017-02-20 00:07 s3://my-tables/incoming/hourly_table_gz/2017-02-01/01/2017-02-01.01.27047.log.gz
-rw-rw-rw-   1     80197 2017-02-20 00:07 s3://my-tables/incoming/hourly_table_gz/2017-02-01/01/2017-02-01.01.59789.log.gz
-rw-rw-rw-   1    121178 2017-02-20 00:07 s3://my-tables/incoming/hourly_table_gz/2017-02-01/01/2017-02-01.01.82293.log.gz

3. Copy files incrementally

In real life, the upstream process drops files in some cadence. For instance, new files might get created every hour, or every minute. The downstream process can be configured to pick it up at a different schedule.

Let’s say data lands on S3 and we want to process it on HDFS daily. Copying all files every time doesn’t scale very well. Fortunately, S3DistCp has a built-in solution for that.

For this solution, we use a manifest file. That file allows S3DistCp to keep track of copied files. Following is an example of the command:

$ s3-dist-cp --src s3://my-tables/incoming/hourly_table --dest s3://my-tables/processing/hourly_table --srcPattern .*\.log --outputManifest=manifest-2017-02-25.gz --previousManifest=s3://my-tables/processing/hourly_table/manifest-2017-02-24.gz

The command takes two manifest files as parameters, outputManifest and previousManifest. The first one contains a list of all copied files (old and new), and the second contains a list of files copied previously. This way, we can recreate the full history of operations and see what files were copied during each run:

$ hadoop fs -text s3://my-tables/processing/hourly_table/manifest-2017-02-24.gz > previous.lst
$ hadoop fs -text s3://my-tables/processing/hourly_table/manifest-2017-02-25.gz > current.lst
$ diff previous.lst current.lst
> {"path":"s3://my-tables/processing/hourly_table/2017-02-25/00/2017-02-15.00.50958.log","baseName":"2017-02-25/00/2017-02-15.00.50958.log","srcDir":"s3://my-tables/processing/hourly_table","size":610308}
> {"path":"s3://my-tables/processing/hourly_table/2017-02-25/00/2017-02-25.00.93423.log","baseName":"2017-02-25/00/2017-02-25.00.93423.log","srcDir":"s3://my-tables/processing/hourly_table","size":178928}

S3DistCp creates the file in the local file system using the provided path, /tmp/mymanifest.gz. When the copy operation finishes, it moves that manifest to <DESTINATION LOCATION>.

4. Copy multiple folders in one job

Imagine that we need to copy several folders. Usually, we run as many copy jobs as there are folders that need to be copied. With S3DistCp, the copy can be done in one go. All we need is to prepare a file with list of prefixes and use it as a parameter for the tool:

$ s3-dist-cp --src s3://my-tables/incoming/hourly_table_filtered --dest s3://my-tables/processing/sample_table --srcPrefixesFile file://${PWD}/folders.lst

In this case, the folders.lst file contains the following prefixes:

$ cat folders.lst

As a result, the target location has only the requested subfolders:

$ hadoop fs -ls -R s3://my-tables/processing/sample_table
drwxrwxrwx   -          0 1970-01-01 00:00 s3://my-tables/processing/sample_table/2017-02-10
drwxrwxrwx   -          0 1970-01-01 00:00 s3://my-tables/processing/sample_table/2017-02-10/11
-rw-rw-rw-   1     139200 2017-02-24 05:59 s3://my-tables/processing/sample_table/2017-02-10/11/2017-02-10.11.12980.log
drwxrwxrwx   -          0 1970-01-01 00:00 s3://my-tables/processing/sample_table/2017-02-19
drwxrwxrwx   -          0 1970-01-01 00:00 s3://my-tables/processing/sample_table/2017-02-19/02
-rw-rw-rw-   1     702058 2017-02-24 05:59 s3://my-tables/processing/sample_table/2017-02-19/02/2017-02-19.02.19497.log
-rw-rw-rw-   1     265404 2017-02-24 05:59 s3://my-tables/processing/sample_table/2017-02-19/02/2017-02-19.02.26671.log
drwxrwxrwx   -          0 1970-01-01 00:00 s3://my-tables/processing/sample_table/2017-02-23
drwxrwxrwx   -          0 1970-01-01 00:00 s3://my-tables/processing/sample_table/2017-02-23/00
-rw-rw-rw-   1     310425 2017-02-24 05:59 s3://my-tables/processing/sample_table/2017-02-23/00/2017-02-23.00.10061.log
-rw-rw-rw-   1    1030397 2017-02-24 05:59 s3://my-tables/processing/sample_table/2017-02-23/00/2017-02-23.00.22664.log

5. Aggregate files based on a pattern

Hadoop is optimized for reading a fewer number of large files rather than many small files, whether from S3 or HDFS. You can use S3DistCp to aggregate small files into fewer large files of a size that you choose, which can optimize your analysis for both performance and cost.

In the following example, we combine small files into bigger files. We do so by using a regular expression with the –groupBy option.

$ s3-dist-cp --src /data/incoming/hourly_table --dest s3://my-tables/processing/daily_table --targetSize=10 --groupBy=’.*/hourly_table/.*/(\d\d)/.*\.log’

Let’s take a look into the target folders and compare them to the corresponding source folders:

$ hadoop fs -ls /data/incoming/hourly_table/2017-02-22/05/
Found 8 items
-rw-r--r--   1 hadoop hadoop     289949 2017-02-19 06:07 /data/incoming/hourly_table/2017-02-22/05/2017-02-22.05.11125.log
-rw-r--r--   1 hadoop hadoop     407290 2017-02-19 06:07 /data/incoming/hourly_table/2017-02-22/05/2017-02-22.05.19596.log
-rw-r--r--   1 hadoop hadoop     253434 2017-02-19 06:07 /data/incoming/hourly_table/2017-02-22/05/2017-02-22.05.30135.log
-rw-r--r--   1 hadoop hadoop     590655 2017-02-19 06:07 /data/incoming/hourly_table/2017-02-22/05/2017-02-22.05.36531.log
-rw-r--r--   1 hadoop hadoop     762076 2017-02-19 06:07 /data/incoming/hourly_table/2017-02-22/05/2017-02-22.05.47822.log
-rw-r--r--   1 hadoop hadoop     489783 2017-02-19 06:07 /data/incoming/hourly_table/2017-02-22/05/2017-02-22.05.80518.log
-rw-r--r--   1 hadoop hadoop     205976 2017-02-19 06:07 /data/incoming/hourly_table/2017-02-22/05/2017-02-22.05.99127.log
-rw-r--r--   1 hadoop hadoop        800 2017-02-19 06:07 /data/incoming/hourly_table/2017-02-22/05/processing.meta


$ hadoop fs -ls s3://my-tables/processing/daily_table/2017-02-22/05/
Found 2 items
-rw-rw-rw-   1   10541944 2017-02-28 05:16 s3://my-tables/processing/daily_table/2017-02-22/05/054
-rw-rw-rw-   1   10511516 2017-02-28 05:16 s3://my-tables/processing/daily_table/2017-02-22/05/055

As you can see, seven data files were combined into two with a size close to the requested 10 MB. The *.meta file was filtered out because --groupBy pattern works in a similar way to –srcPattern. We recommend keeping files larger than the default block size, which is 128 MB on EMR.

The name of the final file is composed of groups in the regular expression used in --groupBy plus some number to make the name unique. The pattern must have at least one group defined.

Let’s consider one more example. This time, we want the file name to be formed from three parts: year, month, and file extension (.log in this case). Here is an updated command:

$ s3-dist-cp --src /data/incoming/hourly_table --dest s3://my-tables/processing/daily_table_2017 --targetSize=10 --groupBy=’.*/hourly_table/.*(2017-).*/(\d\d)/.*\.(log)’

Now we have final files named in a different way:

$ hadoop fs -ls s3://my-tables/processing/daily_table_2017/2017-02-22/05/
Found 2 items
-rw-rw-rw-   1   10541944 2017-02-28 05:16 s3://my-tables/processing/daily_table/2017-02-22/05/2017-05log4
-rw-rw-rw-   1   10511516 2017-02-28 05:16 s3://my-tables/processing/daily_table/2017-02-22/05/2017-05log5

As you can see, names of final files consist of concatenation of 3 groups from the regular expression (2017-), (\d\d), (log).

You might find that occasionally you get an error that looks like the following:

$ s3-dist-cp --src /data/incoming/hourly_table --dest s3://my-tables/processing/daily_table_2017 --targetSize=10 --groupBy=’.*/hourly_table/.*(2018-).*/(\d\d)/.*\.(log)’
17/04/27 15:37:45 INFO S3DistCp.S3DistCp: Created 0 files to copy 0 files
Exception in thread “main” java.lang.RuntimeException: Error running job
	at com.amazon.elasticmapreduce.S3DistCp.S3DistCp.run(S3DistCp.java:927)
	at com.amazon.elasticmapreduce.S3DistCp.S3DistCp.run(S3DistCp.java:705)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
	at com.amazon.elasticmapreduce.S3DistCp.Main.main(Main.java:22)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

In this case, the key information is contained in Created 0 files to copy 0 files. S3DistCp didn’t find any files to copy because the regular expression in the --groupBy option doesn’t match any files in the source location.

The reason for this issue varies. For example, it can be a mistake in the specified pattern. In the preceding example, we don’t have any files for the year 2018. Another common reason is incorrect escaping of the pattern when we submit S3DistCp command as a step, which is addressed later later in this post.

6. Upload files larger than 1 TB in size

The default upload chunk size when doing an S3 multipart upload is 128 MB. When files are larger than 1 TB, the total number of parts can reach over 10,000. Such a large number of parts can make the job run for a very long time or even fail.

In this case, you can improve job performance by increasing the size of each part. In S3DistCp, you can do this by using the --multipartUploadChunkSize option.

Let’s test how it works on several files about 200 GB in size. With the default part size, it takes about 84 minutes to copy them to S3 from HDFS.

We can increase the default part size to 1000 MB:

$ time s3-dist-cp --src /data/gb200 --dest s3://my-tables/data/S3DistCp/gb200_2 --multipartUploadChunkSize=1000
real    41m1.616s

The maximum part size is 5 GB. Keep in mind that larger parts have a higher chance to fail during upload and don’t necessarily speed up the process. Let’s run the same job with the maximum part size:

time s3-dist-cp --src /data/gb200 --dest s3://my-tables/data/S3DistCp/gb200_2 --multipartUploadChunkSize=5000
real    40m17.331s

7. Submit a S3DistCp step to an EMR cluster

You can run the S3DistCp tool in several ways. First, you can SSH to the master node and execute the command in a terminal window as we did in the preceding examples. This approach might be convenient for many use cases, but sometimes you might want to create a cluster that has some data on HDFS. You can do this by submitting a step directly in the AWS Management Console when creating a cluster.

In the console add step dialog box, we can fill the fields in the following way:

  • Step type: Custom JAR
  • Name*: S3DistCp Stepli>
  • JAR location: command-runner.jar
  • Arguments: s3-dist-cp --src s3://my-tables/incoming/hourly_table --dest /data/input/hourly_table --targetSize 10 --groupBy .*/hourly_table/.*(2017-).*/(\d\d)/.*\.(log)
  • Action of failure: Continue

Notice that we didn’t add quotation marks around our pattern. We needed quotation marks when we were using bash in the terminal window, but not here. The console takes care of escaping and transferring our arguments to the command on the cluster.

Another common use case is to run S3DistCp recurrently or on some event. We can always submit a new step to the existing cluster. The syntax here is slightly different than in previous examples. We separate arguments by commas. In the case of a complex pattern, we shield the whole step option with single quotation marks:

aws emr add-steps --cluster-id j-ABC123456789Z --steps 'Name=LoadData,Jar=command-runner.jar,ActionOnFailure=CONTINUE,Type=CUSTOM_JAR,Args=s3-dist-cp,--src,s3://my-tables/incoming/hourly_table,--dest,/data/input/hourly_table,--targetSize,10,--groupBy,.*/hourly_table/.*(2017-).*/(\d\d)/.*\.(log)'


This post showed you the basics of how S3DistCp works and highlighted some of its most useful features. It covered how you can use S3DistCp to optimize for raw files of different sizes and also selectively copy different files between locations. We also looked at several options for using the tool from SSH, the AWS Management Console, and the AWS CLI.

If you have questions or suggestions, leave a message in the comments.

Next Steps

Take your new knowledge to the next level! Click on the post below and learn the top 10 tips to improve query performance in Amazon Athena.

Top 10 Performance Tuning Tips for Amazon Athena

About the Author

Illya Yalovyy is a Senior Software Development Engineer with Amazon Web Services. He works on cutting-edge features of EMR and is heavily involved in open source projects such as Apache Hive, Apache Zookeeper, Apache Sqoop. His spare time is completely dedicated to his children and family.


Roundup of AWS HIPAA Eligible Service Announcements

Post Syndicated from Ana Visneski original https://aws.amazon.com/blogs/aws/roundup-of-aws-hipaa-eligible-service-announcements/

At AWS we have had a number of HIPAA eligible service announcements. Patrick Combes, the Healthcare and Life Sciences Global Technical Leader at AWS, and Aaron Friedman, a Healthcare and Life Sciences Partner Solutions Architect at AWS, have written this post to tell you all about it.


We are pleased to announce that the following AWS services have been added to the BAA in recent weeks: Amazon API Gateway, AWS Direct Connect, AWS Database Migration Service, and Amazon SQS. All four of these services facilitate moving data into and through AWS, and we are excited to see how customers will be using these services to advance their solutions in healthcare. While we know the use cases for each of these services are vast, we wanted to highlight some ways that customers might use these services with Protected Health Information (PHI).

As with all HIPAA-eligible services covered under the AWS Business Associate Addendum (BAA), PHI must be encrypted while at-rest or in-transit. We encourage you to reference our HIPAA whitepaper, which details how you might configure each of AWS’ HIPAA-eligible services to store, process, and transmit PHI. And of course, for any portion of your application that does not touch PHI, you can use any of our 90+ services to deliver the best possible experience to your users. You can find some ideas on architecting for HIPAA on our website.

Amazon API Gateway
Amazon API Gateway is a web service that makes it easy for developers to create, publish, monitor, and secure APIs at any scale. With PHI now able to securely transit API Gateway, applications such as patient/provider directories, patient dashboards, medical device reports/telemetry, HL7 message processing and more can securely accept and deliver information to any number and type of applications running within AWS or client presentation layers.

One particular area we are excited to see how our customers leverage Amazon API Gateway is with the exchange of healthcare information. The Fast Healthcare Interoperability Resources (FHIR) specification will likely become the next-generation standard for how health information is shared between entities. With strong support for RESTful architectures, FHIR can be easily codified within an API on Amazon API Gateway. For more information on FHIR, our AWS Healthcare Competency partner, Datica, has an excellent primer.

AWS Direct Connect
Some of our healthcare and life sciences customers, such as Johnson & Johnson, leverage hybrid architectures and need to connect their on-premises infrastructure to the AWS Cloud. Using AWS Direct Connect, you can establish private connectivity between AWS and your datacenter, office, or colocation environment, which in many cases can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.

In addition to a hybrid-architecture strategy, AWS Direct Connect can assist with the secure migration of data to AWS, which is the first step to using the wide array of our HIPAA-eligible services to store and process PHI, such as Amazon S3 and Amazon EMR. Additionally, you can connect to third-party/externally-hosted applications or partner-provided solutions as well as securely and reliably connect end users to those same healthcare applications, such as a cloud-based Electronic Medical Record system.

AWS Database Migration Service (DMS)
To date, customers have migrated over 20,000 databases to AWS through the AWS Database Migration Service. Customers often use DMS as part of their cloud migration strategy, and now it can be used to securely and easily migrate your core databases containing PHI to the AWS Cloud. As your source database remains fully operational during the migration with DMS, you minimize downtime for these business-critical applications as you migrate your databases to AWS. This service can now be utilized to securely transfer such items as patient directories, payment/transaction record databases, revenue management databases and more into AWS.

Amazon Simple Queue Service (SQS)
Amazon Simple Queue Service (SQS) is a message queueing service for reliably communicating among distributed software components and microservices at any scale. One way that we envision customers using SQS with PHI is to buffer requests between application components that pass HL7 or FHIR messages to other parts of their application. You can leverage features like SQS FIFO to ensure your messages containing PHI are passed in the order they are received and delivered in the order they are received, and available until a consumer processes and deletes it. This is important for applications with patient record updates or processing payment information in a hospital.

Let’s get building!
We are beyond excited to see how our customers will use our newly HIPAA-eligible services as part of their healthcare applications. What are you most excited for? Leave a comment below.

Operating OpenStack at Scale

Post Syndicated from mikesefanov original https://yahooeng.tumblr.com/post/159795571841

By James Penick, Cloud Architect & Gurpreet Kaur, Product Manager

A version of this byline was originally written for and appears in CIO Review.

A successful private cloud presents a consistent and reliable facade over the complexities of hyperscale infrastructure. It must simultaneously handle constant organic traffic growth, unanticipated spikes, a multitude of hardware vendors, and discordant customer demands. The depth of this complexity only increases with the age of the business, leaving a private cloud operator saddled with legacy hardware, old network infrastructure, customers dependent on legacy operating systems, and the list goes on. These are the foundations of the horror stories told by grizzled operators around the campfire.

Providing a plethora of services globally for over a billion active users requires a hyperscale infrastructure. Yahoo’s on-premises infrastructure is comprised of datacenters housing hundreds of thousands of physical and virtual compute resources globally, connected via a multi-terabit network backbone. As one of the very first hyperscale internet companies in the world, Yahoo’s infrastructure had grown organically – things were built, and rebuilt, as the company learned and grew. The resulting web of modern and legacy infrastructure became progressively more difficult to manage. Initial attempts to manage this via IaaS (Infrastructure-as-a-Service) taught some hard lessons. However, those lessons served us well when OpenStack was selected to manage Yahoo’s datacenters, some of which are shared below.

Centralized team offering Infrastructure-as-a-Service

Chief amongst the lessons learned prior to OpenStack was that IaaS must be presented as a core service to the whole organization by a dedicated team. An a-la-carte-IaaS, where each user is expected to manage their own control plane and inventory, just isn’t sustainable at scale. Multiple teams tackling the same challenges involved in the curation of software, deployment, upkeep, and security within an organization is not just a duplication of effort; it removes the opportunity for improved synergy with all levels of the business. The first OpenStack cluster, with a centralized dedicated developer and service engineering team, went live in June 2012.  This model has served us well and has been a crucial piece of making OpenStack succeed at Yahoo. One of the biggest advantages to a centralized, core team is the ability to collaborate with the foundational teams upon which any business is built: Supply chain, Datacenter Site-Operations, Finance, and finally our customers, the engineering teams. Building a close relationship with these vital parts of the business provides the ability to streamline the process of scaling inventory and presenting on-demand infrastructure to the company.

Developers love instant access to compute resources

Our developer productivity clusters, named “OpenHouse,” were a huge hit. Ideation and experimentation are core to developers’ DNA at Yahoo. It empowers our engineers to innovate, prototype, develop, and quickly iterate on ideas. No longer is a developer reliant on a static and costly development machine under their desk. OpenHouse enables developer agility and cost savings by obviating the desktop.

Dynamic infrastructure empowers agile products

From a humble beginning of a single, small OpenStack cluster, Yahoo’s OpenStack footprint is growing beyond 100,000 VM instances globally, with our single largest virtual machine cluster running over a thousand compute nodes, without using Nova Cells.

Until this point, Yahoo’s production footprint was nearly 100% focused on baremetal – a part of the business that one cannot simply ignore. In 2013, Yahoo OpenStack Baremetal began to manage all new compute deployments. Interestingly, after moving to a common API to provision baremetal and virtual machines, there was a marked increase in demand for virtual machines.

Developers across all major business units ranging from Yahoo Mail, Video, News, Finance, Sports and many more, were thrilled with getting instant access to compute resources to hit the ground running on their projects. Today, the OpenStack team is continuing to fully migrate the business to OpenStack-managed. Our baremetal footprint is well beyond that of our VMs, with over 100,000 baremetal instances provisioned by OpenStack Nova via Ironic.

How did Yahoo hit this scale?  

Scaling OpenStack begins with understanding how its various components work and how they communicate with one another. This topic can be very deep and for the sake of brevity, we’ll hit the high points.

1. Start at the bottom and think about the underlying hardware

Do not overlook the unique resource constraints for the services which power your cloud, nor the fashion in which those services are to be used. Leverage that understanding to drive hardware selection. For example, when one examines the role of the database server in an OpenStack cluster, and considers the multitudinous calls to the database: compute node heartbeats, instance state changes, normal user operations, and so on; they would conclude this core component is extremely busy in even a modest-sized Nova cluster, and in need of adequate computational resources to perform. Yet many deployers skimp on the hardware. The performance of the whole cluster is bottlenecked by the DB I/O. By thinking ahead you can save yourself a lot of heartburn later on.

2. Think about how things communicate

Our cluster databases are configured to be multi-master single-writer with automated failover. Control plane services have been modified to split DB reads directly to the read slaves and only write to the write-master. This distributes load across the database servers.

3. Scale wide

OpenStack has many small horizontally-scalable components which can peacefully cohabitate on the same machines: the Nova, Keystone, and Glance APIs, for example. Stripe these across several small or modest hardware. Some services, such as the Nova scheduler, run the risk of race conditions when running multi-active. If the risk of race conditions is unacceptable, use ZooKeeper to manage leader election.

4. Remove dependencies

In a Yahoo datacenter, DHCP is only used to provision baremetal servers. By statically declaring IPs in our instances via cloud-init, our infrastructure is less prone to outage from a failure in the DHCP infrastructure.

5. Don’t be afraid to replace things

Neutron used Dnsmasq to provide DHCP services, however it was not designed to address the complexity or scale of a dynamic environment. For example, Dnsmasq must be restarted for any config change, such as when a new host is being provisioned.  In the Yahoo OpenStack clusters this has been replaced by ISC-DHCPD, which scales far better than Dnsmasq and allows dynamic configuration updates via an API.

6. Or split them apart

Some of the core imaging services provided by Ironic, such as DHCP, TFTP, and HTTPS communicate with a host during the provisioning process. These services are normally  part of the Ironic Conductor (IC) service. In our environment we split these services into a new and physically-distinct service called the Ironic Transport Service (ITS). This brings value by:

  • Adding security: Splitting the ITS from the IC allows us to block all network traffic from production compute nodes to the IC, and other parts of our control plane. If a malicious entity attacks a node serving production traffic, they cannot escalate from it  to our control plane.
  • Scale: The ITS hosts allow us to horizontally scale the core provisioning services with which nodes communicate.
  • Flexibility: ITS allows Yahoo to manage remote sites, such as peering points, without building a new cluster in that site. Resources in those sites can now be managed by the nearest Yahoo owned & operated (O&O) datacenter, without needing to build a whole cluster in each site.

Be prepared for faulty hardware!

Running IaaS reliably at hyperscale is more than just scaling the control plane. One must take a holistic look at the system and consider everything. In fact, when examining provisioning failures, our engineers determined the majority root cause was faulty hardware. For example, there are a number of machines from varying vendors whose IPMI firmware fails from time to time, leaving the host inaccessible to remote power management. Some fail within minutes or weeks of installation. These failures occur on many different models, across many generations, and across many hardware vendors. Exposing these failures to users would create a very negative experience, and the cloud must be built to tolerate this complexity.

Focus on the end state

Yahoo’s experience shows that one can run OpenStack at hyperscale, leveraging it to wrap infrastructure and remove perceived complexity. Correctly leveraged, OpenStack presents an easy, consistent, and error-free interface. Delivering this interface is core to our design philosophy as Yahoo continues to double down on our OpenStack investment. The Yahoo OpenStack team looks forward to continue collaborating with the OpenStack community to share feedback and code.

Private Properties with JavaScript (prototypes)

Post Syndicated from Delian Delchev original http://deliantech.blogspot.com/2017/01/private-properties-with-javascript.html

Some are saying that JavaScript does not have private properties with prototypes and it is not like TypeScript or Python.
However, that is not fully correct, especially with this comparison.
Python doesn’t have anything private. All the scopes are globally addressable and all the variables, properties and methods are accessible globally (as long as you know the scope path, which could be easily traced from within the software itself).
TypeScript is just a pre-processor. It will do extra semantic checks during compilation, but no enforcement of anything on runtime. 
However, it is generally untrue that you have no private properties available in JavaScript.
You do have scope inheritance and you don’t have global access to any scope from outside. Therefore you can make private properties easily.
For example, if you have:
function F1() {
  var XXX = 0;
  return function () {
     return XXX;
F1()() will respond with the value of XXX but there is no way to access XXX from outside.
So what about prototyping?
You can do this there too.
function F1() {
  var XXX = ‘yyy’; // Private property
  function F1() {
     // You constructor is here
     this.YYY = ‘xxx’; // public property
  F1.prototype.method1 = function() {
    return XXX + this.YYY;
  return new F1();
Then if you do:
x = F1();
x = new F1();  // Both works the same way
and check:
console.log(x.YYY); // output ‘xxx’
console.log(x.XXX); // output Error
console.log(x.method1()); // output ‘yyyxxx’
So everything works! You have prototyping with private and public properties.
The same way you can do private and public methods.
The only thing that will not work with this method is instanceof
x instanceof F1 will respond with false, because x is not an instance of the upper F1, but the inner F1

The current ES6 standard have classes, but they are essentially a wrapper on top of prototype. So this technique still apply. Additionally with Object.defineProperty you could apply extra protection on top of a property.

There is another approach to the problem. Without prototypes.

function F1() {
  if (!this instanceof F1) return; // Protect against global scope polution
  var XXX = ‘yyy’; // Private property
  this.YYY = ‘xxx’; // Public property
  function PrivateMethod() {
    return XXX + this.YYY;
  this.method1 = function() {
    return PrivateMethod();

With the shown above approach you have both private and public methods and properties, with workable instanceof

Let’s Encrypt 2016 In Review

Post Syndicated from Let's Encrypt - Free SSL/TLS Certificates original https://letsencrypt.org//2017/01/06/le-2016-in-review.html

Our first full year as a live CA was an exciting one. I’m incredibly proud of what our team and community accomplished during 2016. I’d like to share some thoughts about how we’ve changed, what we’ve accomplished, and what we’ve learned.

At the start of 2016, Let’s Encrypt certificates had been available to the public for less than a month and we were supporting approximately 240,000 active (unexpired) certificates. That seemed like a lot at the time! Now we’re frequently issuing that many new certificates in a single day while supporting more than 20,000,000 active certificates in total. We’ve issued more than a million certificates in a single day a few times recently. We’re currently serving an average of 6,700 OCSP responses per second. We’ve done a lot of optimization work, we’ve had to add some hardware, and there have been some long nights for our staff, but we’ve been able to keep up and we’re ready for another year of strong growth.

Let's Encrypt certificate issuance statistics.

We added a number of new features during the past year, including support for the ACME DNS challenge, ECDSA signing, IPv6, and Internationalized Domain Names.

When 2016 started, our root certificate had not been accepted into any major root programs. Today we’ve been accepted into the Mozilla, Apple, and Google root programs. We’re close to announcing acceptance into another major root program. These are major steps towards being able to operate as an independent CA. You can read more about why here.

The ACME protocol for issuing and managing certificates is at the heart of how Let’s Encrypt works. Having a well-defined and heavily audited specification developed in public on a standards track has been a major contributor to our growth and the growth of our client ecosystem. Great progress was made in 2016 towards standardizing ACME in the IETF ACME working group. We’re hoping for a final document around the end of Q2 2017, and we’ll announce plans for implementation of the updated protocol around that time as well.

Supporting the kind of growth we saw in 2016 meant adding staff, and during the past year Internet Security Research Group (ISRG), the non-profit entity behind Let’s Encrypt, went from four full-time employees to nine. We’re still a pretty small crew given that we’re now one of the largest CAs in the world (if not the largest), but it works because of our intense focus on automation, the fact that we’ve been able to hire great people, and because of the incredible support we receive from the Let’s Encrypt community.

Let’s Encrypt exists in order to help create a 100% encrypted Web. Our own metrics can be interesting, but they’re only really meaningful in terms of the impact they have on progress towards a more secure and privacy-respecting Web. The metric we use to track progress towards that goal is the percentage of page loads using HTTPS, as seen by browsers. According to Firefox Telemetry, the Web has gone from approximately 39% of page loads using HTTPS each day to just about 49% during the past year. We’re incredibly close to a Web that is more encrypted than not. We’re proud to have been a big part of that, but we can’t take credit for all of it. Many people and organizations around the globe have come to realize that we need to invest in a more secure and privacy-respecting Web, and have taken steps to secure their own sites as well as their customers’. Thank you to everyone that has advocated for HTTPS this year, or helped to make it easier for people to make the switch.

We learned some lessons this year. When we had service interruptions they were usually related to managing the rapidly growing database backing our CA. Also, while most of our code had proper tests, some small pieces didn’t and that led to incidents that shouldn’t have happened. That said, I’m proud of the way we handle incidents promptly, including quick and transparent public disclosure.

We also learned a lot about our client ecosystem. At the beginning of 2016, ISRG / Let’s Encrypt provided client software called letsencrypt. We’ve always known that we would never be able produce software that would work for every Web server/stack, but we felt that we needed to offer a client that would work well for a large number of people and that could act as a reference client. By March of 2016, earlier than we had foreseen, it had become clear that our community was up to the task of creating a wide range of quality clients, and that our energy would be better spent fostering that community than producing our own client. That’s when we made the decision to hand off development of our client to the Electronic Frontier Foundation (EFF). EFF renamed the client to Certbot and has been doing an excellent job maintaining and improving it as one of many client options.

As exciting as 2016 was for Let’s Encrypt and encryption on the Web, 2017 seems set to be an even more incredible year. Much of the infrastructure and many of the plans necessary for a 100% encrypted Web came into being or solidified in 2016. More and more hosting providers and CDNs are supporting HTTPS with one click or by default, often without additional fees. It has never been easier for people and organizations running their own sites to find the tools, services, and information they need to move to HTTPS. Browsers are planning to update their user interfaces to better reflect the risks associated with non-secure connections.

We’d like to thank our community, including our sponsors, for making everything we did this past year possible. Please consider getting involved or making a donation, and if your company or organization would like to sponsor Let’s Encrypt please email us at [email protected].

Month in Review: December 2016

Post Syndicated from Derek Young original https://aws.amazon.com/blogs/big-data/month-in-review-december-2016/

Another month of big data solutions on the Big Data Blog.

Take a look at our summaries below and learn, comment, and share. Thank you for reading!

Implementing Authorization and Auditing using Apache Ranger on Amazon EMR
Apache Ranger is a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform. Features include centralized security administration, fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN) and central auditing. In this post, walk through the steps to enable authorization and audit for Amazon EMR clusters using Apache Ranger.

Amazon Redshift Engineering’s Advanced Table Design Playbook
Amazon Redshift is a fully managed, petabyte scale, massively parallel data warehouse that offers simple operations and high performance. In practice, the best way to improve query performance by orders of magnitude is by tuning Amazon Redshift tables to better meet your workload requirements. This five-part blog series will guide you through applying distribution styles, sort keys, and compression encodings and configuring tables for data durability and recovery purposes.

Interactive Analysis of Genomic Datasets Using Amazon Athena
In this post, learn to prepare genomic data for analysis with Amazon Athena. We’ll demonstrate how Athena is well-adapted to address common genomics query paradigms using the Thousand Genomes dataset hosted on Amazon S3, a seminal genomics study. Although this post is focused on genomic analysis, similar approaches can be applied to any discipline where large-scale, interactive analysis is required.

Joining and Enriching Streaming Data on Amazon Kinesis
In this blog post, learn three approaches for joining and enriching streaming data on Amazon Kinesis Streams by using Amazon Kinesis Analytics, AWS Lambda, and Amazon DynamoDB.

Using SaltStack to Run Commands in Parallel on Amazon EMR
SaltStack is an open source project for automation and configuration management. It started as a remote execution engine designed to scale to many machines while delivering high-speed execution. You can now use the new bootstrap action that installs SaltStack on Amazon EMR. It provides a basic configuration that enables selective targeting of the nodes based on instance roles, instance groups, and other parameters.

Building an Event-Based Analytics Pipeline for Amazon Game Studios’ Breakaway
Amazon Game Studios’ new title Breakaway is an online 4v4 team battle sport that delivers fast action, teamwork, and competition. In this post, learn the technical details of how the Breakaway team uses AWS to collect, process, and analyze gameplay telemetry to answer questions about arena design.

Respond to State Changes on Amazon EMR Clusters with Amazon CloudWatch Events
With new support for Amazon EMR in Amazon CloudWatch Events, you can be notified quickly and programmatically respond to state changes in your EMR clusters. Additionally, these events are also displayed in the Amazon EMR console. CloudWatch Events allows you to create filters and rules to match these events and route them to Amazon SNS topics, AWS Lambda functions, Amazon SQS queues, streams in Amazon Kinesis Streams, or built-in targets.

Run Jupyter Notebook and JupyterHub on Amazon EMR
Data scientists who run Jupyter and JupyterHub on Amazon EMR can use Python, R, Julia, and Scala to process, analyze, and visualize big data stored in Amazon S3. Jupyter notebooks can be saved to S3 automatically, so users can shut down and launch new EMR clusters, as needed. See how EMR makes it easy to spin up clusters with different sizes and CPU/memory configurations to suit different workloads and budgets.

Derive Insights from IoT in Minutes using AWS IoT, Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight
In this post, see how you can build a business intelligence capability for streaming IoT device data using AWS serverless and managed services. You can be up and running in minutes―starting small, but able to easily grow to millions of devices and billions of messages.

Serving Real-Time Machine Learning Predictions on Amazon EMR
The typical progression for creating and using a trained model for recommendations falls into two general areas: training the model and hosting the model. Model training has become a well-known standard practice. In this post, we highlight one way to host those recommendations using Amazon EMR with JobServer

Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning
In this post, learn to generate a predictive model for flight delays that can be used to help pick the flight least likely to add to your travel stress. To accomplish this, you’ll use Apache Spark running on Amazon EMR for extracting, transforming, and loading (ETL) the data, Amazon Redshift for analysis, and Amazon Machine Learning for creating predictive models.


Running sparklyr – RStudio’s R Interface to Spark on Amazon EMR
Sparklyr is an R interface to Spark that allows users to use Spark as the backend for dplyr, one of the most popular data manipulation packages. Sparklyr provides interfaces to Spark packages and also allows users to query data in Spark using SQL and develop extensions for the full Spark API. This short post shows you how to run RStudio and sparklyr on EMR.

Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

Leave a comment below to let us know what big data topics you’d like to see next on the AWS Big Data Blog.

IoT saves lives but infosec wants to change that

Post Syndicated from Robert Graham original http://blog.erratasec.com/2016/12/iot-saves-lives.html

The cybersecurity industry mocks/criticizes IoT. That’s because they are evil and wrong. IoT saves lives. This was demonstrated a couple weeks ago when a terrorist attempted to drive a truck through a Christmas market in German. The truck has an Internet-connected braking system (firmware updates, configuration, telemetry). When it detected the collision, it deployed the brakes, bringing the truck to a stop. Injuries and deaths were a 10th of the similar Nice truck attack earlier in the year.

All the trucks shipped by Scania in the last five years have had mobile phone connectivity to the Internet. Scania pulls back telemetry from trucks, for the purposes of improving drivers, but also to help improve the computerized features of the trucks. They put everything under the microscope, such as how to improve air conditioning to make the trucks more environmentally friendly.

Among their features is the “Autonomous Emergency Braking” system. This is the system that saved lives in Germany.

You can read up on these features on their website, or in their annual report [*].

My point is this: the cybersecurity industry is a bunch of police-state fetishists that want to stop innovation, to solve the “security” problem first before allowing innovation to continue. This will only cost lives. Yes, we desperately need to solve the problem. Almost certainly, the Scania system can trivially be hacked by mediocre hackers. But if Scania had waited first to secure its system before rolling it out in trucks, many more people would now be dead in Germany. Don’t listen to cybersecurity professionals who want to stop the IoT revolution — they just don’t care if people die.

Update: Many, such the first comment, point out that the emergency brakes operate independently of the Internet connection, thus disproving this post.

That’s silly. That’s the case of all IoT devices. The toaster still toasts without Internet. The surveillance camera still records video without Internet. My car, which also has emergency brakes, still stops. In almost no IoT is the Internet connectivity integral to the day-to-day operation. Instead, Internet connectivity is for things like configuration, telemetry, and downloading firmware updates — as in the case of Scania.

While the brakes don’t make their decision based on the current connectivity, connectivity is nonetheless essential to the equation. Scania monitors its fleet of 170,000 trucks and uses that information to make trucks, including braking systems, better.

My car is no more or less Internet connected than the Scania truck, yet hackers have released exploits at hacking conferences for it, and it’s listed as a classic example of an IoT device. Before you say a Scania truck isn’t an IoT device, you first have to get all those other hackers to stop calling my car an IoT device.

Monetize your APIs in AWS Marketplace using API Gateway

Post Syndicated from Bryan Liston original https://aws.amazon.com/blogs/compute/monetize-your-apis-in-aws-marketplace-using-api-gateway/

Shiva Krishnamurthy, Sr. Product Manager

Amazon API Gateway helps you quickly build highly scalable, secure, and robust APIs. Today, we are announcing an integration of API Gateway with AWS Marketplace. You can now easily monetize your APIs built with API Gateway, market them directly to AWS customers, and reuse AWS bill calculation and collection mechanisms.

AWS Marketplace lists over 3,500 software listings across 35 product categories with over 100K active customers. With the recent announcement of SaaS Subscriptions, API sellers can, for the first time, take advantage of the full suite of Marketplace features, including customer acquisition, unified billing, and reporting. For AWS customers, this means that they can now subscribe to API products through AWS Marketplace and pay on an existing AWS bill. This gives you direct access to the AWS customer base.

To get started, identify the API on API Gateway that you want to sell on AWS Marketplace. Next, package that API into usage plans. Usage plans allow you to set throttling limits and quotas to your APIs and allow you to control third-party usage of your API. You can create multiple usage plans with different limits (e.g., Silver, Gold, Platinum) and offer them as different API products on AWS Marketplace.

Let’s suppose that you offer a Pet Store API managed by API Gateway and you want to start selling it through AWS Marketplace: you must offer a developer portal, a website that you must maintain that allows new customers to register an account using AWS-provided billing identifiers. Also, the portal needs to provide registered customers with access to your APIs during and after purchase.

To help you get started, we have created a reference implementation of a developer portal application. You can use this implementation to create a developer portal from scratch, or use it as a reference guide to integrate API Gateway into an existing developer portal that you already operate. For a detailed walkthrough on setting up a developer portal using our reference application, see (Generate Your Own API Gateway Developer Portal).

After you have set up your developer portal, register as a seller with the AWS Marketplace. After registration, submit a product load form to list your product for sale. In this step, you describe your API product, define pricing, and submit AWS account IDs to be used to test the subscription. You also submit the URL of your developer portal. Currently, API Gateway only supports “per request” pricing models.

After you have registered as a seller, you are given an AWS Marketplace product code. Log in to the API Gateway console to associate this product code with the corresponding usage plan on API Gateway. This tells API Gateway to send telemetry records to AWS Marketplace when your API is used.


After the product code is associated, test the “end user” flow by subscribing to your API products using the AWS IDs that you submitted via the Marketplace; verify the proper functionality. When you’ve finished verifying, submit your product for final approval using instructions provided in the Seller Guide.

Visit here to learn more about this feature.

IT Governance in a Dynamic DevOps Environment

Post Syndicated from Shashi Prabhakar original https://aws.amazon.com/blogs/devops/it-governance-in-a-dynamic-devops-environment/

IT Governance in a Dynamic DevOps Environment
Governance involves the alignment of security and operations with productivity to ensure a company achieves its business goals. Customers who are migrating to the cloud might be in various stages of implementing governance. Each stage poses its own challenges. In this blog post, the first in a series, I will discuss a four-step approach to automating governance with AWS services.

Governance and the DevOps Environment
Developers with a DevOps and agile mindset are responsible for building and operating services. They often rely on a central security team to develop and apply policies, seek security reviews and approvals, or implement best practices.

These policies and rules are not strictly enforced by the security team. They are treated as guidelines that developers can follow to get the much-desired flexibility from using AWS. However, due to time constraints or lack of awareness, developers may not always follow best practices and standards. If these best practices and rules were strictly enforced, the security team could become a bottleneck.

For customers migrating to AWS, the automated governance mechanisms described in this post will preserve flexibility for developers while providing controls for the security team.

These are some common challenges in a dynamic development environment:

·      Quick or short path to accomplishing tasks like hardcoding credentials in code.

·      Cost management (for example, controlling the type of instance launched).

·      Knowledge transfer.

·      Manual processes.

Steps to Governance
Here is a four-step approach to automating governance:

At initial setup, you want to implement some (1) controls for high-risk actions. After they are in place, you need to (2) monitor your environment to make sure you have configured resources correctly. Monitoring will help you discover issues you want to (3) fix as soon as possible. You’ll also want to regularly produce an (4) audit report that shows everything is compliant.

The example in this post helps illustrate the four-step approach: A central IT team allows its Big Data team to run a test environment of Amazon EMR clusters. The team runs the EMR job with 100 t2.medium instances, but when a team member spins up 100 r3.8xlarge instances to complete the job more quickly, the business incurs an unexpected expense.

The central IT team cares about governance and implements a few measures to prevent this from happening again:

·      Control elements: The team uses CloudFormation to restrict the number and type of instances and AWS Identity and Access Management to allow only a certain group to modify the EMR cluster.

·      Monitor elements: The team uses tagging, AWS Config, and AWS Trusted Advisor to monitor the instance limit and determine if anyone exceeded the number of allowed instances.

·      Fix: The team creates a custom Config rule to terminate instances that are not of the type specified.

·      Audit: The team reviews the lifecycle of the EMR instance in AWS Config.



You can prevent mistakes by standardizing configurations (through AWS CloudFormation), restricting configuration options (through AWS Service Catalog), and controlling permissions (through IAM).

AWS CloudFormation helps you control the workflow environment in a single package. In this example, we use a CloudFormation template to restrict the number and type of instances and tagging to control the environment.

For example, the team can prevent the choice of r3.8xlarge instances by using CloudFormation with a fixed instance type and a fixed number of instances (100).

Cloudformation Template Sample

EMR cluster with tag:

“Type” : “AWS::EMR::Cluster”,
“Properties” : {
“AdditionalInfo” : JSON object,
“Applications” : [ Applications, … ],
“BootstrapActions” [ Bootstrap Actions, … ],
“Configurations” : [ Configurations, … ],
“Instances” : JobFlowInstancesConfig,
“JobFlowRole” : String,
“LogUri” : String,
“Name” : String,
“ReleaseLabel” : String,
“ServiceRole” : String,
“Tags” : [ Resource Tag, … ],
“VisibleToAllUsers” : Boolean
EMR cluster JobFlowInstancesConfig InstanceGroupConfig with fixed instance type and number:

“BidPrice” : String,

“Configurations” : [ Configuration, … ],

“EbsConfiguration” : EBSConfiguration,

“InstanceCount” : Integer,

“InstanceType” : String,

“Market” : String,

“Name” : String

AWS Service Catalog can be used to distribute approved products (servers, databases, websites) in AWS. This gives IT administrators more flexibility in terms of which user can access which products. It also gives them the ability to enforce compliance based on business standards.

AWS IAM is used to control which users can access which AWS services and resources. By using IAM role, you can avoid the use of root credentials in your code to access AWS resources.

In this example, we give the team lead full EMR access, including console and API access (not covered here), and give developers read-only access with no console access. If a developer wants to run the job, the developer just needs PEM files.

IAM Policy
This policy is for the team lead with full EMR access:

“Version”: “2012-10-17”,
“Statement”: [
“Effect”: “Allow”,
“Action”: [
“Resource”: “*”
This policy is for developers with read-only access:

“Version”: “2012-10-17”,
“Statement”: [
“Effect”: “Allow”,
“Action”: [
“Resource”: “*”
These are IAM managed policies. If you want to change the permissions, you can create your own IAM custom policy.



Use logs available from AWS CloudTrail, Amazon Cloudwatch, Amazon VPC, Amazon S3, and Elastic Load Balancing as much as possible. You can use AWS Config, Trusted Advisor, and CloudWatch events and alarms to monitor these logs.

AWS CloudTrail can be used to log API calls in AWS. It helps you fix problems, secure your environment, and produce audit reports. For example, you could use CloudTrail logs to identify who launched those r3.8xlarge instances.


AWS Config can be used to keep track of and act on rules. Config rules check the configuration of your AWS resources for compliance. You’ll also get, at a glance, the compliance status of your environment based on the rules you configured.

Amazon CloudWatch can be used to monitor and alarm on incorrectly configured resources. CloudWatch entities–metrics, alarms, logs, and events–help you monitor your AWS resources. Using metrics (including custom metrics), you can monitor resources and get a dashboard with customizable widgets. Cloudwatch Logs can be used to stream data from AWS-provided logs in addition to your system logs, which is helpful for fixing and auditing.

CloudWatch Events help you take actions on changes. VPC flow, S3, and ELB logs provide you with data to make smarter decisions when fixing problems or optimizing your environment.

AWS Trusted Advisor analyzes your AWS environment and provides best practice recommendations in four categories: cost, performance, security, and fault tolerance. This online resource optimization tool also includes AWS limit warnings.

We will use Trusted Advisor to make sure a limit increase is not going to become bottleneck in launching 100 instances:

Trusted Advisor


Depending on the violation and your ability to monitor and view the resource configuration, you might want to take action when you find an incorrectly configured resource that will lead to a security violation. It’s important the fix doesn’t result in unwanted consequences and that you maintain an auditable record of the actions you performed.


You can use AWS Lambda to automate everything. When you use Lambda with Amazon Cloudwatch Events to fix issues, you can take action on an instance termination event or the addition of new instance to an Auto Scaling group. You can take an action on any AWS API call by selecting it as source. You can also use AWS Config managed rules and custom rules with remediation. While you are getting informed about the environment based on AWS Config rules, you can use AWS Lambda to take action on top of these rules. This helps in automating the fixes.

AWS Config to Find Running Instance Type

To fix the problem in our use case, you can implement a Config custom rule and trigger (for example, the shutdown of the instances if the instance type is larger than .xlarge or the tearing down of the EMR cluster).



You’ll want to have a report ready for the auditor at the end of the year or quarter. You can automate your reporting system using AWS Config resources.

You can view AWS resource configurations and history so you can see when the r3.8xlarge instance cluster was launched or which security group was attached. You can even search for deleted or terminated instances.

AWS Config Resources



More Control, Monitor, and Fix Examples
Armando Leite from AWS Professional Services has created a sample governance framework that leverages Cloudwatch Events and AWS Lambda to enforce a set of controls (flows between layers, no OS root access, no remote logins). When a deviation is noted (monitoring), automated action is taken to respond to an event and, if necessary, recover to a known good state (fix).

·      Remediate (for example, shut down the instance) through custom Config rules or a CloudWatch event to trigger the workflow.

·      Monitor a user’s OS activity and escalation to root access. As events unfold, new Lambda functions dynamically enable more logs and subscribe to log data for further live analysis.

·      If the telemetry indicates it’s appropriate, restore the system to a known good state.

Court: Uploaded Can’t Ignore ‘Spam’ Copyright Notices

Post Syndicated from Ernesto original https://torrentfreak.com/court-uploaded-cant-ignore-spam-copyright-notices-160925/

uploadedlogoWith millions of visitors per month, Uploaded is one of the largest file-hosting services on the Internet.

Like many of its ‘cloud hosting’ competitors, the service is also used to share copyright infringing material, which is a thorn in the side of various copyright holder groups.

In Germany this has resulted in several lawsuits, where copyright holders want Uploaded to be held liable for files that are shared by its users, if they fail to respond properly to takedown notices.

In one of these cases the court has now clarified that Uploaded can even be held liable for copyright infringement if those messages are never read, due to an overactive ‘spam’ filter.

The case was started several years ago by anti-piracy company proMedia GmbH, which sent a takedown notice to Uploaded on behalf of a record label. The notice asked the site to remove a specific file, but that never happened.

Uploaded, in its defense, argued that it had never seen the takedown notice. It was flagged by its DDoS protection system and directly sent to a spam folder. As such, the notice in question was never read.

Last week the Higher Regional Court of Hamburg ruled on the case. It affirmed an earlier ruling from the Regional Court of Hamburg, concluding that Uploaded can be held liable, spam or no spam.

The court argued that because Uploaded’s anti-DDoS system willingly created a “cemetery for emails,” they can be treated as if they had knowledge of the takedown notices. This decision is now final.

Anja Heller, attorney at the German law firm Rasch which represented the plaintiff, is happy with the outcome in this case.

“File-hosters have to handle their inboxes for abuse notices very carefully. On the one hand they cannot easily claim not having received a notice. On the other hand they have to check blacklisted notices on a regular basis,” she says.

The current lawsuit dealt exclusively with the liability question, so the copyright owners have to file a separate proceeding if they want to obtain damages.

However, together with previous liability verdicts, it definitely makes it harder for file-hosters to operate without a solid anti-piracy strategy.

“The verdict once again shows the high demands German courts place on file-hosters, which are not only obliged to take down links immediately after having received a notice, but have to keep their services clean from infringing content as well,” Heller concludes.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Vote for the top 20 Raspberry Pi projects in The MagPi!

Post Syndicated from Rob Zwetsloot original https://www.raspberrypi.org/blog/vote-top-20-raspberry-pi-projects-magpi/

Although this Thursday will see the release of issue 49 of The MagPi, we’re already hard at work putting together our 50th issue spectacular. As part of this issue we’re going to be covering 50 of the best Raspberry Pi projects ever and we want you, the community, to vote for the top 20.

Below we have listed the 30 projects that we think represent the best of the best. All we ask is that you vote for your favourite. We will have a few special categories with some other amazing projects in the final article, but if you think we’ve missed out something truly excellent, let us know in the comments. Here’s the list so you can remind yourselves of the projects, with the poll posted at the bottom.

From paper boats to hybrid sports cars

From paper boats to hybrid sports cars

  1. SeeMore – a huge sculpture of 256 Raspberry Pis connected as a cluster
  2. BeetBox – beets (vegetable) you can use to play sick beats (music)
  3. Voyage – 300 paper boats (actually polypropylene) span a river, and you control how they light up
  4. Aquarium – a huge aquarium with Pi-powered weather control simulating the environment of the Cayman Islands
  5. ramanPi – a Raman spectrometer used to identify different types of molecules
  6. Joytone – an electronic musical instrument operated by 72 back-lit joysticks
  7. Internet of LEGO – a city of LEGO, connected to and controlled by the internet
  8. McMaster Formula Hybrid – a Raspberry Pi provides telemetry on this hybrid racing car
  9. PiGRRL – Adafruit show us how to make an upgraded, 3D-printed Game Boy
  10. Magic Mirror – check out how you look while getting some at-a-glance info about your day
Dinosaurs, space, and modern art

Dinosaurs, space, and modern art

  1. 4bot – play a game of Connect 4 with a Raspberry Pi robot
  2. Blackgang Chine dinosaurs – these theme park attractions use the diminutive Pi to make them larger than life
  3. Sound Fighter – challenge your friend to the ultimate Street Fight, controlled by pianos
  4. Astro Pi – Raspberry Pis go to space with code written by school kids
  5. Pi in the Sky – Raspberry Pis go to near space and send back live images
  6. BrewPi – a microbrewery controlled by a micro-computer
  7. LED Mirror – a sci-fi effect comes to life as you’re represented on a wall of lights
  8. Raspberry Pi VCR – a retro VCR-player is turned into a pink media playing machine
  9. #OZWall – Contemporary art in the form of many TVs from throughout the ages
  10. #HiutMusic – you choose the music for a Welsh denim factory through Twitter
Robots and arcade machines make the cut

Robots and arcade machines make the cut

  1. CandyPi – control a jelly bean dispenser from your browser without the need to twist the dial
  2. Digital Zoetrope – still images rotated to create animation, updated for the 21st century
  3. LifeBox – create virtual life inside this box and watch it adapt and survive
  4. Coffee Table Pi – classy coffee table by name, arcade cabinet by nature. Tea and Pac-Man, anyone?
  5. Raspberry Pi Notebook – this handheld Raspberry Pi is many people’s dream machine
  6. Pip-Boy 3000A – turn life into a Bethesda RPG with this custom Pip-Boy
  7. Mason Jar Preserve – Mason jars are used to preserve things, so this one is a beautiful backup server to preserve your data
  8. Pi glass – Google Glass may be gone but you can still make your own amazing Raspberry Pi facsimile
  9. DoodleBorg – a powerful PiBorg robot that can tow a caravan
  10. BigHak – a Big Trak that is truly big: it’s large enough for you to ride in

Now you’ve refreshed your memory of all these amazing projects, it’s time to vote for the one you think is best!

Note: There is a poll embedded within this post, please visit the site to participate in this post’s poll.

The vote is running over the next two weeks, and the results will be in The MagPi 50. We’ll see you again on Thursday for the release of the excellent MagPi 49: don’t miss it!

The post Vote for the top 20 Raspberry Pi projects in The MagPi! appeared first on Raspberry Pi.

Writing SQL on Streaming Data with Amazon Kinesis Analytics – Part 1

Post Syndicated from Ryan Nienhuis original https://blogs.aws.amazon.com/bigdata/post/Tx2D4GLDJXPKHOY/Writing-SQL-on-Streaming-Data-with-Amazon-Kinesis-Analytics-Part-1

Ryan Nienhuis is a Senior Product Manager for Amazon Kinesis

This is the first of two AWS Big Data blog posts on Writing SQL on Streaming Data with Amazon Kinesis Analytics. In this post, I provide an overview of streaming data and key concepts like the basics of streaming SQL, and complete a walkthrough using a simple example. In the next post, I will cover more advanced stream processing concepts using Amazon Kinesis Analytics.

Most organizations use batch data processing to perform their analytics in daily or hourly intervals to inform their business decisions and improve their customer experiences. However, you can derive significantly more value from your data if you are able to process and react in real time. Indeed, the value of insights in your data can decline rapidly over time – the faster you react, the better. For example:

  • Analyzing your company’s key performance indicators over the last 24 hours is a better reflection of your current business than analyzing last month’s metrics.
  • Reacting to an operational event as it is happening is far more valuable than discovering a week later that the event occurred. 
  • Identifying that a customer is unable to complete a purchase on your ecommerce site so you can assist them in completing the order is much better than finding out next week that they were unable to complete the transaction.

Real-time insights are extremely valuable, but difficult to extract from streaming data. Processing data in real time can be difficult because it needs to be done quickly and continuously to keep up with the speed at which the data is produced. In addition, the analysis may require data to be processed in the same order in which it was generated for accurate results, which can be hard due to the distributed nature of the data.

Because of these complexities, people start by implementing simple applications that perform streaming ETL, such as collecting, validating, and normalizing log data across different applications. Some then progress to basic processing like rolling min-max computations, while a select few implement sophisticated processing such as anomaly detection or correlating events by user sessions.  With each step, more and more value is extracted from the data but the difficulty level also increases.

With the launch of Amazon Kinesis Analytics, you can now easily write SQL ­­­on streaming data, providing a powerful way to build a stream processing application in minutes. The service allows you to connect to streaming data sources, process the data with sub-second latencies, and continuously emit results to downstream destinations for use in real-time alerts, dashboards, or further analysis.

This post introduces you to Amazon Kinesis Analytics, the fundamentals of writing ANSI-Standard SQL over streaming data, and works through a simple example application that continuously generates metrics over time windows.

What is streaming data?

Today, data is generated continuously from a large variety of sources, including clickstream data from mobile and web applications, ecommerce transactions, application logs from servers, telemetry from connected devices, and many other sources.

Typically, hundreds to millions of these sources create data that is usually small (order of kilobytes) and occurs in a sequence. For example, your ecommerce site has thousands of individuals concurrently interacting with the site, each generating a sequence of events based upon their activity (click product, add to cart, purchase product, etc.). When these sequences are captured continuously from these sources as events occur, the data is categorized as streaming data.

Amazon Kinesis Streams

Capturing event data with low latency and durably storing it in a highly available, scalable data store, such as Amazon Kinesis Streams, is the foundation for streaming data. Streams enables you to capture and store data for ordered, replayable, real-time processing using a streaming application. You configure your data sources to emit data into the stream, then build applications that read and process data from that stream in real-time. To build your applications, you can use the Amazon Kinesis Client Library (KCL), AWS Lambda, Apache Storm, and a number of other solutions, including Amazon Kinesis Analytics.

Amazon Kinesis Firehose

One of the more common use cases for streaming data is to capture it and then load it to a cloud storage service, a database, or other analytics service. Amazon Kinesis Firehose is a fully managed service that offers an easy to use solution to collect and deliver streaming data to Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service.

With Firehose, you create delivery streams using the AWS Management Console to specify your destinations of choice and choose from configuration options that allow you to batch, compress, and encrypt your data before it is loaded into the destination. From there, you set up your data sources to start sending data to the Firehose delivery stream, which loads it continuously to your destinations with no ongoing administration.

Amazon Kinesis Analytics

Amazon Kinesis Analytics provides an easy and powerful way to process and analyze streaming data with standard SQL. Using Analytics, you build applications that continuously read data from streaming sources, process it in real-time using SQL code, and emit the results downstream to your configured destinations.

An Analytics application can ingest data from Streams and Firehose. The service detects a schema associated with the data in your source for common formats, which you can further refine using an interactive schema editor. Your application’s SQL code can be anything from a simple count or average, to more advanced analytics like correlating events over time windows. You author your SQL using an interactive editor, and then test it with live streaming data.

Finally, you configure your application to emit SQL results to up to four destinations, including S3, Amazon Redshift, and Amazon Elasticsearch Service (through a Firehose delivery stream); or to an Amazon Kinesis stream. After setup, the service scales your application to handle your query complexity and streaming data throughput – you don’t have to provision or manage any servers.

Walkthrough (part 1): Run your first SQL query using Amazon Kinesis Analytics

The easiest way to understand Amazon Kinesis Analytics is to try it out. You need an AWS account to get started. I interrupt the walkthrough to discuss streaming through time windows in more detail, then you create a second SQL query with more metrics and an additional step for your application.

A streaming application consist of three components:

  • Streaming data sources
  • Analytics written in SQL
  • Destinations for the results

The application continuously reads data from a streaming source, generates analytics using your SQL code, and emits those results to up to four destinations. This walkthrough will cover the first two steps and point you in the right direction for completing an end-to-end application by adding a destination for your SQL results.

Create an Amazon Kinesis Analytics application

  1. Open the Amazon Kinesis Analytics console and choose Create a new application.

  1. Provide a name and (optional) description for your application and choose Continue.

You are taken to the application hub page.

Create a streaming data source

For input, Analytics supports Amazon Kinesis Streams and Amazon Kinesis Firehose as streaming data input, and reference data input through S3. The primary difference between these two sources is that data is read continuously from the streaming data sources and at one time for reference data sources. Reference data sources are used for joining against the incoming stream to enrich the data.

In Amazon Kinesis Analytics, choose Connect to a source.

If you have existing Amazon Kinesis streams or Firehose delivery streams, they are shown here.

For the purposes of this post, you will be using a demo stream, which creates and populates a stream with sample data on your behalf. The demo stream is created under your account with a single shard, which supports up to 1 MB/sec of write throughput and 2 MB/sec of read throughput. Analytics will write simulated stock ticker data to the demo stream directly from your browser. Your application will read data from the stream in real time.

Next, choose Configure a new stream and Create demo stream.

Later, you will refer to the demo stream in your SQL code as “SOURCE_SQL_STREAM_001”. Analytics calls the DiscoverInputSchema API action, which infers a schema by sampling records from your selected input data stream. You can see the applied schema on your data in the formatted sample shown in the browser, as well as the original sample taken from the raw stream. You can then edit the schema to fine tune it to your needs.

Feel free to explore; when you are ready, choose Save and continue. You are taken back to the streaming application hub.

Create a SQL query for analyzing data

On the streaming application hub, choose Go to SQL Editor and Run Application.

This SQL editor is the development environment for Amazon Kinesis Analytics. On the top portion of the screen, there is a text editor with syntax highlighting and intelligent auto-complete, as well as a number of SQL templates to help you get started. On the bottom portion of the screen, there is an area for you to explore your source data, your intermediate SQL results, and the data you are writing to your destinations. You can view the entire flow of your application here, end-to-end.

Next, choose Add SQL from Template.

Amazon Kinesis Analytics provides a number of SQL templates that work with the demo stream. Feel free to explore; when you’re ready, choose the COUNT, AVG, etc. (aggregate functions) + Tumbling time window template and choose Add SQL to Editor.

The SELECT statement in this SQL template performs a count over a 10-second tumbling window. A window is used to group rows together relative to the current row that the Amazon Kinesis Analytics application is processing.

Choose Save and run SQL. Congratulations, you just wrote your first SQL query on streaming data!

Streaming SQL with Amazon Kinesis Analytics

In a relational database, you work with tables of data, using INSERT statements to add records and SELECT statements to query the data in a table. In Amazon Kinesis Analytics, you work with in-application streams, which are similar to tables in that you can CREATE, INSERT, and SELECT from them. However, unlike a table, data is continuously inserted into an in-application stream, even while you are executing a SQL statement against it. The data in an in-application stream is therefore unbounded.

In your application code, you interact primarily with in-application streams. For instance, a source in-application stream represents your configured Amazon Kinesis stream or Firehose delivery stream in the application, which by default is named “SOURCE_SQL_STREAM_001”. A destination in-application stream represents your configured destinations, which by default is named “DESTINATION_SQL_STREAM”. When interacting with in-application streams, the following is true:

  • The SELECT statement is used in the context of an INSERT statement. That is, when you select rows from one in-application stream, you insert results into another in-application stream.
  • The INSERT statement is always used in the context of a pump. That is, you use pumps to write to an in-application stream. A pump is the mechanism used to make an INSERT statement continuous.

There are two separate SQL statements in the template you selected in the first walkthrough. The first statement creates a target in-application stream for the SQL results; the second statement creates a PUMP for inserting into that stream and includes the SELECT statement.

Generating real-time analytics using windows

In the console, look at the SQL results from the walkthrough, which are sampled and continuously streamed to the console.

In the example application you just built, you used a 10-second tumbling time window to perform an aggregation of records. Notice the special column called ROWTIME, which represents the time a row was inserted into the first in-application stream. The ROWTIME value is incrementing every 10 seconds with each new set of SQL results. (Some 10 second windows may not be shown in the console because we sample results on the high speed stream.) You use this special column in your tumbling time window to help define the start and end of each result set.

Windows are important because they define the bounds for which you want your query to operate. The starting bound is usually the current row that Amazon Kinesis Analytics is processing, and the window defines the ending bound. Windows are required with any query that works across rows, because the in-application stream is unbounded and windows provide a mechanism to bind the result set and make the query deterministic. Analytics supports three types of windows: specifically tumbling, sliding, and custom windows. These concepts will be covered in depth in our next blog post.

Tumbling windows, like the one you selected in your template, are useful for periodic reports. You can use a tumbling window to compute an average number of visitors to your website in the last 5 minutes, or the maximum over the past hour. A single result is emitted for each key in the group as specified by the clause at the end of the defined window.

In streaming data, there are different types of time and how they are used is important to the analytics.  Our example uses ROWTIME, or the processing time, which is great for some use cases. However, in many scenarios, you want a time that more accurately reflects when the event occurred, such as the event or ingest time. Amazon Kinesis Analytics supports all three different time semantics for processing data; processing, event, and ingest time. These concepts will be covered in depth in our next blog post.

Part 2: Run your second SQL query using Amazon Kinesis Analytics

The next part of the walkthrough adds some additional metrics to your first SQL query and adds a second step to your application.

Add metrics to the SQL statement

In the SQL editor, add some additional SQL code.

First, add some metrics including the average price, average change, maximum price, and minimum price over the same window. Note that you need to add these in your SELECT statement as well as the in-application stream you are inserting into, DESTINATION_SQL_STREAM.

Second, add the sector to the query so you have additional information about the stock ticker. Note that the sector must be added to both the SELECT and GROUP BY clauses.

When you are finished, your SQL code should look like the following:

    ticker_symbol VARCHAR(4),
    sector VARCHAR(16), 
    ticker_symbol_count INTEGER,
    avg_price REAL,
    avg_change REAL,
    max_price REAL,
    min_price REAL);

SELECT STREAM   ticker_symbol,
                COUNT(*) AS ticker_symbol_count,
                AVG(price) as avg_price,
                AVG(change) as avg_change,
                MAX(price) as max_price,
                MIN(price) as min_price
GROUP BY ticker_symbol, sector, FLOOR(("SOURCE_SQL_STREAM_001".ROWTIME - TIMESTAMP '1970-01-01 00:00:00') SECOND / 10 TO SECOND);

Choose Save and run SQL.

Add a second step to your SQL code

Next, add a second step to your SQL code. You can use in-application streams to store intermediate SQL results, which can then be used as input for additional SQL statements. This allows you to build applications with multiple steps serially before sending it to the destination of your choice. Additionally, you can also use in-application streams to perform multiple steps in parallel and send to multiple destinations.

First, change the DESTINATION_SQL_STREAM name in your two SQL statements to be INTERMEDIATE_SQL_STREAM.

Next, add a second SQL step that selects from INTERMEDIATE_SQL_STREAM and INSERTS into a DESTINATION_SQL_STREAM. The SELECT statement should filter only for companies in the TECHNOLOGY sector using a simple WHERE clause. You must also create the DESTINATION_SQL_STREAM to insert SQL results into. Your final application code should look like the following:

    ticker_symbol VARCHAR(4),
    sector VARCHAR(16), 
    ticker_symbol_count INTEGER,
    avg_price REAL,
    avg_change REAL,
    max_price REAL,
    min_price REAL);

SELECT STREAM   ticker_symbol,
                COUNT(*) AS ticker_symbol_count,
                AVG(price) as avg_price,
                AVG(change) as avg_change,
                MAX(price) as max_price,
                MIN(price) as min_price
GROUP BY ticker_symbol, sector, FLOOR(("SOURCE_SQL_STREAM_001".ROWTIME - TIMESTAMP '1970-01-01 00:00:00') SECOND / 10 TO SECOND);

    ticker_symbol VARCHAR(4),
    sector VARCHAR(16), 
    ticker_symbol_count INTEGER,
    avg_price REAL,
    avg_change REAL,
    max_price REAL,
    min_price REAL);

SELECT STREAM   ticker_symbol, sector, ticker_symbol_count, avg_price, avg_change, max_price, min_price

Choose Save and run SQL.

You can see both of the in-application streams on the left side of the Real-time analytics tab, and select either to see each step in your application for end-to-end visibility.

From here, you can add a destination for your SQL results, such as an Amazon S3 bucket. After set up, your application continuously reads data from the streaming source, processes it using your SQL code, and emits the results to your configured destination.

Clean up

The final step is to clean up. Take the following steps to avoid incurring charges.

  1. Delete the Streams demo stream.
  2. Stop the Analytics application.


Previously, real-time stream data processing was only accessible to those with the technical skills to build and manage a complex application. With Amazon Kinesis Analytics, anyone familiar with the ANSI SQL standard can build and deploy a stream data processing application in minutes.

This application you just built provides a managed and elastic data processing pipeline using Analytics that calculates useful results over streaming data. Results are calculated as they arrive, and you can configure a destination to deliver them to a persistent store like Amazon S3.

It’s simple to get this solution working for your use case. All that is required is to replace the Amazon Kinesis demo stream with your own, and then set up data producers. From there, configure the analytics and you have an end-to-end solution for capturing, processing, and durably storing streaming data.

If you have questions or suggestions, please comment below.