Tag Archives: ctf

FCC Asks Amazon & eBay to Help Eliminate Pirate Media Box Sales

Post Syndicated from Andy original https://torrentfreak.com/fcc-asks-amazon-ebay-to-help-eliminate-pirate-media-box-sales-180530/

Over the past several years, anyone looking for a piracy-configured set-top box could do worse than search for one on Amazon or eBay.

Historically, people deploying search terms including “Kodi” or “fully-loaded” were greeted by page after page of Android-type boxes, each ready for illicit plug-and-play entertainment consumption following delivery.

Although the problem persists on both platforms, people are now much less likely to find infringing devices than they were 12 to 24 months ago. Under pressure from entertainment industry groups, both Amazon and eBay have tightened the screws on sellers of such devices. Now, however, both companies have received requests to stem sales from a completetey different direction.

In a letter to eBay CEO Devin Wenig and Amazon CEO Jeff Bezos first spotted by Ars, FCC Commissioner Michael O’Rielly calls on the platforms to take action against piracy-configured boxes that fail to comply with FCC equipment authorization requirements or falsely display FCC logos, contrary to United States law.

“Disturbingly, some rogue set-top box manufacturers and distributors are exploiting the FCC’s trusted logo by fraudulently placing it on devices that have not been approved via the Commission’s equipment authorization process,” O’Rielly’s letter reads.

“Specifically, nine set-top box distributors were referred to the FCC in October for enabling the unlawful streaming of copyrighted material, seven of which displayed the FCC logo, although there was no record of such compliance.”

While O’Rielly admits that the copyright infringement aspects fall outside the jurisdiction of the FCC, he says it’s troubling that many of these devices are used to stream infringing content, “exacerbating the theft of billions of dollars in American innovation and creativity.”

As noted above, both Amazon and eBay have taken steps to reduce sales of pirate boxes on their respective platforms on copyright infringement grounds, something which is duly noted by O’Rielly. However, he points out that devices continue to be sold to members of the public who may believe that the devices are legal since they’re available for sale from legitimate companies.

“For these reasons, I am seeking your further cooperation in assisting the FCC in taking steps to eliminate the non-FCC compliant devices or devices that fraudulently bear the FCC logo,” the Commissioner writes (pdf).

“Moreover, if your company is made aware by the Commission, with supporting evidence, that a particular device is using a fraudulent FCC label or has not been appropriately certified and labeled with a valid FCC logo, I respectfully request that you commit to swiftly removing these products from your sites.”

In the event that Amazon and eBay take action under this request, O’Rielly asks both platforms to hand over information they hold on offending manufacturers, distributors, and suppliers.

Amazon was quick to respond to the FCC. In a letter published by Ars, Amazon’s Public Policy Vice President Brian Huseman assured O’Rielly that the company is not only dedicated to tackling rogue devices on copyright-infringement grounds but also when there is fraudulent use of the FCC’s logos.

Noting that Amazon is a key member of the Alliance for Creativity and Entertainment (ACE) – a group that has been taking legal action against sellers of infringing streaming devices (ISDs) and those who make infringing addons for Kodi-type systems – Huseman says that dealing with the problem is a top priority.

“Our goal is to prevent the sale of ISDs anywhere, as we seek to protect our customers from the risks posed by these devices, in addition to our interest in protecting Amazon Studios content,” Huseman writes.

“In 2017, Amazon became the first online marketplace to prohibit the sale of streaming media players that promote or facilitate piracy. To prevent the sale of these devices, we proactively scan product listings for signs of potentially infringing products, and we also invest heavily in sophisticated, automated real-time tools to review a variety of data sources and signals to identify inauthentic goods.

“These automated tools are supplemented by human reviewers that conduct manual investigations. When we suspect infringement, we take immediate action to remove suspected listings, and we also take enforcement action against sellers’ entire accounts when appropriate.”

Huseman also reveals that since implementing a proactive policy against such devices, “tens of thousands” of listings have been blocked from Amazon. In addition, the platform has been making criminal referrals to law enforcement as well as taking civil action (1,2,3) as part of ACE.

“As noted in your letter, we would also appreciate the opportunity to collaborate further with the FCC to remove non-compliant devices that improperly use the FCC logo or falsely claim FCC certification. If any FCC non-compliant devices are identified, we seek to work with you to ensure they are not offered for sale,” Huseman concludes.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN reviews, discounts, offers and coupons.

MPAA Chief Says Fighting Piracy Remains “Top Priority”

Post Syndicated from Andy original https://torrentfreak.com/mpaa-chief-says-fighting-piracy-remains-top-priority-180425/

After several high-profile years at the helm of the movie industry’s most powerful lobbying group, last year saw the departure of Chris Dodd from the role of Chairman and CEO at the MPAA.

The former Senator, who earned more than $3.5m a year championing the causes of the major Hollywood studios since 2011, was immediately replaced by another political heavyweight.

Charles Rivkin, who took up his new role September 5, 2017, previously served as Assistant Secretary of State for Economic and Business Affairs in the Obama administration. With an underperforming domestic box office year behind him fortunately overshadowed by massive successes globally, this week he spoke before US movie exhibitors for the first time at CinemaCon in Las Vegas.

“Globally, we hit a record high of $40.6 billion at the box office. Domestically, our $11.1 billion box office was slightly down from the 2016 record. But it exactly matched the previous high from 2015. And it was the second highest total in the past decade,” Rivkin said.

“But it exactly matched the previous high from 2015. And it was the second highest total in the past decade.”

Rivkin, who spent time as President and CEO of The Jim Henson Company, told those in attendance that he shares a deep passion for the movie industry and looks forward optimistically to the future, a future in which content is secured from those who intend on sharing it for free.

“Making sure our creative works are valued and protected is one of the most important things we can do to keep that industry heartbeat strong. At the Henson Company, and WildBrain, I learned just how much intellectual property affects everyone. Our entire business model depended on our ability to license Kermit the Frog, Miss Piggy, and the Muppets and distribute them across the globe,” Rivkin said.

“I understand, on a visceral level, how important copyright is to any creative business and in particular our country’s small and medium enterprises – which are the backbone of the American economy. As Chairman and CEO of the MPAA, I guarantee you that fighting piracy in all forms remains our top priority.”

That tackling piracy is high on the MPAA’s agenda won’t comes as a surprise but at least in terms of the numbers of headlines plastered over the media, high-profile anti-piracy action has been somewhat lacking in recent years.

With lawsuits against torrent sites seemingly a thing of the past and a faltering Megaupload case that will conclude who-knows-when, the MPAA has taken a broader view, seeking partnerships with sometimes rival content creators and distributors, each with a shared desire to curtail illicit media.

“One of the ways that we’re already doing that is through the Alliance for Creativity and Entertainment – or ACE as we call it,” Rivkin said.

“This is a coalition of 30 leading global content creators, including the MPAA’s six member studios as well as Netflix, and Amazon. We work together as a powerful team to ensure our stories are seen as they were intended to be, and that their creators are rewarded for their hard work.”

Announced in June 2017, ACE has become a united anti-piracy powerhouse for a huge range of entertainment industry groups, encompassing the likes of CBS, HBO, BBC, Sky, Bell Canada, CBS, Hulu, Lionsgate, Foxtel and Village Roadshow, to name a few.

The coalition was announced by former MPAA Chief Chris Dodd and now, with serious financial input from all companies involved, appears to be picking its fights carefully, focusing on the growing problem of streaming piracy centered around misuse of Kodi and similar platforms.

From threatening relatively small-time producers and distributors of third-party addons and builds (1,2,3), ACE is also attempting to make its mark among the profiteers.

The group now has several lawsuits underway in the United States against people selling piracy-enabled IPTV boxes including Tickbox, Dragon Box, and during the last week, Set TV.

With these important cases pending, Rivkin offered assurances that his organization remains committed to anti-piracy enforcement and he thanked exhibitors for their efforts to prevent people quickly running away with copies of the latest releases.

“I am grateful to all of you for recognizing what is at stake, and for working with us to protect creativity, such as fighting the use of illegal camcorders in theaters,” he said.

“Protecting our creativity isn’t only a fundamental right. It’s an economic necessity, for us and all creative economies. Film and television are among the most valuable – and most impactful – exports we have.

Thus far at least, Rivkin has a noticeably less aggressive tone on piracy than his predecessor Chris Dodd but it’s unlikely that will be mistaken for weakness among pirates, nor should it. The MPAA isn’t known for going soft on pirates and it certainly won’t be changing course anytime soon.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN reviews, discounts, offers and coupons.

Tackling climate change and helping the community

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/fair-haven-weather-station/

In today’s guest post, seventh-grade students Evan Callas, Will Ross, Tyler Fallon, and Kyle Fugate share their story of using the Raspberry Pi Oracle Weather Station in their Innovation Lab class, headed by Raspberry Pi Certified Educator Chris Aviles.

Raspberry Pi Certified Educator Chris Aviles Innovation Lab Oracle Weather Station

United Nations Sustainable Goals

The past couple of weeks in our Innovation Lab class, our teacher, Mr Aviles, has challenged us students to design a project that helps solve one of the United Nations Sustainable Goals. We chose Climate Action. Innovation Lab is a class that gives students the opportunity to learn about where the crossroads of technology, the environment, and entrepreneurship meet. Everyone takes their own paths in innovation and learns about the environment using project-based learning.

Raspberry Pi Certified Educator Chris Aviles Innovation Lab Oracle Weather Station

Raspberry Pi Oracle Weather Station

For our climate change challenge, we decided to build a Raspberry Pi Oracle Weather Station. Tackling the issues of climate change in a way that helps our community stood out to us because we knew with the help of this weather station we can send the local data to farmers and fishermen in town. Recent changes in climate have been affecting farmers’ crops. Unexpected rain, heat, and other unusual weather patterns can completely destabilize the natural growth of the plants and destroy their crops altogether. The amount of labour output needed by farmers has also significantly increased, forcing farmers to grow more food on less resources. By using our Raspberry Pi Oracle Weather Station to alert local farmers, they can be more prepared and aware of the weather, leading to better crops and safe boating.

Raspberry Pi Certified Educator Chris Aviles Innovation Lab Oracle Weather Station

Growing teamwork and coding skills

The process of setting up our weather station was fun and simple. Raspberry Pi made the instructions very easy to understand and read, which was very helpful for our team who had little experience in coding or physical computing. We enjoyed working together as a team and were happy to be growing our teamwork skills.

Once we constructed and coded the weather station, we learned that we needed to support the station with PVC pipes. After we completed these steps, we brought the weather station up to the roof of the school and began collecting data. Our information is currently being sent to the Initial State dashboard so that we can share the information with anyone interested. This information will also be recorded and seen by other schools, businesses, and others from around the world who are using the weather station. For example, we can see the weather in countries such as France, Greece and Italy.

Raspberry Pi Certified Educator Chris Aviles Innovation Lab Oracle Weather Station

Raspberry Pi allows us to build these amazing projects that help us to enjoy coding and physical computing in a fun, engaging, and impactful way. We picked climate change because we care about our community and would like to make a substantial contribution to our town, Fair Haven, New Jersey. It is not every day that kids are given these kinds of opportunities, and we are very lucky and grateful to go to a school and learn from a teacher where these opportunities are given to us. Thanks, Mr Aviles!

To see more awesome projects by Mr Avile’s class, you can keep up with him on his blog and follow him on Twitter.

The post Tackling climate change and helping the community appeared first on Raspberry Pi.

Getting product security engineering right

Post Syndicated from Michal Zalewski original http://lcamtuf.blogspot.com/2018/02/getting-product-security-engineering.html

Product security is an interesting animal: it is a uniquely cross-disciplinary endeavor that spans policy, consulting,
process automation, in-depth software engineering, and cutting-edge vulnerability research. And in contrast to many
other specializations in our field of expertise – say, incident response or network security – we have virtually no
time-tested and coherent frameworks for setting it up within a company of any size.

In my previous post, I shared some thoughts
on nurturing technical organizations and cultivating the right kind of leadership within. Today, I figured it would
be fitting to follow up with several notes on what I learned about structuring product security work – and about actually
making the effort count.

The “comfort zone” trap

For security engineers, knowing your limits is a sought-after quality: there is nothing more dangerous than a security
expert who goes off script and starts dispensing authoritatively-sounding but bogus advice on a topic they know very
little about. But that same quality can be destructive when it prevents us from growing beyond our most familiar role: that of
a critic who pokes holes in other people’s designs.

The role of a resident security critic lends itself all too easily to a sense of supremacy: the mistaken
belief that our cognitive skills exceed the capabilities of the engineers and product managers who come to us for help
– and that the cool bugs we file are the ultimate proof of our special gift. We start taking pride in the mere act
of breaking somebody else’s software – and then write scathing but ineffectual critiques addressed to executives,
demanding that they either put a stop to a project or sign off on a risk. And hey, in the latter case, they better
brace for our triumphant “I told you so” at some later date.

Of course, escalations of this type have their place, but they need to be a very rare sight; when practiced routinely, they are a telltale
sign of a dysfunctional team. We might be failing to think up viable alternatives that are in tune with business or engineering needs; we might
be very unpersuasive, failing to communicate with other rational people in a language they understand; or it might be that our tolerance for risk
is badly out of whack with the rest of the company. Whatever the cause, I’ve seen high-level escalations where the security team
spoke of valiant efforts to resist inexplicably awful design decisions or data sharing setups; and where product leads in turn talked about
pressing business needs randomly blocked by obstinate security folks. Sometimes, simply having them compare their notes would be enough to arrive
at a technical solution – such as sharing a less sensitive subset of the data at hand.

To be effective, any product security program must be rooted in a partnership with the rest of the company, focused on helping them get stuff done
while eliminating or reducing security risks. To combat the toxic us-versus-them mentality, I found it helpful to have some team members with
software engineering backgrounds, even if it’s the ownership of a small open-source project or so. This can broaden our horizons, helping us see
that we all make the same mistakes – and that not every solution that sounds good on paper is usable once we code it up.

Getting off the treadmill

All security programs involve a good chunk of operational work. For product security, this can be a combination of product launch reviews, design consulting requests, incoming bug reports, or compliance-driven assessments of some sort. And curiously, such reactive work also has the property of gradually expanding to consume all the available resources on a team: next year is bound to bring even more review requests, even more regulatory hurdles, and even more incoming bugs to triage and fix.

Being more tractable, such routine tasks are also more readily enshrined in SDLs, SLAs, and all kinds of other official documents that are often mistaken for a mission statement that justifies the existence of our teams. Soon, instead of explaining to a developer why they should fix a particular problem right away, we end up pointing them to page 17 in our severity classification guideline, which defines that “severity 2” vulnerabilities need to be resolved within a month. Meanwhile, another policy may be telling them that they need to run a fuzzer or a web application scanner for a particular number of CPU-hours – no matter whether it makes sense or whether the job is set up right.

To run a product security program that scales sublinearly, stays abreast of future threats, and doesn’t erect bureaucratic speed bumps just for the sake of it, we need to recognize this inherent tendency for operational work to take over – and we need to reign it in. No matter what the last year’s policy says, we usually don’t need to be doing security reviews with a particular cadence or to a particular depth; if we need to scale them back 10% to staff a two-quarter project that fixes an important API and squashes an entire class of bugs, it’s a short-term risk we should feel empowered to take.

As noted in my earlier post, I find contingency planning to be a valuable tool in this regard: why not ask ourselves how the team would cope if the workload went up another 30%, but bad financial results precluded any team growth? It’s actually fun to think about such hypotheticals ahead of the time – and hey, if the ideas sound good, why not try them out today?

Living for a cause

It can be difficult to understand if our security efforts are structured and prioritized right; when faced with such uncertainty, it is natural to stick to the safe fundamentals – investing most of our resources into the very same things that everybody else in our industry appears to be focusing on today.

I think it’s important to combat this mindset – and if so, we might as well tackle it head on. Rather than focusing on tactical objectives and policy documents, try to write down a concise mission statement explaining why you are a team in the first place, what specific business outcomes you are aiming for, how do you prioritize it, and how you want it all to change in a year or two. It should be a fluid narrative that reads right and that everybody on your team can take pride in; my favorite way of starting the conversation is telling folks that we could always have a new VP tomorrow – and that the VP’s first order of business could be asking, “why do you have so many people here and how do I know they are doing the right thing?”. It’s a playful but realistic framing device that motivates people to get it done.

In general, a comprehensive product security program should probably start with the assumption that no matter how many resources we have at our disposal, we will never be able to stay in the loop on everything that’s happening across the company – and even if we did, we’re not going to be able to catch every single bug. It follows that one of our top priorities for the team should be making sure that bugs don’t happen very often; a scalable way of getting there is equipping engineers with intuitive and usable tools that make it easy to perform common tasks without having to worry about security at all. Examples include standardized, managed containers for production jobs; safe-by-default APIs, such as strict contextual autoescaping for XSS or type safety for SQL; security-conscious style guidelines; or plug-and-play libraries that take care of common crypto or ACL enforcement tasks.

Of course, not all problems can be addressed on framework level, and not every engineer will always reach for the right tools. Because of this, the next principle that I found to be worth focusing on is containment and mitigation: making sure that bugs are difficult to exploit when they happen, or that the damage is kept in check. The solutions in this space can range from low-level enhancements (say, hardened allocators or seccomp-bpf sandboxes) to client-facing features such as browser origin isolation or Content Security Policy.

The usual consulting, review, and outreach tasks are an important facet of a product security program, but probably shouldn’t be the sole focus of your team. It’s also best to avoid undue emphasis on vulnerability showmanship: while valuable in some contexts, it creates a hypercompetitive environment that may be hostile to less experienced team members – not to mention, squashing individual bugs offers very limited value if the same issue is likely to be reintroduced into the codebase the next day. I like to think of security reviews as a teaching opportunity instead: it’s a way to raise awareness, form partnerships with engineers, and help them develop lasting habits that reduce the incidence of bugs. Metrics to understand the impact of your work are important, too; if your engagements are seen mostly as a yet another layer of red tape, product teams will stop reaching out to you for advice.

The other tenet of a healthy product security effort requires us to recognize at a scale and given enough time, every defense mechanism is bound to fail – and so, we need ways to prevent bugs from turning into incidents. The efforts in this space may range from developing product-specific signals for the incident response and monitoring teams; to offering meaningful vulnerability reward programs and nourishing a healthy and respectful relationship with the research community; to organizing regular offensive exercises in hopes of spotting bugs before anybody else does.

Oh, one final note: an important feature of a healthy security program is the existence of multiple feedback loops that help you spot problems without the need to micromanage the organization and without being deathly afraid of taking chances. For example, the data coming from bug bounty programs, if analyzed correctly, offers a wonderful way to alert you to systemic problems in your codebase – and later on, to measure the impact of any remediation and hardening work.

The Decision on Transparency

Post Syndicated from Gleb Budman original https://www.backblaze.com/blog/transparency-in-business/

Backblaze transparency

This post by Backblaze’s CEO and co-founder Gleb Budman is the seventh in a series about entrepreneurship. You can choose posts in the series from the list below:

  1. How Backblaze got Started: The Problem, The Solution, and the Stuff In-Between
  2. Building a Competitive Moat: Turning Challenges Into Advantages
  3. From Idea to Launch: Getting Your First Customers
  4. How to Get Your First 1,000 Customers
  5. Surviving Your First Year
  6. How to Compete with Giants
  7. The Decision on Transparency

Use the Join button above to receive notification of new posts in this series.

“Are you crazy?” “Why would you do that?!” “You shouldn’t share that!”

These are just a few of the common questions and comments we heard after posting some of the information we have shared over the years. So was it crazy? Misguided? Should you do it?

With that background I’d like to dig into the decision to become so transparent, from releasing stats on hard drive failures, to storage pod specs, to publishing our cloud storage costs, and open sourcing the Reed-Solomon code. What was the thought process behind becoming so transparent when most companies work so hard to hide their inner workings, especially information such as the Storage Pod specs that would normally be considered a proprietary advantage? Most importantly I’d like to explore the positives and negatives of being so transparent.

Sharing Intellectual Property

The first “transparency” that garnered a flurry of “why would you share that?!” came as a result of us deciding to open source our Storage Pod design: publishing the specs, parts, prices, and how to build it yourself. The Storage Pod was a key component of our infrastructure, gave us a cost (and thus competitive) advantage, took significant effort to develop, and had a fair bit of intellectual property: the “IP.”

The negatives of sharing this are obvious: it allows our competitors to use the design to reduce our cost advantage, and it gives away the IP, which could be patentable or have value as a trade secret.

The positives were certainly less obvious, and at the time we couldn’t have guessed how massive they would be.

We wrestled with the decision: prospective users and others online didn’t believe we could offer our service for such a low price, thinking that we would burn through some cash hoard and then go out of business. We wanted to reassure them, but how?

This is how our response evolved:

We’ve built a lower cost storage platform.
But why would anyone believe us?
Because, we’ve designed our own servers and they’re less expensive.
But why would anyone believe they were so low cost and efficient?
Because here’s how much they cost versus others.
But why would anyone believe they cost that little and still enabled us to efficiently store data?
Because here are all the components they’re made of, this is how to build them, and this is how they work.
Ok, you can’t argue with that.

Great — so that would reassure people. But should we do this? Is it worth it?

This was 2009, we were a tiny company of seven people working from our co-founder’s one-bedroom apartment. We decided that the risk of not having potential customers trust us was more impactful than the risk of our competitors possibly deciding to use our server architecture. The former might kill the company in short order; the latter might make it harder for us to compete in the future. Moreover, we figured that most competitors were established on their own platforms and were unlikely to switch to ours, even if it were better.

Takeaway: Build your brand today. There are no assurances you will make it to tomorrow if you can’t make people believe in you today.

A Sharing Success Story — The Backblaze Storage Pod

So with that, we decided to publish everything about the Storage Pod. As for deciding to actually open source it? That was a ‘thank you’ to the open source community upon whose shoulders we stood as we used software such as Linux, Tomcat, etc.

With eight years of hindsight, here’s what happened:

As best as I can tell, none of our direct competitors ever used our Storage Pod design, opting instead to continue paying more for commercial solutions.

  • Hundreds of press articles have been written about Backblaze as a direct result of sharing the Storage Pod design.
  • Millions of people have read press articles or our blog posts about the Storage Pods.
  • Backblaze was established as a storage tech thought leader, and a resource for those looking for information in the space.
  • Our blog became viewed as a resource, not a corporate mouthpiece.
  • Recruiting has been made easier through the awareness of Backblaze, the appreciation for us taking on challenging tech problems in interesting ways, and for our openness.
  • Sourcing for our Storage Pods has become easier because we can point potential vendors to our blog posts and say, “here’s what we need.”

And those are just the direct benefits for us. One of the things that warms my heart is that doing this has helped others:

  • Several companies have started selling servers based on our Storage Pod designs.
  • Netflix credits Backblaze with being the inspiration behind their CDN servers.
  • Many schools, labs, and others have shared that they’ve been able to do what they didn’t think was possible because using our Storage Pod designs provided lower-cost storage.
  • And I want to believe that in general we pushed forward the development of low-cost storage servers in the industry.

So overall, the decision on being transparent and sharing our Storage Pod designs was a clear win.

Takeaway: Never underestimate the value of goodwill. It can help build new markets that fuel your future growth and create new ecosystems.

Sharing An “Almost Acquisition”

Acquisition announcements are par for the course. No company, however, talks about the acquisition that fell through. If rumors appear in the press, the company’s response is always, “no comment.” But in 2010, when Backblaze was almost, but not acquired, we wrote about it in detail. Crazy?

The negatives of sharing this are slightly less obvious, but the two issues most people worried about were, 1) the fact that the company could be acquired would spook customers, and 2) the fact that it wasn’t would signal to potential acquirers that something was wrong.

So, why share this at all? No one was asking “did you almost get acquired?”

First, we had established a culture of transparency and this was a significant event that occurred for us, thus we defaulted to assuming we would share. Second, we learned that acquisitions fall through all the time, not just during the early fishing stage, but even after term sheets are signed, diligence is done, and all the paperwork is complete. I felt we had learned some things about the process that would be valuable to others that were going through it.

As it turned out, we received emails from startup founders saying they saved the post for the future, and from lawyers, VCs, and advisors saying they shared them with their portfolio companies. Among the most touching emails I received was from a founder who said that after an acquisition fell through she felt so alone that she became incredibly depressed, and that reading our post helped her see that this happens and that things could be OK after. Being transparent about almost getting acquired was worth it just to help that one founder.

And what about the concerns? As for spooking customers, maybe some were — but our sign-ups went up, not down, afterward. Any company can be acquired, and many of the world’s largest have been. That we were being both thoughtful about where to go with it, and open about it, I believe gave customers a sense that we would do the right thing if it happened. And as for signaling to potential acquirers? The ones I’ve spoken with all knew this happens regularly enough that it’s not a factor.

Takeaway: Being open and transparent is also a form of giving back to others.

Sharing Strategic Data

For years people have been desperate to know how reliable are hard drives. They could go to Amazon for individual reviews, but someone saying “this drive died for me” doesn’t provide statistical insight. Google published a study that showed annualized drive failure rates, but didn’t break down the results by manufacturer or model. Since Backblaze has deployed about 100,000 hard drives to store customer data, we have been able to collect a wealth of data on the reliability of the drives by make, model, and size. Was Backblaze the only one with this data? Of course not — Google, Amazon, Microsoft, and any other cloud-scale storage provider tracked it. Yet none would publish. Should Backblaze?

Again, starting with the main negatives: 1) sharing which drives we liked could increase demand for them, thus reducing availability or increasing prices, and 2) publishing the data might make the drive vendors unhappy with us, thereby making it difficult for us to buy drives.

But we felt that the largest drive purchasers (Amazon, Google, etc.) already had their own stats and would buy the drives they chose, and if individuals or smaller companies used our stats, they wouldn’t sufficiently move the overall market demand. Also, we hoped that the drive companies would see that we were being fair in our analysis and, if anything, would leverage our data to make drives even better.

Again, publishing the data resulted in tremendous value for Backblaze, with millions of people having read the analysis that we put out quarterly. Also, becoming known as the place to go for drive reliability information is a natural fit with being a backup and storage provider. In addition, in a twist from many people’s expectations, some of the drive companies actually started working closer with us, seeing that we could be a good source of data for them as feedback. We’ve also seen many individuals and companies make more data-based decisions on which drives to buy, and researchers have used the data for a variety of analyses.

traffic spike from hard drive reliability post

Backblaze blog analytics showing spike in readership after a hard drive stats post

Takeaway: Being open and transparent is rarely as risky as it seems.

Sharing Revenue (And Other Metrics)

Journalists always want to publish company revenue and other metrics, and private companies always shy away from sharing. For a long time we did, too. Then, we opened up about that, as well.

The negatives of sharing these numbers are: 1) external parties may otherwise perceive you’re doing better than you are, 2) if you share numbers often, you may show that growth has slowed or worse, 3) it gives your competitors info to compare their own business too.

We decided that, while some may have perceived we were bigger, our scale was plenty significant. Since we choose what we share and when, it’s up to us whether to disclose at any point. And if our competitors compare, what will they actually change that would affect us?

I did wait to share revenue until I felt I had the right person to write about it. At one point a journalist said she wouldn’t write about us unless I disclosed revenue. I suggested we had a lot to offer for the story, but didn’t want to share revenue yet. She refused to budge and I walked away from the article. Several year later, I reached out to a journalist who had covered Backblaze before and I felt understood our business and offered to share revenue with him. He wrote a deep-dive about the company, with revenue being one of the components of the story.

Sharing these metrics showed that we were at scale and running a real business, one with positive unit economics and margins, but not one where we were gouging customers.

Takeaway: Being open with the press about items typically not shared can be uncomfortable, but the press can amplify your story.

Should You Share?

For Backblaze, I believe the results of transparency have been staggering. However, it’s not for everyone. Apple has, clearly, been wildly successful taking secrecy to the extreme. In their case, early disclosure combined with the long cycle of hardware releases could significantly impact sales of current products.

“For Backblaze, I believe the results of transparency have been staggering.” — Gleb Budman

I will argue, however, that for most startups transparency wins. Most startups need to establish credibility and trust, build awareness and a fan base, show that they understand what their customers need and be useful to them, and show the soul and passion behind the company. Some startup companies try to buy these virtues with investor money, and sometimes amplifying your brand via paid marketing helps. But, authentic transparency can build awareness and trust not only less expensively, but more deeply than money can buy.

Backblaze was open from the beginning. With no outside investors, as founders we were able to express ourselves and make our decisions. And it’s easier to be a company that shares if you do it from the start, but for any company, here are a few suggestions:

  1. Ask about sharing: If something significant happens — good or bad — ask “should we share this?” If you made a tough decision, ask “should we share the thinking behind the decision and why it was tough?”
  2. Default to yes: It’s often scary to share, but look for the reasons to say ‘yes,’ not the reasons to say ‘no.’ That doesn’t mean you won’t sometimes decide not to, but make that the high bar.
  3. Minimize reviews: Press releases tend to be sanitized and boring because they’ve been endlessly wordsmithed by committee. Establish the few things you don’t want shared, but minimize the number of people that have to see anything else before it can go out. Teach, then trust.
  4. Engage: Sharing will result in comments on your blog, social, articles, etc. Reply to people’s questions and engage. It’ll make the readers more engaged and give you a better understanding of what they’re looking for.
  5. Accept mistakes: Things will become public that aren’t perfectly sanitized. Accept that and don’t punish people for oversharing.

Building a culture of a company that is open to sharing takes time, but continuous practice will build that, and over time the company will navigate its voice and approach to sharing.

The post The Decision on Transparency appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Analyzing AWS Cost and Usage Reports with Looker and Amazon Athena

Post Syndicated from Dillon Morrison original https://aws.amazon.com/blogs/big-data/analyzing-aws-cost-and-usage-reports-with-looker-and-amazon-athena/

This is a guest post by Dillon Morrison at Looker. Looker is, in their own words, “a new kind of analytics platform–letting everyone in your business make better decisions by getting reliable answers from a tool they can use.” 

As the breadth of AWS products and services continues to grow, customers are able to more easily move their technology stack and core infrastructure to AWS. One of the attractive benefits of AWS is the cost savings. Rather than paying upfront capital expenses for large on-premises systems, customers can instead pay variables expenses for on-demand services. To further reduce expenses AWS users can reserve resources for specific periods of time, and automatically scale resources as needed.

The AWS Cost Explorer is great for aggregated reporting. However, conducting analysis on the raw data using the flexibility and power of SQL allows for much richer detail and insight, and can be the better choice for the long term. Thankfully, with the introduction of Amazon Athena, monitoring and managing these costs is now easier than ever.

In the post, I walk through setting up the data pipeline for cost and usage reports, Amazon S3, and Athena, and discuss some of the most common levers for cost savings. I surface tables through Looker, which comes with a host of pre-built data models and dashboards to make analysis of your cost and usage data simple and intuitive.

Analysis with Athena

With Athena, there’s no need to create hundreds of Excel reports, move data around, or deploy clusters to house and process data. Athena uses Apache Hive’s DDL to create tables, and the Presto querying engine to process queries. Analysis can be performed directly on raw data in S3. Conveniently, AWS exports raw cost and usage data directly into a user-specified S3 bucket, making it simple to start querying with Athena quickly. This makes continuous monitoring of costs virtually seamless, since there is no infrastructure to manage. Instead, users can leverage the power of the Athena SQL engine to easily perform ad-hoc analysis and data discovery without needing to set up a data warehouse.

After the data pipeline is established, cost and usage data (the recommended billing data, per AWS documentation) provides a plethora of comprehensive information around usage of AWS services and the associated costs. Whether you need the report segmented by product type, user identity, or region, this report can be cut-and-sliced any number of ways to properly allocate costs for any of your business needs. You can then drill into any specific line item to see even further detail, such as the selected operating system, tenancy, purchase option (on-demand, spot, or reserved), and so on.

Walkthrough

By default, the Cost and Usage report exports CSV files, which you can compress using gzip (recommended for performance). There are some additional configuration options for tuning performance further, which are discussed below.

Prerequisites

If you want to follow along, you need the following resources:

Enable the cost and usage reports

First, enable the Cost and Usage report. For Time unit, select Hourly. For Include, select Resource IDs. All options are prompted in the report-creation window.

The Cost and Usage report dumps CSV files into the specified S3 bucket. Please note that it can take up to 24 hours for the first file to be delivered after enabling the report.

Configure the S3 bucket and files for Athena querying

In addition to the CSV file, AWS also creates a JSON manifest file for each cost and usage report. Athena requires that all of the files in the S3 bucket are in the same format, so we need to get rid of all these manifest files. If you’re looking to get started with Athena quickly, you can simply go into your S3 bucket and delete the manifest file manually, skip the automation described below, and move on to the next section.

To automate the process of removing the manifest file each time a new report is dumped into S3, which I recommend as you scale, there are a few additional steps. The folks at Concurrency labs wrote a great overview and set of scripts for this, which you can find in their GitHub repo.

These scripts take the data from an input bucket, remove anything unnecessary, and dump it into a new output bucket. We can utilize AWS Lambda to trigger this process whenever new data is dropped into S3, or on a nightly basis, or whatever makes most sense for your use-case, depending on how often you’re querying the data. Please note that enabling the “hourly” report means that data is reported at the hour-level of granularity, not that a new file is generated every hour.

Following these scripts, you’ll notice that we’re adding a date partition field, which isn’t necessary but improves query performance. In addition, converting data from CSV to a columnar format like ORC or Parquet also improves performance. We can automate this process using Lambda whenever new data is dropped in our S3 bucket. Amazon Web Services discusses columnar conversion at length, and provides walkthrough examples, in their documentation.

As a long-term solution, best practice is to use compression, partitioning, and conversion. However, for purposes of this walkthrough, we’re not going to worry about them so we can get up-and-running quicker.

Set up the Athena query engine

In your AWS console, navigate to the Athena service, and click “Get Started”. Follow the tutorial and set up a new database (we’ve called ours “AWS Optimizer” in this example). Don’t worry about configuring your initial table, per the tutorial instructions. We’ll be creating a new table for cost and usage analysis. Once you walked through the tutorial steps, you’ll be able to access the Athena interface, and can begin running Hive DDL statements to create new tables.

One thing that’s important to note, is that the Cost and Usage CSVs also contain the column headers in their first row, meaning that the column headers would be included in the dataset and any queries. For testing and quick set-up, you can remove this line manually from your first few CSV files. Long-term, you’ll want to use a script to programmatically remove this row each time a new file is dropped in S3 (every few hours typically). We’ve drafted up a sample script for ease of reference, which we run on Lambda. We utilize Lambda’s native ability to invoke the script whenever a new object is dropped in S3.

For cost and usage, we recommend using the DDL statement below. Since our data is in CSV format, we don’t need to use a SerDe, we can simply specify the “separatorChar, quoteChar, and escapeChar”, and the structure of the files (“TEXTFILE”). Note that AWS does have an OpenCSV SerDe as well, if you prefer to use that.

 

CREATE EXTERNAL TABLE IF NOT EXISTS cost_and_usage	 (
identity_LineItemId String,
identity_TimeInterval String,
bill_InvoiceId String,
bill_BillingEntity String,
bill_BillType String,
bill_PayerAccountId String,
bill_BillingPeriodStartDate String,
bill_BillingPeriodEndDate String,
lineItem_UsageAccountId String,
lineItem_LineItemType String,
lineItem_UsageStartDate String,
lineItem_UsageEndDate String,
lineItem_ProductCode String,
lineItem_UsageType String,
lineItem_Operation String,
lineItem_AvailabilityZone String,
lineItem_ResourceId String,
lineItem_UsageAmount String,
lineItem_NormalizationFactor String,
lineItem_NormalizedUsageAmount String,
lineItem_CurrencyCode String,
lineItem_UnblendedRate String,
lineItem_UnblendedCost String,
lineItem_BlendedRate String,
lineItem_BlendedCost String,
lineItem_LineItemDescription String,
lineItem_TaxType String,
product_ProductName String,
product_accountAssistance String,
product_architecturalReview String,
product_architectureSupport String,
product_availability String,
product_bestPractices String,
product_cacheEngine String,
product_caseSeverityresponseTimes String,
product_clockSpeed String,
product_currentGeneration String,
product_customerServiceAndCommunities String,
product_databaseEdition String,
product_databaseEngine String,
product_dedicatedEbsThroughput String,
product_deploymentOption String,
product_description String,
product_durability String,
product_ebsOptimized String,
product_ecu String,
product_endpointType String,
product_engineCode String,
product_enhancedNetworkingSupported String,
product_executionFrequency String,
product_executionLocation String,
product_feeCode String,
product_feeDescription String,
product_freeQueryTypes String,
product_freeTrial String,
product_frequencyMode String,
product_fromLocation String,
product_fromLocationType String,
product_group String,
product_groupDescription String,
product_includedServices String,
product_instanceFamily String,
product_instanceType String,
product_io String,
product_launchSupport String,
product_licenseModel String,
product_location String,
product_locationType String,
product_maxIopsBurstPerformance String,
product_maxIopsvolume String,
product_maxThroughputvolume String,
product_maxVolumeSize String,
product_maximumStorageVolume String,
product_memory String,
product_messageDeliveryFrequency String,
product_messageDeliveryOrder String,
product_minVolumeSize String,
product_minimumStorageVolume String,
product_networkPerformance String,
product_operatingSystem String,
product_operation String,
product_operationsSupport String,
product_physicalProcessor String,
product_preInstalledSw String,
product_proactiveGuidance String,
product_processorArchitecture String,
product_processorFeatures String,
product_productFamily String,
product_programmaticCaseManagement String,
product_provisioned String,
product_queueType String,
product_requestDescription String,
product_requestType String,
product_routingTarget String,
product_routingType String,
product_servicecode String,
product_sku String,
product_softwareType String,
product_storage String,
product_storageClass String,
product_storageMedia String,
product_technicalSupport String,
product_tenancy String,
product_thirdpartySoftwareSupport String,
product_toLocation String,
product_toLocationType String,
product_training String,
product_transferType String,
product_usageFamily String,
product_usagetype String,
product_vcpu String,
product_version String,
product_volumeType String,
product_whoCanOpenCases String,
pricing_LeaseContractLength String,
pricing_OfferingClass String,
pricing_PurchaseOption String,
pricing_publicOnDemandCost String,
pricing_publicOnDemandRate String,
pricing_term String,
pricing_unit String,
reservation_AvailabilityZone String,
reservation_NormalizedUnitsPerReservation String,
reservation_NumberOfReservations String,
reservation_ReservationARN String,
reservation_TotalReservedNormalizedUnits String,
reservation_TotalReservedUnits String,
reservation_UnitsPerReservation String,
resourceTags_userName String,
resourceTags_usercostcategory String  


)
    ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
      ESCAPED BY '\\'
      LINES TERMINATED BY '\n'

STORED AS TEXTFILE
    LOCATION 's3://<<your bucket name>>';

Once you’ve successfully executed the command, you should see a new table named “cost_and_usage” with the below properties. Now we’re ready to start executing queries and running analysis!

Start with Looker and connect to Athena

Setting up Looker is a quick process, and you can try it out for free here (or download from Amazon Marketplace). It takes just a few seconds to connect Looker to your Athena database, and Looker comes with a host of pre-built data models and dashboards to make analysis of your cost and usage data simple and intuitive. After you’re connected, you can use the Looker UI to run whatever analysis you’d like. Looker translates this UI to optimized SQL, so any user can execute and visualize queries for true self-service analytics.

Major cost saving levers

Now that the data pipeline is configured, you can dive into the most popular use cases for cost savings. In this post, I focus on:

  • Purchasing Reserved Instances vs. On-Demand Instances
  • Data transfer costs
  • Allocating costs over users or other Attributes (denoted with resource tags)

On-Demand, Spot, and Reserved Instances

Purchasing Reserved Instances vs On-Demand Instances is arguably going to be the biggest cost lever for heavy AWS users (Reserved Instances run up to 75% cheaper!). AWS offers three options for purchasing instances:

  • On-Demand—Pay as you use.
  • Spot (variable cost)—Bid on spare Amazon EC2 computing capacity.
  • Reserved Instances—Pay for an instance for a specific, allotted period of time.

When purchasing a Reserved Instance, you can also choose to pay all-upfront, partial-upfront, or monthly. The more you pay upfront, the greater the discount.

If your company has been using AWS for some time now, you should have a good sense of your overall instance usage on a per-month or per-day basis. Rather than paying for these instances On-Demand, you should try to forecast the number of instances you’ll need, and reserve them with upfront payments.

The total amount of usage with Reserved Instances versus overall usage with all instances is called your coverage ratio. It’s important not to confuse your coverage ratio with your Reserved Instance utilization. Utilization represents the amount of reserved hours that were actually used. Don’t worry about exceeding capacity, you can still set up Auto Scaling preferences so that more instances get added whenever your coverage or utilization crosses a certain threshold (we often see a target of 80% for both coverage and utilization among savvy customers).

Calculating the reserved costs and coverage can be a bit tricky with the level of granularity provided by the cost and usage report. The following query shows your total cost over the last 6 months, broken out by Reserved Instance vs other instance usage. You can substitute the cost field for usage if you’d prefer. Please note that you should only have data for the time period after the cost and usage report has been enabled (though you can opt for up to 3 months of historical data by contacting your AWS Account Executive). If you’re just getting started, this query will only show a few days.

 

SELECT 
	DATE_FORMAT(from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate),'%Y-%m') AS "cost_and_usage.usage_start_month",
	COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0) AS "cost_and_usage.total_unblended_cost",
	COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_reserved_unblended_cost",
	1.0 * (COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0)),0)  AS "cost_and_usage.percent_spend_on_ris",
	COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'Non RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_non_reserved_unblended_cost",
	1.0 * (COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'Non RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0)),0)  AS "cost_and_usage.percent_spend_on_non_ris"
FROM aws_optimizer.cost_and_usage  AS cost_and_usage

WHERE 
	(((from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) >= ((DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))) AND (from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) < ((DATE_ADD('month', 6, DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))))))
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500

The resulting table should look something like the image below (I’m surfacing tables through Looker, though the same table would result from querying via command line or any other interface).

With a BI tool, you can create dashboards for easy reference and monitoring. New data is dumped into S3 every few hours, so your dashboards can update several times per day.

It’s an iterative process to understand the appropriate number of Reserved Instances needed to meet your business needs. After you’ve properly integrated Reserved Instances into your purchasing patterns, the savings can be significant. If your coverage is consistently below 70%, you should seriously consider adjusting your purchase types and opting for more Reserved instances.

Data transfer costs

One of the great things about AWS data storage is that it’s incredibly cheap. Most charges often come from moving and processing that data. There are several different prices for transferring data, broken out largely by transfers between regions and availability zones. Transfers between regions are the most costly, followed by transfers between Availability Zones. Transfers within the same region and same availability zone are free unless using elastic or public IP addresses, in which case there is a cost. You can find more detailed information in the AWS Pricing Docs. With this in mind, there are several simple strategies for helping reduce costs.

First, since costs increase when transferring data between regions, it’s wise to ensure that as many services as possible reside within the same region. The more you can localize services to one specific region, the lower your costs will be.

Second, you should maximize the data you’re routing directly within AWS services and IP addresses. Transfers out to the open internet are the most costly and least performant mechanisms of data transfers, so it’s best to keep transfers within AWS services.

Lastly, data transfers between private IP addresses are cheaper than between elastic or public IP addresses, so utilizing private IP addresses as much as possible is the most cost-effective strategy.

The following query provides a table depicting the total costs for each AWS product, broken out transfer cost type. Substitute the “lineitem_productcode” field in the query to segment the costs by any other attribute. If you notice any unusually high spikes in cost, you’ll need to dig deeper to understand what’s driving that spike: location, volume, and so on. Drill down into specific costs by including “product_usagetype” and “product_transfertype” in your query to identify the types of transfer costs that are driving up your bill.

SELECT 
	cost_and_usage.lineitem_productcode  AS "cost_and_usage.product_code",
	COALESCE(SUM(cost_and_usage.lineitem_unblendedcost), 0) AS "cost_and_usage.total_unblended_cost",
	COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer')    THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_data_transfer_cost",
	COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer-In')    THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_inbound_data_transfer_cost",
	COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer-Out')    THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_outbound_data_transfer_cost"
FROM aws_optimizer.cost_and_usage  AS cost_and_usage

WHERE 
	(((from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) >= ((DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))) AND (from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) < ((DATE_ADD('month', 6, DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))))))
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500

When moving between regions or over the open web, many data transfer costs also include the origin and destination location of the data movement. Using a BI tool with mapping capabilities, you can get a nice visual of data flows. The point at the center of the map is used to represent external data flows over the open internet.

Analysis by tags

AWS provides the option to apply custom tags to individual resources, so you can allocate costs over whatever customized segment makes the most sense for your business. For a SaaS company that hosts software for customers on AWS, maybe you’d want to tag the size of each customer. The following query uses custom tags to display the reserved, data transfer, and total cost for each AWS service, broken out by tag categories, over the last 6 months. You’ll want to substitute the cost_and_usage.resourcetags_customersegment and cost_and_usage.customer_segment with the name of your customer field.

 

SELECT * FROM (
SELECT *, DENSE_RANK() OVER (ORDER BY z___min_rank) as z___pivot_row_rank, RANK() OVER (PARTITION BY z__pivot_col_rank ORDER BY z___min_rank) as z__pivot_col_ordering FROM (
SELECT *, MIN(z___rank) OVER (PARTITION BY "cost_and_usage.product_code") as z___min_rank FROM (
SELECT *, RANK() OVER (ORDER BY CASE WHEN z__pivot_col_rank=1 THEN (CASE WHEN "cost_and_usage.total_unblended_cost" IS NOT NULL THEN 0 ELSE 1 END) ELSE 2 END, CASE WHEN z__pivot_col_rank=1 THEN "cost_and_usage.total_unblended_cost" ELSE NULL END DESC, "cost_and_usage.total_unblended_cost" DESC, z__pivot_col_rank, "cost_and_usage.product_code") AS z___rank FROM (
SELECT *, DENSE_RANK() OVER (ORDER BY CASE WHEN "cost_and_usage.customer_segment" IS NULL THEN 1 ELSE 0 END, "cost_and_usage.customer_segment") AS z__pivot_col_rank FROM (
SELECT 
	cost_and_usage.lineitem_productcode  AS "cost_and_usage.product_code",
	cost_and_usage.resourcetags_customersegment  AS "cost_and_usage.customer_segment",
	COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0) AS "cost_and_usage.total_unblended_cost",
	1.0 * (COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer')    THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0)),0)  AS "cost_and_usage.percent_spend_data_transfers_unblended",
	1.0 * (COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'Non RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0)),0)  AS "cost_and_usage.unblended_percent_spend_on_ris"
FROM aws_optimizer.cost_and_usage_raw  AS cost_and_usage

WHERE 
	(((from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) >= ((DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))) AND (from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) < ((DATE_ADD('month', 6, DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))))))
GROUP BY 1,2) ww
) bb WHERE z__pivot_col_rank <= 16384
) aa
) xx
) zz
 WHERE z___pivot_row_rank <= 500 OR z__pivot_col_ordering = 1 ORDER BY z___pivot_row_rank

The resulting table in this example looks like the results below. In this example, you can tell that we’re making poor use of Reserved Instances because they represent such a small portion of our overall costs.

Again, using a BI tool to visualize these costs and trends over time makes the analysis much easier to consume and take action on.

Summary

Saving costs on your AWS spend is always an iterative, ongoing process. Hopefully with these queries alone, you can start to understand your spending patterns and identify opportunities for savings. However, this is just a peek into the many opportunities available through analysis of the Cost and Usage report. Each company is different, with unique needs and usage patterns. To achieve maximum cost savings, we encourage you to set up an analytics environment that enables your team to explore all potential cuts and slices of your usage data, whenever it’s necessary. Exploring different trends and spikes across regions, services, user types, etc. helps you gain comprehensive understanding of your major cost levers and consistently implement new cost reduction strategies.

Note that all of the queries and analysis provided in this post were generated using the Looker data platform. If you’re already a Looker customer, you can get all of this analysis, additional pre-configured dashboards, and much more using Looker Blocks for AWS.


About the Author

Dillon Morrison leads the Platform Ecosystem at Looker. He enjoys exploring new technologies and architecting the most efficient data solutions for the business needs of his company and their customers. In his spare time, you’ll find Dillon rock climbing in the Bay Area or nose deep in the docs of the latest AWS product release at his favorite cafe (“Arlequin in SF is unbeatable!”).

 

 

 

Top 10 Most Obvious Hacks of All Time (v0.9)

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/07/top-10-most-obvious-hacks-of-all-time.html

For teaching hacking/cybersecurity, I thought I’d create of the most obvious hacks of all time. Not the best hacks, the most sophisticated hacks, or the hacks with the biggest impact, but the most obvious hacks — ones that even the least knowledgeable among us should be able to understand. Below I propose some hacks that fit this bill, though in no particular order.

The reason I’m writing this is that my niece wants me to teach her some hacking. I thought I’d start with the obvious stuff first.

Shared Passwords

If you use the same password for every website, and one of those websites gets hacked, then the hacker has your password for all your websites. The reason your Facebook account got hacked wasn’t because of anything Facebook did, but because you used the same email-address and password when creating an account on “beagleforums.com”, which got hacked last year.

I’ve heard people say “I’m sure, because I choose a complex password and use it everywhere”. No, this is the very worst thing you can do. Sure, you can the use the same password on all sites you don’t care much about, but for Facebook, your email account, and your bank, you should have a unique password, so that when other sites get hacked, your important sites are secure.

And yes, it’s okay to write down your passwords on paper.

Tools: HaveIBeenPwned.com

PIN encrypted PDFs

My accountant emails PDF statements encrypted with the last 4 digits of my Social Security Number. This is not encryption — a 4 digit number has only 10,000 combinations, and a hacker can guess all of them in seconds.
PIN numbers for ATM cards work because ATM machines are online, and the machine can reject your card after four guesses. PIN numbers don’t work for documents, because they are offline — the hacker has a copy of the document on their own machine, disconnected from the Internet, and can continue making bad guesses with no restrictions.
Passwords protecting documents must be long enough that even trillion upon trillion guesses are insufficient to guess.

Tools: Hashcat, John the Ripper

SQL and other injection

The lazy way of combining websites with databases is to combine user input with an SQL statement. This combines code with data, so the obvious consequence is that hackers can craft data to mess with the code.
No, this isn’t obvious to the general public, but it should be obvious to programmers. The moment you write code that adds unfiltered user-input to an SQL statement, the consequence should be obvious. Yet, “SQL injection” has remained one of the most effective hacks for the last 15 years because somehow programmers don’t understand the consequence.
CGI shell injection is a similar issue. Back in early days, when “CGI scripts” were a thing, it was really important, but these days, not so much, so I just included it with SQL. The consequence of executing shell code should’ve been obvious, but weirdly, it wasn’t. The IT guy at the company I worked for back in the late 1990s came to me and asked “this guy says we have a vulnerability, is he full of shit?”, and I had to answer “no, he’s right — obviously so”.

XSS (“Cross Site Scripting”) [*] is another injection issue, but this time at somebody’s web browser rather than a server. It works because websites will echo back what is sent to them. For example, if you search for Cross Site Scripting with the URL https://www.google.com/search?q=cross+site+scripting, then you’ll get a page back from the server that contains that string. If the string is JavaScript code rather than text, then some servers (thought not Google) send back the code in the page in a way that it’ll be executed. This is most often used to hack somebody’s account: you send them an email or tweet a link, and when they click on it, the JavaScript gives control of the account to the hacker.

Cross site injection issues like this should probably be their own category, but I’m including it here for now.

More: Wikipedia on SQL injection, Wikipedia on cross site scripting.
Tools: Burpsuite, SQLmap

Buffer overflows

In the C programming language, programmers first create a buffer, then read input into it. If input is long than the buffer, then it overflows. The extra bytes overwrite other parts of the program, letting the hacker run code.
Again, it’s not a thing the general public is expected to know about, but is instead something C programmers should be expected to understand. They should know that it’s up to them to check the length and stop reading input before it overflows the buffer, that there’s no language feature that takes care of this for them.
We are three decades after the first major buffer overflow exploits, so there is no excuse for C programmers not to understand this issue.

What makes particular obvious is the way they are wrapped in exploits, like in Metasploit. While the bug itself is obvious that it’s a bug, actually exploiting it can take some very non-obvious skill. However, once that exploit is written, any trained monkey can press a button and run the exploit. That’s where we get the insult “script kiddie” from — referring to wannabe-hackers who never learn enough to write their own exploits, but who spend a lot of time running the exploit scripts written by better hackers than they.

More: Wikipedia on buffer overflow, Wikipedia on script kiddie,  “Smashing The Stack For Fun And Profit” — Phrack (1996)
Tools: bash, Metasploit

SendMail DEBUG command (historical)

The first popular email server in the 1980s was called “SendMail”. It had a feature whereby if you send a “DEBUG” command to it, it would execute any code following the command. The consequence of this was obvious — hackers could (and did) upload code to take control of the server. This was used in the Morris Worm of 1988. Most Internet machines of the day ran SendMail, so the worm spread fast infecting most machines.
This bug was mostly ignored at the time. It was thought of as a theoretical problem, that might only rarely be used to hack a system. Part of the motivation of the Morris Worm was to demonstrate that such problems was to demonstrate the consequences — consequences that should’ve been obvious but somehow were rejected by everyone.

More: Wikipedia on Morris Worm

Email Attachments/Links

I’m conflicted whether I should add this or not, because here’s the deal: you are supposed to click on attachments and links within emails. That’s what they are there for. The difference between good and bad attachments/links is not obvious. Indeed, easy-to-use email systems makes detecting the difference harder.
On the other hand, the consequences of bad attachments/links is obvious. That worms like ILOVEYOU spread so easily is because people trusted attachments coming from their friends, and ran them.
We have no solution to the problem of bad email attachments and links. Viruses and phishing are pervasive problems. Yet, we know why they exist.

Default and backdoor passwords

The Mirai botnet was caused by surveillance-cameras having default and backdoor passwords, and being exposed to the Internet without a firewall. The consequence should be obvious: people will discover the passwords and use them to take control of the bots.
Surveillance-cameras have the problem that they are usually exposed to the public, and can’t be reached without a ladder — often a really tall ladder. Therefore, you don’t want a button consumers can press to reset to factory defaults. You want a remote way to reset them. Therefore, they put backdoor passwords to do the reset. Such passwords are easy for hackers to reverse-engineer, and hence, take control of millions of cameras across the Internet.
The same reasoning applies to “default” passwords. Many users will not change the defaults, leaving a ton of devices hackers can hack.

Masscan and background radiation of the Internet

I’ve written a tool that can easily scan the entire Internet in a short period of time. It surprises people that this possible, but it obvious from the numbers. Internet addresses are only 32-bits long, or roughly 4 billion combinations. A fast Internet link can easily handle 1 million packets-per-second, so the entire Internet can be scanned in 4000 seconds, little more than an hour. It’s basic math.
Because it’s so easy, many people do it. If you monitor your Internet link, you’ll see a steady trickle of packets coming in from all over the Internet, especially Russia and China, from hackers scanning the Internet for things they can hack.
People’s reaction to this scanning is weirdly emotional, taking is personally, such as:
  1. Why are they hacking me? What did I do to them?
  2. Great! They are hacking me! That must mean I’m important!
  3. Grrr! How dare they?! How can I hack them back for some retribution!?

I find this odd, because obviously such scanning isn’t personal, the hackers have no idea who you are.

Tools: masscan, firewalls

Packet-sniffing, sidejacking

If you connect to the Starbucks WiFi, a hacker nearby can easily eavesdrop on your network traffic, because it’s not encrypted. Windows even warns you about this, in case you weren’t sure.

At DefCon, they have a “Wall of Sheep”, where they show passwords from people who logged onto stuff using the insecure “DefCon-Open” network. Calling them “sheep” for not grasping this basic fact that unencrypted traffic is unencrypted.

To be fair, it’s actually non-obvious to many people. Even if the WiFi itself is not encrypted, SSL traffic is. They expect their services to be encrypted, without them having to worry about it. And in fact, most are, especially Google, Facebook, Twitter, Apple, and other major services that won’t allow you to log in anymore without encryption.

But many services (especially old ones) may not be encrypted. Unless users check and verify them carefully, they’ll happily expose passwords.

What’s interesting about this was 10 years ago, when most services which only used SSL to encrypt the passwords, but then used unencrypted connections after that, using “cookies”. This allowed the cookies to be sniffed and stolen, allowing other people to share the login session. I used this on stage at BlackHat to connect to somebody’s GMail session. Google, and other major websites, fixed this soon after. But it should never have been a problem — because the sidejacking of cookies should have been obvious.

Tools: Wireshark, dsniff

Stuxnet LNK vulnerability

Again, this issue isn’t obvious to the public, but it should’ve been obvious to anybody who knew how Windows works.
When Windows loads a .dll, it first calls the function DllMain(). A Windows link file (.lnk) can load icons/graphics from the resources in a .dll file. It does this by loading the .dll file, thus calling DllMain. Thus, a hacker could put on a USB drive a .lnk file pointing to a .dll file, and thus, cause arbitrary code execution as soon as a user inserted a drive.
I say this is obvious because I did this, created .lnks that pointed to .dlls, but without hostile DllMain code. The consequence should’ve been obvious to me, but I totally missed the connection. We all missed the connection, for decades.

Social Engineering and Tech Support [* * *]

After posting this, many people have pointed out “social engineering”, especially of “tech support”. This probably should be up near #1 in terms of obviousness.

The classic example of social engineering is when you call tech support and tell them you’ve lost your password, and they reset it for you with minimum of questions proving who you are. For example, you set the volume on your computer really loud and play the sound of a crying baby in the background and appear to be a bit frazzled and incoherent, which explains why you aren’t answering the questions they are asking. They, understanding your predicament as a new parent, will go the extra mile in helping you, resetting “your” password.

One of the interesting consequences is how it affects domain names (DNS). It’s quite easy in many cases to call up the registrar and convince them to transfer a domain name. This has been used in lots of hacks. It’s really hard to defend against. If a registrar charges only $9/year for a domain name, then it really can’t afford to provide very good tech support — or very secure tech support — to prevent this sort of hack.

Social engineering is such a huge problem, and obvious problem, that it’s outside the scope of this document. Just google it to find example after example.

A related issue that perhaps deserves it’s own section is OSINT [*], or “open-source intelligence”, where you gather public information about a target. For example, on the day the bank manager is out on vacation (which you got from their Facebook post) you show up and claim to be a bank auditor, and are shown into their office where you grab their backup tapes. (We’ve actually done this).

More: Wikipedia on Social Engineering, Wikipedia on OSINT, “How I Won the Defcon Social Engineering CTF” — blogpost (2011), “Questioning 42: Where’s the Engineering in Social Engineering of Namespace Compromises” — BSidesLV talk (2016)

Blue-boxes (historical) [*]

Telephones historically used what we call “in-band signaling”. That’s why when you dial on an old phone, it makes sounds — those sounds are sent no differently than the way your voice is sent. Thus, it was possible to make tone generators to do things other than simply dial calls. Early hackers (in the 1970s) would make tone-generators called “blue-boxes” and “black-boxes” to make free long distance calls, for example.

These days, “signaling” and “voice” are digitized, then sent as separate channels or “bands”. This is call “out-of-band signaling”. You can’t trick the phone system by generating tones. When your iPhone makes sounds when you dial, it’s entirely for you benefit and has nothing to do with how it signals the cell tower to make a call.

Early hackers, like the founders of Apple, are famous for having started their careers making such “boxes” for tricking the phone system. The problem was obvious back in the day, which is why as the phone system moves from analog to digital, the problem was fixed.

More: Wikipedia on blue box, Wikipedia article on Steve Wozniak.

Thumb drives in parking lots [*]

A simple trick is to put a virus on a USB flash drive, and drop it in a parking lot. Somebody is bound to notice it, stick it in their computer, and open the file.

This can be extended with tricks. For example, you can put a file labeled “third-quarter-salaries.xlsx” on the drive that required macros to be run in order to open. It’s irresistible to other employees who want to know what their peers are being paid, so they’ll bypass any warning prompts in order to see the data.

Another example is to go online and get custom USB sticks made printed with the logo of the target company, making them seem more trustworthy.

We also did a trick of taking an Adobe Flash game “Punch the Monkey” and replaced the monkey with a logo of a competitor of our target. They now only played the game (infecting themselves with our virus), but gave to others inside the company to play, infecting others, including the CEO.

Thumb drives like this have been used in many incidents, such as Russians hacking military headquarters in Afghanistan. It’s really hard to defend against.

More: “Computer Virus Hits U.S. Military Base in Afghanistan” — USNews (2008), “The Return of the Worm That Ate The Pentagon” — Wired (2011), DoD Bans Flash Drives — Stripes (2008)

Googling [*]

Search engines like Google will index your website — your entire website. Frequently companies put things on their website without much protection because they are nearly impossible for users to find. But Google finds them, then indexes them, causing them to pop up with innocent searches.
There are books written on “Google hacking” explaining what search terms to look for, like “not for public release”, in order to find such documents.

More: Wikipedia entry on Google Hacking, “Google Hacking” book.

URL editing [*]

At the top of every browser is what’s called the “URL”. You can change it. Thus, if you see a URL that looks like this:

http://www.example.com/documents?id=138493

Then you can edit it to see the next document on the server:

http://www.example.com/documents?id=138494

The owner of the website may think they are secure, because nothing points to this document, so the Google search won’t find it. But that doesn’t stop a user from manually editing the URL.
An example of this is a big Fortune 500 company that posts the quarterly results to the website an hour before the official announcement. Simply editing the URL from previous financial announcements allows hackers to find the document, then buy/sell the stock as appropriate in order to make a lot of money.
Another example is the classic case of Andrew “Weev” Auernheimer who did this trick in order to download the account email addresses of early owners of the iPad, including movie stars and members of the Obama administration. It’s an interesting legal case because on one hand, techies consider this so obvious as to not be “hacking”. On the other hand, non-techies, especially judges and prosecutors, believe this to be obviously “hacking”.

DDoS, spoofing, and amplification [*]

For decades now, online gamers have figured out an easy way to win: just flood the opponent with Internet traffic, slowing their network connection. This is called a DoS, which stands for “Denial of Service”. DoSing game competitors is often a teenager’s first foray into hacking.
A variant of this is when you hack a bunch of other machines on the Internet, then command them to flood your target. (The hacked machines are often called a “botnet”, a network of robot computers). This is called DDoS, or “Distributed DoS”. At this point, it gets quite serious, as instead of competitive gamers hackers can take down entire businesses. Extortion scams, DDoSing websites then demanding payment to stop, is a common way hackers earn money.
Another form of DDoS is “amplification”. Sometimes when you send a packet to a machine on the Internet it’ll respond with a much larger response, either a very large packet or many packets. The hacker can then send a packet to many of these sites, “spoofing” or forging the IP address of the victim. This causes all those sites to then flood the victim with traffic. Thus, with a small amount of outbound traffic, the hacker can flood the inbound traffic of the victim.
This is one of those things that has worked for 20 years, because it’s so obvious teenagers can do it, yet there is no obvious solution. President Trump’s executive order of cyberspace specifically demanded that his government come up with a report on how to address this, but it’s unlikely that they’ll come up with any useful strategy.

More: Wikipedia on DDoS, Wikipedia on Spoofing

Conclusion

Tweet me (@ErrataRob) your obvious hacks, so I can add them to the list.

Amazon Rekognition Update – Celebrity Recognition

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-rekognition-update-celebrity-recognition/

We launched Amazon Rekognition at re:Invent (Amazon Rekognition – Image Detection and Recognition Powered by Deep Learning) and added Image Moderation earlier this year.

Today we are adding celebrity recognition!

Rekognition has been trained to identify hundreds of thousands of people who are famous, noteworthy, or prominent in fields that includes politics, sports, entertainment, business, and media. The list is global, and is updated frequently.

To access this feature, simply call the new RecognizeCelebrities function. In addition to the bounding box and facial landmark feature returned by the existing DetectFaces function, the new function returns information about any celebrities that it recognizes:

"Id": "3Ir0du6", 
"MatchConfidence": 97, 
"Name": "Jeff Bezos", 
"Urls": [ "www.imdb.com/name/nm1757263" ]

The Urls provide additional information about the celebrity. The API currently return links to IMDB content; we may add other sources in the future.

You can use the Celebrity Recognition Demo in the AWS Management Console to experiment with this feature:

If you have an image archive you can now index it by celebrity. You could also use a combination of celebrity recognition and object detection to build all kinds of search tools. If your images are already stored in S3, you can process them in-place.

I’m sure that you will come up with all sorts of interesting uses for this new feature. Leave me a comment and let me know what you build!

Jeff;

 

AWS Enables Consortium Science to Accelerate Discovery

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-enables-consortium-science-to-accelerate-discovery/

My colleague Mia Champion is a scientist (check out her publications), an AWS Certified Solutions Architect, and an AWS Certified Developer. The time that she spent doing research on large-data datasets gave her an appreciation for the value of cloud computing in the bioinformatics space, which she summarizes and explains in the guest post below!

Jeff;


Technological advances in scientific research continue to enable the collection of exponentially growing datasets that are also increasing in the complexity of their content. The global pace of innovation is now also fueled by the recent cloud-computing revolution, which provides researchers with a seemingly boundless scalable and agile infrastructure. Now, researchers can remove the hindrances of having to own and maintain their own sequencers, microscopes, compute clusters, and more. Using the cloud, scientists can easily store, manage, process and share datasets for millions of patient samples with gigabytes and more of data for each individual. As American physicist, John Bardeen once said: “Science is a collaborative effort. The combined results of several people working together is much more effective than could be that of an individual scientist working alone”.

Prioritizing Reproducible Innovation, Democratization, and Data Protection
Today, we have many individual researchers and organizations leveraging secure cloud enabled data sharing on an unprecedented scale and producing innovative, customized analytical solutions using the AWS cloud.  But, can secure data sharing and analytics be done on such a collaborative scale as to revolutionize the way science is done across a domain of interest or even across discipline/s of science? Can building a cloud-enabled consortium of resources remove the analytical variability that leads to diminished reproducibility, which has long plagued the interpretability and impact of research discoveries? The answers to these questions are ‘yes’ and initiatives such as the Neuro Cloud Consortium, The Global Alliance for Genomics and Health (GA4GH), and The Sage Bionetworks Synapse platform, which powers many research consortiums including the DREAM challenges, are starting to put into practice model cloud-initiatives that will not only provide impactful discoveries in the areas of neuroscience, infectious disease, and cancer, but are also revolutionizing the way in which scientific research is done.

Bringing Crowd Developed Models, Algorithms, and Functions to the Data
Collaborative projects have traditionally allowed investigators to download datasets such as those used for comparative sequence analysis or for training a deep learning algorithm on medical imaging data. Investigators were then able to develop and execute their analysis using institutional clusters, local workstations, or even laptops:

This method of collaboration is problematic for many reasons. The first concern is data security, since dataset download essentially permits “chain-data-sharing” with any number of recipients. Second, analytics done using compute environments that are not templated at some level introduces the risk of variable analytics that itself is not reproducible by a different investigator, or even the same investigator using a different compute environment. Third, the required data dump, processing, and then re-upload or distribution to the collaborative group is highly inefficient and dependent upon each individual’s networking and compute capabilities. Overall, traditional methods of scientific collaboration have introduced methods in which security is compromised and time to discovery is hampered.

Using the AWS cloud, collaborative researchers can share datasets easily and securely by taking advantage of Identity and Access Management (IAM) policy restrictions for user bucket access as well as S3 bucket policies or Access Control Lists (ACLs). To streamline analysis and ensure data security, many researchers are eliminating the necessity to download datasets entirely by leveraging resources that facilitate moving the analytics to the data source and/or taking advantage of remote API requests to access a shared database or data lake. One way our customers are accomplishing this is to leverage container based Docker technology to provide collaborators with a way to submit algorithms or models for execution on the system hosting the shared datasets:

Docker container images have all of the application’s dependencies bundled together, and therefore provide a high degree of versatility and portability, which is a significant advantage over using other executable-based approaches. In the case of collaborative machine learning projects, each docker container will contain applications, language runtime, packages and libraries, as well as any of the more popular deep learning frameworks commonly used by researchers including: MXNet, Caffe, TensorFlow, and Theano.

A common feature in these frameworks is the ability to leverage a host machine’s Graphical Processing Units (GPUs) for significant acceleration of the matrix and vector operations involved in the machine learning computations. As such, researchers with these objectives can leverage EC2’s new P2 instance types in order to power execution of submitted machine learning models. In addition, GPUs can be mounted directly to containers using the NVIDIA Docker tool and appear at the system level as additional devices. By leveraging Amazon EC2 Container Service and the EC2 Container Registry, collaborators are able to execute analytical solutions submitted to the project repository by their colleagues in a reproducible fashion as well as continue to build on their existing environment.  Researchers can also architect a continuous deployment pipeline to run their docker-enabled workflows.

In conclusion, emerging cloud-enabled consortium initiatives serve as models for the broader research community for how cloud-enabled community science can expedite discoveries in Precision Medicine while also providing a platform where data security and discovery reproducibility is inherent to the project execution.

Mia D. Champion, Ph.D.

 

Pollexy – Building a Special Needs Voice Assistant with Amazon Polly and Raspberry Pi

Post Syndicated from Ana Visneski original https://aws.amazon.com/blogs/aws/pollexy-building-a-special-needs-voice-assistant-with-amazon-polly-and-raspberry-pi/

April is Autism Awareness month and about 1 in 68 children in the U.S. have been identified with autism spectrum disorder (ASD) (CDC 2014). In this post from Troy Larson, a Sr. Devops Cloud Architect here at AWS, you get an introduction to a project he has been working on to help his son Calvin.

I have been asked how the minds at AWS come up with so many different ideas. Sometimes they come from a deeply personal place, where someone sees a way to help others. Pollexy is an amazing example of just that. Read about Pollexy and then watch the video here.

-Ana


Background

As a computer programming parent of a 16-year old non-verbal teenage boy with autism, I have been constantly searching over the years to find ways to use technology to make our lives together safer, happier and more comfortable. At the core of this challenge is the most basic of all human interaction—communication. While Calvin is able to respond to verbal instruction, he is not able to speak responsively. In his entire life, we’ve never had a conversation. He is able to be left alone in his room to play, but most every task or set of tasks requires a human to verbally prompt him along the way. Having other children and responsibilities in the home, at times the intensity of supervision can be negatively impactful on the home dynamic.

Genesis

When I saw the announcement of Amazon Polly and Amazon Lex at re:Invent last year, I immediately started churning on how we could leverage these technologies to assist Calvin. He responds well to human verbal prompts, but would he understand a digital voice? So one Saturday, I setup a Raspberry Pi in his room and closed his door and crouched around the corner with other family members so Calvin couldn’t see us. I connected to the Raspberry Pi and instructed Polly to speak in Joanna’s familiar pacific tone, “Calvin, it’s time to take a potty break. Go out of your bedroom and go to the bathroom.” In a few seconds, we heard his doorknob turn and I poked my head out of my hiding place. Calvin passed by, looking at me quizzically, then went into the bathroom as Joanna had instructed. We all looked at each other in amazement—he had listened and responded perfectly to the completely invisible voice of someone he’d never heard before. After discussing some ideas around this with co-workers, a colleague suggested I enter the IoT and AI Science Fair at our annual AWS Sales Kick-Off meeting. Less than two months after the Polly and Lex announcement and 3500 lines of code later, Pollexy—along with Calvin–debuted at the Science Fair.

Overview

Pollexy (“Polly” + “Lex”) is a Raspberry Pi and mobile-based special needs verbal assistant that lets caretakers schedule audio task prompts and messages both on a recurring schedule and/or on-demand. Caretakers can schedule regular medicine reminder messages or hourly bathroom break messages, for example, and at the same time use their Amazon Echo and mobile device to request a specific message be played immediately. Caretakers can even set it up so that the person needs to confirm that they’ve heard the message. For example, my son won’t pay attention to Pollexy unless Pollexy first asks him to “Push the blue button.” Pollexy will wait until he has pushed the button and then speak the actual message. Other people may be able to respond verbally using Lex, or not require a confirmation at all. Pollexy can be tailored to what works best.

And then most importantly—and most challenging—in a large house, how do we make sure the person is in the room where we play the message? What if we have a special needs adult living in an in-law suite? Are they in the living room or the kitchen? And what about multiple people? What if we have multiple people in different areas of the house, each of whom has a message? Let’s explore the basic elements and tie the pieces together.

Basic Elements of Pollexy

In the spirit of Amazon’s Leadership Principle “Invent and Simplify,” we want to minimize the complexity of the Pollexy architecture. We can break Pollexy down into three types of objects and three components, all of which work together in a way that’s easily explainable.

Object #1: Person

Pollexy can support any number of people. A person is a uniquely identifiable name. We can set basic preferences such as “requires confirmation” and most importantly, we can define a location schedule. This means that we can create an Outlook-like schedule that sets preferences where someone should be in the house.

Object #2: Location

A location is simply a uniquely identifiable location where a device is physically sitting. Based on the user’s location schedule, Pollexy will know which device to contact first, second, third, etc. We can also “mute” devices if needed (naptime, etc.)

Object #3: Message

Obviously, this is the actual message we want to play. Attached to each message is a person and a recurring schedule (only if it’s not a one-time message). We don’t store location with the message, because Pollexy figures out the person’s location when the message is ready to be delivered.

Component #1: Scheduler

Every message needs to be scheduled. This is a command-line tool where you basically say Tell “Calvin” that “you need to brush your teeth” every night at 8 p.m. This message is then stored in DynamoDB, waiting to be picked up by the queueing Lambda function.

Component #2: Queueing Engine

Every minute, a Lambda runs and checks the scheduler to see if there is a message or messages ready to be delivered. If a message is ready, it looks up the person’s location schedule and figures out where they are and then pushes the message or messages into an SQS queue for that location.

Component #3: Speaker Engine

Every minute on the Raspberry Pi device, the speaker engine spins up and checks the SQS for its location. If there are messages, then the speaker engine looks at the user’s preferences and initiates communication to convey the message. If the person doesn’t respond, the speaker engine will check if the person has a secondary location in their schedule and drop the message in the SQS Queue for that location. In the end, a message will either be delivered or eventually just timeout (if someone is out of the house for the day).

Respect and Freedom are the Keys

We often take our personal privacy and respect for granted, so imagine even for a special needs person, the lack of privacy and freedom around having a person constantly in your presence. This is exaggerated for those in the autism spectrum where invasion of personal space can escalate a sense of invasion, turning into anger and frustration. Pollexy becomes their own personal, gentle and never-flustered friend to coach to them along the way, giving them confidence, respect and the sense of privacy and freedom we all want to enjoy.

-Troy Larson

HashPump – Exploit Hash Length Extension Attack

Post Syndicated from Darknet original http://feedproxy.google.com/~r/darknethackers/~3/3DOE2xyGowM/

HashPump is a C++ based command line tool to exploit the Hash Length Extension Attack with various hash types supported, including MD4, MD5, SHA1, SHA256, and SHA512. There’s a good write-up of how to use this in practical terms here: Plaid CTF 2014: mtpox Usage [crayon-58d9345a724a6910508053/] You can download HashPump here:…

Read the full post at darknet.org.uk

Commenting Policy for This Blog

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/03/commenting_poli.html

Over the past few months, I have been watching my blog comments decline in civility. I blame it in part on the contentious US election and its aftermath. It’s also a consequence of not requiring visitors to register in order to post comments, and of our tolerance for impassioned conversation. Whatever the causes, I’m tired of it. Partisan nastiness is driving away visitors who might otherwise have valuable insights to offer.

I have been engaging in more active comment moderation. What that means is that I have been quicker to delete posts that are rude, insulting, or off-topic. This is my blog. I consider the comments section as analogous to a gathering at my home. It’s not a town square. Everyone is expected to be polite and respectful, and if you’re an unpleasant guest, I’m going to ask you to leave. Your freedom of speech does not compel me to publish your words.

I like people who disagree with me. I like debate. I even like arguments. But I expect everyone to behave as if they’ve been invited into my home.

I realize that I sometimes express opinions on political matters; I find they are relevant to security at all levels. On those posts, I welcome on-topic comments regarding those opinions. I don’t welcome people pissing and moaning about the fact that I’ve expressed my opinion on something other than security technology. As I said, it’s my blog.

So, please… Assume good faith. Be polite. Minimize profanity. Argue facts, not personalities. Stay on topic. If you want a model to emulate, look at Clive Robinson’s posts.

Schneier on Security is not a professional operation. There’s no advertising, so no revenue to hire staff. My part-time moderator — paid out of my own pocket — and I do what we can when we can. If you see a comment that’s spam, or off-topic, or an ad hominem attack, flag it and be patient. Don’t reply or engage; we’ll get to it. And we won’t always post an explanation when we delete something.

My own stance on privacy and anonymity means that I’m not going to require commenters to register a name or e-mail address, so that isn’t an option. And I really don’t want to disable comments.

I dislike having to deal with this problem. I’ve been proud and happy to see how interesting and useful the comments section has been all these years. I’ve watched many blogs and discussion groups descend into toxicity as a result of trolls and drive-by ideologues derailing the conversations of regular posters. I’m not going to let that happen here.

Security advisories for Monday

Post Syndicated from ris original http://lwn.net/Articles/712296/rss

CentOS has updated java-1.8.0-openjdk (C7; C6: multiple vulnerabilities).

Debian has updated libphp-swiftmailer (code execution), mariadb-10.0 (multiple mostly unspecified vulnerabilities), and openjpeg2 (multiple vulnerabilities).

Debian-LTS has updated groovy (code execution) and opus (code execution).

Fedora has updated docker-latest
(F24: privilege escalation), ed (F25:
denial of service), groovy (F25: code
execution), libnl3 (F25; F24: privilege escalation), opus (F25; F24: code
execution), qemu (F25: multiple
vulnerabilities), squid (F25: two
vulnerabilities), and webkitgtk4 (F25; F24:
multiple vulnerabilities).

Gentoo has updated DBD-mysql
(multiple vulnerabilities), dcraw (denial
of service from 2015), DirectFB (two
vulnerabilities from 2014), libupnp (two
vulnerabilities), lua (code execution from
2014), ppp (denial of service from 2015),
qemu (multiple vulnerabilities), quagga (two vulnerabilities), and zlib (multiple vulnerabilities).

Mageia has updated libpng, libpng12 (NULL dereference bug).

openSUSE has updated perl-DBD-mysql (42.2, 42.1: three vulnerabilities) and xtrabackup (42.2; 42.1: information disclosure).

Oracle has updated java-1.8.0-openjdk (OL7; OL6: multiple vulnerabilities).

SUSE has updated gstreamer-0_10-plugins-good (SLE12-SP1; SLE11-SP4: multiple vulnerabilities).

Amazon Rekognition – Image Detection and Recognition Powered by Deep Learning

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-rekognition-image-detection-and-recognition-powered-by-deep-learning/

What do you see when you look at this picture?

You might simply see an animal. Maybe you see a pet, a dog, or a Golden Retriever. The association between the image and these labels is not hard-wired in to your brain. Instead, you learned the labels after seeing hundreds or thousands of examples. Operating on a number of different levels, you learned to distinguish an animal from a plant, a dog from a cat, and a Golden Retriever from other dog breeds.

Deep Learning for Image Detection
Giving computers the same level of comprehension has proven to be a very difficult task. Over the course of decades, computer scientists have taken many different approaches to the problem. Today, a broad consensus has emerged that the best way to tackle this problem is via deep learning. Deep learning uses a combination of feature abstraction and neural networks to produce results that can be (as Arthur C. Clarke once said) indistinguishable from magic. However, it comes at a considerable cost. First, you need to put a lot of work into the training phase. In essence, you present the learning network with a broad spectrum of labeled examples (“this is a dog”, “this is a pet”, and so forth) so that it can correlate features in the image with the labels. This phase is computationally expensive due to the size and the multi-layered nature of the neural networks. After the training phase is complete, evaluating new images against the trained network is far easier. The results are traditionally expressed in confidence levels (0 to 100%) rather than as cold, hard facts. This allows you to decide just how much precision is appropriate for your applications.

Introducing Amazon Rekognition
Today I would like to tell you about Amazon Rekognition. Powered by deep learning and built by our Computer Vision team over the course of many years, this fully-managed service already analyzes billions of images daily. It has been trained on thousands of objects and scenes, and is now available for you to use in your own applications. You can use the Rekognition Demos to put the service through its paces before dive in and start writing code that uses the Rekognition API.

Rekognition was designed from the get-go to run at scale. It comprehends scenes, objects, and faces. Given an image, it will return a list of labels. Given an image with one or more faces, it will return bounding boxes for each face, along with attributes. Let’s see what it has to say about the picture of my dog (her name is Luna, by the way):

As you can see, Rekognition labeled Luna as an animal, a dog, a pet, and as a golden retriever with a high degree of confidence. It is important to note that these labels are independent, in the sense that the deep learning model does not explicitly understand the relationship between, for example, dogs and animals. It just so happens that both of these labels were simultaneously present on the dog-centric training material presented to Rekognition.

Let’s see how it does with a picture of my wife and I:

Amazon Rekognition found our faces, set up bounding boxes, and let me know that my wife was happy (the picture was taken on her birthday, so I certainly hope she was).

You can also use Rekognition to compare faces and to see if a given image contains any one of a number of faces that you have asked it to recognize.

All of this power is accessible from a set of API functions (the console is great for quick demos). For example, you can call DetectLabels to programmatically reproduce my first example, or DetectFaces to reproduce my second one. You can make multiple calls to IndexFaces to prepare Rekognition to recognize some faces. Each time you do this, Rekognition extracts some features (known as face vectors) from the image, stores the vectors, and discards the image. You can create one or more Rekognition collections and store related groups of face vectors in each one.

Rekognition can directly process images stored in Amazon Simple Storage Service (S3). In fact, you can use AWS Lambda functions to process newly uploaded photos at any desired scale. You can use AWS Identity and Access Management (IAM) to control access to the Rekognition APIs.

Applications for Rekognition
So, what can you use this for? I’ve got plenty of ideas to get you started!

If you have a large collection of photos, you can tag and index them using Amazon Rekognition. Because Rekognition is a service, you can process millions of photos per day without having to worry about setting up, running, or scaling any infrastructure. You can implement visual search, tag-based browsing, and all sorts of interactive discovery models.

You can use Rekognition in several different authentication and security contexts. You can compare a face on a webcam to a badge photo before allowing an employee to enter a secure zone. You can perform visual surveillance, inspecting photos for objects or people of interest or concern.

You can build “smart” marketing billboards that collect demographic data about viewers.

Now Available
Rekognition is now available in the US East (Northern Virginia), US West (Oregon), and EU (Ireland) Regions and you can start using it today. As part of the AWS Free Tier tier, you can analyze up to 5,000 images per month and store up to 1,000 face vectors each month for an entire year. After that (and at higher volume), you will pay tiered pricing based on the number of images that you analyze and the number of face vectors that you store.

Jeff;

 

DyMerge – Bruteforce Dictionary Merging Tool

Post Syndicated from Darknet original http://feedproxy.google.com/~r/darknethackers/~3/sHUaOaOOPPk/

DyMerge is a simple, yet powerful bruteforce dictionary merging tool – written purely in python – which takes given wordlists and merges them into one dynamic dictionary that can then be used as ammunition for a successful dictionary based (or bruteforce) attack. One day the author was making his way through a ctf challenge, and […]

The…

Read the full post at darknet.org.uk

Combining Druid and DataSketches for Real-time, Robust Behavioral Analytics

Post Syndicated from mikesefanov original https://yahooeng.tumblr.com/post/147711922956

By Himanshu Gupta

Millions of users around the world interact with Yahoo through their web browsers and mobile devices, generating billions of events every day (e.g. clicking on ads, clicking on various pages of interest, and logging in). As Yahoo’s data grows larger and more complex, we are investing in new ways to better manage and make sense of it. Behavioral analytics is one important branch of analytics in which we are making significant advancements, and is helping us accomplish these tasks.

Beyond simply measuring how many times a user has performed a certain action, we also try to understand patterns in their actions. We do this in order to help us decide which of our features are impactful and might grow our user base, and to understand responses to ads that might help us improve users’ future experiences.

One example of behavioral analytics is measuring user retention rates for Yahoo properties such as Mail, News, and Finance, and breaking down these rates by different user demographics. Another example is to determine which ads perform well for various types of users (as measured by various signals), and to serve ads appropriately based on that implicit or explicit feedback.

The challenges we face in answering these questions mainly concern storing and interactively querying our user-generated events at massive scale. We heavily make use of distributed systems, and Druid is at the forefront of powering most of our real-time analytics at scale.

One of the features that makes Druid very useful is the ability to summarize data at storage time. This leads to greatly-reduced storage requirements, and hence, faster queries. For example, consider the dataset below:

This data represents ad clicks for different website domains. We can see that there are many repeated attributes, which we call “dimensions,” in our data across different timestamps. Now, most of the time we don’t care that a certain ad was clicked at a precise millisecond in time. What is a lot more interesting to us, is how many times an ad was clicked over the period of an hour. Thus, we can truncate the raw event timestamps and group all events with the same set of dimensions. When we group the dimensions, we also aggregate the raw event values for the “clicked” column.

This method is known as summarization, and in practice, we see summarization significantly reduce the amount of raw data we have to store. We’ve chosen to lose some information about the time an event occurred, but there is no loss of fidelity for the “clicked” metric that we really care about.

Let’s consider the same dataset again, but now with information about which user performed the click. When we go to summarize our data, the highly cardinal and unique “user-id” column prevents our data from compacting very well.

The number of unique user-ids could be very high due to the number of users visiting Yahoo everyday. So, in our “user-id” column, we end up effectively storing our raw data. Given that we are mostly interested in how many unique users performed certain actions, and we don’t really care about precisely which users did those actions, it would be nice if we could somehow lose some information about the individual users so that our data could still be summarized.

One approach to solving this problem is to create a “sketch” of the user-id dimension. Instead of storing every single unique user-id, we instead maintain a hash-based data structure – also known as a sketch – which has smaller storage requirements and gives estimates of user-id dimension cardinality with predictable accuracy.

Leveraging sketches, our summarized data for the user dimension looks something like this:

Sketch algorithms are highly desirable because they are very scalable, use predictable storage, work with real-time streams of data, and provide predictable estimates. There are many different algorithms to construct different type of sketches, and a lot of fancy mathematics goes into detail about how sketch algorithms work and why we can get very good estimations of results.

At Yahoo, we recently developed an open source library called DataSketches. DataSketches provides implementations of various approximate sketch-based algorithms that enable faster, cheaper analytics on large datasets. By combining DataSketches with an extremely low-latency data store, such as Druid, you bring sketches into practical use in a big data store. Embedding sketch algorithms in a data store and persisting the actual sketches is relatively novel in the industry, and is the future structure of big data analytics systems.

Druid’s flexible plugin architecture allows us to integrate it with DataSketches; as such, we’ve developed and open sourced an extension to Druid that allows DataSketches to be used as a Druid aggregation function. Druid applies the aggregation function on selected columns and stores aggregated values instead of raw data.

By leveraging the fast, approximate calculations of DataSketches, complex analytic queries such as cardinality estimation and retention analysis can be completed in less than one second in Druid. This allows developers to visualize the results in real-time, and to be able to slice and dice results across a variety of different filters. For example, we can quickly determine how many users visited our core products, including Yahoo News, Sports, and Finance, as well as see how many of those users returned some time later. We can also break down our results in real-time based on user demographics such as age and location.

If you have similar use cases to ours, we invite you to try out DataSketches and Druid for behavioral analytics. For more information about DataSketches, please visit the DataSketches website. For more information about Druid, please visit the project webpage. And finally, documents for the DataSketches and Druid integration can be found in the Druid docs.

Analyze Realtime Data from Amazon Kinesis Streams Using Zeppelin and Spark Streaming

Post Syndicated from Manjeet Chayel original https://blogs.aws.amazon.com/bigdata/post/Tx3K805CZ8WFBRP/Analyze-Realtime-Data-from-Amazon-Kinesis-Streams-Using-Zeppelin-and-Spark-Strea

Manjeet Chayel is a Solutions Architect with AWS

There is streaming data everywhere. This includes clickstream data, data from sensors, data emitted from billions of IoT devices, and more. Not suprisingly, data scientists want to analyze and explore these data streams in real time. This post shows you how you can use Spark Streaming to process data coming from Amazon Kinesis streams, build some graphs using Zeppelin, and then store the Zeppelin notebook in Amazon S3.

Zeppelin overview

Apache Zeppelin is an open source GUI which creates interactive and collaborative notebooks for data exploration using Spark. You can use Scala, Python, SQL (using Spark SQL), or HiveQL to manipulate data and quickly visualize results.

Zeppelin notebooks can be shared among several users, and visualizations can be published to external dashboards. Zeppelin uses the Spark settings on your cluster and can use Spark’s dynamic allocation of executors to let YARN estimate the optimal resource consumption.

With the latest Zeppelin release (0.5.6) included on Amazon EMR 4.7.0, you can now import notes using links to S3 JSON files, raw file URLs in GitHub, or local files. You can also download a note as a JSON file. This new functionality makes it easier to save and share Zeppelin notes, and it allows you to version your notes during development. The import feature is located on the Zeppelin home screen, and the export feature is located on the toolbar for each note.

Additionally, you can still configure Zeppelin to store its entire notebook file in S3 by adding a configuration for zeppelin-env when creating your cluster (just make sure you have already created the bucket in S3 before creating your cluster).

Streaming data walkthrough

To use this post to play around with streaming data, you need an AWS account and AWS CLI configured on your machine. The entire pattern can be implemented in few simple steps:

  1. Create an Amazon Kinesis stream.
  2. Spin up an EMR cluster with Hadoop, Spark, and Zeppelin applications from advanced options.
  3. Use a Simple Java producer to push random IoT events data into the Amazon Kinesis stream.
  4. Connect to the Zeppelin notebook.
  5. Import the Zeppelin notebook from GitHub.
  6. Analyze and visualize the streaming data.

We’ll look at each of these steps below.

Create a Kinesis stream

First, create a simple Amazon Kinesis stream, “spark-demo,” with two shards. For more information, see Creating a Stream.

Spin up an EMR cluster with Hadoop, Spark, and Zeppelin

Edit the software settings for Zeppelin by copying and pasting the configuration below. Replace the bucket name “demo-s3-bucket” with your S3 bucket name. Note: you do not have to specify S3://. This configuration sets S3 as the notebook storage location and adds the Amazon Kinesis Client Library (KCL) to the environment.

[
   {
      "configurations":[
         {
            "classification":"export",
            "properties":{
               "ZEPPELIN_NOTEBOOK_S3_BUCKET":"demo-s3-bucket/zeppelin",
               "ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo",
               "ZEPPELIN_NOTEBOOK_USER":"hadoop",
               "SPARK_SUBMIT_OPTIONS" : '"$SPARK_SUBMIT_OPTIONS --packages org.apache.spark:spark-streaming-kinesis-asl_2.10:1.6.0 --conf spark.executorEnv.PYTHONPATH=/usr/lib/spark/python/lib/py4j-0.9-src.zip:/usr/lib/spark/python/:<CPS>{{PWD}}/pyspark.zip<{{PWD}}>/py4j-0.9-src.zip --conf spark.yarn.isPython=true"'

            }
         }
      ],
      "classification":"zeppelin-env",
      "properties":{

      }
   }
]

It takes a few minutes for the cluster to start and change to the “Waiting” state.

While this is happening, you can configure your machine to view web interfaces on the cluster. For more information, see View Web Interfaces Hosted on Amazon EMR Clusters.

Use a simple Java producer to push random IoT events into the Amazon Kinesis stream

I have implemented a simple Java producer application, using the Kinesis Producer Library, which ingests random IoT sensor data into the “spark-demo” Amazon Kinesis stream.

Download the JAR and run it from your laptop or EC2 instance (this requires Java8):

java –jar KinesisProducer.jar

Data is pushed in CSV format:

device_id,temperature,timestamp

Note: If you are using an EC2 instance, make sure that it has the required permissions to push the data into the Amazon Kinesis stream.

Connect to the Zeppelin notebook

There are several ways to connect to the UI on the master node. One method is to use a proxy extension to the browser. To learn how, see Option 2, Part 2: Configure Proxy Settings to View Websites Hosted on the Master Node.

To reach the web interfaces, you must establish an SSH tunnel with the master node using either dynamic or local port forwarding. If you establish an SSH tunnel using dynamic port forwarding, you must also configure a proxy server to view the web interface.

The following command opens dynamic port forwarding on port 8157 to the EMR master node. After running it, enable FoxyProxy on your browser using the steps in Configure FoxyProxy for Firefox.

ssh -i <<YOUR-KEY-PAIR>> -ND 8157 [email protected]<<EMR-MASTER-DNS>>>

Import the Zeppelin notebook from GitHub

In Zeppelin, choose Import note and Add from URL to import the notebook from the AWS Big Data blog GitHub repository

Analyze and visualize streaming data

After you import the notebook, you’ll see a few lines of code and some sample SQL as paragraphs. The code in the notebook reads the data from your “spark-demo” Amazon Kinesis stream in batches of 5 seconds (this period can be modified) and stores the data into a temporary Spark table.

After the streaming context has started, Spark starts reading data from streams and populates the temporary table. You can run your SQL queries on this table.

import …   


val endpointUrl = "https://kinesis.us-east-1.amazonaws.com"
val credentials = new DefaultAWSCredentialsProviderChain().getCredentials()
    require(credentials != null,
      "No AWS credentials found. Please specify credentials using one of the methods specified " +
        "in http://docs.aws.amazon.com/AWSSdkDocsJava/latest/DeveloperGuide/credentials.html")
    val kinesisClient = new AmazonKinesisClient(credentials)
    kinesisClient.setEndpoint("https://kinesis.us-east-1.amazonaws.com")
    val numShards = kinesisClient.describeStream("spark-demo").getStreamDescription().getShards().size

val numStreams = numShards

//Setting batch interval to 5 seconds
val batchInterval = Seconds(5)
val kinesisCheckpointInterval = batchInterval
val regionName = RegionUtils.getRegionByEndpoint(endpointUrl).getName()

 
val ssc = new StreamingContext(sc, batchInterval)

 // Create the DStreams
    val kinesisStreams = (0 until numStreams).map { i =>
      KinesisUtils.createStream(ssc, "app-spark-demo", "spark-demo", endpointUrl, regionName,InitialPositionInStream.LATEST, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_2)
    }


// Union all the streams
val unionStreams = ssc.union(kinesisStreams)

//Schema of the incoming data on the stream
val schemaString = "device_id,temperature,timestamp"

//Parse the data in DStreams
val tableSchema = StructType( schemaString.split(",").map(fieldName => StructField(fieldName, StringType, true)))

//Processing each RDD and storing it in temporary table
 unionStreams.foreachRDD ((rdd: RDD[Array[Byte]], time: Time) => {
  val rowRDD = rdd.map(w => Row.fromSeq(new String(w).split(",")))
  val wordsDF = sqlContext.createDataFrame(rowRDD,tableSchema)
  wordsDF.registerTempTable("realTimeTable")
})

Example SQL:

%sql
SELECT device_id,timestamp, avg(temperature) AS avg_temp
FROM realtimetable  
GROUP BY device_id,timestamp 
ORDER BY timestamp


You can also use pie charts.

To modify the processing logic in the foreachRDD block, gracefully stop the streaming context, re-run the foreach paragraph, and re-start the streaming context.

Summary

In this post, I’ve showed you how to use Spark Streaming from a Zeppelin notebook and directly analyze the incoming streaming data. After the analysis you can terminate the cluster; the data is available in the S3 bucket that you configured during cluster creation. I hope you’ve seen how easy it is to use Spark Streaming, Kinesis, and Zeppelin to uncover and share the business intelligence in your streaming data. Please give the process in this post a try and let us know in the comments what your results were!

If you have questions or suggestions, please comment below.

———————————-

Related

Querying Amazon Kinesis Streams Directly with SQL and Spark Streaming

 

 

Not All Bugs Are Created Equal

Post Syndicated from mikesefanov original https://yahooeng.tumblr.com/post/146077281316

yahoo-security:

Doug DePerry, Senior Security Engineer, Paranoids

In our inaugural post to The Paranoid, we discussed the human element behind online attacks–the human adversary. We sought to give some perspectives as to who is behind online threats in order to better understand how to defend against them. Yahoo’s bug bounty program applies that insight in our ongoing efforts to provide a safe environment for our users. By thinking about the economics of security, we’ve found that we can tilt the advantage in our favor by partnering with industry-leading security researchers.

We often get questions from both security researchers, and people just interested in learning about how programs like these work. We thought we’d use this opportunity to take a quick look under the hood.

First, some background. Bug bounty programs essentially crowd-source security. They allow companies to improve coverage so they are able to add additional eyes where they need them. Bug bounty researchers also bring depth of expertise and different skill sets that can uncover hard to find bugs.  

For the past two years, Yahoo has developed one of the largest and most successful bug bounty programs in the industry. We’ve paid out over $1.7 million dollars in bounties, resolved more than 2,000 security bugs and maintain a “hackership” of more than 2,000 researchers, some of whom make careers out of it.

Security researchers often ask us how we decide the payout associated with a given bug report. At first it might seem logical that we pay based on the type or classification of a security bug. Some bug types tend to be bad, so you might think that they would be paid the same. However, in the vast majority of cases, that’s not the complete story. So if the bug type alone is not what we use to determine the payout, what is? The missing input to the calculation is the impact of the vulnerability. We take into account what data might have been exposed, the sensitivity of that data, the role that data plays, network location and the permissions of the server involved. Those factors are of great importance.

Given the importance of the impact of a bug, the Yahoo bug bounty program does not reward researchers solely based on bug type. The type of bug a security researcher finds is mostly irrelevant. It’s what the bug allows them to do and where that are most important. What can an attacker actually do with this specific bug to potentially affect the security of Yahoo or our users? Furthermore, Yahoo’s application landscape is not necessarily uniform; certain properties or applications are more equal than others.

Here’s an example to show how these factors work in practice. SQL injection bugs are often a devastating bug class because they can provide full access to a database. Odds are, if a company has a presence on the web, they are storing sensitive information in databases. But just because an attacker can access the database does not mean it’s game over. The real reason that the SQL injection bug class can be so devastating is the data stored in the database may be accessed or changed by unauthorized parties. The typical impact of a SQL injection bug is high because the data exposed is typically sensitive, except when it’s not. What if the database doesn’t contain any sensitive data?

Part of the process in determining impact can seem opaque to the researcher, and we understand that. That obscurity is an unfortunate but necessary fact of life in a bug bounty program. As an external party, it is just not possible to have all the information. The sort of testing available to participants in a public bug bounty program is inherently “black box”–no documentation, no source code, what you see is what you get.

So we encourage bug reporters to include in their reports what they believe the impact of the vulnerability to be (example report here). Submitting a report that contains a thorough and detailed explanation of a legitimate security issue is much more highly valued and rewarded.

We also work closely with the developers to ensure the bug is fixed in a timely manner, and to obtain their expert opinion on impact if necessary. If the developers that created the application tell us that no sensitive data is stored in a particular database, we take that into consideration when awarding your bug. More detailed guidelines for our bug bounty program are available at hackerone.com/yahoo.

To paraphrase a little-known quote, “bug bounty programs don’t reward you for being clever.” Users and researchers should know that we place far more weight on how impactful bugs are to our platforms.

For the hackers our there interested in how @yahoo’s Bug Bounty program works.