Court: Usenet Provider is Not Liable for Piracy

Post Syndicated from Ernesto original https://torrentfreak.com/court-usenet-provider-is-not-liable-for-piracy-161207/

usenetIn 2009, Dutch anti-piracy group BREIN, representing the movie and music industries, took News-Service Europe (NSE) – one of Europe’s largest Usenet providers at the time – to court.

BREIN argued that NSE must delete all infringing content from its servers, and in 2011 the Court of Amsterdam sided with the copyright holders.

In its initial verdict, the Court concluded that NSE willingly facilitated copyright infringement through its services. As a result, the company was ordered to remove all copyrighted content and filter future posts for possible copyright infringements.

According to the Usenet provider, this filtering requirement would be too costly to operate. It therefore saw no other option than to shut down its services while the appeal was pending.

After several years of litigation and two interim decisions, the Amsterdam appeals court reached a final decision in the case this week.

The overall conclusion is that NSE is not directly or indirectly liable for copyright infringements that take place through its service. However, the Usenet provider is required to offer a fast and effective notice and takedown procedure (NTD), possibly with additional measures.

NSE is happy with the verdict which it characterizes as a big win.

“We see the outcome as a total victory. The court of appeal completely destroys the earlier verdict. NSE did not infringe copyright and is not liable for copyright infringement,” NSE CEO Patrick Schreurs informs TorrentFreak.

The takedown requirement is a moot point, according to NSE, which states that they already had this in place before they shut down.

“The fact that we do need to implement a NTD is void. Even before the lawsuit started, NSE already offered an effective NTD-procedure. The Court of Appeal even considered in an earlier interim judgment that NSE’s NTD-procedure is sufficient,” Schreurs notes.

BREIN had hoped for a better outcome but is happy with the takedown requirements the court included. The anti-piracy group also highlights that the judgment allows for possible additional measures, which could include a filter.

“We are disappointed that NSE is not deemed to infringe but we feel vindicated because of the recognition of this objective,” BREIN Director Tim Kuik told TorrentFreak.

“Let’s face it: People take subscriptions to download from Usenet because of the availability of infringing content. If that availability is lacking then the viability of this business model built on illegal use disappears,” he added.

An earlier court decision found that a proactive piracy filter would go against the ban on general monitoring requirements. However, new copyright proposals put forward by the European Commission could change this position.

While the verdict offers reassurance for the Usenet industry, it also provides rightsholders with a clear precedent to demand a proper takedown procedure. BREIN intends to keep a close eye on other Usenet providers, and won’t rule out future legal action.

“We will work with rights holders and vendors to determine whether Usenet providers are up to par. As in the case at hand, we are always willing to look for cooperation. However, if we find providers that are unwilling to come to an understanding and live up to their responsibilities then we will take them to court,” Kuik says.

Both parties still have the option to take the case to the Supreme Court, but this hasn’t been decided yet.

NSE itself did clearly state that it’s not relaunching its Usenet service. It could, however, start a separate case to ask for compensation for the losses suffered as a result of the shutdown, something BREIN also mentioned in court.

NSE CEO Patrick Schreurs couldn’t confirm or deny this but noted that the case isn’t over just yet.

“All I can say right now is that this isn’t the end,” he told us.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Serverless at re:Invent 2016 – Wrap-up

Post Syndicated from Bryan Liston original https://aws.amazon.com/blogs/compute/serverless-at-reinvent-2016-wrap-up/

The re:Invent 2016 conference was an exciting week to be working on serverless at AWS. We announced new features like support for C# and dead letter queues, and launched new application constructs with Lambda such as Lambda@Edge, AWS Greengrass, Amazon Lex, and AWS Step Functions. In addition we also added support for surfacing services built using API Gateway in the AWS marketplace, expanded the capabilities for custom authorizers, and launched a reference developer portal for managing APIs. Catch up on all the great re:Invent launches here.

In addition to the serverless mini-con with deep dive talks and best practices, we also had deep customer talks by folks from Thomson Reuters, Vevo, Expedia, and FINRA. If you weren’t able to attend the mini-con or missed a specific session, here is a quick link to the entire Serverless Mini Conference Playlist. Other interesting sessions from other tracks are listed below.

Individual Sessions from the Mini Conference

Other Interesting Sessions

If there are other sessions or talks you think I should capture in this list, let me know!

WordPress 4.7

Post Syndicated from ris original http://lwn.net/Articles/708268/rss

WordPress 4.7 “Vaughan” has been released. This
version includes a new default theme, adds new features to the customizer,
comes with REST API endpoints for posts, comments, terms, users, meta, and
settings, and more.
To help give you a solid base to build from, individual themes can provide starter content that appears when you go to customize your brand new site. This can range from placing a business information widget in the best location to providing a sample menu with social icon links to a static front page complete with beautiful images. Don’t worry – nothing new will appear on the live site until you’re ready to save and publish your initial theme setup.

Weekly roundup: Freedom

Post Syndicated from Eevee original https://eev.ee/dev/2016/12/06/weekly-roundup-freedom/

  • zdoom: On a total whim, I resurrected half of an old branch that puts sloped 3D floors in the software renderer. It kinda draws them, but with no textures.

  • blog: I wrote a thing about not copying C which was surprisingly popular.

  • sylph: I accidentally spent 45 minutes writing a microscopic parser for a language that can only print string literals.

  • patreon: I finished up some revamping of my Patreon — the wall of text is now a short and straightforward stack of images, and I dropped the blogging milestones. I’m no longer obliged to write X posts per month, huzzah.

  • art: I pixel-drew some new veekun version icons, which may or may not go live. Also drew a December avatar.

  • veekun: I got started on adapting my ORAS dumping code for Sun and Moon, and have box and dex sprites 90% dumped. Text is mostly done as well.

  • art: I did a few pixels, which you may or not be seeing in the near future.

veekun effort continues (as I scramble to actually finish the game so I don’t spoil myself). Also trying to, uh, remember how to draw?

Court: ‘Falsely’ Accused ‘Movie Pirate’ Deserves $17K Compensation

Post Syndicated from Ernesto original https://torrentfreak.com/court-falsely-accused-movie-pirate-deserves-17k-compensation-161206/

trollsignFor more than half a decade so-called “copyright trolling” cases have been keeping the U.S. judicial system busy.

While new lawsuits are still being filed on a weekly basis, there are signs that some judges are growing tired of the practice and becoming increasingly skeptical of the claims made by copyright holders.

In Oregon, a federal recently dismissed a complaint filed by the makers of the Adam Sandler movie The Cobbler. The judge dismissed a direct infringement complaint against an alleged movie pirate from the outset, as it was clear that the defendant wasn’t the infringer.

The defendant in question, Thomas Gonzales, operates an adult foster care home where several people had access to the Internet. The filmmakers were aware of this and during a hearing their counsel admitted that any guest could have downloaded the film.

Still, the filmmakers decided to move their case ahead, and for this decision they may now have to pay. After the case was dismissed, the wrongfully accused ‘pirate’ asked to be compensated for the fees he incurred during his defense.

In a findings and recommendations filing published last Friday (pdf), Magistrate Judge Stacie Beckerman concludes that the filmmakers went too far.

“The Court finds that once Plaintiff learned that the alleged infringement was taking place at an adult group care home at which Gonzales did not reside, Plaintiff’s continued pursuit of Gonzales for copyright infringement was objectively unreasonable,” Judge Beckerman writes.

Gonzales argued that the filmmakers are using these lawsuits to pressure people into expensive settlements. While the plaintiffs deny that money is a goal for them, the court shares the defendant’s view.

The “overaggressive” tactics of the filmmakers warrant a fees award, Judge Beckerman writes in her recommendation.

“The Court shares Gonzales’ concern that Plaintiff is motivated, at least in large part, by extracting large settlements from individual consumers prior to any meaningful litigation.

“On balance, the Court has concerns about the motivation behind Plaintiff’s overaggressive litigation of this case and other cases, and that factor weighs in favor of fee shifting.”

Copyright holders often argue that damages awards are needed to deter the defendant and other pirates from infringing. In this case, however, the tables are turned.

The Court states that a fees award in favor or the wrongfully accused defendant should deter the filmmakers and other ‘copyright trolls’ from dragging people into copyright lawsuits without any factual evidence.

“Compensating Gonzales will encourage future defendants with valid defenses to litigate those defenses, even if the litigation is expensive,” Judge Beckerman writes.

“Conversely, and perhaps more importantly, awarding fees to Gonzales should deter Plaintiff in the future from continuing its overaggressive pursuit of alleged infringers without a reasonable factual basis.”

gonza

All in all the Magistrate Judge concludes that Gonzales deserves compensation. She recommends that the court awards $17,222 in attorney fees as well as $255 in other expenses.

The filmmakers now have two weeks to object to the recommendations and findings, which means that the damages are not final yet. However, as DieTrollDie notes, such an objection could also mean that they would end up paying more.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

[$] Maintainerless Debian?

Post Syndicated from corbet original http://lwn.net/Articles/708163/rss

The maintainer model is deeply ingrained into the culture of the
free-software community; for any bit of code, there is usually a developer
(or a small group of developers) charged with that code’s maintenance.
Good maintainers can help a project run smoothly, while poor maintainers
can run things into the ground. What is to be done to save a project with
the latter type of maintainer? Forking can be an option in some cases
but, in many others, it’s not a practical alternative. The Debian project
is currently
discussing its approach to bad maintainers — a discussion which has taken a
surprising turn.

Tuesday’s security updates

Post Syndicated from ris original http://lwn.net/Articles/708238/rss

Debian-LTS has updated monit
(regression in previous update).

Fedora has updated dpkg (F25; F24; F23: code execution), gstreamer-plugins-bad-free (F25: code
execution), gstreamer1-plugins-bad-free
(F24: code execution), gstreamer1-plugins-good (F24: multiple
vulnerabilities), kernel (F25; F24; F23:
denial of service), and thunderbird (F25: code execution).

Gentoo has updated arj (multiple vulnerabilities) and util-linux (command injection).

Mageia has updated firefox (code execution), thunderbird (multiple vulnerabilities), and virtualbox (multiple vulnerabilities).

openSUSE has updated GraphicsMagick (Leap42.1; 13.2: two vulnerabilities), ImageMagick (13.2: two vulnerabilities),
mariadb (Leap42.2; Leap42.1: multiple mostly unspecified
vulnerabilities), firefox, thunderbird, nss
(13.1: multiple vulnerabilities), tcpreplay
(Leap42.2: denial of service), kernel
(13.1: multiple vulnerabilities), and thunderbird (SPH for SLE12: multiple vulnerabilities).

Oracle has updated thunderbird (OL7; OL6: code execution).

Red Hat has updated bind
(RHEL6.2, 6.4, 6.5, 6.6, 6.7: denial of service) and sudo (RHEL6,7: privilege escalation).

SUSE has updated java-1_6_0-ibm
(SLEMLS12: multiple vulnerabilities) and firefox, nss (SLE12-SP2,SP1: multiple vulnerabilities).

Ubuntu has updated kernel (16.10; 16.04;
14.04; 12.04: code execution), linux-lts-trusty (12.04: code execution), linux-lts-xenial (14.04: code execution),
linux-raspi2 (16.10; 16.04: code execution), linux-snapdragon (16.04: code execution), and
linux-ti-omap4 (12.04: code execution).

20 Billion Files Restored

Post Syndicated from Yev original https://www.backblaze.com/blog/20-billion-files-restored/

20-billion-files-restored
On September 1st we asked you to predict when Backblaze would reach our 20 Billion Files Restored mark. We had a ton of entrants!! Over two thousand people hazarded a guess. It is our pleasure to announce that we’ve finally hit that milestone, and along with it, we’re announcing the winners of our contest.

Before we reveal the date and time that we crossed that threshold, we first want to point out that the closest guess was only off by 23 minutes. Second closest? 57 minutes. That’s kind of remarkable. The 10 winners all pinpointed the exact date and the furthest winner was only four hours and six minutes off the mark, very impressive job!

Congratulations to our winners:

  • Lance – Bismarck, North Dakota
  • Bartosz – Warsaw, Poland
  • Jeremy – Evergreen, Colorado
  • Justin – Los Angeles, California
  • Andy – London, UK
  • Jeffrey – Round Rock, Texas
  • Jose – Merida, Mexico
  • Maria – New York, New York
  • Rizwan – Surrey, UK
  • Max – Howard Beach, New York

11/20/2016 – 10:57 AM

That was the exact time when we restored the 20,000,000,000th file. That’s lots of memories, documents, and projects saved. The amount of files per Backblaze restore varies do our various restore methods. The most common use-case for our restores is when folks forget one or two files and do small restores of just a folder or two.

Restore Fun Facts For a Typical Month:

Where restores are created:

  • 96.3% of restores are done on the web
  • 3.7% are done via our mobile apps

Of all restores:

  • 97.8% are ZIP restores
  • 1.7% are USB HD restores
  • 0.5% are USB Flash Drive restores

The average size of restores:

  • ZIP restores (web & mobile): 25 GB
  • USB HD – 1.1 TB (1,100 GB)
  • USB Flash – 63 GB

Based on the amount of data in ZIP file restores:

Range in GB % of Restores
< 1 54.5%
1 – 10 17.5%
10 – 25 9.5%
25 – 50 7.2%
50 – 75 2.6%
75 – 100 1.7%
100 – 200 3.4%
200 – 300 1.5%
300 – 400 1.0%
400 – 500 0.8%
> 500 0.3%

Backblaze was started with the goal of preventing data loss. Even though we’re nearing 300 Petabytes of data stored, we consider our 20 Billion Files Restored the benchmark that we’re most proud of because it validates our ability to help our customers out when they need us most. Thank you for being loyal Backblaze fans, and while we always hope that folks won’t need to create restores, we’ll be here for you if you do!

The post 20 Billion Files Restored appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Alex’s Festive Baubles

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/alexs-festive-baubles/

I made a thing. And because I love you all, I’m going to share the thing with you. Thing? Things! I’m going to share the things. Here you go: baubles!

Raspberry Pi and Code Club Christmas Decorations

These 3D-printable Raspberry Pi and Code Club decorations are the perfect addition to any Christmas tree this year. And if you don’t have a tree, they’re the perfect non-festive addition to life in general. There’s really no reason to say no.

The .stl files you’ll need to make the baubles are available via MyMiniFactory (Raspberry Pi/Code Club) and Thingiverse (Raspberry Pi/Code Club). They’re published under a Creative Commons BY-NC-ND 3.0 license. This means that you can make a pile of decorations for your tree and for your friends, though we do have to ask you not to change the designs, as the logos they’re based on are our trademarks.

Here’s a video of the prototype printout being made. If you can help it, try not to use a brim on your print. Brims, though helpful, are a nightmare to remove from the fiddly Pi logo.

Enjoy.

3D Printed Raspberry Pi Logo

Print time: 20 mins. Printer: Ultimaker 2+ Material: ABS With thanks to Makespace for use of the 3D printer: http://makespace.org/ and Safakash for the music: https://soundcloud.com/safakash

 

 

The post Alex’s Festive Baubles appeared first on Raspberry Pi.

International Phone Fraud Tactics

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2016/12/international_p.html

This article outlines two different types of international phone fraud. The first can happen when you call an expensive country like Cuba:

My phone call never actually made it to Cuba. The fraudsters make money because the last carrier simply pretends that it connected to Cuba when it actually connected me to the audiobook recording. So it charges Cuban rates to the previous carrier, which charges the preceding carrier, which charges the preceding carrier, and the costs flow upstream to my telecom carrier. The fraudsters siphoning money from the telecommunications system could be anywhere in the world.

The second happens when phones are forced to dial international premium-rate numbers:

The crime ring wasn’t interested in reselling the actual [stolen] phone hardware so much as exploiting the SIM cards. By using all the phones to call international premium numbers, similar to 900 numbers in the U.S. that charge extra, they were making hundreds of thousands of dollars. Elsewhere — Pakistan and the Philippines being two common locations — organized crime rings have hacked into phone systems to get those phones to constantly dial either international premium numbers or high-rate countries like Cuba, Latvia, or Somalia.

Why is this kind of thing so hard to stop?

Stamping out international revenue share fraud is a collective action problem. “The only way to prevent IRFS fraud is to stop the money. If everyone agrees, if no one pays for IRFS, that disrupts it,” says Yates. That would mean, for example, the second-to-last carrier would refuse to pay the last carrier that routed my call to the audiobooks and the third-to-last would refuse to pay the second-to-last, and so on, all the way back up the chain to my phone company. But when has it been easy to get so many companies to do the same thing? It costs money to investigate fraud cases too, and some companies won’t think it’s worth the trade off. “Some operators take a very positive approach toward fraud management. Others see it as cost of business and don’t put a lot of resources or systems in to manage it,” says Yates.

Pirate Bay Blocking Case Heads Back to Court in Sweden

Post Syndicated from Andy original https://torrentfreak.com/pirate-bay-blocking-case-heads-back-to-court-in-sweden-161206/

tpbOf all websites in the piracy landscape, few can claim to be as hounded as The Pirate Bay (TPB). Due to its resilience and refusal to step into line, the site has been at the core of dozens of direct and indirect court cases for more than a decade.

Today, another process gets underway, with yet another Internet service provider arguing that it should not be held responsible for the actions of The Pirate Bay, or its pirating users.

The case has its roots back in 2014, when Universal Music, Sony Music, Warner Music, Nordisk Film and the Swedish Film Industry teamed up in a lawsuit designed to force Swedish ISP Bredbandsbolaget (Broadband Company) to block the site.

The rightsholders argued that Bredbandsbolaget should be held liable unless it blocked TPB, but the ISP refused to comply. It stated that its only role is to provide customers with Internet access while facilitating the free flow of information.

The case originally went to trial at the Stockholm District Court last October. In line with several other similar rulings elsewhere in Europe, the ISP was expected to lose its case. Instead, it prevailed, with the District Court concluding that Bredbandsbolaget’s actions in facilitating access to the site did not amount to participation in a crime under Swedish law.

Of course, the rightsholders inevitably filed an appeal and today, almost exactly a year later, the parties are set to face off again in a brand new, dedicated venue.

Since September 2016, Sweden has had two new courts. The Patent and Market Court and the Patent and Market Court of Appeal are specialist courts dedicated to tackling intellectual property, competition, and marketing law matters.

The Patent and Market Court is a division of Stockholm District Court while the Patent and Market Court of Appeal is a division of the Svea Court of Appeal. Today’s Pirate Bay case will be heard at the latter.

Bredbandsbolaget’s position remains unchanged. The ISP wants to remain a neutral supplier of Internet connectivity and is alarmed at the prospect of being held liable for any content passing through its infrastructure. While today the discussion is about copyrighted movies, TV shows and music, tomorrow it could be about other offenses allegedly carried out online. The scope is enormous.

Per Strömbäck, representing the copyright holders, told IDG that ISPs like Bredbandsbolaget have knowledge of infringing acts but choose to do nothing about them. This is something the content companies want to change.

“We want to get to a point where a court can order an Internet service provider to block subscribers from accessing an illegal site. The telecom companies will not make that decision themselves,” he says.

A defeat for Bredbandsbolaget in this appeal could have far-reaching consequences. As seen in other countries around Europe, once rightsholders succeed in getting one site blocked, the floodgates open with dozens, perhaps hundreds, of similar requests to block additional domains.

The case begins today and is expected to conclude on Thursday.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

That "Commission on Enhancing Cybersecurity" is absurd

Post Syndicated from Robert Graham original http://blog.erratasec.com/2016/12/that-commission-on-enhancing.html

An Obama commission has publish a report on how to “Enhance Cybersecurity”. It’s promoted as having been written by neutral, bipartisan, technical experts. Instead, it’s almost entirely dominated by special interests and the Democrat politics of the outgoing administration.

In this post, I’m going through a random list of some of the 53 “action items” proposed by the documents. I show how they are policy issues, not technical issues. Indeed, much of the time the technical details are warped to conform to special interests.

IoT passwords

The recommendations include such things as Action Item 2.1.4:

Initial best practices should include requirements to mandate that IoT devices be rendered unusable until users first change default usernames and passwords. 

This recommendation for changing default passwords is repeated many times. It comes from the way the Mirai worm exploits devices by using hardcoded/default passwords.

But this is a misunderstanding of how these devices work. Take, for example, the infamous Xiongmai camera. It has user accounts on the web server to control the camera. If the user forgets the password, the camera can be reset to factory defaults by pressing a button on the outside of the camera.

But here’s the deal with security cameras. They are placed at remote sites miles away, up on the second story where people can’t mess with them. In order to reset them, you need to put a ladder in your truck and drive 30 minutes out to the site, then climb the ladder (an inherently dangerous activity). Therefore, Xiongmai provides a RESET.EXE utility for remotely resetting them. That utility happens to connect via Telnet using a hardcoded password.

The above report misunderstands what’s going on here. It sees Telnet and a hardcoded password, and makes assumptions. Some people assume that this is the normal user account — it’s not, it’s unrelated to the user accounts on the web server portion of the device. Requiring the user to change the password on the web service would have no effect on the Telnet service. Other people assume the Telnet service is accidental, that good security hygiene would remove it. Instead, it’s an intended feature of the product, to remotely reset the device. Fixing the “password” issue as described in the above recommendations would simply mean the manufacturer would create a different, custom backdoor that hackers would eventually reverse engineer, creating MiraiV2 botnet. Instead of security guides banning backdoors, they need to come up with standard for remote reset.

That characterization of Mirai as an IoT botnet is wrong. Mirai is a botnet of security cameras. Security cameras are fundamentally different from IoT devices like toasters and fridges because they are often exposed to the public Internet. To stream video on your phone from your security camera, you need a port open on the Internet. Non-camera IoT devices, however, are overwhelmingly protected by a firewall, with no exposure to the public Internet. While you can create a botnet of Internet cameras, you cannot create a botnet of Internet toasters.

The point I’m trying to demonstrate here is that the above report was written by policy folks with little grasp of the technical details of what’s going on. They use Mirai to justify several of their “Action Items”, none of which actually apply to the technical details of Mirai. It has little to do with IoT, passwords, or hygiene.

Public-private partnerships

Action Item 1.2.1: The President should create, through executive order, the National Cybersecurity Private–Public Program (NCP 3 ) as a forum for addressing cybersecurity issues through a high-level, joint public–private collaboration.

We’ve had public-private partnerships to secure cyberspace for over 20 years, such as the FBI InfraGuard partnership. President Clinton’s had a plan in 1998 to create a public-private partnership to address cyber vulnerabilities. President Bush declared public-private partnerships the “cornerstone of his 2003 plan to secure cyberspace.

Here we are 20 years later, and this document is full of new naive proposals for public-private partnerships There’s no analysis of why they have failed in the past, or a discussion of which ones have succeeded.

The many calls for public-private programs reflects the left-wing nature of this supposed “bipartisan” document, that sees government as a paternalistic entity that can help. The right-wing doesn’t believe the government provides any value in these partnerships. In my 20 years of experience with government private-partnerships in cybersecurity, I’ve found them to be a time waster at best and at worst, a way to coerce “voluntary measures” out of companies that hurt the public’s interest.

Build a wall and make China pay for it

Action Item 1.3.1: The next Administration should require that all Internet-based federal government services provided directly to citizens require the use of appropriately strong authentication.

This would cost at least $100 per person, for 300 million people, or $30 billion. In other words, it’ll cost more than Trump’s wall with Mexico.

Hardware tokens are cheap. Blizzard (a popular gaming company) must deal with widespread account hacking from “gold sellers”, and provides second factor authentication to its gamers for $6 each. But that ignores the enormous support costs involved. How does a person prove their identity to the government in order to get such a token? To replace a lost token? When old tokens break? What happens if somebody’s token is stolen?

And that’s the best case scenario. Other options, like using cellphones as a second factor, are non-starters.

This is actually not a bad recommendation, as far as government services are involved, but it ignores the costs and difficulties involved.

But then the recommendations go on to suggest this for private sector as well:

Specifically, private-sector organizations, including top online retailers, large health insurers, social media companies, and major financial institutions, should use strong authentication solutions as the default for major online applications.

No, no, no. There is no reason for a “top online retailer” to know your identity. I lie about my identity. Amazon.com thinks my name is “Edward Williams”, for example.

They get worse with:

Action Item 1.3.3: The government should serve as a source to validate identity attributes to address online identity challenges.

In other words, they are advocating a cyber-dystopic police-state wet-dream where the government controls everyone’s identity. We already see how this fails with Facebook’s “real name” policy, where everyone from political activists in other countries to LGBTQ in this country get harassed for revealing their real names.

Anonymity and pseudonymity are precious rights on the Internet that we now enjoy — rights endangered by the radical policies in this document. This document frequently claims to promote security “while protecting privacy”. But the government doesn’t protect privacy — much of what we want from cybersecurity is to protect our privacy from government intrusion. This is nothing new, you’ve heard this privacy debate before. What I’m trying to show here is that the one-side view of privacy in this document demonstrates how it’s dominated by special interests.

Cybersecurity Framework

Action Item 1.4.2: All federal agencies should be required to use the Cybersecurity Framework. 

The “Cybersecurity Framework” is a bunch of a nonsense that would require another long blogpost to debunk. It requires months of training and years of experience to understand. It contains things like “DE.CM-4: Malicious code is detected”, as if that’s a thing organizations are able to do.

All the while it ignores the most common cyber attacks (SQL/web injections, phishing, password reuse, DDoS). It’s a typical example where organizations spend enormous amounts of money following process while getting no closer to solving what the processes are attempting to solve. Federal agencies using the Cybersecurity Framework are no safer from my pentests than those who don’t use it.

It gets even crazier:

Action Item 1.5.1: The National Institute of Standards and Technology (NIST) should expand its support of SMBs in using the Cybersecurity Framework and should assess its cost-effectiveness specifically for SMBs.

Small businesses can’t even afford to even read the “Cybersecurity Framework”. Simply reading the doc, trying to understand it, would exceed their entire IT/computer budget for the year. It would take a high-priced consultant earning $500/hour to tell them that “DE.CM-4: Malicious code is detected” means “buy antivirus and keep it up to date”.

Software liability is a hoax invented by the Chinese to make our IoT less competitive

Action Item 2.1.3: The Department of Justice should lead an interagency study with the Departments of Commerce and Homeland Security and work with the Federal Trade Commission, the Consumer Product Safety Commission, and interested private sector parties to assess the current state of the law with regard to liability for harm caused by faulty IoT devices and provide recommendations within 180 days. 

For over a decade, leftists in the cybersecurity industry have been pushing the concept of “software liability”. Every time there is a major new development in hacking, such as the worms around 2003, they come out with documents explaining why there’s a “market failure” and that we need liability to punish companies to fix the problem. Then the problem is fixed, without software liability, and the leftists wait for some new development to push the theory yet again.

It’s especially absurd for the IoT marketspace. The harm, as they imagine, is DDoS. But the majority of devices in Mirai were sold by non-US companies to non-US customers. There’s no way US regulations can stop that.

What US regulations will stop is IoT innovation in the United States. Regulations are so burdensome, and liability lawsuits so punishing, that it will kill all innovation within the United States. If you want to get rich with a clever IoT Kickstarter project, forget about it: you entire development budget will go to cybersecurity. The only companies that will be able to afford to ship IoT products in the United States will be large industrial concerns like GE that can afford the overhead of regulation/liability.

Liability is a left-wing policy issue, not one supported by technical analysis. Software liability has proven to be immaterial in any past problem and current proponents are distorting the IoT market to promote it now.

Cybersecurity workforce

Action Item 4.1.1: The next President should initiate a national cybersecurity workforce program to train 100,000 new cybersecurity practitioners by 2020. 

The problem in our industry isn’t the lack of “cybersecurity practitioners”, but the overabundance of “insecurity practitioners”.

Take “SQL injection” as an example. It’s been the most common way hackers break into websites for 15 years. It happens because programmers, those building web-apps, blinding paste input into SQL queries. They do that because they’ve been trained to do it that way. All the textbooks on how to build webapps teach them this. All the examples show them this.

So you have government programs on one hand pushing tech education, teaching kids to build web-apps with SQL injection. Then you propose to train a second group of people to fix the broken stuff the first group produced.

The solution to SQL/website injections is not more practitioners, but stopping programmers from creating the problems in the first place. The solution to phishing is to use the tools already built into Windows and networks that sysadmins use, not adding new products/practitioners. These are the two most common problems, and they happen not because of a lack of cybersecurity practitioners, but because the lack of cybersecurity as part of normal IT/computers.

I point this to demonstrate yet against that the document was written by policy people with little or no technical understanding of the problem.

Nutritional label

Action Item 3.1.1: To improve consumers’ purchasing decisions, an independent organization should develop the equivalent of a cybersecurity “nutritional label” for technology products and services—ideally linked to a rating system of understandable, impartial, third-party assessment that consumers will intuitively trust and understand. 

This can’t be done. Grab some IoT devices, like my thermostat, my car, or a Xiongmai security camera used in the Mirai botnet. These devices are so complex that no “nutritional label” can be made from them.

One of the things you’d like to know is all the software dependencies, so that if there’s a bug in OpenSSL, for example, then you know your device is vulnerable. Unfortunately, that requires a nutritional label with 10,000 items on it.

Or, one thing you’d want to know is that the device has no backdoor passwords. But that would miss the Xiongmai devices. The web service has no backdoor passwords. If you caught the Telnet backdoor password and removed it, then you’d miss the special secret backdoor that hackers would later reverse engineer.

This is a policy position chasing a non-existent technical issue push by Pieter Zatko, who has gotten hundreds of thousands of dollars from government grants to push the issue. It’s his way of getting rich and has nothing to do with sound policy.

Cyberczars and ambassadors

Various recommendations call for the appointment of various CISOs, Assistant to the President for Cybersecurity, and an Ambassador for Cybersecurity. But nowhere does it mention these should be technical posts. This is like appointing a Surgeon General who is not a doctor.

Government’s problems with cybersecurity stems from the way technical knowledge is so disrespected. The current cyberczar prides himself on his lack of technical knowledge, because that helps him see the bigger picture.

Ironically, many of the other Action Items are about training cybersecurity practitioners, employees, and managers. None of this can happen as long as leadership is clueless. Technical details matter, as I show above with the Mirai botnet. Subtlety and nuance in technical details can call for opposite policy responses.

Conclusion

This document is promoted as being written by technical experts. However, nothing in the document is neutral technical expertise. Instead, it’s almost entirely a policy document dominated by special interests and left-wing politics. In many places it makes recommendations to the incoming Republican president. His response should be to round-file it immediately.

I only chose a few items, as this blogpost is long enough as it is. I could pick almost any of of the 53 Action Items to demonstrate how they are policy, special-interest driven rather than reflecting technical expertise.

Bottomley: Using Your TPM as a Secure Key Store

Post Syndicated from corbet original http://lwn.net/Articles/708162/rss

James Bottomley has posted a
tutorial
on using the trusted platform module to store cryptographic
keys. “The main thing that came out of this discussion was that a
lot of this stack complexity can be hidden from users and we should
concentrate on making the TPM ‘just work’ for all cryptographic functions
where we have parallels in the existing security layers (like the
keystore). One of the great advantages of the TPM, instead of messing
about with USB pkcs11 tokens, is that it has a file format for TPM keys
(I’ll explain this later) which can be used directly in place of standard
private key files.

4shared’s Piracy ‘Fingerprint’ Tool Helps to Reduce Takedown Notices

Post Syndicated from Ernesto original https://torrentfreak.com/4shareds-piracy-fingerprint-tool-helps-to-reduce-takedown-notices-161205/

4sharedWith millions of regular visitors, both via the web and through mobile apps, 4shared is one of the largest file-sharing services.

As with many other sites in this niche, copyright holders often complain about the pirated files that are available on the site. Interestingly, however, most complaints are sent to Google.

Over the past several years the search engine has received a massive 50 million takedown requests for 4shared URLs alone. 4shared itself, which has a DMCA takedown procedure in place, receives only a fraction of this number.

Speaking with TorrentFreak, 4shared says that it is trying to do its best to keep rightsholders happy. They have provided several with a direct-delete account, so they can take infringing files offline as quickly as possible.

In addition, 4shared is using the fingerprinting software Echoprint to detect and remove pirated files from its service. This helped the file-hosting site to reduce the number of takedown requests they receive significantly.

“This is our latest and the most efficient system for taking down copyrighted audio files,” 4shared’s Mike tells us.

“We can see that the volume of removal requests keeps reducing from month to month. It has already reached approximately 6,000 per month, which is fifteen times less that the 90,000 monthly requests we received at the beginning of 2015.”

Takedown requests 4shared received

4sharedtakedown

While 4shared has been using the content recognition software for quite a while already, not all copyright holders are eager to use it. Several large industry groups such as IFPI refuse to provide 4shared with fingerprint data.

As a result, the file-hosting service decided to build its own database based on the takedown notices they receive.

“We are gathering the data this way, because IFPI declines our request to provide ‘fingerprints’ upfront,” Mike says.

“Currently we are building the database for the audio content recognition system from direct ban link submissions and the DMCA notices that IFPI and several other major organizations send.”

When a takedown noticed arrives, 4shared “fingerprints” the audio file which is then added to the database. If someone then tries to upload the same file again, an error message occurs.

4shared doesn’t understand why rightsholders are unwilling to submit the data themselves. There is no need to share actual audio files, they stress, as the fingerprinting data can be easily extracted using a standalone software tool.

The file-hosting service hopes that copyright holders will realize the potential of the system. Not only is it more accurate than the current takedown efforts, but it can also save them a lot of time and money.

“In my opinion, the amount of effort for creating ‘fingerprints’ and uploading to 4shared’s audio recognition database is comparable, or even less, than the amount of effort and the cost of maintaining numerous agents and developing robots that collect lists of links for the direct ban requests or complaints they send,” Mike concludes.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Amazon Redshift Engineering’s Advanced Table Design Playbook: Distribution Styles and Distribution Keys

Post Syndicated from AWS Big Data Blog original https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-distribution-styles-and-distribution-keys/

Zach Christopherson is a Senior Database Engineer on the Amazon Redshift team.


Part 1: Preamble, Prerequisites, and Prioritization
Part 2: Distribution Styles and Distribution Keys
Part 3: Compound and Interleaved Sort Keys (December 6, 2016)
Part 4: Compression Encodings (December 7, 2016)
Part 5: Table Data Durability (December 8, 2016)


The first table and column properties we discuss in this blog series are table distribution styles (DISTSTYLE) and distribution keys (DISTKEY). This blog installment presents a methodology to guide you through the identification of optimal DISTSTYLEs and DISTKEYs for your unique workload.

When you load data into a table, Amazon Redshift distributes the rows to each of the compute nodes according to the table’s DISTSTYLE. Within each compute node, the rows are assigned to a cluster slice. Depending on node type, each compute node contains 2, 16, or 32 slices. You can think of a slice like a virtual compute node. During query execution, all slices process the rows that they’ve had assigned in parallel. The primary goal in selecting a table’s DISTSTYLE is to evenly distribute the data throughout the cluster for parallel processing.

When you execute a query, the query optimizer might redistribute or broadcast the intermediate tuples throughout the cluster to facilitate any join or aggregation operations. The secondary goal in selecting a table’s DISTSTYLE is to minimize the cost of data movement necessary for query processing. To achieve minimization, data should be located where it needs to be before the query is executed.

A table might be defined with a DISTSTYLE of EVEN, KEY, or ALL. If you’re unfamiliar with these table properties, you can watch my presentation at the 2016 AWS Santa Clara Summit, where I discussed the basics of distribution starting at the 17-minute mark. I summarize these here:

  • EVEN will do a round-robin distribution of data.
  • KEY requires a single column to be defined as a DISTKEY. On ingest, Amazon Redshift hashes each DISTKEY column value, and route hashes to the same slice consistently.
  • ALL distribution stores a full copy of the table on the first slice of each node.

Which style is most appropriate for your table is determined by several criteria. This post presents a two-phase flow chart that will guide you through questions to ask of your data profile to arrive at the ideal DISTSTYLE and DISTKEY for your scenario.

Phase 1: Identifying Appropriate DISTKEY Columns

Phase 1 seeks to determine if KEY distribution is appropriate. To do so, first determine if the table contains any columns that would appropriately distribute the table data if they were specified as a DISTKEY. If we find that no columns are acceptable DISTKEY columns, then we can eliminate DISTSTYLE KEY as a potential DISTSTYLE option for this table.

o_redshift_table_design_1

 

Does the column data have a uniformly distributed data profile?

 

If the hashed column values don’t enable uniform distribution of data to the cluster slices, then you’ll end with both data skew at rest and data skew in flight (during query processing)—which results in a performance hit due to an unevenly parallelized workload. A nonuniformly distributed data profile occurs in scenarios such as these:

  • Distributing on a column containing a significant percentage of NULL values
  • Distributing on a column, customer_id, where a minority of your customers are responsible for the majority of your data

You can easily identify columns that contain “heavy hitters” or introduce “hot spots” by using some simple SQL code to review the dataset. In the example following, l_orderkey stands out as a poor option that you can eliminate as a potential DISTKEY column:

root@redshift/dev=# SELECT l_orderkey, COUNT(*) 
FROM lineitem 
GROUP BY 1 
ORDER BY 2 DESC 
LIMIT 100;
 l_orderkey |   count
------------+----------
     [NULL] | 124993010
  260642439 |        80
  240404513 |        80
   56095490 |        72
  348088964 |        72
  466727011 |        72
  438870661 |        72 
...
...

When distributing on a given column, it is desirable to have a nearly consistent number of rows/blocks on each slice. Suppose that you think that you’ve identified a column that should result in uniform distribution but want to confirm this. Here, it’s much more efficient to materialize a single-column temporary table, rather than redistributing the entire table only to find out there was nonuniform distribution:

-- Materialize a single column to check distribution
CREATE TEMP TABLE lineitem_dk_l_partkey DISTKEY (l_partkey) AS 
SELECT l_partkey FROM lineitem;

-- Identify the table OID
root@redshift/tpch=# SELECT 'lineitem_dk_l_partkey'::regclass::oid;
  oid
--------
 240791
(1 row) 

Now that the table exists, it’s trivial to review the distribution. In the following query results, we can assess the following characteristics for a given table with a defined DISTKEY:

  • skew_rows: A ratio of the number of table rows from the slice with the most rows compared to the slice with the fewest table rows. This value defaults to 100.00 if the table doesn’t populate every slice in the cluster. Closer to 1.00 is ideal.
  • storage_skew: A ratio of the number of blocks consumed by the slice with the most blocks compared to the slice with the fewest blocks. Closer to 1.00 is ideal.
  • pct_populated: Percentage of slices in the cluster that have at least 1 table row. Closer to 100 is ideal.
SELECT "table" tablename, skew_rows,
  ROUND(CAST(max_blocks_per_slice AS FLOAT) /
  GREATEST(NVL(min_blocks_per_slice,0)::int,1)::FLOAT,5) storage_skew,
  ROUND(CAST(100*dist_slice AS FLOAT) /
  (SELECT COUNT(DISTINCT slice) FROM stv_slices),2) pct_populated
FROM svv_table_info ti
  JOIN (SELECT tbl, MIN(c) min_blocks_per_slice,
          MAX(c) max_blocks_per_slice,
          COUNT(DISTINCT slice) dist_slice
        FROM (SELECT b.tbl, b.slice, COUNT(*) AS c
              FROM STV_BLOCKLIST b
              GROUP BY b.tbl, b.slice)
        WHERE tbl = 240791 GROUP BY tbl) iq ON iq.tbl = ti.table_id;
       tablename       | skew_rows | storage_skew | pct_populated
-----------------------+-----------+--------------+---------------
 lineitem_dk_l_partkey |      1.00 |      1.00259 |           100
(1 row)

Note: A small amount of data skew shouldn’t immediately discourage you from considering an otherwise appropriate distribution key. In many cases, the benefits of collocating large JOIN operations offset the cost of cluster slices processing a slightly uneven workload.

Does the column data have high cardinality?

Cardinality is a relative measure of how many distinct values exist within the column. It’s important to consider cardinality alongside the uniformity of data distribution. In some scenarios, a uniform distribution of data can result in low relative cardinality. Low relative cardinality leads to wasted compute capacity from lack of parallelization. For example, consider a cluster with 576 slices (36x DS2.8XLARGE) and the following table:

CREATE TABLE orders (                                            
  o_orderkey int8 NOT NULL			,
  o_custkey int8 NOT NULL			,
  o_orderstatus char(1) NOT NULL		,
  o_totalprice numeric(12,2) NOT NULL	,
  o_orderdate date NOT NULL DISTKEY ,
  o_orderpriority char(15) NOT NULL	,
  o_clerk char(15) NOT NULL			,
  o_shippriority int4 NOT NULL		,
  o_comment varchar(79) NOT NULL                  
); 

 

Within this table, I retain a billion records representing 12 months of orders. Day to day, I expect that the number of orders remains more or less consistent. This consistency creates a uniformly distributed dataset:

root@redshift/tpch=# SELECT o_orderdate, count(*) 
FROM orders GROUP BY 1 ORDER BY 2 DESC; 
 o_orderdate |  count
-------------+---------
 1993-01-18  | 2651712
 1993-08-29  | 2646252
 1993-12-05  | 2644488
 1993-12-04  | 2642598
...
...
 1993-09-28  | 2593332
 1993-12-12  | 2593164
 1993-11-14  | 2593164
 1993-12-07  | 2592324
(365 rows)

However, the cardinality is relatively low when we compare the 365 distinct values of the o_orderdate DISTKEY column to the 576 cluster slices. If each day’s value were hashed and assigned to an empty slice, this data only populates 63% of the cluster at best. Over 37% of the cluster remains idle during scans against this table. In real-life scenarios, we’ll end up assigning multiple distinct values to already populated slices before we populate each empty slice with at least one value.

-- How many values are assigned to each slice
root@redshift/tpch=# SELECT rows/2592324 assigned_values, COUNT(*) number_of_slices FROM stv_tbl_perm WHERE name='orders' AND slice<6400 
GROUP BY 1 ORDER BY 1;
 assigned_values | number_of_slices
-----------------+------------------
               0 |              307
               1 |              192
               2 |               61
               3 |               13
               4 |                3
(5 rows)

So in this scenario, on one end of the spectrum we have 307 of 576 slices not populated with any day’s worth of data, and on the other end we have 3 slices populated with 4 days’ worth of data. Query execution is limited by the rate at which those 3 slices can process their data. At the same time, over half of the cluster remains idle.

Note: The pct_slices_populated column from the table_inspector.sql query result identifies tables that aren’t fully populating the slices within a cluster.

On the other hand, suppose the o_orderdate DISTKEY column was defined with the timestamp data type and actually stores true order timestamp data (not dates stored as timestamps). In this case, the granularity of the time dimension causes the cardinality of the column to increase from the order of hundreds to the order of millions of distinct values. This approach results in all 576 slices being much more evenly populated.

Note: A timestamp column isn’t usually an appropriate DISTKEY column, because it’s often not joined or aggregated on. However, this case illustrates how relative cardinality can be influenced by data granularity, and the significance it has in resulting in a uniform and complete distribution of table data throughout a cluster.

Do queries perform selective filters on the column?

 

Even if the DISTKEY column ensures a uniform distribution of data throughout the cluster, suboptimal parallelism can arise if that same column is also used to selectively filter records from the table. To illustrate this, the same orders table with a DISTKEY on o_orderdate is still populated with 1 billion records spanning 365 days of data:

CREATE TABLE orders (                                            
  o_orderkey int8 NOT NULL			,
  o_custkey int8 NOT NULL			,
  o_orderstatus char(1) NOT NULL		,
  o_totalprice numeric(12,2) NOT NULL	,
  o_orderdate date NOT NULL DISTKEY ,
  o_orderpriority char(15) NOT NULL	,
  o_clerk char(15) NOT NULL			,
  o_shippriority int4 NOT NULL		,
  o_comment varchar(79) NOT NULL                  
); 

This time, consider the table on a smaller cluster with 80 slices (5x DS2.8XLARGE) instead of 576 slices. With a uniform data distribution and ~4-5x more distinct values than cluster slices, it’s likely that query execution is more evenly parallelized for full table scans of the table. This effect occurs because each slice is more likely to be populated and assigned an equivalent number of records.

However, in many use cases full table scans are uncommon. For example, with time series data it’s more typical for the workload to scan the past 1, 7, or 30 days of data than it is to repeatedly scan the entire table. Let’s assume I have one of these time series data workloads that performs analytics on orders from the last 7 days with SQL patterns, such as the following:

SELECT ... FROM orders 
JOIN ... 
JOIN ... 
WHERE ...
AND o_orderdate between current_date-7 and current_date-1
GROUP BY ...;  

With a predicate such as this, we limit the relevant values to just 7 days. All of these days must reside on a maximum of 7 slices within the cluster. Due to consistent hashing, slices that contain one or more of these 7 values contain all of the records for those specific values:

root@redshift/tpch=# SELECT SLICE_NUM(), COUNT(*) FROM orders 
WHERE o_orderdate BETWEEN current_date-7 AND current_date-1 
GROUP BY 1 ORDER BY 1;
 slice_num |  count
-----------+---------
         3 | 2553840
        33 | 2553892
        40 | 2555232
        41 | 2553092
        54 | 2554296
        74 | 2552168
        76 | 2552224
(7 rows)  

With the dataset shown above, we have at best 7 slices, each fetching 2.5 million rows to perform further processing. For the scenario with EVEN distribution, we expect 80 slices to fetch ~240,000 records each (((109 records / 365 days) * 7 days) / 80 slices). The important comparison to consider is whether there is in having only 7 slices fetch and process 2.5 million records relative to all 80 slices fetching and processing ~240,000 records each.

If the overhead of having a subset of slices perform the majority of the work is significant, then you want to separate your distribution style from your selective filtering criteria. To do so, choose a different distribution key.

Use the following query to identify how frequently your scans include predicates which filter on the table’s various columns:

SELECT 
    ti."table", ti.diststyle, RTRIM(a.attname) column_name,
    COUNT(DISTINCT s.query ||'-'|| s.segment ||'-'|| s.step) as num_scans,
    COUNT(DISTINCT CASE WHEN TRANSLATE(TRANSLATE(info,')',' '),'(',' ') LIKE ('%'|| a.attname ||'%') THEN s.query ||'-'|| s.segment ||'-'|| s.step END) AS column_filters
FROM stl_explain p
JOIN stl_plan_info i ON ( i.userid=p.userid AND i.query=p.query AND i.nodeid=p.nodeid  )
JOIN stl_scan s ON (s.userid=i.userid AND s.query=i.query AND s.segment=i.segment AND s.step=i.step)
JOIN svv_table_info ti ON ti.table_id=s.tbl
JOIN pg_attribute a ON (a.attrelid=s.tbl AND a.attnum > 0)
WHERE s.tbl IN ([table_id]) 
GROUP BY 1,2,3,a.attnum
ORDER BY attnum;  

From this query result, if the potential DISTKEY column is frequently scanned, you can perform further investigation to identify if those filters are extremely selective or not using more complex SQL:

SELECT 
    ti.schemaname||'.'||ti.tablename AS "table", 
    ti.tbl_rows,
    AVG(r.s_rows_pre_filter) avg_s_rows_pre_filter,
    100*ROUND(1::float - AVG(r.s_rows_pre_filter)::float/ti.tbl_rows::float,6) avg_prune_pct,
    AVG(r.s_rows) avg_s_rows,
    100*ROUND(1::float - AVG(r.s_rows)::float/AVG(r.s_rows_pre_filter)::float,6) avg_filter_pct,
    COUNT(DISTINCT i.query) AS num,
    AVG(r.time) AS scan_time,
    MAX(i.query) AS query, TRIM(info) as filter
FROM stl_explain p
JOIN stl_plan_info i ON ( i.userid=p.userid AND i.query=p.query AND i.nodeid=p.nodeid  )
JOIN stl_scan s ON (s.userid=i.userid AND s.query=i.query AND s.segment=i.segment AND s.step=i.step)
JOIN (SELECT table_id,"table" tablename,schema schemaname,tbl_rows,unsorted,sortkey1,sortkey_num,diststyle FROM svv_table_info) ti ON ti.table_id=s.tbl
JOIN (
SELECT query, segment, step, DATEDIFF(s,MIN(starttime),MAX(endtime)) AS time, SUM(rows) s_rows, SUM(rows_pre_filter) s_rows_pre_filter, ROUND(SUM(rows)::float/SUM(rows_pre_filter)::float,6) filter_pct
FROM stl_scan
WHERE userid>1 AND type=2
AND starttime < endtime
GROUP BY 1,2,3
HAVING sum(rows_pre_filter) > 0
) r ON (r.query = i.query and r.segment = i.segment and r.step = i.step)
LEFT JOIN (SELECT attrelid,t.typname FROM pg_attribute a JOIN pg_type t ON t.oid=a.atttypid WHERE attsortkeyord IN (1,-1)) a ON a.attrelid=s.tbl
WHERE s.tbl IN ([table_id])
AND p.info LIKE 'Filter:%' AND p.nodeid > 0
GROUP BY 1,2,10 ORDER BY 1, 9 DESC;

The above SQL describes these items:

  • tbl_rows: Current number of rows in the table at this moment in time.
  • avg_s_rows_pre_filter: Number of rows that were actually scanned after the zone maps were leveraged to prune a number of blocks from being fetched.
  • avg_prune_pct: Percentage of rows that were pruned from the table just by leveraging the zone maps.
  • avg_s_rows: Number of rows remaining after applying the filter criteria defined in the SQL.
  • avg_filter_pct: Percentage of rows remaining, relative to avg_s_rows_pre_filter, after a user defined filter has been applied.
  • num: Number of queries that include this filter criteria.
  • scan_time: Average number of seconds it takes for the segment which includes that scan to complete.
  • query: Example query ID for the query that issued these filter criteria.
  • filter: Detailed filter criteria specified by user.

In the following query results, we can assess the selectivity for a given filter predicate. Your knowledge of the data profile, and how many distinct values exist within a given range constrained by the filter condition, lets you identify whether a filter should be considered selective or not. If you’re not sure of the data profile, you can always construct SQL code from the query results to get a count of distinct values within that range:

table                 | public.orders
tbl_rows              | 22751520
avg_s_rows_pre_filter | 12581124
avg_prune_pct         | 44.7021
avg_s_rows            | 5736106
avg_filter_pct        | 54.407
num                   | 2
scan_time             | 19
query                 | 1721037
filter                | Filter: ((o_orderdate < '1993-08-01'::date) AND (o_orderdate >= '1993-05-01'::date))

SELECT COUNT(DISTINCT o_orderdate) 
FROM public.orders 
WHERE o_orderdate < '1993-08-01' AND o_orderdate >= '1993-05-01';

We’d especially like to avoid columns that have query patterns with these characteristics:

  • Relative to tbl_rows:
    • A low value for avg_s_rows
    • A high value for avg_s_rows_pre_filter
  • A selective filter on the potential DISTKEY column
  • Limited distinct values within the returned range
  • High scan_time

 If such patterns exist for a column, it’s likely that this column is not a good DISTKEY candidate.

Is the column also a primary compound sortkey column?

 

Note: Sort keys are discussed in greater detail within Part 3 of this blog series.

As shown in the flow chart, even if we are using the column to selectively filter records (thereby potentially restricting post-scan processing to a portion of the slices), in some circumstances it still makes sense to use the column as the distribution key.

If we selectively filter on the column, we might also be using a sortkey on this column. This approach lets us use the column zone maps effectively on non-necessary slices, to quickly identify the relevant blocks to fetch. Doing this makes selective scanning less expensive by orders of magnitude than a full column scan on each slice. In turn, this lower cost helps to offset the cost of a reduced number of slices processing the bulk of the data after the scan.

You can use the following query to determine the primary sortkey for a table:

SELECT attname FROM pg_attribute 
WHERE attrelid = [table_id] AND attsortkeyord = 1;

Using the SQL code from the last step (used to check avg_s_rows, the number of distinct values in the returned range, and so on), we see the characteristics of a valid DISTKEY option include the following:

  • Relative to tbl_rows, a low value for avg_s_rows_pre_filter
  • Relative to avg_s_rows_pre_filter, a similar number for avg_s_rows
  • Selective filter on the potential DISTKEY column
  • Numerous distinct values within the returned range
  • Low or insignificant scan_time

If such patterns exist, it’s likely that this column is a good DISTKEY candidate.

Do the query patterns facilitate MERGE JOINs?

 

When the following criteria are met, you can use a MERGE JOIN operation, the fastest of the three join operations:

  1. Two tables are sorted (using a compound sort key) and distributed on the same columns.
  2. Both tables are over 80% sorted (svv_table_info.unsorted < 20%)
  3. These tables are joined using the DISTKEY and SORTKEY columns in the JOIN condition.

Because of these restrictive criteria, it’s unusual to encounter a MERGE JOIN operation by chance. Typically, an end user makes explicit design decisions to force this type of JOIN operation, usually because of a requirement for a particular query’s performance. If this JOIN pattern doesn’t exist in your workload, then you won’t benefit from this optimized JOIN operation.

The following query returns the number of statements that scanned your table, scanned another table that was sorted and distributed on the same column, and performed some type of JOIN operation:

SELECT COUNT(*) num_queries FROM stl_query
WHERE query IN (
  SELECT DISTINCT query FROM stl_scan 
  WHERE tbl = [table_id] AND type = 2 AND userid > 1
  INTERSECT
  SELECT DISTINCT query FROM stl_scan 
  WHERE tbl <> [table_id] AND type = 2 AND userid > 1
  AND tbl IN (
    SELECT DISTINCT attrelid FROM pg_attribute 
    WHERE attisdistkey = true AND attsortkeyord > 0
    MINUS
    SELECT DISTINCT attrelid FROM pg_attribute
    WHERE attsortkeyord = -1)
  INTERSECT
  (SELECT DISTINCT query FROM stl_hashjoin WHERE userid > 1
  UNION
  SELECT DISTINCT query FROM stl_nestloop WHERE userid > 1
  UNION
  SELECT DISTINCT query FROM stl_mergejoin WHERE userid > 1)
);

If this query returns any results, you potentially have an opportunity to enable a MERGE JOIN for existing queries without modifying any other tables. If this query returns no results, then you need to proactively tune multiple tables simultaneously to facilitate the performance of a single query.

Note: If a desired MERGE JOIN optimization requires reviewing and modifying multiple tables, you approach the problem in a different fashion than this straightforward approach. This more complex approach goes beyond the scope of this article. If you’re interested in implementing such an optimization, you can check our documentation on the JOIN operations and ask specific questions in the comments at the end of this blog post.

Phase One Recap

Throughout this phase, we answered questions to determine which columns in this table were potentially appropriate DISTKEY columns for our table. At the end of these steps, you might have identified zero to many potential columns for your specific table and dataset. We’ll be keeping these columns (or lack thereof) in mind as we move along to the next phase.

Phase 2: Deciding Distribution Style

Phase 2 dives deeper into the potential distribution styles to determine which is the best choice for your workload. Generally, it’s best to strive for a DISTSTYLE of KEY whenever appropriate. Choose ALL in the scenarios where it makes sense (and KEY doesn’t). Only choose EVEN when neither KEY nor ALL is appropriate.

We’ll work though the following flowchart to assist us with our decision. Because DISTSTYLE is a table property, we run through this analysis table by table, after having completed phase 1 preceding.

o_redshift_table_design_2

Does the table participate in JOINs?

 

DISTSTYLE ALL is only used to guarantee colocation of JOIN operations, regardless of the columns specified in the JOIN conditions. If the table doesn’t participate in JOIN operations, then DISTSTYLE ALL offers no performance benefits and should be eliminated from consideration.

JOIN operations that benefit from colocation span a robust set of database operations. WHERE clause and JOIN clause join operations (INNER, OUTER, and so on) are obviously included, and so are some not-as-obvious operations and syntax like IN, NOT IN, MINUS/EXCEPT, INTERSECT and EXISTS. When answering whether the table participates in JOINs, consider all of these operations.

This query confirms how many distinct queries have scanned this table and have included one or more JOIN operations at some point in the same query:

SELECT COUNT(*) FROM (
SELECT DISTINCT query FROM stl_scan 
WHERE tbl = [table_id] AND type = 2 AND userid > 1
INTERSECT
(SELECT DISTINCT query FROM stl_hashjoin
UNION
SELECT DISTINCT query FROM stl_nestloop
UNION
SELECT DISTINCT query FROM stl_mergejoin));

If this query returns a count of 0, then the table isn’t participating in any type of JOIN, no matter what operations are in use.

Note: Certain uncommon query patterns can cause the preceding query to return false positives (such as if you have simple scan against your table that is later appended to a result set of a subquery that contains JOINs). If you’re not sure, you can always look at the queries specifically with this code:

SELECT userid, query, starttime, endtime, rtrim(querytxt) qtxt 
FROM stl_query WHERE query IN (
SELECT DISTINCT query FROM stl_scan 
WHERE tbl = [table_id] AND type = 2 AND userid > 1
INTERSECT
(SELECT DISTINCT query FROM stl_hashjoin
UNION
SELECT DISTINCT query FROM stl_nestloop
UNION
SELECT DISTINCT query FROM stl_mergejoin))
ORDER BY starttime;

 

Does the table contain at least one potential DISTKEY column?

 

The process detailed in phase 1 helped us to identify a table’s appropriate DISTKEY columns. If no appropriate DISTKEY columns exist, then KEY DISTSTYLE is removed from consideration. If appropriate DISTKEY columns do exist, then EVEN distribution is removed from consideration.

With this simple rule, the decision is never between KEY, EVEN, and ALL—rather it’s between these:

  • KEY and ALL in cases where at least one valid DISTKEY column exists
  • EVEN and ALL in cases where no valid DISTKEY columns exist

 

Can you tolerate additional storage overhead?

 

To answer whether you can tolerate additional storage overhead, the questions are: How large is the table and how is it currently distributed? You can use the following query to answer these questions:

SELECT table_id, "table", diststyle, size, pct_used 
FROM svv_table_info WHERE table_id = [table_id];

The following example shows how many 1 MB blocks and the percentage of total cluster storage that are currently consumed by duplicate versions of the same orders table with different DISTSTYLEs:

root@redshift/tpch=# SELECT "table", diststyle, size, pct_used
FROM svv_table_info
WHERE "table" LIKE 'orders_diststyle_%';
         table         |    diststyle    | size  | pct_used
-----------------------+-----------------+-------+----------
 orders_diststyle_even | EVEN            |  6740 |   1.1785
 orders_diststyle_key  | KEY(o_orderkey) |  6740 |   1.1785
 orders_diststyle_all  | ALL             | 19983 |   3.4941
(3 rows)

For DISTSTYLE EVEN or KEY, each node receives just a portion of total table data. However, with DISTSTYLE ALL we are storing a complete version of the table on each compute node. For ALL, as we add nodes to a cluster the amount of data per node remains unchanged. Whether this is significant or not depends on your table size, cluster configuration, and storage overhead. If you use a DS2.8XLARGE configuration with 16TB of storage per node, this increase might be a negligible amount of per-node storage. However, if you use a DC1.LARGE configuration with 160GB of storage per node, then the increase in total cluster storage might be an unacceptable increase.

You can multiply the number of nodes by the current size of your KEY or EVEN distributed table to get a rough estimate of the size of the table as DISTSTYLE ALL. This approach should be provide information to determine if ALL results in an unacceptable growth in table storage:

SELECT "table", size, pct_used, 
 CASE diststyle
  WHEN 'ALL' THEN size::TEXT
  ELSE '< ' || size*(SELECT COUNT(DISTINCT node) FROM stv_slices)
 END est_distall_size,
 CASE diststyle
  WHEN 'ALL' THEN pct_used::TEXT
  ELSE '< ' || pct_used*(SELECT COUNT(DISTINCT node) FROM stv_slices)
 END est_distall_pct_used
FROM svv_table_info WHERE table_id = [table_id];

If the estimate is unacceptable, then DISTSTYLE ALL should be removed from consideration.

Do the query patterns tolerate reduced parallelism?

 

In MPP database systems, performance at scale is achieved by simultaneously processing portions of the complete dataset with several distributed resources. DISTSTYLE ALL means that you’re sacrificing some parallelism, for both read and write operations, to guarantee a colocation of data on each node.

At some point, the benefits of DISTSTYLE ALL tables are offset by the parallelism reduction. At this point, DISTSTYLE ALL is not a valid option. Where that threshold occurs is different for your write operations and your read operations.

Write operations

For a table with KEY or EVEN DISTSTYLE, database write operations are parallelized across each of the slices. This parallelism means that each slice needs to process only a portion of the complete write operation. For ALL distribution, the write operation doesn’t benefit from parallelism because the write needs to be performed in full on every single node to keep the full dataset synchronized on all nodes. This approach significantly reduces performance compared to the same type of write operation performed on a KEY or EVEN distributed table.

If your table is the target of frequent write operations and you find you can’t tolerate the performance hit, that eliminates DISTSTYLE ALL from consideration.

This query identifies how many write operations have modified a table:

SELECT '[table_id]' AS "table_id", 
(SELECT count(*) FROM 
(SELECT DISTINCT query FROM stl_insert WHERE tbl = [table_id]
INTERSECT
SELECT DISTINCT query FROM stl_delete WHERE tbl = [table_id])) AS num_updates,
(SELECT count(*) FROM 
(SELECT DISTINCT query FROM stl_delete WHERE tbl = [table_id]
MINUS
SELECT DISTINCT query FROM stl_insert WHERE tbl = [table_id])) AS num_deletes,
(SELECT COUNT(*) FROM 
(SELECT DISTINCT query FROM stl_insert WHERE tbl = [table_id] 
MINUS 
SELECT distinct query FROM stl_s3client
MINUS
SELECT DISTINCT query FROM stl_delete WHERE tbl = [table_id])) AS num_inserts,
(SELECT COUNT(*) FROM 
(SELECT DISTINCT query FROM stl_insert WHERE tbl = [table_id]
INTERSECT
SELECT distinct query FROM stl_s3client)) as num_copies,
(SELECT COUNT(*) FROM 
(SELECT DISTINCT xid FROM stl_vacuum WHERE table_id = [table_id]
AND status NOT LIKE 'Skipped%')) AS num_vacuum;

If your table is rarely written to, or if you can tolerate the performance hit, then DISTSTYLE ALL is still a valid option.

Read operations

Reads that access DISTSTYLE ALL tables require slices to scan and process the same data multiple times for a single query operation. This approach seeks to improve query performance by avoiding the network I/O overhead of broadcasting or redistributing data to facilitate a join or aggregation. At the same time, it increases the necessary compute and disk I/O due to the excess work being performed over the same data multiple times.

Suppose that you access the table in many ways, sometimes joining, sometimes not. In this case, you’ll need to determine if the benefit of collocating JOINs with DISTSTYLE ALL is significant and desirable or if the cost of reduced parallelism impacts your queries more significantly.

Patterns and trends to avoid

DISTSTYLE ALL tables are most appropriate for smaller, slowly changing dimension tables. As a general set of guidelines, the patterns following typically suggest that DISTSTYLE ALL is a poor option for a given table:

  • Read operations:
    • Scans against large fact tables
    • Single table scans that are not participating in JOINs
    • Scans against tables with complex aggregations (for example, several windowing aggregates with different partitioning, ordering, and frame clauses)
  • Write operations:
    • A table that is frequently modified with DML statements
    • A table that is ingested with massive data loads
    • A table that requires frequent maintenance with VACUUM or VACUUM REINDEX operations

If your table is accessed in a way that meets these criteria, then DISTSTYLE ALL is unlikely to be a valid option.

Do the query patterns utilize potential DISTKEY columns in JOIN conditions?

If the table participates in JOIN operations and has appropriate DISTKEY columns, then we need to decide between KEY or ALL distribution styles. Considering only how the table participates in JOIN operations, and no other outside factors, these criteria apply:

  • ALL distribution is most appropriate when any of these are true:
  • KEY distribution is most appropriate when

 

Determining the best DISTKEY column

If you’ve determined that DISTSTYLE KEY is best for your table, the next step is to determine which column serves as the ideal DISTKEY column. Of the columns you’ve flagged as appropriate potential DISTKEY columns in phase 1, you’ll want to identify which has the largest impact on your particular workload.

For tables with only a single candidate column, or for workloads that only use one of the candidate columns in JOINs, the choice is obvious. For workloads with mixed JOIN conditions against the same table, the most optimal column is determined based on your business requirements.

For example, common scenarios to encounter and questions to ask yourself about how you want to distribute are the following:

  • My transformation SQL code and reporting workload benefit from different columns. Do I want to facilitate my transformation job or reporting performance?
  • My dashboard queries and structured reports leverage different JOIN conditions. Do I value interactive query end user experience over business-critical report SLAs?
  • Should I distribute on column_A that occurs in a JOIN condition thousands of times daily for less important analytics, or on column_B that is referenced only tens of times daily for more important analytics? Would I rather improve a 5 second query to 2 seconds 1,000 times per day, or improve a 60-minute query to 24 minutes twice per day?

Your business requirements and where you place value answer these questions, so there is no simple way to offer guidance that covers all scenarios. If you have a scenario with mixed JOIN conditions and no real winner in value, you can always test multiple distribution key options and measure what works best for you. Or you can materialize multiple copies of the table distributed on differing columns and route queries to disparate tables based on query requirements. If you end up attempting the latter approach, pgbouncer-rr is a great utility to simplify the routing of queries for your end users.

Next Steps

Choosing optimal DISTSTYLE and DISTKEY options for your table ensures that your data is distributed evenly for parallel processing, and that data redistribution during query execution is minimal—which ensures your complex analytical workloads perform well over multipetabyte datasets.

By following the process detailed preceding, you can identify the ideal DISTSTYLE and DISTKEY for your specific tables. The final step is to simply rebuild the tables to apply these optimizations. This rebuild can be performed at any time. However, if you intend to continue reading through parts 3, 4, and 5 of the Advanced Table Design Playbook, you might want to wait until the end before you issue the table rebuilds. Otherwise, you might find yourself rebuilding these tables multiple times to implement optimizations identified in later installments.

In Part 3 of our table design playbook, I’ll describe how to use table properties related to table sorting styles and sort keys for another significant performance gain.


Amazon Redshift Engineering’s Advanced Table Design Playbook

Part 1: Preamble, Prerequisites, and Prioritization
Part 2: Distribution Styles and Distribution Keys
Part 3: Compound and Interleaved Sort Keys (December 6, 2016)
Part 4: Compression Encodings (December 7, 2016)
Part 5: Table Data Durability (December 8, 2016)


About the author


christophersonZach Christopherson is a Palo Alto based Senior Database Engineer at AWS.
He assists Amazon Redshift users from all industries in fine-tuning their workloads for optimal performance. As a member of the Amazon Redshift service team, he also influences and contributes to the development of new and existing service features. In his spare time, he enjoys trying new restaurants with his wife, Mary, and caring for his newborn daughter, Sophia.

 


Related

Top 10 Performance Tuning Techniques for Amazon Redshift (Updated Nov. 28, 2016)

o_redshift_update_1

Amazon Redshift Engineering’s Advanced Table Design Playbook: Preamble, Prerequisites, and Prioritization

Post Syndicated from AWS Big Data Blog original https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-preamble-prerequisites-and-prioritization/

 Zach Christopherson is a Senior Database Engineer on the Amazon Redshift team.


Part 1: Preamble, Prerequisites, and Prioritization
Part 2: Distribution Styles and Distribution Keys
Part 3: Compound and Interleaved Sort Keys (December 6, 2016)
Part 4: Compression Encodings (December 7, 2016)
Part 5: Table Data Durability (December 8, 2016)


Amazon Redshift is a fully managed, petabyte scale, massively parallel data warehouse that offers simple operations and high performance. AWS customers use Amazon Redshift for everything from accelerating existing database environments that are struggling to scale, to ingesting web logs for big data analytics. Amazon Redshift provides an industry-standard JDBC/ODBC driver interface, which allows connections from existing business intelligence tools and reuse of existing analytics queries.

With Amazon Redshift, you can implement any type of data model that’s standard throughout the industry. Whether your data model is third normalized form (3NF), star, snowflake, denormalized flat tables, or a combination of these—by using Amazon Redshift’s unique table properties, your complex analytical workloads will operate performantly over multipetabyte data sets.

In practice, I find that the best way to improve query performance by orders of magnitude is by tuning Amazon Redshift tables to better meet your workload requirements. This five-part blog series will guide you through applying distribution styles, sort keys, and compression encodings and configuring tables for data durability and recovery purposes. I’ll offer concrete guidance on how to properly work with each property for your use case.

Prerequisites

 If you’re working with an existing Amazon Redshift workload, then the Amazon Redshift system tables can help you determine the most ideal configurations. Querying these tables for the complete dataset requires cluster access as a privileged superuser. You can determine if your user is privileged with the result of the usesuper column from the following query result set:

root@redshift/dev=# SELECT usename, usesysid, usesuper FROM pg_user WHERE usename=current_user;
 usename | usesysid | usesuper
---------+----------+----------
 root    |      100 | t
(1 row)

In Amazon Redshift, a table rebuild is required when changing most table or column properties. To reduce the time spent rebuilding tables, identify all of the necessary changes up front, so that only a single rebuild is necessary. Once you’ve identified changes, you can query one of our amazon-redshift-utils view definitions (v_generate_tbl_ddl) to generate the existing DDL, for further modification to implement your identified changes.

I’ve also improved the system view SVV_TABLE_INFO with a new view, named v_extended_table_info, which offers an extended output that makes schema and workload reviews much more efficient. I’ll refer to the result set returned by querying this view throughout the series, so I’d recommend that you create the view in the Amazon Redshift cluster database you’re optimizing.

For the sake of brevity throughout these topics, I’ll refer to tables by their object ID (OID). You can get this OID in one of several ways:

root@redshift/dev=# SELECT 'bi.param_tbl_chriz_header'::regclass::oid;
  oid
--------
 108342
(1 row)

root@redshift/dev=# SELECT oid, relname FROM pg_class 
WHERE relname='param_tbl_chriz_header';
  oid   |        relname
--------+------------------------
 108342 | param_tbl_chriz_header
(1 row)

root@redshift/dev=# SELECT table_id, "table" FROM svv_table_info 
WHERE "table"='param_tbl_chriz_header';
 table_id |         table
----------+------------------------
   108342 | param_tbl_chriz_header
(1 row)

root@redshift/dev=# SELECT DISTINCT id FROM stv_tbl_perm 
WHERE name='param_tbl_chriz_header';
   id
--------
 108342
(1 row)

Prioritization

 This series walks you through a number of processes that you can implement on a table-by-table basis. It’s not unusual for clusters that serve multiple disparate workloads to have thousands of tables. Because your time is finite, you’ll want to prioritize optimizations against the tables that are most significant to the workload, to deliver a meaningful improvement to the overall cluster performance.

If you’re a direct end user of the Amazon Redshift cluster, or if you have well-established communication with end users, then it might already be obvious where you should start optimizing. Perhaps end users are reporting cluster slowness for specific reports, which would highlight tables that need optimization.

If you lack intrinsic knowledge of the environment you’re planning to optimize, the scenario might not be as clear. For example, suppose one of the following is true:

  • You’re an external consultant, engaged to optimize an unknown workload for a new client.
  • You’re an Amazon Redshift subject matter expert within your circles, and you’re often approached for guidance regarding Amazon Redshift resources that you didn’t design or implement.
  • You’ve inherited operational ownership of an existing Amazon Redshift cluster and are unfamiliar with the workloads or issues.

Regardless of your particular scenario, it’s always invaluable to approach the optimization by first determining how best to spend your time.

I’ve found that scan frequency and table size are the two metrics most relevant to estimating table significance. The following SQL code helps identify a list of tables relevant to each given optimization scenario, based on characteristics of the recent historical workload. Each of these result sets are ordered by scan frequency, with most scanned tables first.

Scenario: “There are no specific reports of slowness, but I want to ensure I’m getting the most out of my cluster by performing a review on all tables.”

-- Returns table information for all scanned tables
SELECT * FROM admin.v_extended_table_info 
WHERE table_id IN (
  SELECT DISTINCT tbl FROM stl_scan WHERE type=2 
)
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC, 
  size DESC; 

 Scenario: “The query with ID 4941313 is slow.”

-- Returns table information for all tables scanned by query 4941313
SELECT * FROM admin.v_extended_table_info 
WHERE table_id IN (
  SELECT DISTINCT tbl FROM stl_scan WHERE type=2 AND query = 4941313
) 
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC, 
  size DESC; 

Scenario: “The queries running in transaction with XID=23200 are slow.”

-- Returns table information for all tables scanned within xid 23200
SELECT * FROM admin.v_extended_table_info 
WHERE table_id IN (
  SELECT DISTINCT tbl FROM stl_scan 
  WHERE type=2 
  AND query IN (SELECT query FROM stl_query WHERE xid=23200)
) 
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC, 
  size DESC; 

Scenario: “Our ETL workload running between 02:00 and 04:00 UTC is exceeding our SLAs.”

-- Returns table information for all tables scanned by “etl_user” 
-- during 02:00 and 04:00 on 2016-09-09
SELECT * FROM admin.v_extended_table_info 
WHERE table_id IN (
  SELECT DISTINCT tbl FROM stl_scan 
  WHERE type=2 
  AND query IN (
    SELECT q.query FROM stl_query q
    JOIN pg_user u ON u.usesysid=q.userid
    WHERE u.usename='etl_user' 
    AND starttime BETWEEN '2016-09-09 2:00' AND '2016-09-09 04:00')
) 
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC, 
  size DESC; 

Scenario: “Our reporting workload on tables in the ‘sales’ schema is slow.”

-- Returns table information for all tables scanned by queries 
-- from "reporting_user" which scanned tables in the "sales" schema 
SELECT * FROM admin.v_extended_table_info 
WHERE table_id IN (
  SELECT DISTINCT tbl FROM stl_scan 
  WHERE type=2 AND query IN (
    SELECT DISTINCT s.query FROM stl_scan s
    JOIN pg_user u ON u.usesysid = s.userid 
    WHERE s.type=2 AND u.usename='reporting_user' AND s.tbl IN (
      SELECT c.oid FROM pg_class c 
      JOIN pg_namespace n ON n.oid = c.relnamespace 
      WHERE nspname='sales'
    )
  )
)
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC, 
  size DESC; 

Scenario: “Our dashboard queries need to be optimized.”

-- Returns table information for all tables scanned by queries 
-- from “dashboard_user”
SELECT * FROM admin.v_extended_table_info 
WHERE table_id IN (
  SELECT DISTINCT s.tbl FROM stl_scan s
    JOIN pg_user u ON u.usesysid = s.userid 
    WHERE s.type=2 AND u.usename='dashboard_user' 
  )
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC, 
  size DESC; 

Now that we’ve identified which tables should be prioritized for optimization, we can begin. The next blog post in the series will discuss distribution styles and keys.


Amazon Redshift Engineering’s Advanced Table Design Playbook

Part 1: Preamble, Prerequisites, and Prioritization
Part 2: Distribution Styles and Distribution Keys
Part 3: Compound and Interleaved Sort Keys (December 6, 2016)
Part 4: Compression Encodings (December 7, 2016)
Part 5: Table Data Durability (December 8, 2016)


About the author


christophersonZach Christopherson is a Palo Alto based Senior Database Engineer at AWS.
He assists Amazon Redshift users from all industries in fine-tuning their workloads for optimal performance. As a member of the Amazon Redshift service team, he also influences and contributes to the development of new and existing service features. In his spare time, he enjoys trying new restaurants with his wife, Mary, and caring for his newborn daughter, Sophia.


Related

Top 10 Performance Tuning Techniques for Amazon Redshift (Updated Nov. 28, 2016)

o_redshift_update_1

Security advisories for Monday

Post Syndicated from ris original http://lwn.net/Articles/708135/rss

Arch Linux has updated chromium (multiple vulnerabilities) and libdwarf (multiple vulnerabilities).

CentOS has updated firefox (C6; C5: code execution).

Debian-LTS has updated openafs (information leak).

Fedora has updated firefox (F25; F24; F23: code execution), gstreamer1-plugins-bad-free (F25: code
execution), gstreamer1-plugins-good (F25:
code execution), p7zip (F24; F23: denial of service), phpMyAdmin (F25: multiple vulnerabilities), thunderbird (F24: code execution), and xen (F25; F24; F23: multiple vulnerabilities).

Gentoo has updated busybox (two
vulnerabilities), chromium (multiple
vulnerabilities), cifs-utils (code
execution from 2014), dpkg (code
execution), gd (multiple vulnerabilities),
libsndfile (two vulnerabilities), libvirt (path traversal), nghttp2 (code execution), nghttp2 (denial of service), patch (denial of service), and pygments (shell injection).

openSUSE has updated containerd,
docker, runc
(Leap42.1, 42.2: permission bypass), firefox (two vulnerabilities), java-1_7_0-openjdk (13.1: multiple
vulnerabilities), java-1_8_0-openjdk
(Leap42.1, 42.2: multiple vulnerabilities), libarchive (Leap42.2; Leap42.1: multiple vulnerabilities), thunderbird (code execution), nodejs4 (Leap42.2: code execution), phpMyAdmin (multiple vulnerabilities),
sudo (Leap42.2; Leap42.1: three vulnerabilities), tar (Leap42.1, 42.2: file overwrite), and
vim (Leap42.2; Leap42.1, 13.2: code execution).

Red Hat has updated thunderbird (code execution).

SUSE has updated qemu (SLE12-SP1:
multiple vulnerabilities).

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close