Amazon Redshift UDF repository on AWSLabs

Post Syndicated from Christopher Crosbie original https://blogs.aws.amazon.com/bigdata/post/TxAV7SH8B701B9/Amazon-Redshift-UDF-repository-on-AWSLabs

Christopher Crosbie is a Healthcare and Life Science Solutions Architect with Amazon Web Services

Zach Christopherson, an Amazon Redshift Database Engineer, contributed to this post

Did you ever have a need for column-level encryption in Amazon Redshift and wish you could simply add f_encrypt_str(column, key) to your SQL query? Have you ever tried to weigh which would be less effort: writing a complicated regex in SQL to parse a query string or pulling the data into Python simply to take advantage of packages like urlparse? When was the last time you were developing a report and wished there was an easy way to get the next business day from a query result or even get the next business day according to your own company’s calendar?

These scenarios represent just a few of the Python UDF functions that AWS has released as part of the initial AWS Labs Amazon Redshift UDF repository:column encryption, parsing, date functions, and more! No longer are you constrained to the world of SQL within your Amazon Redshift data warehouse. Python UDFs allow you to extend Amazon Redshift SQL with Python’s rich library of packages.

We encourage all Python UDF and Amazon Redshift developers to take a peek at what’s available today. We also encourage you to submit your own pull requests to show off what you can do with Amazon Redshift and Python UDF capabilities.

If you have questions or suggestions, please leave a comment below.

——————————————–

Related

Introduction to Python UDFs in Amazon Redshift

Foundation report for 2015

Post Syndicated from Michael "Monty" Widenius original http://monty-says.blogspot.com/2016/02/foundation-report-for-2015.html

This is a repost of Otto Kekäläinen’s blog of the MariaDB foundations work in 2015.The mariadb.org website had over one million page views in 2015, a growth of about 9% since 2014. Good growth has been visible all over the MariaDB ecosystem and we can conclude that 2015 was a successful year for MariaDB. Increased adoptionMariaDB was included for the first time in an official Debian release (version 8.0 “Jessie”) and there has been strong adoption of MariaDB 10.0 in Linux distributions that already shipped 5.5. MariaDB is now available from all major Linux distributions including SUSE, RedHat, Debian and Ubuntu. Adoption of MariaDB in other platforms also increased, and MariaDB is now available as a database option on, among others, Amazon RDS, 1&1, Azure and Juju Charm Store (Ubuntu). Active maintenance and active developmentIn 2015 there were 6 releases of the 5.5 series, 8 releases of the 10.0 series and 8 releases of the 10.1 series. The 10.1 series was announced for general availability in October 2015 with the release of 10.1.8. In addition, there were also multiple releases of MariaDB Galera Cluster, and the C, Java and OBDC connectors as well as many other MariaDB tools. The announcements for each release can be read on the Mariadb.org blog archives with further details in the Knowledge Base. Some of the notable new features in 10.1 include: Galera clustering is now built-in instead of a separate server version, and can be activated with a simple configuration change.Traditional replication was also improved and is much faster in certain scenarios.Table, tablespace and log encryption were introduced.New security hardening features by default and authentication improvements.Improved support for the Spatial Reference systems for GIS data.We are also proud that the release remains backwards compatible and it is easy to upgrade to 10.1 from any previous MariaDB or MySQL release. 10.1 was also a success in terms of collaboration and included major contributions from multiple companies and developers. MariaDB events and talksThe main event organized by the MariaDB Foundation in the year was the MariaDB Developer Meetup in Amsterdam in October, at the Booking.com offices. It was a success with over 60 attendees In addition there were about a dozen events in 2015 at which MariaDB Foundation staff spoke.We are planning a new MariaDB developer event in early April 2016 in Berlin. We will make a proper announcement of this as soon as we have the date and place fixed.Staff, board and membersIn 2015 the staff included: Otto Kekäläinen, CEOMichael “Monty” Widenius, Founder and core developerAndrea Spåre-Strachan, personal assistant to Mr WideniusSergey Vojtovich, core developerAlexander Barkov, core developerVicențiu Ciorbaru, developerIan Gilfillan, documentation writer and webmasterOur staffing will slightly increase as Vicențiu will start working full time in 2016 for the Foundation. Our developers worked a lot on performance and scalability issues, ported the best features from new MySQL releases, improved MariaDB portability for platforms like ARM, AIX, IBM s390 and Power8, fixed security issues and other bugs. A lot of time was also invested in cleaning up the code base as the current 2,2 million lines of code includes quite a lot of legacy code in it. Version control and issue tracker statistics shows that the foundation staff made 528 commits, reported 373 bugs or issues and closed 424 bugs or other issues. In total there were 2400 commits made by 91 contributors in 2015.The Board of Directors in 2015 consisted of: Chairman Rasmus Johansson, VP Engineering at MariaDB CorporationMichael “Monty” Widenius, Founder and CTO of MariaDB CorporationJeremy Zawodny, Software Engineer at CraigslistSergei Golubchik, Chief Architect at MariaDB CorporationEspen Håkonsen, CIO of Visma and Managing Director of Visma IT & CommunicationsEric Herman, Principal Developer at Booking.comMariaDB Foundation CEO Otto Kekäläinen served as the secretary of the board. In 2015 we welcomed as new major sponsors Booking.com, Visma, Verkkokauppa.com. Acronis just joined to be a member for 2016. Please check out the full list of supporters. If you want to help the MariaDB Foundation in the mission to guarantee continuity and open collaboration, please support us as with individual or corporate sponsorship. What will 2016 bring?We expect steady growth in the adoption of MariaDB in 2016. There are many migrations from legacy database solutions underway, and as the world becomes increasingly digital, there are a ton of new software projects starting that use MariaDB to for their SQL and no-SQL data needs. In 2016 many will upgrade to 10.1 and the quickest ones will start using MariaDB 10.2 which is scheduled to be released some time during 2016. MariaDB also has a lot of plugins and storage engines that are getting more and more attention, and we expect more buzz around them when software developers figure out new ways to manage data in fast, secure and scalable ways.

Foundation report for 2015

Post Syndicated from Michael "Monty" Widenius original http://monty-says.blogspot.com/2016/02/foundation-report-for-2015.html

This is a repost of Otto Kekäläinen’s blog of the MariaDB foundations work in 2015.The mariadb.org website had over one million page views in 2015, a growth of about 9% since 2014. Good growth has been visible all over the MariaDB ecosystem and we can conclude that 2015 was a successful year for MariaDB. Increased adoptionMariaDB was included for the first time in an official Debian release (version 8.0 “Jessie”) and there has been strong adoption of MariaDB 10.0 in Linux distributions that already shipped 5.5. MariaDB is now available from all major Linux distributions including SUSE, RedHat, Debian and Ubuntu. Adoption of MariaDB in other platforms also increased, and MariaDB is now available as a database option on, among others, Amazon RDS, 1&1, Azure and Juju Charm Store (Ubuntu). Active maintenance and active developmentIn 2015 there were 6 releases of the 5.5 series, 8 releases of the 10.0 series and 8 releases of the 10.1 series. The 10.1 series was announced for general availability in October 2015 with the release of 10.1.8. In addition, there were also multiple releases of MariaDB Galera Cluster, and the C, Java and OBDC connectors as well as many other MariaDB tools. The announcements for each release can be read on the Mariadb.org blog archives with further details in the Knowledge Base. Some of the notable new features in 10.1 include: Galera clustering is now built-in instead of a separate server version, and can be activated with a simple configuration change.Traditional replication was also improved and is much faster in certain scenarios.Table, tablespace and log encryption were introduced.New security hardening features by default and authentication improvements.Improved support for the Spatial Reference systems for GIS data.We are also proud that the release remains backwards compatible and it is easy to upgrade to 10.1 from any previous MariaDB or MySQL release. 10.1 was also a success in terms of collaboration and included major contributions from multiple companies and developers. MariaDB events and talksThe main event organized by the MariaDB Foundation in the year was the MariaDB Developer Meetup in Amsterdam in October, at the Booking.com offices. It was a success with over 60 attendees In addition there were about a dozen events in 2015 at which MariaDB Foundation staff spoke.We are planning a new MariaDB developer event in early April 2016 in Berlin. We will make a proper announcement of this as soon as we have the date and place fixed.Staff, board and membersIn 2015 the staff included: Otto Kekäläinen, CEOMichael “Monty” Widenius, Founder and core developerAndrea Spåre-Strachan, personal assistant to Mr WideniusSergey Vojtovich, core developerAlexander Barkov, core developerVicențiu Ciorbaru, developerIan Gilfillan, documentation writer and webmasterOur staffing will slightly increase as Vicențiu will start working full time in 2016 for the Foundation. Our developers worked a lot on performance and scalability issues, ported the best features from new MySQL releases, improved MariaDB portability for platforms like ARM, AIX, IBM s390 and Power8, fixed security issues and other bugs. A lot of time was also invested in cleaning up the code base as the current 2,2 million lines of code includes quite a lot of legacy code in it. Version control and issue tracker statistics shows that the foundation staff made 528 commits, reported 373 bugs or issues and closed 424 bugs or other issues. In total there were 2400 commits made by 91 contributors in 2015.The Board of Directors in 2015 consisted of: Chairman Rasmus Johansson, VP Engineering at MariaDB CorporationMichael “Monty” Widenius, Founder and CTO of MariaDB CorporationJeremy Zawodny, Software Engineer at CraigslistSergei Golubchik, Chief Architect at MariaDB CorporationEspen Håkonsen, CIO of Visma and Managing Director of Visma IT & CommunicationsEric Herman, Principal Developer at Booking.comMariaDB Foundation CEO Otto Kekäläinen served as the secretary of the board. In 2015 we welcomed as new major sponsors Booking.com, Visma, Verkkokauppa.com. Acronis just joined to be a member for 2016. Please check out the full list of supporters. If you want to help the MariaDB Foundation in the mission to guarantee continuity and open collaboration, please support us as with individual or corporate sponsorship. What will 2016 bring?We expect steady growth in the adoption of MariaDB in 2016. There are many migrations from legacy database solutions underway, and as the world becomes increasingly digital, there are a ton of new software projects starting that use MariaDB to for their SQL and no-SQL data needs. In 2016 many will upgrade to 10.1 and the quickest ones will start using MariaDB 10.2 which is scheduled to be released some time during 2016. MariaDB also has a lot of plugins and storage engines that are getting more and more attention, and we expect more buzz around them when software developers figure out new ways to manage data in fast, secure and scalable ways.

Hackers aren’t smart — people are stupid

Post Syndicated from Robert Graham original http://blog.erratasec.com/2016/02/hackers-arent-smart-people-are-stupid.html

The cliche is that hackers are geniuses. That’s not true, hackers are generally stupid.The top three hacking problems for the last 10 years are “phishing”, “password reuse”, and “SQL injection”. These problems are extremely simple, as measured by the fact that teenagers are able to exploit them. Yet they persist because, unless someone is interested in hacking, they are unable to learn them. They ignore important details. They fail at grasping the core concept.PhishingPhishing happens because the hacker forges email from someone you know and trust, such as your bank. It appears nearly indistinguishable from real email that your bank might send. To be fair, good phishing attacks can fool even the experts.But when read advice from “experts”, it’s often phrased as “Don’t open emails from people you don’t know”. No, no, no. The problem is that emails appear to come from people you do trust. This advice demonstrates a lack of understanding of the core concept.What’s going on here is human instinct. We naturally distrust strangers, and we teach our children to distrust strangers.Therefore, this advice is wired into our brains. Whatever advice we hear from experts, we are likely to translate it into “don’t trust strangers” anyway.We have a second instinct of giving advice. We want to tell people “just do this one thing”, wrapping up the problem in one nice package.But these instincts war with the core concept, “phishing emails appear to come from those you trust”. Thus, average users continue to open emails with reckless abandon, because the core concept never gets through.Password reuseSimilarly there is today’s gem from the Sydney Morning Herald:When you create accounts on major websites, they frequently require you to “choose 8 letters with upper case, number, and symbol”. Therefore, you assume this is some sort of general security advice to protect your account. It’s not, not really. Instead, it’s a technical detail related to a second layer of defense. In the unlikely event that hackers break into the website, they’ll be able able to get the encrypted version of everyone’s password. They use password crackers to guess passwords at a rate of a billion-per-second. Easily guessed passwords will get cracked in a fraction of a second, but hard to guess passwords are essentially uncrackable. But it’s a detail that only matters once the website has already been hacked.The real problem with passwords is password reuse. People use the same password for unimportant websites, like http://flyfishing.com, as they use for important sites, like http://chase.com or their email. Simple hobbyist sites are easily hacked, allowing hackers to download all the email addresses and passwords. Hackers then run tools to automate trying out that combination on sites like Amazon, Gmail, and banks, hoping for a match.Therefore, the correct advice is “don’t reuse passwords on important accounts”, such as your business accounts and email account (remember: your email account can reset any other password). In other words, the correct advice is the very opposite what the Sydney Morning Herald suggested.The problem here is human nature. We see this requirement (“upper-case and number/symbol”) a lot, so we gravitate toward that. It also appeals to our sense of justice, as if people deserve to get hacked for the moral weakness of choosing simple passwords. Thus, we gravitate toward this issue. At the same time, we ignore password reuse, because it’s more subtle.Thus we get bad advice from “experts” like the Sydney Morning Herald, advising people to do the very opposite of what they should be doing. This article was passed around a lot today in the cybersec community. We all had a good laugh.SQL injectionSQL injection is not an issue for users, but for programmers. However, it shares the same problem that it’s extremely simple, yet human nature prevents it from being solved.Most websites are built the same way, with a web server front-end, and a database back-end. The web server takes user interactions with the site and converts them into a database query. What you do with a website is data, but the database query is code. Normally, data and code are unrelated and never get mixed up. However, since the website generates code based on data, it’s easy to confuse the two.What SQL injection is that the user (the hacker) sends data to a website frontend that actually contains code that causes the backend to do something. That something can be to dump all the credit card numbers, or create an account that allows the hacker to break in.In other words, SQL injection is when websites fail to understand the differences between these two sentences:Susie said “you owe me $10”.Susie said you owe me $10.It’s best illustrated in the following comic:The core concept is rather easy: don’t mix code with data, or as the comic phrases it “sanitize your database inputs”. Yet the problem persists because programmers fail to grasp the core concept.The reason is largely that professors fail to understand the core concept. SQL injection has been the most popular hacker attack for more than a decade, but most professors are even older than that. Thus, they continue to teach website design ignoring this problem. The textbooks they use don’t eve mention it.ConclusionThese are the three most common hacker exploits on the Internet. Teenagers interested in hack learn how to exploit them within a few hours. Yet, the continue to be unsolved because if you aren’t interested in the issues, you fail to grasp the core concept. The concept “phishing comes from people you know” to “don’t trust emails from strangers”. The core concept of hackers exploiting password reuse becomes “choose strong passwords”. The core concept of mixing code with data simply gets ignored by programmers.And the problem here isn’t just the average person unwilling or unable to grasp the core concept. Instead, confusion is aided by people who are supposed to be trustworthy, like the Sydney Morning Herald, or your college professor.I know it’s condescending and rude to point out that “hacking happens because people are stupid”, but that’s really the problem. I don’t know how to point this out in a less rude manner. That’s why most hacking persists.

Hackers aren’t smart — people are stupid

Post Syndicated from Robert Graham original http://blog.erratasec.com/2016/02/hackers-arent-smart-people-are-stupid.html

The cliche is that hackers are geniuses. That’s not true, hackers are generally stupid.The top three hacking problems for the last 10 years are “phishing”, “password reuse”, and “SQL injection”. These problems are extremely simple, as measured by the fact that teenagers are able to exploit them. Yet they persist because, unless someone is interested in hacking, they are unable to learn them. They ignore important details. They fail at grasping the core concept.PhishingPhishing happens because the hacker forges email from someone you know and trust, such as your bank. It appears nearly indistinguishable from real email that your bank might send. To be fair, good phishing attacks can fool even the experts.But when read advice from “experts”, it’s often phrased as “Don’t open emails from people you don’t know”. No, no, no. The problem is that emails appear to come from people you do trust. This advice demonstrates a lack of understanding of the core concept.What’s going on here is human instinct. We naturally distrust strangers, and we teach our children to distrust strangers.Therefore, this advice is wired into our brains. Whatever advice we hear from experts, we are likely to translate it into “don’t trust strangers” anyway.We have a second instinct of giving advice. We want to tell people “just do this one thing”, wrapping up the problem in one nice package.But these instincts war with the core concept, “phishing emails appear to come from those you trust”. Thus, average users continue to open emails with reckless abandon, because the core concept never gets through.Password reuseSimilarly there is today’s gem from the Sydney Morning Herald:When you create accounts on major websites, they frequently require you to “choose 8 letters with upper case, number, and symbol”. Therefore, you assume this is some sort of general security advice to protect your account. It’s not, not really. Instead, it’s a technical detail related to a second layer of defense. In the unlikely event that hackers break into the website, they’ll be able able to get the encrypted version of everyone’s password. They use password crackers to guess passwords at a rate of a billion-per-second. Easily guessed passwords will get cracked in a fraction of a second, but hard to guess passwords are essentially uncrackable. But it’s a detail that only matters once the website has already been hacked.The real problem with passwords is password reuse. People use the same password for unimportant websites, like http://flyfishing.com, as they use for important sites, like http://chase.com or their email. Simple hobbyist sites are easily hacked, allowing hackers to download all the email addresses and passwords. Hackers then run tools to automate trying out that combination on sites like Amazon, Gmail, and banks, hoping for a match.Therefore, the correct advice is “don’t reuse passwords on important accounts”, such as your business accounts and email account (remember: your email account can reset any other password). In other words, the correct advice is the very opposite what the Sydney Morning Herald suggested.The problem here is human nature. We see this requirement (“upper-case and number/symbol”) a lot, so we gravitate toward that. It also appeals to our sense of justice, as if people deserve to get hacked for the moral weakness of choosing simple passwords. Thus, we gravitate toward this issue. At the same time, we ignore password reuse, because it’s more subtle.Thus we get bad advice from “experts” like the Sydney Morning Herald, advising people to do the very opposite of what they should be doing. This article was passed around a lot today in the cybersec community. We all had a good laugh.SQL injectionSQL injection is not an issue for users, but for programmers. However, it shares the same problem that it’s extremely simple, yet human nature prevents it from being solved.Most websites are built the same way, with a web server front-end, and a database back-end. The web server takes user interactions with the site and converts them into a database query. What you do with a website is data, but the database query is code. Normally, data and code are unrelated and never get mixed up. However, since the website generates code based on data, it’s easy to confuse the two.What SQL injection is that the user (the hacker) sends data to a website frontend that actually contains code that causes the backend to do something. That something can be to dump all the credit card numbers, or create an account that allows the hacker to break in.In other words, SQL injection is when websites fail to understand the differences between these two sentences:Susie said “you owe me $10”.Susie said you owe me $10.It’s best illustrated in the following comic:The core concept is rather easy: don’t mix code with data, or as the comic phrases it “sanitize your database inputs”. Yet the problem persists because programmers fail to grasp the core concept.The reason is largely that professors fail to understand the core concept. SQL injection has been the most popular hacker attack for more than a decade, but most professors are even older than that. Thus, they continue to teach website design ignoring this problem. The textbooks they use don’t eve mention it.ConclusionThese are the three most common hacker exploits on the Internet. Teenagers interested in hack learn how to exploit them within a few hours. Yet, the continue to be unsolved because if you aren’t interested in the issues, you fail to grasp the core concept. The concept “phishing comes from people you know” to “don’t trust emails from strangers”. The core concept of hackers exploiting password reuse becomes “choose strong passwords”. The core concept of mixing code with data simply gets ignored by programmers.And the problem here isn’t just the average person unwilling or unable to grasp the core concept. Instead, confusion is aided by people who are supposed to be trustworthy, like the Sydney Morning Herald, or your college professor.I know it’s condescending and rude to point out that “hacking happens because people are stupid”, but that’s really the problem. I don’t know how to point this out in a less rude manner. That’s why most hacking persists.

Nothing says "establishment" as Vox’s attack on Trump

Post Syndicated from Robert Graham original http://blog.erratasec.com/2016/02/vox-is-wrong-about-trump.html

I keep seeing this Ezra Klein Vox article attacking Donald Trump. It’s wrong in every way something can be wrong. Trump is an easy target, but the Vox piece has almost no substance.Yes, it’s true that Trump proposes several unreasonable policies, such as banning Muslims from coming into this country. I’ll be the first to chime in and call Trump a racist, Nazi bastard for these things.But I’m not sure the other candidates are any better. Sure, they aren’t Nazis, but their politics are just as full of hate and impracticality. For example, Hillary wants to force Silicon Valley into censoring content, brushing aside complaints from those people overly concerned with “freedom of speech”. No candidate, not even Trump, is as radical as Bernie Sanders, who would dramatically reshape the economy. Trump hates Mexican works inside our country, Bernie hates Mexican workers in their own countries, championing punishing trade restrictions.Most of substantive criticisms Vox gives Trump also applies to Bernie. For example, Vox says:His view of the economy is entirely zero-sum — for Americans to win, others must lose. … His message isn’t so much that he’ll help you as he’ll hurt them… That’s Bernie’s view of the economy as well. He imagines that economy is a zero-sum game, and that for the 1% rich to prosper, they must take from the 99% of everyone else. Bernie’s entire message rests on punishing the 1% for the sin of being rich.It’s the basis of all demagoguery that you find some enemy to blame. Trump’s enemies are foreigners, whereas Bernie’s enemies are those of the wrong class. Trump is one step in the direction of the horrors of the Nazi Holocaust. Bernie is one step in the direction of the horrors of old-style Soviet and Red Chinese totalitarian states.About Trump’s dishonesty, Vox says:He lies so constantly and so fluently that it’s hard to know if he even realizes he’s lying.Not true. Trump just lies badly. He’s not the standard slick politician, who lie so fluently that we don’t even realize they are lying. Whether we find a politician’s lying to be objectionable isn’t based on any principle except whether that politician is on our side.I gave $10 to all 23 presidential candidates, and get a constant stream of emails from the candidates pumping for more money. They all sound the same, regardless of political party, as if they all read the same book “How To Run A Presidential Campaign”. For example, before New Years, they all sent essentially the same message “Help us meet this important deadline!”, as if the end of the year is some important fund-raising deadline that must be met. It isn’t, that’s a lie, but such a fluent one that you can’t precisely identify it as a lie. If I were to judge candidate honesty, based on donor e-mails, Bernie would be near the top on honesty, and Hillary would be near the bottom, with Trump unexceptionally in the middle.Vox’s biggest problem is that their attack focuses on Trump’s style more than substance. It’s a well-known logical fallacy that serious people avoid. Style is irrelevant. Trump’s substance provides us enough fodder to attack him, we don’t need to stoop to this low level. The Vox piece is great creative fiction about how nasty Trump is, missing only the standard dig about his hair, but there’s no details as to exactly why Trump’s policies are bad, such as the impractical cost of building a 2000 mile long wall between us and Mexico, or the necessity of suspending the 6th Amendment right to “due process” when deporting 20 million immigrants.Vox’s complaint about Trump’s style is mostly that he doesn’t obey the mainstream media. All politicians misspeak. There’s no way to spend that many hours a day talking to the public without making the most egregious of mistakes. The mainstream media has a way of dealing with this, forcing the politician to grovel. They resent how Trump just ignores the problem and barrels on to the next thing. That the press can’t make his mistakes stick makes them very upset.Imagine a situation where more than half the country believes in an idea, but nobody stands up and publicly acknowledges this. That’s a symptom of repressed speech. You’d think that the only suppressor of speech is the government, but that’s not true. The mainstream media is part of the establishment, and they regularly suppress speech they don’t like.I point this out because half the country, both Democrats and Republicans, support Trump’s idea of preventing Muslims from coming into our country. Sure, it’s both logically stupid and evilly racist, but that doesn’t matter, half the country supports it. Yet, nobody admits supporting the idea publicly, because as soon as they do, they’ll be punished by the mass media.Thus, the idea continues to fester, because it can’t openly be debated. People continue to believe in this bad idea because they are unpersuaded by the ad hominem that “you are such a racist”. The bedrock principle of journalism is that there are two sides to every debate. When half the country believes in a wrong idea, we have to accept that they are all probably reasonable people, and that we can change their minds if we honestly engage them in debate.This sounds like I’m repeating the “media bias” trope, which politicians like Trump use to deflect even fair media coverage they happen not to like. But it’s not left-wing bias that is the problem here.Instead, it’s that the media has become part of the establishment, with their own seat of power. Ezra Klein’s biggest achievement before Vox was JournoList, designed to help the established press wield their power at the top of the media hierarchy. Ezra Klein is the quintessential press insider. His post attacking Trump is just a typical example of how insiders attack outsiders who don’t conform. Yes, Trump deserves criticism, but based upon substance — not because he challenges how the press establishment has defined how politics should work in America.

Nothing says "establishment" as Vox’s attack on Trump

Post Syndicated from Robert Graham original http://blog.erratasec.com/2016/02/vox-is-wrong-about-trump.html

I keep seeing this Ezra Klein Vox article attacking Donald Trump. It’s wrong in every way something can be wrong. Trump is an easy target, but the Vox piece has almost no substance.Yes, it’s true that Trump proposes several unreasonable policies, such as banning Muslims from coming into this country. I’ll be the first to chime in and call Trump a racist, Nazi bastard for these things.But I’m not sure the other candidates are any better. Sure, they aren’t Nazis, but their politics are just as full of hate and impracticality. For example, Hillary wants to force Silicon Valley into censoring content, brushing aside complaints from those people overly concerned with “freedom of speech”. No candidate, not even Trump, is as radical as Bernie Sanders, who would dramatically reshape the economy. Trump hates Mexican works inside our country, Bernie hates Mexican workers in their own countries, championing punishing trade restrictions.Most of substantive criticisms Vox gives Trump also applies to Bernie. For example, Vox says:His view of the economy is entirely zero-sum — for Americans to win, others must lose. … His message isn’t so much that he’ll help you as he’ll hurt them… That’s Bernie’s view of the economy as well. He imagines that economy is a zero-sum game, and that for the 1% rich to prosper, they must take from the 99% of everyone else. Bernie’s entire message rests on punishing the 1% for the sin of being rich.It’s the basis of all demagoguery that you find some enemy to blame. Trump’s enemies are foreigners, whereas Bernie’s enemies are those of the wrong class. Trump is one step in the direction of the horrors of the Nazi Holocaust. Bernie is one step in the direction of the horrors of old-style Soviet and Red Chinese totalitarian states.About Trump’s dishonesty, Vox says:He lies so constantly and so fluently that it’s hard to know if he even realizes he’s lying.Not true. Trump just lies badly. He’s not the standard slick politician, who lie so fluently that we don’t even realize they are lying. Whether we find a politician’s lying to be objectionable isn’t based on any principle except whether that politician is on our side.I gave $10 to all 23 presidential candidates, and get a constant stream of emails from the candidates pumping for more money. They all sound the same, regardless of political party, as if they all read the same book “How To Run A Presidential Campaign”. For example, before New Years, they all sent essentially the same message “Help us meet this important deadline!”, as if the end of the year is some important fund-raising deadline that must be met. It isn’t, that’s a lie, but such a fluent one that you can’t precisely identify it as a lie. If I were to judge candidate honesty, based on donor e-mails, Bernie would be near the top on honesty, and Hillary would be near the bottom, with Trump unexceptionally in the middle.Vox’s biggest problem is that their attack focuses on Trump’s style more than substance. It’s a well-known logical fallacy that serious people avoid. Style is irrelevant. Trump’s substance provides us enough fodder to attack him, we don’t need to stoop to this low level. The Vox piece is great creative fiction about how nasty Trump is, missing only the standard dig about his hair, but there’s no details as to exactly why Trump’s policies are bad, such as the impractical cost of building a 2000 mile long wall between us and Mexico, or the necessity of suspending the 6th Amendment right to “due process” when deporting 20 million immigrants.Vox’s complaint about Trump’s style is mostly that he doesn’t obey the mainstream media. All politicians misspeak. There’s no way to spend that many hours a day talking to the public without making the most egregious of mistakes. The mainstream media has a way of dealing with this, forcing the politician to grovel. They resent how Trump just ignores the problem and barrels on to the next thing. That the press can’t make his mistakes stick makes them very upset.Imagine a situation where more than half the country believes in an idea, but nobody stands up and publicly acknowledges this. That’s a symptom of repressed speech. You’d think that the only suppressor of speech is the government, but that’s not true. The mainstream media is part of the establishment, and they regularly suppress speech they don’t like.I point this out because half the country, both Democrats and Republicans, support Trump’s idea of preventing Muslims from coming into our country. Sure, it’s both logically stupid and evilly racist, but that doesn’t matter, half the country supports it. Yet, nobody admits supporting the idea publicly, because as soon as they do, they’ll be punished by the mass media.Thus, the idea continues to fester, because it can’t openly be debated. People continue to believe in this bad idea because they are unpersuaded by the ad hominem that “you are such a racist”. The bedrock principle of journalism is that there are two sides to every debate. When half the country believes in a wrong idea, we have to accept that they are all probably reasonable people, and that we can change their minds if we honestly engage them in debate.This sounds like I’m repeating the “media bias” trope, which politicians like Trump use to deflect even fair media coverage they happen not to like. But it’s not left-wing bias that is the problem here.Instead, it’s that the media has become part of the establishment, with their own seat of power. Ezra Klein’s biggest achievement before Vox was JournoList, designed to help the established press wield their power at the top of the media hierarchy. Ezra Klein is the quintessential press insider. His post attacking Trump is just a typical example of how insiders attack outsiders who don’t conform. Yes, Trump deserves criticism, but based upon substance — not because he challenges how the press establishment has defined how politics should work in America.

Call for Participation: Internet Political Trolls Collection Project 2016

Post Syndicated from Lauren original http://lauren.vortex.com/archive/001151.html

It’s no secret that vile political trolls remain massively at large in the social media landscape during this USA 2016 presidential election season. But who are they? Who are their targets? Who do they support? What are the specific aspects of their attacks in social media comments and their other postings? I’ve begun a survey to collect some detailed data…

Call for Participation: Internet Political Trolls Collection Project 2016

Post Syndicated from Lauren original http://lauren.vortex.com/archive/001151.html

It’s no secret that vile political trolls remain massively at large in the social media landscape during this USA 2016 presidential election season. But who are they? Who are their targets? Who do they support? What are the specific aspects of their attacks in social media comments and their other postings? I’ve begun a survey to collect some detailed data…

Местни избори във Франкфурт 2016

Post Syndicated from Боян Юруков original http://feedproxy.google.com/~r/yurukov-blog/~3/JJY720BR4oY/

Както на местния вот в България миналата година, и в Германия чужденците живеещи известно време в даден град имат право да гласуват за кмет и общински съвет. Във Франкфурт изборите са след месец и вече всички с право на глас получиха писма подканващи ги да гласуват. Това се прави на всички избори. Както писах през ноември покрай изборите за Съвета на чужденците, в писмото има описание какви са изборите, къде може да се гласува, както и примерни бюлетини. На онези избори впрочем, успяхме да вкараме за пръв път български представител в съвета – Даниела Спасова.
Бюлетините и този път са огромни. Показвам ги, за да видите, че тези в България въобще не са страшни. Първата снимка е за общинския съвет с изписани имената на всички кандидати. Втората е за нещо като представители на квартала. Двете бюлетини не са увеличени копия – това са реалните размери каквито ще бъдат попълвани в деня на вота.
IMG_20160207_121826
Може да си представите колко пари се харчат за печат на примерни бюлетини, информационни материали и пощенски разходи. Активността при това не е по-голяма от тази другаде в Европа.
Ще забележите, че нямат никакви защити или нещо по-специално. Големият надпис през тях указва, че бюлетината е примерна и не може да се използва за вота. На гърба им е отпечатано описание как се гласува. Всеки може да разпредели определен брой преференции като за един кандидат може да се дадат до три. Така определяш предпочитанията си по-добре, отколкото да гласуваш само за един преференциално или за една партия.
IMG_20160207_121025
Тук ще видите информацията, която дават за изборите и заявлението за гласуване по пощата. С него получаваш бюлетина по пощата, която попълваш и изпращаш обратно. Отново, защитите са минимални и практически всеки може да ти бръкне в пощата. По-големия проблем обаче е, че не малко хора не си получават бюлетините навреме или пък като ги изпратят, те не стигат изборните комисии в указания срок. Така хиляди гласова не са били отчетени на изборите за Бундестага. Скандал за нещо такова обаче нямаше. Доста немци дори не знаят, че въобще съществува такъв проблем.
IMG_20160207_120913 IMG_20160207_120939 IMG_20160207_120947
Други интересно е как се публикуват резултатите след вота. Ако сте чели този блог, навярно знаете, че родният ни ЦИК публикува данните като ZIP файл с нещо като CSV файлове. Макар да не са много удобни, все пак са голяма стъпка напред. След известна преработка позволяват анализ в инструменти като този, който създадох преди две години.
Общината във Франкфурт публикува данните в собствения си портал за отворени данни. У нас има такъв на държавно ниво, но качването на данните на ЦИК все още се бави. Публикували са и доста приятен инструмент за изследване на данните по кметства и квартали. Ще забележите и че докладът за изборите се различава доста от този на ЦИК.
Впрочем, писмото заедно с бюлетините за гласуване ги хвърлих разделно в контейнера за хартия. (Немците са маниаци на тази тема и разделят на 7 или 8 части боклука. Ако намерят писмо с твоето име или адрес в друг контейнер освен този за хартия, получаваш честитка от общината.) Иначе нямам намерение да гласувам на местните избори във Франкфурт – исках само да ви разкажа какъв е процесът и че целият дебат колко сложни и големи били бюлетините в България е просто димка за отклоняване на вниманието. Също както Костинброд, падането на сайтовете на ЦИК и МВР и прочие.


Automatically inferring file syntax with afl-analyze

Post Syndicated from Michal Zalewski original http://lcamtuf.blogspot.com/2016/02/say-hello-to-afl-analyze.html

The nice thing about the control flow instrumentation used by American Fuzzy Lop is that it allows you to do much more than just, well, fuzzing stuff. For example, the suite has long shipped with a standalone tool called afl-tmin, capable of automatically shrinking test cases while still making sure that they exercise the same functionality in the targeted binary (or that they trigger the same crash). Another similar tool, afl-cmin, employed a similar trick to eliminate redundant files in any large testing corpora.

The latest release of AFL features another nifty new addition along these lines: afl-analyze. The tool takes an input file, sequentially flips bytes in this data stream, and then observes the behavior of the targeted binary after every flip. From this information, it can infer several things:

Classify some content as no-op blocks that do not elicit any changes to control flow (say, comments, pixel data, etc).

Checksums, magic values, and other short, atomically compared tokens where any bit flip causes the same change to program execution.

Longer blobs exhibiting this property – almost certainly corresponding to checksummed or encrypted data.

“Pure” data sections, where analyzer-injected changes consistently elicit differing changes to control flow.

This gives us some remarkable and quick insights into the syntax of the file and the behavior of the underlying parser. It may sound too good to be true, but actually seems to work in practice. For a quick demo, let’s see what afl-analyze has to say about running cut -d ‘ ‘ -f1 on a text file:

We see that cut really only cares about spaces and newlines. Interestingly, it also appears that the tool always tokenizes the entire line, even if it’s just asked to return the first token. Neat, right?

Of course, the value of afl-analyze is greater for incomprehensible binary formats than for simple text utilities; perhaps even more so when dealing with black-box parsers (which can be analyzed thanks to the runtime QEMU instrumentation supported in AFL). To try out the tool’s ability to deal with binaries, let’s check out libpng:

This looks pretty damn good: we have two four-byte signatures, followed by chunk length, four-byte chunk name, chunk length, some image metadata, and then a comment section. Neat, right? All in a matter of seconds: no configuration needed and no knobs to turn.

Of course, the tool shipped just moments ago and is still very much experimental; expect some kinks. Field testing and feedback welcome!

Automatically inferring file syntax with afl-analyze

Post Syndicated from Michal Zalewski original http://lcamtuf.blogspot.com/2016/02/say-hello-to-afl-analyze.html

The nice thing about the control flow instrumentation used by American Fuzzy Lop is that it allows you to do much more than just, well, fuzzing stuff. For example, the suite has long shipped with a standalone tool called afl-tmin, capable of automatically shrinking test cases while still making sure that they exercise the same functionality in the targeted binary (or that they trigger the same crash). Another similar tool, afl-cmin, employed a similar trick to eliminate redundant files in any large testing corpora.

The latest release of AFL features another nifty new addition along these lines: afl-analyze. The tool takes an input file, sequentially flips bytes in this data stream, and then observes the behavior of the targeted binary after every flip. From this information, it can infer several things:

Classify some content as no-op blocks that do not elicit any changes to control flow (say, comments, pixel data, etc).

Checksums, magic values, and other short, atomically compared tokens where any bit flip causes the same change to program execution.

Longer blobs exhibiting this property – almost certainly corresponding to checksummed or encrypted data.

“Pure” data sections, where analyzer-injected changes consistently elicit differing changes to control flow.

This gives us some remarkable and quick insights into the syntax of the file and the behavior of the underlying parser. It may sound too good to be true, but actually seems to work in practice. For a quick demo, let’s see what afl-analyze has to say about running cut -d ‘ ‘ -f1 on a text file:

We see that cut really only cares about spaces and newlines. Interestingly, it also appears that the tool always tokenizes the entire line, even if it’s just asked to return the first token. Neat, right?

Of course, the value of afl-analyze is greater for incomprehensible binary formats than for simple text utilities; perhaps even more so when dealing with black-box parsers (which can be analyzed thanks to the runtime QEMU instrumentation supported in AFL). To try out the tool’s ability to deal with binaries, let’s check out libpng:

This looks pretty damn good: we have two four-byte signatures, followed by chunk length, four-byte chunk name, chunk length, some image metadata, and then a comment section. Neat, right? All in a matter of seconds: no configuration needed and no knobs to turn.

Of course, the tool shipped just moments ago and is still very much experimental; expect some kinks. Field testing and feedback welcome!

Encrypt all the things!

Post Syndicated from Mark Henderson original http://blog.serverfault.com/2016/02/09/encrypt-all-the-things/

Let’s talk about encryption. Specifically, HTTPS encryption. If you’ve been following any of the U.S. election debates, encryption is a topic that the politicians want to talk about – but not in the way that most of us would like. And it’s not just exclusive to the U.S. – the U.K. is proposing banning encrypted services, Australia is similar. If you’re really into it, you can get information about most countries cryptography laws.

But one thing is very clear – if your traffic is not encrypted, it’s almost certainly being watched and monitored by someone in a government somewhere – this is the well publicised reason behind governments opposing widespread encryption. The NSA’s PRISM program is the most well known, which is also contributed to by the British and Australian intelligence agencies.

Which is why when the EFF announced their Let’s Encrypt project (in conjunction with Mozilla, Cisco, Akamai and others), we thought it sounded like a great idea.

The premise is simple:

Provide free encryption certificates
Make renewing certificates and installing them on your systems easy
Keep the certificates secure by installing them properly and issuing them best practices
Be transparent. Issued and revoked certificates are publically auditable
Be open. Make a platform and a standard that anyone can use and build on.
Benefit the internet through cooperation – don’t let one body control access to the service

Let’s Encrypt explain this elegantly themselves:

The objective of Let’s Encrypt and the ACME protocol is to make it possible to set up an HTTPS server and have it automatically obtain a browser-trusted certificate, without any human intervention.

The process goes a bit like this:

Get your web server up and running, as per normal, on HTTP.
Install the appropriate Let’s Encrypt tool for your platform. Currently there is ACME protocol support for:

Apache (Let’s Encrypt)
Nginx (Let’s Encrypt — experimental)
HAProxy (janeczku)
IIS (ACMESharp)

Run the tool. It will generate a Certificate Signing Request for your domain, submit it to Let’s Encrypt, and then give you options for validating the ownership of your domain. The easiest method of validating ownership is one that the tool can do automatically, which is creating a file with a pre-determined, random file name, that the Let’s Encrypt server can then validate
The tool then receives the valid certificate from the Let’s Encrypt Certificate Authority and installs it onto your systems, and configures your web server to use the certificate
You need to renew the certificate in fewer than 90 days – so you then need to set up a scheduled task (cron job for Linux, scheduled task for Windows) to execute the renewal command for your platform (see your tool’s documentation for this).

And that’s it. No copy/pasting your CSR into poorly built web interfaces, or waiting for the email to confirm the certificate to come through, or hand-building PEM files with certificate chains. No faxing documents to numbers in foreign countries. No panicking at the last minute because you forgot to renew your certificate. Free, unencumbered, automatically renewed, SSL certificates for life.

Who Let’s Encrypt is for

People running their own web servers.

You could be small businesses running Windows SBS server
You could be a startup offering a Software as a Service platform
You could be a local hackerspace running a forum
You could be a highschool student with a website about making clocks

People with a registered, publically accessible domain name

Let’s Encrypt requires some form of domain name validation, whether it be a file it can probe over HTTP to verify your ownership of the domain name, or creating a DNS record it can verify
Certificate Authorities no longer issue certificates for “made-up” internal domain names or reserved IP addresses

Who Let’s Encrypt is not for

Anyone on shared web hosting

Let’s Encrypt requires the input of the server operator. If you are not running your own web server, then this isn’t for you.

Anyone who wants to keep the existence of their certificates a secret

Every certificate issued by Let’s Encrypt is publically auditable, which means that if you don’t want anyone to know that you have a server on a given domain, then don’t use Let’s Encrypt
If you have sensitive server names (such as finance.corp.example.com), even though it’s firewalled, you might not want to use Let’s Encrypt

Anyone who needs a wildcard certificate

Let’s Encrypt does not issue wildcard certificates. They don’t need to – they offer unlimited certificates, and you can even specify multiple Subject Alternative Names on your certificate signing request
However, you may still need a wildcard if:

You have a lot of domains and can’t use SNI (I’m looking at you, Android 2.x, of which there is still a non-trivial number of users)
You have systems that require a wildcard certificate (some unified communications systems do this)

Anyone who needs a long-lived certificate

Let’s Encrypt certificates are only valid for 90 days, and must be renewed prior to then. If you need a long-lived certificate, then Let’s Encrypt is not for you

Anyone who wants Extended Validation

Let’s Encrypt only validates that you have control over a given domain. It does not validate your identity or business or anything of that nature. As such you cannot get the green security bar that displays in the browser for places like banks or PayPal.

Anyone who needs their certificate to be trusted by really old things

If you have devices from 1997 that only trust 1997’s list of CA’s, then you’re going to have a bad time
However, this is likely the least of your troubles
Let’s Encrypt is trusted by:

Android version 2.3.6 and above, released 2011-09-02
FireFox version 2.0 and above, released 2006-10-24
Internet Explorer on Windows Vista or above (For Windows XP, see this issue), released 2007-01-30
Google Chrome on Windows Vista or above (For Windows XP, see this issue), released 2008-08-02
Safari on OSX v4.0 or above (Mac OSX 10.4 or newer), released 2005-04-29
Safari on iOS v3.1 or above, released 2010-02-02

However, these are mostly edge cases, and if you’re reading this blog post, then you will know if they apply to you or not.

So let’s get out there and encrypt!

The elephant in the room

“But hang on!”, I hear the eagle-eyed reader say. “Stack Overflow is not using SSL/TLS!” you say. And you would be partly correct.

We do offer SSL on all our main sites. Go ahead, try it:

https://stackoverflow.com/
https://serverfault.com/
https://meta.stackexchange.com/

However, we have some slightly more complicated issues at hand. For details about our issues, see the great blog post by Nick Craver. It’s from 2013 and we have fixed many of the issues that we were facing back then, but there is still some way to go.

However, all our signup and login pages however are delivered over HTTPS, and you can switch to HTTPS manually if you would prefer – for most sites.

Let’s get started

So how do you get started? If you have a debian-based Apache server, then grab the Let’s Encrypt tool and go!

If you’re on a different platform, then check the list of pre-build clients above, or take a look at a recent comparison of the most common *nix scripts.

 

Addendum: Michael Hampton pointed out to me that Fedora ships with the Let’s Encrypt package as a part of their distribution and is also in EPEL if you’re on RedHat, CentOS or another distribution that can make use of EPEL packages.

How to Configure Rate-Based Blacklisting with AWS WAF and AWS Lambda

Post Syndicated from Heitor Vital original https://blogs.aws.amazon.com/security/post/Tx1ZTM4DT0HRH0K/How-to-Configure-Rate-Based-Blacklisting-with-AWS-WAF-and-AWS-Lambda

One security challenge you may have faced is how to prevent your web servers from being flooded by unwanted requests, or scanning tools such as bots and crawlers that don’t respect the crawl-delay directive value. The main objective of this kind of distributed denial of service (DDoS) attack, commonly called an HTTP flood, is to overburden system resources and make them unavailable to your real users or customers (as shown in the following illustration). In this blog post, I will show you how to provision a solution that automatically detects unwanted traffic based on request rate, and then updates configurations of AWS WAF (a web application firewall that protects any application deployed on the Amazon CloudFront content delivery service) to block subsequent requests from those users.

As you will see throughout this post, this process is executed by an AWS Lambda function that processes CloudFront access logs in order to identify bad requesters. This function exposes execution metrics in Amazon CloudWatch so that you can monitor how many request entries were processed and the number of origins blocked. The solution also supports manually adding IP ranges that you want to block prohibitively, such as well-known bot networks.

The previous illustration shows an infrastructure trying to respond to all requests, an approach that exhausts the web server’s resources. The following illustration shows an infrastructure that uses the solution proposed in this blog post, which blocks requests originating from blacklisted sources.

Solution overview

One way to prevent overburdening your resources is to deploy a rate-based blacklisting solution. This allows you to set a threshold of how many requests your web application can serve. If a bot, crawler, or attacker exceeds the threshold, you can use AWS WAF to block their requests automatically. All the AWS services used for this solution are highlighted in the gray area in the following diagram. CloudFront is configured to deliver both the static and dynamic content of a website that uses Amazon S3, Amazon EC2, and Amazon RDS. Whenever CloudFront receives a request for your web application, AWS WAF inspects the request and instructs CloudFront to either block or allow the request based on the source IP address. The following diagram shows the architecture and flow of the solution.

As CloudFront receives requests on behalf of your web application, it sends access logs to an S3 bucket that contain detailed information about the requests.

For every new access log stored in the S3 bucket, a Lambda function is triggered.

The Lambda function analyzes which IP addresses have made more requests than the defined threshold and adds those IP addresses to an AWS WAF block list. AWS WAF blocks those IP addresses for a period of time that you define during the provisioning of the solution. After this blocking period has expired, AWS WAF allows those IP addresses to access your application again, but it continues to monitor the behavior of the traffic from those IP addresses.

The Lambda function publishes execution metrics in CloudWatch, such as the number of requests analyzed and IP addresses blocked.

This is how the AWS services shown in the preceding diagram are used to make this solution work:

AWS WAF – A service that gives you control over which traffic to allow or block to your web application by defining customizable web security rules. This solution uses three rules:

Auto Block – This rule is used to add IP addresses identified as unwanted requesters—those that don’t respect the request-per-minute limit. In our solution, after creating this rule, new requests from those IP addresses are blocked until Lambda removes them from the block list after the specified expiration period (by default we use 4 hours).

Manual Block – This rule is used to add IP addresses manually to the auto block list. The IP addresses are permanently blocked—they can only access the web application if you remove them from the block list.

Auto Count – This is a quarantine rule: the requests are not blocked, but you track in near real time the number of requests from previously blocked IP addresses. This exists only to give you visibility into an IP address’s behavior after being removed from the Auto Block rule.

Lambda – A service that lets you run code without provisioning or managing servers. Just upload your code and Lambda takes care of everything required to run and scale your code. You can set up your code to trigger automatically based on activity in other AWS services. In this solution, I’ve added a trigger in S3 to execute the Lambda function every time a new access log file is uploaded. This function is responsible for processing the log data to identify offending IP addresses and block them in AWS WAF.

CloudFront – A content delivery web service that integrates with other AWS services to give developers and businesses an easy way to distribute content to end users with low latency, high data-transfer speeds, and no minimum usage commitments. CloudFront usually delivers access logs within an hour. However, some log entries can be delayed. In this solution, CloudFront distributes both dynamic and static content of the web application.

S3 – A simple storage service that offers software developers a highly scalable, reliable, and low-latency data storage infrastructure at low costs. In this solution, the access log files are saved in an S3 bucket.

CloudWatch – A monitoring service for AWS cloud resources and applications that run on AWS. In this solution, I use CloudWatch to track metrics about the number of access entries processed by the Lambda function and the number of IP addresses that have been blocked.

AWS CloudFormation – A service that enables you to create and manage AWS infrastructure deployments predictably and repeatedly. With CloudFormation, you declare all of your resources and dependencies in a template file. In this solution, I’ve created a template that helps you provision the solution stack (all of the solution’s components) without worrying about creating and configuring the underlying AWS infrastructure.

This solution allows you to define the S3 bucket, the threshold of requests per minute, and the length of time to keep IP addresses in the block list.

Deployment—Using the AWS Management Console

This solution assumes that you already have a CloudFront distribution used to deliver content for your web application. If you do not yet have a CloudFront distribution, follow the instructions on Creating or Updating a Web Distribution Using the CloudFront Console. This solution also uses CloudFormation to simplify the provisioning process. See the CloudFormation User Guide for more information about how the service works.

Step 1: Launch the solution using a CloudFormation template

Go to CloudFormation console.

Change the region as per your requirements. When you use CloudFormation, all resources are provisioned in a region where you are creating the stack. Because this solution uses Lambda, see the AWS Global Infrastructure Region Table to check which AWS regions are available for Lambda.

Click the Create New Stack button.

On the Select Template page, upload the waf_template.json found in this GitHub repository.

On the Specify Details page (as shown in the following image):

For Stack name, type the name of your stack.

For Create CloudFront Access Log Bucket, select yes to create a new S3 bucket for CloudFront Access Logs, or select no if you already have an S3 bucket for CloudFront access logs.

For CloudFront Access Log Bucket Name, type the name of the S3 bucket where CloudFront will put access logs. Leave this field empty if you selected no for Create CloudFront Access Log Bucket.

For Request Threshold, type the maximum number of requests that can be made per minute without being blocked.

For WAF Block Period, specify how long (in seconds) IP addresses should be blocked after passing the threshold.

For WAF Quarantine Period, specify how long AWS WAF should monitor IP addresses after AWS WAF has stopped blocking them.


 

On the Options page, click Next.

On the Review page, select the I acknowledge that this template might cause AWS CloudFormation to create IAM resources check box, and then click Create.

This template creates all the components necessary to run the solution: a Lambda function, an AWS WAF Web ACL (named Malicious Requesters) with all necessary rules configured, a CloudWatch custom metric, and, if you selected yes for Create CloudFront Access Log Bucket, an S3 bucket with the name you specified in the CloudFront Access Log Bucket Name parameter.

Step 2: Update CloudFront distribution settings

Update the CloudFront distribution to activate AWS WAF and logging by using the resources generated in the previous step (if you already have an S3 bucket for CloudFront access logs, skip this logging configuration step):

Open the CloudFront console. In the top pane of the console, select the distribution that you want to update.

In the Distribution Settings pane, click the General tab, and then click Edit.

Update the AWS WAF Web ACL settings (as shown in the following image). This option has a drop-down list with all active AWS WAF Web ACLs. Select the Web ACLs you created during Step 1 (Malicious Requesters).

For Logging, select On.

For Bucket for Logs, select the bucket that you specified in Step 1.


 

Save your changes.

If you already have an S3 bucket for CloudFront access logs (if you selected no for Create CloudFront Access Log Bucket), enable S3 event notification to trigger the Lambda function when a new log file is added to your CloudFront access log bucket (see more details). To do that, open the S3 console and edit the bucket properties highlighted in the following image.

Step 3: [Optional] Edit CloudFormation parameter values

If you want to change the solution parameters after creating the CloudFormation stack in Step 1 (for example, if you want to change the threshold value or how long IPs are blocked), you don’t need to create a new stack—just update the existing one following these steps:

In the CloudFormation console, from the list of stacks, select the running stack that you want to update.

Click Actions and then Update Stack.


 

On the Select Template page, select Use the current template, and then click Next.

On the Specify Details page, change the values of Rate-Based Blacklisting Parameters (as shown in the following image):

For Request Threshold, type the new maximum number of requests that can be made per minute without being blocked.

For WAF Block Period, specify the new value of how long (in seconds) the IP address should be blocked after passing the threshold.

For WAF Quarantine Period, specify the new value of how long AWS WAF should monitor the IP address after AWS WAF has stopped blocking it.

On the Options page, click Next.

On the Review page, select the I acknowledge that this template might cause AWS CloudFormation to create IAM resources check box, and then click Update.

CloudFormation will update the stack to reflect the new values of the parameters.

How to create the stack by using the AWS CLI

Alternatively, you can create the stack by using the AWS CLI. The following code shows one way to create the stack (remember to change the parameter values highlighted in red).

aws cloudformation create-stack –stack-name <STACK_NAME> –template-body file:///<PATH_TO>/waf_template.json –capabilities CAPABILITY_IAM –parameters ParameterKey=CloudFrontCreateAccessLogBucket,ParameterValue=yes ParameterKey=CloudFrontAccessLogBucket,ParameterValue=<LOG_BUCKET_NAME> ParameterKey=RequestThreshold,ParameterValue=400 ParameterKey=WAFBlockPeriod,ParameterValue=1400 ParameterKey=WAFQuarantinePeriod,ParameterValue=14400

You will find the waf_template.json in this GitHub repository.

Testing

To test the solution offered in this blog post, wait until CloudFront generates a new access log file. Alternatively, you can simulate this process by uploading this sample access log file into the S3 bucket in which you stipulated to receive log files. After completing the upload, check to see if the IP addresses were populated automatically in the AWS WAF Auto Block Set section, and if the CloudWatch metrics were updated. Remember that Lambda can take a few seconds to process the log file, and CloudWatch can take up to two minutes before displaying new metrics. The following image shows how the Auto Block Set section appears after Lambda processes the sample access log file.

To demonstrate how CloudWatch displays the metrics, see the CloudWatch dashboard, as shown in the following image.

Summary

As you followed this blog post, you provisioned a solution that automatically blocks IP addresses based on exceeding a specified request-rate threshold. You can use the Lambda script I’ve provided to change how you block unwanted requests. For example, you can analyze data generated by CloudFront (the sc-bytes of access log file field) and block those requests.

If you have any questions or comments, leave them in the “Comments” section below or on the AWS WAF forum.

– Heitor

Лаещите домашни любимци на Пловдив

Post Syndicated from Боян Юруков original http://feedproxy.google.com/~r/yurukov-blog/~3/Ju1oLe5kYOM/

Когато говоря за достъп до обществена информация, повечето хора се сещат за големи обществени поръчки, корупция, уредени съдебни дела, агенти на ДС и прочие. Свързваме достъпа до данните на държавата с тайни, които се крият от нас и се надяваме, че осветляването им ще ни помогне да разберем и подобрим ситуацията. Дали и как това става е трудна и дълга тема.
Сега обаче искам да покажа, че изброеното е само един аспект от отворените данни. Има много информация както в централната и местната власт, така и в частни компании, която може да бъде полезна за ежедневието ни или просто да разкрива интересни факти от него. Такъв пример са данните за демографията, за които пиша доста напоследък. Още един пример са домашните любимци.
Както навярно знаете, всички кучета трябва да се регистрират в общината. Пловдив има публичен регистър, в който може да се намерят доста подробности за всяко куче, включително име, адрес, пол, дали е кастрирано, дали има чип и прочие. Данните от регистъра не са достъпни за сваляне, но с малко познания по javascript може да се получат лесно. Накрая на статията съм пуснал линк към кода и процеса на отваряне.
Къде какви кучета има в града?
Когато поставим на картата всички адреси и категории може да покажем доста неща. Тази карта, например, илюстрира как се е увеличавал регистъра през времето.
Интересно наблюдение е, че всяка година има голям пик в регистрациите през март. Докато през останалите месеци са регистрирани средно по 300 кучета през последните години, 820 кучета са регистрирани през март. Най-малко има през декември – 150. Не знам каква е причината за това, но навярно доста малки кученца се подаряват около Коледа и Нова година и през март вече са около възрастта за регистрация – 4 месеца.

Аналогично може да покажем и разпределението по полове. От пръв поглед се забелязва, че в кв. Прослав има доста повече мъжки кучета. Всъщност, в целия град са регистрирани с 26% повече мъжки, отколкото женски. Изглежда са предпочитани. Вижда се и че има доста малко кучета в Тракия, което е странно предвид по-голямата концентрация на хора там.

Друг интересен момент е колко хора декларират, че кучетата им са осиновени от приют. За Пловдив те са 111 или само 2.64% от всички регистрирани. Според бившия директор на Зооветеринарния комплекс, интересът към осиновяването се увеличава. Преди година посочва, че през 2013 и 2014-та са осиновени над 210 кучета. Това означава, че или много от тях не се регистрират, или че заминават за други общини. Най-вероятно става въпрос за и двете.
Ако желаете да осиновите или да помогнете на куче, може да погледнете на страницата на фондация „Шарко“, този списък на Събина или да станете доброволец към инициативата в TimeHeroes.
Ето разпределението на осиновените кучета в града:

Според промените в закона приети преди дни, всички регистрирани кучета трябва да имат чип. В Пловдив до сега 73% от кучетата вече са с такъв.

За сметка на това само 668 или 16% от кучетата са кастрирани. Женските кастрирани са повече от два пъти повече от мъжките. Така 24% от всички женски и едва 9.3% от мъжките са кастрирани. За сравнение, над 80% от всички кучета, които постъпват с кучкарници в щатите вече са били кастрирани от предишните им собственици.

В регистъра има и други категории. 6% от всички кучета са регистрирани като ловни. Има превес с 25% на мъжките ловни кучета пред женските, но това е синхон с повечето регистрирани мъжки кучета. 16.5% или почти 700 кучета са регистрирани като помощници на инвалиди. Това е доста повече, отколкото очаквах. Забелязват се и 10 служебни кучета.

Сред имената, най-разпространени са Сава, Макс, Рекс, Ричи, Чарли, Тара и Рони. Има обаче и имена като „Bright Light of Moscow Victory Day“, „black pearl“, „Limited Edition Anelia“, „Император бул Дио“, „Принцеса Шуши“ и навярно най-любимото ми – „Gaprillis Glenmorangie“.
Отваряне на данните
Поставянето на адресите на картата беше по-лесно, отколкото очаквах. Използвах стандартен скрипт, които написах за другите си проекти. Всички скриптове и стъпки ще намерите в Github. Част от адресите и датите в регистъра бяха очевидно грешни, така че се наложи да махна някои от тях за картите. Като цяло обаче свалянето и изчистването отне около 45 мин.
Нещо, което буди притеснение в този регистър е именно точният адрес. Повечето съдържат етаж и апартамент. Както споменах по-горе, има около 250 ловни кучета регистрирани в града. Може да се направи добро предположение, че в тези апартаменти ще има и огнестрелно оръжие. Това навярно не е информация, която човек би искал да е публично достъпна. Затова е добре да се преосмисли каква част от регистъра е добре да остане публична и каква – не. Аналогично на случая със сайта на Центъра за асистирана репродукция, това може да създаде проблем за отделни хора.
Полезно ли е всичко това?
Навярно не. Тук няма да намерите схема за източване на пари или пикантерия от политическия ни живот. Всъщност, тези данни не покриват дори всички кучета в града, а само регистрираните. Не намерих оценки колко още нерегистрирани има. Също така не, е ясно дали данните се попълват просто от собствениците или има някаква форма на проверка дали са точни (дали кучето е с чип или дали наистина е на инвалид, например). Забелязват се не една или две грешки в тях.
Въпреки това, показват интересен аспект от града, която може да е полезен на бизнесите свързани с домашни любимци. Географският анализ на тези данни може да покаже дефицит на паркове за разхождане или да помогне в планирането на подобрени кошчета с еднократни торбички за изпражненията на кучетата. Такива има в доста паркове и улици в Европа и помагат много за чистотата.
Тази справка ни показва и колко малко от кучетата са осиновени от приюти. Докато подготвях тази статия потърсих сайт или списък с линкове, където човек би могъл да намери приют, ако иска да осинови куче. Такъв, за жалост, няма. По-горе дадох няколко полезни линка. Ако знаете още, споделете ги в кометнарите. Ще се радвам да ги изброя под статията.


Submitting User Applications with spark-submit

Post Syndicated from Francisco Oliveira original https://blogs.aws.amazon.com/bigdata/post/Tx578UTQUV7LRP/Submitting-User-Applications-with-spark-submit

Francisco Oliveira is a consultant with AWS Professional Services

Customers starting their big data journey often ask for guidelines on how to submit user applications to Spark running on Amazon EMR. For example, customers ask for guidelines on how to size memory and compute resources available to their applications and the best resource allocation model for their use case.

In this post, I show how to set spark-submit flags to control the memory and compute resources available to your application submitted to Spark running on EMR. I discuss when to use the maximizeResourceAllocation configuration option and dynamic allocation of executors.

Spark execution model

At a high level, each application has a driver program that distributes work in the form of tasks among executors running on several nodes of the cluster.

The driver is the application code that defines the transformations and actions applied to the data set. At its core, the driver has instantiated an object of the SparkContext class. This object allows the driver to acquire a connection to the cluster, request resources, split the application actions into tasks, and schedule and launch tasks in the executors.

The executors not only perform tasks sent by the driver but also store data locally. As the executors are created and destroyed (see the “Enabling dynamic allocation of executors” section later), they register and deregister with the driver. The driver and the executors communicate directly.

To execute your application, the driver organizes the work to be accomplished in jobs. Each job is split into stages and each stage consists of a set of independent tasks that run in parallel. A task is the smallest unit of work in Spark and executes the same code, each on a different partition.

Spark programming model

An important abstraction in Spark is the resilient distributed dataset (RDD). This abstraction is key to perform in-memory computations. An RDD is a collection of read-only and immutable partitions of data that are distributed across the nodes of the cluster. Partitions in Spark allow the parallel execution of subsets of the data. Spark applications create RDDs and apply operations to RDDs. Although Spark partitions RDDs automatically, you can also set the number of partitions.

RDDs support two types of operations: transformation and actions. Transformations are operations that generate a new RDD, and actions are operations that write data to external storage or return a value to the driver after running a transformation on the dataset. Common transformations include operations that filter, sort and group by key. Common actions include operations that collect the results of tasks and ship them to the driver, save an RDD, or count the number of elements in a RDD.

spark-submit

A common way to launch applications on your cluster is by using the spark-submit script. This script offers several flags that allow you to control the resources used by your application.

Setting the spark-submit flags is one of the ways to dynamically supply configurations to the SparkContext object that is instantiated in the driver. spark-submit can also read configuration values set in the conf/spark-defaults.conf file which you can set using EMR configuration options when creating your cluster and, although not recommended, hardcoded in the application. An alternative to change conf/spark-defaults.conf is to use the –conf prop=value flag. I present both the spark-submit flag and the property name to use in the spark-defaults.conf file and –conf flag.

Spark applications running on EMR

Any application submitted to Spark running on EMR runs on YARN, and each Spark executor runs as a YARN container. When running on YARN, the driver can run in one YARN container in the cluster (cluster mode) or locally within the spark-submit process (client mode).

When running in cluster mode, the driver runs on ApplicationMaster, the component that submits YARN container requests to the YARN ResourceManager according to the resources needed by the application. A simplified and high-level diagram of the application submission process is shown below.

When running in client mode, the driver runs outside ApplicationMaster, in the spark-submit script process from the machine used to submit the application.

Setting the location of the driver

With spark-submit, the flag –deploy-mode can be used to select the location of the driver.

Submitting applications in client mode is advantageous when you are debugging and wish to quickly see the output of your application. For applications in production, the best practice is to run the application in cluster mode. This mode offers you a guarantee that the driver is always available during application execution. However, if you do use client mode and you submit applications from outside your EMR cluster (such as locally, on a laptop), keep in mind that the driver is running outside your EMR cluster and there will be higher latency for driver-executor communication.

Setting the driver resources

The size of the driver depends on the calculations the driver performs and on the amount of data it collects from the executors. When running the driver in cluster mode, spark-submit provides you with the option to control the number of cores (–driver-cores) and the memory (–driver-memory) used by the driver. In client mode, the default value for the driver memory is 1024 MB and one core.

Setting the number of cores and the number of executors

The number of executor cores (–executor-cores or spark.executor.cores) selected defines the number of tasks that each executor can execute in parallel. The best practice is to leave one core for the OS and about 4-5 cores per executor. The number of cores requested is constrained by the configuration property yarn.nodemanager.resource.cpu-vcores, which controls the number of cores available to all YARN containers running in one node and is set in the yarn-site.xml file.

The number of executors per node can be calculated using the following formula:

number of executors per node = number of cores on node – 1 for OS/number of task per executor

The total number of executors (–num-executors or spark.executor.instances) for a Spark job is:

total number of executors = number of executors per node * number of instances -1.

Setting the memory of each executor

The memory space of each executor container is subdivided on two major areas: the Spark executor memory and the memory overhead.

Note that the maximum memory that can be allocated to an executor container is dependent on the yarn.nodemanager.resource.memory-mb property available at yarn-site.xml. The executor memory (–executor-memory or spark.executor.memory) defines the amount of memory each executor process can use. The memory overhead (spark.yarn.executor.memoryOverHead) is off-heap memory and is automatically added to the executor memory. Its default value is executorMemory * 0.10.

Executor memory unifies sections of the heap for storage and execution purposes. These two subareas can now borrow space from one another if usage is exceeded. The relevant properties are spark.memory.fraction and spark.memory.storageFraction. For more information, see the Unified Memory Management in Spark 1.6 whitepaper.

The memory of each executor can be calculated using the following formula:

 memory of each executor = max container size on node / number of executors per node

A quick example

To show how you can set the flags I have covered so far, I submit the wordcount example application and then use the Spark history server for a graphical view of the execution.

First, I submit a modified word count sample application as an EMR step to my existing cluster. The code can be seen below:

from __future__ import print_function
from pyspark import SparkContext
import sys
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: wordcount ", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="WordCount")
text_file = sc.textFile(sys.argv[1])
counts = text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile(sys.argv[2])
sc.stop()

The cluster has six m3.2xlarge instances plus one instance for the master, each with 8 vCPU and 30 GB of memory. The default value of yarn.nodemanager.resource.memory-mb for this instance type is 23 GB.

According to the formulas above, the spark-submit command would be as follows:

spark-submit –deploy-mode cluster –master yarn –num-executors 5 –executor-cores 5 –executor-memory 20g –conf spark.yarn.submit.waitAppCompletion=false wordcount.py s3://inputbucket/input.txt s3://outputbucket/

I submit the application as an EMR step with the following command:

aws emr add-steps –cluster-id j-xxxxx –steps Type=spark,Name=SparkWordCountApp,Args=[–deploy-mode,cluster,–master,yarn,–conf,spark.yarn.submit.waitAppCompletion=false,–num-executors,5,–executor-cores,5,–executor-memory,20g,s3://codelocation/wordcount.py,s3://inputbucket/input.txt,s3://outputbucket/],ActionOnFailure=CONTINUE

Note that I am also setting the property spark.yarn.submit.waitAppCompletion with the step definitions. When this property is set to false, the client submits the application and exits, not waiting for the application to complete. This setting allows you to submit multiple applications to be executed simultaneously by the cluster and is only available in cluster mode.

I use the default values for –driver-memory and –driver-cores, as the sample application is writing directly to Amazon S3 and the driver is not receiving any data from the executors.

Enabling dynamic allocation of executors

Spark on YARN has the ability to dynamically scale up and down the number of executors. This feature can be valuable when you have multiple applications being processed simultaneously as idle executors are released and an application can request additional executors on demand.

To enable this feature, please see the steps in the EMR documentation.

Spark provides granular control to the dynamic allocation mechanism by providing the following properties:

Initial number of executors (spark.dynamicAllocation.initalExecutors)

Minimum number of executors to be used by the application (spark.dynamicAllocation.minExecutors)

Maximum executors that can be requested (spark.dynamicAllocation.maxExecutors)

When to remove an idle executor (sparkdynamicAllocation.executorIdleTime)

When to request new executors to process waiting tasks (spark.dynamicAllocation.schedulerBacklogTimeout and spark.dynamicAllocation.sustainedSchedulerBacklogTimeout)

Automatically configure executors with maximum resource allocation

EMR provides an option to automatically configure the properties above in order to maximize the resource usage of the entire cluster. This configuration option can be valuable when you have only a single application being processed by your cluster at a time. Its usage should be avoided when you expect to run multiple applications simultaneously.

To enable this configuration option, please see the steps in the EMR documentation.

By setting this configuration option during cluster creation, EMR automatically updates the spark-defaults.conf file with the properties that control the compute and memory resources of an executor, as follows:

spark.executor.memory = (yarn.scheduler.maximum-allocation-mb – 1g) -spark.yarn.executor.memoryOverhead

spark.yarn.executor.memoryOverhead = (yarn.scheduler.maximum-allocation-mb – 1g) * 0.10

spark.executor.instances = [this is set to the initial number of core nodes plus the number of task nodes in the cluster]

spark.executor.cores = yarn.nodemanager.resource.cpu-vcores

spark.default.parallelism = spark.executor.instances * spark.executor.cores

A graphical view of the parallelism

The Spark history server UI is accessible from the EMR console. It provides useful information about your application’s performance and behavior. You can see the list of scheduled stages and tasks, retrieve information about the executors, obtain a summary of memory usage, and retrieve the configurations submitted to the SparkContext object. For the purposes of this post, I show how the flags set in the spark-submit script used in the example above translate to the graphical tool.

To access the Spark history server, enable your SOCKS proxy and choose Spark History Server under Connections.

For Completed applications, choose the only entry available and expand the event timeline as below. Spark added 5 executors as requested in the definition of the –num-executors flag.

Next, by navigating to the stage details, you can see the number of tasks running in parallel per executor. This value is the same as the value of the –executor-cores flag.

Summary

In this post, you learned how to use spark-submit flags to submit an application to a cluster. Specifically, you learned how to control where the driver runs, set the resources allocated to the driver and executors, and the number of executors. You also learned when to use the maximizeResourceAllocation configuration option and dynamic allocation of executors.

If you have questions or suggestions, please leave a comment below.

—————————-

Related:

Run an External Zeppelin Instance using S3 Backed Notebooks with Spark on Amazon EMR

Looking to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

Submitting User Applications with spark-submit

Post Syndicated from Francisco Oliveira original https://blogs.aws.amazon.com/bigdata/post/Tx578UTQUV7LRP/Submitting-User-Applications-with-spark-submit

Francisco Oliveira is a consultant with AWS Professional Services

Customers starting their big data journey often ask for guidelines on how to submit user applications to Spark running on Amazon EMR. For example, customers ask for guidelines on how to size memory and compute resources available to their applications and the best resource allocation model for their use case.

In this post, I show how to set spark-submit flags to control the memory and compute resources available to your application submitted to Spark running on EMR. I discuss when to use the maximizeResourceAllocation configuration option and dynamic allocation of executors.

Spark execution model

At a high level, each application has a driver program that distributes work in the form of tasks among executors running on several nodes of the cluster.

The driver is the application code that defines the transformations and actions applied to the data set. At its core, the driver has instantiated an object of the SparkContext class. This object allows the driver to acquire a connection to the cluster, request resources, split the application actions into tasks, and schedule and launch tasks in the executors.

The executors not only perform tasks sent by the driver but also store data locally. As the executors are created and destroyed (see the “Enabling dynamic allocation of executors” section later), they register and deregister with the driver. The driver and the executors communicate directly.

To execute your application, the driver organizes the work to be accomplished in jobs. Each job is split into stages and each stage consists of a set of independent tasks that run in parallel. A task is the smallest unit of work in Spark and executes the same code, each on a different partition.

Spark programming model

An important abstraction in Spark is the resilient distributed dataset (RDD). This abstraction is key to perform in-memory computations. An RDD is a collection of read-only and immutable partitions of data that are distributed across the nodes of the cluster. Partitions in Spark allow the parallel execution of subsets of the data. Spark applications create RDDs and apply operations to RDDs. Although Spark partitions RDDs automatically, you can also set the number of partitions.

RDDs support two types of operations: transformation and actions. Transformations are operations that generate a new RDD, and actions are operations that write data to external storage or return a value to the driver after running a transformation on the dataset. Common transformations include operations that filter, sort and group by key. Common actions include operations that collect the results of tasks and ship them to the driver, save an RDD, or count the number of elements in a RDD.

spark-submit

A common way to launch applications on your cluster is by using the spark-submit script. This script offers several flags that allow you to control the resources used by your application.

Setting the spark-submit flags is one of the ways to dynamically supply configurations to the SparkContext object that is instantiated in the driver. spark-submit can also read configuration values set in the conf/spark-defaults.conf file which you can set using EMR configuration options when creating your cluster and, although not recommended, hardcoded in the application. An alternative to change conf/spark-defaults.conf is to use the –conf prop=value flag. I present both the spark-submit flag and the property name to use in the spark-defaults.conf file and –conf flag.

Spark applications running on EMR

Any application submitted to Spark running on EMR runs on YARN, and each Spark executor runs as a YARN container. When running on YARN, the driver can run in one YARN container in the cluster (cluster mode) or locally within the spark-submit process (client mode).

When running in cluster mode, the driver runs on ApplicationMaster, the component that submits YARN container requests to the YARN ResourceManager according to the resources needed by the application. A simplified and high-level diagram of the application submission process is shown below.

When running in client mode, the driver runs outside ApplicationMaster, in the spark-submit script process from the machine used to submit the application.

Setting the location of the driver

With spark-submit, the flag –deploy-mode can be used to select the location of the driver.

Submitting applications in client mode is advantageous when you are debugging and wish to quickly see the output of your application. For applications in production, the best practice is to run the application in cluster mode. This mode offers you a guarantee that the driver is always available during application execution. However, if you do use client mode and you submit applications from outside your EMR cluster (such as locally, on a laptop), keep in mind that the driver is running outside your EMR cluster and there will be higher latency for driver-executor communication.

Setting the driver resources

The size of the driver depends on the calculations the driver performs and on the amount of data it collects from the executors. When running the driver in cluster mode, spark-submit provides you with the option to control the number of cores (–driver-cores) and the memory (–driver-memory) used by the driver. In client mode, the default value for the driver memory is 1024 MB and one core.

Setting the number of cores and the number of executors

The number of executor cores (–executor-cores or spark.executor.cores) selected defines the number of tasks that each executor can execute in parallel. The best practice is to leave one core for the OS and about 4-5 cores per executor. The number of cores requested is constrained by the configuration property yarn.nodemanager.resource.cpu-vcores, which controls the number of cores available to all YARN containers running in one node and is set in the yarn-site.xml file.

The number of executors per node can be calculated using the following formula:

number of executors per node = number of cores on node – 1 for OS/number of task per executor

The total number of executors (–num-executors or spark.executor.instances) for a Spark job is:

total number of executors = number of executors per node * number of instances -1.

Setting the memory of each executor

The memory space of each executor container is subdivided on two major areas: the Spark executor memory and the memory overhead.

Note that the maximum memory that can be allocated to an executor container is dependent on the yarn.nodemanager.resource.memory-mb property available at yarn-site.xml. The executor memory (–executor-memory or spark.executor.memory) defines the amount of memory each executor process can use. The memory overhead (spark.yarn.executor.memoryOverHead) is off-heap memory and is automatically added to the executor memory. Its default value is executorMemory * 0.10.

Executor memory unifies sections of the heap for storage and execution purposes. These two subareas can now borrow space from one another if usage is exceeded. The relevant properties are spark.memory.fraction and spark.memory.storageFraction. For more information, see the Unified Memory Management in Spark 1.6 whitepaper.

The memory of each executor can be calculated using the following formula:

 memory of each executor = max container size on node / number of executors per node

A quick example

To show how you can set the flags I have covered so far, I submit the wordcount example application and then use the Spark history server for a graphical view of the execution.

First, I submit a modified word count sample application as an EMR step to my existing cluster. The code can be seen below:

from __future__ import print_function
from pyspark import SparkContext
import sys
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: wordcount ", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="WordCount")
text_file = sc.textFile(sys.argv[1])
counts = text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile(sys.argv[2])
sc.stop()

The cluster has six m3.2xlarge instances plus one instance for the master, each with 8 vCPU and 30 GB of memory. The default value of yarn.nodemanager.resource.memory-mb for this instance type is 23 GB.

According to the formulas above, the spark-submit command would be as follows:

spark-submit –deploy-mode cluster –master yarn –num-executors 5 –executor-cores 5 –executor-memory 20g –conf spark.yarn.submit.waitAppCompletion=false wordcount.py s3://inputbucket/input.txt s3://outputbucket/

I submit the application as an EMR step with the following command:

aws emr add-steps –cluster-id j-xxxxx –steps Type=spark,Name=SparkWordCountApp,Args=[–deploy-mode,cluster,–master,yarn,–conf,spark.yarn.submit.waitAppCompletion=false,–num-executors,5,–executor-cores,5,–executor-memory,20g,s3://codelocation/wordcount.py,s3://inputbucket/input.txt,s3://outputbucket/],ActionOnFailure=CONTINUE

Note that I am also setting the property spark.yarn.submit.waitAppCompletion with the step definitions. When this property is set to false, the client submits the application and exits, not waiting for the application to complete. This setting allows you to submit multiple applications to be executed simultaneously by the cluster and is only available in cluster mode.

I use the default values for –driver-memory and –driver-cores, as the sample application is writing directly to Amazon S3 and the driver is not receiving any data from the executors.

Enabling dynamic allocation of executors

Spark on YARN has the ability to dynamically scale up and down the number of executors. This feature can be valuable when you have multiple applications being processed simultaneously as idle executors are released and an application can request additional executors on demand.

To enable this feature, please see the steps in the EMR documentation.

Spark provides granular control to the dynamic allocation mechanism by providing the following properties:

Initial number of executors (spark.dynamicAllocation.initalExecutors)

Minimum number of executors to be used by the application (spark.dynamicAllocation.minExecutors)

Maximum executors that can be requested (spark.dynamicAllocation.maxExecutors)

When to remove an idle executor (sparkdynamicAllocation.executorIdleTime)

When to request new executors to process waiting tasks (spark.dynamicAllocation.schedulerBacklogTimeout and spark.dynamicAllocation.sustainedSchedulerBacklogTimeout)

Automatically configure executors with maximum resource allocation

EMR provides an option to automatically configure the properties above in order to maximize the resource usage of the entire cluster. This configuration option can be valuable when you have only a single application being processed by your cluster at a time. Its usage should be avoided when you expect to run multiple applications simultaneously.

To enable this configuration option, please see the steps in the EMR documentation.

By setting this configuration option during cluster creation, EMR automatically updates the spark-defaults.conf file with the properties that control the compute and memory resources of an executor, as follows:

spark.executor.memory = (yarn.scheduler.maximum-allocation-mb – 1g) -spark.yarn.executor.memoryOverhead

spark.yarn.executor.memoryOverhead = (yarn.scheduler.maximum-allocation-mb – 1g) * 0.10

spark.executor.instances = [this is set to the initial number of core nodes plus the number of task nodes in the cluster]

spark.executor.cores = yarn.nodemanager.resource.cpu-vcores

spark.default.parallelism = spark.executor.instances * spark.executor.cores

A graphical view of the parallelism

The Spark history server UI is accessible from the EMR console. It provides useful information about your application’s performance and behavior. You can see the list of scheduled stages and tasks, retrieve information about the executors, obtain a summary of memory usage, and retrieve the configurations submitted to the SparkContext object. For the purposes of this post, I show how the flags set in the spark-submit script used in the example above translate to the graphical tool.

To access the Spark history server, enable your SOCKS proxy and choose Spark History Server under Connections.

For Completed applications, choose the only entry available and expand the event timeline as below. Spark added 5 executors as requested in the definition of the –num-executors flag.

Next, by navigating to the stage details, you can see the number of tasks running in parallel per executor. This value is the same as the value of the –executor-cores flag.

Summary

In this post, you learned how to use spark-submit flags to submit an application to a cluster. Specifically, you learned how to control where the driver runs, set the resources allocated to the driver and executors, and the number of executors. You also learned when to use the maximizeResourceAllocation configuration option and dynamic allocation of executors.

If you have questions or suggestions, please leave a comment below.

—————————-

Related:

Run an External Zeppelin Instance using S3 Backed Notebooks with Spark on Amazon EMR

Looking to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

Hadoop Turns 10

Post Syndicated from yahoo original https://yahooeng.tumblr.com/post/138742476996

yahoohadoop:

by Peter Cnudde, VP of Engineering
It is hard to believe that 10 years have already passed since Hadoop was started at Yahoo. We initially applied it to web search, but since then, Hadoop has become central to everything we do at the company. Today, Hadoop is the de facto platform for processing and storing big data for thousands of companies around the world, including most of the Fortune 500. It has also given birth to a thriving industry around it, comprised of a number of companies who have built their businesses on the platform and continue to invest and innovate to expand its capabilities.
At Yahoo, Hadoop remains a cornerstone technology on which virtually every part of our business relies on to power our world-class products, and deliver user experiences that delight more than a billion users worldwide. Whether it is content personalization for increasing engagement, ad targeting and optimization for serving the right ad to the right consumer, new revenue streams from native ads and mobile search monetization, data processing pipelines, mail anti-spam or search assist and analytics – Hadoop touches them all.
When it comes to scale, Yahoo still boasts one of the largest Hadoop deployments in the world. From a footprint standpoint, we maintain over 35,000 Hadoop servers as a central hosted platform running across 16 clusters with a combined 600 petabytes in storage capacity (HDFS), allowing us to execute 34 million monthly compute jobs on the platform.
But we aren’t stopping there, and actively collaborate with the Hadoop community to further push the scalability boundaries and advance technological innovation. We have used MapReduce historically to power batch-oriented processing, but continue to invest in and adopt low latency data processing stacks on top of Hadoop, such as Storm for stream processing, and Tez and Spark for faster batch processing.
What’s more, the applications of these innovations have spanned the gamut – from cool and fun features, like Flickr’s Magic View to one of our most exciting recent projects that involves combining Apache Spark and Caffe. The project allows us to leverage GPUs to power deep learning on Hadoop clusters. This custom deployment bridges the gap between HPC (High Performance Computing) and big data, and is helping position Yahoo as a frontrunner in the next generation of computing and machine learning.
We’re delighted by the impact the platform has made to the big data movement, and can’t wait to see what the next 10 years has in store.
Cheers!

Collectd and Cassandra 2.2

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2016/02/04/collectd-and-cassandra-2.2/

Collectd is a program that you can run on your systems to gather statistics on performance, processes, and overall status of the system in question. When you send these statistics to a time series database like Graphite, you’ll need some way to access and visualize all that data – after all, if you collect all that data but don’t have any way to use or access it, it’s not going to do you a whole lot of good.

AWS FedRAMP-Trusted Internet Connection (TIC) Overlay Pilot Program

Post Syndicated from Chad Woolf original https://blogs.aws.amazon.com/security/post/Tx1WI9Q90T9J8DV/AWS-FedRAMP-Trusted-Internet-Connection-TIC-Overlay-Pilot-Program

I’m pleased to announce a newly created resource for usage of the Federal Cloud—after successfully completing the testing phase of the FedRAMP-Trusted Internet Connection (TIC) Overlay pilot program, we’ve developed Guidance for TIC Readiness on AWS. This new way of architecting cloud solutions that address TIC capabilities (in a FedRAMP moderate baseline) comes as the result of our relationships with the FedRAMP Program Management Office (PMO), Department of Homeland Security (DHS) TIC PMO, GSA 18F, and FedRAMP third-party assessment organization (3PAO), Veris Group. Ultimately, this approach will provide US Government agencies and contractors with information assisting in the development of “TIC Ready” architectures on AWS.  

Background on TIC

In November 2007, the Office of Management & Budget (OMB) mandated that government users could only access their cloud provider through an agency connection, either a TIC Access Provider (TICAP) or Managed Trusted Internet Protocol Service (MTIPS). These agency connections can be slow and cause additional constraints on a government network or infrastructure. In today’s “anytime, anywhere” world, it’s important for government users to access their cloud-based data from any device with speed and agility.

In May 2015, the FedRAMP PMO and DHS TIC PMO invited AWS to participate in the FedRAMP-TIC Overlay pilot program to develop an approach that balances the need for speed and security, while also removing the frustrations and headaches caused by slow connectivity and suboptimal network routing. The goal of the pilot was to help develop and test a new way of architecting access to cloud-based services with TIC capabilities that would maintain a high level of security—mapped to FedRAMP security controls—and still provide a friendly and accessible government user experience. The following figure shows the current state, with cloud services accessible through an agency TIC, and the proposed future state, with mobile user access directly to cloud services.

AWS pilot results

The pilot was conducted in collaboration with DHS and FedRAMP. As an initial analysis, we leveraged a TIC-capabilities-to-FedRAMP-Moderate controls mapping table provided for the pilot. Our 3PAO determined that 80% of the TIC capabilities were covered within AWS’s existing FedRAMP Agency Authority to Operate. During the course of the pilot, in collaboration with DHS and FedRAMP, 17 of the TIC capabilities were removed from the pilot as either not relevant—and therefore excluded—or not appropriate to a cloud service provider (CSP)—and therefore deferred to the agency. Of the remaining 57 TIC candidate capabilities, we determined that responsibilities would be allocated as follows:

Shared between AWS and the customer (36).

Solely the responsibility of the customer (16).

Solely the responsibility of AWS (5). 

Through the pilot activities, we worked with GSA 18F and our 3PAO to identify and demonstrate implementation of the required capabilities through a combination of native AWS services and the use of technologies available from the AWS Marketplace.

Take advantage of TIC connectivity on AWS today

Our government customers interested in following GSA 18F’s lead now have the capability to deploy and test their own TIC capabilities on AWS. While the FedRAMP-TIC Overlay is being finalized, customers can use the evidence resulting from our TIC Mobile assessment to implement the TIC capabilities as part of their virtual perimeter protection solution using functionality provided by AWS and our ecosystem partners. With a clear definition of the customer responsibility for implementation of the additional TIC capabilities, our government customers can architect for TIC readiness on AWS. 

Take a look at our TIC readiness whitepaper, which provides an overview of the FedRAMP-TIC Overlay pilot and its goals, guidance about how customers can implement TIC, and appendices that provide detailed mappings of customer responsibility for the TIC capabilities.

If you’d like to learn more about AWS’ FedRAMP program, please visit our FAQ page, or, for general compliance information, please see our Cloud Compliance page.

– Chad

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close