Tag Archives: SSL

Announcement: IPS code

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/08/announcement-ips-code.html

So after 20 years, IBM is killing off my BlackICE code created in April 1998. So it’s time that I rewrite it.

BlackICE was the first “inline” intrusion-detection system, aka. an “intrusion prevention system” or IPS. ISS purchased my company in 2001 and replaced their RealSecure engine with it, and later renamed it Proventia. Then IBM purchased ISS in 2006. Now, they are formally canceling the project and moving customers onto Cisco’s products, which are based on Snort.

So now is a good time to write a replacement. The reason is that BlackICE worked fundamentally differently than Snort, using protocol analysis rather than pattern-matching. In this way, it worked more like Bro than Snort. The biggest benefit of protocol-analysis is speed, making it many times faster than Snort. The second benefit is better detection ability, as I describe in this post on Heartbleed.

So my plan is to create a new project. I’ll be checking in the starter bits into GitHub starting a couple weeks from now. I need to figure out a new name for the project, so I don’t have to rip off a name from William Gibson like I did last time :).

Some notes:

  • Yes, it’ll be GNU open source. I’m a capitalist, so I’ll earn money like snort/nmap dual-licensing it, charging companies who don’t want to open-source their addons. All capitalists GNU license their code.
  • C, not Rust. Sorry, I’m going for extreme scalability. We’ll re-visit this decision later when looking at building protocol parsers.
  • It’ll be 95% compatible with Snort signatures. Their language definition leaves so much ambiguous it’ll be hard to be 100% compatible.
  • It’ll support Snort output as well, though really, Snort’s events suck.
  • Protocol parsers in Lua, so you can use it as a replacement for Bro, writing parsers to extract data you are interested in.
  • Protocol state machine parsers in C, like you see in my Masscan project for X.509.
  • First version IDS only. These days, “inline” means also being able to MitM the SSL stack, so I’m gong to have to think harder on that.
  • Mutli-core worker threads off PF_RING/DPDK/netmap receive queues. Should handle 10gbps, tracking 10 million concurrent connections, with quad-core CPU.
So if you want to contribute to the project, here’s what I need:
  • Requirements from people who work daily with IDS/IPS today. I need you to write up what your products do well that you really like. I need to you write up what they suck at that needs to be fixed. These need to be in some detail.
  • Testing environment to play with. This means having a small server plugged into a real-world link running at a minimum of several gigabits-per-second available for the next year. I’ll sign NDAs related to the data I might see on the network.
  • Coders. I’ll be doing the basic architecture, but protocol parsers, output plugins, etc. will need work. Code will be in C and Lua for the near term. Unfortunately, since I’m going to dual-license, I’ll need waivers before accepting pull requests.
Anyway, follow me on Twitter @erratarob if you want to contribute.

AWS CloudHSM Update – Cost Effective Hardware Key Management at Cloud Scale for Sensitive & Regulated Workloads

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-cloudhsm-update-cost-effective-hardware-key-management/

Our customers run an incredible variety of mission-critical workloads on AWS, many of which process and store sensitive data. As detailed in our Overview of Security Processes document, AWS customers have access to an ever-growing set of options for encrypting and protecting this data. For example, Amazon Relational Database Service (RDS) supports encryption of data at rest and in transit, with options tailored for each supported database engine (MySQL, SQL Server, Oracle, MariaDB, PostgreSQL, and Aurora).

Many customers use AWS Key Management Service (KMS) to centralize their key management, with others taking advantage of the hardware-based key management, encryption, and decryption provided by AWS CloudHSM to meet stringent security and compliance requirements for their most sensitive data and regulated workloads (you can read my post, AWS CloudHSM – Secure Key Storage and Cryptographic Operations, to learn more about Hardware Security Modules, also known as HSMs).

Major CloudHSM Update
Today, building on what we have learned from our first-generation product, we are making a major update to CloudHSM, with a set of improvements designed to make the benefits of hardware-based key management available to a much wider audience while reducing the need for specialized operating expertise. Here’s a summary of the improvements:

Pay As You Go – CloudHSM is now offered under a pay-as-you-go model that is simpler and more cost-effective, with no up-front fees.

Fully Managed – CloudHSM is now a scalable managed service; provisioning, patching, high availability, and backups are all built-in and taken care of for you. Scheduled backups extract an encrypted image of your HSM from the hardware (using keys that only the HSM hardware itself knows) that can be restored only to identical HSM hardware owned by AWS. For durability, those backups are stored in Amazon Simple Storage Service (S3), and for an additional layer of security, encrypted again with server-side S3 encryption using an AWS KMS master key.

Open & Compatible  – CloudHSM is open and standards-compliant, with support for multiple APIs, programming languages, and cryptography extensions such as PKCS #11, Java Cryptography Extension (JCE), and Microsoft CryptoNG (CNG). The open nature of CloudHSM gives you more control and simplifies the process of moving keys (in encrypted form) from one CloudHSM to another, and also allows migration to and from other commercially available HSMs.

More Secure – CloudHSM Classic (the original model) supports the generation and use of keys that comply with FIPS 140-2 Level 2. We’re stepping that up a notch today with support for FIPS 140-2 Level 3, with security mechanisms that are designed to detect and respond to physical attempts to access or modify the HSM. Your keys are protected with exclusive, single-tenant access to tamper-resistant HSMs that appear within your Virtual Private Clouds (VPCs). CloudHSM supports quorum authentication for critical administrative and key management functions. This feature allows you to define a list of N possible identities that can access the functions, and then require at least M of them to authorize the action. It also supports multi-factor authentication using tokens that you provide.

AWS-Native – The updated CloudHSM is an integral part of AWS and plays well with other tools and services. You can create and manage a cluster of HSMs using the AWS Management Console, AWS Command Line Interface (CLI), or API calls.

Diving In
You can create CloudHSM clusters that contain 1 to 32 HSMs, each in a separate Availability Zone in a particular AWS Region. Spreading HSMs across AZs gives you high availability (including built-in load balancing); adding more HSMs gives you additional throughput. The HSMs within a cluster are kept in sync: performing a task or operation on one HSM in a cluster automatically updates the others. Each HSM in a cluster has its own Elastic Network Interface (ENI).

All interaction with an HSM takes place via the AWS CloudHSM client. It runs on an EC2 instance and uses certificate-based mutual authentication to create secure (TLS) connections to the HSMs.

At the hardware level, each HSM includes hardware-enforced isolation of crypto operations and key storage. Each customer HSM runs on dedicated processor cores.

Setting Up a Cluster
Let’s set up a cluster using the CloudHSM Console:

I click on Create cluster to get started, select my desired VPC and the subnets within it (I can also create a new VPC and/or subnets if needed):

Then I review my settings and click on Create:

After a few minutes, my cluster exists, but is uninitialized:

Initialization simply means retrieving a certificate signing request (the Cluster CSR):

And then creating a private key and using it to sign the request (these commands were copied from the Initialize Cluster docs and I have omitted the output. Note that ID identifies the cluster):

$ openssl genrsa -out CustomerRoot.key 2048
$ openssl req -new -x509 -days 365 -key CustomerRoot.key -out CustomerRoot.crt
$ openssl x509 -req -days 365 -in ID_ClusterCsr.csr   \
                              -CA CustomerRoot.crt    \
                              -CAkey CustomerRoot.key \
                              -CAcreateserial         \
                              -out ID_CustomerHsmCertificate.crt

The next step is to apply the signed certificate to the cluster using the console or the CLI. After this has been done, the cluster can be activated by changing the password for the HSM’s administrative user, otherwise known as the Crypto Officer (CO).

Once the cluster has been created, initialized and activated, it can be used to protect data. Applications can use the APIs in AWS CloudHSM SDKs to manage keys, encrypt & decrypt objects, and more. The SDKs provide access to the CloudHSM client (running on the same instance as the application). The client, in turn, connects to the cluster across an encrypted connection.

Available Today
The new HSM is available today in the US East (Northern Virginia), US West (Oregon), US East (Ohio), and EU (Ireland) Regions, with more in the works. Pricing starts at $1.45 per HSM per hour.

Jeff;

Pirate Domain Blocking ‘Door’ Should Remain Open, RIAA Tells Court

Post Syndicated from Ernesto original https://torrentfreak.com/pirate-domain-blocking-door-should-remain-open-riaa-tells-court-170808/

As one of the leading CDN and DDoS protection services, Cloudflare is used by millions of websites across the globe.

This includes thousands of “pirate” sites which rely on the U.S.-based company to keep server loads down.

While Cloudflare is a neutral service provider, rightsholders are not happy with its role. The company has been involved in several legal disputes already, including the RIAA’s lawsuit against MP3Skull.

Last year the record labels won their case against the MP3 download portal but the site ignored the court order and continued to operate. This prompted the RIAA to go after third-party services including Cloudflare, to target associated domain names.

The RIAA demanded domain blockades, arguing that Cloudflare actively cooperated with the pirates. The CDN provider objected and argued that the DMCA shielded the company from the broad blocking requirements. In turn, the court ruled that the DMCA doesn’t apply in this case, opening the door to widespread anti-piracy filtering.

While it’s still to be determined whether Cloudflare is indeed “in active concert or participation” with MP3Skull, the company recently asked the court to vacate the order, arguing that the case is moot.

MP3Skull no longer has an active website, and previous domain names either never used Cloudflare or stopped using it long before the order was issued, the company argued.

The RIAA clearly disagrees. According to the music industry group, Cloudflare’s request relies on “misstatements.” The motion wasn’t moot when the court issued it in March, and it isn’t moot today, they argue.

Some MP3Skull domains were still actively using Cloudflare as recently as April, but Cloudflare failed to mention these.

“CloudFlare’s arguments to the contrary rely largely on misdirection, pointing to the status of domain names that expressly were not at issue in Plaintiffs’ motion,” the RIAA writes.

Even if all the domain names are no longer active on Cloudflare, the order should remain in place, the RIAA argues. The group points out that nothing is preventing the MP3Skull owners from relaunching the site and moving back to Cloudflare in the future.

“By its own admission, CloudFlare took no steps to prevent Defendants from using its services at any time. Given Defendants’ established practice of moving from domain to domain and from service to service throughout this case in contempt of this Court’s orders, Defendants could easily have resumed — and may tomorrow resume — their use of CloudFlare’s services.”

In addition, the RIAA stressed that the present ruling doesn’t harm Cloudflare at all. Since there are no active MP3Skull domains using the service presently, it need take no action.

“The March 23 Order does not require CloudFlare to do anything. All that Order did was to clarify that Rule 65, and not Section 512(j) of the DMCA, applied,” the RIAA stresses.

While it seems pointless to spend hours of legal counsel on a site that is no longer active, it shows the importance of the court’s ruling and the wider site blocking implications it has.

The RIAA wants to keep the door open for similar requests in the future, and Cloudflare wants to avoid any liability for pirate sites. These looming legal consequences are the main reason why the CDN provider asked the court to vacate the order, the RIAA notes.

“It is evident that the only reason why CloudFlare wants the Court to vacate its March 23 Order is that it does not like the Court’s ruling on the purely legal issue of Rule 65(d)’s scope,” the RIAA writes.

It is now up to the court to decide how to move forward. A decision on Cloudflare’s request is expected to be issued during the weeks to come.

The RIAA’s full reply is available here (pdf).

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Deploying an NGINX Reverse Proxy Sidecar Container on Amazon ECS

Post Syndicated from Nathan Peck original https://aws.amazon.com/blogs/compute/nginx-reverse-proxy-sidecar-container-on-amazon-ecs/

Reverse proxies are a powerful software architecture primitive for fetching resources from a server on behalf of a client. They serve a number of purposes, from protecting servers from unwanted traffic to offloading some of the heavy lifting of HTTP traffic processing.

This post explains the benefits of a reverse proxy, and explains how to use NGINX and Amazon EC2 Container Service (Amazon ECS) to easily implement and deploy a reverse proxy for your containerized application.

Components

NGINX is a high performance HTTP server that has achieved significant adoption because of its asynchronous event driven architecture. It can serve thousands of concurrent requests with a low memory footprint. This efficiency also makes it ideal as a reverse proxy.

Amazon ECS is a highly scalable, high performance container management service that supports Docker containers. It allows you to run applications easily on a managed cluster of Amazon EC2 instances. Amazon ECS helps you get your application components running on instances according to a specified configuration. It also helps scale out these components across an entire fleet of instances.

Sidecar containers are a common software pattern that has been embraced by engineering organizations. It’s a way to keep server side architecture easier to understand by building with smaller, modular containers that each serve a simple purpose. Just like an application can be powered by multiple microservices, each microservice can also be powered by multiple containers that work together. A sidecar container is simply a way to move part of the core responsibility of a service out into a containerized module that is deployed alongside a core application container.

The following diagram shows how an NGINX reverse proxy sidecar container operates alongside an application server container:

In this architecture, Amazon ECS has deployed two copies of an application stack that is made up of an NGINX reverse proxy side container and an application container. Web traffic from the public goes to an Application Load Balancer, which then distributes the traffic to one of the NGINX reverse proxy sidecars. The NGINX reverse proxy then forwards the request to the application server and returns its response to the client via the load balancer.

Reverse proxy for security

Security is one reason for using a reverse proxy in front of an application container. Any web server that serves resources to the public can expect to receive lots of unwanted traffic every day. Some of this traffic is relatively benign scans by researchers and tools, such as Shodan or nmap:

[18/May/2017:15:10:10 +0000] "GET /YesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScann HTTP/1.1" 404 1389 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36
[18/May/2017:18:19:51 +0000] "GET /clientaccesspolicy.xml HTTP/1.1" 404 322 - Cloud mapping experiment. Contact [email protected]

But other traffic is much more malicious. For example, here is what a web server sees while being scanned by the hacking tool ZmEu, which scans web servers trying to find PHPMyAdmin installations to exploit:

[18/May/2017:16:27:39 +0000] "GET /mysqladmin/scripts/setup.php HTTP/1.1" 404 391 - ZmEu
[18/May/2017:16:27:39 +0000] "GET /web/phpMyAdmin/scripts/setup.php HTTP/1.1" 404 394 - ZmEu
[18/May/2017:16:27:39 +0000] "GET /xampp/phpmyadmin/scripts/setup.php HTTP/1.1" 404 396 - ZmEu
[18/May/2017:16:27:40 +0000] "GET /apache-default/phpmyadmin/scripts/setup.php HTTP/1.1" 404 405 - ZmEu
[18/May/2017:16:27:40 +0000] "GET /phpMyAdmin-2.10.0.0/scripts/setup.php HTTP/1.1" 404 397 - ZmEu
[18/May/2017:16:27:40 +0000] "GET /mysql/scripts/setup.php HTTP/1.1" 404 386 - ZmEu
[18/May/2017:16:27:41 +0000] "GET /admin/scripts/setup.php HTTP/1.1" 404 386 - ZmEu
[18/May/2017:16:27:41 +0000] "GET /forum/phpmyadmin/scripts/setup.php HTTP/1.1" 404 396 - ZmEu
[18/May/2017:16:27:41 +0000] "GET /typo3/phpmyadmin/scripts/setup.php HTTP/1.1" 404 396 - ZmEu
[18/May/2017:16:27:42 +0000] "GET /phpMyAdmin-2.10.0.1/scripts/setup.php HTTP/1.1" 404 399 - ZmEu
[18/May/2017:16:27:44 +0000] "GET /administrator/components/com_joommyadmin/phpmyadmin/scripts/setup.php HTTP/1.1" 404 418 - ZmEu
[18/May/2017:18:34:45 +0000] "GET /phpmyadmin/scripts/setup.php HTTP/1.1" 404 390 - ZmEu
[18/May/2017:16:27:45 +0000] "GET /w00tw00t.at.blackhats.romanian.anti-sec:) HTTP/1.1" 404 401 - ZmEu

In addition, servers can also end up receiving unwanted web traffic that is intended for another server. In a cloud environment, an application may end up reusing an IP address that was formerly connected to another service. It’s common for misconfigured or misbehaving DNS servers to send traffic intended for a different host to an IP address now connected to your server.

It’s the responsibility of anyone running a web server to handle and reject potentially malicious traffic or unwanted traffic. Ideally, the web server can reject this traffic as early as possible, before it actually reaches the core application code. A reverse proxy is one way to provide this layer of protection for an application server. It can be configured to reject these requests before they reach the application server.

Reverse proxy for performance

Another advantage of using a reverse proxy such as NGINX is that it can be configured to offload some heavy lifting from your application container. For example, every HTTP server should support gzip. Whenever a client requests gzip encoding, the server compresses the response before sending it back to the client. This compression saves network bandwidth, which also improves speed for clients who now don’t have to wait as long for a response to fully download.

NGINX can be configured to accept a plaintext response from your application container and gzip encode it before sending it down to the client. This allows your application container to focus 100% of its CPU allotment on running business logic, while NGINX handles the encoding with its efficient gzip implementation.

An application may have security concerns that require SSL termination at the instance level instead of at the load balancer. NGINX can also be configured to terminate SSL before proxying the request to a local application container. Again, this also removes some CPU load from the application container, allowing it to focus on running business logic. It also gives you a cleaner way to patch any SSL vulnerabilities or update SSL certificates by updating the NGINX container without needing to change the application container.

NGINX configuration

Configuring NGINX for both traffic filtering and gzip encoding is shown below:

http {
  # NGINX will handle gzip compression of responses from the app server
  gzip on;
  gzip_proxied any;
  gzip_types text/plain application/json;
  gzip_min_length 1000;
 
  server {
    listen 80;
 
    # NGINX will reject anything not matching /api
    location /api {
      # Reject requests with unsupported HTTP method
      if ($request_method !~ ^(GET|POST|HEAD|OPTIONS|PUT|DELETE)$) {
        return 405;
      }
 
      # Only requests matching the whitelist expectations will
      # get sent to the application server
      proxy_pass http://app:3000;
      proxy_http_version 1.1;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection 'upgrade';
      proxy_set_header Host $host;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_cache_bypass $http_upgrade;
    }
  }
}

The above configuration only accepts traffic that matches the expression /api and has a recognized HTTP method. If the traffic matches, it is forwarded to a local application container accessible at the local hostname app. If the client requested gzip encoding, the plaintext response from that application container is gzip-encoded.

Amazon ECS configuration

Configuring ECS to run this NGINX container as a sidecar is also simple. ECS uses a core primitive called the task definition. Each task definition can include one or more containers, which can be linked to each other:

 {
  "containerDefinitions": [
     {
       "name": "nginx",
       "image": "<NGINX reverse proxy image URL here>",
       "memory": "256",
       "cpu": "256",
       "essential": true,
       "portMappings": [
         {
           "containerPort": "80",
           "protocol": "tcp"
         }
       ],
       "links": [
         "app"
       ]
     },
     {
       "name": "app",
       "image": "<app image URL here>",
       "memory": "256",
       "cpu": "256",
       "essential": true
     }
   ],
   "networkMode": "bridge",
   "family": "application-stack"
}

This task definition causes ECS to start both an NGINX container and an application container on the same instance. Then, the NGINX container is linked to the application container. This allows the NGINX container to send traffic to the application container using the hostname app.

The NGINX container has a port mapping that exposes port 80 on a publically accessible port but the application container does not. This means that the application container is not directly addressable. The only way to send it traffic is to send traffic to the NGINX container, which filters that traffic down. It only forwards to the application container if the traffic passes the whitelisted rules.

Conclusion

Running a sidecar container such as NGINX can bring significant benefits by making it easier to provide protection for application containers. Sidecar containers also improve performance by freeing your application container from various CPU intensive tasks. Amazon ECS makes it easy to run sidecar containers, and automate their deployment across your cluster.

To see the full code for this NGINX sidecar reference, or to try it out yourself, you can check out the open source NGINX reverse proxy reference architecture on GitHub.

– Nathan
 @nathankpeck

Top 10 Most Obvious Hacks of All Time (v0.9)

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/07/top-10-most-obvious-hacks-of-all-time.html

For teaching hacking/cybersecurity, I thought I’d create of the most obvious hacks of all time. Not the best hacks, the most sophisticated hacks, or the hacks with the biggest impact, but the most obvious hacks — ones that even the least knowledgeable among us should be able to understand. Below I propose some hacks that fit this bill, though in no particular order.

The reason I’m writing this is that my niece wants me to teach her some hacking. I thought I’d start with the obvious stuff first.

Shared Passwords

If you use the same password for every website, and one of those websites gets hacked, then the hacker has your password for all your websites. The reason your Facebook account got hacked wasn’t because of anything Facebook did, but because you used the same email-address and password when creating an account on “beagleforums.com”, which got hacked last year.

I’ve heard people say “I’m sure, because I choose a complex password and use it everywhere”. No, this is the very worst thing you can do. Sure, you can the use the same password on all sites you don’t care much about, but for Facebook, your email account, and your bank, you should have a unique password, so that when other sites get hacked, your important sites are secure.

And yes, it’s okay to write down your passwords on paper.

Tools: HaveIBeenPwned.com

PIN encrypted PDFs

My accountant emails PDF statements encrypted with the last 4 digits of my Social Security Number. This is not encryption — a 4 digit number has only 10,000 combinations, and a hacker can guess all of them in seconds.
PIN numbers for ATM cards work because ATM machines are online, and the machine can reject your card after four guesses. PIN numbers don’t work for documents, because they are offline — the hacker has a copy of the document on their own machine, disconnected from the Internet, and can continue making bad guesses with no restrictions.
Passwords protecting documents must be long enough that even trillion upon trillion guesses are insufficient to guess.

Tools: Hashcat, John the Ripper

SQL and other injection

The lazy way of combining websites with databases is to combine user input with an SQL statement. This combines code with data, so the obvious consequence is that hackers can craft data to mess with the code.
No, this isn’t obvious to the general public, but it should be obvious to programmers. The moment you write code that adds unfiltered user-input to an SQL statement, the consequence should be obvious. Yet, “SQL injection” has remained one of the most effective hacks for the last 15 years because somehow programmers don’t understand the consequence.
CGI shell injection is a similar issue. Back in early days, when “CGI scripts” were a thing, it was really important, but these days, not so much, so I just included it with SQL. The consequence of executing shell code should’ve been obvious, but weirdly, it wasn’t. The IT guy at the company I worked for back in the late 1990s came to me and asked “this guy says we have a vulnerability, is he full of shit?”, and I had to answer “no, he’s right — obviously so”.

XSS (“Cross Site Scripting”) [*] is another injection issue, but this time at somebody’s web browser rather than a server. It works because websites will echo back what is sent to them. For example, if you search for Cross Site Scripting with the URL https://www.google.com/search?q=cross+site+scripting, then you’ll get a page back from the server that contains that string. If the string is JavaScript code rather than text, then some servers (thought not Google) send back the code in the page in a way that it’ll be executed. This is most often used to hack somebody’s account: you send them an email or tweet a link, and when they click on it, the JavaScript gives control of the account to the hacker.

Cross site injection issues like this should probably be their own category, but I’m including it here for now.

More: Wikipedia on SQL injection, Wikipedia on cross site scripting.
Tools: Burpsuite, SQLmap

Buffer overflows

In the C programming language, programmers first create a buffer, then read input into it. If input is long than the buffer, then it overflows. The extra bytes overwrite other parts of the program, letting the hacker run code.
Again, it’s not a thing the general public is expected to know about, but is instead something C programmers should be expected to understand. They should know that it’s up to them to check the length and stop reading input before it overflows the buffer, that there’s no language feature that takes care of this for them.
We are three decades after the first major buffer overflow exploits, so there is no excuse for C programmers not to understand this issue.

What makes particular obvious is the way they are wrapped in exploits, like in Metasploit. While the bug itself is obvious that it’s a bug, actually exploiting it can take some very non-obvious skill. However, once that exploit is written, any trained monkey can press a button and run the exploit. That’s where we get the insult “script kiddie” from — referring to wannabe-hackers who never learn enough to write their own exploits, but who spend a lot of time running the exploit scripts written by better hackers than they.

More: Wikipedia on buffer overflow, Wikipedia on script kiddie,  “Smashing The Stack For Fun And Profit” — Phrack (1996)
Tools: bash, Metasploit

SendMail DEBUG command (historical)

The first popular email server in the 1980s was called “SendMail”. It had a feature whereby if you send a “DEBUG” command to it, it would execute any code following the command. The consequence of this was obvious — hackers could (and did) upload code to take control of the server. This was used in the Morris Worm of 1988. Most Internet machines of the day ran SendMail, so the worm spread fast infecting most machines.
This bug was mostly ignored at the time. It was thought of as a theoretical problem, that might only rarely be used to hack a system. Part of the motivation of the Morris Worm was to demonstrate that such problems was to demonstrate the consequences — consequences that should’ve been obvious but somehow were rejected by everyone.

More: Wikipedia on Morris Worm

Email Attachments/Links

I’m conflicted whether I should add this or not, because here’s the deal: you are supposed to click on attachments and links within emails. That’s what they are there for. The difference between good and bad attachments/links is not obvious. Indeed, easy-to-use email systems makes detecting the difference harder.
On the other hand, the consequences of bad attachments/links is obvious. That worms like ILOVEYOU spread so easily is because people trusted attachments coming from their friends, and ran them.
We have no solution to the problem of bad email attachments and links. Viruses and phishing are pervasive problems. Yet, we know why they exist.

Default and backdoor passwords

The Mirai botnet was caused by surveillance-cameras having default and backdoor passwords, and being exposed to the Internet without a firewall. The consequence should be obvious: people will discover the passwords and use them to take control of the bots.
Surveillance-cameras have the problem that they are usually exposed to the public, and can’t be reached without a ladder — often a really tall ladder. Therefore, you don’t want a button consumers can press to reset to factory defaults. You want a remote way to reset them. Therefore, they put backdoor passwords to do the reset. Such passwords are easy for hackers to reverse-engineer, and hence, take control of millions of cameras across the Internet.
The same reasoning applies to “default” passwords. Many users will not change the defaults, leaving a ton of devices hackers can hack.

Masscan and background radiation of the Internet

I’ve written a tool that can easily scan the entire Internet in a short period of time. It surprises people that this possible, but it obvious from the numbers. Internet addresses are only 32-bits long, or roughly 4 billion combinations. A fast Internet link can easily handle 1 million packets-per-second, so the entire Internet can be scanned in 4000 seconds, little more than an hour. It’s basic math.
Because it’s so easy, many people do it. If you monitor your Internet link, you’ll see a steady trickle of packets coming in from all over the Internet, especially Russia and China, from hackers scanning the Internet for things they can hack.
People’s reaction to this scanning is weirdly emotional, taking is personally, such as:
  1. Why are they hacking me? What did I do to them?
  2. Great! They are hacking me! That must mean I’m important!
  3. Grrr! How dare they?! How can I hack them back for some retribution!?

I find this odd, because obviously such scanning isn’t personal, the hackers have no idea who you are.

Tools: masscan, firewalls

Packet-sniffing, sidejacking

If you connect to the Starbucks WiFi, a hacker nearby can easily eavesdrop on your network traffic, because it’s not encrypted. Windows even warns you about this, in case you weren’t sure.

At DefCon, they have a “Wall of Sheep”, where they show passwords from people who logged onto stuff using the insecure “DefCon-Open” network. Calling them “sheep” for not grasping this basic fact that unencrypted traffic is unencrypted.

To be fair, it’s actually non-obvious to many people. Even if the WiFi itself is not encrypted, SSL traffic is. They expect their services to be encrypted, without them having to worry about it. And in fact, most are, especially Google, Facebook, Twitter, Apple, and other major services that won’t allow you to log in anymore without encryption.

But many services (especially old ones) may not be encrypted. Unless users check and verify them carefully, they’ll happily expose passwords.

What’s interesting about this was 10 years ago, when most services which only used SSL to encrypt the passwords, but then used unencrypted connections after that, using “cookies”. This allowed the cookies to be sniffed and stolen, allowing other people to share the login session. I used this on stage at BlackHat to connect to somebody’s GMail session. Google, and other major websites, fixed this soon after. But it should never have been a problem — because the sidejacking of cookies should have been obvious.

Tools: Wireshark, dsniff

Stuxnet LNK vulnerability

Again, this issue isn’t obvious to the public, but it should’ve been obvious to anybody who knew how Windows works.
When Windows loads a .dll, it first calls the function DllMain(). A Windows link file (.lnk) can load icons/graphics from the resources in a .dll file. It does this by loading the .dll file, thus calling DllMain. Thus, a hacker could put on a USB drive a .lnk file pointing to a .dll file, and thus, cause arbitrary code execution as soon as a user inserted a drive.
I say this is obvious because I did this, created .lnks that pointed to .dlls, but without hostile DllMain code. The consequence should’ve been obvious to me, but I totally missed the connection. We all missed the connection, for decades.

Social Engineering and Tech Support [* * *]

After posting this, many people have pointed out “social engineering”, especially of “tech support”. This probably should be up near #1 in terms of obviousness.

The classic example of social engineering is when you call tech support and tell them you’ve lost your password, and they reset it for you with minimum of questions proving who you are. For example, you set the volume on your computer really loud and play the sound of a crying baby in the background and appear to be a bit frazzled and incoherent, which explains why you aren’t answering the questions they are asking. They, understanding your predicament as a new parent, will go the extra mile in helping you, resetting “your” password.

One of the interesting consequences is how it affects domain names (DNS). It’s quite easy in many cases to call up the registrar and convince them to transfer a domain name. This has been used in lots of hacks. It’s really hard to defend against. If a registrar charges only $9/year for a domain name, then it really can’t afford to provide very good tech support — or very secure tech support — to prevent this sort of hack.

Social engineering is such a huge problem, and obvious problem, that it’s outside the scope of this document. Just google it to find example after example.

A related issue that perhaps deserves it’s own section is OSINT [*], or “open-source intelligence”, where you gather public information about a target. For example, on the day the bank manager is out on vacation (which you got from their Facebook post) you show up and claim to be a bank auditor, and are shown into their office where you grab their backup tapes. (We’ve actually done this).

More: Wikipedia on Social Engineering, Wikipedia on OSINT, “How I Won the Defcon Social Engineering CTF” — blogpost (2011), “Questioning 42: Where’s the Engineering in Social Engineering of Namespace Compromises” — BSidesLV talk (2016)

Blue-boxes (historical) [*]

Telephones historically used what we call “in-band signaling”. That’s why when you dial on an old phone, it makes sounds — those sounds are sent no differently than the way your voice is sent. Thus, it was possible to make tone generators to do things other than simply dial calls. Early hackers (in the 1970s) would make tone-generators called “blue-boxes” and “black-boxes” to make free long distance calls, for example.

These days, “signaling” and “voice” are digitized, then sent as separate channels or “bands”. This is call “out-of-band signaling”. You can’t trick the phone system by generating tones. When your iPhone makes sounds when you dial, it’s entirely for you benefit and has nothing to do with how it signals the cell tower to make a call.

Early hackers, like the founders of Apple, are famous for having started their careers making such “boxes” for tricking the phone system. The problem was obvious back in the day, which is why as the phone system moves from analog to digital, the problem was fixed.

More: Wikipedia on blue box, Wikipedia article on Steve Wozniak.

Thumb drives in parking lots [*]

A simple trick is to put a virus on a USB flash drive, and drop it in a parking lot. Somebody is bound to notice it, stick it in their computer, and open the file.

This can be extended with tricks. For example, you can put a file labeled “third-quarter-salaries.xlsx” on the drive that required macros to be run in order to open. It’s irresistible to other employees who want to know what their peers are being paid, so they’ll bypass any warning prompts in order to see the data.

Another example is to go online and get custom USB sticks made printed with the logo of the target company, making them seem more trustworthy.

We also did a trick of taking an Adobe Flash game “Punch the Monkey” and replaced the monkey with a logo of a competitor of our target. They now only played the game (infecting themselves with our virus), but gave to others inside the company to play, infecting others, including the CEO.

Thumb drives like this have been used in many incidents, such as Russians hacking military headquarters in Afghanistan. It’s really hard to defend against.

More: “Computer Virus Hits U.S. Military Base in Afghanistan” — USNews (2008), “The Return of the Worm That Ate The Pentagon” — Wired (2011), DoD Bans Flash Drives — Stripes (2008)

Googling [*]

Search engines like Google will index your website — your entire website. Frequently companies put things on their website without much protection because they are nearly impossible for users to find. But Google finds them, then indexes them, causing them to pop up with innocent searches.
There are books written on “Google hacking” explaining what search terms to look for, like “not for public release”, in order to find such documents.

More: Wikipedia entry on Google Hacking, “Google Hacking” book.

URL editing [*]

At the top of every browser is what’s called the “URL”. You can change it. Thus, if you see a URL that looks like this:

http://www.example.com/documents?id=138493

Then you can edit it to see the next document on the server:

http://www.example.com/documents?id=138494

The owner of the website may think they are secure, because nothing points to this document, so the Google search won’t find it. But that doesn’t stop a user from manually editing the URL.
An example of this is a big Fortune 500 company that posts the quarterly results to the website an hour before the official announcement. Simply editing the URL from previous financial announcements allows hackers to find the document, then buy/sell the stock as appropriate in order to make a lot of money.
Another example is the classic case of Andrew “Weev” Auernheimer who did this trick in order to download the account email addresses of early owners of the iPad, including movie stars and members of the Obama administration. It’s an interesting legal case because on one hand, techies consider this so obvious as to not be “hacking”. On the other hand, non-techies, especially judges and prosecutors, believe this to be obviously “hacking”.

DDoS, spoofing, and amplification [*]

For decades now, online gamers have figured out an easy way to win: just flood the opponent with Internet traffic, slowing their network connection. This is called a DoS, which stands for “Denial of Service”. DoSing game competitors is often a teenager’s first foray into hacking.
A variant of this is when you hack a bunch of other machines on the Internet, then command them to flood your target. (The hacked machines are often called a “botnet”, a network of robot computers). This is called DDoS, or “Distributed DoS”. At this point, it gets quite serious, as instead of competitive gamers hackers can take down entire businesses. Extortion scams, DDoSing websites then demanding payment to stop, is a common way hackers earn money.
Another form of DDoS is “amplification”. Sometimes when you send a packet to a machine on the Internet it’ll respond with a much larger response, either a very large packet or many packets. The hacker can then send a packet to many of these sites, “spoofing” or forging the IP address of the victim. This causes all those sites to then flood the victim with traffic. Thus, with a small amount of outbound traffic, the hacker can flood the inbound traffic of the victim.
This is one of those things that has worked for 20 years, because it’s so obvious teenagers can do it, yet there is no obvious solution. President Trump’s executive order of cyberspace specifically demanded that his government come up with a report on how to address this, but it’s unlikely that they’ll come up with any useful strategy.

More: Wikipedia on DDoS, Wikipedia on Spoofing

Conclusion

Tweet me (@ErrataRob) your obvious hacks, so I can add them to the list.

Top Ten Ways to Protect Yourself Against Phishing Attacks

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/top-ten-ways-protect-phishing-attacks/

It’s hard to miss the increasing frequency of phishing attacks in the news. Earlier this year, a major phishing attack targeted Google Docs users, and attempted to compromise at least one million Google Docs accounts. Experts say the “phish” was convincing and sophisticated, and even people who thought they would never be fooled by a phishing attack were caught in its net.

What is phishing?

Phishing attacks use seemingly trustworthy but malicious emails and websites to obtain your personal account or banking information. The attacks are cunning and highly effective because they often appear to come from an organization or business you actually use. The scam comes into play by tricking you into visiting a website you believe belongs to the trustworthy organization, but in fact is under the control of the phisher attempting to extract your private information.

Phishing attacks are once again in the news due to a handful of high profile ransomware incidents. Ransomware invades a user’s computer, encrypts their data files, and demands payment to decrypt the files. Ransomware most often makes its way onto a user’s computer through a phishing exploit, which gives the ransomware access to the user’s computer.

The best strategy against phishing is to scrutinize every email and message you receive and never to get caught. Easier said than done—even smart people sometimes fall victim to a phishing attack. To minimize the damage in an event of a phishing attack, backing up your data is the best ultimate defense and should be part of your anti-phishing and overall anti-malware strategy.

How do you recognize a phishing attack?

A phishing attacker may send an email seemingly from a reputable credit card company or financial institution that requests account information, often suggesting that there is a problem with your account. When users respond with the requested information, attackers can use it to gain access to the accounts.

The image below is a mockup of how a phishing attempt might appear. In this example, courtesy of Wikipedia, the bank is fictional, but in a real attempt the sender would use an actual bank, perhaps even the bank where the targeted victim does business. The sender is attempting to trick the recipient into revealing confidential information by getting the victim to visit the phisher’s website. Note the misspelling of the words “received” and “discrepancy” as recieved and discrepency. Misspellings sometimes are indications of a phishing attack. Also note that although the URL of the bank’s webpage appears to be legitimate, the hyperlink would actually take you to the phisher’s webpage, which would be altogether different from the URL displayed in the message.

By Andrew Levine – en:Image:PhishingTrustedBank.png, Public Domain, https://commons.wikimedia.org/w/index.php?curid=549747

Top ten ways to protect yourself against phishing attacks

  1. Always think twice when presented with a link in any kind of email or message before you click on it. Ask yourself whether the sender would ask you to do what it is requesting. Most banks and reputable service providers won’t ask you to reveal your account information or password via email. If in doubt, don’t use the link in the message and instead open a new webpage and go directly to the known website of the organization. Sign in to the site in the normal manner to verify that the request is legitimate.
  2. A good precaution is to always hover over a link before clicking on it and observe the status line in your browser to verify that the link in the text and the destination link are in fact the same.
  3. Phishers are clever, and they’re getting better all the time, and you might be fooled by a simple ruse to make you think the link is one you recognize. Links can have hard-to-detect misspellings that would result in visiting a site very different than what you expected.
  4. Be wary even of emails and message from people you know. It’s very easy to spoof an email so it appears to come from someone you know, or to create a URL that appears to be legitimate, but isn’t.

For example, let’s say that you work for roughmedia.com and you get an email from Chuck in accounting ([email protected]) that has an attachment for you, perhaps a company form you need to fill out. You likely wouldn’t notice in the sender address that the phisher has replaced the “m” in media with an “r” and an “n” that look very much like an “m.” You think it’s good old Chuck in finance and it’s actually someone “phishing” for you to open the attachment and infect your computer. This type of attack is known as “spear phishing” because it’s targeted at a specific individual and is using social engineering—specifically familiarity with the sender—as part of the scheme to fool you into trusting the attachment. This technique is by far the most successful on the internet today. (This example is based on Gimlet Media’s Reply All Podcast Episode, “What Kind of Idiot Gets Phished?“)

  1. Use anti-malware software, but don’t rely on it to catch all attacks. Phishers change their approach often to keep ahead of the software attack detectors.
  2. If you are asked to enter any valuable information, only do so if you’re on a secure connection. Look for the “https” prefix before the site URL, indicating the site is employing SSL (Secure Socket Layer). If there is no “s” after “http,” it’s best not to enter any confidential information.
By Fabio Lanari – Internet1.jpg by Rock1997 modified., GFDL, https://commons.wikimedia.org/w/index.php?curid=20995390
  1. Avoid logging in to online banks and similar services via public Wi-Fi networks. Criminals can compromise open networks with man-in-the-middle attacks that capture your information or spoof website addresses over the connection and redirect you to a fake page they control.
  2. Email, instant messaging, and gaming social channels are all possible vehicles to deliver phishing attacks, so be vigilant!
  3. Lay the foundation for a good defense by choosing reputable tech vendors and service providers that respect your privacy and take steps to protect your data. At Backblaze, we have full-time security teams constantly looking for ways to improve our security.
  4. When it is available, always take advantage of multi-factor verification to protect your accounts. The standard categories used for authentication are 1) something you know (e.g. your username and password), 2) something you are (e.g. your fingerprint or retina pattern), and 3) something you have (e.g. an authenticator app on your smartphone). An account that allows only a single factor for authentication is more susceptible to hacking than one that supports multiple factors. Backblaze supports multi-factor authentication to protect customer accounts.

Be a good internet citizen, and help reduce phishing and other malware attacks by notifying the organization being impersonated in the phishing attempt, or by forwarding suspicious messages to the Federal Trade Commission at [email protected]. Some email clients and services, such as Microsoft Outlook and Google Gmail, give you the ability to easily report suspicious emails. Phishing emails misrepresenting Apple can be reported to [email protected].

Backing up your data is an important part of a strong defense against phishing and other malware

The best way to avoid becoming a victim is to be vigilant against suspicious messages and emails, but also to assume that no matter what you do, it is very possible that your system will be compromised. Even the most sophisticated and tech-savvy of us can be ensnared if we are tired, in a rush, or just unfamiliar with the latest methods hackers are using. Remember that hackers are working full-time on ways to fool us, so it’s very difficult to keep ahead of them.

The best defense is to make sure that any data that could compromised by hackers—basically all of the data that is reachable via your computer—is not your only copy. You do that by maintaining an active and reliable backup strategy.

Files that are backed up to cloud storage, such as with Backblaze, are not vulnerable to attacks on your local computer in the way that local files, attached drives, network drives, or sync services like Dropbox that have local directories on your computer are.

In the event that your computer is compromised and your files are lost or encrypted, you can recover your files if you have a cloud backup that is beyond the reach of attacks on your computer.

The post Top Ten Ways to Protect Yourself Against Phishing Attacks appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required

Post Syndicated from Maor Kleider original https://aws.amazon.com/blogs/big-data/amazon-redshift-spectrum-extends-data-warehousing-out-to-exabytes-no-loading-required/

When we first looked into the possibility of building a cloud-based data warehouse many years ago, we were struck by the fact that our customers were storing ever-increasing amounts of data, and yet only a small fraction of that data ever made it into a data warehouse or Hadoop system for analysis. We saw that this wasn’t just a cloud-specific anomaly. It was also true in the broader industry, where the growth rate of the enterprise storage market segment greatly surpassed that of the data warehousing market segment.

We dubbed this the “dark data” problem. Our customers knew that there was untapped value in the data they collected; why else would they spend money to store it? But the systems available to them to analyze this data were simply too slow, complex, and expensive for them to use on all but a select subset of this data. They were storing it with optimistic hope that, someday, someone would find a solution.

Amazon Redshift became one of the fastest-growing AWS services because it helped solve the dark data problem. It was at least an order of magnitude less expensive and faster than most alternatives available. And Amazon Redshift was fully managed from the start—you didn’t have to worry about capacity, provisioning, patching, monitoring, backups, and a host of other DBA headaches. Many customers, including Vevo, Yelp, Redfin, and Edmunds, migrated to Amazon Redshift to improve query performance, reduce DBA overhead, and lower the cost of analytics.

And our customers’ data continues to grow at a very fast rate. Across the board, gigabytes to petabytes, the average Amazon Redshift customer doubles the data analyzed every year. That’s why we implement features that help customers handle their growing data, for example to double the query throughput or improve the compression ratios from 3x to 4x. That gives our customers some time before they have to consider throwing away data or removing it from their analytic environments. However, there is an increasing number of AWS customers who each generate a petabyte of data every day—that’s an exabyte in only three years. There wasn’t a solution for customers like that. If your data is doubling every year, it’s not long before you have to find new, disruptive approaches that transform the cost, performance, and simplicity curves for managing data.

Let’s look at the options available today. You can use Hadoop-based technologies like Apache Hive with Amazon EMR. This is actually a pretty great solution because it makes it easy and cost-effective to operate directly on data in Amazon S3 without ingestion or transformation. You can spin up clusters as you wish when you need, and size them right for that specific job you’re running. These systems are great at high scale-out processing like scans, filters, and aggregates. On the other hand, they’re not that good at complex query processing. For example, join processing requires data to be shuffled across nodes—for a large amount of data and large numbers of nodes that gets very slow. And joins are intrinsic to any meaningful analytics problem.

You can also use a columnar MPP data warehouse like Amazon Redshift. These systems make it simple to run complex analytic queries with orders of magnitude faster performance for joins and aggregations performed over large datasets. Amazon Redshift, in particular, leverages high-performance local disks, sophisticated query execution. and join-optimized data formats. Because it is just standard SQL, you can keep using your existing ETL and BI tools. But you do have to load data, and you have to provision clusters against the storage and CPU requirements you need.

Both solutions have powerful attributes, but they force you to choose which attributes you want. We see this as a “tyranny of OR.” You can have the throughput of local disks OR the scale of Amazon S3. You can have sophisticated query optimization OR high-scale data processing. You can have fast join performance with optimized formats OR a range of data processing engines that work against common data formats. But you shouldn’t have to choose. At this scale, you really can’t afford to choose. You need “all of the above.”

Redshift Spectrum

We built Redshift Spectrum to end this “tyranny of OR.” With Redshift Spectrum, Amazon Redshift customers can easily query their data in Amazon S3. Like Amazon EMR, you get the benefits of open data formats and inexpensive storage, and you can scale out to thousands of nodes to pull data, filter, project, aggregate, group, and sort. Like Amazon Athena, Redshift Spectrum is serverless and there’s nothing to provision or manage. You just pay for the resources you consume for the duration of your Redshift Spectrum query. Like Amazon Redshift itself, you get the benefits of a sophisticated query optimizer, fast access to data on local disks, and standard SQL. And like nothing else, Redshift Spectrum can execute highly sophisticated queries against an exabyte of data or more—in just minutes.

Redshift Spectrum is a built-in feature of Amazon Redshift, and your existing queries and BI tools will continue to work seamlessly. Under the covers, we manage a fleet of thousands of Redshift Spectrum nodes spread across multiple Availability Zones. These are transparently scaled and allocated to your queries based on the data that you need to process, with no provisioning or commitments. Redshift Spectrum is also highly concurrent—you can access your Amazon S3 data from any number of Amazon Redshift clusters.

The life of a Redshift Spectrum query

It all starts when Redshift Spectrum queries are submitted to the leader node of your Amazon Redshift cluster. The leader node optimizes, compiles, and pushes the query execution to the compute nodes in your Amazon Redshift cluster. Next, the compute nodes obtain the information describing the external tables from your data catalog, dynamically pruning nonrelevant partitions based on the filters and joins in your queries. The compute nodes also examine the data available locally and push down predicates to efficiently scan only the relevant objects in Amazon S3.

The Amazon Redshift compute nodes then generate multiple requests depending on the number of objects that need to be processed, and submit them concurrently to Redshift Spectrum, which pools thousands of Amazon EC2 instances per AWS Region. The Redshift Spectrum worker nodes scan, filter, and aggregate your data from Amazon S3, streaming required data for processing back to your Amazon Redshift cluster. Then, the final join and merge operations are performed locally in your cluster and the results are returned to your client.

Redshift Spectrum’s architecture offers several advantages. First, it elastically scales compute resources separately from the storage layer in Amazon S3. Second, it offers significantly higher concurrency because you can run multiple Amazon Redshift clusters and query the same data in Amazon S3. Third, Redshift Spectrum leverages the Amazon Redshift query optimizer to generate efficient query plans, even for complex queries with multi-table joins and window functions. Fourth, it operates directly on your source data in its native format (Parquet, RCFile, CSV, TSV, Sequence, Avro, RegexSerDe and more to come soon). This means that no data loading or transformation is needed. This also eliminates data duplication and associated costs. Fifth, operating on open data formats gives you the flexibility to leverage other AWS services and execution engines across your various teams to collaborate on the same data in Amazon S3. You get all of this, and because Redshift Spectrum is a feature of Amazon Redshift, you get the same level of end-to-end security, compliance, and certifications as with Amazon Redshift.

Designed for performance and cost-effectiveness

With Amazon Redshift Spectrum, you pay only for the queries you run against the data that you actually scan. We encourage you to leverage file partitioning, columnar data formats, and data compression to significantly minimize the amount of data scanned in Amazon S3. This is important for data warehousing because it dramatically improves query performance and reduces cost. Partitioning your data in Amazon S3 by date, time, or any other custom keys enables Redshift Spectrum to dynamically prune nonrelevant partitions to minimize the amount of data processed. If you store data in a columnar format, such as Parquet, Redshift Spectrum scans only the columns needed by your query, rather than processing entire rows. Similarly, if you compress your data using one of Redshift Spectrum’s supported compression algorithms, less data is scanned.

Amazon Redshift and Redshift Spectrum give you the best of both worlds. If you need to run frequent queries on the same data, you can normalize it, store it in Amazon Redshift, and get all of the benefits of a fully featured data warehouse for storing and querying structured data at a flat rate. At the same time, you can keep your additional data in multiple open file formats in Amazon S3, whether it is historical data or the most recent data, and extend your Amazon Redshift queries across your Amazon S3 data lake.

And that is how Amazon Redshift Spectrum scales data warehousing to exabytes—with no loading required. Redshift Spectrum ends the “tyranny of OR,” enabling you to store your data where you want, in the format you want, and have it available for fast processing using standard SQL when you need it, now and in the future.


Additional Reading

10 Best Practices for Amazon Redshift Spectrum
Amazon QuickSight Adds Support for Amazon Redshift Spectrum
Amazon Redshift Spectrum – Exabyte-Scale In-Place Queries of S3 Data

 


 

About the Author

Maor Kleider is a Senior Product Manager for Amazon Redshift, a fast, simple and cost-effective data warehouse. Maor is passionate about collaborating with customers and partners, learning about their unique big data use cases and making their experience even better. In his spare time, Maor enjoys traveling and exploring new restaurants with his family.

 

 

 

Why Consumer Design is Good For Business

Post Syndicated from Yev original https://www.backblaze.com/blog/why-consumer-design-is-good-for-business/

Using the Backblaze Cloud Backup App

We know that business users sometimes ask, “Why can’t business software be as easy to use as consumer software?”

At Backblaze, we believe it can be.

We started our business to make backup easier for everyone, knowing that the primary reason why people don’t backup is that it is too complicated and too intimidating.

Backblaze has spent the last decade building an unlimited, inexpensive, and best of all easy-to-use backup service. We designed it from the ground up, with the goal of making it a simple service – one that “just works.” We wanted it to be the easiest backup solution for grandmothers and IT administrators alike.

Having a product that’s intuitive and easy makes it ideal for people that don’t want to fret about backing up or worrying about whether or not the they selected the right files when their backup system was set up. Backblaze backs up all user data by default so there’s no worrying about missing something. What that means is when you use Backblaze for Business – you’re getting a solution that works out of the box not just for the end-user, but also for the account administrator.

Design for Enterprise Scalability but With Consumer Simplicity

Often times when a product is designed “for enterprise” the result can be an unintuitive piece of software that only the systems administrators can navigate. While that may be acceptable for antivirus or anti-spam software, there are many products and services that should not require hours to learn to use. Some of the most common services that businesses use today are known for their ease-of-use. Dropbox Sync, Trello, and Slack come to mind.

Backblaze Online Backup is much the same. Regardless of whether you have one computer or are deploying to an organization of 1,000, Backblaze scales so that you and all your users get the same, simple service that backs up and makes data accessible.

Overcomplexity reduces efficiency
The last thing an IT professional wants is users asking them how a program on their computer works, or complaining about a process that’s supposed to be running in the background. The more bloated and over-designed products and services get, the more stumbling blocks appear before the end-user. When you’re developing a product there’s a fine line between adding features and creating an overwhelmingly complicated user interface. The cost of getting that balance wrong is that it will raise more questions than it provides answers, leaving customers and end-users confused with too many choices. Many of the players in the online backup space have made confusing design choices that leave customers perplexed. We believe easy is better for everyone.

Backblaze for Business is built on top of our award winning Computer Backup product that has been in market for over 10 years. We have over 350 PB under storage and have helped users save over 23 BILLION files. We know a lot about backup.

But businesses have unique needs, such as centralized user management and billing, reporting, monitoring usage, and the ability to act on behalf of any user. When an end-user (or the IT admin) installs Backblaze, the backup starts automatically, backing up all the user-data on the machine. There’s no need to select files or folders. The backup process just starts, because all of the data is important. We’ve heard time and time again that a user’s files were saved because we backed up an obscure directory where one or two important files would have been forgotten about had the user been forced to choose what to back up.

Backblaze just works—for everyone

The best products are the ones that don’t impede your workflow and work seamlessly with the processes you have in place. Which is another reason having something designed with the end-user in mind is helpful. You build software that is aware of its environment (not everyone has top-of-the-line computing systems) and stays out of the way.

Making sure that people are diligent about their backup strategy is hard enough. At Backblaze we believe that simplicity is key, and that’s why we designed a backup service that scales from 1 to 10,000 — without having to change a setting.

The post Why Consumer Design is Good For Business appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Game of Thrones Pirates Being Monitored By HBO, Warnings On The Way

Post Syndicated from Ernesto original https://torrentfreak.com/game-thrones-pirates-monitored-hbo-warnings-way-170719/

Earlier this week, HBO released the long-awaited seventh season of the hit series Game of Thrones.

The show has broken several piracy records over the years and, thus far, there has been plenty of interest in the latest season.

This hasn’t gone unnoticed by HBO. Soon after the first episode of the new season appeared online Sunday evening, the company’s anti-piracy partner IP Echelon started sending warnings targeted at torrenting pirates.

The warnings in question include the IP-addresses of alleged BitTorrent users and ask the associated ISPs to alert their subscribers, in order to prevent further infringements.

“We have information leading us to believe that the IP address xx.xxx.xxx.xx was used to download or share Game of Thrones without authorization,” the notification begins.

“HBO owns the copyright or exclusive rights to Game of Thrones, and the unauthorized download or distribution constitutes copyright infringement. Downloading unauthorized or unknown content is also a security risk for computers, devices, and networks.”

Under US copyright law, ISPs are not obligated to forward these emails, which are sent as a DMCA notification. However, many do as a courtesy to the affected rightsholders.

Redacted infringement details from one of the notices

The warnings are not targeted at a single swarm but cover a wide variety of torrents. TorrentFreak has already seen takedown notices for the following files, but it’s likely that many more are being tracked.

  • Game.of.Thrones.S07E01.720p.WEB.h264-TBS[eztv].mkv
  • Game.of.Thrones.S07E01.HDTV.x264-SVA[rarbg]
  • Game.of.Thrones.S07E01.WEB.h264-TBS[ettv]
  • Game.of.Thrones.S07E01.HDTV.x264-SVA[eztv].mkv
  • Game.of.Thrones.S07E01.720p.HDTV.x264-AVS[eztv].mkv

This isn’t the first time that Game of Thrones pirates have received these kinds of warnings. Similar notices were sent out last year for pirated episodes of the sixth season, and it’s now clear that HBO is not backing down.

Although HBO stresses that copyright infringement is against the law, there are no legal strings attached for the subscribers in question. The company doesn’t know the identity of the alleged pirates, and would need to go to court to find out. This has never happened before.

Filing lawsuits against Game of Thrones fans is probably not high on HBO’s list, but the company hopes that affected subscribers will think twice before downloading future episodes after they are warned.

The DMCA notice asks ISPs to inform subscribers about the various legal alternatives that are available, to give them a push in the right direction.

“We also encourage you to inform the subscriber that HBO programming can easily be watched and streamed on many devices legally by adding HBO to the subscriber’s television package,” the notice reads.

While this type of message may have an effect on some, they only cover a small fraction of the piracy landscape. Millions of people are using pirate streaming tools and websites to watch Game of Thrones, and these views can’t be monitored.

In addition, the fact that many broadcasters worldwide suffered technical issues and outages when Game of Thrones premiered doesn’t help either. The legal options should be superior to the pirated offerings, not the other way around.

A redacted copy of one of the notices is available below.

Dear xxx Communications,

This message is sent on behalf of HOME BOX OFFICE, INC.

We have information leading us to believe that the IP address xx.xxx.xxx.xxx was used to download or share Game of Thrones without authorization (additional details are listed below). HBO owns the copyright or exclusive rights to Game of Thrones, and the unauthorized download or distribution constitutes copyright infringement. Downloading unauthorized or unknown content is also a security risk for computers, devices, and networks.

As the owner of the IP address, HBO requests that xxx Communications immediately contact the subscriber who was assigned the IP address at the date and time below with the details of this notice, and take the proper steps to prevent further downloading or sharing of unauthorized content and additional infringement notices.

We also encourage you to inform the subscriber that HBO programming can easily be watched and streamed on many devices legally by adding HBO to the subscriber’s television package.

We have a good faith belief that use of the copyrighted material detailed below is not authorized by the copyright owner, its agent, or the law. The information in this notice is accurate and we state, under penalty of perjury, that we are authorized to act on behalf of the owner of an exclusive right that is allegedly infringed. This letter is not a complete statement of HBO’s rights in connection with this matter, and nothing contained herein constitutes an express or implied wavier of any rights or remedies of HBO in connection with this matter, all of which are expressly reserved.

We appreciate your assistance and thank you for your cooperation in this matter. Your prompt response is requested. Any further enquiries can be directed to [email protected] Please include this message with your enquiry to ensure a quick response.

Respectfully,

Adrian Leatherland
CEO
IP-Echelon

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Analyze OpenFDA Data in R with Amazon S3 and Amazon Athena

Post Syndicated from Ryan Hood original https://aws.amazon.com/blogs/big-data/analyze-openfda-data-in-r-with-amazon-s3-and-amazon-athena/

One of the great benefits of Amazon S3 is the ability to host, share, or consume public data sets. This provides transparency into data to which an external data scientist or developer might not normally have access. By exposing the data to the public, you can glean many insights that would have been difficult with a data silo.

The openFDA project creates easy access to the high value, high priority, and public access data of the Food and Drug Administration (FDA). The data has been formatted and documented in consumer-friendly standards. Critical data related to drugs, devices, and food has been harmonized and can easily be called by application developers and researchers via API calls. OpenFDA has published two whitepapers that drill into the technical underpinnings of the API infrastructure as well as how to properly analyze the data in R. In addition, FDA makes openFDA data available on S3 in raw format.

In this post, I show how to use S3, Amazon EMR, and Amazon Athena to analyze the drug adverse events dataset. A drug adverse event is an undesirable experience associated with the use of a drug, including serious drug side effects, product use errors, product quality programs, and therapeutic failures.

Data considerations

Keep in mind that this data does have limitations. In addition, in the United States, these adverse events are submitted to the FDA voluntarily from consumers so there may not be reports for all events that occurred. There is no certainty that the reported event was actually due to the product. The FDA does not require that a causal relationship between a product and event be proven, and reports do not always contain the detail necessary to evaluate an event. Because of this, there is no way to identify the true number of events. The important takeaway to all this is that the information contained in this data has not been verified to produce cause and effect relationships. Despite this disclaimer, many interesting insights and value can be derived from the data to accelerate drug safety research.

Data analysis using SQL

For application developers who want to perform targeted searching and lookups, the API endpoints provided by the openFDA project are “ready to go” for software integration using a standard API powered by Elasticsearch, NodeJS, and Docker. However, for data analysis purposes, it is often easier to work with the data using SQL and statistical packages that expect a SQL table structure. For large-scale analysis, APIs often have query limits, such as 5000 records per query. This can cause extra work for data scientists who want to analyze the full dataset instead of small subsets of data.

To address the concern of requiring all the data in a single dataset, the openFDA project released the full 100 GB of harmonized data files that back the openFDA project onto S3. Athena is an interactive query service that makes it easy to analyze data in S3 using standard SQL. It’s a quick and easy way to answer your questions about adverse events and aspirin that does not require you to spin up databases or servers.

While you could point tools directly at the openFDA S3 files, you can find greatly improved performance and use of the data by following some of the preparation steps later in this post.

Architecture

This post explains how to use the following architecture to take the raw data provided by openFDA, leverage several AWS services, and derive meaning from the underlying data.

Steps:

  1. Load the openFDA /drug/event dataset into Spark and convert it to gzip to allow for streaming.
  2. Transform the data in Spark and save the results as a Parquet file in S3.
  3. Query the S3 Parquet file with Athena.
  4. Perform visualization and analysis of the data in R and Python on Amazon EC2.

Optimizing public data sets: A primer on data preparation

Those who want to jump right into preparing the files for Athena may want to skip ahead to the next section.

Transforming, or pre-processing, files is a common task for using many public data sets. Before you jump into the specific steps for transforming the openFDA data files into a format optimized for Athena, I thought it would be worthwhile to provide a quick exploration on the problem.

Making a dataset in S3 efficiently accessible with minimal transformation for the end user has two key elements:

  1. Partitioning the data into objects that contain a complete part of the data (such as data created within a specific month).
  2. Using file formats that make it easy for applications to locate subsets of data (for example, gzip, Parquet, ORC, etc.).

With these two key elements in mind, you can now apply transformations to the openFDA adverse event data to prepare it for Athena. You might find the data techniques employed in this post to be applicable to many of the questions you might want to ask of the public data sets stored in Amazon S3.

Before you get started, I encourage those who are interested in doing deeper healthcare analysis on AWS to make sure that you first read the AWS HIPAA Compliance whitepaper. This covers the information necessary for processing and storing patient health information (PHI).

Also, the adverse event analysis shown for aspirin is strictly for demonstration purposes and should not be used for any real decision or taken as anything other than a demonstration of AWS capabilities. However, there have been robust case studies published that have explored a causal relationship between aspirin and adverse reactions using OpenFDA data. If you are seeking research on aspirin or its risks, visit organizations such as the Centers for Disease Control and Prevention (CDC) or the Institute of Medicine (IOM).

Preparing data for Athena

For this walkthrough, you will start with the FDA adverse events dataset, which is stored as JSON files within zip archives on S3. You then convert it to Parquet for analysis. Why do you need to convert it? The original data download is stored in objects that are partitioned by quarter.

Here is a small sample of what you find in the adverse events (/drugs/event) section of the openFDA website.

If you were looking for events that happened in a specific quarter, this is not a bad solution. For most other scenarios, such as looking across the full history of aspirin events, it requires you to access a lot of data that you won’t need. The zip file format is not ideal for using data in place because zip readers must have random access to the file, which means the data can’t be streamed. Additionally, the zip files contain large JSON objects.

To read the data in these JSON files, a streaming JSON decoder must be used or a computer with a significant amount of RAM must decode the JSON. Opening up these files for public consumption is a great start. However, you still prepare the data with a few lines of Spark code so that the JSON can be streamed.

Step 1:  Convert the file types

Using Apache Spark on EMR, you can extract all of the zip files and pull out the events from the JSON files. To do this, use the Scala code below to deflate the zip file and create a text file. In addition, compress the JSON files with gzip to improve Spark’s performance and reduce your overall storage footprint. The Scala code can be run in either the Spark Shell or in an Apache Zeppelin notebook on your EMR cluster.

If you are unfamiliar with either Apache Zeppelin or the Spark Shell, the following posts serve as great references:

 

import scala.io.Source
import java.util.zip.ZipInputStream
import org.apache.spark.input.PortableDataStream
import org.apache.hadoop.io.compress.GzipCodec

// Input Directory
val inputFile = "s3://download.open.fda.gov/drug/event/2015q4/*.json.zip";

// Output Directory
val outputDir = "s3://{YOUR OUTPUT BUCKET HERE}/output/2015q4/";

// Extract zip files from 
val zipFiles = sc.binaryFiles(inputFile);

// Process zip file to extract the json as text file and save it
// in the output directory 
val rdd = zipFiles.flatMap((file: (String, PortableDataStream)) => {
    val zipStream = new ZipInputStream(file.2.open)
    val entry = zipStream.getNextEntry
    val iter = Source.fromInputStream(zipStream).getLines
    iter
}).map(.replaceAll("\s+","")).saveAsTextFile(outputDir, classOf[GzipCodec])

Step 2:  Transform JSON into Parquet

With just a few more lines of Scala code, you can use Spark’s abstractions to convert the JSON into a Spark DataFrame and then export the data back to S3 in Parquet format.

Spark requires the JSON to be in JSON Lines format to be parsed correctly into a DataFrame.

// Output Parquet directory
val outputDir = "s3://{YOUR OUTPUT BUCKET NAME}/output/drugevents"
// Input json file
val inputJson = "s3://{YOUR OUTPUT BUCKET NAME}/output/2015q4/*”
// Load dataframe from json file multiline 
val df = spark.read.json(sc.wholeTextFiles(inputJson).values)
// Extract results from dataframe
val results = df.select("results")
// Save it to Parquet
results.write.parquet(outputDir)

Step 3:  Create an Athena table

With the data cleanly prepared and stored in S3 using the Parquet format, you can now place an Athena table on top of it to get a better understanding of the underlying data.

Because the openFDA data structure incorporates several layers of nesting, it can be a complex process to try to manually derive the underlying schema in a Hive-compatible format. To shorten this process, you can load the top row of the DataFrame from the previous step into a Hive table within Zeppelin and then extract the “create  table” statement from SparkSQL.

results.createOrReplaceTempView("data")

val top1 = spark.sql("select * from data tablesample(1 rows)")

top1.write.format("parquet").mode("overwrite").saveAsTable("drugevents")

val show_cmd = spark.sql("show create table drugevents”).show(1, false)

This returns a “create table” statement that you can almost paste directly into the Athena console. Make some small modifications (adding the word “external” and replacing “using with “stored as”), and then execute the code in the Athena query editor. The table is created.

For the openFDA data, the DDL returns all string fields, as the date format used in your dataset does not conform to the yyy-mm-dd hh:mm:ss[.f…] format required by Hive. For your analysis, the string format works appropriately but it would be possible to extend this code to use a Presto function to convert the strings into time stamps.

CREATE EXTERNAL TABLE  drugevents (
   companynumb  string, 
   safetyreportid  string, 
   safetyreportversion  string, 
   receiptdate  string, 
   patientagegroup  string, 
   patientdeathdate  string, 
   patientsex  string, 
   patientweight  string, 
   serious  string, 
   seriousnesscongenitalanomali  string, 
   seriousnessdeath  string, 
   seriousnessdisabling  string, 
   seriousnesshospitalization  string, 
   seriousnesslifethreatening  string, 
   seriousnessother  string, 
   actiondrug  string, 
   activesubstancename  string, 
   drugadditional  string, 
   drugadministrationroute  string, 
   drugcharacterization  string, 
   drugindication  string, 
   drugauthorizationnumb  string, 
   medicinalproduct  string, 
   drugdosageform  string, 
   drugdosagetext  string, 
   reactionoutcome  string, 
   reactionmeddrapt  string, 
   reactionmeddraversionpt  string)
STORED AS parquet
LOCATION
  's3://{YOUR TARGET BUCKET}/output/drugevents'

With the Athena table in place, you can start to explore the data by running ad hoc queries within Athena or doing more advanced statistical analysis in R.

Using SQL and R to analyze adverse events

Using the openFDA data with Athena makes it very easy to translate your questions into SQL code and perform quick analysis on the data. After you have prepared the data for Athena, you can begin to explore the relationship between aspirin and adverse drug events, as an example. One of the most common metrics to measure adverse drug events is the Proportional Reporting Ratio (PRR). It is defined as:

PRR = (m/n)/( (M-m)/(N-n) )
Where
m = #reports with drug and event
n = #reports with drug
M = #reports with event in database
N = #reports in database

Gastrointestinal haemorrhage has the highest PRR of any reaction to aspirin when viewed in aggregate. One question you may want to ask is how the PRR has trended on a yearly basis for gastrointestinal haemorrhage since 2005.

Using the following query in Athena, you can see the PRR trend of “GASTROINTESTINAL HAEMORRHAGE” reactions with “ASPIRIN” since 2005:

with drug_and_event as 
(select rpad(receiptdate, 4, 'NA') as receipt_year
    , reactionmeddrapt
    , count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as reports_with_drug_and_event 
from fda.drugevents
where rpad(receiptdate,4,'NA') 
     between '2005' and '2015' 
     and medicinalproduct = 'ASPIRIN'
     and reactionmeddrapt= 'GASTROINTESTINAL HAEMORRHAGE'
group by reactionmeddrapt, rpad(receiptdate, 4, 'NA') 
), reports_with_drug as 
(
select rpad(receiptdate, 4, 'NA') as receipt_year
    , count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as reports_with_drug 
 from fda.drugevents 
 where rpad(receiptdate,4,'NA') 
     between '2005' and '2015' 
     and medicinalproduct = 'ASPIRIN'
group by rpad(receiptdate, 4, 'NA') 
), reports_with_event as 
(
   select rpad(receiptdate, 4, 'NA') as receipt_year
    , count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as reports_with_event 
   from fda.drugevents
   where rpad(receiptdate,4,'NA') 
     between '2005' and '2015' 
     and reactionmeddrapt= 'GASTROINTESTINAL HAEMORRHAGE'
   group by rpad(receiptdate, 4, 'NA')
), total_reports as 
(
   select rpad(receiptdate, 4, 'NA') as receipt_year
    , count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as total_reports 
   from fda.drugevents
   where rpad(receiptdate,4,'NA') 
     between '2005' and '2015' 
   group by rpad(receiptdate, 4, 'NA')
)
select  drug_and_event.receipt_year, 
(1.0 * drug_and_event.reports_with_drug_and_event/reports_with_drug.reports_with_drug)/ (1.0 * (reports_with_event.reports_with_event- drug_and_event.reports_with_drug_and_event)/(total_reports.total_reports-reports_with_drug.reports_with_drug)) as prr
, drug_and_event.reports_with_drug_and_event
, reports_with_drug.reports_with_drug
, reports_with_event.reports_with_event
, total_reports.total_reports
from drug_and_event
    inner join reports_with_drug on  drug_and_event.receipt_year = reports_with_drug.receipt_year   
    inner join reports_with_event on  drug_and_event.receipt_year = reports_with_event.receipt_year
    inner join total_reports on  drug_and_event.receipt_year = total_reports.receipt_year
order by  drug_and_event.receipt_year


One nice feature of Athena is that you can quickly connect to it via R or any other tool that can use a JDBC driver to visualize the data and understand it more clearly.

With this quick R script that can be run in R Studio either locally or on an EC2 instance, you can create a visualization of the PRR and Reporting Odds Ratio (RoR) for “GASTROINTESTINAL HAEMORRHAGE” reactions from “ASPIRIN” since 2005 to better understand these trends.

# connect to ATHENA
conn <- dbConnect(drv, '<Your JDBC URL>',s3_staging_dir="<Your S3 Location>",user=Sys.getenv(c("USER_NAME"),password=Sys.getenv(c("USER_PASSWORD"))

# Declare Adverse Event
adverseEvent <- "'GASTROINTESTINAL HAEMORRHAGE'"

# Build SQL Blocks
sqlFirst <- "SELECT rpad(receiptdate, 4, 'NA') as receipt_year, count(DISTINCT safetyreportid) as event_count FROM fda.drugsflat WHERE rpad(receiptdate,4,'NA') between '2005' and '2015'"
sqlEnd <- "GROUP BY rpad(receiptdate, 4, 'NA') ORDER BY receipt_year"

# Extract Aspirin with adverse event counts
sql <- paste(sqlFirst,"AND medicinalproduct ='ASPIRIN' AND reactionmeddrapt=",adverseEvent, sqlEnd,sep=" ")
aspirinAdverseCount = dbGetQuery(conn,sql)

# Extract Aspirin counts
sql <- paste(sqlFirst,"AND medicinalproduct ='ASPIRIN'", sqlEnd,sep=" ")
aspirinCount = dbGetQuery(conn,sql)

# Extract adverse event counts
sql <- paste(sqlFirst,"AND reactionmeddrapt=",adverseEvent, sqlEnd,sep=" ")
adverseCount = dbGetQuery(conn,sql)

# All Drug Adverse event Counts
sql <- paste(sqlFirst, sqlEnd,sep=" ")
allDrugCount = dbGetQuery(conn,sql)

# Select correct rows
selAll =  allDrugCount$receipt_year == aspirinAdverseCount$receipt_year
selAspirin = aspirinCount$receipt_year == aspirinAdverseCount$receipt_year
selAdverse = adverseCount$receipt_year == aspirinAdverseCount$receipt_year

# Calculate Numbers
m <- c(aspirinAdverseCount$event_count)
n <- c(aspirinCount[selAspirin,2])
M <- c(adverseCount[selAdverse,2])
N <- c(allDrugCount[selAll,2])

# Calculate proptional reporting ratio
PRR = (m/n)/((M-m)/(N-n))

# Calculate reporting Odds Ratio
d = n-m
D = N-M
ROR = (m/d)/(M/D)

# Plot the PRR and ROR
g_range <- range(0, PRR,ROR)
g_range[2] <- g_range[2] + 3
yearLen = length(aspirinAdverseCount$receipt_year)
axis(1,1:yearLen,lab=ax)
plot(PRR, type="o", col="blue", ylim=g_range,axes=FALSE, ann=FALSE)
axis(1,1:yearLen,lab=ax)
axis(2, las=1, at=1*0:g_range[2])
box()
lines(ROR, type="o", pch=22, lty=2, col="red")

As you can see, the PRR and RoR have both remained fairly steady over this time range. With the R Script above, all you need to do is change the adverseEvent variable from GASTROINTESTINAL HAEMORRHAGE to another type of reaction to analyze and compare those trends.

Summary

In this walkthrough:

  • You used a Scala script on EMR to convert the openFDA zip files to gzip.
  • You then transformed the JSON blobs into flattened Parquet files using Spark on EMR.
  • You created an Athena DDL so that you could query these Parquet files residing in S3.
  • Finally, you pointed the R package at the Athena table to analyze the data without pulling it into a database or creating your own servers.

If you have questions or suggestions, please comment below.


Next Steps

Take your skills to the next level. Learn how to optimize Amazon S3 for an architecture commonly used to enable genomic data analysis. Also, be sure to read more about running R on Amazon Athena.

 

 

 

 

 


About the Authors

Ryan Hood is a Data Engineer for AWS. He works on big data projects leveraging the newest AWS offerings. In his spare time, he enjoys watching the Cubs win the World Series and attempting to Sous-vide anything he can find in his refrigerator.

 

 

Vikram Anand is a Data Engineer for AWS. He works on big data projects leveraging the newest AWS offerings. In his spare time, he enjoys playing soccer and watching the NFL & European Soccer leagues.

 

 

Dave Rocamora is a Solutions Architect at Amazon Web Services on the Open Data team. Dave is based in Seattle and when he is not opening data, he enjoys biking and drinking coffee outside.

 

 

 

 

Ultrasonic pi-ano

Post Syndicated from Janina Ander original https://www.raspberrypi.org/blog/ultrasonic-piano/

At the Raspberry Pi Foundation, we love a good music project. So of course we’re excited to welcome Andy Grove‘s ultrasonic piano to the collection! It is a thing of beauty… and noise. Don’t let the name fool you – this build can do so much more than sound like a piano.

Ultrasonic Pi Piano – Full Demo

The Ultrasonic Pi Piano uses HC-SR04 ultrasonic sensors for input and generates MIDI instructions that are played by fluidsynth. For more information: http://theotherandygrove.com/projects/ultrasonic-pi-piano/

What’s an ultrasonic piano?

What we have here, people of all genders, is really a theremin on steroids. The build’s eight ultrasonic distance sensors detect hand movements and, with the help of an octasonic breakout board, a Raspberry Pi 3 translates their signals into notes. But that’s not all: this digital instrument is almost endlessly customisable – you can set each sensor to a different octave, or to a different instrument.

octasonic breakout board

The breakout board designed by Andy

Andy has implemented gesture controls to allow you to switch between modes you have preset. In his video, you can see that holding your hands over the two sensors most distant from each other changes the instrument. Say you’re bored of the piano – try a xylophone! Not your jam? How about a harpsichord? Or a clarinet? In fact, there are 128 MIDI instruments and sound effects to choose from. Go nuts and compose a piece using tuba, ocarina, and the noise of a guitar fret!

How to build the ultrasonic piano

If you head over to Instructables, you’ll find the thorough write-up Andy has provided. He has also made all his scripts, written in Rust, available on GitHub. Finally, he’s even added a video on how to make a housing, so your ultrasonic piano can look more like a proper instrument, and less like a pile of electronics.

Ultrasonic Pi Piano Enclosure

Uploaded by Andy Grove on 2017-04-13.

Make your own!

If you follow us on Twitter, you may have seen photos and footage of the Raspberry Pi staff attending a Pi Towers Picademy. Like Andy*, quite a few of us are massive Whovians. Consequently, one of our final builds on the course was an ultrasonic theremin that gave off a sound rather like a dying Dalek. Take a look at our masterwork here! We loved our make so much that we’ve since turned the instructions for building it into a free resource. Go ahead and build your own! And be sure to share your compositions with us in the comments.

Sonic the hedgehog is feeling the beat

Sonic is feeling the groove as well

* He has a full-sized Dalek at home. I know, right?

The post Ultrasonic pi-ano appeared first on Raspberry Pi.

Google Removed 2.5 Billion ‘Pirate’ Search Results

Post Syndicated from Ernesto original https://torrentfreak.com/google-removed-2-5-billion-pirate-search-results-170706/

Google is coping with a continuous increase in takedown requests from copyright holders, which target pirate sites in search results.

Just a few years ago the search engine removed ‘only’ a few thousand URLs per day, but this has since grown to millions. When added up, the numbers are truly staggering.

In its transparency report, Google now states that it has removed 2.5 billion reported links for alleged copyright infringement. This is roughly 90 percent of all requests the company received.

The chart below breaks down the takedown requests into several categories. In addition to the URLs that were removed, the search engine also received 154 million duplicate URLs and 25 million invalid URLs.

Another 80 million links remain in search results because they can’t be classified as copyright infringing, according to Google.

Google’s takedown overview

The 2.5 billion removed links are spread out over 1.1 million websites. File-storage service 4shared takes the crown with 64 million targeted URLs, followed at a distance by mp3toys.xyz, rapidgator.net, uploaded.net, and chomikuj.pl.

While rightsholders have increased their takedown efforts over the years, the major entertainment industry groups are still not happy with the current state of Google’s takedown process.

One of the main complaints has been that content which Google de-lists often reappears under new URLs.

“They need to take more proactive responsibility to reduce infringing content that appears on their platform, and, where we expressly notify infringing content to them, to ensure that they do not only take it down, but also keep it down,” a BPI spokesperson told us last month.

Ideally, rightsholders would like Google to ensure that content “stays down” while blocking the most notorious pirate sites from search results entirely. Known ‘pirate’ sites such as The Pirate Bay have no place in search results, they argue.

Google, however, believes such broad measures will lead to all sorts of problems, including over-blocking, and maintains that the current system is working as the DMCA was intended.

The search engine did implement various other initiatives to counter piracy, including the downranking of pirate sites and promoting legal options in search results, which it details in its regularly updated “How Google Fights Piracy” report.

In addition, Google and various rightsholders have signed a voluntary agreement to address “domain hopping” by pirate sites and share data to better understand how users are searching for content. For now, however, this effort is limited to the UK.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Bicrophonic Research Institute and the Sonic Bike

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/sonic-bike/

The Bicrophonic Sonic Bike, created by British sound artist Kaffe Matthews, utilises a Raspberry Pi and GPS signals to map location data and plays music and sound in response to the places you take it on your cycling adventures.

What is Bicrophonics?

Bicrophonics is about the mobility of sound, experienced and shared within a moving space, free of headphones and free of the internet. Music made by the journey you take, played with the space that you move through. The Bicrophonic Research Institute (BRI) http://sonicbikes.net

Cycling and music

I’m sure I wasn’t the only teen to go for bike rides with a group of friends and a radio. Spurred on by our favourite movie, the mid-nineties classic Now and Then, we’d hook up a pair of cheap portable speakers to our handlebars, crank up the volume, and sing our hearts out as we cycled aimlessly down country lanes in the cool light evenings of the British summer.

While Sonic Bikes don’t belt out the same classics that my precariously attached speakers provided, they do give you the same sense of connection to your travelling companions via sound. Linked to GPS locations on the same preset map of zones, each bike can produce the same music, creating a cloud of sound as you cycle.

Sonic Bikes

The Sonic Bike uses five physical components: a Raspberry Pi, power source, USB GPS receiver, rechargeable speakers, and subwoofer. Within the Raspberry Pi, the build utilises mapping software to divide a map into zones and connect each zone with a specific music track.

Sonic Bikes Raspberry Pi

Custom software enables the Raspberry Pi to locate itself among the zones using the USB GPS receiver. Then it plays back the appropriate track until it registers a new zone.

Bicrophonic Research Institute

The Bicrophonic Research Institute is a collective of artists and coders with the shared goal of creating sound directed by people and places via Sonic Bikes. In their own words:

Bicrophonics is about the mobility of sound, experienced and shared within a moving space, free of headphones and free of the internet. Music made by the journey you take, played with the space that you move through.

Their technology has potential beyond the aims of the BRI. The Sonic Bike software could be useful for navigation, logging data and playing beats to indicate when to alter speed or direction. You could even use it to create a guided cycle tour, including automatically reproduced information about specific places on the route.

For the creators of Sonic Bike, the project is ever-evolving, and “continues to be researched and developed to expand the compositional potentials and unique listening experiences it creates.”

Sensory Bike

A good example of this evolution is the Sensory Bike. This offshoot of the Sonic Bike idea plays sounds guided by the cyclist’s own movements – it acts like a two-wheeled musical instrument!

lean to go up, slow to go loud,

a work for Sensory Bikes, the Berlin wall and audience to ride it. ‘ lean to go up, slow to go loud ‘ explores freedom and celebrates escape. Celebrating human energy to find solutions, hot air balloons take off, train lines sing, people cheer and nature continues to grow.

Sensors on the wheels, handlebars, and brakes, together with a Sense HAT at the rear, register the unique way in which the rider navigates their location. The bike produces output based on these variables. Its creators at BRI say:

The Sensory Bike becomes a performative instrument – with riders choosing to go slow, go fast, to hop, zigzag, or circle, creating their own unique sound piece that speeds, reverses, and changes pitch while they dance on their bicycle.

Build your own Sonic Bike

As for many wonderful Raspberry Pi-based builds, the project’s code is available on GitHub, enabling makers to recreate it. All the BRI team ask is that you contact them so they can learn more of your plans and help in any way possible. They even provide code to create your own Sonic Kayak using GPS zones, temperature sensors, and an underwater microphone!

Sonic Kayaks explained

Sonic Kayaks are musical instruments for expanding our senses and scientific instruments for gathering marine micro-climate data. Made by foAm_Kernow with the Bicrophonic Research Institute (BRI), two were first launched at the British Science Festival in Swansea Bay September 6th 2016 and used by the public for 2 days.

The post Bicrophonic Research Institute and the Sonic Bike appeared first on Raspberry Pi.

Goodbye To Bob Chassell

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2017/07/03/Chassell.html

It’s fortunately more common now in Free Software communities today to
properly value contributions from non-developers. Historically, though,
contributions from developers were often overvalued and contributions from
others grossly undervalued. One person trailblazed as (likely) the
earliest non-developer contributor to software freedom. His name was
Robert J. Chassell — called Bob by his friends and colleagues. Over
the weekend, our community lost Bob after a long battle with a degenerative
illness.

I am one of the few of my generation in the Free Software community who
had the opportunity to know Bob. He was already semi-retired in the late
1990s when I first became involved with Free Software, but he enjoyed
giving talks about Free Software and occasionally worked the FSF booths at
events where I had begun to volunteer in 1997. He was the first person to
offer mentorship to me as I began the long road of becoming a professional
software freedom activist.

I regularly credit Bob as the first Executive Director of the FSF. While
he technically never held that title, he served as Treasurer for many years
and was the de-facto non-technical manager at the FSF for its first decade
of existence. One need only read
the earliest
issues of the GNU’s Bulletin
to see just a sampling of
the plethora of contributions that Bob made to the FSF and Free Software
generally.

Bob’s primary forte was as a writer and he came to Free Software as a
technical writer. Having focused his career on documenting software and how
it worked to help users make the most of it, software freedom — the
right to improve and modify not only the software, but its documentation as
well — was a moral belief that he held strongly. Bob was an early
member of the privileged group that now encompasses most people in
industrialized society: a non-developer who sees the value in computing and
the improvement it can bring to life. However, Bob’s realization that users
like him (and not just developers) faced detrimental impact from proprietary
software remains somewhat rare, even today. Thus, Bob died in a world where
he was still unique among non-developers: fighting for software freedom as an
essential right for all who use computers.

Bob coined a phrase that I still love to this day. He said once that the
job that we must do as activists was “preserve, protect and promote
software freedom”. Only a skilled writer such as he could come up
with such a perfectly concise alliteration that nevertheless rolls off the
tongue without stuttering. Today, I pulled up an email I sent to Bob in
November 2006 to tell him that (when Novell made their bizarre
software-freedom-unfriendly patent deal with Microsoft)
Novell
had coopted his language in their FAQ on the matter
. Bob wrote
back: I am not surprised. You can bet everything [we’ve ever come up
with] will be used against us.
Bob’s decade-old words are prolific
when I look at the cooption we now face daily in Free Software. I acutely
feel the loss of his insight and thoughtfulness.

One of the saddest facts about Bob’s illness is that his voice was quite
literally lost many years before we lost him entirely. His illness made it
nearly impossible for him to speak. In the late 1990s, I had the pleasure
of regularly hearing Bob’s voice, when I accompanied Bob to talks and
speeches at various conferences. That included the wonderful highlight of
his acceptance speech of GNU’s 2001 achievement award from the USENIX
Association. (I lament that no recordings of any of these talks seem to
be available anywhere.) Throughout the early 2000s, I
would speak to Bob on the telephone at least once a month; he would offer
his sage advice and mentorship in those early years of my professional
software freedom career. Losing his voice in our community has been a
slow-moving tragedy as his illness has progressed. This weekend, that
unique voice was lost to us forever.


I found out we lost Bob through the folks at the FSF, and I don’t at this
time have information about his family’s wishes regarding tributes to him.
I’ll update the post when/if I have them.

In the meantime, the best I can suggest is that anyone who would like to
posthumously get to know Bob please read (what I believe was) the favorite
book that he
wrote, An
Introduction to Programming in Emacs Lisp
. Bob was a huge
advocate of non-developers learning “a little bit” of
programming — just enough to make their lives easier when they used
computers. He used GNU Emacs from its earliest versions and I recall he
was absolutely giddy to discover new features, help document them, and
teach them to new users. I hope those of you that both already love and
use Emacs and those who don’t will take a moment to read what Bob had to
teach us about his favorite program.

PiCorder, the miniature camcorder

Post Syndicated from Janina Ander original https://www.raspberrypi.org/blog/picorder/

The modest dimensions of our Raspberry Pi Zero and its wirelessly connectable sibling, the Pi Zero W, enable makers in our community to build devices that are very small indeed. The PiCorder built by Wayne Keenan is probably the slimmest Pi-powered video-recording device we’ve ever seen.

PiCorder – Pimoroni HyperPixel

A simple Pi-camcorder using @pimoroni #HyperPixel, ZeroLipo, lipo bat, camera and #PiZeroW. All parts from the Pirates, total of ~£85. Project build instructions: https://www.hackster.io/TheBubbleworks/picorder-0eb94d

PiCorder hardware

Wayne’s PiCorder is a very straightforward make. On the hardware side, it features a Pimoroni HyperPixel screen, Pi Zero camera module, and Zero LiPo plus LiPo battery pack. To put it together, he simply soldered header pins onto a Zero W, and connected all the components to it – easy as Pi! (Yes, I went there.)

PiCorder

So sleek as to be almost aerodynamic

Recording with the PiCorder (rePiCording?)

Then it was just a matter of installing the HyperPixel driver on the Pi, and the PiCorder was good to go. In this basic setup, recording is controlled via SSH. However, there’s a discussion about better ways to control the device in the comments on Wayne’s write-up. As the HyperPixel is a touchscreen, adding a GUI would make full use of its capabilities.

Picorder screen

Think about how many screens you’re looking at right now

The PiCorder is a great project to recreate if you’re looking to build a small portable camera. If you’re new to soldering, this build is perfect for you: just follow our ‘How to solder’ video and tutorial, and you’re on your way. This could be the start of your journey into the magical world of physical computing!

You could also check our blog on Alex Ellis‘s implementation of YouTube live-streaming for the Pi, and learn how to share your videos in real time.

Cool camera projects

Our educational resources include plenty of cool projects that could use the PiCorder, or for which the device could be adapted.

Get your head around using the official Raspberry Pi Camera Module with this picamera tutorial. Learn how to set up a stationary or wearable time-lapse camera, and turn your images into animated GIFs. You could also kickstart your career as a director by making an amazing stop-motion film!

No matter which camera project you choose to work on, we’d love to see the results. So be sure to share a link in the comments.

The post PiCorder, the miniature camcorder appeared first on Raspberry Pi.

Yet more reasons to disagree with experts on nPetya

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/07/yet-more-reasons-to-disagree-with.html

In WW II, they looked at planes returning from bombing missions that were shot full of holes. Their natural conclusion was to add more armor to the sections that were damaged, to protect them in the future. But wait, said the statisticians. The original damage is likely spread evenly across the plane. Damage on returning planes indicates where they could damage and still return. The undamaged areas are where they were hit and couldn’t return. Thus, it’s the undamaged areas you need to protect.

This is called survivorship bias.
Many experts are making the same mistake with regards to the nPetya ransomware. 
I hate to point this out, because they are all experts I admire and respect, especially @MalwareJake, but it’s still an error. An example is this tweet:
The context of this tweet is the discussion of why nPetya was well written with regards to spreading, but full of bugs with regards to collecting on the ransom. The conclusion therefore that it wasn’t intended to be ransomware, but was intended to simply be a “wiper”, to cause destruction.
But this is just survivorship bias. If nPetya had been written the other way, with excellent ransomware features and poor spreading, we would not now be talking about it. Even that initial seeding with the trojaned MeDoc update wouldn’t have spread it far enough.
In other words, all malware samples we get are good at spreading, either on their own, or because the creator did a good job seeding them. It’s because we never see the ones that didn’t spread.
With regards to nPetya, a lot of experts are making this claim. Since it spread so well, but had hopelessly crippled ransomware features, that must have been the intent all along. Yet, as we see from survivorship bias, none of us would’ve seen nPetya had it not been for the spreading feature.