Tag Archives: Proxima

People can’t read (Equifax edition)

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/09/people-cant-read-equifax-edition.html

One of these days I’m going to write a guide for journalists reporting on the cyber. One of the items I’d stress is that they often fail to read the text of what is being said, but instead read some sort of subtext that wasn’t explicitly said. This is valid sometimes — as the subtext is what the writer intended all along, even if they didn’t explicitly write it. Other times, though the imagined subtext is not what the writer intended at all.

A good example is the recent Equifax breach. The original statement says:

Equifax Inc. (NYSE: EFX) today announced a cybersecurity incident potentially impacting approximately 143 million U.S. consumers.

The word consumers was widely translated to customers, as in this Bloomberg story:

Equifax Inc. said its systems were struck by a cyberattack that may have affected about 143 million U.S. customers of the credit reporting agency

But these aren’t the same thing. Equifax is a credit rating agency, keeping data on people who are not its own customers. It’s an important difference.

Another good example is yesterday’s quote “confirming” that the “Apache Struts” vulnerability was to blame:

Equifax has been intensely investigating the scope of the intrusion with the assistance of a leading, independent cybersecurity firm to determine what information was accessed and who has been impacted. We know that criminals exploited a U.S. website application vulnerability. The vulnerability was Apache Struts CVE-2017-5638.

But it doesn’t confirm Struts was responsible. Blaming Struts is certainly the subtext of this paragraph, but it’s not the text. It mentions that criminals had exploited the Struts vulnerability, but don’t actually connect the dots to the breach we are all talking about.

There’s probably reasons for this. While it’s easy for forensics to find evidence of Struts exploitation in logfiles, it’s much harder to connect this to the breach. While they suspect Struts, they may not actually be able to confirm it. Or, maybe they are trying to cover things up, where they feel failing to patch is a lesser crime than what they really did.

It’s at this point journalists should earn their pay. Instead rewriting what they read on the Internet, they could do legwork and call up Equifax PR and ask.

The purpose of this post isn’t to discuss Equifax, but the tendency of people to “read between the lines”, to read some subtext that wasn’t actually expressed in the text. Sometimes the subtext is legitimately there, such as how Equifax clearly intends people to blame Struts thought they don’t say it outright. Sometimes the subtext isn’t there, such as how Equifax doesn’t mean it’s own customers, only “U.S. consumers”. Journalists need to be careful about making assumptions about the subtext.


Update: The Equifax CSO has a degree in music. Some people have criticized this. Most people have defended this, pointing out that almost nobody has an “infosec” degree in our industry, and many of the top people have no degree at all. Among others, @thegrugq has pointed out that infosec degrees are only a few years old — they weren’t around 20 years ago when today’s corporate officers were getting their degrees.

Again, we have the text/subtext problem, where people interpret infosec degrees as being the same as computer-science degrees, the later of which have existed for decades. Some, as in this case, consider them to be wildly different. Others consider them to be nearly the same.

Equifax Data Breach – Hack Due To Missed Apache Patch

Post Syndicated from Darknet original https://www.darknet.org.uk/2017/09/equifax-data-breach-hack-due-to-missed-apache-patch/?utm_source=rss&utm_medium=social&utm_campaign=darknetfeed

Equifax Data Breach – Hack Due To Missed Apache Patch

The Equifax data breach is pretty huge with 143 million records leaked from the hack in the US alone with unknown more in Canada and the UK.

The original statement about the breach is as follows for those that weren’t up to date with it, which came out Sept 7th (4 months AFTER the breach happened).

Equifax Inc. (NYSE: EFX) today announced a cybersecurity incident potentially impacting approximately 143 million U.S.

Read the rest of Equifax Data Breach – Hack Due To Missed Apache Patch now! Only available at Darknet.

Time Warner Hacked – AWS Config Exposes 4M Subscribers

Post Syndicated from Darknet original https://www.darknet.org.uk/2017/09/time-warner-hacked-aws-config-exposes-4m-subscribers/?utm_source=rss&utm_medium=social&utm_campaign=darknetfeed

Time Warner Hacked – AWS Config Exposes 4M Subscribers

What’s the latest on the web, Time Warner Hacked is what it’s about now as a bad AWS S3 config (once again) exposes the details of approximately 4 Million subscribers.

This follows not long after the Instagram API leaking user contact information and a few other recent leaks involving poorly secured Amazon AWS S3 buckets and I’d hazard a guess that it won’t be the last.

Records of roughly four million Time Warner Cable customers in the US were exposed to the public internet after a contractor failed to properly secure an Amazon cloud database.

Read the rest of Time Warner Hacked – AWS Config Exposes 4M Subscribers now! Only available at Darknet.

Perfect 10 Takes Giganews to Supreme Court, Says It’s Worse Than Megaupload

Post Syndicated from Andy original https://torrentfreak.com/perfect-10-takes-giganews-supreme-court-says-worse-megaupload-170906/

Adult publisher Perfect 10 has developed a reputation for being a serial copyright litigant.

Over the years the company targeted a number of high-profile defendants, including Google, Amazon, Mastercard, and Visa. Around two dozen of Perfect 10’s lawsuits ended in cash settlements and defaults, in the publisher’s favor.

Perhaps buoyed by this success, the company went after Usenet provider Giganews but instead of a company willing to roll over, Perfect 10 found a highly defensive and indeed aggressive opponent. The initial copyright case filed by Perfect 10 alleged that Giganews effectively sold access to Perfect 10 content but things went badly for the publisher.

In November 2014, the U.S. District Court for the Central District of California found that Giganews was not liable for the infringing activities of its users. Perfect 10 was ordered to pay Giganews $5.6m in attorney’s fees and costs. Perfect 10 lost again at the Court of Appeals for the Ninth Circuit.

As a result of these failed actions, Giganews is owned millions by Perfect 10 but the publisher has thus far refused to pay up. That resulted in Giganews filing a $20m lawsuit, accusing Perfect 10 and President Dr. Norman Zada of fraud.

With all this litigation boiling around in the background and Perfect 10 already bankrupt as a result, one might think the story would be near to a conclusion. That doesn’t seem to be the case. In a fresh announcement, Perfect 10 says it has now appealed its case to the US Supreme Court.

“This is an extraordinarily important case, because for the first time, an appellate court has allowed defendants to copy and sell movies, songs, images, and other copyrighted works, without permission or payment to copyright holders,” says Zada.

“In this particular case, evidence was presented that defendants were copying and selling access to approximately 25,000 terabytes of unlicensed movies, songs, images, software, and magazines.”

Referencing an Amicus brief previously filed by the RIAA which described Giganews as “blatant copyright pirates,” Perfect 10 accuses the Ninth Circuit of allowing Giganews to copy and sell trillions of dollars of other people’s intellectual property “because their copying and selling was done in an automated fashion using a computer.”

Noting that “everything is done via computer” these days and with an undertone that the ruling encouraged others to infringe, Perfect 10 says there are now 88 companies similar to Giganews which rely on the automation defense to commit infringement – even involving content owned by people in the US Government.

“These exploiters of other people’s property are fearless. They are copying and selling access to pirated versions of pretty much every movie ever made, including films co-produced by treasury secretary Steven Mnuchin,” Nada says.

“You would think the justice department would do something to protect the viability of this nation’s movie and recording studios, as unfettered piracy harms jobs and tax revenues, but they have done nothing.”

But Zada doesn’t stop at blaming Usenet services, the California District Court, the Ninth Circuit, and the United States Department of Justice for his problems – Congress is to blame too.

“Copyright holders have nowhere to turn other than the Federal courts, whose judges are ridiculously overworked. For years, Congress has failed to provide the Federal courts with adequate funding. As a result, judges can make mistakes,” he adds.

For Zada, those mistakes are particularly notable, particularly since at least one other super high-profile company was shut down in the most aggressive manner possible for allegedly being involved in less piracy than Giganews.

Pointing to the now-infamous Megaupload case, Perfect 10 notes that the Department of Justice completely shut that operation down, filing charges of criminal copyright infringement against Kim Dotcom and seizing $175 million “for selling access to movies and songs which they did not own.”

“Perfect 10 provided evidence that [Giganews] offered more than 200 times as many full length movies as did megaupload.com. But our evidence fell on deaf ears,” Zada complains.

In contrast, Perfect 10 adds, a California District Court found that Giganews had done nothing wrong, allowed it to continue copying and selling access to Perfect 10’s content, and awarded the Usenet provider $5.63m in attorneys fees.

“Prior to this case, no court had ever awarded fees to an alleged infringer, unless they were found to either own the copyrights at issue, or established a fair use defense. Neither was the case here,” Zada adds.

While Perfect 10 has filed a petition with the Supreme Court, the odds of being granted a review are particularly small. Only time will tell how this case will end, but it seems unlikely that the adult publisher will enjoy a happy ending, one in which it doesn’t have to pay Giganews millions of dollars in attorney’s fees.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Security Flaw in Estonian National ID Card

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/09/security_flaw_i.html

We have no idea how bad this really is:

On 30 August, an international team of researchers informed the Estonian Information System Authority (RIA) of a vulnerability potentially affecting the digital use of Estonian ID cards. The possible vulnerability affects a total of almost 750,000 ID-cards issued starting from October 2014, including cards issued to e-residents. The ID-cards issued before 16 October 2014 use a different chip and are not affected. Mobile-IDs are also not impacted.

My guess is that it’s worse than the politicians are saying:

According to Peterkop, the current data shows this risk to be theoretical and there is no evidence of anyone’s digital identity being misused. “All ID-card operations are still valid and we will take appropriate actions to secure the functioning of our national digital-ID infrastructure. For example, we have restricted the access to Estonian ID-card public key database to prevent illegal use.”

And because this system is so important in local politics, the effects are significant:

In the light of current events, some Estonian politicians called to postpone the upcoming local elections, due to take place on 16 October. In Estonia, approximately 35% of the voters use digital identity to vote online.

But the Estonian prime minister, Jüri Ratas, said at a press conference on 5 September that “this incident will not affect the course of the Estonian e-state.” Ratas also recommended to use Mobile-IDs where possible. The prime minister said that the State Electoral Office will decide whether it will allow the usage of ID cards at the upcoming local elections.

The Estonian Police and Border Guard estimates it will take approximately two months to fix the issue with faulty cards. The authority will involve as many Estonian experts as possible in the process.

This is exactly the sort of thing I worry about as ID systems become more prevalent and more centralized. Anyone want to place bets on whether a foreign country is going to try to hack the next Estonian election?

Another article.

HDClub, Russia’s Leading HD-Only Torrent Site, Permanently Shuts Down

Post Syndicated from Andy original https://torrentfreak.com/hdclub-russias-leading-hd-torrent-site-permanently-shuts-down-170830/

While millions of users frequent popular public torrent sites such as The Pirate Bay and RARBG every day, there’s a thriving scene that’s hidden from the wider public eye.

Every week, private torrent trackers cater to dozens of millions of BitTorrent users who have taken the time and effort to gain access to these more secretive communities. Often labeled as elitist and running counter to the broad sharing ethos that made file-sharing the beast it is today, private sites pride themselves on quality, order and speed, something public sites typically struggle to match.

In addition to these notable qualities, many private sites choose to focus on a particular niche. There are sites dedicated to obscure electronic music, comedy, and even magic, but HDClub’s focus was given away by its name.

Dubbing itself “The HighDefinition BitTorrent Community”, HDClub specialized in HD productions including Blu-ray and 3D content, covering movies, TV shows, music videos, and animation.

Born in 2007, HDClub celebrated its ninth birthday on March 9 last year, with 2017 heralding a full decade online for the site. Catering mainly to the Russian and Ukrainian markets, the site’s releases often preserved an English audio option, ideal for those looking for high-quality releases from an unorthodox source at decent speeds.

Of course, HDClub releases often leaked out of the site, meaning that thousands are still available on regular public trackers, as a search on any Western torrent engine reveals.

A sample of HDClub releases listed on Torrentz2

Importantly, the site offered thousands of releases completely unavailable in Russia from licensed sources, meaning it filled a niche in which official outlets either wouldn’t or couldn’t compete. This earned itself a place in Russia’s Top 1000 sites list, despite being a closed membership platform.

The site’s attention to detail and focus earned it a considerable following. For the past few years the site capped membership at 190,000 people but in practice, attendance floated around the 170,000 mark. Seeders peaked at approximately 400,000 with leechers considerably less, making seeding as difficult as one might expect on a ratio-based tracker.

Now, however, the decade-long run of HDClub has come to an abrupt end. Early this week the tracker went dark, reportedly without advance notice. A Russian language announcement now present on its main page explains the reasons for the site’s demise.

“Recently, we received several dozens of complaints from rightsholders weekly, and our community is subjected to attacks and espionage,” the announcement reads.

While public torrent sites are always bombarded with DMCA-style notices, private sites tend to avoid large numbers of complaints. In this case, however, HDClub were clearly feeling the pressure. The site’s main page was open to the public while featuring popular releases, so this probably didn’t help with the load.

It’s not clear what is meant by “attacks and espionage” but it’s possibly a reference to DDoS assaults and third-parties attempting to monitor the site. Nevertheless, as HDClub points out, the climate for torrent, streaming, and similar sites has become increasingly hostile in the region recently.

“In parallel, there is a tightening of Internet legislation in Russia, Ukraine and EU countries,” the site says.

Interestingly, the site’s operators also suggest that interest from some quarters had waned, noting that “the time of enthusiasts irretrievably goes away.” It’s unclear whether that’s a reference to site users, the site’s operators, or indeed both. But in any event, any significant decline in any area can prove fatal, particularly when other pressures are at play.

“In the circumstances, we can no longer support the work of the club in the originally conceived format. The project is closed, but we ask you to refrain from long farewells. Thank you all and goodbye!” the message concludes.

Interestingly, the site ends with a little teaser, which may indicate some hope for the future.

“There are talks on preserving the heritage of the club,” it reads, without adding further details.

Possibly stay tuned…..

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

How to Configure an LDAPS Endpoint for Simple AD

Post Syndicated from Cameron Worrell original https://aws.amazon.com/blogs/security/how-to-configure-an-ldaps-endpoint-for-simple-ad/

Simple AD, which is powered by Samba  4, supports basic Active Directory (AD) authentication features such as users, groups, and the ability to join domains. Simple AD also includes an integrated Lightweight Directory Access Protocol (LDAP) server. LDAP is a standard application protocol for the access and management of directory information. You can use the BIND operation from Simple AD to authenticate LDAP client sessions. This makes LDAP a common choice for centralized authentication and authorization for services such as Secure Shell (SSH), client-based virtual private networks (VPNs), and many other applications. Authentication, the process of confirming the identity of a principal, typically involves the transmission of highly sensitive information such as user names and passwords. To protect this information in transit over untrusted networks, companies often require encryption as part of their information security strategy.

In this blog post, we show you how to configure an LDAPS (LDAP over SSL/TLS) encrypted endpoint for Simple AD so that you can extend Simple AD over untrusted networks. Our solution uses Elastic Load Balancing (ELB) to send decrypted LDAP traffic to HAProxy running on Amazon EC2, which then sends the traffic to Simple AD. ELB offers integrated certificate management, SSL/TLS termination, and the ability to use a scalable EC2 backend to process decrypted traffic. ELB also tightly integrates with Amazon Route 53, enabling you to use a custom domain for the LDAPS endpoint. The solution needs the intermediate HAProxy layer because ELB can direct traffic only to EC2 instances. To simplify testing and deployment, we have provided an AWS CloudFormation template to provision the ELB and HAProxy layers.

This post assumes that you have an understanding of concepts such as Amazon Virtual Private Cloud (VPC) and its components, including subnets, routing, Internet and network address translation (NAT) gateways, DNS, and security groups. You should also be familiar with launching EC2 instances and logging in to them with SSH. If needed, you should familiarize yourself with these concepts and review the solution overview and prerequisites in the next section before proceeding with the deployment.

Note: This solution is intended for use by clients requiring an LDAPS endpoint only. If your requirements extend beyond this, you should consider accessing the Simple AD servers directly or by using AWS Directory Service for Microsoft AD.

Solution overview

The following diagram and description illustrates and explains the Simple AD LDAPS environment. The CloudFormation template creates the items designated by the bracket (internal ELB load balancer and two HAProxy nodes configured in an Auto Scaling group).

Diagram of the the Simple AD LDAPS environment

Here is how the solution works, as shown in the preceding numbered diagram:

  1. The LDAP client sends an LDAPS request to ELB on TCP port 636.
  2. ELB terminates the SSL/TLS session and decrypts the traffic using a certificate. ELB sends the decrypted LDAP traffic to the EC2 instances running HAProxy on TCP port 389.
  3. The HAProxy servers forward the LDAP request to the Simple AD servers listening on TCP port 389 in a fixed Auto Scaling group configuration.
  4. The Simple AD servers send an LDAP response through the HAProxy layer to ELB. ELB encrypts the response and sends it to the client.

Note: Amazon VPC prevents a third party from intercepting traffic within the VPC. Because of this, the VPC protects the decrypted traffic between ELB and HAProxy and between HAProxy and Simple AD. The ELB encryption provides an additional layer of security for client connections and protects traffic coming from hosts outside the VPC.

Prerequisites

  1. Our approach requires an Amazon VPC with two public and two private subnets. The previous diagram illustrates the environment’s VPC requirements. If you do not yet have these components in place, follow these guidelines for setting up a sample environment:
    1. Identify a region that supports Simple AD, ELB, and NAT gateways. The NAT gateways are used with an Internet gateway to allow the HAProxy instances to access the internet to perform their required configuration. You also need to identify the two Availability Zones in that region for use by Simple AD. You will supply these Availability Zones as parameters to the CloudFormation template later in this process.
    2. Create or choose an Amazon VPC in the region you chose. In order to use Route 53 to resolve the LDAPS endpoint, make sure you enable DNS support within your VPC. Create an Internet gateway and attach it to the VPC, which will be used by the NAT gateways to access the internet.
    3. Create a route table with a default route to the Internet gateway. Create two NAT gateways, one per Availability Zone in your public subnets to provide additional resiliency across the Availability Zones. Together, the routing table, the NAT gateways, and the Internet gateway enable the HAProxy instances to access the internet.
    4. Create two private routing tables, one per Availability Zone. Create two private subnets, one per Availability Zone. The dual routing tables and subnets allow for a higher level of redundancy. Add each subnet to the routing table in the same Availability Zone. Add a default route in each routing table to the NAT gateway in the same Availability Zone. The Simple AD servers use subnets that you create.
    5. The LDAP service requires a DNS domain that resolves within your VPC and from your LDAP clients. If you do not have an existing DNS domain, follow the steps to create a private hosted zone and associate it with your VPC. To avoid encryption protocol errors, you must ensure that the DNS domain name is consistent across your Route 53 zone and in the SSL/TLS certificate (see Step 2 in the “Solution deployment” section).
  2. Make sure you have completed the Simple AD Prerequisites.
  3. We will use a self-signed certificate for ELB to perform SSL/TLS decryption. You can use a certificate issued by your preferred certificate authority or a certificate issued by AWS Certificate Manager (ACM).
    Note: To prevent unauthorized connections directly to your Simple AD servers, you can modify the Simple AD security group on port 389 to block traffic from locations outside of the Simple AD VPC. You can find the security group in the EC2 console by creating a search filter for your Simple AD directory ID. It is also important to allow the Simple AD servers to communicate with each other as shown on Simple AD Prerequisites.

Solution deployment

This solution includes five main parts:

  1. Create a Simple AD directory.
  2. Create a certificate.
  3. Create the ELB and HAProxy layers by using the supplied CloudFormation template.
  4. Create a Route 53 record.
  5. Test LDAPS access using an Amazon Linux client.

1. Create a Simple AD directory

With the prerequisites completed, you will create a Simple AD directory in your private VPC subnets:

  1. In the Directory Service console navigation pane, choose Directories and then choose Set up directory.
  2. Choose Simple AD.
    Screenshot of choosing "Simple AD"
  3. Provide the following information:
    • Directory DNS – The fully qualified domain name (FQDN) of the directory, such as corp.example.com. You will use the FQDN as part of the testing procedure.
    • NetBIOS name – The short name for the directory, such as CORP.
    • Administrator password – The password for the directory administrator. The directory creation process creates an administrator account with the user name Administrator and this password. Do not lose this password because it is nonrecoverable. You also need this password for testing LDAPS access in a later step.
    • Description – An optional description for the directory.
    • Directory Size – The size of the directory.
      Screenshot of the directory details to provide
  4. Provide the following information in the VPC Details section, and then choose Next Step:
    • VPC – Specify the VPC in which to install the directory.
    • Subnets – Choose two private subnets for the directory servers. The two subnets must be in different Availability Zones. Make a note of the VPC and subnet IDs for use as CloudFormation input parameters. In the following example, the Availability Zones are us-east-1a and us-east-1c.
      Screenshot of the VPC details to provide
  5. Review the directory information and make any necessary changes. When the information is correct, choose Create Simple AD.

It takes several minutes to create the directory. From the AWS Directory Service console , refresh the screen periodically and wait until the directory Status value changes to Active before continuing. Choose your Simple AD directory and note the two IP addresses in the DNS address section. You will enter them when you run the CloudFormation template later.

Note: Full administration of your Simple AD implementation is out of scope for this blog post. See the documentation to add users, groups, or instances to your directory. Also see the previous blog post, How to Manage Identities in Simple AD Directories.

2. Create a certificate

In the previous step, you created the Simple AD directory. Next, you will generate a self-signed SSL/TLS certificate using OpenSSL. You will use the certificate with ELB to secure the LDAPS endpoint. OpenSSL is a standard, open source library that supports a wide range of cryptographic functions, including the creation and signing of x509 certificates. You then import the certificate into ACM that is integrated with ELB.

  1. You must have a system with OpenSSL installed to complete this step. If you do not have OpenSSL, you can install it on Amazon Linux by running the command, sudo yum install openssl. If you do not have access to an Amazon Linux instance you can create one with SSH access enabled to proceed with this step. Run the command, openssl version, at the command line to see if you already have OpenSSL installed.
    [[email protected] ~]$ openssl version
    OpenSSL 1.0.1k-fips 8 Jan 2015

  2. Create a private key using the command, openssl genrsa command.
    [[email protected] tmp]$ openssl genrsa 2048 > privatekey.pem
    Generating RSA private key, 2048 bit long modulus
    ......................................................................................................................................................................+++
    ..........................+++
    e is 65537 (0x10001)

  3. Generate a certificate signing request (CSR) using the openssl req command. Provide the requested information for each field. The Common Name is the FQDN for your LDAPS endpoint (for example, ldap.corp.example.com). The Common Name must use the domain name you will later register in Route 53. You will encounter certificate errors if the names do not match.
    [[email protected] tmp]$ openssl req -new -key privatekey.pem -out server.csr
    You are about to be asked to enter information that will be incorporated into your certificate request.

  4. Use the openssl x509 command to sign the certificate. The following example uses the private key from the previous step (privatekey.pem) and the signing request (server.csr) to create a public certificate named server.crt that is valid for 365 days. This certificate must be updated within 365 days to avoid disruption of LDAPS functionality.
    [[email protected] tmp]$ openssl x509 -req -sha256 -days 365 -in server.csr -signkey privatekey.pem -out server.crt
    Signature ok
    subject=/C=XX/L=Default City/O=Default Company Ltd/CN=ldap.corp.example.com
    Getting Private key

  5. You should see three files: privatekey.pem, server.crt, and server.csr.
    [[email protected] tmp]$ ls
    privatekey.pem server.crt server.csr

    Restrict access to the private key.

    [[email protected] tmp]$ chmod 600 privatekey.pem

    Keep the private key and public certificate for later use. You can discard the signing request because you are using a self-signed certificate and not using a Certificate Authority. Always store the private key in a secure location and avoid adding it to your source code.

  6. In the ACM console, choose Import a certificate.
  7. Using your favorite Linux text editor, paste the contents of your server.crt file in the Certificate body box.
  8. Using your favorite Linux text editor, paste the contents of your privatekey.pem file in the Certificate private key box. For a self-signed certificate, you can leave the Certificate chain box blank.
  9. Choose Review and import. Confirm the information and choose Import.

3. Create the ELB and HAProxy layers by using the supplied CloudFormation template

Now that you have created your Simple AD directory and SSL/TLS certificate, you are ready to use the CloudFormation template to create the ELB and HAProxy layers.

  1. Load the supplied CloudFormation template to deploy an internal ELB and two HAProxy EC2 instances into a fixed Auto Scaling group. After you load the template, provide the following input parameters. Note: You can find the parameters relating to your Simple AD from the directory details page by choosing your Simple AD in the Directory Service console.
Input parameter Input parameter description
HAProxyInstanceSize The EC2 instance size for HAProxy servers. The default size is t2.micro and can scale up for large Simple AD environments.
MyKeyPair The SSH key pair for EC2 instances. If you do not have an existing key pair, you must create one.
VPCId The target VPC for this solution. Must be in the VPC where you deployed Simple AD and is available in your Simple AD directory details page.
SubnetId1 The Simple AD primary subnet. This information is available in your Simple AD directory details page.
SubnetId2 The Simple AD secondary subnet. This information is available in your Simple AD directory details page.
MyTrustedNetwork Trusted network Classless Inter-Domain Routing (CIDR) to allow connections to the LDAPS endpoint. For example, use the VPC CIDR to allow clients in the VPC to connect.
SimpleADPriIP The primary Simple AD Server IP. This information is available in your Simple AD directory details page.
SimpleADSecIP The secondary Simple AD Server IP. This information is available in your Simple AD directory details page.
LDAPSCertificateARN The Amazon Resource Name (ARN) for the SSL certificate. This information is available in the ACM console.
  1. Enter the input parameters and choose Next.
  2. On the Options page, accept the defaults and choose Next.
  3. On the Review page, confirm the details and choose Create. The stack will be created in approximately 5 minutes.

4. Create a Route 53 record

The next step is to create a Route 53 record in your private hosted zone so that clients can resolve your LDAPS endpoint.

  1. If you do not have an existing DNS domain for use with LDAP, create a private hosted zone and associate it with your VPC. The hosted zone name should be consistent with your Simple AD (for example, corp.example.com).
  2. When the CloudFormation stack is in CREATE_COMPLETE status, locate the value of the LDAPSURL on the Outputs tab of the stack. Copy this value for use in the next step.
  3. On the Route 53 console, choose Hosted Zones and then choose the zone you used for the Common Name box for your self-signed certificate. Choose Create Record Set and enter the following information:
    1. Name – The label of the record (such as ldap).
    2. Type – Leave as A – IPv4 address.
    3. Alias – Choose Yes.
    4. Alias Target – Paste the value of the LDAPSURL on the Outputs tab of the stack.
  4. Leave the defaults for Routing Policy and Evaluate Target Health, and choose Create.
    Screenshot of finishing the creation of the Route 53 record

5. Test LDAPS access using an Amazon Linux client

At this point, you have configured your LDAPS endpoint and now you can test it from an Amazon Linux client.

  1. Create an Amazon Linux instance with SSH access enabled to test the solution. Launch the instance into one of the public subnets in your VPC. Make sure the IP assigned to the instance is in the trusted IP range you specified in the CloudFormation parameter MyTrustedNetwork in Step 3.b.
  2. SSH into the instance and complete the following steps to verify access.
    1. Install the openldap-clients package and any required dependencies:
      sudo yum install -y openldap-clients.
    2. Add the server.crt file to the /etc/openldap/certs/ directory so that the LDAPS client will trust your SSL/TLS certificate. You can copy the file using Secure Copy (SCP) or create it using a text editor.
    3. Edit the /etc/openldap/ldap.conf file and define the environment variables BASE, URI, and TLS_CACERT.
      • The value for BASE should match the configuration of the Simple AD directory name.
      • The value for URI should match your DNS alias.
      • The value for TLS_CACERT is the path to your public certificate.

Here is an example of the contents of the file.

BASE dc=corp,dc=example,dc=com
URI ldaps://ldap.corp.example.com
TLS_CACERT /etc/openldap/certs/server.crt

To test the solution, query the directory through the LDAPS endpoint, as shown in the following command. Replace corp.example.com with your domain name and use the Administrator password that you configured with the Simple AD directory

$ ldapsearch -D "[email protected]corp.example.com" -W sAMAccountName=Administrator

You should see a response similar to the following response, which provides the directory information in LDAP Data Interchange Format (LDIF) for the administrator distinguished name (DN) from your Simple AD LDAP server.

# extended LDIF
#
# LDAPv3
# base <dc=corp,dc=example,dc=com> (default) with scope subtree
# filter: sAMAccountName=Administrator
# requesting: ALL
#

# Administrator, Users, corp.example.com
dn: CN=Administrator,CN=Users,DC=corp,DC=example,DC=com
objectClass: top
objectClass: person
objectClass: organizationalPerson
objectClass: user
description: Built-in account for administering the computer/domain
instanceType: 4
whenCreated: 20170721123204.0Z
uSNCreated: 3223
name: Administrator
objectGUID:: l3h0HIiKO0a/ShL4yVK/vw==
userAccountControl: 512
…

You can now use the LDAPS endpoint for directory operations and authentication within your environment. If you would like to learn more about how to interact with your LDAPS endpoint within a Linux environment, here are a few resources to get started:

Troubleshooting

If you receive an error such as the following error when issuing the ldapsearch command, there are a few things you can do to help identify issues.

ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
  • You might be able to obtain additional error details by adding the -d1 debug flag to the ldapsearch command in the previous section.
    $ ldapsearch -D "[email protected]" -W sAMAccountName=Administrator –d1

  • Verify that the parameters in ldap.conf match your configured LDAPS URI endpoint and that all parameters can be resolved by DNS. You can use the following dig command, substituting your configured endpoint DNS name.
    $ dig ldap.corp.example.com

  • Confirm that the client instance from which you are connecting is in the CIDR range of the CloudFormation parameter, MyTrustedNetwork.
  • Confirm that the path to your public SSL/TLS certificate configured in ldap.conf as TLS_CAERT is correct. You configured this in Step 5.b.3. You can check your SSL/TLS connection with the command, substituting your configured endpoint DNS name for the string after –connect.
    $ echo -n | openssl s_client -connect ldap.corp.example.com:636

  • Verify that your HAProxy instances have the status InService in the EC2 console: Choose Load Balancers under Load Balancing in the navigation pane, highlight your LDAPS load balancer, and then choose the Instances

Conclusion

You can use ELB and HAProxy to provide an LDAPS endpoint for Simple AD and transport sensitive authentication information over untrusted networks. You can explore using LDAPS to authenticate SSH users or integrate with other software solutions that support LDAP authentication. This solution’s CloudFormation template is available on GitHub.

If you have comments about this post, submit them in the “Comments” section below. If you have questions about or issues implementing this solution, start a new thread on the Directory Service forum.

– Cameron and Jeff

From Data Lake to Data Warehouse: Enhancing Customer 360 with Amazon Redshift Spectrum

Post Syndicated from Dylan Tong original https://aws.amazon.com/blogs/big-data/from-data-lake-to-data-warehouse-enhancing-customer-360-with-amazon-redshift-spectrum/

Achieving a 360o-view of your customer has become increasingly challenging as companies embrace omni-channel strategies, engaging customers across websites, mobile, call centers, social media, physical sites, and beyond. The promise of a web where online and physical worlds blend makes understanding your customers more challenging, but also more important. Businesses that are successful in this medium have a significant competitive advantage.

The big data challenge requires the management of data at high velocity and volume. Many customers have identified Amazon S3 as a great data lake solution that removes the complexities of managing a highly durable, fault tolerant data lake infrastructure at scale and economically.

AWS data services substantially lessen the heavy lifting of adopting technologies, allowing you to spend more time on what matters most—gaining a better understanding of customers to elevate your business. In this post, I show how a recent Amazon Redshift innovation, Redshift Spectrum, can enhance a customer 360 initiative.

Customer 360 solution

A successful customer 360 view benefits from using a variety of technologies to deliver different forms of insights. These could range from real-time analysis of streaming data from wearable devices and mobile interactions to historical analysis that requires interactive, on demand queries on billions of transactions. In some cases, insights can only be inferred through AI via deep learning. Finally, the value of your customer data and insights can’t be fully realized until it is operationalized at scale—readily accessible by fleets of applications. Companies are leveraging AWS for the breadth of services that cover these domains, to drive their data strategy.

A number of AWS customers stream data from various sources into a S3 data lake through Amazon Kinesis. They use Kinesis and technologies in the Hadoop ecosystem like Spark running on Amazon EMR to enrich this data. High-value data is loaded into an Amazon Redshift data warehouse, which allows users to analyze and interact with data through a choice of client tools. Redshift Spectrum expands on this analytics platform by enabling Amazon Redshift to blend and analyze data beyond the data warehouse and across a data lake.

The following diagram illustrates the workflow for such a solution.

This solution delivers value by:

  • Reducing complexity and time to value to deeper insights. For instance, an existing data model in Amazon Redshift may provide insights across dimensions such as customer, geography, time, and product on metrics from sales and financial systems. Down the road, you may gain access to streaming data sources like customer-care call logs and website activity that you want to blend in with the sales data on the same dimensions to understand how web and call center experiences maybe correlated with sales performance. Redshift Spectrum can join these dimensions in Amazon Redshift with data in S3 to allow you to quickly gain new insights, and avoid the slow and more expensive alternative of fully integrating these sources with your data warehouse.
  • Providing an additional avenue for optimizing costs and performance. In cases like call logs and clickstream data where volumes could be many TBs to PBs, storing the data exclusively in S3 yields significant cost savings. Interactive analysis on massive datasets may now be economically viable in cases where data was previously analyzed periodically through static reports generated by inexpensive batch processes. In some cases, you can improve the user experience while simultaneously lowering costs. Spectrum is powered by a large-scale infrastructure external to your Amazon Redshift cluster, and excels at scanning and aggregating large volumes of data. For instance, your analysts maybe performing data discovery on customer interactions across millions of consumers over years of data across various channels. On this large dataset, certain queries could be slow if you didn’t have a large Amazon Redshift cluster. Alternatively, you could use Redshift Spectrum to achieve a better user experience with a smaller cluster.

Proof of concept walkthrough

To make evaluation easier for you, I’ve conducted a Redshift Spectrum proof-of-concept (PoC) for the customer 360 use case. For those who want to replicate the PoC, the instructions, AWS CloudFormation templates, and public data sets are available in the GitHub repository.

The remainder of this post is a journey through the project, observing best practices in action, and learning how you can achieve business value. The walkthrough involves:

  • An analysis of performance data from the PoC environment involving queries that demonstrate blending and analysis of data across Amazon Redshift and S3. Observe that great results are achievable at scale.
  • Guidance by example on query tuning, design, and data preparation to illustrate the optimization process. This includes tuning a query that combines clickstream data in S3 with customer and time dimensions in Amazon Redshift, and aggregates ~1.9 B out of 3.7 B+ records in under 10 seconds with a small cluster!
  • Guidance and measurements to help assess deciding between two options: accessing and analyzing data exclusively in Amazon Redshift, or using Redshift Spectrum to access data left in S3.

Stream ingestion and enrichment

The focus of this post isn’t stream ingestion and enrichment on Kinesis and EMR, but be mindful of performance best practices on S3 to ensure good streaming and query performance:

  • Use random object keys: The data files provided for this project are prefixed with SHA-256 hashes to prevent hot partitions. This is important to ensure that optimal request rates to support PUT requests from the incoming stream in addition to certain queries from large Amazon Redshift clusters that could send a large number of parallel GET requests.
  • Micro-batch your data stream: S3 isn’t optimized for small random write workloads. Your datasets should be micro-batched into large files. For instance, the “parquet-1” dataset provided batches >7 million records per file. The optimal file size for Redshift Spectrum is usually in the 100 MB to 1 GB range.

If you have an edge case that may pose scalability challenges, AWS would love to hear about it. For further guidance, talk to your solutions architect.

Environment

The project consists of the following environment:

  • Amazon Redshift cluster: 4 X dc1.large
  • Data:
    • Time and customer dimension tables are stored on all Amazon Redshift nodes (ALL distribution style):
      • The data originates from the DWDATE and CUSTOMER tables in the Star Schema Benchmark
      • The customer table contains attributes for 3 million customers.
      • The time data is at the day-level granularity, and spans 7 years, from the start of 1992 to the end of 1998.
    • The clickstream data is stored in an S3 bucket, and serves as a fact table.
      • Various copies of this dataset in CSV and Parquet format have been provided, for reasons to be discussed later.
      • The data is a modified version of the uservisits dataset from AMPLab’s Big Data Benchmark, which was generated by Intel’s Hadoop benchmark tools.
      • Changes were minimal, so that existing test harnesses for this test can be adapted:
        • Increased the 751,754,869-row dataset 5X to 3,758,774,345 rows.
        • Added surrogate keys to support joins with customer and time dimensions. These keys were distributed evenly across the entire dataset to represents user visits from six customers over seven years.
        • Values for the visitDate column were replaced to align with the 7-year timeframe, and the added time surrogate key.

Queries across the data lake and data warehouse 

Imagine a scenario where a business analyst plans to analyze clickstream metrics like ad revenue over time and by customer, market segment and more. The example below is a query that achieves this effect: 

The query part highlighted in red retrieves clickstream data in S3, and joins the data with the time and customer dimension tables in Amazon Redshift through the part highlighted in blue. The query returns the total ad revenue for three customers over the last three months, along with info on their respective market segment.

Unfortunately, this query takes around three minutes to run, and doesn’t enable the interactive experience that you want. However, there’s a number of performance optimizations that you can implement to achieve the desired performance.

Performance analysis

Two key utilities provide visibility into Redshift Spectrum:

  • EXPLAIN
    Provides the query execution plan, which includes info around what processing is pushed down to Redshift Spectrum. Steps in the plan that include the prefix S3 are executed on Redshift Spectrum. For instance, the plan for the previous query has the step “S3 Seq Scan clickstream.uservisits_csv10”, indicating that Redshift Spectrum performs a scan on S3 as part of the query execution.
  • SVL_S3QUERY_SUMMARY
    Statistics for Redshift Spectrum queries are stored in this table. While the execution plan presents cost estimates, this table stores actual statistics for past query runs.

You can get the statistics of your last query by inspecting the SVL_S3QUERY_SUMMARY table with the condition (query = pg_last_query_id()). Inspecting the previous query reveals that the entire dataset of nearly 3.8 billion rows was scanned to retrieve less than 66.3 million rows. Improving scan selectivity in your query could yield substantial performance improvements.

Partitioning

Partitioning is a key means to improving scan efficiency. In your environment, the data and tables have already been organized, and configured to support partitions. For more information, see the PoC project setup instructions. The clickstream table was defined as:

CREATE EXTERNAL TABLE clickstream.uservisits_csv10
…
PARTITIONED BY(customer int4, visitYearMonth int4)

The entire 3.8 billion-row dataset is organized as a collection of large files where each file contains data exclusive to a particular customer and month in a year. This allows you to partition your data into logical subsets by customer and year/month. With partitions, the query engine can target a subset of files:

  • Only for specific customers
  • Only data for specific months
  • A combination of specific customers and year/months

You can use partitions in your queries. Instead of joining your customer data on the surrogate customer key (that is, c.c_custkey = uv.custKey), the partition key “customer” should be used instead:

SELECT c.c_name, c.c_mktsegment, t.prettyMonthYear, SUM(uv.adRevenue)
…
ON c.c_custkey = uv.customer
…
ORDER BY c.c_name, c.c_mktsegment, uv.yearMonthKey  ASC

This query should run approximately twice as fast as the previous query. If you look at the statistics for this query in SVL_S3QUERY_SUMMARY, you see that only half the dataset was scanned. This is expected because your query is on three out of six customers on an evenly distributed dataset. However, the scan is still inefficient, and you can benefit from using your year/month partition key as well:

SELECT c.c_name, c.c_mktsegment, t.prettyMonthYear, SUM(uv.adRevenue)
…
ON c.c_custkey = uv.customer
…
ON uv.visitYearMonth = t.d_yearmonthnum
…
ORDER BY c.c_name, c.c_mktsegment, uv.visitYearMonth ASC

All joins between the tables are now using partitions. Upon reviewing the statistics for this query, you should observe that Redshift Spectrum scans and returns the exact number of rows, 66,270,117. If you run this query a few times, you should see execution time in the range of 8 seconds, which is a 22.5X improvement on your original query!

Predicate pushdown and storage optimizations 

Previously, I mentioned that Redshift Spectrum performs processing through large-scale infrastructure external to your Amazon Redshift cluster. It is optimized for performing large scans and aggregations on S3. In fact, Redshift Spectrum may even out-perform a medium size Amazon Redshift cluster on these types of workloads with the proper optimizations. There are two important variables to consider for optimizing large scans and aggregations:

  • File size and count. As a general rule, use files 100 MB-1 GB in size, as Redshift Spectrum and S3 are optimized for reading this object size. However, the number of files operating on a query is directly correlated with the parallelism achievable by a query. There is an inverse relationship between file size and count: the bigger the files, the fewer files there are for the same dataset. Consequently, there is a trade-off between optimizing for object read performance, and the amount of parallelism achievable on a particular query. Large files are best for large scans as the query likely operates on sufficiently large number of files. For queries that are more selective and for which fewer files are operating, you may find that smaller files allow for more parallelism.
  • Data format. Redshift Spectrum supports various data formats. Columnar formats like Parquet can sometimes lead to substantial performance benefits by providing compression and more efficient I/O for certain workloads. Generally, format types like Parquet should be used for query workloads involving large scans, and high attribute selectivity. Again, there are trade-offs as formats like Parquet require more compute power to process than plaintext. For queries on smaller subsets of data, the I/O efficiency benefit of Parquet is diminished. At some point, Parquet may perform the same or slower than plaintext. Latency, compression rates, and the trade-off between user experience and cost should drive your decision.

To help illustrate how Redshift Spectrum performs on these large aggregation workloads, run a basic query that aggregates the entire ~3.7 billion record dataset on Redshift Spectrum, and compared that with running the query exclusively on Amazon Redshift:

SELECT uv.custKey, COUNT(uv.custKey)
FROM <your clickstream table> as uv
GROUP BY uv.custKey
ORDER BY uv.custKey ASC

For the Amazon Redshift test case, the clickstream data is loaded, and distributed evenly across all nodes (even distribution style) with optimal column compression encodings prescribed by the Amazon Redshift’s ANALYZE command.

The Redshift Spectrum test case uses a Parquet data format with each file containing all the data for a particular customer in a month. This results in files mostly in the range of 220-280 MB, and in effect, is the largest file size for this partitioning scheme. If you run tests with the other datasets provided, you see that this data format and size is optimal and out-performs others by ~60X. 

Performance differences will vary depending on the scenario. The important takeaway is to understand the testing strategy and the workload characteristics where Redshift Spectrum is likely to yield performance benefits. 

The following chart compares the query execution time for the two scenarios. The results indicate that you would have to pay for 12 X DC1.Large nodes to get performance comparable to using a small Amazon Redshift cluster that leverages Redshift Spectrum. 

Chart showing simple aggregation on ~3.7 billion records

So you’ve validated that Spectrum excels at performing large aggregations. Could you benefit by pushing more work down to Redshift Spectrum in your original query? It turns out that you can, by making the following modification:

The clickstream data is stored at a day-level granularity for each customer while your query rolls up the data to the month level per customer. In the earlier query that uses the day/month partition key, you optimized the query so that it only scans and retrieves the data required, but the day level data is still sent back to your Amazon Redshift cluster for joining and aggregation. The query shown here pushes aggregation work down to Redshift Spectrum as indicated by the query plan:

In this query, Redshift Spectrum aggregates the clickstream data to the month level before it is returned to the Amazon Redshift cluster and joined with the dimension tables. This query should complete in about 4 seconds, which is roughly twice as fast as only using the partition key. The speed increase is evident upon reviewing the SVL_S3QUERY_SUMMARY table:

  • Bytes scanned is 21.6X less because of the Parquet data format.
  • Only 90 records are returned back to the Amazon Redshift cluster as a result of the push-down, instead of ~66.2 million, leading to substantially less join overhead, and about 530 MB less data sent back to your cluster.
  • No adverse change in average parallelism.

Assessing the value of Amazon Redshift vs. Redshift Spectrum

At this point, you might be asking yourself, why would I ever not use Redshift Spectrum? Well, you still get additional value for your money by loading data into Amazon Redshift, and querying in Amazon Redshift vs. querying S3.

In fact, it turns out that the last version of our query runs even faster when executed exclusively in native Amazon Redshift, as shown in the following chart:

Chart comparing Amazon Redshift vs. Redshift Spectrum with pushdown aggregation over 3 months of data

As a general rule, queries that aren’t dominated by I/O and which involve multiple joins are better optimized in native Amazon Redshift. For instance, the performance difference between running the partition key query entirely in Amazon Redshift versus with Redshift Spectrum is twice as large as that that of the pushdown aggregation query, partly because the former case benefits more from better join performance.

Furthermore, the variability in latency in native Amazon Redshift is lower. For use cases where you have tight performance SLAs on queries, you may want to consider using Amazon Redshift exclusively to support those queries.

On the other hand, when you perform large scans, you could benefit from the best of both worlds: higher performance at lower cost. For instance, imagine that you wanted to enable your business analysts to interactively discover insights across a vast amount of historical data. In the example below, the pushdown aggregation query is modified to analyze seven years of data instead of three months:

SELECT c.c_name, c.c_mktsegment, t.prettyMonthYear, uv.totalRevenue
…
WHERE customer <= 3 and visitYearMonth >= 199201
… 
FROM dwdate WHERE d_yearmonthnum >= 199201) as t
…
ORDER BY c.c_name, c.c_mktsegment, uv.visitYearMonth ASC

This query requires scanning and aggregating nearly 1.9 billion records. As shown in the chart below, Redshift Spectrum substantially speeds up this query. A large Amazon Redshift cluster would have to be provisioned to support this use case. With the aid of Redshift Spectrum, you could use an existing small cluster, keep a single copy of your data in S3, and benefit from economical, durable storage while only paying for what you use via the pay per query pricing model.

Chart comparing Amazon Redshift vs. Redshift Spectrum with pushdown aggregation over 7 years of data

Summary

Redshift Spectrum lowers the time to value for deeper insights on customer data queries spanning the data lake and data warehouse. It can enable interactive analysis on datasets in cases that weren’t economically practical or technically feasible before.

There are cases where you can get the best of both worlds from Redshift Spectrum: higher performance at lower cost. However, there are still latency-sensitive use cases where you may want native Amazon Redshift performance. For more best practice tips, see the 10 Best Practices for Amazon Redshift post.

Please visit the Amazon Redshift Spectrum PoC Environment Github page. If you have questions or suggestions, please comment below.

 


Additional Reading

Learn more about how Amazon Redshift Spectrum extends data warehousing out to exabytes – no loading required.


About the Author

Dylan Tong is an Enterprise Solutions Architect at AWS. He works with customers to help drive their success on the AWS platform through thought leadership and guidance on designing well architected solutions. He has spent most of his career building on his expertise in data management and analytics by working for leaders and innovators in the space.

 

 

Announcing the Winners of the AWS Chatbot Challenge – Conversational, Intelligent Chatbots using Amazon Lex and AWS Lambda

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/announcing-the-winners-of-the-aws-chatbot-challenge-conversational-intelligent-chatbots-using-amazon-lex-and-aws-lambda/

A couple of months ago on the blog, I announced the AWS Chatbot Challenge in conjunction with Slack. The AWS Chatbot Challenge was an opportunity to build a unique chatbot that helped to solve a problem or that would add value for its prospective users. The mission was to build a conversational, natural language chatbot using Amazon Lex and leverage Lex’s integration with AWS Lambda to execute logic or data processing on the backend.

I know that you all have been anxiously waiting to hear announcements of who were the winners of the AWS Chatbot Challenge as much as I was. Well wait no longer, the winners of the AWS Chatbot Challenge have been decided.

May I have the Envelope Please? (The Trumpets sound)

The winners of the AWS Chatbot Challenge are:

  • First Place: BuildFax Counts by Joe Emison
  • Second Place: Hubsy by Andrew Riess, Andrew Puch, and John Wetzel
  • Third Place: PFMBot by Benny Leong and his team from MoneyLion.
  • Large Organization Winner: ADP Payroll Innovation Bot by Eric Liu, Jiaxing Yan, and Fan Yang

 

Diving into the Winning Chatbot Projects

Let’s take a walkthrough of the details for each of the winning projects to get a view of what made these chatbots distinctive, as well as, learn more about the technologies used to implement the chatbot solution.

 

BuildFax Counts by Joe Emison

The BuildFax Counts bot was created as a real solution for the BuildFax company to decrease the amount the time that sales and marketing teams can get answers on permits or properties with permits meet certain criteria.

BuildFax, a company co-founded by bot developer Joe Emison, has the only national database of building permits, which updates data from approximately half of the United States on a monthly basis. In order to accommodate the many requests that come in from the sales and marketing team regarding permit information, BuildFax has a technical sales support team that fulfills these requests sent to a ticketing system by manually writing SQL queries that run across the shards of the BuildFax databases. Since there are a large number of requests received by the internal sales support team and due to the manual nature of setting up the queries, it may take several days for getting the sales and marketing teams to receive an answer.

The BuildFax Counts chatbot solves this problem by taking the permit inquiry that would normally be sent into a ticket from the sales and marketing team, as input from Slack to the chatbot. Once the inquiry is submitted into Slack, a query executes and the inquiry results are returned immediately.

Joe built this solution by first creating a nightly export of the data in their BuildFax MySQL RDS database to CSV files that are stored in Amazon S3. From the exported CSV files, an Amazon Athena table was created in order to run quick and efficient queries on the data. He then used Amazon Lex to create a bot to handle the common questions and criteria that may be asked by the sales and marketing teams when seeking data from the BuildFax database by modeling the language used from the BuildFax ticketing system. He added several different sample utterances and slot types; both custom and Lex provided, in order to correctly parse every question and criteria combination that could be received from an inquiry.  Using Lambda, Joe created a Javascript Lambda function that receives information from the Lex intent and used it to build a SQL statement that runs against the aforementioned Athena database using the AWS SDK for JavaScript in Node.js library to return inquiry count result and SQL statement used.

The BuildFax Counts bot is used today for the BuildFax sales and marketing team to get back data on inquiries immediately that previously took up to a week to receive results.

Not only is BuildFax Counts bot our 1st place winner and wonderful solution, but its creator, Joe Emison, is a great guy.  Joe has opted to donate his prize; the $5,000 cash, the $2,500 in AWS Credits, and one re:Invent ticket to the Black Girls Code organization. I must say, you rock Joe for helping these kids get access and exposure to technology.

 

Hubsy by Andrew Riess, Andrew Puch, and John Wetzel

Hubsy bot was created to redefine and personalize the way users traditionally manage their HubSpot account. HubSpot is a SaaS system providing marketing, sales, and CRM software. Hubsy allows users of HubSpot to create engagements and log engagements with customers, provide sales teams with deals status, and retrieves client contact information quickly. Hubsy uses Amazon Lex’s conversational interface to execute commands from the HubSpot API so that users can gain insights, store and retrieve data, and manage tasks directly from Facebook, Slack, or Alexa.

In order to implement the Hubsy chatbot, Andrew and the team members used AWS Lambda to create a Lambda function with Node.js to parse the users request and call the HubSpot API, which will fulfill the initial request or return back to the user asking for more information. Terraform was used to automatically setup and update Lambda, CloudWatch logs, as well as, IAM profiles. Amazon Lex was used to build the conversational piece of the bot, which creates the utterances that a person on a sales team would likely say when seeking information from HubSpot. To integrate with Alexa, the Amazon Alexa skill builder was used to create an Alexa skill which was tested on an Echo Dot. Cloudwatch Logs are used to log the Lambda function information to CloudWatch in order to debug different parts of the Lex intents. In order to validate the code before the Terraform deployment, ESLint was additionally used to ensure the code was linted and proper development standards were followed.

 

PFMBot by Benny Leong and his team from MoneyLion

PFMBot, Personal Finance Management Bot,  is a bot to be used with the MoneyLion finance group which offers customers online financial products; loans, credit monitoring, and free credit score service to improve the financial health of their customers. Once a user signs up an account on the MoneyLion app or website, the user has the option to link their bank accounts with the MoneyLion APIs. Once the bank account is linked to the APIs, the user will be able to login to their MoneyLion account and start having a conversation with the PFMBot based on their bank account information.

The PFMBot UI has a web interface built with using Javascript integration. The chatbot was created using Amazon Lex to build utterances based on the possible inquiries about the user’s MoneyLion bank account. PFMBot uses the Lex built-in AMAZON slots and parsed and converted the values from the built-in slots to pass to AWS Lambda. The AWS Lambda functions interacting with Amazon Lex are Java-based Lambda functions which call the MoneyLion Java-based internal APIs running on Spring Boot. These APIs obtain account data and related bank account information from the MoneyLion MySQL Database.

 

ADP Payroll Innovation Bot by Eric Liu, Jiaxing Yan, and Fan Yang

ADP PI (Payroll Innovation) bot is designed to help employees of ADP customers easily review their own payroll details and compare different payroll data by just asking the bot for results. The ADP PI Bot additionally offers issue reporting functionality for employees to report payroll issues and aids HR managers in quickly receiving and organizing any reported payroll issues.

The ADP Payroll Innovation bot is an ecosystem for the ADP payroll consisting of two chatbots, which includes ADP PI Bot for external clients (employees and HR managers), and ADP PI DevOps Bot for internal ADP DevOps team.


The architecture for the ADP PI DevOps bot is different architecture from the ADP PI bot shown above as it is deployed internally to ADP. The ADP PI DevOps bot allows input from both Slack and Alexa. When input comes into Slack, Slack sends the request to Lex for it to process the utterance. Lex then calls the Lambda backend, which obtains ADP data sitting in the ADP VPC running within an Amazon VPC. When input comes in from Alexa, a Lambda function is called that also obtains data from the ADP VPC running on AWS.

The architecture for the ADP PI bot consists of users entering in requests and/or entering issues via Slack. When requests/issues are entered via Slack, the Slack APIs communicate via Amazon API Gateway to AWS Lambda. The Lambda function either writes data into one of the Amazon DynamoDB databases for recording issues and/or sending issues or it sends the request to Lex. When sending issues, DynamoDB integrates with Trello to keep HR Managers abreast of the escalated issues. Once the request data is sent from Lambda to Lex, Lex processes the utterance and calls another Lambda function that integrates with the ADP API and it calls ADP data from within the ADP VPC, which runs on Amazon Virtual Private Cloud (VPC).

Python and Node.js were the chosen languages for the development of the bots.

The ADP PI bot ecosystem has the following functional groupings:

Employee Functionality

  • Summarize Payrolls
  • Compare Payrolls
  • Escalate Issues
  • Evolve PI Bot

HR Manager Functionality

  • Bot Management
  • Audit and Feedback

DevOps Functionality

  • Reduce call volume in service centers (ADP PI Bot).
  • Track issues and generate reports (ADP PI Bot).
  • Monitor jobs for various environment (ADP PI DevOps Bot)
  • View job dashboards (ADP PI DevOps Bot)
  • Query job details (ADP PI DevOps Bot)

 

Summary

Let’s all wish all the winners of the AWS Chatbot Challenge hearty congratulations on their excellent projects.

You can review more details on the winning projects, as well as, all of the submissions to the AWS Chatbot Challenge at: https://awschatbot2017.devpost.com/submissions. If you are curious on the details of Chatbot challenge contest including resources, rules, prizes, and judges, you can review the original challenge website here:  https://awschatbot2017.devpost.com/.

Hopefully, you are just as inspired as I am to build your own chatbot using Lex and Lambda. For more information, take a look at the Amazon Lex developer guide or the AWS AI blog on Building Better Bots Using Amazon Lex (Part 1)

Chat with you soon!

Tara

What’s the Diff: Programs, Processes, and Threads

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/whats-the-diff-programs-processes-and-threads/

let's talk about Threads

How often have you heard the term threading in relation to a computer program, but you weren’t exactly sure what it meant? How about processes? You likely understand that a thread is somehow closely related to a program and a process, but if you’re not a computer science major, maybe that’s as far as your understanding goes.

Knowing what these terms mean is absolutely essential if you are a programmer, but an understanding of them also can be useful to the average computer user. Being able to look at and understand the Activity Monitor on the Macintosh, the Task Manager on Windows, or Top on Linux can help you troubleshoot which programs are causing problems on your computer, or whether you might need to install more memory to make your system run better.

Let’s take a few minutes to delve into the world of computer programs and sort out what these terms mean. We’ll simplify and generalize some of the ideas, but the general concepts we cover should help clarify the difference between the terms.

Programs

First of all, you probably are aware that a program is the code that is stored on your computer that is intended to fulfill a certain task. There are many types of programs, including programs that help your computer function and are part of the operating system, and other programs that fulfill a particular job. These task-specific programs are also known as “applications,” and can include programs such as word processing, web browsing, or emailing a message to another computer.

Program

Programs are typically stored on disk or in non-volatile memory in a form that can be executed by your computer. Prior to that, they are created using a programming language such as C, Lisp, Pascal, or many others using instructions that involve logic, data and device manipulation, recurrence, and user interaction. The end result is a text file of code that is compiled into binary form (1’s and 0’s) in order to run on the computer. Another type of program is called “interpreted,” and instead of being compiled in advance in order to run, is interpreted into executable code at the time it is run. Some common, typically interpreted programming languages, are Python, PHP, JavaScript, and Ruby.

The end result is the same, however, in that when a program is run, it is loaded into memory in binary form. The computer’s CPU (Central Processing Unit) understands only binary instructions, so that’s the form the program needs to be in when it runs.

Perhaps you’ve heard the programmer’s joke, “There are only 10 types of people in the world, those who understand binary, and those who don’t.”

Binary is the native language of computers because an electrical circuit at its basic level has two states, on or off, represented by a one or a zero. In the common numbering system we use every day, base 10, each digit position can be anything from 0 to 9. In base 2 (or binary), each position is either a 0 or a 1. (In a future blog post we might cover quantum computing, which goes beyond the concept of just 1’s and 0’s in computing.)

Decimal—Base 10 Binary—Base 2
0 0000
1 0001
2 0010
3 0011
4 0100
5 0101
6 0110
7 0111
8 1000
9 1001

How Processes Work

The program has been loaded into the computer’s memory in binary form. Now what?

An executing program needs more than just the binary code that tells the computer what to do. The program needs memory and various operating system resources that it needs in order to run. A “process” is what we call a program that has been loaded into memory along with all the resources it needs to operate. The “operating system” is the brains behind allocating all these resources, and comes in different flavors such as macOS, iOS, Microsoft Windows, Linux, and Android. The OS handles the task of managing the resources needed to turn your program into a running process.

Some essential resources every process needs are registers, a program counter, and a stack. The “registers” are data holding places that are part of the computer processor (CPU). A register may hold an instruction, a storage address, or other kind of data needed by the process. The “program counter,” also called the “instruction pointer,” keeps track of where a computer is in its program sequence. The “stack” is a data structure that stores information about the active subroutines of a computer program and is used as scratch space for the process. It is distinguished from dynamically allocated memory for the process that is known as “the heap.”

diagram of how processes work

There can be multiple instances of a single program, and each instance of that running program is a process. Each process has a separate memory address space, which means that a process runs independently and is isolated from other processes. It cannot directly access shared data in other processes. Switching from one process to another requires some time (relatively) for saving and loading registers, memory maps, and other resources.

This independence of processes is valuable because the operating system tries its best to isolate processes so that a problem with one process doesn’t corrupt or cause havoc with another process. You’ve undoubtedly run into the situation in which one application on your computer freezes or has a problem and you’ve been able to quit that program without affecting others.

How Threads Work

So, are you still with us? We finally made it to threads!

A thread is the unit of execution within a process. A process can have anywhere from just one thread to many threads.

Process vs. Thread

diagram of threads in a process over time

When a process starts, it is assigned memory and resources. Each thread in the process shares that memory and resources. In single-threaded processes, the process contains one thread. The process and the thread are one and the same, and there is only one thing happening.

In multithreaded processes, the process contains more than one thread, and the process is accomplishing a number of things at the same time (technically, it’s almost at the same time—read more on that in the “What about Parallelism and Concurrency?” section below).

diagram of single and multi-treaded process

We talked about the two types of memory available to a process or a thread, the stack and the heap. It is important to distinguish between these two types of process memory because each thread will have its own stack, but all the threads in a process will share the heap.

Threads are sometimes called lightweight processes because they have their own stack but can access shared data. Because threads share the same address space as the process and other threads within the process, the operational cost of communication between the threads is low, which is an advantage. The disadvantage is that a problem with one thread in a process will certainly affect other threads and the viability of the process itself.

Threads vs. Processes

So to review:

  1. The program starts out as a text file of programming code,
  2. The program is compiled or interpreted into binary form,
  3. The program is loaded into memory,
  4. The program becomes one or more running processes.
  5. Processes are typically independent of each other,
  6. While threads exist as the subset of a process.
  7. Threads can communicate with each other more easily than processes can,
  8. But threads are more vulnerable to problems caused by other threads in the same process.

Processes vs. Threads — Advantages and Disadvantages

Process Thread
Processes are heavyweight operations Threads are lighter weight operations
Each process has its own memory space Threads use the memory of the process they belong to
Inter-process communication is slow as processes have different memory addresses Inter-thread communication can be faster than inter-process communication because threads of the same process share memory with the process they belong to
Context switching between processes is more expensive Context switching between threads of the same process is less expensive
Processes don’t share memory with other processes Threads share memory with other threads of the same process

What about Concurrency and Parallelism?

A question you might ask is whether processes or threads can run at the same time. The answer is: it depends. On a system with multiple processors or CPU cores (as is common with modern processors), multiple processes or threads can be executed in parallel. On a single processor, though, it is not possible to have processes or threads truly executing at the same time. In this case, the CPU is shared among running processes or threads using a process scheduling algorithm that divides the CPU’s time and yields the illusion of parallel execution. The time given to each task is called a “time slice.” The switching back and forth between tasks happens so fast it is usually not perceptible. The terms parallelism (true operation at the same time) and concurrency (simulated operation at the same time), distinguish between the two type of real or approximate simultaneous operation.

diagram of concurrency and parallelism

Why Choose Process over Thread, or Thread over Process?

So, how would a programmer choose between a process and a thread when creating a program in which she wants to execute multiple tasks at the same time? We’ve covered some of the differences above, but let’s look at a real world example with a program that many of us use, Google Chrome.

When Google was designing the Chrome browser, they needed to decide how to handle the many different tasks that needed computer, communications, and network resources at the same time. Each browser window or tab communicates with multiple servers on the internet to retrieve text, programs, graphics, audio, video, and other resources, and renders that data for display and interaction with the user. In addition, the browser can open many windows, each with many tasks.

Google had to decide how to handle that separation of tasks. They chose to run each browser window in Chrome as a separate process rather than a thread or many threads, as is common with other browsers. Doing that brought Google a number of benefits. Running each window as a process protects the overall application from bugs and glitches in the rendering engine and restricts access from each rendering engine process to others and to the rest of the system. Isolating JavaScript programs in a process prevents them from running away with too much CPU time and memory, and making the entire browser non-responsive.

Google made the calculated trade-off with a multi-processing design as starting a new process for each browser window has a higher fixed cost in memory and resources than using threads. They were betting that their approach would end up with less memory bloat overall.

Using processes instead of threads provides better memory usage when memory gets low. An inactive window is treated as a lower priority by the operating system and becomes eligible to be swapped to disk when memory is needed for other processes, helping to keep the user-visible windows more responsive. If the windows were threaded, it would be more difficult to separate the used and unused memory as cleanly, wasting both memory and performance.

You can read more about Google’s design decisions on Google’s Chromium Blog or on the Chrome Introduction Comic.

The screen capture below shows the Google Chrome processes running on a MacBook Air with many tabs open. Some Chrome processes are using a fair amount of CPU time and resources, and some are using very little. You can see that each process also has many threads running as well.

activity monitor of Google Chrome

The Activity Monitor or Task Manager on your system can be a valuable ally in helping fine-tune your computer or troubleshooting problems. If your computer is running slowly, or a program or browser window isn’t responding for a while, you can check its status using the system monitor. Sometimes you’ll see a process marked as “Not Responding.” Try quitting that process and see if your system runs better. If an application is a memory hog, you might consider choosing a different application that will accomplish the same task.

Windows Task Manager view

Made it This Far?

We hope this Tron-like dive into the fascinating world of computer programs, processes, and threads has helped clear up some questions you might have had.

The next time your computer is running slowly or an application is acting up, you know your assignment. Fire up the system monitor and take a look under the hood to see what’s going on. You’re in charge now.

We love to hear from you

Are you still confused? Have questions? If so, please let us know in the comments. And feel free to suggest topics for future blog posts.

The post What’s the Diff: Programs, Processes, and Threads appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

TVAddons Returns, But in Ugly War With Canadian Telcos Over Kodi Addons

Post Syndicated from Andy original https://torrentfreak.com/tvaddons-returns-ugly-war-canadian-telcos-kodi-addons-170801/

After Dish Network filed a lawsuit against TVAddons in Texas, several high-profile Kodi addons took the decision to shut down. Soon after, TVAddons itself went offline.

In the weeks that followed, several TVAddons-related domains were signed over (1,2) to a Canadian law firm, a mysterious situation that didn’t dovetail well with the US-based legal action.

TorrentFreak can now reveal that the shutdown of TVAddons had nothing to do with the US action and everything to do with a separate lawsuit filed in Canada.

The complaint against TVAddons

Two months ago on June 2, a collection of Canadian telecoms giants including Bell Canada, Bell ExpressVu, Bell Media, Videotron, Groupe TVA, Rogers Communications and Rogers Media, filed a complaint in Federal Court against Montreal resident, Adam Lackman, the man behind TVAddons.

The 18-page complaint details the plaintiffs’ case against Lackman, claiming that he communicated copyrighted TV shows including Game of Thrones, Prison Break, The Big Bang Theory, America’s Got Talent, Keeping Up With The Kardashians and dozens more, to the public in breach of copyright.

The key claim is that Lackman achieved this by developing, hosting, distributing or promoting Kodi add-ons.

Adam Lackman, the man behind TVAddons (@adam.lackman on Instagram)

A total of 18 major add-ons are detailed in the complaint including 1Channel, Exodus, Phoenix, Stream All The Sources, SportsDevil, cCloudTV and Alluc, to name a few. Also under the spotlight is the ‘FreeTelly’ custom Kodi build distributed by TVAddons alongside its Kodi configuration tool, Indigo.

“[The defendant] has made the [TV shows] available to the public by telecommunication in a way that allows members of the public to have access to them from a place and at a time individually chosen by them…consequently infringing the Plaintiffs’ copyright…in contravention of sections 2.4(1.1), 3(1)(f) and 27(1) of the Copyright Act,” the complaint reads.

The complaint alleges that Lackman “induced and/or authorized users” of the FreeTelly and Indigo tools to carry out infringement by his handling and promotion of infringing add-ons, including through TVAddons.ag and Offshoregit.com, in contravention of sections 3(1)(f) and 27(1) of the Copyright Act.

“Approximately 40 million unique users located around the world are actively using Infringing Addons hosted by TVAddons every month, and approximately 900,000 Canadian households use Infringing Add-ons to access television content. The amount of users of Infringing add-ons hosted TVAddons is constantly increasing,” the complaint adds.

To limit the harm allegedly caused by TVAddons, the complaint asked for interim, interlocutory, and permanent injunctions restraining Lackman and associates from developing, promoting or distributing any of the allegedly infringing add-ons or software. On top, the plaintiffs requested punitive and exemplary damages, plus costs.

The interim injunction and Anton Piller Order

Following the filing of the complaint, on June 9 the Federal Court handed down a time-limited interim injunction against Lackman which restrained him from various activities in respect of TVAddons. The process took place ex parte, meaning in secret, without Lackman being able to mount a defense.

The Court also authorized a bailiff and computer forensics experts to take control of Internet domains including TVAddons.ag and Offshoregit.com plus social media and hosting provider accounts for a period of 14 days. These were transferred to Daniel Drapeau at DrapeauLex, an independent court-appointed supervising counsel.

The order also contained an Anton Piller order, a civil search warrant that grants plaintiffs no-notice permission to enter a defendant’s premises in order to secure and copy evidence to support their case, before it can be destroyed or tampered with.

The order covered not only data related to the TVAddons platform, such as operating and financial details, revenues, and banking information, but everything in Lackman’s possession.

The Court ordered the telecoms companies to inform Lackman that the case against him is a civil proceeding and that he could deny entry to his property if he wished. However, that option would put him in breach of the order and would place him at risk of being fined or even imprisoned. Catch 22 springs to mind.

The Court did, however, put limits on the number of people that could be present during the execution of the Anton Piller order (ostensibly to avoid intimidation) and ordered the plaintiffs to deposit CAD$50,000 with the Court, in case the order was improperly executed. That decision would later prove an important one.

The search and interrogation of TVAddons’ operator

On June 12, the order was executed and Lackman’s premises were searched for more than 16 hours. For nine hours he was interrogated and effectively denied his right to remain silent since non-cooperation with an Anton Piller order amounts to contempt of court. The Court’s stated aim of not intimidating Lackman failed.

The TVAddons operator informs TorrentFreak that he heard a disturbance in the hallway outside and spotted several men hiding on the other side of the door. Fearing for his life, Lackman called the police and when they arrived he opened the door. At this point, the police were told by those in attendance to leave, despite Lackman’s protests.

Once inside, Lackman was told he had an hour to find a lawyer, but couldn’t use any electronic device to get one. Throughout the entire day, Lackman says he was reminded by the plaintiffs’ lawyer that he could be held in contempt of court and jailed, even though he was always cooperating.

“I had to sit there and not leave their sight. I was denied access to medication,” Lackman told TorrentFreak. “I had a doctor’s appointment I was forced to miss. I wasn’t even allowed to call and cancel.”

In papers later filed with the court by Lackman’s team, the Anton Piller order was described as a “bombe atomique” since TVAddons had never been served with so much as a copyright takedown notice in advance of this action.

The Anton Piller controversy

Anton Piller orders are only valid when passing a three-step test: when there is a strong prima facie case against the respondent, the damage – potential or actual – is serious for the applicant, and when there is a real possibility that evidence could be destroyed.

For Bell Canada, Bell ExpressVu, Bell Media, Videotron, Groupe TVA, Rogers Communications and Rogers Media, serious problems emerged on at least two of these points after the execution of the order.

For example, TVAddons carried more than 1,500 add-ons yet only 1% of those add-ons were considered to be infringing, a tiny number in the overall picture. Then there was the not insignificant problem with the exchange that took place during the hearing to obtain the order, during which Lackman was not present.

Clearly, the securing of existing evidence wasn’t the number one priority.

Plaintiffs: We want to destroy TVAddons

And the problems continued.

No right to remain silent, no right to consult a lawyer

The Anton Piller search should have been carried out between 8am and 8pm but actually carried on until midnight. As previously mentioned, Adam Lackman was effectively denied his right to remain silent and was forbidden from getting advice from his lawyer.

None of this sat well with the Honourable B. Richard Bell during a subsequent Federal Court hearing to consider the execution of the Anton Piller order.

“It is important to note that the Defendant was not permitted to refuse to answer questions under fear of contempt proceedings, and his counsel was not permitted to clarify the answers to questions. I conclude unhesitatingly that the Defendant was subjected to an examination for discovery without any of the protections normally afforded to litigants in such circumstances,” the Judge said.

“Here, I would add that the ‘questions’ were not really questions at all. They took the form of orders or directions. For example, the Defendant was told to ‘provide to the bailiff’ or ‘disclose to the Plaintiffs’ solicitors’.”

Evidence preservation? More like a fishing trip

But shockingly, the interrogation of Lackman went much, much further. TorrentFreak understands that the TVAddons operator was given a list of 30 names of people that might be operating sites or services similar to TVAddons. He was then ordered to provide all of the information he had on those individuals.

Of course, people tend to guard their online identities so it’s possible that the information provided by Lackman will be of limited use, but Judge Bell was not happy that the Anton Piller order was abused by the plaintiffs in this way.

“I conclude that those questions, posed by Plaintiffs’ counsel, were solely made in furtherance of their investigation and constituted a hunt for further evidence, as opposed to the preservation of then existing evidence,” he wrote in a June 29 order.

But he was only just getting started.

Plaintiffs unlawfully tried to destroy TVAddons before trial

The Judge went on to note that from their own mouths, the Anton Piller order was purposely designed by the plaintiffs to completely shut down TVAddons, despite the fact that only a tiny proportion of the add-ons available on the site were allegedly used to infringe copyright.

“I am of the view that [the order’s] true purpose was to destroy the livelihood of the Defendant, deny him the financial resources to finance a defense to the claim made against him, and to provide an opportunity for discovery of the Defendant in circumstances where none of the procedural safeguards of our civil justice system could be engaged,” Judge Bell wrote.

As noted, plaintiffs must also have a “strong prima facie case” to obtain an Anton Piller order but Judge Bell says he’s not convinced that one exists. Instead, he praised the “forthright manner” of Lackman, who successfully compared the ability of Kodi addons to find content in the same way as Google search can.

So why the big turn around?

Judge Bell said that while the prima facie case may have appeared strong before the judge who heard the matter ex parte (without Lackman being present to defend himself), the subsequent adversarial hearing undermined it, to the point that it no longer met the threshold.

As a result of these failings, Judge Bell declared the Anton Piller order unlawful. Things didn’t improve for the plaintiffs on the injunction front either.

The Judge said that he believes that Lackman has “an arguable case” that he is not violating the Copyright Act by merely providing addons and that TVAddons is his only source of income. So, if an injunction to close the site was granted, the litigation would effectively be over, since the plaintiffs already admitted that their aim was to neutralize the platform.

If the platform was neutralized, Lackman could no longer earn money from the site, which would harm his ability to mount a defense.

“In considering the balance of convenience, I also repeat that the plaintiffs admit that the vast majority of add-ons are non-infringing. Whether the remaining approximately 1% are infringing is very much up for debate. For these reasons, I find the balance of convenience favors the defendant, and no interlocutory injunction will be issued,” the Judge declared.

With the Anton Piller order declared unlawful and no interlocutory injunction (one effective until the final determination of the case) handed down, things were about to get worse for the telecoms companies.

They had paid CAD$50,000 to the court in security in case things went wrong with the Anton Piller order, so TVAddons was entitled to compensation from that amount. That would be helpful, since at this point TVAddons had already run up CAD$75,000 in legal expenses.

On top, the Judge told independent counsel to give everything seized during the Anton Piller search back to Lackman.

The order to return items previously seized

But things were far from over. Within days, the telecoms companies took the decision to the Court of Appeal, asking for a stay of execution (a delay in carrying out a court order) to retain possession of items seized, including physical property, domains, and social media accounts.

Mid-July the appeal was granted and certain confidentiality clauses affecting independent counsel (including Daniel Drapeau, who holds the TVAddons’ domains) were ordered to be continued. However, considering the problems with the execution of the Anton Piller order, Bell Canada, TVA, Videotron and Rogers et al, were ordered to submit an additional security bond of CAD$140,000, on top of the CAD$50,000 already deposited.

So the battle continues, and continue it will

Speaking with TorrentFreak, Adam Lackman says that he has no choice but to fight the telcoms companies since not doing so would result in a loss by default judgment. Interestingly, both he and one of the judges involved in the case thus far believe he has an arguable case.

Lackman says that his activities are protected under the Canadian Copyright Act, specifically subparagraph 2.4(1)(b) which states as follows:

A person whose only act in respect of the communication of a work or other subject-matter to the public consists of providing the means of telecommunication necessary for another person to so communicate the work or other subject-matter does not communicate that work or other subject-matter to the public;

Of course, finding out whether that’s indeed the case will be a costly endeavor.

“It all comes down to whether we will have the financial resources necessary to mount our defense and go to trial. We won’t have ad revenue coming in, since losing our domain names means that we’ll lose the majority of our traffic for quite some time into the future,” Lackman told TF in a statement.

“We’re hoping that others will be as concerned as us about big companies manipulating the law in order to shut down what they see as competition. We desperately need help in financially supporting our legal defense, we cannot do it alone.

“We’ve run up a legal bill of over $100,000 to date. We’re David, and they are four Goliaths with practically unlimited resources. If we lose, it will mean that new case law is made, case law that could mean increased censorship of the internet.”

In the hope of getting support, TVAddons has launched a fundraiser campaign and in the meantime, a new version of the site is back on a new domain, TVAddons.co.

Given TVAddons’ line of defense, the nature of both the platform and Kodi addons, and the fact that there has already been a serious abuse of process during evidence preservation, this is now one of the most interesting and potentially influential copyright cases underway anywhere today.

TVAddons is being represented by Éva Richard , Hilal Ayoubi and Karim Renno in Canada, plus Erin Russell and Jason Sweet in the United States.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Run Common Data Science Packages on Anaconda and Oozie with Amazon EMR

Post Syndicated from John Ohle original https://aws.amazon.com/blogs/big-data/run-common-data-science-packages-on-anaconda-and-oozie-with-amazon-emr/

In the world of data science, users must often sacrifice cluster set-up time to allow for complex usability scenarios. Amazon EMR allows data scientists to spin up complex cluster configurations easily, and to be up and running with complex queries in a matter of minutes.

Data scientists often use scheduling applications such as Oozie to run jobs overnight. However, Oozie can be difficult to configure when you are trying to use popular Python packages (such as “pandas,” “numpy,” and “statsmodels”), which are not included by default.

One such popular platform that contains these types of packages (and more) is Anaconda. This post focuses on setting up an Anaconda platform on EMR, with an intent to use its packages with Oozie. I describe how to run jobs using a popular open source scheduler like Oozie.

Walkthrough

For this post, you walk through the following tasks:

  • Create an EMR cluster.
  • Download Anaconda on your master node.
  • Configure Oozie.
  • Test the steps.

Create an EMR cluster

Spin up an Amazon EMR cluster using the console or the AWS CLI. Use the latest release, and include Apache Hadoop, Apache Spark, Apache Hive, and Oozie.

To create a three-node cluster in the us-east-1 region, issue an AWS CLI command such as the following. This command must be typed as one line, as shown below. It is shown here separated for readability purposes only.

aws emr create-cluster \ 
--release-label emr-5.7.0 \ 
 --name '<YOUR-CLUSTER-NAME>' \
 --applications Name=Hadoop Name=Oozie Name=Spark Name=Hive \ 
 --ec2-attributes '{"KeyName":"<YOUR-KEY-PAIR>","SubnetId":"<YOUR-SUBNET-ID>","EmrManagedSlaveSecurityGroup":"<YOUR-EMR-SLAVE-SECURITY-GROUP>","EmrManagedMasterSecurityGroup":"<YOUR-EMR-MASTER-SECURITY-GROUP>"}' \ 
 --use-default-roles \ 
 --instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"<YOUR-INSTANCE-TYPE>","Name":"Master - 1"},{"InstanceCount":<YOUR-CORE-INSTANCE-COUNT>,"InstanceGroupType":"CORE","InstanceType":"<YOUR-INSTANCE-TYPE>","Name":"Core - 2"}]'

One-line version for reference:

aws emr create-cluster --release-label emr-5.7.0 --name '<YOUR-CLUSTER-NAME>' --applications Name=Hadoop Name=Oozie Name=Spark Name=Hive --ec2-attributes '{"KeyName":"<YOUR-KEY-PAIR>","SubnetId":"<YOUR-SUBNET-ID>","EmrManagedSlaveSecurityGroup":"<YOUR-EMR-SLAVE-SECURITY-GROUP>","EmrManagedMasterSecurityGroup":"<YOUR-EMR-MASTER-SECURITY-GROUP>"}' --use-default-roles --instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"<YOUR-INSTANCE-TYPE>","Name":"Master - 1"},{"InstanceCount":<YOUR-CORE-INSTANCE-COUNT>,"InstanceGroupType":"CORE","InstanceType":"<YOUR-INSTANCE-TYPE>","Name":"Core - 2"}]'

Download Anaconda

SSH into your EMR master node instance and download the official Anaconda installer:

wget https://repo.continuum.io/archive/Anaconda2-4.4.0-Linux-x86_64.sh

At the time of publication, Anaconda 4.4 is the most current version available. For the download link location for the latest Python 2.7 version (Python 3.6 may encounter issues), see https://www.continuum.io/downloads.  Open the context (right-click) menu for the Python 2.7 download link, choose Copy Link Location, and use this value in the previous wget command.

This post used the Anaconda 4.4 installation. If you have a later version, it is shown in the downloaded filename:  “anaconda2-<version number>-Linux-x86_64.sh”.

Run this downloaded script and follow the on-screen installer prompts.

chmod u+x Anaconda2-4.4.0-Linux-x86_64.sh
./Anaconda2-4.4.0-Linux-x86_64.sh

For an installation directory, select somewhere with enough space on your cluster, such as “/mnt/anaconda/”.

The process should take approximately 1–2 minutes to install. When prompted if you “wish the installer to prepend the Anaconda2 install location”, select the default option of [no].

After you are done, export the PATH to include this new Anaconda installation:

export PATH=/mnt/anaconda/bin:$PATH

Zip up the Anaconda installation:

cd /mnt/anaconda/
zip -r anaconda.zip .

The zip process may take 4–5 minutes to complete.

(Optional) Upload this anaconda.zip file to your S3 bucket for easier inclusion into future EMR clusters. This removes the need to repeat the previous steps for future EMR clusters.

Configure Oozie

Next, you configure Oozie to use Pyspark and the Anaconda platform.

Get the location of your Oozie sharelibupdate folder. Issue the following command and take note of the “sharelibDirNew” value:

oozie admin -sharelibupdate

For this post, this value is “hdfs://ip-192-168-4-200.us-east-1.compute.internal:8020/user/oozie/share/lib/lib_20170616133136”.

Pass in the required Pyspark files into Oozies sharelibupdate location. The following files are required for Oozie to be able to run Pyspark commands:

  • pyspark.zip
  • py4j-0.10.4-src.zip

These are located on the EMR master instance in the location “/usr/lib/spark/python/lib/”, and must be put into the Oozie sharelib spark directory. This location is the value of the sharelibDirNew parameter value (shown above) with “/spark/” appended, that is, “hdfs://ip-192-168-4-200.us-east-1.compute.internal:8020/user/oozie/share/lib/lib_20170616133136/spark/”.

To do this, issue the following commands:

hdfs dfs -put /usr/lib/spark/python/lib/py4j-0.10.4-src.zip hdfs://ip-192-168-4-200.us-east-1.compute.internal:8020/user/oozie/share/lib/lib_20170616133136/spark/
hdfs dfs -put /usr/lib/spark/python/lib/pyspark.zip hdfs://ip-192-168-4-200.us-east-1.compute.internal:8020/user/oozie/share/lib/lib_20170616133136/spark/

After you’re done, Oozie can use Pyspark in its processes.

Pass the anaconda.zip file into HDFS as follows:

hdfs dfs -put /mnt/anaconda/anaconda.zip /tmp/myLocation/anaconda.zip

(Optional) Verify that it was transferred successfully with the following command:

hdfs dfs -ls /tmp/myLocation/

On your master node, execute the following command:

export PYSPARK_PYTHON=/mnt/anaconda/bin/python

Set the PYSPARK_PYTHON environment variable on the executor nodes. Put the following configurations in your “spark-opts” values in your Oozie workflow.xml file:

–conf spark.executorEnv.PYSPARK_PYTHON=./anaconda_remote/bin/python
–conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./anaconda_remote/bin/python

This is referenced from the Oozie job in the following line in your workflow.xml file, also included as part of your “spark-opts”:

--archives hdfs:///tmp/myLocation/anaconda.zip#anaconda_remote

Your Oozie workflow.xml file should now look something like the following:

<workflow-app name="spark-wf" xmlns="uri:oozie:workflow:0.5">
<start to="start_spark" />
<action name="start_spark">
    <spark xmlns="uri:oozie:spark-action:0.1">
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <prepare>
            <delete path="/tmp/test/spark_oozie_test_out3"/>
        </prepare>
        <master>yarn-cluster</master>
        <mode>cluster</mode>
        <name>SparkJob</name>
        <class>clear</class>
        <jar>hdfs:///user/oozie/apps/myPysparkProgram.py</jar>
        <spark-opts>--queue default
            --conf spark.ui.view.acls=*
            --executor-memory 2G --num-executors 2 --executor-cores 2 --driver-memory 3g
            --conf spark.executorEnv.PYSPARK_PYTHON=./anaconda_remote/bin/python
            --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./anaconda_remote/bin/python
            --archives hdfs:///tmp/myLocation/anaconda.zip#anaconda_remote
        </spark-opts>
    </spark>
    <ok to="end"/>
    <error to="kill"/>
</action>
        <kill name="kill">
                <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
        </kill>
        <end name="end"/>
</workflow-app>

Test steps

To test this out, you can use the following job.properties and myPysparkProgram.py file, along with the following steps:

job.properties

masterNode ip-xxx-xxx-xxx-xxx.us-east-1.compute.internal
nameNode hdfs://${masterNode}:8020
jobTracker ${masterNode}:8032
master yarn
mode cluster
queueName default
oozie.libpath ${nameNode}/user/oozie/share/lib
oozie.use.system.libpath true
oozie.wf.application.path ${nameNode}/user/oozie/apps/

Note: You can get your master node IP address (denoted as “ip-xxx-xxx-xxx-xxx” here) from the value for the sharelibDirNew parameter noted earlier.

myPysparkProgram.py

from pyspark import SparkContext, SparkConf
import numpy
import sys

conf = SparkConf().setAppName('myPysparkProgram')
sc = SparkContext(conf=conf)

rdd = sc.textFile("/user/hadoop/input.txt")

x = numpy.sum([3,4,5]) #total = 12

rdd = rdd.map(lambda line: line + str(x))
rdd.saveAsTextFile("/user/hadoop/output")

Put the “myPysparkProgram.py” into the location mentioned between the “<jar>xxxxx</jar>” tags in your workflow.xml. In this example, the location is “hdfs:///user/oozie/apps/”. Use the following command to move the “myPysparkProgram.py” file to the correct location:

hdfs dfs -put myPysparkProgram.py /user/oozie/apps/

Put the above workflow.xml file into the “/user/oozie/apps/” location in hdfs:

hdfs dfs –put workflow.xml /user/oozie/apps/

Note: The job.properties file is run locally from the EMR master node.

Create a sample input.txt file with some data in it. For example:

input.txt

This is a sentence.
So is this. 
This is also a sentence.

Put this file into hdfs:

hdfs dfs -put input.txt /user/hadoop/

Execute the job in Oozie with the following command. This creates an Oozie job ID.

oozie job -config job.properties -run

You can check the Oozie job state with the command:

oozie job -info <Oozie job ID>

  1. When the job is successfully finished, the results are located at:
/user/hadoop/output/part-00000
/user/hadoop/output/part-00001

  1. Run the following commands to view the output:
hdfs dfs -cat /user/hadoop/output/part-00000
hdfs dfs -cat /user/hadoop/output/part-00001

The output will be:

This is a sentence. 12
So is this 12
This is also a sentence 12

Summary

The myPysparkProgram.py has successfully imported the numpy library from the Anaconda platform and has produced some output with it. If you tried to run this using standard Python, you’d encounter the following error:

Now when your Python job runs in Oozie, any imported packages that are implicitly imported by your Pyspark script are imported into your job within Oozie directly from the Anaconda platform. Simple!

If you have questions or suggestions, please leave a comment below.


Additional Reading

Learn how to use Apache Oozie workflows to automate Apache Spark jobs on Amazon EMR.

 


About the Author

John Ohle is an AWS BigData Cloud Support Engineer II for the BigData team in Dublin. He works to provide advice and solutions to our customers on their Big Data projects and workflows on AWS. In his spare time, he likes to play music, learn, develop tools and write documentation to further help others – both colleagues and customers alike.

 

 

 

Hardcore UK Pirates Dwindle But Illegal Streaming Poses New Threat

Post Syndicated from Andy original https://torrentfreak.com/hardcore-uk-pirates-dwindle-but-illegal-streaming-poses-new-threat-170707/

For as many years as ‘pirate’ services have been online it has been clear that licensed services need to aggressively compete to stay in the game.

Both the music and movie industries were initially slow to get off the mark but in recent years the position has changed. Licensed services such as Spotify and Netflix are now household names and doing well, even among people who have traditionally consumed illicit content.

This continuing trend was highlighted again this morning in a press release by the UK’s Intellectual Property Office. In a fairly upbeat tone, the IPO notes that innovative streaming models offered by both Netflix and Spotify are helping to keep online infringement stable in the UK.

“The Online Copyright Infringement (OCI) Tracker, commissioned by the UK Intellectual Property Office (IPO), has revealed that 15 per cent of UK internet users, approximately 7 million people, either stream or download material that infringes copyright,” the IPO reports.

The full tracking report, which is now on its 7th wave, is yet to be released but the government has teased a few interesting stats. While the 7 million infringer number is mostly unchanged from last year, the mix of hardcore (only use infringing sources) and casual infringers (also use legal sources) has changed.

“Consumers accessing exclusively free content is at an all-time low,” the IPO reveals, noting that legitimate streaming is also on the up, with Spotify increasing its userbase by 7% since 2016.

But despite the positive signs, the government says that there are concerns surrounding illicit streaming, both of music and video content. Unsurprisingly, ‘pirate’ set-top boxes get a prominent mention and are labeled a threat to positive trends.

“Illicitly adapted set top boxes, which allow users to illegally stream premium TV content such as blockbuster movies, threaten to undermine recent progress. 13 per cent of online infringers are using streaming boxes that can be easily adapted to stream illicit content,” the IPO says.

Again, since the report hasn’t yet been published, there are currently no additional details to be examined. However, the “boxes that can be easily adapted” comment could easily reference Amazon Firesticks, for example, that are currently being used for entirely legitimate means.

The IPO notes that an IPTV consultation is underway which may provide guidance on how the devices can be dealt with in the future. A government response is due to be published later in the summer.

Also heavily on the radar is a fairly steep reported increase in stream-ripping, which is the unlicensed downloading of music from streaming sources so that it can be kept on a user’s hard drive or device.

A separate report, commissioned by the IPO and PRS for Music, reveals that 15% of Internet users have stream-ripped in some way and the use of ripping services is on the up.

“The use of stream-ripping websites increased by 141.3% between 2014 and 2016,” the IPO notes.

“In a survey of over 9000 people, 57% of UK adults claimed to be aware of stream-ripping services. Those who claimed to have used a stream-ripping service were significantly more likely to be male and between the ages of 16 to 34 years.”

PRS goes into a little more detail, claiming that stream-ripping is now “the most prevalent and fastest growing form of music piracy in the UK.” The music licensing outfit claims that almost 70% of music-specific infringement is accounted for by stream-ripping.

The survey, carried out by INCOPRO and Kantar Media, looked at 80 stream-ripping services, which included apps, websites, browser plug-ins and other stand-alone software. Each supplied content from a range of sources including SoundCloud, Spotify and Deezer, but YouTube was found to be the most popular source, accounting for 75 of the 80 services.

There are several reported motivations for users to stream-rip but interestingly the number one reason involves what some people consider to be ‘honest’ piracy. A total of 31% of stream-rippers said that since they already own the music, and only use ripping services to obtain it in another format.

Just over a quarter (26%) said they wanted to listen to music while not connected to the Internet while 25% said that a permanent copy helps them while on the move. Around one in five people who stream-rip say that music is either unaffordable or overpriced.

“We hope that this research will provide the basis for a renewed and re-focused commitment to tackling online copyright infringement,” says Robert Ashcroft, Chief Executive, PRS for Music.

“The long term health of the UK’s cultural and creative sectors is in everyone’s best interests, including those of the digital service providers, and a co-ordinated industry and government approach to tackling stream ripping is essential.”

Ros Lynch, Copyright and IP Enforcement Director at the IPO, took the opportunity to praise the widespread use of legitimate platforms. However, he also noted that innovation also continues in piracy circles, with stream-ripping a prime example.

“It’s great that legal streaming sites continue to be a hugely popular choice for consumers. The success and popularity of these platforms show the importance of evolution and innovation in the entertainment industry,” Lynch said.

“Ironically it is innovation that also benefits those looking to undermine IP rights and benefit financially from copyright infringement. There has never been more choice or flexibility for consumers of TV and music, however illicit streaming devices and stream-ripping are threatening this progress.”

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

How to Increase the Redundancy and Performance of Your AWS Directory Service for Microsoft AD Directory by Adding Domain Controllers

Post Syndicated from Peter Pereira original https://aws.amazon.com/blogs/security/how-to-increase-the-redundancy-and-performance-of-your-aws-directory-service-for-microsoft-ad-directory-by-adding-domain-controllers/

You can now increase the redundancy and performance of your AWS Directory Service for Microsoft Active Directory (Enterprise Edition), also known as AWS Microsoft AD, directory by deploying additional domain controllers. Adding domain controllers increases redundancy, resulting in even greater resilience and higher availability. This new capability enables you to have at least two domain controllers operating, even if an Availability Zone were to be temporarily unavailable. The additional domain controllers also improve the performance of your applications by enabling directory clients to load-balance their requests across a larger number of domain controllers. For example, AWS Microsoft AD enables you to use larger fleets of Amazon EC2 instances to run .NET applications that perform frequent user attribute lookups.

AWS Microsoft AD is a highly available, managed Active Directory built on actual Microsoft Windows Server 2012 R2 in the AWS Cloud. When you create your AWS Microsoft AD directory, AWS deploys two domain controllers that are exclusively yours in separate Availability Zones for high availability. Now, you can deploy additional domain controllers easily via the Directory Service console or API, by specifying the total number of domain controllers that you want.

AWS Microsoft AD distributes the additional domain controllers across the Availability Zones and subnets within the Amazon VPC where your directory is running. AWS deploys the domain controllers, configures them to replicate directory changes, monitors for and repairs any issues, performs daily snapshots, and updates the domain controllers with patches. This reduces the effort and complexity of creating and managing your own domain controllers in the AWS Cloud.

In this blog post, I create an AWS Microsoft AD directory with two domain controllers in each Availability Zone. This ensures that I always have at least two domain controllers operating, even if an entire Availability Zone were to be temporarily unavailable. To accomplish this, first I create an AWS Microsoft AD directory with one domain controller per Availability Zone, and then I deploy one additional domain controller per Availability Zone.

Solution architecture

The following diagram shows how AWS Microsoft AD deploys all the domain controllers in this solution after you complete Steps 1 and 2. In Step 1, AWS Microsoft AD deploys the two required domain controllers across multiple Availability Zones and subnets in an Amazon VPC. In Step 2, AWS Microsoft AD deploys one additional domain controller per Availability Zone and subnet.

Solution diagram

Step 1: Create an AWS Microsoft AD directory

First, I create an AWS Microsoft AD directory in an Amazon VPC. I can add domain controllers only after AWS Microsoft AD configures my first two required domain controllers. In my example, my domain name is example.com.

When I create my directory, I must choose the VPC in which to deploy my directory (as shown in the following screenshot). Optionally, I can choose the subnets in which to deploy my domain controllers, and AWS Microsoft AD ensures I select subnets from different Availability Zones. In this case, I have no subnet preference, so I choose No Preference from the Subnets drop-down list. In this configuration, AWS Microsoft AD selects subnets from two different Availability Zones to deploy the directory.

Screenshot of choosing the VPC in which to create the directory

I then choose Next Step to review my configuration, and then choose Create Microsoft AD. It takes approximately 40 minutes for my domain controllers to be created. I can check the status from the AWS Directory Service console, and when the status is Active, I can add my two additional domain controllers to the directory.

Step 2: Deploy two more domain controllers in the directory

Now that I have created an AWS Microsoft AD directory and it is active, I can deploy two additional domain controllers in the directory. AWS Microsoft AD enables me to add domain controllers through the Directory Service console or API. In this post, I use the console.

To deploy two more domain controllers in the directory:

  1. I open the AWS Management Console, choose Directory Service, and then choose the Microsoft AD Directory ID. In my example, my recently created directory is example.com, as shown in the following screenshot.Screenshot of choosing the Directory ID
  2. I choose the Domain controllers tab next. Here I can see the two domain controllers that AWS Microsoft AD created for me in Step 1. It also shows the Availability Zones and subnets in which AWS Microsoft AD deployed the domain controllers.Screenshot showing the domain controllers, Availability Zones, and subnets
  3. I then choose Modify on the Domain controllers tab. I specify the total number of domain controllers I want by choosing the subtract and add buttons. In my example, I want four domain controllers in total for my directory.Screenshot showing how to specify the total number of domain controllers
  4. I choose Apply. AWS Microsoft AD deploys the two additional domain controllers and distributes them evenly across the Availability Zones and subnets in my Amazon VPC. Within a few seconds, I can see the Availability Zones and subnets in which AWS Microsoft AD deployed my two additional domain controllers with a status of Creating (see the following screenshot). While AWS Microsoft AD deploys the additional domain controllers, my directory continues to operate by using the active domain controllers—with no disruption of service.
    Screenshot of two additional domain controllers with a status of "Creating"
  5. When AWS Microsoft AD completes the deployment steps, all domain controllers are in Active status and available for use by my applications. As a result, I have improved the redundancy and performance of my directory.

Note: After deploying additional domain controllers, I can reduce the number of domain controllers by repeating the modification steps with a lower number of total domain controllers. Unless a directory is deleted, AWS Microsoft AD does not allow fewer than two domain controllers per directory in order to deliver fault tolerance and high availability.

Summary

In this blog post, I demonstrated how to deploy additional domain controllers in your AWS Microsoft AD directory. By adding domain controllers, you increase the redundancy and performance of your directory, which makes it easier for you to migrate and run mission-critical Active Directory–integrated workloads in the AWS Cloud without having to deploy and maintain your own AD infrastructure.

To learn more about AWS Directory Service, see the AWS Directory Service home page. If you have questions, post them on the Directory Service forum.

– Peter

Under the Hood of Server-Side Encryption for Amazon Kinesis Streams

Post Syndicated from Damian Wylie original https://aws.amazon.com/blogs/big-data/under-the-hood-of-server-side-encryption-for-amazon-kinesis-streams/

Customers are using Amazon Kinesis Streams to ingest, process, and deliver data in real time from millions of devices or applications. Use cases for Kinesis Streams vary, but a few common ones include IoT data ingestion and analytics, log processing, clickstream analytics, and enterprise data bus architectures.

Within milliseconds of data arrival, applications (KCL, Apache Spark, AWS Lambda, Amazon Kinesis Analytics) attached to a stream are continuously mining value or delivering data to downstream destinations. Customers are then scaling their streams elastically to match demand. They pay incrementally for the resources that they need, while taking advantage of a fully managed, serverless streaming data service that allows them to focus on adding value closer to their customers.

These benefits are great; however, AWS learned that many customers could not take advantage of Kinesis Streams unless their data-at-rest within a stream was encrypted. Many customers did not want to manage encryption on their own, so they asked for a fully managed, automatic, server-side encryption mechanism leveraging centralized AWS Key Management Service (AWS KMS) customer master keys (CMK).

Motivated by this feedback, AWS added another fully managed, low cost aspect to Kinesis Streams by delivering server-side encryption via KMS managed encryption keys (SSE-KMS) in the following regions:

  • US East (N. Virginia)
  • US West (Oregon)
  • US West (N. California)
  • EU (Ireland)
  • Asia Pacific (Singapore)
  • Asia Pacific (Tokyo)

In this post, I cover the mechanics of the Kinesis Streams server-side encryption feature. I also share a few best practices and considerations so that you can get started quickly.

Understanding the mechanics

The following section walks you through how Kinesis Streams uses CMKs to encrypt a message in the PutRecord or PutRecords path before it is propagated to the Kinesis Streams storage layer, and then decrypt it in the GetRecords path after it has been retrieved from the storage layer.

When server-side encryption is enabled—which takes just a few clicks in the console—the partition key and payload for every incoming record is encrypted automatically as it’s flowing into Kinesis Streams, using the selected CMK. When data is at rest within a stream, it’s encrypted.

When records are retrieved through a GetRecords request from the encrypted stream, they are decrypted automatically as they are flowing out of the service. That means your Kinesis Streams producers and consumers do not need to be aware of encryption. You have a fully managed data encryption feature at your fingertips, which can be enabled within seconds.

AWS also makes it easy to audit the application of server-side encryption. You can use the AWS Management Console for instant stream-level verification; the responses from PutRecord, PutRecords, and getRecords; or AWS CloudTrail.

Calling PutRecord or PutRecords

When server-side encryption is enabled for a particular stream, Kinesis Streams and KMS perform the following actions when your applications call PutRecord or PutRecords on a stream with server-side encryption enabled. The Amazon Kinesis Producer Library (KPL) uses PutRecords.

 

  1. Data is sent from a customer’s producer (client) to a Kinesis stream using TLS via HTTPS. Data in transit to a stream is encrypted by default.
  2. After data is received, it is momentarily stored in RAM within a front-end proxy layer.
  3. Kinesis Streams authenticates the producer, then impersonates the producer to request input keying material from KMS.
  4. KMS creates key material, encrypts it by using CMK, and sends both the plaintext and encrypted key material to the service, encrypted with TLS.
  5. The client uses the plaintext key material to derive data encryption keys (data keys) that are unique per-record.
  6. The client encrypts the payload and partition key using the data key in RAM within the front-end proxy layer and removes the plaintext data key from memory.
  7. The client appends the encrypted key material to the encrypted data.
  8. The plaintext key material is securely cached in memory within the front-end layer for reuse, until it expires after 5 minutes.
  9. The client delivers the encrypted message to a back-end store where it is stored at rest and fetchable by an authorized consumer through a GetRecords The Amazon Kinesis Client Library (KCL) calls GetRecords to retrieve records from a stream.

Calling getRecords

Kinesis Streams and KMS perform the following actions when your applications call GetRecords on a server-side encrypted stream.

 

  1. When a GeRecords call is made, the front-end proxy layer retrieves the encrypted record from its back-end store.
  2. The consumer (client) makes a request to KMS using a token generated by the customer’s request. KMS authorizes it.
  3. The client requests that KMS decrypt the encrypted key material.
  4. KMS decrypts the encrypted key material and sends the plaintext key material to the client.
  5. Kinesis Streams derives the per-record data keys from the decrypted key material.
  6. If the calling application is authorized, the client decrypts the payload and removes the plaintext data key from memory.
  7. The client delivers the payload over TLS and HTTPS to the consumer, requesting the records. Data in transit to a consumer is encrypted by default.

Verifying server-side encryption

Auditors or administrators often ask for proof that server-side encryption was or is enabled. Here are a few ways to do this.

To check if encryption is enabled now for your streams:

  • Use the AWS Management Console or the DescribeStream API operation. You can also see what CMK is being used for encryption.
  • See encryption in action by looking at responses from PutRecord, PutRecords, or GetRecords When encryption is enabled, the encryptionType parameter is set to “KMS”. If encryption is not enabled, encryptionType is not included in the response.

Sample PutRecord response

{
    "SequenceNumber": "49573959617140871741560010162505906306417380215064887298",
    "ShardId": "shardId-000000000000",
    "EncryptionType": "KMS"
}

Sample GetRecords response

{
    "Records": [
        {
            "Data": "aGVsbG8gd29ybGQ=", 
            "PartitionKey": "test", 
            "ApproximateArrivalTimestamp": 1498292565.825, 
            "EncryptionType": "KMS", 
            "SequenceNumber": "495735762417140871741560010162505906306417380215064887298"
        }, 
        {
            "Data": "ZnJvZG8gbGl2ZXMK", 
            "PartitionKey": "3d0d9301-3c30-4c48-a9a8-e485b2982b28", 
            "ApproximateArrivalTimestamp": 1498292801.747, 
            "EncryptionType": "KMS", 
            "SequenceNumber": "49573959617140871741560010162507115232237011062036103170"
        }
    ], 
    "NextShardIterator": "AAAAAAAAAAEvFypHZDx/4bJVAS34puwdiNcwssKqbh/XhRK7HSYRq3RS+YXJnVKJ8j0gQUt94bONdqQYHk9X9JHgefMUDKzDzndy5WbZWO4CS3hRdMdrbmJ/9KoR4lOfZvqTLt6JWQjDqXv0IaKs06/LHYcEA3oPcyQLOTJHdJl2EzplCTZnn/U295ovxvqF9g9DY8y2nVoMkdFLmdcEMVXjhCDKiRIt", 
    "MillisBehindLatest": 0
}

To check if encryption was enabled, use CloudTrail, which logs the StartStreamEncryption() and StopStreamEncryption() API calls made against a particular stream.

Getting started

It’s very easy to enable, disable, or modify server-side encryption for a particular stream.

  1. In the Kinesis Streams console, select a stream and choose Details.
  2. Select a CMK and select Enabled.
  3. Choose Save.

You can enable encryption only for a live stream, not upon stream creation.  Follow the same process to disable a stream. To use a different CMK, select it and choose Save.

Each of these tasks can also be accomplished using the StartStreamEncryption and StopStreamEncryption API operations.

Considerations

There are a few considerations you should be aware of when using server-side encryption for Kinesis Streams:

  • Permissions
  • Costs
  • Performance

Permissions

One benefit of using the “(Default) aws/kinesis” AWS managed key is that every producer and consumer with permissions to call PutRecord, PutRecords, or GetRecords inherits the right permissions over the “(Default) aws/kinesis” key automatically.

However, this is not necessarily the same case for a CMK. Kinesis Streams producers and consumers do not need to be aware of encryption. However, if you enable encryption using a custom master key but a producer or consumer doesn’t have IAM permissions to use it, PutRecord, PutRecords, or GetRecords requests fail.

This is a great security feature. On the other hand, it can effectively lead to data loss if you inadvertently apply a custom master key that restricts producers and consumers from interacting from the Kinesis stream. Take precautions when applying a custom master key. For more information about the minimum IAM permissions required for producers and consumers interacting with an encrypted stream, see Using Server-Side Encryption.

Costs

When you apply server-side encryption, you are subject to KMS API usage and key costs. Unlike custom KMS master keys, the “(Default) aws/kinesis” CMK is offered free of charge. However, you still need to pay for the API usage costs that Kinesis Streams incurs on your behalf.

API usage costs apply for every CMK, including custom ones. Kinesis Streams calls KMS approximately every 5 minutes when it is rotating the data key. In a 30-day month, the total cost of KMS API calls initiated by a Kinesis stream should be less than a few dollars.

Performance

During testing, AWS discovered that there was a slight increase (typically 0.2 millisecond or less per record) with put and get record latencies due to the additional overhead of encryption.

If you have questions or suggestions, please comment below.

Analysis of Top-N DynamoDB Objects using Amazon Athena and Amazon QuickSight

Post Syndicated from Rendy Oka original https://aws.amazon.com/blogs/big-data/analysis-of-top-n-dynamodb-objects-using-amazon-athena-and-amazon-quicksight/

If you run an operation that continuously generates a large amount of data, you may want to know what kind of data is being inserted by your application. The ability to analyze data intake quickly can be very valuable for business units, such as operations and marketing. For many operations, it’s important to see what is driving the business at any particular moment. For retail companies, for example, understanding which products are currently popular can aid in planning for future growth. Similarly, for PR companies, understanding the impact of an advertising campaign can help them market their products more effectively.

This post covers an architecture that helps you analyze your streaming data. You’ll build a solution using Amazon DynamoDB Streams, AWS Lambda, Amazon Kinesis Firehose, and Amazon Athena to analyze data intake at a frequency that you choose. And because this is a serverless architecture, you can use all of the services here without the need to provision or manage servers.

The data source

You’ll collect a random sampling of tweets via Twitter’s API and store a variety of attributes in your DynamoDB table, such as: Twitter handle, tweet ID, hashtags, location, and Time-To-Live (TTL) value.

In DynamoDB, the primary key is used as an input to an internal hash function. The output from this function determines the partition in which the data will be stored. When using a combination of primary key and sort key as a DynamoDB schema, you need to make sure that no single partition key contains many more objects than the other partition keys because this can cause partition level throttling. For the demonstration in this blog, the Twitter handle will be the primary key and the tweet ID will be the sort key. This allows you to group and sort tweets from each user.

To help you get started, I have written a script that pulls a live Twitter stream that you can use to generate your data. All you need to do is provide your own Twitter Apps credentials, and it should generate the data immediately. Alternatively, I have also provided a script that you can use to generate random Tweets with little effort.

You can find both scripts in the Github repository:

https://github.com/awslabs/aws-blog-dynamodb-analysis

There are some modules that you may need to install to run these scripts. You can find them in Python’s module repository:

To get your own Twitter credentials, go to https://www.twitter.com/ and sign up for a free account, if you don’t already have one. After your account is set up, go to https://apps.twitter.com/. On the main landing page, choose the Create New App button. After the application is created, go to Keys and Access Tokens to get your credentials to use the Twitter API. You’ll need to generate Customer Tokens/Secret and Access Token/Secret. All four keys will be used to authenticate your request.

Architecture overview

Before we begin, let’s take a look at the overall flow of information will look like, from data ingestion into DynamoDB to visualization of results in Amazon QuickSight.

As illustrated in the architecture diagram above, any changes made to the items in DynamoDB will be captured and processed using DynamoDB Streams. Next, a Lambda function will be invoked by a trigger that is configured to respond to events in DynamoDB Streams. The Lambda function processes the data prior to pushing to Amazon Kinesis Firehose, which will output to Amazon S3. Finally, you use Amazon Athena to analyze the streaming data landing in Amazon S3. The result can be explored and visualized in Amazon QuickSight for your company’s business analytics.

You’ll need to implement your custom Lambda function to help transform the raw <key, value> data stored in DynamoDB to a JSON format for Athena to digest, but I can help you with a sample code that you are free to modify.

Implementation

In the following sections, I’ll walk through how you can set up the architecture discussed earlier.

Create your DynamoDB table

First, let’s create a DynamoDB table and enable DynamoDB Streams. This will enable data to be copied out of this table. From the console, use the user_id as the partition key and tweet_id as the sort key:

After the table is ready, you can enable DynamoDB Streams. This process operates asynchronously, so there is no performance impact on the table when you enable this feature. The easiest way to manage DynamoDB Streams is also through the DynamoDB console.

In the Overview tab of your newly created table, click Manage Stream. In the window, choose the information that will be written to the stream whenever data in the table is added or modified. In this example, you can choose either New image or New and old images.

For more details on this process, check out our documentation:

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html

Configure Kinesis Firehose

Before creating the Lambda function, you need to configure Kinesis Firehose delivery stream so that it’s ready to accept data from Lambda. Open the Firehose console and choose Create Firehose Delivery Stream. From here, choose S3 as the destination and use the following to information to configure the resource. Note the Delivery stream name because you will use it in the next step.

For more details on this process, check out our documentation:

http://docs.aws.amazon.com/firehose/latest/dev/basic-create.html#console-to-s3

Create your Lambda function

Now that Kinesis Firehose is ready to accept data, you can create your Lambda function.

From the AWS Lambda console, choose the Create a Lambda function button and use the Blank Function. Enter a name and description, and choose Python 2.7 as the Runtime. Note your Lambda function name because you’ll need it in the next step.

In the Lambda function code field, you can paste the script that I have written for this purpose. All this function needs is the name of your Firehose stream name set as an environment variable.

import boto3
import json
import os

# Initiate Firehose client
firehose_client = boto3.client('firehose')

def lambda_handler(event, context):
    records = []
    batch   = []
    try :
        for record in event['Records']:
            tweet = {}
            t_stats = '{ "table_name":"%s", "user_id":"%s", "tweet_id":"%s", "approx_post_time":"%d" }\n' \
                      % ( record['eventSourceARN'].split('/')[1], \
                          record['dynamodb']['Keys']['user_id']['S'], \
                          record['dynamodb']['Keys']['tweet_id']['N'], \
                          int(record['dynamodb']['ApproximateCreationDateTime']) )
            tweet["Data"] = t_stats
            records.append(tweet)
        batch.append(records)
        res = firehose_client.put_record_batch(
            DeliveryStreamName = os.environ['firehose_stream_name'],
            Records = batch[0]
        )
        return 'Successfully processed {} records.'.format(len(event['Records']))
    except Exception :
        pass

The handler should be set to lambda_function.lambda_handler and you can use the existing lambda_dynamodb_streams role that’s been created by default.

Enable DynamoDB trigger and start collecting data

Everything is ready to go. Open your table using the DynamoDB console and go to the Triggers tab. Select the Create trigger drop down list and choose Existing Lambda function. In the pop-up window, select the function that you just created, and choose the Create button.

At this point, you can start collecting data with the Python script that I’ve provided. The first one will create a script that will pull public Twitter data and the other will generate fake tweets using Lorem Ipsum text.

Configure Amazon Athena to read the data

Next, you will configure Amazon Athena so that it can read the data Kinesis Firehose outputs to Amazon S3 and allow you to analyze the data as needed. You can connect to Athena directly from the Athena console, and you can establish a connection using JDBC or the Athena API. In this example, I’m going to demonstrate what this looks like on the Athena console.

First, create a new database and a new table. You can do this by running the following two queries. The first query creates a new database:

CREATE DATABASE IF NOT EXISTS ddbtablestats

And the second query creates a new table:

CREATE EXTERNAL TABLE IF NOT EXISTS ddbtablestats.twitterfeed (
    `table_name` string,
    `user_id` string,
    `tweet_id` bigint,
    `approx_post_time` timestamp 
) PARTITIONED BY (
    year string,
    month string,
    day string,
    hour string 
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('serialization.format' = '1')
LOCATION 's3://myBucket/dynamodb/streams/transactions/'

Note that this table is created using partitions. Partitioning separates your data into logical parts based on certain criteria, such as date, location, language, etc. This allows Athena to selectively pull your data without needing to process the entire data set. This effectively minimizes the query execution time, and it also allows you to have greater control over the data that you want to query.

After the query has completed, you should be able to see the table in the left side pane of the Athena dashboard.

After the database and table have been created, execute the ALTER TABLE query to populate the partitions in your table. Replace the date with the current date when the script was executed.

ALTER TABLE ddbtablestats.TwitterFeed ADD IF NOT EXISTS
PARTITION (year='2017',month='05',day='17',hour='01') location 's3://myBucket/dynamodb/streams/transactions/2017/05/17/01/'

Using the Athena console, you’ll need to manually populate each partition for each additional partition that you’d like to analyze, however you can programmatically automate this process by using the JDBC driver or any AWS SDK of your choice.

For more information on partitioning in Athena, check out our documentation:

http://docs.aws.amazon.com/athena/latest/ug/partitions.html

Querying the data in Amazon Athena

This is it! Let’s run this query to see the top 10 most active Twitter users in the last 24 hours. You can do this from the Athena console:

SELECT user_id, COUNT(DISTINCT tweet_id) tweets FROM ddbTableStats.TwitterFeed
WHERE year='2017' AND month='05' AND day='17'
GROUP BY user_id
ORDER BY tweets DESC
LIMIT 10

The result should look similar to the following:

Linking Athena to Amazon QuickSight

Finally, to make this data available to a larger audience, let’s visualize this data in Amazon QuickSight. Amazon QuickSight provides native connectivity to AWS data sources such as Amazon Redshift, Amazon RDS, and Amazon Athena. Amazon QuickSight can also connect to on-premises databases, Excel, or CSV files, and it can connect to cloud data sources such as Salesforce.com. For this solution, we will connect Amazon QuickSight to the Athena table we just created.

Amazon QuickSight has a free tier that provides 1 user and 1GB of SPICE (Superfast Parallel In-memory Calculated Engine) capacity free. So you can sign up and use QuickSight free of charge.

When you are signing up for Amazon QuickSight, ensure that you grant permissions for QuickSight to connect to Athena and the S3 bucket where the data is stored.

After you’ve signed up, navigate to the new analysis button, and choose new data set, and then select the Athena data source option. Create a new name for your data source and proceed to the next prompt. At this point, you should see the Athena table you created earlier.

Choose the option to import the data to SPICE for a quicker analysis. SPICE is an in-memory optimized calculation engine that is designed for quick data visualization through parallel processing. SPICE also enables you to refresh your data sets at a regular interval or on-demand as you want.

In the dialog box, confirm this data set creation, and you’ll arrive on the landing page where you can start building your graph. The X-axis will represent the user_id and the Value will be used to represent the SUM total of the tweets from each user.

The Amazon QuickSight report looks like this:

Through this visualization, I can easily see that there are 3 users that tweeted over 20 times that day and that the majority of the users have fewer than 10 tweets that day. I can also set up a scheduled refresh of my SPICE dataset so that I have a dashboard that is regularly updated with the latest data.

Closing thoughts

Here are the benefits that you can gain from using this architecture:

  1. You can optimize the design of your DynamoDB schema that follows AWS best practice recommendations.
  1. You can run analysis and data intelligence in order to understand the current customer demands for your business.
  1. You can store incremental backup for future auditing.

The flexibility of our AWS services invites you to create and design the ideal workflow for your production at any scale, and, as always, if you ever need some guidance, don’t hesitate to reach out to us.I  hope this has been helpful to you! Please leave any questions and comments below.

 


Additional Reading

Learn how to analyze VPC Flow Logs with Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight.


About the Author

Rendy Oka is a Big Data Support Engineer for Amazon Web Services. He provides consultations and architectural designs and partners with the TAMs, Solution Architects, and AWS product teams to help develop solutions for our customers. He is also a team lead for the big data support team in Seattle. Rendy has traveled to dozens of countries around the world and takes every opportunity to experience the local culture wherever he goes

 

 

 

 

Yahoo Mail’s New Tech Stack, Built for Performance and Reliability

Post Syndicated from mikesefanov original https://yahooeng.tumblr.com/post/162320493306

By Suhas Sadanandan, Director of Engineering 

When it comes to performance and reliability, there is perhaps no application where this matters more than with email. Today, we announced a new Yahoo Mail experience for desktop based on a completely rewritten tech stack that embodies these fundamental considerations and more.

We built the new Yahoo Mail experience using a best-in-class front-end tech stack with open source technologies including React, Redux, Node.js, react-intl (open-sourced by Yahoo), and others. A high-level architectural diagram of our stack is below.

image

New Yahoo Mail Tech Stack

In building our new tech stack, we made use of the most modern tools available in the industry to come up with the best experience for our users by optimizing the following fundamentals:

Performance

A key feature of the new Yahoo Mail architecture is blazing-fast initial loading (aka, launch).

We introduced new network routing which sends users to their nearest geo-located email servers (proximity-based routing). This has resulted in a significant reduction in time to first byte and should be immediately noticeable to our international users in particular.

We now do server-side rendering to allow our users to see their mail sooner. This change will be immediately noticeable to our low-bandwidth users. Our application is isomorphic, meaning that the same code runs on the server (using Node.js) and the client. Prior versions of Yahoo Mail had programming logic duplicated on the server and the client because we used PHP on the server and JavaScript on the client.   

Using efficient bundling strategies (JavaScript code is separated into application, vendor, and lazy loaded bundles) and pushing only the changed bundles during production pushes, we keep the cache hit ratio high. By using react-atomic-css, our homegrown solution for writing modular and scoped CSS in React, we get much better CSS reuse.  

In prior versions of Yahoo Mail, the need to run various experiments in parallel resulted in additional branching and bloating of our JavaScript and CSS code. While rewriting all of our code, we solved this issue using Mendel, our homegrown solution for bucket testing isomorphic web apps, which we have open sourced.  

Rather than using custom libraries, we use native HTML5 APIs and ES6 heavily and use PolyesterJS, our homegrown polyfill solution, to fill the gaps. These factors have further helped us to keep payload size minimal.

With all the above optimizations, we have been able to reduce our JavaScript and CSS footprint by approximately 50% compared to the previous desktop version of Yahoo Mail, helping us achieve a blazing-fast launch.

In addition to initial launch improvements, key features like search and message read (when a user opens an email to read it) have also benefited from the above optimizations and are considerably faster in the latest version of Yahoo Mail.

We also significantly reduced the memory consumed by Yahoo Mail on the browser. This is especially noticeable during a long running session.

Reliability

With this new version of Yahoo Mail, we have a 99.99% success rate on core flows: launch, message read, compose, search, and actions that affect messages. Accomplishing this over several billion user actions a day is a significant feat. Client-side errors (JavaScript exceptions) are reduced significantly when compared to prior Yahoo Mail versions.

Product agility and launch velocity

We focused on independently deployable components. As part of the re-architecture of Yahoo Mail, we invested in a robust continuous integration and delivery flow. Our new pipeline allows for daily (or more) pushes to all Mail users, and we push only the bundles that are modified, which keeps the cache hit ratio high.

Developer effectiveness and satisfaction

In developing our tech stack for the new Yahoo Mail experience, we heavily leveraged open source technologies, which allowed us to ensure a shorter learning curve for new engineers. We were able to implement a consistent and intuitive onboarding program for 30+ developers and are now using our program for all new hires. During the development process, we emphasise predictable flows and easy debugging.

Accessibility

The accessibility of this new version of Yahoo Mail is state of the art and delivers outstanding usability (efficiency) in addition to accessibility. It features six enhanced visual themes that can provide accommodation for people with low vision and has been optimized for use with Assistive Technology including alternate input devices, magnifiers, and popular screen readers such as NVDA and VoiceOver. These features have been rigorously evaluated and incorporate feedback from users with disabilities. It sets a new standard for the accessibility of web-based mail and is our most-accessible Mail experience yet.

Open source 

We have open sourced some key components of our new Mail stack, like Mendel, our solution for bucket testing isomorphic web applications. We invite the community to use and build upon our code. Going forward, we plan on also open sourcing additional components like react-atomic-css, our solution for writing modular and scoped CSS in React, and lazy-component, our solution for on-demand loading of resources.

Many of our company’s best technical minds came together to write a brand new tech stack and enable a delightful new Yahoo Mail experience for our users.

We encourage our users and engineering peers in the industry to test the limits of our application, and to provide feedback by clicking on the Give Feedback call out in the lower left corner of the new version of Yahoo Mail.

AWS GovCloud (US) and Amazon Rekognition – A Powerful Public Safety Tool

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-govcloud-us-and-amazon-rekognition-a-powerful-public-safety-tool/

I’ve already told you about Amazon Rekognition and described how it uses deep neural network models to analyze images by detecting objects, scenes, and faces.

Today I am happy to tell you that Rekognition is now available in the AWS GovCloud (US) Region. To learn more, read the Amazon Rekognition FAQ, and the Amazon Rekognition Product Details, review the Amazon Rekognition Customer Use Cases, and then build your app using the information on the Amazon Rekognition for Developers page.

Motorola Solutions for Public Safety
While I have your attention, I would love to tell you how Motorola Solutions is exploring how Rekognition can enhance real-time intelligence for public safety personnel in the field and at the command center.

Motorola Solutions provides over 100,000 public safety and commercial customers in more than 100 countries with software, services, and tools for mobile intelligence and digital evidence management, many powered by images captured using body, dashboard, and stationary cameras. Due to the exceptionally sensitive nature of these images, they must be stored in an environment that meets stringent CJIS (Criminal Justice Information Systems) security standards defined by the FBI.

For several years, researchers at Motorola Solutions have been exploring the use of artificial intelligence. For example, they have built prototype applications that use Rekognition, Lex, and Polly in conjunction with their own software to scan images from a body-worn camera for missing persons and to raise alerts without requiring continuous human attention or interaction. With approximately 100,000 missing people in the US alone, law enforcement agencies need to bring powerful tools to bear. At re:Invent 2016, Dan Law (Chief Data Scientist for Motorola Solutions) described how they use AWS to aid in this effort. Here’s the video (Dan’s section is titled AI for Public Safety):

AWS and CJIS
The applications that Dan described can run in AWS GovCloud (US). This is an isolated cloud built to protect and preserve sensitive IT data while meeting the FBI’s CJIS requirements (and many others). AWS GovCloud (US) resides on US soil and is managed exclusively by US citizens. AWS routinely signs CJIS security agreements with our customers and can either perform or allow background checks on our employees, as needed.

Here are some resources that you can use to learn more about AWS and CJIS:

Jeff;