Tag Archives: Processing

Announcing the Winners of the AWS Chatbot Challenge – Conversational, Intelligent Chatbots using Amazon Lex and AWS Lambda

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/announcing-the-winners-of-the-aws-chatbot-challenge-conversational-intelligent-chatbots-using-amazon-lex-and-aws-lambda/

A couple of months ago on the blog, I announced the AWS Chatbot Challenge in conjunction with Slack. The AWS Chatbot Challenge was an opportunity to build a unique chatbot that helped to solve a problem or that would add value for its prospective users. The mission was to build a conversational, natural language chatbot using Amazon Lex and leverage Lex’s integration with AWS Lambda to execute logic or data processing on the backend.

I know that you all have been anxiously waiting to hear announcements of who were the winners of the AWS Chatbot Challenge as much as I was. Well wait no longer, the winners of the AWS Chatbot Challenge have been decided.

May I have the Envelope Please? (The Trumpets sound)

The winners of the AWS Chatbot Challenge are:

  • First Place: BuildFax Counts by Joe Emison
  • Second Place: Hubsy by Andrew Riess, Andrew Puch, and John Wetzel
  • Third Place: PFMBot by Benny Leong and his team from MoneyLion.
  • Large Organization Winner: ADP Payroll Innovation Bot by Eric Liu, Jiaxing Yan, and Fan Yang

 

Diving into the Winning Chatbot Projects

Let’s take a walkthrough of the details for each of the winning projects to get a view of what made these chatbots distinctive, as well as, learn more about the technologies used to implement the chatbot solution.

 

BuildFax Counts by Joe Emison

The BuildFax Counts bot was created as a real solution for the BuildFax company to decrease the amount the time that sales and marketing teams can get answers on permits or properties with permits meet certain criteria.

BuildFax, a company co-founded by bot developer Joe Emison, has the only national database of building permits, which updates data from approximately half of the United States on a monthly basis. In order to accommodate the many requests that come in from the sales and marketing team regarding permit information, BuildFax has a technical sales support team that fulfills these requests sent to a ticketing system by manually writing SQL queries that run across the shards of the BuildFax databases. Since there are a large number of requests received by the internal sales support team and due to the manual nature of setting up the queries, it may take several days for getting the sales and marketing teams to receive an answer.

The BuildFax Counts chatbot solves this problem by taking the permit inquiry that would normally be sent into a ticket from the sales and marketing team, as input from Slack to the chatbot. Once the inquiry is submitted into Slack, a query executes and the inquiry results are returned immediately.

Joe built this solution by first creating a nightly export of the data in their BuildFax MySQL RDS database to CSV files that are stored in Amazon S3. From the exported CSV files, an Amazon Athena table was created in order to run quick and efficient queries on the data. He then used Amazon Lex to create a bot to handle the common questions and criteria that may be asked by the sales and marketing teams when seeking data from the BuildFax database by modeling the language used from the BuildFax ticketing system. He added several different sample utterances and slot types; both custom and Lex provided, in order to correctly parse every question and criteria combination that could be received from an inquiry.  Using Lambda, Joe created a Javascript Lambda function that receives information from the Lex intent and used it to build a SQL statement that runs against the aforementioned Athena database using the AWS SDK for JavaScript in Node.js library to return inquiry count result and SQL statement used.

The BuildFax Counts bot is used today for the BuildFax sales and marketing team to get back data on inquiries immediately that previously took up to a week to receive results.

Not only is BuildFax Counts bot our 1st place winner and wonderful solution, but its creator, Joe Emison, is a great guy.  Joe has opted to donate his prize; the $5,000 cash, the $2,500 in AWS Credits, and one re:Invent ticket to the Black Girls Code organization. I must say, you rock Joe for helping these kids get access and exposure to technology.

 

Hubsy by Andrew Riess, Andrew Puch, and John Wetzel

Hubsy bot was created to redefine and personalize the way users traditionally manage their HubSpot account. HubSpot is a SaaS system providing marketing, sales, and CRM software. Hubsy allows users of HubSpot to create engagements and log engagements with customers, provide sales teams with deals status, and retrieves client contact information quickly. Hubsy uses Amazon Lex’s conversational interface to execute commands from the HubSpot API so that users can gain insights, store and retrieve data, and manage tasks directly from Facebook, Slack, or Alexa.

In order to implement the Hubsy chatbot, Andrew and the team members used AWS Lambda to create a Lambda function with Node.js to parse the users request and call the HubSpot API, which will fulfill the initial request or return back to the user asking for more information. Terraform was used to automatically setup and update Lambda, CloudWatch logs, as well as, IAM profiles. Amazon Lex was used to build the conversational piece of the bot, which creates the utterances that a person on a sales team would likely say when seeking information from HubSpot. To integrate with Alexa, the Amazon Alexa skill builder was used to create an Alexa skill which was tested on an Echo Dot. Cloudwatch Logs are used to log the Lambda function information to CloudWatch in order to debug different parts of the Lex intents. In order to validate the code before the Terraform deployment, ESLint was additionally used to ensure the code was linted and proper development standards were followed.

 

PFMBot by Benny Leong and his team from MoneyLion

PFMBot, Personal Finance Management Bot,  is a bot to be used with the MoneyLion finance group which offers customers online financial products; loans, credit monitoring, and free credit score service to improve the financial health of their customers. Once a user signs up an account on the MoneyLion app or website, the user has the option to link their bank accounts with the MoneyLion APIs. Once the bank account is linked to the APIs, the user will be able to login to their MoneyLion account and start having a conversation with the PFMBot based on their bank account information.

The PFMBot UI has a web interface built with using Javascript integration. The chatbot was created using Amazon Lex to build utterances based on the possible inquiries about the user’s MoneyLion bank account. PFMBot uses the Lex built-in AMAZON slots and parsed and converted the values from the built-in slots to pass to AWS Lambda. The AWS Lambda functions interacting with Amazon Lex are Java-based Lambda functions which call the MoneyLion Java-based internal APIs running on Spring Boot. These APIs obtain account data and related bank account information from the MoneyLion MySQL Database.

 

ADP Payroll Innovation Bot by Eric Liu, Jiaxing Yan, and Fan Yang

ADP PI (Payroll Innovation) bot is designed to help employees of ADP customers easily review their own payroll details and compare different payroll data by just asking the bot for results. The ADP PI Bot additionally offers issue reporting functionality for employees to report payroll issues and aids HR managers in quickly receiving and organizing any reported payroll issues.

The ADP Payroll Innovation bot is an ecosystem for the ADP payroll consisting of two chatbots, which includes ADP PI Bot for external clients (employees and HR managers), and ADP PI DevOps Bot for internal ADP DevOps team.


The architecture for the ADP PI DevOps bot is different architecture from the ADP PI bot shown above as it is deployed internally to ADP. The ADP PI DevOps bot allows input from both Slack and Alexa. When input comes into Slack, Slack sends the request to Lex for it to process the utterance. Lex then calls the Lambda backend, which obtains ADP data sitting in the ADP VPC running within an Amazon VPC. When input comes in from Alexa, a Lambda function is called that also obtains data from the ADP VPC running on AWS.

The architecture for the ADP PI bot consists of users entering in requests and/or entering issues via Slack. When requests/issues are entered via Slack, the Slack APIs communicate via Amazon API Gateway to AWS Lambda. The Lambda function either writes data into one of the Amazon DynamoDB databases for recording issues and/or sending issues or it sends the request to Lex. When sending issues, DynamoDB integrates with Trello to keep HR Managers abreast of the escalated issues. Once the request data is sent from Lambda to Lex, Lex processes the utterance and calls another Lambda function that integrates with the ADP API and it calls ADP data from within the ADP VPC, which runs on Amazon Virtual Private Cloud (VPC).

Python and Node.js were the chosen languages for the development of the bots.

The ADP PI bot ecosystem has the following functional groupings:

Employee Functionality

  • Summarize Payrolls
  • Compare Payrolls
  • Escalate Issues
  • Evolve PI Bot

HR Manager Functionality

  • Bot Management
  • Audit and Feedback

DevOps Functionality

  • Reduce call volume in service centers (ADP PI Bot).
  • Track issues and generate reports (ADP PI Bot).
  • Monitor jobs for various environment (ADP PI DevOps Bot)
  • View job dashboards (ADP PI DevOps Bot)
  • Query job details (ADP PI DevOps Bot)

 

Summary

Let’s all wish all the winners of the AWS Chatbot Challenge hearty congratulations on their excellent projects.

You can review more details on the winning projects, as well as, all of the submissions to the AWS Chatbot Challenge at: https://awschatbot2017.devpost.com/submissions. If you are curious on the details of Chatbot challenge contest including resources, rules, prizes, and judges, you can review the original challenge website here:  https://awschatbot2017.devpost.com/.

Hopefully, you are just as inspired as I am to build your own chatbot using Lex and Lambda. For more information, take a look at the Amazon Lex developer guide or the AWS AI blog on Building Better Bots Using Amazon Lex (Part 1)

Chat with you soon!

Tara

Analyzing AWS Cost and Usage Reports with Looker and Amazon Athena

Post Syndicated from Dillon Morrison original https://aws.amazon.com/blogs/big-data/analyzing-aws-cost-and-usage-reports-with-looker-and-amazon-athena/

This is a guest post by Dillon Morrison at Looker. Looker is, in their own words, “a new kind of analytics platform–letting everyone in your business make better decisions by getting reliable answers from a tool they can use.” 

As the breadth of AWS products and services continues to grow, customers are able to more easily move their technology stack and core infrastructure to AWS. One of the attractive benefits of AWS is the cost savings. Rather than paying upfront capital expenses for large on-premises systems, customers can instead pay variables expenses for on-demand services. To further reduce expenses AWS users can reserve resources for specific periods of time, and automatically scale resources as needed.

The AWS Cost Explorer is great for aggregated reporting. However, conducting analysis on the raw data using the flexibility and power of SQL allows for much richer detail and insight, and can be the better choice for the long term. Thankfully, with the introduction of Amazon Athena, monitoring and managing these costs is now easier than ever.

In the post, I walk through setting up the data pipeline for cost and usage reports, Amazon S3, and Athena, and discuss some of the most common levers for cost savings. I surface tables through Looker, which comes with a host of pre-built data models and dashboards to make analysis of your cost and usage data simple and intuitive.

Analysis with Athena

With Athena, there’s no need to create hundreds of Excel reports, move data around, or deploy clusters to house and process data. Athena uses Apache Hive’s DDL to create tables, and the Presto querying engine to process queries. Analysis can be performed directly on raw data in S3. Conveniently, AWS exports raw cost and usage data directly into a user-specified S3 bucket, making it simple to start querying with Athena quickly. This makes continuous monitoring of costs virtually seamless, since there is no infrastructure to manage. Instead, users can leverage the power of the Athena SQL engine to easily perform ad-hoc analysis and data discovery without needing to set up a data warehouse.

After the data pipeline is established, cost and usage data (the recommended billing data, per AWS documentation) provides a plethora of comprehensive information around usage of AWS services and the associated costs. Whether you need the report segmented by product type, user identity, or region, this report can be cut-and-sliced any number of ways to properly allocate costs for any of your business needs. You can then drill into any specific line item to see even further detail, such as the selected operating system, tenancy, purchase option (on-demand, spot, or reserved), and so on.

Walkthrough

By default, the Cost and Usage report exports CSV files, which you can compress using gzip (recommended for performance). There are some additional configuration options for tuning performance further, which are discussed below.

Prerequisites

If you want to follow along, you need the following resources:

Enable the cost and usage reports

First, enable the Cost and Usage report. For Time unit, select Hourly. For Include, select Resource IDs. All options are prompted in the report-creation window.

The Cost and Usage report dumps CSV files into the specified S3 bucket. Please note that it can take up to 24 hours for the first file to be delivered after enabling the report.

Configure the S3 bucket and files for Athena querying

In addition to the CSV file, AWS also creates a JSON manifest file for each cost and usage report. Athena requires that all of the files in the S3 bucket are in the same format, so we need to get rid of all these manifest files. If you’re looking to get started with Athena quickly, you can simply go into your S3 bucket and delete the manifest file manually, skip the automation described below, and move on to the next section.

To automate the process of removing the manifest file each time a new report is dumped into S3, which I recommend as you scale, there are a few additional steps. The folks at Concurrency labs wrote a great overview and set of scripts for this, which you can find in their GitHub repo.

These scripts take the data from an input bucket, remove anything unnecessary, and dump it into a new output bucket. We can utilize AWS Lambda to trigger this process whenever new data is dropped into S3, or on a nightly basis, or whatever makes most sense for your use-case, depending on how often you’re querying the data. Please note that enabling the “hourly” report means that data is reported at the hour-level of granularity, not that a new file is generated every hour.

Following these scripts, you’ll notice that we’re adding a date partition field, which isn’t necessary but improves query performance. In addition, converting data from CSV to a columnar format like ORC or Parquet also improves performance. We can automate this process using Lambda whenever new data is dropped in our S3 bucket. Amazon Web Services discusses columnar conversion at length, and provides walkthrough examples, in their documentation.

As a long-term solution, best practice is to use compression, partitioning, and conversion. However, for purposes of this walkthrough, we’re not going to worry about them so we can get up-and-running quicker.

Set up the Athena query engine

In your AWS console, navigate to the Athena service, and click “Get Started”. Follow the tutorial and set up a new database (we’ve called ours “AWS Optimizer” in this example). Don’t worry about configuring your initial table, per the tutorial instructions. We’ll be creating a new table for cost and usage analysis. Once you walked through the tutorial steps, you’ll be able to access the Athena interface, and can begin running Hive DDL statements to create new tables.

One thing that’s important to note, is that the Cost and Usage CSVs also contain the column headers in their first row, meaning that the column headers would be included in the dataset and any queries. For testing and quick set-up, you can remove this line manually from your first few CSV files. Long-term, you’ll want to use a script to programmatically remove this row each time a new file is dropped in S3 (every few hours typically). We’ve drafted up a sample script for ease of reference, which we run on Lambda. We utilize Lambda’s native ability to invoke the script whenever a new object is dropped in S3.

For cost and usage, we recommend using the DDL statement below. Since our data is in CSV format, we don’t need to use a SerDe, we can simply specify the “separatorChar, quoteChar, and escapeChar”, and the structure of the files (“TEXTFILE”). Note that AWS does have an OpenCSV SerDe as well, if you prefer to use that.

 

CREATE EXTERNAL TABLE IF NOT EXISTS cost_and_usage	 (
identity_LineItemId String,
identity_TimeInterval String,
bill_InvoiceId String,
bill_BillingEntity String,
bill_BillType String,
bill_PayerAccountId String,
bill_BillingPeriodStartDate String,
bill_BillingPeriodEndDate String,
lineItem_UsageAccountId String,
lineItem_LineItemType String,
lineItem_UsageStartDate String,
lineItem_UsageEndDate String,
lineItem_ProductCode String,
lineItem_UsageType String,
lineItem_Operation String,
lineItem_AvailabilityZone String,
lineItem_ResourceId String,
lineItem_UsageAmount String,
lineItem_NormalizationFactor String,
lineItem_NormalizedUsageAmount String,
lineItem_CurrencyCode String,
lineItem_UnblendedRate String,
lineItem_UnblendedCost String,
lineItem_BlendedRate String,
lineItem_BlendedCost String,
lineItem_LineItemDescription String,
lineItem_TaxType String,
product_ProductName String,
product_accountAssistance String,
product_architecturalReview String,
product_architectureSupport String,
product_availability String,
product_bestPractices String,
product_cacheEngine String,
product_caseSeverityresponseTimes String,
product_clockSpeed String,
product_currentGeneration String,
product_customerServiceAndCommunities String,
product_databaseEdition String,
product_databaseEngine String,
product_dedicatedEbsThroughput String,
product_deploymentOption String,
product_description String,
product_durability String,
product_ebsOptimized String,
product_ecu String,
product_endpointType String,
product_engineCode String,
product_enhancedNetworkingSupported String,
product_executionFrequency String,
product_executionLocation String,
product_feeCode String,
product_feeDescription String,
product_freeQueryTypes String,
product_freeTrial String,
product_frequencyMode String,
product_fromLocation String,
product_fromLocationType String,
product_group String,
product_groupDescription String,
product_includedServices String,
product_instanceFamily String,
product_instanceType String,
product_io String,
product_launchSupport String,
product_licenseModel String,
product_location String,
product_locationType String,
product_maxIopsBurstPerformance String,
product_maxIopsvolume String,
product_maxThroughputvolume String,
product_maxVolumeSize String,
product_maximumStorageVolume String,
product_memory String,
product_messageDeliveryFrequency String,
product_messageDeliveryOrder String,
product_minVolumeSize String,
product_minimumStorageVolume String,
product_networkPerformance String,
product_operatingSystem String,
product_operation String,
product_operationsSupport String,
product_physicalProcessor String,
product_preInstalledSw String,
product_proactiveGuidance String,
product_processorArchitecture String,
product_processorFeatures String,
product_productFamily String,
product_programmaticCaseManagement String,
product_provisioned String,
product_queueType String,
product_requestDescription String,
product_requestType String,
product_routingTarget String,
product_routingType String,
product_servicecode String,
product_sku String,
product_softwareType String,
product_storage String,
product_storageClass String,
product_storageMedia String,
product_technicalSupport String,
product_tenancy String,
product_thirdpartySoftwareSupport String,
product_toLocation String,
product_toLocationType String,
product_training String,
product_transferType String,
product_usageFamily String,
product_usagetype String,
product_vcpu String,
product_version String,
product_volumeType String,
product_whoCanOpenCases String,
pricing_LeaseContractLength String,
pricing_OfferingClass String,
pricing_PurchaseOption String,
pricing_publicOnDemandCost String,
pricing_publicOnDemandRate String,
pricing_term String,
pricing_unit String,
reservation_AvailabilityZone String,
reservation_NormalizedUnitsPerReservation String,
reservation_NumberOfReservations String,
reservation_ReservationARN String,
reservation_TotalReservedNormalizedUnits String,
reservation_TotalReservedUnits String,
reservation_UnitsPerReservation String,
resourceTags_userName String,
resourceTags_usercostcategory String  


)
    ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
      ESCAPED BY '\\'
      LINES TERMINATED BY '\n'

STORED AS TEXTFILE
    LOCATION 's3://<<your bucket name>>';

Once you’ve successfully executed the command, you should see a new table named “cost_and_usage” with the below properties. Now we’re ready to start executing queries and running analysis!

Start with Looker and connect to Athena

Setting up Looker is a quick process, and you can try it out for free here (or download from Amazon Marketplace). It takes just a few seconds to connect Looker to your Athena database, and Looker comes with a host of pre-built data models and dashboards to make analysis of your cost and usage data simple and intuitive. After you’re connected, you can use the Looker UI to run whatever analysis you’d like. Looker translates this UI to optimized SQL, so any user can execute and visualize queries for true self-service analytics.

Major cost saving levers

Now that the data pipeline is configured, you can dive into the most popular use cases for cost savings. In this post, I focus on:

  • Purchasing Reserved Instances vs. On-Demand Instances
  • Data transfer costs
  • Allocating costs over users or other Attributes (denoted with resource tags)

On-Demand, Spot, and Reserved Instances

Purchasing Reserved Instances vs On-Demand Instances is arguably going to be the biggest cost lever for heavy AWS users (Reserved Instances run up to 75% cheaper!). AWS offers three options for purchasing instances:

  • On-Demand—Pay as you use.
  • Spot (variable cost)—Bid on spare Amazon EC2 computing capacity.
  • Reserved Instances—Pay for an instance for a specific, allotted period of time.

When purchasing a Reserved Instance, you can also choose to pay all-upfront, partial-upfront, or monthly. The more you pay upfront, the greater the discount.

If your company has been using AWS for some time now, you should have a good sense of your overall instance usage on a per-month or per-day basis. Rather than paying for these instances On-Demand, you should try to forecast the number of instances you’ll need, and reserve them with upfront payments.

The total amount of usage with Reserved Instances versus overall usage with all instances is called your coverage ratio. It’s important not to confuse your coverage ratio with your Reserved Instance utilization. Utilization represents the amount of reserved hours that were actually used. Don’t worry about exceeding capacity, you can still set up Auto Scaling preferences so that more instances get added whenever your coverage or utilization crosses a certain threshold (we often see a target of 80% for both coverage and utilization among savvy customers).

Calculating the reserved costs and coverage can be a bit tricky with the level of granularity provided by the cost and usage report. The following query shows your total cost over the last 6 months, broken out by Reserved Instance vs other instance usage. You can substitute the cost field for usage if you’d prefer. Please note that you should only have data for the time period after the cost and usage report has been enabled (though you can opt for up to 3 months of historical data by contacting your AWS Account Executive). If you’re just getting started, this query will only show a few days.

 

SELECT 
	DATE_FORMAT(from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate),'%Y-%m') AS "cost_and_usage.usage_start_month",
	COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0) AS "cost_and_usage.total_unblended_cost",
	COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_reserved_unblended_cost",
	1.0 * (COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0)),0)  AS "cost_and_usage.percent_spend_on_ris",
	COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'Non RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_non_reserved_unblended_cost",
	1.0 * (COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'Non RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0)),0)  AS "cost_and_usage.percent_spend_on_non_ris"
FROM aws_optimizer.cost_and_usage  AS cost_and_usage

WHERE 
	(((from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) >= ((DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))) AND (from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) < ((DATE_ADD('month', 6, DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))))))
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500

The resulting table should look something like the image below (I’m surfacing tables through Looker, though the same table would result from querying via command line or any other interface).

With a BI tool, you can create dashboards for easy reference and monitoring. New data is dumped into S3 every few hours, so your dashboards can update several times per day.

It’s an iterative process to understand the appropriate number of Reserved Instances needed to meet your business needs. After you’ve properly integrated Reserved Instances into your purchasing patterns, the savings can be significant. If your coverage is consistently below 70%, you should seriously consider adjusting your purchase types and opting for more Reserved instances.

Data transfer costs

One of the great things about AWS data storage is that it’s incredibly cheap. Most charges often come from moving and processing that data. There are several different prices for transferring data, broken out largely by transfers between regions and availability zones. Transfers between regions are the most costly, followed by transfers between Availability Zones. Transfers within the same region and same availability zone are free unless using elastic or public IP addresses, in which case there is a cost. You can find more detailed information in the AWS Pricing Docs. With this in mind, there are several simple strategies for helping reduce costs.

First, since costs increase when transferring data between regions, it’s wise to ensure that as many services as possible reside within the same region. The more you can localize services to one specific region, the lower your costs will be.

Second, you should maximize the data you’re routing directly within AWS services and IP addresses. Transfers out to the open internet are the most costly and least performant mechanisms of data transfers, so it’s best to keep transfers within AWS services.

Lastly, data transfers between private IP addresses are cheaper than between elastic or public IP addresses, so utilizing private IP addresses as much as possible is the most cost-effective strategy.

The following query provides a table depicting the total costs for each AWS product, broken out transfer cost type. Substitute the “lineitem_productcode” field in the query to segment the costs by any other attribute. If you notice any unusually high spikes in cost, you’ll need to dig deeper to understand what’s driving that spike: location, volume, and so on. Drill down into specific costs by including “product_usagetype” and “product_transfertype” in your query to identify the types of transfer costs that are driving up your bill.

SELECT 
	cost_and_usage.lineitem_productcode  AS "cost_and_usage.product_code",
	COALESCE(SUM(cost_and_usage.lineitem_unblendedcost), 0) AS "cost_and_usage.total_unblended_cost",
	COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer')    THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_data_transfer_cost",
	COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer-In')    THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_inbound_data_transfer_cost",
	COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer-Out')    THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_outbound_data_transfer_cost"
FROM aws_optimizer.cost_and_usage  AS cost_and_usage

WHERE 
	(((from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) >= ((DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))) AND (from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) < ((DATE_ADD('month', 6, DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))))))
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500

When moving between regions or over the open web, many data transfer costs also include the origin and destination location of the data movement. Using a BI tool with mapping capabilities, you can get a nice visual of data flows. The point at the center of the map is used to represent external data flows over the open internet.

Analysis by tags

AWS provides the option to apply custom tags to individual resources, so you can allocate costs over whatever customized segment makes the most sense for your business. For a SaaS company that hosts software for customers on AWS, maybe you’d want to tag the size of each customer. The following query uses custom tags to display the reserved, data transfer, and total cost for each AWS service, broken out by tag categories, over the last 6 months. You’ll want to substitute the cost_and_usage.resourcetags_customersegment and cost_and_usage.customer_segment with the name of your customer field.

 

SELECT * FROM (
SELECT *, DENSE_RANK() OVER (ORDER BY z___min_rank) as z___pivot_row_rank, RANK() OVER (PARTITION BY z__pivot_col_rank ORDER BY z___min_rank) as z__pivot_col_ordering FROM (
SELECT *, MIN(z___rank) OVER (PARTITION BY "cost_and_usage.product_code") as z___min_rank FROM (
SELECT *, RANK() OVER (ORDER BY CASE WHEN z__pivot_col_rank=1 THEN (CASE WHEN "cost_and_usage.total_unblended_cost" IS NOT NULL THEN 0 ELSE 1 END) ELSE 2 END, CASE WHEN z__pivot_col_rank=1 THEN "cost_and_usage.total_unblended_cost" ELSE NULL END DESC, "cost_and_usage.total_unblended_cost" DESC, z__pivot_col_rank, "cost_and_usage.product_code") AS z___rank FROM (
SELECT *, DENSE_RANK() OVER (ORDER BY CASE WHEN "cost_and_usage.customer_segment" IS NULL THEN 1 ELSE 0 END, "cost_and_usage.customer_segment") AS z__pivot_col_rank FROM (
SELECT 
	cost_and_usage.lineitem_productcode  AS "cost_and_usage.product_code",
	cost_and_usage.resourcetags_customersegment  AS "cost_and_usage.customer_segment",
	COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0) AS "cost_and_usage.total_unblended_cost",
	1.0 * (COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer')    THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0)),0)  AS "cost_and_usage.percent_spend_data_transfers_unblended",
	1.0 * (COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'Non RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0)),0)  AS "cost_and_usage.unblended_percent_spend_on_ris"
FROM aws_optimizer.cost_and_usage_raw  AS cost_and_usage

WHERE 
	(((from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) >= ((DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))) AND (from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) < ((DATE_ADD('month', 6, DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))))))
GROUP BY 1,2) ww
) bb WHERE z__pivot_col_rank <= 16384
) aa
) xx
) zz
 WHERE z___pivot_row_rank <= 500 OR z__pivot_col_ordering = 1 ORDER BY z___pivot_row_rank

The resulting table in this example looks like the results below. In this example, you can tell that we’re making poor use of Reserved Instances because they represent such a small portion of our overall costs.

Again, using a BI tool to visualize these costs and trends over time makes the analysis much easier to consume and take action on.

Summary

Saving costs on your AWS spend is always an iterative, ongoing process. Hopefully with these queries alone, you can start to understand your spending patterns and identify opportunities for savings. However, this is just a peek into the many opportunities available through analysis of the Cost and Usage report. Each company is different, with unique needs and usage patterns. To achieve maximum cost savings, we encourage you to set up an analytics environment that enables your team to explore all potential cuts and slices of your usage data, whenever it’s necessary. Exploring different trends and spikes across regions, services, user types, etc. helps you gain comprehensive understanding of your major cost levers and consistently implement new cost reduction strategies.

Note that all of the queries and analysis provided in this post were generated using the Looker data platform. If you’re already a Looker customer, you can get all of this analysis, additional pre-configured dashboards, and much more using Looker Blocks for AWS.


About the Author

Dillon Morrison leads the Platform Ecosystem at Looker. He enjoys exploring new technologies and architecting the most efficient data solutions for the business needs of his company and their customers. In his spare time, you’ll find Dillon rock climbing in the Bay Area or nose deep in the docs of the latest AWS product release at his favorite cafe (“Arlequin in SF is unbeatable!”).

 

 

 

What’s the Diff: Programs, Processes, and Threads

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/whats-the-diff-programs-processes-and-threads/

let's talk about Threads

How often have you heard the term threading in relation to a computer program, but you weren’t exactly sure what it meant? How about processes? You likely understand that a thread is somehow closely related to a program and a process, but if you’re not a computer science major, maybe that’s as far as your understanding goes.

Knowing what these terms mean is absolutely essential if you are a programmer, but an understanding of them also can be useful to the average computer user. Being able to look at and understand the Activity Monitor on the Macintosh, the Task Manager on Windows, or Top on Linux can help you troubleshoot which programs are causing problems on your computer, or whether you might need to install more memory to make your system run better.

Let’s take a few minutes to delve into the world of computer programs and sort out what these terms mean. We’ll simplify and generalize some of the ideas, but the general concepts we cover should help clarify the difference between the terms.

Programs

First of all, you probably are aware that a program is the code that is stored on your computer that is intended to fulfill a certain task. There are many types of programs, including programs that help your computer function and are part of the operating system, and other programs that fulfill a particular job. These task-specific programs are also known as “applications,” and can include programs such as word processing, web browsing, or emailing a message to another computer.

Program

Programs are typically stored on disk or in non-volatile memory in a form that can be executed by your computer. Prior to that, they are created using a programming language such as C, Lisp, Pascal, or many others using instructions that involve logic, data and device manipulation, recurrence, and user interaction. The end result is a text file of code that is compiled into binary form (1’s and 0’s) in order to run on the computer. Another type of program is called “interpreted,” and instead of being compiled in advance in order to run, is interpreted into executable code at the time it is run. Some common, typically interpreted programming languages, are Python, PHP, JavaScript, and Ruby.

The end result is the same, however, in that when a program is run, it is loaded into memory in binary form. The computer’s CPU (Central Processing Unit) understands only binary instructions, so that’s the form the program needs to be in when it runs.

Perhaps you’ve heard the programmer’s joke, “There are only 10 types of people in the world, those who understand binary, and those who don’t.”

Binary is the native language of computers because an electrical circuit at its basic level has two states, on or off, represented by a one or a zero. In the common numbering system we use every day, base 10, each digit position can be anything from 0 to 9. In base 2 (or binary), each position is either a 0 or a 1. (In a future blog post we might cover quantum computing, which goes beyond the concept of just 1’s and 0’s in computing.)

Decimal—Base 10 Binary—Base 2
0 0000
1 0001
2 0010
3 0011
4 0100
5 0101
6 0110
7 0111
8 1000
9 1001

How Processes Work

The program has been loaded into the computer’s memory in binary form. Now what?

An executing program needs more than just the binary code that tells the computer what to do. The program needs memory and various operating system resources that it needs in order to run. A “process” is what we call a program that has been loaded into memory along with all the resources it needs to operate. The “operating system” is the brains behind allocating all these resources, and comes in different flavors such as macOS, iOS, Microsoft Windows, Linux, and Android. The OS handles the task of managing the resources needed to turn your program into a running process.

Some essential resources every process needs are registers, a program counter, and a stack. The “registers” are data holding places that are part of the computer processor (CPU). A register may hold an instruction, a storage address, or other kind of data needed by the process. The “program counter,” also called the “instruction pointer,” keeps track of where a computer is in its program sequence. The “stack” is a data structure that stores information about the active subroutines of a computer program and is used as scratch space for the process. It is distinguished from dynamically allocated memory for the process that is known as “the heap.”

diagram of how processes work

There can be multiple instances of a single program, and each instance of that running program is a process. Each process has a separate memory address space, which means that a process runs independently and is isolated from other processes. It cannot directly access shared data in other processes. Switching from one process to another requires some time (relatively) for saving and loading registers, memory maps, and other resources.

This independence of processes is valuable because the operating system tries its best to isolate processes so that a problem with one process doesn’t corrupt or cause havoc with another process. You’ve undoubtedly run into the situation in which one application on your computer freezes or has a problem and you’ve been able to quit that program without affecting others.

How Threads Work

So, are you still with us? We finally made it to threads!

A thread is the unit of execution within a process. A process can have anywhere from just one thread to many threads.

Process vs. Thread

diagram of threads in a process over time

When a process starts, it is assigned memory and resources. Each thread in the process shares that memory and resources. In single-threaded processes, the process contains one thread. The process and the thread are one and the same, and there is only one thing happening.

In multithreaded processes, the process contains more than one thread, and the process is accomplishing a number of things at the same time (technically, it’s almost at the same time—read more on that in the “What about Parallelism and Concurrency?” section below).

diagram of single and multi-treaded process

We talked about the two types of memory available to a process or a thread, the stack and the heap. It is important to distinguish between these two types of process memory because each thread will have its own stack, but all the threads in a process will share the heap.

Threads are sometimes called lightweight processes because they have their own stack but can access shared data. Because threads share the same address space as the process and other threads within the process, the operational cost of communication between the threads is low, which is an advantage. The disadvantage is that a problem with one thread in a process will certainly affect other threads and the viability of the process itself.

Threads vs. Processes

So to review:

  1. The program starts out as a text file of programming code,
  2. The program is compiled or interpreted into binary form,
  3. The program is loaded into memory,
  4. The program becomes one or more running processes.
  5. Processes are typically independent of each other,
  6. While threads exist as the subset of a process.
  7. Threads can communicate with each other more easily than processes can,
  8. But threads are more vulnerable to problems caused by other threads in the same process.

Processes vs. Threads — Advantages and Disadvantages

Process Thread
Processes are heavyweight operations Threads are lighter weight operations
Each process has its own memory space Threads use the memory of the process they belong to
Inter-process communication is slow as processes have different memory addresses Inter-thread communication can be faster than inter-process communication because threads of the same process share memory with the process they belong to
Context switching between processes is more expensive Context switching between threads of the same process is less expensive
Processes don’t share memory with other processes Threads share memory with other threads of the same process

What about Concurrency and Parallelism?

A question you might ask is whether processes or threads can run at the same time. The answer is: it depends. On a system with multiple processors or CPU cores (as is common with modern processors), multiple processes or threads can be executed in parallel. On a single processor, though, it is not possible to have processes or threads truly executing at the same time. In this case, the CPU is shared among running processes or threads using a process scheduling algorithm that divides the CPU’s time and yields the illusion of parallel execution. The time given to each task is called a “time slice.” The switching back and forth between tasks happens so fast it is usually not perceptible. The terms parallelism (true operation at the same time) and concurrency (simulated operation at the same time), distinguish between the two type of real or approximate simultaneous operation.

diagram of concurrency and parallelism

Why Choose Process over Thread, or Thread over Process?

So, how would a programmer choose between a process and a thread when creating a program in which she wants to execute multiple tasks at the same time? We’ve covered some of the differences above, but let’s look at a real world example with a program that many of us use, Google Chrome.

When Google was designing the Chrome browser, they needed to decide how to handle the many different tasks that needed computer, communications, and network resources at the same time. Each browser window or tab communicates with multiple servers on the internet to retrieve text, programs, graphics, audio, video, and other resources, and renders that data for display and interaction with the user. In addition, the browser can open many windows, each with many tasks.

Google had to decide how to handle that separation of tasks. They chose to run each browser window in Chrome as a separate process rather than a thread or many threads, as is common with other browsers. Doing that brought Google a number of benefits. Running each window as a process protects the overall application from bugs and glitches in the rendering engine and restricts access from each rendering engine process to others and to the rest of the system. Isolating JavaScript programs in a process prevents them from running away with too much CPU time and memory, and making the entire browser non-responsive.

Google made the calculated trade-off with a multi-processing design as starting a new process for each browser window has a higher fixed cost in memory and resources than using threads. They were betting that their approach would end up with less memory bloat overall.

Using processes instead of threads provides better memory usage when memory gets low. An inactive window is treated as a lower priority by the operating system and becomes eligible to be swapped to disk when memory is needed for other processes, helping to keep the user-visible windows more responsive. If the windows were threaded, it would be more difficult to separate the used and unused memory as cleanly, wasting both memory and performance.

You can read more about Google’s design decisions on Google’s Chromium Blog or on the Chrome Introduction Comic.

The screen capture below shows the Google Chrome processes running on a MacBook Air with many tabs open. Some Chrome processes are using a fair amount of CPU time and resources, and some are using very little. You can see that each process also has many threads running as well.

activity monitor of Google Chrome

The Activity Monitor or Task Manager on your system can be a valuable ally in helping fine-tune your computer or troubleshooting problems. If your computer is running slowly, or a program or browser window isn’t responding for a while, you can check its status using the system monitor. Sometimes you’ll see a process marked as “Not Responding.” Try quitting that process and see if your system runs better. If an application is a memory hog, you might consider choosing a different application that will accomplish the same task.

Windows Task Manager view

Made it This Far?

We hope this Tron-like dive into the fascinating world of computer programs, processes, and threads has helped clear up some questions you might have had.

The next time your computer is running slowly or an application is acting up, you know your assignment. Fire up the system monitor and take a look under the hood to see what’s going on. You’re in charge now.

We love to hear from you

Are you still confused? Have questions? If so, please let us know in the comments. And feel free to suggest topics for future blog posts.

The post What’s the Diff: Programs, Processes, and Threads appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Game of Thrones Episode “S07E06” Leaks Online Early

Post Syndicated from Ernesto original https://torrentfreak.com/game-of-thrones-episode-s07e06-leaks-online-early-170816/

Trouble continues for HBO as another episode of the popular Game of Thrones series has just leaked online, days ahead of the official premiere.

Copies of the sixth episode of the current season, titled ‘Death is the Enemy,’ are currently circulating on various streaming portals, direct download, and torrent sites.

The first copy only just appeared on the Pirate Bay, but others were shared elsewhere earlier. One of the leaked videos is 64 minutes long and of high quality, and there are also versions that consist of two separate parts.

Early on, the two parts were circulating on the video streaming site Dailymotion, but these were swiftly removed.

At the moment it’s still unclear how the leak came about but some suggest that it was leaked by HBO itself in Spain. TorrentFreak has not been able to confirm this, but there are no visible watermarks that point elsewhere.

Game of Thrones “S07E06” leak screenshot

This isn’t the first time that a Game of Thrones episode has leaked online early. Two years ago the same happened with the first four episodes of season 5. Nonetheless, that season still broke previous viewership records.

Two weeks ago the fourth episode of the current season was also pirated before its official release. This leak, which carried a prominent “Star India Pvt Ltd” watermark, triggered a lot of interest from impatient Game of Thrones fans as well.

Earlier this week, news broke that four men had been arrested in connection with the breach, which is still being investigated. The arrested men all worked for the local media processing company Prime Focus Technologies, where the leak reportedly originated.

The current leak is not in any way related to the hack on HBO’s system, which occurred earlier and revealed several preliminary Game of Thrones scripts.

This hack has also resulted in leaks of various high profile shows, including the upcoming ninth season of ‘Curb Your Enthusiasm.’ Initially, these were hard to find online, but they are now widely available on the usual pirate sites.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

AWS Summit New York – Summary of Announcements

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-summit-new-york-summary-of-announcements/

Whew – what a week! Tara, Randall, Ana, and I have been working around the clock to create blog posts for the announcements that we made at the AWS Summit in New York. Here’s a summary to help you to get started:

Amazon Macie – This new service helps you to discover, classify, and secure content at scale. Powered by machine learning and making use of Natural Language Processing (NLP), Macie looks for patterns and alerts you to suspicious behavior, and can help you with governance, compliance, and auditing. You can read Tara’s post to see how to put Macie to work; you select the buckets of interest, customize the classification settings, and review the results in the Macie Dashboard.

AWS GlueRandall’s post (with deluxe animated GIFs) introduces you to this new extract, transform, and load (ETL) service. Glue is serverless and fully managed, As you can see from the post, Glue crawls your data, infers schemas, and generates ETL scripts in Python. You define jobs that move data from place to place, with a wide selection of transforms, each expressed as code and stored in human-readable form. Glue uses Development Endpoints and notebooks to provide you with a testing environment for the scripts you build. We also announced that Amazon Athena now integrates with Amazon Glue, as does Apache Spark and Hive on Amazon EMR.

AWS Migration Hub – This new service will help you to migrate your application portfolio to AWS. My post outlines the major steps and shows you how the Migration Hub accelerates, tracks,and simplifies your migration effort. You can begin with a discovery step, or you can jump right in and migrate directly. Migration Hub integrates with tools from our migration partners and builds upon the Server Migration Service and the Database Migration Service.

CloudHSM Update – We made a major upgrade to AWS CloudHSM, making the benefits of hardware-based key management available to a wider audience. The service is offered on a pay-as-you-go basis, and is fully managed. It is open and standards compliant, with support for multiple APIs, programming languages, and cryptography extensions. CloudHSM is an integral part of AWS and can be accessed from the AWS Management Console, AWS Command Line Interface (CLI), and through API calls. Read my post to learn more and to see how to set up a CloudHSM cluster.

Managed Rules to Secure S3 Buckets – We added two new rules to AWS Config that will help you to secure your S3 buckets. The s3-bucket-public-write-prohibited rule identifies buckets that have public write access and the s3-bucket-public-read-prohibited rule identifies buckets that have global read access. As I noted in my post, you can run these rules in response to configuration changes or on a schedule. The rules make use of some leading-edge constraint solving techniques, as part of a larger effort to use automated formal reasoning about AWS.

CloudTrail for All Customers – Tara’s post revealed that AWS CloudTrail is now available and enabled by default for all AWS customers. As a bonus, Tara reviewed the principal benefits of CloudTrail and showed you how to review your event history and to deep-dive on a single event. She also showed you how to create a second trail, for use with CloudWatch CloudWatch Events.

Encryption of Data at Rest for EFS – When you create a new file system, you now have the option to select a key that will be used to encrypt the contents of the files on the file system. The encryption is done using an industry-standard AES-256 algorithm. My post shows you how to select a key and to verify that it is being used.

Watch the Keynote
My colleagues Adrian Cockcroft and Matt Wood talked about these services and others on the stage, and also invited some AWS customers to share their stories. Here’s the video:

Jeff;

 

Game of Thrones Pirates Arrested For Leaking Episode Early

Post Syndicated from Andy original https://torrentfreak.com/game-of-thrones-pirates-arrested-for-leaking-episode-early-170814/

Over the past several years, Game of Thrones has become synonymous with fantastic drama and story telling on the one hand, and Internet piracy on the other. It’s the most pirated TV show in history, hands down.

With the new season well underway, another GoT drama began to unfold early August when the then-unaired episode “The Spoils of War” began to circulate on various file-sharing and streaming sites. The leak only trumped the official release by a few days, but that didn’t stop people downloading in droves.

As previously reported, the leaked episode stated that it was “For Internal Viewing Only” at the top of the screen and on the bottom right sported a “Star India Pvt Ltd” watermark. The company commented shortly after.

“We take this breach very seriously and have immediately initiated forensic investigations at our and the technology partner’s end to swiftly determine the cause. This is a grave issue and we are taking appropriate legal remedial action,” a spokesperson said.

Now, just ten days later, that investigation has already netted its first victims. Four people have reportedly been arrested in India for leaking the episode before it aired.

“We investigated the case and have arrested four individuals for unauthorized publication of the fourth episode from season seven,” Deputy Commissioner of Police Akbar Pathan told AFP.

The report indicates that a complaint was filed by a Mumbai-based company that was responsible for storing and processing the TV episodes for an app. It has been named locally as Prime Focus Technologies, which markets itself as a Netflix “Preferred Vendor”.

It’s claimed that at least some of the men had access to login credentials for Game of Thrones episodes which were then abused for the purposes of leaking.

Local media identified the men as Bhaskar Joshi, Alok Sharma and Abhishek Ghadiyal, who were employed by Prime Focus, and Mohamad Suhail, a former employee, who was responsible for leaking the episode onto the Internet.

All of the men were based in Bangalore and were interrogated “throughout the night” at their workplace on August 11. Star India welcomed the arrests and thanked the authorities for their swift action.

“We are deeply grateful to the police for their swift and prompt action. We believe that valuable intellectual property is a critical part of the development of the creative industry and strict enforcement of the law is essential to protecting it,” the company said in a statement.

“We at Star India and Novi Digital Entertainment Private Limited stand committed and ready to help the law enforcement agencies with any technical assistance and help they may require in taking the investigation to its logical conclusion.”

The men will be held in custody until August 21 while investigations continue.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Launch – AWS Glue Now Generally Available

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/launch-aws-glue-now-generally-available/

Today we’re excited to announce the general availability of AWS Glue. Glue is a fully managed, serverless, and cloud-optimized extract, transform and load (ETL) service. Glue is different from other ETL services and platforms in a few very important ways.

First, Glue is “serverless” – you don’t need to provision or manage any resources and you only pay for resources when Glue is actively running. Second, Glue provides crawlers that can automatically detect and infer schemas from many data sources, data types, and across various types of partitions. It stores these generated schemas in a centralized Data Catalog for editing, versioning, querying, and analysis. Third, Glue can automatically generate ETL scripts (in Python!) to translate your data from your source formats to your target formats. Finally, Glue allows you to create development endpoints that allow your developers to use their favorite toolchains to construct their ETL scripts. Ok, let’s dive deep with an example.

In my job as a Developer Evangelist I spend a lot of time traveling and I thought it would be cool to play with some flight data. The Bureau of Transportations Statistics is kind enough to share all of this data for anyone to use here. We can easily download this data and put it in an Amazon Simple Storage Service (S3) bucket. This data will be the basis of our work today.

Crawlers

First, we need to create a Crawler for our flights data from S3. We’ll select Crawlers in the Glue console and follow the on screen prompts from there. I’ll specify s3://crawler-public-us-east-1/flight/2016/csv/ as my first datasource (we can add more later if needed). Next, we’ll create a database called flights and give our tables a prefix of flights as well.

The Crawler will go over our dataset, detect partitions through various folders – in this case months of the year, detect the schema, and build a table. We could add additonal data sources and jobs into our crawler or create separate crawlers that push data into the same database but for now let’s look at the autogenerated schema.

I’m going to make a quick schema change to year, moving it from BIGINT to INT. Then I can compare the two versions of the schema if needed.

Now that we know how to correctly parse this data let’s go ahead and do some transforms.

ETL Jobs

Now we’ll navigate to the Jobs subconsole and click Add Job. Will follow the prompts from there giving our job a name, selecting a datasource, and an S3 location for temporary files. Next we add our target by specifying “Create tables in your data target” and we’ll specify an S3 location in Parquet format as our target.

After clicking next, we’re at screen showing our various mappings proposed by Glue. Now we can make manual column adjustments as needed – in this case we’re just going to use the X button to remove a few columns that we don’t need.

This brings us to my favorite part. This is what I absolutely love about Glue.

Glue generated a PySpark script to transform our data based on the information we’ve given it so far. On the left hand side we can see a diagram documenting the flow of the ETL job. On the top right we see a series of buttons that we can use to add annotated data sources and targets, transforms, spigots, and other features. This is the interface I get if I click on transform.

If we add any of these transforms or additional data sources, Glue will update the diagram on the left giving us a useful visualization of the flow of our data. We can also just write our own code into the console and have it run. We can add triggers to this job that fire on completion of another job, a schedule, or on demand. That way if we add more flight data we can reload this same data back into S3 in the format we need.

I could spend all day writing about the power and versatility of the jobs console but Glue still has more features I want to cover. So, while I might love the script editing console, I know many people prefer their own development environments, tools, and IDEs. Let’s figure out how we can use those with Glue.

Development Endpoints and Notebooks

A Development Endpoint is an environment used to develop and test our Glue scripts. If we navigate to “Dev endpoints” in the Glue console we can click “Add endpoint” in the top right to get started. Next we’ll select a VPC, a security group that references itself and then we wait for it to provision.


Once it’s provisioned we can create an Apache Zeppelin notebook server by going to actions and clicking create notebook server. We give our instance an IAM role and make sure it has permissions to talk to our data sources. Then we can either SSH into the server or connect to the notebook to interactively develop our script.

Pricing and Documentation

You can see detailed pricing information here. Glue crawlers, ETL jobs, and development endpoints are all billed in Data Processing Unit Hours (DPU) (billed by minute). Each DPU-Hour costs $0.44 in us-east-1. A single DPU provides 4vCPU and 16GB of memory.

We’ve only covered about half of the features that Glue has so I want to encourage everyone who made it this far into the post to go read the documentation and service FAQs. Glue also has a rich and powerful API that allows you to do anything console can do and more.

We’re also releasing two new projects today. The aws-glue-libs provide a set of utilities for connecting, and talking with Glue. The aws-glue-samples repo contains a set of example jobs.

I hope you find that using Glue reduces the time it takes to start doing things with your data. Look for another post from me on AWS Glue soon because I can’t stop playing with this new service.
Randall

Deploy a Data Warehouse Quickly with Amazon Redshift, Amazon RDS for PostgreSQL and Tableau Server

Post Syndicated from Jorge A. Lopez original https://aws.amazon.com/blogs/big-data/deploy-a-data-warehouse-quickly-with-amazon-redshift-amazon-rds-for-postgresql-and-tableau-server/

One of the benefits of a data warehouse environment using both Amazon Redshift and Amazon RDS for PostgreSQL is that you can leverage the advantages of each service. Amazon Redshift is a high performance, petabyte-scale data warehouse service optimized for the online analytical processing (OLAP) queries typical of analytic reporting and business intelligence applications. On the other hand, a service like RDS excels at transactional OLTP workloads such as inserting, deleting, or updating rows.

In the recent JOIN Amazon Redshift AND Amazon RDS PostgreSQL WITH dblink post, we showed how you can deploy such an environment. Now, you can deploy a similar architecture using the Modern Data Warehouse on AWS Quick Start. The Quick Start is an automated deployment that uses AWS CloudFormation templates to launch, configure, and run the services required to deploy a data warehousing environment on AWS, based on Amazon Redshift and RDS for PostgreSQL.

The Quick Start also includes an instance of Tableau Server, running on Amazon EC2. This gives you the ability to host and serve analytic dashboards, workbooks and visualizations, supported by a trial license. You can play with the sample data source and dashboard, or create your own analyses by uploading your own data sets.

For more information about the Modern Data Warehouse on AWS Quick Start, download the full deployment guide. If you’re ready to get started, use one of the buttons below:

Option 1: Deploy Quick Start into a new VPC on AWS

Option 2: Deploy Quick Start into an existing VPC

If you have questions, please leave a comment below.


Next Steps

You can also join us for the webinar Unlock Insights and Reduce Costs by Modernizing Your Data Warehouse on AWS on Tuesday, August 22, 2017. Pearson, the education and publishing company, will present best practices and lessons learned during their journey to Amazon Redshift and Tableau.

AWS Encryption SDK: How to Decide if Data Key Caching Is Right for Your Application

Post Syndicated from June Blender original https://aws.amazon.com/blogs/security/aws-encryption-sdk-how-to-decide-if-data-key-caching-is-right-for-your-application/

AWS KMS image

Today, the AWS Crypto Tools team introduced a new feature in the AWS Encryption SDK: data key caching. Data key caching lets you reuse the data keys that protect your data, instead of generating a new data key for each encryption operation.

Data key caching can reduce latency, improve throughput, reduce cost, and help you stay within service limits as your application scales. In particular, caching might help if your application is hitting the AWS Key Management Service (KMS) requests-per-second limit and raising the limit does not solve the problem.

However, these benefits come with some security tradeoffs. Encryption best practices generally discourage extensive reuse of data keys.

In this blog post, I explore those tradeoffs and provide information that can help you decide whether data key caching is a good strategy for your application. I also explain how data key caching is implemented in the AWS Encryption SDK and describe the security thresholds that you can set to limit the reuse of data keys. Finally, I provide some practical examples of using the security thresholds to meet cost, performance, and security goals.

Introducing data key caching

The AWS Encryption SDK is a client-side encryption library that makes it easier for you to implement cryptography best practices in your application. It includes secure default behavior for developers who are not encryption experts, while being flexible enough to work for the most experienced users.

In the AWS Encryption SDK, by default, you generate a new data key for each encryption operation. This is the most secure practice. However, in some applications, the overhead of generating a new data key for each operation is not acceptable.

Data key caching saves the plaintext and ciphertext of the data keys you use in a configurable cache. When you need a key to encrypt or decrypt data, you can reuse a data key from the cache instead of creating a new data key. You can create multiple data key caches and configure each one independently. Most importantly, the AWS Encryption SDK provides security thresholds that you can set to determine how much data key reuse you will allow.

To make data key caching easier to implement, the AWS Encryption SDK provides LocalCryptoMaterialsCache, an in-memory, least-recently-used cache with a configurable size. The SDK manages the cache for you, including adding store, search, and match logic to all encryption and decryption operations.

We recommend that you use LocalCryptoMaterialsCache as it is, but you can customize it, or substitute a compatible cache. However, you should never store plaintext data keys on disk.

The AWS Encryption SDK documentation includes sample code in Java and Python for an application that uses data key caching to encrypt data sent to and from Amazon Kinesis Streams.

Balance cost and security

Your decision to use data key caching should balance cost—in time, money, and resources—against security. In every consideration, though, the balance should favor your security requirements. As a rule, use the minimal caching required to achieve your cost and performance goals.

Before implementing data key caching, consider the details of your applications, your security requirements, and the cost and frequency of your encryption operations. In general, your application can benefit from data key caching if each operation is slow or expensive, or if you encrypt and decrypt data frequently. If the cost and speed of your encryption operations are already acceptable or can be improved by other means, do not use a data key cache.

Data key caching can be the right choice for your application if you have high encryption and decryption traffic. For example, if you are hitting your KMS requests-per-second limit, caching can help because you get some of your data keys from the cache instead of calling KMS for every request.

However, you can also create a case in the AWS Support Center to raise the KMS limit for your account. If raising the limit solves the problem, you do not need data key caching.

Configure caching thresholds for cost and security

In the AWS Encryption SDK, you can configure data key caching to allow just enough data key reuse to meet your cost and performance targets while conforming to the security requirements of your application. The SDK enforces the thresholds so that you can use them with any compatible cache.

The data key caching security thresholds apply to each cache entry. The AWS Encryption SDK will not use the data key from a cache entry that exceeds any of the thresholds that you set.

  • Maximum age (required): Set the lifetime of each cached key to be long enough to get cache hits, but short enough to limit exposure of a plaintext data key in memory to a specific time period.

You can use the maximum age threshold like a key rotation policy. Use it to limit the reuse of data keys and minimize exposure of cryptographic materials. You can also use it to evict data keys when the type or source of data that your application is processing changes.

  • Maximum messages encrypted (optional; default is 232 messages): Set the number of messages protected by each cached data key to be large enough to get value from reuse, but small enough to limit the number of messages that might potentially be exposed.

The AWS Encryption SDK only caches data keys that use an algorithm suite with a key derivation function. This technique avoids the cryptographic limits on the number of bytes encrypted with a single key. However, the more data that a key encrypts, the more data that is exposed if the data key is compromised.

Limiting the number of messages, rather than the number of bytes, is particularly useful if your application encrypts many messages of a similar size or when potential exposure must be limited to very few messages. This threshold is also useful when you want to reuse a data key for a particular type of message and know in advance how many messages of that type you have. You can also use an encryption context to select particular cached data keys for your encryption requests.

  • Maximum bytes encrypted (optional; default is 263 – 1): Set the bytes protected by each cached data key to be large enough to allow the reuse you need, but small enough to limit the amount of data encrypted under the same key.

Limiting the number of bytes, rather than the number of messages, is preferable when your application encrypts messages of widely varying size or when possibly exposing large amounts of data is much more of a concern than exposing smaller amounts of data.

In addition to these security thresholds, the LocalCryptoMaterialsCache in the AWS Encryption SDK lets you set its capacity, which is the maximum number of entries the cache can hold.

Use the capacity value to tune the performance of your LocalCryptoMaterialsCache. In general, use the smallest value that will achieve the performance improvements that your application requires. You might want to test with a very small cache of 5–10 entries and expand if necessary. You will need a slightly larger cache if you are using the cache for both encryption and decryption requests, or if you are using encryption contexts to select particular cache entries.

Consider these cache configuration examples

After you determine the security and performance requirements of your application, consider the cache security thresholds carefully and adjust them to meet your needs. There are no magic numbers for these thresholds: the ideal settings are specific to each application, its security and performance requirements, and budget. Use the minimal amount of caching necessary to get acceptable performance and cost.

The following examples show ways you can use the LocalCryptoMaterialsCache capacity setting and the security thresholds to help meet your security requirements:

  • Slow master key operations: If your master key processes only 100 transactions per second (TPS) but your application needs to process 1,000 TPS, you can meet your application requirements by allowing a maximum of 10 messages to be protected under each data key.
  • High frequency and volume: If your master key costs $0.01 per operation and you need to process a consistent 1,000 TPS while staying within a budget of $100,000 per month, allow a maximum of 275 messages for each cache entry.
  • Burst traffic: If your application’s processing bursts to 100 TPS for five seconds in each minute but is otherwise zero, and your master key costs $0.01 per operation, setting maximum messages to 3 can achieve significant savings. To prevent data keys from being reused across bursts (55 seconds), set the maximum age of each cached data key to 20 seconds.
  • Expensive master key operations: If your application uses a low-throughput encryption service that costs as much as $1.00 per operation, you might want to minimize the number of operations. To do so, create a cache that is large enough to contain the data keys you need. Then, set the byte and message limits high enough to allow reuse while conforming to your security requirements. For example, if your security requirements do not permit a data key to encrypt more than 10 GB of data, setting bytes processed to 10 GB still significantly minimizes operations and conforms to your security requirements.

Learn more about data key caching

To learn more about data key caching, including how to implement it, how to set the security thresholds, and details about the caching components, see Data Key Caching in the AWS Encryption SDK. Also, see the AWS Encryption SDKs for Java and Python as well as the Javadoc and Python documentation.

If you have comments about this blog post, submit them in the “Comments” section below. If you have questions, file an issue in the GitHub repos for the Encryption SDK in Java or Python, or start a new thread on the KMS forum.

– June

Turbocharge your Apache Hive queries on Amazon EMR using LLAP

Post Syndicated from Jigar Mistry original https://aws.amazon.com/blogs/big-data/turbocharge-your-apache-hive-queries-on-amazon-emr-using-llap/

Apache Hive is one of the most popular tools for analyzing large datasets stored in a Hadoop cluster using SQL. Data analysts and scientists use Hive to query, summarize, explore, and analyze big data.

With the introduction of Hive LLAP (Low Latency Analytical Processing), the notion of Hive being just a batch processing tool has changed. LLAP uses long-lived daemons with intelligent in-memory caching to circumvent batch-oriented latency and provide sub-second query response times.

This post provides an overview of Hive LLAP, including its architecture and common use cases for boosting query performance. You will learn how to install and configure Hive LLAP on an Amazon EMR cluster and run queries on LLAP daemons.

What is Hive LLAP?

Hive LLAP was introduced in Apache Hive 2.0, which provides very fast processing of queries. It uses persistent daemons that are deployed on a Hadoop YARN cluster using Apache Slider. These daemons are long-running and provide functionality such as I/O with DataNode, in-memory caching, query processing, and fine-grained access control. And since the daemons are always running in the cluster, it saves substantial overhead of launching new YARN containers for every new Hive session, thereby avoiding long startup times.

When Hive is configured in hybrid execution mode, small and short queries execute directly on LLAP daemons. Heavy lifting (like large shuffles in the reduce stage) is performed in YARN containers that belong to the application. Resources (CPU, memory, etc.) are obtained in a traditional fashion using YARN. After the resources are obtained, the execution engine can decide which resources are to be allocated to LLAP, or it can launch Apache Tez processors in separate YARN containers. You can also configure Hive to run all the processing workloads on LLAP daemons for querying small datasets at lightning fast speeds.

LLAP daemons are launched under YARN management to ensure that the nodes don’t get overloaded with the compute resources of these daemons. You can use scheduling queues to make sure that there is enough compute capacity for other YARN applications to run.

Why use Hive LLAP?

With many options available in the market (Presto, Spark SQL, etc.) for doing interactive SQL  over data that is stored in Amazon S3 and HDFS, there are several reasons why using Hive and LLAP might be a good choice:

  • For those who are heavily invested in the Hive ecosystem and have external BI tools that connect to Hive over JDBC/ODBC connections, LLAP plugs in to their existing architecture without a steep learning curve.
  • It’s compatible with existing Hive SQL and other Hive tools, like HiveServer2, and JDBC drivers for Hive.
  • It has native support for security features with authentication and authorization (SQL standards-based authorization) using HiveServer2.
  • LLAP daemons are aware about of the columns and records that are being processed which enables you to enforce fine-grained access control.
  • It can use Hive’s vectorization capabilities to speed up queries, and Hive has better support for Parquet file format when vectorization is enabled.
  • It can take advantage of a number of Hive optimizations like merging multiple small files for query results, automatically determining the number of reducers for joins and groupbys, etc.
  • It’s optional and modular so it can be turned on or off depending on the compute and resource requirements of the cluster. This lets you to run other YARN applications concurrently without reserving a cluster specifically for LLAP.

How do you install Hive LLAP in Amazon EMR?

To install and configure LLAP on an EMR cluster, use the following bootstrap action (BA):

s3://aws-bigdata-blog/artifacts/Turbocharge_Apache_Hive_on_EMR/configure-Hive-LLAP.sh

This BA downloads and installs Apache Slider on the cluster and configures LLAP so that it works with EMR Hive. For LLAP to work, the EMR cluster must have Hive, Tez, and Apache Zookeeper installed.

You can pass the following arguments to the BA.

Argument Definition Default value
--instances Number of instances of LLAP daemon Number of core/task nodes of the cluster
--cache Cache size per instance 20% of physical memory of the node
--executors Number of executors per instance Number of CPU cores of the node
--iothreads Number of IO threads per instance Number of CPU cores of the node
--size Container size per instance 50% of physical memory of the node
--xmx Working memory size 50% of container size
--log-level Log levels for the LLAP instance INFO

LLAP example

This section describes how you can try the faster Hive queries with LLAP using the TPC-DS testbench for Hive on Amazon EMR.

Use the following AWS command line interface (AWS CLI) command to launch a 1+3 nodes m4.xlarge EMR 5.6.0 cluster with the bootstrap action to install LLAP:

aws emr create-cluster --release-label emr-5.6.0 \
--applications Name=Hadoop Name=Hive Name=Hue Name=ZooKeeper Name=Tez \
--bootstrap-actions '[{"Path":"s3://aws-bigdata-blog/artifacts/Turbocharge_Apache_Hive_on_EMR/configure-Hive-LLAP.sh","Name":"Custom action"}]' \ 
--ec2-attributes '{"KeyName":"<YOUR-KEY-PAIR>","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-xxxxxxxx","EmrManagedSlaveSecurityGroup":"sg-xxxxxxxx","EmrManagedMasterSecurityGroup":"sg-xxxxxxxx"}' 
--service-role EMR_DefaultRole \
--enable-debugging \
--log-uri 's3n://<YOUR-BUCKET/' --name 'test-hive-llap' \
--instance-groups '[{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}],"EbsOptimized":true},"InstanceGroupType":"MASTER","InstanceType":"m4.xlarge","Name":"Master - 1"},{"InstanceCount":3,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}],"EbsOptimized":true},"InstanceGroupType":"CORE","InstanceType":"m4.xlarge","Name":"Core - 2"}]' 
--region us-east-1

After the cluster is launched, log in to the master node using SSH, and do the following:

  1. Open the hive-tpcds folder:
    cd /home/hadoop/hive-tpcds/
  2. Start Hive CLI using the testbench configuration, create the required tables, and run the sample query:

    hive –i testbench.settings
    hive> source create_tables.sql;
    hive> source query55.sql;

    This sample query runs on a 40 GB dataset that is stored on Amazon S3. The dataset is generated using the data generation tool in the TPC-DS testbench for Hive.It results in output like the following:
  3. This screenshot shows that the query finished in about 47 seconds for LLAP mode. Now, to compare this to the execution time without LLAP, you can run the same workload using only Tez containers:
    hive> set hive.llap.execution.mode=none;
    hive> source query55.sql;


    This query finished in about 80 seconds.

The difference in query execution time is almost 1.7 times when using just YARN containers in contrast to running the query on LLAP daemons. And with every rerun of the query, you notice that the execution time substantially decreases by the virtue of in-memory caching by LLAP daemons.

Conclusion

In this post, I introduced Hive LLAP as a way to boost Hive query performance. I discussed its architecture and described several use cases for the component. I showed how you can install and configure Hive LLAP on an Amazon EMR cluster and how you can run queries on LLAP daemons.

If you have questions about using Hive LLAP on Amazon EMR or would like to share your use cases, please leave a comment below.


Additional Reading

Learn how to to automatically partition Hive external tables with AWS.


About the Author

Jigar Mistry is a Hadoop Systems Engineer with Amazon Web Services. He works with customers to provide them architectural guidance and technical support for processing large datasets in the cloud using open-source applications. In his spare time, he enjoys going for camping and exploring different restaurants in the Seattle area.

 

 

 

 

Deploying an NGINX Reverse Proxy Sidecar Container on Amazon ECS

Post Syndicated from Nathan Peck original https://aws.amazon.com/blogs/compute/nginx-reverse-proxy-sidecar-container-on-amazon-ecs/

Reverse proxies are a powerful software architecture primitive for fetching resources from a server on behalf of a client. They serve a number of purposes, from protecting servers from unwanted traffic to offloading some of the heavy lifting of HTTP traffic processing.

This post explains the benefits of a reverse proxy, and explains how to use NGINX and Amazon EC2 Container Service (Amazon ECS) to easily implement and deploy a reverse proxy for your containerized application.

Components

NGINX is a high performance HTTP server that has achieved significant adoption because of its asynchronous event driven architecture. It can serve thousands of concurrent requests with a low memory footprint. This efficiency also makes it ideal as a reverse proxy.

Amazon ECS is a highly scalable, high performance container management service that supports Docker containers. It allows you to run applications easily on a managed cluster of Amazon EC2 instances. Amazon ECS helps you get your application components running on instances according to a specified configuration. It also helps scale out these components across an entire fleet of instances.

Sidecar containers are a common software pattern that has been embraced by engineering organizations. It’s a way to keep server side architecture easier to understand by building with smaller, modular containers that each serve a simple purpose. Just like an application can be powered by multiple microservices, each microservice can also be powered by multiple containers that work together. A sidecar container is simply a way to move part of the core responsibility of a service out into a containerized module that is deployed alongside a core application container.

The following diagram shows how an NGINX reverse proxy sidecar container operates alongside an application server container:

In this architecture, Amazon ECS has deployed two copies of an application stack that is made up of an NGINX reverse proxy side container and an application container. Web traffic from the public goes to an Application Load Balancer, which then distributes the traffic to one of the NGINX reverse proxy sidecars. The NGINX reverse proxy then forwards the request to the application server and returns its response to the client via the load balancer.

Reverse proxy for security

Security is one reason for using a reverse proxy in front of an application container. Any web server that serves resources to the public can expect to receive lots of unwanted traffic every day. Some of this traffic is relatively benign scans by researchers and tools, such as Shodan or nmap:

[18/May/2017:15:10:10 +0000] "GET /YesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScanningForResearchPurposePleaseHaveALookAtTheUserAgentTHXYesThisIsAReallyLongRequestURLbutWeAreDoingItOnPurposeWeAreScann HTTP/1.1" 404 1389 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36
[18/May/2017:18:19:51 +0000] "GET /clientaccesspolicy.xml HTTP/1.1" 404 322 - Cloud mapping experiment. Contact [email protected]

But other traffic is much more malicious. For example, here is what a web server sees while being scanned by the hacking tool ZmEu, which scans web servers trying to find PHPMyAdmin installations to exploit:

[18/May/2017:16:27:39 +0000] "GET /mysqladmin/scripts/setup.php HTTP/1.1" 404 391 - ZmEu
[18/May/2017:16:27:39 +0000] "GET /web/phpMyAdmin/scripts/setup.php HTTP/1.1" 404 394 - ZmEu
[18/May/2017:16:27:39 +0000] "GET /xampp/phpmyadmin/scripts/setup.php HTTP/1.1" 404 396 - ZmEu
[18/May/2017:16:27:40 +0000] "GET /apache-default/phpmyadmin/scripts/setup.php HTTP/1.1" 404 405 - ZmEu
[18/May/2017:16:27:40 +0000] "GET /phpMyAdmin-2.10.0.0/scripts/setup.php HTTP/1.1" 404 397 - ZmEu
[18/May/2017:16:27:40 +0000] "GET /mysql/scripts/setup.php HTTP/1.1" 404 386 - ZmEu
[18/May/2017:16:27:41 +0000] "GET /admin/scripts/setup.php HTTP/1.1" 404 386 - ZmEu
[18/May/2017:16:27:41 +0000] "GET /forum/phpmyadmin/scripts/setup.php HTTP/1.1" 404 396 - ZmEu
[18/May/2017:16:27:41 +0000] "GET /typo3/phpmyadmin/scripts/setup.php HTTP/1.1" 404 396 - ZmEu
[18/May/2017:16:27:42 +0000] "GET /phpMyAdmin-2.10.0.1/scripts/setup.php HTTP/1.1" 404 399 - ZmEu
[18/May/2017:16:27:44 +0000] "GET /administrator/components/com_joommyadmin/phpmyadmin/scripts/setup.php HTTP/1.1" 404 418 - ZmEu
[18/May/2017:18:34:45 +0000] "GET /phpmyadmin/scripts/setup.php HTTP/1.1" 404 390 - ZmEu
[18/May/2017:16:27:45 +0000] "GET /w00tw00t.at.blackhats.romanian.anti-sec:) HTTP/1.1" 404 401 - ZmEu

In addition, servers can also end up receiving unwanted web traffic that is intended for another server. In a cloud environment, an application may end up reusing an IP address that was formerly connected to another service. It’s common for misconfigured or misbehaving DNS servers to send traffic intended for a different host to an IP address now connected to your server.

It’s the responsibility of anyone running a web server to handle and reject potentially malicious traffic or unwanted traffic. Ideally, the web server can reject this traffic as early as possible, before it actually reaches the core application code. A reverse proxy is one way to provide this layer of protection for an application server. It can be configured to reject these requests before they reach the application server.

Reverse proxy for performance

Another advantage of using a reverse proxy such as NGINX is that it can be configured to offload some heavy lifting from your application container. For example, every HTTP server should support gzip. Whenever a client requests gzip encoding, the server compresses the response before sending it back to the client. This compression saves network bandwidth, which also improves speed for clients who now don’t have to wait as long for a response to fully download.

NGINX can be configured to accept a plaintext response from your application container and gzip encode it before sending it down to the client. This allows your application container to focus 100% of its CPU allotment on running business logic, while NGINX handles the encoding with its efficient gzip implementation.

An application may have security concerns that require SSL termination at the instance level instead of at the load balancer. NGINX can also be configured to terminate SSL before proxying the request to a local application container. Again, this also removes some CPU load from the application container, allowing it to focus on running business logic. It also gives you a cleaner way to patch any SSL vulnerabilities or update SSL certificates by updating the NGINX container without needing to change the application container.

NGINX configuration

Configuring NGINX for both traffic filtering and gzip encoding is shown below:

http {
  # NGINX will handle gzip compression of responses from the app server
  gzip on;
  gzip_proxied any;
  gzip_types text/plain application/json;
  gzip_min_length 1000;
 
  server {
    listen 80;
 
    # NGINX will reject anything not matching /api
    location /api {
      # Reject requests with unsupported HTTP method
      if ($request_method !~ ^(GET|POST|HEAD|OPTIONS|PUT|DELETE)$) {
        return 405;
      }
 
      # Only requests matching the whitelist expectations will
      # get sent to the application server
      proxy_pass http://app:3000;
      proxy_http_version 1.1;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection 'upgrade';
      proxy_set_header Host $host;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_cache_bypass $http_upgrade;
    }
  }
}

The above configuration only accepts traffic that matches the expression /api and has a recognized HTTP method. If the traffic matches, it is forwarded to a local application container accessible at the local hostname app. If the client requested gzip encoding, the plaintext response from that application container is gzip-encoded.

Amazon ECS configuration

Configuring ECS to run this NGINX container as a sidecar is also simple. ECS uses a core primitive called the task definition. Each task definition can include one or more containers, which can be linked to each other:

 {
  "containerDefinitions": [
     {
       "name": "nginx",
       "image": "<NGINX reverse proxy image URL here>",
       "memory": "256",
       "cpu": "256",
       "essential": true,
       "portMappings": [
         {
           "containerPort": "80",
           "protocol": "tcp"
         }
       ],
       "links": [
         "app"
       ]
     },
     {
       "name": "app",
       "image": "<app image URL here>",
       "memory": "256",
       "cpu": "256",
       "essential": true
     }
   ],
   "networkMode": "bridge",
   "family": "application-stack"
}

This task definition causes ECS to start both an NGINX container and an application container on the same instance. Then, the NGINX container is linked to the application container. This allows the NGINX container to send traffic to the application container using the hostname app.

The NGINX container has a port mapping that exposes port 80 on a publically accessible port but the application container does not. This means that the application container is not directly addressable. The only way to send it traffic is to send traffic to the NGINX container, which filters that traffic down. It only forwards to the application container if the traffic passes the whitelisted rules.

Conclusion

Running a sidecar container such as NGINX can bring significant benefits by making it easier to provide protection for application containers. Sidecar containers also improve performance by freeing your application container from various CPU intensive tasks. Amazon ECS makes it easy to run sidecar containers, and automate their deployment across your cluster.

To see the full code for this NGINX sidecar reference, or to try it out yourself, you can check out the open source NGINX reverse proxy reference architecture on GitHub.

– Nathan
 @nathankpeck

New – Amazon Connect and Amazon Lex Integration

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/new-amazon-connect-and-amazon-lex-integration/

I’m really excited to share some recent enhancements to two of my favorite services: Amazon Connect and Amazon Lex. Amazon Connect is a self-service, cloud-based contact center service that makes it easy for any business to deliver better customer service at lower cost. Amazon Lex is a service for building conversational interfaces using voice and text. By integrating these two services you can take advantage of Lex‘s automatic speech recognition (ASR) and natural language processing/understading (NLU) capabilities to create great self-service experiences for your customers. To enable this integration the Lex team added support for 8kHz speech input – more on that later. Why should you care about this? Well, if the a bot can solve the majority of your customer’s requests your customers spend less time waiting on hold and more time using your products.

If you need some more background on Amazon Connect or Lex I strongly recommend Jeff’s previous posts[1][2] on these services – especially if you like LEGOs.


Let’s dive in and learn to use this new integration. We’ll take an application that we built on our Twitch channel and modify it for this blog. At the application’s core a user calls an Amazon Connect number which connects them to an Lex bot which invokes an AWS Lambda function based on an intent from Lex. So what does our little application do?

I want to finally settle the question of what the best code editor is: I like vim, it’s a spectacular editor that does one job exceptionally well – editing code (it’s the best). My colleague Jeff likes emacs, a great operating system editor… if you were born with extra joints in your fingers. My colleague Tara loves Visual Studio and sublime. Rather than fighting over what the best editor is I thought we might let you, dear reader, vote. Don’t worry you can even vote for butterflies.

Interested in voting? Call +1 614-569-4019 and tell us which editor you’re voting for! We don’t store your number or record your voice so feel free to vote more than once for vim. Want to see the votes live? http://best-editor-ever.s3-website-us-east-1.amazonaws.com/.

Now, how do we build this little contraption? We’ll cover each component but since we’ve talked about Lex and Lambda before we’ll focus mostly on the Amazon Connect component. I’m going to assume you already have a connect instance running.

Amazon Lex

Let’s start with the Lex side of things. We’ll create a bot named VoteEditor with two intents: VoteEditor with a single slot called editor and ConnectToAgent with no slots. We’ll populate our editor slot full of different code editor names (maybe we’ll leave out emacs).

AWS Lambda

Our Lambda function will also be fairly simple. First we’ll create a Amazon DynamoDB table to store our votes. Then we’ll make a helper method to respond to Lex (build_response) – it will just wrap our message in a Lex friendly response format. Now we just have to figure out our flow logic.


def lambda_handler(event, context):
    if 'ConnectToAgent' == event['currentIntent']['name']:
        return build_response("Ok, connecting you to an agent.")
    elif 'VoteEditor' == event['currentIntent']['name']:
        editor = event['currentIntent']['slots']['editor']
        resp = ddb.update_item(
            Key={"name": editor.lower()},
            UpdateExpression="SET votes = :incr + if_not_exists(votes, :default)",
            ExpressionAttributeValues={":incr": 1, ":default": 0},
            ReturnValues="ALL_NEW"
        )
        msg = "Awesome, now {} has {} votes!".format(
            resp['Attributes']['name'],
            resp['Attributes']['votes'])
        return build_response(msg)

Let’s make sure we understand the code. So, if we got a vote for an editor and it doesn’t exist yet then we add that editor with 1 vote. Otherwise we increase the number of votes on that editor by 1. If we get a request for an agent, we terminate the flow with a nice message. Easy. Now we just tell our Lex bot to use our Lambda function to fulfill our intents. We can test that everything is working over text in the Lex console before moving on.

Amazon Connect

Before we can use our Lex bot in a Contact Flow we have to make sure our Amazon Connect instance has access to it. We can do this by hopping over to the Amazon Connect service console, selecting our instance, and navigating to “Contact Flows”. There should be a section called Lex where you can add your bots!

Now that our Amazon Connect instance can invoke our Lex bot we can create a new Contact Flow that contains our Lex bot. We add the bot to our flow through the “Get customer input” widget from the “Interact” category.

Once we’re on the widget we have a “DTMF” tab for taking input from number keys on a phone or the “Amazon Lex” tab for taking voiceinput and passing it to the Lex service. We’ll use the Lex tab and put in some configuration.

Lots of options, but in short we add the bot we want to use (including the version of the bot), the intents we want to use from our bot, and a short prompt to introduce the bot (and mayb prompt the customer for input).

Our final contact flow looks like this:

A real world example might allow a customer to perform many transactions through a Lex bot. Then on an error or ConnectToAgent intent put the customer into a queue where they could talk to a real person. It could collect and store information about users and populate a rich interface for an agent to use so they could jump right into the conversation with all the context they need.

I want to especially highlight the advantage of 8kHz audio support in Lex. Lex originally only supported speech input that was sampled at a higher rate than the 8 kHz input from the phone. Modern digital communication appliations typically use audio signals sampled at a minimum of 16 kHz. This higher fidelity recroding makes it easier differentiate between sounds like “ess” (/s/) and “eff” (/f/) – or so the audio experts tell me. Phones, however, use a much lower quality recording. Humans, and their ears, are pretty good at using surrounding words to figure out what a voice is saying from a lower quality recording (just check the NASA apollo recordings for proof of this). Most digital phone systems are setup to use 8 kHz sampling by default – it’s a nice tradeoff in bandwidth and fidelity. That’s why your voice sometimes sounds different on the phone. On top of this fundmental sampling rate issue you also have to deal with the fact that a lot of phone call data is already lossy (can you hear me now?). There are thousands of different devices from hundreds of different manufacturers, and tons of different software implentations. So… how do you solve this recognition issue?

The Lex team decided that the best way to address this was to expand the set of models they were using for speech recognition to include an 8kHz model. Support for an 8 kHz telephony audio sampling rate provides increased speech recognition accuracy and fidelity for your contact center interactions. This was a great effort by the team that enables a lot of customers to do more with Amazon Connect.

One final note is that Amazon Connect uses the exact same PostContent endpoint that you can use as an external developer so you don’t have to be a Amazon Connect user to take advantage of this 8kHz feature in Lex.

I hope you guys enjoyed this post and as always the real details are in the docs and API Reference.

Randall

Dutch Film Distributor to Target BitTorrent Users For Cash ‘Fines’

Post Syndicated from Andy original https://torrentfreak.com/dutch-film-distributor-to-target-bittorrent-users-for-cash-fines-170802/

For many carefree years, Dutch Internet users were allowed to download copyrighted content, provided it was for their own personal use. In 2014, however, the European Court of Justice ruled that the country’s “piracy levy” to compensate rightsholders was unlawful. An immediate downloading ban followed.

That action took place more than three years ago but as recently reported by Dutch anti-piracy BREIN, the country still has an appetite for unauthorized content consumption. Some of that takes place with the assistance of torrent sites but for the most part, file-sharers have had little to worry about.

That could all be about to change with the news that local film distributor Dutch Filmworks (DFW) has announced its intention to monitor torrent site users and collect data on their online activities. The news comes via the Dutch Data Protection Authority (Autoriteit Persoonsgegevens), which needs to be formally advised in order for the data collection to go ahead.

DFW’s plans are outlined in a detailed application (Dutch, pdf) dated July 2017. It explains that DFW wishes to combat “the unlawful dissemination of copyright protected works” in order to protect their own interests, and this involves collecting data on Dutch individuals without their knowledge or permission.

“DFW intends to collect data from people who exchange files over the Internet through BitTorrent networks. The data processing consists of capturing proof of exchange of files via IP addresses for the purpose of researching involvement of these users in the distribution or reproduction of copyrighted works,” it reads.

DFW will employ an external German-based tracking company to monitor alleged pirates which will “automatically participate in swarms in which works from DFW are being shared.” Data collected from non-Dutch users will be stripped and discarded but information about local pirates will be retained and processed for further action.

However, in order for DFW to connect an IP address with an individual, the company will have to approach Internet service providers to obtain subscriber information including names and addresses. DFW says that if ISPs won’t cooperate voluntarily, it will be forced to take its case to court. Given past experience, that will probably have to happen.

In March 2016, anti-piracy outfit BREIN obtained permission from the Dutch Data Protection Authority to collect similar data on alleged BitTorrent users, aiming to change attitudes among pirates with fines and legal action.

Several ISPs, most prominently Ziggo, announced that they would not voluntarily cooperate with BREIN and that personal information would only be handed over if BREIN took them to court. It’s logical to presume that Dutch Filmworks will receive the same treatment.

Should the company be successful, however, it has had detailed a stepped plan. First, the alleged pirate will receive a warning and DFW will aim to reach “an amicable settlement” for the breach. If one cannot be reached, further legal action could be taken, up to and including prosecution and claims for damages.

The whole scheme certainly sounds like a classic “copyright trolling” operation in the making but only time will tell which end of the spectrum this project will fall. When asked by NU.nl whether DFW would actually be seeking cash from alleged pirates, it declined to comment.

“This is the first step in this process. We’re going to see what we’re going to do after 25 August,” a spokesperson said.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.