Tag Archives: lambda

Announcing the Winners of the AWS Chatbot Challenge – Conversational, Intelligent Chatbots using Amazon Lex and AWS Lambda

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/announcing-the-winners-of-the-aws-chatbot-challenge-conversational-intelligent-chatbots-using-amazon-lex-and-aws-lambda/

A couple of months ago on the blog, I announced the AWS Chatbot Challenge in conjunction with Slack. The AWS Chatbot Challenge was an opportunity to build a unique chatbot that helped to solve a problem or that would add value for its prospective users. The mission was to build a conversational, natural language chatbot using Amazon Lex and leverage Lex’s integration with AWS Lambda to execute logic or data processing on the backend.

I know that you all have been anxiously waiting to hear announcements of who were the winners of the AWS Chatbot Challenge as much as I was. Well wait no longer, the winners of the AWS Chatbot Challenge have been decided.

May I have the Envelope Please? (The Trumpets sound)

The winners of the AWS Chatbot Challenge are:

  • First Place: BuildFax Counts by Joe Emison
  • Second Place: Hubsy by Andrew Riess, Andrew Puch, and John Wetzel
  • Third Place: PFMBot by Benny Leong and his team from MoneyLion.
  • Large Organization Winner: ADP Payroll Innovation Bot by Eric Liu, Jiaxing Yan, and Fan Yang

 

Diving into the Winning Chatbot Projects

Let’s take a walkthrough of the details for each of the winning projects to get a view of what made these chatbots distinctive, as well as, learn more about the technologies used to implement the chatbot solution.

 

BuildFax Counts by Joe Emison

The BuildFax Counts bot was created as a real solution for the BuildFax company to decrease the amount the time that sales and marketing teams can get answers on permits or properties with permits meet certain criteria.

BuildFax, a company co-founded by bot developer Joe Emison, has the only national database of building permits, which updates data from approximately half of the United States on a monthly basis. In order to accommodate the many requests that come in from the sales and marketing team regarding permit information, BuildFax has a technical sales support team that fulfills these requests sent to a ticketing system by manually writing SQL queries that run across the shards of the BuildFax databases. Since there are a large number of requests received by the internal sales support team and due to the manual nature of setting up the queries, it may take several days for getting the sales and marketing teams to receive an answer.

The BuildFax Counts chatbot solves this problem by taking the permit inquiry that would normally be sent into a ticket from the sales and marketing team, as input from Slack to the chatbot. Once the inquiry is submitted into Slack, a query executes and the inquiry results are returned immediately.

Joe built this solution by first creating a nightly export of the data in their BuildFax MySQL RDS database to CSV files that are stored in Amazon S3. From the exported CSV files, an Amazon Athena table was created in order to run quick and efficient queries on the data. He then used Amazon Lex to create a bot to handle the common questions and criteria that may be asked by the sales and marketing teams when seeking data from the BuildFax database by modeling the language used from the BuildFax ticketing system. He added several different sample utterances and slot types; both custom and Lex provided, in order to correctly parse every question and criteria combination that could be received from an inquiry.  Using Lambda, Joe created a Javascript Lambda function that receives information from the Lex intent and used it to build a SQL statement that runs against the aforementioned Athena database using the AWS SDK for JavaScript in Node.js library to return inquiry count result and SQL statement used.

The BuildFax Counts bot is used today for the BuildFax sales and marketing team to get back data on inquiries immediately that previously took up to a week to receive results.

Not only is BuildFax Counts bot our 1st place winner and wonderful solution, but its creator, Joe Emison, is a great guy.  Joe has opted to donate his prize; the $5,000 cash, the $2,500 in AWS Credits, and one re:Invent ticket to the Black Girls Code organization. I must say, you rock Joe for helping these kids get access and exposure to technology.

 

Hubsy by Andrew Riess, Andrew Puch, and John Wetzel

Hubsy bot was created to redefine and personalize the way users traditionally manage their HubSpot account. HubSpot is a SaaS system providing marketing, sales, and CRM software. Hubsy allows users of HubSpot to create engagements and log engagements with customers, provide sales teams with deals status, and retrieves client contact information quickly. Hubsy uses Amazon Lex’s conversational interface to execute commands from the HubSpot API so that users can gain insights, store and retrieve data, and manage tasks directly from Facebook, Slack, or Alexa.

In order to implement the Hubsy chatbot, Andrew and the team members used AWS Lambda to create a Lambda function with Node.js to parse the users request and call the HubSpot API, which will fulfill the initial request or return back to the user asking for more information. Terraform was used to automatically setup and update Lambda, CloudWatch logs, as well as, IAM profiles. Amazon Lex was used to build the conversational piece of the bot, which creates the utterances that a person on a sales team would likely say when seeking information from HubSpot. To integrate with Alexa, the Amazon Alexa skill builder was used to create an Alexa skill which was tested on an Echo Dot. Cloudwatch Logs are used to log the Lambda function information to CloudWatch in order to debug different parts of the Lex intents. In order to validate the code before the Terraform deployment, ESLint was additionally used to ensure the code was linted and proper development standards were followed.

 

PFMBot by Benny Leong and his team from MoneyLion

PFMBot, Personal Finance Management Bot,  is a bot to be used with the MoneyLion finance group which offers customers online financial products; loans, credit monitoring, and free credit score service to improve the financial health of their customers. Once a user signs up an account on the MoneyLion app or website, the user has the option to link their bank accounts with the MoneyLion APIs. Once the bank account is linked to the APIs, the user will be able to login to their MoneyLion account and start having a conversation with the PFMBot based on their bank account information.

The PFMBot UI has a web interface built with using Javascript integration. The chatbot was created using Amazon Lex to build utterances based on the possible inquiries about the user’s MoneyLion bank account. PFMBot uses the Lex built-in AMAZON slots and parsed and converted the values from the built-in slots to pass to AWS Lambda. The AWS Lambda functions interacting with Amazon Lex are Java-based Lambda functions which call the MoneyLion Java-based internal APIs running on Spring Boot. These APIs obtain account data and related bank account information from the MoneyLion MySQL Database.

 

ADP Payroll Innovation Bot by Eric Liu, Jiaxing Yan, and Fan Yang

ADP PI (Payroll Innovation) bot is designed to help employees of ADP customers easily review their own payroll details and compare different payroll data by just asking the bot for results. The ADP PI Bot additionally offers issue reporting functionality for employees to report payroll issues and aids HR managers in quickly receiving and organizing any reported payroll issues.

The ADP Payroll Innovation bot is an ecosystem for the ADP payroll consisting of two chatbots, which includes ADP PI Bot for external clients (employees and HR managers), and ADP PI DevOps Bot for internal ADP DevOps team.


The architecture for the ADP PI DevOps bot is different architecture from the ADP PI bot shown above as it is deployed internally to ADP. The ADP PI DevOps bot allows input from both Slack and Alexa. When input comes into Slack, Slack sends the request to Lex for it to process the utterance. Lex then calls the Lambda backend, which obtains ADP data sitting in the ADP VPC running within an Amazon VPC. When input comes in from Alexa, a Lambda function is called that also obtains data from the ADP VPC running on AWS.

The architecture for the ADP PI bot consists of users entering in requests and/or entering issues via Slack. When requests/issues are entered via Slack, the Slack APIs communicate via Amazon API Gateway to AWS Lambda. The Lambda function either writes data into one of the Amazon DynamoDB databases for recording issues and/or sending issues or it sends the request to Lex. When sending issues, DynamoDB integrates with Trello to keep HR Managers abreast of the escalated issues. Once the request data is sent from Lambda to Lex, Lex processes the utterance and calls another Lambda function that integrates with the ADP API and it calls ADP data from within the ADP VPC, which runs on Amazon Virtual Private Cloud (VPC).

Python and Node.js were the chosen languages for the development of the bots.

The ADP PI bot ecosystem has the following functional groupings:

Employee Functionality

  • Summarize Payrolls
  • Compare Payrolls
  • Escalate Issues
  • Evolve PI Bot

HR Manager Functionality

  • Bot Management
  • Audit and Feedback

DevOps Functionality

  • Reduce call volume in service centers (ADP PI Bot).
  • Track issues and generate reports (ADP PI Bot).
  • Monitor jobs for various environment (ADP PI DevOps Bot)
  • View job dashboards (ADP PI DevOps Bot)
  • Query job details (ADP PI DevOps Bot)

 

Summary

Let’s all wish all the winners of the AWS Chatbot Challenge hearty congratulations on their excellent projects.

You can review more details on the winning projects, as well as, all of the submissions to the AWS Chatbot Challenge at: https://awschatbot2017.devpost.com/submissions. If you are curious on the details of Chatbot challenge contest including resources, rules, prizes, and judges, you can review the original challenge website here:  https://awschatbot2017.devpost.com/.

Hopefully, you are just as inspired as I am to build your own chatbot using Lex and Lambda. For more information, take a look at the Amazon Lex developer guide or the AWS AI blog on Building Better Bots Using Amazon Lex (Part 1)

Chat with you soon!

Tara

New – SES Dedicated IP Pools

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/new-ses-dedicated-ip-pools/

Today we released Dedicated IP Pools for Amazon Simple Email Service (SES). With dedicated IP pools, you can specify which dedicated IP addresses to use for sending different types of email. Dedicated IP pools let you use your SES for different tasks. For instance, you can send transactional emails from one set of IPs and you can send marketing emails from another set of IPs.

If you’re not familiar with Amazon SES these concepts may not make much sense. We haven’t had the chance to cover SES on this blog since 2016, which is a shame, so I want to take a few steps back and talk about the service as a whole and some of the enhancements the team has made over the past year. If you just want the details on this new feature I strongly recommend reading the Amazon Simple Email Service Blog.

What is SES?

So, what is SES? If you’re a customer of Amazon.com you know that we send a lot of emails. Bought something? You get an email. Order shipped? You get an email. Over time, as both email volumes and types increased Amazon.com needed to build an email platform that was flexible, scalable, reliable, and cost-effective. SES is the result of years of Amazon’s own work in dealing with email and maximizing deliverability.

In short: SES gives you the ability to send and receive many types of email with the monitoring and tools to ensure high deliverability.

Sending an email is easy; one simple API call:

import boto3
ses = boto3.client('ses')
ses.send_email(
    Source='[email protected]',
    Destination={'ToAddresses': ['[email protected]']},
    Message={
        'Subject': {'Data': 'Hello, World!'},
        'Body': {'Text': {'Data': 'Hello, World!'}}
    }
)

Receiving and reacting to emails is easy too. You can set up rulesets that forward received emails to Amazon Simple Storage Service (S3), Amazon Simple Notification Service (SNS), or AWS Lambda – you could even trigger a Amazon Lex bot through Lambda to communicate with your customers over email. SES is a powerful tool for building applications. The image below shows just a fraction of the capabilities:

Deliverability 101

Deliverability is the percentage of your emails that arrive in your recipients’ inboxes. Maintaining deliverability is a shared responsibility between AWS and the customer. AWS takes the fight against spam very seriously and works hard to make sure services aren’t abused. To learn more about deliverability I recommend the deliverability docs. For now, understand that deliverability is an important aspect of email campaigns and SES has many tools that enable a customer to manage their deliverability.

Dedicated IPs and Dedicated IP pools

When you’re starting out with SES your emails are sent through a shared IP. That IP is responsible for sending mail on behalf of many customers and AWS works to maintain appropriate volume and deliverability on each of those IPs. However, when you reach a sufficient volume shared IPs may not be the right solution.

By creating a dedicated IP you’re able to fully control the reputations of those IPs. This makes it vastly easier to troubleshoot any deliverability or reputation issues. It’s also useful for many email certification programs which require a dedicated IP as a commitment to maintaining your email reputation. Using the shared IPs of the Amazon SES service is still the right move for many customers but if you have sustained daily sending volume greater than hundreds of thousands of emails per day you might want to consider a dedicated IP. One caveat to be aware of: if you’re not sending a sufficient volume of email with a consistent pattern a dedicated IP can actually hurt your reputation. Dedicated IPs are $24.95 per address per month at the time of this writing – but you can find out more at the pricing page.

Before you can use a Dedicated IP you need to “warm” it. You do this by gradually increasing the volume of emails you send through a new address. Each IP needs time to build a positive reputation. In March of this year SES released the ability to automatically warm your IPs over the course of 45 days. This feature is on by default for all new dedicated IPs.

Customers who send high volumes of email will typically have multiple dedicated IPs. Today the SES team released dedicated IP pools to make managing those IPs easier. Now when you send email you can specify a configuration set which will route your email to an IP in a pool based on the pool’s association with that configuration set.

One of the other major benefits of this feature is that it allows customers who previously split their email sending across several AWS accounts (to manage their reputation for different types of email) to consolidate into a single account.

You can read the documentation and blog for more info.

Analyzing AWS Cost and Usage Reports with Looker and Amazon Athena

Post Syndicated from Dillon Morrison original https://aws.amazon.com/blogs/big-data/analyzing-aws-cost-and-usage-reports-with-looker-and-amazon-athena/

This is a guest post by Dillon Morrison at Looker. Looker is, in their own words, “a new kind of analytics platform–letting everyone in your business make better decisions by getting reliable answers from a tool they can use.” 

As the breadth of AWS products and services continues to grow, customers are able to more easily move their technology stack and core infrastructure to AWS. One of the attractive benefits of AWS is the cost savings. Rather than paying upfront capital expenses for large on-premises systems, customers can instead pay variables expenses for on-demand services. To further reduce expenses AWS users can reserve resources for specific periods of time, and automatically scale resources as needed.

The AWS Cost Explorer is great for aggregated reporting. However, conducting analysis on the raw data using the flexibility and power of SQL allows for much richer detail and insight, and can be the better choice for the long term. Thankfully, with the introduction of Amazon Athena, monitoring and managing these costs is now easier than ever.

In the post, I walk through setting up the data pipeline for cost and usage reports, Amazon S3, and Athena, and discuss some of the most common levers for cost savings. I surface tables through Looker, which comes with a host of pre-built data models and dashboards to make analysis of your cost and usage data simple and intuitive.

Analysis with Athena

With Athena, there’s no need to create hundreds of Excel reports, move data around, or deploy clusters to house and process data. Athena uses Apache Hive’s DDL to create tables, and the Presto querying engine to process queries. Analysis can be performed directly on raw data in S3. Conveniently, AWS exports raw cost and usage data directly into a user-specified S3 bucket, making it simple to start querying with Athena quickly. This makes continuous monitoring of costs virtually seamless, since there is no infrastructure to manage. Instead, users can leverage the power of the Athena SQL engine to easily perform ad-hoc analysis and data discovery without needing to set up a data warehouse.

After the data pipeline is established, cost and usage data (the recommended billing data, per AWS documentation) provides a plethora of comprehensive information around usage of AWS services and the associated costs. Whether you need the report segmented by product type, user identity, or region, this report can be cut-and-sliced any number of ways to properly allocate costs for any of your business needs. You can then drill into any specific line item to see even further detail, such as the selected operating system, tenancy, purchase option (on-demand, spot, or reserved), and so on.

Walkthrough

By default, the Cost and Usage report exports CSV files, which you can compress using gzip (recommended for performance). There are some additional configuration options for tuning performance further, which are discussed below.

Prerequisites

If you want to follow along, you need the following resources:

Enable the cost and usage reports

First, enable the Cost and Usage report. For Time unit, select Hourly. For Include, select Resource IDs. All options are prompted in the report-creation window.

The Cost and Usage report dumps CSV files into the specified S3 bucket. Please note that it can take up to 24 hours for the first file to be delivered after enabling the report.

Configure the S3 bucket and files for Athena querying

In addition to the CSV file, AWS also creates a JSON manifest file for each cost and usage report. Athena requires that all of the files in the S3 bucket are in the same format, so we need to get rid of all these manifest files. If you’re looking to get started with Athena quickly, you can simply go into your S3 bucket and delete the manifest file manually, skip the automation described below, and move on to the next section.

To automate the process of removing the manifest file each time a new report is dumped into S3, which I recommend as you scale, there are a few additional steps. The folks at Concurrency labs wrote a great overview and set of scripts for this, which you can find in their GitHub repo.

These scripts take the data from an input bucket, remove anything unnecessary, and dump it into a new output bucket. We can utilize AWS Lambda to trigger this process whenever new data is dropped into S3, or on a nightly basis, or whatever makes most sense for your use-case, depending on how often you’re querying the data. Please note that enabling the “hourly” report means that data is reported at the hour-level of granularity, not that a new file is generated every hour.

Following these scripts, you’ll notice that we’re adding a date partition field, which isn’t necessary but improves query performance. In addition, converting data from CSV to a columnar format like ORC or Parquet also improves performance. We can automate this process using Lambda whenever new data is dropped in our S3 bucket. Amazon Web Services discusses columnar conversion at length, and provides walkthrough examples, in their documentation.

As a long-term solution, best practice is to use compression, partitioning, and conversion. However, for purposes of this walkthrough, we’re not going to worry about them so we can get up-and-running quicker.

Set up the Athena query engine

In your AWS console, navigate to the Athena service, and click “Get Started”. Follow the tutorial and set up a new database (we’ve called ours “AWS Optimizer” in this example). Don’t worry about configuring your initial table, per the tutorial instructions. We’ll be creating a new table for cost and usage analysis. Once you walked through the tutorial steps, you’ll be able to access the Athena interface, and can begin running Hive DDL statements to create new tables.

One thing that’s important to note, is that the Cost and Usage CSVs also contain the column headers in their first row, meaning that the column headers would be included in the dataset and any queries. For testing and quick set-up, you can remove this line manually from your first few CSV files. Long-term, you’ll want to use a script to programmatically remove this row each time a new file is dropped in S3 (every few hours typically). We’ve drafted up a sample script for ease of reference, which we run on Lambda. We utilize Lambda’s native ability to invoke the script whenever a new object is dropped in S3.

For cost and usage, we recommend using the DDL statement below. Since our data is in CSV format, we don’t need to use a SerDe, we can simply specify the “separatorChar, quoteChar, and escapeChar”, and the structure of the files (“TEXTFILE”). Note that AWS does have an OpenCSV SerDe as well, if you prefer to use that.

 

CREATE EXTERNAL TABLE IF NOT EXISTS cost_and_usage	 (
identity_LineItemId String,
identity_TimeInterval String,
bill_InvoiceId String,
bill_BillingEntity String,
bill_BillType String,
bill_PayerAccountId String,
bill_BillingPeriodStartDate String,
bill_BillingPeriodEndDate String,
lineItem_UsageAccountId String,
lineItem_LineItemType String,
lineItem_UsageStartDate String,
lineItem_UsageEndDate String,
lineItem_ProductCode String,
lineItem_UsageType String,
lineItem_Operation String,
lineItem_AvailabilityZone String,
lineItem_ResourceId String,
lineItem_UsageAmount String,
lineItem_NormalizationFactor String,
lineItem_NormalizedUsageAmount String,
lineItem_CurrencyCode String,
lineItem_UnblendedRate String,
lineItem_UnblendedCost String,
lineItem_BlendedRate String,
lineItem_BlendedCost String,
lineItem_LineItemDescription String,
lineItem_TaxType String,
product_ProductName String,
product_accountAssistance String,
product_architecturalReview String,
product_architectureSupport String,
product_availability String,
product_bestPractices String,
product_cacheEngine String,
product_caseSeverityresponseTimes String,
product_clockSpeed String,
product_currentGeneration String,
product_customerServiceAndCommunities String,
product_databaseEdition String,
product_databaseEngine String,
product_dedicatedEbsThroughput String,
product_deploymentOption String,
product_description String,
product_durability String,
product_ebsOptimized String,
product_ecu String,
product_endpointType String,
product_engineCode String,
product_enhancedNetworkingSupported String,
product_executionFrequency String,
product_executionLocation String,
product_feeCode String,
product_feeDescription String,
product_freeQueryTypes String,
product_freeTrial String,
product_frequencyMode String,
product_fromLocation String,
product_fromLocationType String,
product_group String,
product_groupDescription String,
product_includedServices String,
product_instanceFamily String,
product_instanceType String,
product_io String,
product_launchSupport String,
product_licenseModel String,
product_location String,
product_locationType String,
product_maxIopsBurstPerformance String,
product_maxIopsvolume String,
product_maxThroughputvolume String,
product_maxVolumeSize String,
product_maximumStorageVolume String,
product_memory String,
product_messageDeliveryFrequency String,
product_messageDeliveryOrder String,
product_minVolumeSize String,
product_minimumStorageVolume String,
product_networkPerformance String,
product_operatingSystem String,
product_operation String,
product_operationsSupport String,
product_physicalProcessor String,
product_preInstalledSw String,
product_proactiveGuidance String,
product_processorArchitecture String,
product_processorFeatures String,
product_productFamily String,
product_programmaticCaseManagement String,
product_provisioned String,
product_queueType String,
product_requestDescription String,
product_requestType String,
product_routingTarget String,
product_routingType String,
product_servicecode String,
product_sku String,
product_softwareType String,
product_storage String,
product_storageClass String,
product_storageMedia String,
product_technicalSupport String,
product_tenancy String,
product_thirdpartySoftwareSupport String,
product_toLocation String,
product_toLocationType String,
product_training String,
product_transferType String,
product_usageFamily String,
product_usagetype String,
product_vcpu String,
product_version String,
product_volumeType String,
product_whoCanOpenCases String,
pricing_LeaseContractLength String,
pricing_OfferingClass String,
pricing_PurchaseOption String,
pricing_publicOnDemandCost String,
pricing_publicOnDemandRate String,
pricing_term String,
pricing_unit String,
reservation_AvailabilityZone String,
reservation_NormalizedUnitsPerReservation String,
reservation_NumberOfReservations String,
reservation_ReservationARN String,
reservation_TotalReservedNormalizedUnits String,
reservation_TotalReservedUnits String,
reservation_UnitsPerReservation String,
resourceTags_userName String,
resourceTags_usercostcategory String  


)
    ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
      ESCAPED BY '\\'
      LINES TERMINATED BY '\n'

STORED AS TEXTFILE
    LOCATION 's3://<<your bucket name>>';

Once you’ve successfully executed the command, you should see a new table named “cost_and_usage” with the below properties. Now we’re ready to start executing queries and running analysis!

Start with Looker and connect to Athena

Setting up Looker is a quick process, and you can try it out for free here (or download from Amazon Marketplace). It takes just a few seconds to connect Looker to your Athena database, and Looker comes with a host of pre-built data models and dashboards to make analysis of your cost and usage data simple and intuitive. After you’re connected, you can use the Looker UI to run whatever analysis you’d like. Looker translates this UI to optimized SQL, so any user can execute and visualize queries for true self-service analytics.

Major cost saving levers

Now that the data pipeline is configured, you can dive into the most popular use cases for cost savings. In this post, I focus on:

  • Purchasing Reserved Instances vs. On-Demand Instances
  • Data transfer costs
  • Allocating costs over users or other Attributes (denoted with resource tags)

On-Demand, Spot, and Reserved Instances

Purchasing Reserved Instances vs On-Demand Instances is arguably going to be the biggest cost lever for heavy AWS users (Reserved Instances run up to 75% cheaper!). AWS offers three options for purchasing instances:

  • On-Demand—Pay as you use.
  • Spot (variable cost)—Bid on spare Amazon EC2 computing capacity.
  • Reserved Instances—Pay for an instance for a specific, allotted period of time.

When purchasing a Reserved Instance, you can also choose to pay all-upfront, partial-upfront, or monthly. The more you pay upfront, the greater the discount.

If your company has been using AWS for some time now, you should have a good sense of your overall instance usage on a per-month or per-day basis. Rather than paying for these instances On-Demand, you should try to forecast the number of instances you’ll need, and reserve them with upfront payments.

The total amount of usage with Reserved Instances versus overall usage with all instances is called your coverage ratio. It’s important not to confuse your coverage ratio with your Reserved Instance utilization. Utilization represents the amount of reserved hours that were actually used. Don’t worry about exceeding capacity, you can still set up Auto Scaling preferences so that more instances get added whenever your coverage or utilization crosses a certain threshold (we often see a target of 80% for both coverage and utilization among savvy customers).

Calculating the reserved costs and coverage can be a bit tricky with the level of granularity provided by the cost and usage report. The following query shows your total cost over the last 6 months, broken out by Reserved Instance vs other instance usage. You can substitute the cost field for usage if you’d prefer. Please note that you should only have data for the time period after the cost and usage report has been enabled (though you can opt for up to 3 months of historical data by contacting your AWS Account Executive). If you’re just getting started, this query will only show a few days.

 

SELECT 
	DATE_FORMAT(from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate),'%Y-%m') AS "cost_and_usage.usage_start_month",
	COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0) AS "cost_and_usage.total_unblended_cost",
	COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_reserved_unblended_cost",
	1.0 * (COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0)),0)  AS "cost_and_usage.percent_spend_on_ris",
	COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'Non RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_non_reserved_unblended_cost",
	1.0 * (COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'Non RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0)),0)  AS "cost_and_usage.percent_spend_on_non_ris"
FROM aws_optimizer.cost_and_usage  AS cost_and_usage

WHERE 
	(((from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) >= ((DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))) AND (from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) < ((DATE_ADD('month', 6, DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))))))
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500

The resulting table should look something like the image below (I’m surfacing tables through Looker, though the same table would result from querying via command line or any other interface).

With a BI tool, you can create dashboards for easy reference and monitoring. New data is dumped into S3 every few hours, so your dashboards can update several times per day.

It’s an iterative process to understand the appropriate number of Reserved Instances needed to meet your business needs. After you’ve properly integrated Reserved Instances into your purchasing patterns, the savings can be significant. If your coverage is consistently below 70%, you should seriously consider adjusting your purchase types and opting for more Reserved instances.

Data transfer costs

One of the great things about AWS data storage is that it’s incredibly cheap. Most charges often come from moving and processing that data. There are several different prices for transferring data, broken out largely by transfers between regions and availability zones. Transfers between regions are the most costly, followed by transfers between Availability Zones. Transfers within the same region and same availability zone are free unless using elastic or public IP addresses, in which case there is a cost. You can find more detailed information in the AWS Pricing Docs. With this in mind, there are several simple strategies for helping reduce costs.

First, since costs increase when transferring data between regions, it’s wise to ensure that as many services as possible reside within the same region. The more you can localize services to one specific region, the lower your costs will be.

Second, you should maximize the data you’re routing directly within AWS services and IP addresses. Transfers out to the open internet are the most costly and least performant mechanisms of data transfers, so it’s best to keep transfers within AWS services.

Lastly, data transfers between private IP addresses are cheaper than between elastic or public IP addresses, so utilizing private IP addresses as much as possible is the most cost-effective strategy.

The following query provides a table depicting the total costs for each AWS product, broken out transfer cost type. Substitute the “lineitem_productcode” field in the query to segment the costs by any other attribute. If you notice any unusually high spikes in cost, you’ll need to dig deeper to understand what’s driving that spike: location, volume, and so on. Drill down into specific costs by including “product_usagetype” and “product_transfertype” in your query to identify the types of transfer costs that are driving up your bill.

SELECT 
	cost_and_usage.lineitem_productcode  AS "cost_and_usage.product_code",
	COALESCE(SUM(cost_and_usage.lineitem_unblendedcost), 0) AS "cost_and_usage.total_unblended_cost",
	COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer')    THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_data_transfer_cost",
	COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer-In')    THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_inbound_data_transfer_cost",
	COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer-Out')    THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0) AS "cost_and_usage.total_outbound_data_transfer_cost"
FROM aws_optimizer.cost_and_usage  AS cost_and_usage

WHERE 
	(((from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) >= ((DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))) AND (from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) < ((DATE_ADD('month', 6, DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))))))
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500

When moving between regions or over the open web, many data transfer costs also include the origin and destination location of the data movement. Using a BI tool with mapping capabilities, you can get a nice visual of data flows. The point at the center of the map is used to represent external data flows over the open internet.

Analysis by tags

AWS provides the option to apply custom tags to individual resources, so you can allocate costs over whatever customized segment makes the most sense for your business. For a SaaS company that hosts software for customers on AWS, maybe you’d want to tag the size of each customer. The following query uses custom tags to display the reserved, data transfer, and total cost for each AWS service, broken out by tag categories, over the last 6 months. You’ll want to substitute the cost_and_usage.resourcetags_customersegment and cost_and_usage.customer_segment with the name of your customer field.

 

SELECT * FROM (
SELECT *, DENSE_RANK() OVER (ORDER BY z___min_rank) as z___pivot_row_rank, RANK() OVER (PARTITION BY z__pivot_col_rank ORDER BY z___min_rank) as z__pivot_col_ordering FROM (
SELECT *, MIN(z___rank) OVER (PARTITION BY "cost_and_usage.product_code") as z___min_rank FROM (
SELECT *, RANK() OVER (ORDER BY CASE WHEN z__pivot_col_rank=1 THEN (CASE WHEN "cost_and_usage.total_unblended_cost" IS NOT NULL THEN 0 ELSE 1 END) ELSE 2 END, CASE WHEN z__pivot_col_rank=1 THEN "cost_and_usage.total_unblended_cost" ELSE NULL END DESC, "cost_and_usage.total_unblended_cost" DESC, z__pivot_col_rank, "cost_and_usage.product_code") AS z___rank FROM (
SELECT *, DENSE_RANK() OVER (ORDER BY CASE WHEN "cost_and_usage.customer_segment" IS NULL THEN 1 ELSE 0 END, "cost_and_usage.customer_segment") AS z__pivot_col_rank FROM (
SELECT 
	cost_and_usage.lineitem_productcode  AS "cost_and_usage.product_code",
	cost_and_usage.resourcetags_customersegment  AS "cost_and_usage.customer_segment",
	COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0) AS "cost_and_usage.total_unblended_cost",
	1.0 * (COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer')    THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0)),0)  AS "cost_and_usage.percent_spend_data_transfers_unblended",
	1.0 * (COALESCE(SUM(CASE WHEN (CASE
         WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item'
         WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item'
         ELSE 'Non RI Line Item'
        END = 'Non RI Line Item') THEN cost_and_usage.lineitem_unblendedcost  ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_unblendedcost ), 0)),0)  AS "cost_and_usage.unblended_percent_spend_on_ris"
FROM aws_optimizer.cost_and_usage_raw  AS cost_and_usage

WHERE 
	(((from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) >= ((DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))) AND (from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) < ((DATE_ADD('month', 6, DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))))))
GROUP BY 1,2) ww
) bb WHERE z__pivot_col_rank <= 16384
) aa
) xx
) zz
 WHERE z___pivot_row_rank <= 500 OR z__pivot_col_ordering = 1 ORDER BY z___pivot_row_rank

The resulting table in this example looks like the results below. In this example, you can tell that we’re making poor use of Reserved Instances because they represent such a small portion of our overall costs.

Again, using a BI tool to visualize these costs and trends over time makes the analysis much easier to consume and take action on.

Summary

Saving costs on your AWS spend is always an iterative, ongoing process. Hopefully with these queries alone, you can start to understand your spending patterns and identify opportunities for savings. However, this is just a peek into the many opportunities available through analysis of the Cost and Usage report. Each company is different, with unique needs and usage patterns. To achieve maximum cost savings, we encourage you to set up an analytics environment that enables your team to explore all potential cuts and slices of your usage data, whenever it’s necessary. Exploring different trends and spikes across regions, services, user types, etc. helps you gain comprehensive understanding of your major cost levers and consistently implement new cost reduction strategies.

Note that all of the queries and analysis provided in this post were generated using the Looker data platform. If you’re already a Looker customer, you can get all of this analysis, additional pre-configured dashboards, and much more using Looker Blocks for AWS.


About the Author

Dillon Morrison leads the Platform Ecosystem at Looker. He enjoys exploring new technologies and architecting the most efficient data solutions for the business needs of his company and their customers. In his spare time, you’ll find Dillon rock climbing in the Bay Area or nose deep in the docs of the latest AWS product release at his favorite cafe (“Arlequin in SF is unbeatable!”).

 

 

 

AWS Online Tech Talks – August 2017

Post Syndicated from Sara Rodas original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-august-2017/

Welcome to mid-August, everyone–the season of beach days, family road trips, and an inbox full of “out of office” emails from your coworkers. Just in case spending time indoors has you feeling a bit blue, we’ve got a piping hot batch of AWS Online Tech Talks for you to check out. Kick up your feet, grab a glass of ice cold lemonade, and dive into our latest Tech Talks on Compute and DevOps.

August 2017 – Schedule

Noted below are the upcoming scheduled live, online technical sessions being held during the month of August. Make sure to register ahead of time so you won’t miss out on these free talks conducted by AWS subject matter experts.

Webinars featured this month are:

Thursday, August 17 – Compute

9:00 – 9:40 AM PDT: Deep Dive on [email protected].

Monday, August 28 – DevOps

10:30 – 11:10 AM PDT: Building a Python Serverless Applications with AWS Chalice.

12:00 – 12:40 PM PDT: How to Deploy .NET Code to AWS from Within Visual Studio.

The AWS Online Tech Talks series covers a broad range of topics at varying technical levels. These sessions feature live demonstrations & customer examples led by AWS engineers and Solution Architects. Check out the AWS YouTube channel for more on-demand webinars on AWS technologies.

– Sara (Hello everyone, I’m a co-op from Northeastern University joining the team until December.)

AWS Config Update – New Managed Rules to Secure S3 Buckets

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-config-update-new-managed-rules-to-secure-s3-buckets/

AWS Config captures the state of your AWS resources and the relationships between them. Among other features, it allows you to select a resource and then view a timeline of configuration changes that affect the resource (read Track AWS Resource Relationships With AWS Config to learn more).

AWS Config rules extends Config with a powerful rule system, with support for a “managed” collection of AWS rules as well as custom rules that you write yourself (my blog post, AWS Config Rules – Dynamic Compliance Checking for Cloud Resources, contains more info). The rules (AWS Lambda functions) represent the ideal (properly configured and compliant) state of your AWS resources. The appropriate functions are invoked when a configuration change is detected and check to ensure compliance.

You already have access to about three dozen managed rules. For example, here are some of the rules that check your EC2 instances and related resources:

Two New Rules
Today we are adding two new managed rules that will help you to secure your S3 buckets. You can enable these rules with a single click. The new rules are:

s3-bucket-public-write-prohibited – Automatically identifies buckets that allow global write access. There’s rarely a reason to create this configuration intentionally since it allows
unauthorized users to add malicious content to buckets and to delete (by overwriting) existing content. The rule checks all of the buckets in the account.

s3-bucket-public-read-prohibited – Automatically identifies buckets that allow global read access. This will flag content that is publicly available, including web sites and documentation. This rule also checks all buckets in the account.

Like the existing rules, the new rules can be run on a schedule or in response to changes detected by Config. You can see the compliance status of all of your rules at a glance:

Each evaluation runs in a matter of milliseconds; scanning an account with 100 buckets will take less than a minute. Behind the scenes, the rules are evaluated by a reasoning engine that uses some leading-edge constraint solving techniques that can, in many cases, address NP-complete problems in polynomial time (we did not resolve P versus NP; that would be far bigger news). This work is part of a larger effort within AWS, some of which is described in a AWS re:Invent presentation: Automated Formal Reasoning About AWS Systems:

Now Available
The new rules are available now and you can start using them today. Like the other rules, they are priced at $2 per rule per month.

Jeff;

New – AWS SAM Local (Beta) – Build and Test Serverless Applications Locally

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/new-aws-sam-local-beta-build-and-test-serverless-applications-locally/

Today we’re releasing a beta of a new tool, SAM Local, that makes it easy to build and test your serverless applications locally. In this post we’ll use SAM local to build, debug, and deploy a quick application that allows us to vote on tabs or spaces by curling an endpoint. AWS introduced Serverless Application Model (SAM) last year to make it easier for developers to deploy serverless applications. If you’re not already familiar with SAM my colleague Orr wrote a great post on how to use SAM that you can read in about 5 minutes. At it’s core, SAM is a powerful open source specification built on AWS CloudFormation that makes it easy to keep your serverless infrastructure as code – and they have the cutest mascot.

SAM Local takes all the good parts of SAM and brings them to your local machine.

There are a couple of ways to install SAM Local but the easiest is through NPM. A quick npm install -g aws-sam-local should get us going but if you want the latest version you can always install straight from the source: go get github.com/awslabs/aws-sam-local (this will create a binary named aws-sam-local, not sam).

I like to vote on things so let’s write a quick SAM application to vote on Spaces versus Tabs. We’ll use a very simple, but powerful, architecture of API Gateway fronting a Lambda function and we’ll store our results in DynamoDB. In the end a user should be able to curl our API curl https://SOMEURL/ -d '{"vote": "spaces"}' and get back the number of votes.

Let’s start by writing a simple SAM template.yaml:

AWSTemplateFormatVersion : '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Resources:
  VotesTable:
    Type: "AWS::Serverless::SimpleTable"
  VoteSpacesTabs:
    Type: "AWS::Serverless::Function"
    Properties:
      Runtime: python3.6
      Handler: lambda_function.lambda_handler
      Policies: AmazonDynamoDBFullAccess
      Environment:
        Variables:
          TABLE_NAME: !Ref VotesTable
      Events:
        Vote:
          Type: Api
          Properties:
            Path: /
            Method: post

So we create a [dynamo_i] table that we expose to our Lambda function through an environment variable called TABLE_NAME.

To test that this template is valid I’ll go ahead and call sam validate to make sure I haven’t fat-fingered anything. It returns Valid! so let’s go ahead and get to work on our Lambda function.

import os
import os
import json
import boto3
votes_table = boto3.resource('dynamodb').Table(os.getenv('TABLE_NAME'))

def lambda_handler(event, context):
    print(event)
    if event['httpMethod'] == 'GET':
        resp = votes_table.scan()
        return {'body': json.dumps({item['id']: int(item['votes']) for item in resp['Items']})}
    elif event['httpMethod'] == 'POST':
        try:
            body = json.loads(event['body'])
        except:
            return {'statusCode': 400, 'body': 'malformed json input'}
        if 'vote' not in body:
            return {'statusCode': 400, 'body': 'missing vote in request body'}
        if body['vote'] not in ['spaces', 'tabs']:
            return {'statusCode': 400, 'body': 'vote value must be "spaces" or "tabs"'}

        resp = votes_table.update_item(
            Key={'id': body['vote']},
            UpdateExpression='ADD votes :incr',
            ExpressionAttributeValues={':incr': 1},
            ReturnValues='ALL_NEW'
        )
        return {'body': "{} now has {} votes".format(body['vote'], resp['Attributes']['votes'])}

So let’s test this locally. I’ll need to create a real DynamoDB database to talk to and I’ll need to provide the name of that database through the enviornment variable TABLE_NAME. I could do that with an env.json file or I can just pass it on the command line. First, I can call:
$ echo '{"httpMethod": "POST", "body": "{\"vote\": \"spaces\"}"}' |\
TABLE_NAME="vote-spaces-tabs" sam local invoke "VoteSpacesTabs"

to test the Lambda – it returns the number of votes for spaces so theoritically everything is working. Typing all of that out is a pain so I could generate a sample event with sam local generate-event api and pass that in to the local invocation. Far easier than all of that is just running our API locally. Let’s do that: sam local start-api. Now I can curl my local endpoints to test everything out.
I’ll run the command: $ curl -d '{"vote": "tabs"}' http://127.0.0.1:3000/ and it returns: “tabs now has 12 votes”. Now, of course I did not write this function perfectly on my first try. I edited and saved several times. One of the benefits of hot-reloading is that as I change the function I don’t have to do any additional work to test the new function. This makes iterative development vastly easier.

Let’s say we don’t want to deal with accessing a real DynamoDB database over the network though. What are our options? Well we can download DynamoDB Local and launch it with java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar -sharedDb. Then we can have our Lambda function use the AWS_SAM_LOCAL environment variable to make some decisions about how to behave. Let’s modify our function a bit:

import os
import json
import boto3
if os.getenv("AWS_SAM_LOCAL"):
    votes_table = boto3.resource(
        'dynamodb',
        endpoint_url="http://docker.for.mac.localhost:8000/"
    ).Table("spaces-tabs-votes")
else:
    votes_table = boto3.resource('dynamodb').Table(os.getenv('TABLE_NAME'))

Now we’re using a local endpoint to connect to our local database which makes working without wifi a little easier.

SAM local even supports interactive debugging! In Java and Node.js I can just pass the -d flag and a port to immediately enable the debugger. For Python I could use a library like import epdb; epdb.serve() and connect that way. Then we can call sam local invoke -d 8080 "VoteSpacesTabs" and our function will pause execution waiting for you to step through with the debugger.

Alright, I think we’ve got everything working so let’s deploy this!

First I’ll call the sam package command which is just an alias for aws cloudformation package and then I’ll use the result of that command to sam deploy.

$ sam package --template-file template.yaml --s3-bucket MYAWESOMEBUCKET --output-template-file package.yaml
Uploading to 144e47a4a08f8338faae894afe7563c3  90570 / 90570.0  (100.00%)
Successfully packaged artifacts and wrote output template to file package.yaml.
Execute the following command to deploy the packaged template
aws cloudformation deploy --template-file package.yaml --stack-name 
$ sam deploy --template-file package.yaml --stack-name VoteForSpaces --capabilities CAPABILITY_IAM
Waiting for changeset to be created..
Waiting for stack create/update to complete
Successfully created/updated stack - VoteForSpaces

Which brings us to our API:
.

I’m going to hop over into the production stage and add some rate limiting in case you guys start voting a lot – but otherwise we’ve taken our local work and deployed it to the cloud without much effort at all. I always enjoy it when things work on the first deploy!

You can vote now and watch the results live! http://spaces-or-tabs.s3-website-us-east-1.amazonaws.com/

We hope that SAM Local makes it easier for you to test, debug, and deploy your serverless apps. We have a CONTRIBUTING.md guide and we welcome pull requests. Please tweet at us to let us know what cool things you build. You can see our What’s New post here and the documentation is live here.

Randall

Automating Blue/Green Deployments of Infrastructure and Application Code using AMIs, AWS Developer Tools, & Amazon EC2 Systems Manager

Post Syndicated from Ramesh Adabala original https://aws.amazon.com/blogs/devops/bluegreen-infrastructure-application-deployment-blog/

Previous DevOps blog posts have covered the following use cases for infrastructure and application deployment automation:

An AMI provides the information required to launch an instance, which is a virtual server in the cloud. You can use one AMI to launch as many instances as you need. It is security best practice to customize and harden your base AMI with required operating system updates and, if you are using AWS native services for continuous security monitoring and operations, you are strongly encouraged to bake into the base AMI agents such as those for Amazon EC2 Systems Manager (SSM), Amazon Inspector, CodeDeploy, and CloudWatch Logs. A customized and hardened AMI is often referred to as a “golden AMI.” The use of golden AMIs to create EC2 instances in your AWS environment allows for fast and stable application deployment and scaling, secure application stack upgrades, and versioning.

In this post, using the DevOps automation capabilities of Systems Manager, AWS developer tools (CodePipeLine, CodeDeploy, CodeCommit, CodeBuild), I will show you how to use AWS CodePipeline to orchestrate the end-to-end blue/green deployments of a golden AMI and application code. Systems Manager Automation is a powerful security feature for enterprises that want to mature their DevSecOps practices.

Here are the high-level phases and primary services covered in this use case:

 

You can access the source code for the sample used in this post here: https://github.com/awslabs/automating-governance-sample/tree/master/Bluegreen-AMI-Application-Deployment-blog.

This sample will create a pipeline in AWS CodePipeline with the building blocks to support the blue/green deployments of infrastructure and application. The sample includes a custom Lambda step in the pipeline to execute Systems Manager Automation to build a golden AMI and update the Auto Scaling group with the golden AMI ID for every rollout of new application code. This guarantees that every new application deployment is on a fully patched and customized AMI in a continuous integration and deployment model. This enables the automation of hardened AMI deployment with every new version of application deployment.

 

 

We will build and run this sample in three parts.

Part 1: Setting up the AWS developer tools and deploying a base web application

Part 1 of the AWS CloudFormation template creates the initial Java-based web application environment in a VPC. It also creates all the required components of Systems Manager Automation, CodeCommit, CodeBuild, and CodeDeploy to support the blue/green deployments of the infrastructure and application resulting from ongoing code releases.

Part 1 of the AWS CloudFormation stack creates these resources:

After Part 1 of the AWS CloudFormation stack creation is complete, go to the Outputs tab and click the Elastic Load Balancing link. You will see the following home page for the base web application:

Make sure you have all the outputs from the Part 1 stack handy. You need to supply them as parameters in Part 3 of the stack.

Part 2: Setting up your CodeCommit repository

In this part, you will commit and push your sample application code into the CodeCommit repository created in Part 1. To access the initial git commands to clone the empty repository to your local machine, click Connect to go to the AWS CodeCommit console. Make sure you have the IAM permissions required to access AWS CodeCommit from command line interface (CLI).

After you’ve cloned the repository locally, download the sample application files from the part2 folder of the Git repository and place the files directly into your local repository. Do not include the aws-codedeploy-sample-tomcat folder. Go to the local directory and type the following commands to commit and push the files to the CodeCommit repository:

git add .
git commit -a -m "add all files from the AWS Java Tomcat CodeDeploy application"
git push

After all the files are pushed successfully, the repository should look like this:

 

Part 3: Setting up CodePipeline to enable blue/green deployments     

Part 3 of the AWS CloudFormation template creates the pipeline in AWS CodePipeline and all the required components.

a) Source: The pipeline is triggered by any change to the CodeCommit repository.

b) BuildGoldenAMI: This Lambda step executes the Systems Manager Automation document to build the golden AMI. After the golden AMI is successfully created, a new launch configuration with the new AMI details will be updated into the Auto Scaling group of the application deployment group. You can watch the progress of the automation in the EC2 console from the Systems Manager –> Automations menu.

c) Build: This step uses the application build spec file to build the application build artifact. Here are the CodeBuild execution steps and their status:

d) Deploy: This step clones the Auto Scaling group, launches the new instances with the new AMI, deploys the application changes, reroutes the traffic from the elastic load balancer to the new instances and terminates the old Auto Scaling group. You can see the execution steps and their status in the CodeDeploy console.

After the CodePipeline execution is complete, you can access the application by clicking the Elastic Load Balancing link. You can find it in the output of Part 1 of the AWS CloudFormation template. Any consecutive commits to the application code in the CodeCommit repository trigger the pipelines and deploy the infrastructure and code with an updated AMI and code.

 

If you have feedback about this post, add it to the Comments section below. If you have questions about implementing the example used in this post, open a thread on the Developer Tools forum.


About the author

 

Ramesh Adabala is a Solutions Architect in Southeast Enterprise Solution Architecture team at Amazon Web Services.

New – Amazon Connect and Amazon Lex Integration

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/new-amazon-connect-and-amazon-lex-integration/

I’m really excited to share some recent enhancements to two of my favorite services: Amazon Connect and Amazon Lex. Amazon Connect is a self-service, cloud-based contact center service that makes it easy for any business to deliver better customer service at lower cost. Amazon Lex is a service for building conversational interfaces using voice and text. By integrating these two services you can take advantage of Lex‘s automatic speech recognition (ASR) and natural language processing/understading (NLU) capabilities to create great self-service experiences for your customers. To enable this integration the Lex team added support for 8kHz speech input – more on that later. Why should you care about this? Well, if the a bot can solve the majority of your customer’s requests your customers spend less time waiting on hold and more time using your products.

If you need some more background on Amazon Connect or Lex I strongly recommend Jeff’s previous posts[1][2] on these services – especially if you like LEGOs.


Let’s dive in and learn to use this new integration. We’ll take an application that we built on our Twitch channel and modify it for this blog. At the application’s core a user calls an Amazon Connect number which connects them to an Lex bot which invokes an AWS Lambda function based on an intent from Lex. So what does our little application do?

I want to finally settle the question of what the best code editor is: I like vim, it’s a spectacular editor that does one job exceptionally well – editing code (it’s the best). My colleague Jeff likes emacs, a great operating system editor… if you were born with extra joints in your fingers. My colleague Tara loves Visual Studio and sublime. Rather than fighting over what the best editor is I thought we might let you, dear reader, vote. Don’t worry you can even vote for butterflies.

Interested in voting? Call +1 614-569-4019 and tell us which editor you’re voting for! We don’t store your number or record your voice so feel free to vote more than once for vim. Want to see the votes live? http://best-editor-ever.s3-website-us-east-1.amazonaws.com/.

Now, how do we build this little contraption? We’ll cover each component but since we’ve talked about Lex and Lambda before we’ll focus mostly on the Amazon Connect component. I’m going to assume you already have a connect instance running.

Amazon Lex

Let’s start with the Lex side of things. We’ll create a bot named VoteEditor with two intents: VoteEditor with a single slot called editor and ConnectToAgent with no slots. We’ll populate our editor slot full of different code editor names (maybe we’ll leave out emacs).

AWS Lambda

Our Lambda function will also be fairly simple. First we’ll create a Amazon DynamoDB table to store our votes. Then we’ll make a helper method to respond to Lex (build_response) – it will just wrap our message in a Lex friendly response format. Now we just have to figure out our flow logic.


def lambda_handler(event, context):
    if 'ConnectToAgent' == event['currentIntent']['name']:
        return build_response("Ok, connecting you to an agent.")
    elif 'VoteEditor' == event['currentIntent']['name']:
        editor = event['currentIntent']['slots']['editor']
        resp = ddb.update_item(
            Key={"name": editor.lower()},
            UpdateExpression="SET votes = :incr + if_not_exists(votes, :default)",
            ExpressionAttributeValues={":incr": 1, ":default": 0},
            ReturnValues="ALL_NEW"
        )
        msg = "Awesome, now {} has {} votes!".format(
            resp['Attributes']['name'],
            resp['Attributes']['votes'])
        return build_response(msg)

Let’s make sure we understand the code. So, if we got a vote for an editor and it doesn’t exist yet then we add that editor with 1 vote. Otherwise we increase the number of votes on that editor by 1. If we get a request for an agent, we terminate the flow with a nice message. Easy. Now we just tell our Lex bot to use our Lambda function to fulfill our intents. We can test that everything is working over text in the Lex console before moving on.

Amazon Connect

Before we can use our Lex bot in a Contact Flow we have to make sure our Amazon Connect instance has access to it. We can do this by hopping over to the Amazon Connect service console, selecting our instance, and navigating to “Contact Flows”. There should be a section called Lex where you can add your bots!

Now that our Amazon Connect instance can invoke our Lex bot we can create a new Contact Flow that contains our Lex bot. We add the bot to our flow through the “Get customer input” widget from the “Interact” category.

Once we’re on the widget we have a “DTMF” tab for taking input from number keys on a phone or the “Amazon Lex” tab for taking voiceinput and passing it to the Lex service. We’ll use the Lex tab and put in some configuration.

Lots of options, but in short we add the bot we want to use (including the version of the bot), the intents we want to use from our bot, and a short prompt to introduce the bot (and mayb prompt the customer for input).

Our final contact flow looks like this:

A real world example might allow a customer to perform many transactions through a Lex bot. Then on an error or ConnectToAgent intent put the customer into a queue where they could talk to a real person. It could collect and store information about users and populate a rich interface for an agent to use so they could jump right into the conversation with all the context they need.

I want to especially highlight the advantage of 8kHz audio support in Lex. Lex originally only supported speech input that was sampled at a higher rate than the 8 kHz input from the phone. Modern digital communication appliations typically use audio signals sampled at a minimum of 16 kHz. This higher fidelity recroding makes it easier differentiate between sounds like “ess” (/s/) and “eff” (/f/) – or so the audio experts tell me. Phones, however, use a much lower quality recording. Humans, and their ears, are pretty good at using surrounding words to figure out what a voice is saying from a lower quality recording (just check the NASA apollo recordings for proof of this). Most digital phone systems are setup to use 8 kHz sampling by default – it’s a nice tradeoff in bandwidth and fidelity. That’s why your voice sometimes sounds different on the phone. On top of this fundmental sampling rate issue you also have to deal with the fact that a lot of phone call data is already lossy (can you hear me now?). There are thousands of different devices from hundreds of different manufacturers, and tons of different software implentations. So… how do you solve this recognition issue?

The Lex team decided that the best way to address this was to expand the set of models they were using for speech recognition to include an 8kHz model. Support for an 8 kHz telephony audio sampling rate provides increased speech recognition accuracy and fidelity for your contact center interactions. This was a great effort by the team that enables a lot of customers to do more with Amazon Connect.

One final note is that Amazon Connect uses the exact same PostContent endpoint that you can use as an external developer so you don’t have to be a Amazon Connect user to take advantage of this 8kHz feature in Lex.

I hope you guys enjoyed this post and as always the real details are in the docs and API Reference.

Randall

AWS CloudFormation Supports Amazon Kinesis Analytics Applications

Post Syndicated from Ryan Nienhuis original https://aws.amazon.com/blogs/big-data/aws-cloudformation-supports-amazon-kinesis-analytics-applications/

You can now provision and manage resources for Amazon Kinesis Analytics applications using AWS CloudFormation.  Kinesis Analytics is the easiest way to process streaming data in real time with standard SQL, without having to learn new programming languages or processing frameworks. Kinesis Analytics enables you to query streaming data or build entire streaming applications using SQL. Using the service, you gain actionable insights and can respond to your business and customer needs promptly.

Customers can create CloudFormation templates that easily create or update Kinesis Analytics applications. Typically, a template is used as a way to manage code across different environments, or to prototype a new streaming data solution quickly.

We have created two sample templates using past AWS Big Data Blog posts that referenced Kinesis Analytics.

For more information about the new feature, see the AWS Cloudformation User Guide.

 

AWS Hot Startups – July 2017

Post Syndicated from Tina Barr original https://aws.amazon.com/blogs/aws/aws-hot-startups-july-2017/

Welcome back to another month of Hot Startups! Every day, startups are creating innovative and exciting businesses, applications, and products around the world. Each month we feature a handful of startups doing cool things using AWS.

July is all about learning! These companies are focused on providing access to tools and resources to expand knowledge and skills in different ways.

This month’s startups:

  • CodeHS – provides fun and accessible computer science curriculum for middle and high schools.
  • Insight – offers intensive fellowships to grow technical talent in Data Science.
  • iTranslate – enables people to read, write, and speak in over 90 languages, anywhere in the world.

CodeHS (San Francisco, CA)

In 2012, Stanford students Zach Galant and Jeremy Keeshin were computer science majors and TAs for introductory classes when they noticed a trend among their peers. Many wished that they had been exposed to computer science earlier in life. In their senior year, Zach and Jeremy launched CodeHS to give middle and high schools the opportunity to provide a fun, accessible computer science education to students everywhere. CodeHS is a web-based curriculum pathway complete with teacher resources, lesson plans, and professional development opportunities. The curriculum is supplemented with time-saving teacher tools to help with lesson planning, grading and reviewing student code, and managing their classroom.

CodeHS aspires to empower all students to meaningfully impact the future, and believe that coding is becoming a new foundational skill, along with reading and writing, that allows students to further explore any interest or area of study. At the time CodeHS was founded in 2012, only 10% of high schools in America offered a computer science course. Zach and Jeremy set out to change that by providing a solution that made it easy for schools and districts to get started. With CodeHS, thousands of teachers have been trained and are teaching hundreds of thousands of students all over the world. To use CodeHS, all that’s needed is the internet and a web browser. Students can write and run their code online, and teachers can immediately see what the students are working on and how they are doing.

Amazon EC2, Amazon RDS, Amazon ElastiCache, Amazon CloudFront, and Amazon S3 make it possible for CodeHS to scale their site to meet the needs of schools all over the world. CodeHS also relies on AWS to compile and run student code in the browser, which is extremely important when teaching server-side languages like Java that powers the AP course. Since usage rises and falls based on school schedules, Amazon CloudWatch and ELBs are used to easily scale up when students are running code so they have a seamless experience.

Be sure to visit the CodeHS website, and to learn more about bringing computer science to your school, click here!

Insight (Palo Alto, CA)

Insight was founded in 2012 to create a new educational model, optimize hiring for data teams, and facilitate successful career transitions among data professionals. Over the last 5 years, Insight has kept ahead of market trends and launched a series of professional training fellowships including Data Science, Health Data Science, Data Engineering, and Artificial Intelligence. Finding individuals with the right skill set, background, and culture fit is a challenge for big companies and startups alike, and Insight is focused on developing top talent through intensive 7-week fellowships. To date, Insight has over 1,000 alumni at over 350 companies including Amazon, Google, Netflix, Twitter, and The New York Times.

The Data Engineering team at Insight is well-versed in the current ecosystem of open source tools and technologies and provides mentorship on the best practices in this space. The technical teams are continually working with external groups in a variety of data advisory and mentorship capacities, but the majority of Insight partners participate in professional sessions. Companies visit the Insight office to speak with fellows in an informal setting and provide details on the type of work they are doing and how their teams are growing. These sessions have proved invaluable as fellows experience a significantly better interview process and companies yield engaged and enthusiastic new team members.

An important aspect of Insight’s fellowships is the opportunity for hands-on work, focusing on everything from building big-data pipelines to contributing novel features to industry-standard open source efforts. Insight provides free AWS resources for all fellows to use, in addition to mentorships from the Data Engineering team. Fellows regularly utilize Amazon S3, Amazon EC2, Amazon Kinesis, Amazon EMR, AWS Lambda, Amazon Redshift, Amazon RDS, among other services. The experience with AWS gives fellows a solid skill set as they transition into the industry. Fellowships are currently being offered in Boston, New York, Seattle, and the Bay Area.

Check out the Insight blog for more information on trends in data infrastructure, artificial intelligence, and cutting-edge data products.

 

iTranslate (Austria)

When the App Store was introduced in 2008, the founders of iTranslate saw an opportunity to be part of something big. The group of four fully believed that the iPhone and apps were going to change the world, and together they brainstormed ideas for their own app. The combination of translation and mobile devices seemed a natural fit, and by 2009 iTranslate was born. iTranslate’s mission is to enable travelers, students, business professionals, employers, and medical staff to read, write, and speak in all languages, anywhere in the world. The app allows users to translate text, voice, websites and more into nearly 100 languages on various platforms. Today, iTranslate is the leading player for conversational translation and dictionary apps, with more than 60 million downloads and 6 million monthly active users.

iTranslate is breaking language barriers through disruptive technology and innovation, enabling people to translate in real time. The app has a variety of features designed to optimize productivity including offline translation, website and voice translation, and language auto detection. iTranslate also recently launched the world’s first ear translation device in collaboration with Bragi, a company focused on smart earphones. The Dash Pro allows people to communicate freely, while having a personal translator right in their ear.

iTranslate started using Amazon Polly soon after it was announced. CEO Alexander Marktl said, “As the leading translation and dictionary app, it is our mission at iTranslate to provide our users with the best possible tools to read, write, and speak in all languages across the globe. Amazon Polly provides us with the ability to efficiently produce and use high quality, natural sounding synthesized speech.” The stable and simple-to-use API, low latency, and free caching allow iTranslate to scale as they continue adding features to their app. Customers also enjoy the option to change speech rate and change between male and female voices. To assure quality, speed, and reliability of their products, iTranslate also uses Amazon EC2, Amazon S3, and Amazon Route 53.

To get started with iTranslate, visit their website here.

—–

Thanks for reading!

-Tina

Run Common Data Science Packages on Anaconda and Oozie with Amazon EMR

Post Syndicated from John Ohle original https://aws.amazon.com/blogs/big-data/run-common-data-science-packages-on-anaconda-and-oozie-with-amazon-emr/

In the world of data science, users must often sacrifice cluster set-up time to allow for complex usability scenarios. Amazon EMR allows data scientists to spin up complex cluster configurations easily, and to be up and running with complex queries in a matter of minutes.

Data scientists often use scheduling applications such as Oozie to run jobs overnight. However, Oozie can be difficult to configure when you are trying to use popular Python packages (such as “pandas,” “numpy,” and “statsmodels”), which are not included by default.

One such popular platform that contains these types of packages (and more) is Anaconda. This post focuses on setting up an Anaconda platform on EMR, with an intent to use its packages with Oozie. I describe how to run jobs using a popular open source scheduler like Oozie.

Walkthrough

For this post, you walk through the following tasks:

  • Create an EMR cluster.
  • Download Anaconda on your master node.
  • Configure Oozie.
  • Test the steps.

Create an EMR cluster

Spin up an Amazon EMR cluster using the console or the AWS CLI. Use the latest release, and include Apache Hadoop, Apache Spark, Apache Hive, and Oozie.

To create a three-node cluster in the us-east-1 region, issue an AWS CLI command such as the following. This command must be typed as one line, as shown below. It is shown here separated for readability purposes only.

aws emr create-cluster \ 
--release-label emr-5.7.0 \ 
 --name '<YOUR-CLUSTER-NAME>' \
 --applications Name=Hadoop Name=Oozie Name=Spark Name=Hive \ 
 --ec2-attributes '{"KeyName":"<YOUR-KEY-PAIR>","SubnetId":"<YOUR-SUBNET-ID>","EmrManagedSlaveSecurityGroup":"<YOUR-EMR-SLAVE-SECURITY-GROUP>","EmrManagedMasterSecurityGroup":"<YOUR-EMR-MASTER-SECURITY-GROUP>"}' \ 
 --use-default-roles \ 
 --instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"<YOUR-INSTANCE-TYPE>","Name":"Master - 1"},{"InstanceCount":<YOUR-CORE-INSTANCE-COUNT>,"InstanceGroupType":"CORE","InstanceType":"<YOUR-INSTANCE-TYPE>","Name":"Core - 2"}]'

One-line version for reference:

aws emr create-cluster --release-label emr-5.7.0 --name '<YOUR-CLUSTER-NAME>' --applications Name=Hadoop Name=Oozie Name=Spark Name=Hive --ec2-attributes '{"KeyName":"<YOUR-KEY-PAIR>","SubnetId":"<YOUR-SUBNET-ID>","EmrManagedSlaveSecurityGroup":"<YOUR-EMR-SLAVE-SECURITY-GROUP>","EmrManagedMasterSecurityGroup":"<YOUR-EMR-MASTER-SECURITY-GROUP>"}' --use-default-roles --instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"<YOUR-INSTANCE-TYPE>","Name":"Master - 1"},{"InstanceCount":<YOUR-CORE-INSTANCE-COUNT>,"InstanceGroupType":"CORE","InstanceType":"<YOUR-INSTANCE-TYPE>","Name":"Core - 2"}]'

Download Anaconda

SSH into your EMR master node instance and download the official Anaconda installer:

wget https://repo.continuum.io/archive/Anaconda2-4.4.0-Linux-x86_64.sh

At the time of publication, Anaconda 4.4 is the most current version available. For the download link location for the latest Python 2.7 version (Python 3.6 may encounter issues), see https://www.continuum.io/downloads.  Open the context (right-click) menu for the Python 2.7 download link, choose Copy Link Location, and use this value in the previous wget command.

This post used the Anaconda 4.4 installation. If you have a later version, it is shown in the downloaded filename:  “anaconda2-<version number>-Linux-x86_64.sh”.

Run this downloaded script and follow the on-screen installer prompts.

chmod u+x Anaconda2-4.4.0-Linux-x86_64.sh
./Anaconda2-4.4.0-Linux-x86_64.sh

For an installation directory, select somewhere with enough space on your cluster, such as “/mnt/anaconda/”.

The process should take approximately 1–2 minutes to install. When prompted if you “wish the installer to prepend the Anaconda2 install location”, select the default option of [no].

After you are done, export the PATH to include this new Anaconda installation:

export PATH=/mnt/anaconda/bin:$PATH

Zip up the Anaconda installation:

cd /mnt/anaconda/
zip -r anaconda.zip .

The zip process may take 4–5 minutes to complete.

(Optional) Upload this anaconda.zip file to your S3 bucket for easier inclusion into future EMR clusters. This removes the need to repeat the previous steps for future EMR clusters.

Configure Oozie

Next, you configure Oozie to use Pyspark and the Anaconda platform.

Get the location of your Oozie sharelibupdate folder. Issue the following command and take note of the “sharelibDirNew” value:

oozie admin -sharelibupdate

For this post, this value is “hdfs://ip-192-168-4-200.us-east-1.compute.internal:8020/user/oozie/share/lib/lib_20170616133136”.

Pass in the required Pyspark files into Oozies sharelibupdate location. The following files are required for Oozie to be able to run Pyspark commands:

  • pyspark.zip
  • py4j-0.10.4-src.zip

These are located on the EMR master instance in the location “/usr/lib/spark/python/lib/”, and must be put into the Oozie sharelib spark directory. This location is the value of the sharelibDirNew parameter value (shown above) with “/spark/” appended, that is, “hdfs://ip-192-168-4-200.us-east-1.compute.internal:8020/user/oozie/share/lib/lib_20170616133136/spark/”.

To do this, issue the following commands:

hdfs dfs -put /usr/lib/spark/python/lib/py4j-0.10.4-src.zip hdfs://ip-192-168-4-200.us-east-1.compute.internal:8020/user/oozie/share/lib/lib_20170616133136/spark/
hdfs dfs -put /usr/lib/spark/python/lib/pyspark.zip hdfs://ip-192-168-4-200.us-east-1.compute.internal:8020/user/oozie/share/lib/lib_20170616133136/spark/

After you’re done, Oozie can use Pyspark in its processes.

Pass the anaconda.zip file into HDFS as follows:

hdfs dfs -put /mnt/anaconda/anaconda.zip /tmp/myLocation/anaconda.zip

(Optional) Verify that it was transferred successfully with the following command:

hdfs dfs -ls /tmp/myLocation/

On your master node, execute the following command:

export PYSPARK_PYTHON=/mnt/anaconda/bin/python

Set the PYSPARK_PYTHON environment variable on the executor nodes. Put the following configurations in your “spark-opts” values in your Oozie workflow.xml file:

–conf spark.executorEnv.PYSPARK_PYTHON=./anaconda_remote/bin/python
–conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./anaconda_remote/bin/python

This is referenced from the Oozie job in the following line in your workflow.xml file, also included as part of your “spark-opts”:

--archives hdfs:///tmp/myLocation/anaconda.zip#anaconda_remote

Your Oozie workflow.xml file should now look something like the following:

<workflow-app name="spark-wf" xmlns="uri:oozie:workflow:0.5">
<start to="start_spark" />
<action name="start_spark">
    <spark xmlns="uri:oozie:spark-action:0.1">
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <prepare>
            <delete path="/tmp/test/spark_oozie_test_out3"/>
        </prepare>
        <master>yarn-cluster</master>
        <mode>cluster</mode>
        <name>SparkJob</name>
        <class>clear</class>
        <jar>hdfs:///user/oozie/apps/myPysparkProgram.py</jar>
        <spark-opts>--queue default
            --conf spark.ui.view.acls=*
            --executor-memory 2G --num-executors 2 --executor-cores 2 --driver-memory 3g
            --conf spark.executorEnv.PYSPARK_PYTHON=./anaconda_remote/bin/python
            --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./anaconda_remote/bin/python
            --archives hdfs:///tmp/myLocation/anaconda.zip#anaconda_remote
        </spark-opts>
    </spark>
    <ok to="end"/>
    <error to="kill"/>
</action>
        <kill name="kill">
                <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
        </kill>
        <end name="end"/>
</workflow-app>

Test steps

To test this out, you can use the following job.properties and myPysparkProgram.py file, along with the following steps:

job.properties

masterNode ip-xxx-xxx-xxx-xxx.us-east-1.compute.internal
nameNode hdfs://${masterNode}:8020
jobTracker ${masterNode}:8032
master yarn
mode cluster
queueName default
oozie.libpath ${nameNode}/user/oozie/share/lib
oozie.use.system.libpath true
oozie.wf.application.path ${nameNode}/user/oozie/apps/

Note: You can get your master node IP address (denoted as “ip-xxx-xxx-xxx-xxx” here) from the value for the sharelibDirNew parameter noted earlier.

myPysparkProgram.py

from pyspark import SparkContext, SparkConf
import numpy
import sys

conf = SparkConf().setAppName('myPysparkProgram')
sc = SparkContext(conf=conf)

rdd = sc.textFile("/user/hadoop/input.txt")

x = numpy.sum([3,4,5]) #total = 12

rdd = rdd.map(lambda line: line + str(x))
rdd.saveAsTextFile("/user/hadoop/output")

Put the “myPysparkProgram.py” into the location mentioned between the “<jar>xxxxx</jar>” tags in your workflow.xml. In this example, the location is “hdfs:///user/oozie/apps/”. Use the following command to move the “myPysparkProgram.py” file to the correct location:

hdfs dfs -put myPysparkProgram.py /user/oozie/apps/

Put the above workflow.xml file into the “/user/oozie/apps/” location in hdfs:

hdfs dfs –put workflow.xml /user/oozie/apps/

Note: The job.properties file is run locally from the EMR master node.

Create a sample input.txt file with some data in it. For example:

input.txt

This is a sentence.
So is this. 
This is also a sentence.

Put this file into hdfs:

hdfs dfs -put input.txt /user/hadoop/

Execute the job in Oozie with the following command. This creates an Oozie job ID.

oozie job -config job.properties -run

You can check the Oozie job state with the command:

oozie job -info <Oozie job ID>

  1. When the job is successfully finished, the results are located at:
/user/hadoop/output/part-00000
/user/hadoop/output/part-00001

  1. Run the following commands to view the output:
hdfs dfs -cat /user/hadoop/output/part-00000
hdfs dfs -cat /user/hadoop/output/part-00001

The output will be:

This is a sentence. 12
So is this 12
This is also a sentence 12

Summary

The myPysparkProgram.py has successfully imported the numpy library from the Anaconda platform and has produced some output with it. If you tried to run this using standard Python, you’d encounter the following error:

Now when your Python job runs in Oozie, any imported packages that are implicitly imported by your Pyspark script are imported into your job within Oozie directly from the Anaconda platform. Simple!

If you have questions or suggestions, please leave a comment below.


Additional Reading

Learn how to use Apache Oozie workflows to automate Apache Spark jobs on Amazon EMR.

 


About the Author

John Ohle is an AWS BigData Cloud Support Engineer II for the BigData team in Dublin. He works to provide advice and solutions to our customers on their Big Data projects and workflows on AWS. In his spare time, he likes to play music, learn, develop tools and write documentation to further help others – both colleagues and customers alike.

 

 

 

New – GPU-Powered Streaming Instances for Amazon AppStream 2.0

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-gpu-powered-streaming-instances-for-amazon-appstream-2-0/

We launched Amazon AppStream 2.0 at re:Invent 2016. This application streaming service allows you to deliver Windows applications to a desktop browser.

AppStream 2.0 is fully managed and provides consistent, scalable performance by running applications on general purpose, compute optimized, and memory optimized streaming instances, with delivery via NICE DCV – a secure, high-fidelity streaming protocol. Our enterprise and public sector customers have started using AppStream 2.0 in place of legacy application streaming environments that are installed on-premises. They use AppStream 2.0 to deliver both commercial and line of business applications to a desktop browser. Our ISV customers are using AppStream 2.0 to move their applications to the cloud as-is, with no changes to their code. These customers focus on demos, workshops, and commercial SaaS subscriptions.

We are getting great feedback on AppStream 2.0 and have been adding new features very quickly (even by AWS standards). So far this year we have added an image builder, federated access via SAML 2.0, CloudWatch monitoring, Fleet Auto Scaling, Simple Network Setup, persistent storage for user files (backed by Amazon S3), support for VPC security groups, and built-in user management including web portals for users.

New GPU-Powered Streaming Instances
Many of our customers have told us that they want to use AppStream 2.0 to deliver specialized design, engineering, HPC, and media applications to their users. These applications are generally graphically intensive and are designed to run on expensive, high-end PCs in conjunction with a GPU (Graphics Processing Unit). Due to the hardware requirements of these applications, cost considerations have traditionally kept them out of situations where part-time or occasional access would otherwise make sense. Recently, another requirement has come to the forefront. These applications almost always need shared, read-write access to large amounts of sensitive data that is best stored, processed, and secured in the cloud. In order to meet the needs of these users and applications, we are launching two new types of streaming instances today:

Graphics Desktop – Based on the G2 instance type, Graphics Desktop instances are designed for desktop applications that use the CUDA, DirectX, or OpenGL for rendering. These instances are equipped with 15 GiB of memory and 8 vCPUs. You can select this instance family when you build an AppStream image or configure an AppStream fleet:

Graphics Pro – Based on the brand-new G3 instance type, Graphics Pro instances are designed for high-end, high-performance applications that can use the NVIDIA APIs and/or need access to large amounts of memory. These instances are available in three sizes, with 122 to 488 GiB of memory and 16 to 64 vCPUs. Again, you can select this instance family when you configure an AppStream fleet:

To learn more about how to launch, run, and scale a streaming application environment, read Scaling Your Desktop Application Streams with Amazon AppStream 2.0.

As I noted earlier, you can use either of these two instance types to build an AppStream image. This will allow you to test and fine tune your applications and to see the instances in action.

Streaming Instances in Action
We’ve been working with several customers during a private beta program for the new instance types. Here are a few stories (and some cool screen shots) to show you some of the applications that they are streaming via AppStream 2.0:

AVEVA is a world leading provider of engineering design and information management software solutions for the marine, power, plant, offshore and oil & gas industries. As part of their work on massive capital projects, their customers need to bring many groups of specialist engineers together to collaborate on the creation of digital assets. In order to support this requirement, AVEVA is building SaaS solutions that combine the streamed delivery of engineering applications with access to a scalable project data environment that is shared between engineers across the globe. The new instances will allow AVEVA to deliver their engineering design software in SaaS form while maximizing quality and performance. Here’s a screen shot of their Everything 3D app being streamed from AppStream:

Nissan, a Japanese multinational automobile manufacturer, trains its automotive specialists using 3D simulation software running on expensive graphics workstations. The training software, developed by The DiSti Corporation, allows its specialists to simulate maintenance processes by interacting with realistic 3D models of the vehicles they work on. AppStream 2.0’s new graphics capability now allows Nissan to deliver these training tools in real time, with up to date content, to a desktop browser running on low-cost commodity PCs. Their specialists can now interact with highly realistic renderings of a vehicle that allows them to train for and plan maintenance operations with higher efficiency.

Cornell University is an American private Ivy League and land-grant doctoral university located in Ithaca, New York. They deliver advanced 3D tools such as AutoDesk AutoCAD and Inventor to students and faculty to support their course work, teaching, and research. Until now, these tools could only be used on GPU-powered workstations in a lab or classroom. AppStream 2.0 allows them to deliver the applications to a web browser running on any desktop, where they run as if they were on a local workstation. Their users are no longer limited by available workstations in labs and classrooms, and can bring their own devices and have access to their course software. This increased flexibility also means that faculty members no longer need to take lab availability into account when they build course schedules. Here’s a copy of Autodesk Inventor Professional running on AppStream at Cornell:

Now Available
Both of the graphics streaming instance families are available in the US East (Northern Virginia), US West (Oregon), EU (Ireland), and Asia Pacific (Tokyo) Regions and you can start streaming from them today. Your applications must run in a Windows 2012 R2 environment, and can make use of DirectX, OpenGL, CUDA, OpenCL, and Vulkan.

With prices in the US East (Northern Virginia) Region starting at $0.50 per hour for Graphics Desktop instances and $2.05 per hour for Graphics Pro instances, you can now run your simulation, visualization, and HPC workloads in the AWS Cloud on an economical, pay-by-the-hour basis. You can also take advantage of fast, low-latency access to Amazon Elastic Compute Cloud (EC2), Amazon Simple Storage Service (S3), AWS Lambda, Amazon Redshift, and other AWS services to build processing workflows that handle pre- and post-processing of your data.

Jeff;