Tag Archives: filtering

New – Pay-per-Session Pricing for Amazon QuickSight, Another Region, and Lots More

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-pay-per-session-pricing-for-amazon-quicksight-another-region-and-lots-more/

Amazon QuickSight is a fully managed cloud business intelligence system that gives you Fast & Easy to Use Business Analytics for Big Data. QuickSight makes business analytics available to organizations of all shapes and sizes, with the ability to access data that is stored in your Amazon Redshift data warehouse, your Amazon Relational Database Service (RDS) relational databases, flat files in S3, and (via connectors) data stored in on-premises MySQL, PostgreSQL, and SQL Server databases. QuickSight scales to accommodate tens, hundreds, or thousands of users per organization.

Today we are launching a new, session-based pricing option for QuickSight, along with additional region support and other important new features. Let’s take a look at each one:

Pay-per-Session Pricing
Our customers are making great use of QuickSight and take full advantage of the power it gives them to connect to data sources, create reports, and and explore visualizations.

However, not everyone in an organization needs or wants such powerful authoring capabilities. Having access to curated data in dashboards and being able to interact with the data by drilling down, filtering, or slicing-and-dicing is more than adequate for their needs. Subscribing them to a monthly or annual plan can be seen as an unwarranted expense, so a lot of such casual users end up not having access to interactive data or BI.

In order to allow customers to provide all of their users with interactive dashboards and reports, the Enterprise Edition of Amazon QuickSight now allows Reader access to dashboards on a Pay-per-Session basis. QuickSight users are now classified as Admins, Authors, or Readers, with distinct capabilities and prices:

Authors have access to the full power of QuickSight; they can establish database connections, upload new data, create ad hoc visualizations, and publish dashboards, all for $9 per month (Standard Edition) or $18 per month (Enterprise Edition).

Readers can view dashboards, slice and dice data using drill downs, filters and on-screen controls, and download data in CSV format, all within the secure QuickSight environment. Readers pay $0.30 for 30 minutes of access, with a monthly maximum of $5 per reader.

Admins have all authoring capabilities, and can manage users and purchase SPICE capacity in the account. The QuickSight admin now has the ability to set the desired option (Author or Reader) when they invite members of their organization to use QuickSight. They can extend Reader invites to their entire user base without incurring any up-front or monthly costs, paying only for the actual usage.

To learn more, visit the QuickSight Pricing page.

A New Region
QuickSight is now available in the Asia Pacific (Tokyo) Region:

The UI is in English, with a localized version in the works.

Hourly Data Refresh
Enterprise Edition SPICE data sets can now be set to refresh as frequently as every hour. In the past, each data set could be refreshed up to 5 times a day. To learn more, read Refreshing Imported Data.

Access to Data in Private VPCs
This feature was launched in preview form late last year, and is now available in production form to users of the Enterprise Edition. As I noted at the time, you can use it to implement secure, private communication with data sources that do not have public connectivity, including on-premises data in Teradata or SQL Server, accessed over an AWS Direct Connect link. To learn more, read Working with AWS VPC.

Parameters with On-Screen Controls
QuickSight dashboards can now include parameters that are set using on-screen dropdown, text box, numeric slider or date picker controls. The default value for each parameter can be set based on the user name (QuickSight calls this a dynamic default). You could, for example, set an appropriate default based on each user’s office location, department, or sales territory. Here’s an example:

To learn more, read about Parameters in QuickSight.

URL Actions for Linked Dashboards
You can now connect your QuickSight dashboards to external applications by defining URL actions on visuals. The actions can include parameters, and become available in the Details menu for the visual. URL actions are defined like this:

You can use this feature to link QuickSight dashboards to third party applications (e.g. Salesforce) or to your own internal applications. Read Custom URL Actions to learn how to use this feature.

Dashboard Sharing
You can now share QuickSight dashboards across every user in an account.

Larger SPICE Tables
The per-data set limit for SPICE tables has been raised from 10 GB to 25 GB.

Upgrade to Enterprise Edition
The QuickSight administrator can now upgrade an account from Standard Edition to Enterprise Edition with a click. This enables provisioning of Readers with pay-per-session pricing, private VPC access, row-level security for dashboards and data sets, and hourly refresh of data sets. Enterprise Edition pricing applies after the upgrade.

Available Now
Everything I listed above is available now and you can start using it today!

You can try QuickSight for 60 days at no charge, and you can also attend our June 20th Webinar.

Jeff;

 

Amazon Neptune Generally Available

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/amazon-neptune-generally-available/

Amazon Neptune is now Generally Available in US East (N. Virginia), US East (Ohio), US West (Oregon), and EU (Ireland). Amazon Neptune is a fast, reliable, fully-managed graph database service that makes it easy to build and run applications that work with highly connected datasets. At the core of Neptune is a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with millisecond latencies. Neptune supports two popular graph models, Property Graph and RDF, through Apache TinkerPop Gremlin and SPARQL, allowing you to easily build queries that efficiently navigate highly connected datasets. Neptune can be used to power everything from recommendation engines and knowledge graphs to drug discovery and network security. Neptune is fully-managed with automatic minor version upgrades, backups, encryption, and fail-over. I wrote about Neptune in detail for AWS re:Invent last year and customers have been using the preview and providing great feedback that the team has used to prepare the service for GA.

Now that Amazon Neptune is generally available there are a few changes from the preview:

Launching an Amazon Neptune Cluster

Launching a Neptune cluster is as easy as navigating to the AWS Management Console and clicking create cluster. Of course you can also launch with CloudFormation, the CLI, or the SDKs.

You can monitor your cluster health and the health of individual instances through Amazon CloudWatch and the console.

Additional Resources

We’ve created two repos with some additional tools and examples here. You can expect continuous development on these repos as we add additional tools and examples.

  • Amazon Neptune Tools Repo
    This repo has a useful tool for converting GraphML files into Neptune compatible CSVs for bulk loading from S3.
  • Amazon Neptune Samples Repo
    This repo has a really cool example of building a collaborative filtering recommendation engine for video game preferences.

Purpose Built Databases

There’s an industry trend where we’re moving more and more onto purpose-built databases. Developers and businesses want to access their data in the format that makes the most sense for their applications. As cloud resources make transforming large datasets easier with tools like AWS Glue, we have a lot more options than we used to for accessing our data. With tools like Amazon Redshift, Amazon Athena, Amazon Aurora, Amazon DynamoDB, and more we get to choose the best database for the job or even enable entirely new use-cases. Amazon Neptune is perfect for workloads where the data is highly connected across data rich edges.

I’m really excited about graph databases and I see a huge number of applications. Looking for ideas of cool things to build? I’d love to build a web crawler in AWS Lambda that uses Neptune as the backing store. You could further enrich it by running Amazon Comprehend or Amazon Rekognition on the text and images found and creating a search engine on top of Neptune.

As always, feel free to reach out in the comments or on twitter to provide any feedback!

Randall

Monitoring your Amazon SNS message filtering activity with Amazon CloudWatch

Post Syndicated from Rachel Richardson original https://aws.amazon.com/blogs/compute/monitoring-your-amazon-sns-message-filtering-activity-with-amazon-cloudwatch/

This post is courtesy of Otavio Ferreira, Manager, Amazon SNS, AWS Messaging.

Amazon SNS message filtering provides a set of string and numeric matching operators that allow each subscription to receive only the messages of interest. Hence, SNS message filtering can simplify your pub/sub messaging architecture by offloading the message filtering logic from your subscriber systems, as well as the message routing logic from your publisher systems.

After you set the subscription attribute that defines a filter policy, the subscribing endpoint receives only the messages that carry attributes matching this filter policy. Other messages published to the topic are filtered out for this subscription. In this way, the native integration between SNS and Amazon CloudWatch provides visibility into the number of messages delivered, as well as the number of messages filtered out.

CloudWatch metrics are captured automatically for you. To get started with SNS message filtering, see Filtering Messages with Amazon SNS.

Message Filtering Metrics

The following six CloudWatch metrics are relevant to understanding your SNS message filtering activity:

  • NumberOfMessagesPublished – Inbound traffic to SNS. This metric tracks all the messages that have been published to the topic.
  • NumberOfNotificationsDelivered – Outbound traffic from SNS. This metric tracks all the messages that have been successfully delivered to endpoints subscribed to the topic. A delivery takes place either when the incoming message attributes match a subscription filter policy, or when the subscription has no filter policy at all, which results in a catch-all behavior.
  • NumberOfNotificationsFilteredOut – This metric tracks all the messages that were filtered out because they carried attributes that didn’t match the subscription filter policy.
  • NumberOfNotificationsFilteredOut-NoMessageAttributes – This metric tracks all the messages that were filtered out because they didn’t carry any attributes at all and, consequently, didn’t match the subscription filter policy.
  • NumberOfNotificationsFilteredOut-InvalidAttributes – This metric keeps track of messages that were filtered out because they carried invalid or malformed attributes and, thus, didn’t match the subscription filter policy.
  • NumberOfNotificationsFailed – This last metric tracks all the messages that failed to be delivered to subscribing endpoints, regardless of whether a filter policy had been set for the endpoint. This metric is emitted after the message delivery retry policy is exhausted, and SNS stops attempting to deliver the message. At that moment, the subscribing endpoint is likely no longer reachable. For example, the subscribing SQS queue or Lambda function has been deleted by its owner. You may want to closely monitor this metric to address message delivery issues quickly.

Message filtering graphs

Through the AWS Management Console, you can compose graphs to display your SNS message filtering activity. The graph shows the number of messages published, delivered, and filtered out within the timeframe you specify (1h, 3h, 12h, 1d, 3d, 1w, or custom).

SNS message filtering for CloudWatch Metrics

To compose an SNS message filtering graph with CloudWatch:

  1. Open the CloudWatch console.
  2. Choose Metrics, SNS, All Metrics, and Topic Metrics.
  3. Select all metrics to add to the graph, such as:
    • NumberOfMessagesPublished
    • NumberOfNotificationsDelivered
    • NumberOfNotificationsFilteredOut
  4. Choose Graphed metrics.
  5. In the Statistic column, switch from Average to Sum.
  6. Title your graph with a descriptive name, such as “SNS Message Filtering”

After you have your graph set up, you may want to copy the graph link for bookmarking, emailing, or sharing with co-workers. You may also want to add your graph to a CloudWatch dashboard for easy access in the future. Both actions are available to you on the Actions menu, which is found above the graph.

Summary

SNS message filtering defines how SNS topics behave in terms of message delivery. By using CloudWatch metrics, you gain visibility into the number of messages published, delivered, and filtered out. This enables you to validate the operation of filter policies and more easily troubleshoot during development phases.

SNS message filtering can be implemented easily with existing AWS SDKs by applying message and subscription attributes across all SNS supported protocols (Amazon SQS, AWS Lambda, HTTP, SMS, email, and mobile push). CloudWatch metrics for SNS message filtering is available now, in all AWS Regions.

For information about pricing, see the CloudWatch pricing page.

For more information, see:

[$] Bpfilter (and user-mode blobs) for 4.18

Post Syndicated from corbet original https://lwn.net/Articles/755919/rss

In February, the bpfilter mechanism was
first posted to the mailing lists. Bpfilter is meant to be a replacement
for the current in-kernel firewall/packet-filtering code. It provides
little functionality itself; instead, it creates a set of hooks that can
run BPF programs to make the packet-filtering decisions. A version of that patch set has been merged
into the net-next tree for 4.18. It will not be replacing any existing
packet filters in its current form, but it does feature a significant
change to one of its more controversial features: the new user-mode helper
mechanism.

All Systems Go! 2018 CfP Open

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/all-systems-go-2018-cfp-open.html

The All Systems Go! 2018 Call for Participation is Now Open!

The Call for Participation (CFP) for All Systems Go!
2018
is now open. We’d like to invite you
to submit your proposals for consideration to the CFP submission
site
.

ASG image

The CFP will close on July 30th. Notification of acceptance and
non-acceptance will go out within 7 days of the closing of the CFP.

All topics relevant to foundational open-source Linux technologies are
welcome. In particular, however, we are looking for proposals
including, but not limited to, the following topics:

  • Low-level container executors and infrastructure
  • IoT and embedded OS infrastructure
  • BPF and eBPF filtering
  • OS, container, IoT image delivery and updating
  • Building Linux devices and applications
  • Low-level desktop technologies
  • Networking
  • System and service management
  • Tracing and performance measuring
  • IPC and RPC systems
  • Security and Sandboxing

While our focus is definitely more on the user-space side of things,
talks about kernel projects are welcome, as long as they have a clear
and direct relevance for user-space.

For more information please visit our conference
website
!

masscan, macOS, and firewall

Post Syndicated from Robert Graham original https://blog.erratasec.com/2018/05/masscan-macos-and-firewall.html

One of the more useful features of masscan is the “–banners” check, which connects to the TCP port, sends some request, and gets a basic response back. However, since masscan has it’s own TCP stack, it’ll interfere with the operating system’s TCP stack if they are sharing the same IPv4 address. The operating system will reply with a RST packet before the TCP connection can be established.

The way to fix this is to use the built-in packet-filtering firewall to block those packets in the operating-system TCP/IP stack. The masscan program still sees everything before the packet-filter, but the operating system can’t see anything after the packet-filter.

Note that we are talking about the “packet-filter” firewall feature here. Remember that macOS, like most operating systems these days, has two separate firewalls: an application firewall and a packet-filter firewall. The application firewall is the one you see in System Settings labeled “Firewall”, and it controls things based upon the application’s identity rather than by which ports it uses. This is normally “on” by default. The packet-filter is normally “off” by default and is of little use to normal users.

Also note that macOS changed packet-filters around version 10.10.5 (“Yosemite”, October 2014). The older one is known as “ipfw“, which was the default firewall for FreeBSD (much of macOS is based on FreeBSD). The replacement is known as PF, which comes from OpenBSD. Whereas you used to use the old “ipfw” command on the command line, you now use the “pfctl” command, as well as the “/etc/pf.conf” configuration file.

What we need to filter is the source port of the packets that masscan will send, so that when replies are received, they won’t reach the operating-system stack, and just go to masscan instead. To do this, we need find a range of ports that won’t conflict with the operating system. Namely, when the operating system creates outgoing connections, it randomly chooses a source port within a certain range. We want to use masscan to use source ports in a different range.

To figure out the range macOS uses, we run the following command:

sysctl net.inet.ip.portrange.first net.inet.ip.portrange.last

On my laptop, which is probably the default for macOS, I get the following range. Sniffing with Wireshark confirms this is the range used for source ports for outgoing connections.

net.inet.ip.portrange.first: 49152
net.inet.ip.portrange.last: 65535

So this means I shouldn’t use source ports anywhere in the range 49152 to 65535. On my laptop, I’ve decided to use for masscan the ports 40000 to 41023. The range masscan uses must be a power of 2, so here I’m using 1024 (two to the tenth power).

To configure masscan, I can either type the parameter “–source-port 40000-41023” every time I run the program, or I can add the following line to /etc/masscan/masscan.conf. Remember that by default, masscan will look in that configuration file for any configuration parameters, so you don’t have to keep retyping them on the command line.

source-port = 40000-41023

Next, I need to add the following firewall rule to the bottom of /etc/pf.conf:

block in proto tcp from any to any port 40000 >< 41024

However, we aren’t done yet. By default, the packet-filter firewall is off on some versions of macOS. Therefore, every time you reboot your computer, you need to enable it. The simple way to do this is on the command line run:

pfctl -e

Or, if that doesn’t work, try:

pfctl -E

If the firewall is already running, then you’ll need to load the file explicitly (or reboot):

pfctl -f /etc/pf.conf

You can check to see if the rule is active:

pfctl -s rules

Williams: Introducing Git protocol version 2

Post Syndicated from corbet original https://lwn.net/Articles/754872/rss

Brandon Williams writes
about the new Git remote protocol
that will debut in the 2.18 release.
We recently rolled out support for protocol version 2 at Google and
have seen a performance improvement of 3x for no-op fetches of a single
branch on repositories containing 500k references. Protocol v2 has also
enabled a reduction of 8x of the overhead bytes (non-packfile) sent from
googlesource.com servers. A majority of this improvement is due to
filtering references advertised by the server to the refs the client has
expressed interest in.

GetAltName – Discover Sub-Domains From SSL Certificates

Post Syndicated from Darknet original https://www.darknet.org.uk/2018/03/getaltname-discover-sub-domains-from-ssl-certificates/?utm_source=rss&utm_medium=social&utm_campaign=darknetfeed

GetAltName – Discover Sub-Domains From SSL Certificates

GetAltName it’s a little script to discover sub-domains that can extract Subject Alt Names for SSL Certificates directly from HTTPS websites which can provide you with DNS names or virtual servers.

It’s useful in a discovery phase of a pen-testing assessment, this tool can provide you with more information about your target and scope.

Features of GetAltName to Discover Sub-Domains

  • Strips wildcards and www’s
  • Returns a unique list (no duplicates)
  • Works on verified and self-signed certs
  • Domain matching system
  • Filtering for main domains and TLDs
  • Gets additional sub-domains from crt.sh
  • Outputs to clipboard

GetAltName Subdomain Exctraction Tool Usage

You can output to a text file and also copy the output to your clipboard as a List or a Single line string, which is useful if you’re trying to make a quick scan with Nmap or other tools.

Read the rest of GetAltName – Discover Sub-Domains From SSL Certificates now! Only available at Darknet.

Message Filtering Operators for Numeric Matching, Prefix Matching, and Blacklisting in Amazon SNS

Post Syndicated from Christie Gifrin original https://aws.amazon.com/blogs/compute/message-filtering-operators-for-numeric-matching-prefix-matching-and-blacklisting-in-amazon-sns/

This blog was contributed by Otavio Ferreira, Software Development Manager for Amazon SNS

Message filtering simplifies the overall pub/sub messaging architecture by offloading message filtering logic from subscribers, as well as message routing logic from publishers. The initial launch of message filtering provided a basic operator that was based on exact string comparison. For more information, see Simplify Your Pub/Sub Messaging with Amazon SNS Message Filtering.

Today, AWS is announcing an additional set of filtering operators that bring even more power and flexibility to your pub/sub messaging use cases.

Message filtering operators

Amazon SNS now supports both numeric and string matching. Specifically, string matching operators allow for exact, prefix, and “anything-but” comparisons, while numeric matching operators allow for exact and range comparisons, as outlined below. Numeric matching operators work for values between -10e9 and +10e9 inclusive, with five digits of accuracy right of the decimal point.

  • Exact matching on string values (Whitelisting): Subscription filter policy   {"sport": ["rugby"]} matches message attribute {"sport": "rugby"} only.
  • Anything-but matching on string values (Blacklisting): Subscription filter policy {"sport": [{"anything-but": "rugby"}]} matches message attributes such as {"sport": "baseball"} and {"sport": "basketball"} and {"sport": "football"} but not {"sport": "rugby"}
  • Prefix matching on string values: Subscription filter policy {"sport": [{"prefix": "bas"}]} matches message attributes such as {"sport": "baseball"} and {"sport": "basketball"}
  • Exact matching on numeric values: Subscription filter policy {"balance": [{"numeric": ["=", 301.5]}]} matches message attributes {"balance": 301.500} and {"balance": 3.015e2}
  • Range matching on numeric values: Subscription filter policy {"balance": [{"numeric": ["<", 0]}]} matches negative numbers only, and {"balance": [{"numeric": [">", 0, "<=", 150]}]} matches any positive number up to 150.

As usual, you may apply the “AND” logic by appending multiple keys in the subscription filter policy, and the “OR” logic by appending multiple values for the same key, as follows:

  • AND logic: Subscription filter policy {"sport": ["rugby"], "language": ["English"]} matches only messages that carry both attributes {"sport": "rugby"} and {"language": "English"}
  • OR logic: Subscription filter policy {"sport": ["rugby", "football"]} matches messages that carry either the attribute {"sport": "rugby"} or {"sport": "football"}

Message filtering operators in action

Here’s how this new set of filtering operators works. The following example is based on a pharmaceutical company that develops, produces, and markets a variety of prescription drugs, with research labs located in Asia Pacific and Europe. The company built an internal procurement system to manage the purchasing of lab supplies (for example, chemicals and utensils), office supplies (for example, paper, folders, and markers) and tech supplies (for example, laptops, monitors, and printers) from global suppliers.

This distributed system is composed of the four following subsystems:

  • A requisition system that presents the catalog of products from suppliers, and takes orders from buyers
  • An approval system for orders targeted to Asia Pacific labs
  • Another approval system for orders targeted to European labs
  • A fulfillment system that integrates with shipping partners

As shown in the following diagram, the company leverages AWS messaging services to integrate these distributed systems.

  • Firstly, an SNS topic named “Orders” was created to take all orders placed by buyers on the requisition system.
  • Secondly, two Amazon SQS queues, named “Lab-Orders-AP” and “Lab-Orders-EU” (for Asia Pacific and Europe respectively), were created to backlog orders that are up for review on the approval systems.
  • Lastly, an SQS queue named “Common-Orders” was created to backlog orders that aren’t related to lab supplies, which can already be picked up by shipping partners on the fulfillment system.

The company also uses AWS Lambda functions to automatically process lab supply orders that don’t require approval or which are invalid.

In this example, because different types of orders have been published to the SNS topic, the subscribing endpoints have had to set advanced filter policies on their SNS subscriptions, to have SNS automatically filter out orders they can’t deal with.

As depicted in the above diagram, the following five filter policies have been created:

  • The SNS subscription that points to the SQS queue “Lab-Orders-AP” sets a filter policy that matches lab supply orders, with a total value greater than $1,000, and that target Asia Pacific labs only. These more expensive transactions require an approver to review orders placed by buyers.
  • The SNS subscription that points to the SQS queue “Lab-Orders-EU” sets a filter policy that matches lab supply orders, also with a total value greater than $1,000, but that target European labs instead.
  • The SNS subscription that points to the Lambda function “Lab-Preapproved” sets a filter policy that only matches lab supply orders that aren’t as expensive, up to $1,000, regardless of their target lab location. These orders simply don’t require approval and can be automatically processed.
  • The SNS subscription that points to the Lambda function “Lab-Cancelled” sets a filter policy that only matches lab supply orders with total value of $0 (zero), regardless of their target lab location. These orders carry no actual items, obviously need neither approval nor fulfillment, and as such can be automatically canceled.
  • The SNS subscription that points to the SQS queue “Common-Orders” sets a filter policy that blacklists lab supply orders. Hence, this policy matches only office and tech supply orders, which have a more streamlined fulfillment process, and require no approval, regardless of price or target location.

After the company finished building this advanced pub/sub architecture, they were then able to launch their internal procurement system and allow buyers to begin placing orders. The diagram above shows six example orders published to the SNS topic. Each order contains message attributes that describe the order, and cause them to be filtered in a different manner, as follows:

  • Message #1 is a lab supply order, with a total value of $15,700 and targeting a research lab in Singapore. Because the value is greater than $1,000, and the location “Asia-Pacific-Southeast” matches the prefix “Asia-Pacific-“, this message matches the first SNS subscription and is delivered to SQS queue “Lab-Orders-AP”.
  • Message #2 is a lab supply order, with a total value of $1,833 and targeting a research lab in Ireland. Because the value is greater than $1,000, and the location “Europe-West” matches the prefix “Europe-“, this message matches the second SNS subscription and is delivered to SQS queue “Lab-Orders-EU”.
  • Message #3 is a lab supply order, with a total value of $415. Because the value is greater than $0 and less than $1,000, this message matches the third SNS subscription and is delivered to Lambda function “Lab-Preapproved”.
  • Message #4 is a lab supply order, but with a total value of $0. Therefore, it only matches the fourth SNS subscription, and is delivered to Lambda function “Lab-Cancelled”.
  • Messages #5 and #6 aren’t lab supply orders actually; one is an office supply order, and the other is a tech supply order. Therefore, they only match the fifth SNS subscription, and are both delivered to SQS queue “Common-Orders”.

Although each message only matched a single subscription, each was tested against the filter policy of every subscription in the topic. Hence, depending on which attributes are set on the incoming message, the message might actually match multiple subscriptions, and multiple deliveries will take place. Also, it is important to bear in mind that subscriptions with no filter policies catch every single message published to the topic, as a blank filter policy equates to a catch-all behavior.

Summary

Amazon SNS allows for both string and numeric filtering operators. As explained in this post, string operators allow for exact, prefix, and “anything-but” comparisons, while numeric operators allow for exact and range comparisons. These advanced filtering operators bring even more power and flexibility to your pub/sub messaging functionality and also allow you to simplify your architecture further by removing even more logic from your subscribers.

Message filtering can be implemented easily with existing AWS SDKs by applying message and subscription attributes across all SNS supported protocols (Amazon SQS, AWS Lambda, HTTP, SMS, email, and mobile push). SNS filtering operators for numeric matching, prefix matching, and blacklisting are available now in all AWS Regions, for no extra charge.

To experiment with these new filtering operators yourself, and continue learning, try the 10-minute Tutorial Filter Messages Published to Topics. For more information, see Filtering Messages with Amazon SNS in the SNS documentation.

Amazon Redshift – 2017 Recap

Post Syndicated from Larry Heathcote original https://aws.amazon.com/blogs/big-data/amazon-redshift-2017-recap/

We have been busy adding new features and capabilities to Amazon Redshift, and we wanted to give you a glimpse of what we’ve been doing over the past year. In this article, we recap a few of our enhancements and provide a set of resources that you can use to learn more and get the most out of your Amazon Redshift implementation.

In 2017, we made more than 30 announcements about Amazon Redshift. We listened to you, our customers, and delivered Redshift Spectrum, a feature of Amazon Redshift, that gives you the ability to extend analytics to your data lake—without moving data. We launched new DC2 nodes, doubling performance at the same price. We also announced many new features that provide greater scalability, better performance, more automation, and easier ways to manage your analytics workloads.

To see a full list of our launches, visit our what’s new page—and be sure to subscribe to our RSS feed.

Major launches in 2017

Amazon Redshift Spectrumextend analytics to your data lake, without moving data

We launched Amazon Redshift Spectrum to give you the freedom to store data in Amazon S3, in open file formats, and have it available for analytics without the need to load it into your Amazon Redshift cluster. It enables you to easily join datasets across Redshift clusters and S3 to provide unique insights that you would not be able to obtain by querying independent data silos.

With Redshift Spectrum, you can run SQL queries against data in an Amazon S3 data lake as easily as you analyze data stored in Amazon Redshift. And you can do it without loading data or resizing the Amazon Redshift cluster based on growing data volumes. Redshift Spectrum separates compute and storage to meet workload demands for data size, concurrency, and performance. Redshift Spectrum scales processing across thousands of nodes, so results are fast, even with massive datasets and complex queries. You can query open file formats that you already use—such as Apache Avro, CSV, Grok, ORC, Apache Parquet, RCFile, RegexSerDe, SequenceFile, TextFile, and TSV—directly in Amazon S3, without any data movement.

For complex queries, Redshift Spectrum provided a 67 percent performance gain,” said Rafi Ton, CEO, NUVIAD. “Using the Parquet data format, Redshift Spectrum delivered an 80 percent performance improvement. For us, this was substantial.

To learn more about Redshift Spectrum, watch our AWS Summit session Intro to Amazon Redshift Spectrum: Now Query Exabytes of Data in S3, and read our announcement blog post Amazon Redshift Spectrum – Exabyte-Scale In-Place Queries of S3 Data.

DC2 nodes—twice the performance of DC1 at the same price

We launched second-generation Dense Compute (DC2) nodes to provide low latency and high throughput for demanding data warehousing workloads. DC2 nodes feature powerful Intel E5-2686 v4 (Broadwell) CPUs, fast DDR4 memory, and NVMe-based solid state disks (SSDs). We’ve tuned Amazon Redshift to take advantage of the better CPU, network, and disk on DC2 nodes, providing up to twice the performance of DC1 at the same price. Our DC2.8xlarge instances now provide twice the memory per slice of data and an optimized storage layout with 30 percent better storage utilization.

Redshift allows us to quickly spin up clusters and provide our data scientists with a fast and easy method to access data and generate insights,” said Bradley Todd, technology architect at Liberty Mutual. “We saw a 9x reduction in month-end reporting time with Redshift DC2 nodes as compared to DC1.”

Read our customer testimonials to see the performance gains our customers are experiencing with DC2 nodes. To learn more, read our blog post Amazon Redshift Dense Compute (DC2) Nodes Deliver Twice the Performance as DC1 at the Same Price.

Performance enhancements— 3x-5x faster queries

On average, our customers are seeing 3x to 5x performance gains for most of their critical workloads.

We introduced short query acceleration to speed up execution of queries such as reports, dashboards, and interactive analysis. Short query acceleration uses machine learning to predict the execution time of a query, and to move short running queries to an express short query queue for faster processing.

We launched results caching to deliver sub-second response times for queries that are repeated, such as dashboards, visualizations, and those from BI tools. Results caching has an added benefit of freeing up resources to improve the performance of all other queries.

We also introduced late materialization to reduce the amount of data scanned for queries with predicate filters by batching and factoring in the filtering of predicates before fetching data blocks in the next column. For example, if only 10 percent of the table rows satisfy the predicate filters, Amazon Redshift can potentially save 90 percent of the I/O for the remaining columns to improve query performance.

We launched query monitoring rules and pre-defined rule templates. These features make it easier for you to set metrics-based performance boundaries for workload management (WLM) queries, and specify what action to take when a query goes beyond those boundaries. For example, for a queue that’s dedicated to short-running queries, you might create a rule that aborts queries that run for more than 60 seconds. To track poorly designed queries, you might have another rule that logs queries that contain nested loops.

Customer insights

Amazon Redshift and Redshift Spectrum serve customers across a variety of industries and sizes, from startups to large enterprises. Visit our customer page to see the success that customers are having with our recent enhancements. Learn how companies like Liberty Mutual Insurance saw a 9x reduction in month-end reporting time using DC2 nodes. On this page, you can find case studies, videos, and other content that show how our customers are using Amazon Redshift to drive innovation and business results.

In addition, check out these resources to learn about the success our customers are having building out a data warehouse and data lake integration solution with Amazon Redshift:

Partner solutions

You can enhance your Amazon Redshift data warehouse by working with industry-leading experts. Our AWS Partner Network (APN) Partners have certified their solutions to work with Amazon Redshift. They offer software, tools, integration, and consulting services to help you at every step. Visit our Amazon Redshift Partner page and choose an APN Partner. Or, use AWS Marketplace to find and immediately start using third-party software.

To see what our Partners are saying about Amazon Redshift Spectrum and our DC2 nodes mentioned earlier, read these blog posts:

Resources

Blog posts

Visit the AWS Big Data Blog for a list of all Amazon Redshift articles.

YouTube videos

GitHub

Our community of experts contribute on GitHub to provide tips and hints that can help you get the most out of your deployment. Visit GitHub frequently to get the latest technical guidance, code samples, administrative task automation utilities, the analyze & vacuum schema utility, and more.

Customer support

If you are evaluating or considering a proof of concept with Amazon Redshift, or you need assistance migrating your on-premises or other cloud-based data warehouse to Amazon Redshift, our team of product experts and solutions architects can help you with architecting, sizing, and optimizing your data warehouse. Contact us using this support request form, and let us know how we can assist you.

If you are an Amazon Redshift customer, we offer a no-cost health check program. Our team of database engineers and solutions architects give you recommendations for optimizing Amazon Redshift and Amazon Redshift Spectrum for your specific workloads. To learn more, email us at [email protected].

If you have any questions, email us at [email protected].

 


Additional Reading

If you found this post useful, be sure to check out Amazon Redshift Spectrum – Exabyte-Scale In-Place Queries of S3 Data, Using Amazon Redshift for Fast Analytical Reports and How to Migrate Your Oracle Data Warehouse to Amazon Redshift Using AWS SCT and AWS DMS.


About the Author

Larry Heathcote is a Principle Product Marketing Manager at Amazon Web Services for data warehousing and analytics. Larry is passionate about seeing the results of data-driven insights on business outcomes. He enjoys family time, home projects, grilling out and the taste of classic barbeque.

 

 

 

[$] BPF comes to firewalls

Post Syndicated from corbet original https://lwn.net/Articles/747551/rss

The Linux kernel currently supports two separate network packet-filtering
mechanisms: iptables and nftables. For the last few years, it has been
generally assumed that nftables would eventually replace the older iptables
implementation; few people expected that the kernel developers would,
instead, add a third packet filter. But that would appear to be what is
happening with the newly announced bpfilter
mechanism. Bpfilter may eventually replace both iptables and nftables, but
there are a lot of questions that will need to be answered first.

AWS Glue Now Supports Scala Scripts

Post Syndicated from Mehul Shah original https://aws.amazon.com/blogs/big-data/aws-glue-now-supports-scala-scripts/

We are excited to announce AWS Glue support for running ETL (extract, transform, and load) scripts in Scala. Scala lovers can rejoice because they now have one more powerful tool in their arsenal. Scala is the native language for Apache Spark, the underlying engine that AWS Glue offers for performing data transformations.

Beyond its elegant language features, writing Scala scripts for AWS Glue has two main advantages over writing scripts in Python. First, Scala is faster for custom transformations that do a lot of heavy lifting because there is no need to shovel data between Python and Apache Spark’s Scala runtime (that is, the Java virtual machine, or JVM). You can build your own transformations or invoke functions in third-party libraries. Second, it’s simpler to call functions in external Java class libraries from Scala because Scala is designed to be Java-compatible. It compiles to the same bytecode, and its data structures don’t need to be converted.

To illustrate these benefits, we walk through an example that analyzes a recent sample of the GitHub public timeline available from the GitHub archive. This site is an archive of public requests to the GitHub service, recording more than 35 event types ranging from commits and forks to issues and comments.

This post shows how to build an example Scala script that identifies highly negative issues in the timeline. It pulls out issue events in the timeline sample, analyzes their titles using the sentiment prediction functions from the Stanford CoreNLP libraries, and surfaces the most negative issues.

Getting started

Before we start writing scripts, we use AWS Glue crawlers to get a sense of the data—its structure and characteristics. We also set up a development endpoint and attach an Apache Zeppelin notebook, so we can interactively explore the data and author the script.

Crawl the data

The dataset used in this example was downloaded from the GitHub archive website into our sample dataset bucket in Amazon S3, and copied to the following locations:

s3://aws-glue-datasets-<region>/examples/scala-blog/githubarchive/data/

Choose the best folder by replacing <region> with the region that you’re working in, for example, us-east-1. Crawl this folder, and put the results into a database named githubarchive in the AWS Glue Data Catalog, as described in the AWS Glue Developer Guide. This folder contains 12 hours of the timeline from January 22, 2017, and is organized hierarchically (that is, partitioned) by year, month, and day.

When finished, use the AWS Glue console to navigate to the table named data in the githubarchive database. Notice that this data has eight top-level columns, which are common to each event type, and three partition columns that correspond to year, month, and day.

Choose the payload column, and you will notice that it has a complex schema—one that reflects the union of the payloads of event types that appear in the crawled data. Also note that the schema that crawlers generate is a subset of the true schema because they sample only a subset of the data.

Set up the library, development endpoint, and notebook

Next, you need to download and set up the libraries that estimate the sentiment in a snippet of text. The Stanford CoreNLP libraries contain a number of human language processing tools, including sentiment prediction.

Download the Stanford CoreNLP libraries. Unzip the .zip file, and you’ll see a directory full of jar files. For this example, the following jars are required:

  • stanford-corenlp-3.8.0.jar
  • stanford-corenlp-3.8.0-models.jar
  • ejml-0.23.jar

Upload these files to an Amazon S3 path that is accessible to AWS Glue so that it can load these libraries when needed. For this example, they are in s3://glue-sample-other/corenlp/.

Development endpoints are static Spark-based environments that can serve as the backend for data exploration. You can attach notebooks to these endpoints to interactively send commands and explore and analyze your data. These endpoints have the same configuration as that of AWS Glue’s job execution system. So, commands and scripts that work there also work the same when registered and run as jobs in AWS Glue.

To set up an endpoint and a Zeppelin notebook to work with that endpoint, follow the instructions in the AWS Glue Developer Guide. When you are creating an endpoint, be sure to specify the locations of the previously mentioned jars in the Dependent jars path as a comma-separated list. Otherwise, the libraries will not be loaded.

After you set up the notebook server, go to the Zeppelin notebook by choosing Dev Endpoints in the left navigation pane on the AWS Glue console. Choose the endpoint that you created. Next, choose the Notebook Server URL, which takes you to the Zeppelin server. Log in using the notebook user name and password that you specified when creating the notebook. Finally, create a new note to try out this example.

Each notebook is a collection of paragraphs, and each paragraph contains a sequence of commands and the output for that command. Moreover, each notebook includes a number of interpreters. If you set up the Zeppelin server using the console, the (Python-based) pyspark and (Scala-based) spark interpreters are already connected to your new development endpoint, with pyspark as the default. Therefore, throughout this example, you need to prepend %spark at the top of your paragraphs. In this example, we omit these for brevity.

Working with the data

In this section, we use AWS Glue extensions to Spark to work with the dataset. We look at the actual schema of the data and filter out the interesting event types for our analysis.

Start with some boilerplate code to import libraries that you need:

%spark

import com.amazonaws.services.glue.DynamicRecord
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.types._
import org.apache.spark.SparkContext

Then, create the Spark and AWS Glue contexts needed for working with the data:

@transient val spark: SparkContext = SparkContext.getOrCreate()
val glueContext: GlueContext = new GlueContext(spark)

You need the transient decorator on the SparkContext when working in Zeppelin; otherwise, you will run into a serialization error when executing commands.

Dynamic frames

This section shows how to create a dynamic frame that contains the GitHub records in the table that you crawled earlier. A dynamic frame is the basic data structure in AWS Glue scripts. It is like an Apache Spark data frame, except that it is designed and optimized for data cleaning and transformation workloads. A dynamic frame is well-suited for representing semi-structured datasets like the GitHub timeline.

A dynamic frame is a collection of dynamic records. In Spark lingo, it is an RDD (resilient distributed dataset) of DynamicRecords. A dynamic record is a self-describing record. Each record encodes its columns and types, so every record can have a schema that is unique from all others in the dynamic frame. This is convenient and often more efficient for datasets like the GitHub timeline, where payloads can vary drastically from one event type to another.

The following creates a dynamic frame, github_events, from your table:

val github_events = glueContext
                    .getCatalogSource(database = "githubarchive", tableName = "data")
                    .getDynamicFrame()

The getCatalogSource() method returns a DataSource, which represents a particular table in the Data Catalog. The getDynamicFrame() method returns a dynamic frame from the source.

Recall that the crawler created a schema from only a sample of the data. You can scan the entire dataset, count the rows, and print the complete schema as follows:

github_events.count
github_events.printSchema()

The result looks like the following:

The data has 414,826 records. As before, notice that there are eight top-level columns, and three partition columns. If you scroll down, you’ll also notice that the payload is the most complex column.

Run functions and filter records

This section describes how you can create your own functions and invoke them seamlessly to filter records. Unlike filtering with Python lambdas, Scala scripts do not need to convert records from one language representation to another, thereby reducing overhead and running much faster.

Let’s create a function that picks only the IssuesEvents from the GitHub timeline. These events are generated whenever someone posts an issue for a particular repository. Each GitHub event record has a field, “type”, that indicates the kind of event it is. The issueFilter() function returns true for records that are IssuesEvents.

def issueFilter(rec: DynamicRecord): Boolean = { 
    rec.getField("type").exists(_ == "IssuesEvent") 
}

Note that the getField() method returns an Option[Any] type, so you first need to check that it exists before checking the type.

You pass this function to the filter transformation, which applies the function on each record and returns a dynamic frame of those records that pass.

val issue_events =  github_events.filter(issueFilter)

Now, let’s look at the size and schema of issue_events.

issue_events.count
issue_events.printSchema()

It’s much smaller (14,063 records), and the payload schema is less complex, reflecting only the schema for issues. Keep a few essential columns for your analysis, and drop the rest using the ApplyMapping() transform:

val issue_titles = issue_events.applyMapping(Seq(("id", "string", "id", "string"),
                                                 ("actor.login", "string", "actor", "string"), 
                                                 ("repo.name", "string", "repo", "string"),
                                                 ("payload.action", "string", "action", "string"),
                                                 ("payload.issue.title", "string", "title", "string")))
issue_titles.show()

The ApplyMapping() transform is quite handy for renaming columns, casting types, and restructuring records. The preceding code snippet tells the transform to select the fields (or columns) that are enumerated in the left half of the tuples and map them to the fields and types in the right half.

Estimating sentiment using Stanford CoreNLP

To focus on the most pressing issues, you might want to isolate the records with the most negative sentiments. The Stanford CoreNLP libraries are Java-based and offer sentiment-prediction functions. Accessing these functions through Python is possible, but quite cumbersome. It requires creating Python surrogate classes and objects for those found on the Java side. Instead, with Scala support, you can use those classes and objects directly and invoke their methods. Let’s see how.

First, import the libraries needed for the analysis:

import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import scala.collection.convert.wrapAll._

The Stanford CoreNLP libraries have a main driver that orchestrates all of their analysis. The driver setup is heavyweight, setting up threads and data structures that are shared across analyses. Apache Spark runs on a cluster with a main driver process and a collection of backend executor processes that do most of the heavy sifting of the data.

The Stanford CoreNLP shared objects are not serializable, so they cannot be distributed easily across a cluster. Instead, you need to initialize them once for every backend executor process that might need them. Here is how to accomplish that:

val props = new Properties()
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment")
props.setProperty("parse.maxlen", "70")

object myNLP {
    lazy val coreNLP = new StanfordCoreNLP(props)
}

The properties tell the libraries which annotators to execute and how many words to process. The preceding code creates an object, myNLP, with a field coreNLP that is lazily evaluated. This field is initialized only when it is needed, and only once. So, when the backend executors start processing the records, each executor initializes the driver for the Stanford CoreNLP libraries only one time.

Next is a function that estimates the sentiment of a text string. It first calls Stanford CoreNLP to annotate the text. Then, it pulls out the sentences and takes the average sentiment across all the sentences. The sentiment is a double, from 0.0 as the most negative to 4.0 as the most positive.

def estimatedSentiment(text: String): Double = {
    if ((text == null) || (!text.nonEmpty)) { return Double.NaN }
    val annotations = myNLP.coreNLP.process(text)
    val sentences = annotations.get(classOf[CoreAnnotations.SentencesAnnotation])
    sentences.foldLeft(0.0)( (csum, x) => { 
        csum + RNNCoreAnnotations.getPredictedClass(x.get(classOf[SentimentCoreAnnotations.SentimentAnnotatedTree])) 
    }) / sentences.length
}

Now, let’s estimate the sentiment of the issue titles and add that computed field as part of the records. You can accomplish this with the map() method on dynamic frames:

val issue_sentiments = issue_titles.map((rec: DynamicRecord) => { 
    val mbody = rec.getField("title")
    mbody match {
        case Some(mval: String) => { 
            rec.addField("sentiment", ScalarNode(estimatedSentiment(mval)))
            rec }
        case _ => rec
    }
})

The map() method applies the user-provided function on every record. The function takes a DynamicRecord as an argument and returns a DynamicRecord. The code above computes the sentiment, adds it in a top-level field, sentiment, to the record, and returns the record.

Count the records with sentiment and show the schema. This takes a few minutes because Spark must initialize the library and run the sentiment analysis, which can be involved.

issue_sentiments.count
issue_sentiments.printSchema()

Notice that all records were processed (14,063), and the sentiment value was added to the schema.

Finally, let’s pick out the titles that have the lowest sentiment (less than 1.5). Count them and print out a sample to see what some of the titles look like.

val pressing_issues = issue_sentiments.filter(_.getField("sentiment").exists(_.asInstanceOf[Double] < 1.5))
pressing_issues.count
pressing_issues.show(10)

Next, write them all to a file so that you can handle them later. (You’ll need to replace the output path with your own.)

glueContext.getSinkWithFormat(connectionType = "s3", 
                              options = JsonOptions("""{"path": "s3://<bucket>/out/path/"}"""), 
                              format = "json")
            .writeDynamicFrame(pressing_issues)

Take a look in the output path, and you can see the output files.

Putting it all together

Now, let’s create a job from the preceding interactive session. The following script combines all the commands from earlier. It processes the GitHub archive files and writes out the highly negative issues:

import com.amazonaws.services.glue.DynamicRecord
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.types._
import org.apache.spark.SparkContext
import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import scala.collection.convert.wrapAll._

object GlueApp {

    object myNLP {
        val props = new Properties()
        props.setProperty("annotators", "tokenize, ssplit, parse, sentiment")
        props.setProperty("parse.maxlen", "70")

        lazy val coreNLP = new StanfordCoreNLP(props)
    }

    def estimatedSentiment(text: String): Double = {
        if ((text == null) || (!text.nonEmpty)) { return Double.NaN }
        val annotations = myNLP.coreNLP.process(text)
        val sentences = annotations.get(classOf[CoreAnnotations.SentencesAnnotation])
        sentences.foldLeft(0.0)( (csum, x) => { 
            csum + RNNCoreAnnotations.getPredictedClass(x.get(classOf[SentimentCoreAnnotations.SentimentAnnotatedTree])) 
        }) / sentences.length
    }

    def main(sysArgs: Array[String]) {
        val spark: SparkContext = SparkContext.getOrCreate()
        val glueContext: GlueContext = new GlueContext(spark)

        val dbname = "githubarchive"
        val tblname = "data"
        val outpath = "s3://<bucket>/out/path/"

        val github_events = glueContext
                            .getCatalogSource(database = dbname, tableName = tblname)
                            .getDynamicFrame()

        val issue_events =  github_events.filter((rec: DynamicRecord) => {
            rec.getField("type").exists(_ == "IssuesEvent")
        })

        val issue_titles = issue_events.applyMapping(Seq(("id", "string", "id", "string"),
                                                         ("actor.login", "string", "actor", "string"), 
                                                         ("repo.name", "string", "repo", "string"),
                                                         ("payload.action", "string", "action", "string"),
                                                         ("payload.issue.title", "string", "title", "string")))

        val issue_sentiments = issue_titles.map((rec: DynamicRecord) => { 
            val mbody = rec.getField("title")
            mbody match {
                case Some(mval: String) => { 
                    rec.addField("sentiment", ScalarNode(estimatedSentiment(mval)))
                    rec }
                case _ => rec
            }
        })

        val pressing_issues = issue_sentiments.filter(_.getField("sentiment").exists(_.asInstanceOf[Double] < 1.5))

        glueContext.getSinkWithFormat(connectionType = "s3", 
                              options = JsonOptions(s"""{"path": "$outpath"}"""), 
                              format = "json")
                    .writeDynamicFrame(pressing_issues)
    }
}

Notice that the script is enclosed in a top-level object called GlueApp, which serves as the script’s entry point for the job. (You’ll need to replace the output path with your own.) Upload the script to an Amazon S3 location so that AWS Glue can load it when needed.

To create the job, open the AWS Glue console. Choose Jobs in the left navigation pane, and then choose Add job. Create a name for the job, and specify a role with permissions to access the data. Choose An existing script that you provide, and choose Scala as the language.

For the Scala class name, type GlueApp to indicate the script’s entry point. Specify the Amazon S3 location of the script.

Choose Script libraries and job parameters. In the Dependent jars path field, enter the Amazon S3 locations of the Stanford CoreNLP libraries from earlier as a comma-separated list (without spaces). Then choose Next.

No connections are needed for this job, so choose Next again. Review the job properties, and choose Finish. Finally, choose Run job to execute the job.

You can simply edit the script’s input table and output path to run this job on whatever GitHub timeline datasets that you might have.

Conclusion

In this post, we showed how to write AWS Glue ETL scripts in Scala via notebooks and how to run them as jobs. Scala has the advantage that it is the native language for the Spark runtime. With Scala, it is easier to call Scala or Java functions and third-party libraries for analyses. Moreover, data processing is faster in Scala because there’s no need to convert records from one language runtime to another.

You can find more example of Scala scripts in our GitHub examples repository: https://github.com/awslabs/aws-glue-samples. We encourage you to experiment with Scala scripts and let us know about any interesting ETL flows that you want to share.

Happy Glue-ing!

 


Additional Reading

If you found this post useful, be sure to check out Simplify Querying Nested JSON with the AWS Glue Relationalize Transform and Genomic Analysis with Hail on Amazon EMR and Amazon Athena.

 


About the Authors

Mehul Shah is a senior software manager for AWS Glue. His passion is leveraging the cloud to build smarter, more efficient, and easier to use data systems. He has three girls, and, therefore, he has no spare time.

 

 

 

Ben Sowell is a software development engineer at AWS Glue.

 

 

 

 
Vinay Vivili is a software development engineer for AWS Glue.

 

 

 

[$] A thorough introduction to eBPF

Post Syndicated from corbet original https://lwn.net/Articles/740157/rss

In his linux.conf.au
2017 talk [YouTube]
on the eBPF in-kernel virtual machine, Brendan Gregg
proclaimed that “super powers have finally come to Linux”. Getting
eBPF to that point has been a long road of evolution and design. While
eBPF was originally used for network packet filtering, it turns out
that running user-space code inside a sanity-checking virtual machine
is a powerful tool for kernel developers and production engineers.

Over time, new eBPF users have appeared to take advantage of its
performance and convenience. This article explains how eBPF evolved
how it works, and how it is used in the kernel.

Presenting AWS IoT Analytics: Delivering IoT Analytics at Scale and Faster than Ever Before

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/launch-presenting-aws-iot-analytics/

One of the technology areas I thoroughly enjoy is the Internet of Things (IoT). Even as a child I used to infuriate my parents by taking apart the toys they would purchase for me to see how they worked and if I could somehow put them back together. It seems somehow I was destined to end up the tough and ever-changing world of technology. Therefore, it’s no wonder that I am really enjoying learning and tinkering with IoT devices and technologies. It combines my love of development and software engineering with my curiosity around circuits, controllers, and other facets of the electrical engineering discipline; even though an electrical engineer I can not claim to be.

Despite all of the information that is collected by the deployment of IoT devices and solutions, I honestly never really thought about the need to analyze, search, and process this data until I came up against a scenario where it became of the utmost importance to be able to search and query through loads of sensory data for an anomaly occurrence. Of course, I understood the importance of analytics for businesses to make accurate decisions and predictions to drive the organization’s direction. But it didn’t occur to me initially, how important it was to make analytics an integral part of my IoT solutions. Well, I learned my lesson just in time because this re:Invent a service is launching to make it easier for anyone to process and analyze IoT messages and device data.

 

Hello, AWS IoT Analytics!  AWS IoT Analytics is a fully managed service of AWS IoT that provides advanced data analysis of data collected from your IoT devices.  With the AWS IoT Analytics service, you can process messages, gather and store large amounts of device data, as well as, query your data. Also, the new AWS IoT Analytics service feature integrates with Amazon Quicksight for visualization of your data and brings the power of machine learning through integration with Jupyter Notebooks.

Benefits of AWS IoT Analytics

  • Helps with predictive analysis of data by providing access to pre-built analytical functions
  • Provides ability to visualize analytical output from service
  • Provides tools to clean up data
  • Can help identify patterns in the gathered data

Be In the Know: IoT Analytics Concepts

  • Channel: archives the raw, unprocessed messages and collects data from MQTT topics.
  • Pipeline: consumes messages from channels and allows message processing.
    • Activities: perform transformations on your messages including filtering attributes and invoking lambda functions advanced processing.
  • Data Store: Used as a queryable repository for processed messages. Provide ability to have multiple datastores for messages coming from different devices or locations or filtered by message attributes.
  • Data Set: Data retrieval view from a data store, can be generated by a recurring schedule. 

Getting Started with AWS IoT Analytics

First, I’ll create a channel to receive incoming messages.  This channel can be used to ingest data sent to the channel via MQTT or messages directed from the Rules Engine. To create a channel, I’ll select the Channels menu option and then click the Create a channel button.

I’ll name my channel, TaraIoTAnalyticsID and give the Channel a MQTT topic filter of Temperature. To complete the creation of my channel, I will click the Create Channel button.

Now that I have my Channel created, I need to create a Data Store to receive and store the messages received on the Channel from my IoT device. Remember you can set up multiple Data Stores for more complex solution needs, but I’ll just create one Data Store for my example. I’ll select Data Stores from menu panel and click Create a data store.

 

I’ll name my Data Store, TaraDataStoreID, and once I click the Create the data store button and I would have successfully set up a Data Store to house messages coming from my Channel.

Now that I have my Channel and my Data Store, I will need to connect the two using a Pipeline. I’ll create a simple pipeline that just connects my Channel and Data Store, but you can create a more robust pipeline to process and filter messages by adding Pipeline activities like a Lambda activity.

To create a pipeline, I’ll select the Pipelines menu option and then click the Create a pipeline button.

I will not add an Attribute for this pipeline. So I will click Next button.

As we discussed there are additional pipeline activities that I can add to my pipeline for the processing and transformation of messages but I will keep my first pipeline simple and hit the Next button.

The final step in creating my pipeline is for me to select my previously created Data Store and click Create Pipeline.

All that is left for me to take advantage of the AWS IoT Analytics service is to create an IoT rule that sends data to an AWS IoT Analytics channel.  Wow, that was a super easy process to set up analytics for IoT devices.

If I wanted to create a Data Set as a result of queries run against my data for visualization with Amazon Quicksight or integrate with Jupyter Notebooks to perform more advanced analytical functions, I can choose the Analyze menu option to bring up the screens to create data sets and access the Juypter Notebook instances.

Summary

As you can see, it was a very simple process to set up the advanced data analysis for AWS IoT. With AWS IoT Analytics, you have the ability to collect, visualize, process, query and store large amounts of data generated from your AWS IoT connected device. Additionally, you can access the AWS IoT Analytics service in a myriad of different ways; the AWS Command Line Interface (AWS CLI), the AWS IoT API, language-specific AWS SDKs, and AWS IoT Device SDKs.

AWS IoT Analytics is available today for you to dig into the analysis of your IoT data. To learn more about AWS IoT and AWS IoT Analytics go to the AWS IoT Analytics product page and/or the AWS IoT documentation.

Tara

Amazon MQ – Managed Message Broker Service for ActiveMQ

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-mq-managed-message-broker-service-for-activemq/

Messaging holds the parts of a distributed application together, while also adding resiliency and enabling the implementation of highly scalable architectures. For example, earlier this year, Amazon Simple Queue Service (SQS) and Amazon Simple Notification Service (SNS) supported the processing of customer orders on Prime Day, collectively processing 40 billion messages at a rate of 10 million per second, with no customer-visible issues.

SQS and SNS have been used extensively for applications that were born in the cloud. However, many of our larger customers are already making use of open-sourced or commercially-licensed message brokers. Their applications are mission-critical, and so is the messaging that powers them. Our customers describe the setup and on-going maintenance of their messaging infrastructure as “painful” and report that they spend at least 10 staff-hours per week on this chore.

New Amazon MQ
Today we are launching Amazon MQ – a managed message broker service for Apache ActiveMQ that lets you get started in minutes with just three clicks! As you may know, ActiveMQ is a popular open-source message broker that is fast & feature-rich. It offers queues and topics, durable and non-durable subscriptions, push-based and poll-based messaging, and filtering.

As a managed service, Amazon MQ takes care of the administration and maintenance of ActiveMQ. This includes responsibility for broker provisioning, patching, failure detection & recovery for high availability, and message durability. With Amazon MQ, you get direct access to the ActiveMQ console and industry standard APIs and protocols for messaging, including JMS, NMS, AMQP, STOMP, MQTT, and WebSocket. This allows you to move from any message broker that uses these standards to Amazon MQ–along with the supported applications–without rewriting code.

You can create a single-instance Amazon MQ broker for development and testing, or an active/standby pair that spans AZs, with quick, automatic failover. Either way, you get data replication across AZs and a pay-as-you-go model for the broker instance and message storage.

Amazon MQ is a full-fledged part of the AWS family, including the use of AWS Identity and Access Management (IAM) for authentication and authorization to use the service API. You can use Amazon CloudWatch metrics to keep a watchful eye metrics such as queue depth and initiate Auto Scaling of your consumer fleet as needed.

Launching an Amazon MQ Broker
To get started, I open up the Amazon MQ Console, select the desired AWS Region, enter a name for my broker, and click on Next step:

Then I choose the instance type, indicate that I want to create a standby , and click on Create broker (I can select a VPC and fine-tune other settings in the Advanced settings section):

My broker will be created and ready to use in 5-10 minutes:

The URLs and endpoints that I use to access my broker are all available at a click:

I can access the ActiveMQ Web Console at the link provided:

The broker publishes instance, topic, and queue metrics to CloudWatch. Here are the instance metrics:

Available Now
Amazon MQ is available now and you can start using it today in the US East (Northern Virginia), US East (Ohio), US West (Oregon), EU (Ireland), EU (Frankfurt), and Asia Pacific (Sydney) Regions.

The AWS Free Tier lets you use a single-AZ micro instance for up to 750 hours and to store up to 1 gigabyte each month, for one year. After that, billing is based on instance-hours and message storage, plus charges Internet data transfer if the broker is accessed from outside of AWS.

Jeff;