Tag Archives: workflow

How I built a data warehouse using Amazon Redshift and AWS services in record time

Post Syndicated from Stephen Borg original https://aws.amazon.com/blogs/big-data/how-i-built-a-data-warehouse-using-amazon-redshift-and-aws-services-in-record-time/

This is a customer post by Stephen Borg, the Head of Big Data and BI at Cerberus Technologies.

Cerberus Technologies, in their own words: Cerberus is a company founded in 2017 by a team of visionary iGaming veterans. Our mission is simple – to offer the best tech solutions through a data-driven and a customer-first approach, delivering innovative solutions that go against traditional forms of working and process. This mission is based on the solid foundations of reliability, flexibility and security, and we intend to fundamentally change the way iGaming and other industries interact with technology.

Over the years, I have developed and created a number of data warehouses from scratch. Recently, I built a data warehouse for the iGaming industry single-handedly. To do it, I used the power and flexibility of Amazon Redshift and the wider AWS data management ecosystem. In this post, I explain how I was able to build a robust and scalable data warehouse without the large team of experts typically needed.

In two of my recent projects, I ran into challenges when scaling our data warehouse using on-premises infrastructure. Data was growing at many tens of gigabytes per day, and query performance was suffering. Scaling required major capital investment for hardware and software licenses, and also significant operational costs for maintenance and technical staff to keep it running and performing well. Unfortunately, I couldn’t get the resources needed to scale the infrastructure with data growth, and these projects were abandoned. Thanks to cloud data warehousing, the bottleneck of infrastructure resources, capital expense, and operational costs have been significantly reduced or have totally gone away. There is no more excuse for allowing obstacles of the past to delay delivering timely insights to decision makers, no matter how much data you have.

With Amazon Redshift and AWS, I delivered a cloud data warehouse to the business very quickly, and with a small team: me. I didn’t have to order hardware or software, and I no longer needed to install, configure, tune, or keep up with patches and version updates. Instead, I easily set up a robust data processing pipeline and we were quickly ingesting and analyzing data. Now, my data warehouse team can be extremely lean, and focus more time on bringing in new data and delivering insights. In this post, I show you the AWS services and the architecture that I used.

Handling data feeds

I have several different data sources that provide everything needed to run the business. The data includes activity from our iGaming platform, social media posts, clickstream data, marketing and campaign performance, and customer support engagements.

To handle the diversity of data feeds, I developed abstract integration applications using Docker that run on Amazon EC2 Container Service (Amazon ECS) and feed data to Amazon Kinesis Data Streams. These data streams can be used for real time analytics. In my system, each record in Kinesis is preprocessed by an AWS Lambda function to cleanse and aggregate information. My system then routes it to be stored where I need on Amazon S3 by Amazon Kinesis Data Firehose. Suppose that you used an on-premises architecture to accomplish the same task. A team of data engineers would be required to maintain and monitor a Kafka cluster, develop applications to stream data, and maintain a Hadoop cluster and the infrastructure underneath it for data storage. With my stream processing architecture, there are no servers to manage, no disk drives to replace, and no service monitoring to write.

Setting up a Kinesis stream can be done with a few clicks, and the same for Kinesis Firehose. Firehose can be configured to automatically consume data from a Kinesis Data Stream, and then write compressed data every N minutes to Amazon S3. When I want to process a Kinesis data stream, it’s very easy to set up a Lambda function to be executed on each message received. I can just set a trigger from the AWS Lambda Management Console, as shown following.

I also monitor the duration of function execution using Amazon CloudWatch and AWS X-Ray.

Regardless of the format I receive the data from our partners, I can send it to Kinesis as JSON data using my own formatters. After Firehose writes this to Amazon S3, I have everything in nearly the same structure I received but compressed, encrypted, and optimized for reading.

This data is automatically crawled by AWS Glue and placed into the AWS Glue Data Catalog. This means that I can immediately query the data directly on S3 using Amazon Athena or through Amazon Redshift Spectrum. Previously, I used Amazon EMR and an Amazon RDS–based metastore in Apache Hive for catalog management. Now I can avoid the complexity of maintaining Hive Metastore catalogs. Glue takes care of high availability and the operations side so that I know that end users can always be productive.

Working with Amazon Athena and Amazon Redshift for analysis

I found Amazon Athena extremely useful out of the box for ad hoc analysis. Our engineers (me) use Athena to understand new datasets that we receive and to understand what transformations will be needed for long-term query efficiency.

For our data analysts and data scientists, we’ve selected Amazon Redshift. Amazon Redshift has proven to be the right tool for us over and over again. It easily processes 20+ million transactions per day, regardless of the footprint of the tables and the type of analytics required by the business. Latency is low and query performance expectations have been more than met. We use Redshift Spectrum for long-term data retention, which enables me to extend the analytic power of Amazon Redshift beyond local data to anything stored in S3, and without requiring me to load any data. Redshift Spectrum gives me the freedom to store data where I want, in the format I want, and have it available for processing when I need it.

To load data directly into Amazon Redshift, I use AWS Data Pipeline to orchestrate data workflows. I create Amazon EMR clusters on an intra-day basis, which I can easily adjust to run more or less frequently as needed throughout the day. EMR clusters are used together with Amazon RDS, Apache Spark 2.0, and S3 storage. The data pipeline application loads ETL configurations from Spring RESTful services hosted on AWS Elastic Beanstalk. The application then loads data from S3 into memory, aggregates and cleans the data, and then writes the final version of the data to Amazon Redshift. This data is then ready to use for analysis. Spark on EMR also helps with recommendations and personalization use cases for various business users, and I find this easy to set up and deliver what users want. Finally, business users use Amazon QuickSight for self-service BI to slice, dice, and visualize the data depending on their requirements.

Each AWS service in this architecture plays its part in saving precious time that’s crucial for delivery and getting different departments in the business on board. I found the services easy to set up and use, and all have proven to be highly reliable for our use as our production environments. When the architecture was in place, scaling out was either completely handled by the service, or a matter of a simple API call, and crucially doesn’t require me to change one line of code. Increasing shards for Kinesis can be done in a minute by editing a stream. Increasing capacity for Lambda functions can be accomplished by editing the megabytes allocated for processing, and concurrency is handled automatically. EMR cluster capacity can easily be increased by changing the master and slave node types in Data Pipeline, or by using Auto Scaling. Lastly, RDS and Amazon Redshift can be easily upgraded without any major tasks to be performed by our team (again, me).

In the end, using AWS services including Kinesis, Lambda, Data Pipeline, and Amazon Redshift allows me to keep my team lean and highly productive. I eliminated the cost and delays of capital infrastructure, as well as the late night and weekend calls for support. I can now give maximum value to the business while keeping operational costs down. My team pushed out an agile and highly responsive data warehouse solution in record time and we can handle changing business requirements rapidly, and quickly adapt to new data and new user requests.


Additional Reading

If you found this post useful, be sure to check out Deploy a Data Warehouse Quickly with Amazon Redshift, Amazon RDS for PostgreSQL and Tableau Server and Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift.


About the Author

Stephen Borg is the Head of Big Data and BI at Cerberus Technologies. He has a background in platform software engineering, and first became involved in data warehousing using the typical RDBMS, SQL, ETL, and BI tools. He quickly became passionate about providing insight to help others optimize the business and add personalization to products. He is now the Head of Big Data and BI at Cerberus Technologies.

 

 

 

Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift

Post Syndicated from Thiyagarajan Arumugam original https://aws.amazon.com/blogs/big-data/top-8-best-practices-for-high-performance-etl-processing-using-amazon-redshift/

An ETL (Extract, Transform, Load) process enables you to load data from source systems into your data warehouse. This is typically executed as a batch or near-real-time ingest process to keep the data warehouse current and provide up-to-date analytical data to end users.

Amazon Redshift is a fast, petabyte-scale data warehouse that enables you easily to make data-driven decisions. With Amazon Redshift, you can get insights into your big data in a cost-effective fashion using standard SQL. You can set up any type of data model, from star and snowflake schemas, to simple de-normalized tables for running any analytical queries.

To operate a robust ETL platform and deliver data to Amazon Redshift in a timely manner, design your ETL processes to take account of Amazon Redshift’s architecture. When migrating from a legacy data warehouse to Amazon Redshift, it is tempting to adopt a lift-and-shift approach, but this can result in performance and scale issues long term. This post guides you through the following best practices for ensuring optimal, consistent runtimes for your ETL processes:

  • COPY data from multiple, evenly sized files.
  • Use workload management to improve ETL runtimes.
  • Perform table maintenance regularly.
  • Perform multiple steps in a single transaction.
  • Loading data in bulk.
  • Use UNLOAD to extract large result sets.
  • Use Amazon Redshift Spectrum for ad hoc ETL processing.
  • Monitor daily ETL health using diagnostic queries.

1. COPY data from multiple, evenly sized files

Amazon Redshift is an MPP (massively parallel processing) database, where all the compute nodes divide and parallelize the work of ingesting data. Each node is further subdivided into slices, with each slice having one or more dedicated cores, equally dividing the processing capacity. The number of slices per node depends on the node type of the cluster. For example, each DS2.XLARGE compute node has two slices, whereas each DS2.8XLARGE compute node has 16 slices.

When you load data into Amazon Redshift, you should aim to have each slice do an equal amount of work. When you load the data from a single large file or from files split into uneven sizes, some slices do more work than others. As a result, the process runs only as fast as the slowest, or most heavily loaded, slice. In the example shown below, a single large file is loaded into a two-node cluster, resulting in only one of the nodes, “Compute-0”, performing all the data ingestion:

When splitting your data files, ensure that they are of approximately equal size – between 1 MB and 1 GB after compression. The number of files should be a multiple of the number of slices in your cluster. Also, I strongly recommend that you individually compress the load files using gzip, lzop, or bzip2 to efficiently load large datasets.

When loading multiple files into a single table, use a single COPY command for the table, rather than multiple COPY commands. Amazon Redshift automatically parallelizes the data ingestion. Using a single COPY command to bulk load data into a table ensures optimal use of cluster resources, and quickest possible throughput.

2. Use workload management to improve ETL runtimes

Use Amazon Redshift’s workload management (WLM) to define multiple queues dedicated to different workloads (for example, ETL versus reporting) and to manage the runtimes of queries. As you migrate more workloads into Amazon Redshift, your ETL runtimes can become inconsistent if WLM is not appropriately set up.

I recommend limiting the overall concurrency of WLM across all queues to around 15 or less. This WLM guide helps you organize and monitor the different queues for your Amazon Redshift cluster.

When managing different workloads on your Amazon Redshift cluster, consider the following for the queue setup:

  • Create a queue dedicated to your ETL processes. Configure this queue with a small number of slots (5 or fewer). Amazon Redshift is designed for analytics queries, rather than transaction processing. The cost of COMMIT is relatively high, and excessive use of COMMIT can result in queries waiting for access to the commit queue. Because ETL is a commit-intensive process, having a separate queue with a small number of slots helps mitigate this issue.
  • Claim extra memory available in a queue. When executing an ETL query, you can take advantage of the wlm_query_slot_count to claim the extra memory available in a particular queue. For example, a typical ETL process might involve COPYing raw data into a staging table so that downstream ETL jobs can run transformations that calculate daily, weekly, and monthly aggregates. To speed up the COPY process (so that the downstream tasks can start in parallel sooner), the wlm_query_slot_count can be increased for this step.
  • Create a separate queue for reporting queries. Configure query monitoring rules on this queue to further manage long-running and expensive queries.
  • Take advantage of the dynamic memory parameters. They swap the memory from your ETL to your reporting queue after the ETL job has completed.

3. Perform table maintenance regularly

Amazon Redshift is a columnar database, which enables fast transformations for aggregating data. Performing regular table maintenance ensures that transformation ETLs are predictable and performant. To get the best performance from your Amazon Redshift database, you must ensure that database tables regularly are VACUUMed and ANALYZEd. The Analyze & Vacuum schema utility helps you automate the table maintenance task and have VACUUM & ANALYZE executed in a regular fashion.

  • Use VACUUM to sort tables and remove deleted blocks

During a typical ETL refresh process, tables receive new incoming records using COPY, and unneeded data (cold data) is removed using DELETE. New rows are added to the unsorted region in a table. Deleted rows are simply marked for deletion.

DELETE does not automatically reclaim the space occupied by the deleted rows. Adding and removing large numbers of rows can therefore cause the unsorted region and the number of deleted blocks to grow. This can degrade the performance of queries executed against these tables.

After an ETL process completes, perform VACUUM to ensure that user queries execute in a consistent manner. The complete list of tables that need VACUUMing can be found using the Amazon Redshift Util’s table_info script.

Use the following approaches to ensure that VACCUM is completed in a timely manner:

  • Use wlm_query_slot_count to claim all the memory allocated in the ETL WLM queue during the VACUUM process.
  • DROP or TRUNCATE intermediate or staging tables, thereby eliminating the need to VACUUM them.
  • If your table has a compound sort key with only one sort column, try to load your data in sort key order. This helps reduce or eliminate the need to VACUUM the table.
  • Consider using time series This helps reduce the amount of data you need to VACUUM.
  • Use ANALYZE to update database statistics

Amazon Redshift uses a cost-based query planner and optimizer using statistics about tables to make good decisions about the query plan for the SQL statements. Regular statistics collection after the ETL completion ensures that user queries run fast, and that daily ETL processes are performant. The Amazon Redshift utility table_info script provides insights into the freshness of the statistics. Keeping the statistics off (pct_stats_off) less than 20% ensures effective query plans for the SQL queries.

4. Perform multiple steps in a single transaction

ETL transformation logic often spans multiple steps. Because commits in Amazon Redshift are expensive, if each ETL step performs a commit, multiple concurrent ETL processes can take a long time to execute.

To minimize the number of commits in a process, the steps in an ETL script should be surrounded by a BEGIN…END statement so that a single commit is performed only after all the transformation logic has been executed. For example, here is an example multi-step ETL script that performs one commit at the end:

Begin
CREATE temporary staging_table;
INSERT INTO staging_table SELECT .. FROM source (transformation logic);
DELETE FROM daily_table WHERE dataset_date =?;
INSERT INTO daily_table SELECT .. FROM staging_table (daily aggregate);
DELETE FROM weekly_table WHERE weekending_date=?;
INSERT INTO weekly_table SELECT .. FROM staging_table(weekly aggregate);
Commit

5. Loading data in bulk

Amazon Redshift is designed to store and query petabyte-scale datasets. Using Amazon S3 you can stage and accumulate data from multiple source systems before executing a bulk COPY operation. The following methods allow efficient and fast transfer of these bulk datasets into Amazon Redshift:

  • Use a manifest file to ingest large datasets that span multiple files. The manifest file is a JSON file that lists all the files to be loaded into Amazon Redshift. Using a manifest file ensures that Amazon Redshift has a consistent view of the data to be loaded from S3, while also ensuring that duplicate files do not result in the same data being loaded more than one time.
  • Use temporary staging tables to hold the data for transformation. These tables are automatically dropped after the ETL session is complete. Temporary tables can be created using the CREATE TEMPORARY TABLE syntax, or by issuing a SELECT … INTO #TEMP_TABLE query. Explicitly specifying the CREATE TEMPORARY TABLE statement allows you to control the DISTRIBUTION KEY, SORT KEY, and compression settings to further improve performance.
  • User ALTER table APPEND to swap data from the staging tables to the target table. Data in the source table is moved to matching columns in the target table. Column order doesn’t matter. After data is successfully appended to the target table, the source table is empty. ALTER TABLE APPEND is much faster than a similar CREATE TABLE AS or INSERT INTO operation because it doesn’t involve copying or moving data.

6. Use UNLOAD to extract large result sets

Fetching a large number of rows using SELECT is expensive and takes a long time. When a large amount of data is fetched from the Amazon Redshift cluster, the leader node has to hold the data temporarily until the fetches are complete. Further, data is streamed out sequentially, which results in longer elapsed time. As a result, the leader node can become hot, which not only affects the SELECT that is being executed, but also throttles resources for creating execution plans and managing the overall cluster resources. Here is an example of a large SELECT statement. Notice that the leader node is doing most of the work to stream out the rows:

Use UNLOAD to extract large results sets directly to S3. After it’s in S3, the data can be shared with multiple downstream systems. By default, UNLOAD writes data in parallel to multiple files according to the number of slices in the cluster. All the compute nodes participate to quickly offload the data into S3.

If you are extracting data for use with Amazon Redshift Spectrum, you should make use of the MAXFILESIZE parameter to and keep files are 150 MB. Similar to item 1 above, having many evenly sized files ensures that Redshift Spectrum can do the maximum amount of work in parallel.

7. Use Redshift Spectrum for ad hoc ETL processing

Events such as data backfill, promotional activity, and special calendar days can trigger additional data volumes that affect the data refresh times in your Amazon Redshift cluster. To help address these spikes in data volumes and throughput, I recommend staging data in S3. After data is organized in S3, Redshift Spectrum enables you to query it directly using standard SQL. In this way, you gain the benefits of additional capacity without having to resize your cluster.

For tips on getting started with and optimizing the use of Redshift Spectrum, see the previous post, 10 Best Practices for Amazon Redshift Spectrum.

8. Monitor daily ETL health using diagnostic queries

Monitoring the health of your ETL processes on a regular basis helps identify the early onset of performance issues before they have a significant impact on your cluster. The following monitoring scripts can be used to provide insights into the health of your ETL processes:

Script Use when… Solution
commit_stats.sql – Commit queue statistics from past days, showing largest queue length and queue time first DML statements such as INSERT/UPDATE/COPY/DELETE operations take several times longer to execute when multiple of these operations are in progress Set up separate WLM queues for the ETL process and limit the concurrency to < 5.
copy_performance.sql –  Copy command statistics for the past days Daily COPY operations take longer to execute • Follow the best practices for the COPY command.
• Analyze data growth with the incoming datasets and consider cluster resize to meet the expected SLA.
table_info.sql – Table skew and unsorted statistics along with storage and key information Transformation steps take longer to execute • Set up regular VACCUM jobs to address unsorted rows and claim the deleted blocks so that transformation SQL execute optimally.
• Consider a table redesign to avoid data skewness.
v_check_transaction_locks.sql – Monitor transaction locks INSERT/UPDATE/COPY/DELETE operations on particular tables do not respond back in timely manner, compared to when run after the ETL Multiple DML statements are operating on the same target table at the same moment from different transactions. Set up ETL job dependency so that they execute serially for the same target table.
v_get_schema_priv_by_user.sql – Get the schema that the user has access to Reporting users can view intermediate tables Set up separate database groups for reporting and ETL users, and grants access to objects using GRANT.
v_generate_tbl_ddl.sql – Get the table DDL You need to create an empty table with same structure as target table for data backfill Generate DDL using this script for data backfill.
v_space_used_per_tbl.sql – monitor space used by individual tables Amazon Redshift data warehouse space growth is trending upwards more than normal

Analyze the individual tables that are growing at higher rate than normal. Consider data archival using UNLOAD to S3 and Redshift Spectrum for later analysis.

Use unscanned_table_summary.sql to find unused table and archive or drop them.

top_queries.sql – Return the top 50 time consuming statements aggregated by its text ETL transformations are taking longer to execute Analyze the top transformation SQL and use EXPLAIN to find opportunities for tuning the query plan.

There are several other useful scripts available in the amazon-redshift-utils repository. The AWS Lambda Utility Runner runs a subset of these scripts on a scheduled basis, allowing you to automate much of monitoring of your ETL processes.

Example ETL process

The following ETL process reinforces some of the best practices discussed in this post. Consider the following four-step daily ETL workflow where data from an RDBMS source system is staged in S3 and then loaded into Amazon Redshift. Amazon Redshift is used to calculate daily, weekly, and monthly aggregations, which are then unloaded to S3, where they can be further processed and made available for end-user reporting using a number of different tools, including Redshift Spectrum and Amazon Athena.

Step 1:  Extract from the RDBMS source to a S3 bucket

In this ETL process, the data extract job fetches change data every 1 hour and it is staged into multiple hourly files. For example, the staged S3 folder looks like the following:

 [[email protected] ~]$ aws s3 ls s3://<<S3 Bucket>>/batch/2017/07/02/
2017-07-02 01:59:58   81900220 20170702T01.export.gz
2017-07-02 02:59:56   84926844 20170702T02.export.gz
2017-07-02 03:59:54   78990356 20170702T03.export.gz
…
2017-07-02 22:00:03   75966745 20170702T21.export.gz
2017-07-02 23:00:02   89199874 20170702T22.export.gz
2017-07-02 00:59:59   71161715 20170702T23.export.gz

Organizing the data into multiple, evenly sized files enables the COPY command to ingest this data using all available resources in the Amazon Redshift cluster. Further, the files are compressed (gzipped) to further reduce COPY times.

Step 2: Stage data to the Amazon Redshift table for cleansing

Ingesting the data can be accomplished using a JSON-based manifest file. Using the manifest file ensures that S3 eventual consistency issues can be eliminated and also provides an opportunity to dedupe any files if needed. A sample manifest20170702.json file looks like the following:

{
  "entries": [
    {"url":" s3://<<S3 Bucket>>/batch/2017/07/02/20170702T01.export.gz", "mandatory":true},
    {"url":" s3://<<S3 Bucket>>/batch/2017/07/02/20170702T02.export.gz", "mandatory":true},
    …
    {"url":" s3://<<S3 Bucket>>/batch/2017/07/02/20170702T23.export.gz", "mandatory":true}
  ]
}

The data can be ingested using the following command:

SET wlm_query_slot_count TO <<max available concurrency in the ETL queue>>;
COPY stage_tbl FROM 's3:// <<S3 Bucket>>/batch/manifest20170702.json' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' manifest;

Because the downstream ETL processes depend on this COPY command to complete, the wlm_query_slot_count is used to claim all the memory available to the queue. This helps the COPY command complete as quickly as possible.

Step 3: Transform data to create daily, weekly, and monthly datasets and load into target tables

Data is staged in the “stage_tbl” from where it can be transformed into the daily, weekly, and monthly aggregates and loaded into target tables. The following job illustrates a typical weekly process:

Begin
INSERT into ETL_LOG (..) values (..);
DELETE from weekly_tbl where dataset_week = <<current week>>;
INSERT into weekly_tbl (..)
  SELECT date_trunc('week', dataset_day) AS week_begin_dataset_date, SUM(C1) AS C1, SUM(C2) AS C2
	FROM   stage_tbl
GROUP BY date_trunc('week', dataset_day);
INSERT into AUDIT_LOG values (..);
COMMIT;
End;

As shown above, multiple steps are combined into one transaction to perform a single commit, reducing contention on the commit queue.

Step 4: Unload the daily dataset to populate the S3 data lake bucket

The transformed results are now unloaded into another S3 bucket, where they can be further processed and made available for end-user reporting using a number of different tools, including Redshift Spectrum and Amazon Athena.

unload ('SELECT * FROM weekly_tbl WHERE dataset_week = <<current week>>’) TO 's3:// <<S3 Bucket>>/datalake/weekly/20170526/' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

Summary

Amazon Redshift lets you easily operate petabyte-scale data warehouses on the cloud. This post summarized the best practices for operating scalable ETL natively within Amazon Redshift. I demonstrated efficient ways to ingest and transform data, along with close monitoring. I also demonstrated the best practices being used in a typical sample ETL workload to transform the data into Amazon Redshift.

If you have questions or suggestions, please comment below.

 


About the Author

Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

 

[$] BPFd: Running BCC tools remotely across systems and architectures

Post Syndicated from corbet original https://lwn.net/Articles/744522/rss

BPF is an increasingly capable tool for instrumenting and tracing the
operation of the kernel; it has enabled the creation of the growing set of
BCC tools. Unfortunately, BCC has no support for a cross-development
workflow where the development machine and the target machine running the
developed code are different. Cross-development is favored by
embedded-systems kernel developers who tend to develop on an x86 host and
then flash and test their code on SoCs (System on Chips) based on the ARM
architecture. In this article, I introduce BPFd, a project to enable cross
development using BPF and BCC.

Cloud Babble: The Jargon of Cloud Storage

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/what-is-cloud-computing/

Cloud Babble

One of the things we in the technology business are good at is coming up with names, phrases, euphemisms, and acronyms for the stuff that we create. The Cloud Storage market is no different, and we’d like to help by illuminating some of the cloud storage related terms that you might come across. We know this is just a start, so please feel free to add in your favorites in the comments section below and we’ll update this post accordingly.

Clouds

The cloud is really just a collection of purpose built servers. In a public cloud the servers are shared between multiple unrelated tenants. In a private cloud, the servers are dedicated to a single tenant or sometimes a group of related tenants. A public cloud is off-site, while a private cloud can be on-site or off-site – or on-prem or off-prem, if you prefer.

Both Sides Now: Hybrid Clouds

Speaking of on-prem and off-prem, there are Hybrid Clouds or Hybrid Data Clouds depending on what you need. Both are based on the idea that you extend your local resources (typically on-prem) to the cloud (typically off-prem) as needed. This extension is controlled by software that decides, based on rules you define, what needs to be done where.

A Hybrid Data Cloud is specific to data. For example, you can set up a rule that says all accounting files that have not been touched in the last year are automatically moved off-prem to cloud storage. The files are still available; they are just no longer stored on your local systems. The rules can be defined to fit an organization’s workflow and data retention policies.

A Hybrid Cloud is similar to a Hybrid Data Cloud except it also extends compute. For example, at the end of the quarter, you can spin up order processing application instances off-prem as needed to add to your on-prem capacity. Of course, determining where the transactional data used and created by these applications resides can be an interesting systems design challenge.

Clouds in my Coffee: Fog

Typically, public and private clouds live in large buildings called data centers. Full of servers, networking equipment, and clean air, data centers need lots of power, lots of networking bandwidth, and lots of space. This often limits where data centers are located. The further away you are from a data center, the longer it generally takes to get your data to and from there. This is known as latency. That’s where “Fog” comes in.

Fog is often referred to as clouds close to the ground. Fog, in our cloud world, is basically having a “little” data center near you. This can make data storage and even cloud based processing faster for everyone nearby. Data, and less so processing, can be transferred to/from the Fog to the Cloud when time is less a factor. Data could also be aggregated in the Fog and sent to the Cloud. For example, your electric meter could report its minute-by-minute status to the Fog for diagnostic purposes. Then once a day the aggregated data could be send to the power company’s Cloud for billing purposes.

Another term used in place of Fog is Edge, as in computing at the Edge. In either case, a given cloud (data center) usually has multiple Edges (little data centers) connected to it. The connection between the Edge and the Cloud is sometimes known as the middle-mile. The network in the middle-mile can be less robust than that required to support a stand-alone data center. For example, the middle-mile can use 1 Gbps lines, versus a data center, which would require multiple 10 Gbps lines.

Heavy Clouds No Rain: Data

We’re all aware that we are creating, processing, and storing data faster than ever before. All of this data is stored in either a structured or more likely an unstructured way. Databases and data warehouses are structured ways to store data, but a vast amount of data is unstructured – meaning the schema and data access requirements are not known until the data is queried. A large pool of unstructured data in a flat architecture can be referred to as a Data Lake.

A Data Lake is often created so we can perform some type of “big data” analysis. In an over simplified example, let’s extend the lake metaphor a bit and ask the question; “how many fish are in our lake?” To get an answer, we take a sufficient sample of our lake’s water (data), count the number of fish we find, and extrapolate based on the size of the lake to get an answer within a given confidence interval.

A Data Lake is usually found in the cloud, an excellent place to store large amounts of non-transactional data. Watch out as this can lead to our data having too much Data Gravity or being locked in the Hotel California. This could also create a Data Silo, thereby making a potential data Lift-and-Shift impossible. Let me explain:

  • Data Gravity — Generally, the more data you collect in one spot, the harder it is to move. When you store data in a public cloud, you have to pay egress and/or network charges to download the data to another public cloud or even to your own on-premise systems. Some public cloud vendors charge a lot more than others, meaning that depending on your public cloud provider, your data could financially have a lot more gravity than you expected.
  • Hotel California — This is like Data Gravity but to a lesser scale. Your data is in the Hotel California if, to paraphrase, “your data can check out any time you want, but it can never leave.” If the cost of downloading your data is limiting the things you want to do with that data, then your data is in the Hotel California. Data is generally most valuable when used, and with cloud storage that can include archived data. This assumes of course that the archived data is readily available, and affordable, to download. When considering a cloud storage project always figure in the cost of using your own data.
  • Data Silo — Over the years, businesses have suffered from organizational silos as information is not shared between different groups, but instead needs to travel up to the top of the silo before it can be transferred to another silo. If your data is “trapped” in a given cloud by the cost it takes to share such data, then you may have a Data Silo, and that’s exactly opposite of what the cloud should do.
  • Lift-and-Shift — This term is used to define the movement of data or applications from one data center to another or from on-prem to off-prem systems. The move generally occurs all at once and once everything is moved, systems are operational and data is available at the new location with few, if any, changes. If your data has too much gravity or is locked in a hotel, a data lift-and-shift may break the bank.

I Can See Clearly Now

Hopefully, the cloudy terms we’ve covered are well, less cloudy. As we mentioned in the beginning, our compilation is just a start, so please feel free to add in your favorite cloud term in the comments section below and we’ll update this post with your contributions. Keep your entries “clean,” and please no words or phrases that are really adverts for your company. Thanks.

The post Cloud Babble: The Jargon of Cloud Storage appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Wanted: Sales Engineer

Post Syndicated from Yev original https://www.backblaze.com/blog/wanted-sales-engineer/

At inception, Backblaze was a consumer company. Thousands upon thousands of individuals came to our website and gave us $5/mo to keep their data safe. But, we didn’t sell business solutions. It took us years before we had a sales team. In the last couple of years, we’ve released products that businesses of all sizes love: Backblaze B2 Cloud Storage and Backblaze for Business Computer Backup. Those businesses want to integrate Backblaze deeply into their infrastructure, so it’s time to hire our first Sales Engineer!

Company Description:
Founded in 2007, Backblaze started with a mission to make backup software elegant and provide complete peace of mind. Over the course of almost a decade, we have become a pioneer in robust, scalable low cost cloud backup. Recently, we launched B2 – robust and reliable object storage at just $0.005/gb/mo. Part of our differentiation is being able to offer the lowest price of any of the big players while still being profitable.

We’ve managed to nurture a team oriented culture with amazingly low turnover. We value our people and their families. Don’t forget to check out our “About Us” page to learn more about the people and some of our perks.

We have built a profitable, high growth business. While we love our investors, we have maintained control over the business. That means our corporate goals are simple – grow sustainably and profitably.

Some Backblaze Perks:

  • Competitive healthcare plans
  • Competitive compensation and 401k
  • All employees receive Option grants
  • Unlimited vacation days
  • Strong coffee
  • Fully stocked Micro kitchen
  • Catered breakfast and lunches
  • Awesome people who work on awesome projects
  • Childcare bonus
  • Normal work hours
  • Get to bring your pets into the office
  • San Mateo Office – located near Caltrain and Highways 101 & 280.

Backblaze B2 cloud storage is a building block for almost any computing service that requires storage. Customers need our help integrating B2 into iOS apps to Docker containers. Some customers integrate directly to the API using the programming language of their choice, others want to solve a specific problem using ready made software, already integrated with B2.

At the same time, our computer backup product is deepening it’s integration into enterprise IT systems. We are commonly asked for how to set Windows policies, integrate with Active Directory, and install the client via remote management tools.

We are looking for a sales engineer who can help our customers navigate the integration of Backblaze into their technical environments.

Are you 1/2” deep into many different technologies, and unafraid to dive deeper?

Can you confidently talk with customers about their technology, even if you have to look up all the acronyms right after the call?

Are you excited to setup complicated software in a lab and write knowledge base articles about your work?

Then Backblaze is the place for you!

Enough about Backblaze already, what’s in it for me?
In this role, you will be given the opportunity to learn about the technologies that drive innovation today; diverse technologies that customers are using day in and out. And more importantly, you’ll learn how to learn new technologies.

Just as an example, in the past 12 months, we’ve had the opportunity to learn and become experts in these diverse technologies:

  • How to setup VM servers for lab environments, both on-prem and using cloud services.
  • Create an automatically “resetting” demo environment for the sales team.
  • Setup Microsoft Domain Controllers with Active Directory and AD Federation Services.
  • Learn the basics of OAUTH and web single sign on (SSO).
  • Archive video workflows from camera to media asset management systems.
  • How upload/download files from Javascript by enabling CORS.
  • How to install and monitor online backup installations using RMM tools, like JAMF.
  • Tape (LTO) systems. (Yes – people still use tape for storage!)

How can I know if I’ll succeed in this role?

You have:

  • Confidence. Be able to ask customers questions about their environments and convey to them your technical acumen.
  • Curiosity. Always want to learn about customers’ situations, how they got there and what problems they are trying to solve.
  • Organization. You’ll work with customers, integration partners, and Backblaze team members on projects of various lengths. You can context switch and either have a great memory or keep copious notes. Your checklists have their own checklists.

You are versed in:

  • The fundamentals of Windows, Linux and Mac OS X operating systems. You shouldn’t be afraid to use a command line.
  • Building, installing, integrating and configuring applications on any operating system.
  • Debugging failures – reading logs, monitoring usage, effective google searching to fix problems excites you.
  • The basics of TCP/IP networking and the HTTP protocol.
  • Novice development skills in any programming/scripting language. Have basic understanding of data structures and program flow.
  • Your background contains:

  • Bachelor’s degree in computer science or the equivalent.
  • 2+ years of experience as a pre or post-sales engineer.
  • The right extra credit:
    There are literally hundreds of previous experiences you can have had that would make you perfect for this job. Some experiences that we know would be helpful for us are below, but make sure you tell us your stories!

  • Experience using or programming against Amazon S3.
  • Experience with large on-prem storage – NAS, SAN, Object. And backing up data on such storage with tools like Veeam, Veritas and others.
  • Experience with photo or video media. Media archiving is a key market for Backblaze B2.
  • Program arduinos to automatically feed your dog.
  • Experience programming against web or REST APIs. (Point us towards your projects, if they are open source and available to link to.)
  • Experience with sales tools like Salesforce.
  • 3D print door stops.
  • Experience with Windows Servers, Active Directory, Group policies and the like.
  • What’s it like working with the Sales team?
    The Backblaze sales team collaborates. We help each other out by sharing ideas, templates, and our customer’s experiences. When we talk about our accomplishments, there is no “I did this,” only “we”. We are truly a team.

    We are honest to each other and our customers and communicate openly. We aim to have fun by embracing crazy ideas and creative solutions. We try to think not outside the box, but with no boxes at all. Customers are the driving force behind the success of the company and we care deeply about their success.

    If this all sounds like you:

    1. Send an email to [email protected] with the position in the subject line.
    2. Tell us a bit about your Sales Engineering experience.
    3. Include your resume.

    The post Wanted: Sales Engineer appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

    Set Up a Continuous Delivery Pipeline for Containers Using AWS CodePipeline and Amazon ECS

    Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/set-up-a-continuous-delivery-pipeline-for-containers-using-aws-codepipeline-and-amazon-ecs/

    This post contributed by Abby FullerAWS Senior Technical Evangelist

    Last week, AWS announced support for Amazon Elastic Container Service (ECS) targets (including AWS Fargate) in AWS CodePipeline. This support makes it easier to create a continuous delivery pipeline for container-based applications and microservices.

    Building and deploying containerized services manually is slow and prone to errors. Continuous delivery with automated build and test mechanisms helps detect errors early, saves time, and reduces failures, making this a popular model for application deployments. Previously, to automate your container workflows with ECS, you had to build your own solution using AWS CloudFormation. Now, you can integrate CodePipeline and CodeBuild with ECS to automate your workflows in just a few steps.

    A typical continuous delivery workflow with CodePipeline, CodeBuild, and ECS might look something like the following:

    • Choosing your source
    • Building your project
    • Deploying your code

    We also have a continuous deployment reference architecture on GitHub for this workflow.

    Getting Started

    First, create a new project with CodePipeline and give the project a name, such as “demo”.

    Next, choose a source location where the code is stored. This could be AWS CodeCommit, GitHub, or Amazon S3. For this example, enter GitHub and then give CodePipeline access to the repository.

    Next, add a build step. You can import an existing build, such as a Jenkins server URL or CodeBuild project, or create a new step with CodeBuild. If you don’t have an existing build project in CodeBuild, create one from within CodePipeline:

    • Build provider: AWS CodeBuild
    • Configure your project: Create a new build project
    • Environment image: Use an image managed by AWS CodeBuild
    • Operating system: Ubuntu
    • Runtime: Docker
    • Version: aws/codebuild/docker:1.12.1
    • Build specification: Use the buildspec.yml in the source code root directory

    Now that you’ve created the CodeBuild step, you can use it as an existing project in CodePipeline.

    Next, add a deployment provider. This is where your built code is placed. It can be a number of different options, such as AWS CodeDeploy, AWS Elastic Beanstalk, AWS CloudFormation, or Amazon ECS. For this example, connect to Amazon ECS.

    For CodeBuild to deploy to ECS, you must create an image definition JSON file. This requires adding some instructions to the pre-build, build, and post-build phases of the CodeBuild build process in your buildspec.yml file. For help with creating the image definition file, see Step 1 of the Tutorial: Continuous Deployment with AWS CodePipeline.

    • Deployment provider: Amazon ECS
    • Cluster name: enter your project name from the build step
    • Service name: web
    • Image filename: enter your image definition filename (“web.json”).

    You are almost done!

    You can now choose an existing IAM service role that CodePipeline can use to access resources in your account, or let CodePipeline create one. For this example, use the wizard, and go with the role that it creates (AWS-CodePipeline-Service).

    Finally, review all of your changes, and choose Create pipeline.

    After the pipeline is created, you’ll have a model of your entire pipeline where you can view your executions, add different tests, add manual approvals, or release a change.

    You can learn more in the AWS CodePipeline User Guide.

    Happy automating!

    Now Open AWS EU (Paris) Region

    Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/now-open-aws-eu-paris-region/

    Today we are launching our 18th AWS Region, our fourth in Europe. Located in the Paris area, AWS customers can use this Region to better serve customers in and around France.

    The Details
    The new EU (Paris) Region provides a broad suite of AWS services including Amazon API Gateway, Amazon Aurora, Amazon CloudFront, Amazon CloudWatch, CloudWatch Events, Amazon CloudWatch Logs, Amazon DynamoDB, Amazon Elastic Compute Cloud (EC2), EC2 Container Registry, Amazon ECS, Amazon Elastic Block Store (EBS), Amazon EMR, Amazon ElastiCache, Amazon Elasticsearch Service, Amazon Glacier, Amazon Kinesis Streams, Polly, Amazon Redshift, Amazon Relational Database Service (RDS), Amazon Route 53, Amazon Simple Notification Service (SNS), Amazon Simple Queue Service (SQS), Amazon Simple Storage Service (S3), Amazon Simple Workflow Service (SWF), Amazon Virtual Private Cloud, Auto Scaling, AWS Certificate Manager (ACM), AWS CloudFormation, AWS CloudTrail, AWS CodeDeploy, AWS Config, AWS Database Migration Service, AWS Direct Connect, AWS Elastic Beanstalk, AWS Identity and Access Management (IAM), AWS Key Management Service (KMS), AWS Lambda, AWS Marketplace, AWS OpsWorks Stacks, AWS Personal Health Dashboard, AWS Server Migration Service, AWS Service Catalog, AWS Shield Standard, AWS Snowball, AWS Snowball Edge, AWS Snowmobile, AWS Storage Gateway, AWS Support (including AWS Trusted Advisor), Elastic Load Balancing, and VM Import.

    The Paris Region supports all sizes of C5, M5, R4, T2, D2, I3, and X1 instances.

    There are also four edge locations for Amazon Route 53 and Amazon CloudFront: three in Paris and one in Marseille, all with AWS WAF and AWS Shield. Check out the AWS Global Infrastructure page to learn more about current and future AWS Regions.

    The Paris Region will benefit from three AWS Direct Connect locations. Telehouse Voltaire is available today. AWS Direct Connect will also become available at Equinix Paris in early 2018, followed by Interxion Paris.

    All AWS infrastructure regions around the world are designed, built, and regularly audited to meet the most rigorous compliance standards and to provide high levels of security for all AWS customers. These include ISO 27001, ISO 27017, ISO 27018, SOC 1 (Formerly SAS 70), SOC 2 and SOC 3 Security & Availability, PCI DSS Level 1, and many more. This means customers benefit from all the best practices of AWS policies, architecture, and operational processes built to satisfy the needs of even the most security sensitive customers.

    AWS is certified under the EU-US Privacy Shield, and the AWS Data Processing Addendum (DPA) is GDPR-ready and available now to all AWS customers to help them prepare for May 25, 2018 when the GDPR becomes enforceable. The current AWS DPA, as well as the AWS GDPR DPA, allows customers to transfer personal data to countries outside the European Economic Area (EEA) in compliance with European Union (EU) data protection laws. AWS also adheres to the Cloud Infrastructure Service Providers in Europe (CISPE) Code of Conduct. The CISPE Code of Conduct helps customers ensure that AWS is using appropriate data protection standards to protect their data, consistent with the GDPR. In addition, AWS offers a wide range of services and features to help customers meet the requirements of the GDPR, including services for access controls, monitoring, logging, and encryption.

    From Our Customers
    Many AWS customers are preparing to use this new Region. Here’s a small sample:

    Societe Generale, one of the largest banks in France and the world, has accelerated their digital transformation while working with AWS. They developed SG Research, an application that makes reports from Societe Generale’s analysts available to corporate customers in order to improve the decision-making process for investments. The new AWS Region will reduce latency between applications running in the cloud and in their French data centers.

    SNCF is the national railway company of France. Their mobile app, powered by AWS, delivers real-time traffic information to 14 million riders. Extreme weather, traffic events, holidays, and engineering works can cause usage to peak at hundreds of thousands of users per second. They are planning to use machine learning and big data to add predictive features to the app.

    Radio France, the French public radio broadcaster, offers seven national networks, and uses AWS to accelerate its innovation and stay competitive.

    Les Restos du Coeur, a French charity that provides assistance to the needy, delivering food packages and participating in their social and economic integration back into French society. Les Restos du Coeur is using AWS for its CRM system to track the assistance given to each of their beneficiaries and the impact this is having on their lives.

    AlloResto by JustEat (a leader in the French FoodTech industry), is using AWS to to scale during traffic peaks and to accelerate their innovation process.

    AWS Consulting and Technology Partners
    We are already working with a wide variety of consulting, technology, managed service, and Direct Connect partners in France. Here’s a partial list:

    AWS Premier Consulting PartnersAccenture, Capgemini, Claranet, CloudReach, DXC, and Edifixio.

    AWS Consulting PartnersABC Systemes, Atos International SAS, CoreExpert, Cycloid, Devoteam, LINKBYNET, Oxalide, Ozones, Scaleo Information Systems, and Sopra Steria.

    AWS Technology PartnersAxway, Commerce Guys, MicroStrategy, Sage, Software AG, Splunk, Tibco, and Zerolight.

    AWS in France
    We have been investing in Europe, with a focus on France, for the last 11 years. We have also been developing documentation and training programs to help our customers to improve their skills and to accelerate their journey to the AWS Cloud.

    As part of our commitment to AWS customers in France, we plan to train more than 25,000 people in the coming years, helping them develop highly sought after cloud skills. They will have access to AWS training resources in France via AWS Academy, AWSome days, AWS Educate, and webinars, all delivered in French by AWS Technical Trainers and AWS Certified Trainers.

    Use it Today
    The EU (Paris) Region is open for business now and you can start using it today!

    Jeff;

     

    timeShift(GrafanaBuzz, 1w) Issue 26

    Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2017/12/15/timeshiftgrafanabuzz-1w-issue-26/

    Welcome to TimeShift

    Big news this week: Grafana v5.0 has been merged into master and is available in the nightly builds! We are really excited to share this with the community, and look forward to receiving community feedback (good or bad) on the new features and enhancements. As you see in the video below, there are some big changes that aim to improve workflow, team organization, permissions, and overall user experience. Check out the video below to see it in action, and give it a spin yourself.

    • New Grid Layout Engine: Make it easier to build dashboards and enable more complex layouts
    • Dashboard Folders & Permissions
    • User Teams
    • Improved Dashboard Settings UX
    • Improved Page Design and Navigation

    NOTE: That’s actually Torkel Odegaard, creator of Grafana shredding on the soundtrack!


    Latest Stable Release

    Grafana 4.6.3 is available and includes some bug fixes:

    • Gzip: Fixes bug Gravatar images when gzip was enabled #5952
    • Alert list: Now shows alert state changes even after adding manual annotations on dashboard #99513
    • Alerting: Fixes bug where rules evaluated as firing when all conditions was false and using OR operator. #93183
    • Cloudwatch: CloudWatch no longer display metrics’ default alias #101514, thx @mtanda

    Download Grafana 4.6.3 Now


    From the Blogosphere

    Monitoring MySQL with Prometheus and Grafana: Julien Pivotto (who will be speaking at GrafanaCon EU), gave a great presentation last month on Monitoring MySQL with Prometheus and Grafana. You can also check out his slides.

    Monitor your Docker Containers: docker stats doesn’t often give you the level of insight you need to effectively manage your containers. This article discuses how to use cAdvisor, Prometheus and Grafana to get a handle on your Docker performance.

    Magento Performance Monitoring with Grafana Dashboards and Alerts: This Christmas-themed post walks you through how to monitor the performance of Magento, start building dashboards, and setup Slack alerts, all while sitting in your rocking chair, sipping eggnog.

    Icinga Web2 and Grafana Working Together: This is a follow-up post about displaying service performance data from Icinga2 in Grafana. Now that we know how to list the services on a dashboard, it would be helpful to filter this list so that specific teams can know the status of services they specifically manage.

    Setup of sitespeed in AWS with Peter Hedenskog: In this video, Peter Hedenskop from Wikimedia and Stefan Judis set up a video call to go over setting up sitespeed in AWS. They create a fully functional Grafana dashboard, including web performance metrics from Stefan’s personal website running in the cloud.

    Deploying Grafana to Access Zabbix in Alibaba Cloud ECS: This article walks you through how to deploy Grafana on Alibaba Cloud ECS to access Zabbix to visualize performance data for your website or application.

    Let’s Summarize the Test Results with Grafana Annotations + Prometheus: The engineers of NTT Communications Corporation have created something of an Advent Calendar, with new posts each day. December 14th’s post focused on Grafana’s new annotation functionality via the UI and the API.


    New Speakers Added!

    We have added new speakers, and talk titles to the lineup at grafanacon.org. Only a few left to include, which should be added in the next few days.

    Join us March 1-2, 2018 in Amsterdam for 2 days of talks centered around Grafana and the surrounding monitoring ecosystem including Graphite, Prometheus, InfluxData, Elasticsearch, Kubernetes, and many other topics.

    This year we have speakers from Bloomberg, CERN, Tinder, Red Hat, Prometheus, InfluxData, Fastly, Automattic, Percona, and more!

    Get Your Ticket Now


    Grafana Plugins

    This week we have a new plugin for the popular IoT platform DeviceHive, and an update to our own Kubernetes App. To install or update any plugin in an on-prem Grafana instance, use the Grafana-cli tool, or install and update with 1 click on Hosted Grafana.

    NEW PLUGIN

    DeviceHive is an IOT Platform and now has a data source plugin, which means you can visualize the live commands and notifications from a device.


    Install Now

    UPDATED PLUGIN

    Kubernetes App – The Grafana Kubernetes App allows you to monitor your Kubernetes cluster’s performance. It includes 4 dashboards, Cluster, Node, Pod/Container and Deployment, and also comes with Intel Snap collectors that are deployed to your cluster to collect health metrics.


    Update


    Upcoming Events:

    In between code pushes we like to speak at, sponsor and attend all kinds of conferences and meetups. We also like to make sure we mention other Grafana-related events happening all over the world. If you’re putting on just such an event, let us know and we’ll list it here.

    FOSDEM | Brussels, Belgium – Feb 3-4, 2018: FOSDEM is a free developer conference where thousands of developers of free and open source software gather to share ideas and technology. Carl Bergquist is managing the Cloud and Monitoring Devroom, and we’ve heard there were some great talks submitted. There is no need to register; all are welcome.


    Tweet of the Week

    We scour Twitter each week to find an interesting/beautiful dashboard and show it off! #monitoringLove


    Ok, ok – This tweet isn’t showing a off a dashboard, but we can’t help but be thrilled when someone post about our poster series. We’ll be working on the fourth poster to be unveiled at GrafanaCon EU!


    Grafana Labs is Hiring!

    We are passionate about open source software and thrive on tackling complex challenges to build the future. We ship code from every corner of the globe and love working with the community. If this sounds exciting, you’re in luck – WE’RE HIRING!

    Check out our Open Positions


    How are we doing?

    Let us know what you think about timeShift. Submit a comment on this article below, or post something at our community forum. Find an article I haven’t included? Send it my way. Help us make timeShift better!

    Follow us on Twitter, like us on Facebook, and join the Grafana Labs community.

    Now Open – AWS China (Ningxia) Region

    Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/now-open-aws-china-ningxia-region/

    Today we launched our 17th Region globally, and the second in China. The AWS China (Ningxia) Region, operated by Ningxia Western Cloud Data Technology Co. Ltd. (NWCD), is generally available now and provides customers another option to run applications and store data on AWS in China.

    The Details
    At launch, the new China (Ningxia) Region, operated by NWCD, supports Auto Scaling, AWS Config, AWS CloudFormation, AWS CloudTrail, Amazon CloudWatch, CloudWatch Events, Amazon CloudWatch Logs, AWS CodeDeploy, AWS Direct Connect, Amazon DynamoDB, Amazon Elastic Compute Cloud (EC2), Amazon Elastic Block Store (EBS), Amazon EC2 Systems Manager, AWS Elastic Beanstalk, Amazon ElastiCache, Amazon Elasticsearch Service, Elastic Load Balancing, Amazon EMR, Amazon Glacier, AWS Identity and Access Management (IAM), Amazon Kinesis Streams, Amazon Redshift, Amazon Relational Database Service (RDS), Amazon Simple Storage Service (S3), Amazon Simple Notification Service (SNS), Amazon Simple Queue Service (SQS), AWS Support API, AWS Trusted Advisor, Amazon Simple Workflow Service (SWF), Amazon Virtual Private Cloud, and VM Import. Visit the AWS China Products page for additional information on these services.

    The Region supports all sizes of C4, D2, M4, T2, R4, I3, and X1 instances.

    Check out the AWS Global Infrastructure page to learn more about current and future AWS Regions.

    Operating Partner
    To comply with China’s legal and regulatory requirements, AWS has formed a strategic technology collaboration with NWCD to operate and provide services from the AWS China (Ningxia) Region. Founded in 2015, NWCD is a licensed datacenter and cloud services provider, based in Ningxia, China. NWCD joins Sinnet, the operator of the AWS China China (Beijing) Region, as an AWS operating partner in China. Through these relationships, AWS provides its industry-leading technology, guidance, and expertise to NWCD and Sinnet, while NWCD and Sinnet operate and provide AWS cloud services to local customers. While the cloud services offered in both AWS China Regions are the same as those available in other AWS Regions, the AWS China Regions are different in that they are isolated from all other AWS Regions and operated by AWS’s Chinese partners separately from all other AWS Regions. Customers using the AWS China Regions enter into customer agreements with Sinnet and NWCD, rather than with AWS.

    Use it Today
    The AWS China (Ningxia) Region, operated by NWCD, is open for business, and you can start using it now! Starting today, Chinese developers, startups, and enterprises, as well as government, education, and non-profit organizations, can leverage AWS to run their applications and store their data in the new AWS China (Ningxia) Region, operated by NWCD. Customers already using the AWS China (Beijing) Region, operated by Sinnet, can select the AWS China (Ningxia) Region directly from the AWS Management Console, while new customers can request an account at www.amazonaws.cn to begin using both AWS China Regions.

    Jeff;

     

     

    timeShift(GrafanaBuzz, 1w) Issue 25

    Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2017/12/08/timeshiftgrafanabuzz-1w-issue-25/

    Welcome to TimeShift

    This week, a few of us from Grafana Labs, along with 4,000 of our closest friends, headed down to chilly Austin, TX for KubeCon + CloudNativeCon North America 2017. We got to see a number of great talks and were thrilled to see Grafana make appearances in some of the presentations. We were also a sponsor of the conference and handed out a ton of swag (we overnighted some of our custom Grafana scarves, which came in handy for Thursday’s snow).

    We also announced Grafana Labs has joined the Cloud Native Computing Foundation as a Silver member! We’re excited to share our expertise in time series data visualization and open source software with the CNCF community.


    Latest Release

    Grafana 4.6.2 is available and includes some bug fixes:

    • Prometheus: Fixes bug with new Prometheus alerts in Grafana. Make sure to download this version if you’re using Prometheus for alerting. More details in the issue. #9777
    • Color picker: Bug after using textbox input field to change/paste color string #9769
    • Cloudwatch: build using golang 1.9.2 #9667, thanks @mtanda
    • Heatmap: Fixed tooltip for “time series buckets” mode #9332
    • InfluxDB: Fixed query editor issue when using > or < operators in WHERE clause #9871

    Download Grafana 4.6.2 Now


    From the Blogosphere

    Grafana Labs Joins the CNCF: Grafana Labs has officially joined the Cloud Native Computing Foundation (CNCF). We look forward to working with the CNCF community to democratize metrics and help unify traditionally disparate information.

    Automating Web Performance Regression Alerts: Peter and his team needed a faster and easier way to find web performance regressions at the Wikimedia Foundation. Grafana 4’s alerting features were exactly what they needed. This post covers their journey on setting up alerts for both RUM and synthetic testing and shares the alerts they’ve set up on their dashboards.

    How To Install Grafana on Ubuntu 17.10: As you probably guessed from the title, this article walks you through installing and configuring Grafana in the latest version of Ubuntu (or earlier releases). It also covers installing plugins using the Grafana CLI tool.

    Prometheus: Starting the Server with Alertmanager, cAdvisor and Grafana: Learn how to monitor Docker from scratch using cAdvisor, Prometheus and Grafana in this detailed, step-by-step walkthrough.

    Monitoring Java EE Servers with Prometheus and Payara: In this screencast, Adam uses firehose; a Java EE 7+ metrics gateway for Prometheus, to convert the JSON output into Prometheus statistics and visualizes the data in Grafana.

    Monitoring Spark Streaming with InfluxDB and Grafana: This article focuses on how to monitor Apache Spark Streaming applications with InfluxDB and Grafana at scale.


    GrafanaCon EU, March 1-2, 2018

    We are currently reaching out to everyone who submitted a talk to GrafanaCon and will soon publish the final schedule at grafanacon.org.

    Join us March 1-2, 2018 in Amsterdam for 2 days of talks centered around Grafana and the surrounding monitoring ecosystem including Graphite, Prometheus, InfluxData, Elasticsearch, Kubernetes, and more.

    Get Your Ticket Now


    Grafana Plugins

    Lots of plugin updates and a new OpenNMS Helm App plugin to announce! To install or update any plugin in an on-prem Grafana instance, use the Grafana-cli tool, or install and update with 1 click on Hosted Grafana.

    NEW PLUGIN

    OpenNMS Helm App – The new OpenNMS Helm App plugin replaces the old OpenNMS data source. Helm allows users to create flexible dashboards using both fault management (FM) and performance management (PM) data from OpenNMS® Horizon™ and/or OpenNMS® Meridian™. The old data source is now deprecated.


    Install Now

    UPDATED PLUGIN

    PNP Data Source – This data source plugin (that uses PNP4Nagios to access RRD files) received a small, but important update that fixes template query parsing.


    Update

    UPDATED PLUGIN

    Vonage Status Panel – The latest version of the Status Panel comes with a number of small fixes and changes. Below are a few of the enhancements:

    • Threshold settings – removed Show Always option, and replaced it with 2 options:
      • Display Alias – Select when to show the metric alias.
      • Display Value – Select when to show the metric value.
    • Text format configuration (bold / italic) for warning / critical / disabled states.
    • Option to change the corner radius of the panel. Now you can change the panel’s shape to have rounded corners.

    Update

    UPDATED PLUGIN

    Google Calendar Plugin – This plugin received a small update, so be sure to install version 1.0.4.


    Update

    UPDATED PLUGIN

    Carpet Plot Panel – The Carpet Plot Panel received a fix for IE 11, and also added the ability to choose custom colors.


    Update


    Upcoming Events:

    In between code pushes we like to speak at, sponsor and attend all kinds of conferences and meetups. We also like to make sure we mention other Grafana-related events happening all over the world. If you’re putting on just such an event, let us know and we’ll list it here.

    Docker Meetup @ Tuenti | Madrid, Spain – Dec 12, 2017: Javier Provecho: Intro to Metrics with Swarm, Prometheus and Grafana

    Learn how to gain visibility in real time for your micro services. We’ll cover how to deploy a Prometheus server with persistence and Grafana, how to enable metrics endpoints for various service types (docker daemon, traefik proxy and postgres) and how to scrape, visualize and set up alarms based on those metrics.

    RSVP

    Grafana Lyon Meetup n ° 2 | Lyon, France – Dec 14, 2017: This meetup will cover some of the latest innovations in Grafana and discussion about automation. Also, free beer and chips, so – of course you’re going!

    RSVP

    FOSDEM | Brussels, Belgium – Feb 3-4, 2018: FOSDEM is a free developer conference where thousands of developers of free and open source software gather to share ideas and technology. Carl Bergquist is managing the Cloud and Monitoring Devroom, and we’ve heard there were some great talks submitted. There is no need to register; all are welcome.


    Tweet of the Week

    We scour Twitter each week to find an interesting/beautiful dashboard and show it off! #monitoringLove

    We were thrilled to see our dashboards bigger than life at KubeCon + CloudNativeCon this week. Thanks for snapping a photo and sharing!


    Grafana Labs is Hiring!

    We are passionate about open source software and thrive on tackling complex challenges to build the future. We ship code from every corner of the globe and love working with the community. If this sounds exciting, you’re in luck – WE’RE HIRING!

    Check out our Open Positions


    How are we doing?

    Hard to believe this is the 25th issue of Timeshift! I have a blast writing these roundups, but Let me know what you think. Submit a comment on this article below, or post something at our community forum. Find an article I haven’t included? Send it my way. Help us make timeShift better!

    Follow us on Twitter, like us on Facebook, and join the Grafana Labs community.

    About the Amazon Trust Services Migration

    Post Syndicated from Brent Meyer original https://aws.amazon.com/blogs/ses/669-2/

    Amazon Web Services is moving the certificates for our services—including Amazon SES—to use our own certificate authority, Amazon Trust Services. We have carefully planned this change to minimize the impact it will have on your workflow. Most customers will not have to take any action during this migration.

    About the Certificates

    The Amazon Trust Services Certificate Authority (CA) uses the Starfield Services CA, which has been valid since 2005. The Amazon Trust Services certificates are available in most major operating systems released in the past 10 years, and are also trusted by all modern web browsers.

    If you send email through the Amazon SES SMTP interface using a mail server that you operate, we recommend that you confirm that the appropriate certificates are installed. You can test whether your server trusts the Amazon Trust Services CAs by visiting the following URLs (for example, by using cURL):

    If you see a message stating that the certificate issuer is not recognized, then you should install the appropriate root certificate. You can download individual certificates from https://www.amazontrust.com/repository. The process of adding a trusted certificate to your server varies depending on the operating system you use. For more information, see “Adding New Certificates,” below.

    AWS SDKs and CLI

    Recent versions of the AWS SDKs and the AWS CLI are not impacted by this change. If you use an AWS SDK or a version of the AWS CLI released prior to February 5, 2015, you should upgrade to the latest version.

    Potential Issues

    If your system is configured to use a very restricted list of root CAs (for example, if you use certificate pinning), you may be impacted by this migration. In this situation, you must update your pinned certificates to include the Amazon Trust Services CAs.

    Adding New Root Certificates

    The following sections list the steps you can take to install the Amazon Root CA certificates on your systems if they are not already present.

    macOS

    To install a new certificate on a macOS server

    1. Download the .pem file for the certificate you want to install from https://www.amazontrust.com/repository.
    2. Change the file extension for the file you downloaded from .pem to .crt.
    3. At the command prompt, type the following command to install the certificate: sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain /path/to/certificatename.crt, replacing /path/to/certificatename.crt with the full path to the certificate file.

    Windows Server

    To install a new certificate on a Windows server

    1. Download the .pem file for the certificate you want to install from https://www.amazontrust.com/repository.
    2. Change the file extension for the file you downloaded from .pem to .crt.
    3. At the command prompt, type the following command to install the certificate: certutil -addstore -f "ROOT" c:\path\to\certificatename.crt, replacing c:\path\to\certificatename.crt with the full path to the certificate file.

    Ubuntu

    To install a new certificate on an Ubuntu (or similar) server

    1. Download the .pem file for the certificate you want to install from https://www.amazontrust.com/repository.
    2. Change the file extension for the file you downloaded from .pem to .crt.
    3. Copy the certificate file to the directory /usr/local/share/ca-certificates/
    4. At the command prompt, type the following command to update the certificate authority store: sudo update-ca-certificates

    Red Hat Enterprise Linux/Fedora/CentOS

    To install a new certificate on a Red Hat Enterprise Linux (or similar) server

    1. Download the .pem file for the certificate you want to install from https://www.amazontrust.com/repository.
    2. Change the file extension for the file you downloaded from .pem to .crt.
    3. Copy the certificate file to the directory /etc/pki/ca-trust/source/anchors/
    4. At the command line, type the following command to enable dynamic certificate authority configuration: sudo update-ca-trust force-enable
    5. At the command line, type the following command to update the certificate authority store: sudo update-ca-trust extract

    To learn more about this migration, see How to Prepare for AWS’s Move to Its Own Certificate Authority on the AWS Security Blog.

    Implementing Dynamic ETL Pipelines Using AWS Step Functions

    Post Syndicated from Tara Van Unen original https://aws.amazon.com/blogs/compute/implementing-dynamic-etl-pipelines-using-aws-step-functions/

    This post contributed by:
    Wangechi Dole, AWS Solutions Architect
    Milan Krasnansky, ING, Digital Solutions Developer, SGK
    Rian Mookencherry, Director – Product Innovation, SGK

    Data processing and transformation is a common use case you see in our customer case studies and success stories. Often, customers deal with complex data from a variety of sources that needs to be transformed and customized through a series of steps to make it useful to different systems and stakeholders. This can be difficult due to the ever-increasing volume, velocity, and variety of data. Today, data management challenges cannot be solved with traditional databases.

    Workflow automation helps you build solutions that are repeatable, scalable, and reliable. You can use AWS Step Functions for this. A great example is how SGK used Step Functions to automate the ETL processes for their client. With Step Functions, SGK has been able to automate changes within the data management system, substantially reducing the time required for data processing.

    In this post, SGK shares the details of how they used Step Functions to build a robust data processing system based on highly configurable business transformation rules for ETL processes.

    SGK: Building dynamic ETL pipelines

    SGK is a subsidiary of Matthews International Corporation, a diversified organization focusing on brand solutions and industrial technologies. SGK’s Global Content Creation Studio network creates compelling content and solutions that connect brands and products to consumers through multiple assets including photography, video, and copywriting.

    We were recently contracted to build a sophisticated and scalable data management system for one of our clients. We chose to build the solution on AWS to leverage advanced, managed services that help to improve the speed and agility of development.

    The data management system served two main functions:

    1. Ingesting a large amount of complex data to facilitate both reporting and product funding decisions for the client’s global marketing and supply chain organizations.
    2. Processing the data through normalization and applying complex algorithms and data transformations. The system goal was to provide information in the relevant context—such as strategic marketing, supply chain, product planning, etc. —to the end consumer through automated data feeds or updates to existing ETL systems.

    We were faced with several challenges:

    • Output data that needed to be refreshed at least twice a day to provide fresh datasets to both local and global markets. That constant data refresh posed several challenges, especially around data management and replication across multiple databases.
    • The complexity of reporting business rules that needed to be updated on a constant basis.
    • Data that could not be processed as contiguous blocks of typical time-series data. The measurement of the data was done across seasons (that is, combination of dates), which often resulted with up to three overlapping seasons at any given time.
    • Input data that came from 10+ different data sources. Each data source ranged from 1–20K rows with as many as 85 columns per input source.

    These challenges meant that our small Dev team heavily invested time in frequent configuration changes to the system and data integrity verification to make sure that everything was operating properly. Maintaining this system proved to be a daunting task and that’s when we turned to Step Functions—along with other AWS services—to automate our ETL processes.

    Solution overview

    Our solution included the following AWS services:

    • AWS Step Functions: Before Step Functions was available, we were using multiple Lambda functions for this use case and running into memory limit issues. With Step Functions, we can execute steps in parallel simultaneously, in a cost-efficient manner, without running into memory limitations.
    • AWS Lambda: The Step Functions state machine uses Lambda functions to implement the Task states. Our Lambda functions are implemented in Java 8.
    • Amazon DynamoDB provides us with an easy and flexible way to manage business rules. We specify our rules as Keys. These are key-value pairs stored in a DynamoDB table.
    • Amazon RDS: Our ETL pipelines consume source data from our RDS MySQL database.
    • Amazon Redshift: We use Amazon Redshift for reporting purposes because it integrates with our BI tools. Currently we are using Tableau for reporting which integrates well with Amazon Redshift.
    • Amazon S3: We store our raw input files and intermediate results in S3 buckets.
    • Amazon CloudWatch Events: Our users expect results at a specific time. We use CloudWatch Events to trigger Step Functions on an automated schedule.

    Solution architecture

    This solution uses a declarative approach to defining business transformation rules that are applied by the underlying Step Functions state machine as data moves from RDS to Amazon Redshift. An S3 bucket is used to store intermediate results. A CloudWatch Event rule triggers the Step Functions state machine on a schedule. The following diagram illustrates our architecture:

    Here are more details for the above diagram:

    1. A rule in CloudWatch Events triggers the state machine execution on an automated schedule.
    2. The state machine invokes the first Lambda function.
    3. The Lambda function deletes all existing records in Amazon Redshift. Depending on the dataset, the Lambda function can create a new table in Amazon Redshift to hold the data.
    4. The same Lambda function then retrieves Keys from a DynamoDB table. Keys represent specific marketing campaigns or seasons and map to specific records in RDS.
    5. The state machine executes the second Lambda function using the Keys from DynamoDB.
    6. The second Lambda function retrieves the referenced dataset from RDS. The records retrieved represent the entire dataset needed for a specific marketing campaign.
    7. The second Lambda function executes in parallel for each Key retrieved from DynamoDB and stores the output in CSV format temporarily in S3.
    8. Finally, the Lambda function uploads the data into Amazon Redshift.

    To understand the above data processing workflow, take a closer look at the Step Functions state machine for this example.

    We walk you through the state machine in more detail in the following sections.

    Walkthrough

    To get started, you need to:

    • Create a schedule in CloudWatch Events
    • Specify conditions for RDS data extracts
    • Create Amazon Redshift input files
    • Load data into Amazon Redshift

    Step 1: Create a schedule in CloudWatch Events
    Create rules in CloudWatch Events to trigger the Step Functions state machine on an automated schedule. The following is an example cron expression to automate your schedule:

    In this example, the cron expression invokes the Step Functions state machine at 3:00am and 2:00pm (UTC) every day.

    Step 2: Specify conditions for RDS data extracts
    We use DynamoDB to store Keys that determine which rows of data to extract from our RDS MySQL database. An example Key is MCS2017, which stands for, Marketing Campaign Spring 2017. Each campaign has a specific start and end date and the corresponding dataset is stored in RDS MySQL. A record in RDS contains about 600 columns, and each Key can represent up to 20K records.

    A given day can have multiple campaigns with different start and end dates running simultaneously. In the following example DynamoDB item, three campaigns are specified for the given date.

    The state machine example shown above uses Keys 31, 32, and 33 in the first ChoiceState and Keys 21 and 22 in the second ChoiceState. These keys represent marketing campaigns for a given day. For example, on Monday, there are only two campaigns requested. The ChoiceState with Keys 21 and 22 is executed. If three campaigns are requested on Tuesday, for example, then ChoiceState with Keys 31, 32, and 33 is executed. MCS2017 can be represented by Key 21 and Key 33 on Monday and Tuesday, respectively. This approach gives us the flexibility to add or remove campaigns dynamically.

    Step 3: Create Amazon Redshift input files
    When the state machine begins execution, the first Lambda function is invoked as the resource for FirstState, represented in the Step Functions state machine as follows:

    "Comment": ” AWS Amazon States Language.", 
      "StartAt": "FirstState",
     
"States": { 
      "FirstState": {
       
"Type": "Task",
       
"Resource": "arn:aws:lambda:xx-xxxx-x:XXXXXXXXXXXX:function:Start",
        "Next": "ChoiceState" 
      } 

    As described in the solution architecture, the purpose of this Lambda function is to delete existing data in Amazon Redshift and retrieve keys from DynamoDB. In our use case, we found that deleting existing records was more efficient and less time-consuming than finding the delta and updating existing records. On average, an Amazon Redshift table can contain about 36 million cells, which translates to roughly 65K records. The following is the code snippet for the first Lambda function in Java 8:

    public class LambdaFunctionHandler implements RequestHandler<Map<String,Object>,Map<String,String>> {
        Map<String,String> keys= new HashMap<>();
        public Map<String, String> handleRequest(Map<String, Object> input, Context context){
           Properties config = getConfig(); 
           // 1. Cleaning Redshift Database
           new RedshiftDataService(config).cleaningTable(); 
           // 2. Reading data from Dynamodb
           List<String> keyList = new DynamoDBDataService(config).getCurrentKeys();
           for(int i = 0; i < keyList.size(); i++) {
               keys.put(”key" + (i+1), keyList.get(i)); 
           }
           keys.put(”key" + T,String.valueOf(keyList.size()));
           // 3. Returning the key values and the key count from the “for” loop
           return (keys);
    }

    The following JSON represents ChoiceState.

    "ChoiceState": {
       "Type" : "Choice",
       "Choices": [ 
       {

          "Variable": "$.keyT",
         "StringEquals": "3",
         "Next": "CurrentThreeKeys" 
       }, 
       {

         "Variable": "$.keyT",
        "StringEquals": "2",
        "Next": "CurrentTwooKeys" 
       } 
     ], 
     "Default": "DefaultState"
    }
    

    The variable $.keyT represents the number of keys retrieved from DynamoDB. This variable determines which of the parallel branches should be executed. At the time of publication, Step Functions does not support dynamic parallel state. Therefore, choices under ChoiceState are manually created and assigned hardcoded StringEquals values. These values represent the number of parallel executions for the second Lambda function.

    For example, if $.keyT equals 3, the second Lambda function is executed three times in parallel with keys, $key1, $key2 and $key3 retrieved from DynamoDB. Similarly, if $.keyT equals two, the second Lambda function is executed twice in parallel.  The following JSON represents this parallel execution:

    "CurrentThreeKeys": { 
      "Type": "Parallel",
      "Next": "NextState",
      "Branches": [ 
      {

         "StartAt": “key31",
        "States": { 
           “key31": {

              "Type": "Task",
            "InputPath": "$.key1",
            "Resource": "arn:aws:lambda:xx-xxxx-x:XXXXXXXXXXXX:function:Execution",
            "End": true 
           } 
        } 
      }, 
      {

         "StartAt": “key32",
        "States": { 
         “key32": {

            "Type": "Task",
           "InputPath": "$.key2",
             "Resource": "arn:aws:lambda:xx-xxxx-x:XXXXXXXXXXXX:function:Execution",
           "End": true 
          } 
         } 
       }, 
       {

          "StartAt": “key33",
           "States": { 
              “key33": {

                    "Type": "Task",
                 "InputPath": "$.key3",
                 "Resource": "arn:aws:lambda:xx-xxxx-x:XXXXXXXXXXXX:function:Execution",
               "End": true 
           } 
         } 
        } 
      ] 
    } 

    Step 4: Load data into Amazon Redshift
    The second Lambda function in the state machine extracts records from RDS associated with keys retrieved for DynamoDB. It processes the data then loads into an Amazon Redshift table. The following is code snippet for the second Lambda function in Java 8.

    public class LambdaFunctionHandler implements RequestHandler<String, String> {
     public static String key = null;
    
public String handleRequest(String input, Context context) { 
       key=input; 
       //1. Getting basic configurations for the next classes + s3 client Properties
       config = getConfig();

       AmazonS3 s3 = AmazonS3ClientBuilder.defaultClient(); 
       // 2. Export query results from RDS into S3 bucket 
       new RdsDataService(config).exportDataToS3(s3,key); 
       // 3. Import query results from S3 bucket into Redshift 
        new RedshiftDataService(config).importDataFromS3(s3,key); 
       System.out.println(input); 
       return "SUCCESS"; 
     } 
    }

    After the data is loaded into Amazon Redshift, end users can visualize it using their preferred business intelligence tools.

    Lessons learned

    • At the time of publication, the 1.5–GB memory hard limit for Lambda functions was inadequate for processing our complex workload. Step Functions gave us the flexibility to chunk our large datasets and process them in parallel, saving on costs and time.
    • In our previous implementation, we assigned each key a dedicated Lambda function along with CloudWatch rules for schedule automation. This approach proved to be inefficient and quickly became an operational burden. Previously, we processed each key sequentially, with each key adding about five minutes to the overall processing time. For example, processing three keys meant that the total processing time was three times longer. With Step Functions, the entire state machine executes in about five minutes.
    • Using DynamoDB with Step Functions gave us the flexibility to manage keys efficiently. In our previous implementations, keys were hardcoded in Lambda functions, which became difficult to manage due to frequent updates. DynamoDB is a great way to store dynamic data that changes frequently, and it works perfectly with our serverless architectures.

    Conclusion

    With Step Functions, we were able to fully automate the frequent configuration updates to our dataset resulting in significant cost savings, reduced risk to data errors due to system downtime, and more time for us to focus on new product development rather than support related issues. We hope that you have found the information useful and that it can serve as a jump-start to building your own ETL processes on AWS with managed AWS services.

    For more information about how Step Functions makes it easy to coordinate the components of distributed applications and microservices in any workflow, see the use case examples and then build your first state machine in under five minutes in the Step Functions console.

    If you have questions or suggestions, please comment below.

    Collect Data Statistics Up to 5x Faster by Analyzing Only Predicate Columns with Amazon Redshift

    Post Syndicated from George Caragea original https://aws.amazon.com/blogs/big-data/collect-data-statistics-up-to-5x-faster-by-analyzing-only-predicate-columns-with-amazon-redshift/

    Amazon Redshift is a fast, fully managed, petabyte-scale data warehousing service that makes it simple and cost-effective to analyze all of your data. Many of our customers—including Boingo Wireless, Scholastic, Finra, Pinterest, and Foursquare—migrated to Amazon Redshift and achieved agility and faster time to insight, while dramatically reducing costs.

    Query optimization and the need for accurate estimates

    When a SQL query is submitted to Amazon Redshift, the query optimizer is in charge of generating all the possible ways to execute that query, and picking the fastest one. This can mean evaluating the cost of thousands, if not millions, of different execution plans.

    The plan cost is calculated based on estimates of the data characteristics. For example, the characteristics could include the number of rows in each base table, the average width of a variable-length column, the number of distinct values in a column, and the most common values in a column. These estimates (or “statistics”) are computed in advance by running an ANALYZE command, and stored in the system catalog.

    How do the query optimizer and ANALYZE work together?

    An ideal scenario is to run ANALYZE after every ETL/ingestion job. This way, when running your workload, the query optimizer can use up-to-date data statistics, and choose the most optimal execution plan, given the updates.

    However, running the ANALYZE command can add significant overhead to the data ingestion scripts. This can lead to customers not running ANALYZE on their data, and using default or stale estimates. The end result is usually the optimizer choosing a suboptimal execution plan that runs for longer than needed.

    Analyzing predicate columns only

    When you run a SQL query, the query optimizer requests statistics only on columns used in predicates in the SQL query (join predicates, filters in the WHERE clause and GROUP BY clauses). Consider the following query:

    SELECT Avg(salary), 
           Min(hiredate), 
           deptname 
    FROM   emp 
    WHERE  state = 'CA' 
    GROUP  BY deptname; 

    In the query above, the optimizer requests statistics only on columns ‘state’ and ‘deptname’, but not on ‘salary’ and ‘hiredate’. If present, statistics on columns ‘salary’ and ‘hiredate’ are ignored, as they do not impact the cost of the execution plans considered.

    Based on the optimizer functionality described earlier, the Amazon Redshift ANALYZE command has been updated to optionally collect information only about columns used in previous queries as part of a filter, join condition or a GROUP BY clause, and columns that are part of distribution or sort keys (predicate columns). There’s a recently introduced option for the ANALYZE command that only analyzes predicate columns:

    ANALYZE <table name> PREDICATE COLUMNS;

    By having Amazon Redshift collect information about predicate columns automatically, and analyzing those columns only, you’re able to reduce the time to run ANALYZE. For example, during the execution of the 99 queries in the TPC-DS workload, only 203 out of the 424 total columns are predicate columns (approximately 48%). By analyzing only the predicate columns for such a workload, the execution time for running ANALYZE can be significantly reduced.

    From my experience in the data warehousing space, I have observed that about 20% of columns in a typical use case are marked predicate. In such a case, running ANALYZE PREDICATE COLUMNS can lead to a speedup of up to 5x relative to a full ANALYZE run.

    If no information on predicate columns exists in the system (for example, a new table that has not been queried yet), ANALYZE PREDICATE COLUMNS collects statistics on all the columns. When queries on the table are run, Amazon Redshift collects information about predicate column usage, and subsequent runs of ANALYZE PREDICATE COLUMNS only operates on the predicate columns.

    If the workload is relatively stable, and the set of predicate columns does not expand continuously over time, I recommend replacing all occurrences of the ANALYZE command with ANALYZE PREDICATE COLUMNS commands in your application and data ingestion code.

    Using the Analyze/Vacuum utility

    Several AWS customers are using the Analyze/Vacuum utility from the Redshift-Utils package to manage and automate their maintenance operations. By passing the –predicate-cols option to the Analyze/Vacuum utility, you can enable it to use the ANALYZE PREDICATE COLUMNS feature, providing you with the significant changes in overhead in a completely seamless manner.

    Enhancements to logging for ANALYZE operations

    When running ANALYZE with the PREDICATE COLUMNS option, the type of analyze run (Full vs Predicate Column), as well as information about the predicate columns encountered, is logged in the stl_analyze view:

    SELECT status, 
           starttime, 
           prevtime, 
           num_predicate_cols, 
           num_new_predicate_cols 
    FROM   stl_analyze;
         status   |    starttime        |   prevtime          | pred_cols | new_pred_cols
    --------------+---------------------+---------------------+-----------+---------------
     Full         | 2017-11-09 01:15:47 |                     |         0 |             0
     PredicateCol | 2017-11-09 01:16:20 | 2017-11-09 01:15:47 |         2 |             2

    AWS also enhanced the pg_statistic catalog table with two new pieces of information: the time stamp at which a column was marked as “predicate”, and the time stamp at which the column was last analyzed.

    The Amazon Redshift documentation provides a view that allows a user to easily see which columns are marked as predicate, when they were marked as predicate, and when a column was last analyzed. For example, for the emp table used above, the output of the view could be as follows:

     SELECT col_name, 
           is_predicate, 
           first_predicate_use, 
           last_analyze 
    FROM   predicate_columns 
    WHERE  table_name = 'emp';
    
     col_name | is_predicate | first_predicate_use  |        last_analyze
    ----------+--------------+----------------------+----------------------------
     id       | f            |                      | 2017-11-09 01:15:47
     name     | f            |                      | 2017-11-09 01:15:47
     deptname | t            | 2017-11-09 01:16:03  | 2017-11-09 01:16:20
     age      | f            |                      | 2017-11-09 01:15:47
     salary   | f            |                      | 2017-11-09 01:15:47
     hiredate | f            |                      | 2017-11-09 01:15:47
     state    | t            | 2017-11-09 01:16:03  | 2017-11-09 01:16:20

    Conclusion

    After loading new data into an Amazon Redshift cluster, statistics need to be re-computed to guarantee performant query plans. By learning which column statistics are actually being used by the customer’s workload and collecting statistics only on those columns, Amazon Redshift is able to significantly reduce the amount of time needed for table maintenance during data loading workflows.


    Additional Reading

    Be sure to check out the Top 10 Tuning Techniques for Amazon Redshift, and the Advanced Table Design Playbook: Distribution Styles and Distribution Keys.


    About the Author

    George Caragea is a Senior Software Engineer with Amazon Redshift. He has been working on MPP Databases for over 6 years and is mainly interested in designing systems at scale. In his spare time, he enjoys being outdoors and on the water in the beautiful Bay Area and finishing the day exploring the rich local restaurant scene.