Tag Archives: hp

Security updates for Friday

Post Syndicated from jake original https://lwn.net/Articles/746988/rss

Security updates have been issued by Arch Linux (clamav), Debian (mailman, mpv, and simplesamlphp), Fedora (tomcat-native), openSUSE (docker, docker-runc, containerd,, kernel, mupdf, and python-mistune), Red Hat (kernel), and Ubuntu (mailman and postgresql-9.3, postgresql-9.5, postgresql-9.6).

All-In on Unlimited Backup

Post Syndicated from Gleb Budman original https://www.backblaze.com/blog/all-in-on-unlimited-backup/

chips on computer with cloud backup

The cloud backup industry has seen its share of tumultuousness. BitCasa, Dell DataSafe, Xdrive, and a dozen others have closed up shop. Mozy, Amazon, and Microsoft offered, but later canceled, their unlimited offerings. Recently, CrashPlan for Home customers were notified that their service was being end-of-lifed. Then today we’ve heard from Carbonite customers who are frustrated by this morning’s announcement of a price increase from Carbonite.

We believe that the fundamental goal of a cloud backup is having peace-of-mind: knowing your data — all of it — is safe. For over 10 years Backblaze has been providing that peace-of-mind by offering completely unlimited cloud backup to our customers. And we continue to be committed to that. Knowing that your cloud backup vendor is not going to disappear or fundamentally change their service is an essential element in achieving that peace-of-mind.

Committed to Unlimited Backup

When Mozy discontinued their unlimited backup on Jan 31, 2011, a lot of people asked, “Does this mean Backblaze will discontinue theirs as well?” At that time I wrote the blog post Backblaze is committed to unlimited backup. That was seven years ago. Since then we’ve continued to make Backblaze cloud backup better: dramatically speeding up backups and restores, offering the unique and very popular Restore Return Refund program, enabling direct access and sharing of any file in your backup, and more. We also introduced Backblaze Groups to enable businesses and families to manage backups — all at no additional cost.

How That’s Possible

I’d like to answer the question of “How have you been able to do this when others haven’t?

First, commitment. It’s not impossible to offer unlimited cloud backup, but it’s not easy. The Backblaze team has been committed to unlimited as a core tenet.

Second, we have pursued the technical, business, and cultural steps required to make it happen. We’ve designed our own servers, written our cloud storage software, run our own operations, and been continually focused on every place we could optimize a penny out of the cost of storage. We’ve built a culture at Backblaze that cares deeply about that.

Ensuring Peace-of-Mind

Price increases and plan changes happen in our industry, but Backblaze has consistently been the low price leader, and continues to stand by the foundational element of our service — truly unlimited backup storage. Carbonite just announced a price increase from $60 to $72/year, and while that’s not an astronomical increase, it’s important to keep in mind the service that they are providing at that rate. The basic Carbonite plan provides a service that doesn’t back up videos or external hard drives by default. We think that’s dangerous. No one wants to discover that their videos weren’t backed up after their computer dies, or have to worry about the safety and durability of their data. That is why we have continued to build on our foundation of unlimited, as well as making our service faster and more accessible. All of these serve the goal of ensuring peace-of-mind for our customers.

3 Months Free For You & A Friend

As part of our commitment to unlimited, refer your friends to receive three months of Backblaze service through March 15, 2018. When you Refer-a-Friend with your personal referral link, and they subscribe, both of you will receive three months of service added to your account. See promotion details on our Refer-a-Friend page.

Want A Reminder When Your Carbonite Subscription Runs Out?

If you’re considering switching from Carbonite, we’d love to be your new backup provider. Enter your email and the date you’d like to be reminded in the form below and you’ll get a friendly reminder email from us to start a new backup plan with Backblaze. Or, you could start a free trial today.

We think you’ll be glad you switched, and you’ll have a chance to experience some of that Backblaze peace-of-mind for your data.

Please Send Me a Reminder When I Need a New Backup Provider



 

The post All-In on Unlimited Backup appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Security updates for Tuesday

Post Syndicated from ris original https://lwn.net/Articles/746701/rss

Security updates have been issued by Debian (xen), Fedora (clamav, community-mysql, dnsmasq, flatpak, libtasn1, mupdf, p7zip, rsync, squid, thunderbird, tomcat, unbound, and zziplib), Mageia (clamav, curl, dovecot, ffmpeg, gcab, kernel, libtiff, libvpx, php-smarty, pure-ftpd, redis, and thunderbird), openSUSE (apache-commons-email), Red Hat (rh-mariadb100-mariadb), SUSE (firefox), and Ubuntu (clamav, squid3, and systemd).

Security updates for Thursday

Post Syndicated from jake original https://lwn.net/Articles/746078/rss

Security updates have been issued by Debian (chromium-browser, krb5, and smarty3), Fedora (firefox, GraphicsMagick, and moodle), Mageia (rsync), openSUSE (bind, chromium, freeimage, gd, GraphicsMagick, libtasn1, libvirt, nodejs6, php7, systemd, and webkit2gtk3), Red Hat (chromium-browser, systemd, and thunderbird), Scientific Linux (systemd), and Ubuntu (curl, firefox, and ruby2.3).

500 Petabytes And Counting

Post Syndicated from Yev original https://www.backblaze.com/blog/500-petabytes-and-counting/

500 Petabytes = 500,000,000 Gigabytes

It seems like only yesterday that we crossed the 350 petabyte mark. It was actually June 2017, but boy have we been growing since. In October 2017 we crossed 400 petabytes. Today, we’re proud to announce we’ve crossed the 500 petabyte mark. That’s a very healthy clip, see for yourself!

Whether you have 50 GB, 500 GB or are just an avid blog reader, thank you for being on this incredible journey with us through the years.

…we’re literally moving at 1,000,000 files per hour.

We’re extremely proud of our track record. Throughout these 11 years we’ve striven to be the simplest, fastest, and most affordable online backup (and now cloud storage) solution available. We’re not just focusing on data ingress, but also adhering to our original goal of making sure that “no one ever loses data again.” How quickly are we restoring data? On average, we’re literally moving at 1,000,000 files per hour.

Even after all these years, one of the most frequent questions asked is, “How has Backblaze maintained such affordable pricing, particularly when the industry continues to move away from unlimited data plans?”

The cloud storage industry is very competitive, with cloud sync, storage, and backup providers leaving the unlimited market every single day: OneDrive, Amazon Cloud Storage, and most recently CrashPlan. Other providers either have tiered pricing (iDrive), or charge almost double or even triple for all the features we provide for our unlimited backup service (Carbonite). So how do we do it?

The answer comes down to our relentless pursuit of lowering costs. Our open-source Backblaze Storage Pods comprise our Backblaze Vaults, and the less expensive and more performant our Storage Pods are, the better the service that we can provide. This all directly translates into the service and pricing we can offer you.

A key part of our service is to be as open as possible with our costs and structure. After all, you are entrusting us with some of your most valuable assets. Still, it is very difficult to find an apples to apples comparison to what our competitors are doing. For example, we can gain some insight from a 2011 interview with Carbonite’s CEO, who gave an interview in which he said Carbonite’s cost of storing a petabyte was $250,000. At the time, our cost to store a petabyte was $76,481 (more on that calculation can be found here and here). If Backblaze’s fundamental cost to store data is one-third that of Carbonite’s, it makes sense that Carbonite’s cost to its customers would be more than Backblaze’s. Today, Backblaze backup is $50/year and Carbonite’s equivalent service is $149.99.

Our continued focus on reducing costs has allowed us to maintain a healthy business. And after accepting customer data for almost 10 years, we sincerely want to thank you all for giving us your trust, and allowing us to protect your important data and memories for you. Here’s to the next 500 petabytes; they’ll be here before we know it.


Update 2/5/18

Since publishing this post, we have posted the latest in our series of Hard Drive Stats, in which we summarize the performance of the hard drives we used in our data centers in 2017 and previously.

The post 500 Petabytes And Counting appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

2018-01-28 чукове

Post Syndicated from Vasil Kolev original https://vasil.ludost.net/blog/?p=3377

“Не го насилвай, вземи по-голям чук”

Каня се от много време да направя debugging workshop, и около мисленето как точно да стане днес стигнах до интересен извод за инструментите, дето ползвам и си правя за дебъгващи цели и като цяло за разни мои начини на работа.

Чукът е хубаво нещо. Какъвто и проблем да имаш, след удара с чука резултатът има същия вид (сплескан) и донякъде ми се вижда като хубава метафора за начина, по който оправям някакви проблеми. Той може да се опише като “най-краткия и прост начин за достигане на нужното крайно състояние, без да има особено значение какво е началното.

Като за пример, тия дни ми се налагаше да подменя едно парче софтуер в 50-тина клъстера, като всеки от тях имаше м/у 3 и 50 машини. Понеже инструментите, които имам са pssh и pscp, се оказа най-лесно на един пас да копирам нужните файлове по всички сървъри, и на втори пас да се логне pssh и ако трябва, да копира където трябва, иначе просто да изтрие това, което бях копирал. Някакъв по-подреден начин би било да извадя списък на всички машини, на които има нужда да се направи действието и да го направя само там, но щях да го напиша и направя по-бавно, отколкото по грубия и бърз начин.

По подобен начин за друг инструмент си бях написал скрипт, който го налива в цял клъстер и отделен, който го update-ва. В един момент осъзнах, че това е тъпо и направих инсталатора така, че да не му пука, ако има вече нещо инсталирано и просто спокойно да може да слага отгоре (както и ако го прекъсна и го пусна пак, да свърши пак нужната работа). Крайният резултат беше, че общото количество код намаля.

Принципът изглежда да може да се приложи към любимите ми начини за дебъгване – това, което ползвай най-често е strace, което спокойно може да се опише като един от най-тежките чукове за дебъгване. Почти без значение какво дебъгвам – компилиран C код, php, python, perl, java – успявам да видя симптомите и да се ориентирам какво става, въпреки че като цяло за всеки от тия езици има специализиран и вероятно доста по-нежен вариант да се гледа какво става.
(искам да отбележа, че има и други тежки случаи – имам колега, който за да смята някакви математически изрази от време на време вместо да си пусне някакъв калкулатор като bc, пуска gdb и прави в него нещо като “p 1024*1024*231/1.1”)

Замислил се бях дали това всъщност не е погрешно и че трябва да се избягва, и стигнах до извода, че не виждам друг работещ начин. Много често ни се налага да дебъгваме чужд код (който сме link-нали/който е под нас някъде/от който зависим, или просто това са ни изсипали) и вариантът да го прочетем и разберем не е опция, понеже в наши дни почти няма проекти, които да могат да бъдат изчетени и опознати за под седмица-две (рекордно малкият код, който в една от фирмите, в които съм работил и търкаляше основните услуги беше около 20000 реда, което е горе-долу в човешките възможности, и пак ще отнеме доста време да се разгледа, а фирмата в това отношение беше сериозно изключение). Това води до нуждата за всякакви помощни средства, за да можем да се справим, понеже човешката глава има сериозни ограничения по темата, и тук на помощ ни идват чуковете, с които всеки проблем може да бъде сведен до пирон (или хлебарка, която трябва да се прасне достатъчно силно).

(да не говорим, че хората искат да пишат умно, и колкото по-умно пишат, толкова по-трудно се дебъгва това, което са сътворили)

The Floodgates Are Open – Increased Network Bandwidth for EC2 Instances

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/the-floodgates-are-open-increased-network-bandwidth-for-ec2-instances/

I hope that you have configured your AMIs and your current-generation EC2 instances to use the Elastic Network Adapter (ENA) that I told you about back in mid-2016. The ENA gives you high throughput and low latency, while minimizing the load on the host processor. It is designed to work well in the presence of multiple vCPUs, with intelligent packet routing backed up by multiple transmit and receive queues.

Today we are opening up the floodgates and giving you access to more bandwidth in all AWS Regions. Here are the specifics (in each case, the actual bandwidth is dependent on the instance type and size):

EC2 to S3 – Traffic to and from Amazon Simple Storage Service (S3) can now take advantage of up to 25 Gbps of bandwidth. Previously, traffic of this type had access to 5 Gbps of bandwidth. This will be of benefit to applications that access large amounts of data in S3 or that make use of S3 for backup and restore.

EC2 to EC2 – Traffic to and from EC2 instances in the same or different Availability Zones within a region can now take advantage of up to 5 Gbps of bandwidth for single-flow traffic, or 25 Gbps of bandwidth for multi-flow traffic (a flow represents a single, point-to-point network connection) by using private IPv4 or IPv6 addresses, as described here.

EC2 to EC2 (Cluster Placement Group) – Traffic to and from EC2 instances within a cluster placement group can continue to take advantage of up to 10 Gbps of lower-latency bandwidth for single-flow traffic, or 25 Gbps of lower-latency bandwidth for multi-flow traffic.

To take advantage of this additional bandwidth, make sure that you are using the latest, ENA-enabled AMIs on current-generation EC2 instances. ENA-enabled AMIs are available for Amazon Linux, Ubuntu 14.04 & 16.04, RHEL 7.4, SLES 12, and Windows Server (2008 R2, 2012, 2012 R2, and 2016). The FreeBSD AMI in AWS Marketplace is also ENA-enabled, as is VMware Cloud on AWS.

Jeff;

Security updates for Friday

Post Syndicated from jake original https://lwn.net/Articles/745493/rss

Security updates have been issued by CentOS (389-ds-base, dhcp, kernel, and nautilus), Debian (curl, openssh, and wireshark), Fedora (clamav, firefox, java-9-openjdk, and poco), Gentoo (clamav), openSUSE (curl, libevent, mupdf, mysql-community-server, newsbeuter, php5, redis, and tre), Oracle (389-ds-base, dhcp, kernel, and nautilus), Slackware (mozilla), and Ubuntu (kernel and linux-hwe, linux-azure, linux-gcp, linux-oem).

Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift

Post Syndicated from Thiyagarajan Arumugam original https://aws.amazon.com/blogs/big-data/top-8-best-practices-for-high-performance-etl-processing-using-amazon-redshift/

An ETL (Extract, Transform, Load) process enables you to load data from source systems into your data warehouse. This is typically executed as a batch or near-real-time ingest process to keep the data warehouse current and provide up-to-date analytical data to end users.

Amazon Redshift is a fast, petabyte-scale data warehouse that enables you easily to make data-driven decisions. With Amazon Redshift, you can get insights into your big data in a cost-effective fashion using standard SQL. You can set up any type of data model, from star and snowflake schemas, to simple de-normalized tables for running any analytical queries.

To operate a robust ETL platform and deliver data to Amazon Redshift in a timely manner, design your ETL processes to take account of Amazon Redshift’s architecture. When migrating from a legacy data warehouse to Amazon Redshift, it is tempting to adopt a lift-and-shift approach, but this can result in performance and scale issues long term. This post guides you through the following best practices for ensuring optimal, consistent runtimes for your ETL processes:

  • COPY data from multiple, evenly sized files.
  • Use workload management to improve ETL runtimes.
  • Perform table maintenance regularly.
  • Perform multiple steps in a single transaction.
  • Loading data in bulk.
  • Use UNLOAD to extract large result sets.
  • Use Amazon Redshift Spectrum for ad hoc ETL processing.
  • Monitor daily ETL health using diagnostic queries.

1. COPY data from multiple, evenly sized files

Amazon Redshift is an MPP (massively parallel processing) database, where all the compute nodes divide and parallelize the work of ingesting data. Each node is further subdivided into slices, with each slice having one or more dedicated cores, equally dividing the processing capacity. The number of slices per node depends on the node type of the cluster. For example, each DS2.XLARGE compute node has two slices, whereas each DS2.8XLARGE compute node has 16 slices.

When you load data into Amazon Redshift, you should aim to have each slice do an equal amount of work. When you load the data from a single large file or from files split into uneven sizes, some slices do more work than others. As a result, the process runs only as fast as the slowest, or most heavily loaded, slice. In the example shown below, a single large file is loaded into a two-node cluster, resulting in only one of the nodes, “Compute-0”, performing all the data ingestion:

When splitting your data files, ensure that they are of approximately equal size – between 1 MB and 1 GB after compression. The number of files should be a multiple of the number of slices in your cluster. Also, I strongly recommend that you individually compress the load files using gzip, lzop, or bzip2 to efficiently load large datasets.

When loading multiple files into a single table, use a single COPY command for the table, rather than multiple COPY commands. Amazon Redshift automatically parallelizes the data ingestion. Using a single COPY command to bulk load data into a table ensures optimal use of cluster resources, and quickest possible throughput.

2. Use workload management to improve ETL runtimes

Use Amazon Redshift’s workload management (WLM) to define multiple queues dedicated to different workloads (for example, ETL versus reporting) and to manage the runtimes of queries. As you migrate more workloads into Amazon Redshift, your ETL runtimes can become inconsistent if WLM is not appropriately set up.

I recommend limiting the overall concurrency of WLM across all queues to around 15 or less. This WLM guide helps you organize and monitor the different queues for your Amazon Redshift cluster.

When managing different workloads on your Amazon Redshift cluster, consider the following for the queue setup:

  • Create a queue dedicated to your ETL processes. Configure this queue with a small number of slots (5 or fewer). Amazon Redshift is designed for analytics queries, rather than transaction processing. The cost of COMMIT is relatively high, and excessive use of COMMIT can result in queries waiting for access to the commit queue. Because ETL is a commit-intensive process, having a separate queue with a small number of slots helps mitigate this issue.
  • Claim extra memory available in a queue. When executing an ETL query, you can take advantage of the wlm_query_slot_count to claim the extra memory available in a particular queue. For example, a typical ETL process might involve COPYing raw data into a staging table so that downstream ETL jobs can run transformations that calculate daily, weekly, and monthly aggregates. To speed up the COPY process (so that the downstream tasks can start in parallel sooner), the wlm_query_slot_count can be increased for this step.
  • Create a separate queue for reporting queries. Configure query monitoring rules on this queue to further manage long-running and expensive queries.
  • Take advantage of the dynamic memory parameters. They swap the memory from your ETL to your reporting queue after the ETL job has completed.

3. Perform table maintenance regularly

Amazon Redshift is a columnar database, which enables fast transformations for aggregating data. Performing regular table maintenance ensures that transformation ETLs are predictable and performant. To get the best performance from your Amazon Redshift database, you must ensure that database tables regularly are VACUUMed and ANALYZEd. The Analyze & Vacuum schema utility helps you automate the table maintenance task and have VACUUM & ANALYZE executed in a regular fashion.

  • Use VACUUM to sort tables and remove deleted blocks

During a typical ETL refresh process, tables receive new incoming records using COPY, and unneeded data (cold data) is removed using DELETE. New rows are added to the unsorted region in a table. Deleted rows are simply marked for deletion.

DELETE does not automatically reclaim the space occupied by the deleted rows. Adding and removing large numbers of rows can therefore cause the unsorted region and the number of deleted blocks to grow. This can degrade the performance of queries executed against these tables.

After an ETL process completes, perform VACUUM to ensure that user queries execute in a consistent manner. The complete list of tables that need VACUUMing can be found using the Amazon Redshift Util’s table_info script.

Use the following approaches to ensure that VACCUM is completed in a timely manner:

  • Use wlm_query_slot_count to claim all the memory allocated in the ETL WLM queue during the VACUUM process.
  • DROP or TRUNCATE intermediate or staging tables, thereby eliminating the need to VACUUM them.
  • If your table has a compound sort key with only one sort column, try to load your data in sort key order. This helps reduce or eliminate the need to VACUUM the table.
  • Consider using time series This helps reduce the amount of data you need to VACUUM.
  • Use ANALYZE to update database statistics

Amazon Redshift uses a cost-based query planner and optimizer using statistics about tables to make good decisions about the query plan for the SQL statements. Regular statistics collection after the ETL completion ensures that user queries run fast, and that daily ETL processes are performant. The Amazon Redshift utility table_info script provides insights into the freshness of the statistics. Keeping the statistics off (pct_stats_off) less than 20% ensures effective query plans for the SQL queries.

4. Perform multiple steps in a single transaction

ETL transformation logic often spans multiple steps. Because commits in Amazon Redshift are expensive, if each ETL step performs a commit, multiple concurrent ETL processes can take a long time to execute.

To minimize the number of commits in a process, the steps in an ETL script should be surrounded by a BEGIN…END statement so that a single commit is performed only after all the transformation logic has been executed. For example, here is an example multi-step ETL script that performs one commit at the end:

Begin
CREATE temporary staging_table;
INSERT INTO staging_table SELECT .. FROM source (transformation logic);
DELETE FROM daily_table WHERE dataset_date =?;
INSERT INTO daily_table SELECT .. FROM staging_table (daily aggregate);
DELETE FROM weekly_table WHERE weekending_date=?;
INSERT INTO weekly_table SELECT .. FROM staging_table(weekly aggregate);
Commit

5. Loading data in bulk

Amazon Redshift is designed to store and query petabyte-scale datasets. Using Amazon S3 you can stage and accumulate data from multiple source systems before executing a bulk COPY operation. The following methods allow efficient and fast transfer of these bulk datasets into Amazon Redshift:

  • Use a manifest file to ingest large datasets that span multiple files. The manifest file is a JSON file that lists all the files to be loaded into Amazon Redshift. Using a manifest file ensures that Amazon Redshift has a consistent view of the data to be loaded from S3, while also ensuring that duplicate files do not result in the same data being loaded more than one time.
  • Use temporary staging tables to hold the data for transformation. These tables are automatically dropped after the ETL session is complete. Temporary tables can be created using the CREATE TEMPORARY TABLE syntax, or by issuing a SELECT … INTO #TEMP_TABLE query. Explicitly specifying the CREATE TEMPORARY TABLE statement allows you to control the DISTRIBUTION KEY, SORT KEY, and compression settings to further improve performance.
  • User ALTER table APPEND to swap data from the staging tables to the target table. Data in the source table is moved to matching columns in the target table. Column order doesn’t matter. After data is successfully appended to the target table, the source table is empty. ALTER TABLE APPEND is much faster than a similar CREATE TABLE AS or INSERT INTO operation because it doesn’t involve copying or moving data.

6. Use UNLOAD to extract large result sets

Fetching a large number of rows using SELECT is expensive and takes a long time. When a large amount of data is fetched from the Amazon Redshift cluster, the leader node has to hold the data temporarily until the fetches are complete. Further, data is streamed out sequentially, which results in longer elapsed time. As a result, the leader node can become hot, which not only affects the SELECT that is being executed, but also throttles resources for creating execution plans and managing the overall cluster resources. Here is an example of a large SELECT statement. Notice that the leader node is doing most of the work to stream out the rows:

Use UNLOAD to extract large results sets directly to S3. After it’s in S3, the data can be shared with multiple downstream systems. By default, UNLOAD writes data in parallel to multiple files according to the number of slices in the cluster. All the compute nodes participate to quickly offload the data into S3.

If you are extracting data for use with Amazon Redshift Spectrum, you should make use of the MAXFILESIZE parameter to and keep files are 150 MB. Similar to item 1 above, having many evenly sized files ensures that Redshift Spectrum can do the maximum amount of work in parallel.

7. Use Redshift Spectrum for ad hoc ETL processing

Events such as data backfill, promotional activity, and special calendar days can trigger additional data volumes that affect the data refresh times in your Amazon Redshift cluster. To help address these spikes in data volumes and throughput, I recommend staging data in S3. After data is organized in S3, Redshift Spectrum enables you to query it directly using standard SQL. In this way, you gain the benefits of additional capacity without having to resize your cluster.

For tips on getting started with and optimizing the use of Redshift Spectrum, see the previous post, 10 Best Practices for Amazon Redshift Spectrum.

8. Monitor daily ETL health using diagnostic queries

Monitoring the health of your ETL processes on a regular basis helps identify the early onset of performance issues before they have a significant impact on your cluster. The following monitoring scripts can be used to provide insights into the health of your ETL processes:

Script Use when… Solution
commit_stats.sql – Commit queue statistics from past days, showing largest queue length and queue time first DML statements such as INSERT/UPDATE/COPY/DELETE operations take several times longer to execute when multiple of these operations are in progress Set up separate WLM queues for the ETL process and limit the concurrency to < 5.
copy_performance.sql –  Copy command statistics for the past days Daily COPY operations take longer to execute • Follow the best practices for the COPY command.
• Analyze data growth with the incoming datasets and consider cluster resize to meet the expected SLA.
table_info.sql – Table skew and unsorted statistics along with storage and key information Transformation steps take longer to execute • Set up regular VACCUM jobs to address unsorted rows and claim the deleted blocks so that transformation SQL execute optimally.
• Consider a table redesign to avoid data skewness.
v_check_transaction_locks.sql – Monitor transaction locks INSERT/UPDATE/COPY/DELETE operations on particular tables do not respond back in timely manner, compared to when run after the ETL Multiple DML statements are operating on the same target table at the same moment from different transactions. Set up ETL job dependency so that they execute serially for the same target table.
v_get_schema_priv_by_user.sql – Get the schema that the user has access to Reporting users can view intermediate tables Set up separate database groups for reporting and ETL users, and grants access to objects using GRANT.
v_generate_tbl_ddl.sql – Get the table DDL You need to create an empty table with same structure as target table for data backfill Generate DDL using this script for data backfill.
v_space_used_per_tbl.sql – monitor space used by individual tables Amazon Redshift data warehouse space growth is trending upwards more than normal

Analyze the individual tables that are growing at higher rate than normal. Consider data archival using UNLOAD to S3 and Redshift Spectrum for later analysis.

Use unscanned_table_summary.sql to find unused table and archive or drop them.

top_queries.sql – Return the top 50 time consuming statements aggregated by its text ETL transformations are taking longer to execute Analyze the top transformation SQL and use EXPLAIN to find opportunities for tuning the query plan.

There are several other useful scripts available in the amazon-redshift-utils repository. The AWS Lambda Utility Runner runs a subset of these scripts on a scheduled basis, allowing you to automate much of monitoring of your ETL processes.

Example ETL process

The following ETL process reinforces some of the best practices discussed in this post. Consider the following four-step daily ETL workflow where data from an RDBMS source system is staged in S3 and then loaded into Amazon Redshift. Amazon Redshift is used to calculate daily, weekly, and monthly aggregations, which are then unloaded to S3, where they can be further processed and made available for end-user reporting using a number of different tools, including Redshift Spectrum and Amazon Athena.

Step 1:  Extract from the RDBMS source to a S3 bucket

In this ETL process, the data extract job fetches change data every 1 hour and it is staged into multiple hourly files. For example, the staged S3 folder looks like the following:

 [[email protected] ~]$ aws s3 ls s3://<<S3 Bucket>>/batch/2017/07/02/
2017-07-02 01:59:58   81900220 20170702T01.export.gz
2017-07-02 02:59:56   84926844 20170702T02.export.gz
2017-07-02 03:59:54   78990356 20170702T03.export.gz
…
2017-07-02 22:00:03   75966745 20170702T21.export.gz
2017-07-02 23:00:02   89199874 20170702T22.export.gz
2017-07-02 00:59:59   71161715 20170702T23.export.gz

Organizing the data into multiple, evenly sized files enables the COPY command to ingest this data using all available resources in the Amazon Redshift cluster. Further, the files are compressed (gzipped) to further reduce COPY times.

Step 2: Stage data to the Amazon Redshift table for cleansing

Ingesting the data can be accomplished using a JSON-based manifest file. Using the manifest file ensures that S3 eventual consistency issues can be eliminated and also provides an opportunity to dedupe any files if needed. A sample manifest20170702.json file looks like the following:

{
  "entries": [
    {"url":" s3://<<S3 Bucket>>/batch/2017/07/02/20170702T01.export.gz", "mandatory":true},
    {"url":" s3://<<S3 Bucket>>/batch/2017/07/02/20170702T02.export.gz", "mandatory":true},
    …
    {"url":" s3://<<S3 Bucket>>/batch/2017/07/02/20170702T23.export.gz", "mandatory":true}
  ]
}

The data can be ingested using the following command:

SET wlm_query_slot_count TO <<max available concurrency in the ETL queue>>;
COPY stage_tbl FROM 's3:// <<S3 Bucket>>/batch/manifest20170702.json' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' manifest;

Because the downstream ETL processes depend on this COPY command to complete, the wlm_query_slot_count is used to claim all the memory available to the queue. This helps the COPY command complete as quickly as possible.

Step 3: Transform data to create daily, weekly, and monthly datasets and load into target tables

Data is staged in the “stage_tbl” from where it can be transformed into the daily, weekly, and monthly aggregates and loaded into target tables. The following job illustrates a typical weekly process:

Begin
INSERT into ETL_LOG (..) values (..);
DELETE from weekly_tbl where dataset_week = <<current week>>;
INSERT into weekly_tbl (..)
  SELECT date_trunc('week', dataset_day) AS week_begin_dataset_date, SUM(C1) AS C1, SUM(C2) AS C2
	FROM   stage_tbl
GROUP BY date_trunc('week', dataset_day);
INSERT into AUDIT_LOG values (..);
COMMIT;
End;

As shown above, multiple steps are combined into one transaction to perform a single commit, reducing contention on the commit queue.

Step 4: Unload the daily dataset to populate the S3 data lake bucket

The transformed results are now unloaded into another S3 bucket, where they can be further processed and made available for end-user reporting using a number of different tools, including Redshift Spectrum and Amazon Athena.

unload ('SELECT * FROM weekly_tbl WHERE dataset_week = <<current week>>’) TO 's3:// <<S3 Bucket>>/datalake/weekly/20170526/' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

Summary

Amazon Redshift lets you easily operate petabyte-scale data warehouses on the cloud. This post summarized the best practices for operating scalable ETL natively within Amazon Redshift. I demonstrated efficient ways to ingest and transform data, along with close monitoring. I also demonstrated the best practices being used in a typical sample ETL workload to transform the data into Amazon Redshift.

If you have questions or suggestions, please comment below.

 


About the Author

Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

 

Security updates for Monday

Post Syndicated from ris original https://lwn.net/Articles/744905/rss

Security updates have been issued by Debian (bind9, couchdb, lucene-solr, mysql-5.5, openocd, and php5), Mageia (gdk-pixbuf2.0, golang, and mariadb), openSUSE (curl, gd, ImageMagick, lxterminal, ncurses, newsbeuter, perl-XML-LibXML, and xmltooling), Oracle (kernel), and SUSE (xmltooling).

Security updates for Wednesday

Post Syndicated from ris original https://lwn.net/Articles/744619/rss

Security updates have been issued by Debian (bind9, wordpress, and xbmc), Fedora (awstats, docker, gifsicle, irssi, microcode_ctl, mupdf, nasm, osc, osc-source_validator, and php), Gentoo (newsbeuter, poppler, and rsync), Mageia (gifsicle), Red Hat (linux-firmware and microcode_ctl), Scientific Linux (linux-firmware and microcode_ctl), SUSE (kernel and openssl), and Ubuntu (bind9, eglibc, glibc, and transmission).

Security updates for Monday

Post Syndicated from ris original https://lwn.net/Articles/744398/rss

Security updates have been issued by Arch Linux (qtpass), Debian (libkohana2-php, libxml2, transmission, and xmltooling), Fedora (kernel and qpid-cpp), Gentoo (PolarSSL and xen), Mageia (flash-player-plugin, irssi, kernel, kernel-linus, kernel-tmb, libvorbis, microcode, nvidia-current, php & libgd, poppler, webkit2, and wireshark), openSUSE (gifsicle, glibc, GraphicsMagick, gwenhywfar, ImageMagick, libetpan, mariadb, pngcrush, postgresql94, rsync, tiff, and wireshark), and Oracle (kernel).

Security updates for Tuesday

Post Syndicated from ris original https://lwn.net/Articles/743700/rss

Security updates have been issued by Arch Linux (graphicsmagick and linux-lts), CentOS (thunderbird), Debian (kernel, opencv, php5, and php7.0), Fedora (electrum), Gentoo (libXfont), openSUSE (gimp, java-1_7_0-openjdk, and libvorbis), Oracle (thunderbird), Slackware (irssi), SUSE (kernel, kernel-firmware, and kvm), and Ubuntu (awstats, nvidia-graphics-drivers-384, python-pysaml2, and tomcat7, tomcat8).

12 B2 Power Tips for Experts and Developers

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/advanced-cloud-storage-tips/

B2 Tips for Pros
If you’ve been using B2 Cloud Storage for a while, you probably think you know all that you can do with it. But do you?

We’ve put together a list of blazing power tips for experts and developers that will take you to the next level. Take a look below.

If you’re new to B2, we have a list of power tips for you, too.
Visit 12 Power Tips for New B2 Users.
Backblaze logo

1    Manage File Versions

Use Lifecycle Rules on a Bucket to set how many days to keep files that are no longer the current version. This is a great way to manage the amount of space your B2 account is using.

Backblaze logo

2    Easily Stay on Top of Your B2 Account Limits

Set usage caps and get text/email alerts for your B2 account when you approach limits that you define.

Backblaze logo

3    Bring on Your Big Files

You can upload files as large as 10TB to B2.

Backblaze logo

4    You Can Use FedEx to Get Your Data into B2

If you have over 20TB of data, you can use Backblaze’s Fireball hard disk array to load large volumes of data directly into your B2 account. We ship a Fireball to you and you ship it back.

Backblaze logo

5    You Have Command-Line Control of All B2 Functions

You have complete control over B2 using our command line tool that is available for Macintosh, Windows, and Linux.

Backblaze logo

6    You Can Use Your Own Domain Name To Front a Public B2 Bucket

You can create a vanity URL for your B2 account.

Backblaze logo

7    See What’s Happening in Your Account with Graphical Reports

You can view graphical reports summarizing your B2 usage — transactions, downloads, averages, data stored — in your B2 account dashboard.

Backblaze logo

8    Create a B2 SDK

You can build your own B2 SDK for JVM-based or JVM-compatible languages using our B2 Java SDK on Github.

Backblaze logo

9    B2’s API is Easy to Use

B2’s API is similar to, but simpler than Amazon’s S3 API, making it super easy for developers to integrate with B2 Cloud Storage.

Backblaze logo

10    View Code Examples To Get Your B2 Project Started

The B2 API is well documented and has code examples for cURL, Java, Python, Swift, Ruby, C#, and PHP. For example, here’s how to create a B2 Bucket.

Backblaze logo

11    Developers can set the B2 part size as low as 5 MB

When working with large files, the minimum file part size can be set as low as 5MB or as high as 5GB. This gives developers the ability to maximize the throughput of B2 data uploads and downloads. See Large Files and Downloading for more developer tips.

Backblaze logo

12    Your App or Device Can Work with B2, as well

Your B2 integration can be listed on Backblaze’s website. Visit Submit an Integration to get started.

Want to Learn More About B2?

You can find more information on B2 on our website and in our help pages.

The post 12 B2 Power Tips for Experts and Developers appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.