Tag Archives: sports

Friday Squid Blogging: Introducing the Seattle Kraken

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2020/07/friday_squid_bl_738.html

The Kraken is the name of Seattle’s new NFL franchise.

I have always really liked collective nouns as sports team names (like the Utah Jazz or the Minnesota Wild), mostly because it’s hard to describe individual players.

As usual, you can also use this squid post to talk about the security stories in the news that I haven’t covered.

Read my blog posting guidelines here.

Let’s do virtual sports with Digital Making at Home!

Post Syndicated from Kevin Johnson original https://www.raspberrypi.org/blog/lets-do-virtual-sports-with-digital-making-at-home/

Join us for Digital Making at Home: this week, young people get to make sports games in Scratch! With Digital Making at Home, we invite kids all over the world to code along with us and our new videos every week.

So get ready to exercise your digital making skills with us:

Check out this week’s sporty code-along projects!

And tune in on Wednesday 2pm BST / 9am EDT / 7.30pm IST at rpf.io/home to code along with our live stream session.

The post Let’s do virtual sports with Digital Making at Home! appeared first on Raspberry Pi.

Spanish Soccer League App Spies on Fans

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2019/06/spanish_soccer_.html

The Spanish Soccer League’s smartphone app spies on fans in order to find bars that are illegally streaming its games. The app listens with the microphone for the broadcasts, and then uses geolocation to figure out where the phone is.

The Spanish data protection agency has ordered the league to stop doing this. Not because it’s creepy spying, but because the terms of service — which no one reads anyway — weren’t clear.

I feel the earth move under my feet (in Michigan)

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/michigan-seismic-activity-raspberry-pi/

The University of Michigan is home to the largest stadium in the USA (the second-largest in the world!). So what better place to test for spectator-induced seismic activity than The Big House?

The Big House stadium in Michigan

The Michigan Shake

University of Michigan geology professor Ben van der Pluijm decided to make waves by measuring the seismic activity produced during games at the university’s 107601 person-capacity stadium. Because earthquakes are (thankfully) very rare in the Midwest, and therefore very rarely experienced by van der Pluijm’s introductory geology class, he hoped this approach would make the movement of the Earth more accessible to his students.

“The bottom line was, I wanted something to show people that the Earth just shakes from all kinds of interactions,” explained van der Pluijm in his interview with The Michigan Daily. “All kinds of activity makes the Earth shake.”

The Big House stadium in Michigan

To measure the seismic activity, van der Pluijm used a Raspberry Pi, placing it on a flat concrete surface within the stadium.

Van der Pluijm installed a small machine called a Raspberry Pi computer in the stadium. He said his only requirements were that it needed to be able to plug into the internet and set up on a concrete floor. “Then it sits there and does its thing,” he said. “In fact, it probably does its thing right now.”

He then sent freshman student Sahil Tolia to some games to record the moments of spectator movement and celebration, so that these could be compared with the seismic activity that the Pi registers.

We’re not sure whether Professor van der Pluijm plans on releasing his findings to the outside world, or whether he’ll keep them a close secret with his introductory students, but we hope for the former!

Build your own Raspberry Pi seismic activity reader

We’re not sure what other technology van der Pluijm uses in conjunction with the Raspberry Pi, but it’s fairly easy to create your own seismic activity reader using our board. You can purchase the Raspberry Shake, an add-on board for the Pi that has vertical and horizontal geophones, MEMs accelerometers, and omnidirectional differential pressure transducers. Or you can fashion something at home, for example by taking hints from this project by Carlo Cristini, which uses household items to register movement.

The post I feel the earth move under my feet (in Michigan) appeared first on Raspberry Pi.

Cheating in Bird Racing

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2018/08/cheating_in_bir.html

I’ve previously written about people cheating in marathon racing by driving — or otherwise getting near the end of the race by faster means than running. In China, two people were convicted of cheating in a pigeon race:

The essence of the plan involved training the pigeons to believe they had two homes. The birds had been secretly raised not just in Shanghai but also in Shangqiu.

When the race was held in the spring of last year, the Shanghai Pigeon Association took all the entrants from Shanghai to Shangqiu and released them. Most of the pigeons started flying back to Shanghai.

But the four specially raised pigeons flew instead to their second home in Shangqiu. According to the court, the two men caught the birds there and then carried them on a bullet train back to Shanghai, concealed in milk cartons. (China prohibits live animals on bullet trains.)

When the men arrived in Shanghai, they released the pigeons, which quickly fluttered to their Shanghai loft, seemingly winning the race.

Stream to Twitch with the push of a button

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/tinkernut-twitch-streaming/

Stream your video gaming exploits to the internet at the touch of a button with the Twitch-O-Matic. Everyone else is doing it, so you should too.

Twitch-O-Matic: Raspberry Pi Twitch Streaming Device – Weekend Hacker #1804

Some gaming consoles make it easy to stream to Twitch, some gaming consoles don’t (come on, Nintendo). So for those that don’t, I’ve made this beta version of the “Twitch-O-Matic”. No it doesn’t chop onions or fold your laundry, but what it DOES do is stream anything with HDMI output to your Twitch channel with the simple push of a button!

eSports and online game streaming

Interest in eSports has skyrocketed over the last few years, with viewership numbers in the hundreds of millions, sponsorship deals increasing in value and prestige, and tournament prize funds reaching millions of dollars. So it’s no wonder that more and more gamers are starting to stream live to online platforms in order to boost their fanbase and try to cash in on this growing industry.

Streaming to Twitch

Launched in 2011, Twitch.tv is an online live-streaming platform with a primary focus on video gaming. Users can create accounts to contribute their comments and content to the site, as well as watching live-streamed gaming competitions and broadcasts. With a staggering fifteen million daily users, Twitch is accessible via smartphone and gaming console apps, smart TVs, computers, and tablets. But if you want to stream to Twitch, you may find yourself using third-party software in order to do so. And with more buttons to click and more wires to plug in for older, app-less consoles, streaming can get confusing.

Enter Tinkernut.

Side note: we ❤ Tinkernut

We’ve featured Tinkernut a few times on the Raspberry Pi blog – his tutorials are clear, his projects are interesting and useful, and his live-streamed comment videos for every build are a nice touch to sharing homebrew builds on the internet.

Tinkernut Raspberry Pi Zero W Twitch-O-Matic

So, yes, we love him. [This is true. Alex never shuts up about him. – Ed.] And since he has over 500K subscribers on YouTube, we’re obviously not the only ones. We wave our Tinkernut flags with pride.


With a Raspberry Pi Zero W, an HDMI to CSI adapter, and a case to fit it all in, Tinkernut’s Twitch-O-Matic allows easy connection to the Twitch streaming service. You’ll also need a button – the bigger, the better in our opinion, though Tinkernut has opted for the Adafruit 16mm Illuminated Pushbutton for his build, and not the 100mm Massive Arcade Button that, sadly, we still haven’t found a reason to use yet.

Adafruit massive button

“I’m sorry, Dave…”

For added frills and pizzazz, Tinketnut has also incorporated Adafruit’s White LED Backlight Module into the case, though you don’t have to do so unless you’re feeling super fancy.

The setup

The Raspberry Pi Zero W is connected to the HDMI to CSI adapter via the camera connector, in the same way you’d attach the camera ribbon. Tinkernut uses a standard Raspbian image on an 8GB SD card, with SSH enabled for remote access from his laptop. He uses the simple command Raspivid to test the HDMI connection by recording ten seconds of video footage from his console.

Tinkernut Raspberry Pi Zero W Twitch-O-Matic

One lead is all you need

Once you have the Pi receiving video from your console, you can connect to Twitch using your Twitch stream key, which you can find by logging in to your account at Twitch.tv. Tinkernut’s tutorial gives you all the commands you need to stream from your Pi.

The frills

To up the aesthetic impact of your project, adding buttons and backlights is fairly straightforward.

Tinkernut Raspberry Pi Zero W Twitch-O-Matic

Pretty LED frills

To run the stream command, Tinketnut uses a button: press once to start the stream, press again to stop. Pressing the button also turns on the LED backlight, so it’s obvious when streaming is in progress.

The tutorial

For the full code and 3D-printable case STL file, head to Tinketnut’s hackster.io project page. And if you’re already using a Raspberry Pi for Twitch streaming, share your build setup with us. Cheers!

The post Stream to Twitch with the push of a button appeared first on Raspberry Pi.

Safety first: a Raspberry Pi safety helmet

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/safety-helmet/

Jennifer Fox is back, this time with a Raspberry Pi Zero–controlled impact force monitor that will notify you if your collision is a worth a trip to the doctor.

Make an Impact Force Monitor!

Check out my latest Hacker in Residence project for SparkFun Electronics: the Helmet Guardian! It’s a Pi Zero powered impact force monitor that turns on an LED if your head/body experiences a potentially dangerous impact. Install in your sports helmets, bicycle, or car to keep track of impact and inform you when it’s time to visit the doctor.


We’ve all knocked our heads at least once in our lives, maybe due to tripping over a loose paving slab, or to falling off a bike, or to walking into the corner of the overhead cupboard door for the third time this week — will I ever learn?! More often than not, even when we’re seeing stars, we brush off the accident and continue with our day, oblivious to the long-term damage we may be doing.

Force of impact

After some thorough research, Jennifer Fox, founder of FoxBot Industries, concluded that forces of 4 to 6 G sustained for more than a few seconds are dangerous to the human body. With this in mind, she decided to use a Raspberry Pi Zero W and an accelerometer to create helmet with an impact force monitor that notifies its wearer if this level of G-force has been met.

Jennifer Fox Raspberry Pi Impact Force Monitor

Obviously, if you do have a serious fall, you should always seek medical advice. This project is an example of how affordable technology can be used to create medical and citizen science builds, and not a replacement for professional medical services.

Setting up the impact monitor

Jennifer’s monitor requires only a few pieces of tech: a Zero W, an accelerometer and breakout board, a rechargeable USB battery, and an LED, plus the standard wires and resistors for these components.

After installing Raspbian, Jennifer enabled SSH and I2C on the Zero W to make it run headlessly, and then accessed it from a laptop. This allows her to control the Pi without physically connecting to it, and it makes for a wireless finished project.

Jen wired the Pi to the accelerometer breakout board and LED as shown in the schematic below.

Jennifer Fox Raspberry Pi Impact Force Monitor

The LED acts as a signal of significant impacts, turning on when the G-force threshold is reached, and not turning off again until the program is reset.

Jennifer Fox Raspberry Pi Impact Force Monitor

Make your own and more

Jennifer’s full code for the impact monitor is on GitHub, and she’s put together a complete tutorial on SparkFun’s website.

For more tutorials from Jennifer Fox, such as her ‘Bark Back’ IoT Pet Monitor, be sure to follow her on YouTube. And for similar projects, check out Matt’s smart bike light and Amelia Day’s physical therapy soccer ball.

The post Safety first: a Raspberry Pi safety helmet appeared first on Raspberry Pi.

Russians Hacked the Olympics

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2018/03/russians_hacked.html

Two weeks ago, I blogged about the myriad of hacking threats against the Olympics. Last week, the Washington Post reported that Russia hacked the Olympics network and tried to cast the blame on North Korea.

Of course, the evidence is classified, so there’s no way to verify this claim. And while the article speculates that the hacks were a retaliation for Russia being banned due to doping, that doesn’t ring true to me. If they tried to blame North Korea, it’s more likely that they’re trying to disrupt something between North Korea, South Korea, and the US. But I don’t know.

Internet Security Threats at the Olympics

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2018/02/internet_securi.html

There are a lot:

The cybersecurity company McAfee recently uncovered a cyber operation, dubbed Operation GoldDragon, attacking South Korean organizations related to the Winter Olympics. McAfee believes the attack came from a nation state that speaks Korean, although it has no definitive proof that this is a North Korean operation. The victim organizations include ice hockey teams, ski suppliers, ski resorts, tourist organizations in Pyeongchang, and departments organizing the Pyeongchang Olympics.

Meanwhile, a Russia-linked cyber attack has already stolen and leaked documents from other Olympic organizations. The so-called Fancy Bear group, or APT28, began its operations in late 2017 –­ according to Trend Micro and Threat Connect, two private cybersecurity firms­ — eventually publishing documents in 2018 outlining the political tensions between IOC officials and World Anti-Doping Agency (WADA) officials who are policing Olympic athletes. It also released documents specifying exceptions to anti-doping regulations granted to specific athletes (for instance, one athlete was given an exception because of his asthma medication). The most recent Fancy Bear leak exposed details about a Canadian pole vaulter’s positive results for cocaine. This group has targeted WADA in the past, specifically during the 2016 Rio de Janeiro Olympics. Assuming the attribution is right, the action appears to be Russian retaliation for the punitive steps against Russia.

A senior analyst at McAfee warned that the Olympics may experience more cyber attacks before closing ceremonies. A researcher at ThreatConnect asserted that organizations like Fancy Bear have no reason to stop operations just because they’ve already stolen and released documents. Even the United States Department of Homeland Security has issued a notice to those traveling to South Korea to remind them to protect themselves against cyber risks.

One presumes the Olympics network is sufficiently protected against the more pedestrian DDoS attacks and the like, but who knows?

EDITED TO ADD: There was already one attack.

Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift

Post Syndicated from Thiyagarajan Arumugam original https://aws.amazon.com/blogs/big-data/top-8-best-practices-for-high-performance-etl-processing-using-amazon-redshift/

An ETL (Extract, Transform, Load) process enables you to load data from source systems into your data warehouse. This is typically executed as a batch or near-real-time ingest process to keep the data warehouse current and provide up-to-date analytical data to end users.

Amazon Redshift is a fast, petabyte-scale data warehouse that enables you easily to make data-driven decisions. With Amazon Redshift, you can get insights into your big data in a cost-effective fashion using standard SQL. You can set up any type of data model, from star and snowflake schemas, to simple de-normalized tables for running any analytical queries.

To operate a robust ETL platform and deliver data to Amazon Redshift in a timely manner, design your ETL processes to take account of Amazon Redshift’s architecture. When migrating from a legacy data warehouse to Amazon Redshift, it is tempting to adopt a lift-and-shift approach, but this can result in performance and scale issues long term. This post guides you through the following best practices for ensuring optimal, consistent runtimes for your ETL processes:

  • COPY data from multiple, evenly sized files.
  • Use workload management to improve ETL runtimes.
  • Perform table maintenance regularly.
  • Perform multiple steps in a single transaction.
  • Loading data in bulk.
  • Use UNLOAD to extract large result sets.
  • Use Amazon Redshift Spectrum for ad hoc ETL processing.
  • Monitor daily ETL health using diagnostic queries.

1. COPY data from multiple, evenly sized files

Amazon Redshift is an MPP (massively parallel processing) database, where all the compute nodes divide and parallelize the work of ingesting data. Each node is further subdivided into slices, with each slice having one or more dedicated cores, equally dividing the processing capacity. The number of slices per node depends on the node type of the cluster. For example, each DS2.XLARGE compute node has two slices, whereas each DS2.8XLARGE compute node has 16 slices.

When you load data into Amazon Redshift, you should aim to have each slice do an equal amount of work. When you load the data from a single large file or from files split into uneven sizes, some slices do more work than others. As a result, the process runs only as fast as the slowest, or most heavily loaded, slice. In the example shown below, a single large file is loaded into a two-node cluster, resulting in only one of the nodes, “Compute-0”, performing all the data ingestion:

When splitting your data files, ensure that they are of approximately equal size – between 1 MB and 1 GB after compression. The number of files should be a multiple of the number of slices in your cluster. Also, I strongly recommend that you individually compress the load files using gzip, lzop, or bzip2 to efficiently load large datasets.

When loading multiple files into a single table, use a single COPY command for the table, rather than multiple COPY commands. Amazon Redshift automatically parallelizes the data ingestion. Using a single COPY command to bulk load data into a table ensures optimal use of cluster resources, and quickest possible throughput.

2. Use workload management to improve ETL runtimes

Use Amazon Redshift’s workload management (WLM) to define multiple queues dedicated to different workloads (for example, ETL versus reporting) and to manage the runtimes of queries. As you migrate more workloads into Amazon Redshift, your ETL runtimes can become inconsistent if WLM is not appropriately set up.

I recommend limiting the overall concurrency of WLM across all queues to around 15 or less. This WLM guide helps you organize and monitor the different queues for your Amazon Redshift cluster.

When managing different workloads on your Amazon Redshift cluster, consider the following for the queue setup:

  • Create a queue dedicated to your ETL processes. Configure this queue with a small number of slots (5 or fewer). Amazon Redshift is designed for analytics queries, rather than transaction processing. The cost of COMMIT is relatively high, and excessive use of COMMIT can result in queries waiting for access to the commit queue. Because ETL is a commit-intensive process, having a separate queue with a small number of slots helps mitigate this issue.
  • Claim extra memory available in a queue. When executing an ETL query, you can take advantage of the wlm_query_slot_count to claim the extra memory available in a particular queue. For example, a typical ETL process might involve COPYing raw data into a staging table so that downstream ETL jobs can run transformations that calculate daily, weekly, and monthly aggregates. To speed up the COPY process (so that the downstream tasks can start in parallel sooner), the wlm_query_slot_count can be increased for this step.
  • Create a separate queue for reporting queries. Configure query monitoring rules on this queue to further manage long-running and expensive queries.
  • Take advantage of the dynamic memory parameters. They swap the memory from your ETL to your reporting queue after the ETL job has completed.

3. Perform table maintenance regularly

Amazon Redshift is a columnar database, which enables fast transformations for aggregating data. Performing regular table maintenance ensures that transformation ETLs are predictable and performant. To get the best performance from your Amazon Redshift database, you must ensure that database tables regularly are VACUUMed and ANALYZEd. The Analyze & Vacuum schema utility helps you automate the table maintenance task and have VACUUM & ANALYZE executed in a regular fashion.

  • Use VACUUM to sort tables and remove deleted blocks

During a typical ETL refresh process, tables receive new incoming records using COPY, and unneeded data (cold data) is removed using DELETE. New rows are added to the unsorted region in a table. Deleted rows are simply marked for deletion.

DELETE does not automatically reclaim the space occupied by the deleted rows. Adding and removing large numbers of rows can therefore cause the unsorted region and the number of deleted blocks to grow. This can degrade the performance of queries executed against these tables.

After an ETL process completes, perform VACUUM to ensure that user queries execute in a consistent manner. The complete list of tables that need VACUUMing can be found using the Amazon Redshift Util’s table_info script.

Use the following approaches to ensure that VACCUM is completed in a timely manner:

  • Use wlm_query_slot_count to claim all the memory allocated in the ETL WLM queue during the VACUUM process.
  • DROP or TRUNCATE intermediate or staging tables, thereby eliminating the need to VACUUM them.
  • If your table has a compound sort key with only one sort column, try to load your data in sort key order. This helps reduce or eliminate the need to VACUUM the table.
  • Consider using time series This helps reduce the amount of data you need to VACUUM.
  • Use ANALYZE to update database statistics

Amazon Redshift uses a cost-based query planner and optimizer using statistics about tables to make good decisions about the query plan for the SQL statements. Regular statistics collection after the ETL completion ensures that user queries run fast, and that daily ETL processes are performant. The Amazon Redshift utility table_info script provides insights into the freshness of the statistics. Keeping the statistics off (pct_stats_off) less than 20% ensures effective query plans for the SQL queries.

4. Perform multiple steps in a single transaction

ETL transformation logic often spans multiple steps. Because commits in Amazon Redshift are expensive, if each ETL step performs a commit, multiple concurrent ETL processes can take a long time to execute.

To minimize the number of commits in a process, the steps in an ETL script should be surrounded by a BEGIN…END statement so that a single commit is performed only after all the transformation logic has been executed. For example, here is an example multi-step ETL script that performs one commit at the end:

CREATE temporary staging_table;
INSERT INTO staging_table SELECT .. FROM source (transformation logic);
DELETE FROM daily_table WHERE dataset_date =?;
INSERT INTO daily_table SELECT .. FROM staging_table (daily aggregate);
DELETE FROM weekly_table WHERE weekending_date=?;
INSERT INTO weekly_table SELECT .. FROM staging_table(weekly aggregate);

5. Loading data in bulk

Amazon Redshift is designed to store and query petabyte-scale datasets. Using Amazon S3 you can stage and accumulate data from multiple source systems before executing a bulk COPY operation. The following methods allow efficient and fast transfer of these bulk datasets into Amazon Redshift:

  • Use a manifest file to ingest large datasets that span multiple files. The manifest file is a JSON file that lists all the files to be loaded into Amazon Redshift. Using a manifest file ensures that Amazon Redshift has a consistent view of the data to be loaded from S3, while also ensuring that duplicate files do not result in the same data being loaded more than one time.
  • Use temporary staging tables to hold the data for transformation. These tables are automatically dropped after the ETL session is complete. Temporary tables can be created using the CREATE TEMPORARY TABLE syntax, or by issuing a SELECT … INTO #TEMP_TABLE query. Explicitly specifying the CREATE TEMPORARY TABLE statement allows you to control the DISTRIBUTION KEY, SORT KEY, and compression settings to further improve performance.
  • User ALTER table APPEND to swap data from the staging tables to the target table. Data in the source table is moved to matching columns in the target table. Column order doesn’t matter. After data is successfully appended to the target table, the source table is empty. ALTER TABLE APPEND is much faster than a similar CREATE TABLE AS or INSERT INTO operation because it doesn’t involve copying or moving data.

6. Use UNLOAD to extract large result sets

Fetching a large number of rows using SELECT is expensive and takes a long time. When a large amount of data is fetched from the Amazon Redshift cluster, the leader node has to hold the data temporarily until the fetches are complete. Further, data is streamed out sequentially, which results in longer elapsed time. As a result, the leader node can become hot, which not only affects the SELECT that is being executed, but also throttles resources for creating execution plans and managing the overall cluster resources. Here is an example of a large SELECT statement. Notice that the leader node is doing most of the work to stream out the rows:

Use UNLOAD to extract large results sets directly to S3. After it’s in S3, the data can be shared with multiple downstream systems. By default, UNLOAD writes data in parallel to multiple files according to the number of slices in the cluster. All the compute nodes participate to quickly offload the data into S3.

If you are extracting data for use with Amazon Redshift Spectrum, you should make use of the MAXFILESIZE parameter to and keep files are 150 MB. Similar to item 1 above, having many evenly sized files ensures that Redshift Spectrum can do the maximum amount of work in parallel.

7. Use Redshift Spectrum for ad hoc ETL processing

Events such as data backfill, promotional activity, and special calendar days can trigger additional data volumes that affect the data refresh times in your Amazon Redshift cluster. To help address these spikes in data volumes and throughput, I recommend staging data in S3. After data is organized in S3, Redshift Spectrum enables you to query it directly using standard SQL. In this way, you gain the benefits of additional capacity without having to resize your cluster.

For tips on getting started with and optimizing the use of Redshift Spectrum, see the previous post, 10 Best Practices for Amazon Redshift Spectrum.

8. Monitor daily ETL health using diagnostic queries

Monitoring the health of your ETL processes on a regular basis helps identify the early onset of performance issues before they have a significant impact on your cluster. The following monitoring scripts can be used to provide insights into the health of your ETL processes:

ScriptUse when…Solution
commit_stats.sql – Commit queue statistics from past days, showing largest queue length and queue time firstDML statements such as INSERT/UPDATE/COPY/DELETE operations take several times longer to execute when multiple of these operations are in progressSet up separate WLM queues for the ETL process and limit the concurrency to < 5.
copy_performance.sql –  Copy command statistics for the past days Daily COPY operations take longer to execute• Follow the best practices for the COPY command.
• Analyze data growth with the incoming datasets and consider cluster resize to meet the expected SLA.
table_info.sql – Table skew and unsorted statistics along with storage and key informationTransformation steps take longer to execute• Set up regular VACCUM jobs to address unsorted rows and claim the deleted blocks so that transformation SQL execute optimally.
• Consider a table redesign to avoid data skewness.
v_check_transaction_locks.sql – Monitor transaction locksINSERT/UPDATE/COPY/DELETE operations on particular tables do not respond back in timely manner, compared to when run after the ETLMultiple DML statements are operating on the same target table at the same moment from different transactions. Set up ETL job dependency so that they execute serially for the same target table.
v_get_schema_priv_by_user.sql – Get the schema that the user has access toReporting users can view intermediate tablesSet up separate database groups for reporting and ETL users, and grants access to objects using GRANT.
v_generate_tbl_ddl.sql – Get the table DDLYou need to create an empty table with same structure as target table for data backfillGenerate DDL using this script for data backfill.
v_space_used_per_tbl.sql – monitor space used by individual tablesAmazon Redshift data warehouse space growth is trending upwards more than normal

Analyze the individual tables that are growing at higher rate than normal. Consider data archival using UNLOAD to S3 and Redshift Spectrum for later analysis.

Use unscanned_table_summary.sql to find unused table and archive or drop them.

top_queries.sql – Return the top 50 time consuming statements aggregated by its textETL transformations are taking longer to executeAnalyze the top transformation SQL and use EXPLAIN to find opportunities for tuning the query plan.

There are several other useful scripts available in the amazon-redshift-utils repository. The AWS Lambda Utility Runner runs a subset of these scripts on a scheduled basis, allowing you to automate much of monitoring of your ETL processes.

Example ETL process

The following ETL process reinforces some of the best practices discussed in this post. Consider the following four-step daily ETL workflow where data from an RDBMS source system is staged in S3 and then loaded into Amazon Redshift. Amazon Redshift is used to calculate daily, weekly, and monthly aggregations, which are then unloaded to S3, where they can be further processed and made available for end-user reporting using a number of different tools, including Redshift Spectrum and Amazon Athena.

Step 1:  Extract from the RDBMS source to a S3 bucket

In this ETL process, the data extract job fetches change data every 1 hour and it is staged into multiple hourly files. For example, the staged S3 folder looks like the following:

 [[email protected] ~]$ aws s3 ls s3://<<S3 Bucket>>/batch/2017/07/02/
2017-07-02 01:59:58   81900220 20170702T01.export.gz
2017-07-02 02:59:56   84926844 20170702T02.export.gz
2017-07-02 03:59:54   78990356 20170702T03.export.gz
2017-07-02 22:00:03   75966745 20170702T21.export.gz
2017-07-02 23:00:02   89199874 20170702T22.export.gz
2017-07-02 00:59:59   71161715 20170702T23.export.gz

Organizing the data into multiple, evenly sized files enables the COPY command to ingest this data using all available resources in the Amazon Redshift cluster. Further, the files are compressed (gzipped) to further reduce COPY times.

Step 2: Stage data to the Amazon Redshift table for cleansing

Ingesting the data can be accomplished using a JSON-based manifest file. Using the manifest file ensures that S3 eventual consistency issues can be eliminated and also provides an opportunity to dedupe any files if needed. A sample manifest20170702.json file looks like the following:

  "entries": [
    {"url":" s3://<<S3 Bucket>>/batch/2017/07/02/20170702T01.export.gz", "mandatory":true},
    {"url":" s3://<<S3 Bucket>>/batch/2017/07/02/20170702T02.export.gz", "mandatory":true},
    {"url":" s3://<<S3 Bucket>>/batch/2017/07/02/20170702T23.export.gz", "mandatory":true}

The data can be ingested using the following command:

SET wlm_query_slot_count TO <<max available concurrency in the ETL queue>>;
COPY stage_tbl FROM 's3:// <<S3 Bucket>>/batch/manifest20170702.json' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' manifest;

Because the downstream ETL processes depend on this COPY command to complete, the wlm_query_slot_count is used to claim all the memory available to the queue. This helps the COPY command complete as quickly as possible.

Step 3: Transform data to create daily, weekly, and monthly datasets and load into target tables

Data is staged in the “stage_tbl” from where it can be transformed into the daily, weekly, and monthly aggregates and loaded into target tables. The following job illustrates a typical weekly process:

INSERT into ETL_LOG (..) values (..);
DELETE from weekly_tbl where dataset_week = <<current week>>;
INSERT into weekly_tbl (..)
  SELECT date_trunc('week', dataset_day) AS week_begin_dataset_date, SUM(C1) AS C1, SUM(C2) AS C2
	FROM   stage_tbl
GROUP BY date_trunc('week', dataset_day);
INSERT into AUDIT_LOG values (..);

As shown above, multiple steps are combined into one transaction to perform a single commit, reducing contention on the commit queue.

Step 4: Unload the daily dataset to populate the S3 data lake bucket

The transformed results are now unloaded into another S3 bucket, where they can be further processed and made available for end-user reporting using a number of different tools, including Redshift Spectrum and Amazon Athena.

unload ('SELECT * FROM weekly_tbl WHERE dataset_week = <<current week>>’) TO 's3:// <<S3 Bucket>>/datalake/weekly/20170526/' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';


Amazon Redshift lets you easily operate petabyte-scale data warehouses on the cloud. This post summarized the best practices for operating scalable ETL natively within Amazon Redshift. I demonstrated efficient ways to ingest and transform data, along with close monitoring. I also demonstrated the best practices being used in a typical sample ETL workload to transform the data into Amazon Redshift.

If you have questions or suggestions, please comment below.


About the Author

Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.


Optimize Delivery of Trending, Personalized News Using Amazon Kinesis and Related Services

Post Syndicated from Yukinori Koide original https://aws.amazon.com/blogs/big-data/optimize-delivery-of-trending-personalized-news-using-amazon-kinesis-and-related-services/

This is a guest post by Yukinori Koide, an the head of development for the Newspass department at Gunosy.

Gunosy is a news curation application that covers a wide range of topics, such as entertainment, sports, politics, and gourmet news. The application has been installed more than 20 million times.

Gunosy aims to provide people with the content they want without the stress of dealing with a large influx of information. We analyze user attributes, such as gender and age, and past activity logs like click-through rate (CTR). We combine this information with article attributes to provide trending, personalized news articles to users.

In this post, I show you how to process user activity logs in real time using Amazon Kinesis Data Firehose, Amazon Kinesis Data Analytics, and related AWS services.

Why does Gunosy need real-time processing?

Users need fresh and personalized news. There are two constraints to consider when delivering appropriate articles:

  • Time: Articles have freshness—that is, they lose value over time. New articles need to reach users as soon as possible.
  • Frequency (volume): Only a limited number of articles can be shown. It’s unreasonable to display all articles in the application, and users can’t read all of them anyway.

To deliver fresh articles with a high probability that the user is interested in them, it’s necessary to include not only past user activity logs and some feature values of articles, but also the most recent (real-time) user activity logs.

We optimize the delivery of articles with these two steps.

  1. Personalization: Deliver articles based on each user’s attributes, past activity logs, and feature values of each article—to account for each user’s interests.
  2. Trends analysis/identification: Optimize delivering articles using recent (real-time) user activity logs—to incorporate the latest trends from all users.

Optimizing the delivery of articles is always a cold start. Initially, we deliver articles based on past logs. We then use real-time data to optimize as quickly as possible. In addition, news has a short freshness time. Specifically, day-old news is past news, and even the news that is three hours old is past news. Therefore, shortening the time between step 1 and step 2 is important.

To tackle this issue, we chose AWS for processing streaming data because of its fully managed services, cost-effectiveness, and so on.


The following diagrams depict the architecture for optimizing article delivery by processing real-time user activity logs

There are three processing flows:

  1. Process real-time user activity logs.
  2. Store and process all user-based and article-based logs.
  3. Execute ad hoc or heavy queries.

In this post, I focus on the first processing flow and explain how it works.

Process real-time user activity logs

The following are the steps for processing user activity logs in real time using Kinesis Data Streams and Kinesis Data Analytics.

  1. The Fluentd server sends the following user activity logs to Kinesis Data Streams:
{"article_id": 12345, "user_id": 12345, "action": "click"}
{"article_id": 12345, "user_id": 12345, "action": "impression"}
  1. Map rows of logs to columns in Kinesis Data Analytics.

  1. Set the reference data to Kinesis Data Analytics from Amazon S3.

a. Gunosy has user attributes such as gender, age, and segment. Prepare the following CSV file (user_id, gender, segment_id) and put it in Amazon S3:


b. Add the application reference data source to Kinesis Data Analytics using the AWS CLI:

$ aws kinesisanalytics add-application-reference-data-source \
  --application-name <my-application-name> \
  --current-application-version-id <version-id> \
  --reference-data-source '{
  "S3ReferenceDataSource": {
    "BucketARN": "arn:aws:s3:::<my-bucket-name>",
    "FileKey": "mydata.csv",
    "ReferenceRoleARN": "arn:aws:iam::<account-id>:role/..."
  "ReferenceSchema": {
    "RecordFormat": {
      "RecordFormatType": "CSV",
      "MappingParameters": {
        "CSVMappingParameters": {"RecordRowDelimiter": "\n", "RecordColumnDelimiter": ","}
    "RecordEncoding": "UTF-8",
    "RecordColumns": [
      {"Name": "USER_ID", "Mapping": "0", "SqlType": "INTEGER"},
      {"Name": "GENDER",  "Mapping": "1", "SqlType": "VARCHAR(32)"},
      {"Name": "SEGMENT_ID", "Mapping": "2", "SqlType": "INTEGER"}

This application reference data source can be referred on Kinesis Data Analytics.

  1. Run a query against the source data stream on Kinesis Data Analytics with the application reference data source.

a. Define the temporary stream named TMP_SQL_STREAM.


b. Insert the joined source stream and application reference data source into the temporary stream.


c. Define the destination stream named DESTINATION_SQL_STREAM.


d. Insert the processed temporary stream, using a tumbling window, into the destination stream per minute.


The results look like the following:

  1. Insert the results into Amazon Elasticsearch Service (Amazon ES).
  2. Batch servers get results from Amazon ES every minute. They then optimize delivering articles with other data sources using a proprietary optimization algorithm.

How to connect a stream to another stream in another AWS Region

When we built the solution, Kinesis Data Analytics was not available in the Asia Pacific (Tokyo) Region, so we used the US West (Oregon) Region. The following shows how we connected a data stream to another data stream in the other Region.

There is no need to continue containing all components in a single AWS Region, unless you have a situation where a response difference at the millisecond level is critical to the service.


The solution provides benefits for both our company and for our users. Benefits for the company are cost savings—including development costs, operational costs, and infrastructure costs—and reducing delivery time. Users can now find articles of interest more quickly. The solution can process more than 500,000 records per minute, and it enables fast and personalized news curating for our users.


In this post, I showed you how we optimize trending user activities to personalize news using Amazon Kinesis Data Firehose, Amazon Kinesis Data Analytics, and related AWS services in Gunosy.

AWS gives us a quick and economical solution and a good experience.

If you have questions or suggestions, please comment below.

Additional Reading

If you found this post useful, be sure to check out Implement Serverless Log Analytics Using Amazon Kinesis Analytics and Joining and Enriching Streaming Data on Amazon Kinesis.

About the Authors

Yukinori Koide is the head of development for the Newspass department at Gunosy. He is working on standardization of provisioning and deployment flow, promoting the utilization of serverless and containers for machine learning and AI services. His favorite AWS services are DynamoDB, Lambda, Kinesis, and ECS.




Akihiro Tsukada is a start-up solutions architect with AWS. He supports start-up companies in Japan technically at many levels, ranging from seed to later-stage.





Yuta Ishii is a solutions architect with AWS. He works with our customers to provide architectural guidance for building media & entertainment services, helping them improve the value of their services when using AWS.






Federate Database User Authentication Easily with IAM and Amazon Redshift

Post Syndicated from Thiyagarajan Arumugam original https://aws.amazon.com/blogs/big-data/federate-database-user-authentication-easily-with-iam-and-amazon-redshift/

Managing database users though federation allows you to manage authentication and authorization procedures centrally. Amazon Redshift now supports database authentication with IAM, enabling user authentication though enterprise federation. No need to manage separate database users and passwords to further ease the database administration. You can now manage users outside of AWS and authenticate them for access to an Amazon Redshift data warehouse. Do this by integrating IAM authentication and a third-party SAML-2.0 identity provider (IdP), such as AD FS, PingFederate, or Okta. In addition, database users can also be automatically created at their first login based on corporate permissions.

In this post, I demonstrate how you can extend the federation to enable single sign-on (SSO) to the Amazon Redshift data warehouse.

SAML and Amazon Redshift

AWS supports Security Assertion Markup Language (SAML) 2.0, which is an open standard for identity federation used by many IdPs. SAML enables federated SSO, which enables your users to sign in to the AWS Management Console. Users can also make programmatic calls to AWS API actions by using assertions from a SAML-compliant IdP. For example, if you use Microsoft Active Directory for corporate directories, you may be familiar with how Active Directory and AD FS work together to enable federation. For more information, see the Enabling Federation to AWS Using Windows Active Directory, AD FS, and SAML 2.0 AWS Security Blog post.

Amazon Redshift now provides the GetClusterCredentials API operation that allows you to generate temporary database user credentials for authentication. You can set up an IAM permissions policy that generates these credentials for connecting to Amazon Redshift. Extending the IAM authentication, you can configure the federation of AWS access though a SAML 2.0–compliant IdP. An IAM role can be configured to permit the federated users call the GetClusterCredentials action and generate temporary credentials to log in to Amazon Redshift databases. You can also set up policies to restrict access to Amazon Redshift clusters, databases, database user names, and user group.

Amazon Redshift federation workflow

In this post, I demonstrate how you can use a JDBC– or ODBC-based SQL client to log in to the Amazon Redshift cluster using this feature. The SQL clients used with Amazon Redshift JDBC or ODBC drivers automatically manage the process of calling the GetClusterCredentials action, retrieving the database user credentials, and establishing a connection to your Amazon Redshift database. You can also use your database application to programmatically call the GetClusterCredentials action, retrieve database user credentials, and connect to the database. I demonstrate these features using an example company to show how different database users accounts can be managed easily using federation.

The following diagram shows how the SSO process works:

  2. Authenticate using Corp Username/Password
  3. IdP sends SAML assertion
  4. Call STS to assume role with SAML
  5. STS Returns Temp Credentials
  6. Use Temp Credentials to get Temp cluster credentials
  7. Connect to Amazon Redshift using temp credentials


Example Corp. is using Active Directory (idp host:demo.examplecorp.com) to manage federated access for users in its organization. It has an AWS account: 123456789012 and currently manages an Amazon Redshift cluster with the cluster ID “examplecorp-dw”, database “analytics” in us-west-2 region for its Sales and Data Science teams. It wants the following access:

  • Sales users can access the examplecorp-dw cluster using the sales_grp database group
  • Sales users access examplecorp-dw through a JDBC-based SQL client
  • Sales users access examplecorp-dw through an ODBC connection, for their reporting tools
  • Data Science users access the examplecorp-dw cluster using the data_science_grp database group.
  • Partners access the examplecorp-dw cluster and query using the partner_grp database group.
  • Partners are not federated through Active Directory and are provided with separate IAM user credentials (with IAM user name examplecorpsalespartner).
  • Partners can connect to the examplecorp-dw cluster programmatically, using language such as Python.
  • All users are automatically created in Amazon Redshift when they log in for the first time.
  • (Optional) Internal users do not specify database user or group information in their connection string. It is automatically assigned.
  • Data warehouse users can use SSO for the Amazon Redshift data warehouse using the preceding permissions.

Step 1:  Set up IdPs and federation

The Enabling Federation to AWS Using Windows Active Directory post demonstrated how to prepare Active Directory and enable federation to AWS. Using those instructions, you can establish trust between your AWS account and the IdP and enable user access to AWS using SSO.  For more information, see Identity Providers and Federation.

For this walkthrough, assume that this company has already configured SSO to their AWS account: 123456789012 for their Active Directory domain demo.examplecorp.com. The Sales and Data Science teams are not required to specify database user and group information in the connection string. The connection string can be configured by adding SAML Attribute elements to your IdP. Configuring these optional attributes enables internal users to conveniently avoid providing the DbUser and DbGroup parameters when they log in to Amazon Redshift.

The user-name attribute can be set up as follows, with a user ID (for example, nancy) or an email address (for example. [email protected]):

<Attribute Name="https://redshift.amazon.com/SAML/Attributes/DbUser">  

The AutoCreate attribute can be defined as follows:

<Attribute Name="https://redshift.amazon.com/SAML/Attributes/AutoCreate">

The sales_grp database group can be included as follows:

<Attribute Name="https://redshift.amazon.com/SAML/Attributes/DbGroups">

For more information about attribute element configuration, see Configure SAML Assertions for Your IdP.

Step 2: Create IAM roles for access to the Amazon Redshift cluster

The next step is to create IAM policies with permissions to call GetClusterCredentials and provide authorization for Amazon Redshift resources. To grant a SQL client the ability to retrieve the cluster endpoint, region, and port automatically, include the redshift:DescribeClusters action with the Amazon Redshift cluster resource in the IAM role.  For example, users can connect to the Amazon Redshift cluster using a JDBC URL without the need to hardcode the Amazon Redshift endpoint:

Previous:  jdbc:redshift://endpoint:port/database

Current:  jdbc:redshift:iam://clustername:region/dbname

Use IAM to create the following policies. You can also use an existing user or role and assign these policies. For example, if you already created an IAM role for IdP access, you can attach the necessary policies to that role. Here is the policy created for sales users for this example:


    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": [
            "Effect": "Allow",
            "Action": [
            "Resource": [
            "Condition": {
                "StringEquals": {
                    "aws:userid": "AIDIODR4TAW7CSEXAMPLE:${redshift:DbUser}@examplecorp.com"
            "Effect": "Allow",
            "Action": [
            "Resource": [
            "Effect": "Allow",
            "Action": [
            "Resource": [

The policy uses the following parameter values:

  • Region: us-west-2
  • AWS Account: 123456789012
  • Cluster name: examplecorp-dw
  • Database group: sales_grp
Policy StatementDescription

Allow users to retrieve the cluster endpoint, region, and port automatically for the Amazon Redshift cluster examplecorp-dw. This specification uses the resource format arn:aws:redshift:region:account-id:cluster:clustername. For example, the SQL client JDBC can be specified in the format jdbc:redshift:iam://clustername:region/dbname.

For more information, see Amazon Resource Names.


Generates a temporary token to authenticate into the examplecorp-dw cluster. “arn:aws:redshift:us-west-2:123456789012:dbuser:examplecorp-dw/${redshift:DbUser}” restricts the corporate user name to the database user name for that user. This resource is specified using the format: arn:aws:redshift:region:account-id:dbuser:clustername/dbusername.

The Condition block enforces that the AWS user ID should match “AIDIODR4TAW7CSEXAMPLE:${redshift:DbUser}@examplecorp.com”, so that individual users can authenticate only as themselves. The AIDIODR4TAW7CSEXAMPLE role has the Sales_DW_IAM_Policy policy attached.

Automatically creates database users in examplecorp-dw, when they log in for the first time. Subsequent logins reuse the existing database user.
Allows sales users to join the sales_grp database group through the resource “arn:aws:redshift:us-west-2:123456789012:dbgroup:examplecorp-dw/sales_grp” that is specified in the format arn:aws:redshift:region:account-id:dbgroup:clustername/dbgroupname.

Similar policies can be created for Data Science users with access to join the data_science_grp group in examplecorp-dw. You can now attach the Sales_DW_IAM_Policy policy to the role that is mapped to IdP application for SSO.
 For more information about how to define the claim rules, see Configuring SAML Assertions for the Authentication Response.

Because partners are not authorized using Active Directory, they are provided with IAM credentials and added to the partner_grp database group. The Partner_DW_IAM_Policy is attached to the IAM users for partners. The following policy allows partners to log in using the IAM user name as the database user name.


    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": [
            "Effect": "Allow",
            "Action": [
            "Resource": [
            "Condition": {
                "StringEquals": {
                    "redshift:DbUser": "${aws:username}"
            "Effect": "Allow",
            "Action": [
            "Resource": [
            "Effect": "Allow",
            "Action": [
            "Resource": [

redshift:DbUser“: “${aws:username}” forces an IAM user to use the IAM user name as the database user name.

With the previous steps configured, you can now establish the connection to Amazon Redshift through JDBC– or ODBC-supported clients.

Step 3: Set up database user access

Before you start connecting to Amazon Redshift using the SQL client, set up the database groups for appropriate data access. Log in to your Amazon Redshift database as superuser to create a database group, using CREATE GROUP.

Log in to examplecorp-dw/analytics as superuser and create the following groups and users:

CREATE GROUP sales_grp;
CREATE GROUP datascience_grp;
CREATE GROUP partner_grp;

Use the GRANT command to define access permissions to database objects (tables/views) for the preceding groups.

Step 4: Connect to Amazon Redshift using the JDBC SQL client

Assume that sales user “nancy” is using the SQL Workbench client and JDBC driver to log in to the Amazon Redshift data warehouse. The following steps help set up the client and establish the connection:

  1. Download the latest Amazon Redshift JDBC driver from the Configure a JDBC Connection page
  2. Build the JDBC URL with the IAM option in the following format:

Because the redshift:DescribeClusters action is assigned to the preceding IAM roles, it automatically resolves the cluster endpoints and the port. Otherwise, you can specify the endpoint and port information in the JDBC URL, as described in Configure a JDBC Connection.

Identify the following JDBC options for providing the IAM credentials (see the “Prepare your environment” section) and configure in the SQL Workbench Connection Profile:

idp_host=demo.examplecorp.com (The name of the corporate identity provider host)
idp_port=443  (The port of the corporate identity provider host)
user=examplecorp\nancy(corporate user name)
password=***(corporate user password)

The SQL workbench configuration looks similar to the following screenshot:

Now, “nancy” can connect to examplecorp-dw by authenticating using the corporate Active Directory. Because the SAML attributes elements are already configured for nancy, she logs in as database user nancy and is assigned the sales_grp. Similarly, other Sales and Data Science users can connect to the examplecorp-dw cluster. A custom Amazon Redshift ODBC driver can also be used to connect using a SQL client. For more information, see Configure an ODBC Connection.

Step 5: Connecting to Amazon Redshift using JDBC SQL Client and IAM Credentials

This optional step is necessary only when you want to enable users that are not authenticated with Active Directory. Partners are provided with IAM credentials that they can use to connect to the examplecorp-dw Amazon Redshift clusters. These IAM users are attached to Partner_DW_IAM_Policy that assigns them to be assigned to the public database group in Amazon Redshift. The following JDBC URLs enable them to connect to the Amazon Redshift cluster:

jdbc:redshift:iam//examplecorp-dw/analytics?AccessKeyID=XXX&SecretAccessKey=YYY&DbUser=examplecorpsalespartner&DbGroup= partner_grp&AutoCreate=true

The AutoCreate option automatically creates a new database user the first time the partner logs in. There are several other options available to conveniently specify the IAM user credentials. For more information, see Options for providing IAM credentials.

Step 6: Connecting to Amazon Redshift using an ODBC client for Microsoft Windows

Assume that another sales user “uma” is using an ODBC-based client to log in to the Amazon Redshift data warehouse using Example Corp Active Directory. The following steps help set up the ODBC client and establish the Amazon Redshift connection in a Microsoft Windows operating system connected to your corporate network:

  1. Download and install the latest Amazon Redshift ODBC driver.
  2. Create a system DSN entry.
    1. In the Start menu, locate the driver folder or folders:
      • Amazon Redshift ODBC Driver (32-bit)
      • Amazon Redshift ODBC Driver (64-bit)
      • If you installed both drivers, you have a folder for each driver.
    2. Choose ODBC Administrator, and then type your administrator credentials.
    3. To configure the driver for all users on the computer, choose System DSN. To configure the driver for your user account only, choose User DSN.
    4. Choose Add.
  3. Select the Amazon Redshift ODBC driver, and choose Finish. Configure the following attributes:
    Data Source Name =any friendly name to identify the ODBC connection 
    user=uma(corporate user name)
    Auth Type-Identity Provider: AD FS
    password=leave blank (Windows automatically authenticates)
    Cluster ID: examplecorp-dw
    idp_host=demo.examplecorp.com (The name of the corporate IdP host)

This configuration looks like the following:

  1. Choose OK to save the ODBC connection.
  2. Verify that uma is set up with the SAML attributes, as described in the “Set up IdPs and federation” section.

The user uma can now use this ODBC connection to establish the connection to the Amazon Redshift cluster using any ODBC-based tools or reporting tools such as Tableau. Internally, uma authenticates using the Sales_DW_IAM_Policy  IAM role and is assigned the sales_grp database group.

Step 7: Connecting to Amazon Redshift using Python and IAM credentials

To enable partners, connect to the examplecorp-dw cluster programmatically, using Python on a computer such as Amazon EC2 instance. Reuse the IAM users that are attached to the Partner_DW_IAM_Policy policy defined in Step 2.

The following steps show this set up on an EC2 instance:

  1. Launch a new EC2 instance with the Partner_DW_IAM_Policy role, as described in Using an IAM Role to Grant Permissions to Applications Running on Amazon EC2 Instances. Alternatively, you can attach an existing IAM role to an EC2 instance.
  2. This example uses Python PostgreSQL Driver (PyGreSQL) to connect to your Amazon Redshift clusters. To install PyGreSQL on Amazon Linux, use the following command as the ec2-user:
    sudo easy_install pip
    sudo yum install postgresql postgresql-devel gcc python-devel
    sudo pip install PyGreSQL

  1. The following code snippet demonstrates programmatic access to Amazon Redshift for partner users:
    #!/usr/bin/env python
    python redshift-unload-copy.py <config file> <region>
    * Copyright 2014, Amazon.com, Inc. or its affiliates. All Rights Reserved.
    * Licensed under the Amazon Software License (the "License").
    * You may not use this file except in compliance with the License.
    * A copy of the License is located at
    * http://aws.amazon.com/asl/
    * or in the "license" file accompanying this file. This file is distributed
    * express or implied. See the License for the specific language governing
    * permissions and limitations under the License.
    import sys
    import pg
    import boto3
    REGION = 'us-west-2'
    CLUSTER_IDENTIFIER = 'examplecorp-dw'
    DB_NAME = 'sales_db'
    DB_USER = 'examplecorpsalespartner'
    options = """keepalives=1 keepalives_idle=200 keepalives_interval=200
    set_timeout_stmt = "set statement_timeout = 1200000"
    def conn_to_rs(host, port, db, usr, pwd, opt=options, timeout=set_timeout_stmt):
        rs_conn_string = """host=%s port=%s dbname=%s user=%s password=%s
                             %s""" % (host, port, db, usr, pwd, opt)
        print "Connecting to %s:%s:%s as %s" % (host, port, db, usr)
        rs_conn = pg.connect(dbname=rs_conn_string)
        return rs_conn
    def main():
        # describe the cluster and fetch the IAM temporary credentials
        global redshift_client
        redshift_client = boto3.client('redshift', region_name=REGION)
        response_cluster_details = redshift_client.describe_clusters(ClusterIdentifier=CLUSTER_IDENTIFIER)
        response_credentials = redshift_client.get_cluster_credentials(DbUser=DB_USER,DbName=DB_NAME,ClusterIdentifier=CLUSTER_IDENTIFIER,DurationSeconds=3600)
        rs_host = response_cluster_details['Clusters'][0]['Endpoint']['Address']
        rs_port = response_cluster_details['Clusters'][0]['Endpoint']['Port']
        rs_db = DB_NAME
        rs_iam_user = response_credentials['DbUser']
        rs_iam_pwd = response_credentials['DbPassword']
        # connect to the Amazon Redshift cluster
        conn = conn_to_rs(rs_host, rs_port, rs_db, rs_iam_user,rs_iam_pwd)
        # execute a query
        result = conn.query("SELECT sysdate as dt")
        # fetch results from the query
        for dt_val in result.getresult() :
            print dt_val
        # close the Amazon Redshift connection
    if __name__ == "__main__":

You can save this Python program in a file (redshiftscript.py) and execute it at the command line as ec2-user:

python redshiftscript.py

Now partners can connect to the Amazon Redshift cluster using the Python script, and authentication is federated through the IAM user.


In this post, I demonstrated how to use federated access using Active Directory and IAM roles to enable single sign-on to an Amazon Redshift cluster. I also showed how partners outside an organization can be managed easily using IAM credentials.  Using the GetClusterCredentials API action, now supported by Amazon Redshift, lets you manage a large number of database users and have them use corporate credentials to log in. You don’t have to maintain separate database user accounts.

Although this post demonstrated the integration of IAM with AD FS and Active Directory, you can replicate this solution across with your choice of SAML 2.0 third-party identity providers (IdP), such as PingFederate or Okta. For the different supported federation options, see Configure SAML Assertions for Your IdP.

If you have questions or suggestions, please comment below.

Additional Reading

Learn how to establish federated access to your AWS resources by using Active Directory user attributes.

About the Author

Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.


Join us for an evening of League of Legends

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/league-of-legends-evening/

Last month, we shared the news that Riot Games is supporting digital literacy by matching 25% of sales of Championship Ashe and Championship Ward to create a charity fund that will benefit the Raspberry Pi Foundation and two other charities.

Raspberry Pi League of Legends Championship Ashe Riot Games

Vote for the Raspberry Pi Foundation

Riot Games is now calling for all League of Legends players to vote for their favourite charity — the winning organisation will receive 50% of the total fund.

By visiting the ‘Vote for charity’ tab in-client, you’ll be able to choose between the Raspberry Pi Foundation, BasicNeeds, and Learning Equality.

Players can vote only once, and your vote will be multiplied based on your honour level. Voting ends on 5 November 2017 at 11:59pm PT.

League of Legends with Riot Gaming

In honour of the Riot Games Charity Fund vote, and to support the work of the Raspberry Pi Foundation, KimmieRiot and M0RGZ of top female eSports organisation Riot Gaming (no relation to Riot Games) will run a four-hour League of Legends live-stream this Saturday, 21 October, from 6pm to 10pm BST.

Playing as Championship Ashe, they’ll be streaming live to Twitch, and you’re all invited to join in the fun. I’ll be making an appearance in the chat box as RaspberryPiFoundation, and we’ll be giving away some free T-shirts and stickers during the event — make sure to tune in to the conversation.

In a wonderful gesture, Riot Gaming will pass on all donations made to their channel during the live-stream to us. These funds will directly aid the ongoing charitable work of Raspberry Pi and our computing education programmes like CoderDojo.

Make sure to follow Riot Gaming, and activate notifications so you don’t miss the event!

We’re blushing

Thank you to everyone who buys Championship Ashe and Championship Ward, and to all of you who vote for us. We’re honoured to be one of the three charities selected to benefit from the Riot Games Charity Fund.

And a huge thank you to Riot Gaming for organising an evening of Raspberry Pi and League of Legends. We can’t wait!

The post Join us for an evening of League of Legends appeared first on Raspberry Pi.