Rumour has it that there’s a worldwide football tournament on, and that England, surprisingly, are doing quite well. In celebration, here are some soccer-themed Raspberry Pi projects for you to try out at home between (or during) matches.
Start by coding a moving football in Scratch, and work through the project to build a game that tallies your successful attempts on goal within a time limit that you choose. Up the stakes by upgrading your game to include second-player control of the penguin goalie.
Table football
Once you’ve moved on from penalty practice, it’s time to recruit the whole team!
Our Table Football project – free, like all of our learning projects – comes with all the ingredients you need to recreate the classic game, including player sprites, graphics, and sounds.
Instant replay!
Scratch is all well and good, but it’s time we had some real-life table football, with all the snazzy upgrades you can add using a Raspberry Pi.
Demo of Foosball Instant Replay system More info here: * https://github.com/swehner/foos * https://github.com/netsuso/foos-tournament Music: http://freemusicarchive.org/music/Jahzzar/Blinded_by_dust/Magic_Mountain_1877
Stefan Wehner’s build is fully documented, so you can learn how to add automatic goal detection, slow-motion instant replay, scorekeeping, tallying, and more.
In this video we start to program Marty The Robot to play football, using a camera and Raspberry Pi on board to detect the ball and the goal. With the camera, Marty can spot a ball, and detect a pattern next to the goal.
Have you built a football-themed project using a Raspberry Pi? What projects did we miss in our roundup? Share them with us here in the comments, or on social media.
Jennifer Fox is back, this time with a Raspberry Pi Zero–controlled impact force monitor that will notify you if your collision is a worth a trip to the doctor.
Check out my latest Hacker in Residence project for SparkFun Electronics: the Helmet Guardian! It’s a Pi Zero powered impact force monitor that turns on an LED if your head/body experiences a potentially dangerous impact. Install in your sports helmets, bicycle, or car to keep track of impact and inform you when it’s time to visit the doctor.
Concussion
We’ve all knocked our heads at least once in our lives, maybe due to tripping over a loose paving slab, or to falling off a bike, or to walking into the corner of the overhead cupboard door for the third time this week — will I ever learn?! More often than not, even when we’re seeing stars, we brush off the accident and continue with our day, oblivious to the long-term damage we may be doing.
Force of impact
After some thorough research, Jennifer Fox, founder of FoxBot Industries, concluded that forces of 4 to 6 G sustained for more than a few seconds are dangerous to the human body. With this in mind, she decided to use a Raspberry Pi Zero W and an accelerometer to create helmet with an impact force monitor that notifies its wearer if this level of G-force has been met.
Obviously, if you do have a serious fall, you should always seek medical advice. This project is an example of how affordable technology can be used to create medical and citizen science builds, and not a replacement for professional medical services.
Setting up the impact monitor
Jennifer’s monitor requires only a few pieces of tech: a Zero W, an accelerometer and breakout board, a rechargeable USB battery, and an LED, plus the standard wires and resistors for these components.
After installing Raspbian, Jennifer enabled SSH and I2C on the Zero W to make it run headlessly, and then accessed it from a laptop. This allows her to control the Pi without physically connecting to it, and it makes for a wireless finished project.
Jen wired the Pi to the accelerometer breakout board and LED as shown in the schematic below.
The LED acts as a signal of significant impacts, turning on when the G-force threshold is reached, and not turning off again until the program is reset.
Kuhu Shukla (bottom center) and team at the 2017 DataWorks Summit
By Kuhu Shukla
This post first appeared here on the Apache Software Foundation blog as part of ASF’s “Success at Apache” monthly blog series.
As I sit at my desk on a rather frosty morning with my coffee, looking up new JIRAs from the previous day in the Apache Tez project, I feel rather pleased. The latest community release vote is complete, the bug fixes that we so badly needed are in and the new release that we tested out internally on our many thousand strong cluster is looking good. Today I am looking at a new stack trace from a different Apache project process and it is hard to miss how much of the exceptional code I get to look at every day comes from people all around the globe. A contributor leaves a JIRA comment before he goes on to pick up his kid from soccer practice while someone else wakes up to find that her effort on a bug fix for the past two months has finally come to fruition through a binding +1.
Yahoo – which joined AOL, HuffPost, Tumblr, Engadget, and many more brands to form the Verizon subsidiary Oath last year – has been at the frontier of open source adoption and contribution since before I was in high school. So while I have no historical trajectories to share, I do have a story on how I found myself in an epic journey of migrating all of Yahoo jobs from Apache MapReduce to Apache Tez, a then-new DAG based execution engine.
Oath grid infrastructure is through and through driven by Apache technologies be it storage through HDFS, resource management through YARN, job execution frameworks with Tez and user interface engines such as Hive, Hue, Pig, Sqoop, Spark, Storm. Our grid solution is specifically tailored to Oath’s business-critical data pipeline needs using the polymorphic technologies hosted, developed and maintained by the Apache community.
On the third day of my job at Yahoo in 2015, I received a YouTube link on An Introduction to Apache Tez. I watched it carefully trying to keep up with all the questions I had and recognized a few names from my academic readings of Yarn ACM papers. I continued to ramp up on YARN and HDFS, the foundational Apache technologies Oath heavily contributes to even today. For the first few weeks I spent time picking out my favorite (necessary) mailing lists to subscribe to and getting started on setting up on a pseudo-distributed Hadoop cluster. I continued to find my footing with newbie contributions and being ever more careful with whitespaces in my patches. One thing was clear – Tez was the next big thing for us. By the time I could truly call myself a contributor in the Hadoop community nearly 80-90% of the Yahoo jobs were now running with Tez. But just like hiking up the Grand Canyon, the last 20% is where all the pain was. Being a part of the solution to this challenge was a happy prospect and thankfully contributing to Tez became a goal in my next quarter.
The next sprint planning meeting ended with me getting my first major Tez assignment – progress reporting. The progress reporting in Tez was non-existent – “Just needs an API fix,” I thought. Like almost all bugs in this ecosystem, it was not easy. How do you define progress? How is it different for different kinds of outputs in a graph? The questions were many.
I, however, did not have to go far to get answers. The Tez community actively came to a newbie’s rescue, finding answers and posing important questions. I started attending the bi-weekly Tez community sync up calls and asking existing contributors and committers for course correction. Suddenly the team was much bigger, the goals much more chiseled. This was new to anyone like me who came from the networking industry, where the most open part of the code are the RFCs and the implementation details are often hidden. These meetings served as a clean room for our coding ideas and experiments. Ideas were shared, to the extent of which data structure we should pick and what a future user of Tez would take from it. In between the usual status updates and extensive knowledge transfers were made.
Oath uses Apache Pig and Apache Hive extensively and most of the urgent requirements and requests came from Pig and Hive developers and users. Each issue led to a community JIRA and as we started running Tez at Oath scale, new feature ideas and bugs around performance and resource utilization materialized. Every year most of the Hadoop team at Oath travels to the Hadoop Summit where we meet our cohorts from the Apache community and we stand for hours discussing the state of the art and what is next for the project. One such discussion set the course for the next year and a half for me.
We needed an innovative way to shuffle data. Frameworks like MapReduce and Tez have a shuffle phase in their processing lifecycle wherein the data from upstream producers is made available to downstream consumers. Even though Apache Tez was designed with a feature set corresponding to optimization requirements in Pig and Hive, the Shuffle Handler Service was retrofitted from MapReduce at the time of the project’s inception. With several thousands of jobs on our clusters leveraging these features in Tez, the Shuffle Handler Service became a clear performance bottleneck. So as we stood talking about our experience with Tez with our friends from the community, we decided to implement a new Shuffle Handler for Tez. All the conversation points were tracked now through an umbrella JIRA TEZ-3334 and the to-do list was long. I picked a few JIRAs and as I started reading through I realized, this is all new code I get to contribute to and review. There might be a better way to put this, but to be honest it was just a lot of fun! All the whiteboards were full, the team took walks post lunch and discussed how to go about defining the API. Countless hours were spent debugging hangs while fetching data and looking at stack traces and Wireshark captures from our test runs. Six months in and we had the feature on our sandbox clusters. There were moments ranging from sheer frustration to absolute exhilaration with high fives as we continued to address review comments and fixing big and small issues with this evolving feature.
As much as owning your code is valued everywhere in the software community, I would never go on to say “I did this!” In fact, “we did!” It is this strong sense of shared ownership and fluid team structure that makes the open source experience at Apache truly rewarding. This is just one example. A lot of the work that was done in Tez was leveraged by the Hive and Pig community and cross Apache product community interaction made the work ever more interesting and challenging. Triaging and fixing issues with the Tez rollout led us to hit a 100% migration score last year and we also rolled the Tez Shuffle Handler Service out to our research clusters. As of last year we have run around 100 million Tez DAGs with a total of 50 billion tasks over almost 38,000 nodes.
In 2018 as I move on to explore Hadoop 3.0 as our future release, I hope that if someone outside the Apache community is reading this, it will inspire and intrigue them to contribute to a project of their choice. As an astronomy aficionado, going from a newbie Apache contributor to a newbie Apache committer was very much like looking through my telescope - it has endless possibilities and challenges you to be your best.
About the Author:
Kuhu Shukla is a software engineer at Oath and did her Masters in Computer Science at North Carolina State University. She works on the Big Data Platforms team on Apache Tez, YARN and HDFS with a lot of talented Apache PMCs and Committers in Champaign, Illinois. A recent Apache Tez Committer herself she continues to contribute to YARN and HDFS and spoke at the 2017 Dataworks Hadoop Summit on “Tez Shuffle Handler: Shuffling At Scale With Apache Hadoop”. Prior to that she worked on Juniper Networks’ router and switch configuration APIs. She likes to participate in open source conferences and women in tech events. In her spare time she loves singing Indian classical and jazz, laughing, whale watching, hiking and peering through her Dobsonian telescope.
Contributed by: Stephen Liedig, Senior Solutions Architect, ANZ Public Sector, and Otavio Ferreira, Manager, Amazon Simple Notification Service
Want to make your cloud-native applications scalable, fault-tolerant, and highly available? Recently, we wrote a couple of posts about using AWS messaging services Amazon SQS and Amazon SNS to address messaging patterns for loosely coupled communication between highly cohesive components. For more information, see:
Today, AWS is releasing a new message filtering functionality for SNS. This new feature simplifies the pub/sub messaging architecture by offloading the filtering logic from subscribers, as well as the routing logic from publishers, to SNS.
In this post, we walk you through the new message filtering feature, and how to use it to clean up unnecessary logic in your components, and reduce the number of topics in your architecture.
Topic-based filtering
SNS is a fully managed pub/sub messaging service that lets you fan out messages to large numbers of recipients at one time, using topics. SNS topics support a variety of subscription types, allowing you to push messages to SQS queues, AWS Lambda functions, HTTP endpoints, email addresses, and mobile devices (SMS, push).
In the above scenario, every subscriber receives the same message published to the topic, allowing them to process the message independently. For many use cases, this is sufficient.
However, in more complex scenarios, the subscriber may only be interested in a subset of the messages being published. The onus, in that case, is on each subscriber to ensure that they are filtering and only processing those messages in which they are actually interested.
To avoid this additional filtering logic on each subscriber, many organizations have adopted a practice in which the publisher is now responsible for routing different types of messages to different topics. However, as depicted in the following diagram, this topic-based filtering practice can lead to overly complicated publishers, topic proliferation, and additional overhead in provisioning and managing your SNS topics.
Attribute-based filtering
To leverage the new message filtering capability, SNS requires the publisher to set message attributes and each subscriber to set a subscription attribute (a subscription filter policy). When the publisher posts a new message to the topic, SNS attempts to match the incoming message attributes to the filter policy set on each subscription, to determine whether a particular subscriber is interested in that incoming event. If there is a match, SNS then pushes the message to the subscriber in question. The new attribute-based message filtering approach is depicted in the following diagram.
Message filtering in action
Look at how message filtering works. The following example is based on a sports merchandise ecommerce website, which publishes a variety of events to an SNS topic. The events range from checkout events (triggered when orders are placed or canceled) to buyers’ navigation events (triggered when product pages are visited). The code below is based on the existing AWS SDK for Python.
First, create the single SNS topic to which all shopping events are published.
Next, subscribe the endpoints that will be listening to those shopping events. The first subscriber is an SQS queue that is processed by a payment gateway, while the second subscriber is a Lambda function that indexes the buyer’s shopping interests against a search engine.
A subscription filter policy is set as a subscription attribute, by the subscription owner, as a simple JSON object, containing a set of key-value pairs. This object defines the kind of event in which the subscriber is interested.
You’re now ready to start publishing events with attributes!
Message attributes allow you to provide structured metadata items (such as time stamps, geospatial data, event type, signatures, and identifiers) about the message. Message attributes are optional and separate from, but sent along with, the message body. You can include up to 10 message attributes with your message.
The first message published in this example is related to an order that has been placed on the ecommerce website. The message attribute “event_type” with the value “order_placed” matches only the filter policy associated with the payment gateway subscription. Therefore, only the SQS queue subscribed to the SNS topic is notified about this checkout event.
The second message published is related to a buyer’s navigation activity on the ecommerce website. The message attribute “event_type” with the value “product_page_visited” matches only the filter policy associated with the search engine subscription. Therefore, only the Lambda function subscribed to the SNS topic is notified about this navigation event.
The following diagram represents the architecture for this ecommerce website, with the message filtering mechanism in action. As described earlier, checkout events are pushed only to the SQS queue, whereas navigation events are pushed to the Lambda function only.
Message filtering criteria
It is important to remember the following things about subscription filter policy matching:
A subscription filter policy either matches an incoming message, or it doesn’t. It’s Boolean logic.
For a filter policy to match a message, the message must contain all the attribute keys listed in the policy.
Attributes of the message not mentioned in the filtering policy are ignored.
The value of each key in the filter policy is an array containing one or more values. The policy matches if any of the values in the array match the value in the corresponding message attribute.
If the value in the message attribute is an array, then the filter policy matches if the intersection of the policy array and the message array is non-empty.
The matching is exact (character-by-character), without case-folding or any other string normalization.
The values being matched follow JSON rules: Strings enclosed in quotes, numbers, and the unquoted keywords true, false, and null.
Number matching is at the string representation level. Example: 300, 300.0, and 3.0e2 aren’t considered equal.
When should I use message filtering?
We recommend using message filtering and grouping subscribers into a single topic only when all of the following is true:
Subscribers are semantically related to each other
Subscribers consume similar types of events
Subscribers are supposed to share the same access permissions on the topic
Technically, you could get away with creating a single topic for your entire domain to handle all event processing, even unrelated use cases, but this wouldn’t be recommended. This option could result in an unnecessarily large topic, which could potentially impact your message delivery latency. Also, you would lose the ability to implement fine-grained access control on your topics.
Finally, if you already use SNS, but had to add filtering logic in your subscribers or routing logic in your publishers (topic-based filtering), you can now immediately benefit from message filtering. This new approach lets you clean up any unnecessary logic in your components, and reduce the number of topics in your architecture.
Summary
As we’ve shown in this post, the new message filtering capability in Amazon SNS gives you a great amount of flexibility in your messaging pattern. It allows you to really simplify your pub/sub infrastructure requirements.
Message filtering can be implemented easily with existing AWS SDKs by applying message and subscription attributes across all SNS supported protocols (Amazon SQS, AWS Lambda, HTTP, SMS, email, and mobile push). It’s now available in all AWS commercial regions, at no extra charge.
Here’s a few ideas for next steps to get you started:
Add filter policies to your subscriptions on the SNS console,
One of the great benefits of Amazon S3 is the ability to host, share, or consume public data sets. This provides transparency into data to which an external data scientist or developer might not normally have access. By exposing the data to the public, you can glean many insights that would have been difficult with a data silo.
The openFDA project creates easy access to the high value, high priority, and public access data of the Food and Drug Administration (FDA). The data has been formatted and documented in consumer-friendly standards. Critical data related to drugs, devices, and food has been harmonized and can easily be called by application developers and researchers via API calls. OpenFDA has published two whitepapers that drill into the technical underpinnings of the API infrastructure as well as how to properly analyze the data in R. In addition, FDA makes openFDA data available on S3 in raw format.
In this post, I show how to use S3, Amazon EMR, and Amazon Athena to analyze the drug adverse events dataset. A drug adverse event is an undesirable experience associated with the use of a drug, including serious drug side effects, product use errors, product quality programs, and therapeutic failures.
Data considerations
Keep in mind that this data does have limitations. In addition, in the United States, these adverse events are submitted to the FDA voluntarily from consumers so there may not be reports for all events that occurred. There is no certainty that the reported event was actually due to the product. The FDA does not require that a causal relationship between a product and event be proven, and reports do not always contain the detail necessary to evaluate an event. Because of this, there is no way to identify the true number of events. The important takeaway to all this is that the information contained in this data has not been verified to produce cause and effect relationships. Despite this disclaimer, many interesting insights and value can be derived from the data to accelerate drug safety research.
Data analysis using SQL
For application developers who want to perform targeted searching and lookups, the API endpoints provided by the openFDA project are “ready to go” for software integration using a standard API powered by Elasticsearch, NodeJS, and Docker. However, for data analysis purposes, it is often easier to work with the data using SQL and statistical packages that expect a SQL table structure. For large-scale analysis, APIs often have query limits, such as 5000 records per query. This can cause extra work for data scientists who want to analyze the full dataset instead of small subsets of data.
To address the concern of requiring all the data in a single dataset, the openFDA project released the full 100 GB of harmonized data files that back the openFDA project onto S3. Athena is an interactive query service that makes it easy to analyze data in S3 using standard SQL. It’s a quick and easy way to answer your questions about adverse events and aspirin that does not require you to spin up databases or servers.
While you could point tools directly at the openFDA S3 files, you can find greatly improved performance and use of the data by following some of the preparation steps later in this post.
Architecture
This post explains how to use the following architecture to take the raw data provided by openFDA, leverage several AWS services, and derive meaning from the underlying data.
Steps:
Load the openFDA /drug/event dataset into Spark and convert it to gzip to allow for streaming.
Transform the data in Spark and save the results as a Parquet file in S3.
Query the S3 Parquet file with Athena.
Perform visualization and analysis of the data in R and Python on Amazon EC2.
Optimizing public data sets: A primer on data preparation
Those who want to jump right into preparing the files for Athena may want to skip ahead to the next section.
Transforming, or pre-processing, files is a common task for using many public data sets. Before you jump into the specific steps for transforming the openFDA data files into a format optimized for Athena, I thought it would be worthwhile to provide a quick exploration on the problem.
Making a dataset in S3 efficiently accessible with minimal transformation for the end user has two key elements:
Partitioning the data into objects that contain a complete part of the data (such as data created within a specific month).
Using file formats that make it easy for applications to locate subsets of data (for example, gzip, Parquet, ORC, etc.).
With these two key elements in mind, you can now apply transformations to the openFDA adverse event data to prepare it for Athena. You might find the data techniques employed in this post to be applicable to many of the questions you might want to ask of the public data sets stored in Amazon S3.
Before you get started, I encourage those who are interested in doing deeper healthcare analysis on AWS to make sure that you first read the AWS HIPAA Compliance whitepaper. This covers the information necessary for processing and storing patient health information (PHI).
Also, the adverse event analysis shown for aspirin is strictly for demonstration purposes and should not be used for any real decision or taken as anything other than a demonstration of AWS capabilities. However, there have been robust case studies published that have explored a causal relationship between aspirin and adverse reactions using OpenFDA data. If you are seeking research on aspirin or its risks, visit organizations such as the Centers for Disease Control and Prevention (CDC) or the Institute of Medicine (IOM).
Preparing data for Athena
For this walkthrough, you will start with the FDA adverse events dataset, which is stored as JSON files within zip archives on S3. You then convert it to Parquet for analysis. Why do you need to convert it? The original data download is stored in objects that are partitioned by quarter.
Here is a small sample of what you find in the adverse events (/drugs/event) section of the openFDA website.
If you were looking for events that happened in a specific quarter, this is not a bad solution. For most other scenarios, such as looking across the full history of aspirin events, it requires you to access a lot of data that you won’t need. The zip file format is not ideal for using data in place because zip readers must have random access to the file, which means the data can’t be streamed. Additionally, the zip files contain large JSON objects.
To read the data in these JSON files, a streaming JSON decoder must be used or a computer with a significant amount of RAM must decode the JSON. Opening up these files for public consumption is a great start. However, you still prepare the data with a few lines of Spark code so that the JSON can be streamed.
Step 1: Convert the file types
Using Apache Spark on EMR, you can extract all of the zip files and pull out the events from the JSON files. To do this, use the Scala code below to deflate the zip file and create a text file. In addition, compress the JSON files with gzip to improve Spark’s performance and reduce your overall storage footprint. The Scala code can be run in either the Spark Shell or in an Apache Zeppelin notebook on your EMR cluster.
If you are unfamiliar with either Apache Zeppelin or the Spark Shell, the following posts serve as great references:
import scala.io.Source
import java.util.zip.ZipInputStream
import org.apache.spark.input.PortableDataStream
import org.apache.hadoop.io.compress.GzipCodec
// Input Directory
val inputFile = "s3://download.open.fda.gov/drug/event/2015q4/*.json.zip";
// Output Directory
val outputDir = "s3://{YOUR OUTPUT BUCKET HERE}/output/2015q4/";
// Extract zip files from
val zipFiles = sc.binaryFiles(inputFile);
// Process zip file to extract the json as text file and save it
// in the output directory
val rdd = zipFiles.flatMap((file: (String, PortableDataStream)) => {
val zipStream = new ZipInputStream(file.2.open)
val entry = zipStream.getNextEntry
val iter = Source.fromInputStream(zipStream).getLines
iter
}).map(.replaceAll("\s+","")).saveAsTextFile(outputDir, classOf[GzipCodec])
Step 2: Transform JSON into Parquet
With just a few more lines of Scala code, you can use Spark’s abstractions to convert the JSON into a Spark DataFrame and then export the data back to S3 in Parquet format.
Spark requires the JSON to be in JSON Lines format to be parsed correctly into a DataFrame.
// Output Parquet directory
val outputDir = "s3://{YOUR OUTPUT BUCKET NAME}/output/drugevents"
// Input json file
val inputJson = "s3://{YOUR OUTPUT BUCKET NAME}/output/2015q4/*”
// Load dataframe from json file multiline
val df = spark.read.json(sc.wholeTextFiles(inputJson).values)
// Extract results from dataframe
val results = df.select("results")
// Save it to Parquet
results.write.parquet(outputDir)
Step 3: Create an Athena table
With the data cleanly prepared and stored in S3 using the Parquet format, you can now place an Athena table on top of it to get a better understanding of the underlying data.
Because the openFDA data structure incorporates several layers of nesting, it can be a complex process to try to manually derive the underlying schema in a Hive-compatible format. To shorten this process, you can load the top row of the DataFrame from the previous step into a Hive table within Zeppelin and then extract the “create table” statement from SparkSQL.
results.createOrReplaceTempView("data")
val top1 = spark.sql("select * from data tablesample(1 rows)")
top1.write.format("parquet").mode("overwrite").saveAsTable("drugevents")
val show_cmd = spark.sql("show create table drugevents”).show(1, false)
This returns a “create table” statement that you can almost paste directly into the Athena console. Make some small modifications (adding the word “external” and replacing “using with “stored as”), and then execute the code in the Athena query editor. The table is created.
For the openFDA data, the DDL returns all string fields, as the date format used in your dataset does not conform to the yyy-mm-dd hh:mm:ss[.f…] format required by Hive. For your analysis, the string format works appropriately but it would be possible to extend this code to use a Presto function to convert the strings into time stamps.
With the Athena table in place, you can start to explore the data by running ad hoc queries within Athena or doing more advanced statistical analysis in R.
Using SQL and R to analyze adverse events
Using the openFDA data with Athena makes it very easy to translate your questions into SQL code and perform quick analysis on the data. After you have prepared the data for Athena, you can begin to explore the relationship between aspirin and adverse drug events, as an example. One of the most common metrics to measure adverse drug events is the Proportional Reporting Ratio (PRR). It is defined as:
PRR = (m/n)/( (M-m)/(N-n) ) Where m = #reports with drug and event n = #reports with drug M = #reports with event in database N = #reports in database
Gastrointestinal haemorrhage has the highest PRR of any reaction to aspirin when viewed in aggregate. One question you may want to ask is how the PRR has trended on a yearly basis for gastrointestinal haemorrhage since 2005.
Using the following query in Athena, you can see the PRR trend of “GASTROINTESTINAL HAEMORRHAGE” reactions with “ASPIRIN” since 2005:
with drug_and_event as
(select rpad(receiptdate, 4, 'NA') as receipt_year
, reactionmeddrapt
, count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as reports_with_drug_and_event
from fda.drugevents
where rpad(receiptdate,4,'NA')
between '2005' and '2015'
and medicinalproduct = 'ASPIRIN'
and reactionmeddrapt= 'GASTROINTESTINAL HAEMORRHAGE'
group by reactionmeddrapt, rpad(receiptdate, 4, 'NA')
), reports_with_drug as
(
select rpad(receiptdate, 4, 'NA') as receipt_year
, count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as reports_with_drug
from fda.drugevents
where rpad(receiptdate,4,'NA')
between '2005' and '2015'
and medicinalproduct = 'ASPIRIN'
group by rpad(receiptdate, 4, 'NA')
), reports_with_event as
(
select rpad(receiptdate, 4, 'NA') as receipt_year
, count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as reports_with_event
from fda.drugevents
where rpad(receiptdate,4,'NA')
between '2005' and '2015'
and reactionmeddrapt= 'GASTROINTESTINAL HAEMORRHAGE'
group by rpad(receiptdate, 4, 'NA')
), total_reports as
(
select rpad(receiptdate, 4, 'NA') as receipt_year
, count(distinct (concat(safetyreportid,receiptdate,reactionmeddrapt))) as total_reports
from fda.drugevents
where rpad(receiptdate,4,'NA')
between '2005' and '2015'
group by rpad(receiptdate, 4, 'NA')
)
select drug_and_event.receipt_year,
(1.0 * drug_and_event.reports_with_drug_and_event/reports_with_drug.reports_with_drug)/ (1.0 * (reports_with_event.reports_with_event- drug_and_event.reports_with_drug_and_event)/(total_reports.total_reports-reports_with_drug.reports_with_drug)) as prr
, drug_and_event.reports_with_drug_and_event
, reports_with_drug.reports_with_drug
, reports_with_event.reports_with_event
, total_reports.total_reports
from drug_and_event
inner join reports_with_drug on drug_and_event.receipt_year = reports_with_drug.receipt_year
inner join reports_with_event on drug_and_event.receipt_year = reports_with_event.receipt_year
inner join total_reports on drug_and_event.receipt_year = total_reports.receipt_year
order by drug_and_event.receipt_year
One nice feature of Athena is that you can quickly connect to it via R or any other tool that can use a JDBC driver to visualize the data and understand it more clearly.
With this quick R script that can be run in R Studio either locally or on an EC2 instance, you can create a visualization of the PRR and Reporting Odds Ratio (RoR) for “GASTROINTESTINAL HAEMORRHAGE” reactions from “ASPIRIN” since 2005 to better understand these trends.
# connect to ATHENA
conn <- dbConnect(drv, '<Your JDBC URL>',s3_staging_dir="<Your S3 Location>",user=Sys.getenv(c("USER_NAME"),password=Sys.getenv(c("USER_PASSWORD"))
# Declare Adverse Event
adverseEvent <- "'GASTROINTESTINAL HAEMORRHAGE'"
# Build SQL Blocks
sqlFirst <- "SELECT rpad(receiptdate, 4, 'NA') as receipt_year, count(DISTINCT safetyreportid) as event_count FROM fda.drugsflat WHERE rpad(receiptdate,4,'NA') between '2005' and '2015'"
sqlEnd <- "GROUP BY rpad(receiptdate, 4, 'NA') ORDER BY receipt_year"
# Extract Aspirin with adverse event counts
sql <- paste(sqlFirst,"AND medicinalproduct ='ASPIRIN' AND reactionmeddrapt=",adverseEvent, sqlEnd,sep=" ")
aspirinAdverseCount = dbGetQuery(conn,sql)
# Extract Aspirin counts
sql <- paste(sqlFirst,"AND medicinalproduct ='ASPIRIN'", sqlEnd,sep=" ")
aspirinCount = dbGetQuery(conn,sql)
# Extract adverse event counts
sql <- paste(sqlFirst,"AND reactionmeddrapt=",adverseEvent, sqlEnd,sep=" ")
adverseCount = dbGetQuery(conn,sql)
# All Drug Adverse event Counts
sql <- paste(sqlFirst, sqlEnd,sep=" ")
allDrugCount = dbGetQuery(conn,sql)
# Select correct rows
selAll = allDrugCount$receipt_year == aspirinAdverseCount$receipt_year
selAspirin = aspirinCount$receipt_year == aspirinAdverseCount$receipt_year
selAdverse = adverseCount$receipt_year == aspirinAdverseCount$receipt_year
# Calculate Numbers
m <- c(aspirinAdverseCount$event_count)
n <- c(aspirinCount[selAspirin,2])
M <- c(adverseCount[selAdverse,2])
N <- c(allDrugCount[selAll,2])
# Calculate proptional reporting ratio
PRR = (m/n)/((M-m)/(N-n))
# Calculate reporting Odds Ratio
d = n-m
D = N-M
ROR = (m/d)/(M/D)
# Plot the PRR and ROR
g_range <- range(0, PRR,ROR)
g_range[2] <- g_range[2] + 3
yearLen = length(aspirinAdverseCount$receipt_year)
axis(1,1:yearLen,lab=ax)
plot(PRR, type="o", col="blue", ylim=g_range,axes=FALSE, ann=FALSE)
axis(1,1:yearLen,lab=ax)
axis(2, las=1, at=1*0:g_range[2])
box()
lines(ROR, type="o", pch=22, lty=2, col="red")
As you can see, the PRR and RoR have both remained fairly steady over this time range. With the R Script above, all you need to do is change the adverseEvent variable from GASTROINTESTINAL HAEMORRHAGE to another type of reaction to analyze and compare those trends.
Summary
In this walkthrough:
You used a Scala script on EMR to convert the openFDA zip files to gzip.
You then transformed the JSON blobs into flattened Parquet files using Spark on EMR.
You created an Athena DDL so that you could query these Parquet files residing in S3.
Finally, you pointed the R package at the Athena table to analyze the data without pulling it into a database or creating your own servers.
If you have questions or suggestions, please comment below.
Ryan Hood is a Data Engineer for AWS. He works on big data projects leveraging the newest AWS offerings. In his spare time, he enjoys watching the Cubs win the World Series and attempting to Sous-vide anything he can find in his refrigerator.
Vikram Anand is a Data Engineer for AWS. He works on big data projects leveraging the newest AWS offerings. In his spare time, he enjoys playing soccer and watching the NFL & European Soccer leagues.
Dave Rocamora is a Solutions Architect at Amazon Web Services on the Open Data team. Dave is based in Seattle and when he is not opening data, he enjoys biking and drinking coffee outside.
Every year, eighth-grade science teacher Michele Chamberlain challenges her students to find a solution to a real-world problem. The solution must be environmentally friendly, and must demonstrate their sense of global awareness.
Amelia with her project.
One of Michele’s students, 14-year-old Amelia Day, knew she wanted to create something that would help her practice her favourite sport, and approached Chamberlain with an idea for a football-related project.
“I know you said to choose a project you love,” Amelia explained, “I love soccer and I want to do something with engineering. I know I want to compete.”
Originally, the tool was built to help budding football players practise how to kick a ball correctly. The ball, tethered to a parasol shaft, uses a Raspberry Pi, LEDs, Bluetooth, and pressure points; together, these help athletes to connect with the ball with the right degree of force at the appropriate spot.
However, after a conversation with her teacher, it became apparent that Amelia’s ball could be used for so much more. As a result, the project was gradually redirected towards working with stroke therapy patients.
“It uses the aspect of a soccer training tool and that interface makes it fun, but it also uses Bluetooth audio feedback to rebuild the neural pathways inside the brain, and this is what is needed to recover from a stroke,” explains Amelia.
The video above comes as part of Amelia’s submission for the Discover Education’s 3M ‘Young Science Challenge 2016’, a national competition for fifth- to eight-grade students from across the USA.
Down to the last ten finalists, Amelia travelled to 3M HQ in Minnesota this October where she had to present her project to a panel of judges. She placed third runner up and received a cash prize.
Our very own Amelia Day placed 3rd runner up @ the 3M National Junior Scientist competition this week. Proud to call her a Hawk!✏️⚽️ #LMS
We’re always so proud to see young makers working to change the world and we wish Amelia the best of luck with her future. We expect to see great things from this Lakeridge Middle School Hawk.
So, England, nominally the home of football, is out of the European Cup, having lost to Iceland. Iceland is a country with a population of 330,000 hardy Vikings, whose national sport is handball. England’s population is over 53 million. And we invented soccer.
Iceland’s only football pitch is under snow for much of the year, and their part-time manager is a full-time dentist.
I think perhaps England should refocus their sporting efforts on something a little less challenging. Like table football. With a Raspberry Pi on hand, you can even make it feel stadium-like, with automatic goal detection, slow-motion instant replay, score-keeping, tallying for a league of competitors and more. Come on, nation. I feel that we could do quite well with this; and given that it cuts the size of the team down to two people, it’d keep player salaries at a minimum.
Demo of Foosball Instant Replay system More info here: * https://github.com/swehner/foos * https://github.com/netsuso/foos-tournament Music: http://freemusicarchive.org/music/Jahzzar/Blinded_by_dust/Magic_Mountain_1877
This build comes from Stefan Wehner, who has documented it meticulously on GitHub. You’ll find full build instructions and a parts list (which starts with a football table), along with all the code you’ll need.
Well done Iceland, by the way. We’re not bitter or anything.
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.