Tag Archives: Reid

Welcome Daren – Datacenter Technician!

Post Syndicated from Yev original https://www.backblaze.com/blog/welcome-daren-datacenter-technician/

The datacenter team continues to expand and the latest person to join the team is Daren! He’s very well versed with our infrastructure and is a welcome addition to the caregivers for our ever-growing fleet!

What is your Backblaze Title?
Datacenter Technician.

Where are you originally from?
Fair Oaks, CA.

What attracted you to Backblaze?
The Pods! I’ve always thought Backblaze had a great business concept and I wanted to be a part of the team that helps build it and make it a huge success.

What do you expect to learn while being at Backblaze?
Everything about Backblaze and what makes it tick.

Where else have you worked?
Sungard Availability Services, ASC Profiles, and Reids Family Martial Arts.

Where did you go to school?
American River College and Techskills of California.

What’s your dream job?
I always had interest in Architecture. I’m not sure how good I would be at it but building design is something that I would have liked to try.

Favorite place you’ve traveled?
My favorite place to travel is the Philippines. I have a lot of family their and I mostly like to visit the smaller villages far from the busy city life. White sandy beaches, family, and Lumpia!

Favorite hobby?
Martial Arts – its challenging, great exercise, and a lot of fun!

Star Trek or Star Wars?
Whatever my boss likes.

Coke or Pepsi?
Coke.

Favorite food?
One of my favorite foods is Lumpia. Its the cousin of the Egg Roll but much more amazing. Made of a thin pastry wrapper with a mixture of fillings, consisting of chopped vegetables, ground beef or pork, and potatoes.

Why do you like certain things?
I like certain things that take me to places I have never been before.

Anything else you’d like you’d like to tell us?
I am excited to be apart of the Backblaze team.

Welcome aboard Daren! We’d love to try some of that lumpia sometime!

The post Welcome Daren – Datacenter Technician! appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Presenting AWS IoT Analytics: Delivering IoT Analytics at Scale and Faster than Ever Before

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/launch-presenting-aws-iot-analytics/

One of the technology areas I thoroughly enjoy is the Internet of Things (IoT). Even as a child I used to infuriate my parents by taking apart the toys they would purchase for me to see how they worked and if I could somehow put them back together. It seems somehow I was destined to end up the tough and ever-changing world of technology. Therefore, it’s no wonder that I am really enjoying learning and tinkering with IoT devices and technologies. It combines my love of development and software engineering with my curiosity around circuits, controllers, and other facets of the electrical engineering discipline; even though an electrical engineer I can not claim to be.

Despite all of the information that is collected by the deployment of IoT devices and solutions, I honestly never really thought about the need to analyze, search, and process this data until I came up against a scenario where it became of the utmost importance to be able to search and query through loads of sensory data for an anomaly occurrence. Of course, I understood the importance of analytics for businesses to make accurate decisions and predictions to drive the organization’s direction. But it didn’t occur to me initially, how important it was to make analytics an integral part of my IoT solutions. Well, I learned my lesson just in time because this re:Invent a service is launching to make it easier for anyone to process and analyze IoT messages and device data.

 

Hello, AWS IoT Analytics!  AWS IoT Analytics is a fully managed service of AWS IoT that provides advanced data analysis of data collected from your IoT devices.  With the AWS IoT Analytics service, you can process messages, gather and store large amounts of device data, as well as, query your data. Also, the new AWS IoT Analytics service feature integrates with Amazon Quicksight for visualization of your data and brings the power of machine learning through integration with Jupyter Notebooks.

Benefits of AWS IoT Analytics

  • Helps with predictive analysis of data by providing access to pre-built analytical functions
  • Provides ability to visualize analytical output from service
  • Provides tools to clean up data
  • Can help identify patterns in the gathered data

Be In the Know: IoT Analytics Concepts

  • Channel: archives the raw, unprocessed messages and collects data from MQTT topics.
  • Pipeline: consumes messages from channels and allows message processing.
    • Activities: perform transformations on your messages including filtering attributes and invoking lambda functions advanced processing.
  • Data Store: Used as a queryable repository for processed messages. Provide ability to have multiple datastores for messages coming from different devices or locations or filtered by message attributes.
  • Data Set: Data retrieval view from a data store, can be generated by a recurring schedule. 

Getting Started with AWS IoT Analytics

First, I’ll create a channel to receive incoming messages.  This channel can be used to ingest data sent to the channel via MQTT or messages directed from the Rules Engine. To create a channel, I’ll select the Channels menu option and then click the Create a channel button.

I’ll name my channel, TaraIoTAnalyticsID and give the Channel a MQTT topic filter of Temperature. To complete the creation of my channel, I will click the Create Channel button.

Now that I have my Channel created, I need to create a Data Store to receive and store the messages received on the Channel from my IoT device. Remember you can set up multiple Data Stores for more complex solution needs, but I’ll just create one Data Store for my example. I’ll select Data Stores from menu panel and click Create a data store.

 

I’ll name my Data Store, TaraDataStoreID, and once I click the Create the data store button and I would have successfully set up a Data Store to house messages coming from my Channel.

Now that I have my Channel and my Data Store, I will need to connect the two using a Pipeline. I’ll create a simple pipeline that just connects my Channel and Data Store, but you can create a more robust pipeline to process and filter messages by adding Pipeline activities like a Lambda activity.

To create a pipeline, I’ll select the Pipelines menu option and then click the Create a pipeline button.

I will not add an Attribute for this pipeline. So I will click Next button.

As we discussed there are additional pipeline activities that I can add to my pipeline for the processing and transformation of messages but I will keep my first pipeline simple and hit the Next button.

The final step in creating my pipeline is for me to select my previously created Data Store and click Create Pipeline.

All that is left for me to take advantage of the AWS IoT Analytics service is to create an IoT rule that sends data to an AWS IoT Analytics channel.  Wow, that was a super easy process to set up analytics for IoT devices.

If I wanted to create a Data Set as a result of queries run against my data for visualization with Amazon Quicksight or integrate with Jupyter Notebooks to perform more advanced analytical functions, I can choose the Analyze menu option to bring up the screens to create data sets and access the Juypter Notebook instances.

Summary

As you can see, it was a very simple process to set up the advanced data analysis for AWS IoT. With AWS IoT Analytics, you have the ability to collect, visualize, process, query and store large amounts of data generated from your AWS IoT connected device. Additionally, you can access the AWS IoT Analytics service in a myriad of different ways; the AWS Command Line Interface (AWS CLI), the AWS IoT API, language-specific AWS SDKs, and AWS IoT Device SDKs.

AWS IoT Analytics is available today for you to dig into the analysis of your IoT data. To learn more about AWS IoT and AWS IoT Analytics go to the AWS IoT Analytics product page and/or the AWS IoT documentation.

Tara

The TSA’s Selective Laptop Ban

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/03/the_tsas_select.html

Last Monday, the TSA announced a peculiar new security measure to take effect within 96 hours. Passengers flying into the US on foreign airlines from eight Muslim countries would be prohibited from carrying aboard any electronics larger than a smartphone. They would have to be checked and put into the cargo hold. And now the UK is following suit.

It’s difficult to make sense of this as a security measure, particularly at a time when many people question the veracity of government orders, but other explanations are either unsatisfying or damning.

So let’s look at the security aspects of this first. Laptop computers aren’t inherently dangerous, but they’re convenient carrying boxes. This is why, in the past, TSA officials have demanded passengers turn their laptops on: to confirm that they’re actually laptops and not laptop cases emptied of their electronics and then filled with explosives.

Forcing a would-be bomber to put larger laptops in the plane’s hold is a reasonable defense against this threat, because it increases the complexity of the plot. Both the shoe-bomber Richard Reid and the underwear bomber Umar Farouk Abdulmutallab carried crude bombs aboard their planes with the plan to set them off manually once aloft. Setting off a bomb in checked baggage is more work, which is why we don’t see more midair explosions like Pan Am Flight 103 over Lockerbie, Scotland, in 1988.

Security measures that restrict what passengers can carry onto planes are not unprecedented either. Airport security regularly responds to both actual attacks and intelligence regarding future attacks. After the liquid bombers were captured in 2006, the British banned all carry-on luggage except passports and wallets. I remember talking with a friend who traveled home from London with his daughters in those early weeks of the ban. They reported that airport security officials confiscated every tube of lip balm they tried to hide.

Similarly, the US started checking shoes after Reid, installed full-body scanners after Abdulmutallab and restricted liquids in 2006. But all of those measure were global, and most lessened in severity as the threat diminished.

This current restriction implies some specific intelligence of a laptop-based plot and a temporary ban to address it. However, if that’s the case, why only certain non-US carriers? And why only certain airports? Terrorists are smart enough to put a laptop bomb in checked baggage from the Middle East to Europe and then carry it on from Europe to the US.

Why not require passengers to turn their laptops on as they go through security? That would be a more effective security measure than forcing them to check them in their luggage. And lastly, why is there a delay between the ban being announced and it taking effect?

Even more confusing, the New York Times reported that “officials called the directive an attempt to address gaps in foreign airport security, and said it was not based on any specific or credible threat of an imminent attack.” The Department of Homeland Security FAQ page makes this general statement, “Yes, intelligence is one aspect of every security-related decision,” but doesn’t provide a specific security threat. And yet a report from the UK states the ban “follows the receipt of specific intelligence reports.”

Of course, the details are all classified, which leaves all of us security experts scratching our heads. On the face of it, the ban makes little sense.

One analysis painted this as a protectionist measure targeted at the heavily subsidized Middle Eastern airlines by hitting them where it hurts the most: high-paying business class travelers who need their laptops with them on planes to get work done. That reasoning makes more sense than any security-related explanation, but doesn’t explain why the British extended the ban to UK carriers as well. Or why this measure won’t backfire when those Middle Eastern countries turn around and ban laptops on American carriers in retaliation. And one aviation official told CNN that an intelligence official informed him it was not a “political move.”

In the end, national security measures based on secret information require us to trust the government. That trust is at historic low levels right now, so people both in the US and other countries are rightly skeptical of the official unsatisfying explanations. The new laptop ban highlights this mistrust.

This essay previously appeared on CNN.com.

EDITED TO ADD: Here are two essays that look at the possible political motivations, and fallout, of this ban. And the EFF rightly points out that letting a laptop out of your hands and sight is itself a security risk — for the passenger.

Running R on Amazon Athena

Post Syndicated from Gopal Wunnava original https://aws.amazon.com/blogs/big-data/running-r-on-amazon-athena/

Data scientists are often concerned about managing the infrastructure behind big data platforms while running SQL on R. Amazon Athena is an interactive query service that works directly with data stored in S3 and makes it easy to analyze data using standard SQL without the need to manage infrastructure. Integrating R with Amazon Athena gives data scientists a powerful platform for building interactive analytical solutions.

In this blog post, you’ll connect R/RStudio running on an Amazon EC2 instance with Athena.

Prerequisites

Before you get started, complete the following steps.

    1. Have your AWS account administrator give your AWS account the required permissions to access Athena via Amazon’s Identity and Access Management (IAM) console. This can be done by attaching the associated Athena policies to your data scientist user group in IAM.

 

RAthena_1

  1. Provide a staging directory in the form of an Amazon S3 bucket. Athena will use this to query datasets and store results. We’ll call this staging bucket s3://athenauser-athena-r in the instructions that follow.

NOTE: In this blog post, I create all AWS resources in the US-East region. Use the Region Table to check the availability of Athena in other regions.

Set up R and RStudio on EC2 

  1. Follow the instructions in the blog post “Running R on AWS” to set up R on an EC2 instance (t2.medium or greater) running Amazon Linux . Read the step below before you begin.
  1. In that blog post under “Advanced Details,” when you reach step 3 use the following bash script to install the latest version of RStudio. Modify the password for RStudio as needed.
#!/bin/bash
#install R
yum install -y R
#install RStudio-Server
wget https://download2.rstudio.org/rstudio-server-rhel-1.0.136-x86_64.rpm
yum install -y --nogpgcheck rstudio-server-rhel-1.0.136-x86_64.rpm
#add user(s)
useradd rstudio
echo rstudio:rstudio | chpasswd

Install Java 8 

  1. SSH into this EC2 instance.
  2. Remove older versions of Java.
  3. Install Java 8. This is required to work with Athena.
  4. Run the following commands on the command line.
#install Java 8, select ‘y’ from options presented to proceed with installation
sudo yum install java-1.8.0-openjdk-devel
#remove version 7 of Java, select ‘y’ from options to proceed with removal
sudo yum remove java-1.7.0-openjdk
#configure java, choose 1 as your selection option for java 8 configuration
sudo /usr/sbin/alternatives --config java
#run command below to add Java support to R
sudo R CMD javareconf

#following libraries are required for the interactive application we build later
sudo yum install -y libpng-devel
sudo yum install -y libjpeg-turbo-devel

Set up .Renviron

You need to configure the R environment variable .Renviron with the required Athena credentials.

  1. Get the required credentials from your AWS Administrator in the form of AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
  1. Type the following command from the Linux command prompt and bring up the vi editor.
sudo vim /home/rstudio/.Renviron

Provide your Athena credentials in the following form into the editor:
ATHENA_USER=< AWS_ACCESS_KEY_ID >
ATHENA_PASSWORD=< AWS_SECRET_ACCESS_KEY>
  1. Save this file and exit from the editor. 

Log in to RStudio

Next, you’ll log in to RStudio on your EC2 instance.

  1. Get the public IP address of your instance from the EC2 dashboard and paste it followed by :8787 (port number for RStudio) into your browser window.
  1. Confirm that your IP address has been whitelisted for inbound access to port 8787 as part of the configuration for the security group associated with your EC2 instance.
  1. Log in to RStudio with the username and password you provided previously.

Install R packages

Next, you’ll install and load the required R packages.

#--following R packages are required for connecting R with Athena
install.packages("rJava")
install.packages("RJDBC")
library(rJava)
library(RJDBC)

#--following R packages are required for the interactive application we build later
#--steps below might take several minutes to complete
install.packages(c("plyr","dplyr","png","RgoogleMaps","ggmap"))
library(plyr)
library(dplyr)
library(png)
library(RgoogleMaps)
library(ggmap)

Connect to Athena

The following steps in R download the Athena driver and set up the required connection. Use the JDBC URL associated with your region.

#verify Athena credentials by inspecting results from command below
Sys.getenv()
#set up URL to download Athena JDBC driver
URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.0.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
fil
#set up driver connection to JDBC
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")
#connect to Athena using the driver, S3 working directory and credentials for Athena 
#replace ‘athenauser’ below with prefix you have set up for your S3 bucket
con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
s3_staging_dir="s3://athenauser-athena-r",
user=Sys.getenv("ATHENA_USER"),
password=Sys.getenv("ATHENA_PASSWORD"))
#in case of error or warning from step above ensure rJava and RJDBC packages have #been loaded 
#also ensure you have Java 8 running and configured for R as outlined earlier

Now you’re ready to start querying Athena from RStudio. 

Sample Queries to test

# get a list of all tables currently in Athena 
dbListTables(con)
# run a sample query
dfelb=dbGetQuery(con, "SELECT * FROM sampledb.elb_logs limit 10")
head(dfelb,2)

RAthena_2

Interactive Use Case

Next, you’ll practice interactively querying Athena from R for analytics and visualization. For this purpose, you’ll use GDELT, a publicly available dataset hosted on S3.

Create a table in Athena from R using the GDELT dataset. This step can also be performed from the AWS management console as illustrated in the blog post “Amazon Athena – Interactive SQL Queries for Data in Amazon S3.”

#---sql  create table statement in Athena
dbSendQuery(con, 
"
CREATE EXTERNAL TABLE IF NOT EXISTS sampledb.gdeltmaster (
GLOBALEVENTID BIGINT,
SQLDATE INT,
MonthYear INT,
Year INT,
FractionDate DOUBLE,
Actor1Code STRING,
Actor1Name STRING,
Actor1CountryCode STRING,
Actor1KnownGroupCode STRING,
Actor1EthnicCode STRING,
Actor1Religion1Code STRING,
Actor1Religion2Code STRING,
Actor1Type1Code STRING,
Actor1Type2Code STRING,
Actor1Type3Code STRING,
Actor2Code STRING,
Actor2Name STRING,
Actor2CountryCode STRING,
Actor2KnownGroupCode STRING,
Actor2EthnicCode STRING,
Actor2Religion1Code STRING,
Actor2Religion2Code STRING,
Actor2Type1Code STRING,
Actor2Type2Code STRING,
Actor2Type3Code STRING,
IsRootEvent INT,
EventCode STRING,
EventBaseCode STRING,
EventRootCode STRING,
QuadClass INT,
GoldsteinScale DOUBLE,
NumMentions INT,
NumSources INT,
NumArticles INT,
AvgTone DOUBLE,
Actor1Geo_Type INT,
Actor1Geo_FullName STRING,
Actor1Geo_CountryCode STRING,
Actor1Geo_ADM1Code STRING,
Actor1Geo_Lat FLOAT,
Actor1Geo_Long FLOAT,
Actor1Geo_FeatureID INT,
Actor2Geo_Type INT,
Actor2Geo_FullName STRING,
Actor2Geo_CountryCode STRING,
Actor2Geo_ADM1Code STRING,
Actor2Geo_Lat FLOAT,
Actor2Geo_Long FLOAT,
Actor2Geo_FeatureID INT,
ActionGeo_Type INT,
ActionGeo_FullName STRING,
ActionGeo_CountryCode STRING,
ActionGeo_ADM1Code STRING,
ActionGeo_Lat FLOAT,
ActionGeo_Long FLOAT,
ActionGeo_FeatureID INT,
DATEADDED INT,
SOURCEURL STRING )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://support.elasticmapreduce/training/datasets/gdelt'
;
"
)



dbListTables(con)

You should see this newly created table named ‘gdeltmaster’ appear in your RStudio console after executing the statement above.

RAthena_3

Query this Athena table to get a count of all CAMEO events that took place in the US in 2015.

#--get count of all CAMEO events that took place in US in year 2015 
#--save results in R dataframe
dfg<-dbGetQuery(con,"SELECT eventcode,count(*) as count
FROM sampledb.gdeltmaster
where year = 2015 and ActionGeo_CountryCode IN ('US')
group by eventcode
order by eventcode desc"
)
str(dfg)
head(dfg,2)

RAthena_4

#--get list of top 5 most frequently occurring events in US in 2015
dfs=head(arrange(dfg,desc(count)),5)
dfs

RAthena_5

From the R output shown above, you can see that CAMEO event “042” has the highest count. From the CAMEO manual, you can determine that this event has the description “Travel to another location for a meeting or other event.”

Next, you’ll use the knowledge gained from this analysis to get a list of all geo-coordinates associated with this specific event from the Athena table.

#--get a list of latitude and longitude associated with event “042” 
#--save results in R dataframe
dfgeo<-dbGetQuery(con,"SELECT actiongeo_lat,actiongeo_long
FROM sampledb.gdeltmaster
where year = 2015 and ActionGeo_CountryCode IN ('US')
and eventcode = '042'
"
)
#--duration of above query will depend on factors like size of chosen EC2 instance
#--now rename columns in dataframe for brevity
names(dfgeo)[names(dfgeo)=="actiongeo_lat"]="lat"
names(dfgeo)[names(dfgeo)=="actiongeo_long"]="long"
names(dfgeo)
#let us inspect this R dataframe
str(dfgeo)
head(dfgeo,5)

RAthena_6
Next, generate a map for the United States. 

#--generate map for the US using the ggmap package
map=qmap('USA',zoom=3)
map

RAthena_7

Now you’ll plot the geodata obtained from your Athena table onto this map. This will help you visualize all US locations where these events had occurred in 2015. 

#--plot our geo-coordinates on the US map
map + geom_point(data = dfgeo, aes(x = dfgeo$long, y = dfgeo$lat), color="blue", size=0.5, alpha=0.5)

RAthena_8

By visually inspecting the results, you can determine that this specific event was heavily concentrated in the Northeastern part of the US.

Conclusion

You’ve learned how to build a simple interactive application with Athena and R. Athena can be used to store and query the underlying data for your big data applications using standard SQL, while R can be used to interactively query Athena and generate analytical insights using the powerful set of libraries that R provides.

If you have questions or suggestions, please leave your feedback in the comments.

 


About the Author

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.

 

 


Related

Derive Insights from IoT in Minutes using AWS IoT, Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight

o_IoT_Minutes_1

 

Why You Should Speak At & Attend LinuxConf Australia

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2016/08/04/lca2016.html

[ This blog
was crossposted
on Software Freedom Conservancy’s website
. ]

Monday 1 February 2016 was the longest day of my life, but I don’t mean
that in the canonical, figurative, and usually negative sense of that
phrase. I mean it literally and in a positive way. I woke up that morning
Amsterdam in the Netherlands — having the previous night taken a
evening train from Brussels, Belgium with my friend and colleague Tom
Marble
. Tom and I had just spent the weekend
at FOSDEM 2016, where he and
I co-organize
the Legal
and Policy Issues DevRoom
(with our mutual friends and colleagues,
Richard Fontana and Karen M. Sandler).

Tom and I headed over to AMS airport around 07:00 local time, found some
breakfast and boarded our flights. Tom was homeward bound, but I was about
to do the crazy thing that he’d done in the reverse a few years before: I
was speaking at FOSDEM and LinuxConf Australia, back-to-back. In fact,
because the airline fares were substantially cheaper this way, I didn’t
book a “round the world” flight, but instead two back-to-back
round-trip tickets. I boarded the plane at AMS at 09:30 that morning
(local time), and landed in my (new-ish) hometown of Portland, OR as
afternoon there began. I went home, spent the afternoon with my wife,
sister-in-law, and dogs, washed my laundry, and repacked my bag. My flight
to LAX departed at 19:36 local time, a little after US/Pacific sunset.

I crossed the Pacific ocean, the international dateline, left a day on
deposit to pickup on the way back, after 24 hours of almost literally
chasing the sun, I arrived in Melbourne on the morning of Wednesday 3
February, road a shuttle bus, dumped my bags at my room, and arrived just
in time for
the Wednesday
afternoon tea break at LinuxConf Australia 2016 in Geelong
.

Nearly everyone who heard this story — or saw me while it was
happening — asked me the same question: Why are you doing
this?
. The five to six people packed in with me in my coach section on
the LAX→SYD leg are probably still asking this, because I had an
allergic attack of some sort most of the flight and couldn’t stop coughing,
even with two full bags of Fisherman’s Friends over those 15 hours.

But, nevertheless, I gave a simple answer to everyone who questioned my
crazy BRU→AMS→PDX→LAX→SYD→MEL itinerary: FOSDEM and LinuxConf AU are
two of the most important events on the Free Software annual calendar.
There’s just no question. I’ll write more about FOSDEM sometime soon, but
the rest of this post, I’ll dedicate to LinuxConf Australia (LCA).

One of my biggest regrets in Free Software is that I was once — and
you’ll be surprised by this given my story above — a bit squeamish
about the nearly 15 hour flight to get from the USA to Australia, and
therefore I didn’t attend LCA until 2015. LCA began way back in 1999.
Keep in mind that, other than FOSDEM, no major, community-organized events
have survived from that time. But LCA has the culture and mindset of the
kinds of conferences that our community made in 1999.

LCA is community organized and operated. Groups of volunteers
each year plan the event. In the tradition of science fiction conventions
and other hobbyist activities, groups bid for the conference and offer
their time and effort to make the conference a success. They have an
annual hand-off meeting to be sure the organization lessons are passed from
one committee to the next, and some volunteers even repeat their
involvement year after year. For organizational structure, they rely on a
non-profit organization, Linux
Australia
, to assist with handling the funds and providing
infrastructure (just like Conservancy does for our member projects and
their conferences!)

I believe fully that the success of software freedom and GNU/Linux in
particularly has not primarily been because companies allow developers to
spend some of their time coding on upstream. Sure, many Free Software
projects couldn’t survive without that component, but what really makes
GNU/Linux, or any Free Software project, truly special is that there’s a
community of users and developers who use, improve, and learn about the
software because it excites and interests them. LCA is one of the few
events specifically designed to invite that sort of person to attend, and
it has for almost an entire generation stood in stark contrast the highly
corporate, for-profits events that slowly took over our community in the
years that followed LCA’s founding. (Remember all those years of
LinuxWorld
Expo
? I wasn’t even sad when IDG stopped running it!)

Speaking particularly of earlier this year, LCA 2016 in Geelong, Australia
was a particular profound event for me. LCA is one of the few events that
accepts my rather political talks about what’s happening in Open Source and
Free Software, so I gave a talk
on Friday
5 February 2016
entitled Copyleft For the Next Decade: A
Comprehensive Plan
, which was recorded, so you can watch it. I do
warn everyone that the jokes did not go over well (mine never do), so after I
finished, I was feeling a bit down that I hadn’t made the talk entertaining
enough. But then, something amazing happened: people started walking up to
me and telling me how important my message was. One individual even came up
and told me that he was excited enough that he’d like
to match
any donation that Software Freedom Conservancy received during LCA 2016
.
Since it was the last day of the event, I quickly went to one of the
organizers, Kathy Reid, and asked
if they would announce this match during the closing ceremonies; she agreed.
In a matter of just an hour or two, I’d gone from believing my talk had
fallen flat to realizing that — regardless of whether I’d presented
well — the concepts I discussed had connected with people.

Then, I sat down in the closing session. I started to tear up slightly
when the
organizers announced the donation match
. Within 90 seconds, though,
that turned to full tears of joy when the incoming President of Linux
Australia, Hugh Blemings, came on
stage and
said
:

[I’ll start with] a Software Freedom Conservancy thing, as it turns out.
… I can tell that most of you weren’t at Bradley’s talk earlier on
today, but if there is one talk I’d encourage you to watch on the
playback later it would be that one. There’s a very very important
message in there and something to take away for all of us. On behalf of
the Council I’d like to announce … that we’re actually in the
process of making a significant donation from Linux Australia to Software
Freedom Conservancy as well. I urge all of you to consider contributing
individual as well, and there is much left for us to be done as a
community on that front.

I hope that this post helps organizers of events like LCA fully understand
how much something like this means to us who run a small charities —
and not just with regard to the financial contributions. Knowing that the
organizers of community events feel so strongly positive about our work
really keeps us going. We work hard and spend much time at Conservancy to
serve the Open Source and Free Software community, and knowing the work is
appreciated inspires us to keep working. Furthermore, we know that without
these events, it’s much tougher for us to reach others with our message of
software freedom. So, for us, the feeling is mutual: I’m delighted that
the Linux Australia and LCA folks feel so positively about Conservancy, and
I now look forward to another 15 hour flight for the next LCA.

And, on that note, I chose a strategic time to post this story. On Friday
5 August 2016, the CFP for LCA
2017 closes
. So, now is the time for all of you to submit a talk. If
you regularly speak at Open Source and Free Software events, or have been
considering it, this event really needs to be on your calendar. I look
forward to seeing all of you Hobart this January.

Anonymization and the Law

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2016/07/anonymization_a.html

Interesting paper: “Anonymization and Risk,” by Ira S. Rubinstein and Woodrow Hartzog:

Abstract: Perfect anonymization of data sets has failed. But the process of protecting data subjects in shared information remains integral to privacy practice and policy. While the deidentification debate has been vigorous and productive, there is no clear direction for policy. As a result, the law has been slow to adapt a holistic approach to protecting data subjects when data sets are released to others. Currently, the law is focused on whether an individual can be identified within a given set. We argue that the better locus of data release policy is on the process of minimizing the risk of reidentification and sensitive attribute disclosure. Process-based data release policy, which resembles the law of data security, will help us move past the limitations of focusing on whether data sets have been “anonymized.” It draws upon different tactics to protect the privacy of data subjects, including accurate deidentification rhetoric, contracts prohibiting reidentification and sensitive attribute disclosure, data enclaves, and query-based strategies to match required protections with the level of risk. By focusing on process, data release policy can better balance privacy and utility where nearly all data exchanges carry some risk.

Exploring Geospatial Intelligence using SparkR on Amazon EMR

Post Syndicated from Gopal Wunnava original https://blogs.aws.amazon.com/bigdata/post/Tx1MECZ47VAV84F/Exploring-Geospatial-Intelligence-using-SparkR-on-Amazon-EMR

Gopal Wunnava is a Senior Consultant with AWS Professional Services

The number of data sources that use location, such as smartphones and sensory devices used in IoT (Internet of things), is expanding rapidly. This explosion has increased demand for analyzing spatial data.

Geospatial intelligence (GEOINT) allows you to analyze data that has geographical or spatial dimensions and present it based on its location. GEOINT can be applied to many industries, including social and environmental sciences, law enforcement, defense, and weather and disaster management. In this blog post, I show you how to build a simple GEOINT application using SparkR that will allow you to appreciate GEOINT capabilities.

R has been a popular platform for GEOINT due to its wide range of functions and packages related to spatial analysis. SparkR provides a great solution for overcoming known limitations in R because it lets you run geospatial applications in a distributed environment provided by the Spark engine. By implementing your SparkR geospatial application on Amazon EMR, you combine  these benefits with the flexibility and ease-of-use provided by EMR.

Overview of a GEOINT application

You’ll use the GDELT project for building this GEOINT application. The GDELT dataset you will use is available on Amazon S3  in the form of a tab-delimited text file with an approximate size of 13 GB.

Your GEOINT application will generate images like the one below.

Building your GEOINT application

Create an EMR cluster (preferably with the latest AMI version) in your region of choice, specifying Spark, Hive, and Ganglia.  To learn how to create a cluster, see Getting Started: Analyzing Big Data with Amazon EMR.

I suggest the r3 instance family for this application (1 master and 8 core nodes, all r38xlarge in this case), as it is suitable for the type of SparkR application you will create here. While I have chosen this cluster size for better performance, a smaller cluster size could work as well.

After your cluster is ready and you SSH into the master node, run the following command to install the files required for this application to run on EMR:

sudo yum install libjpeg-turbo-devel

For this GEOINT application, you identify and display locations where certain events of interest related to the economy are taking place in the US. For a more detailed description of these events, see the CAMEO manual available from the GDELT website.

You can use either the SparkR shell (type sparkR on the command line) or RStudio to develop this GEOINT application. To learn how to configure RStudio on EMR, see the Crunching Statistics at Scale with SparkR on Amazon EMR blog post.

You need to install the required R packages for this application onto the cluster. This can be done by executing the following R statement:

install.packages(c("plyr","dplyr","mapproj","RgoogleMaps","ggmap"))

Note: The above step can take up to thirty minutes because a number of dependent packages must be installed onto your EMR cluster.

After the required packages are installed the next step is to load these packages into your R environment:

library(plyr)
library(dplyr)
library(mapproj)
library(RgoogleMaps)
library(ggmap)

You can save the images generated by this application as a PDF document. Unless you use the setwd() function to set your desired path for this file, it defaults to your current working directory.

setwd("/home/hadoop")
pdf("SparkRGEOINTEMR.pdf")

If you are using RStudio, the plots appear in the lower-right corner of your workspace.

Now, create the Hive context that is required to access the external table from within the Spark environment:

#set up Hive context
hiveContext <- sparkRHive.init(sc)

Note: If you are using the SparkR shell in EMR, the spark context ‘sc’ is created automatically for you.  If you are using RStudio, follow the instructions in the Crunching Statistics at Scale with SparkR on Amazon EMR blog post to create the Spark context.

Next, create an external table that points to your source GDELT dataset on S3.

sql(hiveContext,
"
CREATE EXTERNAL TABLE IF NOT EXISTS gdelt (
GLOBALEVENTID BIGINT,
SQLDATE INT,
MonthYear INT,
Year INT,
FractionDate DOUBLE,
Actor1Code STRING,
Actor1Name STRING,
Actor1CountryCode STRING,
Actor1KnownGroupCode STRING,
Actor1EthnicCode STRING,
Actor1Religion1Code STRING,
Actor1Religion2Code STRING,
Actor1Type1Code STRING,
Actor1Type2Code STRING,
Actor1Type3Code STRING,
Actor2Code STRING,
Actor2Name STRING,
Actor2CountryCode STRING,
Actor2KnownGroupCode STRING,
Actor2EthnicCode STRING,
Actor2Religion1Code STRING,
Actor2Religion2Code STRING,
Actor2Type1Code STRING,
Actor2Type2Code STRING,
Actor2Type3Code STRING,
IsRootEvent INT,
EventCode STRING,
EventBaseCode STRING,
EventRootCode STRING,
QuadClass INT,
GoldsteinScale DOUBLE,
NumMentions INT,
NumSources INT,
NumArticles INT,
AvgTone DOUBLE,
Actor1Geo_Type INT,
Actor1Geo_FullName STRING,
Actor1Geo_CountryCode STRING,
Actor1Geo_ADM1Code STRING,
Actor1Geo_Lat FLOAT,
Actor1Geo_Long FLOAT,
Actor1Geo_FeatureID INT,
Actor2Geo_Type INT,
Actor2Geo_FullName STRING,
Actor2Geo_CountryCode STRING,
Actor2Geo_ADM1Code STRING,
Actor2Geo_Lat FLOAT,
Actor2Geo_Long FLOAT,
Actor2Geo_FeatureID INT,
ActionGeo_Type INT,
ActionGeo_FullName STRING,
ActionGeo_CountryCode STRING,
ActionGeo_ADM1Code STRING,
ActionGeo_Lat FLOAT,
ActionGeo_Long FLOAT,
ActionGeo_FeatureID INT,
DATEADDED INT,
SOURCEURL STRING )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
STORED AS TEXTFILE
LOCATION 's3://support.elasticmapreduce/training/datasets/gdelt'
");

Note: You might encounter an error in the above statement, with an error message that specifies an unused argument. This can be due to an overwritten Spark context. If this error appears, restart your SparkR or RStudio session.

Next, apply your filters to extract desired events in the last two years for the country of interest and store the results in a SparkR dataframe  named ‘gdelt’. In this post, I focus on spatial analysis of the data for the US only. For code samples that illustrate how this can be done for other countries, such as India, see the Exploring GDELT – Geospatial Analysis using SparkR on EMR GitHub site for the Big Data Blog.

gdelt<-sql(hiveContext,"SELECT * FROM gdelt WHERE ActionGeo_CountryCode IN ('US') AND Year >= 2014")

Register and cache this table in-memory:

registerTempTable(gdelt, "gdelt")
cacheTable(hiveContext, "gdelt")

Rename the columns for readability:

names(gdelt)[names(gdelt)=="actiongeo_countrycode"]="cn"
names(gdelt)[names(gdelt)=="actiongeo_lat"]="lat"
names(gdelt)[names(gdelt)=="actiongeo_long"]="long"
names(gdelt)

Next, extract a subset of columns from your original SparkR dataframe (‘gdelt’).  Of particular interest is the “lat” and “long” columns, as these attributes provide you with the locations where these events have taken place. Register and cache this result table into another SparkR dataframe  named ‘gdt’:

gdt=gdelt[,
		 c("sqldate",
		 "eventcode",
		 "globaleventid", 
		 "cn",
	         "year",
	         "lat",
	         "long")
             ]
registerTempTable(gdt, "gdt")
cacheTable(hiveContext, "gdt")

Now, filter down to specific events of interest. For this blog post, I have chosen certain event codes that relate to economic aid and co-operation.  While the CAMEO manual provides more details on what these specific event codes represent, I have provided a table below for quick reference. Refer to the manual to choose events from other news categories that may be of particular interest to you.

Store your chosen event codes into an R vector object:

ecocodes <- c("0211","0231","0311","0331","061","071")	

Next, apply a filter operation and store the results of this operation into another SparkR dataset named ‘gdeltusinf’. Follow the same approach of registering and caching this table.

gdeltusinf <- filter(gdt,gdt$eventcode %in% ecocodes)
registerTempTable(gdeltusinf, "gdeltusinf")
cacheTable(hiveContext, "gdeltusinf")

Now that you have a smaller dataset that is a subset of the original, collect this SparkR dataframe into a local R dataframe. By doing this, you can leverage the spatial libraries installed previously into your R environment.

dflocale1=collect(select(gdeltusinf,"*"))	

Save the R dataframe to a local file system in case you need to quit your SparkR session and want to reuse the file at a later point. You can also share this saved file with other R users and sessions.  

save(dflocale1, file = "gdeltlocale1.Rdata")

Now, create separate dataframes by country as this allows you to plot the corresponding event locations on separate maps:

dflocalus1=subset(dflocale1,dflocale1$cn=='US')

Next, provide a suitable title for the maps and prepare to plot them:

plot.new()
title("GDELT Analysis for Economy related Events in 2014-2015")
map=qmap('USA',zoom=3)

The first plot identifies locations where all events related to economic aid and co-operation have taken place in the US within the last two years (2014-2015). For this example, these locations are marked in red, but you can choose another color.

map + geom_point(data = dflocalus1, aes(x = dflocalus1$long, y = dflocalus1$lat), color="red", size=0.5, alpha=0.5)
title("All GDELT Event Locations in USA related to Economy in 2014-2015")

It may take several seconds for the image to be displayed. From the above image, you can infer that the five chosen events related to economy took place in locations all over the US in 2014-2015. While this map provides you with the insight that these five events were fairly widespread in the US, you might want to drill down further and identify locations where each of these five specific events took place.

For this purpose, display only certain chosen events. Start by displaying locations where events related to ‘0211’ (Economic Co-op for Appeals) have taken place in the US, using the color blue:

dflocalus0211=subset(dflocalus1,dflocalus1$eventcode=='0211')
x0211=geom_point(data = dflocalus0211, aes(x = dflocalus0211$long, y = dflocalus0211$lat), color="blue", size=2, alpha=0.5)
map+x0211
title("GDELT Event Locations in USA: Economic Co-op(appeals)-Code 0211")

From the image above, you can see that the event ‘0211’ (Economic Co-op for Appeals) was fairly widespread as well, but there is more of a concentration within the Eastern region of the US, specifically the Northeast.

Next, follow the same process, but this time for a different event –‘0231’ (Economic Aid for Appeals). Notice the use of the color yellow for this purpose.

dflocalus0231=subset(dflocalus1,dflocalus1$eventcode=='0231')
x0231=geom_point(data = dflocalus0231, aes(x = dflocalus0231$long, y = dflocalus0231$lat), color="yellow", size=2, alpha=0.5)
map+x0231
title("GDELT Event Locations in USA:Economic Aid(appeals)-Code 0231")

From the above image, you can see that there is a heavy concentration of this event type in the Midwest, Eastern, and Western parts of the US while the North-Central region is sparser.

You can follow a similar approach to prepare separate R dataframes for each event of interest. Choosing a different color for each event allows you to identify each event type and locations much more easily.

dflocalus0311=subset(dflocalus1,dflocalus1$eventcode=='0311')
dflocalus0331=subset(dflocalus1,dflocalus1$eventcode=='0331')
dflocalus061=subset(dflocalus1,dflocalus1$eventcode=='061')
dflocalus071=subset(dflocalus1,dflocalus1$eventcode=='071')

x0211=geom_point(data = dflocalus0211, aes(x = dflocalus0211$long, y = dflocalus0211$lat), color="blue", size=3, alpha=0.5)
x0231=geom_point(data = dflocalus0231, aes(x = dflocalus0231$long, y = dflocalus0231$lat), color="yellow", size=1, alpha=0.5)
x0311=geom_point(data = dflocalus0311, aes(x = dflocalus0311$long, y = dflocalus0311$lat), color="red", size=1, alpha=0.5)
x0331=geom_point(data = dflocalus0331, aes(x = dflocalus0331$long, y = dflocalus0331$lat), color="green", size=1, alpha=0.5)
x061=geom_point(data = dflocalus061, aes(x = dflocalus061$long, y = dflocalus061$lat), color="orange", size=1, alpha=0.5)
x071=geom_point(data = dflocalus071, aes(x = dflocalus071$long, y = dflocalus071$lat), color="violet", size=1, alpha=0.5)

Using this approach allows you to overlay locations of different events on the same map, where each event is represented by a specific color. To illustrate this with an example, use the following code to display locations of three specific events from the steps above:

map+x0211+x0231+x0311

legend(‘bottomleft’,c("0211:Appeal for Economic Co-op","0231:Appeal for Economic Aid","0311:Intent for Economic Co-op"),col=c("blue","yellow","red"),pch=16) title("GDELT Locations In USA: Economy related Events in 2014-2015")

Note: If you are using RStudio, you might have to hit the refresh button on the Plots tab to display the map in the lower-right portion of your workspace.

From the above image, you can see that the Northeast region has the heaviest concentration of events, while certain areas such as North Central are sparser.

Conclusion

I’ve shown you how to build a simple yet powerful geospatial application using SparkR on EMR.  Though R helps with geospatial analysis, native R limitations prevent GEOINT applications from scaling to the extent required by large-scale GEOINT applications. You can overcome these limitations by mixing SparkR with native R workloads, a method referred to as big data / small learning.  This approach lets you take your GEOINT application performance to the next level while still running the R analytics you know and love.

You can find the code samples for this GEOINT application, along with other use cases for this application, on the Exploring GDELT – Geospatial Analysis using SparkR on EMR GitHub site.  You can also find code samples that make use of the concept of pipelines and the pipeR package to implement this functionality. Along similar lines, you can make use of the magrittr package to implement the functionality represented in this application.

If you have any questions or suggestions, please leave a comment below.

————————————

Related

Crunching Statistics at Scale with SparkR on Amazon EMR

Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

Query Routing and Rewrite: Introducing pgbouncer-rr for Amazon Redshift and PostgreSQL

Post Syndicated from Bob Strahan original https://blogs.aws.amazon.com/bigdata/post/Tx3G7177U6YHY5I/Query-Routing-and-Rewrite-Introducing-pgbouncer-rr-for-Amazon-Redshift-and-Postg

Bob Strahan is a senior consultant with AWS Professional Services

Have you ever wanted to split your database load across multiple servers or clusters without impacting the configuration or code of your client applications? Or perhaps you have wished for a way to intercept and modify application queries, so that you can make them use optimized tables (sorted, pre-joined, pre-aggregated, etc.), add security filters, or hide changes you have made in the schema?

The pgbouncer-rr project is based on pgbouncer, an open source, PostgreSQL connection pooler. It adds two new significant features:

Routing: Intelligently send queries to different database servers from one client connection; use it to partition or load balance across multiple servers or clusters.

Rewrite: Intercept and programmatically change client queries before they are sent to the server; use it to optimize or otherwise alter queries without modifying your application.

Pgbouncer-rr works the same way as pgbouncer. Any target application can be connected to pgbouncer-rr as if it were an Amazon Redshift or PostgreSQL server, and pgbouncer-rr creates a connection to the actual server, or reuses an existing connection.

You can deploy multiple instances of pgbouncer-rr to avoid throughput bottlenecks or single points of failure, or to support multiple configurations. It can live in an Auto Scaling group, and behind an Elastic Load Balancing load balancer. It can be deployed to a public subnet while your servers reside in private subnets. You can choose to run it as a bastion server using SSH tunneling, or you can use pgbouncer’s recently introduced SSL support for encryption and authentication.

Documentation and community support for pgbouncer can be easily found online;  pgbouncer-rr is a superset of pgbouncer.

Now I’d like to talk about the query routing and query rewrite feature enhancements.

Query Routing

The routing feature maps client connections to server connections using a Python routing function which you provide. Your function is called for each query, with the client username and the query string as parameters. Its return value must identify the target database server. How it does this is entirely up to you.

For example, you might want to run two Amazon Redshift clusters, each optimized to host a distinct data warehouse subject area. You can determine the appropriate cluster for any given query based on the names of the schemas or tables used in the query. This can be extended to support multiple Amazon Redshift clusters or PostgreSQL instances.

In fact, you can even mix and match Amazon Redshift and PostgreSQL, taking care to ensure that your Python functions correctly handle any server-specific grammar in your queries; your database will throw errors if your routing function sends queries it can’t process. And, of course, any query must run entirely on the server to which is it routed; cross-database joins or multi-server transactions do not work!

Here’s another example: you might want to implement controlled load balancing (or A/B testing) across replicated clusters or servers. Your routing function can choose a server for each query based on any combination of the client username, the query string, random variables, or external input. The logic can be as simple or as sophisticated as you want.

Your routing function has access to the full power of the Python language and the myriad of available Python modules. You can use the regular expression module (re) to match words and patterns in the query string, or use the SQL parser module (sqlparse) to support more sophisticated/robust query parsing. You may also want to use the AWS SDK module (boto) to read your routing table configurations from Amazon DynamoDB.

The Python routing function is dynamically loaded by pgbouncer-rr from the file you specify in the configuration:

routing_rules_py_module_file = /etc/pgbouncer-rr/routing_rules.py

The file should contain the following Python function:

def routing_rules(username, query):

The function parameters provide the username associated with the client, and a query string. The function return value must be a valid database key name (dbkey) as specified in the configuration file, or None. When a valid dbkey is returned by the routing function, the client connection will be routed to a connection in the specified server connection pool. When None is returned by the routing function, the client remains routed to its current server connection.

The route function is called only for query and prepare packets, with the following restrictions:

All queries must run wholly on the assigned server. Cross-server joins do not work.

Ideally, queries should auto-commit each statement. Set pool_mode = statement in the configuration.

Multi-statement transactions work correctly only if statements are not rerouted by the routing_rules function to a different server pool mid-transaction. Set pool_mode = transaction in the configuration.

If your application uses database catalog tables to discover the schema, then the routing_rules function should direct catalog table queries to a database server that has all the relevant schema objects created.

Simple query routing example

Amazon Redshift cluster 1 has data in table ‘tablea’. Amazon Redshift cluster 2 has data in table ‘tableb’. You want a client to be able to issue queries against either tablea or tableb without needing to know which table resides on which cluster.

Create a (default) entry with a key, say, ‘dev’ in the [databases] section of the pgbouncer configuration. This entry determines the default cluster used for client connections to database ‘dev’. You can make either redshift1 or redshift2 the default, or even specify a third ‘default’ cluster. Create additional entries in the pgbouncer [databases] section for each cluster; give these unique key names such as ‘dev.1’, and ‘dev.2’.

[databases]
dev = host= port=5439 dbname=dev
dev.1 = host= port=5439 dbname=dev
dev.2 = host= port=5439 dbname=dev

Ensure that the configuration file setting routing_rules_py_module_file specifies the path to your Python routing function file, such as ~/routing_rules.py. The code in the file could look like the following:

def routing_rules(username, query):
if "tablea" in query:
return "dev.1"
elif "tableb" in query:
return "dev.2"
else:
return None

Below is a toy example, but it illustrates the concept. If a client sends the query SELECT * FROM tablea, it matches the first rule, and is assigned to server pool ‘dev.1’ (redshift1). If a client (and it could be the same client in the same session) sends the query SELECT * FROM tableb, it matches the second rule, and is assigned to server pool ‘dev.2’ (redshift2). Any query that does not match either rule results in None being returned, and the server connection remains unchanged.

Below is an alternative function for the same use case, but the routing logic is defined in a separate extensible data structure using regular expressions to find the table matches. The routing table structure could easily be externalized in a DynamoDB table.

# ROUTING TABLE
# ensure that all dbkey values are defined in [database] section of the pgbouncer ini file
routingtable = {
‘route’ : [{
‘usernameRegex’ : ‘.*’,
‘queryRegex’ : ‘.*tablea.*’,
‘dbkey’ : ‘dev.1’
}, {
‘usernameRegex’ : ‘.*’,
‘queryRegex’ : ‘.*tableb.*’,
‘dbkey’ : ‘dev.2’
}
],
‘default’ : None
}

# ROUTING FN – CALLED FROM PGBOUNCER-RR – DO NOT CHANGE NAME
# IMPLEMENTS REGEX RULES DEFINED IN ROUTINGTABLE OBJECT
# RETURNS FIRST MATCH FOUND
import re
def routing_rules(username, query):
for route in routingtable[‘route’]:
u = re.compile(route[‘usernameRegex’])
q = re.compile(route[‘queryRegex’])
if u.search(username) and q.search(query):
return route[‘dbkey’]
return routingtable[‘default’]

You will likely want to implement more robust and sophisticated rules, taking care to avoid unintended matches. Write test cases to call your function with different inputs and validate the output dbkey values.

Query Rewrite

The rewrite feature provides you with the opportunity to manipulate application queries en route to the server without modifying application code. You might want to do this to:

Optimize an incoming query to use the best physical tables when you have replicated tables with alternative sort/dist keys and column subsets (emulate projections), or when you have stored pre-joined or pre-aggregated data (emulate ‘materialized views’).

Apply query filters to support row-level data partitioning/security.

Roll out new schemas, resolve naming conflicts, and so on, by changing identifier names on the fly.

The rewrite function is also implemented in a fully configurable Python function, dynamically loaded from an external module specified in the configuration: 

rewrite_query_py_module_file = /etc/pgbouncer-rr/rewrite_query.py

The file should contain the following Python function:

def rewrite_query(username, query):

The function parameters provide the username associated with the client, and a query string. The function return value must be a valid SQL query string which returns the result set that you want the client application to receive.

Implementing a query rewrite function is straightforward when the incoming application queries have fixed formats that are easily detectable and easily manipulated, perhaps using regular expression search/replace logic in the Python function. It is much more challenging to build a robust rewrite function to handle SQL statements with arbitrary format and complexity.

Enabling the query rewrite function triggers pgbouncer-rr to enforce that a complete query is contained in the incoming client socket buffer. Long queries are often split across multiple network packets. They should all be in the buffer before the rewrite function is called, which requires that the buffer size be large enough to accommodate the largest query. The default buffer size (2048) is likely too small, so specify a much larger size in the configuration: pkt_buf = 32768.

If a partially received query is detected, and there is room in the buffer for the remainder of the query, pgbouncer-rr waits for the remaining packets to be received before processing the query. If the buffer is not large enough for the incoming query, or if it is not large enough to hold the re-written query (which may be longer than the original), then the rewrite function will fail. By default, the failure is logged and the original query string will be passed to the server unchanged. You can force the client connection to terminate instead, by setting: rewrite_query_disconnect_on_failure = true.

Simple query rewrite example

You have a star schema with a large fact table in Amazon Redshift (such as ‘sales’) with two related dimension tables (such as ‘store’ and ‘product’). You want to optimize equally for two different queries:

1> SELECT storename, SUM(total) FROM sales JOIN store USING (storeid)
GROUP BY storename ORDER BY storename
2> SELECT prodname, SUM(total) FROM sales JOIN product USING (productid)
GROUP BY prodname ORDER BY prodname

By experimenting, you have determined that the best possible solution is to have two additional tables, each optimized for one of the queries:

store_sales: store and sales tables denormalized, pre-aggregated by store, and sorted and distributed by store name

product_sales: product and sales tables denormalized, pre-aggregated by product, sorted and distributed by product name

So you implement the new tables, and take care of their population in your ETL processes. But you’d like to avoid directly exposing these new tables to your reporting or analytic client applications. This might be the best optimization today, but who knows what the future holds? Maybe you’ll come up with a better optimization later, or maybe Amazon Redshift will introduce cool new features that provide a simpler alternative.

So, you implement a pgbouncer-rr rewrite function to change the original queries on the fly. Ensure that the configuration file setting rewrite_query_py_module_file specifies the path to your python function file, say~/rewrite_query.py.

The code in the file could look like this:

import re
def rewrite_query(username, query):
q1="SELECT storename, SUM(total) FROM sales JOIN store USING (storeid)
GROUP BY storename ORDER BY storename"
q2="SELECT prodname, SUM(total) FROM sales JOIN product USING (productid)
GROUP BY prodname ORDER BY prodname"
if re.match(q1, query):
new_query = "SELECT storename, SUM(total) FROM store_sales
GROUP BY storename ORDER BY storename;"
elif re.match(q2, query):
new_query = "SELECT prodname, SUM(total) FROM product_sales
GROUP BY prodname ORDER BY prodname;"
else:
new_query = query
return new_query

Again, this is a toy example to illustrate the concept. In any real application, your Python function needs to employ more robust query pattern matching and substitution.

Your reports and client applications use the same join query as before:

SELECT prodname, SUM(total) FROM sales JOIN product USING (productid) GROUP BY prodname ORDER BY prodname;

Now, when you look on the Amazon Redshift console Queries tab, you see that the query received by Amazon Redshift is the rewritten version that uses the new product_sales table, leveraging your pre-joined, pre-aggregated data and the targeted sort and dist keys:

SELECT prodname, SUM(total) FROM product_sales GROUP BY prodname ORDER BY prodname;

Getting Started

Here are the steps to start working with pgbouncer-rr.

Install

Download and install pgbouncer-rr by running the following commands (Amazon Linux/RHEL/CentOS):

# install required packages
sudo yum install libevent-devel openssl-devel git libtool python-devel -y

# download the latest pgbouncer distribution
git clone https://github.com/pgbouncer/pgbouncer.git

# download pgbouncer-rr extensions
git clone https://github.com/awslabs/pgbouncer-rr-patch.git

# merge pgbouncer-rr extensions into pgbouncer code
cd pgbouncer-rr-patch
./install-pgbouncer-rr-patch.sh ../pgbouncer

# build and install
cd ../pgbouncer
git submodule init
git submodule update
./autogen.sh
./configure …
make
sudo make install

Configure

Create a configuration file, using ./pgbouncer-example.ini as a starting point, adding your own database connections and Python routing rules and rewrite query functions.

Set up user authentication; for more information, see authentication file format.

NOTE: The recently added pgbouncer auth_query feature does not work with Amazon Redshift.

By default, pgbouncer-rr does not support SSL/TLS connections. However, you can experiment with pgbouncer’s newest TLS/SSL feature. Just add a private key and certificate to your pgbouncer-rr configuration:

client_tls_sslmode=allow
client_tls_key_file = ./pgbouncer-rr-key.key
client_tls_cert_file = ./pgbouncer-rr-key.crt

Hint: Here’s how to easily generate a test key with a self-signed certificate using openssl:

openssl req -newkey rsa:2048 -nodes -keyout pgbouncer-rr-key.key -x509 -days 365 -out pgbouncer-rr-key.crt

Configure a firewall

Configure your Linux firewall to enable incoming connections on the configured pgbouncer-rr listening port. For example:

sudo firewall-cmd –zone=public –add-port=5439/tcp –permanent
sudo firewall-cmd –reload

If you are running pgbouncer-rr on an Amazon EC2 instance, the instance security group must also be configured to allow incoming TCP connections on the listening port.

Launch

Run pgbouncer-rr as a daemon using the commandline pgbouncer <config_file> -d. See pgbouncer –help for commandline options. Hint: Use -v -v to enable verbose logging. If you look carefully in the logfile, you will see evidence of the query routing and query rewrite features in action.

Connect

Configure your client application as though you were connecting directly to an Amazon Redshift or PostgreSQL database, but be sure to use the pgbouncer-rr hostname and listening port.

Here’s an example using plsql:

psql -h pgbouncer-dnshostname -U dbuser -d dev -p 5439

Here’s another example using a JDBC driver URL (Amazon Redshift driver):

jdbc:redshift://pgbouncer-dnshostname:5439/dev

Other uses for pgbouncer-rr

It can be used for lots of things, really. In addition to the examples shown above, here are some other use cases suggested by colleagues:

Serve small or repetitive queries from PostgreSQL tables consisting of aggregated results.

Parse SQL for job-tracking table names to implement workload management with job tracking tables on PostgreSQL, simplifies application development.

Leverage multiple Amazon Redshift clusters to serve dashboarding workloads with heavy concurrency requirements.

Determine the appropriate route based on the current workload/state of cluster resources (always route to the cluster with least running queries, etc).

Use Query rewrite to parse SQL for queries which do not leverage the nuances of Amazon Redshift query design or query best practices; either block these queries or re-write them for performance.

Use SQL parse to limit end user ability to access tables with ad hoc queries; e.g., identify users scanning N+ years of data and instead issue a query which blocks them with a re-write: e.g., SELECT ‘WARNING: scans against v_all_sales must be limited to no more than 30 days’ AS alert;

Use SQL parse to identify and rewrite queries which filter on certain criteria to direct them towards a specific table containing data matching that filter.

Actually, your use cases don’t need to be limited to just routing and query rewriting. You could design a routing function that leaves the route unchanged, but which instead implements purposeful side effects, such as:

Publishing custom CloudWatch metrics, enabling you to monitor specific query patterns and/or user interactions with your databases.

Capturing SQL DDL and INSERT/UPDATE statements, and wrap them into Amazon Kinesis put-records as input to the method described in Erik Swensson’s excellent post, Building Multi-AZ or Multi-Region Amazon Redshift Clusters.

We’d love to hear your thoughts and ideas for pgbouncer-rr functions. If you have questions or suggestions, please leave a comment below

Copyright 2015-2015 Amazon.com, Inc. or its affiliates. All Rights Reserved.

———————————————

Related:

Top 10 Performance Tuning Techniques for Amazon Redshift

 

 

Query Routing and Rewrite: Introducing pgbouncer-rr for Amazon Redshift and PostgreSQL

Post Syndicated from Bob Strahan original https://blogs.aws.amazon.com/bigdata/post/Tx3G7177U6YHY5I/Query-Routing-and-Rewrite-Introducing-pgbouncer-rr-for-Amazon-Redshift-and-Postg

Bob Strahan is a senior consultant with AWS Professional Services

Have you ever wanted to split your database load across multiple servers or clusters without impacting the configuration or code of your client applications? Or perhaps you have wished for a way to intercept and modify application queries, so that you can make them use optimized tables (sorted, pre-joined, pre-aggregated, etc.), add security filters, or hide changes you have made in the schema?

The pgbouncer-rr project is based on pgbouncer, an open source, PostgreSQL connection pooler. It adds two new significant features:

Routing: Intelligently send queries to different database servers from one client connection; use it to partition or load balance across multiple servers or clusters.

Rewrite: Intercept and programmatically change client queries before they are sent to the server; use it to optimize or otherwise alter queries without modifying your application.

Pgbouncer-rr works the same way as pgbouncer. Any target application can be connected to pgbouncer-rr as if it were an Amazon Redshift or PostgreSQL server, and pgbouncer-rr creates a connection to the actual server, or reuses an existing connection.

You can deploy multiple instances of pgbouncer-rr to avoid throughput bottlenecks or single points of failure, or to support multiple configurations. It can live in an Auto Scaling group, and behind an Elastic Load Balancing load balancer. It can be deployed to a public subnet while your servers reside in private subnets. You can choose to run it as a bastion server using SSH tunneling, or you can use pgbouncer’s recently introduced SSL support for encryption and authentication.

Documentation and community support for pgbouncer can be easily found online;  pgbouncer-rr is a superset of pgbouncer.

Now I’d like to talk about the query routing and query rewrite feature enhancements.

Query Routing

The routing feature maps client connections to server connections using a Python routing function which you provide. Your function is called for each query, with the client username and the query string as parameters. Its return value must identify the target database server. How it does this is entirely up to you.

For example, you might want to run two Amazon Redshift clusters, each optimized to host a distinct data warehouse subject area. You can determine the appropriate cluster for any given query based on the names of the schemas or tables used in the query. This can be extended to support multiple Amazon Redshift clusters or PostgreSQL instances.

In fact, you can even mix and match Amazon Redshift and PostgreSQL, taking care to ensure that your Python functions correctly handle any server-specific grammar in your queries; your database will throw errors if your routing function sends queries it can’t process. And, of course, any query must run entirely on the server to which is it routed; cross-database joins or multi-server transactions do not work!

Here’s another example: you might want to implement controlled load balancing (or A/B testing) across replicated clusters or servers. Your routing function can choose a server for each query based on any combination of the client username, the query string, random variables, or external input. The logic can be as simple or as sophisticated as you want.

Your routing function has access to the full power of the Python language and the myriad of available Python modules. You can use the regular expression module (re) to match words and patterns in the query string, or use the SQL parser module (sqlparse) to support more sophisticated/robust query parsing. You may also want to use the AWS SDK module (boto) to read your routing table configurations from Amazon DynamoDB.

The Python routing function is dynamically loaded by pgbouncer-rr from the file you specify in the configuration:

routing_rules_py_module_file = /etc/pgbouncer-rr/routing_rules.py

The file should contain the following Python function:

def routing_rules(username, query):

The function parameters provide the username associated with the client, and a query string. The function return value must be a valid database key name (dbkey) as specified in the configuration file, or None. When a valid dbkey is returned by the routing function, the client connection will be routed to a connection in the specified server connection pool. When None is returned by the routing function, the client remains routed to its current server connection.

The route function is called only for query and prepare packets, with the following restrictions:

All queries must run wholly on the assigned server. Cross-server joins do not work.

Ideally, queries should auto-commit each statement. Set pool_mode = statement in the configuration.

Multi-statement transactions work correctly only if statements are not rerouted by the routing_rules function to a different server pool mid-transaction. Set pool_mode = transaction in the configuration.

If your application uses database catalog tables to discover the schema, then the routing_rules function should direct catalog table queries to a database server that has all the relevant schema objects created.

Simple query routing example

Amazon Redshift cluster 1 has data in table ‘tablea’. Amazon Redshift cluster 2 has data in table ‘tableb’. You want a client to be able to issue queries against either tablea or tableb without needing to know which table resides on which cluster.

Create a (default) entry with a key, say, ‘dev’ in the [databases] section of the pgbouncer configuration. This entry determines the default cluster used for client connections to database ‘dev’. You can make either redshift1 or redshift2 the default, or even specify a third ‘default’ cluster. Create additional entries in the pgbouncer [databases] section for each cluster; give these unique key names such as ‘dev.1’, and ‘dev.2’.

[databases]
dev = host= port=5439 dbname=dev
dev.1 = host= port=5439 dbname=dev
dev.2 = host= port=5439 dbname=dev

Ensure that the configuration file setting routing_rules_py_module_file specifies the path to your Python routing function file, such as ~/routing_rules.py. The code in the file could look like the following:

def routing_rules(username, query):
if "tablea" in query:
return "dev.1"
elif "tableb" in query:
return "dev.2"
else:
return None

Below is a toy example, but it illustrates the concept. If a client sends the query SELECT * FROM tablea, it matches the first rule, and is assigned to server pool ‘dev.1’ (redshift1). If a client (and it could be the same client in the same session) sends the query SELECT * FROM tableb, it matches the second rule, and is assigned to server pool ‘dev.2’ (redshift2). Any query that does not match either rule results in None being returned, and the server connection remains unchanged.

Below is an alternative function for the same use case, but the routing logic is defined in a separate extensible data structure using regular expressions to find the table matches. The routing table structure could easily be externalized in a DynamoDB table.

# ROUTING TABLE
# ensure that all dbkey values are defined in [database] section of the pgbouncer ini file
routingtable = {
‘route’ : [{
‘usernameRegex’ : ‘.*’,
‘queryRegex’ : ‘.*tablea.*’,
‘dbkey’ : ‘dev.1’
}, {
‘usernameRegex’ : ‘.*’,
‘queryRegex’ : ‘.*tableb.*’,
‘dbkey’ : ‘dev.2’
}
],
‘default’ : None
}

# ROUTING FN – CALLED FROM PGBOUNCER-RR – DO NOT CHANGE NAME
# IMPLEMENTS REGEX RULES DEFINED IN ROUTINGTABLE OBJECT
# RETURNS FIRST MATCH FOUND
import re
def routing_rules(username, query):
for route in routingtable[‘route’]:
u = re.compile(route[‘usernameRegex’])
q = re.compile(route[‘queryRegex’])
if u.search(username) and q.search(query):
return route[‘dbkey’]
return routingtable[‘default’]

You will likely want to implement more robust and sophisticated rules, taking care to avoid unintended matches. Write test cases to call your function with different inputs and validate the output dbkey values.

Query Rewrite

The rewrite feature provides you with the opportunity to manipulate application queries en route to the server without modifying application code. You might want to do this to:

Optimize an incoming query to use the best physical tables when you have replicated tables with alternative sort/dist keys and column subsets (emulate projections), or when you have stored pre-joined or pre-aggregated data (emulate ‘materialized views’).

Apply query filters to support row-level data partitioning/security.

Roll out new schemas, resolve naming conflicts, and so on, by changing identifier names on the fly.

The rewrite function is also implemented in a fully configurable Python function, dynamically loaded from an external module specified in the configuration: 

rewrite_query_py_module_file = /etc/pgbouncer-rr/rewrite_query.py

The file should contain the following Python function:

def rewrite_query(username, query):

The function parameters provide the username associated with the client, and a query string. The function return value must be a valid SQL query string which returns the result set that you want the client application to receive.

Implementing a query rewrite function is straightforward when the incoming application queries have fixed formats that are easily detectable and easily manipulated, perhaps using regular expression search/replace logic in the Python function. It is much more challenging to build a robust rewrite function to handle SQL statements with arbitrary format and complexity.

Enabling the query rewrite function triggers pgbouncer-rr to enforce that a complete query is contained in the incoming client socket buffer. Long queries are often split across multiple network packets. They should all be in the buffer before the rewrite function is called, which requires that the buffer size be large enough to accommodate the largest query. The default buffer size (2048) is likely too small, so specify a much larger size in the configuration: pkt_buf = 32768.

If a partially received query is detected, and there is room in the buffer for the remainder of the query, pgbouncer-rr waits for the remaining packets to be received before processing the query. If the buffer is not large enough for the incoming query, or if it is not large enough to hold the re-written query (which may be longer than the original), then the rewrite function will fail. By default, the failure is logged and the original query string will be passed to the server unchanged. You can force the client connection to terminate instead, by setting: rewrite_query_disconnect_on_failure = true.

Simple query rewrite example

You have a star schema with a large fact table in Amazon Redshift (such as ‘sales’) with two related dimension tables (such as ‘store’ and ‘product’). You want to optimize equally for two different queries:

1> SELECT storename, SUM(total) FROM sales JOIN store USING (storeid)
GROUP BY storename ORDER BY storename
2> SELECT prodname, SUM(total) FROM sales JOIN product USING (productid)
GROUP BY prodname ORDER BY prodname

By experimenting, you have determined that the best possible solution is to have two additional tables, each optimized for one of the queries:

store_sales: store and sales tables denormalized, pre-aggregated by store, and sorted and distributed by store name

product_sales: product and sales tables denormalized, pre-aggregated by product, sorted and distributed by product name

So you implement the new tables, and take care of their population in your ETL processes. But you’d like to avoid directly exposing these new tables to your reporting or analytic client applications. This might be the best optimization today, but who knows what the future holds? Maybe you’ll come up with a better optimization later, or maybe Amazon Redshift will introduce cool new features that provide a simpler alternative.

So, you implement a pgbouncer-rr rewrite function to change the original queries on the fly. Ensure that the configuration file setting rewrite_query_py_module_file specifies the path to your python function file, say~/rewrite_query.py.

The code in the file could look like this:

import re
def rewrite_query(username, query):
q1="SELECT storename, SUM(total) FROM sales JOIN store USING (storeid)
GROUP BY storename ORDER BY storename"
q2="SELECT prodname, SUM(total) FROM sales JOIN product USING (productid)
GROUP BY prodname ORDER BY prodname"
if re.match(q1, query):
new_query = "SELECT storename, SUM(total) FROM store_sales
GROUP BY storename ORDER BY storename;"
elif re.match(q2, query):
new_query = "SELECT prodname, SUM(total) FROM product_sales
GROUP BY prodname ORDER BY prodname;"
else:
new_query = query
return new_query

Again, this is a toy example to illustrate the concept. In any real application, your Python function needs to employ more robust query pattern matching and substitution.

Your reports and client applications use the same join query as before:

SELECT prodname, SUM(total) FROM sales JOIN product USING (productid) GROUP BY prodname ORDER BY prodname;

Now, when you look on the Amazon Redshift console Queries tab, you see that the query received by Amazon Redshift is the rewritten version that uses the new product_sales table, leveraging your pre-joined, pre-aggregated data and the targeted sort and dist keys:

SELECT prodname, SUM(total) FROM product_sales GROUP BY prodname ORDER BY prodname;

Getting Started

Here are the steps to start working with pgbouncer-rr.

Install

Download and install pgbouncer-rr by running the following commands (Amazon Linux/RHEL/CentOS):

# install required packages
sudo yum install libevent-devel openssl-devel git libtool python-devel -y

# download the latest pgbouncer distribution
git clone https://github.com/pgbouncer/pgbouncer.git

# download pgbouncer-rr extensions
git clone https://github.com/awslabs/pgbouncer-rr-patch.git

# merge pgbouncer-rr extensions into pgbouncer code
cd pgbouncer-rr-patch
./install-pgbouncer-rr-patch.sh ../pgbouncer

# build and install
cd ../pgbouncer
git submodule init
git submodule update
./autogen.sh
./configure …
make
sudo make install

Configure

Create a configuration file, using ./pgbouncer-example.ini as a starting point, adding your own database connections and Python routing rules and rewrite query functions.

Set up user authentication; for more information, see authentication file format.

NOTE: The recently added pgbouncer auth_query feature does not work with Amazon Redshift.

By default, pgbouncer-rr does not support SSL/TLS connections. However, you can experiment with pgbouncer’s newest TLS/SSL feature. Just add a private key and certificate to your pgbouncer-rr configuration:

client_tls_sslmode=allow
client_tls_key_file = ./pgbouncer-rr-key.key
client_tls_cert_file = ./pgbouncer-rr-key.crt

Hint: Here’s how to easily generate a test key with a self-signed certificate using openssl:

openssl req -newkey rsa:2048 -nodes -keyout pgbouncer-rr-key.key -x509 -days 365 -out pgbouncer-rr-key.crt

Configure a firewall

Configure your Linux firewall to enable incoming connections on the configured pgbouncer-rr listening port. For example:

sudo firewall-cmd –zone=public –add-port=5439/tcp –permanent
sudo firewall-cmd –reload

If you are running pgbouncer-rr on an Amazon EC2 instance, the instance security group must also be configured to allow incoming TCP connections on the listening port.

Launch

Run pgbouncer-rr as a daemon using the commandline pgbouncer <config_file> -d. See pgbouncer –help for commandline options. Hint: Use -v -v to enable verbose logging. If you look carefully in the logfile, you will see evidence of the query routing and query rewrite features in action.

Connect

Configure your client application as though you were connecting directly to an Amazon Redshift or PostgreSQL database, but be sure to use the pgbouncer-rr hostname and listening port.

Here’s an example using plsql:

psql -h pgbouncer-dnshostname -U dbuser -d dev -p 5439

Here’s another example using a JDBC driver URL (Amazon Redshift driver):

jdbc:redshift://pgbouncer-dnshostname:5439/dev

Other uses for pgbouncer-rr

It can be used for lots of things, really. In addition to the examples shown above, here are some other use cases suggested by colleagues:

Serve small or repetitive queries from PostgreSQL tables consisting of aggregated results.

Parse SQL for job-tracking table names to implement workload management with job tracking tables on PostgreSQL, simplifies application development.

Leverage multiple Amazon Redshift clusters to serve dashboarding workloads with heavy concurrency requirements.

Determine the appropriate route based on the current workload/state of cluster resources (always route to the cluster with least running queries, etc).

Use Query rewrite to parse SQL for queries which do not leverage the nuances of Amazon Redshift query design or query best practices; either block these queries or re-write them for performance.

Use SQL parse to limit end user ability to access tables with ad hoc queries; e.g., identify users scanning N+ years of data and instead issue a query which blocks them with a re-write: e.g., SELECT ‘WARNING: scans against v_all_sales must be limited to no more than 30 days’ AS alert;

Use SQL parse to identify and rewrite queries which filter on certain criteria to direct them towards a specific table containing data matching that filter.

Actually, your use cases don’t need to be limited to just routing and query rewriting. You could design a routing function that leaves the route unchanged, but which instead implements purposeful side effects, such as:

Publishing custom CloudWatch metrics, enabling you to monitor specific query patterns and/or user interactions with your databases.

Capturing SQL DDL and INSERT/UPDATE statements, and wrap them into Amazon Kinesis put-records as input to the method described in Erik Swensson’s excellent post, Building Multi-AZ or Multi-Region Amazon Redshift Clusters.

We’d love to hear your thoughts and ideas for pgbouncer-rr functions. If you have questions or suggestions, please leave a comment below

Copyright 2015-2015 Amazon.com, Inc. or its affiliates. All Rights Reserved.

———————————————

Related:

Top 10 Performance Tuning Techniques for Amazon Redshift

 

 

Building a Graph Database on AWS Using Amazon DynamoDB and Titan

Post Syndicated from Nick Corbett original https://blogs.aws.amazon.com/bigdata/post/Tx12NN92B1F5K0C/Building-a-Graph-Database-on-AWS-Using-Amazon-DynamoDB-and-Titan

Nick Corbett is a Big Data Consultant for AWS Professional Services

You might not know it, but a graph has changed your life. A bold claim perhaps, but companies such as Facebook, LinkedIn, and Twitter have revolutionized the way society interacts through their ability to manage a huge network of relationships. However, graphs aren’t just used in social media; they can represent many different systems, including financial transactions for fraud detection, customer purchases for recommendation engines, computer network topologies, or the logistics operations of Amazon.com.

In this post, I would like to introduce you to a technology that makes it easy to manipulate graphs in AWS at massive scale. To do this, let’s imagine that you have decided to build a mobile app to help you and your friends with the simple task of finding a good restaurant. You quickly decide to build a ‘server-less’ infrastructure, using Amazon Cognito to identity management and data synchronization, Amazon API Gateway for your REST API, and AWS Lambda to implement microservices that fulfil your business logic. Your final decision is where to store your data. Because your vision is to build a network of friends and restaurants, the natural choice is a graph database rather than an RDBMS.  Titan running on Amazon DynamoDB is a great fit for the job.

DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance together with seamless scalability. Recently, AWS announced a plug-in for Titan that allows it to use DynamoDB as a storage backend. This means you can now build a graph database using Titan and not worry about the performance, scalability, or operational management of storing your data.

Your vision for the network that will power your app is shown below and shows the three major parts of a graph: vertices (or nodes), edges, and properties.

A vertex (or node) represents an entity, such as a person or restaurant. In your graph, you have three types of vertex: customers, restaurants, and the type of cuisine served (called genre in the code examples).

An edge defines a relationship between two vertices. For example, a customer might visit a restaurant or a restaurant may serve food of a particular cuisine. An edge always has direction – it will be outgoing from one vertex and incoming to the other.

A property is a key-value pair that enriches a vertex or an edge. For example, a customer has a name or the customer might rate their experience when they visit a restaurant.

After a short time, your app is ready to be released, albeit as a minimum viable product. The initial functionality of your app is very simple: your customer supplies a cuisine, such as ‘Pizza’ or ‘Sushi’, and the app returns a list of restaurants they might like to visit.

To show how this works in Titan, you can follow these instructions in the AWS Big Data Blog’s GitHub’ repository to load some sample data into your own Titan database, using DynamoDB as the backend store. The data used in this example was based on a data set provided by the Machine Learning Repository at UCL1. By default, the example uses Amazon DynamoDB Local, a small client-side database and server that mimics the DynamoDB service. This component is intended to support local development and small scale testing, and lets you save on provisioned throughput, data storage, and transfer fees.

Interaction with Titan is through a graph traversal language called Gremlin, in much the same way as you would use SQL to interact with an RDBMS. However, whereas SQL is declarative, Gremlin is implemented as a functional pipeline; the results of each operation in the query are piped to the next stage. This provides a degree of control on not just what results your query generates but also how it is executed. Gremlin is part of the Open Source Apache TinkerPop stack, which has become the de facto standard framework for graph databases and is supported by products such as Titan, Neo4j, and OrientDB.

Titan is written in Java and you can see that this API is used to load the sample data by running Gremlin commands. The Java API would also be used by your microservices running in Lambda, calling through to DynamoDB to store the data. In fact, the data stored in DynamoDB is compressed and not humanly readable (for more information about the storage format, see Titan Graph Modeling in DynamoDB).

For the purposes of this post, however, it’s easier to user the Gremlin REPL, written in Groovy. The instructions on GitHub show you how to start your Gremlin session.

A simple Gremlin query that finds restaurants based on a type of cuisine is shown below:

gremlin> g.V.has(‘genreId’, ‘Pizzeria’).in.restaurant_name

==>La Fontana Pizza Restaurante and Cafe
==>Dominos Pizza
==>Little Cesarz
==>pizza clasica
==>Restaurante Tiberius

This introduces the concept of how graph queries work; you select one or more vertices then use the language to walk (or traverse) across the graph. You can also see the functional pipeline in action as the results of each element are passed to the next step in the query. The query can be read as shown below.

Network that will power your app

The query gives us five restaurants to recommend to our customer. This query would be just as easy to run if your data was based in an RDBMS, so at this point not much is gained by using a graph database. However, as more customers start using your app and the first feature requests come in, you start to feel the benefit of your decision.

Initial feedback from your customers is good. However, they tell you that although it’s great to get a recommendation based on a cuisine, it would be better if they could receive recommendations based on places their friends have visited. You quickly add a ‘friend’ feature to the the app and change the Gremlin query that you use to provide recommendations:

This query assumes that a particular user (‘U1064’) has asked us to find a ‘Cafeteria’ restaurant that their friends have visited. The Gremlin syntax can be read as shown below.

This query uses a pattern called ‘backtrack’. You make a selection of vertices and ‘remember’ them. You then traverse the graph, selecting more nodes. Finally, you ‘backtrack’ to your remembered selection and reduce it to those vertices that have a path through to your current position.

Again, this query could be executed in an RDBMS but it would be complex. Because you would keep all customers in a single table, finding friends would involve looping back to join a table to itself. While it’s perfectly possible to do this in SQL, the syntax can become long—especially if you want to loop multiple times; for example, how many of my friends’ friends’ have visited the same restaurant as me?  A more important problem would be the performance. Each SQL join would introduce extra latency to the query and you may find that, as your database grows, you can’t meet the strict latency requirements of a modern app. In my test system, Titan returned the answer to this query in 38ms, but the RDBMS where I staged the data took over 0.3 seconds to resolve, an order of magnitude difference!

Your new recommendations work well, but some customers are still not happy. Just because their friends visited a restaurant doesn’t mean that they enjoyed it; they only want recommendations to restaurants their friends actually liked. You update your app again and ask customers to rate their experience, using ‘0’ for poor, ‘1’ for good, and ‘2’ for excellent. You then modify the query to:

g.V.has(‘userId’,’U1101′).out(‘friend’).outE(‘visit’).has(‘visit_food’, T.gte, 1).as(‘x’).inV.as(‘y’).out(‘restaurant_genre’).has(‘genreId’, ‘Seafood’).back(‘x’).transform{e, m -> [food: m.x.visit_food, name:m.y.restaurant_name]}.groupCount{it.name}.cap

==>{Restaurante y Pescaderia Tampico=1, Restaurante Marisco Sam=1, Mariscos El Pescador=2}

This query is based on a user (‘U1101’) asking for a seafood restaurant. The stages of the query are shown below.

This query shows how you can filter for a property on an edge. When you traverse the ‘visit’ edge, you filter for those visits where the food rating was greater or equal than 1. The query also shows how you can transform results from a pipeline to a new object. You build a simple object, with two properties (food rating and name) for each ‘hit’ you have against your query criteria. Finally, the query also demonstrates the ‘groupCount’ function. This aggregation provides a count of each unique name.

The net result of this query is that the ‘best’ seafood restaurant to recommend is ‘Mariscos El Pescador’, as your customer’s friends have made two visits in which they rated the food as ‘good’ or better.

The reputation of your app grows and more and more customers sign up. It’s great to take advantage of DynamoDB scalability; there’s no need to re-architect your solution as you gain more users, as your storage backend can scale to deal with millions or even hundreds of millions of customers.

Soon, it becomes apparent that most of your customers are using your app when they are out and about. You need to enhance your app so that it can make recommendations that are close to the customer. Fortunately, Titan comes with built-in geo queries. The query below imagines that customer ‘U1064’ is asking for a ‘Cafeteria’ and that you’ve captured their location of their mobile as (22.165, -101.0):

g.V.has(‘userId’, ‘U1064’).out(‘friend’).outE(‘visit’).has(‘visit_rating’, T.gte, 2).has(‘visit_food’, T.gte, 2).inV.as(‘x’).out(‘restaurant_genre’).has(‘genreId’, ‘Cafeteria’).back(‘x’).has(‘restaurant_place’, WITHIN, Geoshape.circle(22.165, -101.00, 5)).as(‘b’).transform{e, m -> m.b.restaurant_name + " distance " + m.b.restaurant_place.getPoint().distance(Geoshape.point(22.165, -101.00).getPoint())}

==>Luna Cafe distance 2.774053451453471
==>Cafeteria y Restaurant El Pacifico distance 3.064723519030348

This query is the same as before except that there’s an extra filter:

has(‘restaurant_place’, WITHIN, Geoshape.circle(22.165, -101.00, 5)).

Each restaurant vertex has a property called ‘restaurant_place’, which is a geo-point (a longitude and latitude). The filter restricts selection to any restaurants whose ‘restaurant_place’ is within 5km of the customer’s current location. The part of the query that transforms the output from the pipeline is modified to include the distance to the customer. You can use this to order your recommendations so the nearest is shown first.

Your app hits the big time as more and more customers use it to find a good dining experience. You are approached by one of the restaurants, which wants to run a promotion to acquire new customers. Their request is simple – they will pay you to send an in-app advert to your customers who are friends of people who have visited their restaurant, but who haven’t visited the restaurant themselves. Relieved that your app can finally make some money, you set about writing the query. This type of query follows a ‘except’ pattern:

gremlin> x = []
gremlin> g.V.has(‘RestaurantId’,’135052′).in(‘visit’).aggregate(x).out(‘friend’).except(x).userId.order

The query assumes that RestaurantId 135052 has made the approach. The first line defines a variable ‘x’ as an array. The steps of the query are shown below.

The ‘except’ pattern used in this query makes it very easy to select elements that have not been selected in a previous step. This makes queries such as the above or “who are a customer’s friend’s friends that are not already their friends” easy resolve. Once again, you could write this query in SQL, but the syntax would be far more complex than the simple Gremlin query used above and the multiple joins needed to resolve the query would affect performance.

Summary

In this post, I’ve shown you how to build a simple graph database using Titan with DynamoDB for storage. Compared to a more traditional RDBMS approach, a graph database can offer many advantages when you need to model a complex network. Your queries will be easier to understand and you may well get better performance from using a storage engine geared towards graph traversal. Using DynamoDB for your storage gives the added benefit of a fully managed, scalable repository for storing your data. You can concentrate on producing an app that excites your customers rather than managing infrastructure.

If you have any questions or suggestions, please leave a comment below.

References

Blanca Vargas-Govea, Juan Gabriel González-Serna, Rafael Ponce-Medellín. Effects of relevant contextual features in the performance of a restaurant recommender system. In RecSys’11: Workshop on Context Aware Recommender Systems (CARS-2011), Chicago, IL, USA, October 23, 2011

——————————————–

Related:

Scaling Writes on Amazon DynamoDB Tables with Global Secondary Indexes

Building a Graph Database on AWS Using Amazon DynamoDB and Titan

Post Syndicated from Nick Corbett original https://blogs.aws.amazon.com/bigdata/post/Tx12NN92B1F5K0C/Building-a-Graph-Database-on-AWS-Using-Amazon-DynamoDB-and-Titan

Nick Corbett is a Big Data Consultant for AWS Professional Services

You might not know it, but a graph has changed your life. A bold claim perhaps, but companies such as Facebook, LinkedIn, and Twitter have revolutionized the way society interacts through their ability to manage a huge network of relationships. However, graphs aren’t just used in social media; they can represent many different systems, including financial transactions for fraud detection, customer purchases for recommendation engines, computer network topologies, or the logistics operations of Amazon.com.

In this post, I would like to introduce you to a technology that makes it easy to manipulate graphs in AWS at massive scale. To do this, let’s imagine that you have decided to build a mobile app to help you and your friends with the simple task of finding a good restaurant. You quickly decide to build a ‘server-less’ infrastructure, using Amazon Cognito to identity management and data synchronization, Amazon API Gateway for your REST API, and AWS Lambda to implement microservices that fulfil your business logic. Your final decision is where to store your data. Because your vision is to build a network of friends and restaurants, the natural choice is a graph database rather than an RDBMS.  Titan running on Amazon DynamoDB is a great fit for the job.

DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance together with seamless scalability. Recently, AWS announced a plug-in for Titan that allows it to use DynamoDB as a storage backend. This means you can now build a graph database using Titan and not worry about the performance, scalability, or operational management of storing your data.

Your vision for the network that will power your app is shown below and shows the three major parts of a graph: vertices (or nodes), edges, and properties.

A vertex (or node) represents an entity, such as a person or restaurant. In your graph, you have three types of vertex: customers, restaurants, and the type of cuisine served (called genre in the code examples).

An edge defines a relationship between two vertices. For example, a customer might visit a restaurant or a restaurant may serve food of a particular cuisine. An edge always has direction – it will be outgoing from one vertex and incoming to the other.

A property is a key-value pair that enriches a vertex or an edge. For example, a customer has a name or the customer might rate their experience when they visit a restaurant.

After a short time, your app is ready to be released, albeit as a minimum viable product. The initial functionality of your app is very simple: your customer supplies a cuisine, such as ‘Pizza’ or ‘Sushi’, and the app returns a list of restaurants they might like to visit.

To show how this works in Titan, you can follow these instructions in the AWS Big Data Blog’s GitHub’ repository to load some sample data into your own Titan database, using DynamoDB as the backend store. The data used in this example was based on a data set provided by the Machine Learning Repository at UCL1. By default, the example uses Amazon DynamoDB Local, a small client-side database and server that mimics the DynamoDB service. This component is intended to support local development and small scale testing, and lets you save on provisioned throughput, data storage, and transfer fees.

Interaction with Titan is through a graph traversal language called Gremlin, in much the same way as you would use SQL to interact with an RDBMS. However, whereas SQL is declarative, Gremlin is implemented as a functional pipeline; the results of each operation in the query are piped to the next stage. This provides a degree of control on not just what results your query generates but also how it is executed. Gremlin is part of the Open Source Apache TinkerPop stack, which has become the de facto standard framework for graph databases and is supported by products such as Titan, Neo4j, and OrientDB.

Titan is written in Java and you can see that this API is used to load the sample data by running Gremlin commands. The Java API would also be used by your microservices running in Lambda, calling through to DynamoDB to store the data. In fact, the data stored in DynamoDB is compressed and not humanly readable (for more information about the storage format, see Titan Graph Modeling in DynamoDB).

For the purposes of this post, however, it’s easier to user the Gremlin REPL, written in Groovy. The instructions on GitHub show you how to start your Gremlin session.

A simple Gremlin query that finds restaurants based on a type of cuisine is shown below:

gremlin> g.V.has(‘genreId’, ‘Pizzeria’).in.restaurant_name

==>La Fontana Pizza Restaurante and Cafe
==>Dominos Pizza
==>Little Cesarz
==>pizza clasica
==>Restaurante Tiberius

This introduces the concept of how graph queries work; you select one or more vertices then use the language to walk (or traverse) across the graph. You can also see the functional pipeline in action as the results of each element are passed to the next step in the query. The query can be read as shown below.

Network that will power your app

The query gives us five restaurants to recommend to our customer. This query would be just as easy to run if your data was based in an RDBMS, so at this point not much is gained by using a graph database. However, as more customers start using your app and the first feature requests come in, you start to feel the benefit of your decision.

Initial feedback from your customers is good. However, they tell you that although it’s great to get a recommendation based on a cuisine, it would be better if they could receive recommendations based on places their friends have visited. You quickly add a ‘friend’ feature to the the app and change the Gremlin query that you use to provide recommendations:

This query assumes that a particular user (‘U1064’) has asked us to find a ‘Cafeteria’ restaurant that their friends have visited. The Gremlin syntax can be read as shown below.

This query uses a pattern called ‘backtrack’. You make a selection of vertices and ‘remember’ them. You then traverse the graph, selecting more nodes. Finally, you ‘backtrack’ to your remembered selection and reduce it to those vertices that have a path through to your current position.

Again, this query could be executed in an RDBMS but it would be complex. Because you would keep all customers in a single table, finding friends would involve looping back to join a table to itself. While it’s perfectly possible to do this in SQL, the syntax can become long—especially if you want to loop multiple times; for example, how many of my friends’ friends’ have visited the same restaurant as me?  A more important problem would be the performance. Each SQL join would introduce extra latency to the query and you may find that, as your database grows, you can’t meet the strict latency requirements of a modern app. In my test system, Titan returned the answer to this query in 38ms, but the RDBMS where I staged the data took over 0.3 seconds to resolve, an order of magnitude difference!

Your new recommendations work well, but some customers are still not happy. Just because their friends visited a restaurant doesn’t mean that they enjoyed it; they only want recommendations to restaurants their friends actually liked. You update your app again and ask customers to rate their experience, using ‘0’ for poor, ‘1’ for good, and ‘2’ for excellent. You then modify the query to:

g.V.has(‘userId’,’U1101′).out(‘friend’).outE(‘visit’).has(‘visit_food’, T.gte, 1).as(‘x’).inV.as(‘y’).out(‘restaurant_genre’).has(‘genreId’, ‘Seafood’).back(‘x’).transform{e, m -> [food: m.x.visit_food, name:m.y.restaurant_name]}.groupCount{it.name}.cap

==>{Restaurante y Pescaderia Tampico=1, Restaurante Marisco Sam=1, Mariscos El Pescador=2}

This query is based on a user (‘U1101’) asking for a seafood restaurant. The stages of the query are shown below.

This query shows how you can filter for a property on an edge. When you traverse the ‘visit’ edge, you filter for those visits where the food rating was greater or equal than 1. The query also shows how you can transform results from a pipeline to a new object. You build a simple object, with two properties (food rating and name) for each ‘hit’ you have against your query criteria. Finally, the query also demonstrates the ‘groupCount’ function. This aggregation provides a count of each unique name.

The net result of this query is that the ‘best’ seafood restaurant to recommend is ‘Mariscos El Pescador’, as your customer’s friends have made two visits in which they rated the food as ‘good’ or better.

The reputation of your app grows and more and more customers sign up. It’s great to take advantage of DynamoDB scalability; there’s no need to re-architect your solution as you gain more users, as your storage backend can scale to deal with millions or even hundreds of millions of customers.

Soon, it becomes apparent that most of your customers are using your app when they are out and about. You need to enhance your app so that it can make recommendations that are close to the customer. Fortunately, Titan comes with built-in geo queries. The query below imagines that customer ‘U1064’ is asking for a ‘Cafeteria’ and that you’ve captured their location of their mobile as (22.165, -101.0):

g.V.has(‘userId’, ‘U1064’).out(‘friend’).outE(‘visit’).has(‘visit_rating’, T.gte, 2).has(‘visit_food’, T.gte, 2).inV.as(‘x’).out(‘restaurant_genre’).has(‘genreId’, ‘Cafeteria’).back(‘x’).has(‘restaurant_place’, WITHIN, Geoshape.circle(22.165, -101.00, 5)).as(‘b’).transform{e, m -> m.b.restaurant_name + " distance " + m.b.restaurant_place.getPoint().distance(Geoshape.point(22.165, -101.00).getPoint())}

==>Luna Cafe distance 2.774053451453471
==>Cafeteria y Restaurant El Pacifico distance 3.064723519030348

This query is the same as before except that there’s an extra filter:

has(‘restaurant_place’, WITHIN, Geoshape.circle(22.165, -101.00, 5)).

Each restaurant vertex has a property called ‘restaurant_place’, which is a geo-point (a longitude and latitude). The filter restricts selection to any restaurants whose ‘restaurant_place’ is within 5km of the customer’s current location. The part of the query that transforms the output from the pipeline is modified to include the distance to the customer. You can use this to order your recommendations so the nearest is shown first.

Your app hits the big time as more and more customers use it to find a good dining experience. You are approached by one of the restaurants, which wants to run a promotion to acquire new customers. Their request is simple – they will pay you to send an in-app advert to your customers who are friends of people who have visited their restaurant, but who haven’t visited the restaurant themselves. Relieved that your app can finally make some money, you set about writing the query. This type of query follows a ‘except’ pattern:

gremlin> x = []
gremlin> g.V.has(‘RestaurantId’,’135052′).in(‘visit’).aggregate(x).out(‘friend’).except(x).userId.order

The query assumes that RestaurantId 135052 has made the approach. The first line defines a variable ‘x’ as an array. The steps of the query are shown below.

The ‘except’ pattern used in this query makes it very easy to select elements that have not been selected in a previous step. This makes queries such as the above or “who are a customer’s friend’s friends that are not already their friends” easy resolve. Once again, you could write this query in SQL, but the syntax would be far more complex than the simple Gremlin query used above and the multiple joins needed to resolve the query would affect performance.

Summary

In this post, I’ve shown you how to build a simple graph database using Titan with DynamoDB for storage. Compared to a more traditional RDBMS approach, a graph database can offer many advantages when you need to model a complex network. Your queries will be easier to understand and you may well get better performance from using a storage engine geared towards graph traversal. Using DynamoDB for your storage gives the added benefit of a fully managed, scalable repository for storing your data. You can concentrate on producing an app that excites your customers rather than managing infrastructure.

If you have any questions or suggestions, please leave a comment below.

References

Blanca Vargas-Govea, Juan Gabriel González-Serna, Rafael Ponce-Medellín. Effects of relevant contextual features in the performance of a restaurant recommender system. In RecSys’11: Workshop on Context Aware Recommender Systems (CARS-2011), Chicago, IL, USA, October 23, 2011

——————————————–

Related:

Scaling Writes on Amazon DynamoDB Tables with Global Secondary Indexes