SPECIAL NOTE*** THE FULL TUTORIAL WILL BE AVAILABLE NEXT WEEK April Fools! What a terrible day. So many pranks. You can’t believe anything you read. People invading your space. The mental and physical anguish of enduring the day. It’s time to fight back! Let’s catch the perps in action by making a device that always watches.
Keeping tabs
A Raspberry Pi Zero W, a small camera, and a rechargeable Lithium Polymer (LiPo) battery constitute the bulk of this project’s tech. A pair of 3D-printed parts, and gelatine-solidified Coke Zero make up the fake fizzy body.
“So let’s make this video as short as possible and just buy a cheap pre-made spy cam off of Amazon. Just kidding,” Tinkernut jokes in the tutorial video for the project, before going through the step-by-step process of using the Raspberry Pi to “DIY this the right way”.
After accessing the Zero W from his laptop via SSH, Tinkernut opted for using the rpi_camera_surveillance_system Python script written by GitHub user RuiSantosdotme to control the spy cam. Luckily, this meant no additional library setup, and basically no lag on the video feed.
What we want to do is create a script that activates the camera and serves it to a web page so that we can access it from any web browser. There are plenty of different ways to do this (Motion, Raspivid, etc), but I found a simple Python script that does everything I need it to do and doesn’t require any extra software or libraries to install. The best thing about it is that the lag time is practically unnoticeable.
With the code in place, every boot-up of the Raspberry Pi automatically launches both the script and a web page of the live video, allowing for constant monitoring of potential sneaks and thieves.
The projects is powered by a 1500mAh LiPo battery and the Adafruit LiPo charger. It also includes a simple on/off switch, which Tinkernut wired to the charger and the Pi’s PP1 and PP6 connector pads.
Tinkernut decided to use a Coke Zero bottle for the build, incorporating 3D-printed parts to house the Pi, and a mix of Coke and gelatine to create a realistic-looking filling for the bottle. However, the setup can be transferred to pretty much any hollow item in your home, say, a cookie jar or a cracker box. So get creative and get spying!
A complete spy cam how-to
If you’d like to make your own secret spy cam, you can find a tutorial for Tinkernut’s build at hackster.io, or follow along with his video below. Also make sure to subscribe his YouTube channel to be updated on all his newest builds — they’re rather splendid.
Learn how to take a regular Coke Zero bottle, cram a Raspberry Pi and webcam inside of it, and have it still look like a regular Coke Zero bottle. Why would you want to do this? To spy on those irritating April Fooligans!!!
Spring has sprung, and with it, sleepy-eyed wildlife is beginning to roam our gardens and local woodlands. So why not follow hackster.io maker reichley’s tutorial and build your own solar-powered squirrelhouse nature cam?
Inspiration
“I live half a mile above sea level and am SURROUNDED by animals…bears, foxes, turkeys, deer, squirrels, birds”, reichley explains in his tutorial. “Spring has arrived, and there are LOADS of squirrels running around. I was in the building mood and, being a nerd, wished to combine a common woodworking project with the connectivity and observability provided by single-board computers (and their camera add-ons).”
Building a tiny home
reichley started by sketching out a design for the house to determine where the various components would fit.
Since he’s fan of autonomy and renewable energy, he decided to run the project’s Raspberry Pi Zero W via solar power. To do so, he reiterated the design to include the necessary tech, scaling the roof to fit the panels.
To keep the project running 24/7, reichley had to figure out the overall power consumption of both the Zero W and the Raspberry Pi Camera Module, factoring in the constant WiFi connection and the sunshine hours in his garden.
He used a LiPo SHIM to bump up the power to the required 5V for the Zero. Moreover, he added a BH1750 lux sensor to shut off the LiPo SHIM, and thus the Pi, whenever it’s too dark for decent video.
To control the project, he used Calin Crisan’s motionEyeOS video surveillance operating system for single-board computers.
Build your own nature camera
To build your own version, follow reichley’s tutorial, in which you can also find links to all the necessary code and components. You can also check out our free tutorial for building an infrared bird box using the Raspberry Pi NoIR Camera Module. As Eben said in our YouTube live Q&A last week, we really like nature cameras here at Pi Towers, and we’d love to see yours. So if you have any live-stream links or photography from your Raspberry Pi–powered nature cam, please share them with us!
This video presents the project MoveLens: a voice controlled glasses with magnifying lens. It was the my entry for the Voice Activated context on unstructables. Check the step by step guide at Voice Controlled Glasses With Magnifying Lens. Source code: https://github.com/pichiliani/MoveLens Step by Step guide: https://www.instructables.com/id/Voice-Controlled-Glasses-With-Magnifying-Lens/
It’s a kind of magnification
We’ve all been there – that moment when you need another pair of hands to complete a task. And while these glasses may not hold all the answers, they’re a perfect addition to any hobbyist’s arsenal.
Introducing Mauro Pichilliani’s voice-activated glasses: a pair of frames with magnification lenses that can flip up and down in response to a voice command, depending on the task at hand. No more needing to put down your tools in order to put magnifying glasses on. No more trying to re-position a magnifying glass with the back of your left wrist, or getting grease all over your lenses.
As Mauro explains in his tutorial for the glasses:
Many professionals work for many hours looking at very small areas, such as surgeons, watchmakers, jewellery designers and so on. Most of the time these professionals use some kind of magnification glasses that helps them to see better the area they are working with and other tiny items used on the job. The devices that had magnifications lens on a form factor of a glass usually allow the professional to move the lens out of their eye sight, i.e. put aside the lens. However, in some scenarios touching the lens or the glass rim to move away the lens can contaminate the fingers. Also, it is cumbersome and can break the concentration of the professional.
Voice-controlled magnification glasses
Using a Raspberry Pi Zero W, a servo motor, a microphone, and the IBM Watson speech-to-text service, Mauro built a pair of glasses that lets users control the position of the magnification lenses with voice commands.
The glasses Mauro modified, before he started work on them; you have to move the lenses with your hands, like it’s October 2015
Mauro started by dismantling a pair of standard magnification glasses in order to modify the lens supports to allow them to move freely. He drilled a hole in one of the lens supports to provide a place to attach the servo, and used lollipop sticks and hot glue to fix the lenses relative to one another, so they would both move together under the control of the servo. Then, he set up a Raspberry Pi Zero, installing Raspbian and software to use a USB microphone; after connecting the servo to the Pi Zero’s GPIO pins, he set up the Watson speech-to-text service.
Finally, he wrote the code to bring the project together. Two Python scripts direct the servo to raise and lower the lenses, and a Node.js script captures audio from the microphone, passes it on to Watson, checks for an “up” or “down” command, and calls the appropriate Python script as required.
Your turn
You can follow the tutorial on the Instructables website, where Mauro entered the glasses into the Instructables Voice Activated Challenge. And if you’d like to take your first steps into digital making using the Raspberry Pi, take a look at our free online projects.
Genomics analysis has taken off in recent years as organizations continue to adopt the cloud for its elasticity, durability, and cost. With the AWS Cloud, customers have a number of performant options to choose from. These options include AWS Batch in conjunction with AWS Lambda and AWS Step Functions; AWS Glue, a serverless extract, transform, and load (ETL) service; and of course, the AWS big data and machine learning workhorse Amazon EMR.
For this task, we use Hail, an open source framework for exploring and analyzing genomic data that uses the Apache Spark framework. In this post, we use Amazon EMR to run Hail. We walk through the setup, configuration, and data processing. Finally, we generate an Apache Parquet–formatted variant dataset and explore it using Amazon Athena.
Compiling Hail
Because Hail is still under active development, you must compile it before you can start using it. To help simplify the process, you can launch the following AWS CloudFormation template that creates an EMR cluster, compiles Hail, and installs a Jupyter Notebook so that you’re ready to go with Hail.
There are a few things to note about the AWS CloudFormation template. You must provide a password for the Jupyter Notebook. Also, you must provide a virtual private cloud (VPC) to launch Amazon EMR in, and make sure that you select a subnet from within that VPC. Next, update the cluster resources to fit your needs. Lastly, the HailBuildOutputS3Path parameter should be an Amazon S3 bucket/prefix, where you should save the compiled Hail binaries for later use. Leave the Hail and Spark versions as is, unless you’re comfortable experimenting with more recent versions.
When you’ve completed these steps, the following files are saved locally on the cluster to be used when running the Apache Spark Python API (PySpark) shell.
The files are also copied to the Amazon S3 location defined by the AWS CloudFormation template so that you can include them when running jobs using the Amazon EMR Step API.
Collecting genome data
To get started with Hail, use the 1000 Genome Project dataset available on AWS. The data that you will use is located at s3://1000genomes/release/20130502/.
For Hail to process these files in an efficient manner, they need to be block compressed. In many cases, files that use gzip compression are compressed in blocks, so you don’t need to recompress—you can just rename the file extension to “.bgz” from “.gz” . Hail can process .gz files, but it’s much slower and not recommended. The simple way to accomplish this is to copy the data files from the public S3 bucket to your own and rename them.
The following is the Bash command line to copy the first five genome Variant Call Format (VCF) files and rename them appropriately using the AWS CLI.
for i in $(seq 5); do aws s3 cp s3://1000genomes/release/20130502/ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz s3://your_bucket/prefix/ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.bgz; done
Now that you have some data files containing variants in the Variant Call Format, you need to get the sample annotations that go along with them. These annotations provide more information about each sample, such as the population they are a part of.
In this section, you use the data collected in the previous section to explore genome variations interactively using a Jupyter Notebook. You then create a simple ETL job to convert these variations into Parquet format. Finally, you query it using Amazon Athena.
Let’s open the Jupyter notebook. To start, sign in to the AWS Management Console, and open the AWS CloudFormation console. Choose the stack that you created, and then choose the Output tab. There you see the JupyterURL. Open this URL in your browser.
Go ahead and download the Jupyter Notebook that is provided to your local machine. Log in to Jupyter with the password that you provided during stack creation. Choose Upload on the right side, and then choose the notebook from your local machine.
After the notebook is uploaded, choose it from the list on the left to open it.
Select the first cell, update the S3 bucket location to point to the bucket where you saved the compiled Hail libraries, and then choose Run. This code imports the Hail modules that you compiled at the beginning. When the cell is executing, you will see In [*]. When the process is complete, the asterisk (*) is replaced by a number, for example, In [1].
Next, run the subsequent two cells, which imports the Hail module into PySpark and initiates the Hail context.
The next cell imports a single VCF file from the bucket where you saved your data in the previous section. If you change the Amazon S3 path to not include a file name, it imports all the VCF files in that directory. Depending on your cluster size, it might take a few minutes.
Remember that in the previous section, you also copied an annotation file. Now you use it to annotate the VCF files that you’ve loaded with Hail. Execute the next cell—as a shortcut, you can select the cell and press Shift+Enter.
The import_table API takes a path to the annotation file in TSV (tab-separated values) format and a parameter named impute that attempts to infer the schema of the file, as shown in the output below the cell.
At this point, you can interactively explore the data. For example, you can count the number of samples you have and group them by population.
You can also calculate the standard quality control (QC) metrics on your variants and samples.
What if you want to query this data outside of Hail and Spark, for example, using Amazon Athena? To start, you need to change the column names to lowercase because Athena currently supports only lowercase names. To do that, use the two functions provided in the notebook and call them on your virtual dedicated server (VDS), as shown in the following image. Note that you’re only changing the case of the variants and samples schemas. If you’ve further augmented your VDS, you might need to modify the lowercase functions to do the same for those schemas.
In the current version of Hail, the sample annotations are not stored in the exported Parquet VDS, so you need to save them separately. As noted by the Hail maintainers, in future versions, the data represented by the VDS Parquet output will change, and it is recommended that you also export the variant annotations. So let’s do that.
Note that both of these lines are similar in that they export a table representation of the sample and variant annotations, convert them to a Spark DataFrame, and finally save them to Amazon S3 in Parquet file format.
Finally, it is beneficial to save the VDS file back to Amazon S3 so that next time you need to analyze your data, you can load it without having to start from the raw VCF. Note that when Hail saves your data, it requires a path and a file name.
After you run these cells, expect it to take some time as it writes out the data.
Discovering table metadata
Before you can query your data, you need to tell Athena the schema of your data. You have a couple of options. The first is to use AWS Glue to crawl the S3 bucket, infer the schema from the data, and create the appropriate table. Before proceeding, you might need to migrate your Athena database to use the AWS Glue Data Catalog.
Creating tables in AWS Glue
To use the AWS Glue crawler, open the AWS Glue console and choose Crawlers in the left navigation pane.
Then choose Add crawler to create a new crawler.
Next, give your crawler a name and assign the appropriate IAM role. Leave Amazon S3 as the data source, and select the S3 bucket where you saved the data and the sample annotations. When you set the crawler’s Include path, be sure to include the entire path, for example: s3://output_bucket/hail_data/sample_annotations/
Under the Exclusion Paths, type _SUCCESS, so that you don’t crawl that particular file.
Continue forward with the default settings until you are asked if you want to add another source. Choose Yes, and add the Amazon S3 path to the variant annotation bucket s3://your_output_bucket/hail_data/sample_annotations/ so that it can build your variant annotation table. Give it an existing database name, or create a new one.
Provide a table prefix and choose Next. Then choose Finish. At this point, assuming that the data is finished writing, you can go ahead and run the crawler. When it finishes, you have two new tables in the database you created that should look something like the following:
You can explore the schema of these tables by choosing their name and then choosing Edit Schema on the right side of the table view; for example:
Creating tables in Amazon Athena
If you cannot or do not want to use AWS Glue crawlers, you can add the tables via the Athena console by typing the following statements:
In the Amazon Athena console, choose the database in which your tables were created. In this case, it looks something like the following:
To verify that you have data, choose the three dots on the right, and then choose Preview table.
Indeed, you can see some data.
You can further explore the sample and variant annotations along with the calculated QC metrics that you calculated previously using Hail.
Summary
To summarize, this post demonstrated the ease in which you can install, configure, and use Hail, an open source highly scalable framework for exploring and analyzing genomics data on Amazon EMR. We demonstrated setting up a Jupyter Notebook to make our exploration easy. We also used the power of Hail to calculate quality control metrics for variants and samples. We exported them to Amazon S3 and allowed a broader range of users and analysts to explore them on-demand in a serverless environment using Amazon Athena.
Roy Hasson is a Global Business Development Manager for AWS Analytics. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan cheering his team on and hanging out with his family.
Standard Christmas tree ornaments are just so boring, always hanging there doing nothing. Yawn! Lucky for us, Sean Hodgins has created an ornament that plays classic nineties Christmas adverts, because of nostalgia.
This Christmas ornament will really take you back…
Ingredients
Sean first 3D printed a small CRT-shaped ornament resembling the family television set in The Simpsons. He then got to work on the rest of the components.
All images featured in this blog post are c/o Sean Hodgins. Thanks, Sean!
The ornament uses a Raspberry Pi Zero W, 2.2″ TFT LCD screen, Mono Amp, LiPo battery, and speaker, plus the usual peripherals. Sean purposely assembled it with jumper wires and tape, so that he can reuse the components for another project after the festive season.
By adding header pins to a PowerBoost 1000 LiPo charger, Sean was able to connect a switch to control the Pi’s power usage. This method is handy if you want to seal your Pi in a casing that blocks access to the power leads. From there, jumper wires connect the audio amplifier, LCD screen, and PowerBoost to the Zero W.
Code
Then, with Raspbian installed to an SD card and SSH enabled on the Zero W, Sean got the screen to work. The type of screen he used has both SPI and FBTFT enabled. And his next step was to set up the audio functionality with the help of an Adafruit tutorial.
For video playback, Sean installed mplayer before writing a program to extract video content from YouTube*. Once extracted, the video files are saved to the Raspberry Pi, allowing for seamless playback on the screen.
Construct
When fully assembled, the entire build fit snugly within the 3D-printed television set. And as a final touch, Sean added the cut-out lens of a rectangular magnifying glass to give the display the look of a curved CRT screen.
Then finally, the ornament hangs perfectly on the Christmas tree, up and running and spreading nostalgic warmth.
For more information on the build, check out the Instructables tutorial. And to see all of Sean’s builds, subscribe to his YouTube channel.
Make
If you’re looking for similar projects, have a look at this tutorial by Cabe Atwell for building a Pi-powered ornament that receives and displays text messages.
Have you created Raspberry Pi tree ornaments? Maybe you’ve 3D printed some of our own? We’d love to see what you’re doing with a Raspberry Pi this festive season, so make sure to share your projects with us, either in the comments below or via our social media channels.
*At this point, I should note that we don’t support the extraction of video content from YouTube for your own use if you do not have the right permissions. However, since Sean’s device can play back any video, we think it would look great on your tree showing your own family videos from previous years. So, y’know, be good, be legal, and be festive.
We can’t believe that there are just few days left before re:Invent 2017. If you are attending this year, you’ll want to check out our Big Data sessions! The Big Data and Machine Learning categories are bigger than ever. As in previous years, you can find these sessions in various tracks, including Analytics & Big Data, Deep Learning Summit, Artificial Intelligence & Machine Learning, Architecture, and Databases.
We have great sessions from organizations and companies like Vanguard, Cox Automotive, Pinterest, Netflix, FINRA, Amtrak, AmazonFresh, Sysco Foods, Twilio, American Heart Association, Expedia, Esri, Nextdoor, and many more. All sessions are recorded and made available on YouTube. In addition, all slide decks from the sessions will be available on SlideShare.net after the conference.
This post highlights the sessions that will be presented as part of the Analytics & Big Data track, as well as relevant sessions from other tracks like Architecture, Artificial Intelligence & Machine Learning, and IoT. If you’re interested in Machine Learning sessions, don’t forget to check out our Guide to Machine Learning at re:Invent 2017.
This year’s session catalog contains the following breakout sessions.
Raju Gulabani, VP, Database, Analytics and AI at AWS will discuss the evolution of database and analytics services in AWS, the new database and analytics services and features we launched this year, and our vision for continued innovation in this space. We are witnessing an unprecedented growth in the amount of data collected, in many different forms. Storage, management, and analysis of this data require database services that scale and perform in ways not possible before. AWS offers a collection of database and other data services—including Amazon Aurora, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Amazon ElastiCache, Amazon Kinesis, and Amazon EMR—to process, store, manage, and analyze data. In this session, we provide an overview of AWS database and analytics services and discuss how customers are using these services today.
Deep dive customer use cases
ABD401 – How Netflix Monitors Applications in Near Real-Time with Amazon Kinesis Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this session, we first discuss why Netflix chose Kinesis Streams to address these challenges at scale. We then dive deep into how Netflix uses Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this session, you will learn how to build a real-time application monitoring system using network traffic logs and get real-time, actionable insights.
In this session, learn how Nextdoor replaced their home-grown data pipeline based on a topology of Flume nodes with a completely serverless architecture based on Kinesis and Lambda. By making these changes, they improved both the reliability of their data and the delivery times of billions of records of data to their Amazon S3–based data lake and Amazon Redshift cluster. Nextdoor is a private social networking service for neighborhoods.
ABD205 – Taking a Page Out of Ivy Tech’s Book: Using Data for Student Success Data speaks. Discover how Ivy Tech, the nation’s largest singly accredited community college, uses AWS to gather, analyze, and take action on student behavioral data for the betterment of over 3,100 students. This session outlines the process from inception to implementation across the state of Indiana and highlights how Ivy Tech’s model can be applied to your own complex business problems.
ABD207 – Leveraging AWS to Fight Financial Crime and Protect National Security Banks aren’t known to share data and collaborate with one another. But that is exactly what the Mid-Sized Bank Coalition of America (MBCA) is doing to fight digital financial crime—and protect national security. Using the AWS Cloud, the MBCA developed a shared data analytics utility that processes terabytes of non-competitive customer account, transaction, and government risk data. The intelligence produced from the data helps banks increase the efficiency of their operations, cut labor and operating costs, and reduce false positive volumes. The collective intelligence also allows greater enforcement of Anti-Money Laundering (AML) regulations by helping members detect internal risks—and identify the challenges to detecting these risks in the first place. This session demonstrates how the AWS Cloud supports the MBCA to deliver advanced data analytics, provide consistent operating models across financial institutions, reduce costs, and strengthen national security.
ABD208 – Cox Automotive Empowered to Scale with Splunk Cloud & AWS and Explores New Innovation with Amazon Kinesis Firehose In this session, learn how Cox Automotive is using Splunk Cloud for real time visibility into its AWS and hybrid environments to achieve near instantaneous MTTI, reduce auction incidents by 90%, and proactively predict outages. We also introduce a highly anticipated capability that allows you to ingest, transform, and analyze data in real time using Splunk and Amazon Kinesis Firehose to gain valuable insights from your cloud resources. It’s now quicker and easier than ever to gain access to analytics-driven infrastructure monitoring using Splunk Enterprise & Splunk Cloud.
ABD209 – Accelerating the Speed of Innovation with a Data Sciences Data & Analytics Hub at Takeda Historically, silos of data, analytics, and processes across functions, stages of development, and geography created a barrier to R&D efficiency. Gathering the right data necessary for decision-making was challenging due to issues of accessibility, trust, and timeliness. In this session, learn how Takeda is undergoing a transformation in R&D to increase the speed-to-market of high-impact therapies to improve patient lives. The Data and Analytics Hub was built, with Deloitte, to address these issues and support the efficient generation of data insights for functions such as clinical operations, clinical development, medical affairs, portfolio management, and R&D finance. In the AWS hosted data lake, this data is processed, integrated, and made available to business end users through data visualization interfaces, and to data scientists through direct connectivity. Learn how Takeda has achieved significant time reductions—from weeks to minutes—to gather and provision data that has the potential to reduce cycle times in drug development. The hub also enables more efficient operations and alignment to achieve product goals through cross functional team accountability and collaboration due to the ability to access the same cross domain data.
ABD210 – Modernizing Amtrak: Serverless Solution for Real-Time Data Capabilities As the nation’s only high-speed intercity passenger rail provider, Amtrak needs to know critical information to run their business such as: Who’s onboard any train at any time? How are booking and revenue trending? Amtrak was faced with unpredictable and often slow response times from existing databases, ranging from seconds to hours; existing booking and revenue dashboards were spreadsheet-based and manual; multiple copies of data were stored in different repositories, lacking integration and consistency; and operations and maintenance (O&M) costs were relatively high. Join us as we demonstrate how Deloitte and Amtrak successfully went live with a cloud-native operational database and analytical datamart for near-real-time reporting in under six months. We highlight the specific challenges and the modernization of architecture on an AWS native Platform as a Service (PaaS) solution. The solution includes cloud-native components such as AWS Lambda for microservices, Amazon Kinesis and AWS Data Pipeline for moving data, Amazon S3 for storage, Amazon DynamoDB for a managed NoSQL database service, and Amazon Redshift for near-real time reports and dashboards. Deloitte’s solution enabled “at scale” processing of 1 million transactions/day and up to 2K transactions/minute. It provided flexibility and scalability, largely eliminate the need for system management, and dramatically reduce operating costs. Moreover, it laid the groundwork for decommissioning legacy systems, anticipated to save at least $1M over 3 years.
ABD211 – Sysco Foods: A Journey from Too Much Data to Curated Insights In this session, we detail Sysco’s journey from a company focused on hindsight-based reporting to one focused on insights and foresight. For this shift, Sysco moved from multiple data warehouses to an AWS ecosystem, including Amazon Redshift, Amazon EMR, AWS Data Pipeline, and more. As the team at Sysco worked with Tableau, they gained agile insight across their business. Learn how Sysco decided to use AWS, how they scaled, and how they became more strategic with the AWS ecosystem and Tableau.
ABD217 – From Batch to Streaming: How Amazon Flex Uses Real-time Analytics to Deliver Packages on Time Reducing the time to get actionable insights from data is important to all businesses, and customers who employ batch data analytics tools are exploring the benefits of streaming analytics. Learn best practices to extend your architecture from data warehouses and databases to real-time solutions. Learn how to use Amazon Kinesis to get real-time data insights and integrate them with Amazon Aurora, Amazon RDS, Amazon Redshift, and Amazon S3. The Amazon Flex team describes how they used streaming analytics in their Amazon Flex mobile app, used by Amazon delivery drivers to deliver millions of packages each month on time. They discuss the architecture that enabled the move from a batch processing system to a real-time system, overcoming the challenges of migrating existing batch data to streaming data, and how to benefit from real-time analytics.
ABD218 – How EuroLeague Basketball Uses IoT Analytics to Engage Fans IoT and big data have made their way out of industrial applications, general automation, and consumer goods, and are now a valuable tool for improving consumer engagement across a number of industries, including media, entertainment, and sports. The low cost and ease of implementation of AWS analytics services and AWS IoT have allowed AGT, a leader in IoT, to develop their IoTA analytics platform. Using IoTA, AGT brought a tailored solution to EuroLeague Basketball for real-time content production and fan engagement during the 2017-18 season. In this session, we take a deep dive into how this solution is architected for secure, scalable, and highly performant data collection from athletes, coaches, and fans. We also talk about how the data is transformed into insights and integrated into a content generation pipeline. Lastly, we demonstrate how this solution can be easily adapted for other industries and applications.
ABD222 – How to Confidently Unleash Data to Meet the Needs of Your Entire Organization Where are you on the spectrum of IT leaders? Are you confident that you’re providing the technology and solutions that consistently meet or exceed the needs of your internal customers? Do your peers at the executive table see you as an innovative technology leader? Innovative IT leaders understand the value of getting data and analytics directly into the hands of decision makers, and into their own. In this session, Daren Thayne, Domo’s Chief Technology Officer, shares how innovative IT leaders are helping drive a culture change at their organizations. See how transformative it can be to have real-time access to all of the data that’ is relevant to YOUR job (including a complete view of your entire AWS environment), as well as understand how it can help you lead the way in applying that same pattern throughout your entire company
ABD303 – Developing an Insights Platform – Sysco’s Journey from Disparate Systems to Data Lake and Beyond Sysco has nearly 200 operating companies across its multiple lines of business throughout the United States, Canada, Central/South America, and Europe. As the global leader in food services, Sysco identified the need to streamline the collection, transformation, and presentation of data produced by the distributed units and systems, into a central data ecosystem. Sysco’s Business Intelligence and Analytics team addressed these requirements by creating a data lake with scalable analytics and query engines leveraging AWS services. In this session, Sysco will outline their journey from a hindsight reporting focused company to an insights driven organization. They will cover solution architecture, challenges, and lessons learned from deploying a self-service insights platform. They will also walk through the design patterns they used and how they designed the solution to provide predictive analytics using Amazon Redshift Spectrum, Amazon S3, Amazon EMR, AWS Glue, Amazon Elasticsearch Service and other AWS services.
ABD309 – How Twilio Scaled Its Data-Driven Culture As a leading cloud communications platform, Twilio has always been strongly data-driven. But as headcount and data volumes grew—and grew quickly—they faced many new challenges. One-off, static reports work when you’re a small startup, but how do you support a growth stage company to a successful IPO and beyond? Today, Twilio’s data team relies on AWS and Looker to provide data access to 700 colleagues. Departments have the data they need to make decisions, and cloud-based scale means they get answers fast. Data delivers real-business value at Twilio, providing a 360-degree view of their customer, product, and business. In this session, you hear firsthand stories directly from the Twilio data team and learn real-world tips for fostering a truly data-driven culture at scale.
ABD310 – How FINRA Secures Its Big Data and Data Science Platform on AWS FINRA uses big data and data science technologies to detect fraud, market manipulation, and insider trading across US capital markets. As a financial regulator, FINRA analyzes highly sensitive data, so information security is critical. Learn how FINRA secures its Amazon S3 Data Lake and its data science platform on Amazon EMR and Amazon Redshift, while empowering data scientists with tools they need to be effective. In addition, FINRA shares AWS security best practices, covering topics such as AMI updates, micro segmentation, encryption, key management, logging, identity and access management, and compliance.
ABD331 – Log Analytics at Expedia Using Amazon Elasticsearch Service Expedia uses Amazon Elasticsearch Service (Amazon ES) for a variety of mission-critical use cases, ranging from log aggregation to application monitoring and pricing optimization. In this session, the Expedia team reviews how they use Amazon ES and Kibana to analyze and visualize Docker startup logs, AWS CloudTrail data, and application metrics. They share best practices for architecting a scalable, secure log analytics solution using Amazon ES, so you can add new data sources almost effortlessly and get insights quickly
ABD316 – American Heart Association: Finding Cures to Heart Disease Through the Power of Technology Combining disparate datasets and making them accessible to data scientists and researchers is a prevalent challenge for many organizations, not just in healthcare research. American Heart Association (AHA) has built a data science platform using Amazon EMR, Amazon Elasticsearch Service, and other AWS services, that corrals multiple datasets and enables advanced research on phenotype and genotype datasets, aimed at curing heart diseases. In this session, we present how AHA built this platform and the key challenges they addressed with the solution. We also provide a demo of the platform, and leave you with suggestions and next steps so you can build similar solutions for your use cases
ABD319 – Tooling Up for Efficiency: DIY Solutions @ Netflix At Netflix, we have traditionally approached cloud efficiency from a human standpoint, whether it be in-person meetings with the largest service teams or manually flipping reservations. Over time, we realized that these manual processes are not scalable as the business continues to grow. Therefore, in the past year, we have focused on building out tools that allow us to make more insightful, data-driven decisions around capacity and efficiency. In this session, we discuss the DIY applications, dashboards, and processes we built to help with capacity and efficiency. We start at the ten thousand foot view to understand the unique business and cloud problems that drove us to create these products, and discuss implementation details, including the challenges encountered along the way. Tools discussed include Picsou, the successor to our AWS billing file cost analyzer; Libra, an easy-to-use reservation conversion application; and cost and efficiency dashboards that relay useful financial context to 50+ engineering teams and managers.
ABD312 – Deep Dive: Migrating Big Data Workloads to AWS Customers are migrating their analytics, data processing (ETL), and data science workloads running on Apache Hadoop, Spark, and data warehouse appliances from on-premise deployments to AWS in order to save costs, increase availability, and improve performance. AWS offers a broad set of analytics services, including solutions for batch processing, stream processing, machine learning, data workflow orchestration, and data warehousing. This session will focus on identifying the components and workflows in your current environment; and providing the best practices to migrate these workloads to the right AWS data analytics product. We will cover services such as Amazon EMR, Amazon Athena, Amazon Redshift, Amazon Kinesis, and more. We will also feature Vanguard, an American investment management company based in Malvern, Pennsylvania with over $4.4 trillion in assets under management. Ritesh Shah, Sr. Program Manager for Cloud Analytics Program at Vanguard, will describe how they orchestrated their migration to AWS analytics services, including Hadoop and Spark workloads to Amazon EMR. Ritesh will highlight the technical challenges they faced and overcame along the way, as well as share common recommendations and tuning tips to accelerate the time to production.
ABD402 – How Esri Optimizes Massive Image Archives for Analytics in the Cloud Petabyte scale archives of satellites, planes, and drones imagery continue to grow exponentially. They mostly exist as semi-structured data, but they are only valuable when accessed and processed by a wide range of products for both visualization and analysis. This session provides an overview of how ArcGIS indexes and structures data so that any part of it can be quickly accessed, processed, and analyzed by reading only the minimum amount of data needed for the task. In this session, we share best practices for structuring and compressing massive datasets in Amazon S3, so it can be analyzed efficiently. We also review a number of different image formats, including GeoTIFF (used for the Public Datasets on AWS program, Landsat on AWS), cloud optimized GeoTIFF, MRF, and CRF as well as different compression approaches to show the effect on processing performance. Finally, we provide examples of how this technology has been used to help image processing and analysis for the response to Hurricane Harvey.
ABD329 – A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Massive Scale Amazon’s consumer business continues to grow, and so does the volume of data and the number and complexity of the analytics done in support of the business. In this session, we talk about how Amazon.com uses AWS technologies to build a scalable environment for data and analytics. We look at how Amazon is evolving the world of data warehousing with a combination of a data lake and parallel, scalable compute engines such as Amazon EMR and Amazon Redshift.
ABD327 – Migrating Your Traditional Data Warehouse to a Modern Data Lake In this session, we discuss the latest features of Amazon Redshift and Redshift Spectrum, and take a deep dive into its architecture and inner workings. We share many of the recent availability, performance, and management enhancements and how they improve your end user experience. You also hear from 21st Century Fox, who presents a case study of their fast migration from an on-premises data warehouse to Amazon Redshift. Learn how they are expanding their data warehouse to a data lake that encompasses multiple data sources and data formats. This architecture helps them tie together siloed business units and get actionable 360-degree insights across their consumer base. MCL202 – Ally Bank & Cognizant: Transforming Customer Experience Using Amazon Alexa Given the increasing popularity of natural language interfaces such as Voice as User technology or conversational artificial intelligence (AI), Ally® Bank was looking to interact with customers by enabling direct transactions through conversation or voice. They also needed to develop a capability that allows third parties to connect to the bank securely for information sharing and exchange, using oAuth, an authentication protocol seen as the future of secure banking technology. Cognizant’s Architecture team partnered with Ally Bank’s Enterprise Architecture group and identified the right product for oAuth integration with Amazon Alexa and third-party technologies. In this session, we discuss how building products with conversational AI helps Ally Bank offer an innovative customer experience; increase retention through improved data-driven personalization; increase the efficiency and convenience of customer service; and gain deep insights into customer needs through data analysis and predictive analytics to offer new products and services.
MCL317 – Orchestrating Machine Learning Training for Netflix Recommendations At Netflix, we use machine learning (ML) algorithms extensively to recommend relevant titles to our 100+ million members based on their tastes. Everything on the member home page is an evidence-driven, A/B-tested experience that we roll out backed by ML models. These models are trained using Meson, our workflow orchestration system. Meson distinguishes itself from other workflow engines by handling more sophisticated execution graphs, such as loops and parameterized fan-outs. Meson can schedule Spark jobs, Docker containers, bash scripts, gists of Scala code, and more. Meson also provides a rich visual interface for monitoring active workflows and inspecting execution logs. It has a powerful Scala DSL for authoring workflows as well as the REST API. In this session, we focus on how Meson trains recommendation ML models in production, and how we have re-architected it to scale up for a growing need of broad ETL applications within Netflix. As a driver for this change, we have had to evolve the persistence layer for Meson. We talk about how we migrated from Cassandra to Amazon RDS backed by Amazon Aurora
MCL350 – Humans vs. the Machines: How Pinterest Uses Amazon Mechanical Turk’s Worker Community to Improve Machine Learning Ever since the term “crowdsourcing” was coined in 2006, it’s been a buzzword for technology companies and social institutions. In the technology sector, crowdsourcing is instrumental for verifying machine learning algorithms, which, in turn, improves the user’s experience. In this session, we explore how Pinterest adapted to an increased reliability on human evaluation to improve their product, with a focus on how they’ve integrated with Mechanical Turk’s platform. This presentation is aimed at engineers, analysts, program managers, and product managers who are interested in how companies rely on Mechanical Turk’s human evaluation platform to better understand content and improve machine learning algorithms. The discussion focuses on the analysis and product decisions related to building a high quality crowdsourcing system that takes advantage of Mechanical Turk’s powerful worker community.
ABD201 – Big Data Architectural Patterns and Best Practices on AWS In this session, we simplify big data processing as a data bus comprising various stages: collect, store, process, analyze, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architectures, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost
ABD202 – Best Practices for Building Serverless Big Data Applications Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. In this session, we show you how to incorporate serverless concepts into your big data architectures. We explore the concepts behind and benefits of serverless architectures for big data, looking at design patterns to ingest, store, process, and visualize your data. Along the way, we explain when and how you can use serverless technologies to streamline data processing, minimize infrastructure management, and improve agility and robustness and share a reference architecture using a combination of cloud and open source technologies to solve your big data problems. Topics include: use cases and best practices for serverless big data applications; leveraging AWS technologies such as Amazon DynamoDB, Amazon S3, Amazon Kinesis, AWS Lambda, Amazon Athena, and Amazon EMR; and serverless ETL, event processing, ad hoc analysis, and real-time analytics.
ABD206 – Building Visualizations and Dashboards with Amazon QuickSight Just as a picture is worth a thousand words, a visual is worth a thousand data points. A key aspect of our ability to gain insights from our data is to look for patterns, and these patterns are often not evident when we simply look at data in tables. The right visualization will help you gain a deeper understanding in a much quicker timeframe. In this session, we will show you how to quickly and easily visualize your data using Amazon QuickSight. We will show you how you can connect to data sources, generate custom metrics and calculations, create comprehensive business dashboards with various chart types, and setup filters and drill downs to slice and dice the data.
ABD203 – Real-Time Streaming Applications on AWS: Use Cases and Patterns To win in the marketplace and provide differentiated customer experiences, businesses need to be able to use live data in real time to facilitate fast decision making. In this session, you learn common streaming data processing use cases and architectures. First, we give an overview of streaming data and AWS streaming data capabilities. Next, we look at a few customer examples and their real-time streaming applications. Finally, we walk through common architectures and design patterns of top streaming data use cases.
ABD213 – How to Build a Data Lake with AWS Glue Data Catalog As data volumes grow and customers store more data on AWS, they often have valuable data that is not easily discoverable and available for analytics. The AWS Glue Data Catalog provides a central view of your data lake, making data readily available for analytics. We introduce key features of the AWS Glue Data Catalog and its use cases. Learn how crawlers can automatically discover your data, extract relevant metadata, and add it as table definitions to the AWS Glue Data Catalog. We will also explore the integration between AWS Glue Data Catalog and Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
ABD214 – Real-time User Insights for Mobile and Web Applications with Amazon Pinpoint With customers demanding relevant and real-time experiences across a range of devices, digital businesses are looking to gather user data at scale, understand this data, and respond to customer needs instantly. This requires tools that can record large volumes of user data in a structured fashion, and then instantly make this data available to generate insights. In this session, we demonstrate how you can use Amazon Pinpoint to capture user data in a structured yet flexible manner. Further, we demonstrate how this data can be set up for instant consumption using services like Amazon Kinesis Firehose and Amazon Redshift. We walk through example data based on real world scenarios, to illustrate how Amazon Pinpoint lets you easily organize millions of events, record them in real-time, and store them for further analysis.
ABD223 – IT Innovators: New Technology for Leveraging Data to Enable Agility, Innovation, and Business Optimization Companies of all sizes are looking for technology to efficiently leverage data and their existing IT investments to stay competitive and understand where to find new growth. Regardless of where companies are in their data-driven journey, they face greater demands for information by customers, prospects, partners, vendors and employees. All stakeholders inside and outside the organization want information on-demand or in “real time”, available anywhere on any device. They want to use it to optimize business outcomes without having to rely on complex software tools or human gatekeepers to relevant information. Learn how IT innovators at companies such as MasterCard, Jefferson Health, and TELUS are using Domo’s Business Cloud to help their organizations more effectively leverage data at scale.
ABD301 – Analyzing Streaming Data in Real Time with Amazon Kinesis Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. In this session, we present an end-to-end streaming data solution using Kinesis Streams for data ingestion, Kinesis Analytics for real-time processing, and Kinesis Firehose for persistence. We review in detail how to write SQL queries using streaming data and discuss best practices to optimize and monitor your Kinesis Analytics applications. Lastly, we discuss how to estimate the cost of the entire system
ABD302 – Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service and Kibana In this session, we use Apache web logs as example and show you how to build an end-to-end analytics solution. First, we cover how to configure an Amazon ES cluster and ingest data using Amazon Kinesis Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data. Then we demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we review approaches for generating custom, ad-hoc reports.
ABD304 – Best Practices for Data Warehousing with Amazon Redshift & Redshift Spectrum Most companies are over-run with data, yet they lack critical insights to make timely and accurate business decisions. They are missing the opportunity to combine large amounts of new, unstructured big data that resides outside their data warehouse with trusted, structured data inside their data warehouse. In this session, we take an in-depth look at how modern data warehousing blends and analyzes all your data, inside and outside your data warehouse without moving the data, to give you deeper insights to run your business. We will cover best practices on how to design optimal schemas, load data efficiently, and optimize your queries to deliver high throughput and performance.
ABD305 – Design Patterns and Best Practices for Data Analytics with Amazon EMR Amazon EMR is one of the largest Hadoop operators in the world, enabling customers to run ETL, machine learning, real-time processing, data science, and low-latency SQL at petabyte scale. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about lowering cost with Auto Scaling and Spot Instances, and security best practices for encryption and fine-grained access control. Finally, we dive into some of our recent launches to keep you current on our latest features.
ABD307 – Deep Analytics for Global AWS Marketing Organization To meet the needs of the global marketing organization, the AWS marketing analytics team built a scalable platform that allows the data science team to deliver custom econometric and machine learning models for end user self-service. To meet data security standards, we use end-to-end data encryption and different AWS services such as Amazon Redshift, Amazon RDS, Amazon S3, Amazon EMR with Apache Spark and Auto Scaling. In this session, you see real examples of how we have scaled and automated critical analysis, such as calculating the impact of marketing programs like re:Invent and prioritizing leads for our sales teams.
ABD311 – Deploying Business Analytics at Enterprise Scale with Amazon QuickSight One of the biggest tradeoffs customers usually make when deploying BI solutions at scale is agility versus governance. Large-scale BI implementations with the right governance structure can take months to design and deploy. In this session, learn how you can avoid making this tradeoff using Amazon QuickSight. Learn how to easily deploy Amazon QuickSight to thousands of users using Active Directory and Federated SSO, while securely accessing your data sources in Amazon VPCs or on-premises. We also cover how to control access to your datasets, implement row-level security, create scheduled email reports, and audit access to your data.
ABD315 – Building Serverless ETL Pipelines with AWS Glue Organizations need to gain insight and knowledge from a growing number of Internet of Things (IoT), APIs, clickstreams, unstructured and log data sources. However, organizations are also often limited by legacy data warehouses and ETL processes that were designed for transactional data. In this session, we introduce key ETL features of AWS Glue, cover common use cases ranging from scheduled nightly data warehouse loads to near real-time, event-driven ETL flows for your data lake. We discuss how to build scalable, efficient, and serverless ETL pipelines using AWS Glue. Additionally, Merck will share how they built an end-to-end ETL pipeline for their application release management system, and launched it in production in less than a week using AWS Glue.
ABD318 – Architecting a data lake with Amazon S3, Amazon Kinesis, and Amazon Athena Learn how to architect a data lake where different teams within your organization can publish and consume data in a self-service manner. As organizations aim to become more data-driven, data engineering teams have to build architectures that can cater to the needs of diverse users – from developers, to business analysts, to data scientists. Each of these user groups employs different tools, have different data needs and access data in different ways. In this talk, we will dive deep into assembling a data lake using Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue. The session will feature Mohit Rao, Architect and Integration lead at Atlassian, the maker of products such as JIRA, Confluence, and Stride. First, we will look at a couple of common architectures for building a data lake. Then we will show how Atlassian built a self-service data lake, where any team within the company can publish a dataset to be consumed by a broad set of users.
Companies have valuable data that they may not be analyzing due to the complexity, scalability, and performance issues of loading the data into their data warehouse. However, with the right tools, you can extend your analytics to query data in your data lake—with no loading required. Amazon Redshift Spectrum extends the analytic power of Amazon Redshift beyond data stored in your data warehouse to run SQL queries directly against vast amounts of unstructured data in your Amazon S3 data lake. This gives you the freedom to store your data where you want, in the format you want, and have it available for analytics when you need it. Join a discussion with AWS solution architects to ask question.
ABD330 – Combining Batch and Stream Processing to Get the Best of Both Worlds Today, many architects and developers are looking to build solutions that integrate batch and real-time data processing, and deliver the best of both approaches. Lambda architecture (not to be confused with the AWS Lambda service) is a design pattern that leverages both batch and real-time processing within a single solution to meet the latency, accuracy, and throughput requirements of big data use cases. Come join us for a discussion on how to implement Lambda architecture (batch, speed, and serving layers) and best practices for data processing, loading, and performance tuning
ABD335 – Real-Time Anomaly Detection Using Amazon Kinesis Amazon Kinesis Analytics offers a built-in machine learning algorithm that you can use to easily detect anomalies in your VPC network traffic and improve security monitoring. Join us for an interactive discussion on how to stream your VPC flow Logs to Amazon Kinesis Streams and identify anomalies using Kinesis Analytics.
ABD339 – Deep Dive and Best Practices for Amazon Athena Amazon Athena is an interactive query service that enables you to process data directly from Amazon S3 without the need for infrastructure. Since its launch at re:invent 2016, several organizations have adopted Athena as the central tool to process all their data. In this talk, we dive deep into the most common use cases, including working with other AWS services. We review the best practices for creating tables and partitions and performance optimizations. We also dive into how Athena handles security, authorization, and authentication. Lastly, we hear from a customer who has reduced costs and improved time to market by deploying Athena across their organization.
We look forward to meeting you at re:Invent 2017!
About the Author
Roy Ben-Alta is a solution architect and principal business development manager at Amazon Web Services in New York. He focuses on Data Analytics and ML Technologies, working with AWS customers to build innovative data-driven products.
With the recent launch of [email protected], it’s now possible for you to provide even more robust functionality to your static websites. Amazon CloudFront is a content distribution network service. In this post, I show how you can use [email protected] along with the CloudFront origin access identity (OAI) for Amazon S3 and still provide simple URLs (such as www.example.com/about/ instead of www.example.com/about/index.html).
Background
Amazon S3 is a great platform for hosting a static website. You don’t need to worry about managing servers or underlying infrastructure—you just publish your static to content to an S3 bucket. S3 provides a DNS name such as <bucket-name>.s3-website-<AWS-region>.amazonaws.com. Use this name for your website by creating a CNAME record in your domain’s DNS environment (or Amazon Route 53) as follows:
You can also put CloudFront in front of S3 to further scale the performance of your site and cache the content closer to your users. CloudFront can enable HTTPS-hosted sites, by either using a custom Secure Sockets Layer (SSL) certificate or a managed certificate from AWS Certificate Manager. In addition, CloudFront also offers integration with AWS WAF, a web application firewall. As you can see, it’s possible to achieve some robust functionality by using S3, CloudFront, and other managed services and not have to worry about maintaining underlying infrastructure.
One of the key concerns that you might have when implementing any type of WAF or CDN is that you want to force your users to go through the CDN. If you implement CloudFront in front of S3, you can achieve this by using an OAI. However, in order to do this, you cannot use the HTTP endpoint that is exposed by S3’s static website hosting feature. Instead, CloudFront must use the S3 REST endpoint to fetch content from your origin so that the request can be authenticated using the OAI. This presents some challenges in that the REST endpoint does not support redirection to a default index page.
CloudFront does allow you to specify a default root object (index.html), but it only works on the root of the website (such as http://www.example.com > http://www.example.com/index.html). It does not work on any subdirectory (such as http://www.example.com/about/). If you were to attempt to request this URL through CloudFront, CloudFront would do a S3 GetObject API call against a key that does not exist.
Of course, it is a bad user experience to expect users to always type index.html at the end of every URL (or even know that it should be there). Until now, there has not been an easy way to provide these simpler URLs (equivalent to the DirectoryIndex Directive in an Apache Web Server configuration) to users through CloudFront. Not if you still want to be able to restrict access to the S3 origin using an OAI. However, with the release of [email protected], you can use a JavaScript function running on the CloudFront edge nodes to look for these patterns and request the appropriate object key from the S3 origin.
Solution
In this example, you use the compute power at the CloudFront edge to inspect the request as it’s coming in from the client. Then re-write the request so that CloudFront requests a default index object (index.html in this case) for any request URI that ends in ‘/’.
When a request is made against a web server, the client specifies the object to obtain in the request. You can use this URI and apply a regular expression to it so that these URIs get resolved to a default index object before CloudFront requests the object from the origin. Use the following code:
'use strict';
exports.handler = (event, context, callback) => {
// Extract the request from the CloudFront event that is sent to [email protected]
var request = event.Records[0].cf.request;
// Extract the URI from the request
var olduri = request.uri;
// Match any '/' that occurs at the end of a URI. Replace it with a default index
var newuri = olduri.replace(/\/$/, '\/index.html');
// Log the URI as received by CloudFront and the new URI to be used to fetch from origin
console.log("Old URI: " + olduri);
console.log("New URI: " + newuri);
// Replace the received URI with the URI that includes the index page
request.uri = newuri;
// Return to CloudFront
return callback(null, request);
};
To get started, create an S3 bucket to be the origin for CloudFront:
On the other screens, you can just accept the defaults for the purposes of this walkthrough. If this were a production implementation, I would recommend enabling bucket logging and specifying an existing S3 bucket as the destination for access logs. These logs can be useful if you need to troubleshoot issues with your S3 access.
Now, put some content into your S3 bucket. For this walkthrough, create two simple webpages to demonstrate the functionality: A page that resides at the website root, and another that is in a subdirectory.
<s3bucketname>/index.html
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>Root home page</title>
</head>
<body>
<p>Hello, this page resides in the root directory.</p>
</body>
</html>
<s3bucketname>/subdirectory/index.html
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>Subdirectory home page</title>
</head>
<body>
<p>Hello, this page resides in the /subdirectory/ directory.</p>
</body>
</html>
When uploading the files into S3, you can accept the defaults. You add a bucket policy as part of the CloudFront distribution creation that allows CloudFront to access the S3 origin. You should now have an S3 bucket that looks like the following:
Root of bucket
Subdirectory in bucket
Next, create a CloudFront distribution that your users will use to access the content. Open the CloudFront console, and choose Create Distribution. For Select a delivery method for your content, under Web, choose Get Started.
On the next screen, you set up the distribution. Below are the options to configure:
Origin Domain Name: Select the S3 bucket that you created earlier.
Restrict Bucket Access: Choose Yes.
Origin Access Identity: Create a new identity.
Grant Read Permissions on Bucket: Choose Yes, Update Bucket Policy.
Object Caching: Choose Customize (I am changing the behavior to avoid having CloudFront cache objects, as this could affect your ability to troubleshoot while implementing the Lambda code).
Minimum TTL: 0
Maximum TTL: 0
Default TTL: 0
You can accept all of the other defaults. Again, this is a proof-of-concept exercise. After you are comfortable that the CloudFront distribution is working properly with the origin and Lambda code, you can re-visit the preceding values and make changes before implementing it in production.
CloudFront distributions can take several minutes to deploy (because the changes have to propagate out to all of the edge locations). After that’s done, test the functionality of the S3-backed static website. Looking at the distribution, you can see that CloudFront assigns a domain name:
Try to access the website using a combination of various URLs:
http://<domainname>/: Works
› curl -v http://d3gt20ea1hllb.cloudfront.net/
* Trying 54.192.192.214...
* TCP_NODELAY set
* Connected to d3gt20ea1hllb.cloudfront.net (54.192.192.214) port 80 (#0)
> GET / HTTP/1.1
> Host: d3gt20ea1hllb.cloudfront.net
> User-Agent: curl/7.51.0
> Accept: */*
>
< HTTP/1.1 200 OK
< ETag: "cb7e2634fe66c1fd395cf868087dd3b9"
< Accept-Ranges: bytes
< Server: AmazonS3
< X-Cache: Miss from cloudfront
< X-Amz-Cf-Id: -D2FSRwzfcwyKZKFZr6DqYFkIf4t7HdGw2MkUF5sE6YFDxRJgi0R1g==
< Content-Length: 209
< Content-Type: text/html
< Last-Modified: Wed, 19 Jul 2017 19:21:16 GMT
< Via: 1.1 6419ba8f3bd94b651d416054d9416f1e.cloudfront.net (CloudFront), 1.1 iad6-proxy-3.amazon.com:80 (Cisco-WSA/9.1.2-010)
< Connection: keep-alive
<
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>Root home page</title>
</head>
<body>
<p>Hello, this page resides in the root directory.</p>
</body>
</html>
* Curl_http_done: called premature == 0
* Connection #0 to host d3gt20ea1hllb.cloudfront.net left intact
This is because CloudFront is configured to request a default root object (index.html) from the origin.
http://<domainname>/subdirectory/: Doesn’t work
› curl -v http://d3gt20ea1hllb.cloudfront.net/subdirectory/
* Trying 54.192.192.214...
* TCP_NODELAY set
* Connected to d3gt20ea1hllb.cloudfront.net (54.192.192.214) port 80 (#0)
> GET /subdirectory/ HTTP/1.1
> Host: d3gt20ea1hllb.cloudfront.net
> User-Agent: curl/7.51.0
> Accept: */*
>
< HTTP/1.1 200 OK
< ETag: "d41d8cd98f00b204e9800998ecf8427e"
< x-amz-server-side-encryption: AES256
< Accept-Ranges: bytes
< Server: AmazonS3
< X-Cache: Miss from cloudfront
< X-Amz-Cf-Id: Iqf0Gy8hJLiW-9tOAdSFPkL7vCWBrgm3-1ly5tBeY_izU82ftipodA==
< Content-Length: 0
< Content-Type: application/x-directory
< Last-Modified: Wed, 19 Jul 2017 19:21:24 GMT
< Via: 1.1 6419ba8f3bd94b651d416054d9416f1e.cloudfront.net (CloudFront), 1.1 iad6-proxy-3.amazon.com:80 (Cisco-WSA/9.1.2-010)
< Connection: keep-alive
<
* Curl_http_done: called premature == 0
* Connection #0 to host d3gt20ea1hllb.cloudfront.net left intact
If you use a tool such like cURL to test this, you notice that CloudFront and S3 are returning a blank response. The reason for this is that the subdirectory does exist, but it does not resolve to an S3 object. Keep in mind that S3 is an object store, so there are no real directories. User interfaces such as the S3 console present a hierarchical view of a bucket with folders based on the presence of forward slashes, but behind the scenes the bucket is just a collection of keys that represent stored objects.
http://<domainname>/subdirectory/index.html: Works
› curl -v http://d3gt20ea1hllb.cloudfront.net/subdirectory/index.html
* Trying 54.192.192.130...
* TCP_NODELAY set
* Connected to d3gt20ea1hllb.cloudfront.net (54.192.192.130) port 80 (#0)
> GET /subdirectory/index.html HTTP/1.1
> Host: d3gt20ea1hllb.cloudfront.net
> User-Agent: curl/7.51.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Thu, 20 Jul 2017 20:35:15 GMT
< ETag: "ddf87c487acf7cef9d50418f0f8f8dae"
< Accept-Ranges: bytes
< Server: AmazonS3
< X-Cache: RefreshHit from cloudfront
< X-Amz-Cf-Id: bkh6opXdpw8pUomqG3Qr3UcjnZL8axxOH82Lh0OOcx48uJKc_Dc3Cg==
< Content-Length: 227
< Content-Type: text/html
< Last-Modified: Wed, 19 Jul 2017 19:21:45 GMT
< Via: 1.1 3f2788d309d30f41de96da6f931d4ede.cloudfront.net (CloudFront), 1.1 iad6-proxy-3.amazon.com:80 (Cisco-WSA/9.1.2-010)
< Connection: keep-alive
<
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>Subdirectory home page</title>
</head>
<body>
<p>Hello, this page resides in the /subdirectory/ directory.</p>
</body>
</html>
* Curl_http_done: called premature == 0
* Connection #0 to host d3gt20ea1hllb.cloudfront.net left intact
This request works as expected because you are referencing the object directly. Now, you implement the [email protected] function to return the default index.html page for any subdirectory. Looking at the example JavaScript code, here’s where the magic happens:
var newuri = olduri.replace(/\/$/, '\/index.html');
You are going to use a JavaScript regular expression to match any ‘/’ that occurs at the end of the URI and replace it with ‘/index.html’. This is the equivalent to what S3 does on its own with static website hosting. However, as I mentioned earlier, you can’t rely on this if you want to use a policy on the bucket to restrict it so that users must access the bucket through CloudFront. That way, all requests to the S3 bucket must be authenticated using the S3 REST API. Because of this, you implement a [email protected] function that takes any client request ending in ‘/’ and append a default ‘index.html’ to the request before requesting the object from the origin.
In the Lambda console, choose Create function. On the next screen, skip the blueprint selection and choose Author from scratch, as you’ll use the sample code provided.
Next, configure the trigger. Choosing the empty box shows a list of available triggers. Choose CloudFront and select your CloudFront distribution ID (created earlier). For this example, leave Cache Behavior as * and CloudFront Event as Origin Request. Select the Enable trigger and replicate box and choose Next.
Next, give the function a name and a description. Then, copy and paste the following code:
'use strict';
exports.handler = (event, context, callback) => {
// Extract the request from the CloudFront event that is sent to [email protected]
var request = event.Records[0].cf.request;
// Extract the URI from the request
var olduri = request.uri;
// Match any '/' that occurs at the end of a URI. Replace it with a default index
var newuri = olduri.replace(/\/$/, '\/index.html');
// Log the URI as received by CloudFront and the new URI to be used to fetch from origin
console.log("Old URI: " + olduri);
console.log("New URI: " + newuri);
// Replace the received URI with the URI that includes the index page
request.uri = newuri;
// Return to CloudFront
return callback(null, request);
};
Next, define a role that grants permissions to the Lambda function. For this example, choose Create new role from template, Basic Edge Lambda permissions. This creates a new IAM role for the Lambda function and grants the following permissions:
In a nutshell, these are the permissions that the function needs to create the necessary CloudWatch log group and log stream, and to put the log events so that the function is able to write logs when it executes.
After the function has been created, you can go back to the browser (or cURL) and re-run the test for the subdirectory request that failed previously:
› curl -v http://d3gt20ea1hllb.cloudfront.net/subdirectory/
* Trying 54.192.192.202...
* TCP_NODELAY set
* Connected to d3gt20ea1hllb.cloudfront.net (54.192.192.202) port 80 (#0)
> GET /subdirectory/ HTTP/1.1
> Host: d3gt20ea1hllb.cloudfront.net
> User-Agent: curl/7.51.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Thu, 20 Jul 2017 21:18:44 GMT
< ETag: "ddf87c487acf7cef9d50418f0f8f8dae"
< Accept-Ranges: bytes
< Server: AmazonS3
< X-Cache: Miss from cloudfront
< X-Amz-Cf-Id: rwFN7yHE70bT9xckBpceTsAPcmaadqWB9omPBv2P6WkIfQqdjTk_4w==
< Content-Length: 227
< Content-Type: text/html
< Last-Modified: Wed, 19 Jul 2017 19:21:45 GMT
< Via: 1.1 3572de112011f1b625bb77410b0c5cca.cloudfront.net (CloudFront), 1.1 iad6-proxy-3.amazon.com:80 (Cisco-WSA/9.1.2-010)
< Connection: keep-alive
<
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>Subdirectory home page</title>
</head>
<body>
<p>Hello, this page resides in the /subdirectory/ directory.</p>
</body>
</html>
* Curl_http_done: called premature == 0
* Connection #0 to host d3gt20ea1hllb.cloudfront.net left intact
You have now configured a way for CloudFront to return a default index page for subdirectories in S3!
Summary
In this post, you used [email protected] to be able to use CloudFront with an S3 origin access identity and serve a default root object on subdirectory URLs. To find out some more about this use-case, see [email protected] integration with CloudFront in our documentation.
If you have questions or suggestions, feel free to comment below. For troubleshooting or implementation help, check out the Lambda forum.
So when Michael Darby’s latest PUBG-inspired Game Boy build appeared in my notifications last week, I squealed with excitement and quickly sent the link to my team…while drinking a cocktail by a pool in Turkey ☀️
For those unfamiliar with the game: PlayerUnknown’s Battlegrounds, or PUBG for short, is a Battle-Royale-style multiplayer online video game in which individuals or teams fight to the death on an island map. As players collect weapons, ammo, and transport, their ‘safe zone’ shrinks, forcing a final face-off until only one character remains.
The game has been an astounding success on Steam, the digital distribution platform which brings PUBG to the masses. It records daily player counts of over a million!
Yeah, I’d say one or two people seem to enjoy it!
PUBG on a Game Boy?!
As it’s a fairly complex game, let’s get this out of the way right now: no, Michael is not running the entire game on a Nintendo Game Boy. That would be magicsilly impossible. Instead, he’s streaming the game from his home PC to a Raspberry Pi Zero W fitted within the hacked handheld console.
Michael removed the excess plastic inside an old Game Boy Color shell to make space for a Zero W, LiPo battery, and TFT screen. He then soldered the necessary buttons to GPIO pins, and wrote a Python script to control them.
The maker battleground
The full script can be found here, along with a more detailed tutorial for the build.
In order to stream PUBG to the Zero W, Michael uses the open-source NVIDIA steaming service Moonlight. He set his PC’s screen resolution to 800×600 and its frame rate to 30, so that streaming the game to the TFT screen works perfectly, albeit with no sound.
The end result is a rather impressive build that has confused YouTube commenters since he uploaded footage for it last week. The video has more than 60000 views to date, so it appears we’re not the only ones impressed with Michael’s make.
314reactor
If you’re a regular reader of our blog, you may recognise Michael’s name from his recent Nerf blaster mod. And fans of Raspberry Pi may also have seen his Pi-powered Windows 98 wristwatch earlier in the year. He blogs at 314reactor, where you can read more about his digital making projects.
Player Two has entered the game
Now it’s your turn. Have you used a Raspberry Pi to create a gaming system? I’m not just talking arcades and RetroPie here. We want to see everything, from Pi-powered board games to tech on the football field.
Share your builds in the comments below and while you’re at it, what game would you like to stream to a handheld device?
Atlassian is an Australian IT company that develops enterprise software, with its best-known products being its issue-tracking app, Jira, and team collaboration and wiki product, Confluence.
In December 2015, Atlassian went public and made their initial public offering (IPO) under the symbol TEAM, valuing them at $4.37 billion. In summary, they big.
What happened?
A facelift
It’s a nice sunny day in Sydney in mid-September of 2017, and Atlassian, after 15 years of consistency, has rebranded, changing their look and feel for a brighter and funner one, compared to the dreary previous look.It’s a hell of a lot simpler and, as they show in the above video, it’s going to be used with a lot more creativity and flair in mind—it’s flexible in a sense that they can use it in a lot more ways than before, with a lot more colours than before.
The blues they’re using now work super-well with the logos on a white background, whereas the white logos on their new champion, brand colour blue can go both ways: some can see it as a bold, daring step which is quite attractive, while others can see it as off-putting and not very user-friendly.
What’s it all mean?
Symbolism
In his announcement blog, Atlassian Co-Founder & Co-CEO, Mike Cannon-Brookes, mentions that the branding change reflects their newly-shifted focus on the concept of teamwork. He continues to explain that their previous logo depicted the sky-holding Greek titan Atlas and symbolised legendary service and support. But, while it has become renown, they’re shifting their focus on the concept of teamwork—why focus on something you’ve already done right, right?
The new logo contains more symbolism than meets the eye, as can be interpreted as:
Two people high-fiving
A mountain to scale
The letter “A” (seen as two pillars reinforcing each other)
Product logos
Atlassian has created and acquired many products in their adventure so far, and they all seemed to have a similar art style, but something always felt off about their consistency. Well, needless to say, this was addressed with Atlassian’s very own “identity system”, which is a pretty cool term for a consistent logo-look for 14+ products, to fit them under one brand.
The result is a set of unique marks that “still feel very related to each other”. Whereas, I also see a new set of “unknown” Pokémon.
Typeface
To add a cherry on top, Atlassian will be using their own custom-made typeface called Charlie Sans, specifically designed to balance legibility with personality–that’s probably the best way to describe it. Otherwise, I’d say, out of purely-constructive criticism, that there isn’t much difference between itself and any of the other staple fonts; i.e. Arial, Verdana, etc. Then again, I’m not a professional designer.
It doesn’t look as distinct as their previous typeface, but, to be fair, it does look very slick next to the new product logos.
The modest dimensions of our Raspberry Pi Zero and its wirelessly connectable sibling, the Pi Zero W, enable makers in our community to build devices that are very small indeed. The PiCorder built by Wayne Keenan is probably the slimmest Pi-powered video-recording device we’ve ever seen.
A simple Pi-camcorder using @pimoroni #HyperPixel, ZeroLipo, lipo bat, camera and #PiZeroW. All parts from the Pirates, total of ~£85. Project build instructions: https://www.hackster.io/TheBubbleworks/picorder-0eb94d
PiCorder hardware
Wayne’s PiCorder is a very straightforward make. On the hardware side, it features a Pimoroni HyperPixel screen, Pi Zero camera module, and Zero LiPo plus LiPo battery pack. To put it together, he simply soldered header pins onto a Zero W, and connected all the components to it – easy as Pi! (Yes, I went there.)
So sleek as to be almost aerodynamic
Recording with the PiCorder (rePiCording?)
Then it was just a matter of installing the HyperPixel driver on the Pi, and the PiCorder was good to go. In this basic setup, recording is controlled via SSH. However, there’s a discussion about better ways to control the device in the comments on Wayne’s write-up. As the HyperPixel is a touchscreen, adding a GUI would make full use of its capabilities.
Think about how many screens you’re looking at right now
The PiCorder is a great project to recreate if you’re looking to build a small portable camera. If you’re new to soldering, this build is perfect for you: just follow our ‘How to solder’ video and tutorial, and you’re on your way. This could be the start of your journey into the magical world of physical computing!
You could also check our blog on Alex Ellis‘s implementation of YouTube live-streaming for the Pi, and learn how to share your videos in real time.
Cool camera projects
Our educational resources include plenty of cool projects that could use the PiCorder, or for which the device could be adapted.
Get your head around using the official Raspberry Pi Camera Module with this picamera tutorial. Learn how to set up a stationary or wearable time-lapse camera, and turn your images into animated GIFs. You could also kickstart your career as a director by making an amazing stop-motion film!
No matter which camera project you choose to work on, we’d love to see the results. So be sure to share a link in the comments.
Dr Lucy Rogers is more than just a human LED. She’s also an incredibly imaginative digital maker, ready and willing to void warranties in her quest to take things apart and put them back together again, better than before. With her recipe for legal, digital indoor fireworks, she does exactly that, leaving an electronic cigarette in a battered state as it produces the smoke effects for this awesome build.
In her IBM blog post, Lucy offers a basic rundown of the build. While it may not be a complete how-to for building the firecrackers, the provided GitHub link and commentary should be enough for the seasoned maker to attempt their own version. If you feel less confident about producing the complete build yourself, there are more than enough resources available online to help you create something flashy and bangy without the added smoke show.
For the physical build itself, Lucy used a plastic soft drink bottle, a paper plate, and plastic tubing. Once painted, they provided the body for her firecrackers, and the support needed to keep the LED NeoPixels in place. She also drilled holes into the main plastic tube that ran up the centre of the firecracker, allowing smoke to billow out at random points. More of that to come.
Spray paint and a touch of gold transform the pieces of plastic piping into firecrackers
The cracking, banging sounds play via a USB audio adapter due to complications between the NeoPixels and the audio jack. Lucy explains:
The audio settings need to be set in the Raspberry Pi’s configuration settings (raspi-config). I also used the Linux program ‘alsamixer’ to set the volume. The firecrackers sound file was made by Phil Andrew. I found that using the Node-RED ‘exec node’ calling the ‘mpg123’ program worked best.
Lucy states that the hacking of the e-cigarette was the hardest part of the build. For the smoke show itself, she reversed its recommended usage as follows:
On an electronic cigarette, if you blow down the air-intake hole (not the outlet hole from which you would normally inhale), smoke comes out of the outlet hole. I attached an aquarium pump to the air-intake hole and the firecracker pipe to the outlet, to make smoke on demand.
For the power, she gingerly hacked at the body with a pipe cutter before replacing the inner LiPo battery with a 30W isolated DC-DC converter, allowing for a safer power flow throughout the build (for “safer flow”, read “less likely to blow up the Raspberry Pi”).
The pump and e-cigarette fit snugly inside the painted bottle, while the Raspberry Pi remains outside
The project was partly inspired by Lucy’s work with Robin Hill Country Park. A how-to of that build can be seen below:
www.farnell.com Dr Lucy Rogers presents her exciting Fire Crackers project, taking you from the initial concept right through to installation. Whilst working in partnership with the Robin Hill country park on the Isle of Wight, Lucy wanted to develop a solution for creating safe electronic Fire Crackers, for their Chinese New year festival.
Although I won’t challenge you all to dismantle electric cigarettes, nor do I expect you to spend money on strobe lights, sensors, and other such peripherals, it would be great to see some other attempts at digital home fireworks. If you build, or have built, anything flashy and noisy, please share it in the comments below.
Gone, it would seem, are the days of ‘Hello, My name is…’ stickers and Sharpies. Who wants a simple sticker on their chest, so flat and dull, when they can wear an entire computer, displaying their name and face in pixelated perfection?
I created this video with the YouTube Video Editor (http://www.youtube.com/editor)
With this PiE-Ink Name Badge, maker Josh King has taken this simple means of identification and upgraded it. And in his Instructables tutorial, he explains exactly how. But here’s the TL;DR for those wanting to get the basic gist of the build.
For the badge, Josh uses a Raspberry Pi Zero, a PaPiRus 2″ e-ink HAT, an Adafruit Powerboost 1000c, and a LiPo battery. He also uses various other components, such as magnets and adhesive putty.
Josh prepped the Zero, soldering the header pins in place, and then attached the Powerboost, allowing the LiPo battery to power the unit and be charged at the same time.
From there, he attaches the PaPiRus HAT and secures the whole thing with the putty, to ensure a snug fit. He also attaches a mini slide switch to allow an on/off function.
Having pre-installed Raspbian on the SD card, Josh follows the setup for the PaPiRus, ensuring all library information is in place and that the Pi recognises the 2″ screen. The code for the badge can then be downloaded directly from Josh’s GitHub account. You’ll need to scale your image down to 200×96 in order for it to fit on the e-ink screen.
And there you have it. One Raspberry Pi Zero e-ink name badge, ready for you to show off at the next work function, conference, or when you visit Grandma and she still can’t get your name right.
Today, Yahoo Mail introduced a feature that allows you to automatically sync your mobile photos to Yahoo Mail so that they’re readily available when you’re composing an email from your computer. A key technology behind this feature is a new photo and video platform called “Tripod,” which was born out of the innovations and capabilities of Flickr.
For 13 years, Flickr has served as one of the world’s largest photo-sharing communities and as a platform for millions of people who have collectively uploaded more than 13 billion photos globally. Tripod provides a great opportunity to bring some of the most-loved and useful Flickr features to the Yahoo network of products, including Yahoo Mail, Yahoo Messenger, and Yahoo Answers Now.
Tripod and its Three Services
As the name suggests, Tripod offers three services:
The Pixel Service: for uploading, storing, resizing, editing, and serving photos and videos.
The Enrichment Service: for enriching media metadata using image recognition algorithms. For example, the algorithms might identify and tag scenes, actions, and objects.
The Aggregation Service: for in-application and cross-application metadata aggregation, filtering, and search.
The combination of these three services makes Tripod an end-to-end platform for smart image services. There is also an administrative console for configuring the integration of an application with Tripod, and an identity service for authentication and authorization.
Figure 1: Tripod Architecture
The Pixel Service
Flickr has achieved a highly-scalable photo upload and resizing pipeline. Particularly in the case of large-scale ingestion of thousands of photos and videos, Flickr’s mobile and API teams tuned techniques, like resumable upload and deduplication, to create a high-quality photo-sync experience. On serving, Flickr tackled the challenge of optimizing storage without impacting photo quality, and added dynamic resizing to support more diverse client photo layouts.
Over many years at Flickr, we’ve demonstrated sustained uploads of more than 500 photos per second. The full pipeline includes the PHP Upload API endpoint, backend Java services (Image Daemon, Storage Master), hot-hot uploads across the US West and East Coasts, and five worldwide photo caches, plus a massive CDN.
In Tripod’s Pixel Service, we leverage all of this core technology infrastructure as-is, except for the API endpoint, which is now written in Java and implements a new bucket-based data model.
The Enrichment Service
In 2013, Flickr made an exciting leap. Yahoo acquired two Computer Vision technology companies, IQ Engines and LookFlow, and rolled these incredible teams into Flickr. Using their image recognition algorithms, we enhanced Flickr Search and introduced Magic View to the Flickr Camera Roll.
In Tripod, the Enrichment Service applies the image recognition technology to each photograph, resulting in rich metadata that can be used to enhance filtering, indexing, and searching. The Enrichment Service can identify places, themes, landmarks, objects, colors, text, media similarity, NSFW content, and best thumbnail. It also performs OCR text recognition and applies an aesthetic score to indicate the overall quality of the photograph.
The Aggregation Service
The Aggregation Service lets an application, such as Yahoo Mail, find media based on any criteria. For example, it can return all the photos belonging to a particular person within a particular application, all public photos, or all photos belonging to a particular person taken in a specific location during a specific time period (e.g. San Francisco between March 1, 2015 and May 31, 2015.)
Vespa, Yahoo’s internal search engine, indexes all metadata for each media item. If the Enrichment Service has been run on the media, the metadata is indexed in Vespa and is available to the Aggregation API. The result set from a call to the Aggregation Service depends on authentication and the read permissions defined by an API key.
APIs and SDKs
Each service is expressed as a set of APIs. We upgraded our API technology stack, switching from PHP to Spring MVC on a Java Jetty servlet container, and made use of the latest Spring features such as Spring Data, Spring Boot, and Spring Security with OAuth 2.0. Tripod’s API is defined and documented using Swagger. Each service is developed and deployed semi-autonomously from a separate Git repository with a separate build lifecycle to an independent micro-service container.
Figure 2: Tripod API
Swagger Editor makes it easy to auto-generate SDKs in many languages, depending on the needs of Yahoo product developers. The mobile SDKs for iOS and Android are most commonly used, as is the JS SDK for Yahoo’s web developers. The SDKs make integration with Tripod by a web or mobile application easy. For example, in the case of the Yahoo Mail photo upload feature, the Yahoo Mail mobile app includes the embedded Tripod SDK to manage the photo upload process.
Buckets and API Keys
The Tripod data model differs in some important ways from the Flickr data model. Tripod applications, buckets, and API keys introduce the notion of multi-tenancy, with a strong access control boundary. An application is simply the name of the application that is using Tripod (e.g. Yahoo Mail). Buckets are logical containers for the application’s media, and media in an application is further affected by bucket settings such as compression rate, capacity, media time-to-live, and the selection of enrichments to compute.
Figure 3: Creating a new Bucket
Beyond Tripod’s generic attributes, a bucket may also have custom organizing attributes that are defined by an application’s developers. API keys control read/write permissions on buckets and are used to generate OAuth tokens for anonymous or user-authenticated access to a bucket.
Figure 4: Creating a new API Key
App developers at Yahoo use the Tripod Console to:
Create the buckets and API keys that they will use with their application
Define the bucket settings and the access control rules for each API key
Another departure from the Flickr API is that Tripod can handle media that is not user-generated content (UGC). This is critical for storing curated content, as is required by many Yahoo applications.
Architecture and Implementation
Going from a monolithic architecture to a microservices architecture has had its challenges. In particular, we’ve had to find the right internal communication process between the services. At the core of this is our Pulsar Event Bus, over which we send Avro messages backed by a strong schema registry. This lets each Tripod team move fast, without introducing incompatible changes that would break another Tripod service.
For data persistence, we’ve moved most of our high-scale multi-colo data to Yahoo’s distributed noSQL database. We’ve been experimenting with using Redis Cluster as our caching tier, and we use Vespa to drive the Aggregation service. For Enrichment, we make extensive use of Storm and HBase for real-time processing of Tripod’s Computer Vision algorithms. Finally, we run large backfills using PIG, Oozie, and Hive on Yahoo’s massive grid infrastructure.
In 2017, we expect Tripod will be at 50% of Flickr’s scale, with Tripod supporting the photo and video needs across many Yahoo applications that serve Yahoo’s 1B users across mobile and desktop.
After reading about Tripod, you might have a few questions
Did Tripod replace Flickr?!
No! Flickr is still here, better than ever. In fact, Flickr celebrated its 13th birthday last week! Over the past several years, the Flickr team has implemented significant innovations on core photo management features (such as an optimized storage footprint, dynamic resizing, Camera Roll, Magic View, and Search). We wanted to make these technology advancements available to other teams at Yahoo!
But, what about the Flickr API? Why not just use that?
Flickr APIs are being used by hundreds of thousands of third-party developers around the world. Flickr’s API was designed for interacting with Flickr Accounts, Photos, and Groups, generally on lower scale than the Flickr site itself; it was not designed for independent, highly configurable, multi-tenant core photo management at large scale.
How can I join the team?
We’re hiring and we’d love to talk to you about our open opportunities! Just email [email protected] to start the conversation.
People who are accomplished in one field of expertise tend to believe that they can bring unique insights to just about any other debate. I am as guilty as anyone: at one time or another, I aired my thoughts on anything from CNC manufacturing, to electronics, to emergency preparedness, to politics. Today, I’m about to commit the same sin – but instead of pretending to speak from a position of authority, I wanted to share a more personal tale.
The author, circa 1995. The era of hand-crank computers and punch cards.
Back in my school days, I was that one really tall and skinny kid in the class. It wasn’t trying to stay this way; I preferred computer games to sports, and my grandma’s Polish cooking was heavy on potatoes, butter, chicken, dumplings, cream, and cheese. But that did not matter: I could eat what I wanted, as often as I wanted, and I still stayed in shape. This made me look down on chubby kids; if my reckless ways had little or no effect on my body, it followed that they had to be exceptionally lazy and must have lacked even the most basic form of self-control.
As I entered adulthood, my habits remained the same. I felt healthy and stayed reasonably active, walking to and from work every other day and hiking with friends whenever I could. But my looks started to change:
The author at a really exciting BlackHat party in 2002.
I figured it’s just a part of growing up. But somewhere around my twentieth birthday, I stepped on a bathroom scale and typed the result into an online calculator. I was surprised to find out that my BMI was about 24 – pretty darn close to overweight.
“Pssh, you know how inaccurate these things are!”, I exclaimed while searching online to debunk that whole BMI thing. I mean, sure, I had some belly fat – maybe a pizza or two too far – but nothing that wouldn’t go away in time. Besides, I was doing fine, so what would be the point of submitting to the society’s idea of the “right” weight?
It certainly helped that I was having a blast at work. I made a name for myself in the industry, published a fair amount of cool research, authored a book, settled down, bought a house, had a kid. It wasn’t until the age of 26 that I strayed into a doctor’s office for a routine checkup. When the nurse asked me about my weight, I blurted out “oh, 175 pounds, give or take”. She gave me a funny look and asked me to step on the scale.
Turns out it was quite a bit more than 175 pounds. With a BMI of 27.1, I was now firmly into the “overweight” territory. Yeah yeah, the BMI metric was a complete hoax – but why did my passport photos look less flattering than before?
A random mugshot from 2007. Some people are just born big-boned, I think.
Well, damn. I knew what had to happen: from now on, I was going to start eating healthier foods. I traded Cheetos for nuts, KFC for sushi rolls, greasy burgers for tortilla wraps, milk smoothies for Jamba Juice, fries for bruschettas, regular sodas for diet. I’d even throw in a side of lettuce every now and then. It was bound to make a difference. I just wasn’t gonna be one of the losers who check their weight every day and agonize over every calorie on their plate. (Weren’t calories a scam, anyway? I think I read that on that cool BMI conspiracy site.)
By the time I turned 32, my body mass index hit 29. At that point, it wasn’t just a matter of looking chubby. I could do the math: at that rate, I’d be in a real pickle in a decade or two – complete with a ~50% chance of developing diabetes or cardiovascular disease. This wouldn’t just make me miserable, but also mess up the lives of my spouse and kids.
Presenting at Google TGIF in 2013. It must’ve been the unflattering light.
I wanted to get this over with right away, so I decided to push myself hard. I started biking to work, quite a strenuous ride. It felt good, but did not help: I would simply eat more to compensate and ended up gaining a few extra pounds. I tried starving myself. That worked, sure – only to be followed by an even faster rebound. Ultimately, I had to face the reality: I had a problem and I needed a long-term solution. There was no one weird trick to outsmart the calorie-counting crowd, no overnight cure.
I started looking for real answers. My world came crumbling down; I realized that a “healthy” burrito from Chipotle packed four times as many calories as a greasy burger from McDonald’s. That a loaded fruit smoothie from Jamba Juice was roughly equal to two hot dogs with a side of mashed potatoes to boot. That a glass of apple juice fared worse than a can of Sprite, and that bruschetta wasn’t far from deep-fried butter on a stick. It didn’t matter if it was sugar or fat, bacon or kale. Familiar favorites were not better or worse than the rest. Losing weight boiled down to portion control – and sticking to it for the rest of my life.
It was a slow and humbling journey that spanned almost a year. I ended up losing around 70 lbs along the way. What shocked me is that it wasn’t a painful experience; what held me back for years was just my own smugness, plus the folksy wisdom gleaned from the covers of glossy magazines.
Author with a tractor, 2017.
I’m not sure there is a moral to this story. I guess one lesson is: don’t be a judgmental jerk. Sometimes, the simple things – the ones you think you have all figured out – prove to be a lot more complicated than they seem.
Trending on social media is how Yahoo is changing it’s name to “Altaba” and CEO Marissa Mayer is stepping down. This is false.
What is happening instead is that everything we know of as “Yahoo” (including the brand name) is being sold to Verizon. The bits that are left are a skeleton company that holds stock in Alibaba and a few other companies. Since the brand was sold to Verizon, that investment company could no longer use it, so chose “Altaba”. Since 83% of its investment is in Alibabi, “Altaba” makes sense. It’s not like this new brand name means anything — the skeleton investment company will be wound down in the next year, either as a special dividend to investors, sold off to Alibaba, or both.
Marissa Mayer is an operations CEO. Verizon didn’t want her to run their newly acquired operations, since the entire point of buying them was to take the web operations in a new direction (though apparently she’ll still work a bit with them through the transition). And of course she’s not an appropriate CEO for an investment company. So she had no job left — she made her own job disappear.
What happened today is an obvious consequence of Alibaba going IPO in September 2014. It meant that Yahoo’s stake of 16% in Alibaba was now liquid. All told, the investment arm of Yahoo was worth $36-billion while the web operations (Mail, Fantasy, Tumblr, etc.) was worth only $5-billion.
In other words, Yahoo became a Wall Street mutual fund who inexplicably also offered web mail and cat videos.
Such a thing cannot exist. If Yahoo didn’t act, shareholders would start suing the company to get their money back.That $36-billion in investments doesn’t belong to Yahoo, it belongs to its shareholders. Thus, the moment the Alibaba IPO closed, Yahoo started planning on how to separate the investment arm from the web operations.
Yahoo had basically three choices.
The first choice is simply give the Alibaba (and other investment) shares as a one time dividend to Yahoo shareholders.
A second choice is simply split the company in two, one of which has the investments, and the other the web operations.
The third choice is to sell off the web operations to some chump like Verizon.
Obviously, Marissa Mayer took the third choice. Without a slushfund (the investment arm) to keep it solvent, Yahoo didn’t feel it could run its operations profitably without integration with some other company. That meant it either had to buy a large company to integrate with Yahoo, or sell the Yahoo portion to some other large company.
Every company, especially Internet ones, have a legacy value. It’s the amount of money you’ll get from firing everyone, stop investing in the future, and just raking in year after year a stream of declining revenue. It’s the fate of early Internet companies like Earthlink and Slashdot. It’s like how I documented with Earthlink [*], which continues to offer email to subscribers, but spends only enough to keep the lights on, not even upgrading to the simplest of things like SSL.
Presumably, Verizon will try to make something of a few of the properties. Apparently, Yahoo’s Fantasy sports stuff is popular, and will probably be rebranded as some new Verizon thing. Tumblr is already it’s own brand name, independent of Yahoo, and thus will probably continue to exist as its own business unit.
One of the weird things is Yahoo Mail. It permanently bound to the “yahoo.com” domain, so you can’t do much with the “Yahoo” brand without bringing Mail along with it. Though at this point, the “Yahoo” brand is pretty tarnished. There’s not much new you can put under that brand anyway. I can’t see how Verizon would want to invest in that brand at all — just milk it for what it can over the coming years.
The investment company cannot long exist on its own. Investors want their money back, so they can make future investment decisions on their own. They don’t want the company to make investment choices for them.
Think about when Yahoo made its initial $1-billion investment for 40% of Alibaba in 2005, it did not do so because it was a good “investment opportunity”, but because Yahoo believed it was good strategic investment, such as providing an entry in the Chinese market, or providing an e-commerce arm to compete against eBay and Amazon. In other words, Yahoo didn’t consider as a good way of investing its money, but a good way to create a strategic partnership — one that just never materialized. From that point of view, the Alibaba investment was a failure.
In 2012, Marissa Mayer sold off 25% of Alibaba, netting $4-billion after taxes. She then lost all $4-billion on the web operations. That stake would be worth over $50-billion today. You can see the problem: companies with large slush funds just fritter them away keeping operations going. Marissa Mayer abused her position of trust, playing with money that belong to shareholders.
Thus, Altbaba isn’t going to play with shareholder’s money. It’s a skeleton company, so there’s no strategic value to investments. They can make no better investment choices than its shareholders can with their own money. Thus, the only purpose of the skeleton investment company is to return the money back to the shareholders. I suspect it’ll choose the most tax efficient way of doing this, like selling the whole thing to Alibaba, which just exchanges the Altaba shares for Alibaba shares, with a 15% bonus representing the value of the other Altaba investments. Either way, if Altaba is still around a year from now, it’s because it’s board is skimming money that doesn’t belong to them.
Key points:
Altaba is the name of the remaining skeleton investment company, the “Yahoo” brand was sold with the web operations to Verizon.
The name Altaba sucks because it’s not a brand name that will stick around for a while — the skeleton company is going to return all its money to its investors.
Yahoo had to spin off its investments — there’s no excuse for 90% of its market value to be investments and 10% in its web operations.
In particular, the money belongs to Yahoo’s investors, not Yahoo the company. It’s not some sort of slush fund Yahoo’s executives could use. Yahoo couldn’t use that money to keep its flailing web operations going, as Marissa Mayer was attempting to do.
Most of Yahoo’s web operations will go the way of Earthlink and Slashdot, as Verizon milks the slowly declining revenue while making no new investments in it.
Caffeination is an important cornerstone of Raspberry Pi development. Gordon in particular drinks so much tea in any given day that we are concerned for the sustainability of Sri Lanka’s plantations, not to mention the colour of his insides. (Conversation at 10.30 this morning: “Gordon, how many cups of tea would you estimate you drink in a day?” “Em…fifteen? I’ve already had five this morning, I drink it through the day and I usually have at least one in bed at night.”)
In an act of one-upmanship, Carrie Anne, James and the other people who write our educational resources have been showing us the state of their mugs this morning too.
Because we love you and want to make you happy, we are not illustrating this post with a picture of Gordon’s insides.
We like to make sure that Gordon, Carrie and the rest of the office tea-drinkers are doing as much work as possible, and are undistracted by the need to steep yet another bag. So we were delighted to happen upon this project from Andrey Chilikin. This is what happens when you are innovative enough to turn one of those antique computer-cup-holders on its end and add that standby of makers everywhere, the trusty lollipop stick. Hook it up to the Raspberry Pi’s GPIO pins, and Bob’s your uncle.
Hector Marco and Ismael Ripoll report a discouraging vulnerability in many encrypted disk setups: simply running up too many password failures will eventually result in a root shell. “This vulnerability allows to obtain a root initramfs shell on affected systems. The vulnerability is very reliable because it doesn’t depend on specific systems or configurations. Attackers can copy, modify or destroy the hard disc as well as set up the network to exfiltrate data. This vulnerability is specially serious in environments like libraries, ATMs, airport machines, labs, etc, where the whole boot process is protect (password in BIOS and GRUB) and we only have a keyboard or/and a mouse.”
Let me start by saying that I am responsible for the title of this blog post. The build’s creator, the wonderful Nicole He, didn’t correct me on this when I contacted her, sooooo…
Anyway, this is one project that caused the residents of Pi Towers (mainly Liz and me) to stare at it open-mouthed, praying that the build used a Raspberry Pi. This happens sometimes. We see something awesome that definitely uses a Raspberry Pi, Arduino or similar and we cross fingers and toes in the hope it’s the former. This was one of those cases, and a quick Instagram comment brought us the answer we’d hoped for.
I’ve shared Nicole’s work on our social channels in the past. A few months back, I came across her Grow Slow project tutorial, which became an instant hit both with our followers and across other social accounts and businesses. Grow Slow uses a Raspberry Pi and webcam to tweet a daily photo of her plant. The tutorial is a great starter for those new to coding and the Pi, another reason why it did so well across social media.
But we’re not here to talk about plants and Twitter. We’re here to talk about the Pi-powered Wonder Pop Controller; a project brought to our attention via a retweet, causing instant drooling.
The controller uses a Raspberry Pi, the Adafruit Capacitive Touch HAT, and copper foil tape to create a networked controller.
“I made it for a class about networks; the idea is that we make a physical controller that can connect to a game played over a TCP socket.”
Now, I’m sure someone will argue that it’s not the licking of the lollipop that creates the connection, but rather the licking of the copper tape. And yes, you’re right. But where’s the fun in a project titled ‘Pi-powered Lickable Copper Tape Controller‘? Exactly.
The idea behind this project is a nice starting block for using capacitive touch for video games controllers. While we figure out our creations, share with us any interesting controllers you’ve made.
… or make one this weekend and share it on Monday. I can wait.
*Continues to play with the sun on Nicole’s website instead of doing any work*
Software in the Public Interest (SPI) has completed its 2016 board elections. There were two open seats on the board in addition to four board members whose terms were expiring. The six newly elected members of the board are Luca Filipozzi, Joerg Jaspert, Jimmy Kaplowitz, Andrew Tridgell, Valerie Young, and Martin Zobel-Helas. The full results, including voter statistics, are also available.
So I accidentally ordered too many Raspberry Pi’s. Therefore, I built a small cluster out of them. I thought I’d write up a parts list for others wanting to build a cluster.
To start with is some pics of the cluster What you see is a stack of 7 RPis. At the bottom of the stack is a USB multiport charger and also an Ethernet hub. You see USB cables coming out of the charger to power the RPis, and out the other side you see Ethernet cables connecting the RPis to a network. I’ve including the mouse and keyboard in the picture to give you a sense of perspective.
Here is the same stack turn around, seeing it from the other side. Out the bottom left you see three external cables, one Ethernet to my main network and power cables for the USB charger and Ethernet hub. You can see that the USB hub is nicely tied down to the frame, but that the Ethernet hub is just sort jammed in there somehow.
The concept is to get things as cheap as possible, on per unit basis. Otherwise, one might as well just buy more expensive computers. My parts list for a 7x Pi cluster are:
…or $54.65 per unit (or $383 for entire cluster), or around 50% more than the base Raspberry Pis alone. This is getting a bit expensive, as Newegg. always has cheap Android tablets on closeout for $30 to $50.
So here’s a discussion of the parts.
Raspberry Pi 2 These are old boards I’d ordered a while back. They are up to RPi3 now with slightly faster processors and WiFi/Bluetooth on board, neither of which are useful for a cluster. It has four CPUs each running at 900 MHz as opposed to the RPi3 which has four 1.2 GHz processors. If you order a Raspberry Pi now, it’ll be the newer, better one. The case
You’ll notice that the RPi’s are mounted on acrylic sheets, which are in turn held together with standoffs/spaces. This is a relatively expensive option.
A cheaper solution would be just to buy the spaces/standoffs yourself. They are a little hard to find, because the screws need to fit the 2.9mm holes, where are unusually tiny. Such spaces/standoffs are usually made of brass, but you can also find nylon ones. For the ends, you need some washers and screws. This will bring the price down to about $2/unit — or a lot cheaper if you are buying in bulk for a lot of units.
The micro-SD
The absolute cheapest micro SD’s I could find were $2.95/unit for 4gb, or half the price than the ones I bought. But the ones I chose are 4x the size and 2x the speed. RPi distros are getting large enough that they no longer fit well on 4gig cards, and are even approaching 8gigs. Thus, 16gigs are the best choice, especially when I could get hen for $6/unit. By the time you read this, the price of flash will have changed up or down. I search on Newegg, because that’s the easiest way to focus on the cheapest. Most cards should work, but check http://elinux.org/RPi_SD_cards to avoid any known bad chips.
Note that different cards have different speeds, which can have a major impact on performance. You probably don’t care for a cluster, but if you are buying a card for a development system, get the faster ones. The Samsung EVO cards are a good choice for something fast.
USB Charging Hub
What we want here is a charger not a hub. Both can work, but the charger works better.
A normal hub is about connecting all your USB devices to your desktop/laptop. That doesn’t work for this RPi — the connector is just for power. It’s just leveraging the fact that there’s already lots of USB power cables/chargers out there, so that it doesn’t have to invite a custom one.
USB hubs an supply some power to the RPi, enough to boot it. However, under load, or when you connect further USB devices to the RPi, there may not be enough power available. You might be able to run a couple RPis from a normal hub, but when you’ve got all seven running (as in this stack), there might not be enough power. Power problems can outright crash the devices, but worse, it can lead to things like corrupt writes to the flash drives, slowly corrupting the system until it fails.
Luckily, in the last couple years we’ve seen suppliers of multiport chargers. These are designed for families (and workplaces) that have a lot of phones and tablets to charge. They can charge high-capacity batteries on all ports — supplying much more power than your RPi will ever need.
If want to go ultra cheaper, then cheap hubs at $1/port may be adequate. Chargers cost around $4/port.
The charger I chose in particular is the Bolse 60W 7-port charger. I only need exactly 7 ports. More ports would be nicer, in case I needed to power something else along with the stack, but this Bolse unit has the nice property that it fits snugly within the stack. The frame came with extra spacers which I could screw together to provide room. I then used zip ties to hold it firmly in place.
Ethernet hub
The RPis only have 100mbps Ethernet. Therefore, you don’t need a gigabit hub, which you’d normally get, but can choose a 100mbps hub instead: it’s cheaper, smaller, and lower power. The downside is that while each RPi only does 100-mbps, combined they will do 700-mbps, which the hub can’t handle.
I got a $10 hub from Newegg. As you can see, it fits within the frame, though not well. Every gigabit hub I’ve seen is bigger and could not fit this way.
Note that I have a couple extra RPis, but I only built a 7-high stack, because of the Ethernet hub. Hubs have only 8 ports, one of which is needed for the uplink. That leaves 7 devices. I’d have to upgrade to an unwieldy 16-port hub if I wanted more ports, which wouldn’t fit the nice clean case I’ve got.
For a gigabit option, Ethernet switches will cost between $23 and $35 dollars. That $35 option is a “smart” switch that supports not only gigabit, but also a web-based configuration tool, VLANs, and some other high-end features. If I paid more for a switch, I’d probably go with the smart/managed one.
Cables (Ethernet, USB)
Buying cables is expensive, as everyone knows whose bought an Apple cable for $30. But buying in bulk from specialty sellers can reduce the price to under $1/cable.
The chief buy factor is length. We want short cables that will just barely be long enough. in the pictures above, the Ethernet cables are 1-foot, as are two of the USB cables. The colored USB cables are 6-inches. I got these off Amazon because they looked cool, but now I’m regretting it.
The easiest, cheapest, and highest quality place to buy cables is Monoprice.com. It allows you to easily select the length and color.
To reach everything in this stack, you’ll need 1-foot cables. Though, 6-inch cables will work for some (but not all) of the USB devices. Although, instead of putting the hubs on the bottom, I could’ve put them in the middle of the stack, then 6-inch cables would’ve worked better — but I didn’t think that’d look as pretty. (I chose these colored cables because somebody suggested them, but they won’t work for the full seven-high tower).
Power consumption
The power consumption of the entire stack is 13.3 watts while it’s idle. The Ethernet hub by itself was 1.3 watts (so low because it’s 100-mbps instead of gigabit).
So, round it up, that’s 2-watts per RPi while idle.
In previous power tests, it’s an extra 2 to 3 watts while doing heavy computations, so for the entire stack, that can start consuming a significant amount of power. I mention this because people think terms of a low-power alternative to Intel’s big CPUs, but in truth, once you’ve gotten enough RPis in a cluster to equal the computational power of an Intel processor, you’ll probably be consuming more electricity.
The operating system I grabbed the lasted Raspbian image and installed it on one of the RPis. I then removed it, copied the files off (cp -a), reformatted it to use the f2fs flash file system, then copied the files back on. I then made an image of the card (using dd), then wrote that image to 6 other cards. I then I logged into each one ad renamed them rpi-a1, …, rpi-a7. (Security note: this means they all have the same SSH private key, but I don’t care).
About flash file systems The micro SD flash has a bit of wear leveling, but not enough. A lot of RPi servers I’ve installed in the past have failed after a few months with corrupt drives. I don’t know why, I suspect it’s because the flash is getting corrupted.
Thus, I installed f2fs, a wear leveling file system designed especially for this sort of situation. We’ll see if that helps at all.
One big thing is to make sure atime is disabled, a massively brain dead feature inherited from 1980s Unix that writes to the disk every time you read from a file.
I notice that the green LED on the RPi, indicating disk activity, flashes very briefly once per second, (so quick you’ll miss it unless you look closely at the light). I used iotop -a to find out what it is. I think it’s just a hardware feature and not related to disk activity. On the other hand, it’s worth tracking down what writes might be happening in the background that will affect flash lifetime.
What I found was that there is some kernel thread that writes rarely to the disk, and a “f2fs garbage collector” that’s cleaning up the disk for wear leveling. I saw nothing that looked like it was writing regularly to the disk.
What to use it for? So here’s the thing about an RPi cluster — it’s technically useless. If you run the numbers, it’s got less compute power and higher power consumption than a normal desktop/laptop computer. Thus, an entire cluster of them will still perform slower than laptops/desktops.
Thus, the point of a cluster is to have something to play with, to experiment with, not that it’s the best form of computation. The point of individual RPis is not that they have better performance/watt — but that you don’t need as much performance but want a package with very low watts.
With that said, I should do some password cracking benchmarks with them, compared across CPUs and GPUs, measuring power consumption. That’ll be a topic for a later post.
With that said, I will be using these, though as individual computers rather than as a “cluster”. There’s lots of services I want to run, but I don’t want to run a full desktop running VMware. I’d rather control individual devices.
Conclusion I’m not sure what I’m going to do with my little RPi stack/cluster, but I wanted to document everything about it so that others can replicate it if they want to.
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.