Meet Callum Fawcett, who shares his journey from tinkering with the first Raspberry Pi while he was at school, to a Master’s degree in computer science and a real-life job in programming. We also get to see some of the awesome projects he’s made along the way.
I first decided to get a Raspberry Pi at the age of 14. I had already started programming a little bit before and found that I really enjoyed the language Python. At the time the first Raspberry Pi came out, my History teacher told us about them and how they would be a great device to use to learn programming. I decided to ask for one to help me learn more. I didn’t really know what I would use it for or how it would even work, but after a little bit of help at the start, I quickly began making small programs in Python. I remember some of my first programs being very simple dictionary-type programs in which I would match English words to German to help with my German homework.
Learning Linux, C++, and Python
Most of my learning was done through two sources. I learnt Linux and how the terminal worked using online resources such as Stack Overflow. I would have a problem that I needed to solve, look up solutions online, and try out commands that I found. This was perhaps the hardest part of learning how to use a Raspberry Pi, as it was something I had never done before, but it really helped me in later years when I would use Linux more than Windows. For learning programming, I preferred to use books. I had a book for C++ and a book for Python that I would work through. These were game-based books, so many of the fun projects that I did were simple text-based games where you typed in responses to questions.
A family robotics project
The first robot Callum made using a Raspberry Pi
By far the coolest project I did with the Raspberry Pi was to build a small robot (shown above). This was a joint project between myself and my dad. He sorted out the electronics and I programmed the robot. It was a great opportunity to learn about robotics and refine my programming skills. By the end, the robot was capable of moving around by itself, driving into objects, and then reversing and trying a new direction. It was almost like an unintelligent Roomba that couldn’t hoover, but I spent many hours improving small bits and pieces to make it as easy to use as possible. My one wish that I never managed to achieve with my robot was allowing it to map out its surroundings. This was a very ambitious project at the time, since I was still quite inexperienced in programming. The biggest problem with this was calibrating the robot’s turning circle, which was never consistent so it was very hard to have the robot know where in the room it was.
Sense HAT maze game
Another fun project that I worked on used the Sense HAT developed for the Astro Pi computers for use on the International Space Station. Using this, I was able to make a memory maze game (shown below), in which a player is shown a maze for several seconds and then has to navigate that maze from memory by shaking the device. This was my first introduction to using more interactive types of input, and this eventually led to my final-year project, which used these interesting interactions to develop another way of teaching.
Learning programming without formal lessons
I have now just finished my Master’s degree in computer science at the University of Bristol. Before going to university, I had no experience of being taught programming in a formal environment. It was not a taught subject at my secondary school or sixth form. I wanted to get more people at my school interested in this area of study though, which I did by running a coding club for people. I would help others debug their code and discuss interesting problems with them. The reason that I chose to study computer science is largely because of my experiences with Raspberry Pi and other programming I did in my own time during my teenage years. I likely would have studied history if it weren’t for the programming I had done by myself making robots and other games.
Raspberry Pi has continued to play a part in my degree and extra-curricular activities; I used them in two large projects during my time at university and used a similar device in my final project. My robot experience also helped me to enter my university’s ‘Robot Wars’ competition which, though we never won, was a lot of fun.
A tool for learning and a device for industry
Having a Raspberry Pi is always useful during a hackathon, because it’s such a versatile component. Tech like Raspberry Pi will always be useful for beginners to learn the basics of programming and electronics, but these computers are also becoming more and more useful for people with more experience to make fun and useful projects. I could see tech like Raspberry Pi being used in the future to help quickly prototype many types of electronic devices and, as they become more powerful, even being used as an affordable way of controlling many types of robots, which will become more common in the future.
Our guest blogger Callum
Now I am going on to work on programming robot control systems at Ocado Technology. My experiences of robot building during my years before university played a large part in this decision. Already, robots are becoming a huge part of society, and I think they are only going to become more prominent in the future. Automation through robots and artificial intelligence will become one of the most important tools for humanity during the 21st century, and I look forward to being a part of that process. If it weren’t for learning through Raspberry Pi, I certainly wouldn’t be in this position.
Cheers for your story, Callum! Has tinkering with our tiny computer inspired your educational or professional choices? Let us know in the comments below.
Warning: a GIF used in today’s blog contains flashing images.
Students at the University of Bremen, Germany, have built a wearable camera that records the seconds of vision lost when you blink. Augenblick uses a Raspberry Pi Zero and Camera Module alongside muscle sensors to record footage whenever you close your eyes, producing a rather disjointed film of the sights you miss out on.
Blink and you’ll miss it
The average person blinks up to five times a minute, with each blink lasting 0.5 to 0.8 seconds. These half-seconds add up to about 30 minutes a day. What sights are we losing during these minutes? That is the question asked by students Manasse Pinsuwan and René Henrich when they set out to design Augenblick.
Blinking is a highly invasive mechanism for our eyesight. Every day we close our eyes thousands of times without noticing it. Our mind manages to never let us wonder what exactly happens in the moments that we miss.
Capturing lost moments
For Augenblick, the wearer sticks MyoWare Muscle Sensor pads to their face, and these detect the electrical impulses that trigger blinking.
Two pads are applied over the orbicularis oculi muscle that forms a ring around the eye socket, while the third pad is attached to the cheek as a neutral point.
Biology fact: there are two muscles responsible for blinking. The orbicularis oculi muscle closes the eye, while the levator palpebrae superioris muscle opens it — and yes, they both sound like the names of Harry Potter spells.
The sensor is read 25 times a second. Whenever it detects that the orbicularis oculi is active, the Camera Module records video footage.
Pressing a button on the side of the Augenblick glasses set the code running. An LED lights up whenever the camera is recording and also serves to confirm the correct placement of the sensor pads.
The Pi Zero saves the footage so that it can be stitched together later to form a continuous, if disjointed, film.
Learn more about the Augenblick blink camera
You can find more information on the conception, design, and build process of Augenblickhere in German, with a shorter explanation including lots of photos here in English.
And if you’re keen to recreate this project, our free project resource for a wearable Pi Zero time-lapse camera will come in handy as a starting point.
SHB is a small invitational gathering of people studying various aspects of the human side of security, organized each year by Alessandro Acquisti, Ross Anderson, and myself. The 50 or so people in the room include psychologists, economists, computer security researchers, sociologists, political scientists, neuroscientists, designers, lawyers, philosophers, anthropologists, business school professors, and a smattering of others. It’s not just an interdisciplinary event; most of the people here are individually interdisciplinary.
The goal is to maximize discussion and interaction. We do that by putting everyone on panels, and limiting talks to 7-10 minutes. The rest of the time is left to open discussion. Four hour-and-a-half panels per day over two days equals eight panels; six people per panel means that 48 people get to speak. We also have lunches, dinners, and receptions — all designed so people from different disciplines talk to each other.
I invariably find this to be the most intellectually stimulating conference of my year. It influences my thinking in many different, and sometimes surprising, ways.
Here are my posts on the first, second, third, fourth, fifth, sixth, seventh, eighth, ninth, and tenth SHB workshops. Follow those links to find summaries, papers, and occasionally audio recordings of the various workshops.
As we shoot way past 500 petabytes of data stored, we need a lot of helping hands in the data center to keep those hard drives spinning! We’ve been hiring quite a lot, and our latest addition is Jack. Lets learn a bit more about him, shall we?
What is your Backblaze Title? Data Center Tech
Where are you originally from? Walnut Creek, CA until 7th grade when the family moved to Durango, Colorado.
What attracted you to Backblaze? I had heard about how cool the Backblaze community is and have always been fascinated by technology.
What do you expect to learn while being at Backblaze? I expect to learn a lot about how our data centers run and all of the hardware behind it.
Where else have you worked? Garrhs HVAC as an HVAC Installer and then Durango Electrical as a Low Volt Technician.
Where did you go to school? Durango High School and then Montana State University.
What’s your dream job? I would love to be a driver for the Audi Sport. Race cars are so much fun!
Favorite place you’ve traveled? Iceland has definitely been my favorite so far.
Favorite hobby? Video games.
Of what achievement are you most proud? Getting my Eagle Scout badge was a tough, but rewarding experience that I will always cherish.
Star Trek or Star Wars? Star Wars.
Coke or Pepsi? Coke…I know, it’s bad.
Favorite food? Thai food.
Why do you like certain things? I tend to warm up to things the more time I spend around them, although I never really know until it happens.
Anything else you’d like to tell us? I’m a friendly car guy who will always be in love with my European cars and I really enjoy the Backblaze community!
We’re happy you joined us Out West! Welcome aboard Jack!
Earlier this year on 3 and 4 March, communities around the world held Raspberry Jam events to celebrate Raspberry Pi’s sixth birthday. We sent out special birthday kits to participating Jams — it was amazing to know the kits would end up in the hands of people in parts of the world very far from Raspberry Pi HQ in Cambridge, UK.
The Raspberry Jam Camer team: Damien Doumer, Eyong Etta, Loïc Dessap and Lionel Sichom, aka Lionel Tellem
Preparing for the #PiParty
One birthday kit went to Yaoundé, the capital of Cameroon. There, a team of four students in their twenties — Lionel Sichom (aka Lionel Tellem), Eyong Etta, Loïc Dessap, and Damien Doumer — were organising Yaoundé’s first Jam, called Raspberry Jam Camer, as part of the Raspberry Jam Big Birthday Weekend. The team knew one another through their shared interests and skills in electronics, robotics, and programming. Damien explains in his blog post about the Jam that they planned ahead for several activities for the Jam based on their own projects, so they could be confident of having a few things that would definitely be successful for attendees to do and see.
Show-and-tell at Raspberry Jam Cameroon
Loïc presented a Raspberry Pi–based, Android app–controlled robot arm that he had built, and Lionel coded a small video game using Scratch on Raspberry Pi while the audience watched. Damien demonstrated the possibilities of Windows 10 IoT Core on Raspberry Pi, showing how to install it, how to use it remotely, and what you can do with it, including building a simple application.
Loïc showcases the prototype robot arm he built
There was lots more too, with others discussing their own Pi projects and talking about the possibilities Raspberry Pi offers, including a Pi-controlled drone and car. Cake was a prevailing theme of the Raspberry Jam Big Birthday Weekend around the world, and Raspberry Jam Camer made sure they didn’t miss out.
Yay, birthday cake!!
A big success
Most visitors to the Jam were secondary school students, while others were university students and graduates. The majority were unfamiliar with Raspberry Pi, but all wanted to learn about Raspberry Pi and what they could do with it. Damien comments that the fact most people were new to Raspberry Pi made the event more interactive rather than creating any challenges, because the visitors were all interested in finding out about the little computer. The Jam was an all-round success, and the team was pleased with how it went:
What I liked the most was that we sensitized several people about the Raspberry Pi and what one can be capable of with such a small but powerful device. — Damien Doumer
The Jam team rounded off the event by announcing that this was the start of a Raspberry Pi community in Yaoundé. They hope that they and others will be able to organise more Jams and similar events in the area to spread the word about what people can do with Raspberry Pi, and to help them realise their ideas.
Raspberry Jam Camer gets the thumbs-up
The Raspberry Pi community in Cameroon
In a French-language interview about their Jam, the team behind Raspberry Jam Camer said they’d like programming to become the third official language of Cameroon, after French and English; their aim is to to popularise programming and digital making across Cameroonian society. Neither of these fields is very familiar to most people in Cameroon, but both are very well aligned with the country’s ambitions for development. The team is conscious of the difficulties around the emergence of information and communication technologies in the Cameroonian context; in response, they are seizing the opportunities Raspberry Pi offers to give children and young people access to modern and constantly evolving technology at low cost.
Thanks to Lionel, Eyong, Damien, and Loïc, and to everyone who helped put on a Jam for the Big Birthday Weekend! Remember, anyone can start a Jam at any time — and we provide plenty of resources to get you started. Check out the Guidebook, the Jam branding pack, our specially-made Jam activities online (in multiplelanguages), printable worksheets, and more.
Researchers havedemonstrated the ability to send inaudible commands to voice assistants like Alexa, Siri, and Google Assistant.
Over the last two years, researchers in China and the United States have begun demonstrating that they can send hidden commands that are undetectable to the human ear to Apple’s Siri, Amazon’s Alexa and Google’s Assistant. Inside university labs, the researchers have been able to secretly activate the artificial intelligence systems on smartphones and smart speakers, making them dial phone numbers or open websites. In the wrong hands, the technology could be used to unlock doors, wire money or buy stuff online – simply with music playing over the radio.
A group of students from University of California, Berkeley, and Georgetown University showed in 2016 that they could hide commands in white noise played over loudspeakers and through YouTube videos to get smart devices to turn on airplane mode or open a website.
This month, some of those Berkeley researchers published a research paper that went further, saying they could embed commands directly into recordings of music or spoken text. So while a human listener hears someone talking or an orchestra playing, Amazon’s Echo speaker might hear an instruction to add something to your shopping list.
Students taking Design of Mechatronics at the Technical University of Denmark have created some seriously elegant and striking Raspberry Pi speakers. Their builds are part of a project asking them to “explore, design and build a 3D printed speaker, around readily available electronics and components”.
The students have been uploading their designs, incorporating Raspberry Pis and HiFiBerry HATs, to Thingiverse throughout April. The task is a collaboration with luxury brand Bang & Olufsen’s Create initiative, and the results wouldn’t look out of place in a high-end showroom; I’d happily take any of these home.
Søren Qvist’s wall-mounted kitchen sphere uses 3D-printed and laser-cut parts, along with the HiFiBerry HAT and B&O speakers to create a sleek-looking design.
Otto Ømann’s group have designed the Hex One – a work-in-progress wireless 360° speaker. A particular objective for their project is to create a speaker using as many 3D-printed parts as possible.
“The design is supposed to resemble that of a B&O speaker, and from a handful of categories we chose to create a portable and wearable speaker,” explain Gustav Larsen and his team.
Oliver Repholtz Behrens and team have housed a Raspberry Pi and HiFiBerry HAT inside this this stylish airplay speaker. You can follow their design progress on their team blog.
Tue Thomsen’s six-person team Mechatastic have produced the B&O TILE. “The speaker consists of four 3D-printed cabinet and top parts, where the top should be covered by fabric,” they explain. “The speaker insides consists of laser-cut wood to hold the tweeter and driver and encase the Raspberry Pi.”
The team aimed to design a speaker that would be at home in a kitchen. With a removable upper casing allowing for a choice of colour, the TILE can be customised to fit particular tastes and colour schemes.
Build your own speakers with Raspberry Pis
Raspberry Pi’s onboard audio jack, along with third-party HATs such as the HiFiBerry and Pimoroni Speaker pHAT, make speaker design and fabrication with the Pi an interesting alternative to pre-made tech. These builds don’t tend to be technically complex, and they provide some lovely examples of tech-based projects that reflect makers’ own particular aesthetic style.
If you have access to a 3D printer or a laser cutter, perhaps at a nearby maker space, then those can be excellent resources, but fancy kit isn’t a requirement. Basic joinery and crafting with card or paper are just a couple of ways you can build things that are all your own, using familiar tools and materials. We think more people would enjoy getting hands-on with this sort of thing if they gave it a whirl, and we publish a free magazine to help.
Looking for a new project to build around the Raspberry Pi Zero, I came across the pHAT DAC from Pimoroni. This little add-on board adds audio playback capabilities to the Pi Zero. Because the pHAT uses the GPIO pins, the USB OTG port remains available for a wifi dongle.
This video by Frederick Vandenbosch is a great example of building AirPlay speakers using a Pi and HAT, and a quick search will find you lots more relevant tutorials and ideas.
Have you built your own? Share your speaker-based Pi builds with us in the comments.
Researchers at Princeton University have released IoT Inspector, a tool that analyzes the security and privacy of IoT devices by examining the data they send across the Internet. They’ve already used the tool to study a bunch of different IoT devices. From their blog post:
Finding #3: Many IoT Devices Contact a Large and Diverse Set of Third Parties
In many cases, consumers expect that their devices contact manufacturers’ servers, but communication with other third-party destinations may not be a behavior that consumers expect.
We have found that many IoT devices communicate with third-party services, of which consumers are typically unaware. We have found many instances of third-party communications in our analyses of IoT device network traffic. Some examples include:
Samsung Smart TV. During the first minute after power-on, the TV talks to Google Play, Double Click, Netflix, FandangoNOW, Spotify, CBS, MSNBC, NFL, Deezer, and Facebookeven though we did not sign in or create accounts with any of them.
Amcrest WiFi Security Camera. The camera actively communicates with cellphonepush.quickddns.com using HTTPS. QuickDDNS is a Dynamic DNS service provider operated by Dahua. Dahua is also a security camera manufacturer, although Amcrest’s website makes no references to Dahua. Amcrest customer service informed us that Dahua was the original equipment manufacturer.
Halo Smoke Detector. The smart smoke detector communicates with broker.xively.com. Xively offers an MQTT service, which allows manufacturers to communicate with their devices.
Geeni Light Bulb. The Geeni smart bulb communicates with gw.tuyaus.com, which is operated by TuYa, a China-based company that also offers an MQTT service.
We also looked at a number of other devices, such as Samsung Smart Camera and TP-Link Smart Plug, and found communications with third parties ranging from NTP pools (time servers) to video storage services.
Their first two findings are that “Many IoT devices lack basic encryption and authentication” and that “User behavior can be inferred from encrypted IoT device traffic.” No surprises there.
Ever since we introduced our Groups feature, Backblaze for Business has been growing at a rapid rate! We’ve been staffing up in order to support the product and the newest addition to the sales team, Victoria, joins us as a Sales Development Representative! Let’s learn a bit more about Victoria, shall we?
What is your Backblaze Title? Sales Development Representative.
Where are you originally from? Harrisburg, North Carolina.
What attracted you to Backblaze? The leaders and family-style culture.
What do you expect to learn while being at Backblaze? How to sell, sell, sell!
Where else have you worked? The North Carolina Autism Society, an ophthalmologist’s office, home health care, and another tech startup.
Where did you go to school? The University of North Carolina Chapel Hill and Duke University’s Fuqua School of Business.
What’s your dream job? Fighter pilot, professional snowboarder or killer whale trainer.
Favorite place you’ve traveled? Hawaii and Banff.
Favorite hobby? Basketball and cars.
Of what achievement are you most proud? Missionary work and helping patients feel better.
Star Trek or Star Wars? Neither, but probably Star Wars.
Coke or Pepsi? Neither, bubble tea.
Favorite food? Snow crab legs.
Why do you like certain things? Because God made me that way.
Anything else you’d like you’d like to tell us? I’m a germophobe, drink a lot of water and unfortunately, am introverted.
Being on the phones all day is a good way to build up those extroversion skills! Welcome to the team and we hope you enjoy learning how to sell, sell, sell!
Backblaze is growing, and with it our need to cater to a lot of different use cases that our customers bring to us. We needed a Solutions Engineer to help out, and after a long search we’ve hired our first one! Lets learn a bit more about Nathan shall we?
What is your Backblaze Title? Solutions Engineer. Our customers bring a thousand different use cases to both B1 and B2, and I’m here to help them figure out how best to make those use cases a reality. Also, any odd jobs that Nilay wants me to do.
Where are you originally from? I am native to the San Francisco Bay Area, studying mathematics at UC Santa Cruz, and then computer science at California University of Hayward (which has since renamed itself California University of the East Hills. I observe that it’s still in Hayward).
What attracted you to Backblaze? As a stable, growing company with huge growth and even bigger potential, the business model is attractive, and the team is outstanding. Add to that the strong commitment to transparency, and it’s a hard company to resist. We can store – and restore – data while offering superior reliability at an economic advantage to do-it-yourself, and that’s a great place to be.
What do you expect to learn while being at Backblaze? Everything I need to, but principally how our customers choose to interact with web storage. Storage isn’t a solution per se, but it’s an important component of any persistent solution. I’m looking forward to working with all the different concepts our customers have to make use of storage.
Where else have you worked? All sorts of places, but I’ll admit publicly to EMC, Gemalto, and my own little (failed, alas) startup, IC2N. I worked with low-level document imaging.
Where did you go to school? UC Santa Cruz, BA Mathematics CU Hayward, Master of Science in Computer Science.
What’s your dream job? Sipping tea in the California redwood forest. However, solutions engineer at Backblaze is a good second choice!
Favorite place you’ve traveled? Ashland, Oregon, for the Oregon Shakespeare Festival and the marble caves (most caves form from limestone).
Favorite hobby? Theater. Pathfinder. Writing. Baking cookies and cakes.
Of what achievement are you most proud? Marrying the most wonderful man in the world.
Star Trek or Star Wars? Star Trek’s utopian science fiction vision of humanity and science resonates a lot more strongly with me than the dystopian science fantasy of Star Wars.
Coke or Pepsi? Neither. I’d much rather have a cup of jasmine tea.
Favorite food? It varies, but I love Indian and Thai cuisine. Truly excellent Italian food is marvelous – wood fired pizza, if I had to pick only one, but the world would be a boring place with a single favorite food.
Why do you like certain things? If I knew that, I’d be in marketing.
Anything else you’d like you’d like to tell us? If you haven’t already encountered the amazing authors Patricia McKillip and Lois McMasters Bujold – go encounter them. Be happy.
There’s nothing wrong with a nice cup of tea and a long game of Pathfinder. Sign us up! Welcome to the team Nathan!
Data that describe processes in a spatial context are everywhere in our day-to-day lives and they dominate big data problems. Map data, for instance, whether describing networks of roads or remote sensing data from satellites, get us where we need to go. Atmospheric data from simulations and sensors underlie our weather forecasts and climate models. Devices and sensors with GPS can provide a spatial context to nearly all mobile data.
In this post, we introduce the WIND toolkit, a huge (500 TB), open weather model dataset that’s available to the world on Amazon’s cloud services. We walk through how to access this data and some of the open-source software developed to make it easily accessible. Our solution considers a subset of geospatial data that exist on a grid (raster) and explores ways to provide access to large-scale raster data from weather models. The solution uses foundational AWS services and the Hierarchical Data Format (HDF), a well adopted format for scientific data.
The approach developed here can be extended to any data that fit in an HDF5 file, which can describe sparse and dense vectors and matrices of arbitrary dimensions. This format is already popular within the physical sciences for both experimental and simulation data. We discuss solutions to gridded data storage for a massive dataset of public weather model outputs called the Wind Integration National Dataset (WIND) toolkit. We also highlight strategies that are general to other large geospatial data management problems.
Wind Integration National Dataset
As variable renewable power penetration levels increase in power systems worldwide, the importance of renewable integration studies to ensure continued economic and reliable operation of the power grid is also increasing. The WIND toolkit is the largest freely available grid integration dataset to date.
The WIND toolkit was developed by 3TIER by Vaisala. They were under a subcontract to the National Renewable Energy Laboratory (NREL) to support studies on integration of wind energy into the existing US grid. NREL is a part of a network of national laboratories for the US Department of Energy and has a mission to advance the science and engineering of energy efficiency, sustainable transportation, and renewable power technologies.
The toolkit has been used by consultants, research groups, and universities worldwide to support grid integration studies. Less traditional uses also include resource assessments for wind plants (such as those powering Amazon data centers), and studying the effects of weather on California condor migrations in the Baja peninsula.
The diversity of applications highlights the value of accessible, open public data. Yet, there’s a catch: the dataset is huge. The WIND toolkit provides simulated atmospheric (weather) data at a two-km spatial resolution and five-minute temporal resolution at multiple heights for seven years. The entire dataset is half a petabyte (500 TB) in size and is stored in the NREL High Performance Computing data center in Golden, Colorado. Making this dataset publicly available easily and in a cost-effective manner is a major challenge.
As other laboratories and public institutions work to release their data to the world, they may face similar challenges to those that we experienced. Some prior, well-intentioned efforts to release huge datasets as-is have resulted in data resources that are technically available but fundamentally unusable. They may be stored in an unintuitive format or indexed and organized to support only a subset of potential uses. Downloading hundreds of terabytes of data is often impractical. Most users don’t have access to a big data cluster (or super computer) to slice and dice the data as they need after it’s downloaded.
We aim to provide a large amount of data (50 terabytes) to the public in a way that is efficient, scalable, and easy to use. In many cases, researchers can access these huge cloud-located datasets using the same software and algorithms they have developed for smaller datasets stored locally. Only the pieces of data they need for their individual analysis must be downloaded. To make this work in practice, we worked with the HDF Group and have built upon their forthcoming Highly Scalable Data Service.
In the rest of this post, we discuss how the HSDS software was developed to use Amazon EC2 and Amazon S3 resources to provide convenient and scalable access to these huge geospatial datasets. We describe how the HSDS service has been put to work for the WIND Toolkit dataset and demonstrate how to access it using the h5pyd Python library and the REST API. We conclude with information about our ongoing work to release more ‘open’ datasets to the public using AWS services, and ways to improve and extend the HSDS with newer Amazon services like Amazon ECS and AWS Lambda.
Developing a scalable service for big geospatial data
The HDF5 file format and API have been used for many years and is an effective means of storing large scientific datasets. For example, NASA’s Earth Observing System (EOS) satellites collect more than 16 TBs of data per day using HDF5.
With the rise of the cloud, there are new challenges and opportunities to rethink how HDF5 can be enhanced to work effectively as a component in a cloud-native architecture. For the HDF Group, working with NREL has been a great opportunity to put ideas into practice with a production-size dataset.
An HDF5 file consists of a directed graph of group and dataset objects. Datasets can be thought of as a multidimensional array with support for user-defined metadata tags and compression. Typical operations on datasets would be reading or writing data to a regular subregion (a hyperslab) or reading and writing individual elements (a point selection). Also, group and dataset objects may each contain an arbitrary number of the user-defined metadata elements known as attributes.
Many people have used the HDF library in applications developed or ported to run on EC2 instances, but there are a number of constraints that often prove problematic:
The HDF5 library can’t read directly from HDF5 files stored as S3 objects. The entire file (often many GB in size) would need to be copied to local storage before the first byte can be read. Also, the instance must be configured with the appropriately sized EBS volume)
The HDF library only has access to the computational resources of the instance itself (as opposed to a cluster of instances), so many operations are bottlenecked by the library.
Any modifications to the HDF5 file would somehow have to be synchronized with changes that other instances have made to same file before writing back to S3.
Using a pattern common to many offerings from AWS, the solution to these constraints is to develop a service framework around the HDF data model. Using this model, the HDF Group has created the Highly Scalable Data Service (HSDS) that provides all the functionality that traditionally was provided by the HDF5 library. By using the service, you don’t need to manage your own file volumes, but can just read and write whatever data that you need.
Because the service manages the actual data persistence to a durable medium (S3, in this case), you don’t need to worry about disk management. Simply stream the data you need from the service as you need it. Secondly, putting the functionality behind a service allows some tricks to increase performance (described in more detail later). And lastly, HSDS allows any number of clients to access the data at the same time, enabling HDF5 to be used as a coordination mechanism for multiple readers and writers.
In designing the HSDS architecture, we gave much thought to how to achieve scalability of the HSDS service. For accessing HDF5 data, there are two different types of scaling to consider:
Multiple clients making many requests to the service
Single requests that require a significant amount of data processing
To deal with the first scaling challenge, as with most services, we considered how the service responds as the request rate increases. AWS provides some great tools that help in this regard:
Auto Scaling groups
Elastic Load Balancing load balancers
The ability of S3 to handle large aggregate throughput rates
By using a cluster of EC2 instances behind a load balancer, you can handle different client loads in a cost-effective manner.
The second scaling challenge concerns single requests that would take significant processing time with just one compute node. One example of this from the WIND toolkit would be extracting all the values in the seven-year time span for a given geographic point and dataset.
In HDF5, large datasets are typically stored as “chunks”; that is, a regular partition of the array. In HSDS, each chunk is stored as a binary object in S3. The sequential approach to retrieving the time series values would be for the service to read each chunk needed from S3, extract the needed elements, and go on to the next chunk. In this case, that would involve processing 2557 chunks, and would be quite slow.
Fortunately, with HSDS, you can speed this up quite a bit by exploiting the compute and I/O capabilities of the cluster. Upon receiving the request, the receiving node can use other nodes in the cluster to read different portions of the selection. With multiple nodes reading from S3 in parallel, performance improves as the cluster size increases.
The diagram below illustrates how this works in simplified case of four chunks and four nodes.
This architecture has worked in well in practice. In testing with the WIND toolkit and time series extraction, we observed a request latency of ~60 seconds using four nodes vs. ~5 seconds with 40 nodes. Performance roughly scales with the size of the cluster.
A planned enhancement to this is to use AWS Lambda for the worker processing. This enables 1000-way parallel reads at a reasonable cost, as you only pay for the milliseconds of CPU time used with AWS Lambda.
Public access to atmospheric data using HSDS and AWS
An early challenge in releasing the WIND toolkit data was in deciding how to subset the data for different use cases. In general, few researchers need access to the entire 0.5 PB of data and a great deal of efficiency and cost reduction can be gained by making directed constituent datasets.
NREL grid integration researchers initially extracted a 2-TB subset by selecting 120,000 points where the wind resource seemed appropriate for development. They also chose only those data important for wind applications (100-m wind speed, converted to power), the most interesting locations for those performing grid studies. To support the remaining users who needed more data resolution, we down-sampled the data to a 60-minute temporal resolution, keeping all the other variables and spatial resolution intact. This reduced dataset is 50 TB of data describing 30+ atmospheric variables of data for 7 years at a 60-minute temporal resolution.
The WindViz browser-based Gridded Wind Toolkit Visualizer was created as an example implementation of the HSDS REST API in JavaScript. The visualizer is written in the style of ECMAScript 2016 using a modern development toolchain that includes webpack and Babel. The source code is available through our GitHub repository. The demo page is hosted via GitHub pages, and we use a cross-origin AJAX request to fetch data from the HSDS service running on the EC2 infrastructure. The visualizer can be used to explore the gridded wind toolkit data on a map. Achieve full spatial resolution by zooming in to a specific region.
Programmatic access is possible using the h5pyd Python library, a distributed analog to the widely used h5py library. Users interact with the datasets (variables) and slice the data from its (time x longitude x latitude) cube form as they see fit.
Examples and use cases are described in a set of Jupyter notebooks and available on GitHub:
Now you have a Jupyter notebook server running on your EC2 server.
From your laptop, create an SSH tunnel:
$ ssh –L 8888:localhost:8888 (IP address of the EC2 server)
Now, you can browse to localhost:8888 using the correct token, and interact with the notebooks as if they were local. Within the directory, there are examples for accessing the HSDS API and plotting wind and weather data using matplotlib.
Controlling access and defraying costs
A final concern is rate limiting and access control. Although the HSDS service is scalable and relatively robust, we had a few practical concerns:
How can we protect from malicious or accidental use that may lead to high egress fees (for example, someone who attempts to repeatedly download the entire dataset from S3)?
How can we keep track of who is using the data both to document the value of the data resource and to justify the costs?
If costs become too high, can we charge for some or all API use to help cover the costs?
To approach these problems, we investigated using Amazon API Gateway and its simplified integration with the AWS Marketplace for SaaS monetization as well as third-party API proxies.
In the end, we chose to use API Umbrella due to its close involvement with http://data.gov. While AWS Marketplace is a compelling option for future datasets, the decision was made to keep this dataset entirely open, at least for now. As community use and associated costs grow, we’ll likely revisit Marketplace. Meanwhile, API Umbrella provides controls for rate limiting and API key registration out of the box and was simple to implement as a front-end proxy to HSDS. Those applications that may want to charge for API use can accomplish a similar strategy using Amazon API Gateway and AWS Marketplace.
Ongoing work and other resources
As NREL and other government research labs, municipalities, and organizations try to share data with the public, we expect many of you will face similar challenges to those we have tried to approach with the architecture described in this post. Providing large datasets is one challenge. Doing so in a way that is affordable and convenient for users is an entirely more difficult goal. Using AWS cloud-native services and the existing foundation of the HDF file format has allowed us to tackle that challenge in a meaningful way.
Dr. Caleb Phillips is a senior scientist with the Data Analysis and Visualization Group within the Computational Sciences Center at the National Renewable Energy Laboratory. Caleb comes from a background in computer science systems, applied statistics, computational modeling, and optimization. His work at NREL spans the breadth of renewable energy technologies and focuses on applying modern data science techniques to data problems at scale.
Dr. Caroline Draxl is a senior scientist at NREL. She supports the research and modeling activities of the US Department of Energy from mesoscale to wind plant scale. Caroline uses mesoscale models to research wind resources in various countries, and participates in on- and offshore boundary layer research and in the coupling of the mesoscale flow features (kilometer scale) to the microscale (tens of meters). She holds a M.S. degree in Meteorology and Geophysics from the University of Innsbruck, Austria, and a PhD in Meteorology from the Technical University of Denmark.
John Readey has been a Senior Architect at The HDF Group since he joined in June 2014. His interests include web services related to HDF, applications that support the use of HDF and data visualization.Before joining The HDF Group, John worked at Amazon.com from 2006–2014 where he developed service-based systems for eCommerce and AWS.
Jordan Perr-Sauer is an RPP intern with the Data Analysis and Visualization Group within the Computational Sciences Center at the National Renewable Energy Laboratory. Jordan hopes to use his professional background in software engineering and his academic training in applied mathematics to solve the challenging problems facing America and the world.
The AWS Community Heroes program helps shine a spotlight on some of the innovative work being done by rockstar AWS developers around the globe. Marrying cloud expertise with a passion for community building and education, these Heroes share their time and knowledge across social media and in-person events. Heroes also actively help drive content at Meetups, workshops, and conferences.
This March, we have five Heroes that we’re happy to welcome to our network of cloud innovators:
Peter Sbarski is VP of Engineering at A Cloud Guru and the organizer of Serverlessconf, the world’s first conference dedicated entirely to serverless architectures and technologies. His work at A Cloud Guru allows him to work with, talk and write about serverless architectures, cloud computing, and AWS. He has written a book called Serverless Architectures on AWS and is currently collaborating on another book called Serverless Design Patterns with Tim Wagner and Yochay Kiriaty.
Peter is always happy to talk about cloud computing and AWS, and can be found at conferences and meetups throughout the year. He helps to organize Serverless Meetups in Melbourne and Sydney in Australia, and is always keen to share his experience working on interesting and innovative cloud projects.
Peter’s passions include serverless technologies, event-driven programming, back end architecture, microservices, and orchestration of systems. Peter holds a PhD in Computer Science from Monash University, Australia and can be followed on Twitter, LinkedIn, Medium, and GitHub.
In close collaboration with his brother Andreas Wittig, the Wittig brothers are actively creating AWS related content. Their book Amazon Web Services in Action (Manning) introduces AWS with a strong focus on automation. Andreas and Michael run the blog cloudonaut.io where they share their knowledge about AWS with the community. The Wittig brothers also published a bunch of video courses with O’Reilly, Manning, Pluralsight, and A Cloud Guru. You can also find them speaking at conferences and user groups in Europe. Both brothers are co-organizing the AWS user group in Stuttgart.
Fernando is an experienced Infrastructure Solutions Leader, holding 5 AWS Certifications, with extensive IT Architecture and Management experience in a variety of market sectors. Working as a Cloud Architect Consultant in United Kingdom since 2014, Fernando built an online community for Hispanic speakers worldwide.
Fernando founded a LinkedIn Group, a Slack Community and a YouTube channel all of them named “AWS en Español”, and started to run a monthly webinar via YouTube streaming where different leaders discuss aspects and challenges around AWS Cloud.
During the last 18 months he’s been helping to run and coach AWS User Group leaders across LATAM and Spain, and 10 new User Groups were founded during this time.
Feel free to follow Fernando on Twitter, connect with him on LinkedIn, or join the ever-growing Hispanic Community via Slack, LinkedIn or YouTube.
Anders is a consultant and cloud evangelist at Webstep AS in Norway. He finished his degree in Computer Science at the Norwegian Institute of Technology at about the same time the Internet emerged as a public service. Since then he has been an IT consultant and a passionate advocate of knowledge-sharing.
He architected and implemented his first customer solution on AWS back in 2010, and is essential in building Webstep’s core cloud team. Anders applies his broad expert knowledge across all layers of the organizational stack. He engages with developers on technology and architectures and with top management where he advises about cloud strategies and new business models.
Anders enjoys helping people increase their understanding of AWS and cloud in general, and holds several AWS certifications. He co-founded and co-organizes the AWS User Groups in the largest cities in Norway (Oslo, Bergen, Trondheim and Stavanger), and also uses any opportunity to engage in events related to AWS and cloud wherever he is.
You can follow him on Twitter or connect with him on LinkedIn.
To learn more about the AWS Community Heroes Program and how to get involved with your local AWS community, click here.
In Part 2, we take a deeper look at the differences between HDDs and SSDs, how both HDD and SSD technologies are evolving, and how Backblaze takes advantage of SSDs in our operations and data centers.
The first time you booted a computer or opened an app on a computer with a solid-state-drive (SSD), you likely were delighted. I know I was. I loved the speed, silence, and just the wow factor of this new technology that seemed better in just about every way compared to hard drives.
I was ready to fully embrace the promise of SSDs. And I have. My desktop uses an SSD for booting, applications, and for working files. My laptop has a single 512GB SSD. I still use hard drives, however. The second, third, and fourth drives in my desktop computer are HDDs. The external USB RAID I use for local backup uses HDDs in four drive bays. When my laptop is at my desk it is attached to a 1.5TB USB backup hard drive. HDDs still have a place in my personal computing environment, as they likely do in yours.
Nothing stays the same for long, however, especially in the fast-changing world of computing, so we are certain to see new storage technologies coming to the fore, perhaps with even more wow factor.
Before we get to what’s coming, let’s review the primary differences between HDDs and SSDs in a little more detail in the following table.
A Comparison of HDDs to SSDs
HDD
SSD
Power Draw/Battery Life
More power draw, averages 6–7 watts and therefore uses more battery
Less power draw, averages 2–3 watts, resulting in 30+ minute battery boost
Cost
Only around $0.03 per gigabyte, very cheap (buying a 4TB model)
Expensive, roughly $0.20- $0.30 per gigabyte (based on buying a 1TB drive)
Capacity
Typically around 500GB and 2TB maximum for notebook size drives; 10TB max for desktops
Typically not larger than 1TB for notebook size drives; 4TB for desktops
Operating System Boot Time
Around 30-40 seconds average bootup time
Around 8-13 seconds average bootup time
Noise
Audible clicks and spinning platters can be heard
There are no moving parts, hence no sound
Vibration
The spinning of the platters can sometimes result in vibration
No vibration as there are no moving parts
Heat Produced
HDD doesn’t produce much heat, but it will have a measurable amount more heat than an SSD due to moving parts and higher power draw
Lower power draw and no moving parts so little heat is produced
Failure Rate
Mean time between failure rate of 1.5 million hours
Mean time between failure rate of 2.0 million hours
File Copy / Write Speed
The range can be anywhere from 50–120MB/s
Generally above 200 MB/s and up to 550 MB/s for cutting edge drives
Encryption
Full Disk Encryption (FDE) Supported on some models
Full Disk Encryption (FDE) Supported on some models
The HDD has an amazing history of improvement and innovation. From its inception in 1956 the hard drive has decreased in size 57,000 times, increased storage 1 million times, and decreased cost 2,000 times. In other words, the cost per gigabyte has decreased by 2 billion times in about 60 years.
Hard drive manufacturers made these dramatic advances by reducing the size, and consequently the seek times, of platters while increasing their density, improving disk reading technologies, adding multiple arms and read/write heads, developing better bus interfaces, and increasing spin speed and reducing friction with techniques such as filling drives with helium.
In 2005, the drive industry introduced perpendicular recording technology to replace the older longitudinal recording technology, which enabled areal density to reach more than 100 gigabits per square inch. Longitudinal recording aligns data bits horizontally in relation to the drive’s spinning platter, parallel to the surface of the disk, while perpendicular recording aligns bits vertically, perpendicular to the disk surface.
Perpendicular Recording
Other technologies such as bit patterned media recording (BPMR) are contributing to increased densities, as well. Introduced by Toshiba in 2010, BPMR is a proposed hard disk drive technology that could succeed perpendicular recording. It records data using nanolithography in magnetic islands, with one bit per island. This contrasts with current disk drive technology where each bit is stored in 20 to 30 magnetic grains within a continuous magnetic film.
Shingled magnetic recording (SMR) is a magnetic storage data recording technology used in HDDs to increase storage density and overall per-drive storage capacity. Shingled recording writes new tracks that overlap part of the previously written magnetic track, leaving the previous track narrower and allowing for higher track density. Thus, the tracks partially overlap similar to roof shingles. This approach was selected because physical limitations prevent recording magnetic heads from having the same width as reading heads, leaving recording heads wider.
Track Spacing Enabled by SMR Technology (Seagate)
To increase the amount of data stored on a drive’s platter requires cramming the magnetic regions closer together, which means the grains need to be smaller so they won’t interfere with each other. In 2002, Seagate successfully demoed heat-assisted magnetic recording (HAMR). HAMR records magnetically using laser-thermal assistance that ultimately could lead to a 20 terabyte drive by 2019. (See our post on HAMR by Seagate’s CTO Mark Re, What is HAMR and How Does It Enable the High-Capacity Needs of the Future?)
Western Digital claims that its competing microwave-assisted magnetic recording (MAMR) could enable drive capacity to increase up to 40TB by the year 2025. Some industry watchers and drive manufacturers predict increases in areal density from today’s .86 tbpsi terabit-per-square-inch (TBPSI) to 10 tbpsi by 2025 resulting in as much as 100TB drive capacity in the next decade.
The future certainly does look bright for HDDs continuing to be with us for a while.
The Outlook for SSDs
SSDs are also in for some amazing advances.
Interfaces
SATA (Serial Advanced Technology Attachment) is the common hardware interface that allows the transfer of data to and from HDDs and SSDs. SATA SSDs are fine for the majority of home users, as they are generally cheaper, operate at a lower speed, and have a lower write life.
While fine for everyday computing, in a RAID (Redundant Array of Independent Disks), server array or data center environment, often a better alternative has been to use ‘SAS’ drives, which stands for Serial Attached SCSI. This is another type of interface that, again, is usable either with HDDs or SSDs. ‘SCSI’ stands for Small Computer System Interface (which is why SAS drives are sometimes referred to as ‘scuzzy’ drives). SAS has increased IOPS (Inputs Outputs Per Second) over SATA, meaning it has the ability to read and write data faster. This has made SAS an optimal choice for systems that require high performance and availability.
On an enterprise level, SAS prevails over SATA, as SAS supports over-provisioning to prolong write life and has been specifically designed to run in environments that require constant drive usage.
Enter PCIe
PCIe (Peripheral Component Interconnect Express) is a high speed serial computer expansion bus standard that supports drastically higher data transfer rates over SATA or SAS interfaces due to the fact that there are more channels available for the flow of data.
Many leading drive manufacturers have been adopting PCIe as the standard for new home and enterprise storage and some peripherals. For example, you’ll see that the latest Apple Macbooks ship with PCIe-based flash storage, something that Apple has been adopting over the years with their consumer devices.
PCIe can also be used within data centers for RAID systems and to create high-speed networking capabilities, increasing overall performance and supporting the newer and higher capacity HDDs.
Storage Density
As we covered in Part 1, SSDs are based on a type of non-volatile flash memory called NAND.The latest trend in NAND flash is quad-level-cell (QLC) NAND. NAND is subdivided into types based on how many bits of data are stored in each physical memory cell. SLC (single-level-cell) stores one bit, MLC (multi-level-cell) stores two, TLC (triple-level cell) stores three, and QLC (quad-level-cell) stores four.
QLC NAND
Storing more data per cell makes NAND more dense, but it also makes the memory slower — it takes more time to read and write data when so much additional information (and so many more charge states) are stored within the same cell of memory.
QLC NAND memory is built on older process nodes with larger cells that can more easily store multiple bits of data. The new NAND tech has higher overall reliability with higher total number of program / erase cycles (P/E cycles).
QLC NAND wafer from which individual microcircuits are made
QLC NAND promises to produce faster and denser SSDs. The effect on price also could be dramatic. Tom’s Hardware is predicting that the advent of QLC could push 512GB SSDs down to $100.
Beyond HDDs and SSDs
There is significant work being done that is pushing the bounds of data storage beyond what is possible with spinning platters and microcircuits. A team at Harvard University has used genome-editing to encode video into live bacteria.
Scientists from the University of Southampton’s Optoelectronics Research Centre (ORC) have developed the recording and retrieval processes of five dimensional (5D) digital data by femtosecond laser writing. These discs can hold 360 TB of data and are theoretically stable for up to 14 billion years, thanks to ‘5D data storage’ inscribed by laser nanostructuring in quartz silica glass. Space-X’s Falcon Heavy recently carried this technology into space.
We’ve already discussed the benefits of SSDs. The benefits of SSDs that apply particularly to the data center are:
Low power consumption — When you are running lots of drives, power usage adds up. Anywhere you can conserve power is a win.
Speed — Data can be accessed faster, which is especially beneficial for caching databases and other data affecting overall application or system performance.
Lack of vibration — Reducing vibration improves reliability thereby reducing problems and maintenance. Racks don’t need the size and structural rigidity housing SSDs that they need housing HDDs.
Low noise — Data centers will become quieter as more SSDs are deployed.
Low heat production — The less heat generated the less cooling and power required in the data center.
Faster booting — The faster a storage chassis can get online or a critical server can be rebooted after maintenance or a problem, the better.
Greater areal density — Data centers will be able to store more data in less space, which increases efficiency in all areas (power, cooling, etc.)
The top drive manufacturers say that they expect HDDs and SSDs to coexist for the foreseeable future in all areas — home, business, and data center, with customers choosing which technology and product will best fit their application.
How Backblaze Uses SSDs
In just about all respects, SSDs are superior to HDDs. So why don’t we replace the 100,000+ hard drives we have spinning in our data centers with SSDs?
Our operations team takes advantage of the benefits and savings of SSDs wherever they can, using them in every place that’s appropriate other than primary data storage. They’re particularly useful in our caching and restore layers, where we use them strategically to speed up data transfers. SSDs also speed up access to B2 Cloud Storage metadata. Our operations teams is considering moving to SSDs to boot our Storage Pods, where the cost of a small SSD is competitive with hard drives, and their other attributes (small size, lack of vibration, speed, low-power consumption, reliability) are all pluses.
A Future with Both HDDs and SSDs
IDC predicts that total data created will grow from approximately 33 zettabytes in 2018 to about 160 zettabytes in 2025. (See What’s a Byte? if you’d like help understanding the size of a zettabyte.)
Annual Size of the Global Datasphere
Over 90% of enterprise drive shipments today are HDD, according to IDC. By 2025, SSDs will comprise almost 20% of drive shipments. SSDs will gain share, but total growth in data created will result in massive sales of both HDDs and SSDs.
Enterprise Byte Shipments: NDD and SSD
As both HDD and SSD sales grow, so does the capacity of both technologies. Given the benefits of SSDs in many applications, we’re likely going to see SSDs replacing HDDs in all but the highest capacity uses.
It’s clear that there are merits to both HDDs and SSDs. If you’re not running a data center, and don’t have more than one or two terabytes of data to store on your home or business computer, your first choice likely should be an SSD. They provide a noticeable improvement in performance during boot-up and data transfer, and are smaller, quieter, and more reliable as well. Save the HDDs for secondary drives, NAS, RAID, and local backup devices in your system.
Perhaps some day we’ll look back at the days of spinning platters with the same nostalgia we look back at stereo LPs, and some of us will have an HDD paperweight on our floating anti-gravity desk as a conversation piece. Until the day that SSD’s performance, capacity, and finally, price, expel the last HDD out of the home and data center, we can expect to live in a world that contains both solid state SSDs and magnetic platter HDDs, and as users we will reap the benefits from both technologies.
Don’t miss future posts on HDDs, SSDs, and other topics, including hard drive stats, cloud storage, and tips and tricks for backing up to the cloud. Use the Join button above to receive notification of future posts on our blog.
Amazon EMR empowers many customers to build big data processing applications quickly and cost-effectively, using popular distributed frameworks such as Apache Spark, Apache HBase, Presto, and Apache Flink. For organizations that are crafting their analytical applications on Amazon EMR, there is a growing need to keep their data assets organized in an automated fashion. Because datasets tend to grow exponentially, using cataloging tools is essential to automating data discovery and organizing data assets.
AWS Glue Data Catalog provides this essential capability, allowing you to automatically discover and catalog metadata about your data stores in a central repository. Since Amazon EMR 5.8.0, customers have been using the AWS Glue Data Catalog as a metadata store for Apache Hive and Spark SQL applications that are running on Amazon EMR. Starting with Amazon EMR 5.10.0, you can catalog datasets using AWS Glue and run queries using Presto on Amazon EMR from the Hue (Hadoop User Experience) and Apache Zeppelin UIs.
You might wonder what scenarios warrant using Presto running on Amazon EMR and when to choose Amazon Athena (which uses Presto as the query engine under the hood). It is important to note that both are excellent tools for querying massive amounts of data and addressing different needs and use cases.
Amazon Athena provides the easiest way to run interactive queries for data in Amazon S3 without needing to set up or manage any servers. Presto running on Amazon EMR gives you much more flexibility in how you configure and run your queries, providing the ability to federate to other data sources if needed. For example, you might have a use case that requires LDAP authentication for clients such as the Presto CLI or JDBC/ODBC drivers. Or you might have a workflow where you need to join data between different systems like MySQL/Amazon Redshift/Apache Cassandra and Hive. In these examples, Presto running on Amazon EMR is the right tool to use because it can be configured to enable LDAP authentication in addition to the desired database connectors at cluster launch.
Now, let’s look at how metadata management for Presto works with AWS Glue.
Using an AWS Glue crawler to discover datasets
The AWS Glue Data Catalog is a reference to the location, schema, and runtime metrics of your datasets. To create this reference metadata, AWS Glue needs to crawl your datasets. In this exercise, we use an AWS Glue crawler to populate tables in the Data Catalog for the NYC taxi rides dataset.
The following are the steps for adding a crawler:
Sign in to the AWS Management Console, and open the AWS Glue console. In the navigation pane, choose Crawlers. Then choose Add crawler.
On the Add a data store page, specify the location of the NYC taxi rides dataset.
In the next step, choose an existing IAM role if one is available, or create a new role. Then choose Next.
On the scheduling page, for Frequency, choose Run on demand.
On the Configure the crawler’s output page, choose Add database. Specify blog-db as the database name. (You can specify a name of your choice, but be sure to choose the correct database name when running queries.)
Follow the remaining steps using the default values to create a crawler.
When the crawler displays the Ready state, navigate to the Databases (Choose blog-db from the list of databases, or search for it by specifying it as a filter, as shown in the following screenshot.) Then choose Tables. You should see the three tables created by the crawler, as follows.
(Optional) The discovered data is classified as CSV files. You can optionally convert this data into Parquet format for better response times on your queries.
Launching an Amazon EMR cluster
With the dataset discovered and organized, we can now walk through different options for launching Presto on an Amazon EMR cluster to use the AWS Glue Data Catalog.
After you’ve set up the Amazon EMR cluster with Presto, the AWS Glue Data Catalog is available through a default “hive” catalog. To change between the Hive and Glue metastores, you have to manually update hive.properties and restart the Presto server. Connect to the master node on your EMR cluster using SSH, and run the Presto CLI to start running queries interactively.
$ presto-cli --catalog hive
Begin with a simple query to sample a few rows:
presto> SELECT * FROM “blog-db”.taxi limit 10;
The query shows a few sample rows as follows:
Query the average fare for trips at each hour of the day and for each day of the month on the Parquet version of the taxi dataset.
presto> SELECT EXTRACT (HOUR FROM pickup_datetime) AS hour, avg(fare_amount) AS average_fare FROM “blog-db”.taxi_parquet GROUP BY 1 ORDER BY 1;
The following image shows the results:
More interestingly, you can compute the number of trips that gave tips in the 10 percent, 15 percent, or higher percentage range:
presto> -- Tip Percent Category
SELECT TipPrctCtgry
, COUNT (DISTINCT TripID) TripCt
FROM
(SELECT TripID
, (CASE
WHEN fare_prct < 0.7 THEN 'FL70'
WHEN fare_prct < 0.8 THEN 'FL80'
WHEN fare_prct < 0.9 THEN 'FL90'
ELSE 'FL100'
END) FarePrctCtgry
, (CASE
WHEN tip_prct < 0.1 THEN 'TL10'
WHEN tip_prct < 0.15 THEN 'TL15'
WHEN tip_prct < 0.2 THEN 'TL20'
ELSE 'TG20'
END) TipPrctCtgry
FROM
(SELECT TripID
, (fare_amount / total_amount) as fare_prct
, (extra / total_amount) as extra_prct
, (mta_tax / total_amount) as tip_prct
, (tolls_amount / total_amount) as mta_taxprct
, (tip_amount / total_amount) as tolls_prct
, (improvement_surcharge / total_amount) as imprv_suchrgprct
, total_amount
FROM
(SELECT *
, (cast(pickup_longitude AS VARCHAR(100)) || '_' || cast(pickup_latitude AS VARCHAR(100))) as TripID
from "blog-db”.taxi_parquet
WHERE total_amount > 0
) as t
) as t
) ct
GROUP BY TipPrctCtgry;
The results are as follows:
While the preceding query is running, navigate to the web interface for Presto on Amazon EMR at <http://master-public-dns-name:8889/. Here you can look into the query metrics, such as active worker nodes, number of rows read per second, reserved memory, and parallelism.
Running queries in the Presto Editor on Hue
If you installed Hue with your Amazon EMR launch, you can also run queries on Hue’s Presto Editor. On the Amazon EMR Cluster console, choose Enable Web Connection, and follow the instructions to access the web interfaces for Hue and Zeppelin.
After the web connection is enabled, choose the Hue link to open the web interface. At the login screen, if you are the administrator logging in for the first time, type a user name and password to create your Hue superuser account. Then choose Create account. Otherwise, type your user name and password and choose Create account, or type the credentials provided by your administrator.
Choose the Presto Editor from the menu. You can run Presto queries against your tables in the AWS Glue Data Catalog.
Conclusion
Having a shared data catalog for applications on Amazon EMR alleviates a myriad of data-related challenges that organizations face today—including discovery, governance, auditability, and collaboration. In this post, we explored how the AWS Glue Data Catalog addresses discoverability and manageability for table metadata for Presto on Amazon EMR. Go ahead, give this a try, and share your experience with us!
Radhika Ravirala is a Solutions Architect at Amazon Web Services where she helps customers craft distributed big data applications on the AWS platform. Prior to her cloud journey, she worked as a software engineer and designer for technology companies in Silicon Valley. She holds a M.S in computer science from San Jose State University.
As Backblaze continues to grow a couple of our departments need to grow right along with it. One of the quickest-growing departments we have at Backblaze is Customer Support. We do all of our support in-house and the team grows to accommodate our growing customer base! We have a new person joining us in support, Lin! Lets take a moment to learn a bit more about her shall we?
What is your Backblaze Title? Jr. Support Technician.
Where are you originally from? Ventura, CA. It’s okay if you haven’t heard of it, it is very, very, small.
What attracted you to Backblaze? The company culture, the delightful ads on Critical Role, and how immediately genuinely friendly everyone I met was.
Where else have you worked? I previously did content management at Wish, and an awful lot of temp gigs. I did a few years at a coffee shop in the beginning of college, but my first job ever was a JoAnn’s Fabrics.
Where did you go to school? San Francisco State University
What’s your dream job? Magical Girl!
Favorite place you’ve traveled? Tokyo, but Disneyworld is a real close second.
Favorite hobby? I spend an awful lot of time playing video games, and possibly even more making silly costumes.
Star Trek or Star Wars? Truthfully I love both. But I was raised on original series and next generation Trek.
Coke or Pepsi? Coke … definitely coke.
Favorite food? Cupcakes. Especially funfetti cupcakes.
Anything else you’d like you’d like to tell us? I discovered Sailor Moon as a child and it possibly influenced my life way too much. Like many people here I am a huge Disney fan; Anyone who spends longer than a few hours with me will probably tell you I can go on for hours about my cat (but in my defense he’s adorable and fluffy and I have the pictures to prove it).
We keep hiring folks that love Disney! It’s kind of amazing. It’s also nice to have folks in the office that can chat about the latest Critical Role episode! Welcome aboard Lin, we’ll try to get some funfetti stocked for the cupcakes that come in!
As we sail past 500 Petabytes of data stored, our Operations Department continues to grow. To that end we’ve added a brand new member to our Operations and Engineering teams, Alex! He straddles the line between Ops and Engineering, working on both internal and external systems – making sure they run smoothly. Lets learn a bit more about Alex, shall we?
What is your Backblaze Title? Operations Engineer.
Where are you originally from? Chicago, IL.
What attracted you to Backblaze? The company mission and overall transparency really appealed to me. It was a great opportunity to work in an environment that aligned with my core values.
What do you expect to learn while being at Backblaze? I expect to learn more about modern cloud technologies and data center deployments.
Where else have you worked? I worked at a startup called Cleversafe out of college which was later acquired by IBM.
Where did you go to school? University of Illinois at Urbana-Champaign
What’s your dream job? NHL general manager.
Favorite place you’ve traveled? The Cinque Terre. It’s a set of small towns along the Italian Riviera that has great hiking.
Favorite hobby? Playing hockey.
Of what achievement are you most proud? Graduating college with two engineering degrees.
Star Trek or Star Wars? Neither.
Favorite food? Homemade pizza.
Why do you like certain things? Life’s too short not to like certain things.
Cinque Terre is definitely one of the most beautiful places on earth, as long as you don’t visit on a foggy day! If you happen to find the perfect NHL job, we’ll understand! Oh, and thanks for bringing Cosmo to the office on occasion! Welcome aboard!
Some things don’t get the credit they deserve. For one of our engineers, Billy, the Locate My Computer feature is near and dear to his heart. It took him a while to build it, and it requires some regular updates, even after all these years. Billy loves the Locate My Computer feature, but really loves knowing how it’s helped customers over the years. One recent story made us decide to write a bit of a greatest hits post as an ode to one of our favorite features — Locate My Computer.
What is it?
Locate My Computer, as you’ll read in the stories below, came about because some of our users had their computers stolen and were trying to find a way to retrieve their devices. They realized that while some of their programs and services like Find My Mac were wiped, in some cases, Backblaze was still running in the background. That created the ability to use our software to figure out where the computer was contacting us from. After manually helping some of the individuals that wrote in, we decided to build it in as a feature. Little did we know the incredible stories it would lead to. We’ll get into that, but first, a little background on why the whole thing came about.
Identifying the Customer Need
“My friend’s laptop was stolen. He tracked the thief via @Backblaze for weeks & finally identified him on Facebook & Twitter. Digital 007.”
Mat — In December 2010, we saw a tweet from @DigitalRoyalty which read: “My friend’s laptop was stolen. He tracked the thief via @Backblaze for weeks & finally identified him on Facebook & Twitter. Digital 007.” Our CEO was manning Twitter at the time and reached out for the whole story. It turns out that Mat Miller had his laptop stolen, and while he was creating some restores a few days later, he noticed a new user was created on his computer and was backing up data. He restored some of those files, saw some information that could help identify the thief, and filed a police report. Read the whole story: Digital 007 — Outwitting The Thief.
Mark — Following Mat Miller’s story we heard from Mark Bao, an 18-year old entrepreneur and student at Bentley University who had his laptop stolen. The laptop was stolen out of Mark’s dorm room and the thief started using it in a variety of ways, including audition practice for Dancing with the Stars. Once Mark logged in to Backblaze and saw that there were new files being uploaded, including a dance practice video, he was able to reach out to campus police and got his laptop back. You can read more about the story on: 18 Year Old Catches Thief Using Backblaze.
After Mat and Mark’s story we thought we were onto something. In addition to those stories that had garnered some media attention, we would occasionally get requests from users that said something along the lines of, “Hey, my laptop was stolen, but I had Backblaze installed. Could you please let me know if it’s still running, and if so, what the IP address is so that I can go to the authorities?” We would help them where we could, but knew that there was probably a much more efficient method of helping individuals and businesses keep track of their computers.
Some of the Greatest Hits, and the Mafia Story
In May of 2011, we launched “Locate My Computer.” This was our way of adding a feature to our already-popular backup client that would allow users to see a rough representation of where their computer was located, and the IP address associated with its last known transmission. After speaking to law enforcement, we learned that those two things were usually enough for the authorities to subpoena an ISP and get the physical address of the last known place the computer phoned home from. From there, they could investigate and, if the device was still there, return it to its rightful owner.
Bridgette — Once the feature went live the stories got even more interesting. Almost immediately after we launched Locate My Computer, we were contacted by Bridgette, who told us of a break-in at her house. Luckily no one was home at the time, but the thief was able to get away with her iMac, DSLR, and a few other prized possessions. As soon as she reported the robbery to the police, they were able to use the Locate My Computer feature to find the thief’s location and recover her missing items. We even made a case study out of Bridgette’s experience. You can read it at: Backblaze And The Stolen iMac.
“Joe” — The crazy recovery stories didn’t end there. Shortly after Bridgette’s story, we received an email from a user (“Joe” — to protect the innocent) who was traveling to Argentina from the United States and had his laptop stolen. After he contacted the police department in Buenos Aires, and explained to them that he was using Backblaze (which the authorities thought was a computer tracking service, and in this case, we were), they were able to get the location of the computer from an ISP in Argentina. When they went to investigate, they realized that the perpetrators were foreign nationals connected to the mafia, and that in addition to a handful of stolen laptops, they were also in the possession of over $1,000,000 in counterfeit currency! Read the whole story about “Joe” and how: Backblaze Found $1 Million in Counterfeit Cash!
The Maker — After “Joe,” we thought that our part in high-profile “busts was over, but we were wrong. About a year later we received word from a “maker” who told us that he was able to act as an “internet super-sleuth” and worked hard to find his stolen computer. After a Maker Faire in Detroit, the maker’s car was broken into while they were getting BBQ following a successful show. While some of the computers were locked and encrypted, others were in hibernation mode and wide open to prying eyes. After the police report was filed, the maker went to Backblaze to retrieve his lost files and remembered seeing the little Locate My Computer button. That’s when the story gets really interesting. The victim used a combination of ingenuity, Craigslist, Backblaze, and the local police department to get his computer back, and make a drug bust along the way. Head over to Makezine.com to read about how:How Tracking Down My Stolen Computer Triggered a Drug Bust.
Una — While we kept hearing praise and thanks from our customers who were able to recover their data and find their computers, a little while passed before we would hear a story that was as incredible as the ones above. In July of 2016, we received an email from Una who told us one of the most amazing stories of perseverance that we’d ever heard. With the help of Backblaze and a sympathetic constable in Australia, Una tracked her stolen computer’s journey across 6 countries. She got her computer back and we wrote up the whole story: How Una Found Her Stolen Laptop.
And the Hits Keep on Coming
The most recent story came from “J,” and we’ll share the whole thing with you because it has a really nice conclusion:
Back in September of 2017, I brought my laptop to work to finish up some administrative work before I took off for a vacation. I work in a mall where traffic [is] plenty and more specifically I work at a kiosk in the middle of the mall. This allows for a high amount of traffic passing by every few seconds. I turned my back for about a minute to put away some paperwork. At the time I didn’t notice my laptop missing. About an hour later when I was gathering my belongings for the day I noticed it was gone. I was devastated. This was a high end MacBook Pro that I just purchased. So we are not talking about a little bit of money here. This was a major investment.
Time [went] on. When I got back from my vacation I reached out to my LP (Loss Prevention) team to get images from our security to submit to the police with some thread of hope that they would find whomever stole it. December approached and I did not hear anything. I gave up hope and assumed that the laptop was scrapped. I put an iCloud lock on it and my Find My Mac feature was saying that laptop was “offline.” I just assumed that they opened it, saw it was locked, and tried to scrap it for parts.
Towards the end of January I got an email from Backblaze saying that the computer was successfully backed up. This came as a shock to me as I thought it was wiped. But I guess however they wiped it didn’t remove Backblaze from the SSD. None the less, I was very happy. I sifted through the backup and found the person’s name via the search history. Then, using the Locate my Computer feature I saw where it came online. I reached out on social media to the person in question and updated the police. I finally got ahold of the person who stated she bought it online a few weeks backs. We made arrangements and I’m happy to say that I am typing this email on my computer right now.
J finished by writing: “Not only did I want to share this story with you but also wanted to say thanks! Apple’s find my computer system failed. The police failed to find it. But Backblaze saved the day. This has been the best $5 a month I have ever spent. Not only that but I got all my stuff back. Which made the deal even better! It was like it was never gone.”
Have a Story of Your Own?
We’re more than thrilled to have helped all of these people restore their lost data using Backblaze. Recovering the actual machine using Locate My Computer though, that’s the icing on the cake. We’re proud of what we’ve been able to build here at Backblaze, and we really enjoy hearing stories from people who have used our service to successfully get back up and running, whether that meant restoring their data or recovering their actual computer.
If you have any interesting data recovery or computer recovery stories that you’d like to share with us, please email press@backblaze.com and we’ll share it with Billy and the rest of the Backblaze team. We love hearing them!
This column is from The MagPi issue 59. You can download a PDF of the full issue for free, or subscribe to receive the print edition through your letterbox or the digital edition on your tablet. All proceeds from the print and digital editions help the Raspberry Pi Foundation achieve our charitable goals.
“Hey, world!” Estefannie exclaims, a wide grin across her face as the camera begins to roll for another YouTube tutorial video. With a growing number of followers and wonderful support from her fans, Estefannie is building a solid reputation as an online maker, creating unique, fun content accessible to all.
It’s as if she was born into performing and making for an audience, but this fun, enjoyable journey to social media stardom came not from a desire to be in front of the camera, but rather as a unique approach to her own learning. While studying, Estefannie decided the best way to confirm her knowledge of a subject was to create an educational video explaining it. If she could teach a topic successfully, she knew she’d retained the information. And so her YouTube channel, Estefannie Explains It All, came into being.
Her first videos featured pages of notes with voice-over explanations of data structure and algorithm analysis. Then she moved in front of the camera, and expanded her skills in the process.
But YouTube isn’t her only outlet. With nearly 50000 followers, Estefannie’s Instagram game is strong, adding to an increasing number of female coders taking to the platform. Across her Instagram grid, you’ll find insights into her daily routine, from programming on location for work to behind-the-scenes troubleshooting as she begins to create another tutorial video. It’s hard work, with content creation for both Instagram and YouTube forever on her mind as she continues to work and progress successfully as a software engineer.
As a thank you to her Instagram fans for helping her reach 10000 followers, Estefannie created a free game for Android and iOS called Gravitris — imagine Tetris with balance issues!
Estefannie was born and raised in Mexico, with ambitions to become a graphic designer and animator. However, a documentary on coding at Pixar, and the beauty of Merida’s hair in Brave, opened her mind to the opportunities of software engineering in animation. She altered her career path, moved to the United States, and switched to a Computer Science course.
With a constant desire to make and to learn, Estefannie combines her software engineering profession with her hobby to create fun, exciting content for YouTube.
While studying, Estefannie started a Computer Science Girls Club at the University of Houston, Texas, and she found herself eager to put more time and effort into the movement to increase the percentage of women in the industry. The club was a success, and still is to this day. While Estefannie has handed over the reins, she’s still very involved in the cause.
Through her YouTube videos, Estefannie continues the theme of inclusion, with every project offering a warm sense of approachability for all, regardless of age, gender, or skill. From exploring Scratch and Makey Makey with her young niece and nephew to creating her own Disney ‘Made with Magic’ backpack for a trip to Disney World, Florida, Estefannie’s videos are essentially a documentary of her own learning process, produced so viewers can learn with her — and learn from her mistakes — to create their own tech wonders.
Estefannie’s automated gingerbread house project was a labour of love, with electronics, wires, and candy strewn across both her living room and kitchen for weeks before completion. While she already was a skilled programmer, the world of physical digital making was still fairly new for Estefannie. Having ditched her hot glue gun in favour of a soldering iron in a previous video, she continued to experiment and try out new, interesting techniques that are now second nature to many members of the maker community. With the gingerbread house, Estefannie was able to research and apply techniques such as light controls, servos, and app making, although the latter was already firmly within her skill set. The result? A fun video of ups and downs that resulted in a wonderful, festive treat. She even gave her holiday home its own solar panel!
1,910 Likes, 43 Comments – Estefannie Explains It All (@estefanniegg) on Instagram: “A DAY AT RASPBERRY PI TOWERS!! LINK IN BIO @raspberrypifoundation”
And that’s just the beginning of her adventures with Pi…but we won’t spoil her future plans by telling you what’s coming next. Sorry! However, since this article was written last year, Estefannie has released a few more Pi-based project videos, plus some awesome interviews and live-streams with other members of the maker community such as Simone Giertz. She even made us an awesome video for our Raspberry Pi YouTube channel! So be sure to check out her latest releases.
2,264 Likes, 56 Comments – Estefannie Explains It All (@estefanniegg) on Instagram: “Best day yet!! I got to hangout, play Jenga with a huge arm robot, and have afternoon tea with…”
While many wonderful maker videos show off a project without much explanation, or expect a certain level of skill from viewers hoping to recreate the project, Estefannie’s videos exist almost within their own category. We can’t wait to see where Estefannie Explains It All goes next!
Kuhu Shukla (bottom center) and team at the 2017 DataWorks Summit
By Kuhu Shukla
This post first appeared here on the Apache Software Foundation blog as part of ASF’s “Success at Apache” monthly blog series.
As I sit at my desk on a rather frosty morning with my coffee, looking up new JIRAs from the previous day in the Apache Tez project, I feel rather pleased. The latest community release vote is complete, the bug fixes that we so badly needed are in and the new release that we tested out internally on our many thousand strong cluster is looking good. Today I am looking at a new stack trace from a different Apache project process and it is hard to miss how much of the exceptional code I get to look at every day comes from people all around the globe. A contributor leaves a JIRA comment before he goes on to pick up his kid from soccer practice while someone else wakes up to find that her effort on a bug fix for the past two months has finally come to fruition through a binding +1.
Yahoo – which joined AOL, HuffPost, Tumblr, Engadget, and many more brands to form the Verizon subsidiary Oath last year – has been at the frontier of open source adoption and contribution since before I was in high school. So while I have no historical trajectories to share, I do have a story on how I found myself in an epic journey of migrating all of Yahoo jobs from Apache MapReduce to Apache Tez, a then-new DAG based execution engine.
Oath grid infrastructure is through and through driven by Apache technologies be it storage through HDFS, resource management through YARN, job execution frameworks with Tez and user interface engines such as Hive, Hue, Pig, Sqoop, Spark, Storm. Our grid solution is specifically tailored to Oath’s business-critical data pipeline needs using the polymorphic technologies hosted, developed and maintained by the Apache community.
On the third day of my job at Yahoo in 2015, I received a YouTube link on An Introduction to Apache Tez. I watched it carefully trying to keep up with all the questions I had and recognized a few names from my academic readings of Yarn ACM papers. I continued to ramp up on YARN and HDFS, the foundational Apache technologies Oath heavily contributes to even today. For the first few weeks I spent time picking out my favorite (necessary) mailing lists to subscribe to and getting started on setting up on a pseudo-distributed Hadoop cluster. I continued to find my footing with newbie contributions and being ever more careful with whitespaces in my patches. One thing was clear – Tez was the next big thing for us. By the time I could truly call myself a contributor in the Hadoop community nearly 80-90% of the Yahoo jobs were now running with Tez. But just like hiking up the Grand Canyon, the last 20% is where all the pain was. Being a part of the solution to this challenge was a happy prospect and thankfully contributing to Tez became a goal in my next quarter.
The next sprint planning meeting ended with me getting my first major Tez assignment – progress reporting. The progress reporting in Tez was non-existent – “Just needs an API fix,” I thought. Like almost all bugs in this ecosystem, it was not easy. How do you define progress? How is it different for different kinds of outputs in a graph? The questions were many.
I, however, did not have to go far to get answers. The Tez community actively came to a newbie’s rescue, finding answers and posing important questions. I started attending the bi-weekly Tez community sync up calls and asking existing contributors and committers for course correction. Suddenly the team was much bigger, the goals much more chiseled. This was new to anyone like me who came from the networking industry, where the most open part of the code are the RFCs and the implementation details are often hidden. These meetings served as a clean room for our coding ideas and experiments. Ideas were shared, to the extent of which data structure we should pick and what a future user of Tez would take from it. In between the usual status updates and extensive knowledge transfers were made.
Oath uses Apache Pig and Apache Hive extensively and most of the urgent requirements and requests came from Pig and Hive developers and users. Each issue led to a community JIRA and as we started running Tez at Oath scale, new feature ideas and bugs around performance and resource utilization materialized. Every year most of the Hadoop team at Oath travels to the Hadoop Summit where we meet our cohorts from the Apache community and we stand for hours discussing the state of the art and what is next for the project. One such discussion set the course for the next year and a half for me.
We needed an innovative way to shuffle data. Frameworks like MapReduce and Tez have a shuffle phase in their processing lifecycle wherein the data from upstream producers is made available to downstream consumers. Even though Apache Tez was designed with a feature set corresponding to optimization requirements in Pig and Hive, the Shuffle Handler Service was retrofitted from MapReduce at the time of the project’s inception. With several thousands of jobs on our clusters leveraging these features in Tez, the Shuffle Handler Service became a clear performance bottleneck. So as we stood talking about our experience with Tez with our friends from the community, we decided to implement a new Shuffle Handler for Tez. All the conversation points were tracked now through an umbrella JIRA TEZ-3334 and the to-do list was long. I picked a few JIRAs and as I started reading through I realized, this is all new code I get to contribute to and review. There might be a better way to put this, but to be honest it was just a lot of fun! All the whiteboards were full, the team took walks post lunch and discussed how to go about defining the API. Countless hours were spent debugging hangs while fetching data and looking at stack traces and Wireshark captures from our test runs. Six months in and we had the feature on our sandbox clusters. There were moments ranging from sheer frustration to absolute exhilaration with high fives as we continued to address review comments and fixing big and small issues with this evolving feature.
As much as owning your code is valued everywhere in the software community, I would never go on to say “I did this!” In fact, “we did!” It is this strong sense of shared ownership and fluid team structure that makes the open source experience at Apache truly rewarding. This is just one example. A lot of the work that was done in Tez was leveraged by the Hive and Pig community and cross Apache product community interaction made the work ever more interesting and challenging. Triaging and fixing issues with the Tez rollout led us to hit a 100% migration score last year and we also rolled the Tez Shuffle Handler Service out to our research clusters. As of last year we have run around 100 million Tez DAGs with a total of 50 billion tasks over almost 38,000 nodes.
In 2018 as I move on to explore Hadoop 3.0 as our future release, I hope that if someone outside the Apache community is reading this, it will inspire and intrigue them to contribute to a project of their choice. As an astronomy aficionado, going from a newbie Apache contributor to a newbie Apache committer was very much like looking through my telescope - it has endless possibilities and challenges you to be your best.
About the Author:
Kuhu Shukla is a software engineer at Oath and did her Masters in Computer Science at North Carolina State University. She works on the Big Data Platforms team on Apache Tez, YARN and HDFS with a lot of talented Apache PMCs and Committers in Champaign, Illinois. A recent Apache Tez Committer herself she continues to contribute to YARN and HDFS and spoke at the 2017 Dataworks Hadoop Summit on “Tez Shuffle Handler: Shuffling At Scale With Apache Hadoop”. Prior to that she worked on Juniper Networks’ router and switch configuration APIs. She likes to participate in open source conferences and women in tech events. In her spare time she loves singing Indian classical and jazz, laughing, whale watching, hiking and peering through her Dobsonian telescope.
This column is from The MagPi issue 58. You can download a PDF of the full issue for free, or subscribe to receive the print edition through your letterbox or the digital edition on your tablet. All proceeds from the print and digital editions help the Raspberry Pi Foundation achieve our charitable goals.
Dr Lucy Rogers calls herself a Transformer. “I transform simple electronics into cool gadgets, I transform science into plain English, I transform problems into opportunities. I am also a catalyst. I am interested in everything around me, and can often see ways of putting two ideas from very different fields together into one package. If I cannot do this myself, I connect the people who can.”
Among many other projects, Dr Lucy Rogers currently focuses much of her attention on reducing the damage from space debris
It’s a pretty wide range of interests and skills for sure. But it only takes a brief look at Lucy’s résumé to realise that she means it. When she says she’s interested in everything around her, this interest reaches from electronics to engineering, wearable tech, space, robotics, and robotic dinosaurs. And she can be seen talking about all of these things across various companies’ social media, such as IBM, websites including the Women’s Engineering Society, and books, including her own.
With her bright LED boots, Lucy was one of the wonderful Pi community members invited to join us and HRH The Duke of York at St James’s Palace just over a year ago
When not attending conferences as guest speaker, tinkering with electronics, or creating engaging IoT tutorials, she can be found retrofitting Raspberry Pis into the aforementioned robotic dinosaurs at Blackgang Chine Land of Imagination, writing, and judging battling bots for the BBC’s Robot Wars.
First broadcast in the UK between 1998 and 2004, Robot Wars was revived in 2016 with a new look and new judges, including Dr Lucy Rogers. Competitors battle their home-brew robots, and Lucy, together with the other two judges, awards victories among the carnage of robotic remains
Lucy graduated from Lancaster University with a degree in Mechanical Engineering. After that, she spent seven years at Rolls-Royce Industrial Power Group as a graduate trainee before becoming a chartered engineer and earning her PhD in bubbles.
Bubbles?
“Foam formation in low‑expansion fire-fighting equipment. I investigated the equipment to determine how the bubbles were formed,” she explains. Obviously. Bubbles!
Lucy graduated from the Singularity University Graduate Studies Program in 2011, focusing on how robotics, nanotech, medicine, and various technologies can tackle the challenges facing the world
She then went on to become a fellow of the Royal Astronomical Society (RAS) in 2005 and, later, a fellow of both the Institution of Mechanical Engineers (IMechE) and British Interplanetary Society. As a member of the Association of British Science Writers, Lucy wrote It’s ONLY Rocket Science: an Introduction in Plain English.
In It’s Only Rocket Science: An Introduction in Plain English Lucy explains that ‘hard to understand’ isn’t the same as ‘impossible to understand’, and takes her readers through the journey of building a rocket, leaving Earth, and travelling the cosmos
As a standout member of the industry, and all-round fun person to be around, Lucy has quickly established herself as a valued member of the Pi community.
In 2014, with the help of Neil Ford and Andy Stanford-Clark, Lucy worked with the UK’s oldest amusement park, Blackgang Chine Land of Imagination, on the Isle of Wight, with the aim of updating its animatronic dinosaurs. The original Blackgang Chine dinosaurs had a limited range of behaviour: able to roar, move their heads, and stomp a foot in a somewhat repetitive action.
When she contacted Raspberry Pi back in the November of that same year, the team were working on more creative, varied behaviours, giving each dinosaur a new Raspberry Pi-sized brain. This later evolved into a very successful dino-hacking Raspberry Jam.
Lucy, Neil Ford, and Andy Stanford-Clark used several Raspberry Pis and Node-RED to visualise flows of events when updating the robotic dinosaurs at Blackgang Chine. They went on to create the successful WightPi Raspberry Jam event, where visitors could join in with the unique hacking opportunity.
Given her love for tinkering with tech, and a love for stand-up comedy that can be uncovered via a quick YouTube search, it’s no wonder that Lucy was asked to help judge the first round of the ‘Make us laugh’ Pioneers challenge for Raspberry Pi. Alongside comedian Bec Hill, Code Club UK director Maria Quevedo, and the face of the first challenge, Owen Daughtery, Lucy lent her expertise to help name winners in the various categories of the teens event, and offered her support to future Pioneers.
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.