Meet Callum Fawcett, who shares his journey from tinkering with the first Raspberry Pi while he was at school, to a Master’s degree in computer science and a real-life job in programming. We also get to see some of the awesome projects he’s made along the way.
I first decided to get a Raspberry Pi at the age of 14. I had already started programming a little bit before and found that I really enjoyed the language Python. At the time the first Raspberry Pi came out, my History teacher told us about them and how they would be a great device to use to learn programming. I decided to ask for one to help me learn more. I didn’t really know what I would use it for or how it would even work, but after a little bit of help at the start, I quickly began making small programs in Python. I remember some of my first programs being very simple dictionary-type programs in which I would match English words to German to help with my German homework.
Learning Linux, C++, and Python
Most of my learning was done through two sources. I learnt Linux and how the terminal worked using online resources such as Stack Overflow. I would have a problem that I needed to solve, look up solutions online, and try out commands that I found. This was perhaps the hardest part of learning how to use a Raspberry Pi, as it was something I had never done before, but it really helped me in later years when I would use Linux more than Windows. For learning programming, I preferred to use books. I had a book for C++ and a book for Python that I would work through. These were game-based books, so many of the fun projects that I did were simple text-based games where you typed in responses to questions.
A family robotics project
The first robot Callum made using a Raspberry Pi
By far the coolest project I did with the Raspberry Pi was to build a small robot (shown above). This was a joint project between myself and my dad. He sorted out the electronics and I programmed the robot. It was a great opportunity to learn about robotics and refine my programming skills. By the end, the robot was capable of moving around by itself, driving into objects, and then reversing and trying a new direction. It was almost like an unintelligent Roomba that couldn’t hoover, but I spent many hours improving small bits and pieces to make it as easy to use as possible. My one wish that I never managed to achieve with my robot was allowing it to map out its surroundings. This was a very ambitious project at the time, since I was still quite inexperienced in programming. The biggest problem with this was calibrating the robot’s turning circle, which was never consistent so it was very hard to have the robot know where in the room it was.
Sense HAT maze game
Another fun project that I worked on used the Sense HAT developed for the Astro Pi computers for use on the International Space Station. Using this, I was able to make a memory maze game (shown below), in which a player is shown a maze for several seconds and then has to navigate that maze from memory by shaking the device. This was my first introduction to using more interactive types of input, and this eventually led to my final-year project, which used these interesting interactions to develop another way of teaching.
Learning programming without formal lessons
I have now just finished my Master’s degree in computer science at the University of Bristol. Before going to university, I had no experience of being taught programming in a formal environment. It was not a taught subject at my secondary school or sixth form. I wanted to get more people at my school interested in this area of study though, which I did by running a coding club for people. I would help others debug their code and discuss interesting problems with them. The reason that I chose to study computer science is largely because of my experiences with Raspberry Pi and other programming I did in my own time during my teenage years. I likely would have studied history if it weren’t for the programming I had done by myself making robots and other games.
Raspberry Pi has continued to play a part in my degree and extra-curricular activities; I used them in two large projects during my time at university and used a similar device in my final project. My robot experience also helped me to enter my university’s ‘Robot Wars’ competition which, though we never won, was a lot of fun.
A tool for learning and a device for industry
Having a Raspberry Pi is always useful during a hackathon, because it’s such a versatile component. Tech like Raspberry Pi will always be useful for beginners to learn the basics of programming and electronics, but these computers are also becoming more and more useful for people with more experience to make fun and useful projects. I could see tech like Raspberry Pi being used in the future to help quickly prototype many types of electronic devices and, as they become more powerful, even being used as an affordable way of controlling many types of robots, which will become more common in the future.
Our guest blogger Callum
Now I am going on to work on programming robot control systems at Ocado Technology. My experiences of robot building during my years before university played a large part in this decision. Already, robots are becoming a huge part of society, and I think they are only going to become more prominent in the future. Automation through robots and artificial intelligence will become one of the most important tools for humanity during the 21st century, and I look forward to being a part of that process. If it weren’t for learning through Raspberry Pi, I certainly wouldn’t be in this position.
Cheers for your story, Callum! Has tinkering with our tiny computer inspired your educational or professional choices? Let us know in the comments below.
Introducing the new Isaac Computer Science online learning platform and calendar of free events for students and teachers. Be the first to know about new features and content on the platform: Twitter – ncce.io/ytqstw Instagram – ncce.io/ytqsig Facebook – ncce.io/ytqsfb If you are a teacher, you may also be interested in our free online training courses for GCSE Computer Science teachers.
The project is a collaboration between the Raspberry Pi Foundation and the University of Cambridge, and is funded by the Department for Education’s National Centre for Computing Education programme.
Isaac Computer Science
Isaac Computer Science gives you access to a huge range of online learning materials for the classroom, homework, and revision — all for free.
The platform’s resources are mapped to the A level specifications in England (including the AQA and OCR exam boards). You’ll be able to set assignments for your students, have the platform mark it for you, and be confident that the content is relevant and high quality. We are confident that this will save you time in planning lessons and setting homework.
“Computer Science is a relatively small subject area and teachers across the country often work alone without the support of colleagues. Isaac Computer Science will build a teaching and learning community to support teachers at all levels and will offer invaluable support to A level students in their learning journey. As an experienced teacher, I am very excited to have the opportunity to work on this project.” – Diane Dowling, Isaac Computer Science Learning Manager and former teacher
And that’s not all! To further support you, we are also running free student workshops and teacher CPD events at universities and schools around England. Tickets for the events are available to book through the Isaac Computer Science website.
“Isaac Computer Science helped equip me with the skills to teach A level, and ran a great workshop at one of their recent Discovery events using the micro:bit and the Kitronik :MOVE mini. This is a session that I’ll definitely be using again and again.” – James Spencer, Computer Science teacher at St Martin’s School
A teacher works with her students at our recent Discovery event in Cambridge.
Why sign up?
Isaac Computer Science provides:
High-quality materials written by experienced teachers
Resources mapped to the AQA and OCR specifications
CPD events for teachers
Workshops for students
Isaac Computer Science allows you to:
Plan lessons around high-quality content pages, thus saving time
Select and set self-marking homework questions
Pinpoint areas to work on with your students
Manage students’ progress in your personal markbook
Side projects are the things you do at home, after work, for your own “entertainment”, or to satisfy your desire to learn new stuff, in case your workplace doesn’t give you that opportunity (or at least not enough of it). Side projects are also a way to build stuff that you think is valuable but not necessarily “commercialisable”. Many side projects are open-sourced sooner or later and some of them contribute to the pool of tools at other people’s disposal.
I’ve outlined one recommendation about side projects before – do them with technologies that are new to you, so that you learn important things that will keep you better positioned in the software world.
But there are more benefits than that – serendipitous benefits, for example. And I’d like to tell some personal stories about that. I’ll focus on a few examples from my list of side projects to show how, through a sort-of butterfly effect, they helped shape my career.
The computoser project, no matter how cool algorithmic music composition, didn’t manage to have much of a long term impact. But it did teach me something apart from niche musical theory – how to read a bulk of scientific papers (mostly computer science) and understand them without being formally trained in the particular field. We’ll see how that was useful later.
Then there was the “State alerts” project – a website that scraped content from public institutions in my country (legislation, legislation proposals, decisions by regulators, new tenders, etc.), made them searchable, and “subscribable” – so that you get notified when a keyword of interest is mentioned in newly proposed legislation, for example. (I obviously subscribed for “information technologies” and “electronic”).
And that project turned out to have a significant impact on the following years. First, I chose a new technology to write it with – Scala. Which turned out to be of great use when I started working at TomTom, and on the 3rd day I was transferred to a Scala project, which was way cooler and much more complex than the original one I was hired for. It was a bit ironic, as my colleagues had just read that “I don’t like Scala” a few weeks earlier, but nevertheless, that was one of the most interesting projects I’ve worked on, and it went on for two years. Had I not known Scala, I’d probably be gone from TomTom much earlier (as the other project was restructured a few times), and I would not have learned many of the scalability, architecture and AWS lessons that I did learn there.
But the very same project had an even more important follow-up. Because if its “civic hacking” flavour, I was invited to join an informal group of developers (later officiated as an NGO) who create tools that are useful for society (something like MySociety.org). That group gathered regularly, discussed both tools and policies, and at some point we put up a list of policy priorities that we wanted to lobby policy makers. One of them was open source for the government, the other one was open data. As a result of our interaction with an interim government, we donated the official open data portal of my country, functioning to this day.
As a result of that, a few months later we got a proposal from the deputy prime minister’s office to “elect” one of the group for an advisor to the cabinet. And we decided that could be me. So I went for it and became advisor to the deputy prime minister. The job has nothing to do with anything one could imagine, and it was challenging and fascinating. We managed to pass legislation, including one that requires open source for custom projects, eID and open data. And all of that would not have been possible without my little side project.
As for my latest side project, LogSentinel – it became my current startup company. And not without help from the previous two mentioned above – the computer science paper reading was of great use when I was navigating the crypto papers landscape, and from the government job I not only gained invaluable legal knowledge, but I also “got” a co-founder.
Some other side projects died without much fanfare, and that’s fine. But the ones above shaped my “story” in a way that would not have been possible otherwise.
And I agree that such serendipitous chain of events could have happened without side projects – I could’ve gotten these opportunities by meeting someone at a bar (unlikely, but who knows). But we, as software engineers, are capable of tilting chance towards us by utilizing our skills. Side projects are our “extracurricular activities”, and they often lead to unpredictable, but rather positive chains of events. They would rarely be the only factor, but they are certainly great at unlocking potential.
Earlier this spring, an excited group of STEM educators came together to participate in the first ever Raspberry Pi and Arduino workshop in Puerto Rico.
Their three-day digital making adventure was led by MakerTechPR’s José Rullán and Raspberry Pi Certified Educator Alex Martínez. They ran the event as part of the Robot Makers challenge organized by Yees! and sponsored by Puerto Rico’s Department of Economic Development and Trade to promote entrepreneurial skills within Puerto Rico’s education system.
Over 30 educators attended the workshop, which covered the use of the Raspberry Pi 3 as a computer and digital making resource. The educators received a kit consisting of a Raspberry Pi 3 with an Explorer HAT Pro and an Arduino Uno. At the end of the workshop, the educators were able to keep the kit as a demonstration unit for their classrooms. They were enthusiastic to learn new concepts and immerse themselves in the world of physical computing.
In their first session, the educators were introduced to the Raspberry Pi as an affordable technology for robotic clubs. In their second session, they explored physical computing and the coding languages needed to control the Explorer HAT Pro. They started off coding with Scratch, with which some educators had experience, and ended with controlling the GPIO pins with Python. In the final session, they learned how to develop applications using the powerful combination of Arduino and Raspberry Pi for robotics projects. This gave them a better understanding of how they could engage their students in physical computing.
“The Raspberry Pi ecosystem is the perfect solution in the classroom because to us it is very resourceful and accessible.” – Alex Martínez
Computer science and robotics courses are important for many schools and teachers in Puerto Rico. The simple idea of programming a microcontroller from a $35 computer increases the chances of more students having access to more technology to create things.
Puerto Rico’s education system has faced enormous challenges after Hurricane Maria, including economic collapse and the government’s closure of many schools due to the exodus of families from the island. By attending training like this workshop, educators in Puerto Rico are becoming more experienced in fields like robotics in particular, which are key for 21st-century skills and learning. This, in turn, can lead to more educational opportunities, and hopefully the reopening of more schools on the island.
“We find it imperative that our children be taught STEM disciplines and skills. Our goal is to continue this work of spreading digital making and computer science using the Raspberry Pi around Puerto Rico. We want our children to have the best education possible.” – Alex Martínez
After attending Picademy in 2016, Alex has integrated the Raspberry Pi Foundation’s online resources into his classroom. He has also taught small workshops around the island and in the local Puerto Rican makerspace community. José is an electrical engineer, entrepreneur, educator and hobbyist who enjoys learning to use technology and sharing his knowledge through projects and challenges.
Join us as we celebrate the Year of Engineering in the newest issue of Hello World, our magazine for computing and digital making educators.
Inspiring future engineers
We’ve brought together a wide range of experts to share their ideas and advice on how to bring engineering to your classroom — read issue 5 to find out the best ways to inspire the next generation.
Plus we’ve got plenty on GP and Scratch, we answer your latest questions, and we bring you our usual collection of useful features, guides, and lesson plans.
Highlights of issue 5 include:
The bluffers’ guide to putting together a tech-themed school trip
Inclusion, and coding for the visually impaired
Getting students interested in databases
Why copying may not always be a bad thing
How to get Hello World #5
Hello World is available as a free download under a Creative Commons license for everyone in world who is interested in computer science and digital making education. Get the latest issue as a PDF file straight from the Hello World website.
We’re currently offering free print copies of the magazine to serving educators in the UK. This offer is open to teachers, Code Club and CoderDojo volunteers, teaching assistants, teacher trainers, and others who help children and young people learn about computing and digital making. Subscribe to have your free print magazine posted directly to your home, or subscribe digitally — 20000 educators have already signed up to receive theirs!
Get in touch!
You could write for us about your experiences as an educator, and share your advice with the community. Wherever you are in the world, get in touch by emailing our editorial team about your article idea — we would love to hear from you!
Last week, we shared the first half of our Q&A with Raspberry Pi Trading CEO and Raspberry Pi creator Eben Upton. Today we follow up with all your other questions, including your expectations for a Raspberry Pi 4, Eben’s dream add-ons, and whether we really could go smaller than the Zero.
Get your questions to us now using #AskRaspberryPi on Twitter
With internet security becoming more necessary, will there be automated versions of VPN on an SD card?
There are already third-party tools which turn your Raspberry Pi into a VPN endpoint. Would we do it ourselves? Like the power button, it’s one of those cases where there are a million things we could do and so it’s more efficient to let the community get on with it.
Just to give a counterexample, while we don’t generally invest in optimising for particular use cases, we did invest a bunch of money into optimising Kodi to run well on Raspberry Pi, because we found that very large numbers of people were using it. So, if we find that we get half a million people a year using a Raspberry Pi as a VPN endpoint, then we’ll probably invest money into optimising it and feature it on the website as we’ve done with Kodi. But I don’t think we’re there today.
Have you ever seen any Pis running and doing important jobs in the wild, and if so, how does it feel?
It’s amazing how often you see them driving displays, for example in radio and TV studios. Of course, it feels great. There’s something wonderful about the geographic spread as well. The Raspberry Pi desktop is quite distinctive, both in its previous incarnation with the grey background and logo, and the current one where we have Greg Annandale’s road picture.
And so it’s funny when you see it in places. Somebody sent me a video of them teaching in a classroom in rural Pakistan and in the background was Greg’s picture.
Raspberry Pi 4!?!
There will be a Raspberry Pi 4, obviously. We get asked about it a lot. I’m sticking to the guidance that I gave people that they shouldn’t expect to see a Raspberry Pi 4 this year. To some extent, the opportunity to do the 3B+ was a surprise: we were surprised that we’ve been able to get 200MHz more clock speed, triple the wireless and wired throughput, and better thermals, and still stick to the $35 price point.
We’re up against the wall from a silicon perspective; we’re at the end of what you can do with the 40nm process. It’s not that you couldn’t clock the processor faster, or put a larger processor which can execute more instructions per clock in there, it’s simply about the energy consumption and the fact that you can’t dissipate the heat. So we’ve got to go to a smaller process node and that’s an order of magnitude more challenging from an engineering perspective. There’s more effort, more risk, more cost, and all of those things are challenging.
With 3B+ out of the way, we’re going to start looking at this now. For the first six months or so we’re going to be figuring out exactly what people want from a Raspberry Pi 4. We’re listening to people’s comments about what they’d like to see in a new Raspberry Pi, and I’m hoping by early autumn we should have an idea of what we want to put in it and a strategy for how we might achieve that.
Could you go smaller than the Zero?
The challenge with Zero as that we’re periphery-limited. If you run your hand around the unit, there is no edge of that board that doesn’t have something there. So the question is: “If you want to go smaller than Zero, what feature are you willing to throw out?”
It’s a single-sided board, so you could certainly halve the PCB area if you fold the circuitry and use both sides, though you’d have to lose something. You could give up some GPIO and go back to 26 pins like the first Raspberry Pi. You could give up the camera connector, you could go to micro HDMI from mini HDMI. You could remove the SD card and just do USB boot. I’m inventing a product live on air! But really, you could get down to two thirds and lose a bunch of GPIO – it’s hard to imagine you could get to half the size.
What’s the one feature that you wish you could outfit on the Raspberry Pi that isn’t cost effective at this time? Your dream feature.
Well, more memory. There are obviously technical reasons why we don’t have more memory on there, but there are also market reasons. People ask “why doesn’t the Raspberry Pi have more memory?”, and my response is typically “go and Google ‘DRAM price’”. We’re used to the price of memory going down. And currently, we’re going through a phase where this has turned around and memory is getting more expensive again.
Machine learning would be interesting. There are machine learning accelerators which would be interesting to put on a piece of hardware. But again, they are not going to be used by everyone, so according to our method of pricing what we might add to a board, machine learning gets treated like a $50 chip. But that would be lovely to do.
Which citizen science projects using the Pi have most caught your attention?
I like the wildlife camera projects. We live out in the countryside in a little village, and we’re conscious of being surrounded by nature but we don’t see a lot of it on a day-to-day basis. So I like the nature cam projects, though, to my everlasting shame, I haven’t set one up yet. There’s a range of them, from very professional products to people taking a Raspberry Pi and a camera and putting them in a plastic box. So those are good fun.
How does it feel to go to bed every day knowing you’ve changed the world for the better in such a massive way?
What feels really good is that when we started this in 2006 nobody else was talking about it, but now we’re part of a very broad movement.
We were in a really bad way: we’d seen a collapse in the number of applicants applying to study Computer Science at Cambridge and elsewhere. In our view, this reflected a move away from seeing technology as ‘a thing you do’ to seeing it as a ‘thing that you have done to you’. It is problematic from the point of view of the economy, industry, and academia, but most importantly it damages the life prospects of individual children, particularly those from disadvantaged backgrounds. The great thing about STEM subjects is that you can’t fake being good at them. There are a lot of industries where your Dad can get you a job based on who he knows and then you can kind of muddle along. But if your dad gets you a job building bridges and you suck at it, after the first or second bridge falls down, then you probably aren’t going to be building bridges anymore. So access to STEM education can be a great driver of social mobility.
By the time we were launching the Raspberry Pi in 2012, there was this wonderful movement going on. Code Club, for example, and CoderDojo came along. Lots of different ways of trying to solve the same problem. What feels really, really good is that we’ve been able to do this as part of an enormous community. And some parts of that community became part of the Raspberry Pi Foundation – we merged with Code Club, we merged with CoderDojo, and we continue to work alongside a lot of these other organisations. So in the two seconds it takes me to fall asleep after my face hits the pillow, that’s what I think about.
We’re currently advertising a Programme Manager role in New Delhi, India. Did you ever think that Raspberry Pi would be advertising a role like this when you were bringing together the Foundation?
No, I didn’t.
But if you told me we were going to be hiring somewhere, India probably would have been top of my list because there’s a massive IT industry in India. When we think about our interaction with emerging markets, India, in a lot of ways, is the poster child for how we would like it to work. There have already been some wonderful deployments of Raspberry Pi, for example in Kerala, without our direct involvement. And we think we’ve got something that’s useful for the Indian market. We have a product, we have clubs, we have teacher training. And we have a body of experience in how to teach people, so we have a physical commercial product as well as a charitable offering that we think are a good fit.
It’s going to be massive.
What is your favourite BBC type-in listing?
There was a game called Codename: Druid. There is a famous game called Codename: Droid which was the sequel to Stryker’s Run, which was an awesome, awesome game. And there was a type-in game called Codename: Druid, which was at the bottom end of what you would consider a commercial game.
And I remember typing that in. And what was really cool about it was that the next month, the guy who wrote it did another article that talks about the memory map and which operating system functions used which bits of memory. So if you weren’t going to do disc access, which bits of memory could you trample on and know the operating system would survive.
I still like type-in listings. The Raspberry Pi 2018 Annual has a type-in listing that I wrote for a Babbage versus Bugs game. I will say that’s not the last type-in listing you will see from me in the next twelve months. And if you download the PDF, you could probably copy and paste it into your favourite text editor to save yourself some time.
Previous attempts to track tainted coins had used either the “poison” or the “haircut” method. Suppose I open a new address and pay into it three stolen bitcoin followed by seven freshly-mined ones. Then under poison, the output is ten stolen bitcoin, while under haircut it’s ten bitcoin that are marked 30% stolen. After thousands of blocks, poison tainting will blacklist millions of addresses, while with haircut the taint gets diffused, so neither is very effective at tracking stolen property. Bitcoin due-diligence services supplant haircut taint tracking with AI/ML, but the results are still not satisfactory.
We discovered that, back in 1816, the High Court had to tackle this problem in Clayton’s case, which involved the assets and liabilities of a bank that had gone bust. The court ruled that money must be tracked through accounts on the basis of first-in, first out (FIFO); the first penny into an account goes to satisfy the first withdrawal, and so on.
Ilia Shumailov has written software that applies FIFO tainting to the blockchain and the results are impressive, with a massive improvement in precision. What’s more, FIFO taint tracking is lossless, unlike haircut; so in addition to tracking a stolen coin forward to find where it’s gone, you can start with any UTXO and trace it backwards to see its entire ancestry. It’s not just good law; it’s good computer science too.
With the Greenland shark finally caught on video for the very first time, scientists and engineers are discussing the limitations of current marine monitoring technology. One significant advance comes from the CSAIL team at Massachusetts Institute of Technology (MIT): SoFi, the robotic fish.
More info: http://bit.ly/SoFiRobot Paper: http://robert.katzschmann.eu/wp-content/uploads/2018/03/katzschmann2018exploration.pdf
The untethered SoFi robot
Last week, the Computer Science and Artificial Intelligence Laboratory (CSAIL) team at MIT unveiled SoFi, “a soft robotic fish that can independently swim alongside real fish in the ocean.”
Directed by a Super Nintendo controller and acoustic signals, SoFi can dive untethered to a maximum of 18 feet for a total of 40 minutes. A Raspberry Pi receives input from the controller and amplifies the ultrasound signals for SoFi via a HiFiBerry. The controller, Raspberry Pi, and HiFiBerry are sealed within a waterproof, cast-moulded silicone membrane filled with non-conductive mineral oil, allowing for underwater equalisation.
The ultrasound signals, received by a modem within SoFi’s head, control everything from direction, tail oscillation, pitch, and depth to the onboard camera.
As explained on MIT’s news blog, “to make the robot swim, the motor pumps water into two balloon-like chambers in the fish’s tail that operate like a set of pistons in an engine. As one chamber expands, it bends and flexes to one side; when the actuators push water to the other channel, that one bends and flexes in the other direction.”
Ocean exploration
While we’ve seen many autonomous underwater vehicles (AUVs) using onboard Raspberry Pis, SoFi’s ability to roam untethered with a wireless waterproof controller is an exciting achievement.
“To our knowledge, this is the first robotic fish that can swim untethered in three dimensions for extended periods of time. We are excited about the possibility of being able to use a system like this to get closer to marine life than humans can get on their own.” – CSAIL PhD candidate Robert Katzschmann
As the MIT news post notes, SoFi’s simple, lightweight setup of a single camera, a motor, and a smartphone lithium polymer battery set it apart it from existing bulky AUVs that require large motors or support from boats.
For more in-depth information on SoFi and the onboard tech that controls it, find the CSAIL team’s paper here.
The data center keeps growing, with well over 500 Petabytes of data under management we needed more systems administrators to help us keep track of all the systems as our operation expands. Our latest systems administrator is Billy! Let’s learn a bit more about him shall we?
What is your Backblaze Title? Sr. Systems Administrator
Where are you originally from? Boston, MA
What attracted you to Backblaze? I’ve read the hard drive articles that were published and was excited to be a part of the company that took the time to do that kind of analysis and share it with the world.
What do you expect to learn while being at Backblaze? I expect that I’ll learn about the problems that arise from a larger scale operation and how to solve them. I’m very curious to find out what they are.
Where else have you worked? I’ve worked for the MIT Math Dept, Google, a social network owned by AOL called Bebo, Evernote, a contractor recommendation site owned by The Home Depot called RedBeacon, and a few others that weren’t as interesting.
Where did you go to school? I started college at The Cooper Union, discovered that Electrical Engineering wasn’t my thing, then graduated from the Computer Science program at Northeastern.
What’s your dream job? Is couch potato a job? I like to solve puzzles and play with toys, which is why I really enjoy being a sysadmin. My dream job is to do pretty much what I do now, but not have to participate in on-call.
Favorite place you’ve traveled? We did a 2 week tour through Europe on our honeymoon. I’d go back to any place there.
Favorite hobby? Reading and listening to music. I spent a stupid amount of money on a stereo, so I make sure it gets plenty of use. I spent much less money on my library card, but I try to utilize it quite a bit as well.
Of what achievement are you most proud? I designed a built a set of shelves for the closet in my kids’ room. Built with hand tools. The only electricity I used was the lights to see what I was doing.
Star Trek or Star Wars? Star Trek: The Next Generation
Coke or Pepsi? Coke!
Favorite food? Pesto. Usually on angel hair, but it also works well on bread, or steak, or a spoon.
Why do you like certain things? I like things that are a little outside the norm, like musical covers and mashups, or things that look like 1 thing but are really something else. Secret compartments are also fun.
Anything else you’d like you’d like to tell us? I’m full of anecdotes and lines from songs and movies and tv shows.
Pesto is delicious! Welcome to the systems administrator team Billy, we’ll keep the fridge stocked with Coke for you!
Backblaze is growing, and with it our need to cater to a lot of different use cases that our customers bring to us. We needed a Solutions Engineer to help out, and after a long search we’ve hired our first one! Lets learn a bit more about Nathan shall we?
What is your Backblaze Title? Solutions Engineer. Our customers bring a thousand different use cases to both B1 and B2, and I’m here to help them figure out how best to make those use cases a reality. Also, any odd jobs that Nilay wants me to do.
Where are you originally from? I am native to the San Francisco Bay Area, studying mathematics at UC Santa Cruz, and then computer science at California University of Hayward (which has since renamed itself California University of the East Hills. I observe that it’s still in Hayward).
What attracted you to Backblaze? As a stable, growing company with huge growth and even bigger potential, the business model is attractive, and the team is outstanding. Add to that the strong commitment to transparency, and it’s a hard company to resist. We can store – and restore – data while offering superior reliability at an economic advantage to do-it-yourself, and that’s a great place to be.
What do you expect to learn while being at Backblaze? Everything I need to, but principally how our customers choose to interact with web storage. Storage isn’t a solution per se, but it’s an important component of any persistent solution. I’m looking forward to working with all the different concepts our customers have to make use of storage.
Where else have you worked? All sorts of places, but I’ll admit publicly to EMC, Gemalto, and my own little (failed, alas) startup, IC2N. I worked with low-level document imaging.
Where did you go to school? UC Santa Cruz, BA Mathematics CU Hayward, Master of Science in Computer Science.
What’s your dream job? Sipping tea in the California redwood forest. However, solutions engineer at Backblaze is a good second choice!
Favorite place you’ve traveled? Ashland, Oregon, for the Oregon Shakespeare Festival and the marble caves (most caves form from limestone).
Favorite hobby? Theater. Pathfinder. Writing. Baking cookies and cakes.
Of what achievement are you most proud? Marrying the most wonderful man in the world.
Star Trek or Star Wars? Star Trek’s utopian science fiction vision of humanity and science resonates a lot more strongly with me than the dystopian science fantasy of Star Wars.
Coke or Pepsi? Neither. I’d much rather have a cup of jasmine tea.
Favorite food? It varies, but I love Indian and Thai cuisine. Truly excellent Italian food is marvelous – wood fired pizza, if I had to pick only one, but the world would be a boring place with a single favorite food.
Why do you like certain things? If I knew that, I’d be in marketing.
Anything else you’d like you’d like to tell us? If you haven’t already encountered the amazing authors Patricia McKillip and Lois McMasters Bujold – go encounter them. Be happy.
There’s nothing wrong with a nice cup of tea and a long game of Pathfinder. Sign us up! Welcome to the team Nathan!
Data that describe processes in a spatial context are everywhere in our day-to-day lives and they dominate big data problems. Map data, for instance, whether describing networks of roads or remote sensing data from satellites, get us where we need to go. Atmospheric data from simulations and sensors underlie our weather forecasts and climate models. Devices and sensors with GPS can provide a spatial context to nearly all mobile data.
In this post, we introduce the WIND toolkit, a huge (500 TB), open weather model dataset that’s available to the world on Amazon’s cloud services. We walk through how to access this data and some of the open-source software developed to make it easily accessible. Our solution considers a subset of geospatial data that exist on a grid (raster) and explores ways to provide access to large-scale raster data from weather models. The solution uses foundational AWS services and the Hierarchical Data Format (HDF), a well adopted format for scientific data.
The approach developed here can be extended to any data that fit in an HDF5 file, which can describe sparse and dense vectors and matrices of arbitrary dimensions. This format is already popular within the physical sciences for both experimental and simulation data. We discuss solutions to gridded data storage for a massive dataset of public weather model outputs called the Wind Integration National Dataset (WIND) toolkit. We also highlight strategies that are general to other large geospatial data management problems.
Wind Integration National Dataset
As variable renewable power penetration levels increase in power systems worldwide, the importance of renewable integration studies to ensure continued economic and reliable operation of the power grid is also increasing. The WIND toolkit is the largest freely available grid integration dataset to date.
The WIND toolkit was developed by 3TIER by Vaisala. They were under a subcontract to the National Renewable Energy Laboratory (NREL) to support studies on integration of wind energy into the existing US grid. NREL is a part of a network of national laboratories for the US Department of Energy and has a mission to advance the science and engineering of energy efficiency, sustainable transportation, and renewable power technologies.
The toolkit has been used by consultants, research groups, and universities worldwide to support grid integration studies. Less traditional uses also include resource assessments for wind plants (such as those powering Amazon data centers), and studying the effects of weather on California condor migrations in the Baja peninsula.
The diversity of applications highlights the value of accessible, open public data. Yet, there’s a catch: the dataset is huge. The WIND toolkit provides simulated atmospheric (weather) data at a two-km spatial resolution and five-minute temporal resolution at multiple heights for seven years. The entire dataset is half a petabyte (500 TB) in size and is stored in the NREL High Performance Computing data center in Golden, Colorado. Making this dataset publicly available easily and in a cost-effective manner is a major challenge.
As other laboratories and public institutions work to release their data to the world, they may face similar challenges to those that we experienced. Some prior, well-intentioned efforts to release huge datasets as-is have resulted in data resources that are technically available but fundamentally unusable. They may be stored in an unintuitive format or indexed and organized to support only a subset of potential uses. Downloading hundreds of terabytes of data is often impractical. Most users don’t have access to a big data cluster (or super computer) to slice and dice the data as they need after it’s downloaded.
We aim to provide a large amount of data (50 terabytes) to the public in a way that is efficient, scalable, and easy to use. In many cases, researchers can access these huge cloud-located datasets using the same software and algorithms they have developed for smaller datasets stored locally. Only the pieces of data they need for their individual analysis must be downloaded. To make this work in practice, we worked with the HDF Group and have built upon their forthcoming Highly Scalable Data Service.
In the rest of this post, we discuss how the HSDS software was developed to use Amazon EC2 and Amazon S3 resources to provide convenient and scalable access to these huge geospatial datasets. We describe how the HSDS service has been put to work for the WIND Toolkit dataset and demonstrate how to access it using the h5pyd Python library and the REST API. We conclude with information about our ongoing work to release more ‘open’ datasets to the public using AWS services, and ways to improve and extend the HSDS with newer Amazon services like Amazon ECS and AWS Lambda.
Developing a scalable service for big geospatial data
The HDF5 file format and API have been used for many years and is an effective means of storing large scientific datasets. For example, NASA’s Earth Observing System (EOS) satellites collect more than 16 TBs of data per day using HDF5.
With the rise of the cloud, there are new challenges and opportunities to rethink how HDF5 can be enhanced to work effectively as a component in a cloud-native architecture. For the HDF Group, working with NREL has been a great opportunity to put ideas into practice with a production-size dataset.
An HDF5 file consists of a directed graph of group and dataset objects. Datasets can be thought of as a multidimensional array with support for user-defined metadata tags and compression. Typical operations on datasets would be reading or writing data to a regular subregion (a hyperslab) or reading and writing individual elements (a point selection). Also, group and dataset objects may each contain an arbitrary number of the user-defined metadata elements known as attributes.
Many people have used the HDF library in applications developed or ported to run on EC2 instances, but there are a number of constraints that often prove problematic:
The HDF5 library can’t read directly from HDF5 files stored as S3 objects. The entire file (often many GB in size) would need to be copied to local storage before the first byte can be read. Also, the instance must be configured with the appropriately sized EBS volume)
The HDF library only has access to the computational resources of the instance itself (as opposed to a cluster of instances), so many operations are bottlenecked by the library.
Any modifications to the HDF5 file would somehow have to be synchronized with changes that other instances have made to same file before writing back to S3.
Using a pattern common to many offerings from AWS, the solution to these constraints is to develop a service framework around the HDF data model. Using this model, the HDF Group has created the Highly Scalable Data Service (HSDS) that provides all the functionality that traditionally was provided by the HDF5 library. By using the service, you don’t need to manage your own file volumes, but can just read and write whatever data that you need.
Because the service manages the actual data persistence to a durable medium (S3, in this case), you don’t need to worry about disk management. Simply stream the data you need from the service as you need it. Secondly, putting the functionality behind a service allows some tricks to increase performance (described in more detail later). And lastly, HSDS allows any number of clients to access the data at the same time, enabling HDF5 to be used as a coordination mechanism for multiple readers and writers.
In designing the HSDS architecture, we gave much thought to how to achieve scalability of the HSDS service. For accessing HDF5 data, there are two different types of scaling to consider:
Multiple clients making many requests to the service
Single requests that require a significant amount of data processing
To deal with the first scaling challenge, as with most services, we considered how the service responds as the request rate increases. AWS provides some great tools that help in this regard:
Auto Scaling groups
Elastic Load Balancing load balancers
The ability of S3 to handle large aggregate throughput rates
By using a cluster of EC2 instances behind a load balancer, you can handle different client loads in a cost-effective manner.
The second scaling challenge concerns single requests that would take significant processing time with just one compute node. One example of this from the WIND toolkit would be extracting all the values in the seven-year time span for a given geographic point and dataset.
In HDF5, large datasets are typically stored as “chunks”; that is, a regular partition of the array. In HSDS, each chunk is stored as a binary object in S3. The sequential approach to retrieving the time series values would be for the service to read each chunk needed from S3, extract the needed elements, and go on to the next chunk. In this case, that would involve processing 2557 chunks, and would be quite slow.
Fortunately, with HSDS, you can speed this up quite a bit by exploiting the compute and I/O capabilities of the cluster. Upon receiving the request, the receiving node can use other nodes in the cluster to read different portions of the selection. With multiple nodes reading from S3 in parallel, performance improves as the cluster size increases.
The diagram below illustrates how this works in simplified case of four chunks and four nodes.
This architecture has worked in well in practice. In testing with the WIND toolkit and time series extraction, we observed a request latency of ~60 seconds using four nodes vs. ~5 seconds with 40 nodes. Performance roughly scales with the size of the cluster.
A planned enhancement to this is to use AWS Lambda for the worker processing. This enables 1000-way parallel reads at a reasonable cost, as you only pay for the milliseconds of CPU time used with AWS Lambda.
Public access to atmospheric data using HSDS and AWS
An early challenge in releasing the WIND toolkit data was in deciding how to subset the data for different use cases. In general, few researchers need access to the entire 0.5 PB of data and a great deal of efficiency and cost reduction can be gained by making directed constituent datasets.
NREL grid integration researchers initially extracted a 2-TB subset by selecting 120,000 points where the wind resource seemed appropriate for development. They also chose only those data important for wind applications (100-m wind speed, converted to power), the most interesting locations for those performing grid studies. To support the remaining users who needed more data resolution, we down-sampled the data to a 60-minute temporal resolution, keeping all the other variables and spatial resolution intact. This reduced dataset is 50 TB of data describing 30+ atmospheric variables of data for 7 years at a 60-minute temporal resolution.
The WindViz browser-based Gridded Wind Toolkit Visualizer was created as an example implementation of the HSDS REST API in JavaScript. The visualizer is written in the style of ECMAScript 2016 using a modern development toolchain that includes webpack and Babel. The source code is available through our GitHub repository. The demo page is hosted via GitHub pages, and we use a cross-origin AJAX request to fetch data from the HSDS service running on the EC2 infrastructure. The visualizer can be used to explore the gridded wind toolkit data on a map. Achieve full spatial resolution by zooming in to a specific region.
Programmatic access is possible using the h5pyd Python library, a distributed analog to the widely used h5py library. Users interact with the datasets (variables) and slice the data from its (time x longitude x latitude) cube form as they see fit.
Examples and use cases are described in a set of Jupyter notebooks and available on GitHub:
Now you have a Jupyter notebook server running on your EC2 server.
From your laptop, create an SSH tunnel:
$ ssh –L 8888:localhost:8888 (IP address of the EC2 server)
Now, you can browse to localhost:8888 using the correct token, and interact with the notebooks as if they were local. Within the directory, there are examples for accessing the HSDS API and plotting wind and weather data using matplotlib.
Controlling access and defraying costs
A final concern is rate limiting and access control. Although the HSDS service is scalable and relatively robust, we had a few practical concerns:
How can we protect from malicious or accidental use that may lead to high egress fees (for example, someone who attempts to repeatedly download the entire dataset from S3)?
How can we keep track of who is using the data both to document the value of the data resource and to justify the costs?
If costs become too high, can we charge for some or all API use to help cover the costs?
To approach these problems, we investigated using Amazon API Gateway and its simplified integration with the AWS Marketplace for SaaS monetization as well as third-party API proxies.
In the end, we chose to use API Umbrella due to its close involvement with http://data.gov. While AWS Marketplace is a compelling option for future datasets, the decision was made to keep this dataset entirely open, at least for now. As community use and associated costs grow, we’ll likely revisit Marketplace. Meanwhile, API Umbrella provides controls for rate limiting and API key registration out of the box and was simple to implement as a front-end proxy to HSDS. Those applications that may want to charge for API use can accomplish a similar strategy using Amazon API Gateway and AWS Marketplace.
Ongoing work and other resources
As NREL and other government research labs, municipalities, and organizations try to share data with the public, we expect many of you will face similar challenges to those we have tried to approach with the architecture described in this post. Providing large datasets is one challenge. Doing so in a way that is affordable and convenient for users is an entirely more difficult goal. Using AWS cloud-native services and the existing foundation of the HDF file format has allowed us to tackle that challenge in a meaningful way.
Dr. Caleb Phillips is a senior scientist with the Data Analysis and Visualization Group within the Computational Sciences Center at the National Renewable Energy Laboratory. Caleb comes from a background in computer science systems, applied statistics, computational modeling, and optimization. His work at NREL spans the breadth of renewable energy technologies and focuses on applying modern data science techniques to data problems at scale.
Dr. Caroline Draxl is a senior scientist at NREL. She supports the research and modeling activities of the US Department of Energy from mesoscale to wind plant scale. Caroline uses mesoscale models to research wind resources in various countries, and participates in on- and offshore boundary layer research and in the coupling of the mesoscale flow features (kilometer scale) to the microscale (tens of meters). She holds a M.S. degree in Meteorology and Geophysics from the University of Innsbruck, Austria, and a PhD in Meteorology from the Technical University of Denmark.
John Readey has been a Senior Architect at The HDF Group since he joined in June 2014. His interests include web services related to HDF, applications that support the use of HDF and data visualization.Before joining The HDF Group, John worked at Amazon.com from 2006–2014 where he developed service-based systems for eCommerce and AWS.
Jordan Perr-Sauer is an RPP intern with the Data Analysis and Visualization Group within the Computational Sciences Center at the National Renewable Energy Laboratory. Jordan hopes to use his professional background in software engineering and his academic training in applied mathematics to solve the challenging problems facing America and the world.
Want to work at a company that helps customers in 156 countries around the world protect the memories they hold dear? A company that stores over 500 petabytes of customers’ photos, music, documents and work files in a purpose-built cloud storage system?
Well here’s your chance. Backblaze is looking for a Vault Storage Engineer!
Company Description:
Founded in 2007, Backblaze started with a mission to make backup software elegant and provide complete peace of mind. Over the course of almost a decade, we have become a pioneer in robust, scalable low cost cloud backup. Recently, we launched B2 — robust and reliable object storage at just $0.005/gb/mo. Part of our differentiation is being able to offer the lowest price of any of the big players while still being profitable.
We’ve managed to nurture a team oriented culture with amazingly low turnover. We value our people and their families. Don’t forget to check out our “About Us” page to learn more about the people and some of our perks.
We have built a profitable, high growth business. While we love our investors, we have maintained control over the business. That means our corporate goals are simple – grow sustainably and profitably.
Some Backblaze Perks:
Competitive healthcare plans
Competitive compensation and 401k
All employees receive Option grants
Unlimited vacation days
Strong coffee
Fully stocked Micro kitchen
Catered breakfast and lunches
Awesome people who work on awesome projects
New Parent Childcare bonus
Normal work hours
Get to bring your pets into the office
San Mateo Office – located near Caltrain and Highways 101 & 280.
Want to know what you’ll be doing?
You will work on the core of the Backblaze: the vault cloud storage system (https://www.backblaze.com/blog/vault-cloud-storage-architecture/). The system accepts files uploaded from customers, stores them durably by distributing them across the data center, automatically handles drive failures, rebuilds data when drives are replaced, and maintains high availability for customers to download their files. There are significant enhancements in the works, and you’ll be a part of making them happen.
Must have a strong background in:
Computer Science
Multi-threaded programming
Distributed Systems
Java
Math (such as matrix algebra and statistics)
Building reliable, testable systems
Bonus points for:
Java
JavaScript
Python
Cassandra
SQL
Looking for an attitude of:
Passionate about building reliable clean interfaces and systems.
Likes to work closely with other engineers, support, and sales to help customers.
Customer Focused (!!) — always focus on the customer’s point of view and how to solve their problem!
Required for all Backblaze Employees:
Good attitude and willingness to do whatever it takes to get the job done
Strong desire to work for a small fast-paced company
Desire to learn and adapt to rapidly changing technologies and work environment
Rigorous adherence to best practices
Relentless attention to detail
Excellent interpersonal skills and good oral/written communication
Excellent troubleshooting and problem solving skills
This position is located in San Mateo, California but will also consider remote work as long as you’re no more than three time zones away and can come to San Mateo now and then.
Backblaze is an Equal Opportunity Employer.
Contact Us: If this sounds like you, follow these steps:
The AWS Community Heroes program helps shine a spotlight on some of the innovative work being done by rockstar AWS developers around the globe. Marrying cloud expertise with a passion for community building and education, these Heroes share their time and knowledge across social media and in-person events. Heroes also actively help drive content at Meetups, workshops, and conferences.
This March, we have five Heroes that we’re happy to welcome to our network of cloud innovators:
Peter Sbarski is VP of Engineering at A Cloud Guru and the organizer of Serverlessconf, the world’s first conference dedicated entirely to serverless architectures and technologies. His work at A Cloud Guru allows him to work with, talk and write about serverless architectures, cloud computing, and AWS. He has written a book called Serverless Architectures on AWS and is currently collaborating on another book called Serverless Design Patterns with Tim Wagner and Yochay Kiriaty.
Peter is always happy to talk about cloud computing and AWS, and can be found at conferences and meetups throughout the year. He helps to organize Serverless Meetups in Melbourne and Sydney in Australia, and is always keen to share his experience working on interesting and innovative cloud projects.
Peter’s passions include serverless technologies, event-driven programming, back end architecture, microservices, and orchestration of systems. Peter holds a PhD in Computer Science from Monash University, Australia and can be followed on Twitter, LinkedIn, Medium, and GitHub.
In close collaboration with his brother Andreas Wittig, the Wittig brothers are actively creating AWS related content. Their book Amazon Web Services in Action (Manning) introduces AWS with a strong focus on automation. Andreas and Michael run the blog cloudonaut.io where they share their knowledge about AWS with the community. The Wittig brothers also published a bunch of video courses with O’Reilly, Manning, Pluralsight, and A Cloud Guru. You can also find them speaking at conferences and user groups in Europe. Both brothers are co-organizing the AWS user group in Stuttgart.
Fernando is an experienced Infrastructure Solutions Leader, holding 5 AWS Certifications, with extensive IT Architecture and Management experience in a variety of market sectors. Working as a Cloud Architect Consultant in United Kingdom since 2014, Fernando built an online community for Hispanic speakers worldwide.
Fernando founded a LinkedIn Group, a Slack Community and a YouTube channel all of them named “AWS en Español”, and started to run a monthly webinar via YouTube streaming where different leaders discuss aspects and challenges around AWS Cloud.
During the last 18 months he’s been helping to run and coach AWS User Group leaders across LATAM and Spain, and 10 new User Groups were founded during this time.
Feel free to follow Fernando on Twitter, connect with him on LinkedIn, or join the ever-growing Hispanic Community via Slack, LinkedIn or YouTube.
Anders is a consultant and cloud evangelist at Webstep AS in Norway. He finished his degree in Computer Science at the Norwegian Institute of Technology at about the same time the Internet emerged as a public service. Since then he has been an IT consultant and a passionate advocate of knowledge-sharing.
He architected and implemented his first customer solution on AWS back in 2010, and is essential in building Webstep’s core cloud team. Anders applies his broad expert knowledge across all layers of the organizational stack. He engages with developers on technology and architectures and with top management where he advises about cloud strategies and new business models.
Anders enjoys helping people increase their understanding of AWS and cloud in general, and holds several AWS certifications. He co-founded and co-organizes the AWS User Groups in the largest cities in Norway (Oslo, Bergen, Trondheim and Stavanger), and also uses any opportunity to engage in events related to AWS and cloud wherever he is.
You can follow him on Twitter or connect with him on LinkedIn.
To learn more about the AWS Community Heroes Program and how to get involved with your local AWS community, click here.
Amazon EMR empowers many customers to build big data processing applications quickly and cost-effectively, using popular distributed frameworks such as Apache Spark, Apache HBase, Presto, and Apache Flink. For organizations that are crafting their analytical applications on Amazon EMR, there is a growing need to keep their data assets organized in an automated fashion. Because datasets tend to grow exponentially, using cataloging tools is essential to automating data discovery and organizing data assets.
AWS Glue Data Catalog provides this essential capability, allowing you to automatically discover and catalog metadata about your data stores in a central repository. Since Amazon EMR 5.8.0, customers have been using the AWS Glue Data Catalog as a metadata store for Apache Hive and Spark SQL applications that are running on Amazon EMR. Starting with Amazon EMR 5.10.0, you can catalog datasets using AWS Glue and run queries using Presto on Amazon EMR from the Hue (Hadoop User Experience) and Apache Zeppelin UIs.
You might wonder what scenarios warrant using Presto running on Amazon EMR and when to choose Amazon Athena (which uses Presto as the query engine under the hood). It is important to note that both are excellent tools for querying massive amounts of data and addressing different needs and use cases.
Amazon Athena provides the easiest way to run interactive queries for data in Amazon S3 without needing to set up or manage any servers. Presto running on Amazon EMR gives you much more flexibility in how you configure and run your queries, providing the ability to federate to other data sources if needed. For example, you might have a use case that requires LDAP authentication for clients such as the Presto CLI or JDBC/ODBC drivers. Or you might have a workflow where you need to join data between different systems like MySQL/Amazon Redshift/Apache Cassandra and Hive. In these examples, Presto running on Amazon EMR is the right tool to use because it can be configured to enable LDAP authentication in addition to the desired database connectors at cluster launch.
Now, let’s look at how metadata management for Presto works with AWS Glue.
Using an AWS Glue crawler to discover datasets
The AWS Glue Data Catalog is a reference to the location, schema, and runtime metrics of your datasets. To create this reference metadata, AWS Glue needs to crawl your datasets. In this exercise, we use an AWS Glue crawler to populate tables in the Data Catalog for the NYC taxi rides dataset.
The following are the steps for adding a crawler:
Sign in to the AWS Management Console, and open the AWS Glue console. In the navigation pane, choose Crawlers. Then choose Add crawler.
On the Add a data store page, specify the location of the NYC taxi rides dataset.
In the next step, choose an existing IAM role if one is available, or create a new role. Then choose Next.
On the scheduling page, for Frequency, choose Run on demand.
On the Configure the crawler’s output page, choose Add database. Specify blog-db as the database name. (You can specify a name of your choice, but be sure to choose the correct database name when running queries.)
Follow the remaining steps using the default values to create a crawler.
When the crawler displays the Ready state, navigate to the Databases (Choose blog-db from the list of databases, or search for it by specifying it as a filter, as shown in the following screenshot.) Then choose Tables. You should see the three tables created by the crawler, as follows.
(Optional) The discovered data is classified as CSV files. You can optionally convert this data into Parquet format for better response times on your queries.
Launching an Amazon EMR cluster
With the dataset discovered and organized, we can now walk through different options for launching Presto on an Amazon EMR cluster to use the AWS Glue Data Catalog.
After you’ve set up the Amazon EMR cluster with Presto, the AWS Glue Data Catalog is available through a default “hive” catalog. To change between the Hive and Glue metastores, you have to manually update hive.properties and restart the Presto server. Connect to the master node on your EMR cluster using SSH, and run the Presto CLI to start running queries interactively.
$ presto-cli --catalog hive
Begin with a simple query to sample a few rows:
presto> SELECT * FROM “blog-db”.taxi limit 10;
The query shows a few sample rows as follows:
Query the average fare for trips at each hour of the day and for each day of the month on the Parquet version of the taxi dataset.
presto> SELECT EXTRACT (HOUR FROM pickup_datetime) AS hour, avg(fare_amount) AS average_fare FROM “blog-db”.taxi_parquet GROUP BY 1 ORDER BY 1;
The following image shows the results:
More interestingly, you can compute the number of trips that gave tips in the 10 percent, 15 percent, or higher percentage range:
presto> -- Tip Percent Category
SELECT TipPrctCtgry
, COUNT (DISTINCT TripID) TripCt
FROM
(SELECT TripID
, (CASE
WHEN fare_prct < 0.7 THEN 'FL70'
WHEN fare_prct < 0.8 THEN 'FL80'
WHEN fare_prct < 0.9 THEN 'FL90'
ELSE 'FL100'
END) FarePrctCtgry
, (CASE
WHEN tip_prct < 0.1 THEN 'TL10'
WHEN tip_prct < 0.15 THEN 'TL15'
WHEN tip_prct < 0.2 THEN 'TL20'
ELSE 'TG20'
END) TipPrctCtgry
FROM
(SELECT TripID
, (fare_amount / total_amount) as fare_prct
, (extra / total_amount) as extra_prct
, (mta_tax / total_amount) as tip_prct
, (tolls_amount / total_amount) as mta_taxprct
, (tip_amount / total_amount) as tolls_prct
, (improvement_surcharge / total_amount) as imprv_suchrgprct
, total_amount
FROM
(SELECT *
, (cast(pickup_longitude AS VARCHAR(100)) || '_' || cast(pickup_latitude AS VARCHAR(100))) as TripID
from "blog-db”.taxi_parquet
WHERE total_amount > 0
) as t
) as t
) ct
GROUP BY TipPrctCtgry;
The results are as follows:
While the preceding query is running, navigate to the web interface for Presto on Amazon EMR at <http://master-public-dns-name:8889/. Here you can look into the query metrics, such as active worker nodes, number of rows read per second, reserved memory, and parallelism.
Running queries in the Presto Editor on Hue
If you installed Hue with your Amazon EMR launch, you can also run queries on Hue’s Presto Editor. On the Amazon EMR Cluster console, choose Enable Web Connection, and follow the instructions to access the web interfaces for Hue and Zeppelin.
After the web connection is enabled, choose the Hue link to open the web interface. At the login screen, if you are the administrator logging in for the first time, type a user name and password to create your Hue superuser account. Then choose Create account. Otherwise, type your user name and password and choose Create account, or type the credentials provided by your administrator.
Choose the Presto Editor from the menu. You can run Presto queries against your tables in the AWS Glue Data Catalog.
Conclusion
Having a shared data catalog for applications on Amazon EMR alleviates a myriad of data-related challenges that organizations face today—including discovery, governance, auditability, and collaboration. In this post, we explored how the AWS Glue Data Catalog addresses discoverability and manageability for table metadata for Presto on Amazon EMR. Go ahead, give this a try, and share your experience with us!
Radhika Ravirala is a Solutions Architect at Amazon Web Services where she helps customers craft distributed big data applications on the AWS platform. Prior to her cloud journey, she worked as a software engineer and designer for technology companies in Silicon Valley. She holds a M.S in computer science from San Jose State University.
Opensource.com looks at the availability of open educational resources (OERs), where to find them, and what the advantages of OERs are. Math and computer science professor David Usinski is a strong advocate for OERs and was interviewed for the article. “The ability to customize the curriculum is one of David’s favorite benefits of OER. ‘The intangible aspect is that OER has allowed me to reinvent my curriculum and take ownership of the content. With a textbook, I am locked into the chapter-by-chapter approach by one or two authors,’ he says. Because of OER ‘I am no longer hindered or confined by published materials and now have the flexibility to create the curriculum that truly addresses the course outcomes.’ By freely sharing the content he creates, other instructors can also benefit.”
In September of last year, we launched our 2017/2018 Astro Pi challenge with our partners at the European Space Agency (ESA). Students from ESA membership and associate countries had the chance to design science experiments and write code to be run on one of our two Raspberry Pis on the International Space Station (ISS).
Submissions for the Mission Space Lab challenge have just closed, and the results are in! Students had the opportunity to design an experiment for one of the following two themes:
Life in space Making use of Astro Pi Vis (Ed) in the European Columbus module to learn about the conditions inside the ISS.
Life on Earth Making use of Astro Pi IR (Izzy), which will be aimed towards the Earth through a window to learn about Earth from space.
ESA astronaut Alexander Gerst, speaking from the replica of the Columbus module at the European Astronaut Center in Cologne, has a message for all Mission Space Lab participants:
Subscribe to our YouTube channel: http://rpf.io/ytsub Help us reach a wider audience by translating our video content: http://rpf.io/yttranslate Buy a Raspberry Pi from one of our Approved Resellers: http://rpf.io/ytproducts Find out more about the Raspberry Pi Foundation: Raspberry Pi http://rpf.io/ytrpi Code Club UK http://rpf.io/ytccuk Code Club International http://rpf.io/ytcci CoderDojo http://rpf.io/ytcd Check out our free online training courses: http://rpf.io/ytfl Find your local Raspberry Jam event: http://rpf.io/ytjam Work through our free online projects: http://rpf.io/ytprojects Do you have a question about your Raspberry Pi?
Flight status
We had a total of 212 Mission Space Lab entries from 22 countries. Of these, a 114 fantastic projects have been given flight status, and the teams’ project code will run in space!
But they’re not winners yet. In April, the code will be sent to the ISS, and then the teams will receive back their experimental data. Next, to get deeper insight into the process of scientific endeavour, they will need produce a final report analysing their findings. Winners will be chosen based on the merit of their final report, and the winning teams will get exclusive prizes. Check the list below to see if your team got flight status.
Belgium
Flight status achieved:
Team De Vesten, Campus De Vesten, Antwerpen
Ursa Major, CoderDojo Belgium, West-Vlaanderen
Special operations STEM, Sint-Claracollege, Antwerpen
Canada
Flight status achieved:
Let It Grow, Branksome Hall, Toronto
The Dark Side of Light, Branksome Hall, Toronto
Genie On The ISS, Branksome Hall, Toronto
Byte by PIthons, Youth Tech Education Society & Kid Code Jeunesse, Edmonton
The Broadviewnauts, Broadview, Ottawa
Czech Republic
Flight status achieved:
BLEK, Střední Odborná Škola Blatná, Strakonice
Denmark
Flight status achieved:
2y Infotek, Nærum Gymnasium, Nærum
Equation Quotation, Allerød Gymnasium, Lillerød
Team Weather Watchers, Allerød Gymnasium, Allerød
Space Gardners, Nærum Gymnasium, Nærum
Finland
Flight status achieved:
Team Aurora, Hyvinkään yhteiskoulun lukio, Hyvinkää
France
Flight status achieved:
INC2, Lycée Raoul Follereau, Bourgogne
Space Project SP4, Lycée Saint-Paul IV, Reunion Island
Dresseurs2Python, clg Albert CAMUS, essonne
Lazos, Lycée Aux Lazaristes, Rhone
The space nerds, Lycée Saint André Colmar, Alsace
Les Spationautes Valériquais, lycée de la Côte d’Albâtre, Normandie
This column is from The MagPi issue 59. You can download a PDF of the full issue for free, or subscribe to receive the print edition through your letterbox or the digital edition on your tablet. All proceeds from the print and digital editions help the Raspberry Pi Foundation achieve our charitable goals.
“Hey, world!” Estefannie exclaims, a wide grin across her face as the camera begins to roll for another YouTube tutorial video. With a growing number of followers and wonderful support from her fans, Estefannie is building a solid reputation as an online maker, creating unique, fun content accessible to all.
It’s as if she was born into performing and making for an audience, but this fun, enjoyable journey to social media stardom came not from a desire to be in front of the camera, but rather as a unique approach to her own learning. While studying, Estefannie decided the best way to confirm her knowledge of a subject was to create an educational video explaining it. If she could teach a topic successfully, she knew she’d retained the information. And so her YouTube channel, Estefannie Explains It All, came into being.
Her first videos featured pages of notes with voice-over explanations of data structure and algorithm analysis. Then she moved in front of the camera, and expanded her skills in the process.
But YouTube isn’t her only outlet. With nearly 50000 followers, Estefannie’s Instagram game is strong, adding to an increasing number of female coders taking to the platform. Across her Instagram grid, you’ll find insights into her daily routine, from programming on location for work to behind-the-scenes troubleshooting as she begins to create another tutorial video. It’s hard work, with content creation for both Instagram and YouTube forever on her mind as she continues to work and progress successfully as a software engineer.
As a thank you to her Instagram fans for helping her reach 10000 followers, Estefannie created a free game for Android and iOS called Gravitris — imagine Tetris with balance issues!
Estefannie was born and raised in Mexico, with ambitions to become a graphic designer and animator. However, a documentary on coding at Pixar, and the beauty of Merida’s hair in Brave, opened her mind to the opportunities of software engineering in animation. She altered her career path, moved to the United States, and switched to a Computer Science course.
With a constant desire to make and to learn, Estefannie combines her software engineering profession with her hobby to create fun, exciting content for YouTube.
While studying, Estefannie started a Computer Science Girls Club at the University of Houston, Texas, and she found herself eager to put more time and effort into the movement to increase the percentage of women in the industry. The club was a success, and still is to this day. While Estefannie has handed over the reins, she’s still very involved in the cause.
Through her YouTube videos, Estefannie continues the theme of inclusion, with every project offering a warm sense of approachability for all, regardless of age, gender, or skill. From exploring Scratch and Makey Makey with her young niece and nephew to creating her own Disney ‘Made with Magic’ backpack for a trip to Disney World, Florida, Estefannie’s videos are essentially a documentary of her own learning process, produced so viewers can learn with her — and learn from her mistakes — to create their own tech wonders.
Estefannie’s automated gingerbread house project was a labour of love, with electronics, wires, and candy strewn across both her living room and kitchen for weeks before completion. While she already was a skilled programmer, the world of physical digital making was still fairly new for Estefannie. Having ditched her hot glue gun in favour of a soldering iron in a previous video, she continued to experiment and try out new, interesting techniques that are now second nature to many members of the maker community. With the gingerbread house, Estefannie was able to research and apply techniques such as light controls, servos, and app making, although the latter was already firmly within her skill set. The result? A fun video of ups and downs that resulted in a wonderful, festive treat. She even gave her holiday home its own solar panel!
1,910 Likes, 43 Comments – Estefannie Explains It All (@estefanniegg) on Instagram: “A DAY AT RASPBERRY PI TOWERS!! LINK IN BIO @raspberrypifoundation”
And that’s just the beginning of her adventures with Pi…but we won’t spoil her future plans by telling you what’s coming next. Sorry! However, since this article was written last year, Estefannie has released a few more Pi-based project videos, plus some awesome interviews and live-streams with other members of the maker community such as Simone Giertz. She even made us an awesome video for our Raspberry Pi YouTube channel! So be sure to check out her latest releases.
2,264 Likes, 56 Comments – Estefannie Explains It All (@estefanniegg) on Instagram: “Best day yet!! I got to hangout, play Jenga with a huge arm robot, and have afternoon tea with…”
While many wonderful maker videos show off a project without much explanation, or expect a certain level of skill from viewers hoping to recreate the project, Estefannie’s videos exist almost within their own category. We can’t wait to see where Estefannie Explains It All goes next!
Kuhu Shukla (bottom center) and team at the 2017 DataWorks Summit
By Kuhu Shukla
This post first appeared here on the Apache Software Foundation blog as part of ASF’s “Success at Apache” monthly blog series.
As I sit at my desk on a rather frosty morning with my coffee, looking up new JIRAs from the previous day in the Apache Tez project, I feel rather pleased. The latest community release vote is complete, the bug fixes that we so badly needed are in and the new release that we tested out internally on our many thousand strong cluster is looking good. Today I am looking at a new stack trace from a different Apache project process and it is hard to miss how much of the exceptional code I get to look at every day comes from people all around the globe. A contributor leaves a JIRA comment before he goes on to pick up his kid from soccer practice while someone else wakes up to find that her effort on a bug fix for the past two months has finally come to fruition through a binding +1.
Yahoo – which joined AOL, HuffPost, Tumblr, Engadget, and many more brands to form the Verizon subsidiary Oath last year – has been at the frontier of open source adoption and contribution since before I was in high school. So while I have no historical trajectories to share, I do have a story on how I found myself in an epic journey of migrating all of Yahoo jobs from Apache MapReduce to Apache Tez, a then-new DAG based execution engine.
Oath grid infrastructure is through and through driven by Apache technologies be it storage through HDFS, resource management through YARN, job execution frameworks with Tez and user interface engines such as Hive, Hue, Pig, Sqoop, Spark, Storm. Our grid solution is specifically tailored to Oath’s business-critical data pipeline needs using the polymorphic technologies hosted, developed and maintained by the Apache community.
On the third day of my job at Yahoo in 2015, I received a YouTube link on An Introduction to Apache Tez. I watched it carefully trying to keep up with all the questions I had and recognized a few names from my academic readings of Yarn ACM papers. I continued to ramp up on YARN and HDFS, the foundational Apache technologies Oath heavily contributes to even today. For the first few weeks I spent time picking out my favorite (necessary) mailing lists to subscribe to and getting started on setting up on a pseudo-distributed Hadoop cluster. I continued to find my footing with newbie contributions and being ever more careful with whitespaces in my patches. One thing was clear – Tez was the next big thing for us. By the time I could truly call myself a contributor in the Hadoop community nearly 80-90% of the Yahoo jobs were now running with Tez. But just like hiking up the Grand Canyon, the last 20% is where all the pain was. Being a part of the solution to this challenge was a happy prospect and thankfully contributing to Tez became a goal in my next quarter.
The next sprint planning meeting ended with me getting my first major Tez assignment – progress reporting. The progress reporting in Tez was non-existent – “Just needs an API fix,” I thought. Like almost all bugs in this ecosystem, it was not easy. How do you define progress? How is it different for different kinds of outputs in a graph? The questions were many.
I, however, did not have to go far to get answers. The Tez community actively came to a newbie’s rescue, finding answers and posing important questions. I started attending the bi-weekly Tez community sync up calls and asking existing contributors and committers for course correction. Suddenly the team was much bigger, the goals much more chiseled. This was new to anyone like me who came from the networking industry, where the most open part of the code are the RFCs and the implementation details are often hidden. These meetings served as a clean room for our coding ideas and experiments. Ideas were shared, to the extent of which data structure we should pick and what a future user of Tez would take from it. In between the usual status updates and extensive knowledge transfers were made.
Oath uses Apache Pig and Apache Hive extensively and most of the urgent requirements and requests came from Pig and Hive developers and users. Each issue led to a community JIRA and as we started running Tez at Oath scale, new feature ideas and bugs around performance and resource utilization materialized. Every year most of the Hadoop team at Oath travels to the Hadoop Summit where we meet our cohorts from the Apache community and we stand for hours discussing the state of the art and what is next for the project. One such discussion set the course for the next year and a half for me.
We needed an innovative way to shuffle data. Frameworks like MapReduce and Tez have a shuffle phase in their processing lifecycle wherein the data from upstream producers is made available to downstream consumers. Even though Apache Tez was designed with a feature set corresponding to optimization requirements in Pig and Hive, the Shuffle Handler Service was retrofitted from MapReduce at the time of the project’s inception. With several thousands of jobs on our clusters leveraging these features in Tez, the Shuffle Handler Service became a clear performance bottleneck. So as we stood talking about our experience with Tez with our friends from the community, we decided to implement a new Shuffle Handler for Tez. All the conversation points were tracked now through an umbrella JIRA TEZ-3334 and the to-do list was long. I picked a few JIRAs and as I started reading through I realized, this is all new code I get to contribute to and review. There might be a better way to put this, but to be honest it was just a lot of fun! All the whiteboards were full, the team took walks post lunch and discussed how to go about defining the API. Countless hours were spent debugging hangs while fetching data and looking at stack traces and Wireshark captures from our test runs. Six months in and we had the feature on our sandbox clusters. There were moments ranging from sheer frustration to absolute exhilaration with high fives as we continued to address review comments and fixing big and small issues with this evolving feature.
As much as owning your code is valued everywhere in the software community, I would never go on to say “I did this!” In fact, “we did!” It is this strong sense of shared ownership and fluid team structure that makes the open source experience at Apache truly rewarding. This is just one example. A lot of the work that was done in Tez was leveraged by the Hive and Pig community and cross Apache product community interaction made the work ever more interesting and challenging. Triaging and fixing issues with the Tez rollout led us to hit a 100% migration score last year and we also rolled the Tez Shuffle Handler Service out to our research clusters. As of last year we have run around 100 million Tez DAGs with a total of 50 billion tasks over almost 38,000 nodes.
In 2018 as I move on to explore Hadoop 3.0 as our future release, I hope that if someone outside the Apache community is reading this, it will inspire and intrigue them to contribute to a project of their choice. As an astronomy aficionado, going from a newbie Apache contributor to a newbie Apache committer was very much like looking through my telescope - it has endless possibilities and challenges you to be your best.
About the Author:
Kuhu Shukla is a software engineer at Oath and did her Masters in Computer Science at North Carolina State University. She works on the Big Data Platforms team on Apache Tez, YARN and HDFS with a lot of talented Apache PMCs and Committers in Champaign, Illinois. A recent Apache Tez Committer herself she continues to contribute to YARN and HDFS and spoke at the 2017 Dataworks Hadoop Summit on “Tez Shuffle Handler: Shuffling At Scale With Apache Hadoop”. Prior to that she worked on Juniper Networks’ router and switch configuration APIs. She likes to participate in open source conferences and women in tech events. In her spare time she loves singing Indian classical and jazz, laughing, whale watching, hiking and peering through her Dobsonian telescope.
Cue the lights! Cue the music! Picademy is back for another year stateside. We’re excited to bring our free computer science and digital making professional development program for educators to four new cities this summer — you can apply right now.
We’re thrilled to kick off our 2018 season! Before we get started, let’s take a look back at our community’s accomplishments in the 2017 Picademy North America season.
Picademy 2017 highlights
Last year, we partnered with four awesome venues to host eight Picademy events in the United States. At every event across the country, we met incredibly talented educators passionate about bringing digital making to their learners. Whether it was at Ann Arbor District Library’s makerspace, UC Irvine’s College of Engineering, or a creative community center in Boise, Idaho, we were truly inspired by all our Picademy attendees and were thrilled to welcome them to the Raspberry Pi Certified Educator community.
JWU Providence’s College of Engineering & Design recently partnered with the Raspberry Pi Foundation to host Picademy, a free training session designed to give educators the tools to teach computer skills with confidence and creativity. | http://www.jwu.edu
The 2017 Picademy cohorts were a diverse bunch with a lot of experience in their field. We welcomed more than 300 educators from 32 U.S. states and 10 countries. They were a mix of high school, middle school, and elementary classroom teachers, librarians, museum staff, university lecturers, and teacher trainers. More than half of our attendees were teaching computer science or technology already, and over 90% were specifically interested in incorporating physical computing into their work.
Picademy has a strong and lasting impact on educators. Over 80% of graduates said they felt confident using Raspberry Pi after attending, and 88% said they were now interested in leading a digital making event in their community. To showcase two wonderful examples of this success: Chantel Mason led a Raspberry Pi workshop for families and educators in her community in St. Louis, Missouri this fall, and Dean Palmer led a digital making station at the Computer Science for Rhode Island Summit in December.
Picademy 2018 dates
This year, we’re partnering with four new venues to host our Picademy season.
Another new year brings with it thoughts of setting goals and targets. Thankfully, there is a new issue of Hello World packed with practical advise to set you on the road to success.
Hello World is our magazine about computing and digital making for educators, and it’s a collaboration between the Raspberry Pi Foundation and Computing at School, which is part of the British Computing Society.
In issue 4, our international panel of educators and experts recommends approaches to continuing professional development in computer science education.
Approaches to professional development, and much more
With recommendations for more professional development in the Royal Society’s report, and government funding to support this, our cover feature explores some successful approaches. In addition, the issue is packed with other great resources, guides, features, and lesson plans to support educators.
Highlights include:
The Royal Society: After the Reboot — learn about the latest report and its findings about computing education
The Cyber Games — a new programme looking for the next generation of security experts
Engaging Students with Drones
Digital Literacy: Lost in Translation?
Object-oriented Coding with Python
Get your copy of Hello World 4
Hello World is available as a free Creative Commons download for anyone around the world who is interested in computer science and digital making education. You can get the latest issue as a PDF file straight from the Hello World website.
Thanks to the very generous sponsorship of BT, we are able to offer free print copies of the magazine to serving educators in the UK. It’s for teachers, Code Club volunteers, teaching assistants, teacher trainers, and others who help children and young people learn about computing and digital making. So remember to subscribe to have your free print magazine posted directly to your home — 6000 educators have already signed up to receive theirs!
Could you write for Hello World?
By sharing your knowledge and experience of working with young people to learn about computing, computer science, and digital making in Hello World, you will help inspire others to get involved. You will also help bring the power of digital making to more and more educators and learners.
The computing education community is full of people who lend their experience to help colleagues. Contributing to Hello World is a great way to take an active part in this supportive community, and you’ll be adding to a body of free, open-source learning resources that are available for anyone to use, adapt, and share. It’s also a tremendous platform to broadcast your work: Hello World digital versions alone have been downloaded more than 50000 times!
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.