Homes in Madrid, Dublin, Cardiff, Ljubljana, and Leuven are participating in the Citizens Observing UrbaN Transport (WeCount) project, a European Commission–funded research project investigating sustainable economic growth.
1,500 Raspberry Pi traffic sensors will be distributed to homes in the five cities to gather data on traffic conditions. Every hour, the devices will upload information to publically accessible cloud storage. The team behind WeCount says:
Following this approach, we will be able to quantify local road transport (cars, heavy goods vehicles, active travel modes, and speed), produce scientific knowledge in the field of mobility and environmental pollution, and co-design informed solutions to tackle a variety of road transport challenges.
“With air pollution being blamed for 500,000 premature deaths across the continent in 2018,” states a BBC News article about the project, “the experts running the survey hope their results can be used to make cities healthier places to live.” Says the WeCount team:
[T]he project will provide cost-effective data for local authorities, at a far greater temporal and spatial scale than what would be possible in classic traffic counting campaigns, thereby opening up new opportunities for transportation policy making and research.
The small form factor and low cost of Raspberry Pi mean it’s the ideal brain for citizen science projects across the globe, including our own Raspberry Pi Oracle Weather Station.
Another wonderful Raspberry Pi–powered citizen science project is Penguin Watch, which asks the public to, you guessed it, watch penguins. Time-lapse footage — obtained in the Antarctic by Raspberry Pi Camera Modules connected to Raspberry Pi Zeros — is uploaded to the Penguin Watch website, and anyone in the world can go online to highlight penguins in the footage, helping the research team to monitor the penguin population in these locations.
Setting up. Credit: Alasdair Davies, ZSL
Penguin Watch is highly addictive and it’s for a great cause, so be sure to check it out.
Hey folks, Rob here with good news about the latest edition of The MagPi! Issue 71, out right now, is all about running Android on Raspberry Pi with the help of emteria.OS and Android Things.
Android and Raspberry Pi, two great tastes that go great together!
Android and Raspberry Pi
A big part of our main feature looks at emteria.OS, a version of Android that runs directly on the Raspberry Pi. By running it on a touchscreen setup, you can use your Pi just like an Android tablet — one that’s easily customisable and hackable for all your embedded computing needs. Inside the issue, we’ve got a special emteria.OS discount code for readers.
We also look at Android Things, the official Android release for Raspberry Pi that focuses on IoT applications, and we show you some of the amazing projects that have been built with it.
On top of that, we’ve included guides on how to get started with TensorFlow AI and on building an oscilloscope.
We really loved this card scanning project! Read all about it in issue 71.
All this, along with our usual varied selection of project showcases, excellent tutorials, and definitive reviews!
Get The MagPi 71
You can get The MagPi 71 today from WHSmith, Tesco, Sainsbury’s, and Asda. If you live in the US, head over to your local Barnes & Noble or Micro Center in the next few days for a print copy. You can also get the new issue online from our store, or digitally via our Android or iOS apps. And don’t forget, there’s always the free PDF as well.
New subscription offer!
Want to support the Raspberry Pi Foundation and the magazine? We’ve launched a new way to subscribe to the print version of The MagPi: you can now take out a monthly £4 subscription to the magazine, effectively creating a rolling pre-order system that saves you money on each issue.
You can also take out a twelve-month print subscription and get a Pi Zero W plus case and adapter cables absolutely free! This offer does not currently have an end date.
One of the most common enquiries I receive at Pi Towers is “How can I get my hands on a Raspberry Pi Oracle Weather Station?” Now the answer is: “Why not build your own version using our guide?”
Tadaaaa! The BYO weather station fully assembled.
Our Oracle Weather Station
In 2016 we sent out nearly 1000 Raspberry Pi Oracle Weather Station kits to schools from around the world who had applied to be part of our weather station programme. In the original kit was a special HAT that allows the Pi to collect weather data with a set of sensors.
The original Raspberry Pi Oracle Weather Station HAT
We designed the HAT to enable students to create their own weather stations and mount them at their schools. As part of the programme, we also provide an ever-growing range of supporting resources. We’ve seen Oracle Weather Stations in great locations with a huge differences in climate, and they’ve even recorded the effects of a solar eclipse.
Our new BYO weather station guide
We only had a single batch of HATs made, and unfortunately we’ve given nearly* all the Weather Station kits away. Not only are the kits really popular, we also receive lots of questions about how to add extra sensors or how to take more precise measurements of a particular weather phenomenon. So today, to satisfy your demand for a hackable weather station, we’re launching our Build your own weather station guide!
Fun with meteorological experiments!
Our guide suggests the use of many of the sensors from the Oracle Weather Station kit, so can build a station that’s as close as possible to the original. As you know, the Raspberry Pi is incredibly versatile, and we’ve made it easy to hack the design in case you want to use different sensors.
Many other tutorials for Pi-powered weather stations don’t explain how the various sensors work or how to store your data. Ours goes into more detail. It shows you how to put together a breadboard prototype, it describes how to write Python code to take readings in different ways, and it guides you through recording these readings in a database.
There’s also a section on how to make your station weatherproof. And in case you want to move past the breadboard stage, we also help you with that. The guide shows you how to solder together all the components, similar to the original Oracle Weather Station HAT.
Who should try this build
We think this is a great project to tackle at home, at a STEM club, Scout group, or CoderDojo, and we’re sure that many of you will be chomping at the bit to get started. Before you do, please note that we’ve designed the build to be as straight-forward as possible, but it’s still fairly advanced both in terms of electronics and programming. You should read through the whole guide before purchasing any components.
The sensors and components we’re suggesting balance cost, accuracy, and easy of use. Depending on what you want to use your station for, you may wish to use different components. Similarly, the final soldered design in the guide may not be the most elegant, but we think it is achievable for someone with modest soldering experience and basic equipment.
You can build a functioning weather station without soldering with our guide, but the build will be more durable if you do solder it. If you’ve never tried soldering before, that’s OK: we have a Getting started with soldering resource plus video tutorial that will walk you through how it works step by step.
For those of you who are more experienced makers, there are plenty of different ways to put the final build together. We always like to hear about alternative builds, so please post your designs in the Weather Station forum.
Our plans for the guide
Our next step is publishing supplementary guides for adding extra functionality to your weather station. We’d love to hear which enhancements you would most like to see! Our current ideas under development include adding a webcam, making a tweeting weather station, adding a light/UV meter, and incorporating a lightning sensor. Let us know which of these is your favourite, or suggest your own amazing ideas in the comments!
*We do have a very small number of kits reserved for interesting projects or locations: a particularly cool experiment, a novel idea for how the Oracle Weather Station could be used, or places with specific weather phenomena. If have such a project in mind, please send a brief outline to [email protected], and we’ll consider how we might be able to help you.
The German charity Save Nemo works to protect coral reefs, and they are developing Nemo-Pi, an underwater “weather station” that monitors ocean conditions. Right now, you can vote for Save Nemo in the Google.org Impact Challenge.
Save Nemo
The organisation says there are two major threats to coral reefs: divers, and climate change. To make diving saver for reefs, Save Nemo installs buoy anchor points where diving tour boats can anchor without damaging corals in the process.
In addition, they provide dos and don’ts for how to behave on a reef dive.
The Nemo-Pi
To monitor the effects of climate change, and to help divers decide whether conditions are right at a reef while they’re still on shore, Save Nemo is also in the process of perfecting Nemo-Pi.
This Raspberry Pi-powered device is made up of a buoy, a solar panel, a GPS device, a Pi, and an array of sensors. Nemo-Pi measures water conditions such as current, visibility, temperature, carbon dioxide and nitrogen oxide concentrations, and pH. It also uploads its readings live to a public webserver.
The Save Nemo team is currently doing long-term tests of Nemo-Pi off the coast of Thailand and Indonesia. They are also working on improving the device’s power consumption and durability, and testing prototypes with the Raspberry Pi Zero W.
The web dashboard showing live Nemo-Pi data
Long-term goals
Save Nemo aims to install a network of Nemo-Pis at shallow reefs (up to 60 metres deep) in South East Asia. Then diving tour companies can check the live data online and decide day-to-day whether tours are feasible. This will lower the impact of humans on reefs and help the local flora and fauna survive.
A healthy coral reef
Nemo-Pi data may also be useful for groups lobbying for reef conservation, and for scientists and activists who want to shine a spotlight on the awful effects of climate change on sea life, such as coral bleaching caused by rising water temperatures.
A bleached coral reef
Vote now for Save Nemo
If you want to help Save Nemo in their mission today, vote for them to win the Google.org Impact Challenge:
Click “Abstimmen” in the footer of the page to vote
Click “JA” in the footer to confirm
Voting is open until 6 June. You can also follow Save Nemo on Facebook or Twitter. We think this organisation is doing valuable work, and that their projects could be expanded to reefs across the globe. It’s fantastic to see the Raspberry Pi being used to help protect ocean life.
In today’s guest post, seventh-grade students Evan Callas, Will Ross, Tyler Fallon, and Kyle Fugate share their story of using the Raspberry Pi Oracle Weather Station in their Innovation Lab class, headed by Raspberry Pi Certified Educator Chris Aviles.
United Nations Sustainable Goals
The past couple of weeks in our Innovation Lab class, our teacher, Mr Aviles, has challenged us students to design a project that helps solve one of the United Nations Sustainable Goals. We chose Climate Action. Innovation Lab is a class that gives students the opportunity to learn about where the crossroads of technology, the environment, and entrepreneurship meet. Everyone takes their own paths in innovation and learns about the environment using project-based learning.
Raspberry Pi Oracle Weather Station
For our climate change challenge, we decided to build a Raspberry Pi Oracle Weather Station. Tackling the issues of climate change in a way that helps our community stood out to us because we knew with the help of this weather station we can send the local data to farmers and fishermen in town. Recent changes in climate have been affecting farmers’ crops. Unexpected rain, heat, and other unusual weather patterns can completely destabilize the natural growth of the plants and destroy their crops altogether. The amount of labour output needed by farmers has also significantly increased, forcing farmers to grow more food on less resources. By using our Raspberry Pi Oracle Weather Station to alert local farmers, they can be more prepared and aware of the weather, leading to better crops and safe boating.
Growing teamwork and coding skills
The process of setting up our weather station was fun and simple. Raspberry Pi made the instructions very easy to understand and read, which was very helpful for our team who had little experience in coding or physical computing. We enjoyed working together as a team and were happy to be growing our teamwork skills.
Once we constructed and coded the weather station, we learned that we needed to support the station with PVC pipes. After we completed these steps, we brought the weather station up to the roof of the school and began collecting data. Our information is currently being sent to the Initial State dashboard so that we can share the information with anyone interested. This information will also be recorded and seen by other schools, businesses, and others from around the world who are using the weather station. For example, we can see the weather in countries such as France, Greece and Italy.
Raspberry Pi allows us to build these amazing projects that help us to enjoy coding and physical computing in a fun, engaging, and impactful way. We picked climate change because we care about our community and would like to make a substantial contribution to our town, Fair Haven, New Jersey. It is not every day that kids are given these kinds of opportunities, and we are very lucky and grateful to go to a school and learn from a teacher where these opportunities are given to us. Thanks, Mr Aviles!
To see more awesome projects by Mr Avile’s class, you can keep up with him on his blog and follow him on Twitter.
As they sail aboard their floating game design studio Pino, Rekka Bellum and Devine Lu Linvega are starting to explore the use of Raspberry Pis. As part of an experimental development tool and a weather station, Pis are now aiding them on their nautical adventures!
Pino is on its way to becoming a smart sailboat! Raspberry Pi is the ideal device for sailors, we hope to make many more projects with it. Also the projects continue still, but we have windows now yay!
Barometer
Using a haul of Pimoroni tech including the Enviro pHat, Scroll pHat HD and Mini Black HAT Hack3r, Rekka and Devine have been experimenting with using a Raspberry Pi Zero as an onboard barometer for their sailboat. On their Hundred Rabbits YouTube channel and website, the pair has documented their experimental setups. They have also built another Raspberry Pi rig for distraction-free work and development.
The official Raspberry Pi 7″ touch display, a Raspberry Pi 3B+, a Pimorni Blinkt, and a Poker II Keyboard make up Pino‘s experimental development station.
“The Pi computer is currently used only as an experimental development tool aboard Pino, but could readily be turned into a complete development platform, would our principal computers fail.” they explain, before going into the build process for the Raspberry Pi–powered barometer.
The use of solderless headers make this weather station an ideal build wherever space and tools are limited.
The barometer uses the sensor power of the Pimoroni Enviro HAT to measure atmospheric pressure, and a Raspberry Pi Zero displays this data on the Scroll pHAT HD. It thus advises the two travellers of oncoming storms. By taking advantage of the solderless header provided by the Sheffield-based pirates, the Hundred Rabbits team was able to put the device together with relative ease. They provide all information for the build here.
This is us, this what we do, and these are our intentions! We live, and work from our sailboat Pino. Traveling helps us stay creative, and we feed what we see back into our work. We make games, art, books and music under the studio name ‘Hundred Rabbits.’
As we head into 2018 and start looking forward to longer days in the Northern hemisphere, I thought I’d take a look back at last year’s weather using data from Raspberry Pi Oracle Weather Stations. One of the great things about the kit is that as well as uploading all its readings to the shared online Oracle database, it stores them locally on the Pi in a MySQL or MariaDB database. This means you can use the power of SQL queries coupled with Python code to do automatic data analysis.
Soggy Surrey
My Weather Station has only been installed since May, so I didn’t have a full 52 weeks of my own data to investigate. Still, my station recorded more than 70000 measurements. Living in England, the first thing I wanted to know was: which was the wettest month? Unsurprisingly, both in terms of average daily rainfall and total rainfall, the start of the summer period — exactly when I went on a staycation — was the soggiest:
What about the global Weather Station community?
Even soggier Bavaria
Here things get slightly trickier. Although we have a shiny Oracle database full of all participating schools’ sensor readings, some of the data needs careful interpretation. Many kits are used as part of the school curriculum and do not always record genuine outdoor conditions. Nevertheless, it appears that Adalbert Stifter Gymnasium in Bavaria, Germany, had an even wetter 2017 than my home did:
The records Robert-Dannemann Schule in Westerstede, Germany, is a good example of data which was most likely collected while testing and investigating the weather station sensors, rather than in genuine external conditions. Unless this school’s Weather Station was transported to a planet which suffers from extreme hurricanes, it wasn’t actually subjected to wind speeds above 1000km/h in November. Dismissing these and all similarly suspect records, I decided to award the ‘Windiest location of the year’ prize to CEIP Noalla-Telleiro, Spain.
This school is right on the coast, and is subject to some strong and squally weather systems.
Weather Station at CEIP Noalla-Telleiro
They’ve mounted their wind vane and anemometer nice and high, so I can see how they were able to record such high wind velocities.
A couple of Weather Stations have recently been commissioned in equally exposed places — it will be interesting to see whether they will record even higher speeds during 2018.
Highs and lows
After careful analysis and a few disqualifications (a couple of Weather Stations in contention for this category were housed indoors), the ‘Hottest location’ award went to High School of Chalastra in Thessaloniki, Greece. There were a couple of Weather Stations (the one at The Marwadi Education Foundation in India, for example) that reported higher average temperatures than Chalastra’s 24.54 ºC. However, they had uploaded far fewer readings and their data coverage of 2017 was only partial.
At the other end of the thermometer, the location with the coldest average temperature is École de la Rose Sauvage in Calgary, Canada, with a very chilly 9.9 ºC.
Weather Station at École de la Rose Sauvage
I suspect this school has a good chance of retaining the title: their lowest 2017 temperature of -24 ºC is likely to be beaten in 2018 due to extreme weather currently bringing a freezing start to the year in that part of the world.
If you have an Oracle Raspberry Pi Weather Station and would like to perform an annual review of your local data, you can use this Python script as a starting point. It will display a monthly summary of the temperature and rainfall for 2017, and you should be able to customise the code to focus on other sensor data or on a particular time of year. We’d love to see your results, so please share your findings with [email protected], and we’ll send you some limited-edition Weather Station stickers.
Since we launched the Oracle Weather Station project, we’ve collected more than six million records from our network of stations at schools and colleges around the world. Each one of these records contains data from ten separate sensors — that’s over 60 million individual weather measurements!
Weather station measurements in Oracle database
Weather data collection
Having lots of data covering a long period of time is great for spotting trends, but to do so, you need some way of visualising your measurements. We’ve always had great resources like Graphing the weather to help anyone analyse their weather data.
And from now on its going to be even easier for our Oracle Weather Station owners to display and share their measurements. I’m pleased to announce a new partnership with our friends at Initial State: they are generously providing a white-label platform to which all Oracle Weather Station recipients can stream their data.
Using Initial State
Initial State makes it easy to create vibrant dashboards that show off local climate data. The service is perfect for having your Oracle Weather Station data on permanent display, for example in the school reception area or on the school’s website.
But that’s not all: the Initial State toolkit includes a whole range of easy-to-use analysis tools for extracting trends from your data. Distribution plots and statistics are just a few clicks away!
Looks like Auntie Beryl is right — it has been a damp old year! (Humidity value distribution May–Nov 2017)
The wind direction data from my Weather Station supports my excuse as to why I’ve not managed a high-altitude balloon launch this year: to use my launch site, I need winds coming from the east, and those have been in short supply.
Chart showing wind direction over time
Initial State credientials
Every Raspberry Pi Oracle Weather Station school will shortly be receiving the credentials needed to start streaming their data to Initial State. If you’re super keen though, please email [email protected] with a photo of your Oracle Weather Station, and I’ll let you jump the queue!
The Initial State folks are big fans of Raspberry Pi and have a ton of Pi-related projects on their website. They even included shout-outs to us in the music video they made to celebrate the publication of their 50th tutorial. Can you spot their weather station?
Your home-brew weather station
If you’ve built your own Raspberry Pi–powered weather station and would like to dabble with the Initial State dashboards, you’re in luck! The team at Initial State is offering 14-day trials for everyone. For more information on Initial State, and to sign up for the trial, check out their website.
When James Puderer moved to Lima, Peru, his roadside runs left a rather nasty taste in his mouth. Hit by the pollution from old diesel cars in the area, he decided to monitor the air quality in his new city using Raspberry Pis and the abundant taxies as his tech carriers.
How to assemble the enclosure for my Taxi Datalogger project: https://www.hackster.io/james-puderer/distributed-air-quality-monitoring-using-taxis-69647e
Sensing air quality in Lima
Luckily for James, almost all taxies in Lima are equipped with the standard hollow vinyl roof sign seen in the video above, which makes them ideal for hacking.
With the onboard tech, the device collects data on longitude, latitude, humidity, temperature, pressure, and airborne particle count, feeding it back to an Android Things datalogger. This data is then pushed to Google IoT Core, where it can be remotely accessed.
Next, the data is processed by Google Dataflow and turned into a BigQuery table. Users can then visualize the collected measurements. And while James uses Google Maps to analyse his data, there are many tools online that will allow you to organise and study your figures depending on what final result you’re hoping to achieve.
James hopped in a taxi and took his monitor on the road, collecting results throughout the journey
James has provided the complete build process, including all tech ingredients and code, on his Hackster.io project page, and urges makers to create their own air quality monitor for their local area. He also plans on building upon the existing design by adding a 12V power hookup for connecting to the taxi, functioning lights within the sign, and companion apps for drivers.
Sensing the world around you
We’ve seen a wide variety of Raspberry Pi projects using sensors to track the world around us, such as Kasia Molga’s Human Sensor costume series, which reacts to air pollution by lighting up, and Clodagh O’Mahony’s Social Interaction Dress, which she created to judge how conversation and physical human interaction can be scored and studied.
Kasia Molga’s Human Sensor — a collection of hi-tech costumes that react to air pollution within the wearer’s environment.
Many people also build their own Pi-powered weather stations, or use the Raspberry Pi Oracle Weather Station, to measure and record conditions in their towns and cities from the roofs of schools, offices, and homes.
Have you incorporated sensors into your Raspberry Pi projects? Share your builds in the comments below or via social media by tagging us.
Did you realise the Sense HAT has been available for over two years now? Used by astronauts on the International Space Station, the exact same hardware is available to you on Earth. With a new Astro Pi challenge just launched, it’s time for a retrospective/roundup/inspiration post about this marvellous bit of kit.
The Sense HAT on a Pi in full glory
The Sense HAT explained
We developed our scientific add-on board to be part of the Astro Pi computers we sent to the International Space Station with ESA astronaut Tim Peake. For a play-by-play of Astro Pi’s history, head to the blog archive.
Just to remind you, this is all the cool stuff our engineers have managed to fit onto the HAT:
Use the LED matrix and joystick to recreate games such as Pong or Flappy Bird. Of course, you could also add sensor input to your game: code an egg drop game or a Magic 8 Ball that reacts to how the device moves.
If you like the great outdoors, you could also use your Sense HAT to recreate this Hiking Companion by Marcus Johnson. Take it with you on your next hike!
It’s also possible to incorporate Sense HAT data into your digital art! The Python Turtle module and the Processing language are both useful tools for creating beautiful animations based on real-world information.
A Sense HAT project that also uses this principle is Giorgio Sancristoforo’s Tableau, a ‘generative music album’. This device creates music according to the sensor data:
“There is no doubt that, as music is removed by the phonographrecord from the realm of live production and from the imperative of artistic activity and becomes petrified, it absorbs into itself, in this process of petrification, the very life that would otherwise vanish.”
Our online resource shows you how to record the information your HAT picks up. Next you can analyse and graph your data using Mathematica, which is included for free on Raspbian. This resource walks you through how this software works.
If you’re seeking inspiration for experiments you can do on our Astro Pis Izzy and Ed on the ISS, check out the winning entries of previous rounds of the Astro Pi challenge.
Thomas Pesquet with Ed and Izzy
But you can also stick to terrestrial scientific investigations. For example, why not build a weather station and share its data on your own web server or via Weather Underground?
Your code in space!
If you’re a student or an educator in one of the 22 ESA member states, you can get a team together to enter our 2017-18 Astro Pi challenge. There are two missions to choose from, including Mission Zero: follow a few guidelines, and your code is guaranteed to run in space!
As everyone knows, one of the problems with the weather is that it can be difficult to predict a long time in advance. In the UK we’ve had stormy conditions for weeks but, of course, now that I’ve finished my lightning detector, everything has calmed down. If you’re planning to make scientific measurements of a particular phenomenon, patience is often required.
Wake STEM ECH get ready to safely observe the eclipse
In the path of the eclipse
Fortunately, this wasn’t a problem for Mr Burgess and his students at Wake STEM Early College High School in Raleigh, North Carolina, USA. They knew exactly when the event they were interested in studying was going to occur: they were going to use their Raspberry Pi Oracle Weather Station to monitor the progress of the 2017 solar eclipse.
Through the @Celestron telescope #Eclipse2017 @WCPSS via @stemburgess
Measuring the temperature drop
The Raspberry Pi Oracle Weather Stations are always active and recording data, so all the students needed to do was check that everything was connected and working. That left them free to enjoy the eclipse, and take some amazing pictures like the one above.
You can see from the data how the changes in temperature lag behind the solar events – this makes sense, as it takes a while for the air to cool down. When the sun starts to return, the temperature rise continues on its pre-eclipse trajectory.
Weather station data 21st Aug: the yellow bars mark the start and end of the eclipse, the red bar marks the maximum sun coverage.
Reading Mr Burgess’ description, I’m feeling rather jealous. Being in the path of the Eclipse sounds amazing: “In North Carolina we experienced 93% coverage, so a lot of sunlight was still shining, but the landscape took on an eerie look. And there was a cool wind like you’d experience at dusk, not at 2:30 pm on a hot summer day. I was amazed at the significant drop in temperature that occurred in a small time frame.”
Close up of data showing temperature drop as recorded by the Raspberry Pi Oracle Weather Station. The yellow bars mark the start and end of the eclipse, the red bar marks the maximum sun coverage.
Weather Station in the classroom
“I’ve been preparing for the solar eclipse for almost two years, with the weather station arriving early last school year. I did not think about temperature data until I read about citizen scientists on a NASA website,” explains Mr Burgess, who is now in his second year of working with the Raspberry Pi Oracle Weather Station. Around 120 ninth-grade students (ages 14-15) have been involved with the project so far. “I’ve found that students who don’t have a strong interest in meteorology find it interesting to look at real data and figure out trends.”
Wake STEM EC Raspberry Pi Oracle Weather Station installation
As many schools have discovered, Mr Burgess found that the biggest challenge with the Weather Station project “was finding a suitable place to install the weather station in a place that could get power and Ethernet“. To help with this problem, we’ve recently added two new guides to help with installing the wind sensors outside and using WiFi to connect the kit to the Internet.
Raspberry Pi Oracle Weather Station
If you want to keep up to date with all the latest Raspberry Pi Oracle Weather Station activities undertaken by our network of schools around the world, make sure you regularly check our weather station forum. Meanwhile, everyone at Wake STEM ECH is already starting to plan for their next eclipse on Monday, April 8, 2024. I wonder if they’d like some help with their Weather Station?
Following a post-Christmas decision to keep illuminated decorations on her stairway bannister throughout the year, Lorraine Underwood found a new purpose for a strip of NeoPixels she had lying around.
Changed the stair lights from a string to a strip & they look awesome! #neopixel #raspberrypi https://t.co/dksLwy1SE1
Simply running the lights up the stairs, blinking and flashing to a random code, wasn’t enough for her. By using an API to check the outdoor weather, Lorraine’s lights went from decorative to informative: they now give an indication of outside weather conditions through their colour and the quantity illuminated.
“The idea is that more lights will light up as it gets warmer,” Lorraine explains. “The temperature is checked every five minutes (I think that may even be a little too often). I am looking forward to walking downstairs to a nice warm yellow light instead of the current blue!”
In total, Lorraine had 240 lights in the strip; she created a chart indicating a range of outside temperatures and the quantity of lights which for each value, as well as specifying the colour of those lights, running from chilly blue through to scorching red.
Oh, Lorraine! We love your optimistic dreams of the British summer being more than its usual rainy 16 Celsius…
The lights are controlled by a Raspberry Pi Zero running a code that can be found on Lorraine’s blog. The code dictates which lights are lit and when.
“Do I need a coat today? I’ll check the stairs.”
Lorraine is planning some future additions to the build, including a toddler-proof 3D housing, powering the Zero from the lights’ power supply, and gathering her own temperature data instead of relying on a third-party API.
While gathering the temperature data from outside her house, she may also want to look into building an entire weather station, collecting extra data on rain, humidity, and wind conditions. After all, this is the UK: just because it’s hot outside, it doesn’t mean it’s not also raining.
By any measure, the Raspberry Pi Foundation had a fantastic 2016. We ended the year with over 11 million Raspberry Pi computers sold, millions of people using our learning resources, almost 1,000 Certified Educators in the UK and US, 75,000 children regularly attending over 5,000 Code Clubs in the UK, hundreds of Raspberry Jams taking place all over the world, code written by schoolkids running in space (yes, space), and much, much more.
Fantastic to see 5,000 active Code Clubs in the UK, helping over 75,000 young people learn to code. https://t.co/OyShrUzAhI @Raspberry_Pi https://t.co/luFj1qgzvQ
As I’ve said before, what we achieve is only possible thanks to the amazing community of makers, educators, volunteers, and young people all over the world who share our mission and support our work. You’re all awesome: thank you.
So here we are, just over a week into the New Year, and I thought it might be a good time to share with you some of what we’ve got planned for 2017.
Young digital makers
At the core of our mission is getting more young people excited about computing, and learning how to make things with computers. That was the original inspiration for the Raspberry Pi computer and it remains our number-one objective.
One of the ways we do that is through Code Club, a network of after-school clubs for 9- 11-year-olds run by teachers and volunteers. It’s already one of the largest networks of after-school clubs in the world, and this year we’ll be working with our existing partners in Australia, Bangladesh, Brazil, Canada, Croatia, France, Hong Kong, New Zealand, and Ukraine, as well as finding more partners in more countries, to bring Code Club to many more children.
This year also sees the launch of Pioneers, our new programme for teen digital makers. It’s built around a series of challenges that will inspire young people to make things with technology and share their makes with the world. Check out the first challenge here, and keep watching the hashtag #MakeYourIdeas across your favourite social media platforms.
UPDATE – The first challenge is now LIVE. Head here for more information https://www.youtube.com/watch?v=OCUzza7LJog Woohoo! Get together, get inspired, and get thinking. We’re looking for Pioneers to use technology to make something awesome. Get together in a team or on your own, post online to show us how you’re getting on, and then show the world your build when you’re done.
We’re also expanding our space programme Astro Pi, with 250 teams across Europe currently developing code that will be run on the ISS by ESA French Astronaut Thomas Pesquet. And, building on our Weather Station project, we’re excited to be developing new ideas for citizen science programmes that get more young people involved in computing.
British ESA astronaut Tim Peake is safely back on Earth now, but French ESA astronaut Thomas Pesquet is onboard the ISS, keen to see what students from all over Europe can do with the Astro Pi units too.
Supporting educators
Another big part of our work is supporting educators who are bringing computing and digital making into the classroom, and this year we’re going to be doing even more to help them.
We’ll continue to grow our community of official Raspberry Pi Certified Educators, with Picademy training programmes in the UK and US. Watch out for those dates coming soon. We’re also opening up our educator training to a much wider audience through a series of online courses in partnership with FutureLearn. The first two courses are open for registration now, and we’ve got plans to develop and run more courses throughout the year, so if you’re an educator, let us know what you would find most useful.
We’re also really excited to be launching a brand-new free resource for educators later this month in partnership with CAS, the grass-roots network of computing educators. For now, it’s top-secret, but if you’re in the Bett Arena on 25 January, you’ll be the first to hear all about it.
Free educational resources
One of the most important things we do at Pi Towers is create the free educational resources that are used in Code Clubs, STEM clubs, CoderDojos, classrooms, libraries, makerspaces, and bedrooms by people of all ages learning about computing and digital making. We love making these resources and we know that you love using them. This year, we want to make them even more useful.
As a first step, later this month we will share our digital making curriculum, which explains how we think about learning and progression, and which provides the structure for our educational resources and programmes. We’re publishing it so that we can get feedback to make it better, but we also hope that it will be used by other organisations creating educational resources.
We’re also working hard behind the scenes to improve the content and presentation of our learning resources. We want to include more diverse content like videos, make it easier for users to track their own progress, and generally make the experience more interactive and social. We’re looking forward to sharing that work and getting your feedback over the next few months.
Community
Last, but by no means least, we will continue to support and grow the community around our mission. We’ll be doing even more outreach, with ever more diverse groups, and doing much more to support the Raspberry Jam organisers and others who do so much to involve people in the digital making movement.
The other big community news is that we will be formally establishing ourselves as a charity in the US, which will provide the foundation (see what I did there?) for a serious expansion of our charitable activities and community in North America.
As you can see, we’ve got big plans for the year. Let me know what you think in the comments below and, if you’re excited about the mission, there’s lots of ways to get involved.
Nick Corbett is a Senior Consultant for AWS Professional Services
Many of our customers choose to build their data lake on AWS. They find the flexible, pay-as-you-go, cloud model is ideal when dealing with vast amounts of heterogeneous data. While some customers choose to build their own lake, many others are supported by a wide range of partner products.
Today, we are pleased to announce another choice for customers wanting to build their data lake on AWS: the data lake solution. The solution is provided as an AWS CloudFormation script that you can use out-of-the-box, or as a reference implementation that can be customized to meet your unique data management, search, and processing needs.
In this post, I introduce you to the solution and show you why a data lake on AWS can increase the flexibility and agility of your analytics.
Data lake overview
The concepts behind a data lake seem simple: securely store all your data in a raw format and apply a schema on read. Indeed, the first description of a data lake compared it to a ‘large body of water in a more natural state’, whereas a data mart could be thought of as a ‘store of bottled water – cleansed and packaged and structured for easy consumption’.
A data lake is a bet against the future – you don’t know what analysis you might want to do, so why not just keep everything to give the best chance you can satisfy any requirement that comes along?
If you spend some time reading about data lakes, you quickly unearth another term: the data swamp. Some organisations find their lakes are filled with unregulated and unknown content. Preventing a data swamp might seem impossible―how do you collect every bit of data that your company generates and keep it organized? How will you ever find it again? How do you keep your data lake clean?
At Amazon, we use a working backwards process when developing our products and services. You start with your customer and work your way backwards until you get to the minimum product that satisfies what they are trying to achieve. Applying this process when you build your data lake is one way to focus your efforts on building a lake rather than a swamp.
When you build a data lake, your main customers are the business users that consume data and use it for analysis. The most important things your customers are trying to achieve are agility and innovation. You made a bet when you decided to store data in your lake, your customers are looking to quickly cash this in when they start their new project.
After your data lake is mature, it will undoubtedly feed several data marts such as reporting systems or enterprise data warehouses. Using the data lake as a source for specific business systems is a recognized best practice. However, if that is all you needed to do, you wouldn’t need a data lake.
Having a data lake comes into its own when you need to implement change; either adapting an existing system or building a new one. These projects build on an opportunity for a competitive advantage and need to run as quickly as possible. Your data lake customers need to be agile. They want their projects to either quickly succeed or fail fast and cheaply.
The data lake solution on AWS has been designed to solve these problems by managing metadata alongside the data. You can use this to provide a rich description of the data you are storing. A data lake stores raw data, so the quality of the data you store will not always be perfect (if you take steps to improve the quality of your data, you are no longer storing raw data). However, if you use metadata to give visibility of where your data came from, its linage, and its imperfections, you will have an organized data lake that your customers can use to quickly find data they need for their projects.
Data lake concepts
The central concept of this data lake solution is a package. This is a container in which you can store one or more files. You can also tag the package with metadata so you can easily find it again.
For example, the data you need to store may come from a vast network of weather stations. Perhaps each station sends several files containing sensor readings every 5 minutes. In this case, you would build a package each time a weather station sends data. The package would contain all the sensor reading files and would be tagged with metadata, for example the location of each station, and the date and time on which the readings were taken. You can configure the data lake solution to require that all packages have certain metadata tags. This helps ensure that you maintain visibility on the data added to your lake.
Installing and configuring the data lake
You can follow the instructions in the installation guide to install the data lake in your own AWS account by running a CloudFormation script. A high-level view of the server-side architecture that is built is shown below:
The architecture is serverless; you don’t need to manage any Amazon EC2 instances. All the data is stored in managed AWS services and the processing is implemented by a microservices layer written using AWS Lambda.
When you install the Data Lake Solution, you set yourself up as an administrator by providing your name and email address to the CloudFormation script. During the installation, an AWS Cognito User Pool is created. Your details are added to the pool and you’re sent an activation email. There’s also a link in the email to take you to your data lake web console. The data lake console was also installed into your account by the CloudFormation template; it is hosted as a static website in an Amazon S3 bucket.
After you’ve logged in to the data lake console as the administrator, your first task is to configure the governance that you’ll enforce on packages. By choosing Settings on the left and then the Governance tab, you can configure the minimum set of metadata tags that must be applied to all new packages.
In the diagram below, you can see the data lake configured to capture the example weather data. All packages must be tagged with the location, region, and date and time. When users create packages, they can always add their own extra tags to provide more context. You can also specify that tags are optional if you want to enforce conformity over the use of metadata that isn’t always present:
As an administrator, you can also create data lake accounts for other people at your organisation. Choose users on the left side to create extra administrators or user accounts. Users can’t change governance settings or create other user accounts.
After you’ve configured your data lake, you are ready to create your first package. You can do this by choosing Create a Package on the left side and filling in the fields:
You can see that the metadata tags specified in the governance settings are now required before you can create the package. After it has been created, you can add files to the package to build its contents. You can either upload files that are on your local machine or link to files that are already stored on S3:
In practice, if you are creating lots of packages, you wouldn’t want to create each one using the console. Instead, you can create packages using the data lake Command Line Interface (CLI) or directly against the REST API that is implemented using Amazon API Gateway.
Storing data
When you create a package, the data is stored in S3 and the metadata is stored in both Amazon DynamoDB and Amazon Elasticsearch Service (Amazon ES). Storing data in S3 has many advantages; you can securely store your data in any format, it is durable and highly scalable, and you pay only for the storage that you actually use. Having your data in S3 also provides integration with other services. For example, you can use your data in an Amazon EMR cluster, load it into an Amazon Redshift data warehouse, visualize it in Amazon QuickSight, or build a machine learning model with Amazon Machine Learning.
Leveraging the S3 integration with other tools is key to establishing an agile analytics environment. When a project comes along, you can provision data into the system that’s most suitable for the task in hand and the skills of the business users. For example, a user with SQL skills may want to analyze their data in Amazon Redshift or load it into Amazon Aurora from S3. Alternatively, a data scientist may want to analyze the data using R.
Processing data
Separating storage from processing can also help to reduce the cost of your data lake. Until you choose to analyze your data, you need to pay only for S3 storage. This model also makes it easier to attribute costs to individual projects. With the correct tagging policy in place, you can allocate the costs to each of your analytical projects based on the infrastructure that they consume. In turn, this makes it easy to work out which projects provide most value to your organization.
The data lake stores metadata in both DynamoDB and Amazon ES. DynamoDB is used as the system of record. Each change of metadata that you make is saved, so you have a complete audit trail of how your package has changed over time. You can see this on the data lake console by choosing History in the package view:
The latest version of a package’s metadata is also written to Amazon ES and is used to power the search, allowing you to quickly find data based on metadata tags. For example, you may need to find all the packages created for weather stations in the Southwest on November 11, 2016:
After you’ve found a package that you are interested in, you use the data lake solution like a shopping website and add it to your cart. Choosing Cart on the left shows its contents:
When you are happy with the contents of your cart, you can choose Generate Manifest to get access to the data in the packages. This creates a manifest file that contains either presigned URLs or the S3 bucket and key for each object. The presigned URL allows you to download a copy of the data.
However, creating a copy isn’t always the most efficient way forward. As an alternative, you can ask for the bucket and key where the object is stored in S3. It’s important to remember that you need access to an IAM user or role that has permissions to get data from this location. Like the package creation process, you can use the CLI or API to search for packages, add them to your cart, and generate a manifest file, allowing you to fully automate the retrieval of data from the lake.
Summary
A successful data lake strikes a balance. Although a data lake makes it easy to contribute data and build a vast organisational archive, it never loses control over the information that’s ingested. Ultimately, the lake is built to serve its customers, the business users that need to get to relevant data quickly so that they can execute projects for maximum return on investment (ROI).
By equally managing both data and metadata, the data lake solution on AWS allows you to govern the contents of your data lake. By using Amazon S3, your data is kept in secure, durable, and low-cost storage. S3 integrates with a wealth of other AWS services and third-party tools so that data lake customers can provision the right tool for their tasks.
The data lake solution is available for you to start using today. We welcome the feedback on this new solution and you can join in the discussion by leaving a comment below or visiting the AWS Solutions Forum.
About the author
Nick Corbett is a Senior Consultant for AWS Professional Services. He works with our customers to provide leadership on big data projects, helping them shorten their time to value when using AWS. In his spare time, he follows the Jürgen Klopp revolution.
In my day, you were lucky if you had some broken Clackers and a half-sucked, flocculent gobstopper in your trouser pockets. But here I am, half a century later, watching a swarm of school pupils running around the playground with entire computers attached to them.
Or microcontrollers, at least. This was Eastlea Community School’s Technology Day, and Steph and I had been invited along by ICT and computing teacher Mr Richards, a long-term Raspberry Pi forum member and Pi enthusiast. The day was a whole school activity, involving 930 pupils and 100 staff, showcasing how computing and technology can be used across the curriculum. In the playground, PE students had designed and coded micro:bits to measure all manner of sporting metrics. In physics, they were investigating g-forces. In the ICT and computing rooms, whole cohorts were learning to code. This was really innovative stuff.
All ICT classrooms should have shelves like this
A highlight of the tour was Mr Richard’s classroom, stuffed with electronics, robots, and hacking goodness, and pupils coming and going. It was a really creative space. Impressively, there are Raspberry Pis permanently installed on every desk, which is just how we envisaged it: a normal classroom tool for digital making.
All this was amazing, and certainly the most impressive cross-curricular use of computing I’ve seen in a school. But having lived and breathed the Raspberry Pi Oracle weather station project for several months, I was really keen to see what they’d done with theirs. And it was a corker. Students from the computing club had built and set up the station in their lunch breaks, and installed it in a small garden area.
Mr Richards and the Eastlea Community School weather station team
Then they had hacked it, adding a solar panel, battery and WiFi. This gets round the problems of how to power the station and how to transfer data. The standard way is Power over Ethernet, which uses the same cable for power and data, but this is not always the optimal solution, depending on location. It’s not as simple as sticking a solar panel on a stick either. What happens when it’s cloudy? Will the battery recharge in winter? Mr Richards and his students have spent a lot of time investigating such questions, and it’s exactly the sort of problem-solving and engineering that we want to encourage. Also, we love hacking.
Not content with these achievements, they plan to add a camera to monitor wildlife and vegetation, perhaps tying it in with the weather data. They’re also hoping to install another weather station elsewhere, so that they can compare the data and investigate the school microclimate in more detail. The weather station itself will be used for teaching and learning this September.
Eastlea Community School’s weather station really is a showcase for the project, and we’d like to thank Mr Richards and his students for working so hard on it. If you want to learn more about solar panels and other hacks, then head over to our weather station forum.
Weather station update
The remaining weather station kits have started shipping to schools this week! We sent an email out recently for people to confirm delivery addresses, and if you’ve done this you should have yours soon. If you were offered a weather station last year and have not had an email from us in the last few weeks (early July), then please contact us immediately at [email protected].
We spotted this aquarium project on YouTube, and were struck with searing pangs of fishy jealousy; imagine having a 2000-litre slice of the Cayman Islands, complete with the weather as it is right now, in your living room.
aMGee has equipped his (enormous) tropical fish tank, full of corals as well as fish, with an IoT Raspberry Pi weather system. It polls a weather station in the Cayman Islands every two minutes and duplicates that weather in the tank: clouds; wind speed and direction; exact sunset and sunrise times; and moon phase, including the direction the moon travels across the tank.
The setup uses three 100W and 18 20W multi-chip leds, which are controlled separately by an Arduino that lives on top of the lamp. There’s also a web interface, just in case you feel like playing Thor.
DIY LED aquarium lighting project for my reef tank. The 660 watts fixture simulates the weather from Cayman Islands in real time. 3 x 100 watts and 18 x 20 watts multi-chip leds controlled separately by an arduino sitting on the lamp).
If you want to learn more, aMGee answers questions about the build (which, sadly, doesn’t have a how-to attached) at the Reef Central forums.
It’s a beautiful project, considerably less expensive (and more satisfying) than any off-the-shelf equivalent; and a really lovely demonstration of meaningful IoT. Thanks aMGee!
Veronika Megler, Ph.D., is a Senior Consultant with AWS Professional Services
We are surrounded by more and more sensors – some of which we’re not even consciously aware. As sensors become cheaper and easier to connect, they create an increasing flood of data that’s getting cheaper and easier to store and process.
However, sensor readings are notoriously “noisy” or “dirty”. To produce meaningful analyses, we’d like to identify anomalies in the sensor data and remove them before we perform further analysis. Or we may wish to analyze the anomalies themselves, as they may help us understand how our system really works or how our system is changing. For example, throwing away (more and more) high temperature readings in the Arctic because they are “obviously bad data” would cause us to miss the warming that is happening there.
The use case for this post is from the domain of road traffic: freeway traffic sensors. These sensors report three measures (called “features”): speed, volume, and occupancy, each of which are sampled several times a minute (see “Appendix: Measurement Definitions” at the end of this post for details on measurement definitions). Each reading from the sensors is called an observation. Sensors of different types (radar, in-road, Bluetooth) are often mixed in a single network and may be installed in varied configurations. For in-road sensors, there’s often a separate sensor in each lane; in freeways with a “carpool” lane, that lane will have different traffic characteristics from the others. Different sections of the freeway may have different traffic characteristics, such as rush hour on the inbound vs. outbound side of the freeway.
Thus, anomaly detection is frequently an iterative process where the system, as represented by the data from the sensors, must first be segmented in some way and “normal” characterized for each part of the system, before variations from that “normal” can be detected. After these variations or anomalies are removed, we can perform various analyses of the cleaned data such as trend analysis, model creation, and predictions. This post describes how two popular and powerful open-source technologies, Spark and Hive, were used to detect anomalies in data from a network of traffic sensors. While it’s based on real usage (see "References" at the end of this post), here you’ll work with similar, anonymized data.
The same characteristics and challenges apply to many other sensor networks. Specific examples I’ve worked with include weather stations, such as Weather Underground (www.wunderground.com), that report temperature, air pressure, humidity, wind and rainfall, amongst other things; ocean observatories such as CMOP (http://www.stccmop.org/datamart/observation_network) that collect physical, geochemical and biological observations; and satellite data from NOAA (http://www.nodc.noaa.gov/).
Detecting anomalies
An anomaly in a sensor network may be a single variable with an unreasonable reading (speed = 250 m.p.h.; for a thermometer, air temperature = 200F). However, each traffic sensor reading has several features (speed, volume, occupancy). There can be situations where each reading itself has a reasonable value, but the combination itself is highly unlikely (an anomaly). For traffic sensors, a speed of more than 100 m.p.h. is possible during times of low congestion (that is, low occupancy and low volume) but extremely unlikely during a traffic jam.
Many of these “valid” or “invalid” combinations are situational, as is the case here. Common combinations often have descriptive terms, such as “traffic jam”, “congested traffic”, or “light traffic”. These terms are representative of a commonly seen combination of characteristics, which would be represented in the data as a cluster of observations.
So, to detect anomalies: First, identify the common situations (as represented by a large cluster of similar combinations of features), and then identify observations that are sufficiently different from those clusters. You essentially apply two methods from basic statistics: clustering using the most common algorithm, k-means. Then, measure the distance from each observation to the closest cluster, and classify those “far away” as being anomalies. (Note that other anomaly detection techniques exist, some of which could be used against the same data, but would reflect a different model or understanding of the problem.)
This post walks through the three major steps:
Clustering the data.
Choosing the number of clusters.
Detecting probable anomalies.
For the project, you process the data using Spark, Hive, and Hue on an Amazon EMR cluster, reading input data from an Amazon S3 bucket.
Clustering the data
To perform k-means clustering, you first need to know how many clusters exist in the data. However, in most cases, as is true here, you don’t know the “right” number to use. A common solution is to repeatedly cluster the data, each time using a different number (“k”) of clusters. For each “k”, calculate a metric: the sum of the squared distance of each point from its closest cluster center, known as the Within Set Sum of Squared Error (WSSSE). (My code extends this sample.) The smaller the WSSSE, the better your clustering is considered to be – within limits, as more clusters will almost always give a smaller WSSSE but having more clusters may distract rather than add to your analysis.
Here, the input data is a CSV format file stored in an S3 bucket. Each row contains a single observation taken by a specific sensor at a specific time, and consists of 9 numeric values. There are two versions of the input:
s3://vmegler/traffic/sensorinput/, with 24M rows
s3://vmegler/traffic/sensorinputsmall/, an extract with 50,000 rows
In this post, I show how to run the programs here with the smaller input. However, the exact same code runs over the 24M row input. Here’s the Hive SQL definition for the input:
CREATE EXTERNAL TABLE sensorinput ( highway int, — highway id sensorloc int, — one sensor location may have — multiple sensors, e.g. for different highway lanes sensorid int, — sensor id dayofyear bigint, — yyyyddd dayofweek bigint, — 0=Sunday, 1=Monday, etc time decimal(10,2), — seconds since midnight — e.g. a value of 185.67 is 3:05:67 a.m. volume int, — a count speed int, — average, in m.p.h. occupancy int — a count ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘n’ LOCATION ‘s3://vmegler/traffic/sensorinput/’;
Start an EMR cluster in us-west-2 (where this bucket is located), specifying Spark, Hue, Hive, and Ganglia. (For more information, see Getting Started: Analyzing Big Data with Amazon EMR.) I’ve run the same program in two different clusters: a small cluster with 1 master and 2 core nodes, all m3.xlarge; and a larger cluster, with 1 master and 8 core nodes, all m4.mxlarge.
Spark has two interfaces that can be used to run a Spark/Python program: an interactive interface, pyspark, and batch submission via spark-submit. I generally begin my projects by reviewing my data and testing my approach interactively in pyspark, while logged on to the cluster master. Then, I run my completed program using spark-submit (see also Submitting User Applications with spark-submit). After the program is ready to operationalize, I start submitting the jobs as steps to a running cluster using the AWS CLI for EMR or from a script such as a Python script using Boto3 to interface to EMR, with appropriate parameterization.
I’ve written two PySpark programs: one to repeatedly cluster the data and calculate the WSSSE using different numbers of clusters (kmeanswsssey.py); and a second one (kmeansandey.py) to calculate the distances of each observation from the closest cluster. The other parts of the anomaly detection—choosing the number of clusters to use, and deciding which observations are the outliers—are performed interactively, using Hue and Hive. I also provide a file (traffic-hive.hql), with the table definitions and sample queries.
For simplicity, I’ll describe how to run the programs using spark-submit while logged on to the master instance console.
To prepare the cluster for executing your programs, install some Python packages:
sudo yum install python-numpy python-scipy -y
Copy the programs from S3 onto the master node’s local disk; I often run this way while I’m still editing the programs and experimenting with slightly different variations:
My first PySpark program (kmeanswsssey.py) calculates WSSSE repeatedly, starting with 1 cluster (k=1), then for 2 clusters, and so on, up to some maximum k that you define. It outputs a CSV file; for each k, it appends a set of lines containing the WSSSE and some statistics that describe each of the clusters. This program takes 3 arguments: the input file location, the maximum k to use, and a prefix to prepend to the output file for this run for when I’m testing multiple variations: <infile> <maxk> <runId> <outfile>. For example:
When run on the small cluster with the small input, this program took around 5 minutes. The same program, run on the 24M row input on the larger cluster, took 2.5 hours. Running the large input on the smaller cluster produces correct results, but takes over 24 hours to complete.
Choosing the number of clusters
Next, review the clustering results and choose the number of clusters to use for the actual anomaly detection.
A common and easy way to do that is to graph the WSSSE calculated for each k, and to choose “the knee in the curve”. That is, look for a point where the total distance has dropped sufficiently that increasing the number of clusters does not drop the WSSSE by much. If you’re very lucky, each cluster has characteristics that match your mental model for the problem domain such as low speed, high occupancy, and high volume, matching “congested traffic”.
Here you use Hue and Hive, conveniently selected when you started the cluster, for data exploration and simple graphing. In Hue’s Hive Query Editor, define a table that describes the output file you created in the previous step. Here, I’m pointing to a precomputed version calculated over the larger dataset:
CREATE EXTERNAL TABLE kcalcs ( run string, wssse decimal(20,3), k decimal, clusterid decimal, clustersize decimal, volcntr decimal(10,1), spdcntr decimal(10,1), occcntr decimal(10,1), volstddev decimal(10,1), spdstddev decimal(10,1), occstddev decimal(10,1) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘n’ LOCATION ‘s3://vmegler/traffic/sensorclusters/’ tblproperties ("skip.header.line.count"="2");
To decide how many clusters to use for the next step, use Hue to plot a line graph of the number of clusters versus WSSSE. First, select the information to display:
SELECT DISTINCT k, wssse FROM kcalcs ORDER BY k;
In the results panel, choose Chart. Choose the icon representing a line graph, choose “k” for X-Axis and “wssse” for Y-Axis, and Hue builds the following chart. Hover your cursor above a particular bar, and Hue shows the value of the X and Y axis for that bar.
For the “best number of clusters”, you’re looking for the “knee in the curve”: the place where going to a higher number of clusters does not significantly reduce the total distance function (WSSE). For this data, it looks as though around 4 is a good choice, as the gains of going to 5 or 6 clusters looks minimal.
You can explore the characteristics of the identified clusters with the following SELECT statement:
SELECT DISTINCT k, clusterid, clustersize, volcntr, spdcntr, occcntr, volstddev, spdstddev, occstddev FROM kcalcs ORDER BY k, spdcntr;
By looking at, for example, the lines for three clusters (k=3), you can see a “congestion” cluster (17.1 m.p.h., occupancy 37.7 cars), a “free-flowing heavy-traffic” cluster, and a “light traffic” cluster (65.2 m.p.h., occupancy 5.1). With k=4, you still see the “congestion” and “fast, light traffic” clusters, but the “free-flowing heavy-traffic” cluster from k=3 has been split into two distinct clusters with very different occupancy. Choose to stay with 4 clusters.
Detecting anomalies
Use the following method with these clusters to identify anomalies:
Assign each sensor reading to the closest cluster.
Calculate the distance (using some measure) for each reading to the assigned cluster center.
Filter for the entries with a greater distance than some chosen threshold.
I like to use Mahalanobis distance as the distance measure, as it compensates for differences in units (speed in m.p.h., while volume and occupancy are counts), averages, and scales of the several features I’m clustering across.
Run the second PySpark program, kmeansandey.py (you copied this program onto local disk earlier, during setup). Give this program the number of clusters to use, decided in the previous step, and the input data. For each input observation, this program does the following:
Identifies the closest cluster center.
Calculates the Mahalanobis distance from this observation to the closest center.
Creates an output record consisting of the original observation, plus the cluster number, the cluster center, and the distance.
The program takes the following parameters: <infile> <k> <outfile>. The output is a CSV file, placed in an S3 bucket of your choice. To run the program, use spark-submit:
On the small cluster with the small input, this job finished in under a minute; on the bigger cluster, the 24M dataset took around 17 minutes. In the next step, you review the observations that are “distant” from the closest cluster as calculated by that distance calculation. Because these observations are unlike the majority of the other observations, they are considered outliers, and probable anomalies.
Exploring identified anomalies
Now you’re ready to look at the probable anomalies and decide whether they really should be considered anomalies. In Hive, you define a table that describes the output file created in the previous step. Use an output file from the S3 bucket, which contains the original 7 columns (sensorid through occupancy) plus 5 new ones (clusterid through maldist). The smaller dataset’s output is less interesting to explore as it only contains data from one sensor, so I’ve precomputed the output over the large dataset for this exploration. Here is the modified table definition:
CREATE EXTERNAL TABLE sensoroutput ( highway int, — highway id etc. (as before) … occupancy int, — a count clusterid int, — cluster identifier volcntr decimal(10,2), — cluster center, volume spdcntr decimal(10,2), — cluster center, speed occcntr decimal(10,2), — cluster center, occupancy maldist decimal(10,2) — Mahalanobis distance to this cluster ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘n’ LOCATION ‘s3://vmegler/traffic/sensoroutput/’;
Explore your results. To look at the number of observations assigned to each cluster for each sensor, try the following query:
SELECT sensorid, clusterid, concat(cast(sensorid AS string), ‘.’, cast(clusterid AS string)) AS senclust, count(*) AS howmany, max(maldist) AS dist FROM sensoroutput GROUP BY sensorid, clusterid ORDER BY sensorid, clusterid;
The “concat” statement creates a compound column, senclust, that you can use in Hue’s built-in graphing tool to compare the clusters visually for each sensor. For this chart, choose a bar graph, choose the compound column “senclust” for X-Axis and “howmany” for Y-Axis, and Hue builds the following chart.
You can now easily compare the sizes, and the largest and average distances for each cluster across the different sensors. The smaller clusters probably bear investigation; they either represent unusual traffic conditions, or a cluster of bad readings. Note that an additional cluster of known bad readings (0 speed, volume, and occupancy) was identified using a similar process during a prior run; these observations are all assigned to a dummy clusterid of “-1” and have a high maldist.
SELECT clusterid, volcntr, spdcntr, occcntr, count(*) AS num, max(maldist) AS maxmaldist, avg(maldist) AS avgmaldist, stddev_pop(maldist) AS stddevmal FROM sensoroutput GROUP BY clusterid, volcntr, spdcntr, occcntr ORDER BY spdcntr;
How do you choose the threshold for defining an observation as an anomaly? This is another black art. I chose 2.5 by a combination of standard practice, discussing the graphs, and looking at how much and which data I’d be throwing away by using that assumption. To explore the distribution of outliers across the sensors, use a query like the following:
SELECT sensorid, clusterid, count(*) AS num_outliers, avg(spdcntr) AS spdcntr, avg(maldist) AS avgdist FROM sensoroutput WHERE maldist > 2.5 GROUP BY sensorid, clusterid ORDER BY sensorid, clusterid;
The number of outliers varies quite a bit by sensor and cluster. You can explore the 100 entries for sensor 44, cluster 2:
SELECT * FROM sensoroutput WHERE maldist > 2.5 AND sensorid = 44 AND clusterid = 0 ORDER BY maldist desc LIMIT 100;
The query results show some entries that look reasonable (volume 6, occupancy 1), and others that look less so (volume of 3 and occupancy of 10). Depending on your intended use, you may decide that the number of observations that might not really be anomalies is small enough that you should just exclude all these entries – but perhaps you want to study these entries further to find a pattern, such as that this is a state that often occurs during transition times from one traffic pattern to another.
After you understand your clusters and the flagged “potential anomalies” sufficiently, you can choose which observations to exclude from further analysis.
Conclusion
This post describes anomaly detection for sensor data, and works through a case of identifying anomalies in traffic sensor data. You’ve dived into some of the complexities that comes with deciding which subset of sensor data is dirty or not, and the tools used to ask those questions. I showed how an iterative approach is often needed, with each analysis leading to further questions and further analyses.
In the real use case (see "References" below), we iteratively clustered subsets of the data: for different highways, days of the week, different sensor types, and so on, to understand the data and anomalies. We’ve seen here some of the challenges in deciding whether or not something is an anomaly in the data, or an anomaly in our approach. We used Amazon EMR, along with Apache Spark, Apache Hive and Hue to implement the approach and explore the results, allowing us to quickly experiment with a number of alternative clusters before settling on the combination that we felt best identified the real anomalies in our data.
Now, you can move forward: providing “clean data” to the business users; combining this data with weather, school holiday calendars, and sporting events to identify the causes of specific traffic patterns and pattern changes; and then using that model to predict future traffic conditions.
Appendix: Measurement Definitions
Volume measures how many vehicles have passed this sensor during the given time period. Occupancy measures the number of vehicles at the sensor at the measurement time. The combination of volume and occupancy gives a view of overall traffic density. For example: if the traffic is completely stopped, a sensor may have very high occupancy – many vehicles sitting at the sensor – but a volume close to 0, as very few vehicles have passed the sensor. This is a common circumstance for sensors at freeway entrances that limit freeway entry, often via lights that only permit one car from one lane to pass every few seconds.
Note that different sensor types may have different capabilities to detect these situations, such as radar vs. in-road sensors, and different sensors types or models may have different defaults for how they report various situations. For example, “0,0,0” may mean no traffic, or known bad data, or assumed bad data based on hard limits, such as traffic above a specific density (ouch!). Thus sensor type, capability, and context are all important factors in identifying “bad data”. In this study, the analysis of which sensors were “similar enough” for the data to be analyzed together was performed prior to data extract. The anomaly detection steps described here were performed separately for each set of similar sensors, as defined by the pre-analysis.
Veronika Megler, Ph.D., is a Senior Consultant with AWS Professional Services
We are surrounded by more and more sensors – some of which we’re not even consciously aware. As sensors become cheaper and easier to connect, they create an increasing flood of data that’s getting cheaper and easier to store and process.
However, sensor readings are notoriously “noisy” or “dirty”. To produce meaningful analyses, we’d like to identify anomalies in the sensor data and remove them before we perform further analysis. Or we may wish to analyze the anomalies themselves, as they may help us understand how our system really works or how our system is changing. For example, throwing away (more and more) high temperature readings in the Arctic because they are “obviously bad data” would cause us to miss the warming that is happening there.
The use case for this post is from the domain of road traffic: freeway traffic sensors. These sensors report three measures (called “features”): speed, volume, and occupancy, each of which are sampled several times a minute (see “Appendix: Measurement Definitions” at the end of this post for details on measurement definitions). Each reading from the sensors is called an observation. Sensors of different types (radar, in-road, Bluetooth) are often mixed in a single network and may be installed in varied configurations. For in-road sensors, there’s often a separate sensor in each lane; in freeways with a “carpool” lane, that lane will have different traffic characteristics from the others. Different sections of the freeway may have different traffic characteristics, such as rush hour on the inbound vs. outbound side of the freeway.
Thus, anomaly detection is frequently an iterative process where the system, as represented by the data from the sensors, must first be segmented in some way and “normal” characterized for each part of the system, before variations from that “normal” can be detected. After these variations or anomalies are removed, we can perform various analyses of the cleaned data such as trend analysis, model creation, and predictions. This post describes how two popular and powerful open-source technologies, Spark and Hive, were used to detect anomalies in data from a network of traffic sensors. While it’s based on real usage (see "References" at the end of this post), here you’ll work with similar, anonymized data.
The same characteristics and challenges apply to many other sensor networks. Specific examples I’ve worked with include weather stations, such as Weather Underground (www.wunderground.com), that report temperature, air pressure, humidity, wind and rainfall, amongst other things; ocean observatories such as CMOP (http://www.stccmop.org/datamart/observation_network) that collect physical, geochemical and biological observations; and satellite data from NOAA (http://www.nodc.noaa.gov/).
Detecting anomalies
An anomaly in a sensor network may be a single variable with an unreasonable reading (speed = 250 m.p.h.; for a thermometer, air temperature = 200F). However, each traffic sensor reading has several features (speed, volume, occupancy). There can be situations where each reading itself has a reasonable value, but the combination itself is highly unlikely (an anomaly). For traffic sensors, a speed of more than 100 m.p.h. is possible during times of low congestion (that is, low occupancy and low volume) but extremely unlikely during a traffic jam.
Many of these “valid” or “invalid” combinations are situational, as is the case here. Common combinations often have descriptive terms, such as “traffic jam”, “congested traffic”, or “light traffic”. These terms are representative of a commonly seen combination of characteristics, which would be represented in the data as a cluster of observations.
So, to detect anomalies: First, identify the common situations (as represented by a large cluster of similar combinations of features), and then identify observations that are sufficiently different from those clusters. You essentially apply two methods from basic statistics: clustering using the most common algorithm, k-means. Then, measure the distance from each observation to the closest cluster, and classify those “far away” as being anomalies. (Note that other anomaly detection techniques exist, some of which could be used against the same data, but would reflect a different model or understanding of the problem.)
This post walks through the three major steps:
Clustering the data.
Choosing the number of clusters.
Detecting probable anomalies.
For the project, you process the data using Spark, Hive, and Hue on an Amazon EMR cluster, reading input data from an Amazon S3 bucket.
Clustering the data
To perform k-means clustering, you first need to know how many clusters exist in the data. However, in most cases, as is true here, you don’t know the “right” number to use. A common solution is to repeatedly cluster the data, each time using a different number (“k”) of clusters. For each “k”, calculate a metric: the sum of the squared distance of each point from its closest cluster center, known as the Within Set Sum of Squared Error (WSSSE). (My code extends this sample.) The smaller the WSSSE, the better your clustering is considered to be – within limits, as more clusters will almost always give a smaller WSSSE but having more clusters may distract rather than add to your analysis.
Here, the input data is a CSV format file stored in an S3 bucket. Each row contains a single observation taken by a specific sensor at a specific time, and consists of 9 numeric values. There are two versions of the input:
s3://vmegler/traffic/sensorinput/, with 24M rows
s3://vmegler/traffic/sensorinputsmall/, an extract with 50,000 rows
In this post, I show how to run the programs here with the smaller input. However, the exact same code runs over the 24M row input. Here’s the Hive SQL definition for the input:
CREATE EXTERNAL TABLE sensorinput ( highway int, — highway id sensorloc int, — one sensor location may have — multiple sensors, e.g. for different highway lanes sensorid int, — sensor id dayofyear bigint, — yyyyddd dayofweek bigint, — 0=Sunday, 1=Monday, etc time decimal(10,2), — seconds since midnight — e.g. a value of 185.67 is 3:05:67 a.m. volume int, — a count speed int, — average, in m.p.h. occupancy int — a count ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘n’ LOCATION ‘s3://vmegler/traffic/sensorinput/’;
Start an EMR cluster in us-west-2 (where this bucket is located), specifying Spark, Hue, Hive, and Ganglia. (For more information, see Getting Started: Analyzing Big Data with Amazon EMR.) I’ve run the same program in two different clusters: a small cluster with 1 master and 2 core nodes, all m3.xlarge; and a larger cluster, with 1 master and 8 core nodes, all m4.mxlarge.
Spark has two interfaces that can be used to run a Spark/Python program: an interactive interface, pyspark, and batch submission via spark-submit. I generally begin my projects by reviewing my data and testing my approach interactively in pyspark, while logged on to the cluster master. Then, I run my completed program using spark-submit (see also Submitting User Applications with spark-submit). After the program is ready to operationalize, I start submitting the jobs as steps to a running cluster using the AWS CLI for EMR or from a script such as a Python script using Boto3 to interface to EMR, with appropriate parameterization.
I’ve written two PySpark programs: one to repeatedly cluster the data and calculate the WSSSE using different numbers of clusters (kmeanswsssey.py); and a second one (kmeansandey.py) to calculate the distances of each observation from the closest cluster. The other parts of the anomaly detection—choosing the number of clusters to use, and deciding which observations are the outliers—are performed interactively, using Hue and Hive. I also provide a file (traffic-hive.hql), with the table definitions and sample queries.
For simplicity, I’ll describe how to run the programs using spark-submit while logged on to the master instance console.
To prepare the cluster for executing your programs, install some Python packages:
sudo yum install python-numpy python-scipy -y
Copy the programs from S3 onto the master node’s local disk; I often run this way while I’m still editing the programs and experimenting with slightly different variations:
My first PySpark program (kmeanswsssey.py) calculates WSSSE repeatedly, starting with 1 cluster (k=1), then for 2 clusters, and so on, up to some maximum k that you define. It outputs a CSV file; for each k, it appends a set of lines containing the WSSSE and some statistics that describe each of the clusters. This program takes 3 arguments: the input file location, the maximum k to use, and a prefix to prepend to the output file for this run for when I’m testing multiple variations: <infile> <maxk> <runId> <outfile>. For example:
When run on the small cluster with the small input, this program took around 5 minutes. The same program, run on the 24M row input on the larger cluster, took 2.5 hours. Running the large input on the smaller cluster produces correct results, but takes over 24 hours to complete.
Choosing the number of clusters
Next, review the clustering results and choose the number of clusters to use for the actual anomaly detection.
A common and easy way to do that is to graph the WSSSE calculated for each k, and to choose “the knee in the curve”. That is, look for a point where the total distance has dropped sufficiently that increasing the number of clusters does not drop the WSSSE by much. If you’re very lucky, each cluster has characteristics that match your mental model for the problem domain such as low speed, high occupancy, and high volume, matching “congested traffic”.
Here you use Hue and Hive, conveniently selected when you started the cluster, for data exploration and simple graphing. In Hue’s Hive Query Editor, define a table that describes the output file you created in the previous step. Here, I’m pointing to a precomputed version calculated over the larger dataset:
CREATE EXTERNAL TABLE kcalcs ( run string, wssse decimal(20,3), k decimal, clusterid decimal, clustersize decimal, volcntr decimal(10,1), spdcntr decimal(10,1), occcntr decimal(10,1), volstddev decimal(10,1), spdstddev decimal(10,1), occstddev decimal(10,1) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘n’ LOCATION ‘s3://vmegler/traffic/sensorclusters/’ tblproperties ("skip.header.line.count"="2");
To decide how many clusters to use for the next step, use Hue to plot a line graph of the number of clusters versus WSSSE. First, select the information to display:
SELECT DISTINCT k, wssse FROM kcalcs ORDER BY k;
In the results panel, choose Chart. Choose the icon representing a line graph, choose “k” for X-Axis and “wssse” for Y-Axis, and Hue builds the following chart. Hover your cursor above a particular bar, and Hue shows the value of the X and Y axis for that bar.
For the “best number of clusters”, you’re looking for the “knee in the curve”: the place where going to a higher number of clusters does not significantly reduce the total distance function (WSSE). For this data, it looks as though around 4 is a good choice, as the gains of going to 5 or 6 clusters looks minimal.
You can explore the characteristics of the identified clusters with the following SELECT statement:
SELECT DISTINCT k, clusterid, clustersize, volcntr, spdcntr, occcntr, volstddev, spdstddev, occstddev FROM kcalcs ORDER BY k, spdcntr;
By looking at, for example, the lines for three clusters (k=3), you can see a “congestion” cluster (17.1 m.p.h., occupancy 37.7 cars), a “free-flowing heavy-traffic” cluster, and a “light traffic” cluster (65.2 m.p.h., occupancy 5.1). With k=4, you still see the “congestion” and “fast, light traffic” clusters, but the “free-flowing heavy-traffic” cluster from k=3 has been split into two distinct clusters with very different occupancy. Choose to stay with 4 clusters.
Detecting anomalies
Use the following method with these clusters to identify anomalies:
Assign each sensor reading to the closest cluster.
Calculate the distance (using some measure) for each reading to the assigned cluster center.
Filter for the entries with a greater distance than some chosen threshold.
I like to use Mahalanobis distance as the distance measure, as it compensates for differences in units (speed in m.p.h., while volume and occupancy are counts), averages, and scales of the several features I’m clustering across.
Run the second PySpark program, kmeansandey.py (you copied this program onto local disk earlier, during setup). Give this program the number of clusters to use, decided in the previous step, and the input data. For each input observation, this program does the following:
Identifies the closest cluster center.
Calculates the Mahalanobis distance from this observation to the closest center.
Creates an output record consisting of the original observation, plus the cluster number, the cluster center, and the distance.
The program takes the following parameters: <infile> <k> <outfile>. The output is a CSV file, placed in an S3 bucket of your choice. To run the program, use spark-submit:
On the small cluster with the small input, this job finished in under a minute; on the bigger cluster, the 24M dataset took around 17 minutes. In the next step, you review the observations that are “distant” from the closest cluster as calculated by that distance calculation. Because these observations are unlike the majority of the other observations, they are considered outliers, and probable anomalies.
Exploring identified anomalies
Now you’re ready to look at the probable anomalies and decide whether they really should be considered anomalies. In Hive, you define a table that describes the output file created in the previous step. Use an output file from the S3 bucket, which contains the original 7 columns (sensorid through occupancy) plus 5 new ones (clusterid through maldist). The smaller dataset’s output is less interesting to explore as it only contains data from one sensor, so I’ve precomputed the output over the large dataset for this exploration. Here is the modified table definition:
CREATE EXTERNAL TABLE sensoroutput ( highway int, — highway id etc. (as before) … occupancy int, — a count clusterid int, — cluster identifier volcntr decimal(10,2), — cluster center, volume spdcntr decimal(10,2), — cluster center, speed occcntr decimal(10,2), — cluster center, occupancy maldist decimal(10,2) — Mahalanobis distance to this cluster ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘n’ LOCATION ‘s3://vmegler/traffic/sensoroutput/’;
Explore your results. To look at the number of observations assigned to each cluster for each sensor, try the following query:
SELECT sensorid, clusterid, concat(cast(sensorid AS string), ‘.’, cast(clusterid AS string)) AS senclust, count(*) AS howmany, max(maldist) AS dist FROM sensoroutput GROUP BY sensorid, clusterid ORDER BY sensorid, clusterid;
The “concat” statement creates a compound column, senclust, that you can use in Hue’s built-in graphing tool to compare the clusters visually for each sensor. For this chart, choose a bar graph, choose the compound column “senclust” for X-Axis and “howmany” for Y-Axis, and Hue builds the following chart.
You can now easily compare the sizes, and the largest and average distances for each cluster across the different sensors. The smaller clusters probably bear investigation; they either represent unusual traffic conditions, or a cluster of bad readings. Note that an additional cluster of known bad readings (0 speed, volume, and occupancy) was identified using a similar process during a prior run; these observations are all assigned to a dummy clusterid of “-1” and have a high maldist.
SELECT clusterid, volcntr, spdcntr, occcntr, count(*) AS num, max(maldist) AS maxmaldist, avg(maldist) AS avgmaldist, stddev_pop(maldist) AS stddevmal FROM sensoroutput GROUP BY clusterid, volcntr, spdcntr, occcntr ORDER BY spdcntr;
How do you choose the threshold for defining an observation as an anomaly? This is another black art. I chose 2.5 by a combination of standard practice, discussing the graphs, and looking at how much and which data I’d be throwing away by using that assumption. To explore the distribution of outliers across the sensors, use a query like the following:
SELECT sensorid, clusterid, count(*) AS num_outliers, avg(spdcntr) AS spdcntr, avg(maldist) AS avgdist FROM sensoroutput WHERE maldist > 2.5 GROUP BY sensorid, clusterid ORDER BY sensorid, clusterid;
The number of outliers varies quite a bit by sensor and cluster. You can explore the 100 entries for sensor 44, cluster 2:
SELECT * FROM sensoroutput WHERE maldist > 2.5 AND sensorid = 44 AND clusterid = 0 ORDER BY maldist desc LIMIT 100;
The query results show some entries that look reasonable (volume 6, occupancy 1), and others that look less so (volume of 3 and occupancy of 10). Depending on your intended use, you may decide that the number of observations that might not really be anomalies is small enough that you should just exclude all these entries – but perhaps you want to study these entries further to find a pattern, such as that this is a state that often occurs during transition times from one traffic pattern to another.
After you understand your clusters and the flagged “potential anomalies” sufficiently, you can choose which observations to exclude from further analysis.
Conclusion
This post describes anomaly detection for sensor data, and works through a case of identifying anomalies in traffic sensor data. You’ve dived into some of the complexities that comes with deciding which subset of sensor data is dirty or not, and the tools used to ask those questions. I showed how an iterative approach is often needed, with each analysis leading to further questions and further analyses.
In the real use case (see "References" below), we iteratively clustered subsets of the data: for different highways, days of the week, different sensor types, and so on, to understand the data and anomalies. We’ve seen here some of the challenges in deciding whether or not something is an anomaly in the data, or an anomaly in our approach. We used Amazon EMR, along with Apache Spark, Apache Hive and Hue to implement the approach and explore the results, allowing us to quickly experiment with a number of alternative clusters before settling on the combination that we felt best identified the real anomalies in our data.
Now, you can move forward: providing “clean data” to the business users; combining this data with weather, school holiday calendars, and sporting events to identify the causes of specific traffic patterns and pattern changes; and then using that model to predict future traffic conditions.
Appendix: Measurement Definitions
Volume measures how many vehicles have passed this sensor during the given time period. Occupancy measures the number of vehicles at the sensor at the measurement time. The combination of volume and occupancy gives a view of overall traffic density. For example: if the traffic is completely stopped, a sensor may have very high occupancy – many vehicles sitting at the sensor – but a volume close to 0, as very few vehicles have passed the sensor. This is a common circumstance for sensors at freeway entrances that limit freeway entry, often via lights that only permit one car from one lane to pass every few seconds.
Note that different sensor types may have different capabilities to detect these situations, such as radar vs. in-road sensors, and different sensors types or models may have different defaults for how they report various situations. For example, “0,0,0” may mean no traffic, or known bad data, or assumed bad data based on hard limits, such as traffic above a specific density (ouch!). Thus sensor type, capability, and context are all important factors in identifying “bad data”. In this study, the analysis of which sensors were “similar enough” for the data to be analyzed together was performed prior to data extract. The anomaly detection steps described here were performed separately for each set of similar sensors, as defined by the pre-analysis.
Big brown boxes If this blog was an Ealing comedy, it would be a speeded-up montage of an increasingly flustered postman delivering huge numbers of huge boxes to school reception desks across the land. At the end, they’d push their cap up at a jaunty angle and wipe their brow with a large spotted handkerchief. With squeaky sound effects. Over the past couple of days, huge brown boxes have indeed been dropping onto the counters of school receptions across the UK, and they contain something wonderful— a Raspberry Pi Oracle Weather Station. DJCS on Twitter Code club students building a weather station kindly donated by the @Raspberry_Pi foundation thanks @clivebeale pic.twitter.com/yGQP4BQ6SP
This week, we sent out the first batch of Weather Station kits to 150 UK schools. Yesterday – World Meteorological Day, of course! – they started to appear in the wild. DHFS Computing Dept on Twitter The next code club project has just arrived! Can’t wait to get stuck in! @Raspberry_Pi @clivebeale pic.twitter.com/axA7wJ1RMF
Pilot “lite” We’re running the UK delivery as a short pilot scheme. With almost 1000 schools involved worldwide, it will give us a chance us to tweak software and resources, and to get a feel for how we can best support schools. In the next few weeks, we’ll send out the remainder of the weather stations. We’ll have a good idea of when this will be next week, when the first kits have been in schools for a while. Once all the stations are shipped, we’ll be extending and expanding our teaching and learning resources. In particular, we would like resources for big data management and visualisation, and for non-computing subjects such as geography. And, of course, if you make any of your own we’d love to see them. BWoodhead Primary on Twitter Super exciting raspberry pi weather station arrived, very lucky to be one of the 150 uk schools @rasberrypi pic.twitter.com/ZER0RPKqIf
“Just” a milestone This is a big milestone for the project, but it’s not the end by any means. In fact, it’s just the beginning as schools start to build their stations, using them to investigate the weather and to learn. We’re hoping to see and encourage lots of collaboration between schools. We started the project back in 2014. Over time, it’s easy to take any project for granted, so it was brilliant to see the excitement of teachers and students when they received their kit. Stackpole V.C School on Twitter We were really excited to receive our @Raspberry_Pi weather station today. Indoor trial tomorrow. @clivebeale pic.twitter.com/7fsI7DYCYg
It’s been a fun two years, and if you’ve opened a big brown box this morning and found a weather station inside, we think you’ll agree that it’s been worth the wait. Building and setting up your weather station The weather station page has tutorials for building the hardware and setting up the software for your weather station, along with a scheme of work for teachers and other resources. Getting involved The community is hugely important to us and whether you’ve just received a weather station or not, we’d love to hear from you. The best way to get involved is to come to the friendly Weather Station corner of our forums and say hi. This is also the place to get help and to share ideas. If you’re tweeting, then you can reach us @raspberry_pi or on the hashtag #weatherstation – thanks! BA Science on Twitter Our weather station has arrived!Thanks to @Raspberry_Pi now need some students to help us build it! @BromptonAcademy pic.twitter.com/8qZPG3JTaQ
Buying the kit We’re often asked if we’ll be selling the kits. We’re currently looking into this and hope that they will be commercially available at some point. I’d love to see a Raspberry Pi Weather Station attached to every school – it’s a project that genuinely engages students across many subjects. In addition, the data gathered from thousands of weather stations, all sending data back to a central database, would be really useful. That’s all for now But now that the kits are shipped there’ll be lots going on, so expect more news soon. And do pop into the forums for a chat. Thanks As well as the talented and lovely folk at Pi Towers, we’ve only made it this far with the help of others. At risk of turning into a mawkish awards ceremony speech, a few shout-outs are needed: Oracle for their generous funding and the database support, especially Nicole at Oracle Giving, Jane at Oracle Academy, and Jeff who built our Apex database. Rachel, Kevin and Team @cpc_tweet for the kit build (each kit has around 80 parts!) and amazing logistics support. @HackerJimbo for sterling software development and the disk image. If I’ve missed you out, it doesn’t mean I don’t love you. The post Raspberry Pi Oracle Weather Stations shipped appeared first on Raspberry Pi.
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.