Tag Archives: machine learning

How do we develop AI education in schools? A panel discussion

Post Syndicated from Sue Sentance original https://www.raspberrypi.org/blog/ai-education-schools-panel-uk-policy/

AI is a broad and rapidly developing field of technology. Our goal is to make sure all young people have the skills, knowledge, and confidence to use and create AI systems. So what should AI education in schools look like?

To hear a range of insights into this, we organised a panel discussion as part of our seminar series on AI and data science education, which we co-host with The Alan Turing Institute. Here our panel chair Tabitha Goldstaub, Co-founder of CogX and Chair of the UK government’s AI Council, summarises the event. You can also watch the recording below.

As part of the Raspberry Pi Foundation’s monthly AI education seminar series, I was delighted to chair a special panel session to broaden the range of perspectives on the subject. The members of the panel were:

  • Chris Philp, UK Minister for Tech and the Digital Economy
  • Philip Colligan, CEO of the Raspberry Pi Foundation 
  • Danielle Belgrave, Research Scientist, DeepMind
  • Caitlin Glover, A level student, Sandon School, Chelmsford
  • Alice Ashby, student, University of Brighton

The session explored the UK government’s commitment in the recently published UK National AI Strategy stating that “the [UK] government will continue to ensure programmes that engage children with AI concepts are accessible and reach the widest demographic.” We discussed what it will take to make this a reality, and how we will ensure young people have a seat at the table.

Two teenage girls do coding during a computer science lesson.

Why AI education for young people?

It was clear that the Minister felt it is very important for young people to understand AI. He said, “The government takes the view that AI is going to be one of the foundation stones of our future prosperity and our future growth. It’s an enabling technology that’s going to have almost universal applicability across our entire economy, and that is why it’s so important that the United Kingdom leads the world in this area. Young people are the country’s future, so nothing is complete without them being at the heart of it.”

A teacher watches two female learners code in Code Club session in the classroom.

Our panelist Caitlin Glover, an A level student at Sandon School, reiterated this from her perspective as a young person. She told us that her passion for AI started initially because she wanted to help neurodiverse young people like herself. Her idea was to start a company that would build AI-powered products to help neurodiverse students.

What careers will AI education lead to?

A theme of the Foundation’s seminar series so far has been how learning about AI early may impact young people’s career choices. Our panelist Alice Ashby, who studies Computer Science and AI at Brighton University, told us about her own process of deciding on her course of study. She pointed to the fact that terms such as machine learning, natural language processing, self-driving cars, chatbots, and many others are currently all under the umbrella of artificial intelligence, but they’re all very different. Alice thinks it’s hard for young people to know whether it’s the right decision to study something that’s still so ambiguous.

A young person codes at a Raspberry Pi computer.

When I asked Alice what gave her the courage to take a leap of faith with her university course, she said, “I didn’t know it was the right move for me, honestly. I took a gamble, I knew I wanted to be in computer science, but I wanted to spice it up.” The AI ecosystem is very lucky that people like Alice choose to enter the field even without being taught what precisely it comprises.

We also heard from Danielle Belgrave, a Research Scientist at DeepMind with a remarkable career in AI for healthcare. Danielle explained that she was lucky to have had a Mathematics teacher who encouraged her to work in statistics for healthcare. She said she wanted to ensure she could use her technical skills and her love for math to make an impact on society, and to really help make the world a better place. Danielle works with biologists, mathematicians, philosophers, and ethicists as well as with data scientists and AI researchers at DeepMind. One possibility she suggested for improving young people’s understanding of what roles are available was industry mentorship. Linking people who work in the field of AI with school students was an idea that Caitlin was eager to confirm as very useful for young people her age.

We need investment in AI education in school

The AI Council’s Roadmap stresses how important it is to not only teach the skills needed to foster a pool of people who are able to research and build AI, but also to ensure that every child leaves school with the necessary AI and data literacy to be able to become engaged, informed, and empowered users of the technology. During the panel, the Minister, Chris Philp, spoke about the fact that people don’t have to be technical experts to come up with brilliant ideas, and that we need more people to be able to think creatively and have the confidence to adopt AI, and that this starts in schools. 

A class of primary school students do coding at laptops.

Caitlin is a perfect example of a young person who has been inspired about AI while in school. But sadly, among young people and especially girls, she’s in the minority by choosing to take computer science, which meant she had the chance to hear about AI in the classroom. But even for young people who choose computer science in school, at the moment AI isn’t in the national Computing curriculum or part of GCSE computer science, so much of their learning currently takes place outside of the classroom. Caitlin added that she had had to go out of her way to find information about AI; the majority of her peers are not even aware of opportunities that may be out there. She suggested that we ensure AI is taught across all subjects, so that every learner sees how it can make their favourite subject even more magical and thinks “AI’s cool!”.

A primary school boy codes at a laptop with the help of an educator.

Philip Colligan, the CEO here at the Foundation, also described how AI could be integrated into existing subjects including maths, geography, biology, and citizenship classes. Danielle thoroughly agreed and made the very good point that teaching this way across the school would help prepare young people for the world of work in AI, where cross-disciplinary science is so important. She reminded us that AI is not one single discipline. Instead, many different skill sets are needed, including engineering new AI systems, integrating AI systems into products, researching problems to be addressed through AI, or investigating AI’s societal impacts and how humans interact with AI systems.

On hearing about this multitude of different skills, our discussion turned to the teachers who are responsible for imparting this knowledge, and to the challenges they face. 

The challenge of AI education for teachers

When we shifted the focus of the discussion to teachers, Philip said: “If we really want to equip every young person with the knowledge and skills to thrive in a world that shaped by these technologies, then we have to find ways to evolve the curriculum and support teachers to develop the skills and confidence to teach that curriculum.”

Teenage students and a teacher do coding during a computer science lesson.

I asked the Minister what he thought needed to happen to ensure we achieved data and AI literacy for all young people. He said, “We need to work across government, but also across business and society more widely as well.” He went on to explain how important it was that the Department for Education (DfE) gets the support to make the changes needed, and that he and the Office for AI were ready to help.

Philip explained that the Raspberry Pi Foundation is one of the organisations in the consortium running the National Centre for Computing Education (NCCE), which is funded by the DfE in England. Through the NCCE, the Foundation has already supported thousands of teachers to develop their subject knowledge and pedagogy around computer science.

A recent study recognises that the investment made by the DfE in England is the most comprehensive effort globally to implement the computing curriculum, so we are starting from a good base. But Philip made it clear that now we need to expand this investment to cover AI.

Young people engaging with AI out of school

Philip described how brilliant it is to witness young people who choose to get creative with new technologies. As an example, he shared that the Foundation is seeing more and more young people employ machine learning in the European Astro Pi Challenge, where participants run experiments using Raspberry Pi computers on board the International Space Station. 

Three teenage boys do coding at a shared computer during a computer science lesson.

Philip also explained that, in the Foundation’s non-formal CoderDojo club network and its Coolest Projects tech showcase events, young people build their dream AI products supported by volunteers and mentors. Among these have been autonomous recycling robots and AI anti-collision alarms for bicycles. Like Caitlin with her company idea, this shows that young people are ready and eager to engage and create with AI.

We closed out the panel by going back to a point raised by Mhairi Aitken, who presented at the Foundation’s research seminar in September. Mhairi, an Alan Turing Institute ethics fellow, argues that children don’t just need to learn about AI, but that they should actually shape the direction of AI. All our panelists agreed on this point, and we discussed what it would take for young people to have a seat at the table.

A Black boy uses a Raspberry Pi computer at school.

Alice advised that we start by looking at our existing systems for engaging young people, such as Youth Parliament, student unions, and school groups. She also suggested adding young people to the AI Council, which I’m going to look into right away! Caitlin agreed and added that it would be great to make these forums virtual, so that young people from all over the country could participate.

The panel session was full of insight and felt very positive. Although the challenge of ensuring we have a data- and AI-literate generation of young people is tough, it’s clear that if we include them in finding the solution, we are in for a bright future. 

What’s next for AI education at the Raspberry Pi Foundation?

In the coming months, our goal at the Foundation is to increase our understanding of the concepts underlying AI education and how to teach them in an age-appropriate way. To that end, we will start to conduct a series of small AI education research projects, which will involve gathering the perspectives of a variety of stakeholders, including young people. We’ll make more information available on our research pages soon.

In the meantime, you can sign up for our upcoming research seminars on AI and data science education, and peruse the collection of related resources we’ve put together.

The post How do we develop AI education in schools? A panel discussion appeared first on Raspberry Pi.

The machine learning effect: Magic boxes and computational thinking 2.0

Post Syndicated from Jane Waite original https://www.raspberrypi.org/blog/machine-learning-education-school-computational-thinking-2-0-research-seminar/

How does teaching children and young people about machine learning (ML) differ from teaching them about other aspects of computing? Professor Matti Tedre and Dr Henriikka Vartiainen from the University of Eastern Finland shared some answers at our latest research seminar.

Three smiling young learners in a computing classroom.
We need to determine how to teach young people about machine learning, and what teachers need to know to help their learners form correct mental models.

Their presentation, titled ‘ML education for K-12: emerging trajectories’, had a profound impact on my thinking about how we teach computational thinking and programming. For this blog post, I have simplified some of the complexity associated with machine learning for the benefit of readers who are new to the topic.

a 3D-rendered grey box.
Machine learning is not magic — what needs to change in computing education to make sure learners don’t see ML systems as magic boxes?

Our seminars on teaching AI, ML, and data science

We’re currently partnering with The Alan Turing Institute to host a series of free research seminars about how to teach artificial intelligence (AI) and data science to young people.

The seminar with Matti and Henriikka, the third one of the series, was very well attended. Over 100 participants from San Francisco to Rajasthan, including teachers, researchers, and industry professionals, contributed to a lively and thought-provoking discussion.

Representing a large interdisciplinary team of researchers, Matti and Henriikka have been working on how to teach AI and machine learning for more than three years, which in this new area of study is a long time. So far, the Finnish team has written over a dozen academic papers based on their pilot studies with kindergarten-, primary-, and secondary-aged learners.

Current teaching in schools: classical rule-driven programming

Matti and Henriikka started by giving an overview of classical programming and how it is currently taught in schools. Classical programming can be described as rule-driven. Example features of classical computer programs and programming languages are:

  • A classical language has a strict syntax, and a limited set of commands that can only be used in a predetermined way
  • A classical language is deterministic, meaning we can guarantee what will happen when each line of code is run
  • A classical program is executed in a strict, step-wise order following a known set of rules

When we teach this type of programming, we show learners how to use a deductive problem solving approach or workflow: defining the task, designing a possible solution, and implementing the solution by writing a stepwise program that is then run on a computer. We encourage learners to avoid using trial and error to write programs. Instead, as they develop and test a program, we ask them to trace it line by line in order to predict what will happen when each line is run (glass-box testing).

A list of features of rule-driven computer programming, also included in the text.
The features of classical (rule-driven) programming approaches as taught in computer science education (CSE) (Tedre & Vartiainen, 2021).

Classical programming underpins the current view of computational thinking (CT). Our speakers called this version of CT ‘CT 1.0’. So what’s the alternative Matti and Henriikka presented, and how does it affect what computational thinking is or may become?

Machine learning (data-driven) models and new computational thinking (CT 2.0) 

Rule-based programming languages are not being eradicated. Instead, software systems are being augmented through the addition of machine learning (data-driven) elements. Many of today’s successful software products, such as search engines, image classifiers, and speech recognition programs, combine rule-driven software and data-driven models. However, the workflows for these two approaches to solving problems through computing are very different.

A table comparing problem solving workflows using computational thinking 1.0 versus computational thinking 2.0, info also included in the text.
Problem solving is very different depending on whether a rule-driven computational thinking (CT 1.0) approach or a data-driven computational thinking (CT 2.0) approach is used (Tedre & Vartiainen,2021).

Significantly, while in rule-based programming (and CT 1.0), the focus is on solving problems by creating algorithms, in data-driven approaches, the problem solving workflow is all about the data. To highlight the profound impact this shift in focus has on teaching and learning computing, Matti introduced us to a new version of computational thinking for machine learning, CT 2.0, which is detailed in a forthcoming research paper.

Because of the focus on data rather than algorithms, developing a machine learning model is not at all like developing a classical rule-driven program. In classical programming, programs can be traced, and we can predict what will happen when they run. But in data-driven development, there is no flow of rules, and no absolutely right or wrong answer.

A table comparing conceptual differences between computational thinking 1.0 versus computational thinking 2.0, info also included in the text.
There are major differences between rule-driven computational thinking (CT 1.0) and data-driven computational thinking (CT 2.0), which impact what computing education needs to take into account (Tedre & Vartiainen,2021).

Machine learning models are created iteratively using training data and must be cross-validated with test data. A tiny change in the data provided can make a model useless. We rarely know exactly why the output of an ML model is as it is, and we cannot explain each individual decision that the model might have made. When evaluating a machine learning system, we can only say how well it works based on statistical confidence and efficiency. 

Machine learning education must cover ethical and societal implications 

The ethical and societal implications of computer science have always been important for students to understand. But machine learning models open up a whole new set of topics for teachers and students to consider, because of these models’ reliance on large datasets, the difficulty of explaining their decisions, and their usefulness for automating very complex processes. This includes privacy, surveillance, diversity, bias, job losses, misinformation, accountability, democracy, and veracity, to name but a few.

I see the shift in problem solving approach as a chance to strengthen the teaching of computing in general, because it opens up opportunities to teach about systems, uncertainty, data, and society.

Jane Waite

Teaching machine learning: the challenges of magic boxes and new mental models

For teaching classical rule-driven programming, much time and effort has been put into researching learners’ understanding of what a program will do when it is run. This kind of understanding is called a learner’s mental model or notional machine. An approach teachers often use to help students develop a useful mental model of a program is to hide the detail of how the program works and only gradually reveal its complexity. This approach is described with the metaphor of hiding the detail of elements of the program in a box. 

Data-driven models in machine learning systems are highly complex and make little sense to humans. Therefore, they may appear like magic boxes to students. This view needs to be banished. Machine learning is not magic. We have just not figured out yet how to explain the detail of data-driven models in a way that allows learners to form useful mental models.

An example of a representation of a machine learning model in TensorFlow, an online machine learning tool (Tedre & Vartiainen,2021).

Some existing ML tools aim to help learners form mental models of ML, for example through visual representations of how a neural network works (see Figure 2). But these explanations are still very complex. Clearly, we need to find new ways to help learners of all ages form useful mental models of machine learning, so that teachers can explain to them how machine learning systems work and banish the view that machine learning is magic.

Some tools and teaching approaches for ML education

Matti and Henriikka’s team piloted different tools and pedagogical approaches with different age groups of learners. In terms of tools, since large amounts of data are needed for machine learning projects, our presenters suggested that tools that enable lots of data to be easily collected are ideal for teaching activities. Media-rich education tools provide an opportunity to capture still images, movements, sounds, or sense other inputs and then use these as data in machine learning teaching activities. For example, to create a machine learning–based rock-paper-scissors game, students can take photographs of their hands to train a machine learning model using Google Teachable Machine.

Photos of hands are used to train a machine learning model as part of a project to create a rock-paper-scissors game.
Photos of hands are used to train a Teachable Machine machine learning model as part of a project to create a rock-paper-scissors game (Tedre & Vartiainen, 2021).

Similar to tools that teach classic programming to novice students (e.g. Scratch), some of the new classroom tools for teaching machine learning have a drag-and-drop interface (e.g. Cognimates). Using such tools means that in lessons, there can be less focus on one of the more complex aspects of learning to program, learning programming language syntax. However, not all machine learning education products include drag-and-drop interaction, some instead have their own complex languages (e.g. Wolfram Programming Lab), which are less attractive to teachers and learners. In their pilot studies, the Finnish team found that drag-and-drop machine learning tools appeared to work well with students of all ages.

The different pedagogical approaches the Finnish research team used in their pilot studies included an exploratory approach with preschool children, who investigated machine learning recognition of happy or sad faces; and a project-based approach with older students, who co-created machine learning apps with web-based tools such as Teachable Machine and Learn Machine Learning (built by the research team), supported by machine learning experts.

Example of a middle school (age 8 to 11) student’s pen and paper design for a machine learning app that recognises different instruments and chords.
Example of a middle school (age 8 to 11) student’s design for a machine learning app that recognises different instruments and chords (Tedre & Vartiainen, 2021).

What impact these pedagogies have on students’ long-term mental models about machine learning has yet to be researched. If you want to find out more about the classroom pilot studies, the academic paper is a very accessible read.

My take-aways: new opportunities, new research questions

We all learned a tremendous amount from Matti and Henriikka and their perspectives on this important topic. Our seminar participants asked them many questions about the pedagogies and practicalities of teaching machine learning in class, and raised concerns about squeezing more into an already packed computing curriculum.

For me, the most significant take-away from the seminar was the need to shift focus from algorithms to data and from CT 1.0 to CT 2.0. Learning how to best teach classical rule-driven programming has been a long journey that we have not yet completed. We are forming an understanding of what concepts learners need to be taught, the progression of learning, key mental models, pedagogical options, and assessment approaches. For teaching data-driven development, we need to do the same.  

The question of how we make sure teachers have the necessary understanding is key.

Jane Waite

I see the shift in problem solving approach as a chance to strengthen the teaching of computing in general, because it opens up opportunities to teach about systems, uncertainty, data, and society. I think it will help us raise awareness about design, context, creativity, and student agency. But I worry about how we will introduce this shift. In my view, there is a considerable risk that we will be sucked into open-ended, project-based learning, with busy and fun but shallow learning experiences that result in restricted conceptual development for students.

I also worry about how we can best help teachers build up the knowledge and experience to support their students. In the Q&A after the seminar, I asked Matti and Henriikka about the role of their team’s machine learning experts in their pilot studies. It seemed to me that without them, the pilot lessons would not have worked, as the participating teachers and students would not have had the vocabulary to talk about the process and would not have known what was doable given the available time, tools, and student knowledge.

The question of how we make sure teachers have the necessary understanding is key. Many existing professional development resources for teachers wanting to learn about ML seem to imply that teachers will all need a PhD in statistics and neural network optimisation to engage with machine learning education. This is misleading. But teachers do need to understand the machine learning concepts that their students need to learn about, and I think we don’t yet know exactly what these concepts are. 

In summary, clearly more research is needed. There are fundamental questions still to be answered about what, when, and how we teach data-driven approaches to software systems development and how this impacts what we teach about classical, rule-based programming. But to me, that is exciting, and I am very much looking forward to the journey ahead.

Join our next free seminar

To find out what others recommend about teaching AI and ML, catch up on last month’s seminar with Professor Carsten Schulte and colleagues on centring data instead of code in the teaching of AI.

We have another four seminars in our monthly series on AI, machine learning, and data science education. Find out more about them on this page, and catch up on past seminar blogs and recordings here.

At our next seminar on Tuesday 7 December at 17:00–18:30 GMT, we will welcome Professor Rose Luckin from University College London. She will be presenting on what it is about AI that makes it useful for teachers and learners.

We look forward to meeting you there!

PS You can build your understanding of machine learning by joining our latest free online course, where you’ll learn foundational concepts and train your own ML model!

The post The machine learning effect: Magic boxes and computational thinking 2.0 appeared first on Raspberry Pi.

Zabbix 6.0 LTS at Zabbix Summit Online 2021

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/zabbix-6-0-lts-at-zabbix-summit-online-2021/16115/

With Zabbix Summit Online 2021 just around the corner, it’s time to have a quick overview of the 6.0 LTS features that we can expect to see featured during the event. The Zabbix 6.0 LTS release aims to deliver some of the long-awaited enterprise-level features while also improving the general user experience, performance, scalability, and many other aspects of Zabbix.

Native Zabbix server cluster

Many of you will be extremely happy to hear that Zabbix 6.0 LTS release comes with out-of-the-box High availability for Zabbix Server. This means that HA will now be supported natively, without having to use external tools to create Zabbix Server clusters.

The native Zabbix Server cluster will have a speech dedicated to it during the Zabbix Summit Online 2021. You can expect to learn both the inner workings of the HA solution, the configuration and of course the main benefits of using the native HA solution. You can also take a look at the in-development version of the native Zabbix server cluster in the latest Zabbix 6.0 LTS alpha release.

Business service monitoring and root cause analysis

Service monitoring is also about to go through a significant redesign, focusing on delivering additional value by providing robust Business service monitoring (BSM) features. This is achieved by delivering significant additions to the existing service status calculation logic. With features such as service weights, service status analysis based on child problem severities, ability to calculate service status based on the number or percentage of children in a problem state, users will be able to implement BSM on a whole new level. BSM will also support root cause analysis – users will be informed about the root cause problem of the service status change.

All of this and more, together with examples and use cases will be covered during a separate speech dedicated to BSM. In addition, some of the BSM features are available in the latest Zabbix 6.0 LTS alpha release – with more to come as we continue working on the Zabbix 6.0 release.

Audit log redesign

The Audit log is another existing feature that has received a complete redesign. With the ability to log each and every change performed both by the Zabbix Server and Zabbix Frontend, the Audit log will become an invaluable source of audit information. Of course, the redesign also takes performance into consideration – the redesign was developed with the least possible performance impact in mind.

The audit log is constantly in development and the current Zabbix 6.0 LTS alpha release offers you an early look at the feature. We will also be covering the technical details of the new audit log implementation during the Summit and will explain how we are able to achieve minimal performance impact with major improvements to Zabbix audit logging.

Geographical maps

With Geographical maps, our users can finally display their entities on a geographical map based on the coordinates of the entity. Geographical maps can be used with multiple geographical map providers and display your hosts with their most severe problems. In addition, geographical maps will react dynamically to Zoom levels and support filtering.

The latest Zabbix 6.0 Alpha release includes the Geomap widget – feel free to deploy it in your QA environment, check out the different map providers, filter options and other great features that come with this widget.

Machine learning

When it comes to problem detection, Zabbix 6.0 LTS will deliver multiple trend new functions. A specific set of functions provides machine learning functionality for Anomaly detection and Baseline monitoring.

The topic will be covered in-depth during the Zabbix Summit Online 2021. We will look at the configuration of the new functions and also take a deeper dive at the logic and algorithms used under the hood.

During the Zabbix Summit Online 2021, we will also cover many other new features, such as:

  • New Dashboard widgets
  • New items for Zabbix Agent
  • New templates and integrations
  • Zabbix login password complexity settings
  • Performance improvements for Zabbix Server, Zabbix Proxy, and Zabbix Frontend
  • UI and UX improvements
  • Zabbix login password complexity requirements
  • New history and trend functions
  • And more!

Not only will you get the chance to have an early look at many new features not yet available in the latest alpha release, but also you will have a great chance to learn the inner workings of the new features, the upgrade and migration process to Zabbix 6.0 LTS and much more!

We are extremely excited to share all of the new features with our community, so don’t miss out – take a look at the full Zabbix Summit online 2021 agenda and register for the event by visiting our Zabbix Summit page, and we will see you at the Zabbix Summit Online 2021 on November 25!

Batch Inference at Scale with Amazon SageMaker

Post Syndicated from Ramesh Jetty original https://aws.amazon.com/blogs/architecture/batch-inference-at-scale-with-amazon-sagemaker/

Running machine learning (ML) inference on large datasets is a challenge faced by many companies. There are several approaches and architecture patterns to help you tackle this problem. But no single solution may deliver the desired results for efficiency and cost effectiveness. In this blog post, we will outline a few factors that can help you arrive at the most optimal approach for your business. We will illustrate a use case and architecture pattern with Amazon SageMaker to perform batch inference at scale.

ML inference can be done in real time on individual records, such as with a REST API endpoint. Inference can also be done in batch mode as a processing job on a large dataset. While both approaches push data through a model, each has its own target goal when running inference at scale.

With real-time inference, the goal is usually to optimize the number of transactions per second that the model can process. With batch inference, the goal is usually tied to time constraints and the service-level agreement (SLA) for the job. Table 1 shows the key attributes of real-time, micro-batch, and batch inference scenarios.

Real Time Micro Batch Batch
Execution Mode
Synchronous Synchronous/Asynchronous Asynchronous
Prediction Latency
Subsecond Seconds to minutes Indefinite
Data Bounds Unbounded/stream Bounded Bounded
Execution Frequency
Variable Variable Variable/fixed
Invocation Mode
Continuous stream/API calls Event-based Event-based/scheduled
Examples Real-time REST API endpoint Data analyst running a SQL UDF Scheduled inference job

Table 1. Key characteristics of real-time, micro-batch, and batch inference scenarios

Key considerations for batch inference jobs

Batch inference tasks are usually good candidates for horizontal scaling. Each worker within a cluster can operate on a different subset of data without the need to exchange information with other workers. AWS offers multiple storage and compute options that enable horizontal scaling. Table 2 shows some key considerations when architecting for batch inference jobs.

  • Model type and ML framework. Models built with frameworks such as XGBoost and SKLearn require smaller compute instances. Those built with deep learning frameworks, such as TensorFlow and PyTorch require larger ones.
  • Complexity of the model. Simple models can run on CPU instances while more complex ensemble models and large-scale deep learning models can benefit from GPU instances.
  • Size of the inference data. While all approaches work on small datasets, larger datasets come with a unique set of challenges. The storage system must provide sufficient throughput and I/O to reliably run the inference workload.
  • Inference frequency and job concurrency. The volume of jobs within a fixed interval of time is an important consideration to address Service Quotas. The frequency and SLA requirements also proportionally impact the number of concurrent jobs. This might create additional pressure on the underlying Service Quotas.
ML Framework Model Complexity
Inference Data Size
Inference Frequency
Job Concurrency
  • Traditional
    • XGBoost
    • SKLearn
  • Deep Learning
    • Tensorflow
    • PyTorch
  • Low (linear models)
  • Medium (complex ensemble models)
  • High (large scale DL models)
  • Small (<1 GB)
  • Medium (<100 GB)
  • Large (<1 TB)
  • Hyperscale (>1 TB)
  • Hourly
  • Daily
  • Weekly
  • Monthly
  • 1
  • <10
  • <100
  • >100

Table 2. Key considerations when architecting for batch inference jobs

Real world Batch Inference use case and architecture

Often customers in certain domains such as advertising and marketing or healthcare must make predictions on hyperscale datasets. This requires deploying an inference pipeline that can complete several thousand inference jobs on extremely large datasets. The individual models used are typically of low complexity from a compute perspective. They could include a combination of various algorithms implemented in scikit-learn, XGBoost, and TensorFlow, for example. Most of the complexity in these use cases stems from large volumes of data and the number of concurrent jobs that must run to meet the service level agreement (SLA).

The batch inference architecture for these requirements typically is composed of three layers:

  • Orchestration layer. Manages the submission, scheduling, tracking, and error handling of individual jobs or multi-step pipelines
  • Storage layer. Stores the data that will be inferenced upon
  • Compute layer. Runs the inference job

There are several AWS services available that can be used for each of these architectural layers. The architecture in Figure 1 illustrates a real world implementation. Amazon SageMaker Processing and training services are used for compute layer and Amazon S3 for the storage layer. Amazon Managed Workflows for Apache Airflow (MWAA) and Amazon DynamoDB are used for the orchestration and job control layer.

Figure 1. Architecture for batch inference at scale with Amazon SageMaker

Figure 1. Architecture for batch inference at scale with Amazon SageMaker

Orchestration and job control layer. Apache Airflow is used to orchestrate the training and inference pipelines with job metadata captured into DynamoDB. At each step of the pipeline, Airflow updates the status of each model run. A custom Airflow sensor polls the status of each pipeline. It advances the pipeline with the successful completion of each step, or resubmits a job in case of failure.

Compute layer. SageMaker processing is used as the compute option for running the inference workload. SageMaker has a purpose-built batch transform feature for running batch inference jobs. However, this feature often requires additional pre and post-processing steps to get the data into the appropriate input and output format. SageMaker Processing offers a general purpose managed compute environment to run a custom batch inference container with a custom script. In the architecture, the processing script takes the input location of the model artifact generated by a SageMaker training job and the location of the inference data, and performs pre and post-processing along with model inference.

Storage layer. Amazon S3 is used to store the large input dataset and the output inference data. The ShardedByS3Key data distribution strategy distributes the files across multiple nodes within a processing cluster. With this option enabled, SageMaker Processing will automatically copy a different subset of input files into each node of the processing job. This way you can horizontally scale batch inference jobs by requesting a higher number of instances when configuring the job.

One caveat of this approach is that while many ML algorithms utilize multiple CPU cores during training, only one core is utilized during inference. This can be rectified by using Python’s native concurrency and parallelism frameworks such concurrent.futures. The following pseudo-code illustrates how you can distribute the inference workload across all instance cores. This assumes the SageMaker Processing job has been configured to copy the input files into the /opt/ml/processing/input directory.

from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import cpu_count
import os
from glob import glob
import pandas as pd

def inference_fn(model_dir, file_path, output_dir):

model = joblib.load(f"{model_dir}/model.joblib")
data = pd.read_parquet(file_path)
data["prediction"] = model.predict(data)

output_path = f"{output_dir}/{os.path.basename(file_path)}"


return output_path

input_files = glob("/opt/ml/processing/input/*")
model_dir = "/opt/ml/model"
output_dir = "/opt/ml/output"

with ProcessPoolExecutor(max_workers=cpu_count()) as executor:
futures = [executor.submit(inference_fn, model_dir, file_path, output_dir) for file in input_files]

results =[]
for future in as_completed(futures):


In this blog post, we described ML inference options and use cases. We primarily focused on batch inference and reviewed key challenges faced when performing batch inference at scale. We provided a mental model of some key considerations and best practices to consider as you make various architecture decisions. We illustrated these considerations with a real world use case and an architecture pattern to perform batch inference at scale. This pattern can be extended to other choices of compute, storage, and orchestration services on AWS to build large-scale ML inference solutions.

More information:

Open-Sourcing a Monitoring GUI for Metaflow

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/open-sourcing-a-monitoring-gui-for-metaflow-75ff465f0d60

Open-Sourcing a Monitoring GUI for Metaflow, Netflix’s ML Platform

tl;dr Today, we are open-sourcing a long-awaited GUI for Metaflow. The Metaflow GUI allows data scientists to monitor their workflows in real-time, track experiments, and see detailed logs and results for every executed task. The GUI can be extended with plugins, allowing the community to build integrations to other systems, custom visualizations, and embed upcoming features of Metaflow directly into its views.

Metaflow is a full-stack framework for data science that we started developing at Netflix over four years ago and which we open-sourced in 2019. It allows data scientists to define ML workflows, test them locally, scale-out to the cloud, and deploy to production in idiomatic Python code. Since open-sourcing, the Metaflow community has been growing quickly: it is now the 7th most starred active project on Netflix’s GitHub account with nearly 4800 stars. Outside Netflix, Metaflow is used to power machine learning in production by hundreds of companies across industries from bioinformatics to real estate.

Since its inception, Metaflow has been a command-line-centric tool. It makes it easy for data scientists to express even complex machine learning applications in idiomatic Python, test them locally, or scale them out in the cloud — all using their favorite IDEs and terminals. Following our culture of freedom and responsibility, Metaflow grants data scientists the freedom to choose the right modeling approach, handle data and features flexibly, and construct workflows easily while ensuring that the resulting project executes responsibly and robustly on the production infrastructure.

As the number and criticality of projects running on Metaflow increased — some of which are very central to our business — our ML platform team started receiving an increasing number of support requests. Frequently, the questions were of the nature “can you help me understand why my flow takes so long to execute” or “how can I find the logs for a model that failed last night.” Technically, Metaflow provides a Python API that allows the user to inspect all details e.g., in a notebook, but writing code in a notebook to answer basic questions like this felt overkill and unnecessarily tedious. After observing the situation for months, we started forming an understanding of the kind of a new user interface that could address the growing needs of our users.

Requirements for a Metaflow GUI

Metaflow is a human-centered system by design. We consider our Python API and the CLI to be integral parts of the overall user interface and user experience, which singularly focuses on making it easier to build production-ready ML projects from scratch. In our approach, Python code provides a highly expressive and productive user interface for expressing complex business logic, such as ML models and workflows. At the same time, the CLI allows users to execute specific commands quickly and even automate common actions. When it comes to complex, real-life development work like this, it would be hard to achieve the same level of productivity on a graphical user interface.

However, textual UIs are quite lacking when it comes to discoverability and getting a holistic understanding of the system’s state. The questions we were hearing reflected this gap: we were lacking a user interface that would allow the users, quite simply, to figure out quickly what is happening in their Metaflow projects.

Netflix has a long history of developing innovative tools for observability, so when we began to specify requirements for the new GUI, we were able to leverage experiences from the previous GUIs built for other use cases, as well as real-life user stories from Metaflow users. We wanted to scope the GUI tightly, focusing on a specific gap in the Metaflow experience:

  1. The GUI should allow the users to see what flows and tasks are executing and what is happening inside them. Notably, we didn’t want to replace any of the functionality in the Metaflow APIs or CLI with the GUI — just to complement them. This meant that the GUI would be read-only: all actions like writing code and starting executions should happen on the users’ IDE and terminal as before. We also had no need to build a model-monitoring GUI yet, which is a wholly separate problem domain.
  2. The GUI would be targeted at professional data scientists. Instead of a fancy GUI for demos and presentations, we wanted a serious productivity tool with carefully thought-out user workflows that would fit seamlessly into our toolchain of data science. This requires attention to small details: for instance, users should be able to copy a link to any view in the GUI and share it e.g., on Slack, for easy collaboration and support (or to integrate with the Metaflow Slack bot). And, there should be natural affordances for navigating between the CLI, the GUI, and notebooks.
  3. The GUI should be scalable and snappy: it should handle our existing repository consisting of millions of runs, some of which contain tens of thousands of tasks without hiccups. Based on our experiences with other GUIs operating at Netflix-scale, this is not a trivial requirement: scalability needs to be baked into the design from the very beginning. Sluggish GUIs are hard to debug and fix afterwards, and they can have a significantly negative impact on productivity.
  4. The GUI should integrate well with other GUIs. A modern ML stack consists of many independent systems like data warehouses, compute layers, model serving systems, and, in particular, notebooks. It should be possible to find runs and tasks of interest in the Metaflow GUI and use a task-specific view to jump to other GUIs for further information. Our landscape of tools is constantly evolving, so we didn’t want to hardcode these links and views in the GUI itself. Instead, following the integration-friendly ethos of Metaflow, we want to embed relevant information in the GUI as plugins.
  5. Finally, we wanted to minimize the operational overhead of the GUI. In particular, under no circumstances should the GUI impact Metaflow executions. The GUI backend should be a simple service, optionally sitting alongside the existing Metaflow metadata service, providing a read-only, real-time view to the stored state. The frontend side should be easily extensible and maintainable, suggesting that we wanted a modern React app.

Monitoring GUI for Metaflow

As our ML Platform team had limited frontend resources, we reached out to Codemate to help with the implementation. As it often happens in software engineering projects, the project took longer than expected to finish, mostly because the problem of tracking and visualizing thousands of concurrent objects in real-time in a highly distributed environment is a surprisingly non-trivial problem (duh!). After countless iterations, we are finally very happy with the outcome, which we have now used in production for a few months.

When you open the GUI, you see an overview of all flows and runs, both current and historical, which you can group and filter in various ways:

Runs Grouped by flows

We can use this view for experiment tracking: Metaflow records every execution automatically, so data scientists can track all their work using this view. Naturally, the view can be grouped by user. They can also tag their runs and filter the view by tags, allowing them to focus on particular subsets of experiments.

After you click a specific run, you see all its tasks on a timeline:

Timeline view for a run

The timeline view is extremely useful in understanding performance bottlenecks, distribution of task runtimes, and finding failed tasks. At the top, you can see global attributes of the run, such as its status, start time, parameters etc. You can click a specific task to see more details:

Task view

This task view shows logs produced by a task, its results, and optionally links to other systems that are relevant to the task. For instance, if the task had deployed a model to a model serving platform, the view could include a link to a UI used for monitoring microservices.

As specified in our requirements, the GUI should work well with Metaflow CLI. To facilitate this, the top bar includes a navigation component where the user can copy-paste any pathspec, i.e., a path to any object in the Metaflow universe, which are prominently shown in the CLI output. This way, the user can easily move from the CLI to the GUI to observe runs and tasks in detail.

While the CLI is great, it is challenging to visualize flows. Each flow can be represented as a Directed Acyclic Graph (DAG), and so the GUI provides a much better way to visualize a flow. The DAG view presents all the steps of a flow and how they are related. Each step may have developer comments. They are colored to indicate the current state. Split steps are grouped by shaded boxes, while steps that participated in a foreach are grouped by a double shade box. Clicking on a step will take you to the Task view.

DAG View

Users at different organizations will likely have some special use cases that are not directly supported. The Metaflow GUI is extensible through its plugin API. For example, Netflix has its container orchestration platform called Titus. Users can configure tasks to utilize Titus to scale up or out. When failures happen, users will need to access their Titus containers for more information, and within the task view, a simple plugin provides a link for further troubleshooting.

Example task-level plugin

Try it at home!

We know that our user stories and requirements for a Metaflow GUI are not unique to Netflix. A number of companies in the Metaflow community have requested GUI for Metaflow in the past. To support the thriving community and invite 3rd party contributions to the GUI, we are open-sourcing our Monitoring GUI for Metaflow today!

You can find detailed instructions for how to deploy the GUI here. If you want to see the GUI in action before deploying it, Outerbounds, a new startup founded by our ex-colleagues, has deployed a public demo instance of the GUI. Outerbounds also hosts an active Slack community of Metaflow users where you can find support for GUI-related issues and share feedback and ideas for improvement.

With the new GUI, data scientists don’t have to fly blind anymore. Instead of reaching out to a platform team for support, they can easily see the state of their workflows on their own. We hope that Metaflow users outside Netflix will find the GUI equally beneficial, and companies will find creative ways to improve the GUI with new plugins.

For more context on the development process and motivation for the GUI, you can watch this recording of the GUI launch meetup.

Open-Sourcing a Monitoring GUI for Metaflow was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Using Machine Learning to Guess PINs from Video

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2021/10/using-machine-learning-to-guess-pins-from-video.html

Researchers trained a machine-learning system on videos of people typing their PINs into ATMs:

By using three tries, which is typically the maximum allowed number of attempts before the card is withheld, the researchers reconstructed the correct sequence for 5-digit PINs 30% of the time, and reached 41% for 4-digit PINs.

This works even if the person is covering the pad with their hands.

The article doesn’t contain a link to the original research. If someone knows it, please put it in the comments.

Slashdot thread.

Learn the fundamentals of AI and machine learning with our free online course

Post Syndicated from Michael Conterio original https://www.raspberrypi.org/blog/fundamentals-ai-machine-learning-free-online-course/

Join our free online course Introduction to Machine Learning and AI to discover the fundamentals of machine learning and learn to train your own machine learning models using free online tools.

Drawing of a machine learning robot helping a human identify spam at a computer.

Although artificial intelligence (AI) was once the province of science fiction, these days you’re very likely to hear the term in relation to new technologies, whether that’s facial recognition, medical diagnostic tools, or self-driving cars, which use AI systems to make decisions or predictions.

By the end of this free online course, you will have an appreciation for what goes into machine learning and artificial intelligence systems — and why you should think carefully about what comes out.

Machine learning — a brief overview

You’ll also often hear about AI systems that use machine learning (ML). Very simply, we can say that programs created using ML are ‘trained’ on large collections of data to ‘learn’ to produce more accurate outputs over time. One rather funny application you might have heard of is the ‘muffin or chihuahua?’ image recognition task.

Drawing of a machine learning ars rover trying to decide whether it is seeing an alien or a rock.

More precisely, we would say that a ML algorithm builds a model, based on large collections of data (the training data), without being explicitly programmed to do so. The model is ‘finished’ when it makes predictions or decisions with an acceptable level of accuracy. (For example, it rarely mistakes a muffin for a chihuahua in a photo.) It is then considered to be able to make predictions or decisions using new data in the real world.

It’s important to understand AI and ML — especially for educators

But how does all this actually work? If you don’t know, it’s hard to judge what the impacts of these technologies might be, and how we can be sure they benefit everyone — an important discussion that needs to involve people from across all of society. Not knowing can also be a barrier to using AI, whether that’s for a hobby, as part of your job, or to help your community solve a problem.

some things that machine learning and AI systems can be built into: streetlamps, waste collecting vehicles, cars, traffic lights.

For teachers and educators it’s particularly important to have a good foundational knowledge of AI and ML, as they need to teach their learners what the young people need to know about these technologies and how they impact their lives. (We’ve also got a free seminar series about teaching these topics.)

To help you understand the fundamentals of AI and ML, we’ve put together a free online course: Introduction to Machine Learning and AI. Over four weeks in two hours per week, you’ll learn how machine learning can be used to solve problems, without going too deeply into the mathematical details. You’ll also get to grips with the different ways that machines ‘learn’, and you will try out online tools such as Machine Learning for Kids and Teachable Machine to design and train your own machine learning programs.

What types of problems and tasks are AI systems used for?

As well as finding out how these AI systems work, you’ll look at the different types of tasks that they can help us address. One of these is classification — working out which group (or groups) something fits in, such as distinguishing between positive and negative product reviews, identifying an animal (or a muffin) in an image, or spotting potential medical problems in patient data.

You’ll also learn about other types of tasks ML programs are used for, such as regression (predicting a numerical value from a continuous range) and knowledge organisation (spotting links between different pieces of data or clusters of similar data). Towards the end of the course you’ll dive into one of the hottest topics in AI today: neural networks, which are ML models whose design is inspired by networks of brain cells (neurons).

drawing of a small machine learning neural network.

Before an ML program can be trained, you need to collect data to train it with. During the course you’ll see how tools from statistics and data science are important for ML — but also how ethical issues can arise both when data is collected and when the outputs of an ML program are used.

By the end of the course, you will have an appreciation for what goes into machine learning and artificial intelligence systems — and why you should think carefully about what comes out.

Sign up to the course today, for free

The Introduction to Machine Learning and AI course is open for you to sign up to now. Sign-ups will pause after 12 December. Once you sign up, you’ll have access for six weeks. During this time you’ll be able to interact with your fellow learners, and before 25 October, you’ll also benefit from the support of our expert facilitators. So what are you waiting for?

Share your views as part of our research

As part of our research on computing education, we would like to find out about educators’ views on machine learning. Before you start the course, we will ask you to complete a short survey. As a thank you for helping us with our research, you will be offered the chance to take part in a prize draw for a £50 book token!

Learn more about AI, its impacts, and teaching learners about them

To develop your computing knowledge and skills, you might also want to:

If you are a teacher in England, you can develop your teaching skills through the National Centre for Computing Education, which will give you free upgrades for our courses (including Introduction to Machine Learning and AI) so you’ll receive certificates and unlimited access.

The post Learn the fundamentals of AI and machine learning with our free online course appeared first on Raspberry Pi.

Should we teach AI and ML differently to other areas of computer science? A challenge

Post Syndicated from Sue Sentance original https://www.raspberrypi.org/blog/research-seminar-data-centric-ai-ml-teaching-in-school/

Between September 2021 and March 2022, we’re partnering with The Alan Turing Institute to host a series of free research seminars about how to teach AI and data science to young people.

In the second seminar of the series, we were excited to hear from Professor Carsten Schulte, Yannik Fleischer, and Lukas Höper from the University of Paderborn, Germany, who presented on the topic of teaching AI and machine learning (ML) from a data-centric perspective. Their talk raised the question of whether and how AI and ML should be taught differently from other themes in the computer science curriculum at school.

Machine behaviour — a new field of study?

The rationale behind the speakers’ work is a concept they call hybrid interaction system, referring to the way that humans and machines interact. To explain this concept, Carsten referred to an 2019 article published in Nature by Iyad Rahwan and colleagues: Machine hehaviour. The article’s authors propose that the study of AI agents (complex and simple algorithms that make decisions) should be a separate, cross-disciplinary field of study, because of the ubiquity and complexity of AI systems, and because these systems can have both beneficial and detrimental impacts on humanity, which can be difficult to evaluate. (Our previous seminar by Mhairi Aitken highlighted some of these impacts.) The authors state that to study this field, we need to draw on scientific practices from across different fields, as shown below:

Machine behaviour as a field sits at the intersection of AI engineering and behavioural science. Quantitative evidence from machine behaviour studies feeds into the study of the impact of technology, which in turn feeds questions and practices into engineering and behavioural science.
The interdisciplinarity of machine behaviour. (Image taken from Rahwan et al [1])

In establishing their argument, the authors compare the study of animal behaviour and machine behaviour, citing that both fields consider aspects such as mechanism, development, evolution and function. They describe how part of this proposed machine behaviour field may focus on studying individual machines’ behaviour, while collective machines and what they call ‘hybrid human-machine behaviour’ can also be studied. By focusing on the complexities of the interactions between machines and humans, we can think both about machines shaping human behaviour and humans shaping machine behaviour, and a sort of ‘co-behaviour’ as they work together. Thus, the authors conclude that machine behaviour is an interdisciplinary area that we should study in a different way to computer science.

Carsten and his team said that, as educators, we will need to draw on the parameters and frameworks of this machine behaviour field to be able to effectively teach AI and machine learning in school. They argue that our approach should be centred on data, rather than on code. I believe this is a challenge to those of us developing tools and resources to support young people, and that we should be open to these ideas as we forge ahead in our work in this area.

Ideas or artefacts?

In the interpretation of computational thinking popularised in 2006 by Jeanette Wing, she introduces computational thinking as being about ‘ideas, not artefacts’. When we, the computing education community, started to think about computational thinking, we moved from focusing on specific technology — and how to understand and use it — to the ideas or principles underlying the domain. The challenge now is: have we gone too far in that direction?

Carsten argued that, if we are to understand machine behaviour, and in particular, human-machine co-behaviour, which he refers to as the hybrid interaction system, then we need to be studying   artefacts as well as ideas.

Throughout the seminar, the speakers reminded us to keep in mind artefacts, issues of bias, the role of data, and potential implications for the way we teach.

Studying machine learning: a different focus

In addition, Carsten highlighted a number of differences between learning ML and learning other areas of computer science, including traditional programming:

  1. The process of problem-solving is different. Traditionally, we might try to understand the problem, derive a solution in terms of an algorithm, then understand the solution. In ML, the data shapes the model, and we do not need a deep understanding of either the problem or the solution.
  2. Our tolerance of inaccuracy is different. Traditionally, we teach young people to design programs that lead to an accurate solution. However, the nature of ML means that there will be an error rate, which we strive to minimise. 
  3. The role of code is different. Rather than the code doing the work as in traditional programming, the code is only a small part of a real-world ML system. 

These differences imply that our teaching should adapt too.

A graphic demonstrating that in machine learning as compared to other areas of computer science, the process of problem-solving, tolerance of inaccuracy, and role of code is different.
Click to enlarge.

ProDaBi: a programme for teaching AI, data science, and ML in secondary school

In Germany, education is devolved to state governments. Although computer science (known as informatics) was only last year introduced as a mandatory subject in lower secondary schools in North Rhine-Westphalia, where Paderborn is located, it has been taught at the upper secondary levels for many years. ProDaBi is a project that researchers have been running at Paderborn University since 2017, with the aim of developing a secondary school curriculum around data science, AI, and ML.

The ProDaBi curriculum includes:

  • Two modules for 11- to 12-year-olds covering decision trees and data awareness (ethical aspects), introduced this year
  • A short course for 13-year-olds covering aspects of artificial intelligence, through the game Hexapawn
  • A set of modules for 14- to 15-year-olds, covering data science, data exploration, decision trees, neural networks, and data awareness (ethical aspects), using Jupyter notebooks
  • A project-based course for 18-year-olds, including the above topics at a more advanced level, using Codap and Jupyter notebooks to develop practical skills through projects; this course has been running the longest and is currently in its fourth iteration

Although the ProDaBi project site is in German, an English translation is available.

Learning modules developed as part of the ProDaBi project.
Modules developed as part of the ProDaBi project

Our speakers described example activities from three of the modules:

  • Hexapawn, a two-player game inspired by the work of Donald Michie in 1961. The purpose of this activity is to support learners in reflecting on the way the machine learns. Children can then relate the activity to the behavior of AI agents such as autonomous cars. An English version of the activity is available. 
  • Data cards, a series of activities to teach about decision trees. The cards are designed in a ‘Top Trumps’ style, and based on food items, with unplugged and digital elements. 
  • Data awareness, a module focusing on the amount of data an individual can generate as they move through a city, in this case through the mobile phone network. Children are encouraged to reflect on personal data in the context of the interaction between the human and data-driven artefact, and how their view of the world influences their interpretation of the data that they are given.

Questioning how we should teach AI and ML at school

There was a lot to digest in this seminar: challenging ideas and some new concepts, for me anyway. An important takeaway for me was how much we do not yet know about the concepts and skills we should be teaching in school around AI and ML, and about the approaches that we should be using to teach them effectively. Research such as that being carried out in Paderborn, demonstrating a data-centric approach, can really augment our understanding, and I’m looking forward to following the work of Carsten and his team.

Carsten and colleagues ended with this summary and discussion point for the audience:

“‘AI education’ requires developing an adequate picture of the hybrid interaction system — a kind of data-driven, emergent ecosystem which needs to be made explicitly to understand the transformative role as well as the technological basics of these artificial intelligence tools and how they are related to data science.”

You can catch up on the seminar, including the Q&A with Carsten and his colleagues, here:

Join our next seminar

This seminar really extended our thinking about AI education, and we look forward to introducing new perspectives from different researchers each month. At our next seminar on Tuesday 2 November at 17:00–18:30 BST / 12:00–13:30 EDT / 9:00–10:30 PDT / 18:00–19:30 CEST, we will welcome Professor Matti Tedre and Henriikka Vartiainen (University of Eastern Finland). The two Finnish researchers will talk about emerging trajectories in ML education for K-12. We look forward to meeting you there.

Carsten and their colleagues are also running a series of seminars on AI and data science: you can find out about these on their registration page.

You can increase your own understanding of machine learning by joining our latest free online course!

[1] Rahwan, I., Cebrian, M., Obradovich, N., Bongard, J., Bonnefon, J. F., Breazeal, C., … & Wellman, M. (2019). Machine behaviour. Nature, 568(7753), 477-486.

The post Should we teach AI and ML differently to other areas of computer science? A challenge appeared first on Raspberry Pi.

Machine Learning Prosthetic Arm | The MagPi #110

Post Syndicated from Phil King original https://www.raspberrypi.org/blog/machine-learning-prosthetic-arm-the-magpi-110/

This intelligent arm learns how to move naturally, based on what the wearer is doing, as Phil King discovers in the latest issue of The MagPi, out now.

Known for his robotic creations, popular YouTuber James Bruton is also a keen Iron Man cosplayer, and his latest invention would surely impress Tony Stark: an intelligent prosthetic arm that can move naturally and autonomously, depending on the wearer’s body posture and limb movements.

Equipped with three heavy-duty servos, the prosthetic arm moves naturally based on the data from IMU sensors on the wearer’s other limbs
Equipped with three heavy-duty servos, the prosthetic arm moves naturally based on the data from IMU sensors on the wearer’s other limbs

“It’s a project I’ve been thinking about for a while, but I’ve never actually attempted properly,” James tells us. “I thought it would be good to have a work stream of something that could be useful.”

Motion capture suit

To obtain the body movement data on which to base the arm’s movements, James considered using a brain computer, but this would be unreliable without embedding electrodes in his head! So, he instead opted to train it with machine learning.

For this he created a motion capture suit from 3D-printed parts to gather all the data from his body motions: arms, legs, and head. The suit measures joint movements using rotating pieces with magnetic encoders, along with limb and head positions – via a special headband – using MPU-6050 inertial measurement units and Teensy LC boards.

Part of the motion capture suit, the headband is equipped with an IMU to gather movement data
Part of the motion capture suit, the headband is equipped with an IMU to gather movement data

Collected by a Teensy 4.1, this data is then fed into a machine learning model running on the suit’s Raspberry Pi Zero using AOgmaNeo, a lightweight C++ software library designed to run on low-power devices such a microcontrollers.

“AOgmaNeo is a reinforcement machine learning system which learns what all of the data is doing in relation to itself,” James explains. “This means that you can remove any piece of data and, after training, the software will do its best to replace the missing piece with a learned output. In my case, I’m removing the right arm and using the learned output to drive the prosthetic arm, but it could be any limb.”

While James notes that AOgmaNeo is actually meant for reinforcement learning,“in this case we know what the output should be rather than it being unknown and learning through binary reinforcement.”

The motion capture suit comprises 3D-printed parts, each equipped with a magnetic rotary encoder, MPU-6050 IMU, and Teensy LC
The motion capture suit comprises 3D-printed parts, each equipped with a magnetic rotary encoder, MPU-6050 IMU, and Teensy LC

To train the model, James used distinctive repeated motions, such as walking, so that the prosthetic arm would later be able to predict what it should do from incoming sensor data. He also spent some time standing still so that the arm would know what to do in that situation.

New model arm

With the machine learning model trained, Raspberry Pi Zero can be put into playback mode to control the backpack-mounted arm’s movements intelligently. It can then duplicate what the wearer’s real right arm was doing during training depending on the positions and movements of other body parts.

So, as he demonstrates in his YouTube video, if James starts walking on the spot, the prosthetic arm swings the opposite way to his left arm as he strides along, and moves forward as raises his left leg. If he stands still, the arm will hang down by his side. The 3D-printed hand was added purely for aesthetic reasons and the fingers don’t move.

Subscribe to James’ YouTube channel

James admits that the project is highly experimental and currently an early work in progress. “I’d like to develop this concept further,” he says, “although the current setup is slightly overambitious and impractical. I think the next step will be to have a simpler set of inputs and outputs.”

While he generally publishes his CAD designs and code, the arm “doesn’t work all that well, so I haven’t this time. AOgmaNeo is open-source, though (free for personal use), so you can make something similar if you wished.” What would you do with an extra arm? 

Get The MagPi #110 NOW!

MagPi 110 Halloween cover

You can grab the brand-new issue right now from the Raspberry Pi Press store, or via our app on Android or iOS. You can also pick it up from supermarkets and newsagents. There’s also a free PDF you can download.

The post Machine Learning Prosthetic Arm | The MagPi #110 appeared first on Raspberry Pi.

What’s a kangaroo?! AI ethics lessons for and from the younger generation

Post Syndicated from Sue Sentance original https://www.raspberrypi.org/blog/ai-ethics-lessons-education-children-research/

Between September 2021 and March 2022, we’re partnering with The Alan Turing Institute to host speakers from the UK, Finland, Germany, and the USA presenting a series of free research seminars about AI and data science education for young people. These rapidly developing technologies have a huge and growing impact on our lives, so it’s important for young people to understand them both from a technical and a societal perspective, and for educators to learn how to best support them to gain this understanding.

Mhairi Aitken.

In our first seminar we were beyond delighted to hear from Dr Mhairi Aitken, Ethics Fellow at The Alan Turing Institute. Mhairi is a sociologist whose research examines social and ethical dimensions of digital innovation, particularly relating to uses of data and AI. You can catch up on her full presentation and the Q&A with her in the video below.

Why we need AI ethics

The increased use of AI in society and industry is bringing some amazing benefits. In healthcare for example, AI can facilitate early diagnosis of life-threatening conditions and provide more accurate surgery through robotics. AI technology is also already being used in housing, financial services, social services, retail, and marketing. Concerns have been raised about the ethical implications of some aspects of these technologies, and Mhairi gave examples of a number of controversies to introduce us to the topic.

“Ethics considers not what we can do but rather what we should do — and what we should not do.”

Mhairi Aitken

One such controversy in England took place during the coronavirus pandemic, when an AI system was used to make decisions about school grades awarded to students. The system’s algorithm drew on grades awarded in previous years to other students of a school to upgrade or downgrade grades given by teachers; this was seen as deeply unfair and raised public consciousness of the real-life impact that AI decision-making systems can have.

An AI system was used in England last year to make decisions about school grades awarded to students — this was seen as deeply unfair.

Another high-profile controversy was caused by biased machine learning-based facial recognition systems and explored in Shalini Kantayya’s documentary Coded Bias. Such facial recognition systems have been shown to be much better at recognising a white male face than a black female one, demonstrating the inequitable impact of the technology.

What should AI be used for?

There is a clear need to consider both the positive and negative impacts of AI in society. Mhairi stressed that using AI effectively and ethically is not just about mitigating negative impacts but also about maximising benefits. She told us that bringing ethics into the discussion means that we start to move on from what AI applications can do to what they should and should not do. To outline how ethics can be applied to AI, Mhairi first outlined four key ethical principles:

  • Beneficence (do good)
  • Nonmaleficence (do no harm)
  • Autonomy
  • Justice

Mhairi shared a number of concrete questions that ethics raise about new technologies including AI: 

  • How do we ensure the benefits of new technologies are experienced equitably across society?
  • Do AI systems lead to discriminatory practices and outcomes?
  • Do new forms of data collection and monitoring threaten individuals’ privacy?
  • Do new forms of monitoring lead to a Big Brother society?
  • To what extent are individuals in control of the ways they interact with AI technologies or how these technologies impact their lives?
  • How can we protect against unjust outcomes, ensuring AI technologies do not exacerbate existing inequalities or reinforce prejudices?
  • How do we ensure diverse perspectives and interests are reflected in the design, development, and deployment of AI systems? 

Who gets to inform AI systems? The kangaroo metaphor

To mitigate negative impacts and maximise benefits of an AI system in practice, it’s crucial to consider the context in which the system is developed and used. Mhairi illustrated this point using the story of an autonomous vehicle, a self-driving car, developed in Sweden in 2017. It had been thoroughly safety-tested in the country, including tests of its ability to recognise wild animals that may cross its path, for example elk and moose. However, when the car was used in Australia, it was not able to recognise kangaroos that hopped into the road! Because the system had not been tested with kangaroos during its development, it did not know what they were. As a result, the self-driving car’s safety and reliability significantly decreased when it was taken out of the context in which it had been developed, jeopardising people and kangaroos.

A parent kangaroo with a young kangaroo in its pouch stands on grass.
Mitigating negative impacts and maximising benefits of AI systems requires actively involving the perspectives of groups that may be affected by the system — ‘kangoroos’ in Mhairi’s metaphor.

Mhairi used the kangaroo example as a metaphor to illustrate ethical issues around AI: the creators of an AI system make certain assumptions about what an AI system needs to know and how it needs to operate; these assumptions always reflect the positions, perspectives, and biases of the people and organisations that develop and train the system. Therefore, AI creators need to include metaphorical ‘kangaroos’ in the design and development of an AI system to ensure that their perspectives inform the system. Mhairi highlighted children as an important group of ‘kangaroos’. 

AI in children’s lives

AI may have far-reaching consequences in children’s lives, where it’s being used for decision-making around access to resources and support. Mhairi explained the impact that AI systems are already having on young people’s lives through these systems’ deployment in children’s education, in apps that children use, and in children’s lives as consumers.

A young child sits at a table using a tablet.
AI systems are already having an impact on children’s lives.

Children can be taught not only that AI impacts their lives, but also that it can get things wrong and that it reflects human interests and biases. However, Mhairi was keen to emphasise that we need to find out what children know and want to know before we make assumptions about what they should be taught. Moreover, engaging children in discussions about AI is not only about them learning about AI, it’s also about ethical practice: what can people making decisions about AI learn from children by listening to their views and perspectives?

AI research that listens to children

UNICEF, the United Nations Children’s Fund, has expressed concerns about the impact of new AI technologies used on children and young people. They have developed the UNICEF Requirements for Child-Centred AI.

Unicef Requirements for Child-Centred AI: Support childrenʼs development and well-being. Ensure inclusion of and for children. Prioritise fairness and non-discrimination for children. Protect childrenʼs data and privacy. Ensure safety for children. Provide transparency, explainability, and accountability for children. Empower governments and businesses with knowledge of AI and childrenʼs rights. Prepare children for present and future developments in AI. Create an enabling environment for child-centred AI. Engage in digital cooperation.
UNICEF’s requirements for child-centred AI, as presented by Mhairi. Click to enlarge.

Together with UNICEF, Mhairi and her colleagues working on the Ethics Theme in the Public Policy Programme at The Alan Turing Institute are engaged in new research to pilot UNICEF’s Child-Centred Requirements for AI, and to examine how these impact public sector uses of AI. A key aspect of this research is to hear from children themselves and to develop approaches to engage children to inform future ethical practices relating to AI in the public sector. The researchers hope to find out how we can best engage children and ensure that their voices are at the heart of the discussion about AI and ethics.

We all learned a tremendous amount from Mhairi and her work on this important topic. After her presentation, we had a lively discussion where many of the participants relayed the conversations they had had about AI ethics and shared their own concerns and experiences and many links to resources. The Q&A with Mhairi is included in the video recording.

What we love about our research seminars is that everyone attending can share their thoughts, and as a result we learn so much from attendees as well as from our speakers!

It’s impossible to cover more than a tiny fraction of the seminar here, so I do urge you to take the time to watch the seminar recording. You can also catch up on our previous seminars through our blogs and videos.

Join our next seminar

We have six more seminars in our free series on AI, machine learning, and data science education, taking place every first Tuesday of the month. At our next seminar on Tuesday 5 October at 17:00–18:30 BST / 12:00–13:30 EDT / 9:00–10:30 PDT / 18:00–19:30 CEST, we will welcome Professor Carsten Schulte, Yannik Fleischer, and Lukas Höper from the University of Paderborn, Germany, who will be presenting on the topic of teaching AI and machine learning (ML) from a data-centric perspective (find out more here). Their talk will raise the questions of whether and how AI and ML should be taught differently from other themes in the computer science curriculum at school.

Sign up now and we’ll send you the link to join on the day of the seminar — don’t forget to put the date in your diary.

I look forward to meeting you there!

In the meantime, we’re offering a brand-new, free online course that introduces machine learning with a practical focus — ideal for educators and anyone interested in exploring AI technology for the first time.

The post What’s a kangaroo?! AI ethics lessons for and from the younger generation appeared first on Raspberry Pi.

How to improve visibility into AWS WAF with anomaly detection

Post Syndicated from Cyril Soler original https://aws.amazon.com/blogs/security/how-to-improve-visibility-into-aws-waf-with-anomaly-detection/

When your APIs are exposed on the internet, they naturally face unpredictable traffic. AWS WAF helps protect your application’s API against common web exploits, such as SQL injection and cross-site scripting. In this blog post, you’ll learn how to automatically detect anomalies in the AWS WAF metrics to improve your visibility into AWS WAF activity, identify malicious activity, and simplify your investigations. The service that this solution uses to detect anomalies is Amazon Lookout for Metrics.

Lookout for Metrics is a service you can use to monitor business or operational metrics such as successful or failed HTTP requests and detect anomalies by using machine learning (ML). You can configure Lookout for Metrics to monitor different data sources that contain AWS WAF metrics, including Amazon CloudWatch. Lookout for Metrics can also take actions such as publishing findings in AWS Security Hub.

Solution overview

The solution in this blog post uses Amazon API Gateway to serve a simple REST API. AWS WAF protects API Gateway with AWS Managed Rules for AWS WAF. Amazon Lookout for Metrics actively detects unusual patterns in AWS WAF rule actions and sends a finding to Security Hub when suspicious activity is detected. Figure 1 shows the solution architecture.

Because AWS WAF integrates with Application Load Balancer, Amazon CloudFront distributions, or AWS AppSync GraphQL APIs, this solution also applies to these services.

Figure 1: Solution architecture

Figure 1: Solution architecture

The workflow of the solution is as follows:

  1. An HTTP request reaches the API Gateway endpoint.
  2. AWS WAF analyzes the HTTP request using the configured rules.
  3. Amazon CloudWatch collects action metrics for each rule that is configured in AWS WAF.
  4. Amazon Lookout for Metrics monitors CloudWatch metrics, selects the best ML algorithm, and trains the ML model.
  5. Lookout for Metrics detects outliers and provides a severity score to diagnose the issue.
  6. Lookout for Metrics invokes an AWS Lambda function when an anomaly is detected.
  7. The Lambda function sends a finding to Security Hub for further analysis.

Let’s take a detailed look at the AWS services that you will use in this solution.

Amazon API Gateway

Amazon API Gateway is a serverless API management service that supports mock integrations for API methods. This is the easiest and the most cost-effective way to implement this solution. But you can also use Amazon CloudFront, AWS AppSync GraphQL API, and Application Load Balancer to implement this solution in your workload.


AWS WAF is a web application firewall you can associate with API Gateway for REST APIs, Amazon CloudFront, AWS AppSync for GraphQL API, or Application Load Balancer. AWS WAF is integrated with other AWS services such as CloudWatch. AWS WAF uses rules to detect common web exploits in the incoming HTTP requests. You can configure your own rules, or use managed rulesets from AWS or from a third-party vendor. In this solution, you use AWS Managed Rules, which contains the CrossSiteScripting_QUERYARGUMENTS rule.

Amazon CloudWatch

Amazon CloudWatch is a monitoring and observability service. CloudWatch receives specific metrics from AWS WAF every 5 minutes. In particular, for each AWS WAF rule, CloudWatch provides PassedRequests, BlockedRequests, and CountedRequests metrics.

Amazon Lookout for Metrics

Amazon Lookout for Metrics uses machine learning (ML) algorithms to automatically detect and diagnose anomalies in your metrics. By using CloudWatch metrics as a data source for Lookout for Metrics, you can apply one of the Lookout for Metrics ML models to detect anomalies in a faster way. In addition, you can provide feedback on detected anomalies to help improve the model accuracy over time. Lookout for Metrics is available in the US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), and Europe (Stockholm) AWS Regions.

AWS Lambda

In this solution, you use an AWS Lambda function as an alert mechanism for Lookout for Metrics. When the machine learning model detects an outlier, it invokes the Lambda function, which implements a custom code. The Lambda function then imports the anomaly as a finding to Security Hub.

AWS Security Hub

In this solution, you use AWS Security Hub as a centralized way to manage security findings. This integration has the advantage of providing a common place for the security team to diagnose security findings from various sources, and uniformly integrates with your existing Security Information and Event Management (SIEM) system.


This solution uses Security Hub to collect anomaly detection findings. Before you deploy the solution, you need to enable Security Hub in your AWS account by following the instructions provided in to enable Security Hub manually. After you enable Security Hub, you can optionally select the security standards that are relevant for your workload, as shown in Figure 2.

Figure 2: Manually enabling Security Hub in the AWS Management Console

Figure 2: Manually enabling Security Hub in the AWS Management Console

Deploy the solution

A ready-to-use solution is provided as an AWS Cloud Development Kit (AWS CDK) application in the AWS WAF Anomaly Detection CDK project GitHub code repository. You can clone the GitHub repository and deploy the application by using the AWS CDK for Python.

Important: After you successfully deploy the solution, you should activate the Lookout for Metrics detector. This is not done as part of the CDK deployment. To activate the detector, in the AWS Management Console navigate to Amazon Lookout for Metrics, select the detector the solution created (WAFBlockingRequestDetector), and choose Activate. Alternatively, you can use the following AWS command to activate your detector.

aws lookoutmetrics activate-anomaly-detector --anomaly-detector-arn arn:aws:lookoutmetrics:<REGION_ID>:<ACCOUNT_ID>:AnomalyDetector:WAFBlockingRequestDetector

If you don’t want to run the CDK application, you can implement the same solution by using the AWS Management Console. In the following sections, I’ll go through the manual steps you can follow to achieve this.

Create an API to demonstrate the solution

First, you need an HTTP endpoint to protect. AWS WAF is integrated with CloudFront, Application Load Balancer, API Gateway, and AWS AppSync GraphQL API. In this blog post, I recommend a REST API Gateway because it’s a fully managed service to create and manage APIs. In addition, API Gateway provides a mechanism to implement mock APIs.

To build a REST API, follow the instructions for creating a REST API in Amazon API Gateway. After you create the API, create a GET method at the API root level and associate it to a mock endpoint, as shown in Figure 3. This is just enough to return an HTTP 200 status code to any GET requests.

Figure 3: Creating an API with mock integration

Figure 3: Creating an API with mock integration

Finally, deploy the API under the “prod” stage and keep all the default settings.

Create an AWS WAF web ACL to deploy the managed rules

Now that you’ve created an API in API Gateway, you need to create an AWS WAF web access control list (web ACL) by following the instructions in Creating a web ACL. A web ACL is the top-level configuration object of AWS WAF. This is the collection of AWS WAF rules that you will apply to your API. API Gateway is a regional service, so make sure to create a web ACL in the same AWS Region as the API. After you create the web ACL, add the Core rule set (CRS) rule group from AWS Managed Rules, also called AWSManagedRulesCommonRuleSet, as shown in Figure 4. This rule group contains the CrossSiteScripting_QUERYARGUMENTS rule, which you will use later to demonstrate the anomaly detection.

Figure 4: Adding AWSManagedRulesCommonRuleSet to the AWS WAF web ACL

Figure 4: Adding AWSManagedRulesCommonRuleSet to the AWS WAF web ACL

By observing Web ACL rule capacity units used, you can see that the Core rule set is consuming 700 web ACL capacity units (WCUs). The maximum capacity for a web ACL is 1,500, which is sufficient for most use cases. If you need more capacity, contact the AWS Support Center.

Associate the web ACL with the API deployment

After you create the web ACL, you associate it with the API. To do this, in the AWS WAF console, navigate to the web ACL you just created. On the Associated AWS resources tab, choose Add AWS resources. When prompted, choose the API you created earlier, and then choose Add.

Figure 5: Associating the web ACL with the API

Figure 5: Associating the web ACL with the API

Create a Lambda function to forward the anomaly to Security Hub

It’s useful to get visibility into the anomalies that are detected by the solution, and there are various ways to do that. In this solution, you provide such visibility as findings to Security Hub. Security Hub provides a centralized place to manage different findings from your AWS solutions. It also provides graphical tools to help with diagnostics.

You use a Lambda function that receives each anomaly and imports them into Security Hub. You can find the lookout_alarm Lambda function on GitHub, or follow the instructions to build a Lambda function with Python. You will use this Lambda function to provide additional context enrichment in the finding.

import boto3

securityHub = boto3.client('securityhub')

def lambda_handler(event, context):
    # submit the finding to Security Hub
    result = securityHub.batch_import_findings(Findings = [...])

Before you use this Lambda function, make sure you enable Security Hub.

Create the Lookout for Metrics detector, dataset, and alarm

Now you have an API that is protected by an AWS WAF web ACL. You also have configured a way to integrate with Security Hub through a Lambda function. The next step is to create a Lookout for Metrics detector and connect all these elements together. The key concepts and terminology of Lookout for Metrics are:

  • Detector – A Lookout for Metrics resource that monitors a dataset and identifies anomalies.
  • Dataset – The detector’s copy of the data that Lookout for Metrics is analyzing.
  • Alert – A mechanism to send a notification or initiate a processing workflow when the detector finds an anomaly.

First, follow the instructions to create a detector. The only information you need to provide is a name and an interval. The interval is the amount of time between two analyses. Your choice of the interval depends upon criteria such as the metrics you are processing, or the retention time of your data. For more information on the detector interval, see Lookout for Metrics quotas. In the example in Figure 6, I chose an interval of 5 minutes, which is the minimum.

Figure 6: Creating an Amazon Lookout for Metrics detector

Figure 6: Creating an Amazon Lookout for Metrics detector

After you create the detector, follow the instructions to configure a dataset that uses CloudWatch as a data source. Select Create a role in the service role, choose Next, and enter the following parameters:

  • For the CloudWatch namespace, choose AWS/WAFV2.
  • For Dimensions, choose Region, Rule, and WebACL.
  • For Measure, choose BlockedRequests.
  • For Aggregation function, choose SUM.

Figure 7 shows the data source fields that the detector will check for anomalies.

Figure 7: Creating an Amazon Lookout for Metrics dataset

Figure 7: Creating an Amazon Lookout for Metrics dataset

Next, create a Lookout for Metrics alert to invoke the Lambda function. To do so, follow the instructions for working with alerts. You provide a name, a channel (the Lambda function), and a severity threshold. One of the main advantages of Lookout for Metrics is the scoring of the detected anomaly, which indicates the severity. Anomalies have a score from 0 to 100. You can set up different alerts with different thresholds that are associated to the same detector. This way, you can provide alerts for different severity levels. In the example in Figure 8, I created a single alert with a severity threshold of 10.

Figure 8: Creating an Amazon Lookout for Metrics alert

Figure 8: Creating an Amazon Lookout for Metrics alert

The last steps are to activate the detector and configure Lookout for Metrics to select a ML model and train it. To do so, choose Activate on the detector details page.

Figure 9: Activating the Amazon Lookout for Metrics detector

Figure 9: Activating the Amazon Lookout for Metrics detector

Why does this solution use Lookout for Metrics anomaly detection?

Amazon CloudWatch offers native anomaly detection on a given metric. This function is useful to apply statistical and ML algorithms that continuously analyze metrics, determine normal baselines, and identify anomalies with minimal user intervention.

Lookout for Metrics provides a more sophisticated version of anomaly detection, which makes it the better choice for this solution. Lookout for Metrics automatically supports a collection of ML algorithms. For example, no one algorithm works for all kinds of data, so Lookout for Metrics inspects the data and applies the right ML algorithm to the right data to accurately detect anomalies. In addition, Lookout for Metrics groups concurrent anomalies into logical groups, and sends a single alert for the anomaly group rather than separate alerts, so you can see the full picture. Finally, Lookout for Metrics allows you to provide feedback on the detected anomalies, which AWS uses to continuously improve the accuracy and performance of the models.

Publish the value zero in CloudWatch metrics

The reporting criteria for AWS WAF metrics is a nonzero value. This means that the BlockedRequests metric isn’t updated if AWS WAF isn’t blocking any requests. In the absence of real HTTP traffic, typically in a testing environment, the value zero must be published. In production, because AWS WAF is actively blocking illegitimate requests, this publication is not required. To train the ML model in the absence of blocked requests, you need to publish the value zero by calling the PutMetricData CloudWatch API method every 5 minutes.

In my example, I selected a 5-minute period to be aligned with the Lookout for Metrics interval. It’s possible to publish a zero value every five minutes by using the CloudWatch metrics API, as shown following. The zero value doesn’t impact the SUM and ensures that at least one value is published every five minutes. You can use the cloudwatch_zero Lambda function on GitHub to publish the value zero by using the AWS SDK for Python.

import boto3

cloudwatch = boto3.client('cloudwatch')

def lambda_handler(event, context):

    result = cloudwatch.put_metric_data(
                'MetricName': 'BlockedRequests',
                'Dimensions': [...],
                'Value': 0

To create a CloudWatch Events rule to schedule the call every 5 minutes

  1. Navigate to the CloudWatch Event console and choose Create Rule.
  2. Choose Schedule, keep the 5-minute default interval, and choose Add target.
  3. Select the name of the function you previously created, expand the Configure input section.
  4. Choose Constant (JSON text), as shown in Figure 10. In the text field, paste the following configuration:

  5. Choose Configure details.
  6. Enter a name for your rule, and then choose Create rule.


Figure 10: Creating a CloudWatch Events rule scheduled every 5 minutes

Figure 10: Creating a CloudWatch Events rule scheduled every 5 minutes

Training time

Before the activated detector attempts to find anomalies, it uses data from several intervals to learn. If no historical data is available, the training process takes approximately one day for a five-minute interval. When you first deploy this solution, you have no historical data in CloudWatch for your AWS WAF resources, and you’re facing a cold start of Lookout for Metrics anomaly detection. Because the Lookout for Metrics detector interval is set to 5 minutes, you have to wait for 25 hours before being able to detect an anomaly. If you deploy the solution against an AWS WAF resource that’s been in production for days, you’ll have a reduced training time.

Test the anomaly detection

After 25 hours, Lookout for Metrics correctly selects an ML model that fits your metrics behavior, and correctly trains it based on your actual data. You can then start to test the anomaly detection. You can use a simple curl command, injecting a JavaScript alert() call in a query parameter as described in the AWS WAF documentation, to invoke the CrossSiteScripting_QUERYARGUMENTS managed rule. Make sure to inject a significant number of requests to ensure detection of blocked requests anomalies.

for i in {1..150}
  curl https://<api_gateway_endpoint>?test=%3Cscript%3Ealert%28%22hello%22%29%3C%2Fscript%3E

After you run the injection script, wait for the system to detect the anomaly. The CloudWatch BlockedRequests metric takes up to 5 minutes to update, and Lookout for Metrics is configured to detect anomalies in the CloudWatch data every 5 minutes. For those reasons, it can take 10 minutes to detect the simulated anomaly.

After detection and processing time, the finding is visible in Security Hub. To view the finding, go to the AWS Management Console, choose Services, choose Security Hub, and then choose Findings.

Figure 11: AWS Security Hub findings

Figure 11: AWS Security Hub findings

In Figure 11, you can see the new finding, coming from Lookout for Metrics, with a Low severity and an anomaly score of 100. You can use the remediation field to open the Lookout for Metrics console, where you can give feedback on the anomaly detection to improve the model for future detections.

Figure 12: Lookout for Metrics console, Finding view

Figure 12: Lookout for Metrics console, Finding view

Figure 12 shows the Lookout for Metrics graphical interface, where you can see the metrics related to the finding. The previous injection script impacted only one metric, but the same setup works to observe anomalies that arise between two or more metrics together. This feature makes diagnosis of issues easier.

For each of the impacted metrics, to confirm that the anomaly is relevant, choose the Yes button next to Is this relevant? above the graph.

Extend the solution

The solution in this post detects anomalies in the AWS WAF blocked request behavior. But you can also configure AWS WAF rule actions to count your requests instead of blocking them. This is usually done on legacy systems or for some particular rules of a managed ruleset that present an incompatibility with your workload. When you configure the rule action as a count, you increase the need for a comprehensive observability approach. By implementing anomaly detection against counted requests, this solution will help you to achieve better observability for your system.

Concerning the remediation, it’s possible to modify this solution by integrating it with different AWS services. As an example, you can integrate the anomaly detection with your own SIEM system, or simply notify your security team distribution list by using Amazon Simple Notification Service (Amazon SNS).

AWS WAF provides additional information in its logs, such as the IP address for the client. To detect anomalies in AWS WAF logs, you can ingest the AWS WAF logs to Amazon Simple Storage Service (Amazon S3), and then use Lookout for Metrics with Amazon S3 as a data source.


AWS WAF is integrated with CloudWatch and provides metrics for passed requests, blocked requests, or counted requests. With Lookout for Metrics, you can detect unexpected behavior in CloudWatch metrics by using a machine learning (ML) model. In this blog, I showed you how to integrate both services to provide AWS WAF with an ML-based anomaly detection mechanism. ML is a way to gain more visibility into your AWS WAF behavior. In addition, you can easily be notified when the system detects abnormal levels of blocked (or counted) requests, in order to take the right remediation action.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS WAF forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Cyril Soler

Cyril is a Senior Solutions Architect at AWS, working with Spain-based enterprises. His interests include security and data protection. He has been passionate about computer science since he was 7. When he’s far from a keyboard, he enjoys mechanics. Cyril holds a Master’s degree from Polytech Marseille, School of Engineering.

IoT gets a machine learning boost, from edge to cloud

Post Syndicated from Ashley Whittaker original https://www.raspberrypi.org/blog/iot-gets-a-machine-learning-boost-from-edge-to-cloud/

Today, it’s easy to run Edge Impulse machine learning on any operating system, like Raspberry Pi OS, and on every cloud, like Microsoft’s Azure IoT. Evan Rust, Technology Ambassador for Edge Impulse, walks us through it.

Building enterprise-grade IoT solutions takes a lot of practical effort and a healthy dose of imagination. As a foundation, you start with a highly secure and reliable communication between your IoT application and the devices it manages. We picked our favorite integration, the Microsoft Azure IoT Hub, which provides us with a cloud-hosted solution backend to connect virtually any device. For our hardware, we selected the ubiquitous Raspberry Pi 4, and of course Edge Impulse, which will connect to both platforms and extend our showcased solution from cloud to edge, including device authentication, out-of-box device management, and model provisioning.

From edge to cloud – getting started 

Edge machine learning devices fall into two categories: some are able to run very simple models locally, and others have more advanced capabilities that allow them to be more powerful and have cloud connectivity. The second group is often expensive to develop and maintain, as training and deploying models can be an arduous process. That’s where Edge Impulse comes in to help to simplify the pipeline, as data can be gathered remotely, used effortlessly to train models, downloaded to the devices directly from the Azure IoT Hub, and then run – fast.

This reference project will serve you as a guide for quickly getting started with Edge Impulse on Raspberry Pi 4 and Azure IoT, to train a model that detects lug nuts on a wheel and sends alerts to the cloud.

Setting up the hardware

Hardware setup for Edge Impulse Machine Learning
Raspberry Pi 4 forms the base for the Edge Impulse machine learning setup

To begin, you’ll need a Raspberry Pi 4 with an up-to-date Raspberry Pi OS image which can be found here. After flashing this image to an SD card and adding a file named wpa_supplicant.conf

ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
country=<Insert 2 letter ISO 3166-1 country code here>

	ssid="<Name of your wireless LAN>"
	psk="<Password for your wireless LAN>"

along with an empty file named ssh (both within the /boot directory), you can go ahead and power up the board. Once you’ve successfully SSH’d into the device with 

$ ssh [email protected]<IP_ADDRESS>

and the password raspberry, it’s time to install the dependencies for the Edge Impulse Linux SDK. Simply run the next three commands to set up the NodeJS environment and everything else that’s required for the edge-impulse-linux wizard:

$ curl -sL https://deb.nodesource.com/setup_12.x | sudo bash -
$ sudo apt install -y gcc g++ make build-essential nodejs sox gstreamer1.0-tools gstreamer1.0-plugins-good gstreamer1.0-plugins-base gstreamer1.0-plugins-base-apps
$ npm config set user root && sudo npm install edge-impulse-linux -g --unsafe-perm

Since this project deals with images, we’ll need some way to capture them. The wizard supports both the Pi Camera modules and standard USB webcams, so make sure to enable the camera module first with 

$ sudo raspi-config

if you plan on using one. With that completed, go to the Edge Impulse Studio and create a new project, then run the wizard with 

$ edge-impulse-linux

and make sure your device appears within the Edge Impulse Studio’s device section after logging in and selecting your project.

Edge Impulse Machine Learning screengrab

Capturing your data

Training accurate machine learning models requires feeding plenty of varied data, which means a lot of images are required. For this use case, I captured around 50 images of a wheel that had lug nuts on it. After I was done, I headed to the Labeling queue in the Data Acquisition page and added bounding boxes around each lug nut within every image, along with every wheel.

Edge Impulse Machine Learning screengrab

To add some test data, I went back to the main Dashboard page and clicked the Rebalance dataset button, which moves 20% of the training data to the test data bin. 

Training your models

So now that we have plenty of training data, it’s time to do something with it, namely train a model. The first block in the impulse is an Image Data block, and it scales each image to a size of 320 by 320 pixels. Next, image data is fed to the Image processing block which takes the raw RGB data and derives features from it.

Edge Impulse Machine Learning screengrab

Finally, these features are sent to the Transfer Learning Object Detection model which learns to recognize the objects. I set my model to train for 30 cycles at a learning rate of .15, but this can be adjusted to fine-tune the accuracy.

As you can see from the screenshot below, the model I trained was able to achieve an initial accuracy of 35.4%, but after some fine-tuning, it was able to correctly recognize objects at an accuracy of 73.5%.

Edge Impulse Machine Learning screengrab

Testing and deploying your models

In order to verify that the model works correctly in the real world, we’ll need to deploy it to our Raspberry Pi 4. This is a simple task thanks to the Edge Impulse CLI, as all we have to do is run 

$ edge-impulse-linux-runner

which downloads the model and creates a local webserver. From here, we can open a browser tab and visit the address listed after we run the command to see a live camera feed and any objects that are currently detected. 

Integrating your models with Microsoft Azure IoT 

With the model working locally on the device, let’s add an integration with an Azure IoT Hub that will allow our Raspberry Pi to send messages to the cloud. First, make sure you’ve installed the Azure CLI and have signed in using az login. Then get the name of the resource group you’ll be using for the project. If you don’t have one, you can follow this guide on how to create a new resource group. After that, return to the terminal and run the following commands to create a new IoT Hub and register a new device ID:

$ az iot hub create --resource-group <your resource group> --name <your IoT Hub name>
$ az extension add --name azure-iot
$ az iot hub device-identity create --hub-name <your IoT Hub name> --device-id <your device id>

Retrieve the connection string with 

$ az iot hub device-identity connection-string show --device-id <your device id> --hub-name <your IoT Hub name>
Edge Impulse Machine Learning screengrab

and set it as an environment variable with 

$ export IOTHUB_DEVICE_CONNECTION_STRING="<your connection string here>" 

in your Raspberry Pi’s SSH session, as well as 

$ pip install azure-iot-device

to add the necessary libraries. (Note: if you do not set the environment variable or pass it in as an argument, the program will not work!) The connection string contains the information required for the device to establish a connection with the IoT Hub service and communicate with it. You can then monitor output in the Hub with 

$ az iot hub monitor-events --hub-name <your IoT Hub name> --output table

 or in the Azure Portal.

To make sure it works, download and run this example to make sure you can see the test message. For the second half of deployment, we’ll need a way to customize how our model is used within the code. Thankfully, Edge Impulse provides a Python SDK for this purpose. Install it with 

$ sudo apt-get install libatlas-base-dev libportaudio0 libportaudio2 libportaudiocpp0 portaudio19-dev
$ pip3 install edge_impulse_linux -i https://pypi.python.org/simple

There’s some simple code that can be found here on Github, and it works by setting up a connection to the Azure IoT Hub and then running the model.

Edge Impulse Machine Learning screengrab

Once you’ve either downloaded the zip file or cloned the repo into a folder, get the model file by running

$ edge-impulse-linux-runner --download modelfile.eim

inside of the folder you just created from the cloning process. This will download a file called modelfile.eim. Now, run the Python program with 

$ python lug_nut_counter.py ./modelfile.eim -c <LUG_NUT_COUNT>

where <LUG_NUT_COUNT> is the correct number of lug nuts that should be attached to the wheel (you might have to use python3 if both Python 2 and 3 are installed).

Now whenever a wheel is detected the number of lug nuts is calculated. If this number falls short of the target, a message is sent to the Azure IoT Hub.

And by only sending messages when there’s something wrong, we can prevent an excess amount of bandwidth from being taken due to empty payloads.

The possibilities are endless

Imagine utilizing object detection for an industrial task such as quality control on an assembly line, or identifying ripe fruit amongst rows of crops, or detecting machinery malfunction, or remote, battery-powered inferencing devices. Between Edge Impulse, hardware like Raspberry Pi, and the Microsoft Azure IoT Hub, you can design endless models and deploy them on every device, while authenticating each and every device with built-in security.

You can set up individual identities and credentials for each of your connected devices to help retain the confidentiality of both cloud-to-device and device-to-cloud messages, revoke access rights for specific devices, transmit code and services between the cloud and the edge, and benefit from advanced analytics on devices running offline or with intermittent connectivity. And if you’re really looking to scale your operation and enjoy a complete dashboard view of the device fleets you manage, it is also possible to receive IoT alerts in Microsoft’s Connected Field Service from Azure IoT Central – directly.

Feel free to take the code for this project hosted here on GitHub and create a fork or add to it.

The complete project is available here. Let us know your thoughts at [email protected]. There are no limits, just your imagination at work.

The post IoT gets a machine learning boost, from edge to cloud appeared first on Raspberry Pi.

Deter package thieves from your porch with Raspberry Pi

Post Syndicated from Ashley Whittaker original https://www.raspberrypi.org/blog/deter-package-thieves-from-your-porch-with-raspberry-pi/

This Raspberry Pi-based build aims to deter porch pirates from stealing packages left at your front door. In recent times, we’ve all relied on home-delivered goods more than ever, and more often than not we ask our delivery drivers to stash our package somewhere if we’re not home, leaving them vulnerable to thieves.

Watch the full build video: ‘Fighting porch pirates with artificial intelligence (and flour)’

Flashing lights, sirens, flour and sprinklers

When internet shopper and AI project maker Ryder had a package stolen from his porch, he wanted to make sure that didn’t happen again. He figured that package stealers would be deterred by blaring sirens and flashing red lights. He also went one step further, wanting to hamper the thief’s escape with motion-activated water sprinklers and a blast of flour ready to catch them as they run away.

package thief running away
A would-be package thief dropping their swag and running away from the sprinkler

A simple motion detector wouldn’t work because it would set off Ryder’s booby traps whenever an unsuspecting cat or legitimate visitor happened across his porch, or if Ryder himself arrived home and didn’t fancy a watery flour bath. So some machine learning and a Python script needed to be employed.

How does it catch package thieves?

inside the package thief build
It’s what’s on the inside that counts. Us. We’re on the inside.

The camera keeps an eye on Ryder’s porch and is connected wirelessly to a Raspberry Pi 4, which works with a custom TensorFlow machine learning model trained to recognise when a package is or isn’t present. If the system detects a package, it gets ready to deploy the anti-thief traps. The Raspberry Pi sets everything off if it detects that someone other than Ryder has removed the package from the camera’s view.

And Ryder had an interesting technique to train the machine learning model to recognise him:

If you want to make your own anti-porch pirate device, Ryder has shared everything you need on GitHub.

Wanna see some cool dogs?

We can always rely on Ryder Calm Down’s YouTube channel for unique and quasi-bonkers builds.

If you’re not familiar with Ryder’s dog-detecting (and happiness-boosting) build, check it out below. We also blogged about this project when we needed a good dopamine boost during lockdown.

The post Deter package thieves from your porch with Raspberry Pi appeared first on Raspberry Pi.

Educating young people in AI, machine learning, and data science: new seminar series

Post Syndicated from Sue Sentance original https://www.raspberrypi.org/blog/ai-machine-learning-data-science-education-seminars/

A recent Forbes article reported that over the last four years, the use of artificial intelligence (AI) tools in many business sectors has grown by 270%. AI has a history dating back to Alan Turing’s work in the 1940s, and we can define AI as the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings.

A woman explains a graph on a computer screen to two men.
Recent advances in computing technology have accelerated the rate at which AI and data science tools are coming to be used.

Four key areas of AI are machine learning, robotics, computer vision, and natural language processing. Other advances in computing technology mean we can now store and efficiently analyse colossal amounts of data (big data); consequently, data science was formed as an interdisciplinary field combining mathematics, statistics, and computer science. Data science is often presented as intertwined with machine learning, as data scientists commonly use machine learning techniques in their analysis.

Venn diagram showing the overlaps between computer science, AI, machine learning, statistics, and data science.
Computer science, AI, statistics, machine learning, and data science are overlapping fields. (Diagram from our forthcoming free online course about machine learning for educators)

AI impacts everyone, so we need to teach young people about it

AI and data science have recently received huge amounts of attention in the media, as machine learning systems are now used to make decisions in areas such as healthcare, finance, and employment. These AI technologies cause many ethical issues, for example as explored in the film Coded Bias. This film describes the fallout of researcher Joy Buolamwini’s discovery that facial recognition systems do not identify dark-skinned faces accurately, and her journey to push for the first-ever piece of legislation in the USA to govern against bias in the algorithms that impact our lives. Many other ethical issues concerning AI exist and, as highlighted by UNESCO’s examples of AI’s ethical dilemmas, they impact each and every one of us.

Three female teenagers and a teacher use a computer together.
We need to make sure that young people understand AI technologies and how they impact society and individuals.

So how do such advances in technology impact the education of young people? In the UK, a recent Royal Society report on machine learning recommended that schools should “ensure that key concepts in machine learning are taught to those who will be users, developers, and citizens” — in other words, every child. The AI Roadmap published by the UK AI Council in 2020 declared that “a comprehensive programme aimed at all teachers and with a clear deadline for completion would enable every teacher confidently to get to grips with AI concepts in ways that are relevant to their own teaching.” As of yet, very few countries have incorporated any study of AI and data science in their school curricula or computing programmes of study.

A teacher and a student work on a coding task at a laptop.
Our seminar speakers will share findings on how teachers can help their learners get to grips with AI concepts.

Partnering with The Alan Turing Institute for a new seminar series

Here at the Raspberry Pi Foundation, AI, machine learning, and data science are important topics both in our learning resources for young people and educators, and in our programme of research. So we are delighted to announce that starting this autumn we are hosting six free, online seminars on the topic of AI, machine learning, and data science education, in partnership with The Alan Turing Institute.

A woman teacher presents to an audience in a classroom.
Everyone with an interest in computing education research is welcome at our seminars, from researchers to educators and students!

The Alan Turing Institute is the UK’s national institute for data science and artificial intelligence and does pioneering work in data science research and education. The Institute conducts many different strands of research in this area and has a special interest group focused on data science education. As such, our partnership around the seminar series enables us to explore our mutual interest in the needs of young people relating to these technologies.

This promises to be an outstanding series drawing from international experts who will share examples of pedagogic best practice […].

Dr Matt Forshaw, The Alan Turing Institute

Dr Matt Forshaw, National Skills Lead at The Alan Turing Institute and Senior Lecturer in Data Science at Newcastle University, says: “We are delighted to partner with the Raspberry Pi Foundation to bring you this seminar series on AI, machine learning, and data science. This promises to be an outstanding series drawing from international experts who will share examples of pedagogic best practice and cover critical topics in education, highlighting ethical, fair, and safe use of these emerging technologies.”

Our free seminar series about AI, machine learning, and data science

At our computing education research seminars, we hear from a range of experts in the field and build an international community of researchers, practitioners, and educators interested in this important area. Our new free series of seminars runs from September 2021 to February 2022, with some excellent and inspirational speakers:

  • Tues 7 September: Dr Mhairi Aitken from The Alan Turing Institute will share a talk about AI ethics, setting out key ethical principles and how they apply to AI before discussing the ways in which these relate to children and young people.
  • Tues 5 October: Professor Carsten Schulte, Yannik Fleischer, and Lukas Höper from Paderborn University in Germany will use a series of examples from their ProDaBi programme to explore whether and how AI and machine learning should be taught differently from other topics in the computer science curriculum at school. The speakers will suggest that these topics require a paradigm shift for some teachers, and that this shift has to do with the changed role of algorithms and data, and of the societal context.
  • Tues 3 November: Professor Matti Tedre and Dr Henriikka Vartiainen from the University of Eastern Finland will focus on machine learning in the school curriculum. Their talk will map the emerging trajectories in educational practice, theory, and technology related to teaching machine learning in K-12 education.
  • Tues 7 December: Professor Rose Luckin from University College London will be looking at the breadth of issues impacting the teaching and learning of AI.
  • Tues 11 January: We’re delighted that Dr Dave Touretzky and Dr Fred Martin (Carnegie Mellon University and University of Massachusetts Lowell, respectively) from the AI4K12 Initiative in the USA will present some of the key insights into AI that the researchers hope children will acquire, and how they see K-12 AI education evolving over the next few years.
  • Tues 1 February: Speaker to be confirmed

How you can join our online seminars

All seminars start at 17:00 UK time (18:00 Central European Time, 12 noon Eastern Time, 9:00 Pacific Time) and take place in an online format, with a presentation, breakout discussion groups, and a whole-group Q&A.

Sign up now and we’ll send you the link to join on the day of each seminar — don’t forget to put the dates in your diary!

In the meantime, you can explore some of our educational resources related to machine learning and data science:

The post Educating young people in AI, machine learning, and data science: new seminar series appeared first on Raspberry Pi.

Hiding Malware in ML Models

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2021/07/hiding-malware-in-ml-models.html

Interesting research: “EvilModel: Hiding Malware Inside of Neural Network Models”.

Abstract: Delivering malware covertly and detection-evadingly is critical to advanced malware campaigns. In this paper, we present a method that delivers malware covertly and detection-evadingly through neural network models. Neural network models are poorly explainable and have a good generalization ability. By embedding malware into the neurons, malware can be delivered covertly with minor or even no impact on the performance of neural networks. Meanwhile, since the structure of the neural network models remains unchanged, they can pass the security scan of antivirus engines. Experiments show that 36.9MB of malware can be embedded into a 178MB-AlexNet model within 1% accuracy loss, and no suspicious are raised by antivirus engines in VirusTotal, which verifies the feasibility of this method. With the widespread application of artificial intelligence, utilizing neural networks becomes a forwarding trend of malware. We hope this work could provide a referenceable scenario for the defense on neural network-assisted attacks.

News article.

Protecting Personal Data in Grab’s Imagery

Post Syndicated from Grab Tech original https://engineering.grab.com/protecting-personal-data-in-grabs-imagery

Image Collection Using KartaView

Starting a few years ago, we realised the strong demand to better understand the streets where our drivers and clients go, with the purpose to better fulfil their needs and also to be able to quickly adapt ourselves to the rapidly changing environment in the Southeast Asia cities.

One way to fulfil that demand was to create an image collection platform named KartaView which is Grab Geo’s platform for geotagged imagery. It empowers collection, indexing, storage, retrieval of imagery, and map data extraction.

KartaView is a public, partially open-sourced product, used both internally and externally by the OpenStreetMap community and other users. As of 2021, KartaView has public imagery in over 100 countries with various coverage degrees, and 60+ cities of Southeast Asia. Check it out at www.kartaview.com.

Figure 1 - KartaView platform
Figure 1 – KartaView platform

Why Image Blurring is Important

Many incidental people and licence plates are in the collected images, whose privacy is a serious concern. We deeply respect all of them and consequently, we are using image obfuscation as the most effective anonymisation method for ensuring privacy protection.

Because manually annotating the regions in the picture where faces and licence plates are located is impractical, this problem should be solved using machine learning and engineering techniques. Hence we detect and blur all faces and licence plates which could be considered as personal data.

Figure 2 - Sample blurred picture
Figure 2 – Sample blurred picture

In our case, we have a wide range of picture types: regular planar, very wide and 360 pictures in equirectangular format collected with 360 cameras. Also, because we are collecting imagery globally, the vehicle types, licence plates, and human environments are quite diverse in appearance, and are not handled well by off-the-shelf blurring software. So we built our own custom blurring solution which yielded higher accuracy and better cost-efficiency overall with respect to blurring of personal data.

Figure 3 - Example of equirectangular image where personal data has to be blurred
Figure 3 – Example of equirectangular image where personal data has to be blurred

Behind the scenes, in KartaView, there are a set of cool services which are able to derive useful information from the pictures like image quality, traffic signs, roads, etc. A big part of them are using deep learning algorithms which potentially can be negatively affected by running them over blurred pictures. In fact, based on the assessment we have done so far, the impact is extremely low, similar to the one reported in a well known study of face obfuscation in ImageNet [9].

Outline of Grab’s Blurring Process

Roughly, the processing steps are the following:

  1. Transform each picture into a set of planar images. In this way, we further process all pictures, whatever the format they had, in the same way.
  2. Use an object detector able to detect all faces and licence plates in a planar image having a standard field of view.
  3. Transform the coordinates of the detected regions into original coordinates and blur those regions.
Figure 4 - Picture’s processing steps [8]
Figure 4 – Picture’s processing steps [8]

In the following section, we are going to describe in detail the interesting aspects of the second step, sharing the challenges and how we were solving them. Let’s start with the first and most important part, the dataset.


Our current dataset consists of images from a wide range of cameras, including normal perspective cameras from mobile phones, wide field of view cameras and also 360 degree cameras.

It is the result of a series of data collections contributed by Grab’s data tagging teams, which may contain 2 classes of dataset that are of interest for us: FACE and LICENSE_PLATE.

The data was collected using Grab internal tools, stored in queryable databases, making it a system that gives the possibility to revisit the data and correct it if necessary, but also making it possible for data engineers to select and filter the data of interest.

Dataset Evolution

Each iteration of the dataset was made to address certain issues discovered while having models used in a production environment and observing situations where the model lacked in performance.

Dataset v1 Dataset v2 Dataset v3
Nr. images 15226 17636 30538
Nr. of labels 64119 86676 242534

If the first version was basic, containing a rough tagging strategy we quickly noticed that it was not detecting some special situations that appeared due to the pandemic situation: people wearing masks.

This led to another round of data annotation to include those scenarios.
The third iteration addressed a broader range of issues:

  • Small regions of interest (objects far away from the camera)
  • Objects in very dark backgrounds
  • Rotated objects or even upside down
  • Variation of the licence plate design due to images from different countries and regions
  • People wearing masks
  • Faces in the mirror – see below the mirror of the motorcycle
  • But the main reason was because of a scenario where the recording, at the start or end (but not only), had close-ups of the operator who was checking the camera. This led to images with large regions of interest containing the camera operator’s face – too large to be detected by the model.

An investigation in the dataset structure, by splitting the data into bins based on the bbox sizes (in pixels), made something clear: the dataset was unbalanced.

We made bins for tag sizes with a stride of 100 pixels and went up to the max present in the dataset which accounted for 1 sample of size 2000 pixels. The majority of the labels were small in size and the higher we would go with the size, the less tags we would have. This made it clear that we would need more targeted annotations for our dataset to try to balance it.

All these scenarios required the tagging team to revisit the data multiple times and also change the tagging strategy by including more tags that were considered at a certain limit. It also required them to pay more attention to small details that may have been missed in a previous iteration.

Data Splitting

To better understand the strategy chosen for splitting the data we need to also understand the source of the data. The images come from different devices that are used in different geo locations (different countries) and are from a continuous trip recording. The annotation team used an internal tool to visualise the trips image by image and mark the faces and licence plates present in them. We would then have access to all those images and their respective metadata.

The chosen ratios for splitting are:

  • Train 70%
  • Validation 10%
  • Test 20%
Number of train images 12733
Number of validation images 1682
Number of test images 3221
Number of labeled classes in train set 60630
Number of labeled classes in validation set 7658
Number of of labeled classes in test set 18388

The split is not so trivial as we have some requirements and need to complete some conditions:

  • An image can have multiple tags from one or both classes but must belong to just one subset.
  • The tags should be split as close as possible to the desired ratios.
  • As different images can belong to the same trip in a close geographical relation we need to force them in the same subset, thus avoiding similar tags in train and test subsets, resulting in incorrect evaluations.

Data Augmentation

The application of data augmentation plays a crucial role while training the machine learning model. There are mainly three ways in which data augmentation techniques can be applied. They are:

  1. Offline data augmentation – enriching a dataset by physically multiplying some of its images and applying modifications to them.
  2. Online data augmentation – on the fly modifications of the image during train time with configurable probability for each modification.
  3. Combination of both offline and online data augmentation.

In our case, we are using the third option which is the combination of both.

The first method that contributes to offline augmentation is a method called image view splitting. This is necessary for us due to different image types: perspective camera images, wide field of view images, 360 degree images in equirectangular format. All these formats and field of views with their respective distortions would complicate the data and make it hard for the model to generalise it and also handle different image types that could be added in the future.

For this we defined the concept of image views which are an extracted portion (view) of an image with some predefined properties. For example, the perspective projection of 75 by 75 degrees field of view patches from the original image.

Here we can see a perspective camera image and the image views generated from it:

Figure 5 - Original image
Figure 5 – Original image
Figure 6 - Two image views generated
Figure 6 – Two image views generated

The important thing here is that each generated view is an image on its own with the associated tags. They also have an overlapping area so we have a possibility to contain the same tag in two views but from different perspectives. This brings us to an indirect outcome of the first offline augmentation.

The second method for offline augmentation is the oversampling of some of the images (views). As mentioned above, we faced the problem of an unbalanced dataset, specifically we were missing tags that occupied high regions of the image, and even though our tagging teams tried to annotate as many as they could find, these were still scarce.

As our object detection model is an anchor-based detector, we did not even have enough of them to generate the anchor boxes correctly. This could be clearly seen in the accuracy of the previous trained models, as they were performing poorly on bins of big sizes.

By randomly oversampling images that contained big tags, up to a minimum required number, we managed to have better anchors and increase the recall for those scenarios. As described below, the chosen object detector for blurring was YOLOv4 which offers a large variety of online augmentations. The online augmentations used are saturation, exposure, hue, flip and mosaic.


As of summer of 2021, the “to go” solution for object detection in images are convolutional neural networks (CNN), being a mature solution able to fulfil the needs efficiently.


Most CNN based object detectors have three main parts: Backbone, Neck and (Dense or Sparse Prediction) Heads. From the input image, the backbone extracts features which can be combined in the neck part to be used by the prediction heads to predict object bounding-boxes and their labels.

Figure 7 - Anatomy of one and two-stage object detectors [1]
Figure 7 – Anatomy of one and two-stage object detectors [1]

The backbone is usually a CNN classification network pretrained on some dataset, like ImageNet-1K. The neck combines features from different layers in order to produce rich representations for both large and small objects. Since the objects to be detected have varying sizes, the topmost features are too coarse to represent smaller objects, so the first CNN based object detectors were fairly weak in detecting small sized objects. The multi-scale, pyramid hierarchy is inherent to CNNs so [2] introduced the Feature Pyramid Network which at marginal costs combines features from multiple scales and makes predictions on them. This or improved variants of this technique is used by most detectors nowadays. The head part does the predictions for bounding boxes and their labels.

YOLO is part of the anchor-based one-stage object detectors family being developed originally in Darknet, an open source neural network framework written in C and CUDA. Back in 2015 it was the first end-to-end differentiable network of this kind that offered a joint learning of object bounding boxes and their labels.

One reason for the big success of newer YOLO versions is that the authors carefully merged new ideas into one architecture, the overall speed of the model being always the north star.

YOLOv4 introduces several changes to its v3 predecessor:

  • Backbone – CSPDarknet53: YOLOv3 Darknet53 backbone was modified to use Cross Stage Partial Network (CSPNet [5]) strategy, which aims to achieve richer gradient combinations by letting the gradient flow propagate through different network paths.
  • Multiple configurable augmentation and loss function types, so called “Bag of freebies”, which by changing the training strategy can yield higher accuracy without impacting the inference time.
  • Configurable necks and different activation functions, they call “Bag of specials”.


For this task, we found that YOLOv4 gave a good compromise between speed and accuracy as it has doubled the speed of a more accurate two-stage detector while maintaining a very good overall precision/recall. For blurring, the main metric for model selection was the overall recall, while precision and intersection over union (IoU) of the predicted box comes second as we want to catch all personal data even if some are wrong. Having a multitude of possibilities to configure the detector architecture and train it on our own dataset we conducted several experiments with different configurations for backbones, necks, augmentations and loss functions to come up with our current solution.

We faced challenges in training a good model as the dataset posed a large object/box-level scale imbalance, small objects being over-represented in the dataset. As described in [3] and [4], this affects the scales of the estimated regions and the overall detection performance. In [3] several solutions are proposed for this out of which the SPP [6] blocks and PANet [7] neck used in YOLOv4 together with heavy offline data augmentation increased the performance of the actual model in comparison to the former ones.

As we have evaluated the model; it still has some issues:

  • Occlusion of the object, either by the camera view, head accessories or other elements:

These cases would need extra annotation in the dataset, just like the faces or licence plates that are really close to the camera and occupy a large region of interest in the image.

  • As we have a limited number of annotations of close objects to the camera view, the model has incorrectly learnt this, sometimes producing false positives in these situations:

Again, one solution for this would be to include more of these scenarios in the dataset.

What’s Next?

Grab spends a lot of effort ensuring privacy protection for its users so we are always looking for ways to further improve our related models and processes.

As far as efficiency is concerned, there are multiple directions to consider for both the dataset and the model. There are two main factors that drive the costs and the quality: further development of the dataset for additional edge cases (e.g. more training data of people wearing masks) and the operational costs of the model.

As the vast majority of current models require a fully labelled dataset, this puts a large work effort on the Data Entry team before creating a new model. Our dataset increased 4x for it’s third version, still there is room for improvement as described in the Dataset section.

As Grab extends its operation in more cities, new data is collected that has to be processed, this puts an increased focus on running detection models more efficiently.

Directions to pursue to increase our efficiency could be the following:

  • As plenty of unlabelled data is available from imagery collection, a natural direction to explore is self-supervised visual representation learning techniques to derive a general vision backbone with superior transferring performance for our subsequent tasks as detection, classification.
  • Experiment with optimisation techniques like pruning and quantisation to get a faster model without sacrificing too much on accuracy.
  • Explore new architectures: YOLOv5, EfficientDet or Swin-Transformer for Object Detection.
  • Introduce semi-supervised learning techniques to improve our model performance on the long tail of the data.


  1. Alexey Bochkovskiy et al.. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv:2004.10934v1
  2. Tsung-Yi Lin et al. Feature Pyramid Networks for Object Detection. arXiv:1612.03144v2
  3. Kemal Oksuz et al.. Imbalance Problems in Object Detection: A Review. arXiv:1909.00169v3
  4. Bharat Singh, Larry S. Davis. An Analysis of Scale Invariance in Object Detection – SNIP. arXiv:1711.08189v2
  5. Chien-Yao Wang et al. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. arXiv:1911.11929v1
  6. Kaiming He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. arXiv:1406.4729v4
  7. Shu Liu et al. Path Aggregation Network for Instance Segmentation. arXiv:1803.01534v4
  8. http://blog.nitishmutha.com/equirectangular/360degree/2017/06/12/How-to-project-Equirectangular-image-to-rectilinear-view.html
  9. Kaiyu Yang et al. Study of Face Obfuscation in ImageNet: arxiv.org/abs/2103.06191
  10. Zhenda Xie et al. Self-Supervised Learning with Swin Transformers. arXiv:2105.04553v2

Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Hosting Hugging Face models on AWS Lambda for serverless inference

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/hosting-hugging-face-models-on-aws-lambda/

This post written by Eddie Pick, AWS Senior Solutions Architect – Startups and Scott Perry, AWS Senior Specialist Solutions Architect – AI/ML

Hugging Face Transformers is a popular open-source project that provides pre-trained, natural language processing (NLP) models for a wide variety of use cases. Customers with minimal machine learning experience can use pre-trained models to enhance their applications quickly using NLP. This includes tasks such as text classification, language translation, summarization, and question answering – to name a few.

First introduced in 2017, the Transformer is a modern neural network architecture that has quickly become the most popular type of machine learning model applied to NLP tasks. It outperforms previous techniques based on convolutional neural networks (CNNs) or recurrent neural networks (RNNs). The Transformer also offers significant improvements in computational efficiency. Notably, Transformers are more conducive to parallel computation. This means that Transformer-based models can be trained more quickly, and on larger datasets than their predecessors.

The computational efficiency of Transformers provides the opportunity to experiment and improve on the original architecture. Over the past few years, the industry has seen the introduction of larger and more powerful Transformer models. For example, BERT was first published in 2018 and was able to get better benchmark scores on 11 natural language processing tasks using between 110M-340M neural network parameters. In 2019, the T5 model using 11B parameters achieved better results on benchmarks such as summarization, question answering, and text classification. More recently, the GPT-3 model was introduced in 2020 with 175B parameters and in 2021 the Switch Transformers are scaling to over 1T parameters.

One consequence of this trend toward larger and more powerful models is an increased barrier to entry. As the number of model parameters increases, as does the computational infrastructure that is necessary to train such a model. This is where the open-source Hugging Face Transformers project helps.

Hugging Face Transformers provides over 30 pretrained Transformer-based models available via a straightforward Python package. Additionally, there are over 10,000 community-developed models available for download from Hugging Face. This allows users to use modern Transformer models within their applications without requiring model training from scratch.

The Hugging Face Transformers project directly addresses challenges associated with training modern Transformer-based models. Many customers want a zero administration ML inference solution that allows Hugging Face Transformers models to be hosted in AWS easily. This post introduces a low touch, cost effective, and scalable mechanism for hosting Hugging Face models for real-time inference using AWS Lambda.


Our solution consists of an AWS Cloud Development Kit (AWS CDK) script that automatically provisions container image-based Lambda functions that perform ML inference using pre-trained Hugging Face models. This solution also includes Amazon Elastic File System (EFS) storage that is attached to the Lambda functions to cache the pre-trained models and reduce inference latency.Solution architecture

In this architectural diagram:

  1. Serverless inference is achieved by using Lambda functions that are based on container image
  2. The container image is stored in an Amazon Elastic Container Registry (ECR) repository within your account
  3. Pre-trained models are automatically downloaded from Hugging Face the first time the function is invoked
  4. Pre-trained models are cached within Amazon Elastic File System storage in order to improve inference latency

The solution includes Python scripts for two common NLP use cases:

  • Sentiment analysis: Identifying if a sentence indicates positive or negative sentiment. It uses a fine-tuned model on sst2, which is a GLUE task.
  • Summarization: Summarizing a body of text into a shorter, representative text. It uses a Bart model that was fine-tuned on the CNN / Daily Mail dataset.

For simplicity, both of these use cases are implemented using Hugging Face pipelines.


The following is required to run this example:

Deploying the example application

  1. Clone the project to your development environment:
    git clone https://github.com/aws-samples/zero-administration-inference-with-aws-lambda-for-hugging-face.git
  2. Install the required dependencies:
    pip install -r requirements.txt
  3. Bootstrap the CDK. This command provisions the initial resources needed by the CDK to perform deployments:
    cdk bootstrap
  4. This command deploys the CDK application to its environment. During the deployment, the toolkit outputs progress indications:
    $ cdk deploy

Testing the application

After deployment, navigate to the AWS Management Console to find and test the Lambda functions. There is one for sentiment analysis and one for summarization.

To test:

  1. Enter “Lambda” in the search bar of the AWS Management Console:Console Search
  2. Filter the functions by entering “ServerlessHuggingFace”:Filtering functions
  3. Select the ServerlessHuggingFaceStack-sentimentXXXXX function:Select function
  4. In the Test event, enter the following snippet and then choose Test:Test function
   "text": "I'm so happy I could cry!"

The first invocation takes approximately one minute to complete. The initial Lambda function environment must be allocated and the pre-trained model must be downloaded from Hugging Face. Subsequent invocations are faster, as the Lambda function is already prepared and the pre-trained model is cached in EFS.Function test results

The JSON response shows the result of the sentiment analysis:

  "statusCode": 200,
  "body": {
    "label": "POSITIVE",
    "score": 0.9997532367706299

Understanding the code structure

The code is organized using the following structure:

├── inference
│ ├── Dockerfile
│ ├── sentiment.py
│ └── summarization.py
├── app.py
└── ...

The inference directory contains:

  • The Dockerfile used to build a custom image to be able to run PyTorch Hugging Face inference using Lambda functions
  • The Python scripts that perform the actual ML inference

The sentiment.py script shows how to use a Hugging Face Transformers model:

import json
from transformers import pipeline

nlp = pipeline("sentiment-analysis")

def handler(event, context):
    response = {
        "statusCode": 200,
        "body": nlp(event['text'])[0]
    return response

For each Python script in the inference directory, the CDK generates a Lambda function backed by a container image and a Python inference script.

CDK script

The CDK script is named app.py in the solution’s repository. The beginning of the script creates a virtual private cloud (VPC).

vpc = ec2.Vpc(self, 'Vpc', max_azs=2)

Next, it creates the EFS file system and an access point in EFS for the cached models:

        fs = efs.FileSystem(self, 'FileSystem',
        access_point = fs.add_access_point('MLAccessPoint',
                                               owner_gid='1001', owner_uid='1001', permissions='750'),
                                           posix_user=efs.PosixUser(gid="1001", uid="1001"))>

It iterates through the Python files in the inference directory:

docker_folder = os.path.dirname(os.path.realpath(__file__)) + "/inference"
pathlist = Path(docker_folder).rglob('*.py')
for path in pathlist:

And then creates the Lambda function that serves the inference requests:

            base = os.path.basename(path)
            filename = os.path.splitext(base)[0]
            # Lambda Function from docker image
            function = lambda_.DockerImageFunction(
                self, filename,
                    access_point, '/mnt/hf_models_cache'),
                    "TRANSFORMERS_CACHE": "/mnt/hf_models_cache"},

Adding a translator

Optionally, you can add more models by adding Python scripts in the inference directory. For example, add the following code in a file called translate-en2fr.py:

import json
from transformers 
import pipeline

en_fr_translator = pipeline('translation_en_to_fr')

def handler(event, context):
    response = {
        "statusCode": 200,
        "body": en_fr_translator(event['text'])[0]
    return response

Then run:

$ cdk synth
$ cdk deploy

This creates a new endpoint to perform English to French translation.

Cleaning up

After you are finished experimenting with this project, run “cdk destroy” to remove all of the associated infrastructure.


This post shows how to perform ML inference for pre-trained Hugging Face models by using Lambda functions. To avoid repeatedly downloading the pre-trained models, this solution uses an EFS-based approach to model caching. This helps to achieve low-latency, near real-time inference. The solution is provided as infrastructure as code using Python and the AWS CDK.

We hope this blog post allows you to prototype quickly and include modern NLP techniques in your own products.

The Future of Machine Learning and Cybersecurity

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2021/06/the-future-of-machine-learning-and-cybersecurity.html

The Center for Security and Emerging Technology has a new report: “Machine Learning and Cybersecurity: Hype and Reality.” Here’s the bottom line:

The report offers four conclusions:

  • Machine learning can help defenders more accurately detect and triage potential attacks. However, in many cases these technologies are elaborations on long-standing methods — not fundamentally new approaches — that bring new attack surfaces of their own.
  • A wide range of specific tasks could be fully or partially automated with the use of machine learning, including some forms of vulnerability discovery, deception, and attack disruption. But many of the most transformative of these possibilities still require significant machine learning breakthroughs.
  • Overall, we anticipate that machine learning will provide incremental advances to cyber defenders, but it is unlikely to fundamentally transform the industry barring additional breakthroughs. Some of the most transformative impacts may come from making previously un- or under-utilized defensive strategies available to more organizations.
  • Although machine learning will be neither predominantly offense-biased nor defense-biased, it may subtly alter the threat landscape by making certain types of strategies more appealing to attackers or defenders.

Building a Hyper Self-Service, Distributed Tracing and Feedback System for Rule & Machine Learning (ML) Predictions

Post Syndicated from Grab Tech original https://engineering.grab.com/building-hyper-self-service-distributed-tracing-feedback-system


In Grab, the Trust, Identity, Safety, and Security (TISS) is a team of software engineers and AI developers working on fraud detection, login identity check, safety issues, etc. There are many TISS services, like grab-fraud, grab-safety, and grab-id. They make billions of business decisions daily using the Griffin rule engine, which determines if a passenger can book a trip, get a food promotion, or if a driver gets a delivery booking.

There is a natural demand to log down all these important business decisions, store them and query them interactively or in batches. Data analysts and scientists need to use the data to train their machine learning models. RiskOps and customer service teams can query the historical data and help consumers.

That’s where Archivist comes in; it is a new tracing, statistics and feedback system for rule and machine learning-based predictions. It is reliable and performant. Its innovative data schema is flexible for storing events from different business scenarios. Finally, it provides a user-friendly UI, which has access control for classified data.

Here are the impacts Archivist has already made:

  • Currently, there are 2 teams with a total of 5 services and about 50 business scenarios using Archivist. The scenarios include fraud prevention (e.g. DriverBan, PassengerBan), payment checks (e.g. PayoutBlockCheck, PromoCheck), and identity check events like PinTrigger.
  • It takes only a few minutes to onboard a new business scenario (event type), by using the configuration page on the user portal. Previously, it took at least 1 to 2 days.
  • Each day, Archivist logs down 80 million logs to the ElasticSearch cluster, which is about 200GB of data.
  • Each week, Customer Experience (CE)/Risk Ops goes to the user portal and checks Archivist logs for about 2,000 distinct customers. They can search based on numerous dimensions such as the Passenger/DriverID, phone number, request ID, booking code and payment fingerprint.


Each day, TISS services make billions of business decisions (predictions), based on the Griffin rule engine and ML models.

After the predictions are made, there are still some tough questions for these services to answer.

  • If Risk Ops believes a prediction is false-positive, a consumer could be banned. If this happens, how can consumers or Risk Ops report or feedback this information to the new rule and ML model training quickly?
  • As CustomService/Data Scientists investigating any tickets opened due to TISS predictions/decisions, how do you know which rules and data were used? E.g. why the passenger triggered a selfie, or why a booking was blocked.
  • After Data Analysts/Data Scientists (DA/DS) launch a new rule/model, how can they track the performance in fine-granularity and in real-time? E.g. week-over-week rule performance in a country or city.
  • How can DA/DS access all prediction data for data analysis or model training?
  • How can the system keep up with Grab’s business launch speed, with maximum self-service?


To answer the questions above, TISS services previously used company-wide Kibana to log predictions.  For example, a log looks like: PassengerID:123,Scenario:PinTrigger,Decision:Trigger,.... This logging method had some obvious issues:

  • Logs in plain text don’t have any structure and are not friendly to ML model training as most ML models need processed data to make accurate predictions.
  • Furthermore, there is no fine-granularity access control for developers in Kibana.
  • Developers, DA and DS have no access control while CEs have no access at all. So CE cannot easily see the data and DA/DS cannot easily process the data.

To address all the Kibana log issues, we developed ActionTrace, a code library with a well-structured data schema. The logs, also called documents, are stored in a dedicated ElasticSearch cluster with access control implemented. However, after using it for a while, we found that it still needed some improvements.

  1. Each business scenario involves different types of entities and ActionTrace is not fully self-service. This means that a lot of development work was needed to support fast-launching business scenarios. Here are some examples:
    • The main entities in the taxi business are Driver and Passenger,

    • The main entities in the food business can be Merchant, Driver and Consumer.

    All these entities will need to be manually added into the ActionTrace data schema.

  2.  Each business scenario may have their own custom information logged. Because there is no overlap, each of them will correspond to a new field in the data schema. For example:
    • For any scenario involving payment, a valid payment method and expiration date is logged.
    • For the taxi business, the geohash is logged.
  3.   To store the log data from ActionTrace, different teams need to set up and manage their own ElasticSearch clusters. This increases hardware and maintenance costs.

  4. There was a simple Web UI created for viewing logs from ActionTrace, but there was still no access control in fine granularity.


We developed Archivist, a new tracing, statistics, and feedback system for ML/rule-based prediction events. It’s centralised, performant and flexible. It answers all the issues mentioned above, and it is an improvement over all the existing solutions we have mentioned previously.

The key improvements are:

  • User-defined entities and custom fields
    • There are no predefined entity types. Users can define up to 5 entity types (E.g. PassengerId, DriverId, PhoneNumber, PaymentMethodId, etc.).
    • Similarly, there are a limited number of custom data fields to use, in addition to the common data fields shared by all business scenarios.
  • A dedicated service shared by all other services
    • Each service writes its prediction events to a Kafka stream. Archivist then reads the stream and writes to the ElasticSearch cluster.
    • The data writes are buffered, so it is easy to handle traffic surges in peak time.
    • Different services share the same Elastic Cloud Enterprise (ECE) cluster, but they create their own daily file indices so the costs can be split fairly.
  • Better support for data mining, prediction stats and feedback
    • Kafka stream data are simultaneously written to AWS S3. DA/DS can use the PrestoDB SQL query engine to mine the data.
    • There is an internal web portal for viewing Archivist logs. Customer service teams and Ops can use no-risk data to address CE tickets, while DA, DS and developers can view high-risk data for code/rule debugging.
  • A reduction of development days to support new business launches
    • Previously, it took a week to modify and deploy the ActionTrace data schema. Now, it only takes several minutes to configure event schemas in the user portal.
  • Saves time in RiskOps/CE investigations
    • With the new web UI which has access control in place, the different roles in the company, like Customer service and Data analysts, can access the Archivist events with different levels of permissions.
    • It takes only a few clicks for them to find the relevant events that impact the drivers/passengers.

Architecture Details

Archivist’s system architecture is shown in the diagram below.

Archivist system architecture
Archivist system architecture
  • Different services (like fraud-detection, safety-allocation, etc.) use a simple SDK to write data to a Kafka stream (the left side of the diagram).
  • In the centre of Archivist is an event processor. It reads data from Kafka, and writes them to ElasticSearch (ES).
  • The Kafka stream writes to the Amazon S3 data lake, so DA/DS can use the Presto SQL query engine to query them.
  • The user portal (bottom right) can be used to view the Archivist log and update configurations. It also sends all the web requests to the API Handler in the centre.

The following diagram shows how internal and external users use Archivist as well as the interaction between the Griffin rule engine and Archivist.

Archivist use cases
Archivist use cases

Flexible Event Schema

In Archivist, a prediction/decision is called an event. The event schema can be divided into 3 main parts conceptually.

  1. Data partitioning: Fields like service_name and event_type categorise data by services and business scenarios.
    Field name Type Example Notes
    service_name string GrabFraud Name of the Service
    event_type string PreRide PaxBan/SafeAllocation
  2. Business decision making: request_id, decisions, reasons, event_content are used to record the business decision, the reason and the context (E.g. The input features of machine learning algorithms).
    Field name Type Example Notes
    request_id string a16756e8-efe2-472b-b614-ec6ae08a5912 a 32-digit id for web requests
    event_content string Event context
    decisions [string] [“NotAllowBook”, “SMS”] A list
    reasons string json payload string of the response from engine.
  3. Customisation: Archivist provides user-defined entities and custom fields that we feel are sufficient and flexible for handling different business scenarios.
    Field name Type Example Notes
    entity_type_1 string Passenger
    entity_id_1 string 12151
    entity_type_2 string Driver
    entity_id_2 string 341521-rdxf36767
    entity_id_5 string
    custom_field_type_1 string “MessageToUser”
    custom_field_1 string “please contact Ops” User defined fields
    custom_field_type_2 “Prediction rule:”
    custom_field_2 string “ML rule: 123, version:2”
    custom_field_6 string

A User Portal to Support Querying, Prediction Stats and Feedback

DA, DS, Ops and CE can access the internal user portal to see the prediction events, individually and on an aggregated city level.

A snapshot of the Archivist logs showing the aggregation of the data in each city
A snapshot of the Archivist logs showing the aggregation of the data in each city

There are graphs on the portal, showing the rule/model performance on individual customers over a period of time.

Rule performance on a customer over a period of time
Rule performance on a customer over a period of time

How to Use Archivist for Your Service

If you want to get onboard Archivist, the coding effort is minimal. Here is an example of a code snippet to log an event:

Code snippet to log an event
Code snippet to log an event


During the implementation of Archivist, we learnt some things:

  • A good system needs to support multi-tenants from the beginning. Originally, we thought we could use just one Kafka stream, and put all the documents from different teams into one ElasticSearch (ES) index. But after one team insisted on keeping their data separately from others, we created more Kafka streams and ES indexes. We realised that this way, it’s easier for us to manage data and share the cost fairly.
  • Shortly after we launched Archivist, there was an incident where the ES data writes were choked. Because each document write is a goroutine, the number of goroutines increased to 400k and the memory usage reached 100% within minutes. We added a patch (2 lines of code) to limit the maximum number of goroutines in our system. Since then, we haven’t had any more severe incidents in Archivist.

Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

How We Improved Agent Chat Efficiency with Machine Learning

Post Syndicated from Grab Tech original https://engineering.grab.com/how-we-improved-agent-chat-efficiency-with-ml

In previous articles (see Grab’s in-house chat platform, workforce routing), we shared how chat has grown to become one of the primary channels for support in the last few years.

With continuous chat growth and a new in-house tool, helping our agents be more efficient and productive was key to ensure a faster support time for our users and scale chat even further.

Starting from the analysis on the usage of another third-party tool as well as some shadowing sessions, we realised that building a templated-based feature wouldn’t help. We needed to offer personalisation capabilities, as our consumer support specialists care about their writing style and tone, and using templates often feels robotic.

We decided to build a machine learning model, called SmartChat, which offers contextual suggestions by leveraging several sources of internal data, helping our chat specialists type much faster, and hence serving more consumers.

In this article, we are going to explain the process from problem discovery to design iterations, and share how the model was implemented from both a data science and software engineering perspective.

How SmartChat Works

Diving Deeper into the Problem

Agent productivity became a key part in the process of scaling chat as a channel for support.

After splitting chat time into all its components, we noted that agent typing time represented a big portion of the chat support journey, making it the perfect problem to tackle next.

After some analysis on the usage of the third-party chat tool, we found out that even with functionalities such as canned messages, 85% of the messages were still free typed.

Hours of shadowing sessions also confirmed that the consumer support specialists liked to add their own flair. They would often use the template and adjust it to their style, which took more time than just writing it on the spot. With this in mind, it was obvious that templates wouldn’t be too helpful, unless they provided some degree of personalisation.

We needed something that reduces typing time and also:

  • Allows some degree of personalisation, so that answers don’t seem robotic and repeated.
  • Works with multiple languages and nuances, considering Grab operates in 8 markets, even some of the English markets have some slight differences in commonly used words.
  • It’s contextual to the problem and takes into account the user type, issue reported, and even the time of the day.
  • Ideally doesn’t require any maintenance effort, such as having to keep templates updated whenever there’s a change in policies.

Considering the constraints, this seemed to be the perfect candidate for a machine learning-based functionality, which predicts sentence completion by considering all the context about the user, issue and even the latest messages exchanged.

Usability is Key

To fulfil the hypothesis, there are a few design considerations:

  1. Minimising the learning curve for agents.
  2. Avoiding visual clutter if recommendations are not relevant.

To increase the probability of predicting an agent’s message, one of the design explorations is to allow agents to select the top 3 predictions (Design 1). To onboard agents, we designed a quick tip to activate SmartChat using keyboard shortcuts.

By displaying the top 3 recommendations, we learnt that it slowed agents down as they started to read all options even if the recommendations were not helpful. Besides, by triggering this component upon every recommendable text, it became a distraction as they were forced to pause.

In our next design iteration, we decided to leverage and reuse the interaction of SmartChat from a familiar platform that agents are using – Gmail’s Smart Compose. As agents are familiar with Gmail, the learning curve for this feature would be less steep. For first time users, agents will see a “Press tab” tooltip, which will activate the text recommendation. The tooltip will disappear after 5 times of use.

To relearn the shortcut, agents can hover over the recommended text.

How We Track Progress

Knowing that this feature would come in multiple iterations, we had to find ways to track how well we were doing progressively, so we decided to measure the different components of chat time.

We realised that the agent typing time is affected by:

  • Percentage of characters saved. This tells us that the model predicted correctly, and also saved time. This metric should increase as the model improves.
  • Model’s effectiveness. The agent writes the least number of characters possible before getting the right suggestion, which should decrease as the model learns.
  • Acceptance rate. This tells us how many messages were written with the help of the model. It is a good proxy for feature usage and model capabilities.
  • Latency. If the suggestion is not shown in about 100-200ms, the agent would not notice the text and keep typing.


The architecture involves support specialists initiating the fetch suggestion request, which is sent for evaluation to the machine learning model through API gateway. This ensures that only authenticated requests are allowed to go through and also ensures that we have proper rate limiting applied.

We have an internal platform called Catwalk, which is a microservice that offers the capability to execute machine learning models as a HTTP service. We used the Presto query engine to calculate and analyse the results from the experiment.

Designing the Machine Learning Model

I am sure all of us can remember an experiment we did in school when we had to catch a falling ruler. For those who have not done this experiment, feel free to try it at home! The purpose of this experiment is to define a ballpark number for typical human reaction time (equations also included in the video link).

Typically, the human reaction time ranges from 100ms to 300ms, with a median of about 250ms (read more here). Hence, we decided to set the upper bound for SmartChat response time to be 200ms while deciding the approach. Otherwise, the experience would be affected as the agents would notice a delay in the suggestions. To achieve this, we had to manage the model’s complexity and ensure that it achieves the optimal time performance.

Taking into consideration network latencies, the machine learning model would need to churn out predictions in less than 100ms, in order for the entire product to achieve a maximum 200ms refresh rate.

As such, a few key components were considered:

  • Model Tokenisation
    • Model input/output tokenisation needs to be implemented along with the model’s core logic so that it is done in one network request.
    • Model tokenisation needs to be lightweight and cheap to compute.
  • Model Architecture
    • This is a typical sequence-to-sequence (seq2seq) task so the model needs to be complex enough to account for the auto-regressive nature of seq2seq tasks.
    • We could not use pure attention-based models, which are usually state of the art for seq2seq tasks, as they are bulky and computationally expensive.
  • Model Service
    • The model serving platform should be executed on a low-level, highly performant framework.

Our proposed solution considers the points listed above. We have chosen to develop in Tensorflow (TF), which is a well-supported framework for machine learning models and application building.

For Latin-based languages, we used a simple whitespace tokenizer, which is serialisable in the TF graph using the tensorflow-text package.

import tensorflow_text as text

tokenizer = text.WhitespaceTokenizer()

For the model architecture, we considered a few options but eventually settled for a simple recurrent neural network architecture (RNN), in an Encoder-Decoder structure:

  • Encoder
    • Whitespace tokenisation
    • Single layered Bi-Directional RNN
    • Gated-Recurrent Unit (GRU) Cell
  • Decoder

    • Single layered Uni-Directional RNN
    • Gated-Recurrent Unit (GRU) Cell
  • Optimisation
    • Teacher-forcing in training, Greedy decoding in production
    • Trained with a cross-entropy loss function
    • Using ADAM (Kingma and Ba) optimiser


To provide context for the sentence completion tasks, we provided the following features as model inputs:

  • Past conversations between the chat agent and the user
  • Time of the day
  • User type (Driver-partners, Consumers, etc.)
  • Entrypoint into the chat (e.g. an article on cancelling a food order)

These features give the model the ability to generalise beyond a simple language model, with additional context on the nature of contact for support. Such experiences also provide a better user experience and a more customised user experience.

For example, the model is better aware of the nature of time in addressing “Good {Morning/Afternoon/Evening}” given the time of the day input, as well as being able to interpret meal times in the case of food orders. E.g. “We have contacted the driver, your {breakfast/lunch/dinner} will be arriving shortly”.

Typeahead Solution for the User Interface

With our goal to provide a seamless experience in showing suggestions to accepting them, we decided to implement a typeahead solution in the chat input area. This solution had to be implemented with the ReactJS library, as the internal web-app used by our support specialist for handling chats is built in React.

There were a few ways to achieve this:

  1. Modify the Document Object Model (DOM) using Javascript to show suggestions by positioning them over the input HTML tag based on the cursor position.
  2. Use a content editable div and have the suggestion span render conditionally.

After evaluating the complexity in both approaches, the second solution seemed to be the better choice, as it is more aligned with the React way of doing things: avoid DOM manipulations as much as possible.

However, when a suggestion is accepted we would still need to update the content editable div through DOM manipulation. It cannot be added to React’s state as it creates a laggy experience for the user to visualise what they type.

Here is a code snippet for the implementation:

import React, { Component } from 'react';
import liveChatInstance from './live-chat';

export class ChatInput extends Component {
 constructor(props) {
   this.state = {
     suggestion: '',

 getCurrentInput = () => {
   const { roomID } = this.props;
   const inputDiv = document.getElementById(`input_content_${roomID}`);
   const suggestionSpan = document.getElementById(

   // put the check for extra safety in case suggestion span is accidentally cleared
   if (suggestionSpan) {
     const range = document.createRange();
     range.setStart(inputDiv, 0);
     return range.toString(); // content before suggestion span in input div
   return inputDiv.textContent;

 handleKeyDown = async e => {
   const { roomID } = this.props;
   // tab or right arrow for accepting suggestion
   if (this.state.suggestion && (e.keyCode === 9 || e.keyCode === 39)) {
     this.setState({ suggestion: '' });
   const parsedValue = this.getCurrentInput();
   // space
   if (e.keyCode === 32 && !this.state.suggestion && parsedValue) {
     // fetch suggestion
     const prediction = await liveChatInstance.getSmartComposePrediction(
       parsedValue.trim(), roomID);
     this.setState({ suggestion: prediction })

 insertContent = content => {
   // insert content behind cursor
   const { roomID } = this.props;
   const inputDiv = document.getElementById(`input_content_${roomID}`);
   if (inputDiv) {
     const sel = window.getSelection();
     const range = sel.getRangeAt(0);
     if (sel.getRangeAt && sel.rangeCount) {

 render() {
   const { roomID } = this.props;
   return (
     <div className="message_wrapper">
         {!!this.state.suggestion.length && (

The solution uses the spacebar as the trigger for fetching the suggestion from the ML model and stores them in a React state. The ML model prediction is then rendered in a dynamically rendered span.

We used the window.getSelection() and range APIs to:

  • Find the current input value
  • Insert the suggestion
  • Clear the input to type a new message

The implementation has also considered the following:

  • Caching. API calls are made on every space character to fetch the prediction. To reduce the number of API calls, we also cached the prediction until it differs from the user input.
  • Recover placeholder. There are data fields that are specific to the agent and consumer, such as agent name and user phone number, and these data fields are replaced by placeholders for model training. The implementation recovers the placeholders in the prediction before showing it on the UI.
  • Control rollout. Since rollout is by percentage per country, the implementation has to ensure that only certain users can access predictions from their country chat model.
  • Aggregate and send metrics. Metrics are gathered and sent for each chat message.


The initial experiment results suggested that we managed to save 20% of characters, which improved the efficiency of our agents by 12% as they were able to resolve the queries faster. These numbers exceeded our expectations and as a result, we decided to move forward by rolling SmartChat out regionally.

What’s Next?

In the upcoming iteration, we are going to focus on non-Latin language support, caching, and continuous training.

Non-Latin Language Support and Caching

The current model only works with Latin languages, where sentences consist of space-separated words. We are looking to provide support for non-Latin languages such as Thai and Vietnamese. The result would also be cached in the frontend to reduce the number of API calls, providing the prediction faster for the agents.

Continuous Training

The current machine learning model is built with training data derived from historical chat data. In order to teach the model and improve the metrics mentioned in our goals, we will enhance the model by letting it learn from data gathered in day-to-day chat conversations. Along with this, we are going to train the model to give better responses by providing more context about the conversations.

Seeing how effective this solution has been for our chat agents, we would also like to expose this to the end consumers to help them express their concerns faster and improve their overall chat experience.

Special thanks to Kok Keong Matthew Yeow, who helped to build the architecture and implementation in a scalable way.

Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!